Record Football Transfers (Part I - Data Scraping)
15 January, 2021
Categories: scraping, data analysis
The beautiful game has undergone a drastic shift. Nations now own clubs. Players are being exchanged for hundreds of millions of currency… and the Premier League won’t let Newcastle join in on the fun.
I thought it would be interesting to graph the highest transfer fees over the past 20 years.
Importing the Data
First of all, I chose to scrape a table off Wikipedia that contained the most expensive football transfers. I did this by using several modules:
- I began by using the requests module to make a HTTP requests and retrieve the HTML from the url of a Wikipedia site stored in
wiki_url
. The module isn't designed to parse the information.data = requests.get(wiki_url)
-
To parse the HTML, I passed it into
bs4
'sBeautifulSoup
constructor to return a tree-like object that we can interact with.This part requires the
lxml
parser to be installed.
Thissoup = BeautifulSoup(data.text, 'lxml')
BeautifulSoup
object supports methods like.find()
and.find_all()
to search the nested tag structure of a HTML document.
Cleaning the Data
Data scraped from the internet is usually not in a presentable format. My end goal is to produce a table that’s easy to read (and access information from). The first course of action was to deal with row elements that span multiple cells. Next up was to remove extra fluff from the cells. Wikipedia, for example, has loads of citations like[1] this. I defined a regular expression to get rid of them.
def remove_citation(n):
return re.sub(r'[\[].*?[\]]', '', n)
There was a fair bit of uninteresting code to fix up the table that I’ll leave out. It was all essentially looking at the tag structure of the table and making changes where necessary.
Now I have a list of lists [row1, ..., rown]
.
- For an easy to way to build and sort a table, I used the
pandas
module. I converted my list of lists to what is called aDataFrame
object in pandas. According to Google, aDataFrame
is a 2-dimensional labelled data structure with columns of potentially different types.
Thedf = pd.DataFrame(listofrows) df.columns = ['name','from','to','position','feeeuro','fee','year','born'] df.year = df.year.astype(int) df.fee = df.fee.astype(float) df = df.sort_values(by=['year','fee'])
columns
attribute labels the columns and the two lines that follow change the type of the column entries toint
egers andfloat
s respectively. A very useful method of theDataFrame
structure is the ability to sort the table by columns. I first sorted by the'year'
of the transfer (which is my independent variable) and the transfer'fee'
in GBP (my dependent variable).
Without posting the entirety of the table, it came out looking something like this:
name from to position feeeuro fee year born 47 Luís Figo Barcelona Real Madrid Midfielder 62 37.00 0 1972 24 Zinedine Zidane Juventus Real Madrid Midfielder 76 46.60 1 1972 34 Zlatan Ibrahimović Inter Milan Barcelona Forward 69.5 56.00 9 1981 37 Kaká Milan Real Madrid Midfielder 67 56.00 9 1982 12 Cristiano Ronaldo Manchester United Real Madrid Forward 94 80.00 9 1985
Visualising the Data
Plotting the information was a simple case of using pyplot
from the matplotlib
module.
pyplot.title('The Highest Football Transfers')
pyplot.xlabel('Year')
pyplot.ylabel(r'''Transfer Fee (£M)''')
pyplot.plot(df.year, df.fee, 'o')
pyplot.show()
Maybe I’ll try to use a statistical model to predict what the average high-profile 2022 signing will look like given the information I have.
Edit: I did end up doing this. Follow this link to find the follow-up!