Record Football Transfers (Part I - Data Scraping)

15 January, 2021

Categories: scraping, data analysis

The beautiful game has undergone a drastic shift. Nations now own clubs. Players are being exchanged for hundreds of millions of currency… and the Premier League won’t let Newcastle join in on the fun.

I thought it would be interesting to graph the highest transfer fees over the past 20 years.

Importing the Data

First of all, I chose to scrape a table off Wikipedia that contained the most expensive football transfers. I did this by using several modules:

I began by using the requests module to make a HTTP requests and retrieve the HTML from the url of a Wikipedia site stored in wiki_url. The module isn't designed to parse the information.
```
data = requests.get(wiki_url)
```
To parse the HTML, I passed it into bs4's BeautifulSoup constructor to return a tree-like object that we can interact with.
This part requires the lxml parser to be installed.
```
soup = BeautifulSoup(data.text, 'lxml')
```
This BeautifulSoup object supports methods like .find() and .find_all() to search the nested tag structure of a HTML document.

Cleaning the Data

Data scraped from the internet is usually not in a presentable format. My end goal is to produce a table that’s easy to read (and access information from). The first course of action was to deal with row elements that span multiple cells. Next up was to remove extra fluff from the cells. Wikipedia, for example, has loads of citations like[1] this. I defined a regular expression to get rid of them.

def remove_citation(n):
    return re.sub(r'[\[].*?[\]]', '', n)

There was a fair bit of uninteresting code to fix up the table that I’ll leave out. It was all essentially looking at the tag structure of the table and making changes where necessary.

Now I have a list of lists [row1, ..., rown].

For an easy to way to build and sort a table, I used the pandas module. I converted my list of lists to what is called a DataFrame object in pandas. According to Google, a DataFrame is a 2-dimensional labelled data structure with columns of potentially different types.
```
df = pd.DataFrame(listofrows)

df.columns = ['name','from','to','position','feeeuro','fee','year','born']
df.year = df.year.astype(int)
df.fee = df.fee.astype(float)

df = df.sort_values(by=['year','fee'])
```
The columns attribute labels the columns and the two lines that follow change the type of the column entries to integers and floats respectively. A very useful method of the DataFrame structure is the ability to sort the table by columns. I first sorted by the 'year' of the transfer (which is my independent variable) and the transfer 'fee' in GBP (my dependent variable).

Without posting the entirety of the table, it came out looking something like this:

                   name                from            to     position  feeeuro     fee  year  born
47            Luís Figo           Barcelona   Real Madrid   Midfielder       62   37.00     0  1972
24      Zinedine Zidane            Juventus   Real Madrid   Midfielder       76   46.60     1  1972
34   Zlatan Ibrahimović         Inter Milan     Barcelona      Forward     69.5   56.00     9  1981
37                 Kaká               Milan   Real Madrid   Midfielder       67   56.00     9  1982
12    Cristiano Ronaldo   Manchester United   Real Madrid      Forward       94   80.00     9  1985

Visualising the Data

Plotting the information was a simple case of using pyplot from the matplotlib module.

pyplot.title('The Highest Football Transfers')
pyplot.xlabel('Year')
pyplot.ylabel(r'''Transfer Fee (£M)''')

pyplot.plot(df.year, df.fee, 'o')
pyplot.show()

Maybe I’ll try to use a statistical model to predict what the average high-profile 2022 signing will look like given the information I have.

Edit: I did end up doing this. Follow this link to find the follow-up!