Record Football Transfers (Part I - Data Scraping)

15 January, 2021

Categories: scraping, data analysis

The beautiful game has undergone a drastic shift. Nations now own clubs. Players are being exchanged for hundreds of millions of currency… and the Premier League won’t let Newcastle join in on the fun.

I thought it would be interesting to graph the highest transfer fees over the past 20 years.

Importing the Data

First of all, I chose to scrape a table off Wikipedia that contained the most expensive football transfers. I did this by using several modules:

Cleaning the Data

Data scraped from the internet is usually not in a presentable format. My end goal is to produce a table that’s easy to read (and access information from). The first course of action was to deal with row elements that span multiple cells. Next up was to remove extra fluff from the cells. Wikipedia, for example, has loads of citations like[1] this. I defined a regular expression to get rid of them.

def remove_citation(n):
    return re.sub(r'[\[].*?[\]]', '', n)

There was a fair bit of uninteresting code to fix up the table that I’ll leave out. It was all essentially looking at the tag structure of the table and making changes where necessary.

Now I have a list of lists [row1, ..., rown].

Without posting the entirety of the table, it came out looking something like this:

                   name                from            to     position  feeeuro     fee  year  born
47            Luís Figo           Barcelona   Real Madrid   Midfielder       62   37.00     0  1972
24      Zinedine Zidane            Juventus   Real Madrid   Midfielder       76   46.60     1  1972
34   Zlatan Ibrahimović         Inter Milan     Barcelona      Forward     69.5   56.00     9  1981
37                 Kaká               Milan   Real Madrid   Midfielder       67   56.00     9  1982
12    Cristiano Ronaldo   Manchester United   Real Madrid      Forward       94   80.00     9  1985

Visualising the Data

Plotting the information was a simple case of using pyplot from the matplotlib module.

pyplot.title('The Highest Football Transfers')
pyplot.xlabel('Year')
pyplot.ylabel(r'''Transfer Fee (£M)''')

pyplot.plot(df.year, df.fee, 'o')
pyplot.show()

Maybe I’ll try to use a statistical model to predict what the average high-profile 2022 signing will look like given the information I have.

Edit: I did end up doing this. Follow this link to find the follow-up!