Record Football Transfers (Part II - Prediction)

12 August, 2021

Categories: regression, football, transfers

Continuing on from my last project, I would like to apply some of the statistical modelling techniques I’ve discovered (by watching the last few lectures of MIT 6.0002) so that I can predict the average price of next year’s high-profile transfers.

I’ll be using the idea of polynomial linear regression to fit a curve to the data I’ve collected and make a prediction using said curve.

The lack of information on the Wikipedia table between 2002 and 2008 would certainly affect the results so I added the top transfers from each year using the same sources as the Wikipedia table (and of course cross-checking my sources). Then I worked out the average price of the (at least 5 or more) highest transfers for each year and plotted them in red:

Using my dataset fee_dataset, I used the following code to perform and plot polynomial linear regression to find the line (polynomial of degree 1) that best fits the data.

linear_model = pylab.polyfit(fee_dataset.year, fee_dataset.fee, 1)
linear_predictions = pylab.polyval(linear_model, fee_dataset.year)
pylab.plot(fee_dataset.year, linear_predictions, 'g--', linewidth=1, label="Linear")

The quartic model shows the most significant downturn beginning around 2019. The cubic model that’s been fit to the data shows more of a plateau. There is some circumstantial evidence that contextualises the data:

The linear model fails to fit early (2000-2008) transfer data very well and doesn’t account for the slowing down of spending in recent years. The quadratic model is slightly better than the linear model for early years but also fails to capture the current spending climate reasonably.

To quantify how well the models fit the data, I’ve chosen to calculate a quantity, (R^{2}), known as the coefficient of determination. It’s defined as:

\[R^{2} = 1 - \dfrac{ \sum_{i} (y_{i} - p_{i})^{2}}{\sum_{i} (y_{i} - \mu)^{2}}\]

The calculation for (R^{2}) is simple to implement:

def coeff_determination(y,p):
    ymean = sum(y)/len(y)
    comp = sum([(yi - pi)**2 for yi,pi in zip(y,p)])/sum([(yi - ymean)**2 for yi in y])
    return 1 - comp

The results are as follows:

    Linear model:       0.7562801560992423
    Quadratic model:    0.8048401397741687
    Cubic model:        0.8355798702015268
    Quartic model:      0.8380772813567788

The (R^{2}) values are all relatively high and the graphs don’t seem to be over-fitting the data. They all reasonably capture the increasing average transfer fees as time progresses since the beginning of the century.

Taking into account all of the above, I’m inclined to side with the cubic model over the others.

To predict the value of the cubic model for the year 2022, I can define a Polynomial class that I’ll feed the cubic model coefficients (found with .polyfit(x,y,3)) to:

class Polynomial:
    def __init__(self, coeffs):
        self.coeffs = coeffs

    def evaluate(self, a):
        output = 0
        for i in range(len(self.coeffs)):
            output += self.coeffs[i]*(a**(len(self.coeffs) - 1 - i))
        return output

prediction = Polynomial(cubic_model).evaluate(22)

Thus, the final prediction of this analysis is roughly £77.89M!

Postgame Analysis

1 September, 2022

The chosen polynomial linear regression model was quite close to the observed record fee. Antony was the record transfer in the 2022 summer transfer window for a fee of £82.2M which is £4.31M over the model’s prediction of £77.89M!

Football inflation is truly incredible (and unsustainable…?)