Record Football Transfers (Part II - Prediction)
12 August, 2021
Categories: regression, football, transfers
Continuing on from my last project, I would like to apply some of the statistical modelling techniques I’ve discovered (by watching the last few lectures of MIT 6.0002) so that I can predict the average price of next year’s high-profile transfers.
I’ll be using the idea of polynomial linear regression to fit a curve to the data I’ve collected and make a prediction using said curve.
The lack of information on the Wikipedia table between 2002 and 2008 would certainly affect the results so I added the top transfers from each year using the same sources as the Wikipedia table (and of course cross-checking my sources). Then I worked out the average price of the (at least 5 or more) highest transfers for each year and plotted them in red:
Using my dataset fee_dataset
, I used the following code to perform and plot polynomial linear regression to find the line (polynomial of degree 1) that best fits the data.
linear_model = pylab.polyfit(fee_dataset.year, fee_dataset.fee, 1)
linear_predictions = pylab.polyval(linear_model, fee_dataset.year)
pylab.plot(fee_dataset.year, linear_predictions, 'g--', linewidth=1, label="Linear")
- The
.polyfit(x, y, n)
method of thepyplot
class returns a list of coefficients corresponding to the polynomial of degreen
that best approximates the data (independent variable, dependent variable) given by (x
,y
). .polyval
returns a set of predictions given the data.
The quartic model shows the most significant downturn beginning around 2019. The cubic model that’s been fit to the data shows more of a plateau. There is some circumstantial evidence that contextualises the data:
- Despite the pandemic, FFP (Financial Fair Play) regulations have been relaxed in the most recent transfer window (2021) and the data that I've collected concerns the upper echelon of football spending.
Their spending power defies the current global climate. - Neymar's transfer for £198M also greatly skews the results for 2017. This transfer was a singularity amongst singularities. Without his contribution, the average would've been £70.5M.
Thanks, PSG.
The linear model fails to fit early (2000-2008) transfer data very well and doesn’t account for the slowing down of spending in recent years. The quadratic model is slightly better than the linear model for early years but also fails to capture the current spending climate reasonably.
To quantify how well the models fit the data, I’ve chosen to calculate a quantity, (R^{2}), known as the coefficient of determination. It’s defined as:
\[R^{2} = 1 - \dfrac{ \sum_{i} (y_{i} - p_{i})^{2}}{\sum_{i} (y_{i} - \mu)^{2}}\]y
\(= (y_{i})_{i \in I}\) is the sequence of dependent variable values (average fees in our case),p
\(= (p_{i})_{i \in I}\) is the sequence of predicted values from our model, and- \(\mu\) is the mean of
y
calculated in the ordinary way: \(\mu = \frac{1}{\#I} \left( \sum_{i \in I} y_{i} \right)\) where \(\#I\) is the number of terms iny
.
The calculation for (R^{2}) is simple to implement:
def coeff_determination(y,p):
ymean = sum(y)/len(y)
comp = sum([(yi - pi)**2 for yi,pi in zip(y,p)])/sum([(yi - ymean)**2 for yi in y])
return 1 - comp
The results are as follows:
Linear model: 0.7562801560992423 Quadratic model: 0.8048401397741687 Cubic model: 0.8355798702015268 Quartic model: 0.8380772813567788
The (R^{2}) values are all relatively high and the graphs don’t seem to be over-fitting the data. They all reasonably capture the increasing average transfer fees as time progresses since the beginning of the century.
Taking into account all of the above, I’m inclined to side with the cubic model over the others.
To predict the value of the cubic model for the year 2022, I can define a Polynomial
class that I’ll feed the cubic model coefficients (found with .polyfit(x,y,3)
) to:
class Polynomial:
def __init__(self, coeffs):
self.coeffs = coeffs
def evaluate(self, a):
output = 0
for i in range(len(self.coeffs)):
output += self.coeffs[i]*(a**(len(self.coeffs) - 1 - i))
return output
prediction = Polynomial(cubic_model).evaluate(22)
Thus, the final prediction of this analysis is roughly £77.89M!
Postgame Analysis
1 September, 2022
The chosen polynomial linear regression model was quite close to the observed record fee. Antony was the record transfer in the 2022 summer transfer window for a fee of £82.2M which is £4.31M over the model’s prediction of £77.89M!
Football inflation is truly incredible (and unsustainable…?)