A Look Into International Goalscoring

29 January, 2024

Categories: football, data analysis, pandas

Data: kaggle/martj42/goalscorers.csv

Code: Jupyter notebook

Competitive play is the meat and potatoes of international football. Who are the best players to score on a consistent basis when it really counts? “Back in my day, goals were few and far between!” Join me in this brief analysis where we’ll be taking information on 43189 goals over 100 years of football to piece together the cold, hard facts interpretations!

The dataset in question describes each goal with 8 variables:

date        home_team	away_team   team        scorer              minute	own_goal    penalty
1916-07-02  Chile       Uruguay     Uruguay     José Piendibene     44.0	False       False

Who are the top scorers?

The code below aggregates .agg() the data by the 'scorer' and counts how many goals each has scored:

record_goalscorers = goals.groupby(
        by=['team','scorer']
    )[['scorer']].agg(
        {'scorer':'count'}
    ).rename(
        columns={'scorer': 'goals_scored'}
    ).reset_index().sort_values(by=['goals_scored'],ascending=False)

It’s interesting to see that Lionel Messi is (5^{\text{th}}) in this list. His goals in friendly matches push him up to (3^{\text{rd}}) place (with (106) goals) in the overall international tally as can be found here (accurate as of the date of this blogpost).

Similarly, Ali Daei from Iran shoots all the way up from (7^{\text{th}}) with (49) goals to a whopping (108) goals at (2^{\text{nd}}) place — friendly matches really skew the perception!

Is there any general trend in the number of goals scored in competitive matches over time?

As usual, a picture plot is worth a thousand words:

Overall, there’s been a general upturn in the number of goals. The upturn in goals is likely positively correlated with the expansion of international competitions, continental championships and qualifiers. There also seems to be some level of periodicity where the total number of goals scored per year spike and drop. This may be related to more games being played for each major competition’s qualifying period i.e. over the comp’s precedent year. These tournaments are:

Instead of a scatter plot, a bar-chart colour-coded by competitions may help to surface a pattern. To do so, I need to create another column that will serve as a reference point to colour-code the bars:

def which_tournament(year):
    if year == 2021:
        return 'Euro'
    if (2020 > year >= 1960) and ((year - 1960)%4 == 0):
        return 'Euro'
    elif (year >= 1930) and ((year - 1930)%4 == 0):
        return 'World Cup'
    return 'No Tournament'

goals_by_year['tournament'] = goals_by_year['year'].apply(which_tournament)

To go even more granular, a similar process can highlight every year that comes before, for example, a World Cup competition:

This seems like something that could be investigated more rigorously with the theory of time series.


Maybe goals aren’t as hard to come by these days as previously thought!