Blog of Random Thoughts and Pictures

Euro 2020 (in 2021) Predictions Review

July 18th, 2021

So before the tournament got underway I made my predictions for the Euro 2020 tournament and the results are in.

I had 2 of the Semi-finalists Italy and England, and dare I say I’m chuffed to have called the Winner (Italy) and the Runners-up (England).

Not so good with the total goals to be scored, I had 129 whereas 142 goals were scored. I also didn’t get to name the top goalscorer for the tournament, I had Harry Kane, he was just 1 goal away, and in truth this was one goal away from winning a top prize too, but talk of that another day.

All in all though I wasn’t too far away.

I’ve also been able to look back on a scoresheet I did out at the start, and the results were based on my gut reaction to the teams & possible score lines and needless to say it wasn’t very good, I got only one of the semi-finalist (Spain) and none of the finalist, so here’s to more data based decisions.

Euro 2020 (in 2021) Predictions

June 11th, 2021

More predictions, but this time I was also entering a local Euro 2020 predictions competition. This one has proven to be a much harder undertaking. So I grabbed one big spreadsheet cranked the grey matter (no machine involved this time) and went for the following.

Shall I explain ……. na there’s loads of previews out there already, let’s just see what happens.

Semi finalists 1
France
Semi finalists 2
Italy
Semi finalists 3
Netherlands
Semi finalists 4
England
Winner
Italy
Runner-up
England
Total goals to be scored
129
Name the top goalscorer for the tournament
Harry Kane

Expected Goals (xG)

May 9th, 2021

As part of my LOI weekly predictions (don’t ask about match day 10 it was a wipe out) I’ve also included an indication of the teams expected goals (xG). Now it might be worth articulating an introduction to expected goals and there are four videos worth a review.

  • One by David Sumpter (Friends of Tracking)
  • One by Duncan Alexander at Opta.
  • One by Tifo football.
  • One with example goals with the xG overlaid on the screen.

First up David Sumpter on How to explain expected goals to a football player.

In this video David takes us through the probability of scoring a goal in the penalty area, with an overview of Barcelona statistics of expected Goals, indicating how a penalty is a 75% chance of scoring, and in comparison to a 7% chance of scoring, and an explanation of what this means.

True to the point there can and should be a reasoned discussion around goal scoring (and goal prevention) instead of always the emotional one, and xG gives some insights on this.

Opta’s Duncan Alexander takes us through the expected goals metric in the video Opta Expected Goals.

Of note (agt the time of the video), 4 variables are considered with Opta, Assist Type, Header / Foot, Big Chance and Angle/Distance.

The video by Tifo Football is a nice By The Numbers presentation on What is xG ?

They describe in nicer detail how good a shooting chance was, how likely a similar chances was, to result in a goal. They also highlight that people like StrataBet considers defenders in the way while other models like Opta do not and that sets of 5~10 games get best value for xG.

Expected goals (xG)

So expected goals (xG) is a probability of scoring a goal, with a look at how good a shooting chance was, how likely a similar chances was, to result in a goal.

Finally here’s a video demonstration of expected goals (xG)

Good Practice in Football Visualisation

April 25th, 2021

This is a review of a special guest lecture from Opta’s Peter McKeever were he gives some insights in to how to make better data visualisations.

In this video Peter covers:

  • Elements of Matplotlib
  • Under the Hood: rcParams
  • Layering objects with zorder in plots
  • Works through a real world example

The origin of the code is available on Github under the project “friends-of-tracking-viz-lecture” and worked through in this video.

Peter’s slides are available here in this PDF document. Peter also has an excellent blog with code and further examples.

Set up

Under the organisation on Github called mmoffoot I forked the “friends-of-tracking-viz-lecture” repo into the mmoffoot area.

Then I created a new branch called ‘tottenham’ in this area to cover the changes I made.

What was coded

This is another Jupyter Notebook, but this time I just could not get it to load up the highlight_text python library and so I had to create a bog standard Python programme to run through this code base.

Then I found that highlight_text has changed its interface slightly since Peter coded against it. For example in the Notebook there’s the line

htext.fig_htext(s.format(team,ssn_start,ssn_end),0.15,0.99,highlight_colors=[primary], highlight_weights=["bold"],string_weight="bold",fontsize=22, fontfamily=title_font,color=text_color)

I had to change it to

htext.fig_text(0.15,0.86,s.format(team,ssn_start,ssn_end),highlight_colors=[primary], highlight_weights=["bold"],fontweight="bold",fontsize=22, fontfamily=title_font,color=text_color)

Given that Peter McKeever has run through all the elements coded via the YouTube video and there’s an associated slide deck this is a really nice resource to get started on exact visual items and how to then code them up. Of course for devilment I’ve gone for a Tottenham theme for the final output.

Tottenham’s goal difference from 2010/2011 to 2019/2020

Peter also talks about the blog posts by Lisa Rost which are well worth a review on how to visualise data. He also gives a pointer towards Tim Bayer and his work doing some things for Fantasy premier league, all of which is excellent.

Finally there’s ThemePy which is being developed, it is a theme selector / creator and aesthetic manager for Matplotlib. This wrappers aim is to simplify the process of customising matplotlib plots and to enable users who are relatively new to python or matplotlib to move beyond the default plotting params we are given with matplotlib.

Player Ranking Framework

April 11th, 2021

As mentioned in Part I of this multipart post, Luca Pappalardo prepared a video, for the Friends of Tracking channel in 2020, to talk about some elements of a paper related to an open Wyscout data set, and advanced statistics related to passing networks, flow centrality and player ranking.

For this post (Part IV) I’m going to cover my take on the PlayeRank framework created by this team of researchers.

I’ve forked the “mapping-match-events-in-Python” repo into my mmoffoot area and created a new branch called ‘englanddata’ to cover the data set of English Premier League information for the 2017-18 season.

An exhaustive description of the PlayeRank framework is available in this paper Pappalardo, Luca, Cintia, Paolo, Ferragina, Paolo, Massucco, Emanuele, Pedreschi, Dino & Giannotti, Fosca (2019) PlayeRank: Data-driven Performance Evaluation and Player Ranking in Soccer via a Machine Learning Approach. ACM Transactions on Intelligent Systems and Technologies 10(5).

This Notebook builds player rankings from match events, the following steps are required:

  • compute feature weights (learning)
  • compute roles (learning)
  • compute performance scores (rating)
  • aggregate performance scores (ranking)

It doesn’t take long to run through the [In] steps of the Notebook and for the English data you end up with Figure 1 as seen below.

Figure 1: Player Ranking English Premier League 2017-2018

The visual output from the Notebook is interactive which is great as you can hover over the points to catch the name. For example in the striker role H. Kane is the outlier (at the top), S. Augero second. There’s even a drop down menu to do a comparison.

Figure 2: Player Ranking Comparison for H. Kane and S. Augero English Premier League 2017-2018

The positions are an interesting element to this ranking systems which is based on a role matrix.

Team Attacking left to right, position 0 is a Striker

And the top players from English Premier League 2017-2018 for each role position

  • Position 0. = H. Kane
  • Position 1. = L. Milivojevic
  • Position 2. = N. Monreal
  • Position 3. = D. Janmaat
  • Position 4. = S. Mane
  • Position 5. = M. Salah
  • Position 6. = N. Otamendi
  • Position 7. = J. Stephens

If I look at the PFA Premier League Team of the Year for 2017-18, Otamendi, Kane and Salah were named in it and also appear here in PlayeRank, but none of the rest. I wonder how D. Silva and K. De Bruyne both of whom are in the PFA team, missed out in this PlayeRank framework.

Overall over 4 posts I can say this is a great Jupyter Notebook, firstly to really learn about Jupyter Notebooks, and secondly to be able to see the structure and how to use WyScout data. It is so important given this data set is used by so many tops clubs for the scouting, analyses and recruitment of players.

I got caught a little on the passing networks, and the flow centrality but certainly a thread of more investigation on a measure of cohesiveness within the team, would be a nice continuation of this topic.

The player ranking and the full explanation of the PlayeRank Framework was fantastic and a joy to read and interact with.

Passing networks and Flow centrality

April 5th, 2021

As mentioned in Part I of this multipart post, Luca Pappalardo prepared a video, for the Friends of Tracking channel in 2020, to talk about some elements of a paper related to an open Wyscout data set, and advanced statistics related to passing networks, flow centrality and player ranking.

For this post (Part III) I’m going to cover my take on Passing networks and Flow centrality.

I’ve forked the “mapping-match-events-in-Python” repo into my mmoffoot area and created a new branch called ‘englanddata’ to cover the data set of English Premier League information for the 2017-18 season.

Passing networks

This Notebook creates a player passing network for any of the matches covered in the data set. The passing network is a weighted network where nodes are players and weighted edges represent movements of the ball between players. The size of an edge is proportional to the number of passes between the players.

Some finer details on are covered in a research paper Cintia et al., The harsh rule of the goals: data-driven performance indicators for football teams, In Proceedings of the 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA’2015), 2015.

Now I wanted to continue with a passing network from the Tottenham Hotspur – Leicester City match (2018), given this was the match showing the events to this point, however it just turned out as a blank image for both teams via the Notebook, I tried another match and only got one team and finally the Arsenal – Burnley match turned out what was being looked for, so I moved over to this match.

Arsenal Passing Network, Arsenal – Burnley, May 13, 2018
Burnley Passing Network, Arsenal – Burnley May 13, 2018

The produced images need some time for study, even a quick look leaves me wondering what I can take from them. I also tried to gain some insight from a related paper from the authors P. Cintia, S. Rinzivillo, and L. Pappalardo, “A network-based approach to evaluate the performance of football teams,” in Proceedings of the Machine Learning and Data Mining for Sports Analytics workshop, ECML/PKDD 2015, were it mentions again that nodes are players, directed edges represent passes between players and the size of an edge is proportional to the number of passes between the players. Node 0 indicates the opponent’s goal, and edges ending in 0 node represent goal attempts. However there are no Node 0 in these passing networks.

The related video at minute 15:00 has a section that describes the passing networks too, but unfortunately there’s still not enough in there to give a hint as to what the take away is.

Flow centrality

Next up is flow centrality, which is a feature that can be computed on the passing network and in this Notebook is described as a way to capture the fraction of times that a player intervenes in those paths that result in a shot. They take into account defensive efficiency by letting each player start a number of paths proportional to the number of balls that he recovers during the match.

This concept is only lightly explained in the paper Duch et al., Quantifying the Performance of Individual Players in a Team Activity, PLoS ONE 5(6): e10937 as referenced in the Notebook and I must admit I still didn’t get it, but then a read of a source paper by Freeman LC, A set of measures of centrality based upon betweenness, Sociometry 40: 35–41, 1977 clears it all up “when a particular person in a group is strategically located on the shortest communication path connecting pairs of others, that person is in a central position”, and that same paper goes on to show measures that define centrality in terms of the degree to which a point falls on the shortest path between others and therefore has a potential for control of communication.

Flow Centrality Burnley 2018, Arsenal – Burnley May 13, 2018

Dare I say Westwood (Burnley) is the betweenness man in control for Burnley.

Might be a follow up idea to classify based on how they measure cohesiveness within the team.

Advanced football visualisations duels on the pitch Italy compared to England

March 27th, 2021

As mentioned in Part I of this multipart post, Luca Pappalardo prepared a video, for the Friends of Tracking channel in 2020, to talk about some elements of a paper related to an open Wyscout data set, and advanced statistics related to passing networks, flow centrality and player ranking.

For this post (Part II) I’m going to cover my take on the match evolution and spatial stats.

I’ve forked the “mapping-match-events-in-Python” repo into my mmoffoot area and created a new branch called ‘englanddata’ to cover the data set of English Premier League information for the 2017-18 season.

Spatial distribution of events

There are tons of events collated in the WyScout API Events from duels to fouls to interruptions, as explained in the WyScout API document. For passes there are 6 types:

  • Pass = Hand pass,
  • Pass = Head pass,
  • Pass = High pass,
  • Pass = Launch,
  • Pass = Simple pass,
  • Pass = Smart pass,

In this Notebook there’s an interesting set of images created which show the distribution of positions per event type. These kernel density plots show the distribution of the events’ positions during the match with the darker the green representing the higher number of events in a specific zone of the field.

Figure 1: Move the slider to compare Italian Serie A Duels and English Premier League Duels 2017-18

The first image is of duels, and in the WyScout world “Duel” has a specific meaning,

A challenge between two players to gain control of the ball, progress with the ball or change its direction.

With a number of subtypes to consider too: Defensive duel, Offensive duel, Aerial duel, Loose ball duel, and Sliding tackle

For the moment I’m not going into the whys and wherefores of these subtypes, but it’s really interesting to review and compare the images and to see the difference of where the Italian league and the English league host their duels. Dare I say right-backs, left-backs and right wingers, left wingers should look closer if they are moving between the leagues.

A big summer move from England to Italy was Emre Can from Liverpool to Juventus, with Stephan Lichtsteiner coming in from Juventus to Arsenal maybe a view of these plots before the new 2018-19 season got underway might have been handy.

Of note there’s a 10,000 event sample size in here by default, so for the Italian & English league this represents about 6 matches worth of events, and so a larger sample size would be nice to see and compare against. Would also be nice to identify specific players (RB,LB and RW, LW) that were strong in those main duel locations, however that will have to be for another day.

Here are where the fouls happen.

Figure 2: Move slider to compare Italian Serie A Fouls and English Premier League Fouls 2017-18

And the shots.

Figure 3: Move the slide to compare Italian Serie A shots and English Premier League shots 2017-18

Intra-match evolution of the events

Goals are the main stay of football and so when looking at the English and Italian leagues (season 2017-18), its pleasing to see the difference between the leagues, especially the 1st half goals.

Yellow cards and red cards are covered in the data set too, and displayed in the Jupyter Notebook but I’ll be honest and say I didn’t take too much time to analyse the results here, because I was fascinated by the Duel plots.

Advanced football visualisations and data analysis of match events

March 22nd, 2021

Luca Pappalardo an author of the paper (PCR2019) Pappalardo, L., Cintia, P., Rossi, A. et al. A public data set of spatio-temporal match events in soccer competitions. Nature Scientific Data 6, 236 (2019) prepared a video, for the Friends of Tracking channel in 2020, to talk about some elements of this paper and the related Wyscout data set, which was used for the paper.

In this video Luca covers:

  • The Wyscout data set, how it is collected, from players to events.
  • Basic statistics on events and distributions.
  • Plotting events on the field, match evolution and spatial stats.
  • Advanced statistics: passing networks, flow centrality and playerRank

For this blog post I’m going to cover my take on the player events in the Wyscout data set and the display of some basic statistics on events and distributions.

The origin of the code is available on Github under the project “mapping-match-events-in-Python” and worked through in this video.

Set up

Then I created a new branch called ‘englanddata’ in this area to cover the changes I made.

The example code base uses the Italian league data, but the branch name might be a give away, seeing as the data set has English Premier League information for the 2017-18 season I wanted to run the code base against that data set, and so I took a copy of the original Jupyter Notebook and ran it against the English data as data_england_exploration.ipynb.

The full list of data available includes:

  • Italian first division 2017-18
  • English first division 2017-18
  • Spanish first division 2017-18
  • French first division 2017-18
  • German first division 2017-18
  • European Championship 2017-18
  • World Cup 2018

All the matches, events, players, and competition data sets are hosted in a figshare repository with all the data stored in a JSON format.

The way the data is collected is explained in the paper, with a nice visual representation in the Notebook so I won’t ruin that insight and will let you read it in there.

I should say a quick word on Jupyter Notebooks, its an interactive way of developing and presenting data science projects, and I can really see that it’s an easy way to follow the code base for this project. It’s easy enough to install Jupyter Notebook on a machine too and well worth the install.

Plotting events on the field

There are a number of nice overviews of the structure of data given in the early part of the Notebook, but it’s more interesting when it comes to the static plots.

Figure 1: All Events Tottenham Hotspur 5 – 4 Leicester City, May 13, 2018.

Although of course too much detail can overwhelm and so the interactive plots in this Notebook are much better mechanism to share this information, as in you just have to hover the mouse over the event and its details come to the fore.

Figure 2: Pass Events Tottenham Hotspur – Leicester City, May 13, 2018.
 
Figure 3: Foul Events Tottenham Hotspur – Leicester City, May 13, 2018.
 
Figure 4: Fouls by a specific player Tottenham Hotspur – Leicester City, May 13, 2018.

This is a great Jupyter Notebook, firstly to really learn about Jupyter Notebooks, and then of course to be able to see the structure and how to use WyScout data. It is so important given this data set is used by so many tops clubs for the scouting, analyses and recruitment of players.

There’s more to come, as I plan to complete the match evolution, spatial stats in part II of this blog post and finally cover the advanced statistics: passing networks, flow centrality and playerRank in a part III of this blog post.

Handling StatsBomb Event Data

February 7th, 2021

Yes I did have a blog post back in September 2020, highlighting that I was undertaking the Uppsala University course “Mathematical Modelling of Football” which overtook my time and life in Q4 2020. In completing the course I’m finally getting my head up in 2021 to record all my notes and code and to take a journey to share those notes as I go down through each section and sub-section of the course.

So here’s the first of hopefully many notes from the course, which were originally written in Asciidoc.

 

Purpose of Handling Event Data

The first element to work on via this course is Handling Event Data and the purpose is to learn how to :-

  • Download code and data
  • Organise working folder
  • Load in data from a json file.
  • Using ‘for’ loops and ‘if’ statements
  • Identify specific matches in Statsbomb data

The code needed for this lecture is available at the Github SoccermaticsForPython repo.

Set up

Created a new organisation on Github called mmoffoot standing for Mathematical Modelling of Football. The purpose is to fork the Github projects used in the course to track my own changes to those repos.

First up is a fork of the SoccermaticsForPython repo into the mmoffoot area.

Then created a branch called ‘week1’ covering the changes I had made as of week1 of the course.

The next little hurdle here is the loading of the Statsbomb data. It’s really in another repo on Github called statsbomb / open-data and in order to always have access to the StatsBomb data within this repo was going to set up a git submodule for this repo. This means that any time in the future when this repo is cloned (new) then it has to be done with the recursive command switch.

However then I noted that the StatsBomb data is over 3Gb in size and that it doesn’t really make sense have a couple of copies of this data on the one machine so I just placed it in a directory higher.I then just add a soft link to the source data within the ‘Statsbomb’ folder.

ln -s ../../statsbomb-opendata/data .

Also modified the README file to point this out.

What was coded

The first exercise is to

  1. Edit the code to print out the result list for the Mens World cup
  2. Edit the code to find the ID for England vs. Sweden
  3. Write new code to write out a list of just Sweden’s results in the tournament.

I made the code changes to 1LoadInData.py and run the code as

python3 1LoadInData.py

The output reads

The match between Croatia and Denmark finished 1 : 1
The match between Australia and Peru finished 0 : 2
.........
.........
The match between Spain and Russia finished 1 : 1
The match between Croatia and England finished 2 : 1
The Sweden match between Mexico and Sweden finished 0 : 3
The Sweden match between Sweden and South Korea finished 1 : 0
The Sweden match between Sweden and Switzerland finished 1 : 0
Sweden vs England has id:8651
The Sweden match between Sweden and England finished 0 : 2
The Sweden match between Germany and Sweden finished 2 : 1

I think the exercise is complete.

What was learned

Learning how to extract match results from the StatsBomb open data is important, and being able to read in the StatsBomb open data is great because at the time of writing it has a number of competitions covered in it.

  • International Mens FIFA World Cup 2018 (competition_id=43)
  • Europe Champions League 2018/2019
  • Europe Champions League 2017/2018
  • Europe Champions League 2016/2017
  • Europe Champions League 2015/2016
  • Europe Champions League 2014/2015
  • Europe Champions League 2013/2014
  • Europe Champions League 2012/2013
  • Europe Champions League 2011/2012
  • Europe Champions League 2010/2011
  • Europe Champions League 2009/2010
  • Europe Champions League 2008/2009
  • Europe Champions League 2006/2007
  • Europe Champions League 2004/2005
  • Europe Champions League 2003/2004
  • Europe Champions League 1999/2000
  • Spain La Liga 2018/2019
  • Spain La Liga 2017/2018
  • Spain La Liga 2016/2017
  • Spain La Liga 2015/2016
  • Spain La Liga 2014/2015
  • Spain La Liga 2013/2014
  • Spain La Liga 2012/2013
  • Spain La Liga 2011/2012
  • Spain La Liga 2010/2011
  • Spain La Liga 2009/2010
  • Spain La Liga 2008/2009
  • Spain La Liga 2007/2008
  • Spain La Liga 2006/2007
  • Spain La Liga 2005/2006
  • Spain La Liga 2004/2005
  • England Premier League 2003/2004
  • International Women’s World Cup 2019 (competition_id=72)
  • United States of America NWSL (Female) 2018
  • England FA Women’s Super League 2019/2020
  • England FA Women’s Super League 2018/2019

Maths, Modelling, Software and Football what a match.

September 2nd, 2020

Back in 2018 I took to reading “Soccermatics” and really enjoyed, what I would say is a different view on the sport. I say different because its less of the blaa blaa and gives strength to looking at how football problems are being solved on the pitch, with the data to back it up.

More recently reading “Zonal Marking” has really excited the mind, and then watching the “Friends of Tracking” video sessions has been brilliant for the getting the little software skills I have, applying them to football and bringing both books content into even sharper focus.

I recently spotted the chance to participate in the Uppsala University course “Mathematical Modelling of Football” which to me is just an opportunity not to be missed, and with some help of Google Translate and some very late nights I’m going to take on Mathematical Modelling of Football.

Here’s to going back to school !