The most surprising season in Premier League history? What are the odds…

It’s fair to say that the result of the 2015-2016 Premier League season was a shocking one, with Leicester City winning their first league title despite starting the season priced at up to 5000/1.
In all likelihood such a shock will never happen again if only because bookmakers will have learned the lesson of placing such steep odds.
This made me wonder though at how we can quantify the surprise-factor in individual games using the odds offered by bookmakers before the match starts.

The data available at football-data.co.uk might be able to offer some insight.
They provide match results (and other stats) but also odds on these results from various bookmakers for several years of English and European football.
Let’s start with Premier League games since the 2000-2001 season as from this point the odds data is consistently present with only a few gaps.

Which were the most surprising individual results?

As you might expect they are all away wins for unfancied teams against high quality opposition.
At the top of the list there’s Blackburn’s famous 3-2 win against Manchester United at Old Trafford which was the perfect storm of an upset result, with Blackburn sitting at the bottom of the table at the time, and with Man. United going top if they win. There was also the added embarrassment of this game occurring on Sir Alex Ferguson’s 70th birthday and, although they didn’t know it at the time, United go on to lose the title to rivals Man. City on goal difference, so it turned out that had they not lost this game they would have won the title.

We can also group these results by teams across each season, summing up the odds of them winning each of those matches that they won.
I was expecting to see Leicester’s 2015-2016 season appear top here, but it was beaten into second place by West Ham in the same season.

You can think of the ‘Win odds sum’ column as being the total amount you’d win if you bet £1 on this team win their game every week.
On average, this should come out at £38 (or a bit less accounting for bookmaker’s profit) for the season since you would expect to win back about £1 for every £1 spent.
So West Ham’s 2015-2016 value of 73 is very impressive.

Breaking that down by game:

Whereas for Leicester we have:

So even though Leicester are winning more, and still at some long odds, it is West Ham’s capacity for big away wins despite having a mid-table season which makes them the most surprising team (in terms of winning at long odds) in recent Premier League history.
Over the whole history of this dataset we can plot these cumulative winning odds:

The unusually low point last season was of course Aston Villa, while the all-time low point, in the 2007-2008 season goes of course to my own club Derby County, whose solitary win that year was against Newcastle.
Putting £1 on Derby to win every week would have returned less than a fiver across the whole season.

One last thing we might like to do here is try to quantify how surprising the season as a whole is, so lets take the variance of the cumulative winning odds of the 20 teams:

It’s pretty clear then that last season had by far the most surprising results of recent Premier League history – lets hope it continues.

Prediction markets without money

One of my favourite websites currently is Metaculus – it is an opinion/prediction aggregator where players give 1%-99% predictions for the likelihood of various events.
Depending on your prediction and the outcome of the event (as well as the predictions of others) you receive or lose points for being right or wrong.
Increasing your confidence in your prediction increases the amount you can win, but also the amount you can lose – but importantly scores are calculated in such a way that you maximise your expected score by setting the prediction slider to your true belief about the event.

I’ve long been a fan of the idea that prediction markets can provide accurate estimations of the likelihood of events using the wisdom of the crowds, but usually the problem is that people are unwilling to invest money (or be seen to be gambling, especially with serious issues).
Playing for points is low stakes, sure, but lots of people are willing to spend a lot of time and effort to acquire points on the internet elsewhere so I’ve hope that Metaculus grows in popularity.

We won the Data Science Institute’s Summer Datachallenge!

Last week, Alvis Tang (another PhD student at the Centre for Complexity Science) and found out we had won the Summer Datachallenge, a data science competition hosted by Imperial’s . Well, we were joint winners with another Physicist at Imperial, Jason Cole but we were still very pleased.

The challenge involved taking a load of raw data concerning London in 2012, including house prices, tweets, Olympic medals, theatre ticket sales – and the data we ended up using which was from London’s transport system. The tube data consisted of the numbers of entries and exits from each tube station, in 15 minute intervals throughout the whole year, coming to a ~100MB text file. We began by just exploring the data to see what patterns emerged.
This is the number of people entering, and exiting my local station, Kennington, this day 2 years ago.

kennington

As you might expect, there are 2 daily peaks around the morning and evening rush hours (this was a Tuesday). Since Kennington is a alrgely residential area, there are more people entering in the morning, and more leaving in the evening, and the evening peak is more spread out – suggesting people leave for work at roughly the same time but get home at lots of different times. Got similar plots for every other tube station in the network and then decided to have a look at the Olympic period in the summer of 2012 to see whether there was a significant difference. In particular, we were interested as to whether TfL’s `Get Ahead of the Games’ scheme, designed to get commuters to work from home was very successful.

So we picked a few stations around business areas, and compared their traffic on Olympic days against traffic on a usual summer day.
BankCanary Wharf

London BridgeLiverpool Street

Here we are looking at the total traffic through the station – ie. entrances + exits. For almost all the stations, this plot looks the same. There is no real difference during peak hours, but outside peak hours there is a small increase in passenger numbers. There is one exception to this rule which is Canary Wharf. Here there was a drastic reduction in numbers during peak commuting hours – so it looks like the `Get Ahead of the Games’ campaign worked a lot better here than it did anywhere else.
We guessed that the difference is largely due to HSBC, who reportedly had up to 40% of their workforce , which more than accounts for the difference at commuting peak times.
We also looked at the individual entry gates to the station and found that the East entrace accounted for almost all of the drop in traffic with the West entrance seeing little difference, and it is the East entrace which is nearest to HSBC’s offices.

How should we judge scientific impact?

It’s a hard problem. For very applied work, it might be possible to see how the technology sector picks up various scientific advances and judge merit on that basis, but for most scientists the impact of their research is only seen on long timescales, much longer than the few years they spend on an individual project, or working with one particular group.

Traditionally, citations have been used as a measure of success. The assumption is that other scientists will be aiming to work on important topics, and so if they have cited some research it must have been useful to them, and therefore also important. So the more citations a paper has, the more useful it has been. The obvious problem with this analysis though is that it assumes that all citations contain the same amount of usefulness. In reality, there are many reasons to cite something beyond it being a genuinely useful inspiration to the research in question. Writers often cite well known works in their field, even if largely irrelevant to their current research, or cite a paper to explain that its interpretation of something or other is incorrect. And because citations are used as a measure of success, writers often cite their own papers, or those of their collaborators even when not really necessary.

This problem is most clearly uncovered in the work of Simkin & Roychowdhury’s 2002 paper, Read before you cite!  They reason that authors who read a paper, then decide to cite it have some small chance of making a typographical mistake when writing their bibliography (note, this was before the time of software reliably auto-generating bibliographies, as is commonplace now). So perhaps paper A is cited by paper B, but a spelling mistake is made in the citation. Paper C then cites both B, and A – but makes the exact same mistake as paper B did. Simkin & Roychowdhury then reason that it is likely that the authors of paper C didn’t read paper A at all – they just copied from the bibliography of paper B. After all, the chances of them just making an identical typo seem very small, and if you did read the paper you would presumably just write down the author’s names from the original. Using this simple model they calculate that around 80% of citations (in their dataset) are made without the cite even reading the paper they are citing!

This should all be very concerning for those of us who care about how many times our papers get cited. Measures such as the h-index, and impact factors are all ultimately based on counting citations. Does it mean that citation analysis is fundamentally flawed? I don’t think so – it just means that we need to use more sophisticated tools than just counting citations and look at other aspects of the topology of citation networks, and other information we have about papers. This has been the subject of some of my recent research, and so I hope to write a few more posts soon to ‘catch up’ to where that research currently is.

About me

I’m James Clough, a PhD student in Physics at Imperial College London – this is my personal website. I am currently working in the Centre for Complexity Science (formerly the Complexity & Networks group) at Imperial, supervised by Dr. Tim Evans, and Prof. Kim Christensen. My research is in complex networks, and my interest is particularly in networks constrained by causality, such as citation networks. My work focuses on using the special constraints and properties of these networks to find new ways of characterising and modelling their structure. Other academic interests include game theory, statistics and other aspects of complexity science,  such as applications in economics and government policy.