A Bayesian rating system for football teams

This post provides an introduction to a system I devised for rating football teams called Footbayes. The goal of this project is to rate football teams in a coherent, principles-based way using game results.

There are of course many other projects that do the same. In my opinion, the two most prominent are ClubElo and the Euro Club Index. These are similar in that both are based to different degrees on the Elo rating system originally developed for chess players. They are both very good overall; I’m especially fond of ClubElo due to its careful attention to predictive checks.

However, I have never been satisfied with Elo as a definitive solution to rating football teams, or anything else for that matter. For one, what is considered to be one of its virtues, I actually consider to be a terrible feature: a team’s rating doesn’t change at any time except when it plays a match. This is counter-intuitive, and not how people usually think about things. Take the example of group B in the last World Cup. The Netherlands thrashed Spain 5-1 in the fourth game of the tournament, and everyone’s opinion of the Netherlands improved a lot. When a couple days later Chile also beat Spain convincingly, should our opinion of the overall quality of the Dutch remain unchanged? Elo says no; FootBayes says most certainly yes! If Spain got beaten again it provides us with some evidence they were never very good in the first place, and that in turn makes the Netherlands’ achievement of beating them by four goals less impressive in retrospect.

Another problem is that Elo was originally designed to predict binary outcomes, but football games are not binary. Any rating system will need at least three outcomes: home win, loss and draw. The technique used to work around this is to count a draw as a half-win for either side. This is relatively reasonable. On the other hand, the adjustment for goal difference is not. The number of points exchanged between the teams is multiplied by a coefficient that varies with the goal difference. This is very unelegant; it is mixing two different rating systems, but it also leads to strange results. If Real Madrid beat Ludogorets at home by 1-0 Real’s rating will increase; however, that can’t be right. The difference in abilities is taken to be so great that a narrow 1-0 victory should make one downgrade their estimate of Real’s ability. If Real wants to keep being considered one of the best teams ever, it really should be beating minnows at home by larger margins.

So how does FootBayes actually work? It is based on a few principles:

  1. Each team is described by a unique number, its rating. The ratings are transitive, meaning that there is no situation such as Arsenal beats Spurs, Spurs beats Stoke, Stoke beats Arsenal; if the first two are true, then Arsenal must be better than Stoke. This rating is normalized so that the mean is 500 and the standard deviation is 250, for interpretability
  2. The probability distribution of the different outcomes of a match is a function of exclusively the difference in ratings between the home team and the away team. Specifically, the outcomes follow an ordered discrete logistic regression.
  3. The ratings, as well as the coefficients for the regression, are those that make the data most likely. For those of you who know this, it’s just simple maximum likelihood, with flat priors on the coefficients. The ratings follow a normal distribution by construction.

That’s it. There is nothing else involved. These assumptions are used to fit a Bayesian inference model in Stan, a probabilistic programming language. A major flaw of this model is that it doesn’t take time into account anywhere, thus a team’s ratings are assumed to be universal, and can’t vary with time. I have an idea to incorporate time into the ratings, but for the time being the ratings are given by team by season, and there’s no link between a team’s rating last season and this season. For the last Premier League season, the ratings look like this:


They mostly follow the final league table, although not perfectly. This is to be expected; in such a low-scoring games, some teams can go on hot streaks that do end up altering their position in the table at the end of the season. This is the same data, but in table form:

Screen Shot 2015-01-01 at 2.42.46 AM

The ratings are designed such that an average team has a score of around 500, and a good team has a score of at least 750. One thing that is striking is that the standard deviations for the estimates are all fairly large. Even though Man City is rated as being 100 points better than Liverpool, on average, the standard deviation of around 90 points allows for a small possibility that Liverpool were actually better than City, and that the results on the field just didn’t reflect this.


Now, imagine for a moment City and Liverpool were going to play a match. A static rating system, such as Elo, would calculate win/draw/loss probabilities from point estimates for the ratings. FootBayes, on the other hand, incorporates uncertainty into its estimates. Ratings only make sense if they include uncertainty.

Of course, if the uncertainty is too big ratings don’t mean anything. But the FootBayes ratings allow us to be very sure of at least some things. For comparison, here is the difference between City’s and Arsenal’s ratings for last season:


Very clearly different, though you probably didn’t need any math to get to that conclusion independently if you watched any games last season.

The nice part of having these ratings is being able to get probabilities not only for upcoming matches, but also for who is going to win the league, who is going to be relegated, etc. Here are the ratings for this season so far:

Screen Shot 2015-01-01 at 3.39.19 AM

A few comments:

  • Chelsea and City have a very clear lead over the rest of the field, as they should
  • United are actually almost not any better than last season; their rise in the league table is the result of the collapse of Arsenal and especially Liverpool
  • The model rates Southampton relatively high, on the same level as Arsenal and United
  • Spurs and Liverpool are very mediocre
  • The three worst-rated sides are the three promoted sides: Burnley, QPR, Leicester
  • These ratings do NOT take the team’s performances in previous seasons into account (yet). So it probably overrates underdogs like Southampton, and underrates big teams like Liverpool

Good, we have the ratings, now we can use them to make predictions about the future. All of the following are the results of 4000 simulations. First, W/D/L probabilities for the next round of matches:

Screen Shot 2015-01-01 at 3.34.21 AM

Next, title probabilities:


Chelsea is favored but City have a chance. That is the consensus anyway. Next, Top 4 probabilities:


Chelsea and City have already guaranteed their places in the next Champions League. The other two spots seem likely to go to two of United, Southampton and Arsenal. Both Spurs and Liverpool have very low chances of getting one of the spots; the model rates them very poorly. Finally, relegation:


A tighter battle. Leicester and Burnley are the most endangered, but each still has only a two-thirds chance of being relegated. QPR comes after, and this is remarkable because the model does not know at all who was promoted this year. Could it be that the three are going down?