Using SMT Solvers to Analyze the Premier League Table

A few days ago reddit user /u/Shurtgal posted a graphic showing the possible points that can be achieved by each football team in the English Premier League. You might think that this graphic would provide insight into questions like the following.

  1. What teams are still mathematically capable of winning the league title?

  2. What is the minimum number of points required to guarantee safety from relegation (finish in one of the top 17 league positions)?

The trouble is, the graphic cannot answer these questions definitively. The following comment from /u/mitters explains why.

Consider Swansea. According to this chart they are not guaranteed safety (2 points short). Which is not actually true. They are in fact already guaranteed safety. Even if they lost every game from now on and did not get the 2 points they need to "guarantee" safety they would still be safe. There's nothing wrong with OP's chart or logic as it would be nearly impossible to plot this table taking into account teams that play each other. Leicester could win all of their remaining games and finish level on points with Swansea. But that would require that West Brom, Newcastle, Sunderland and QPR all drop points. As such, in this (very unlikely) scenario both Leicester and Swansea would be safe because of the points other teams would have to drop as they play each other.

However, the good news is that a wonderful tool known as an SMT Solver can answer these questions definitively. In this article, I will provide an gentle introduction to using SMT solvers by showing how they can be used to answer such questions.

What is an SMT solver?

An SMT solver is essentially a constraint solver. It is able to combine the "base" theory of propositional logic, which is a logic that can described the usual logical connectives AND/OR/NOT along with Boolean variables and combines this with additional theories such as integer arithmetic, bitvector arithmetic, a theory of arrays and so on.

If you've used an Integer Linear Programming (ILP) solver like CPLEX or GLPK, you know you can express constraints involving integers and the solver finds optimal solutions satisfying these constraints. It is easy to ask CPLEX to solve problems like:

Maximize 10x + 10y + 10z where:

x             < 10
3x + 4y       < 15
x       - z   = 10

However it is not so straightforward to ask CPLEX to find a solution where either (x < 10) OR ((3x + 4y < 15) AND (x - z = 10)). SMT handles these constraints easily. And this example shows what we mean by "combining" theories. We've combined propositional logic (AND/OR/XOR, ...) with integer arithmetic which involves operators like addition, subtraction, multiplication and so on.

What are SMT solvers good at?

SMT solvers are currently best at constraint satisfaction problems. At this point they're not so good at optimization problems, but this could change in the future.

I think it's not too far wrong to say that as of now, SMT solvers are primarily being used for the analysis, synthesis and verification of hardware and software.

How does one use an SMT Solver?

One way is to encode problems using the SMT-LIB format, which is understood by a variety of solvers. However, I recommend using a solver which exposes a programmatic interface. A good choice would be Z3 from Microsoft Research which has interfaces to C/C++/C# and Python. In this article, I'll use the Z3 Python interface.

An SMT Formulation of the Premier League Table

Now let me describe how we can formulate an SMT problem that answers these questions.

What problem do we want to solve?

We are given:

  1. The current league table, where each team has played a certain number of games and obtained a certain number of points having scored some number of goals and conceded some number of goals.

  2. We are given the remaining (unplayed) fixtures for the rest of season.

We would like to ask questions such as the following:

  1. Is it still mathematically possible for last season's runners up Liverpool to win the league title?

  2. Is it still mathematically possible for Swansea to qualify for next season's Champions League? (For those who don't follow football, qualifying for the Champions League in the Premier League involves finishing in one of the top 4 league positions.)

  3. Is it still mathematically possible for the team currently at the bottom of the table, Leicester City, to avoid being relegated to the Championship. ( For non-football fans, this would mean that Leicester would have to finish above the bottom three positions, i.e., 17th or higher in the league table.)

A bird's eye view of the SMT formulation

  • We will consider each remaining fixture in the season and create two integer SMT variables that represents the goals scored by the two teams playing the fixture.

  • If a team wins, i.e., scores more goals than the other, it gets three points. We will encode this fact.

  • If the scores are level, both teams get one point each. We will encode this fact as well.

  • We will keep track of the total number of goals scored and conceded by each team by summing up the appropriate values. This is important for the goal difference calculations which decide league positions when the number of points are equal at the end of the season.

  • Finally, we will create an integer variable that represents each teams' final league position. We will create constraints which express the following. When comparing two teams, the team with more points is placed in a higher league position. If two teams have equal points, the team with the better goal difference is placed higher. If the points and goal difference are both equal, then the team with more goals scored is placed higher. If all of these are equal, I think the Premier League schedules an extra "playoff" fixture to decide the league position, but since this is extremely unlikely to happen and has in fact never happened in the past, we will leave this scenario undefined.

Once we have encoded all these facts, we can just ask the solver if it can find a satisfying assignment to all variables that respects the constraints we've provided and results in a team finishing in particular position.

This sounds moderately complicated, but in fact, it is not. And the Z3 Python API is a joy to use. The code will be a lot of fun to read/write.

If you want to jump right into the code, it is available in this bitbucket repository.

A walk through the code

Reading the current league table and fixtures

I got the league table and fixture data from the nice API exposed by www.football-data.org. They have some code samples available here. Since I didn't want to repeatedly hit their servers while I was debugging and testing my code, I wrote a separate script to get the data and write it a file. I read from these files in the functions readTable and readFixtures.

Once we have the data, we have to create a Z3 Solver object which keeps track of all the constraints that are part of our problem.

import z3

S = z3.Solver()

Encoding each result

The function encodeFixtures does this.

def encodeFixtures(S, fixtures):
    """Encode constraints on the number of goals that can be scored
    in each fixture."""
    for f in fixtures:
        f.homeTeamGoals = z3.Int(f.name + ':HTG')
        f.awayTeamGoals = z3.Int(f.name + ':ATG')
        S.add(f.homeTeamGoals >= 0)
        S.add(f.homeTeamGoals <= 15)
        S.add(f.awayTeamGoals >= 0)
        S.add(f.awayTeamGoals <= 15)

As you can see, this is a pretty straightforward function. It takes in a list of fixtures and a Z3 Solver object. It creates two integer variables for each fixture representing the number of goals scored by the home and away teams and adds constraints on these variables to the solver object.

The lower limit of 0 goals should be self-explanatory. I chose 15 as the upper limit because IIRC the biggest victory in the premier league was 9-1 (turns out I was wrong, see footnote), so it's extremely unlikely that more 15 goals will be scored by a team and we can use this limit to bound the search space for the solver.

One thing that might be worth pointing out here is that expressions like f.homeTeamGoals >= 0 are Z3 abstract syntax tree (AST) objects. The Z3 python interface overloads the arithmetic operators on Z3 variables to create AST objects which represent the underlying SMT formulas. And as we've shown here, we can just add these ASTs to the solver objects when specifying constraints to the solver. All this syntactic sugar allows code that uses the Z3 API to be very clean and elegant.

Computing the final points and goal difference of each team

The function that does this is encodeFinalTable.

def encodeFinalTable(S, table, fixtures):
    """Figure out the points tally and goals for/against at the end of the season."""
    # walk through each team
    for row in table:

        # select the home fixtures
        homeFixtures = [f for f in fixtures if f.homeTeam == row.team]
        # count the number of points.
        homePoints = sum([
            z3.If(f.homeTeamGoals > f.awayTeamGoals, 3, 
                z3.If(f.homeTeamGoals == f.awayTeamGoals, 1, 0)) 
            for f in homeFixtures])
        homeGoals = sum([f.homeTeamGoals for f in homeFixtures])

        # same for the away fixtures.
        awayFixtures = [f for f in fixtures if f.awayTeam == row.team]
        awayPoints = sum([
            z3.If(f.awayTeamGoals > f.homeTeamGoals, 3, 
                z3.If(f.awayTeamGoals == f.homeTeamGoals, 1, 0)) 
            for f in awayFixtures])
        awayGoals = sum([f.awayTeamGoals for f in awayFixtures])

        # This is the total number of points.
        row.finalPoints = homePoints + awayPoints + row.points
        # And the final goals scored/conceded
        row.finalGF = homeGoals + row.gf
        row.finalGA = awayGoals + row.ga
        row.finalGD = row.finalGF - row.finalGA

What we're doing here is walking through each team in the league and considering their remaining fixtures. Then this SMT formula says that home team gets 3 points from a fixture if it has scored more goals than the away team, 1 point from a fixture it has the same number of goals as the away team and no points otherwise.

homePoints = sum([
    z3.If(f.homeTeamGoals > f.awayTeamGoals, 3, 
        z3.If(f.homeTeamGoals == f.awayTeamGoals, 1, 0)) 
    for f in homeFixtures])

z3.If(condition, true_value, false_value) is the conditional if-then-else operator, which evaluates condition and returns either true_value or false_value.

Notice again how the overloading of the + operator by Z3 allows Python's sum function to work seamlessly.

The rest of encodeFinalTable should be easy to follow.

Computing league position

Computing the final league position is done by the following function.

def encodeLeaguePositions(S, table):
    """Encode the final league position of each team."""
    # First specify that each team's league position must lie between 1 and N.
    for t in table:
        t.position = z3.Int(t.team + '_pos')
        S.add(t.position >= 1)
        S.add(t.position <= len(table))

    # Now for the relative positions
    for i in xrange(len(table)):
        ti = table[i]
        for j in xrange(len(table)):
            tj = table[j]
            if i == j: continue

            # more points => higher finish
            S.add(z3.Implies(ti.finalPoints > tj.finalPoints, ti.position < tj.position))
            # equal points and better goal difference => higher finish
            S.add(z3.Implies(
                z3.And(ti.finalPoints == tj.finalPoints, ti.finalGD > tj.finalGD),
                ti.position < tj.position))
            # points, GD equal and more goals scored => higher finish
            S.add(z3.Implies(
                z3.And(
                    ti.finalPoints == tj.finalPoints,
                    ti.finalGD == tj.finalGD,
                    ti.finalGF > tj.finalGF),
                ti.position < tj.position))
            # no two teams can have the same position.
            if i < j:
                S.add(ti.position != tj.position)

Most of the code should be easy to understand at this point. The only new Z3 features we're using are z3.Implies and z3.And. And does what you expect. In case you're not familiar with Implies, Implies(A, B) says that B must be true whenever A is true, i.e., A=True and B=False shouldn't be possible.

You might think that this is a slightly inefficient way of encoding what is essentially a sorting function because we're using O(N^2) constraints to encode the ordering of N items. However, this is acceptable for two reasons. First, N=20, and SMT solvers can solve 400 or so constraints quite easily. Also I suspect that the simple, almost naive encoding used here might actually be beneficial for the solver. The solver works by propagating constraints, and having more constraints makes it easier to propagate. In a smarter encoding, the solver will have to infer more constraints and this could possibly make the solution harder. All that said, I haven't actually tried other encodings and this is just an opinion based on my experience working with SAT/SMT solvers.

Using the formulation to answer questions

Let's try to find a list of teams that are still mathematically in the the title race.

def findTeamsWithPossiblePosition(S, table, pos):
    "Returns a list of teams which *can* satisfy predicate pos."
    teams = []
    for t in table:
        S.push()
        S.add(pos(t.position))
        if S.check() == z3.sat:
            teams.append(t)
        S.pop()
    return teams

winners = findTeamsWithPossiblePosition(S, table, lambda p: p == 1)
print 'Teams that can still win the league are:\n%s\n' % (
    '\n'.join('  ' + t.team for t in winners))

We're using a nice feature provided by the Z3 solver, which is the ability push and pop contexts. When you push a context, and then add some constraints and then later pop that context, all the constraints that were added between the push and the corresponding pop are discarded by the solver. This allows incremental solving, where we solve a bunch of related problems with slightly different constraints.

And here we're iterating through the teams and finding whether there is some possible set of results in each of the remaining fixtures that would allow the team to finish on top of the table.

If you're curious about the output, as of today (11 April 2015) it is:

Teams that can still win the league are:
  Chelsea FC
  FC Arsenal London
  Manchester United FC
  Manchester City FC
  FC Southampton
  Liverpool FC
  Tottenham Hotspur FC

Similarly we can find teams that are still in the race for Champions League qualification as well the teams that could possibly be relegated.

relegationCandidates = findTeamsWithPossiblePosition(S, table, lambda p: p >= 18)
print 'Teams that can still get relegated are:\n%s\n' % (
    '\n'.join('  ' + t.team for t in relegationCandidates))

clQualifiers = findTeamsWithPossiblePosition(S, table, lambda p: p <= 4)
print 'Teams that can still qualify for the Champions League are:\n%s\n' % (
    '\n'.join('  ' + t.team for t in clQualifiers))

The output for these functions is:

Teams that can still get relegated are:
  West Ham United FC
  Stoke City FC
  Crystal Palace
  Everton FC
  Newcastle United
  West Bromwich Albion
  Aston Villa FC
  Sunderland AFC
  Hull City FC
  FC Burnley
  Leicester City
  Queens Park Rangers

Teams that can still qualify for the Champions League are:
  Chelsea FC
  FC Arsenal London
  Manchester United FC
  Manchester City FC
  FC Southampton
  Liverpool FC
  Tottenham Hotspur FC
  Swansea City
  West Ham United FC
  Stoke City FC

Let's try to find the minimum number of points required to ensure safety. We are going to do this in the following way. First, we will ask the solver if it is possible for the team to obtain at least X points. For example, Leicester currently have only 25 points and 7 games left to play so they cannot possibly get more than 46 points. So X = 46 is achievable for Leicester but not X=47.

Once we have an achievable value X, we ask whether it is possible that the team gets these many points and still be relegated. If both queries are satisfiable, we increment X and try again (see footnote 2). But if the first query is satisfiable and the second query is not, we know X points are sufficient to guarantee safety.

Here's the code:

def findPointsForSafety(S, team):
    "Find the minimum points need to achieve safety from relegation."
    # first check if the team can achieve safety.
    S.push()
    S.add(team.position < 18)
    r = S.check();
    S.pop()

    # if it can't, return -1
    if r == z3.unsat:
        return -1 # always relegated.

    pts = team.points
    while True:
        S.push()
        S.add(team.finalPoints >= pts)
        # Is it possible finish with more than pts points?
        r = S.check()
        if r == z3.unsat:
            # The team can't even get these many points. Give up.
            S.pop()
            return 0 # May not be relegated but at the mercy of other teams.

        # Is it possible to be relegated?
        S.add(team.position >= 18)
        r = S.check()
        S.pop()
        if r == z3.sat:
            # Try more points, because we can still be relegated.
            pts += 1
        else:
            # can get these many points and can't get relegated!
            return pts

    # should never get here.
    assert False

print 'Minimum points to guarantee safety (current points in parenthesis).'
for t in relegationCandidates:
    pts = findPointsForSafety(S, t)
    ptstr = '??' if pts == 0 else str(pts)
    print '  %-20s %6s ( %2d )' % (t.team, ptstr, t.points)

The output when I run this code is:

Minimum points to guarantee safety (current points in parenthesis).
  West Ham United FC       47 ( 43 )
  Stoke City FC            47 ( 43 )
  Crystal Palace           47 ( 42 )
  Everton FC               47 ( 38 )
  Newcastle United         47 ( 35 )
  West Bromwich Albion     47 ( 33 )
  Aston Villa FC           47 ( 31 )
  Sunderland AFC           46 ( 29 )
  Hull City FC             ?? ( 28 )
  FC Burnley               ?? ( 26 )
  Leicester City           45 ( 25 )
  Queens Park Rangers      ?? ( 25 )

Note there are some teams which can't guarantee safety even if they win all their remaining matches. They will need one of the other teams to lose/draw some games to avoid relegation. These teams are shown with question marks in the table.

Conclusion

Hope you had some fun playing with Z3 and this piqued your interest in SMT solvers.

You can install Z3 from here. The Python API documentation is here. An introductory tutorial to using Z3 with the SMT-LIB input format is here. It might be worthwhile to play with this format for a bit and then try to translate some of these examples into the Z3 Python API.

All the code I've shown above is available on bitbucket.

Let me know if you have any questions/comments or suggestions.

Footnotes

  1. Turns out the biggest victory in the Premier League was 9-0 for Manchester United when playing at home against Ipswich Town. The highest scoring game was Portsmouth 7 - 4 Reading. The good news is both are well within the bounds we selected.

  2. The linear search procedure we're using to determine points for safety can be improved. We could try exponentially increasing increments for X: 1, 2, 4, 8 ... and then do a binary search once we find a value that is not achievable. This would be asymptotically faster, but I decided it's not worth the trouble.

Comments !