Play with Statistical Modeling: Quantify J-League Team Strength with Stan and Python

Introduction

The other day, I read the book Bayesian Statistical Modeling with Stan and R, which made me want to try statistical modeling. I recommend it because it is a very good book.

Stan can be used not only from R, but also from the Python library PyStan. This time, I will make a model that estimates the strength (like value) of each team in the 2016 soccer J1 league from the match results.

This time's main version

Collect data

You can check the J-League match results from http://www.jleague.jp/stats/SFTD01.html. Suppose you have the following data collected from each team in the 1st stage of 2016 in some way. https://gist.github.com/mokemokechicken/a26eb7230e51fba7018476f7706e5b0b

away_score away_team date home_score home_team score visitors
1 Nagoya 02/27(soil) 0 Iwata 0-1 14333
1 Kawasaki F 02/27(soil) 0 Hiroshima 0-1 18120
1 Fukuoka 02/27(soil) 2 Tosu 2-1 19762
... ... ... ... ... ... ...

Thinking

Think about the mechanism by which match results are produced

Let's assume how the match result (score) is generated as follows.

will do. Well, it doesn't seem that strange story so far.

Next, let's enter the world of statistics and think about it. First, it is appropriately assumed that the "score" follows the Poisson distribution. In other words

Score of a game of a team ~ Poisson (Attack power of that team-Defense power of the opponent team)

It's like that. (By the way, ~ means "follow the distribution".) However, the parameters of the Poisson distribution cannot take negative values, so if you take ʻexp` for convenience and add home power,

Scoring a match for a home team~ Poisson(exp( 
		(Home power+Attack power of the home team) -Away team defense
	))

Suppose that Also symmetrically

Conceded a match in a home team~ Poisson(exp( 
Away team attack power- (Home power+The defense of the home team)
	))

will do.

Write the model in Stan

It turned out to be something like this.

model_code = """
data {
    int N;  // N Games
    int K;  // K Teams
    int Th[N]; // Home Team ID
    int Ta[N]; // Away Team ID
    int Sh[N]; // Home Team score point
    int Sa[N]; // Away Team score point
}

parameters {
    real atk[K];
    real def[K];
    real home_power[K];
    real<lower=0> sigma;
    real<lower=0> hp_sigma;
}

model {
    for (k in 1:K) {
        atk[k] ~ normal(0, sigma);
        def[k] ~ normal(0, sigma);
        home_power[k] ~ normal(0, hp_sigma);
    }

    for (n in 1:N) {
        Sh[n] ~ poisson(exp(
            (home_power[Th[n]] + atk[Th[n]]) - 
            (def[Ta[n]])
        ));

        Sa[n] ~ poisson(exp(
            (atk[Ta[n]]) - 
            (def[Th[n]] + home_power[Th[n]])
        ));
    }
}

generated quantities {
    real games[K, K, 2];
    
    for (th in 1:K) {
        for (ta in 1:K) {
            games[th, ta, 1] = poisson_rng(exp((home_power[th] + atk[th]) - (def[ta])));
            games[th, ta, 2] = poisson_rng(exp((atk[ta]) - (def[th] + home_power[th])));
        }
    }
}
"""

easy explanation

Regarding Stan, the book I introduced earlier is very helpful, but I will explain it briefly here as well.

data block

data {
    int N;  // N Games
    int K;  // K Teams
    int Th[N]; // Home Team ID
    int Ta[N]; // Away Team ID
    int Sh[N]; // Home Team score point
    int Sa[N]; // Away Team score point
}

This data block declares externally given data. The data collected earlier will be injected later with this name.

parameters block

parameters {
    real atk[K];
    real def[K];
    real home_power[K];
    real<lower=0> sigma;
    real<lower=0> hp_sigma;
}

This parameters block describes the parameters you want to estimate. This time, the main focus is attack power (atk), defense power (def), and home power (home_power). <lower = 0> represents the range that the parameter can take.

model block

model {
    for (k in 1:K) {
        atk[k] ~ normal(0, sigma);
        def[k] ~ normal(0, sigma);
        home_power[k] ~ normal(0, hp_sigma);
    }

    for (n in 1:N) {
        Sh[n] ~ poisson(exp(
            (home_power[Th[n]] + atk[Th[n]]) - 
            (def[Ta[n]])
        ));

        Sa[n] ~ poisson(exp(
            (atk[Ta[n]]) - 
            (def[Th[n]] + home_power[Th[n]])
        ));
    }
}

This model block uses a probability distribution to represent the relationship between the input data and the parameters you want to estimate. The "mechanism" mentioned earlier is also expressed here. In addition, it is assumed that the offensive power and defense power of each team follow the "normal distribution of the variance sigma with an average of 0 ". This sigma is also (incidentally) an estimated parameter. Otherwise, the calculation (MCMC) would not converge (because all values would converge in various range values). Well, it seems like I wrote a hope that "I want you to fit in the value around here for the time being". What's interesting is that it was easier for the calculation to converge if it was estimated rather than writing sigma as a fixed 1 (the converged sigma was much smaller than 1). Hard-coding 1 makes me feel that the range is limited, but it is interesting that it is not so.

generated quantities block

generated quantities {
    real games[K, K, 2];
    
    for (th in 1:K) {
        for (ta in 1:K) {
            games[th, ta, 1] = poisson_rng(exp((home_power[th] + atk[th]) - (def[ta])));
            games[th, ta, 2] = poisson_rng(exp((atk[ta]) - (def[th] + home_power[th])));
        }
    }
}

This generated quantities block is like a bonus. It is a function that makes a different calculation using the converged value. Here, the parameters of each team are used to sample the score of the match when each team played against each other. You can calculate it separately on the Python side, but the calculation logic must match the Model, so it seems easier to maintain it if you manage it here.

Other

There is another transformed parameters block, which allows you to define different variables using the data and parameters variables. Expressions such as (home_power [th] + atk [th])-(def [ta]) this time also appear twice in model and generated quantities, so it is better to put them together. I wonder, but ...

Calculate with model and data

Define some useful functions

from hashlib import sha256
import pickle
import time

import requests
import pandas as pd
from pystan import StanModel


def stan_cache(file=None, model_code=None, model_name=None, cache_dir="."):
    """Use just as you would `stan`"""
    # http://pystan.readthedocs.io/en/latest/avoiding_recompilation.html#automatically-reusing-models

    if file:
        if file.startswith("http"):
            model_code = requests.get(file).text
        else:
            with open(file) as f:
                model_code = f.read()
    code_hash = sha256(model_code.encode('utf8')).hexdigest()
    if model_name is None:
        cache_fn = '{}/cached-model-{}.pkl'.format(cache_dir, code_hash)
    else:
        cache_fn = '{}/cached-{}-{}.pkl'.format(cache_dir, model_name, code_hash)
    try:
        sm = pickle.load(open(cache_fn, 'rb'))
    except:
        start_time = time.time()
        sm = StanModel(model_code=model_code)
        print("compile time: %s sec" % (time.time() - start_time))
        with open(cache_fn, 'wb') as f:
            pickle.dump(sm, f)
    return sm


def sampling_summary_to_df(fit4model):
    """

    :param StanFit4model fit4model:
    :return:
    """
    s = fit4model.summary()
    return pd.DataFrame(s["summary"], columns=s['summary_colnames'], index=s['summary_rownames'])

The stan_cache function caches the stan model and retrieves the model code via HTTP as well. Convenient.

The sampling_summary_to_df function turns the result of fitting the model into a Pandas DataFrame. It has the advantage of being easier to see when working with Jupyter.

Execute

import numpy as np
import pandas as pd

df = pd.read_csv("https://gist.githubusercontent.com/mokemokechicken/a26eb7230e51fba7018476f7706e5b0b/raw/90e1c26e172b92d5f52d53d0591a611e0258bd2b/2016-j1league-1st.csv")

team_list = list(sorted(set(df['away_team']) | set(df['home_team'])))
teams = dict(zip(range(1, len(team_list)+1), team_list))
for k, v in list(teams.items()):
    teams[v] = k
    
model = stan_cache(model_code=model_code) 

stan_input = {
    "N": len(df),
    "K": len(team_list),
    "Th": df['home_team'].apply(lambda x: teams[x]),
    "Ta": df['away_team'].apply(lambda x: teams[x]),
    "Sh": df['home_score'],
    "Sa": df['away_score'],
}

fitted = model.sampling(data=stan_input, seed=999)
sampling_summary_to_df(fitted).sort_values("Rhat", ascending=False)

result

For example, the summary of the parameter estimation results is as follows. In the book, "If this Rhat is less than 1.1 for all estimated parameters, it can be considered that the calculation has converged ", so I will consider it as converged for the time being. However, it is a little subtle that the "number of times the value can be sampled properly" called n_eff is less than 100 (especially when estimating the interval), so there are some subtleties, but this time I will leave it as good.

mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
lp__ -214.114891 1.542823 13.712915 -237.604900 -223.960556 -215.747425 -205.033248 -183.817981 79.0 1.049519
hp_sigma 0.094069 0.005835 0.063921 0.011408 0.044970 0.081836 0.132181 0.245524 120.0 1.031258
atk[12] 0.048430 0.009507 0.158794 -0.318161 -0.052664 0.052359 0.155878 0.342186 279.0 1.018430
atk[16] 0.225840 0.008570 0.158951 -0.075769 0.115129 0.223274 0.330571 0.531899 344.0 1.013612
sigma 0.220279 0.002456 0.056227 0.112035 0.182132 0.217550 0.255500 0.344382 524.0 1.011379
... ... ... ... ... ... ... ... ... ... ...

View the results

Estimated team strength

Collect parameters by team.

tk = {}

params = fitted.extract()
atks = params['atk']
defs = params['def']
home_power = params['home_power']
games = params['games']


for k, team in enumerate(team_list):
    tk[team] = {
            "atk": np.mean(atks[:, k]),
            "def": np.mean(defs[:, k]),
            "home_power": np.mean(home_power[:, k]),
    }

tdf = pd.DataFrame(tk).T

For the time being, I will arrange them in descending order of "home power".

tdf.sort_values("home_power", ascending=False)
atk def home_power
Kashima 0.225840 0.217699 0.068898
Urawa 0.152638 0.063192 0.041830
Kobe 0.099154 -0.163383 0.030443
Hiroshima 0.293808 -0.000985 0.029861
Nagoya 0.130757 -0.247969 0.015849
Kawasaki F 0.312593 0.087859 0.014728
Niigata 0.010827 -0.146252 0.011885
Iwata 0.048430 -0.105230 0.011843
Sendai 0.037960 -0.150151 0.005462
FC tokyo -0.072115 0.017485 -0.005050
Gamba Osaka 0.087034 -0.033666 -0.005532
Tosu -0.232595 0.103412 -0.006186
oak 0.036172 -0.053332 -0.009568
Omiya -0.040014 0.027842 -0.020942
Yokohama FM 0.059420 -0.009322 -0.021393
Fukuoka -0.185292 -0.130759 -0.026643
Kofu 0.002608 -0.268061 -0.037079
Shonan -0.000052 -0.184515 -0.049440

have become. This time, the average strength should be 0, so whether it is higher or lower than 0 is more than the average of the J1 team.

By the way, if you do it with the data of the 2nd stage, you will get a completely different result (Kashima will be much worse). To the last, it's just like this when calculated from the data of the 1st stage, so I'm not sure.

If you think about it, it's natural, but it's close to the 2016 1st stage standings. https://ja.wikipedia.org/wiki/2016%E5%B9%B4%E3%81%AEJ1%E3%83%AA%E3%83%BC%E3%82%B0#.E9.A0.86.E4.BD.8D.E8.A1.A8 In particular, there seems to be a general correlation between "total points and total goals" and "attack power and defense power" (although it would be a problem if there were none).

at the end

I tried to simulate what would happen if I bought the 2nd stage Toto with this prediction model, but I stopped because it seems to be a terrible result (laugh). It seems that we can make good predictions if the competition density is high in a short period of time. Predicting 17 verses with data up to 16 verses, is it a little better ??

It's amazing that this kind of modeling has become easy, and it's a lot of fun.

Recommended Posts

Play with Statistical Modeling: Quantify J-League Team Strength with Stan and Python
Duck book implemented in Python "Bayesian statistical modeling with Stan and R"
Easy modeling with Blender and Python
Fractal to make and play with Python
"Introduction to data analysis by Bayesian statistical modeling starting with R and Stan" implemented in Python
Play with 2016-Python
An introduction to statistical modeling for data analysis (Midorimoto) reading notes (in Python and Stan)
[Python] How to play with class variables with decorator and metaclass
[Let's play with Python] Image processing to monochrome and dots
Play with Mastodon's archive in Python 2 Count replies and favourites
Play with the password mechanism of GitHub Webhook and Python
Programming with Python and Tkinter
Encryption and decryption with Python
Python and hardware-Using RS232C with Python-
[Python] Play with Discord's Webhook.
Play RocketChat with API / Python
python with pyenv and venv
Works with Python and R
Introduction to Bayesian Statistical Modeling with python ~ Trying Linear Regression with MCMC ~
Communicate with FX-5204PS with Python and PyUSB
Shining life with Python and OpenCV
Robot running with Arduino and python
Install Python 2.7.9 and Python 3.4.x with pip.
Neural network with OpenCV 3 and Python 3
AM modulation and demodulation with python
[Python] font family and font with matplotlib
Scraping with Node, Ruby and Python
Scraping with Python, Selenium and Chromedriver
Scraping with Python and Beautiful Soup
Let's play with Excel with Python [Beginner]
JSON encoding and decoding with python
Hadoop introduction and MapReduce with Python
Reading and writing NetCDF with Python
Reading and writing CSV with Python
Multiple integrals with Python and Sympy
Coexistence of Python2 and 3 with CircleCI (1.0)
Sugoroku game and addition game with python
FM modulation and demodulation with Python