[PYTHON] Machine learning of sports-Analysis of J-League as an example-②

It's been a while, but I'd like to report on the continuation of the following post. https://qiita.com/ryouta0506/items/100a2b252fbaeb73f493

Even though it's a continuation, I'm trying to change the model used. In the previous report, we assumed that the scores follow a Poisson distribution. As a result, there were almost no cases where the predicted score was concentrated on 0 points and 1 point and predicted as 2 points. In soccer, it is rare to have more than 3 points, so even if you can't predict it, you can't help but it's not good if you can't predict 2 to 3 points at a reasonable rate. So I would like to try other probability distributions.

Use negative binomial distribution

By definition, the Poisson distribution has the relationship of expected value = variance. This means that the smaller the expected value, the smaller the variation, and the larger the expected value, the larger the variation. If you are basically concentrating on 0 points / 1 point like a soccer score, the expected value will be close to 0 points, and as a result, the variation will be small, so the probability of appearance of 2 points / 3 points will be small. I will end up. Therefore, we will use the negative binomial distribution, which is a statistical model that has the characteristics of the Poisson distribution but has a more generalized variance. The negative binomial distribution is the probability distribution that corresponds to the relatives of the binomial distribution. The binomial distribution is the distribution of the number of times an event appears in a predetermined number of trials (for example, 10 times) under the condition that the probability of event occurrence (for example, the probability of a coin toss appearing is 0.5) is given. On the other hand, the negative binomial distribution is the distribution of the number of trials given the probability of event occurrence and the number of event occurrences. (Note: Actually, there are other definitions. I made it the easiest to understand here.) In the field of actual statistical modeling, the generalization of Poisson distribution (that is, the constraint condition of expected value = variance is excluded. Since it is often treated as something that has been done, it is also applied to this model.

The code for stan is below.

model_code = """
data {
    int N;  // N Games
    int K;  // K Teams
    int home_team[N]; // Home Team ID
    int away_team[N]; // Away Team ID
    int ht_score[N] ; // Home Team score 
    int at_score[N] ; // Away Team score
}

parameters {
    real<lower=-1,upper=1> atk[K]; // atack
    real<lower=-1,upper=1> def[K]; // defence
    real<lower=0> home_adv[K];     // home advantage
    real<lower=0> sigma[K];        // sigma 
    real<lower=0> hadv_sigma;      // sigma home_advantage
    real<lower=0> phi[K] ;            // 
}

model {
    real param1 ;
    real param2 ;
    
    for (k in 1:K) {
        atk[k] ~ normal(0, sigma[k]);
        def[k] ~ normal(0, sigma[k]);
        home_adv[k] ~ normal(0, hadv_sigma);
        phi[k] ~ normal(0,10);
    }

    for (n in 1:N) {
        param1 = exp(home_adv[home_team[n]] + atk[home_team[n]] - def[away_team[n]]) ;
        param2 = exp((atk[away_team[n]]) - (def[home_team[n]] + home_adv[home_team[n]])) ;
        
        ht_score[n] ~ neg_binomial_2(param1,phi[home_team[n]]) ;
        at_score[n] ~ neg_binomial_2(param2,phi[away_team[n]]) ;
        
    }

}

generated quantities {
    int ht_predict[N] ;
    int at_predict[N] ;
    
    for (i in 1:N) {
        ht_predict[i] = neg_binomial_2_rng(exp(home_adv[home_team[i]] + atk[home_team[i]] - def[away_team[i]]),
                                                phi[home_team[i]]) ;
        at_predict[i] = neg_binomial_2_rng(exp((atk[away_team[i]]) - (def[home_team[i]] + home_adv[home_team[i]])),
                                                phi[away_team[i]]) ;
    }
}

"""

The function that represents the negative binomial distribution with stan is neg_binomial_2 (). Actually, there is another function, but it depends on the parameters to be applied. It is better to use neg_binomial when setting the parameters according to the actual function form. Here, since param1 and param2 were used as the average model for the Poisson distribution, neg_binomial_2, which uses the mean and variance as arguments, is used. For reference, the distribution of scores on the Kawasaki side of Kawasaki F vs. Kashima (home Kashima) is shown.

Kawasaki F has scored 4 points in this match, so the ideal is to have the maximum probability of 3 or 4 points. The probability of one point is the maximum for both the Poisson distribution and the negative binomial distribution. However, in the negative binomial distribution, the probability of 0 points and 1 point is reduced, and that amount is assigned to 2 points or more. Therefore, it can be said that the result of 4 points in this game could express that the negative binomial distribution may have been higher than the Poisson distribution. The above is a case study of a specific match between Kawasaki F and Kashima. Therefore, we will use the likelihood to verify which of the Poisson distribution model and the negative binomial distribution model better represents reality throughout.

Model evaluation by likelihood

Likelihood is the probability that the data at hand will appear from that probability distribution under a particular probability distribution. This likelihood is the probability that the data at hand will appear under the assumption that the probability distribution is correct, but it represents the certainty of the statistical model that estimates that the data at hand is correct. It will be an index. Therefore, I would like to utilize this to evaluate the model. In this case, it is the probability that the data at hand (all 306 games) will be established at the same time, so the result of multiplying the individual likelihoods will be the overall likelihood. However, since the calculation is not good with multiplication, we will calculate the likelihood of the entire hand data by logarithmically converting and summing.

The above is the distribution of log-likelihood estimated using Stan. Originally, the log-likelihood takes a negative value, but it is minus and the larger the value, the better the performance. In other words, it shows that the negative binomial distribution for both the home team and the away team is a probability distribution that better expresses the observed values. The calculation by stan of likelihood can be obtained as a distribution by adding the following code to the generated quantities part.

Poisson distribution likelihood calculation

    real ht_log_likehood ;
    real at_log_likehood ;

    ht_log_likehood = 0;
    at_log_likehood = 0;
    
    for (i in 1:N) {
        ht_log_likehood = ht_log_likehood + 
                          poisson_lpmf(ht_score[i] | exp((home_adv[home_team[i]] + atk[home_team[i]]) - (def[away_team[i]])));
        at_log_likehood = at_log_likehood +
                          poisson_lpmf(at_score[i] | exp((atk[away_team[i]]) - (def[home_team[i]] + home_adv[home_team[i]])));
        
    }

Likelihood calculation of negative binomial distribution


    real ht_log_likehood ;
    real at_log_likehood ;
    
    ht_log_likehood = 0;
    at_log_likehood = 0;
    
    for (i in 1:N) {
        
        ht_log_likehood = ht_log_likehood +
                            neg_binomial_2_lpmf(ht_score[i] | 
                                exp(home_adv[home_team[i]] + atk[home_team[i]] - def[away_team[i]]),phi[home_team[i]]) ;
                                
        at_log_likehood = at_log_likehood +
                            neg_binomial_2_lpmf(at_score[i] |
                                exp((atk[away_team[i]]) - (def[home_team[i]] + home_adv[home_team[i]])),phi[away_team[i]]) ;

    }

Zero excess Poisson distribution

In the case of the previous models, it has been assumed that the score is determined under one model. However, it is also possible to assume that the result of zero points is a different generation process than the case where points are scored. For example, there is a distribution of whether or not there is a chance of scoring throughout the match, and a process that follows the Poisson distribution only when there is a chance can be considered. In this case, the zero point result does not simply appear in the Poisson distribution. Rather, it is affected by the Bernoulli distribution (the distribution that determines the presence or absence of chances), and the rate at which zero points appear is greater than that of the Poisson distribution. In this way, there is a zero excess Poisson distribution as a model that follows the Poisson distribution and makes the probability of appearance of zero larger than the original one. Here we apply a slightly customized model of this. The actual code is below.

model_code = """
data {
    int N;  // N Games
    int K;  // K Teams
    int home_team[N]; // Home Team ID
    int away_team[N]; // Away Team ID
    int ht_score[N] ; // Home Team score 
    int at_score[N] ; // Away Team score
}

transformed data {
    int ht_score_over_zero[N] ;
    int ht_score_trans[N]     ;
    int at_score_over_zero[N] ;
    int at_score_trans[N]     ;
    
    for (i in 1:N) {
        if (ht_score[i]>0) { 
            ht_score_over_zero[i] = 1 ;
            ht_score_trans[i] = ht_score[i]-1 ;
            }
        else {
            ht_score_over_zero[i] = 0 ;
            ht_score_trans[i] = 0 ;
            }
            
        if (at_score[i]>0) { 
            at_score_over_zero[i] = 1 ;
            at_score_trans[i] =at_score[i]-1 ;
            }
        else {
            at_score_over_zero[i] = 0 ;
            at_score_trans[i] = 0;
            }
    }
}

parameters {
    real<lower=-1,upper=1> atk[K]; // atack
    real<lower=-1,upper=1> def[K]; // defence
    real<lower=0> home_adv[K];     // home advantage
    real<lower=0> sigma[K];        // sigma 
    real<lower=0> hadv_sigma;      // sigma home_advantage

}

model {
    real param1 ;
    real param2 ;
    
    for (k in 1:K) {
        atk[k] ~ normal(0, sigma[k]);
        def[k] ~ normal(0, sigma[k]);
        home_adv[k] ~ normal(0, hadv_sigma);
    }

    for (n in 1:N) {
        param1 = home_adv[home_team[n]] + atk[home_team[n]] - def[away_team[n]] ;
        param2 = atk[away_team[n]] - (def[home_team[n]] + home_adv[home_team[n]]) ;
        
        ht_score_over_zero[n] ~ bernoulli(inv_logit(param1)) ;
        at_score_over_zero[n] ~ bernoulli(inv_logit(param2)) ;
        
        if (ht_score_over_zero[n] >0 ) 
            ht_score_trans[n] ~ poisson(exp(param1));

            at_score_trans[n] ~ poisson(exp(param2));
    }

}

generated quantities {
    int ht_predict_score[N] ;
    int at_predict_score[N] ;
    real ht_log_likehood ;
    real at_log_likehood ;
    
    ht_log_likehood = 0 ;
    at_log_likehood = 0 ;
    
    for (i in 1:N) {
        ht_log_likehood = ht_log_likehood + 
                            bernoulli_lpmf(ht_score_over_zero[i]|
                                                inv_logit(home_adv[home_team[i]] + atk[home_team[i]] - def[away_team[i]])) ;
        at_log_likehood = at_log_likehood +
                            bernoulli_lpmf(at_score_over_zero[i]|
                                                inv_logit(atk[away_team[i]] - (def[home_team[i]] + home_adv[home_team[i]]))) ;
                            
        if (ht_score_over_zero[i]>0) {
            ht_log_likehood = ht_log_likehood +
                            poisson_lpmf(ht_score_trans[i]|
                                                exp(home_adv[home_team[i]] + atk[home_team[i]] - def[away_team[i]])) ;
        }
        if (at_score_over_zero[i]>0) {
            at_log_likehood = at_log_likehood +
                            poisson_lpmf(at_score_trans[i]|
                                                exp(atk[away_team[i]] - (def[home_team[i]] + home_adv[home_team[i]]))) ;
        }
    }
}
                                                
                                                
                                                
        if (bernoulli_rng(inv_logit(home_adv[home_team[i]] + atk[home_team[i]] - def[away_team[i]]))==0) {
            ht_predict_score[i] = 0 ;
        }
        else {
            ht_predict_score[i] = poisson_rng(exp(home_adv[home_team[i]] + atk[home_team[i]] - def[away_team[i]])) +1 ;
        }
        
        if (bernoulli_rng(inv_logit(home_adv[home_team[i]] + atk[home_team[i]] - def[away_team[i]]))==0) {
            at_predict_score[i] = 0;
        }
        else {
            at_predict_score[i] = poisson_rng(exp(atk[away_team[i]] - def[home_team[i]] + home_adv[home_team[i]])) +1 ;
        }

"""

The basic structure of this model is a mixture of the Bernoulli distribution of "scoring" and "not scoring" and the Poisson distribution of scoring. Of these, for the Bernoulli distribution, the results calculated from offensive power, defensive power, and home advantage are converted into probabilities using the sigmoid function (the inverse function of the logit function). If it is judged by this model that you cannot score, it will be fixed at zero. If it is judged that "it can be scored", the Poisson distribution is used. The original Poisson distribution exists, although there is a greater or lesser possibility of a zero point. However, since the case of applying the Poisson distribution this time is "scoreable", it is inconsistent that the zero point appears. Therefore, shift the Poisson distribution to the right by 1.0 to prevent the zero point from appearing. First, check the distribution of scores on the Kawasaki side of Kawasaki F vs. Kashima (home Kashima).

The percentage of 0 points is the highest, which indicates whether the score was achieved or not. The ratio of 0 points was 41.8%, so it seems better to think that Kawasaki F was able to score in this match. In that case, Kawasaki F should expect to score two points in this match. This seems to be a fairly good result when compared with what is expected to be one point of Poisson distribution and negative binomial distribution. In addition, in the case of this model, the case where Kawasaki F scores 4 points is quite large at 11.9%.

Finally, check the likelihood of this model. In the case of this model, the likelihood is the case where the Bernoulli distribution and the Poisson distribution are satisfied at the same time, so it is the multiplication of each likelihood. Therefore, if this is logarithmically converted, it can be calculated by adding the log-likelihood of the Bernoulli distribution and the log-likelihood of the Poisson distribution. If the above is the code of stan, it will be as follows.

generated quantities {

    real ht_log_likehood ;
    real at_log_likehood ;
    
    ht_log_likehood = 0 ;
    at_log_likehood = 0 ;
    
    for (i in 1:N) {
        ht_log_likehood = ht_log_likehood + 
                            bernoulli_lpmf(ht_score_over_zero[i]|
                                                inv_logit(home_adv[home_team[i]] + atk[home_team[i]] - def[away_team[i]])) ;
        at_log_likehood = at_log_likehood +
                            bernoulli_lpmf(at_score_over_zero[i]|
                                                inv_logit(atk[away_team[i]] - (def[home_team[i]] + home_adv[home_team[i]]))) ;
                            
        if (ht_score_over_zero[i]>0) {
            ht_log_likehood = ht_log_likehood +
                            poisson_lpmf(ht_score_trans[i]|
                                                exp(home_adv[home_team[i]] + atk[home_team[i]] - def[away_team[i]])) ;
        }
        if (at_score_over_zero[i]>0) {
            at_log_likehood = at_log_likehood +
                            poisson_lpmf(at_score_trans[i]|
                                                exp(atk[away_team[i]] - (def[home_team[i]] + home_adv[home_team[i]]))) ;
        }                                  
    }
}

The result is as follows. The overall likelihood was greater than the negative binomial distribution, indicating that this model best represents the observations.

Summary

We have found that the zero excess Poisson distribution is the best way to represent the observed values in predicting soccer scores. This makes it possible to define the offensive and defensive power of each team. Offensive and defensive power is determined by regular members and their characteristics (running power, etc.). Therefore, in the future, if this relationship can be modeled well, it will be possible to predict the score in the order of team characteristics → offensive / defensive power → score.