[PYTHON] I tried to predict the J-League match (data analysis)

Overview

I tried the J League match forecast based on past data We had a 55% chance of predicting wins, losses and draws (Learning data 2573 games, test data 200 games)

Disclaimer

We do not take any responsibility for using the content of the article

Trigger and past efforts

Google predicted World Cup wins and losses around 2014 " When Google analyzes big data and predicts the World Cup, it will hit the quarterfinals during all games. Will it end up? " Looking at this article, I decided to make a prediction for the J League. What I was interested in in Google's efforts Information on all player and ball positions can be obtained from soccer data called OPTA. (I searched for data, but it seems like it's too much for me to get it personally) For example, simulating the entire game using the Monte Carlo method.

Data collection

There are quite a few sites about the J League. We have collected the data quietly so as not to cause any inconvenience.

Input data

Match status
・・・ Weather, number of spectators, days of the week, holidays, kick-off time, etc.

The history of the past 6 games of the home and away teams Input by connecting as it is without averaging.

Data of all starting lineup players
・・・ Each individual's position, height, weight, birthplace, score of the past few games

(The total number of learning variables is 754)

Correct answer data

Labeled 0,1,2 for wins, losses, or draws

Data processing, molding

This time, I used beautifulsoup in earnest for the first time to extract data from html. It's too convenient. I should have used it earlier. You should also normalize to 0 to 1. You can easily do it in about 2 lines using np.max.

Learning source code

Pretty ordinary scikit learn code

import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC

all_input_np = np.load('input.npy')
all_label_np = np.load('label.npy')
train_input = all_input_np[:-200]
test_input = all_input_np[-200:]
train_result = all_label_np[:-200]
test_result = all_label_np[-200:]

tuned_parameters = [{'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

clf = GridSearchCV(SVC(), tuned_parameters, n_jobs=4, scoring="accuracy")

print(">>>START")
clf.fit(train_input, train_result)
print("Best parameters set found on development set: %s" % clf.best_params_)

print("Grid scores on development set:")
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))

print("The scores are computed on the full evaluation set.")
y_true, y_pred = test_result, clf.predict(test_input)
print(classification_report(y_true, y_pred))
print(accuracy_score(y_true, y_pred))

result

             precision    recall  f1-score   support

        0.0       0.61      0.56      0.59        87
        1.0       0.51      0.78      0.62        78
        2.0       0.00      0.00      0.00        35
avg / total       0.46      0.55      0.50       200

0.55

55% is the correct answer rate As for the label, 0 loses, 1 wins, and 2 draws.

Here is one ingenuity. It seems that about 20% of all soccer games are drawn Therefore, in order to raise the overall correct answer rate, learning predicts that the draw is a win. I tried to predict only wins and losses. (Conversely, the draw is unpredictable)

Results using neural networks

The correct answer rate was about 50% Multilayer perceptron (3 layers, elu, dropout rate like 0.5)

Impressions

After all, it is difficult to access only limited data. With the data collected this time, I think that even if the mathematical model is raised to the limit, the correct answer rate is at most about 60%. I think it's a good idea for a whole game simulation like Google did I'm skeptical that soccer can be reproduced with just the pass success rate and other numbers. After all, by linking some image data of the game, humans did not notice it, or One point is that it may be possible to perform analytical processing that humans cannot bear. The other is if we can visualize the process of data analysis I think we can see the direction of improvement of the model and input data.