[PYTHON] I tried to predict the J-League match (data analysis)

Overview

I tried the J League match forecast based on past data We had a 55% chance of predicting wins, losses and draws (Learning data 2573 games, test data 200 games)

Disclaimer

We do not take any responsibility for using the content of the article

Trigger and past efforts

Google predicted World Cup wins and losses around 2014 " When Google analyzes big data and predicts the World Cup, it will hit the quarterfinals during all games. Will it end up? " Looking at this article, I decided to make a prediction for the J League. What I was interested in in Google's efforts Information on all player and ball positions can be obtained from soccer data called OPTA. (I searched for data, but it seems like it's too much for me to get it personally) For example, simulating the entire game using the Monte Carlo method.

Data collection

There are quite a few sites about the J League. We have collected the data quietly so as not to cause any inconvenience.


Input data

Match status
・ ・ ・ Weather, number of spectators, days of the week, holidays, kick-off time, etc.

The history of the past 6 games of the home and away teams Input by connecting as it is without averaging.

Data of all starting lineup players
・ ・ ・ Each individual's position, height, weight, birthplace, score of the past few games

(The total number of learning variables is 754)


Correct answer data

Labeled 0,1,2 for wins, losses, or draws

Data processing, molding

This time, I used beautifulsoup in earnest for the first time to extract data from html. It's too convenient. I should have used it earlier. You should also normalize to 0 to 1. You can easily do it in about 2 lines using np.max.

Learning source code

Pretty ordinary scikit learn code

import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC

all_input_np = np.load('input.npy')
all_label_np = np.load('label.npy')
train_input = all_input_np[:-200]
test_input = all_input_np[-200:]
train_result = all_label_np[:-200]
test_result = all_label_np[-200:]

tuned_parameters = [{'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

clf = GridSearchCV(SVC(), tuned_parameters, n_jobs=4, scoring="accuracy")

print(">>>START")
clf.fit(train_input, train_result)
print("Best parameters set found on development set: %s" % clf.best_params_)

print("Grid scores on development set:")
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))

print("The scores are computed on the full evaluation set.")
y_true, y_pred = test_result, clf.predict(test_input)
print(classification_report(y_true, y_pred))
print(accuracy_score(y_true, y_pred))

result

             precision    recall  f1-score   support

        0.0       0.61      0.56      0.59        87
        1.0       0.51      0.78      0.62        78
        2.0       0.00      0.00      0.00        35
avg / total       0.46      0.55      0.50       200

0.55

55% is the correct answer rate As for the label, 0 loses, 1 wins, and 2 draws.

Here is one ingenuity. It seems that about 20% of all soccer games are drawn Therefore, in order to raise the overall correct answer rate, learning predicts that the draw is a win. I tried to predict only wins and losses. (Conversely, the draw is unpredictable)

Results using neural networks

The correct answer rate was about 50% Multilayer perceptron (3 layers, elu, dropout rate like 0.5)

Impressions

After all, it is difficult to access only limited data. With the data collected this time, I think that even if the mathematical model is raised to the limit, the correct answer rate is at most about 60%. I think it's a good idea for a whole game simulation like Google did I'm skeptical that soccer can be reproduced with just the pass success rate and other numbers. After all, by linking some image data of the game, humans did not notice it, or One point is that it may be possible to perform analytical processing that humans cannot bear. The other is if we can visualize the process of data analysis I think we can see the direction of improvement of the model and input data.

Recommended Posts

I tried to predict the J-League match (data analysis)
I tried to save the data with discord
I tried to predict the price of ETF
I tried to move the ball
I tried to estimate the interval.
I tried to analyze scRNA-seq data using Topological Data Analysis (TDA)
[First data science ⑤] I tried to help my friend find the first property by data analysis.
I tried to summarize the umask command
I tried factor analysis with Titanic data!
I tried to summarize the graphical modeling.
Sentiment analysis with natural language processing! I tried to predict the evaluation from the review text
I tried to estimate the pi stochastically
I tried to touch the COTOHA API
I tried to predict by letting RNN learn the sine wave
I tried to predict Covid-19 using Darts
I tried logistic regression analysis for the first time using Titanic data
I tried fMRI data analysis with python (Introduction to brain information decoding)
I tried to perform a cluster analysis of customers using purchasing data
I tried the common story of using Deep Learning to predict the Nikkei 225
I tried to predict the behavior of the new coronavirus with the SEIR model.
I tried to predict next year with AI
I tried web scraping to analyze the lyrics.
I tried cluster analysis of the weather map
I tried to optimize while drying the laundry
Before the coronavirus, I first tried SARS analysis
I tried principal component analysis with Titanic data!
I tried to touch the API of ebay
I tried to get CloudWatch data with Python
I tried to correct the keystone of the image
[Data analysis] Should I buy the Harumi flag?
I tried to predict Titanic survival with PyCaret
Qiita Job I tried to analyze the job offer
LeetCode I tried to summarize the simple ones
I tried to implement the traveling salesman problem
I tried to vectorize the lyrics of Hinatazaka46!
I tried to debug.
I tried to paste
I tried to predict the deterioration of the lithium ion battery using the Qore SDK
I tried to predict the horses that will be in the top 3 with LightGBM
I tried to predict the presence or absence of snow by machine learning.
I tried to predict the change in snowfall for 2 years by machine learning
I tried to process and transform the image and expand the data for machine learning
I tried to rescue the data of the laptop by booting it on Ubuntu
python beginners tried to predict the number of criminals
I tried to graph the packages installed in Python
I tried to detect the iris from the camera image
I tried to summarize the basic form of GPLVM
I tried to touch the CSV file with Python
I tried to solve the soma cube with python
I tried clustering ECG data using the K-Shape method
I tried to approximate the sin function using chainer
I tried to put pytest into the actual battle
[Python] I tried to graph the top 10 eyeshadow rankings
I tried using the API of the salmon data project
I tried to visualize the spacha information of VTuber
I tried to erase the negative part of Meros
I tried to solve the problem with Python Vol.1
I tried to analyze J League data with Python
I tried to simulate the dollar cost averaging method
I tried to redo the non-negative matrix factorization (NMF)
I tried to identify the language using CNN + Melspectogram