[PYTHON] I tried to predict the victory or defeat of the Premier League using the Qore SDK

Introduction

This is Qiita's first post. Since I am almost a beginner in data analysis, I think there are many mistakes, so please point out. This time, I used Qore SDK from Qauntum Core Co., Ltd.

How to use the Qore SDK is explained in the following article. The world of reservoir computing ~ with Qore ~ Introduction of Qore SDK and detection of arrhythmia with Qore

The content of the effort will be the prediction of victory or defeat in the Soccer Premier League. Specifically, it will be a task to predict the result of the match played in 2019-2020 using the data of 2010-2018.

The dataset was downloaded from the following site. http://football-data.co.uk/englandm.php

Data preprocessing

I posted all the datasets and pre-processing on GitHub, so please refer to this. https://github.com/obameyan/QoreSDK-Premire-League

Since it is difficult to describe all of the preprocessing here, only the data before preprocessing and the data after preprocessing are described. To briefly explain what I did, I converted the data so that it could be thrown into the Qore SDK as time series data, taking into consideration various factors such as the opponent team, the result of the match, the number of goals, the number of goals scored, and the hat trick.

The following is the data before preprocessing. (Display only part)

import pandas as pd

#Original data
raw_data = pd.read_csv('./data/PremierLeague/2018-19.csv') #Only a part is described
raw_data.head()
Div Date HomeTeam AwayTeam FTHG FTAG FTR HTHG HTAG HTR Referee HS AS HST AST HF AF HC AC HY AY HR AR B365H B365D B365A BWH BWD BWA IWH IWD IWA PSH PSD PSA WHH WHD WHA VCH VCD VCA Bb1X2 BbMxH BbAvH BbMxD BbAvD BbMxA BbAvA BbOU BbMx>2.5 BbAv>2.5 BbMx<2.5 BbAv<2.5 BbAH BbAHh BbMxAHH BbAvAHH BbMxAHA BbAvAHA PSCH PSCD PSCA
E0 10/08/2018 Man United Leicester 2 1 H 1 0 H A Marriner 8 13 6 4 11 8 2 5 2 1 0 0 1.57 3.9 7.50 1.53 4.0 7.50 1.55 3.80 7.00 1.58 3.93 7.50 1.57 3.8 6.00 1.57 4.0 7.00 39 1.60 1.56 4.20 3.92 8.05 7.06 38 2.12 2.03 1.85 1.79 17 -0.75 1.75 1.70 2.29 2.21 1.55 4.07 7.69
E0 11/08/2018 Bournemouth Cardiff 2 0 H 1 0 H K Friend 12 10 4 1 11 9 7 4 1 1 0 0 1.90 3.6 4.50 1.90 3.4 4.40 1.90 3.50 4.10 1.89 3.63 4.58 1.91 3.5 4.00 1.87 3.6 4.75 39 1.93 1.88 3.71 3.53 4.75 4.37 38 2.05 1.98 1.92 1.83 20 -0.75 2.20 2.13 1.80 1.75 1.88 3.61 4.70
E0 11/08/2018 Fulham Crystal Palace 0 2 A 0 1 A M Dean 15 10 6 9 9 11 5 5 1 2 0 0 2.50 3.4 3.00 2.45 3.3 2.95 2.40 3.30 2.95 2.50 3.46 3.00 2.45 3.3 2.80 2.50 3.4 3.00 39 2.60 2.47 3.49 3.35 3.05 2.92 38 2.00 1.95 1.96 1.87 22 -0.25 2.18 2.11 1.81 1.77 2.62 3.38 2.90
E0 11/08/2018 Huddersfield Chelsea 0 3 A 0 2 A C Kavanagh 6 13 1 4 9 8 2 5 2 1 0 0 6.50 4.0 1.61 6.25 3.9 1.57 6.20 4.00 1.55 6.41 4.02 1.62 5.80 3.9 1.57 6.50 4.0 1.62 38 6.85 6.09 4.07 3.90 1.66 1.61 37 2.05 1.98 1.90 1.84 23 1.00 1.84 1.80 2.13 2.06 7.24 3.95 1.58
E0 11/08/2018 Newcastle Tottenham 1 2 A 1 2 A M Atkinson 15 15 2 5 11 12 3 5 2 2 0 0 3.90 3.5 2.04 3.80 3.5 2.00 3.70 3.35 2.05 3.83 3.57 2.08 3.80 3.2 2.05 3.90 3.4 2.10 39 4.01 3.83 3.57 3.40 2.12 2.05 38 2.10 2.01 1.88 1.81 20 0.25 2.20 2.12 1.80 1.76 4.74 3.53 1.89

Next is the data after preprocessing. (Display only part)

import pandas as pd

#It is not the original data because it is in the middle of the data
data=pd.read_csv("./data/PremierLeague/allAtt_onehot_large_train.csv") #Training data
dataT=pd.read_csv("./data/PremierLeague/allAtt_onehot_large_test.csv") #test data

#Data after preprocessing
data = data[['HTGS','ATGS','HTP','ATP','HM1','AM1', 'DiffLP','final1']]
dataT = dataT[['HTGS','ATGS','HTP','ATP','HM1','AM1','DiffLP','final1']]
df = data[200:210]
HTGS ATGS HTP ATP HM1 AM1 DiffLP final1
0.4737 0.2568 1.3333 1.0476 3 3 1 0
0.3289 0.3784 0.9048 1.0476 1 1 -3 1
0.4342 0.3243 2.0952 1.2857 3 3 -12 1
0.4342 0.2703 1.8571 1.8095 3 3 1 0
0.3553 0.2432 1.0000 1.2857 0 1 -1 0
0.2763 0.3378 1.1905 1.1905 3 1 9 1
0.4342 0.3919 1.3810 0.9524 1 1 -2 0
0.3289 0.3378 1.0476 1.7143 1 3 2 1
0.4474 0.3784 1.1905 0.8095 3 0 -8 1
0.3816 0.3919 0.8571 1.6667 1 0 15 1

Pretreatment with Qore SDK

Here, we will perform preprocessing using the Qore SDK. Specifically, use qore_sdk.utils.sliding_window () to convert the training data dimension to (number of data, time, actual data) and the correct label dimension to (number of data, 1).

from qore_sdk.utils import sliding_window

x = np.array(data)
x_t = np.array(dataT)

x_train = np.array(x[:, :7])
x_test = np.array(x_t[:, :7])
y_train = np.array(x[:, 7])
y_test = np.array(x_t[:, 7])

X, y= sliding_window(x_train, 10, 5, axis=0, y=y_train,y_def='mode', y_axis=0)
X_test, y_test = sliding_window(x_test, 10, 5, axis=0, y=y_test,y_def='mode', y_axis=0)
print(X.shape, y.shape, X_test.shape, y_test.shape)
>>  (653, 10, 7), (653, 1), (159, 10, 7), (159, 1)

Learning and prediction using the Qore SDK

Enter the account information issued here.

from qore_sdk.client import WebQoreClient

username = '*****'
password = '*****'
endpoint = '*****'

client = WebQoreClient(username, password, endpoint=endpoint)

Let's actually learn.

client.classifier_train(X, y)
>> {'res': 'ok', 'train_time': 0.8582723140716553}

I was able to learn in an instant. Next, check the accuracy using test data.

res = client.classifier_predict(X_test)
report = classification_report(y_test, res['Y'])
print(report)
              precision    recall  f1-score   support

         0.0       0.73      0.90      0.81       104
         1.0       0.68      0.38      0.49        55

   accuracy                            0.72       159
   macro avg       0.71      0.64      0.65       159
weighted avg       0.71      0.72      0.70       159

The accuracy was 72%. To be honest, it's a subtle accuracy, but I think this is a pre-processing problem ... I thought that the data should be inflated and the correlation should be considered more carefully before preprocessing ...

Summary

I wanted to do data analysis on a daily basis, but I couldn't do anything about it, but I am grateful to Quantum Core for giving me the opportunity to perform this kind of data analysis. I would also like to take this opportunity to continue taking on the challenge of data analysis.

Recommended Posts

I tried to predict the victory or defeat of the Premier League using the Qore SDK
I tried to predict the deterioration of the lithium ion battery using the Qore SDK
I tried to predict the price of ETF
I want to use the Qore SDK to predict the success of NBA players
I tried the common story of using Deep Learning to predict the Nikkei 225
I tried to predict the presence or absence of snow by machine learning.
I tried to predict the infection of new pneumonia using the SIR model: ☓ Wuhan edition ○ Hubei edition
I tried to get the index of the list using the enumerate function
Predict the rise and fall of BTC price using Qore SDK
I tried to predict Covid-19 using Darts
I tried to predict the up and down of the closing price of Gurunavi's stock price using TensorFlow (progress)
I tried to transform the face image using sparse_image_warp of TensorFlow Addons
I tried to execute SQL from the local environment using Looker SDK
I tried to get the batting results of Hachinai using image processing
I tried to estimate the similarity of the question intent using gensim's Doc2Vec
I tried to extract and illustrate the stage of the story using COTOHA
Using COTOHA, I tried to follow the emotional course of Run, Melos!
I tried to predict the behavior of the new coronavirus with the SEIR model.
I tried to touch the API of ebay
I tried to correct the keystone of the image
I tried using the image filter of OpenCV
I tried to vectorize the lyrics of Hinatazaka46!
I tried to notify the update of "Hamelin" using "Beautiful Soup" and "IFTTT"
[Python] I tried to judge the member image of the idol group using Keras
python beginners tried to predict the number of criminals
I tried to summarize the basic form of GPLVM
I tried to predict the J-League match (data analysis)
I tried to approximate the sin function using chainer
I tried using the API of the salmon data project
I tried to visualize the spacha information of VTuber
I tried to erase the negative part of Meros
I tried to identify the language using CNN + Melspectogram
I tried to complement the knowledge graph using OpenKE
I tried to classify the voices of voice actors
I tried to compress the image using machine learning
I tried to summarize the string operations of Python
I tried to notify the update of "Become a novelist" using "IFTTT" and "Become a novelist API"
Python practice 100 knocks I tried to visualize the decision tree of Chapter 5 using graphviz
I tried to extract the text in the image file using Tesseract of the OCR engine
I tried to find the entropy of the image with python
[Horse Racing] I tried to quantify the strength of racehorses
I tried to get the location information of Odakyu Bus
I tried to find the average of the sequence with TensorFlow
I tried refactoring the CNN model of TensorFlow using TF-Slim
I tried to simulate ad optimization using the bandit algorithm.
I tried face recognition of the laughter problem using Keras.
Judging the victory or defeat of Shadowverse by image recognition
[Python] I tried to visualize the follow relationship of Twitter
[TF] I tried to visualize the learning result using Tensorboard
[Machine learning] I tried to summarize the theory of Adaboost
[Python] I tried collecting data using the API of wikipedia
I tried to fight the Local Minimum of Goldstein-Price Function
I tried to approximate the sin function using chainer (re-challenge)
I tried to output the access log to the server using Node.js
I tried to compare the accuracy of machine learning models using kaggle as a theme.
I tried to predict the genre of music from the song title on the Recurrent Neural Network
Implementation of recommendation system ~ I tried to find the similarity from the outline of the movie using TF-IDF ~
I tried to automate the construction of a hands-on environment using IBM Cloud's SoftLayer API
I tried to predict the sales of game software with VARISTA by referring to the article of Codexa
[Linux] I tried to summarize the command of resource confirmation system
I tried to get a database of horse racing using Pandas