Machine learning beginners tried to make a horse racing prediction model with python

Recently, the development of services incorporating machine learning has begun to increase, and I myself sometimes direct it.

However, it is undeniable that it is a simple task to blindly use the learning model created by people called data scientists and ML engineers, and it is easy for beginners (I) to raise the knowledge level of machine learning. I tried to summarize the process until I became able to create a learning model.

This goal

We will start by building an environment with python, and then try to build a classification model by logistic regression, which seems to be the quickest. As for the subject, I will challenge the horse racing prediction model for both hobbies and profits.

Environment

Premise

The environment implemented is as follows.

pipenv installation

I will build the execution environment of python using pipenv.

$ pip install pipenv

Build a virtual environment to run python.

$ export PIPENV_VENV_IN_PROJECT=true
$ cd <project_dir>
$ pipenv --python 3.7

PIPENV_VENV_IN_PROJECT is a setting to build a virtual environment under the project directory (./.venv/).

Library installation

Here, we will install the minimum required libraries.

$ pipenv install pandas
$ pipenv install sklearn
$ pipenv install matplotlib
$ pipenv install jupyter

After installation, Pipfile and Pipfile.lock in the current directory have been updated. These 4 libraries are essential items, so let's install them without saying anything.

Library Use
pandas Data storage and preprocessing (cleansing, integration, transformation, etc.)
sklearn Learning and prediction using various machine learning algorithms
matplotlib Data visualization by graph drawing
jupyter Interactive programming on the browser

How to start jupyter notebook

$ cd <project_dir>
$ pipenv run jupyter notebook
...
    To access the notebook, open this file in a browser:
        file:///Users/katayamk/Library/Jupyter/runtime/nbserver-4261-open.html
    Or copy and paste one of these URLs:
        http://localhost:8888/?token=f809cb2bcb716ba5726912d43738dd51992d3d7f20942d71
     or http://127.0.0.1:8888/?token=f809cb2bcb716ba5726912d43738dd51992d3d7f20942d71

By accessing the localhost URL output to the terminal, you will be able to browse the jupyter notebook on the local server.

This completes the environment construction.

Model building

There are various types of machine learning, such as supervised learning, unsupervised learning, enhanced learning, and deep learning, but this time, as mentioned at the beginning, in order to be able to create a simple learning model. Build a classification model for supervised learning.

Machine learning workflow

The AWS article was easy to understand, so I think you should refer to it here. What is the workflow of machine learning? Explaining AWS machine learning services with Grareco workflow3.png I think that the above flow will be summarized briefly, so we will build the learning model in this order.

1. Data acquisition

By building a horse racing prediction model, we first need past horse racing data. There are also methods for scraping horse racing information sites on the Internet, but in anticipation of future operations, we will purchase and obtain official JRA data. Acquired data: JRA-VAN Data Lab

You can create a program to get the data yourself, but you can also use the free horse racing software provided in advance to output the data to a file. (Since it is not the main story, I will omit the details.)

This time, I got the following two types of data files. The target period of the data is 5 years from 2015 to 2019.

file name type of data Data description
syutsuba_data.csv Race table data Program guide data that describes the racehorses that will be held
seiseki_data.csv Grade data 開催されたレースの着順などが記載されたGrade data

2. Data preprocessing

What is data preprocessing?

Here is the most important step in machine learning. Perform the following processing according to the acquired data.

Data cleansing

You can remove noise data or fill in missing values with different values.

Data integration

It is rare that the data required for training is gathered together from the beginning, and by integrating the distributed data, consistent data is generated.

Data conversion

The process of converting data into a specified format to improve the quality of the model. For example, processing various data such as standardizing numerical data to data that fits in the range of -1 to 1, or converting category data in which either dog or cat is selected into a dummy variable and converting it to numerical data. Will be carried out.

Preprocessing of horse racing data

From here, we will actually implement the preprocessing of horse racing data, but if you use the launched jupyter notebook, you can program while checking the data status interactively.

First, load the acquired horse racing data into the DataFrame of pandas, but as a result of preprocessing the data, I will finally process the data into the following structure.

data item Use Data description
race_index index Identification ID that identifies the race to be held
This prize Explanatory variable Total amount of prize money earned for racehorses
Jockey name Explanatory variable Use the jockey name as a dummy variable
Within 3 Objective variable Convert the finish order of racehorses to 1 if it is within 3rd place and 0 if it is 4th or less

This time, we will use the total amount of prize money that each horse has won so far as a feature to measure the ability of the racehorse. We also adopted the jockey name, considering that there is a big difference depending on the skill of the jockey. Let's try to see how accurate the prediction can be with these two explanatory variables alone.

build.ipynb


import os
import pandas as pd

#Race table data
syutsuba_path = './data/sample/syutsuba_data.csv'
df_syutsuba = pd.read_csv(syutsuba_path, encoding='shift-jis')
df_syutsuba = df_syutsuba[['Race ID', 'This prize', 'Jockey name']]

#Grade data
seiseki_path = './data/sample/seiseki_data.csv'
df_seiseki = pd.read_csv(seiseki_path, encoding='shift-jis')
df_seiseki = df_seiseki[['Race ID', 'Confirmed order of arrival']]

In DataFrame, the data is organized as follows. スクリーンショット 2020-09-26 11.03.20.png スクリーンショット 2020-09-26 11.04.11.png

Reference) Race ID data format
Subscript (range) Data length Item description
0〜3 4byte Year
4〜5 2byte Month
6〜7 2byte Day
8〜9 2byte Racetrack code
10〜11 2byte Held times
12〜13 2byte Date
14〜15 2byte Race number
16〜17 2byte Horse number

Next, we will integrate the acquired data and perform data cleansing and conversion.

build.ipynb


#Merge runner table data and grade data
df = pd.merge(df_syutsuba, df_seiseki, on = 'Race ID')

#Records with missing values are removed
df.dropna(how='any', inplace=True)

#Add a column to see if the order of arrival is within 3
f_ranking = lambda x: 1 if x in [1, 2, 3] else 0
df['Within 3'] = df['Confirmed order of arrival'].map(f_ranking)

#Generate dummy variable
df = pd.get_dummies(df, columns=['Jockey name'])

#Set index (use up to 16th byte to specify race only)
df['race_index'] = df['Race ID'].astype(str).str[0:16]
df.set_index('race_index', inplace=True)

#Delete unnecessary columns
df.drop(['Race ID', 'Confirmed order of arrival'], axis=1, inplace=True)

If you check the DataFrame, you can see that the columns that have been made into dummy variables are replaced with new columns for the number of categories that belong to them, and the 0 or 1 flag is set. スクリーンショット 2020-09-26 11.36.35.png By making the jockey name a dummy variable, the number of columns has increased to 295, but please note that making a column with a large number of categories a dummy variable may cause overfitting.

3. Model learning

Next, let's learn the model. First, the data is divided into training data and evaluation data for each explanatory variable and objective variable.

build.ipynb


from sklearn.model_selection import train_test_split

#Store explanatory variables in dataX
dataX = df.drop(['Within 3'], axis=1)

#Store objective variable in dataY
dataY = df['Within 3']

#Divide the data (learning data 0).8 Evaluation data 0.2)
X_train, X_test, y_train, y_test = train_test_split(dataX, dataY, test_size=0.2, stratify=dataY)

In short, it is divided into the following four types of data.

Variable name type of data Use
X_train Explanatory variable Training data
X_test Explanatory variable Evaluation data
y_train Objective variable Training data
y_test Objective variable Evaluation data

This time, train_test_split is used to easily divide the training data and the evaluation data, but for data with a time series concept such as horse racing, ** (past)-> training data-> It seems that the accuracy will be improved if the data is divided so that the order is evaluation data-> (current) **.

Next, we will train the prepared data. The basic algorithm is included in sklearn, and this time we will use ** logistic regression **.

build.ipynb


from sklearn.linear_model import LogisticRegression

#Create a classifier (logistic regression)
clf = LogisticRegression()

#Learning
clf.fit(X_train, y_train)

That's it. It's very easy.

4. Model evaluation

First, let's predict the evaluation data and check the correct answer rate based on the result.

build.ipynb


#Forecast
y_pred = clf.predict(X_test)

#Display correct answer rate
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))
0.7874043003746538

The correct answer rate is 0.7874043003746538, which means that 78% can be predicted correctly. At first glance, you might be happy to say, "Oh awesome! It's really profitable!", But be careful with this accuracy_score. Then try running the following code.

build.ipynb


#Show confusion matrix
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred, labels=[1, 0]))
[[  339 10031]
 [  410 38332]]

This two-dimensional array, called the confusion matrix, represents the following:

Prediction: Within 3rd place Prediction: 4th or less
Actual: Within 3 339 10031
Actual: 4 or less 410 38332

Of these, the correct answer rate is the total of ** Prediction: 3rd place or less x Actual: 3rd place or less ** and ** Prediction: 4th place or less x Actual: 4th place or less **.

** Correct answer rate **: 0.78 = (339 + 38332) / (339 + 38332 + 410 + 10031)

Prediction: Within 3rd place Prediction: 4th or less
Actual: Within 3 339 10031
Actual: 4 or less 410 38332

From this result, it can be seen that the number of cases predicted to be within 3rd place is too small in the first place, and the correct answer rate is boosted by predicting that most of them are 4th place or less.

Now that you know that you need to be careful about the accuracy rate, what should be used to evaluate the accuracy of the model? One way to utilize this confusion matrix is to check the F value.

What is F value?

It is a combination of 1 and 2 below.

  1. Percentage of horses that are predicted to be within 3rd place and answered correctly (called the precision rate)
  2. Percentage of horses that actually finished in 3rd place with correct answers (called recall rate)

** Compliance rate **: 0.45 = 339 / (339 + 410) ** recall **: 0.03 = 339 / (339 + 10031)

Prediction: Within 3rd place Prediction: 4th or less
Actual: Within 3 339 10031
Actual: 4 or less 410 38332

build.ipynb


#Display F value
from sklearn.metrics import f1_score
print(f1_score(y_test, y_pred))
0.06097670653835776

When I checked the F value this time, it was 0.06097670653835776. Regarding the F value, in the case of randomly dividing it into 0 and 1, it has the property of converging to 0.5, so you can see that the value of 0.06 this time is an extremely low value.

Correct data imbalance

build.ipynb


print(df['Within 3'].value_counts())
0    193711
1     51848

The data ratio of the objective variable within 3rd place and 4th place or less is 1: 4, and there is a slight bias in the data, so let's correct this a little.

First, install the following libraries additionally.

$ pipenv install imbalanced-learn

Undersample the data ratio of the training data within 3rd place and 4th place or less to 1: 2. Undersampling means randomly narrowing down the number of large numbers of data to match the small number of data.

build.ipynb


from imblearn.under_sampling import RandomUnderSampler

#Undersampling training data
f_count = y_train.value_counts()[1] * 2
t_count = y_train.value_counts()[1]
rus = RandomUnderSampler(sampling_strategy={0:f_count, 1:t_count})
X_train_rus, y_train_rus = rus.fit_sample(X_train, y_train)

Now that we've corrected some of the data imbalances, we'll train and evaluate the model again.

build.ipynb


#Learning
clf.fit(X_train_rus, y_train_rus)

#Forecast
y_pred = clf.predict(X_test)

#Display correct answer rate
print(accuracy_score(y_test, y_pred))
0.7767958950969214

#Show confusion matrix
print(confusion_matrix(y_test, y_pred, labels=[1, 0]))
[[ 1111  9259]
 [ 1703 37039]]

#Display F value
print(f1_score(y_test, y_pred))
0.1685376213592233

The F value is 0.1685376213592233, which is a considerable improvement.

Standardize explanatory variables

There are two explanatory variables, the prize money and the jockey name, but the jockey name has a value of 0 or 1 due to dummy variable conversion, while the prize money has the following distribution of features.

build.ipynb


import matplotlib.pyplot as plt
plt.xlabel('prize')
plt.ylabel('freq')
plt.hist(dataX['This prize'], range=(0, 20000), bins=20)

スクリーンショット 2020-09-26 14.07.09.png Since the values are too different, it is highly likely that the prize money and the jockey name cannot be compared on an equal footing, and it is necessary to scale each feature to the same range. One of the methods is standardization.

build.ipynb


from sklearn.preprocessing import StandardScaler

#Standardize explanatory variables
sc = StandardScaler()
X_train_rus_std = pd.DataFrame(sc.fit_transform(X_train_rus), columns=X_train_rus.columns)
X_test_std = pd.DataFrame(sc.transform(X_test), columns=X_test.columns)

スクリーンショット 2020-09-26 16.02.08.png By standardizing, the values of all explanatory variables have been converted so that they fall within a certain range, so the model is trained and evaluated again.

build.ipynb


#Learning
clf.fit(X_train_rus_std, y_train_rus)

#Forecast
y_pred = clf.predict(X_test_std)

#Display correct answer rate
print(accuracy_score(y_test, y_pred))
0.7777732529727969

#Show confusion matrix
print(confusion_matrix(y_test, y_pred, labels=[1, 0]))
[[ 2510  7860]
 [ 3054 35688]]

#Display F value
print(f1_score(y_test, y_pred))
0.3150495795155014

The F value became 0.3150495795155014, and the accuracy was further improved from the previous time, and it reached the 30% level. In addition, the precision rate is 0.45 and the recall rate is 0.24, which is a reasonable prediction result for horse racing.

Check the weight of the regression coefficient

Finally, check the regression coefficient to see which value of the explanatory variable has a strong influence on the horse racing prediction.

build.ipynb


pd.options.display.max_rows = X_train_rus_std.columns.size
print(pd.Series(clf.coef_[0], index=X_train_rus_std.columns).sort_values())

Jockey name_Lower principle-0.092015
Jockey name_Seiji Sakai-0.088886
Jockey name_Teruo Eda-0.081689
Jockey name_Hayabusa Mitsuya-0.078886
Jockey name_Toshiya Yamamoto-0.075083
Jockey name_Norifumi Mikamoto-0.073361
Jockey name_Keita Ban-0.072113
Jockey name_Junji Iwabe-0.070202
Jockey name_Bushizawa Tomo-0.069766
Jockey name_Mitsuyuki Miyazaki-0.068009
...(abridgement)
Jockey name_Yasunari Iwata 0.065899
Jockey name_Hironobu Tanabe 0.072882
Jockey name_Moreira 0.073010
Jockey name_Taketoyo 0.084130
Jockey name_Yuichi Fukunaga 0.107660
Jockey name_Yuga Kawada 0.123749
Jockey name_Keita Tosaki 0.127755
Jockey name_M. Dem 0.129514
Jockey name_Lemaire 0.185976
This prize 0.443854

You can see that the prize money has the most positive influence, which is expected to be in the third place, followed by the major jockeys.

5. Model operation

With the work so far, we managed to build the model. Next, let's consider the actual operation. Horse races are held on a regular basis every week, but I hope to get rich by predicting which horses will be in the top three of each race.

So every week, do you run the machine learning workflow in order from the beginning? [1. Data acquisition](https://qiita.com/drafts#1-%E3%83%87%E3%83%BC%E3%82%BF%E3%81%AE%E5%8F%96 % E5% BE% 97) needs to be performed every time to get the latest runner table data, but 2. Data preprocessing 83% 87% E3% 83% BC% E3% 82% BF% E3% 81% AE% E5% 89% 8D% E5% 87% A6% E7% 90% 86) and 3. Model Learning //qiita.com/drafts#3-%E3%83%A2%E3%83%87%E3%83%AB%E3%81%AE%E5%AD%A6%E7%BF%92) every time Instead of doing it, you should be able to reuse the model you built once (although regular model updates are required). So let's carry out that operation.

build.ipynb


import pickle

filename = 'model_sample.pickle'
pickle.dump(clf, open(filename, 'wb'))

In this way, by using a library called pickle, you can serialize the built model and save it in a file.

And here is how to restore the saved model.

restore.ipynb


import pickle

filename = 'model_sample.pickle'
clf = pickle.load(open(filename, 'rb'))

#Forecast
y_pred = clf.predict(Explanatory variable data for the race to be predicted)

You can easily restore the model and use it for future race predictions. This enables efficient operation without the need for data preprocessing or model training.

at the end

With the above, we were able to carry out a series of work step by step from environment construction to model construction. It will be a poor explanation by beginners, but I hope it will be helpful for people with similar circumstances.

Next time, I would like to use another algorithm to try to create a mechanism that goes one step further, not only in comparison and verification with the model created this time and prediction accuracy, but also in what the actual balance is.

Recommended Posts

Machine learning beginners tried to make a horse racing prediction model with python
I tried to make a real-time sound source separation mock with Python machine learning
I learned scraping using selenium to make a horse racing prediction model.
A beginner of machine learning tried to predict Arima Kinen with python
Machine learning beginners try to make a decision tree
Create a python machine learning model relearning mechanism with mlflow
I tried to divide with a deep learning language model
[5th] I tried to make a certain authenticator-like tool with python
Rubyist tried to make a simple API with Python + bottle + MySQL
[2nd] I tried to make a certain authenticator-like tool with python
[3rd] I tried to make a certain authenticator-like tool with python
I tried to make a periodical process with Selenium and Python
I tried to make a 2channel post notification application with Python
[Introduction] I want to make a Mastodon Bot with Python! 【Beginners】
I tried to make a todo application using bottle with python
[4th] I tried to make a certain authenticator-like tool with python
[1st] I tried to make a certain authenticator-like tool with python
Python: I tried to make a flat / flat_map just right with a generator
Python beginners decided to make a LINE bot with Flask (Flask rough commentary)
I tried to make a traffic light-like with Raspberry Pi 4 (Python edition)
I tried to visualize the model with the low-code machine learning library "PyCaret"
[Python] Easy introduction to machine learning with python (SVM)
I want to make a game with Python
Try to make a "cryptanalysis" cipher with Python
Try to make a dihedral group with Python
Build a Python machine learning environment with a container
I tried to make a periodical process with CentOS7, Selenium, Python and Chrome
I tried to make a simple mail sending application with tkinter of Python
[Patent analysis] I tried to make a patent map with Python without spending money
I tried to implement various methods for machine learning (prediction model) using scikit-learn.
I tried to move machine learning (ObjectDetection) with TouchDesigner
Run a machine learning pipeline with Cloud Dataflow (Python)
Try to make a command standby tool with python
I tried to draw a route map with Python
Build a machine learning application development environment with Python
I tried to automatically generate a password with Python3
Attempt to include machine learning model in python package
I tried to make something like a chatbot with the Seq2Seq model of TensorFlow
Machine learning beginners tried RBM
Make a fortune with Python
Beginning with Python machine learning
I tried to build an environment for machine learning with Python (Mac OS X)
Uncle SE with hardened brain tried to study machine learning
I tried to make various "dummy data" with Python faker
Experiment to make a self-catering PDF for Kindle with Python
[Ipdb] Web development beginners tried to summarize debugging with Python
Mayungo's Python Learning Episode 3: I tried to print numbers with print
A beginner's summary of Python machine learning is super concise.
I tried to make a stopwatch using tkinter in python
I tried to make GUI tic-tac-toe with Python and Tkinter
[1 hour challenge] I tried to make a fortune-telling site that is too suitable with Python
I tried to make a generator that generates a C # container class from CSV with Python
Let's make a GUI with python.
I tried machine learning with liblinear
Machine learning with python (1) Overall classification
Machine learning summary by Python beginners
Try horse racing prediction with Chainer
Make a recommender system with python
<For beginners> python library <For machine learning>
Inversely analyze a machine learning model
Let's make a graph with python! !!