[PYTHON] A story about achieving a horse racing recovery rate of over 100% through machine learning

Introduction

Do you like horse racing? I'm a beginner who started this year, but it's really fun to collect various information, predict and guess!

At first, it was fun just to predict, but the desire to ** "I don't want to lose" ** overflowed.

So, when I was surfing the internet, I was wondering if there was a delicious way to win, and it seemed interesting to predict horse racing using machine learning, so I decided to try it after studying.

Target

The return rate of horse racing seems to be about 70 to 80%, so if you buy it properly, the recovery rate is likely to converge to this level.

So, for the time being, I would like to aim for a ** recovery rate of 100% or more ** using the data obtained before the race!

Decide the setting

I think there are various ways to predict horse racing, such as simply predicting the ranking or optimizing the betting method by considering the odds. There are also various types of betting tickets to buy.

This time, I would like to divide the horse racing ranking into 3 groups within 3rd place, middle and lower, and perform ** multi-class classification **.

And I will buy a ** win-type ** betting ticket for the horse that ranked first in the expected results. The reason is that the win-win return rate is set higher than those that are easy to get high-priced betting tickets such as triples. (Reference: Buena's Horse Racing Blog-Knowledge for not losing with betting tickets)

Also, I will not use information on popularity and odds for features. How about using information that is not decided until just before the race? And that's because I thought it wouldn't be interesting to simply buy a popular horse. (The horse weight data determined about 50 minutes before the start of the race is treated as a feature.)

This time, I would like to focus on the race at Tokyo Racecourse and proceed as follows. The reason for narrowing down the racetrack is that it takes time to scrape the race data due to the poor algorithm and the time.sleep of 1 second. (It took about 50 hours to collect the data from 2008 to 2019 ...)

It's a hassle, but if you take the time, you can collect data on all racetracks.

procedure

  1. Collect race data by scraping from this site (netkeiba.com).
  2. Preprocess the data.
  3. Create a model by training with LightGBM.
  4. Using the created model, check what the recovery rate is for one year.

Scraping

Collect by scraping from this site (netkeiba.com). As far as I read robots.txt (which wasn't there in the first place) and the terms of use when scraping, it seemed to be okay, so I was careful not to overload it. I referred to the following article for the scraping method.

-python3 Crawling & Scraping -Scraping with Python and Beautiful Soup

The result of collecting the data is like this.

スクリーンショット 2020-09-02 20.06.59.png

When scraping, we have removed information about horses that have not performed in the last three races. This is because we think that the future cannot be predicted for things that do not have past information.

In addition, horses that ran in rural areas or overseas may lack the time index, etc., but that part is filled with the average value.

Feature value

This time, the following items were treated as features. Data of the day

Variable name Contents
kai How many times
day What day is it held
race_num What R
field Turf or dirt
dist distance
turn Which way
weather weather
field_cond Going
~~place~~ ~~Venue~~
sum_num How many heads
prize Winning prize
horse_num Horse number
sex sex
age age
weight_carry Weight
horse_weight Horse weight
weight_change Change in horse weight
l_days How many days have passed since the previous run

Data of the past 3 races (01 → previous race, 02 → 2 races before, 03 → 3 races before)
Variable name Contents
p_place Venue
p_weather weather
p_race_num What R
p_sum_num How many heads
p_horse_num Horse number
p_rank Ranking
p_field Turf or dirt
p_dist distance
p_condi Going
p_condi_num Baba index
p_time_num Time index

Preprocessing

I just put the time in seconds and label-encoded the categorical variables. Below is the code to label encode the weather as an example.

encode.py


num = df1.columns.get_loc('weather')
    for i in range(df1['weather'].size):
        copy = df1.iat[i, num]
        if copy == 'Fine':
            copy = '6'
        elif copy == 'rain':
            copy = '1'
        elif copy == 'light rain':
            copy = '2'
        elif copy == 'Koyuki':
            copy = '3'
        elif copy == 'Cloudy':
            copy = '4'
        elif copy == 'snow':
            copy = '5'
        else:
            copy = '0'
        df1.iat[i, num] = int(copy)

df1['weather'] = df1['weather'].astype('int64')

Label-encode each categorical variable in this way.

I thought it would be easier with LabelEncoder, but I didn't use it because it seemed impossible to unify the compatibility of converted numbers and variables among multiple data files.

Also, LightGBM, the machine learning framework used this time, seems to use a decision tree for the weak classifier, so there is no need to standardize it. (Reference: Introduction to LightGBM)

Predictive model

Build your model with LightGBM, a gradient boosting framework. I chose this because it's fast and (likely) the strongest non-deep.

And as for the prediction method, this time we made it a multi-class classification, which is classified into one of the three groups of 3rd place or less, middle 1/3 excluding 3rd place and lower 1/3. For example, in the case of 15 heads, 1st to 3rd place is group 0, 4th to 8th place is group 1, and 9th to 15th place is group 2.

I referred to the following site for how to use it. (Reference: [[For beginners] LightGBM (Multi-class classification) [Python] [Machine learning]](https://mathmatical22.xyz/2020/04/11/%E3%80%90%E5%88%9D % E5% AD% A6% E8% 80% 85% E5% 90% 91% E3% 81% 91% E3% 80% 91lightgbm-% E5% 9F% BA% E6% 9C% AC% E7% 9A% 84% E3% 81% AA% E4% BD% BF% E3% 81% 84% E6% 96% B9-% E5% A4% 9A% E3% 82% AF% E3% 83% A9% E3% 82% B9% E5 % 88% 86% E9% A1% 9E% E7% B7% A8 /))

The training data and verification data are as follows.

Training data / verification data test data
Tokyo_2008_2018 Tokyo_2019

The training data is divided into training data and model evaluation data by train_test_split. The parameters are not tuned in particular.

train.py


train_set, test_set = train_test_split(keiba_data, test_size=0.2, random_state=0)

#Explain training data Variable data(X_train)And objective variable data(y_train)Divided into
X_train = train_set.drop('rank', axis=1)
y_train = train_set['rank']

#Explanatory variable data for model evaluation data(X_test)And objective variable data(y_test)Divided into
X_test = test_set.drop('rank', axis=1)
y_test = test_set['rank']

#Set the data used for learning
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

params = {
        'task': 'train',
        'boosting_type': 'gbdt',
        'objective': 'multiclassova',
        'num_class': 3,
        'metric': {'multi_error'},
}

model = lgb.train(params,
        train_set=lgb_train, #Designation of training data
        valid_sets=lgb_eval, #Specifying validation data
        verbose_eval=10
)


### Try to move I actually moved it. ![model_pic](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/666677/86df2b27-05bc-cc96-fc27-cc34733ac103.png)

The correct answer rate is about 54%. You're guessing more than half. This value did not change much even if I tampered with the parameters, so this time I will keep it as it is.

Verification

We will post the data verified at Tokyo Racecourse for one year in 2019.

Here, as a condition

-** Buy a win for 100 yen per race. (Hit = odds x 100-100 or miss = -100) ** -** 1 Do not buy if the number of horses for which data remains during the race is less than half the number of horses in the race. (± 0) **

It is said. The reason for the second condition is to exclude races that almost certainly miss, such as the 2-year-old race, where only one horse has three or more past race data.

Below is the resulting graph.

tansho.png


**: relaxed: It feels good: relaxed: **
To be honest, I didn't think that the recovery rate would exceed 100% so easily.

The hit rate is about 26%, which is a good hit.

Due to the second condition, there were races where I couldn't bet about 100 races, but I think there is no complaint about this recovery rate after participating in about 80% of the races.


Since it is a great deal, I would like to verify it with other betting tickets. In addition, only 3 units are on the premise of buying 3 BOXes (6 ways). ![fukusho.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/666677/0a27b45c-5aa9-2c7b-844a-410e7d077f0c.png) ![umaren00.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/666677/6c12ed8d-dd26-c9dc-d707-0d1f3cc4d877.png) ![umatan00.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/666677/7a6a1813-35ba-876b-d776-1e8fe57ab257.png) ![renpuku00.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/666677/613b1601-51e4-6b1c-534a-36dc666aa98a.png) ![3rentan_0.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/666677/7f5151f8-4e9a-6e9c-c912-3dfca7635443.png)

It's pretty good ...! The recovery rate for horses is close to 200%, which is a wonderful result. However, betting tickets with a large return instead of a low hit rate will have a large fluctuation in the collection rate, so I would like to keep it as a reference.

The dissatisfaction point is that the recovery rate of double wins is less than 100%, even though we evaluate whether or not we are in the third place. I would like to do something about this area.

Let's add more conditions

So far, the only condition for buying a betting ticket is the number of horses, but in practical use, I think that it will be decided by the good or bad of the expected value rather than buying all the races.
Therefore, I would like to add the following new conditions.

** ・ Buy only when the difference between the 1st and 2nd predicted numbers classified in group 0 is 0.1 or more. ** **

In other words, buy only in such cases スクリーンショット 2020-09-02 21.36.00.png I don't buy at such times スクリーンショット 2020-09-02 21.36.34.png about it.

The reason for this condition

  1. It is difficult to predict whether or not the horse will be in the 3rd place, because it tends to be large when there are many strong horses or the number of horses running is small.
  2. If there is a gap in the predicted numbers, it is expected that the horse will be fairly strong in the target race. Because.

Below are the results of verification under these conditions. tansho_1.png

**: relaxed: It feels really good: relaxed: **

The hit rate has improved significantly from 26% to 39%, and the recovery rate has improved significantly from 130% to 168%. The number of target races has decreased by about 250, and it has been narrowed down to 100 races a year, but considering that 1/4 is still participating, I think this recovery rate is good.
I will try other betting tickets for the time being. fukusho_1.png umaren_1.png umatan_1.png renpuku_1.png 3rentan_1.png

It is good! It is worth mentioning that the hit rate of double wins exceeds 70% and the recovery rate exceeds 100%. The most popular horse has a double win rate of about 60 to 65% (Reference: Developer Blog | AlphaImpact Co., Ltd.) This seems to be very good.

About features

Let's also look at the importance of features when creating a model. feature_importance.png

You can see that the time index is treated as a fairly important feature. Obviously, horses that have run well in past races have a high probability of winning.

What was surprising was that how many days had passed since the last race was treated as an as important feature as the climbing time and horse weight. I was surprised that I don't see many people who are predicting horse racing and who emphasize rotation. Also, this is interesting because it overlaps with Almond Eye, who lost by forcing Rote for the middle two weeks. Well, I'm not sure if the index is getting worse because the space is short, but lol

in conclusion

Nowadays, horse racing AI seems to be steadily getting excited, with some sites operating as a service and Dwango's cyber award. Under such circumstances, I was able to practice horse racing prediction using machine learning, which was a lot of fun.

However, the future cannot be predicted perfectly, so using this model does not mean that you can definitely win. It is possible that this year and next year's race will be defeated.

I don't think it's good to expect too much from this kind of thing, but I think there are dreams because some people make money from horse racing programs.

Since you pointed out that it is a guidance article for financial purposes, I deleted the url. : bow:

Recommended Posts

A story about achieving a horse racing recovery rate of over 100% through machine learning
A concrete method of predicting horse racing by machine learning and simulating the recovery rate
Horse Racing Prediction: If you think that the recovery rate has exceeded 100% in machine learning (LightGBM), it's a story
A story about machine learning with Kyasuket
A story about simple machine learning using TensorFlow
A story about data analysis by machine learning
A story stuck with the installation of the machine learning library JAX
Machine learning beginners tried to make a horse racing prediction model with python
Implementation of a model that predicts the exchange rate (dollar-yen rate) by machine learning
Machine learning memo of a fledgling engineer Part 1
About the development contents of machine learning (Example)
Machine learning memo of a fledgling engineer Part 2
Get a glimpse of machine learning in Python
A story about predicting exchange rates with Deep Learning
About data preprocessing of systems that use machine learning
Installation of TensorFlow, a machine learning library from Google
About testing in the implementation of machine learning models
A story about changing the master name of BlueZ
A story about stumbling through PATH after installing anaconda
(Note) A story about creating a question answering system using Spring Boot and machine learning (SVM)
I wrote a code that exceeds 100% recovery rate in horse racing prediction using LightGBM (Part 2)
Machine learning A story about people who are not familiar with GBDT using GBDT in Python
A story about developing a machine learning model while managing experiments and models with Azure Machine Learning + MLflow
About machine learning overfitting
[Linux] A story about mounting a NAS through a firewall using NFS
A story about clustering time series data of foreign exchange
A beginner's summary of Python machine learning is super concise.
A story about a 40-year-old engineer manager passing "Deep Learning for ENGINEER"
A story about a student who does not know the machine learning machine learned machine learning (deep learning) for half a year