[PYTHON] I wrote a code that exceeds 100% recovery rate in horse racing prediction using LightGBM (Part 2)

Introduction

This article

・ I wrote a code that exceeds 100% recovery rate in horse racing prediction using LightGBM (1)

It will be a continuation of the article.

In Part 1, I wrote about the model with momentum, but in Part 2, I will write the result of actually predicting the future, and finally publish the code.

Predicted value after model creation

Actually, as of July, the predicted value was already published in note. However, the code used for prediction is continuing to improve while issuing the prediction note, and the code released this time is only the basic part, so the prediction of the note here and the prediction value of the code to be published are not necessarily It does not match.

[Horse Racing Forecast] July 25, 2020 [Horse Racing Forecast] July 26, 2020 [Horse Racing Forecast] August 01, 2020 [[Horse Racing Forecast] 08/08/2020] (https://note.com/km_takao/n/n9d2acf507e60) [Horse Racing Forecast] August 09, 2020 [Horse Racing Forecast] August 15, 2020 [Horse Racing Forecast] August 22, 2020

(August 2, 16 and 23, 2020 could not be predicted due to the need.)

The recovery rate when these are purchased in double win is as follows. Regarding the amount to be bet, following the method of Mr. Ushi explained in Part 1, "total budget x odds of 0.01/30 minutes ago" is used, and the total budget is calculated at 100,000 yen.

Race date Total amount bet Refund amount Recovery rate
July 25, 2020 7,500 yen 9,440 yen 125%
July 26, 2020 6,700 yen 7,350 yen 109%
August 01, 2020 10,100 yen 10,110 yen 100%
08/08/2020 23,700 yen 23,200 yen 98%
August 09, 2020 14,900 yen 15,210 yen 102%
August 15, 2020 23,200 yen 26,260 yen 113%
August 22, 2020 31,000 Yen 30,540 yen 99%

As a result of the improvement, we were able to increase the number of purchases, but the recovery rate is worse (and as a supplement, it has been 3 Baba since the 15th). It's currently under consideration whether such a race just happened to come at this time or if further improvements are needed.

Similarly, the recovery rate when purchasing a win is as follows.

Race date Total amount bet Refund amount Recovery rate
July 25, 2020 2,800 yen 4,390 yen 156%
July 26, 2020 1,900 yen 1,580 yen 83%
August 01, 2020 4,700 yen 4,410 yen 93%
08/08/2020 9,800 yen 7,600 yen 78%
August 09, 2020 4,500 yen 3,380 yen 75%
August 15, 2020 9,600 yen 15,060 yen 157%
August 22, 2020 12,700 yen 13,900 yen 109%

The number of purchases is increasing here as well, but there are days when the recovery rate has dropped. By the way, here is the result of the win if you change to the method of always purchasing only 100 yen regardless of the budget instead of Mr. Ushi's betting method.

Race date Total amount bet Refund amount Recovery rate
July 25, 2020 1,600 yen 1,850 yen 115%
July 26, 2020 1,100 yen 1,080 yen 98%
August 01, 2020 1,600 yen 3,500 yen 218%
08/08/2020 3,900 yen 11,610 yen 297%
August 09, 2020 2,400 yen 8,190 yen 341%
August 15, 2020 3,700 yen 4,530 yen 122%
August 22, 2020 4,800 yen 5,980 yen 125%

In other words, the model can predict the winning of Anoma, but with Mr. Ushi's betting method, the maximum odds that can be bet will decrease depending on the amount of the budget, and it will not be possible to bet on Anoma. As a result, only popular horses with low odds can be bet, which seems to be a factor in lowering the recovery rate. However, on the contrary, for races that did not get rough, it is a factor to increase the recovery rate by sloping like Mr. Ushi's betting method. If the budget you are thinking about is 100,000 yen, it will not have much effect if the odds are low (at most around 5 times) like a double win. However, if the odds are about 10 times or more in a win, the minimum stake is 100 yen, so it seems to be particularly affected. In this area, it is necessary to consider the constant of the stake calculation formula (0.01 in this case), your own budget, and the predicted value of the model.

Publish code

I will publish it in note. A detailed explanation of the code is given in the notes and comments in the notebook. Here, we will explain the simple flow.

Database scraping of past results

Scraping past race results, odds, etc. from netkeiba's database for model creation. As I wrote in Part 1, the scraping here is based on "How to scrape horse racing data using pandas read_html". The features to be scraped include information on each participating horse such as order of arrival and jockey name, information on the race itself such as distance, riding ground information, weather, and odds of each horse before the start of the race.

スクリーンショット 2020-08-21 22.33.09.png

Feature creation

The code I publish is the foundation of the code I'm still improving, and I think it's even more accurate if you create your own features or ensemble with other algorithms, for example. .. Of course, even in the code to be published, a new feature amount related to the aggregation of past grades is created from the scraped feature amount.

An example of the features to be created is the aggregation of horses' past performance. It is necessary to aggregate so that future grades will not be included at the past time, so here we will sort race_id in ascending order so that future grades will not be aggregated from the time of aggregation. For example, if you look at the aggregated results for almond eyes,

スクリーンショット 2020-08-21 22.27.54.png

Therefore, only the past data is properly aggregated and added as a new feature quantity. (Note that only the data from 2018 to 2020 is used here to check the public code.)

Modeling

As the title suggests, we will create a model using lightGBM. The parameters are automatically adjusted by optuna. I think that the accuracy can be further improved by performing ensemble and cross-validation in this part.

Scraping of race information before the event, display of predicted values

Pre-race race information to let the model predict is not the netkeiba database but the race information page Get from / top /? Rf = navi). The basic code is almost the same as scraping past grades.

Predicted values are displayed for each race. The bet amount similar to Mr. 卍's calculation is also displayed in the column using the odds at the time of scraping.

For example, if you display the Niigata 4R on Saturday, August 8, 2020 the other day

スクリーンショット 2020-08-26 9.07.58.png

In this case, bet on horses whose predict value exceeds a certain value, or bet on 3 horses from the one with the largest value.

Full code

The full text is available at here. We also provide more detailed explanations.

Recommended Posts

I wrote a code that exceeds 100% recovery rate in horse racing prediction using LightGBM (Part 2)
I learned scraping using selenium to make a horse racing prediction model.
A memo that I wrote a quicksort in Python
I wrote a Japanese parser in Japanese using pyparsing.
I tried crawling and scraping a horse racing site Part 2
I wrote a class that makes it easier to divide by specifying part of speech when using Mecab in python
I wrote a script that splits the image in two
I tried to get a database of horse racing using Pandas
With deep learning, you can exceed 100% recovery rate in horse racing
I wrote a code to convert quaternions to z-y-x Euler angles in Python
[Python] I wrote a simple code that automatically generates AA (ASCII art)
I wrote FizzBuzz in python using a support vector machine (library LIVSVM).
Horse Racing Prediction in Machine Learning-LightGBM Edition-
A story about achieving a horse racing recovery rate of over 100% through machine learning
I wrote a PyPI module that extends the parameter style in Python's sqlite3 module
I wrote an animation that degenerates a linear system with deadly dirty code
I wrote a design pattern in kotlin Prototype