[8th] Let's predict horse racing with Python ~ Review so far ~

Review of horse racing expectations

Horse racing prediction in python that came to mind and started writing code Right now, I'm predicting horse racing, but the horses recruited by bite horse owners are ..., and the stocks and FX are ** ... I have a lot to try

Here, I will write a memorandum of the past reviews Also, I will try the problems that I found while reworking the code and their before and after.

I've posted the code here and there, but I'm an amateur so please forgive me If there is something that should be fixed, please teach me.

Development environment

Windows10 Jupyter Notebook

Data acquisition

Search all race data from 2000 with the race search of TARGET frontier JV It also reads horse data and previous run data, and outputs all data as CSV for use.

Data preprocessing

--Minor corrections --Remove the weight loss mark --Convert the difference of the horses excluded from competition'----' and the order of arrival to '0' to an error value. --Convert the missing dividend value to 0 --Convert missing pedigree to'unknown' --Category variables --Category, turf / da, riding condition, gender, father type name, mother-father type name, etc., and make them dummy variables --Unexpected race / horse data deletion --Deleted if the expected race is an obstacle race --Deleted data for new horse races and first-running horses (horses without previous race data) --Past races are connected for 5 races --Objective variable: Added a column to see if the order of arrival is within 3rd. --Training data and verification data are divided in chronological order. Recent 30% as verification data --Imbalance data correction: Undersampling False, 2 True, 1 ratio

Model learning / evaluation

Use LightGBM The learning results and evaluation results are as follows

Out


train  roc_auc_score = 0.7891163706003093
eval  roc_auc_score = 0.7758619662995246

image.png image.png

Model operation

Based on the learning result, the recovery rate when the threshold is changed is calculated as shown in the graph below. The simulation is stopped when the number of purchases is less than 1/3 of the number of races to be purchased, so data with extremely small purchases is excluded. Double win is a good feeling image.png

Task

――When you scrutinize the training data, it turns out that the horses excluded from competition are also predicted. I don't know how bad it is, but it looks better to fix it. ――Since we have deleted the horses that have not run yet, it turned out that we are predicting even in races where there are no 1 to 3 horses. Depending on the proportion and number of horses that have not run in each race, it seems better to exclude them from the predicted race. --This is the biggest problem: The current model learns and predicts the unwinning field and G1 horses on the same ground, and simulates the recovery rate based on the results. Although unconfirmed, it is estimated that in low-class races, the proportion of horses with a high probability of prediction results is low, and in high-class races such as G1, only horses with a high probability of prediction results are high. Will it be dealt with by modifying the features? Is it possible to buy a betting ticket?

~~ I will post it for the time being, but I plan to add the countermeasure results one by one: wave: ~~

Removed 3 or more races for excluded horses and non-running horses

Removed competition-excluded horses from learning / verification data

build.jpynd


    df_last = df_last.dropna(subset = ['Previous race ID(new)']) # Previous race ID がないデータ削除

Removed races with 3 or more non-running horses

Removed races with 3 or more non-running horses

build.jpynd


#Excludes races with more than a certain number of unrunning horses
rouped = df_return_table.groupby('Race ID(new/No horse number)')
num_horse = lambda x : x.count() - x.mean()
df_return_table['Data reduction'] = grouped['Number of runners'].transform(num_horse)
df_return_table = df_return_table[~(df_return_table['Data reduction'] <= -3)]

Corrected the value of proba to standard deviation so that it fits in 0 to 1.

In the race, I corrected Proba to the standard deviation (Proba_std in the table below) and corrected the value so that it was within 0 to 1 (Proba_std01 in the table below). In addition, I added a column that ranked Proba in the race (Proba order in the table below).

build.jpynd


#proba ranking
df_return_table['Proba order'] = df_return_table.groupby('Race ID(new/No horse number)').rank('dense', ascending = False)['Proba']
    
#In-race standard deviation of proba
    
std_scaler = lambda x: (x-x.mean()) / x.std()
df_return_table['Proba_std'] = df_return_table.groupby('Race ID(new/No horse number)')['Proba'].transform(std_scaler)
std01_scaler = lambda x: (x - x.min()) / (x.max() - x.min())
df_return_table['Proba_std01'] = df_return_table['Proba_std'].transform(std01_scaler)

The table used to calculate the recovery rate looks like this スクリーンショット 2020-12-15 09.17.51.png

Learning / verification results

Removed races with 2 or more horses excluded from competition and horses not running. Recovery rate simulation unchanged

Auc is almost unchanged

Out


train  roc_auc_score = 0.7891023931796946
eval  roc_auc_score = 0.77596700811738

Recovery simulation

Verify by moving the threshold based on the Proba value as before image.png

Out


WIN_Recovery rate 81.82452193475815
PLACE_Recovery rate 88.57142857142857

I thought it would improve for a moment, but there was almost no change.

Deleted two or more races with non-competition horses and non-running horses. Simulation of recovery rate by standard deviation and normalization of Proba for each race.

image.png

Out


WIN_max recovery rate 81.76498653852708
PLACE_max recovery rate 85.47403863193337

Removed two or more races with non-competition horses and non-running horses. Purchased the top n Proba horses for each race.

Even if n is changed to 1 to 3, the recovery rate for wins and double wins is about 80-81%.

I tried various things, but it didn't improve much.

So far this time: wave:

Recommended Posts

[8th] Let's predict horse racing with Python ~ Review so far ~
Horse Racing Site Web Scraping with Python
Let's play with 4D 4th
Let's run Excel with Python
Let's write python with cinema4d.
Let's build git-cat with Python
Let's make a GUI with python.
Try horse racing prediction with Chainer
Let's play with Excel with Python [Beginner]
Let's do image scraping with Python
Let's make a graph with python! !!
Let's analyze voice with Python # 1 FFT
Algorithm learned with Python 10th: Binary search
Algorithm learned with Python 5th: Fibonacci sequence
Algorithm learned with Python 9th: Linear search
Algorithm learned with Python 7th: Year conversion
Algorithm learned with Python 8th: Evaluation of algorithm
Let's create a free group with Python
Algorithm learned with Python 4th: Prime numbers
[Introduction to Python] Let's use foreach with Python
Let's read the RINEX file with Python ①
Let's make a voice slowly with Python
Algorithm learned with Python 19th: Sorting (heapsort)
[Python] Let's make matplotlib compatible with Japanese
Algorithm learned with Python 6th: Leap year
Let's do MySQL data manipulation with Python
Let's make a web framework with Python! (1)
Let's make a Twitter Bot with Python!
Let's develop an investment algorithm with Python 1
Algorithm learned with Python 12th: Maze search
Let's get along with Python # 0 (Environment construction)
Let's make a web framework with Python! (2)
Algorithm learned with Python 11th: Tree structure
[Blender x Python] Let's get started with Blender Python !!
Machine learning beginners tried to make a horse racing prediction model with python