Review of horse racing expectations

Horse racing prediction in python that came to mind and started writing code Right now, I'm predicting horse racing, but the horses recruited by bite horse owners are ..., and the stocks and FX are ** ... I have a lot to try

Here, I will write a memorandum of the past reviews Also, I will try the problems that I found while reworking the code and their before and after.

I've posted the code here and there, but I'm an amateur so please forgive me If there is something that should be fixed, please teach me.

Development environment

Windows10 Jupyter Notebook

Data acquisition

Search all race data from 2000 with the race search of TARGET frontier JV It also reads horse data and previous run data, and outputs all data as CSV for use.

Data preprocessing

--Minor corrections --Remove the weight loss mark --Convert the difference of the horses excluded from competition'----' and the order of arrival to '0' to an error value. --Convert the missing dividend value to 0 --Convert missing pedigree to'unknown' --Category variables --Category, turf / da, riding condition, gender, father type name, mother-father type name, etc., and make them dummy variables --Unexpected race / horse data deletion --Deleted if the expected race is an obstacle race --Deleted data for new horse races and first-running horses (horses without previous race data) --Past races are connected for 5 races --Objective variable: Added a column to see if the order of arrival is within 3rd. --Training data and verification data are divided in chronological order. Recent 30% as verification data --Imbalance data correction: Undersampling False, 2 True, 1 ratio

Model learning / evaluation

Use LightGBM The learning results and evaluation results are as follows

`Out`


train  roc_auc_score = 0.7891163706003093
eval  roc_auc_score = 0.7758619662995246

Model operation

Based on the learning result, the recovery rate when the threshold is changed is calculated as shown in the graph below. The simulation is stopped when the number of purchases is less than 1/3 of the number of races to be purchased, so data with extremely small purchases is excluded. Double win is a good feeling

Task

――When you scrutinize the training data, it turns out that the horses excluded from competition are also predicted. I don't know how bad it is, but it looks better to fix it. ――Since we have deleted the horses that have not run yet, it turned out that we are predicting even in races where there are no 1 to 3 horses. Depending on the proportion and number of horses that have not run in each race, it seems better to exclude them from the predicted race. --This is the biggest problem: The current model learns and predicts the unwinning field and G1 horses on the same ground, and simulates the recovery rate based on the results. Although unconfirmed, it is estimated that in low-class races, the proportion of horses with a high probability of prediction results is low, and in high-class races such as G1, only horses with a high probability of prediction results are high. Will it be dealt with by modifying the features? Is it possible to buy a betting ticket?

~~ I will post it for the time being, but I plan to add the countermeasure results one by one: wave: ~~

Removed 3 or more races for excluded horses and non-running horses

Removed competition-excluded horses from learning / verification data

`build.jpynd`


    df_last = df_last.dropna(subset = ['Previous race ID(new)']) # Previous race ID がないデータ削除

Removed races with 3 or more non-running horses

`build.jpynd`


#Excludes races with more than a certain number of unrunning horses
rouped = df_return_table.groupby('Race ID(new/No horse number)')
num_horse = lambda x : x.count() - x.mean()
df_return_table['Data reduction'] = grouped['Number of runners'].transform(num_horse)
df_return_table = df_return_table[~(df_return_table['Data reduction'] <= -3)]

Corrected the value of proba to standard deviation so that it fits in 0 to 1.

In the race, I corrected Proba to the standard deviation (Proba_std in the table below) and corrected the value so that it was within 0 to 1 (Proba_std01 in the table below). In addition, I added a column that ranked Proba in the race (Proba order in the table below).

`build.jpynd`


#proba ranking
df_return_table['Proba order'] = df_return_table.groupby('Race ID(new/No horse number)').rank('dense', ascending = False)['Proba']
    
#In-race standard deviation of proba
    
std_scaler = lambda x: (x-x.mean()) / x.std()
df_return_table['Proba_std'] = df_return_table.groupby('Race ID(new/No horse number)')['Proba'].transform(std_scaler)
std01_scaler = lambda x: (x - x.min()) / (x.max() - x.min())
df_return_table['Proba_std01'] = df_return_table['Proba_std'].transform(std01_scaler)

The table used to calculate the recovery rate looks like this スクリーンショット 2020-12-15 09.17.51.png

Learning / verification results

Removed races with 2 or more horses excluded from competition and horses not running. Recovery rate simulation unchanged

Auc is almost unchanged

`Out`


train  roc_auc_score = 0.7891023931796946
eval  roc_auc_score = 0.77596700811738

Recovery simulation

Verify by moving the threshold based on the Proba value as before

`Out`


WIN_Recovery rate 81.82452193475815
PLACE_Recovery rate 88.57142857142857

I thought it would improve for a moment, but there was almost no change.

Deleted two or more races with non-competition horses and non-running horses. Simulation of recovery rate by standard deviation and normalization of Proba for each race.

`Out`


WIN_max recovery rate 81.76498653852708
PLACE_max recovery rate 85.47403863193337

Removed two or more races with non-competition horses and non-running horses. Purchased the top n Proba horses for each race.

Even if n is changed to 1 to 3, the recovery rate for wins and double wins is about 80-81%.

I tried various things, but it didn't improve much.

So far this time: wave:

[8th] Let's predict horse racing with Python ~ Review so far ~

Review of horse racing expectations

Development environment

Data acquisition

Data preprocessing

Model learning / evaluation

Out

Model operation

Task

Removed 3 or more races for excluded horses and non-running horses

Removed competition-excluded horses from learning / verification data

build.jpynd

Removed races with 3 or more non-running horses

build.jpynd

Corrected the value of proba to standard deviation so that it fits in 0 to 1.

build.jpynd

Learning / verification results

Removed races with 2 or more horses excluded from competition and horses not running. Recovery rate simulation unchanged

Out

Recovery simulation

Out

Deleted two or more races with non-competition horses and non-running horses. Simulation of recovery rate by standard deviation and normalization of Proba for each race.

Out

Removed two or more races with non-competition horses and non-running horses. Purchased the top n Proba horses for each race.

`Out`

`build.jpynd`

`build.jpynd`

`build.jpynd`

`Out`

`Out`

`Out`