Horse racing prediction in python that came to mind and started writing code Right now, I'm predicting horse racing, but the horses recruited by bite horse owners are ..., and the stocks and FX are ** ... I have a lot to try
Here, I will write a memorandum of the past reviews Also, I will try the problems that I found while reworking the code and their before and after.
I've posted the code here and there, but I'm an amateur so please forgive me If there is something that should be fixed, please teach me.
Windows10 Jupyter Notebook
Search all race data from 2000 with the race search of TARGET frontier JV It also reads horse data and previous run data, and outputs all data as CSV for use.
--Minor corrections --Remove the weight loss mark --Convert the difference of the horses excluded from competition'----' and the order of arrival to '0' to an error value. --Convert the missing dividend value to 0 --Convert missing pedigree to'unknown' --Category variables --Category, turf / da, riding condition, gender, father type name, mother-father type name, etc., and make them dummy variables --Unexpected race / horse data deletion --Deleted if the expected race is an obstacle race --Deleted data for new horse races and first-running horses (horses without previous race data) --Past races are connected for 5 races --Objective variable: Added a column to see if the order of arrival is within 3rd. --Training data and verification data are divided in chronological order. Recent 30% as verification data --Imbalance data correction: Undersampling False, 2 True, 1 ratio
Use LightGBM The learning results and evaluation results are as follows
Out
train roc_auc_score = 0.7891163706003093
eval roc_auc_score = 0.7758619662995246
Based on the learning result, the recovery rate when the threshold is changed is calculated as shown in the graph below. The simulation is stopped when the number of purchases is less than 1/3 of the number of races to be purchased, so data with extremely small purchases is excluded. Double win is a good feeling
――When you scrutinize the training data, it turns out that the horses excluded from competition are also predicted. I don't know how bad it is, but it looks better to fix it. ――Since we have deleted the horses that have not run yet, it turned out that we are predicting even in races where there are no 1 to 3 horses. Depending on the proportion and number of horses that have not run in each race, it seems better to exclude them from the predicted race. --This is the biggest problem: The current model learns and predicts the unwinning field and G1 horses on the same ground, and simulates the recovery rate based on the results. Although unconfirmed, it is estimated that in low-class races, the proportion of horses with a high probability of prediction results is low, and in high-class races such as G1, only horses with a high probability of prediction results are high. Will it be dealt with by modifying the features? Is it possible to buy a betting ticket?
~~ I will post it for the time being, but I plan to add the countermeasure results one by one: wave: ~~
build.jpynd
df_last = df_last.dropna(subset = ['Previous race ID(new)']) # Previous race ID がないデータ削除
Removed races with 3 or more non-running horses
build.jpynd
#Excludes races with more than a certain number of unrunning horses
rouped = df_return_table.groupby('Race ID(new/No horse number)')
num_horse = lambda x : x.count() - x.mean()
df_return_table['Data reduction'] = grouped['Number of runners'].transform(num_horse)
df_return_table = df_return_table[~(df_return_table['Data reduction'] <= -3)]
In the race, I corrected Proba to the standard deviation (Proba_std
in the table below) and corrected the value so that it was within 0 to 1 (Proba_std01
in the table below).
In addition, I added a column that ranked Proba in the race (Proba order
in the table below).
build.jpynd
#proba ranking
df_return_table['Proba order'] = df_return_table.groupby('Race ID(new/No horse number)').rank('dense', ascending = False)['Proba']
#In-race standard deviation of proba
std_scaler = lambda x: (x-x.mean()) / x.std()
df_return_table['Proba_std'] = df_return_table.groupby('Race ID(new/No horse number)')['Proba'].transform(std_scaler)
std01_scaler = lambda x: (x - x.min()) / (x.max() - x.min())
df_return_table['Proba_std01'] = df_return_table['Proba_std'].transform(std01_scaler)
The table used to calculate the recovery rate looks like this
Auc is almost unchanged
Out
train roc_auc_score = 0.7891023931796946
eval roc_auc_score = 0.77596700811738
Verify by moving the threshold based on the Proba value as before
Out
WIN_Recovery rate 81.82452193475815
PLACE_Recovery rate 88.57142857142857
I thought it would improve for a moment, but there was almost no change.
Out
WIN_max recovery rate 81.76498653852708
PLACE_max recovery rate 85.47403863193337
Even if n is changed to 1 to 3, the recovery rate for wins and double wins is about 80-81%.
I tried various things, but it didn't improve much.
So far this time: wave:
Recommended Posts