[PYTHON] [Verification] Just because there is deep learning, it does not mean that the recovery rate can easily exceed 100% in horse racing.

Now it's the beginning of solving the mystery

Original article: If you have deep learning, you can exceed 100% recovery rate in horse racing

Let's move it first

I will purchase the program immediately and try it. As written in the explanation, it basically worked with copy and paste, but the following two places did not work as it is, so I fixed it here.

--Parsing date data --Column name of features used for learning / inference

It is not exactly the same because it contains random elements, but the graph behaves in much the same way, so it seems to have been reproduced. As stated in the article, ** "A part of the range (around 55-60) where the 3rd place index is 60 or more and the odds are not too high" ** is 100% at hand. Exceeded. It seems that the number of races and records has increased last week (Queen Elizabeth Cup week).

The result of my execution

item result
Number of target races (*) 3672
Number of target records 42299
Purchase number 74
Hit number 13
Hit rate 17.57%
Recovery rate 172.97%

Results in the original article

item result
Number of target races (*) 3639
Number of target records 41871
Purchase number 98
Hit number 20
Hit rate 20.4%
Recovery rate 213.3%

What kind of horse is predicted

What kind of horse did you buy and exceeded 100%? Are you aiming for conditional races rather than main races? Horse racing fans will be curious. However, the data frame of the verification data did not contain the horse name, only the preprocessed values such as horse number and popularity were included, and the odds were not raw data, so it was very difficult to understand. I couldn't find out without creating separate data.

Try changing from deep learning to another model

This is the code I modified. Let's change to a simpler logistic regression compared to a simple neural network.

python


from sklearn.linear_model import LogisticRegression

model = LogisticRegression(C=2.0, penalty='l1', random_state=42, multi_class="auto")
model.fit(df_train, train_labels)

This time, with the output model.predict_proba (df) [:, 1] of this model as the 3rd place index, as before, the "3rd place index is 60 or more, and the odds are not too high. (Around 55-60) Try to buy ".

item result
Purchase number 175
Hit number 21
Hit rate 12.0%
Recovery rate 127.26%

** Amazing. Logistic regression exceeded 100%! ** By the way, it was 196% in Random Forest!

Is the data wrong?

I feel that there is some bias in the data, not deep learning. The places where we are considering the purchase method are as follows.

python


#0 if you buy odds in the range 1-8.01〜0.Set 08
if win3_pred >= 0.60 and 0.01 <= odds < 0.08:
    return 1
else:
    return 0

win3_pred is the 3rd place index before multiplying by 100. I'm curious that the odds are still standardized (0.01-0.08 is equivalent to normal win odds 55-60), but here I'll rewrite it as follows.

python


if 0.01 <= odds < 0.08:
    return 1
else:
    return 0

This is a simulation of purchasing a double win betting ticket when the winning odds are 55 to 60 times without using the 3rd place index at all.

item result
Purchase number 778
Hit number 68
Hit rate 8.74%
Recovery rate 90.67%

Since the deduction rate for double-winning betting tickets is 20%, it is natural that the recovery rate will be around 80% to some extent even if the range is specified simply by the odds. If it is a large hole betting ticket, one shot is big, so there may be some blurring, but I feel that 90% is a little high. Perhaps there is a problem with the validation data itself. Let's check where pre-processing and data processing are performed.

Discard horses that do not have the results of the last 5 runs

This is the first place I got caught

#Delete missing lines
df = df.dropna(subset=[
    'past_time_sec1', 'past_time_sec2', 'past_time_sec3',
    'past_time_sec4', 'past_time_sec5'
]).reset_index(drop=True)

past_time_sec1 to past_time_sec5 represent the time of the horse's last 5 runs. This means that horses that do not have all the times of the last 5 runs are abandoned here. Especially in the 2-3 year old race, the number of horses running varies. For example, last week on November 10, 2019, Fukushima 10R Fukushima 2-year-old S has 14 racehorses (https://race.netkeiba.com/?pid=race_old&id=c201903030410), but the time of the past 5 runs is complete. There were three, and in fact there were only three in the deleted dataframe. With this dropna, the number of records is ** 471500-> 252885 **. ** Nearly half of the data has been discarded. ** The data discarded here are 2-3 year old horses, local horses (because local horse racing has not acquired data), and 2010 data (previous run information cannot be acquired because there is no data for 2009). It was the center. It doesn't seem appropriate, but it's not a fatal mistake because it can be excluded by the same rules when inferring.

How many double wins will be refunded?

Double-winning betting tickets will win up to 3rd place if the number of runners is 8 or more, up to 2nd place if 5 to 7 horses, and will not be sold if 4 or less horses. The following processing was ** performed on the verification data **.

python


#Focus on races with 2 or more double wins and 5 or more in total
win3_sums = df.groupby('race_id')['win3'].sum()
win3_races = win3_sums[win3_sums >= 2]
win3_races_indexs = win3_races.index.tolist()

win3_counts = df.groupby('race_id')['win3'].count()
win3_races2 = win3_counts[win3_counts >= 5]
win3_races_indexs2 = win3_races2.index.tolist()

race_id_list = list(set(win3_races_indexs) & set(win3_races_indexs2))

By this process, the number of records becomes ** 48555-> 42999 **, and 11.4% of the data is discarded. What you really want to throw away is a race with no refund for double wins, but it's too much by any means. In fact, there is no competition for less than 4 JRA in 2018-2019 (at least in my keibadb) This process is a problem.

What's going on

What's wrong? Since win3 is an objective variable that indicates whether or not the horses have entered the double-winning order, in the above processing, the number of horses that have entered the double-winning range is 2 or more, and the number of runners is 5 or more. But remember. ** Horses that do not have the results of the last 5 runs have already been thrown away **, so ** horses that should not be erased have disappeared **. It's a little confusing, but let's take a concrete look. For example, Kyoto 12R on November 2, 2019. https://race.netkeiba.com/?pid=race&id=p201908050112&mode=shutuba The 5 horses of 4th Boccherini, 7th Sunray Pocket, 9th Narita Blue, 10th Theo Amazon, and 12th Metropole have been excluded from the record in advance because the race times of the last 5 runs are not aligned, and 8 heads are standing. It will be regarded as a race. The order of arrival in this race was 4-3-7. Since it is a 13-headed race, the double win is 4, 3 and 7. However, since the horses ** 4 and 7 have already been deleted from the DataFrame, only one horse has become a double-winning betting ticket in this race **, so in this race there are horses within the double-winning range. It will be one and will be excluded from the validation data. This operation cannot be done for future races, as we naturally do not know which horse will be the double-winning betting ticket before the race ** By the way, the 5th Nihon Pillow Halo in this race had a win odds of 60 times, but lost. (Verification data shows that it is no longer necessary to buy a horse that loses at 55 to 60 times). I can see it little by little. Regardless of the training model, some horses have disappeared from the validation data, which seems to be biased.

Let's verify with correct data

Let's make an inference without improper narrowing down. There are no races without double win refunds in 2018-2019, so there is no need to narrow down the validation data above. Let's simulate the purchase for the verification data that has not been commented out and narrowed down.

item result
Number of target races (*) 5384
Number of target records 48555
Purchase number 88
Hit number 13
Hit rate 14.77%
Recovery rate 145.45%

This recovery rate of 145% is

――Third place index (predicted value of deep learning model) ――Do not buy horses that do not have the same time for the previous 5 runs --Winning odds 55-60 times

It is achieved under these three conditions. Both the hit rate and the recovery rate have decreased, but it is a calculation that is normally profitable. Is this the power of deep learning?

What kind of horse do you appreciate?

When will the 3rd place index be higher? I tried learning with Decision Tree using the class of whether the 3rd place index is larger or smaller than 0.5 as the correct answer data.

python


from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(max_depth=3)
clf = clf.fit(df[all_columns], df.win3_pred > 0.5)

Let's visualize the resulting tree. H.png

Apparently, past_odds1, that is, ** the winning odds of the previous run ** and the order of arrival of the previous run seem to be important.

This time, let's simply specify the 3rd place index of the purchase condition as 10 times or less of the winning odds of the previous run based on the rule.

python


# if win3_pred >= 0.60 and 0.01 <= odds < 0.08:
if raw_past_odds1 <= 10 and 55 <= raw_odds <= 60:
item result
Purchase number 115
Hit number 15
Hit rate 13.04%
Recovery rate 147.22%

Instead of using the output of the model learned by deep learning, ** using just one rule gave the same recovery rate. ** **

Review all data again

Let's go back to 2010 as well as 2018-2019 for data with win odds of 55-60 times.

Horses with winning odds of 55-60 times for all 471,500 records

pivot_df = df2[(df2['odds'] >= 55) & (df2['odds'] <= 60)].groupby('year') \
          .fukusho.agg(["mean", "count", "sum"])
pivot_df.columns = ["Recovery rate", "Purchase number", "Refund"]
pivot_df['Income and expenditure'] = pivot_df['Purchase number'] * (-100) + pivot_df['Refund']
pivot_df.style.background_gradient()

A.png The recovery rate in 2015 is high, but it is about 80%. The verification data for 2018-2019 is also below 80%, which is not particularly high.

Horses that do not have the same time for all 5 runs and horses with winning odds of 55 to 60 times

C.png The verification data is in this state. The recovery rate decreased only in 2016, but increased in other years. The verification data for 2018-2019 also increased, exceeding 80%.

Horses with winning odds of 55-60 times and previous run odds of 10 times or less

D.pngE.png   The left is for all data, and the right is for all 5 runs with narrowing down. The results for each year (147.22%) under the conditions that gave the same numbers as deep learning earlier are the combined data for 2018 and 2019 in the table on the right. ** You can see that even if you purchase under the same conditions, the recovery rate may be less than 60% in some years **.

Conclusion

The condition for replacing the 3rd place index with the odds of winning the previous race within 10 times was 147.22% because it was ** 2018-2019 **, and this rule will achieve a recovery rate of 100% in the future. It seems to be difficult to do. So what about the 3rd place index? If you plot the relationship between the 3rd place index and the odds of the previous run ... ダウンロード (6).png Of course, I also use other features, but ** I can see that the lower the odds of the previous run, the higher the 3rd place index. ** ** That was exactly what Decision Tree taught me.

The original article was titled ** "If you have deep learning, you can exceed 100% recovery rate in horse racing" , In fact, ** "If you remove horses that do not have the same time for the past 5 runs and purchase a double-winning betting ticket for horses with winning odds of 55 to 60 times from November 2018 to November 2019, the recovery rate is 100 You can exceed% by rules instead of deep learning, and it happens to be " I think that it is the content. ** Isn't it easy to exceed 100% recovery rate in horse racing just because there is deep learning? **

(I analyzed it all at once and wrote it all at once, so I may have made a mistake. Please let me know if you made a mistake.)

Lack of love for data

What I want to say throughout this article is to look at the data better. I enjoyed this verification and the feeling that the mystery was being solved. ** Basically, I think the data is interesting **. There are various perspectives, and there are various discoveries. Even the data of the survival analysis of the Titanic is interesting just by looking at it. As we deal with the data properly, we can see various things. That's how you can make a good model, ** no matter how deep learning or good algorithms you use, a model that doesn't care about your data won't be a good AI. ** ** If you want to get along with the data but don't know what to start with, go to the racetrack !! It's fun even if you don't make money !!!

Recommended Posts

[Verification] Just because there is deep learning, it does not mean that the recovery rate can easily exceed 100% in horse racing.
With deep learning, you can exceed 100% recovery rate in horse racing
Horse Racing Prediction: If you think that the recovery rate has exceeded 100% in machine learning (LightGBM), it's a story
Deep Learning! The story of the data itself that is read when it does not follow after handwritten number recognition
Python does not output errors or output just because the indent is misaligned
It is said that libmysqlclient.so.18 does not exist