[PYTHON] Horse Racing Prediction: If you think that the recovery rate has exceeded 100% in machine learning (LightGBM), it's a story

Thanks

Caution!!!

** This article is completely done **

I'm sorry for the person who stocked it.

Thank you for noticing @ hal27's point.

・ What I did I made a fatal mistake from the scraping stage. I thought I had acquired the data for the previous 3 races from the time of the race, but in fact I was acquiring the information for the latest 3 races from the scraping execution time.

However, when I predicted without using the information of the previous run at all, the recovery rate was about 90% on average, so Even if I use the correct data, I think it can exceed 100%.

** Start over! **

Please read this article while thinking that it has been done. (Especially pay attention to the scraping part of the previous run information)

Introduction

Recently I'm addicted to data analysis.

When I'm running Kaggle, a data analysis competition, I often think ** "Sales forecast? Is there a more interesting theme? **".

That's why I decided to make a horse racing prediction model from scratch for studying. Hopefully you can make money and it's the best analysis theme for me, who loves horse racing.

Although I am almost a beginner, as a result I was able to create a model with a stable recovery rate exceeding 100% **, so in this article I will describe the rough flow up to horse racing model creation and the details of the simulation results. I will continue to do it. If there is something wrong with your way of thinking, please give us guidance.

Condition setting

** Predict the running time of the racehorse and win the fastest horse. ** **

In the streets, there were many models that gave the hit rate of the first horse, but I think that there were many models that did not increase the recovery rate as expected. If so, it might be nice to bet after purely predicting the time. I thought ** (rogue) ** and made this setting.

There are actually other reasons .....

In horse racing, it seems that more people bet on popular horses than their stats. (Reference: Theory that you cannot win if you focus on popular horses) In other words, rather than pursuing a first-place hit rate, it may be possible to raise the recovery rate by looking at the odds and making predictions. However, I want to buy betting tickets with plenty of time, so I don't want to incorporate the odds that are fixed just before the race into the features.

What should I do...

Horse racing involves a variety of factors, so it is difficult to predict the running time purely. Isn't it because it's difficult to predict horses with high expectations that people won't bet on? ** (Rogue) **. Alright, let's go in the running time.

** Learning target and simulation target **

Since I live in Kyoto, I targeted ** Kyoto horse racing only **. The data will be for almost all races from 2009 to 2019 (discussed in the Data Preprocessing section). We divided them into training data and test data, and simulated the test data. It is a simulation for 7 years in total.

The following is the content of data division.

Training data Test data
2009~2018 2019
2009~2017 2018
2009~2016 2017
2009~2015 2016
2009~2014 2015
2009~2013 2014
2009~2012 2013

In case of a leak, the training data is set to the year before the test data.

The details of the features handled are explained below.

Flow until model creation

  1. Data acquisition (web scraping)
  2. Data preprocessing
  3. Modeling

1. Data acquisition

It seems that you can easily get the data if you pay, but I also got the data by web scraping for studying.

First of all, you can easily study HTML / CSS with Progate. I didn't know where to find the data I wanted without at least knowledge.

Regarding web scraping, I created it with reference to the following article. To be honest, I think this was the hardest part. (Too many exceptions!)

-Story of winning the Emperor's badge by machine learning at Oi Horse Racing -python3 Crawling & Scraping

The site to be scraped is https://www.netkeiba.com/.

The following is the data obtained by scraping. The code is in Appendix. image.png image.png

The acquired data is as follows

feature Description feature Description
race_num What race field Turf or dirt
dist distance l_or_r Clockwise or counterclockwise
sum_num Number of heads weather the weather
field_cond Going rank Order of arrival
horse_num Horse number horse_name Horse name
gender sex age age
weight Horse weight weight_c Jockey weight
time Running time sum_score Total results
odds Win odds popu Popular

We have obtained data for the previous 3 races with ** +. ** **

Of these, ** running time, finish order, odds, popularity, and horse name features are not used as training data. ** **

Also, ** information has been deleted for horses that do not have 3 races in the previous race. ** However, if even one horse has information in each race, we will predict the fastest horse among them. Of course, if there is a first horse in the erased data, the expectation will be wrong.

Data for about 450 races remains per year. (Almost all races)

2. Data preprocessing

Transform the resulting data for inclusion in a machine learning model. However, I just changed the categorical variable to ** label encoding ** and ** character type to numeric type **.

Since it is a kind of algorithm tree model of the model handled this time, it is not standardized.

Also, ** I am making some new features using the acquired features. ** (Distance and time to speed, etc.)

3. Modeling

Implemented using the ** LightGBM ** library of the ** gradient boosting decision tree algorithm **. It's one that is often used these days in Kaggle.

Below is the implementation code.

lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_valid, y_valid, reference=lgb_train)

params = {'objective': 'regression',
          'metric': 'rmse',
          }

model = lgb.train(
    params, lgb_train,
    valid_sets=[lgb_train, lgb_eval],
    verbose_eval=10,
    num_boost_round=1000,
    early_stopping_rounds=10)

As you can see, we haven't done anything (laughs)

I tried to tune the parameters using Optuna etc., but since the evaluation function and the recovery rate are different, it did not lead to much improvement in the recovery rate.

simulation

Below are the simulation results for 7 years

** Horizontal axis **: What race ** Vertical **: Win odds when hit (0 if missed) ** hit_rate **: Hit rate ** back_rate **: Recovery rate The train and test in the title represent the period of data used for each. (09 is 2009)

09-10-11-12,13.jpg 09-10-11-12-13,14.jpg 09-10-11-12-13-14,15.jpg 09-10-11-12-13-14-15,16.jpg 09-10-11-12-13-14-15-16,17.jpg 09-10-11-12-13-14-15-16-17,18.jpg 09-10-11-12-13-14-15-16-17-18,19.jpg

The results are summarized below

Simulation year Number of hits Hit rate Recovery rate
2013 116/460 25.2% 171.9%
2014 89/489 18.5% 155.3%
2015 120/489 24.5% 154.7%
2016 101/491 20.6% 163.6%
2017 131/472 27.8% 263.5%
2018 145/451 32.2% 191.8%
2019 136/459 29.6% 161.7%
average ------ 25.5% 180.4%

It's too good.

The hit rate is ok, but what surprised me was that I was hitting horses with high odds from time to time.

** In 2017, I'm hitting 250 times more horses! ** **

I'm curious, so let's take a look at the contents Below is the content of the race of the day. (Sort in order of time_pred)

image.png

image.png

I'm seriously guessing.

I'm scared that something is wrong. Is it okay to exceed 100% so easily in the first place?

What should I look for in such a case? .....

** You have to try it in real life! ** **

The following is what I searched for. ・ ** Are you using only the features that can be used on the day **? ・ ** Is the formula for calculating the recovery rate correct? ** ・ ** Are there any discrepancies with the information on the Internet? ** ** ・ ** Are you able to select the person with the earliest expected time **? ・ ** Play the created model from various directions **

I played with the model

It's a good idea, so I'll try various things.

1 Try riding the horse that you predicted to be the slowest

aaa.png

I'm hitting only 6 times.

2 Importance of features

Below is lightGBM's feature-import anannces (2019). ʻA, b, c` are the speeds (dist / time) of the horses in the previous run, the previous run, and the previous run, respectively.

aaa.png

I know that dist (race distance) is important because I am predicting the time, but what is the importance ofrace_id (what race of the year)?

Is the season important for predicting time?

In addition, the environment is of great importance, such as race_cond (going), race_num (what race of the day).

** * I wrote Addition. (2020/05/28) **

3 What happens if you continue to be the nth most popular?

It has nothing to do with the model I created (laughs)

Below are the results of ** 2019 **

nth most popular Hit rate Recovery rate
Most popular 30.9% 71.1%
2nd most popular 17.3% 77.2%
3rd most popular 15.3% 90.3%
4th most popular 10.1% 81.3%
5th most popular 8.4% 100.5%
6th most popular 6.2% 92.4%
7th most popular 3.2% 64.2%
8th most popular 2.4% 52.1%
9th most popular 1.5% 48.2%
10th most popular 1.3% 59.1%
11th most popular 1.5% 127.6%
12th most popular 1.3% 113.9%
13th most popular 1.5% 138.6%
14th most popular 0.4% 77.8%

It ’s a very interesting result. If you want to recover, you may keep buying unpopular horses. It doesn't seem to be fun to watch because it hardly hits (laughs)

It's interesting, so I took a look at the ** 2013-2019 average **.

nth most popular Hit rate Recovery rate
Most popular 31.7% 73.4%
2nd most popular 20.0% 83.7%
3rd most popular 13.2% 80.2%
4th most popular 9.9% 81.9%
5th most popular 7.8% 89.1%
6th most popular 5.5% 89.8%
7th most popular 4.2% 86.0%
8th most popular 2.4% 64.8%
9th most popular 2.1% 64.8%
10th most popular 1.7% 80.9%
11th most popular 1.1% 98.2%
12th most popular 1.0% 69.4%
13th most popular 1.1% 113.2%
14th most popular 0.2% 35.4%

interesting

Summary

Recovery rate can exceed 100% with almost no ingenuity lightGBM awesome

appendix

Please note that this is a dirty code.

・ ** Scraping (acquisition of race information and URL of each horse) **


def url_to_soup(url):
    
    time.sleep(1)
        
    html = requests.get(url)
    html.encoding = 'EUC-JP'
    return BeautifulSoup(html.text, 'html.parser')


def race_info_df(url):
    
    df1 = pd.DataFrame()
    HorseLink = []
    
    try:
        # year = '2018'
        # url = 'https://race.sp.netkeiba.com/?pid=race_result&race_id=201108030409&rf=rs'
        soup = url_to_soup(url)
        
        if soup.find_all('li',class_='NoData') != []:
            return df1,HorseLink
            
        else:
            
            race_cols = ['year', 'date', 'place', 'race_num' ,'race_class', 'field', 'dist', 'l_or_r',\
                        'sum_num','weather', 'field_cond', 'rank', 'horse_num', 'horse_name', 'gender', 'age',\
                        'weight', 'weight_c', 'time', 'jackie', 'j_weght', 'odds', 'popu']
            
            #Common Items#
            # Year = year
            Date = soup.find_all('div', class_='Change_Btn Day')[0].text.split()[0]
            Place = soup.find_all('div', class_="Change_Btn Course")[0].text.split()[0]

            RaceClass = soup.find_all('div', class_="RaceDetail fc")[0].text.split()[0][-6:].replace('、','')

            RaceNum = soup.find('span', id= re.compile("kaisaiDate")).text
            RaceData = soup.find_all('dd', class_="Race_Data")[0].contents
            Field  = RaceData[2].text[0]
            Dist = RaceData[2].text[1:5]

            l_index = RaceData[3].find('(')
            r_index = RaceData[3].find(')')
            LOrR = RaceData[3][l_index+1:r_index]

            RD = RaceData[3][r_index+1:]
            SumNum = RD.split()[0]
            Weather = RD.split()[1]
            FieldCond =  soup.find_all('span',class_= re.compile("Item"))[0].text

            #Not common#
            HorseLink = []
            for m in range(int(SumNum[:-1])):
                HN = soup.find_all('dt',class_='Horse_Name')[m].contents[1].text
                HL = soup.find_all('dt',class_='Horse_Name')[m].contents[1].get('href')
                HorseLink.append(HL if HN!='' else soup.find_all('dt',class_='Horse_Name')[m].contents[3].get('href'))

            HorseName = []
            for m in range(int(SumNum[:-1])):
                HN = soup.find_all('dt',class_='Horse_Name')[m].contents[1].text
                HorseName.append(HN if HN!='' else soup.find_all('dt',class_='Horse_Name')[m].contents[3].text)
            #     print(soup.find_all('dt',class_='Horse_Name')[m].contents[3])

            Rank = [soup.find_all('div',class_='Rank')[m].text for m in range(int(SumNum[:-1]))]

            #Get the information you can get from here
            HorseNum = [soup.find_all('td', class_ = re.compile('Num Waku'))[m].text.strip() for m in range(1,int(SumNum[:-1])*2+1,2)]

            Detail_Left = soup.find_all('span',class_='Detail_Left')
            Gender = [Detail_Left[m].text.split()[0][0] for m in range(int(SumNum[:-1]))]
            Age =  [Detail_Left[m].text.split()[0][1] for m in range(int(SumNum[:-1]))]
            Weight = [Detail_Left[m].text.split()[1][0:3] for m in range(int(SumNum[:-1]))]
            WeightC = [Detail_Left[m].text.split()[1][3:].replace('(','').replace(')','') for m in range(int(SumNum[:-1]))]

            Time = [soup.find_all('td', class_="Time")[m].contents[1].text.split('\n')[1] for m in range(int(SumNum[:-1]))]

            Detail_Right = soup.find_all('span',class_='Detail_Right')
            Jackie = [Detail_Right[m].text.split()[0] for m in range(int(SumNum[:-1]))]
            JWeight = [Detail_Right[m].text.split()[1].replace('(','').replace(')','')for m in range(int(SumNum[:-1]))]
            Odds = [soup.find_all('td', class_="Odds")[m].contents[1].text.split('\n')[1][:-1] for m in range(int(SumNum[:-1]))]
            Popu = [soup.find_all('td', class_="Odds")[m].contents[1].text.split('\n')[2][:-2] for m in range(int(SumNum[:-1]))]

            Year = [year for a in range(int(SumNum[:-1]))]
            RaceCols =  [Year, Date, Place, RaceNum ,RaceClass, Field, Dist, LOrR,\
                      SumNum,Weather, FieldCond, Rank, HorseNum, HorseName, Gender, Age,\
                      Weight, WeightC, Time, Jackie, JWeight, Odds, Popu]
            for race_col,RaceCol in zip(race_cols,RaceCols):
                df1[race_col] = RaceCol

            return df1,HorseLink
        
    except:
        return df1,HorseLink

・ ** Scraping (race information of each horse so far) **


def horse_info_df(HorseLink, df1):
    
    df2 = pd.DataFrame()
    # print(HorseLink)

    for n,url2 in enumerate(HorseLink):

        try: 
            soup2 = url_to_soup(url2)

            horse_cols = ['sum_score',\
                          'popu_1','rank_1','odds_1','sum_num_1','field_1','dist_1','time_1',\
                          'popu_2','rank_2','2','sum_num_2','field_2','dist2','time_2',\
                          'popu_3','rank_3','odds_3','sum_num_3','field_3','dist_3','time_3']

            sec = 1
            ya = soup2.find_all('section',class_="RaceResults Sire")
            #ya = soup.find_all('div',class_="Title_Sec")
            if ya !=[]:
                sec = 2
            tbody1 = soup2.find_all('tbody')[sec] 
            SomeScore = tbody1.find_all('td')[0].text
            # print(SomeScore)

            tbody3 = soup2.find_all('tbody')[2+sec]

            HorseCols = [SomeScore]

            for late in range(1,4):
                HorseCols.append(tbody3.contents[late].find_all('td')[2].text)  # Popu
                HorseCols.append(tbody3.contents[late].find_all('td')[3].text)  # Rank
                HorseCols.append(tbody3.contents[late].find_all('td')[6].text)  # Odds
                HorseCols.append(tbody3.contents[late].find_all('td')[7].text)  # SumNum
                HorseCols.append(tbody3.contents[late].find_all('td')[10].text[0]) # Field
                HorseCols.append(tbody3.contents[late].find_all('td')[10].text[1:5]) # Dist
                HorseCols.append(tbody3.contents[late].find_all('td')[14].text) # Time


            dfplus = pd.DataFrame([HorseCols], columns=horse_cols)
            dfplus['horse_name'] = df1['horse_name'][n]

            df2 = pd.concat([df2,dfplus])
        except:
            pass

    return df2

Postscript

Consideration on the importance of features (2020/05/28)

Let's consider the importance of race_id in feature-importances. First, you can see that the importance of race_id is highly estimated from the following two steps.

** 1 Compare with and without race_id **

When the'race_id'feature was erased, the recovery rate was about 10% and the rmse for test was about 0.1 worse in all 7 years. From this, we can see that we need the features of race_id.

** 2 Do the same as 1 with the second most important dist ** aaa.png

You can see that the prediction accuracy has deteriorated considerably.

From ** 1.2, you can see that the importance of race_id is overestimated. ** **

This time, valid and train are randomly divided as follows.

X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.3,\
                                                      random_state=0)

This will give you an approximate time if you know the race_id.

The following is the result of trying to put shuffle = False. aaa.png

In this way, race_id becomes less important. However, it is a different dimension from improving the recovery rate and rmse's. It's actually getting worse.

Recommended Posts

Horse Racing Prediction: If you think that the recovery rate has exceeded 100% in machine learning (LightGBM), it's a story
I wrote a code that exceeds 100% recovery rate in horse racing prediction using LightGBM (Part 2)
A story about achieving a horse racing recovery rate of over 100% through machine learning
A concrete method of predicting horse racing by machine learning and simulating the recovery rate
With deep learning, you can exceed 100% recovery rate in horse racing
[Verification] Just because there is deep learning, it does not mean that the recovery rate can easily exceed 100% in horse racing.
Horse Racing Prediction in Machine Learning-LightGBM Edition-
Machine learning beginners tried to make a horse racing prediction model with python
Implementation of a model that predicts the exchange rate (dollar-yen rate) by machine learning
Introducing the book "Creating a profitable AI with Python" that allows you to learn machine learning in the shortest course
MALSS, a tool that supports machine learning in Python
Machine learning and statistical prediction, a paradigm of modern statistics that you should know before that
An example of a mechanism that returns a prediction by HTTP from the result of machine learning
If you think that the person you put in with pip doesn't work → Maybe you are using python3?
If you are a beginner in programming, why not make a "game" for the time being? The story