[PYTHON] Try to predict the triplet of boat race by ranking learning

Introduction

This article is an explanation of the internal code of the boat race triple prediction site "Today, do you have a good prediction?" that was created by myself and released on the Web. It will be. This time, I will summarize the machine learning model. In the first place, I will create a separate article on data acquisition and shaping, so please refer to that.

Create a data frame from the acquired boat race text data

What is ranking learning?

I referred to the following articles very much when I was able to use it.

・ Horse Racing Prediction AI Again -Part 1- ~ Lambda Rank ~

Ranking learning is said to be a method for learning relative order relationships. Like the link above, I thought that it would be suitable for learning the relative strength of multiple people (horses) such as horse racing and boat racing.

The paper is still being read (laughs), but I'll try using it first. The library used is lightgbm.

Prepare the Query dataset

This time, the learning data will be from January to April 2020, and the data from May 2020 will be the verification data.

One of the features of ranking learning is "Query data". This Query data represents the number of training data contained in one race. In the case of boat race, if there is no trouble, the race will be held with 6 boats, so

Query data box = [6,6,6, ..., 6]

You should have a list of "6" s for the number of races (if there are no absentees). image.png

So, we will create a query data box with the following code.

%%time #Measure time
target_cols = ["Position"]
feature_cols = ["Name","Lane","Round","Month","Place"]

train = pd.read_csv(train_file)
train_shuffle = pd.DataFrame([],columns=train.columns)

train_group =[]
for i,k in enumerate(train["Round"]):
    if i == 0:
        temp = k
        temp2 = i
        
    else:
        if k == temp:
            pass
        else:
            train_group.append(i-temp2)
            #↓ .Make the data shuffled with sample.
            train_shuffle=train_shuffle.append(train[temp2:i].sample(frac=1))
            temp = k
            temp2 = i

#Added because the last pair is not included
train_group.append(i+1-temp2)
train_shuffle=train_shuffle.append(train[temp2:i+1].sample(frac=1))

train_y = train_shuffle[target_cols].astype(int)
train = train_shuffle[feature_cols]
print(train.shape)

The train file read by read_csv is based on the article Create a data frame from the acquired boat race text data.

The number of the same Round is counted and stored in the list of train_group. And when I read the reference article, it was ** that it would be dangerous if I did not shuffle the order in this group, **, so when storing it in train_shuffle, shuffle processing with .sample doing.

Apply the above code to the validation dataset to create a validation Query dataset.

Use LightGBM

I will omit missing processing, feature engineering, one-hot encoding, etc., but if the learning data set and query data set are ready, it is easy to execute machine learning in the present world. Only one point, maybe it is a lightgbm specification, an error will occur if Japanese is in the column. Therefore, the following processing was added.

column_list = []
for i in range(len(comb_onehot.columns)):
    column_list.append(str(i)+'_column')

comb_onehot.columns = column_list
train_onehot = comb_onehot[:len(train)]
val_onehot = comb_onehot[len(train):]

Re-separated for learning and verification. Now, let's carry out machine learning.

import lightgbm as lgb

lgbm_params =  {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'lambdarank', #← Designated as ranking learning here!
    'metric': 'ndcg',   # for lambdarank
    'ndcg_eval_at': [1,2,3],  #I want to predict triples
    'max_position': 6,  #Kyotei is only up to 6th place
    'learning_rate': 0.01, 
    'min_data': 1,
    'min_data_in_bin': 1,
#     'num_leaves': 31,
#     'min_data_in_leaf': 20,
#     'max_depth':35,
}
lgtrain = lgb.Dataset(train_onehot, train_y,  group=train_group)
lgvalid = lgb.Dataset(val_onehot, val_y,group=val_group)
lgb_clf = lgb.train(
    lgbm_params,
    lgtrain,
    num_boost_round=250,
    valid_sets=[lgtrain, lgvalid],
    valid_names=['train','valid'],
    early_stopping_rounds=20,
    verbose_eval=5
)

Hyperparameters such as num_leaves should be adjusted, but let's proceed without thinking here. The prediction for the verification data looks like this. It's a really convenient time ...

y_pred = lgb_clf.predict(val_onehot,group=val_group, num_iteration=lgb_clf.best_iteration)

Result is...

The triad prediction by ranking learning is as follows. Triple single is 8.15% .. !!

image.png

By the way, in order to get the above hit rate (especially 2nd and 3rd), I wrote the following code. Hmmm, redundant!

#Calculation of Valid data hit rate
j = 0
solo_count = 0
doub_count = 0
tri_count = 0
for i in val_group:
    result = y_pred[j:j+i]
    ans = val_y[j:j+i].reset_index()
    
    result1st = np.argmin(result)
    if len(np.where(result==sorted(result)[1])[0])>1:
        result2nd = np.where(result==sorted(result)[1])[0][0]
        result3rd = np.where(result==sorted(result)[1])[0][1]
    else:
        if i > 1:
            result2nd = np.where(result==sorted(result)[1])[0][0]
        if i > 2:
            result3rd = np.where(result==sorted(result)[2])[0][0]
    
    ans1st = int(ans[ans["Position"]==1].index.values)
    if len(ans[ans["Position"]==2].index.values)>1:
        ans2nd = int(ans[ans["Position"]==2].index.values[0])
        ans3rd = int(ans[ans["Position"]==2].index.values[1])
    else:
        if i > 1:
            ans2nd = int(ans[ans["Position"]==2].index.values[0])
        if i > 2:
            ans3rd = int(ans[ans["Position"]==3].index.values[0])
    
    if ans1st==result1st:
        #print(ans1st,result1st)
        solo_count = solo_count+1
    
    if i > 1:
        if (ans1st==result1st)&(ans2nd==result2nd):
            doub_count = doub_count+1
    
    if i > 2:
        if (ans1st==result1st)&(ans2nd==result2nd)&(ans3rd==result3rd):
            tri_count = tri_count+1 
    j=j+i

print("Winning rate:",round(solo_count/len(val_group)*100,2),"%")
print("Double single predictive value:",round(doub_count/len(val_group)*100,2),"%")
print("Triple single predictive value:",round(tri_count/len(val_group)*100,2),"%")

at the end

The above result is a higher hit rate than buying without thinking. (The most frequently occurring triplet combination is "1-2-3", and the frequency is about 7%)

However, this result alone is unacceptable as a hit rate, so I felt that another device was needed. I would like to summarize that in another article.

Recommended Posts

Try to predict the triplet of boat race by ranking learning
I tried to predict the presence or absence of snow by machine learning.
Try to predict the value of the water level gauge by machine learning using the open data of Data City Sabae
Try to evaluate the performance of machine learning / regression model
Predict the presence or absence of infidelity by machine learning
Try to evaluate the performance of machine learning / classification model
[Note] Let's try to predict the amount of electricity used! (Part 1)
Try to make a blackjack strategy by reinforcement learning ((1) Implementation of blackjack)
Try to face the integration by parts
I tried the common story of using Deep Learning to predict the Nikkei 225
Try to forecast power demand by machine learning
Try to simulate the movement of the solar system
I tried to predict the price of ETF
I tried to predict the change in snowfall for 2 years by machine learning
Try to make a blackjack strategy by reinforcement learning (② Register the environment in gym)
Try to solve the problems / problems of "Matrix Programmer" (Chapter 1)
Try to predict forex (FX) with non-deep machine learning
Try to estimate the number of likes on Twitter
Predict the gender of Twitter users with machine learning
Try to get the contents of Word with Golang
Tweet the triple forecast of the boat race on Twitter
I tried to verify the yin and yang classification of Hololive members by machine learning
Try to import to the database by manipulating ShapeFile of national land numerical information with Python
I tried to predict the sales of game software with VARISTA by referring to the article of Codexa
Try to get the function list of Python> os package
How to test the attributes added by add_request_method of pyramid
Try to improve the accuracy of Twitter like number estimation
Try to solve the problems / problems of "Matrix Programmer" (Chapter 0 Functions)
How to increase the number of machine learning dataset images
Try to automate the operation of network devices with Python
Debug by attaching to the Python process of the SSH destination
[Machine learning] I tried to summarize the theory of Adaboost
Try to predict if tweets will burn with machine learning
[Introduction to Reinforcement Learning] Reinforcement learning to try moving for the time being
Try to extract the features of the sensor data with CNN
Learn accounting data and try to predict accounts from the content of the description when entering journals
[Python] Try to graph from the image of Ring Fit [OCR]
First python ② Try to write code while examining the features of python
Try to solve the N Queens problem with SA of PyQUBO
Try to model the cumulative return of rollovers in futures trading
How to use machine learning for work? 01_ Understand the purpose of machine learning
Try to implement and understand the segment tree step by step (python)
[Anomaly detection] Try using the latest method of deep distance learning
Try to introduce the theme to Pelican
Cython to try in the shortest
The fastest way to try EfficientNet
Supplement to the explanation of vscode
The easiest way to try PyQtGraph
About the order of learning programming languages (from beginner to intermediate) Part 2
Try to estimate the parameters of the gamma distribution while simply implementing MCMC
I'm stunned by the behavior of filter () due to different versions of Python
Try to image the elevation data of the Geographical Survey Institute with Python
Hit the Rakuten Ranking API to save the ranking of any category in CSV
Try to get the road surface condition using big data of road surface management
Setting to debug test by entering the contents of the library with pytest
I tried to find the optimal path of the dreamland by (quantum) annealing
I tried to verify and analyze the acceleration of Python by Cython
Try using n to downgrade the version of Node.js you have installed
Try to react only the carbon at the end of the chain with SMARTS
__Getattr__ and __getattribute__ to customize the acquisition of object attributes by dots
I want to use the Qore SDK to predict the success of NBA players