[PYTHON] Horse Racing Prediction in Machine Learning-LightGBM Edition-

Introduction

Have you ever raced? It was not until I studied machine learning that I came into contact with horse racing. When studying programming, there is "learning while moving your hands". I thought it would be nice to have a subject that you can enjoy studying if you study hard. As a result of various investigations there

** ・ There are many features and approaches for forecasting → It is rewarding ** ** ・ Related to money → If you make it as a tool, it will also be a business ** ** ・ The desired result is updated weekly as a race result → Easy to study **

I chose horse racing from the above three points of view. Since this is my first article, I think there are many parts that are difficult to understand, but I would appreciate it if you could point out that.

What went

We analyzed the horse racing data and even got the prediction accuracy (AUC). Actually, it would be more interesting to get the recovery rate after actually making a prediction, but since there is no such technology, we went through "data acquisition-> pre-processing-> learning-> AUC calculation-> consideration".

environment

Google Colab (I used to go to the local Jupyter lab at first, but I used Google colab because the kernel crashes when learning on my PC.)

procedure

1 https://www.netkeiba.com/ Scraping from 2 Pretreatment Learn and predict with 3 lightGBM

1 Scraping from netkeiba.com

For this process, I referred to https://qiita.com/penguinz222/items/6a30d026ede2e822e245.

2 Pretreatment

The CSV file that has been scraped looks like this. スクリーンショット 2021-01-05 13.54.17.png The size of the data is about 500,000 rows x 20 columns. We will perform preprocessing for these.

First, I will explain some column names that are generally unfamiliar. ・ C_weight ... Weight difference from the previous race ・ J_weight ... Jockey's weight ・ Popu ... Popular (The management side will make each horse popular according to prior information) ・ Odds ... Refund amount ÷ Stake (The more popular horses, the lower the odds.) ・ TrainerA, trainerB ... Trainer

2-1 Make data easier to handle

The leftmost column is Unnamed: 0. This column is a series of dates, race numbers, and horse numbers, so this is broken down into two columns.


#Data reading
#Since the columns of Unnamed0 are date, race number, and horse number, rename and split.
keiba_data = pd.read_csv('/content/drive/MyDrive/Horse racing.csv', encoding = "shift-jis")

keiba_data.rename(columns={"Unnamed: 0":"date_num"},inplace=True)
keiba_data["date_num"]=keiba_data["date_num"].astype(str)
keiba_data["race_num"]=keiba_data["date_num"].str[0:12].astype(int)
keiba_data["horse_num"]=keiba_data["date_num"].str[12:14].astype(int)
keiba_data.drop(columns=["date_num"],inplace=True)
#Race for ease of use_num and horse_num is placed on the far left.
keiba_data=keiba_data.reindex(columns=['race_num','horse_num','age', 'c_weight', 'course', 'date', 'field', 'gender', 'head_count',
       'horse_name', 'j_weight', 'jackie', 'odds', 'popu', 'race', 'race_name',
       'rank', 'trainerA', 'trainerB', 'weight', 'year'])

スクリーンショット 2021-01-08 11.11.00.png     スクリーンショット2021-01-0811.13.52.png

2-2 Missing value processing

Check if the data has missing values.

keiba_data.isnull().sum()

<img width="200" alt="スクリーンショット 2021-01-05 14.18.19.png "src="https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/978166/772236b6-0ab9-b7de-fa42-895d916b98fe.png ">

There are some missing values. This time, the model used for learning is a model called lightGBM that does not need to process missing values, but I saw an article somewhere that it would be more accurate to process lightGBM as well (sorry, which article is it? I forgot.) I will try to process the missing value.

The int column is filled with 0 or the average value, and the object column is dropped. Since popu and odds are expected to be fairly important factors, we dropped the missing values ​​without filling them with the mean or median. However, since all missing values ​​are 2000 or less, I don't think that any treatment will affect your posture.

#Missing value processing
keiba_data["c_weight"].fillna(0,inplace=True)
keiba_data["j_weight"].fillna(keiba_data["j_weight"].mean(),inplace=True)
keiba_data["weight"].fillna(keiba_data["weight"].mean(),inplace=True)
keiba_data.dropna(subset=["race_name"],inplace=True)
keiba_data.dropna(subset=["odds"],inplace=True)
keiba_data.dropna(subset=["popu"],inplace=True)

2-3 Checking the data type

keiba_data.dtypes

<img width="200" alt="スクリーンショット 2021-01-05 14.47.56"src=https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/978166/4816e9e7-8c7e-af6f-905f-32861403a72a.png>

All object types are converted to int types by Label Encoder. Then remove unnecessary columns such as year.

#Convert categorical variables using labelencoder.
le=LabelEncoder()
keiba_categorical = keiba_data[["gender","field","horse_name","course","head_count","trainerA","trainerB","race","jackie","race_name"]].apply(le.fit_transform)
keiba_categorical = keiba_categorical.rename(columns={"race_name":"race_name_c","filed":"field_c","gender":"gender_c","horse_name":"horse_name_c","course":"course_c","head_count":"head_count_c","trainerA":"trainerA_c","trainerB":"trainerB_c","jackie":"jackie_c"})
keiba_data = pd.concat([keiba_data,keiba_categorical],axis=1)
#Remove pre-conversion and unnecessary columns
keiba_data.drop(columns=["race_num","horse_num","date","year","race_name","race","trainerA","trainerB","course","field","gender","jackie","head_count","horse_name"],inplace=True)

2-4 Feature generation

We consider odds and popu to be quite important features, and since the previous weight can be found from the feature (odds_popu) and c_weight obtained by multiplying them, the previous weight (pre_weight) was newly added as a feature.

#Feature generation
#The first is the product of odds and popu
#The second is the previous weight
keiba_data["odds_popu"]=keiba_data["odds"]*keiba_data["popu"]
keiba_data["pre_weight"]=keiba_data["weight"]-keiba_data["c_weight"]

2-5 Processing of objective variable

From here, the objective variable rank is processed.


#Confirmation of rank
keiba_data["rank"].unique()

スクリーンショット 2021-01-05 15.55.59.png

Then, there are lines of cancellation and disqualification other than the ranking. If you don't know the rank, you can't learn, so drop all this line. Counting with count () is about 2000, which is a small number, so it seems okay to drop it.

#Delete all canceled and disqualified lines.
delete_index = keiba_data.index[((keiba_data["rank"]=="Cancel") | (keiba_data["rank"]=="Disqualification")]
keiba_data.drop(delete_index,inplace=True)

Now, I would like to predict the ranking, but basically horse racing is related to the prize money whether it is within the third place. Conversely, you don't have to hit the 7th or 8th horse with Don Pisha. So, I'm going to make it a binary classification problem of 3rd place or less or lower. The point is whether you win or lose. In the articles of other people, there were some who divided it into three categories, high-ranking, medium-ranking, and low-ranking, and made it a multi-value classification problem.

# 1,2,Divide it into 3 or other pieces and make it a binary classification problem.
keiba_data["rank"]=keiba_data["rank"].astype(int)
keiba_data = keiba_data.assign(target = (keiba_data['rank'] <= 3).astype(int))

I made a column called target, which gave 1 if it was within 3rd place and 0 if it was lower than that.

This is the pre-processing. Next, using lightGBM, we will go from learning to prediction. The data processed so far looks like this. スクリーンショット 2021-01-06 11.26.55.png

3 Learning, prediction

The features and objective variables are divided into X and y, and further divided into training data and evaluation data. It also converts those data into lightGBM readable data with lgb.Dataset.


#Division of train data and test data and division of features and objective variables
import lightgbm as lgb
X = keiba_data.drop(['rank','target'], axis=1)
y = keiba_data['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

#Data conversion
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test)

Parameter tuning is automatically decided using a framework called optuna.


#Hyperparameters are automatically set by optuna

%%time
!pip install optuna
from optuna.integration import lightgbm as lgb

params = {
    'objective': 'binary',
    'metric': 'auc'
}

best_params, history = {}, []
model = lgb.train(params, lgb_train, valid_sets=[lgb_train,lgb_eval],
                    verbose_eval=False,
                    num_boost_round=10,
                    early_stopping_rounds=10)
best_params_ = model.params

#Creating a model
import lightgbm as lgb_orig

model = lgb_orig.train(best_params_, 
                       lgb_train,
                       valid_sets=lgb_eval,
                       num_boost_round=100,
                       early_stopping_rounds=10)

The final AUC is

スクリーンショット 2021-01-05 16.28.50.png was. Since it's an AUC, it doesn't mean that I'm guessing 82%, but it gave me a better number than I expected. (For AUC, this site was easy to understand. https://techblog.gmo-ap.jp/2018/12/14/%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92%E3%81%AE%E8%A9%95%E4%BE%A1%E6%8C%87%E6%A8%99-roc%E6%9B%B2%E7%B7%9A%E3%81%A8auc/)

Finally, let's look at the importance of features.


#Displays the importance of features.
keiba_data.drop(columns = ["rank","target"],inplace=True)
importance = pd.DataFrame(model.feature_importance(), index=keiba_data.columns,columns=['importance'])
importance=importance.sort_values(by="importance",ascending=False)
display(importance)
スクリーンショット 2021-01-05 16.35.40.png

After all odds seems to be a fairly important feature. The jockey riding a horse (jackie_c) also seems to be important. What was surprising was that popu and trainerA were less important. But trainer B is very important. It may be better to ask someone who is familiar with horse racing around here. I've only raced horses a few times, so I still need to study around here.

in conclusion

This time, I did only the basics about preprocessing, but the AUC was 0.816736, which was a good value. It's pretty good, so when asked if it's profitable to use it, it's another problem. Since the model only judges whether or not it will be in the third place, we do not consider the recovery rate such as how much you bet and how much you will return. Bet on a popular horse to make it easier to hit, but the odds are lower and the refund amount will be smaller. The point is, "I'll win, but I don't have much money to return." Ideally, I think it would be nice to have a horse that is popular and earns a stable income, while also riding on a horse (an unpopular horse with high odds). I would like to implement the final model with Django etc. and put in the date and race number and make something that returns the prediction.

I thought I made this, but it's good to be able to study programming with your favorite subject. It was a lot of fun. As I wrote at the beginning, there are many other approaches to horse racing.

・ Analyze Twitter tweets just before the race and predict which horses are tweeted in positive words ・ Incorporate strains (such as deep impact children are strong) into features ・ Think of it as a regression problem that predicts the use of another model or time.

I think there are various things such as, so I would like to try other methods in the future.

Thank you for watching until the end.

All codes are listed below. https://github.com/suzuki24/keiba

Articles that I used as a reference

・ Https://qiita.com/Mshimia/items/6c54d82b3792925b8199 ・ Https://qiita.com/km_takao/items/0a448543961a97fc9c94 ・ Https://qiita.com/km_takao/items/70f7a7c3c9c533d7bee4 ・ Https://techblog.gmo-ap.jp/2018/12/14/%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92%E3%81%AE% E8% A9% 95% E4% BE% A1% E6% 8C% 87% E6% A8% 99-roc% E6% 9B% B2% E7% B7% 9A% E3% 81% A8auc /

Recommended Posts

Horse Racing Prediction in Machine Learning-LightGBM Edition-
Try horse racing prediction with Chainer
Machine learning beginners tried to make a horse racing prediction model with python
Horse Racing Prediction: If you think that the recovery rate has exceeded 100% in machine learning (LightGBM), it's a story
I wrote a code that exceeds 100% recovery rate in horse racing prediction using LightGBM (Part 2)
Studying Machine Learning-Pandas Edition-