[PYTHON] I tried to predict the horses that will be in the top 3 with LightGBM

Purpose

Predict horse racing with machine learning and aim for a recovery rate of 100%.

What to do this time

Using all the 2019 race result data obtained in Previous article, we predict the horses that will be in the top 3 in LightGBM. スクリーンショット 2020-07-07 15.34.15.png

Source code

First of all, pretreatment

import datetime

def preprocessing(results):
    df = results.copy()

    #Remove items that contain non-numeric character strings in the order of arrival
    df = df[~(df["Order of arrival"].astype(str).str.contains("\D"))]
    #Using backslashes in qiita is buggy, so I've capitalized it.
    df["Order of arrival"] = df["Order of arrival"].astype(int)

    #Divide sex age into sex and age
    df["sex"] = df["sex齢"].map(lambda x: str(x)[0])
    df["age"] = df["Sexual age"].map(lambda x: str(x)[1:]).astype(int)

    #Divide horse weight into weight and weight change
    df["body weight"] = df["馬body weight"].str.split("(", expand=True)[0].astype(int)
    df["Weight change"] = df["Horse weight"].str.split("(", expand=True)[1].str[:-1].astype(int)

    #Data int,Convert to float
    df["Win"] = df["Win"].astype(float)

    #Remove unnecessary columns
    df.drop(["time", "Difference", "Trainer", "Sexual age", "Horse weight"], axis=1, inplace=True)

    df["date"] = pd.to_datetime(df["date"], format="%Y year%m month%d day")

    return df

results_p = preprocessing(results)

スクリーンショット 2020-07-07 15.40.28.png I want to divide it into training data and test data, but train_test_split cannot be used in this case because the training data must be older than the test data. Therefore, we will create a function that separates training data and test data in chronological order using the column called'date', which is now datetime type.

def split_data(df, test_size=0.3):
    sorted_id_list = df.sort_values("date").index.unique()
    train_id_list = sorted_id_list[: round(len(sorted_id_list) * (1 - test_size))]
    test_id_list = sorted_id_list[round(len(sorted_id_list) * (1 - test_size)) :]
    train = df.loc[train_id_list].drop(['date'], axis=1)
    test = df.loc[test_id_list].drop(['date'], axis=1)
    return train, test

The category variable is made into a dummy variable, but the horse name has too many categories, so it is omitted this time.

results_p.drop(["Horse name"], axis=1, inplace=True)
results_d = pd.get_dummies(results_p)

If the order of arrival is within 3rd, label it as 1 and otherwise label it as 0 and treat it as an objective variable.

results_d["rank"] = results_d["Order of arrival"].map(lambda x: 1 if x < 4 else 0)
results_d.drop(['Order of arrival'], axis=1, inplace=True)

Using the created split_data function, divide the training data and test data and train them with LightGBM.

import lightgbm as lgb

train, test = split_data(results_d, 0.3)
X_train = train.drop(["rank"], axis=1)
y_train = train["rank"]
X_test = test.drop(["rank"], axis=1)
y_test = test["rank"]

params = {
    "num_leaves": 4,
    "n_estimators": 80,
    "class_weight": "balanced",
    "random_state": 100,
}

lgb_clf = lgb.LGBMClassifier(**params)
lgb_clf.fit(X_train.values, y_train.values)

Evaluate by AUC score.

y_pred_train = lgb_clf.predict_proba(X_train)[:, 1]
y_pred = lgb_clf.predict_proba(X_test)[:, 1]
print(roc_auc_score(y_train, y_pred_train))
print(roc_auc_score(y_test, y_pred))

Result is, Training data: 0.819 Test data: 0.812 was. I think it's a good score for the fact that I haven't made any features yet. Looking at the importance of features,

importances = pd.DataFrame(
    {"features": X_train.columns, "importance": lgb_clf.feature_importances_}
)
importances.sort_values("importance", ascending=False)[:20]

スクリーンショット 2020-07-07 15.56.41.png Looking at this, it depends almost entirely on the data of winning odds, that is, it has become a model to bet on low odds </ font>, so create features etc. I want to improve it.

We have a detailed explanation in the video! Data analysis and machine learning starting with horse racing prediction スクリーンショット 2020-07-09 17.56.49.png

Recommended Posts

I tried to predict the horses that will be in the top 3 with LightGBM
I tried to summarize the operations that are likely to be used with numpy-stl
I tried to describe the traffic in real time with WebSocket
I tried to process the image in "sketch style" with OpenCV
Day 71 I tried to predict how long this self-restraint will continue with the SIR model
I tried to process the image in "pencil style" with OpenCV
I tried to expand the database so that it can be used with PES analysis software
A story that didn't work when I tried to log in with the Python requests module
I tried with the top 100 PyPI packages> I tried to graph the packages installed on Python
I tried to predict the behavior of the new coronavirus with the SEIR model.
I tried to predict next year with AI
I tried to use lightGBM, xgboost with Boruta
I tried to save the data with discord
I tried to integrate with Keras in TFv1.1
I tried to predict Titanic survival with PyCaret
I tried to predict the price of ETF
I tried to predict the change in snowfall for 2 years by machine learning
I tried to graph the packages installed in Python
I tried to touch the CSV file with Python
I tried to solve the soma cube with python
[Python] I tried to graph the top 10 eyeshadow rankings
I tried to solve the problem with Python Vol.1
Kaggle Tutorial Titanic know-how to be in the top 2%
Work memo that I tried i18n with Flask app
Note that I dealt with HTML in Beautiful Soup
Work memo that I tried i18n with Flask app
I registered PyQCheck, a library that can perform QuickCheck with Python, in PyPI.
I tried to predict the horses that will be in the top 3 with LightGBM
I made a familiar function that can be used in statistics with Python
I tried to predict the number of people infected with coronavirus in Japan by the method of the latest paper in China
I tried to predict the number of people infected with coronavirus in consideration of the effect of refraining from going out
Introduction to AI creation with Python! Part 2 I tried to predict the house price in Boston with a neural network
I will try to summarize the links that seem to be useful for the time being
I tried to predict and submit Titanic survivors with Kaggle
I tried to find the entropy of the image with python
I tried to simulate how the infection spreads with Python
I tried to analyze the whole novel "Weathering with You" ☔️
I tried to find the average of the sequence with TensorFlow
I tried to notify the train delay information with LINE Notify
I tried to summarize the code often used in Pandas
I tried to illustrate the time and time in C language
I tried to summarize the commands often used in business
I tried to implement the mail sending function in Python
I can't log in to the admin page with Django3
I tried to predict Boston real estate prices with PyCaret
I tried to divide the file into folders with Python
I investigated the pretreatment that can be done with PyCaret
I tried to create an article in Wiki.js with SQLAlchemy
Sentiment analysis with natural language processing! I tried to predict the evaluation from the review text
I also tried to imitate the function monad and State monad with a generator in Python
I wrote a doctest in "I tried to simulate the probability of a bingo game with Python"
I tried to predict the sales of game software with VARISTA by referring to the article of Codexa
I tried learning LightGBM with Yellowbrick
I tried to move the ball
I tried to estimate the interval.
I tried to solve the ant book beginner's edition with python
Movement that changes direction in the coordinate system I tried Python 3
I tried to automate the watering of the planter with Raspberry Pi
[Python] A memo that I tried to get started with asyncio
I tried to get started with Bitcoin Systre on the weekend
I tried to predict by letting RNN learn the sine wave
I tried to expand the size of the logical volume with LVM
I tried to improve the efficiency of daily work with Python
I tried to log in to twitter automatically with selenium (RPA, scraping)
[Flask] I tried to summarize the "docker-compose configuration" that can be created quickly for web applications