[PYTHON] A concrete method of predicting horse racing by machine learning and simulating the recovery rate

Purpose

Predict horse racing with machine learning and aim for a recovery rate of 100%

What to do this time

This article is a continuation of the following article. -Scraping race result data using pandas read_htmlScraping detailed race information using Beautiful SoupPredict the horses that will be in the top 3 in LightGBM ・ [Add past performance data of horses to features] (https://qiita.com/dijzpeb/items/63cb783c7d45cb91d262)

This time, I will try to simulate how much I can win if I actually use this model and bet on double wins.

Source code

First, scrape the refund table. スクリーンショット 2020-07-11 15.16.52.png If you scrape normally, double win and wide will not be separated as shown below, so convert the </ font> line feed tag to a character string. スクリーンショット 2020-07-11 15.17.45.png

f = urlopen(url)
html = f.read()
html = html.replace(b'<br />', b'br')

スクリーンショット 2020-07-12 11.43.26.png

As in the previous article, if you include a list of race_id, create and execute a function that scrapes the refund data and convert it to DataFrame type.

import pandas as pd
import time
from tqdm.notebook import tqdm
from urllib.request import urlopen

def scrape_return_tables(race_id_list, pre_return_tables={}):
    return_tables = pre_return_tables
    for race_id in tqdm(race_id_list):
        if race_id in return_tables.keys():
            continue
        try:
            url = "https://db.netkeiba.com/race/" + race_id
            f = urlopen(url)
            html = f.read()
            html = html.replace(b'<br />', b'br')
            dfs = pd.read_html(html)
            return_tables[race_id] = pd.concat([dfs[1], dfs[2]])
            time.sleep(1)
        except IndexError:
            continue
        except:
            break
    return return_tables

return_tables = scrape_return_tables(race_id_list)
for key in return_tables:
    return_tables[key].index = [key] * len(return_tables[key])
return_tables = pd.concat([return_tables[key] for key in return_tables])

Next, create a Retrun class and process the double win data so that it can be used.

class Return:
    def __init__(self, return_tables):
        self.return_tables = return_tables
    
    @property
    def fukusho(self):
        fukusho = self.return_tables[self.return_tables[0]=='Double win'][[1,2]]
        wins = fukusho[1].str.split('br', expand=True).drop([3], axis=1)
        wins.columns = ['win_0', 'win_1', 'win_2']
        returns = fukusho[2].str.split('br', expand=True).drop([3], axis=1)
        returns.columns = ['return_0', 'return_1', 'return_2']
        
        df = pd.concat([wins, returns], axis=1)
        for column in df.columns:
            df[column] = df[column].str.replace(',', '')
        return df.fillna(0).astype(int)

rt = Return(return_tables)
rt.fukusho

スクリーンショット 2020-07-11 15.26.37.png Next, put in LightGBM and the refund data you just scraped, and create a ModelEvaluator class that will calculate the AUC score and balance and evaluate the model.

from sklearn.metrics import roc_auc_score

class ModelEvaluator:
    def __init__(self, model, return_tables):
        self.model = model
        self.fukusho = Return(return_tables).fukusho
    
    def predict_proba(self, X):
        return self.model.predict_proba(X)[:, 1]
    
    def predict(self, X, threshold=0.5):
        y_pred = self.predict_proba(X)
        return [0 if p<threshold else 1 for p in y_pred]
    
    def score(self, y_true, X):
        return roc_auc_score(y_true, self.predict_proba(X))
    
    def feature_importance(self, X, n_display=20):
        importances = pd.DataFrame({"features": X.columns, 
                                    "importance": self.model.feature_importances_})
        return importances.sort_values("importance", ascending=False)[:n_display]
    
    def pred_table(self, X, threshold=0.5, bet_only=True):
        pred_table = X.copy()[['Horse number']]
        pred_table['pred'] = self.predict(X, threshold)
        if bet_only:
            return pred_table[pred_table['pred']==1]['Horse number']
        else:
            return pred_table
        
    def calculate_return(self, X, threshold=0.5):
        pred_table = self.pred_table(X, threshold)
        money = -100 * len(pred_table)
        df = self.fukusho.copy()
        df = df.merge(pred_table, left_index=True, right_index=True, how='right')
        for i in range(3):
            money += df[df['win_{}'.format(i)]==df['Horse number']]['return_{}'.format(i)].sum()
        return money

When I actually calculate ...

me = ModelEvaluator(lgb_clf, return_tables)

gain = {}
n_samples = 100
for i in tqdm(range(n_samples)):
    threshold = i / n_samples
    gain[threshold] = me.calculate_return(X_test, threshold)
pd.Series(gain).plot()

スクリーンショット 2020-07-11 15.30.19.png I'm really losing, so I still need to improve ...

Detailed explanation in the video ↓ Data analysis / machine learning starting with horse racing prediction スクリーンショット 2020-07-11 15.33.33.png

Recommended Posts

A concrete method of predicting horse racing by machine learning and simulating the recovery rate
A story about achieving a horse racing recovery rate of over 100% through machine learning
Implementation of a model that predicts the exchange rate (dollar-yen rate) by machine learning
Horse Racing Prediction: If you think that the recovery rate has exceeded 100% in machine learning (LightGBM), it's a story
Basic machine learning procedure: ③ Compare and examine the selection method of features
Predicting the goal time of a full marathon with machine learning-③: Visualizing data with Python-
Reuse the behavior of the @property method by using a descriptor [16/100]
Predict the presence or absence of infidelity by machine learning
An example of a mechanism that returns a prediction by HTTP from the result of machine learning
I tried to verify the yin and yang classification of Hololive members by machine learning
[Machine learning] Write the k-nearest neighbor method (k-nearest neighbor method) in python by yourself and recognize handwritten numbers.
I considered the machine learning method and its implementation language from the tag information of Qiita
Find the white Christmas rate by prefecture with Python and map it to a map of Japan
A story stuck with the installation of the machine learning library JAX
With deep learning, you can exceed 100% recovery rate in horse racing
A memorandum of scraping & machine learning [development technique] by Python (Chapter 4)
A memorandum of scraping & machine learning [development technique] by Python (Chapter 5)
Program the horse racing winning method
Significance of machine learning and mini-batch learning
A simple method to get MNIST correct answer rate of 97% or more by unsupervised learning (without transfer learning)
Python learning memo for machine learning by Chainer until the end of Chapter 2
Examination of exchange rate forecasting method using deep learning and wavelet transform
Judge the authenticity of posted articles by machine learning (Google Prediction API).
Horse racing winning method by combinatorial optimization
Judgment of igneous rock by machine learning ②
Evaluation method of machine learning regression problem (mean square error and coefficient of determination)
Machine Learning: Image Recognition of MNIST by using PCA and Gaussian Native Bayes
Predict short-lived works of Weekly Shonen Jump by machine learning (Part 2: Learning and evaluation)
Machine learning beginners tried to make a horse racing prediction model with python
Approximation by the least squares method of a circle with two fixed points
Build a python environment to learn the theory and implementation of deep learning
I tried to predict the presence or absence of snow by machine learning.
A memorandum of method often used in machine learning using scikit-learn (for beginners)
Machine learning memo of a fledgling engineer Part 1
Classification of guitar images by machine learning Part 1
Numerai Tournament-Fusion of Traditional Quants and Machine Learning-
A memorandum of studying and implementing deep learning
Parallel learning of deep learning by Keras and Kubernetes
About the development contents of machine learning (Example)
Analysis of shared space usage by machine learning
[Translation] scikit-learn 0.18 Tutorial Introduction of machine learning by scikit-learn
Machine learning memo of a fledgling engineer Part 2
Reasonable price estimation of Mercari by machine learning
Classification of guitar images by machine learning Part 2
Get a glimpse of machine learning in Python
A discussion of the strengths and weaknesses of Python
A story about data analysis by machine learning
Learning roadmap and recommended books taught by alumni of the Department of Information Systems ~ No.2 ~
Predicting offensive and defensive attributes from the Yu-Gi-Oh! Card name --Yu-Gi-Oh! Data Science 3. Machine Learning
[Machine learning] "Abnormality detection and change detection" Let's draw the figure of Chapter 1 in Python.
Introduction to machine learning ~ Let's show the table of K-nearest neighbor method ~ (+ error handling)
I made a twitter app that identifies and saves the image of a specific character on the twitter timeline by pytorch transfer learning
The procedure from generating and saving a learning model by machine learning, making it an API server, and communicating with JSON from a browser