Predict horse racing with machine learning and aim for a recovery rate of 100%

This article is a continuation of the following article. -Scraping race result data using pandas read_html ・ Scraping detailed race information using Beautiful Soup ・ Predict the horses that will be in the top 3 in LightGBM ・ [Add past performance data of horses to features] (https://qiita.com/dijzpeb/items/63cb783c7d45cb91d262)

This time, I will try to simulate how much I can win if I actually use this model and bet on double wins.

First, scrape the refund table. If you scrape normally, double win and wide will not be separated as shown below, so convert the </ font> line feed tag to a character string.

```
f = urlopen(url)
html = f.read()
html = html.replace(b'<br />', b'br')
```

As in the previous article, if you include a list of race_id, create and execute a function that scrapes the refund data and convert it to DataFrame type.

```
import pandas as pd
import time
from tqdm.notebook import tqdm
from urllib.request import urlopen
def scrape_return_tables(race_id_list, pre_return_tables={}):
return_tables = pre_return_tables
for race_id in tqdm(race_id_list):
if race_id in return_tables.keys():
continue
try:
url = "https://db.netkeiba.com/race/" + race_id
f = urlopen(url)
html = f.read()
html = html.replace(b'<br />', b'br')
dfs = pd.read_html(html)
return_tables[race_id] = pd.concat([dfs[1], dfs[2]])
time.sleep(1)
except IndexError:
continue
except:
break
return return_tables
return_tables = scrape_return_tables(race_id_list)
for key in return_tables:
return_tables[key].index = [key] * len(return_tables[key])
return_tables = pd.concat([return_tables[key] for key in return_tables])
```

Next, create a Retrun class and process the double win data so that it can be used.

```
class Return:
def __init__(self, return_tables):
self.return_tables = return_tables
@property
def fukusho(self):
fukusho = self.return_tables[self.return_tables[0]=='Double win'][[1,2]]
wins = fukusho[1].str.split('br', expand=True).drop([3], axis=1)
wins.columns = ['win_0', 'win_1', 'win_2']
returns = fukusho[2].str.split('br', expand=True).drop([3], axis=1)
returns.columns = ['return_0', 'return_1', 'return_2']
df = pd.concat([wins, returns], axis=1)
for column in df.columns:
df[column] = df[column].str.replace(',', '')
return df.fillna(0).astype(int)
rt = Return(return_tables)
rt.fukusho
```

Next, put in LightGBM and the refund data you just scraped, and create a ModelEvaluator class that will calculate the AUC score and balance and evaluate the model.

```
from sklearn.metrics import roc_auc_score
class ModelEvaluator:
def __init__(self, model, return_tables):
self.model = model
self.fukusho = Return(return_tables).fukusho
def predict_proba(self, X):
return self.model.predict_proba(X)[:, 1]
def predict(self, X, threshold=0.5):
y_pred = self.predict_proba(X)
return [0 if p<threshold else 1 for p in y_pred]
def score(self, y_true, X):
return roc_auc_score(y_true, self.predict_proba(X))
def feature_importance(self, X, n_display=20):
importances = pd.DataFrame({"features": X.columns,
"importance": self.model.feature_importances_})
return importances.sort_values("importance", ascending=False)[:n_display]
def pred_table(self, X, threshold=0.5, bet_only=True):
pred_table = X.copy()[['Horse number']]
pred_table['pred'] = self.predict(X, threshold)
if bet_only:
return pred_table[pred_table['pred']==1]['Horse number']
else:
return pred_table
def calculate_return(self, X, threshold=0.5):
pred_table = self.pred_table(X, threshold)
money = -100 * len(pred_table)
df = self.fukusho.copy()
df = df.merge(pred_table, left_index=True, right_index=True, how='right')
for i in range(3):
money += df[df['win_{}'.format(i)]==df['Horse number']]['return_{}'.format(i)].sum()
return money
```

When I actually calculate ...

```
me = ModelEvaluator(lgb_clf, return_tables)
gain = {}
n_samples = 100
for i in tqdm(range(n_samples)):
threshold = i / n_samples
gain[threshold] = me.calculate_return(X_test, threshold)
pd.Series(gain).plot()
```

I'm really losing, so I still need to improve ...

Detailed explanation in the video ↓ Data analysis / machine learning starting with horse racing prediction

Recommended Posts