Horse racing forecast

Since deep learning is popular, I decided to study something on the subject, and I made a horse racing prediction memo.

Investment Horse Racing Advent Calendar 2016 This is the article on the 7th day.

Environment

Install pyenv, python(annaconda)

Install Python 3.5.1 :: Anaconda 4.0.0 referring to ↓

http://qiita.com/oct_itmt/items/2d066801a7464a676994

Install Chainer

Install Chainer.

pip install chainer

Data collection

There is a service operated by a JRA subsidiary called JRA-VAN. There is a charge, but there is also a one-month free trial. This time we will collect data with a free trial.

It seems that there are multiple software that imports JRA-VAN data, but TARGET frontier JV seems to be able to output data in csv, so I will use this. ..

By the way, JRA-VAN compatible software is only for Windows, so this work is done on Windows.

It seems that the data specifications are open to the public, but it seems difficult to read them by myself, so I will pass it.

Data capture

・ Start TARGET frontier JV and select "Menu"-> "Event Results CSV"

・ Select "Grade data (user settings)"

・ Select the items required for learning. I think this is the point, but I'm not sure, so I'll decide somehow (see the image below).

After deciding the item, select the number of years to output and the racetrack and output.

Then, the following data was obtained.

07,08,11,Sapporo,1,Not won*,Turf,1500,3,稍,1,1,5,16.2,3,Male,2,Theo Black,Preceding,35.87,442,-14,53.0,Yuichi Kitamura,01102,Tomoyuki Umeda,01084
07,08,11,Sapporo,1,Not won*,Turf,1500,3,稍,2,2,6,22.8,12,Male,2,Meisho Early,Middle group,36.07,464,+4,54.0,Shinichi Akiyama,01019,Isa Yasuda,00340
07,08,11,Sapporo,1,Not won*,Turf,1500,3,稍,3,3,11,162.0,11,Male,2,Sunday charity,Rear,36.53,424,+6,51.0,Hiroto Mayuzumi,01109,Kunio Takamatsu,00219
...

Predicted by Chainer

For learning with Chainer, prepare the input data by referring to the mnist sample.

--Input data: A two-dimensional array of float32. Here, the information for one race is included in the array, and that array is used as the input. --Correct answer (label) data: An array of int32. In mnist, the correct number was entered, but here we will enter the first horse number of each race.

Data read

To train with Chainer, load the data from the csv file into the numpy array. The race results of some dates are read as verification data, and others are read as training data.

Since the input data to Chainer is unified with float32, the character string data included in the csv file is converted to a numerical value. Prepare the following dictionary for conversion.

    self.dataMap = {
      3 : { "Sapporo": 0, "Hakodate": 1, "Fukushima": 2, "Tokyo": 3, "Nakayama": 4, "Kyoto": 5, "Niigata": 6, "Hanshin": 7, "Chukyo": 8, "Kokura": 9 },
      6 :  { "Turf" : 0, "Da" : 1 },
      9 :  { "Not" : 0,  "Heavy" : 1, "稍" : 2, "Good" : 3 },
      15 : {"Male" : 0, "Female" : 1, "Se" : 2},
      18 : {"escape" : 0, "Preceding" : 1, "Middle group" : 2, "Insert" : 3, "Rear" : 4, "Drive in" : 5, "Digenea simplex" : 6, "" : 7}
    }

pickle

Reading csv also takes time when the amount of data becomes large, so once it is read, dump it and use this when learning.

Python has a library called pickle that allows you to write an object to a file, so use that.

import pickle as P

#Export
with open('train_data.pickle', 'wb') as f:
    P.dump(self.train_data, f)

#reading
with open('train_data.pickle', 'rb') as f:
    self.train_data = P.load(f)

Learning & verification

I don't know which model is suitable for horse racing prediction, so I tried it with the mnist sample. Unlike mnist, horse racing races with a maximum of 18 horses, so change only the output to 18.

class MLP(chainer.Chain):

    def __init__(self, n_units, n_out):
        super(MLP, self).__init__(
            # the size of the inputs to each layer will be inferred
            l1=L.Linear(None, n_units),  # n_in -> n_units
            l2=L.Linear(None, n_units),  # n_units -> n_units
            l3=L.Linear(None, n_out),  # n_units -> n_out
        )

    def __call__(self, x):
        h1 = F.relu(self.l1(x))
        h2 = F.relu(self.l2(h1))
        return self.l3(h2)

model = L.Classifier(MLP(args.unit, 18))

Execution result

The accuracy of the mnist model was hardly improved. It seems that it is still necessary to review the data and examine the model.

# unit: 1000
# Minibatch-size: 100
# epoch: 40

train_data count = 33972
train_data_answer count = 33972
test_data count = 263
test_data_answer count = 263

epoch       main/loss   validation/main/loss  main/accuracy  validation/main/accuracy  elapsed_time
1           53.3197     2.88323               0.072          0.086455                  6.08774
2           2.8392      2.86606               0.0755         0.0778307                 11.7691
3           2.80957     2.84367               0.0756471      0.081164                  25.6663
4           2.79768     2.93953               0.0758407      0.081164                  39.172
5           2.79359     2.81899               0.0761471      0.0831217                 52.5182
6           2.78751     2.82241               0.0754118      0.0644974                 65.9139
7           2.78489     2.8214                0.0740882      0.0644974                 79.0856
8           2.78344     2.8256                0.0753392      0.0644974                 92.4037
9           2.78374     2.80649               0.0748824      0.0644974                 105.891
10          2.78153     2.81414               0.0760588      0.0644974                 119.413
11          2.78305     2.80919               0.0756047      0.061164                  133.375
12          2.78163     2.81012               0.0749706      0.0644974                 147.492
13          2.78179     2.81818               0.0759706      0.061164                  160.974
14          2.78133     2.81274               0.0743529      0.0678307                 174.484
15          2.78157     2.81185               0.0747493      0.0678307                 188.008
16          2.78114     2.81094               0.0746765      0.0678307                 201.59
17          2.78222     2.8136                0.0759706      0.0678307                 215.782
18          2.78156     2.81085               0.0758407      0.0644974                 229.198
19          2.78261     2.81022               0.0743529      0.0678307                 242.806
20          2.78189     2.81007               0.0737353      0.0678307                 256.197
21          2.78089     2.8106                0.0752647      0.0678307                 269.714
22          2.78256     2.81243               0.0749853      0.0678307                 283.141
23          2.78154     2.81041               0.0757059      0.0678307                 296.677
24          2.78148     2.81015               0.0744706      0.0678307                 310.393
25          2.78165     2.81023               0.0750442      0.0678307                 324.221
26          2.78157     2.81032               0.0757353      0.0678307                 338.199
27          2.7815      2.81081               0.0756176      0.0678307                 352.488
28          2.78158     2.81084               0.0752353      0.0831217                 366.459
29          2.78158     2.81058               0.0738348      0.0831217                 380.612
30          2.78151     2.81075               0.0745882      0.0678307                 395.066
31          2.78159     2.81096               0.0750882      0.0678307                 409.354
32          2.7814      2.8106                0.0756176      0.0678307                 423.486
33          2.78167     2.81094               0.0741593      0.0678307                 437.62

Added 2 layers to mnist

I tried adding two layers to the mnist network.

class MLP(chainer.Chain):

    def __init__(self, n_units, n_out):
        super(MLP, self).__init__(
            # the size of the inputs to each layer will be inferred
            l1=L.Linear(None, n_units),  # n_in -> n_units\
            l2=L.Linear(None, n_units),  # n_units -> n_units
            l3=L.Linear(None, n_units),  # n_units -> n_units
            l4=L.Linear(None, n_units),  # n_units -> n_units
            l5=L.Linear(None, n_out),  # n_units -> n_out
        )

    def __call__(self, x):
        h1 = F.relu(self.l1(x))
        h2 = F.relu(self.l2(h1))
        h3 = F.relu(self.l3(h2))
        h4 = F.relu(self.l4(h3))
        return self.l5(h4)

The result seems to be better than the mnist original, but it's not stable & it's likely to be overfitting as the accuracy of the training data is improved.

# unit: 600
# Minibatch-size: 100
# epoch: 40

train_data count = 33972
train_data_answer count = 33972
test_data count = 263
test_data_answer count = 263
loader.train_data = float32, shape = (33972, 240)
loader.train_data_answer = int32, shape = (33972,)
loader.test_data = float32, shape = (263, 240)
loader.test_data_answer = int32, shape = (263,)
epoch       main/loss   validation/main/loss  main/accuracy  validation/main/accuracy  elapsed_time
1           12.466      2.91101               0.0711765      0.0731217                 5.70191
2           2.7585      2.84412               0.0864118      0.0903704                 11.2326
3           2.70088     2.81017               0.0917059      0.0997884                 23.8984
4           2.67644     2.7853                0.0983186      0.0942857                 35.879
5           2.66051     2.82611               0.104588       0.0592063                 47.6442
6           2.64653     2.84772               0.110235       0.0937037                 59.3844
7           2.63512     2.81853               0.111765       0.0592063                 71.0073
8           2.61773     2.84415               0.120295       0.0831217                 82.6477
9           2.60187     2.80639               0.125265       0.091164                  94.1596
10          2.59428     2.798                 0.130412       0.0944974                 105.526
11          2.57489     2.82496               0.132566       0.071164                  116.915
12          2.5647      2.84402               0.134647       0.096455                  128.34
13          2.54531     2.91482               0.143324       0.0850794                 139.941
14          2.53773     2.83752               0.148353       0.0897884                 151.488
15          2.52841     2.81961               0.152006       0.0725397                 162.79
16          2.51507     2.96342               0.152412       0.071164                  174.238
17          2.50024     2.97278               0.158618       0.103704                  185.85
18          2.48458     3.03544               0.165074       0.0844974                 197.423
19          2.46567     2.98729               0.169794       0.111746                  209.066
20          2.46163     2.97408               0.168559       0.081164                  220.849
21          2.43796     3.05378               0.177029       0.0878307                 232.583
22          2.42934     2.86844               0.181268       0.075873                  244.434
23          2.39571     2.9371                0.191206       0.096455                  256.263
24          2.37371     2.95642               0.197971       0.091746                  268.319
25          2.35578     2.96039               0.207345       0.0978307                 280.227
26          2.32705     3.01686               0.216471       0.091746                  292.061
27          2.3103      3.077                 0.221088       0.0931217                 304.246
28          2.26667     3.06368               0.233706       0.106455                  316.494
29          2.23368     3.05979               0.248378       0.135661                  328.63
30          2.19393     3.45029               0.263412       0.0878307                 340.724
31          2.16578     3.47368               0.269529       0.0931217                 352.725
32          2.13084     3.31725               0.281765       0.0992063                 364.935
33          2.09286     3.59374               0.2959         0.0925397                 377.033
34          2.04235     3.6446                0.312088       0.107037                  389.009
35          1.99226     3.73125               0.329559       0.0897884                 401.031
36          1.95377     3.71884               0.343717       0.109206                  412.944
37          1.90421     3.77256               0.360529       0.0978307                 424.951
38          1.86084     4.0408                0.377588       0.10254                   436.903
39          1.7942      4.35645               0.398176       0.105079                  448.987
40          1.74267     4.43788               0.415752       0.0672487                 460.879

That's why I tried horse racing prediction using JRA data. I feel that the accuracy will increase depending on the data selection method and the neural network model, so I will try various things.

Here is the code I tried this time (although some code is commented out because of trial and error ...) https://github.com/takecian/HorseRacePrediction

[PYTHON] Try horse racing prediction with Chainer