[PYTHON] [Machine learning] Where will you win this year's Hakone Ekiden? ~ From data to prediction ~

table of contents

0. Introduction 1. Preparation ~ Apply for Free Trial ~ 2. Preparation ~ Prepare the dataset ~ 3. Now put it in Datarobot! 4. Target setting, modeling 5. Predict test data! 6. Conclusion

0. Introduction

This article is a compilation of articles written as the 17th day article of the 2020 Datarobot Advent Calendar.

Probably Original article wrote it with a lot of energy, so I recommend you to read it. If you look at the original article and find it "difficult to read", please read this article. I think it's good.

As the title says, I predicted the winning school of the 97th Hakone Ekiden in 2021 using a tool called Datarobot. Datarobot
・ Data shaping, EDA (exploratory data analysis) ・ Create a machine learning model and automatically propose the optimum model
It is a very convenient tool that can do such things with the touch of a button. AutoML is a tool that automatically sets models related to machine learning, and has been attracting attention in recent years. Datarobot is originally used for a fee. However, since 2020, 14-day free trial is possible, so I made a prediction using it.


# 1. Preparation ~ Apply for a free trial ~

First, apply for Free Trial. The linked page is スクリーンショット 2020-12-06 19.51.18.png If you fill in the required items, you are ready to use AutoML.


2. Preparation ~ Prepare the dataset ~

First you have to prepare the data set. In the competition, the data is given. So you can think about what kind of model to make using that data, but this time from the point of making the data set I have to. This was a very good experience as I had never prepared the dataset from scratch.

This time, the theme was "Predicting the winning school of Hakone Ekiden", so I decided to make a prediction like this.

スクリーンショット 2020-12-06 20.05.31.png

The procedure is as follows. First, we will collect data that are likely to be useful features (more on this later). We will use that data to predict the time of each university this year. That is, as a ** regression problem **. By predicting the times and arranging them in the order of the times, we will predict each ranking table.



Let's take a look at what kind of data we have collected from now on. We have collected data for the past ** 10 ** years and ** 207 ** teams as the number of teams. (Note that the following data was all entered in the Excel file ** manually ** due to poor knowledge of scraping. I confirmed it, but there may be some mistakes.)

Data 1-University name

The name of the university is considered to be important in predicting the time, for example, Aoyama Gakuin University seems to be faster than other universities.


In the past 10 years, a total of 27 universities have participated in the Hakone Ekiden, and their team names are used as category data. I have. Since there are sites like this 96th Hakone Ekiden from the 87th tournament to the 96th tournament, I entered it with reference to that.


Data 2-year

Track and field records and other sports records are improving year by year for some reason. I wonder why the human body should not change, but behind that (I'm not sure) ) I think that there are improvements in practice methods and tools due to various advances in science (I think there is also practice for athletes, of course). In recent years, "thick sole shoes" have been discussed. ..

So, probably because it seems that the record will gradually improve as the years go by, we will use the number of tournaments as a feature (like 90 for the 90th tournament). Does not contribute, but should contribute to time prediction.


Data 3-Athletes' 10000m time

Of course, this seems to be relevant. 10000m is done on a flat track, but on the other hand, Hakone Ekiden climbs the mountain of Hakone, so it can not be said that it has a very strong correlation, but the effect is There should be.

Since Hakone Ekiden is for 10 people, there is data for 10 people. Since there were 207 teams in total, the total number is 2070, and like data 1, a site like 96th Hakone Ekiden I entered it for reference (this was the hardest ...). Also, I had the following problems.


  1. There is a player who does not have a time of 10000m (missing value)
  2. The training data shows who actually ran, but the predicted data (that is, who will run in the 2021 competition) does not know who will run until the day of the race.

1. There is a player who does not have a time of 10000m (missing value)

First of all, regarding 1., there are players who do not have a time of 10000m. There are 35 out of 2070 people in total. It is not so many, but if the missing value remains, "data will decrease or learning will not be possible" I am in trouble because it will be a situation.

Fortunately, the 10000m time was not listed, but instead the 5000m time was listed. Since the 10000m is twice the 5000m time, "Is it okay to double the 5000m time ..." I thought about it at first, but obviously it's faster to double the 5000m time than to double the 10000m time.

It is natural that the speed will be slower when running a long distance, so I made the following corrections.

スクリーンショット 2020-12-06 21.01.06.png

It means that you doubled it and added 40 seconds. By the way, there is no scientific basis for 40 seconds. From the feeling when I was hand-crafting the data, it seems to be just right, various things It is from the result of comparing the records.


#### 2. The training data shows who actually ran, but the predicted data (that is, who will run in the 2021 competition) does not know who will run until the day of the race.

This is pretty important. There are 16 players in the entry, but we don't know who will run until the day (there may be player changes, etc.). In contrast, historical data will tell you who ran in which ward. This is not possible to compare. Therefore, this time we will compare the time of ** 16 people who are in the top 10 in the 10000m time **. (There is a rule that only one international student can run. Therefore, if more than one international student is in the top 10, players in the 11th place or lower will also be included.)

Also, along with that, the 10 people in the training data (that is, past data) are sorted in chronological order of 10000m, not in which section they ran.

It's not a very grounded operation, but given the fact that there are sections where strong players can be put in (for example, 2nd ward), sorting by time is not so strange.



By the way, the time is set to the second decimal place in seconds.


Data 4-Wind direction, wind speed, temperature


Strong headwinds, hot weather, etc. should affect the time. This does not affect the evaluation of the ranking because it is common to all teams, but it contributes to the prediction of the time.

More specifically, considering the transit time of the outbound and inbound journeys,

スクリーンショット 2020-12-06 21.28.03.png

These data were entered by looking at the Meteorological Agency homepage. These three locations were selected by the Japan Meteorological Agency by observing the temperature, wind speed, etc. Because this was the only place I was in.

Wind speed and temperature can be treated as numerical data as they are, but how should the direction be used? The data of the Japan Meteorological Agency classifies them as "north-northeast".

North-northeast-> north East Northeast-> East

For those that are classified up to 3 kanji characters like, we have simplified it to 1 kanji character such as "north" and "east".

However, it seems a bit rough to classify things that are classified by two kanji characters such as "northeast" into "north" and "east". Therefore, in the manner of One-Hot-Encoding, I thought about doing the following.

スクリーンショット 2020-12-06 21.46.54.png

In other words, it is interpreted that the wind is blowing from the northwest is the sum of the influences of "north" and "west", and if it is "northwest", 1 is added to "north" and "west". However, this gives the impression that the wind blowing from the name such as "northwest" is strong, so I did the following.

スクリーンショット 2020-12-08 20.58.44.png

As shown in the figure below, it feels like standardizing the strength.

direction

Data 5 --All Japan University Ekiden Time

Earlier, I set a time of 10000m as an individual running ability, but since it is a relay road race, the ability as a team is also important. Therefore, we decided to use the time of All Japan University Ekiden (hereinafter referred to as All Japan). All Japan is called the three major Ekiden along with Hakone and Izumo, and we thought that it was highly reliable. For example, it was held on New Year 2020. For the time of Hakone Ekiden, we will use the time of All Japan held in 2019. I entered it referring to this site.

Of course, there may be other players running, but there seems to be some correlation.

In addition, there were universities that participated in the Hakone Ekiden but did not participate in all Japan. 207 teams There are 79 teams. There are so many Hakone Ekiden only for universities in the Kanto region, and while 20 teams are selected from the universities in the Kanto region, all Japan has qualifying in the Kanto region and only strong teams in the Kanto region can participate. Because.

I was also worried about this process, but I couldn't make it to All Japan (I couldn't make it to the Kanto area due to qualifying, etc.), so ** The latest time of the Kanto team participating in All Japan + I decided to fill in the missing values ​​with the same value every 3 to 5 minutes **.

For example, if the slowest time of a university in Kanto that is out in Japan is 5 hours 25 minutes 0 seconds, then 5 hours 30 minutes 0 seconds is the time of a university that is not in all Japan. Of course, this is every year. to change.

This seems to be a little disadvantageous for teams that have not participated in all Japan, but I couldn't think of anything else ...

スクリーンショット 2020-12-11 15.46.28.png

Hakone Ekiden time (objective variable)

I didn't know until now, but Hakone Ekiden has changed the course once every few years.

However, none of them have much effect on the total distance of the course, and it will not affect so much, so I will use the time of Hakone Ekiden without any particular correction.


Data-Bottom 1-Grade distribution

I wanted to include features such as the grade distribution of the team (2 in 1st grade, 3 in 2nd grade, 1 in 3rd grade, 4 in 4th grade). For example, if there are many 4th grade, the morale of the whole team will be It looks like it will go up, and I think it will have an impact on the mental side.

However, I didn't have time to add this to the data, so it was a crap ...


Data-Bottom 2 --Average temperature, average wind speed

At first, I was thinking of including the average temperature and average wind speed (wind direction) of the day instead of the hourly temperature and wind speed (wind direction). However, when I actually look at the data from the Japan Meteorological Agency,

・ The average temperature reflects the temperature difference between night and day, so it is a meaningless index. ・ Since the wind direction is completely different between day and night, it is not so affected by the daytime in one day.

For this reason, we decided to include the hourly temperature and wind speed instead of the daily average temperature and wind speed.


Data-Bottom 3 --Results of Izumo University Ekiden

I wanted to put this in, but Izumo University Ekiden is not held this year due to the influence of the new coronavirus ... Therefore, the forecast data, that is, the data of this year's Izumo University Ekiden, does not exist and is meaningless, so it was a crap.


About test data

Test data is collected almost like training data, but detailed data is used for ** wind direction, wind speed, temperature ** (not only date but also time is specified), 1/2,1/3 The weather forecast of the Japan Meteorological Agency has not been released yet, so we have to do something about it.

According to 1 month forecast of Japan Meteorological Agency, the forecast of Kanto Koshinetsu region from 12/26 to 1/8 is about temperature, precipitation, etc.

20-30% chance of being higher or higher than normal 40% chance of normal 30-40% chance of being lower or less than normal

Therefore, I would like to take the average of the corresponding columns of the training data ** as the corresponding columns of the test data.

I wish I had a weather forecast soon ...

Data preparation finish

Finally, convert the manually entered data to a nice data as explained above.

Code that makes the input data look good (training data)

data_format_train.py


import pandas as pd

#Path to the file you want to enter
path1 = #path1
#Path to the file you want to output
path2 = #path2

#Name of columns
name_colmuns = ['year', 'univ','5000_time_idx',
                'wind_velocity_2_yokohama_10', 'wind_direction_2_yokohama_10', 'temp_2_yokohama_10', 
                'wind_velocity_3_yokohama_11', 'wind_direction_3_yokohama_11', 'temp_3_yokohama_11', 
                'wind_velocity_2_tsujido_11', 'wind_direction_2_tsujido_11', 'temp_2_tsujido_11', 
                'wind_velocity_3_tsujido_10', 'wind_direction_3_tsujido_10', 'temp_3_tsujido_10', 
                'wind_velocity_2_odawara_12', 'wind_direction_2_odawara_12', 'temp_2_odawara_12', 
                'wind_velocity_3_odawara_9', 'wind_direction_3_odawara_9', 'temp_3_odawara_9', 
                'time1', 'time2', 'time3', 'time4', 'time5', 'time6', 'time7', 'time8', 'time9', 'time10',
                'japan', 'japan_no_record', 'total_time']

data = pd.read_excel(path1)
data = data[name_colmuns]

#A function that converts all units to seconds
def time_to_time(x):
    y = [int(i) for i in str(x).split(".")]
    return 60*y[0] + y[1]/100
def total_to_total(x):
    y = [int(i) for i in str(x).split(":")]
    return 60*60*y[0] + 60*y[1] + y[2]

#Convert units to seconds
times = [f"time{i}" for i in range(1,11)]
data["total_time"] = data["total_time"].apply(total_to_total)
data["japan"] = data["japan"].apply(total_to_total)
data["japan_no_record"] = data["japan_no_record"].apply(total_to_total) #Data for all-Japan correction has been entered
for t in times:
    data[t] = data[t].apply(time_to_time)

#The number of the player who has to correct because the time is 5000m
def hosei_func(x):
    return [int(i) for i in str(x).split(",")]
data["hosei"] = data["5000_time_idx"].apply(hosei_func)

time_column_number = list(range(21,31)) #Time column number
for i in range(data.shape[0]):
    if data.iloc[i,-1][0] < 0: # -1 : hosei 
        continue
    else:
        for v in data.iloc[i,-1]:
            data.iloc[i,time_column_number[v-1]] += 40 #I doubled the 5000m time, so I just added 40

#Sort time
time_data = data[times]
for i in range(time_data.shape[0]):
    times_1teams = time_data.iloc[i]
    times_1teams.sort_values(inplace = True)
    times_1teams.index = times
    time_data.iloc[i] = times_1teams
data[times] = time_data

#All Japan data will be fixed
for i in range(data.shape[0]):
    if data.iloc[i,-4] == 0: # "japan"If there is no record for all Japan, enter 0
        data.iloc[i,-4] = data.iloc[i,-3] # "japan_no_record"To time

#Do something like one hot encoding that manages direction data
direction = ["wind_direction_2_yokohama_10","wind_direction_3_yokohama_11",
              "wind_direction_2_tsujido_11","wind_direction_3_tsujido_10",
              "wind_direction_2_odawara_12","wind_direction_3_odawara_9"]

new_direction_column = []
for d in direction:
    for s in "NSWE":
        new_direction_column.append(d + "_" + s)

before_size = data.shape[1]
data[new_direction_column] = 0.0
origin_direction_index = [4,7,10,13,16,19]

for i in range(data.shape[0]):
    for j,idx in enumerate(origin_direction_index):
        
        cnt = 0
        if "N" in data.iloc[i,idx]:
            cnt += 1
        if "S" in data.iloc[i,idx]:
            cnt += 1
        if "W" in data.iloc[i,idx]:
            cnt += 1
        if "E" in data.iloc[i,idx]:
            cnt += 1
        
        strength = 0
        if cnt == 2:
            strength = 1
        else:
            strength = 2**0.5

        if "N" in data.iloc[i,idx]:
            data.iloc[i,before_size + 4*j] = strength
        if "S" in data.iloc[i,idx]:
            data.iloc[i,before_size + 4*j + 1] = strength
        if "W" in data.iloc[i,idx]:
            data.iloc[i,before_size + 4*j + 2] = strength
        if "E" in data.iloc[i,idx]:
            data.iloc[i,before_size + 4*j + 3] = strength  

#Is the team acting as a label?(Is there any strange duplication? Ex. jyuntendo,jyunntenndo)Check
teams = data["univ"].unique()
teams.sort()
print(teams)
print(len(teams))

data.drop(direction,axis = 1,inplace = True)
data.drop(["japan_no_record","hosei","5000_time_idx"],axis = 1,inplace = True)

#Verification
data.head(10)

#output
data.to_excel(path2,index = False)
Code that makes the input data look good (test data)

data_format_test.py



import pandas as pd

path1 = #Path to the file you want to enter
path2 = #data_format_train.File output by py
path3 = #Path to the file you want to output

name_colmuns = ['year', 'univ', 
                'wind_velocity_2_yokohama_10', 'wind_direction_2_yokohama_10', 'temp_2_yokohama_10', 
                'wind_velocity_3_yokohama_11', 'wind_direction_3_yokohama_11', 'temp_3_yokohama_11', 
                'wind_velocity_2_tsujido_11', 'wind_direction_2_tsujido_11', 'temp_2_tsujido_11', 
                'wind_velocity_3_tsujido_10', 'wind_direction_3_tsujido_10', 'temp_3_tsujido_10', 
                'wind_velocity_2_odawara_12', 'wind_direction_2_odawara_12', 'temp_2_odawara_12', 
                'wind_velocity_3_odawara_9', 'wind_direction_3_odawara_9', 'temp_3_odawara_9', 
                'time1', 'time2', 'time3', 'time4', 'time5', 'time6', 'time7', 'time8', 'time9', 'time10', 
                'time11', 'time12', 'time13', 'time14', 'time15', 'time16',
                'japan'] #Missing values ​​are entered manually because there are few

data = pd.read_excel(path1)
data = data[name_colmuns]

#A function that converts all units to seconds
def time_to_time(x):
    y = [int(i) for i in str(x).split(".")]
    return 60*y[0] + y[1]/100
def total_to_total(x):
    y = [int(i) for i in str(x).split(":")]
    return 60*60*y[0] + 60*y[1] + y[2]

#Convert units to seconds
times = [f"time{i}" for i in range(1,17)]
data["japan"] = data["japan"].apply(total_to_total)
for t in times:
    data[t] = data[t].apply(time_to_time)

#Sort time
time_data = data[times]
for i in range(time_data.shape[0]):
    times_1teams = time_data.iloc[i]
    times_1teams.sort_values(inplace = True)
    times_1teams.index = times
    time_data.iloc[i] = times_1teams
data[times] = time_data

#Wind direction,wind speed,Enter training data to calculate temperature from average
train_data = pd.read_excel(path2)

#Average

#Wind direction
direction = ["wind_direction_2_yokohama_10","wind_direction_3_yokohama_11",
              "wind_direction_2_tsujido_11","wind_direction_3_tsujido_10",
              "wind_direction_2_odawara_12","wind_direction_3_odawara_9"]

new_direction_column = []
for d in direction:
    for s in "NSWE":
        new_direction_column.append(d + "_" + s)
        
        
data.drop(direction,axis = 1,inplace = True)

#temperature,wind speed
temp_and_velocity =  ['wind_velocity_2_yokohama_10', 'temp_2_yokohama_10', 
                'wind_velocity_3_yokohama_11',  'temp_3_yokohama_11', 
                'wind_velocity_2_tsujido_11', 'temp_2_tsujido_11', 
                'wind_velocity_3_tsujido_10', 'temp_3_tsujido_10', 
                'wind_velocity_2_odawara_12', 'temp_2_odawara_12', 
                'wind_velocity_3_odawara_9', 'temp_3_odawara_9']


before_size = data.shape[1]
data[new_direction_column] = 0
temp_and_velocity_data = train_data[temp_and_velocity].mean()
new_direction_data = train_data[new_direction_column].mean()
for i in range(21):
    data.iloc[i,2:14] = temp_and_velocity_data
    data.iloc[i,before_size:before_size + 24] = new_direction_data

#Drop unnecessary data
not_top10_times = [f"time{i}" for i in range(11,17)]
data.drop(not_top10_times,axis = 1,inplace = True)

data.isnull().sum() #Confirmation of missing values
data.head() #Various confirmation

data.to_excel(path3,index = False) #Output

The dataset is now ready!

スクリーンショット 2020-12-11 13.29.44.png

3. Now put it in Datarobot!

Data input

First, put this training dataset in Datarobot. From the home screen, click "Create Project" to go to the following screen.

スクリーンショット 2020-12-08 22.25.16.png

Press the local file to upload the dataset, and it will read the data for you. At this time, data analysis is also performed automatically. Convenient!

スクリーンショット 2020-12-08 21.16.58.png

Outlier processing

If you wait for a while, you will see the following screen.

スクリーンショット 2020-12-09 9.03.57.png

On this screen, you can know the distribution of the data entered as training data, whether there are missing values, outliers, etc. If you click on the place where data quality evaluation is written in the upper right, the outliers Since I found that there is, I will check it. I found that there is an outlier in wind_velocity_2_yokohama_10 (1/2, Yokohama, 10:00 wind speed), so click it.

Yokohama_vecocity

Then,

スクリーンショット 2020-12-08 21.19.56.png

In this way, you can see the distribution of the data. In this state, it does not seem that there are outliers, but if you click "Show outliers" at the bottom left and wait for a while,

スクリーンショット 2020-12-08 21.19.43.png

You can see that there are outliers like this. You can see that there is a large value compared to other values. I made sure that this value was not a typing error, but it was not so this year is strong I think the wind has blown and I will not do anything in particular.

Other outliers are processed in the same way. It seems that there is really good processing, but since I have no knowledge about processing outliers, the outliers are almost neglected ...


## Wind direction becomes a category value ...

Also, if you look closely at the data, the wind direction converted earlier is recognized as a category value.

スクリーンショット 2020-12-09 9.50.29.png

We can see from the unique quantity of 1 that these values ​​have become 0 in all the data due to the wind direction processing earlier. The same value in all the data does not contribute to the prediction. Therefore, this feature is deleted. (If this value is not 0 in the test data, it will be discarded ...?)

If you click "Feature set" in the menu, you will see the screen below.

スクリーンショット 2020-12-08 21.35.24.png

So I chose "useful features 43", and they deleted the meaningless features. As a result of checking, only the features that are 0 in all the data were deleted, so if there is any possibility of contributing to the prediction, is it recognized as a "useful feature"?


4. Target setting, modeling

Now that you've looked through the data, next select the "target", that is, the amount you want to predict. Since you want to predict the time of the Hakone Ekiden, target "total_time". On the screen where the above data is displayed ,

スクリーンショット 2020-12-08 21.43.01.png

You can set the target by clicking "Use as target".

Start modeling!

Finally, modeling is done. With the target set, scroll up on the Data tab, You should see a screen like this.

スクリーンショット 2020-12-08 21.43.15.png

Set "Modeling Mode" here. The default is "Quick", but for better accuracy, select "Autopilot".

スクリーンショット 2020-12-08 21.43.58.png

And start.

Yokohama_vecocity

Once started, training will begin on the various models.

Yokohama_vecocity

After waiting for tens of minutes ...
...

Learning is complete!

スクリーンショット 2020-12-11 14.20.39.png

It has been trained on 41 models and the cross-validation values ​​are listed in order of goodness. It will try everything you've never used (my lack of knowledge ...), so be sure. I think I made something more accurate than I did myself.

Since the top model is the one that minimizes the RMSE of the index in cross-validation, we would like to take a closer look at this model.

Model description

Clicking on the name of a model will start with a description of that model.

スクリーンショット 2020-12-11 14.21.22.png

First of all, the data is preprocessed with "Regularized Linear Model Preprocessing v20". I'm not sure, so when I click this "Regularized Linear Model Preprocessing v20", the following display appears.

Yokohama_vecocity

Now you can see that we are pre-processing for the regularized linear model developed by Datarobot. Not only libraries such as scikit_learn and Keras, but also Datarobot is making its own.

As a prediction, it seems that "Nystroem Kernel SVM Regressor" is used after this preprocessing. Nystroem seems to project data to a lower dimension using a kernel function. (I'm not confident). After that, I'm doing a regression with a kernel SVM.


Effect of features

Next, let's see how much the features contribute to the prediction. Clicking "Impact of features" from "Interpretation" of the model will bring up the screen below.

スクリーンショット 2020-12-11 14.22.16.png

I would like to consider the effect of this feature.


### All Japan

Looking at the impact of the features, we can see that the result of "japan", that is, the All Japan University Ekiden, has a great influence on the prediction. After all, the team power has a very strong influence.


### Wind direction, wind speed

Below that, there is a lot of "direction", that is, the wind direction. It seems that the influence of the wind direction on the time is certainly large. (Because there are only 10 types of wind direction data), it may just happen that it was a good feature in this data.


University name

The most surprising thing is that the "university name" is not so characteristic. I thought that it would give weight to traditionally strong universities, so I added it, but it seems that it did not make much sense. Well, even at the same university, the time may be completely different if the year is different, so it may not be a good index if it is affected by that.

time

The features that are affected by time are time1, time8, time9, and time10. You can see that the time of the earliest person in the team and the time of the late person are affected. ..

Is the earliest person having the fastest time (is there an ace who is extremely fast?) Or is it pulled by the person who has the slowest time?

Looking at the effects of each feature,

time1 スクリーンショット 2020-12-11 14.31.08.png

time8 スクリーンショット 2020-12-11 14.31.22.png

time9 スクリーンショット 2020-12-11 14.31.30.png

time10 スクリーンショット 2020-12-11 14.31.39.png

・ Time1 and time8 are "time1 and time8 are ** faster ** teams have Hakone Ekiden time ** faster **" ・ Time9, time10 is "time9, time10 time is ** faster ** team has Hakone Ekiden time ** slower **"

Although time1 and time8 can be interpreted meaningfully, time9 and time10 seem to be counterintuitive. Since the number of data is small, this may be possible. (There may be some factor. Yes, but I don't know ...)


Evaluation of the model

If you click on the "evaluation" part of the model, you should see the following screen.

スクリーンショット 2020-12-11 14.37.23.png

Moderately, the prediction seems to be in line with the actual measurement, and the coefficient of determination is

R^2 = 0.7706

It was. Is it a good feeling? By the way, if you look at the residuals,

スクリーンショット 2020-12-11 14.37.52.png

It looks like this graph. Looking at the histogram on the far right, it looks a little distorted, but when I try to see it, it looks like a normal distribution.

5. Predict test data!

Finally, I will put in the test data and predict the time of this year's Hakone Ekiden !!

What the hell is going on ??

First, upload the test data you created earlier. Click "Forecast" and click "Import data from here" at the bottom right to upload.

スクリーンショット 2020-12-13 17.13.58.png

Then click on "Calculate Forecast" at the bottom right and wait for a while.

スクリーンショット 2020-12-13 17.15.11.png

Then, download the forecast and take a look. If you do not do anything, you will not know the ranking etc. just because the time is listed, so sort by time and make a ranking table.

Code to convert forecast data downloaded from Datarobot into a standings

data_format_predict.py


import pandas as pd
import numpy as np

path1 = #Path to the downloaded forecast file
path2 = #Path to where you want to get the standings

predict = pd.read_csv(path1)

univ = np.array(['aoyama','tokai','kokugakuin','teikyou' ,'tokyo_kokusai', 'meiji', 'waseda'
 ,'komazawa' ,'souka', 'toyo', 'jyuntendo' ,'tyuuou', 'jyousai', 'kanagawa'
, 'kokushikan', 'nihon_taiiku', 'yamanashi_gakuin' ,'housei' ,'takusyoku',
 'sennsyuu', 'rengou'])

predict["univ"] = univ[predict["row_id"]]

#Sort by time
predict.sort_values(by = 'Prediction',inplace = True,ignore_index = True)

#Time is "time:Minutes:Function to convert to "second" notation
def to_time_format(time):
    x = [0] * 3
    x[0] = str(int(time//(60*60))).zfill(2)
    time %= (60*60)
    x[1] = str(int(time//60)).zfill(2)
    time %= 60
    x[2] = str(int(time)).zfill(2)
    return ':'.join(x)

predict["Prediction"] = predict["Prediction"].apply(to_time_format)

predict.drop("row_id",axis = 1,inplace = True) #Erase unnecessary data

predict.to_excel(path2,index = False)  #Output

result

Result is....

Ranking University time
1 Waseda 10:53:30
2 Komazawa 10:54:47
3 Meiji 10:56:01
4 Tokai 10:58:25
5 Aoyama Gakuin 10:59:55
6 Juntendo 11:00:23
7 Oriental 11:01:26
8 Kokugakuin 11:03:29
9 Teikyo 11:04:08
10 Japanese physical education 11:04:25
11 Tokyo International 11:05:18
12 Central 11:05:41
13 Yamanashi Gakuin 11:07:48
14 Josai 11:10:04
15 Kanto Student Union 11:11:14
16 Kanagawa 11:11:15
17 Takushoku 11:13:21
18 Soka 11:13:29
19 Kokushikan 11:15:12
20 Hosei 11:15:28
21 Specialization 11:22:10

have become!!

Consideration

As a result, ** Waseda University won the championship **. For the time, the wind speed, wind direction, and temperature were averaged from the training data, so the wind was blowing from all directions **. It seems that the time is not so reliable.

On the other hand, looking at the standings themselves, it is expected that Aoyama Gakuin University, which has the image of winning all the time, will be in 5th place.

In addition, "Komazawa, Aoyama Gakuin, Tokai, Meiji" is said to be the top four, but it is expected that Waseda will win the championship by suppressing these. This model has given such an answer. , I would like to think a little about why this happened.

As discussed in Effects of features, this model is strongly influenced by "japan", "time8", "time10", "time1", etc., except for conditions such as wind speed and temperature. Considering what was ranked by these features

japan ranking University
1 Komazawa
2 Tokai
3 Meiji
4 Aoyama Gakuin
5 Waseda
6 Oriental
7 Teikyo
8 Juntendo
9 Kokugakuin
10 Tokyo International
11 Japanese physical education
12 Yamanashi Gakuin
13 Josai
14 Soka
15 Specialization
16 Central
17 Kanagawa
18 Kokushikan
19 Hosei
20 Takushoku
21 Kanto Student Union
ranking of time8 University
1 Meiji
2 Komazawa
3 Waseda
4 Aoyama Gakuin
5 Juntendo
6 Central
7 Japanese physical education
8 Kanto Student Union
9 Oriental
10 Kanagawa
11 Tokai
12 Yamanashi Gakuin
13 Tokyo International
14 Josai
15 Takushoku
16 Teikyo
17 Kokugakuin
18 Hosei
19 Soka
20 Kokushikan
21 Specialization
ranking of time10 University
1 Meiji
2 Komazawa
3 Aoyama Gakuin
4 Juntendo
5 Waseda
6 Central
7 Oriental
8 Kanto Student Union
9 Japanese physical education
10 Kanagawa
11 Tokyo International
12 Takushoku
13 Kokugakuin
14 Yamanashi Gakuin
15 Josai
16 Kokushikan
17 Hosei
18 Soka
19 Tokai
20 Teikyo
21 Specialization
Rank of time1 University
1 Kokushikan
2 Tokyo International
3 Komazawa
4 Soka
5 Takushoku
6 Waseda
7 Japanese physical education
8 Yamanashi Gakuin
9 Central
10 Oriental
11 Aoyama Gakuin
12 Juntendo
13 Tokai
14 Meiji
15 Kanagawa
16 Josai
17 Kokugakuin
18 Teikyo
19 Hosei
20 Kanto Student Union
21 Specialization

It can be seen that Waseda University and Komazawa University are in the top position in a stable manner. Tokai University and Aoyama Gakuin University are slightly lower in rank than that. It can be imagined that Tokai University and Aoyama Gakuin University will be slower than Waseda, Meiji, and Komazawa.

It is not clear why Waseda was expected to be faster than Komazawa, but when I looked up the average time of 10 people, it was as follows.

Average time ranking University Average time of 10000m(Minutes:Seconds)
1 Waseda 28:19
2 Komazawa 28:26
3 Meiji 28:28
4 Central 28:37
5 Juntendo 28:42
6 Tokai 28:43
7 Aoyama Gakuin 28:44
8 Oriental 28:49
9 Japanese physical education 28:51
10 Kokugakuin 28:53
11 Tokyo International 28:56
12 Kanto Student Union 28:57
13 Kanagawa 28:58
14 Yamanashi Gakuin 29:00
15 Teikyo 29:01
16 Josai 29:02
17 Soka 29:05
18 Takushoku 29:06
19 Hosei 29:12
20 Kokushikan 29:12
21 Specialization 29:37

Note: Athletes who have only 5000m time double the time and add 40 seconds to make 10000m time, so it may be different from the average time ranking introduced on other websites.


Waseda outperforms Komazawa, though only slightly. The average time was not added to the features, but I will interpret it as if it had various effects overall (?).


6. Conclusion

I tried to predict the ranking of Hakone Ekiden using Datarobot, but I thought that the good thing about Datarobot is that I can spend time collecting the necessary data and thinking about the meaning of the prediction. T.

I think the free trial is pretty easy, so if you want to try something out, you can try it.

By the way ** This year's Hakone Ekiden will support Waseda University with all its might !!!!!!**

Thank you for reading until the end.


Recommended Posts

[Machine learning] Where will you win this year's Hakone Ekiden? ~ From data to prediction ~
How to collect machine learning data
Machine learning Training data division and learning / prediction / verification
Python Machine Learning Programming Chapter 1 Gives Computers the Ability to Learn from Data Summary
Machine learning python code summary (updated from time to time)
Time series data prediction by AutoML (automatic machine learning)
An introduction to machine learning from a simple perceptron