[PYTHON] Until the introduction and adoption of the research proposal that won first place in KDD CUP 2019

This is DoCoMo's Ochiai. In this article, I will introduce the efforts [^ 1] that won the championship in one category of the KDD Cup, which is a data analysis competition held at KDD2019. I will.

KDD Cup 2019 Overview

The outline of KDD was introduced in Previous article, so I will focus on KDD Cup here. I will explain. The KDD Cup is a data science competition attached to KDD, first in 1997 and with a tradition of more than 20 years. Until last year, it was a competition in which data and tasks were given, some predictions were made, and the accuracy was competed, as in a normal competition such as Kaggle. I have been challenging the KDD Cup at DoCoMo R & D, and in 2016 I tried it for the first time and became a finalist. In KDD Cup 2019, the conventional competition for prediction accuracy will continue as Task1 of Regular ML Track, and in addition to that, you can freely set tasks using given data and propose research. Regular ML Track Task2, machine learning AutoML Track, which automatically performs each process (feature extraction, prediction model construction, verification, etc.), and Humanity RL Track, which competes for measures to prevent the spread of malaria infection by reinforcement learning, have been newly established. At DoCoMo, we participated in all tracks and were able to win first place in the Regular ML Track Task 2. https://www.kdd.org/kdd2019/docs/Winners_Regular_Baidu.pdf

Proposal Title: Simulating the Effects of Eco-Friendly Transportation Selections for Air Pollution Reduction Keiichi Ochiai, Tsukasa Demizu, Shin Ishiguro, Shohei Maruyama, Akihiro Kawana Research introduction movie: https://www.powtoon.com/online-presentation/bugFjP07kIK/eco-friendly-transportation-selections

Details of Regular ML Track

The Regular ML Track was sponsored by Baidu in China, and a transfer search log for Baidu Maps was provided. There are four types of logs, but here we will explain only the three types used in Task2.

Description of the log provided

Search query log (quoted from official page)

検索クエリ

In the search query log, sid is session ID, pid is user ID (actually rounded by people with similar profiles to protect privacy), req_time is the time when the search request was made, o is the latitude and longitude of the departure point, d Is the latitude and longitude of the destination.

Route candidate log

経路候補ログ

This is also a quote from the official page. Multiple routes (transportation, transportation mode) are presented for one search query. transport mode is an ID that indicates transportation, for example, 1 is a bus, 2 is a subway, and so on. However, since it was not disclosed which ID corresponds to which transportation, it was estimated from various information. After that, Distance is the distance, ETA is the travel time, and estimated price is the usage fee. There are multiple such data for one query.

Click log

クリックログ

This is also a quote from the official page. The click log records the session ID and the selected traffic mode. Using sid as a key, you can link the three logs described so far.

Regular ML Track Task1 was a task to predict the traffic mode used by the user based on the logs explained so far. In Task2, you can freely set tasks and make research proposals using the above logs. Task1 can be evaluated with predictive accuracy, but Task2 does not have such an index, so it was evaluated in a format like a normal paper submission in which a 4-page paper is written in English and the committee member evaluates the content. ..

Proposal content

Our team made a proposal linking the environmental issues of choosing a traffic mode and reducing air pollution. I will explain the details from here.

Research background

Environmental problems are one of the important social problems, and efforts to reduce CO2 are being carried out on a global scale. On the other hand, according to the UN Announcement, CO2 emissions in 2017 turned to increase for the first time in four years, and the Paris Agreement 2 There is also an opinion that the ℃ target (keeping the temperature rise after the Industrial Revolution within 2 ℃) may be difficult.

Based on this situation, I think that not only national and corporate-level efforts but also individual efforts are important. One way for individuals to tackle environmental problems is to choose environmentally friendly transportation in their daily lives. However, simply avoiding the use of transportation that emits CO2 will hinder people's lives, so such actions are not considered to be selected. On the other hand, if it does not interfere with your life, you may be asked to choose a means of transportation that will lead to CO2 reduction. It is thought that users will be motivated to make a choice if they know how much they can contribute to CO2 reduction when they change their means of transportation. However, to show the effect quantitatively, we need a log of people's means of transportation.

Therefore, in this research, we will simulate the amount of reduction in CO2 emissions when an environmentally friendly transportation method is selected by using the log of the transfer search service and assuming that the vehicle actually moved by the clicked transportation method. .. In addition, increasing walking and biking will have a positive effect on health, so we will quantitatively evaluate the effect on health.

approach

Basic idea

From the route candidate log presented to a certain user, the CO2 emissions when using each traffic mode can be calculated by the following formula.

CO2 emissions = distance traveled (km) x unit distance / CO2 emissions per person (g / person / km)

Since the travel distance is in the route candidate log, CO2 emissions can be calculated if the unit distance and CO2 emissions per person are known. Some of the data is publicly by the Transportation Ecology and Mobility Foundation, and we will use it. .. For example, if Bus and Cycling are the following logs of route candidates, the calculation can be performed as shown in the red frame below.

CO2_calc.png

You can also see from the click log that this user was clicking on the Bus. Here we define that alternative modes of transportation are acceptable:

** Alternative Route Travel Time (ETA) ≤ Clicked Route Travel Time ** ** CO2 emissions of alternative route <CO2 emissions of clicked route **

The first condition is that the travel time is less than or equal to the travel time originally selected by the user. The eco-friendly transportation modes are walking and biking, but even if there is a candidate to walk far, it will not be selected. On the other hand, it is assumed that they will be accepted if the travel time of the originally selected traffic mode is not exceeded. The second condition is to reduce CO2 emissions. According to this definition of acceptability, Cycling is acceptable in the previous example. Apply this to all search logs. In the actual log, since the transportation means that satisfy the above conditions are searched for hundreds of thousands of search queries, it can be formulated as a ** combinatorial optimization problem **.

Formulation as an integer optimization problem

Formulate the traffic mode selection as a 0-1 integer optimization problem with constraints on travel time and CO2 emissions. The objective functions and constraints to be minimized are as follows.

定式化

Where $ P_ {i, j} $ is the CO2 emissions when user $ i $ selects traffic mode $ j $, and $ Q_ {i, j} $ is user $ i $ for traffic mode $ j $. Travel time when selected, $ Q \ prime_ {i, j} $ is the travel time in traffic mode clicked by user $ i $ (assuming it actually traveled), $ X_ {i, j} $ is selected It is a value like a One-hot vector where only the traffic mode is 1 and the others are 0. In addition, $ m $ indicates the number of session IDs, and $ n $ indicates the number of traffic modes. At the time of mounting, the units are different for CO2 emissions and travel time, so both are normalized.

The tutorial by Professor Umetani of Osaka University was very helpful for the formulation as an optimization problem.

Introduction to Combinatorial Optimization: From Linear Programming to Integer Programming https://www.slideshare.net/shunjiumetani/ss-17197023

If it can be formulated, it will use the existing solver to solve it. This time, I used a library called PuLP.

result

We compared the results after optimization with the user's click log as the baseline. The results are shown in the table below. Optimized for search logs of about 400,000 queries.

結果

It seems that CO2 emissions can be reduced by about 9.23%. Surprisingly, the result was that the travel time could be reduced by 9.96%. The user may not necessarily be clicking on the fastest route.

Next, let's see how the means of transportation have changed due to optimization. In the table below, the left in parentheses is the traffic mode clicked by the user, and the right is the traffic mode selected by optimization.

定性結果

Since the published log is in the center of Beijing, it seems that there were many cases of traveling short distances by bus or subway, and it seems that there were many changes to change it to a bicycle (red frame in the table). On the other hand, in the case of private cars (Driving), the number of bicycles that have changed to bicycles is small, and many of them remain private cars (blue frame in the table). Where you travel by car, the distance may be long and there may be no alternative.

I will only give an overview of the effects on health. A study from Oxford University analyzed the effect of biking on reducing mortality (the paper is this). According to this study, WHO-recommended physical activity (11.25 METh / week physical exercise, 150 minutes of moderate aerobic exercise) reduces overall mortality by 10%. From this simulation, users can calculate an average of 23.04 minutes (WHO recommended 13.63%) of riding a bicycle per day. We considered that combining the two could reduce total mortality by $ 10 % \ times 13.63 % = 1.36 % $.

As you can see from the results of changes in transportation, the result of this study is that bicycles can be replaced by bicycles. However, when actually trying to do this, there is a problem with bicycle parking space, and the accepted transportation method changes depending on the weather (I do not want to move by bicycle if it rains), so how to get the user to choose a transportation method with low CO2 There are still issues such as UI / UX until practical use. In addition, although it is assumed that you moved according to the means of transportation / route you clicked in the transfer search, there is also a fundamental restriction that you do not know how you actually moved.

Implementation

Finally, I would like to introduce a little about the implementation. For the implementation, I referred to the following article. https://qiita.com/samuelladoco/items/703bf78ea66e8369c455 https://qiita.com/mzmttks/items/82ea3a51e4dbea8fbc17

Create a dictionary in advance that contains CO2 emissions in c_co2 [i, j] and travel time (user i, traffic mode j) in c_eta [i, j]. For example, it looks like this. c_eta[2,1] = 1976.0, c_eta[2,3] = 1146.0, c_eta[2,4] = 1446.0, c_eta[2,5] = 2246.0, c_eta[2,6] = 818.0 In this example, it is assumed that the user with the user ID No. 2 is presented with traffic modes 1,3,4,5,6 as candidates, and the travel time in each traffic mode is as shown above.

import pulp
problem = pulp.LpProblem("ETA-CO2 minimize", pulp.LpMinimize)
x = {}

# 0-Define one variable(Constraint 3 X_{i,j}Is 0,Corresponds to the constraint that 1 is 2 values)
# define 0-1 variable
for i in I:
    for j in J:
        x[i,j] = pulp.LpVariable("x({:},{:})".format(i,j),  0, 1, pulp.LpBinary)

#Define the objective function. Normalize by dividing the travel time and CO2 emissions by the maximum value.
# define objective function. ETA and CO2 emission are normalized by dividing max
max_co2 = max(c_co2.values())
max_eta = max(c_eta.values())
problem += pulp.lpSum( ((c_co2[i,j]/max_co2) * x[i,j]) + ((c_eta[i,j]/max_eta) * x[i,j]) for i in I for j in J if (i,j) in c_co2), "TotalCost"

#Constraint 1:One mode can be assigned for user i
# constraint 1: each user can be assigned one mode
for i in I:
    problem += sum(x[i,j] for j in J if (i,j) in c_co2 ) == 1, "Constraint_leq_{:}".format(i)

#Constraint 2:Shorter than the travel time of the traffic mode clicked by user i
# click_log_Click log is included in df as pandas dataframe
# constraint 2: set "acceptable" condition
for n, t in click_log_df.iterrows():
    i = int(t["sid"])
    if (i,int(t["transport_mode"])) in c_co2:
        baseline = int(c_eta[i, int(t["transport_mode"])]*1.0)

    problem += sum(c_eta[i,j]*x[i,j] for j in J if (i,j) in c_co2 ) <= baseline, "Constraint_co2eq_{:}".format(i) 

#Specify the solver
# set solver
solver = pulp.solvers.PULP_CBC_CMD()
result = problem.solve(solver)
#Outputs whether optimization was possible and the evaluation value at that time
# output
print(pulp.LpStatus[result], pulp.value(problem.objective))

Timeline

The issue of Regular ML Track of KDD Cup 2019 was released around 4/13. After that, I submitted the thesis according to the following schedule.

4/19 First meeting (drafting data utilization method) 5/8 Meeting # 2 (Drafting data utilization method & deciding policy) 5/15 Meeting # 3 (Implemented optimization program, optimization did not converge at this point) 5/31 Meeting # 4 (optimization implementation completed) 6 / 3-10 Writing a dissertation (4 pages in English) 6 / 11-12 English proofreading 6/13 Post 6/15 Submission deadline (the deadline will be extended at a later date by the end of June)

In 2019, there were 10 consecutive holidays during Golden Week, and in about a month and a half, we decided on research themes, implemented them, and wrote a dissertation.

Finally

I wasn't familiar with optimization problems myself, but a colleague I was working with told me in detail about formulation, implementation of Pulp, and what to do if optimization doesn't work. In addition, another member suggests to apply it to environmental problems, and where each traffic mode is provided only by ID, which ID is which traffic mode is published on the Web of fare and data This research was made possible by each team member's contribution, such as identifying them while looking at statistics. I would like to continue doing interesting research and raise awareness of DoCoMo's data analysis and AI fields through external activities such as conference presentations, and I would like to continue to challenge top conferences such as KDD.

Then everyone, Merry Christmas & Happy New Year!

[^ 1]: Won first place in the world at the world's highest data analysis competition "KDD CUP 2019" https://www.nttdocomo.co.jp/binary/pdf/info/news_release/topics_190809_01.pdf

Recommended Posts

Until the introduction and adoption of the research proposal that won first place in KDD CUP 2019
Linux is something like that in the first place
This and that of the inclusion notation.
Talking about the features that pandas and I were in charge of in the project
12. Save the first column in col1.txt and the second column in col2.txt
[Introduction to Python] Summary of functions and methods that frequently appear in Python [Problem format]