[PYTHON] Reinforcement learning Learn from today

From now on, you will travel from Sapporo in Hokkaido to Naha in Okinawa. With reinforcement learning.

This trip through Japan has the following two goals.

And I would like to apply the following rules.

  • A city that has been visited once will never be visited again. </ b> If you make a mistake and visit the same city twice, the game is over.

Shouldn't we use the graph search method?

Yes, that's right. It can be easily found by using the graph search method. For more information

Please see. But here, the purpose is to start learning about reinforcement learning from today.

Japan prefectural office location data

Used in Basics of Graph Theory and Basics of Graph Theory with matplotlib animation The following data is used.

Latitude / longitude of the prefectural capital

Download the latitude / longitude data of the prefectural office location.

import urllib.request
url = 'https://raw.githubusercontent.com/maskot1977/ipython_notebook/master/toydata/location.txt'
urllib.request.urlretrieve(url, 'location.txt') #Download data
('location.txt', <http.client.HTTPMessage at 0x110e78ba8>)

Load the downloaded data.

import pandas as pd
location = pd.read_csv('location.txt').values
pd.DataFrame(location)
0 1 2
0 Sapporo 43.0642 141.347
1 Aomori 40.8244 140.74
2 Morioka 39.7036 141.153
3 Sendai 38.2689 140.872
... ... ... ...
44 Miyazaki 31.9111 131.424
45 Kagoshima 31.5603 130.558
46 Naha 26.2125 127.681

47 rows × 3 columns

Let's visualize the read data.

%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(11, 9))
plt.scatter(location[:, 2], location[:, 1])
for city, y, x in location:
    plt.text(x, y, city, alpha=0.5, size=12)
plt.grid()

output_4_1.png

Walking between prefectural office locations

There is data on Google Maps to find out how many hours it takes to move between prefectural capitals on foot (if you can't move on foot, you'll use a ferry). Download this.

url = 'https://raw.githubusercontent.com/maskot1977/ipython_notebook/master/toydata/walk.txt'
urllib.request.urlretrieve(url, 'walk.txt')
('walk.txt', <http.client.HTTPMessage at 0x1220b8a58>)

Load the downloaded data.

walk = pd.read_csv('walk.txt', sep='\s+')
walk
Town1 Town2 Hour Ferry
0 Sapporo Aomori 55 True
1 Akita Aomori 36 NaN
2 Akita Sendai 45 NaN
3 Akita Niigata 52 NaN
... ... ... ... ...
96 Nagasaki Kumamoto 18 NaN
97 Nagasaki Saga 20 NaN
98 Kagoshima Naha 26 NaN

99 rows × 4 columns

that? I just noticed that the Kagoshima-Naha section is not a ferry ... well, it doesn't affect this data analysis. Let's visualize the relationship between cities obtained here.

plt.figure(figsize=(11, 9))
for city, y, x in location:
    plt.text(x, y, city, alpha=0.4, size=8)
    
for w in walk.values:
    t1 = np.where(location[:, 0] == w[0])[0][0]
    t2 = np.where(location[:, 0] == w[1])[0][0]
    n1, y1, x1 = location[t1]
    n2, y2, x2 = location[t2]
    plt.plot([x1, x2], [y1, y2])
    
plt.scatter(location[:, 2], location[:, 1])
plt.grid()

output_12_0.png

The city where this prefectural office is located is called "vertex" </ b>, and the line between cities is called "side" </ b>, and the vertices are directly connected by the sides. Will be called adjacent </ b>. I have data on walking travel time between cities, but for the sake of simplicity, I will use the information on "sides" in the subsequent analysis, but not the data on walking travel time.

Distance matrix between prefectural office locations

Let's find the distance matrix between the prefectural office locations from the latitude and longitude of the prefectural office location.

import numpy as np
from scipy.spatial import distance

dist_mat = distance.cdist(location[:, 1:], location[:, 1:], metric='euclidean') #Euclidean distance

Create a function to visualize the matrix with a color map.

import pandas as pd
def visualize_matrix(matrix):
    plt.figure(figsize=(12, 10))
    plt.imshow(matrix, interpolation='nearest', cmap=plt.cm.coolwarm)
    plt.colorbar()
    tick_marks = np.arange(len(matrix))
    plt.xticks(tick_marks, matrix.columns, rotation=90)
    plt.yticks(tick_marks, matrix.columns)
    plt.tight_layout()

When the obtained distance matrix is visualized

visualize_matrix(pd.DataFrame(dist_mat, columns=location[:, 0]))

output_8_0.png

This distance matrix is used for two purposes in this analysis.

  • One is the calculation of the distance from your current location to your destination, Naha.
  • The other is to calculate the total distance traveled from the starting point of Sapporo to your current location.

Agent

In reinforcement learning, the "brain" to learn is called the "agent" </ b>. The agent understands the action options that can be taken and adjusts the weight of the judgment of that action through learning.

Action options you can take Theta

The variable theta (θ, theta) shall represent the action options that can be taken at that time. In graph theory, it represents an "edge" (the connection between vertices). Let's calculate.

import numpy as np
theta_zero = np.full((len(location), len(location)), np.nan)

for w in walk.values:
    i = np.where(location[:, 0] == w[0])[0][0]
    j = np.where(location[:, 0] == w[1])[0][0]
    theta_zero[i, j] = 1
    theta_zero[j, i] = 1

With the above, the number of the city $ i $ and the adjacent city can be obtained by theta_zero [i].

Probability pie of actions to take

The variable pie (π, pi) shall represent the probability of the action taken at that time. In graph theory, it represents "edge weights". In reinforcement learning, this pi is called a " policy "</ b>, and it is optimized.

First, as an initial value, consider the case where there are multiple options and one of them is randomly selected with the same probability.

def normalize_pi(theta):
    [m, n] = theta.shape
    pi = np.zeros((m, n))
    for i in range(0, m):
        pi[i, :] = theta[i, :] / np.nansum(theta[i, :])
        
    return np.nan_to_num(pi)

In this way, the initial value of pi`` pi_zero is obtained.

pi_zero = normalize_pi(theta_zero)

Let's visualize pi_zero.

visualize_matrix(pd.DataFrame(pi_zero, columns=location[:, 0]))

output_19_0.png

Note that pi_zero is not a symmetric matrix. For example, there is a side from Sapporo to Aomori and a side from Aomori to Sapporo, but the weights are not the same. This is because, in this graph, there is only one option from Sapporo to the next city, but there are multiple options from Aomori to the next city.

About "behavior" and "state"

Reinforcement learning treats "behavior" and "state" separately. As a result of selecting a certain action in a certain state, it transitions to a different state.

The examples covered in this article are special cases where it is not necessary to distinguish between "behavior" and "state". In other words, "state" refers to the "city at that time", "action" refers to the "next city", and that is the next "state". The code is simpler because you don't have to distinguish between actions and states.

Random walk

First, let's try a trip through Japan with this policy pi still pi_zero. Once you reach a city, it's a random walk that randomly decides which city to go to next.

Probabilistically choose the next action

Create a function that chooses the next city according to the probability indicated by policy pi.

def get_next(town, pi):
    return np.random.choice(range(len(pi)), p=pi[town, :])

For example, according to pi_zero, the next city in Tokyo (No. 12) is calculated as follows.

get_next(12, pi_zero)
13

Of course, the result will change with each run because it is decided randomly.

Random walk search

Now, let's search by random walk.

def explore(pi, start=0, goal=46): #Start is 0 (Sapporo), goal is 46 (Naha)
    route = [start] #List to put travel history
    town = start
    while True:
        next_town = get_next(town, pi) #Determine the next city
        if next_town in route: #If you visit the same city twice
            break #Game over

        route.append(next_town) #Add the next city to the history

        if next_town == goal: #Arrive at the goal
            break #Congratulations
        else: #Not a goal yet
            town = next_town #Set "next city" as "current position"

    return route

Start a random walk according to pi_zero

route = explore(pi_zero)
route
[0, 1, 2, 3, 4, 5]

It's random, so of course the result will change with each run. If it is difficult to imagine just the city number, it will be easier to imagine if you convert it to the city name as follows.

[location[i, 0] for i in route]
['Sapporo', 'Aomori', 'Morioka', 'Sendai', 'Akita', 'Yamagata']

In this example, the game is over because I walked from Sapporo to Yamagata and mistakenly selected the city I visited at the next destination.

By the way, can we really get from Sapporo to Naha with such a random walk?

Evaluation index of search results

Let's create a way to evaluate the quality of the search results. Calculate the straight-line distance from your current location to your destination (Naha) and the total distance traveled from your departure location (Sapporo) to your current location.

def evaluate(route, goal=46):
    dist_goal = dist_mat[route[-1], goal]
    len_route = sum([dist_mat[route[i], route[i + 1]] for i in range(len(route) - 1)])
    return [dist_goal, len_route]

dist_goal is the" straight line distance from your current location to your destination (Naha) ", and len_route is the "total distance traveled from your departure point (Sapporo) to your current location".

Then, it is an execution example.

route = explore(pi_zero)
[location[i, 0] for i in route], evaluate(route)
(['Sapporo', 'Aomori'], [19.597025248636594, 2.3205099949149073])

In this example, the game is over because I chose the route from Sapporo to Aomori and back from Aomori to Sapporo. The distance from the current location (Aomori) to the destination (Naha) is 19.597025248636594, and the distance from the departure point (Sapporo) to the current location (Aomori) is 2.3205099949149073.

The goal of a trip through Japan is to find a route where this dist_goal is zero, and then find the one with the smallest len_route. Shorter len_route is not a good thing.

From now on, I will try the game of moving from Sapporo to Naha many times, but when I get a certain result, let's create a function to judge whether it is the best one so far.

best_dist_goal = 1000000 #Initialize an impossibly large value
best_len_route = 1000000 #Initialize an impossibly large value

def is_best_ever():
    if best_dist_goal >= dist_goal: #Best dist_If equal to or less than the goal value
        if best_dist_goal > dist_goal: #Best dist of the past_If less than the goal value
            return True #Best ever
        elif best_len_route > len_route: #Best len of the past_If less than the route value
            return True #Best ever
        else:
            return False #Not the best
    else:
        return False #Not the best

Search execution

Now that we're ready, we'll repeat the random walk search 50,000 times.

%%time #Measure execution time
pi = pi_zero.copy() #Initialization of pi
theta = theta_zero.copy() #initialization of theta

best_dist_goal = 1000000 #Initialize an impossibly large value
best_len_route = 1000000 #Initialize an impossibly large value

dist_goal_history0 = [] # dist_history of goal
len_route_history0 = [] # len_route history
best_route0 = [] #Best route

for itera in range(50000): #Repeat 50,000 times
    route = explore(pi) #search
    dist_goal, len_route = evaluate(route) #Evaluation
    dist_goal_history0.append(dist_goal) #Record
    len_route_history0.append(len_route) #Record
    
    if is_best_ever(): #If the best of the past
        best_dist_goal = dist_goal #Best dist_goal
        best_len_route = len_route #Best len_route
        best_route0 = route #Best route
CPU times: user 10 s, sys: 118 ms, total: 10.2 s
Wall time: 11.8 s

Search results

Create a function to visualize the result. First, a function that visualizes the distribution of the obtained dist_goal`` len_route

def draw_histgrams(dist_goal_history, len_route_history):
    plt.figure(figsize=(max(len_route_history) / 4, max(dist_goal_history) / 4))
    x_max = max(dist_goal_history + len_route_history)
    ax = plt.subplot(2,1,1)
    ax.hist(dist_goal_history, label='dist_goal', bins=20)
    ax.set_xlim([0, x_max])
    ax.grid()
    ax.legend()
    ax = plt.subplot(2,1,2)
    ax.hist(len_route_history, label='len_route', bins=20)
    ax.set_xlim([0, x_max])
    ax.grid()
    ax.legend()
draw_histgrams(dist_goal_history0, len_route_history0)

output_35_0.png

Starting from Sapporo, it's hard to reach Naha. If you're lucky, you may get there. You can see that there are many cases where the game is over at the beginning of the search.

Next, let's look at the relationship of dist_goal`` len_route.

def draw_scatter(dist_goal_history, len_route_history):
    plt.figure(figsize=(max(len_route_history) / 4, max(dist_goal_history) / 4))
    plt.scatter(len_route_history, dist_goal_history, alpha=0.5)
    plt.ylabel('dist_goal')
    plt.xlabel('len_route')
    plt.ylim([0, max(dist_goal_history) + 1])
    plt.xlim([0, max(len_route_history) + 1])
    plt.grid()
    plt.show()
draw_scatter(dist_goal_history0, len_route_history0)

output_37_0.png

You can see that we have not reached Naha, and even if the game is over at the same place, the distance traveled to that point varies (various routes are selected).

Next, let's see how dist_goal`` len_route has changed.

def visualize_history(history, interval=50, window=50):
    plt.plot(range(0, len(history), interval),  #Overall maximum
             [np.array(history)[:i].max() for i in range(len(history)) if (i%interval) == 1], label='max')
    plt.plot(range(window, len(history)+window, interval), 
             [np.array(history)[i:i + window].mean() for i in range(len(history)) if (i%interval) == 1], 
             label='mean(recent)') #Average of recent interval times
    plt.plot(range(0, len(history), interval), 
             [np.array(history)[:i].mean() for i in range(len(history)) if (i%interval) == 1], label='mean') #Overall average
    plt.plot(range(0, len(history), interval), 
             [np.array(history)[:i].min() for i in range(len(history)) if (i%interval) == 1], label='min') #Overall minimum
    plt.legend()
    plt.grid()
    plt.show()
visualize_history(dist_goal_history0)

output_39_0.png

dist_goal sometimes updates the minimum value, but it does not reach Naha. And no matter how many times the search is repeated, the average value will not improve. It's natural because I haven't learned anything.

visualize_history(len_route_history0)

output_40_0.png

Similarly, len_route sometimes updates the maximum value, but it does not reach Naha, and the average value does not improve no matter how many times the search is repeated. It's natural because I haven't learned anything.

Finally, let's display the best route found in this random walk.

def draw_route(route):
    plt.figure(figsize=(11, 9))
    for i in route:
        plt.text(location[i, 2], location[i, 1], location[i, 0], alpha=0.4, size=12)
    plt.grid()
    plt.plot([location[i][2] for i in route], [location[i][1] for i in route])
    plt.ylabel('latitude')
    plt.xlabel('longitude')
    plt.show()
draw_route(best_route0)

output_42_0.png

I tried very hard, but I could only reach Saga. If you are lucky, you may reach Naha. But you don't learn it.

Policy gradient method

Now, it's time to start reinforcement learning </ b>. Reinforcement learning methods can be broadly divided into Policy gradient method </ b> and Value iteration method </ b>. First, let's do the policy gradient method.

Update policy pi

The obtained route route can be evaluated as good or bad from the viewpoint of" how close the end point of the route is to the goal ". Update policy pi so that the shorter the distance to the goal, the more likely you are to pick an edge on that path for future explorations.

def update_pi(pi, route, goal=46, eta=0.1): #Eta (η,eta) is the learning rate
    new_pi = pi.copy() #Copy numpy array
    for i in range(len(route) - 1): #For each side on the route
        town1 = route[i] #City of origin of side i
        town2 = route[i + 1] #City at the end of side i
        new_pi[town1, town2] +=  eta / (dist_mat[route[-1], goal] + 1)
        #The closer the end point of the route is to the goal, the higher the score
    
    return normalize_pi(new_pi) #Update pi

The reason for using normalize_pi at the end is to adjust the sum of the row values to be 1.0.

Search execution

Now let's start the search.

%%time
pi = pi_zero.copy()

best_dist_goal = 1000000
best_len_route = 1000000

dist_goal_history1 = []
len_route_history1 = []
best_route1 = []
for itera in range(50000):
    route = explore(pi)
    dist_goal, len_route = evaluate(route)
    dist_goal_history1.append(dist_goal)
    len_route_history1.append(len_route)
        
    pi = update_pi(pi, route)
    
    if is_best_ever():
        best_dist_goal = dist_goal
        best_len_route = len_route
        best_route1 = route
CPU times: user 59.9 s, sys: 340 ms, total: 1min
Wall time: 1min 1s

Search results

Below is an example of the results. The results will vary from run to run, and in some cases you may not reach your destination Naha after 50,000 learning sessions.

The resulting dist_goal`` len_route distribution is very different from the random walk distribution. The number of arrivals at the destination Naha has increased overwhelmingly.

draw_histgrams(dist_goal_history1, len_route_history1)

output_46_0.png

The relationship between dist_goal`` len_route is as follows.

draw_scatter(dist_goal_history1, len_route_history1)

output_47_0.png

The history of dist_goal. It reached the destination Naha much earlier than the random walk, and after that it became easier to reach the destination.

visualize_history(dist_goal_history1)

output_48_0.png

This is the first 5000 times in the history of dist_goal.

visualize_history(dist_goal_history1[:5000])

output_49_0.png

Similarly, the history of len_route.

visualize_history(len_route_history1)

output_50_0.png

The first 5000 times in the history of len_route.

visualize_history(len_route_history1[:5000])

output_51_0.png

Here is the output as the best route. You can see that the one closest to the shortest route is selected. Only one is regrettable. I am going to Kumamoto from Fukuoka via Saga. I should have gone directly from Fukuoka to Kumamoto without going through Saga.

draw_route(best_route1)

output_52_0.png

Save the resulting policy pi as the alias pi_pg.

pi_pg = pi.copy()

Visualize that policy pi_pg

visualize_matrix(pd.DataFrame(pi_pg, columns=location[:, 0]))

output_54_0.png

The absolutely important route is red. For example, once you arrive in Kagoshima, there can only be Naha next. It is absolutely useless to select other than that (Kumamoto, Miyazaki). It is a good representation of that.

You can check how it has changed compared to the initial value in this way.

visualize_matrix(pd.DataFrame(pi_pg - pi_zero, columns=location[:, 0]))

output_55_0.png

The places that aren't important haven't changed much. For example, the Shikoku and Kanto regions do not seem to be very important.

Epsilon-Greedy method

In contrast to the above "policy gradient method", the Epsilon-Greedy method introduced here and the Q learning introduced next are called "Value Iteration Method" </ b>. ..

In the value iteration method, the policy seems to be called "Q" instead of "π" for some reason. Whereas pi starts with a perfect equal probability, Q starts with a non-equal random probability.

Policy Q

def generate_Q(theta): #Generate unequal random probabilities from the choices you can take
    [m, n] = theta.shape
    Q = np.zeros((m, n))
    rand = np.random.rand(m, n)
    for i in range(m):
        for j in range(n):
            if theta[i, j] == 1:
                Q[i, j] = rand[i, j]
        
    return normalize_pi(Q)

Generates an initial value Q_zero for a non-uniform policy Q

Q_zero = generate_Q(theta_zero)

Visualize the generated Q_zero.

visualize_matrix(pd.DataFrame(Q_zero, columns=location[:, 0]))

output_59_0.png

This is the difference between Q_zero and pi.

visualize_matrix(pd.DataFrame(Q_zero - pi_zero, columns=location[:, 0]))

output_60_0.png

How to choose the next action

In the policy gradient method, the next action get_next was randomly selected according to the probability indicated by policy pi. In the value iterative method, the next action get_next is rewritten as:

def get_next(town, Q, epsilon=0.1):
    if np.random.rand() < epsilon:
        return np.random.choice(range(len(Q)), p=Q[town, :])
    else:
        return np.nanargmax(Q[town, :])

In other words, a new parameter ʻepsilon is introduced, and there are cases where the action is randomly selected according to the probability indicated by the policy Q, and cases where the policy Q selects the action with the maximum value. Initially, ʻepsilon sets a large value. In other words, you have more chances to make random choices. After that, gradually reduce ʻepsilon` to reduce the chance of making random choices.

Reward

We also introduced the concept of reward </ b>, and if you achieve your goal (when you reach the goal), you get 1 point, and if you fail (when you visit the same city twice and the game is over) -1 Points and progress will be 0 points.

In the following method called sarsa, when updating the Q value for a certain action (side on the route), the Q value for the next action (next side on the route) is also referred to. I will update. At this time, multiply by the time discount rate gamma.

def sarsa(town, Q, prev_t, next_t, reward, eta=0.1, gamma=0.9, goal=46):
    if reward == 1: #dist_mat[town, goal] == 0:
        Q[prev_t, town] = Q[prev_t, town] + eta * (reward - Q[prev_t, town])
    elif reward >= 0:
        Q[prev_t, town] = Q[prev_t, town] + eta * (reward - Q[prev_t, town] + gamma * Q[town, next_t])
    else:
        Q[prev_t, town] = Q[prev_t, town] - eta * Q[prev_t, town]
        
    return normalize_pi(Q)

By doing this, the rewards earned at the destination will influence the choice of actions towards the destination. The "negative reward" when the game is over also has a negative effect on the choice of past actions to reach it.

def explore_epsilon_greedy(Q, epsilon=0.1, eta=0.1, gamma=0.9, start=0, goal=46, min_epsilon=0.01):
    epsilon = max(epsilon, min_epsilon)
    route = [start]
    town = get_next(start, Q, epsilon)
    prev_t = start
    while True:
        if town in route:
            reward = -1
            Q = sarsa(town, Q, prev_t, next_t, reward, eta, gamma)
            break
        elif town == goal:
            reward = 1
            route.append(town)
            Q = sarsa(town, Q, prev_t, next_t, reward, eta, gamma)
            break
        else:
            reward = 0
            route.append(town)
            next_t = get_next(town, Q, epsilon)
            Q = sarsa(town, Q, prev_t, next_t, reward, eta, gamma)
            prev_t = town
            town = next_t
    
    return [route, Q]

Search execution

Now let's start the search.

%%time
eta = 0.1 #Learning rate
gamma = 0.9 #Time discount rate
epsilon = 0.5
Q = Q_zero.copy()

best_dist_goal = 1000000
best_len_route = 1000000

best_route2 = []
dist_goal_history2 = []
len_route_history2 = []
for itera in range(50000):
    epsilon = epsilon * 0.99
    route, Q = explore_epsilon_greedy(Q, epsilon, eta, gamma)
    dist_goal, len_route = evaluate(route)
    dist_goal_history2.append(dist_goal)
    len_route_history2.append(len_route)

    if is_best_ever():
        best_dist_goal = dist_goal
        best_len_route = len_route
        best_route2 = route
CPU times: user 7min 50s, sys: 948 ms, total: 7min 50s
Wall time: 7min 53s

Search results

This is an example of the search results by the Epsilon-Greedy method. In fact, the results aren't very stable and can change quite a bit each time you run them (try it out).

First, the distribution of dist_goal and len_route

draw_histgrams(dist_goal_history2, len_route_history2)

output_65_0.png

Relationship between dist_goal and len_route

draw_scatter(dist_goal_history2, len_route_history2) 

output_66_0.png

History of dist_goal

Increasing the learning rate ʻeta` will result in faster convergence, but will increase the likelihood of falling into a local solution. The smaller the value, the slower the convergence, but the less likely it is to fall into a local solution.

visualize_history(dist_goal_history2)

output_67_0.png

The first 5000 times in the history of dist_goal

visualize_history(dist_goal_history2[:5000])

output_68_0.png

History of len_route

visualize_history(len_route_history2)

output_69_0.png

The first 5000 times in the history of len_route

visualize_history(len_route_history2[:5000])

output_70_0.png

The shortest path obtained by the Epsilon-Greedy method

Again, the result was a bit different from the true shortest path. Compare with the results obtained by the policy gradient method.

draw_route(best_route2)

output_71_0.png

In the policy gradient method, the distance to the destination Naha was used to update the pi value, so learning a route that simultaneously satisfies" Goal 1: Goal "and" Goal 2: Find the shortest route ". was doing. Therefore, no matter how many times you execute it, you will get a result close to the shortest path.

In this Epsilon-Greedy algorithm, the "reward" given only reflects "whether you have reached the goal" and "whether the game is over (have you visited the same city twice)". Therefore, we will learn how to "goal 1: reach the goal", but not "goal 2: find the shortest path". For this reason, it may happen that a result close to the shortest path is obtained, but in reality it often converges to a result far from the shortest path (try it).

Let's think about how to make learning for "Goal 2: Find the shortest path" by the Epsilon-Greedy method (not covered here).

Save the Q value after training with the Epsilon-Greedy method as Q_eg.

Q_eg = Q.copy()

Its value is as follows:

visualize_matrix(pd.DataFrame(Q_eg, columns=location[:, 0]))

output_73_0.png

Let's compare it with the pi_pg obtained by the policy gradient method.

Q learning

"Q-learning" </ b> is famous as another method of value iterative method. It's basically similar to the Epsilon-Greedy method, except that it doesn't include the randomness that occurs when choosing the "next action". It is said that the convergence will be faster by that amount.

However, as long as I execute the following code several times, the convergence may be faster, but it is not always the case, and I often fall into a local solution (I can not reach the destination Naha). I have the impression that it will end up.

def Q_learning(town, Q, prev_t, reward, eta=0.1, gamma=0.9, goal=46):
    if reward == 1: #dist_mat[town, goal] == 0:
        Q[prev_t, town] = Q[prev_t, town] + eta * (reward - Q[prev_t, town])
    elif reward >= 0:
        Q[prev_t, town] = Q[prev_t, town] + eta * (reward - Q[prev_t, town] + gamma * np.nanmax(Q[town, :]))
    else:
        Q[prev_t, town] = Q[prev_t, town] - eta * Q[prev_t, town]
        
    return normalize_pi(Q)
def explore_Q_learning(Q, epsilon=0.1, eta=0.1, gamma=0.9, start=0, goal=46):
    prev_t = start
    route = [start]
    town = get_next(start, Q, epsilon)
    while True:
        if town in route:
            reward = -1
            Q = Q_learning(town, Q, prev_t, reward, eta, gamma)
            break
        elif town == goal:
            reward = 1 
            route.append(town)
            Q = Q_learning(town, Q, prev_t, reward, eta, gamma)
            break
        else:
            reward = 0
            dist_goal, len_route = evaluate(route)
            if best_dist_goal > dist_goal:
                reward = 1
            route.append(town)
            next_t = get_next(town, Q, epsilon)
            Q = Q_learning(town, Q, prev_t, reward, eta, gamma)
            prev_t = town
            town = next_t
    
    return [route, Q]

Search execution

%%time
eta = 0.1 #Learning rate
gamma = 0.9 #Time discount rate
epsilon = 0.5
Q = Q_zero.copy()

best_dist_goal = 1000000
best_len_route = 1000000

best_route3 = []
dist_goal_history3 = []
len_route_history3 = []
for itera in range(50000):
    epsilon = epsilon * 0.99
    route, Q = explore_Q_learning(Q, epsilon, eta, gamma)
    dist_goal, len_route = evaluate(route)
    dist_goal_history3.append(dist_goal)
    len_route_history3.append(len_route)

    if is_best_ever():
        best_dist_goal = dist_goal
        best_len_route = len_route
        best_route3 = route
CPU times: user 9min 50s, sys: 1.41 s, total: 9min 52s
Wall time: 9min 54s

Search results

This is an example of search results by Q-learning. As with the Epsilon-Greedy method, the results are not very stable and can vary considerably from run to run (try it).

Distribution of dist_goal and len_route

draw_histgrams(dist_goal_history3, len_route_history3)

output_81_0.png

Relationship between dist_goal and len_route

draw_scatter(dist_goal_history3, len_route_history3) 

output_82_0.png

History of dist_goal

visualize_history(dist_goal_history3)

output_83_0.png

The first 5000 times in the history of dist_goal

visualize_history(dist_goal_history3[:5000])

output_84_0.png

History of len_route

visualize_history(len_route_history3)

output_85_0.png

The first 5000 times in the history of len_route

visualize_history(len_route_history3[:5000])

output_86_0.png

Shortest path obtained by Q-learning

draw_route(best_route3)

output_87_0.png

As you can see in this example calculation, this is not the true shortest path. The reason is that, as mentioned in the case of the Epsilon-Greedy method, due to the problem of "reward" design, we are learning to "goal 1: reach the goal", but "goal 2: find the shortest path". This is because we have not learned for.

Also, as mentioned above, within my observation range, Q-learning may converge faster, but in many cases it does not reach the destination Naha even after calculating 50,000 times. I think that the "shortest path obtained by Q-learning" tends to be longer than the "shortest path obtained by the Epsilon-Greedy method". The reason is that the randomness when selecting "next action" is suppressed, so if you happen to reach the goal by the selected route, even if the route is long, there is an opportunity to change it Epsilon- I think it's because it's less than the Greedy method.

Q_qlearn = Q.copy()
visualize_matrix(pd.DataFrame(Q_qlearn, columns=location[:, 0]))

output_89_0.png

so. That's all for reinforcement learning. Well, Okinawa is far away.

There is deep reinforcement learning </ b> that mixes deep learning with this, but that is another opportunity. Chao.

Recommended Posts