Aidemy　2020/11/16

Introduction

Hello, it is Yope! I am a liberal arts student, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. I am very happy that many people have read the previous summary article. Thank you! This is the third post of reinforcement learning. Nice to meet you.

This article is a summary of what you learned in "Aidemy" "in your own words". It may contain mistakes and misunderstandings. Please note.

What to learn this time ・ About dynamic programming ・ About TD method

Dynamic programming

What is dynamic programming?

・ In Chapter 2, we explained that we use __Bellman equation __ to find the optimal policy, but in this Chapter we will actually do this __. -Assuming that the model of the environment is completely given in __ "Markov decision process (MDP)" __, the method for finding the optimal policy is called __ "Dynamic programming (DP)" __.

Policy evaluation

-Calculating the value function $ V ^ \ pi (s) $ when the method $ \ pi $ is taken is called __measure evaluation __. The execution method is as follows. ・ First, set the threshold $ \ epsilon $ for updating the policy in advance. Then enter the strategy $ \ pi $, assume $ V (s) $ to be 0 in all states, and make the difference in all states smaller than the __threshold when updating , so "$ \ Delta $ "is defined. ・ Once defined so far, "v" for all states s=V(s)As__With the Bellman equation「V(s)Is calculated. In order to compare the update amount, the predefined "\delta""\max(\delta,|v-V(s)|)And this is "\epsilon"__If it falls below__At that point, the calculation is finished and "V(s)""V^\pi(s)Is output assuming an approximate solution of. -The code for the above is as follows.

スクリーンショット 2020-11-14 17.19.32.png

-For the function __ "policy_evaluation ()" __ that evaluates the policy, pass "policy pi, state states, threshold epsilon" as an argument. -__ First, define "$ \ delta = 0 " __. Then, from here, __ the above method is performed for all states s. First, assign the value function V to the variable __ "v" __ with __ "v = V.copy ()" __. Also, prepare the variable __ "x" __ to store the "V [s]" calculated in the next calculation. -For __ "for (p, s1) in T (s, pi [s]):" __, __ "p" __ is obtained in episode T __ "Probability of taking that Action ( \ pi ($ \ pi ($ \ pi) a | s) $) "__, " s1 " is " State (after_state) of the transition destination ". Using these, we calculate $ V (s) $ with the __Bellman equation __ (see the formula in Chapter 2). Substitute this for __ "x" __. ・ Calculate "$ \ delta $" by the above formula using $ V (s) $ and $ v (s) , and if it falls below the threshold " \ epsilon $", the approximate value "V" of the value function return it.

-Execution part![Screenshot 2020-11-14 17.21.46.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/7260cfde-0802-6188 -96e9-db5d3ba10faf.png)

-Looking at the execution part, V corresponding to the "v" part of __ "policy_evaluation ()" __ and pi passed as an argument are defined. As a result of applying the function, the value function __ of __StateA and the value function __ of __StateB are calculated respectively.

Policy repetition

・ In "$ V ^ \ pi (s) " calculated by the value evaluation in the previous section, if the measure that takes the __greedy method __ is " \ pi'", then __ " V ^ \ pi ( s) $ <$ V ^ {\ pi'} (s) $ ”__, which is called value improvement. -By repeating this __value improvement and value evaluation __, the __optimal value function can be derived __. This is called __policy iteration __. -The code of the function __ "policy_iteration ()" __ that improves the value is as follows.

スクリーンショット 2020-11-14 18.42.37.png

-For the above code, first initialize value function V and policy pi. After doing this, for V, __ measure evaluation is performed with the function "policy_evaluation ()" created in Chapter 2. At the same time, set __ "policy_flag = True" __. -For each state s, calculate the policy pi with the Germanic equation using the updated V, and sort __action according to this __action. Then, if the __measure is not covered, the action is stored in __pi and the loop is continued. When this is done, return the policy pi.

-Execution part![Screenshot 2020-11-14 20.16.09.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/309fc249-436a-782a -6fb7-46b0d149eb4a.png)

-The policy iteration function __ "policy_iteration ()" __ returns the policy pi, so this execution result represents the policy of "taking ActionX in State A and taking Action X in State B".

Value iteration

-In the policy iteration of the previous section, in order to calculate the __measure pi, it is necessary to recalculate the value function for all states __ multiple times __, and the amount of calculation increases as the number of states increases. It will increase. The solution to this is a technique called __ "value iteration" __. -In the value iteration, in order to calculate the __value v, the value function is directly calculated and updated for each state s, so the __value function can be calculated only once __. -The following is the code of the value iteration function __ "value_iteration ()" __.

スクリーンショット 2020-11-14 20.59.43.png

-For this function, first __state value V is initialized as before __. Unlike policy iteration, __policy pi is not considered __, so there is no need to initialize __pi __. Next, for each action, __measure evaluation __ is performed and the __maximum value of $ V (s) $ calculated is updated __. -Finally, calculate the action that gets the most reward from this __ $ V (s) $ __ and return it as the policy "pi". The most rewarding action at this time can be calculated in the same way as when the policy is repeated in the previous section.

-Execution part![Screenshot 2020-11-14 21.00.05.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/751f94af-3263-29a4 -d262-053ea8522ea6.png)

TD method

What is the TD method?

-The above method is __dynamic programming __, and the _state transition probability P must be known in advance __. On the contrary, when the state transition probability is not known, it is necessary to estimate this from the reward actually collected, as learned in Chapter 1. -There is a __ "TD method" __ as this estimation method. It is an abbreviation of TimeDifferance, and it is done by updating the next estimated value by using current estimated value without seeing the final result. -There are methods such as __ "Sarsa" and "Q-learning" __ in the TD method.

-After that, use the following __ "Environment 2" __. スクリーンショット 2020-11-14 22.48.44.png

-The start point is the "start" square in the figure, the end point is the "+1" or "-1" square, and the black square cannot be transitioned, and can move up, down, left, and right of the adjacent square. In addition, with a probability of __80% __, it is possible to make a transition as selected, but it makes a 90-degree clockwise transition in the direction selected with _wari, and a counterclockwise transition with the remaining __1% __ To do. As for the reward, it is assumed that the square at the end point is __the score __, and the other squares are __ "-0.04" __. -The following code is a function that outputs the position when you try to move __ "two up, three to the right" __ from the start point.

スクリーンショット 2020-11-15 11.21.11.png

スクリーンショット 2020-11-15 11.21.29.png

-Because we will use a new environment, first define it. Since the function __ "T ()" __ that defines the episode is used in "take_single_action ()", it is defined so that the 0th column returns an array of probabilily and the 1st column returns an array of next_state. Also, in order to change the state of next_state, create __ "go ()" __ that defines __ destination __ from state and direction. -The __ "take_single_action ()" __ function, which determines the next state based on the transition probability, is also used here. The "next state" is updated until the "transition probability" defined in the above "T ()" exceeds x, and the reward at the end of the update is set. -Finally, create a function __ "move ()" __ that actually moves "2 times up from the current state and 3 times to the right". As actions, the __0th column defines the vertical movement, the 1st column defines the left and right movement __, the 0th column is moved by take_single_action twice, and the 1st column is moved 3 times to update the state. ..

-Execution part![Screenshot 2020-11-15 11.51.56.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/21e90c1f-d9b9-999b -175d-e0aea00aebcf.png)

・ Each cell is defined for the environment. The initial location and the endpoint are also defined, and the state after the move can be found by using move ().

Sarsa ・ Sarsa is one of the TD methods, and solves __Bellman equation __ by trial and error. In value iteration etc., $ V (s) $ was used, but here __ action value function $ Q (s, a) $ is used __. This Q function is represented by an array of __ [all states x all actions] __. -Regarding "all states" at this time, in environment 2, there are 11 possible states and 4 "all actions" up, down, left and right. Therefore, I want to make the Q function an 11x4 array, but since the __state is represented by coordinates __, this cannot be expressed well. Therefore, the state part is associated with __ "x coordinate x 10 + y coordinate x 1" __.

-The specific code is as follows. (The part that overlaps with the previous section is omitted) スクリーンショット 2020-11-15 12.30.29.png

-__ "Get_action ()" __ is a function that determines the action based on the maximum value of the Q function (q_table) __ defined later. The __ "t_state" __ passed at this time is the __ "state with transformed coordinates" __ described above. - "take_single_action ()" __ has the same purpose as the previous one, but when reward is not set as the returned value, that is, when __ action destination cannot move due to a wall or obstacle __ , Returns the state "state" and reward "-0.04" in the sense that it stops there, and returns "next_state" and its reward when it is in a square where reward is set. -__ "update_qtable ()" __ is a function that actually calculates and updates the action value function Q. The formula used at the time of this update is: $Q(s,a) += \alpha[r + \gamma Q(s',a') - Q(s,a)]$

・ This $ \ alpha $ is learning rate and needs to be set in advance. This time, set with __ "0.4" __. The discount rate $ \ gamma $ is __ "0.8" __. In addition, the state passed at this time is also defined as the "state in which the coordinates have been converted". The calculation itself should be done according to the formula. -Finally, create a __ "trans_state ()" __ function that changes the state "state" to "state with converted coordinates".

-Execution part![Screenshot 2020-11-15 12.59.35.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/3984e81e-9e7b-fc38 -24c3-5b2d25a1c6ea.png)

スクリーンショット 2020-11-15 12.59.55.png

-When executing Sarsa, first initialize the action value function Q. This time, the number of episodes is 500, the time step is 1000 times at the maximum, and __ "Average reward is 0.7 or more for 5 consecutive times" __ is set as the end condition of the episode, so prepare an array to manage this. The environment is the same as in the previous section. -Next, for each of the 500 episodes, repeat everything after __ (until the end condition is met) __. -First, initialize __state __. Based on this, the "t_state" and action "action" to be passed to the Q function are calculated. Use the "trans_state ()" and "get_action ()" functions, respectively. ・ Next, for each time step in the episode, get the next state s'and reward r when action a is performed with __ "take_single_action ()" __, and accumulate __ rewards __ I will go. Also, for the state s', the "t_next_state" and the action "next_action" to be passed to the Q function are calculated. -Pass these to __ "update_qtable ()" __ to update the __action value function __. It also updates the state and action to go to the next time step. ・ When you finally reach the end point, that episode ends and you move on to the next episode. Before moving, update __ "total_reward_vec" for the reward that is the end condition of the episode, and output it. Similarly, it outputs the number of the episode that ended and how many steps it took. ・ If the __minimum value __ of this "total_reward_vec" exceeds 0.7, the repetition of the episode will end.

-Execution result (only part)![Screenshot 2020-11-15 13.34.46.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/ d8d32f7c-e892-dfbe-7c9a-dd7b0689e03f.png)

-Looking at the execution results, episodes with a reward of "0.2" were excluded from the last 5 episodes, and all rewards ended in episode 21 + 1 exceeding 0.7.

Implementation of ε-greedy method in Sarsa

-In the method in the previous section, only the maximum value of the Q function was selected, but with this, there is a risk of overlooking a better method than __. To avoid this, we will search for new methods using the ε-greedy method. ・ Although the ε-greedy method was used in Chapter 1, it is used with a slight change. This time, we use the method __ "Increase ε to increase the search ratio while the episode is early, and decrease the value of ε to narrow the search as the episode progresses." The following is used as the formula. $\epsilon = 0.5 * (1 / ( episode + 1)) $

-The specific code is as follows, only the "get_action ()" part is different.

スクリーンショット 2020-11-15 15.30.10.png

-For the above function, first calculate "epsilon" with the above formula. If this epsilon is less than or equal to the uniform random number "np.random.uniform (0,1)", __ determine the next action based on the maximum value of the Q function as in the previous section __, otherwise. __ Randomly decide the next action __.

-Execution part![Screenshot 2020-11-15 15.53.09.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/d575a11e-ce96-9d67 -856a-34d2e8d31eb9.png)

-Although the execution part is almost the same as the previous section, the __ "action" __ defined at the beginning is determined using the maximum value of the Q function without applying get_action () involving random numbers. .. This function is used for next_action.

Q learning

-There is a TD method called __ "Q-learning" __ that is different from Sarsa. It is basically the same as Sarsa, but the difference is that in the processing of each time step, __Q function is the __Q function, whereas __Q function is updated after determining the next action. Q-learning is __ that decides the next action after updating. -Therefore, do not pass __ "next_action" ___ to the function __ "update_Qtable ()" __, but instead calculate the maximum value "next_max_q" __ of the __Q function in next_state and use it. Update the Q function. Other than this, it is exactly the same as Sarsa.

スクリーンショット 2020-11-15 16.19.52.png

Summary

-When the model of the environment is completely given in the Markov decision process, the method of finding the optimum policy using the Bellman equation in Chapter 2 is called "dynamic programming (DP)". -Calculating the value function $ V ^ \ pi (s) $ when a certain method $ \ pi $ is taken is called policy evaluation. In addition, it is known that the value function becomes larger when the greedy method is adopted, and this is called value improvement. The optimum method can be found by repeating evaluation and improvement, and this is called policy repetition. -As an application of policy iteration, the method of updating the value function to calculate the value instead of the policy is called value iteration. As a result, the amount of calculation of the value function can be significantly reduced. -Since policy iteration and value iteration are dynamic programming methods and can be used only for models given a state transition rate, the TD method is used for models for which this is not given. TD methods include Sarsa and Q-learning. ・ For Sarsa and Q-learning, the optimal policy is considered by calculating the action value function Q using the Bellman equation.

This time is over. Thank you for reading until the end.

[PYTHON] Reinforcement learning 3 Dynamic programming / TD method