[PYTHON] Reinforcement learning 2 Markov decision process / Bellman equation

Aidemy 2020/11/15

Introduction

Hello, it is Yope! I am a liberal arts student, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. I am very happy that many people have read the previous summary article. Thank you! This is the second post of reinforcement learning. Nice to meet you.

What to learn this time ・ Components of reinforcement learning ・ About Markov decision process ・ Value, profit, condition ・ About the best policy ・ About the Bellman equation

Reinforcement learning components

N-arm banded problem and next step

-The N-arm banded problem dealt with in Chapter 1 is a simpler problem than the general problem in that it is __ "immediate reward" __ points and __ "state does not change depending on the action of the agent" __. Met. -The __state __ represents what the __environment is currently in __. It's easy to understand when you think of a board game. When the player who is an agent uses one move of the opponent in the environment of shogi, the board has changed from his previous turn, so the action to be taken next also changes. At this time, it can be said that the state of __ "board" __ is changing. -Also, as I touched on in Chapter 1, the original purpose of reinforcement learning is to maximize __delayed rewards such as "winning the game" instead of immediate rewards. ・ This time, we will carry out reinforcement learning that incorporates these concepts of __ "state" and "time" __.

About the model of reinforcement learning

・ Reinforcement learning is __ "(1) The agent receives the state $ s_ {t} $ and causes the action $ a_ {t} $ to act on the environment." "(2) The environment receives the state $ s_ {t + 1} $. "Transition" "③ Environment gives the agent $ r_ {t + 1} $" "④ Agent decides the next action $ a_ {t + 1} $ based on the result of ②③" __ It will proceed with. -__ "T" __ is __ "time step" __ indicating how many times the operation is performed. This is the basic unit of reinforcement learning time.

Formulation of states, actions, and rewards

-The following is a set of __states that the environment can take __formalized __. S={s_{0},s_{1},s_{2},...} -Similarly, the _set of actions __ that an agent can take is expressed as follows. A(s)={a_{0},a_{1},a_{2},...} -Reward "$ R {t + 1} $" at the time step "t + 1" is expressed as follows. R_{t+1}=r(S_{t},A_{t},S_{t+1})

・ When the reward is calculated by the above formula, $ R_ {t + 1} $ is called __reward function __. As can be seen from this reward function, __future state is stochastically determined only by the current (t, t + 1) state / behavior, and has nothing to do with past behavior __. __ Called "Markov property" __.

Markov decision process

What is the Markov decision process?

-As seen in the previous section, __the future state is determined only by the current state and actions __ is called __ "Markov property" __. And the process of reinforcement learning that satisfies this is called __ "Markov decision process (MDP)" __. -The Markov process consists of a set of states __ "state space $ S $" __, a set of actions __ "action space $ A (s) $" , and a random variable __ "initial state distribution" that represents the state at the start. $ P _ {0} $ ”, when the action $ a $ is performed in the state $ s $, the transition rate of the probability of becoming the state $ s'$ is “ state transition probability $ P (s'| s, a ) $ ”, “ Reward function $ r (s, a, s') $ ” has five elements.

-The figure below shows the environment __ "Environment 1" __ used this time. It is shown that when starting from StateA and taking actionX, it moves to StateB with a probability of "0.7" and returns to StateA with a probability of "0.3". Finally, in StateB, if you draw "0.8" of actionX, you will reach the goal (End).

スクリーンショット 2020-11-12 16.36.51.png

・ In each case, the above 5 elements__[s,s',a,p(s'|a,s),r(s,a,s')]What is expressed like"State transition diagram"__That is.(Which state is s now, s'Which state to go to next, a is actionX or Y, p(s'|a,s)Is its probability, r(s,a,s')Is the reward)

-The following code is an array of all state transition diagrams of this environment 1. When making an array, replace it with a numerical value such as StateA = 0, StateB = 1, End = 2.

スクリーンショット 2020-11-12 23.23.50.png

Introducing the concept of time-state and behavior

-Since the concept of time has not yet been introduced into the five variables in the previous section, which are the components of the Markov decision process, we will introduce these concepts of time and formulate them. -For the state space S, define "$ S {t} " that incorporates the concept of time t and " A {t} " that introduces the concept of time into actions. -The array used in the definition is the __ "state transition diagram" __ in the previous section. Since __0th column __ of this array is "state", that is, "current (at the time of t) state", this becomes __ " S_ {t} " __, and ___2th column __ is "action". In other words, it is "action at time t", so this is __ " A_ {t} $" . -The following is a function that returns $ A {t} $ that summarizes the actions __ that the agent can take in that state when $ S {t} $ is passed.

スクリーンショット 2020-11-12 23.49.03.png

-__ "actions ()" __ The __ "state" __ passed to the function is the state "$ S_ {t} ", and the __ "state_transition" __ referenced by A is the __state transition. Figure __. Since this 0th column is " S_ {t} $", describe that this part refers to the one in the same state as the passed state, and __ "A [:, 2]" __ in the 2nd column. Returns a list of possible actions $ A_ {t} $ at that time by referring to. At this time, the return value is a possible action, so it must not be duplicated, that is, it must be a __unique element __. To do this, use __ "np.unique ()" __.

-For the execution part, the current state is __ "state = [1]" __ indicating that it is in State B. If you look at the s part (current state) of state_transition, that is, the second column of the row where the 0th column is [1], 0 or 1 is stored, so these are returned as the result of the function.

Introducing the concept of time-other

-Here, the initial state distribution, state transition probability, and reward function are redefined by incorporating the concept of time. -The initial state distribution is simple, just define the state $ S_ {0} $ when t = 0. -The state transition probability is determined only by the __ Markov decision process , and the state $ S {t + 1} $ at time $ t + 1 $ is determined only by $ S {t} $ and $ A_ {t} $. , $ P (s'| s_ {t}, a_ {t}) $. -For Reward __ $ R {t + 1} $ when transitioning to the state $ S {t + 1} $, $ R_ {t + 1} = r (s_ {t}, a_ {t}, It can be represented by s_ {t + 1}) $. The following code defines this reward function.

スクリーンショット 2020-11-13 11.38.57.png

-For this function __ "R" , pass the above-mentioned $ s {t}, a {t}, s_ {t + 1} $ as arguments. As a premise, 0 is returned because there is no reward when __ is already at the terminal. -This time as well, the reward is returned by referring to "state_transition", but since reward is in the __4th column of this array, the state (row) that meets the conditions is extracted and that You can return the 4th column. -The condition this time is to find the matching "state, after_state, action" passed in the argument from the 0th, 1st, and 2nd columns of state_transition.

-Regarding the execution part, state is [1], after_state is [2], and action is [0], so __ "Currently in State B and reached End by performing Action X" __ State transition at the time You can refer to. In the state_transition this time, it is [1, 2, 0, 0.8, 100] in 4th row, so the reward in this 4th column is output.

episode

-The time it takes from the start to the end of a task is called __ "episode" __. By repeating the cycle of "behavior → state change" many times, the time step progresses, and one episode is composed. ・ In the reinforcement learning model, for this one episode __ "Initialize the environment, let the agent act, and optimize the behavior model based on the received reward" __ Do it multiple times __ I will proceed with learning by doing so. ・ Turn-type card games are very easy to understand as an example. The board changes depending on the player's actions every turn, and the task ends in the form of "win or lose". With this as one episode, by playing against each other many times, you will gradually learn the best move.

-The following is the function __ "T ()" __ that defines the episode. This function is __ "passing the current state and action, extracting the corresponding one from state_transition, and returning the state transition probability and the state after the transition" __.

スクリーンショット 2020-11-13 12.20.13.png

-For the definition part of T, when __ is already at the end points (terminals), __ [(0, terminals)] __ is returned because the state does not change. -For X, the lines that match the passed arguments are extracted as in the previous section. Store __ "3rd column (state transition probability)" __ and __ "1st column (state after transition)" __ in this row in A, and __ "tuple (tuple) for all possible cases. A [i,:]) ”__ tuples and returns.

Value / Profit / State

Rewards and revenue

・ Up to this point, we have modeled and defined reinforcement learning. From here, we will consider __ "What is the basis for finding the optimal policy?" __. ・ When considering using reward as an index, this is determined only by the action immediately before __, so __ "Even if the action has little value at that time, it has a very large value after that. I overlook __. -The solution to this is __ "Revenue" _. Revenue is calculated as the sum of the rewards obtained in a certain period , so the rewards after __ can be taken into consideration . ・ If the reward obtained by the action $ a {t} $ at time t is __ "$ R {t} " __, the profit __ " G {t}" is the sum of the rewards within a certain period. $ " Can be calculated by the following formula. $ G_{t} =R_{t+1} + \gamma R_{t+2}+ \gamma^2 R_{t+3}+....= \displaystyle\sum_{\tau=0}^{\infty} \gamma^{\tau} R_{t+1+\tau} $

-$ \ Gamma $ is called __ "discount rate" __ and takes the value of __ (0 to 1) __. This represents __ "how much the reward you will receive in the future is the current value" __. The closer it is to 0, the less you will find the future value, and the closer it is to 1, the more weight you will have in the future value. Will be.

-Although there are several methods for calculating the total reward, the __ "sum of discount rewards" __ is generally used. This is calculated by calculating the average of revenues from time t to a certain time T and taking the limit of __T __. -The sum of discount rewards can be defined by two functions as follows.

スクリーンショット 2020-11-13 14.12.02.png

-For __ "take_single_action ()" __, this is the __ "determine the next state based on the transition probability" __ function. As a code, first __ "random.uniform (0,1)" __ generates a random number from 0 to 1. It also defines __cumlative probability "cumlative_probability" __. -In the for statement, the possible probability __ "probability" __ and the next state __ "next_state" __ are extracted from the passed state and action for each episode "T", and the cumulative probability "cumlative_probability" Add "probability" for each episode to, and when this becomes larger than "x", end the for statement and return "next_state" at that point.

-For another function __ "calculate_value ()" __, this is the function that calculates __ "discount reward sum" _. First, "state" and "action" and discount reward sum "sumg" __ are defined. -In the for statement, repeat __ for the number of __ episodes. This time it is twice. In this, __ the next state is defined by "take_single_action", and $ R {t + 1} $ is calculated using this and __ "state" "action" __ at that time, and this and Multiply by $ \ gamma ^ i $ and accumulate __sumg. Then, after updating the "state", go to the next episode and return the final cumulative sum of discounts "sumg".

-Execution part![Screenshot 2020-11-14 12.12.43.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/bf1eac93-f500-a799 -9cdc-b692f5ad0f0f.png)

Value function / state value function

・ In the method like the method in the previous section, where __reward is used as the evaluation standard of the policy, R is determined based on the probability, so the more the __state branches, the more complicated the calculation formula becomes. There is a drawback that it ends up __. ・ By taking the expected value of __reward __, the above drawbacks can be eliminated. The expected value of the reward that takes into account all actions after __ starting from a certain state is called "value" or __ "state value" __. __ The better the policy, the greater the state value __. ・ By introducing this concept, "in a certain policy, __which state is superior __ can be compared" and "when a certain state is placed at the starting point, it is possible to compare the good and bad of each __ policy __ Two things are possible.

Searching for the best strategy

Optimal policy

-By introducing the state value function __ in the previous section, it has become possible to compare the good and bad of each measure . ・ When comparing two measures $ \ pi {1}, \ pi {2} $ -In all states s of the state space S, the state value $ V ^ {\ pi {1}} (s) $ is $ V ^ {\ pi {2}} (s) $ __ or more __ -In at least one state s of the state space S, the state value $ V ^ {\ pi {1}} (s) $ is greater than $ V ^ {\ pi {2}} (s) $ __ __

When these two hold, we can say __ "$ \ pi_ {1} is a better strategy than \ pi_ {2} $" __. -At this time, the best policy should exist among all the policies, and this is called __optimal policy __ and is expressed as $ \ pi ^ * $.

Action value

-The state value function is __ "expected value of reward when a certain state is started" , but in reality, " reward when a certain state is started and a certain action is taken" In many cases, it is better to consider the value that takes into account action, such as "expected value". The value at this time is called __ "action value" __, and the function indicating the action value under a certain measure is called __ "action value function" __. -The action value function can be said to be a function that determines __ how good each "action" is. -As a mathematical formula, use the following formula of __ geometric series __. $a+ar + ar^2 + \cdots + ar^n = \large\frac{a(1 - r^n)}{1 - r}$

・ The action value function can be calculated with "a" as reward and "r" as gamma. The code for this looks like this:

スクリーンショット 2020-11-14 14.52.38.png

Optimal state value function / Optimal behavior value function

-Of the "state value function" and "action value function" that we have seen so far, each function when the optimum policy is followed is called __ "optimal state value function" and "optimal action value function" __. .. They are expressed as "$ V ^ \ * (s) " and " Q ^ \ * (s) $", respectively. ・ __ Once the optimal state value function and optimal action value function are obtained __, it is possible to know which action is the most profitable action, so always select the optimal action __ "greedy method" _ Just choose _.

Bellman equation

Optimal state value

-For __ to find the optimal state value function, __ to find each state value function and compare __. The same applies to the action value function, so here we will proceed with the state value function. -Use __Bellman equation __ to find each state value function. This is based on the idea that __ "a relational expression of the value function is established between the state s and the state s'that may shift as a result of action" __. -Bellman equation can be used by making each function like __recurrence formula __ as follows. V^{\pi}(s) =\displaystyle\sum_{\alpha \in A(s)}^{} \pi(a|s) \displaystyle\sum_{s' \in S}^{} P(s'|s,a)(r(s,a,s') + \gamma V^{\pi}(s'))

Q^{\pi}(s,a) =\displaystyle\sum_{s' \in S}^{} (P(s'|s,a) (r(s,a,s') + \displaystyle\sum_{a' \in A(s')}^{} \gamma \pi(a'|s') Q^{\pi}(s',a')))

-Here, $ \ pi (a | s) $ represents the probability that the action is selected __.

-For example, in "Environment 1", when __ "s = StateB", "$ \ gamma $ = 0.8", and "do only ActionX" __, the value function is calculated by the Bellman equation. $ V ^ \ pi (B) = 0.2 (-10 + \ gamma * V ^ \ pi (B)) + 0.8 (100 + \ gamma * 0) $ as the equation for $ V ^ \ pi (B) $ You just have to solve it. That is, to put it more simply, __ "(probability of causing the action) x (Reward by action + $ \ gamma $ x State after movement)" __ is calculated for all cases, and the __ sum __ You can solve the equation using.

・ Environment 1![Screenshot 2020-11-14 16.34.36.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/16c41c13-0b25-965d -b127-389bf3dfccb4.png)

Bellman optimal equation

-Of course, the Bellman equation can be applied to the optimal state function and optimal behavior function mentioned above. Therefore, __ the Bellman equation __ when the optimum policy is always taken is called __ "Bellman optimum equation" . -For example, when comparing which of "Always take ActionX policy $ \ pi {1} $" and "Always take ActionY policy $ \ pi {2} $" is the better policy in environment 1, __ profit (discount) Reward sum) Using the function __ "caluclate_value ()" __ that calculates __, it can be expressed as follows.

スクリーンショット 2020-11-14 16.34.19.png

-Q_optimum represents the maximum profit at that time, and Q_pi represents the policy at that time. Since Q_pi is [0 0], we can see that it is better to take _measure $ \ pi {1} $ which always takes ActionX.

Summary

-The reward "R_ {t + 1}" at the time step "t + 1" is expressed as $ r (S_ {t}, A_ {t}, S_ {t + 1}) $. At this time, the fact that the future state (t + 1) is determined only by the current state and behavior (t) is called "Markov property", and the process of reinforcement learning that satisfies this is called the Markov determination process. ・ In the Markov decision process, the components are represented by an array such as [s, s', a, p (s'| a, s), r (s, a, s')] in the "state transition diagram". ". When the concept of time is introduced, it is represented by $ [S_ {t}, S_ {t + 1}, A_ {t}, S_ {0}, R_ {t + 1}] $. -The time from the start to the end of a task is called an "episode". In reinforcement learning, learning is repeated for this episode by "initializing the environment and optimizing based on the rewards of the agent's actions." ・ In actual reinforcement learning, consider the optimal measures based on "earnings" including not only immediate rewards but also late rewards. Revenue is calculated by the "discount reward sum", which is the sum of the rewards at each time step multiplied by the "discount rate $ \ gamma $". -In addition, when searching for the optimal policy, the amount of calculation will increase if the profit is calculated as it is, so it is better to calculate the "state value function" and "action value function" that are the expected values of the reward. The "Bellman equation" is used to find these. If the Bellman equation is used for the state value function, it is calculated using "probability of causing Action", "reward", and "State after movement".

This time is over. Thank you for reading this far.

Recommended Posts

Reinforcement learning 2 Markov decision process / Bellman equation
[Introduction] Reinforcement learning
Future reinforcement learning_2
Future reinforcement learning_1
Reinforcement learning 1 Python installation
Reinforcement learning 3 OpenAI installation
Reinforcement learning for tic-tac-toe
[Reinforcement learning] Bandit task
Python + Unity Reinforcement Learning (Learning)
Reinforcement learning 1 introductory edition