Future strengthening learning

I'm Harima, a first year master's student in the Graduate School of Science. I will summarize my learning contents as a memo. I'm sorry it's hard to see. I would like to know what you do not understand.

Chap.1 Basic theory of reinforcement learning

--Concisely organize the basic concepts of reinforcement learning based on the latest perspective

1.2 Reinforcement learning components

--The basic framework of reinforcement learning and the Markov decision process, which is a mathematical model for describing interactions --Based on this model, formulate various concepts such as profit, value, and policy.

--The framework of reinforcement learning is ** Agent **, ** Environment **, ** Interaction **

--Receive and deliver status, actions, and rewards every hour step

-** Measures ** ・・・ Rules for agents to decide their actions

--Design an algorithm to improve the policy by having the agent act on the environment through ** "action" ** and observe the result in the form of ** "reward" ** and ** "state" **.

――The important issue is how to determine the reward function.

-** Markov decision process ** ・・・ State space $ S $, action space $ A (s) $, initial state distribution $ P_0 $, state transition probability $ P (s'| s, a) $, reward function Stochastic process described by $ r (s, a, s') $

--Set the state set $ S $ to be a set of all states --Let $ s $ be the variable that represents the elements of this set -The state set consisting of $ N $ type states is as follows.

  ```math
  S={s_1,s_2,...,s_N}
  ```

--Let $ S_t $ be the random variable that represents the state in the time step $ t . --If you write the states in order from time step 0, it is as follows $S_0,S_1,S_2,...,S_t,...$$

--The action set $ A (s) $ is a set consisting of all selectable actions in a certain state $ s . $A(s)={a_1,a_2,...,a_M}$$

--Let $ A_t $ be the random variable that represents the behavior of the agent determined in the state $ S_t $ in the time step $ t $ --If you write the states in order from time step 0, it is as follows $A_0,A_1,A_2,...,A_t,...$

--Let $ R_t + 1 $ be the random variable that represents the reward that depends on $ S_t $, $ A_t $, $ S_ {t + 1} $. --Take one of the values of the set $ R $ of all real numbers

model

--The environment probabilistically determines the state (initial state) at the initial time (** initial state distribution **) $S_0~P_0(s)$

--The next state is stochastically determined by the current state and behavior. --When the agent decides the action $ a $ in the state $ s $, the probability that the state transitions to $ s'$ is given as follows. $P(s'|s,a)$

--The state $ S_ {t + 1} $ at the step $ t + 1 $ is determined as follows. $S_{t+1}~P(s'|S_t,A_t)$

-** Markov property ** ・・・ The property that the transition probability is determined only by the immediately preceding state

--The environment determines the reward $ R_ {t + 1} $ according to the current state $ S_t $, the action $ A_t $, and the next state $ S_ {t + 1} . $R_{t+1}=r(S_t,A_t,S_{t+1})$$

--Action is determined based on the agent's policy ($ \ pi $) ――In a certain state, a measure whose behavior is stochastically determined is called a probabilistic measure. --Under the probabilistic policy $ \ pi $, the probability that a certain action $ a $ is selected in a certain state $ s $ is expressed as $ \ pi (a | s) $.

-** Tic-tac-toe ** -Each player puts a stone on the $ 9 $ square of $ 3 x 3 $, and wins if his stones line up in a straight line.

--The agent gives a positive reward to the winning board and a negative reward to the losing board. --The initial state distribution is as follows

  ```math

P_0(s)=\begin{cases}1 ,,,,,, (s=s_1) \ 0 ,,,,,, (otherwise) \end{cases}

      
      
 -** Time Steps and Episodes **

 -** Time step ** ・ ・ ・ Basic unit of time in the interaction between the agent and the environment
      
 -** Episode ** ・ ・ ・ The time from the start to the end of the task is summarized and consists of multiple time steps.
      
 -** What is a good policy **
   
 -** Revenue ** ・ ・ ・ Cumulative reward obtained in a certain period (sum of rewards in the period)
      
 --The reward $ R_t $ obtained in the time step $ t $, the interval length is $ T $, and the revenue $ G_t $ is defined as follows.
      
      ```math
      G_t=\sum^{T-1}_{\tau=0}{R_{t+1+\tau}}
      ```

 --Define profits in longer-term intervals
      
      ```math
      G_t=\lim_{T\rightarrow \infty} \frac{1}{T}\sum^{T-1}_{\tau=0}{R_{t+1+\tau}}
      ```
 -** Discount reward sum ** ・ ・ ・ Profit that expresses future uncertainty in the form of discounted reward
      
      ```math
      G_t=\sum^{\infty}_{\tau=0}\gamma^{\tau}R_{t+1+\tau}=R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+...
      ```
 --Discount rate $ \ gamma (0 \ le \ gamma \ le 1) $ is a constant that indicates how much the future will be discounted.
      
 ――Revenue is an index that evaluates the rewards obtained from a long-term perspective.
      
 --Take the expected value of profit on the condition of the state, and call this the ** state value **.
      
 -** State value ** ・ ・ ・ Expected value of profit obtained when action is decided according to policy $ \ pi $ from a certain state
      
      ```math
      V^{\pi}(s)=E^{\pi}[ G_t|S_t=s ]
      ```
 --"Expected value under policy $ \ pi $" ... Expected value when the agent decides the action based on the policy $ \ pi $ from the state $ s $ in the time step $ t $
      
 -Consider an example of finite interval profit of $ T = 1 $
   
 --The profits to consider are as follows
      
      ```math
G_t=R_{t+1}

--The probability that the state is $ s'$ in the time step $ t + 1 $ is as follows

  ```math

P(S_{t+1}=s',A_t=a|S_t=s)=P(S_{t+1}=s'|S_t=s,A_t=a) \pi(a|s)

 --The state value is as follows by taking the expected value with the state $ S_t $ as a condition.
      
      $$\begin{eqnarray*} V^{\pi}(s)&=& E^{\pi}[G_t|S_t=s] \\ 
                                   &=& \sum_{s' \in S} \sum_{a \in A(s)} P(S_{t+1}=s',A_t=a|S_t=s) r(s,a,s') \\
                                    &=& \sum_{s' \in S} \sum_{a \in A(s)} P(S_{t+1}=s'|S_t=s,A_t=a) \pi(a|s) r(s,a,s') \end{eqnarray*} $$
                  
 -Consider an example of finite interval revenue with $ T = 2 $
   
 --The profits to consider are as follows
      
      ```math
G_t=R_{t+1}+R_{t+2}

--Expected values are as follows

  ```math

\begin{eqnarray*} V^{\pi}(s) &=& E[G_t|S_t=s]=E^\pi[R_{t+1}+R_{t+2}|S_t=s] \ &=& \sum_{s''\in S}\sum_{a'\in A(s)}\sum_{s'\in S}\sum_{a\in A(s)} P(S_{t+2}=s'',A_{t+1}=a',S_{t+1}=s',A_t=a|S_t=s){r(s,a,s')+r(s',a',s'')} \ &=& \sum_{s''\in S}\sum_{a'\in A(s)}\sum_{s'\in S}\sum_{a\in A(s)} P(S_{t+2}=s''|S_{t+1}=s',A_{t+1}=a')\pi(a'|s')×P(S_{t+1}=s'|S_t=s,A_t=a)\pi(a|s){r(s,a,s')+r(s',a',s'')} \end{eqnarray*}

 -When fixing $ \ pi $ and changing $ s $
   
 --Evaluate the expected return earned when deciding actions based on certain fixed measures for various conditions
 --It can be used as an index to show the goodness of the state under a certain measure $ \ pi $ (** state value function **)

 -When fixing $ s $ and changing $ \ pi $

 --Evaluate the profits that are expected to be earned by starting actions from a certain state for various measures
 ――Indicator showing the goodness of the policy when starting from a certain state $ s $
   
$$
\forall s\in S,\,\,\,\,\, V^\pi(s) \ge V^{{\pi}^{'}}(s)\\ 
\exists s\in S,\,\,\,\,\, V^\pi(s) >  V^{{\pi}^{'}}(s)
$$  

 -** Optimal policy **
   
 -** Optimal state values ** ・ ・ ・ $ V ^ * (s) $
      
      ```math
\forall s\in S,\,\,\,\,\, V^*(s)=V^{{\pi}^{*}}(s)=\max_\pi V^\pi (s)

-** Behavioral values ** ・・・ $ Q ^ \ pi $

  ```math

Q^\pi(s,a)=E^\pi[G_t|S_t=s,A_t=a]

 -For $ A_t, S_ {t + 1}, A_ {t + 1} $, take the expected value according to the probability of appearance
 ――A trajectory in which each state and action are connected
      
---
      
 -Finite segment revenue of $ T = 1 $
   
      ```math
X_1=\{\Xi=(s,a,s')|s\in S,a\in A,s'\in S\}

-Call $ \ Xi $ ** orbit **

--Orbital set with fixed initial state

  ```math

X_1|_s={\Xi=(s,a,s')|a\in A,s'\in S}

 --A set of orbitals with fixed initial state and behavior
      
      ```math
X_1|_s(s,a)=\{\Xi=(s,a,s')|s'\in S\}

--Revenue is regarded as a function of orbit

  ```math

G_t=G_t(\Xi)

      ```math
V^\pi(s)=\sum_{\Xi\in X_1|_s}P(\Xi)G_t(\Xi)\\
      Q^\pi(s,a)=\sum_{\Xi\in X_1|_{(s,a)}}P(\Xi)G_t(\Xi)

-When $ T = 2 $

  ```math

X_2|_s={\Xi=(s,a,s',a',s'')|a\in A,s'\in S,a'\in A,s''\in S}

      ```math
X_2|_{(s,a)}=\{\Xi=(s,a,s',a',s'')|s'\in S,a'\in A,s''\in S\}

--In the environment shown in Figure 1.2.5, the set of orbits to be considered when calculating the state value is as follows.

  ```math

X_1|_{s_1}={(s_1,a_1,s_3),(s_1,a_2,s_2)}

      ```math
X_2|_{s_1}=\{(s_1,a_1,s_3,a_1,s_4),(s_1,a_1,s_3,a_2,s_1),(s_1,a_2,s_2,a_1,s_1),(s_1,a_2,s_2,a_2,s_4)\}

--The set of trajectories to consider when seeking action value is as follows.

  ```math

X_1|_{(s_1,a_1)}={(s_1,a_1,s_3)}

      ```math
X_2|_{(s_1,a_1)}=\{(s_1,a_1,s_3,a_1,s_4),(s_1,a_1,s_3,a_2,s_1)\}

――How to find a good policy

-** greedy policy **

  ```math

\pi(a|s)=\begin{cases}1 ,,,,,, (a=\arg \max_aQ(s,a)) \ 0 ,,,,,, (otherwise) \end{cases}

 --Estimate the optimal behavioral value function
 ――Sometimes it is necessary to stochastically select actions that are not always the best at that time.
      
 -** $ \ epsilon $ -greedy policy **
      
      ```math
\pi(a|s)=\begin{cases}1-\epsilon+\frac{\epsilon}{|A(s)|} \,\,\,\,\,\, (a= \arg \max_{a} Q(s,a)) \\\frac{\epsilon}{|A(s)|}  \,\,\,\,\,\, (otherwise) \end{cases}

-** Boltzmann (Softmax) policy ** ・・・ Selection probability follows Gibbs distribution

  ```math

\pi(a|s)=\frac{\exp(Q(s,a)/T)}{\sum_{b\in A}\exp(Q(s,b)/T)}

 -$ T $ is the temperature parameter

[PYTHON] Future reinforcement learning_2

** Future strengthening learning **

** Chap.1 Basic theory of reinforcement learning **

** 1.2 Reinforcement learning components **

Future strengthening learning

Chap.1 Basic theory of reinforcement learning

1.2 Reinforcement learning components