** Future strengthening learning **

I'm Harima, a first year master's student in the Graduate School of Science. I will summarize my learning contents as a memo. I'm sorry it's hard to see. I would like to know what you do not understand.

** Chap.1 Basic theory of reinforcement learning **

--Concisely organize the basic concepts of reinforcement learning based on the latest perspective

** 1.2 Reinforcement learning components **

--The basic framework of reinforcement learning and the Markov decision process, which is a mathematical model for describing interactions --Based on this model, formulate various concepts such as profit, value, and policy.

--The framework of reinforcement learning is ** Agent **, ** Environment **, ** Interaction **

--Receive and deliver status, actions, and rewards every hour step

-** Measures ** ・ ・ ・ Rules for agents to decide their actions

--Design an algorithm to improve the policy by having the agent act on the environment through ** "action" ** and observe the result in the form of ** "reward" ** and ** "state" **.

――The important issue is how to determine the reward function.

-** Markov decision process ** ・ ・ ・ State space $S$, action space $A (s)$, initial state distribution $P_0$, state transition probability $P (s'| s, a)$, reward function Stochastic process described by $r (s, a, s')$

--Set the state set $S$ to be a set of all states --Let $s$ be the variable that represents the elements of this set -The state set consisting of $N$ type states is as follows.

  math
S={s_1,s_2,...,s_N}



--Let $S_t$ be the random variable that represents the state in the time step $t . --If you write the states in order from time step 0, it is as follows$S_0,S_1,S_2,...,S_t,...$$--The action set  A (s)  is a set consisting of all selectable actions in a certain state  s . A(s)={a_1,a_2,...,a_M}$$

--Let $A_t$ be the random variable that represents the behavior of the agent determined in the state $S_t$ in the time step $t$ --If you write the states in order from time step 0, it is as follows $A_0,A_1,A_2,...,A_t,...$

--Let $R_t + 1$ be the random variable that represents the reward that depends on $S_t$, $A_t$, $S_ {t + 1}$. --Take one of the values of the set $R$ of all real numbers

• model

--The environment probabilistically determines the state (initial state) at the initial time (** initial state distribution **) $S_0~P_0(s)$

--The next state is stochastically determined by the current state and behavior. --When the agent decides the action $a$ in the state $s$, the probability that the state transitions to $s'$ is given as follows. $P(s'|s,a)$

--The state $S_ {t + 1}$ at the step $t + 1$ is determined as follows. $S_{t+1}~P(s'|S_t,A_t)$

-** Markov property ** ・ ・ ・ The property that the transition probability is determined only by the immediately preceding state

--The environment determines the reward $R_ {t + 1}$ according to the current state $S_t$, the action $A_t$, and the next state $S_ {t + 1} .$R_{t+1}=r(S_t,A_t,S_{t+1})$$--Action is determined based on the agent's policy ( \ pi ) ――In a certain state, a measure whose behavior is stochastically determined is called a probabilistic measure. --Under the probabilistic policy  \ pi , the probability that a certain action  a  is selected in a certain state  s  is expressed as  \ pi (a | s) . -** Tic-tac-toe ** -Each player puts a stone on the  9  square of  3 x 3 , and wins if his stones line up in a straight line. --The agent gives a positive reward to the winning board and a negative reward to the losing board. --The initial state distribution is as follows  math  P_0(s)=\begin{cases}1 ,,,,,, (s=s_1) \ 0 ,,,,,, (otherwise) \end{cases}  -** Time Steps and Episodes ** -** Time step ** ・ ・ ・ Basic unit of time in the interaction between the agent and the environment -** Episode ** ・ ・ ・ The time from the start to the end of the task is summarized and consists of multiple time steps. -** What is a good policy ** -** Revenue ** ・ ・ ・ Cumulative reward obtained in a certain period (sum of rewards in the period) --The reward  R_t  obtained in the time step  t , the interval length is  T , and the revenue  G_t  is defined as follows. math G_t=\sum^{T-1}_{\tau=0}{R_{t+1+\tau}}  --Define profits in longer-term intervals math G_t=\lim_{T\rightarrow \infty} \frac{1}{T}\sum^{T-1}_{\tau=0}{R_{t+1+\tau}}  -** Discount reward sum ** ・ ・ ・ Profit that expresses future uncertainty in the form of discounted reward math G_t=\sum^{\infty}_{\tau=0}\gamma^{\tau}R_{t+1+\tau}=R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+...  --Discount rate  \ gamma (0 \ le \ gamma \ le 1)  is a constant that indicates how much the future will be discounted. ――Revenue is an index that evaluates the rewards obtained from a long-term perspective. --Take the expected value of profit on the condition of the state, and call this the ** state value **. -** State value ** ・ ・ ・ Expected value of profit obtained when action is decided according to policy  \ pi  from a certain state math V^{\pi}(s)=E^{\pi}[ G_t|S_t=s ]  --"Expected value under policy  \ pi " ... Expected value when the agent decides the action based on the policy  \ pi  from the state  s  in the time step  t  -Consider an example of finite interval profit of  T = 1  --The profits to consider are as follows math G_t=R_{t+1}  --The probability that the state is  s' in the time step  t + 1  is as follows  math  P(S_{t+1}=s',A_t=a|S_t=s)=P(S_{t+1}=s'|S_t=s,A_t=a) \pi(a|s)  --The state value is as follows by taking the expected value with the state  S_t  as a condition.$$\begin{eqnarray*} V^{\pi}(s)&=& E^{\pi}[G_t|S_t=s] \\
&=& \sum_{s' \in S} \sum_{a \in A(s)} P(S_{t+1}=s',A_t=a|S_t=s) r(s,a,s') \\
&=& \sum_{s' \in S} \sum_{a \in A(s)} P(S_{t+1}=s'|S_t=s,A_t=a) \pi(a|s) r(s,a,s') \end{eqnarray*} $$-Consider an example of finite interval revenue with  T = 2  --The profits to consider are as follows math G_t=R_{t+1}+R_{t+2}  --Expected values are as follows  math  \begin{eqnarray*} V^{\pi}(s) &=& E[G_t|S_t=s]=E^\pi[R_{t+1}+R_{t+2}|S_t=s] \ &=& \sum_{s''\in S}\sum_{a'\in A(s)}\sum_{s'\in S}\sum_{a\in A(s)} P(S_{t+2}=s'',A_{t+1}=a',S_{t+1}=s',A_t=a|S_t=s){r(s,a,s')+r(s',a',s'')} \ &=& \sum_{s''\in S}\sum_{a'\in A(s)}\sum_{s'\in S}\sum_{a\in A(s)} P(S_{t+2}=s''|S_{t+1}=s',A_{t+1}=a')\pi(a'|s')×P(S_{t+1}=s'|S_t=s,A_t=a)\pi(a|s){r(s,a,s')+r(s',a',s'')} \end{eqnarray*}  -When fixing  \ pi  and changing  s  --Evaluate the expected return earned when deciding actions based on certain fixed measures for various conditions --It can be used as an index to show the goodness of the state under a certain measure  \ pi  (** state value function **) -When fixing  s  and changing  \ pi  --Evaluate the profits that are expected to be earned by starting actions from a certain state for various measures ――Indicator showing the goodness of the policy when starting from a certain state  s$$
\forall s\in S,\,\,\,\,\, V^\pi(s) \ge V^{{\pi}^{'}}(s)\\
\exists s\in S,\,\,\,\,\, V^\pi(s) >  V^{{\pi}^{'}}(s)


-** Optimal policy **

-** Optimal state values ** ・ ・ ・ $V ^ * (s)$

math
\forall s\in S,\,\,\,\,\, V^*(s)=V^{{\pi}^{*}}(s)=\max_\pi V^\pi (s)


-** Behavioral values ** ・ ・ ・ $Q ^ \ pi$

  math


Q^\pi(s,a)=E^\pi[G_t|S_t=s,A_t=a]

 -For $A_t, S_ {t + 1}, A_ {t + 1}$, take the expected value according to the probability of appearance
――A trajectory in which each state and action are connected

---

-Finite segment revenue of $T = 1$

math
X_1=\{\Xi=(s,a,s')|s\in S,a\in A,s'\in S\}


-Call $\ Xi$ ** orbit **

--Orbital set with fixed initial state

  math


X_1|_s={\Xi=(s,a,s')|a\in A,s'\in S}

 --A set of orbitals with fixed initial state and behavior

math
X_1|_s(s,a)=\{\Xi=(s,a,s')|s'\in S\}


--Revenue is regarded as a function of orbit

  math


G_t=G_t(\Xi)

      math
V^\pi(s)=\sum_{\Xi\in X_1|_s}P(\Xi)G_t(\Xi)\\
Q^\pi(s,a)=\sum_{\Xi\in X_1|_{(s,a)}}P(\Xi)G_t(\Xi)


-When $T = 2$

  math


X_2|_s={\Xi=(s,a,s',a',s'')|a\in A,s'\in S,a'\in A,s''\in S}

      math
X_2|_{(s,a)}=\{\Xi=(s,a,s',a',s'')|s'\in S,a'\in A,s''\in S\}


--In the environment shown in Figure 1.2.5, the set of orbits to be considered when calculating the state value is as follows.

  math


X_1|_{s_1}={(s_1,a_1,s_3),(s_1,a_2,s_2)}

      math
X_2|_{s_1}=\{(s_1,a_1,s_3,a_1,s_4),(s_1,a_1,s_3,a_2,s_1),(s_1,a_2,s_2,a_1,s_1),(s_1,a_2,s_2,a_2,s_4)\}


--The set of trajectories to consider when seeking action value is as follows.

  math


X_1|_{(s_1,a_1)}={(s_1,a_1,s_3)}

      math
X_2|_{(s_1,a_1)}=\{(s_1,a_1,s_3,a_1,s_4),(s_1,a_1,s_3,a_2,s_1)\}


――How to find a good policy

-** greedy policy **

  math


\pi(a|s)=\begin{cases}1 ,,,,,, (a=\arg \max_aQ(s,a)) \ 0 ,,,,,, (otherwise) \end{cases}

 --Estimate the optimal behavioral value function
――Sometimes it is necessary to stochastically select actions that are not always the best at that time.

-** $\ epsilon$ -greedy policy **

math
\pi(a|s)=\begin{cases}1-\epsilon+\frac{\epsilon}{|A(s)|} \,\,\,\,\,\, (a= \arg \max_{a} Q(s,a)) \\\frac{\epsilon}{|A(s)|}  \,\,\,\,\,\, (otherwise) \end{cases}


-** Boltzmann (Softmax) policy ** ・ ・ ・ Selection probability follows Gibbs distribution

  math


\pi(a|s)=\frac{\exp(Q(s,a)/T)}{\sum_{b\in A}\exp(Q(s,b)/T)}

 -$T$ is the temperature parameter

`