I'm Harima, a first year master's student in the Graduate School of Science. I will summarize my learning contents as a memo. I'm sorry it's hard to see. I would like to know what you do not understand.

--Concisely organize the basic concepts of reinforcement learning based on the latest perspective

--The basic framework of reinforcement learning and the Markov decision process, which is a mathematical model for describing interactions --Based on this model, formulate various concepts such as profit, value, and policy.

--The framework of reinforcement learning is ** Agent **, ** Environment **, ** Interaction **

--Receive and deliver status, actions, and rewards every hour step

-** Measures ** ・ ・ ・ Rules for agents to decide their actions

--Design an algorithm to improve the policy by having the agent act on the environment through ** "action" ** and observe the result in the form of ** "reward" ** and ** "state" **.

――The important issue is how to determine the reward function.

-** Markov decision process ** ・ ・ ・ State space $ S $, action space $ A (s) $, initial state distribution $ P_0 $, state transition probability $ P (s'| s, a) $, reward function Stochastic process described by $ r (s, a, s') $

--Set the state set $ S $ to be a set of all states --Let $ s $ be the variable that represents the elements of this set -The state set consisting of $ N $ type states is as follows.

```
```math
S={s_1,s_2,...,s_N}
```
```

--Let $ S_t $ be the random variable that represents the state in the time step $ t

--The action set $ A (s) $ is a set consisting of all selectable actions in a certain state $ s

--Let $ A_t $ be the random variable that represents the behavior of the agent determined in the state $ S_t $ in the time step $ t $
--If you write the states in order from time step 0, it is as follows

--Let $ R_t + 1 $ be the random variable that represents the reward that depends on $ S_t $, $ A_t $, $ S_ {t + 1} $. --Take one of the values of the set $ R $ of all real numbers

**model**

--The environment probabilistically determines the state (initial state) at the initial time (** initial state distribution **)

--The next state is stochastically determined by the current state and behavior.
--When the agent decides the action $ a $ in the state $ s $, the probability that the state transitions to $ s'$ is given as follows.

--The state $ S_ {t + 1} $ at the step $ t + 1 $ is determined as follows.

-** Markov property ** ・ ・ ・ The property that the transition probability is determined only by the immediately preceding state

--The environment determines the reward $ R_ {t + 1} $ according to the current state $ S_t $, the action $ A_t $, and the next state $ S_ {t + 1}

--Action is determined based on the agent's policy ($ \ pi $) ――In a certain state, a measure whose behavior is stochastically determined is called a probabilistic measure. --Under the probabilistic policy $ \ pi $, the probability that a certain action $ a $ is selected in a certain state $ s $ is expressed as $ \ pi (a | s) $.

-** Tic-tac-toe ** -Each player puts a stone on the $ 9 $ square of $ 3 x 3 $, and wins if his stones line up in a straight line.

--The agent gives a positive reward to the winning board and a negative reward to the losing board. --The initial state distribution is as follows

```
```math
```

P_0(s)=\begin{cases}1 ,,,,,, (s=s_1) \ 0 ,,,,,, (otherwise) \end{cases}

```
-** Time Steps and Episodes **
-** Time step ** ・ ・ ・ Basic unit of time in the interaction between the agent and the environment
-** Episode ** ・ ・ ・ The time from the start to the end of the task is summarized and consists of multiple time steps.
-** What is a good policy **
-** Revenue ** ・ ・ ・ Cumulative reward obtained in a certain period (sum of rewards in the period)
--The reward $ R_t $ obtained in the time step $ t $, the interval length is $ T $, and the revenue $ G_t $ is defined as follows.
```math
G_t=\sum^{T-1}_{\tau=0}{R_{t+1+\tau}}
```
--Define profits in longer-term intervals
```math
G_t=\lim_{T\rightarrow \infty} \frac{1}{T}\sum^{T-1}_{\tau=0}{R_{t+1+\tau}}
```
-** Discount reward sum ** ・ ・ ・ Profit that expresses future uncertainty in the form of discounted reward
```math
G_t=\sum^{\infty}_{\tau=0}\gamma^{\tau}R_{t+1+\tau}=R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+...
```
--Discount rate $ \ gamma (0 \ le \ gamma \ le 1) $ is a constant that indicates how much the future will be discounted.
――Revenue is an index that evaluates the rewards obtained from a long-term perspective.
--Take the expected value of profit on the condition of the state, and call this the ** state value **.
-** State value ** ・ ・ ・ Expected value of profit obtained when action is decided according to policy $ \ pi $ from a certain state
```math
V^{\pi}(s)=E^{\pi}[ G_t|S_t=s ]
```
--"Expected value under policy $ \ pi $" ... Expected value when the agent decides the action based on the policy $ \ pi $ from the state $ s $ in the time step $ t $
-Consider an example of finite interval profit of $ T = 1 $
--The profits to consider are as follows
```math
G_t=R_{t+1}
```

--The probability that the state is $ s'$ in the time step $ t + 1 $ is as follows

```
```math
```

P(S_{t+1}=s',A_t=a|S_t=s)=P(S_{t+1}=s'|S_t=s,A_t=a) \pi(a|s)

```
--The state value is as follows by taking the expected value with the state $ S_t $ as a condition.
$$\begin{eqnarray*} V^{\pi}(s)&=& E^{\pi}[G_t|S_t=s] \\
&=& \sum_{s' \in S} \sum_{a \in A(s)} P(S_{t+1}=s',A_t=a|S_t=s) r(s,a,s') \\
&=& \sum_{s' \in S} \sum_{a \in A(s)} P(S_{t+1}=s'|S_t=s,A_t=a) \pi(a|s) r(s,a,s') \end{eqnarray*} $$
-Consider an example of finite interval revenue with $ T = 2 $
--The profits to consider are as follows
```math
G_t=R_{t+1}+R_{t+2}
```

--Expected values are as follows

```
```math
```

\begin{eqnarray*} V^{\pi}(s) &=& E[G_t|S_t=s]=E^\pi[R_{t+1}+R_{t+2}|S_t=s] \ &=& \sum_{s''\in S}\sum_{a'\in A(s)}\sum_{s'\in S}\sum_{a\in A(s)} P(S_{t+2}=s'',A_{t+1}=a',S_{t+1}=s',A_t=a|S_t=s){r(s,a,s')+r(s',a',s'')} \ &=& \sum_{s''\in S}\sum_{a'\in A(s)}\sum_{s'\in S}\sum_{a\in A(s)} P(S_{t+2}=s''|S_{t+1}=s',A_{t+1}=a')\pi(a'|s')×P(S_{t+1}=s'|S_t=s,A_t=a)\pi(a|s){r(s,a,s')+r(s',a',s'')} \end{eqnarray*}

```
-When fixing $ \ pi $ and changing $ s $
--Evaluate the expected return earned when deciding actions based on certain fixed measures for various conditions
--It can be used as an index to show the goodness of the state under a certain measure $ \ pi $ (** state value function **)
-When fixing $ s $ and changing $ \ pi $
--Evaluate the profits that are expected to be earned by starting actions from a certain state for various measures
――Indicator showing the goodness of the policy when starting from a certain state $ s $
$$
\forall s\in S,\,\,\,\,\, V^\pi(s) \ge V^{{\pi}^{'}}(s)\\
\exists s\in S,\,\,\,\,\, V^\pi(s) > V^{{\pi}^{'}}(s)
$$
-** Optimal policy **
-** Optimal state values ** ・ ・ ・ $ V ^ * (s) $
```math
\forall s\in S,\,\,\,\,\, V^*(s)=V^{{\pi}^{*}}(s)=\max_\pi V^\pi (s)
```

-** Behavioral values ** ・ ・ ・ $ Q ^ \ pi $

```
```math
```

Q^\pi(s,a)=E^\pi[G_t|S_t=s,A_t=a]

```
-For $ A_t, S_ {t + 1}, A_ {t + 1} $, take the expected value according to the probability of appearance
――A trajectory in which each state and action are connected
---
-Finite segment revenue of $ T = 1 $
```math
X_1=\{\Xi=(s,a,s')|s\in S,a\in A,s'\in S\}
```

-Call $ \ Xi $ ** orbit **

--Orbital set with fixed initial state

```
```math
```

X_1|_s={\Xi=(s,a,s')|a\in A,s'\in S}

```
--A set of orbitals with fixed initial state and behavior
```math
X_1|_s(s,a)=\{\Xi=(s,a,s')|s'\in S\}
```

--Revenue is regarded as a function of orbit

```
```math
```

G_t=G_t(\Xi)

```
```math
V^\pi(s)=\sum_{\Xi\in X_1|_s}P(\Xi)G_t(\Xi)\\
Q^\pi(s,a)=\sum_{\Xi\in X_1|_{(s,a)}}P(\Xi)G_t(\Xi)
```

-When $ T = 2 $

```
```math
```

X_2|_s={\Xi=(s,a,s',a',s'')|a\in A,s'\in S,a'\in A,s''\in S}

```
```math
X_2|_{(s,a)}=\{\Xi=(s,a,s',a',s'')|s'\in S,a'\in A,s''\in S\}
```

--In the environment shown in Figure 1.2.5, the set of orbits to be considered when calculating the state value is as follows.

```
```math
```

X_1|_{s_1}={(s_1,a_1,s_3),(s_1,a_2,s_2)}

```
```math
X_2|_{s_1}=\{(s_1,a_1,s_3,a_1,s_4),(s_1,a_1,s_3,a_2,s_1),(s_1,a_2,s_2,a_1,s_1),(s_1,a_2,s_2,a_2,s_4)\}
```

--The set of trajectories to consider when seeking action value is as follows.

```
```math
```

X_1|_{(s_1,a_1)}={(s_1,a_1,s_3)}

```
```math
X_2|_{(s_1,a_1)}=\{(s_1,a_1,s_3,a_1,s_4),(s_1,a_1,s_3,a_2,s_1)\}
```

――How to find a good policy

-** greedy policy **

```
```math
```

\pi(a|s)=\begin{cases}1 ,,,,,, (a=\arg \max_aQ(s,a)) \ 0 ,,,,,, (otherwise) \end{cases}

```
--Estimate the optimal behavioral value function
――Sometimes it is necessary to stochastically select actions that are not always the best at that time.
-** $ \ epsilon $ -greedy policy **
```math
\pi(a|s)=\begin{cases}1-\epsilon+\frac{\epsilon}{|A(s)|} \,\,\,\,\,\, (a= \arg \max_{a} Q(s,a)) \\\frac{\epsilon}{|A(s)|} \,\,\,\,\,\, (otherwise) \end{cases}
```

-** Boltzmann (Softmax) policy ** ・ ・ ・ Selection probability follows Gibbs distribution

```
```math
```

\pi(a|s)=\frac{\exp(Q(s,a)/T)}{\sum_{b\in A}\exp(Q(s,b)/T)}

```
-$ T $ is the temperature parameter
```

Recommended Posts