I'm Harima, a first year master's student in the Graduate School of Science. I will summarize my learning contents as a memo. I'm sorry it's hard to see. I would like to know what you do not understand.

**Chap.0 Introduction**

――Reinforcement learning is a theoretical framework for acquiring optimal behavior through trial and error based on experience.

--ex) Bicycle

--The output of the algorithm changes the collected data itself (trade-off between search and use)

――How to collect data in a world where you do not have enough data and it is costly to collect data (← → big data)

--Concisely organize the basic concepts of reinforcement learning based on the latest perspective

-** Reinforcement learning problem ** ・ ・ ・ Problems such as finding the optimal sequence of working methods when there is only incomplete knowledge about the target and what can be observed changes depending on the working on the target.

――The actor who acts is ** agent **, the target to be worked on is ** environment **, the action is ** action **, and the elements of the environment that change accordingly are ** state **

--** Reward ** (** Gain ** in economics, ** Loss ** or ** Cost ** in control engineering) for a unified comparison of what happens in an unknown environment

-** Maximize the total rewards you get through your choice of actions in your environment **

--Express the ** policy ** of the agent's action decision as a function that takes the observation result as input and outputs the action.

-It is necessary to maximize the long-term reward (** revenue **) obtained by combining ** immediate reward ** and ** delayed reward **.

--Calculate ** value ** as conditional expected value when the current state of the agent, the policy to be used, etc. are fixed

--In most cases, it is assumed that the agent has no prior knowledge of the environment or is incomplete.

-** Trade-off between search and use ** ・ ・ ・ If you try to ** use ** the learning results so far, ** search ** will decrease and the chances of loss will increase. On the other hand, if the number of searches is increased, the behavior different from the best behavior is taken, and the reward obtained is reduced (the optimum solution changes dynamically depending on the amount and bias of the obtained information).

-** Multi-arm bandit problem ** ・ ・ ・ The slot machine's arm is $ K $, the refund amount is $ R $, and the probability of winning when the arm $ i $ is subtracted is $ p_i $ (player). Cannot know the probability value in advance)

-** greedy algorithm ** ・ ・ ・ Select the arm with the highest expected value from the results so far

--If you have an arm that you haven't selected n times yet, select that arm. --Otherwise, calculate the average reward so far for all arms. -$ \ mu_i $ chooses the biggest arm

```
\mu_i=\frac{Sum of rewards obtained from arm i so far}{Number of times arm i has been played so far}
```

--The originally non-optimal arm $ i'$ hit a lot during the trial, so the arm $ i'$ refund rate $ p_ {i'} $ is higher than the optimal arm $ i $ refund rate $ p_i $. Misidentified as large

--If you keep pulling arm $ i'$ based on the wrong optimal solution, more results of arm $ i'$ will be collected, and the estimation error of $ p_ {i'} $ will decrease as the number of trials increases, and someday you will make an error. Correctable

--The originally optimal arm $ i $ did not hit much at the time of the trial, so the arm $ i $ refund rate $ p_i $ is not optimal The arm $ i'$ refund rate $ p_ {i'} $ Misidentified as smaller

--Even if you keep pulling the arm $ i'$ based on the wrong optimal solution, the information of the originally optimal arm $ i $ is not collected, so the estimation error does not decrease even if you repeat the trial, and the error cannot be corrected.

-** $ \ epsilon $ -greedy algorithm ** ・ ・ ・ Choose a random arm with probability $ \ epsilon $ ($ 0 \ leq \ epsilon \ leq 1 $)

――If you have an arm that you haven't selected yet, select one from those arms. --Randomly choose one from all arms with a probability of $ \ epsilon $ --Choose the arm with the highest average $ \ mu_i $ reward so far with a probability of $ 1- \ epsilon $

-$ \ Epsilon $ seems to have a more efficient search method than random

-** Optimistic when uncertain ** ... When there is uncertainty in the expected value, a large expected value should be assumed within the range of the uncertainty.

―― 1) Think of a set of “imaginable environments” that is consistent with current knowledge ―― 2) Select the “most convenient” environment from the set ―― 3) The next action is the optimal solution in the most convenient environment.

--In the early stages of learning, the emphasis is on exploration, and in the later stages, the emphasis is on usage.

-** Optimistic initial value method ** ・ ・ ・ Estimate the optimistic expected value of the value of each arm in the form of observing the maximum reward value from each arm $ K $ times before learning.

--Reward upper bound $ r_ \ sup $ --In addition to the results observed during learning, the expected value of the reward for each arm is calculated, assuming that the reward of $ r_ \ sup $ was observed $ K $ times from each arm. -$ \ mu'_i $ chooses the biggest arm

```
\mu'_i = \frac{Sum of rewards obtained from arm i so far+Kr_{\sup}}{Number of times arm i has been played so far+K}
```

-** UCB algorithm ** ・ ・ ・ By gradually approaching the probability of the confidence interval to 1, the necessary search is performed for all options, and the cost of the search and the risk of making a mistake in the optimal solution can be reduced.

-$ R $: Difference between maximum and minimum refund amount --If you have an arm that you haven't chosen yet, choose one --Calculate the expected value of the reward from each arm $ i $ --Calculate the half width of the confidence interval of the reward obtained from each arm $ i $ -$ x_i = \ mu_i + U_i $ chooses the biggest arm $ i $

```
\mu_i=\frac{Sum of rewards obtained from arm i so far}{Number of times arm i has been selected so far}\\
```

```
U_i=R \sqrt{\frac{2 \ln (Total number of plays so far)}{Number of times arm i has been played so far}}
```

-Simulate when $ K = 4 $.

--Refunds are 0.2, 0.3, 0.4 and 0.5 respectively --Repeat 10,000 hour step learning 10,000 times

-Collect information on your own and acquire good behavior for robustness in various unknown environments

--The appropriate algorithm depends on the problem (algorithm selection is important)

Recommended Posts