[PYTHON] Future reinforcement learning_1

** Future strengthening learning **

I'm Harima, a first year master's student in the Graduate School of Science. I will summarize my learning contents as a memo. I'm sorry it's hard to see. I would like to know what you do not understand.

Chap.0 Introduction

――Reinforcement learning is a theoretical framework for acquiring optimal behavior through trial and error based on experience.

--ex) Bicycle

--The output of the algorithm changes the collected data itself (trade-off between search and use)

――How to collect data in a world where you do not have enough data and it is costly to collect data (← → big data)

** Chap.1 Basic theory of reinforcement learning **

--Concisely organize the basic concepts of reinforcement learning based on the latest perspective

** 1.1 What is reinforcement learning **

-** Reinforcement learning problem ** ・ ・ ・ Problems such as finding the optimal sequence of working methods when there is only incomplete knowledge about the target and what can be observed changes depending on the working on the target.

――The actor who acts is ** agent **, the target to be worked on is ** environment **, the action is ** action **, and the elements of the environment that change accordingly are ** state **

--** Reward ** (** Gain ** in economics, ** Loss ** or ** Cost ** in control engineering) for a unified comparison of what happens in an unknown environment

-** Maximize the total rewards you get through your choice of actions in your environment **

--Express the ** policy ** of the agent's action decision as a function that takes the observation result as input and outputs the action.

-It is necessary to maximize the long-term reward (** revenue **) obtained by combining ** immediate reward ** and ** delayed reward **.

--Calculate ** value ** as conditional expected value when the current state of the agent, the policy to be used, etc. are fixed

--In most cases, it is assumed that the agent has no prior knowledge of the environment or is incomplete.

-** Trade-off between search and use ** ・ ・ ・ If you try to ** use ** the learning results so far, ** search ** will decrease and the chances of loss will increase. On the other hand, if the number of searches is increased, the behavior different from the best behavior is taken, and the reward obtained is reduced (the optimum solution changes dynamically depending on the amount and bias of the obtained information).

-** Multi-arm bandit problem ** ・ ・ ・ The slot machine's arm is $ K $, the refund amount is $ R $, and the probability of winning when the arm $ i $ is subtracted is $ p_i $ (player). Cannot know the probability value in advance)

-** greedy algorithm ** ・ ・ ・ Select the arm with the highest expected value from the results so far

--If you have an arm that you haven't selected n times yet, select that arm. --Otherwise, calculate the average reward so far for all arms. -$ \ mu_i $ chooses the biggest arm

\mu_i=\frac{Sum of rewards obtained from arm i so far}{Number of times arm i has been played so far}

--The originally non-optimal arm $ i'$ hit a lot during the trial, so the arm $ i'$ refund rate $ p_ {i'} $ is higher than the optimal arm $ i $ refund rate $ p_i $. Misidentified as large

--If you keep pulling arm $ i'$ based on the wrong optimal solution, more results of arm $ i'$ will be collected, and the estimation error of $ p_ {i'} $ will decrease as the number of trials increases, and someday you will make an error. Correctable

--The originally optimal arm $ i $ did not hit much at the time of the trial, so the arm $ i $ refund rate $ p_i $ is not optimal The arm $ i'$ refund rate $ p_ {i'} $ Misidentified as smaller

--Even if you keep pulling the arm $ i'$ based on the wrong optimal solution, the information of the originally optimal arm $ i $ is not collected, so the estimation error does not decrease even if you repeat the trial, and the error cannot be corrected.

-** $ \ epsilon $ -greedy algorithm ** ・ ・ ・ Choose a random arm with probability $ \ epsilon $ ($ 0 \ leq \ epsilon \ leq 1 $)

――If you have an arm that you haven't selected yet, select one from those arms. --Randomly choose one from all arms with a probability of $ \ epsilon $ --Choose the arm with the highest average $ \ mu_i $ reward so far with a probability of $ 1- \ epsilon $

-$ \ Epsilon $ seems to have a more efficient search method than random

-** Optimistic when uncertain ** ... When there is uncertainty in the expected value, a large expected value should be assumed within the range of the uncertainty.

―― 1) Think of a set of “imaginable environments” that is consistent with current knowledge ―― 2) Select the “most convenient” environment from the set ―― 3) The next action is the optimal solution in the most convenient environment.

--In the early stages of learning, the emphasis is on exploration, and in the later stages, the emphasis is on usage.

-** Optimistic initial value method ** ・ ・ ・ Estimate the optimistic expected value of the value of each arm in the form of observing the maximum reward value from each arm $ K $ times before learning.

--Reward upper bound $ r_ \ sup $ --In addition to the results observed during learning, the expected value of the reward for each arm is calculated, assuming that the reward of $ r_ \ sup $ was observed $ K $ times from each arm. -$ \ mu'_i $ chooses the biggest arm

\mu'_i = \frac{Sum of rewards obtained from arm i so far+Kr_{\sup}}{Number of times arm i has been played so far+K}

-** UCB algorithm ** ・ ・ ・ By gradually approaching the probability of the confidence interval to 1, the necessary search is performed for all options, and the cost of the search and the risk of making a mistake in the optimal solution can be reduced.

-$ R $: Difference between maximum and minimum refund amount --If you have an arm that you haven't chosen yet, choose one --Calculate the expected value of the reward from each arm $ i $ --Calculate the half width of the confidence interval of the reward obtained from each arm $ i $ -$ x_i = \ mu_i + U_i $ chooses the biggest arm $ i $

\mu_i=\frac{Sum of rewards obtained from arm i so far}{Number of times arm i has been selected so far}\\
U_i=R \sqrt{\frac{2 \ln (Total number of plays so far)}{Number of times arm i has been played so far}}

-Simulate when $ K = 4 $.

--Refunds are 0.2, 0.3, 0.4 and 0.5 respectively --Repeat 10,000 hour step learning 10,000 times

-Collect information on your own and acquire good behavior for robustness in various unknown environments

--The appropriate algorithm depends on the problem (algorithm selection is important)

Recommended Posts

Future reinforcement learning_2
Future reinforcement learning_1
[Introduction] Reinforcement learning
Reinforcement learning 1 Python installation
Reinforcement learning 3 OpenAI installation
[Reinforcement learning] Bandit task
Python + Unity Reinforcement Learning (Learning)
Reinforcement learning 1 introductory edition
Reinforcement learning 18 Colaboratory + Acrobat + ChainerRL
Play with reinforcement learning with MuZero
Reinforcement learning 17 Colaboratory + CartPole + ChainerRL
Reinforcement learning 28 colaboratory + OpenAI + chainerRL
Reinforcement learning 19 Colaboratory + Mountain_car + ChainerRL
Reinforcement learning 2 Installation of chainerrl
[Reinforcement learning] Tracking by multi-agent
Reinforcement learning 6 First Chainer RL
Reinforcement learning starting with Python
Reinforcement learning 20 Colaboratory + Pendulum + ChainerRL
Reinforcement learning 5 Try programming CartPole?
Reinforcement learning Learn from today
Reinforcement learning 4 CartPole first step
Deep Reinforcement Learning 1 Introduction to Reinforcement Learning
Deep reinforcement learning 2 Implementation of reinforcement learning
DeepMind Reinforcement Learning Framework Acme
Reinforcement learning: Accelerate Value Iteration
TF2RL: Reinforcement learning library for TensorFlow2.x
Reinforcement learning 34 Make continuous Agent videos
Reinforcement learning 13 Try Mountain_car with ChainerRL.
Python + Unity Reinforcement learning environment construction
Explore the maze with reinforcement learning
Reinforcement learning 8 Try using Chainer UI
Reinforcement learning 24 Colaboratory + CartPole + ChainerRL + ACER
Reinforcement learning 3 Dynamic programming / TD method
Deep Reinforcement Learning 3 Practical Edition: Breakout
I tried reinforcement learning using PyBrain
Learn while making! Deep reinforcement learning_1
real-time-Personal-estimation (learning)
[Reinforcement learning] DQN with your own library
Reinforcement learning to learn from zero to deep
[Reinforcement learning] I implemented / explained R2D3 (Keras-RL)
Reinforcement learning 2 Markov decision process / Bellman equation
Learning record
<Course> Deep Learning Day4 Reinforcement Learning / Tensor Flow
Learning record # 3
Learning record # 1
Machine learning
Reinforcement learning 14 Pendulum was done at ChainerRL.
python learning
[Reinforcement learning] Easy high-speed implementation of Ape-X!
Learning record # 2
6/10 Learning content
Deep Learning
numpy-sigmoid learning
[Python] Easy Reinforcement Learning (DQN) with Keras-RL
Try OpenAI's standard reinforcement learning algorithm PPO
[Reinforcement learning] Search for the best route
Reinforcement learning 11 Try OpenAI acrobot with ChainerRL.
Reinforcement learning 10 Try using a trained neural network.
[Reinforcement learning] R2D2 implementation / explanation revenge commentary (Keras-RL)
See the behavior of drunkenness with reinforcement learning
[Reinforcement learning] Experience Replay is easy with cpprb!