[PYTHON] Reinforcement learning 1 introductory edition

Aidemy 2020/11/15

Introduction

Hello, it is Yope! I am a liberal arts student, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. I am very happy that many people have read the previous summary article. Thank you! This is the first post of reinforcement learning. Nice to meet you.

What to learn this time ・ What is reinforcement learning? ・ Agent, environment, reward ・ Reinforcement learning measures

What is reinforcement learning?

・ Machine learning can be broadly divided into __ "supervised learning," "unsupervised learning," and "reinforcement learning." ・ Of these, the __reinforcement learning __ that we will learn this time is a method aimed at __ "discovering the optimal behavior under given conditions" __. For example, in the case of a game, you can discover how to win by reinforcement learning.

More about reinforcement learning

-Reinforcement learning is carried out on the premise that __ "player" __ and __ "environment (stage)" __ interact with each other. -If a agent __ is in the state s and takes action a for environment, then the environment will receive reward R as an evaluation of that behavior. In return, the agent proceeds with the task by repeating the transition to the next state s'. -For example, in the case of Super Mario, the agent "Mario" takes the action of "avoiding the enemy (by jumping)" in an environment where "go ahead and aim for the goal, but end when it hits the enemy". The environment returns the reward "the right to go further (the game doesn't end there)". By repeating this, Mario advances toward the task of goal.

・ In reinforcement learning, not only the on-the-spot reward __ "immediate reward" __ is maximized, but also the reward __ "delayed reward" __ included after that __ "revenue" __ It needs to be maximized. ・ In the previous Mario example, taking coins is a worthless action on the spot, but if you find value in collecting 100 coins and "increasing the remaining machines", you can maximize the reward. It can be said that it is better to act as if to take coins as much as possible.

N-arm bandit problem

・ __N arm bandit problem __ is an example of reinforcement learning, __ "Suppose that there are N slot machines, and if it is out of 1, the reward is 0. The probability of hit is different for each machine, but from the user "I can't see" __ "," If the user can draw any one slot in one trial, __ How can I maximize the average reward per trial __ " It is a problem to think about. ・ The idea is that __ "pull on the platform with the highest probability" __, but it is necessary to try to find (guess) that platform __. That is the miso of this problem. ・ In the following, we will learn about reinforcement learning using this example.

Creating an agent

-Agent means more specifically __ "things that determine actions in the environment and influence the environment" __. -In the N-arm bandit problem, the agent is user who decides which machine to use, receives a reward, and makes the next decision. -The index __ on how to make the next decision from the reward obtained by the agent is called __ "policy" __. For example, when the policy is set to "always 1" in this problem, the agent will always aim for a platform where the reward is 1. ・ The purpose of this time is to determine the optimal policy for this agent.

-The following code is a function __ "randomselect ()" __ that randomly determines the method __ "which machine to choose" __. This will return __ "slot_num" __ which unit to choose.

スクリーンショット 2020-11-10 22.05.38.png

Creating an environment

-Environment is the target __ to which the __agent takes action. Its role is to take action, observe the situation, send rewards to agents, and advance one time. -The environment in this problem is __ "When an agent pulls a certain table, the process of hitting or losing depending on the probability of that table" __.

-The following code is the function __ "environment ()" __ that defines the environment. __ "coins_p" __ stores the probability of hitting each unit in a NumPy array, and "results" is __ "np.random.binomial (n, p)" __, the number of trials n, and the probability p binomial Use the distribution function to store the result of subtracting slots. This time, the number of trials is one and the probability is "coins_p", so you can specify __ (1, coins_p) __. As the environment function, the result of the passed "band_number" should be returned, so this is stored in "result" and returned.

スクリーンショット 2020-11-10 22.10.57.png

Definition of reward

-Reward is an index __ that evaluates the desirability of a series of actions of the __agent. ・ Speaking of this problem, the __return value (0 or 1) __ obtained by subtracting the slot is the reward as it is. This is __immediate reward __.

-The following code is the function __ "reward ()" __ that defines the reward. The result obtained by the environment function earlier is stored in "result", and this is stored in the "time" th, which indicates the current number of trials of the array "record" passed as an argument. -Also, in the list __ "results" __ which is __ [[(number of units), (number of selections), (total reward), (total reward ÷ number of selections)], ...] __ , Store each result.

スクリーンショット 2020-11-10 22.22.14.png

-Furthermore, use these functions or pass the value by yourself to decide __ "table to use", "whether it is a hit or miss", "how many times to try" __, etc., and the number of trials of each machine The result is output and the transition of __average reward is illustrated __ below.

スクリーンショット 2020-11-10 22.24.51.png

·result スクリーンショット 2020-11-10 22.25.07.png

・ Looking at this result, "Which car to choose" is completely random (1/5), so the number of times __ cars are selected is equal __, and the probability of hitting is almost the same as the set value. It has become. Also, as you can see from the graph, the average reward is maintained around 0.5.

Measures for the N-arm banded problem

greedy method

-Since the agent, environment, and reward type have been defined so far, the next step is to think about __measures __ "How can I raise the average reward?" ・ As a basic policy, there is no information at first, so there is no choice but to randomly select a table __. After some trials and when you can guess the probability of hitting a table, you will find the table with the highest probability and choose to keep choosing that table. -At this time, collecting information by repeating trials to some extent is called __ "search" __, and finally continuing to select the most suitable platform is called __ "use" __.

-The simplest method to carry out this policy is __ "greedy method" __. This is a simple and straightforward __ "select the one with the highest expected value from the results so far" . ・ First, perform search. At this time, it is decided in advance that __ will be tried n times per unit, and that will be done for all units. When this is done, calculate the expected value of __ reward $ u {i} $ at that time __ and select the largest one. -The expected value $ u {i} $ is calculated by __ "(sum of rewards of machine i) ÷ (number of trials of machine i)" __.

-The following code is an example of actually defining this method and using the function that defines the environment created up to the previous section.

-The definition of the greedy method is __ "Leave slot_num on that table until the maximum number of trials is reached" __ and __ "When all the machines reach the maximum number of trials, slot_num is set. Make it the one with the highest expected value at that point. "__ You can define something like that. -For the latter, the reason why the code says __ "np.array ()" __ is to use the method __ "argmax ()" __ to get the maximum value.

スクリーンショット 2020-11-11 12.19.12.png

・ Practicing the greedy method by giving a value スクリーンショット 2020-11-11 12.20.11.png

・ Result![Screenshot 2020-11-11 12.20.26.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/d2b53fb9-583e-7d14- ce60-f265a667b2be.png)

・ As can be seen from this result, trials are performed on all units up to the upper limit of the number of trials for a given search __ "n = 100" __, and from that point on, the highest expected value __ All trials (9500 remaining) are used for "Table 4" __.

ε (epsilon) -greedy method

-In the greedy method in the previous section, the smaller n, that is, the fewer trials used for the search, the higher the possibility of selecting the wrong platform. On the other hand, if n is increased, the possibility of selecting the optimum platform increases, but the search requires a large number of trials, which increases waste. -This __ "ε-greedy method" __ can solve this problem. By interweaving search and use, this method can reduce the cost of search while reducing the risk of continuing to choose the wrong platform. ・ The flow of the ε-greedy method is as follows. ① If there is a stand that you have not selected yet, select it ②__Randomly select from all units with probability ε__ (search) ③ __ Select the machine with the highest average reward (expected value) so far with probability 1-ε __ (use)

-In other words, this method is to search with __ probability ε and use with probability 1-ε for each trial.

-Definition part of ε-greedy method![Screenshot 2020-11-12 10.08.14.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/ f62db320-5545-1e0d-e810-d73fc3a99221.png)

-Execution part![Screenshot 2020-11-12 10.10.43.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/d09711b3-1335-9165 -36ec-9cd56a863bd8.png)

・ Result![Screenshot 2020-11-12 10.10.29.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/df18123b-4a82-53a6- 7eab-852e42fe26d5.png)

Optimistic initial value method

・ In the greedy method, for example, if there are "tables A and B that hit with a probability of 0.4 and 0.6" and the expected value of "table of A" is predicted to be large like "0.8", the table of A is initially used. However, by trial, it is possible to recognize that the expected value of the A platform seems to be low (converge to 0.4) and move to the B platform. ・ On the other hand, if the expected value of "B stand" is predicted to be less than "0.2", then __, "B stand will not be tried", that is, "A stand only will be selected". There is a problem that __ I can't notice the mistake __. -As a method to reduce such risk, __ "optimistic initial value method" __ "estimate the expected value higher (optimistically) when uncertain" is used. ・ As a concrete idea, before learning, we will calculate the expected value after assuming that the maximum reward value is obtained K times from each machine __. -Therefore, when using this method, it is necessary to specify __ "maximum reward value (rsup)" __ and __ "number of times assuming maximum value (K)" _. ・ The expected value of each unit using these two is calculated by $ \ frac {R (N) + Kr {sup}} {N + K} $. __R (N) __ is the number of times actually measured (stored in the third column of results), and N is the number of times actually measured (stored in the second column of results). So the code looks like this:

スクリーンショット 2020-11-12 11.01.32.png

-Execution part![Screenshot 2020-11-12 11.02.52.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/f47baedd-9afe-5c32 -1a76-f0236ffaef22.png)

・ Result![Screenshot 2020-11-12 11.02.38.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/6992afc7-4fc9-beaa- 767b-bfa559a8a530.png)

soft-max method

-The disadvantage of the ε-greedy method is that when searching, even if the platform is clearly considered to be the worst, __ will always try a fixed number of times. For example, if the probability of hitting is 20% and there is a table that is "1" if it is hit and "-100" if it is lost, it is predicted that pulling this table will be extremely bad behavior, but ε-greedy The law cannot avoid this and search. -It is __ "soft-max method" __ that can solve this. This is __ "It is easy to choose actions that seem to be high value, and it is difficult to choose actions that seem to be low value" . That is, __ "weight each action" __ is performed. -The weight is calculated using the formula $ \ frac {\ exp {Q {i} / \ tau}} {\ sum ^ i \ exp {Q {i} / \ tau}} $. $ Q_i $ is the expected value of the reward, $ \ tau $ (tau) is the parameter, and the smaller the $ \ tau $, the stronger the tendency of the above weighting. ・ As the flow of the soft-max method, if there is no __data, all rewards are assumed to be "1" __, the selection probability of each unit is calculated by the above formula, and selection is made based on this. , Update the reward function by getting a reward, and so on. The specific code is as follows.

スクリーンショット 2020-11-12 11.54.05.png

-For the above code, first calculate the expected expected value of reward __ "q" __. See the 4th column of results for this. Next, if there is no data, the expected reward value "q" is set to "1", that is, an array in which all values are 1 is created with __ "np.ones ()" __. -Calculate the selection probability __ "probability" __ with the above formula using "q" and "tau" passed. Then, assuming that the probability is probability and the target to be selected is the number of the unit, __ "np.random.choice (array, p = probability)" __ defines what is selected.

-Execution part![Screenshot 2020-11-12 12.01.40.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/90b343bf-f6b6-9786 -e3c9-2528b0771d39.png)

・ Looking at the above results, it is often selected according to the probability of hitting the table. However, in this trial, the No. 1 machine did not hit at all, it was regarded as "the machine with a probability of 0", and it was not selected after that.

UCB1 algorithm

-The __UCB1 algorithm __ is an improved version of the __optimistic initial value method . Specifically, __ "How much the machine hit (success rate)" __ and __ "How much do you know about the machine (the amount of data variation due to chance)" __ By making a judgment, it is possible to actively search for a table that has not been searched so much, and to select the table with the highest winning probability when data is collected. -As a flow, first calculate the difference "R" __ between the maximum value and the minimum value of __ reward. Then, if there is a unit that has not been selected yet, select it and calculate the expected value (success rate) __ "$ u {i} $" of the reward for each unit from the obtained results. -Also, with $ R \ sqrt {\ frac {2logT} {N}} $ using these, the magnitude of data variation "$ x {i} " due to chance of each unit is calculated. At this time, __ "T" __ represents the total number of plays, and __ "N" __ represents the number of plays of machine i. ・ Consider that the platform with the maximum __ sum __ of " u {i} " and " x {i} $" is the most suitable, and select it. ・ The above can be expressed in code as follows. (R = 1 for N-arm banded) スクリーンショット 2020-11-12 13.35.34.png

-For the above code, "a machine that has not been selected yet" can be rephrased as having 0 in the first column (number of trials) of results. In addition, the "total number of trials so far (times)" may be obtained by adding all of these parts with "sum ()". -The magnitude of data variation due to chance of each unit "xi" can be described according to the above formula, but the root is __ "math.sqrt ()" __, __ "math.log ()" __ Each log can be expressed with.

-Execution part![Screenshot 2020-11-12 13.35.58.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/ba96a21e-3d0e-e34b -480f-3dc74fba84fa.png)

-For this result, all the machines have been tried the minimum number of times, but the machines that seem to have a low probability have not been tried so much, and __ waste is reduced __.

About the method so far

・ So far, we have seen the five __ "greedy method", "ε-greedy method", "soft-max method", "optimistic initial value method", and "UCB1 algorithm" __ The important thing is to use __ depending on the problem __. ・ As a premise of reinforcement learning, it can be said that __ "search and use are in a trade-off relationship" __. That is, if the number of searches is increased, the number of trials other than the optimum one is increased, which increases waste, and if the number of uses is increased, the risk of not selecting the best one is increased. -For example, when the total number of trials is small, the search rate is large __ "ε-greedy method" __, etc., because it is easy to find the optimum platform, the reward rate tends to be high. On the other hand, when the number of trials is large, the search waste becomes large. The search can be performed smarter and the number of uses increases __ "UCB1 algorithm" __ tends to have a higher rate of return. ・ When all the probabilities are published from the beginning, the agent only has to try the most suitable one from them, that is, no search is necessary, but the reward at this time and the above five methods were used. The difference in rewards in the case is called __ "regret" __. -Of the five methods, the one that can minimize the regret is __ "UCB1 algorithm" __.

・ A diagram showing the results of each method (in the case of the N-arm banded problem) スクリーンショット 2020-11-12 14.06.51.png

Summary

Reinforcement learning is a method aimed at discovering optimal behavior under given conditions. -In reinforcement learning, there is an actor __ "agent" __. In the __N arm banded problem , the action of "choosing a platform" is taken. -There is also a __ "environment" __ for which the agent takes action. The environment in this case is "the process of returning whether the platform is a hit or a miss". - "Reward" __ is an index that evaluates the desirability of the agent's behavior. The purpose of reinforcement learning is to maximize the magnitude of this reward. -As a method for maximizing this reward, __ "greedy method", "ε-greedy method", "soft-max method", "optimistic initial value method", and "UCB1 algorithm" __ can be mentioned. In each method, first, __ "search" __ is performed to estimate the probability of each platform, and then __ "use" __ is performed to select the optimum platform from the result.

This time is over. Thank you for reading until the end.

Recommended Posts

Reinforcement learning 1 introductory edition
Deep Reinforcement Learning 3 Practical Edition: Breakout
[Introduction] Reinforcement learning
Future reinforcement learning_2
Future reinforcement learning_1
Reinforcement learning 1 Python installation
Reinforcement learning 3 OpenAI installation
Reinforcement learning for tic-tac-toe
Python + Unity Reinforcement Learning (Learning)
Reinforcement learning 18 Colaboratory + Acrobat + ChainerRL
Reinforcement learning 7 Learning data log output
Play with reinforcement learning with MuZero
Reinforcement learning 17 Colaboratory + CartPole + ChainerRL
Reinforcement learning 28 colaboratory + OpenAI + chainerRL
Reinforcement learning 19 Colaboratory + Mountain_car + ChainerRL
Reinforcement learning 2 Installation of chainerrl
[Reinforcement learning] Tracking by multi-agent
Reinforcement learning 6 First Chainer RL
Reinforcement learning 5 Try programming CartPole?
Reinforcement learning 9 ChainerRL magic remodeling
Reinforcement learning Learn from today
Supervised learning (regression) 2 Advanced edition
Reinforcement learning 4 CartPole first step
Deep Reinforcement Learning 1 Introduction to Reinforcement Learning
Deep reinforcement learning 2 Implementation of reinforcement learning
DeepMind Reinforcement Learning Framework Acme
Reinforcement learning: Accelerate Value Iteration
Reinforcement learning 21 Colaboratory + Pendulum + ChainerRL + A2C
Reinforcement learning 34 Make continuous Agent videos
Reinforcement learning 13 Try Mountain_car with ChainerRL.
Python + Unity Reinforcement learning environment construction
Reinforcement learning 22 Colaboratory + CartPole + ChainerRL + A3C
Explore the maze with reinforcement learning
Chainer Machine Learning Introductory Tutorial Memorandum
[Reinforcement learning] DeepMind Experience Replay library Reverb usage survey [Client edition]
Reinforcement learning 8 Try using Chainer UI
Reinforcement learning 24 Colaboratory + CartPole + ChainerRL + ACER
Reinforcement learning 3 Dynamic programming / TD method
I tried reinforcement learning using PyBrain
Introduction to Deep Learning ~ Dropout Edition ~
Learn while making! Deep reinforcement learning_1
[Reinforcement learning] DQN with your own library
Reinforcement learning to learn from zero to deep
[Reinforcement learning] I implemented / explained R2D3 (Keras-RL)
Deep learning from scratch (forward propagation edition)
<Course> Deep Learning Day4 Reinforcement Learning / Tensor Flow
2020 Recommended 20 selections of introductory machine learning books
Reinforcement learning 14 Pendulum was done at ChainerRL.
[Reinforcement learning] Easy high-speed implementation of Ape-X!
[Python] Easy Reinforcement Learning (DQN) with Keras-RL
[Reinforcement learning] Search for the best route
Reinforcement learning 11 Try OpenAI acrobot with ChainerRL.