[PYTHON] Deep Reinforcement Learning 1 Introduction to Reinforcement Learning

Aidemy 2020/11/21

Introduction

Hello, it is Yope! I'm a crunchy literary school, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. I am very happy that many people have read the previous summary article. Thank you! This is the first post of deep reinforcement learning. Nice to meet you.

What to learn this time ・ (Review) Reinforcement learning ・ Reinforcement learning method ・ DQN

(Review) About reinforcement learning

Reinforcement learning is a method of machine learning. ・ The components of reinforcement learning are as follows. Agent, which is the subject of action, Environment, which is the target of action, Action, which acts on the environment, and each element of the environment that changes accordingly is state. In addition, reward indicates the evaluation that is immediately obtained by the action, and revenue indicates how much the total reward is finally obtained. ・ The purpose of reinforcement learning is to maximize the sum of this revenue. -As a model of reinforcement learning, the __measure __ of agent action selection is expressed as __ "input the current environment state" __ and __ "output action" __. And this action chooses something that will give you a higher reward than __. -For this "higher reward", if all the rewards are known, the one with the highest reward can be selected, but in reality, this is rarely given in advance. In such a case, it is necessary to collect information by performing actions that have never been selected by performing __ "search" __. After gathering information in this way, it is advisable to select the action that is presumed to be the most rewarding. This is called __ "use" __.

Reinforcement learning strategies

(Review) greedy method

-It is important to take measures in line with the problem of measures about how to carry out the search and use shown above. -For example, when all the expected values of rewards are known, it is best to select __ "greedy method" __ that only the action with the highest expected value is selected. -However, as mentioned above, in general, there are few cases where all rewards are known, so in such cases, it is necessary to select another action even if the rewards obtained are known to be small. One of these measures is __ "ε-greedy method" __. This is __, which searches with the probability ε and uses it with ___1-ε. By reducing the value of ε based on the number of trials __, the usage rate will increase, and it will be possible to search efficiently.

Boltzmann selection

・ The ε-greedy method was a __method of selecting actions with some probability. Similar to this, there is a policy called __ "Boltzmann selection" __. -Boltzmann selection is called this way because the selection probability follows the following __Boltzmann distribution __.

スクリーンショット 2020-11-18 11.58.34.png

-In this formula, __ "T" __ is called __ temperature function __, and it is __ "function that converges to 0 with the passage of time" __. At this time, __T → infinite limit __ selects all actions with the same probability __, and __T → 0 limit __ makes it easier to select the one with the maximum expected reward value __ It is a thing. -In other words, since T is large at the beginning, the action selection is random __, but when __T approaches 0 with the passage of time, it becomes __ to select like the greedy method.

DQN -DQN is the __Q function of Q-learning expressed by deep learning __. The __Q function __ is the __ "action value function" __, and the __Q learning __ is a reinforcement learning algorithm that estimates this. -The __action value function __ is a function that calculates the expected value __ of the reward when __optimal measures are taken by inputting __ "state s and action a" __. What is done is the sum of the action value __ obtained by performing a certain action and the action value __ obtained by performing a possible action in the next state __ Then, the function is updated a little (adjusting the learning rate) by taking the __difference __ from the current action value. -Actually, the state s and the action a are represented by __table function __ for all combinations, but depending on the problem, there is a risk that the amount of this __combination will be enormous __. ・ In such a case, DQN can solve this Q function by function approximation by deep learning.

-Characteristics of __DQN __ are as follows. See the next Chapter for details. · __Experience Replay : __ Shuffle data time series __ to deal with time series correlation - Target Network __: Calculates the error from the correct answer and adjusts the model so that it is close to the correct answer. Randomly create a batch from the data and perform __batch learning __. -Filter and convert images by __CNN __: __ Convolution . -Clipping: Regarding the reward, if it is negative, it is -1, if it is positive, it is __ + 1, and if it is none, it is 0.

Experience Replay -For example, the input obtained by the agent playing the game has the property of __time series __. Since the time series input has a strong correlation __, if the time series input is used as it is for learning, the learning result will be biased and the convergence will be poor. The solution to this is called Experience Replay. This is a __method in which states, actions, and rewards are input, all or a certain number are recorded, and then __randomly called and learned.

スクリーンショット 2020-11-18 12.44.01.png

Summary

-In reinforcement learning, __search __ and use are performed in order to maximize the sum of __ profits. How to do this is policy. -For this measure, the __ "greedy method" __ is effective when the expected value of the reward is known. This is __ to select only the action with the highest expected value __. -The __ "ε-greedy method" __ corresponds to the case where all the expected reward values are not known. This policy is to search with probability ε and use with 1-ε. -As a similar measure, there is __Boltzmann selection __. Since the values are selected according to the Boltzmann distribution using the temperature function T whose value converges to 0 over time, the action is randomly selected at first, but the action with the highest expected value is selected over time. Will be. -DQN is the __Q function (behavioral value function) __ expressed by __ deep learning __. The action value function calculates the expected value of the reward by inputting the state s and the action a, but this method is used because the amount of s and a becomes enormous if all combinations are expressed by the table function. .. -One of the features of DQN is __ "Experience Replay" __. This does randomly retrieve states, actions, and rewards in order to remove the __time-series nature of the input data.

This time it is abnormal. Thank you for reading this far.

Recommended Posts

Deep Reinforcement Learning 1 Introduction to Reinforcement Learning
Introduction to Deep Learning ~ Learning Rules ~
Introduction to Deep Learning ~ Backpropagation ~
[Introduction] Reinforcement learning
Introduction to Deep Learning ~ Function Approximation ~
Introduction to Deep Learning ~ Coding Preparation ~
Introduction to Deep Learning ~ Dropout Edition ~
Introduction to Deep Learning ~ Forward Propagation ~
Introduction to Deep Learning ~ CNN Experiment ~
Reinforcement learning to learn from zero to deep
Introduction to Deep Learning ~ Convolution and Pooling ~
Introduction to machine learning
Introduction to Deep Learning ~ Localization and Loss Function ~
[Learning memorandum] Introduction to vim
An introduction to machine learning
Super introduction to machine learning
Deep reinforcement learning 2 Implementation of reinforcement learning
[Introduction to Reinforcement Learning] part.1-Epsilon-Greedy Algorithm in Bandit Game
Introduction
Introduction to Deep Learning ~ CNN Experiment ~
Deep Learning
Deep learning to start without GPU
Introduction to Machine Learning Library SHOGUN
Learn while making! Deep reinforcement learning_1
[Introduction to Reinforcement Learning] Reinforcement learning to try moving for the time being
Try Q-learning in Dragon Quest-style battle [Introduction to Reinforcement Learning]
Introduction to MQTT (Introduction)
Introduction to Scrapy (1)
Introduction to Scrapy (3)
Introduction to Machine Learning: How Models Work
Introduction to Tkinter 1: Introduction
<Course> Deep Learning Day4 Reinforcement Learning / Tensor Flow
An introduction to OpenCV for machine learning
Deep Learning Memorandum
Introduction to PyQt
How to study deep learning G test
Introduction to ClearML-Easy to manage machine learning experiments-
Future reinforcement learning_1
[Linux] Introduction to Linux
Image alignment: from SIFT to deep learning
Python Deep Learning
I read "Reinforcement Learning with Python: From Introduction to Practice" Chapter 1
Introduction to Scrapy (4)
Deep learning × Python
Introduction to discord.py (2)
An introduction to Python for machine learning
Introduction to Deep Learning (2) --Try your own nonlinear regression with Chainer-
Introduction to TensorFlow-Machine Learning Terminology / Concept Explanation
I read "Reinforcement Learning with Python: From Introduction to Practice" Chapter 2
Introduction to discord.py
[Python] Easy introduction to machine learning with python (SVM)
[Super Introduction to Machine Learning] Learn Pytorch tutorials
An introduction to machine learning for bot developers
An amateur tried Deep Learning using Caffe (Introduction)
Deep Learning from scratch ① Chapter 6 "Techniques related to learning"
An introduction to Cython that doesn't go deep
[Introduction to StyleGAN2] Independent learning with 10 anime faces ♬
[Super Introduction to Machine Learning] Learn Pytorch tutorials
[For beginners] Introduction to vectorization in machine learning
An introduction to Cython that doesn't go deep -2-
Introduction to Deep Learning (1) --Chainer is explained in an easy-to-understand manner for beginners-
First Deep Learning ~ Struggle ~