I read "Reinforcement Learning with Python: From Introduction to Practice" Chapter 1

Trigger

For some reason, I wanted to solve the maze with reinforcement learning.

Before that, I bought this book "Introduction to Strengthening Learning with Python" — Takahiro Kubo to study the basics.

However, although it is a fairly easy-to-understand book, I do not understand it very much at the first time ...

I remember hearing a long time ago that I should output it, and wrote what I understood the contents of this book in my own way. (There are a lot of quotations ...)

There may be some mistakes, but I would appreciate it if you could point out.

What is reinforcement learning?

Reinforcement Learning, AlphaGo and AlphaZero are famous, aren't they?

As a simple example There is a pattern to train reinforcement learning in a breakout game. There is a video released by Deepmind, so please take a look. https://www.youtube.com/watch?v=TmPfTpjtdgg

  1. At first, you won't hit at all, but when you happen to hit the board, you will get a (point) reward.
  2. How to get (points) rewards While trying and error, learn that the board should hit → → break the blocks.
  3. Learn how to get the largest (point) reward through trial and error, and learn that the (point) reward is higher if it is reflected on the wall or ceiling.
  4. By repeating this trial and error thousands of times, you will become a master of breakout that surpasses humans!

Roughly, did you understand? It's okay as if you're learning for a reward.

Characteristics of reinforcement learning

--There is a reward (= correct answer) for the action It's a bit like supervised learning.

—— Behavior is evaluated from the perspective of maximizing the “sum of rewards”.

--The period from the start to the end of the environment is one episode (Episode)

-** Purpose of reinforcement learning: Maximize the total rewards obtained in one episode **

-** Reinforcement learning model learns **

1, behavior evaluation method

  1. How to choose actions (based on evaluation) = strategy

Markov property

In reinforcement learning Assume that a given environment follows certain rules The rule (property) is called Markov property. "The state of the transition destination depends only on the previous state and the action there. The reward depends on the previous state and the transition destination. "

An environment with Markov properties is called a Markov Decision Process (MDP). 4 components of MDP $ s $: State $ a $: Action $ T $: Probability of state transition (Transition function) A function that outputs the transition destination (next state) and transition probability $ P_ {a} $ with the state and action as arguments.

※ Input: Action $ a $, State $ s $ Transition function: $ T \ left (s, a \ right) $ Output: Transition destination $ s'$, Transition probability $ P_ {a} \ left (s, s'\ right) $

$ R $: Immediate reward (Reward function) A function that outputs a reward with the state and transition destination as arguments (even when taking an action as an argument)

※ Input: State $ s $, Transition destination (next state) $ s'$ Reward function: $ R \ left (s, s'\ right) $ Output: Immediate reward $ r $ image.png

$ \ pi $: A function that receives a Strategy state and outputs an action It is called an agent that moves according to the strategy.

-Learning in reinforcement learning is to adjust the parameters of the strategy so that appropriate actions can be output according to the state. ――Strategy is a model in reinforcement learning.

Reward summation formula

Reinforcement learning seeks to maximize the sum of rewards, so we will explain the formula for sums of rewards.

The sum of the rewards in MDP is the sum of the immediate rewards.

If the episode ends at time $ T $, the sum of rewards $ G_ {t} $ at time $ t $ is defined as: G_{t}:=r_{t+1}+r_{t+2}+r_{t+3}+\ldots +r_{T}

G_{t}:=r_{t+1}+\gamma r_{t+2}+\gamma ^{2}r_{t+3}+\ldots + \gamma ^{T-t-1}r_{T} =\gamma^{k}r_{t+k+1}\sum ^{T-t-1}_{k=0}

--Discount rate is 0 ~ 1 ――Because the index of the discount rate becomes larger at the future time, that is, the discount is made at the future. ――The value obtained in the future is discounted by the discount rate and is called the present value.

What is the expected reward (value)?

When the above formula is expressed as a recursive formula

-A recursive expression is to use $ G_ {t} $ in the expression that defines $ G_ {t} $.

G_{t}:=r_{t+1}+\gamma r_{t+2}+\gamma ^{2}r_{t+3}+\ldots + \gamma ^{T-t-1}r_{T} =r_{t+1}+\gamma\left(r_{t+2}+\gamma r_{t+3}+\ldots + \gamma ^{T-t-2}r_{T}\right) =r_{t+1}+\gamma G_{t+1}

--What is the expected reward (value) $ G_ {t} $?

-This $ G_ {t} $ "sum of rewards" is called the "estimated value" ** Expected reward, ** or ** Value **. * Value is used in the following explanation --Calculating value is called ** Value approximation **. ――This value evaluation is one of the two things learned by reinforcement learning ** "Behavior evaluation method" **.

MDP components in maze movement

image.png

--State $ s $: Current position of coordinates, squares, cells, etc. --Action $ a $: Move up / down / left / right (because it is a maze, there are only four directions) --Transition function $ T $: A function that receives the state $ s $ and the action $ a $ and returns the movable cell and the probability of moving to it (transition probability $ P_ {a} $).

--Immediate reward $ R $: A function that receives the state $ s $ and the transition destination $ s'$ and returns the reward $ r $.

Summary

--The period from the start to the end of the environment is one episode (Episode)

-** Purpose of reinforcement learning: Maximize the total rewards obtained in one episode ** --What the reinforcement learning model learns

  1. Behavioral evaluation method
  2. How to choose actions (based on evaluation) = strategy

--Four components of MDP $ s $: State $ a $: Action $ T $: Probability of state transition (Transition function) A function that outputs the transition destination (next state) and transition probability $ P_ {a} $ with the state and action as arguments. $ R $: Immediate reward (Reward function) A function that outputs a reward with the state and transition destination as arguments (even when taking an action as an argument)

-$ \ pi $: Function that receives the Strategy state and outputs the action It is called an agent that moves according to the strategy.

-$ G_ {t} $ The "sum of rewards" is called the "estimated value" ** Expected reward, ** or ** Value **. --Calculating value is called ** Value approximation **. ――This value evaluation is one of the two things learned by reinforcement learning ** "Behavior evaluation method" **.

Reference material

-[Machine Learning Startup Series-Strengthening Learning with Python-From Introduction to Practice](https://www.amazon.co.jp/%E6%A9%9F%E6%A2%B0%E5%AD%A6%] E7% BF% 92% E3% 82% B9% E3% 82% BF% E3% 83% BC% E3% 83% 88% E3% 82% A2% E3% 83% 83% E3% 83% 97% E3% 82% B7% E3% 83% AA% E3% 83% BC% E3% 82% BA-Python% E3% 81% A7% E5% AD% A6% E3% 81% B6% E5% BC% B7% E5% 8C% 96% E5% AD% A6% E7% BF% 92-% E5% 85% A5% E9% 96% 80% E3% 81% 8B% E3% 82% 89% E5% AE% 9F% E8% B7 % B5% E3% 81% BE% E3% 81% A7-KS% E6% 83% 85% E5% A0% B1% E7% A7% 91% E5% AD% A6% E5% B0% 82% E9% 96 % 80% E6% 9B% B8-% E4% B9% 85% E4% BF% 9D / dp / 4065142989)

-Math Webmemo LaTex conversion of handwritten formulas! Seriously recommended!

Recommended Posts

I read "Reinforcement Learning with Python: From Introduction to Practice" Chapter 1
I read "Reinforcement Learning with Python: From Introduction to Practice" Chapter 2
[Python] Easy introduction to machine learning with python (SVM)
I want to solve APG4b with Python (Chapter 2)
I want to climb a mountain with reinforcement learning
Reinforcement learning starting with Python
[Chapter 5] Introduction to Python with 100 knocks of language processing
[Chapter 3] Introduction to Python with 100 knocks of language processing
[Chapter 2] Introduction to Python with 100 knocks of language processing
Deep Reinforcement Learning 1 Introduction to Reinforcement Learning
[Chapter 4] Introduction to Python with 100 knocks of language processing
Mayungo's Python Learning Episode 3: I tried to print numbers with print
Introduction to Python for VBA users-Calling Python from Excel with xlwings-
Practice! !! Introduction to Python (Type Hints)
Create folders from '01' to '12' with python
[Introduction to Python3 Day 13] Chapter 7 Strings (7.1-7.1.1.1)
[Introduction to Python3 Day 14] Chapter 7 Strings (7.1.1.1 to 7.1.1.4)
[Introduction to Python3 Day 15] Chapter 7 Strings (7.1.2-7.1.2.2)
I want to debug with Python
Read fbx from python with cinema4d
[Introduction to Python3 Day 21] Chapter 10 System (10.1 to 10.5)
Python learning memo for machine learning by Chainer Chapter 8 Introduction to Numpy
I want to solve APG4b with Python (only 4.01 and 4.04 in Chapter 4)
Python learning memo for machine learning by Chainer Chapter 10 Introduction to Cupy
Introduction to Data Analysis with Python P17-P26 [ch02 1.usa.gov data from bit.ly]
[Introduction] I want to make a Mastodon Bot with Python! 【Beginners】
From the introduction of JUMAN ++ to morphological analysis of Japanese with Python
Python learning memo for machine learning by Chainer Chapter 9 Introduction to scikit-learn
Reinforcement learning to learn from zero to deep
[Introduction to Python3, Day 17] Chapter 8 Data Destinations (8.1-8.2.5)
[Introduction to Python3, Day 17] Chapter 8 Data Destinations (8.3-8.3.6.1)
I installed Python 3.5.1 to study machine learning
Introduction to Python Image Inflating Image inflating with ImageDataGenerator
I want to use jar from python
[Introduction to Python] Let's use foreach with Python
[Introduction to Python3 Day 19] Chapter 8 Data Destinations (8.4-8.5)
[Python] Introduction to CNN with Pytorch MNIST
[Introduction to Python3 Day 18] Chapter 8 Data Destinations (8.3.6.2 to 8.3.6.3)
I want to play with aws with python
[Introduction to Pytorch] I played with sinGAN ♬
Chapter 1 Introduction to Python Cut out only the good points of deep learning made from scratch
An introduction to Python for machine learning
[Python] Easy Reinforcement Learning (DQN) with Keras-RL
I wanted to solve ABC172 with Python
Introduction to Effectiveness Verification Chapter 1 in Python
[Data science basics] I tried saving from csv to mysql with python
Mayungo's Python Learning Episode 2: I tried to put out characters with variables
Use Python from Java with Jython. I was also addicted to it.
I tried fMRI data analysis with python (Introduction to brain information decoding)
I started machine learning with Python (I also started posting to Qiita) Data preparation
Introduction to Artificial Intelligence with Python 1 "Genetic Algorithm-Theory-"
How to read a CSV file with Python 2/3
Markov Chain Chatbot with Python + Janome (1) Introduction to Janome
Markov Chain Chatbot with Python + Janome (2) Introduction to Markov Chain
I want to use MATLAB feval with python
Introduction to Artificial Intelligence with Python 2 "Genetic Algorithm-Practice-"
[Introduction to Python3 Day 12] Chapter 6 Objects and Classes (6.3-6.15)
Deep Learning from scratch ① Chapter 6 "Techniques related to learning"
I wanted to solve NOMURA Contest 2020 with Python
[Introduction to StyleGAN2] Independent learning with 10 anime faces ♬
I want to email from Gmail using Python.