Aidemy　2020/11/22

Introduction

Hello, it is Yope! I'm a crunchy literary school, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. I am very happy that many people have read the previous summary article. Thank you! This is the third post of deep reinforcement learning. Nice to meet you.

This article is a summary of what you learned in "Aidemy" "in your own words". It may contain mistakes and misunderstandings. Please note.

What to learn this time ・・

Reinforcement learning practice by breaking blocks

Creating an environment

-Create an environment with the same method __ (gym.make ()) __ as in Chapter 2. In case of breakout, specify __ "BreakoutDeterministic-v4" __ as an argument. -The number of actions can be confirmed by __ "env.action_space.n" __.

·code スクリーンショット 2020-11-20 14.52.49.png

Model building

・ Here, a multi-layer neural network is constructed. Input is __ "4 frames of breakout screen" __. Also, in order to reduce the amount of calculation, the image is resized to __grayscale 84 × 84 pixels __. -The model uses Sequential (). As in Chapter 2, smooth the input with __Flatten () __, add the fully connected layer with Dense, and the activation function with Activation. -Since this time we are inputting an image (two-dimensional), we will use __ "Convolution2D ()" __, which is a two-dimensional convolution layer. The first argument is __ "filter" __, which specifies the number of dimensions __ of the output space, and the second argument is __ "kernel_size" __, the width and height of the collapsed __ window. Specify __. __ "strides" __ specifies the stride, that is, the width and height of the __ window that moves at one time __.

·code スクリーンショット 2020-11-20 15.21.21.png

History and policy settings

-Similar to Chapter 2, set __History __ and __Measures __ required to create an agent. -Use __ "SequentialMemory ()" __ for history. Specify limit and window_length as arguments. -Use __ "BoltzmannQPolicy ()" __ when using the Boltzmann policy, and __ "EpsGreedyQPolicy ()" __ when using the ε-greedy method. -Also, when changing the parameter ε to Linear, use __ "LinearAnnealedPolicy ()" __. When the argument is specified as shown in the code below, it means that the parameter ε is transformed into a linear line with a maximum of 1.0 and a minimum of 0.1 in 10 steps during training, and fixed at 0.05 during testing.

·code スクリーンショット 2020-11-20 15.59.01.png

Agent settings

-An agent can be created by passing model, memory, policy, nb_actions, nb_steps_warmup to the argument of __ "DQNAgent ()" __. After that, you can specify the learning method with __ "dqn.compile ()" __. Specify __optimization algorithm __ in the first argument and __ evaluation function __ in the second argument.

・ Code![Screenshot 2020-11-20 16.08.36.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/22bcf494-f1c6-0c09- 0a70-8f2b885c4b41.png)

Implementation of learning

・ After completing the settings in the previous section, learn using the DQN algorithm. __ "dqn.fit ()" __, specify the environment in the first argument and how many steps to learn in "nb_steps" in the second argument. -Also, the learning result can be saved in hdf5 format with __ "dqn.save_weights ()" __. Specify the file name in the first argument and whether to enable overwriting with "overwrite" in the second argument.

·code スクリーンショット 2020-11-21 11.15.16.png

Conducting the test

-Test with a trained agent. __ Do with "dqn.test ()" __. The argument is the same as fit, and the number of episodes "nb_episodes" is specified instead of the number of steps nb_steps. ・ By the way, in this breakout, it is one episode until the ball is dropped.

Dueling DQN

What is Dueling DQN?

-__Dueling DQN (DDQN) __ is an advanced version of DQN, which is a modification of the end of the DQN network layer. -In DQN, the Q value was output via the fully connected layer after the first three "convolution layers", but DDQN divides this __ fully connected layer into two , while the __ state. Outputs the value V and the other output action A. By finding the Q value from the last fully connected layer that takes these two as inputs, the performance is higher than DQN.

・ Figure![Screenshot 2020-11-21 11.32.40.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/eaf40627-b0cc-5f53- 1bfe-5bf6e6330637.png)

Implementation of Dueling DQN

-The implementation of Dueling DQN is the same as DQN until the addition of layers. It can be implemented by setting __ "enable_dueling_network = True" __ in the argument to __ (DQNAgent ()) __ when setting the agent and specifying the Q value calculation method __ "dueling_type" __. You can specify __ "'avg','max','naive'" __ for dueling_type.

・ Code![Screenshot 2020-11-21 12.12.37.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/1d6b9e57-b793-07c2- 684c-856785593a98.png)

・ Result![Screenshot 2020-11-21 12.13.16.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/3445992d-30e4-50c0- defc-754ab85a2bb4.png)

Summary

-Even if you break a block, you can define the environment as in Chapter 2. -For model construction, this time we are using 2D image recognition, so we will use convolution. __ Use the "Convolution 2D" __ layer. -This policy uses the ε-greedy method, but the parameter ε needs to be changed linearly. In such a case, use __ "LinearAnnealedPolicy ()" __ to change __ linearly __. -Models that have been trained can be saved in hdf5 format by using __ "dqn.save_weights ()" __. -DuelingDQN is a DQN that divides the __total bond into two __, calculates the state value V and the action A respectively, and obtains the Q value from the two in the last layer. Implementation should specify __ "enable_dueling_network" __ and __ "dueling_type" __ in __DQNAgent () __.

[PYTHON] Deep Reinforcement Learning 3 Practical Edition: Breakout