[PYTHON] <Course> Deep Learning Day4 Reinforcement Learning / Tensor Flow

sutudy-ai


Deep learning

table of contents [Deep Learning: Day1 NN] (https://qiita.com/matsukura04583/items/6317c57bc21de646da8e) [Deep Learning: Day2 CNN] (https://qiita.com/matsukura04583/items/29f0dcc3ddeca4bf69a2) [Deep Learning: Day3 RNN] (https://qiita.com/matsukura04583/items/9b77a238da4441e0f973) [Deep Learning: Day4 Reinforcement Learning / TensorFlow] (https://qiita.com/matsukura04583/items/50806b750c8d77f2305d)

Deep Learning: Day4 Reinforcement Learning / TensorFlow (Lecture Summary)

Section1) TensorFlow implementation exercise

Linear regression (DN65)

[try]

Let's change the value of noise Let's change the number of d

スクリーンショット 2020-01-04 12.11.10.png

$ \ Rightarrow $ [Discussion]

Optimizer name Description
GradientDescentOptimizer Gradient descent optimizer
AdagradOptimizer AdaGrad method optimizer
MomentumOptimizer Momentum optimizer
AdamOptimize Adam method
FtrlOptimizer Follow the Regularized Leader algorithm(I haven't learned this)
RMSPropOptimizer Algorithm that automates the adjustment of learning rate

(Reference) Optimizer for tensorflow

Nonlinear Regression (DN66)

[try] Let's change the value of noise Let's change the number of d

スクリーンショット 2020-01-04 12.43.50.png

Exercise (DN67)

[try]

スクリーンショット 2020-01-04 14.36.20.png $ \ Rightarrow $ [Discussion] The result is that adjusting learning_rate is more effective than adjusting iters_num (iteration number: number of iterative learnings). [Change source]

python


import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

#Change the iteration here
iters_num = 10000
plot_interval = 100

#Generate data
n=100
#random.rand(): 0.0 or more, 1.Random number generation less than 0
x = np.random.rand(n).astype(np.float32) * 4 - 2
d =  30 * x ** 2 +0.5 * x + 0.2

#Add noise
noise = 0.05
d = d + noise * np.random.randn(n) 

#model
#Note that we are not using b.
#Added: The number of Ws has changed from 4 to 3, so change
#xt = tf.placeholder(tf.float32, [None, 4])
xt = tf.placeholder(tf.float32, [None, 3])
dt = tf.placeholder(tf.float32, [None, 1])
#Added: The number of Ws has changed from 4 to 3, so change
#W = tf.Variable(tf.random_normal([4, 1], stddev=0.01))
W = tf.Variable(tf.random_normal([3, 1], stddev=0.01))
y = tf.matmul(xt,W)

#Error function Mean squared error
loss = tf.reduce_mean(tf.square(y - dt))
#Change the learning rate here
optimizer = tf.train.AdamOptimizer(0.001)
train = optimizer.minimize(loss)

#Initialization
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)

#Prepare the created data as training data
d_train = d.reshape(-1,1)
#x_train = np.zeros([n, 4])
x_train = np.zeros([n, 3])
for i in range(n):
#Added: The number of Ws has changed from 4 to 3, so change
#    for j in range(4):
    for j in range(3):
        x_train[i, j] = x[i]**j

#training
for i in range(iters_num):
    if (i+1) % plot_interval == 0:
        loss_val = sess.run(loss, feed_dict={xt:x_train, dt:d_train}) 
        W_val = sess.run(W)
        print('Generation: ' + str(i+1) + '.error= ' + str(loss_val))
    sess.run(train, feed_dict={xt:x_train,dt:d_train})

print(W_val[::-1])
    
#Prediction function
def predict(x):
    result = 0.
#Added: The number of Ws has changed from 4 to 3, so change
#   for i in range(0,4):
    for i in range(0,3):
        result += W_val[i,0] * x ** i
    return result

fig = plt.figure()
subplot = fig.add_subplot(1,1,1)
plt.scatter(x ,d)
linex = np.linspace(-2,2,100)
liney = predict(linex)
subplot.plot(linex,liney)
plt.show()

MNIST1(DN68)

Classification 3 layers (mnist) (DN69)

[try] Let's resize the hidden layer Let's change the optimizer $ \ Rightarrow $ [Discussion] スクリーンショット 2020-01-04 19.33.05.png When the size of the hidden layer was halved, the correct answer rate dropped significantly. On the other hand, when the optimizer was changed from Adam to Momentum, the correct answer rate increased from 90 to 94%. Others were done, but RMS Drop was the best at 96%. I tried doubling the size of the hidden layer, but the improvement in the correct answer rate was about 1%, so if the size of the hidden layer was deep enough, I felt that it would be desirable to adjust it with the optimizer after that.

Classification CNN (mnist) (DN70)

conv - relu - pool - conv - relu - pool - affin - relu - dropout - affin - softmax [try]

Let's change the dropout rate to 0 $ \ Rightarrow $ [Discussion] (Before change) dropout_rate = 0.5 スクリーンショット 2020-01-04 19.54.24.png

(After change) dropout_rate = 0 スクリーンショット 2020-01-04 20.02.13.png I thought it would go down more, but it didn't change much.

Example explanation

スクリーンショット 2020-01-04 20.37.06.png

$ \ Rightarrow $ [Discussion] The answer is (a)

  1. googlenet is a network composed of a stack of Inception modules.
  2. Inceptin module Define a small network in one module as shown in the table above. Therefore, (D) is the correct answer. The Inceptin module is usually characterized by defining the filter size and performing the part in Convolution by using multiple filter sizes. C, which reduces dimensions by 1x1 convolution, is the correct answer. B, which improves expressiveness while reducing the number of parameters by multiple convolutions, is also correct.

Answer (a) is incorrect First of all, regarding loss, the feature of this loss is that the classification is performed in the part branched from the network in the middle.

The explanation of the following examples is omitted.

[DN73] Confirmation test in the explanation of the example Briefly describe the features of VGG, GoogleNet, and ResNet.

For VGG, it will be the oldest 2014 model. As a feature, it is simple by stacking simple networks such as Convolution, Convolution, max_pool. On the other hand, it is characterized by a large number of parameters compared to the other two. The feature of Google Net is that it uses the inception module. It is characterized by dimension reduction using 1✖️1 size and sparseness by using various filter sizes. The feature of ResNet is that it can perform deep learning by making a residual connection by using the <Skip Connection Identity Module.

Keras2 (DN69)

Simple perceptron

OR circuit [try] Changed np.random.seed (0) to np.random.seed (1) Changed the number of epochs to 100 Changed to AND circuit and XOR circuit Change batch size to 10 with OR circuit Let's change the number of epochs to 300 ⇒ [Discussion] (Before change) np.random.seed (0) スクリーンショット 2020-01-04 22.40.41.png (After change) Changed to np.random.seed (1) スクリーンショット 2020-01-04 22.49.46.png (After change) Epoch changed from 30 times to 100 times スクリーンショット 2020-01-04 22.58.35.png (After change) Change to AND circuit OR and AND are linearly separable, but XOR is not linearly separable and cannot be learned. (After change) Change batch size to 10 with OR circuit スクリーンショット 2020-01-04 23.22.23.png (After change) Let's change the number of epochs to 300 スクリーンショット 2020-01-04 23.24.53.png

Classification (iris)

[try]

(Before change / ReLU) スクリーンショット 2020-01-04 23.38.52.png (Changed activation function to Sygmoid) スクリーンショット 2020-01-05 0.40.44.png After all, it can be said that ReRU is more accurate from the graph. (Changed optimization to optimizer = SGD (lr = 0.1)) スクリーンショット 2020-01-05 0.53.14.png

With optimizer = SGD (lr = 0.1), there are some areas where the accuracy has improved so that 1.0 appears occasionally, but it seems that there are also many variations.

Classification (mnist)

[try]

(Change before) スクリーンショット 2020-01-05 1.17.45.png (After change) change one_hot_label to False スクリーンショット 2020-01-05 4.51.25.png

(After change) Change error function to sparse_categorical_crossentropy And change one_hot_label to False スクリーンショット 2020-01-05 5.04.45.png

categorical_crossentropy → set one_hot_label to True sparse_categorical_crossentropy → Fales one_hot_label Must be. If not, an error will occur.

(After change) Let's change the value of Adam's lr argument (learning rate 0.01-> 0.1) スクリーンショット 2020-01-05 5.10.06.png

RNN (Prediction of binary addition) Keras RNN documentation

[try] (Change before) スクリーンショット 2020-01-07 15.54.14.png (After change) Change the number of output nodes to 128 Changed SimpleRNN units = 16 $ \ Rightarrow $ units = 128. スクリーンショット 2020-01-07 16.07.28.png It has risen from the stage of EPOCH1 to Acc 0.9299. (After change) Changed output activation function to ReLU $ \ Rightarrow $ sigmoid スクリーンショット 2020-01-07 16.21.14.png The result of Sygmoid is that Acc does not rise as much as LeRU. (After change) Change output activation function to tanh スクリーンショット 2020-01-07 16.33.54.png It takes up to Epoch3 even though Acc is up to 100%.

(After change) Optimized method changed to adam Source change

pyton


#model.compile(loss='mean_squared_error', optimizer=SGD(lr=0.1), metrics=['accuracy'])
model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])

スクリーンショット 2020-01-07 17.09.49.png Acc is almost a good result.

(After change) Input Dropout set to 0.5 スクリーンショット 2020-01-07 16.41.19.png The result that Acc does not rise as much as expected.

(After change) Set recursive Dropout to 0.3 スクリーンショット 2020-01-07 16.53.33.png This is also only Acc 98%.

(After change) set unroll to True スクリーンショット 2020-01-07 17.15.36.png This is also a good result.

Section2) Reinforcement learning

2-1 What is reinforcement learning?

A field of machine learning that aims to create agents who can choose actions in the environment so that rewards can be maximized in the long run. $ \ Rightarrow $ It is a mechanism to improve the principle of deciding an action based on the profit (reward) given as a result of the action.

[D81] Reinforcement learning 1 Confirmation test Consider examples that could be applied to reinforcement learning, and list environmental agents, actions, and rewards.

⇒ [Discussion] Stock investment robot Environment ⇒ Stock market Agent ⇒ Investor Action ⇒ Select and invest in stocks that are likely to be profitable Remuneration ⇒ Profit / loss from buying and selling stocks

2-2 Application example of reinforcement learning

For marketing Environment: Company Sales Promotion Department Agent: Send campaign emails based on profile and purchase history It is software that determines the customer to send. Action: You will have to choose between two actions, send and non-send, for each customer. Reward: Negative reward of campaign cost and campaign Receive a positive reward of sales that are estimated to be made

2-3 Trade-off between search and use

With perfect knowledge of the environment in advance, it is possible to predict and determine optimal behavior.
⇒Situations where it is known what kind of customer the campaign email will be sent to and what kind of action will be taken.
⇒ In the case of reinforcement learning, the above assumption does not hold. Collect data while acting on the basis of incomplete knowledge. Find the best action.

With historical data, if you always take only the best behavior, you cannot find another best behavior. ⇒ Insufficient search (The relationship between the top and bottom is a trade-off) If you keep taking only unknown actions, you cannot make use of your past experience. Insufficient use Trade-off relationship ⇒ Only unknown altitude

2-4 Image of reinforcement learning

Day4.jpg

2-5 Differences in reinforcement learning

Differences between reinforcement learning and supervised and unsupervised learning

Conclusion: different goals

History of reinforcement learning About reinforcement learning ・ Although there was a winter era, reinforcement learning is becoming possible when there is a large-scale state due to the progress of calculation speed. ・ Appearance of a method that combines function approximation and Q-learning

Q learning ・ How to proceed with learning by updating the action value function each time you act Function approximation method ・ A method of function approximation of value functions and policy functions.

2-6 Action value function

What is the action value function?

2-7 Policy function

A policy function is a function that gives the probability of what action to take in a certain state in a policy-based reinforcement learning method.

2-8 Policy Gradient Method

Policy Iterative Method Techniques for modeling and optimizing strategies ⇒ Policy gradient method

\theta^{(t+1)}=\theta^{(t)}\epsilon\nabla j(\theta)

What is j? ⇒ Good policy ... Must be defined

Definition method ・ Average reward ・ Discount reward sum Corresponding to the above definition, the action value function: Q (s, a) is defined. The policy gradient theorem holds.

\nabla _{\theta} j(\theta)=E_{\pi_\theta} [\nabla_{\theta} log\pi_\theta(a|s)Q^\pi(s,a))]

Recommended Posts

<Course> Deep Learning Day4 Reinforcement Learning / Tensor Flow
<Course> Deep Learning: Day2 CNN
<Course> Deep Learning: Day1 NN
Rabbit Challenge Deep Learning 1Day
Subjects> Deep Learning: Day3 RNN
Rabbit Challenge Deep Learning 2Day
Deep Reinforcement Learning 1 Introduction to Reinforcement Learning
Deep reinforcement learning 2 Implementation of reinforcement learning
Deep Reinforcement Learning 3 Practical Edition: Breakout
Thoroughly study Deep Learning [DW Day 0]
Learn while making! Deep reinforcement learning_1
Deep Learning
Machine learning beginners take Coursera's Deep learning course
[Rabbit Challenge (E qualification)] Deep learning (day2)
Reinforcement learning to learn from zero to deep
[Rabbit Challenge (E qualification)] Deep learning (day3)
[Rabbit Challenge (E qualification)] Deep learning (day4)
[Introduction] Reinforcement learning
Deep Learning Memorandum
Start Deep learning
Python learning day 4
Future reinforcement learning_2
Future reinforcement learning_1
Python Deep Learning
Deep learning × Python
Stock investment by deep reinforcement learning (policy gradient method) (1)
Deep learning course that can be crushed on site
First Deep Learning ~ Struggle ~
Python: Deep Learning Practices
Deep learning / activation functions
Deep Learning from scratch
Learning record 4 (8th day)
Learning record 9 (13th day)
Learning record 3 (7th day)
Deep learning 1 Practice of deep learning
Reinforcement learning for tic-tac-toe
Deep learning / cross entropy
Learning record 5 (9th day)
Learning record 6 (10th day)
First Deep Learning ~ Preparation ~
Programming learning record day 2
First Deep Learning ~ Solution ~
Learning record 8 (12th day)
[AI] Deep Metric Learning
Learning record 1 (4th day)
Learning record 7 (11th day)
I tried deep learning
Machine learning course memo
[Reinforcement learning] Bandit task
Python: Deep Learning Tuning
Learning record 2 (6th day)
Deep learning large-scale technology
Python + Unity Reinforcement Learning (Learning)
Learning record 16 (20th day)
Learning record 22 (26th day)
Reinforcement learning 1 introductory edition
Deep learning / softmax function
[Deep learning] Image classification with convolutional neural network [DW day 4]
Automatic composition by deep learning (Stacked LSTM edition) [DW Day 6]