[PYTHON] Implemented DQN in TensorFlow (I wanted to ...)

After reading the following articles, I was impressed that DQN (Deep Q-Network) seems to be interesting. Alpha-Go, which is a hot topic recently, is also an extension of DQN ... Is it? (I don't understand) History of DQN + Deep Q-Network written in Chainer DQN (Deep Q Network) learning with an inverted pendulum Playing with Machine Learning with Chainer: Can Addition Games be Reinforced Learning with Chainer?

That's why I tried to implement it with TensorFlow ... (-_-;)? ?? ?? I'm not sure. No, I'm trying to do it without knowing the theory and mathematical formulas well. I think there are too few examples of TensorFlow. For the time being, I tried to imitate it, so I would appreciate it if you could comment if there are any misunderstandings or corrections. "This area is correct" and "This area is ok" are also very helpful.

Other referenced sites: Deep-Q learning Pong with Tensorflow and PyGame I probably referred to the source code in the top half of this page.

Implementation details

Consider the following game. --Consider a number line from 0 to 100. --The program starts from 0 to 100. --There are two choices for the program, and you can go +1 or +2. ――If the place you stopped is a multiple of 2, you will receive a reward of +1 and if it is a multiple of 8, you will receive a reward of -1 (or a penalty). Try training a few times to see how the program works.

environment

TensorFlow 0.7 Ubuntu 14.04 GCE vCPU x8 instance

Implementation

I will put the source code at the bottom, but I will explain the part.

Graph creation

def inference(x_ph):

    with tf.name_scope('hidden1'):
        weights = tf.Variable(tf.truncated_normal([NUM_IMPUT, NUM_HIDDEN1], stddev=stddev), name='weights')
        biases = tf.Variable(tf.zeros([NUM_HIDDEN1], dtype=tf.float32), name='biases')
        hidden1 = tf.matmul(x_ph, weights) + biases

    with tf.name_scope('hidden2'):
        weights = tf.Variable(tf.truncated_normal([NUM_HIDDEN1, NUM_HIDDEN2], stddev=stddev), name='weights')
        biases = tf.Variable(tf.zeros([NUM_HIDDEN2], dtype=tf.float32), name='biases')
        hidden2 = tf.matmul(hidden1, weights) + biases

    with tf.name_scope('output'):
        weights = tf.Variable(tf.truncated_normal([NUM_HIDDEN2, NUM_OUTPUT], stddev=stddev), name='weights')
        biases = tf.Variable(tf.zeros([NUM_OUTPUT], dtype=tf.float32), name='biases')
        y = tf.matmul(hidden2, weights) + biases

    return y

There are two hidden layers, and the number of units is 100 and 100, respectively. The number of this side is appropriate. (I'm trying while playing with it) Enter only one number indicating the current position. There are two outputs, the expected rewards for +1 and +2 (I think). I saw many variable initializations being zero-initialized, but then it didn't work, so random initialization. (Is there a problem?) The activation function does not work well if I use relu, and if I connect it without the activation function, it works a little to Matomo, so it remains none. (Is there a problem?)

Loss calculation

def loss(y, y_ph):
    return tf.reduce_mean(tf.nn.l2_loss((y - y_ph)))

It seems that the loss calculation is squared and halved, so implement it with the equivalent API.

The part to actually train

def getNextPositionReward(choice_position):

    if choice_position % 8 == 0:
        next_position_reward = -1.
    elif choice_position % 2 == 0:
        next_position_reward = 1.
    else:
        next_position_reward = 0.

    return next_position_reward

A function that returns a penalty if the next place is a multiple of 8 and a reward if it is a multiple of 2.

def getNextPosition(position, action_reward1, action_reward2):

    if random.random() < RANDOM_FACTOR:
        if random.randint(0, 1) == 0:
            next_position = position + 1
        else:
            next_position = position + 2
    else:
        if action_reward1 > action_reward2:
            next_position = position + 1
        else:
            next_position = position + 2

    return next_position

The part that compares the two rewards and considers whether to advance +1 or +2. At the time of training, I try to put in a certain random element and proceed.

    for i in range(REPEAT_TIMES):
        position = 0.
        position_history = []
        reward_history = []

        while(True):
            if position >= GOAL:
                break

            choice1_position = position + 1.
            choice2_position = position + 2.

            next_position1_reward = getNextPositionReward(choice1_position)
            next_position2_reward = getNextPositionReward(choice2_position)

            reward1 = sess.run(y, feed_dict={x_ph: [[choice1_position]]})[0]
            reward2 = sess.run(y, feed_dict={x_ph: [[choice2_position]]})[0]

            action_reward1 = next_position1_reward + GAMMA * np.max(reward1)
            action_reward2 = next_position2_reward + GAMMA * np.max(reward2)

            position_history.append([position])
            reward_history.append([action_reward1, action_reward2])

            position = getNextPosition(position, action_reward1, action_reward2)

        sess.run(train_step, feed_dict={x_ph: position_history, y_ph: reward_history})

Training part (excerpt). There are two options, compare the rewards, and choose the one with the higher reward. The reward is the sum of the maximum values of "the reward (certainly) obtained in the next position" and "the predicted value of the reward (probably) obtained after that". Also, put together a list of the two rewards and your current position for supervised learning. This is repeated about 1000 times. ⇒I'm worried about this. I think I'm making a big mistake.

result

Let's take a look at the trajectory of how it actually moved after training.

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98]

It's completely returning just an even number. I'm stepping on multiples of 8 perfectly! I feel like I'm going to get a positive reward, but I don't feel like I'm avoiding a negative one. When I tried various things such as increasing the negative value of the reward, it seems that the above numerical value is different, so it seems that the numerical value is not a fixed value, but it did not move ideally ... ・ ・. By the way, the loss has converged.

Source code (all)

import tensorflow as tf
import numpy as np
import random

# definition
NUM_IMPUT = 1
NUM_HIDDEN1 = 100
NUM_HIDDEN2 = 100
NUM_OUTPUT = 2
LEARNING_RATE = 0.1
REPEAT_TIMES = 100
GOAL = 100
LOG_DIR = "tf_log"
GAMMA = 0.8
stddev = 0.01
RANDOM_FACTOR = 0.1

def inference(x_ph):

    with tf.name_scope('hidden1'):
        weights = tf.Variable(tf.truncated_normal([NUM_IMPUT, NUM_HIDDEN1], stddev=stddev), name='weights')
        biases = tf.Variable(tf.zeros([NUM_HIDDEN1], dtype=tf.float32), name='biases')
        hidden1 = tf.matmul(x_ph, weights) + biases

    with tf.name_scope('hidden2'):
        weights = tf.Variable(tf.truncated_normal([NUM_HIDDEN1, NUM_HIDDEN2], stddev=stddev), name='weights')
        biases = tf.Variable(tf.zeros([NUM_HIDDEN2], dtype=tf.float32), name='biases')
        hidden2 = tf.matmul(hidden1, weights) + biases

    with tf.name_scope('output'):
        weights = tf.Variable(tf.truncated_normal([NUM_HIDDEN2, NUM_OUTPUT], stddev=stddev), name='weights')
        biases = tf.Variable(tf.zeros([NUM_OUTPUT], dtype=tf.float32), name='biases')
        y = tf.matmul(hidden2, weights) + biases

    return y

def loss(y, y_ph):
    return tf.reduce_mean(tf.nn.l2_loss((y - y_ph)))

def optimize(loss):
    optimizer = tf.train.AdamOptimizer(LEARNING_RATE)
    train_step = optimizer.minimize(loss)
    return train_step

def getNextPositionReward(choice_position):

    if choice_position % 8 == 0:
        next_position_reward = -1.
    elif choice_position % 2 == 0:
        next_position_reward = 1.
    else:
        next_position_reward = 0.

    return next_position_reward

def getNextPosition(position, action_reward1, action_reward2):

    if random.random() < RANDOM_FACTOR:
        if random.randint(0, 1) == 0:
            next_position = position + 1
        else:
            next_position = position + 2
    else:
        if action_reward1 > action_reward2:
            next_position = position + 1
        else:
            next_position = position + 2

    return next_position

if __name__ == "__main__":

    x_ph = tf.placeholder(tf.float32, [None, NUM_IMPUT])
    y_ph = tf.placeholder(tf.float32, [None, NUM_OUTPUT])

    y = inference(x_ph)
    loss = loss(y, y_ph)
    tf.scalar_summary("Loss", loss)
    train_step = optimize(loss)

    sess = tf.Session()
    summary_op = tf.merge_all_summaries()
    init = tf.initialize_all_variables()
    sess.run(init)
    summary_writer = tf.train.SummaryWriter(LOG_DIR, graph_def=sess.graph_def)

    for i in range(REPEAT_TIMES):
        position = 0.
        position_history = []
        reward_history = []

        while(True):
            if position >= GOAL:
                break

            choice1_position = position + 1.
            choice2_position = position + 2.

            next_position1_reward = getNextPositionReward(choice1_position)
            next_position2_reward = getNextPositionReward(choice2_position)

            reward1 = sess.run(y, feed_dict={x_ph: [[choice1_position]]})[0]
            reward2 = sess.run(y, feed_dict={x_ph: [[choice2_position]]})[0]

            action_reward1 = next_position1_reward + GAMMA * np.max(reward1)
            action_reward2 = next_position2_reward + GAMMA * np.max(reward2)

            position_history.append([position])
            reward_history.append([action_reward1, action_reward2])

            position = getNextPosition(position, action_reward1, action_reward2)

        sess.run(train_step, feed_dict={x_ph: position_history, y_ph: reward_history})
        summary_str = sess.run(summary_op, feed_dict={x_ph: position_history, y_ph: reward_history})
        summary_writer.add_summary(summary_str, i)
        if i % 10 == 0:
            print "Count: " + str(i)

    # TEST
    position = 0
    position_history = []
    while(True):
        if position >= GOAL:
                break

        position_history.append(position)

        rewards = sess.run(y, feed_dict={x_ph: [[position]]})[0]
        choice = np.argmax(rewards)
        if choice == 0:
            position += 1
        else:
            position += 2

    print position_history

Really

We look forward to your advice and uselessness.

2016/04/27 postscript

dsanno gave us some advice in the comments. Thank you very much. I'll try that.

Part 1

With this problem setting, there is no hidden layer, I think that you can learn with only one layer of embedding_lookup with 100 input values (one-hot vector representing your current location) and 2 output values.

I see, i see··· I still don't understand embedding_lookup, so I'll leave it aside and make the input a one-hot vector and try it without a hidden layer.

def inference(x_ph):

    with tf.name_scope('output'):
        weights = tf.Variable(tf.truncated_normal([NUM_IMPUT, NUM_OUTPUT], stddev=stddev), name='weights')
        biases = tf.Variable(tf.zeros([NUM_OUTPUT], dtype=tf.float32), name='biases')
        y = tf.matmul(x_ph, weights) + biases

Below is a function to create a one-hot vector.

def onehot(idx):
    idx = int(idx)
    array = np.zeros(GOAL)
    array[idx] = 1.
    return array

result

[0, 2, 4, 6, 7, 9, 10, 12, 14, 15, 17, 18, 20, 22, 23, 25, 26, 28, 30, 32, 34, 36, 38, 39, 41, 42, 44, 46, 47, 49, 50, 52, 53, 54, 55, 57, 58, 60, 62, 63, 65, 66, 68, 70, 71, 73, 74, 76, 78, 79, 81, 82, 84, 86, 88, 90, 92, 94, 95, 97, 98, 99]

It's kind of like that. It's not perfect, but I feel like I'm trying to avoid multiples of 8 while stepping on multiples of 2 as much as possible.

Part 2

With ReLu, there is no upper limit on the output and it seems to be incompatible, so use tanh or relu6 with an upper limit as the activation function

I tried with 100,100 hidden layer units while keeping one input. This is not much different from no activation function.

result

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98]

Part 3

If you can assume that the solution is periodic, use tf.sin for the activation function (use such as sin for the first stage and relu for the second stage). I tried with 100,100 hidden layer units while keeping one input.

def inference(x_ph):

    with tf.name_scope('hidden1'):
        weights = tf.Variable(tf.zeros([NUM_IMPUT, NUM_HIDDEN1], dtype=tf.float32), name='weights')
        biases = tf.Variable(tf.zeros([NUM_HIDDEN1], dtype=tf.float32), name='biases')
        hidden1 = tf.sin(tf.matmul(x_ph, weights) + biases)

    with tf.name_scope('hidden2'):
        weights = tf.Variable(tf.truncated_normal([NUM_HIDDEN1, NUM_HIDDEN2], stddev=stddev), name='weights')
        biases = tf.Variable(tf.zeros([NUM_HIDDEN2], dtype=tf.float32), name='biases')
        hidden2 = tf.nn.relu(tf.matmul(hidden1, weights) + biases)

    with tf.name_scope('output'):
        weights = tf.Variable(tf.truncated_normal([NUM_HIDDEN2, NUM_OUTPUT], stddev=stddev), name='weights')
        biases = tf.Variable(tf.zeros([NUM_OUTPUT], dtype=tf.float32), name='biases')
        y = tf.matmul(hidden2, weights) + biases

    return y

result

[0, 2, 4, 6, 8, 9, 10, 12, 14, 15, 17, 18, 20, 22, 23, 25, 26, 28, 29, 30, 31, 33, 34, 36, 38, 39, 41, 43, 44, 46, 47, 49, 50, 51, 53, 55, 57, 58, 60, 62, 63, 64, 66, 68, 69, 71, 73, 74, 76, 78, 79, 81, 82, 83, 84, 85, 87, 89, 90, 92, 94, 95, 97, 98]

I'm stepping on the first 8, but I feel that I'm doing my best here as well.

Corrected a little and adjusted the number of hidden layer units to 500,100.

result

[0, 2, 4, 6, 7, 9, 10, 12, 14, 15, 17, 18, 20, 22, 23, 25, 26, 28, 30, 31, 33, 34, 36, 38, 39, 41, 42, 44, 46, 47, 49, 50, 52, 54, 55, 57, 58, 60, 62, 63, 65, 66, 68, 70, 71, 73, 74, 76, 78, 79, 81, 82, 84, 86, 87, 89, 90, 92, 94, 95, 97, 98]

Is it perfect? I didn't even think of using sin () at all. Thank you again, dsanno.

Impressions

When I heard about artificial intelligence, I had the illusion that if I did it for the time being, I would think about anything myself, but I realized that the characteristics of the input data and the creator had to think about it. ..

Recommended Posts

Implemented DQN in TensorFlow (I wanted to ...)
I wanted to solve ABC159 in Python
I implemented Google's Speech to text in Django
I want to use self in Backpropagation (tf.custom_gradient) (tensorflow)
I wanted to delete multiple objects in s3 with boto3
Hash chain I wanted to avoid (2)
I wanted to evolve cGAN to ACGAN
How to run TensorFlow 1.0 code in 2.0
Hash chain I wanted to avoid (1)
I wanted to do something like an Elixir pipe in Python
I tried to implement Autoencoder with TensorFlow
I tried to implement permutation in Python
I want to print in a comprehension
I tried to visualize AutoEncoder with TensorFlow
I tried to implement PLSA in Python 2
I wanted to solve ABC160 with Python
I tried to classify text using TensorFlow
I implemented Cousera's logistic regression in Python
I tried to implement ADALINE in Python
I want to embed Matplotlib in PySimpleGUI
I wanted to solve ABC172 with Python
I wanted to use jupyter notebook with docker in pip environment (opticspy)
I implemented the VGG16 model in Keras and tried to identify CIFAR10
I implemented Robinson's Bayesian Spam Filter in python
I want to pin Datetime.now in Django tests
I wanted to solve NOMURA Contest 2020 with Python
I implemented DCGAN and tried to generate apples
i-Town Page Scraping: I Wanted To Replace Wise-kun
I was able to recurse in Python: lambda
I want to create a window in Python
I tried to integrate with Keras in TFv1.1
I wanted to play with the Bezier curve
How to run CNN in 1 system notation in Tensorflow 2
I wrote "Introduction to Effect Verification" in Python
I want to store DB information in list
I want to merge nested dicts in Python
I implemented CycleGAN (1)
I tried to implement TOPIC MODEL in Python
I implemented the inverse gamma function in python
[I want to classify images using Tensorflow] (2) Let's classify images
I implemented ResNet!
I tried to implement selection sort in python
I want to display the progress in Python!
I tried to make a ○ ✕ game using TensorFlow
I just wanted to understand Python's Pickle module
I implemented breadth-first search in python (queue, drawing self-made)
I want to write in Python! (1) Code format check
I tried to graph the packages installed in Python
I want to embed a variable in a Python string
I want to easily implement a timeout in python
I want DQN Puniki to hit a home run
I also wanted to check type hints with numpy
I want to transition with a button in flask
I want to write in Python! (2) Let's write a test
Procedure to install TensorFlow in fish shell environment (Anaconda 4.0.0)
Even in JavaScript, I want to see Python `range ()`!
I made a script to put a snippet in README.md
I tried to implement a pseudo pachislot in Python
I wanted to use the Python library from MATLAB
I want to randomly sample a file in Python
I tried to implement Dragon Quest poker in Python