[PYTHON] I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Implementation ~

Series table of contents

-I tried to make Othello AI with tensorflow without understanding the theory of machine learning-Introduction- --I tried to make Othello AI with tensorflow without understanding the theory of machine learning-Implementation- -I tried to make Othello AI with tensorflow without understanding the theory of machine learning-Iza battle- -I tried to make Othello AI after trying to understand the theory of machine learning ~ Restart! ~ -I tried to make Othello AI after trying to understand the theory of machine learning-What is this Alpha Zero edition- -I tried to make a neutral network with Excel to understand the theory of machine learning ~ Image recognition mnist edition ~

Last time continued ... In this field, as an outsider, I didn't study "theory of machine learning" at all. I would like to make an AI for Othello. Click here for the referenced site ・ Implement DQN with Keras, TensorFlow and OpenAI GymTraining TensorFlow neural network to play Tic-Tac-Toe game using one-step Q-learning algorithm.

Basics of reinforcement learning

I made Othello's AI without studying "machine learning theory" at all. Here is a summary of the minimum knowledge required to implement it.

File structure and role

The file structure and role are like this. 構成.png --train.py --- AI training --Reversi.py --- Management of Othello games --dqn_agent.py --- Management of AI training --FightWithAI.py --- Battle with users

Overall algorithm

The DQN algorithm implemented this time looks like this.

algorithm.png If you keep this flow in mind, you will understand what you are talking about and what you are going to explain.

Othello game specifications

The board used for Othello games and AI training This is done using a two-dimensional array with No in the figure below. screen.png

Reversi.py


self.screen[0~7][0~7]

The action that AI can select is to select the number from 0 to 63 in the above figure.

Reversi.py


self.enable_actions[0~63]

AI training

In AI training, players [0] and players [1] played Othello battles n_epochs = 1000 times. Finally, save the AI of the second player [1].

Reward for AI

--If you win the game, set reward = 1. --Other than that, reward = 0

Training method

I will play with two AIs, but I have to act even on the opponent's turn Because the story until the end is not connected (Q value is not transmitted)

8151b82b-2d1a-7b02-4282-1e98a5a9a265.png

Both act on every turn This time, I decided to "save the transition in D" for all the numbers that can be set separately from the progress of the game.

train.py



#targets contains all the numbers you can put this turn
for tr in targets:
    #Duplicate the status quo
    tmp = copy.deepcopy(env)
    #Action
    tmp.update(tr, playerID[i])
    #End judgment
    win = tmp.winner()
    end = tmp.isEnd()
    #Board after action
    state_X = tmp.screen
    #A number that you can leave after you act
    target_X = tmp.get_enables(playerID[i+1])
                       
    #Both actions
    for j in range(0, len(players)):
        reword = 0
        if end == True:
            if win == playerID[j]:
                #If you win, you get 1 reward
                reword = 1
        #Both "Save transition in D"
        players[j].store_experience(state, targets, tr, reword, state_X, target_X, end)
        players[j].experience_replay()

The following part of the DQN algorithm is done by dqn_agent.py.

--Save transition to D (, ai, ri, si + 1, terminal) --Sample mini-patches (si, ai, ri, si + 1, tarminal) that change randomly from D --Teacher signal yi = ri + γmax Q (si + 1, a: θ) --For the Q Network parameter θ, execute the gradient method with (yi-Q (si, ai; θ)) ^ 2. --Reset Target Network on a regular basis Q = Q

I don't know why it's a plagiarism of the site I referred to.

dqn_agent.py


    def store_experience(self, state, targets, action, reward, state_1, targets_1, terminal):
        self.D.append((state, targets, action, reward, state_1, targets_1, terminal))
>>
    def experience_replay(self):
        state_minibatch = []
        y_minibatch = []
>>
        # sample random minibatch
        minibatch_size = min(len(self.D), self.minibatch_size)
        minibatch_indexes = np.random.randint(0, len(self.D), minibatch_size)
>>
        for j in minibatch_indexes:
            state_j, targets_j, action_j, reward_j, state_j_1, targets_j_1, terminal = self.D[j]
            action_j_index = self.enable_actions.index(action_j)
>>
            y_j = self.Q_values(state_j)
>>
            if terminal:
                y_j[action_j_index] = reward_j
            else:
                # reward_j + gamma * max_action' Q(state', action')
                qvalue, action = self.select_enable_action(state_j_1, targets_j_1)
                y_j[action_j_index] = reward_j + self.discount_factor * qvalue
>>
            state_minibatch.append(state_j)
            y_minibatch.append(y_j)
>>
        # training
        self.sess.run(self.training, feed_dict={self.x: state_minibatch, self.y_: y_minibatch})
>>
        # for log
        self.current_loss = self.sess.run(self.loss, feed_dict={self.x: state_minibatch, self.y_: y_minibatch})
>>
Variable name Contents
state Board surface( = Reversi.screen[0~7][0~7] )
targets Number you can leave
action Selected action
reward Reward for action 0-1
state_1 Board after action
targets_1 A number that you can leave after you act
terminal Game ends = True

Implementation

In AI training, players [0] and players [1] played Othello battles n_epochs = 1000 times. Finally, save the AI of the second player [1].

train.py


   # parameters
    n_epochs = 1000
    # environment, agent
    env = Reversi()
 
    # playerID    
    playerID = [env.Black, env.White, env.Black]

    # player agent    
    players = []
    # player[0]= env.Black
    players.append(DQNAgent(env.enable_actions, env.name, env.screen_n_rows, env.screen_n_cols))
    # player[1]= env.White
    players.append(DQNAgent(env.enable_actions, env.name, env.screen_n_rows, env.screen_n_cols))

This DQNAgent (env.enable_actions, env.name, env.screen_n_rows, env.screen_n_cols) part is

-Initialize Replay Memory D --Q NetworkQ is initialized with a random weight θ --Initialize Target NetworkQ θ ^ = θ

, Dqn_agent.py is doing it.

dqn_agent.py


class DQNAgent:
>>
    def __init__(self, enable_actions, environment_name, rows, cols):
        ...abridgement...
        #Replay Memory D initialization
        self.D = deque(maxlen=self.replay_memory_size)
        ...abridgement...
>>
    def init_model(self):
        # input layer (rows x cols)
        self.x = tf.placeholder(tf.float32, [None, self.rows, self.cols])
>>
        # flatten (rows x cols)
        size = self.rows * self.cols
        x_flat = tf.reshape(self.x, [-1, size])
>>
        #Initialize Q NetworkQ with a random weight θ
        W_fc1 = tf.Variable(tf.truncated_normal([size, size], stddev=0.01))
        b_fc1 = tf.Variable(tf.zeros([size]))
        h_fc1 = tf.nn.relu(tf.matmul(x_flat, W_fc1) + b_fc1)
>>
        #Initialize Target NetworkQ θ^=θ
        W_out = tf.Variable(tf.truncated_normal([size, self.n_actions], stddev=0.01))
        b_out = tf.Variable(tf.zeros([self.n_actions]))
        self.y = tf.matmul(h_fc1, W_out) + b_out
>>
        # loss function
        self.y_ = tf.placeholder(tf.float32, [None, self.n_actions])
        self.loss = tf.reduce_mean(tf.square(self.y_ - self.y))
>>
        # train operation
        optimizer = tf.train.RMSPropOptimizer(self.learning_rate)
        self.training = optimizer.minimize(self.loss)
>>
        # saver
        self.saver = tf.train.Saver()
>>
        # session
        self.sess = tf.Session()
        self.sess.run(tf.initialize_all_variables())

python


    for e in range(n_epochs):
        # reset
        env.reset()
        terminal = False
  • for episode =1, M do --Initial screen x1, preprocess to create initial state s1

python


        while terminal == False: #Loop until the end of one episode

            for i in range(0, len(players)): 
                
                state = env.screen
                targets = env.get_enables(playerID[i])
                
                if len(targets) > 0:
                    #If there is a place to put it somewhere

#← Here, all the above-mentioned hands are "saved in D"

                    #Choose an action
                    action = players[i].select_action(state, targets, players[i].exploration)
                    #Take action
                    env.update(action, playerID[i])
  • while not terminal --Action selection

Action selection ʻagent.select_action (state_t, targets, agent.exploration)` is This is done by dqn_agent.py.

-Action selection --Random action ai -Or ai = argmax Q (s1, a: θ)

dqn_agent.py


    def Q_values(self, state):
        # Q(state, action) of all actions
        return self.sess.run(self.y, feed_dict={self.x: [state]})[0]
>>
    def select_action(self, state, targets, epsilon):
>>    
        if np.random.rand() <= epsilon:
            # random
            return np.random.choice(targets)
        else:
            # max_action Q(state, action)
            qvalue, action = self.select_enable_action(state, targets)
            return action
>>  
    #The board(state)so,Place to put(targets)Returns the Q value and number that maximizes the Q value from
    def select_enable_action(self, state, targets):
        Qs = self.Q_values(state)
        #descend = np.sort(Qs)
        index = np.argsort(Qs)
        for action in reversed(index):
            if action in targets:
                break 
        # max_action Q(state, action)
        qvalue = Qs[action]       
>>
        return qvalue, action          

-Execute action ai and observe reward ri, next screen xi + 1 and end judgment tarminal --Preprocess and create the next state si + 1

Finally save the AI behind



                #The result of performing the action
                terminal = env.isEnd()     
                              
        w = env.winner()                    
        print("EPOCH: {:03d}/{:03d} | WIN: player{:1d}".format(
                         e, n_epochs, w))


    #Save saves the second player player2.
    players[1].save_model()

The source is here. $ git clone https://github.com/sasaco/tf-dqn-reversi.git

Next time will tell you about the battle edition.

Recommended Posts

I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Implementation ~
I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Introduction ~
I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Battle Edition ~
[Machine learning] I tried to summarize the theory of Adaboost
I tried to find the average of the sequence with TensorFlow
I tried to make something like a chatbot with the Seq2Seq model of TensorFlow
I tried to make Othello AI that I learned 7.2 million hands by deep learning with Chainer
(Machine learning) I tried to understand Bayesian linear regression carefully with implementation.
I tried to visualize the model with the low-code machine learning library "PyCaret"
(Machine learning) I tried to understand the EM algorithm in a mixed Gaussian distribution carefully with implementation.
I tried to predict the presence or absence of snow by machine learning.
I tried to move machine learning (ObjectDetection) with TouchDesigner
I tried to compress the image using machine learning
I tried to understand the learning function of neural networks carefully without using a machine learning library (first half).
I tried to make a real-time sound source separation mock with Python machine learning
I tried to create a reinforcement learning environment for Othello with Open AI gym
I tried to find the entropy of the image with python
I tried to implement ListNet of rank learning with Chainer
I tried machine learning with liblinear
I tried to compare the accuracy of machine learning models using kaggle as a theme.
I tried to verify the yin and yang classification of Hololive members by machine learning
I tried to automate the watering of the planter with Raspberry Pi
I tried to understand it carefully while implementing the algorithm Adaboost in machine learning (+ I deepened my understanding of array calculation)
I tried to make deep learning scalable with Spark × Keras × Docker
I tried to expand the size of the logical volume with LVM
I tried to summarize the frequently used implementation method of pytest-mock
I tried to make a mechanism of exclusive control with Go
I tried to visualize AutoEncoder with TensorFlow
I tried to transform the face image using sparse_image_warp of TensorFlow Addons
I tried calling the prediction API of the machine learning model from WordPress
I tried to get the authentication code of Qiita API with Python.
I tried to automatically extract the movements of PES players with software
I tried to analyze the negativeness of Nono Morikubo. [Compare with Posipa]
A beginner of machine learning tried to predict Arima Kinen with python
I tried to streamline the standard role of new employees with Python
I tried to visualize the text of the novel "Weathering with You" with WordCloud
Memorandum of means when you want to make machine learning with 50 images
I tried the common story of using Deep Learning to predict the Nikkei 225
I tried to predict the behavior of the new coronavirus with the SEIR model.
I tried to understand the learning function in the neural network carefully without using the machine learning library (second half).
I tried to make a thumbnail image of the best avoidance flag-chan! With RGB values ​​[Histogram] [Visualization]
I tried to predict next year with AI
I tried to save the data with discord
I tried to touch the API of ebay
I tried to correct the keystone of the image
I tried to make AI for Smash Bros.
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
I tried to organize the evaluation indexes used in machine learning (regression model)
I tried to make a simple mail sending application with tkinter of Python
[Patent analysis] I tried to make a patent map with Python without spending money
Machine learning beginners tried to make a horse racing prediction model with python
I tried to easily visualize the tweets of JAWS DAYS 2017 with Python + ELK
[Python] I thoroughly explained the theory and implementation of support vector machine (SVM)
I tried running the TensorFlow tutorial with comments (text classification of movie reviews)
I tried object detection with YOLO v3 (TensorFlow 2.1) on the GPU of windows!
I tried to predict the change in snowfall for 2 years by machine learning
I didn't understand the Resize of TensorFlow so I tried to summarize it visually.
The story of making soracom_exporter (I tried to monitor SORACOM Air with Prometheus)
I tried to make a simple image recognition API with Fast API and Tensorflow
I tried to make deep learning scalable with Spark × Keras × Docker 2 Multi-host edition