[PYTHON] DQN with Chainer. I tried various reinforcement learning in tic-tac-toe. (Deep Q Network, Q-Learning, Monte Carlo)

This is my first Qiita post. We are consulting and developing mainly machine learning at a company called Ridge-i.

I have an opportunity to teach about reinforcement learning, so based on tic-tac-toe

--Monte Carlo

I implemented it with Jupyter (ipython) and made teaching materials.

By the way, if you are a strong player, only the draw should be repeated. (It's a famous WarGame guy.)

As conclusion

--Monte Carlo Easy to implement. Almost no loss in 100 trials (sometimes loses in about 50) --Q-Learning Pay attention to the renewable design. Tic-tac-toe is the strongest in 100,000 battles --Deep Q Network There are many pitfalls. I don't know the optimal topology. The weakest until Leaky Relu. If you make a mistake in teaching, you will not learn anything. etc

It took only 1-2 hours to implement until Q-Learning, but it took about 6 hours just to make DQN learn properly. (Crying)

In terms of improving learning efficiency, it would be better to flip and rotate the board properly, but this time I chose it with an emphasis on clarity. Also, I didn't properly design the interface and hide the variables, so I sometimes violate it. Ignore that area.

Click here for Github. https://github.com/narisan25/TTT-RL

If you would like to know the theoretical details, please contact us separately. Also pointed out, improved, advised, great welcome.

Tic-tac-toe design

This is a rough sketch. There are many points to be improved, but this time we will focus on clarity and speed (just move)

Board surface

Put 1 for X, 0 for blank, and -1 for ○ in the 0-9 array.

EMPTY=0
PLAYER_X=1
PLAYER_O=-1
MARKS={PLAYER_X:"X",PLAYER_O:"O",EMPTY:" "}
DRAW=2

class TTTBoard:
   
    def __init__(self,board=None):
        if board==None:
            self.board = []
            for i in range(9):self.board.append(EMPTY)
        else:
            self.board=board
        self.winner=None
    
    def get_possible_pos(self):
        pos=[]
        for i in range(9):
            if self.board[i]==EMPTY:
                pos.append(i)
        return pos
    
    def print_board(self):
        tempboard=[]
        for i in self.board:
            tempboard.append(MARKS[i])
        row = ' {} | {} | {} '
        hr = '\n-----------\n'
        print((row + hr + row + hr + row).format(*tempboard))
               

                    
    def check_winner(self):
        win_cond = ((1,2,3),(4,5,6),(7,8,9),(1,4,7),(2,5,8),(3,6,9),(1,5,9),(3,5,7))
        for each in win_cond:
            if self.board[each[0]-1] == self.board[each[1]-1]  == self.board[each[2]-1]:
                if self.board[each[0]-1]!=EMPTY:
                    self.winner=self.board[each[0]-1]
                    return self.winner
        return None
    
    def check_draw(self):
        if len(self.get_possible_pos())==0 and self.winner is None:
            self.winner=DRAW
            return DRAW
        return None
    
    def move(self,pos,player):
        if self.board[pos]== EMPTY:
            self.board[pos]=player
        else:
            self.winner=-1*player
        self.check_winner()
        self.check_draw()
    
    def clone(self):
        return TTTBoard(self.board.copy())
    
    def switch_player(self):
        if self.player_turn == self.player_x:
            self.player_turn=self.player_o
        else:
            self.player_turn=self.player_x

Game facilitator

Make it an Observer + Mediator model. With this, you can manage turns, wins and losses, display, etc. while preventing the player from directly touching the board.

class TTT_GameOrganizer:

act_turn=0
winner=None
    
    def __init__(self,px,po,nplay=1,showBoard=True,showResult=True,stat=100):
        self.player_x=px
        self.player_o=po
        self.nwon={px.myturn:0,po.myturn:0,DRAW:0}
        self.nplay=nplay
        self.players=(self.player_x,self.player_o)
        self.board=None
        self.disp=showBoard
        self.showResult=showResult
        self.player_turn=self.players[random.randrange(2)]
        self.nplayed=0
        self.stat=stat
    
    def progress(self):
        while self.nplayed<self.nplay:
            self.board=TTTBoard()
            while self.board.winner==None:
                if self.disp:print("Turn is "+self.player_turn.name)
                act=self.player_turn.act(self.board)
                self.board.move(act,self.player_turn.myturn)
                if self.disp:self.board.print_board()
               
                if self.board.winner != None:
                    # notice every player that game ends
                    for i in self.players:
                        i.getGameResult(self.board) 
                    if self.board.winner == DRAW:
                        if self.showResult:print ("Draw Game")
                    elif self.board.winner == self.player_turn.myturn:
                        out = "Winner : " + self.player_turn.name
                        if self.showResult: print(out)
                    else:
                        print ("Invalid Move!")
                    self.nwon[self.board.winner]+=1
                else:
                    self.switch_player()
                    #Notice other player that the game is going
                    self.player_turn.getGameResult(self.board)

            self.nplayed+=1
            if self.nplayed%self.stat==0 or self.nplayed==self.nplay:
                        print(self.player_x.name+":"+str(self.nwon[self.player_x.myturn])+","+self.player_o.name+":"+str(self.nwon[self.player_o.myturn])
             +",DRAW:"+str(self.nwon[DRAW]))

            
    def switch_player(self):
        if self.player_turn == self.player_x:
            self.player_turn=self.player_o
        else:
            self.player_turn=self.player_x

Create various players

Random

It's a simple one that you just put where you can put it.

import random
             

class PlayerRandom:
    def __init__(self,turn):
        self.name="Random"
        self.myturn=turn
        
    def act(self,board):
        acts=board.get_possible_pos()
        i=random.randrange(len(acts))
        return acts[i]
    
    
    def getGameResult(self,board):
        pass

Human

class PlayerHuman:
    def __init__(self,turn):
        self.name="Human"
        self.myturn=turn
        
    def act(self,board):
        valid = False
        while not valid:
            try:
                act = input("Where would you like to place " + str(self.myturn) + " (1-9)? ")
                act = int(act)
                #if act >= 1 and act <= 9 and board.board[act-1]==EMPTY:
                if act >= 1 and act <= 9:
                    valid=True
                    return act-1
                else:
                    print ("That is not a valid move! Please try again.")
            except Exception as e:
                    print (act +  "is not a valid move! Please try again.")
        return act
    
    def getGameResult(self,board):
        if board.winner is not None and board.winner!=self.myturn and board.winner!=DRAW:
            print("I lost...")

First of all, let's fight with humans at random. Shame if you lose.

def Human_vs_Random():
    
    p1=PlayerHuman(PLAYER_X)
    p2=PlayerRandom(PLAYER_O)
    game=TTT_GameOrganizer(p1,p2)
    game.progress()

Execution result

Human_vs_Random()
Turn is Random
   |   |   
-----------
   |   | O 
-----------
   |   |   
Turn is Human
Where would you like to place 1 (1-9)? 1
 X |   |   
-----------
   |   | O 
-----------
   |   |   
Turn is Random
 X |   |   
-----------
 O |   | O 
-----------
   |   |   
Turn is Human
Where would you like to place 1 (1-9)? 2
 X | X |   
-----------
 O |   | O 
-----------
   |   |   
Turn is Random
 X | X |   
-----------
 O | O | O 
-----------
   |   |   
I lost...
Winner : Random
Human:0,Random:1,DRAW:0

What was left in the log was an example of losing care.

Improved Random

It's boring if it's just random, so if there is a place where you can win with your next move, try to point it properly. (Random if not)

class PlayerAlphaRandom:
    
    
    def __init__(self,turn,name="AlphaRandom"):
        self.name=name
        self.myturn=turn
        
    def getGameResult(self,winner):
        pass
        
    def act(self,board):
        acts=board.get_possible_pos()
        #see only next winnable act
        for act in acts:
            tempboard=board.clone()
            tempboard.move(act,self.myturn)
            # check if win
            if tempboard.winner==self.myturn:
                #print ("Check mate")
                return act
        i=random.randrange(len(acts))
        return acts[i]

Monte Carlo

Now, from here, we will enter the reinforcement learning area. In Monte Carlo, we will try a certain number of times until the end of the game according to a certain policy, and search for a good move by seeing the winning percentage.

The important point of Monte Carlo is that the strength depends on the number of "fixed times" and "a certain policy".

For "a certain policy", I used the improved random you mentioned earlier. When I tried "a certain number of times" and exceeded 100 times, it seems that I will not make a mistake. If it is about 50 times, you will sometimes lose depending on the random number.

From here, you can bring the actual results to dynamic programming, or derive from Monte Carlo Tree Search (MCTS) to combine strong and weak policies, but I will not mention them this time.

class PlayerMC:
    def __init__(self,turn,name="MC"):
        self.name=name
        self.myturn=turn
    
    def getGameResult(self,winner):
        pass
        
    def win_or_rand(self,board,turn):
        acts=board.get_possible_pos()
        #see only next winnable act
        for act in acts:
            tempboard=board.clone()
            tempboard.move(act,turn)
            # check if win
            if tempboard.winner==turn:
                return act
        i=random.randrange(len(acts))
        return acts[i]
           
    def trial(self,score,board,act):
        tempboard=board.clone()
        tempboard.move(act,self.myturn)
        tempturn=self.myturn
        while tempboard.winner is None:
            tempturn=tempturn*-1
            tempboard.move(self.win_or_rand(tempboard,tempturn),tempturn)
        
        if tempboard.winner==self.myturn:
            score[act]+=1
        elif tempboard.winner==DRAW:
            pass
        else:
            score[act]-=1

        
    def getGameResult(self,board):
        pass
        
    
    def act(self,board):
        acts=board.get_possible_pos()
        scores={}
        n=50
        for act in acts:
            scores[act]=0
            for i in range(n):
                #print("Try"+str(i))
                self.trial(scores,board,act)
            
            #print(scores)
            scores[act]/=n
        
        max_score=max(scores.values())
        for act, v in scores.items():
            if v == max_score:
                #print(str(act)+"="+str(v))
                return act

Let Monte Carlo fight each other

p1=PlayerMC(PLAYER_X,"M1")
p2=PlayerMC(PLAYER_O,"M2")
game=TTT_GameOrganizer(p1,p2,10,False)
game.progress()

Draw Game
Winner : M2
Draw Game
Draw Game
Draw Game
Draw Game
Draw Game
Winner : M2
Draw Game
Draw Game
M1:0,M2:2,DRAW:8

It's almost a draw. If you feel like it, try 100 trials. I think it will all be a draw.

Q-Learning

By the way, speaking of reinforcement learning, Q-Learning. In the middle of the game, the reward is 0, if you win, you get +1. If you lose, you get -1, and the draw is 0. Based on that, we will update the Dictionary of Q (s, a) with the following update formula.

Q(s,a) = Q(s,a) + alpha (reward + gammma* max(Q(s',a')- Q(s,a)) 

The important thing is to recommend random moves with the concept of ε-greedy. Otherwise, you will soon fall into the local optimum. The value of the first Q is also important. If you don't, you won't want to look for it right away. A selfish child.

I was surprised to find out what to do with max (Q) at the end of the game and forget to update during the game.

class PlayerQL:
    def __init__(self,turn,name="QL",e=0.2,alpha=0.3):
        self.name=name
        self.myturn=turn
        self.q={} #set of s,a
        self.e=e
        self.alpha=alpha
        self.gamma=0.9
        self.last_move=None
        self.last_board=None
        self.totalgamecount=0
        
    
    def policy(self,board):
        self.last_board=board.clone()
        acts=board.get_possible_pos()
        #Explore sometimes
        if random.random() < (self.e/(self.totalgamecount//10000+1)):
                i=random.randrange(len(acts))
                return acts[i]
        qs = [self.getQ(tuple(self.last_board.board),act) for act in acts]
        maxQ= max(qs)

        if qs.count(maxQ) > 1:
            # more than 1 best option; choose among them randomly
            best_options = [i for i in range(len(acts)) if qs[i] == maxQ]
            i = random.choice(best_options)
        else:
            i = qs.index(maxQ)

        self.last_move = acts[i]
        return acts[i]
    
    def getQ(self, state, act):
        # encourage exploration; "optimistic" 1.0 initial values
        if self.q.get((state, act)) is None:
            self.q[(state, act)] = 1
        return self.q.get((state, act))
    
    def getGameResult(self,board):
        r=0
        if self.last_move is not None:
            if board.winner is None:
                self.learn(self.last_board,self.last_move, 0, board)
                pass
            else:
                if board.winner == self.myturn:
                    self.learn(self.last_board,self.last_move, 1, board)
                elif board.winner !=DRAW:
                    self.learn(self.last_board,self.last_move, -1, board)
                else:
                    self.learn(self.last_board,self.last_move, 0, board)
                self.totalgamecount+=1
                self.last_move=None
                self.last_board=None

    def learn(self,s,a,r,fs):
        pQ=self.getQ(tuple(s.board),a)
        if fs.winner is not None:
            maxQnew=0
        else:
            maxQnew=max([self.getQ(tuple(fs.board),act) for act in fs.get_possible_pos()])
        self.q[(tuple(s.board),a)]=pQ+self.alpha*((r+self.gamma*maxQnew)-pQ)
        #print (str(s.board)+"with "+str(a)+" is updated from "+str(pQ)+" refs MAXQ="+str(maxQnew)+":"+str(r))
        #print(self.q)

    
    def act(self,board):
        return self.policy(board)

Let Q-Learning fight 100,000 times.

pQ=PlayerQL(PLAYER_O,"QL1")
p2=PlayerQL(PLAYER_X,"QL2")
game=TTT_GameOrganizer(pQ,p2,100000,False,False,10000)
game.progress()

QL1:4371,QL2:4436,DRAW:1193
QL1:8328,QL2:8456,DRAW:3216
QL1:11903,QL2:11952,DRAW:6145
QL1:14268,QL2:14221,DRAW:11511
QL1:15221,QL2:15099,DRAW:19680
QL1:15730,QL2:15667,DRAW:28603
QL1:16136,QL2:16090,DRAW:37774
QL1:16489,QL2:16439,DRAW:47072
QL1:16832,QL2:16791,DRAW:56377
QL1:17128,QL2:17121,DRAW:65751

It's almost converged on the draw. You can see that the victory and defeat are divided by ε.

Let's fight Monte Carlo

Set ε to zero, eliminate the random element and try to attack with the best move

pQ.e=0
p2=PlayerMC(PLAYER_X,"M1")
game=TTT_GameOrganizer(pQ,p2,100,False,False,10)
game.progress()
QL1:1,M1:0,DRAW:9
QL1:1,M1:0,DRAW:19
QL1:2,M1:0,DRAW:28
QL1:2,M1:0,DRAW:38
QL1:3,M1:0,DRAW:47
QL1:4,M1:0,DRAW:56
QL1:5,M1:0,DRAW:65
QL1:5,M1:0,DRAW:75
QL1:6,M1:0,DRAW:84
QL1:6,M1:0,DRAW:94

Monte Carlo also lost from time to time. After all Monte Carlo cannot escape from the whims of random numbers. By the way, if you fight Q-Learning in this state, it will hurt a lot.

DQN (Deep Q Network)

Well, this time's main subject. This is DQN. The bottleneck of Q-Learning is that the Q (s, a) table becomes huge. Tic-tac-toe is fine, but in Go, both S and a become enormous memory. In Q-Network, Q (s) is used to issue the expected reward for each a and take argmax. With Deep Q Network, "You can find the characteristics that you can win on your own !?" I was crazy about it.

・ ・ ・ ・ But I got stuck.

It doesn't get stronger at all. Moreover, there are too many Hyper Parameters. How many Hidden Layers should I have? Relu? Is the error OK with SME? Is Optimizer Adam or SGD? Honestly, I wanted Grid Search. But since there is no teacher in the first place, it is not possible to measure loss. The accuracy can only be known by fighting the opponent.

Furthermore, even if I was learning, the Variable Link was cut off and I didn't learn at all. I'm really into it. Besides, it's slow because I did various things on a machine without GPU. .. .. Without GPU, it takes about 5-10 minutes to learn.

import chainer

from chainer import Function, gradient_check, Variable, optimizers, serializers, utils
import chainer.functions as F
import chainer.links as L
import numpy as np
from chainer import computational_graph as c

# Network definition
class MLP(chainer.Chain):

    def __init__(self, n_in, n_units, n_out):
        super(MLP, self).__init__(
            l1=L.Linear(n_in, n_units),  # first layer
            l2=L.Linear(n_units, n_units),  # second layer
            l3=L.Linear(n_units, n_units),  # Third layer
            l4=L.Linear(n_units, n_out),  # output layer
        )

    def __call__(self, x, t=None, train=False):
        h = F.leaky_relu(self.l1(x))
        h = F.leaky_relu(self.l2(h))
        h = F.leaky_relu(self.l3(h))
        h = self.l4(h)

        if train:
            return F.mean_squared_error(h,t)
        else:
            return h

    def get(self,x):
        # input x as float, output float
        return self.predict(Variable(np.array([x]).astype(np.float32).reshape(1,1))).data[0][0]


class DQNPlayer:
    def __init__(self, turn,name="DQN",e=1,dispPred=False):
        self.name=name
        self.myturn=turn
        self.model = MLP(9, 162,9)
        self.optimizer = optimizers.SGD()
        self.optimizer.setup(self.model)
        self.e=e
        self.gamma=0.95
        self.dispPred=dispPred
        self.last_move=None
        self.last_board=None
        self.last_pred=None
        self.totalgamecount=0
        self.rwin,self.rlose,self.rdraw,self.rmiss=1,-1,0,-1.5
        
    
    def act(self,board):
        
        self.last_board=board.clone()
        x=np.array([board.board],dtype=np.float32).astype(np.float32)
        
        pred=self.model(x)
        if self.dispPred:print(pred.data)
        self.last_pred=pred.data[0,:]
        act=np.argmax(pred.data,axis=1)
        if self.e > 0.2: #decrement epsilon over time
            self.e -= 1/(20000)
        if random.random() < self.e:
            acts=board.get_possible_pos()
            i=random.randrange(len(acts))
            act=acts[i]
        i=0
        while board.board[act]!=EMPTY:
            #print("Wrong Act "+str(board.board)+" with "+str(act))
            self.learn(self.last_board,act, -1, self.last_board)
            x=np.array([board.board],dtype=np.float32).astype(np.float32)
            pred=self.model(x)
            #print(pred.data)
            act=np.argmax(pred.data,axis=1)
            i+=1
            if i>10:
                print("Exceed Pos Find"+str(board.board)+" with "+str(act))
                acts=self.last_board.get_possible_pos()
                act=acts[random.randrange(len(acts))]
            
        self.last_move=act
        #self.last_pred=pred.data[0,:]
        return act
    
    def getGameResult(self,board):
        r=0
        if self.last_move is not None:
            if board.winner is None:
                self.learn(self.last_board,self.last_move, 0, board)
                pass
            else:
                if board.board== self.last_board.board:            
                    self.learn(self.last_board,self.last_move, self.rmiss, board)
                elif board.winner == self.myturn:
                    self.learn(self.last_board,self.last_move, self.rwin, board)
                elif board.winner !=DRAW:
                    self.learn(self.last_board,self.last_move, self.rlose, board)
                else:                    #DRAW
                    self.learn(self.last_board,self.last_move, self.rdraw, board)
                self.totalgamecount+=1
                self.last_move=None
                self.last_board=None
                self.last_pred=None

    def learn(self,s,a,r,fs):
        if fs.winner is not None:
            maxQnew=0
        else:
            x=np.array([fs.board],dtype=np.float32).astype(np.float32)
            maxQnew=np.max(self.model(x).data[0])
        update=r+self.gamma*maxQnew
        #print(('Prev Board:{} ,ACT:{}, Next Board:{}, Get Reward {}, Update {}').format(s.board,a,fs.board,r,update))
        #print(('PREV:{}').format(self.last_pred))
        self.last_pred[a]=update
        
        x=np.array([s.board],dtype=np.float32).astype(np.float32)
        t=np.array([self.last_pred],dtype=np.float32).astype(np.float32)
        self.model.zerograds()
        loss=self.model(x,t,train=True)
        loss.backward()
        self.optimizer.update()
        #print(('Updated:{}').format(self.model(x).data))
        #print (str(s.board)+"with "+str(a)+" is updated from "+str(pQ)+" refs MAXQ="+str(maxQnew)+":"+str(r))
        #print(self.q)

The sauce is long. .. .. I think there is evidence of trial and error, but I haven't cleaned it.

Play against improved random

With this, you will first learn how to hit and win.

pDQ=DQNPlayer(PLAYER_X)
p2=PlayerAlphaRandom(PLAYER_O)
game=TTT_GameOrganizer(pDQ,p2,20000,False,False,1000)
game.progress()
DQN:206,AlphaRandom:727,DRAW:67
DQN:468,AlphaRandom:1406,DRAW:126
DQN:861,AlphaRandom:1959,DRAW:180
DQN:1458,AlphaRandom:2315,DRAW:227
DQN:2185,AlphaRandom:2560,DRAW:255
DQN:3022,AlphaRandom:2704,DRAW:274
DQN:3832,AlphaRandom:2856,DRAW:312
DQN:4632,AlphaRandom:3023,DRAW:345
DQN:5481,AlphaRandom:3153,DRAW:366
DQN:6326,AlphaRandom:3280,DRAW:394
DQN:7181,AlphaRandom:3400,DRAW:419
DQN:8032,AlphaRandom:3522,DRAW:446
DQN:8902,AlphaRandom:3618,DRAW:480
DQN:9791,AlphaRandom:3705,DRAW:504
DQN:10673,AlphaRandom:3793,DRAW:534
DQN:11545,AlphaRandom:3893,DRAW:562
DQN:12420,AlphaRandom:3986,DRAW:594
DQN:13300,AlphaRandom:4074,DRAW:626
DQN:14183,AlphaRandom:4158,DRAW:659
DQN:15058,AlphaRandom:4246,DRAW:696

It's become much stronger. As expected, it seems that you can beat the improved random.

Confrontation with favorite Q-Learning

pDQ.e=1
game=TTT_GameOrganizer(pDQ,pQ,30000,False,False,1000)
game.progress()

DQN:4,QL1:436,DRAW:560
DQN:6,QL1:790,DRAW:1204
DQN:6,QL1:1135,DRAW:1859
DQN:6,QL1:1472,DRAW:2522
DQN:6,QL1:1801,DRAW:3193
DQN:6,QL1:2123,DRAW:3871
DQN:6,QL1:2439,DRAW:4555
DQN:6,QL1:2777,DRAW:5217
DQN:7,QL1:3128,DRAW:5865
DQN:9,QL1:3648,DRAW:6343
DQN:13,QL1:4132,DRAW:6855
DQN:13,QL1:4606,DRAW:7381
DQN:13,QL1:5087,DRAW:7900
DQN:14,QL1:5536,DRAW:8450
DQN:14,QL1:6011,DRAW:8975
DQN:14,QL1:6496,DRAW:9490
DQN:14,QL1:7394,DRAW:9592
DQN:14,QL1:7925,DRAW:10061
DQN:16,QL1:8357,DRAW:10627
DQN:16,QL1:8777,DRAW:11207

I wonder if it has finally become stronger. .. .. It's a delicate strength. .. I still have a lot of losses in this log, but after 100,000 battles, I almost settled on a draw.

Finally, play against humans. ε is zero

pDQ.e=0
p2=PlayerHuman(PLAYER_O)
game=TTT_GameOrganizer(pDQ,p2,2)
game.progress()
Turn is Human
Where would you like to place -1 (1-9)? 1
//anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:69: VisibleDeprecationWarning: converting an array with ndim > 0 to an index will result in an error in the future

 O |   |   
-----------
   |   |   
-----------
   |   |   
Turn is DQN
 O |   |   
-----------
   | X |   
-----------
   |   |   
Turn is Human
Where would you like to place -1 (1-9)? 2
 O | O |   
-----------
   | X |   
-----------
   |   |   
Turn is DQN
 O | O | X 
-----------
   | X |   
-----------
   |   |   
Turn is Human
Where would you like to place -1 (1-9)? 7
 O | O | X 
-----------
   | X |   
-----------
 O |   |   
Turn is DQN
 O | O | X 
-----------
 X | X |   
-----------
 O |   |   
Turn is Human
Where would you like to place -1 (1-9)? 6
 O | O | X 
-----------
 X | X | O 
-----------
 O |   |   
Turn is DQN
 O | O | X 
-----------
 X | X | O 
-----------
 O |   | X 
Turn is Human
Where would you like to place -1 (1-9)? 7
 O | O | X 
-----------
 X | X | O 
-----------
 O |   | X 
I lost...
Invalid Move!
Turn is Human
Where would you like to place -1 (1-9)? 2
   | O |   
-----------
   |   |   
-----------
   |   |   
Turn is DQN
   | O |   
-----------
   | X |   
-----------
   |   |   
Turn is Human
Where would you like to place -1 (1-9)? 1
 O | O |   
-----------
   | X |   
-----------
   |   |   
Turn is DQN
 O | O | X 
-----------
   | X |   
-----------
   |   |   
Turn is Human
Where would you like to place -1 (1-9)? 7
 O | O | X 
-----------
   | X |   
-----------
 O |   |   
Turn is DQN
 O | O | X 
-----------
 X | X |   
-----------
 O |   |   
Turn is Human
Where would you like to place -1 (1-9)? 6
 O | O | X 
-----------
 X | X | O 
-----------
 O |   |   
Turn is DQN
 O | O | X 
-----------
 X | X | O 
-----------
 O |   | X 
Turn is Human
Where would you like to place -1 (1-9)? 8
 O | O | X 
-----------
 X | X | O 
-----------
 O | O | X 
Draw Game
DQN:1,Human:0,DRAW:1

Well, it was such a move.

Conclusion

Q-Learning was pretty easy to implement, but with Deep Q Network it was very often undecided without trial and error. Even if I first implemented it with Relu, it didn't get stronger at all, and it became better with Leaky Relu. Maybe I couldn't express it well with Relu because -1 is on the board. Also, for some reason Adam didn't work and SGD worked. It's strange. I thought Adam was the strongest. Furthermore, if you do not play against Alpha Random and fight against Q-Learning (strong guy) from the beginning, you will soon lose motivation and you will lose. At first, it seems that you need to fight the weak ones to motivate them.

Also, the most difficult thing was

――How to teach DQN what you shouldn't do

That is the point.

This time, if I hit a place where there was already a piece, I forcibly gave a negative reward and learned 10 times, which worked well. Should I enter a place where I can hit it?

If you have any questions or want to improve, please let me know. Advice and suggestions, welcome more and more.

Even so, I was thinking that it would take about 2-3 hours for tic-tac-toe, but it took longer than I expected. I am keenly aware that there is a ridiculous gap between what I know, what I can implement (can write code), and what I can do (it works properly). We encourage you to implement it yourself, rather than looking at this source too much.

Well, I wonder if I will make an Othello version when I have time. .. .. I'm worried about the next challenge, whether to try AlphaGo's 3NN configuration properly.

Recommended Posts

DQN with Chainer. I tried various reinforcement learning in tic-tac-toe. (Deep Q Network, Q-Learning, Monte Carlo)
Try with Chainer Deep Q Learning --Launch
[Mac] I tried reinforcement learning with OpenAI Baselines
I tried deep learning
Learn with an inverted pendulum DQN (Deep Q Network)
I tried to implement ListNet of rank learning with Chainer
I tried to divide with a deep learning language model
I tried machine learning with liblinear
I tried reinforcement learning using PyBrain
I tried deep learning using Theano
I tried learning LightGBM with Yellowbrick
I tried to implement deep learning that is not deep with only NumPy
[Reinforcement learning] DQN with your own library
Classify anime faces with deep learning with Chainer
I tried learning with Kaggle's Titanic (kaggle②)
[Python] [Natural language processing] I tried Deep Learning ❷ made from scratch in Japanese ①
I tried to extract a line art from an image with Deep Learning
I tried to implement Cifar10 with SONY Deep Learning library NNabla [Nippon Hurray]
I tried to make deep learning scalable with Spark × Keras × Docker 2 Multi-host edition