[PYTHON] Othello ~ From the tic-tac-toe of "Implementation Deep Learning" (4) [End]

This is a continuation of this article. This time, Othello will be over.

Othello-From the tic-tac-toe of "Implementation Deep Learning" (1) http://qiita.com/Kumapapa2012/items/cb89d73782ddda618c99 Othello-From the tic-tac-toe of "Implementation Deep Learning" (2) http://qiita.com/Kumapapa2012/items/f6c654d7c789a074c69b Othello-From the tic-tac-toe of "Implementation Deep Learning" (3) http://qiita.com/Kumapapa2012/items/3cc20a75c745dc91e826

In the previous article, when Leaky ReLU was used, the winning percentage was stable for the 6x6 board Othello, but there was a large fluctuation in the 8x8 board, and the winning percentage was on a downward trend. I changed the slope of Leaky ReLU again and did this this time.

The code is here. ~~, but the code I put is not the Leaky ReLU version. I will update it at a later date (^^; ~~ Added. https://github.com/Kumapapa2012/Learning-Machine-Learning/tree/master/Reversi

Leaky ReLU Slope change

For the Othello 8x8 board, the execution result with Leaky ReLU with Slope = 0.2 shown last time is as follows. image

This time, the execution result with Slope = 0.1 is as follows. image

By setting Slope = 0.1, it seems that it has converged, but the sharp drop in the winning percentage around Eps. 20000 is not solved. This is [not seen in the 6x6 disc results](http://qiita.com/Kumapapa2012/items/3cc20a75c745dc91e826#leaky-relu-%E3%82%92%E4%BD%BF%E3%81%A3% E3% 81% A6% E3% 81% BF% E3% 81% 9F), which is an interesting situation.

What is the cause of the depression?

To be honest, you can't tell just by looking at the fluctuations in winning percentage. I'm not sure if it's a learning rate issue. Plot the Loss output (square error between teacher data and calculated data) output after 5000 steps for clues. image

It seems to be related to the winning percentage. I will repeat this. image

Looking at the graph, it seems that there is a connection between the increase in Loss and the sharp drop in winning percentage. If you try to interpret and classify this image as an amateur:

a) Near 16000 (480k Steps): Loss is very low, but the winning percentage is also low (50%). From here, the winning percentage starts to rise. At this point, ε of ε-Greedy is already the lowest value of 0.001, and the position of the frame is determined by the Q value. b) Near 16000-22000 (660k Steps): Loss increased slightly as the winning percentage increased. And from the middle, the winning percentage has dropped sharply. In the model at this point, the more you learn, the more you lose. The model seems to be collapsing. c) Around 22000-27000 (720k Steps): A relatively low value of Loss occurs constantly, and a low win rate state continues. If you don't win, there is no reward, so there is almost no reward during this time. d) Near 27000-30000 (900k Steps): Loss expands again. Learning seems to be going well this time, and the winning percentage is rising. e) Near 30000-35000 (1050k Steps): Once Loss drops, the winning percentage continues to rise. It looks like learning is going well. f) Near 35000-45000 (1350k Steps): Loss expands again. Last time, this was the valley of the second win rate. However, this time the winning percentage will not decrease. Is Loss working in a positive direction for learning, or model correction? g) Near 45000-48000 (1440k Steps): Loss decreased. The winning percentage is also stable. h) After 48000: Loss expands again. However, there is a sign that the winning percentage will converge.

Loss expansion is a sign that the model is changing, that is, the agent is growing. If the interpretation of this situation is correct, it can be said that the agent was growing in the wrong direction around b). This time, I saved all aspects as text, so let's look back on this assumption. In the early stages, let's take a look at the following aspects, which are likely to be the difference between victory and defeat.

[[ 0  0  0  0  0  0  0  0]
 [ 0  0  0 (0) 0 (0)(0) 0]
 [ 0 (0)(0)-1 (0)-1  0  0]
 [(0)-1 -1 -1 -1  1  0  0]
 [ 0  0  0  1  1  0  0  0]
 [ 0  0  0  0  1  0  0  0]
 [ 0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0]]

This is the phase in which the agent will place a piece. In this phase, the agent (although hard to see) can place the pieces in the parenthesized areas. A search of this aspect with pcregrep found it in 622 episodes out of 50000 [^ 1].

Between the 20000 and 30000 episodes, which hit the valley of winning percentages, this phase was manifested five times: The numbers in parentheses are the wins and losses of that episode.

21974(win) 22078(lose) 22415(lose) 29418(lose) 29955(win)

In the above situation, the four except 29955 hit as follows. Let this be move A.

[[ 0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0]
 [ 0  0  0 -1  0 -1  0  0]
 [(1) 1  1  1  1  1  0  0]
 [ 0  0  0  1  1  0  0  0]
 [ 0  0  0  0  1  0  0  0]
 [ 0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0]]

29955 struck: Let's call this move B.

[[ 0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0]
 [ 0  0  0 -1 (1)-1  0  0]
 [ 0 -1 -1 -1  1  1  0  0]
 [ 0  0  0  1  1  0  0  0]
 [ 0  0  0  0  1  0  0  0]
 [ 0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0]]

I haven't seen all the episodes since 29955, but I've hit B in every phase I've seen, up to 50,000, where the winning percentage is high and stable.

Strike A is the move that can take the most pieces in this situation. When the winning percentage was the lowest, the action to take this piece occurred 4 times out of 5, so it is probable that the agent at this time was more likely to take the action of always taking more pieces. .. However, it is said that the action of taking a lot of pieces in the early stages is not good for Othello's strategy. In the early stages, the pieces that can be taken are the smallest, such as move B, and the part as far from the "edge" as possible will be effective later. However, the action of "taking a lot of frames" was an action implemented by the self-made Othello agent so that it would always be carried out at a rate of 80%. [^ 2] The agent seems to have learned this move.

Since Othello is a game that starts from the center of the board, the action of always taking a lot of pieces is thought to lead to actively taking the "edge" far from the center in the early stages. It is probable that the act of actively taking the "edge" triggered a chance to take the "horn", the environment took the "horn", and the agent continued to lose. This is because, before the loss continued, the probability that the agent and the environment would take the "corner" was almost the same during the period when the winning percentage was about 50%, but as soon as the agent tended to take the "edge", the environment became a little stronger. It can be said that the probability that the player will take the "horn" has increased, making it easier for the environment to win. The wider the board, the more likely it is that the "corners" will be removed when the "edges" are removed. Therefore, it is probable that the 6x6 board did not have a dip, and only the 8x8 board had a dip. ..

Therefore, I would like to conclude that this decrease in winning percentage is due to the fact that the agent was influenced by specific behaviors in the environment and fell into a state similar to "overfitting" and learned the wrong strategy. .. But how can the effect of Leaky ReLU's Slope be interpreted? .. .. I would like to continue studying and thinking.

Anyway, to avoid this situation, like Alpha Go, you should do supervised learning and study well before playing against the environment, and you will grow in the wrong direction and the reward will decrease. If so, it may be useful to say, for example, to increase the ε of ε-Greedy in order to accelerate the metabolism of the model.

At the end

I'd like to do something else, so I'll finish Othello this time.

What happens if Othello creates a problem by himself and puts it into reinforcement learning? I started with the interest. Since it is not the purpose to create a strong Othello agent, we do not do any supervised learning about the agent, but only deal with the appropriately created and self-made environment. From this "blank sheet" state, in reinforcement learning that deals only with the environment, the behavior of the environment has a strong influence on the model. Rather, it has nothing to do with it. For this reason, we have seen the result that if the implementation of the environment is a little bad like this time, the learning of the agent may temporarily go in an unintended direction. However, we can also see that the agents make self-corrections and eventually increase the winning percentage. Is this self-correction the true value of reinforcement learning? [^ 3] It's a little emotional interpretation, but even with strange parents (environments), children (agents) can grow up properly. I feel like (^^ ;. [^ 4]

In this Othello, it took more than 10 hours for the 6x6 board and more than 24 hours for the 8x8 board to run 50000 episodes. In addition, the 8x8 board doesn't work on my home Pascal GeForce 1050GTX (2GB) due to lack of memory, so I have to run it on the Maxwell Tesla M60 (8GB) on Azure NV6, which is a bit slower than my home, thanks to Azure this month. The billing amount has already exceeded 10,000 yen. It's hard to try any more. This is also one of the reasons for quitting Othello this time.

Oh, I want an 8GB 1070 or 1080 GTX. .. .. [^ 5]

References

● Othello / Reversi winning method ○ http://mezasou.com/reversi/top27.html

[^ 1]: All aspects are about 1.6 million. It seems that the same situation will occur more if you rotate, transpose, etc., but for the time being. [^ 2]: 2nd [^ 3]: The main premise is that the reward is reasonable. [^ 4]: This way of thinking may be a squirt from a professional perspective, but please forgive me for being unstudied (sweat) [^ 5]: This would be great!

Recommended Posts

Othello ~ From the tic-tac-toe of "Implementation Deep Learning" (4) [End]
Othello-From the tic-tac-toe of "Implementation Deep Learning" (3)
Othello-From the tic-tac-toe of "Implementation Deep Learning" (2)
[Learning memo] Deep Learning from scratch ~ Implementation of Dropout ~
Deep reinforcement learning 2 Implementation of reinforcement learning
[Deep Learning from scratch] Implementation of Momentum method and AdaGrad method
Visualize the effects of deep learning / regularization
Learning record of reading "Deep Learning from scratch"
Learning notes from the beginning of Python 2
Chapter 2 Implementation of Perceptron Cut out only the good points of deep learning made from scratch
Python vs Ruby "Deep Learning from scratch" Chapter 4 Implementation of loss function
The story of doing deep learning with TPU
Implementation of Deep Learning model for image recognition
Deep learning learned by implementation (segmentation) ~ Implementation of SegNet ~
Deep Learning from scratch
Python vs Ruby "Deep Learning from scratch" Chapter 3 Implementation of 3-layer neural network
About testing in the implementation of machine learning models
Application of Deep Learning 2 made from scratch Spam filter
Techniques for understanding the basis of deep learning decisions
Deep Learning from the mathematical basics Part 2 (during attendance)
Pip the machine learning library from one end (Ubuntu)
Deep Learning from scratch 1-3 chapters
Deep running 2 Tuning of deep learning
[Part 4] Use Deep Learning to forecast the weather from weather images
[Part 3] Use Deep Learning to forecast the weather from weather images
Learning from the basics Artificial intelligence textbook Chapter 5 Chapter end problems
[Part 2] Use Deep Learning to forecast the weather from weather images
Judge whether it is my child from the picture of Shiba Inu by deep learning (1)
Chapter 3 Neural Network Cut out only the good points of deep learning made from scratch
I considered the machine learning method and its implementation language from the tag information of Qiita
Deep Learning / Deep Learning from Zero 2 Chapter 4 Memo
Deep Learning / Deep Learning from Zero Chapter 3 Memo
Deep Learning / Deep Learning from Zero 2 Chapter 5 Memo
Capture GeneratorExit and detect the end of iteration from the generator side
Existence from the viewpoint of Python
Chapter 1 Introduction to Python Cut out only the good points of deep learning made from scratch
Deep learning learned by implementation 1 (regression)
Deep Learning / Deep Learning from Zero 2 Chapter 7 Memo
Deep Learning / Deep Learning from Zero 2 Chapter 8 Memo
Write an impression of Deep Learning 3 framework edition made from scratch
Deep Learning / Deep Learning from Zero Chapter 5 Memo
Deep Learning / Deep Learning from Zero Chapter 4 Memo
Deep learning image recognition 2 model implementation
Deep Learning / Deep Learning from Zero 2 Chapter 3 Memo
Deep Learning memos made from scratch
[Anomaly detection] Try using the latest method of deep distance learning
"Deep Learning from scratch" Self-study memo (No. 10-2) Initial value of weight
Evaluate the accuracy of the learning model by cross-validation from scikit learn
Summary of pages useful for studying the deep learning framework Chainer
Deep Learning / Deep Learning from Zero 2 Chapter 6 Memo
I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Implementation ~
Deep learning tutorial from environment construction
Judgment whether it is my child from the photograph of Shiba Inu by deep learning (3) Visualization by Grad-CAM
About the order of learning programming languages (from beginner to intermediate) Part 2
Graph of the history of the number of layers of deep learning and the change in accuracy
I tried calling the prediction API of the machine learning model from WordPress
I tried using the trained model VGG16 of the deep learning library Keras
Voice processing by deep learning: Let's identify who the voice actor is from the voice
[Deep Learning from scratch] Layer implementation from softmax function to cross entropy error
I tried the common story of using Deep Learning to predict the Nikkei 225
Python learning memo for machine learning by Chainer until the end of Chapter 2