Introduction

After practicing, try using PyTorch to machine-learn your own Minesweeper. There are so many things I don't understand, so I take notes while studying various things. Create a memo once, and if necessary, format it later.

Target

To be able to stably clear the beginner level of Windows standard (was) Minesweeper. For the time being, aim for a winning percentage of about 90%.

Constitution

I copied the DQN of here. It's not enough, so it's easy.

The network uses a sequential model. The number of neurons in the input layer (state $ s ) and output layer ( Q_ {s, a} $) is the number of eyes on the board. There are two hidden layers, and the number of neurons in each layer is the number of eyes on the board x SIZE_MAG. I wonder if I should scale with respect to the number of eyes on the board for the time being (appropriate)

ReLU is used as the activation function, and Adam (learning rate 0.001) is used as the optimization method.

Minesweeper is my own work. I thought I'd do my best by capturing the images, but the main subject is not there. The algorithm is omitted.

The rewards are as follows.

variable	conditions
`reward_win`	Game clear
`reward_failed`	Game failure
`reward_miss`	Trying to open a square that is already open

Progress

First time

First, try setting the board size to 6x6 and the number of mines to 5 to see if you can learn.

`param`


GAMMA = 0.99
NUM_EPISODES = 1000
CAPACITY = 10000
BATCH_SIZE = 200
SIZE_MAG = 8

reward_failed = -100
reward_win = 100
reward_miss = -1

I can win very rarely, but I feel like I'm winning by chance. Looking at the error, it was blown away to about 4 digits in about 2000 steps. Yeah ... Even if you look at the simple reward sum, you try to open only the squares that are already open, and is it a reward problem?

Second time

So fix the reward.

`reward`


reward_failed = -100
reward_win = 100
reward_miss = -10
reward_open = 1

reward_open is a reward given when you open a new square. The error was calmer than before, but it vibrated all the time at around 10.

nth time

I played around with it, but the vibration and divergence didn't stop. Even if you look at the behavior, it still tries to open the already open square. Fixed target Q-Network will be introduced ...

After one night ...

I considered the following possibilities.

During the random operation of the ε-greedy method, only the cells that are already open are trying to open.
If the Q network selects a cell that is already open, the next state $ s_ {t + 1} $ and the previous state $ s_t $ will be the same, so it will loop.
When making a mini-batch, many actions to open already open squares are selected.

Even if you select a square that is already open as a trial, if the game is over, the error itself will change to less than 1. When the value of ε was reduced (initial value 0.5 → 0.2), the error became even smaller. (About 0.1-0.01) However, since the problem when making a mini batch has not been solved, we will implement Prioritized Experience Replay. The code is as it is

Even if I try, it doesn't work. Well, I wish I had a coding mistake ( Even target Q-Network was still the default value ...

So, as a result of fixing it, it didn't work. Is the reward too big?

`reward`


reward_failed = -1
reward_win = 1
reward_miss = -1
reward_open = 1

After all it was useless Baby ...

Suddenly

Until now, I've been learning about different boards every time, Is it possible to learn for one board ...? → I was able to do it. About 200 to 300 episodes before the winning percentage reaches 90%. It is cute that the winning percentage will be 90% as soon as you can clear it once.

Then why not change it once to 150 episodes? → I can't win at all. It seems that it is being dragged by past learning data.

Well then, let's go back to change the board every time!

[PYTHON] Machine learning Minesweeper with PyTorch

Introduction

Target

Constitution

Progress

First time

param

Second time

reward

nth time

After one night ...

reward

Suddenly

`param`

`reward`

`reward`