[PYTHON] See the behavior of drunkenness with reinforcement learning

I want to study reinforcement learning

I usually touch AI/machine learning in my work. However, until now, I have been studying mainly with supervised learning, so I thought that I hadn't touched much on unsupervised learning and reinforcement learning. So yesterday, I saw this article, which is the best for reinforcement learning hands-on, and actually tried it myself.

When I was looking at the article thinking that I wanted a theme even if I did it, I felt that the movement of the agent was "__ I feel like I'm drunk." In this article, the theme was reinforcement learning to find the optimal route from the start point to the end point. The agent has an 80% chance of going in the desired direction, a 10% chance of going left, and a 10% chance of going right.

image.png

I thought, "Isn't it possible to reproduce the behavior of drunkenness ...? __" by using this, so I experimented with hands-on. Please take a look while remembering __the memory that became sloppy at the tavern in front of Corona __. This article is also very convenient, so please have a look.

Sample code

Almost as in this article, some modifications have been made to introduce the drunk variable berobero. Please refer to Github in the article above for the contents of the defined method. Take out only the modified part.

#Abstract class
class MDP:
    #MDP:Markov decision process(Markov decision processes
    #Define drunkenness berobero as an argument
    def __init__(self, init, actlist, terminals, gamma=.9,berobero=0.1):
        #init:initial state
        #actlist:Action
        #erminals:End state
        #gamma:Discount function
        self.init = init
        self.actlist = actlist
        self.terminals = terminals

#Concrete class
class GridMDP(MDP):
    #Define drunkenness berobero as an argument
    def __init__(self, grid, terminals, init=(0, 0), gamma=.9,berobero=0.1):
        #grid is a matrix that defines the field
        grid.reverse()  # because we want row 0 on bottom, not on top                                                                                                  
        MDP.__init__(self, init, actlist=orientations,
                     terminals=terminals, gamma=gamma,berobero=berobero)
        self.grid = grid
        self.berobero=berobero
        self.rows = len(grid)
        self.cols = len(grid[0])

    #List of transition probabilities and next actions
    #berobero=0 is shirafu, 0.1 is good, 0.3 is a sticky image
    #berobero=0.It's okay to go around because it will definitely be a crab walk at 5
    def T(self, state, action):
        if action is None:
            return [(0.0, state)]
        else:
            return [(1-2*self.berobero, self.go(state, action)),
                    (self.berobero, self.go(state, turn_right(action))),
                    (self.berobero, self.go(state, turn_left(action)))]

Problem setting

As the agent progresses from the start point to the end point, look for a way to maximize the reward. Agents have a certain probability of moving in a direction different from what they want. __berobero has a probability of left, berobero has a right, 1-2 * berobero has a chance of moving in the desired direction. __ So, __berobero = 0 is sloppy, 0.1 is nice, 0.3 is a sloppy image __. __berobero = 0.5 will surely make you walk with crabs, so please imagine "a person who can go around". __ image.png

Walk along the wall when you get drunk

When you actually get drunk, why don't you touch the wall and move along the wall? __ If it's a tatami room, I'm afraid to say "I'm sorry", so I have an image of going to the wall and moving along the wall at the shortest. I would like to simulate this for a moment.

When there are people near the exit

First, let's verify using the example of this article. Consider a pattern in which the store has one pillar and there is one person near the exit. We will guide you to the optimal route to the exit without hitting people.

image.png

If you're drunk, you don't want to risk hitting someone by moving to the right as much as possible. So __intuitively, it seems good to go up and hit it and go to the right as it is __.

Loss is the reward you get when you go there, and if you set it to -0.5, it will be a room like "-0.5 every time you move". The larger the minus of the reward, the shorter the movement will be. It seems that you can use loss to design something like "If you don't go to the exit early, you'll vomit." (I won't do it this time)

#Original pattern
loss = 0
grid=[
    [loss, loss, loss, +1],
    [loss, None,  loss, -1],
    [loss, loss,loss,loss]
]

#Count from the bottom left
sequential_decision_environment = GridMDP(grid,terminals=[(3, 2), (3, 1)],berobero=0.1)

pi = best_policy(sequential_decision_environment, value_iteration(sequential_decision_environment, .01))

print_table(sequential_decision_environment.to_arrows(pi))

Bittersweet pattern

Click here for berobero = 0.1. As a point of view, "Which direction is the most appropriate to move from that square?" Is output as an arrow. In this case, the result is "the best route is to go up from the starting point and go to the right". It seems intuitively valid. image.png

Sticky pattern

Click here for berobero = 0.3. The interesting thing is that you try to turn left first. I feel the will to not go to the right even if I am willing. It's interesting to look up just before entering the end point to make it look like "rightward, worst left." image.png

Crab walking pattern

Here is the case of berobero = 0.5. I already feel a strong will to say, "__ I'm too drunk ... I'm steadily moving sideways ... __". Go to the right and walk sideways, then face up and walk sideways without difficulty. __ I feel the state of enlightenment that I took the wrong side of not being able to move forward __.

image.png

When escaping from a drinking party of four tatami mats

Well, it's finally the practical edition. Consider escaping from the back seats of the four table drinking parties. Let's think of the four tables as if they were four people, and see if they go to the exit while "excuse me" or go along the wall even in a detour.

image.png

Bittersweet pattern

Click here for berobero = 0.1. In this case, it's a pattern of "excuse me". I'm not drunk, so I feel like it's okay to get out of the way.

image.png

Sticky pattern

Click here for berobero = 0.3. __This! !! I feel a strong will to go along the wall! !! __ __ When you get drunk, just go along the wall and go safely! !! __

I wondered why this is not the case with berobero = 0.1. Since loss = 0, there should be no demerit of detouring, and this is certain. If you know it, I would be grateful if you could teach me!

image.png

Crab walking pattern

Here is the case of berobero = 0.5. It's kind of funny to walk with crabs because you can't move forward. __ It makes me think that there is a figure that is too sticky and feels a sense of crisis and is calm. __

image.png

When there is a risk in a detour

There was no risk in the detour earlier, but this time I will add a little risk. Whether you go on a short cut or a detour, you will be "excuse me", and if you take a detour, the road will be thicker and you can safely proceed to the right. __ Is there such an izakaya? __

image.png

Bittersweet pattern

Click here for berobero = 0.1. In this case, it's a pattern of "I'm sorry" to the right.

Why is the phenomenon of pointing downwards at the starting point? I feel a misanthropic grief like "__ No, I don't want to go to the exit."

image.png

Sticky pattern

Click here for berobero = 0.3. It's a pattern that takes safety choices. None By turning to the left one place above, I feel a strong will to "never go to the right."

image.png

Crab walking pattern

Here is the case of berobero = 0.5. If you learn crab walking, it's a crowded mystery.

image.png

at the end

This time, the purpose was reinforcement learning hands-on, so the condition examination may still be sweet. Especially since I set loss = 0, I would like to see the behavior change due to that change. Also, this time I used the value iteration method as a model, but I would like to implement Q-learning as well.

And let's enjoy drinking alcohol moderately! !! !! !! Thank you for reading until the end! I would be grateful if you could do LGTM.

Recommended Posts

See the behavior of drunkenness with reinforcement learning
Explore the maze with reinforcement learning
The story of doing deep learning with TPU
Visualize the behavior of the sorting algorithm with matplotlib
Predict the gender of Twitter users with machine learning
Summary of the basic flow of machine learning with Python
Record of the first machine learning challenge with Keras
I investigated the reinforcement learning algorithm of algorithmic trading
Play with reinforcement learning with MuZero
Reinforcement learning 2 Installation of chainerrl
Reinforcement learning starting with Python
Deep reinforcement learning 2 Implementation of reinforcement learning
Reinforcement learning in the shortest time with Keras with OpenAI Gym
Tank game made with python About the behavior of tanks
See the power of speeding up with NumPy and SciPy
Let's move word2vec with Chainer and see the learning progress
I learned the basics of reinforcement learning and played with Cart Pole (implementing simple Q Learning)
See the contents of Kumantic Segumantion
About the behavior of yield_per of SqlAlchemy
Validate the learning model with Pylearn2
A story stuck with the installation of the machine learning library JAX
[Machine learning] Check the performance of the classifier with handwritten character data
[Examples of improving Python] Learning Python with Codecademy
Check the behavior of destructor in Python
[Reinforcement learning] DQN with your own library
Behavior when returning in the with block
Align the size of the colorbar with matplotlib
Othello-From the tic-tac-toe of "Implementation Deep Learning" (3)
Learning notes from the beginning of Python 1
About the behavior of enable_backprop of Chainer v2
Check the existence of the file with python
Try deep learning of genomics with Kipoi
Visualize the effects of deep learning / regularization
Sentiment analysis of tweets with deep learning
The third night of the loop with for
[Reinforcement learning] Easy high-speed implementation of Ape-X!
I made a GAN with Keras, so I made a video of the learning process.
The second night of the loop with for
Learning notes from the beginning of Python 2
[Python] Easy Reinforcement Learning (DQN) with Keras-RL
I tried to predict the behavior of the new coronavirus with the SEIR model.
[Reinforcement learning] Search for the best route
Count the number of characters with echo
Othello-From the tic-tac-toe of "Implementation Deep Learning" (2)
Reinforcement learning 11 Try OpenAI acrobot with ChainerRL.
Deep Learning from scratch The theory and implementation of deep learning learned with Python Chapter 3
The story of the learning method that acquired LinuC Level 1 with only ping -t
Align the number of samples between classes of data for machine learning with Python
Deep learning dramatically makes it easier to see the time-lapse of physical changes
Note: Prepare the environment of CmdStanPy with docker
The story of low learning costs for Python
Prepare the execution environment of Python3 with Docker
About the behavior of copy, deepcopy and numpy.copy
2016 The University of Tokyo Mathematics Solved with Python
[Note] Export the html of the site with python.
The behavior of signal () depends on the compile options
Increase the font size of the graph with matplotlib
[Reinforcement learning] Experience Replay is easy with cpprb!
Calculate the total number of combinations with python
Check the date of the flag duty with Python
Eliminate the inconveniences of QDock Widget with PySide