[PYTHON] [Introduction to Reinforcement Learning] part.1-Epsilon-Greedy Algorithm in Bandit Game

Introduction

I would like to write an article to digest machine learning in my own way. We plan to make a series, starting from the basics, and gradually dealing with evolving content. The theory behind it ⇨ Code. Questions are welcome.

Reinforcement learning

Reinforcement learning, unlike so-called supervised learning, is an approach that seeks the optimal solution while no "answer" is given to a pre-given data set. For example, let's assume that there are two blackjack stands in front of you (here, the winning percentage will change depending on the dealer). Of course, if you continue to play with the higher winning percentage, you can maximize your profits. However, at the beginning, we do not know how much money each platform will return, so we will "learn" which one is better by repeating the number of times. This is the basic idea of reinforcement learning. casino_dealer_man.pngcasino_dealer_woman.png

Here, two important "antagonizing" factors, called Explore-Exploit dillemma, are

Exploit (~Greedy) The power to maximize profits   Explore (~Epsilon) The power to "seek" for greater profits at the expense of maximizing (most recent) profits

Is. Next, let's take a look at one of the concrete models, the Epsilon-Greedy Algorithm.

Epsilon-Greedy Algorithm To put it simply, it is a tactic that "basically chooses the one with the higher return (Greedy), but occasionally (with a probability as small as Epsilon) changes the mood and randomly chooses". In the example of blackjack above, suppose that you play blackjack a total of N times, and you first win the male dealer (M). Then, the return, that is, the winning percentage against M (MW) is 1, while the winning percentage against the female dealer (W) is unknown, so if the winning percentage (WW) is 0, the next table to play will be M again. By the way, in this second play, the winning percentage is MW> WW regardless of whether you win or lose. In this way, if you just play with the one with the highest winning percentage at that time, that is, with M (Complete Greedy), you will only play with one dealer, and you will continue playing forever without knowing the winning percentage against W. Therefore, what was brought out was the place where the probability of Epsilon in the latter half (usually 5 to 10%) was randomly selected from all options. With this, if you get a sufficiently large N, there will be no dealers whose winning percentage will not be updated, and you can estimate all winning percentages.

Now, what is important here is Epsilon, the "probability of playing randomly". If this is a large value, 50%, what happens is that you play appropriately with half the probability, so you will play about the same number as two dealers, and you can know the winning percentage against two people at an early stage. At the same time, Greedy's strategy is relatively weakened, resulting in a smaller total return. What if I take Epsilon too small this time? The answer is, on the contrary, you decide on a dealer with a high winning percentage early on (not always really right) and you only play with that dealer. In other words, balance is important (as is life). By the way, my life is small Epsilon (aside).

Let's implement these things next.

Implementation

Written in python3.x. First, substitute the argument to be used after library.

Bandit_Epsilon_simulation.py


import numpy as np
import matplotlib.pyplot as plt

Num_try = 1000
Eps =0.05
Bandit_prob = [0.1,0.45,0.75]

Here, the number in Bandit_prob is the "real win rate" for each dealer, but the goal of this algorithm is to find this by learning (hence, the information that the player does not know originally). Note).

Next, implement the core algorithm.

Bandit_Epsilon_simulation.py


class Bandit():
    def __init__(self,p): # p:winning rate- pretend not to know what these are
        self.p = p
        self.p_estimate = 0
        self.N = 0
        
    def pull(self): # return a 1 with p
        return np.random.random() < self.p # random takes [0,1] randomly
    
    def update(self, x):
        self.N +=1
        self.p_estimate = (1/self.N)*(x + self.p_estimate*(self.N-1))
        # Using the formula below
        # <p>_n*N = p_n(=x) + <p>_(n-1)*(N-1)
        
def experiment():
    bandits = [Bandit(p) for p in Bandit_prob] # we don't know what are in Bandit_prob
    
    rewards = np.zeros(Num_try)
    num_explored = 0
    num_exploited = 0
    num_opt = 0
    opt_j = np.argmax([b.p for b in bandits]) # # in bandit_prob

    
    for i in range(Num_try):
        
        # follow epsilon-greedy algorythm
        if np.random.random() < Eps:# explore
            num_explored += 1
            # j is index
            j = np.random.choice(list(range(len(bandits))))
        else: # greed
            num_exploited += 1
            j = np.argmax([k.p_estimate for k in bandits])
        
        if j == opt_j:
            num_opt += 1
            
        x = bandits[j].pull() #the chosen one can get reward or not?
        
        
        rewards[i] = x
        
        bandits[j].update(x)
        
    return bandits, rewards, num_explored, num_exploited, opt_j

The match result (win: 1, Lose: 0) is returned by the method pull of the first Bandit. Depending on the result, the winning percentage against each dealer will be updated by updating each time. It should be noted that the winning percentage here is calculated and obtained while playing against the player side, and is different from what the dealer originally has (learning to get closer to it).

result

Now let's plot this under multiple Epsilon. The horizontal axis is the number of plays, the vertical axis is the winning percentage, and the black horizontal line is the maximum winning percentage of 75%.

bandit_epsilon.png

This plot is packed with a lot of insights. The speed to the plateau, its height, why it does not reach 75%, next time, let's proceed while looking around.

summary

--Reinforcement learning is a model that corrects the probability distribution of choices based on medium returns (rewards) for which there is no answer. --The Epsilon-Greedy model is one of the representatives. ――The appearance differs greatly depending on the size of Epsilon.

References Udemy- Artificial Intelligence: Reinforcement Learning in Python

Recommended Posts

[Introduction to Reinforcement Learning] part.1-Epsilon-Greedy Algorithm in Bandit Game
Deep Reinforcement Learning 1 Introduction to Reinforcement Learning
Try Q-learning in Dragon Quest-style battle [Introduction to Reinforcement Learning]
[Introduction] Reinforcement learning
[For beginners] Introduction to vectorization in machine learning
[Reinforcement learning] Bandit task
Introduction to machine learning
Introduction to dictionary lookup algorithm
[Learning memorandum] Introduction to vim
An introduction to machine learning
Introduction to Deep Learning ~ Learning Rules ~
Super introduction to machine learning
Introduction to Deep Learning ~ Backpropagation ~
[Introduction to Reinforcement Learning] Reinforcement learning to try moving for the time being
Introduction to machine learning Note writing
Introduction to Deep Learning ~ Function Approximation ~
Solving AOJ's Algorithm and Introduction to Data Structures in Python -Part2-
Solving AOJ's Algorithm and Introduction to Data Structures in Python -Part4-
Introduction to Deep Learning ~ Coding Preparation ~
I read "Reinforcement Learning with Python: From Introduction to Practice" Chapter 1
Introduction to Machine Learning Library SHOGUN
Introduction to Deep Learning ~ Dropout Edition ~
Introduction to Deep Learning ~ Forward Propagation ~
Introduction to Deep Learning ~ CNN Experiment ~
Solving AOJ's Algorithm and Introduction to Data Structures in Python -Part3-
I read "Reinforcement Learning with Python: From Introduction to Practice" Chapter 2
Possibility of application to evacuation route design of maze problem in reinforcement learning
[Introduction to Python] How to use class in Python?
Introduction to Machine Learning: How Models Work
Introduction to Deep Learning ~ Convolution and Pooling ~
An introduction to OpenCV for machine learning
Introduction to ClearML-Easy to manage machine learning experiments-
An introduction to Python for machine learning
Introduction to TensorFlow-Machine Learning Terminology / Concept Explanation
Introduction to docker Create ubuntu environment in ubuntu
Try OpenAI's standard reinforcement learning algorithm PPO
Introduction to Vectors: Linear Algebra in Python <1>
[Introduction to simulation] Sine wave mass game ♬
Introduction to Effectiveness Verification Chapter 1 in Python
Introduction to Deep Learning (1) --Chainer is explained in an easy-to-understand manner for beginners-
[What is an algorithm? Introduction to Search Algorithm] ~ Python ~
[Python] Easy introduction to machine learning with python (SVM)
[Super Introduction to Machine Learning] Learn Pytorch tutorials
Introduction to effectiveness verification Chapter 3 written in Python
tse --Introduction to Text Stream Editor in Python
[Introduction to StyleGAN2] Independent learning with 10 anime faces ♬
I wrote "Introduction to Effect Verification" in Python
[Super Introduction to Machine Learning] Learn Pytorch tutorials
Introduction to Deep Learning ~ Localization and Loss Function ~
[Introduction to Algorithm] Find the shortest path [Python3]
Introduction to Effectiveness Verification Chapter 2 Written in Python
Try to make a blackjack strategy by reinforcement learning (② Register the environment in gym)
Learning history to participate in team application development in Python ~ After finishing "Introduction to Python 3" of paiza learning ~