[PYTHON] Try to make a blackjack strategy by reinforcement learning (② Register the environment in gym)

Introduction

I tried to make a strategy for blackjack while studying Python and reinforcement learning. There is a probability-based strategy called a basic strategy, but I will try to catch up with it.

I will proceed like this

Blackjack implementation
Register in the OpenAI gym environment ← This time here
Learn blackjack strategy with reinforcement learning

What is OpenAI gym?

It is a platform used as a research environment for reinforcement learning. Environments (games) such as CartPole and maze are prepared, and you can easily try reinforcement learning. The OpenAI Gym environment has a common interface that receives actions from agents and returns the next state and reward as a result. Installation can be done easily as follows, but please refer to other pages for the detailed method. In the following, it is assumed that the installation has been completed.

pip install gym

This time, I will register my own blackjack in this OpenAI Gym environment so that I can perform reinforcement learning.

Review of reinforcement learning

First, let's take a quick look at reinforcement learning. The "state" is observed from the "environment", and the "agent" takes "action" on it. The "environment" feeds back the updated "state" and "reward" to the "agent". The purpose of reinforcement learning is to acquire a "action" method (= policy) that maximizes the sum of the "rewards" that will be obtained in the future.

Applying Reinforcement Learning Elements to Blackjack

In this blackjack, we consider reinforcement learning as follows.

--Environment: Blackjack --Agent: Player --Status: Player card, Dealer card, etc. --Action: Player selection. Hit, Stand, etc. --Reward: Chips obtained in the game

Procedure for registering the environment in OpenAI Gym

Follow the steps below to register your own environment in OpenAI Gym.

Create a blackjack environment class "BlackJackEnv" that inherits gym.Env of OpenAI Gym
Register the environment using the gym.envs.registration.register function so that it can be called with the ID BlackJack-v0.

Development environment

Windows 10
Python 3.6.9
Anaconda 4.3.0 (64-bit)
gym 0.15.4

file organization

The file structure is as follows. Please note that there are two files named \ _ \ _ init \ _ \ _. Py.

└─ myenv
    ├─ __init__.py  --->Call BlacJackEnv
    └─env
       ├─ __init__.py  --->Indicates where the BlackJack Env is located
       ├─ blackjack.py  --->BlackJack game itself
       └─ blackjack_env.py  --->OpenAI Gym gym.Create a BlackJackEnv class that inherits Env

Then, follow the procedure to register the environment.

Create a blackjack environment class "BlackJackEnv" that inherits gym.Env of OpenAI Gym

myenv/env/blackjack.py Leave the Blackjack code created last time as it is. Import it with blackjack_env.py below and use it.

myenv/env/blackjack_env.py Create the BlackJack game environment "BlackJackEnv" class that you want to register in OpenAI Gym. Inherit gym.Env and implement the following 3 properties and 5 methods.

Property

--action_space: Indicates what action the player (agent) can select. --observation_space: Information on the game environment that the player (agent) can observe --reward_range: Range from minimum to maximum reward

Method

--reset: A method for resetting the environment. --step: A method that executes an action in the environment and returns the result. --render: A method that visualizes the environment. --close: A method for closing the environment. It is used at the end of learning. --Seed: A method to fix a random seed.

action_space property

It shows that you can take four actions: Stand, Hit, Double Down, and Surrender.

`action_space`


self.action_space = gym.spaces.Discrete(4)

observation_space property

Observe the total points of the Player's hand, the points of the Dealer's disclosed hand, the flag indicating the soft hand (A is included in the Player's hand), and the flag indicating whether the Player has been hit. Determine the maximum and minimum values for each.

`observation_space`


high = np.array([
            30,  # player max
            30,  # dealer max
            1,   # is_soft_hand
            1,   # hit flag true
        ])
        low = np.array([
            2,  # player min
            1,  # dealer min
            0,  # is_soft_hand false
            0,  # hit flag false
        ])
        self.observation_space = gym.spaces.Box(low=low, high=high)

reward_range property

Determine the range of rewards. Here, it is decided to include the minimum and maximum values of chips that can be obtained.

`reward_range`


        self.reward_range = [-10000, 10000]

reset method

Initialize self.done, initialize Player and Dealer's hands with self.game.reset_game (), bet chips (Bet), and distribute cards (Deal). As mentioned in the step method, self.done is a Boolean value that indicates whether or not there is a win or loss. Observe and return 4 states with self.observe (). However, this time, we decided to train the player assuming that the chips possessed by the player will not decrease.

`reset()`


    def reset(self):
        #Initializes the state and returns the initial observations
        #Initialize various variables
        self.done = False

        self.game.reset_game()
        self.game.bet(bet=100)
        self.game.player.chip.balance = 1000  #The amount of money you have will never be zero while you are studying
        self.game.deal()
        # self.bet_done = True

        return self.observe()

step method

The Player takes either Stand, Hit, Double down, or Surrender with respect to the environment. If the player's turn is over, the chips will be settled. Finally, the following four pieces of information are returned.

--observation: The state of the observed environment. --Reward: The amount of reward earned by the action. --done: A Boolean value that indicates whether the environment should be reset again. In BlackJack, a Boolean value that indicates whether or not there is a win or loss. --info: A dictionary that can set useful information for debugging.

Also, in this learning environment, if you double down or Surrender after hitting, you will be penalized for violating the rules.

`step()`


    def step(self, action):
        #Execute action and return the result
        #Describe the process to advance one step. The return value is observation, reward,done (has the game finished), info(Dictionary of additional information)

        if action == 0:
            action_str = 's'  # Stand
        elif action == 1:
            action_str = 'h'  # Hit
        elif action == 2:
            action_str = 'd'  # Double down
        elif action == 3:
            action_str = 'r'  # Surrender
        else:
            print(action)
            print("Undefined Action")
            print(self.observe())

        hit_flag_before_step = self.game.player.hit_flag
        self.game.player_step(action=action_str)

        if self.game.player.done:
            #At the end of the player's turn
            self.game.dealer_turn()
            self.game.judge()
            reward = self.get_reward()
            self.game.check_deck()
            print(str(self.game.judgment) + " : " + str(reward))


        elif action >= 2 and hit_flag_before_step is True:
            reward = -1e3  #Give a penalty if you violate the rules

        else:
            #When continuing a player's turn
            reward = 0

        observation = self.observe()
        self.done = self.is_done()
        return observation, reward, self.done, {}

This time, the render, close, and seed methods are not used.

The whole code of blackjack_env.py looks like this:

`myenv/env/blackjack_env.py`


import gym
import gym.spaces
import numpy as np

from myenv.env.blackjack import Game


class BlackJackEnv(gym.Env):
    metadata = {'render.mode': ['human', 'ansi']}

    def __init__(self):
        super().__init__()

        self.game = Game()
        self.game.start()

        # action_space, observation_space, reward_Set range
        self.action_space = gym.spaces.Discrete(4)  # hit, stand, double down, surrender

        high = np.array([
            30,  # player max
            30,  # dealer max
            1,   # is_soft_hand
            1,   # hit flag true
        ])
        low = np.array([
            2,  # player min
            1,  # dealer min
            0,  # is_soft_hand false
            0,  # hit flag false
        ])
        self.observation_space = gym.spaces.Box(low=low, high=high)
        self.reward_range = [-10000, 10000]  #List of minimum and maximum rewards

        self.done = False
        self.reset()

    def reset(self):
        #Initializes the state and returns the initial observations
        #Initialize various variables
        self.done = False

        self.game.reset_game()
        self.game.bet(bet=100)
        self.game.player.chip.balance = 1000  #The amount of money you have will never be zero while you are studying
        self.game.deal()
        # self.bet_done = True

        return self.observe()

    def step(self, action):
        #Execute action and return the result
        #Describe the process to advance one step. The return value is observation, reward,done (has the game finished), info(Dictionary of additional information)

        if action == 0:
            action_str = 's'  # Stand
        elif action == 1:
            action_str = 'h'  # Hit
        elif action == 2:
            action_str = 'd'  # Double down
        elif action == 3:
            action_str = 'r'  # Surrender
        else:
            print(action)
            print("Undefined Action")
            print(self.observe())

        hit_flag_before_step = self.game.player.hit_flag
        self.game.player_step(action=action_str)

        if self.game.player.done:
            #At the end of the player's turn
            self.game.dealer_turn()
            self.game.judge()
            reward = self.get_reward()
            self.game.check_deck()
            print(str(self.game.judgment) + " : " + str(reward))


        elif action >= 2 and hit_flag_before_step is True:
            reward = -1e3  #Give a penalty if you violate the rules

        else:
            #When continuing a player's turn
            reward = 0

        observation = self.observe()
        self.done = self.is_done()
        return observation, reward, self.done, {}

    def render(self, mode='human', close=False):
        #Visualize the environment
        #In the case of human, it is output to the console. Returns StringIO for ansi
        pass

    def close(self):
        #Close the environment and perform post-processing
        pass

    def seed(self, seed=None):
        #Random seeds fixed
        pass

    def get_reward(self):
        #Return reward
        reward = self.game.pay_chip() - self.game.player.chip.bet
        return reward

    def is_done(self):
        if self.game.player.done:
            return True
        else:
            return False

    def observe(self):
        if self.game.player.done:
            observation = tuple([
                self.game.player.hand.calc_final_point(),
                self.game.dealer.hand.calc_final_point(),  #Dealer card total score
                int(self.game.player.hand.is_soft_hand),
                int(self.game.player.hit_flag)])
        else:
            observation = tuple([
                self.game.player.hand.calc_final_point(),
                self.game.dealer.hand.hand[0].point,  #Dealer up card only
                int(self.game.player.hand.is_soft_hand),
                int(self.game.player.hit_flag)])

        return observation

Register the environment using the gym.envs.registration.register function so that it can be called with the ID BlackJack-v0.

myenv/__init__.py Register BlackJackEnv with gym using the gym.envs.registration.register function. Here we declare that we will call the class BlackJackEnv under the env directory under the myenv directory with the ID BlackJack-v0.

`myenv/init.py`


from gym.envs.registration import register

register(
    id='BlackJack-v0',
    entry_point='myenv.env:BlackJackEnv',
)

myenv/env/__init__.py Declare that the BlakcJackEnv class is in blackjack_env.py under the env directory under the myenv directory.

`myenv/env/init.py`


from myenv.env.blackjack_env import BlackJackEnv

To make reinforcement learning

In the reinforcement learning code, you can use the environment by setting ʻenv = gym.make ('BlackJack-v0')`.

This time, the registration of the environment is the main, so I will omit it, but the next article will create this.

At the end

I registered my own blackjack game in the OpenAI Gym environment. I realized that I had to think carefully about what to do, what to observe as a state, what to reward, and where to go from one step to the environment I made. .. At first, the length of one step was set to be ridiculously long. .. ..

Next, I would like to use this environment to learn blackjack strategies.

Sites / books that I referred to

-Create your own environment with OpenAI Gym -[[OpenAI Gym] Create AI to defeat RPG bosses by reinforcement learning ①](http://mokichi-blog.com/2019/03/03/%e3%80%90tensorflow-gym-lstm%e3%80 % 91% e5% bc% b7% e5% 8c% 96% e5% ad% a6% e7% bf% 92% e3% 81% a7rpg% e3% 81% ae% e3% 83% 9c% e3% 82% b9 % e3% 82% 92% e5% 80% 92% e3% 81% 99 /) -[[OpenAI Gym] Create AI to defeat RPG bosses by reinforcement learning ②](http://mokichi-blog.com/2019/03/03/%e3%80%90openai-gym%e3%80%91 % e5% bc% b7% e5% 8c% 96% e5% ad% a6% e7% bf% 92% e3% 81% a7rpg% e3% 81% ae% e3% 83% 9c% e3% 82% b9% e3 % 82% 92% e5% 80% 92% e3% 81% 99 /) -[[OpenAI Gym] Create AI to defeat RPG bosses by reinforcement learning ③](http://mokichi-blog.com/2019/03/05/%E3%80%90openai-gym%E3%80%91 % E5% BC% B7% E5% 8C% 96% E5% AD% A6% E7% BF% 92% E3% 81% A7rpg% E3% 81% AE% E3% 83% 9C% E3% 82% B9% E3 % 82% 92% E5% 80% 92% E3% 81% 99ai% E3% 82% 92% E3% 81% A4% E3% 81% 8F% E3% 82% 8B% E2% 91% A2 /) -[Machine Learning Startup Series Enhanced Learning with Python](https://www.amazon.co.jp/%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92% E3% 82% B9% E3% 82% BF% E3% 83% BC% E3% 83% 88% E3% 82% A2% E3% 83% 83% E3% 83% 97% E3% 82% B7% E3% 83% AA% E3% 83% BC% E3% 82% BA-Python% E3% 81% A7% E5% AD% A6% E3% 81% B6% E5% BC% B7% E5% 8C% 96% E5% AD% A6% E7% BF% 92-% E6% 94% B9% E8% A8% 82% E7% AC% AC2% E7% 89% 88-% E5% 85% A5% E9% 96% 80% E3% 81% 8B% E3% 82% 89% E5% AE% 9F% E8% B7% B5% E3% 81% BE% E3% 81% A7-% E4% B9% 85% E4% BF% 9D / dp / 4065712519 / ref = asc_df_4065172519 /? tag = jpo-22 & linkCode = df0 & hvadid = 342520995287 & hvpos = & hvnetw = g & hvrand = 8228940498059590938 & hvpone = & hvptwo = & hvqmt = & hvdev = c & hvdvcmdl = & hvlocint = & hvps 69995157558 & hvpone = & hvptwo = & hvadid = 342520995287 & hvpos = & hvnetw = g & hvrand = 8228940498059590938 & hvqmt = & hvdev = c & hvdvcmdl = & hvlocint = & hvlocphy = 1009409 & hvtargid = pla-818473632453) -[Introduction to Deep Reinforcement Learning with Python Reinforcement Learning Beginning with Chainer and OpenAI Gym](https://www.amazon.co.jp/Python%E3%81%AB%E3%82%88%E3%82%8B%E6 % B7% B1% E5% B1% A4% E5% BC% B7% E5% 8C% 96% E5% AD% A6% E7% BF% 92% E5% 85% A5% E9% 96% 80-Chainer% E3 % 81% A8OpenAI-Gym% E3% 81% A7% E3% 81% AF% E3% 81% 98% E3% 82% 81% E3% 82% 8B% E5% BC% B7% E5% 8C% 96% E5 % AD% A6% E7% BF% 92-% E7% 89% A7% E9% 87% 8E-% E6% B5% A9% E4% BA% 8C / dp / 4274222535 / ref = sr_1_1? __mk_ja_JP =% E3% 82% AB% E3% 82% BF% E3% 82% AB% E3% 83% 8A & keywords = Python% E3% 81% AB% E3% 82% 88% E3% 82% 8B% E6% B7% B1% E5% B1% A4% E5% BC% B7% E5% 8C% 96% E5% AD% A6% E7% BF% 92% E5% 85% A5% E9% 96% 80 & qid = 1584978761 & s = books & sr = 1-1)