I tried to make a strategy for blackjack while studying Python and reinforcement learning. There is a probability-based strategy called a basic strategy, but I will try to catch up with it.
I will proceed like this
It is a platform used as a research environment for reinforcement learning. Environments (games) such as CartPole and maze are prepared, and you can easily try reinforcement learning. The OpenAI Gym environment has a common interface that receives actions from agents and returns the next state and reward as a result. Installation can be done easily as follows, but please refer to other pages for the detailed method. In the following, it is assumed that the installation has been completed.
pip install gym
This time, I will register my own blackjack in this OpenAI Gym environment so that I can perform reinforcement learning.
First, let's take a quick look at reinforcement learning. The "state" is observed from the "environment", and the "agent" takes "action" on it. The "environment" feeds back the updated "state" and "reward" to the "agent". The purpose of reinforcement learning is to acquire a "action" method (= policy) that maximizes the sum of the "rewards" that will be obtained in the future.
In this blackjack, we consider reinforcement learning as follows.
--Environment: Blackjack --Agent: Player --Status: Player card, Dealer card, etc. --Action: Player selection. Hit, Stand, etc. --Reward: Chips obtained in the game
Follow the steps below to register your own environment in OpenAI Gym.
The file structure is as follows. Please note that there are two files named \ _ \ _ init \ _ \ _. Py.
└─ myenv
├─ __init__.py --->Call BlacJackEnv
└─env
├─ __init__.py --->Indicates where the BlackJack Env is located
├─ blackjack.py --->BlackJack game itself
└─ blackjack_env.py --->OpenAI Gym gym.Create a BlackJackEnv class that inherits Env
Then, follow the procedure to register the environment.
myenv/env/blackjack.py Leave the Blackjack code created last time as it is. Import it with blackjack_env.py below and use it.
myenv/env/blackjack_env.py Create the BlackJack game environment "BlackJackEnv" class that you want to register in OpenAI Gym. Inherit gym.Env and implement the following 3 properties and 5 methods.
--action_space: Indicates what action the player (agent) can select. --observation_space: Information on the game environment that the player (agent) can observe --reward_range: Range from minimum to maximum reward
--reset: A method for resetting the environment. --step: A method that executes an action in the environment and returns the result. --render: A method that visualizes the environment. --close: A method for closing the environment. It is used at the end of learning. --Seed: A method to fix a random seed.
It shows that you can take four actions: Stand, Hit, Double Down, and Surrender.
action_space
self.action_space = gym.spaces.Discrete(4)
Observe the total points of the Player's hand, the points of the Dealer's disclosed hand, the flag indicating the soft hand (A is included in the Player's hand), and the flag indicating whether the Player has been hit. Determine the maximum and minimum values for each.
observation_space
high = np.array([
30, # player max
30, # dealer max
1, # is_soft_hand
1, # hit flag true
])
low = np.array([
2, # player min
1, # dealer min
0, # is_soft_hand false
0, # hit flag false
])
self.observation_space = gym.spaces.Box(low=low, high=high)
Determine the range of rewards. Here, it is decided to include the minimum and maximum values of chips that can be obtained.
reward_range
self.reward_range = [-10000, 10000]
Initialize self.done, initialize Player and Dealer's hands with self.game.reset_game (), bet chips (Bet), and distribute cards (Deal). As mentioned in the step method, self.done is a Boolean value that indicates whether or not there is a win or loss. Observe and return 4 states with self.observe (). However, this time, we decided to train the player assuming that the chips possessed by the player will not decrease.
reset()
def reset(self):
#Initializes the state and returns the initial observations
#Initialize various variables
self.done = False
self.game.reset_game()
self.game.bet(bet=100)
self.game.player.chip.balance = 1000 #The amount of money you have will never be zero while you are studying
self.game.deal()
# self.bet_done = True
return self.observe()
The Player takes either Stand, Hit, Double down, or Surrender with respect to the environment. If the player's turn is over, the chips will be settled. Finally, the following four pieces of information are returned.
--observation: The state of the observed environment. --Reward: The amount of reward earned by the action. --done: A Boolean value that indicates whether the environment should be reset again. In BlackJack, a Boolean value that indicates whether or not there is a win or loss. --info: A dictionary that can set useful information for debugging.
Also, in this learning environment, if you double down or Surrender after hitting, you will be penalized for violating the rules.
step()
def step(self, action):
#Execute action and return the result
#Describe the process to advance one step. The return value is observation, reward,done (has the game finished), info(Dictionary of additional information)
if action == 0:
action_str = 's' # Stand
elif action == 1:
action_str = 'h' # Hit
elif action == 2:
action_str = 'd' # Double down
elif action == 3:
action_str = 'r' # Surrender
else:
print(action)
print("Undefined Action")
print(self.observe())
hit_flag_before_step = self.game.player.hit_flag
self.game.player_step(action=action_str)
if self.game.player.done:
#At the end of the player's turn
self.game.dealer_turn()
self.game.judge()
reward = self.get_reward()
self.game.check_deck()
print(str(self.game.judgment) + " : " + str(reward))
elif action >= 2 and hit_flag_before_step is True:
reward = -1e3 #Give a penalty if you violate the rules
else:
#When continuing a player's turn
reward = 0
observation = self.observe()
self.done = self.is_done()
return observation, reward, self.done, {}
This time, the render, close, and seed methods are not used.
The whole code of blackjack_env.py looks like this:
myenv/env/blackjack_env.py
import gym
import gym.spaces
import numpy as np
from myenv.env.blackjack import Game
class BlackJackEnv(gym.Env):
metadata = {'render.mode': ['human', 'ansi']}
def __init__(self):
super().__init__()
self.game = Game()
self.game.start()
# action_space, observation_space, reward_Set range
self.action_space = gym.spaces.Discrete(4) # hit, stand, double down, surrender
high = np.array([
30, # player max
30, # dealer max
1, # is_soft_hand
1, # hit flag true
])
low = np.array([
2, # player min
1, # dealer min
0, # is_soft_hand false
0, # hit flag false
])
self.observation_space = gym.spaces.Box(low=low, high=high)
self.reward_range = [-10000, 10000] #List of minimum and maximum rewards
self.done = False
self.reset()
def reset(self):
#Initializes the state and returns the initial observations
#Initialize various variables
self.done = False
self.game.reset_game()
self.game.bet(bet=100)
self.game.player.chip.balance = 1000 #The amount of money you have will never be zero while you are studying
self.game.deal()
# self.bet_done = True
return self.observe()
def step(self, action):
#Execute action and return the result
#Describe the process to advance one step. The return value is observation, reward,done (has the game finished), info(Dictionary of additional information)
if action == 0:
action_str = 's' # Stand
elif action == 1:
action_str = 'h' # Hit
elif action == 2:
action_str = 'd' # Double down
elif action == 3:
action_str = 'r' # Surrender
else:
print(action)
print("Undefined Action")
print(self.observe())
hit_flag_before_step = self.game.player.hit_flag
self.game.player_step(action=action_str)
if self.game.player.done:
#At the end of the player's turn
self.game.dealer_turn()
self.game.judge()
reward = self.get_reward()
self.game.check_deck()
print(str(self.game.judgment) + " : " + str(reward))
elif action >= 2 and hit_flag_before_step is True:
reward = -1e3 #Give a penalty if you violate the rules
else:
#When continuing a player's turn
reward = 0
observation = self.observe()
self.done = self.is_done()
return observation, reward, self.done, {}
def render(self, mode='human', close=False):
#Visualize the environment
#In the case of human, it is output to the console. Returns StringIO for ansi
pass
def close(self):
#Close the environment and perform post-processing
pass
def seed(self, seed=None):
#Random seeds fixed
pass
def get_reward(self):
#Return reward
reward = self.game.pay_chip() - self.game.player.chip.bet
return reward
def is_done(self):
if self.game.player.done:
return True
else:
return False
def observe(self):
if self.game.player.done:
observation = tuple([
self.game.player.hand.calc_final_point(),
self.game.dealer.hand.calc_final_point(), #Dealer card total score
int(self.game.player.hand.is_soft_hand),
int(self.game.player.hit_flag)])
else:
observation = tuple([
self.game.player.hand.calc_final_point(),
self.game.dealer.hand.hand[0].point, #Dealer up card only
int(self.game.player.hand.is_soft_hand),
int(self.game.player.hit_flag)])
return observation
myenv/__init__.py
Register BlackJackEnv with gym using the gym.envs.registration.register function.
Here we declare that we will call the class BlackJackEnv
under the env directory under the myenv directory with the ID BlackJack-v0
.
myenv/__init__.py
from gym.envs.registration import register
register(
id='BlackJack-v0',
entry_point='myenv.env:BlackJackEnv',
)
myenv/env/__init__.py
Declare that the BlakcJackEnv
class is in blackjack_env.py
under the env directory under the myenv directory.
myenv/env/__init__.py
from myenv.env.blackjack_env import BlackJackEnv
In the reinforcement learning code, you can use the environment by setting ʻenv = gym.make ('BlackJack-v0')`.
This time, the registration of the environment is the main, so I will omit it, but the next article will create this.
I registered my own blackjack game in the OpenAI Gym environment. I realized that I had to think carefully about what to do, what to observe as a state, what to reward, and where to go from one step to the environment I made. .. At first, the length of one step was set to be ridiculously long. .. ..
Next, I would like to use this environment to learn blackjack strategies.
-Create your own environment with OpenAI Gym -[[OpenAI Gym] Create AI to defeat RPG bosses by reinforcement learning ①](http://mokichi-blog.com/2019/03/03/%e3%80%90tensorflow-gym-lstm%e3%80 % 91% e5% bc% b7% e5% 8c% 96% e5% ad% a6% e7% bf% 92% e3% 81% a7rpg% e3% 81% ae% e3% 83% 9c% e3% 82% b9 % e3% 82% 92% e5% 80% 92% e3% 81% 99 /) -[[OpenAI Gym] Create AI to defeat RPG bosses by reinforcement learning ②](http://mokichi-blog.com/2019/03/03/%e3%80%90openai-gym%e3%80%91 % e5% bc% b7% e5% 8c% 96% e5% ad% a6% e7% bf% 92% e3% 81% a7rpg% e3% 81% ae% e3% 83% 9c% e3% 82% b9% e3 % 82% 92% e5% 80% 92% e3% 81% 99 /) -[[OpenAI Gym] Create AI to defeat RPG bosses by reinforcement learning ③](http://mokichi-blog.com/2019/03/05/%E3%80%90openai-gym%E3%80%91 % E5% BC% B7% E5% 8C% 96% E5% AD% A6% E7% BF% 92% E3% 81% A7rpg% E3% 81% AE% E3% 83% 9C% E3% 82% B9% E3 % 82% 92% E5% 80% 92% E3% 81% 99ai% E3% 82% 92% E3% 81% A4% E3% 81% 8F% E3% 82% 8B% E2% 91% A2 /) -[Machine Learning Startup Series Enhanced Learning with Python](https://www.amazon.co.jp/%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92% E3% 82% B9% E3% 82% BF% E3% 83% BC% E3% 83% 88% E3% 82% A2% E3% 83% 83% E3% 83% 97% E3% 82% B7% E3% 83% AA% E3% 83% BC% E3% 82% BA-Python% E3% 81% A7% E5% AD% A6% E3% 81% B6% E5% BC% B7% E5% 8C% 96% E5% AD% A6% E7% BF% 92-% E6% 94% B9% E8% A8% 82% E7% AC% AC2% E7% 89% 88-% E5% 85% A5% E9% 96% 80% E3% 81% 8B% E3% 82% 89% E5% AE% 9F% E8% B7% B5% E3% 81% BE% E3% 81% A7-% E4% B9% 85% E4% BF% 9D / dp / 4065712519 / ref = asc_df_4065172519 /? tag = jpo-22 & linkCode = df0 & hvadid = 342520995287 & hvpos = & hvnetw = g & hvrand = 8228940498059590938 & hvpone = & hvptwo = & hvqmt = & hvdev = c & hvdvcmdl = & hvlocint = & hvps 69995157558 & hvpone = & hvptwo = & hvadid = 342520995287 & hvpos = & hvnetw = g & hvrand = 8228940498059590938 & hvqmt = & hvdev = c & hvdvcmdl = & hvlocint = & hvlocphy = 1009409 & hvtargid = pla-818473632453) -[Introduction to Deep Reinforcement Learning with Python Reinforcement Learning Beginning with Chainer and OpenAI Gym](https://www.amazon.co.jp/Python%E3%81%AB%E3%82%88%E3%82%8B%E6 % B7% B1% E5% B1% A4% E5% BC% B7% E5% 8C% 96% E5% AD% A6% E7% BF% 92% E5% 85% A5% E9% 96% 80-Chainer% E3 % 81% A8OpenAI-Gym% E3% 81% A7% E3% 81% AF% E3% 81% 98% E3% 82% 81% E3% 82% 8B% E5% BC% B7% E5% 8C% 96% E5 % AD% A6% E7% BF% 92-% E7% 89% A7% E9% 87% 8E-% E6% B5% A9% E4% BA% 8C / dp / 4274222535 / ref = sr_1_1? __mk_ja_JP =% E3% 82% AB% E3% 82% BF% E3% 82% AB% E3% 83% 8A & keywords = Python% E3% 81% AB% E3% 82% 88% E3% 82% 8B% E6% B7% B1% E5% B1% A4% E5% BC% B7% E5% 8C% 96% E5% AD% A6% E7% BF% 92% E5% 85% A5% E9% 96% 80 & qid = 1584978761 & s = books & sr = 1-1)
Recommended Posts