[PYTHON] Try to make a blackjack strategy by reinforcement learning (③ Reinforcement learning in your own OpenAI Gym environment)

Introduction

I tried to make a strategy for blackjack while studying Python and reinforcement learning. There is a probability-based strategy called a basic strategy, but I will try to catch up with it.

I will proceed like this

  1. Blackjack implementation
  2. Register in the OpenAI gym environment
  3. Learn blackjack strategy through reinforcement learning ← This time here

Development environment

Coding Reinforcement Learning

This time, we will use Q-Learning, which is one of the basic reinforcement learning algorithms.

file organization

The file structure is as follows. The reinforcement learning code created this time is "q-learning_blackjack.py". Other files are created as "Register in OpenAI gym environment".

├─ q-learning_blackjack.py
└─ myenv
    ├─ __init__.py  --->Call BlacJackEnv
    └─env
       ├─ __init__.py  --->Indicates where the BlackJack Env is located
       ├─ blackjack.py  --->BlackJack game itself
       └─ blackjack_env.py  --->OpenAI Gym gym.Create a BlackJackEnv class that inherits Env

coding

Agent class

self.Q becomes a table that summarizes the Q values and is updated as the learning progresses. This is called the Q table here. In the Q table, for ** status (Player points, Dealer points, Player owns Ace, Player has been hit) **, Player ** Stand **, ** Hit ** , ** Double Down ** , ** Surrender ** Represents the value when the action is taken.

ex_Qtable.png

The policy method selects an action according to the ε-greedy method. Randomly select an action with the probability ʻepsilon, and select an action according to the Q table with the probability 1-epsilon`.

Agent class


class Agent():
    def __init__(self, epsilon):
        self.Q = {}
        self.epsilon = epsilon
        self.reward_log = []

    def policy(self, state, actions):
        if np.random.random() < self.epsilon:
            return np.random.randint(len(actions))
        else:
            if state in self.Q and sum(self.Q[state]) != 0:
                return np.argmax(self.Q[state])
            else:
                return np.random.randint(len(actions))

    def init_log(self):
        self.reward_log = []

    def log(self, reward):
        self.reward_log.append(reward)

    def show_reward_log(self, interval=100, episode=-1):
        if episode > 0:
            rewards = self.reward_log[-interval:]
            mean = np.round(np.mean(rewards), 3)
            std = np.round(np.std(rewards), 3)
            print("At Episode {} average reward is {} (+/-{}).".format(episode, mean, std))
        else:
            indices = list(range(0, len(self.reward_log), interval))
            means = []
            stds = []
            for i in indices:
                rewards = self.reward_log[i:(i + interval)]
                means.append(np.mean(rewards))
                stds.append(np.std(rewards))
            means = np.array(means)
            stds = np.array(stds)
            plt.figure()
            plt.title("Reward History")
            plt.xlabel("episode")
            plt.ylabel("reward")
            plt.grid()
            plt.fill_between(indices, means - stds, means + stds, alpha=0.2, color="g")
            plt.plot(indices, means, "o-", color="g", label="Rewards for each {} episode".format(interval))
            plt.legend(loc="best")
            plt.savefig("Reward_History.png ")
            plt.show()

QLearningAgent class

It inherits the Agent class created above. The learn method is the main learning method. One episode is equivalent to one blackjack game. ʻA = self.policy (s, actions)selects an action according to the state, andn_state, reward, done, info = env.step (a)` shows the result of actually taking that action. Observe the reward. The step function is as implemented in "Register in OpenAI gym environment".

The following three lines of code are the Q-Learning formula

Q(s_t, a_t)\leftarrow(1-\alpha)Q(s_t, a_t)+\alpha(r_{t+1}+\gamma \max_{a_{t+1}}Q(s_{t+1}, a_{t+1}))

Corresponds to. $ \ Gamma $ (gamma) is a parameter for how much the future value is discounted at the discount rate, and $ \ alpha $ (learning_rate) is the learning rate. It is a parameter to control.

Q-Learning formula


gain = reward + gamma * max(self.Q[n_state])
estimated = self.Q[s][a]
self.Q[s][a] += learning_rate * (gain - estimated)

QLearningAgent class


class QLearningAgent(Agent):
    def __init__(self, epsilon=0.1):
        super().__init__(epsilon)

    def learn(self, env, episode_count=1000, gamma=0.9,
              learning_rate=0.1, render=False, report_interval=5000):
        self.init_log()
        actions = list(range(env.action_space.n))
        self.Q = defaultdict(lambda: [0] * len(actions))
        for e in range(episode_count):
            s = env.reset()
            done = False
            reward_history = []
            while not done:
                if render:
                    env.render()
                a = self.policy(s, actions)
                n_state, reward, done, info = env.step(a)

                reward_history.append(reward)
                gain = reward + gamma * max(self.Q[n_state])
                estimated = self.Q[s][a]
                self.Q[s][a] += learning_rate * (gain - estimated)
                s = n_state
            else:
                self.log(sum(reward_history))

            if e != 0 and e % report_interval == 0:
                self.show_reward_log(episode=e, interval=50)
        env.close()

train function

Load your own blackjack environment with ʻenv = gym.make ('BlackJack-v0')`.

For the creation method, refer to Blackjack implementation, Register in OpenAI gym environment. please.

I created a save_Q method to save the Q value table and a show_reward_log method to display the reward log history.

train function and execution part


def train():
    agent = QLearningAgent()
    env = gym.make('BlackJack-v0')
    agent.learn(env, episode_count=50000, report_interval=1000)
    agent.save_Q()
    agent.show_reward_log(interval=500)

if __name__ == "__main__":
    train()

Learning results

The learning results are as follows. The horizontal axis is the episode and the vertical axis is the reward. The green line is the average reward for 500 episodes, and the green fill is the standard deviation for the reward for 500 episodes. It has been almost flat since the 20000 episode. And I feel sad that the average reward is less than 0 even at the time of learning 50,000 episodes. .. ..

Reward_History_epi50000.png

Comparison with basic strategy

Compare the learned Q table with the basic strategy. Extract the action that maximizes the Q value for each state of the Q table, and create a strategy table for each hard hand and soft hand in the same way as the basic strategy.

The left column is the strategy learned in Q-Learning, and the right column is the basic strategy. The upper row is a hard hand (when A is not included in the hand), and the lower row is a soft hand (when A is included in the hand). The rows of each table represent Player points and the columns represent Dealer points. The alphabet in the table indicates the action that the player should take.

The split function is not implemented in this self-made blackjack. Therefore, actions are assigned even when the hard hand player has 4 points (2, 2) and the soft hand player has 12 points (A, A).

strategy_table.png

Comparing the learned results with the basic strategy, the tendency of Hit when the player's point is low and Stay when the player's point is high are the same. I would like you to learn at least this area. However, if you look at the details, the soft hand player tends to hit at 19 points. Even if you hit it, you won't be Bust, but if you stay silently, it's a strong move. This is where I couldn't learn well. Why. .. .. Also, there is less Double Down and more Surrender. We can see that there is a tendency to take no risk and try to minimize the loss.

I actually played it

I tried to play 100 games x 1000 times based on the learned Q table and the basic strategy. We bet $ 100 on each game and calculated the average chips earned for 100 games 1000 times. The histogram of the average acquired chips is shown below.

Reward_Histogram_Basic_Q.png

The more distributed on the right side of the figure, the better the results, but the basic strategy gives better results. By the way, the average value based on the Q table was \ $ -8.2, and the average value based on the basic strategy was \ $ -3.2. Of course, the case with high average chips was the basic strategy, but the case with low average chips was also the basic strategy. The basic strategy has a wider distribution. The distribution of the Q table is narrower, probably because there are fewer Double Downs and more Surrenders.

in conclusion

Through three steps, I created my own blackjack environment and tried to strengthen and learn the strategy. As a result, I did not get results that exceeded the basic strategy, but I was able to deepen my understanding of reinforcement learning and programming. There is still room for improvement in learning. Since I made my own environment, I can also observe the expected value of the cards that come out of the remaining decks. It's a bit sloppy, but I'd like to experiment. (Of course, it can't be used at casinos ...)

Sites / books that I referred to

-[Machine Learning Startup Series Enhanced Learning with Python](https://www.amazon.co.jp/%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92% E3% 82% B9% E3% 82% BF% E3% 83% BC% E3% 83% 88% E3% 82% A2% E3% 83% 83% E3% 83% 97% E3% 82% B7% E3% 83% AA% E3% 83% BC% E3% 82% BA-Python% E3% 81% A7% E5% AD% A6% E3% 81% B6% E5% BC% B7% E5% 8C% 96% E5% AD% A6% E7% BF% 92-% E6% 94% B9% E8% A8% 82% E7% AC% AC2% E7% 89% 88-% E5% 85% A5% E9% 96% 80% E3% 81% 8B% E3% 82% 89% E5% AE% 9F% E8% B7% B5% E3% 81% BE% E3% 81% A7-% E4% B9% 85% E4% BF% 9D / dp / 4065712519 / ref = asc_df_4065172519 /? tag = jpo-22 & linkCode = df0 & hvadid = 342520995287 & hvpos = & hvnetw = g & hvrand = 8228940498059590938 & hvpone = & hvptwo = & hvqmt = & hvdev = c & hvdvcmdl = & hvlocint = & hvps 69995157558 & hvpone = & hvptwo = & hvadid = 342520995287 & hvpos = & hvnetw = g & hvrand = 8228940498059590938 & hvqmt = & hvdev = c & hvdvcmdl = & hvlocint = & hvlocphy = 1009409 & hvtargid = pla-818473632453) -[Introduction to Deep Reinforcement Learning with Python Reinforcement Learning Beginning with Chainer and OpenAI Gym](https://www.amazon.co.jp/Python%E3%81%AB%E3%82%88%E3%82%8B%E6 % B7% B1% E5% B1% A4% E5% BC% B7% E5% 8C% 96% E5% AD% A6% E7% BF% 92% E5% 85% A5% E9% 96% 80-Chainer% E3 % 81% A8OpenAI-Gym% E3% 81% A7% E3% 81% AF% E3% 81% 98% E3% 82% 81% E3% 82% 8B% E5% BC% B7% E5% 8C% 96% E5 % AD% A6% E7% BF% 92-% E7% 89% A7% E9% 87% 8E-% E6% B5% A9% E4% BA% 8C / dp / 4274222535 / ref = sr_1_1? __mk_ja_JP =% E3% 82% AB% E3% 82% BF% E3% 82% AB% E3% 83% 8A & keywords = Python% E3% 81% AB% E3% 82% 88% E3% 82% 8B% E6% B7% B1% E5% B1% A4% E5% BC% B7% E5% 8C% 96% E5% AD% A6% E7% BF% 92% E5% 85% A5% E9% 96% 80 & qid = 1584978761 & s = books & sr = 1-1)

Recommended Posts

Try to make a blackjack strategy by reinforcement learning (③ Reinforcement learning in your own OpenAI Gym environment)
Try to make a blackjack strategy by reinforcement learning (② Register the environment in gym)
Try to make a blackjack strategy by reinforcement learning ((1) Implementation of blackjack)
Try to make your own AWS-SDK with bash
Try to make a Python module in C language
Machine learning beginners try to make a decision tree
I tried to create a reinforcement learning environment for Othello with Open AI gym
Reinforcement learning in the shortest time with Keras with OpenAI Gym
Let's make a number guessing game in your own language!
Try Q-learning in Dragon Quest-style battle [Introduction to Reinforcement Learning]
Try HeloWorld in your own language (with How to & code)
Introduction to Deep Learning (2) --Try your own nonlinear regression with Chainer-
[Reinforcement learning] How to draw OpenAI Gym on Google Corab (2020.6 version)
[Python] Try to make a sort program by yourself. (Selection sort, insertion sort, bubble sort)
[Reinforcement learning] DQN with your own library
Try to make a kernel of Jupyter
Make your own PC for deep learning
Reinforcement learning 11 Try OpenAI acrobot with ChainerRL.
From nothing on Ubuntu 18.04 to setting up a Deep Learning environment in Tensor
Try to draw a "weather map-like front" by machine learning based on weather data (5)
Try to draw a "weather map-like front" by machine learning based on weather data (3)
Try to draw a "weather map-like front" by machine learning based on weather data (1)
Try to draw a "weather map-like front" by machine learning based on weather data (4)
Try to draw a "weather map-like front" by machine learning based on weather data (2)
[Machine learning] Create a machine learning model by performing transfer learning with your own data set
Try to calculate a statistical problem in Python
Reinforcement learning 10 Try using a trained neural network.
Try to forecast power demand by machine learning
Try to make a "cryptanalysis" cipher with Python
Try to make a dihedral group with Python
Use a free GPU in your favorite environment
Let's make a leap in the manufacturing industry by utilizing the Web in addition to Python
How to use Docker to containerize your application and how to use Docker Compose to run your application in a development environment