[PYTHON] [Reinforcement learning] Bandit task

Introduction

Implemented $ n $ skill bandit task in python. As a textbook ["Reinforcement Learning"](http://www.amazon.co.jp/ Reinforcement Learning-Richard-S-Sutton/dp/4627826613/ref=sr_1_1?ie=UTF8&qid=1463965873&sr=8-1&keywords=Reinforcement Learning % E3% 80% 80 Mikami) was used.

Structure of this article

Reinforcement learning

Overview

Reinforcement learning learns which actions to choose to maximize rewards. In supervised learning, the correct action to be selected is given, but in reinforcement learning, the action is selected based on a "certain guideline", and the action is evaluated and updated using the reward obtained from it. You can maximize your reward by acting according to the learned "certain guidelines".

Component

I will explain the components of reinforcement learning and their brief explanations.

The agent senses the environment and selects actions based on the sensed state and value function. The environment feeds back the reward for the action, and this reward is used to update the value of the action.

Bandit task

rule

A slot machine with $ n $ levers is called a $ n $ arm bandit. Select a $ 1 $ book from the $ n $ levers and pull the lever to win a prize. First, from the Gaussian distribution with mean $ 0 $ variance $ 1 $, we generate reward $ Q ^ {\ ast} (a) $ for some action $ a $. Next, we generate the reward obtained from the bandit machine from the Gaussian distribution with mean $ Q ^ {\ ast} (a) $ variance $ 1 $. Creating multiple bandit machines and pulling all bandit machines $ 1 $ each is called $ 1 $ play. Maximize rewards by learning through multiple plays,

Specimen averaging method

By averaging the rewards obtained when an action is selected, the value of that action can be estimated. This is called the sample averaging method. Let the true value of the action $ a $ be $ Q ^ {\ ast} (a) $ and the estimator in the $ t $ th play be $ Q_t (a) $. If the action $ a $ is selected $ k_a $ times in the $ t $ th play and the rewards for each selection are $ r_ {1}, r_ {2}, ..., r_ {k_a} $ The value of the action $ a $ up to the $ t $ time is defined as follows.

Q_{t}(a) = \frac{r_{1} + r_{2} + \cdots + r_{k_a}}{k_a}

In $ k_a \ rightarrow \ infinty $, it converges to $ Q_t (a) \ rightarrow Q ^ {\ ast} (a) $ according to the law of large numbers.

Action selection rules

The Greedy method and the $ \ varepsilon $ Greedy method are explained as action selection rules. In the Greedy method, the action with the highest estimated action value is selected. In other words, in the $ t $ first play, $ 1 $ of greedy behavior such that $ Q_t (a ^ {\ ast}) = \ max_ {a} Q_t (a) $ $ a ^ {\ ast} $ Choose. In the $ \ varepsilon $ greedy method, a random action is selected with a probability of $ \ varepsilon $, and a greedy action is selected with a probability of $ 1-\ varepsilon $. With the Greedy method, we always choose the action with the best value and never try any other action. This means that you can't find any behavior that you haven't tried, even if it's better than what you're currently doing best. On the other hand, in the $ \ varepsilon $ Greedy method, the action is basically selected in a greedy manner, but sometimes the action is randomly performed with a probability of $ \ varepsilon $. This may allow you to find even better behavior than what you are currently doing best.

Implementation

I implemented it as follows. I prepared a $ 10 $ bandit machine with $ 2000 $ and learned through $ 2000 $ play. Here, the $ 1 $ play is to draw $ 1 $ for all machines in the $ 2000 $ range. (The results are slightly different due to some changes in the textbook rules.)

bandit.py


import numpy
from matplotlib import pyplot
import random
import sys

class Bandit:

	# constuctor
	def __init__(self, n_machine, n_action):
		for i in range(n_action):
			_qstar = numpy.random.normal(0.0, 1.0)
			_machine = numpy.random.normal(_qstar, 1.0, n_machine).reshape((-1, 1))
			if i == 0:
				self.machine = _machine
			else:
				self.machine = numpy.hstack((self.machine, _machine))


# public method
	def play(self, n_play, epsilon):
		
		self.q = numpy.zeros(self.machine.shape)
		self.q_count = numpy.zeros(self.machine.shape)
		average_reward = numpy.zeros(n_play)
		n_machine = self.machine.shape[0]

		for _p in range(n_play):
			total = 0.0
			for mac_index in range(n_machine):
				act_index = self.__select_action(mac_index, epsilon)
				reward = self.machine[mac_index, act_index]
				total += reward
				self.__update_qtable(reward, mac_index, act_index)

			average_reward[_p] = total / n_machine
			self.__display(_p, average_reward[_p])

		return average_reward


# private method
	def __select_action(self, mac_index, epsilon):
		if numpy.random.rand() > epsilon:
			act_index = self.__select_greedy_action(mac_index)
		else:
			act_index = self.__select_random_action()

		return act_index


	def __select_greedy_action(self, mac_index):
		_max = self.q[mac_index, :].max()
		indexes = numpy.argwhere(self.q[mac_index, :] == _max)
		random.shuffle(indexes)
		return indexes[0]


	def __select_random_action(self):
		return numpy.random.randint(10)


	def __update_qtable(self, reward, mac_index, act_index):
		_q = self.q[mac_index, act_index]
		self.q_count[mac_index, act_index] += 1
		self.q[mac_index, act_index] = _q + (reward - _q) / self.q_count[mac_index, act_index]


	def __display(self, play, average_reward):
		if (play + 1) % 100 == 0:
			print u'play: %d, average reward: %f' % (play + 1, average_reward)

main.py


from bandit import *

if __name__ == '__main__':

	# param
	param = sys.argv

	# init
	n_machine = 2000
	n_action = 10
	n_play = 2000
	epsilon = [0.0, 0.01, 0.1, 1.0]

	# draw init
	mergin = 5
	color = ['b', 'g', 'r', 'k']
	pyplot.figure(figsize = (8, 6))
	pyplot.xlim(-mergin, n_play + mergin)

	# bandit machine
	bandit = Bandit(n_machine, n_action)

	# play
	for i in range(len(epsilon)):
		print u'play count: %d, epsilon: %.2f' % (n_play, epsilon[i])
		average_reward = bandit.play(n_play, epsilon[i])
		_label = 'e = %.2f' % epsilon[i]
		pyplot.plot(numpy.arange(n_play), average_reward, color = color[i], label = _label)
		print '!!!finish!!!\n'

	# save and show
	if '-d' in param or '-s' in param:
		pyplot.legend(loc = 'center right')
		if '-s' in param:
			pyplot.savefig('bandit2.png')
		if '-d' in param:
			pyplot.show()

result

The following results were obtained. At $ \ varepsilon = 0.0 $, you always choose a greedy action, so when you find an action worth $ 0 $ or more, you will always continue to choose that action. Therefore, the average value of the rewards obtained is always constant. At $ \ varepsilon = 1.0 $, you always act randomly, so even if you play more times, the average reward you get does not exceed a certain range. With $ \ varepsilon = 0.1 $, there is a $ 10 % $ chance of randomly acting and a $ 90 % $ chance of choosing a greedy action. We find the best behavior the fastest, but the average reward is only around $ 1.8 $ because we act randomly with a probability of $ 10 % $ even after learning. With $ \ varepsilon = 0.01 $, there is a probability of $ 1 % $ to act randomly, and a probability of $ 99 % $ to select a greedy action. The convergence speed is slower than $ \ varepsilon = 0.1 $, but the performance is the best from around $ 1000 $ play.

bandit2.png

in conclusion

We implemented the $ n $ skill bandit task in python and confirmed the difference in the learning process due to the change in $ \ varepsilon $.

Recommended Posts

[Reinforcement learning] Bandit task
[Introduction] Reinforcement learning
Future reinforcement learning_2
Future reinforcement learning_1
Reinforcement learning 1 Python installation
Reinforcement learning 3 OpenAI installation
Reinforcement learning for tic-tac-toe
Python + Unity Reinforcement Learning (Learning)
Reinforcement learning 1 introductory edition
[Introduction to Reinforcement Learning] part.1-Epsilon-Greedy Algorithm in Bandit Game
Reinforcement learning 18 Colaboratory + Acrobat + ChainerRL
Reinforcement learning 7 Learning data log output
Play with reinforcement learning with MuZero
Reinforcement learning 17 Colaboratory + CartPole + ChainerRL
Reinforcement learning 28 colaboratory + OpenAI + chainerRL
Reinforcement learning 19 Colaboratory + Mountain_car + ChainerRL
Reinforcement learning 2 Installation of chainerrl
[Reinforcement learning] Tracking by multi-agent
Reinforcement learning 6 First Chainer RL
Reinforcement learning 5 Try programming CartPole?
Reinforcement learning 9 ChainerRL magic remodeling
Reinforcement learning Learn from today
Reinforcement learning 4 CartPole first step
Deep Reinforcement Learning 1 Introduction to Reinforcement Learning
Deep reinforcement learning 2 Implementation of reinforcement learning
DeepMind Reinforcement Learning Framework Acme
Reinforcement learning: Accelerate Value Iteration
Reinforcement learning 21 Colaboratory + Pendulum + ChainerRL + A2C
Reinforcement learning 34 Make continuous Agent videos
Reinforcement learning 13 Try Mountain_car with ChainerRL.
Python + Unity Reinforcement learning environment construction
Reinforcement learning 22 Colaboratory + CartPole + ChainerRL + A3C
Explore the maze with reinforcement learning
Reinforcement learning 8 Try using Chainer UI
Reinforcement learning 24 Colaboratory + CartPole + ChainerRL + ACER
Reinforcement learning 3 Dynamic programming / TD method
Deep Reinforcement Learning 3 Practical Edition: Breakout
I tried reinforcement learning using PyBrain
Learn while making! Deep reinforcement learning_1
[Reinforcement learning] DQN with your own library
Reinforcement learning to learn from zero to deep
[Reinforcement learning] I implemented / explained R2D3 (Keras-RL)
<Course> Deep Learning Day4 Reinforcement Learning / Tensor Flow
Reinforcement learning 14 Pendulum was done at ChainerRL.
[Reinforcement learning] Easy high-speed implementation of Ape-X!
[Python] Easy Reinforcement Learning (DQN) with Keras-RL
Try OpenAI's standard reinforcement learning algorithm PPO
[Reinforcement learning] Search for the best route
Reinforcement learning 11 Try OpenAI acrobot with ChainerRL.