wrap up

I tried to solve the multi-armed bandit problem with Thompson Sampling.

What is the multi-armed bandit problem?

(For Bernoulli Bandit) If you have multiple slot machines and you play them, you'll get hits and misses. The probability of winning is different for each slot, but the value is unknown. At this time, I want to win a lot with the fixed number of gameplays.

This is called Bernoulli Bandit Discrete distributions with probabilities p of 1 and p-1 of 0 are [Bernoulli distributions](https://ja.wikipedia.org/wiki/%E3%83%99%E3%83%AB%E3%83%8C % E3% 83% BC% E3% 82% A4% E5% 88% 86% E5% B8% 83) That's why.

As an image to solve,

I want to play a lot of slots with lots of hits.
But I want to play other slots to check the ease of hitting I want to achieve something like that at the same time.

What is Thompson Sampling?

One of the algorithms to solve the multi-armed bandit problem.

Count the number of hits and misses for each slot machine.
Subtract a random number from the beta distribution Be (a + 1, b + 1) for each slot machine. (a is the number of hits of the slot machine, b is the number of misses)
Play the slot machine that maximizes the subtracted value.

The beta distribution here is that when the probability distribution is Bernoulli distribution like this, If the prior distribution is beta distribution, the posterior probability is also beta distribution. Apparently because it is a conjugate prior [2]

I implemented it like this.

self.results only needs Sat and Fat, but I keep it because I may use it later.

import scipy
import scipy.stats
import numpy.random
import collections
import pprint


# Banding machine simulator.
def getResult(prob):
    return scipy.stats.bernoulli.rvs(prob, size=1)[0]

# Multiarm Bandit Problem solver using Thompson Sampling.
class BernoulliBandit:
    def __init__ (self, numBandits):
        self.numBandits = numBandits
        self.results = dict([(i, []) for i in range(self.numBandits)])
        
    def getBandit(self):
        posteriors = []
        for b in range(self.numBandits):
            Sat = len([r for r in self.results[b] if r == 1])
            Fat = len([r for r in self.results[b] if r == 0])    
            posteriors.append(numpy.random.beta(Sat+1, Fat+1))
        return numpy.array(posteriors).argmax()
    
    def feed(self, selectedBandit, result):
        self.results[selectedBandit].append(result)
        
    def __str__(self):
        out = ""
        for b in range(self.numBandits):
            Sat = len([r for r in self.results[b] if r == 1])
            Fat = len([r for r in self.results[b] if r == 0])    
            out += "Bandit[%d]   Fails: %4d\t  Successes: %6d\n" % (b, Sat, Fat)
        return out

if __name__ == "__main__":
    # set parameters
    numBandits = 3
    numTrials = 5000
    rewards = [0.9, 0.9, 0.4]   # expected rewards (hidden)

    bandit = BernoulliBandit(numBandits)
    for t in range(numTrials):
        if t % 100 == 0:
            print t,
        b = bandit.getBandit()
        r = getResult(rewards[b])
        bandit.feed(b, r)
    print
    print bandit
    print "Rewards", rewards

The execution result is as follows. I haven't met the true reward expectations, I wonder what a naive implementation is like this. Or is it a mistake in implementation?

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000 3100 3200 3300 3400 3500 3600 3700 3800 3900 4000 4100 4200 4300 4400 4500 4600 4700 4800 4900
Bandit[0]   Fails:  250	  Successes:     40
Bandit[1]   Fails: 4261	  Successes:    446
Bandit[2]   Fails:    0	  Successes:      3

Rewards [0.9, 0.9, 0.4]

jupyter gist

References

http://ibisml.org/archive/ibis2014/ibis2014_bandit.pdf
http://jmlr.org/proceedings/papers/v23/agrawal12/agrawal12.pdf

[PYTHON] Multi-armed bandit problem

wrap up

What is the multi-armed bandit problem?

What is Thompson Sampling?

I implemented it like this.

References