I implemented a misunderstood prisoner's dilemma game in Python. Using that program, I repeatedly simulated and experimented with a strategy of a prisoner's dilemma game.

Prisoner's dilemma games are often used to analyze collaborative behavior, such as collusion between companies, primarily in the economic arena. There are various other application examples.

In the conventional prisoner's dilemma game model, it is assumed that ** the opponent's behavior can be directly observed (complete observation) ** after the fact, and the players can decide the next action based on that assumption. It's done. However, in the real world, it is often impossible to completely observe the behavior of the other party. In this case, the conventional model cannot capture these situations well.

Therefore, in recent years, the prisoner's dilemma game of ** incomplete observation ** has been actively studied. In the incomplete observation, instead of directly observing the behavior of the other party, it is possible to observe ** signals ** that depend on the behavior of the other party.

This time, we will simulate this incomplete observation repeated prisoner's dilemma game.

What is a prisoner's dilemma game?

The prisoner's dilemma game is one of the most representative games in game theory. This game is often given in the score table below.

図1.png

The rules of this game are that the two players $ X $ and $ Y $ choose to cooperate ($ C : Cooperate) or betrayal ( D $: Defect), respectively. The score is determined by the action chosen by each player.

If players $ X $ and $ Y $ choose $ C $ each other, they can get a reasonably good score of $ 1 $. If either $ X $ or $ Y $ chooses $ C $ and the other chooses $ D $, the player who chooses $ C $ will get the lowest score, $ -0.5 $. The player who chooses ， $ D $ can get the highest score of $ 1.5 $. If $ X $ and $ Y $ choose $ D $ for each other, they will get a small score of $ 0 $. If you choose $ C $ for each other, the total score of both will be $ 2 \ (= 1 + 1) $, which is the best score for the whole, and if you are unilaterally betrayed by the opponent, the betrayed player will lose a lot. It has a structure such as

In this way, there is a dilemma that each player really wants to cooperate, but is tempted by betrayal.

In the execution of one game, cooperation between selfish players is not realized (Nash equilibrium). On the other hand, in the ** repeated prisoner's dilemma game ** that repeats this game indefinitely, it is theoretically proved that cooperation is realized even among selfish players (folk's). theorem).

Therefore, what kind of strategy is strong in this repeated prisoner's dilemma game has been studied (Axelrod's experiment).

Repeated game with incomplete private observation

It's different from the traditional prisoner's dilemma game, which assumes ** Perfect Monitoring **, which allows you to accurately observe all the actions the player has taken in the past. ** Imperfect Private Monitoring ** does not allow you to directly observe past player behavior. Instead, suppose you observe a signal that depends on the previous player's actions. In this game, we assume ** Private Monitoring ** that receives individual signals. The signal received by each player cannot be observed by the opponent.

Concrete example

The prisoner's dilemma game of incomplete private observation assumes ** price competition ** of the following retailers.

In retail price competition, assuming that you are selling the same product, you try to secure sales by maintaining the price of that product or lowering the price.

The actual profit of a retail store is determined by the number of visitors (signal) and the price of the product (your own behavior). For example, if the prices are maintained (cooperated) with each other, customers can sell products at high prices without being biased toward one store, and both stores can secure reasonably high profits. I will.

Customers have the same structure as the prisoner's dilemma because they increase or decrease depending on the price of the products in each store.

In such a situation, the following cases are assumed. Normally, if each store maintains its price, both stores can obtain high profits, but the number of visitors fluctuates due to various factors. For example, unpredictable demand shocks and the unknowing opening of similar stores in the neighborhood may reduce the number of store customers (see figure below). The point of store profit is that the price (behavior) of the product decided by each store is not directly linked to profit.

On the other hand, the store can also reduce the price of the product and acquire customers so that the other party does not know (such as presenting a price lower than the catalog price only in front of the customer). Such a situation cannot be captured in the traditional prisoner's dilemma game.

Such a situation is assumed by ** repeated game with incomplete private observation **.

Game content

--Define the game.

Repeat the game given in the score table below.

図2.png

The rules of this game are that $ 2 $ players $ X $ and $ Y $ choose to cooperate ($ C : Cooperate) or betrayal ( D $: Defect), respectively. The score is determined by the action chosen by each player.

Basically, when the opponent player chooses $ C $, he observes a ** signal ** called $ g $ which means * good *. At that time, if you choose the action of $ C $, you will get $ 1 $ points. Also, if you select the action of $ D $, you will get $ 1.5 $ points. Similarly, when the opponent player chooses $ D $, he observes the signal $ b $ which means * bad *. At that time, if you choose the action of $ C $, you will get $ -0.5 $ points. Also, if you select the action of $ D $, you will get $ 0 $ points.

The assumptions so far are almost the same as the traditional prisoner's dilemma game.

--Define the observation error.

In this model, the behavior signals $ g $ and $ b $ are reversed due to an error due to factors such as the environment.

Even if your opponent chooses $ C $, you may get a signal of $ b $, which means your opponent has chosen $ D $. Conversely, even if the other party chooses $ D $, you may get a signal of $ g $, which means that the other party chose $ C $. Observing a signal opposite to the signal that means the actual behavior in this way is called ** observation error **.

This time, ** the probability that only one of the players will fail ** is defined as $ \ epsilon $, and the probability that ** both errors ** is defined as $ \ xi $. It is assumed that the error rates $ \ epsilon $ and $ \ xi $ are not very large.

--Define the player's strategy.

We define the strategy of player $ X $ as $ {\ bf p} = (p_ {Cg}, p_ {Cb}, p_ {Dg}, p_ {Db}; p_0) . This strategy is called the Memory-One strategy. Players who use this strategy will probabilistically determine this action ( C $ or $ D $) by considering only the results of the previous game.

For example, $ p_ {Cg} $ is the probability of cooperation in this game when player $ X $ selects $ C $ and observes $ g $ in the previous game. $ p_ {Cb} $ is the probability of cooperation in this game when player $ X $ selects $ C $ and observes $ b $ in the previous game. Others are considered in the same way. $ p_0 $ is the probability of cooperating in the first game.

Similarly, the strategy for player $ Y $ is $ {\ bf q} = (q_ {Cg}, q_ {Cb}, q_ {Dg}, q_ {Db}; q_0) $.

The figure below is an example of the game state transitioning from $ CC $ (mutual cooperation) to $ CD $ ($ X $ cooperation, $ Y $ betrayal). In this figure, you can understand how each player observes the signal and transitions to the next state based on the strategy defined above according to the action decided by each player. Although it is not relevant in this simulation, this can be regarded as a Markov process, and the transition matrix of this game can be defined from the transition diagram below.

Python program

The following Python program implements the above. You can find out what kind of strategy is strong by simulation. Below, each player's strategy is $ {\ bf p} = (1/2, 1/2, 1/2, 1/2; 1/2) $ and $ {\ bf q} = (1/2, 1/2, 1/2, 1/2; 1/2) $. The number of times the game is repeated is $ 100,000 $. The error rates are set to $ \ epsilon = 0.05 $ and $ \ xi = 0.05 $.

import random

max_t = 100000 #Number of game repeats

point  = {'x' : 0, 'y' : 0} #Player's total score
action = {'x' : None, 'y' : None} #Player behavior C or D
signal = {'x' : None, 'y' : None} #Signals observed by the player
state  = {'x' : '0', 'y' : '0'} #Set game status
payoff = {'Dg':1.5, 'Cg':1.0, 'Db':0, 'Cb':-0.5} #Game score
epsilon = 0.05; xi = 0.05 #Only one error, probability of both errors

p = {'Cg':1/2, 'Cb':1/2, 'Dg':1/2, 'Db':1/2, '0' :1/2} #Player X strategy
q = {'Cg':1/2, 'Cb':1/2, 'Dg':1/2, 'Db':1/2, '0' :1/2} #Player Y Strategy

#Repeated game
for t in range(max_t):
    #Select C or D based on the results of the previous game
    action['x'] = 'C' if random.random() < p[state['x']] else 'D'
    action['y'] = 'C' if random.random() < q[state['y']] else 'D'
    #signal
    signal['x'] = 'g' if action['y'] == 'C' else 'b'
    signal['y'] = 'g' if action['x'] == 'C' else 'b'
    #Observation error
    if random.random() < 2 * epsilon:
        player = 'x' if random.random() < 0.5 else 'y'
        if signal[player] == 'g':
            signal[player] = 'b'
        else:
            signal[player] = 'g'
    elif 2 * epsilon < random.random() < 2 * epsilon + xi:
        for player in ['x', 'y']:
            if signal[player] == 'g':
                signal[player] = 'b'
            else:
                signal[player] = 'g'
    #Divide gains based on actions and signals
    for player in ['x', 'y']:
        if action[player] == 'C' and signal[player] == 'g':
            point[player] += payoff['Cg']
        elif action[player] == 'C' and signal[player] == 'b':
            point[player] += payoff['Cb']
        elif action[player] == 'D' and signal[player] == 'g':
            point[player] += payoff['Dg']
        elif action[player] == 'D' and signal[player] == 'b':
            point[player] += payoff['Db']        
        #Substitute game results
        state[player] = action[player] + signal[player]

#Output the average score of each player
print('Player X: {} Points'.format(point['x'] / max_t))
print('Player Y: {} Points'.format(point['y'] / max_t))

When this program is executed, the following execution results will be obtained. ..

Player X: 0.49855 Points
Player Y: 0.49827 Points

That is, $ {\ bf p} = (1/2, 1/2, 1/2, 1/2; 1/2) $ and $ {\ bf q} = (1/2, 1/2, 1 / 2, 1/2; 1/2) When you play a $ strategy You can see that the results are almost the same. This is a natural result.

Next, I would like to try various strategies.

Experiment with Zero-determinant strategy

The Zero-determinant strategy (ZD strategy) is one strategy in the prisoner's dilemma game. Previously, I introduced it in Prisoner's Dilemma Game Opponent Exploitation Strategy (Qiita). This ZD strategy is a simple strategy and can be defined by the memory 1 strategy $ \ bf p $ and $ \ bf q $ (the derivation of the ZD strategy itself is a little difficult). This time, we will experiment with this strategy in this misunderstood prisoner's dilemma game.

The ZD strategy includes the Equalizer strategy and the Extortion strategy as well-known partial strategies. The Equalizer strategy can fix the opponent's average score, and the Extortion strategy's Extortion means extortion, and this strategy can exploit the opponent's score.

This is the end of the detailed explanation, and we will experiment with this strategy.

Equalizer strategy

No matter what strategy your opponent has, you can fix your opponent's score. Equalizer strategy to make the opponent's average score $ 0.5 $ points $ {\ bf p} = (0.8, 0.365217,0.634783, 0.2; 1/2) $ and $ {\ bf q} = (1/2, 1/2, If you set 1/2, 1/2; 1/2) $ and run the simulation, you will get the following results.

Player X: 0.496385 Points
Player Y: 0.50177 Points

Certainly, it is `` `Player Y: 0.50177 Points```, and you can see that the opponent's score is close to $ 0.5 $ points. It may happen, so if you try the opponent's strategy with the strongest always betrayal strategy ($ ALLD $ strategy) $ {\ bf q} = (0,0,0,0; 0) $ in the prisoner's dilemma game ， The following results are obtained.

Player X: -0.004545 Points
Player Y: 0.50022 Points

After all, it is Player Y: 0.50022 Points, and you can see that the opponent's score is close to $ 0.5 $ points.

In the figure below, in addition to the above strategy, the other party's strategy $ \ bf q $ is given various random numbers, and a simulation is performed. $ \ Bf q $ was generated for 100 strategies, and the program was executed for each. From this figure, you can see that the opponent is fixed at the $ 0.5 $ point regardless of the opponent's strategy $ \ bf q $.

The area surrounded by the gray line is the area related to the gain that can be realized. The gain relationship between the two players is always somewhere in this area. The vertical axis is the score of the player who adopts the Equalizer strategy, and the horizontal axis is the score of the opponent.

Therefore, the Equalizer strategy can fix the opponent's score even if there is an error, regardless of the opponent's strategy.

Extortion strategy

The Extortion strategy is known to be an error-free, ordinary prisoner's dilemma game in which the opponent's score is exploited. You can tie back to the strongest always betrayal strategy ($ ALLD $ strategy) and the Extortion strategy.

Is it possible if there is an error?

Similar to the figure created for the Equalizer strategy, if you change the opponent's strategy $ \ bf q $ with random numbers and play the game with the Extortion strategy, you will get the following figure.

Red dots are for the opponent's $ ALLD $ strategy, and blue dots are for the opponent's $ ALLC $ strategy (always a cooperation strategy). The vertical axis is the score of the player who adopts the Extortion strategy, and the horizontal axis is the score of the opponent.

From this figure, we can see that the Extortion strategy can no longer beat the $ ALLD $ strategy (red dot). On the other hand, we are still able to win against most strategies other than the $ ALLD $ strategy.

In the absence of errors, the Extortion strategy could result in a draw against the $ ALLD $ strategy, preventing exploitation from $ ALLD $. However, the error resulted in allowing exploitation from the $ ALLD $ strategy. On the other hand, it was found that strategies other than the $ ALLD $ strategy can be exploited.

Summary

I implemented a simulation of a prisoner's dilemma game with a mistake in Python. The player's strategy this time is a simple memory 1 strategy, but maybe a stronger strategy may be found if a strategy with increased memory or a strategy with learning is used.

This time, I experimented with the Zero-determinant strategy. Even with errors, the Equalizer strategy can fix the opponent's score, the Extortion strategy allows exploitation for the $ ALLD $ strategy, but exploits for most other strategies. I found that I could do it. Therefore, even if there is an error, it can be said that it is an effective strategy to some extent.

If you find a strong strategy or an interesting strategy, please let us know (> _ <)

Related literature

-Prisoner's Dilemma Game Opponent Exploitation Strategy (Qiita) -Summary of Zero-determinant strategies (Qiita)

A. Mamiya and G. Ichinose, Strategies that enforce linear payoff relationships under observation errors in Repeated Prisoner’s Dilemma game J. Theor. Biol. 477, 63-76, 2019.

I tried to implement a misunderstood prisoner's dilemma game in Python