(Python) I tried to analyze 1 million hands ~ I tried to estimate the number of AA ~

Thank you for browsing. It is pbird yellow.

This time, I estimated the number of AA appearances from 1 million hands. Specifically, we estimate how likely and how much AA will appear for an unknown 2000 hand that you will play in the future. The estimation procedure is as follows.

① Aggregate the hands and create a histogram ② Test whether the aggregated hands have normality ③ Find the average value and standard deviation value ④ Estimate the number of occurrences of AA with an accuracy of 95%

Regarding the above contents and technical terms that will appear after that, the following books are very easy to understand, so I will post them. ・ "Complete Self-study: Introduction to Statistics Kindle Edition" Hiroyuki Kojima (Author) https://amzn.to/3mSPpqf

Below is a histogram of the conclusions. In the histogram of this article, the vertical axis shows the integrated value and the horizontal axis shows the number of times AA appeared in 2000 hands.

スクリーンショット 2020-09-26 16.22.47.png

■ What is a histogram? A histogram is simply a "graph of aggregated data". Taking the above figure as an example ・ The number of times AA appeared every 2000 hands was 8 times (horizontal axis), but it was 72 times (vertical axis) in 1 million hands. ・ The number of times AA appeared every 2000 hands was twice (horizontal axis), but once in 1 million hands (vertical axis). And so on.

■ What is SW test? There are various test methods for normality, but this time we will use the SW test. The SW test is a method to verify whether the aggregated data has normality in the original data group (= population). With normality, various laws can be used. And it is possible to estimate the number of occurrences of AA using those laws.

The SW test uses the p-value to determine normality. The p value is a value that expresses the probability that data will be distributed like aggregated data when data is randomly selected from the population and distributed, assuming that the population has normality. In general, if the probability of distribution is less than 5%, it is too low and it is judged that the population has no rules with normality (= the population has no normality) in the first place. .. In the above figure, the p-value is 0.08> 0.05 (= 5%), so it can be said that there is marginal normality.

■ Mean and standard deviation In the above figure ・ Average → average (μ) ・ Deviation → standard deviation (σ) Is applicable.

■ Estimating method with 95% accuracy Let μ be the mean value of the aggregated data and σ be the standard deviation value. There is a 95% chance that the number of AA occurrences will fall within "μ-1.96σ ≤ x ≤ μ + 1.96σ".

So ** "If you play 2000 hands, AA will appear 3.37 times or more and 15.29 times or less with a 95% chance." ** It will be.

Since the estimation is based on the data of 1 million hands, there is some error in the above estimation value. As the data grows and the mean and standard deviation approaches the population mean and population standard deviation, the estimates will be accurate.

However, this number is a value known only to God.

By the way, in the case of KK, it is as follows. Since P-value = 0.03 <0.05, the assumption is rejected and the population has no normality. In this case, the aggregated data also has no normality, so an estimated 95% accuracy cannot be calculated.

スクリーンショット 2020-09-26 17.05.46.png

However, if the number of hands increases, the P-value value will increase and the data will be normal.

In the case of QQ, there is normality ** "If you play 2000 hands, QQ will appear 3.40 times or more and 14.54 times or less with a 95% probability" **.

スクリーンショット 2020-09-26 17.05.26.png

By the way, why isn't KK regular? What's the difference with AA and QQ histograms! I think some people say that.

···I agree with you! !! !!

This is bad because the p-value is estimated around the boundary limit. This problem can be solved by changing every 2000 hands to every 1000 hands. However, this is a problem, and the values on the horizontal axis are only 0 or more, which makes it impossible to perform a proper analysis ...

スクリーンショット 2020-09-27 0.23.56.png

スクリーンショット 2020-09-27 6.57.38.png

The point is that 1 million hands is too few lol

However, it is really convenient because Python can do such complicated calculations in an instant. I will keep the source code, so please take advantage of it !!

As a matter of fact, the contents of the SW test are not very well understood. In Python, you can calculate with just one line, so even if you know what kind of numerical value you can get, the calculation process is difficult to understand. If you have any books that are explained with actual examples, I would be grateful if you could let me know!

The following is the source code. The program is a complete beginner, so if you have any suggestions on how to write better code, please let us know! !!

`pokermain.py`


from holdcards import Holdcards 
from plotgraph import Plotgraph
import os
import glob
import re

path='Write the path here'
hand = "AA"  #Describe the hand you want to look up
count = 2000 #Describe each hand you want to check


num = lambda val : int(re.sub("\\D", "", val))
filelist = sorted(glob.glob(os.path.join(path,"*.txt"),recursive=True),key = num)
totcards = []
graphdata = []
countdata = []
counthands = []
for item in filelist:
    print(item)
    with open(item) as f:
        data = f.readlines()
        card = Holdcards()
        h_cards = card.find_holdcards(data)
        totcards += h_cards

i = 0
while len(totcards[count*i:count*(i+1)]) == count:
    graphdata.append(totcards[count*i:count*(i+1)])
    i += 1

for item in graphdata:
    countdata.append(item.count(hand))

graph= Plotgraph()

graph.writehist(countdata,hand,count,len(graphdata)*count)  #SW test-Normalization

`holdcards.py`



class Holdcards:
       def __init__(self):
              self.trump={"A":"14","K":"13","Q":"12","J":"11","T":"10","9":"9","8":"8","7":"7","6":"6","5":"5","4":"4","3":"3","2":"2"}
              self.r_trump={"14":"A","13":"K","12":"Q","11":"J","10":"T","9":"9","8":"8","7":"7","6":"6","5":"5","4":"4","3":"3","2":"2"}
              self.hands = 0
              self.tothands = 0
              self.handlist = []


       def find_holdcards(self,data):
              holdcards = []
              for item in data:
                     if 'Dealt to' in item:
                            item = item[-7:-2]
                            if item[1] == item[4]:
                                   if int(self.trump.get(item[0])) > int(self.trump.get(item[3])):
                                          item = item[0] + item[3] + 's'
                                   else:
                                          item = item[3] + item[0] + 's'
                            else:
                                   if int(self.trump.get(item[0])) > int(self.trump.get(item[3])):
                                          item = item[0] + item[3] + 'o'
                                   elif item[0] == item[3]:
                                          item = item[0] + item[3]
                                   else:
                                          item = item[3] + item[0] + 'o'
                            
                            holdcards.append(item)
              return holdcards

`plotgraph.py`


import numpy as np
import pandas as pd
import scipy.stats as st
import math                        
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker 
import matplotlib.transforms as ts 

class Plotgraph:
       def __init__(self):
              pass

       def writehist(self,countdata,hand,count,tothands):#Mean mu, standard deviation sig, number of normal random numbers n

              df = pd.DataFrame( {'p1':countdata} )
              target = 'p1'  #Columns to plot in the data frame
              # (1)Statistical processing
              mu  = round(df[target].mean(),2)  #average
              sig = round(df[target].std(ddof=0),2)#Standard deviation: ddof(Degree of freedom)=0
              print(f'■ Average:{df[target].mean():.2f},standard deviation:{df[target].std(ddof=0):.2f}')
              ci1, ci2 = (None, None)

              #Graph drawing parameters
              x_min = round(mu - 3*sig)
              x_max = round(mu + 3*sig)  #Score range to plot (lower and upper limits)
              j = 10                  #Y-axis (frequency) step size
              k = 1                   #class
              bins = int((x_max - x_min)/k)            #Number of sections(x_max-x_min)/k  (100-40)/5->12
              d = 0.001

              #Drawing process from here
              plt.figure(dpi=96)
              plt.xlim(x_min,x_max)
              hist_data = plt.hist(df[target], bins=bins, color='tab:cyan', range=(x_min, x_max), rwidth=0.9)
              n   = len(hist_data[0])    #Specimen size
              plt.title("hand = "+hand+" , totalhands = "+str(tothands))

              # (2)Histogram drawing
              plt.gca().set_xticks(np.arange(x_min,x_max-k+d, k))

              #Test of normality (significance level 5)%）
              _, p = st.shapiro(hist_data[0])
              print(hist_data[0])
              print(st.shapiro(hist_data[0]))
              if p >= 0.05 :
                     print(f'  - p={p:.2f} ( p>=0.05 )And it can be said that the population has normality')
                     U2 = df[target].var(ddof=1)  #Population variance estimate (unbiased variance)
                     print(U2)
                     DF = n-1                     #Degree of freedom
                     SE = math.sqrt(U2/n)         #Standard error
                     print(SE)
                     ci1,ci2 = st.t.interval( alpha=0.95, loc=mu, scale=SE, df=DF )
              else:
                     print(f'  ※ p={p:.2f} ( p<0.05 )And the population cannot be said to be normal')


              # (3)Approximate curve assuming a normal distribution
              sig = df[target].std(ddof=1)  #Unbiased standard deviation: ddof(Degree of freedom)=1
              nx = np.linspace(x_min, x_max+d, 150) #150 divisions
              ny = st.norm.pdf(nx,mu,sig) * k * len(df[target])
              plt.plot( nx , ny, color='tab:blue', linewidth=1.5, linestyle='--')

              # (4)X-axis scale / label setting
              plt.xlabel('total"'+str(hand)+'"/'+str(count)+'hands',fontsize=12)
              plt.gca().set_xticks(np.arange(x_min,x_max+d, k))
              # (5)Y-axis scale / label setting
              y_max = max(hist_data[0].max(), st.norm.pdf(mu,mu,sig) * k * len(df[target]))
              y_max = int(((y_max//j)+1)*j) #The smallest multiple of j that is greater than the maximum frequency
              plt.ylim(0,y_max)
              plt.gca().set_yticks( range(0,y_max+1,j) )
              plt.ylabel('Accumulation',fontsize=12)

              # (6)Text output of mean and standard deviation
              tx = 0.03 #For character output position adjustment
              ty = 0.91 #For character output position adjustment
              tt = 0.08 #For character output position adjustment
              tp = dict( horizontalalignment='left',verticalalignment='bottom',
                     transform=plt.gca().transAxes, fontsize=11 )
              plt.text( tx, ty, f'average {mu:.2f}', **tp)
              plt.text( tx, ty-tt, f'deviation {sig:.2f}', **tp)
              plt.text( tx, ty-tt-tt, f'P-value {p:.2f}', **tp)
              plt.vlines( mu, 0, y_max, color='black', linewidth=1 )


              plt.show()