1. Statistics learned with Python 2. Probability distribution [Thorough understanding of scipy.stats]

1_2_01.PNG

(1) Types of probability distribution

Probability distribution probability distribution Method data
1 Hypergeometric distribution Hypergeometric distribution scipy.stats.hypergeom Discrete type
2 Bernoulli distribution Bernoulli distribution scipy.stats.bernoulli Discrete type
3 Binomial distribution binomial distribution scipy.stats.binom Discrete type
4 Poisson distribution Poisson distribution scipy.stats.poisson Discrete type
5 Geometric distribution Geometric Distribution scipy.stats.geom Discrete type
6 Negative binomial distribution Negative Binomial Distribution scipy.stats.nbinom Discrete type
7 Uniform distribution uniform distribution scipy.stats.uniform Discrete type
8 normal distribution normal distribution scipy.stats.norm Continuous type
9 Exponential distribution exponential distribution scipy.stats.expon Continuous type
10 Gamma distribution gamma distribution scipy.stats.gamma Continuous type
11 Beta distribution beta distribution scipy.stats.betabinom Continuous type
12 Cauchy distribution Cauchy distribution scipy.stats.cauchy Continuous type
13 Lognormal distribution lognormal distribution scipy.stats.lognorm Continuous type
14 Pareto distribution Pareto distribution scipy.stats.pareto Continuous type
15 Weibull distribution Weibull distribution scipy.stats.dweibull Continuous type

⑵ Functions of various methods

#Library for numerical calculations
import numpy as np
from scipy import stats

#Library for graph drawing
import matplotlib.pyplot as plt
%matplotlib inline

#Module to make matplotlib support Japanese display
!pip install japanize-matplotlib
import japanize_matplotlib

➀ rvs (Random variates) Random variables

#Generate pseudo-random numbers that follow a normal distribution with rvs
norm_rvs = stats.norm.rvs(loc=50, scale=20, size=1000, random_state=0) #Expected value=50,standard deviation=20,Quantity=1000

#Visualization (expressed in histogram)
plt.hist(norm_rvs, bins=10, alpha=0.3, ec='blue')
plt.xlabel("class", fontsize=13)
plt.ylabel("frequency", fontsize=13)

plt.show()

1_2_02.PNG

➁ pdf (Probability density function) Probability density function

#Generate arithmetic progression
X = np.arange(start=1, stop=7.1, step=0.1)

#Generate probability density function in pdf
norm_pdf = stats.norm.pdf(x=X, loc=4, scale=0.8) #Expected value=4,standard deviation=0.8 

#Visualization
plt.plot(X, norm_pdf)
plt.xlabel("Random variable X", fontsize=13)
plt.ylabel("Probability density pdf", fontsize=13)
plt.show()

1_2_03.PNG

  • As an attempt, let's check the probability density when the random variable x = 5.
#Calculate the probability density when the random variable 5
x = 5
y = stats.norm.pdf(x=x, loc=4, scale=0.8) #Expected value=4,standard deviation=0.8
print("Random variable x=Probability density at 5:", y)

#Plot on the graph of probability density function
plt.plot(X, norm_pdf)
plt.plot(x, y, 'bo') #Place the blue dots

plt.vlines(x, 0.0, y, lw=2, linestyles='dashed') #Vertical line
plt.hlines(y, 1.0, x, lw=2, linestyles='dashed') #Horizon

plt.xlabel("Random variable X", fontsize=13)
plt.ylabel("Probability density pdf", fontsize=13)

plt.show()

1_2_04.PNG

  • ** Probability density function pdf is a method for continuous data, and the following probability mass function pfm is used for discrete data. </ font> **

➂ pmf (Probability mass function) Probability mass function

  • Notation: pmf (k, n, p, loc = 0)
  • ** Probability mass ** represents ** relative ease ** for each discrete element of random variable X.
  • To put it plainly, the ** probability mass function is a function ** that calculates the probability mass by taking discrete </ font> data as an argument.
  • Here, using ** binomial distribution ** as an example, "the probability of each success (0-4 times) when 5 trials with a success probability of 40% are performed" is shown.
#Probability of success
p = 0.4
#Number of trials
n = 5
#Number of successes
k = np.arange(0, 5) # array([0, 1, 2, 3, 4])

#Calculate the probability for each success with pmf
binom_pmf = stats.binom.pmf(k, n, p) #Number of successes=0~4,Number of trials=5,Probability of success=0.4

#Visualization
plt.plot(k, binom_pmf, 'bo', ms=8)
plt.vlines(k, 0, binom_pmf, colors='b', lw=3, alpha=0.5)

plt.xticks(k) #x-axis scale

plt.xlabel("Random variable X", fontsize=13)
plt.ylabel("Probability mass function pmph", fontsize=13)

plt.show()

1_2_14.PNG

➃ logpdf (Log of the probability density function) log probability density function

  • Notation: logpdf (x, loc = 0, scale = 1)
  • ** The log probability density function ** is the logarithm of the probability density ** with the base e.
  • The value of $ a $ when the number of Napiers $ e = 2.7182818284 ... $ is multiplied by the transcendental number followed by $ a $ to obtain the probability density. The formula is $ e ^ a = $ pdf.
#Generate arithmetic progression
X = np.arange(start=1, stop=7.1, step=0.1)

#Generate logarithm of probability density function with logpdf
norm_logpdf = stats.norm.logpdf(x=X, loc=4, scale=0.8) #Expected value=4,standard deviation=0.8

#Visualization
plt.plot(X, norm_logpdf)
plt.xlabel("Random variable X", fontsize=13)
plt.ylabel("Log probability density function logpdf", fontsize=13)
plt.show()

1_2_05.PNG

  • As an attempt, let's check the random variable x = 4, which shows the maximum logarithm of the probability density in the above figure.
  • As a procedure, first obtain the value of the log probability density ** of the random variable ** x = 4, then calculate the probability density of ** x = 4, then take the logarithm ** and compare both. I will.
#Get maximum log probability density
logpdf_max = np.max(norm_logpdf) 
print('x=Log probability density of 4:', logpdf_max)

#Calculate probability density
x = 4
v_pdf = stats.norm.pdf(x=x, loc=4, scale=0.8) #Expected value=4,standard deviation=0.8

#Calculate the logarithm of probability density
v_pdf_log = np.log(v_pdf)
print('x=Logarithm of probability density of 4:', v_pdf_log)

#Visualization
plt.plot(X, norm_logpdf)
plt.plot(4, logpdf_max, 'bo') #Place the blue dots

logpdf_min = np.min(norm_logpdf) #Minimum log probability density(For drawing)
plt.vlines(x, logpdf_max, logpdf_min, lw=2, linestyles='dashed') #Vertical line
plt.hlines(logpdf_max, 1.0, x, lw=2, linestyles='dashed') #Horizon

plt.xlabel("Random variable X", fontsize=13)
plt.ylabel("Log probability density function logpdf", fontsize=13)

plt.show()

1_2_06.PNG

➄ cdf (Cumulative distribution function) Cumulative distribution function

  • Notation: cdf (x, loc = 0, scale = 1)
  • ** Cumulative distribution function ** calculates the probability that the random variable $ X $ is less than or equal to a certain value $ x $ **.
  • For example, the probability that "the number of rolls is 3 or less" when throwing a dice is the sum of all the probability densities from 1 to 3.
#Generate x-axis arithmetic progression
X = np.linspace(start=1, stop=7, num=100)

#Generate cumulative distribution function with cdf
norm_cdf = stats.norm.cdf(x=X, loc=4, scale=0.8) #Expected value=4,standard deviation=0.8

#x with cdf=Calculate cumulative probabilities of 5 or less
under_5 = stats.norm.cdf(x=5, loc=4, scale=0.8) #Expected value=4,standard deviation=0.8
print('Probability of becoming 5 or less:', under_5)

#Visualization
plt.plot(X, norm_cdf)
plt.plot(5, under_5, 'bo') #Place the blue dots

plt.vlines(5, 0.0, under_5, lw=2, linestyles='dashed') #Vertical line
plt.hlines(under_5, 1.0, 5, lw=2, linestyles='dashed') #Horizon

plt.xlabel("Random variable X", fontsize=13)
plt.ylabel("Cumulative probability density cdf", fontsize=13)

plt.show()

1_2_07.PNG

  • The figure below shows the points of the random variable x = 5 on the probability density function with exactly the same parameters.
  • The cumulative distribution function corresponds to the light blue area in this figure.

1_2_08.PNG

➅ sf (Survival function) Survival function

  • Notation: sf (x, loc = 0, scale = 1)
  • ** Survival function **, contrary to the cumulative distribution function cdf, ** Random variable $ X $ has a certain value $ x $ or more </ font> probability ** Calculate.
  • ** Sometimes defined as "1-cdf" **, but ** sf may be more accurate ** scipy.org https://docs.scipy.org/ You can find it in doc / scipy / reference / generated / scipy.stats.norm.html.

1_2_15.PNG

#Generate x-axis arithmetic progression
X = np.linspace(start=1, stop=7, num=100)

#Generate survival function with sf
norm_sf = stats.norm.sf(x=X, loc=4, scale=0.8) #Expected value=4,standard deviation=0.8

#x with sf=Calculate cumulative probabilities of 5 or more
upper_5 = stats.norm.sf(x=5, loc=4, scale=0.8) #Expected value=4,standard deviation=0.8
print('Probability of 5 or more:', upper_5)

#Visualization
plt.plot(X, norm_sf)
plt.plot(5, upper_5, 'bo') #Place the blue dots

plt.vlines(5, 0.0, upper_5, lw=2, linestyles='dashed') #Vertical line
plt.hlines(upper_5, 1.0, 5, lw=2, linestyles='dashed') #Horizon

plt.xlabel("Random variable X", fontsize=13)
plt.ylabel("Survival function sf", fontsize=13)

plt.show()

1_2_09.PNG

  • The figure below shows the points of the random variable x = 5 on the probability density function with exactly the same parameters.
  • The survival function corresponds to the light blue area in the figure.

1_2_10.PNG

➆ ppf (Percent point function) Percent point function

  • Notation: ppf (q, loc = 0, scale = 1)
  • ** The percentage point function ** is the inverse of the cumulative distribution function cdf, and if you specify the ** cumulative distribution function as q%, it returns a variable ** that takes that value.
  • Therefore, ppf (0.25) corresponds to the first quartile, ppf (0.5) corresponds to the median second quartile, and ppf (0.75) corresponds to the third quartile.
#Generate x-axis arithmetic progression
X = np.arange(start=1, stop=7, step=0.1)

#Generate probability density function in pdf
norm_pdf = stats.norm.pdf(x=X, loc=4, scale=0.8) #Expected value=4,standard deviation=0.8

#Get the variable corresponding to the cumulative distribution function 75% with ppf
q_75 = stats.norm.ppf(q=0.75, loc=4, scale=0.8) #Expected value=4,standard deviation=0.8
print('Cumulative distribution function 75% point random variable:', q_75)

#Get the probability density of the variable with pdf
v = stats.norm.pdf(x=q_75, loc=4, scale=0.8)
print('Cumulative distribution function 75% point probability density:', v)

#Visualization
plt.plot(X, norm_pdf)
plt.plot(q_75, v, 'bo') #Place the blue dots

plt.vlines(q_75, 0.0, v, lw=2, linestyles='dashed') #Vertical line
plt.hlines(v, 1.0, q_75, lw=2, linestyles='dashed') #Horizon

plt.xlabel("Random variable X", fontsize=13)
plt.ylabel("Probability density function pdf", fontsize=13)

plt.show()

1_2_11.PNG

➇ isf (Inverse survival function)

  • Notation: ʻisf (q, loc = 0, scale = 1) `
  • ** The inverse survival function ** is the inverse of the survival function sf, but what you do is the opposite of the percentage point function ppf. ** If the remainder of the survival function is specified as q%, a variable ** that takes that value is returned.
  • Therefore, the quartile is the opposite of the percentage point function ppf, isf (0.25) is the third quartile, isf (0.5) is also the second quartile, and isf (0.75) is the first. Corresponds to the quartile.
#Generate x-axis arithmetic progression
X = np.arange(start=1, stop=7, step=0.1)

#Generate probability density function in pdf
norm_pdf = stats.norm.pdf(x=X, loc=4, scale=0.8) #Expected value=4,standard deviation=0.8

#Get the variable corresponding to the reverse survival function 25% with isf
q_25 = stats.norm.isf(q=0.25, loc=4, scale=0.8) #Expected value=4,standard deviation=0.8
print('Random variable with 25% reverse survival function:', q_25)

#Get the probability density of the variable with pdf
v = stats.norm.pdf(x=q_25, loc=4, scale=0.8)
print('Probability density of 25% point of inverse survival function:', v)

#Visualization
plt.plot(X, norm_pdf)
plt.plot(q_25, v, 'bo') #Place the blue dots

plt.vlines(q_25, 0.0, v, lw=2, linestyles='dashed') #Vertical line
plt.hlines(v, 1.0, q_25, lw=2, linestyles='dashed') #Horizon

plt.xlabel("Random variable X", fontsize=13)
plt.ylabel("Probability density function pdf", fontsize=13)

plt.show()

1_2_12.PNG

➈ interval (Endpoints of the range that contains alpha percent of the distribution) Interval estimation

  • Notation: ʻinterval (alpha, loc = 0, scale = 1)`
  • In the ** empirical distribution ** calculated from actual data, neither the mean nor the standard deviation is always known. I don't really know if the mean obtained from the sample can be approximated to the ** true population parameter $ θ $ ** in the population.
  • Therefore, ** Finding the interval including parameter $ θ $ </ font> with a certain probability is called interval estimation **.
#Generate x-axis arithmetic progression
X = np.arange(start=1, stop=7, step=0.1)

#Generate probability density function in pdf
norm_pdf = stats.norm.pdf(x=X, loc=4, scale=0.8) #Expected value=4,standard deviation=0.8

#Get the variable corresponding to the confidence interval 95% with interval
lower, upper = stats.norm.interval(alpha=0.95, loc=4, scale=0.8) #Confidence factor=0.95,Expected value=4,standard deviation=0.8
print('Lower limit of 95% confidence interval:', lower)
print('Upper limit of 95% confidence interval:', upper)

#Get the probability density of each variable with pdf
v_lower = stats.norm.pdf(x=lower, loc=4, scale=0.8) #Expected value=4,standard deviation=0.8
v_upper = stats.norm.pdf(x=upper, loc=4, scale=0.8) #Expected value=4,standard deviation=0.8

#Visualization
plt.plot(X, norm_pdf)
plt.plot(lower, 0.0, 'k|')
plt.plot(upper, 0.0, 'k|')

plt.vlines(lower, 0.0, v_lower, lw=0.8) #Lower limit vertical line
plt.vlines(upper, 0.0, v_upper, lw=0.8) #Upper vertical line

plt.xlabel("Random variable X", fontsize=13)
plt.ylabel("Probability density function pdf", fontsize=13)

plt.show()

1_2_13.PNG

  • In interval estimation, ** first specify the probability of the interval including the parameter $ θ $ **. In this example, it is ʻalpha = 0.95`, which is called the ** confidence factor **. Often 99%, 95% and 90% are used, but this is optional.
  • Under the specified probability, the lower and upper limits of the interval containing the parameter $ θ $ are calculated. In this example, there is a 95% chance that the parameter $ θ $ indicates that the random variable is between 2.43 and 5.57.

Afterword

  • I have omitted the log cumulative distribution function logcdf, log survival function logsf, etc., but I think that the ones that seem to be used frequently have been suppressed.
  • Using these, I would like to unravel the 15 types of probability distributions listed above one by one.

Recommended Posts