After studying [Probability Theory](https://ja.wikipedia.org/wiki/Probability Theory) and Statistics, various [Probability Distribution](https://ja.wikipedia.org/wiki/probability distribution) is displayed, and there are so many that you can't decide what kind of distribution to use in what situation.
Which probability distribution can explain the value of [Random variable](https://ja.wikipedia.org/wiki/Random variable) in a naturally occurring phenomenon varies depending on the phenomenon.
In this article, I will try to draw a distribution by simulating the phenomenon that various probability distributions occur with python.
Actually, even if you don't know the origin of the distribution, you can easily implement various probability distributions by using scipy's scipy.stats module, but first of all, I think it is meaningful to know under what circumstances and how to make such a distribution.
So here we will implement it only with python's random module and numpy.
Use the random module to randomize one by one, and numpy to randomize a lot at once. (Graph is in matplotlib)
Then use scipy.stats to compare the results. If you understand the meaning of the distribution correctly, the results should match.
How to use scipy.stats is not explained in detail because it is written in various articles such as this article https://qiita.com/supersaiakujin/items/71540d1ecd60ced65add.
Basically, the probability distributions are [discrete probability distribution](https://ja.wikipedia.org/wiki/discrete probability distribution) and continuous probability distribution. It is divided into two types (probability distribution), and the implementation method is slightly different.
The main difference is that, for example, the discrete probability distribution shows the probability distribution by [Probability mass function](https://ja.wikipedia.org/wiki/Probability mass function) (PMF), whereas the continuous probability distribution. Shows the probability distribution in [Probability Density Function](https://ja.wikipedia.org/wiki/Probability Density Function) (PDF).
The probability distribution described here is
** Discrete probability distribution ** -[Binomial distribution](https://ja.wikipedia.org/wiki/binomial distribution) -[Bernoulli distribution](https://ja.wikipedia.org/wiki/Bernoulli distribution) -[Geometric distribution](https://ja.wikipedia.org/wiki/Geometric distribution) -[Negative binomial distribution](https://ja.wikipedia.org/wiki/Negative binomial distribution) -[Poisson distribution](https://ja.wikipedia.org/wiki/Poisson distribution)
** Continuous probability distribution ** -[Continuous uniform distribution](https://ja.wikipedia.org/wiki/Continuous uniform distribution) -[Exponential distribution](https://ja.wikipedia.org/wiki/exponential distribution) -[Normal distribution](https://ja.wikipedia.org/wiki/normal distribution) -[Chi-square distribution](https://ja.wikipedia.org/wiki/Chi-square distribution) -[Beta distribution](https://ja.wikipedia.org/wiki/Beta distribution)
When conducting an experiment that results in either success or failure, the probability distribution of the number of successes is ** [binomial distribution](https://ja.wikipedia.org/wiki/binomial distribution) **. It is.
The probability mass function is
P(x) = C(n,x)p^x(1-p)^{n-x}
p = probability of success n = number of times $ C (n, x) $ selects x out of n Combination function
For example, in a game, if you defeat a certain monster, you will drop an item with a probability of p, and if you defeat n, how many items will be dropped, the number of drops should be binomial distribution.
Let's actually see the result randomly. Here, n = 2000, p = 0.01, and the number of samplings is 10000.
import random
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats
#First, import all the modules you want to use.
n_sampl = 10000 #Number of samplings
n = 2000 #Number of monsters to defeat
p = 0.01 #Probability of dropping an item
n_drop = [] #Number of each drop
for _ in range(n_sampl):
#Random n pieces to determine if it is less than or equal to p
drop = np.random.random(n)<=p
n_drop.append(drop.sum()) #Store the number of drops
y = np.bincount(n_drop) #Count the number of each drop
x = np.arange(len(y))
#Draw a bar graph
plt.title('Binomial distribution',family='Arial Unicode MS')
plt.bar(x,y,color='#f9a7a0',ec='k')
plt.grid(ls='--')
plt.savefig('binom.png',dpi=100)
plt.close()
Next is the implementation in scipy.stats. Now use .rvs () to randomly and .pmf () to draw the graph.
f = scipy.stats.binom(n,p)
plt.title('Binomial distribution',family='Arial Unicode MS')
h = np.bincount(f.rvs(10000))
x = np.arange(len(h))
plt.bar(x,h/n_sampl,color='#a477e2',ec='k')
plt.plot(x,f.pmf(x),'#4aec96')
plt.grid(ls='--')
plt.savefig('binom2.png',dpi=100)
plt.close()
The distribution drawn with .rvs () roughly matches the graph of .pmf (), and also matches the figure above. However, here, divide by the number of samplings so that the total is 1.
If the binomial distribution is n = 1, it is called ** [Bernoulli distribution](https://ja.wikipedia.org/wiki/Bernoulli distribution) **.
If n = 1 is set for the probability mass function of the binomial distribution
P(x) = p^x(1-p)^{1-x}
It will be. Since it is executed once, the result is only 0 (success) and 1 (failure).
The code is simpler than the binomial distribution.
n_sampl = 10000
p = 0.2 #Probability of dropping an item
n_drop = [] #Number of each drop (0 or 1)
for _ in range(n_sampl):
n_drop.append(random.random()<=p) #Whether the random value is less than or equal to p
y = np.bincount(n_drop)
x = np.arange(len(y))
plt.title('Bernoulli distribution',family='Arial Unicode MS')
plt.bar(x,y,color='#f9a7a0',ec='k')
plt.grid(ls='--')
plt.savefig('bernulli.png',dpi=100)
plt.close()
Like the binomial distribution, ** [geometric distribution](https://ja.wikipedia.org/wiki/geometric distribution) ** is a distribution that occurs when conducting successful and unsuccessful experiments, but geometric distribution. The consideration in is the distribution of the number of times to success.
The distribution of the number of successful executions, which is either a success or a failure, is a geometric distribution.
If the probability of success is p, then the probability mass function of the number of times to succeed is
P(x) = p(1-p)^{x-1}
For example, the probability distribution of the number of monsters that drop items with probability p in the game is geometric distribution.
Random things are the same as in the case of the binomial distribution, but this time the number of monsters to defeat is not fixed, it will be repeated until it succeeds once.
n_sampl = 10000 #Number of samplings
p = 0.2 #Probability of dropping
kaime_drop = [] #A list that stores the result of how many times to drop
for _ in range(n_sampl):
n_kai = 1 #Number of monsters defeated
#Repeat until successful
while(random.random()>p):
n_kai += 1
#Stores the number of monsters killed after success
kaime_drop.append(n_kai)
y = np.bincount(kaime_drop) #Count the number of monsters you have defeated
x = np.arange(len(y))
#Draw a bar graph
plt.title('Geometric distribution',family='Arial Unicode MS')
plt.bar(x,y,color='#f9a7a0',ec='k')
plt.grid(ls='--')
plt.savefig('geom.png',dpi=100)
plt.close()
Implemented with scipy
f = scipy.stats.geom(p)
plt.title('Geometric distribution',family='Arial Unicode MS')
h = np.bincount(f.rvs(10000))
x = np.arange(1,len(h))
h = h[1:]
plt.bar(x,h/n_sampl,color='#a477e2',ec='k')
plt.plot(x,f.pmf(x),'#4aec96')
plt.grid(ls='--')
plt.savefig('geom2.png',dpi=100)
plt.close()
Similar to the geometric distribution, ** [Negative binomial distribution](https://ja.wikipedia.org/wiki/Negative binomial distribution) ** is also a distribution of the number of executions, but only once successful. Instead of being satisfied with, you are thinking about how many times you will run to a certain number of successes.
The probability mass function is
P(x) = C(k+r-1,k)(1-p)^rp^k
p = probability of success r = number of times you want to achieve
If r is 1, it will be a geometric distribution.
For example, if you want r items from a monster that drops items with a probability of p, how much do you need to defeat?
The method is a little more complicated than the geometric distribution, but it is almost the same.
n_sampl = 10000
p = 0.2 #Probability of dropping
r = 4 #Number of successes you want
kaime_drop = []
for _ in range(n_sampl):
n_kai = 0 #Number of monsters defeated
n_drop = 0 #Number of drops
#Defeat monsters over and over again
while(n_drop<r):
#Whether it is a drop
if(random.random()<=p):
n_drop += 1
n_kai += 1
#Stores the number of defeated monsters after dropping r times
kaime_drop.append(n_kai)
y = np.bincount(kaime_drop)
x = np.arange(len(y))
plt.title('Negative binomial distribution',family='Arial Unicode MS')
plt.bar(x,y,color='#f9a7a0',ec='k')
plt.grid(ls='--')
plt.savefig('nbinom.png',dpi=100)
plt.close()
Implemented with scipy
f = scipy.stats.nbinom(r,p)
plt.title('Negative binomial distribution',family='Arial Unicode MS')
y = np.bincount(f.rvs(10000))
x = np.arange(len(h))
plt.bar(x+r,y/n_sampl,color='#a477e2',ec='k')
plt.plot(x+r,f.pmf(x),'#4aec96')
plt.grid(ls='--')
plt.savefig('nbinom2.png',dpi=100)
plt.close()
If an event occurs with the same probability at any time, the number of times it occurs in a particular interval is ** [Poisson distribution](https://ja.wikipedia.org/wiki/Poisson distribution) **.
The probability mass function is
P(x) = \frac{λ^x e^{-λ}}{x!}
λ = average number of times that occur in a particular interval
For example, if a monster appears λ times in an hour in a game, the number of times it actually appears in each hour.
n = 10000 #Number of time intervals to divide
λ = 8 #Average number of times in a certain time
#Random time to wake up
jikan = np.random.uniform(0,n,n*λ)
#Divide into 1 hour units and count the number of times in each section
kaisuu = np.bincount(jikan.astype(int))
#Count how many times there are to see the distribution of times
y = np.bincount(kaisuu)
x = np.arange(len(y))
plt.title('Poisson distribution',family='Arial Unicode MS')
plt.bar(x,y,color='#f9a7a0',ec='k')
plt.grid(ls='--')
plt.savefig('poisson.png',dpi=100)
plt.close()
It's a little complicated, but the point is to randomize the time it happens. Since it happens about λ times in a certain interval, the random range should be the same as the number of divided intervals (n), and the random number should be nλ.
Then use np.bincount () to count the number of times each interval occurs.
And we also use np.bincount () to count the distribution of the number of occurrences of all intervals.
Implementation with scipy
f = scipy.stats.poisson(λ)
plt.title('Poisson distribution',family='Arial Unicode MS')
h = np.bincount(f.rvs(10000))
x = np.arange(len(h))
plt.bar(x,h/n,color='#a477e2',ec='k')
plt.plot(x,f.pmf(x),'#4aec96')
plt.grid(ls='--')
plt.savefig('poisson2.png',dpi=100)
plt.close()
Unlike the discrete probability distribution, which has a probability of only a specific number, the probability of becoming a specific number is 0 in the case of a continuous value type probability distribution, and instead the probability that the value is in each range is obtained. Therefore, the probability distribution is displayed by the probability density function.
** [Continuous uniform distribution](https://ja.wikipedia.org/wiki/Continuous uniform distribution) ** is a simple distribution that has a similar probability distribution only within a certain range.
If the distribution range is from a to b, then the probability density function of x ∈ [a, b] is
f(x) = \frac{1}{b-a}
Outside the range, $ f (x) = 0 $.
It can be easily implemented with np.random.uniform or random.uniform.
For example, the story of roulette without a section. If you throw the ball at that roulette, you should think normally that it will stay at the same probability from 0 degrees to 360 degrees.
n_sampl = 10000
a = 0 #minimum
b = 360 #Maximum
x = np.random.uniform(a,b,n_sampl) #random
#Write a histogram
plt.title('Continuous uniform distribution',family='Arial Unicode MS')
plt.hist(x,50,color='#92bee1',ec='k')
plt.grid(ls='--')
plt.savefig('uniform.png',dpi=100)
plt.close()
For a continuous uniform distribution, instead of counting the numbers with np.bincount () and drawing a bar graph, plt.hist () makes a histogram showing the density.
scipy has scipy.stats.uniform, but when using .rvs () it doesn't seem to be much different from np.random.uniform.
In scipy.stats, the distribution function method is different from the discrete type, the continuous type is .pdf () instead of .pmf ().
When drawing the distribution of .rvs () with plt.hist (), if density = True, it will be divided by the total number of samples and will match the value of .pdf ().
f = scipy.stats.uniform(a,b)
plt.title('Continuous uniform distribution',family='Arial Unicode MS')
_,h,_ = plt.hist(f.rvs(10000),50,color='#ffb3d3',ec='k',density=True)
x = np.linspace(h.min(),h.max(),101)
plt.plot(x,f.pdf(x),'#4aec96')
plt.grid(ls='--')
plt.savefig('uniform2.png',dpi=100)
plt.close()
** [Exponential distribution](https://ja.wikipedia.org/wiki/exponential distribution) ** is a distribution that occurs at the same time as the Poisson distribution.
If an event can occur at any time during a certain time, the distribution of time between each occurrence is an exponential distribution.
The probability density function is
f(x) = λe^{-λx}
As with the Poisson distribution, λ is the average number of times that occurs in a particular interval.
For example, if a monster appears λ times in an hour, how long should you wait after one appears before the next?
The method is similar to the Poisson distribution, but the exponential distribution looks easier.
n = 10000 #Length of time
λ = 10 #Average number of times an hour
t = np.random.uniform(0,n*λ,n) #Random time to wake up
t = np.sort(t) #Sorting
x = t[1:]-t[:-1] #Time difference
#Draw a histogram
plt.title('Exponential distribution',family='Arial Unicode MS')
plt.hist(x,50,color='#92bee1',ec='k')
plt.grid(ls='--')
plt.savefig('expon.png',dpi=100)
plt.close()
Implemented with scipy
f = scipy.stats.expon(0,λ)
plt.title('Exponential distribution',family='Arial Unicode MS')
_,h,_ = plt.hist(f.rvs(10000),50,color='#ffb3d3',ec='k',density=True)
x = np.linspace(h.min(),h.max(),101)
plt.plot(x,f.pdf(x),'#4aec96')
plt.grid(ls='--')
plt.savefig('expon2.png',dpi=100)
plt.close()
** [Normal distribution](https://ja.wikipedia.org/wiki/normal distribution) ** or ** [Gaussian distribution](https://ja.wikipedia.org/wiki/Gaussian distribution) ** Is the most common distribution, so it is the most common distribution in the world.
The normal distribution can occur in various situations, but this time we will try the normal distribution generated by the [Central Limit Theorem](https://ja.wikipedia.org/wiki/Central Limit Theorem).
According to the Central Limit Theorem, if you take the mean of many random variables, the probability distribution of the mean (whatever the original distribution) becomes a normal distribution.
The probability density function of the normal distribution is
f(x) = \frac{1}{\sqrt{2πσ^2}}e^{\left(-\frac{(x-μ)^2}{2σ^2}\right)}
I will try it with infinite roulette as in the example of continuous uniform distribution, but this time I will look at the distribution of the average value of the results of 100 times.
n_sampl = 10000
n = 100
a,b = 0,360
x = np.random.uniform(a,b,[n_sampl,n]).mean(1)
plt.title('normal distribution',family='Arial Unicode MS')
plt.hist(x,50,color='#92bee1',ec='k')
plt.grid(ls='--')
plt.savefig('norm.png',dpi=100)
plt.close()
Next we will implement a normal distribution in scipy.stats, but first we need to calculate to match the mean and standard deviation.
The mean of the continuous uniform is $ \ frac {b + a} {2} $ and the standard deviation is $ \ frac {(ba) ^ 2} {12} $, but the law of large numbers (https) //ja.wikipedia.org/wiki/ Law of Large Numbers) According to $ n $, the standard deviation of the result of the trial is $ \ frac {1} {\ sqrt {n}} $.
μ = (a+b)/2 #Calculate mean μ
σ = np.sqrt((b-a)**2/12/n) #Calculate standard deviation σ
f = scipy.stats.norm(μ,σ)
plt.title('normal distribution',family='Arial Unicode MS')
_,h,_ = plt.hist(f.rvs(10000),50,color='#ffb3d3',ec='k',density=True)
x = np.linspace(h.min(),h.max(),101)
plt.plot(x,f.pdf(x),'#4aec96')
plt.grid(ls='--')
plt.savefig('norm2.png',dpi=100)
plt.close()
If you execute a random variable that follows a normal distribution k times, the sum of squares
\sum_{i=1}^{k}x_i^2
The distribution of is [Chi-square distribution](https://ja.wikipedia.org/wiki/Chi-square distribution). It has nothing to do with [Kai Ni](https://dic.pixiv.net/a/ Kaiji) </ s>
The probability density function is
f(x) = \frac{x^{k/2-1}e^{-x/2}}{\,2^{k/2} \Gamma(k/2)}
Random implementation of normal distribution
n_sampl = 10000
k = 5
randn = np.random.randn(n_sampl,k)
x = (randn**2).sum(1)
plt.title('Chi-square distribution',family='Arial Unicode MS')
plt.hist(x,50,color='#92bee1',ec='k')
plt.grid(ls='--')
plt.savefig('chi2.png',dpi=100)
plt.close()
Implemented with scipy
f = scipy.stats.chi2(k)
plt.title('Chi-square distribution',family='Arial Unicode MS')
_,h,_ = plt.hist(f.rvs(10000),50,color='#ffb3d3',ec='k',density=True)
x = np.linspace(h.min(),h.max(),101)
plt.plot(x,f.pdf(x),'#4aec96')
plt.grid(ls='--')
plt.savefig('chi22.png',dpi=100)
plt.close()
** [Beta distribution](https://ja.wikipedia.org/wiki/Beta distribution) ** occurs when considering [Bayesian probabilities](https://ja.wikipedia.org/wiki/Bayesian probabilities) It is a probability distribution, which is considerably more complicated than other distributions.
If you don't know the probability that an event will occur and you want to estimate it from the number of successes, then the probability distribution for that probability is a beta distribution.
The probability density function is
f(x) = \frac{x^{α-1}(1-x)^{β-1}}{B(α,β)}
α-1 = number of successes β-1 = number of failures B is [Beta Function](https://ja.wikipedia.org/wiki/Beta Function)
For details, refer to the book "Complete Self-study: Introduction to Bayesian Statistics".
There should be various ways to express this distribution, but here we will consider the interval of probability from 0 to 1 divided into 50.
Consider the case where you succeed 3 times out of 4 times.
n_sampl = 10000 #Number of samplings
n_bin = 50 #Number of sections to divide
n = 4 #All times
k = 3 #Number of successes
p = [] #List to store interval numbers
x = (np.arange(n_bin)+0.5)/n_bin #Center of probability for each interval(0.01, 0.03, 0.05, ..., 0.99)
i_shikou = 0 #Number of attempts
while(i_shikou<n_sampl):
rand = np.random.random(n) #Random numbers that determine success or failure
for i in range(n_bin):
#Number of successes in the case of the probability of that interval
if((rand<=x[i]).sum()==k):
#Store the interval number if it is the same as a specific number
p.append(i)
i_shikou += 1
y = np.bincount(p)
plt.title('Beta distribution',family='Arial Unicode MS')
plt.bar(x,y,width=1/n_bin,color='#92bee1',ec='k')
plt.grid(ls='--')
plt.savefig('beta.png',dpi=100)
plt.close()
I use np.bincount and plt.bar instead of plt.hist, but since the beta distribution is also a continuous distribution, I actually drew a histogram.
Implemented in scipy.stats
α = k+1 # 4
β = n-k+1 # 2
f = scipy.stats.beta(α,β)
plt.title('Beta distribution',family='Arial Unicode MS')
plt.hist(f.rvs(10000),50,color='#ffb3d3',ec='k',density=True)
x = np.linspace(0,1,101)
plt.plot(x,f.pdf(x),'#4aec96')
plt.grid(ls='--')
plt.savefig('beta2.png',dpi=100)
plt.close()
When you actually look at the results randomly, you can realize that this kind of distribution occurs in this way, and you can deepen your understanding and convince yourself.
The distribution explained above is summarized here.
Name | function | Range of x | scipy.stats | Parameters |
---|---|---|---|---|
Binomial distribution | 0,1,2,...,n | scipy.stats.binom(n,p) | ||
Bernoulli distribution | 0,1 | scipy.stats.bernoulli(p) | ||
Geometric distribution | 1,2,... | scipy.stats.geom(p) | ||
Negative binomial distribution | 0,1,2,... | scipy.stats.nbinom(r,p) | ||
Poisson distribution | 0,1,2,... | scipy.stats.poisson(λ) | ||
Continuous uniform distribution | [a,b] | scipy.stats.uniform(a,b) | ||
Exponential distribution | [0,∞) | scipy.stats.expon(0,λ) | ||
normal distribution | (-∞,∞) | scipy.stats.norm(μ,σ) | ||
Chi-square distribution | [0,∞) | scipy.stats.chi2(k) | ||
Beta distribution | [0,1] | scipy.stats.beta(α,β) |
Also refer to this article https://qiita.com/qiita_kuru/items/d9782185652351c78aac https://qiita.com/hamage/items/738b65668b1093dd9660
Recommended Posts