While studying statistical tests, various probability distributions come out, but I think it's hard to get an image just by looking at mathematical formulas. While moving various parameters in Python, draw the probability distribution and attach the image.
For the explanation of the probability distribution, refer to the following.
-Statistics time -Introduction to Statistics (Basic Statistics I) Department of Statistics, Faculty of Liberal Arts, University of Tokyo
This article does not give detailed explanations such as the derivation of various mathematical formulas, but focuses on grasping the shape of each distribution and the meaning of that distribution. This article deals with the following two distributions.
--Binomial distribution --Poisson distribution
The number of successful $ n $ trials of independent trials (Bernoulli trials) with only two outcomes, such as "whether the coin is tossed, front or back" is $ X $ The distribution that follows is called the binomial distribution **.
――The number of times you roll the dice 10 times and get 1 ――The number of times the coin appears when you throw it 5 times ――The number of times a baseball team with a winning percentage of 70% plays 144 games and wins
Etc. follow the binomial distribution.
The formula for the probability mass function of the binomial distribution is expressed as follows.
P(X = k) = {}_n C _kp^k(1-p)^{n-k}
$ n $ is the number of trials, $ p $ is the probability of success of the trial, and $ k $ is the number of successful trials.
Also, when the random variable $ X $ follows the binomial distribution, the expected value $ E (X) $ and the variance $ V (X) $ are as follows.
E(X) = np
V(X) = np(1 - p)
I think that the expected value is the product of the number of trials $ n $ and the probability of success $ p $, which matches the feeling.
Now let's draw a probability distribution in Python. When you make $ 50 $ trials with a probability of success of $ 10 % $, check the distribution of the number of successes.
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
def comb_(n, k):
result = math.factorial(n) / (np.math.factorial(n - k) * np.math.factorial(k))
return result
def binomial_dist(p, n, k):
result = comb_(n, k) * (p**k) * ((1 - p) ** (n - k))
return result
x = np.arange(1, 50, 1)
y = [binomial_dist(a, 50, i) for i in x]
plt.bar(x, y, align="center", width=0.4, color="blue",
alpha=0.5, label="binomial p= " + "{:.1f}".format(a))
plt.legend()
plt.ylim(0, 0.3)
plt.xlim(0, 50)
plt.show()
plt.savefig('binomial_dist_sample.png')
Since the probability of success is $ 10 % $, you can see that the probability of success is still the highest for $ 4,5 $. It also matches that the expected value is $ np = 50 × 0.1 = 5 $. You can also see that the odds of success over $ 10 $ are very low, and $ 20 $ is a miracle level.
Now let's see how the distribution changes as we increase the probability of success (change $ p $).
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation
%matplotlib notebook
def comb_(n, k):
result = math.factorial(n) / (np.math.factorial(n - k) * np.math.factorial(k))
return result
def binomial_dist(p, n, k):
result = comb_(n, k) * (p**k) * ((1 - p) ** (n - k))
return result
fig = plt.figure()
def update(a):
plt.cla()
x = np.arange(1, 50, 1)
y = [binomial_dist(a, 50, i) for i in x]
plt.bar(x, y, align="center", width=0.4, color="blue",
alpha=0.5, label="binomial p= " + "{:.1f}".format(a))
plt.legend()
plt.ylim(0, 0.3)
plt.xlim(0, 50)
ani = animation.FuncAnimation(fig,
update,
interval=1000,
frames = np.arange(0.1, 1, 0.1),
blit=True)
plt.show()
ani.save('Binomial_dist.gif', writer='pillow')
You can see that the closer $ p $ is to $ 0.5 $ (probability of success is $ 50 % $), the wider the tail of the distribution, and the closer it is to $ 0 $ or $ 1 $, the sharper the shape. Looking at the formula $ V (X) = np (1-p) $, we can see that the closer $ p $ is to $ 0.5 $, the larger the variance value. If the probability of success is even, the results will vary accordingly, which is in line with the feeling.
Next is the Poisson distribution. The probability distribution that represents the probability that an event that occurs an average of $ \ lambda $ times per unit time occurs exactly $ k $ times is called a Poisson distribution.
――Number of vehicles passing through a specific intersection in one hour --Number of visits to the website in one hour --Number of emails received per day ――Number of visitors to the store within a certain period of time
Etc. are said to follow the Poisson distribution.
The formula for the Poisson distribution probability mass function is expressed as follows.
P(X=k) = \frac{\lambda^k \mathrm{e}^{-\lambda}}{k!}
It is a very confusing formula, but if you want to know the detailed derivation process, please see Previous article. The expected value $ E (X) $ and the variance $ V (X) $ when the random variable $ X $ follows the Poisson distribution are as follows.
E(X) = \lambda
V(X) = \lambda
Since we are talking about events that occur an average of $ \ lambda $ times, it makes sense that the expected value is $ \ lambda $ as it is.
Now let's draw a probability distribution in Python. Let's draw the Poisson distribution on top of each other to see what the average $ 5 $ event, average $ 10 $ event, and average $ 15 $ event occur per unit time.
import numpy as np
import matplotlib.pyplot as plt
def poisson(k, lambda_):
k = int(k)
result = (lambda_**k) * (np.exp(-lambda_)) / np.math.factorial(k)
return result
x = np.arange(1, 50, 1)
y1= [poisson(i, 5) for i in x]
y2= [poisson(i, 15) for i in x]
y3= [poisson(i, 30) for i in x]
plt.bar(x, y1, align="center", width=0.4, color="red"
,alpha=0.5, label="Poisson λ= %d" % 5)
plt.bar(x, y2, align="center", width=0.4, color="green"
,alpha=0.5, label="Poisson λ= %d" % 15)
plt.bar(x, y3, align="center", width=0.4, color="blue"
,alpha=0.5, label="Poisson λ= %d" % 30)
plt.legend()
plt.savefig('Poisson_sample.png')
plt.show()
Since the value of $ \ lambda $ is equal and distributed, the larger the value of $ \ lambda $, the wider the tail of the probability distribution. Looking at the movement of the change in the distribution when $ \ lambda $ is increased, it looks like this.
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import numpy as np
from scipy.stats import poisson
fig = plt.figure()
def update(a):
plt.cla()
x = np.arange(1, 50, 1)
y = [poisson.pmf(i,a) for i in x]
plt.bar(x, y, align="center", width=0.4, color="red",
alpha=0.5, label="Poisson λ= %d" % a)
plt.legend()
plt.ylim(0, 0.3)
plt.xlim(0, 50)
ani = animation.FuncAnimation(fig,
update,
interval=500,
frames = np.arange(1, 31, 1),
blit=True)
plt.show()
ani.save('Poisson_distribution.gif', writer='pillow')
You can see that the tail of the distribution is changing as the value of $ \ lambda $ increases. The larger the average number of events that occur per unit time, $ \ lambda $, the more the number of events that occur will vary.
The Poisson distribution is actually a probability distribution created based on the binomial distribution. The Poisson distribution is a Poisson distribution that brings $ n → ∞ $ closer to $ p → 0 $ while keeping $ np = \ lambda $ constant. (It's called the Poisson Central Limit Theorem. [Previous article] (https://qiita.com/g-k/items/836820b826775feb5628), so if you are interested, please have a look there. )
In other words, among the events that follow the binomial distribution, ** events that are numerous and rarely occur ** follow the Poisson distribution.
As a concrete example, let's draw the binomial distribution of $ n = 100 $ $ p = 0.1 $ and the Poisson distribution of $ \ lambda = 1 $ on top of each other.
import numpy as np
import matplotlib.pyplot as plt
def poisson(k, lambda_):
result = (lambda_**k) * (np.exp(-lambda_)) / np.math.factorial(k)
return result
def comb_(n, k):
result = math.factorial(n) / (np.math.factorial(n - k) * np.math.factorial(k))
return result
def binomial_dist(p, n, k):
result = comb_(n, k) * (p**k) * ((1 - p) ** (n - k))
return result
x = np.arange(1, 100, 1)
y1= [poisson(i, 1) for i in x]
y2 = [binomial_dist(0.01, 100, i) for i in x]
plt.xlim(0, 30)
plt.bar(x, y1, align="center", width=0.4, color="red"
,alpha=0.5, label="Poisson λ= %d" % 1)
plt.bar(x, y2, align="center", width=0.4, color="blue",
alpha=0.5, label="binomial p= " + "{:.2f}".format(0.01))
plt.legend()
plt.savefig('bino_poisson.png')
plt.show()
You can see that the distributions almost exactly overlap. By actually drawing the distribution in this way, it becomes easier to understand the relationship between the distributions.
NEXT Next time, we will focus on "geometric distribution," "exponential distribution," and "negative binomial distribution."
Recommended Posts