[PYTHON] Summary of probability distributions that often appear in statistics and data analysis

In statistics, the consequences of events that occur in the world are called ** random variables **. And it is ** probability distribution ** that gives us the likelihood of each random variable.

Random variables change the shape of the probability distribution to which they belong, depending on what event they are the result of. Below is a summary of the probability distributions commonly used in statistics, with code in Python that draws the probability distributions.

When the random variable is a discrete value

The result of an event is $ X $, which has discrete values. As an example of each distribution, the graph of ** probability mass function ** is shown.

The probability mass function is a function that gives each $ X $ a probability.

Bernoulli distribution

Suppose an event X can have only two consequences: $ X = 0 $ or $ X = 1 $. Then, if $ P (X = 1) = p $, the random variable $ X $ follows the ** Bernoulli distribution **.

If the random variable $ X $ is the probability $ p $ [Bernoulli distribution](https://ja.wikipedia.org/wiki/%E3%83%99%E3%83%AB%E3%83%8C%E3 If you follow% 83% BC% E3% 82% A4% E5% 88% 86% E5% B8% 83) ($ X $ ~ $ B (p) $), then:


\begin{align}
P(X=1) &= p \\
P(X=0) &= 1 - p \\
E[X] &= p \\
Var(X) &= p(1-p)
\end{align}

At this time, $ E [X] $ represents the mean and $ Var (X) $ represents the variance.

As an example, once you take a test with a passing rate of 30%, the probability mass function of the result is shown.

bernoulli.py



## Bernoulli distribution ---------------------------------------------
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import bernoulli

# probability that X=1
p = 0.3

# possible outcomes
x = [0,1]

# visualize
fig, ax = plt.subplots(1,1)
ax.plot(x, bernoulli.pmf(x, p), 'bo', ms=8)
ax.vlines(x, 0, bernoulli.pmf(x,p), colors='b', lw=5, alpha=0.5)
ax.set_xlabel('X')
ax.set_ylabel('probability')
ax.set_title('bernoulli pmf')
ax.set_ylim((0,1))
ax.set_aspect('equal')

bernoulli.png

Binomial distribution

Binomial distribution generalized the Bernoulli distribution. Thing. It's just the sum of $ N $ Bernoulli trials. Represents a probability distribution in which an event $ X $ with a probability of $ p $ occurs $ k $ out of $ N $ trials.

When $ X $ ~ $ Bin (N, p) $


\begin{align}
P(X=k|p) &= \binom{N}{k} p^{k}(1-p)^{N-k} \\
E[X] &= np \\
Var(X) &= np(1-p)
\end{align}

As an example, here is a probability mass function that represents the probability of passing a test with a passing rate of 30% five times.

binomial.py



## Binomial distribution ---------------------------------------------
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import binom

# probability that X=1
p = 0.3

# the number of trials
N = 5
 
# X=1 happens k times
k = np.arange(N+1)

# visualize
fig, ax = plt.subplots(1,1)
ax.plot(k, binom.pmf(k, N, p), 'bo', ms=8)
ax.vlines(k, 0, binom.pmf(k, N, p), colors='b', lw=5, alpha=0.5)
ax.set_xlabel('k')
ax.set_ylabel('probability (X=1)')
ax.set_title('binomial pmf')
ax.set_ylim((0,1))

binomial.png

Poisson distribution

[Poisson distribution](https://ja.wikipedia.org/wiki/%E3%83%9D%E3%82%A2%E3%82%BD%E3%83%B3%E5%88%86%E5% B8% 83) is used to describe the number of times an event occurs. The only parameter is $ \ lambda $, which represents the average frequency of the event.

When $ X \ sim Pois (\ lambda) $,


\begin{align}
P(X=k|\lambda) &= \frac{\lambda^{k} e^{-\lambda}}{k!} \hspace{15pt}  for \hspace{10pt} k = 0, 1, 2, ...\\
E[X] &= \lambda \\
Var(X) &= \lambda 
\end{align}

The Poisson distribution is characterized by mean = variance = $ \ lambda $.

An example is the firing frequency of nerve cells. It is known that the firing activity of a nerve cell is approximately a Poisson process, and assuming that the average firing rate of a nerve cell is 5 (times / s), the number of firings of this nerve cell per second is as follows. The probability distribution is as follows.

poisson.py



## Poisson distribution --------------------------------------------
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import poisson

# rate parameter
mu = 5

# possible counts
k = np.arange(poisson.ppf(0.01, mu), poisson.ppf(0.99, mu))

# visualize
fig, ax = plt.subplots(1,1)
ax.plot(k, poisson.pmf(k, mu), 'bo', ms=8)
ax.vlines(k, 0, poisson.pmf(k, mu), colors='b', lw=5, alpha=0.5)
ax.set_xlabel('k')
ax.set_ylabel('probability')
ax.set_title('poisson pmf')

poisson.png

When the random variable is a continuous value

The result of an event is that $ X $ takes consecutive values rather than discrete values. As an example of each distribution, the graph of ** probability density function ** is shown.

The probability density function, unlike the probability mass function, is a function that gives each $ X $ its relative likelihood (relative likelihood). The probability that $ X $ takes the range $ a \ leq X \ leq b $ is calculated by the area of the probability density function in that range, that is, the integral $ \ int_ {a} ^ {b} f (X) dX $. Can be done. By definition of probability, the area of the entire range of the probability density function $-\ inf \ leq X \ leq \ inf $ is 1.

Exponential distribution

Exponential distribution shows that an event occurs many times. When you wake up, it describes how much time has passed between events. The point is the distribution of waiting time.

More statistically, it represents the time between Poisson processes. Therefore, the parameter is $ \ lambda $, which represents the average frequency, like Poisson.

When $ X \ sim Exp (\ lambda) $,


\begin{align}
f(x|\lambda) &= \lambda e^{-\lambda x} \hspace{15pt} (x \geq 0) \\
E[X] &= \frac{1}{\lambda} \\
Var(X) &= \frac{1}{\lambda^{2}} 
\end{align}

In other words, if an event is a Poisson process, the average waiting time from one event to the next is $ 1 / \ lambda $.

The following shows the distribution of the time from the firing of the nerve cell to the next firing as an exponential distribution.

exponential.py



## Exponential distribution --------------------------------------------
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import expon

# rate parameter
mu = 5

# possible waiting time
k = np.linspace(expon.ppf(0.01, loc=0, scale=1/mu), expon.ppf(0.99, loc=0, scale=1/mu), 100)

# visualize
fig, ax = plt.subplots(1,1)
ax.plot(k, expon.pdf(k, loc=0, scale=1/mu), 'r-', lw=5, alpha=0.5)
ax.set_xlabel('waiting time')
ax.set_ylabel('f(x)')
ax.set_title('exponential pdf')

exponential.png

Uniform distribution

Uniform distribution is the simplest form of distribution. However, any random variable $ x $ within the specified range returns the same probability, so it is an essential probability distribution when the possible outcomes are equally probable.

When $ X $ ~ $ U [a, b] $, the following holds.


\begin{align}
f(x) &= \frac{1}{b - a} \hspace{15pt} if \hspace{15pt} a \leq x \leq b \\
f(x) &= 0  \hspace{45pt} otherwise \\
E[X] &= \frac{a + b}{2} \\
Var(X) &= \frac{(b - a)^{2}}{12}
\end{align}

Also, for the mean of a uniform distribution, "the sum of the means is the mean of the sum" and "the product of the means is the mean of the products" (however, when the events are independent).


\begin{align}
E[X+Y] &= E[X] + E[Y] \\
E[XY] &= E[X]E[Y] \hspace{20pt} (if \hspace{5pt} X \perp Y)
\end{align}

As an example, here is a probability density function when an experimental result should take one of three to four consecutive values, and each probability is equally probable.

uniform.py



## Uniform distribution ---------------------------------------------
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import uniform

# range
a = 3
b = 4 - a
x = np.linspace(uniform.ppf(0.01, loc=a, scale=b), uniform.ppf(0.99, loc=a, scale=b), 100)

# visualize
fig, ax = plt.subplots(1,1)
ax.plot(x, uniform.pdf(x, loc=a, scale=b), 'r-', alpha=0.6, lw=5)
ax.set_xlabel('X')
ax.set_ylabel('P(X)')
ax.set_title('uniform pdf')

uniform.png

Gamma distribution

Gamma distribution is Represents the distribution of total latency when an event occurs $ n $ times. Therefore, if $ n = 1 $, the gamma distribution matches the exponential distribution. There are two parameters, $ \ alpha = n $, $ \ beta = \ lambda $. $ \ lambda $ is the average frequency of events, like the Poisson and exponential distributions.

When $ X \ sim Gamma (\ alpha, \ beta) $, the following holds.


\begin{align}
f(x|\alpha, \beta) &= \frac{\beta^{\alpha}}{\Gamma(\alpha)}x^{\alpha-1}e^{-\beta x} \hspace{20pt} (x \geq 0) \\
E[X] &= \frac{\alpha}{\beta} \\
Var[X] &= \frac{\alpha}{\beta^{2}} \\
\end{align}

The ** $ \ Gamma (・) $ ** that appears in the formula is called the ** gamma function **, and is actually a general system of factorial $ (!) $. The following formula holds.


if \hspace{10pt} n > 0 \\
\Gamma(n) = (n-1)!

This gamma function is used in the model of the right-skewed distribution, but it is also related to factorial and is a distribution that often appears in statistics.

Below is the gamma distribution when $ \ alpha = 7, \ beta = 5 $. Interpret it as the distribution of the total waiting time for the example nerve cell to fire 7 times.

gamma.py



## Gamma distribution --------------------------------------------
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import gamma

# total number of events
n = 7

# rate parameter
mu = 5

# possible waiting time
x = np.linspace(gamma.ppf(0.01, n, loc=0, scale=1/mu), gamma.ppf(0.99, n, loc=0, scale=1/mu), 100)

# visualize
fig, ax = plt.subplots(1,1)
ax.plot(x, gamma.pdf(x, n, loc=0, scale=1/mu), 'r-', lw=5, alpha=0.5)
ax.set_xlabel('total waiting time until the neuron fires 7 times')
ax.set_ylabel('f(x)')
ax.set_title('gamma pdf')

gamma.png

Beta distribution

Beta distribution is $ 0 \ leq X \ leq Used to represent the variable $ X $, which is 1 $. Tsu In other words, the probability itself can be modeled.

When $ X \ sim Beta (\ alpha, \ beta) $, the following holds.


\begin{align}
f(x|\alpha, \beta) &= \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)}x^{\alpha-1}(1 - x)^{\beta - 1} \hspace{10pt} (0 \leq x \leq 1) \\
E[X] &= \frac{\alpha}{\alpha + \beta} \\
Var[X] &= \frac{\alpha \beta}{(\alpha + \beta)^{2}(\alpha + \beta + 1)}
\end{align}

At first glance, it's a complicated formula, but surprisingly, when $ \ alpha = \ beta = 1 $, the beta distribution matches the uniform distribution $ U (0,1) $.

As an example, let's look at the beta distribution when $ \ alpha = \ beta = 0.5 $. The beta distribution can express various shapes depending on the parameters, but in this case, it becomes a characteristic distribution like a valley (?).

beta.py



## Beta distribution --------------------------------------------
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import beta

# parameters
a = 0.5
b = 0.5

# range
x = np.linspace(beta.ppf(0.01, a, b), beta.ppf(0.99, a, b), 100)

# visualize
fig, ax = plt.subplots(1,1)
ax.plot(x, beta.pdf(x, a, b), 'r-', lw=5, alpha=0.5)
ax.set_xlabel('X')
ax.set_ylabel('f(x)')
ax.set_title('beta pdf')

beta.png

The beta distribution is essentially "when the random variables $ X (0 \ leq X \ leq 1) $ that follow a uniform distribution are arranged in descending order, the distribution of the $ p $ th number from the bottom is the beta distribution $ Beta (p,, q) It means "become $". However, at least the main use in statistics is Bayesian update as a natural common benefit distribution, so what is important is not that meaning, but the variety of shapes that the beta distribution can take. Below, I plotted a drawing of the beta distribution with a combination of various parameters.

Diverse Beta Distributions

Normal distribution

It is finally normal distribution (Gaussian distribution). ** Central Limit Theorem ** (Regardless of the distribution of the population, the distribution of the true mean error of the sample mean approaches the normal distribution if there are enough trials), but statistics are also important. It is a probability distribution that cannot be avoided in science.

When $ X \ sim N (\ mu, \ sigma ^ {2}) $, the following holds.


\begin{align}
f(X|\mu, \sigma^{2}) &= \frac{1}{\sqrt{2 \pi \sigma^{2}}} exp{^\frac{(x - \mu)^{2}}{2 \sigma^2}} \\
E[X] &= \mu \\
Var(X) &= \sigma^{2}
\end{align}

The normal distribution contains 99% or more of the data $ X $ within three times the standard deviation $ \ sigma $. Therefore, if a variable is outside the range of 3 times the standard deviation $ \ sigma $, it is determined to be ** outlier **.

Also, for multiple independent normal distributions, the following holds.


if \hspace{10pt} X1 \sim N(\mu_1, \sigma_1^{2}) \perp X2 \sim N(\mu_2, \sigma_2^{2}) \\
\\
X1 + X2 \sim N(\mu_1 + \mu_2, \sigma_1^{2} + \sigma_2^{2})

The sum of variables obtained from an independent normal distribution follows a new normal distribution with the original mean sum and the variance sum.

As an example of the graph, $ \ mu = 0, \ sigma ^ {2} = 1 $, that is, ** standard normal distribution (z-distribution) **.

normal.py



import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm

# mean
mu = 0

# standard deviation
sd = 1

# range
x = np.linspace(norm.ppf(0.01, loc=mu, scale=sd), norm.ppf(0.99, loc=mu, scale=sd), 100)

# visualize
fig, ax = plt.subplots(1,1)
ax.plot(x, norm.pdf(x, loc=mu, scale=sd), 'r-', lw=5, alpha=0.5)
ax.set_xlabel('X')
ax.set_ylabel('f(x)')
ax.set_title('normal pdf')

normal.png

It's worth remembering that in a standard normal distribution, the value at the end of the range containing 95% of the data is ** $ \ pm 1.96 $ **. Values that exceed this are rare characters (about 2.5% of the total) that correspond to a deviation value of 70 or more on the + side.

By the way, when calculating the deviation value, the test score is standardized to be $ N (50, 10 ^ {2}) $. Below is the formula for converting the test $ x_i $ points to the deviation value $ T_i $ when the test score distribution is $ N (\ mu_x, \ sigma_x ^ {2}) $.


T_i = \frac{10(x_i - \mu_x)}{\sigma_x} + 50

T distribution

The last is T distribution. The T test is a convenient way to test if the mean values of the two groups are significantly different, and it is being abused without being well understood ~~ especially in the bio world ~~, so here again T I would like to review what the distribution is.

Let's assume that the experimental data $ X $ is obtained from a population that follows the normal distribution $ N (\ mu, \ sigma ^ {2}) $.


X \sim N(\mu, \sigma^{2})

At this time, the mean value of the experimental data $ X $ $ \ overline {X} $ was obtained from the same probability distribution where $ X = X_1, X_2, ..., X_n $ were ** independent of each other (independent and). Due to the nature of the independent normal distribution, it is considered identically distributed; iid) **.


\begin{align}
\Sigma_{i=1}^{n} X_i &\sim N(n*\mu, n*\sigma^{2}) \\
\frac{1}{n}\Sigma_{i=1}^{n} X_i &\sim N(\mu, \frac{\sigma^{2}}{n}) \\
\overline{X} &\sim N(\mu, \frac{\sigma^{2}}{n}) \\
\frac{\overline{X} - \mu}{\sigma /\sqrt{n}} &\sim N(0,1)
\end{align}

It can be transformed. Then, if the standardized data mean is greater than $ 1.96 $ or less than $ -1.96 $, the edge value that contains 95% of the data in the standard normal distribution, then ** sample mean $ \ override {X It can be said that} $ is significantly far from the population mean ** ………… is the ** Z-test **.

In practice, the population standard deviation $ \ sigma $ is often unknown from the sample data (Z-test is ineligible) and instead ** the population standard deviation estimated from the sample (unbiased variance) S Use **. Normally, the standard deviation is calculated by adding the squares of the difference between the data and the mean, dividing by the number of data $ n $ and taking the square root, but in the unbiased variance calculation, $ n-1 instead of $ n $. Divide by $.


S = \sqrt{\Sigma_{i}(X_i - \overline{X})^{2}/(n-1)}

With this simple process, you can ** match the expected population standard deviation estimated from the sample with the population standard deviation **.

However, once you've messed with it so far, the standardized $ \ overline {X} $ no longer follows a normal distribution. Instead, it follows a t distribution with ** $ n-1 $ degrees of freedom **.

When $ Y \ sim t_ {\ gamma} $, the following holds.


\begin{align}
f(y) &= \frac{\Gamma(\frac{\gamma + 1}{2})}{\Gamma(\frac{\gamma}{2})\sqrt{\gamma \pi}}(1 + \frac{y^{2}}{\gamma})^{-\frac{\gamma + 1}{2}} \\
E[Y] &= 0 \hspace{10pt} if \hspace{10pt} \gamma \geq 1 \\
Var[Y] &= \frac{\gamma}{\gamma-2} \hspace{10pt} if \hspace{10pt} \gamma \geq 2 \\
\end{align} 

If the $ t $ value converted from the experimental mean $ \ override {x} $ is greater than or less than the edge value that contains 95% of the total data in the T-distribution determined by the degree of freedom of the data. The T-test allows you to perform a significant test on the difference between the mean of the population and the sample.

As a graph, I will put a T distribution with 15 degrees of freedom. The T distribution also has a shape close to the normal distribution, but tends to be a little wider. As the degree of freedom increases, it approaches the normal distribution.

t.py



## t distribution --------------------------------------------
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import t

# degree of freedom
df = 15

# range
x = np.linspace(t.ppf(0.01, df), t.ppf(0.99, df), 100)

# visualize
fig, ax = plt.subplots(1,1)
ax.plot(x, t.pdf(x, df), 'r-', lw=5, alpha=0.5)
ax.set_xlabel('y')
ax.set_ylabel('f(y)')
ax.set_title('t pdf')

t.png

in conclusion

We have summarized the probability distributions that often appear in statistics and data analysis, but of course there are many more types of probability distributions out there. If I have time, I would like to add more.

Recommended Posts

Summary of probability distributions that often appear in statistics and data analysis
List of main probability distributions used in machine learning and statistics and code in python
Summary of statistical data analysis methods using Python that can be used in business
[Introduction to Python] Summary of functions and methods that frequently appear in Python [Problem format]
Summary of methods often used in pandas
Separation of design and data in matplotlib
Summary of modules and classes in Python-TensorFlow2-
Data analysis: Easily apply descriptive and inference statistics to CSV data in Python
A summary of Python e-books that are useful for free-to-read data analysis
[Introduction to Data Scientists] Basics of Probability and Statistics ♬ Probability / Random Variables and Probability Distribution
Python practice data analysis Summary of learning that I hit about 10 with 100 knocks
Data analysis in Python Summary of sources to look at first for beginners
Summary of OSS tools and libraries created in 2016
Summary of scikit-learn data sources that can be used when writing analysis articles
A well-prepared record of data analysis in Python
Analysis of financial data by pandas and its visualization (2)
Summary of tools needed to analyze data in Python
Full-width and half-width processing of CSV data in Python
Summary of Linux (UNIX) commands that appeared in Progate
Analysis of financial data by pandas and its visualization (1)
Story of image analysis of PDF file and data extraction
List of Python code used in big data analysis
Analysis of measurement data ②-Histogram and fitting, lmfit recommendation-
Summary of date processing in Python (datetime and dateutil)
Numerical summary of data
Processing summary 2 often done in Pandas (data reference, editing operation)
Basic summary of data manipulation in Python Pandas-Second half: Data aggregation
[Introduction to Data Scientists] Descriptive Statistics and Simple Regression Analysis ♬
Let's make the analysis of the Titanic sinking data like that
[Statistics for programmers] Random variables, probability distributions, and probability density functions
Probability prediction of imbalanced data
Beginning of Nico Nico Pedia analysis ~ JSON and touch the provided data ~
Recommended books and sources of data analysis programming (Python or R)
A simple data analysis of Bitcoin provided by CoinMetrics in Python
About Boxplot and Violinplot that visualize the variability of independent data
Practice of data analysis by Python and pandas (Tokyo COVID-19 data edition)
Probability statistics in Pokemon (uncorrelated test) --Is there a correlation between CP, weight, and height of Magikarp?