[PYTHON] Summary of gamma distribution parameter specification method

Gamma distribution

The gamma distribution is used to model values that take positive continuous values. There are two parameters, but the definitions are slightly different depending on the library, and it is difficult to understand the average value and variance, so I feel like I'm always researching them, so I'll summarize them.

Probability density function

\begin{align}
f(x) &= \frac{1}{\Gamma(k) \theta^k} x^{k-1} e^{- \frac{x}{\theta}} \\
&= \frac{\lambda^k}{\Gamma(k)} x^{k-1} e^{- \lambda x}
\end{align}

k: shape parameter \theta: scale parameter Both are parameters with positive values. However, depending on the library, it may be expressed using $ \ lambda = \ frac {1} {\ theta} $.

It looks like this when plotted with some parameters. gamma_dists.png

Statistics

Average $ \ mu = k \ theta = \ frac {k} {\ lambda} $ Variance $ v = k \ theta ^ 2 = \ frac {k} {\ lambda ^ 2} $

On the contrary, if you want to determine the parameters from the mean and variance, use this. \theta = \frac{v}{\mu} \ (\lambda = \frac{\mu}{v}) k = \frac{\mu^2}{v}

Parameter specification method in each library

In summary, it looks like this. I wanted to write this table!

Library shape parameter scale parameter
numpy.random.gamma k \theta
scipy.stats.gamma a \theta
PyMC3(pm.Gamma) \alpha 1 / \beta
TensorFlow Probability (tfp.Gamma)) concentration 1/rate
Stan (gamma) \alpha 1 / \beta
R (rgamma) shape scale, 1/rate

numpy and scipy use scale parameter $ \ theta $, but in so-called stochastic programming languages such as PyMC3 and Stan TFP, they are specified by the reciprocal of $ \ theta $.

The reciprocal of $ \ theta $, $ \ lambda $, is called rate parameter, and the library that calls the parameter $ \ alpha, \ beta $ seems to adopt the definition by $ \ lambda $.

In PyMC3, it is also possible to specify the gamma distribution by mean (mu) and standard deviation (sigma). Also, in R, it seems that either shape or rate can be specified.

Confirmation of implementation

In order to confirm whether the above parameter list is correct, we obtained 10,000 random numbers from $ Gamma (2, 2) $ in each library, estimated the probability density function, and compared them.

import numpy as np
import scipy as sp
import pymc3 as pm
import tensorflow_probability as tfp
import pystan
import matplotlib.pyplot as plt
import seaborn as sns

shape = 2
scale = 2
rate = 1 / scale
n_sample = 10000

xx = np.linspace(0, 20)
# ground truth
gamma_pdf = sp.stats.gamma(a=shape, scale=scale).pdf

s_np = np.random.gamma(shape=shape, scale=scale, size=n_sample)
s_sp = sp.stats.gamma(a=shape, scale=scale).rvs(size=n_sample)
s_tfp = tfp.distributions.Gamma(concentration=shape, rate=rate).sample(n_sample).numpy()
s_pm = pm.Gamma.dist(alpha=shape, beta=rate).random(size=n_sample)

fig, ax = plt.subplots()
ax.plot(xx, gamma_pdf(xx), label='truth', lw=2, c='k')
sns.kdeplot(s_np, ax=ax, label='numpy', alpha=0.8)
sns.kdeplot(s_sp, ax=ax, label='scipy', alpha=0.8)
sns.kdeplot(s_tfp, ax=ax, label='TFP', alpha=0.8)
sns.kdeplot(s_pm, ax=ax, label='PyMC3', alpha=0.8)

The result is shown in the figure below, and it was confirmed that all libraries were implemented correctly.

gamma_pdf.png

Only Stan did not know how to obtain random numbers directly from the probability distribution, so instead I tried to estimate the parameters of the gamma distribution from the random numbers obtained above.

stan_code = '''

data {
  int N;
  vector<lower=0>[N] Y;
}

parameters {
  real<lower=0> shape;
  real<lower=0> rate;
}

model {
  Y ~ gamma(shape, rate);
}

'''

data = dict(N=n_sample, Y=s_np)
stan_model = pystan.StanModel(model_code=stan_code)
fit = stan_model.sampling(data=data)
print(fit)

The estimated values of the shape and rate parameters are 1.98 and 0.49, respectively, which are also the expected results.

Inference for Stan model: anon_model_6a5d60bed963727c801dc434b96a49a1.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

        mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
shape   1.98  8.3e-4   0.03   1.93   1.97   1.98    2.0   2.04   1016    1.0
rate    0.49  2.3e-4 7.4e-3   0.47   0.48   0.49   0.49    0.5   1020    1.0
lp__  -2.3e4    0.03   1.01 -2.3e4 -2.3e4 -2.3e4 -2.3e4 -2.3e4   1192    1.0

Designation of distribution by mean and standard deviation

It is convenient to prepare a function to calculate the shape and scale of the corresponding gamma distribution from the mean and standard deviation.

def calc_gamma_param(mu, sigma):
    return (mu / sigma)**2, sigma**2 / mu

mu, sigma = 4, 2
shape, scale = calc_gamma_param(mu, sigma)

def plot_gamma(xx, shape, scale):
    plt.plot(xx, sp.stats.gamma(a=shape, scale=scale).pdf(xx), label=f'shape={shape}, scale={scale}')
    
xx = np.linspace(0, 10)
plot_gamma(xx, shape, scale)
plt.legend()

Gamma distribution with mean 4 and standard deviation 2. Note that the mode (the most probable value) is smaller than the mean because the distribution has a long tail to the right.

gamma_mu_sigma.png

Bonus: Relationship with other probability distributions

When $ k = 1 $, the gamma distribution matches the exponential distribution of the parameter $ \ theta $. When $ k = \ frac {n} {2} (n = 1,2, \ dots), \ \ theta = 2 $, the gamma distribution matches the chi-square distribution with $ n $ degrees of freedom.

gamma_exp_chi2.png

Click here for the code.

from scipy import stats

xx = np.linspace(0, 10)
fig, ax = plt.subplots(1, 2, figsize=(10, 3))

shape, scale = 1, 3
ax[0].plot(xx, stats.gamma(a=shape, scale=scale).pdf(xx), label=f'Gamma({shape}, {scale})')
ax[0].plot(xx, stats.expon(scale=scale).pdf(xx), label=f'Exp({scale})')
ax[0].legend()

shape, scale = 3/2, 2 
ax[1].plot(xx, stats.gamma(a=shape, scale=scale).pdf(xx), label=f'Gamma({shape}, {scale})')
ax[1].plot(xx, stats.chi2(df=2*shape).pdf(xx), label=f'Chi2({int(2*shape)})')
ax[1].legend()

plt.savefig('gamma_exp_chi2.png')

Recommended Posts

Summary of gamma distribution parameter specification method
Summary of test method
Summary of Linux distribution types
Summary of SQLAlchemy connection method by DB
Axis option specification summary of Python "numpy.sum (...)"
Clustering of clustering method
[Python] Summary of table creation method using DataFrame (pandas)
[Summary of 27 languages] My number check digit calculation method
Numerical summary of data
Summary of pyenv usage
Summary of string operations
Summary of Python arguments
Verification of normal distribution
Summary of logrotate software logrotate