[PYTHON] Introduction to Statistical Modeling for Data Analysis GLM Likelihood-Ratio Test and Test Asymmetry

Perform a test to compare the estimated statistical models. This chapter deals with the ** likelihood ratio test **.

Statistical test framework

procedure

  1. Determine the data to use
  2. Design an appropriate statistical model corresponding to the purpose and data structure, and perform maximum likelihood estimation of the parameters.

In the test, a model with few parameters and a model with many parameters (simple model and complex model) are tested. They are called ** null hypothesis ** and ** alternative hypothesis **.

Neyman-Pearson test framework

The statistical model test examines whether the proposition "the null hypothesis is correct" can be denied.

  1. Treat the goodness of the model as ** test statistic **
  2. Assume that the null hypothesis is a "true model"
  3. Examine the variability (probability distribution) of the test statistic and determine the "probable range" of the test statistic value
  4. When the size of the "common range" is 95%, it can be said that a significance level of 5% is set.
  5. If the test statistic obtained by the alternative hypothesis model is out of the "common range", reject the null hypothesis and support the alternative hypothesis.

Likelihood ratio test example: Examine the difference in deviance

Statistical model to use:

-** Constant model : A model in which the average number of seeds $ \ lambda_i $ is constant and does not depend on body size $ x_i $ (slope $ \ beta_2 = 0 $; number of parameters $ k = 1 $) - x model **: A model in which the average number of seeds $ \ lambda_i $ depends on the body size $ x_i $ (slope $ \ beta_2 \ neq 0 $; number of parameters $ k = 2 $)

model Number of parametersk \log L ^ \ast deviance(-2\log L ^ \ast residual deviance AIC
Constant 1 -237.6 475.3 89.5 477.3
x 2 -235.4 470.8 85.0 474.8
full 100 -192.9 385.6 0.0 585.8

The likelihood ratio test uses the difference in deviance, which is the logarithm of the likelihood ratio multiplied by -2. In the case of this example,

\begin{eqnarray}
\frac{L_1 ^\ast}{L_2 ^\ast} &=& \frac{Maximum likelihood of a constant model:\exp(-237.6)}{Maximum likelihood of x model:\exp(-235.4)} \\
\Delta D_{1,2} &=& -2 * (\log L_1 ^ \ast - \log L_2 ^ \ast) \\
\Delta D_{1,2} &=& D_1 - D_2 \\
\Delta D_{1,2} &=& 4.5
\end{eqnarray}

It is evaluated that this deviance is improved by 4.5.

--Null hypothesis: constant model (number of parameters $ k = 1, \ beta_2 = 0 ) --Alternative hypothesis: x model ( k = 2, \ beta_2 \ neq 0 $)

The null hypothesis is \Delta D_{1,2}Rarely difference (rejection) \Delta D_{1,2}Is a common difference (cannot be rejected)
Is a true model Type I error (no problem)
Not a true model (no problem) Type II error

In the Neyman-Pearson test framework ** devoted solely to the examination of type I errors **

As a procedure

  1. Assume that a constant model that is a null hypothesis is correct (a true model)
  2. When a certain model is applied to the observed data, it becomes $ \ hat {\ beta_1} = 2.06 $, so this is considered to be the same as the true model.
  3. Create data from a true model and apply the x model each time to calculate $ \ Delta D_ {1, 2} $ and find the distribution of $ \ Delta D_ {1, 2} $
  4. The probability P that the difference between the deviances of the constant model and the x model is $ \ Delta D_ {1,2} \ geq 4.5 $ is obtained.
  5. If $ \ Delta D_ {1,2} = 4.5 $ is considered an impossible value, the null hypothesis is rejected and the alternative hypothesis is automatically adopted.

The probability $ P $ that the difference between the deviances of the constant model and the x model is $ \ Delta D_ {1,2} \ geq 4.5 $ is called the ** P value **.

--P value is "large": $ \ Delta D_ {1,2} = 4.5 $ is common → Null hypothesis cannot be rejected --P value is "small": $ \ Delta D_ {1,2} = 4.5 $ is very rare → Let's reject the null hypothesis, and the remaining x model is correct

Determine the ** significance level ** in advance (by yourself)

-$ P \ geq \ alpha : The null hypothesis cannot be rejected - P <\ alpha $: Null hypothesis can be rejected

Let.

Null hypothesis: Calculates the probability that $ \ Delta D_ {1,2}, which is a test statistic in the world where a constant model is a true model, will be greater than 4.5 (make a type I error).

Parametric bootstrap method

Above 3. A method in which the process of generating the data of is performed based on a simulation using random numbers. Suppose that the result of the constant model is stored in fit1 and the result of the x model is stored in fit2.

>>> model1 = smf.glm('y~1', data=d, family=sm.families.Poisson())
>>> model2 = smf.glm('y~x', data=d, family=sm.families.Poisson())
>>> fit1 = model1.fit()
>>> fit2 = model2.fit()

# fit1$deviance - fit2$deviance
>>> fit1.deviance - fit2.deviance
4.5139410788540459

From the average number of seeds estimated by a certain model $ \ lambda = \ exp (2.06) = 7.85 $ The data to be generated is "** 100 Poisson random numbers with an average of 7.85 **".

# d$y.rnd <- rpois(100, lambda=mean(d$y))
>>> d.y.rnd = sci.poisson.rvs(d.y.mean(), size=100)
# fit1 <- glm(y.rnd ~ 1, data=d, family=poisson)
>>> model1 = smf.glm('y.rnd ~ 1', data=d, family=sm.families.Poisson())
>>> fit1 = model1.fit()
# fit2 <- glm(y.rnd ~ x, data=d, family=poisson)
>>> model2 = smf.glm('y.rnd ~ x', data=d, family=sm.families.Poisson())
>>> fit2 = model2.fit()
# fit1$deviance - fit2$deviance
>>> fit1.deviance - fit2.deviance
0.63270346107955788

The difference in deviance is 1.92 for data whose mean value is a constant Poisson random number.

The x model with meaningless explanatory variables fits better than the constant model, which is the "true model"

By repeating this step about 1000 times, the ** distribution of test statistic ** and the “distribution of deviance difference $ \ Delta D_ {1, 2} $” in this example can be predicted.

def get_dd(d):
    sample_d = len(d)
    mean_y = d.y.mean()
    d.y.rnd = sci.poisson.rvs(mu=mean_y, size=sample_d)
    model1 = smf.glm('y.rnd ~ 1', data=d, family=sm.families.Poisson())
    model2 = smf.glm('y.rnd ~ x', data=d, family=sm.families.Poisson())
    fit1 = model1.fit()
    fit2 = model2.fit()
    return fit1.deviance - fit2.deviance


def pb(d, bootstrap=1000):
    return pandas.Series([get_dd(d) for _ in xrange(bootstrap)])

results = pb(d)
results.describe()

results[results >= 4.5].count()


results[results>= 4.5].count()
# 32

results.quantile(.95)
# 3.6070094508948025

results.hist()


From the above results, the "probability that the difference in deviance is greater than 4.5" is $ 32/1000 $, that is, $ P = 0.032 $. Also, if you find the difference in deviance $ \ Delta D_ {1,2} $ where $ P = 0.05 $. It becomes $ \ Delta D_ {1,2} \ leq 3.61 $.

From the above, There is a significant difference because "the P value of the deviance difference of 4.5 was 0.032, which is smaller than the significance level of 0.05". It is judged that "the null hypothesis (constant model) is rejected and the x model remains, so this is adopted".

Approximate calculation method using chi-square distribution

The probability distribution of the difference in deviance $ \ Delta D_ {1,2} $ can be approximated by the $ \ chi ^ 2 $ distribution with $ k_2 --k_1 = 2-1 = 1 $ degrees of freedom.

Apparently there is no ANOVA for GLM in stats models ...

The $ \ chi ^ 2 $ distribution approximation is ** an approximation calculation that is effective when the sample size is large **.

The PB method seems to be better for small samples with a small number of individuals investigated, or for data variations that are not Poisson distributions but homoscedastic normal distributions.

Summary

Both the likelihood ratio test and the model selection by AIC focus on the statistic of deviance.

--AIC: The purpose is to select a "model that makes good predictions" --Likelihood ratio test: The purpose is to safely reject the null hypothesis

It seems that there is no choice but to choose according to the purpose.

Recommended Posts

Introduction to Statistical Modeling for Data Analysis GLM Likelihood-Ratio Test and Test Asymmetry
Introduction to Statistical Modeling for Data Analysis GLM Model Selection
An introduction to statistical modeling for data analysis
Introduction to Statistical Modeling for Data Analysis Generalized Linear Models (GLM)
Introduction to Statistical Modeling for Data Analysis Expanding the range of applications of GLM
An introduction to statistical modeling for data analysis (Midorimoto) reading notes (in Python and Stan)
"Introduction to data analysis by Bayesian statistical modeling starting with R and Stan" implemented in Python
Organizing basic procedures for data analysis and statistical processing (4)
Organizing basic procedures for data analysis and statistical processing (2)
[Introduction to Data Scientists] Descriptive Statistics and Simple Regression Analysis ♬
How to use data analysis tools for beginners
[Introduction to minimize] Data analysis with SEIR model ♬
An introduction to voice analysis for music apps
[For beginners] How to study Python3 data analysis exam
Statistical hypothesis test of A/B test and required number of data
Reading Note: An Introduction to Data Analysis with Python
Data processing methods for mechanical engineers and non-computer engineers (Introduction 2)
[Technical book] Introduction to data analysis using Python -1 Chapter Introduction-
Data processing methods for mechanical engineers and non-computer engineers (Introduction 1)
[Introduction to cx_Oracle] (Part 6) DB and Python data type mapping
[Introduction to Udemy Python3 + Application] 42. for statement, break statement, and continue statement
20200329_Introduction to Data Analysis with Python Second Edition Personal Summary
[Introduction to Data Scientists] Basics of Python ♬ Functions and classes
[Introduction to Python] Combine Nikkei 225 and NY Dow csv data
[Python] Introduction to graph creation using coronavirus data [For beginners]
Python for Data Analysis Chapter 4
Python for Data Analysis Chapter 2
Point and Figure Data Modeling
Introduction to Python For, While
Tips for data analysis ・ Notes
Python for Data Analysis Chapter 3
Solving AOJ's Algorithm and Introduction to Data Structures in Python -Part1-
Introduction to Structural Equation Modeling (SEM), Covariance Structure Analysis with Python
Introduction to Data Analysis with Python P32-P43 [ch02 3.US Baby Names 1880-2010]
Introduction to Data Analysis with Python P17-P26 [ch02 1.usa.gov data from bit.ly]
Solving AOJ's Algorithm and Introduction to Data Structures in Python -Part2-
Solving AOJ's Algorithm and Introduction to Data Structures in Python -Part4-
How to make a unit test Part.1 Design pattern for introduction
Introduction to Bayesian Statistical Modeling with python ~ Trying Linear Regression with MCMC ~
Solving AOJ's Algorithm and Introduction to Data Structures in Python -Part3-
[Introduction to Data Scientists] Basics of Python ♬ Conditional branching and loops
[Introduction to Data Scientists] Basics of Python ♬ Functions and anonymous functions, etc.