[PYTHON] Introduction to Statistical Modeling for Data Analysis GLM Model Selection

What is a good model

A model with only intercepts for linear predictors ($ \ log \ lambda = \ beta_1 $) is too easy. By increasing the number of parameters, like $ \ log \ lambda = \ beta_1 + \ beta_2 x + \ dots + \ beta_6x ^ 6 $ The fit will improve.

The source code for polynomials is written below.

>>> import numpy
>>> import pandas
>>> import matplotlib.pyplot as plt
>>> import statsmodels.api as sm
>>> import statsmodels.formula.api as smf

>>> d = pandas.read_csv("data3a.csv")

>>> model = smf.glm('y ~ x + I(x**2) + I(x**3) + I(x**4) + I(x**5) + I(x**6)', data=d, family=sm.families.Poisson())
>>> result = model.fit()
>>> result.summary()

>>> params = result.params

>>> fit = lambda x: numpy.exp(params[0] + params[1] * x + params[2] * x ** 2 + params[3] * x ** 3 + params[4] * x ** 4 + params[5] * x ** 5 + params[6] * x ** 6)

>>> xx = numpy.linspace(min(d.x), max(d.x), 100)
>>> plt.plot(xx, fit(xx))

>>> plt.plot(d.x[d.f == 'T'], d.y[d.f == 'T'], 'ro')
>>> plt.plot(d.x[d.f == 'C'], d.y[d.f == 'C'], 'bo')
>>> plt.show()

** AIC **: "A model that makes good ** predictions ** is a good model."

So far, four types of models have been applied to seed data.

--Model that has no effect (constant model) --Model affected by the effect of body size (x model) --Model affected by fertilization effect (f model) --Models affected by body size effect and fertilization effect (x + f model)

>>> plt.subplot(221)
>>> model = smf.glm('y ~ 1', data=d, family=sm.families.Poisson())
>>> result = model.fit()
>>> plt.plot(xx, [numpy.exp(result.params[0]) for _ in range(100)])

>>> plt.subplot(222)
>>> model = smf.glm('y ~ f', data=d, family=sm.families.Poisson())
>>> result = model.fit()
>>> plt.plot(xx, [numpy.exp(result.params[0] + result.params[1]) for _ in range(100)], 'r')
>>> plt.plot(xx, [numpy.exp(result.params[0]) for _ in range(100)], 'b')

>>> plt.subplot(223)
>>> model = smf.glm('y ~ x', data=d, family=sm.families.Poisson())
>>> result = model.fit()
>>> fit = lambda x: numpy.exp(result.params[0] + result.params[1] * x)
>>> plt.plot(xx, fit(xx))

>>> plt.subplot(224)
>>> model = smf.glm('y ~ x + f', data=d, family=sm.families.Poisson())
>>> result = model.fit()
>>> fit = lambda x: numpy.exp(result.params[0]+ result.params[2] * x)
>>> plt.plot(xx, fit(xx), 'b')
>>> fit = lambda x: numpy.exp(result.params[0]+ result.params[1] + result.params[2] * x)
>>> plt.plot(xx, fit(xx), 'r')

Bad fit of statistical model: Deviance

About the statistic ** deviance ** (** deviance **) $ D $ which is a transformation of the maximum log-likelihood $ \ log L ^ \ ast $, which is a good fit.

D = -2 \log L ^\ast

Is defined as. The Deviance in the previousresults.summary ()is ** residual deviance ** (= deviance of the model-minimum deviance, ** residual deviance **).

Minimum deviance

Deviance of the full model. A full model is a model with $ \ lambda_i = y_i $. In other words, the maximum log-likelihood is.

\log L = \sum_i \log \frac{y_i ^{y_i} \exp(-y_i}{y_i !}

Considering the x model,

#x Model minimum deviance
>>> dpois = lambda x:sct.poisson.logpmf(x, x)
>>> sum(dpois(d.y))
-192.8897525244958
>>> sum(dpois(d.y)) * (-2)
385.77950504899161

From the above, the residual deviation of this model is

>>> result.llf * (-2) -  sum(dpois(d.y))*(-2)
84.992996490729922
>>> result.summary()
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                  100
Model:                            GLM   Df Residuals:                       98
Model Family:                 Poisson   Df Model:                            1
Link Function:                    log   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -235.39
Date:Money, 05  6 2015   Deviance:                       84.993
Time:                        13:12:50   Pearson chi2:                     83.8
No. Iterations:                     7
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      1.2917      0.364      3.552      0.000         0.579     2.005
x              0.0757      0.036      2.125      0.034         0.006     0.145
==============================================================================

Maximum deviance

The maximum deviance, which is the opposite of the minimum deviance, is the deviance of the Null model. The Null model is a model in which the linear predictor is only the intercept.

In summary, the larger the residual deviance, the worse the fit, and the smaller the residual deviation, the better the fit (overfitting?).

Model selection criteria: AIC

** AIC ** (Akaike's information criterion) is a criterion that emphasizes predictability. When the number of parameters estimated for maximum likelihood is $ k $,

\begin{eqnarray}
AIC &=& -2 \{(Maximum log-likelihood) - (Maximum likelihood estimated number of parameters) \} \\
&=& -2(\log L ^ \ast - k) \\
&=& D + 2k
\end{eqnarray}
model Number of parametersk \log L ^ \ast deviance(-2\log L ^ \ast residual deviance AIC
Constant 1 -237.6 475.3 89.5 477.3
f 2 -237.6 475.3 89.5 479.3
x 2 -235.4 470.8 85.0 474.8
x+f 3 -235.3 470.6 84.8 476.6
full 100 -192.9 385.6 0.0 585.8

From the table, the x model is the AIC's smallest statistical model.

Personal summary

In general, increasing the explanatory variables (number of parameters) improves the accuracy of the training data, but makes the prediction worse (the bias difference between the nested models is constant). → If you choose something that can be explained well, the bias will not change, but the whole will improve → AIC will decrease

The difference between the maximum log-likelihood $ \ log L ^ \ ast $ and the average log-likelihood is from the constant model (k = 1) if the improvement of the former by increasing the number of parameters (k = 1-> 2) is greater than 1. AIC becomes smaller. I don't understand this Japanese ...

Recommended Posts

Introduction to Statistical Modeling for Data Analysis GLM Model Selection
An introduction to statistical modeling for data analysis
Introduction to Statistical Modeling for Data Analysis GLM Likelihood-Ratio Test and Test Asymmetry
Introduction to Statistical Modeling for Data Analysis Expanding the range of applications of GLM
An introduction to statistical modeling for data analysis (Midorimoto) reading notes (in Python and Stan)
[Introduction to minimize] Data analysis with SEIR model ♬
"Introduction to data analysis by Bayesian statistical modeling starting with R and Stan" implemented in Python
[Introduction to SEIR model] Try fitting COVID-19 data ♬
How to use data analysis tools for beginners
Organizing basic procedures for data analysis and statistical processing (4)
Reading Note: An Introduction to Data Analysis with Python
Organizing basic procedures for data analysis and statistical processing (2)
[CovsirPhy] COVID-19 Python package for data analysis: SIR-F model
[CovsirPhy] COVID-19 Python Package for Data Analysis: SIR model
[Technical book] Introduction to data analysis using Python -1 Chapter Introduction-
[Translation] scikit-learn 0.18 Tutorial Statistical learning tutorial for scientific data processing Model selection: Estimator and its parameter selection
[Introduction to Data Scientists] Descriptive Statistics and Simple Regression Analysis ♬
Python for Data Analysis Chapter 4
Python for Data Analysis Chapter 2
[Python] Introduction to graph creation using coronavirus data [For beginners]
Introduction to Python For, While
Tips for data analysis ・ Notes
Python for Data Analysis Chapter 3
Introduction to Structural Equation Modeling (SEM), Covariance Structure Analysis with Python
Introduction to Data Analysis with Python P32-P43 [ch02 3.US Baby Names 1880-2010]
I tried using GLM (generalized linear model) for stock price data
Introduction to Data Analysis with Python P17-P26 [ch02 1.usa.gov data from bit.ly]
Introduction to Bayesian Statistical Modeling with python ~ Trying Linear Regression with MCMC ~
An introduction to Mercurial for non-engineers
Preprocessing template for data analysis (Python)
Data analysis for improving POG 3-Regression analysis-
Introduction to image analysis opencv python
An introduction to Python for non-engineers
I tried fMRI data analysis with python (Introduction to brain information decoding)
Python visualization tool for data analysis work
[Introduction to Python3, Day 17] Chapter 8 Data Destinations (8.1-8.2.5)
[Introduction to Python3, Day 17] Chapter 8 Data Destinations (8.3-8.3.6.1)
An introduction to OpenCV for machine learning
[MNIST] Convert data to PNG for keras
[Introduction to Python3 Day 19] Chapter 8 Data Destinations (8.4-8.5)
[Introduction to Python3 Day 18] Chapter 8 Data Destinations (8.3.6.2 to 8.3.6.3)
Introduction to discord.py (1st day) -Preparation for discord.py-
Beginners read "Introduction to TensorFlow 2.0 for Experts"
An introduction to Python for machine learning
JupyterLab Basic Setting 2 (pip) for data analysis
[Introduction to Data Scientists] Basics of Python ♬
JupyterLab Basic Setup for Data Analysis (pip)
An introduction to Python for C programmers
Analysis for Data Scientists: Qiita Self-Article Summary 2020
Data analysis in Python Summary of sources to look at first for beginners
How to replace with Pandas DataFrame, which is useful for data analysis (easy)
Introduction to Time Series Analysis ~ Seasonal Adjustment Model ~ Implemented in R and Python
An introduction to data analysis using Python-To increase the number of video views-
[Introduction to Python] How to get the index of data with a for statement