[PYTHON] Points to note when performing logistic regression with Statsmodels

Python "statsmodels" is generally stable and linear regression calculation OLS is also indebted, but if you take a closer look? ?? ?? In some cases, Here, I will list two methods for logistic regression and note the points to note.

The target is ** statsmodels (0.6.1) **, which is the current Latest Stable Release.

How to use Logit Model

If you search on google, stackoverflow.com Q & A will show you how to use Logit Model. Even in the Statsmodels document, it is properly written in "Regression with Discrete Dependent Variable", so I confirmed the movement according to the document.

The example is solved by Spector and Mazzeo's data analysis, sourced from W. Greene. "Econometric Analysis" Prentice Hall, 5th. Edition. 2003, a textbook on econometrics. Explanatory variable: Problem of predicting post grade from explained variable: psi, tuce, gpa. If you replace the information obtained from the Internet in an easy-to-understand manner (at your own discretion), the grades for the second semester will increase from the report card for the first semester (gpa), the grades for the summer national mock exam (tuce), and the participation status for the summer course (psi). The problem is to estimate the post grade. I did the following according to the Statsmodels documentation.

import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm

# Load the data from Spector and Mazzeo (1980)
spector_data = sm.datasets.spector.load()
spector_data.exog = sm.add_constant(spector_data.exog)

# Follow statsmodles ipython notebook
logit_mod = sm.Logit(spector_data.endog, spector_data.exog)
logit_res = logit_mod.fit(disp=0)
print('Parameters: ', logit_res.params)
print logit_res.summary()

As mentioned above, I was able to calculate without any problem using sm.Logit (). The summary () is as follows.

                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                   32
Model:                          Logit   Df Residuals:                       28
Method:                           MLE   Df Model:                            3
Date:                Tue, 01 Sep 2015   Pseudo R-squ.:                  0.3740
Time:                        22:20:41   Log-Likelihood:                -12.890
converged:                       True   LL-Null:                       -20.592
                                        LLR p-value:                  0.001502
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const        -13.0213      4.931     -2.641      0.008       -22.687    -3.356
x1             2.8261      1.263      2.238      0.025         0.351     5.301
x2             0.0952      0.142      0.672      0.501        -0.182     0.373
x3             2.3787      1.065      2.234      0.025         0.292     4.465
==============================================================================

How to use GLM (Generalized Linear Models)

Logistic regression is a kind of generalized regression model (GLM) that uses Logit () for the link function, so it can also be written as follows.

# logit_mod = sm.Logit(spector_data.endog, spector_data.exog)
glm_model = sm.GLM(spector_data.endog, spector_data.exog, family=sm.families.Binomial())
# logit_res = logit_mod.fit(disp=0)
glm_reslt = glm_model.fit()

# print logit_res.summary()
print glm_reslt.summary()

Indicate that the distribution used with the family option in sm.GLM () is Binomial. Now the link function uses logit () (by Binomial's default). The output summary () is as follows.

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                   32
Model:                            GLM   Df Residuals:                       28
Model Family:                Binomial   Df Model:                            3
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -12.890
Date:                Tue, 01 Sep 2015   Deviance:                       25.779
Time:                        22:21:20   Pearson chi2:                     27.3
No. Iterations:                     7                                         
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const        -13.0213      4.931     -2.641      0.008       -22.687    -3.356
x1             2.8261      1.263      2.238      0.025         0.351     5.301
x2             0.0952      0.142      0.672      0.501        -0.182     0.373
x3             2.3787      1.065      2.234      0.025         0.292     4.465
==============================================================================

The output format is different compared to the case using Logit (). Initially, Logit () and GLM () thought that the processing inside was the same, only the wrapper was different (aliasing). However, if you look closely at the Method of the output,

Logit(): MLE ... most likelihood esitimation
GLM(): IRLS ... iteratively reweighted least squares

It is different like. Is this something else? I was worried while referring to the article on wikipedia, ――IRLS is also one of the methods to calculate Likelihood. --The calculated parameters and the numerical values of the amount of statistical information are exactly the same for both. From this, it is presumed that the processing inside is the same, only the wrappers are different.

Comparison of the two methods-residual calculation

We decided to calculate the residuals in order to understand the status of the regression analysis. If you follow the OLS method,

>>> resid1 = logit_res.resid
AttributeError: 'LogitResults' object has no attribute 'resid'

>>> resid2 = glm_reslt.resid
AttributeError: 'GLMResults' object has no attribute 'resid'

Both return an AttributeError. Since the concept of "residual" is different from the linear model, it may not have the same attribute name. When I desperately searched the document from the class name of the object, I came up with the following correct answer.

>>> resid1 = logit_res.resid_dev   # for Logit model
>>> resid1
array([-0.23211021, -0.35027122, -0.64396264, -0.22909819,  1.06047795,
       -0.26638437, -0.23178275, -0.32537884, -0.48538752,  0.85555565,
       -0.22259715, -0.64918082, -0.88199929,  1.81326864, -0.94639849,
       -0.24758297, -0.3320177 , -0.28054444, -1.33513084,  0.91030269,
       -0.35592175,  0.44718924, -0.74400503, -1.95507406,  0.59395382,
        1.20963752,  0.95233204, -0.85678568,  0.58707192,  0.33529199,
       -1.22731092,  2.09663887])
>>> resid2 = glm_reslt.resid_deviance   # for GLM model
>>> resid2
array([-0.23211021, -0.35027122, -0.64396264, -0.22909819,  1.06047795,
       -0.26638437, -0.23178275, -0.32537884, -0.48538752,  0.85555565,
       -0.22259715, -0.64918082, -0.88199929,  1.81326864, -0.94639849,
       -0.24758297, -0.3320177 , -0.28054444, -1.33513084,  0.91030269,
       -0.35592175,  0.44718924, -0.74400503, -1.95507406,  0.59395382,
        1.20963752,  0.95233204, -0.85678568,  0.58707192,  0.33529199,
       -1.22731092,  2.09663887])

It is quite strange that the name of attribute is different depending on the class even though they have the same contents. (The Programmer you are writing may be different.)

However, the output numerical values are exactly the same, which is considered to support the fact that the regression calculation processing was the same.

** Statsmodels (0.6.1) ** may still be a beta version, but I would like to see it organized a little more in the future version to make it easier to understand.

References (web site)

Statsmodels domumentation http://statsmodels.sourceforge.net/stable/index.html
Wikipedia "Iteratively reweighted least squares(IRLS)" https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares