[PYTHON] Points to note when performing logistic regression with Statsmodels

Python "statsmodels" is generally stable and linear regression calculation OLS is also indebted, but if you take a closer look? ?? ?? In some cases, Here, I will list two methods for logistic regression and note the points to note.

The target is ** statsmodels (0.6.1) **, which is the current Latest Stable Release.


How to use Logit Model

If you search on google, stackoverflow.com Q & A will show you how to use Logit Model. Even in the Statsmodels document, it is properly written in "Regression with Discrete Dependent Variable", so I confirmed the movement according to the document.

The example is solved by Spector and Mazzeo's data analysis, sourced from W. Greene. "Econometric Analysis" Prentice Hall, 5th. Edition. 2003, a textbook on econometrics. Explanatory variable: Problem of predicting post grade from explained variable: psi, tuce, gpa. If you replace the information obtained from the Internet in an easy-to-understand manner (at your own discretion), the grades for the second semester will increase from the report card for the first semester (gpa), the grades for the summer national mock exam (tuce), and the participation status for the summer course (psi). The problem is to estimate the post grade. I did the following according to the Statsmodels documentation.

import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm

# Load the data from Spector and Mazzeo (1980)
spector_data = sm.datasets.spector.load()
spector_data.exog = sm.add_constant(spector_data.exog)

# Follow statsmodles ipython notebook
logit_mod = sm.Logit(spector_data.endog, spector_data.exog)
logit_res = logit_mod.fit(disp=0)
print('Parameters: ', logit_res.params)
print logit_res.summary()

As mentioned above, I was able to calculate without any problem using sm.Logit (). The summary () is as follows.

                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                   32
Model:                          Logit   Df Residuals:                       28
Method:                           MLE   Df Model:                            3
Date:                Tue, 01 Sep 2015   Pseudo R-squ.:                  0.3740
Time:                        22:20:41   Log-Likelihood:                -12.890
converged:                       True   LL-Null:                       -20.592
                                        LLR p-value:                  0.001502
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const        -13.0213      4.931     -2.641      0.008       -22.687    -3.356
x1             2.8261      1.263      2.238      0.025         0.351     5.301
x2             0.0952      0.142      0.672      0.501        -0.182     0.373
x3             2.3787      1.065      2.234      0.025         0.292     4.465
==============================================================================

How to use GLM (Generalized Linear Models)

Logistic regression is a kind of generalized regression model (GLM) that uses Logit () for the link function, so it can also be written as follows.

# logit_mod = sm.Logit(spector_data.endog, spector_data.exog)
glm_model = sm.GLM(spector_data.endog, spector_data.exog, family=sm.families.Binomial())
# logit_res = logit_mod.fit(disp=0)
glm_reslt = glm_model.fit()

# print logit_res.summary()
print glm_reslt.summary()

Indicate that the distribution used with the family option in sm.GLM () is Binomial. Now the link function uses logit () (by Binomial's default). The output summary () is as follows.

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                   32
Model:                            GLM   Df Residuals:                       28
Model Family:                Binomial   Df Model:                            3
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -12.890
Date:                Tue, 01 Sep 2015   Deviance:                       25.779
Time:                        22:21:20   Pearson chi2:                     27.3
No. Iterations:                     7                                         
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const        -13.0213      4.931     -2.641      0.008       -22.687    -3.356
x1             2.8261      1.263      2.238      0.025         0.351     5.301
x2             0.0952      0.142      0.672      0.501        -0.182     0.373
x3             2.3787      1.065      2.234      0.025         0.292     4.465
==============================================================================

The output format is different compared to the case using Logit (). Initially, Logit () and GLM () thought that the processing inside was the same, only the wrapper was different (aliasing). However, if you look closely at the Method of the output,

It is different like. Is this something else? I was worried while referring to the article on wikipedia, ――IRLS is also one of the methods to calculate Likelihood. --The calculated parameters and the numerical values of the amount of statistical information are exactly the same for both. From this, it is presumed that the processing inside is the same, only the wrappers are different.

Comparison of the two methods-residual calculation

We decided to calculate the residuals in order to understand the status of the regression analysis. If you follow the OLS method,

>>> resid1 = logit_res.resid
AttributeError: 'LogitResults' object has no attribute 'resid'

>>> resid2 = glm_reslt.resid
AttributeError: 'GLMResults' object has no attribute 'resid'

Both return an AttributeError. Since the concept of "residual" is different from the linear model, it may not have the same attribute name. When I desperately searched the document from the class name of the object, I came up with the following correct answer.

>>> resid1 = logit_res.resid_dev   # for Logit model
>>> resid1
array([-0.23211021, -0.35027122, -0.64396264, -0.22909819,  1.06047795,
       -0.26638437, -0.23178275, -0.32537884, -0.48538752,  0.85555565,
       -0.22259715, -0.64918082, -0.88199929,  1.81326864, -0.94639849,
       -0.24758297, -0.3320177 , -0.28054444, -1.33513084,  0.91030269,
       -0.35592175,  0.44718924, -0.74400503, -1.95507406,  0.59395382,
        1.20963752,  0.95233204, -0.85678568,  0.58707192,  0.33529199,
       -1.22731092,  2.09663887])
>>> resid2 = glm_reslt.resid_deviance   # for GLM model
>>> resid2
array([-0.23211021, -0.35027122, -0.64396264, -0.22909819,  1.06047795,
       -0.26638437, -0.23178275, -0.32537884, -0.48538752,  0.85555565,
       -0.22259715, -0.64918082, -0.88199929,  1.81326864, -0.94639849,
       -0.24758297, -0.3320177 , -0.28054444, -1.33513084,  0.91030269,
       -0.35592175,  0.44718924, -0.74400503, -1.95507406,  0.59395382,
        1.20963752,  0.95233204, -0.85678568,  0.58707192,  0.33529199,
       -1.22731092,  2.09663887])

It is quite strange that the name of attribute is different depending on the class even though they have the same contents. (The Programmer you are writing may be different.)

However, the output numerical values are exactly the same, which is considered to support the fact that the regression calculation processing was the same.

** Statsmodels (0.6.1) ** may still be a beta version, but I would like to see it organized a little more in the future version to make it easier to understand.

References (web site)

Recommended Posts

Points to note when performing logistic regression with Statsmodels
Points to note when updating to WSL2
Points to note when switching from NAOqi OS 2.4.3 to 2.5.5
Linear regression with statsmodels
Add a constant term (y-intercept) when performing multiple regression analysis with Python's Statsmodels
Implementing logistic regression with NumPy
[python] A note when trying to use numpy with Cython
Points to note when deleting multiple elements from the List
(Note) Points to be addicted to when installing Scilab on ArchLinux
Points to note when making pandas read csv of excel output
Apply Influence function to logistic regression
A note I was addicted to when creating a table with SQLAlchemy
Logistic regression analysis Self-made with python
Addictive points when displaying logs with Tk.text
[Logistic regression] Implement k-validation with stats models
Note when creating an environment with python
Trying to handle SQLite3 with Python [Note]
Logistic regression
Logistic regression
Note that the Logistic Regression solver has changed its default value to lbfgs.
A note I was addicted to when running Python with Visual Studio Code
How to deal with errors when hitting pip ②
Note to plot IP address with Kibana + Elasticsearch
How to deal with SessionNotCreatedException when using Selenium
Python Note: When assigning a value to a string
Try Theano with Kaggle's MNIST Data ~ Logistic Regression ~
How to avoid BrokenPipeError with PyTorch's DataLoader Note
How to display images continuously with matplotlib Note
Implement a discrete-time logistic regression model with stan
Logistic regression implementation with particle swarm optimization method
Materials to read when getting started with Python
Python script to get note information with REAPER
[Logistic regression] Implement holdout verification with stats models