Python "statsmodels" is generally stable and linear regression calculation OLS is also indebted, but if you take a closer look? ?? ?? In some cases, Here, I will list two methods for logistic regression and note the points to note.
The target is ** statsmodels (0.6.1) **, which is the current Latest Stable Release.
If you search on google, stackoverflow.com Q & A will show you how to use Logit Model. Even in the Statsmodels document, it is properly written in "Regression with Discrete Dependent Variable", so I confirmed the movement according to the document.
The example is solved by Spector and Mazzeo's data analysis, sourced from W. Greene. "Econometric Analysis" Prentice Hall, 5th. Edition. 2003, a textbook on econometrics. Explanatory variable: Problem of predicting post grade from explained variable: psi, tuce, gpa. If you replace the information obtained from the Internet in an easy-to-understand manner (at your own discretion), the grades for the second semester will increase from the report card for the first semester (gpa), the grades for the summer national mock exam (tuce), and the participation status for the summer course (psi). The problem is to estimate the post grade. I did the following according to the Statsmodels documentation.
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
# Load the data from Spector and Mazzeo (1980)
spector_data = sm.datasets.spector.load()
spector_data.exog = sm.add_constant(spector_data.exog)
# Follow statsmodles ipython notebook
logit_mod = sm.Logit(spector_data.endog, spector_data.exog)
logit_res = logit_mod.fit(disp=0)
print('Parameters: ', logit_res.params)
print logit_res.summary()
As mentioned above, I was able to calculate without any problem using sm.Logit (). The summary () is as follows.
Logit Regression Results
==============================================================================
Dep. Variable: y No. Observations: 32
Model: Logit Df Residuals: 28
Method: MLE Df Model: 3
Date: Tue, 01 Sep 2015 Pseudo R-squ.: 0.3740
Time: 22:20:41 Log-Likelihood: -12.890
converged: True LL-Null: -20.592
LLR p-value: 0.001502
==============================================================================
coef std err z P>|z| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const -13.0213 4.931 -2.641 0.008 -22.687 -3.356
x1 2.8261 1.263 2.238 0.025 0.351 5.301
x2 0.0952 0.142 0.672 0.501 -0.182 0.373
x3 2.3787 1.065 2.234 0.025 0.292 4.465
==============================================================================
Logistic regression is a kind of generalized regression model (GLM) that uses Logit () for the link function, so it can also be written as follows.
# logit_mod = sm.Logit(spector_data.endog, spector_data.exog)
glm_model = sm.GLM(spector_data.endog, spector_data.exog, family=sm.families.Binomial())
# logit_res = logit_mod.fit(disp=0)
glm_reslt = glm_model.fit()
# print logit_res.summary()
print glm_reslt.summary()
Indicate that the distribution used with the family option in sm.GLM () is Binomial. Now the link function uses logit () (by Binomial's default). The output summary () is as follows.
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 32
Model: GLM Df Residuals: 28
Model Family: Binomial Df Model: 3
Link Function: logit Scale: 1.0
Method: IRLS Log-Likelihood: -12.890
Date: Tue, 01 Sep 2015 Deviance: 25.779
Time: 22:21:20 Pearson chi2: 27.3
No. Iterations: 7
==============================================================================
coef std err z P>|z| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const -13.0213 4.931 -2.641 0.008 -22.687 -3.356
x1 2.8261 1.263 2.238 0.025 0.351 5.301
x2 0.0952 0.142 0.672 0.501 -0.182 0.373
x3 2.3787 1.065 2.234 0.025 0.292 4.465
==============================================================================
The output format is different compared to the case using Logit (). Initially, Logit () and GLM () thought that the processing inside was the same, only the wrapper was different (aliasing). However, if you look closely at the Method of the output,
It is different like. Is this something else? I was worried while referring to the article on wikipedia, ――IRLS is also one of the methods to calculate Likelihood. --The calculated parameters and the numerical values of the amount of statistical information are exactly the same for both. From this, it is presumed that the processing inside is the same, only the wrappers are different.
We decided to calculate the residuals in order to understand the status of the regression analysis. If you follow the OLS method,
>>> resid1 = logit_res.resid
AttributeError: 'LogitResults' object has no attribute 'resid'
>>> resid2 = glm_reslt.resid
AttributeError: 'GLMResults' object has no attribute 'resid'
Both return an AttributeError. Since the concept of "residual" is different from the linear model, it may not have the same attribute name. When I desperately searched the document from the class name of the object, I came up with the following correct answer.
>>> resid1 = logit_res.resid_dev # for Logit model
>>> resid1
array([-0.23211021, -0.35027122, -0.64396264, -0.22909819, 1.06047795,
-0.26638437, -0.23178275, -0.32537884, -0.48538752, 0.85555565,
-0.22259715, -0.64918082, -0.88199929, 1.81326864, -0.94639849,
-0.24758297, -0.3320177 , -0.28054444, -1.33513084, 0.91030269,
-0.35592175, 0.44718924, -0.74400503, -1.95507406, 0.59395382,
1.20963752, 0.95233204, -0.85678568, 0.58707192, 0.33529199,
-1.22731092, 2.09663887])
>>> resid2 = glm_reslt.resid_deviance # for GLM model
>>> resid2
array([-0.23211021, -0.35027122, -0.64396264, -0.22909819, 1.06047795,
-0.26638437, -0.23178275, -0.32537884, -0.48538752, 0.85555565,
-0.22259715, -0.64918082, -0.88199929, 1.81326864, -0.94639849,
-0.24758297, -0.3320177 , -0.28054444, -1.33513084, 0.91030269,
-0.35592175, 0.44718924, -0.74400503, -1.95507406, 0.59395382,
1.20963752, 0.95233204, -0.85678568, 0.58707192, 0.33529199,
-1.22731092, 2.09663887])
It is quite strange that the name of attribute is different depending on the class even though they have the same contents. (The Programmer you are writing may be different.)
However, the output numerical values are exactly the same, which is considered to support the fact that the regression calculation processing was the same.
** Statsmodels (0.6.1) ** may still be a beta version, but I would like to see it organized a little more in the future version to make it easier to understand.
Recommended Posts