[PYTHON] Introduction to Statistical Modeling for Data Analysis Generalized Linear Models (GLM)

Continuing from the previous time (Chapter 2), Chapter 3. Studying statistical models that incorporate ** explanatory variables **.

3.1 Example: When the average number of seeds varies from individual to individual

As the difference from the previous time, the body size $ x_i $, which is one of the attributes of the individual, and the presence or absence of control (do nothing: ** C **, add fertilizer: ** T **) are used as explanatory variables.

Data reading

[Here] Download and use data3a.csv in Chapter 3 of (). The data used this time is called DataFrame type.

# d <- read.csv("data3a.csv")
>>> import pandas
>>> d = pandas.read_csv("data3a.csv")
>>> d
     y      x  f
0    6   8.31  C
1    6   9.44  C
2    6   9.50  C
3   12   9.07  C
...(Omitted)...
98   7  10.86  T
99   9   9.97  T

[100 rows x 3 columns]

# d$x
>>> d.x
0      8.31
1      9.44
2      9.50
...(Omitted)...
98    10.86
99     9.97
Name: x, Length: 100, dtype: float64


# d$y
>>> d.y
0      6
1      6
2      6
...(Omitted)...
98     7
99     9
Name: y, Length: 100, dtype: int64

# d$f
>>> d.f = d.f.astype('category')
>>> d.f
0     C
1     C
2     C
3     C
...(Omitted)...
98    T
99    T
Name: f, Length: 100, dtype: category
Categories (2, object): [C < T]

In R, it seems that column f is called ** factor ** (factor), In python (pandas), it is a category.

To check the data type

# class(d)Can't be confirmed for a moment.
# class(d$y)Yara class(d$x)Yara class(d$f)
>>> d.y.dtype
dtype('int64')
>>> d.x.dtype
dtype('float64')
>>> d.f.dtype
dtype('O')

For an overview of the data frame, see Series (same as last time)

# summary(d)
>>> d.describe()
                y           x
count  100.000000  100.000000
mean     7.830000   10.089100
std      2.624881    1.008049
min      2.000000    7.190000
25%      6.000000    9.427500
50%      8.000000   10.155000
75%     10.000000   10.685000
max     15.000000   12.400000

>>> d.f.describe()
count     100
unique      2
top         T
freq       50
Name: f, dtype: object

Visualization

I couldn't find a way to color-code the scatter plot in one line like R. .. ..

# plot(d$x, d$y, pch=c(21, 19)[d$f])
>>> plt.plot(d.x[d.f == 'C'], d.y[d.f == 'C'], 'bo')
[<matplotlib.lines.Line2D at 0x1184d0490>]
>>> plt.plot(d.x[d.f == 'T'], d.y[d.f == 'T'], 'ro')
[<matplotlib.lines.Line2D at 0x111b58dd0>]
>>> plt.show()
>>> d.boxplot(column='y', by='f')
<matplotlib.axes._subplots.AxesSubplot at 0x11891a9d0>

Screen Shot 2015-06-01 at 18.53.49.png Screen Shot 2015-06-01 at 18.51.47.png

What you can see from these figures

--It seems that the number of seeds $ y $ increases as the body size $ x_i $ increases. --There seems to be no effect of fertilizer $ f $

Statistical model of Poisson regression

Create a statistical model that seems to be able to express the seed number data, which is the count data.

Statistical model dependent on body size $ x_i $ for individual $ i $

** Explanatory variable **: $ x_i $ ** Response variable **: $ y_i $

The probability $ p (y_i | \ lambda_i) $ that the number of seeds is $ y_i $ in an individual $ i $ follows a Poisson distribution.

p(y_i|\lambda_i) = \frac{\lambda _i ^{y_i} \exp (-\lambda_i)}{y_i !}

Suppose. Here, the average number of seeds $ \ lambda_i $ of an individual $ i $ is

\begin{eqnarray}
\lambda_i &=& \exp(\beta_1 + \beta_2 x_i )
\log \lambda_i &=& \beta_1 + \beta_2 x_i 
\end{eqnarray}

Suppose that. $ \ Beta_1 $ is called ** intercept **, $ \ beta_2 $ is called ** slope **, The right-hand side $ \ beta_1 + \ beta_2 x_i $ is called ** linear predictor **.

Also, if (function of $ \ lambda_i $) = (linear predictor), The function on the left is called the ** link function **.

Poisson regression

Poisson regression finds $ \ beta_1, \ beta_2 $ that maximizes the following equation.


\log L(\beta_1, \beta_2) = \sum_i \log \frac{\lambda_i^{y_i}\exp(-\lambda_i)}{y_i !}

In R, it seems to be one shot if you use something like glm. With python, I have to work endlessly using the statsmodels library ... (Reference)

# fit <- glm(y ~ x, data=d, familiy=poisson)
>>> import statsmodels.api as sm
>>> import statsmodel.formula.api as smf
>>> model = smf.glm('y ~ x', data=d family=sm.families.Poisson())
>>> results = model.fit()
>>> print(results.summary())
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                  100
Model:                            GLM   Df Residuals:                       98
Model Family:                 Poisson   Df Model:                            1
Link Function:                    log   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -235.39
Date:Month, 01  6 2015   Deviance:                       84.993
Time:                        23:06:45   Pearson chi2:                     83.8
No. Iterations:                     7
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          1.2917      0.364      3.552      0.000         0.579     2.005
x              0.0757      0.036      2.125      0.034         0.006     0.145
==============================================================================
>>> results.llf
-235.38625076986077

It seems that it is called Intercept in R, In Python, const at the bottom seems to represent $ \ beta_1 $ (intercept) and x represents $ \ beta_2 $ (slope).

In other words, the maximum likelihood estimators are $ \ beta_1 = 1.2917 $ and $ \ beta_2 = 0.0757 $.

The std err indicates the ** standard error **, which indicates the variation of $ \ beta_1 and \ beta_2 $.

If the same number of different data is resampled by the same survey method, the maximum likelihood estimator also changes, and the degree of variation

z is a statistic called ** z value **, which is the maximum likelihood estimation divided by the standard error.

It seems that this z can be used to estimate the ** Wald confidence interval **. Apparently, this confidence interval does not include 0 in the value that the maximum likelihood estimator can take [see](http://hosho.ees.hokudai.ac.jp/~kubo/log/2009/0101. html # 08)

Prediction by Poisson regression model

Using the estimation result of Poisson regression, we predict the average number of seeds $ \ lambda $ at body size $ x $.

\lambda = \exp(1.29 + 0.0757x)

Is illustrated using.

# plot(d$x, d$y, pch=c(21, 19)[d$f])
# xx <- seq(min(d$x), max(d$x), length=100)
# lines(xx, exp(1.29 + 0.0757 * xx), lwd = 2)
>>> plt.plot(d.x[d.f == 'C'], d.y[d.f == 'C'], 'bo')
>>> plt.plot(d.x[d.f == 'T'], d.y[d.f == 'T'], 'ro')
>>> xx = [d.x.min() + i * ((d.x.max() - d.x.min()) / 100.) for i in range(100)]
>>> yy = [numpy.exp(1.29 + 0.0757 * i )for i in xx]
>>> plt.plot(xx, yy)
>>> plt.show()

Screen Shot 2015-06-04 at 16.52.44.png

3.5 Statistical model with factor type explanatory variables

Consider a statistical model that incorporates fertilizer application $ f_i $ as an explanatory variable.

In GLM, ** dummy variable ** is used as the factor type explanatory variable. That is, the average value of the model,

\begin{eqnarray}
\lambda_i &=& \exp (\beta_1 + \beta_3 d_i) \\

 \therefore d_i &=& \left\{ \begin{array}{ll}
    0 & (f_i =For C) \\
    1 & (f_i =For T)
  \end{array} \right.

\end{eqnarray}

Will be written.

# fit.f <- glm(y ~ f, data=d, family=poisson)
>>> model = smf.glm('y ~ f', data=d, familiy=sm.families.Poisson())
>>> model.fit().summary()
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                  100
Model:                            GLM   Df Residuals:                       98
Model Family:                 Poisson   Df Model:                            1
Link Function:                    log   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -237.63
Date:wood, 04  6 2015   Deviance:                       89.475
Time:                        17:20:11   Pearson chi2:                     87.1
No. Iterations:                     7
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      2.0516      0.051     40.463      0.000         1.952     2.151
f[T.T]         0.0128      0.071      0.179      0.858        -0.127     0.153
==============================================================================
>>> model.llf
-237.62725696068682

From this result, if $ f_i $ of the individual $ i $ is C,

\lambda_i = \exp(2.05 + 0.0128 * 0) = \exp(2.05) = 7.77

If T

\lambda_i = \exp(2.05 + 0.0128 * 1) = \exp(2.0628) = 7.87

Therefore, it is predicted that the average number of seeds will increase only slightly when fertilizer is given.

The log-likelihood (llf) has become smaller, so it can be said that the application is bad.

3.6 Statistical model in which the explanatory variables are quantitative + factor type

Consider a statistical model that incorporates both individual body size $ x_i $ and fertilization treatment $ f_i $.

\log \lambda_i = \beta_1 + \beta_2 x_i + \beta_3 d_i

>>> model = smf.glm('y ~ x + f', data=d, family=sm.families.Poisson())
>>> result = model.fit()
>>> result.summary()
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                  100
Model:                            GLM   Df Residuals:                       97
Model Family:                 Poisson   Df Model:                            2
Link Function:                    log   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -235.29
Date:wood, 04  6 2015   Deviance:                       84.808
Time:                        17:31:34   Pearson chi2:                     83.8
No. Iterations:                     7
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      1.2631      0.370      3.417      0.001         0.539     1.988
f[T.T]        -0.0320      0.074     -0.430      0.667        -0.178     0.114
x              0.0801      0.037      2.162      0.031         0.007     0.153
==============================================================================
>>> result.llf
-235.29371924249369

Since the maximum log-likelihood is the largest, it can be said that the fit is better than the above two statistical models.

Logarithmic link function clarity: Multiplying effect

By transforming the above quantitative model + factor type statistical model,

\begin{eqnarray}
\log \lambda_i &=& \beta_1 + \beta_2 x_i + \beta_3 d_i \\
\lambda_i &=& \exp(\beta_1 + \beta_2 x_i + \beta_3 d_i) \\
\lambda_i &=& \exp(\beta_1) * \exp(\beta_2 x_i) * \exp(\beta_3 d_i) \\
\lambda_i &=& \exp(constant) * \exp(Body size effect) * \exp(Effect of fertilization treatment) 
\end{eqnarray}

It can be seen that the effects are multiplied.

If there is no link function, it is called ** identity link function **.

In GLM

The case of normal distribution and oral link function is called ** linear model ** (** linear model, LM ) or ** general linear model ** ( general linear model **).

Recommended Posts

Introduction to Statistical Modeling for Data Analysis Generalized Linear Models (GLM)
Introduction to Statistical Modeling for Data Analysis GLM Model Selection
An introduction to statistical modeling for data analysis
Introduction to Statistical Modeling for Data Analysis GLM Likelihood-Ratio Test and Test Asymmetry
Introduction to Statistical Modeling for Data Analysis Expanding the range of applications of GLM
Introduction to Generalized Linear Models (GLM) with Python
An introduction to statistical modeling for data analysis (Midorimoto) reading notes (in Python and Stan)
I tried using GLM (generalized linear model) for stock price data
Introduction to Bayesian Statistical Modeling with python ~ Trying Linear Regression with MCMC ~
"Introduction to data analysis by Bayesian statistical modeling starting with R and Stan" implemented in Python
Introduction to Statistical Hypothesis Testing with stats models
How to use data analysis tools for beginners
[Introduction to minimize] Data analysis with SEIR model ♬
An introduction to voice analysis for music apps
Organizing basic procedures for data analysis and statistical processing (4)
Reading Note: An Introduction to Data Analysis with Python
Organizing basic procedures for data analysis and statistical processing (2)
[Statistics] Visualization for understanding generalized linear mixed models (GLMM).
[Technical book] Introduction to data analysis using Python -1 Chapter Introduction-
[Introduction to Data Scientists] Descriptive Statistics and Simple Regression Analysis ♬
[Python] Introduction to graph creation using coronavirus data [For beginners]
Python for Data Analysis Chapter 4
Python for Data Analysis Chapter 2
Introduction to Python For, While
Tips for data analysis ・ Notes
Python for Data Analysis Chapter 3
Introduction to Structural Equation Modeling (SEM), Covariance Structure Analysis with Python
Introduction to Data Analysis with Python P32-P43 [ch02 3.US Baby Names 1880-2010]
Introduction to Data Analysis with Python P17-P26 [ch02 1.usa.gov data from bit.ly]
An introduction to Mercurial for non-engineers
Introduction to Generalized Estimates by statsmodels
Preprocessing template for data analysis (Python)
Data analysis for improving POG 3-Regression analysis-
Introduction to image analysis opencv python
An introduction to Python for non-engineers
I tried fMRI data analysis with python (Introduction to brain information decoding)