Continuing from the previous time (Chapter 2), Chapter 3. Studying statistical models that incorporate ** explanatory variables **.

3.1 Example: When the average number of seeds varies from individual to individual

As the difference from the previous time, the body size $ x_i $, which is one of the attributes of the individual, and the presence or absence of control (do nothing: ** C **, add fertilizer: ** T **) are used as explanatory variables.

Data reading

[Here] Download and use data3a.csv in Chapter 3 of (). The data used this time is called DataFrame type.

# d <- read.csv("data3a.csv")
>>> import pandas
>>> d = pandas.read_csv("data3a.csv")
>>> d
     y      x  f
0    6   8.31  C
1    6   9.44  C
2    6   9.50  C
3   12   9.07  C
...(Omitted)...
98   7  10.86  T
99   9   9.97  T

[100 rows x 3 columns]

# d$x
>>> d.x
0      8.31
1      9.44
2      9.50
...(Omitted)...
98    10.86
99     9.97
Name: x, Length: 100, dtype: float64


# d$y
>>> d.y
0      6
1      6
2      6
...(Omitted)...
98     7
99     9
Name: y, Length: 100, dtype: int64

# d$f
>>> d.f = d.f.astype('category')
>>> d.f
0     C
1     C
2     C
3     C
...(Omitted)...
98    T
99    T
Name: f, Length: 100, dtype: category
Categories (2, object): [C < T]

In R, it seems that column f is called ** factor ** (factor), In python (pandas), it is a category.

To check the data type

# class(d)Can't be confirmed for a moment.
# class(d$y)Yara class(d$x)Yara class(d$f)
>>> d.y.dtype
dtype('int64')
>>> d.x.dtype
dtype('float64')
>>> d.f.dtype
dtype('O')

For an overview of the data frame, see Series (same as last time)

# summary(d)
>>> d.describe()
                y           x
count  100.000000  100.000000
mean     7.830000   10.089100
std      2.624881    1.008049
min      2.000000    7.190000
25%      6.000000    9.427500
50%      8.000000   10.155000
75%     10.000000   10.685000
max     15.000000   12.400000

>>> d.f.describe()
count     100
unique      2
top         T
freq       50
Name: f, dtype: object

Visualization

I couldn't find a way to color-code the scatter plot in one line like R. .. ..

# plot(d$x, d$y, pch=c(21, 19)[d$f])
>>> plt.plot(d.x[d.f == 'C'], d.y[d.f == 'C'], 'bo')
[<matplotlib.lines.Line2D at 0x1184d0490>]
>>> plt.plot(d.x[d.f == 'T'], d.y[d.f == 'T'], 'ro')
[<matplotlib.lines.Line2D at 0x111b58dd0>]
>>> plt.show()
>>> d.boxplot(column='y', by='f')
<matplotlib.axes._subplots.AxesSubplot at 0x11891a9d0>

Screen Shot 2015-06-01 at 18.53.49.png

What you can see from these figures

--It seems that the number of seeds $ y $ increases as the body size $ x_i $ increases. --There seems to be no effect of fertilizer $ f $

Statistical model of Poisson regression

Create a statistical model that seems to be able to express the seed number data, which is the count data.

Statistical model dependent on body size $ x_i $ for individual $ i $

** Explanatory variable **: $ x_i $ ** Response variable **: $ y_i $

The probability $ p (y_i | \ lambda_i) $ that the number of seeds is $ y_i $ in an individual $ i $ follows a Poisson distribution.

p(y_i|\lambda_i) = \frac{\lambda _i ^{y_i} \exp (-\lambda_i)}{y_i !}

Suppose. Here, the average number of seeds $ \ lambda_i $ of an individual $ i $ is

\begin{eqnarray}
\lambda_i &=& \exp(\beta_1 + \beta_2 x_i )
\log \lambda_i &=& \beta_1 + \beta_2 x_i 
\end{eqnarray}

Suppose that. $ \ Beta_1 $ is called ** intercept **, $ \ beta_2 $ is called ** slope **, The right-hand side $ \ beta_1 + \ beta_2 x_i $ is called ** linear predictor **.

Also, if (function of $ \ lambda_i $) = (linear predictor), The function on the left is called the ** link function **.

Poisson regression

Poisson regression finds $ \ beta_1, \ beta_2 $ that maximizes the following equation.


\log L(\beta_1, \beta_2) = \sum_i \log \frac{\lambda_i^{y_i}\exp(-\lambda_i)}{y_i !}

In R, it seems to be one shot if you use something like glm. With python, I have to work endlessly using the statsmodels library ... (Reference)

# fit <- glm(y ~ x, data=d, familiy=poisson)
>>> import statsmodels.api as sm
>>> import statsmodel.formula.api as smf
>>> model = smf.glm('y ~ x', data=d family=sm.families.Poisson())
>>> results = model.fit()
>>> print(results.summary())
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                  100
Model:                            GLM   Df Residuals:                       98
Model Family:                 Poisson   Df Model:                            1
Link Function:                    log   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -235.39
Date:Month, 01  6 2015   Deviance:                       84.993
Time:                        23:06:45   Pearson chi2:                     83.8
No. Iterations:                     7
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          1.2917      0.364      3.552      0.000         0.579     2.005
x              0.0757      0.036      2.125      0.034         0.006     0.145
==============================================================================
>>> results.llf
-235.38625076986077

It seems that it is called Intercept in R, In Python, const at the bottom seems to represent $ \ beta_1 $ (intercept) and x represents $ \ beta_2 $ (slope).

In other words, the maximum likelihood estimators are $ \ beta_1 = 1.2917 $ and $ \ beta_2 = 0.0757 $.

The std err indicates the ** standard error **, which indicates the variation of $ \ beta_1 and \ beta_2 $.

If the same number of different data is resampled by the same survey method, the maximum likelihood estimator also changes, and the degree of variation

z is a statistic called ** z value **, which is the maximum likelihood estimation divided by the standard error.

It seems that this z can be used to estimate the ** Wald confidence interval **. Apparently, this confidence interval does not include 0 in the value that the maximum likelihood estimator can take [see](http://hosho.ees.hokudai.ac.jp/~kubo/log/2009/0101. html # 08)

Prediction by Poisson regression model

Using the estimation result of Poisson regression, we predict the average number of seeds $ \ lambda $ at body size $ x $.

\lambda = \exp(1.29 + 0.0757x)

Is illustrated using.

# plot(d$x, d$y, pch=c(21, 19)[d$f])
# xx <- seq(min(d$x), max(d$x), length=100)
# lines(xx, exp(1.29 + 0.0757 * xx), lwd = 2)
>>> plt.plot(d.x[d.f == 'C'], d.y[d.f == 'C'], 'bo')
>>> plt.plot(d.x[d.f == 'T'], d.y[d.f == 'T'], 'ro')
>>> xx = [d.x.min() + i * ((d.x.max() - d.x.min()) / 100.) for i in range(100)]
>>> yy = [numpy.exp(1.29 + 0.0757 * i )for i in xx]
>>> plt.plot(xx, yy)
>>> plt.show()

Screen Shot 2015-06-04 at 16.52.44.png

3.5 Statistical model with factor type explanatory variables

Consider a statistical model that incorporates fertilizer application $ f_i $ as an explanatory variable.

In GLM, ** dummy variable ** is used as the factor type explanatory variable. That is, the average value of the model,

\begin{eqnarray}
\lambda_i &=& \exp (\beta_1 + \beta_3 d_i) \\

 \therefore d_i &=& \left\{ \begin{array}{ll}
    0 & (f_i =For C) \\
    1 & (f_i =For T)
  \end{array} \right.

\end{eqnarray}

Will be written.

# fit.f <- glm(y ~ f, data=d, family=poisson)
>>> model = smf.glm('y ~ f', data=d, familiy=sm.families.Poisson())
>>> model.fit().summary()
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                  100
Model:                            GLM   Df Residuals:                       98
Model Family:                 Poisson   Df Model:                            1
Link Function:                    log   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -237.63
Date:wood, 04  6 2015   Deviance:                       89.475
Time:                        17:20:11   Pearson chi2:                     87.1
No. Iterations:                     7
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      2.0516      0.051     40.463      0.000         1.952     2.151
f[T.T]         0.0128      0.071      0.179      0.858        -0.127     0.153
==============================================================================
>>> model.llf
-237.62725696068682

From this result, if $ f_i $ of the individual $ i $ is C,

\lambda_i = \exp(2.05 + 0.0128 * 0) = \exp(2.05) = 7.77

If T

\lambda_i = \exp(2.05 + 0.0128 * 1) = \exp(2.0628) = 7.87

Therefore, it is predicted that the average number of seeds will increase only slightly when fertilizer is given.

The log-likelihood (llf) has become smaller, so it can be said that the application is bad.

3.6 Statistical model in which the explanatory variables are quantitative + factor type

Consider a statistical model that incorporates both individual body size $ x_i $ and fertilization treatment $ f_i $.

\log \lambda_i = \beta_1 + \beta_2 x_i + \beta_3 d_i

>>> model = smf.glm('y ~ x + f', data=d, family=sm.families.Poisson())
>>> result = model.fit()
>>> result.summary()
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                  100
Model:                            GLM   Df Residuals:                       97
Model Family:                 Poisson   Df Model:                            2
Link Function:                    log   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -235.29
Date:wood, 04  6 2015   Deviance:                       84.808
Time:                        17:31:34   Pearson chi2:                     83.8
No. Iterations:                     7
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      1.2631      0.370      3.417      0.001         0.539     1.988
f[T.T]        -0.0320      0.074     -0.430      0.667        -0.178     0.114
x              0.0801      0.037      2.162      0.031         0.007     0.153
==============================================================================
>>> result.llf
-235.29371924249369

Since the maximum log-likelihood is the largest, it can be said that the fit is better than the above two statistical models.

Logarithmic link function clarity: Multiplying effect

By transforming the above quantitative model + factor type statistical model,

\begin{eqnarray}
\log \lambda_i &=& \beta_1 + \beta_2 x_i + \beta_3 d_i \\
\lambda_i &=& \exp(\beta_1 + \beta_2 x_i + \beta_3 d_i) \\
\lambda_i &=& \exp(\beta_1) * \exp(\beta_2 x_i) * \exp(\beta_3 d_i) \\
\lambda_i &=& \exp(constant) * \exp(Body size effect) * \exp(Effect of fertilization treatment) 
\end{eqnarray}

It can be seen that the effects are multiplied.

If there is no link function, it is called ** identity link function **.

In GLM

The case of normal distribution and oral link function is called ** linear model ** (** linear model, LM ) or ** general linear model ** ( general linear model **).

[PYTHON] Introduction to Statistical Modeling for Data Analysis Generalized Linear Models (GLM)