Continuing from the previous time (Chapter 2), Chapter 3. Studying statistical models that incorporate ** explanatory variables **.
As the difference from the previous time, the body size $ x_i $, which is one of the attributes of the individual, and the presence or absence of control (do nothing: ** C **, add fertilizer: ** T **) are used as explanatory variables.
[Here] Download and use data3a.csv
in Chapter 3 of ().
The data used this time is called DataFrame type.
# d <- read.csv("data3a.csv")
>>> import pandas
>>> d = pandas.read_csv("data3a.csv")
>>> d
y x f
0 6 8.31 C
1 6 9.44 C
2 6 9.50 C
3 12 9.07 C
...(Omitted)...
98 7 10.86 T
99 9 9.97 T
[100 rows x 3 columns]
# d$x
>>> d.x
0 8.31
1 9.44
2 9.50
...(Omitted)...
98 10.86
99 9.97
Name: x, Length: 100, dtype: float64
# d$y
>>> d.y
0 6
1 6
2 6
...(Omitted)...
98 7
99 9
Name: y, Length: 100, dtype: int64
# d$f
>>> d.f = d.f.astype('category')
>>> d.f
0 C
1 C
2 C
3 C
...(Omitted)...
98 T
99 T
Name: f, Length: 100, dtype: category
Categories (2, object): [C < T]
In R, it seems that column f is called ** factor ** (factor), In python (pandas), it is a category.
To check the data type
# class(d)Can't be confirmed for a moment.
# class(d$y)Yara class(d$x)Yara class(d$f)
>>> d.y.dtype
dtype('int64')
>>> d.x.dtype
dtype('float64')
>>> d.f.dtype
dtype('O')
For an overview of the data frame, see Series (same as last time)
# summary(d)
>>> d.describe()
y x
count 100.000000 100.000000
mean 7.830000 10.089100
std 2.624881 1.008049
min 2.000000 7.190000
25% 6.000000 9.427500
50% 8.000000 10.155000
75% 10.000000 10.685000
max 15.000000 12.400000
>>> d.f.describe()
count 100
unique 2
top T
freq 50
Name: f, dtype: object
I couldn't find a way to color-code the scatter plot in one line like R. .. ..
# plot(d$x, d$y, pch=c(21, 19)[d$f])
>>> plt.plot(d.x[d.f == 'C'], d.y[d.f == 'C'], 'bo')
[<matplotlib.lines.Line2D at 0x1184d0490>]
>>> plt.plot(d.x[d.f == 'T'], d.y[d.f == 'T'], 'ro')
[<matplotlib.lines.Line2D at 0x111b58dd0>]
>>> plt.show()
>>> d.boxplot(column='y', by='f')
<matplotlib.axes._subplots.AxesSubplot at 0x11891a9d0>
What you can see from these figures
--It seems that the number of seeds $ y $ increases as the body size $ x_i $ increases. --There seems to be no effect of fertilizer $ f $
Create a statistical model that seems to be able to express the seed number data, which is the count data.
** Explanatory variable **: $ x_i $ ** Response variable **: $ y_i $
The probability $ p (y_i | \ lambda_i) $ that the number of seeds is $ y_i $ in an individual $ i $ follows a Poisson distribution.
p(y_i|\lambda_i) = \frac{\lambda _i ^{y_i} \exp (-\lambda_i)}{y_i !}
Suppose. Here, the average number of seeds $ \ lambda_i $ of an individual $ i $ is
\begin{eqnarray}
\lambda_i &=& \exp(\beta_1 + \beta_2 x_i )
\log \lambda_i &=& \beta_1 + \beta_2 x_i
\end{eqnarray}
Suppose that. $ \ Beta_1 $ is called ** intercept **, $ \ beta_2 $ is called ** slope **, The right-hand side $ \ beta_1 + \ beta_2 x_i $ is called ** linear predictor **.
Also, if (function of $ \ lambda_i $) = (linear predictor), The function on the left is called the ** link function **.
Poisson regression finds $ \ beta_1, \ beta_2 $ that maximizes the following equation.
\log L(\beta_1, \beta_2) = \sum_i \log \frac{\lambda_i^{y_i}\exp(-\lambda_i)}{y_i !}
In R, it seems to be one shot if you use something like glm
.
With python, I have to work endlessly using the statsmodels
library ... (Reference)
# fit <- glm(y ~ x, data=d, familiy=poisson)
>>> import statsmodels.api as sm
>>> import statsmodel.formula.api as smf
>>> model = smf.glm('y ~ x', data=d family=sm.families.Poisson())
>>> results = model.fit()
>>> print(results.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 100
Model: GLM Df Residuals: 98
Model Family: Poisson Df Model: 1
Link Function: log Scale: 1.0
Method: IRLS Log-Likelihood: -235.39
Date:Month, 01 6 2015 Deviance: 84.993
Time: 23:06:45 Pearson chi2: 83.8
No. Iterations: 7
==============================================================================
coef std err z P>|z| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const 1.2917 0.364 3.552 0.000 0.579 2.005
x 0.0757 0.036 2.125 0.034 0.006 0.145
==============================================================================
>>> results.llf
-235.38625076986077
It seems that it is called Intercept in R, In Python, const at the bottom seems to represent $ \ beta_1 $ (intercept) and x represents $ \ beta_2 $ (slope).
In other words, the maximum likelihood estimators are $ \ beta_1 = 1.2917 $ and $ \ beta_2 = 0.0757 $.
The std err indicates the ** standard error **, which indicates the variation of $ \ beta_1 and \ beta_2 $.
If the same number of different data is resampled by the same survey method, the maximum likelihood estimator also changes, and the degree of variation
z is a statistic called ** z value **, which is the maximum likelihood estimation divided by the standard error.
It seems that this z can be used to estimate the ** Wald confidence interval **. Apparently, this confidence interval does not include 0 in the value that the maximum likelihood estimator can take [see](http://hosho.ees.hokudai.ac.jp/~kubo/log/2009/0101. html # 08)
Using the estimation result of Poisson regression, we predict the average number of seeds $ \ lambda $ at body size $ x $.
\lambda = \exp(1.29 + 0.0757x)
Is illustrated using.
# plot(d$x, d$y, pch=c(21, 19)[d$f])
# xx <- seq(min(d$x), max(d$x), length=100)
# lines(xx, exp(1.29 + 0.0757 * xx), lwd = 2)
>>> plt.plot(d.x[d.f == 'C'], d.y[d.f == 'C'], 'bo')
>>> plt.plot(d.x[d.f == 'T'], d.y[d.f == 'T'], 'ro')
>>> xx = [d.x.min() + i * ((d.x.max() - d.x.min()) / 100.) for i in range(100)]
>>> yy = [numpy.exp(1.29 + 0.0757 * i )for i in xx]
>>> plt.plot(xx, yy)
>>> plt.show()
Consider a statistical model that incorporates fertilizer application $ f_i $ as an explanatory variable.
In GLM, ** dummy variable ** is used as the factor type explanatory variable. That is, the average value of the model,
\begin{eqnarray}
\lambda_i &=& \exp (\beta_1 + \beta_3 d_i) \\
\therefore d_i &=& \left\{ \begin{array}{ll}
0 & (f_i =For C) \\
1 & (f_i =For T)
\end{array} \right.
\end{eqnarray}
Will be written.
# fit.f <- glm(y ~ f, data=d, family=poisson)
>>> model = smf.glm('y ~ f', data=d, familiy=sm.families.Poisson())
>>> model.fit().summary()
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 100
Model: GLM Df Residuals: 98
Model Family: Poisson Df Model: 1
Link Function: log Scale: 1.0
Method: IRLS Log-Likelihood: -237.63
Date:wood, 04 6 2015 Deviance: 89.475
Time: 17:20:11 Pearson chi2: 87.1
No. Iterations: 7
==============================================================================
coef std err z P>|z| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 2.0516 0.051 40.463 0.000 1.952 2.151
f[T.T] 0.0128 0.071 0.179 0.858 -0.127 0.153
==============================================================================
>>> model.llf
-237.62725696068682
From this result, if $ f_i $ of the individual $ i $ is C,
\lambda_i = \exp(2.05 + 0.0128 * 0) = \exp(2.05) = 7.77
If T
\lambda_i = \exp(2.05 + 0.0128 * 1) = \exp(2.0628) = 7.87
Therefore, it is predicted that the average number of seeds will increase only slightly when fertilizer is given.
The log-likelihood (llf) has become smaller, so it can be said that the application is bad.
Consider a statistical model that incorporates both individual body size $ x_i $ and fertilization treatment $ f_i $.
\log \lambda_i = \beta_1 + \beta_2 x_i + \beta_3 d_i
>>> model = smf.glm('y ~ x + f', data=d, family=sm.families.Poisson())
>>> result = model.fit()
>>> result.summary()
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 100
Model: GLM Df Residuals: 97
Model Family: Poisson Df Model: 2
Link Function: log Scale: 1.0
Method: IRLS Log-Likelihood: -235.29
Date:wood, 04 6 2015 Deviance: 84.808
Time: 17:31:34 Pearson chi2: 83.8
No. Iterations: 7
==============================================================================
coef std err z P>|z| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 1.2631 0.370 3.417 0.001 0.539 1.988
f[T.T] -0.0320 0.074 -0.430 0.667 -0.178 0.114
x 0.0801 0.037 2.162 0.031 0.007 0.153
==============================================================================
>>> result.llf
-235.29371924249369
Since the maximum log-likelihood is the largest, it can be said that the fit is better than the above two statistical models.
By transforming the above quantitative model + factor type statistical model,
\begin{eqnarray}
\log \lambda_i &=& \beta_1 + \beta_2 x_i + \beta_3 d_i \\
\lambda_i &=& \exp(\beta_1 + \beta_2 x_i + \beta_3 d_i) \\
\lambda_i &=& \exp(\beta_1) * \exp(\beta_2 x_i) * \exp(\beta_3 d_i) \\
\lambda_i &=& \exp(constant) * \exp(Body size effect) * \exp(Effect of fertilization treatment)
\end{eqnarray}
It can be seen that the effects are multiplied.
If there is no link function, it is called ** identity link function **.
In GLM
The case of normal distribution and oral link function is called ** linear model ** (** linear model, LM ) or ** general linear model ** ( general linear model **).
Recommended Posts