Statsmodels is a statistical analysis software that runs on a programming language called Python. Python must be installed on your PC to run the statsmodels sample. If you haven't installed it yet, please refer to Installing Jupyter notebook. Jupyter notebook is very useful for running stats models.

Most of it is based on statsmodels reference.

Generalized Estimating Equations

The generalized estimation equation is a model that aims to estimate the average effect of observations of panels, clusters, and iterative measurement data in a population of data that is correlated within the cluster and not correlated outside the cluster. , Generalized linear model. Although it is said that there is a correlation within the cluster, it can be said that the cause of the correlation has not been identified or cannot be identified. The average structure and the correlation structure can be handled separately. GLM

--The dependent variable $ y_i $ follows the exponential family. --Each dependent variable is independent.

Under the assumption

--Assuming a probability distribution of the random variable $ y_i $. -Set a linear regression model with the expected value of $ y_i $ of $ \ mu $. --Create a likelihood function for the assumed probability distribution of $ y_i $ and find $ \ beta (\ hat {\ beta}) $ to maximize it. --The maximum likelihood estimator $ \ hat {\ beta} $ has consistency and asymptotic normality.

Estimate the model as.

In GEE

--Relax the assumption that the dependent variable $ y_i $ follows an exponential family. --Relax the assumption that the dependent variable is independent. --Correspond to an unknown distribution using pseudo-likelihood. --Introduce working correlation matrix $ R_i (\ alpha) $ to represent the correlation structure. --The generalized estimation equation is divided into two stages and solved by numerical calculation.

Estimate the model in this order. Therefore, it is possible to deal with the diversity of data, but it also requires the amount of information.

Example: Number of seizures in epilepsy patients

Thall and Vail (1990) analyzed the 8-week baseline period and the number of seizures every two consecutive weeks in 59 epilepsy patients. No treatment is given to the patient during the baseline period. After the baseline period, patients are randomly prescribed placebo or progabide. Post-prescription observations are performed for 4x2 = 8 weeks.

Data details: Use: Epilepsy Format: 236 rows and 9 columns

y: Number of seizures in 2 weeks trt: Treatment, placebo, or progabide base: Number of seizures during the 8-week baseline period age: Subject's age, age V4: 0/1 indicator variable for 4 periods subject: Subject number, 1 to 59. period: period, 1 to 4. lbase: Logarithmic mean of seizures during baseline period standardized to zero lage: Logarithmic mean of age, standardized to zero

Epileptic seizures correspond to recurrent events in survival analysis. Recurrence event data is obtained by repeatedly measuring the occurrence of seizures in the same subject, and can be regarded as a type of longitudinal data or repeated measures data. It is considered that there is no correlation in the number of epileptic seizures in different subjects, but it is natural to think that there is a correlation in the results if the measurements are made for the same subject.

A total of 4 observations were made at intervals of 8 weeks before drug administration and 2 weeks after drug administration, and the number of seizures was examined for each interval. Only the age of the subject is a covariate.

First, initialize and get the data.

import statsmodels.formula.api as smf
data = sm.datasets.get_rdataset('epil', package='MASS').data

Descriptive statistics

Examine the state of the data.

data.head(10)

y	trt	base	age	V4	subject	period	lbase	lage
0	5	placebo	11	31	0	1	1	-0.756354	0.114204
1	3	placebo	11	31	0	1	2	-0.756354	0.114204
2	3	placebo	11	31	0	1	3	-0.756354	0.114204
3	3	placebo	11	31	1	1	4	-0.756354	0.114204
4	3	placebo	11	30	0	2	1	-0.756354	0.081414
5	5	placebo	11	30	0	2	2	-0.756354	0.081414
6	3	placebo	11	30	0	2	3	-0.756354	0.081414
7	3	placebo	11	30	1	2	4	-0.756354	0.081414
8	2	placebo	6	25	0	3	1	-1.362490	-0.100908
9	4	placebo	6	25	0	3	2	-1.362490	-0.100908

Let's compare the distribution of the number of epileptic seizures after prescribing a drug that does not contain the active ingredient placebo and a GABA-like antiepileptic drug called progabide. The number of data is 112 for the former and 124 for the latter.

data[data.trt!="placebo"].y.hist()
data[data.trt=="placebo"].y.hist()

data[data.trt!="placebo"].y.count(),data[data.trt=="placebo"].y.count()

(124, 112)

Let's make a frequency chart of the logarithm of the number of seizures and the logarithm of the age during the standardized baseline period.

data.lbase.hist()

data.lage.hist()

Next, we get descriptive statistics. Mean and standard deviation of the number of seizures during the baseline period

data.base.mean(),data.base.std()

(31.220338983050848, 26.705051109822254)

Mean and standard deviation of the number of seizures after progabide

data[data.trt!="placebo"].y.mean(),data[data.trt!="placebo"].y.std()

(7.959677419354839, 13.92978863002629)

Mean and standard deviation of seizures during baseline period in patients treated with progabide

data[data.trt!="placebo"].base.mean(),data[data.trt!="placebo"].base.std()

(31.612903225806452, 27.638405492528324)

Mean and standard deviation of the number of seizures after placebo

data[data.trt=="placebo"].y.mean(),data[data.trt=="placebo"].y.std()

(8.580357142857142, 10.369426989352696)

Mean and standard deviation of seizures during baseline period for placebo-treated patients

data[data.trt=="placebo"].base.mean(),data[data.trt=="placebo"].base.std()

(30.785714285714285, 25.749111266541444)

The logarithm of the subject's age and the number of seizures during the baseline period is plotted.

plt.scatter(data.lage,data.lbase)
rg=sm.OLS(data.lbase,data.lage).fit()
plt.plot(data.lage,rg.fittedvalues)

Statistical inference

Estimate the model. It tries to follow the Poisson distribution for the number of seizures.

Differences in working correlation matrix

independence()

For the working correlation matrix, we assume that the paired data for repeated measurements of each patient are independent. "subject" indicates groups.

fam = sm.families.Poisson()
ind = sm.cov_struct.Independence()
mod = smf.gee("y ~ age + trt + base", "subject", data,
              cov_struct=ind, family=fam)

res = mod.fit()
print(res.summary())

GEE Regression Results                              
===================================================================================
Dep. Variable:                           y   No. Observations:                  236
Model:                                 GEE   No. clusters:                       59
Method:                        Generalized   Min. cluster size:                   4
                      Estimating Equations   Max. cluster size:                   4
Family:                            Poisson   Mean cluster size:                 4.0
Dependence structure:         Independence   Num. iterations:                     2
Date:                     Sat, 02 Nov 2019   Scale:                           1.000
Covariance type:                    robust   Time:                         13:06:54
====================================================================================
                       coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept            0.5730      0.361      1.589      0.112      -0.134       1.280
trt[T.progabide]    -0.1519      0.171     -0.888      0.375      -0.487       0.183
age                  0.0223      0.011      1.960      0.050    2.11e-06       0.045
base                 0.0226      0.001     18.451      0.000       0.020       0.025
==============================================================================
Skew:                          3.7823   Kurtosis:                      28.6672
Centered skew:                 2.7597   Centered kurtosis:             21.9865
==============================================================================

Next, take the mean value of the number of seizures for each subject and the logarithm of the variance and blot it.

yg = mod.cluster_list(np.asarray(data["y"]))
ymn = [x.mean() for x in yg]
yva = [x.var() for x in yg]
plt.grid(True)
plt.plot(np.log(ymn), np.log(yva), 'o')
plt.plot([-2, 6], [-2, 6], '-', color='grey')
plt.xlabel("Log variance", size=17)
plt.ylabel("Log mean", size=17)

exchangeable() In this case, it is assumed that the paired data of each patient's repeated measurements for the working correlation matrix have the same correlation.

fam = sm.families.Poisson()
ex = sm.cov_struct.Exchangeable()
mod = smf.gee("y ~ age + trt + base", "subject", data,
              cov_struct=ex, family=fam)

res = mod.fit()
print(res.summary())

 GEE Regression Results                              
===================================================================================
Dep. Variable:                           y   No. Observations:                  236
Model:                                 GEE   No. clusters:                       59
Method:                        Generalized   Min. cluster size:                   4
                      Estimating Equations   Max. cluster size:                   4
Family:                            Poisson   Mean cluster size:                 4.0
Dependence structure:         Exchangeable   Num. iterations:                     2
Date:                     Thu, 31 Oct 2019   Scale:                           1.000
Covariance type:                    robust   Time:                         07:23:48
====================================================================================
                       coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept            0.5730      0.361      1.589      0.112      -0.134       1.280
trt[T.progabide]    -0.1519      0.171     -0.888      0.375      -0.487       0.183
age                  0.0223      0.011      1.960      0.050    2.11e-06       0.045
base                 0.0226      0.001     18.451      0.000       0.020       0.025
==============================================================================
Skew:                          3.7823   Kurtosis:                      28.6672
Centered skew:                 2.7597   Centered kurtosis:             21.9865
==============================================================================

fig = res.plot_isotropic_dependence()
fig.get_axes()[0].set_ylim(0, 1)

rslts = res.params_sensitivity(0., 0.9, 5)
dp = [x.cov_struct.dep_params for x in rslts]
age_params = np.asarray([x.params[2] for x in rslts])
age_se = np.asarray([x.bse[2] for x in rslts])
plt.grid(True)
plt.plot(dp, age_params, '-', color='orange', lw=4)
plt.plot(dp, age_params + age_se, '-', color='grey')
plt.plot(dp, age_params - age_se, '-', color='grey')
plt.axvline(res.cov_struct.dep_params, ls='--', color='black')
plt.xlabel("Autocorrelation parameter", size=14)
plt.ylabel("Age effect", size=14)

plt.plot(res.fittedvalues, res.resid, 'o', alpha=0.5)
plt.xlabel("Fitted values", size=17)
plt.ylabel("Residuals", size=17)

Autoregressive() In this case, it is assumed that the paired data of each patient's repeated measurements for the working correlation matrix are autocorrelated.

fam = sm.families.Poisson()
au = sm.cov_struct.Autoregressive()
mod = smf.gee("y ~ age + trt + base", "subject", data,
              cov_struct=au, family=fam)

res = mod.fit()
print(res.summary())

GEE Regression Results                              
===================================================================================
Dep. Variable:                           y   No. Observations:                  236
Model:                                 GEE   No. clusters:                       59
Method:                        Generalized   Min. cluster size:                   4
                      Estimating Equations   Max. cluster size:                   4
Family:                            Poisson   Mean cluster size:                 4.0
Dependence structure:       Autoregressive   Num. iterations:                    11
Date:                     Sat, 02 Nov 2019   Scale:                           1.000
Covariance type:                    robust   Time:                         13:08:52
====================================================================================
                       coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept            0.4123      0.422      0.977      0.329      -0.415       1.240
trt[T.progabide]    -0.0704      0.164     -0.430      0.667      -0.391       0.250
age                  0.0274      0.013      2.108      0.035       0.002       0.053
base                 0.0223      0.001     18.252      0.000       0.020       0.025
==============================================================================
Skew:                          3.9903   Kurtosis:                      30.1753
Centered skew:                 2.7597   Centered kurtosis:             21.9865
==============================================================================

fig = res.plot_isotropic_dependence()
fig.get_axes()[0].set_ylim(0, 1)

rslts = res.params_sensitivity(0., 0.9, 5)
dp = [x.cov_struct.dep_params for x in rslts]
age_params = np.asarray([x.params[2] for x in rslts])
age_se = np.asarray([x.bse[2] for x in rslts])
plt.grid(True)
plt.plot(dp, age_params, '-', color='orange', lw=4)
plt.plot(dp, age_params + age_se, '-', color='grey')
plt.plot(dp, age_params - age_se, '-', color='grey')
plt.axvline(res.cov_struct.dep_params, ls='--', color='black')
plt.xlabel("Autocorrelation parameter", size=14)
plt.ylabel("Age effect", size=14)

We tested whether administration of the study drug reduced the number of seizures compared to placebo, changing the working correlation matrix, but the results from this clinical trial were significant when using the generalized estimation equation. I can't get it.

Example: Epileptic seizure (Leppik et.al.1987)

Let's look at the case where the results of clinical trials are evaluated using the general estimation equation using the data of epileptic seizures of the same 59 people.

First, change the structure of the data and put the number of seizures during the baseline period in $ y_ {ij} . $log E(Y_{ij})=log(T_{ij})+\beta_0+\beta_1x_{ij1}+\beta_2x_{ij2}+\beta_3x_{ij1}x_{ij2}$$ Here, $ x_ {ij1} $ represents the period before (0) and after (1) of randomization as a binary value. $ x_ {ij2} $ indicates the treatment group of placebo (0) and study drug (1) in binary. $ T_ {ij} $ is the length of the 8-week baseline period and the 2-week interval for each post-treatment measurement period.

import pandas as pd
yy=pd.DataFrame(np.arange(59)+1,columns=['subject'])
yy['period']=np.log(8)
yy['y']=np.NaN
yy['trt']=0#np.NaN
yy['btrt']=0
data['btrt']=1
data['period']=np.log(2)

ii=0
for i in range(len(data)):
    if data.subject.iloc[i]==yy.subject.iloc[ii]:
        yy.iloc[ii,2]=data.base.iloc[i]
        yy.iloc[ii,3]=data.trt.iloc[i]
        yy.iloc[ii,0]=data.subject.iloc[i]
        ii+=1
        if ii>=59:
            break
yy=pd.concat([data,yy],axis=0,join='inner')

for i in range(len(yy)):
    if yy.iloc[i,1]=='progabide':
        yy.iloc[i:i+1,1]=1
    if yy.iloc[i,1]=='placebo':
        yy.iloc[i:i+1,1]=0

mod = smf.gee("y ~ btrt + trt ", "subject", yy,
              cov_struct=ex, family=fam,offset='period')

res = mod.fit()
print(res.summary())

  GEE Regression Results                              
===================================================================================
Dep. Variable:                           y   No. Observations:                  295
Model:                                 GEE   No. clusters:                       59
Method:                        Generalized   Min. cluster size:                   5
                      Estimating Equations   Max. cluster size:                   5
Family:                            Poisson   Mean cluster size:                 5.0
Dependence structure:         Exchangeable   Num. iterations:                     6
Date:                     Sat, 02 Nov 2019   Scale:                           1.000
Covariance type:                    robust   Time:                         23:30:28
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.3240      0.168      7.883      0.000       0.995       1.653
btrt           0.0560      0.106      0.530      0.596      -0.151       0.263
trt            0.0705      0.213      0.331      0.740      -0.346       0.487
==============================================================================
Skew:                          3.3538   Kurtosis:                      15.0311
Centered skew:                 2.0882   Centered kurtosis:              6.6169
==============================================================================

No results are obtained that the study group of clinical trials is significant. It looks a little different there. Add the age of the subject, which is a covariate.

import pandas as pd
yy=pd.DataFrame(np.arange(59)+1,columns=['subject'])
yy['period']=0
yy['y']=np.NaN
yy['trt']=0#np.NaN
yy['btrt']=0
yy['age']=0
data = sm.datasets.get_rdataset('epil', package='MASS').data
data['btrt']=1

ii=0
for i in range(len(data)):
    if data.subject.iloc[i]==yy.subject.iloc[ii]:
        yy.iloc[ii,2]=data.base.iloc[i]
        #yy.iloc[ii,3]=data.trt.iloc[i]
        yy.iloc[ii,0]=data.subject.iloc[i]
        yy.iloc[ii,5]=data.age.iloc[i]
        ii+=1
        if ii>=59:
            break
#print(yy.head(200))
yy=pd.concat([data,yy],axis=0,join='inner')

for i in range(len(yy)):
    if yy.iloc[i,1]=='progabide':
        yy.iloc[i,1]=1
    if yy.iloc[i,1]=='placebo':
        yy.iloc[i,1]=0
yy[yy.trt==0].count()

mod = smf.gee("y ~ age + btrt + trt ", "subject", yy,
              cov_struct=ind, family=fam,offset='period')

res = mod.fit()
print(res.summary())

GEE Regression Results                              
===================================================================================
Dep. Variable:                           y   No. Observations:                  295
Model:                                 GEE   No. clusters:                       59
Method:                        Generalized   Min. cluster size:                   5
                      Estimating Equations   Max. cluster size:                   5
Family:                            Poisson   Mean cluster size:                 5.0
Dependence structure:         Independence   Num. iterations:                     2
Date:                     Sat, 02 Nov 2019   Scale:                           1.000
Covariance type:                    robust   Time:                         23:48:30
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      3.9937      0.614      6.503      0.000       2.790       5.197
age           -0.0198      0.021     -0.954      0.340      -0.060       0.021
btrt          -4.3316      0.154    -28.050      0.000      -4.634      -4.029
trt           -0.1014      0.335     -0.303      0.762      -0.758       0.555
==============================================================================
Skew:                          2.6891   Kurtosis:                      13.3017
Centered skew:                 0.3066   Centered kurtosis:              3.7456
==============================================================================

The point that it does not become significant remains unchanged.

References:

Analysis of recurrence event data Wiki notebooks for GEE repeated measures of epileptic seizure counts