Statsmodels is a statistical analysis software that runs on a programming language called Python. Python must be installed on your PC to run the statsmodels sample. If you haven't installed it yet, please refer to Installing Jupyter notebook. Jupyter notebook is very useful for running stats models.
Most of it is based on statsmodels reference.
The generalized estimation equation is a model that aims to estimate the average effect of observations of panels, clusters, and iterative measurement data in a population of data that is correlated within the cluster and not correlated outside the cluster. , Generalized linear model. Although it is said that there is a correlation within the cluster, it can be said that the cause of the correlation has not been identified or cannot be identified. The average structure and the correlation structure can be handled separately. GLM
--The dependent variable $ y_i $ follows the exponential family. --Each dependent variable is independent.
Under the assumption
--Assuming a probability distribution of the random variable $ y_i $. -Set a linear regression model with the expected value of $ y_i $ of $ \ mu $. --Create a likelihood function for the assumed probability distribution of $ y_i $ and find $ \ beta (\ hat {\ beta}) $ to maximize it. --The maximum likelihood estimator $ \ hat {\ beta} $ has consistency and asymptotic normality.
Estimate the model as.
In GEE
--Relax the assumption that the dependent variable $ y_i $ follows an exponential family. --Relax the assumption that the dependent variable is independent. --Correspond to an unknown distribution using pseudo-likelihood. --Introduce working correlation matrix $ R_i (\ alpha) $ to represent the correlation structure. --The generalized estimation equation is divided into two stages and solved by numerical calculation.
Estimate the model in this order. Therefore, it is possible to deal with the diversity of data, but it also requires the amount of information.
Thall and Vail (1990) analyzed the 8-week baseline period and the number of seizures every two consecutive weeks in 59 epilepsy patients. No treatment is given to the patient during the baseline period. After the baseline period, patients are randomly prescribed placebo or progabide. Post-prescription observations are performed for 4x2 = 8 weeks.
Data details: Use: Epilepsy Format: 236 rows and 9 columns
y: Number of seizures in 2 weeks trt: Treatment, placebo, or progabide base: Number of seizures during the 8-week baseline period age: Subject's age, age V4: 0/1 indicator variable for 4 periods subject: Subject number, 1 to 59. period: period, 1 to 4. lbase: Logarithmic mean of seizures during baseline period standardized to zero lage: Logarithmic mean of age, standardized to zero
Epileptic seizures correspond to recurrent events in survival analysis. Recurrence event data is obtained by repeatedly measuring the occurrence of seizures in the same subject, and can be regarded as a type of longitudinal data or repeated measures data. It is considered that there is no correlation in the number of epileptic seizures in different subjects, but it is natural to think that there is a correlation in the results if the measurements are made for the same subject.
A total of 4 observations were made at intervals of 8 weeks before drug administration and 2 weeks after drug administration, and the number of seizures was examined for each interval. Only the age of the subject is a covariate.
First, initialize and get the data.
import statsmodels.formula.api as smf
data = sm.datasets.get_rdataset('epil', package='MASS').data
Examine the state of the data.
data.head(10)
y trt base age V4 subject period lbase lage
0 5 placebo 11 31 0 1 1 -0.756354 0.114204
1 3 placebo 11 31 0 1 2 -0.756354 0.114204
2 3 placebo 11 31 0 1 3 -0.756354 0.114204
3 3 placebo 11 31 1 1 4 -0.756354 0.114204
4 3 placebo 11 30 0 2 1 -0.756354 0.081414
5 5 placebo 11 30 0 2 2 -0.756354 0.081414
6 3 placebo 11 30 0 2 3 -0.756354 0.081414
7 3 placebo 11 30 1 2 4 -0.756354 0.081414
8 2 placebo 6 25 0 3 1 -1.362490 -0.100908
9 4 placebo 6 25 0 3 2 -1.362490 -0.100908
Let's compare the distribution of the number of epileptic seizures after prescribing a drug that does not contain the active ingredient placebo and a GABA-like antiepileptic drug called progabide. The number of data is 112 for the former and 124 for the latter.
data[data.trt!="placebo"].y.hist()
data[data.trt=="placebo"].y.hist()
data[data.trt!="placebo"].y.count(),data[data.trt=="placebo"].y.count()
(124, 112)
Let's make a frequency chart of the logarithm of the number of seizures and the logarithm of the age during the standardized baseline period.
data.lbase.hist()
data.lage.hist()
Next, we get descriptive statistics. Mean and standard deviation of the number of seizures during the baseline period
data.base.mean(),data.base.std()
(31.220338983050848, 26.705051109822254)
Mean and standard deviation of the number of seizures after progabide
data[data.trt!="placebo"].y.mean(),data[data.trt!="placebo"].y.std()
(7.959677419354839, 13.92978863002629)
Mean and standard deviation of seizures during baseline period in patients treated with progabide
data[data.trt!="placebo"].base.mean(),data[data.trt!="placebo"].base.std()
(31.612903225806452, 27.638405492528324)
Mean and standard deviation of the number of seizures after placebo
data[data.trt=="placebo"].y.mean(),data[data.trt=="placebo"].y.std()
(8.580357142857142, 10.369426989352696)
Mean and standard deviation of seizures during baseline period for placebo-treated patients
data[data.trt=="placebo"].base.mean(),data[data.trt=="placebo"].base.std()
(30.785714285714285, 25.749111266541444)
The logarithm of the subject's age and the number of seizures during the baseline period is plotted.
plt.scatter(data.lage,data.lbase)
rg=sm.OLS(data.lbase,data.lage).fit()
plt.plot(data.lage,rg.fittedvalues)
Estimate the model. It tries to follow the Poisson distribution for the number of seizures.
independence()
For the working correlation matrix, we assume that the paired data for repeated measurements of each patient are independent. "subject" indicates groups.
fam = sm.families.Poisson()
ind = sm.cov_struct.Independence()
mod = smf.gee("y ~ age + trt + base", "subject", data,
cov_struct=ind, family=fam)
res = mod.fit()
print(res.summary())
GEE Regression Results
===================================================================================
Dep. Variable: y No. Observations: 236
Model: GEE No. clusters: 59
Method: Generalized Min. cluster size: 4
Estimating Equations Max. cluster size: 4
Family: Poisson Mean cluster size: 4.0
Dependence structure: Independence Num. iterations: 2
Date: Sat, 02 Nov 2019 Scale: 1.000
Covariance type: robust Time: 13:06:54
====================================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------
Intercept 0.5730 0.361 1.589 0.112 -0.134 1.280
trt[T.progabide] -0.1519 0.171 -0.888 0.375 -0.487 0.183
age 0.0223 0.011 1.960 0.050 2.11e-06 0.045
base 0.0226 0.001 18.451 0.000 0.020 0.025
==============================================================================
Skew: 3.7823 Kurtosis: 28.6672
Centered skew: 2.7597 Centered kurtosis: 21.9865
==============================================================================
Next, take the mean value of the number of seizures for each subject and the logarithm of the variance and blot it.
yg = mod.cluster_list(np.asarray(data["y"]))
ymn = [x.mean() for x in yg]
yva = [x.var() for x in yg]
plt.grid(True)
plt.plot(np.log(ymn), np.log(yva), 'o')
plt.plot([-2, 6], [-2, 6], '-', color='grey')
plt.xlabel("Log variance", size=17)
plt.ylabel("Log mean", size=17)
exchangeable() In this case, it is assumed that the paired data of each patient's repeated measurements for the working correlation matrix have the same correlation.
fam = sm.families.Poisson()
ex = sm.cov_struct.Exchangeable()
mod = smf.gee("y ~ age + trt + base", "subject", data,
cov_struct=ex, family=fam)
res = mod.fit()
print(res.summary())
GEE Regression Results
===================================================================================
Dep. Variable: y No. Observations: 236
Model: GEE No. clusters: 59
Method: Generalized Min. cluster size: 4
Estimating Equations Max. cluster size: 4
Family: Poisson Mean cluster size: 4.0
Dependence structure: Exchangeable Num. iterations: 2
Date: Thu, 31 Oct 2019 Scale: 1.000
Covariance type: robust Time: 07:23:48
====================================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------
Intercept 0.5730 0.361 1.589 0.112 -0.134 1.280
trt[T.progabide] -0.1519 0.171 -0.888 0.375 -0.487 0.183
age 0.0223 0.011 1.960 0.050 2.11e-06 0.045
base 0.0226 0.001 18.451 0.000 0.020 0.025
==============================================================================
Skew: 3.7823 Kurtosis: 28.6672
Centered skew: 2.7597 Centered kurtosis: 21.9865
==============================================================================
fig = res.plot_isotropic_dependence()
fig.get_axes()[0].set_ylim(0, 1)
rslts = res.params_sensitivity(0., 0.9, 5)
dp = [x.cov_struct.dep_params for x in rslts]
age_params = np.asarray([x.params[2] for x in rslts])
age_se = np.asarray([x.bse[2] for x in rslts])
plt.grid(True)
plt.plot(dp, age_params, '-', color='orange', lw=4)
plt.plot(dp, age_params + age_se, '-', color='grey')
plt.plot(dp, age_params - age_se, '-', color='grey')
plt.axvline(res.cov_struct.dep_params, ls='--', color='black')
plt.xlabel("Autocorrelation parameter", size=14)
plt.ylabel("Age effect", size=14)
plt.plot(res.fittedvalues, res.resid, 'o', alpha=0.5)
plt.xlabel("Fitted values", size=17)
plt.ylabel("Residuals", size=17)
Autoregressive() In this case, it is assumed that the paired data of each patient's repeated measurements for the working correlation matrix are autocorrelated.
fam = sm.families.Poisson()
au = sm.cov_struct.Autoregressive()
mod = smf.gee("y ~ age + trt + base", "subject", data,
cov_struct=au, family=fam)
res = mod.fit()
print(res.summary())
GEE Regression Results
===================================================================================
Dep. Variable: y No. Observations: 236
Model: GEE No. clusters: 59
Method: Generalized Min. cluster size: 4
Estimating Equations Max. cluster size: 4
Family: Poisson Mean cluster size: 4.0
Dependence structure: Autoregressive Num. iterations: 11
Date: Sat, 02 Nov 2019 Scale: 1.000
Covariance type: robust Time: 13:08:52
====================================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------
Intercept 0.4123 0.422 0.977 0.329 -0.415 1.240
trt[T.progabide] -0.0704 0.164 -0.430 0.667 -0.391 0.250
age 0.0274 0.013 2.108 0.035 0.002 0.053
base 0.0223 0.001 18.252 0.000 0.020 0.025
==============================================================================
Skew: 3.9903 Kurtosis: 30.1753
Centered skew: 2.7597 Centered kurtosis: 21.9865
==============================================================================
fig = res.plot_isotropic_dependence()
fig.get_axes()[0].set_ylim(0, 1)
rslts = res.params_sensitivity(0., 0.9, 5)
dp = [x.cov_struct.dep_params for x in rslts]
age_params = np.asarray([x.params[2] for x in rslts])
age_se = np.asarray([x.bse[2] for x in rslts])
plt.grid(True)
plt.plot(dp, age_params, '-', color='orange', lw=4)
plt.plot(dp, age_params + age_se, '-', color='grey')
plt.plot(dp, age_params - age_se, '-', color='grey')
plt.axvline(res.cov_struct.dep_params, ls='--', color='black')
plt.xlabel("Autocorrelation parameter", size=14)
plt.ylabel("Age effect", size=14)
We tested whether administration of the study drug reduced the number of seizures compared to placebo, changing the working correlation matrix, but the results from this clinical trial were significant when using the generalized estimation equation. I can't get it.
Let's look at the case where the results of clinical trials are evaluated using the general estimation equation using the data of epileptic seizures of the same 59 people.
First, change the structure of the data and put the number of seizures during the baseline period in $ y_ {ij}
import pandas as pd
yy=pd.DataFrame(np.arange(59)+1,columns=['subject'])
yy['period']=np.log(8)
yy['y']=np.NaN
yy['trt']=0#np.NaN
yy['btrt']=0
data['btrt']=1
data['period']=np.log(2)
ii=0
for i in range(len(data)):
if data.subject.iloc[i]==yy.subject.iloc[ii]:
yy.iloc[ii,2]=data.base.iloc[i]
yy.iloc[ii,3]=data.trt.iloc[i]
yy.iloc[ii,0]=data.subject.iloc[i]
ii+=1
if ii>=59:
break
yy=pd.concat([data,yy],axis=0,join='inner')
for i in range(len(yy)):
if yy.iloc[i,1]=='progabide':
yy.iloc[i:i+1,1]=1
if yy.iloc[i,1]=='placebo':
yy.iloc[i:i+1,1]=0
mod = smf.gee("y ~ btrt + trt ", "subject", yy,
cov_struct=ex, family=fam,offset='period')
res = mod.fit()
print(res.summary())
GEE Regression Results
===================================================================================
Dep. Variable: y No. Observations: 295
Model: GEE No. clusters: 59
Method: Generalized Min. cluster size: 5
Estimating Equations Max. cluster size: 5
Family: Poisson Mean cluster size: 5.0
Dependence structure: Exchangeable Num. iterations: 6
Date: Sat, 02 Nov 2019 Scale: 1.000
Covariance type: robust Time: 23:30:28
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 1.3240 0.168 7.883 0.000 0.995 1.653
btrt 0.0560 0.106 0.530 0.596 -0.151 0.263
trt 0.0705 0.213 0.331 0.740 -0.346 0.487
==============================================================================
Skew: 3.3538 Kurtosis: 15.0311
Centered skew: 2.0882 Centered kurtosis: 6.6169
==============================================================================
No results are obtained that the study group of clinical trials is significant. It looks a little different there. Add the age of the subject, which is a covariate.
import pandas as pd
yy=pd.DataFrame(np.arange(59)+1,columns=['subject'])
yy['period']=0
yy['y']=np.NaN
yy['trt']=0#np.NaN
yy['btrt']=0
yy['age']=0
data = sm.datasets.get_rdataset('epil', package='MASS').data
data['btrt']=1
ii=0
for i in range(len(data)):
if data.subject.iloc[i]==yy.subject.iloc[ii]:
yy.iloc[ii,2]=data.base.iloc[i]
#yy.iloc[ii,3]=data.trt.iloc[i]
yy.iloc[ii,0]=data.subject.iloc[i]
yy.iloc[ii,5]=data.age.iloc[i]
ii+=1
if ii>=59:
break
#print(yy.head(200))
yy=pd.concat([data,yy],axis=0,join='inner')
for i in range(len(yy)):
if yy.iloc[i,1]=='progabide':
yy.iloc[i,1]=1
if yy.iloc[i,1]=='placebo':
yy.iloc[i,1]=0
yy[yy.trt==0].count()
mod = smf.gee("y ~ age + btrt + trt ", "subject", yy,
cov_struct=ind, family=fam,offset='period')
res = mod.fit()
print(res.summary())
GEE Regression Results
===================================================================================
Dep. Variable: y No. Observations: 295
Model: GEE No. clusters: 59
Method: Generalized Min. cluster size: 5
Estimating Equations Max. cluster size: 5
Family: Poisson Mean cluster size: 5.0
Dependence structure: Independence Num. iterations: 2
Date: Sat, 02 Nov 2019 Scale: 1.000
Covariance type: robust Time: 23:48:30
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 3.9937 0.614 6.503 0.000 2.790 5.197
age -0.0198 0.021 -0.954 0.340 -0.060 0.021
btrt -4.3316 0.154 -28.050 0.000 -4.634 -4.029
trt -0.1014 0.335 -0.303 0.762 -0.758 0.555
==============================================================================
Skew: 2.6891 Kurtosis: 13.3017
Centered skew: 0.3066 Centered kurtosis: 3.7456
==============================================================================
The point that it does not become significant remains unchanged.
References:
Analysis of recurrence event data Wiki notebooks for GEE repeated measures of epileptic seizure counts
Recommended Posts