[Chapter 6] Econometrics (Yuhikaku) Chapter End Problem (Demonstration), Answer by python

I tried to solve Chapter 6-10 [Demonstration] of Econometrics with python. I don't know much about python, so please point out any improvements.

Panel data analysis can be easily performed by using the linearmodels library. First, do the fixed effect conversion by yourself using pandas without using linear models, and estimate the OLS. After that, it is processed by using linear models.

Chapter End Problem 6-10 (1)


#Basic data analysis library
import pandas as pd
import numpy as np
from scipy import stats

#Graph drawing
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()

#Statistical model
from statsmodels.formula.api import ols
import statsmodels.api as stm 

#For panel data
from linearmodels.panel.data import PanelData
from linearmodels.panel import PanelOLS, PooledOLS
#read stata data
df_1 = pd.read_stata('yamaguchi.dta')

df_1 #Data confirmation

In reproducing Table 6-5 of the textbook, the explanatory variables [nucrate], [numhh], and [hhtype] that are not used are dropped. In the textbook, it was estimated only from the data after 2000, but since it is a big deal, we will also use the data from 1990 and 1995.


df_1 = df_1.drop(['nucrate','numhh','hhtype'],axis=1)
df_1  #Show data frame

It goes without saying that Chapter 6 is a chapter on panel data analysis, but at first glance at the data frame, it seems that observation values at multiple time points (years) are obtained for multiple observation targets (pref). I would like to confirm how many objects are to be observed and when the observation time is. The unique function and the unique function return a list of unique values and the number of unique values, respectively, so use this.


#Check the number of observation targets
df_1['pref'].nunique() 
#47

#Confirm the observation time
df_1['year'].unique() 
#array([1990, 1995, 2000, 2005, 2010], dtype=int16)

It can be confirmed that the observation targets are 47 prefectures, while the observation values are obtained every 5 years from 1990 to 2010.

The advantage of panel data is that fixed effect conversion can eliminate constant variables over time for each observation target and avoid the occurrence of missing variable bias. Alternatively, by performing time effect conversion, it is possible to eliminate variables that are common among observation targets but change over time. That was the point. In this analysis, it was possible to consider that the inability to observe cultural differences between prefectures is the cause of the missing variable bias when performing regression analysis as it is. It is unlikely that cultural differences will change significantly over time, so it seems possible to avoid missing variable bias by using fixed effect transformations.

For fixed effect transformations, mean $ \ bar {Y} _ {i} = \ frac {1} {T} \ sum ^ {T} \ _ {t = 1} Y \ _ {i, over time of observations Since t} $ is required, the observed values are summarized for each prefecture by the groupby function.

#Calculate the average of the time for each observation target
df_1_mean = df_1.groupby('pref').mean()

#Combine data frames
df_1_merged = df_1.merge(df_1_mean,
                         on = 'pref',
                         suffixes = ('_origin','_mean'))

Now you have all the parts to convert the fixed effect.

#Fixed effect conversion for variables
df_1_fet = df_1_merged - df_1_merged.shift(periods = -6,axis=1)

#Erase NaN
df_1_fet = df_1_fet.dropna(axis=1)
df_1_fet

All you have to do is make an OLS estimate and you will get the second column in Table 6-5.

#OLS estimation
result = ols(formula = 'emprate_origin ~  -1 + caprate_origin ',
            data = df_1_fet).fit()
#View results
result.summary().tables[1]
coef stderr t P>t [0.025 0.975]
caprate 0.5760 0.057 10.093 0.000 0.464

Although it is slightly different from the textbook value due to the data of 1990 and 1995 and the influence of rounding error, it is an estimated value corresponding to the second column of Table 6-5.

Effortlessly process with linear models

Up to this point, I managed to perform fixed effect conversion on my own and bring it to a form that can estimate OLS, but if you use the PanelOLS module of the linearmodels library, you can perform fixed effect estimation and time effect estimation in one line for panel data. I can do it. It's convenient.

First, convert the data frame to panel data. At that time, give a hierarchical index and use the PanelData function of linearmodels.

#Give a hierarchical index
df_1_panel = df_1.set_index(['pref','year'])

#Convert to panel data
df_1_panel = PanelData(df_1_panel)

Now you are ready to reproduce Table 6-5.

In the argument formula of from_formula of the PanelOLS module, add the following items according to the purpose. ・ Fixed effect → Entity Effects ・ Time effects → Time Effects ・ Constant term → 1

#Table 6-5 1st row
result_1 = PanelOLS.from_formula(formula = 'emprate ~ 1 + caprate',
                                 data = df_1_panel).fit(cov_type='clustered',cluster_entity=True,cluster_time=True)

#Table 6-5 2nd row
result_2 = PanelOLS.from_formula(formula = 'emprate ~ caprate + EntityEffects', 
                                 data = df_1_panel).fit(cov_type='clustered',cluster_entity=True,cluster_time=True)

#Table 6-5 3rd row
result_3 = PanelOLS.from_formula(formula = 'emprate ~ caprate + TimeEffects', 
                                 data = df_1_panel).fit(cov_type='clustered',cluster_entity=True,cluster_time=True)

#Table 6-5 4th row
result_4 = PanelOLS.from_formula(formula = 'emprate ~ caprate + EntityEffects + TimeEffects', 
                                 data = df_1_panel).fit(cov_type='clustered',cluster_entity=True,cluster_time=True)

#Table 6-5 5th row
result_5 = PanelOLS.from_formula(formula = 'emprate ~ caprate + age + agehus + empratehus + urate + TimeEffects',
                                 data = df_1_panel).fit(cov_type='clustered',cluster_entity=True,cluster_time=True)

#Table 6-5 6th row
result_6 = PanelOLS.from_formula(formula = 'emprate ~ caprate + age + agehus + empratehus + urate + EntityEffects +TimeEffects',
                                 data = df_1_panel).fit(cov_type='clustered',cluster_entity=True,cluster_time=True)

Let's say that by confirming the result, we will reproduce Table 6-5.

result_1.summary.tables[1]
result_2.summary.tables[1] 
 #I get the same estimate as when I did my best to convert the fixed effect without using PanelOLS.
result_3.summary.tables[1]
result_4.summary.tables[1]
result_5.summary.tables[1]
result_6.summary.tables[1]

Chapter End Problem 6-10 (2)

LSDV estimation is a method that performs the same as fixed effect estimation and time effect estimation by changing the calculation method by assigning dummy variables for each observation target and observation time point. To assign a dummy variable to the observation target, use the C () function of statmodels or get_dummies () of pandas.

#Table 6-Perform the 6th column of 5 with LSDV
result_lsdv = ols(formula = 'emprate ~ -1 + C(pref) + C(year) + caprate + age + agehus + empratehus + urate',
                  data = df_1).fit(cov_type='HC3',use_t=True)

#View results
pd.DataFrame(result_lsdv.summary().tables[1],
            columns = ['','coef','stderr','t','P>|t|','[0.025','0.975]']).iloc[52:,:]


coef stderr t P>t [0.025 0.975]
caprate -0.0783 0.132 -0.593 0.553 -0.338
age 0.1735 0.013 -13.443 0.000 -0.199
ages 0.2213 0.009 25.187 0.000 0.204
empratehus 1.3598 0.194 7.001 0.000 0.978
urate -0.8024 0.584 -1.375 0.170 -1.948

Compared to the 6th column in Table 6-5, which was estimated using PanelOLS, the estimates are exactly the same. It can be confirmed that LSDV estimation, fixed effect, and time estimation do the same thing.

This time, the standard deviation is not the standard deviation that is robust to the cluster structure, but the standard deviation that is robust only to the non-uniform variance. I get the impression that the absolute value of t is highly evaluated. Only this time, there are no parameters that change the advantage, but in some cases the standard deviation robust to the cluster structure did not make a significant difference. It is possible that a standard deviation that is robust only to heterogeneous variances can be significant.

Chapter End Problem 6-10 (3)

The second column of Table 6-5 is the table for fixed effect conversion estimates. Here, the coefficient of determination at this time is compared with the coefficient of determination when LSDV estimation is performed.

#LSDV estimation
result_2lsdv = ols(formula = ' emprate ~ -1 + C(pref) + caprate',
                   data = df_1).fit(cov_type='HC3',use_t=True)

#View results
result_2.summary.tables[0]

#View results
result_2lsdv.summary().tables[0]

Results of fixed effect estimation
R-squared R-squared(Between) R-squared(Within) R-squared(Overall)
0.1264 0.6823 0.1264 0.6674

Three types of coefficient of determination are displayed. The importance of confirming what definition the coefficient of determination is based on was also mentioned in the textbook. If X is the vector of the explanatory variable and y is the value of the non-explanatory variable, the definitions are as follows. ・ Between → The coefficient of determination when the average of y is regressed to the average of X ・ Within → Coefficient of determination in the model with fixed effect conversion ・ Overall → Coefficient of determination when y is regressed to X

Results of LSDV estimation
R-squared
0.650

The coefficient of determination (Within) when performing fixed effect conversion is 0.1264, and the coefficient of determination when performing LSDV estimation is 0.650. You can see that the coefficient of determination is very different.

Chapter End Problem 6-10 (4)


#LSDV estimation
result_4lsdv = ols(formula = 'emprate ~ -1 + C(year) + caprate',
                   data = df_1).fit(cov_type='HC3',use_t=True)

#View results
result_4.summary.tables[0]

#View results
result_4lsdv.summary().tables[0]
Results of time effect estimation
R-squared R-squared(Between) R-squared(Within) R-squared(Overall)
4.874e-05 0.0423 0.0118 0.0415
Results of LSDV estimation
R-squared
0.317

The coefficient of determination when time-effect conversion is performed is (Within) in the above table, so 0.0118, and the coefficient of determination when LSDV estimation is performed is 0.317. The coefficient of determination is certainly very different this time as well.

References

In preparing this answer, I referred to the following materials.

Yoshihiko Nishiyama, Mototsugu Shintani, Daiji Kawaguchi, Ryo Okui "Measurement Economics", Yuhikaku Publishing, 2019

Mackinnon and White(1985)"Some Heteroskedasticity Consistent Covariance Matrix Estimators with Improved Finite Sample Properties",Journal of Econometrics, 1985, vol. 29, issue 3, 305-325

statmodels

linearmodels

QuantEcon

Recommended Posts

[Chapter 6] Econometrics (Yuhikaku) Chapter End Problem (Demonstration), Answer by python
[Chapter 8] Econometrics (Yuhikaku) Chapter End Problem, Answer by python
100 Language Processing Knock Chapter 1 by Python
Python learning memo for machine learning by Chainer until the end of Chapter 2
Linux standard textbook chapter end test answer
Answer to AtCoder Beginners Selection by Python3