[Chapter 8] Econometrics (Yuhikaku) Chapter End Problem, Answer by python

[Econometrics](https://www.amazon.co.jp/ Econometrics-New-Liberal-Arts-Selection / dp / 4641053855 / ref = asc_df_4641053855_nodl /? Tag = jpgo-22 & linkCode = df0 & hvadid = 342654745883 & hvpos = & hvnetw = g & hvrand = 3840609228244811524 & hvpone = & hvptwo = & hvqmt = & hvdev = m & hvdvcmdl = & hvlocint = & hvlocphy = 1009363 & hvtargid = pla-796815219068 & psc = 1 & th = 1 & psc = 1), the answer to the end of Chapter 8 by python.

Basically, I am making answers while looking at the official documents of stats models and linear models, github. I don't know about python at all, so if you have any mistakes, please point out and ask.

The end-of-chapter problem in Chapter 8 is a reproduction of Tables 8-1 and 8-5.

Demonstration analysis example, reproduction of Table 8-1

What is required in the reproduction of Table 8-1 is what to do when the explained variable is a binary variable. We will compare the linear probability model, probit model, and logit model as the means.

Linear probability model


import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()

import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.formula.api import logit, probit

#Read and confirm data
df = pd.read_stata('piaac.dta')
df.head()

In the analysis in Table 8-1, [lfs] is limited to women and returns to [educ], [age], [couple], [child]. Get only the columns and records you need.


df_1 = df[df['gender']== 'Female']
df_1 = df_1[['lfs','educ','age','couple','child']]

The problem here is that the explained variable [lfs] is a qualitative variable. For the time being, let's check what value the explained variable takes.


#['lfs']Check the type of
df_1['lfs'].unique()
#[Employed, Out of the labour force, Unemployed, Not known]
Categories (4, object): [Employed < Unemployed < Out of the labour force < Not known]

#Also check your age
df_1['age'].unique()
#[45, 29, 35, 48, 60, ..., 41, 37, 19, 58, 28]
Length: 50
Categories (50, int64): [16 < 17 < 18 < 19 ... 62 < 63 < 64 < 65]

lfs seems to take four kinds of values: Employed, Unemployed, Out of the labor force, Not known. How to interpret this is a very delicate matter by nature, but Table 8-1 seems to count as non-working people without dropping Not known. What to do if the explained variable is not observed (sample selection becomes a problem) will be considered when implementing the Hekit model. Also, the out of labor force is literally a non-labor force, but since all the data is only for those over 16 years old, they can work by nature. However, I choose not to work for some reason, such as low wages. Here, consider the case where the explained variable takes a binary variable, so create a dummy variable with Employed = 1 and other = 0 for analysis.


#Dummy variable column['emp_dummy']Creation
df_1.loc[df_1['lfs']=='Employed','emp_dummy']=1
df_1.loc[df_1['lfs']!='Employed','emp_dummy']=0
# Check the data type just in case
df_1.dtypes
#lfs          category
educ         category
age          category
couple        float32
child         float32
emp_dummy     float64
dtype: object

For some reason, educ and age are categorical ... Convert to a numeric variable.

#Change data type
df_1[['educ','age']] = df_1[['educ','age']].apply(pd.to_numeric)

First, ignore the missing values and perform the analysis.


#OLS estimation
result_1_1 = ols(formula = 'emp_dummy ~ educ + age + couple + child',
                data = df_1).fit(cov_type='HC1',use_t=True)

When using the degree-of-freedom correction white standard error, specify HC1 for cov_type. Specify HC0 to use the normal white standard error.


#Check the result
result_1_1.summary()
coef std err t P>t [0.025 [0.025
Intercept 0.3759 0.105 3.566 0.000 0.169 0.583
educ 0.0195 0.006 3.312 0.001 0.008 0.031
age 0.0020 0.001 1.726 0.085 -0.000 0.004
couple -0.1178 0.031 -3.743 0.000 -0.180 -0.056
child 0.0137 0.013 1.067 0.286 -0.012 0.039

The result can be said to be exactly the same. Table 8-1 in the textbook seems to ignore the missing values and perform the analysis.

Bonus: Consider missing values

I will also analyze the missing values. This is a very delicate issue, but for now, let's put in the average value.


#Fill in the missing values with the average value
df_1['educ'] = df_1['educ'].fillna(df_1['educ'].mean())
df_1['age'] = df_1['age'].fillna(df_1['age'].mean())
df_1['couple'] = df_1['couple'].fillna(df_1['couple'].mean())
df_1['child'] = df_1['child'].fillna(df_1['couple'].mean())

#OLS estimation
result_1_nonan = ols(formula = 'emp_dummy ~ educ + age + couple + child',
                     data = df_1).fit(cov_type='HC1',use_t=True)
#Check the result
result_1_nonan_summary()
chef std err t P>t [0.025 0.975]
Intercept 0.0189 0.062 0.302 0.763 -0.104 0.141
educ 0.0450 0.004 10.951 0.000 0.037 0.053
age 0.0032 0.001 3.895 0.000 0.002 0.005
couple -0.1035 0.022 -4.735 0.000 -0.146 -0.061
child -0.0006 0.009 -0.059 0.953 -0.019 0.018

The value has changed a little. After all, in the analysis of textbooks, it seems that all records including missing values are dropped and analyzed.

Probit model

From here on, follow the textbook and ignore all missing values.


#Data shaping
df = pd.read_stata('piaac.dta')
df_2 = df[df['gender']== 'Female']
df_2 = df_2[['lfs','educ','age','couple','child']]
df_2.loc[df_2['lfs']=='Employed','emp_dummy']=1
df_2.loc[df_2['lfs']!='Employed','emp_dummy']=0
df_2[['educ','age']] = df_2[['educ','age']].apply(pd.to_numeric)

If you use a probit model, you can just use "probit" instead of "ols", if you take advantage of statsmodels.

###probit model
result_1_2 = probit(formula = 'emp_dummy ~ educ + age + couple + child',
                    data = df_2).fit()
#Check the result
result_1_2.summary().tables[1]
coef std err z P>z [0.025 0.975]
Intercept -0.3443 0.287 -1.198 0.231 -0.908 0.219
educ 0.0537 0.016 3.340 0.001 0.022 0.085
age 0.0053 0.003 1.789 0.074 -0.001 0.011
couple -0.3391 0.097 -3.489 0.000 -0.530 -0.149
child 0.0397 0.032 1.256 0.209 -0.022 0.102

This time, the result is almost the same as Table 8-1.

Next is the calculation of the marginal effect. If you use statsmodels, the marginal effect of the probit model is over with the method "get_margeff".


#Marginal effect
result_1_2.get_margeff().summary().tables[1]
dy/dx std err z P>z [0.025 0.975]
educ 0.0197 0.006 3.372 0.001 0.008 0.031
age 0.0019 0.001 1.794 0.073 -0.000 0.004
couple -0.1243 0.035 -3.524 0.000 -0.193 -0.055
child 0.0145 0.012 1.258 0.209 -0.008 0.037

This is the same result as the textbook.

Logit model

Just like with Probit, just use "logit". The marginal effect can be calculated in the same way with the "get_margeff" module.


#Logit model
result_1_4 = logit(formula = 'emp_dummy ~ educ + age + couple + child',
                    data = df_2).fit()

#Check the result
result_1_4.summary().tables[1]

#Marginal effect
result_1_4.get_margeff().summary().tables[1]
coef std err z P>z [0.025 0.975]
Intercept -0.5781 0.471 -1.227 0.220 -1.502 0.345
educ 0.0869 0.026 3.307 0.001 0.035 0.138
age 0.0088 0.005 1.806 0.071 -0.001 0.018
couple -0.5587 0.163 -3.424 0.001 -0.879 -0.239
child 0.0716 0.058 1.242 0.214 -0.041 0.185
dy/dx std err z P>z [0.025 0.975]
educ 0.0195 0.006 3.345 0.001 0.008 0.031
age 0.0020 0.001 1.812 0.070 -0.000 0.004
couple -0.1255 0.036 -3.463 0.001 -0.197 -0.054
child 0.0161 0.013 1.244 0.213 -0.009 0.041

Bonus: Create a Probit model using GenericLikelihoodModel.

statsmodels has a useful class "GenericLikelihoodModel", which officially says  The GenericLikelihoodModel class eases the process by providing tools such as automatic numeric differentiation and a unified interface to scipy optimization functions. Using statsmodels, users can fit new MLE models simply by “plugging-in” a log-likelihood function.

It seems that it is convenient when you want to perform maximum likelihood estimation using a homemade likelihood function.

Since the example of the official document is only implemented by changing the data set, please refer to the official document as well.


#import
from scipy import stats
from statsmodels.base.model import GenericLikelihoodModel

#Drop records that contain missing values
df_4 = df_2.dropna()

#Create a matrix of explanatory variables.
exog = df_4.loc[:,['educ','age','couple','child']]
exog = sm.add_constant(exog,prepend=True)
#Create a vector of explained variables.
endog = df_4.loc[:,'emp_dummy']

Assuming that the latent variable is determined by the formula on the right, $ Y_ {i} ^ {*} = X_ {i}'\ beta + e_ {i} $ The log-likelihood function of the binomial selection model in general is

logL(\beta)=\sum_{i=1}^{n}[Y_{i}logF(X_{i}'\beta)+(1-Y_{i})log(1-F(X'_{i}\beta)]

So all you have to do is change F () into a normal distribution function and plunge into it.

class MyProbit(GenericLikelihoodModel):
    def loglike(self,params):
        exog = self.exog
        endog = self.endog
        q= 2*endog-1
        return stats.norm.logcdf(q*np.dot(exog,params)).sum()#Log-likelihood of the probit model

It seems that you can estimate with just this.


#Fits to probit models
result_mine = MyProbit(endog, exog).fit(maxiter=1000)

#Check the result
result_mine.summary().tables[1]
chef stderr z P>z [0.025 0.975]
const -0.3444 0.287 -1.198 0.231 -0.908 0.219
educ 0.0537 0.016 3.340 0.001 0.022 0.085
age 0.0053 0.003 1.790 0.073 -0.001 0.011
couple -0.3391 0.097 -3.489 0.000 -0.530 -0.149
child 0.0397 0.032 1.256 0.209 -0.022 0.102

The result is the same as when using the probit module of statsmodels. There is no officially merged module for Tobit models, but if you use GenericLikelihoodModels, it seems that you can create it just by raising the likelihood function.

Demonstration analysis example, reproduction of Table 8-5

Although it is a Heckitmodel, statsmodels and linearmodels did not have a module that can perform two-step estimation and maximum likelihood estimation as they are, as far as I read the document. I would be grateful if you could let me know. However, although not officially merged, there was a module called [Heckman] in the branch.

Use this.


import Heckman 

#Data reading and formatting

read_stata('piaac.dta')
df_3 = df[df['gender']== 'Female']
df_3 = df_3[['educ','couple','child','age']]
df_3 = df_3.dropna()
df_3['wage'] = df['wage']
df_3[['educ','age']] = df_3[['educ','age']].apply(pd.to_numeric)
df_3['experience'] = df_3['age']-df_3['educ']-6
df_3['experience_2'] = (df_3['experience']**2)/100
df_3['logwage'] = np.log(df_3['wage'])

#Data confirmation
df_3

Some people have not observed their wages for some reason. There is a possibility that you are not working in the first place because the offered wage is less than the reserved wage. If wages are returned to the explanatory variables as they are, sample selection is likely to become a problem.

OLS estimation

Anyway, in order to reproduce the table, I will run OLS as it is.


#OLS estimation
result_5_1 = ols(formula = 'logwage ~ educ + experience + experience_2',
                 data = df_3).fit(cov_type='HC1',use_t=True)

#Check the result
result_5_1.summary()
coef stderr t P>t [0.025 0.975]
Intercept 5.8310 0.236 24.726 0.000 5.368 6.294
educ 0.0862 0.014 6.074 0.000 0.058 0.114
experience 0.0069 0.009 0.762 0.446 -0.011 0.025
experience_2 -0.0126 0.015 -0.820 0.413 -0.043 0.018
No. Observations: 984

There were 1747 samples in total, but the number of samples used is 984.

Two-stage hekit


#Create a vector of explained variables.
endog = df_3['logwage']

#Create a matrix of explanatory variables for the selection expression.
exog_select = df_3[['educ','experience','experience_2','couple','child']]
exog_select = sm.add_constant(exog_select,prepend=True)

#Create a matrix of explanatory variables for the second-stage equation.
exog = df_3[['educ','experience','experience_2']]
exog = sm.add_constant(exog,prepend=True)

#Heckman's two-step estimation
result_5_2 = Heckman.Heckman(endog,exog,exog_select).fit(method='twostep')

#Check the result
result_5_2.summary()

Maximum Likelihood Hekit


#Maximum likelihood estimation
result_5_4 = Heckman.Heckman(endog,exog,exog_select).fit(method='mle',
                                                         method_mle='nm',
                                                         maxiter_mle=2000)

#Check the result
result_5_4.summary()

References

In preparing this answer, I referred to the following materials.

Yoshihiko Nishiyama, Mototsugu Shintani, Daiji Kawaguchi, Ryo Okui "Measurement Economics", Yuhikaku Publishing, 2019

Burce E.Hansen(2020)"ECONOMETRICS"

Mackinnon and White(1985)"Some Heteroskedasticity Consistent Covariance Matrix Estimators with Improved Finite Sample Properties",Journal of Econometrics, 1985, vol. 29, issue 3, 305-325

statmodels

linearmodels

QuantEcon

Recommended Posts

[Chapter 8] Econometrics (Yuhikaku) Chapter End Problem, Answer by python
[Chapter 6] Econometrics (Yuhikaku) Chapter End Problem (Demonstration), Answer by python
100 Language Processing Knock Chapter 1 by Python
Linux standard textbook chapter end test answer
Python learning memo for machine learning by Chainer until the end of Chapter 2
[Language processing 100 knocks 2020] Summary of answer examples by Python
Implemented in Python PRML Chapter 4 Classification by Perceptron Algorithm
[Python] Try to read the cool answer to the FizzBuzz problem
Python learning memo for machine learning by Chainer from Chapter 2