"Introduction to effect verification Chapter 3 Analysis using propensity score" + α is tried in Python

This article is based on "Introduction to Effectiveness Verification-Causal Reasoning for Correct Comparison / Basics of Econometrics". This is a summary of what I learned and felt by reading "Chapter 3 Analysis Using Propensity Scores".

Basics

What is a propensity score?

-** Observed ** Probability that the covariate $ X $ will be assigned to the intervention group under the given conditions $ e (X) = P (Z = 1 | X) $ [^ px] --Not all covariates are observed

[^ px]: In this book, the propensity score is written as $ P (X) $, but for convenience, it is set as $ e (X) $.

Regression analysis v.s. Propensity score [^ ra-vs-ps]

[^ ra-vs-ps]: It's a strange comparison, but it can be read like this from this book.

--Regression analysis assumptions: - CIA(Conditional Independence Assumption): \\{Y^{(1)}, Y^{(0)}\\}\perp Z|X --Outcome and covariate linearity (omitted in this book?) --Propensity score assumption --CIA version: $ \ {Y ^ {(1)}, Y ^ {(0)} \} \ perp Z | e (X) $ --I think it's the same thing to explain the allocation with the covariate $ X $ and the propensity score $ e (X) $ to balance the covariates [^ balancing-score] -** SUTVA ** (Stable Unit Treatment Value Assumption) [^ sutva-krsk-phs] [^ sutva-pira-nino] [^ sutva-in-ra](Omitted in this book?)

CIA v.s. Strongly negligible allocation

--CIA is used in this book, but strictly speaking, ** strongly negligible assignment ** is correct [^ rosenbaum_1983] --A strongly negligible allocation adds the following two assumptions to the CIA: --The covariate $ X $ is only observed - 0 ――In short, I think that it is ** quite important ** because it is NG if there is no possibility that any sample will be assigned to both the intervention group and the non-intervention group. --For example, if all men are assigned to the intervention group, $ P (Z = 1 | men) = 1 $, so NG

Propensity score estimation

--Logistic regression is often used to estimate propensity scores, but machine learning is also OK. --The model includes not only covariates but also other variables that affect outcomes [^ tsugawa-ps2] --Probability calibration is required when using machine learning [^ rmizutaa] --Random forest and SVM? --On the other hand, if the loss function is logloss, it converges to the probability, so calibration is unnecessary [^ calibration-unnecessary]

Estimating intervention effect using propensity score

ATT v.s. ATE

Matching v.s. IPW [^ matching-vs-ipw]

[^ matching-vs-ipw]: There are other methods for estimating the effect of intervention using propensity scores other than matching and IPW, but this book deals with these two.

-** Matching **: --A pair of samples with similar propensity score values are taken from the intervention group and the non-intervention group, and one missing measure (counterfact) is complemented by the other.

image.png

ATT in matching

--Matching samples with similar propensity score values to the samples in the intervention group from the non-intervention group, and letting the average pair difference be ATT.

image.png

ATE in matching

--Similarly for samples in the non-intervention group, the average of the pair difference is ATE.

image.png

ATT in IPW

--Calculated by the following formula

\frac{\sum_{i=1}^{N}Z_{i}Y_{i}}{\sum_{i=1}^{N}Z_{i}}
-\frac{\sum_{i=1}^{N}\frac{(1-Z_{i})\hat{e}(X_{i})}{1-\hat{e}(X_{i})}Y_{i}}{\sum_{i=1}^{N}\frac{(1-Z_{i})\hat{e}(X_{i})}{1-\hat{e}(X_{i})}}

ATE in IPW

--Calculated by the following formula

\frac{\sum_{i=1}^{N}\frac{Z_{i}}{\hat{e}(X_{i})}Y_{i}}{\sum_{i=1}^{N}\frac{Z_{i}}{\hat{e}(X_{i})}}
-\frac{\sum_{i=1}^{N}\frac{1-Z_{i}}{1-\hat{e}(X_{i})}Y_{i}}{\sum_{i=1}^{N}\frac{1-Z_{i}}{1-\hat{e}(X_{i})}}

About common support

--Common support is the range in which the distribution of propensity scores of the intervention group and the non-intervention group overlap. --If you use a sample that does not support common support in the calculation of IPW, the reciprocal (weight) of the propensity score becomes too large to estimate well [^ tsugawa-ps1] -Is "out of common support" and "weight too large" equivalent? --Is it common to exclude samples that are not in common support? --This book introduces property score trim mig / clipping ――Isn't it possible to create a new bias by excluding non-common support? ――So it is important to design the experimental plan so that you can observe two comparable groups properly so that you do not have to exclude it ** ――The bias generated on that should be adjusted by the tendency score etc. ――Since there is no sample that can complement the counterfacts outside of common support, isn't it not satisfying the “strongly negligible allocation” in the first place? - P(Z=1|X)=0OrP(Z=1|X)=1Not because I'm using logistic regression

Evaluation of propensity score

――It is important whether the distribution of covariates in the intervention group and the non-intervention group is balanced between the two groups by the propensity score. -That is, make assumptions\\{Y^{(1)}, Y^{(0)}\\}\perp Z|e(X)From\\{Y^{(1)}, Y^{(0)}\\}\perp Z|XI think it means to return to --In this book, we evaluate by standardized mean difference (ASAM). --OK if ASAM is 0.1 or less (why?) ――If the old AUC is 0.8 or more, is the OK theory nonsense? [^ ps-auc] --AUC is OK if it is 0.8 or more [^ iwanami-ds3] --What does it mean that the AUC is small (close to 0.5): --Allocation does not depend on covariates ($ Z \ perp X $) → No need to adjust with propensity score → Does it make sense if AUC is not large to some extent? --What does it mean to have a large AUC (close to 1): ――The common support is small → The intervention effect cannot be estimated → Isn't it good if the AUC is large?

Practical edition

MineThatData E-Mail Analytics And Data Mining Challenge dataset[1]

--Will be added at a later date

LaLonde(1986)

--Refer to this manual for the explanation of the data. --Refer to here for the explanation of the paper.

Data read

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
df_cps1 = pd.read_stata('https://users.nber.org/~rdehejia/data/cps_controls.dta')
df_cps3 = pd.read_stata('https://users.nber.org/~rdehejia/data/cps_controls3.dta')
df_nsw = pd.read_stata('https://users.nber.org/~rdehejia/data/nsw_dw.dta')

Data set creation

--NSW: RCT dataset --CPS1 + NSW: Data set with NSW non-intervention group replaced by CPS1 --CPS3 + NSW: Data set with NSW non-intervention group replaced by CPS3

treat_label = {0: 'untreat', 1: 'treat'}
X_columns = ['age', 'education', 'black', 'hispanic', 'married', 'nodegree', 're74', 're75']
y_column = 're78'
z_column = 'treat'

CPS1+NSW

df_cps1_nsw = pd.concat([df_nsw[df_nsw[z_column] == 1], df_cps1], ignore_index=True)

CPS3+NSW

df_cps3_nsw = pd.concat([df_nsw[df_nsw[z_column] == 1], df_cps3], ignore_index=True)

Confirm distribution of intervention group and non-intervention group

NSW

--The data is well-balanced because it only sings RCT.

df_nsw[z_column].map(treat_label).value_counts().plot.bar(title='# of treatment in NSW')

image.png

_, ax = plt.subplots(3, 3, figsize=(16, 12))
for i, c in enumerate(X_columns + [y_column]):
    (df_nsw[c].groupby(df_nsw[z_column].map(treat_label))
     .plot.hist(density=True, alpha=0.5, title=f'{c} densities', legend=True, ax=ax[i // 3, i % 3]))

image.png

CPS1+NSW

――It's quite unbalanced data

df_cps1_nsw[z_column].map(treat_label).value_counts().plot.bar(title='# of treatment in CPS1+NSW')

image.png

_, ax = plt.subplots(3, 3, figsize=(16, 12))
for i, c in enumerate(X_columns + [y_column]):
    (df_cps1_nsw[c].groupby(df_cps1_nsw[z_column].map(treat_label))
     .plot.hist(density=True, alpha=0.5, title=f'{c} densities', legend=True, ax=ax[i // 3, i % 3]))

image.png

CPS3+NSW

--Not as much as NSW, but more balanced data than CPS1 + NSW

df_cps3_nsw[z_column].map(treat_label).value_counts().plot.bar(title='# of treatment in CPS3+NSW')

image.png

_, ax = plt.subplots(3, 3, figsize=(16, 12))
for i, c in enumerate(X_columns + [y_column]):
    (df_cps3_nsw[c].groupby(df_cps3_nsw[z_column].map(treat_label))
     .plot.hist(density=True, alpha=0.5, title=f'{c} densities', legend=True, ax=ax[i // 3, i % 3]))

image.png

regression analysis

import statsmodels.api as sm

NSW

results = sm.OLS(df_nsw[y_column], sm.add_constant(df_nsw[[z_column] + X_columns])).fit()
results.summary().tables[1]

image.png

CPS1+NSW

results = sm.OLS(df_cps1_nsw[y_column], sm.add_constant(df_cps1_nsw[[z_column] + X_columns])).fit()
results.summary().tables[1]

image.png

CPS3+NSW

results = sm.OLS(df_cps3_nsw[y_column], sm.add_constant(df_cps3_nsw[[z_column] + X_columns])).fit()
results.summary().tables[1]

image.png

result

data set Intervention effect[$]
NSW 1676.3
CPS1+NSW 699.1
CPS3+NSW 1548.2

Game Start

--NSW intervention effect is about $ 1600 --Challenge how close you can get to $ 1600 using propensity score from biased state

Propensity score estimation

--Where did ʻI (re74 ^ 2) and ʻI (re75 ^ 2) in this book come from?

CPS1+NSW

--Estimate propensity score by logistic regression

ps_model = sm.Logit(df_cps1_nsw[z_column], sm.add_constant(df_cps1_nsw[X_columns])).fit()
ps_model.summary().tables[1]

image.png

df_cps1_nsw['ps'] = 1 / (1 + np.exp(-ps_model.fittedvalues))

--Check the distribution of propensity scores

(df_cps1_nsw.groupby(df_cps1_nsw[z_column].map(treat_label))['ps']
 .plot.hist(density=True, alpha=0.5, title=f'propensity score densities', legend=True))

image.png

--Calculate AUC

from sklearn.metrics import roc_auc_score
roc_auc_score(df_cps1_nsw[z_column], df_cps1_nsw['ps'])
0.9708918648513447

CPS3+NSW

--Estimate propensity score by logistic regression

ps_model = sm.Logit(df_cps3_nsw[z_column], sm.add_constant(df_cps3_nsw[X_columns])).fit()
ps_model.summary().tables[1]

image.png

df_cps3_nsw['ps'] = 1 / (1 + np.exp(-ps_model.fittedvalues))

--Check the distribution of propensity scores

(df_cps3_nsw.groupby(df_cps3_nsw[z_column].map(treat_label))['ps']
 .plot.hist(density=True, alpha=0.5, title=f'propensity score densities', legend=True))

image.png

--Calculate AUC

from sklearn.metrics import roc_auc_score
roc_auc_score(df_cps3_nsw[z_column], df_cps3_nsw['ps'])
0.8742266742266742

ATT estimation

CPS1+NSW

matching

--Define a function (recursive generator) to make a pair -Here was used as a reference.

from sklearn.neighbors import NearestNeighbors
def pairs_generator(s0: pd.Series, s1: pd.Series, threshold: np.float) -> pd.DataFrame:
    """
A recursive generator that creates pairs using the K-nearest neighbor method.
    
    Parameters
    ----------
    s0, s1 : pd.Series
    threshold : np.float
The threshold for the distance to pair. If even the nearest neighbor points exceed the threshold value, they will not be paired.
    
    Returns
    -------
    pairs : pd.DataFrame
Index of a pair made up of the s0 sample and the closest s1 sample.
Column l is the index of s0 and column r is the index of s1.
    
    Notes
    -----
The same sample is never used for multiple pairs.
    
    Examples
    --------
    s0 = df[df[z_column] == 0]['ps']
    s1 = df[df[z_column] == 1]['ps']
    threshold = df['ps'].std() * 0.1
    pairs = pd.concat(pairs_generator(s1, s0, threshold), ignore_index=True)
    """
    neigh_dist, neigh_ind = NearestNeighbors(1).fit(s0.values.reshape(-1, 1)).kneighbors(s1.values.reshape(-1, 1))
    pairs = pd.DataFrame(
        {'d': neigh_dist[:, 0], 'l': s0.iloc[neigh_ind[:, 0]].index},
        index=s1.index,
    ).query(f'd < {threshold}').groupby('l')['d'].idxmin().rename('r').reset_index()
    if len(pairs) > 0:
        yield pairs
        yield from pairs_generator(s0.drop(pairs['l']), s1.drop(pairs['r']), threshold)

--Make a pair

s1 = df_cps1_nsw[df_cps1_nsw[z_column] == 1]['ps']
s0 = df_cps1_nsw[df_cps1_nsw[z_column] == 0]['ps']
threshold = df_cps1_nsw['ps'].std() * 0.1
pairs1_cps1_nsw = pd.concat(pairs_generator(s1, s0, threshold), ignore_index=True)
pairs0_cps1_nsw = pd.concat(pairs_generator(s0, s1, threshold), ignore_index=True) #Used in ATE calculation

--Confirm a pair of intervention groups --Orange is a discarded sample ――Originally, it is desirable to make a pair so that the intersection as shown in the figure does not occur (I think)

ax = df_cps1_nsw.loc[pairs1_cps1_nsw.unstack()].plot.scatter(x='ps', y=z_column, label='matched', alpha=0.3, figsize=(16, 4), title='pairs on treated')
df_cps1_nsw.loc[~df_cps1_nsw.index.isin(pairs1_cps1_nsw.unstack())].plot.scatter(x='ps', y=z_column, label='unmatched', alpha=0.3, c='tab:orange', ax=ax)
for x in zip(df_cps1_nsw['ps'].iloc[pairs1_cps1_nsw['l']].values, df_cps1_nsw['ps'].iloc[pairs1_cps1_nsw['r']].values):
    plt.plot(x, [1, 0], c='tab:blue', alpha=0.3)

image.png

--Calculate ATT --Standard deviation is quite large

(pairs1_cps1_nsw['l'].map(df_cps1_nsw[y_column]) - pairs1_cps1_nsw['r'].map(df_cps1_nsw[y_column])).agg(['mean', 'std'])
mean    1929.083008
std     9618.346680
dtype: float64

IPW

--Calculate weight

df_cps1_nsw['w'] = df_cps1_nsw[z_column] / df_cps1_nsw['ps'] + (1 - df_cps1_nsw[z_column]) / (1 - df_cps1_nsw['ps'])

--Estimate ATT with weighted linear regression

att_model = sm.WLS(df_cps1_nsw[y_column], df_cps1_nsw[[z_column]].assign(untreat=1-df_cps1_nsw[z_column]), weights=df_cps1_nsw['w'] * df_cps1_nsw['ps']).fit()
att_model.summary().tables[1]

image.png

att_model.params['treat'] - att_model.params['untreat']
1180.4078220960955
IPW (common support)

--Extract samples in the range where the distributions of propensity scores overlap

df_cps1_nsw_cs = df_cps1_nsw[df_cps1_nsw['ps'].between(*df_cps1_nsw.groupby(z_column)['ps'].agg(['min', 'max']).agg({'min': 'max', 'max': 'min'}))].copy()

--Estimate ATT with weighted linear regression

att_model = sm.WLS(df_cps1_nsw_cs[y_column], df_cps1_nsw_cs[[z_column]].assign(untreat=1-df_cps1_nsw_cs[z_column]), weights=df_cps1_nsw_cs['w'] * df_cps1_nsw_cs['ps']).fit()
att_model.summary().tables[1]

image.png

att_model.params['treat'] - att_model.params['untreat']
1243.8640545001808

--Confirm standardized mean difference between two groups of covariates ――After adjustment, it is almost within the red dotted line. --Does it match the graph in this book? (Especially black before adjustment)

pd.DataFrame([{
    'unadjusted': np.subtract(*df_cps1_nsw.groupby(z_column)[c].mean()) / df_cps1_nsw[c].std(),
    'adjusted(matching)': np.subtract(*df_cps1_nsw.loc[pairs1_cps1_nsw.unstack()].groupby(z_column)[c].mean()) / df_cps1_nsw[c].std(),
    'adjusted(weighting)': np.subtract(*df_cps1_nsw.groupby(z_column).apply(lambda x: np.average(x[c], weights=x['w'] * x['ps']))) / df_cps1_nsw[c].std(),
    'adjusted(weighting in common support)': np.subtract(*df_cps1_nsw_cs.groupby(z_column).apply(lambda x: np.average(x[c], weights=x['w'] * x['ps']))) / df_cps1_nsw_cs[c].std(),
} for c in X_columns], index=X_columns).abs().plot(marker='o', alpha=0.5, grid=True, ylim=(0, 100), logy=True, title='covariates balance')
plt.axhline(y=0.1, linestyle='--', c='r')

image.png

--Confirm the distribution of covariates in matching --The average does not tell the truth, so look at the distribution ――Balancing is done properly compared to the original distribution

_, ax = plt.subplots(3, 3, figsize=(16, 12))
for i, c in enumerate(X_columns + [y_column]):
    (df_cps1_nsw.loc[pairs1_cps1_nsw.unstack()][c].groupby(df_cps1_nsw[z_column].map(treat_label))
     .plot.hist(density=True, alpha=0.5, title=f'{c} densities', legend=True, ax=ax[i // 3, i % 3]))

image.png

--Confirm the distribution of covariates in IPW ――Balancing is done properly compared to the original distribution

_, ax = plt.subplots(3, 3, figsize=(16, 12))
df = df_cps1_nsw.sample(1000000, replace=True, weights=df_cps1_nsw['w'] * df_cps1_nsw['ps'], random_state=2020)
for i, c in enumerate(X_columns + [y_column]):
    (df[c].groupby(df_cps1_nsw[z_column].map(treat_label))
     .plot.hist(density=True, alpha=0.5, title=f'{c} densities', legend=True, ax=ax[i // 3, i % 3]))

image.png

--Confirm the distribution of covariates in IPW (common support) ――Balancing is done properly compared to the original distribution

_, ax = plt.subplots(3, 3, figsize=(16, 12))
df = df_cps1_nsw_cs.sample(1000000, replace=True, weights=df_cps1_nsw_cs['w'] * df_cps1_nsw_cs['ps'], random_state=2020)
for i, c in enumerate(X_columns + [y_column]):
    (df[c].groupby(df_cps1_nsw_cs[z_column].map(treat_label))
     .plot.hist(density=True, alpha=0.5, title=f'{c} densities', legend=True, ax=ax[i // 3, i % 3]))

image.png

CPS3+NSW

matching

--Make a pair

s1 = df_cps3_nsw[df_cps3_nsw[z_column] == 1]['ps']
s0 = df_cps3_nsw[df_cps3_nsw[z_column] == 0]['ps']
threshold = df_cps3_nsw['ps'].std() * 0.1
pairs1_cps3_nsw = pd.concat(pairs_generator(s1, s0, threshold), ignore_index=True)
pairs0_cps3_nsw = pd.concat(pairs_generator(s0, s1, threshold), ignore_index=True) #Used in ATE calculation

--Confirm a pair of intervention groups

ax = df_cps3_nsw.loc[pairs1_cps3_nsw.unstack()].plot.scatter(x='ps', y=z_column, label='matched', alpha=0.3, figsize=(16, 4), title='pairs on treated')
df_cps3_nsw.loc[~df_cps3_nsw.index.isin(pairs1_cps3_nsw.unstack())].plot.scatter(x='ps', y=z_column, label='unmatched', alpha=0.3, c='tab:orange', ax=ax)
for x in zip(df_cps3_nsw['ps'].iloc[pairs1_cps3_nsw['l']].values, df_cps3_nsw['ps'].iloc[pairs1_cps3_nsw['r']].values):
    plt.plot(x, [1, 0], c='tab:blue', alpha=0.3)

image.png

--Calculate ATT

(pairs1_cps3_nsw['l'].map(df_cps3_nsw[y_column]) - pairs1_cps3_nsw['r'].map(df_cps3_nsw[y_column])).agg(['mean', 'std'])
mean    1463.115845
std     8779.598633
dtype: float64

IPW

--Calculate weight

df_cps3_nsw['w'] = df_cps3_nsw[z_column] / df_cps3_nsw['ps'] + (1 - df_cps3_nsw[z_column]) / (1 - df_cps3_nsw['ps'])

--Estimate ATT with weighted linear regression

att_model = sm.WLS(df_cps3_nsw[y_column], df_cps3_nsw[[z_column]].assign(untreat=1-df_cps3_nsw[z_column]), weights=df_cps3_nsw['w'] * df_cps3_nsw['ps']).fit()
att_model.summary().tables[1]

image.png

att_model.params['treat'] - att_model.params['untreat']
1214.0711733143507
IPW (common support)

--Extract samples in the range where the distributions of propensity scores overlap

df_cps3_nsw_cs = df_cps3_nsw[df_cps3_nsw['ps'].between(*df_cps3_nsw.groupby(z_column)['ps'].agg(['min', 'max']).agg({'min': 'max', 'max': 'min'}))].copy()

--Estimate ATT with weighted linear regression

att_model = sm.WLS(df_cps3_nsw_cs[y_column], df_cps3_nsw_cs[[z_column]].assign(untreat=1-df_cps3_nsw_cs[z_column]), weights=df_cps3_nsw_cs['w'] * df_cps3_nsw_cs['ps']).fit()
att_model.summary().tables[1]

image.png

att_model.params['treat'] - att_model.params['untreat']
988.8657151185471

--Confirm standardized mean difference between two groups of covariates

pd.DataFrame([{
    'unadjusted': np.subtract(*df_cps3_nsw.groupby(z_column)[c].mean()) / df_cps3_nsw[c].std(),
    'adjusted(matching)': np.subtract(*df_cps3_nsw.loc[pairs1_cps3_nsw.unstack()].groupby(z_column)[c].mean()) / df_cps3_nsw[c].std(),
    'adjusted(weighting)': np.subtract(*df_cps3_nsw.groupby(z_column).apply(lambda x: np.average(x[c], weights=x['w'] * x['ps']))) / df_cps3_nsw[c].std(),
    'adjusted(weighting in common support)': np.subtract(*df_cps3_nsw_cs.groupby(z_column).apply(lambda x: np.average(x[c], weights=x['w'] * x['ps']))) / df_cps3_nsw_cs[c].std(),
} for c in X_columns], index=X_columns).abs().plot(marker='o', alpha=0.5, grid=True, ylim=(0, 100), logy=True, title='covariates balance')
plt.axhline(y=0.1, linestyle='--', c='r')

image.png

--Confirm the distribution of covariates in matching

_, ax = plt.subplots(3, 3, figsize=(16, 12))
for i, c in enumerate(X_columns + [y_column]):
    (df_cps3_nsw.loc[pairs1_cps3_nsw.unstack()][c].groupby(df_cps3_nsw[z_column].map(treat_label))
     .plot.hist(density=True, alpha=0.5, title=f'{c} densities', legend=True, ax=ax[i // 3, i % 3]))

image.png

--Confirm the distribution of covariates in IPW

_, ax = plt.subplots(3, 3, figsize=(16, 12))
df = df_cps3_nsw.sample(1000000, replace=True, weights=df_cps3_nsw['w'] * df_cps3_nsw['ps'], random_state=2020)
for i, c in enumerate(X_columns + [y_column]):
    (df[c].groupby(df_cps3_nsw[z_column].map(treat_label))
     .plot.hist(density=True, alpha=0.5, title=f'{c} densities', legend=True, ax=ax[i // 3, i % 3]))

image.png

--Confirm the distribution of covariates in IPW (common support)

_, ax = plt.subplots(3, 3, figsize=(16, 12))
df = df_cps3_nsw_cs.sample(1000000, replace=True, weights=df_cps3_nsw_cs['w'] * df_cps3_nsw_cs['ps'], random_state=2020)
for i, c in enumerate(X_columns + [y_column]):
    (df[c].groupby(df_cps3_nsw_cs[z_column].map(treat_label))
     .plot.hist(density=True, alpha=0.5, title=f'{c} densities', legend=True, ax=ax[i // 3, i % 3]))

image.png

result

--The covariates are balanced, but the estimated intervention effect seems to be quite different from $ 1600.

data set matching IPW IPW(Common support)
CPS1+NSW 1929.1 1180.4 1243.9
CPS3+NSW 1463.1 1214.1 988.9

ATE estimation

CPS1+NSW

matching

--Check the pair

pairs_cps1_nsw = pd.concat([pairs1_cps1_nsw, pairs0_cps1_nsw.rename({'l': 'r', 'r': 'l'}, axis=1)], ignore_index=True)
ax = df_cps1_nsw.loc[pairs_cps1_nsw.unstack()].plot.scatter(x='ps', y=z_column, label='matched', alpha=0.3, figsize=(16, 4), title='pairs')
df_cps1_nsw.loc[~df_cps1_nsw.index.isin(pairs_cps1_nsw.unstack())].plot.scatter(x='ps', y=z_column, label='unmatched', alpha=0.3, c='tab:orange', ax=ax)
for x in zip(df_cps1_nsw['ps'].iloc[pairs_cps1_nsw['l']].values, df_cps1_nsw['ps'].iloc[pairs_cps1_nsw['r']].values):
    plt.plot(x, [1, 0], c='tab:blue', alpha=0.3)

image.png

--Calculate ATE

(pairs_cps1_nsw['l'].map(df_cps1_nsw[y_column]) - pairs_cps1_nsw['r'].map(df_cps1_nsw[y_column])).agg(['mean', 'std'])
mean    1836.516968
std     9634.636719
dtype: float64

IPW

--Estimate ATE with weighted linear regression

ate_model = sm.WLS(df_cps1_nsw[y_column], df_cps1_nsw[[z_column]].assign(untreat=1-df_cps1_nsw[z_column]), weights=df_cps1_nsw['w']).fit()
ate_model.summary().tables[1]

image.png

ate_model.params['treat'] - ate_model.params['untreat']
-6456.300804768925
IPW (common support)

--Estimate ATE with weighted linear regression

ate_model = sm.WLS(df_cps1_nsw_cs[y_column], df_cps1_nsw_cs[[z_column]].assign(untreat=1-df_cps1_nsw_cs[z_column]), weights=df_cps1_nsw_cs['w']).fit()
ate_model.summary().tables[1]

image.png

ate_model.params['treat'] - ate_model.params['untreat']
545.3149727568589

--Confirm standardized mean difference between two groups of covariates

pd.DataFrame([{
    'unadjusted': np.subtract(*df_cps1_nsw.groupby(z_column)[c].mean()) / df_cps1_nsw[c].std(),
    'adjusted(matching)': np.subtract(*df_cps1_nsw.loc[pairs_cps1_nsw.unstack()].groupby(z_column)[c].mean()) / df_cps1_nsw[c].std(),
    'adjusted(weighting)': np.subtract(*df_cps1_nsw.groupby(z_column).apply(lambda x: np.average(x[c], weights=x['w']))) / df_cps1_nsw[c].std(),
    'adjusted(weighting in common support)': np.subtract(*df_cps1_nsw_cs.groupby(z_column).apply(lambda x: np.average(x[c], weights=x['w']))) / df_cps1_nsw_cs[c].std(),
} for c in X_columns], index=X_columns).abs().plot(marker='o', alpha=0.5, grid=True, ylim=(0, 100), logy=True, title='covariates balance')
plt.axhline(y=0.1, linestyle='--', c='r')

image.png

--Confirm the distribution of covariates in matching

_, ax = plt.subplots(3, 3, figsize=(16, 12))
for i, c in enumerate(X_columns + [y_column]):
    (df_cps1_nsw.loc[pairs_cps1_nsw.unstack()][c].groupby(df_cps1_nsw[z_column].map(treat_label))
     .plot.hist(density=True, alpha=0.5, title=f'{c} densities', legend=True, ax=ax[i // 3, i % 3]))

image.png

--Confirm the distribution of covariates in IPW

_, ax = plt.subplots(3, 3, figsize=(16, 12))
df = df_cps1_nsw.sample(1000000, replace=True, weights=df_cps1_nsw['w'], random_state=2020)
for i, c in enumerate(X_columns + [y_column]):
    (df[c].groupby(df_cps1_nsw[z_column].map(treat_label))
     .plot.hist(density=True, alpha=0.5, title=f'{c} densities', legend=True, ax=ax[i // 3, i % 3]))

image.png

--Confirm the distribution of covariates in IPW (common support)

_, ax = plt.subplots(3, 3, figsize=(16, 12))
df = df_cps1_nsw_cs.sample(1000000, replace=True, weights=df_cps1_nsw_cs['w'], random_state=2020)
for i, c in enumerate(X_columns + [y_column]):
    (df[c].groupby(df_cps1_nsw_cs[z_column].map(treat_label))
     .plot.hist(density=True, alpha=0.5, title=f'{c} densities', legend=True, ax=ax[i // 3, i % 3]))

image.png

CPS3+NSW

matching

--Check the pair

pairs_cps3_nsw = pd.concat([pairs1_cps3_nsw, pairs0_cps3_nsw.rename({'l': 'r', 'r': 'l'}, axis=1)], ignore_index=True)
ax = df_cps3_nsw.loc[pairs_cps3_nsw.unstack()].plot.scatter(x='ps', y=z_column, label='matched', alpha=0.3, figsize=(16, 4), title='pairs')
df_cps3_nsw.loc[~df_cps3_nsw.index.isin(pairs_cps3_nsw.unstack())].plot.scatter(x='ps', y=z_column, label='unmatched', alpha=0.3, c='tab:orange', ax=ax)
for x in zip(df_cps3_nsw['ps'].iloc[pairs_cps3_nsw['l']].values, df_cps3_nsw['ps'].iloc[pairs_cps3_nsw['r']].values):
    plt.plot(x, [1, 0], c='tab:blue', alpha=0.3)

image.png

--Calculate ATE

(pairs_cps3_nsw['l'].map(df_cps3_nsw[y_column]) - pairs_cps3_nsw['r'].map(df_cps3_nsw[y_column])).agg(['mean', 'std'])
mean    1486.608276
std     8629.639648
dtype: float64

IPW

--Estimate ATE with weighted linear regression

ate_model = sm.WLS(df_cps3_nsw[y_column], df_cps3_nsw[[z_column]].assign(untreat=1-df_cps3_nsw[z_column]), weights=df_cps3_nsw['w']).fit()
ate_model.summary().tables[1]

image.png

ate_model.params['treat'] - ate_model.params['untreat']
224.6762634665365
IPW (common support)

--Estimate ATE with weighted linear regression

ate_model = sm.WLS(df_cps3_nsw_cs[y_column], df_cps3_nsw_cs[[z_column]].assign(untreat=1-df_cps3_nsw_cs[z_column]), weights=df_cps3_nsw_cs['w']).fit()
ate_model.summary().tables[1]

image.png

ate_model.params['treat'] - ate_model.params['untreat']
801.0736196669177

--Confirm standardized mean difference between two groups of covariates

pd.DataFrame([{
    'unadjusted': np.subtract(*df_cps3_nsw.groupby(z_column)[c].mean()) / df_cps3_nsw[c].std(),
    'adjusted(matching)': np.subtract(*df_cps3_nsw.loc[pairs_cps3_nsw.unstack()].groupby(z_column)[c].mean()) / df_cps3_nsw[c].std(),
    'adjusted(weighting)': np.subtract(*df_cps3_nsw.groupby(z_column).apply(lambda x: np.average(x[c], weights=x['w']))) / df_cps3_nsw[c].std(),
    'adjusted(weighting in common support)': np.subtract(*df_cps3_nsw_cs.groupby(z_column).apply(lambda x: np.average(x[c], weights=x['w']))) / df_cps3_nsw_cs[c].std(),
} for c in X_columns], index=X_columns).abs().plot(marker='o', alpha=0.5, grid=True, ylim=(0, 100), logy=True, title='covariates balance')
plt.axhline(y=0.1, linestyle='--', c='r')

image.png

--Confirm the distribution of covariates in matching

_, ax = plt.subplots(3, 3, figsize=(16, 12))
for i, c in enumerate(X_columns + [y_column]):
    (df_cps3_nsw.loc[pairs_cps3_nsw.unstack()][c].groupby(df_cps3_nsw[z_column].map(treat_label))
     .plot.hist(density=True, alpha=0.5, title=f'{c} densities', legend=True, ax=ax[i // 3, i % 3]))

image.png

--Confirm the distribution of covariates in IPW

_, ax = plt.subplots(3, 3, figsize=(16, 12))
df = df_cps3_nsw.sample(1000000, replace=True, weights=df_cps3_nsw['w'], random_state=2020)
for i, c in enumerate(X_columns + [y_column]):
    (df[c].groupby(df_cps3_nsw[z_column].map(treat_label))
     .plot.hist(density=True, alpha=0.5, title=f'{c} densities', legend=True, ax=ax[i // 3, i % 3]))

image.png

--Confirm the distribution of covariates in IPW (common support)

_, ax = plt.subplots(3, 3, figsize=(16, 12))
df = df_cps3_nsw_cs.sample(1000000, replace=True, weights=df_cps3_nsw_cs['w'], random_state=2020)
for i, c in enumerate(X_columns + [y_column]):
    (df[c].groupby(df_cps3_nsw_cs[z_column].map(treat_label))
     .plot.hist(density=True, alpha=0.5, title=f'{c} densities', legend=True, ax=ax[i // 3, i % 3]))

image.png

result

data set matching IPW IPW(Common support)
CPS1+NSW 1836.5 -6456.3 545.3
CPS3+NSW 1486.6 224.7 801.0

Impressions

--I can't get rid of the "bias" at all ...


  1. https://blog.minethatdata.com/2008/03/minethatdata-e-mail-analytics-and-data.html ↩︎

Recommended Posts

"Introduction to effect verification Chapter 3 Analysis using propensity score" + α is tried in Python
Introduction to Effectiveness Verification Chapter 1 in Python
Introduction to effectiveness verification Chapter 3 written in Python
I wrote "Introduction to Effect Verification" in Python
Introduction to Effectiveness Verification Chapter 2 Written in Python
[Technical book] Introduction to data analysis using Python -1 Chapter Introduction-
Introduction to Effectiveness Verification Chapters 4 and 5 are written in Python
I tried to make a stopwatch using tkinter in python
[Python] PCA scratch in the example of "Introduction to multivariate analysis"
[Introduction to Python3 Day 13] Chapter 7 Strings (7.1-7.1.1.1)
[Introduction to Python3 Day 14] Chapter 7 Strings (7.1.1.1 to 7.1.1.4)
[Introduction to Python3 Day 15] Chapter 7 Strings (7.1.2-7.1.2.2)
Introduction to image analysis opencv python
[Introduction to Python3 Day 21] Chapter 10 System (10.1 to 10.5)
I tried fMRI data analysis with python (Introduction to brain information decoding)
I tried to implement PLSA in Python
[Introduction to Python] How to use class in Python?
I tried to implement permutation in Python
Introduction to Discrete Event Simulation Using Python # 1
[Introduction to Python3, Day 17] Chapter 8 Data Destinations (8.1-8.2.5)
[Introduction to Python3, Day 17] Chapter 8 Data Destinations (8.3-8.3.6.1)
I tried to implement PLSA in Python 2
I tried using Bayesian Optimization in Python
Log in to Slack using requests in Python
[Introduction to Python3 Day 19] Chapter 8 Data Destinations (8.4-8.5)
[Introduction to Python3 Day 18] Chapter 8 Data Destinations (8.3.6.2 to 8.3.6.3)
I tried to implement ADALINE in Python
I tried to implement PPO in Python
Introduction to Discrete Event Simulation Using Python # 2
How to use is and == in Python
Introduction to Vectors: Linear Algebra in Python <1>
Introduction to Time Series Analysis ~ Seasonal Adjustment Model ~ Implemented in R and Python
[What is an algorithm? Introduction to Search Algorithm] ~ Python ~
[Introduction to Python3 Day 12] Chapter 6 Objects and Classes (6.3-6.15)
tse --Introduction to Text Stream Editor in Python
[Introduction to Python3 Day 22] Chapter 11 Concurrency and Networking (11.1 to 11.3)
Perform entity analysis using spaCy / GiNZA in Python
[Introduction to Python3 Day 11] Chapter 6 Objects and Classes (6.1-6.2)
[Introduction to Python3 Day 23] Chapter 12 Become a Paisonista (12.1 to 12.6)
I tried to implement TOPIC MODEL in Python
[Environment construction] Dependency analysis using CaboCha in Python 2.7
[Introduction to Python3 Day 20] Chapter 9 Unraveling the Web (9.1-9.4)
I tried to implement selection sort in python
[Introduction to Udemy Python 3 + Application] 54. What is Docstrings?
[Introduction to element decomposition] Let's arrange time series analysis methods in R and python ♬
Python practice 100 knocks I tried to visualize the decision tree of Chapter 5 using graphviz
[Impression] [Data analysis starting from zero] Introduction to Python data science learned in business cases
[New Corona] Is the next peak in December? I tried trend analysis with Python!
[Introduction to Python3 Day 8] Chapter 4 Py Skin: Code Structure (4.1-4.13)
I tried to graph the packages installed in Python
I tried using TradeWave (BitCoin system trading in Python)
[Chapter 5] Introduction to Python with 100 knocks of language processing
I tried to implement a pseudo pachislot in Python
Reading Note: An Introduction to Data Analysis with Python
I tried to implement Dragon Quest poker in Python
I tried to implement GA (genetic algorithm) in Python
[Chapter 3] Introduction to Python with 100 knocks of language processing
How to exit when using Python in Terminal (Mac)
[Chapter 2] Introduction to Python with 100 knocks of language processing
I tried to summarize how to use pandas in python
Introduction to Linear Algebra in Python: A = LU Decomposition