[PYTHON] Estimating the effect of measures using propensity scores

Introduction

Statistical causal reasoning (hereinafter referred to as causal reasoning) is a framework for more accurately estimating the effects of measures and treatments in the fields of marketing and medical care. In particular,

--When using causal reasoning --Measures effect estimation method by inverse probability weighting method based on propensity score

I will briefly summarize about.

Recently, a very easy-to-understand introductory book on causal reasoning (Introduction to Effectiveness Verification-Causal Reasoning for Correct Comparison / Basics of Econometrics Since 11117-5)) came out, I feel that causal reasoning has also gained citizenship.

When to use causal reasoning

** When using causal reasoning, you cannot do AB testing, but you want to estimate the effect of interventions (marketing measures and treatments) more accurately **.

The AB test (also known as a Randomized Controlled Trial (RCT)) randomly assigns subjects to two groups, one group intervening and the other intervening. Instead, the intervention effect is estimated by comparing the results of both groups. By randomly assigning subjects to two groups and creating homogeneous groups in both groups except with or without intervention, the average effect of intervention can be estimated by a simple difference between the results of both groups. For example, as shown in the figure below, when the email is delivered to half of the randomly selected users and the email is not delivered to the other half of the users, the difference in the average number of reservations for each group (= average reservation). (Incremental number) will be the effect of the mail delivery measure. ABtest.JPG

However, because it is costly to perform an AB test, if the intervention target is determined by non-random logic, the intervention effect cannot be estimated by the above method.

For example, if you want to estimate the effectiveness of a marketing measure, but narrowing down the target audience for the measure to half, there is a high possibility that opportunity loss (decrease in sales, etc.) will occur. It is also applicable (in a broad sense) when it is ethical to randomly select a treatment target, such as when you want to estimate the effect of treatment.

** A simple comparison of the results of the intervention group and the non-intervention group when not performing A / B testing cannot rule out non-intervention effects (bias) and underestimate the intervention effect. There is a possibility of doing so. ** **

Let's take an email delivery measure as an example. Let's assume that you are implementing measures such as machine learning to predict whether or not a user will make a reservation and deliver an email to a user with a high predicted score. At this time, the user who is the target of mail delivery can be considered to have an attribute that makes it easy to make a reservation. At this time, can we say that the simple difference between the number of reservations made by email delivery targets and the number of reservations made by non-mail delivery targets is the effect of email delivery? The answer is probably NO. The reason is that emails are delivered to users who have attributes that are easy to make reservations in the first place, and it is expected that the number of reservations for mail delivery targets will be large accordingly, so the effect of email delivery will be due to the influence of attributes other than email delivery. This is because it is overestimated. biased_test.JPG

Also, when you want to estimate the effect of a TV commercial, the presence or absence of intervention is decided by the intention of the target person (it is up to the individual to see or not see the commercial, and the person who broadcasts the commercial decides who will watch the commercial. Even if it is not possible, a simple comparison between the intervention group and the non-intervention group may overestimate (or underestimate) the effectiveness of the intervention because the presence or absence of intervention is not randomly determined.

Even in the above cases, causal reasoning is the methodology for estimating the effect of intervention.

Inverse probability weighting method using propensity score

Here, we will explain the Inverse Probability Weighting (IPW), which is one of the methods of causal reasoning, using probabilistic scores.

** Probability score is the probability that subject $ i $ belongs to the intervention group (treatment group) **, and "the background information (attribute) of subject $ i $ looks like this, so subject $ i $ The probability that a member belongs to the treatment group (easiness of being assigned to the treatment group) is about this. " In the formula, the propensity score $ e_i $ of the target person $ i $ is

e_i = P(Z_i =1| X=x_i)

is. $ Z_i $ is a variable that represents the grouping of the subject $ i $ ($ Z_i = 1 $ belongs to the intervention group (treatment group), $ Z_i = 0 $ belongs to the control group (no intervention group)). , $ X_i $ represent the background information (called covariates) of the subject $ i $.

So far, I have explained that the effect of "average" intervention is estimated without any particular notice. The reason is that it is not possible to estimate the effect of intervention at the individual level (called the fundamental problem of causal reasoning). This is because it is not possible to know the number of reservations when an email was delivered to a user and the number of reservations when the email was not delivered at the same time. Therefore, we take an approach to estimate the average intervention effect by forming homogeneous groups except for the presence or absence of intervention.

Just as the mean value varies depending on the population to be aggregated, the average intervention effect also differs depending on the population of interest. The basic ones are the following three.

-** Average Treatment Effect (ATE) ** --Effect of average intervention on the entire population -** Average Treatment effect on the Treated (ATT) ** ――Effect of average intervention in the treatment group (group with intervention) -** Average Treatment effect on the Untreated (ATU) ** --Effect of average intervention in the control group (no intervention group)

If you write each in a formula,

ATE = E(Y_1) - E(Y_0) = \frac{\sum_{i=1}^{N} \frac{z_i}{e_i}y_i}{\sum_{j=1}^{N} \frac{z_j}{e_j}} - \frac{\sum_{i=1}^{N} \frac{1-z_i}{1-e_i}y_i}{\sum_{j=1}^{N} \frac{1-z_j}{1-e_j}}
ATT = E(Y_1 | Z=1) - E(Y_0 | Z=1) = \bar{y}_1 - \frac{\sum_{i=1}^{N} \frac{(1-z_i)e_i}{1-e_i}y_i}{\sum_{j=1}^{N} \frac{(1-z_j)e_j}{1-e_j}}
ATU = E(Y_1 | Z=0) - E(Y_0 | Z=0) = \frac{\sum_{i=1}^{N} \frac{z_i(1-e_i)}{e_i}y_i}{\sum_{j=1}^{N} \frac{z_j(1-e_j)}{e_j}} - \bar{y}_0

And $ y_i $ represents the result variable of the target person $ i $ (the amount for which the intervention effect is estimated. It is also called the outcome. In the example of mail delivery, the number of reservations).

All of ATE, ATT, and ATU are the differences between the "results when intervening in the population of interest (first term)" and the "results when not intervening in the population of interest (second term)".

ATE focuses on the entire population. The first term calculates the mean value of the result variable weighted by the reciprocal of the propensity score for the treatment group (group of $ z_i = 1 ), and the second term is the control group ( z_i = 0 $). The mean value of the result variable weighted by the reciprocal of (1-probability score) is calculated for the group). If the propensity score is interpreted as "subject's treatment group-likeness" and (1-propensity score) as "subject's control group-likeness", the first term is neutral by dividing the treatment group by "treatment group-likeness". It can be regarded as a neutral (?) Group by dividing the control group by the "control group-likeness" in the second term. In other words, we estimate the average intervention effect of the entire population regardless of the presence or absence of intervention.

** ATT ** focuses on the treatment group, "Elongation of the result variable due to intervention in the treatment group (= actual value of the result variable of the treatment group and the result variable when the treatment group is not intervened) Difference) ”. It is often used ** when estimating the cost-effectiveness of an intervention, because it allows you to estimate the direct benefits of the intervention. The first term $ \ bar {y} _1 $ is the average value (actual value) of the result variables of the treatment group. The second term calculates the mean value of the result variable weighted by the propensity score / (1-prone score) for the control group (group of $ z_i = 0 $). This can be interpreted as estimating the "result variable when the treatment group is not intervened" by dividing the result variable of the control group by the "control group-likeness" and multiplying by the "treatment group-likeness".

** ATU ** focuses on the control group. Difference) ”. By estimating the effect of intervention in a population that is not currently intervening, it is often used to ** determine whether to expand the scope of intervention in the future **. The first term calculates the mean value of the result variables weighted by (1-prone score) / propensity score for the treatment group (group of $ z_i = 1 $). This can be interpreted as estimating the "result variable when the control group is tentatively intervened" by dividing the result variable of the treatment group by the "treatment group-likeness" and multiplying by the "control-group-likeness". The second term $ \ bar {y} _0 $ is the mean value (actual value) of the result variable of the control group.

Procedure for estimating the effect

Now let's actually estimate the effect of the intervention. Here, Data of CM contact effect estimation of Iwanami DS vol.3 is used, and CM contact gives to the application usage. Implement the effect estimation in python. Here, we estimate how much the percentage of people who actually use the app has increased by broadcasting a commercial that promotes the use of the app. ATT is estimated in order to estimate the direct profit (= increase rate of users who use the app) from CM measures. The estimation procedure is as follows.

  1. Build a propensity score estimation model
  2. Check the estimation accuracy of the propensity score estimation model
  3. Calculate ATT

Below is the code.

--Import the required libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns

import statsmodels.api as sm
from sklearn.calibration import calibration_curve
from sklearn.metrics import roc_auc_score

import itertools
import time

--Reading the data to be used

Let $ Z $ be the variable that represents the group of "watched / not watched CM", and the covariate $ X $ for TV viewing time, gender, age, etc.

data = pd.read_csv('./q_data_x.csv')
X = data[['TVwatch_day', 'age', 'sex', 'marry_dummy', 'child_dummy', 'inc', 'pmoney','area_kanto', 'area_tokai', 'area_keihanshin', 
          'job_dummy1', 'job_dummy2', 'job_dummy3', 'job_dummy4', 'job_dummy5', 'job_dummy6', 'job_dummy7',
          'fam_str_dummy1', 'fam_str_dummy2', 'fam_str_dummy3', 'fam_str_dummy4']] #Covariate
Z = data['cm_dummy'] #Group variables

--Estimation of propensity score

Estimate the propensity score for each user from the covariates. Here, we use the Logit function of the StatsModels library to build a propensity score estimation model by logistic regression.

exog = sm.add_constant(X) #Add section
logit_model = sm.Logit(endog=Z, exog=exog) #Logistic regression
logit_res = logit_model.fit()

--Confirmation of estimation accuracy of propensity score

Looking at the estimation accuracy with the code below, it was AUC = 0.792. It's a reasonable estimation accuracy.

ps = logit_res.predict(exog)
print('AUC = {:.3f}'.format(roc_auc_score(y_true=Z, y_score=ps)))
#output:=> AUC = 0.792

Then draw a calibration plot with the code below. Since the calibration plot is roughly on the 45 degree line, it is still a reasonable estimation accuracy.

_, ax1 = plt.subplots(figsize=(10, 5))

prob_true, prob_pred = calibration_curve(y_true=Z, y_prob=ps, n_bins=20)
ax1.plot(prob_pred, prob_true, marker='o', label='calibration plot')
ax1.plot([0,1], [0,1], linestyle='--', color='black', label='ideal')
ax1.legend(bbox_to_anchor=(1.2, 0.9), loc='upper left', borderaxespad=0)
ax1.set_xlabel('predicted probability')
ax1.set_ylabel('true probability')

ax2 = ax1.twinx()
ax2.hist(ps, bins=20, histtype='step', rwidth=0.9)
ax2.set_ylabel('count')
plt.tight_layout()
plt.show()

calibration_plot.PNG

--ATT estimation

Estimate ATT. The result variable to be estimated is an application usage dummy variable (variable that takes 1 if there is application use, 0 if not). Estimating ATT with the code below, the effect of CM was about 0.026 (± 0.013 is the 95% confidence interval calculated by the bootstrap method). In other words, it was found that the percentage of users who use the app will increase by about 2.6% due to the CM measures. (It has been confirmed that it matches the value described in Iwanami DS vol.3)

Y = data['gamedummy'] #Result variable (estimated target)
sample_size = len(data.loc[data['cm_dummy']==1])
ATT_list = []

for i in range(10000):
    idx1 = pd.Series(data.loc[data['cm_dummy']==1, 'gamedummy'].index).sample(n=sample_size, replace=True, random_state=i)
    idx0 = pd.Series(data.loc[data['cm_dummy']==0, 'gamedummy'].index).sample(n=sample_size, replace=True, random_state=i)
    
    Z_tmp = np.r_[Z[idx1], Z[idx0]]
    Y_tmp = np.r_[Y[idx1], Y[idx0]]
    ps_tmp = np.r_[ps[idx1], ps[idx0]]
    w01_tmp = (1-Z_tmp)*ps_tmp/(1-ps_tmp)
    
    E1 = np.mean(Y_tmp[Z_tmp==1])
    E0 = np.sum(Y_tmp * w01_tmp) / np.sum(w01_tmp)
    ATT = E1 - E0
    ATT_list.append(ATT)
    
print('ATT = {:.3f} ± {:.3f} (s.d.={:.3f})'.format(np.mean(ATT_list), np.std(ATT_list)*1.96, np.std(ATT_list)))
#output:=> ATT = 0.026 ± 0.013 (s.d.=0.006)

Confirmation of covariate distribution

We were able to estimate ATT with the above code, but let's take it one step further to see if we could correctly estimate the effect of the intervention.

In estimating ATT, it is important to correctly estimate the second term (the mean value of the outcome variables if no intervention was made in the treatment group) using the propensity score. For that purpose, is the "value of the result variable of the control group" adjusted well to the "value of the result variable when the treatment group is not intervened", that is, the control is adjusted by using the ** propensity score. It is necessary to confirm that the background information (covariates) of the users belonging to the group is the same as the background information (covariates) of the users belonging to the treatment group **. We have confirmed the estimation accuracy of the propensity score, but I think we should give priority to confirming whether the distribution of covariates is homogeneous. When checking, each covariate, like the outcome variable, is appropriately weighted with a propensity score.

Now, let's check if the distribution of covariates in each group is homogeneous by adjusting using the propensity score. Here, we use an index called Standardized Mean Difference (SMD). When written in a formula,

SMD = \frac{\bar{x}_1 - \bar{x}_0}{\sigma_{pool}}
\sigma_{pool}^2 = \frac{(N_1-1)\sigma_1^2+(N_0-1)\sigma_0^2}{N_1+N_2-2}

is. The variance of pooling the difference between the mean covariate $ \ bar {x} \ _1 $ of the treatment group and the mean $ \ bar {x} \ 0 $ of the control group $ \ sigma {pool} ^ 2 It can be calculated by dividing by the square root of $. $ \ Sigma_1 ^ 2 $ is the covariate distribution of the treatment group, $ \ sigma_0 ^ 2 $ is the covariate distribution of the control group, $ N_1 $ is the number of users in the treatment group, and $ N_0 $ is the number of users in the control group. ..

When the difference between the mean values of each group is small and the variance is large, SMD is an index showing that the distribution of covariates in both groups has a large overlap and the difference in distribution of covariates is small. If you want to judge the difference in distribution, you may want to perform the Kolmogorov-Smirnov test (KS test), but since this is a method to show that there is a statistically significant difference in distribution, " I think it is a misuse to use it to indicate that the distributions are homogeneous (there is no difference in distributions).

Now let's check the standardized mean difference (SMD) of each covariate. The code below first defines a function that calculates the SMD (standardized_mean_difference) and a function that calculates the SMD before and after adjustment using the propensity score (smd_on_the_treated). Then, for each covariate, the SMD before and after adjustment using the propensity score is plotted. This plot is called a love plot.

def standardized_mean_difference(X1, X0): #Function to calculate SMD
    N1 = len(X1)
    N0 = len(X0)
    s_pool = ((N1-1)*np.var(X1)+(N0-1)*np.var(X0))/(N1+N0-2)
    
    return (np.mean(X1)-np.mean(X0))/np.sqrt(s_pool)

def smd_on_the_treated(X, Z, ps): #Function to calculate SMD before and after adjustment using propensity score
    X1 = X[Z==1]
    X0 = X[Z==0]
    ps0 = ps[Z==0]
    X10 = X0*ps0/(1-ps0)
    
    smd_before = standardized_mean_difference(X1, X0)
    smd_after = standardized_mean_difference(X1, X10)
    
    return smd_before, smd_after

#Calculate SMD before and after adjustment using propensity score for each covariate
smd_list = []
for name in X.columns:
    smd_before, smd_after = smd_on_the_treated(X=X[name], Z=Z, ps=ps)
    smd_list.append([name, smd_before, smd_after])
smd_df = pd.DataFrame(smd_list, columns=['covariate', 'SMD(before)', 'SMD(after)'])

#For each covariate, plot the SMD before and after adjustment using the propensity score (create a love plot)
plt.figure(figsize=(5, 10))
plt.scatter(smd_df['SMD(before)'], smd_df['covariate'], label='before')
plt.scatter(smd_df['SMD(after)'], smd_df['covariate'], label='after')
plt.vlines([0], ymin=-1, ymax=X.shape[1])
plt.legend()
plt.xlabel('standardized mean difference')
plt.ylabel('covariate')
plt.grid(True)
plt.show()

love_plot.png

The blue dots indicate the SMD between the pre-adjusted covariates using the propensity score, and the orange dots indicate the SMD between the adjusted covariates. Looking at this, we can see that the SMD is greatly reduced by adjusting the TV viewing time (TVwatch_day) using the propensity score. However, there are some covariates in which SMD is rather increased by adjustment using propensity scores such as the presence or absence of children (child_dummy).

How to select covariates

The reason why the distribution of covariates is not homogeneous as described above is that the covariates that should be used for estimating the propensity score are not incorporated, and a hidden bias remains. In order to eliminate the hidden bias, it is necessary to satisfy the "strongly negligible allocation" condition, but since it is impossible to directly verify whether this condition is satisfied, the condition is indirectly satisfied. It seems that you often check and select a covariate. (The explanation of the "strongly negligible allocation" condition will be left to the technical books.)

The basic policy for covariate selection is to include as many covariates as possible that are related to the outcome and group variables. The selection procedure is

  1. Extract variables related to result variables --Example: Regression analysis is performed with the number of results as the objective variable and one of the covariate candidates as the explanatory variable, and the one with the highest coefficient of determination is extracted.
  2. Extract variables related to group variables --Example: Regression analysis is performed using the group variable as the objective variable and one of the covariate candidates as the explanatory variable, and the one with the highest coefficient of determination is extracted.
  3. Merge the set of covariates extracted in 1. and 2. --Prevent the top of variables related to result variables from falling --Because it is more important to include variables related to result variables than variables related to assignments --It seems that the estimated variance of the causal effect on the result variable will increase if a variable that is related to the allocation but not the result variable is included.
  4. Confirmation of explanatory power of allocation by covariates --Build a propensity score estimation model with the covariates selected up to 3. and check the estimation accuracy (AUC, etc.)
  5. Confirmation of covariate distribution --Use indicators such as standardized mean difference (SMD) to confirm that the distribution of covariates adjusted by propensity score is homogeneous in the treatment and control groups.

is. Regression analysis is performed in the examples of 1 and 2. However, since the purpose here is to extract variables related to result variables and group variables, if this purpose is achieved, regression is not necessarily achieved. No analysis is required. In 4, we check the estimation accuracy of the propensity score estimation model, but please note that it is not a matter of high accuracy. This is because it is necessary to select a covariate that is more closely related to the outcome variable than the group variable in order to keep the estimated variance of the causal effect on the outcome variable low. Therefore, I think that the variable selection method that raises the estimation accuracy of the propensity score by using a machine learning algorithm with good accuracy is different as a policy.

in conclusion

Statistical causal reasoning, especially the method of estimating the effect of measures by the inverse probability weighting method based on the propensity score, is summarized together with the Python code. He also touched on how to select covariates. However, it is difficult to judge whether the effect is estimated correctly, so be careful when using statistical causal reasoning! I have to say.

We would appreciate it if you could point out any incorrect descriptions.

reference

-Introduction to Effectiveness Verification-Causal Reasoning for Correct Comparison / Basics of Econometrics -Iwanami Data Science vol.3 -Statistical science of survey observation data Causal reasoning / selection bias / data fusion -Statistical causal inference and data analysis-Aiming for proper operation-

Recommended Posts

Estimating the effect of measures using propensity scores
[Nonparametric Bayes] Estimating the number of clusters using the Dirichlet process
Investigate the effect of outliers on correlation
Shortening the analysis time of Openpose using sound
Check the type of the variable you are using
Exclusive release of the django app using ngrok
Try using the collections module (ChainMap) of python3
Find the geometric mean of n! Using Python
Consideration of propensity score and effect estimation accuracy
Determine the number of classes using the Starges formula
I tried using the image filter of OpenCV
Check the status of your data using pandas_profiling
Scraping the winning data of Numbers using Docker
Calculation of the shortest path using the Monte Carlo method
Explanation of the concept of regression analysis using python Part 2
Cut a part of the string using a Python slice
Drawing on Jupyter using the plot function of pandas
The pain of gRPC using Python. November 2019. (Personal memo)
Explanation of the concept of regression analysis using Python Part 1
I tried using the API of the salmon data project
Let's analyze the emotions of Tweet using Chainer (2nd)
Explanation of the concept of regression analysis using Python Extra 1
Study from the beginning of Python Hour8: Using packages
Let's analyze the sentiment of Tweet using Chainer (1st)
The story of using circleci to build manylinux wheels