This is a Japanese translation of this notebook.

from __future__ import division

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")
sns.set_palette("colorblind")

%matplotlib inline

import datagenerators as dg

import warnings
warnings.filterwarnings('ignore')

This post uses a good CausalInference package to try to infer causal relationships for situations where there is only observational data [potential outcomes]( This is an overview of how to use https://en.wikipedia.org/wiki/Rubin_causal_model). The author has written a series of blog posts about its features.

Most datasets that can be downloaded are static, so in this post I will use my function to generate the data. This has two advantages. The ability to generate and generate datasets with specific properties, and the ability to "intervene" directly into the data generation system to check if the inference is correct. All of these data generators generate i.i.d. samples from several distributions and return the results as pandas data frames. The functions that generate these datasets can be found in the attachment datagenerators.py on github here.

First, let's look at a motivational example.

Introduction

One day, a team leader noticed that some members of the team were wearing cool hats, which tended to reduce the productivity of these members of the team. Based on the data, the team leader is based on whether team members are wearing cool hats ($ X = 1 $ for cool hats, $ X = 0 $ for non-cool hats), and whether they are productive ($). Start recording (Y = 1 $ is productive, $ Y = 0 $ is unproductive).

After a week of observation, the team got the following dataset:

observed_data_0 = dg.generate_dataset_0()

observed_data_0.head()

	x	y
0	0	0
1	1	1
2	0	0
3	0	1
4	0	0

The first question the team leader asks is, "Is a person wearing a cool hat more likely to be more productive than someone who isn't?" This means estimating the following quantities:

P(Y=1|X=1) - (Y=1|X=0)

And you can estimate it directly from the data:

def estimate_uplift(ds):
    """
    Estiamte the difference in means between two groups.
    
    Parameters
    ----------
    ds: pandas.DataFrame
        a dataframe of samples.
        
    Returns
    -------
    estimated_uplift: dict[Str: float] containing two items:
        "estimated_effect" - the difference in mean values of $y$ for treated and untreated samples.
        "standard_error" - 90% confidence intervals arround "estimated_effect"
        
        
    """
    base = ds[ds.x == 0]
    variant = ds[ds.x == 1]
    
    delta = variant.y.mean() - base.y.mean()
    delta_err = 1.96 * np.sqrt(
        variant.y.var() / variant.shape[0] + 
        base.y.var() / base.shape[0])
    
    return {"estimated_effect": delta, "standard_error": delta_err}

estimate_uplift(observed_data_0)

{'estimated_effect': -0.15738429037833684,
 'standard_error': 0.08639322378194732}

People with cool hats seem less productive.

You can also run a statistical test to make sure.

from scipy.stats import chi2_contingency

contingency_table = (
    observed_data_0
    .assign(placeholder=1)
    .pivot_table(index="x", columns="y", values="placeholder", aggfunc="sum")
    .values
)

_, p, _, _ = chi2_contingency(contingency_table, lambda_="log-likelihood")

# p-value
p

0.0005626147113456373

This is one small p-value. Statisticians will be proud

You can use this information to explain what you think about the odds of someone wearing a cool hat. As long as we believe that it is "derived from the same distribution" as previous observations, it is expected that the same correlation will exist.

The problem is when team leaders try to use this information as a debate about whether to force people to wear cool hats. If the team leader does this, it can radically change the system we are sampling, changing or vice versa of previously observed correlations.

The cleanest way to actually measure the impact of some system changes is to run a Randomized Controlled Trial (https://en.wikipedia.org/wiki/Randomized_controlled_trial). Specifically, randomize who gets a cool hat and who doesn't, and see the difference in the value $ y $ you get. This eliminates the effects of Confounding Variables (https://en.wikipedia.org/wiki/Confounding) that may be affecting the metrics you care about.

Now that we have generated the dataset from a known process (in this case the function I created), we can directly intervene in it to measure the effectiveness of our A / B testing.

def run_ab_test(datagenerator, n_samples=10000, filter_=None):
    """
    Generates n_samples from datagenerator with the value of X randomized
    so that 50% of the samples recieve treatment X=1 and 50% receive X=0,
    and feeds the results into `estimate_uplift` to get an unbiased 
    estimate of the average treatment effect.
    
    Returns
    -------
    effect: dict
    """
    n_samples_a = int(n_samples / 2)
    n_samples_b = n_samples - n_samples_a
    set_X = np.concatenate([np.ones(n_samples_a), np.zeros(n_samples_b)]).astype(np.int64)
    ds = datagenerator(n_samples=n_samples, set_X=set_X)
    if filter_ != None:
        ds = ds[filter_(ds)].copy()
    return estimate_uplift(ds)

run_ab_test(dg.generate_dataset_0)

{'estimated_effect': 0.18679999999999997,
 'standard_error': 0.019256613018207393}

Suddenly, the direction of the effect of wearing a cool hat seems to have reversed.

What's going on?

Note: In the above example and all subsequent examples, the sample is iid and SUTVA. It is assumed that you are following / wiki / Rubin_causal_model # Stable_unit_treatment_value_assumption_% 28SUTVA% 29). Basically this means that when one person chooses or is forced to wear a really cool hat, they do not affect the choice or effect of another person who is wearing a really cool hat. I will. Structurally, all the synthetic data generators I use have this property. In reality, that's another thing you have to assume to be true.

Definition of causality

The previous example shows the maxim of an old statistician:

** Correlation does not mean causality **

"Causality" is a vague and philosophical word. In the current context, we use it to mean "what is the effect of changing $ X $ on $ Y $?"

To be precise, $ X $ and $ Y $ are random variables, and the "effect" you want to know is to assign a specific value to $ X $. Then how does the distribution of $ Y $ change? This act of forcing a variable to have a specific value is called "intervention."

In the previous example, if we did not intervene in the system, we would get an observation distribution of $ Y $, provided we observed $ X $.

P(Y|X)

We are intervening when we force people to look cool hats. And the distribution of $ Y $ is given by the intervention distribution.

P(Y|\hbox{do}(X))

In general, the two are not the same.

The question these notes try to answer is how can we infer the intervention distribution when only observational data are available. This is a useful question as there are many unrealistic, feasible, or unethical situations in which it is impractical, feasible, or unethical to perform A / B testing to directly measure the effectiveness of an intervention. Even in this situation, I would like to be able to say something about what the effect of the intervention is.

Potential outcome

One way to approach this problem is to introduce two new random variables into the system: $ Y_ {0} $ and $ Y_ {1} $, Potential Outcomes (http: // www. It is known as stat.unipg.it/stanghellini/rubinjasa2005.pdf). We assume that these variables exist and can be treated like any other random variable.

$ Y $ is defined as follows:

Y = Y_{1} when X=1
Y = Y_{0} when X=0

This is some of the underlying distributions with missing values, from what the problem is about how the distribution changes under intervention. Shifts from to about the data drawn with iid. Under certain assumptions about why values are missing, there are well-developed theories about how to estimate missing values.

Target

Often, we don't care about the complete intervention distribution, $ P (Y | \ hbox {do} (X)) $, and an estimate of the mean difference between the two groups is sufficient. This is the number known as the Average Treatment Effect (ATE) (https://en.wikipedia.org/wiki/Average_treatment_effect).

\Delta = E[Y_{1} - Y_{0}]

When you run an A / B test and compare the averages for each group, this is directly linked to the numbers we are measuring.

Estimating this value from the observed distribution gives the following:

\Delta_{bad} = E[Y|X=1] - E[Y|X=0] \\ = E[Y_{1}|X=1] - E[Y_{0}|X=0] \\ \neq \Delta

This generally does not match true ATE. because

E[Y_{i}|X=i] \neq E[Y_{i}]

The two related quantities

ATT = E[Y_{1} - Y_{0}|X=1], the "Average Treatment effect of the Treated"
ATC = E[Y_{1} - Y_{0}|X=0], the "Average Treatment effect of the Control"

That's why.

One way to interpret ATC is to measure the effect of processing only specimens that are not treated naturally, and vice versa for ATT. Depending on the use case, they may be a more natural measure of what you care about. You can estimate them all using the following techniques:

\def\ci{\perp\!\!\!\perp}

Make assumptions

Randomize $ X $ assignments when doing A / B testing. This has the effect of allowing you to choose whether to reveal the $ Y_ {1} $ or $ Y_ {0} $ variables. This makes the result independent of the value of $ X $.

Write this as follows:

Y_{1}, Y_{0} \mathrel{\unicode{x2AEB}} X

That is, the distribution of $ X, Y_ {0}, Y_ {1} $ is factored as follows:

P(X, Y_{0}, Y_{1}) = P(X)P(Y_{0}, Y_{1})

If this independence is guaranteed, then:

E[Y_{1}|X=1] = E[Y_{1}]

If you want to estimate ATE using observational data, you need to use other information you have about the sample. In particular, it must be ** assumed ** that there is sufficient additional information to fully explain the treatment choices for each subject.

Calling that additional information the random variable $ Z $, this assumption can be written as:

Y_{1}, Y_{0} \mathrel{\unicode{x2AEB}} X \, | \, Z

P(X, Y_{0}, Y_{1}| Z) = P(X|Z)P(Y_{0}, Y_{1}|Z)

This means that the observed treatment $ X $ that the specimen receives is fully explained by $ Z $. This is sometimes referred to as the "Ignore Assumption" (https://en.wikipedia.org/wiki/Ignorability).

In this example, the motivational example of a cool hat is that there are other factors (let's call it "skills") that affect both a person's productivity and whether or not they are wearing a cool hat. Means In the example above, a skilled person is more productive and less likely to wear a cool hat. These facts may be able to explain together why the effects of cool hats seemed to reverse when running A / B testing.

If you divide the data by whether the person is proficient or not, you can see that there is a positive relationship between wearing a cool hat and productivity in each subgroup.

observed_data_0_with_confounders = dg.generate_dataset_0(show_z=True)

print(estimate_uplift(observed_data_0_with_confounders.loc[lambda df: df.z == 0]))
print(estimate_uplift(observed_data_0_with_confounders.loc[lambda df: df.z == 1]))

{'estimated_effect': 0.1647058823529412, 'standard_error': 0.15924048578984673}
{'estimated_effect': 0.14041297935103236, 'standard_error': 0.17328031714939449}

Unfortunately, we cannot observe $ Y_ {0} $ and $ Y_ {1} $ in the same sample, so we cannot test the following assumptions.

Y_{1}, Y_{0} \mathrel{\unicode{x2AEB}} X \, | \, Z

It must be evaluated using the knowledge of the system we are investigating.

The quality of the predictions you make depends on how well this assumption is held. The Simpson's Paradox (http://www.degeneratestate.org/posts/2017/Oct/22/generating-examples-of-simpsons-paradox/) says if $ Z $ does not contain all confounding variables. , An extreme example of the fact that the reasoning we make may be wrong. Facebook publishes a good paper comparing different causal inference approaches using direct A / B testing to show how effectiveness is overestimated when conditional independence is not maintained. I am.

Once you have made this assumption, there are several techniques to approach it. The rest of this article will outline some of the simpler approaches, but keep in mind that this is an ongoing area of research.

Modeling of counterfacts

From the above, it is clear that ATE can be estimated if $ Y_ {0} $ and $ Y_ {1} $ are known. So why not try modeling these directly?

Specifically, you can make the following estimators.

\hat{Y}_{0}(Z) = E[Y|Z, X=0]
\hat{Y}_{1}(Z) = E[Y|Z, X=1]

If we can model these two quantities, we can estimate ATE as follows.

The success of this approach depends on how well the potential results can be modeled. To see it in action, let's use the following data generation process.

observed_data_1 = dg.generate_dataset_1()

observed_data_1.plot.scatter(x="z", y="y", c="x", cmap="rainbow", colorbar=False);

Before we dive into modeling the counterfacts, let's take a look at the data. Looking at how $ Y $ is distributed, it looks like there is a small difference between the two groups.

sns.kdeplot(observed_data_1.loc[lambda df: df.x == 0].y, label="untreated")
sns.kdeplot(observed_data_1.loc[lambda df: df.x == 1].y, label="treated")

<matplotlib.axes._subplots.AxesSubplot at 0x1a1a6c3590>

This can be confirmed by looking at the average difference between the two groups.

print("Observed ATE: {estimated_effect:.3f} ({standard_error:.3f})".format(**estimate_uplift(observed_data_1)))

Observed ATE: 0.113 (0.103)

However, if you look at the distribution of the covariance of $ Z $, you can see that there are differences between the groups.

sns.kdeplot(observed_data_1.loc[lambda df: df.x == 0].z, label="untreated")
sns.kdeplot(observed_data_1.loc[lambda df: df.x == 1].z, label="treated")

<matplotlib.axes._subplots.AxesSubplot at 0x1a1a7bced0>

This is a concern if you think that $ Z $ has some effect on $ Y $. We need a way to distinguish between the impact of $ X $ on $ Y $ and the impact of $ Z $ on $ Y $.

You can check the actual ATE with an A / B test that simulates it, and confirm that it is the difference from the observed value.

print("Real ATE:  {estimated_effect:.3f} ({standard_error:.3f})".format(**run_ab_test(dg.generate_dataset_1)))

Real ATE:  -0.479 (0.026)

But what if you can't run this A / B test? You need to model your system.

The simplest type of model is a linear model. Specifically, assume the following:

Y_{0} = \alpha + \beta Z + \epsilon

Y_{1} = Y_{0} + \gamma

If this is accurate, let the model fit the data

Y = \alpha + \beta Z + \gamma X

You can use linear regression to get an estimate of ATE.

The causalinference package provides a simple interface for doing this.

from causalinference import CausalModel

cm = CausalModel(
    Y=observed_data_1.y.values, 
    D=observed_data_1.x.values, 
    X=observed_data_1.z.values)

cm.est_via_ols(adj=1)

print(cm.estimates)

Treatment Effect Estimates: OLS

                      Est.       S.e.          z      P>|z|      [95% Conf. int.]
--------------------------------------------------------------------------------
           ATE     -0.488      0.032    -15.122      0.000     -0.551     -0.425

causalinference returns the ATE estimate and the statistical nature of that estimate. The confidence intervals reported for the estimates are the confidence intervals assuming that the model accurately describes the counterfacts, and the confidence intervals for how well the model describes the counterfacts. It is important to recognize that there is no such thing.

In this case, the package has succeeded in identifying the correct ATE-which is good, but the data generation process is specially designed to meet the assumptions. Let's look at some cases that can fail.

The first case is when the effect is not simply additive.

observed_data_2 = dg.generate_dataset_2()

observed_data_2.plot.scatter(x="z", y="y", c="x", cmap="rainbow", colorbar=False)

print("Observed ATE: {estimated_effect:.3f} ({standard_error:.3f})".format(**estimate_uplift(observed_data_2)))
print("Real ATE:  {estimated_effect:.3f} ({standard_error:.3f})".format(**run_ab_test(dg.generate_dataset_2)))

Observed ATE: 0.689 (0.104)
Real ATE:  0.563 (0.031)

cm = CausalModel(
    Y=observed_data_2.y.values, 
    D=observed_data_2.x.values, 
    X=observed_data_2.z.values)

cm.est_via_ols(adj=1)

print(cm.estimates)

Treatment Effect Estimates: OLS

                      Est.       S.e.          z      P>|z|      [95% Conf. int.]
--------------------------------------------------------------------------------
           ATE      0.246      0.088      2.805      0.005      0.074      0.419

This can usually be overcome by using stronger estimators. A simple, non-parametric approach is the Matching method (https://en.wikipedia.org/wiki/Matching_%28statistics%29). The idea is to find similar untreated specimens for each treated specimen and compare these values directly. What you mean by "similar" depends on your particular use case.

The causalinference package implements matching for each unit by selecting the most similar unit from the other treatment groups by substitution and using the difference between these two units to calculate the ATE. .. By default, matching choices are made using the nearest neighbor method in the covariate space $ Z $, and the distances are weighted by the inverse variance of each dimension.

You have the option to change the number of units to be compared and the weighting of each dimension in the match. See the Documentation (http://laurence-wong.com/software/matching) for more information.

The matching estimate can be calculated with the following code.

cm = CausalModel(
    Y=observed_data_2.y.values, 
    D=observed_data_2.x.values, 
    X=observed_data_2.z.values)

cm.est_via_matching()

print(cm.estimates)

Treatment Effect Estimates: Matching

                      Est.       S.e.          z      P>|z|      [95% Conf. int.]
--------------------------------------------------------------------------------
           ATE      0.480      0.148      3.239      0.001      0.190      0.770
           ATC      0.973      0.142      6.854      0.000      0.695      1.251
           ATT      0.036      0.211      0.169      0.866     -0.379      0.450

The confidence interval for the estimate contains the true ATE.

Covariate imbalance

A more difficult problem to deal with is if the covariates you are using are imbalanced: if the covariate space has a region that contains only processed or unprocessed specimens. Here we have to extrapolate the effect of the treatment-this is highly dependent on the hypothetical model we use.

The example below demonstrates this.

observed_data_3 = dg.generate_dataset_3()

observed_data_3.plot.scatter(x="z", y="y", c="x", cmap="rainbow", colorbar=False)

# actual response curves
z = np.linspace(0,1,100)
y0 =  np.where(z >= 0.4, -4*(z - 0.4), 0)
y1 =  np.where(z < 0.6,  -4*(z - 0.6), 0) + 1
plt.plot(z,y0, "b")
plt.plot(z,y1, "r")


print("Observed ATE: {estimated_effect:.3f} ({standard_error:.3f})".format(**estimate_uplift(observed_data_3)))
print("Real ATE:  {estimated_effect:.3f} ({standard_error:.3f})".format(**run_ab_test(dg.generate_dataset_3)))

Observed ATE: 1.362 (0.080)
Real ATE:  2.449 (0.033)

# OLS estimator
cm = CausalModel(
    Y=observed_data_3.y.values, 
    D=observed_data_3.x.values, 
    X=observed_data_3.z.values)

cm.est_via_ols()

print(cm.estimates)

Treatment Effect Estimates: OLS

                      Est.       S.e.          z      P>|z|      [95% Conf. int.]
--------------------------------------------------------------------------------
           ATE      1.969      0.055     35.707      0.000      1.861      2.078
           ATC      1.971      0.056     35.117      0.000      1.861      2.081
           ATT      1.968      0.073     26.923      0.000      1.825      2.112

# Matching estimator
cm = CausalModel(
    Y=observed_data_3.y.values, 
    D=observed_data_3.x.values, 
    X=observed_data_3.z.values)

cm.est_via_matching()

print(cm.estimates)

Treatment Effect Estimates: Matching

                      Est.       S.e.          z      P>|z|      [95% Conf. int.]
--------------------------------------------------------------------------------
           ATE      1.913      0.156     12.269      0.000      1.607      2.218
           ATC      1.965      0.277      7.093      0.000      1.422      2.509
           ATT      1.876      0.177     10.626      0.000      1.530      2.222

The OLS estimator does not capture the true effect, and the matching estimator improves a bit, but there is not enough information in the data to fully extrapolate to the non-overlapping region.

This example may seem inconsistent, but once you start looking at higher dimensional covariates, the problem becomes more common.

causalinference provides a convenient tool for quickly evaluating variable duplication using the summary_stats property.

print(cm.summary_stats)

Summary Statistics

                       Controls (N_c=206)         Treated (N_t=294)             
       Variable         Mean         S.d.         Mean         S.d.     Raw-diff
--------------------------------------------------------------------------------
              Y       -0.151        0.438        1.210        0.467        1.362

                       Controls (N_c=206)         Treated (N_t=294)             
       Variable         Mean         S.d.         Mean         S.d.     Nor-diff
--------------------------------------------------------------------------------
             X0        0.267        0.185        0.681        0.205        2.116

Here, the normalized difference is defined as:

This is not a rigorous statistical test, but it does provide some indicators of how much overlap each covariate has. Values greater than 1 suggest that there is not much duplication.

Propensity score

The propensity score is an estimate of how likely it is that a subject will be treated if a covariate is given.

\hat{p}(Z) = P(X|Z)

You can estimate this as you like, but once you estimate it, you can do a few things.

Weighting by reverse tendency score

The problem with measuring causal reasoning is that I want to know the quantity $ E [Y_ {i}] $, but keep in mind that I can only access a sample of $ E [Y_ {i} | X = i] $. Please give me.

The probability of potential outcome can be expanded as follows:

P(Y_{i}) = P(Y_{i}| X = i)P(X = i)

This is true

E[Y_{i}] = E[\frac{Y_{i}}{P(X=i|Z)}P(X=i|Z)] = E[\frac{Y_{i}}{P(X=i|Z)}|X=i, Z]

Suggests that can be estimated.

Therefore, you can recover potential results by weighting each point with a reverse propensity score. The result is the Reverse Trend Score Weight Estimator (https://en.wikipedia.org/wiki/Inverse_probability_weighting).

\Delta_{IPS} = \frac{1}{N}\left(\sum_{i \in 1} \frac{y_{i}}{\hat{p}(z_{i})} - \sum_{i \in 0} \frac{y_{i}}{1 - \hat{p}(z_{i})}\right)

Let's see what to do with one of the previous datasets.

observed_data_1 = dg.generate_dataset_1()

observed_data_1.plot.scatter(x="z", y="y", c="x", cmap="rainbow", colorbar=False)

print("Observed ATE: {estimated_effect:.3f} ({standard_error:.3f})".format(**estimate_uplift(observed_data_1)))
print("Real ATE:  {estimated_effect:.3f} ({standard_error:.3f})".format(**run_ab_test(dg.generate_dataset_1)))

Observed ATE: 0.147 (0.102)
Real ATE:  -0.490 (0.025)

You can use the methods ʻest_propensity_s and ʻest_propensity in the CausalInference package to estimate trends using logistic regression on covariates.

cm = CausalModel(
    Y=observed_data_1.y.values, 
    D=observed_data_1.x.values, 
    X=observed_data_1.z.values)

cm.est_propensity_s()

propensity = cm.propensity["fitted"]

df = observed_data_1

df["ips"] = np.where(
    df.x == 1, 
    1 / propensity,
    1 / (1 - propensity))
df["ipsw"] = df.y * df.ips

ipse = (
      df[df.x == 1]["ipsw"].sum() 
    - df[df.x == 0]["ipsw"].sum()
) / df.shape[0]

ipse

-0.4613297031604915

The data generator used in this example can successfully describe the relationship with a simple logistic regression. In the data generator used in this example, the relationship can be well described by a simple logistic regression, but if we use sklean's logistic regression function (regularization is used by default). If you try to estimate the propensity score, you will get the wrong answer.


from sklearn.linear_model import LogisticRegression

lg = LogisticRegression()
X = df.z.values.reshape(-1,1)
y = df.x.values
lg.fit(X,y)

propensity = lg.predict_proba(X)[:,1]

df["ips"] = np.where(
    df.x == 1, 
    1 / propensity,
    1 / (1 - propensity))
df["ipsw"] = df.y * df.ips

ipse = (
      df[df.x == 1]["ipsw"].sum() 
    - df[df.x == 0]["ipsw"].sum()
) / df.shape[0]

ipse

-0.3291860760196773

Better than our naive estimates, but not correct.

Robust weighted estimator

In this way, you can combine a weighted estimator of the reverse propensity score with a linear estimator of effect size to try to reduce the defects in either model. This is done by pre-weighting the data with a weighted linear regression, with each point weighted by a reverse propensity score. The result is a doubly robust weighted estimator.

The idea is that there is a bias in which specimens are treated in the observational data, so the specimens that were treated but less likely to be treated are more important and have more weights. Should be given.

There are the following methods to apply this.

observed_data_1 = dg.generate_dataset_1()

observed_data_1.plot.scatter(x="z", y="y", c="x", cmap="rainbow", colorbar=False)

print("Observed ATE: {estimated_effect:.3f} ({standard_error:.3f})".format(**estimate_uplift(observed_data_1)))
print("Real ATE:  {estimated_effect:.3f} ({standard_error:.3f})".format(**run_ab_test(dg.generate_dataset_1)))

Observed ATE: 0.145 (0.106)
Real ATE:  -0.494 (0.026)

cm = CausalModel(
    Y=observed_data_1.y.values, 
    D=observed_data_1.x.values, 
    X=observed_data_1.z.values)

cm.est_propensity_s()
cm.est_via_weighting()

print(cm.estimates)

Treatment Effect Estimates: Weighting

                      Est.       S.e.          z      P>|z|      [95% Conf. int.]
--------------------------------------------------------------------------------
           ATE     -0.521      0.034    -15.344      0.000     -0.588     -0.455

cm.estimates

{'weighting': {'ate': -0.5214252844257842, 'ate_se': 0.03398288940994425}}

Unconfoundedness and propensity score

In the previous section, we assumed that the results and treatment are independent because the covariates are given.

Y_{1}, Y_{0} \mathrel{\unicode{x2AEB}} X \, | \,Z

You can also make a slightly stronger assumption. That is, the outcome is independent of treatment and is conditioned on the probability of tendency.

Y_{1}, Y_{0} \mathrel{\unicode{x2AEB}} X \, | \,\hat{p}(Z)

With this assumption, you can reduce the dimensions of confounding variables. This allows you to perform some techniques that may not work in higher dimensional settings.

trimming

We have seen that imbalances in covariates can cause problems. A simple solution is to make a counterfact prediction only in areas with good overlap, or to "trim" points where there is no good overlap. For high-dimensional data, it can be difficult to define a "good overlap"-using propensity scores to define the overlap is one way to solve this.

Let's look at a dataset with less overlap.

observed_data_3 = dg.generate_dataset_3()

observed_data_3.plot.scatter(x="z", y="y", c="x", cmap="rainbow", colorbar=False)

# actual response curves
z = np.linspace(0,1,100)
y0 =  np.where(z >= 0.4, -4*(z - 0.4), 0)
y1 =  np.where(z < 0.6,  -4*(z - 0.6), 0) + 1
plt.plot(z,y0, "b")
plt.plot(z,y1, "r")


print("Observed ATE: {estimated_effect:.3f} ({standard_error:.3f})".format(**estimate_uplift(observed_data_3)))
print("Real ATE:  {estimated_effect:.3f} ({standard_error:.3f})".format(**run_ab_test(dg.generate_dataset_3)))

Observed ATE: 1.361 (0.080)
Real ATE:  2.437 (0.033)

CausalInference provides a way to crop data based on propensity scores.

# OLS estimator
cm = CausalModel(
    Y=observed_data_3.y.values, 
    D=observed_data_3.x.values, 
    X=observed_data_3.z.values)

cm.est_propensity_s()
cm.trim_s()
cm.est_via_matching()

print(cm.estimates)

Treatment Effect Estimates: Matching

                      Est.       S.e.          z      P>|z|      [95% Conf. int.]
--------------------------------------------------------------------------------
           ATE      2.068      0.127     16.350      0.000      1.820      2.316
           ATC      2.249      0.229      9.843      0.000      1.802      2.697
           ATT      1.920      0.102     18.739      0.000      1.719      2.121

You can see the rest of the data as follows:

# mask out data ignored by the 
propensity = cm.propensity["fitted"]
cutoff = cm.cutoff
mask = (propensity > cutoff) &  (propensity < 1 - cutoff)

# plot the data
observed_data_3[mask].plot.scatter(x="z", y="y", c="x", cmap="rainbow", colorbar=False)

# actual response curves
z = np.linspace(0,1,100)
y0 =  np.where(z >= 0.4, -4*(z - 0.4), 0)
y1 =  np.where(z < 0.6,  -4*(z - 0.6), 0) + 1
plt.plot(z,y0, "b")
plt.plot(z,y1, "r")

# filter out data in regions we have trimmed when we calculate the true uplift
filter_ = lambda df: (df.z > 0.2) & (df.z < 0.7)

print("Observed ATE: {estimated_effect:.3f} ({standard_error:.3f})".format(**estimate_uplift(observed_data_3[mask])))
print("Real ATE:  {estimated_effect:.3f} ({standard_error:.3f})".format(
    **run_ab_test(dg.generate_dataset_3, filter_= filter_)))

Observed ATE: 1.708 (0.125)
Real ATE:  1.984 (0.031)

This works pretty well in such cases.

When applying trimming, we explicitly state that it is possible to make causal inferences only for some samples of the covariate space. Nothing can be said about ATE for specimens outside these regions.

Stratification

Another use for propensity scores is stratified or blocked estimators. It consists of grouping data points into groups with similar trends and estimating the ATE within these groups. Again, CausalInference provides a great interface to do this.

Use the stratify (for user-defined stata boundaries) or the stratify_s (automatically select boundaries) method to determine the layer.

observed_data_1 = dg.generate_dataset_1()

cm = CausalModel(
    Y=observed_data_1.y.values, 
    D=observed_data_1.x.values, 
    X=observed_data_1.z.values)

cm.est_propensity_s()
cm.stratify_s()

print(cm.strata)

Stratification Summary

              Propensity Score         Sample Size     Ave. Propensity   Outcome
   Stratum      Min.      Max.  Controls   Treated  Controls   Treated  Raw-diff
--------------------------------------------------------------------------------
         1     0.093     0.260       108        18     0.163     0.182    -0.544
         2     0.266     0.327        23         9     0.296     0.292    -0.639
         3     0.327     0.411        14        17     0.368     0.376    -0.441
         4     0.412     0.555        31        31     0.473     0.490    -0.428
         5     0.557     0.769        45        80     0.661     0.667    -0.499
         6     0.770     0.910        18       106     0.838     0.859    -0.536

Then use the ʻest_via_blocking` method to combine the estimates for these layers into one overall ATE.

cm.est_via_blocking()
print(cm.estimates)

Treatment Effect Estimates: Blocking

                      Est.       S.e.          z      P>|z|      [95% Conf. int.]
--------------------------------------------------------------------------------
           ATE     -0.559      0.033    -16.722      0.000     -0.625     -0.494
           ATC     -0.571      0.039    -14.797      0.000     -0.646     -0.495
           ATT     -0.548      0.038    -14.304      0.000     -0.624     -0.473

This also works well.

Stratifying data into groups by propensity score is useful if you don't have any prior knowledge of what constitutes a "similar" unit, but that's not the only way. If you know in advance that different groups of specimens are likely to be affected by the intervention in a similar manner, divide the sample into these groups and pool the results globally before estimating ATE. It makes sense to get an ATE.

Which method to use?

This covers most of the common methods for causal reasoning from observed data. The remaining question is "how and how to decide which method to use?" This is not a simple question. There are also automated techniques like this paper, but I haven't tried them.

Ultimately, in order to choose your technique, you need to make some assumptions about how you build the counterfacts. Matching is a good approach if you trust your data to have good overlap in covariate space. If this is not the case, then one must use a reliable model to successfully extrapolate to the unexplored region, or make the assumption that something like a propensity score captures enough information to assume negligibility. There is.

To emphasize that all of these methods can fail, here's another example. Unlike the previous example, there is one or more covariates. Like all previous data generators, this data generator also follows the following assumptions by design:

Y_{1}, Y_{0} \mathrel{\unicode{x2AEB}} X \, | \,Z

Let's blindly try the methods we've discussed so far and see what happens.

data_gen = dg.generate_exercise_dataset_2
ds = data_gen()

print("Observed ATE: {estimated_effect:.3f} ({standard_error:.3f})".format(**estimate_uplift(ds)))
print("Real ATE:  {estimated_effect:.3f} ({standard_error:.3f})".format(**run_ab_test(data_gen)))

zs = [c for c in ds.columns if c.startswith("z")]

cm = CausalModel(
    Y=ds.y.values, 
    D=ds.x.values, 
    X=ds[zs].values)

cm.est_via_ols()
cm.est_via_matching()
cm.est_propensity_s()
cm.est_via_weighting()

cm.stratify_s()
cm.est_via_blocking()

print(cm.estimates)

Observed ATE: -0.423 (0.711)
Real ATE:  4.570 (0.184)

Treatment Effect Estimates: OLS

                      Est.       S.e.          z      P>|z|      [95% Conf. int.]
--------------------------------------------------------------------------------
           ATE     -0.165      0.363     -0.455      0.649     -0.878      0.547
           ATC      0.438      0.402      1.088      0.276     -0.350      1.226
           ATT     -0.468      0.384     -1.218      0.223     -1.220      0.285

Treatment Effect Estimates: Matching

                      Est.       S.e.          z      P>|z|      [95% Conf. int.]
--------------------------------------------------------------------------------
           ATE      1.241      0.581      2.137      0.033      0.103      2.379
           ATC      1.314      0.695      1.890      0.059     -0.049      2.677
           ATT      1.204      0.667      1.804      0.071     -0.104      2.512

Treatment Effect Estimates: Weighting

                      Est.       S.e.          z      P>|z|      [95% Conf. int.]
--------------------------------------------------------------------------------
           ATE      3.190      0.478      6.667      0.000      2.252      4.128

Treatment Effect Estimates: Blocking

                      Est.       S.e.          z      P>|z|      [95% Conf. int.]
--------------------------------------------------------------------------------
           ATE     -0.003      0.378     -0.009      0.993     -0.745      0.738
           ATC     -0.003      0.378     -0.009      0.993     -0.745      0.738
           ATT     -0.003      0.378     -0.009      0.993     -0.745      0.738

y = []
yerr = []
x_label = []

for method, result in dict(cm.estimates).items():
    y.append(result["ate"])
    yerr.append(result["ate_se"])
    x_label.append(method)
    
x = np.arange(len(y))

plt.errorbar(x=x, y=y, yerr=yerr, linestyle="none", capsize=5, marker="o")
plt.xticks(x, x_label)
plt.title("Estimated Effect Size", fontsize=18)
plt.hlines(4.6, -0.5, 3.5, linestyles="dashed")
plt.xlim(-0.5,3.5);

The dashed line next to it shows the true ATE for this dataset.

Not only are the results of each method different, but all methods miss the true value.

This should be a warning about the limitations of this type of approach. It can be a very interesting exercise for the reader to try out what properties of the dataset are causing these techniques to deviate from their true values.

Structure of causal reasoning

Hopefully, by this point you will be aware of the importance of the negligibility assumption.

Y_{1}, Y_{0} \mathrel{\unicode{x2AEB}} X \, | \,Z

What I haven't talked about so far is how to choose $ Z $ so that this is true. Ultimately, it should come from knowledge of the area of the system being studied. There is a powerful set of tools called the Causal Model (https://en.wikipedia.org/wiki/Causal_graph). It can be used to encode knowledge about the system under study as a graphical model of the system, or to infer the conditional independence assumptions described above.

The next post of Causal Reasoning will discuss these.

Another question raised by this post is whether the only way to make causal reasoning is to adjust confounding variables. Not so-you can use it elsewhere in Posts later Let's take a look at the technique.

code

The notes for this post can be found on github here.

Potential Outcomes (Potential Outcomes) Causal Reasoning Notes in Python Part 1

Introduction

Definition of causality

Potential outcome

Target

Make assumptions

Modeling of counterfacts

Covariate imbalance

Propensity score

Weighting by reverse tendency score

Robust weighted estimator

Unconfoundedness and propensity score

trimming

Stratification

Which method to use?

Structure of causal reasoning

code