Introduction to effectiveness verification Chapter 3 written in Python

Introduction

Introduction to Effectiveness Verification-Causal Reasoning for Correct Comparison / Basics of Econometrics Reproduce the source code in Python To do.

I already have a Great ancestor implementation example, but I will leave it as a memo for my study.

This article covers Chapter 3. The code is also posted on github. In addition, variable names and processing contents are basically implemented in the book.

Logistic regression

Logistic regression can be implemented with scikit-learn or stats models.

scikit-learn

scikit-learn needs to be dummy.

Logistic regression of sklearn


from sklearn.linear_model import LogisticRegression

X = pd.get_dummies(biased_data[['recency', 'history', 'channel']])  #Make channel a dummy variable
treatment = biased_data['treatment'] 

model = LogisticRegression().fit(X, treatment)

statsmodels

Use the glm or logit classes.

statsmodels does not need to be dummy, but it seems that some category values may not be used in the model. (Channel = Multichannel is not used in the example below. The cause is not clear.) If you are worried about it, you can make it a dummy and it will match the result of sklearn.

Regression analysis of stats models


from statsmodels.formula.api import glm, logit

model = glm('treatment ~ history + recency + channel', data=biased_data, family=sm.families.Binomial()).fit()
# model = logit('treatment ~ history + recency + channel', data=biased_data).fit()  #Same result as above

Estimate using propensity score

"Propensity score matching" and "IPW", which are taken up as estimation methods using propensity scores, can be implemented at DoWhy.

Before performing each analysis, define an instance for common causal reasoning. At that time, note that the intervention variable needs to be booled.

Preparation


from dowhy import CausalModel

biased_data = biased_data.astype({"treatment":'bool'}, copy=False)  #treatment to bool

model=CausalModel(
    data = biased_data,
    treatment='treatment',
    outcome='spend',
    common_causes=['recency', 'history', 'channel']
)

Propensity score matching

Nearest neighbor matching

Nearest neighbor matching is a method of matching the sample with intervention and the sample without intervention explained in the book 1: 1 between those with similar propensity scores. In this book, only ATT is estimated, but DoWhy can also estimate ATE. Note that the default is ATE. Since both are calculated by the internal implementation, it would be nice if you could get both.

This ATE seems to be calculated from ATT and ATC (expected value of intervention effect on non-intervention sample).

Nearest neighbor matching


nearest_ate = model.estimate_effect(
    identified_estimand, 
    method_name="backdoor.propensity_score_matching",
    target_units='ate',
)
nearest_att = model.estimate_effect(
    identified_estimand, 
    method_name="backdoor.propensity_score_matching",
    target_units='att', 
)

print(f'ATE: {nearest_ate.value}')
print(f'ATT: {nearest_att.value}')

Stratified matching

Although not mentioned in this book, stratified matching is also implemented.

Looking at the source code, it seems that the data is divided into the num_strata layer for the same number of data and estimated. clipping_threshold compares the number of cases with and without intervention in each layer, and excludes layers below this value. (Probably to exclude those who have extremely little intervention data)

Stratified matching


stratification_ate = model.estimate_effect(
    identified_estimand,
    method_name="backdoor.propensity_score_stratification",
    target_units='ate',
    method_params={'num_strata':50, 'clipping_threshold':5},
)
stratification_att = model.estimate_effect(
    identified_estimand,
    method_name="backdoor.propensity_score_stratification",
    target_units='att',
    method_params={'num_strata':50, 'clipping_threshold':5},
)

print(f'ATE: {stratification_ate.value}')
print(f'ATT: {stratification_att.value}')

IPW

IPW


ipw_estimate = model.estimate_effect(
    identified_estimand,
    method_name="backdoor.propensity_score_weighting",
    target_units='ate',
)

print(f'ATE: {ipw_estimate.value}')

(Bonus) Regression analysis

The regression analysis performed in Chapter 2 is also possible.

python


estimate = model.estimate_effect(
    identified_estimand,
    method_name="backdoor.linear_regression", #test_significance=True
)

print('Causal Estimate is ' + str(estimate.value))

Visualization of standardized mean difference

Visualization of standardized mean differences to confirm the balance of covariates is not supported. Therefore, it is troublesome, but it is calculated by yourself.

Propensity score matching

It seems that you cannot refer to the correspondence of propensity score matching, so implement matching yourself. The implementation is based on around here.

I don't know how to find distance and calculated it appropriately, so don't use it as a reference.

Standard mean difference for propensity score matching


from sklearn.neighbors import NearestNeighbors

#Calculation of ASAM
def calculate_asam(data, treatment_column, columns):
  treated_index = data[treatment_column]
  data = pd.get_dummies(data[columns])

  asam = data.apply(lambda c: abs(c[treated_index].mean() - c[~treated_index].mean()) / c.std())
  asam['distance'] = np.sqrt(np.sum(asam**2))  #I don't understand the definition
  return asam

#Propensity score matching
def get_matching_data(data, treatment_column, propensity_score):
  data['ps'] = propensity_score
  treated = data[data[treatment_column] == 1]
  control = data[data[treatment_column] == 0]

  #Matching samples with intervention
  control_neighbors = NearestNeighbors(n_neighbors=1, algorithm='ball_tree').fit(control['ps'].values.reshape(-1, 1))
  distances, indices = control_neighbors.kneighbors(treated['ps'].values.reshape(-1, 1))
  #You can set the matching threshold with distances(Not compatible)

  matching_data = pd.concat([treated, control.iloc[indices.flatten()]])
  return matching_data

matching_data = get_matching_data(biased_data[['treatment', 'recency', 'channel', 'history']], 'treatment', nearest_ate.propensity_scores)
unadjusted_asam = calculate_asam(biased_data, 'treatment', ['recency', 'history', 'channel'])
matching_adjusted_asam = calculate_asam(matching_data, 'treatment', ['recency', 'history', 'channel'])

IPW

With DescrStatsW, you can easily calculate weighted statistics. But is the value when not weighted slightly different from the result of numpy? So be careful when handling it.

IPW standardized mean difference


from statsmodels.stats.weightstats import DescrStatsW

def calculate_asam_weighted(data, treatment_column, columns, propensity_score):
  data = pd.get_dummies(data[[treatment_column] + columns])
  data['ps'] = propensity_score
  data['weight'] = data[[treatment_column, 'ps']].apply(lambda x: 1 / x['ps'] if x[treatment_column] else 1 / (1 - x['ps']), axis=1)

  asam_dict = dict()
  for column_name, column_value in data.drop([treatment_column, 'ps', 'weight'], axis=1).iteritems():
    treated_stats = DescrStatsW(column_value[data[treatment_column]], weights=data[data[treatment_column]]['weight'])
    control_stats = DescrStatsW(column_value[~data[treatment_column]], weights=data[~data[treatment_column]]['weight'])
    asam_dict[column_name] = abs(treated_stats.mean - control_stats.mean) / treated_stats.std

  asam = pd.Series(asam_dict)
  asam['distance'] = np.sqrt(np.sum(asam**2))  #I don't understand the definition
  return asam

ipw_adjusted_asam = calculate_asam_weighted(biased_data, 'treatment', ['recency', 'history', 'channel'], ipw_estimate.propensity_scores)

Visualization

Visualization


%matplotlib inline
import matplotlib.pyplot as plt

plt.scatter(unadjusted_asam, unadjusted_asam.index, label='unadjusted')
plt.scatter(matching_adjusted_asam, matching_adjusted_asam.index, label='adjusted(matching)')
plt.scatter(ipw_adjusted_asam, ipw_adjusted_asam.index, label='adjusted(ipw)')
plt.vlines(0.1, 0, len(unadjusted_asam)-1, linestyles='dashed')
plt.legend()
plt.show()

image.png

Somehow IPW is better, but probably because propensity score matching allows duplication of matching samples in nearest neighbor matching and is biased towards some samples.

Relation

-I wrote "Introduction to Effect Verification" in Python

Recommended Posts

Introduction to effectiveness verification Chapter 3 written in Python
Introduction to Effectiveness Verification Chapter 2 Written in Python
Introduction to Effectiveness Verification Chapter 1 in Python
Introduction to Effectiveness Verification Chapters 4 and 5 are written in Python
I wrote "Introduction to Effect Verification" in Python
[Introduction to Python3 Day 13] Chapter 7 Strings (7.1-7.1.1.1)
[Introduction to Python3 Day 14] Chapter 7 Strings (7.1.1.1 to 7.1.1.4)
[Introduction to Python3 Day 15] Chapter 7 Strings (7.1.2-7.1.2.2)
[Introduction to Python3 Day 21] Chapter 10 System (10.1 to 10.5)
"Introduction to effect verification Chapter 3 Analysis using propensity score" + α is tried in Python
Fourier series verification code written in Python
[Introduction to Python] How to use class in Python?
[Introduction to Python3, Day 17] Chapter 8 Data Destinations (8.1-8.2.5)
[Introduction to Python3, Day 17] Chapter 8 Data Destinations (8.3-8.3.6.1)
[Introduction to Python3 Day 19] Chapter 8 Data Destinations (8.4-8.5)
[Introduction to Python3 Day 18] Chapter 8 Data Destinations (8.3.6.2 to 8.3.6.3)
Introduction to Vectors: Linear Algebra in Python <1>
[Introduction to Python3 Day 12] Chapter 6 Objects and Classes (6.3-6.15)
tse --Introduction to Text Stream Editor in Python
Introduction to Python language
[Introduction to Python3 Day 22] Chapter 11 Concurrency and Networking (11.1 to 11.3)
Introduction to OpenCV (python)-(2)
[Introduction to Python3 Day 11] Chapter 6 Objects and Classes (6.1-6.2)
[Introduction to Python3 Day 23] Chapter 12 Become a Paisonista (12.1 to 12.6)
[Introduction to Python3 Day 20] Chapter 9 Unraveling the Web (9.1-9.4)
[Introduction to Python3 Day 8] Chapter 4 Py Skin: Code Structure (4.1-4.13)
Parse a JSON string written to a file in Python
[Chapter 5] Introduction to Python with 100 knocks of language processing
[Chapter 3] Introduction to Python with 100 knocks of language processing
[Chapter 2] Introduction to Python with 100 knocks of language processing
Introduction to Linear Algebra in Python: A = LU Decomposition
[Technical book] Introduction to data analysis using Python -1 Chapter Introduction-
[Chapter 4] Introduction to Python with 100 knocks of language processing
Introduction to Python Django (2) Win
Login to website in Python
Introduction to serial communication [Python]
Design Patterns in Python: Introduction
Speech to speech in python [text to speech]
[Introduction to Python] <list> [edit: 2020/02/22]
Introduction to Python (Python version APG4b)
Gacha written in Python -BOX gacha-
An introduction to Python Programming
How to develop in Python
Introduction to Python For, While
Post to Slack in Python
[Introduction to Udemy Python 3 + Application] 36. How to use In and Not
[Introduction to Python3 Day 3] Chapter 2 Py components: Numbers, strings, variables (2.2-2.3.6)
[Introduction to Python3 Day 2] Chapter 2 Py Components: Numbers, Strings, Variables (2.1)
[Introduction to Python3 Day 4] Chapter 2 Py Components: Numbers, Strings, Variables (2.3.7-2.4)
Squid Lisp written in Python: Hy
[Introduction to Udemy Python 3 + Application] 58. Lambda
[Introduction to Udemy Python 3 + Application] 31. Comments
[Python] How to do PCA in Python
Introduction to Python Numerical Library NumPy
Practice! !! Introduction to Python (Type Hints)
[Introduction to Python3 Day 1] Programming and Python
Convert markdown to PDF in Python
How to collect images in Python
100 Language Processing Knock Chapter 1 in Python
[Introduction to Python] <numpy ndarray> [edit: 2020/02/22]
[Introduction to Udemy Python 3 + Application] 57. Decorator