[PYTHON] [Causal search / causal reasoning] Execute causal search (SAM) using deep learning

Recently, I studied ** causal reasoning / causal search **, so Described a Python implementation example of SAM (Structural Agnostic Modeling) [2018], which is a ** causal search ** method using DeepLearning / GAN. I will. I would like to publish an implementation example of the most famous Bayesian network in a separate article at a later date.

1. [Purpose](# 1-Purpose)
2. [What is causality](# 2-What is causality)
3. [What is SAM](What is # 3-sam)
4. [Implementation example](# 4-Implementation example)
6. [Environmental preparation](# 4-2-Environmental preparation)
7. [Code Execution](# 4-3-Code Execution)
8. [Finally](# 5-Finally)

1. Purpose

The purpose is to ** search for the existence and direction of causal relationships ** from the data, as shown in the figure below. This is called ** causal search **.

On the other hand, estimating the magnitude of these causal relationships (arrows) is called ** causal reasoning **.

2. What is a causal relationship?

First of all, I will briefly explain what a ** causal relationship ** is. If there is a tendency that "the larger the variable X, the larger the variable Y" ** Say there is a causal relationship from variable X to variable Y **.

When you hear this word, what is the difference from the correlation coefficient? I will explain the difference because some people may have the question. You may have heard the famous story ** "The more chocolate the people consume, the more Nobel laureates will be" **. Looking at the figure, there seems to be a correlation between these two variables. However, whether or not there is a ** causal relationship ** between these two variables is another matter.

In general, chocolate is a luxury item, and wealthier countries are likely to consume more. In addition, the wealthier the country, the more money it will spend on education, and the more likely it is that it will produce Nobel laureates. Therefore, as shown in the figure below, variables such as "GDP per capita", which is an indicator of wealth, are considered to be involved.

Such spurious correlations are called ** spurious correlations **, and variables that produce spurious correlations are called ** confounding factors **. Variables that are spurious are ** not said to have a causal relationship **. However, if you calculate the correlation coefficient, it will be a fairly large positive value, so there is a ** correlation **.

Here, I will summarize the terms again.

--Correlation: ** When the value of variable X is large, variable Y also tends to be large ** --Causal: ** Increasing the value of variable X tends to increase variable Y ** --Spurious: Variables X and Y are correlated but not causal

In terms of causality, the point is that the larger X is **, the larger ** Y is. Also, in any relationship, the absolute value of the correlation coefficient takes a large value, so These cannot be distinguished by the correlation coefficient alone.

3. What is SAM?

Based on the contents so far, we will explain the method of searching for the existence and direction of ** causal relationship **. In this article, I will explain very briefly about SAM (Structural Agnostic Modeling) [2018], which is a causal search method using Deep Learning / GAN. I would like to explain the most famous Bayesian network in another article.

3-1. Overview

SAM is a method that realizes causal search using Deep Learning's GAN (Generative Adversarial Network) technology. In general, GAN is largely composed of generators and classifiers, each of which has the following roles. --Generator: Inputs noise and generates fake data that is close to the trained data. --Identifier: Identifies whether the input data is genuine or fake

In SAM, ** a matrix representing the cause and effect of the number of explanatory variables × the number of explanatory variables ** is given to the forward function of the generator, and this matrix is also trained when the generator is trained. The matrix has a value from 0 to 1, and the part with a value above the threshold is judged to be causal. The image will be as follows. (Threshold is 0.9)

• The diagonal component is 0.
• The numbers used above are random numbers, not trained ones.

It's a pretty rough explanation, but if you want to know more details, please refer to the following documents.

4. Implementation example

SAM (Structural Agnostic Modeling) [2018] Python implementation example is described.

Download ** SAM_titanic.ipynb ** and ** titanic.csv ** from GitHub below. https://github.com/yuomori0127/SAM_titanic

4-2. Environmental preparation

The environment used ** Google Colab **. See the following article for how to use it. The server fee is free. Summary of how to use Google Colab

Put the two files you downloaded earlier in any folder of ** GooleDrive **.

4-3. Code execution

Open ** SAM_titanic.ipynb ** from ** Goole Drive ** in ** Google Colab **.

• I feel that it was necessary to install it for the first time.

First of all, you need a GPU resemara. Run `! Nvidia-smi` at the top (Shift + Enter) and resemble until you get a Tesla P100. You can close within 5 times.

You can reset it from the following.

Please change the folder name (causal_book part) of the code below to the folder name where .ipynb and csv are placed on Google Drive.

``````import os
os.chdir("./drive/My Drive/causal_book/")
``````

If you execute `Runtime`-> `Execute all processing`, all will be executed.

The following is a summary of the execution results. (Threshold 0.6)

• Only the matrix part is output from the code.

The result is that the objective variable Survived floats (no explanatory variable has any cause and effect).

5. Finally

I was able to perform a causal search using Deep Learning / GAN, but there are various problems. ・ ** SAM has very high randomness and the learning result is not stable at all ** The execution example takes the average value of the results of executing 5 times. The dissertation averaged 16 times. ・ ** Output result cannot be evaluated ** In the paper, the evaluation was made using AP and Hamming distance on the premise that there is a correct answer, but that is not possible in actual operation. I wish there was something like perprexity. ・ ** Discrete values cannot be used as they are, so the explanatory variables after get_dummies will have a causal effect ** This may be my lack of study. Considering actual operation like this, there are many difficult points. It may be a field that will continue to develop in the future.