I am a beginner in causal reasoning.

Recently, I had the opportunity to learn the latest method of statistical causal search, but I realized that if I didn't understand the basic method ** LiNGAM **, I couldn't talk about it at all. I will try to understand it by reading it and following the process in Python one by one. There are two approaches to estimating ** LiNGAM **, but this time we will use the independent component analysis (ICA) approach.

For implementation, a book on statistical causal exploration (machine learning professional series) [1], a paper on LiNGAM [2], and code on the original Github ) Was used as a reference.

What you can do with LiNGAM

For example, suppose you have $ N $ observations with four observation variables $ x_1 $, $ x_2 $, $ x_3 $, $ x_4 $. From that data, we derive a causal graph that is linear, non-circular, and whose errors follow a non-Gaussian distribution (see below).

LiNGAM model

The formula for the model is as follows. For $ p $ observed variables $ x_1, x_2, x_3,…, x_p , the LiNGAM model $x_i = \displaystyle \sum_{j≠i} b_{i,j} x_j + e_i \qquad i=1,2,...,p\tag{1}$$ However, $ e_i $ is an error variable and assumes independence with a non-Gaussian continuous distribution. The matrix representation of the model looks like this: ${\bf x} = {\bf Bx} + {\bf e} \tag{2}$ $ {\ bf B} $ is a coefficient matrix, a square matrix of $ p × p $. Since we assume non-circulation for the causal graph, if the order of the observed variables is rearranged in the correct order, the coefficient matrix $ {\ bf B} $ is ** a lower triangular matrix in which all diagonal components are 0 (strict lower). Triangular matrix) **.

STEP0 Generation of artificial data

This time, I will create the following artificial data and find a causal graph. Create data based on the causal graph shown in the figure above.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.decomposition import FastICA
from sklearn.linear_model import LassoLarsIC, LinearRegression
from scipy.optimize import linear_sum_assignment
from graphviz import Digraph

#STEP0 Artificial data generation
num_of_data = 1000
num_of_features = 4

_B_true = np.asarray([[0,0,0,0], [2,0,0,0], [1,3,0,0], [4,5,3,0]])

e1 = np.random.uniform(size=num_of_data)
e2 = np.random.uniform(size=num_of_data)
e3 = np.random.uniform(size=num_of_data)
e4 = np.random.uniform(size=num_of_data)

x1 = e1
x2 = _B_true[1,0] * x1 + e2
x3 = _B_true[2,0] * x1 + _B_true[2,1] * x2 + e3
x4 = _B_true[3,0] * x1 + _B_true[3,1] * x2 + _B_true[3,2] * x3 + e4

_X = np.asarray([x1, x2, x3, x4]).T
_columns = ["x1", "x2", "x3", "x4"]

#Sort observation variables into texto
order_shuffled = [3,0,2,1]
X = _X[:,order_shuffled]
columns = [_columns[i] for i in order_shuffled]
P_os = np.zeros_like(_B_true)
for i,j in enumerate(order_shuffled):
    P_os[i,j] = 1
B_true = P_os@_B_true@P_os.T

DF_X = pd.DataFrame(X,columns=columns)

After creating the exact lower triangular matrix B to be the solution, rearrange the order of the observed variables into texto. Here, the order is $ (x_4, x_1, x_3, x_2) $. This is because we do not know the correct order of observed variables in the real world.

STEP1 Estimate the reconstruction matrix W

Transforming the expression $ (2) $, we get:

\begin{align}
{\bf x} &= {\bf (I-B)}^{-1} {\bf e}  \\
&= {\bf A} e \tag{3}
\end{align}

Here, since the error variable vector $ \ bf e $ is independent and non-Gauss, this equation can be regarded as an ICA model. It becomes {W = IB} $. However, what the ICA finds may differ from the original $ \ bf W $ in line order and scale. Therefore, using the permutation matrix $ {\ bf P} $ in the row order and the diagonal matrix $ \ bf D $ indicating the scale, the restoration matrix $ {\ bf W_ {ICA}} $ estimated by ICA is

\begin{align}
{\bf W_{ICA}} &={\bf PDW}  \\
&= {\bf PD(I-B)}  \tag{4}
\end{align}

It will be. Now, let's actually ask for $ {\ bf W_ {ICA}} $ in python. However, since ICA is not the theme this time, I will ask for it quickly with FastICA in scikit-learn.

#STEP1 W_ICA(PDW)Estimate
ICA = FastICA(random_state=0)
ICA.fit(X)
W_ICA = ICA.components_

STEP2 Estimate the permutation matrix P

Transforming the expression $ (4) $, we get:

\begin{align}
{\bf P^{-1}W_{ICA}} &={\bf D(I-B)} \tag{5}
\end{align}

Here, because $ \ bf B $ has a diagonal component of $ 0 $ due to acyclicity, the diagonal component of $ \ bf {IB} $ is $ 1 $, and the diagonal component of the scale matrix $ \ bf {D} $. Is not $ 0 $, so the diagonal component of $ \ bf {D (IB)} $ is not $ 0 $. Therefore, on the left side of equation $ (5) $, $ \ bf W_ {ICA} $ must be replaced so that $ 0 $ does not come to the diagonal component. Therefore, if such a permutation matrix $ \ bf \ tilde {P} $ is used, the estimation matrix is

\begin{align}
{\bf \hat{\tilde{P}} = \mathop{\rm arg~min}\limits_{\bf \tilde{P}} \sum_{i=1}^p \frac{1}{|(\bf {\tilde{P} \hat{W}_{ICA}})_{ii}| }}  \tag{6}
\end{align}

Estimated by. Since the problem of finding such a permutation matrix is considered to be a linear assignment problem, it can be solved by the Hungarian method or the like. Well, it is an implementation, but as usual, the Hungarian method is not the theme, so it will be solved quickly with scipy.

#STEP2 P_tilde_hat estimation
epsilon = 1e-16

cost = 1 / np.abs(W_ICA +  epsilon)
row_ind, col_ind = linear_sum_assignment(cost)

P_tilde_hat = np.zeros_like(W_ICA)
for i,j in  zip(row_ind,col_ind):
    P_tilde_hat[i,j] = 1
P_tilde_hat = np.linalg.inv(P_tilde_hat)

If 0 appears when taking the reciprocal, it will explode, so we add a minute value ε.

STEP3 Estimate the scale matrix D

As usual, the diagonal component of $ \ bf {I-B} $ is $ 1 $, so the diagonal component of $ {\ bf D (I-B)} $ is the same as the diagonal matrix $ \ bf D $. Therefore, from the equation $ (5) $, the diagonal matrix $ \ bf D $ can be said to be equal to the diagonal component of $ {\ bf P ^ {-1} W_ {ICA}} $. Therefore, the estimation matrix of $ \ bf {D} $ is

\begin{align}
{\bf{\hat D} = diag(\hat{\tilde{P}} \hat{W}_{ICA})} \tag{7}
\end{align}

Written in python, it looks like this.

#STEP3 D_hat estimation
D_hat = np.diag(np.diag(P_tilde_hat@W_ICA))

STEP4 Estimate the reconstruction matrix W and the coefficient matrix B

Estimate the reconstruction matrix $ \ bf {W} $ from the equation $ (5) $

\begin{align}
{\bf \hat W = \hat D^{-1} \hat{\tilde{P}} \hat{W}_{ICA}}  \tag{8}
\end{align}

And the coefficient matrix $ \ bf {B} $ is estimated as the matrix

\begin{align}
{\bf \hat B = I - \hat W}  \tag{9}
\end{align}

It will be.

#STEP4 W_hat,B_hat estimation
W_hat = np.linalg.inv(D_hat)@P_tilde_hat@W_ICA
I = np.eye(num_of_features)
B_hat = I - W_hat

STEP5 Guessing the causal order

At this stage, the LiNGAM model formula itself represented by the formula $ (2) $ is almost obtained, but the causal direction itself for drawing the graph is unknown. Therefore, we must find a permutation in which $ \ bf {B} $ becomes a strict lower triangular matrix by changing the order of the observed variables in Eq. $ (2) $. That is, multiply the permutation matrix $ \ bf {\ ddot {P}} $ in the equation $ (2) $.

{\bf \ddot{P} x} = {\bf \ddot{P} Bx} + {\bf \ddot{P} e} \tag{10}

{\bf \ddot{P} x} = {\bf (\ddot{P} B\ddot{P}^{\mathrm{T}})\ddot{P}x} + {\bf \ddot{P} e} \tag{11}

$ \ Bf \ ddot {P} B \ ddot {P} ^ {\ mathrm {T}} $ at this time is as close as possible to the exact lower triangular matrix $ \ bf {\ ddot {P}} Look for $.

You may find such a permutation matrix by force search, but the number of permutation matrices that can be considered is $ p! $, So there is a limit. Therefore, reference [3] proposes a method to obtain this at high speed.

In that method, first replace $ p (p + 1) / 2 $ components with 0 from $ \ bf \ hat {B} $ in ascending order of absolute value. After that, the observed variables are rearranged to determine whether or not they become a lower triangular matrix, and if not, the process ends, and if not, the component of the next smaller $ \ bf \ hat {B} $ is replaced with 0 and the determination is made again.

The lower triangular matrix judgment (= guessing the causal order) method, ① First, let the line of $ \ bf \ hat {B} $ whose components are all 0 be the $ i $ line. (If it does not exist, the judgment is Flase) ② Add $ i $ to the causal order list ③ Delete the $ i $ row and $ i $ column of $ \ bf \ hat {B} $ and return to ① as a new matrix $ \ bf \ hat {B} $.

The completed causal order list is ancestor → grandchild. Now for the python implementation. Honestly, this part was too complicated to process, so I referred to the original code a lot ...

#STEP5 Causal order(causal order)Guess

pos_list = np.vstack(np.unravel_index(np.argsort(np.abs(B_hat), axis=None), B_hat.shape)).T
initial_zero_num = num_of_features * (num_of_features + 1) // 2

for i, j in pos_list[:initial_zero_num]:    #p(p+1)/Replace with 2 0
    B_hat[i, j] = 0

Mtx_lowtri = B_hat
causal_order = []
original_index = np.arange(num_of_features)

for i in range(num_of_features):    #Strict lower triangular matrix judgment
    idx_row_all_zero_list = np.where(np.sum(np.abs(Mtx_lowtri), axis=1) == 0)[0]
    if len(idx_row_all_zero_list) == 0:
        print("Not a strict lower triangular matrix")
        break
    idx_row_all_zero = idx_row_all_zero_list[0]
    causal_order.append(original_index[idx_row_all_zero])
    original_index = np.delete(original_index, idx_row_all_zero, axis=0)

    mask = np.delete(np.arange(len(Mtx_lowtri)), idx_row_all_zero, axis=0)
    Mtx_lowtri = Mtx_lowtri[mask][:, mask]

P_doubledot = np.zeros_like(B_hat)
for i,j in enumerate(causal_order):
    P_doubledot[j,i] = 1

In this data, we found a causal order that becomes a lower triangular matrix with one shot, so the code does not loop. See the original one for a proper implementation.

STEP6 Finishing, pruning

If the number of data is sufficient, the coefficient matrix derived in STEP 5 is fine, but if not, it seems that the model can be improved by using pruning as the final finish.

The book proposes to use Adaptive Lasso as one of the pruning methods. Adaptive Lasso weights the Lasso regularization term with $ \ bf w $ to make the variable selection consistent. It has been proposed that this $ \ bf w $ use the reciprocal of the coefficients estimated by linear regression. Adaptive Lasso is performed for each observed variable according to the causal order obtained in STEP5, and the final solution is obtained.

In the implementation, the coefficients are first estimated by performing linear regression, the weights are calculated from the estimated coefficients, the weights are applied to the input data of Lasso, and the estimator that is the output of Lasso is weighted.

#STEP6 Pruning
B_pred = np.zeros_like(B_hat, dtype='float64')
lr = LinearRegression()
reg = LassoLarsIC(criterion='bic')

for i in range(1,num_of_features):
    #adaptive lasso
    predictors = causal_order[:i]    #Predictive variables
    target = causal_order[i]    #Target variable
    lr.fit(X[:, predictors], X[:, target])

    weight = 1/(np.abs(lr.coef_) + epsilon)
    reg = LassoLarsIC(criterion='bic')
    reg.fit(X[:, predictors] * weight, X[:, target])

    B_pred[causal_order[i], causal_order[:i]] = reg.coef_ * weight

Visualization

Finally, the obtained model is visualized as shown in the figure below. From the left, the true graph, the graph obtained in STEP5, and the graph processed in STEP6. This time, the data is simple and we have prepared enough data, so it is working well as of STEP5. 無題.jpg

Graphviz was used for visualization. I used the following graph creation code.

def Make_Graph(columns, B):
    Graph = Digraph(format="png")
    for i, j in enumerate(columns):
        for k, l in enumerate(columns):
            if B[i, k] != 0:
                Graph.edge(l, j, color="black", label= "    " + str(B[i, k].round(3)))
    return Graph

in conclusion

As mentioned above, the beginners of causal reasoning compared the books [1] and thesis [2], cheated the code of the head family, and tried hard to read LiNGAM.

References

** [1] ** Shohei Shimizu. Statistical causal search (machine learning professional series). Kodansha. [2] Shimizu et al. A linear non-Gaussian acyclic model for causal discovery. Journal of Machine Learning Research 7.Oct (2006): 2003-2030. [3] Hoyer et al. New permutation algorithms for causal discovery using ICA. In Proceedings of International Conference on Independent Component Analysis and Blind Signal Separation(2006).

Understanding with mathematical formulas and Python LiNGAM (ICA version)