A memorandum about correlation [Python]

Introduction

This article is a memorandum of some attempts made in the process of considering the correlation of data with error variables and confounding factors.

First, I will show the structure of this article.

  1. Causal structure of data ([Data Generation](# data-generation))
  2. Correlation Matrix / Partial Correlation Matrix 1 ([Correlation / Partial Correlation [1]](# correlation--partial-correlation-1))
  3. Sparse estimation ([Graphical Lasso](# graphical-lasso))
  4. Correlation Matrix / Partial Correlation Matrix 2 ([Correlation / Partial Correlation [2]](# correlation--partial-correlation-2))
  5. Regression analysis ([Regression analysis](#Regression analysis))
  6. Summary (Summary)

Data Generation

Regarding the data scale, "nominal scale", "ordinal scale", "categorical variable", etc. are not included, and regarding the nature and structure of data, "bias", "unobserved confounding factors", "directed patrol", etc. are approached. not here. Suppose all the multivariate data we are dealing with are continuous and each follows a normal distribution. We are standardizing the "granularity" of data.

import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
import scipy as sp
from scipy import linalg
from scipy.stats import multivariate_normal
from sklearn.linear_model import LinearRegression, HuberRegressor, TheilSenRegressor, RANSACRegressor
from sklearn.svm import SVR
from sklearn.covariance import GraphicalLasso
from sklearn.preprocessing import StandardScaler 
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_squared_log_error

It generates data with the following directed acyclic causal structure. Random numbers that follow a normal distribution or uniform distribution are added to each node as an error variable rather than the effect of unobserved factors.

np.random.seed(0)
data_ = np.array([[20, 1, 0.3242, 34, 1, 0.10, 52, 5], [52, 0, 0.6134, 63, 1, 0.35, 38, 8], 
                  [38, 0, 0.7136, 57, 0, 0.20, 64, 7], [63, 1, 0.4592, 49, 0, 0.80, 95, 2]])
mu = np.mean(data_, axis = 0)
cov = np.cov(data_, rowvar = 0, bias = 0)

w_9 = np.random.multivariate_normal(mu, cov, 1000)[:, 0][:, np.newaxis] + np.random.uniform(1.3, size = 1000)[:, np.newaxis]
w_8 = 0.32 * w_9 + np.random.multivariate_normal(mu, cov, 1000)[:, 3][:, np.newaxis]
w_7 = np.random.multivariate_normal(mu, cov, 1000)[:, 1][:, np.newaxis] + np.random.uniform(1.6, size = 1000)[:, np.newaxis]
w_6 = 0.50 * w_7 + np.random.multivariate_normal(mu, cov, 1000)[:, 4][:, np.newaxis]
w_5 = 0.53 * w_7 + np.random.multivariate_normal(mu, cov, 1000)[:, 5][:, np.newaxis]
w_4 = 0.48 * w_5 + 0.40 * w_6 + np.random.uniform(1.5, size = 1000)[:, np.newaxis]
w_3 = 0.28 * w_4 + np.random.uniform(1.7, size = 1000)[:, np.newaxis]
w_2 = 0.20 * w_5 + np.random.uniform(3.2, size = 1000)[:, np.newaxis]
w_1 = 0.31 * w_2 + np.random.uniform(1.5, size = 1000)[:, np.newaxis]
w_0 = 0.290 * w_1 + 0.327 * w_4 + 0.491 * w_8 + 0.136 * w_9 + 10.0 + np.random.multivariate_normal(mu, cov, 1000)[:, 2][:, np.newaxis]

mu = np.mean(w_0, axis = 0)
cov = np.cov(w_0, rowvar = 0, bias = 0)
list_ = np.where(w_0 <= 0)
for i in list_:
    normal = np.random.normal(mu, cov)
    if normal <= 0:
        w_0[i] = mu

    elif normal > 0:
        w_0[i] = normal

data_ = pd.DataFrame(np.concatenate([w_0, w_1, w_2, w_3, w_4, w_5, w_6, w_7, w_8, w_9], axis = 1))

Correlation / Partial Correlation [1]

Ideally, the correlation matrix would give the result that w_7 has a correlation between w_5 and w_6, and the partial correlation matrix would give the result that the correlation is spurious. At the same time, I am also interested in how the results of the correlation between w_0 and w_8 change between the correlation matrix and the partial correlation matrix. For uncomplicated data structures such as the relationship of w_0`` w_8 w_9, even if the confounding factor w_9 exists, if there is no causal relationship between w_0 and w_8, a partial correlation matrix is used. The calculation should give the result that w_0 and w_8 have no causal relationship. On the other hand, in the data to be handled, there is a causal relationship between w_0 and w_8, and there is a confounding factor for w_0 and w_8, w_9. In that case, let's see how the correlation of w_0`` w_8 is expressed. Check how accurate the correlation is for w_0 and w_1 w_4`` w_8 w_9.

def partital_corr(data):
    temp_cov = data.cov()
    omega = np.linalg.inv(temp_cov)
    D = np.diag(np.power(np.diag(omega), -0.5))
    temp_pcorr = -np.dot(np.dot(D, omega), D) + 2 * np.eye(temp_cov.shape[0])
    mtx_pcorr = pd.DataFrame(temp_pcorr, columns = temp_cov.columns, index = temp_cov.index)
    return mtx_pcorr


partial_correlation = partital_corr(data_)

fig = plt.subplots(figsize=(9, 9))
heatmap = sns.heatmap(data_.corr(), annot = True, cmap = 'Blues', square = True)
heatmap_ = sns.heatmap(partial_correlation, annot = True, cmap = 'Blues', square = True)

Correlation(left) VS Partial Correlation(right)

w_0``w_8 w_0``w_9 w_0``w_1 w_0``w_4 w_5``w_6 w_8``w_9
correlation 0.96 0.66 -0.031 0.014 0.36 0.43
Partial correlation 1 1 0.28 0.24 -0.6 -1

One interpretation is when calculating the covariance matrix in the process of calculating the partial correlation matrix because both w_5 and w_6, which are affected by the confounding factors w_7, affect w_4. It seems that it can be considered that the relationship between variables is not measured well. For another consideration, Correlation / Partial Correlation [2] adds loop structures and outliers to the data.

Graphical Lasso Graphical lasso is a sparse estimation that calculates the inverse of a sparse variance-covariance matrix when the data follow a multivariate normal distribution. It has a Gaussian graphical model and sparse estimation elements. Consider conditional independence in relation to other variables in the correlation between two variables. The inverse matrix (precision matrix) of the sparse covariance matrix by creating a log-likelihood function from the covariance matrix calculated from the data and solving the minimization problem for the loss function with the $ L1 $ regularization term added. To estimate. Graphical lasso is not suitable for the data you are dealing with by guessing, but I will try it.

ss = StandardScaler()
data_ss = ss.fit_transform(data_)
gl = GraphicalLasso(alpha = 0.1, max_iter = 100, )
gl.fit(data_ss)
gl.get_params()

# correlation
diag = np.zeros(shape = (10, 10))
for i in range(diag.shape[0]):
    for j in range(diag.shape[1]):
        if i == j:
            diag[i,j] = 1 / np.sqrt(gl.covariance_[i,j])
corr = np.dot(np.dot(diag, gl.covariance_), diag)

# partial correlation
omega = np.linalg.inv(gl.covariance_)
diag = np.diag(np.power(np.diag(omega), -0.5))
partial_corr = -np.dot(np.dot(diag, omega), diag)
partial_corr += 2 * np.eye(gl.covariance_.shape[0])

heatmap_gl =sns.heatmap(corr, annot = True, cmap = 'Blues', square = True)
figure_gl = heatmap_gl.get_figure()
figure_gl.savefig('graph_gl_corr')

heatmap_gl_ = sns.heatmap(partial_corr, annot = True, cmap = 'Blues', square = True)
figure_gl_ = heatmap_gl_.get_figure()
figure_gl_.savefig('graph_gl_partial_corr')

Correlation(left) VS Partial Correlation(right) Graphical lasso is not an algorithm that supports noise and confounding, so it is not considered to be a suitable choice for this data.

Correlation / Partial Correlation [2] Create a loop structure in w_0`` w_4 w_6 and make w_1 w_4`` w_9 data with outliers.

np.random.seed(0)
data_ = np.array([[20, 1, 0.3242, 34, 1, 0.10, 52, 5], [52, 0, 0.6134, 63, 1, 0.35, 38, 8], 
                  [38, 0, 0.7136, 57, 0, 0.20, 64, 7], [63, 1, 0.4592, 49, 0, 0.80, 95, 2]])
mu = np.mean(data_, axis = 0)
cov = np.cov(data_, rowvar = 0, bias = 0)

#w_11 = np.random.multivariate_normal(mu, cov, 1000)[:, 6][:, np.newaxis] 
#w_10 = 0.27 * w_11 + np.random.multivariate_normal(mu, cov, 1000)[:, 7][:, np.newaxis] 
w_9 = np.random.multivariate_normal(mu, cov, 1000)[:, 0][:, np.newaxis] + np.random.uniform(1.3, size = 1000)[:, np.newaxis]

noise_upper = np.random.normal(200, 30, 20)
noise_row = np.random.randint(0, 1000, 20)
w_9[noise_row] = noise_upper[:, np.newaxis]

w_8 = 0.32 * w_9 + np.random.multivariate_normal(mu, cov, 1000)[:, 3][:, np.newaxis]
w_7 = np.random.multivariate_normal(mu, cov, 1000)[:, 1][:, np.newaxis] + np.random.uniform(1.6, size = 1000)[:, np.newaxis]
w_6 = 0.50 * w_7 + np.random.multivariate_normal(mu, cov, 1000)[:, 4][:, np.newaxis] + 0.53 * w_0
w_5 = 0.53 * w_7 + np.random.multivariate_normal(mu, cov, 1000)[:, 5][:, np.newaxis]
w_4 = 0.48 * w_5 + 0.40 * w_6 + np.random.uniform(1.5, size = 1000)[:, np.newaxis]

noise_lower = np.random.normal(0.6, 0.2, 10)
noise_upper = np.random.normal(30, 2, 30)
noise_row = np.random.randint(0, 1000, 40)
w_4[noise_row] = np.concatenate([noise_lower, noise_upper])[:, np.newaxis]

w_3 = 0.28 * w_4 + np.random.uniform(1.7, size = 1000)[:, np.newaxis]
w_2 = 0.20 * w_5 + np.random.uniform(3.2, size = 1000)[:, np.newaxis]
w_1 = 0.31 * w_2 + np.random.uniform(1.5, size = 1000)[:, np.newaxis]

noise_upper = np.random.normal(30, 4, 20)
noise_upper_ = np.random.normal(50, 3, 20)
noise_row = np.random.randint(0, 1000, 40)
w_1[noise_row] = np.concatenate([noise_upper, noise_upper_])[:, np.newaxis]

w_0 = 0.290 * w_1 + 0.327 * w_4 + 0.491 * w_8 + 0.136 * w_9 + 10.0 + np.random.multivariate_normal(mu, cov, 1000)[:, 2][:, np.newaxis] #+ 0.426 * w_11

mu = np.mean(w_0, axis = 0)
cov = np.cov(w_0, rowvar = 0, bias = 0)
list_ = np.where(w_0 <= 0)
for i in list_:
    normal = np.random.normal(mu, cov)
    if normal <= 0:
        w_0[i] = mu

    elif normal > 0:
        w_0[i] = normal

data_ = pd.DataFrame(np.concatenate([w_0, w_1, w_2, w_3, w_4, w_5, w_6, w_7, w_8, w_9], axis = 1))
def partital_corr(data):
    temp_cov = data.cov()
    omega = np.linalg.inv(temp_cov)
    D = np.diag(np.power(np.diag(omega), -0.5))
    temp_pcorr = -np.dot(np.dot(D, omega), D) + 2 * np.eye(temp_cov.shape[0])
    mtx_pcorr = pd.DataFrame(temp_pcorr, columns = temp_cov.columns, index = temp_cov.index)
    return mtx_pcorr


partial_correlation = partital_corr(data_)

fig = plt.subplots(figsize=(9, 9))
heatmap = sns.heatmap(data_.corr(), annot = True, cmap = 'Blues', square = True)
heatmap_ = sns.heatmap(partial_correlation, annot = True, cmap = 'Blues', square = True)

Correlation(left) VS Partial Correlation(right) Since w_0 and w_4 are connected in a loop structure, the result is that w_4 correlates with w_8 and w_9 respectively. The correlation matrix shows a positive correlation and the partial correlation matrix shows a negative correlation. In the result of the correlation matrix, w_9 indirectly shows a positive correlation with respect to w_3 in the correlation at w_3`` w_4 w_9. In the results of the partial correlation matrix, w_3 and w_9 are uncorrelated, and w_4 and w_9 are negatively correlated. You can see some places where the correlation matrix is compatible but the partial correlation matrix is not. The reverse is also true. You can also see that the correlation matrix shows a positive correlation and the partial correlation matrix shows a negative correlation.

There is not enough evidence to give a thorough consideration, but one interpretation is that from the case of this section, both the correlation matrix and the partial correlation matrix are "error variables of a certain size or larger" and "a factor (variable)". Two or more variables $ x_1, ..., x_n $ that are indirectly or directly affected by are causally or correlated, or two or more of $ x_1, ..., x_n $ are other It seems that it can be considered that "when affecting the same variable" and "loop structure" are not handled well.

regression analysis

Since w_0 is made so that it can be expressed by a linear expression, regression analysis is performed. If you can assume "normality", "homoscedasticity", and "independence" for the error distribution, and "linearity" for the relationship between the objective variable and the explanatory variable, you should have a strong motivation to perform regression analysis. I can do it. I also tried HuberRegressor and SVR with scikit-learn, but LinearRegression is sufficient. This is because the data handled has no bias such as selection bias or unobserved confounding factors. When doing SVR, we use a linear kernel that weights the input vector. Accuracy can be obtained by performing regression using a polynomial kernel that supports multiple features to adjust the degrees of freedom or a Gaussian kernel. Hoover regression is used for robust regression. This time, the error distribution does not have asymmetry or large noise, so it is not considered to be a suitable choice. Since there is no motivation such as "make the objective variable follow a normal distribution", "handle outliers", or "the relationship between the objective variable and the explanatory variable tends to be an exponential function", logarithmic conversion is not performed.

X_train, X_valid, y_train, y_valid = train_test_split(data_.drop(0, axis =1), data_[0], test_size = 0.4, random_state = 0)
data_ss = pd.DataFrame(data_ss)
X_train_ss, X_valid_ss, y_train_ss, y_valid_ss = train_test_split(data_ss.drop(0, axis =1), data_ss[0], test_size = 0.4, random_state = 0)
y_valid = y_valid.sort_index(ascending = True)
X_valid = X_valid.sort_index(ascending = True)
y_valid_ss = y_valid_ss.sort_index(ascending = True)
X_valid_ss = X_valid_ss.sort_index(ascending = True)

plt.hist(np.ravel(y_train), label = 'training data', color = 'c')
plt.hist(np.ravel(y_valid), label = 'valid data', color = 'k')
plt.savefig('graph_hist')
#Non-standardized data
linear = LinearRegression()
linear.fit(X_train, y_train)
# <SVR>
# svr_linear = SVR(kernel = 'linear', C = 1.0, epsilon = 0.1, gamma = 'scale')
# svr_linear.fit(X_train, np.ravel(y_train))
# scores = svr_linear.score(X_valid, y_valid)
# print('R^2 scores : %s' %scores)
# <HuberRegressor>
# hr = HuberRegressor(alpha = 0.0001, epsilon = 1.35, # fit_intercept = True, max_iter = 100, tol = 1e-05, warm_start = False)
# hr.fit(X_train, y_train)
pred = linear.predict(X_valid)
plt.plot(np.ravel(y_valid), label = 'true valid', color = 'k')
plt.plot(np.ravel(pred), label = 'pred valid', color = 'w')

diff = np.log1p(pred) - np.log1p(y_valid)
diff = np.where(np.isnan(diff) == True, 0, diff)
RMSLE = np.sqrt(np.mean(diff ** 2))

#Standardized data
linear_ss = LinearRegression()
linear_ss.fit(X_train_ss, y_train_ss)
# <SVR>
# svr_linear_ss = SVR(kernel = 'linear', C = 1.0, epsilon = 0.1, gamma = 'scale')
# svr_linear_ss.fit(X_train_ss, np.ravel(y_train_ss))
# scores = svr_linear_ss.score(X_valid_ss, y_valid_ss)
# print('R^2 scores : %s' %scores)
# <HuberRegressor>
# hr_ss = HuberRegressor(alpha = 0.0001, epsilon = 1.35, # fit_intercept = True, max_iter = 100, tol = 1e-05, warm_start = False)
# hr_ss.fit(X_train_ss, y_train_ss)
pred_ss = linear_ss.predict(X_valid_ss)
plt.plot(np.ravel(y_valid_ss), label = 'true valid', color = 'k')
plt.plot(np.ravel(pred_ss), label = 'pred valid', color = 'w')

coef = pd.DataFrame(X_train.columns.values, columns = {'feature'})
coef['coef'] = linear.coef_.T
coef['coef_ss'] = linear_ss.coef_.T
print('MSE : %s' %mean_squared_error(y_valid, pred, squared = True))
print('RMSLE : %s' %RMSLE)
>> RMSE : 0.02983134938758614
>> RMSLE : 0.03423681936352853

coef.sort_values(by = 'coef', ascending = False)
>>
feature     coef
	8	0.480333
	4	0.322978
	1	0.284168
	9	0.133841
	6	0.039120
	5	0.016435
	3	-0.005714
	2	-0.009774
	7	-0.020549
print('MSE : %s' %mean_squared_error(y_valid_ss, pred_ss, squared = True))
>> MSE : 0.000242675476861591

coef.sort_values(by = 'coef_ss', ascending = False)
>>
feature     coef
	8	0.652601
	9	0.332685
	1	0.194995
	4	0.109338
	6	0.020749
	5	0.000665
	3	-0.000551
	2	-0.000566
	7	-0.001129

Check the residuals.

#Non-standardized data
plt.scatter(pred, pred - y_valid, color = 'blue') 
plt.hlines(y = 0, xmin = -10, xmax = 150, color = 'c')
plt.xlabel('Prediction')
plt.ylabel('Residual error')
plt.grid()

#Standardized data
plt.scatter(pred_ss, pred_ss - y_valid_ss, color = 'blue') 
plt.hlines(y = 0, xmin = -10, xmax = 150, color = 'c')
plt.xlabel('Prediction')
plt.ylabel('Residual error')
plt.grid()

Non Standardized(left) VS Standardized(right)

#Non-standardized data
plt.hist(pred - y_valid)

#Standardized data
plt.hist(pred_ss - y_valid_ss)

Non Standardized(left) VS Standardized(right)

The residuals of standardized and non-standardized data are scattered in the range of $ ± 0.5 $ and $ ± 0.05 $ of $ y = 0 $, respectively, and the residuals can be checked for normality. The way the residuals are scattered is not irregular, but symmetric with $ y = 0 $ as the boundary. The standardized data has a higher prediction accuracy for the verification data by MSE than the non-standardized data, but when the contribution of the coefficient to the objective variable is confirmed in descending order, the non-standardized data 8`` 4`` 1`` You can see that the 9 sequence is more accurate than the standardized data, which is more predictive for the validation data. Even if error variables and outliers are added to the data, if the data does not have multiple scales or unobserved confounders, the accuracy is reasonable. Even if more and more error variables and outliers are randomly assigned and larger, the contribution of the coefficient to the objective variable is calculated appropriately in this data, and the result is that the accuracy of the model is not stable when there are unobserved variables. It has come out.

Regression analysis is performed using statsmodels.formula.api for data that considers differences in multiple scales and particle sizes when the emphasis is on prediction rather than causality, and Cond for multicollinearity. Another method is to check No. and determine that the multicollinearity is weak if the number of conditions is close to 1, otherwise use the variance_inflation_factor of statsmodels.stats.outliers_influence to calculate the variance expansion factor. .. For example, when there is multicollinearity between two variables, the regression plane exists only on the straight line because the variables are arranged in a straight line, and the inverse matrix of the covariance matrix does not exist. I can't solve. As a result, "the standard deviation of the coefficient becomes large", "the $ t $ value becomes small and affects the $ p $ value", "the coefficient of determination takes a large value", and so on. Using statsmodels.formula.api," determine the substantial effect using $ p $ value, $ z $ value, and confidence interval of the parameter "," Covariance Type ", and" for sample data. You can easily check the normality test (Omnibus test).

Summary

I tried to consider the correlation and regression analysis for data with error variables and confounding factors. Although it is a memorandum, I feel that the content of the article is inconsistent with respect to the subject. If we can properly arrange the data that can be used as an article, it will be more interesting than estimating the cause and effect by structural equation modeling.

Recommended Posts