[PYTHON] Rethink the correlation coefficient

When there are two continuous variables, draw a ** scatter plot ** if you want to consider this relationship. As explained above, tools such as matplotlib and R are useful for drawing scatter plots.

Linear regression revisited

Let's take revenge because we covered it in the Linear Regression and Correlation Coefficient.

import numpy as np
import matplotlib.pyplot as plt

#Two continuous variables
v1 = np.array([24, 27, 29, 34, 42, 43, 51])
v2 = np.array([236, 330, 375, 392, 460, 525, 578])

def phi(x): #Calculate the phi coefficient, in this case X=4
    return [1, x, x**2, x**3]

def f(w, x): #
    return np.dot(w, phi(x))

PHI = np.array([phi(x) for x in v2])
w = np.linalg.solve(np.dot(PHI.T, PHI), np.dot(PHI.T, v1))

ylist = np.arange(200, 600, 10)
xlist = [f(w, x) for x in ylist]

plt.xlim(20, 55)
plt.ylim(200, 600)
plt.xlabel('Age')
plt.ylabel('Price')
plt.plot(v1, v2, 'o', color="blue")
plt.plot(xlist, ylist, color="red")
plt.show()
plt.savefig("image.png ")

image2.png

The statistic for continuous variables can be found as follows. This was also in Past Articles.

item function value
v2 average np.average(v2) 413.714285714
v2 distribution np.var(v2) 11725.3469388
standard deviation of v2 np.std(v2) 108.283641141
Correlation coefficient between v1 and v2 np.corrcoef(v1, v2) 0.96799293

When the variable X (= v1) increases, so does Y (= v2), which is called a positive correlation. In this case, there is a positive correlation.

Linear relationship and correlation coefficient

In this way, when one variable changes and the other variable changes, which is a monotonous change, this linear relationship is called ** linear relationship **.

To be precise, the correlation coefficient is called ** Pearson's product moment correlation coefficient **. There are other correlation coefficients, but in general, most of them refer to Pearson's product moment correlation coefficient.

Also, when the scatter plot was created, this [Cartesian coordinate system](http://ja.wikipedia.org/wiki/%E7%9B%B4%E4%BA%A4%E5%BA%A7%E6%A8% The upper right corner of 99% E7% B3% BB) is called the first quadrant. Similarly, the upper left is the 2nd quadrant, the lower left is the 3rd quadrant, and the lower right is the 4th quadrant. If there are many distributions in the 1st and 3rd quadrants of the scatter plot as a whole, the total value of the products of the deviations will be large in the positive direction.

The number ** covariance ** is a number that indicates the strength and direction of the linear relationship between continuous variables and is expressed by the following equation.

Cov(X, Y) = \frac {\sum (Y_i - \overline{Y})(X_i - \overline{X})} {N - 1}

The product moment correlation coefficient can be calculated by using the covariance and correcting with the standard deviation σ of X and Y.

r_{xy} = \frac {Cov(X, Y)} {Of X\sigma x Y\sigma}

Summary

The correlation coefficient has been reorganized and supplemented. When we make the null hypothesis that there is no linear relationship between the two variables, we need to test the product moment correlation coefficient. In this case, the null hypothesis assumes an independent state where the population correlation is 0 and the value of one variable does not change the value of the other variable. The degree of dissociation from the independent state of the sample data is used to test whether the correlation coefficient in the population is 0 or not.

reference

Introduction to Social Statistics http://www.amazon.co.jp/dp/4595313705

Let's implement Bayesian linear regression http://gihyo.jp/dev/serial/01/machine-learning/0014

Recommended Posts

Rethink the correlation coefficient
Easily visualize the correlation coefficient between variables
How to calculate the autocorrelation coefficient
[Machine learning] Understand from mathematics why the correlation coefficient ranges from -1 to 1.
Calculation of Spearman's rank correlation coefficient
Investigate the effect of outliers on correlation
Check the correlation with Kaggle's Titanic (kaggle③)
Time comparison: Correlation coefficient calculation in Python