[PYTHON] Examine the relationship between two variables (1)

Examine the correlation coefficient

The first basic aspect of data analysis is collecting data of interest. The interest of statistics is only the whole situation and tendency, and the basis for knowing it is the model of distribution and statistics (mean and variance).

Examining the correlation coefficient is one of the basics. I've talked a lot about Linear Regression and Correlation Coefficient before. Let's unravel further here and explore the details.

Collect data and make a diagram

When playing a certain sport, I checked the ball throwing and grip strength values of high school students.

student Grip strength Ball throw
A 26 16
B 26 11
C 26 14
D 27 16
E 28 18
F 29 16
G 32 18
H 29 21
I 24 14
J 26 19

It seems to repeat many times, but statistics start with data collection. In this example, I recorded and tabulated the sports performance of each individual student. However, the relationship is not clear from the table alone. So I will make a figure. You can get a rough idea of the relationship between the two variables x and y by drawing a scatter plot.

Now, let's draw a scatter plot by making use of knowledge learned so far.

import numpy as np
import matplotlib.pyplot as plt
X = np.array( [26, 26, 26, 27, 28, 29, 32, 29, 24, 26] )
Y = np.array( [16, 11, 14, 16, 18, 16, 18, 21, 14, 19] )
plt.plot(X, Y, 'o', color="blue")
plt.show()
plt.savefig("image.png ")

image.png

Somehow there seems to be a positive correlation.

Find the correlation coefficient

I've explained this too, but the next thing I want to know is to quantify the strength of the relationship between the two variables x and y. This is the correlation coefficient.

For the concrete calculation of the correlation coefficient, use the formula for finding the following covariance.

Cov(x,y) = \frac 1 N \sum_{k=1}^N X_kY_k - \overline{x} \overline{y}

Did you remember? The correlation coefficient can be obtained as follows.

r(x,y) = \frac {Cov(x,y)} {\sigma(x)\sigma(y)}

Therefore

r(x,y) = \frac {10 * 4481 - 273 * 163} {\sqrt{(10*7499-273^2)(10*2731-163^2)}} = 0.53

It will be.

Is this manual calculation correct? Let's write some code and try it out.

corr = np.corrcoef(X, Y)[0,1]
print("The correlation coefficient between X and Y is%(corr)s" %locals() )
#=>The correlation coefficient between X and Y is 0.532109266822

As expected, it is NumPy. The answer was sought with just one line of code.

Summary

I have begun to review again the relationship between the two variables, which can be said to be the basis of statistics. This time, I calculated the scatter plot and the correlation coefficient as a starting point.

Recommended Posts

Examine the relationship between two variables (1)
Calculate the correspondence between two word-separators
Estimate the delay between two signals
The subtle relationship between Gentoo and pip
About the relationship between Git and GitHub
Bayesian modeling-estimation of the difference between the two groups-
Investigating the relationship between ice cream spending and temperature
Examine the dual problem
Calculate the time difference between two columns with Pandas DataFrame
Understand the difference between cumulative assignment to variables and cumulative assignment to objects
[Statistics] Let's visualize the relationship between the normal distribution and the chi-square distribution.
I investigated the relationship between Keras stateful LSTM and hidden state