The first basic aspect of data analysis is collecting data of interest. The interest of statistics is only the whole situation and tendency, and the basis for knowing it is the model of distribution and statistics (mean and variance).
Examining the correlation coefficient is one of the basics. I've talked a lot about Linear Regression and Correlation Coefficient before. Let's unravel further here and explore the details.
When playing a certain sport, I checked the ball throwing and grip strength values of high school students.
student | Grip strength | Ball throw |
---|---|---|
A | 26 | 16 |
B | 26 | 11 |
C | 26 | 14 |
D | 27 | 16 |
E | 28 | 18 |
F | 29 | 16 |
G | 32 | 18 |
H | 29 | 21 |
I | 24 | 14 |
J | 26 | 19 |
It seems to repeat many times, but statistics start with data collection. In this example, I recorded and tabulated the sports performance of each individual student. However, the relationship is not clear from the table alone. So I will make a figure. You can get a rough idea of the relationship between the two variables x and y by drawing a scatter plot.
Now, let's draw a scatter plot by making use of knowledge learned so far.
import numpy as np
import matplotlib.pyplot as plt
X = np.array( [26, 26, 26, 27, 28, 29, 32, 29, 24, 26] )
Y = np.array( [16, 11, 14, 16, 18, 16, 18, 21, 14, 19] )
plt.plot(X, Y, 'o', color="blue")
plt.show()
plt.savefig("image.png ")
Somehow there seems to be a positive correlation.
I've explained this too, but the next thing I want to know is to quantify the strength of the relationship between the two variables x and y. This is the correlation coefficient.
For the concrete calculation of the correlation coefficient, use the formula for finding the following covariance.
Cov(x,y) = \frac 1 N \sum_{k=1}^N X_kY_k - \overline{x} \overline{y}
Did you remember? The correlation coefficient can be obtained as follows.
r(x,y) = \frac {Cov(x,y)} {\sigma(x)\sigma(y)}
Therefore
r(x,y) = \frac {10 * 4481 - 273 * 163} {\sqrt{(10*7499-273^2)(10*2731-163^2)}} = 0.53
It will be.
Is this manual calculation correct? Let's write some code and try it out.
corr = np.corrcoef(X, Y)[0,1]
print("The correlation coefficient between X and Y is%(corr)s" %locals() )
#=>The correlation coefficient between X and Y is 0.532109266822
As expected, it is NumPy. The answer was sought with just one line of code.
I have begun to review again the relationship between the two variables, which can be said to be the basis of statistics. This time, I calculated the scatter plot and the correlation coefficient as a starting point.
Recommended Posts