[PYTHON] [Machine learning] Understand from mathematics why the correlation coefficient ranges from -1 to 1.

1. Purpose

If you want to try machine learning, anyone can use scikit-learn etc. to implement it relatively easily. However, in order to achieve results at work or to improve your level You can see that it is clearly weak in the explanation of "I don't know the background, but I got this result."

This time, I would like to post about the "correlation coefficient" that is often used in preprocessing. Many people know that the correlation coefficient is between -1 and 1, but can you explain ** "Why do you take between -1 and 1"? ** **

In this article, I will briefly introduce the correlation coefficient in 2, "Because the theory is good, first try to visualize the correlation coefficient with python", and 4 and later "Understand the background from mathematics" 2 Is aimed at.

2. What is the correlation coefficient?

The correlation coefficient is an index that measures the strength of the linear relationship between two random variables, and takes a value as a real number between -1 and 1. Source: [Wikipedia] (https://ja.wikipedia.org/wiki/%E7%9B%B8%E9%96%A2%E4%BF%82%E6%95%B0)

Roughly speaking, "When the correlation coefficient is positive, the larger the value of one explanatory variable, the larger the other explanatory variable, and when it is negative, the smaller the value of one explanatory variable, the more. One explanatory variable gets smaller. "

◆ Correlation coefficient and standard of correlation

This is just a guide, but in general, the following guides are set. [Source] (https://sci-pursuit.com/math/statistics/correlation-coefficient.html)

キャプチャ1.PNG

◆ Notes

It's easy to get confused, but be aware that just because the correlation is weak doesn't mean that ** there is no relationship between the two variables **. As mentioned earlier in the definition of correlation coefficient, the correlation coefficient is ** an index that measures the strength of the linear relationship between two variables **, so if there is a relationship other than linear, the phase It cannot be determined by the number of relationships **.

Let's look at a concrete example. It seems that the following two variables are clearly related like a quadratic curve. However, since the correlation coefficient of these two variables is -0.447, it is considered that the correlation is relatively weak if only the correlation coefficient is calculated mechanically, and it seems that there is a relationship between the two variables, but it is overlooked. There is a possibility that it will end up.

キャプチャ2.PNG

In this way, it is important that ** "correlation coefficient is just an index to measure linear relationships" and "visualize variables as much as possible in order not to overlook true relationships" **. I will.

◆ Where to use the correlation coefficient

In machine learning, the correlation coefficient is mainly used in preprocessing. More specifically, it is used to examine which explanatory variable to use for the objective variable (= feature selection).

Among them, there are mainly two usage scenes.

** (1) Select an item that has a high correlation with the objective variable and select it as an explanatory variable ** Of course, when building a model, you need to choose the explanatory variables that are related to the objective variable. (Even if you put variables that are completely unrelated to the model, it will cause a decrease in accuracy.) The correlation coefficient is used as one index of this "relationship". Calculate the correlation coefficient and select the variable that is judged to have a strong correlation as the explanatory variable.

** (2) If there is a variable with high correlation between the explanatory variables, delete one ** I think this is easier to understand if you give a concrete example. It's a fictitious setting, but ** Suppose you want to build a model that measures the technical skills of staff with shoe shine expertise **. Suppose that technical ability is the objective variable and there are many candidates for explanatory variables, but two of them are ** "years of service" and "staff ID" **.

キャプチャ3.PNG

I think you can expect it somehow, but the longer the service, the smaller the staff ID because it has been around for a long time, and the shorter the service, the larger the staff ID because it has recently entered. ** There is definitely a strong negative correlation. Masu **.

In such a case, even if you include both the staff ID and the length of service, the calculation cost will be high and it may have an extra effect on the model construction, so delete either one from the explanatory variables.

3. Try to get the correlation coefficient with python

(1) Import of required libraries

Import the following required to obtain the correlation coefficient.

import seaborn as sns

(2) Data preparation

Use iris data.

df = sns.load_dataset("iris")

(3) Display of correlation coefficient

It can be output as a heat map as shown below.

sns.heatmap(df.corr(), vmax=1, vmin=-1, center=0,annot=True)

The correlation coefficient itself is calculated by df.corr () and used as a heat map. By doing this, you can intuitively check whether the correlation is strong or weak, instead of looking at the numerical values one by one.

キャプチャ4.PNG

4. Understand from mathematics why the correlation coefficient takes a value from -1 to 1.

Well, it's finally the main subject. Until now, I had no doubt about the correlation coefficient, and I thought "take a value from -1 to 1", but why do you take a value from -1 to 1?

In conclusion, ** the correlation coefficient is equal to cos $ θ $ of the angle $ θ $ formed by the deviation vector **.

I would like to explain this.

(1) Prior knowledge

◆ About cosθ

Regarding the inner product of vectors, the following holds.

x ・ y= ||x||||y||cosθ

(2) Correlation coefficient formula

The correlation coefficient is defined as follows.

As an image, the covariance is a numerical representation of the correlation between the two data, but since it is not clear whether the value is large or small, it is an image of dividing by the standard deviation and normalizing (= aligning the units). ..

r_{xy} := \frac{σ_{xy}}{σ_xσ_y}

(3) Inner product of vectors

(1) From prior knowledge, conversion can be done as follows.

x ・ y= ||x||||y||cosθ\\
\begin{align}
cosθ &= \frac{x  y}{||x||||y||}\\
&= \frac{\frac{x ・ y}{N}}{\frac{||x||}{\sqrt{N}}\frac{||y||}{\sqrt{N}}}(* The denominator and numerator are divided by the number of data N)
\end{align}

This equation refers to dividing the covariance of $ x $ and $ y $ by their standard deviations, as shown below.

キャプチャ6.PNG

As a result, we were able to convert to the same definition as the standard deviation as described in (2).

In other words, it can be said that the correlation coefficient between $ x $ and $ y $ is equal to $ cos θ $ at the angle $ θ $ between $ x $ and $ y $. → And, as mentioned in the prior knowledge, $ cosθ $ is in the range of -1 to 1, so it can be said that the correlation coefficient is also in the range of -1 to 1.

(4) Summary

As described so far, the definition of the correlation coefficient is the same as the angle $ cosθ $ formed by the two variables, and $ cosθ $ is in the range of -1 to 1, so the correlation coefficient is also -1 to 1. Take the range of.

5. Summary

How was it? In my opinion, "I can't understand even if I give a very complicated explanation from the beginning, so I can't move on, so I don't care about the theory once, so I'll try to build a machine learning model first (for that purpose, give a correlation coefficient). I think it's very important.

However, once I get used to it, I feel that it is very important to understand what the correlation coefficient really means from a mathematical background.

I hope it helps you to deepen your understanding.

Recommended Posts

[Machine learning] Understand from mathematics why the correlation coefficient ranges from -1 to 1.
Record the steps to understand machine learning
[Machine learning] Understanding uncorrelatedness from mathematics
How to use machine learning for work? 01_ Understand the purpose of machine learning
Notes on machine learning (updated from time to time)
Machine learning algorithms (from two-class classification to multi-class classification)
Try to write code from 1 using the machine learning framework chainer (mnist edition)
[Machine learning] Understanding SVM from both scikit-learn and mathematics
Python Machine Learning Programming Chapter 1 Gives Computers the Ability to Learn from Data Summary
Artificial intelligence, machine learning, deep learning to implement and understand
Pip the machine learning library from one end (Ubuntu)
Introduction to machine learning
Rethink the correlation coefficient
An introduction to machine learning from a simple perceptron
I tried to compress the image using machine learning
[Part 4] Use Deep Learning to forecast the weather from weather images
[Part 1] Use Deep Learning to forecast the weather from weather images
Try to evaluate the performance of machine learning / regression model
[Part 3] Use Deep Learning to forecast the weather from weather images
[Machine learning] Understanding decision trees from both scikit-learn and mathematics
Try to evaluate the performance of machine learning / classification model
How to increase the number of machine learning dataset images
[Part 2] Use Deep Learning to forecast the weather from weather images
[Machine learning] I tried to summarize the theory of Adaboost
[Machine learning] Understanding logistic regression from both scikit-learn and mathematics
An introduction to machine learning
Understand machine learning ~ ridge regression ~.
Super introduction to machine learning
[Machine learning] Understand from mathematics that standardization results in an average of 0 and a standard deviation of 1.
[Machine learning] Understanding linear simple regression from both scikit-learn and mathematics
Newton's method for machine learning (from one variable to multiple variables)
Before the introduction to machine learning. ~ Technology required for machine learning other than machine learning ~
[Machine learning] Understanding linear multiple regression from both scikit-learn and mathematics
Machine learning model management to avoid quarreling with the business side
[Note] AI / machine learning / python related websites [updated from time to time]
People memorize learned knowledge in the brain, how to memorize learned knowledge in machine learning
(Machine learning) I tried to understand the EM algorithm in a mixed Gaussian distribution carefully with implementation.
I tried to understand the learning function in the neural network carefully without using the machine learning library (second half).
How to calculate the autocorrelation coefficient
Introduction to machine learning Note writing
Why super-intelligents couldn't understand the class
Introduction to Machine Learning Library SHOGUN
How to collect machine learning data
I tried calling the prediction API of the machine learning model from WordPress
Aiming to become a machine learning engineer from sales positions using MOOCs
[Introduction to machine learning] Until you run the sample code with chainer
Learning record (4th day) #How to get the absolute path from the relative path
Search for technical blogs by machine learning focusing on "easiness to understand"
Take the free "Introduction to Python for Machine Learning" online until 4/27 application
(Machine learning) I tried to understand Bayesian linear regression carefully with implementation.
I tried to visualize the model with the low-code machine learning library "PyCaret"
I tried to understand the learning function of neural networks carefully without using a machine learning library (first half).