1. Introduction

(1. Purpose

If you want to try machine learning, anyone can use scikit-learn etc. to implement it relatively easily. However, in order to achieve results at work or to improve your level ** "I don't know the background, but I got this result" ** You can see that it is clearly weak.

This time, we will focus on ** "Uncorrelated" **.

From the previous explanation, "I've heard about uncorrelatedness, why do you do it?" And "How do you use it?", "What kind of processing is mathematically performed with uncorrelatedness?" The purpose is to make the article answer the question "?".

This time, if you want to understand it well, you need to understand various things such as matrix, covariance, eigenvalue vector, and so on. Mathematics didn't go too far into the details, ** I tried to understand the outline **.
It seems that uncorrelatedness is rarely used alone in machine learning. If anything, it seems to be used in principal component analysis.

(2) Configuration

First, Chapter 2 gives an overview of uncorrelatedness, and Chapter 3 actually does uncorrelatedness. Finally, in Chapter 4, I will explain how to understand uncorrelatedness mathematically.

I have posted several articles as a series of "Understanding from Mathematics", so I hope you can read them together. [Machine learning] Understand from mathematics that standardization results in an average of 0 and a standard deviation of 1 [Machine learning] Understanding decision trees from both scikit-learn and mathematics [Machine learning] Understanding linear simple regression from both scikit-learn and mathematics [[Machine learning] Understanding linear multiple regression from both scikit-learn and mathematics] (https://qiita.com/Hawaii/items/b84a0d669bcf5267e750) [[Machine learning] Understanding logistic regression from both scikit-learn and mathematics] (https://qiita.com/Hawaii/items/ee2a0687ca451fe213be) [[Machine learning] Understanding SVM from both scikit-learn and mathematics] (https://qiita.com/Hawaii/items/4688a50cffb2140f297d) [[Machine learning] Understand from mathematics why the correlation coefficient ranges from -1 to 1] (https://qiita.com/Hawaii/items/3f4e91cf9b86676c202f)

2. What is uncorrelated?

As the name implies, it is ** to eliminate the correlation between each variable **. What's wrong with the high correlation between each variable?

(1) Why it is necessary to uncorrelated

From the conclusion, ** "Because the variance of the partial regression coefficient becomes large and the accuracy of the model tends to be unstable" **.

... I don't know at all. I will explain a little more.

For example, the formula for a regression model is commonly expressed as:

$ Y = A_1x_1 + A_2x_2 + ・・ + b $

Put the actual data in $ y $ (objective variable) and $ x_1 $ or $ x_2 $ (explanatory variable) here to find the partial regression coefficient $ a_1 $ or $ a_2 $ and the constant term $ b $. Is a multiple regression analysis.

How to obtain the variance of this partial regression coefficient (the image of how much the partial regression coefficient tends to take various values) is confusing if it is written in detail, so if you write only the conclusion, ** " (1-Correlation coefficient) "is included in the numerator of the formula for calculating the variance of the partial regression coefficient **.

In other words, ** the larger the correlation, the smaller the molecule, and as a result, the larger the variance of the partial regression coefficient = the partial regression coefficient tends to take various values = the accuracy of the model becomes unstable **

That is the theory.

In (2), variables with high correlation should be deleted ...?

As we saw in (1), if there are variables with high correlation, it is not the case that one of them should be deleted.

Because, ** "high correlation between variables" simply means that "variables are in a linear relationship" **.

If you are not sure about this, [Machine learning] Understand from mathematics why the correlation coefficient ranges from -1 to 1 2 See the notes in the chapter.

→ So ** if you delete it easily, you may delete other important information that the variable may actually have **.

(3) Then what should I do ...

What appears here is uncorrelatedness.

We will build the model after eliminating the correlation of each variable.

I think it's hard to get an image, so let's actually try it.

3. Try uncorrelated

This time, as a concrete example, I will use kaggle's kickstarter-projects, which I always use, as an example. https://www.kaggle.com/kemical/kickstarter-projects

This chapter is long, but ** the essential uncorrelatedness is only (ⅶ) **, so it's a good idea to look there first.

※important point※ ・ This time, I couldn't find a variable that should be uncorrelated as an explanatory variable. Variables that have nothing to do with model construction are uncorrelated.

Please note that this is just a chapter for recognizing that "uncorrelatedness is done in this way".

-There was a site that was just uncorrelated in the kickstarter projects that I always use in the example of my article, so I used that as a reference. https://ds-blog.tbtech.co.jp/entry/2019/04/27/Kaggle%E3%81%AB%E6%8C%91%E6%88%A6%E3%81%97%E3%82%88%E3%81%86%EF%BC%81_%EF%BD%9E%E3%82%B3%E3%83%BC%E3%83%89%E8%AA%AC%E6%98%8E%EF%BC%92%EF%BD%9E

(I) Import

#numpy,Import pandas
import numpy as np
import pandas as pd
import seaborn as sns

#Import to perform some processing on date data
import datetime

(Ii) Reading data


df = pd.read_csv("ks-projects-201801.csv")

(Iii) Confirmation of the number of data

From the following, you can see that it is the dataset of (378661, 15).

df.shape

(Iv) Data molding

◆ Number of recruitment days

I will omit the details, but since the recruitment start time and end time of crowdfunding are in the data, I will convert this to "recruitment days".

df['deadline'] = pd.to_datetime(df["deadline"])
df["launched"] = pd.to_datetime(df["launched"])
df["days"] = (df["deadline"] - df["launched"]).dt.days

◆ About the objective variable

I will omit the details here as well, but there are categories other than success ("successful") and failure ("failed") for the objective variable "state", but this time I will only use data for success and failure.

df = df[(df["state"] == "successful") | (df["state"] == "failed")]

Then replace success with 1 and failure with 0.

df["state"] = df["state"].replace("failed",0)
df["state"] = df["state"].replace("successful",1)

(V) Missing value processing

I'm sorry for the conclusion, but since it is used pledged that I will use after this, only this variable will be processed for missing values.

df["usd pledged"] = df["usd pledged"].fillna(df["usd pledged"].mean())

(V) Confirmation of correlation coefficient

Let's check the correlation of each variable.

sns.heatmap(df.corr())

Now, let's uncorrelate "pleged" and "used pleded", which are highly correlated between variables.

Again, this process itself has nothing to do with model building.

(Ⅶ) Uncorrelated

Uncorrelated itself is okay if you write the code as below. However, I don't think you understand the meaning, but this time, please read "That's what it is" and see Chapter 4, "Understanding from Mathematics."

#df only pledged and used pleged_Store in pledged
df_corr = pd.DataFrame({'pledged' : df['pledged'], 'usdpledged' : df['usd pledged']})

#Find variance / covariance
cov = np.cov(df_corr, rowvar=0) 

#Substitute the eigenvectors of the covariance matrix into S
_, S = np.linalg.eig(cov)           

#Uncorrelated data *.T represents transpose
pledged_decorr = np.dot(S.T, df_corr.T).T

This completes the uncorrelated. As a test, let's check the correlation coefficient between pledged and used pleased.

print('Correlation coefficient: {:.3f}'.format(np.corrcoef(pledged_decorr[:, 0], pledged_decorr[:, 1])[0,1]))

This will display "Correlation coefficient: 0.000". I was able to successfully uncorrelated!

4. Understanding uncorrelatedness from mathematics

(1) Premise

Now, in this chapter, let's see how to actually handle uncorrelatedness mathematically. As mentioned at the beginning, it is necessary to think of "matrix" and "eigenvalue / eigenvector" to understand uncorrelatedness.

If you find it difficult, you can skip it, and the explanation itself is not detailed, but roughly speaking, it is explained in this way.

(2) Specific example

If there are some explanatory variables, let them be $ \ boldsymbol {x_1} $, $ \ boldsymbol {x_2} $ ・・・ $ \ boldsymbol {x_n} $, respectively.

For example, in the previous example, pledged is $ \ boldsymbol {x_1} $ and usdpledged is $ \ boldsymbol {x_2} $.

The variance / covariance matrix between these variables can be written as follows. キャプチャ3.PNG

The blue frame shows the covariance of two combinations of each variable, and the red frame shows the variance of each variable.

There may be some people who suddenly get a matrix and get a rejection reaction, but rather than understanding the matrix, here ** "The covariance of two pairs of all variables is in the blue frame. Yes, the red frame contains the variance of each variable. "** Please understand.

This variance / covariance matrix is transformed as follows by ** diagonalizing **!

What is diagonalization is not the main point, so I will omit it.

... I don't think I'm sure. The important thing here is that ** the blue frame (covariance) is all 0 **.

** Diagonalization can reduce the covariance to 0, which is exactly synonymous with uncorrelated processing **.

So why is it uncorrelated if the covariance between the variables is 0?

To understand that, let's recall the formula for the correlation coefficient. If the correlation coefficient is r, then r is derived as follows.

** (Correlation coefficient r) = (Covariance) / (Standard deviation of $ x $) ・ (Standard deviation of $ y $) **

From this equation, we can see that ** the covariance is 0, that is, the numerator is 0, so the correlation coefficient is 0 **.

That is why ** the covariance between each variable was set to 0 by diagonalization, and the correlation coefficient was set to 0 to achieve uncorrelated **.

5. Conclusion

How was it? My thought is, "I can't interpret the extremely complicated code from the beginning, so I don't care about the accuracy, so I'll try to implement a basic series of steps with scikit-learn etc." I think it's very important.

However, once I get used to it, I feel that it is very important to understand from a mathematical background how they work behind the scenes.

I think there are many contents that are difficult to understand, but I hope it helps to deepen my understanding.

[PYTHON] [Machine learning] Understanding uncorrelatedness from mathematics