[PYTHON] [Machine learning] Understanding uncorrelatedness from mathematics

1. Introduction

(1. Purpose

If you want to try machine learning, anyone can use scikit-learn etc. to implement it relatively easily. However, in order to achieve results at work or to improve your level ** "I don't know the background, but I got this result" ** You can see that it is clearly weak.

This time, we will focus on ** "Uncorrelated" **.

From the previous explanation, "I've heard about uncorrelatedness, why do you do it?" And "How do you use it?", "What kind of processing is mathematically performed with uncorrelatedness?" The purpose is to make the article answer the question "?".

(2) Configuration

First, Chapter 2 gives an overview of uncorrelatedness, and Chapter 3 actually does uncorrelatedness. Finally, in Chapter 4, I will explain how to understand uncorrelatedness mathematically.

2. What is uncorrelated?

As the name implies, it is ** to eliminate the correlation between each variable **. What's wrong with the high correlation between each variable?

(1) Why it is necessary to uncorrelated

From the conclusion, ** "Because the variance of the partial regression coefficient becomes large and the accuracy of the model tends to be unstable" **.

... I don't know at all. I will explain a little more.

For example, the formula for a regression model is commonly expressed as:

$ Y = A_1x_1 + A_2x_2 + ・ ・ + b $

Put the actual data in $ y $ (objective variable) and $ x_1 $ or $ x_2 $ (explanatory variable) here to find the partial regression coefficient $ a_1 $ or $ a_2 $ and the constant term $ b $. Is a multiple regression analysis.

How to obtain the variance of this partial regression coefficient (the image of how much the partial regression coefficient tends to take various values) is confusing if it is written in detail, so if you write only the conclusion, ** " (1-Correlation coefficient) "is included in the numerator of the formula for calculating the variance of the partial regression coefficient **.

In other words, ** the larger the correlation, the smaller the molecule, and as a result, the larger the variance of the partial regression coefficient = the partial regression coefficient tends to take various values = the accuracy of the model becomes unstable **

That is the theory.

In (2), variables with high correlation should be deleted ...?

As we saw in (1), if there are variables with high correlation, it is not the case that one of them should be deleted.

Because, ** "high correlation between variables" simply means that "variables are in a linear relationship" **.

→ So ** if you delete it easily, you may delete other important information that the variable may actually have **.

(3) Then what should I do ...

What appears here is uncorrelatedness.

We will build the model after eliminating the correlation of each variable.

I think it's hard to get an image, so let's actually try it.

3. Try uncorrelated

This time, as a concrete example, I will use kaggle's kickstarter-projects, which I always use, as an example. https://www.kaggle.com/kemical/kickstarter-projects

This chapter is long, but ** the essential uncorrelatedness is only (ⅶ) **, so it's a good idea to look there first.

※important point※ ・ This time, I couldn't find a variable that should be uncorrelated as an explanatory variable. Variables that have nothing to do with model construction are uncorrelated.

Please note that this is just a chapter for recognizing that "uncorrelatedness is done in this way".

-There was a site that was just uncorrelated in the kickstarter projects that I always use in the example of my article, so I used that as a reference. https://ds-blog.tbtech.co.jp/entry/2019/04/27/Kaggle%E3%81%AB%E6%8C%91%E6%88%A6%E3%81%97%E3%82%88%E3%81%86%EF%BC%81_%EF%BD%9E%E3%82%B3%E3%83%BC%E3%83%89%E8%AA%AC%E6%98%8E%EF%BC%92%EF%BD%9E

(I) Import

#numpy,Import pandas
import numpy as np
import pandas as pd
import seaborn as sns

#Import to perform some processing on date data
import datetime

(Ii) Reading data


df = pd.read_csv("ks-projects-201801.csv")

(Iii) Confirmation of the number of data

From the following, you can see that it is the dataset of (378661, 15).

df.shape

(Iv) Data molding

◆ Number of recruitment days

I will omit the details, but since the recruitment start time and end time of crowdfunding are in the data, I will convert this to "recruitment days".

df['deadline'] = pd.to_datetime(df["deadline"])
df["launched"] = pd.to_datetime(df["launched"])
df["days"] = (df["deadline"] - df["launched"]).dt.days

◆ About the objective variable

I will omit the details here as well, but there are categories other than success ("successful") and failure ("failed") for the objective variable "state", but this time I will only use data for success and failure.

df = df[(df["state"] == "successful") | (df["state"] == "failed")]

Then replace success with 1 and failure with 0.

df["state"] = df["state"].replace("failed",0)
df["state"] = df["state"].replace("successful",1)

(V) Missing value processing

I'm sorry for the conclusion, but since it is used pledged that I will use after this, only this variable will be processed for missing values.

df["usd pledged"] = df["usd pledged"].fillna(df["usd pledged"].mean())

(V) Confirmation of correlation coefficient

Let's check the correlation of each variable.

sns.heatmap(df.corr())
キャプチャ1.PNG

Now, let's uncorrelate "pleged" and "used pleded", which are highly correlated between variables.

(Ⅶ) Uncorrelated

Uncorrelated itself is okay if you write the code as below. However, I don't think you understand the meaning, but this time, please read "That's what it is" and see Chapter 4, "Understanding from Mathematics."

#df only pledged and used pleged_Store in pledged
df_corr = pd.DataFrame({'pledged' : df['pledged'], 'usdpledged' : df['usd pledged']})

#Find variance / covariance
cov = np.cov(df_corr, rowvar=0) 

#Substitute the eigenvectors of the covariance matrix into S
_, S = np.linalg.eig(cov)           

#Uncorrelated data *.T represents transpose
pledged_decorr = np.dot(S.T, df_corr.T).T 

This completes the uncorrelated. As a test, let's check the correlation coefficient between pledged and used pleased.

print('Correlation coefficient: {:.3f}'.format(np.corrcoef(pledged_decorr[:, 0], pledged_decorr[:, 1])[0,1])) 

This will display "Correlation coefficient: 0.000". I was able to successfully uncorrelated!

4. Understanding uncorrelatedness from mathematics

(1) Premise

Now, in this chapter, let's see how to actually handle uncorrelatedness mathematically. As mentioned at the beginning, it is necessary to think of "matrix" and "eigenvalue / eigenvector" to understand uncorrelatedness.

If you find it difficult, you can skip it, and the explanation itself is not detailed, but roughly speaking, it is explained in this way.

(2) Specific example

If there are some explanatory variables, let them be $ \ boldsymbol {x_1} $, $ \ boldsymbol {x_2} $ ・ ・ ・ $ \ boldsymbol {x_n} $, respectively.

The variance / covariance matrix between these variables can be written as follows. キャプチャ3.PNG

The blue frame shows the covariance of two combinations of each variable, and the red frame shows the variance of each variable.

This variance / covariance matrix is transformed as follows by ** diagonalizing **!

キャプチャ2.PNG

... I don't think I'm sure. The important thing here is that ** the blue frame (covariance) is all 0 **.

** Diagonalization can reduce the covariance to 0, which is exactly synonymous with uncorrelated processing **.

So why is it uncorrelated if the covariance between the variables is 0?

To understand that, let's recall the formula for the correlation coefficient. If the correlation coefficient is r, then r is derived as follows.

** (Correlation coefficient r) = (Covariance) / (Standard deviation of $ x $) ・ (Standard deviation of $ y $) **

From this equation, we can see that ** the covariance is 0, that is, the numerator is 0, so the correlation coefficient is 0 **.

That is why ** the covariance between each variable was set to 0 by diagonalization, and the correlation coefficient was set to 0 to achieve uncorrelated **.

5. Conclusion

How was it? My thought is, "I can't interpret the extremely complicated code from the beginning, so I don't care about the accuracy, so I'll try to implement a basic series of steps with scikit-learn etc." I think it's very important.

However, once I get used to it, I feel that it is very important to understand from a mathematical background how they work behind the scenes.

I think there are many contents that are difficult to understand, but I hope it helps to deepen my understanding.

Recommended Posts

[Machine learning] Understanding uncorrelatedness from mathematics
[Machine learning] Understanding SVM from both scikit-learn and mathematics
[Machine learning] Understanding decision trees from both scikit-learn and mathematics
[Machine learning] Understanding logistic regression from both scikit-learn and mathematics
[Machine learning] Understanding linear multiple regression from both scikit-learn and mathematics
[Machine learning] Understanding random forest
Machine learning
[Machine learning] Understand from mathematics why the correlation coefficient ranges from -1 to 1.
Use machine learning APIs A3RT from Python
[Memo] Machine learning
Machine learning classification
Machine Learning sample
Machine learning starting from 0 for theoretical physics students # 1
Notes on machine learning (updated from time to time)
Machine learning algorithms (from two-class classification to multi-class classification)
Machine learning starting from scratch (machine learning learned with Kaggle)
Overview of machine learning techniques learned from scikit-learn
Machine learning starting from 0 for theoretical physics students # 2
Machine learning tutorial summary
About machine learning overfitting
Deep Learning from scratch
Machine learning ⑤ AdaBoost Summary
Machine Learning: Supervised --AdaBoost
Machine learning logistic regression
Machine learning python code summary (updated from time to time)
Non-information graduate student studied machine learning from scratch # 1: Perceptron
Python learning memo for machine learning by Chainer from Chapter 2
Machine learning support vector machine
Installation of TensorFlow, a machine learning library from Google
Studying Machine Learning ~ matplotlib ~
Machine learning linear regression
Machine learning course memo
Machine learning library dlib
Machine learning (TensorFlow) + Lotto 6
Somehow learn machine learning
Study method for learning machine learning from scratch (March 2020 version)
Pip the machine learning library from one end (Ubuntu)
Machine learning library Shogun
Machine learning rabbit challenge
Introduction to machine learning
Create a machine learning environment from scratch with Winsows 10
An introduction to machine learning from a simple perceptron
Machine Learning: k-Nearest Neighbors
What is machine learning?
Collect machine learning data by scraping from bio-based public databases
Non-information graduate student studied machine learning from scratch # 2: Neural network
Machine learning model considering maintainability
Machine learning learned with Pokemon
Data set for machine learning
Japanese preprocessing for machine learning
Deep Learning from scratch 1-3 chapters
[Machine learning] Understand from mathematics that standardization results in an average of 0 and a standard deviation of 1.
An introduction to machine learning
Machine learning / classification related techniques
Machine Learning: Supervised --Linear Regression
Basics of Machine Learning (Notes)
Machine learning beginners tried RBM
Machine learning with Python! Preparation
Machine Learning Study Resource Notepad
Machine learning ② Naive Bayes Summary
Understand machine learning ~ ridge regression ~.