2. Multivariate analysis spelled out in Python 1-2. Simple regression analysis (algorithm)

We will find the simple regression equation using only Numpy and Pandas, which are necessary for basic numerical calculations.

** ⑴ Import the library **

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

** ⑵ Import data and check the contents **

df = pd.read_csv("https://raw.githubusercontent.com/karaage0703/machine-learning-study/master/data/karaage_data.csv")
print(df.head())

002_002_001.PNG

** least squares </ font> **

The goal of simple regression analysis was to find the two constants contained in the regression equation: the regression coefficient $ a $ and the intercept $ b $. At that time, in order to obtain a more accurate simple regression equation, the constants $ a $ and $ b $ are determined so that the overall error, that is, the residual $ y-\ hat {y} $ is as small as possible. I have to. Consider this ** definition of residuals **. 002_002_002.PNG e_{1}+e_{2}+e_{3}…+e_{n} It looks like this, but this is incorrect. The measured values are unevenly distributed on each of the positive and negative sides of the regression line. In other words, the plus and minus cancel each other out, and the total is 0. 002_002_003.PNG Therefore, by squared the residual for each individual, the positive and negative are eliminated, and it can be treated simply as the size (distance) of the distance. Q = {e_{1}}^{2}+{e_{2}}^{2}+{e_{3}}^{2}…+{e_{n}}^{2} $ Q $ has been defined as the total amount of distance from the regression line. The smallest of this $ Q $ is the decisive factor for the slope of the regression line $ a $, and if it is obtained, the intersection $ b $ with the $ y $ axis can be naturally obtained. This method is called the ** least squares method **.

We will solve the simple regression equation based on the least squares method.

** ⑶ Calculate the average value of each variable x and y **

mean_x = df['x'].mean()
mean_y = df['y'].mean()

002_002_004.PNG

** ⑷ Calculate the deviation of each variable x and y **

Deviation is the difference between the value of each individual and the mean value. Calculate $ x_ {i}-\ bar {x} $ for the variable $ x $ and $ y- \ bar {y} $ for the variable $ y $. Each variable will be calculated for the number of data.

#Deviation of x
dev_x = []
for i in df['x']:
    dx = i - mean_x
    dev_x.append(dx)
#deviation of y
dev_y = []
for j in df['y']:
    dy = j - mean_y
    dev_y.append(dy)

002_002_005.PNG

** ⑸ Calculate the variance of variable x **

Calculate the variance using the deviation obtained in (4). The variance is the mean square of the deviations, that is, the squares for each deviation and the sum is divided by the number (number of data-1).

#Deviation square sum
ssdev_x = 0
for i in dev_x:
    d = i ** 2
    ssdev_x += d
#Distributed
var_x = ssdev_x / (len(df) - 1)

002_002_006.PNG

** ⑹ Calculate covariance **

The covariance $ s_ {xy} $ is one of the indexes showing the strength of the relationship between two variables and is defined by the following equation. s_{xy} = \frac{1}{n - 1} \displaystyle \sum_{i = 1}^n {(x_i - \overline{x})(y_{i} - \overline{y})} Consider a set of data for each individual. When there is a $ n $ pair of $ (x_ {1}, y_ {1}), (x_ {2}, y_ {2}), ..., (x_ {n}, y_ {n}) $ For each pair, multiply the deviation of $ x $ by the deviation of $ y $ and divide the sum of them by the number (number of data-1).

#Deviation product sum
spdev = 0
for i,j in zip(df['x'], df['y']):
    spdev += (i - mean_x) * (j - mean_y)
#Covariance
cov = spdev / (len(df) - 1)

002_002_007.PNG

** ⑺ Calculate regression coefficient a **

Here is the formula for finding the regression coefficient by the least squares method. a = \frac{S_{xy}}{Sx^2} The regression coefficient $ a $ can be obtained by dividing the covariance $ S_ {xy} $ obtained in ⑹ by the variance $ Sx ^ 2 $ of the variable $ x $ obtained in ⑸.

a = cov / var_x

002_002_008.PNG

** ⑻ Calculate intercept b **

By transforming the simple regression equation $ y = ax + b $ and setting $ b = y -ax $, the average value $ \ bar {x}, \ bar {y} $ obtained in ⑶ and the regression coefficient obtained in ⑺ Substitute $ a $.

b = mean_y - (a * mean_x)

002_002_009.PNG

** As mentioned above, the simple regression equation was obtained by the formula of the least squares method. ** ** ** It matches the calculation result obtained by using the machine learning library scikit-learn earlier. Therefore, I will also calculate the confirmation of the coefficient of determination by myself. ** **

** ⑼ Calculate the coefficient of determination and check the accuracy of the regression equation **

Create predicted value data using a regression equation and find its variance. What percentage of the variance of the measured value $ y $, that is, how much can the original variate $ y $ be explained?

#Data creation of predicted value z
df['z'] = (a * df['x']) + b
print(df)

#Variance of predicted value z
ssdev_z = 0
for i in df['z']:
    j = (i - df['z'].mean())**2
    ssdev_z += j
var_z = ssdev_z / (len(df) - 1)
print("Variance of predicted values:", var_z)

#Variance of measured value y
ssdev_y = 0
for i in dev_y:
    j = i ** 2
    ssdev_y += j
var_y = ssdev_y / (len(df) - 1)
print("Variance of measured value y:", var_y)

#Coefficient of determination
R = var_z / var_y
print("Coefficient of determination R:", R)

002_002_010.PNG It was confirmed that the coefficient of determination also matches the calculation result by scikit-learn above.

** ⑽ Show the regression line along with the scatter plot **

plt.plot(x, y, "o") #Scatter plot
plt.plot(x, z, "r") #Regression line
plt.show()

002_002_011.PNG

So far, you've learned the algorithms for simple regression analysis. However, in the real world, there are few cases where a phenomenon can be explained by only one factor. In the background of a certain phenomenon, various factors are intertwined at the same time, more or less. Next, you will learn multiple regression analysis that deals with three or more variables.

Recommended Posts

2. Multivariate analysis spelled out in Python 1-2. Simple regression analysis (algorithm)
2. Multivariate analysis spelled out in Python 1-1. Simple regression analysis (scikit-learn)
2. Multivariate analysis spelled out in Python 3-2. Principal component analysis (algorithm)
2. Multivariate analysis spelled out in Python 2-1. Multiple regression analysis (scikit-learn)
2. Multivariate analysis spelled out in Python 5-3. Logistic regression analysis (stats models)
Simple regression analysis in Python
2. Multivariate analysis spelled out in Python 2-3. Multiple regression analysis [COVID-19 infection rate]
First simple regression analysis in Python
2. Multivariate analysis spelled out in Python 7-1. Decision tree (scikit-learn)
2. Multivariate analysis spelled out in Python 6-1. Ridge regression / Lasso regression (scikit-learn) [multiple regression vs. ridge regression]
2. Multivariate analysis spelled out in Python 8-2. K-nearest neighbor method [Weighting method] [Regression model]
2. Multivariate analysis spelled out in Python 3-1. Principal component analysis (scikit-learn)
2. Multivariate analysis spelled out in Python 6-3. Ridge regression / Lasso regression (scikit-learn) [How regularization works]
2. Multivariate analysis spelled out in Python 8-1. K-nearest neighbor method (scikit-learn)
2. Multivariate analysis spelled out in Python 8-3. K-nearest neighbor method [cross-validation]
Regression analysis in Python
2. Multivariate analysis spelled out in Python 7-2. Decision tree [difference in division criteria]
Machine learning algorithm (simple regression analysis)
Implementing a simple algorithm in Python 2
Simple regression analysis implementation in Keras
Run a simple algorithm in Python
Write a simple greedy algorithm in Python
Simple gRPC in Python
Genetic algorithm in python
Algorithm in Python (Bellman-Ford)
Association analysis in Python
Algorithm in Python (Dijkstra's algorithm)
Python Scikit-learn Linear Regression Analysis Nonlinear Simple Regression Analysis Machine Learning
Calculate the regression coefficient of simple regression analysis with python
Multiple regression expressions in Python
Algorithm in Python (primality test)
Axisymmetric stress analysis in Python
Reproduce Euclidean algorithm in Python
Algorithm in Python (binary search)
Implement Dijkstra's Algorithm in python
Online linear regression in Python
Simple IRC client in python
Simple Regression Analysis in High School Mathematics-Verification of Moore's Law
EEG analysis in Python: Python MNE tutorial
A simple data analysis of Bitcoin provided by CoinMetrics in Python
Simple OAuth 2 in Python (urllib + oauthlib)
Sorting algorithm and implementation in Python
[Python] PCA scratch in the example of "Introduction to multivariate analysis"
Machine learning algorithm (multiple regression analysis)
Write A * (A-star) algorithm in Python
[Statistical test 2nd grade / quasi 1st grade] Regression analysis training in Python (2)
Develop an investment algorithm in Python 2
Algorithm in Python (depth-first search, dfs)
Predictive Statistics (Practice Simple Regression) Python
[Statistical test 2nd grade / quasi 1st grade] Regression analysis training in Python (1)
Planar skeleton analysis in Python (2) Hotfix
Algorithm (segment tree) in Python (practice)
Logistic regression analysis Self-made with python
Simple gacha logic written in Python
Predictive Statistics (Practice Simple Regression) Python
Predictive Statistics (Practice Classification) Python
Multiple regression expressions in Python
Statistics with python
Predictive statistics (theory)
Beginners practice Python
2. Multivariate analysis spelled out in Python 2-1. Multiple regression analysis (scikit-learn)
First Python 3rd Edition
Feature Prediction Statistics python
Merge Nodes-Houdini Python Practice
Python: Supervised Learning (Regression)
Regression analysis in Python
Linear regression in Python (statmodels, scikit-learn, PyMC3)
A simple HTTP client implemented in Python
Ant book in python: Sec.2-5 Dijkstra's algorithm
Algorithm in Python (ABC 146 C Binary Search
Try drawing a simple animation in Python
Online Linear Regression in Python (Robust Estimate)