[PYTHON] Single regression analysis by least squares method

Linear analysis using the least squares method is called multiple regression analysis for multiple variables and simple regression analysis for one variable. Simple regression analysis is, that is, the equation $ y = ax + b $ for the linear function.

Since it is a regression analysis, we want to find a straight line that fits, that is, we want to find the optimum values of $ a $ and $ b $. x and y are given as data.

I'm investigating how to do it, and there are various things such as using a library such as scikit-learn, calculating covariance, etc. and solving it aptly, but since it is a least squares method, can it be done easily? I thought. Try it.

Want to know $ a, b $ as a vector

  A = \left(
    \begin{array}{c}
      a \\
      b \\
    \end{array}
  \right)

Write. The data groups $ X and Y $ can also be represented by vectors, but since we want to make them into a matrix (simultaneous equations) here, we write $ X $ as follows.

  X = \left(
    \begin{array}{cc}
      x_1 & 1\\
      x_2 & 1\\
      \vdots \\
      x_n & 1
    \end{array}
  \right)

The point is to say. In other words, the simultaneous equations

XA = Y \\  
\left(
    \begin{array}{cc}
      x_1 & 1\\
      x_2 & 1\\
      \vdots \\
      x_n & 1
    \end{array}
  \right)\left(
    \begin{array}{c}
      a \\
      b \\
    \end{array}
  \right)
=
\left(\begin{array}{c}
      y_1\\
      y_2\\
      \vdots \\
      y_n
    \end{array}\right)

is not it. Dimensional analysis shows that N × 2.2 × 1 = N × 1. Then use the generalized inverse matrix $ X ^ \ dagger $

  A = X^\dagger Y

Can be solved in one shot.

This makes it easy to implement. Below is the test code.

import numpy as np
import random

def linear_regression():

    #Make the answer first
    a = 2
    b = 10
    x_true = np.arange(0.0,50.0,1.0)
    y_true = a * x_true + b


    #Create data by randomly shifting from the correct answer value
    xd = np.zeros(len(x_true))
    yd = np.zeros(len(y_true))
    for i in range(len(xd)):
        yd[i] = y_true[i] + 100*(random.randint(-10,10)/100)
    for i in range(len(xd)):
        xd[i] = x_true[i] + 10*(random.randint(-10,10)/100)
    print(xd)
    print(yd)

    #Data group matrix
    X = np.c_[xd, np.linspace(1,1,len(xd))]
    print(X)

    #Least squares method: Just multiply the generalized inverse matrix from the left
    A = np.linalg.pinv(X) @ yd
    y_est = x_true * A[0] + A[1]

    import matplotlib.pyplot as plt
    fig, ax = plt.subplots()
    ax.scatter(xd, yd, label='data')
    ax.plot(x_true,y_true, label='true')
    ax.plot(x_true,y_est, label='linear regression')
    ax.legend()
    ax.grid()
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    fig.tight_layout()
    plt.show()

    return

if __name__ == '__main__' :
    linear_regression()

The result is as follows: test.png

true is the original straight line. Since the value is changed randomly from there, it does not match, but it seems to match. In this case, it seems that regression analysis can be easily performed without relying on libraries.

Recommended Posts

Single regression analysis by least squares method
Regression analysis method
[TensorFlow] Least squares linear regression by gradient descent (stochastic descent)
Calculation of homography matrix by least squares method (DLT method)
Least squares method (triangular matrix calculation)
Least squares method and maximum likelihood estimation method (comparison by model fitting)
I tried the least squares method in Python
Poisson regression analysis
Approximation by the least squares method of a circle with two fixed points
Clustering and principal component analysis by K-means method (beginner)
Basics of regression analysis
Regression analysis in Python
Python learning memo for machine learning by Chainer Chapter 7 Regression analysis