[PYTHON] Single regression analysis by least squares method

Linear analysis using the least squares method is called multiple regression analysis for multiple variables and simple regression analysis for one variable. Simple regression analysis is, that is, the equation $ y = ax + b $ for the linear function.

Since it is a regression analysis, we want to find a straight line that fits, that is, we want to find the optimum values of $ a $ and $ b $. x and y are given as data.

I'm investigating how to do it, and there are various things such as using a library such as scikit-learn, calculating covariance, etc. and solving it aptly, but since it is a least squares method, can it be done easily? I thought. Try it.

Want to know $ a, b $ as a vector

  A = \left(
    \begin{array}{c}
      a \\
      b \\
    \end{array}
  \right)

Write. The data groups $ X and Y $ can also be represented by vectors, but since we want to make them into a matrix (simultaneous equations) here, we write $ X $ as follows.

  X = \left(
    \begin{array}{cc}
      x_1 & 1\\
      x_2 & 1\\
      \vdots \\
      x_n & 1
    \end{array}
  \right)

The point is to say. In other words, the simultaneous equations

XA = Y \\  
\left(
    \begin{array}{cc}
      x_1 & 1\\
      x_2 & 1\\
      \vdots \\
      x_n & 1
    \end{array}
  \right)\left(
    \begin{array}{c}
      a \\
      b \\
    \end{array}
  \right)
=
\left(\begin{array}{c}
      y_1\\
      y_2\\
      \vdots \\
      y_n
    \end{array}\right)

is not it. Dimensional analysis shows that N × 2.2 × 1 = N × 1. Then use the generalized inverse matrix $ X ^ \ dagger $

  A = X^\dagger Y

Can be solved in one shot.

This makes it easy to implement. Below is the test code.

import numpy as np
import random

def linear_regression():

    #Make the answer first
    a = 2
    b = 10
    x_true = np.arange(0.0,50.0,1.0)
    y_true = a * x_true + b


    #Create data by randomly shifting from the correct answer value
    xd = np.zeros(len(x_true))
    yd = np.zeros(len(y_true))
    for i in range(len(xd)):
        yd[i] = y_true[i] + 100*(random.randint(-10,10)/100)
    for i in range(len(xd)):
        xd[i] = x_true[i] + 10*(random.randint(-10,10)/100)
    print(xd)
    print(yd)

    #Data group matrix
    X = np.c_[xd, np.linspace(1,1,len(xd))]
    print(X)

    #Least squares method: Just multiply the generalized inverse matrix from the left
    A = np.linalg.pinv(X) @ yd
    y_est = x_true * A[0] + A[1]

    import matplotlib.pyplot as plt
    fig, ax = plt.subplots()
    ax.scatter(xd, yd, label='data')
    ax.plot(x_true,y_true, label='true')
    ax.plot(x_true,y_est, label='linear regression')
    ax.legend()
    ax.grid()
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    fig.tight_layout()
    plt.show()

    return

if __name__ == '__main__' :
    linear_regression()

The result is as follows:

true is the original straight line. Since the value is changed randomly from there, it does not match, but it seems to match. In this case, it seems that regression analysis can be easily performed without relying on libraries.