# Overview

Describes the least squares method. First, calculate with concrete numerical values, then find the general formula, and finally implement it in Python.

# What is the least squares method?

Regression analysis is a method of deriving the formula that best fits the sample data and predicting the numerical value of the new data, which is one of the typical methods. Derive a relational expression that minimizes the error between the sample data and the value obtained from the expression.

# Concrete example

Suppose you predict the growth of radish. Currently, on the 4th day, we measure how many centimeters of leaves are coming out every day.

Day 1 Day 2 Day 3 Day 4
0.3 1.9 2.8 4.3

I made a graph of this data. The $x$ axis is the number of days and the $y$ axis is the leaf length.

Let the hypothetical function be a linear function. The equation of the linear function is

y = ax + b


So, calculate what the values of $a$ and $b$ should be to make a nice function. As an image, find a linear function that fits the data as closely as possible, as shown in the red graph.

First, from the hypothesis function $y = ax + b$, the value of $y$ on the 0th to 4th days (the number of days is $x$) is calculated, and the error from the actual growth data is calculated. (Day 0 is the day when the seeds are buried)

Days(x) Day 0 Day 1 Day 2 Day 3 Day 4
Record 0.0cm 0.3cm 1.9cm 2.8cm 4.3cm
Calculated from the formula(y) b a+b 2a+b 3a+b 4a+b
error(y-Record) b-0 (a+b)-0.3 (2a+b)-1.9 (3a+b)-2.8 (4a+b)-4.3

The error is the length of the pink part of the graph below.

Square and then add up so that the error is not negative. Assuming that the total error is $J$, the calculation is as follows.

J = b^2+((a+b)-0.3)^2+((2a+b)-1.9)^2+((3a+b)-2.8)^2+((4a+b)-4.3)^2\\
\\
=30a^2+20ab-59.4a+5b^2-18.6b+30.03



Find $a$ and $b$ that minimize this $J$ and you're done.

Partial differentiation of $J$ as a function of $a$ is 0, and partial differentiation of $J$ as a function of $b$ is 0. If two simultaneous equations are solved, $J$ is the minimum. The following $a$ and $b$ are calculated.

See here for the minimum value of a two-variable two-dimensional function. http://mathtrain.jp/quadratic

The above equation is partially differentiated with $a$ ($\ frac {∂J} {∂a}$), and the lower equation is partially differentiated with $b$ ($\ frac {∂J} {∂b). }$).

\left\{
\begin{array}{ll}
0 = 60a+20b-59.4 \\
0 = 20a+10b-18.6
\end{array}
\right.


When you solve the simultaneous equations,

\left\{
\begin{array}{ll}
a = 1.11 \\
b=-0.36
\end{array}
\right.


Next door

y = 1.11x - 0.36


I was able to derive the function. After drawing with a graph

It feels very good !!

#### radish_plot.py


import matplotlib.pyplot as plt
import numpy as np
# data set
plt.plot([0,0.3,1.9,2.8,4.3], "bo")

plt.xlabel("X:days")
plt.ylabel("Y:length")
plt.xlim([0,4])
plt.xticks(np.arange(1,5,1))
plt.ylim([0,4])
plt.yticks(np.arange(0,6,1))

# Graph
x = np.arange(0,10,0.01)
y = 1.11*x -0.36
plt.plot(x,y,"r-")
plt.show()


# General formula

In the concrete example, the hypothetical function is a linear function, but a quadratic function or a cubic function may be more suitable. Maybe it's usually a more complex relation.

And this may be considered a better formula. Also, the growth of radish may be affected not only by the number of days, but also by the hours of sunshine and the amount of water.

Therefore, let's consider a concrete example by converting it into a general formula.

If there are $x_ {1}$, $x_ {2}$ factors that determine the length of the leaves, such as the number of days, the hours of sunshine, the amount of water to be given, and $\ cdots$$n$, respectively. , $x_ {3}$ $\ cdots$ $x_ {n}$. Let the coefficient of $x_ {n}$ be $\ theta_ {n}$.

The function and hypothesis function that you want to finally derive can be expressed by the following formula. ($X_ {1}$ doesn't mean $x$ squared, it's a function that depends on $x$. It might be helpful to look at the last Python code.)

h_{\theta}(x) = \theta_{0}x_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} +\cdots + \theta_{n}x_{n}



$\ Theta_ {0}$ is the $b$ part of the concrete example and is considered as $x_ {0} = 1$. Suddenly, $h_ {\ theta} (x)$ came out, but $y$ in the concrete example is the length of the leaf.

By associating the concrete example function $y = 1.11x --0.36$ with the above hypothetical function, it becomes as follows.

y = -0.36 × 1 + 1.11 × x
h_{\theta}(x) = \theta_{0} × x_{0} + \theta_{1} × x_{1}

Suppose there are multiple sample data for this hypothesis function, and the data is as follows.

x_{0}
(1 fixed)
x_{1}
(Days)
x_{2}
(Daylight hours h)
x_{3}
(Moisture amount ml)
\cdots y
(Leaf length cm)
1 1 14 150 \cdots 0.3
1 2 14 100 \cdots 1.9
1 3 14.1 160 \cdots 2.8
1 4 14.1 160 \cdots 4.3

The data in the $j$ column of the $i$ line can be represented as $x_ {j} ^ {(i)}$, $y ^ {(i)}$, and when mapped, it looks like this. It is a shape. It looks like an exponent, but it only shows the data.

x_{0}
(1 fixed)
x_{1}
(Days)
x_{2}
(Daylight hours h)
x_{3}
(Moisture amount ml)
\cdots y
(Leaf length cm)
x_{0}^{(1)} x_{1}^{(1)} x_{2}^{(1)} x_{3}^{(1)} \cdots y^{(1)}
x_{0}^{(2)} x_{1}^{(2)} x_{2}^{(2)} x_{3}^{(2)} \cdots y^{(2)}
x_{0}^{(3)} x_{1}^{(3)} x_{2}^{(3)} x_{3}^{(3)} \cdots y^{(3)}
x_{0}^{(4)} x_{1}^{(4)} x_{2}^{(4)} x_{3}^{(4)} \cdots y^{(4)}

When the data in the first column is assigned to $h_ {\ theta} (x)$, it is expressed as $h_ {\ theta} (x ^ {(1)})$.

In this way, the total error with the hypothetical function can be rewritten as the following function. $m$ is the number of datasets. In the example of the above table, it is 4.

J(\theta_{0},\theta_{1}, \cdots,\theta_{n})= \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})^2


The total error $J (\ theta_ {0}, \ theta_ {1}, \ cdots, \ theta_ {n})$ is the smallest $\ theta_ {0}, \ theta_ {1}, \ cdots, \ theta_ The goal is to derive a set of {n}$.

From here, let's replace it with a matrix. A matrix is a powerful item that can calculate data at once.

First, in order to represent the hypothesis function $h_ {\ theta} (x)$ as a matrix, $\ theta_ {0}$, $\ theta_ {1}$, $\ cdots$ $\ theta_ {n}$, $X_ {0}$, $x_ {1}$, $\ cdots$ $x_ {n}$ is represented by a matrix.

\theta=\begin{bmatrix}
\theta_{0} \\
\theta_{1} \\
\vdots\\
\theta_{n}
\end{bmatrix}
,
x=\begin{bmatrix}
x_{0} \\
x_{1} \\
\vdots\\
x_{n}
\end{bmatrix}


Placed this way, $h_ {\ theta} (x)$ can be represented by the product of the transposed matrix of $\ theta$ and the matrix $x$.

h_{\theta}(x)= \theta^T x\\


See here for the transposition and stacking of matrices. http://www.sist.ac.jp/~kanakubo/research/hosoku/trans_gyoretu.html

The data set $x ^ {(i)}$ in the $i$ column is represented by a matrix, and the matrix $X$ given all the datasets is represented by the transposed matrix of each column. $y ^ {(i)}$ is also represented by a matrix.

x^{(i)}=\begin{bmatrix}
x_{0}^{(i)} \\
x_{1}^{(i)} \\
\vdots\\
x_{n}^{(i)}
\end{bmatrix}
,
X=\begin{bmatrix}
(x^{(1)})^T \\
(x^{(2)})^T  \\
\vdots\\
(x^{(m)})^T
\end{bmatrix}
=
\begin{bmatrix}
x_{0}^{(1)} & x_{1}^{(1)} & \cdots & x_{n}^{(1)} \\
x_{0}^{(2)} & x_{1}^{(2)} & \cdots & x_{n}^{(2)} \\
\vdots & \vdots & & \vdots\\
x_{0}^{(m)} & x_{1}^{(m)} & \cdots & x_{n}^{(m)} \\
\end{bmatrix}
,
y=\begin{bmatrix}
y^{(1)} \\
y^{(2)} \\
\vdots\\
y^{(m)}
\end{bmatrix}


It looks confusing, but I just converted the above table into a matrix.

And the story blows away at once.

When placed in this way, $\ theta$, which minimizes the total error $J (\ theta)$, can be obtained by solving the following determinant.

\theta =
(X^tX)^{-1}X^ty



See here for why this formula is used. http://mathtrain.jp/leastsquarematrix

Also, $X ^ {-1}$ is called the inverse matrix of $X$, refer to here. http://mathtrain.jp/inversematrix

Put the dataset in this formula to find $\ theta$, and put it in $\ theta_ {0}, \ theta_ {1}, \ cdots, \ theta_ {n}$ of $h (\ theta)$ to minimize it. The function expression (model function) by the square method is completed.

# I wrote it in Python

I wrote this least squares method in Python. You'll be asked for $\ theta$ in just one line! !!

First is the dataset. Random numbers are added to the function $y = 3 + 2 \ cos (x) + \ frac {1} {2} x$ to generate an error.

#### linalg_lstsq.py


x = arange(-3, 10, 0.1)
y = 3 + 2 * np.cos(x) + (1/2) * x + np.random.normal(0.0, 1.0, len(x))


In the general formula, $x_ {1}$ is $\ cos (x)$ and $x_ {2}$ is $x$. If the coefficients to be calculated can be calculated so that $\ theta_ {0}$ is $3$, $\ theta_ {1}$ is $2$, and $\ theta_ {2}$ is $\ frac {1} {2}$. It's OK.

Create a matrix $X$ with all the datasets.

#### linalg_lstsq.py


n = 3
X = np.zeros((len(x), n), float)
X[:,0] = 1
X[:,1] = np.cos(x)
X[:,2] = x


The expression for $\ theta$ is this one line using linalg.lstsq in the Numpy library.

#### linalg_lstsq.py


(theta, residuals, rank, s) = linalg.lstsq(X, y)


https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.lstsq.html

As a result, it became like this.

#### linalg_lstsq.py


import numpy as np
import matplotlib.pyplot as plt

# data-set
x = np.arange(-3, 10, 0.1)
y = 3 + 2 * np.cos(x) + (1/2) * x + np.random.normal(0.0, 1.0, len(x))

n = 3
X = np.zeros((len(x), n), float)
X[:,0] = 1
X[:,1] = np.cos(x)
X[:,2] = x

# Least square method
(theta, residuals, rank, s) = np.linalg.lstsq(X, y)

# data-set plot
plt.plot(x, y, 'b.')
# h(theta) plot
plt.plot(x, theta[0] + theta[1] * X[:,1] + theta[2] * X[:,2], 'r-')

plt.title("linalg.lstsq")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")

plt.show()

print('θ[0]: %s' % theta[0])
print('θ[1]: %s' % theta[1])
print('θ[2]: %s' % theta[2])


I was able to get a fairly close value, and even if you look at the graph, it fits! !!

# end

Since I have just started studying machine learning, I am learning the standard algorithms and ideas. I can't write anything brand new, but I hope it helps someone.