[PYTHON] About the Normal Equation of Linear Regression

Preface

It came out in a certain online machine learning course Linear Regression with Multiple Variables. About Normal Equation.

Dr. Andrew Ng (hereinafter abbreviated as Dr. Ang) said, "It's a pain to derive (free translation)" and only the result was shown, so I dug a little deeper.

Among them, I have some questions, so I will share them. I myself may not understand it yet, so I welcome Tsukkomi.

[2015/07/24 23:10] We have added a verification code and made major corrections.

Normal Equation

First of all, review.

$ X $ is the $ m \ times (n + 1) $ matrix that represents the entire training data feature ($ m $ (number of rows) is the number of data, $ n $ is the number of features (features)). $ y $ is a $ m $ dimensional vector that lists the "correct values" of the training data. $ \ theta $ represents $ (n + 1) $ representing the parameters of the Hypothesis Function ($ h_ {\ theta} (x) = \ sum_i \ theta_i x_i $, $ 0 \ le i \ le n $) Dimension vector. In addition, $ J (\ theta) = \ frac {1} {2m} \ sum_ {i = 1} ^ {m} (h_ {\ theta} (x ^ {(i)})-y ^ {(i)} ) ^ 2 $ is called the Loss Function (objective function) [^ 1](of linear regression), and the purpose of linear regression is to minimize it.

[^ 1]: This formula is the so-called square error function. In other words, it is the least squares method.

As a method of minimizing, the last $ J (\ theta) $ is partially differentiated with respect to $ \ theta_0, \ dots, \ theta_n $, and $ \ theta $ is obtained so that they are all 0. For example, if you partially differentiate with a certain $ \ theta_j $:

\begin{eqnarray} \frac{\partial}{\partial \theta_j}J(\theta) &=& \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)} \\\\ &=& \frac{1}{m}\\{\sum_{k}\theta_k\sum_{i}x_j^{(i)}x_k^{(i)} - \sum_{i}x_j^{(i)}y^{(i)}\\} \end{eqnarray}

If you transform this into $ m $ rows and rewrite it in the form of a matrix [^ 2]:

m\left[\begin{array}{cc} \frac{\partial}{\partial \theta_0}J(\theta)\\\\ \frac{\partial}{\partial \theta_1}J(\theta)\\\\ \vdots\\\\ \frac{\partial}{\partial \theta_n}J(\theta) \end{array}\right] = X^TX \theta - X^Ty

[^ 2]: Sumimasen, I folded it here. It's annoying to write one by one (^-^;

This is translated into $ = {\ bf 0} $ and solved for $ \ theta $:

\begin{eqnarray} X^TX \theta &=& X^Ty\\\\ \theta &=& (X^TX)^{-1} X^Ty \end{eqnarray}

This is the "Normal Equation" that Dr. Ang says. By calculating the right-hand side of the last equation, we can find the optimal parameter $ \ theta $ (which minimizes the square error).

Generalized inverse matrix

It is not always the case that $ (X ^ TX) ^ {-1} $ is always present for $ X . However, again, Professor Ang said in the lecture, "Don't worry because it is very rare." He also explains that even if such a case should occur, for example, in Octave, `pinv (X'* X) * X'* y` should be used. `X'` is the transposed matrix of the matrix` X` ( X ^ T $ for $ X $).

Octave's pinv is" Pseudoinverse "(pseudo-inverse matrix, generalized) Also called the reciprocal. Hereafter, "[Generalized reciprocal](https://ja.wikipedia.org/wiki/%E6%93%AC%E4%BC%BC%E9%80%86%E8%A1%" 8C% E5% 88% 97) "is used). If the argument matrix is reversible, the inverse matrix (the matrix that matches) is returned, otherwise the "generalized inverse matrix" that satisfies certain properties is returned.

The details of the generalized inverse matrix are omitted here, but in simple terms, it is a "matrix with properties similar to the inverse matrix". This can be defined not only for non-regular (= non-reversible) square matrices, but also for non-square (= $ m \ times n $ matrices such as $ m \ ne n $) matrices.

This will return $ (X ^ TX) ^ {-1} $ without any problem if $ X ^ TX $ is reversible, otherwise it will be calculated by a properly calculated "generalized inverse matrix". It is a mechanism that calculates without errors and obtains reasonable calculation results.

In addition, the generalized inverse matrix

As I wrote earlier, the generalized inverse matrix is also defined in the $ m \ times n $ matrix ($ m \ ne n $). Especially if $ m> n $ (= if there are more rows than columns = if it is a vertically long matrix), it can actually be written as:

X^- = (X^TX)^{-1}X^T

I'm thinking about linear regression (multivariate). The number of data tends to be very large compared to the number of (generally) features. In other words, the matrix should be a vertically long matrix with more rows.

Now let's take a look at the Normal Equation again.

\theta = (X^TX)^{-1} X^Ty

The part other than $ y $ on the right side. It's the same shape, right? In other words. Isn't it okay to write this? When.

\theta = X^-y

In Octave code, pinv (X) * y. Isn't this all right? When.

further. Consider the following equation for a moment.

X\theta = y

If $ X $ is a square matrix (reversible with a square matrix), it can be solved with $ \ theta = X ^ {-1} y $. If not, there is no such ordinary method (there is no satisfying solution in the first place, or the solution is not uniquely determined). However, considering $ \ theta = X ^ -y $, which replaces the inverse matrix with the generalized inverse matrix, we can find the solution that minimizes the error (or norm). In other words, this equation means finding the solution with the minimum error for $ X \ theta = y $.

And Octave provides another way to solve $ X \ theta = y $ as is. It is written as X \ y using the backslash operator. This means $ X ^ {-1} y $ if $ X $ is an invertible matrix, otherwise it seems to do the same calculation as $ X ^ -y $.

Now let's recall the original shape of the Normal Equation.

\begin{eqnarray} X^TX \theta &=& X^Ty\\\\ (X^TX) \theta &=& (X^Ty) \end{eqnarray}

This is also in the form of $ \ bigcirc \ theta = \ bigcirc $, isn't it? It is in a form that can be solved with ○ \ ○ and the backslash operator. When applied, (X'* X) \ (X'* y).

In other words. Octave's code for solving the Normal Equation means that there are four possible types:

pinv(X'*X)*X'*y
pinv(X)*y
X \ y
(X'*X) \ (X'*y)

Of these, only the former pinv (X'* X) * X'* y is introduced in the lecture. Why aren't the 2nd, 3rd and 4th introduced?

Verification 1

So, I wrote the actual code and verified it.

`solve_X_y.m`


% solve_X_y.m
X = rand(10, 4)
# X =
#
#    0.033536   0.816107   0.996677   0.958327
#    0.683542   0.116498   0.614316   0.884338
#    0.734337   0.769245   0.696212   0.245270
#    0.216938   0.013297   0.885327   0.906086
#    0.630620   0.733668   0.820551   0.784664
#    0.138834   0.838178   0.216751   0.638286
#    0.100739   0.893597   0.891867   0.239482
#    0.362333   0.404999   0.018274   0.922847
#    0.102606   0.442110   0.744582   0.452299
#    0.590709   0.274452   0.459526   0.656588

y = rand(10, 1)
# y = 
# 
#    0.48518
#    0.13242
#    0.60525
#    0.31265
#    0.59250
#    0.47161
#    0.95971
#    0.44011
#    0.60115
#    0.75571

# calcuration
# [1]
pinv(X' * X) * X' * y
# ans =
# 
#    0.1861915
#    0.5484641
#    0.2473279
#   -0.0031611

# [2]
pinv(X) * y
# ans =
# 
#    0.1861915
#    0.5484641
#    0.2473279
#   -0.0031611

# [3]
X \ y
# ans =
# 
#    0.1861915
#    0.5484641
#    0.2473279
#   -0.0031611

# [4]
(X'*X) \ (X'*y)
# ans =
# 
#    0.1861915
#    0.5484641
#    0.2473279
#   -0.0031611

# time measurement (n = 10)
# [1]
tic();
for k=1:10000;
    X = rand(40, 10);
    y = rand(40, 1);
    pinv(X' * X) * X' * y;
end;
toc()
# Elapsed time is 1.26513 seconds.

# [2]
tic();
for k=1:10000;
    X = rand(40, 10);
    y = rand(40, 1);
    pinv(X) * y;
end;
toc()
# Elapsed time is 1.16283 seconds.

# [3]
tic();
for k=1:10000;
    X = rand(40, 10);
    y = rand(40, 1);
    X \ y;
end;
toc()
# Elapsed time is 0.902037 seconds.

# [4]
tic();
for k=1:10000;
    X = rand(40, 10);
    y = rand(40, 1);
    (X'*X) \ (X'*y);
end;
toc()
# Elapsed time is 0.689348 seconds.

# time measurement (n = 30)
# [1]
tic();
for k=1:10000;
    X = rand(100, 30);
    y = rand(100, 1);
    pinv(X' * X) * X' * y;
end;
toc()
# Elapsed time is 5.79588 seconds.

# [2]
tic();
for k=1:10000;
    X = rand(100, 30);
    y = rand(100, 1);
    pinv(X) * y;
end;
toc()
# Elapsed time is 7.11547 seconds.

# [3]
tic();
for k=1:10000;
    X = rand(100, 30);
    y = rand(100, 1);
    X \ y;
end;
toc()
# Elapsed time is 3.64188 seconds.

# [4]
tic();
for k=1:10000;
    X = rand(100, 30);
    y = rand(100, 1);
    (X'*X) \ (X'*y);
end;
toc()
# Elapsed time is 1.37039 seconds.

First of all, as far as the calculation results are concerned, [1], [2], [3], and [4] certainly calculate the same value [^ 3].

[^ 3]: Upon closer examination, it seems that different values are actually calculated with an error of about 1e-14. Floating point calculation error due to different calculation process and calculation method, isn't it?

Also, when measuring the time, it is natural that the larger the value of $ n $ (= the number of columns in the matrix = the number of features), the longer it takes, and at that time [1] pinv (X) '* X) * X'* y has a shorter execution time than [2] pinv (X) * y. In other words, the calculation formula looks more complicated, but the amount of calculation is smaller in the former. This is because if $ X $ is a $ m \ times n $ matrix ($ m> n $), then $ X ^ TX $ is a square matrix of $ n \ times n $, which is almost reversible. Compared to the fact that the inverse matrix is calculated at high speed (the cost of multiplying $ X ^ T $ of $ n \ times m $ is also relatively low), the generalized inverse matrix of $ X $, which is not a square matrix in the first place, is (). Moreover, it is $ m \ times n> n \ times n $) I think that the calculation cost is high.

However, the results show that the calculation time of X \ y in [3] is faster. Furthermore, the (X'* X) \ (X'* y) in [4] is much faster, isn't it? [4] is faster than [3] for the same reason as before.

By the way, of course, since rand () is used, the values of X and y change each time it is executed, but the calculation result and execution time are almost the same. At this rate, (X'* X) \ (X'* y) seems to be the best, but ...

Verification 2

Earlier I tried to verify with the code using a random matrix. Now let's try it with a matrix of intentionally arranged numbers.

`rank_deficient.m`


X = [1 1 2 3;
     1 2 3 5;
     1 3 5 8;
     1 5 8 13;
     1 8 13 21;
     1 13 21 34]

y = [8; 13; 21; 34; 55; 89]

# check rank of matrix
rank(X)
# => 3

# calcuration
# [1]
pinv(X'*X) * X'*y
# ans =
# 
#    3.1974e-13
#    3.3333e-01
#    1.3333e+00
#    1.6667e+00

# [2]
pinv(X) * y
# ans =
# 
#    0.00000
#    0.33333
#    1.33333
#    1.66667

# [3]
X \ y
# ans =
# 
#   -1.3628e-14
#    3.3333e-01
#    1.3333e+00
#    1.6667e+00

# [4]
(X'*X) \ (X'*y)
# warning: matrix singular to machine precision, rcond = 4.97057e-18
# ans =
# 
#   -1.8859e-13
#    3.3333e-01
#    1.3333e+00
#    1.6667e+00


# Square Matrix
X = X(1:4, 1:4)
# X =
# 
#     1    1    2    3
#     1    2    3    5
#     1    3    5    8
#     1    5    8   13

y = y(1:4)
# y =
# 
#     8
#    13
#    21
#    34

# calcuration
# [1]
pinv(X'*X) * X'*y
# ans =
# 
#    1.8119e-13
#    3.3333e-01
#    1.3333e+00
#    1.6667e+00

# [2]
pinv(X) * y
# ans =
# 
#   -7.1054e-15
#    3.3333e-01
#    1.3333e+00
#    1.6667e+00

# [3]
X \ y
# warning: matrix singular to machine precision, rcond = 0
# ans =
# 
#   -7.3807e-15
#    3.3333e-01
#    1.3333e+00
#    1.6667e+00

# [4]
(X'*X) \ (X'*y)
# warning: matrix singular to machine precision, rcond = 1.26207e-17
# ans =
# 
#    1.5742e-14
#    3.3333e-01
#    1.3333e+00
#    1.6667e+00

The second to fourth columns of the matrix $ X $ are consecutive Fibonacci numbers. That is, $ x_2 + x_3 = x_4 $ (if you write the $ i $ column as $ x_i $). The rank of the matrix is because it is a $ m \ times 4 $ matrix ($ m \ ge 4 $) and there are only 3 independent columns (because the 4th column is represented by a linear combination of the other columns). It's $ 3 $ [^ 4].

[^ 4]: In a matrix X with m rows and n columns, when rank (X) = min (m, n), X is said to be "full rank". If rank (X) <min (m, n), it is said to be "rank deficient".

At this time, $ X ^ TX $, which is a $ n \ times n $ matrix, is not a regular matrix (it does not have an inverse matrix). Also, if $ m = n $, $ X $ itself is a square matrix, but it is not a square matrix either.

As a result, [4]((X'* X) \ (X'* y)) when $ m> n $, and [3], 4 when $ m = n $. A warning has been output when calculating X \ y, (X'* X) \ (X'* y)). The solution by the ` operator seems to have such a problem.

On the other hand, in the case of the solution by the generalized inverse matrix, it seems that the calculation is performed without warning. That's because from the beginning, "it is designed to perform valid calculations even if it is not a regular matrix (even if it is not a square matrix)".

Consideration from the result

Dr. Ang introduced pinv (X'* X) * X'* y, which is closest to the derived expression (from the original expression transformation) and is better than pinv (X) * y. I think it's because it can be calculated at high speed.

Also, I didn't mention X \ y or(X'* X) \ (X'* y), because these calculations are square matrices where the matrix to the left of the operator is not regular (or" It seems that there is a problem in the case of "rank drop" matrix) (because there is no worry when using pinv ()) [^ 5].

However, this can be said to be the case when there is a problem in selecting data (feature) in the first place, so if you select only independent features regardless of other features, it will be simpler and faster (X'* X). I don't think it's bad to use \ (X'* y) .

After that, focus on data analysis for that purpose (pay the cost for the work) and speed up the calculation ((X'* X) \ (X'* y)), or specially for the data Do you do a reasonably fast and error-free calculation without inserting (pinv (X'* X) * X'* y)? It will be that bargain.

[^ 5]: Before modifying the article, I asked, "If there are no problems, X \ y is better? What about?", But this is probably the biggest problem. I solved it myself, but what about?

Bonus: For other languages

In this kind of lecture (language and processing system are specified) and studying in a textbook, I often try to write in other languages personally in order to deepen my understanding. So, I also verified the code verified this time with Julia (v0.3.x / 0.4.0-dev) and Python (v2.7.x / 3.x) + NumPy.

Python doesn't have an operator that corresponds to X \ y, but instead has a function callednumpy.linalg.solve (X, y), but it seems that it doesn't work unless X is a square matrix. ..

Julia version

Julia v0.3.x / 0.4.0 Works with both [^ 6]:

`solve_X_y.jl`


X = rand(10, 4)
# 10x4 Array{Float64,2}:
#  0.71148    0.968352  0.0952939  0.796324 
#  0.915128   0.128326  0.630086   0.0635579
#  0.351199   0.131409  0.934867   0.501701 
#  0.165645   0.874088  0.173725   0.976326 
#  0.765261   0.790716  0.760362   0.204496 
#  0.544099   0.156464  0.041718   0.507071 
#  0.764964   0.852837  0.230312   0.134783 
#  0.0738597  0.75529   0.693856   0.0107293
#  0.621861   0.56881   0.66972    0.163911 
#  0.9471     0.453841  0.466836   0.10744  

y = rand(10, 1)
# 10x1 Array{Float64,2}:
#  0.389321 
#  0.436261 
#  0.308189 
#  0.734617 
#  0.410237 
#  0.4969   
#  0.0708882
#  0.0840005
#  0.944711 
#  0.14718  

# calcuration
# [1]
pinv(X' * X) * X' * y
# 4x1 Array{Float64,2}:
#   0.169937 
#  -0.0365938
#   0.273122 
#   0.55004  

# [2]
pinv(X) * y
# 4x1 Array{Float64,2}:
#   0.169937 
#  -0.0365938
#   0.273122 
#   0.55004  

# [3]
X \ y
# 4x1 Array{Float64,2}:
#   0.169937 
#  -0.0365938
#   0.273122 
#   0.55004  

# [4]
(X'*X) \ (X'*y)
# 4x1 Array{Float64,2}:
#   0.169937 
#  -0.0365938
#   0.273122 
#   0.55004  

# time measurement (n = 10)
# [1]
@time for k=1:10000
    X = rand(40, 10)
    y = rand(40, 1)
    pinv(X' * X) * X' * y
end
# elapsed time: 1.087745051 seconds (283600016 bytes allocated, 17.28% gc time)

# [2]
@time for k=1:10000
    X = rand(40, 10)
    y = rand(40, 1)
    pinv(X) * y
end
# elapsed time: 1.278193773 seconds (334800016 bytes allocated, 17.29% gc time)

# [3]
@time for k=1:10000
    X = rand(40, 10)
    y = rand(40, 1)
    X \ y
end
# elapsed time: 1.014968744 seconds (324320000 bytes allocated, 20.29% gc time)

# [4]
@time for k=1:10000
    X = rand(100, 30)
    y = rand(100, 1)
    (X'*X) \ (X'*y)
end
# elapsed time: 0.163586767 seconds (62720032 bytes allocated, 41.51% gc time)

# time measurement (n = 30)
# [1]
@time for k=1:10000
    X = rand(100, 30)
    y = rand(100, 1)
    pinv(X' * X) * X' * y
end
# elapsed time: 5.820615493 seconds (1557840000 bytes allocated, 19.02% gc time)

# [2]
@time for k=1:10000
    X = rand(100, 30)
    y = rand(100, 1)
    pinv(X) * y
end
# elapsed time: 7.518744844 seconds (1914480016 bytes allocated, 16.51% gc time)

# [3]
@time for k=1:10000
    X = rand(100, 30)
    y = rand(100, 1)
    X \ y
end
# elapsed time: 3.455976006 seconds (1292000000 bytes allocated, 22.67% gc time)

# [4]
@time for k=1:10000
    X = rand(100, 30)
    y = rand(100, 1)
    (X'*X) \ (X'*y)
end
# elapsed time: 0.777771618 seconds (407840016 bytes allocated, 32.71% gc time)

[^ 6]: Julia's matrix / vector handling and formatting was taken from MATLAB / Octave, so it worked as it was with almost no rewriting.

I have the impression that it is not as fast as I expected, but I think it is because I wrote for directly at the top level. It will surely be faster if you change it to a function or other writing style that pays attention to performance. But (X'* X) \ (X'* y) is fast enough to overwhelm others! But ...

`rank_deficient.jl`


X = [1 1 2 3;
     1 2 3 5;
     1 3 5 8;
     1 5 8 13;
     1 8 13 21;
     1 13 21 34]

y = [8; 13; 21; 34; 55; 89]

# check rank of matrix
rank(X)
# => 3

# calcuration
# [1]
pinv(X'*X) * X'*y
# 4-element Array{Float64,1}:
#  -7.10543e-15
#   0.333333   
#   1.33333    
#   1.66667    

# [2]
pinv(X) * y
# 4-element Array{Float64,1}:
#   3.55271e-15
#   0.333333   
#   1.33333    
#   1.66667    

# [3]
X \ y
# 4-element Array{Float64,1}:
#  -4.35117e-15
#   2.0        
#   3.0        
#   0.0        

# [4]
(X'*X) \ (X'*y)
# 4-element Array{Float64,1}:
#   3.22113e-13
#  -1.50024    
#  -0.500244   
#   3.50024    


# Square Matrix
X = X[1:4, 1:4]
# 4x4 Array{Int64,2}:
#  1  1  2   3
#  1  2  3   5
#  1  3  5   8
#  1  5  8  13

y = y[1:4]
# 4-element Array{Int64,1}:
#   8
#  13
#  21
#  34

# calcuration
# [1]
pinv(X'*X) * X'*y
# 4-element Array{Float64,1}:
#  8.52651e-14
#  0.333333   
#  1.33333    
#  1.66667    

# [2]
pinv(X) * y
# 4-element Array{Float64,1}:
#  3.55271e-15
#  0.333333   
#  1.33333    
#  1.66667    

# x[3]
# X \ y
# @> SingularException(4)

# x[4]
# (X'*X) \ (X'*y)
# @> SingularException(4)

In case of rank drop. In the case of a vertically long matrix, [3], [4](X \ y,(X'* X) \ (X'* y)) are significantly different from those usingpinv (). It has become. Also, in the case of a square matrix, an error saying "not a regular matrix" appears and calculation is not possible. N (> _ <) (By the way, I would like to briefly mention that the results were slightly different between Julia v0.3.x and v0.4.0)

Python + NumPy version

Also works with both Python v2.7.x / 3.x, results are similar:

`solve_X_y.py`


import numpy as np
X = np.random.rand(10, 4)
# array([[ 0.61009055,  0.71722947,  0.48465025,  0.15660522],
#        [ 0.02424431,  0.49947237,  0.60493258,  0.8988653 ],
#        [ 0.65048106,  0.69667863,  0.52860957,  0.65003537],
#        [ 0.56541266,  0.25463788,  0.74047536,  0.64691215],
#        [ 0.03052439,  0.47651739,  0.01667898,  0.7613639 ],
#        [ 0.87725831,  0.47684888,  0.44039111,  0.39706053],
#        [ 0.58302851,  0.20919564,  0.97598994,  0.19268083],
#        [ 0.35987338,  0.98331404,  0.06299533,  0.76193058],
#        [ 0.625453  ,  0.70985323,  0.62948802,  0.627458  ],
#        [ 0.64201569,  0.22264827,  0.71333221,  0.53305839]])

y = np.random.rand(10, 1)
# array([[ 0.99674247],
#        [ 0.66282312],
#        [ 0.68295932],
#        [ 0.14330449],
#        [ 0.17467666],
#        [ 0.90896029],
#        [ 0.65385071],
#        [ 0.00748736],
#        [ 0.93824979],
#        [ 0.91696375]])

# calcuration
# [1]
np.linalg.pinv(X.T.dot(X)).dot(X.T).dot(y)
# array([[ 0.32591078],
#        [ 0.46479763],
#        [ 0.6684976 ],
#        [-0.26695783]])

# [2]
np.linalg.pinv(X).dot(y)
# array([[ 0.32591078],
#        [ 0.46479763],
#        [ 0.6684976 ],
#        [-0.26695783]])

# x[3]
# np.linalg.solve(X, y)
# @> LinAlgError

# [4]
np.linalg.solve(X.T.dot(X), X.T.dot(y))
# array([[ 0.32591078],
#        [ 0.46479763],
#        [ 0.6684976 ],
#        [-0.26695783]])

# time measurement (n = 10)
from timeit import timeit
# [1]
def test_a():
    X = np.random.rand(40, 10)
    y = np.random.rand(40, 1)
    np.linalg.pinv(X.T.dot(X)).dot(X.T).dot(y)

timeit("test_a()", setup="from __main__ import test_a", number=10000)
# 1.1948060989379883

# [2]
def test_b():
    X = np.random.rand(40, 10)
    y = np.random.rand(40, 1)
    np.linalg.pinv(X).dot(y)

timeit("test_b()", setup="from __main__ import test_b", number=10000)
# 1.2698009014129639

# [4]
def test_c():
    X = np.random.rand(40, 10)
    y = np.random.rand(40, 1)
    np.linalg.solve(X.T.dot(X), X.T.dot(y))

timeit("test_c()", setup="from __main__ import test_c", number=10000)
# 0.4645709991455078

# time measurement (n = 30)
# [1]
def test_d():
    X = np.random.rand(100, 30)
    y = np.random.rand(100, 1)
    np.linalg.pinv(X.T.dot(X)).dot(X.T).dot(y)

timeit("test_d()", setup="from __main__ import test_d", number=10000)
# 4.615994930267334

# [2]
def test_e():
    X = np.random.rand(100, 30)
    y = np.random.rand(100, 1)
    np.linalg.pinv(X).dot(y)

timeit("test_e()", setup="from __main__ import test_e", number=10000)
# 5.413921117782593

# [4]
def test_f():
    X = np.random.rand(100, 30)
    y = np.random.rand(100, 1)
    np.linalg.solve(X.T.dot(X), X.T.dot(y))

timeit("test_f()", setup="from __main__ import test_f", number=10000)
# 0.9642360210418701

In NumPy's ndarray, * is the product of each element, and the matrix product must use the .dot () method. Even complicated expressions become even more complicated ... but fast! solve () is also fast! But ...

`rank_deficient.jl`


import numpy as np
X = np.array([
        [1, 1, 2, 3],
        [1, 2, 3, 5],
        [1, 3, 5, 8],
        [1, 5, 8, 13],
        [1, 8, 13, 21],
        [1, 13, 21, 34]])

y = np.array([[8], [13], [21], [34], [55], [89]])

# check rank of matrix
np.linalg.matrix_rank(X)
# => 3

# calcuration
# [1]
np.linalg.pinv(X.T.dot(X)).dot(X.T.dot(y))
# array([[  2.27373675e-13],
#        [  3.33333333e-01],
#        [  1.33333333e+00],
#        [  1.66666667e+00]])

# [2]
np.linalg.pinv(X).dot(y)
# array([[  3.55271368e-15],
#        [  3.33333333e-01],
#        [  1.33333333e+00],
#        [  1.66666667e+00]])

# [4]
np.linalg.solve(X.T.dot(X), X.T.dot(y))
# array([[ -8.12048841e-14],
#        [ -5.00000000e+00],
#        [ -4.00000000e+00],
#        [  7.00000000e+00]])

# Square Matrix
X = X[0:4]
# array([[ 1,  1,  2,  3],
#        [ 1,  2,  3,  5],
#        [ 1,  3,  5,  8],
#        [ 1,  5,  8, 13]])

y = y[0:4]
# array([[ 8],
#        [13],
#        [21],
#        [34]])

# calcuration
# [1]
np.linalg.pinv(X.T.dot(X)).dot(X.T.dot(y))
# array([[ -1.13686838e-13],
#        [  3.33333333e-01],
#        [  1.33333333e+00],
#        [  1.66666667e+00]])

# [2]
np.linalg.pinv(X).dot(y)
# array([[  4.44089210e-15],
#        [  3.33333333e-01],
#        [  1.33333333e+00],
#        [  1.66666667e+00]])

# [4]
np.linalg.solve(X.T.dot(X), X.T.dot(y))
# array([[ -1.47008842e-14],
#        [  1.00000000e+00],
#        [  2.00000000e+00],
#        [  1.00000000e+00]])

In case of rank drop. In the case of NumPy, there is no error when using solve (), but the calculation result is still very different from the one using pinv ().