Introduction

Simple regression analysis can be easily performed using scikit-learn's linear_model.LinearRegression () model. I can do it. Here, after reviewing the principle of simple regression analysis, we will perform simple regression analysis steadily without using LinearRegression (). Specifically, using CPU list data, "Moore's integration density of semiconductor circuits will double in one and a half to two years." I would like to show that the law of holds empirically.

Principle of simple regression analysis

Simple regression analysis predicts the objective variable (y) with one explanatory variable (x), and expresses the relationship between them in the form of a linear expression y = ax + b. For example, when multiple pairs of x and y are obtained as sample data, such as the learning time (x) and grade (y) of the students, based on these data, for example, a new learning time (x) When it is known, it is simple regression analysis that enables the corresponding performance (y) prediction.

Performing a simple regression analysis is equivalent to finding the straight line y = ax + b. So how do you find the slope a and y-intercept b? This is where the concept of least squares comes into play. A and b can be obtained by using a method called the least squares method.

Expressing the desired regression line as follows,

\hat{y}= ax+b

The least squares method is equivalent to finding a and b that minimize E below.

E = \sum_{i=1}^{N}  (y_i - \hat{y_i})^2 = \sum_{i=1}^{N}  (y_i - (ax_i + b))^2 ...(1)

(y_i is the y coordinate of each data point in the sample,\hat{y_i}Is a straight line\hat{y}= ax+b Means the y coordinate of the predicted value on.)

(As an aside, isn't it possible to simply sum the difference between the data points and the predicted values and minimize it as shown below?

\sum_{i=1}^{N}  (y_i - \hat{y_i})

As you may know if you think about it for a moment, if the data points are above or below the predicted value, they will cancel each other out and the overall difference will be calculated less than it looks, so you need to square it. )

Now, let's get back to the main subject. Then, how can we find the pair of a and b that minimizes E in (1) above? Equation (1) can be regarded as a quadratic equation with a and b as two variables (xi and yi are known and can be regarded as constants).

That is, if Eq. (1) is partially differentiated with respect to a and b and the pair of a and b at which they become = 0 is obtained, then a and b that minimize E can be obtained.

Let's calculate now.

 \frac{\partial E}{\partial a} = \sum_{i=1}^{N}  2(y_i - (ax_i+b))(-x_i)
=2(-\sum_{i=1}^{N} x_iy_i + a\sum_{i=1}^{N}x_i^2+b\sum_{i=1}^{N}x_i)=0

 \frac{\partial E}{\partial b} = \sum_{i=1}^{N}  2(y_i - (ax_i+b))(-1)
=2(-\sum_{i=1}^{N} y_i + a\sum_{i=1}^{N}x_i+b\sum_{i=1}^{N}1)=0

That is,

a\sum_{i=1}^{N}x_i^2+b\sum_{i=1}^{N}x_i=\sum_{i=1}^{N} x_iy_i ...(2)

a\sum_{i=1}^{N}x_i+bN=\sum_{i=1}^{N} y_i ・ ・ ・(3)

here

A =\sum_{i=1}^{N}x_i^2,\quad B =\sum_{i=1}^{N}x_i,\quad C =\sum_{i=1}^{N}x_iy_i,\quad D =\sum_{i=1}^{N}y_i

(2) and (3) are

aA+bB=C ...(2)'

aB+bN=D ...(3)'

Can be done. Let's solve this simultaneous equations steadily (the subscript of Σ is omitted below to eliminate the difficulty of reading).

(2)'\times N - (3)'\From times B\\
a(AN-B^2)=CN-BD\\
\therefore a = \frac{CN-BD}{AN-B^2}=\frac{N\sum x_iy_i-\sum x_i\sum y_i}{N\sum x_i^2-(\sum x_i)^2}\\=\frac{\bar{xy}-\bar{x}\bar{y}}{\bar{x^2}-\bar{x}^2} \quad(N denominator numerator^Divided by 2)・ ・ ・(4)

Similarly,

(2)'\times B - (3)'\From times A\\
b(B^2-AN)=BC-AD\\
\therefore b = \frac{BC-AD}{B^2-AN}=\frac{\sum x_i\sum x_iy_i-\sum x_i^2\sum y_i}{(\sum x_i)^2-N\sum x_i^2}=\frac{\sum x_i^2\sum y_i-\sum x_i\sum x_iy_i}{N\sum x_i^2-(\sum x_i)^2}\\=\frac{\bar{x^2}\bar{y}-\bar{x}\bar{xy}}{\bar{x^2}-\bar{x}^2} \quad(N denominator numerator^Divided by 2)・ ・ ・(5)

Now we have a pair of a and b that minimizes E. You can easily find a and b by substituting the x-coordinate and y-coordinate of the data points of the sample data into the equations (4) and (5). The regression line to be obtained is finally from (4) and (5).

\hat{y}=ax+b=\frac{\bar{xy}-\bar{x}\bar{y}}{\bar{x^2}-\bar{x}^2}x+\frac{\bar{x^2}\bar{y}-\bar{x}\bar{xy}}{\bar{x^2}-\bar{x}^2}・ ・ ・(6)

Can be expressed as.

Moore's Law

So far, we have seen the principle of simple regression analysis for a long time. Now, using simple regression analysis, Moore's Law states that "the integration density of semiconductor circuits doubles in one and a half to two years." I would like to see if /mu/A02712.html) is empirically established. First, let's get the CSV of the list showing the historical transition of the number of CPU processors by scraping, referring to here.

Preparation

Once you have the data, let's start coding. First, let's take a look at the acquired data.

import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv('processor.csv', sep='\t')
df

Data like this is output. Screen Shot 2020-02-28 at 16.19.40.png

Next, let's extract only the release year of the CPU and the number of transistors and output it as df1.

df1 = pd.concat([df['Date ofintroduction'], df['MOS transistor count'] ], axis=1)
df1

It will be as follows. Screen Shot 2020-02-28 at 16.22.38.png Of the transistor count data, the one with "?" Cannot be used, so let's exclude it.

df2 = df1[df1['MOS transistor count'] != '?']
df2.head(10)

It became beautiful as below. Screen Shot 2020-02-28 at 16.30.03.png

Looking at the data, the annotation [] remains after the release year and the number of transistors. Also, the commas that separate the number of transistors by 3 digits are an obstacle, so let's remove them by the following operations.

#Specified for removing non-numeric (three-digit commas)
decimal = re.compile(r'[^\d]')
#Annotation about the release year[Extract only the left part of "" and add it to df2 as a "year" column.
df2['year'] = df2['Date ofintroduction'].apply(lambda x: int(x.split('[')[0]))
#Extract the first matching number part (including commas) for the number of transistors, then remove the commas and add it to df2 as a "transistor count" column.
df2['transistor count'] = df2['MOS transistor count'].apply(lambda x:  int(decimal.sub('', re.match(r'[\d, \,]+', x).group() )))
#output
df2

The result is as follows, and you can see that the "year" and "transistor count" columns have been newly added and the data is clean. Screen Shot 2020-02-28 at 16.38.21.png

plot

Let's draw a scatter plot of the horizontal axis "year" and the vertical axis "number of transistors" based on the cleaned data.

#Scatter plot of sample data
X = df2['year']
Y = df2['transistor count']

plt.scatter(X, Y)
plt.xlabel('year')
plt.ylabel('transistor count')
plt.show()

You can see that the number of transistors is increasing exponentially. download (1).png

Now, let's take the logarithm of the number of transistors and plot it again.

#Logarithmic scatter plot of sample data
Y = np.log(Y)
plt.scatter(X, Y)
plt.xlabel('year')
plt.ylabel('transistor count')
plt.show()

You can see a beautiful straight line relationship. download (2).png

Now let's apply a simple regression analysis to this plot. Let's put equation (6) derived in "Principle of simple regression analysis" into the code.

#Logarithmic scatter plot of sample data
Y = np.log(Y)
plt.scatter(X, Y ,alpha=.8)

#Derivation of regression line
denom = (X**2).mean() - (X.mean())**2 # a,denominator of b
a = ( (X*Y).mean() - X.mean()*Y.mean() ) / denom
b = ( Y.mean()*(X**2).mean() - X.mean()*(X*Y).mean() ) / denom

#Regression line formula
pred_Y = a*X + b

# R-squared calculation
SSE_l = (Y - pred_Y).dot( (Y - pred_Y) )
SSE_m = (Y - Y.mean()).dot( (Y - pred_Y.mean()) )

r2 = 1 - SSE_l / SSE_m

plt.plot(X, pred_Y, 'k', label='$\hat{y}=ax+b$')
plt.title('$R^2 = %s$'%round(r2,2))
plt.ylabel('transistor count')
plt.xlabel('year')
plt.legend()
plt.show()

The result is as follows. (R-squared = 0.91)

(To give a supplementary explanation about R-squared here, R-squared is also called the coefficient of determination and is an index that shows the validity of the regression line as a prediction model. It can be expressed as follows in a mathematical formula. Yes (SqE stands for squared error).

R^2 = 1 - \frac{\sum SqE_{line}}{\sum SqE_{mean}}= 1 - \frac{\sum (y_i - \hat{y_i})^2}{\sum (y_i - \bar{y})^2}

This formula compares mean, that is, the variability of the data points relative to the mean line of the sample data and the variability of the data points relative to the regression line. For example, when the value of ΣSqEline is equal to the value of ΣSqEmean, R2 becomes 0, and the regression line matches the mean line of the data points, indicating that it is not functioning as a prediction model at all (regression). The straight line should function as more than the average straight line, which is the representative value of the data). On the other hand, when ΣSqEline is 0, all the data points are on the regression line, which means that the prediction model can completely predict the data points. At this time, R2 is 1. )

Verification of Moore's Law

Now, let's roughly verify Moore's law based on the obtained regression line. As we have seen, the relationship between the number of transistors (tc) and the regression line can be expressed as follows.

log(tc) = ax + b \\
\therefore tc = \exp(ax + b)

Therefore, if the number of transistors reaches twice the number of transistors in a certain year (x2) in a certain year (x1),

2 = \frac{\exp(ax_2 + b)}{\exp(ax_1 + b)} = exp(a(x_2-x_1))\\
\therefore log2 = a(x_2-x_1)\\
\therefore x_2 - x_1 = \frac{log2}{a}

In other words, log2 / a calculated from the slope a of the regression line means the number of years it takes for the number of transistors to double. Let's calculate the number of years based on this immediately.

print("time to double:", round(np.log(2)/a,1), "years")

`output`


time to double: 2.1 years

It took about two years for the number of transistors to double. Moore, that's amazing.

[PYTHON] Simple Regression Analysis in High School Mathematics-Verification of Moore's Law