When you want to do ** simple regression analysis ** in Python, that is, when you want to find a regression line **, there are several libraries that can easily do that. Here, the regression line (parameters $ a $ and $ b $ of $ y = ax + b $) is set using 6 libraries of "numpy", "scipy", "scikit-learn", "statsmodels", "sympy", and "tensorflow". We have compared and organized the method / code for obtaining.
In addition, as a bonus, we also described how to forcibly obtain the regression equation parameters from the regression line automatically drawn by the data visualization library "seaborn".
The operation of the code is confirmed by Google Colab. (Python 3.6.9).
Import various libraries. Then, for the linear function $ 8x + 10 $, the random number $ \ varepsilon $ according to the normal distribution $ N (0,6 ^ 2) $ is added, and the sample data ($ 8 $) is added from $ y = 8x + 10 + \ varepsilon $. (Set of pieces) is generated.
Data preparation
import numpy as np
import pandas as pd
import scipy
import sklearn.linear_model
import statsmodels
import sympy
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sns
np.random.seed(10) #Random sheet fixing
n = 8
x_range = (10,60)
x = np.round(np.linspace(*x_range,n),2)
e = np.random.normal(0,6,n)
y = np.round(0.8 * x + 10 + e,2)
df = pd.DataFrame({'x':x,'y':y})
display(df)
The execution result is as follows.
We will plot this data with matplotlib.
Data plot
fig,ax = plt.subplots(1,1)
ax.scatter(x,y)
ax.set_ylim(0,80) #Y-axis drawing range
ax.grid() #Turn on the grid
ax.set_axisbelow(True) #Move the grid to the back
plt.show() #Draw in output cell
The execution result is as follows.
For the data plotted above (input $ x $, output $ y $), find the parameter $ a $ of the regression equation $ y = ax + b $ for $ b $.
Use statsmodels.formula.api.ols (...)
to find the regression equation.
--Official reference: https://www.statsmodels.org/stable/generated/statsmodels.formula.api.ols.html
statsmodels
print(f'> statsmodels {statsmodels.__version__}')
c = statsmodels.formula.api.ols(formula='y ~ x',data=df).fit()
print(f'a = {c.params["x"]:.2f}')
print(f'b = {c.params["Intercept"]:.2f}')
Execution result
> statsmodels 0.10.2
a = 0.72
b = 13.27
Also, following the above, if you use print (c.summary ())
, you can check information such as ** coefficient of determination ** and ** parameter confidence interval **.
c.summary()
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.856
Model: OLS Adj. R-squared: 0.832
Method: Least Squares F-statistic: 35.69
Date: Tue, 29 Dec 2020 Prob (F-statistic): 0.000987
Time: 12:32:03 Log-Likelihood: -23.986
No. Observations: 8 AIC: 51.97
Df Residuals: 6 BIC: 52.13
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 13.2717 4.676 2.838 0.030 1.830 24.713
x 0.7230 0.121 5.974 0.001 0.427 1.019
==============================================================================
Omnibus: 5.195 Durbin-Watson: 2.078
Prob(Omnibus): 0.074 Jarque-Bera (JB): 1.759
Skew: -1.142 Prob(JB): 0.415
Kurtosis: 3.242 Cond. No. 91.3
==============================================================================
Find the regression equation using sklearn.linear_model.LinearRegression ()
. Even in the case of simple regression, the input must be given as a matrix (ndim = 2). In this case, X.shape
cannot be(8,)
and must be transformed into(8,1)
.
--Official Reference: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
scikit-learn
print(f'> sklearn {sklearn.__version__}')
X = df['x'].values.reshape(len(df),1) # X.shape -> (8,1)
y = df['y'].values # y.shape -> (8,)
lr = sklearn.linear_model.LinearRegression()
lr.fit(X,y)
print(f'a = {lr.coef_[0]:.2f}')
print(f'b = {lr.intercept_:.2f}')
Execution result
> sklearn 0.22.2.post1
a = 0.72
b = 13.27
Find the regression equation using numpy.polyfit (...)
.
--Official reference: https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html
numpy
print(f'> numpy {np.__version__}')
x = df['x'].values
y = df['y'].values
c = np.polyfit(x,y,1)
print(f'a = {c[0]:.2f}')
print(f'b = {c[1]:.2f}')
Execution result
> numpy 1.19.4
a = 0.72
b = 13.27
Use scipy.optimize.leastsq (...)
or scipy.optimize.curve_fit (...)
to find the regression equation.
--Official reference: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.leastsq.html
--Official reference: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html
py:scipy.optimize.leastsq(...)
print(f'> scipy {scipy.__version__}')
x = df['x'].values
y = df['y'].values
c = scipy.optimize.leastsq(lambda c,x,y:y-(c[0]*x+c[1]),(0,0),args=(x,y))
print(f'a = {c[0][0]:.2f}')
print(f'b = {c[0][1]:.2f}')
py:scipy.optimize.curve_fit(...)
print(f'> scipy {scipy.__version__}')
x = df['x'].values
y = df['y'].values
c = scipy.optimize.curve_fit(lambda x,a,b:a*x+b,x,y)
print(f'a = {c[0][0]:.2f}')
print(f'b = {c[0][1]:.2f}')
The execution results are the same for all, and are as follows.
Execution result
> scipy 1.4.1
a = 0.72
b = 13.27
sympy is a library of computer algebra, not numerical calculations. Therefore, it is possible to find an analytical solution based on the least squares method.
J
in the code is the sum of squares of the residuals (minimized in the least squares method) as follows:
Also, dJa
and dJb
are analytical solutions of the partial differential of $ J (a, b) $, respectively. Also, eq1
and eq2
are equations with them equal to zero, as follows:
Then, sympy.solve ((eq1, eq2), (a, b))
solves these simultaneous equations.
sympy
print(f'> sympy {sympy.__version__}')
rx = df['x'].values
ry = df['y'].values
a, b = sympy.symbols('a,b')
n = sympy.symbols('n', integer=True, positive=True)
i = sympy.Idx('i')
x = sympy.IndexedBase('x')
y = sympy.IndexedBase('y')
xi = [ (x[i+1],rx[i]) for i in range(len(rx)) ]
yi = [ (y[i+1],ry[i]) for i in range(len(ry)) ]
J = sympy.Sum((y[i]-(a*x[i]+b))**2,(i,1,n)) #Residual sum of squares
dJa = sympy.diff(J,a) #Partial differential with a
dJb = sympy.diff(J,b) #Partial derivative with b
dJa = dJa.subs(n,len(df)).doit().subs(xi).subs(yi) # xi,Substitution of yi
dJb = dJb.subs(n,len(df)).doit().subs(xi).subs(yi) # xi,Substitution of yi
eq1, eq2 = sympy.Eq(dJa,0), sympy.Eq(dJb,0) #Solve simultaneous equations
r = sympy.solve((eq1,eq2),(a,b))
print(f'a = {float(r[a].subs(xi,yi)):.2f}')
print(f'b = {float(r[b].subs(xi,yi)):.2f}')
Execution result
> sympy 1.1.1
a = 0.72
b = 13.27
Regression analysis is performed by gradient descent using the TensorFlow (2.X) low-layer function tf.GradientTape (...)
. To get good results, you need to adjust the learning rate and the number of iterations.
tensorflow
print(f'> tensorflow {tf.__version__}')
x = df['x'].values
y = df['y'].values
a = tf.Variable(0.)
b = tf.Variable(0.)
for i in range(50000):
with tf.GradientTape() as tape:
y_pred = a*x+b
loss = tf.math.reduce_mean(tf.square(y_pred-y))
gradients = tape.gradient(loss,[a,b])
a.assign_sub(0.0005 * gradients[0]) #Learning rate 0.0005
b.assign_sub(0.0005 * gradients[1])
if (i + 1) % 10000 == 0:
print(f'Step:{i+1} a = {a.numpy():.2f} b = {b.numpy():.2f}')
print(f'a = {a.numpy():.2f}')
print(f'b = {b.numpy():.2f}')
Execution result
> tensorflow 2.4.0
Step:10000 a = 0.77 b = 11.07
Step:20000 a = 0.73 b = 12.90
Step:30000 a = 0.72 b = 13.21
Step:40000 a = 0.72 b = 13.26
Step:50000 a = 0.72 b = 13.27
a = 0.72
b = 13.27
The visualization library seaborn.regplot (...)
draws regression lines and $ 95 % $ confidence intervals by default as follows:
But unfortunately, there seems to be no way to refer to those regression lines and confidence interval parameters (Reference). There, you can get the plot data of the regression line and calculate the parameters of the equation from it.
seaborn
print(f'> seaborn {sns.__version__}')
fig,ax = plt.subplots(1,1)
d = sns.regplot(x='x', y='y', data=df, ax=ax) #Acquisition of plot data
ax.set_ylim(0,80)
ax.grid()
ax.set_axisbelow(True)
plt.show()
xr = d.get_lines()[0].get_xdata() #X data of the regression line in the figure
yr = d.get_lines()[0].get_ydata() #Y data of regression line in figure
a = (yr[-1]-yr[0])/(xr[-1]-xr[0])
b = yr[0]-a*xr[0]
print(f'a = {a:.2f}')
print(f'b = {b:.2f}')
The execution result is as follows.
Execution result
> seaborn 0.11.0
a = 0.72
b = 13.27
―― "Basic coating for Python data analysis / machine learning Introduction to pandas library utilization" @ Impress --"TensorFlow2 Programming Implementation Handbook" @ Shuwa System -"Curve fitting by least squares method. Compare using 3 functions" @ Code 7 ward
Recommended Posts