Introduction to Simple Regression Analysis with Python (Comparison of 6 Libraries of Numerical Calculation/Computer Algebra System)

Overview

When you want to do ** simple regression analysis ** in Python, that is, when you want to find a regression line **, there are several libraries that can easily do that. Here, the regression line (parameters $ a $ and $ b $ of $ y = ax + b $) is set using 6 libraries of "numpy", "scipy", "scikit-learn", "statsmodels", "sympy", and "tensorflow". We have compared and organized the method / code for obtaining.

In addition, as a bonus, we also described how to forcibly obtain the regression equation parameters from the regression line automatically drawn by the data visualization library "seaborn".

The operation of the code is confirmed by Google Colab. (Python 3.6.9).

Preparation

Import various libraries. Then, for the linear function $ 8x + 10 $, the random number $ \ varepsilon $ according to the normal distribution $ N (0,6 ^ 2) $ is added, and the sample data ($ 8 $) is added from $ y = 8x + 10 + \ varepsilon $. (Set of pieces) is generated.

Data preparation


import numpy as np
import pandas as pd
import scipy
import sklearn.linear_model
import statsmodels
import sympy
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(10) #Random sheet fixing
n = 8
x_range = (10,60)
x = np.round(np.linspace(*x_range,n),2)
e = np.random.normal(0,6,n)
y = np.round(0.8 * x + 10 + e,2)
df = pd.DataFrame({'x':x,'y':y})
display(df)

The execution result is as follows.
2020-12-29_21h15_28.png

We will plot this data with matplotlib.

Data plot


fig,ax = plt.subplots(1,1)
ax.scatter(x,y)
ax.set_ylim(0,80)       #Y-axis drawing range
ax.grid()               #Turn on the grid
ax.set_axisbelow(True)  #Move the grid to the back
plt.show()              #Draw in output cell

The execution result is as follows.
2020-12-29_21h18_29.png

For the data plotted above (input $ x $, output $ y $), find the parameter $ a $ of the regression equation $ y = ax + b $ for $ b $.

Regression analysis: stats models

Use statsmodels.formula.api.ols (...) to find the regression equation.

--Official reference: https://www.statsmodels.org/stable/generated/statsmodels.formula.api.ols.html

statsmodels


print(f'> statsmodels {statsmodels.__version__}')
c = statsmodels.formula.api.ols(formula='y ~ x',data=df).fit()
print(f'a = {c.params["x"]:.2f}')
print(f'b = {c.params["Intercept"]:.2f}')

Execution result


> statsmodels 0.10.2
a = 0.72
b = 13.27

Also, following the above, if you use print (c.summary ()), you can check information such as ** coefficient of determination ** and ** parameter confidence interval **.

c.summary()


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.856
Model:                            OLS   Adj. R-squared:                  0.832
Method:                 Least Squares   F-statistic:                     35.69
Date:                Tue, 29 Dec 2020   Prob (F-statistic):           0.000987
Time:                        12:32:03   Log-Likelihood:                -23.986
No. Observations:                   8   AIC:                             51.97
Df Residuals:                       6   BIC:                             52.13
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     13.2717      4.676      2.838      0.030       1.830      24.713
x              0.7230      0.121      5.974      0.001       0.427       1.019
==============================================================================
Omnibus:                        5.195   Durbin-Watson:                   2.078
Prob(Omnibus):                  0.074   Jarque-Bera (JB):                1.759
Skew:                          -1.142   Prob(JB):                        0.415
Kurtosis:                       3.242   Cond. No.                         91.3
==============================================================================

Regression analysis: scikit-learn edition

Find the regression equation using sklearn.linear_model.LinearRegression (). Even in the case of simple regression, the input must be given as a matrix (ndim = 2). In this case, X.shape cannot be(8,)and must be transformed into(8,1).

--Official Reference: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

scikit-learn


print(f'> sklearn {sklearn.__version__}')
X = df['x'].values.reshape(len(df),1) # X.shape -> (8,1)
y = df['y'].values                    # y.shape -> (8,)
lr = sklearn.linear_model.LinearRegression()
lr.fit(X,y)
print(f'a = {lr.coef_[0]:.2f}')
print(f'b = {lr.intercept_:.2f}')

Execution result


> sklearn 0.22.2.post1
a = 0.72
b = 13.27

Regression analysis: numpy edition

Find the regression equation using numpy.polyfit (...).

--Official reference: https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html

numpy


print(f'> numpy {np.__version__}')
x = df['x'].values
y = df['y'].values
c = np.polyfit(x,y,1)
print(f'a = {c[0]:.2f}')
print(f'b = {c[1]:.2f}')

Execution result


> numpy 1.19.4
a = 0.72
b = 13.27

Regression analysis: scipy

Use scipy.optimize.leastsq (...) or scipy.optimize.curve_fit (...) to find the regression equation.

--Official reference: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.leastsq.html

--Official reference: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html

py:scipy.optimize.leastsq(...)


print(f'> scipy {scipy.__version__}')
x = df['x'].values
y = df['y'].values
c = scipy.optimize.leastsq(lambda c,x,y:y-(c[0]*x+c[1]),(0,0),args=(x,y))
print(f'a = {c[0][0]:.2f}')
print(f'b = {c[0][1]:.2f}')

py:scipy.optimize.curve_fit(...)


print(f'> scipy {scipy.__version__}')
x = df['x'].values
y = df['y'].values
c = scipy.optimize.curve_fit(lambda x,a,b:a*x+b,x,y)
print(f'a = {c[0][0]:.2f}')
print(f'b = {c[0][1]:.2f}')

The execution results are the same for all, and are as follows.

Execution result


> scipy 1.4.1
a = 0.72
b = 13.27

Regression analysis: sympy edition

sympy is a library of computer algebra, not numerical calculations. Therefore, it is possible to find an analytical solution based on the least squares method.

J in the code is the sum of squares of the residuals (minimized in the least squares method) as follows:

J(a,b) = \sum_{i=1}^{n}\left( y - ( ax+b ) \right)^2

Also, dJa and dJb are analytical solutions of the partial differential of $ J (a, b) $, respectively. Also, eq1 and eq2 are equations with them equal to zero, as follows:

\frac{\partial J(a,b)}{\partial a}=0,\ \ \ \frac{\partial J(a,b)}{\partial b}=0

Then, sympy.solve ((eq1, eq2), (a, b)) solves these simultaneous equations.

sympy


print(f'> sympy {sympy.__version__}')
rx = df['x'].values
ry = df['y'].values

a, b = sympy.symbols('a,b')
n = sympy.symbols('n', integer=True, positive=True)
i = sympy.Idx('i')
x = sympy.IndexedBase('x')
y = sympy.IndexedBase('y')

xi = [ (x[i+1],rx[i]) for i in range(len(rx)) ]
yi = [ (y[i+1],ry[i]) for i in range(len(ry)) ]

J = sympy.Sum((y[i]-(a*x[i]+b))**2,(i,1,n)) #Residual sum of squares

dJa = sympy.diff(J,a) #Partial differential with a
dJb = sympy.diff(J,b) #Partial derivative with b

dJa = dJa.subs(n,len(df)).doit().subs(xi).subs(yi) # xi,Substitution of yi
dJb = dJb.subs(n,len(df)).doit().subs(xi).subs(yi) # xi,Substitution of yi

eq1, eq2 = sympy.Eq(dJa,0), sympy.Eq(dJb,0) #Solve simultaneous equations

r = sympy.solve((eq1,eq2),(a,b))
print(f'a = {float(r[a].subs(xi,yi)):.2f}')
print(f'b = {float(r[b].subs(xi,yi)):.2f}')

Execution result


> sympy 1.1.1
a = 0.72
b = 13.27

Regression analysis: tensorflow

Regression analysis is performed by gradient descent using the TensorFlow (2.X) low-layer function tf.GradientTape (...). To get good results, you need to adjust the learning rate and the number of iterations.

tensorflow


print(f'> tensorflow {tf.__version__}')

x = df['x'].values
y = df['y'].values

a = tf.Variable(0.)
b = tf.Variable(0.)

for i in range(50000):
  with tf.GradientTape() as tape:
    y_pred = a*x+b
    loss = tf.math.reduce_mean(tf.square(y_pred-y))
  gradients = tape.gradient(loss,[a,b])
  a.assign_sub(0.0005 * gradients[0]) #Learning rate 0.0005
  b.assign_sub(0.0005 * gradients[1])

  if (i + 1) % 10000 == 0:
    print(f'Step:{i+1} a = {a.numpy():.2f} b = {b.numpy():.2f}')

print(f'a = {a.numpy():.2f}')
print(f'b = {b.numpy():.2f}')

Execution result


> tensorflow 2.4.0
Step:10000 a = 0.77 b = 11.07
Step:20000 a = 0.73 b = 12.90
Step:30000 a = 0.72 b = 13.21
Step:40000 a = 0.72 b = 13.26
Step:50000 a = 0.72 b = 13.27
a = 0.72
b = 13.27

Regression analysis: seaborn edition (bonus)

The visualization library seaborn.regplot (...) draws regression lines and $ 95 % $ confidence intervals by default as follows:

2020-12-29_22h50_31.png

But unfortunately, there seems to be no way to refer to those regression lines and confidence interval parameters (Reference). There, you can get the plot data of the regression line and calculate the parameters of the equation from it.

seaborn


print(f'> seaborn {sns.__version__}')
fig,ax = plt.subplots(1,1)
d = sns.regplot(x='x', y='y', data=df, ax=ax) #Acquisition of plot data
ax.set_ylim(0,80)
ax.grid() 
ax.set_axisbelow(True)
plt.show()

xr = d.get_lines()[0].get_xdata() #X data of the regression line in the figure
yr = d.get_lines()[0].get_ydata() #Y data of regression line in figure
a = (yr[-1]-yr[0])/(xr[-1]-xr[0])
b = yr[0]-a*xr[0]
print(f'a = {a:.2f}')
print(f'b = {b:.2f}')

The execution result is as follows.

Execution result


> seaborn 0.11.0
a = 0.72
b = 13.27

reference

―― "Basic coating for Python data analysis / machine learning Introduction to pandas library utilization" @ Impress --"TensorFlow2 Programming Implementation Handbook" @ Shuwa System -"Curve fitting by least squares method. Compare using 3 functions" @ Code 7 ward

Recommended Posts

Introduction to Simple Regression Analysis with Python (Comparison of 6 Libraries of Numerical Calculation/Computer Algebra System)
Calculate the regression coefficient of simple regression analysis with python
IPynb scoring system made with TA of Introduction to Programming (Python)
From the introduction of JUMAN ++ to morphological analysis of Japanese with Python
Machine learning with python (2) Simple regression analysis
Thorough comparison of three Python morphological analysis libraries
Simple comparison of Python libraries that operate Excel
[Chapter 5] Introduction to Python with 100 knocks of language processing
Reading Note: An Introduction to Data Analysis with Python
[Chapter 3] Introduction to Python with 100 knocks of language processing
[Chapter 2] Introduction to Python with 100 knocks of language processing
[Chapter 4] Introduction to Python with 100 knocks of language processing
Simple regression analysis in Python
[Introduction to Data Scientists] Descriptive Statistics and Simple Regression Analysis ♬
20200329_Introduction to Data Analysis with Python Second Edition Personal Summary
[Raspi4; Introduction to Sound] Stable recording of sound input with python ♪
First simple regression analysis in Python
Introduction to Python Numerical Library NumPy
[Python] PCA scratch in the example of "Introduction to multivariate analysis"
Introduction to Python Basics of Machine Learning (Unsupervised Learning / Principal Component Analysis)
Introduction to Structural Equation Modeling (SEM), Covariance Structure Analysis with Python
Introduction to Data Analysis with Python P32-P43 [ch02 3.US Baby Names 1880-2010]
Introduction to Data Analysis with Python P17-P26 [ch02 1.usa.gov data from bit.ly]
Introduction to Bayesian Statistical Modeling with python ~ Trying Linear Regression with MCMC ~
Introduction to image analysis opencv python
Logistic regression analysis Self-made with python
[Introduction to Python3 Day 21] Chapter 10 System (10.1 to 10.5)
Introduction and usage of Python bottle ・ Try to set up a simple web server with login function
[Introduction to Python] How to sort the contents of a list efficiently with list sort
I tried fMRI data analysis with python (Introduction to brain information decoding)
[Introduction to Python] What is the method of repeating with the continue statement?
[Introduction to Udemy Python 3 + Application] 10. Numerical values
Introduction to Python Image Inflating Image inflating with ImageDataGenerator
Easy introduction of speech recognition with Python
[EDA] Introduction of Sweetviz (comparison with + pandas-profiling)
[Introduction to Python] Let's use foreach with Python
[Python] Introduction to CNN with Pytorch MNIST
Introduction to Vectors: Linear Algebra in Python <1>
Comparison of matrix transpose speeds with Python
[Introduction to Data Scientists] Basics of Python ♬
Summary of how to read numerical data with python [CSV, NetCDF, Fortran binary]
I tried to make a simple mail sending application with tkinter of Python
[Introduction to Python] How to get the index of data with a for statement
[Python] Easy introduction to machine learning with python (SVM)
Introduction to Artificial Intelligence with Python 1 "Genetic Algorithm-Theory-"
[Introduction to Udemy Python 3 + Application] 26. Copy of dictionary
Library comparison summary to generate PDF with Python
Markov Chain Chatbot with Python + Janome (1) Introduction to Janome
Markov Chain Chatbot with Python + Janome (2) Introduction to Markov Chain
Introduction to Artificial Intelligence with Python 2 "Genetic Algorithm-Practice-"
Performance comparison of face detector with Python + OpenCV
[Introduction to Udemy Python 3 + Application] 19. Copy of list
Introduction to Tornado (1): Python web framework started with Tornado
How to specify attributes with Mock of python
Introduction to formation flight with Tello edu (Python)
[Introduction to minimize] Data analysis with SEIR model ♬
Automating simple tasks with Python Table of contents
Introduction to Python with Atom (on the way)
Introduction to Generalized Linear Models (GLM) with Python
Static analysis of Python code with GitLab CI
[Introduction to Udemy Python3 + Application] 9. First, print with print