First simple regression analysis in Python

The meaning of simple regression analysis can be found as many times as you like, but I hope that you can deepen your understanding by actually writing a program yourself, and I would like to try it using Python.

For the time being, there are the following examples as explanations for simple regression analysis.

One objective variable (y) predicted by one explanatory variable (x).
Express their relationship in the form of a linear equation y = ax + b.

A is the slope and b is the intercept

The test environment uses a Jupyter Notebook (I don't even remember when I installed it). The version used is as follows.

The version of the notebook server is: 6.0.0
Python 3.7.3 (default, Mar 27 2019, 22:11:17) 
[GCC 7.3.0]

pandas pandas (pandas, pandas or pandas) is a library for converting and analyzing data. We will use this to read the data. The version used is as follows.

import pandas as pd
print(pd.__version__)
# 0.24.2

Data read

This time, I will use the height (x) and weight (y) data (sample.csv) for 48 people.

`sample.csv`

Read the sample.csv file and try to output the first 3 lines. It seems that it can be read as follows.

df = pd.read_csv('sample.csv')
df.head(3)
 	x 	y
0 	152 	57
1 	173 	78
2 	172 	83

When I read the data with pandas.read_csv, it seems that the data is created with a type called DataFrame. pandas.read_csv DataFrame

Store each column data in variables x and y.

x = df.x
y = df.y

matplotlib matplotlib is a graph drawing library. The version used is as follows.

import matplotlib
matplotlib.__version__
# '3.1.0'

Graph drawing

import matplotlib.pyplot as plt
plt.plot(x, y)
plt.show()

A graph with all the points connected by a line was displayed. What I expected was a graph showing only points, so I'll modify it as follows.

import matplotlib.pyplot as plt
plt.plot(x, y, 'o')
plt.show()

scikit-learn scikit-learn is a machine learning library built on the Python packages NumPy (Nampai or Nampai) and SciPy for performing scientific and technological calculations. The version used is as follows.

import sklearn
print(sklearn.__version__)
# 0.21.2

It seems that simple regression analysis can be easily performed by using scikit-learn. LinearRegression

Data learning

Instantiate a linear regression model (LinearRegression) and train (fit) the data.

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x, y)
# ValueError: Expected 2D array, got 1D array instead:

I thought, I got an error. Where a 2D array is needed, it seems to give a 1D array. Let's change the storage method of x and y and learn again.

x = df[['x']]
y = df[['y']]

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x, y)
# LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

It seems that he took it in this time. This part should work if you define the original x and y as follows.

#numpy in values.Convert to ndarray type and np.reshape(-1, 1)Convert to n rows and 1 column with
model.fit(x.values.reshape(-1,1), y.values.reshape(-1,1))

Data forecast

Let's predict it.

plt.plot(x, y, 'o')
plt.plot(x, model.predict(x), linestyle="solid")
plt.show()

As a result of predicting the objective variable (y) from the explanatory variable (x), a line of rising shoulders (height increases as the height increases) is drawn.

It seems that the coef_ and intercept_ attributes hold the "slope" and "intercept" of this straight line, respectively, so if you output them, you can get the equation of the straight line.

print('y = %.2fx + %.2f' % (model.coef_ , model.intercept_))
# y = 0.52x + -20.94

From the above, by knowing a (slope) and b (intercept), it is possible to predict y (weight) from x (height), that is, "simple regression analysis" has been realized. ..