The meaning of simple regression analysis can be found as many times as you like, but I hope that you can deepen your understanding by actually writing a program yourself, and I would like to try it using Python.
For the time being, there are the following examples as explanations for simple regression analysis.
The test environment uses a Jupyter Notebook (I don't even remember when I installed it). The version used is as follows.
The version of the notebook server is: 6.0.0
Python 3.7.3 (default, Mar 27 2019, 22:11:17)
[GCC 7.3.0]
pandas pandas (pandas, pandas or pandas) is a library for converting and analyzing data. We will use this to read the data. The version used is as follows.
import pandas as pd
print(pd.__version__)
# 0.24.2
This time, I will use the height (x) and weight (y) data (sample.csv) for 48 people.
sample.csv
x,y
152,57
173,78
172,83
178,58
166,63
175,66
158,66
163,74
157,64
165,68
176,68
165,60
147,63
153,63
146,47
156,49
145,59
181,66
160,74
140,55
152,55
165,56
170,65
159,51
151,52
167,51
177,82
155,63
159,45
170,66
154,56
163,60
161,70
165,70
150,57
158,53
163,67
186,69
168,68
170,74
155,60
159,49
170,87
163,50
166,58
161,69
159,60
171,71
Read the sample.csv file and try to output the first 3 lines. It seems that it can be read as follows.
df = pd.read_csv('sample.csv')
df.head(3)
x y
0 152 57
1 173 78
2 172 83
When I read the data with pandas.read_csv, it seems that the data is created with a type called DataFrame. pandas.read_csv DataFrame
Store each column data in variables x and y.
x = df.x
y = df.y
matplotlib matplotlib is a graph drawing library. The version used is as follows.
import matplotlib
matplotlib.__version__
# '3.1.0'
import matplotlib.pyplot as plt
plt.plot(x, y)
plt.show()
A graph with all the points connected by a line was displayed. What I expected was a graph showing only points, so I'll modify it as follows.
import matplotlib.pyplot as plt
plt.plot(x, y, 'o')
plt.show()
scikit-learn scikit-learn is a machine learning library built on the Python packages NumPy (Nampai or Nampai) and SciPy for performing scientific and technological calculations. The version used is as follows.
import sklearn
print(sklearn.__version__)
# 0.21.2
It seems that simple regression analysis can be easily performed by using scikit-learn. LinearRegression
Instantiate a linear regression model (LinearRegression) and train (fit) the data.
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x, y)
# ValueError: Expected 2D array, got 1D array instead:
I thought, I got an error. Where a 2D array is needed, it seems to give a 1D array. Let's change the storage method of x and y and learn again.
x = df[['x']]
y = df[['y']]
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x, y)
# LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
It seems that he took it in this time. This part should work if you define the original x and y as follows.
#numpy in values.Convert to ndarray type and np.reshape(-1, 1)Convert to n rows and 1 column with
model.fit(x.values.reshape(-1,1), y.values.reshape(-1,1))
Let's predict it.
plt.plot(x, y, 'o')
plt.plot(x, model.predict(x), linestyle="solid")
plt.show()
As a result of predicting the objective variable (y) from the explanatory variable (x), a line of rising shoulders (height increases as the height increases) is drawn.
It seems that the coef_ and intercept_ attributes hold the "slope" and "intercept" of this straight line, respectively, so if you output them, you can get the equation of the straight line.
print('y = %.2fx + %.2f' % (model.coef_ , model.intercept_))
# y = 0.52x + -20.94
From the above, by knowing a (slope) and b (intercept), it is possible to predict y (weight) from x (height), that is, "simple regression analysis" has been realized. ..