Machine learning with python (2) Simple regression analysis

Since I summarized the overall classification of machine learning in the previous article, I will describe each concrete implementation from this time.

Click here for the previous article https://qiita.com/U__ki/items/4ae54da6cadec8a84d1b

Implementation of simple regression analysis

The theme this time is "What is the TV size that fits the size of the room?" Maybe some people have decided to move from spring. For those who are wondering how to determine the size of the TV that fits the size of the new room, I would like to find the size of the TV that fits the size of the room using simple regression analysis.

Since it was a big deal, I decided to create simulated data using pandas and then analyze it based on that data.

csv file creation with pandas

First of all, referring to this article (* https://www.olive-hitomawashi.com/lifestyle/2019/10/post-294.html*), the recommended TV size data that matches the room size is as follows. did. (If you have this, you don't need this article)

[TV size] 6th power: 24 inches 8 tatami mats: 32 inches 10 tatami mats: 40 inches 12 tatami mats: 50 inches

Output this data as a csv file using pandas.

create_csv.py


#csv creation pandas
import pandas as pd
df=pd.DataFrame([
    ["6", "24"],
    ["8", "32"],
    ["10", "40"],
    ["12", "50"]],
    columns=["room_size", "tv_inch"]
)
df.to_csv("room_tv.csv", index=False)

df is an abbreviation for data flame. Also, by setting index = False, the index number in csv is eliminated. This will create a new file called room_tv.csv on the same folder.

If the following files are created in the folder, it is successful.

room_tv.csv


room_size,tv_inch
6,24
8,32
10,40
12,50

Now you have the csv file to use this time.

Simple regression analysis

Next, we will perform the main single regression analysis this time.

Simple regression analysis has the following three components. ** ・ Model decision ** ** ・ Set the evaluation function ** **-Minimize the evaluation function (determine the slope) **

Model determination

First, load csv.

main.py


df=pd.read_csv("room_tv.csv")

If you are using jupyter notebook with this, the data created earlier will be displayed as follows. スクリーンショット 2020-11-04 13.47.21.png

Next, let's take a look at the data this time. Matplotlib is easy to understand for illustration in python.

main.py


x=df["room_size"]
y=df["tv_inch"]

import matplotlib.pyplot as plt
plt.scatter(x,y)
plt.show()

Here, x is the size of the room and y is the size of the TV.

スクリーンショット 2020-11-04 13.50.08.png

I got the data that seems to be easy to handle. This time, it seems that a linear function can handle it (** model determination **).

Currently, you can use this data as it is, but if you use this data y=ax+b Then, two variables a and b appear. The calculation can be done as it is, but ** data averaging ** is performed to reduce one variable.

** Data averaging ** Take the average value of all the data and use the value subtracted from each data. With pandas, it's extremely easy to average the data or subtract it from all the data. By doing this, only y = ax can be considered.

main.py


#Get the average value of the data
xm=x.mean()
ym=y.mean()

#Centralized by subtracting the mean from all data
xc=x-xm
yc=y-ym

#Redisplay
plt.scatter(xc,yc)
plt.show()

With the above code, it changed to the following figure. スクリーンショット 2020-11-04 16.15.33.png

It seems that there is not much change from the previous graph, but since it is not necessary to consider the intercept (b), future calculations will be considerably easier.

\hat{y}=ax

You just need to find a about.

Determination of evaluation function

I want to determine the formula so that the predicted value (using machine learning) is the smallest compared to the measured value. The thing for that is called the determination of the evaluation function. It has the same meaning as "loss function" in data science. Although the explanation is small here, the square error is confirmed and a small one is determined. With y as the measured value and y ^ (Waihat) as the predicted value

\begin{align}
L&=(y_1-\hat{y_1})^2+(y_2-\hat{y_2})^2+....+(y_N-\hat{y_1N})^2\\
 &=\sum_{n=1}^{N}(y_n-\hat{y_n})^2
\end{align}

Can be represented by. At this time, * L * is called the evaluation function.

Minimization of evaluation function

The evaluation function appears as a quadratic function as we saw earlier. Therefore, the point where the slope is 0 is the minimum, and the square error is small.

If you want to find the point where the slope of the quadratic equation is 0, which is in the range of high school mathematics, you can differentiate and find the point where "= 0".

Because the variable is a

\frac{\partial}{\partial a}(L)=0

To ask. Substituting the above formula into this and expanding it

\begin{align}
L&=\sum_{n=1}^{N}y_n^2-2(\sum_{n=1}^{N}x_ny_n)a+(\sum_{n=1}^{N}x_n^2)a^2\\
&=c_o-2c_1a+c_2a^2
\end{align}

Substitute

\frac{\partial}{\partial a}(c_o-2c_1a+c_2a^2)=0\\
\\
a=\frac{\sum_{n=1}^{N}x_ny_n}{\sum_{n=1}^{N}x_n^2} 


Let's write some code about this.

main.py



#Find the value of each square from the formula
xx=xc*xc
xy=xc*yc

#Ask for a
a=xy.sum()/xx.sum()

#Try to plot
plt.scatter(xc,yc, label="y")
plt.plot(x,a*x, label="y_hat", color="green")
plt.legend()
plt.show()

The graph is shown below. スクリーンショット 2020-11-04 17.08.40.png

Since x is determined within the range of the measured values, the line segment is short, but a was obtained.

From the above, the slope a was obtained. However, since it is centralized, when actually applying it

Don't forget to multiply x (x value) -x (mean value) by a and finally y (mean value).

Finally, I will write a summary of this.

main.py



import pandas as pd
import matplotlib.pyplot as plt

df=pd.read_csv("room_tv.csv")
x=df["room_size"]
y=df["tv_inch"]

#Averaging
xm=x.mean()
ym=y.mean()

#Centralization
xc=x-xm
yc=y-ym

xx=xc*xc
xy=xc*yc

a=xy.sum()/xx.sum()

plt.scatter(xc,yc, label="y")
plt.plot(x,a*x, label="y_hat", color="green")
plt.legend()
plt.show()


bonus

main.py



#Understanding the data
df.describe()

This will analyze the data as follows. Useful for processing more complex and large amounts of data. スクリーンショット 2020-11-04 16.09.41.png

Finally

The code was short but very meaningful. Next time, I would like to do multiple regression analysis. As for future developments, I will post not only programming-related items such as coding and error handling, but also those related to neuroscience.

Recommended Posts

Machine learning with python (2) Simple regression analysis
Machine learning algorithm (simple regression analysis)
Python Scikit-learn Linear Regression Analysis Nonlinear Simple Regression Analysis Machine Learning
Machine learning with Python! Preparation
Simple regression analysis in Python
Beginning with Python machine learning
Calculate the regression coefficient of simple regression analysis with python
First simple regression analysis in Python
Machine learning with python (1) Overall classification
Machine learning algorithm (multiple regression analysis)
Logistic regression analysis Self-made with python
"Scraping & machine learning with Python" Learning memo
Python learning memo for machine learning by Chainer Chapter 7 Regression analysis
Amplify images for machine learning with python
[Shakyo] Encounter with Python for machine learning
[Python] First data analysis / machine learning (Kaggle)
Data analysis starting with python (data preprocessing-machine learning)
Build AI / machine learning environment with Python
[Machine learning] Regression analysis using scikit learn
"Gaussian process and machine learning" Gaussian process regression implemented only with Python numpy
Data analysis with python 2
Learning Python with ChemTHEATER 03
"Object-oriented" learning with python
Learning Python with ChemTHEATER 05-1
Machine learning logistic regression
Voice analysis with python
Learning Python with ChemTHEATER 02
Machine learning linear regression
Learning Python with ChemTHEATER 01
Python: Supervised Learning (Regression)
Voice analysis with python
Regression analysis with NumPy
Data analysis with Python
Regression analysis in Python
[Python] Easy introduction to machine learning with python (SVM)
Machine learning starting with Python Personal memorandum Part2
Machine learning starting with Python Personal memorandum Part1
EV3 x Python Machine Learning Part 2 Linear Regression
[Python] Data analysis, machine learning practice (Kaggle) -Data preprocessing-
[Python] Collect images with Icrawler for machine learning [1000 images]
[Python3] Let's analyze data using machine learning! (Regression)
I started machine learning with Python Data preprocessing
Easy Lasso regression analysis with Python (no theory)
Build a Python machine learning environment with a container
Machine learning learned with Pokemon
2. Multivariate analysis spelled out in Python 1-1. Simple regression analysis (scikit-learn)
[Python] Morphological analysis with MeCab
[Co-occurrence analysis] Easy co-occurrence analysis with Python! [Python]
Run a machine learning pipeline with Cloud Dataflow (Python)
Multiple regression analysis with Keras
Sentiment analysis with Python (word2vec)
Reinforcement learning starting with Python
Understand machine learning ~ ridge regression ~.
Python data analysis learning notes
Planar skeleton analysis with Python
Build a machine learning application development environment with Python
Japanese morphological analysis with Python
Machine learning Minesweeper with PyTorch
Python Machine Learning Programming> Keywords
Coursera Machine Learning Challenges in Python: ex1 (Linear Regression)
Machine learning algorithm (simple perceptron)