Simple regression analysis in Python

Introduction

** What is regression analysis? ** A way to know how much the explanatory variable x (cause) affects the objective variable y (result). Simple regression analysis is used when there is only one explanatory variable x, and multiple regression analysis is used when there are multiple explanatory variables x.

** Theoretical model of regression equation ** y = α + βx + u Objective variable = intercept + slope * explanatory variable + error term

Simple regression analysis can be done in Excel, but this time I tried to verify it in Python for practice. (I wrote it after checking the reference materials and receiving guidance from a university professor, but there may be mistakes. I would appreciate it if you could point out: pray :)

Simple regression analysis

** What you want to verify ** This time, we will examine "how much the increase or decrease in the number of direct flights from China, South Korea, Taiwan, and Hong Kong affects the number of visitors to Japan." The objective variable is "the number of visitors to Japan from Asian countries" and the explanatory variable is only "the number of direct flights from Asian countries". In addition to the number of direct flights, exchange rates, natural disasters, security, etc. are also considered to be factors that increase or decrease the number of visitors to Japan, so I think that multiple regression analysis is more suitable for verification, but I would like to verify it again next time.

** Data to use ** --Ministry of Land, Infrastructure, Transport and Tourism Japan Tourism Agency "Accommodation Travel Statistics Survey" 2015-2018 (http://www.mlit.go.jp/kankocho/siryou/toukei/shukuhakutoukei.html) --Ministry of Land, Infrastructure, Transport and Tourism "International Flight Status" 2015-2018 Summer and Winter Timetables (https://www.mlit.go.jp/koku/koku_fr19_000005.html)

I made the following Excel sheet by taking the above two data into a kettle. dataset.png The number of direct flights from Asian countries and the number of visitors to Japan are summarized by prefecture. 0 is entered for areas where there are no direct flights or where there is no airport in the first place.

Data reading and variable creation

Use pandas to read the data and store it in a data file. Enter the number of direct flights in x and the number of visitors to Japan in y.

linear-regression.py


import pandas as pd

df = pd.read_excel('2016_summer_original.xlsx', sheet_name='Sheet2', encoding='utf-8')
x = df[['Korea']]
y = df[['Number of visitors to Japan']]

Simple regression analysis with scikit-learn

Perform simple regression analysis using scikit-learn and graph with matplotlib.

linear-regression.py


import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression

##Draw a regression line
model_lr = LinearRegression()
model_lr.fit(x, y)

plt.plot(x, y, 'o')
plt.plot(x, model_lr.predict(x), linestyle="solid")
plt.show()

Also make descriptive statistics with stats model.

linear-regression.py


import statsmodels.api as sm

#Show descriptive statistics
x_add_const = sm.add_constant(x)
model_sm = sm.OLS(y, x_add_const).fit()

print(model_sm.summary())

The following is the execution result. It's an instant kill: laughing: korea-LR.png korea-DS.png

Interpretation of results

Let's compare the results of China and Hong Kong in the summer of 2015.

China china_LR.png china_DS.png Model: y = 942.76x + 21142.86 P>|t|:0.000 R-squared:0.405

Hong Kong hongkong_LR.png hongkong_DS.png Model: y = 961.33x + 4053.08 P>|t|:0.000 R-squared:0.654

At the very least, the significance of the analysis results and the explanatory power of the formula should be seen in the P value and R2 of the descriptive statistics. The P value is the probability of rejecting the null hypothesis (the opposite hypothesis to what you want to claim). If it is below 5%, it is statistically significant. The coefficient of determination R2` is an index that measures how well the estimated regression line fits into the observed data. The closer the value is to 1, the better the fit. In the above figure, if the blue dot is close to the orange line, the fit is good.

From the regression model, China seems to increase by 943 for each additional direct flight. The result is significant because the P value is 0, but the explanation of the formula is low. On the other hand, in Hong Kong, the number of people increases by 961 for each flight, which shows that it is significant and the formula is explainable.

at the end

Since the published direct flight data is from 2015 to 2018, the analysis target is limited to the period when the data exists, and since it is not monthly data, it is not possible to analyze continuous changes. It was a pity. I didn't mention it in the article because it's not the main one, but this time it was more difficult to collect and preprocess data than to analyze: sweat_smile: Next time, I would like to verify it by multiple regression analysis.

Reference material

I tried to explain how to analyze data with Python for beginners [# 1 How to perform simple regression analysis with Scikit-learn](https://medium.com/@yamasaKit/scikit-learn%E3%81%A7%E5%8D%98%E5%9B%9E%E5% B8% B0% E5% 88% 86% E6% 9E% 90% E3% 82% 92% E8% A1% 8C% E3% 81% 86% E6% 96% B9% E6% B3% 95-f6baa2cb761e) How to read the results of simple regression analysis [Excel data analysis tool] [Regression analysis series 2] (Video) [Shinichi Kurihara and Atsushi Maruyama "Statistics Picture Book" Ohmsha](https://www.amazon.co.jp/%E7%B5%B1%E8%A8%88%E5%AD%A6%E5%9B% B3% E9% 91% 91-% E6% A0% 97% E5% 8E% 9F-% E4% BC% B8% E4% B8% 80 / dp / 427422080X)

Recommended Posts

Simple regression analysis in Python
First simple regression analysis in Python
Regression analysis in Python
Simple regression analysis implementation in Keras
2. Multivariate analysis spelled out in Python 1-1. Simple regression analysis (scikit-learn)
2. Multivariate analysis spelled out in Python 1-2. Simple regression analysis (algorithm)
Simple gRPC in Python
Machine learning with python (2) Simple regression analysis
Association analysis in Python
Multiple regression expressions in Python
Online linear regression in Python
Simple IRC client in python
2. Multivariate analysis spelled out in Python 7-3. Decision tree [regression tree]
Python Scikit-learn Linear Regression Analysis Nonlinear Simple Regression Analysis Machine Learning
Calculate the regression coefficient of simple regression analysis with python
2. Multivariate analysis spelled out in Python 2-1. Multiple regression analysis (scikit-learn)
EEG analysis in Python: Python MNE tutorial
Simple OAuth 2 in Python (urllib + oauthlib)
Machine learning algorithm (simple regression analysis)
Implementing a simple algorithm in Python 2
Planar skeleton analysis in Python (2) Hotfix
Run a simple algorithm in Python
Logistic regression analysis Self-made with python
Simple gacha logic written in Python
2. Multivariate analysis spelled out in Python 5-3. Logistic regression analysis (stats models)
Quadtree in Python --2
Python in optimization
CURL in python
Metaprogramming in Python
Python 3.3 in Anaconda
Geocoding in python
SendKeys in Python
Linear regression in Python (statmodels, scikit-learn, PyMC3)
A simple HTTP client implemented in Python
A simple data analysis of Bitcoin provided by CoinMetrics in Python
Meta-analysis in Python
Unittest in python
Try drawing a simple animation in Python
Data analysis python
Online Linear Regression in Python (Robust Estimate)
Create a simple GUI app in Python
Epoch in Python
Discord in Python
[Statistical test 2nd grade / quasi 1st grade] Regression analysis training in Python (2)
Sudoku in Python
DCI in Python
quicksort in python
nCr in python
Poisson regression analysis
I implemented Cousera's logistic regression in Python
Plink in Python
Constant in python
Write a simple greedy algorithm in Python
Lifegame in Python.
FizzBuzz in Python
Sqlite in python
StepAIC in Python
2. Multivariate analysis spelled out in Python 6-2. Ridge regression / Lasso regression (scikit-learn) [Ridge regression vs. Lasso regression]
Regression analysis method
[Statistical test 2nd grade / quasi 1st grade] Regression analysis training in Python (1)
2. Multivariate analysis spelled out in Python 2-3. Multiple regression analysis [COVID-19 infection rate]