[PYTHON] SE, a beginner in data analysis, learns with the data science unit vol.1

Introduction

Hello! This is Nakagawa from Hitachi, Ltd. Lumada Data Science Lab..

At Lumada Data Science Lab., We actively accept in-house SEs as trainees and train data scientists with the aim of improving the quality of proposals to our customers. In the practical training, we regularly challenge the subject of data analysis and discuss the questions that arise in the subject with the members of the institute engaged in data analysis work. In this article, I would like to introduce the contents of the data analysis and discussion.

By sharing solutions to problems that are easy for people who are just starting to analyze data, and sharing effective techniques for those who are already in the business of data analysis, we can share what data analysis is. I hope it will be an opportunity to think.

Apprentice profile

--Mr. Matsushita (male 9 years after joining the company) --Engaged in social security related SE work in the Public Systems Division --Experience in developing in Java and C, but inexperienced in data analysis --Active mid-career SE who loves traveling abroad and drinking

Contents of the exercise

Here, we would like to introduce the specific contents of the exercises that Mr. Matsushita summarized.

theme

--Perform multiple regression analysis using Python and scikit-learn. (Development environment is Jupyter Notebook, which is convenient for data analysis using Python) --Scikit-learn has some data analysis and machine learning to try out right away. There is an attached dataset (https://scikit-learn.org/stable/datasets/index.html#boston-house-prices-dataset). --This time, among the attached datasets, the Boston house prices dataset Work on. --Data analysis is performed by referring to the CRISP-DM process.

  1. CRISP-DM An effective way of thinking in advancing data analysis is [CRISP-DM (CRoss-Industry Standard Process for Data Mining)](https://mineracaodedados.files.wordpress.com/2012/04/the-crisp-dm-model- the-new-blueprint-for-data-mining-shearer-colin.pdf). It is divided into the following processes, and it is a data mining methodology and process model that analyzes data while rotating the PDCA cycle, from understanding the customer's business issues to actual modeling, its evaluation and deployment to business (business improvement). ..

  2. Understanding the business

  3. Understanding the data

  4. Data preparation

  5. Modeling

  6. Evaluation

  7. Deployment

We also worked on this theme in this order.

1. Understanding the business

Clarify business challenges and set goals for data analysis. This time, the problem to be solved has already been clarified, so it is as follows.

Goal: Creating and evaluating a numerical forecasting model for Boston home prices

2. Understanding the data

Check the data to be analyzed and decide whether it can be used as it is or whether the data needs to be processed. Specifically, check if there is any data that cannot be used for data analysis as it is due to many missing or outliers, and if there is data that cannot be used, decide on a data processing policy such as deletion or completion.

Data reading

Import the library you want to use and load the data for data analysis.

#Library import
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

#Data set loading
from sklearn import datasets
boston_data = datasets.load_boston()

#Storage of explanatory variables
boston = pd.DataFrame(boston_data.data, columns=boston_data.feature_names)

#Objective variable(House price)Storing
boston_medv = pd.DataFrame(boston_data.target)

#Data confirmation
boston.info()

The variables of the Boston house prices dataset that are the subject of this data analysis are as follows.

Column name Contents
CRIM Crime rate per capita by town
ZN 25,Percentage of residential areas divided into plots over 000 square feet
INDUS Percentage of non-retail area per town
CHAS Charles River Dummy Variable(=1 if at the border of the river, 0 otherwise)
NOX Nitric oxide concentration(1 in 10 million)
RM Average number of rooms per dwelling
AGE Percentage of dwellings built before 1940
DIS Distances to 5 Boston Employment Centers(Weighted)
RAD Indicators of accessibility to radial highways
TAX 10,Property tax rate per $ 000
PTRATIO Students by town-Teacher ratio
B 「1000(Bk - 0.63)^Index of residence ratio calculated in "2" * Bk is the ratio of African Americans by town
LSTAT Percentage of low-income earners per population
MEDV Median house price in $ 1000 * Objective variable

Confirmation of missing values

Check the Boston house prices dataset for missing values.

#Confirmation of missing values
boston.isnull().sum()

Output result

#Number of missing values
CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
dtype: int64

It was confirmed that there are no missing values in this data.

Confirmation of outliers / abnormal values

Regarding outliers / outliers, it is necessary to consider what to treat as outliers / outliers in the first place. For this purpose, it is necessary to understand the background of the business / event to be analyzed and the background of the data such as the measurement method. After understanding the background, for example, outliers and outliers are judged from the following viewpoints.

--Isn't it a value that is clearly impossible due to the nature of the variable (example: when the age is a value such as a minus or a character string)? --Isn't it an impossible value due to the nature of the business or event (eg, if the age limit for applying for a loan is limited, but the value does not meet the limit)? --If the probability distribution of the variable can be assumed, is it a significantly different distribution? (Example: When both extreme values frequently appear with respect to the assumption of the normal distribution)

First, check the variation in values using a box plot.

#Check outliers on the box plot(Visualize in tiles)
fig, axs = plt.subplots(ncols=5, nrows=3, figsize=(13, 8))
for i, col in enumerate(boston.columns):
    sns.boxplot(boston[col], ax=axs[i//5, i%5])
    
#Adjusting the graph interval
fig.subplots_adjust(wspace=0.2, hspace=0.5)

#Remove margins
fig.delaxes(axs[2, 4])
fig.delaxes(axs[2, 3])

Output result

01_boxplot.png

If you check the box plot, there are many variations in the values of CRIM, ZN, CHAS, RM, DIS, PTRATIO, B, LSTAT, and it seems that there may be outliers. In addition, use the histogram to see the values and distribution of specific variables.

#Check the distribution with the histogram(Visualize in tiles)
fig, axs = plt.subplots(ncols=5, nrows=3, figsize=(13, 8))
for i, col in enumerate(boston.columns):
    sns.distplot(boston[col], bins=20, kde_kws={'bw':1}, ax=axs[i//5, i%5])

#Adjusting the graph interval
fig.subplots_adjust(wspace=0.2, hspace=0.5)

#Remove margins
fig.delaxes(axs[2, 4])
fig.delaxes(axs[2, 3])

Output result

02_distplot.png

When I checked the histogram while imagining the properties of the variables, it seems that they are possible values, so I will not treat them as outliers or outliers here. "CHAS" has a special shape in both the boxplot and the histogram, because it is a dummy variable that flags whether it is along the river with 0, 1.

3. Data preparation

According to the policy decided in "Understanding the data", the data is processed so that it can be put into the next modeling. For example, perform the following processing.

--Complement or exclude missing values, outliers, and outliers with mean or mode --Separate numerical data into meaningful units and convert to category data --Flaged to handle label data as a number

As confirmed in "Understanding the data", proceed to the next phase assuming that there are no missing values, outliers, or outliers.

4. Modeling

Model using a method suitable for the conditions of data analysis. Select the variables to be input to the model and divide the data for training and testing of the model and use it for data analysis. This time, we will select linear regression analysis (multiple regression) as the modeling algorithm. By the way, in scikit-learn, the selection index of which algorithm and modeling method should be used is summarized in cheat sheet. ..

Variable selection

Select the variables you want to submit to your model. The process of searching for valid combinations while reducing the number of variables actually used. Reducing the variables used has the following benefits:

--Lower calculation cost and shorten processing time --Avoid overfitting and improve generalization (prediction performance for unknown data)

There are the following methods for selecting variables.

--Filter Method: A method of ranking variables based on evaluation indicators and selecting the top variables. --Wrapper Method: Actually model with a combination of multiple variables Techniques for choosing the best-performing variable combination -Embedded Method: A method to select variables at the same time in a machine learning algorithm

This time, we will describe as an example how to check the strength of correlation between variables according to the methodology of Filter Method. Pairplot is useful because it visualizes the histogram of each variable and the correlation of all combinations of the two variables.

#Quantify the correlation of quantitative variables using seaborn heatmap
boston["MEDV"] = boston_medv
plt.figure(figsize=(11, 11))
sns.heatmap(boston.corr(), cmap="summer", annot=True, fmt='.2f', square=True, linewidths=.5)
plt.ylim(0, boston.corr().shape[0])
plt.show()
#Visualize the correlation of quantitative variables graphically using seaborn pair plot
sns.pairplot(boston)
plt.show()

Output result

03_heatmap.png

04_pairplot.png

If you check the heatmap, there is a strong positive correlation between RAD and TAX, so here we select TAX, which has a stronger negative correlation with MEDV, out of the two variables.

Data division for learning and testing

It is common to use part of the data for model learning and the rest for verifying the predictive power of the created model. This time, we will use 50% of the data for learning and 50% for testing.

#Xm for explanatory variables and objective variables,Stored in Ym respectively
Xm = boston.drop(['MEDV', 'RAD'], axis=1)
Ym = boston.MEDV

#Import a library that splits train data and test data
from sklearn.model_selection import train_test_split

# X_train, X_The data distributed to test is randomly determined
# test_size=0.5 to 50%To test
X_train, X_test = train_test_split(Xm, test_size=0.5, random_state=1234) 
Y_train, Y_test = train_test_split(Ym, test_size=0.5, random_state=1234) 

Model fitting

It actually gives the data and fits it into a linear regression model (multiple regression model).

#Import sklearn linear regression model and fit with train data
from sklearn import linear_model
model_lr = linear_model.LinearRegression()
model_lr.fit(X_train, Y_train)

#Use the generated model to get the predicted values for the explanatory variables of the test data
predict_lr = model_lr.predict(X_test)

#Regression coefficient
print(model_lr.coef_)
#Intercept
print(model_lr.intercept_)

Output result

#Regression coefficient
[-2.79020004e-02  5.37476665e-02 -1.78835462e-01  3.58752530e+00
 -2.01893649e+01  2.15895260e+00  1.95781711e-02 -1.66948371e+00
  6.47894480e-03 -9.66999954e-01  3.62212576e-03 -6.65471265e-01]
#Intercept
48.68643686655955

5. Evaluation

Evaluate the accuracy and performance of the created model to determine whether the goal can be achieved. In addition, the model is tuned as necessary based on the evaluation results.

Model evaluation

In the accuracy evaluation of the model, the error and the strength of correlation between the predicted value and the correct answer value of the model created using the following indexes are evaluated.

--MAE (Mean Absolute Error): Average of absolute values of error --MSE (Mean Squared Error): Mean squared error --RMSE (Root Mean Squared Error): Square root of MSE --Coefficient of determination: The square of the strength of the correlation

#Evaluation
# MAE
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(Y_test, predict_lr)
print("MAE:{}".format(mae))

# MSE
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(Y_test, predict_lr)
print("MSE:{}".format(mse))

# RMSE
rmse = np.sqrt(mse)
print("RMSE:{}".format(rmse))

#Coefficient of determination
print("R^2:{}".format(model_lr.score(X_test, Y_test)))
#Evaluation results
MAE:3.544918694530246
MSE:23.394317851152568
RMSE:4.83676729346705
R^2:0.7279094372333412

You have created a model with a coefficient of determination of approximately 0.73.

6. Deployment

Apply data analysis evaluation results to your business to solve business problems. This time, the goal is to create and evaluate a model, but in actual business, we will use the created model to improve operations and develop systems.

discussion

Members of the Lumada Data Science Lab. Will answer the candid questions that the apprentices have experienced through actual data analysis.

I was wondering what to remove as outliers, how should I decide? I couldn't figure it out by looking at the boxplot and histogram.

The purpose is to exclude records that are clearly impossible values or have special conditions. These data inadvertently distort the modeling results. First of all, it is important to take a good look at the data. As Mr. Matsushita did, plotting boxplots and histograms is also a common way to get noticed. For example, if the value of a variable is unusually biased, you can easily notice it by plotting it. Also, by going further into the tendency of records that take such values, you may notice the phenomenon behind them and the meaning of the values.

How many correlation coefficients do you judge that the correlation between variables is strong? Or can you check it in another way? I'm not sure about the criteria.

It seems that the correlation coefficient is 0.7 or more and it is generally considered that the correlation is strong, but in reality it depends on the domain. Therefore, it is very important to discuss the criteria with the customer. In this case, we focused on the combination of explanatory variables with a correlation coefficient of 0.9 or more, but it is important not only to look at the correlation coefficient but also to check the tendency of variation in the scatter plot.

Is the coefficient of determination $ R ^ 2 $ calculated by cross_val_score for k-fold cross-validation?

Information on implementing the library can be found in the API Reference (https://scikit-learn.org/stable/modules/classes.html), which you should refer to. The index for calculating the cross_val_score in the question is specified by the argument scoring. You can specify the score calculated by name or function. If not specified, the score function implemented in the modeling algorithm is used. LinearRegression implements $ R ^ 2 $.

Should k-fold cross-validation be done every time? Is there a case where cross-validation is not performed?

k-fold cross-validation is one method, and it is important to perform validation as a general theory. Hold-out validation, k-fold cross-validation, leave-one-out cross-validation, etc. are performed according to conditions such as data volume and variation. The purpose of the verification is to detect the phenomenon that the model overfits the data used for training (overfitting) and improve the prediction performance (generality) for unknown data. If you know the statistical behavior of the population and the distribution is clear, you may want to use all the data to estimate the parameters of the distribution.

What should be identified as the main factor (the explanatory variable that most affects the objective variable)? Regression coefficient? Correlation coefficient? There seem to be many ways to do it, but I'm not sure what to choose.

In the case of a multiple regression model, it is sufficient to compare the standardized regression coefficients in order to consider the difference in the size and unit of the variables, paying attention to the independence of the explanatory variables (there is no multicollinearity). Multicollinearity requires careful confirmation of multivariable relationships, such as VIF (Distribution Expansion Coefficient), an approach that actually regresses one variable with another to evaluate the effect, and principal component regression. As you can see, the approach of synthesizing uncorrelated variables in advance and then regressing is taken.

I would like to refer to the implementation of another person who has described the data analysis. Is there any good way?

It goes without saying that you should peruse Qiita, but there is a site called Kaggle that holds a data science competition, and Notebook for various problems including Boston housing. (Data analysis program) is open to the public and discussions are actively held. Just reading this will be a great learning experience for what other data analysts are using. You may also want to read the introductory book to get a basic understanding of the prerequisite statistics.

Impressions of the exercise

Since sklearn has many data analysis methods, we were able to implement the model more smoothly than we had imagined. However, even with the sample data prepared like this time, I often had trouble with the data analysis policy, so I imagined that more trial and error would be required for data analysis in actual business. I think I understand a little that it is said that "data analysis requires 90% of preprocessing up to modeling." This time, the main focus was on mastering the method, but I hope that through the exercises, we will be able to understand the essence of the method and make proposals based on data analysis that is convincing to our customers.

in conclusion

This time, we asked Mr. Matsushita, an apprentice, to work on the Boston house prices dataset. In the discussion, I think we were able to have meaningful discussions on outliers, correlations, and ideas about verification. Lumada Data Science Lab. Will continue to post various practical training articles, so please look forward to the next post. Thank you for reading this far.

Recommended Posts

SE, a beginner in data analysis, learns with the data science unit vol.1
[Unexpectedly known? ] Introducing a real day in the data analysis department
The first time a programming beginner tried simple data analysis by programming
[Python] Get the files in a folder with Python
Delete data in a pattern with Redis Cluster
Data analysis in Python: A note about line_profiler
A well-prepared record of data analysis in Python
Realize a super IoT house by acquiring sensor data in the house with Raspberry Pi
Train MNIST data with a neural network in PyTorch
Data analysis with python 2
Data analysis with Python
I made a class to get the analysis result by MeCab in ndarray with python
How to plot the distribution of bacterial composition from Qiime2 analysis data in a box plot
Natural language processing with Word2Vec developed by a researcher in the US google (original data)
Build a data analysis environment with Kedro + MLflow + Github Actions
A network diagram was created with the data of COVID-19.
Ingenuity to handle data with Pandas in a memory-saving manner
[Understand in the shortest time] Python basics for data analysis
Process the files in the folder in order with a shell script
[Homology] Count the number of holes in data with Python
A Python beginner first tried a quick and easy analysis of weather data for the last 10 years.
Process the contents of the file in order with a shell script
A simple data analysis of Bitcoin provided by CoinMetrics in Python
Organize individual purchase data in a table with scikit-learn's MultiLabel Binarizer
Organize useful blogs in the field of data science (overseas & Japan)
Read a file in Python with a relative path from the program
Solve the subset sum problem with a full search in Python
Prepare a high-speed analysis environment by hitting mysql from the data analysis environment
Create a USB boot Ubuntu with a Python environment for data analysis