[PYTHON] Pre-processing in machine learning 3 Missing values, outliers, and imbalanced data

Aidemy 2020/10/29

Introduction

Hello, it is Yope! I am a liberal arts student, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. I am very happy that many people have read the previous summary article. Thank you! This is the third post of machine learning pre-processing. Nice to meet you.

What to learn this time ・ Handling of missing values ・ Handling of outliers ・ Handling of imbalanced data

Handling of missing values

About missing values

-__ Missing value __ is __ empty data __ represented by "NaN". -If NaN is present in the data, __ the overall mean or standard deviation cannot be calculated __. -If all __ all __ data containing even one missing value is deleted, the data will be wasted or biased.

Mechanism of missing values

-As seen in the previous section, data containing missing values cannot be analyzed unless it is properly processed. Therefore, it is necessary to perform appropriate pretreatment for missing values, but the method differs depending on the mechanism of missing value generation.

-There are the following three types of mechanisms for generating missing values. 1.__MCAR: When the probability of data loss (becoming NaN) is irrelevant to the data itself (randomly occurring) __ ex) Some data was lost due to a computer malfunction. 2.__MAR: When the probability of data loss can be inferred from items other than that item __ ex) In the questionnaire for men and women, if the item of "gender" is female, the probability of missing the item of "age" is high. 3.__NMAR: When the probability of data loss can be inferred from the item __ ex) In a questionnaire targeting men and women, the older the actual age, the higher the probability of missing the "age" item.

Dealing with missing values

-Appropriate countermeasures differ depending on the mechanism of missing value generation.

-In the case of MCAR, it is also possible to perform __ "listwise delete" __ to delete all data including missing values. As a result, when the number of data becomes too small, __ "assignment method" __ (described later) can also be used. -For MAR, use __ "assignment method" __. Listwise deletion is inappropriate because, for example, a lot of female data is deleted and the data is biased. -In the case of NMAR, it is difficult to deal with it properly, so __basically, data is recollected __.

Substitution method

-There are roughly two types of missing value substitution methods. Details will be described later.

  1. __ Single assignment method __: A method of complementing missing values to create a complete dataset and using it to create a model.
  2. __ Multiple assignment method __: A method of creating multiple complete datasets that complement missing values, creating a model for each, and finally integrating the models into one.

Visualization of missing values

・ By visualizing the data, find out where NaN is. -If you want to know how many NaNs are in which column, you can use pandas __ "Data.isnull (). Sum ()" __. -If you want to visualize the status of the entire missing value, use the __matrix function __ of the __missingno package __.

・ Code![Screenshot 2020-10-29 15.04.51.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/80bcfbae-f632-584d- 74ff-9e37c2c9965b.png)

・ Result (white part is missing value)![Screenshot 2020-10-29 15.05.17.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700 /948b7b71-7cf9-be7d-f645-e7eea9335423.png)

Single assignment method

-When supplementing missing values in __single substitution method __, there are three types of methods according to the purpose.

-__ Mean value substitution method : Substitute the average value of the item for the missing value NaN. As the average value increases, the overall variance becomes smaller, so it cannot be used when the variance and error should be taken into consideration. - Stochastic regression substitution method : We will not deal with it in detail, but we can consider variance and error. - Hot deck method __: Complement by receiving the value from the data row (recipient) containing the missing value and the data row (donor) whose data is close to each other. -The closeness of data is determined using a method called __ "nearest neighbor method" __.

Hot deck method

-The hot deck method is described and executed as follows. knnimpute.knn_impute_few_observed(matrix, missing_mask, k)

-For the argument, pass the data converted to np.matrix to "matrix". In "missing_mask", pass the data matrixed by __np.isnan (matrix) __ where the missing value is included in the matrix. “K” specifies the number of nearby points to take into account in the “nearest neighbor” KNN.

・ KNN is a method learned in "supervised learning (classification)", and is a method that takes into account k points in the vicinity when classifying a specific data point.

-Also, since the data is converted to np.matrix, it is necessary to return to the original data format if necessary.

・ Code![Screenshot 2020-10-29 15.11.54.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/db15c25d-1ae3-c0b5- 6711-fbaf689c50f5.png)

About multiple assignment method

-In the case of the single assignment method, the complemented value is just a predicted value, so there is a drawback that the analysis result of the model using this is not always accurate. ・ On the other hand, in the multiple substitution method, data is complemented multiple times to create multiple data sets, and the analysis results are integrated into one to produce the final result. Is possible__.

-The model used in the multiple assignment method can be expressed as follows using mice of stats models. mice.MICE(formula, optimizer, imp_data)

-For each argument, pass the formula of the variable predicted by the model to "formula". For example, in the case of a linear multiple regression model that predicts the variable "distance" by "time" and "speed", it is expressed as follows. 'distance ~ time + speed' -Specify the analysis model in "optimizer", and pass __ "mice.MICEData (data) function" __ to be created so that data can be handled by MICE in "imp_data".

-The complementary analysis result can be obtained by using the __fit () __ method for the MICE model created in this way. (Execution of multiple assignment method) -In the first argument of fit (), specify __ the number of trials for one completion __, and in the second argument, pass __ the number of datasets created __.

-For this analysis result, you can check the result by using the __summary () __ method.

・ Code![Screenshot 2020-10-29 15.13.33.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/a40802e3-fe89-1c45- b1a3-c5bf17e2ea36.png)

About outliers

Problems caused by outliers

-__ Outliers __ refers to data __ that is significantly distant from other data __. -If outliers are mixed, problems such as __ "analysis results will not be accurate" and "model learning will be slow" will occur. -Therefore, it is necessary to detect outliers at the pre-processing stage and exclude them __.

Visualization of outliers

・ It is easy to understand if there are outliers. -Use seaborn __boxplot (x, y, whis) __ for visualization. What is drawn is a "boxplot", and the outliers are indicated by "♦ ︎". -Arguments "x" and "y" specify the x-axis and y-axis data, and "whis" specifies the criteria to be regarded as outliers.

-When the data is two-dimensional, __joinplot (x, y, data) __ can also be used. Since the points are plotted as shown in the scatter plot, outliers are visually detected. -DataFrame should be passed to the argument "data".

・ Code (Specify only the y-axis because you only need to know how far the vertical axis is) スクリーンショット 2020-10-29 15.14.57.png

・ Result![Screenshot 2020-10-29 15.15.24.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/eed4d5ce-e5ff-4c74- 7105-230a2789e22b.png)

Outlier detection [LOF]

-If it is found that an outlier exists, __ actually detects the outlier __. -For detection, criteria such as __ "what value should be an outlier" __ are required. This time, we will deal with scikit-learn's __ "LOF" __ method, which detects after the criteria are set in advance. ・ LOF outlier judgment is performed as follows. (1) Consider a point with few data points near __ as an outlier. (2) The judgment of "near" is made using __k neighborhood points __. (3) A point where this judgment (data density) is relatively low with the surroundings is judged as an outlier __.

-LOF can be used with __LocalOutlierFactor (n_neighbors = k) __. -For n_neighbors, specify the number of neighboring points k.

-After that, train the data with the __fit_predict (data) __ method like a normal model and detect outliers. -DataFrame can be passed to data as it is. -As a return value, an array such as __array ([1,1, -1,1, ...]) __ is returned. Of these, the part "-1" is considered to be an outlier. -On the other hand, if you specify __data [variable that stores the return value == -1] __, you can extract the rows that are regarded as outliers.

・ Code![Screenshot 2020-10-29 15.21.47.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/4423b970-0dfd-ecd5- da49-847dffe1f166.png)

-Result (output of outliers)![Screenshot 2020-10-29 15.22.04.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/ db1e7d72-ee73-cb5a-c9f0-d315fbbeaf1f.png)

Outlier detection [isolationForest]

-There is a method called __ "isolation Forest" __ as a method of detecting outliers different from LOF. -In isolationForest, outliers are detected by repeating __randomly dividing the data and setting the value to the depth. -Since isolationForest does not depend on distance or density, calculation is not complicated and __memory saving __. It also has the feature that it is easy to scale the calculation even for large-scale data.

-Similar to LOF, isolationForest can predict outliers by using the "IsolationForest () function".

The classification model is created by __IsolationForest () __, which is trained by using the __fit () __ method, and predicted by predict (). As with LOF, "-1" is an outlier for this return value, so get it like __data [predictions == -1] __.

·code スクリーンショット 2020-10-29 15.23.45.png

・ Result![Screenshot 2020-10-29 15.22.54.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/24e06303-4aff-5fa1- df0a-dc90bc387055.png)

About imbalanced data

Problems caused by imbalanced data

-Of the data values, the data __ whose specific value is extremely large or small is called __ "unbalanced data" __. -For example, when the data has a value of "0" or "1" as in the binary classification, 999 of the 1000 data are "1" and 1 is "0". When predicting with this data, if you predict __ "1", you will hit 99.9%, but "0" is almost unpredictable __. -Therefore, imbalanced data also needs to be adjusted appropriately at the data preprocessing stage __.

-There are the following three adjustment methods. -__ Oversampling : A method of increasing __ balance of infrequent data. - Undersampling __: A method of reducing frequent data __ and balancing __. -SMOTE-ENN: Increase the lesser one (SMOTE) and decrease the more one (ENN) to balance.

-When adjusting the imbalance data, the first thing to do is to confirm "whether there is imbalance data", that is, to visualize the number (frequency) of each value __. -Visualize the number as follows. __Data ['column'] .value_counts () __

-Code (data on whether a person bought a car)Screenshot 2020-10-29 15.25.29.png .com / 0/698700 / b12aa6ff-a80c-949c-cdb3-85cc22a6289a.png)

Oversampling

-Oversampling is a method of increasing infrequent data and balancing it with more frequent data. -There are several ways to increase the data, but the simplest is the __ "randomly inflate existing data" __ method. -For padding, the imbalanced-learn __RandomOverSampler (ratio ='minority') __ used when adjusting imbalanced data is used. -After defining RandomOverSampler, actually execute it with __.fit_sample (X, y) __. -For this, I have the objective variable X and the explanatory variable y, and prepare two variables to store the inflated ones.

・ Code![Screenshot 2020-10-29 16.15.15.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/af63bf2f-e5e1-fd9e- ec9f-077bdc5c2fa2.png)

Undersampling

-__ Undersampling __ is a method of reducing and balancing frequently used data. -You can reduce the data with __RandomUnderSampler () __. Regarding the ratio specified by the argument at this time, since the one with the highest frequency is dealt with this time, it is set as 'majority'.

・ Code![Screenshot 2020-10-29 16.13.02.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/79204ef5-3ef9-8163- 8aac-390d1890a9fa.png)

SMOTE-ENN -SMOTE is used for oversampling (data padding) and ENN is used for undersampling (data deletion). -In addition, SMOTE uses the __nearest neighbor method (kNN) __ to infer the data to be increased and inflate it, and ENN also infers the data to be reduced and deletes it in the same way as kNN. ・ Use __SMOTEENN (smote =, enn =) __ to use SMOTE and ENN. It is basically the same as ROS and RUS, but it is necessary to specify the number of neighbor points (k_neighbors, n_neighbors) used in kNN in the argument.

・ Code![Screenshot 2020-10-29 16.15.41.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/34354d74-0869-5e88- d0cc-520309db02e7.png)

Summary

-__ Missing value __ is empty data represented by "NaN". -__ Outliers __ are data that are significantly different from other data. -Imbalanced data is data in which a specific value is extremely high or low. -These data interfere with machine learning and must be preprocessed.

-The missing value is complemented by substituting a numerical value for NaN with __ "assignment method" __. -Get outliers with __ "LOF" and "isolation Forest" __ and exclude them. -Unbalanced data is balanced by performing __ "oversampling", "undersampling" __, or both.

This time is over. Thank you for reading until the end.

Recommended Posts

Pre-processing in machine learning 3 Missing values, outliers, and imbalanced data
Python: Preprocessing in machine learning: Handling of missing, outlier, and imbalanced data
Preprocessing in machine learning 2 Data acquisition
Preprocessing in machine learning 4 Data conversion
Python: Preprocessing in machine learning: Data acquisition
Python: Preprocessing in machine learning: Data conversion
Preprocessing in machine learning 1 Data analysis process
Classification and regression in machine learning
Machine learning in Delemas (data acquisition)
Python: Preprocessing in Machine Learning: Overview
Machine learning imbalanced data sklearn with k-NN
Data supply tricks using deques in machine learning
Multivariate LSTM and data preprocessing in TensorFlow 2.x
[Python] Data analysis, machine learning practice (Kaggle) -Data preprocessing-
Machine learning Training data division and learning / prediction / verification
I started machine learning with Python Data preprocessing
Data cleaning How to handle missing and outliers
Sampling in imbalanced data
About data preprocessing of systems that use machine learning
How to split machine learning training data into objective variables and others in Pandas
Data set for machine learning
Machine learning in Delemas (practice)
Music and Machine Learning Preprocessing MFCC ~ Mel Frequency Cepstrum Coefficient
Performance verification of data preprocessing for machine learning (numerical data) (Part 2)
Used in machine learning EDA
Machine learning and mathematical optimization
Performance verification of data preprocessing for machine learning (numerical data) (Part 1)
Python dummy data generation (address edition)
Preprocessing in machine learning 1 Data analysis process
Data set generation
Sampling in imbalanced data
Significance of machine learning and mini-batch learning
Automate routine tasks in machine learning
Hashing data in R and Python
Organize machine learning and deep learning platforms
Random seed research in machine learning
Basic machine learning procedure: ② Prepare data
How to collect machine learning data
Put your own image data in Deep Learning and play with it
Summary of mathematical scope and learning resources required for machine learning and data science
[Machine learning] OOB (Out-Of-Bag) and its ratio
[python] Frequently used techniques in machine learning
Search / Delete Missing Values in "Kaggle Memorandum"
Personal notes and links about machine learning ① (Machine learning)
Python data structure and operation (Python learning memo ③)
[Python] First data analysis / machine learning (Kaggle)
Easily graph data in shell and Python
Machine learning algorithm classification and implementation summary
Python and machine learning environment construction (macOS)
Separation of design and data in matplotlib
Fill in missing values with Scikit-learn impute
[Python] Saving learning results (models) in machine learning
"OpenCV-Python Tutorials" and "Practical Machine Learning System"
List of main probability distributions used in machine learning and statistics and code in python
I tried to process and transform the image and expand the data for machine learning
Machine Learning with docker (40) with anaconda (40) "Hands-On Data Science and Python Machine Learning" By Frank Kane