[PYTHON] Data cleaning How to handle missing and outliers

I will write about the handling of outliers and missing values in data cleaning. Think of it as running on Jupyter.

Data preparation

First of all, data preparation You can easily create data by using scikit-learn's make_classification function, so prepare it.

Reference: http://overlap.hatenablog.jp/entry/2015/10/08/022246

Now let's read the data.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

data = pd.read_csv('2d_data.csv', header = None)

The data prepared this time looks like this

data = data.as_matrix()
data

#output

array([[  1.00000000e+00,  -7.42847611e+00,   1.50990301e+00],
       [  0.00000000e+00,   2.98069292e+00,   1.96082119e+00],
       [  0.00000000e+00,   3.98825476e+00,   4.63638899e+00],
       [  1.00000000e+00,  -5.78943741e+00,  -4.62161424e+00],
       [  1.00000000e+00,  -4.89444674e+02,  -3.26972997e+02],
       [  1.00000000e+00,  -1.93394930e+00,  -4.72763616e-02],
       [  0.00000000e+00,  -1.61177146e+00,   5.93220121e+00],
       [  1.00000000e+00,  -6.67015188e+00,              nan],
       [  1.00000000e+00,  -2.93141529e+00,  -1.04474622e-01],
       [  0.00000000e+00,  -7.47618437e-02,   1.07000182e+00],
       [  1.00000000e+00,  -2.69179269e+00,   4.16877367e+00],
       [  0.00000000e+00,              nan,   3.45343849e+00],
       [  0.00000000e+00,  -1.35413500e+00,   3.75165665e+00],
       [  1.00000000e+00,  -6.22947550e+00,  -1.20943430e+00],
       [  0.00000000e+00,   2.77859414e+00,   7.58210258e+00],
       [  1.00000000e+00,  -5.71957792e+00,  -2.43509341e-01],
       [  0.00000000e+00,   9.28321714e-01,   3.20852039e+02],
       [  0.00000000e+00,   8.50475089e+01,   2.90895510e+00],
       [  1.00000000e+00,  -6.02948927e+00,  -1.83119942e+00],
       [  0.00000000e+00,   1.11602534e+00,   3.35360162e+00]])

You can see outliers and missing values (Nan). Let's divide it into x and y data.

X = data[:,1:3]
y = data[:,0].astype(int)
X.shape, y.shape

#output
((20, 2), (20,))

Now let's plot the first and second columns of the X data.

plt.scatter(X[:, 0], X[:, 1], c=y, s=50,  cmap='Blues');

https://diveintocode.gyazo.com/5b4fd10d1edcaea456b6bf34fe54589a

You can clearly see the outliers by plotting.

How to fill in missing values

You can check Nan by using np.isnan (). If there is a missing value, it will return True.

np.isnan(X[:, 0]),np.isnan(X[:, 1])

#output

(array([False, False, False, False, False, False, False, False, False,
        False, False,  True, False, False, False, False, False, False,
        False, False], dtype=bool),
 array([False, False, False, False, False, False, False,  True, False,
        False, False, False, False, False, False, False, False, False,
        False, False], dtype=bool))

Now, let's make X1 and'y1'excluding missing values.

X1 = X[~np.isnan(X[:, 1]) & ~np.isnan(X[:, 0])]
y1 = y[~np.isnan(X[:, 1]) & ~np.isnan(X[:, 0])]

X1, y1
#output

Out[139]:
(array([[ -7.42847611e+00,   1.50990301e+00],
        [  2.98069292e+00,   1.96082119e+00],
        [  3.98825476e+00,   4.63638899e+00],
        [ -5.78943741e+00,  -4.62161424e+00],
        [ -4.89444674e+02,  -3.26972997e+02],
        [ -1.93394930e+00,  -4.72763616e-02],
        [ -1.61177146e+00,   5.93220121e+00],
        [ -2.93141529e+00,  -1.04474622e-01],
        [ -7.47618437e-02,   1.07000182e+00],
        [ -2.69179269e+00,   4.16877367e+00],
        [ -1.35413500e+00,   3.75165665e+00],
        [ -6.22947550e+00,  -1.20943430e+00],
        [  2.77859414e+00,   7.58210258e+00],
        [ -5.71957792e+00,  -2.43509341e-01],
        [  9.28321714e-01,   3.20852039e+02],
        [  8.50475089e+01,   2.90895510e+00],
        [ -6.02948927e+00,  -1.83119942e+00],
        [  1.11602534e+00,   3.35360162e+00]]),
 array([1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0]))

How to remove outliers

X2 = X1[(abs(X1[:, 0] < 10)) & (abs(X1[:, 1]) < 10)]
y2 = y1[(abs(X1[:, 0] < 10)) & (abs(X1[:, 1]) < 10)]

If you write it like this, it will return a number of 10 or more as follows.

(array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True, False,  True,  True], dtype=bool),
 array([ True,  True,  True,  True, False,  True,  True,  True,  True,
         True,  True,  True,  True,  True, False,  True,  True,  True], dtype=bool))

Now let's remove the outliers.

X2 = X1[(abs(X1[:, 0] < 10)) & (abs(X1[:, 1]) < 10)]
y2 = y1[(abs(X1[:, 0] < 10)) & (abs(X1[:, 1]) < 10)]

Let's plot it.

plt.scatter(X2[:, 0], X2[:, 1],c = y2,  s=50, cmap='Blues');

https://diveintocode.gyazo.com/eeeb8ea787239a0e11ff77ea7e15f989

I was able to confirm that there were no outliers!

Recommended Posts

Data cleaning How to handle missing and outliers
How to handle data frames
How to handle time series data (implementation)
[Python] How to read data from CIFAR-10 and CIFAR-100
[Introduction to Python] How to handle JSON format data
Pre-processing in machine learning 3 Missing values, outliers, and imbalanced data
How to add new data (lines and plots) using matplotlib
[Python] How to FFT mp3 data
How to handle session in SQLAlchemy
How to read e-Stat subregion data
How to deal with imbalanced data
How to deal with imbalanced data
How to install and configure blackbird
How to use .bash_profile and .bashrc
How to install and use Graphviz
How to Data Augmentation with PyTorch
How to handle Japanese in Python
How to collect machine learning data
How to solve slide puzzles and 15 puzzles
<Pandas> How to handle time series data in a pivot table
[Linux] How to subdivide files and folders
How to split and save a DataFrame
How to collect Twitter data without programming
How to install and use pandas_datareader [Python]
python: How to use locals () and globals ()
How to use "deque" for Python data
[Python] How to calculate MAE and RMSE
How to use Python zip and enumerate
Compress python data and write to sqlite
How to read problem data with paiza
How to use is and == in Python
How to use pandas Timestamp and date_range
How to install fabric and basic usage
How to write pydoc and multi-line comments
[Python] How to name table data and output it in csv (to_csv method)
I'm addicted to the difference in how Flask and Django receive JSON data
The first step to log analysis (how to format and put log data in Pandas)
How to use lists, tuples, dictionaries, and sets
Introducing Sinatra-style frameworks and how to use them
How to generate permutations in Python and C ++
[Kaggle] From data reading to preprocessing and encoding
How to create explanatory variables and objective functions
[Django] How to get data by specifying SQL.
How to scrape horse racing data with BeautifulSoup
How to use data analysis tools for beginners
Data retrieval from MacNote3 and migration to Write
How to handle Linux commands well from Python
How to convert SVG to PDF and PNG [Python]
How to get article data using Qiita API
How to create data to put in CNN (Chainer)
How to switch between Linux and Mac shells
[Python] How to use hash function and tuple.
How to read time series data in PyTorch
How to search HTML data using Beautiful Soup
How to write async and await in Vue.js
[Python] How to handle Japanese characters with openCV
How to handle datetime type in python sqlite3
How to install Cascade detector and how to use it
How to plot autocorrelation and partial autocorrelation in python
How to split machine learning training data into objective variables and others in Pandas
[Python] How to handle inf and NaN in numpy mean, standard deviation, maximum / minimum