Understand the status of data loss --Python vs. R

Introduction

The first thing to do when data is given in data analysis is to grasp the outline of the contents. We try to understand what the features are and what the types of data are by converting the data into a table. The work to be done at the same time is the investigation of missing values. Since the presence or absence of data loss affects data manipulation, first check the presence or absence, and then check the frequency of missing values. In this article, we'll see how to do this in Python and R.

(The programming environment is Jupyter Notebook + Python 3.5.2 and Jupyter Notebook + IRkernel (R 3.2.3).)

Checking data loss status in Python

We decided to use "Titanic" provided by Kaggle as the data set. Many people may have seen the data, but this classifies "survived" / "not survived" based on the passenger characteristics. As will be described later, this is a dataset that contains missing values.

First, load the data into the pandas DataFrame with Python.

def load_data(fn1='./data/train.csv', fn2='./data/test.csv'):
    train = pd.read_csv(fn1)
    test = pd.read_csv(fn2)
    
    return train, test

train, test = load_data()
train.head()

train.csv titanic_fig1.png

As shown in the above figure, you can already see the indication of'NaN'in the first 5 rows of data, the Cabin column. Let's look at test.csv as well.

test.csv titanic_fig2.png

Similarly, it was found that'NaN' was lined up in Cabin. By the way, the size (shape) of the data is not so large as a data set because train is (891, 12) and test is (418, 11).

Next, let's check which features (columns) contain data defects.

# check if NA exists in each column
train.isnull().any(axis=0)

# output
'''
PassengerId    False
Survived       False
Pclass         False
Name           False
Sex            False
Age             True
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin           True
Embarked        True
dtype: bool
'''

isnull () is a function that determines whether or not it is NA, and any () is a function that comprehensively looks at multiple locations and returns the truth. Since any () takes the axis option, if you want to enclose it in "column" units (= scan in the "row" direction), set axis = 0 (it can be omitted because it is the default value). Similarly, let's look at test.

# check if NA exists in each column
test.isnull().any()

# output
'''
PassengerId    False
Pclass         False
Name           False
Sex            False
Age             True
SibSp          False
Parch          False
Ticket         False
Fare            True
Cabin           True
Embarked       False
dtype: bool
'''

As mentioned above, in train data, ['Age','Cabin','Embarked'] contains data loss, and in test data, ['Age','Fare','Cabin'] contains data loss. It turned out. Based on this, "Let's make a model that does not use features with data loss ('Age','Fare','Cabin','Embarked') as a prototype of the target classifier." , "But'Age'(age) seems to affect the classification (survival or not)" and so on.

Next, count the number of data loss. First, train data.

# count NA samples
train.isnull().sum()

# output
'''
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
'''

Similarly, test data.

# count NA samples
test.isnull().sum()

# output
'''
PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64
'''

From the above results, the following points could be grasped.

---'Age' contains data loss at a certain rate in both train and test. --'Cabin'has a large data loss in train / test. --'Embarked' is only train and 2 defects. --'Fare' is only one missing in test.

As a method of data classification, it is possible to decide the policy such as "For the time being, let's model'Cabin'by removing it from the features!" And "Dropna () because there are two cases of'Embarked'".

Checking data loss status in R

Do the same thing you did in Python with R. First, read the file into the data frame.

# Reread not to conver sting to factor
train <- read.csv("./data/train.csv", header=T, stringsAsFactors=F)
test <- read.csv("./data/test.csv", header=T, stringsAsFactors=F)

header is an option to specify the handling of header lines, and stringsAsFactors is an option to specify whether to convert a character string to a factor type (factor). If you input the file above, the train data frame will be as follows.

train.csv titanic_fig3.png

The age of'Moran, Mr. James'with ID = 6 is'NA'. Next, check for missing values in each column.

# any() usage:
is_na_train <- sapply(train, function(y) any(is.na(y)))
is_na_test <- sapply(test, function(y) any(is.na(y)))

Here, any () is used in the same way as Python. Next, count the number of missing values.

# count na
na_count_train <- sapply(train, function(y) sum(is.na(y)))
na_count_train

# output
# PassengerId   0
# Survived      0
# Pclass        0
# Name          0
# Sex           0
# Age         177
# SibSp         0
# Parch         0
# Ticket        0
# Fare          0
# Cabin         0
# Embarked      0

Do you understand? It is different from the result obtained by Python above. Look at the test data as well.

# count na
na_count_test <- sapply(test, function(y) sum(is.na(y)))

# output
# PassengerId   0
# Pclass        0
# Name          0
# Sex           0
# Age          86
# SibSp         0
# Parch         0
# Ticket        0
# Fare          1
# Cabin         0
# Embarked      0

This is also significantly reduced (especially in'Cabin') compared to the number of NAs obtained in Python. Why?

In fact, the reason for this difference (Python vs. R) is that the treatment of blanks ("", blanks) is different.

train.csv titanic_fig4.png (In Python, NaN was already in the red frame when reading.)

The ʻisnull () functions supported by Python pandas also determine that whitespace ("") is null, whereas R's ʻis.na () does not put spaces in na. Due to this, the count of na is low.

This time, Titanic's'Cabin'is data indicating the cabin ID, so the "blank" is probably because there is no record (although it is speculated). Also, for blanks, it is highly likely that the data analysis flow will be separated (from the processing with'Cabin'data), so it is preferable for the program to count the blanks as NA. Therefore, change to an R script that treats spaces ("") as NA like Python code.

# Reread data with na.string option
train <- read.csv("./data/train.csv", header=T, stringsAsFactors=F, 
    na.strings=(c("NA", "")))
test <- read.csv("./data/test.csv", header=T, stringsAsFactors=F,
    na.strings=(c("NA", "")))

By specifying the na.strings option of read.csv () as na.strings = (c ("NA", "")) , blanks ("") are converted to NA. After that, the NA was counted as follows.

# Counting NA
na_count_train <- sapply(train, function(y) sum(is.na(y)))
na_count_test <- sapply(test, function(y) sum(is.na(y)))

Output result:

# --- Train dataset ---
# PassengerId   0
# Survived      0
# Pclass        0
# Name          0
# Sex           0
# Age         177
# SibSp         0
# Parch         0
# Ticket        0
# Fare          0
# Cabin       687
# Embarked      2

# --- Test dataset ---
# PassengerId   0
# Pclass        0
# Name          0
# Sex           0
# Age          86
# SibSp         0
# Parch         0
# Ticket        0
# Fare          1
# Cabin       327
# Embarked      0

This is in agreement with the Python result. In this way, it was found that the definition of NA is different between Python (pandas) and R. As a comparison, R seems to have a stricter classification of null / NaN / NA. In pandas, blank / NA / NaN are all judged as NA by isnull (), but it seems that there is no problem in practical use. (To put it badly, the treatment of pandas is "ambiguous".)

The following is quoted from the Python pandas documentation (http://pandas.pydata.org/pandas-docs/version/0.18.1/missing_data.html).

Note: The choice of using NaN internally to denote missing data was largely for simplicity and performance reasons. It differs from the MaskedArray approach of, for example, scikits.timeseries. We are hopeful that NumPy will soon be able to provide a native NA type solution (similar to R) performant enough to be used in pandas.

Blank isn't written, but pandas says it's implemented as it is now for simplicity and performance. I myself have never encountered a case where NA data must be strictly divided into blank / NA / NaN, so how to handle it in Python described in this article (in R, how to convert blank to NA) ) I want to remember.

(For reference, it is a conversion from Blank to NA, but I confirmed that the conversion process is performed with the same option (na.strings) in fread () of R package {data.table}.)

Finally

Kaggle Titanic is Kaggle's Tutorial-like competition, but when you look at the Leader Board, the scores vary widely from excellent to mediocre. It is presumed that one of the points to raise the score here is the parameter adjustment of the classifier, and the other is the method of interpolating the missing values of the data, especially the'Age'. At the moment, there are still days before the deadline (12/31/2016), so I would like to take the opportunity to try the Titanic competition again. (The Top group has achieved an accuracy rate of 1.0, how do you do it ...)

References / Web site

Recommended Posts

Understand the status of data loss --Python vs. R
[Python3] Understand the basics of Beautiful Soup
[Python3] Understand the basics of file operations
Comparing R, Python, SAS, SPSS from the perspective of European data scientists
The story of reading HSPICE data in Python
Check the status of your data using pandas_profiling
the zen of Python
Not being aware of the contents of the data in python
Let's use the open data of "Mamebus" in Python
14 quizzes to understand the surprisingly confusing scope of Python
Get the operation status of JR West with Python
Extract the band information of raster data with python
About the ease of Python
About the features of Python
The Power of Pandas: Python
Try scraping the data of COVID-19 in Tokyo with Python
Make the display of Python module exceptions easier to understand
[Understand in the shortest time] Python basics for data analysis
The story of rubyist struggling with python :: Dict data with pycall
Comparison of data frame handling in Python (pandas), R, Pig
[Python] I tried collecting data using the API of wikipedia
[Data science memorandum] Confirmation of the contents of DataFrame type [python]
The story of Python and the story of NaN
[Python] The stumbling block of import
First Python 3 ~ The beginning of repetition ~
Understand the contents of sklearn's pipeline
Automatic acquisition of gene expression level data by python and R
Existence from the viewpoint of Python
pyenv-change the python version of virtualenv
Change the Python version of Homebrew
[Python] Understanding the potential_field_planning of Python Robotics
Review of the basics of Python (FizzBuzz)
Visualize the response status of the census 2020
About the basics list of Python basics
[python] A note that started to understand the behavior of matplotlib.pyplot
Learn the basics of Python ① Beginners
Easily exchange data between Python, R and Julia using the Feather format
Python vs Ruby "Deep Learning from scratch" Chapter 4 Implementation of loss function
Get the key for the second layer migration of JSON data in python
Explain the mechanism of PEP557 data class
Check the behavior of destructor in Python
Understand the benefits of the Django Rest Framework
Pass the path of the imported python module
The story of verifying the open data of COVID-19
Get the column list & data list of CASTable
Check the existence of the file with python
About the virtual environment of python version 3.7
[Python3] Rewrite the code object of the function
I didn't know the basics of Python
The result of installing python in Anaconda
[Python] Try pydash of the Python version of lodash
[python] Checking the memory consumption of variables
Check the path of the Python imported module
The story of manipulating python global variables
The basics of running NoxPlayer in Python
Understand the "temporary" part of UNIX / Linux
Pandas of the beginner, by the beginner, for the beginner [Python]
The Python project template I think of.
Recommendation of Altair! Data visualization with Python
In search of the fastest FizzBuzz in Python
Python Basic Course (at the end of 15)