Summary of pre-processing practices for Python beginners (Pandas dataframe)

I'm new to Python. Regarding data frame operation in Pandas, although there are abundant articles on the operation explanation alone, I felt that there was no article that explained the points and purposes of preprocessing. I decided to create it as a learning memo.

Assumed reader

--Python beginner ∋ I --Those who have begun to touch Pandas

What you can do after reading this article

--When reading a data frame using the pandas library, you can understand both the purpose of preprocessing and the specific procedure of what to do first. ――In particular, you will be able to easily perform processing after reading the CSV file.

Premise

――The code in this paper is written after writing the following. Please replace df with your data frame as appropriate. ――Imagine the passenger data of the Titanic, which is often used in the introductory content of statistics, but the data that comes out is the fiction for creating this article. --There is no mention of how to create or read the data frame itself, or how to edit the matrix. I plan to publish it at a later date.

import pandas as pd
df = pd.read_csv("hogehoge/test.csv", usecols = ['PassengerId','Sex','Age'], header = 1)

Main article I | Overview of data

1. Visual confirmation

--Visually check the contents of the data using the head method and tail method --Check the row and column names using the columns method and index method. --Purpose: Check if the wrong file is read and if the data is read as expected.

#lead/Enumerate the last two lines. Specify the number of lines you want to check in 2(If omitted, 6 is specified.)
print(df.head(2))
print(df.tail(2))
print("Column name:",df.columns)
print("Line name(index):"df.index)

"""
Displayed as ↓:
# head
   PassengerId     Sex   Age
0            1  female  23.0
1            2    male  48.0

# tail
     PassengerId     Sex   Age
998          999  female  41.0
999         1000    male  15.0

Column name: Index(['PassengerId', 'Sex', 'Age'], dtype='object')

Line name: RangeIndex(start=0, stop=1000, step=1)

"""

--From this result, for example, the following can be confirmed: --Sex is stored as a string, --The line name is returned as RangeIndex, so the line name only has a serial number index (it doesn't have a specific name), and there are 1000 pieces of data. --RangeIndex (start = 0, stop = 1000, step = 1) is "starting from 0 and indexing each 1 with less than 1000 numbers", so the number of data (number of rows) is indexed from 0 to 999. 1000 pieces

2. Data type confirmation

--Use the dtypes attribute --Attribute-> Attach `` `.hoge``` after the data frame like a method --Purpose: Depending on the library used, the calculation with mixed data types may cause an error, so to remove it later (described later).

print(df.dtypes)

"""
It will be displayed as below
PassengerId      int64
Sex             object
Age            float64
"""

――From this result, I think you can create the following issues, for example: ―― 1) Sex is stored as a character string such as male or female. Isn't it better to add a dummy value such as 0/1 to use in the calculation? ―― 2) Age is float (floating point type), while PassengerId is int (integer type). Both are used for calculation, and it would be better to unify them to either one.

3. Confirmation and replacement of missing values (NaN)

--Use a combination of isnull method and any method and exclude ――By combining these, you can detect "columns containing even one NaN". --Purpose: Missing values have an adverse effect on the overall calculation result, so they are excluded (described later).

print(df.isnull().any())

"""
The result will be displayed as below
PassengerId    False
Sex            False
Age             True
dtype: bool

"""

――The suggestion from here is that "NaN exists in the Age column, so it seems possible to remove it." --The processing method (whether to delete the row where NaN exists, replace NaN with 0, delete the Age column itself, etc.) depends on the case.

4. Confirmation of basic statistics

--Let's check the basic statistics using the describe method --Tells you the total value, arithmetic mean value, standard deviation, and quartile of each column. --Purpose: Overview of the data to be analyzed and check for outliers.

print(df.describe())
"""
       PassengerId         Age
count  1000.000000  884.000000
mean    446.000000   29.699118
std     257.353842   14.526497
min       1.000000    3.100000
25%     215.500000   20.125000
50%     430.000000   27.000000
75%     703.500000   39.000000
max    1000.000000   80.000000
"""

--Suggestions obtained: --Although the min of Age is 3.1, it seems that the age is recorded as an integer (though it is a floating point type) as confirmed by head / tail. Isn't this 3.1 a 31 mistake of the data acquirer? Confirmation is required. ――Be careful how to read the statistics --PassengerId (passenger number) statistics are meaningless --Since the Sex column is an object type, it is automatically excluded.

Main article II | Perform basic processing

1. Handle missing values

――In this case, for example, "Let's set NaN of age to 0. When calculating the average value of age in the future, let's analyze values other than 0", and convert NaN to 0. --In loc, extract "all Age columns in the row where the value of Age column is NaN" (although it is complicated in Japanese) and substitute 0.

#Perform a transformation on the column where the presence of NaN was confirmed in the previous chapter.
df.loc[df['Age'].isnull(), 'Age'] = 0

#Check if the process was done correctly
print(df.isnull().any())

"""
It will be displayed as follows. Compare with the previous chapter c.
PassengerId    False
Sex            False
Age            False
dtype: bool
"""

2. Unify data types and data types

--Based on the previous chapter, work to unify the data types --Convert data type column by column using astype method --In this case, you need to (1) change PassengerId to float64 type, and (2) assign 0/1 as a dummy variable to Sex (and also make it float64 type).

#PassengerId type change
df.PassengerId = df.PassengerId.astype('float64')

#Sex dummy value assignment(0 for male and 1 for female) &float64
df.Sex[df.Sex=='male'] = 0
df.Sex[df.Sex=='female'] = 1
df.Sex = df.Sex.astype('float64')

#Check if the process was done correctly
print(df.dtypes)

"""
It should look like this:
PassengerId    float64
Sex            float64
Age            float64

"""

in conclusion

--The basic pre-processing flow and procedure are summarized. No matter what data you analyze, the need for such pre-processing will surely emerge. We would appreciate it if you could send us your feedback. I am also a beginner, so I will study further. --3/27 postscript: I actually tried this pre-processing procedure here -titanic). Please have a look if you like!

reference

-Get a specific row / column from a dataframe in Pandas -Check Pandas dataframes -[Python beginner's memo] Importance and method of confirming missing value NaN before data analysis

Recommended Posts

Summary of pre-processing practices for Python beginners (Pandas dataframe)
100 Pandas knocks for Python beginners
[Python] Summary of table creation method using DataFrame (pandas)
[Python] Operation memo of pandas DataFrame
Pandas basics summary link for beginners
[For beginners] Summary of standard input in Python (with explanation)
[Python] Minutes of study meeting for beginners (7/15)
Summary of various for statements in Python
Pandas of the beginner, by the beginner, for the beginner [Python]
Summary of useful techniques for Python Scrapy
Python pandas: Search for DataFrame using regular expressions
Easy understanding of Python for & arrays (for super beginners)
Summary of frequently used Python arrays (for myself)
Basics of pandas for beginners ② Understanding data overview
Basic story of inheritance in Python (for beginners)
Basic operation of Python Pandas Series and Dataframe (1)
python textbook for beginners
Summary of Python arguments
OpenCV for Python beginners
Python application: Pandas # 3: Dataframe
Data analysis in Python Summary of sources to look at first for beginners
Summary of Python sort (list, dictionary type, Series, DataFrame)
Summary of python environment settings for myself [mac] [ubuntu]
Summary of tools for operating Windows GUI with Python
Summary of Pandas methods used when extracting data [Python]
Summary of Python3 list operations
Learning flow for Python beginners
Formatted display of pandas DataFrame
Python3 environment construction (for beginners)
Overview of Docker (for beginners)
Python #function 2 for super beginners
python pandas study recent summary
Basic Python grammar for beginners
Basic usage of Pandas Summary
Python for super beginners Python #functions 1
Python #list for super beginners
~ Tips for beginners to Python ③ ~
Extract only Python for preprocessing
Reference resource summary (for beginners)
The Power of Pandas: Python
[For beginners] Basics of Python explained by Java Gold Part 2
Convert from Pandas DataFrame to System.Data.DataTable using Python for .NET
■ Kaggle Practice for Beginners --Introduction of Python --by Google Colaboratory
Summary of Hash (Dictionary) operation support for Ruby and Python
Pandas basics for beginners ④ Handling of date and time items
A beginner's summary of Python machine learning is super concise.
[Python] The biggest weakness / disadvantage of Google Colaboratory [For beginners]
[For beginners] A word summary of popular programming languages (2018 version)
[For beginners] Basics of Python explained by Java Gold Part 1
A summary of Python e-books that are useful for free-to-read data analysis
Best practices for messing with data with pandas
How to replace with Pandas DataFrame, which is useful for data analysis (easy)
Basics of pandas for beginners ② Understanding data overview
Python hand play (Pandas / DataFrame beginning)
Pandas basics for beginners ① Reading & processing
Pandas / DataFrame Tips for practical use
[Python machine learning] Recommendation of using Spyder for beginners (as of August 2020)
A brief summary of Graphviz in python (explained only for mac)
Pandas basics for beginners ⑧ Digit processing
Python Exercise for Beginners # 2 [for Statement / While Statement]
Python for super beginners Python # dictionary type 1 for super beginners
A brief summary of Python collections
Machine learning summary by Python beginners
What is scraping? [Summary for beginners]