Python application: Data cleansing # 2: Data cleansing with DataFrame

CSV

Read CSV with Pandas

CSV is a file that lists values separated by commas. It is commonly used because it is very easy to handle in data analysis.

To read a csv file with Pandas, use the read_csv () function.

read_csv("File path where the csv file is located", header=)

As the law of header

For example, read a wine dataset that does not have column name information from a directory. Since we do not know what the numerical value represents as it is, add a column name (column) that indicates the content of the value.

import pandas as pd

df = pd.read_csv("./4050_data_cleansing_data/wine.csv", header=None)

#Add a column to see what each number represents
df.columns = ["", "Alcohol", "Malic acid", "Ash", "Alcalinity of ash", "Magnesium","Total phenols", "Flavanoids", "Nonflavanoid phenols", "Proanthocyanins","Color intensity", "Hue", "OD280/OD315 of diluted wines", "Proline"]
df

The ./ in the file path indicates the current directory. The current directory is the working directory where Python is currently running.

Creating CSV using CSV library

Create a CSV data file using the CSV library that comes standard with Python3.

Outputs the data of the cities, years, and seasons of the past 10 Olympic Games as a CSV data file.

import csv

#Use the with statement
# csv0.Write mode with the csv file as the variable csvfile ("w") To open
with open("./4050_data_cleansing_data/csv0.csv", "w") as csvfile:
    #The variable csvfile and line feed code (as arguments to the writer method)\n)
    writer = csv.writer(csvfile, lineterminator="\n")
    #Add a row using writerow (list)
    writer.writerow(["city", "year", "season"])
    writer.writerow(["Nagano", 1998, "winter"])
    writer.writerow(["Sydney", 2000, "summer"])
    writer.writerow(["Salt Lake City", 2002, "winter"])
    writer.writerow(["Athens", 2004, "summer"])
    writer.writerow(["Torino", 2006, "winter"])
    writer.writerow(["Beijing", 2008, "summer"])
    writer.writerow(["Vancouver", 2010, "winter"])
    writer.writerow(["London", 2012, "summer"])
    writer.writerow(["Sochi", 2014, "winter"])
    writer.writerow(["Rio de Janeiro", 2016, "summer"])

#output
# csv0.Read mode with csv file as variable csvfile("r")Open with
with open("./4050_data_cleansing_data/csv0.csv", "r") as csvfile:
    print(csvfile.read())

When executed, a CSV data file called csv0.csv will be created and the contents of the data will be displayed. To find out where the CSV data file was created

import os #Do
print(os.getcwd()) #To execute.

Creating CSV with Pandas

Without using CSV library You can also create CSV data using Pandas. It is more convenient to use Pandas when converting PandasDataFrame format data to CSV data.

To create a csv file in Pandas, use the to_csv () function.

to_csv("Csv file name to create")

As an example of DataFrame, as in the previous sample, the data of the city, year, and season of the past 10 Olympic games are output as a CSV data file.

import pandas as pd

data = {"city": ["Nagano", "Sydney", "Salt Lake City", "Athens", "Torino", "Beijing", "Vancouver", "London", "Sochi", "Rio de Janeiro"], 
        "year": [1998, 2000, 2002, 2004, 2006, 2008, 2010, 2012, 2014, 2016],
        "season": ["winter", "summer", "winter", "summer", "winter", "summer", "winter", "summer", "winter", "summer"]}

df = pd.DataFrame(data)

df.to_csv("4050_data_cleansing_data/csv1.csv")
#Doing this csv1.A file called csv is cleansing_It will be created in the data directory.

Missing value

Listwise delete / pairwise delete

If there is a space in the read data, it will be recognized as a missing value NaN (Not a Number). To remove missing NaNs to improve data accuracy

dropna()Use a function.
#First, randomly create a table with a part of the table missing on purpose.

import numpy as np
from numpy import nan as NA
import pandas as pd

sample_data_frame = pd.DataFrame(np.random.rand(10, 4))

#I intentionally lose some data
sample_data_frame.iloc[1, 0] = NA
sample_data_frame.iloc[2, 2] = NA
sample_data_frame.iloc[5:, 3] = NA

sample_data_frame

Listwise delete

Erase the entire row or column (row containing NaN) with data loss This is called listwise deletion.

dropna()Using a function

Remove all rows that contain at least one NaN. Also If axis = 1 is specified as an argument, even one column containing NaN will be removed.

sample_data_frame.dropna()

Pairwise delete

If the listwise method removes all missing rows and there is too little data Another option is to use only the available data. Leave columns with few defects (for example, columns 0 and 1) Deleting lines containing NaN from there is called pairwise deletion.

sample_data_frame[[0, 1]].dropna()

Missing value completion

Other than removing missing values to improve the accuracy of the data You can also assign the alternate data to the missing value.

To substitute (replace) alternative data for missing value NaN

fillna()Use a function.

Randomly create a table with a part of the table missing on purpose.


import numpy as np
from numpy import nan as NA
import pandas as pd

sample_data_frame = pd.DataFrame(np.random.rand(10, 4))

#I intentionally lose some data
sample_data_frame.iloc[1, 0] = NA
sample_data_frame.iloc[2, 2] = NA
sample_data_frame.iloc[5:, 3] = NA
# fillna()When using a function, the number given as an argument is assigned to the NaN part. This time I will fill it with 0.

sample_data_frame.fillna(0)
#If you specify ffill for method, you can fill it with the value of the previous line.

sample_data_frame.fillna(method="ffill")

Completion of missing values (mean value substitution method)

The method of filling in the missing values with the average value of the column (or row) is called the average value substitution method.

Mean is mean()Calculate using a function.
import numpy as np
from numpy import nan as NA
import pandas as pd

sample_data_frame = pd.DataFrame(np.random.rand(10, 4))

#I intentionally lose some data
sample_data_frame.iloc[1, 0] = NA
sample_data_frame.iloc[2, 2] = NA
sample_data_frame.iloc[5:, 3] = NA

#Substitute the mean value of the column for the NaN part using fillna
sample_data_frame.fillna(sample_data_frame.mean())

Data aggregation

Calculation of statistics for each key

Statistics can be divided into representative values and dispersal degrees. A representative value is a value that represents the basic characteristics of data. For example, mean, maximum, minimum, and so on.

import pandas as pd

df = pd.read_csv("./4050_data_cleansing_data/wine.csv", header=None)
df.columns=["", "Alcohol", "Malic acid", "Ash", "Alcalinity of ash", "Magnesium","Total phenols", "Flavanoids", "Nonflavanoid phenols", "Proanthocyanins","Color intensity", "Hue", "OD280/OD315 of diluted wines", "Proline"]

# DataFrame `df`key of"Alcohol"Calculate the average value of
df["Alcohol"].mean()
#Output result
13.000617977528091

Duplicate data

If there is a duplicate of the data, delete the data to improve the accuracy of the data. Prepare a DataFrame that actually has duplicate data Try to extract or delete duplicate data.

import pandas as pd
from pandas import DataFrame

dupli_data = DataFrame({"col1":[1, 1, 2, 3, 4, 4, 6, 6]
                       ,"col2":["a", "b", "b", "b", "c", "c", "b", "b"]})
dupli_data
duplicated()With the method

Generates Series type data that returns True for duplicate rows and extracts duplicate data.

#Extract duplicate data
dupli_data.duplicated()
#Output result
0    False
1    False
2    False
3    False
4    False
5     True
6    False
7     True
dtype: bool

dtype stands for "Data Type" and indicates the data type of the element.

drop_duplicates()Use the method to remove duplicate data.
dupli_data.drop_duplicates()

mapping

What is mapping? For data that has a common key It is a process to refer to the value corresponding to the key from another table. Actually prepare a DataFrame and try the mapping process.

import pandas as pd
from pandas import DataFrame

attri_data1 = {"ID": ["100", "101", "102", "103", "104", "106", "108", "110", "111", "113"]
        ,"city": ["Tokyo", "Osaka", "Kyoto", "Hokkaido", "Tokyo", "Tokyo", "Osaka", "Kyoto", "Hokkaido", "Tokyo"]
        ,"birth_year" :[1990, 1989, 1992, 1997, 1982, 1991, 1988, 1990, 1995, 1981]
        ,"name" :["Hiroshi", "Akiko", "Yuki", "Satoru", "Steeve", "Mituru", "Aoi", "Tarou", "Suguru", "Mitsuo"]}
attri_data_frame1 = DataFrame(attri_data1)

attri_data_frame1

Create dictionary-type data with the area name corresponding to city.

city_map ={"Tokyo":"Kanto"
          ,"Hokkaido":"Hokkaido"
          ,"Osaka":"Kansai"
          ,"Kyoto":"Kansai"}
city_map

Using the city column of attri_data_frame1 prepared at the beginning as a key, refer to the corresponding area name data from city_map and add it to the new column. This is the mapping process. If you are familiar with Excel, imagine a process like the vlookup function. Perform mapping processing using the map () function and add region to attri_data_frame1 as a new column.

attri_data_frame1["region"] = attri_data_frame1["city"].map(city_map)
attri_data_frame1

If you look at the output, you can see that the region name has been added to the region column. Elements whose corresponding data does not exist in city_map are filled with NaN.

Bin split

Bin division is the process of roughly dividing numerical data into categories. For example, it is a process to divide the age into 0-9 years old, 10-19 years old, 20-29 years old. Prepare a bin-divided list in advance

pandas cut()Process using a function.
import pandas as pd
from pandas import DataFrame

attri_data1 = {"ID": [100,101,102,103,104,106,108,110,111,113]
        ,"city": ["Tokyo", "Osaka", "Kyoto", "Hokkaido", "Tokyo", "Tokyo", "Osaka", "Kyoto", "Hokkaido", "Tokyo"]
        ,"birth_year" :[1990, 1989, 1992, 1997, 1982, 1991, 1988, 1990, 1995, 1981]
        ,"name" :["Hiroshi", "Akiko", "Yuki", "Satoru", "Steeve", "Mituru", "Aoi", "Tarou", "Suguru", "Mitsuo"]}
attri_data_frame1 = DataFrame(attri_data1)

Specify the granularity of the division in the list and perform the bin division. Here we focus on birth_year.

#Create a list of partition granularity
birth_year_bins = [1980,1985,1990,1995,2000]

#Do a bin split
birth_year_cut_data = pd.cut(attri_data_frame1.birth_year,birth_year_bins)
birth_year_cut_data
#Output result
0    (1985, 1990]
1    (1985, 1990]
2    (1990, 1995]
3    (1995, 2000]
4    (1980, 1985]
5    (1990, 1995]
6    (1985, 1990]
7    (1985, 1990]
8    (1990, 1995]
9    (1980, 1985]
Name: birth_year, dtype: category
Categories (4, interval[int64]): [(1980, 1985] < (1985, 1990] < (1990, 1995] < (1995, 2000]]

"()" Does not include that value, and "[]" means that it does. For example, (1985, 1990] does not include 1985, but includes 1990.

If you want to count the number of each bin

value_counts()Use the method.
pd.value_counts(birth_year_cut_data)
#Output result
(1985, 1990]    4
(1990, 1995]    3
(1980, 1985]    2
(1995, 2000]    1
Name: birth_year, dtype: int64

You can also give each bin a name.

group_names = ["first1980", "second1980", "first1990", "second1990"]
birth_year_cut_data = pd.cut(attri_data_frame1.birth_year,birth_year_bins,labels = group_names)
pd.value_counts(birth_year_cut_data)
#Output result
second1980    4
first1990     3
first1980     2
second1990    1
Name: birth_year, dtype: int64

It is also possible to specify the number of divisions in advance and divide. You can use it to create bins of about the same size. Specify the number of divisions in the second argument of the cut () function.

pd.cut(attri_data_frame1.birth_year,2)
#Output result
0      (1989.0, 1997.0]
1    (1980.984, 1989.0]
2      (1989.0, 1997.0]
3      (1989.0, 1997.0]
4    (1980.984, 1989.0]
5      (1989.0, 1997.0]
6    (1980.984, 1989.0]
7      (1989.0, 1997.0]
8      (1989.0, 1997.0]
9    (1980.984, 1989.0]
Name: birth_year, dtype: category
Categories (2, interval[float64]): [(1980.984, 1989.0] < (1989.0, 1997.0]]

Recommended Posts

Python application: Data cleansing # 2: Data cleansing with DataFrame
Python Application: Data Cleansing Part 1: Python Notation
Data cleansing 2 Data cleansing using DataFrame
Data analysis with Python
Python application: Pandas # 3: Dataframe
Sample data created with python
Get Youtube data with python
Python application: data visualization # 2: matplotlib
Web application with Python + Flask ② ③
Web application with Python + Flask ④
Read json data with python
Python Application: Data Handling Part 3: Data Format
Monitor Python application performance with Dynatrace ♪
[Python] Get economic data with DataReader
Convert list to DataFrame with python
Python application: Data cleansing # 3: Use of OpenCV and preprocessing of image data
Python data structures learned with chemoinformatics
Easy data visualization with Python seaborn.
Python application: data visualization part 1: basic
Process Pubmed .xml data with python
Data analysis starting with python (data visualization 1)
Application development with Docker + Python + Flask
Data analysis starting with python (data visualization 2)
Get additional data in LDAP with python
Data pipeline construction with Python and Luigi
[Note] Get data from PostgreSQL with Python
Python Application: Data Visualization Part 3: Various Graphs
Retrieving food data with Amazon API (Python)
Try working with binary data in Python
Generate Japanese test data with Python faker
Convert Excel data to JSON with python
[Python] Use string data with scikit-learn SVM
Manipulate DynamoDB data with Lambda (Node & Python)
Python application: Pandas Part 4: DataFrame concatenation / combination
Convert FX 1-minute data to 5-minute data with Python
[Python] Web application from 0! Hands-on (4) -Data molding-
[Python] A quick web application with Bottle!
Recommendation of Altair! Data visualization with Python
Data analysis starting with python (data preprocessing-machine learning)
Let's do MySQL data manipulation with Python
Organize data divided by folder with Python
Run a Python web application with Docker
Process big data with Dataflow (ApacheBeam) + Python3
Statistics with python
Python with Go
Data analysis python
Twilio with Python
Integrate with Python
Play with 2016-Python
AES256 with python
Tested with Python
python starts with ()
with syntax (Python)
Bingo with python
Zundokokiyoshi with python
Excel with Python
[python] Read data
Microcomputer with Python
Cast with python
Create test data like that with Python (Part 1)
Read data with python / netCDF> nc.variables [] / Check data size