Aidemy　2020/9/21

Introduction

Hello, it is Yope! I am a liberal arts student, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. Last time was my first post, but I am very happy that many people have read it. Thank you! This is the second post of data cleansing. Nice to meet you.

This article is a summary of what you learned in "Aidemy" "in your own words". It may contain mistakes and misunderstandings. Please note.

What to learn this time ・ About CSV -Processing for missing values (NaN) -Process data for each key in the dictionary

1　CSV

Load CSV with Pandas

-CSV is a data file that is often used in data analysis. First, load it using Pandas. __ · pd.read_csv ("data file name", header = specification of column to be header) __

None unless the header is specified.

import pandas as pd
# wine.loading csv
df = pd.read_csv("./4050_data_cleansing_data/wine.csv", header=None)
#Set the column name
df.columns = [abridgement]
#output
df

Creating CSV with CSV library

-Create CSV using the "CSV library" built into Python. __ ・ with open ("file name", "w") as csvfile: __, in it writer=csv.writer(csvfile, lineterminator="\n") Create (writer method that takes csvfile and line feed code as arguments) and use it Enter data with __writer.writerow ([data]) __.

import csv
with open("file name","w") as csvfile:
  writer=csv.writer(csvfile, lineterminator="\n")
  writer.writerow(["version","year","gen"]) #The first line is the column name
  writer.writerow(["Red-green",1996,1])
  writer.writerow(["Gold and silver",1999,2])
  writer.writerow(["RS",2002,3])

#Read
with open("file name","r") as csvfile:
  print(csvfile.read())
# version,year,gen
#Red-green,1996,1
#Gold and silver,1999,2
# RS,2003,3

Creating CSV with Pandas

・ This is easier when converting DataFrame to CSV. __ · DataFrame data .to_csv ("filename") __

data={"version":["Red-green","Gold and silver","RS"],"year":[1996,1999,2002],"gen":[1,2,3]}
df=pd.DataFrame(data)
df.to_csv("file name")

2 Processing for missing values (NaN)

Delete rows and columns containing NaN

-The deletion of rows and columns containing NaN is called __ "listwise deletion" __. Listwise deletion is done with the dropna () function. If no argument is specified, the row is deleted, and if axis = 1 is specified, the column is deleted. __ · DataFrame data .dropna () __

-If the number of data is too small when using listwise deletion, it is better to leave the column with few missing and delete the row containing NaN from it. This is called __ "pairwise delete" __. __ · DataFrame data [[column to leave]] .dropna () __

data={"version":["Red-green","Gold and silver","RS"],"year":[1996,1999],"gen":[1]}
df=pd.DataFrame(data)
#Since there is little data in listwise deletion, version,Remove rows containing NaN, leaving only the year column
df[[0,1]].dropna()
#version year
#Red green 1996
#Gold and silver 1999

Missing value completion (1)

-Complement by substituting alternative data instead of deleting. Use the fillna () function. Set "alternative data" or "method =" ffill "" (apply the value of the previous line) to the argument. __ · DataFrame data .fillna () __

data={"version":["Red-green","Gold and silver","RS"],"year":[1996,1999],"gen":[1]}
df=pd.DataFrame(data)
#Replace NaN with previous data
df.fillna(method="ffill")
#version year gen
#Red-green 1996 1
#Gold and silver 1999 1
#RS  1999 1

Missing value completion (2)

-By using the mean () function as an argument of fillna (), the substitute data can be replaced with "mean value of the column". __ · DataFrame data .fillna (DataFrame data .mean ()) __

data={"version":["Red-green","Gold and silver","RS"],"year":[1996,1999],"gen":[1]}
df=pd.DataFrame(data)
#Replace NaN with previous data
df.fillna(method="ffill")
#version year gen
#Red-green 1996 1
#Gold and silver 1999 1
#RS  1997.5 1

3 Process data for each key in the dictionary

Calculate for each key

-The average and maximum values for each key in a dictionary or CSV (DataFrame) can be calculated using statistical functions such as mean () and max (). __ · Dictionary and df [key]. Statistical function __

df = pd.read_csv("./4050_data_cleansing_data/wine.csv", header=None)
df.columns = ["","Alcohol",(abridgement)]
#　"Alcohol"Average value calculation
df["Alcohol"].mean()
# 13.000617977528091

Extraction and deletion of duplicate data

-Use the duplicated () method to extract the exact same data. Use the drop_duplicates () method to delete. __ · Duplicate data .duplicated () __ __ · Duplicate data .drop_duplicates () __

data={"version":["Red-green","Gold and silver","RS","Red-green"],"year":[1996,1999,2002,1996],"gen":[1,2,3,1]}
df=pd.DataFrame(data)
#Delete duplicate data
df.drop_duplicates()
# version,year,gen
#Red-green,1996,1
#Gold and silver,1999,2
# RS,2003,3

Create a new key that corresponds to an existing key

-Creating a new key corresponding to an existing key is called __ "mapping" __. __ · DataFrame data [new key] = DataFrame data [reference source key] .map (dictionary type correspondence table) __

data={"version":["Red-green","Gold and silver","RS"],"year":[1996,1999,2002],"gen":[1,2,3]}
df=pd.DataFrame(data)
#Map after creating a dictionary-type correspondence table
version_map={"Red-green":"Kanto","Gold and silver":"Johto","RS":"Hoenn"}
df["region"] = df["version"].map(version_map)
df
# version,year,gen,region
#Red-green,1996,1,Kanto
#Gold and silver,1999,2,Johto
# RS,2003,3,Hoenn

Set the range of numerical data and categorize

-Set a range such as "0-5" and "6-10", and divide or aggregate the data for each range. Such processing is called "bin division". Bin division is done with pd.cut (). __ · pd.cut (Data [key to specify range], [list of range], labels = [list of names for each range]) __ #bin split -There is also a method to specify the number of divisions and automatically divide the range. __ · pd.cut (data [key to specify range], number of divisions) __ __ · pd.value_counts (bin divided data) __ # Aggregation of the number of data for each divided range

data={"version":["Red-green","Gold and silver","RS","DP"],"year":[1996,1999,2002,2006],"gen":[1,2,3,4]}
df=pd.DataFrame(data)
#Specify the range by year, and create the range list first.
range = [1996,2000,2007]
range_name = ["1996-2000","2000-2007"]
cut_df = pd.cut(df["year"], range, labels=range_name)
pd.value_counts(cut_df)
#1996-2000  2
#2000-2007  2

Summary

-A data file called CSV is often used for data analysis. CSV is created using the standard python library or Pandas. -Missing values are processed by deleting them with dropna () or complementing them with fillna (). -In dictionaries and DataFlame, you can calculate value by referring to keys, delete duplicate data (drop_duplicates), create new keys (map), and divide by specifying a range (pd.cut). ..

that's all. Thank you for reading this far.

[PYTHON] Data cleansing 2 Data cleansing using DataFrame