[Python] Pre-processing tricks

Sample data creation

Create a DataFrame from iris data


import pandas as pd
from sklearn.datasets import load_iris


iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)

image.png

Create a DataFrame from a dictionary

import pandas as pd
input = {'a': ['red', 'yellow', 'blue'], 'b': [0.5, 0.25, 0.125]}
df = pd.DataFrame(input)

image.png

Data read

import pandas as pd

#Excel
df = pd.read_excel('file name.xlsx')

# CSV
df = pd.read_csv('filename.csv', low_memory=False, sep=',', delim_whitespace=False, names=col_names, header=True)

Data confirmation

Statistics

train.describe(include='all')

image.png

Pair plot


import seaborn as sns

sns.pairplot(df, vars=df.columns, hue="target")

image.png

null check

df.isnull().sum()

image.png

Unique number in each column (Distinct)

df.nunique()

image.png

frequency

df.value_counts()

image.png

histogram

df3['Column name'].plot.hist(bins=40)

image.png

sort


#In index order
df.sort_index()

Data processing

One Hot Encoding

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': ['Senior citizens', 'adult', 'adult', "Toddler", "Toddler"], 'B': [2020,2020,2021,2021,1993],
                   'C': [1.0, 2.0, 1.0, np.nan, np.inf], "D":[0,1,2,3,4]})

image.png


pd.get_dummies(df, columns=["A", "B"])

image.png


#One Hot
df = pd.get_dummies(df, columns=["Column name"], drop_first=True)

#Get only rows that meet the conditions
df = df[df['Column name'] ==value]

#Label names with the word "curry" at 1 and names without the word "curry" at 0
train['curry'] = train['name'].apply(lambda x : 1 if x.find("curry") >=0 else 0)



Handling of DataFrame

#Combine Dataframes vertically
pd.concat([df1, df2, df3], axis=0, ignore_index=True)

#Combine Dataframes horizontally
pd.concat([df1, df2, df3], axis=1)

Handling of columns


#Rename column
df = df.rename(columns={'Change before':'After change'})

#Add column
df = df.assign('Column name'='value')

#Delete column
df = df.drop('Column name', axis=1)

Handling of NULL (NaN)

#Delete lines that contain even one NULL
df = df.dropna(how='any')

#Replace NULL
df = df.fillna({'Column name':value})

One Hot Decode

animals = pd.DataFrame({"monkey":[0,1,0,0,0],"rabbit":[1,0,0,0,0],"fox":[0,0,1,0,0]})


image.png

def get_animal(row):
    for c in animals.columns:
        if row[c]==1:
            return c
animals.apply(get_animal, axis=1)

image.png

output


#csv output
df.to_csv('file name.csv', index=False)

reference

Recommended Posts

[Python] Pre-processing tricks
Python tricks
3 Jupyter notebook (Python) tricks
Python
Extract only Python for preprocessing
[Python tricks] 5 minor but useful Python tricks
Python Pandas Data Preprocessing Personal Notes
Preprocessing template for data analysis (Python)
Python: Preprocessing in Machine Learning: Overview
kafka python
Python basics ⑤
python + lottery 6
Python Summary
Built-in python
Python comprehension
Python technique
Studying python
Python 2.7 Countdown
Python memorandum
Python FlowFishMaster
Python service
python tips
python function ①
Python basics
Python memo
ufo-> python (3)
Python comprehension
install python
Python Singleton
Python basics ④
Python Memorandum 2
python memo
Python Jinja2
Python increment
atCoder 173 Python
[Python] function
Python installation
python tips
Try python
Python memo
Python iterative
Python2 + word2vec
Python functions
Python tutorial
python underscore
Python summary
Start python
[Python] Sort
Note: Python
Python: Preprocessing in machine learning: Data acquisition
Python basics ③
python log
Python basics
[Scraping] Python scraping
Python update (2.6-> 2.7)
python memo
Python memorandum
Python # sort
ufo-> python
Python nslookup
python learning