[PYTHON] A memorandum of method often used when analyzing data with pandas (for beginners)

Introduction

We have summarized the methods that are often used when analyzing data with pandas. We will make corrections as needed.

1. Creating a DataFrame

#Create a DataFrame and store the list
pd_sample = pd.DataFrame(list)

#Create a DataFrame and store multiple lists in a dictionary
pd_sample = pd.DataFrame({"Column A":List A, "Column B":List B})

#Create an empty DataFrame and join another DataFrame in subsequent processing
pd_sampleA = pd.DataFrame()
pd_sampleA = pd.concat([pd_sampleA, pd_sampleB])

2. Data join (union)

#pandas.concat()function
pd_sampleAll = pd.concat([pd_sampleA, pd_sampleB], ignore_index=True)

pd_sampleA

Column A Column B Column C
0 a b c

pd_sampleB

Column C Column D
0 y z

pd_sampleAll = pd.concat([pd_sampleA, pd_sampleB], ignore_index=True)

Column A Column B Column C Column D
0 a b c NaN
1 NaN NaN y z

3. Join data

#pandas.merge()function
pd_sampleAll = pd.merge(pd_sampleA, pd_sampleB, on="Column label", how="left")

#pandas.DataFrame merge()Method
pd_sampleAll = pd_sampleA.merge(pd_sampleB, on="Column label", how="left")

#Combine multiple columns into a key
pd_sampleAll = pd.merge(pd_sampleA, pd_sampleB, on=["Column A", "Column B"], how="left")

#Join when key column names are different
pd_sampleAll = pd.merge(pd_sampleA, pd_sampleB, left_on="Column A", right_on="Column AAA", how="left")

pd_sampleA

Column A Column B Column C
0 a b c
1 aaa bbb ccc

pd_sampleB

Column A Column D
0 a x
1 aa y

** pd_sampleAll = pd.merge (pd_sampleA, pd_sampleB, on = "column A", how = "left") **

Column A Column B Column C Column D
0 a b c x
1 aaa bbb ccc NaN

4. Data extraction

#Specify column name pandas.Extract DataFrame
pd_sampleA = pd_sampleAll[["Column A", "Column B"]]

#pandas.loc()Extract by specifying the position of row label and column label with function
#Series type is returned when a single row is fetched, and DataFrame type is returned when multiple rows are fetched.
#In addition to the line name, you can also specify a conditional expression that returns a boolean type in the line label.
#You can also slice multiple columns and rows
pd_sampleA = pd_sampleAll.loc["Row label", "Column label"]
pd_sampleA = pd_sampleAll["Column label"].loc["Row label"]

#pandas.iloc()Extract by specifying the position of row number and column number with the function
#Series type is returned when a single row is fetched, and DataFrame type is returned when multiple rows are fetched.
#You can also slice multiple columns and rows
pd_sampleA = pd_sampleAll.iloc["line number", "Column index"]

pd_sampleAll

Column A Column B Column C Column D
0 a b c d
1 aaa bbb ccc ddd

** pd_sampleA = pd_sampleAll.loc [pd_sampleAll ["Column A"] == "aaa", ["Column B", "Column C"]] **

Column B Column C
1 bbb ccc

5. Confirmation of missing values

#pandas.Series.isnull()Get the row where the missing value exists in the method
pd_sample = pd_sample["Column A"].isnull()

#isnull()Any to method()Check for the presence of columns containing missing values using
pd_sample["Column A"].isnull().any(axis=0)

#pandas.DataFrame.dropna()Delete a row that is missing a particular column in the method
pd_sample = pd_sample.dropna(subset=["Column A"])

#pandas.DataFrame.fillna()Replace missing values with arbitrary values in the method
pd_sample = pd_sample.fillna({"Column A": "AAA", "Column B": "BBB"}))

6. Date conversion

#pandas.to_datetime()String in function(object type)To datetime type
pd_sample["date"] = pd.to_datetime(pd_sample["date"])

#Get date, hour, minute, second, day of the week, etc. using dt accessor for datetime type
pd_sample["Month"] = pd_sample["date"].dt.month

#datetime type strftime()Convert date and time to a string in any format with the method
pd_sample["Year / month"] = pd_sample["date"].dt.strftime("%Y%m")

#pandas.to_timedelta()Get the number of days difference between two dates with a function
import datetime
pd_sample["Number of days elapsed"] = pd.to_timedelta(Timestamp 1-Timestamp 2)

#Get the monthly difference between two dates using the relativedelta object of the dateutil module
#The difference between the two dates in the return value(years=+1, months=+10, days=+8)Is stored in the form of
from dateutil.relativedelta import relativedelta
delta = relativedelta(Timestamp 1,Timestamp 2)
pd_sample["Elapsed months"] = delta.years*12 + delta.months

7. Aggregation

#pandas.groupby()Aggregate by specifying the column you want to aggregate with the function and the aggregation method
#By default the group label is index, so if you don't want to do that, as_index=Specify False.
pd_count = pd_sample.groupby("The column you want to put together").count()
pd_sum = pd_sample.groupby(["Column 1 you want to summarize", "Column 2 you want to summarize"], as_index=False).sum()[["Column 1 you want to sum", "Column 2 you want to sum"]]

#pandas.unique()Get unique data count excluding duplicates with function
pd_unique = pd.unique(pd_sample["Column A"])
pd_unique = pd_sample["Column A"].unique()

8. Other methods

#pandas.agg()The function outputs the mean, median, maximum, and minimum values for a particular column
pd_sample = pd_sample.agg(["mean", "median", "max", "min"])["Columns you want to aggregate"]

#pandas.DataFrame.describe()Outputs the number of data items, mean value, standard deviation, minimum value, quartile, median value, and maximum value with the function
#By default, only numeric columns (integer type int, floating point type float) are selected
pd_sample.describe

#Manipulate strings using str accessor for DateFrame type
pd_sample["Column A"] = pd_sample["Column A"].str.replace(" ", "")

#pandas.Where in DataFrame()Assign a value according to the condition in the method
#If True, the calling object is applied, if False, the second argument is applied
pd_sample["Column A"] = pd_sample["Column A"].where(pd_sample["Column B"]>3, 0)

#pandas.DataFrame mask()Assign a value according to the condition in the method
#If True, the second argument is applied, if False, the calling object is applied.
pd_sample["Column A"] = pd_sample["Column A"].mask(pd_sample["Column B"]>3, 0)

Recommended Posts

A memorandum of method often used when analyzing data with pandas (for beginners)
A memorandum of method often used in machine learning using scikit-learn (for beginners)
A collection of methods used when aggregating data with pandas
A memorandum of trouble when formatting data
Basics of pandas for beginners ② Understanding data overview
Summary of Pandas methods used when extracting data [Python]
[Pandas] I tried to analyze sales data with Python [For beginners]
Manage the overlap when drawing scatter plots with a large amount of data (Matplotlib, Pandas, Datashader)
Summary of methods often used in pandas
Memorandum of introduction of EXODUS, a data model of the finite element method (FEM)
[Pandas_flavor] Add a method of Pandas DataFrame
Processing memos often used in pandas (beginners)
Example of efficient data processing with PANDAS
Best practices for messing with data with pandas
Introduction of drawing code for figures with a certain degree of perfection of meteorological data
A collection of Numpy, Pandas Tips that are often used in the field
[Introduction to Python] How to get the index of data with a for statement
Pandas basics for beginners ③ Histogram creation with matplotlib
Make holiday data into a data frame with pandas
[Linux command] A memorandum of frequently used commands
When reading a csv file with read_csv of pandas, the first column becomes index
Implementation of clustering k-shape method for time series data [Unsupervised learning with python Chapter 13]
A collection of code often used in personal Python
Be careful when reading data with pandas (specify dtype)
Analysis of measurement data ①-Memorandum of understanding for scipy fitting-
A collection of Excel operations often used in Python
A memorandum when making a surveillance camera with Raspberry Pi
Precautions when operating with string for TmeStampType of PySpark
I tried using the frequently used seaborn method with as few arguments as possible [for beginners]
Can be used with AtCoder! A collection of techniques for drawing short code in Python!
[Python] Extracts data frames that do not match a specific column with other data frames of Pandas
[For recording] Pandas memorandum
Data visualization with pandas
Shuffle data with pandas
Recommendation of Jupyter Notebook, a coding environment for data scientists
A network diagram was created with the data of COVID-19.
Code snippets often used when processing videos with Google Colaboratory
Notes on handling large amounts of data with python + pandas
Code snippets often used when using BigQuery with Google Colab
[For beginners] Summary of standard input in Python (with explanation)
A study method for beginners to learn time series analysis
Ingenuity to handle data with Pandas in a memory-saving manner
Pandas basics for beginners ④ Handling of date and time items
Easy-to-understand [Pandas] practice / data confirmation method for high school graduates
(For beginners) Try creating a simple web API with Django
[For beginners] A word summary of popular programming languages (2018 version)
Created a method to downsample for unbalanced data (for binary classification)
Turn an array of strings with a for statement (Python3)
An overview of DELG, a new method for extracting image features that attracts attention with Kaggle
[For IT beginners] What to do when the rev command cannot be used with Git Bash