[PYTHON] Processing memos often used in pandas (beginners)

Introduction

This article is a memo of data analysis by python using pandas. I will briefly describe the basics. The variables to which the data read by pandas are assigned are as follows.

What is pandas

Python data analysis library

pandas import

Import example (Hereafter, usage example is described in pd)

import pandas as pd

Read CSV file and Excel file

an_data = pd.read_csv("<CSV file name>")
an_data = pd.read_excel("<Excel file name>")

Display the read data for 5 lines

Useful when you want to check the contents of the data

an_data.head()

Data join

Describes the process when combining separately read data. First data name: an_data1 Second data name: an_data2

Union

Combine the data vertically. (If you set ignore_index = True, the index will be reassigned)

concat_data = pd.concat([an_data1, an_data2], ignore_index=True)

join

Combine the data horizontally.

merge_data = pd.merge(an_data1, an_data2[[<Combine specific keys>]], on="<Join key>", how=left)

How to handle date data

Change the data recognized as a character string to date data

an_data["date"] = pd.to_datetime(an_data["date"])

Extract only year and month

an_data["Year / month"] = an_data["date"].dt.strftime("%Y%m")

Mysterious numbers are read for data such as Excel that has the following notation. Make it possible to handle mysterious numbers as date data.

--Original data that becomes a mysterious number --May 1, 2020 --When the above data is pd.to_datetime (), the following numbers are output (this is an example, so it is different from the actual value). - 456789

ch_time = pd.to_timedelta(an_data.loc[flg_num, "date"].astype("float"), unit="D") + pd.to_datetime("1900/01/01")

If you process one level higher, it will be separated by a hyphen like YYYY-MM-DD. It is troublesome if it is mixed with a slash-separated date, so unify the format (YYYY-MM-DD).

ch_time = pd.to_datetime(an_data.loc[~flg_num, "date"])

Find a value without data

Addition of index containing null data

an_data.isnull().sum()

Contains null data or outputs as boolean (axis is 0 for vertical, 1 for horizontal)

an_data.isnull().any(axis=0)

Data output by describe

The number of data, mean, standard deviation, and quintuplet summary are output.

an_data.describe()

Data type output

<Read data>.dtypes

Example of data summarization by groupby

With the as_index = False option, the index that is automatically created with the name specified by groupby is not created.

an_data.groupby(["<Specific column>"]).sum()["<Another column>"]

It is possible to specify multiple specific columns and different columns.

an_data.groupby(["<Specific column A>", "<Specific column B>"]).sum()["<Another column A">, "<Another column b>"]
tmp = an_data.groupby(["<Specific column>"]).count()["<Another column>"]

You can output average etc. by using agg

tmp = an_data.groupby("<Specific column>").agg("mean", "median", "max", "min")["<The column you want to aggregate>"]

pivot_table

Create a new table

an_data = pd.pivot_table(an_data, index="<Column name you want to index>", columns"<The column name you want to be column>", values=["<Column values you want to output to the table>"], aggfunc="", fill_value=0

sort

ascending = True is in ascending order

an_data.sort_values(by=["<Main sort target column>"], ascending=True)

Remove whitespace

an_data["test_val"] = an_data["test_val"].str.replace("  ", "")
an_data["test_val"] = an_data["test_val"].str.replace(" ", "")

Data confirmation by unique

print(len(pd.unique(an_data["val"])))
print(pd.unique(an_data["val"]))

Check the number of parameters whose value is a numerical value

flg_num = an_data["<Column name>"].astype("str").str.isdigit()
flg_num.sum()

loc Label designation

an_data.loc[:, [<column01>, <column02>]]
an_data.loc[[True, False, True, False], [False, False, True]]

iloc Numbering

an_data.iloc[[0:2], [2:4]]
an_data.iloc[[False, True, True], [False, True, False]]

Rename label

Specifying inplace = True will change the original data

an_data.rename(columns={"<Name before change>":"<Name after change>"}, index={"<Name before change>":"<Name after change>"}, inplace=True)

Reindex

Since the index value is reassigned, the index is not covered.

an_data = an_data.reset_index(drop=False)

Drop processing

an_data = an_data.drop("<Items you want to drop>", axis=1)

where Use where to change the value of the item that "does not" match the condition In the following example, if the value of is not 5 or less, the value of is changed to 1.

an_data["<Target item>"] = an_data["<Target item>"].where(an_data["<Comparison items>"]<5, 1)

relativedelta

delta = relative_delta(dtime1, dtime2)
The content of delta is the difference between dtime1 and dtime2 (dtime1 is the latest date))

Removal of missing values

an_data = an_data.dropna(subsent=["<Column name where the missing value exists>"])

The replacement of the missing part is as follows (filled with 0)

an_data = an_data.fillna(0)

Confirmation is possible at

an_data.isna().sum()

csv output

an_data.to_csv("", index=False)

example graph output

Import example of library for graph output (Hereafter, usage example is described in plt) In addition, the second line describes an example when using it with jupyter notebook.

import matplotlib.pyplot as plt
%matplotlib inline

Read the data to be plotted (listed in the order of horizontal axis, vertical axis, option) At least you should put a label

plt.plot(an_data.index, an_data['<Items on the vertical axis>'], label='label-1')
plt.plot(an_data.index, an_data['<Items on the vertical axis>'], label='label-2')
plt.plot(an_data.index, an_data['<Items on the vertical axis>'], label='label-3')

Plot the read data on a graph

plt.legend()

Recommended Posts

Processing memos often used in pandas (beginners)
Grammar summary often used in pandas
Summary of methods often used in pandas
Settings often used in Jupyter
Processing summary 2 often done in Pandas (data reference, editing operation)
I tried to summarize the code often used in Pandas
Pandas basics for beginners ① Reading & processing
Pandas basics for beginners ⑧ Digit processing
Disk-related commands often used in Ubuntu (memories)
A collection of Numpy, Pandas Tips that are often used in the field
Summary of what was used in 100 Pandas knocks (# 1 ~ # 32)
Techniques often used in python short coding (Notepad)
Code often used in Python / Django apps [prefectures]
A memorandum of method often used when analyzing data with pandas (for beginners)
A memorandum of method often used in machine learning using scikit-learn (for beginners)
A collection of code often used in personal Python
A collection of Excel operations often used in Python
File processing in Python
Learn Pandas in 10 minutes
Processing datasets with pandas (1)
Processing datasets with pandas (2)
Multithreaded processing in python
UnicodeDecodeError in pandas read_csv
Text processing in Python
Queue processing in Python
Code snippets often used when processing videos with Google Colaboratory
Commands often used in the development environment during Python implementation
[Explanation for beginners] Introduction to convolution processing (explained in TensorFlow)
[Explanation for beginners] Introduction to pooling processing (explained in TensorFlow)
I tried to summarize the commands often used in business