[Python] Pandas to fully understand in 10 minutes

Pandas in 10 minutes

Introduction

This article is a sutra copy and commentary of the official pandas tutorial "10 minutes to pandas"

I refer to the following URL https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html

environment

Import for the time being

import numpy as np
import pandas as pd
np
pd

OK if each module is displayed as below スクリーンショット 2020-01-25 11.51.03.png

If an error occurs

** ModuleNotFoundError: No module named'pandas' ** If you get angry, put pandas first.


---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-59ab05e21164> in <module>
      1 import numpy as np
----> 2 import pandas as pd

ModuleNotFoundError: No module named 'pandas'

command python -m pip install pandas


1. Create an object

You can easily create data by putting a list in the Series class. ..


#Easy to line up
s = pd.Series(data=[1, 3, 5, np.nan, 6, 8])
s
スクリーンショット 2020-01-25 12.05.34.png

You can use date_range () to create a line with a date for a specific time period.


#Data for 6 days from January 1, 2020
dates = pd.date_range("20200101", periods=6)
dates
スクリーンショット 2020-01-25 12.14.36.png

[DataFrame] of pandas (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas-dataframe) By specifying the class ** argument index **, the line You can specify the index.

#Specify data from January 1, 2020 for row index
#Enter a random number for each value
df = pd.DataFrame(np.random.randn(6, 4), index=dates)
df
スクリーンショット 2020-01-25 12.20.49.png

Also, of the DataFrame class You can set the column names by specifying the ** argument columns **.

#Set column name ABCD
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df
スクリーンショット 2020-01-25 13.54.51.png

By passing dictionary type data to the DataFrame class, the key part of the dictionary type becomes the column name.

df2 = pd.DataFrame(
    {
        "A": 1.,
        "B": pd.Timestamp("20200101"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)
df2
スクリーンショット 2020-01-25 14.08.34.png

You can see the data attributes of each column by referring to ** dtypes attribute **.

df2.dtypes
スクリーンショット 2020-01-25 14.10.58.png

If you are using Jupyter nootbook or Jupyter Lab, column names will be displayed in tab completion.

db2.<TAB>
スクリーンショット 2020-01-25 14.13.28.png

2. View data

Data by using the [head () method] of the DataFrame class (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html#pandas.DataFrame.head) You can display the beginning.

df.head(2)
スクリーンショット 2020-01-25 14.21.33.png

Similarly, by using the DataFrame class tail () You can view the tail.

df.tail(2)
スクリーンショット 2020-01-25 14.35.10.png

By referring to ** index ** of the DataFrame class You can display the row index of that data.

df.index
df2.index
スクリーンショット 2020-01-25 14.36.33.png

Data by using the DataFrame class to_numpy () Can be converted to data that is easy to operate with numpy.

df.to_numpy()
df2.to_numpy()
スクリーンショット 2020-01-25 15.25.48.png

Use the DataFrame class Reference: DataFrame.describe () You can get a quick statistic for each column of data.

df2.describe()
スクリーンショット 2020-01-25 15.44.36.png

If you refer to the T attribute of the DataFrame class, the matrix-swapped data You can access.

df.T
スクリーンショット 2020-01-25 15.48.59.png

Also, transpose the matrix in transpose () of the DataFrame class. Can be obtained.

df.transpose()
スクリーンショット 2020-01-25 16.00.49.png

By using the DataFrame class sort_index () [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_index.html#pandas-dataframe-sort-index) , You can sort the entire row or column.

df.sort_index()
スクリーンショット 2020-01-25 16.10.23.png

** Set the argument axis ** to 0 or "index" to sort by row, set 1 or "columns" to sort by axis (default value 0). Also, if False is specified for the ** argument ascending **, the sort order will be descending (default value True).


df.sort_index(axis=0, ascending=False)
df.sort_index(axis=1, ascending=False)
スクリーンショット 2020-01-25 16.12.40.png

By using the DataFrame class sort_values () [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html#pandas-dataframe-sort-values) You can sort by row or column.

df.sort_values(by="B")
df.sort_values(by="2020-01-01", axis=1)
スクリーンショット 2020-01-25 16.45.17.png

(Added on 2020-03-07)

3. Select data

Simple data acquisition

You can get the specified row by setting ** df ["A"] ** or ** df.A **.

df["A"]
df.A

image.png

If specified in the list ** [] **, you can select columns and rows with a Python slice operation.

#First 4 columns display
df[0:3]

image.png

You can also get the reindex range.


#Displayed from January 2, 2020 to January 4, 2020
df['20200102':'20200104'] 

image.png

Select data by label

Index (dates in this case) to loc () of DataFrame class ) Can be specified to select the row as a column.


df.loc[dates]
df.loc[dates[0]]

image.png

Select multiple columns by using loc () can do.


df.loc[:, ["A", "B"]]

image.png

It seems that an error will occur if there is no leading colon.

image.png

loc () Multiple lines and multiples by combining slice operations You can select columns.

df.loc['20200102':'20200104', ['A', 'B']]

image.png

Single data by specifying an index in loc () Can get

df.loc[dates[0], 'A']

image.png

By using at (), you can get single data faster.

df.at[dates[0], 'A']

image.png

Select data by location (https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#selection-by-position)

Select data by specifying a numerical value by using iloc () of the DataFrame class. You can.

df.iloc[3]
df.iloc[3:5, 0:2]
df.iloc[[1, 2, 4], [0, 2]]

image.png

Slice (:) with the start position and end position omitted in the argument of iloc () of the DataFrame class. You can get a specific all rows or all columns by specifying (only)

df.iloc[1:3, :]
df.iloc[:, 1:3]

image.png

By specifying only a numerical value as an argument to iloc () of the DataFrame class, it is a single data. You can choose

df.iloc[1, 1]

image.png

Like at (), iat () .org / pandas-docs / stable / reference / api / pandas.DataFrame.iat.html) You can get single data faster by using

df.at[dates[0], 'A']

image.png

Data selection by conditional judgment

(I'm exhausted here. The rest ... isn't there? 10 minutes is: thinking :)

4. Missing data 5. Operations 6. Merge 7. Grouping 8. Rebuild 9. Time Series 10. Categorize 11. Plot 12. Data Input and Output 13. Pit Pit

Recommended Posts

[Python] Pandas to fully understand in 10 minutes
Adding Series to columns in python pandas
Learn Pandas in 10 minutes
Understand in 10 minutes Selenium
Even beginners want to say "I fully understand Python"
To flush stdout in Python
Try to understand Python self
Login to website in Python
Super Primer to python-Getting started with python3.5 in 3 minutes
Speech to speech in python [text to speech]
I tried to summarize how to use pandas in python
How to develop in Python
I understand Python in Japanese!
Post to Slack in Python
Data science companion in python, how to specify elements in pandas
[Python] How to do PCA in Python
Convert markdown to PDF in Python
[Python] How to use Pandas Series
How to collect images in Python
How to use SQLite in Python
In the python command python points to python3.8
Try to calculate Trace in Python
[Introduction to Python] Let's use pandas
How to use Mysql in python
How to wrap C in Python
How to use ChemSpider in Python
6 ways to string objects in Python
How to use PubChem in Python
[Introduction to Python] Let's use pandas
[Introduction to Python] Let's use pandas
How to handle Japanese in Python
An alternative to `pause` in Python
I tried to implement PLSA in Python
[Python] Summary of how to use pandas
[Introduction to Python] How to use class in Python?
Try logging in to qiita with Python
[Python] Use pandas to extract △△ that maximizes ○○
Install Pyaudio to play wave in python
How to access environment variables in Python
I tried to implement permutation in Python
Method to build Python environment in Xcode 6
How to dynamically define variables in Python
How to do R chartr () in Python
Pin current directory to script directory in Python
[Itertools.permutations] How to put permutations in Python
PUT gzip directly to S3 in Python
Send email to multiple recipients in Python (Python 3)
Convert psd file to png in Python
Sample script to trap signals in Python
Decorator to avoid UnicodeEncodeError in Python 3 print ()
How to work with BigQuery in Python
Log in to Slack using requests in Python
How to get a stacktrace in python
How to display multiplication table in python
Easy way to use Wikipedia in Python
How to extract polygon area in Python
How to check opencv version in python
I tried to implement ADALINE in Python
Throw Incoming Webhooks to Mattermost in Python
Module to generate word N-gram in Python
To reference environment variables in Python in Blender