[PYTHON] Learn Pandas in 10 minutes

In "pandas 0.15.2 documentation" , " 10 Minutes to pandas ", so when I looked into it, my mind was pretty organized. If you do it seriously, it will not be finished in 10 minutes, but just take a note of what seems to be convenient.

First, import Pandas and Numpy.

#import liblaries
import pandas as pd
import numpy as np

Create a DataFrame

There are several ways to create a DataFrame, so organize them. First, create a matrix with numpy for DataFrame, and paste the index and label.

Indexing.

#Create a index
dates = pd.date_range("20130101", periods=6)
dates

<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01, ..., 2013-01-06]
Length: 6, Freq: D, Timezone: None

Create a DataFrame and paste the index.

#Create a DatFrame
df = pd.DataFrame(np.random.randn(6,4),index = dates, columns = list("ABCD"))
df

 	A 	B 	C 	D
2013-01-01 	0.705624 	-0.793903 	0.843425 	0.672602
2013-01-02 	-1.211129 	2.077101 	-1.795861 	0.028060
2013-01-03 	0.706086 	0.385631 	0.967568 	0.271894
2013-01-04 	2.152279 	-0.493576 	1.184289 	-1.193300
2013-01-05 	0.455767 	0.787551 	0.239406 	1.627586
2013-01-06 	-0.639162 	-0.052620 	0.288010 	-2.205777


This time, create a DataFrame with an image that creates a Series for each label. Here you can have different dtypes for each label

df2 = pd.DataFrame({ 'A' : 1.,
                     'B' : pd.Timestamp('20130102'),
                     'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                     'D' : np.array([3] * 4,dtype='int32'),
                     'E' : pd.Categorical(["test","train","test","train"]),
                     'F' : 'foo' })
df2

 	A 	B 	C 	D 	E 	F
0 	1 	2013-01-02 	1 	3 	test 	foo
1 	1 	2013-01-02 	1 	3 	train 	foo
2 	1 	2013-01-02 	1 	3 	test 	foo
3 	1 	2013-01-02 	1 	3 	train 	foo


df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object


DataFrame reference

Next is how to view the data in the desired form.

Display only index, only columns, only numpy data.

df.index

<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01, ..., 2013-01-06]
Length: 6, Freq: D, Timezone: None


df.columns

Index([u'A', u'B', u'C', u'D'], dtype='object')

df.values

array([[ 0.705624  , -0.79390348,  0.84342517,  0.67260162],
       [-1.21112884,  2.0771009 , -1.79586146,  0.02806019],
       [ 0.70608621,  0.38563092,  0.9675681 ,  0.27189394],
       [ 2.15227868, -0.49357565,  1.18428903, -1.19329976],
       [ 0.45576744,  0.78755094,  0.23940583,  1.62758649],
       [-0.63916155, -0.05261954,  0.28800958, -2.20577674]])

A summary of statistics is displayed together, which is convenient.

df.describe()

 	A 	B 	C 	D
count 	6.000000 	6.000000 	6.000000 	6.000000
mean 	0.361578 	0.318364 	0.287806 	-0.133156
std 	1.177066 	1.034585 	1.087978 	1.368150
min 	-1.211129 	-0.793903 	-1.795861 	-2.205777
25% 	-0.365429 	-0.383337 	0.251557 	-0.887960
50% 	0.580696 	0.166506 	0.565717 	0.149977
75% 	0.705971 	0.687071 	0.936532 	0.572425
max 	2.152279 	2.077101 	1.184289 	1.627586

Invert the DataFrame matrix.

df.T

2013-01-01 00:00:00 	2013-01-02 00:00:00 	2013-01-03 00:00:00 	2013-01-04 00:00:00 	2013-01-05 00:00:00 	2013-01-06 00:00:00
A 	0.705624 	-1.211129 	0.706086 	2.152279 	0.455767 	-0.639162
B 	-0.793903 	2.077101 	0.385631 	-0.493576 	0.787551 	-0.052620
C 	0.843425 	-1.795861 	0.967568 	1.184289 	0.239406 	0.288010
D 	0.672602 	0.028060 	0.271894 	-1.193300 	1.627586 	-2.205777

Sort by any axis. For example, sort the labels in descending order.

df.sort_index(axis=1, ascending=False)

 	D 	C 	B 	A
2013-01-01 	0.672602 	0.843425 	-0.793903 	0.705624
2013-01-02 	0.028060 	-1.795861 	2.077101 	-1.211129
2013-01-03 	0.271894 	0.967568 	0.385631 	0.706086
2013-01-04 	-1.193300 	1.184289 	-0.493576 	2.152279
2013-01-05 	1.627586 	0.239406 	0.787551 	0.455767
2013-01-06 	-2.205777 	0.288010 	-0.052620 	-0.639162

Next is the value of label "B" in ascending order.


df.sort(columns='B')

A 	B 	C 	D
2013-01-01 	0.705624 	-0.793903 	0.843425 	0.672602
2013-01-04 	2.152279 	-0.493576 	1.184289 	-1.193300
2013-01-06 	-0.639162 	-0.052620 	0.288010 	-2.205777
2013-01-03 	0.706086 	0.385631 	0.967568 	0.271894
2013-01-05 	0.455767 	0.787551 	0.239406 	1.627586
2013-01-02 	-1.211129 	2.077101 	-1.795861 	0.028060

Pick out data

Data can be extracted from various points of view. For example, only part of the index.

Extract data by specifying both label and index.

df.loc['20130102':'20130104',['A','B']]

 	A 	B
2013-01-02 	-1.211129 	2.077101
2013-01-03 	0.706086 	0.385631
2013-01-04 	2.152279 	-0.493576

You can make a group with any label. Data can be manipulated as it is.


#Creating a DataFrame
df = pd.DataFrame({"A" : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
                   "B" : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],
                   "C" : np.random.randn(8),
                   "D" : np.random.randn(8)})

df

 	A 	B 	C 	D
0 	foo 	one 	1.130975 	1.235940
1 	bar 	one 	-0.140004 	-2.714958
2 	foo 	two 	1.526578 	-0.165415
3 	bar 	three 	-1.049092 	-0.037484
4 	foo 	two 	-1.182303 	0.288754
5 	bar 	two 	0.530652 	1.204125
6 	foo 	one 	0.678477 	-0.273343
7 	foo 	three 	0.929624 	0.169822

df.sort(columns='B')

A 	B 	C 	D
2013-01-01 	0.705624 	-0.793903 	0.843425 	0.672602
2013-01-04 	2.152279 	-0.493576 	1.184289 	-1.193300
2013-01-06 	-0.639162 	-0.052620 	0.288010 	-2.205777
2013-01-03 	0.706086 	0.385631 	0.967568 	0.271894
2013-01-05 	0.455767 	0.787551 	0.239406 	1.627586
2013-01-02 	-1.211129 	2.077101 	-1.795861 	0.028060

Pick out data

Data can be extracted from various points of view. For example, only part of the index.

Extract data by specifying both label and index.

df.loc['20130102':'20130104',['A','B']]

 	A 	B
2013-01-02 	-1.211129 	2.077101
2013-01-03 	0.706086 	0.385631
2013-01-04 	2.152279 	-0.493576

You can make a group with any label. Data can be manipulated as it is.

#Creating a DataFrame
df = pd.DataFrame({"A" : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
                   "B" : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],
                   "C" : np.random.randn(8),
                   "D" : np.random.randn(8)})
df

 	A 	B 	C 	D
0 	foo 	one 	1.130975 	1.235940
1 	bar 	one 	-0.140004 	-2.714958
2 	foo 	two 	1.526578 	-0.165415
3 	bar 	three 	-1.049092 	-0.037484
4 	foo 	two 	-1.182303 	0.288754
5 	bar 	two 	0.530652 	1.204125
6 	foo 	one 	0.678477 	-0.273343
7 	foo 	three 	0.929624 	0.169822

#Grouping and then calculate sum
df.groupby('A').sum()

 	C 	D
A 		
bar 	-0.658445 	-1.548317
foo 	3.083350 	1.255758


Creating a pivot table

Creating a DataFrame to make it a pivot table.

#Create a DataFrame
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                   'B' : ['A', 'B', 'C'] * 4,
                   'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] *2,
                   'D' : np.random.randn(12),
                   'E' : np.random.randn(12)})
df
A 	B 	C 	D 	E
0 	one 	A 	foo 	0.575699 	-1.669032
1 	one 	B 	foo 	0.592889 	-2.526196
2 	two 	C 	foo 	-2.229949 	-0.703339
3 	three 	A 	bar 	0.801380 	-1.638983
4 	one 	B 	bar 	-0.135691 	-0.302586
5 	one 	C 	bar 	0.317401 	1.169608
6 	two 	A 	foo 	0.064460 	-0.109437
7 	three 	B 	foo 	-0.605017 	1.043246
8 	one 	C 	foo 	-0.365220 	0.850535
9 	one 	A 	bar 	1.033552 	0.226002
10 	two 	B 	bar 	-0.260542 	0.352249
11 	three 	C 	bar 	0.518531 	1.407827

It can be converted to a pivot table relatively easily.

pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])

 	C 	bar 	foo
A 	B 		
one 	A 	1.033552 	0.575699
B 	-0.135691 	0.592889
C 	0.317401 	-0.365220
three 	A 	0.801380 	NaN
B 	NaN 	-0.605017
C 	0.518531 	NaN
two 	A 	NaN 	0.064460
B 	-0.260542 	NaN
C 	NaN 	-2.229949

Summary

If you take a quick look at it once, it will come back when you face the process, which is very appreciated.

reference

pandas 0.15.2 documentation http://pandas.pydata.org/pandas-docs/stable/index.html

10 Minutes to pandas http://pandas.pydata.org/pandas-docs/stable/10min.html

Recommended Posts

Learn Pandas in 10 minutes
[Python] Pandas to fully understand in 10 minutes
Learn Pandas with Cheminformatics
UnicodeDecodeError in pandas read_csv
Understand in 10 minutes Selenium
Selenium running in 15 minutes
Learn cumulative sum in Python
Learn exploration in Python # 1 Full exploration
Swap columns in pandas dataframes
Start in 5 minutes GIMP Python-Fu
Create dummy variables in pandas (get_dummies)
How to write soberly in pandas
Features of pd.NA in pandas 1.0.0 (rc0)
Pandas
Let's experience BERT in about 30 minutes.
Scraping with Beautiful Soup in 10 minutes
Grammar summary often used in pandas
Make matplotlib Japanese compatible in 3 minutes
Deploy Django in 3 minutes using docker-compose
Ignore # line and read in pandas
Wow Pandas Let's learn a lot
Get started with Python in 30 minutes! Development environment construction & learn basic grammar
Bar graph display in pandas (basic edition)
Learn the design pattern "Prototype" in Python
Learn the design pattern "Builder" in Python
[Understanding in 3 minutes] The beginning of Linux
Summary of methods often used in pandas
Learn the design pattern "Flyweight" in Python
Learn the design pattern "Observer" in Python
Learn the design pattern "Memento" in Python
Learn the design pattern "Proxy" in Python
Get the top nth values in Pandas
Learn the design pattern "Command" in Python
CSS environment created in 10 minutes using Django
Learn the design pattern "Visitor" in Python
Precautions when using for statements in pandas
Learn the design pattern "Bridge" in Python
Learn the design pattern "Mediator" in Python
Learn the design pattern "Decorator" in Python
How to reassign index in pandas dataframe
Learn the design pattern "Iterator" in Python
Django Foreign Key Tutorial Ends in 10 Minutes
Learn the design pattern "Strategy" in Python
RDS data via stepping stones in Pandas
Learn the design pattern "Composite" in Python
Learn the design pattern "State" in Python
10 Minutes to Learn APPELPY-Python's Applied Econometrics Library
Get Cloud Logging available in Python in 10 minutes
Learn the design pattern "Adapter" in Python
Processing memos often used in pandas (beginners)
Learn dynamic programming in Python (A ~ E)
How to read CSV files in Pandas
Adding Series to columns in python pandas
Working with 3D data structures in pandas
Is there NaN in the pandas DataFrame?