To commemorate the release of the Japanese version of Python for Data Analysis Things that are likely to be used often or places that are likely to get clogged
It's been so long, so the table of contents
Use concat to set axis = 1
s1 = pd.Series([1,2,3,4,5], index=[0,1,2,3,4])
s2 = pd.Series([10,20,30,40,50], index=[10,11,12,13,14])
s3 = pd.Series([100,200,300,400,500], index=[0,2,4,6,8])
print pd.concat([s1, s2, s3], axis=1)
"""
0 1 2
0 1 NaN 100
1 2 NaN NaN
2 3 NaN 200
3 4 NaN NaN
4 5 NaN 300
6 NaN NaN 400
8 NaN NaN 500
10 NaN 10 NaN
11 NaN 20 NaN
12 NaN 30 NaN
13 NaN 40 NaN
14 NaN 50 NaN
"""
Various are prepared and confused It's annoying, but I think the reference by ix is the easiest to understand.
#Prepare DataFrame
df = DataFrame(arange(30).reshape((6, 5)), index=list("ASDFGH"), columns=list("abcde"))
print df
"""
a b c d e
A 0 1 2 3 4
S 5 6 7 8 9
D 10 11 12 13 14
F 15 16 17 18 19
G 20 21 22 23 24
H 25 26 27 28 29
"""
Specify the column name for column reference When specifying more than one, specify in list format If there is only one column to specify, the instance variable is prepared by the column name, so you can specify it.
#Specify column name
print df["a"]
"""
A 0
S 5
D 10
F 15
G 20
H 25
Name: a, dtype: int64
"""
#Specify multiple column names
print df[["a", "c", "e"]]
"""
a c e
A 0 2 4
S 5 7 9
D 10 12 14
F 15 17 19
G 20 22 24
H 25 27 29
"""
#Specify the instance variable of the column name
print df.a
"""
A 0
S 5
D 10
F 15
G 20
H 25
"""
Designated by slicing etc. Specifying by name (index label) attached to the line will be described later.
print df[0:3]
"""
a b c d e
A 0 1 2 3 4
S 5 6 7 8 9
D 10 11 12 13 14
"""
#Negative index specification
print df[-2:]
"""
a b c d e
G 20 21 22 23 24
H 25 26 27 28 29
"""
Can be referenced by slicing the ix attribute In fact, if you remember this much, you can use it in the same format at any time and it is convenient Specify by [row, column] like NumPy 2D array slicing Can also be specified by index label However, in the case of slicing by index label, note that the index label behind is included in the range.
#Designated by index
print df.ix[0, 3]
"""
3
"""
#Slicing by index
print df.ix[0:3, 0:2]
"""
a b
A 0 1
S 5 6
D 10 11
"""
#Specify separate rows or columns with multiple indexes
print df.ix[[0, 2, 4], [0, 2, 4]]
"""
a c e
A 0 2 4
D 10 12 14
G 20 22 24
"""
#Designated by index label
print df.ix["D", "a"]
"""
10
"""
#Range specification by index label
print df.ix["A":"D", "a":"b"]
"""
a b
A 0 1
S 5 6
D 10 11
"""
#Specify separate rows or columns with multiple index labels
print df.ix[["A", "H", "S"], ["c", "b", "a"]]
"""
c b a
A 2 1 0
H 27 26 25
S 7 6 5
"""
Since I did DataFrame, I also did Series Unlike DataFrame, you can also specify by index from the beginning. It is possible to specify by index label, so I don't think there is any particular clutter. Slicing is also possible
#Prepare Series
s1 = pd.Series(range(6), index=list("abcdef"))
print s1
"""
a 0
b 1
c 2
d 3
e 4
f 5
dtype: int64
"""
#Designated by index
print s1[0:3]
"""
a 0
b 1
c 2
dtype: int64
"""
#Designated by index label
print ["b":"e"]
"""
b 1
c 2
d 3
e 4
dtype: int64
"""
If you evaluate a Series or DataFrame like NumPy's ndarray, the bool Series or DataFrame that was evaluated for each element will be returned.
print df > 5
"""
a b c d e
A False False False False False
S False True True True True
D True True True True True
F True True True True True
G True True True True True
H True True True True True
"""
print s1 > 3
"""
a False
b False
c False
d False
e True
f True
dtype: bool
"""
You can mask the original Series or DataFrame with what you get back
#DataFrame mask
print df[df > 5]
"""
a b c d e
A NaN NaN NaN NaN NaN
S NaN 6 7 8 9
D 10 11 12 13 14
F 15 16 17 18 19
G 20 21 22 23 24
H 25 26 27 28 29
"""
#Series mask
print s1[s1 > 3]
"""
e 4
f 5
dtype: int64
"""
Operations are performed between the same index labels Those without the same index label will be NaN If you do not want to make it NaN, you can calculate after assigning 0 to NaN with fill_value.
#Prepare Series
s1 = pd.Series(range(6), index=list("abcdef"))
s2 = pd.Series([10,20,30,40,50], index=list("asdfg"))
print s1
"""
a 0
b 1
c 2
d 3
e 4
f 5
dtype: int64
"""
print s2
"""
a 10
s 20
d 30
f 40
g 50
dtype: int64
"""
# s1 +may be s2
print s1.add(s2)
"""
a 10
b NaN
c NaN
d 33
e NaN
f 45
g NaN
s NaN
dtype: float64
"""
#Operate after replacing NaN with 0
print s1.add(s2, fill_value=0)
"""
a 10
b 1
c 2
d 33
e 4
f 45
g 50
s 20
dtype: float64
"""
#Prepare DataFrame
df1 = DataFrame(arange(30).reshape((6,5)), index=list("ASDFGH"), columns=list("abcde"))
df2 = DataFrame(arange(42).reshape((7,6)), index=list("ASDFGHJ"), columns=list("abcdef"))
print df1
"""
a b c d e
A 0 1 2 3 4
S 5 6 7 8 9
D 10 11 12 13 14
F 15 16 17 18 19
G 20 21 22 23 24
H 25 26 27 28 29
"""
print df2
"""
a b c d e f
A 0 1 2 3 4 5
S 6 7 8 9 10 11
D 12 13 14 15 16 17
F 18 19 20 21 22 23
G 24 25 26 27 28 29
H 30 31 32 33 34 35
J 36 37 38 39 40 41
"""
# df1 +may be df2
print df1.add(df2)
"""
a b c d e f
A 0 2 4 6 8 NaN
D 22 24 26 28 30 NaN
F 33 35 37 39 41 NaN
G 44 46 48 50 52 NaN
H 55 57 59 61 63 NaN
J NaN NaN NaN NaN NaN NaN
S 11 13 15 17 19 NaN
"""
#Operate after replacing NaN with 0
print df1.add(df2, fill_value=0)
"""
a b c d e f
A 0 2 4 6 8 5
D 22 24 26 28 30 17
F 33 35 37 39 41 23
G 44 46 48 50 52 29
H 55 57 59 61 63 35
J 36 37 38 39 40 41
S 11 13 15 17 19 11
"""
If one of the index labels of the data to be calculated is completely included in the index label of the other data, fill_value may be performed with reindex before the calculation.
#Taking df1 and df2 above as an example
#The index label for df1 is completely contained in df2
print df1.index.isin(df2.index).all(), df1.columns.isin(df2.columns).all()
"""
True True
"""
print df1.reindex(index=df2.index, columns=df2.columns, fill_value=0) + df2
"""
a b c d e f
A 0 2 4 6 8 5
S 11 13 15 17 19 11
D 22 24 26 28 30 17
F 33 35 37 39 41 23
G 44 46 48 50 52 29
H 55 57 59 61 63 35
J 36 37 38 39 40 41
"""
Multiple index labels can be attached to one row or column Use the sortlevel method to sort by the label of the hierarchical index
# Series
s = Series(range(10), index=[list("aaabbbccdd"),[1,2,3,2,1,3,1,2,1,2]])
print s
"""
a 1 0
2 1
3 2
b 2 3
1 4
3 5
c 1 6
2 7
d 1 8
2 9
dtype: int64
"""
#See outer index label
print s["a"]
"""
1 0
2 1
3 2
dtype: int64
"""
#See inner index label
print s[:, 1]
"""
a 0
b 4
c 6
d 8
dtype: int64
"""
#Change index label hierarchy
#Swap the index label of the 0th layer and the index label of the 1st layer
# s.swaplevel(1, 0)May be
print s.swaplevel(0, 1)
"""
1 a 0
2 a 1
3 a 2
2 b 3
1 b 4
3 b 5
1 c 6
2 c 7
1 d 8
2 d 9
dtype: int64
"""
#Sort by outer index label
print s.swaplevel(0,1).sortlevel(0)
"""
1 a 0
b 4
c 6
d 8
2 a 1
b 3
c 7
d 9
3 a 2
b 5
dtype: int64
"""
stack is a column-to-row pivot conversion unstack is a row-to-column pivot conversion Since it is a row-to-column conversion, it can only be used with a DataFrame or Series with a hierarchical index label.
#Series object with hierarchical index label
print s
"""
a 1 0
2 1
3 2
b 2 3
1 4
3 5
c 1 6
2 7
d 1 8
2 9
dtype: int64
"""
#Convert rows to columns
print s.unstack()
"""
1 2 3
a 0 1 2
b 4 3 5
c 6 7 NaN
d 8 9 NaN
"""
print df
"""
a b c d e
A 0 1 2 3 4
S 5 6 7 8 9
D 10 11 12 13 14
F 15 16 17 18 19
G 20 21 22 23 24
H 25 26 27 28 29
"""
#Column to row
print df.stack()
"""
A a 0
b 1
c 2
d 3
e 4
S a 5
b 6
c 7
d 8
e 9
D a 10
b 11
c 12
d 13
e 14
F a 15
b 16
c 17
d 18
e 19
G a 20
b 21
c 22
d 23
e 24
H a 25
b 26
c 27
d 28
e 29
dtype: int64
"""
#Row to column
print df.unstack()
"""
a A 0
S 5
D 10
F 15
G 20
H 25
b A 1
S 6
D 11
F 16
G 21
H 26
c A 2
S 7
D 12
F 17
G 22
H 27
d A 3
S 8
D 13
F 18
G 23
H 28
e A 4
S 9
D 14
F 19
G 24
H 29
dtype: int64
"""
You can change the name by overriding the index for rows and the columns attribute for columns. Alternatively, use the rename method to pass the name before and after the change in dictionary format. Using the rename method returns a copy of the data and does not change the original data
print df
"""
a b c d e
A 0 1 2 3 4
S 5 6 7 8 9
D 10 11 12 13 14
F 15 16 17 18 19
G 20 21 22 23 24
H 25 26 27 28 29
"""
#Rename line
df.index = ["AAA", "SSS", "DDD", "FFF", "GGG", "HHH"]
#Rename column
df.columns=["one", "two", "three", "four", "five"]
print df
"""
one two three four five
AAA 0 1 2 3 4
SSS 5 6 7 8 9
DDD 10 11 12 13 14
FFF 15 16 17 18 19
GGG 20 21 22 23 24
HHH 25 26 27 28 29
"""
#use rename
print df.rename(index={"AAA":"a", "SSS":"s", "DDD":"d", "FFF":"f", "GGG":"g", "HHH":"h"},
columns={"one":1, "two":2, "three":3, "four":4, "five":5})
"""
1 2 3 4 5
a 0 1 2 3 4
s 5 6 7 8 9
d 10 11 12 13 14
f 15 16 17 18 19
g 20 21 22 23 24
h 25 26 27 28 29
"""
You can name the index label itself The name you give can be used when converting axes such as swaplevel and unstack.
#Rename index
df.index.names = ["UPPER"]
#Rename the column
df.columns.names = ["lower"]
print df
"""
lower one two three four five
UPPER
AAA 0 1 2 3 4
SSS 5 6 7 8 9
DDD 10 11 12 13 14
FFF 15 16 17 18 19
GGG 20 21 22 23 24
HHH 25 26 27 28 29
"""
(A bin is a column of a histogram) Data values can be split into bins using the pd.cut method
#Age and gender data
df = DataFrame([[20,"F"],[22,"M"],[25,"M"],[27,"M"],[21,"F"],[23,"M"],[37,"F"],[31,"M"],[61,"F"],[45,"M"],[41,"F"],[32,"M"]], columns=["age", "sex"])
print df
"""
age sex
0 20 F
1 22 M
2 25 M
3 27 M
4 21 F
5 23 M
6 37 F
7 31 M
8 61 F
9 45 M
10 41 F
11 32 M
"""
#Value when splitting into bins
bins = [18, 25, 35, 60, 100]
#Bin name
group_names = ["Youth", "YoungAdult", "MiddleAged", "Senior"]
#Binning
print pd.cut(df.age, bins, labels=group_names)
"""
Categorical:
[Youth, Youth, Youth, YoungAdult, Youth, Youth, nan, YoungAdult, nan, nan, nan, YoungAdult]
Levels (4): Index(['Youth', 'YoungAdult', 'MiddleAged', 'Senior'], dtype=object)
"""
#Add bin column to df
df["bin"] = pd.cut(df.age, bins, labels=group_names)
print df
"""
age sex bin
0 20 F Youth
1 22 M Youth
2 25 M Youth
3 27 M YoungAdult
4 21 F Youth
5 23 M Youth
6 37 F MiddleAged
7 31 M YoungAdult
8 61 F Senior
9 45 M MiddleAged
10 41 F MiddleAged
11 32 M YoungAdult
"""
If you want to convert only a specific column to an index, use the set_index method of DataFrame.
print df
"""
age sex bin
0 20 F Youth
1 22 M Youth
2 25 M Youth
3 27 M YoungAdult
4 21 F Youth
5 23 M Youth
6 37 F MiddleAged
7 31 M YoungAdult
8 61 F Senior
9 45 M MiddleAged
10 41 F MiddleAged
11 32 M YoungAdult
"""
#Convert bin column to index
print df.set_index("bin")
"""
age sex
bin
Youth 20 F
Youth 22 M
Youth 25 M
YoungAdult 27 M
Youth 21 F
Youth 23 M
MiddleAged 37 F
YoungAdult 31 M
Senior 61 F
MiddleAged 45 M
MiddleAged 41 F
YoungAdult 32 M
"""
#Give a list as an argument to convert multiple columns to a hierarchical index
print df.set_index(["bin", "sex"])
"""
age
bin sex
Youth F 20
M 22
M 25
YoungAdult M 27
Youth F 21
M 23
MiddleAged F 37
YoungAdult M 31
Senior F 61
MiddleAged M 45
F 41
YoungAdult M 32
"""
#sort
print df.set_index(["bin", "sex"]).sortlevel("bin")
"""
age
bin sex
MiddleAged F 37
F 41
M 45
Senior F 61
YoungAdult M 27
M 31
M 32
Youth F 20
F 21
M 22
M 25
M 23
"""
DataFrame take can be used to retrieve only the required rows Random sampling can be performed by randomly extracting columns using this.
#Specify the row to sample
print df.take([1, 3, 5])
"""
age sex bin
1 22 M Youth
3 27 M YoungAdult
5 23 M Youth
"""
#Randomly sample 5
sampler = np.random.permutation(len(df))
print df.take(sampler[:5])
"""
age sex bin
8 61 F Senior
2 25 M Youth
7 31 M YoungAdult
1 22 M Youth
0 20 F Youth
"""
You can use pandas' get_dummies method to convert categorical data (qualitative data) to dummy variables. Are used when quantifying type I
#Make the gender column a dummy variable
dum = pd.get_dummies(df["sex"])
print dum
"""
F M
0 1 0
1 0 1
2 0 1
3 0 1
4 1 0
5 0 1
6 1 0
7 0 1
8 1 0
9 0 1
10 1 0
11 0 1
"""
#Add to df
df = pd.concat((df, dum), axis=1)
df = df.drop("sex", axis=1)
print df
"""
age bin F M
0 20 Youth 1 0
1 22 Youth 0 1
2 25 Youth 0 1
3 27 YoungAdult 0 1
4 21 Youth 1 0
5 23 Youth 0 1
6 37 MiddleAged 1 0
7 31 YoungAdult 0 1
8 61 Senior 1 0
9 45 MiddleAged 0 1
10 41 MiddleAged 1 0
11 32 YoungAdult 0 1
"""
Grouping by value can be done by using groupby Use the apply method of the groupby object to apply the same processing to each group
print df.groupby("bin")
"""
<pandas.core.groupby.DataFrameGroupBy object at 0x106af9a90>
"""
#Calculate mean for each group
f = lambda x: x.mean()
print df.groupby("bin").apply(f)
"""
age F M
bin
MiddleAged 41.0 0.666667 0.333333
Senior 61.0 1.000000 0.000000
YoungAdult 30.0 0.000000 1.000000
Youth 22.2 0.400000 0.600000
"""
Example: Bin analysis Group value of bins and get statistics for each bin
frame = DataFrame({"data1": np.random.randn(1000), "data2":np.random.randn(1000)})
factor = pd.cut(frame.data1, 4) #Divide the value range of data1 evenly into 4 bins
#Define a function to apply
def get_stats(group):
return {"min": group.min(), "max": group.max(), "count": group.count(), "mean": group.mean()}
#Produce a statistic of the value of data2 for each bin of data1
grouped = frame.data2.groupby(factor)
print grouped.apply(get_stats).unstack()
"""
count max mean min
data1
(-3.205, -1.673] 44 2.544910 -0.108558 -1.984124
(-1.673, -0.147] 414 3.420818 0.003078 -3.168528
(-0.147, 1.379] 447 2.563359 -0.018062 -3.175148
(1.379, 2.905] 95 2.498611 0.140166 -2.439985
"""
The same function can be applied to each column of the GroupBy object using the agg method
df["random"] = [random.choice((0,1)) for i in range(len(df))]
print df
"""
age sex random
0 20 F 1
1 22 M 1
2 25 M 1
3 27 M 0
4 21 F 1
5 23 M 1
6 37 F 1
7 31 M 0
8 61 F 1
9 45 M 1
10 41 F 0
11 32 M 0
"""
print df.groupby("sex").agg([np.max, np.min, np.mean])
"""
age random
amax amin mean amax amin mean
sex
F 61 20 36.000000 1 0 0.800000
M 45 22 29.285714 1 0 0.571429
"""
Recommended Posts