[PYTHON] pandas memo

To commemorate the release of the Japanese version of Python for Data Analysis Things that are likely to be used often or places that are likely to get clogged

It's been so long, so the table of contents

Combine multiple Series into a DataFrame

Use concat to set axis = 1

s1 = pd.Series([1,2,3,4,5], index=[0,1,2,3,4])
s2 = pd.Series([10,20,30,40,50], index=[10,11,12,13,14])
s3 = pd.Series([100,200,300,400,500], index=[0,2,4,6,8])
print pd.concat([s1, s2, s3], axis=1)
"""
     0   1    2
0    1 NaN  100
1    2 NaN  NaN
2    3 NaN  200
3    4 NaN  NaN
4    5 NaN  300
6  NaN NaN  400
8  NaN NaN  500
10 NaN  10  NaN
11 NaN  20  NaN
12 NaN  30  NaN
13 NaN  40  NaN
14 NaN  50  NaN
"""

DataFrame index reference

Various are prepared and confused It's annoying, but I think the reference by ix is the easiest to understand.

#Prepare DataFrame
df = DataFrame(arange(30).reshape((6, 5)), index=list("ASDFGH"), columns=list("abcde"))
print df
"""
    a   b   c   d   e
A   0   1   2   3   4
S   5   6   7   8   9
D  10  11  12  13  14
F  15  16  17  18  19
G  20  21  22  23  24
H  25  26  27  28  29
"""

Reference to column

Specify the column name for column reference When specifying more than one, specify in list format If there is only one column to specify, the instance variable is prepared by the column name, so you can specify it.

#Specify column name
print df["a"]
"""
A     0
S     5
D    10
F    15
G    20
H    25
Name: a, dtype: int64
"""

#Specify multiple column names
print df[["a", "c", "e"]]
"""
    a   c   e
A   0   2   4
S   5   7   9
D  10  12  14
F  15  17  19
G  20  22  24
H  25  27  29
"""

#Specify the instance variable of the column name
print df.a
"""
A     0
S     5
D    10
F    15
G    20
H    25
"""

Row reference

Designated by slicing etc. Specifying by name (index label) attached to the line will be described later.

print df[0:3]
"""
    a   b   c   d   e
A   0   1   2   3   4
S   5   6   7   8   9
D  10  11  12  13  14
"""

#Negative index specification
print df[-2:]
"""
    a   b   c   d   e
G  20  21  22  23  24
H  25  26  27  28  29
"""

Reference by specifying the range of rows and columns at the same time (reference by ix)

Can be referenced by slicing the ix attribute In fact, if you remember this much, you can use it in the same format at any time and it is convenient Specify by [row, column] like NumPy 2D array slicing Can also be specified by index label However, in the case of slicing by index label, note that the index label behind is included in the range.

#Designated by index
print df.ix[0, 3]
"""
3
"""

#Slicing by index
print df.ix[0:3, 0:2]
"""
    a   b
A   0   1
S   5   6
D  10  11
"""

#Specify separate rows or columns with multiple indexes
print df.ix[[0, 2, 4], [0, 2, 4]]
"""
    a   c   e
A   0   2   4
D  10  12  14
G  20  22  24
"""

#Designated by index label
print df.ix["D", "a"]
"""
10
"""

#Range specification by index label
print df.ix["A":"D", "a":"b"]
"""
    a   b
A   0   1
S   5   6
D  10  11
"""

#Specify separate rows or columns with multiple index labels
print df.ix[["A", "H", "S"], ["c", "b", "a"]]
"""
    c   b   a
A   2   1   0
H  27  26  25
S   7   6   5
"""

Series index reference

Since I did DataFrame, I also did Series Unlike DataFrame, you can also specify by index from the beginning. It is possible to specify by index label, so I don't think there is any particular clutter. Slicing is also possible

#Prepare Series
s1 = pd.Series(range(6), index=list("abcdef"))
print s1
"""
a    0
b    1
c    2
d    3
e    4
f    5
dtype: int64
"""

#Designated by index
print s1[0:3]
"""
a    0
b    1
c    2
dtype: int64
"""

#Designated by index label
print ["b":"e"]
"""
b    1
c    2
d    3
e    4
dtype: int64
"""

Masking by bool value

If you evaluate a Series or DataFrame like NumPy's ndarray, the bool Series or DataFrame that was evaluated for each element will be returned.

print df > 5
"""
       a      b      c      d      e
A  False  False  False  False  False
S  False   True   True   True   True
D   True   True   True   True   True
F   True   True   True   True   True
G   True   True   True   True   True
H   True   True   True   True   True
"""

print s1 > 3
"""
a    False
b    False
c    False
d    False
e     True
f     True
dtype: bool
"""

You can mask the original Series or DataFrame with what you get back

#DataFrame mask
print df[df > 5]
"""
    a   b   c   d   e
A NaN NaN NaN NaN NaN
S NaN   6   7   8   9
D  10  11  12  13  14
F  15  16  17  18  19
G  20  21  22  23  24
H  25  26  27  28  29
"""

#Series mask
print s1[s1 > 3]
"""
e    4
f    5
dtype: int64
"""

Operations between Series and DataFrame

Operations are performed between the same index labels Those without the same index label will be NaN If you do not want to make it NaN, you can calculate after assigning 0 to NaN with fill_value.

#Prepare Series
s1 = pd.Series(range(6), index=list("abcdef"))
s2 = pd.Series([10,20,30,40,50], index=list("asdfg"))

print s1
"""
a    0
b    1
c    2
d    3
e    4
f    5
dtype: int64
"""

print s2
"""
a    10
s    20
d    30
f    40
g    50
dtype: int64
"""

# s1 +may be s2
print s1.add(s2)
"""
a    10
b   NaN
c   NaN
d    33
e   NaN
f    45
g   NaN
s   NaN
dtype: float64
"""

#Operate after replacing NaN with 0
print s1.add(s2, fill_value=0)
"""
a    10
b     1
c     2
d    33
e     4
f    45
g    50
s    20
dtype: float64
"""

#Prepare DataFrame
df1 = DataFrame(arange(30).reshape((6,5)), index=list("ASDFGH"), columns=list("abcde"))
df2 = DataFrame(arange(42).reshape((7,6)), index=list("ASDFGHJ"), columns=list("abcdef"))

print df1
"""
    a   b   c   d   e
A   0   1   2   3   4
S   5   6   7   8   9
D  10  11  12  13  14
F  15  16  17  18  19
G  20  21  22  23  24
H  25  26  27  28  29
"""

print df2
"""
    a   b   c   d   e   f
A   0   1   2   3   4   5
S   6   7   8   9  10  11
D  12  13  14  15  16  17
F  18  19  20  21  22  23
G  24  25  26  27  28  29
H  30  31  32  33  34  35
J  36  37  38  39  40  41
"""

# df1 +may be df2
print df1.add(df2)
"""
    a   b   c   d   e   f
A   0   2   4   6   8 NaN
D  22  24  26  28  30 NaN
F  33  35  37  39  41 NaN
G  44  46  48  50  52 NaN
H  55  57  59  61  63 NaN
J NaN NaN NaN NaN NaN NaN
S  11  13  15  17  19 NaN
"""

#Operate after replacing NaN with 0
print df1.add(df2, fill_value=0)
"""
    a   b   c   d   e   f
A   0   2   4   6   8   5
D  22  24  26  28  30  17
F  33  35  37  39  41  23
G  44  46  48  50  52  29
H  55  57  59  61  63  35
J  36  37  38  39  40  41
S  11  13  15  17  19  11
"""

If one of the index labels of the data to be calculated is completely included in the index label of the other data, fill_value may be performed with reindex before the calculation.

#Taking df1 and df2 above as an example
#The index label for df1 is completely contained in df2
print df1.index.isin(df2.index).all(), df1.columns.isin(df2.columns).all()
"""
True True
"""

print df1.reindex(index=df2.index, columns=df2.columns, fill_value=0) + df2
"""
    a   b   c   d   e   f
A   0   2   4   6   8   5
S  11  13  15  17  19  11
D  22  24  26  28  30  17
F  33  35  37  39  41  23
G  44  46  48  50  52  29
H  55  57  59  61  63  35
J  36  37  38  39  40  41
"""

Hierarchical index

Multiple index labels can be attached to one row or column Use the sortlevel method to sort by the label of the hierarchical index

# Series
s = Series(range(10), index=[list("aaabbbccdd"),[1,2,3,2,1,3,1,2,1,2]])
print s
"""
a  1    0
   2    1
   3    2
b  2    3
   1    4
   3    5
c  1    6
   2    7
d  1    8
   2    9
dtype: int64
"""

#See outer index label
print s["a"]
"""
1    0
2    1
3    2
dtype: int64
"""

#See inner index label
print s[:, 1]
"""
a    0
b    4
c    6
d    8
dtype: int64
"""

#Change index label hierarchy
#Swap the index label of the 0th layer and the index label of the 1st layer
# s.swaplevel(1, 0)May be
print s.swaplevel(0, 1)
"""
1  a    0
2  a    1
3  a    2
2  b    3
1  b    4
3  b    5
1  c    6
2  c    7
1  d    8
2  d    9
dtype: int64
"""

#Sort by outer index label
print s.swaplevel(0,1).sortlevel(0)
"""
1  a    0
   b    4
   c    6
   d    8
2  a    1
   b    3
   c    7
   d    9
3  a    2
   b    5
dtype: int64
"""

stack method and unstack method

stack is a column-to-row pivot conversion unstack is a row-to-column pivot conversion Since it is a row-to-column conversion, it can only be used with a DataFrame or Series with a hierarchical index label.

#Series object with hierarchical index label
print s
"""
a  1    0
   2    1
   3    2
b  2    3
   1    4
   3    5
c  1    6
   2    7
d  1    8
   2    9
dtype: int64
"""

#Convert rows to columns
print s.unstack()
"""
   1  2   3
a  0  1   2
b  4  3   5
c  6  7 NaN
d  8  9 NaN
"""

print df
"""
    a   b   c   d   e
A   0   1   2   3   4
S   5   6   7   8   9
D  10  11  12  13  14
F  15  16  17  18  19
G  20  21  22  23  24
H  25  26  27  28  29
"""

#Column to row
print df.stack()
"""
A  a     0
   b     1
   c     2
   d     3
   e     4
S  a     5
   b     6
   c     7
   d     8
   e     9
D  a    10
   b    11
   c    12
   d    13
   e    14
F  a    15
   b    16
   c    17
   d    18
   e    19
G  a    20
   b    21
   c    22
   d    23
   e    24
H  a    25
   b    26
   c    27
   d    28
   e    29
dtype: int64
"""

#Row to column
print df.unstack()
"""
a  A     0
   S     5
   D    10
   F    15
   G    20
   H    25
b  A     1
   S     6
   D    11
   F    16
   G    21
   H    26
c  A     2
   S     7
   D    12
   F    17
   G    22
   H    27
d  A     3
   S     8
   D    13
   F    18
   G    23
   H    28
e  A     4
   S     9
   D    14
   F    19
   G    24
   H    29
dtype: int64
"""

Naming

Row or column naming

You can change the name by overriding the index for rows and the columns attribute for columns. Alternatively, use the rename method to pass the name before and after the change in dictionary format. Using the rename method returns a copy of the data and does not change the original data

print df
"""
    a   b   c   d   e
A   0   1   2   3   4
S   5   6   7   8   9
D  10  11  12  13  14
F  15  16  17  18  19
G  20  21  22  23  24
H  25  26  27  28  29
"""

#Rename line
df.index = ["AAA", "SSS", "DDD", "FFF", "GGG", "HHH"]
#Rename column
df.columns=["one", "two", "three", "four", "five"]

print df
"""
     one  two  three  four  five
AAA    0    1      2     3     4
SSS    5    6      7     8     9
DDD   10   11     12    13    14
FFF   15   16     17    18    19
GGG   20   21     22    23    24
HHH   25   26     27    28    29
"""

#use rename
print df.rename(index={"AAA":"a", "SSS":"s", "DDD":"d", "FFF":"f", "GGG":"g", "HHH":"h"},
columns={"one":1, "two":2, "three":3, "four":4, "five":5})
"""
    1   2   3   4   5
a   0   1   2   3   4
s   5   6   7   8   9
d  10  11  12  13  14
f  15  16  17  18  19
g  20  21  22  23  24
h  25  26  27  28  29
"""

Naming the index label itself

You can name the index label itself The name you give can be used when converting axes such as swaplevel and unstack.

#Rename index
df.index.names = ["UPPER"]
#Rename the column
df.columns.names = ["lower"]

print df
"""
lower  one  two  three  four  five
UPPER                             
AAA      0    1      2     3     4
SSS      5    6      7     8     9
DDD     10   11     12    13    14
FFF     15   16     17    18    19
GGG     20   21     22    23    24
HHH     25   26     27    28    29
"""

Binning

(A bin is a column of a histogram) Data values can be split into bins using the pd.cut method

#Age and gender data
df = DataFrame([[20,"F"],[22,"M"],[25,"M"],[27,"M"],[21,"F"],[23,"M"],[37,"F"],[31,"M"],[61,"F"],[45,"M"],[41,"F"],[32,"M"]], columns=["age", "sex"])
print df
"""
   age sex
0   20   F
1   22   M
2   25   M
3   27   M
4   21   F
5   23   M
6   37   F
7   31   M
8   61   F
9   45   M
10  41   F
11  32   M
"""

#Value when splitting into bins
bins = [18, 25, 35, 60, 100]
#Bin name
group_names = ["Youth", "YoungAdult", "MiddleAged", "Senior"]
#Binning
print pd.cut(df.age, bins, labels=group_names)
"""
Categorical: 
[Youth, Youth, Youth, YoungAdult, Youth, Youth, nan, YoungAdult, nan, nan, nan, YoungAdult]
Levels (4): Index(['Youth', 'YoungAdult', 'MiddleAged', 'Senior'], dtype=object)
"""

#Add bin column to df
df["bin"] = pd.cut(df.age, bins, labels=group_names)
print df
"""
    age sex         bin
0    20   F       Youth
1    22   M       Youth
2    25   M       Youth
3    27   M  YoungAdult
4    21   F       Youth
5    23   M       Youth
6    37   F  MiddleAged
7    31   M  YoungAdult
8    61   F      Senior
9    45   M  MiddleAged
10   41   F  MiddleAged
11   32   M  YoungAdult
"""

Column and index conversion

If you want to convert only a specific column to an index, use the set_index method of DataFrame.

print df
"""
    age sex         bin
0    20   F       Youth
1    22   M       Youth
2    25   M       Youth
3    27   M  YoungAdult
4    21   F       Youth
5    23   M       Youth
6    37   F  MiddleAged
7    31   M  YoungAdult
8    61   F      Senior
9    45   M  MiddleAged
10   41   F  MiddleAged
11   32   M  YoungAdult
"""

#Convert bin column to index
print df.set_index("bin")
"""
            age sex
bin                
Youth        20   F
Youth        22   M
Youth        25   M
YoungAdult   27   M
Youth        21   F
Youth        23   M
MiddleAged   37   F
YoungAdult   31   M
Senior       61   F
MiddleAged   45   M
MiddleAged   41   F
YoungAdult   32   M
"""

#Give a list as an argument to convert multiple columns to a hierarchical index
print df.set_index(["bin", "sex"])
"""
                age
bin        sex     
Youth      F     20
           M     22
           M     25
YoungAdult M     27
Youth      F     21
           M     23
MiddleAged F     37
YoungAdult M     31
Senior     F     61
MiddleAged M     45
           F     41
YoungAdult M     32
"""

#sort
print df.set_index(["bin", "sex"]).sortlevel("bin")
"""
                age
bin        sex     
MiddleAged F     37
           F     41
           M     45
Senior     F     61
YoungAdult M     27
           M     31
           M     32
Youth      F     20
           F     21
           M     22
           M     25
           M     23
"""

Random sampling

DataFrame take can be used to retrieve only the required rows Random sampling can be performed by randomly extracting columns using this.

#Specify the row to sample
print df.take([1, 3, 5])
"""
   age sex         bin
1   22   M       Youth
3   27   M  YoungAdult
5   23   M       Youth
"""

#Randomly sample 5
sampler = np.random.permutation(len(df))
print df.take(sampler[:5])
"""
   age sex         bin
8   61   F      Senior
2   25   M       Youth
7   31   M  YoungAdult
1   22   M       Youth
0   20   F       Youth
"""

Make categorical data a dummy variable

You can use pandas' get_dummies method to convert categorical data (qualitative data) to dummy variables. Are used when quantifying type I

#Make the gender column a dummy variable
dum = pd.get_dummies(df["sex"])
print dum
"""
    F  M
0   1  0
1   0  1
2   0  1
3   0  1
4   1  0
5   0  1
6   1  0
7   0  1
8   1  0
9   0  1
10  1  0
11  0  1
"""

#Add to df
df = pd.concat((df, dum), axis=1)
df = df.drop("sex", axis=1)
print df
"""
    age         bin  F  M
0    20       Youth  1  0
1    22       Youth  0  1
2    25       Youth  0  1
3    27  YoungAdult  0  1
4    21       Youth  1  0
5    23       Youth  0  1
6    37  MiddleAged  1  0
7    31  YoungAdult  0  1
8    61      Senior  1  0
9    45  MiddleAged  0  1
10   41  MiddleAged  1  0
11   32  YoungAdult  0  1
"""

grouping

Grouping by value can be done by using groupby Use the apply method of the groupby object to apply the same processing to each group

print df.groupby("bin")
"""
<pandas.core.groupby.DataFrameGroupBy object at 0x106af9a90>
"""

#Calculate mean for each group
f = lambda x: x.mean()
print df.groupby("bin").apply(f)
"""
             age         F         M
bin                                 
MiddleAged  41.0  0.666667  0.333333
Senior      61.0  1.000000  0.000000
YoungAdult  30.0  0.000000  1.000000
Youth       22.2  0.400000  0.600000
"""

Example: Bin analysis Group value of bins and get statistics for each bin

frame = DataFrame({"data1": np.random.randn(1000), "data2":np.random.randn(1000)})
factor = pd.cut(frame.data1, 4)  #Divide the value range of data1 evenly into 4 bins

#Define a function to apply
def get_stats(group):
    return {"min": group.min(), "max": group.max(), "count": group.count(), "mean": group.mean()}

#Produce a statistic of the value of data2 for each bin of data1
grouped = frame.data2.groupby(factor)
print grouped.apply(get_stats).unstack()
"""
                  count       max      mean       min
data1                                                
(-3.205, -1.673]     44  2.544910 -0.108558 -1.984124
(-1.673, -0.147]    414  3.420818  0.003078 -3.168528
(-0.147, 1.379]     447  2.563359 -0.018062 -3.175148
(1.379, 2.905]       95  2.498611  0.140166 -2.439985
"""

The same function can be applied to each column of the GroupBy object using the agg method

df["random"] = [random.choice((0,1)) for i in range(len(df))]
print df
"""
    age sex  random
0    20   F       1
1    22   M       1
2    25   M       1
3    27   M       0
4    21   F       1
5    23   M       1
6    37   F       1
7    31   M       0
8    61   F       1
9    45   M       1
10   41   F       0
11   32   M       0
"""

print df.groupby("sex").agg([np.max, np.min, np.mean])
"""
      age                   random                
     amax  amin       mean    amax  amin      mean
sex                                               
F      61    20  36.000000       1     0  0.800000
M      45    22  29.285714       1     0  0.571429
"""