[PYTHON] [Pandas 1.0.1 Memorial] Fierce battle record with cookbook

background

What and how

About the structure of this post


pandas logo

\small{\mathsf{ Date:Feb \ 05, 2020 \ Version: 1.0.1 }}

Pandas Official Site Cookbook

Idioms (idiom)

If-then ...

Use if-then / if-then-else on one column and assign it to one or more other columns.

In [1]: df = pd.DataFrame({'AAA': [4, 5, 6, 7],
                           'BBB': [10, 20, 30, 40],
                           'CCC': [100, 50, -30, -50]})

In [2]: df
Out[2]: 
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50

1 row if-then

df.loc[df.AAA >= 5, 'BBB'] = -1

Example of if-then assigned to two columns

df.loc[df.AAA >= 5, ['BBB', 'CCC']] = 555

Change the condition and update another row

df.loc[df.AAA < 5, ['BBB', 'CCC']] = 2000

Conditioned to mask value

In [9]: df_mask = pd.DataFrame({'AAA': [True] * 4,
                                'BBB': [False] * 4,
                                'CCC': [True, False] * 2})

In [10]: df.where(df_mask, -1000)
Out[10]: 
   AAA   BBB   CCC
0    4 -1000  2000
1    5 -1000 -1000
2    6 -1000   555
3    7 -1000 -1000

** [Code Reading] ** </ font>

  • When creating a data frame, [True] * 4 becomes [True, True, True, True].
  • The contents of mask_df are as follows.
AAA BBB CCC
0 True False True
1 True False False
2 True False True
3 True False False
  • If the first argument of the where method is True, the element is used, otherwise the second argument is used.

Use np.where to substitute if-then-else: thumbs up:

# np.where(condition, true, false)
df['logic'] = np.where(df['AAA'] > 5, 'high', 'low')
AAA BBB CCC logic
0 4 2000 2000 low
1 5 555 555 low
2 6 555 555 high
3 7 555 555 high

Splitting

In [16]: df
Out[16]: 
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50

In [17]: df[df.AAA <= 5]
Out[17]: 
   AAA  BBB  CCC
0    4   10  100
1    5   20   50

In [18]: df[df.AAA > 5]
Out[18]: 
   AAA  BBB  CCC
2    6   30  -30
3    7   40  -50

Split dataframe by Boolean value: clipboard :: thumbsup:

id sex age visits points
0 128 1 36 6 1980
1 324 0 29 9 -50
2 287 0 41 12 -239
3 423 1 33 33 1570
4 589 0 19 10 10

Let's divide this by a Boolean value.

** [Code Reading] ** </ font>

#Check the Boolean value of m
m = df['sex'] != 0
m
0     True
1    False
2    False
3     True
4    False
Name: sex, dtype: bool
#Extract with condition m, extract with negation of m
female, male = df[m], df[~m]

female

id sex age visits points
0 128 1 36 6 1980
3 423 1 33 33 1570

male

id sex age visits points
1 324 0 29 9 -50
2 287 0 41 12 -239
4 589 0 19 10 10

Building criteria

id sex age visits points
0 128 1 36 6 1980
1 324 0 29 9 -50
2 287 0 41 12 -239
3 423 1 33 33 1570
4 589 0 19 10 10
# (A condition) & (B condition)and will result in an error
df.loc[(df['BBB'] < 25) & (df['CCC'] >= -40), 'AAA']
0    4
1    5
Name: AAA, dtype: int64
# (A condition) | (B condition)or will result in an error
 df.loc[(df['BBB'] > 25) | (df['CCC'] >= -40), 'AAA']
0    4
1    5
2    6
3    7
Name: AAA, dtype: int64
# &,|You can update the data using
df.loc[(df['BBB'] > 25) | (df['CCC'] >= 75), 'AAA'] = 0.1
AAA BBB CCC
0 0.1 10 100
1 5 20 50
2 0.1 30 -30
3 0.1 40 -50

Use the argsort function to select the row with the data closest to a particular value

df = pd.DataFrame({'AAA': [4, 5, 6, 7],
                   'BBB': [10, 20, 30, 40],
                   'CCC': [100, 50, -30, -50]})

In [26]: df
Out[26]: 
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50

In [27]: aValue = 43.0

In [28]: df.loc[(df.CCC - aValue).abs().argsort()]
Out[28]: 
   AAA  BBB  CCC
1    5   20   50
0    4   10  100
2    6   30  -30
3    7   40  -50

** [Code Reading] ** </ font>

#Simple calculation
(df.CCC - aValue)
0    57.0
1     7.0
2   -73.0
3   -93.0
Name: CCC, dtype: float64
#Make the result of a simple calculation an absolute value
(df.CCC - aValue).abs()
0    57.0
1     7.0
2    73.0
3    93.0
Name: CCC, dtype: float64
#Returns the index of the result of sorting the numbers with the result of a simple calculation as an absolute value.
(df.CCC - aValue).abs().argsort()
0    1
1    0
2    2
3    3
Name: CCC, dtype: int64

Use binary operators to dynamically filter the list of criteria


In [29]: df = pd.DataFrame({'AAA': [4, 5, 6, 7],
                            'BBB': [10, 20, 30, 40],
                            'CCC': [100, 50, -30, -50]})

In [30]: df
Out[30]: 
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50

In [31]: Crit1 = df.AAA <= 5.5
In [32]: Crit2 = df.BBB == 10.0
In [33]: Crit3 = df.CCC > -40.0
#Hard-coded
In [34]: AllCrit = Crit1 & Crit2 & Crit3

In [35]: import functools
In [36]: CritList = [Crit1, Crit2, Crit3]
#Create all conditions (integrated) with the reduce function
In [37]: AllCrit = functools.reduce(lambda x, y: x & y, CritList)

In [38]: df[AllCrit]
Out[38]: 
   AAA  BBB  CCC
0    4   10  100

** [Code Reading] ** </ font>

#Check each condition as a data frame.
tmp = pd.DataFrame({'CritList_0':CritList[0].tolist(),
                    'CritList_1':CritList[1].tolist(),
                    'CritList_2':CritList[2].tolist(),
                    'AllCrit':AllCrit.tolist()
                   }, index=[0,1,2,3])
  • In ʻAll Crit, which is the product of all conditions, Trueis the row with the index0`.
CritList_0 CritList_1 CritList_2 AllCrit
0 True True True True
1 True False True False
2 False False True False
3 False False False False

Selection

Dataframes

Use both row labels and value conditions: thumbs up:

In [39]: df = pd.DataFrame({'AAA': [4, 5, 6, 7],
                            'BBB': [10, 20, 30, 40],
                            'CCC': [100, 50, -30, -50]}) 

In [40]: df
Out[40]: 
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50

In [41]: df[(df.AAA <= 6) & (df.index.isin([0, 2, 4]))]
Out[41]: 
   AAA  BBB  CCC
0    4   10  100
2    6   30  -30

Use loc for slicing on labels and iloc for positional slicing

AAA BBB CCC
foo 4 10 100
bar 5 20 50
boo 6 30 -30
kar 7 40 -50
In [43]: df.loc['bar':'kar']  # Label
Out[43]: 
     AAA  BBB  CCC
bar    5   20   50
boo    6   30  -30
kar    7   40  -50

# Generic
In [44]: df.iloc[0:3]
Out[44]: 
     AAA  BBB  CCC
foo    4   10  100
bar    5   20   50
boo    6   30  -30

** Important points learned **

  • Position-oriented (Python slicing style: excluding termination)
  • Label oriented (non-Python slice style: including termination)

Problems occur if the index consists of non-zero start or non-incremental integers

In [46]: data = {'AAA': [4, 5, 6, 7],
                 'BBB': [10, 20, 30, 40],
                 'CCC': [100, 50, -30, -50]}

#Note: Index starts at 1
In [47]: df2 = pd.DataFrame(data=data, index=[1, 2, 3, 4])  
In [48]: df2.iloc[1:3]  #Position oriented
Out[48]: 
   AAA  BBB  CCC
2    5   20   50
3    6   30  -30

In [49]: df2.loc[1:3]  #Label oriented
Out[49]: 
   AAA  BBB  CCC
1    4   10  100
2    5   20   50
3    6   30  -30

Get by negating a condition using the comparison operator (~) [Not]

In [50]: df = pd.DataFrame({'AAA': [4, 5, 6, 7],
                            'BBB': [10, 20, 30, 40],
                            'CCC': [100, 50, -30, -50]})

In [51]: df
Out[51]: 
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50

In [52]: df[~((df.AAA <= 6) & (df.index.isin([0, 2, 4])))]
Out[52]: 
   AAA  BBB  CCC
1    5   20   50
3    7   40  -50

** [Code Reading] ** </ font>

# `~`If you get it under the condition that does not use
df[((df.AAA <= 6) & (df.index.isin([0, 2, 4])))]
AAA BBB CCC
0 4 10 100
2 6 30 -30
Since the negation of this content is `~`, indexes 1 and 3 are selected.

New columns

Efficiently and dynamically create new columns using the applymap function: thumbs up:

In [53]: df = pd.DataFrame({'AAA': [1, 2, 1, 3],
                            'BBB': [1, 1, 2, 2],
                            'CCC': [2, 1, 3, 1]})


In [54]: df
Out[54]: 
   AAA  BBB  CCC
0    1    1    2
1    2    1    1
2    1    2    3
3    3    2    1

In [55]: source_cols = df.columns   #Get existing column name

In [56]: new_cols = [str(x) + "_cat" for x in source_cols]

In [57]: categories = {1: 'Alpha', 2: 'Beta', 3: 'Charlie'}

In [58]: df[new_cols] = df[source_cols].applymap(categories.get)

In [59]: df
Out[59]: 
   AAA  BBB  CCC  AAA_cat BBB_cat  CCC_cat
0    1    1    2    Alpha   Alpha     Beta
1    2    1    1     Beta   Alpha    Alpha
2    1    2    3    Alpha    Beta  Charlie
3    3    2    1  Charlie    Beta    Alpha

** [Code Reading] ** </ font>

#Get an existing column name
source_cols = df.columns
source_cols
Index(['AAA', 'BBB', 'CCC'], dtype='object')
#To an existing column name`_cat`Create a list added to
new_cols = [str(x) + "_cat" for x in source_cols]
new_cols
['AAA_cat', 'BBB_cat', 'CCC_cat']
Update new columns ['AAA_cat','BBB_cat','CCC_cat'] with values from the categories dictionary for existing columns

If you use min () with groupby, keep other columns

In [60]: df = pd.DataFrame({'AAA': [1, 1, 1, 2, 2, 2, 3, 3],
                            'BBB': [2, 1, 3, 4, 5, 1, 2, 3]})

In [61]: df
Out[61]: 
   AAA  BBB
0    1    2
1    1    1
2    1    3
3    2    4
4    2    5
5    2    1
6    3    2
7    3    3

Method 1: Get the smallest index idxmin ()

In [62]: df.loc[df.groupby("AAA")["BBB"].idxmin()]
Out[62]: 
   AAA  BBB
1    1    1
5    2    1
6    3    2

Method 2: Sort and then get each first

In [63]: df.sort_values(by="BBB").groupby("AAA", as_index=False).first()
Out[63]: 
   AAA  BBB
0    1    1
1    2    1
2    3    2

** [Code Reading] ** </ font>

  • Method 1
# 'AAA'Group by column
gb = df.groupby("AAA")["BBB"]
# 'AAA'Of the column'2'If you get, the following contents will be displayed.'AAA'Of the column'2'The minimum value of'1'Becomes
gb.get_group(2)
3    4
4    5
5    1
Name: BBB, dtype: int64
  • Method 2
# 'BBB'Sort by column
sv = df.sort_values(by="BBB")
sv
	AAA	BBB
1	1	1
5	2	1
0	1	2
6	3	2
2	1	3
7	3	3
3	2	4
4	2	5
# 'AAA'Group by column
gb = sv.groupby("AAA", as_index=False)
# 'AAA'Of the column'2'When you get the minimum value'1'
gb.get_group(2)
	AAA	BBB
5	2	1
3	2	4
4	2	5
#Get the beginning of a group object
gb.first()
	AAA	BBB
0	1	1
1	2	1
2	3	2

MultiIndexing

Create a MultiIndex from a labeled frame

In [64]: df = pd.DataFrame({'row': [0, 1, 2],
                           'One_X': [1.1, 1.1, 1.1],
                           'One_Y': [1.2, 1.2, 1.2],
                           'Two_X': [1.11, 1.11, 1.11],
                           'Two_Y': [1.22, 1.22, 1.22]})

In [65]: df
Out[65]: 
   row  One_X  One_Y  Two_X  Two_Y
0    0    1.1    1.2   1.11   1.22
1    1    1.1    1.2   1.11   1.22
2    2    1.1    1.2   1.11   1.22

#Labeled index
In [66]: df = df.set_index('row')

In [67]: df
Out[67]: 
     One_X  One_Y  Two_X  Two_Y
row                            
0      1.1    1.2   1.11   1.22
1      1.1    1.2   1.11   1.22
2      1.1    1.2   1.11   1.22

#Existing column name'_'Divide with and change to a hierarchical index in tuple format.
In [68]: df.columns = pd.MultiIndex.from_tuples([tuple(c.split('_')) for c in df.columns])

In [69]: df
Out[69]: 
     One        Two      
       X    Y     X     Y
row                      
0    1.1  1.2  1.11  1.22
1    1.1  1.2  1.11  1.22
2    1.1  1.2  1.11  1.22

#Make the hierarchical index column data and reset the index
In [70]: df = df.stack(0).reset_index(1)

In [71]: df
Out[71]: 
    level_1     X     Y
row                    
0       One  1.10  1.20
0       Two  1.11  1.22
1       One  1.10  1.20
1       Two  1.11  1.22
2       One  1.10  1.20
2       Two  1.11  1.22

#Modify the label (label "level"_Note that "1" was added automatically)
In [72]: df.columns = ['Sample', 'All_X', 'All_Y']

In [73]: df
Out[73]: 
    Sample  All_X  All_Y
row                     
0      One   1.10   1.20
0      Two   1.11   1.22
1      One   1.10   1.20
1      Two   1.11   1.22
2      One   1.10   1.20
2      Two   1.11   1.22

** [Code Reading] ** </ font>

#Existing columns in inclusion notation'_'Divide by
[tuple(c.split('_')) for c in df.columns]
[('One', 'X'), ('One', 'Y'), ('Two', 'X'), ('Two', 'Y')]
#Hierarchical indexes are represented in tuple format
df.columns = pd.MultiIndex.from_tuples([tuple(c.split('_')) for c in df.columns])
df.columns
MultiIndex([('One', 'X'),
            ('One', 'Y'),
            ('Two', 'X'),
            ('Two', 'Y')],
           )
#Pivot from column to row
df = df.stack(0)
		X	Y
row			
0	One	1.10	1.20
	Two	1.11	1.22
1	One	1.10	1.20
	Two	1.11	1.22
2	One	1.10	1.20
	Two	1.11	1.22

#index Level=1('One,Two')To reset df.reset_index(1) level_1 X Y row 0 One 1.10 1.20 0 Two 1.11 1.22 1 One 1.10 1.20 1 Two 1.11 1.22 2 One 1.10 1.20 2 Two 1.11 1.22


## Arithmetic
 * The code below has some changes from the official website.

```python
In [74]: cols = pd.MultiIndex.from_tuples([(x, y) for x in ['A', 'B', 'C'] for y in ['O', 'I']])

In [75]: df = pd.DataFrame(np.arange(12).reshape((2,6)), index=['n', 'm'], columns=cols)

In [76]: df
Out[76]: 
	A		B		C
	O	I	O	I	O	I
n	0	1	2	3	4	5
m	6	7	8	9	10	11

In [77]: df = df.div(df['C'], level=1)

In [78]: df
Out[78]: 

	A				B				C
	O	I			O	I			O	I
n	0.0	0.200000	0.5	0.600000	1.0	1.0
m	0.6	0.636364	0.8	0.818182	1.0	1.0

** [Code Reading] ** </ font>

$ n row in column B is $ $ [B][O](2 \div 4 = 0.5) $ $ [B][I](3 \div 5 = 0.600000) $

A B C
O I O I O I
n 0.0 0.200000 0.5 0.600000 1.0 1.0
m 0.6 0.636364 0.8 0.818182 1.0 1.0

Slice

coords = [('AA', 'one'), ('AA', 'six'), ('BB', 'one'), ('BB', 'two'), ('BB', 'six')]
index = pd.MultiIndex.from_tuples(coords)
df = pd.DataFrame([11, 22, 33, 44, 55], index, ['MyData'])
MyData
AA one 11
six 22
BB one 33
two 44
six 55

To choose from the first level and the first axis

#First axis
df.xs('BB', level=0, axis=0)
MyData
one 33
two 44
six 55

To select from the'six'level and the first axis

#Second axis
df.xs('six', level=1, axis=0)
MyData
AA 22
BB 55
In [85]: import itertools
In [86]: index = list(itertools.product(['Ada', 'Quinn', 'Violet'],
                                        ['Comp', 'Math', 'Sci']))
In [87]: headr = list(itertools.product(['Exams', 'Labs'], ['I', 'II']))
In [88]: indx = pd.MultiIndex.from_tuples(index, names=['Student', 'Course'])
In [89]: cols = pd.MultiIndex.from_tuples(headr)   # Notice these are un-named
In [90]: data = [[70 + x + y + (x * y) % 3 for x in range(4)] for y in range(9)]
In [91]: df = pd.DataFrame(data, indx, cols)

In [92]: df
Out[92]: 
               Exams     Labs    
                   I  II    I  II
Student Course                   
Ada     Comp      70  71   72  73
        Math      71  73   75  74
        Sci       72  75   75  75
Quinn   Comp      73  74   75  76
        Math      74  76   78  77
        Sci       75  78   78  78
Violet  Comp      76  77   78  79
        Math      77  79   81  80
        Sci       78  81   81  81

In [93]: All = slice(None)

In [94]: df.loc['Violet']
Out[94]: 
       Exams     Labs    
           I  II    I  II
Course                   
Comp      76  77   78  79
Math      77  79   81  80
Sci       78  81   81  81

In [95]: df.loc[(All, 'Math'), All]
Out[95]: 
               Exams     Labs    
                   I  II    I  II
Student Course                   
Ada     Math      71  73   75  74
Quinn   Math      74  76   78  77
Violet  Math      77  79   81  80

In [96]: df.loc[(slice('Ada', 'Quinn'), 'Math'), All]
Out[96]: 
               Exams     Labs    
                   I  II    I  II
Student Course                   
Ada     Math      71  73   75  74
Quinn   Math      74  76   78  77

In [97]: df.loc[(All, 'Math'), ('Exams')]
Out[97]: 
                 I  II
Student Course        
Ada     Math    71  73
Quinn   Math    74  76
Violet  Math    77  79

In [98]: df.loc[(All, 'Math'), (All, 'II')]
Out[98]: 
               Exams Labs
                  II   II
Student Course           
Ada     Math      73   74
Quinn   Math      76   77
Violet  Math      79   80

** [Code Reading] ** </ font>

All = slice(None) #What does this mean?
# silice(start, stop, step)So, are there three None?
All
slice(None, None, None)
print(type(All))
# <class 'slice'> #Slice object
  • I tried with ʻALL = None`, but I get an error, so the extraction condition must be a slice object?
  • Here, ʻALL` means no condition specified
indx
MultiIndex([(   'Ada', 'Comp'),
            (   'Ada', 'Math'),
            (   'Ada',  'Sci'),
            ( 'Quinn', 'Comp'),
            ( 'Quinn', 'Math'),
            ( 'Quinn',  'Sci'),
            ('Violet', 'Comp'),
            ('Violet', 'Math'),
            ('Violet',  'Sci')],
           names=['Student', 'Course'])
cols
MultiIndex([('Exams',  'I'),
            ('Exams', 'II'),
            ( 'Labs',  'I'),
            ( 'Labs', 'II')],
           )

Sorting

Use MultiIndex to sort by a specific column or an ordered list of columns

df.sort_values(by=('Labs', 'II'), ascending=False)
Exams Labs
I II I II
Student Course
Violet Sci 78 81 81 81
Math 77 79 81 80
Comp 76 77 78 79
Quinn Sci 75 78 78 78
Math 74 76 78 77
Comp 73 74 75 76
Ada Sci 72 75 75 75
Math 71 73 75 74
Comp 70 71 72 73

Levels

Add level before multi-index: clipboard :: thumbs up:

#To this DF'Firstlevel:Foo'How to add?
pd.concat([df], keys=['Foo'], names=['Firstlevel'])
Exams Labs
I II I II
Firstlevel Student Course
Foo Ada Comp 70 71 72 73
Math 71 73 75 74
Sci 72 75 75 75
Quinn Comp 73 74 75 76
Math 74 76 78 77
Sci 75 78 78 78
Violet Comp 76 77 78 79
Math 77 79 81 80
Sci 78 81 81 81

To flatten the hierarchical index: clipboard:

df.columns = df.columns.get_level_values(0)
Exams Labs
I II I II
Student Course
Ada Comp 70 71 72 73
Math 71 73 75 74
Sci 72 75 75 75
Quinn Comp 73 74 75 76
Math 74 76 78 77
Sci 75 78 78 78
Violet Comp 76 77 78 79
Math 77 79 81 80
Sci 78 81 81 81

Missing data

Fill the reverse order of the time series with forward data

In [100]: df = pd.DataFrame(np.random.randn(6, 1),
                            index=pd.date_range('2013-08-01', periods=6, freq='B'),
                            columns=list('A'))

In [101]: df.loc[df.index[3], 'A'] = np.nan

In [102]: df
Out[102]: 
                   A
2013-08-01  0.721555
2013-08-02 -0.706771
2013-08-05 -1.039575
2013-08-06       NaN
2013-08-07 -0.424972
2013-08-08  0.567020

In [103]: df.reindex(df.index[::-1]).ffill()
Out[103]: 
                   A
2013-08-08  0.567020
2013-08-07 -0.424972
2013-08-06 -0.424972
2013-08-05 -1.039575
2013-08-02 -0.706771
2013-08-01  0.721555

** [Code Reading] ** </ font>

df.reindex(df.index[::-1]).ffill()
df.reindex(df.index[::-1]).fillna(method='ffill')  #Synonymous with above

The above is an example of resetting the index in the reverse order of index and filling in the blanks with fiillna (method ='ffill'). The fillna function is more versatile thanffill ()because you can use mode ='ffill' (forward) and'bfill (backward)' to fill in the blanks.

Reset cumulative with NaN: clipboard :: thumbs up:

v = pd.Series([1, 1, 1, np.nan, 1, 1, 1, 1, np.nan, 1], dtype=float)
n = v.isnull()
a = ~n
c = a.astype(float).cumsum()
d = pd.Series(np.diff(np.hstack(([0.], c[n]))), index=index)
v[n] = -d
result = v.cumsum()
result
    0    1.0
    1    2.0
    2    3.0
    3    0.0
    4    1.0
    5    2.0
    6    3.0
    7    4.0
    8    0.0
    9    1.0
    dtype: float64

** [Code Reading] ** </ font> $ (1) Create data containing Nan values $ $ v = pd.Series([1, 1, 1, np.nan, 1, 1, 1, 1, np.nan, 1], dtype=float) \tag{1} $ $ (2) n gets the NaN value as a Boolean value $ $ n = v.isnull() \tag{2} $ $ (3) a inverts the Boolean value of n $ $ a = \tilde~n \tag{3} $ $ (4) Accumulate based on the Boolean value of a $ $ c = a.astype(float).cumsum() \tag{4} $ $ (5) Get the False index of n from the accumulated data $ $ index = c[n].index \tag{5} $ $ (6) Get the difference from 0 to c [n] $ $ d = pd.Series(np.diff(np.hstack(([0.], c[n]))), index=index) \tag{6} $ Substitute the numerical value obtained in $ (7) (6) into False of v as a negative value $ $ v[n] = -d \tag{7} $ When the data of $ (8) and (7) are integrated, the negative value becomes 0. $ $ result = v.cumsum() \tag{8} $

\begin
{array}{|c|c|c|c|c|c|c|r|c|} \hline
  &  (1)  &  (2)   &  (3)  & (4) &(5)&(6)& (7)  & (8) \\ \hline 
0 &   1.0 &  False & True  & 1.0 &   &   & 1.0  &  1.0 \\ \hline
1 &   1.0 &  False & True  & 2.0 &   &   & 1.0  &  2.0 \\ \hline
2 &   1.0 &  False & True  & 3.0 &   &   & 1.0  &  3.0 \\ \hline
3 &   NaN &  True  & False & 3.0 & 3 & 3 & -3.0 &  0.0 \\ \hline
4 &   1.0 &  False & True  & 4.0 &   &   & 1.0  &  1.0 \\ \hline
5 &   1.0 &  False & True  & 5.0 &   &   & 1.0  &  2.0 \\ \hline
6 &   1.0 &  False & True  & 6.0 &   &   & 1.0  &  3.0 \\ \hline
7 &   1.0 &  False & True  & 7.0 &   &   & 1.0  &  4.0 \\ \hline
8 &   NaN &  True  & False & 7.0 & 8 & 4 & -4.0 &  0.0 \\ \hline
9 &   1.0 &  False & True  & 8.0 &   &   & 1.0 &  1.0 \\ \hline
\end{array}

Grouping

Unlike the agg function, the apply function returns a dataframe that can access all columns

df = pd.DataFrame({'animal': 'cat dog cat fish dog cat cat'.split(), 
                    'size': list('SSMMMLL'), 
                    'weight': [8, 10, 11, 1, 20, 12, 12], 
                    'adult': [False] * 5 + [True] * 2})
In [105]: df
Out[105]: 
  animal size  weight  adult
0    cat    S       8  False
1    dog    S      10  False
2    cat    M      11  False
3   fish    M       1  False
4    dog    M      20  False
5    cat    L      12   True
6    cat    L      12   True

#List the sizes of the heaviest animals.
In [106]: df.groupby('animal').apply(lambda subf: subf['size'][subf['weight'].idxmax()])
Out[106]: 
animal
cat     L
dog     M
fish    M
dtype: object

** [Code Reading] ** </ font>

#If you disassemble this
df.groupby('animal').apply(lambda subf: subf['size'][subf['weight'].idxmax()])
# 'weight'From the index of the element with the maximum value'size'Get the element of
df['size'][df['weight'].idxmax()]
   'M'
#So you can get the maximum size of the grouped contents

use the get_group function

In [107]: gb = df.groupby(['animal'])
# 'get_group'Use to check the contents of the GroupBy object
In [108]: gb.get_group('cat')
Out[108]: 
  animal size  weight  adult
0    cat    S       8  False
2    cat    M      11  False
5    cat    L      12   True
6    cat    L      12   True

** [Code Reading] ** </ font>

#Access to the object acquired by groupby
gb[['animal', 'size', 'weight']].get_group('cat')
animal size weight
0 cat S 8
2 cat M 11
5 cat L 12
6 cat L 12

Apply to different items in the group

In [109]: def GrowUp(x):
              avg_weight = sum(x[x['size'] == 'S'].weight * 1.5)
              avg_weight += sum(x[x['size'] == 'M'].weight * 1.25)
              avg_weight += sum(x[x['size'] == 'L'].weight)
              avg_weight /= len(x)
              return pd.Series(['L', avg_weight, True], index=['size', 'weight', 'adult'])

In [110]: expected_df = gb.apply(GrowUp)

In [111]: expected_df
Out[111]: 
       size   weight  adult
animal                     
cat       L  12.4375   True
dog       L  20.0000   True
fish      L   1.2500   True

** [Code Reading] ** </ font>

#Calculation of S size
cat_s = sum(df[(df['animal'] == 'cat') & (df['size'] == 'S')].weight * 1.5)
cat_s
12.0
#Calculation of M size
cat_m = sum(df[(df['animal'] == 'cat') & (df['size'] == 'M')].weight * 1.25)
cat_m
13.75
#Calculation of L size
cat_l = sum(df[(df['animal'] == 'cat') & (df['size'] == 'L')].weight)
cat_l
24
# 'SML average
cat_avg = (cat_s + cat_m + cat_l) / len(df[df['animal'] == 'cat'])
cat_avg
12.4375
#The obtained result is pd.Return in Series format

Apply function to all cases

The code below has some numerical changes from the official website.

# S = pd.Series([i / 100.0 for i in range(1, 11)])
S = pd.Series([i for i in range(1, 11)])

S
    0     1
    1     2
    2     3
    3     4
    4     5
    5     6
    6     7
    7     8
    8     9
    9    10
    dtype: int64
def cum_ret(x, y):
    return x * (1 + y)

def red(x):
    p = functools.reduce(cum_ret, x, 1.0)
    return p
#Expanding to perform on all data().[mean(),std()...]
S.expanding().apply(red, raw=True)

0           2.0
1           6.0
2          24.0
3         120.0
4         720.0
5        5040.0
6       40320.0
7      362880.0
8     3628800.0
9    39916800.0
dtype: float64

** [Code Reading] ** </ font>

$ x=1 $

$ 1: The red function (x) is called and x = 1.0. Then $

$ 2: The cum \ _ ret function (x, y) is called and x * (1 + y) = 1.0 * (1.0 + 1.0) = 2.0 is returned from x = 1, y = 1 $

$ 3:p=2.0 $

$ x=2 $

$ p=6.0((12)(1+2)) $

$ x=3 $

$ p=24.0((123)*(1+3)) $

$ x=4 $

$ p=120.0((1234)(1+4)) $

Replace with the group mean except for values less than 0

In [116]: df = pd.DataFrame({'A': [1, 1, 2, 2], 'B': [1, -1, 1, 2]})
	A	B
0	1	1
1	1	-1
2	2	1
3	2	2

In [117]: gb = df.groupby('A')
In [118]: def replace(g):
              mask = g < 0
              return g.where(mask, g[~mask].mean())

In [119]: gb.transform(replace)
Out[119]: 
     B
0  1.0
1 -1.0
2  1.5
3  1.5

Replace values less than 0 with group mean: clipboard :: thumbs up:

df = pd.DataFrame({'A' : [1, 1, 1, 2, 2, 2], 'B' : [2, 3, -1, 1, 2, 6]})
gb = df.groupby('A')
def replace(g):
    mask = g < 0
    g.loc[mask] = g[~mask].mean()
    return g
gb.transform(replace)

	B
0	2.0
1	3.0
2	2.5
3	1.0
4	2.0
5	6.0

Sort groups by aggregated data

In [120]: df = pd.DataFrame({'code': ['foo', 'bar', 'baz'] * 2,
                             'data': [0.16, -0.21, 0.33, 0.45, -0.59, 0.62],
                             'flag': [False, True] * 3})

In [121]: code_groups = df.groupby('code')

In [122]: agg_n_sort_order = code_groups[['data']].transform(sum).sort_values(by='data')

In [123]: sorted_df = df.loc[agg_n_sort_order.index]

In [124]: sorted_df
Out[124]: 
  code  data   flag
1  bar -0.21   True
4  bar -0.59  False
0  foo  0.16  False
3  foo  0.45   True
2  baz  0.33  False
5  baz  0.62   True

** [Code Reading] ** </ font>

# 'data'Group by and find the total and sort.
agg_n_sort_order = code_groups[['data']].transform(sum).sort_values(by='data')
	data
1	-0.80
4	-0.80
0	0.61
3	0.61
2	0.95
5	0.95
#Then find and extract the index
agg_n_sort_order.index
Int64Index([1, 4, 0, 3, 2, 5], dtype='int64')

Create multiple summary columns

In [125]: rng = pd.date_range(start="2014-10-07", periods=10, freq='2min')

In [126]: ts = pd.Series(data=list(range(10)), index=rng)

In [127]: def MyCust(x):
              if len(x) > 2:
                  return x[1] * 1.234
              return pd.NaT 

In [128]: mhc = {'Mean': np.mean, 'Max': np.max, 'Custom': MyCust}

In [129]: ts.resample("5min").apply(mhc)
Out[129]: 
Mean    2014-10-07 00:00:00        1
        2014-10-07 00:05:00      3.5
        2014-10-07 00:10:00        6
        2014-10-07 00:15:00      8.5
Max     2014-10-07 00:00:00        2
        2014-10-07 00:05:00        4
        2014-10-07 00:10:00        7
        2014-10-07 00:15:00        9
Custom  2014-10-07 00:00:00    1.234
        2014-10-07 00:05:00      NaT
        2014-10-07 00:10:00    7.404
        2014-10-07 00:15:00      NaT
dtype: object

In [130]: ts
Out[130]: 
2014-10-07 00:00:00    0
2014-10-07 00:02:00    1
2014-10-07 00:04:00    2
2014-10-07 00:06:00    3
2014-10-07 00:08:00    4
2014-10-07 00:10:00    5
2014-10-07 00:12:00    6
2014-10-07 00:14:00    7
2014-10-07 00:16:00    8
2014-10-07 00:18:00    9
Freq: 2T, dtype: int64

** [Code Reading] ** </ font>

#TimeSeries object'5min'Recount to
ts_re = ts.resample('5min')
#View the contents of the object
for t in ts_re:
    print(t)
(Timestamp('2014-10-07 00:00:00', freq='5T'),   2014-10-07 00:00:00    0
                                                2014-10-07 00:02:00    1
                                                2014-10-07 00:04:00    2
                                                Freq: 2T, dtype: int64)
(Timestamp('2014-10-07 00:05:00', freq='5T'),   2014-10-07 00:06:00    3
                                                2014-10-07 00:08:00    4
                                                Freq: 2T, dtype: int64)
(Timestamp('2014-10-07 00:10:00', freq='5T'),   2014-10-07 00:10:00    5
                                                2014-10-07 00:12:00    6
                                                2014-10-07 00:14:00    7
                                                Freq: 2T, dtype: int64)
(Timestamp('2014-10-07 00:15:00', freq='5T'),   2014-10-07 00:16:00    8
                                                2014-10-07 00:18:00    9
                                                Freq: 2T, dtype: int64)
ts data mean max MyCust x[1]
0 2014-10-07 00:00:00 0,1,2 1 2 1.234 1
1 2014-10-07 00:05:00 3,4 3.5 4 NaT False
2 2014-10-07 00:10:00 5,6,7 6 7 7.404 6
3 2014-10-07 00:15:00 8,9 8.5 9 NaT False

Create a value count column and reassign it to the DataFrame

In [131]: df = pd.DataFrame({'Color': 'Red Red Red Blue'.split(),
                             'Value': [100, 150, 50, 50]})

In [132]: df
Out[132]: 
  Color  Value
0   Red    100
1   Red    150
2   Red     50
3  Blue     50

In [133]: df['Counts'] = df.groupby(['Color']).transform(len)

In [134]: df
Out[134]: 
  Color  Value  Counts
0   Red    100       3
1   Red    150       3
2   Red     50       3
3  Blue     50       1

Shift a group of column values based on the index

In [135]: df = pd.DataFrame({'line_race': [10, 10, 8, 10, 10, 8],
                             'beyer': [99, 102, 103, 103, 88, 100]},
                             index=['Last Gunfighter', 'Last Gunfighter', 'Last Gunfighter', 
                                    'Paynter', 'Paynter', 'Paynter'])

In [136]: df
Out[136]: 
                 line_race  beyer
Last Gunfighter         10     99
Last Gunfighter         10    102
Last Gunfighter          8    103
Paynter                 10    103
Paynter                 10     88
Paynter                  8    100

In [137]: df['beyer_shifted'] = df.groupby(level=0)['beyer'].shift(1)

In [138]: df
Out[138]: 
                 line_race  beyer  beyer_shifted
Last Gunfighter         10     99            NaN
Last Gunfighter         10    102           99.0
Last Gunfighter          8    103          102.0
Paynter                 10    103            NaN
Paynter                 10     88          103.0
Paynter                  8    100           88.0

Select the row with the maximum value from each group

df = pd.DataFrame({'host': ['other', 'other', 'that', 'this', 'this'],
                   'service': ['mail', 'web', 'mail', 'mail', 'web'],
                   'no': [1, 2, 1, 2, 1]}).set_index(['host', 'service']

In [140]: mask = df.groupby(level=0).agg('idxmax')

In [141]: df_count = df.loc[mask['no']].reset_index()

In [142]: df_count
Out[142]: 
    host service  no
0  other     web   2
1   that    mail   1
2   this    mail   2

** [Code Reading] ** </ font>

no
host service
other mail 1
web 2
that mail 1
this mail 2
web 1
# level=View the contents of mask values grouped by 0
mask = df.groupby(level=0).agg('idxmax')
		no
host 	
other 	(other, web)
that 	(that, mail)
this 	(this, mail)

The maximum value is indexed with the data grouped in tuple format. The index is reset in the data frame of the result extracted by this mask value.

Grouping like itertools.groupby

In [143]: df = pd.DataFrame([0, 1, 0, 1, 1, 1, 0, 1, 1], columns=['A'])

In [144]: df['A'].groupby((df['A'] != df['A'].shift()).cumsum()).groups
Out[144]: 
{1: Int64Index([0], dtype='int64'),
 2: Int64Index([1], dtype='int64'),
 3: Int64Index([2], dtype='int64'),
 4: Int64Index([3, 4, 5], dtype='int64'),
 5: Int64Index([6], dtype='int64'),
 6: Int64Index([7, 8], dtype='int64')}

In [145]: df['A'].groupby((df['A'] != df['A'].shift()).cumsum()).cumsum()
Out[145]: 
0    0
1    1
2    0
3    1
4    2
5    3
6    0
7    1
8    2
Name: A, dtype: int64

** [Code Reading] ** </ font>

#The above data frame
df
	A
0	0
1	1
2	0
3	1
4	1
5	1
6	0
7	1
8	1
  • If you execute the same data with itertools.groupby and see it, the contents will be as follows.
data = [0, 1, 0, 1, 1, 1, 0, 1, 1]
gb = itertools.groupby(data)
for key, group in gb:
    print(f'{key}: {list(group)}')
0: [0]
1: [1]
0: [0]
1: [1, 1, 1]
0: [0]
1: [1, 1]
  • GroupBy.groups returns group names and labels in dictionary format.
tmp = df.groupby('A')
gb = tmp.groups
gb.get(1)
Int64Index([1, 3, 4, 5, 7, 8], dtype='int64')
gb.get(0)
Int64Index([0, 2, 6], dtype='int64')

$ (1) \ Sample data $ $ df['A'] \tag{1} $ $ (2) \ Shift data $ $ df['A'].shift() \tag{2} $ True $ if $ (3) \ (2) and (3) are not equivalent $ df['A'] != df['A'].shift() \tag{3} $ Take the accumulation of $ (4) \ (3) $ $ (df['A'] != df['A'].shift()).cumsum() \tag{4} $ $ (5) \ Contents of objects grouped by cumulative data $ $ gb = df['A'].groupby((df['A'] != df['A'].shift()).cumsum()) \tag{5} $ $ (6) \ Grouped data cumulative $ $ df['A'].groupby((df['A'] != df['A'].shift()).cumsum()).cumsum() \tag{6} $

\begin
{array}{|c|c|c|c|c|c|c|} \hline
 & (1) & (2) & (3) & (4) & (5) & (6) \\ \hline
0 & 0 &	NaN & True & 1 & 0 & 0  \\ \hline
1 & 1 &	0.0 & True & 2 & 1 & 1 \\ \hline
2 & 0 &	1.0 & True & 3 & 0 & 0 \\ \hline
3 & 1 & 0.0 & True & 4 & 1,1,1 & 1  \\ \hline
4 & 1 & 1.0 & False & 4 &  & 2 \\ \hline
5 & 1 & 1.0 & False & 4 &  & 3 \\ \hline
6 & 0 &	1.0 & True & 5 & 0 & 0 \\ \hline
7 & 1 &	0.0 & True & 6 & 1,1 & 1 \\ \hline
8 & 1 &	1.0 & False & 6 &  & 2 \\ \hline
\end{array}

Expanding data

Rolling calculation based on value rather than count: clipboard:

df = pd.DataFrame({'RollBasis':[1,1,1,2,3,5,8,10,12,13],
                   'ToRoll':[1,4,-5,2,-4,-2,0,-13,-2,-5]})

def f(x):
    ser = df.ToRoll[(df.RollBasis >= x) & (df.RollBasis < x+5)]
    return ser.sum()

df['Rolled'] = df.RollBasis.apply(f)
df
0	1	1	-4
1	1	4	-4
2	1	-5	-4
3	2	2	-4
4	3	-4	-6
5	5	-2	-2
6	8	0	-15
7	10	-13	-20
8	12	-2	-7
9	13	-5	-5

** [Code Reading] ** </ font>

RollBasis ToRoll
0 1 1
1 1 4
2 1 -5
3 2 2
4 3 -4
5 5 -2
6 8 0
7 10 -13
8 12 -2
9 13 -5
  • The content of the function f is to sum the values of'ToRoll'in the range where the value of'RollBasis' is x or more and x + 5 or less.

$ x=1 $ $(RollBasis>=1) \& (RollBsis<1+5) \rightarrow sum([1,4,-5,2,-4,-2]) \rightarrow -4 $ $ \vdots $ $x =4 $ $(RollBasis>=4) \& (RollBsis<4+5) \rightarrow sum([-2,0]) \rightarrow -2 $

Rolling average over time interval

df = pd.DataFrame({'B': range(5)})

df.index = [pd.Timestamp('20130101 09:00:00'),
            pd.Timestamp('20130101 09:00:02'),
            pd.Timestamp('20130101 09:00:03'),
            pd.Timestamp('20130101 09:00:05'),
            pd.Timestamp('20130101 09:00:06')]

df
2013-01-01 09:00:00	0
2013-01-01 09:00:02	1
2013-01-01 09:00:03	2
2013-01-01 09:00:05	3
2013-01-01 09:00:06	4

df.rolling(2, min_periods=1).sum()
				B
2013-01-01 09:00:00	0.0
2013-01-01 09:00:02	1.0
2013-01-01 09:00:03	3.0
2013-01-01 09:00:05	5.0
2013-01-01 09:00:06	7.0

df.rolling('2s', min_periods=1).sum()
				B
2013-01-01 09:00:00	0.0
2013-01-01 09:00:02	1.0
2013-01-01 09:00:03	3.0
2013-01-01 09:00:05	3.0
2013-01-01 09:00:06	7.0

** [Code Reading] ** </ font>

# rolling(window width,Minimum number of data)
df.rolling(2, min_periods=1).sum()
# 0+1=1, 1+2=3, 2+3=5, 3+4=7 In units of 2 lines
df.rolling('2s', min_periods=1).sum()
# 0+1=1, 1+2=3, 3+0=3, 3+4=7
#If you look at the time series data at 2-minute intervals and take the total
df.resample('2s').sum()
		B
2013-01-01 09:00:00	0
2013-01-01 09:00:02	3
2013-01-01 09:00:04	3
2013-01-01 09:00:06	4

Calculate the cumulative average from the total in time series units with the rolling function

df = pd.DataFrame({'enddate':pd.to_datetime(['2019-03-01','2019-03-01','2019-03-01','2019-03-02','2019-03-02','2019-03-02','2019-03-03','2019-03-03','2019-03-04']),
                   'bloom':[0.342, 0.235, 0.456, 0.389, 0.453, 0.367, 0.369, 0.428, 0.55],
                   'unbloom':[0.658, 0.765, 0.544, 0.611, 0.547, 0.633, 0.631, 0.572, 0.45]})

Splitting

Create a list of data frames and split them based on the values contained in the row.

In [146]: d df = pd.DataFrame(data={'Case': ['A', 'A', 'A', 'B', 'A', 'A', 'B', 'A', 'A'],
                         'Data': np.random.randint(10,50,(9,))})

In [147]: dfs = list(zip(*df.groupby((1 * (df['Case'] == 'B')).cumsum()
                         .rolling(window=3, min_periods=1).median())))[-1]

In [148]: dfs[0]
Out[148]: 
  Case      Data
0	A	24
1	A	34
2	A	33
3	B	11
In [149]: dfs[1]
Out[149]: 
  Case      Data
4	A	31
5	A	35
6	B	13

In [150]: dfs[2]
Out[150]: 
Case	Data
7	A	17
8	A	33

** [Code Reading] ** </ font>

Case Data
0 A 24
1 A 34
2 A 33
3 B 11
4 A 31
5 A 35
6 B 13
7 A 17
8 A 33
dfs = list(zip(*df.groupby((1 * (df['Case'] == 'B')).cumsum().rolling(window=3, min_periods=1).median())))[-1]
  • I will disassemble this one liner.
# 'Case'The column is'B'Is set to 1
1 * (df['Case'] == 'B')
>
0    0
1    0
2    0
3    1
4    0
5    0
6    1
7    0
8    0
Name: Case, dtype: int64
#Take cumulative
1 * (df['Case'] == 'B').cumsum()
>
0    0
1    0
2    0
3    1
4    1
5    1
6    2
7    2
8    2
Name: Case, dtype: int64
#Take the median cumulative value in units of 3 lines
(1 * (df['Case'] == 'B')).cumsum().rolling(window=3, min_periods=1).median()
>
0    0.0
1    0.0
2    0.0
3    0.0
4    1.0
5    1.0
6    1.0
7    2.0
8    2.0
Name: Case, dtype: float64
#Now that we have a reference value for grouping in the previous process, we will extract it. Expand and list the resulting group objects.
tmp2 = list(zip(*df.groupby((1 * (df['Case'] == 'B')).cumsum().rolling(window=3, min_periods=1).median())))
>
[(0.0, 1.0, 2.0), (  Case  Data
  0    A    24
  1    A    34
  2    A    33
  3    B    11,   Case  Data
  4    A    31
  5    A    35
  6    B    13,   Case  Data
  7    A    17
  8    A    33)]
#If you do not expand the obtained group object, the contents will be as follows, so unpack(*)You can see the difference with.
tmp = list(zip(df.groupby((1 * (df['Case'] == 'B')).cumsum().rolling(window=3, min_periods=1).median())))
>
[((0.0,   Case  Data
   0    A    24
   1    A    34
   2    A    33
   3    B    11),), ((1.0,   Case  Data
   4    A    31
   5    A    35
   6    B    13),), ((2.0,   Case  Data
   7    A    17
   8    A    33),)]
#dfs is a list[(0.0,1.0,2.0),(The following data)]So[-1]To get the first element from the end.
dfs = list(zip(*df.groupby((1 * (df['Case'] == 'B')).cumsum().rolling(window=3, min_periods=1).median())))[-1]
>
(  Case  Data
 0    A    24
 1    A    34
 2    A    33
 3    B    11,   Case  Data
 4    A    31
 5    A    35
 6    B    13,   Case  Data
 7    A    17
 8    A    33)

Pivot

Partial total and subtotal

In [151]: df = pd.DataFrame(data={'Province': ['ON', 'QC', 'BC', 'AL', 'AL', 'MN', 'ON'],
                                  'City': ['Toronto', 'Montreal', 'Vancouver',
                                  'Calgary', 'Edmonton', 'Winnipeg', 'Windsor'],
                                  'Sales': [13, 6, 16, 8, 4, 3, 1]})


In [152]: table = pd.pivot_table(df, values=['Sales'], index=['Province'],
                                 columns=['City'], aggfunc=np.sum, margins=True)

In [153]: table.stack('City')
Out[153]: 
                    Sales
Province City            
AL       All         12.0
         Calgary      8.0
         Edmonton     4.0
BC       All         16.0
         Vancouver   16.0
...                   ...
All      Montreal     6.0
         Toronto     13.0
         Vancouver   16.0
         Windsor      1.0
         Winnipeg     3.0

[20 rows x 1 columns]

** [Code Reading] ** </ font> The given data frame has the following contents.

Province City Sales
0 ON Toronto 13
1 QC Montreal 6
2 BC Vancouver 16
3 AL Calgary 8
4 AL Edmonton 4
5 MN Winnipeg 3
6 ON Windsor 1
# pivot_Convert to tabular format with the table function.
table = pd.pivot_table(df, values=['Sales'], index=['Province'], columns=['City'], aggfunc=np.sum, margins=True)
Sales
Province Calgary Edmonton Montreal Toronto Vancouver Windsor
AL 8 4 nan nan nan nan
BC nan nan nan nan 16 nan
MN nan nan nan nan nan nan
ON nan nan nan 13 nan 1
QC nan nan 6 nan nan nan
All 8 4 6 13 16 1

The stack function transforms the data in the column direction into the row direction.

R frequency table like plyr

In [154]: grades = [48, 99, 75, 80, 42, 80, 72, 68, 36, 78]

In [155]: df = pd.DataFrame({'ID': ["x%d" % r for r in range(10)],
                             'Gender': ['F', 'M', 'F', 'M', 'F',
                                        'M', 'F', 'M', 'M', 'M'],
                            'ExamYear': ['2007', '2007', '2007', '2008', '2008',
                                         '2008', '2008', '2009', '2009', '2009'],
                            'Class': ['algebra', 'stats', 'bio', 'algebra',
                                      'algebra', 'stats', 'stats', 'algebra', 
                                      'bio', 'bio'],
                            'Participated': ['yes', 'yes', 'yes', 'yes', 'no',
                                             'yes', 'yes', 'yes', 'yes', 'yes'],
                            'Passed': ['yes' if x > 50 else 'no' for x in grades],
                            'Employed': [True, True, True, False,
                                         False, False, False, True, True, False],
                            'Grade': grades})

In [156]: df.groupby('ExamYear').agg({'Participated': lambda x: x.value_counts()['yes'],
                                      'Passed': lambda x: sum(x == 'yes'),
                                      'Employed': lambda x: sum(x),
                                      'Grade': lambda x: sum(x) / len(x)})
Out[156]: 
          Participated  Passed  Employed      Grade
ExamYear                                           
2007                 3       2         3  74.000000
2008                 3       3         0  68.500000
2009                 3       2         2  60.666667

** [Code Reading] ** </ font>

ID Gender ExamYear Class Participated Passed Employed Grade
0 x0 F 2007 algebra yes no True 48
1 x1 M 2007 stats yes yes True 99
2 x2 F 2007 bio yes yes True 75
3 x3 M 2008 algebra yes yes False 80
4 x4 F 2008 algebra no no False 42
5 x5 M 2008 stats yes yes False 80
6 x6 F 2008 stats yes yes False 72
7 x7 M 2009 algebra yes yes True 68
8 x8 M 2009 bio yes no True 36
9 x9 M 2009 bio yes yes False 78
  • I don't use SQL on a daily basis, but I put it in Sqlite3 and checked it with a SQL statement.
select ExamYear,
count(Participated = 'yes' or null) as participated,
count(Passed = 'yes' or null) as passed,
count(Employed = 'True' or null) as employed,
sum(cast(Grade as real))/count(cast(Grade as real)) as grades
from df
group by ExamYear 
;

"2007"	"3"	"2"	"3"	"74.0"
"2008"	"3"	"3"	"0"	"68.5"
"2009"	"3"	"2"	"2"	"60.6666666666667"

To create a year-month crosstab:

In [157]: df = pd.DataFrame({'value': np.random.randn(36)},
                             index=pd.date_range('2011-01-01', freq='M', periods=36))

In [158]: pd.pivot_table(df, index=df.index.month, columns=df.index.year,
                         values='value', aggfunc='sum')

Out[158]: 
        2011      2012      2013
1  -1.039268 -0.968914  2.565646
2  -0.370647 -1.294524  1.431256
3  -1.157892  0.413738  1.340309
4  -1.344312  0.276662 -1.170299
5   0.844885 -0.472035 -0.226169
6   1.075770 -0.013960  0.410835
7  -0.109050 -0.362543  0.813850
8   1.643563 -0.006154  0.132003
9  -1.469388 -0.923061 -0.827317
10  0.357021  0.895717 -0.076467
11 -0.674600  0.805244 -1.187678
12 -1.776904 -1.206412  1.130127

** [Code Reading] ** </ font>

# 1-There were records of indexes that I thought were only 12. Year is the same. Unique each()And aggregated.
df.index.month
>
Int64Index([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12,  1,  2,  3,  4,  5,
             6,  7,  8,  9, 10, 11, 12,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10,
            11, 12],
           dtype='int64')
>
df.index.year
>
Int64Index([2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
            2011, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012,
            2012, 2012, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,
            2013, 2013, 2013],
           dtype='int64')

Pivot_table became smaller if I thought it would eat memory like Excel. When I thought about it, it was natural because I didn't have the data before conversion.

df = pd.DataFrame({'value': np.random.randn(36)},
                  index=pd.date_range('2011-01-01', freq='M', periods=36))
df.__sizeof__()
576
df_pivot = pd.pivot_table(df, index=df.index.month, columns=df.index.year, values='value', aggfunc='sum')
df_pivot.__sizeof__()
384

Apply

Apply rolling to organize-convert embedded list to MultiIndex frame

In [159]: df = pd.DataFrame(data={'A': [[2, 4, 8, 16], [100, 200], [10, 20, 30]],
                           'B': [['a', 'b', 'c'], ['jj', 'kk'], ['ccc']]},
                           index=['I', 'II', 'III'])

In [160]: def SeriesFromSubList(aList):
              return pd.Series(aList)

In [161]: df_orgz = pd.concat({ind: row.apply(SeriesFromSubList)
                               for ind, row in df.iterrows()})

In [162]: df_orgz
Out[162]: 
         0    1    2     3
I   A    2    4    8  16.0
    B    a    b    c   NaN
II  A  100  200  NaN   NaN
    B   jj   kk  NaN   NaN
III A   10   20   30   NaN
    B  ccc  NaN  NaN   NaN

** [Code Reading] ** </ font>

A B
I [2, 4, 8, 16] [a, b, c]
II [100, 200] [jj, kk]
III [10, 20, 30] [ccc]

Get the list in the cell of the created dataframe, convert it to pd.Series and combine it. NaN is added to the data that is less than the maximum column.

for ind, row in df.iterrows():
    print('ind:'+str(ind))
    print(row)
    r = row.apply(SeriesFromSubList)
    print(r)
>
ind:I
A    [2, 4, 8, 16]
B        [a, b, c]
Name: I, dtype: object
   0  1  2     3
A  2  4  8  16.0
B  a  b  c   NaN
ind:II
A    [100, 200]
B      [jj, kk]
Name: II, dtype: object
     0    1
A  100  200
B   jj   kk
ind:III
A    [10, 20, 30]
B           [ccc]
Name: III, dtype: object
     0     1     2
A   10  20.0  30.0
B  ccc   NaN   NaN

Convert the result converted from pd.Series to scalar value again to pd.Series format and apply the rolling function to multiple columns

In [163]: df = pd.DataFrame(data=np.random.randn(2000, 2) / 10000,
                            index=pd.date_range('2001-01-01', periods=2000),
                            columns=['A', 'B'])
In [164]: df
Out[164]: 
                   A         B
2001-01-01 -0.000144 -0.000141
2001-01-02  0.000161  0.000102
2001-01-03  0.000057  0.000088
2001-01-04 -0.000221  0.000097
2001-01-05 -0.000201 -0.000041
...              ...       ...
2006-06-19  0.000040 -0.000235
2006-06-20 -0.000123 -0.000021
2006-06-21 -0.000113  0.000114
2006-06-22  0.000136  0.000109
2006-06-23  0.000027  0.000030
[2000 rows x 2 columns]

In [165]: def gm(df, const):
              v = ((((df['A'] + df['B']) + 1).cumprod()) - 1) * const
              return v.iloc[-1]


In [166]: s = pd.Series({df.index[i]: gm(df.iloc[i:min(i + 51, len(df) - 1)], 5)
                         for i in range(len(df) - 50)})

In [167]: s
Out[167]: 
2001-01-01    0.000930
2001-01-02    0.002615
2001-01-03    0.001281
2001-01-04    0.001117
2001-01-05    0.002772
                ...   
2006-04-30    0.003296
2006-05-01    0.002629
2006-05-02    0.002081
2006-05-03    0.004247
2006-05-04    0.003928
Length: 1950, dtype: float64

** [Code Reading] ** </ font>

#In this example, the data is extracted every 50 days.
#Obtain the return value by applying this gm function to the extraction result.
#First date and time of the target period (in this example'2001-01-1')From the end date and time(2006-05-04)Save the date and time and return value until
>
#Function of gm function:(df['A']+df['B']+1)Calculate the cumulative product of and subtract 1 and multiply the result by a constant: 5.
gm(df.iloc[0:min(0 + 51, len(df) - 1)], 5)
0.000930

Apply rolling to multiple columns when the function is scalar

In [168]: rng = pd.date_range(start='2014-01-01', periods=100)

In [169]: df = pd.DataFrame({'Open': np.random.randn(len(rng)),
                             'Close': np.random.randn(len(rng)),
                             'Volume': np.random.randint(100, 2000, len(rng))},
                            index=rng)
In [170]: df
Out[170]: 
                Open     Close  Volume
2014-01-01 -1.611353 -0.492885    1219
2014-01-02 -3.000951  0.445794    1054
2014-01-03 -0.138359 -0.076081    1381
2014-01-04  0.301568  1.198259    1253
2014-01-05  0.276381 -0.669831    1728
...              ...       ...     ...
2014-04-06 -0.040338  0.937843    1188
2014-04-07  0.359661 -0.285908    1864
2014-04-08  0.060978  1.714814     941
2014-04-09  1.759055 -0.455942    1065
2014-04-10  0.138185 -1.147008    1453
[100 rows x 3 columns]

In [171]: def vwap(bars):
              return ((bars.Close * bars.Volume).sum() / bars.Volume.sum())
 

In [172]: window = 5

In [173]: s = pd.concat([(pd.Series(vwap(df.iloc[i:i + window]),
                          index=[df.index[i + window]]))
                         for i in range(len(df) - window)])

In [174]: s.round(2)
Out[174]: 
2014-01-06    0.02
2014-01-07    0.11
2014-01-08    0.10
2014-01-09    0.07
2014-01-10   -0.29
              ... 
2014-04-06   -0.63
2014-04-07   -0.02
2014-04-08   -0.03
2014-04-09    0.34
2014-04-10    0.29
Length: 95, dtype: float64

** [Code Reading] ** </ font>

  • Apply the vmap function to the data aggregated with window value 5 (in units of 5 days) and obtain the result.
  • Add the window value 5 to the start date and create the aggregation period date and the result data frame as the aggregation period date.
[(pd.Series(vwap(df.iloc[i:i + window]),index=[df.index[i + window]])) for i in range(len(df) - window)]
#First section i=When the contents of the function are decomposed as 0, it becomes as follows.
tmp = df.iloc[0:0 + window]
(tmp.Close*tmp.Volume).sum()
tmp.Volume.sum()
(tmp.Close*tmp.Volume).sum() / tmp.Volume.sum()
0.02

Time series: clipboard:

Time interval

# indexer_between_time(start, end)
rng = pd.date_range('1/1/2000', periods=24, freq='H')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
2000-01-01 00:00:00    1.298558
2000-01-01 01:00:00    0.333122
2000-01-01 02:00:00   -0.034170
2000-01-01 03:00:00   -2.396508
2000-01-01 04:00:00   -0.608591
2000-01-01 05:00:00    1.710535
2000-01-01 06:00:00   -0.657193
2000-01-01 07:00:00   -1.166563
2000-01-01 08:00:00    1.768663
2000-01-01 09:00:00    1.395805
2000-01-01 10:00:00   -0.111062
2000-01-01 11:00:00    0.840320
2000-01-01 12:00:00    0.129081
2000-01-01 13:00:00   -1.604446
2000-01-01 14:00:00   -1.257967
2000-01-01 15:00:00   -0.384715
2000-01-01 16:00:00   -1.185482
2000-01-01 17:00:00    2.222229
2000-01-01 18:00:00   -0.917695
2000-01-01 19:00:00   -2.158507
2000-01-01 20:00:00   -0.487902
2000-01-01 21:00:00   -1.667997
2000-01-01 22:00:00    0.202186
2000-01-01 23:00:00   -0.710070
Freq: H, dtype: float64
>
#When extracting from 10:00 to 14:00
ts.iloc[ts.index.indexer_between_time(datetime.time(10), datetime.time(14))]
2000-01-01 10:00:00   -0.111062
2000-01-01 11:00:00    0.840320
2000-01-01 12:00:00    0.129081
2000-01-01 13:00:00   -1.604446
2000-01-01 14:00:00   -1.257967
Freq: H, dtype: float64

About time series masks based on index: clipboard:

index = pd.date_range('2013-1-1',periods=10,freq='15Min')
data = pd.DataFrame(data=[1,2,3,4,5,6,7,8,9,0], columns=['value'], index=index)
data
	value
2013-01-01 00:00:00	1
2013-01-01 00:15:00	2
2013-01-01 00:30:00	3
2013-01-01 00:45:00	4
2013-01-01 01:00:00	5
2013-01-01 01:15:00	6
2013-01-01 01:30:00	7
2013-01-01 01:45:00	8
2013-01-01 02:00:00	9
2013-01-01 02:15:00	0

data.index.indexer_between_time('01:15', '02:00')
array([5, 6, 7, 8])

data.iloc[data.index.indexer_between_time('1:15', '02:00')]
013-01-01 01:15:00	6
2013-01-01 01:30:00	7
2013-01-01 01:45:00	8
2013-01-01 02:00:00	9

** [Code Reading] ** </ font>

#If the time specification is reversed, the period to be excluded is extracted.
data.iloc[data.index.indexer_between_time('02:00', '1:15')]
	value
2013-01-01 00:00:00	1
2013-01-01 00:15:00	2
2013-01-01 00:30:00	3
2013-01-01 00:45:00	4
2013-01-01 01:00:00	5
2013-01-01 01:15:00	6
2013-01-01 02:00:00	9
2013-01-01 02:15:00	0

Exclude weekends and create a date and time range that includes only specific times: clipboard:

rng = pd.date_range('20130101 09:00','20130110 16:00',freq='30T')

DatetimeIndex(['2013-01-01 09:00:00', '2013-01-01 09:30:00',
               '2013-01-01 10:00:00', '2013-01-01 10:30:00',
               '2013-01-01 11:00:00', '2013-01-01 11:30:00',
               '2013-01-01 12:00:00', '2013-01-01 12:30:00',
               '2013-01-01 13:00:00', '2013-01-01 13:30:00',
               ...
               '2013-01-10 11:30:00', '2013-01-10 12:00:00',
               '2013-01-10 12:30:00', '2013-01-10 13:00:00',
               '2013-01-10 13:30:00', '2013-01-10 14:00:00',
               '2013-01-10 14:30:00', '2013-01-10 15:00:00',
               '2013-01-10 15:30:00', '2013-01-10 16:00:00'],
              dtype='datetime64[ns]', length=447, freq='30T')

#Exclude unnecessary time
rng = rng.take(rng.indexer_between_time('09:30','16:00'))

DatetimeIndex(['2013-01-01 09:30:00', '2013-01-01 10:00:00',
               '2013-01-01 10:30:00', '2013-01-01 11:00:00',
               '2013-01-01 11:30:00', '2013-01-01 12:00:00',
               '2013-01-01 12:30:00', '2013-01-01 13:00:00',
               '2013-01-01 13:30:00', '2013-01-01 14:00:00',
               ...
               '2013-01-10 11:30:00', '2013-01-10 12:00:00',
               '2013-01-10 12:30:00', '2013-01-10 13:00:00',
               '2013-01-10 13:30:00', '2013-01-10 14:00:00',
               '2013-01-10 14:30:00', '2013-01-10 15:00:00',
               '2013-01-10 15:30:00', '2013-01-10 16:00:00'],
              dtype='datetime64[ns]', length=140, freq=None)

#Only on weekdays
rng = rng[rng.weekday<5]
DatetimeIndex(['2013-01-01 09:30:00', '2013-01-01 10:00:00',
               '2013-01-01 10:30:00', '2013-01-01 11:00:00',
               '2013-01-01 11:30:00', '2013-01-01 12:00:00',
               '2013-01-01 12:30:00', '2013-01-01 13:00:00',
               '2013-01-01 13:30:00', '2013-01-01 14:00:00',
               ...
               '2013-01-10 11:30:00', '2013-01-10 12:00:00',
               '2013-01-10 12:30:00', '2013-01-10 13:00:00',
               '2013-01-10 13:30:00', '2013-01-10 14:00:00',
               '2013-01-10 14:30:00', '2013-01-10 15:00:00',
               '2013-01-10 15:30:00', '2013-01-10 16:00:00'],
              dtype='datetime64[ns]', length=112, freq=None)

Look up a matrix in a data frame

The code below has changed some numbers from stackoverflow.com.

date_item = pd.to_datetime(['2019/1/4','2019/1/7','2019/1/7','2019/1/8','2019/1/9','2019/1/10','2019/1/10','2019/1/11','2019/1/15','2019/1/16','2019/1/16','2019/1/17','2019/1/18'])

DatetimeIndex(['2019-01-04', '2019-01-07', '2019-01-07', '2019-01-08',
               '2019-01-09', '2019-01-10', '2019-01-10', '2019-01-11',
               '2019-01-15', '2019-01-16', '2019-01-16', '2019-01-17',
               '2019-01-18'],
              dtype='datetime64[ns]', freq=None)

orders = pd.DataFrame({'Date':date_item,
                      'by':['buy','buy','sell','buy','sell','buy','buy','sell','buy','sell','buy','sell','buy'],
                      'cnt':[100, 200, 100, 50, 100, 100, 200, 50, 100, 50, 50, 50, 100],
                      'code':['1720','8086','4967','8086','8086','4967','1720','4967','1720','1720','8086','4967','8086'],
                      'prices':[1008, 1344, 7530, 1347, 1373, 7120, 1008, 7120, 995, 986, 1417, 6990, 1445]})

Prices

1720 8086 4967
2019-01-04 1008 1311 7370
2019-01-07 1033 1344 7530
2019-01-08 1025 1347 7130
2019-01-09 1034 1373 7460
2019-01-10 1008 1364 7120
2019-01-11 1000 1391 7030
2019-01-15 995 1400 6950
2019-01-16 986 1417 6920
2019-01-17 1002 1419 6990
2019-01-18 1010 1445 7050

Orders

Date by cnt code prices
0 2019-01-04 buy 100 1720 1008
1 2019-01-07 buy 200 8086 1344
2 2019-01-07 sell 100 4967 7530
3 2019-01-08 buy 50 8086 1347
4 2019-01-09 sell 100 8086 1373
5 2019-01-10 buy 100 4967 7120
6 2019-01-10 buy 200 1720 1008
7 2019-01-11 sell 50 4967 7120
8 2019-01-15 buy 100 1720 995
9 2019-01-16 sell 50 1720 986
10 2019-01-16 buy 50 8086 1417
11 2019-01-17 sell 50 4967 6990
12 2019-01-18 buy 100 8086 1445
lookup(row_labels, col_labels)
prices.lookup(orders.Date, orders.code)
array([1008, 1344, 7530, 1347, 1373, 7120, 1008, 7030,  995,  986, 1417,
       6990, 1445])

Converts a data frame consisting of columns in hours and rows in days into a continuous row sequence in chronological format. : clipboard:

The code below has some numerical changes from the content posted on stackoverflow.com.

df = pd.DataFrame({'Date':['2020-01-01', '2020-01-02'],'h1':[23,45],'h2':[18,17],'h3':[11,16],'h4':[29,31],'h24':[45,55]})

	Date		h1	h2	h3	h4	h24
0	2020-01-01	23	18	11	29	45
1	2020-01-02	45	17	16	31	55

#Convert to the following format
	Date	value
0	2020-01-01	23
2	2020-01-01	18
4	2020-01-01	11
6	2020-01-01	29
8	2020-01-01	45
1	2020-01-02	45
3	2020-01-02	17
5	2020-01-02	16
7	2020-01-02	31
9	2020-01-02	55

** [Code Reading] ** </ font>

#Convert from horizontal holding to vertical holding
df = pd.melt(df, id_vars=['Date'])
>
	Date	variable	value
0	2020-01-01	h1	23
1	2020-01-02	h1	45
2	2020-01-01	h2	18
3	2020-01-02	h2	17
4	2020-01-01	h3	11
5	2020-01-02	h3	16
6	2020-01-01	h4	29
7	2020-01-02	h4	31
8	2020-01-01	h24	45
9	2020-01-02	h24	55
#Rename column
df = df.rename(columns={'variable': 'hour'})
>
	Date	hour	value
0	2020-01-01	h1	23
1	2020-01-02	h1	45
2	2020-01-01	h2	18
3	2020-01-02	h2	17
4	2020-01-01	h3	11
5	2020-01-02	h3	16
6	2020-01-01	h4	29
7	2020-01-02	h4	31
8	2020-01-01	h24	45
9	2020-01-02	h24	55
#of the hour column'h'Removal
df['hour'] = df['hour'].apply(lambda x: int(x.lstrip('h'))-1)
>
	Date	hour	value
0	2020-01-01	0	23
1	2020-01-02	0	45
2	2020-01-01	1	18
3	2020-01-02	1	17
4	2020-01-01	2	11
5	2020-01-02	2	16
6	2020-01-01	3	29
7	2020-01-02	3	31
8	2020-01-01	23	45
9	2020-01-02	23	55
#Creating column data that combines day and time
combined = df.apply(lambda x: pd.to_datetime(x['Date'], dayfirst=True) + datetime.timedelta(hours=int(x['hour'])), axis=1)
>
0   2020-01-01 00:00:00
1   2020-01-02 00:00:00
2   2020-01-01 01:00:00
3   2020-01-02 01:00:00
4   2020-01-01 02:00:00
5   2020-01-02 02:00:00
6   2020-01-01 03:00:00
7   2020-01-02 03:00:00
8   2020-01-01 23:00:00
9   2020-01-02 23:00:00
dtype: datetime64[ns]
# 'Date'Replace column data
df['Date'] = combined
	Date  	  	hour  value 
 0  2020-01-01 00:00:00   0  23 
 1  2020-01-02 00:00:00   0  45 
 2  2020-01-01 01:00:00   1  18 
 3  2020-01-02 01:00:00   1  17 
 4  2020-01-01 02:00:00   2  11 
 5  2020-01-02 02:00:00   2  16 
 6  2020-01-01 03:00:00   3  29 
 7  2020-01-02 03:00:00   3  31 
 8  2020-01-01 23:00:00  23  45 
 9  2020-01-02 23:00:00  23  55 
# 'hour'Delete column
del df['hour']
>
#Sort by date
df = df.sort_values("Date")
>
Date	value
0	2020-01-01	23
2	2020-01-01	18
4	2020-01-01	11
6	2020-01-01	29
8	2020-01-01	45
1	2020-01-02	45
3	2020-01-02	17
5	2020-01-02	16
7	2020-01-02	31
9	2020-01-02	55

Handling duplicates when reindexing a time series to a certain frequency: clipboard:

#A Creating time series data(Date and time+millisecond[Dummy data])->100 cases
intervals = np.random.randint(0,1000,size=100).cumsum()
df = pd.DataFrame({'time':[pd.Timestamp('20140101') + pd.offsets.Milli(i) for i in intervals ],
                'value' : np.random.randn(len(intervals))})

# df
			time	value
0	2014-01-01 00:00:00.499	0.567731
1	2014-01-01 00:00:01.232	-0.751466
2	2014-01-01 00:00:01.238	1.250118
3	2014-01-01 00:00:01.533	0.588161
4	2014-01-01 00:00:02.210	0.996543
...	...	...
95	2014-01-01 00:00:49.216	0.753233
96	2014-01-01 00:00:49.719	0.849922
97	2014-01-01 00:00:49.732	1.171001
98	2014-01-01 00:00:50.723	-0.443835
99	2014-01-01 00:00:51.389	-0.830960
100 rows × 2 columns

#B Time series data index(Date and time+1 minute unit)->3601 cases
pd.date_range('20140101 00:00:00','20140101 01:00:00',freq='s')

DatetimeIndex(['2014-01-01 00:00:00', '2014-01-01 00:00:01',
               '2014-01-01 00:00:02', '2014-01-01 00:00:03',
               '2014-01-01 00:00:04', '2014-01-01 00:00:05',
               '2014-01-01 00:00:06', '2014-01-01 00:00:07',
               '2014-01-01 00:00:08', '2014-01-01 00:00:09',
               ...
               '2014-01-01 00:59:51', '2014-01-01 00:59:52',
               '2014-01-01 00:59:53', '2014-01-01 00:59:54',
               '2014-01-01 00:59:55', '2014-01-01 00:59:56',
               '2014-01-01 00:59:57', '2014-01-01 00:59:58',
               '2014-01-01 00:59:59', '2014-01-01 01:00:00'],
              dtype='datetime64[ns]', length=3601, freq='S')

#result
2014-01-01 00:00:00.000	NaN
2014-01-01 00:00:00.499	0.567731
2014-01-01 00:00:01.000	0.567731  <-Originally NaN
2014-01-01 00:00:01.232	-0.751466
2014-01-01 00:00:01.238	1.250118
2014-01-01 00:00:01.533	0.588161
2014-01-01 00:00:02.000	0.588161  <-Originally NaN
2014-01-01 00:00:02.210	0.996543
2014-01-01 00:00:02.652	0.322535
2014-01-01 00:00:03.000	0.322535  <-Originally NaN

** [Code Reading] ** </ font>

  • It is a very confusing heading, but the point is to add another B time series (both in the same time axis) to the A time series data, and ffill the state where the data string is missing. ) Is a technique to fill in the numerical value.
intervals
>
array([  499,  1232,  1238,  1533,  2210,  2652,  3512,  3598,  3663,
        3688,  3758,  3969,  4960,  5933,  6211,  7040,  7763,  8331,
        9329, 10229, 10436, 10671, 10727, 11695, 12249, 13033, 13867,
       13895, 14809, 15069, 16022, 16484, 16597, 17044, 17060, 17344,
       18124, 18629, 19256, 20022, 20620, 21080, 21148, 21603, 22164,
       22872, 23075, 23566, 24133, 24887, 24996, 25132, 25435, 26047,
       26287, 27168, 27228, 28129, 29118, 29539, 30339, 30465, 31396,
       31805, 32583, 33021, 33854, 34439, 34865, 35207, 35212, 35345,
       35865, 36586, 37325, 37422, 38411, 38986, 39227, 39768, 40316,
       41254, 42162, 42476, 43155, 43971, 44791, 44899, 45770, 46701,
       47568, 47901, 48242, 48720, 48940, 49216, 49719, 49732, 50723,
       51389])
>
#Create a B time series and combine it with an A time series index
rng = pd.date_range('20140101 00:00:00','20140101 01:00:00', freq='s')
new_range = pd.DatetimeIndex(np.append(rng, pd.Index(df.time)))
  • In the content posted on stackoverflow.com, it is A time series + B time series, but an error occurs.
  • I checked it and was able to handle it with DatetimeIndex (np.append (A, B)).
#A time series + B time series remains, so sort by time series
new_range = new_range.sort_values() 
# 'time'To index and change to a new combined index
df.set_index('time').reindex(new_range).head()
>
2014-01-01 00:00:00.000	NaN
2014-01-01 00:00:00.499	0.567731
2014-01-01 00:00:01.000	NaN
2014-01-01 00:00:01.232	-0.751466
2014-01-01 00:00:01.238	1.250118
>
# ffill()Update the value with
df.set_index('time').reindex(new_range).ffill().head(10)
>
		value
2014-01-01 00:00:00.000	NaN
2014-01-01 00:00:00.499	0.567731
2014-01-01 00:00:01.000	0.567731
2014-01-01 00:00:01.232	-0.751466
2014-01-01 00:00:01.238	1.250118
2014-01-01 00:00:01.533	0.588161
2014-01-01 00:00:02.000	0.588161
2014-01-01 00:00:02.210	0.996543
2014-01-01 00:00:02.652	0.322535
2014-01-01 00:00:03.000	0.322535

Resampling

How can I divide a single value in a data frame by the monthly average? : clipboard :: thumbs up:

df = pd.DataFrame(np.random.randint(1000, 1200, (12,1)), columns=['value'],
                  index = pd.date_range('2014-01-01 00:15:00', freq='15min', periods=12))

df
		value 
 2014-01-01 00:15:00  1122 
 2014-01-01 00:30:00  1132 
 2014-01-01 00:45:00  1091 
 2014-01-01 01:00:00  1188 
 2014-01-01 01:15:00  1156 
 2014-01-01 01:30:00  1089 
 2014-01-01 01:45:00  1148 
 2014-01-01 02:00:00  1040 
 2014-01-01 02:15:00  1010 
 2014-01-01 02:30:00  1130 
 2014-01-01 02:45:00  1178 
 2014-01-01 03:00:00  1186 

df['normed'] = df.groupby(grouper).transform(lambda x: x / x.mean())

df
			value	normed
Date		
2014-01-01 00:15:00	1122	0.999555
2014-01-01 00:30:00	1132	1.008463
2014-01-01 00:45:00	1091	0.971938
2014-01-01 01:00:00	1188	1.058352
2014-01-01 01:15:00	1156	1.029844
2014-01-01 01:30:00	1089	0.970156
2014-01-01 01:45:00	1148	1.022717
2014-01-01 02:00:00	1040	0.926503
2014-01-01 02:15:00	1010	0.899777
2014-01-01 02:30:00	1130	1.006682
2014-01-01 02:45:00	1178	1.049443
2014-01-01 03:00:00	1186	1.056570

** [Code Reading] ** </ font>

df.index.name = 'Date'
#Use Grouper to group by a specific cycle or interval. In this case it's one month`freq='M'`And.
grouper = pd.Grouper(level=0, freq='M')
>
#If you refer to the contents of grouper in the groups property, you can see that it is a monthly unit of the last day.
df.groupby(grouper).groups
{Timestamp('2014-01-31 00:00:00', freq='M'): 12}
>
#Easy-to-find methods are posted without using grouper. This is convenient!
df.groupby([df.index.year, df.index.month]).transform(lambda x: x/x.mean())

Grouping using MultiIndex: clipboard:

data = pd.concat([pd.DataFrame([['A']*36, list(pd.date_range('1/1/2011', periods=36, freq='H')), list(np.random.rand(36))], index = ['Group', 'Time', 'Value']).T,
                  pd.DataFrame([['B']*36, list(pd.date_range('1/1/2011', periods=36, freq='H')), list(np.random.rand(36))], index = ['Group', 'Time', 'Value']).T,
                  pd.DataFrame([['C']*36, list(pd.date_range('1/1/2011', periods=36, freq='H')), list(np.random.rand(36))], index = ['Group', 'Time', 'Value']).T],
                  axis = 0).set_index(['Group', 'Time'])

data

			Value
Group	Time	
A	2011-01-01 00:00:00	0.306533
	2011-01-01 01:00:00	0.653431
	2011-01-01 02:00:00	0.0284997
	2011-01-01 03:00:00	0.852406
	2011-01-01 04:00:00	0.856705
...	...	...
C	2011-01-02 07:00:00	0.607041
	2011-01-02 08:00:00	0.291705
	2011-01-02 09:00:00	0.480091
	2011-01-02 10:00:00	0.890153
	2011-01-02 11:00:00	0.628454
108 rows × 1 columns

#result
Time        Group
2011-01-01  A        0.497803
            B        0.524934
            C        0.566736
2011-01-02  A        0.442884
            B        0.452267
            C        0.632567
Name: Value, dtype: float64

** [Code Reading] ** </ font>

data['Value'] = data['Value'].astype(float)
#Two keys(Grouper A:By day, Grouper B:'Group')Group and take the average value.
daily_counts = data.groupby([pd.Grouper(freq='D', level='Time'), pd.Grouper(level='Group')])['Value'].mean()

How to resampling in your own period?

dtrange = pd.date_range(datetime.datetime(2013,1,1), datetime.datetime(2013,2,20))
df = pd.DataFrame({'p1': np.random.rand(len(dtrange)) + 5,
                   'p2': np.random.rand(len(dtrange)) + 10},
                 index=dtrange)
		p1		p2
2013-01-01	5.428541	10.292222
2013-01-02	5.172898	10.077244
2013-01-03	5.414736	10.587493
2013-01-04	5.235626	10.567700
2013-01-05	5.659100	10.270270
2013-01-06	5.144520	10.033026
2013-01-07	5.340707	10.561371
2013-01-08	5.256909	10.647431
2013-01-09	5.491950	10.036358
2013-01-10	5.869493	10.010417

#result
df.groupby(date).mean()
		p1		p2
2013-01-01	5.401448	10.308353
2013-01-11	5.362789	10.463397
2013-01-21	5.531094	10.346834
2013-02-01	5.559112	10.551347
2013-02-11	5.551023	10.451666

** [Code Reading] ** </ font>

#This d creates an index for the beginning, middle, and end.
d = df.index.day - np.clip((df.index.day - 1) // 10, 0, 2) * 10 - 1
d
Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9,  0,  1,  2,  3,  4,  5,  6,
             7,  8,  9,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10,  0,  1,  2,
             3,  4,  5,  6,  7,  8,  9,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
           dtype='int64')
#This gets the date of the index of the data.
df.index.day
Int64Index([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
            18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,  1,  2,  3,
             4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
           dtype='int64')
#-1 to make the index in units of 10 days.
df.index.day - 1
Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
            17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30,  0,  1,  2,
             3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
           dtype='int64')
#Creating indexes for early, mid, and late
(df.index.day - 1) // 10
Int64Index([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
            2, 2, 2, 2, 2, 2, 2, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
            1, 1, 1, 1, 1, 1, 1],
           dtype='int64')
# np.clip(a, a_min, a_max, out=None, **kargs)
# np.clip processes the value of the element within an arbitrary range. 0 here~Fit in the range up to 2
np.clip((df.index.day - 1) // 10, 0, 2)
Int64Index([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
            2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
            1, 1, 1, 1, 1, 1, 1],
           dtype='int64')
#Early:0,Mid:10,Late:Converting for 20
np.clip((df.index.day - 1) // 10, 0, 2) * 10
Int64Index([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10, 10, 10, 10, 10, 10, 10,
            10, 10, 10, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20,  0,  0,  0,
             0,  0,  0,  0,  0,  0,  0, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10],
           dtype='int64')
#Check the calculation of d.
tmp = pd.DataFrame({'A':df.index.day.values,
                    'B':np.clip((df.index.day - 1) // 10, 0, 2) * 10},
                  index=np.arange(51))
tmp['Diff'] = tmp['A'] - tmp['B']
tmp.head()
	A	B	Diff
0	1	0	1
1	2	0	2
2	3	0	3
3	4	0	4
4	5	0	5
>
tmp['Diff'] = (tmp['A'] - tmp['B']) - 1
	A	B	Diff
0	1	0	0
1	2	0	1
2	3	0	2
3	4	0	3
4	5	0	4
>
# 2020-01-01,2020-01-11,2020-01-21, 2020-02-01, 2020-02-Create 11 time indexes
#Converting d to timedelta format and subtracting it from the original time series gives the index of seasonal data.
date = df.index.values - np.array(d, dtype='timedelta64[D]')
>
array(['2013-01-01T00:00:00.000000000', '2013-01-01T00:00:00.000000000',
       '2013-01-01T00:00:00.000000000', '2013-01-01T00:00:00.000000000',
       '2013-01-01T00:00:00.000000000', '2013-01-01T00:00:00.000000000',
       '2013-01-01T00:00:00.000000000', '2013-01-01T00:00:00.000000000',
       '2013-01-01T00:00:00.000000000', '2013-01-01T00:00:00.000000000',
       '2013-01-11T00:00:00.000000000', '2013-01-11T00:00:00.000000000',
       '2013-01-11T00:00:00.000000000', '2013-01-11T00:00:00.000000000',
       '2013-01-11T00:00:00.000000000', '2013-01-11T00:00:00.000000000',
       '2013-01-11T00:00:00.000000000', '2013-01-11T00:00:00.000000000',
       '2013-01-11T00:00:00.000000000', '2013-01-11T00:00:00.000000000',
       '2013-01-21T00:00:00.000000000', '2013-01-21T00:00:00.000000000',
       '2013-01-21T00:00:00.000000000', '2013-01-21T00:00:00.000000000',
       '2013-01-21T00:00:00.000000000', '2013-01-21T00:00:00.000000000',
       '2013-01-21T00:00:00.000000000', '2013-01-21T00:00:00.000000000',
       '2013-01-21T00:00:00.000000000', '2013-01-21T00:00:00.000000000',
       '2013-01-21T00:00:00.000000000', '2013-02-01T00:00:00.000000000',
       '2013-02-01T00:00:00.000000000', '2013-02-01T00:00:00.000000000',
       '2013-02-01T00:00:00.000000000', '2013-02-01T00:00:00.000000000',
       '2013-02-01T00:00:00.000000000', '2013-02-01T00:00:00.000000000',
       '2013-02-01T00:00:00.000000000', '2013-02-01T00:00:00.000000000',
       '2013-02-01T00:00:00.000000000', '2013-02-11T00:00:00.000000000',
       '2013-02-11T00:00:00.000000000', '2013-02-11T00:00:00.000000000',
       '2013-02-11T00:00:00.000000000', '2013-02-11T00:00:00.000000000',
       '2013-02-11T00:00:00.000000000', '2013-02-11T00:00:00.000000000',
       '2013-02-11T00:00:00.000000000', '2013-02-11T00:00:00.000000000',
       '2013-02-11T00:00:00.000000000'], dtype='datetime64[ns]')
>
#this'date'Group and take the average.
df.groupby(date).mean()
			p1	p2
2013-01-01	5.401448	10.308353
2013-01-11	5.362789	10.463397
2013-01-21	5.531094	10.346834
2013-02-01	5.559112	10.551347
2013-02-11	5.551023	10.451666
  • Create a DatetimeIndex based on the criteria you want to aggregate, and aggregate based on that index.
  • This one-liner process was difficult for beginners, and it was enough to disassemble it.

Aggregate trading hours without adding new days

#Create with 1-minute data
dates = pd.date_range('2014-01-01','2014-01-11', freq='T')[0:-1]
#Only on business days
dates = dates[dates.dayofweek < 5]
s = pd.DataFrame(np.random.randn(dates.size), dates)
s
					0
2014-01-01 00:00:00	-2.593328
2014-01-01 00:01:00	0.173850
2014-01-01 00:02:00	0.781819
2014-01-01 00:03:00	0.734917
2014-01-01 00:04:00	-1.323457
...	...
2014-01-10 23:55:00	0.158127
2014-01-10 23:56:00	1.205610
2014-01-10 23:57:00	-0.757652
2014-01-10 23:58:00	0.350570
2014-01-10 23:59:00	0.886426
11520 rows × 1 columns

#Resample the date and time of the data frame in 30-minute increments and take the average
s.groupby(lambda d: d.date()).resample('30min').mean()
						0
2014-01-01	2014-01-01 00:00:00	-0.152332
		2014-01-01 00:30:00	-0.172909
		2014-01-01 01:00:00	0.110629
		2014-01-01 01:30:00	-0.070501
		2014-01-01 02:00:00	0.058135
		...	...	...
2014-01-10	2014-01-10 21:30:00	-0.158843
		2014-01-10 22:00:00	0.204738
		2014-01-10 22:30:00	-0.057380
		2014-01-10 23:00:00	0.212008
		2014-01-10 23:30:00	-0.021065
384 rows × 1 columns

** [Code Reading] ** </ font> Resampling every 30 minutes, calculating the average value and grouping by day. Initially, there were 14,400 cases, which were extracted only on business days and became 11,520 cases, which were aggregated in 30-minute units to 384 cases.

Group daily data by month. : clipboard :: thumbs up:

import random
df = pd.DataFrame({'string': [random.choice(('about', 'contact', 'home', 'blog')) for _ in range(120)],
                 'visits':np.random.randint(0,2000, (120,))},
               index = pd.date_range('2001-01-01', freq='1d', periods=120))

re_df = df.groupby('string').resample('M').sum()

			visits
string		
about	2001-01-31	4879
		2001-02-28	6713
		2001-03-31	5747
		2001-04-30	9286
blog	2001-01-31	9164
		2001-02-28	8318
		2001-03-31	4770
		2001-04-30	6571
contact	2001-01-31	6883
		2001-02-28	6177
		2001-03-31	10228
		2001-04-30	3531
home	2001-01-31	5496
		2001-02-28	5261
		2001-03-31	7083
		2001-04-30	11383

** [Code Reading] ** </ font>

  • random.choice ('String A','String B','String C') Randomly extract the argument strings.
  • Group data by classification name, resample and sum up on a'M'monthly basis.

Add monthly composition to a new column

re_df['ratio'] = re_df.groupby(level=1).transform(lambda x: x / x.sum())
>
				visits	ratio
string			
about	2001-01-31	4879	0.184657
		2001-02-28	6713	0.253617
		2001-03-31	5747	0.206519
		2001-04-30	9286	0.301778
blog	2001-01-31	9164	0.346832
		2001-02-28	8318	0.314254
		2001-03-31	4770	0.171410
		2001-04-30	6571	0.213545
contact	2001-01-31	6883	0.260503
		2001-02-28	6177	0.233367
		2001-03-31	10228	0.367543
		2001-04-30	3531	0.114751
home	2001-01-31	5496	0.208008
		2001-02-28	5261	0.198761
		2001-03-31	7083	0.254528
		2001-04-30	11383	0.369926

Merge

In [177]: rng = pd.date_range('2000-01-01', periods=6)
In [178]: df1 = pd.DataFrame(np.random.randn(6, 3), index=rng, columns=['A', 'B', 'C'])
In [179]: df2 = df1.copy()
In [180]: df = df1.append(df2, ignore_index=True)

In [181]: df
Out[181]: 
           A         B         C
0  -0.870117 -0.479265 -0.790855
1   0.144817  1.726395 -0.464535
2  -0.821906  1.597605  0.187307
3  -0.128342 -1.511638 -0.289858
4   0.399194 -1.430030 -0.639760
5   1.115116 -2.012600  1.810662
6  -0.870117 -0.479265 -0.790855
7   0.144817  1.726395 -0.464535
8  -0.821906  1.597605  0.187307
9  -0.128342 -1.511638 -0.289858
10  0.399194 -1.430030 -0.639760
11  1.115116 -2.012600  1.810662

** [Code Reading] ** </ font>

  • Create 6 data and combine the same data vertically by copy ().
  • Depending on the join, ʻignore_index = True` may be required.

DataFrame self-join

In [182]: df = pd.DataFrame(data={'Area': ['A'] * 5 + ['C'] * 2,
                           'Bins': [110] * 2 + [160] * 3 + [40] * 2,
                           'Test_0': [0, 1, 0, 1, 2, 0, 1],
                           'Data': np.random.randn(7)})

In [183]: df
Out[183]: 
	Area	Bins	Test_0	Data
0	A	110	0	0.632955
1	A	110	1	1.485463
2	A	160	0	-1.193891
3	A	160	1	-0.324484
4	A	160	2	1.293263
5	C	40	0	-0.476979
6	C	40	1	-0.467655

In [184]: df['Test_1'] = df['Test_0'] - 1

In [185]: pd.merge(df, df, left_on=['Bins', 'Area', 'Test_0'],
                   right_on=['Bins', 'Area', 'Test_1'],
                   suffixes=('_L', '_R'))

Out[185]: 
Area	Bins	Test_0_L	Data_L	Test_1_L	Test_0_R	Data_R	Test_1_R
0	A	110	0	0.632955	-1	1	1.485463	0
1	A	160	0	-1.193891	-1	1	-0.324484	0
2	A	160	1	-0.324484	0	2	1.293263	1
3	C	40	0	-0.476979	-1	1	-0.467655	0

** [Code Reading] ** </ font>

# 
df['Test_1'] = df['Test_0'] - 1
>
	Area	Bins	Test_0	Data	Test_1
0	A	110	0	0.632955	-1
1	A	110	1	1.485463	0
2	A	160	0	-1.193891	-1
3	A	160	1	-0.324484	0
4	A	160	2	1.293263	1
5	C	40	0	-0.476979	-1
6	C	40	1	-0.467655	0

####Combine by value-based condition:clipboard:
The code below is stackoverflow.I have changed some code from the post on com.

```python
import operator as op

df_a = pd.DataFrame([{"a": 1, "b": 4}, {"a": 2, "b": 5}, {"a": 3, "b": 6}])
df_b = pd.DataFrame([{"c": 2, "d": 7}, {"c": 3, "d": 8}])

binOp   = op.lt
matches = np.where(binOp(df_a.a[:,None],df_b.c.values))

pd.concat([df.ix[idxs].reset_index(drop=True) 
                 for df,idxs in zip([df_a,df_b],matches)], axis=1)
df_a
	a	b
0	1	4
1	2	5
2	3	6
df_b

	c	d
0	2	7
1	3	8
# result
	a	b	c	d
0	1	4	2	7
1	1	4	3	8
2	2	5	3	8

[Code reading]

*FirstmachtesWhen disassembling

# Check the contents of columns df.a and df.b
df_a.a[:,None]
array([[1],
       [2],
       [3]])
df_b.c.values
array([2, 3])

*Function-style standard operatoroperator.lt(a,b)Isa < bso,a < bIs equivalent to

op.lt(df_a.a[:,None],df_b.c.values)
array([[ True,  True],
       [False,  True],
       [False, False]])
# I could write it like this if it was written differently
df_a.a[:,None] < df_b.c.values
array([[ True,  True],
       [False,  True],
       [False, False]])
>
# To check this content
df_a.a[:,None][0] < df_b.c.values[0]
array([ True])
df_a.a[:,None][0] < df_b.c.values[1]
array([ True])
df_a.a[:,None][1] < df_b.c.values[0]
array([False])
df_a.a[:,None][1] < df_b.c.values[1]
array([ True])
df_a.a[:,None][2] < df_b.c.values[0]
array([False])
df_a.a[:,None][2] < df_b.c.values[1]
array([False])
>
# In the case of np.where (condition), the index of the element that satisfies the condition is returned in tuple format.
np.where([[True, True], [False, True], [False, False]])
(array([0, 0, 1]), array([0, 1, 1]))
>
# Extract with the conditions obtained in'matches' earlier.
[df.loc[idxs].reset_index(drop=True) for df,idxs in zip([df_a,df_b],matches)]
[   a  b
 0  1  4
 1  1  4
 2  2  5,    c  d
 0  2  7
 1  3  8
 2  3  8]
>
# This is connected in the horizontal direction (axis = 1).
	a	b	c	d
0	1	4	2	7
1	1	4	3	8
2	2	5	3	8

Plotting(Drawing)

n [186]: df = pd.DataFrame(
   .....:     {'stratifying_var': np.random.uniform(0, 100, 20),
   .....:      'price': np.random.normal(100, 5, 20)})
   .....: 

In [187]: df['quartiles'] = pd.qcut(
   .....:     df['stratifying_var'],
   .....:     4,
   .....:     labels=['0-25%', '25-50%', '50-75%', '75-100%'])
   .....: 

In [188]: df.boxplot(column='price', by='quartiles')
Out[188]: <matplotlib.axes._subplots.AxesSubplot at 0x7fc66b7f1710>

box_plot.png

[Code reading]

 pandas.qcut(x, q, labels=None...)
# Divide x by the number of q: bin divisions and add labels.
df['quartiles'] = pd.qcut(df['stratifying_var'], 4, labels=['0-25%', '25-50%', '50-75%', '75-100%'])
>
	stratifying_var	price	quartiles
0	95.463259	110.468740	75-100%
1	61.567537	105.038334	50-75%
2	93.169189	99.502664	75-100%
3	32.881181	100.462400	25-50%
4	22.735506	95.821731	0-25%
5	85.662861	101.262124	75-100%
6	24.251856	102.351950	0-25%
7	26.323525	91.812003	25-50%
8	6.192982	111.425087	0-25%
9	25.520758	104.666583	25-50%
10	75.505473	104.450480	75-100%
11	30.620504	100.044772	25-50%
12	40.438555	102.697402	50-75%
13	18.171318	102.612876	0-25%
14	57.080747	101.067847	50-75%
15	44.066472	93.410125	50-75%
16	64.131972	103.707151	50-75%
17	33.548572	103.359709	25-50%
18	88.433754	99.416668	75-100%
19	21.660715	91.229785	0-25%

Data IN/Out(Data input / output)

###Read multiple files to create a single DataFrame:thumbsup:

# Example A
n [189]: for i in range(3):
             data = pd.DataFrame(np.random.randn(10, 4))
             data.to_csv('file_{}.csv'.format(i))

In [190]: files = ['file_0.csv', 'file_1.csv', 'file_2.csv']
In [191]: result = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)

# Example B
In [192]: import glob
In [193]: import os

In [194]: files = glob.glob('file_*.csv')
In [195]: result = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)

[Code reading]

*In example A, a sample file is created in a loop, and the sample file is read and combined. *In example B, the created file name is read by the glob function, and the file is read and combined.

###Parsing multi-column date components

In [196]: i = pd.date_range('20000101', periods=10000)

In [197]: df = pd.DataFrame({'year': i.year, 'month': i.month, 'day': i.day})

In [198]: df.head()
Out[198]: 
   year  month  day
0  2000      1    1
1  2000      1    2
2  2000      1    3
3  2000      1    4
4  2000      1    5

In [199]: %timeit pd.to_datetime(df.year * 10000 + df.month * 100 + df.day, format='%Y%m%d')
 ds = df.apply(lambda x: "%04d%02d%02d" % (x['year'],
                                          x['month'], x['day']), axis=1)
ds.head()
 %timeit pd.to_datetime(ds)

9.98 ms +- 235 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
2.91 ms +- 57.5 us per loop (mean +- std. dev. of 7 runs, 100 loops each)

[Code reading]

# When calculating and creating time series information using the values of the created data frame
df.year * 10000 + df.month * 100 + df.day
>
# When creating time series information with a function from the elements of the created data frame
 ds = df.apply(lambda x: "%04d%02d%02d" % (x['year'], x['month'], x['day']), axis=1)

Difference in calculation cost?

###Skip the line between the header and the data

In [200]: data = """;;;;
        ;;;;
        ;;;;
        ;;;;
        ;;;;
        ;;;;
        ;;;;
        ;;;;
        ;;;;
        ;;;;
         date;Param1;Param2;Param4;Param5
             ;m²;°C;m²;m
        ;;;;
         01.01.1990 00:00;1;1;2;3
         01.01.1990 01:00;5;3;4;5
         01.01.1990 02:00;9;5;6;7
         01.01.1990 03:00;13;7;8;9
         01.01.1990 04:00;17;9;10;11
         01.01.1990 05:00;21;11;12;13
 """

Option 1:Explicitly pass a line and skip the line

In [201]: from io import StringIO

In [202]: pd.read_csv(StringIO(data), sep=';', skiprows=[11, 12],

Out[202]: 
                     Param1  Param2  Param4  Param5
date                                               
1990-01-01 00:00:00       1       1       2       3
1990-01-01 01:00:00       5       3       4       5
1990-01-01 02:00:00       9       5       6       7
1990-01-01 03:00:00      13       7       8       9
1990-01-01 04:00:00      17       9      10      11
1990-01-01 05:00:00      21      11      12      13

Option 2:Read column names and then read data

In [203]: pd.read_csv(StringIO(data), sep=';', header=10, nrows=10).columns
Out[203]: Index(['date', 'Param1', 'Param2', 'Param4', 'Param5'], dtype='object')

In [204]: columns = pd.read_csv(StringIO(data), sep=';', header=10, nrows=10).columns

In [205]: pd.read_csv(StringIO(data), sep=';', index_col=0,
                      header=12, parse_dates=True, names=columns)

Out[205]: 
                     Param1  Param2  Param4  Param5
date                                               
1990-01-01 00:00:00       1       1       2       3
1990-01-01 01:00:00       5       3       4       5
1990-01-01 02:00:00       9       5       6       7
1990-01-01 03:00:00      13       7       8       9
1990-01-01 04:00:00      17       9      10      11
1990-01-01 05:00:00      21      11      12      13

[Code reading]

I think that either one is not advantageous regardless of the size of the data. In any case, it is necessary to find the number of lines to skip and the line corresponding to the header line, so you need to skip the appropriate number of lines and open it, or open it separately with an editor first.


#Computation) ###correlation

In [211]: df = pd.DataFrame(np.random.random(size=(100, 5)))

In [212]: corr_mat = df.corr()

In [213]: mask = np.tril(np.ones_like(corr_mat, dtype=np.bool), k=-1)

In [214]: corr_mat.where(mask)
Out[214]: 
          0         1         2         3   4
0       NaN       NaN       NaN       NaN NaN
1 -0.018923       NaN       NaN       NaN NaN
2 -0.076296 -0.012464       NaN       NaN NaN
3 -0.169941 -0.289416  0.076462       NaN NaN
4  0.064326  0.018759 -0.084140 -0.079859 NaN

[Code reading]

mask
array([[False, False, False, False, False],
       [ True, False, False, False, False],
       [ True,  True, False, False, False],
       [ True,  True,  True, False, False],
       [ True,  True,  True,  True, False]])
>
mask2 = np.triu(np.ones_like(corr_mat, dtype=np.bool), k=-1)
mask2
array([[ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [False,  True,  True,  True,  True],
       [False, False,  True,  True,  True],
       [False, False, False,  True,  True]])
>
corr_mat.where(mask2)
>
0	1	2	3	4
0	1.000000	-0.022615	0.047486	0.092043	-0.112379
1	-0.022615	1.000000	-0.336864	-0.025473	0.004696
2	NaN			-0.336864	1.000000	0.047746	-0.008458
3	NaN			NaN			0.047746	1.000000	0.133289
4	NaN			NaN			NaN			0.133289	1.000000

numpy.tril() ->Extract the lower triangular matrix from the Numpy array(Lower triangle of an array.) numpy.triu() ->Extract the upper triangular matrix from the Numpy array(Upper triangle of an array.)


Timedeltas(elapsed time)

In [218]: import datetime

In [219]: s = pd.Series(pd.date_range('2012-1-1', periods=3, freq='D'))

In [220]: s - s.max()
Out[220]: 
0   -2 days
1   -1 days
2    0 days
dtype: timedelta64[ns]

In [221]: s.max() - s
Out[221]: 
0   2 days
1   1 days
2   0 days
dtype: timedelta64[ns]

In [222]: s - datetime.datetime(2011, 1, 1, 3, 5)
Out[222]: 
0   364 days 20:55:00
1   365 days 20:55:00
2   366 days 20:55:00
dtype: timedelta64[ns]

In [223]: s + datetime.timedelta(minutes=5)
Out[223]: 
0   2012-01-01 00:05:00
1   2012-01-02 00:05:00
2   2012-01-03 00:05:00
dtype: datetime64[ns]

In [224]: datetime.datetime(2011, 1, 1, 3, 5) - s
Out[224]: 
0   -365 days +03:05:00
1   -366 days +03:05:00
2   -367 days +03:05:00
dtype: timedelta64[ns]

In [225]: datetime.timedelta(minutes=5) + s
Out[225]: 
0   2012-01-01 00:05:00
1   2012-01-02 00:05:00
2   2012-01-03 00:05:00
dtype: datetime64[ns]

[Code reading]

# 2 weeks later
s + datetime.timedelta(weeks=2)
>
0   2012-01-15
1   2012-01-16
2   2012-01-17
dtype: datetime64[ns]
# 2 weeks ago
s + datetime.timedelta(weeks=-2)
>
0   2011-12-18
1   2011-12-19
2   2011-12-20
dtype: datetime64[ns]

###Addition and subtraction of elapsed time and date

In [226]: deltas = pd.Series([datetime.timedelta(days=i) for i in range(3)])

In [227]: df = pd.DataFrame({'A': s, 'B': deltas})

In [228]: df
Out[228]: 
           A      B
0 2012-01-01 0 days
1 2012-01-02 1 days
2 2012-01-03 2 days

In [229]: df['New Dates'] = df['A'] + df['B']

In [230]: df['Delta'] = df['A'] - df['New Dates']

In [231]: df
Out[231]: 
           A      B  New Dates   Delta
0 2012-01-01 0 days 2012-01-01  0 days
1 2012-01-02 1 days 2012-01-03 -1 days
2 2012-01-03 2 days 2012-01-05 -2 days

In [232]: df.dtypes
Out[232]: 
A             datetime64[ns]
B            timedelta64[ns]
New Dates     datetime64[ns]
Delta        timedelta64[ns]
dtype: object

[Code reading]

# Perform weekly addition
w_deltas = pd.Series([datetime.timedelta(weeks=i) for i in range(3)])
w_deltas
0    0 days
1    7 days
2   14 days
dtype: timedelta64[ns]
>
df['W_Delta'] = df['A'] + w_deltas
>
A	B	New Dates	Delta	W_Delta
0	2012-01-01	0 days	2012-01-01	0 days	2012-01-01
1	2012-01-02	1 days	2012-01-03	-1 days	2012-01-09
2	2012-01-03	2 days	2012-01-05	-2 days	2012-01-17

#Axis name alias ###To provide axis name aliases globally, you can define two functions:

def set_axis_alias(cls, axis, alias):
    if axis not in cls._AXIS_NUMBERS:
        raise Exception("invalid axis [%s] for alias [%s]" % (axis, alias))
     cls._AXIS_ALIASES[alias] = axis

def clear_axis_alias(cls, axis, alias):
     if axis not in cls._AXIS_NUMBERS:
        raise Exception("invalid axis [%s] for alias [%s]" % (axis, alias))
     cls._AXIS_ALIASES.pop(alias, None)

set_axis_alias(pd.DataFrame, 'columns', 'myaxis2')
df2 = pd.DataFrame(np.random.randn(3, 2), columns=['c1', 'c2'],
                   index=['i1', 'i2', 'i3'])

In [241]: df2.sum(axis='myaxis2')
Out[241]: 
i1   -0.461013
i2    2.040016
i3    0.904681
dtype: float64

In [242]: clear_axis_alias(pd.DataFrame, 'columns', 'myaxis2')

[Code reading]

# An error occurred in df2.sum (axis ='myaxis2').
df2.sum(axis=1)

I think this function can define its own axis name, but in reality an error occurs and no result is obtained.axis=1Was obtained by specifying.

Creating example data(Creating sample data)

In [243]: def expand_grid(data_dict):
              rows = itertools.product(*data_dict.values())
              return pd.DataFrame.from_records(rows, columns=data_dict.keys())

In [244]: df = expand_grid({'height': [60, 70],
                            'weight': [100, 140, 180],
                            'sex': ['Male', 'Female']})

In [245]: df
Out[245]: 
    height  weight     sex
0       60     100    Male
1       60     100  Female
2       60     140    Male
3       60     140  Female
4       60     180    Male
5       60     180  Female
6       70     100    Male
7       70     100  Female
8       70     140    Male
9       70     140  Female
10      70     180    Male
11      70     180  Female

#Creating sample data *To create a data frame from all combinations of a particular value, you can create a dictionary where the key is the column name and the value is a list of data values.

In [243]: def expand_grid(data_dict):
              rows = itertools.product(*data_dict.values())
              return pd.DataFrame.from_records(rows, columns=data_dict.keys())

In [244]: df = expand_grid({'height': [60, 70],
                            'weight': [100, 140, 180],
                            'sex': ['Male', 'Female']})

In [245]: df
Out[245]: 
    height  weight     sex
0       60     100    Male
1       60     100  Female
2       60     140    Male
3       60     140  Female
4       60     180    Male
5       60     180  Female
6       70     100    Male
7       70     100  Female
8       70     140    Male
9       70     140  Female
10      70     180    Male
11      70     180  Female

[Code reading]

# * dict.values () expands dictionary-style values
d_dict = {'height': [60, 70], 'weight': [100, 140, 180], 'sex': ['Male', 'Female']}
print(*d_dict.values())
[60, 70] [100, 140, 180] ['Male', 'Female']
>
# All combinations Cartesian product (Descartes) itertools.product is an iterator
rows = itertools.product(*d_dict.values())
>
for r in rows:
    print(r)
>
(60, 100, 'Male')
(60, 100, 'Female')
(60, 140, 'Male')
(60, 140, 'Female')
(60, 180, 'Male')
(60, 180, 'Female')
(70, 100, 'Male')
(70, 100, 'Female')
(70, 140, 'Male')
(70, 140, 'Female')
(70, 180, 'Male')
(70, 180, 'Female')

#Afterword *Thank you for watching until the end. It's been about two and a half years since I started learning the Python language, and the self-study of the official cookbook was a struggle. *The first time, I started by preparing the same data as the official one, but I gave up because I could not think of sample data on the way. *The second time, I tried to create the code reading part with the folding function by quoting the formula, but I decided that it did not convey the feeling of struggle. Please forgive the part that is not fully explained while the accurate understanding and description of the language are not left. The real feeling after finishing is"The one who controls the index controls the Pandas."**I thought. *For the explanation in code reading, I used mathjax for the first time, but it is unknown whether it worked.