background

I respond by Google search every time, but it is not bloody and flesh due to temporary understanding.
Fragmented knowledge alone is not enough to deal with emergencies.
To overcome the weakened memory, I want to engrave it in my brain by moving my hands, understanding and explaining it.
I want to carry out the word "Please read the official manual" from the ancestors.

What and how

Pandas Target the CookBook at the end of the ʻUser Guide` on the official page.
On the official page, concrete examples and posted on stackoverlow.com etc.
There is a link to the question and answer that was actively discussed.
Why is that so? While solving the doubts, I actually move my hand and write to the goal

About the structure of this post

Structure the headings according to the structure of the cookbook and cite the code provided.
(Some numbers and codes have been modified)
Also, the post on stackoverflow.com is posted depending on the content. (Posting standards are personal judgment)
At the end of the heading: clipboard: is an example from stackoverflow.com.
Regarding the title of the post, if you feel something is wrong with machine translation at first, I am correcting it with my poor reading comprehension.
At the end of the heading: thumbsup :, this is ** useful! **, ** Cool! **, ** I'm good! It is a personal view of **.

pandas logo

\small{\mathsf{ Date:Feb \ 05, 2020 \ Version: 1.0.1 }}

Pandas Official Site Cookbook

Idioms (idiom)

If-then ...

Use if-then / if-then-else on one column and assign it to one or more other columns.

In [1]: df = pd.DataFrame({'AAA': [4, 5, 6, 7],
                           'BBB': [10, 20, 30, 40],
                           'CCC': [100, 50, -30, -50]})

In [2]: df
Out[2]: 
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50

1 row if-then

df.loc[df.AAA >= 5, 'BBB'] = -1

Example of if-then assigned to two columns

df.loc[df.AAA >= 5, ['BBB', 'CCC']] = 555

Change the condition and update another row

df.loc[df.AAA < 5, ['BBB', 'CCC']] = 2000

Conditioned to mask value

In [9]: df_mask = pd.DataFrame({'AAA': [True] * 4,
                                'BBB': [False] * 4,
                                'CCC': [True, False] * 2})

In [10]: df.where(df_mask, -1000)
Out[10]: 
   AAA   BBB   CCC
0    4 -1000  2000
1    5 -1000 -1000
2    6 -1000   555
3    7 -1000 -1000

The contents of mask_df are as follows.

	AAA	BBB	CCC
0	True	False	True
1	True	False	False
2	True	False	True
3	True	False	False

If the first argument of the where method is True, the element is used, otherwise the second argument is used.

Use np.where to substitute if-then-else: thumbs up:

# np.where(condition, true, false)
df['logic'] = np.where(df['AAA'] > 5, 'high', 'low')

	AAA	BBB	CCC	logic
0	4	2000	2000	low
1	5	555	555	low
2	6	555	555	high
3	7	555	555	high

Splitting

In [16]: df
Out[16]: 
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50

In [17]: df[df.AAA <= 5]
Out[17]: 
   AAA  BBB  CCC
0    4   10  100
1    5   20   50

In [18]: df[df.AAA > 5]
Out[18]: 
   AAA  BBB  CCC
2    6   30  -30
3    7   40  -50

Split dataframe by Boolean value: clipboard :: thumbsup:

	id	sex	age	visits	points
0	128	1	36	6	1980
1	324	0	29	9	-50
2	287	0	41	12	-239
3	423	1	33	33	1570
4	589	0	19	10	10

Let's divide this by a Boolean value.

#Check the Boolean value of m
m = df['sex'] != 0
m
0     True
1    False
2    False
3     True
4    False
Name: sex, dtype: bool

#Extract with condition m, extract with negation of m
female, male = df[m], df[~m]

female

	id	sex	age	visits	points
0	128	1	36	6	1980
3	423	1	33	33	1570

	id	age	visits	points
1	324	29	9	-50
2	287	41	12	-239
4	589	19	10	10

Building criteria

	id	sex	age	visits	points
0	128	1	36	6	1980
1	324	0	29	9	-50
2	287	0	41	12	-239
3	423	1	33	33	1570
4	589	0	19	10	10

# (A condition) & (B condition)and will result in an error
df.loc[(df['BBB'] < 25) & (df['CCC'] >= -40), 'AAA']
0    4
1    5
Name: AAA, dtype: int64

# (A condition) | (B condition)or will result in an error
 df.loc[(df['BBB'] > 25) | (df['CCC'] >= -40), 'AAA']
0    4
1    5
2    6
3    7
Name: AAA, dtype: int64

# &,|You can update the data using
df.loc[(df['BBB'] > 25) | (df['CCC'] >= 75), 'AAA'] = 0.1

	AAA	BBB	CCC
0	0.1	10	100
1	5	20	50
2	0.1	30	-30
3	0.1	40	-50

Use the argsort function to select the row with the data closest to a particular value

df = pd.DataFrame({'AAA': [4, 5, 6, 7],
                   'BBB': [10, 20, 30, 40],
                   'CCC': [100, 50, -30, -50]})

In [26]: df
Out[26]: 
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50

In [27]: aValue = 43.0

In [28]: df.loc[(df.CCC - aValue).abs().argsort()]
Out[28]: 
   AAA  BBB  CCC
1    5   20   50
0    4   10  100
2    6   30  -30
3    7   40  -50

#Simple calculation
(df.CCC - aValue)
0    57.0
1     7.0
2   -73.0
3   -93.0
Name: CCC, dtype: float64

#Make the result of a simple calculation an absolute value
(df.CCC - aValue).abs()
0    57.0
1     7.0
2    73.0
3    93.0
Name: CCC, dtype: float64

#Returns the index of the result of sorting the numbers with the result of a simple calculation as an absolute value.
(df.CCC - aValue).abs().argsort()
0    1
1    0
2    2
3    3
Name: CCC, dtype: int64

Use binary operators to dynamically filter the list of criteria


In [29]: df = pd.DataFrame({'AAA': [4, 5, 6, 7],
                            'BBB': [10, 20, 30, 40],
                            'CCC': [100, 50, -30, -50]})

In [30]: df
Out[30]: 
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50

In [31]: Crit1 = df.AAA <= 5.5
In [32]: Crit2 = df.BBB == 10.0
In [33]: Crit3 = df.CCC > -40.0
#Hard-coded
In [34]: AllCrit = Crit1 & Crit2 & Crit3

In [35]: import functools
In [36]: CritList = [Crit1, Crit2, Crit3]
#Create all conditions (integrated) with the reduce function
In [37]: AllCrit = functools.reduce(lambda x, y: x & y, CritList)

In [38]: df[AllCrit]
Out[38]: 
   AAA  BBB  CCC
0    4   10  100

#Check each condition as a data frame.
tmp = pd.DataFrame({'CritList_0':CritList[0].tolist(),
                    'CritList_1':CritList[1].tolist(),
                    'CritList_2':CritList[2].tolist(),
                    'AllCrit':AllCrit.tolist()
                   }, index=[0,1,2,3])

In ʻAll Crit, which is the product of all conditions, Trueis the row with the index0`.

	CritList_0	CritList_1	CritList_2	AllCrit
0	True	True	True	True
1	True	False	True	False
2	False	False	True	False
3	False	False	False	False

Selection

Dataframes

Use both row labels and value conditions: thumbs up:

In [39]: df = pd.DataFrame({'AAA': [4, 5, 6, 7],
                            'BBB': [10, 20, 30, 40],
                            'CCC': [100, 50, -30, -50]}) 

In [40]: df
Out[40]: 
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50

In [41]: df[(df.AAA <= 6) & (df.index.isin([0, 2, 4]))]
Out[41]: 
   AAA  BBB  CCC
0    4   10  100
2    6   30  -30

Use loc for slicing on labels and iloc for positional slicing

	AAA	BBB	CCC
foo	4	10	100
bar	5	20	50
boo	6	30	-30
kar	7	40	-50

In [43]: df.loc['bar':'kar']  # Label
Out[43]: 
     AAA  BBB  CCC
bar    5   20   50
boo    6   30  -30
kar    7   40  -50

# Generic
In [44]: df.iloc[0:3]
Out[44]: 
     AAA  BBB  CCC
foo    4   10  100
bar    5   20   50
boo    6   30  -30

** Important points learned **

Position-oriented (Python slicing style: excluding termination)
Label oriented (non-Python slice style: including termination)

Problems occur if the index consists of non-zero start or non-incremental integers

In [46]: data = {'AAA': [4, 5, 6, 7],
                 'BBB': [10, 20, 30, 40],
                 'CCC': [100, 50, -30, -50]}

#Note: Index starts at 1
In [47]: df2 = pd.DataFrame(data=data, index=[1, 2, 3, 4])  
In [48]: df2.iloc[1:3]  #Position oriented
Out[48]: 
   AAA  BBB  CCC
2    5   20   50
3    6   30  -30

In [49]: df2.loc[1:3]  #Label oriented
Out[49]: 
   AAA  BBB  CCC
1    4   10  100
2    5   20   50
3    6   30  -30

Get by negating a condition using the comparison operator (~) [Not]

In [50]: df = pd.DataFrame({'AAA': [4, 5, 6, 7],
                            'BBB': [10, 20, 30, 40],
                            'CCC': [100, 50, -30, -50]})

In [51]: df
Out[51]: 
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50

In [52]: df[~((df.AAA <= 6) & (df.index.isin([0, 2, 4])))]
Out[52]: 
   AAA  BBB  CCC
1    5   20   50
3    7   40  -50

# `~`If you get it under the condition that does not use
df[((df.AAA <= 6) & (df.index.isin([0, 2, 4])))]

	AAA	BBB	CCC
0	4	10	100
2	6	30	-30

Since the negation of this content is `~`, indexes 1 and 3 are selected.

New columns

Efficiently and dynamically create new columns using the applymap function: thumbs up:

In [53]: df = pd.DataFrame({'AAA': [1, 2, 1, 3],
                            'BBB': [1, 1, 2, 2],
                            'CCC': [2, 1, 3, 1]})


In [54]: df
Out[54]: 
   AAA  BBB  CCC
0    1    1    2
1    2    1    1
2    1    2    3
3    3    2    1

In [55]: source_cols = df.columns   #Get existing column name

In [56]: new_cols = [str(x) + "_cat" for x in source_cols]

In [57]: categories = {1: 'Alpha', 2: 'Beta', 3: 'Charlie'}

In [58]: df[new_cols] = df[source_cols].applymap(categories.get)

In [59]: df
Out[59]: 
   AAA  BBB  CCC  AAA_cat BBB_cat  CCC_cat
0    1    1    2    Alpha   Alpha     Beta
1    2    1    1     Beta   Alpha    Alpha
2    1    2    3    Alpha    Beta  Charlie
3    3    2    1  Charlie    Beta    Alpha

#Get an existing column name
source_cols = df.columns
source_cols
Index(['AAA', 'BBB', 'CCC'], dtype='object')
#To an existing column name`_cat`Create a list added to
new_cols = [str(x) + "_cat" for x in source_cols]
new_cols
['AAA_cat', 'BBB_cat', 'CCC_cat']

Update new columns ['AAA_cat','BBB_cat','CCC_cat'] with values from the `categories` dictionary for existing columns

If you use min () with groupby, keep other columns

In [60]: df = pd.DataFrame({'AAA': [1, 1, 1, 2, 2, 2, 3, 3],
                            'BBB': [2, 1, 3, 4, 5, 1, 2, 3]})

In [61]: df
Out[61]: 
   AAA  BBB
0    1    2
1    1    1
2    1    3
3    2    4
4    2    5
5    2    1
6    3    2
7    3    3

Method 1: Get the smallest index idxmin ()

In [62]: df.loc[df.groupby("AAA")["BBB"].idxmin()]
Out[62]: 
   AAA  BBB
1    1    1
5    2    1
6    3    2

Method 2: Sort and then get each first

In [63]: df.sort_values(by="BBB").groupby("AAA", as_index=False).first()
Out[63]: 
   AAA  BBB
0    1    1
1    2    1
2    3    2

Method 1

# 'AAA'Group by column
gb = df.groupby("AAA")["BBB"]
# 'AAA'Of the column'2'If you get, the following contents will be displayed.'AAA'Of the column'2'The minimum value of'1'Becomes
gb.get_group(2)
3    4
4    5
5    1
Name: BBB, dtype: int64

# 'BBB'Sort by column
sv = df.sort_values(by="BBB")
sv
	AAA	BBB
1	1	1
5	2	1
0	1	2
6	3	2
2	1	3
7	3	3
3	2	4
4	2	5

# 'AAA'Group by column
gb = sv.groupby("AAA", as_index=False)

# 'AAA'Of the column'2'When you get the minimum value'1'
gb.get_group(2)
	AAA	BBB
5	2	1
3	2	4
4	2	5

#Get the beginning of a group object
gb.first()
	AAA	BBB
0	1	1
1	2	1
2	3	2

MultiIndexing

Create a MultiIndex from a labeled frame

In [64]: df = pd.DataFrame({'row': [0, 1, 2],
                           'One_X': [1.1, 1.1, 1.1],
                           'One_Y': [1.2, 1.2, 1.2],
                           'Two_X': [1.11, 1.11, 1.11],
                           'Two_Y': [1.22, 1.22, 1.22]})

In [65]: df
Out[65]: 
   row  One_X  One_Y  Two_X  Two_Y
0    0    1.1    1.2   1.11   1.22
1    1    1.1    1.2   1.11   1.22
2    2    1.1    1.2   1.11   1.22

#Labeled index
In [66]: df = df.set_index('row')

In [67]: df
Out[67]: 
     One_X  One_Y  Two_X  Two_Y
row                            
0      1.1    1.2   1.11   1.22
1      1.1    1.2   1.11   1.22
2      1.1    1.2   1.11   1.22

#Existing column name'_'Divide with and change to a hierarchical index in tuple format.
In [68]: df.columns = pd.MultiIndex.from_tuples([tuple(c.split('_')) for c in df.columns])

In [69]: df
Out[69]: 
     One        Two      
       X    Y     X     Y
row                      
0    1.1  1.2  1.11  1.22
1    1.1  1.2  1.11  1.22
2    1.1  1.2  1.11  1.22

#Make the hierarchical index column data and reset the index
In [70]: df = df.stack(0).reset_index(1)

In [71]: df
Out[71]: 
    level_1     X     Y
row                    
0       One  1.10  1.20
0       Two  1.11  1.22
1       One  1.10  1.20
1       Two  1.11  1.22
2       One  1.10  1.20
2       Two  1.11  1.22

#Modify the label (label "level"_Note that "1" was added automatically)
In [72]: df.columns = ['Sample', 'All_X', 'All_Y']

In [73]: df
Out[73]: 
    Sample  All_X  All_Y
row                     
0      One   1.10   1.20
0      Two   1.11   1.22
1      One   1.10   1.20
1      Two   1.11   1.22
2      One   1.10   1.20
2      Two   1.11   1.22

#Existing columns in inclusion notation'_'Divide by
[tuple(c.split('_')) for c in df.columns]
[('One', 'X'), ('One', 'Y'), ('Two', 'X'), ('Two', 'Y')]

#Hierarchical indexes are represented in tuple format
df.columns = pd.MultiIndex.from_tuples([tuple(c.split('_')) for c in df.columns])
df.columns
MultiIndex([('One', 'X'),
            ('One', 'Y'),
            ('Two', 'X'),
            ('Two', 'Y')],
           )

#Pivot from column to row
df = df.stack(0)
		X	Y
row			
0	One	1.10	1.20
	Two	1.11	1.22
1	One	1.10	1.20
	Two	1.11	1.22
2	One	1.10	1.20
	Two	1.11	1.22

#index Level=1('One,Two')To reset df.reset_index(1) level_1 X Y row 0 One 1.10 1.20 0 Two 1.11 1.22 1 One 1.10 1.20 1 Two 1.11 1.22 2 One 1.10 1.20 2 Two 1.11 1.22


## Arithmetic
 * The code below has some changes from the official website.

```python
In [74]: cols = pd.MultiIndex.from_tuples([(x, y) for x in ['A', 'B', 'C'] for y in ['O', 'I']])

In [75]: df = pd.DataFrame(np.arange(12).reshape((2,6)), index=['n', 'm'], columns=cols)

In [76]: df
Out[76]: 
	A		B		C
	O	I	O	I	O	I
n	0	1	2	3	4	5
m	6	7	8	9	10	11

In [77]: df = df.div(df['C'], level=1)

In [78]: df
Out[78]: 

	A				B				C
	O	I			O	I			O	I
n	0.0	0.200000	0.5	0.600000	1.0	1.0
m	0.6	0.636364	0.8	0.818182	1.0	1.0

$ n row in column B is $ $ [B][O](2 \div 4 = 0.5) $ $ [B][I](3 \div 5 = 0.600000) $

	A		B		C
	O	I	O	I	O	I
n	0.0	0.200000	0.5	0.600000	1.0	1.0
m	0.6	0.636364	0.8	0.818182	1.0	1.0

Slice

coords = [('AA', 'one'), ('AA', 'six'), ('BB', 'one'), ('BB', 'two'), ('BB', 'six')]
index = pd.MultiIndex.from_tuples(coords)
df = pd.DataFrame([11, 22, 33, 44, 55], index, ['MyData'])

		MyData
AA	one	11
	six	22
BB	one	33
	two	44
	six	55

To choose from the first level and the first axis

#First axis
df.xs('BB', level=0, axis=0)

	MyData
one	33
two	44
six	55

To select from the'six'level and the first axis

#Second axis
df.xs('six', level=1, axis=0)

	MyData
AA	22
BB	55

In [85]: import itertools
In [86]: index = list(itertools.product(['Ada', 'Quinn', 'Violet'],
                                        ['Comp', 'Math', 'Sci']))
In [87]: headr = list(itertools.product(['Exams', 'Labs'], ['I', 'II']))
In [88]: indx = pd.MultiIndex.from_tuples(index, names=['Student', 'Course'])
In [89]: cols = pd.MultiIndex.from_tuples(headr)   # Notice these are un-named
In [90]: data = [[70 + x + y + (x * y) % 3 for x in range(4)] for y in range(9)]
In [91]: df = pd.DataFrame(data, indx, cols)

In [92]: df
Out[92]: 
               Exams     Labs    
                   I  II    I  II
Student Course                   
Ada     Comp      70  71   72  73
        Math      71  73   75  74
        Sci       72  75   75  75
Quinn   Comp      73  74   75  76
        Math      74  76   78  77
        Sci       75  78   78  78
Violet  Comp      76  77   78  79
        Math      77  79   81  80
        Sci       78  81   81  81

In [93]: All = slice(None)

In [94]: df.loc['Violet']
Out[94]: 
       Exams     Labs    
           I  II    I  II
Course                   
Comp      76  77   78  79
Math      77  79   81  80
Sci       78  81   81  81

In [95]: df.loc[(All, 'Math'), All]
Out[95]: 
               Exams     Labs    
                   I  II    I  II
Student Course                   
Ada     Math      71  73   75  74
Quinn   Math      74  76   78  77
Violet  Math      77  79   81  80

In [96]: df.loc[(slice('Ada', 'Quinn'), 'Math'), All]
Out[96]: 
               Exams     Labs    
                   I  II    I  II
Student Course                   
Ada     Math      71  73   75  74
Quinn   Math      74  76   78  77

In [97]: df.loc[(All, 'Math'), ('Exams')]
Out[97]: 
                 I  II
Student Course        
Ada     Math    71  73
Quinn   Math    74  76
Violet  Math    77  79

In [98]: df.loc[(All, 'Math'), (All, 'II')]
Out[98]: 
               Exams Labs
                  II   II
Student Course           
Ada     Math      73   74
Quinn   Math      76   77
Violet  Math      79   80

All = slice(None) #What does this mean?
# silice(start, stop, step)So, are there three None?
All
slice(None, None, None)
print(type(All))
# <class 'slice'> #Slice object

I tried with ʻALL = None`, but I get an error, so the extraction condition must be a slice object?
Here, ʻALL` means no condition specified

indx
MultiIndex([(   'Ada', 'Comp'),
            (   'Ada', 'Math'),
            (   'Ada',  'Sci'),
            ( 'Quinn', 'Comp'),
            ( 'Quinn', 'Math'),
            ( 'Quinn',  'Sci'),
            ('Violet', 'Comp'),
            ('Violet', 'Math'),
            ('Violet',  'Sci')],
           names=['Student', 'Course'])
cols
MultiIndex([('Exams',  'I'),
            ('Exams', 'II'),
            ( 'Labs',  'I'),
            ( 'Labs', 'II')],
           )

Sorting

Use MultiIndex to sort by a specific column or an ordered list of columns

df.sort_values(by=('Labs', 'II'), ascending=False)

		Exams		Labs
		I	II	I	II
Student	Course
Violet	Sci	78	81	81	81
	Math	77	79	81	80
	Comp	76	77	78	79
Quinn	Sci	75	78	78	78
	Math	74	76	78	77
	Comp	73	74	75	76
Ada	Sci	72	75	75	75
	Math	71	73	75	74
	Comp	70	71	72	73

Levels

Add level before multi-index: clipboard :: thumbs up:

#To this DF'Firstlevel:Foo'How to add?
pd.concat([df], keys=['Foo'], names=['Firstlevel'])

			Exams		Labs
			I	II	I	II
Firstlevel	Student	Course
Foo	Ada	Comp	70	71	72	73
		Math	71	73	75	74
		Sci	72	75	75	75
	Quinn	Comp	73	74	75	76
		Math	74	76	78	77
		Sci	75	78	78	78
	Violet	Comp	76	77	78	79
		Math	77	79	81	80
		Sci	78	81	81	81

To flatten the hierarchical index: clipboard:

df.columns = df.columns.get_level_values(0)

		Exams		Labs
		I	II	I	II
Student	Course
Ada	Comp	70	71	72	73
	Math	71	73	75	74
	Sci	72	75	75	75
Quinn	Comp	73	74	75	76
	Math	74	76	78	77
	Sci	75	78	78	78
Violet	Comp	76	77	78	79
	Math	77	79	81	80
	Sci	78	81	81	81

Missing data

Fill the reverse order of the time series with forward data

In [100]: df = pd.DataFrame(np.random.randn(6, 1),
                            index=pd.date_range('2013-08-01', periods=6, freq='B'),
                            columns=list('A'))

In [101]: df.loc[df.index[3], 'A'] = np.nan

In [102]: df
Out[102]: 
                   A
2013-08-01  0.721555
2013-08-02 -0.706771
2013-08-05 -1.039575
2013-08-06       NaN
2013-08-07 -0.424972
2013-08-08  0.567020

In [103]: df.reindex(df.index[::-1]).ffill()
Out[103]: 
                   A
2013-08-08  0.567020
2013-08-07 -0.424972
2013-08-06 -0.424972
2013-08-05 -1.039575
2013-08-02 -0.706771
2013-08-01  0.721555

df.reindex(df.index[::-1]).ffill()
df.reindex(df.index[::-1]).fillna(method='ffill')  #Synonymous with above

The above is an example of resetting the index in the reverse order of index and filling in the blanks with fiillna (method ='ffill'). The fillna function is more versatile thanffill ()because you can use mode ='ffill' (forward) and'bfill (backward)' to fill in the blanks.

Reset cumulative with NaN: clipboard :: thumbs up:

v = pd.Series([1, 1, 1, np.nan, 1, 1, 1, 1, np.nan, 1], dtype=float)
n = v.isnull()
a = ~n
c = a.astype(float).cumsum()
d = pd.Series(np.diff(np.hstack(([0.], c[n]))), index=index)
v[n] = -d
result = v.cumsum()
result
    0    1.0
    1    2.0
    2    3.0
    3    0.0
    4    1.0
    5    2.0
    6    3.0
    7    4.0
    8    0.0
    9    1.0
    dtype: float64

\begin
{array}{|c|c|c|c|c|c|c|r|c|} \hline
  &  (1)  &  (2)   &  (3)  & (4) &(5)&(6)& (7)  & (8) \\ \hline 
0 &   1.0 &  False & True  & 1.0 &   &   & 1.0  &  1.0 \\ \hline
1 &   1.0 &  False & True  & 2.0 &   &   & 1.0  &  2.0 \\ \hline
2 &   1.0 &  False & True  & 3.0 &   &   & 1.0  &  3.0 \\ \hline
3 &   NaN &  True  & False & 3.0 & 3 & 3 & -3.0 &  0.0 \\ \hline
4 &   1.0 &  False & True  & 4.0 &   &   & 1.0  &  1.0 \\ \hline
5 &   1.0 &  False & True  & 5.0 &   &   & 1.0  &  2.0 \\ \hline
6 &   1.0 &  False & True  & 6.0 &   &   & 1.0  &  3.0 \\ \hline
7 &   1.0 &  False & True  & 7.0 &   &   & 1.0  &  4.0 \\ \hline
8 &   NaN &  True  & False & 7.0 & 8 & 4 & -4.0 &  0.0 \\ \hline
9 &   1.0 &  False & True  & 8.0 &   &   & 1.0 &  1.0 \\ \hline
\end{array}

Grouping

Unlike the agg function, the apply function returns a dataframe that can access all columns

df = pd.DataFrame({'animal': 'cat dog cat fish dog cat cat'.split(), 
                    'size': list('SSMMMLL'), 
                    'weight': [8, 10, 11, 1, 20, 12, 12], 
                    'adult': [False] * 5 + [True] * 2})
In [105]: df
Out[105]: 
  animal size  weight  adult
0    cat    S       8  False
1    dog    S      10  False
2    cat    M      11  False
3   fish    M       1  False
4    dog    M      20  False
5    cat    L      12   True
6    cat    L      12   True

#List the sizes of the heaviest animals.
In [106]: df.groupby('animal').apply(lambda subf: subf['size'][subf['weight'].idxmax()])
Out[106]: 
animal
cat     L
dog     M
fish    M
dtype: object

#If you disassemble this
df.groupby('animal').apply(lambda subf: subf['size'][subf['weight'].idxmax()])

# 'weight'From the index of the element with the maximum value'size'Get the element of
df['size'][df['weight'].idxmax()]
   'M'
#So you can get the maximum size of the grouped contents

use the get_group function

In [107]: gb = df.groupby(['animal'])
# 'get_group'Use to check the contents of the GroupBy object
In [108]: gb.get_group('cat')
Out[108]: 
  animal size  weight  adult
0    cat    S       8  False
2    cat    M      11  False
5    cat    L      12   True
6    cat    L      12   True

#Access to the object acquired by groupby
gb[['animal', 'size', 'weight']].get_group('cat')

	animal	size	weight
0	cat	S	8
2	cat	M	11
5	cat	L	12
6	cat	L	12

Apply to different items in the group

In [109]: def GrowUp(x):
              avg_weight = sum(x[x['size'] == 'S'].weight * 1.5)
              avg_weight += sum(x[x['size'] == 'M'].weight * 1.25)
              avg_weight += sum(x[x['size'] == 'L'].weight)
              avg_weight /= len(x)
              return pd.Series(['L', avg_weight, True], index=['size', 'weight', 'adult'])

In [110]: expected_df = gb.apply(GrowUp)

In [111]: expected_df
Out[111]: 
       size   weight  adult
animal                     
cat       L  12.4375   True
dog       L  20.0000   True
fish      L   1.2500   True

#Calculation of S size
cat_s = sum(df[(df['animal'] == 'cat') & (df['size'] == 'S')].weight * 1.5)
cat_s
12.0

#Calculation of M size
cat_m = sum(df[(df['animal'] == 'cat') & (df['size'] == 'M')].weight * 1.25)
cat_m
13.75

#Calculation of L size
cat_l = sum(df[(df['animal'] == 'cat') & (df['size'] == 'L')].weight)
cat_l
24

# 'SML average
cat_avg = (cat_s + cat_m + cat_l) / len(df[df['animal'] == 'cat'])
cat_avg
12.4375
#The obtained result is pd.Return in Series format

Apply function to all cases

The code below has some numerical changes from the official website.

# S = pd.Series([i / 100.0 for i in range(1, 11)])
S = pd.Series([i for i in range(1, 11)])

S
    0     1
    1     2
    2     3
    3     4
    4     5
    5     6
    6     7
    7     8
    8     9
    9    10
    dtype: int64

def cum_ret(x, y):
    return x * (1 + y)

def red(x):
    p = functools.reduce(cum_ret, x, 1.0)
    return p

#Expanding to perform on all data().[mean(),std()...]
S.expanding().apply(red, raw=True)

0           2.0
1           6.0
2          24.0
3         120.0
4         720.0
5        5040.0
6       40320.0
7      362880.0
8     3628800.0
9    39916800.0
dtype: float64

$ x=1 $

$ 1: The red function (x) is called and x = 1.0. Then $

$ 2: The cum \ _ ret function (x, y) is called and x * (1 + y) = 1.0 * (1.0 + 1.0) = 2.0 is returned from x = 1, y = 1 $

$ 3:p=2.0 $

$ x=2 $

$ p=6.0((12)(1+2)) $

$ x=3 $

$ p=24.0((123)*(1+3)) $

$ x=4 $

$ p=120.0((1234)(1+4)) $

Replace with the group mean except for values less than 0

In [116]: df = pd.DataFrame({'A': [1, 1, 2, 2], 'B': [1, -1, 1, 2]})
	A	B
0	1	1
1	1	-1
2	2	1
3	2	2

In [117]: gb = df.groupby('A')
In [118]: def replace(g):
              mask = g < 0
              return g.where(mask, g[~mask].mean())

In [119]: gb.transform(replace)
Out[119]: 
     B
0  1.0
1 -1.0
2  1.5
3  1.5

Replace values less than 0 with group mean: clipboard :: thumbs up:

df = pd.DataFrame({'A' : [1, 1, 1, 2, 2, 2], 'B' : [2, 3, -1, 1, 2, 6]})
gb = df.groupby('A')
def replace(g):
    mask = g < 0
    g.loc[mask] = g[~mask].mean()
    return g
gb.transform(replace)

	B
0	2.0
1	3.0
2	2.5
3	1.0
4	2.0
5	6.0

Sort groups by aggregated data

In [120]: df = pd.DataFrame({'code': ['foo', 'bar', 'baz'] * 2,
                             'data': [0.16, -0.21, 0.33, 0.45, -0.59, 0.62],
                             'flag': [False, True] * 3})

In [121]: code_groups = df.groupby('code')

In [122]: agg_n_sort_order = code_groups[['data']].transform(sum).sort_values(by='data')

In [123]: sorted_df = df.loc[agg_n_sort_order.index]

In [124]: sorted_df
Out[124]: 
  code  data   flag
1  bar -0.21   True
4  bar -0.59  False
0  foo  0.16  False
3  foo  0.45   True
2  baz  0.33  False
5  baz  0.62   True

# 'data'Group by and find the total and sort.
agg_n_sort_order = code_groups[['data']].transform(sum).sort_values(by='data')
	data
1	-0.80
4	-0.80
0	0.61
3	0.61
2	0.95
5	0.95
#Then find and extract the index
agg_n_sort_order.index
Int64Index([1, 4, 0, 3, 2, 5], dtype='int64')

Create multiple summary columns

In [125]: rng = pd.date_range(start="2014-10-07", periods=10, freq='2min')

In [126]: ts = pd.Series(data=list(range(10)), index=rng)

In [127]: def MyCust(x):
              if len(x) > 2:
                  return x[1] * 1.234
              return pd.NaT 

In [128]: mhc = {'Mean': np.mean, 'Max': np.max, 'Custom': MyCust}

In [129]: ts.resample("5min").apply(mhc)
Out[129]: 
Mean    2014-10-07 00:00:00        1
        2014-10-07 00:05:00      3.5
        2014-10-07 00:10:00        6
        2014-10-07 00:15:00      8.5
Max     2014-10-07 00:00:00        2
        2014-10-07 00:05:00        4
        2014-10-07 00:10:00        7
        2014-10-07 00:15:00        9
Custom  2014-10-07 00:00:00    1.234
        2014-10-07 00:05:00      NaT
        2014-10-07 00:10:00    7.404
        2014-10-07 00:15:00      NaT
dtype: object

In [130]: ts
Out[130]: 
2014-10-07 00:00:00    0
2014-10-07 00:02:00    1
2014-10-07 00:04:00    2
2014-10-07 00:06:00    3
2014-10-07 00:08:00    4
2014-10-07 00:10:00    5
2014-10-07 00:12:00    6
2014-10-07 00:14:00    7
2014-10-07 00:16:00    8
2014-10-07 00:18:00    9
Freq: 2T, dtype: int64

#TimeSeries object'5min'Recount to
ts_re = ts.resample('5min')
#View the contents of the object
for t in ts_re:
    print(t)
(Timestamp('2014-10-07 00:00:00', freq='5T'),   2014-10-07 00:00:00    0
                                                2014-10-07 00:02:00    1
                                                2014-10-07 00:04:00    2
                                                Freq: 2T, dtype: int64)
(Timestamp('2014-10-07 00:05:00', freq='5T'),   2014-10-07 00:06:00    3
                                                2014-10-07 00:08:00    4
                                                Freq: 2T, dtype: int64)
(Timestamp('2014-10-07 00:10:00', freq='5T'),   2014-10-07 00:10:00    5
                                                2014-10-07 00:12:00    6
                                                2014-10-07 00:14:00    7
                                                Freq: 2T, dtype: int64)
(Timestamp('2014-10-07 00:15:00', freq='5T'),   2014-10-07 00:16:00    8
                                                2014-10-07 00:18:00    9
                                                Freq: 2T, dtype: int64)

	ts	data	mean	max	MyCust	x[1]
0	2014-10-07 00:00:00	0,1,2	1	2	1.234	1
1	2014-10-07 00:05:00	3,4	3.5	4	NaT	False
2	2014-10-07 00:10:00	5,6,7	6	7	7.404	6
3	2014-10-07 00:15:00	8,9	8.5	9	NaT	False

Create a value count column and reassign it to the DataFrame

In [131]: df = pd.DataFrame({'Color': 'Red Red Red Blue'.split(),
                             'Value': [100, 150, 50, 50]})

In [132]: df
Out[132]: 
  Color  Value
0   Red    100
1   Red    150
2   Red     50
3  Blue     50

In [133]: df['Counts'] = df.groupby(['Color']).transform(len)

In [134]: df
Out[134]: 
  Color  Value  Counts
0   Red    100       3
1   Red    150       3
2   Red     50       3
3  Blue     50       1

Shift a group of column values based on the index

In [135]: df = pd.DataFrame({'line_race': [10, 10, 8, 10, 10, 8],
                             'beyer': [99, 102, 103, 103, 88, 100]},
                             index=['Last Gunfighter', 'Last Gunfighter', 'Last Gunfighter', 
                                    'Paynter', 'Paynter', 'Paynter'])

In [136]: df
Out[136]: 
                 line_race  beyer
Last Gunfighter         10     99
Last Gunfighter         10    102
Last Gunfighter          8    103
Paynter                 10    103
Paynter                 10     88
Paynter                  8    100

In [137]: df['beyer_shifted'] = df.groupby(level=0)['beyer'].shift(1)

In [138]: df
Out[138]: 
                 line_race  beyer  beyer_shifted
Last Gunfighter         10     99            NaN
Last Gunfighter         10    102           99.0
Last Gunfighter          8    103          102.0
Paynter                 10    103            NaN
Paynter                 10     88          103.0
Paynter                  8    100           88.0

Select the row with the maximum value from each group

df = pd.DataFrame({'host': ['other', 'other', 'that', 'this', 'this'],
                   'service': ['mail', 'web', 'mail', 'mail', 'web'],
                   'no': [1, 2, 1, 2, 1]}).set_index(['host', 'service']

In [140]: mask = df.groupby(level=0).agg('idxmax')

In [141]: df_count = df.loc[mask['no']].reset_index()

In [142]: df_count
Out[142]: 
    host service  no
0  other     web   2
1   that    mail   1
2   this    mail   2

		no
host	service
other	mail	1
	web	2
that	mail	1
this	mail	2
	web	1

# level=View the contents of mask values grouped by 0
mask = df.groupby(level=0).agg('idxmax')
		no
host 	
other 	(other, web)
that 	(that, mail)
this 	(this, mail)

The maximum value is indexed with the data grouped in tuple format. The index is reset in the data frame of the result extracted by this mask value.

Grouping like itertools.groupby

In [143]: df = pd.DataFrame([0, 1, 0, 1, 1, 1, 0, 1, 1], columns=['A'])

In [144]: df['A'].groupby((df['A'] != df['A'].shift()).cumsum()).groups
Out[144]: 
{1: Int64Index([0], dtype='int64'),
 2: Int64Index([1], dtype='int64'),
 3: Int64Index([2], dtype='int64'),
 4: Int64Index([3, 4, 5], dtype='int64'),
 5: Int64Index([6], dtype='int64'),
 6: Int64Index([7, 8], dtype='int64')}

In [145]: df['A'].groupby((df['A'] != df['A'].shift()).cumsum()).cumsum()
Out[145]: 
0    0
1    1
2    0
3    1
4    2
5    3
6    0
7    1
8    2
Name: A, dtype: int64

#The above data frame
df
	A
0	0
1	1
2	0
3	1
4	1
5	1
6	0
7	1
8	1

If you execute the same data with itertools.groupby and see it, the contents will be as follows.

data = [0, 1, 0, 1, 1, 1, 0, 1, 1]
gb = itertools.groupby(data)
for key, group in gb:
    print(f'{key}: {list(group)}')
0: [0]
1: [1]
0: [0]
1: [1, 1, 1]
0: [0]
1: [1, 1]

GroupBy.groups returns group names and labels in dictionary format.

tmp = df.groupby('A')
gb = tmp.groups
gb.get(1)
Int64Index([1, 3, 4, 5, 7, 8], dtype='int64')
gb.get(0)
Int64Index([0, 2, 6], dtype='int64')

$ (1) \ Sample data $ $ df['A'] \tag{1} $ $ (2) \ Shift data $ $ df['A'].shift() \tag{2} $ True $ if $ (3) \ (2) and (3) are not equivalent $ df['A'] != df['A'].shift() \tag{3} $ Take the accumulation of $ (4) \ (3) $ $ (df['A'] != df['A'].shift()).cumsum() \tag{4} $ $ (5) \ Contents of objects grouped by cumulative data $ $ gb = df['A'].groupby((df['A'] != df['A'].shift()).cumsum()) \tag{5} $ $ (6) \ Grouped data cumulative $ $ df['A'].groupby((df['A'] != df['A'].shift()).cumsum()).cumsum() \tag{6} $

\begin
{array}{|c|c|c|c|c|c|c|} \hline
 & (1) & (2) & (3) & (4) & (5) & (6) \\ \hline
0 & 0 &	NaN & True & 1 & 0 & 0  \\ \hline
1 & 1 &	0.0 & True & 2 & 1 & 1 \\ \hline
2 & 0 &	1.0 & True & 3 & 0 & 0 \\ \hline
3 & 1 & 0.0 & True & 4 & 1,1,1 & 1  \\ \hline
4 & 1 & 1.0 & False & 4 &  & 2 \\ \hline
5 & 1 & 1.0 & False & 4 &  & 3 \\ \hline
6 & 0 &	1.0 & True & 5 & 0 & 0 \\ \hline
7 & 1 &	0.0 & True & 6 & 1,1 & 1 \\ \hline
8 & 1 &	1.0 & False & 6 &  & 2 \\ \hline
\end{array}

Expanding data

Rolling calculation based on value rather than count: clipboard:

df = pd.DataFrame({'RollBasis':[1,1,1,2,3,5,8,10,12,13],
                   'ToRoll':[1,4,-5,2,-4,-2,0,-13,-2,-5]})

def f(x):
    ser = df.ToRoll[(df.RollBasis >= x) & (df.RollBasis < x+5)]
    return ser.sum()

df['Rolled'] = df.RollBasis.apply(f)
df
0	1	1	-4
1	1	4	-4
2	1	-5	-4
3	2	2	-4
4	3	-4	-6
5	5	-2	-2
6	8	0	-15
7	10	-13	-20
8	12	-2	-7
9	13	-5	-5

	RollBasis	ToRoll
0	1	1
1	1	4
2	1	-5
3	2	2
4	3	-4
5	5	-2
6	8	0
7	10	-13
8	12	-2
9	13	-5

The content of the function f is to sum the values of'ToRoll'in the range where the value of'RollBasis' is x or more and x + 5 or less.

$ x=1 $ $(RollBasis>=1) \& (RollBsis<1+5) \rightarrow sum([1,4,-5,2,-4,-2]) \rightarrow -4 $ $ \vdots $ $x =4 $ $(RollBasis>=4) \& (RollBsis<4+5) \rightarrow sum([-2,0]) \rightarrow -2 $

Rolling average over time interval

df = pd.DataFrame({'B': range(5)})

df.index = [pd.Timestamp('20130101 09:00:00'),
            pd.Timestamp('20130101 09:00:02'),
            pd.Timestamp('20130101 09:00:03'),
            pd.Timestamp('20130101 09:00:05'),
            pd.Timestamp('20130101 09:00:06')]

df
2013-01-01 09:00:00	0
2013-01-01 09:00:02	1
2013-01-01 09:00:03	2
2013-01-01 09:00:05	3
2013-01-01 09:00:06	4

df.rolling(2, min_periods=1).sum()
				B
2013-01-01 09:00:00	0.0
2013-01-01 09:00:02	1.0
2013-01-01 09:00:03	3.0
2013-01-01 09:00:05	5.0
2013-01-01 09:00:06	7.0

df.rolling('2s', min_periods=1).sum()
				B
2013-01-01 09:00:00	0.0
2013-01-01 09:00:02	1.0
2013-01-01 09:00:03	3.0
2013-01-01 09:00:05	3.0
2013-01-01 09:00:06	7.0

# rolling(window width,Minimum number of data)
df.rolling(2, min_periods=1).sum()
# 0+1=1, 1+2=3, 2+3=5, 3+4=7 In units of 2 lines
df.rolling('2s', min_periods=1).sum()
# 0+1=1, 1+2=3, 3+0=3, 3+4=7

#If you look at the time series data at 2-minute intervals and take the total
df.resample('2s').sum()
		B
2013-01-01 09:00:00	0
2013-01-01 09:00:02	3
2013-01-01 09:00:04	3
2013-01-01 09:00:06	4

Calculate the cumulative average from the total in time series units with the rolling function

df = pd.DataFrame({'enddate':pd.to_datetime(['2019-03-01','2019-03-01','2019-03-01','2019-03-02','2019-03-02','2019-03-02','2019-03-03','2019-03-03','2019-03-04']),
                   'bloom':[0.342, 0.235, 0.456, 0.389, 0.453, 0.367, 0.369, 0.428, 0.55],
                   'unbloom':[0.658, 0.765, 0.544, 0.611, 0.547, 0.633, 0.631, 0.572, 0.45]})

Splitting

Create a list of data frames and split them based on the values contained in the row.

In [146]: d df = pd.DataFrame(data={'Case': ['A', 'A', 'A', 'B', 'A', 'A', 'B', 'A', 'A'],
                         'Data': np.random.randint(10,50,(9,))})

In [147]: dfs = list(zip(*df.groupby((1 * (df['Case'] == 'B')).cumsum()
                         .rolling(window=3, min_periods=1).median())))[-1]

In [148]: dfs[0]
Out[148]: 
  Case      Data
0	A	24
1	A	34
2	A	33
3	B	11
In [149]: dfs[1]
Out[149]: 
  Case      Data
4	A	31
5	A	35
6	B	13

In [150]: dfs[2]
Out[150]: 
Case	Data
7	A	17
8	A	33

	Case	Data
0	A	24
1	A	34
2	A	33
3	B	11
4	A	31
5	A	35
6	B	13
7	A	17
8	A	33

dfs = list(zip(*df.groupby((1 * (df['Case'] == 'B')).cumsum().rolling(window=3, min_periods=1).median())))[-1]

I will disassemble this one liner.

# 'Case'The column is'B'Is set to 1
1 * (df['Case'] == 'B')
>
0    0
1    0
2    0
3    1
4    0
5    0
6    1
7    0
8    0
Name: Case, dtype: int64

#Take cumulative
1 * (df['Case'] == 'B').cumsum()
>
0    0
1    0
2    0
3    1
4    1
5    1
6    2
7    2
8    2
Name: Case, dtype: int64

#Take the median cumulative value in units of 3 lines
(1 * (df['Case'] == 'B')).cumsum().rolling(window=3, min_periods=1).median()
>
0    0.0
1    0.0
2    0.0
3    0.0
4    1.0
5    1.0
6    1.0
7    2.0
8    2.0
Name: Case, dtype: float64

#Now that we have a reference value for grouping in the previous process, we will extract it. Expand and list the resulting group objects.
tmp2 = list(zip(*df.groupby((1 * (df['Case'] == 'B')).cumsum().rolling(window=3, min_periods=1).median())))
>
[(0.0, 1.0, 2.0), (  Case  Data
  0    A    24
  1    A    34
  2    A    33
  3    B    11,   Case  Data
  4    A    31
  5    A    35
  6    B    13,   Case  Data
  7    A    17
  8    A    33)]

#If you do not expand the obtained group object, the contents will be as follows, so unpack(*)You can see the difference with.
tmp = list(zip(df.groupby((1 * (df['Case'] == 'B')).cumsum().rolling(window=3, min_periods=1).median())))
>
[((0.0,   Case  Data
   0    A    24
   1    A    34
   2    A    33
   3    B    11),), ((1.0,   Case  Data
   4    A    31
   5    A    35
   6    B    13),), ((2.0,   Case  Data
   7    A    17
   8    A    33),)]

#dfs is a list[(0.0,1.0,2.0),(The following data)]So[-1]To get the first element from the end.
dfs = list(zip(*df.groupby((1 * (df['Case'] == 'B')).cumsum().rolling(window=3, min_periods=1).median())))[-1]
>
(  Case  Data
 0    A    24
 1    A    34
 2    A    33
 3    B    11,   Case  Data
 4    A    31
 5    A    35
 6    B    13,   Case  Data
 7    A    17
 8    A    33)

Pivot

Partial total and subtotal

In [151]: df = pd.DataFrame(data={'Province': ['ON', 'QC', 'BC', 'AL', 'AL', 'MN', 'ON'],
                                  'City': ['Toronto', 'Montreal', 'Vancouver',
                                  'Calgary', 'Edmonton', 'Winnipeg', 'Windsor'],
                                  'Sales': [13, 6, 16, 8, 4, 3, 1]})


In [152]: table = pd.pivot_table(df, values=['Sales'], index=['Province'],
                                 columns=['City'], aggfunc=np.sum, margins=True)

In [153]: table.stack('City')
Out[153]: 
                    Sales
Province City            
AL       All         12.0
         Calgary      8.0
         Edmonton     4.0
BC       All         16.0
         Vancouver   16.0
...                   ...
All      Montreal     6.0
         Toronto     13.0
         Vancouver   16.0
         Windsor      1.0
         Winnipeg     3.0

[20 rows x 1 columns]

	Province	City	Sales
0	ON	Toronto	13
1	QC	Montreal	6
2	BC	Vancouver	16
3	AL	Calgary	8
4	AL	Edmonton	4
5	MN	Winnipeg	3
6	ON	Windsor	1

# pivot_Convert to tabular format with the table function.
table = pd.pivot_table(df, values=['Sales'], index=['Province'], columns=['City'], aggfunc=np.sum, margins=True)

						Sales
Province	Calgary	Edmonton	Montreal	Toronto	Vancouver	Windsor
AL	8	4	nan	nan	nan	nan
BC	nan	nan	nan	nan	16	nan
MN	nan	nan	nan	nan	nan	nan
ON	nan	nan	nan	13	nan	1
QC	nan	nan	6	nan	nan	nan
All	8	4	6	13	16	1

The stack function transforms the data in the column direction into the row direction.

R frequency table like plyr

In [154]: grades = [48, 99, 75, 80, 42, 80, 72, 68, 36, 78]

In [155]: df = pd.DataFrame({'ID': ["x%d" % r for r in range(10)],
                             'Gender': ['F', 'M', 'F', 'M', 'F',
                                        'M', 'F', 'M', 'M', 'M'],
                            'ExamYear': ['2007', '2007', '2007', '2008', '2008',
                                         '2008', '2008', '2009', '2009', '2009'],
                            'Class': ['algebra', 'stats', 'bio', 'algebra',
                                      'algebra', 'stats', 'stats', 'algebra', 
                                      'bio', 'bio'],
                            'Participated': ['yes', 'yes', 'yes', 'yes', 'no',
                                             'yes', 'yes', 'yes', 'yes', 'yes'],
                            'Passed': ['yes' if x > 50 else 'no' for x in grades],
                            'Employed': [True, True, True, False,
                                         False, False, False, True, True, False],
                            'Grade': grades})

In [156]: df.groupby('ExamYear').agg({'Participated': lambda x: x.value_counts()['yes'],
                                      'Passed': lambda x: sum(x == 'yes'),
                                      'Employed': lambda x: sum(x),
                                      'Grade': lambda x: sum(x) / len(x)})
Out[156]: 
          Participated  Passed  Employed      Grade
ExamYear                                           
2007                 3       2         3  74.000000
2008                 3       3         0  68.500000
2009                 3       2         2  60.666667

	ID	Gender	ExamYear	Class	Participated	Passed	Employed	Grade
0	x0	F	2007	algebra	yes	no	True	48
1	x1	M	2007	stats	yes	yes	True	99
2	x2	F	2007	bio	yes	yes	True	75
3	x3	M	2008	algebra	yes	yes	False	80
4	x4	F	2008	algebra	no	no	False	42
5	x5	M	2008	stats	yes	yes	False	80
6	x6	F	2008	stats	yes	yes	False	72
7	x7	M	2009	algebra	yes	yes	True	68
8	x8	M	2009	bio	yes	no	True	36
9	x9	M	2009	bio	yes	yes	False	78

I don't use SQL on a daily basis, but I put it in Sqlite3 and checked it with a SQL statement.

select ExamYear,
count(Participated = 'yes' or null) as participated,
count(Passed = 'yes' or null) as passed,
count(Employed = 'True' or null) as employed,
sum(cast(Grade as real))/count(cast(Grade as real)) as grades
from df
group by ExamYear 
;

"2007"	"3"	"2"	"3"	"74.0"
"2008"	"3"	"3"	"0"	"68.5"
"2009"	"3"	"2"	"2"	"60.6666666666667"

To create a year-month crosstab:

In [157]: df = pd.DataFrame({'value': np.random.randn(36)},
                             index=pd.date_range('2011-01-01', freq='M', periods=36))

In [158]: pd.pivot_table(df, index=df.index.month, columns=df.index.year,
                         values='value', aggfunc='sum')

Out[158]: 
        2011      2012      2013
1  -1.039268 -0.968914  2.565646
2  -0.370647 -1.294524  1.431256
3  -1.157892  0.413738  1.340309
4  -1.344312  0.276662 -1.170299
5   0.844885 -0.472035 -0.226169
6   1.075770 -0.013960  0.410835
7  -0.109050 -0.362543  0.813850
8   1.643563 -0.006154  0.132003
9  -1.469388 -0.923061 -0.827317
10  0.357021  0.895717 -0.076467
11 -0.674600  0.805244 -1.187678
12 -1.776904 -1.206412  1.130127

# 1-There were records of indexes that I thought were only 12. Year is the same. Unique each()And aggregated.
df.index.month
>
Int64Index([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12,  1,  2,  3,  4,  5,
             6,  7,  8,  9, 10, 11, 12,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10,
            11, 12],
           dtype='int64')
>
df.index.year
>
Int64Index([2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
            2011, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012,
            2012, 2012, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,
            2013, 2013, 2013],
           dtype='int64')

Pivot_table became smaller if I thought it would eat memory like Excel. When I thought about it, it was natural because I didn't have the data before conversion.

df = pd.DataFrame({'value': np.random.randn(36)},
                  index=pd.date_range('2011-01-01', freq='M', periods=36))
df.__sizeof__()
576
df_pivot = pd.pivot_table(df, index=df.index.month, columns=df.index.year, values='value', aggfunc='sum')
df_pivot.__sizeof__()
384

Apply

Apply rolling to organize-convert embedded list to MultiIndex frame

In [159]: df = pd.DataFrame(data={'A': [[2, 4, 8, 16], [100, 200], [10, 20, 30]],
                           'B': [['a', 'b', 'c'], ['jj', 'kk'], ['ccc']]},
                           index=['I', 'II', 'III'])

In [160]: def SeriesFromSubList(aList):
              return pd.Series(aList)

In [161]: df_orgz = pd.concat({ind: row.apply(SeriesFromSubList)
                               for ind, row in df.iterrows()})

In [162]: df_orgz
Out[162]: 
         0    1    2     3
I   A    2    4    8  16.0
    B    a    b    c   NaN
II  A  100  200  NaN   NaN
    B   jj   kk  NaN   NaN
III A   10   20   30   NaN
    B  ccc  NaN  NaN   NaN

	A	B
I	[2, 4, 8, 16]	[a, b, c]
II	[100, 200]	[jj, kk]
III	[10, 20, 30]	[ccc]

Get the list in the cell of the created dataframe, convert it to pd.Series and combine it. NaN is added to the data that is less than the maximum column.

for ind, row in df.iterrows():
    print('ind:'+str(ind))
    print(row)
    r = row.apply(SeriesFromSubList)
    print(r)
>
ind:I
A    [2, 4, 8, 16]
B        [a, b, c]
Name: I, dtype: object
   0  1  2     3
A  2  4  8  16.0
B  a  b  c   NaN
ind:II
A    [100, 200]
B      [jj, kk]
Name: II, dtype: object
     0    1
A  100  200
B   jj   kk
ind:III
A    [10, 20, 30]
B           [ccc]
Name: III, dtype: object
     0     1     2
A   10  20.0  30.0
B  ccc   NaN   NaN

Convert the result converted from pd.Series to scalar value again to pd.Series format and apply the rolling function to multiple columns

In [163]: df = pd.DataFrame(data=np.random.randn(2000, 2) / 10000,
                            index=pd.date_range('2001-01-01', periods=2000),
                            columns=['A', 'B'])
In [164]: df
Out[164]: 
                   A         B
2001-01-01 -0.000144 -0.000141
2001-01-02  0.000161  0.000102
2001-01-03  0.000057  0.000088
2001-01-04 -0.000221  0.000097
2001-01-05 -0.000201 -0.000041
...              ...       ...
2006-06-19  0.000040 -0.000235
2006-06-20 -0.000123 -0.000021
2006-06-21 -0.000113  0.000114
2006-06-22  0.000136  0.000109
2006-06-23  0.000027  0.000030
[2000 rows x 2 columns]

In [165]: def gm(df, const):
              v = ((((df['A'] + df['B']) + 1).cumprod()) - 1) * const
              return v.iloc[-1]


In [166]: s = pd.Series({df.index[i]: gm(df.iloc[i:min(i + 51, len(df) - 1)], 5)
                         for i in range(len(df) - 50)})

In [167]: s
Out[167]: 
2001-01-01    0.000930
2001-01-02    0.002615
2001-01-03    0.001281
2001-01-04    0.001117
2001-01-05    0.002772
                ...   
2006-04-30    0.003296
2006-05-01    0.002629
2006-05-02    0.002081
2006-05-03    0.004247
2006-05-04    0.003928
Length: 1950, dtype: float64

#In this example, the data is extracted every 50 days.
#Obtain the return value by applying this gm function to the extraction result.
#First date and time of the target period (in this example'2001-01-1')From the end date and time(2006-05-04)Save the date and time and return value until
>
#Function of gm function:(df['A']+df['B']+1)Calculate the cumulative product of and subtract 1 and multiply the result by a constant: 5.
gm(df.iloc[0:min(0 + 51, len(df) - 1)], 5)
0.000930

Apply rolling to multiple columns when the function is scalar

In [168]: rng = pd.date_range(start='2014-01-01', periods=100)

In [169]: df = pd.DataFrame({'Open': np.random.randn(len(rng)),
                             'Close': np.random.randn(len(rng)),
                             'Volume': np.random.randint(100, 2000, len(rng))},
                            index=rng)
In [170]: df
Out[170]: 
                Open     Close  Volume
2014-01-01 -1.611353 -0.492885    1219
2014-01-02 -3.000951  0.445794    1054
2014-01-03 -0.138359 -0.076081    1381
2014-01-04  0.301568  1.198259    1253
2014-01-05  0.276381 -0.669831    1728
...              ...       ...     ...
2014-04-06 -0.040338  0.937843    1188
2014-04-07  0.359661 -0.285908    1864
2014-04-08  0.060978  1.714814     941
2014-04-09  1.759055 -0.455942    1065
2014-04-10  0.138185 -1.147008    1453
[100 rows x 3 columns]

In [171]: def vwap(bars):
              return ((bars.Close * bars.Volume).sum() / bars.Volume.sum())
 

In [172]: window = 5

In [173]: s = pd.concat([(pd.Series(vwap(df.iloc[i:i + window]),
                          index=[df.index[i + window]]))
                         for i in range(len(df) - window)])

In [174]: s.round(2)
Out[174]: 
2014-01-06    0.02
2014-01-07    0.11
2014-01-08    0.10
2014-01-09    0.07
2014-01-10   -0.29
              ... 
2014-04-06   -0.63
2014-04-07   -0.02
2014-04-08   -0.03
2014-04-09    0.34
2014-04-10    0.29
Length: 95, dtype: float64

Apply the vmap function to the data aggregated with window value 5 (in units of 5 days) and obtain the result.
Add the window value 5 to the start date and create the aggregation period date and the result data frame as the aggregation period date.

[(pd.Series(vwap(df.iloc[i:i + window]),index=[df.index[i + window]])) for i in range(len(df) - window)]
#First section i=When the contents of the function are decomposed as 0, it becomes as follows.
tmp = df.iloc[0:0 + window]
(tmp.Close*tmp.Volume).sum()
tmp.Volume.sum()
(tmp.Close*tmp.Volume).sum() / tmp.Volume.sum()
0.02

Time series: clipboard:

Time interval

# indexer_between_time(start, end)
rng = pd.date_range('1/1/2000', periods=24, freq='H')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
2000-01-01 00:00:00    1.298558
2000-01-01 01:00:00    0.333122
2000-01-01 02:00:00   -0.034170
2000-01-01 03:00:00   -2.396508
2000-01-01 04:00:00   -0.608591
2000-01-01 05:00:00    1.710535
2000-01-01 06:00:00   -0.657193
2000-01-01 07:00:00   -1.166563
2000-01-01 08:00:00    1.768663
2000-01-01 09:00:00    1.395805
2000-01-01 10:00:00   -0.111062
2000-01-01 11:00:00    0.840320
2000-01-01 12:00:00    0.129081
2000-01-01 13:00:00   -1.604446
2000-01-01 14:00:00   -1.257967
2000-01-01 15:00:00   -0.384715
2000-01-01 16:00:00   -1.185482
2000-01-01 17:00:00    2.222229
2000-01-01 18:00:00   -0.917695
2000-01-01 19:00:00   -2.158507
2000-01-01 20:00:00   -0.487902
2000-01-01 21:00:00   -1.667997
2000-01-01 22:00:00    0.202186
2000-01-01 23:00:00   -0.710070
Freq: H, dtype: float64
>
#When extracting from 10:00 to 14:00
ts.iloc[ts.index.indexer_between_time(datetime.time(10), datetime.time(14))]
2000-01-01 10:00:00   -0.111062
2000-01-01 11:00:00    0.840320
2000-01-01 12:00:00    0.129081
2000-01-01 13:00:00   -1.604446
2000-01-01 14:00:00   -1.257967
Freq: H, dtype: float64

About time series masks based on index: clipboard:

index = pd.date_range('2013-1-1',periods=10,freq='15Min')
data = pd.DataFrame(data=[1,2,3,4,5,6,7,8,9,0], columns=['value'], index=index)
data
	value
2013-01-01 00:00:00	1
2013-01-01 00:15:00	2
2013-01-01 00:30:00	3
2013-01-01 00:45:00	4
2013-01-01 01:00:00	5
2013-01-01 01:15:00	6
2013-01-01 01:30:00	7
2013-01-01 01:45:00	8
2013-01-01 02:00:00	9
2013-01-01 02:15:00	0

data.index.indexer_between_time('01:15', '02:00')
array([5, 6, 7, 8])

data.iloc[data.index.indexer_between_time('1:15', '02:00')]
013-01-01 01:15:00	6
2013-01-01 01:30:00	7
2013-01-01 01:45:00	8
2013-01-01 02:00:00	9

#If the time specification is reversed, the period to be excluded is extracted.
data.iloc[data.index.indexer_between_time('02:00', '1:15')]
	value
2013-01-01 00:00:00	1
2013-01-01 00:15:00	2
2013-01-01 00:30:00	3
2013-01-01 00:45:00	4
2013-01-01 01:00:00	5
2013-01-01 01:15:00	6
2013-01-01 02:00:00	9
2013-01-01 02:15:00	0

Exclude weekends and create a date and time range that includes only specific times: clipboard:

rng = pd.date_range('20130101 09:00','20130110 16:00',freq='30T')

DatetimeIndex(['2013-01-01 09:00:00', '2013-01-01 09:30:00',
               '2013-01-01 10:00:00', '2013-01-01 10:30:00',
               '2013-01-01 11:00:00', '2013-01-01 11:30:00',
               '2013-01-01 12:00:00', '2013-01-01 12:30:00',
               '2013-01-01 13:00:00', '2013-01-01 13:30:00',
               ...
               '2013-01-10 11:30:00', '2013-01-10 12:00:00',
               '2013-01-10 12:30:00', '2013-01-10 13:00:00',
               '2013-01-10 13:30:00', '2013-01-10 14:00:00',
               '2013-01-10 14:30:00', '2013-01-10 15:00:00',
               '2013-01-10 15:30:00', '2013-01-10 16:00:00'],
              dtype='datetime64[ns]', length=447, freq='30T')

#Exclude unnecessary time
rng = rng.take(rng.indexer_between_time('09:30','16:00'))

DatetimeIndex(['2013-01-01 09:30:00', '2013-01-01 10:00:00',
               '2013-01-01 10:30:00', '2013-01-01 11:00:00',
               '2013-01-01 11:30:00', '2013-01-01 12:00:00',
               '2013-01-01 12:30:00', '2013-01-01 13:00:00',
               '2013-01-01 13:30:00', '2013-01-01 14:00:00',
               ...
               '2013-01-10 11:30:00', '2013-01-10 12:00:00',
               '2013-01-10 12:30:00', '2013-01-10 13:00:00',
               '2013-01-10 13:30:00', '2013-01-10 14:00:00',
               '2013-01-10 14:30:00', '2013-01-10 15:00:00',
               '2013-01-10 15:30:00', '2013-01-10 16:00:00'],
              dtype='datetime64[ns]', length=140, freq=None)

#Only on weekdays
rng = rng[rng.weekday<5]
DatetimeIndex(['2013-01-01 09:30:00', '2013-01-01 10:00:00',
               '2013-01-01 10:30:00', '2013-01-01 11:00:00',
               '2013-01-01 11:30:00', '2013-01-01 12:00:00',
               '2013-01-01 12:30:00', '2013-01-01 13:00:00',
               '2013-01-01 13:30:00', '2013-01-01 14:00:00',
               ...
               '2013-01-10 11:30:00', '2013-01-10 12:00:00',
               '2013-01-10 12:30:00', '2013-01-10 13:00:00',
               '2013-01-10 13:30:00', '2013-01-10 14:00:00',
               '2013-01-10 14:30:00', '2013-01-10 15:00:00',
               '2013-01-10 15:30:00', '2013-01-10 16:00:00'],
              dtype='datetime64[ns]', length=112, freq=None)

Look up a matrix in a data frame

The code below has changed some numbers from stackoverflow.com.

date_item = pd.to_datetime(['2019/1/4','2019/1/7','2019/1/7','2019/1/8','2019/1/9','2019/1/10','2019/1/10','2019/1/11','2019/1/15','2019/1/16','2019/1/16','2019/1/17','2019/1/18'])

DatetimeIndex(['2019-01-04', '2019-01-07', '2019-01-07', '2019-01-08',
               '2019-01-09', '2019-01-10', '2019-01-10', '2019-01-11',
               '2019-01-15', '2019-01-16', '2019-01-16', '2019-01-17',
               '2019-01-18'],
              dtype='datetime64[ns]', freq=None)

orders = pd.DataFrame({'Date':date_item,
                      'by':['buy','buy','sell','buy','sell','buy','buy','sell','buy','sell','buy','sell','buy'],
                      'cnt':[100, 200, 100, 50, 100, 100, 200, 50, 100, 50, 50, 50, 100],
                      'code':['1720','8086','4967','8086','8086','4967','1720','4967','1720','1720','8086','4967','8086'],
                      'prices':[1008, 1344, 7530, 1347, 1373, 7120, 1008, 7120, 995, 986, 1417, 6990, 1445]})

Prices

	1720	8086	4967
2019-01-04	1008	1311	7370
2019-01-07	1033	1344	7530
2019-01-08	1025	1347	7130
2019-01-09	1034	1373	7460
2019-01-10	1008	1364	7120
2019-01-11	1000	1391	7030
2019-01-15	995	1400	6950
2019-01-16	986	1417	6920
2019-01-17	1002	1419	6990
2019-01-18	1010	1445	7050

Orders

	Date	by	cnt	code	prices
0	2019-01-04	buy	100	1720	1008
1	2019-01-07	buy	200	8086	1344
2	2019-01-07	sell	100	4967	7530
3	2019-01-08	buy	50	8086	1347
4	2019-01-09	sell	100	8086	1373
5	2019-01-10	buy	100	4967	7120
6	2019-01-10	buy	200	1720	1008
7	2019-01-11	sell	50	4967	7120
8	2019-01-15	buy	100	1720	995
9	2019-01-16	sell	50	1720	986
10	2019-01-16	buy	50	8086	1417
11	2019-01-17	sell	50	4967	6990
12	2019-01-18	buy	100	8086	1445

lookup(row_labels, col_labels)
prices.lookup(orders.Date, orders.code)
array([1008, 1344, 7530, 1347, 1373, 7120, 1008, 7030,  995,  986, 1417,
       6990, 1445])

Converts a data frame consisting of columns in hours and rows in days into a continuous row sequence in chronological format. : clipboard:

The code below has some numerical changes from the content posted on stackoverflow.com.

df = pd.DataFrame({'Date':['2020-01-01', '2020-01-02'],'h1':[23,45],'h2':[18,17],'h3':[11,16],'h4':[29,31],'h24':[45,55]})

	Date		h1	h2	h3	h4	h24
0	2020-01-01	23	18	11	29	45
1	2020-01-02	45	17	16	31	55

#Convert to the following format
	Date	value
0	2020-01-01	23
2	2020-01-01	18
4	2020-01-01	11
6	2020-01-01	29
8	2020-01-01	45
1	2020-01-02	45
3	2020-01-02	17
5	2020-01-02	16
7	2020-01-02	31
9	2020-01-02	55

#Convert from horizontal holding to vertical holding
df = pd.melt(df, id_vars=['Date'])
>
	Date	variable	value
0	2020-01-01	h1	23
1	2020-01-02	h1	45
2	2020-01-01	h2	18
3	2020-01-02	h2	17
4	2020-01-01	h3	11
5	2020-01-02	h3	16
6	2020-01-01	h4	29
7	2020-01-02	h4	31
8	2020-01-01	h24	45
9	2020-01-02	h24	55

#Rename column
df = df.rename(columns={'variable': 'hour'})
>
	Date	hour	value
0	2020-01-01	h1	23
1	2020-01-02	h1	45
2	2020-01-01	h2	18
3	2020-01-02	h2	17
4	2020-01-01	h3	11
5	2020-01-02	h3	16
6	2020-01-01	h4	29
7	2020-01-02	h4	31
8	2020-01-01	h24	45
9	2020-01-02	h24	55

#of the hour column'h'Removal
df['hour'] = df['hour'].apply(lambda x: int(x.lstrip('h'))-1)
>
	Date	hour	value
0	2020-01-01	0	23
1	2020-01-02	0	45
2	2020-01-01	1	18
3	2020-01-02	1	17
4	2020-01-01	2	11
5	2020-01-02	2	16
6	2020-01-01	3	29
7	2020-01-02	3	31
8	2020-01-01	23	45
9	2020-01-02	23	55

#Creating column data that combines day and time
combined = df.apply(lambda x: pd.to_datetime(x['Date'], dayfirst=True) + datetime.timedelta(hours=int(x['hour'])), axis=1)
>
0   2020-01-01 00:00:00
1   2020-01-02 00:00:00
2   2020-01-01 01:00:00
3   2020-01-02 01:00:00
4   2020-01-01 02:00:00
5   2020-01-02 02:00:00
6   2020-01-01 03:00:00
7   2020-01-02 03:00:00
8   2020-01-01 23:00:00
9   2020-01-02 23:00:00
dtype: datetime64[ns]

# 'Date'Replace column data
df['Date'] = combined
	Date  	  	hour  value 
 0  2020-01-01 00:00:00   0  23 
 1  2020-01-02 00:00:00   0  45 
 2  2020-01-01 01:00:00   1  18 
 3  2020-01-02 01:00:00   1  17 
 4  2020-01-01 02:00:00   2  11 
 5  2020-01-02 02:00:00   2  16 
 6  2020-01-01 03:00:00   3  29 
 7  2020-01-02 03:00:00   3  31 
 8  2020-01-01 23:00:00  23  45 
 9  2020-01-02 23:00:00  23  55

# 'hour'Delete column
del df['hour']
>
#Sort by date
df = df.sort_values("Date")
>
Date	value
0	2020-01-01	23
2	2020-01-01	18
4	2020-01-01	11
6	2020-01-01	29
8	2020-01-01	45
1	2020-01-02	45
3	2020-01-02	17
5	2020-01-02	16
7	2020-01-02	31
9	2020-01-02	55

Handling duplicates when reindexing a time series to a certain frequency: clipboard:

#A Creating time series data(Date and time+millisecond[Dummy data])->100 cases
intervals = np.random.randint(0,1000,size=100).cumsum()
df = pd.DataFrame({'time':[pd.Timestamp('20140101') + pd.offsets.Milli(i) for i in intervals ],
                'value' : np.random.randn(len(intervals))})

# df
			time	value
0	2014-01-01 00:00:00.499	0.567731
1	2014-01-01 00:00:01.232	-0.751466
2	2014-01-01 00:00:01.238	1.250118
3	2014-01-01 00:00:01.533	0.588161
4	2014-01-01 00:00:02.210	0.996543
...	...	...
95	2014-01-01 00:00:49.216	0.753233
96	2014-01-01 00:00:49.719	0.849922
97	2014-01-01 00:00:49.732	1.171001
98	2014-01-01 00:00:50.723	-0.443835
99	2014-01-01 00:00:51.389	-0.830960
100 rows × 2 columns

#B Time series data index(Date and time+1 minute unit)->3601 cases
pd.date_range('20140101 00:00:00','20140101 01:00:00',freq='s')

DatetimeIndex(['2014-01-01 00:00:00', '2014-01-01 00:00:01',
               '2014-01-01 00:00:02', '2014-01-01 00:00:03',
               '2014-01-01 00:00:04', '2014-01-01 00:00:05',
               '2014-01-01 00:00:06', '2014-01-01 00:00:07',
               '2014-01-01 00:00:08', '2014-01-01 00:00:09',
               ...
               '2014-01-01 00:59:51', '2014-01-01 00:59:52',
               '2014-01-01 00:59:53', '2014-01-01 00:59:54',
               '2014-01-01 00:59:55', '2014-01-01 00:59:56',
               '2014-01-01 00:59:57', '2014-01-01 00:59:58',
               '2014-01-01 00:59:59', '2014-01-01 01:00:00'],
              dtype='datetime64[ns]', length=3601, freq='S')

#result
2014-01-01 00:00:00.000	NaN
2014-01-01 00:00:00.499	0.567731
2014-01-01 00:00:01.000	0.567731  <-Originally NaN
2014-01-01 00:00:01.232	-0.751466
2014-01-01 00:00:01.238	1.250118
2014-01-01 00:00:01.533	0.588161
2014-01-01 00:00:02.000	0.588161  <-Originally NaN
2014-01-01 00:00:02.210	0.996543
2014-01-01 00:00:02.652	0.322535
2014-01-01 00:00:03.000	0.322535  <-Originally NaN

It is a very confusing heading, but the point is to add another B time series (both in the same time axis) to the A time series data, and ffill the state where the data string is missing. ) Is a technique to fill in the numerical value.

intervals
>
array([  499,  1232,  1238,  1533,  2210,  2652,  3512,  3598,  3663,
        3688,  3758,  3969,  4960,  5933,  6211,  7040,  7763,  8331,
        9329, 10229, 10436, 10671, 10727, 11695, 12249, 13033, 13867,
       13895, 14809, 15069, 16022, 16484, 16597, 17044, 17060, 17344,
       18124, 18629, 19256, 20022, 20620, 21080, 21148, 21603, 22164,
       22872, 23075, 23566, 24133, 24887, 24996, 25132, 25435, 26047,
       26287, 27168, 27228, 28129, 29118, 29539, 30339, 30465, 31396,
       31805, 32583, 33021, 33854, 34439, 34865, 35207, 35212, 35345,
       35865, 36586, 37325, 37422, 38411, 38986, 39227, 39768, 40316,
       41254, 42162, 42476, 43155, 43971, 44791, 44899, 45770, 46701,
       47568, 47901, 48242, 48720, 48940, 49216, 49719, 49732, 50723,
       51389])
>
#Create a B time series and combine it with an A time series index
rng = pd.date_range('20140101 00:00:00','20140101 01:00:00', freq='s')
new_range = pd.DatetimeIndex(np.append(rng, pd.Index(df.time)))

In the content posted on stackoverflow.com, it is A time series + B time series, but an error occurs.
I checked it and was able to handle it with DatetimeIndex (np.append (A, B)).

#A time series + B time series remains, so sort by time series
new_range = new_range.sort_values() 
# 'time'To index and change to a new combined index
df.set_index('time').reindex(new_range).head()
>
2014-01-01 00:00:00.000	NaN
2014-01-01 00:00:00.499	0.567731
2014-01-01 00:00:01.000	NaN
2014-01-01 00:00:01.232	-0.751466
2014-01-01 00:00:01.238	1.250118
>
# ffill()Update the value with
df.set_index('time').reindex(new_range).ffill().head(10)
>
		value
2014-01-01 00:00:00.000	NaN
2014-01-01 00:00:00.499	0.567731
2014-01-01 00:00:01.000	0.567731
2014-01-01 00:00:01.232	-0.751466
2014-01-01 00:00:01.238	1.250118
2014-01-01 00:00:01.533	0.588161
2014-01-01 00:00:02.000	0.588161
2014-01-01 00:00:02.210	0.996543
2014-01-01 00:00:02.652	0.322535
2014-01-01 00:00:03.000	0.322535

Resampling

How can I divide a single value in a data frame by the monthly average? : clipboard :: thumbs up:

df = pd.DataFrame(np.random.randint(1000, 1200, (12,1)), columns=['value'],
                  index = pd.date_range('2014-01-01 00:15:00', freq='15min', periods=12))

df
		value 
 2014-01-01 00:15:00  1122 
 2014-01-01 00:30:00  1132 
 2014-01-01 00:45:00  1091 
 2014-01-01 01:00:00  1188 
 2014-01-01 01:15:00  1156 
 2014-01-01 01:30:00  1089 
 2014-01-01 01:45:00  1148 
 2014-01-01 02:00:00  1040 
 2014-01-01 02:15:00  1010 
 2014-01-01 02:30:00  1130 
 2014-01-01 02:45:00  1178 
 2014-01-01 03:00:00  1186 

df['normed'] = df.groupby(grouper).transform(lambda x: x / x.mean())

df
			value	normed
Date		
2014-01-01 00:15:00	1122	0.999555
2014-01-01 00:30:00	1132	1.008463
2014-01-01 00:45:00	1091	0.971938
2014-01-01 01:00:00	1188	1.058352
2014-01-01 01:15:00	1156	1.029844
2014-01-01 01:30:00	1089	0.970156
2014-01-01 01:45:00	1148	1.022717
2014-01-01 02:00:00	1040	0.926503
2014-01-01 02:15:00	1010	0.899777
2014-01-01 02:30:00	1130	1.006682
2014-01-01 02:45:00	1178	1.049443
2014-01-01 03:00:00	1186	1.056570

df.index.name = 'Date'
#Use Grouper to group by a specific cycle or interval. In this case it's one month`freq='M'`And.
grouper = pd.Grouper(level=0, freq='M')
>
#If you refer to the contents of grouper in the groups property, you can see that it is a monthly unit of the last day.
df.groupby(grouper).groups
{Timestamp('2014-01-31 00:00:00', freq='M'): 12}
>
#Easy-to-find methods are posted without using grouper. This is convenient!
df.groupby([df.index.year, df.index.month]).transform(lambda x: x/x.mean())

Grouping using MultiIndex: clipboard:

data = pd.concat([pd.DataFrame([['A']*36, list(pd.date_range('1/1/2011', periods=36, freq='H')), list(np.random.rand(36))], index = ['Group', 'Time', 'Value']).T,
                  pd.DataFrame([['B']*36, list(pd.date_range('1/1/2011', periods=36, freq='H')), list(np.random.rand(36))], index = ['Group', 'Time', 'Value']).T,
                  pd.DataFrame([['C']*36, list(pd.date_range('1/1/2011', periods=36, freq='H')), list(np.random.rand(36))], index = ['Group', 'Time', 'Value']).T],
                  axis = 0).set_index(['Group', 'Time'])

data

			Value
Group	Time	
A	2011-01-01 00:00:00	0.306533
	2011-01-01 01:00:00	0.653431
	2011-01-01 02:00:00	0.0284997
	2011-01-01 03:00:00	0.852406
	2011-01-01 04:00:00	0.856705
...	...	...
C	2011-01-02 07:00:00	0.607041
	2011-01-02 08:00:00	0.291705
	2011-01-02 09:00:00	0.480091
	2011-01-02 10:00:00	0.890153
	2011-01-02 11:00:00	0.628454
108 rows × 1 columns

#result
Time        Group
2011-01-01  A        0.497803
            B        0.524934
            C        0.566736
2011-01-02  A        0.442884
            B        0.452267
            C        0.632567
Name: Value, dtype: float64

data['Value'] = data['Value'].astype(float)
#Two keys(Grouper A:By day, Grouper B:'Group')Group and take the average value.
daily_counts = data.groupby([pd.Grouper(freq='D', level='Time'), pd.Grouper(level='Group')])['Value'].mean()

How to resampling in your own period?

dtrange = pd.date_range(datetime.datetime(2013,1,1), datetime.datetime(2013,2,20))
df = pd.DataFrame({'p1': np.random.rand(len(dtrange)) + 5,
                   'p2': np.random.rand(len(dtrange)) + 10},
                 index=dtrange)
		p1		p2
2013-01-01	5.428541	10.292222
2013-01-02	5.172898	10.077244
2013-01-03	5.414736	10.587493
2013-01-04	5.235626	10.567700
2013-01-05	5.659100	10.270270
2013-01-06	5.144520	10.033026
2013-01-07	5.340707	10.561371
2013-01-08	5.256909	10.647431
2013-01-09	5.491950	10.036358
2013-01-10	5.869493	10.010417

#result
df.groupby(date).mean()
		p1		p2
2013-01-01	5.401448	10.308353
2013-01-11	5.362789	10.463397
2013-01-21	5.531094	10.346834
2013-02-01	5.559112	10.551347
2013-02-11	5.551023	10.451666

#This d creates an index for the beginning, middle, and end.
d = df.index.day - np.clip((df.index.day - 1) // 10, 0, 2) * 10 - 1
d
Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9,  0,  1,  2,  3,  4,  5,  6,
             7,  8,  9,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10,  0,  1,  2,
             3,  4,  5,  6,  7,  8,  9,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
           dtype='int64')
#This gets the date of the index of the data.
df.index.day
Int64Index([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
            18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,  1,  2,  3,
             4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
           dtype='int64')
#-1 to make the index in units of 10 days.
df.index.day - 1
Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
            17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30,  0,  1,  2,
             3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
           dtype='int64')
#Creating indexes for early, mid, and late
(df.index.day - 1) // 10
Int64Index([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
            2, 2, 2, 2, 2, 2, 2, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
            1, 1, 1, 1, 1, 1, 1],
           dtype='int64')
# np.clip(a, a_min, a_max, out=None, **kargs)
# np.clip processes the value of the element within an arbitrary range. 0 here~Fit in the range up to 2
np.clip((df.index.day - 1) // 10, 0, 2)
Int64Index([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
            2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
            1, 1, 1, 1, 1, 1, 1],
           dtype='int64')
#Early:0,Mid:10,Late:Converting for 20
np.clip((df.index.day - 1) // 10, 0, 2) * 10
Int64Index([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10, 10, 10, 10, 10, 10, 10,
            10, 10, 10, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20,  0,  0,  0,
             0,  0,  0,  0,  0,  0,  0, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10],
           dtype='int64')
#Check the calculation of d.
tmp = pd.DataFrame({'A':df.index.day.values,
                    'B':np.clip((df.index.day - 1) // 10, 0, 2) * 10},
                  index=np.arange(51))
tmp['Diff'] = tmp['A'] - tmp['B']
tmp.head()
	A	B	Diff
0	1	0	1
1	2	0	2
2	3	0	3
3	4	0	4
4	5	0	5
>
tmp['Diff'] = (tmp['A'] - tmp['B']) - 1
	A	B	Diff
0	1	0	0
1	2	0	1
2	3	0	2
3	4	0	3
4	5	0	4
>
# 2020-0１-01,2020-01-11,2020-01-21, 2020-02-01, 2020-02-Create 11 time indexes
#Converting d to timedelta format and subtracting it from the original time series gives the index of seasonal data.
date = df.index.values - np.array(d, dtype='timedelta64[D]')
>
array(['2013-01-01T00:00:00.000000000', '2013-01-01T00:00:00.000000000',
       '2013-01-01T00:00:00.000000000', '2013-01-01T00:00:00.000000000',
       '2013-01-01T00:00:00.000000000', '2013-01-01T00:00:00.000000000',
       '2013-01-01T00:00:00.000000000', '2013-01-01T00:00:00.000000000',
       '2013-01-01T00:00:00.000000000', '2013-01-01T00:00:00.000000000',
       '2013-01-11T00:00:00.000000000', '2013-01-11T00:00:00.000000000',
       '2013-01-11T00:00:00.000000000', '2013-01-11T00:00:00.000000000',
       '2013-01-11T00:00:00.000000000', '2013-01-11T00:00:00.000000000',
       '2013-01-11T00:00:00.000000000', '2013-01-11T00:00:00.000000000',
       '2013-01-11T00:00:00.000000000', '2013-01-11T00:00:00.000000000',
       '2013-01-21T00:00:00.000000000', '2013-01-21T00:00:00.000000000',
       '2013-01-21T00:00:00.000000000', '2013-01-21T00:00:00.000000000',
       '2013-01-21T00:00:00.000000000', '2013-01-21T00:00:00.000000000',
       '2013-01-21T00:00:00.000000000', '2013-01-21T00:00:00.000000000',
       '2013-01-21T00:00:00.000000000', '2013-01-21T00:00:00.000000000',
       '2013-01-21T00:00:00.000000000', '2013-02-01T00:00:00.000000000',
       '2013-02-01T00:00:00.000000000', '2013-02-01T00:00:00.000000000',
       '2013-02-01T00:00:00.000000000', '2013-02-01T00:00:00.000000000',
       '2013-02-01T00:00:00.000000000', '2013-02-01T00:00:00.000000000',
       '2013-02-01T00:00:00.000000000', '2013-02-01T00:00:00.000000000',
       '2013-02-01T00:00:00.000000000', '2013-02-11T00:00:00.000000000',
       '2013-02-11T00:00:00.000000000', '2013-02-11T00:00:00.000000000',
       '2013-02-11T00:00:00.000000000', '2013-02-11T00:00:00.000000000',
       '2013-02-11T00:00:00.000000000', '2013-02-11T00:00:00.000000000',
       '2013-02-11T00:00:00.000000000', '2013-02-11T00:00:00.000000000',
       '2013-02-11T00:00:00.000000000'], dtype='datetime64[ns]')
>
#this'date'Group and take the average.
df.groupby(date).mean()
			p1	p2
2013-01-01	5.401448	10.308353
2013-01-11	5.362789	10.463397
2013-01-21	5.531094	10.346834
2013-02-01	5.559112	10.551347
2013-02-11	5.551023	10.451666

Create a DatetimeIndex based on the criteria you want to aggregate, and aggregate based on that index.
This one-liner process was difficult for beginners, and it was enough to disassemble it.

Aggregate trading hours without adding new days

#Create with 1-minute data
dates = pd.date_range('2014-01-01','2014-01-11', freq='T')[0:-1]
#Only on business days
dates = dates[dates.dayofweek < 5]
s = pd.DataFrame(np.random.randn(dates.size), dates)
s
					0
2014-01-01 00:00:00	-2.593328
2014-01-01 00:01:00	0.173850
2014-01-01 00:02:00	0.781819
2014-01-01 00:03:00	0.734917
2014-01-01 00:04:00	-1.323457
...	...
2014-01-10 23:55:00	0.158127
2014-01-10 23:56:00	1.205610
2014-01-10 23:57:00	-0.757652
2014-01-10 23:58:00	0.350570
2014-01-10 23:59:00	0.886426
11520 rows × 1 columns

#Resample the date and time of the data frame in 30-minute increments and take the average
s.groupby(lambda d: d.date()).resample('30min').mean()
						0
2014-01-01	2014-01-01 00:00:00	-0.152332
		2014-01-01 00:30:00	-0.172909
		2014-01-01 01:00:00	0.110629
		2014-01-01 01:30:00	-0.070501
		2014-01-01 02:00:00	0.058135
		...	...	...
2014-01-10	2014-01-10 21:30:00	-0.158843
		2014-01-10 22:00:00	0.204738
		2014-01-10 22:30:00	-0.057380
		2014-01-10 23:00:00	0.212008
		2014-01-10 23:30:00	-0.021065
384 rows × 1 columns

Group daily data by month. : clipboard :: thumbs up:

import random
df = pd.DataFrame({'string': [random.choice(('about', 'contact', 'home', 'blog')) for _ in range(120)],
                 'visits':np.random.randint(0,2000, (120,))},
               index = pd.date_range('2001-01-01', freq='1d', periods=120))

re_df = df.groupby('string').resample('M').sum()

			visits
string		
about	2001-01-31	4879
		2001-02-28	6713
		2001-03-31	5747
		2001-04-30	9286
blog	2001-01-31	9164
		2001-02-28	8318
		2001-03-31	4770
		2001-04-30	6571
contact	2001-01-31	6883
		2001-02-28	6177
		2001-03-31	10228
		2001-04-30	3531
home	2001-01-31	5496
		2001-02-28	5261
		2001-03-31	7083
		2001-04-30	11383

random.choice ('String A','String B','String C') Randomly extract the argument strings.
Group data by classification name, resample and sum up on a'M'monthly basis.

Add monthly composition to a new column

re_df['ratio'] = re_df.groupby(level=1).transform(lambda x: x / x.sum())
>
				visits	ratio
string			
about	2001-01-31	4879	0.184657
		2001-02-28	6713	0.253617
		2001-03-31	5747	0.206519
		2001-04-30	9286	0.301778
blog	2001-01-31	9164	0.346832
		2001-02-28	8318	0.314254
		2001-03-31	4770	0.171410
		2001-04-30	6571	0.213545
contact	2001-01-31	6883	0.260503
		2001-02-28	6177	0.233367
		2001-03-31	10228	0.367543
		2001-04-30	3531	0.114751
home	2001-01-31	5496	0.208008
		2001-02-28	5261	0.198761
		2001-03-31	7083	0.254528
		2001-04-30	11383	0.369926

Merge

In [177]: rng = pd.date_range('2000-01-01', periods=6)
In [178]: df1 = pd.DataFrame(np.random.randn(6, 3), index=rng, columns=['A', 'B', 'C'])
In [179]: df2 = df1.copy()
In [180]: df = df1.append(df2, ignore_index=True)

In [181]: df
Out[181]: 
           A         B         C
0  -0.870117 -0.479265 -0.790855
1   0.144817  1.726395 -0.464535
2  -0.821906  1.597605  0.187307
3  -0.128342 -1.511638 -0.289858
4   0.399194 -1.430030 -0.639760
5   1.115116 -2.012600  1.810662
6  -0.870117 -0.479265 -0.790855
7   0.144817  1.726395 -0.464535
8  -0.821906  1.597605  0.187307
9  -0.128342 -1.511638 -0.289858
10  0.399194 -1.430030 -0.639760
11  1.115116 -2.012600  1.810662

Create 6 data and combine the same data vertically by copy ().
Depending on the join, ʻignore_index = True` may be required.

DataFrame self-join

In [182]: df = pd.DataFrame(data={'Area': ['A'] * 5 + ['C'] * 2,
                           'Bins': [110] * 2 + [160] * 3 + [40] * 2,
                           'Test_0': [0, 1, 0, 1, 2, 0, 1],
                           'Data': np.random.randn(7)})

In [183]: df
Out[183]: 
	Area	Bins	Test_0	Data
0	A	110	0	0.632955
1	A	110	1	1.485463
2	A	160	0	-1.193891
3	A	160	1	-0.324484
4	A	160	2	1.293263
5	C	40	0	-0.476979
6	C	40	1	-0.467655

In [184]: df['Test_1'] = df['Test_0'] - 1

In [185]: pd.merge(df, df, left_on=['Bins', 'Area', 'Test_0'],
                   right_on=['Bins', 'Area', 'Test_1'],
                   suffixes=('_L', '_R'))

Out[185]: 
Area	Bins	Test_0_L	Data_L	Test_1_L	Test_0_R	Data_R	Test_1_R
0	A	110	0	0.632955	-1	1	1.485463	0
1	A	160	0	-1.193891	-1	1	-0.324484	0
2	A	160	1	-0.324484	0	2	1.293263	1
3	C	40	0	-0.476979	-1	1	-0.467655	0

# 
df['Test_1'] = df['Test_0'] - 1
>
	Area	Bins	Test_0	Data	Test_1
0	A	110	0	0.632955	-1
1	A	110	1	1.485463	0
2	A	160	0	-1.193891	-1
3	A	160	1	-0.324484	0
4	A	160	2	1.293263	1
5	C	40	0	-0.476979	-1
6	C	40	1	-0.467655	0

####Combine by value-based condition:clipboard:
The code below is stackoverflow.I have changed some code from the post on com.

```python
import operator as op

df_a = pd.DataFrame([{"a": 1, "b": 4}, {"a": 2, "b": 5}, {"a": 3, "b": 6}])
df_b = pd.DataFrame([{"c": 2, "d": 7}, {"c": 3, "d": 8}])

binOp   = op.lt
matches = np.where(binOp(df_a.a[:,None],df_b.c.values))

pd.concat([df.ix[idxs].reset_index(drop=True) 
                 for df,idxs in zip([df_a,df_b],matches)], axis=1)
df_a
	a	b
0	1	4
1	2	5
2	3	6
df_b

	c	d
0	2	7
1	3	8
# result
	a	b	c	d
0	1	4	2	7
1	1	4	3	8
2	2	5	3	8

*FirstmachtesWhen disassembling

# Check the contents of columns df.a and df.b
df_a.a[:,None]
array([[1],
       [2],
       [3]])
df_b.c.values
array([2, 3])

*Function-style standard operatoroperator.lt(a,b)Isa < bso,a < bIs equivalent to

op.lt(df_a.a[:,None],df_b.c.values)
array([[ True,  True],
       [False,  True],
       [False, False]])
# I could write it like this if it was written differently
df_a.a[:,None] < df_b.c.values
array([[ True,  True],
       [False,  True],
       [False, False]])
>
# To check this content
df_a.a[:,None][0] < df_b.c.values[0]
array([ True])
df_a.a[:,None][0] < df_b.c.values[1]
array([ True])
df_a.a[:,None][1] < df_b.c.values[0]
array([False])
df_a.a[:,None][1] < df_b.c.values[1]
array([ True])
df_a.a[:,None][2] < df_b.c.values[0]
array([False])
df_a.a[:,None][2] < df_b.c.values[1]
array([False])
>
# In the case of np.where (condition), the index of the element that satisfies the condition is returned in tuple format.
np.where([[True, True], [False, True], [False, False]])
(array([0, 0, 1]), array([0, 1, 1]))
>
# Extract with the conditions obtained in'matches' earlier.
[df.loc[idxs].reset_index(drop=True) for df,idxs in zip([df_a,df_b],matches)]
[   a  b
 0  1  4
 1  1  4
 2  2  5,    c  d
 0  2  7
 1  3  8
 2  3  8]
>
# This is connected in the horizontal direction (axis = 1).
	a	b	c	d
0	1	4	2	7
1	1	4	3	8
2	2	5	3	8

Plotting(Drawing)

n [186]: df = pd.DataFrame(
   .....:     {'stratifying_var': np.random.uniform(0, 100, 20),
   .....:      'price': np.random.normal(100, 5, 20)})
   .....: 

In [187]: df['quartiles'] = pd.qcut(
   .....:     df['stratifying_var'],
   .....:     4,
   .....:     labels=['0-25%', '25-50%', '50-75%', '75-100%'])
   .....: 

In [188]: df.boxplot(column='price', by='quartiles')
Out[188]: <matplotlib.axes._subplots.AxesSubplot at 0x7fc66b7f1710>

 pandas.qcut(x, q, labels=None...)
# Divide x by the number of q: bin divisions and add labels.
df['quartiles'] = pd.qcut(df['stratifying_var'], 4, labels=['0-25%', '25-50%', '50-75%', '75-100%'])
>
	stratifying_var	price	quartiles
0	95.463259	110.468740	75-100%
1	61.567537	105.038334	50-75%
2	93.169189	99.502664	75-100%
3	32.881181	100.462400	25-50%
4	22.735506	95.821731	0-25%
5	85.662861	101.262124	75-100%
6	24.251856	102.351950	0-25%
7	26.323525	91.812003	25-50%
8	6.192982	111.425087	0-25%
9	25.520758	104.666583	25-50%
10	75.505473	104.450480	75-100%
11	30.620504	100.044772	25-50%
12	40.438555	102.697402	50-75%
13	18.171318	102.612876	0-25%
14	57.080747	101.067847	50-75%
15	44.066472	93.410125	50-75%
16	64.131972	103.707151	50-75%
17	33.548572	103.359709	25-50%
18	88.433754	99.416668	75-100%
19	21.660715	91.229785	0-25%

Data IN/Out(Data input / output)

###Read multiple files to create a single DataFrame:thumbsup:

# Example A
n [189]: for i in range(3):
             data = pd.DataFrame(np.random.randn(10, 4))
             data.to_csv('file_{}.csv'.format(i))

In [190]: files = ['file_0.csv', 'file_1.csv', 'file_2.csv']
In [191]: result = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)

# Example B
In [192]: import glob
In [193]: import os

In [194]: files = glob.glob('file_*.csv')
In [195]: result = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)

*In example A, a sample file is created in a loop, and the sample file is read and combined. *In example B, the created file name is read by the glob function, and the file is read and combined.

###Parsing multi-column date components

In [196]: i = pd.date_range('20000101', periods=10000)

In [197]: df = pd.DataFrame({'year': i.year, 'month': i.month, 'day': i.day})

In [198]: df.head()
Out[198]: 
   year  month  day
0  2000      1    1
1  2000      1    2
2  2000      1    3
3  2000      1    4
4  2000      1    5

In [199]: %timeit pd.to_datetime(df.year * 10000 + df.month * 100 + df.day, format='%Y%m%d')
 ds = df.apply(lambda x: "%04d%02d%02d" % (x['year'],
                                          x['month'], x['day']), axis=1)
ds.head()
 %timeit pd.to_datetime(ds)

9.98 ms +- 235 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
2.91 ms +- 57.5 us per loop (mean +- std. dev. of 7 runs, 100 loops each)

# When calculating and creating time series information using the values of the created data frame
df.year * 10000 + df.month * 100 + df.day
>
# When creating time series information with a function from the elements of the created data frame
 ds = df.apply(lambda x: "%04d%02d%02d" % (x['year'], x['month'], x['day']), axis=1)

###Skip the line between the header and the data

In [200]: data = """;;;;
        ;;;;
        ;;;;
        ;;;;
        ;;;;
        ;;;;
        ;;;;
        ;;;;
        ;;;;
        ;;;;
         date;Param1;Param2;Param4;Param5
             ;m²;°C;m²;m
        ;;;;
         01.01.1990 00:00;1;1;2;3
         01.01.1990 01:00;5;3;4;5
         01.01.1990 02:00;9;5;6;7
         01.01.1990 03:00;13;7;8;9
         01.01.1990 04:00;17;9;10;11
         01.01.1990 05:00;21;11;12;13
 """

Option 1:Explicitly pass a line and skip the line

In [201]: from io import StringIO

In [202]: pd.read_csv(StringIO(data), sep=';', skiprows=[11, 12],

Out[202]: 
                     Param1  Param2  Param4  Param5
date                                               
1990-01-01 00:00:00       1       1       2       3
1990-01-01 01:00:00       5       3       4       5
1990-01-01 02:00:00       9       5       6       7
1990-01-01 03:00:00      13       7       8       9
1990-01-01 04:00:00      17       9      10      11
1990-01-01 05:00:00      21      11      12      13

Option 2:Read column names and then read data

In [203]: pd.read_csv(StringIO(data), sep=';', header=10, nrows=10).columns
Out[203]: Index(['date', 'Param1', 'Param2', 'Param4', 'Param5'], dtype='object')

In [204]: columns = pd.read_csv(StringIO(data), sep=';', header=10, nrows=10).columns

In [205]: pd.read_csv(StringIO(data), sep=';', index_col=0,
                      header=12, parse_dates=True, names=columns)

Out[205]: 
                     Param1  Param2  Param4  Param5
date                                               
1990-01-01 00:00:00       1       1       2       3
1990-01-01 01:00:00       5       3       4       5
1990-01-01 02:00:00       9       5       6       7
1990-01-01 03:00:00      13       7       8       9
1990-01-01 04:00:00      17       9      10      11
1990-01-01 05:00:00      21      11      12      13

I think that either one is not advantageous regardless of the size of the data. In any case, it is necessary to find the number of lines to skip and the line corresponding to the header line, so you need to skip the appropriate number of lines and open it, or open it separately with an editor first.

#Computation) ###correlation

In [211]: df = pd.DataFrame(np.random.random(size=(100, 5)))

In [212]: corr_mat = df.corr()

In [213]: mask = np.tril(np.ones_like(corr_mat, dtype=np.bool), k=-1)

In [214]: corr_mat.where(mask)
Out[214]: 
          0         1         2         3   4
0       NaN       NaN       NaN       NaN NaN
1 -0.018923       NaN       NaN       NaN NaN
2 -0.076296 -0.012464       NaN       NaN NaN
3 -0.169941 -0.289416  0.076462       NaN NaN
4  0.064326  0.018759 -0.084140 -0.079859 NaN

mask
array([[False, False, False, False, False],
       [ True, False, False, False, False],
       [ True,  True, False, False, False],
       [ True,  True,  True, False, False],
       [ True,  True,  True,  True, False]])
>
mask2 = np.triu(np.ones_like(corr_mat, dtype=np.bool), k=-1)
mask2
array([[ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [False,  True,  True,  True,  True],
       [False, False,  True,  True,  True],
       [False, False, False,  True,  True]])
>
corr_mat.where(mask2)
>
0	1	2	3	4
0	1.000000	-0.022615	0.047486	0.092043	-0.112379
1	-0.022615	1.000000	-0.336864	-0.025473	0.004696
2	NaN			-0.336864	1.000000	0.047746	-0.008458
3	NaN			NaN			0.047746	1.000000	0.133289
4	NaN			NaN			NaN			0.133289	1.000000

numpy.tril() ->Extract the lower triangular matrix from the Numpy array(Lower triangle of an array.) numpy.triu() ->Extract the upper triangular matrix from the Numpy array(Upper triangle of an array.)

Timedeltas(elapsed time)

In [218]: import datetime

In [219]: s = pd.Series(pd.date_range('2012-1-1', periods=3, freq='D'))

In [220]: s - s.max()
Out[220]: 
0   -2 days
1   -1 days
2    0 days
dtype: timedelta64[ns]

In [221]: s.max() - s
Out[221]: 
0   2 days
1   1 days
2   0 days
dtype: timedelta64[ns]

In [222]: s - datetime.datetime(2011, 1, 1, 3, 5)
Out[222]: 
0   364 days 20:55:00
1   365 days 20:55:00
2   366 days 20:55:00
dtype: timedelta64[ns]

In [223]: s + datetime.timedelta(minutes=5)
Out[223]: 
0   2012-01-01 00:05:00
1   2012-01-02 00:05:00
2   2012-01-03 00:05:00
dtype: datetime64[ns]

In [224]: datetime.datetime(2011, 1, 1, 3, 5) - s
Out[224]: 
0   -365 days +03:05:00
1   -366 days +03:05:00
2   -367 days +03:05:00
dtype: timedelta64[ns]

In [225]: datetime.timedelta(minutes=5) + s
Out[225]: 
0   2012-01-01 00:05:00
1   2012-01-02 00:05:00
2   2012-01-03 00:05:00
dtype: datetime64[ns]

# 2 weeks later
s + datetime.timedelta(weeks=2)
>
0   2012-01-15
1   2012-01-16
2   2012-01-17
dtype: datetime64[ns]
# 2 weeks ago
s + datetime.timedelta(weeks=-2)
>
0   2011-12-18
1   2011-12-19
2   2011-12-20
dtype: datetime64[ns]

###Addition and subtraction of elapsed time and date

In [226]: deltas = pd.Series([datetime.timedelta(days=i) for i in range(3)])

In [227]: df = pd.DataFrame({'A': s, 'B': deltas})

In [228]: df
Out[228]: 
           A      B
0 2012-01-01 0 days
1 2012-01-02 1 days
2 2012-01-03 2 days

In [229]: df['New Dates'] = df['A'] + df['B']

In [230]: df['Delta'] = df['A'] - df['New Dates']

In [231]: df
Out[231]: 
           A      B  New Dates   Delta
0 2012-01-01 0 days 2012-01-01  0 days
1 2012-01-02 1 days 2012-01-03 -1 days
2 2012-01-03 2 days 2012-01-05 -2 days

In [232]: df.dtypes
Out[232]: 
A             datetime64[ns]
B            timedelta64[ns]
New Dates     datetime64[ns]
Delta        timedelta64[ns]
dtype: object

# Perform weekly addition
w_deltas = pd.Series([datetime.timedelta(weeks=i) for i in range(3)])
w_deltas
0    0 days
1    7 days
2   14 days
dtype: timedelta64[ns]
>
df['W_Delta'] = df['A'] + w_deltas
>
A	B	New Dates	Delta	W_Delta
0	2012-01-01	0 days	2012-01-01	0 days	2012-01-01
1	2012-01-02	1 days	2012-01-03	-1 days	2012-01-09
2	2012-01-03	2 days	2012-01-05	-2 days	2012-01-17

#Axis name alias ###To provide axis name aliases globally, you can define two functions:

def set_axis_alias(cls, axis, alias):
    if axis not in cls._AXIS_NUMBERS:
        raise Exception("invalid axis [%s] for alias [%s]" % (axis, alias))
     cls._AXIS_ALIASES[alias] = axis

def clear_axis_alias(cls, axis, alias):
     if axis not in cls._AXIS_NUMBERS:
        raise Exception("invalid axis [%s] for alias [%s]" % (axis, alias))
     cls._AXIS_ALIASES.pop(alias, None)

set_axis_alias(pd.DataFrame, 'columns', 'myaxis2')
df2 = pd.DataFrame(np.random.randn(3, 2), columns=['c1', 'c2'],
                   index=['i1', 'i2', 'i3'])

In [241]: df2.sum(axis='myaxis2')
Out[241]: 
i1   -0.461013
i2    2.040016
i3    0.904681
dtype: float64

In [242]: clear_axis_alias(pd.DataFrame, 'columns', 'myaxis2')

# An error occurred in df2.sum (axis ='myaxis2').
df2.sum(axis=1)

I think this function can define its own axis name, but in reality an error occurs and no result is obtained.axis=1Was obtained by specifying.

Creating example data(Creating sample data)

In [243]: def expand_grid(data_dict):
              rows = itertools.product(*data_dict.values())
              return pd.DataFrame.from_records(rows, columns=data_dict.keys())

In [244]: df = expand_grid({'height': [60, 70],
                            'weight': [100, 140, 180],
                            'sex': ['Male', 'Female']})

In [245]: df
Out[245]: 
    height  weight     sex
0       60     100    Male
1       60     100  Female
2       60     140    Male
3       60     140  Female
4       60     180    Male
5       60     180  Female
6       70     100    Male
7       70     100  Female
8       70     140    Male
9       70     140  Female
10      70     180    Male
11      70     180  Female

#Creating sample data *To create a data frame from all combinations of a particular value, you can create a dictionary where the key is the column name and the value is a list of data values.

In [243]: def expand_grid(data_dict):
              rows = itertools.product(*data_dict.values())
              return pd.DataFrame.from_records(rows, columns=data_dict.keys())

In [244]: df = expand_grid({'height': [60, 70],
                            'weight': [100, 140, 180],
                            'sex': ['Male', 'Female']})

In [245]: df
Out[245]: 
    height  weight     sex
0       60     100    Male
1       60     100  Female
2       60     140    Male
3       60     140  Female
4       60     180    Male
5       60     180  Female
6       70     100    Male
7       70     100  Female
8       70     140    Male
9       70     140  Female
10      70     180    Male
11      70     180  Female

# * dict.values () expands dictionary-style values
d_dict = {'height': [60, 70], 'weight': [100, 140, 180], 'sex': ['Male', 'Female']}
print(*d_dict.values())
[60, 70] [100, 140, 180] ['Male', 'Female']
>
# All combinations Cartesian product (Descartes) itertools.product is an iterator
rows = itertools.product(*d_dict.values())
>
for r in rows:
    print(r)
>
(60, 100, 'Male')
(60, 100, 'Female')
(60, 140, 'Male')
(60, 140, 'Female')
(60, 180, 'Male')
(60, 180, 'Female')
(70, 100, 'Male')
(70, 100, 'Female')
(70, 140, 'Male')
(70, 140, 'Female')
(70, 180, 'Male')
(70, 180, 'Female')

#Afterword *Thank you for watching until the end. It's been about two and a half years since I started learning the Python language, and the self-study of the official cookbook was a struggle. *The first time, I started by preparing the same data as the official one, but I gave up because I could not think of sample data on the way. *The second time, I tried to create the code reading part with the folding function by quoting the formula, but I decided that it did not convey the feeling of struggle. Please forgive the part that is not fully explained while the accurate understanding and description of the language are not left. The real feeling after finishing is"The one who controls the index controls the Pandas."**I thought. *For the explanation in code reading, I used mathjax for the first time, but it is unknown whether it worked.

[PYTHON] [Pandas 1.0.1 Memorial] Fierce battle record with cookbook

background

What and how

About the structure of this post

Pandas Official Site Cookbook

Idioms (idiom)

If-then ...

Use if-then / if-then-else on one column and assign it to one or more other columns.

1 row if-then

Example of if-then assigned to two columns

Change the condition and update another row

Conditioned to mask value

Use np.where to substitute if-then-else: thumbs up:

Splitting

Split dataframe by Boolean value: clipboard :: thumbsup:

Building criteria

Use the argsort function to select the row with the data closest to a particular value

Use binary operators to dynamically filter the list of criteria

Selection

Dataframes

Use both row labels and value conditions: thumbs up:

Use loc for slicing on labels and iloc for positional slicing

Problems occur if the index consists of non-zero start or non-incremental integers

Get by negating a condition using the comparison operator (~) [Not]

New columns

Efficiently and dynamically create new columns using the applymap function: thumbs up:

Update new columns ['AAA_cat','BBB_cat','CCC_cat'] with values from the categories dictionary for existing columns

If you use min () with groupby, keep other columns

Method 1: Get the smallest index idxmin ()

Method 2: Sort and then get each first

MultiIndexing

Create a MultiIndex from a labeled frame

Slice

To choose from the first level and the first axis

To select from the'six'level and the first axis

Sorting

Use MultiIndex to sort by a specific column or an ordered list of columns

Levels

Add level before multi-index: clipboard :: thumbs up:

To flatten the hierarchical index: clipboard:

Missing data

Fill the reverse order of the time series with forward data

Reset cumulative with NaN: clipboard :: thumbs up:

Grouping

Unlike the agg function, the apply function returns a dataframe that can access all columns

use the get_group function

Apply to different items in the group

Apply function to all cases

Replace with the group mean except for values less than 0

Replace values less than 0 with group mean: clipboard :: thumbs up:

Sort groups by aggregated data

Create multiple summary columns

Create a value count column and reassign it to the DataFrame

Shift a group of column values based on the index

Select the row with the maximum value from each group

Grouping like itertools.groupby

Expanding data

Rolling calculation based on value rather than count: clipboard:

Rolling average over time interval

Calculate the cumulative average from the total in time series units with the rolling function

Splitting

Create a list of data frames and split them based on the values contained in the row.

Pivot

Partial total and subtotal

R frequency table like plyr

To create a year-month crosstab:

Apply

Apply rolling to organize-convert embedded list to MultiIndex frame

Convert the result converted from pd.Series to scalar value again to pd.Series format and apply the rolling function to multiple columns

Apply rolling to multiple columns when the function is scalar

Time series: clipboard:

Time interval

About time series masks based on index: clipboard:

Exclude weekends and create a date and time range that includes only specific times: clipboard:

Look up a matrix in a data frame

Converts a data frame consisting of columns in hours and rows in days into a continuous row sequence in chronological format. : clipboard:

Handling duplicates when reindexing a time series to a certain frequency: clipboard:

Resampling

How can I divide a single value in a data frame by the monthly average? : clipboard :: thumbs up:

Grouping using MultiIndex: clipboard:

Update new columns ['AAA_cat','BBB_cat','CCC_cat'] with values from the `categories` dictionary for existing columns