CookBook
at the end of the ʻUser Guide` on the official page.stackoverflow.com
is posted depending on the content. (Posting standards are personal judgment)\small{\mathsf{ Date:Feb \ 05, 2020 \ Version: 1.0.1 }}
In [1]: df = pd.DataFrame({'AAA': [4, 5, 6, 7],
'BBB': [10, 20, 30, 40],
'CCC': [100, 50, -30, -50]})
In [2]: df
Out[2]:
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
df.loc[df.AAA >= 5, 'BBB'] = -1
df.loc[df.AAA >= 5, ['BBB', 'CCC']] = 555
df.loc[df.AAA < 5, ['BBB', 'CCC']] = 2000
In [9]: df_mask = pd.DataFrame({'AAA': [True] * 4,
'BBB': [False] * 4,
'CCC': [True, False] * 2})
In [10]: df.where(df_mask, -1000)
Out[10]:
AAA BBB CCC
0 4 -1000 2000
1 5 -1000 -1000
2 6 -1000 555
3 7 -1000 -1000
** [Code Reading] ** </ font>
- When creating a data frame, [True] * 4 becomes [True, True, True, True].
AAA | BBB | CCC | |
---|---|---|---|
0 | True | False | True |
1 | True | False | False |
2 | True | False | True |
3 | True | False | False |
# np.where(condition, true, false)
df['logic'] = np.where(df['AAA'] > 5, 'high', 'low')
AAA | BBB | CCC | logic | |
---|---|---|---|---|
0 | 4 | 2000 | 2000 | low |
1 | 5 | 555 | 555 | low |
2 | 6 | 555 | 555 | high |
3 | 7 | 555 | 555 | high |
In [16]: df
Out[16]:
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
In [17]: df[df.AAA <= 5]
Out[17]:
AAA BBB CCC
0 4 10 100
1 5 20 50
In [18]: df[df.AAA > 5]
Out[18]:
AAA BBB CCC
2 6 30 -30
3 7 40 -50
id | sex | age | visits | points | |
---|---|---|---|---|---|
0 | 128 | 1 | 36 | 6 | 1980 |
1 | 324 | 0 | 29 | 9 | -50 |
2 | 287 | 0 | 41 | 12 | -239 |
3 | 423 | 1 | 33 | 33 | 1570 |
4 | 589 | 0 | 19 | 10 | 10 |
Let's divide this by a Boolean value.
** [Code Reading] ** </ font>
#Check the Boolean value of m
m = df['sex'] != 0
m
0 True
1 False
2 False
3 True
4 False
Name: sex, dtype: bool
#Extract with condition m, extract with negation of m
female, male = df[m], df[~m]
female
id | sex | age | visits | points | |
---|---|---|---|---|---|
0 | 128 | 1 | 36 | 6 | 1980 |
3 | 423 | 1 | 33 | 33 | 1570 |
male
id | sex | age | visits | points | |
---|---|---|---|---|---|
1 | 324 | 0 | 29 | 9 | -50 |
2 | 287 | 0 | 41 | 12 | -239 |
4 | 589 | 0 | 19 | 10 | 10 |
id | sex | age | visits | points | |
---|---|---|---|---|---|
0 | 128 | 1 | 36 | 6 | 1980 |
1 | 324 | 0 | 29 | 9 | -50 |
2 | 287 | 0 | 41 | 12 | -239 |
3 | 423 | 1 | 33 | 33 | 1570 |
4 | 589 | 0 | 19 | 10 | 10 |
# (A condition) & (B condition)and will result in an error
df.loc[(df['BBB'] < 25) & (df['CCC'] >= -40), 'AAA']
0 4
1 5
Name: AAA, dtype: int64
# (A condition) | (B condition)or will result in an error
df.loc[(df['BBB'] > 25) | (df['CCC'] >= -40), 'AAA']
0 4
1 5
2 6
3 7
Name: AAA, dtype: int64
# &,|You can update the data using
df.loc[(df['BBB'] > 25) | (df['CCC'] >= 75), 'AAA'] = 0.1
AAA | BBB | CCC | |
---|---|---|---|
0 | 0.1 | 10 | 100 |
1 | 5 | 20 | 50 |
2 | 0.1 | 30 | -30 |
3 | 0.1 | 40 | -50 |
df = pd.DataFrame({'AAA': [4, 5, 6, 7],
'BBB': [10, 20, 30, 40],
'CCC': [100, 50, -30, -50]})
In [26]: df
Out[26]:
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
In [27]: aValue = 43.0
In [28]: df.loc[(df.CCC - aValue).abs().argsort()]
Out[28]:
AAA BBB CCC
1 5 20 50
0 4 10 100
2 6 30 -30
3 7 40 -50
** [Code Reading] ** </ font>
#Simple calculation
(df.CCC - aValue)
0 57.0
1 7.0
2 -73.0
3 -93.0
Name: CCC, dtype: float64
#Make the result of a simple calculation an absolute value
(df.CCC - aValue).abs()
0 57.0
1 7.0
2 73.0
3 93.0
Name: CCC, dtype: float64
#Returns the index of the result of sorting the numbers with the result of a simple calculation as an absolute value.
(df.CCC - aValue).abs().argsort()
0 1
1 0
2 2
3 3
Name: CCC, dtype: int64
In [29]: df = pd.DataFrame({'AAA': [4, 5, 6, 7],
'BBB': [10, 20, 30, 40],
'CCC': [100, 50, -30, -50]})
In [30]: df
Out[30]:
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
In [31]: Crit1 = df.AAA <= 5.5
In [32]: Crit2 = df.BBB == 10.0
In [33]: Crit3 = df.CCC > -40.0
#Hard-coded
In [34]: AllCrit = Crit1 & Crit2 & Crit3
In [35]: import functools
In [36]: CritList = [Crit1, Crit2, Crit3]
#Create all conditions (integrated) with the reduce function
In [37]: AllCrit = functools.reduce(lambda x, y: x & y, CritList)
In [38]: df[AllCrit]
Out[38]:
AAA BBB CCC
0 4 10 100
** [Code Reading] ** </ font>
#Check each condition as a data frame.
tmp = pd.DataFrame({'CritList_0':CritList[0].tolist(),
'CritList_1':CritList[1].tolist(),
'CritList_2':CritList[2].tolist(),
'AllCrit':AllCrit.tolist()
}, index=[0,1,2,3])
, which is the product of all conditions,
Trueis the row with the index
0`.CritList_0 | CritList_1 | CritList_2 | AllCrit | |
---|---|---|---|---|
0 | True | True | True | True |
1 | True | False | True | False |
2 | False | False | True | False |
3 | False | False | False | False |
In [39]: df = pd.DataFrame({'AAA': [4, 5, 6, 7],
'BBB': [10, 20, 30, 40],
'CCC': [100, 50, -30, -50]})
In [40]: df
Out[40]:
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
In [41]: df[(df.AAA <= 6) & (df.index.isin([0, 2, 4]))]
Out[41]:
AAA BBB CCC
0 4 10 100
2 6 30 -30
AAA | BBB | CCC | |
---|---|---|---|
foo | 4 | 10 | 100 |
bar | 5 | 20 | 50 |
boo | 6 | 30 | -30 |
kar | 7 | 40 | -50 |
In [43]: df.loc['bar':'kar'] # Label
Out[43]:
AAA BBB CCC
bar 5 20 50
boo 6 30 -30
kar 7 40 -50
# Generic
In [44]: df.iloc[0:3]
Out[44]:
AAA BBB CCC
foo 4 10 100
bar 5 20 50
boo 6 30 -30
** Important points learned **
In [46]: data = {'AAA': [4, 5, 6, 7],
'BBB': [10, 20, 30, 40],
'CCC': [100, 50, -30, -50]}
#Note: Index starts at 1
In [47]: df2 = pd.DataFrame(data=data, index=[1, 2, 3, 4])
In [48]: df2.iloc[1:3] #Position oriented
Out[48]:
AAA BBB CCC
2 5 20 50
3 6 30 -30
In [49]: df2.loc[1:3] #Label oriented
Out[49]:
AAA BBB CCC
1 4 10 100
2 5 20 50
3 6 30 -30
In [50]: df = pd.DataFrame({'AAA': [4, 5, 6, 7],
'BBB': [10, 20, 30, 40],
'CCC': [100, 50, -30, -50]})
In [51]: df
Out[51]:
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
In [52]: df[~((df.AAA <= 6) & (df.index.isin([0, 2, 4])))]
Out[52]:
AAA BBB CCC
1 5 20 50
3 7 40 -50
** [Code Reading] ** </ font>
# `~`If you get it under the condition that does not use
df[((df.AAA <= 6) & (df.index.isin([0, 2, 4])))]
AAA | BBB | CCC | |
---|---|---|---|
0 | 4 | 10 | 100 |
2 | 6 | 30 | -30 |
In [53]: df = pd.DataFrame({'AAA': [1, 2, 1, 3],
'BBB': [1, 1, 2, 2],
'CCC': [2, 1, 3, 1]})
In [54]: df
Out[54]:
AAA BBB CCC
0 1 1 2
1 2 1 1
2 1 2 3
3 3 2 1
In [55]: source_cols = df.columns #Get existing column name
In [56]: new_cols = [str(x) + "_cat" for x in source_cols]
In [57]: categories = {1: 'Alpha', 2: 'Beta', 3: 'Charlie'}
In [58]: df[new_cols] = df[source_cols].applymap(categories.get)
In [59]: df
Out[59]:
AAA BBB CCC AAA_cat BBB_cat CCC_cat
0 1 1 2 Alpha Alpha Beta
1 2 1 1 Beta Alpha Alpha
2 1 2 3 Alpha Beta Charlie
3 3 2 1 Charlie Beta Alpha
** [Code Reading] ** </ font>
#Get an existing column name
source_cols = df.columns
source_cols
Index(['AAA', 'BBB', 'CCC'], dtype='object')
#To an existing column name`_cat`Create a list added to
new_cols = [str(x) + "_cat" for x in source_cols]
new_cols
['AAA_cat', 'BBB_cat', 'CCC_cat']
categories
dictionary for existing columnsIn [60]: df = pd.DataFrame({'AAA': [1, 1, 1, 2, 2, 2, 3, 3],
'BBB': [2, 1, 3, 4, 5, 1, 2, 3]})
In [61]: df
Out[61]:
AAA BBB
0 1 2
1 1 1
2 1 3
3 2 4
4 2 5
5 2 1
6 3 2
7 3 3
In [62]: df.loc[df.groupby("AAA")["BBB"].idxmin()]
Out[62]:
AAA BBB
1 1 1
5 2 1
6 3 2
In [63]: df.sort_values(by="BBB").groupby("AAA", as_index=False).first()
Out[63]:
AAA BBB
0 1 1
1 2 1
2 3 2
** [Code Reading] ** </ font>
# 'AAA'Group by column
gb = df.groupby("AAA")["BBB"]
# 'AAA'Of the column'2'If you get, the following contents will be displayed.'AAA'Of the column'2'The minimum value of'1'Becomes
gb.get_group(2)
3 4
4 5
5 1
Name: BBB, dtype: int64
- Method 2
# 'BBB'Sort by column
sv = df.sort_values(by="BBB")
sv
AAA BBB
1 1 1
5 2 1
0 1 2
6 3 2
2 1 3
7 3 3
3 2 4
4 2 5
# 'AAA'Group by column
gb = sv.groupby("AAA", as_index=False)
# 'AAA'Of the column'2'When you get the minimum value'1'
gb.get_group(2)
AAA BBB
5 2 1
3 2 4
4 2 5
#Get the beginning of a group object
gb.first()
AAA BBB
0 1 1
1 2 1
2 3 2
In [64]: df = pd.DataFrame({'row': [0, 1, 2],
'One_X': [1.1, 1.1, 1.1],
'One_Y': [1.2, 1.2, 1.2],
'Two_X': [1.11, 1.11, 1.11],
'Two_Y': [1.22, 1.22, 1.22]})
In [65]: df
Out[65]:
row One_X One_Y Two_X Two_Y
0 0 1.1 1.2 1.11 1.22
1 1 1.1 1.2 1.11 1.22
2 2 1.1 1.2 1.11 1.22
#Labeled index
In [66]: df = df.set_index('row')
In [67]: df
Out[67]:
One_X One_Y Two_X Two_Y
row
0 1.1 1.2 1.11 1.22
1 1.1 1.2 1.11 1.22
2 1.1 1.2 1.11 1.22
#Existing column name'_'Divide with and change to a hierarchical index in tuple format.
In [68]: df.columns = pd.MultiIndex.from_tuples([tuple(c.split('_')) for c in df.columns])
In [69]: df
Out[69]:
One Two
X Y X Y
row
0 1.1 1.2 1.11 1.22
1 1.1 1.2 1.11 1.22
2 1.1 1.2 1.11 1.22
#Make the hierarchical index column data and reset the index
In [70]: df = df.stack(0).reset_index(1)
In [71]: df
Out[71]:
level_1 X Y
row
0 One 1.10 1.20
0 Two 1.11 1.22
1 One 1.10 1.20
1 Two 1.11 1.22
2 One 1.10 1.20
2 Two 1.11 1.22
#Modify the label (label "level"_Note that "1" was added automatically)
In [72]: df.columns = ['Sample', 'All_X', 'All_Y']
In [73]: df
Out[73]:
Sample All_X All_Y
row
0 One 1.10 1.20
0 Two 1.11 1.22
1 One 1.10 1.20
1 Two 1.11 1.22
2 One 1.10 1.20
2 Two 1.11 1.22
** [Code Reading] ** </ font>
#Existing columns in inclusion notation'_'Divide by
[tuple(c.split('_')) for c in df.columns]
[('One', 'X'), ('One', 'Y'), ('Two', 'X'), ('Two', 'Y')]
#Hierarchical indexes are represented in tuple format
df.columns = pd.MultiIndex.from_tuples([tuple(c.split('_')) for c in df.columns])
df.columns
MultiIndex([('One', 'X'),
('One', 'Y'),
('Two', 'X'),
('Two', 'Y')],
)
#Pivot from column to row
df = df.stack(0)
X Y
row
0 One 1.10 1.20
Two 1.11 1.22
1 One 1.10 1.20
Two 1.11 1.22
2 One 1.10 1.20
Two 1.11 1.22
#index Level=1('One,Two')To reset df.reset_index(1) level_1 X Y row 0 One 1.10 1.20 0 Two 1.11 1.22 1 One 1.10 1.20 1 Two 1.11 1.22 2 One 1.10 1.20 2 Two 1.11 1.22
## Arithmetic
* The code below has some changes from the official website.
```python
In [74]: cols = pd.MultiIndex.from_tuples([(x, y) for x in ['A', 'B', 'C'] for y in ['O', 'I']])
In [75]: df = pd.DataFrame(np.arange(12).reshape((2,6)), index=['n', 'm'], columns=cols)
In [76]: df
Out[76]:
A B C
O I O I O I
n 0 1 2 3 4 5
m 6 7 8 9 10 11
In [77]: df = df.div(df['C'], level=1)
In [78]: df
Out[78]:
A B C
O I O I O I
n 0.0 0.200000 0.5 0.600000 1.0 1.0
m 0.6 0.636364 0.8 0.818182 1.0 1.0
** [Code Reading] ** </ font>
$ n row in column B is $ $ [B][O](2 \div 4 = 0.5) $ $ [B][I](3 \div 5 = 0.600000) $
A | B | C | ||||
---|---|---|---|---|---|---|
O | I | O | I | O | I | |
n | 0.0 | 0.200000 | 0.5 | 0.600000 | 1.0 | 1.0 |
m | 0.6 | 0.636364 | 0.8 | 0.818182 | 1.0 | 1.0 |
coords = [('AA', 'one'), ('AA', 'six'), ('BB', 'one'), ('BB', 'two'), ('BB', 'six')]
index = pd.MultiIndex.from_tuples(coords)
df = pd.DataFrame([11, 22, 33, 44, 55], index, ['MyData'])
MyData | ||
---|---|---|
AA | one | 11 |
six | 22 | |
BB | one | 33 |
two | 44 | |
six | 55 |
#First axis
df.xs('BB', level=0, axis=0)
MyData | |
---|---|
one | 33 |
two | 44 |
six | 55 |
#Second axis
df.xs('six', level=1, axis=0)
MyData | |
---|---|
AA | 22 |
BB | 55 |
In [85]: import itertools
In [86]: index = list(itertools.product(['Ada', 'Quinn', 'Violet'],
['Comp', 'Math', 'Sci']))
In [87]: headr = list(itertools.product(['Exams', 'Labs'], ['I', 'II']))
In [88]: indx = pd.MultiIndex.from_tuples(index, names=['Student', 'Course'])
In [89]: cols = pd.MultiIndex.from_tuples(headr) # Notice these are un-named
In [90]: data = [[70 + x + y + (x * y) % 3 for x in range(4)] for y in range(9)]
In [91]: df = pd.DataFrame(data, indx, cols)
In [92]: df
Out[92]:
Exams Labs
I II I II
Student Course
Ada Comp 70 71 72 73
Math 71 73 75 74
Sci 72 75 75 75
Quinn Comp 73 74 75 76
Math 74 76 78 77
Sci 75 78 78 78
Violet Comp 76 77 78 79
Math 77 79 81 80
Sci 78 81 81 81
In [93]: All = slice(None)
In [94]: df.loc['Violet']
Out[94]:
Exams Labs
I II I II
Course
Comp 76 77 78 79
Math 77 79 81 80
Sci 78 81 81 81
In [95]: df.loc[(All, 'Math'), All]
Out[95]:
Exams Labs
I II I II
Student Course
Ada Math 71 73 75 74
Quinn Math 74 76 78 77
Violet Math 77 79 81 80
In [96]: df.loc[(slice('Ada', 'Quinn'), 'Math'), All]
Out[96]:
Exams Labs
I II I II
Student Course
Ada Math 71 73 75 74
Quinn Math 74 76 78 77
In [97]: df.loc[(All, 'Math'), ('Exams')]
Out[97]:
I II
Student Course
Ada Math 71 73
Quinn Math 74 76
Violet Math 77 79
In [98]: df.loc[(All, 'Math'), (All, 'II')]
Out[98]:
Exams Labs
II II
Student Course
Ada Math 73 74
Quinn Math 76 77
Violet Math 79 80
** [Code Reading] ** </ font>
All = slice(None) #What does this mean?
# silice(start, stop, step)So, are there three None?
All
slice(None, None, None)
print(type(All))
# <class 'slice'> #Slice object
indx
MultiIndex([( 'Ada', 'Comp'),
( 'Ada', 'Math'),
( 'Ada', 'Sci'),
( 'Quinn', 'Comp'),
( 'Quinn', 'Math'),
( 'Quinn', 'Sci'),
('Violet', 'Comp'),
('Violet', 'Math'),
('Violet', 'Sci')],
names=['Student', 'Course'])
cols
MultiIndex([('Exams', 'I'),
('Exams', 'II'),
( 'Labs', 'I'),
( 'Labs', 'II')],
)
df.sort_values(by=('Labs', 'II'), ascending=False)
Exams | Labs | ||||
---|---|---|---|---|---|
I | II | I | II | ||
Student | Course | ||||
Violet | Sci | 78 | 81 | 81 | 81 |
Math | 77 | 79 | 81 | 80 | |
Comp | 76 | 77 | 78 | 79 | |
Quinn | Sci | 75 | 78 | 78 | 78 |
Math | 74 | 76 | 78 | 77 | |
Comp | 73 | 74 | 75 | 76 | |
Ada | Sci | 72 | 75 | 75 | 75 |
Math | 71 | 73 | 75 | 74 | |
Comp | 70 | 71 | 72 | 73 |
#To this DF'Firstlevel:Foo'How to add?
pd.concat([df], keys=['Foo'], names=['Firstlevel'])
Exams | Labs | |||||
---|---|---|---|---|---|---|
I | II | I | II | |||
Firstlevel | Student | Course | ||||
Foo | Ada | Comp | 70 | 71 | 72 | 73 |
Math | 71 | 73 | 75 | 74 | ||
Sci | 72 | 75 | 75 | 75 | ||
Quinn | Comp | 73 | 74 | 75 | 76 | |
Math | 74 | 76 | 78 | 77 | ||
Sci | 75 | 78 | 78 | 78 | ||
Violet | Comp | 76 | 77 | 78 | 79 | |
Math | 77 | 79 | 81 | 80 | ||
Sci | 78 | 81 | 81 | 81 |
df.columns = df.columns.get_level_values(0)
Exams | Labs | ||||
---|---|---|---|---|---|
I | II | I | II | ||
Student | Course | ||||
Ada | Comp | 70 | 71 | 72 | 73 |
Math | 71 | 73 | 75 | 74 | |
Sci | 72 | 75 | 75 | 75 | |
Quinn | Comp | 73 | 74 | 75 | 76 |
Math | 74 | 76 | 78 | 77 | |
Sci | 75 | 78 | 78 | 78 | |
Violet | Comp | 76 | 77 | 78 | 79 |
Math | 77 | 79 | 81 | 80 | |
Sci | 78 | 81 | 81 | 81 |
In [100]: df = pd.DataFrame(np.random.randn(6, 1),
index=pd.date_range('2013-08-01', periods=6, freq='B'),
columns=list('A'))
In [101]: df.loc[df.index[3], 'A'] = np.nan
In [102]: df
Out[102]:
A
2013-08-01 0.721555
2013-08-02 -0.706771
2013-08-05 -1.039575
2013-08-06 NaN
2013-08-07 -0.424972
2013-08-08 0.567020
In [103]: df.reindex(df.index[::-1]).ffill()
Out[103]:
A
2013-08-08 0.567020
2013-08-07 -0.424972
2013-08-06 -0.424972
2013-08-05 -1.039575
2013-08-02 -0.706771
2013-08-01 0.721555
** [Code Reading] ** </ font>
df.reindex(df.index[::-1]).ffill()
df.reindex(df.index[::-1]).fillna(method='ffill') #Synonymous with above
The above is an example of resetting the index in the reverse order of index and filling in the blanks with fiillna (method ='ffill')
. The fillna
function is more versatile thanffill ()
because you can use mode ='ffill' (forward) and'bfill (backward)'
to fill in the blanks.
v = pd.Series([1, 1, 1, np.nan, 1, 1, 1, 1, np.nan, 1], dtype=float)
n = v.isnull()
a = ~n
c = a.astype(float).cumsum()
d = pd.Series(np.diff(np.hstack(([0.], c[n]))), index=index)
v[n] = -d
result = v.cumsum()
result
0 1.0
1 2.0
2 3.0
3 0.0
4 1.0
5 2.0
6 3.0
7 4.0
8 0.0
9 1.0
dtype: float64
** [Code Reading] ** </ font> $ (1) Create data containing Nan values $ $ v = pd.Series([1, 1, 1, np.nan, 1, 1, 1, 1, np.nan, 1], dtype=float) \tag{1} $ $ (2) n gets the NaN value as a Boolean value $ $ n = v.isnull() \tag{2} $ $ (3) a inverts the Boolean value of n $ $ a = \tilde~n \tag{3} $ $ (4) Accumulate based on the Boolean value of a $ $ c = a.astype(float).cumsum() \tag{4} $ $ (5) Get the False index of n from the accumulated data $ $ index = c[n].index \tag{5} $ $ (6) Get the difference from 0 to c [n] $ $ d = pd.Series(np.diff(np.hstack(([0.], c[n]))), index=index) \tag{6} $ Substitute the numerical value obtained in $ (7) (6) into False of v as a negative value $ $ v[n] = -d \tag{7} $ When the data of $ (8) and (7) are integrated, the negative value becomes 0. $ $ result = v.cumsum() \tag{8} $
\begin
{array}{|c|c|c|c|c|c|c|r|c|} \hline
& (1) & (2) & (3) & (4) &(5)&(6)& (7) & (8) \\ \hline
0 & 1.0 & False & True & 1.0 & & & 1.0 & 1.0 \\ \hline
1 & 1.0 & False & True & 2.0 & & & 1.0 & 2.0 \\ \hline
2 & 1.0 & False & True & 3.0 & & & 1.0 & 3.0 \\ \hline
3 & NaN & True & False & 3.0 & 3 & 3 & -3.0 & 0.0 \\ \hline
4 & 1.0 & False & True & 4.0 & & & 1.0 & 1.0 \\ \hline
5 & 1.0 & False & True & 5.0 & & & 1.0 & 2.0 \\ \hline
6 & 1.0 & False & True & 6.0 & & & 1.0 & 3.0 \\ \hline
7 & 1.0 & False & True & 7.0 & & & 1.0 & 4.0 \\ \hline
8 & NaN & True & False & 7.0 & 8 & 4 & -4.0 & 0.0 \\ \hline
9 & 1.0 & False & True & 8.0 & & & 1.0 & 1.0 \\ \hline
\end{array}
df = pd.DataFrame({'animal': 'cat dog cat fish dog cat cat'.split(),
'size': list('SSMMMLL'),
'weight': [8, 10, 11, 1, 20, 12, 12],
'adult': [False] * 5 + [True] * 2})
In [105]: df
Out[105]:
animal size weight adult
0 cat S 8 False
1 dog S 10 False
2 cat M 11 False
3 fish M 1 False
4 dog M 20 False
5 cat L 12 True
6 cat L 12 True
#List the sizes of the heaviest animals.
In [106]: df.groupby('animal').apply(lambda subf: subf['size'][subf['weight'].idxmax()])
Out[106]:
animal
cat L
dog M
fish M
dtype: object
** [Code Reading] ** </ font>
#If you disassemble this
df.groupby('animal').apply(lambda subf: subf['size'][subf['weight'].idxmax()])
# 'weight'From the index of the element with the maximum value'size'Get the element of
df['size'][df['weight'].idxmax()]
'M'
#So you can get the maximum size of the grouped contents
In [107]: gb = df.groupby(['animal'])
# 'get_group'Use to check the contents of the GroupBy object
In [108]: gb.get_group('cat')
Out[108]:
animal size weight adult
0 cat S 8 False
2 cat M 11 False
5 cat L 12 True
6 cat L 12 True
** [Code Reading] ** </ font>
#Access to the object acquired by groupby
gb[['animal', 'size', 'weight']].get_group('cat')
animal | size | weight | |
---|---|---|---|
0 | cat | S | 8 |
2 | cat | M | 11 |
5 | cat | L | 12 |
6 | cat | L | 12 |
In [109]: def GrowUp(x):
avg_weight = sum(x[x['size'] == 'S'].weight * 1.5)
avg_weight += sum(x[x['size'] == 'M'].weight * 1.25)
avg_weight += sum(x[x['size'] == 'L'].weight)
avg_weight /= len(x)
return pd.Series(['L', avg_weight, True], index=['size', 'weight', 'adult'])
In [110]: expected_df = gb.apply(GrowUp)
In [111]: expected_df
Out[111]:
size weight adult
animal
cat L 12.4375 True
dog L 20.0000 True
fish L 1.2500 True
** [Code Reading] ** </ font>
#Calculation of S size
cat_s = sum(df[(df['animal'] == 'cat') & (df['size'] == 'S')].weight * 1.5)
cat_s
12.0
#Calculation of M size
cat_m = sum(df[(df['animal'] == 'cat') & (df['size'] == 'M')].weight * 1.25)
cat_m
13.75
#Calculation of L size
cat_l = sum(df[(df['animal'] == 'cat') & (df['size'] == 'L')].weight)
cat_l
24
# 'SML average
cat_avg = (cat_s + cat_m + cat_l) / len(df[df['animal'] == 'cat'])
cat_avg
12.4375
#The obtained result is pd.Return in Series format
The code below has some numerical changes from the official website.
# S = pd.Series([i / 100.0 for i in range(1, 11)])
S = pd.Series([i for i in range(1, 11)])
S
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
dtype: int64
def cum_ret(x, y):
return x * (1 + y)
def red(x):
p = functools.reduce(cum_ret, x, 1.0)
return p
#Expanding to perform on all data().[mean(),std()...]
S.expanding().apply(red, raw=True)
0 2.0
1 6.0
2 24.0
3 120.0
4 720.0
5 5040.0
6 40320.0
7 362880.0
8 3628800.0
9 39916800.0
dtype: float64
** [Code Reading] ** </ font>
$ x=1 $
$ 1: The red function (x) is called and x = 1.0. Then $
$ 2: The cum \ _ ret function (x, y) is called and x * (1 + y) = 1.0 * (1.0 + 1.0) = 2.0 is returned from x = 1, y = 1 $
$ 3:p=2.0 $
$ x=2 $
$ p=6.0((12)(1+2)) $
$ x=3 $
$ p=24.0((123)*(1+3)) $
$ x=4 $
$ p=120.0((1234)(1+4)) $
In [116]: df = pd.DataFrame({'A': [1, 1, 2, 2], 'B': [1, -1, 1, 2]})
A B
0 1 1
1 1 -1
2 2 1
3 2 2
In [117]: gb = df.groupby('A')
In [118]: def replace(g):
mask = g < 0
return g.where(mask, g[~mask].mean())
In [119]: gb.transform(replace)
Out[119]:
B
0 1.0
1 -1.0
2 1.5
3 1.5
df = pd.DataFrame({'A' : [1, 1, 1, 2, 2, 2], 'B' : [2, 3, -1, 1, 2, 6]})
gb = df.groupby('A')
def replace(g):
mask = g < 0
g.loc[mask] = g[~mask].mean()
return g
gb.transform(replace)
B
0 2.0
1 3.0
2 2.5
3 1.0
4 2.0
5 6.0
In [120]: df = pd.DataFrame({'code': ['foo', 'bar', 'baz'] * 2,
'data': [0.16, -0.21, 0.33, 0.45, -0.59, 0.62],
'flag': [False, True] * 3})
In [121]: code_groups = df.groupby('code')
In [122]: agg_n_sort_order = code_groups[['data']].transform(sum).sort_values(by='data')
In [123]: sorted_df = df.loc[agg_n_sort_order.index]
In [124]: sorted_df
Out[124]:
code data flag
1 bar -0.21 True
4 bar -0.59 False
0 foo 0.16 False
3 foo 0.45 True
2 baz 0.33 False
5 baz 0.62 True
** [Code Reading] ** </ font>
# 'data'Group by and find the total and sort.
agg_n_sort_order = code_groups[['data']].transform(sum).sort_values(by='data')
data
1 -0.80
4 -0.80
0 0.61
3 0.61
2 0.95
5 0.95
#Then find and extract the index
agg_n_sort_order.index
Int64Index([1, 4, 0, 3, 2, 5], dtype='int64')
In [125]: rng = pd.date_range(start="2014-10-07", periods=10, freq='2min')
In [126]: ts = pd.Series(data=list(range(10)), index=rng)
In [127]: def MyCust(x):
if len(x) > 2:
return x[1] * 1.234
return pd.NaT
In [128]: mhc = {'Mean': np.mean, 'Max': np.max, 'Custom': MyCust}
In [129]: ts.resample("5min").apply(mhc)
Out[129]:
Mean 2014-10-07 00:00:00 1
2014-10-07 00:05:00 3.5
2014-10-07 00:10:00 6
2014-10-07 00:15:00 8.5
Max 2014-10-07 00:00:00 2
2014-10-07 00:05:00 4
2014-10-07 00:10:00 7
2014-10-07 00:15:00 9
Custom 2014-10-07 00:00:00 1.234
2014-10-07 00:05:00 NaT
2014-10-07 00:10:00 7.404
2014-10-07 00:15:00 NaT
dtype: object
In [130]: ts
Out[130]:
2014-10-07 00:00:00 0
2014-10-07 00:02:00 1
2014-10-07 00:04:00 2
2014-10-07 00:06:00 3
2014-10-07 00:08:00 4
2014-10-07 00:10:00 5
2014-10-07 00:12:00 6
2014-10-07 00:14:00 7
2014-10-07 00:16:00 8
2014-10-07 00:18:00 9
Freq: 2T, dtype: int64
** [Code Reading] ** </ font>
#TimeSeries object'5min'Recount to
ts_re = ts.resample('5min')
#View the contents of the object
for t in ts_re:
print(t)
(Timestamp('2014-10-07 00:00:00', freq='5T'), 2014-10-07 00:00:00 0
2014-10-07 00:02:00 1
2014-10-07 00:04:00 2
Freq: 2T, dtype: int64)
(Timestamp('2014-10-07 00:05:00', freq='5T'), 2014-10-07 00:06:00 3
2014-10-07 00:08:00 4
Freq: 2T, dtype: int64)
(Timestamp('2014-10-07 00:10:00', freq='5T'), 2014-10-07 00:10:00 5
2014-10-07 00:12:00 6
2014-10-07 00:14:00 7
Freq: 2T, dtype: int64)
(Timestamp('2014-10-07 00:15:00', freq='5T'), 2014-10-07 00:16:00 8
2014-10-07 00:18:00 9
Freq: 2T, dtype: int64)
ts | data | mean | max | MyCust | x[1] | |
---|---|---|---|---|---|---|
0 | 2014-10-07 00:00:00 | 0,1,2 | 1 | 2 | 1.234 | 1 |
1 | 2014-10-07 00:05:00 | 3,4 | 3.5 | 4 | NaT | False |
2 | 2014-10-07 00:10:00 | 5,6,7 | 6 | 7 | 7.404 | 6 |
3 | 2014-10-07 00:15:00 | 8,9 | 8.5 | 9 | NaT | False |
In [131]: df = pd.DataFrame({'Color': 'Red Red Red Blue'.split(),
'Value': [100, 150, 50, 50]})
In [132]: df
Out[132]:
Color Value
0 Red 100
1 Red 150
2 Red 50
3 Blue 50
In [133]: df['Counts'] = df.groupby(['Color']).transform(len)
In [134]: df
Out[134]:
Color Value Counts
0 Red 100 3
1 Red 150 3
2 Red 50 3
3 Blue 50 1
In [135]: df = pd.DataFrame({'line_race': [10, 10, 8, 10, 10, 8],
'beyer': [99, 102, 103, 103, 88, 100]},
index=['Last Gunfighter', 'Last Gunfighter', 'Last Gunfighter',
'Paynter', 'Paynter', 'Paynter'])
In [136]: df
Out[136]:
line_race beyer
Last Gunfighter 10 99
Last Gunfighter 10 102
Last Gunfighter 8 103
Paynter 10 103
Paynter 10 88
Paynter 8 100
In [137]: df['beyer_shifted'] = df.groupby(level=0)['beyer'].shift(1)
In [138]: df
Out[138]:
line_race beyer beyer_shifted
Last Gunfighter 10 99 NaN
Last Gunfighter 10 102 99.0
Last Gunfighter 8 103 102.0
Paynter 10 103 NaN
Paynter 10 88 103.0
Paynter 8 100 88.0
df = pd.DataFrame({'host': ['other', 'other', 'that', 'this', 'this'],
'service': ['mail', 'web', 'mail', 'mail', 'web'],
'no': [1, 2, 1, 2, 1]}).set_index(['host', 'service']
In [140]: mask = df.groupby(level=0).agg('idxmax')
In [141]: df_count = df.loc[mask['no']].reset_index()
In [142]: df_count
Out[142]:
host service no
0 other web 2
1 that mail 1
2 this mail 2
** [Code Reading] ** </ font>
no | ||
---|---|---|
host | service | |
other | 1 | |
web | 2 | |
that | 1 | |
this | 2 | |
web | 1 |
# level=View the contents of mask values grouped by 0
mask = df.groupby(level=0).agg('idxmax')
no
host
other (other, web)
that (that, mail)
this (this, mail)
The maximum value is indexed with the data grouped in tuple format. The index is reset in the data frame of the result extracted by this mask value.
In [143]: df = pd.DataFrame([0, 1, 0, 1, 1, 1, 0, 1, 1], columns=['A'])
In [144]: df['A'].groupby((df['A'] != df['A'].shift()).cumsum()).groups
Out[144]:
{1: Int64Index([0], dtype='int64'),
2: Int64Index([1], dtype='int64'),
3: Int64Index([2], dtype='int64'),
4: Int64Index([3, 4, 5], dtype='int64'),
5: Int64Index([6], dtype='int64'),
6: Int64Index([7, 8], dtype='int64')}
In [145]: df['A'].groupby((df['A'] != df['A'].shift()).cumsum()).cumsum()
Out[145]:
0 0
1 1
2 0
3 1
4 2
5 3
6 0
7 1
8 2
Name: A, dtype: int64
** [Code Reading] ** </ font>
#The above data frame
df
A
0 0
1 1
2 0
3 1
4 1
5 1
6 0
7 1
8 1
data = [0, 1, 0, 1, 1, 1, 0, 1, 1]
gb = itertools.groupby(data)
for key, group in gb:
print(f'{key}: {list(group)}')
0: [0]
1: [1]
0: [0]
1: [1, 1, 1]
0: [0]
1: [1, 1]
tmp = df.groupby('A')
gb = tmp.groups
gb.get(1)
Int64Index([1, 3, 4, 5, 7, 8], dtype='int64')
gb.get(0)
Int64Index([0, 2, 6], dtype='int64')
$ (1) \ Sample data $ $ df['A'] \tag{1} $ $ (2) \ Shift data $ $ df['A'].shift() \tag{2} $ True $ if $ (3) \ (2) and (3) are not equivalent $ df['A'] != df['A'].shift() \tag{3} $ Take the accumulation of $ (4) \ (3) $ $ (df['A'] != df['A'].shift()).cumsum() \tag{4} $ $ (5) \ Contents of objects grouped by cumulative data $ $ gb = df['A'].groupby((df['A'] != df['A'].shift()).cumsum()) \tag{5} $ $ (6) \ Grouped data cumulative $ $ df['A'].groupby((df['A'] != df['A'].shift()).cumsum()).cumsum() \tag{6} $
\begin
{array}{|c|c|c|c|c|c|c|} \hline
& (1) & (2) & (3) & (4) & (5) & (6) \\ \hline
0 & 0 & NaN & True & 1 & 0 & 0 \\ \hline
1 & 1 & 0.0 & True & 2 & 1 & 1 \\ \hline
2 & 0 & 1.0 & True & 3 & 0 & 0 \\ \hline
3 & 1 & 0.0 & True & 4 & 1,1,1 & 1 \\ \hline
4 & 1 & 1.0 & False & 4 & & 2 \\ \hline
5 & 1 & 1.0 & False & 4 & & 3 \\ \hline
6 & 0 & 1.0 & True & 5 & 0 & 0 \\ \hline
7 & 1 & 0.0 & True & 6 & 1,1 & 1 \\ \hline
8 & 1 & 1.0 & False & 6 & & 2 \\ \hline
\end{array}
df = pd.DataFrame({'RollBasis':[1,1,1,2,3,5,8,10,12,13],
'ToRoll':[1,4,-5,2,-4,-2,0,-13,-2,-5]})
def f(x):
ser = df.ToRoll[(df.RollBasis >= x) & (df.RollBasis < x+5)]
return ser.sum()
df['Rolled'] = df.RollBasis.apply(f)
df
0 1 1 -4
1 1 4 -4
2 1 -5 -4
3 2 2 -4
4 3 -4 -6
5 5 -2 -2
6 8 0 -15
7 10 -13 -20
8 12 -2 -7
9 13 -5 -5
** [Code Reading] ** </ font>
RollBasis | ToRoll | |
---|---|---|
0 | 1 | 1 |
1 | 1 | 4 |
2 | 1 | -5 |
3 | 2 | 2 |
4 | 3 | -4 |
5 | 5 | -2 |
6 | 8 | 0 |
7 | 10 | -13 |
8 | 12 | -2 |
9 | 13 | -5 |
$ x=1 $ $(RollBasis>=1) \& (RollBsis<1+5) \rightarrow sum([1,4,-5,2,-4,-2]) \rightarrow -4 $ $ \vdots $ $x =4 $ $(RollBasis>=4) \& (RollBsis<4+5) \rightarrow sum([-2,0]) \rightarrow -2 $
df = pd.DataFrame({'B': range(5)})
df.index = [pd.Timestamp('20130101 09:00:00'),
pd.Timestamp('20130101 09:00:02'),
pd.Timestamp('20130101 09:00:03'),
pd.Timestamp('20130101 09:00:05'),
pd.Timestamp('20130101 09:00:06')]
df
2013-01-01 09:00:00 0
2013-01-01 09:00:02 1
2013-01-01 09:00:03 2
2013-01-01 09:00:05 3
2013-01-01 09:00:06 4
df.rolling(2, min_periods=1).sum()
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 3.0
2013-01-01 09:00:05 5.0
2013-01-01 09:00:06 7.0
df.rolling('2s', min_periods=1).sum()
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 3.0
2013-01-01 09:00:05 3.0
2013-01-01 09:00:06 7.0
** [Code Reading] ** </ font>
# rolling(window width,Minimum number of data)
df.rolling(2, min_periods=1).sum()
# 0+1=1, 1+2=3, 2+3=5, 3+4=7 In units of 2 lines
df.rolling('2s', min_periods=1).sum()
# 0+1=1, 1+2=3, 3+0=3, 3+4=7
#If you look at the time series data at 2-minute intervals and take the total
df.resample('2s').sum()
B
2013-01-01 09:00:00 0
2013-01-01 09:00:02 3
2013-01-01 09:00:04 3
2013-01-01 09:00:06 4
df = pd.DataFrame({'enddate':pd.to_datetime(['2019-03-01','2019-03-01','2019-03-01','2019-03-02','2019-03-02','2019-03-02','2019-03-03','2019-03-03','2019-03-04']),
'bloom':[0.342, 0.235, 0.456, 0.389, 0.453, 0.367, 0.369, 0.428, 0.55],
'unbloom':[0.658, 0.765, 0.544, 0.611, 0.547, 0.633, 0.631, 0.572, 0.45]})
In [146]: d df = pd.DataFrame(data={'Case': ['A', 'A', 'A', 'B', 'A', 'A', 'B', 'A', 'A'],
'Data': np.random.randint(10,50,(9,))})
In [147]: dfs = list(zip(*df.groupby((1 * (df['Case'] == 'B')).cumsum()
.rolling(window=3, min_periods=1).median())))[-1]
In [148]: dfs[0]
Out[148]:
Case Data
0 A 24
1 A 34
2 A 33
3 B 11
In [149]: dfs[1]
Out[149]:
Case Data
4 A 31
5 A 35
6 B 13
In [150]: dfs[2]
Out[150]:
Case Data
7 A 17
8 A 33
** [Code Reading] ** </ font>
Case | Data | |
---|---|---|
0 | A | 24 |
1 | A | 34 |
2 | A | 33 |
3 | B | 11 |
4 | A | 31 |
5 | A | 35 |
6 | B | 13 |
7 | A | 17 |
8 | A | 33 |
dfs = list(zip(*df.groupby((1 * (df['Case'] == 'B')).cumsum().rolling(window=3, min_periods=1).median())))[-1]
# 'Case'The column is'B'Is set to 1
1 * (df['Case'] == 'B')
>
0 0
1 0
2 0
3 1
4 0
5 0
6 1
7 0
8 0
Name: Case, dtype: int64
#Take cumulative
1 * (df['Case'] == 'B').cumsum()
>
0 0
1 0
2 0
3 1
4 1
5 1
6 2
7 2
8 2
Name: Case, dtype: int64
#Take the median cumulative value in units of 3 lines
(1 * (df['Case'] == 'B')).cumsum().rolling(window=3, min_periods=1).median()
>
0 0.0
1 0.0
2 0.0
3 0.0
4 1.0
5 1.0
6 1.0
7 2.0
8 2.0
Name: Case, dtype: float64
#Now that we have a reference value for grouping in the previous process, we will extract it. Expand and list the resulting group objects.
tmp2 = list(zip(*df.groupby((1 * (df['Case'] == 'B')).cumsum().rolling(window=3, min_periods=1).median())))
>
[(0.0, 1.0, 2.0), ( Case Data
0 A 24
1 A 34
2 A 33
3 B 11, Case Data
4 A 31
5 A 35
6 B 13, Case Data
7 A 17
8 A 33)]
#If you do not expand the obtained group object, the contents will be as follows, so unpack(*)You can see the difference with.
tmp = list(zip(df.groupby((1 * (df['Case'] == 'B')).cumsum().rolling(window=3, min_periods=1).median())))
>
[((0.0, Case Data
0 A 24
1 A 34
2 A 33
3 B 11),), ((1.0, Case Data
4 A 31
5 A 35
6 B 13),), ((2.0, Case Data
7 A 17
8 A 33),)]
#dfs is a list[(0.0,1.0,2.0),(The following data)]So[-1]To get the first element from the end.
dfs = list(zip(*df.groupby((1 * (df['Case'] == 'B')).cumsum().rolling(window=3, min_periods=1).median())))[-1]
>
( Case Data
0 A 24
1 A 34
2 A 33
3 B 11, Case Data
4 A 31
5 A 35
6 B 13, Case Data
7 A 17
8 A 33)
In [151]: df = pd.DataFrame(data={'Province': ['ON', 'QC', 'BC', 'AL', 'AL', 'MN', 'ON'],
'City': ['Toronto', 'Montreal', 'Vancouver',
'Calgary', 'Edmonton', 'Winnipeg', 'Windsor'],
'Sales': [13, 6, 16, 8, 4, 3, 1]})
In [152]: table = pd.pivot_table(df, values=['Sales'], index=['Province'],
columns=['City'], aggfunc=np.sum, margins=True)
In [153]: table.stack('City')
Out[153]:
Sales
Province City
AL All 12.0
Calgary 8.0
Edmonton 4.0
BC All 16.0
Vancouver 16.0
... ...
All Montreal 6.0
Toronto 13.0
Vancouver 16.0
Windsor 1.0
Winnipeg 3.0
[20 rows x 1 columns]
** [Code Reading] ** </ font> The given data frame has the following contents.
Province | City | Sales | |
---|---|---|---|
0 | ON | Toronto | 13 |
1 | QC | Montreal | 6 |
2 | BC | Vancouver | 16 |
3 | AL | Calgary | 8 |
4 | AL | Edmonton | 4 |
5 | MN | Winnipeg | 3 |
6 | ON | Windsor | 1 |
# pivot_Convert to tabular format with the table function.
table = pd.pivot_table(df, values=['Sales'], index=['Province'], columns=['City'], aggfunc=np.sum, margins=True)
Sales | ||||||
---|---|---|---|---|---|---|
Province | Calgary | Edmonton | Montreal | Toronto | Vancouver | Windsor |
AL | 8 | 4 | nan | nan | nan | nan |
BC | nan | nan | nan | nan | 16 | nan |
MN | nan | nan | nan | nan | nan | nan |
ON | nan | nan | nan | 13 | nan | 1 |
QC | nan | nan | 6 | nan | nan | nan |
All | 8 | 4 | 6 | 13 | 16 | 1 |
The stack function transforms the data in the column direction into the row direction.
In [154]: grades = [48, 99, 75, 80, 42, 80, 72, 68, 36, 78]
In [155]: df = pd.DataFrame({'ID': ["x%d" % r for r in range(10)],
'Gender': ['F', 'M', 'F', 'M', 'F',
'M', 'F', 'M', 'M', 'M'],
'ExamYear': ['2007', '2007', '2007', '2008', '2008',
'2008', '2008', '2009', '2009', '2009'],
'Class': ['algebra', 'stats', 'bio', 'algebra',
'algebra', 'stats', 'stats', 'algebra',
'bio', 'bio'],
'Participated': ['yes', 'yes', 'yes', 'yes', 'no',
'yes', 'yes', 'yes', 'yes', 'yes'],
'Passed': ['yes' if x > 50 else 'no' for x in grades],
'Employed': [True, True, True, False,
False, False, False, True, True, False],
'Grade': grades})
In [156]: df.groupby('ExamYear').agg({'Participated': lambda x: x.value_counts()['yes'],
'Passed': lambda x: sum(x == 'yes'),
'Employed': lambda x: sum(x),
'Grade': lambda x: sum(x) / len(x)})
Out[156]:
Participated Passed Employed Grade
ExamYear
2007 3 2 3 74.000000
2008 3 3 0 68.500000
2009 3 2 2 60.666667
** [Code Reading] ** </ font>
ID | Gender | ExamYear | Class | Participated | Passed | Employed | Grade | |
---|---|---|---|---|---|---|---|---|
0 | x0 | F | 2007 | algebra | yes | no | True | 48 |
1 | x1 | M | 2007 | stats | yes | yes | True | 99 |
2 | x2 | F | 2007 | bio | yes | yes | True | 75 |
3 | x3 | M | 2008 | algebra | yes | yes | False | 80 |
4 | x4 | F | 2008 | algebra | no | no | False | 42 |
5 | x5 | M | 2008 | stats | yes | yes | False | 80 |
6 | x6 | F | 2008 | stats | yes | yes | False | 72 |
7 | x7 | M | 2009 | algebra | yes | yes | True | 68 |
8 | x8 | M | 2009 | bio | yes | no | True | 36 |
9 | x9 | M | 2009 | bio | yes | yes | False | 78 |
select ExamYear,
count(Participated = 'yes' or null) as participated,
count(Passed = 'yes' or null) as passed,
count(Employed = 'True' or null) as employed,
sum(cast(Grade as real))/count(cast(Grade as real)) as grades
from df
group by ExamYear
;
"2007" "3" "2" "3" "74.0"
"2008" "3" "3" "0" "68.5"
"2009" "3" "2" "2" "60.6666666666667"
In [157]: df = pd.DataFrame({'value': np.random.randn(36)},
index=pd.date_range('2011-01-01', freq='M', periods=36))
In [158]: pd.pivot_table(df, index=df.index.month, columns=df.index.year,
values='value', aggfunc='sum')
Out[158]:
2011 2012 2013
1 -1.039268 -0.968914 2.565646
2 -0.370647 -1.294524 1.431256
3 -1.157892 0.413738 1.340309
4 -1.344312 0.276662 -1.170299
5 0.844885 -0.472035 -0.226169
6 1.075770 -0.013960 0.410835
7 -0.109050 -0.362543 0.813850
8 1.643563 -0.006154 0.132003
9 -1.469388 -0.923061 -0.827317
10 0.357021 0.895717 -0.076467
11 -0.674600 0.805244 -1.187678
12 -1.776904 -1.206412 1.130127
** [Code Reading] ** </ font>
# 1-There were records of indexes that I thought were only 12. Year is the same. Unique each()And aggregated.
df.index.month
>
Int64Index([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5,
6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 12],
dtype='int64')
>
df.index.year
>
Int64Index([2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
2011, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012,
2012, 2012, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,
2013, 2013, 2013],
dtype='int64')
Pivot_table became smaller if I thought it would eat memory like Excel. When I thought about it, it was natural because I didn't have the data before conversion.
df = pd.DataFrame({'value': np.random.randn(36)},
index=pd.date_range('2011-01-01', freq='M', periods=36))
df.__sizeof__()
576
df_pivot = pd.pivot_table(df, index=df.index.month, columns=df.index.year, values='value', aggfunc='sum')
df_pivot.__sizeof__()
384
In [159]: df = pd.DataFrame(data={'A': [[2, 4, 8, 16], [100, 200], [10, 20, 30]],
'B': [['a', 'b', 'c'], ['jj', 'kk'], ['ccc']]},
index=['I', 'II', 'III'])
In [160]: def SeriesFromSubList(aList):
return pd.Series(aList)
In [161]: df_orgz = pd.concat({ind: row.apply(SeriesFromSubList)
for ind, row in df.iterrows()})
In [162]: df_orgz
Out[162]:
0 1 2 3
I A 2 4 8 16.0
B a b c NaN
II A 100 200 NaN NaN
B jj kk NaN NaN
III A 10 20 30 NaN
B ccc NaN NaN NaN
** [Code Reading] ** </ font>
A | B | |
---|---|---|
I | [2, 4, 8, 16] | [a, b, c] |
II | [100, 200] | [jj, kk] |
III | [10, 20, 30] | [ccc] |
Get the list in the cell of the created dataframe, convert it to pd.Series and combine it. NaN is added to the data that is less than the maximum column.
for ind, row in df.iterrows():
print('ind:'+str(ind))
print(row)
r = row.apply(SeriesFromSubList)
print(r)
>
ind:I
A [2, 4, 8, 16]
B [a, b, c]
Name: I, dtype: object
0 1 2 3
A 2 4 8 16.0
B a b c NaN
ind:II
A [100, 200]
B [jj, kk]
Name: II, dtype: object
0 1
A 100 200
B jj kk
ind:III
A [10, 20, 30]
B [ccc]
Name: III, dtype: object
0 1 2
A 10 20.0 30.0
B ccc NaN NaN
In [163]: df = pd.DataFrame(data=np.random.randn(2000, 2) / 10000,
index=pd.date_range('2001-01-01', periods=2000),
columns=['A', 'B'])
In [164]: df
Out[164]:
A B
2001-01-01 -0.000144 -0.000141
2001-01-02 0.000161 0.000102
2001-01-03 0.000057 0.000088
2001-01-04 -0.000221 0.000097
2001-01-05 -0.000201 -0.000041
... ... ...
2006-06-19 0.000040 -0.000235
2006-06-20 -0.000123 -0.000021
2006-06-21 -0.000113 0.000114
2006-06-22 0.000136 0.000109
2006-06-23 0.000027 0.000030
[2000 rows x 2 columns]
In [165]: def gm(df, const):
v = ((((df['A'] + df['B']) + 1).cumprod()) - 1) * const
return v.iloc[-1]
In [166]: s = pd.Series({df.index[i]: gm(df.iloc[i:min(i + 51, len(df) - 1)], 5)
for i in range(len(df) - 50)})
In [167]: s
Out[167]:
2001-01-01 0.000930
2001-01-02 0.002615
2001-01-03 0.001281
2001-01-04 0.001117
2001-01-05 0.002772
...
2006-04-30 0.003296
2006-05-01 0.002629
2006-05-02 0.002081
2006-05-03 0.004247
2006-05-04 0.003928
Length: 1950, dtype: float64
** [Code Reading] ** </ font>
#In this example, the data is extracted every 50 days.
#Obtain the return value by applying this gm function to the extraction result.
#First date and time of the target period (in this example'2001-01-1')From the end date and time(2006-05-04)Save the date and time and return value until
>
#Function of gm function:(df['A']+df['B']+1)Calculate the cumulative product of and subtract 1 and multiply the result by a constant: 5.
gm(df.iloc[0:min(0 + 51, len(df) - 1)], 5)
0.000930
In [168]: rng = pd.date_range(start='2014-01-01', periods=100)
In [169]: df = pd.DataFrame({'Open': np.random.randn(len(rng)),
'Close': np.random.randn(len(rng)),
'Volume': np.random.randint(100, 2000, len(rng))},
index=rng)
In [170]: df
Out[170]:
Open Close Volume
2014-01-01 -1.611353 -0.492885 1219
2014-01-02 -3.000951 0.445794 1054
2014-01-03 -0.138359 -0.076081 1381
2014-01-04 0.301568 1.198259 1253
2014-01-05 0.276381 -0.669831 1728
... ... ... ...
2014-04-06 -0.040338 0.937843 1188
2014-04-07 0.359661 -0.285908 1864
2014-04-08 0.060978 1.714814 941
2014-04-09 1.759055 -0.455942 1065
2014-04-10 0.138185 -1.147008 1453
[100 rows x 3 columns]
In [171]: def vwap(bars):
return ((bars.Close * bars.Volume).sum() / bars.Volume.sum())
In [172]: window = 5
In [173]: s = pd.concat([(pd.Series(vwap(df.iloc[i:i + window]),
index=[df.index[i + window]]))
for i in range(len(df) - window)])
In [174]: s.round(2)
Out[174]:
2014-01-06 0.02
2014-01-07 0.11
2014-01-08 0.10
2014-01-09 0.07
2014-01-10 -0.29
...
2014-04-06 -0.63
2014-04-07 -0.02
2014-04-08 -0.03
2014-04-09 0.34
2014-04-10 0.29
Length: 95, dtype: float64
** [Code Reading] ** </ font>
[(pd.Series(vwap(df.iloc[i:i + window]),index=[df.index[i + window]])) for i in range(len(df) - window)]
#First section i=When the contents of the function are decomposed as 0, it becomes as follows.
tmp = df.iloc[0:0 + window]
(tmp.Close*tmp.Volume).sum()
tmp.Volume.sum()
(tmp.Close*tmp.Volume).sum() / tmp.Volume.sum()
0.02
# indexer_between_time(start, end)
rng = pd.date_range('1/1/2000', periods=24, freq='H')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
2000-01-01 00:00:00 1.298558
2000-01-01 01:00:00 0.333122
2000-01-01 02:00:00 -0.034170
2000-01-01 03:00:00 -2.396508
2000-01-01 04:00:00 -0.608591
2000-01-01 05:00:00 1.710535
2000-01-01 06:00:00 -0.657193
2000-01-01 07:00:00 -1.166563
2000-01-01 08:00:00 1.768663
2000-01-01 09:00:00 1.395805
2000-01-01 10:00:00 -0.111062
2000-01-01 11:00:00 0.840320
2000-01-01 12:00:00 0.129081
2000-01-01 13:00:00 -1.604446
2000-01-01 14:00:00 -1.257967
2000-01-01 15:00:00 -0.384715
2000-01-01 16:00:00 -1.185482
2000-01-01 17:00:00 2.222229
2000-01-01 18:00:00 -0.917695
2000-01-01 19:00:00 -2.158507
2000-01-01 20:00:00 -0.487902
2000-01-01 21:00:00 -1.667997
2000-01-01 22:00:00 0.202186
2000-01-01 23:00:00 -0.710070
Freq: H, dtype: float64
>
#When extracting from 10:00 to 14:00
ts.iloc[ts.index.indexer_between_time(datetime.time(10), datetime.time(14))]
2000-01-01 10:00:00 -0.111062
2000-01-01 11:00:00 0.840320
2000-01-01 12:00:00 0.129081
2000-01-01 13:00:00 -1.604446
2000-01-01 14:00:00 -1.257967
Freq: H, dtype: float64
index = pd.date_range('2013-1-1',periods=10,freq='15Min')
data = pd.DataFrame(data=[1,2,3,4,5,6,7,8,9,0], columns=['value'], index=index)
data
value
2013-01-01 00:00:00 1
2013-01-01 00:15:00 2
2013-01-01 00:30:00 3
2013-01-01 00:45:00 4
2013-01-01 01:00:00 5
2013-01-01 01:15:00 6
2013-01-01 01:30:00 7
2013-01-01 01:45:00 8
2013-01-01 02:00:00 9
2013-01-01 02:15:00 0
data.index.indexer_between_time('01:15', '02:00')
array([5, 6, 7, 8])
data.iloc[data.index.indexer_between_time('1:15', '02:00')]
013-01-01 01:15:00 6
2013-01-01 01:30:00 7
2013-01-01 01:45:00 8
2013-01-01 02:00:00 9
** [Code Reading] ** </ font>
#If the time specification is reversed, the period to be excluded is extracted.
data.iloc[data.index.indexer_between_time('02:00', '1:15')]
value
2013-01-01 00:00:00 1
2013-01-01 00:15:00 2
2013-01-01 00:30:00 3
2013-01-01 00:45:00 4
2013-01-01 01:00:00 5
2013-01-01 01:15:00 6
2013-01-01 02:00:00 9
2013-01-01 02:15:00 0
rng = pd.date_range('20130101 09:00','20130110 16:00',freq='30T')
DatetimeIndex(['2013-01-01 09:00:00', '2013-01-01 09:30:00',
'2013-01-01 10:00:00', '2013-01-01 10:30:00',
'2013-01-01 11:00:00', '2013-01-01 11:30:00',
'2013-01-01 12:00:00', '2013-01-01 12:30:00',
'2013-01-01 13:00:00', '2013-01-01 13:30:00',
...
'2013-01-10 11:30:00', '2013-01-10 12:00:00',
'2013-01-10 12:30:00', '2013-01-10 13:00:00',
'2013-01-10 13:30:00', '2013-01-10 14:00:00',
'2013-01-10 14:30:00', '2013-01-10 15:00:00',
'2013-01-10 15:30:00', '2013-01-10 16:00:00'],
dtype='datetime64[ns]', length=447, freq='30T')
#Exclude unnecessary time
rng = rng.take(rng.indexer_between_time('09:30','16:00'))
DatetimeIndex(['2013-01-01 09:30:00', '2013-01-01 10:00:00',
'2013-01-01 10:30:00', '2013-01-01 11:00:00',
'2013-01-01 11:30:00', '2013-01-01 12:00:00',
'2013-01-01 12:30:00', '2013-01-01 13:00:00',
'2013-01-01 13:30:00', '2013-01-01 14:00:00',
...
'2013-01-10 11:30:00', '2013-01-10 12:00:00',
'2013-01-10 12:30:00', '2013-01-10 13:00:00',
'2013-01-10 13:30:00', '2013-01-10 14:00:00',
'2013-01-10 14:30:00', '2013-01-10 15:00:00',
'2013-01-10 15:30:00', '2013-01-10 16:00:00'],
dtype='datetime64[ns]', length=140, freq=None)
#Only on weekdays
rng = rng[rng.weekday<5]
DatetimeIndex(['2013-01-01 09:30:00', '2013-01-01 10:00:00',
'2013-01-01 10:30:00', '2013-01-01 11:00:00',
'2013-01-01 11:30:00', '2013-01-01 12:00:00',
'2013-01-01 12:30:00', '2013-01-01 13:00:00',
'2013-01-01 13:30:00', '2013-01-01 14:00:00',
...
'2013-01-10 11:30:00', '2013-01-10 12:00:00',
'2013-01-10 12:30:00', '2013-01-10 13:00:00',
'2013-01-10 13:30:00', '2013-01-10 14:00:00',
'2013-01-10 14:30:00', '2013-01-10 15:00:00',
'2013-01-10 15:30:00', '2013-01-10 16:00:00'],
dtype='datetime64[ns]', length=112, freq=None)
The code below has changed some numbers from stackoverflow.com.
date_item = pd.to_datetime(['2019/1/4','2019/1/7','2019/1/7','2019/1/8','2019/1/9','2019/1/10','2019/1/10','2019/1/11','2019/1/15','2019/1/16','2019/1/16','2019/1/17','2019/1/18'])
DatetimeIndex(['2019-01-04', '2019-01-07', '2019-01-07', '2019-01-08',
'2019-01-09', '2019-01-10', '2019-01-10', '2019-01-11',
'2019-01-15', '2019-01-16', '2019-01-16', '2019-01-17',
'2019-01-18'],
dtype='datetime64[ns]', freq=None)
orders = pd.DataFrame({'Date':date_item,
'by':['buy','buy','sell','buy','sell','buy','buy','sell','buy','sell','buy','sell','buy'],
'cnt':[100, 200, 100, 50, 100, 100, 200, 50, 100, 50, 50, 50, 100],
'code':['1720','8086','4967','8086','8086','4967','1720','4967','1720','1720','8086','4967','8086'],
'prices':[1008, 1344, 7530, 1347, 1373, 7120, 1008, 7120, 995, 986, 1417, 6990, 1445]})
Prices
1720 | 8086 | 4967 | |
---|---|---|---|
2019-01-04 | 1008 | 1311 | 7370 |
2019-01-07 | 1033 | 1344 | 7530 |
2019-01-08 | 1025 | 1347 | 7130 |
2019-01-09 | 1034 | 1373 | 7460 |
2019-01-10 | 1008 | 1364 | 7120 |
2019-01-11 | 1000 | 1391 | 7030 |
2019-01-15 | 995 | 1400 | 6950 |
2019-01-16 | 986 | 1417 | 6920 |
2019-01-17 | 1002 | 1419 | 6990 |
2019-01-18 | 1010 | 1445 | 7050 |
Orders
Date | by | cnt | code | prices | |
---|---|---|---|---|---|
0 | 2019-01-04 | buy | 100 | 1720 | 1008 |
1 | 2019-01-07 | buy | 200 | 8086 | 1344 |
2 | 2019-01-07 | sell | 100 | 4967 | 7530 |
3 | 2019-01-08 | buy | 50 | 8086 | 1347 |
4 | 2019-01-09 | sell | 100 | 8086 | 1373 |
5 | 2019-01-10 | buy | 100 | 4967 | 7120 |
6 | 2019-01-10 | buy | 200 | 1720 | 1008 |
7 | 2019-01-11 | sell | 50 | 4967 | 7120 |
8 | 2019-01-15 | buy | 100 | 1720 | 995 |
9 | 2019-01-16 | sell | 50 | 1720 | 986 |
10 | 2019-01-16 | buy | 50 | 8086 | 1417 |
11 | 2019-01-17 | sell | 50 | 4967 | 6990 |
12 | 2019-01-18 | buy | 100 | 8086 | 1445 |
lookup(row_labels, col_labels)
prices.lookup(orders.Date, orders.code)
array([1008, 1344, 7530, 1347, 1373, 7120, 1008, 7030, 995, 986, 1417,
6990, 1445])
The code below has some numerical changes from the content posted on stackoverflow.com.
df = pd.DataFrame({'Date':['2020-01-01', '2020-01-02'],'h1':[23,45],'h2':[18,17],'h3':[11,16],'h4':[29,31],'h24':[45,55]})
Date h1 h2 h3 h4 h24
0 2020-01-01 23 18 11 29 45
1 2020-01-02 45 17 16 31 55
#Convert to the following format
Date value
0 2020-01-01 23
2 2020-01-01 18
4 2020-01-01 11
6 2020-01-01 29
8 2020-01-01 45
1 2020-01-02 45
3 2020-01-02 17
5 2020-01-02 16
7 2020-01-02 31
9 2020-01-02 55
** [Code Reading] ** </ font>
#Convert from horizontal holding to vertical holding
df = pd.melt(df, id_vars=['Date'])
>
Date variable value
0 2020-01-01 h1 23
1 2020-01-02 h1 45
2 2020-01-01 h2 18
3 2020-01-02 h2 17
4 2020-01-01 h3 11
5 2020-01-02 h3 16
6 2020-01-01 h4 29
7 2020-01-02 h4 31
8 2020-01-01 h24 45
9 2020-01-02 h24 55
#Rename column
df = df.rename(columns={'variable': 'hour'})
>
Date hour value
0 2020-01-01 h1 23
1 2020-01-02 h1 45
2 2020-01-01 h2 18
3 2020-01-02 h2 17
4 2020-01-01 h3 11
5 2020-01-02 h3 16
6 2020-01-01 h4 29
7 2020-01-02 h4 31
8 2020-01-01 h24 45
9 2020-01-02 h24 55
#of the hour column'h'Removal
df['hour'] = df['hour'].apply(lambda x: int(x.lstrip('h'))-1)
>
Date hour value
0 2020-01-01 0 23
1 2020-01-02 0 45
2 2020-01-01 1 18
3 2020-01-02 1 17
4 2020-01-01 2 11
5 2020-01-02 2 16
6 2020-01-01 3 29
7 2020-01-02 3 31
8 2020-01-01 23 45
9 2020-01-02 23 55
#Creating column data that combines day and time
combined = df.apply(lambda x: pd.to_datetime(x['Date'], dayfirst=True) + datetime.timedelta(hours=int(x['hour'])), axis=1)
>
0 2020-01-01 00:00:00
1 2020-01-02 00:00:00
2 2020-01-01 01:00:00
3 2020-01-02 01:00:00
4 2020-01-01 02:00:00
5 2020-01-02 02:00:00
6 2020-01-01 03:00:00
7 2020-01-02 03:00:00
8 2020-01-01 23:00:00
9 2020-01-02 23:00:00
dtype: datetime64[ns]
# 'Date'Replace column data
df['Date'] = combined
Date hour value
0 2020-01-01 00:00:00 0 23
1 2020-01-02 00:00:00 0 45
2 2020-01-01 01:00:00 1 18
3 2020-01-02 01:00:00 1 17
4 2020-01-01 02:00:00 2 11
5 2020-01-02 02:00:00 2 16
6 2020-01-01 03:00:00 3 29
7 2020-01-02 03:00:00 3 31
8 2020-01-01 23:00:00 23 45
9 2020-01-02 23:00:00 23 55
# 'hour'Delete column
del df['hour']
>
#Sort by date
df = df.sort_values("Date")
>
Date value
0 2020-01-01 23
2 2020-01-01 18
4 2020-01-01 11
6 2020-01-01 29
8 2020-01-01 45
1 2020-01-02 45
3 2020-01-02 17
5 2020-01-02 16
7 2020-01-02 31
9 2020-01-02 55
#A Creating time series data(Date and time+millisecond[Dummy data])->100 cases
intervals = np.random.randint(0,1000,size=100).cumsum()
df = pd.DataFrame({'time':[pd.Timestamp('20140101') + pd.offsets.Milli(i) for i in intervals ],
'value' : np.random.randn(len(intervals))})
# df
time value
0 2014-01-01 00:00:00.499 0.567731
1 2014-01-01 00:00:01.232 -0.751466
2 2014-01-01 00:00:01.238 1.250118
3 2014-01-01 00:00:01.533 0.588161
4 2014-01-01 00:00:02.210 0.996543
... ... ...
95 2014-01-01 00:00:49.216 0.753233
96 2014-01-01 00:00:49.719 0.849922
97 2014-01-01 00:00:49.732 1.171001
98 2014-01-01 00:00:50.723 -0.443835
99 2014-01-01 00:00:51.389 -0.830960
100 rows × 2 columns
#B Time series data index(Date and time+1 minute unit)->3601 cases
pd.date_range('20140101 00:00:00','20140101 01:00:00',freq='s')
DatetimeIndex(['2014-01-01 00:00:00', '2014-01-01 00:00:01',
'2014-01-01 00:00:02', '2014-01-01 00:00:03',
'2014-01-01 00:00:04', '2014-01-01 00:00:05',
'2014-01-01 00:00:06', '2014-01-01 00:00:07',
'2014-01-01 00:00:08', '2014-01-01 00:00:09',
...
'2014-01-01 00:59:51', '2014-01-01 00:59:52',
'2014-01-01 00:59:53', '2014-01-01 00:59:54',
'2014-01-01 00:59:55', '2014-01-01 00:59:56',
'2014-01-01 00:59:57', '2014-01-01 00:59:58',
'2014-01-01 00:59:59', '2014-01-01 01:00:00'],
dtype='datetime64[ns]', length=3601, freq='S')
#result
2014-01-01 00:00:00.000 NaN
2014-01-01 00:00:00.499 0.567731
2014-01-01 00:00:01.000 0.567731 <-Originally NaN
2014-01-01 00:00:01.232 -0.751466
2014-01-01 00:00:01.238 1.250118
2014-01-01 00:00:01.533 0.588161
2014-01-01 00:00:02.000 0.588161 <-Originally NaN
2014-01-01 00:00:02.210 0.996543
2014-01-01 00:00:02.652 0.322535
2014-01-01 00:00:03.000 0.322535 <-Originally NaN
** [Code Reading] ** </ font>
intervals
>
array([ 499, 1232, 1238, 1533, 2210, 2652, 3512, 3598, 3663,
3688, 3758, 3969, 4960, 5933, 6211, 7040, 7763, 8331,
9329, 10229, 10436, 10671, 10727, 11695, 12249, 13033, 13867,
13895, 14809, 15069, 16022, 16484, 16597, 17044, 17060, 17344,
18124, 18629, 19256, 20022, 20620, 21080, 21148, 21603, 22164,
22872, 23075, 23566, 24133, 24887, 24996, 25132, 25435, 26047,
26287, 27168, 27228, 28129, 29118, 29539, 30339, 30465, 31396,
31805, 32583, 33021, 33854, 34439, 34865, 35207, 35212, 35345,
35865, 36586, 37325, 37422, 38411, 38986, 39227, 39768, 40316,
41254, 42162, 42476, 43155, 43971, 44791, 44899, 45770, 46701,
47568, 47901, 48242, 48720, 48940, 49216, 49719, 49732, 50723,
51389])
>
#Create a B time series and combine it with an A time series index
rng = pd.date_range('20140101 00:00:00','20140101 01:00:00', freq='s')
new_range = pd.DatetimeIndex(np.append(rng, pd.Index(df.time)))
#A time series + B time series remains, so sort by time series
new_range = new_range.sort_values()
# 'time'To index and change to a new combined index
df.set_index('time').reindex(new_range).head()
>
2014-01-01 00:00:00.000 NaN
2014-01-01 00:00:00.499 0.567731
2014-01-01 00:00:01.000 NaN
2014-01-01 00:00:01.232 -0.751466
2014-01-01 00:00:01.238 1.250118
>
# ffill()Update the value with
df.set_index('time').reindex(new_range).ffill().head(10)
>
value
2014-01-01 00:00:00.000 NaN
2014-01-01 00:00:00.499 0.567731
2014-01-01 00:00:01.000 0.567731
2014-01-01 00:00:01.232 -0.751466
2014-01-01 00:00:01.238 1.250118
2014-01-01 00:00:01.533 0.588161
2014-01-01 00:00:02.000 0.588161
2014-01-01 00:00:02.210 0.996543
2014-01-01 00:00:02.652 0.322535
2014-01-01 00:00:03.000 0.322535
df = pd.DataFrame(np.random.randint(1000, 1200, (12,1)), columns=['value'],
index = pd.date_range('2014-01-01 00:15:00', freq='15min', periods=12))
df
value
2014-01-01 00:15:00 1122
2014-01-01 00:30:00 1132
2014-01-01 00:45:00 1091
2014-01-01 01:00:00 1188
2014-01-01 01:15:00 1156
2014-01-01 01:30:00 1089
2014-01-01 01:45:00 1148
2014-01-01 02:00:00 1040
2014-01-01 02:15:00 1010
2014-01-01 02:30:00 1130
2014-01-01 02:45:00 1178
2014-01-01 03:00:00 1186
df['normed'] = df.groupby(grouper).transform(lambda x: x / x.mean())
df
value normed
Date
2014-01-01 00:15:00 1122 0.999555
2014-01-01 00:30:00 1132 1.008463
2014-01-01 00:45:00 1091 0.971938
2014-01-01 01:00:00 1188 1.058352
2014-01-01 01:15:00 1156 1.029844
2014-01-01 01:30:00 1089 0.970156
2014-01-01 01:45:00 1148 1.022717
2014-01-01 02:00:00 1040 0.926503
2014-01-01 02:15:00 1010 0.899777
2014-01-01 02:30:00 1130 1.006682
2014-01-01 02:45:00 1178 1.049443
2014-01-01 03:00:00 1186 1.056570
** [Code Reading] ** </ font>
df.index.name = 'Date'
#Use Grouper to group by a specific cycle or interval. In this case it's one month`freq='M'`And.
grouper = pd.Grouper(level=0, freq='M')
>
#If you refer to the contents of grouper in the groups property, you can see that it is a monthly unit of the last day.
df.groupby(grouper).groups
{Timestamp('2014-01-31 00:00:00', freq='M'): 12}
>
#Easy-to-find methods are posted without using grouper. This is convenient!
df.groupby([df.index.year, df.index.month]).transform(lambda x: x/x.mean())
data = pd.concat([pd.DataFrame([['A']*36, list(pd.date_range('1/1/2011', periods=36, freq='H')), list(np.random.rand(36))], index = ['Group', 'Time', 'Value']).T,
pd.DataFrame([['B']*36, list(pd.date_range('1/1/2011', periods=36, freq='H')), list(np.random.rand(36))], index = ['Group', 'Time', 'Value']).T,
pd.DataFrame([['C']*36, list(pd.date_range('1/1/2011', periods=36, freq='H')), list(np.random.rand(36))], index = ['Group', 'Time', 'Value']).T],
axis = 0).set_index(['Group', 'Time'])
data
Value
Group Time
A 2011-01-01 00:00:00 0.306533
2011-01-01 01:00:00 0.653431
2011-01-01 02:00:00 0.0284997
2011-01-01 03:00:00 0.852406
2011-01-01 04:00:00 0.856705
... ... ...
C 2011-01-02 07:00:00 0.607041
2011-01-02 08:00:00 0.291705
2011-01-02 09:00:00 0.480091
2011-01-02 10:00:00 0.890153
2011-01-02 11:00:00 0.628454
108 rows × 1 columns
#result
Time Group
2011-01-01 A 0.497803
B 0.524934
C 0.566736
2011-01-02 A 0.442884
B 0.452267
C 0.632567
Name: Value, dtype: float64
** [Code Reading] ** </ font>
data['Value'] = data['Value'].astype(float)
#Two keys(Grouper A:By day, Grouper B:'Group')Group and take the average value.
daily_counts = data.groupby([pd.Grouper(freq='D', level='Time'), pd.Grouper(level='Group')])['Value'].mean()
dtrange = pd.date_range(datetime.datetime(2013,1,1), datetime.datetime(2013,2,20))
df = pd.DataFrame({'p1': np.random.rand(len(dtrange)) + 5,
'p2': np.random.rand(len(dtrange)) + 10},
index=dtrange)
p1 p2
2013-01-01 5.428541 10.292222
2013-01-02 5.172898 10.077244
2013-01-03 5.414736 10.587493
2013-01-04 5.235626 10.567700
2013-01-05 5.659100 10.270270
2013-01-06 5.144520 10.033026
2013-01-07 5.340707 10.561371
2013-01-08 5.256909 10.647431
2013-01-09 5.491950 10.036358
2013-01-10 5.869493 10.010417
#result
df.groupby(date).mean()
p1 p2
2013-01-01 5.401448 10.308353
2013-01-11 5.362789 10.463397
2013-01-21 5.531094 10.346834
2013-02-01 5.559112 10.551347
2013-02-11 5.551023 10.451666
** [Code Reading] ** </ font>
#This d creates an index for the beginning, middle, and end.
d = df.index.day - np.clip((df.index.day - 1) // 10, 0, 2) * 10 - 1
d
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6,
7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 0, 1, 2,
3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
dtype='int64')
#This gets the date of the index of the data.
df.index.day
Int64Index([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 1, 2, 3,
4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
dtype='int64')
#-1 to make the index in units of 10 days.
df.index.day - 1
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 0, 1, 2,
3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
dtype='int64')
#Creating indexes for early, mid, and late
(df.index.day - 1) // 10
Int64Index([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1],
dtype='int64')
# np.clip(a, a_min, a_max, out=None, **kargs)
# np.clip processes the value of the element within an arbitrary range. 0 here~Fit in the range up to 2
np.clip((df.index.day - 1) // 10, 0, 2)
Int64Index([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1],
dtype='int64')
#Early:0,Mid:10,Late:Converting for 20
np.clip((df.index.day - 1) // 10, 0, 2) * 10
Int64Index([ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10, 10, 10, 10, 10, 10, 10,
10, 10, 10, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10],
dtype='int64')
#Check the calculation of d.
tmp = pd.DataFrame({'A':df.index.day.values,
'B':np.clip((df.index.day - 1) // 10, 0, 2) * 10},
index=np.arange(51))
tmp['Diff'] = tmp['A'] - tmp['B']
tmp.head()
A B Diff
0 1 0 1
1 2 0 2
2 3 0 3
3 4 0 4
4 5 0 5
>
tmp['Diff'] = (tmp['A'] - tmp['B']) - 1
A B Diff
0 1 0 0
1 2 0 1
2 3 0 2
3 4 0 3
4 5 0 4
>
# 2020-01-01,2020-01-11,2020-01-21, 2020-02-01, 2020-02-Create 11 time indexes
#Converting d to timedelta format and subtracting it from the original time series gives the index of seasonal data.
date = df.index.values - np.array(d, dtype='timedelta64[D]')
>
array(['2013-01-01T00:00:00.000000000', '2013-01-01T00:00:00.000000000',
'2013-01-01T00:00:00.000000000', '2013-01-01T00:00:00.000000000',
'2013-01-01T00:00:00.000000000', '2013-01-01T00:00:00.000000000',
'2013-01-01T00:00:00.000000000', '2013-01-01T00:00:00.000000000',
'2013-01-01T00:00:00.000000000', '2013-01-01T00:00:00.000000000',
'2013-01-11T00:00:00.000000000', '2013-01-11T00:00:00.000000000',
'2013-01-11T00:00:00.000000000', '2013-01-11T00:00:00.000000000',
'2013-01-11T00:00:00.000000000', '2013-01-11T00:00:00.000000000',
'2013-01-11T00:00:00.000000000', '2013-01-11T00:00:00.000000000',
'2013-01-11T00:00:00.000000000', '2013-01-11T00:00:00.000000000',
'2013-01-21T00:00:00.000000000', '2013-01-21T00:00:00.000000000',
'2013-01-21T00:00:00.000000000', '2013-01-21T00:00:00.000000000',
'2013-01-21T00:00:00.000000000', '2013-01-21T00:00:00.000000000',
'2013-01-21T00:00:00.000000000', '2013-01-21T00:00:00.000000000',
'2013-01-21T00:00:00.000000000', '2013-01-21T00:00:00.000000000',
'2013-01-21T00:00:00.000000000', '2013-02-01T00:00:00.000000000',
'2013-02-01T00:00:00.000000000', '2013-02-01T00:00:00.000000000',
'2013-02-01T00:00:00.000000000', '2013-02-01T00:00:00.000000000',
'2013-02-01T00:00:00.000000000', '2013-02-01T00:00:00.000000000',
'2013-02-01T00:00:00.000000000', '2013-02-01T00:00:00.000000000',
'2013-02-01T00:00:00.000000000', '2013-02-11T00:00:00.000000000',
'2013-02-11T00:00:00.000000000', '2013-02-11T00:00:00.000000000',
'2013-02-11T00:00:00.000000000', '2013-02-11T00:00:00.000000000',
'2013-02-11T00:00:00.000000000', '2013-02-11T00:00:00.000000000',
'2013-02-11T00:00:00.000000000', '2013-02-11T00:00:00.000000000',
'2013-02-11T00:00:00.000000000'], dtype='datetime64[ns]')
>
#this'date'Group and take the average.
df.groupby(date).mean()
p1 p2
2013-01-01 5.401448 10.308353
2013-01-11 5.362789 10.463397
2013-01-21 5.531094 10.346834
2013-02-01 5.559112 10.551347
2013-02-11 5.551023 10.451666
#Create with 1-minute data
dates = pd.date_range('2014-01-01','2014-01-11', freq='T')[0:-1]
#Only on business days
dates = dates[dates.dayofweek < 5]
s = pd.DataFrame(np.random.randn(dates.size), dates)
s
0
2014-01-01 00:00:00 -2.593328
2014-01-01 00:01:00 0.173850
2014-01-01 00:02:00 0.781819
2014-01-01 00:03:00 0.734917
2014-01-01 00:04:00 -1.323457
... ...
2014-01-10 23:55:00 0.158127
2014-01-10 23:56:00 1.205610
2014-01-10 23:57:00 -0.757652
2014-01-10 23:58:00 0.350570
2014-01-10 23:59:00 0.886426
11520 rows × 1 columns
#Resample the date and time of the data frame in 30-minute increments and take the average
s.groupby(lambda d: d.date()).resample('30min').mean()
0
2014-01-01 2014-01-01 00:00:00 -0.152332
2014-01-01 00:30:00 -0.172909
2014-01-01 01:00:00 0.110629
2014-01-01 01:30:00 -0.070501
2014-01-01 02:00:00 0.058135
... ... ...
2014-01-10 2014-01-10 21:30:00 -0.158843
2014-01-10 22:00:00 0.204738
2014-01-10 22:30:00 -0.057380
2014-01-10 23:00:00 0.212008
2014-01-10 23:30:00 -0.021065
384 rows × 1 columns
** [Code Reading] ** </ font> Resampling every 30 minutes, calculating the average value and grouping by day. Initially, there were 14,400 cases, which were extracted only on business days and became 11,520 cases, which were aggregated in 30-minute units to 384 cases.
import random
df = pd.DataFrame({'string': [random.choice(('about', 'contact', 'home', 'blog')) for _ in range(120)],
'visits':np.random.randint(0,2000, (120,))},
index = pd.date_range('2001-01-01', freq='1d', periods=120))
re_df = df.groupby('string').resample('M').sum()
visits
string
about 2001-01-31 4879
2001-02-28 6713
2001-03-31 5747
2001-04-30 9286
blog 2001-01-31 9164
2001-02-28 8318
2001-03-31 4770
2001-04-30 6571
contact 2001-01-31 6883
2001-02-28 6177
2001-03-31 10228
2001-04-30 3531
home 2001-01-31 5496
2001-02-28 5261
2001-03-31 7083
2001-04-30 11383
** [Code Reading] ** </ font>
re_df['ratio'] = re_df.groupby(level=1).transform(lambda x: x / x.sum())
>
visits ratio
string
about 2001-01-31 4879 0.184657
2001-02-28 6713 0.253617
2001-03-31 5747 0.206519
2001-04-30 9286 0.301778
blog 2001-01-31 9164 0.346832
2001-02-28 8318 0.314254
2001-03-31 4770 0.171410
2001-04-30 6571 0.213545
contact 2001-01-31 6883 0.260503
2001-02-28 6177 0.233367
2001-03-31 10228 0.367543
2001-04-30 3531 0.114751
home 2001-01-31 5496 0.208008
2001-02-28 5261 0.198761
2001-03-31 7083 0.254528
2001-04-30 11383 0.369926
In [177]: rng = pd.date_range('2000-01-01', periods=6)
In [178]: df1 = pd.DataFrame(np.random.randn(6, 3), index=rng, columns=['A', 'B', 'C'])
In [179]: df2 = df1.copy()
In [180]: df = df1.append(df2, ignore_index=True)
In [181]: df
Out[181]:
A B C
0 -0.870117 -0.479265 -0.790855
1 0.144817 1.726395 -0.464535
2 -0.821906 1.597605 0.187307
3 -0.128342 -1.511638 -0.289858
4 0.399194 -1.430030 -0.639760
5 1.115116 -2.012600 1.810662
6 -0.870117 -0.479265 -0.790855
7 0.144817 1.726395 -0.464535
8 -0.821906 1.597605 0.187307
9 -0.128342 -1.511638 -0.289858
10 0.399194 -1.430030 -0.639760
11 1.115116 -2.012600 1.810662
** [Code Reading] ** </ font>
copy ()
.In [182]: df = pd.DataFrame(data={'Area': ['A'] * 5 + ['C'] * 2,
'Bins': [110] * 2 + [160] * 3 + [40] * 2,
'Test_0': [0, 1, 0, 1, 2, 0, 1],
'Data': np.random.randn(7)})
In [183]: df
Out[183]:
Area Bins Test_0 Data
0 A 110 0 0.632955
1 A 110 1 1.485463
2 A 160 0 -1.193891
3 A 160 1 -0.324484
4 A 160 2 1.293263
5 C 40 0 -0.476979
6 C 40 1 -0.467655
In [184]: df['Test_1'] = df['Test_0'] - 1
In [185]: pd.merge(df, df, left_on=['Bins', 'Area', 'Test_0'],
right_on=['Bins', 'Area', 'Test_1'],
suffixes=('_L', '_R'))
Out[185]:
Area Bins Test_0_L Data_L Test_1_L Test_0_R Data_R Test_1_R
0 A 110 0 0.632955 -1 1 1.485463 0
1 A 160 0 -1.193891 -1 1 -0.324484 0
2 A 160 1 -0.324484 0 2 1.293263 1
3 C 40 0 -0.476979 -1 1 -0.467655 0
** [Code Reading] ** </ font>
#
df['Test_1'] = df['Test_0'] - 1
>
Area Bins Test_0 Data Test_1
0 A 110 0 0.632955 -1
1 A 110 1 1.485463 0
2 A 160 0 -1.193891 -1
3 A 160 1 -0.324484 0
4 A 160 2 1.293263 1
5 C 40 0 -0.476979 -1
6 C 40 1 -0.467655 0
####Combine by value-based condition:clipboard:
The code below is stackoverflow.I have changed some code from the post on com.
```python
import operator as op
df_a = pd.DataFrame([{"a": 1, "b": 4}, {"a": 2, "b": 5}, {"a": 3, "b": 6}])
df_b = pd.DataFrame([{"c": 2, "d": 7}, {"c": 3, "d": 8}])
binOp = op.lt
matches = np.where(binOp(df_a.a[:,None],df_b.c.values))
pd.concat([df.ix[idxs].reset_index(drop=True)
for df,idxs in zip([df_a,df_b],matches)], axis=1)
df_a
a b
0 1 4
1 2 5
2 3 6
df_b
c d
0 2 7
1 3 8
# result
a b c d
0 1 4 2 7
1 1 4 3 8
2 2 5 3 8
[Code reading]
*Firstmachtes
When disassembling
# Check the contents of columns df.a and df.b
df_a.a[:,None]
array([[1],
[2],
[3]])
df_b.c.values
array([2, 3])
*Function-style standard operatoroperator.lt(a,b)
Isa < b
so,a < b
Is equivalent to
op.lt(df_a.a[:,None],df_b.c.values)
array([[ True, True],
[False, True],
[False, False]])
# I could write it like this if it was written differently
df_a.a[:,None] < df_b.c.values
array([[ True, True],
[False, True],
[False, False]])
>
# To check this content
df_a.a[:,None][0] < df_b.c.values[0]
array([ True])
df_a.a[:,None][0] < df_b.c.values[1]
array([ True])
df_a.a[:,None][1] < df_b.c.values[0]
array([False])
df_a.a[:,None][1] < df_b.c.values[1]
array([ True])
df_a.a[:,None][2] < df_b.c.values[0]
array([False])
df_a.a[:,None][2] < df_b.c.values[1]
array([False])
>
# In the case of np.where (condition), the index of the element that satisfies the condition is returned in tuple format.
np.where([[True, True], [False, True], [False, False]])
(array([0, 0, 1]), array([0, 1, 1]))
>
# Extract with the conditions obtained in'matches' earlier.
[df.loc[idxs].reset_index(drop=True) for df,idxs in zip([df_a,df_b],matches)]
[ a b
0 1 4
1 1 4
2 2 5, c d
0 2 7
1 3 8
2 3 8]
>
# This is connected in the horizontal direction (axis = 1).
a b c d
0 1 4 2 7
1 1 4 3 8
2 2 5 3 8
n [186]: df = pd.DataFrame(
.....: {'stratifying_var': np.random.uniform(0, 100, 20),
.....: 'price': np.random.normal(100, 5, 20)})
.....:
In [187]: df['quartiles'] = pd.qcut(
.....: df['stratifying_var'],
.....: 4,
.....: labels=['0-25%', '25-50%', '50-75%', '75-100%'])
.....:
In [188]: df.boxplot(column='price', by='quartiles')
Out[188]: <matplotlib.axes._subplots.AxesSubplot at 0x7fc66b7f1710>
[Code reading]
pandas.qcut(x, q, labels=None...)
# Divide x by the number of q: bin divisions and add labels.
df['quartiles'] = pd.qcut(df['stratifying_var'], 4, labels=['0-25%', '25-50%', '50-75%', '75-100%'])
>
stratifying_var price quartiles
0 95.463259 110.468740 75-100%
1 61.567537 105.038334 50-75%
2 93.169189 99.502664 75-100%
3 32.881181 100.462400 25-50%
4 22.735506 95.821731 0-25%
5 85.662861 101.262124 75-100%
6 24.251856 102.351950 0-25%
7 26.323525 91.812003 25-50%
8 6.192982 111.425087 0-25%
9 25.520758 104.666583 25-50%
10 75.505473 104.450480 75-100%
11 30.620504 100.044772 25-50%
12 40.438555 102.697402 50-75%
13 18.171318 102.612876 0-25%
14 57.080747 101.067847 50-75%
15 44.066472 93.410125 50-75%
16 64.131972 103.707151 50-75%
17 33.548572 103.359709 25-50%
18 88.433754 99.416668 75-100%
19 21.660715 91.229785 0-25%
###Read multiple files to create a single DataFrame:thumbsup:
# Example A
n [189]: for i in range(3):
data = pd.DataFrame(np.random.randn(10, 4))
data.to_csv('file_{}.csv'.format(i))
In [190]: files = ['file_0.csv', 'file_1.csv', 'file_2.csv']
In [191]: result = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)
# Example B
In [192]: import glob
In [193]: import os
In [194]: files = glob.glob('file_*.csv')
In [195]: result = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)
[Code reading]
*In example A, a sample file is created in a loop, and the sample file is read and combined. *In example B, the created file name is read by the glob function, and the file is read and combined.
###Parsing multi-column date components
In [196]: i = pd.date_range('20000101', periods=10000)
In [197]: df = pd.DataFrame({'year': i.year, 'month': i.month, 'day': i.day})
In [198]: df.head()
Out[198]:
year month day
0 2000 1 1
1 2000 1 2
2 2000 1 3
3 2000 1 4
4 2000 1 5
In [199]: %timeit pd.to_datetime(df.year * 10000 + df.month * 100 + df.day, format='%Y%m%d')
ds = df.apply(lambda x: "%04d%02d%02d" % (x['year'],
x['month'], x['day']), axis=1)
ds.head()
%timeit pd.to_datetime(ds)
9.98 ms +- 235 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
2.91 ms +- 57.5 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
[Code reading]
# When calculating and creating time series information using the values of the created data frame
df.year * 10000 + df.month * 100 + df.day
>
# When creating time series information with a function from the elements of the created data frame
ds = df.apply(lambda x: "%04d%02d%02d" % (x['year'], x['month'], x['day']), axis=1)
Difference in calculation cost?
###Skip the line between the header and the data
In [200]: data = """;;;;
;;;;
;;;;
;;;;
;;;;
;;;;
;;;;
;;;;
;;;;
;;;;
date;Param1;Param2;Param4;Param5
;m²;°C;m²;m
;;;;
01.01.1990 00:00;1;1;2;3
01.01.1990 01:00;5;3;4;5
01.01.1990 02:00;9;5;6;7
01.01.1990 03:00;13;7;8;9
01.01.1990 04:00;17;9;10;11
01.01.1990 05:00;21;11;12;13
"""
In [201]: from io import StringIO
In [202]: pd.read_csv(StringIO(data), sep=';', skiprows=[11, 12],
Out[202]:
Param1 Param2 Param4 Param5
date
1990-01-01 00:00:00 1 1 2 3
1990-01-01 01:00:00 5 3 4 5
1990-01-01 02:00:00 9 5 6 7
1990-01-01 03:00:00 13 7 8 9
1990-01-01 04:00:00 17 9 10 11
1990-01-01 05:00:00 21 11 12 13
In [203]: pd.read_csv(StringIO(data), sep=';', header=10, nrows=10).columns
Out[203]: Index(['date', 'Param1', 'Param2', 'Param4', 'Param5'], dtype='object')
In [204]: columns = pd.read_csv(StringIO(data), sep=';', header=10, nrows=10).columns
In [205]: pd.read_csv(StringIO(data), sep=';', index_col=0,
header=12, parse_dates=True, names=columns)
Out[205]:
Param1 Param2 Param4 Param5
date
1990-01-01 00:00:00 1 1 2 3
1990-01-01 01:00:00 5 3 4 5
1990-01-01 02:00:00 9 5 6 7
1990-01-01 03:00:00 13 7 8 9
1990-01-01 04:00:00 17 9 10 11
1990-01-01 05:00:00 21 11 12 13
[Code reading]
I think that either one is not advantageous regardless of the size of the data. In any case, it is necessary to find the number of lines to skip and the line corresponding to the header line, so you need to skip the appropriate number of lines and open it, or open it separately with an editor first.
#Computation) ###correlation
In [211]: df = pd.DataFrame(np.random.random(size=(100, 5)))
In [212]: corr_mat = df.corr()
In [213]: mask = np.tril(np.ones_like(corr_mat, dtype=np.bool), k=-1)
In [214]: corr_mat.where(mask)
Out[214]:
0 1 2 3 4
0 NaN NaN NaN NaN NaN
1 -0.018923 NaN NaN NaN NaN
2 -0.076296 -0.012464 NaN NaN NaN
3 -0.169941 -0.289416 0.076462 NaN NaN
4 0.064326 0.018759 -0.084140 -0.079859 NaN
[Code reading]
mask
array([[False, False, False, False, False],
[ True, False, False, False, False],
[ True, True, False, False, False],
[ True, True, True, False, False],
[ True, True, True, True, False]])
>
mask2 = np.triu(np.ones_like(corr_mat, dtype=np.bool), k=-1)
mask2
array([[ True, True, True, True, True],
[ True, True, True, True, True],
[False, True, True, True, True],
[False, False, True, True, True],
[False, False, False, True, True]])
>
corr_mat.where(mask2)
>
0 1 2 3 4
0 1.000000 -0.022615 0.047486 0.092043 -0.112379
1 -0.022615 1.000000 -0.336864 -0.025473 0.004696
2 NaN -0.336864 1.000000 0.047746 -0.008458
3 NaN NaN 0.047746 1.000000 0.133289
4 NaN NaN NaN 0.133289 1.000000
numpy.tril() ->Extract the lower triangular matrix from the Numpy array(Lower triangle of an array.) numpy.triu() ->Extract the upper triangular matrix from the Numpy array(Upper triangle of an array.)
In [218]: import datetime
In [219]: s = pd.Series(pd.date_range('2012-1-1', periods=3, freq='D'))
In [220]: s - s.max()
Out[220]:
0 -2 days
1 -1 days
2 0 days
dtype: timedelta64[ns]
In [221]: s.max() - s
Out[221]:
0 2 days
1 1 days
2 0 days
dtype: timedelta64[ns]
In [222]: s - datetime.datetime(2011, 1, 1, 3, 5)
Out[222]:
0 364 days 20:55:00
1 365 days 20:55:00
2 366 days 20:55:00
dtype: timedelta64[ns]
In [223]: s + datetime.timedelta(minutes=5)
Out[223]:
0 2012-01-01 00:05:00
1 2012-01-02 00:05:00
2 2012-01-03 00:05:00
dtype: datetime64[ns]
In [224]: datetime.datetime(2011, 1, 1, 3, 5) - s
Out[224]:
0 -365 days +03:05:00
1 -366 days +03:05:00
2 -367 days +03:05:00
dtype: timedelta64[ns]
In [225]: datetime.timedelta(minutes=5) + s
Out[225]:
0 2012-01-01 00:05:00
1 2012-01-02 00:05:00
2 2012-01-03 00:05:00
dtype: datetime64[ns]
[Code reading]
# 2 weeks later
s + datetime.timedelta(weeks=2)
>
0 2012-01-15
1 2012-01-16
2 2012-01-17
dtype: datetime64[ns]
# 2 weeks ago
s + datetime.timedelta(weeks=-2)
>
0 2011-12-18
1 2011-12-19
2 2011-12-20
dtype: datetime64[ns]
###Addition and subtraction of elapsed time and date
In [226]: deltas = pd.Series([datetime.timedelta(days=i) for i in range(3)])
In [227]: df = pd.DataFrame({'A': s, 'B': deltas})
In [228]: df
Out[228]:
A B
0 2012-01-01 0 days
1 2012-01-02 1 days
2 2012-01-03 2 days
In [229]: df['New Dates'] = df['A'] + df['B']
In [230]: df['Delta'] = df['A'] - df['New Dates']
In [231]: df
Out[231]:
A B New Dates Delta
0 2012-01-01 0 days 2012-01-01 0 days
1 2012-01-02 1 days 2012-01-03 -1 days
2 2012-01-03 2 days 2012-01-05 -2 days
In [232]: df.dtypes
Out[232]:
A datetime64[ns]
B timedelta64[ns]
New Dates datetime64[ns]
Delta timedelta64[ns]
dtype: object
[Code reading]
# Perform weekly addition
w_deltas = pd.Series([datetime.timedelta(weeks=i) for i in range(3)])
w_deltas
0 0 days
1 7 days
2 14 days
dtype: timedelta64[ns]
>
df['W_Delta'] = df['A'] + w_deltas
>
A B New Dates Delta W_Delta
0 2012-01-01 0 days 2012-01-01 0 days 2012-01-01
1 2012-01-02 1 days 2012-01-03 -1 days 2012-01-09
2 2012-01-03 2 days 2012-01-05 -2 days 2012-01-17
#Axis name alias ###To provide axis name aliases globally, you can define two functions:
def set_axis_alias(cls, axis, alias):
if axis not in cls._AXIS_NUMBERS:
raise Exception("invalid axis [%s] for alias [%s]" % (axis, alias))
cls._AXIS_ALIASES[alias] = axis
def clear_axis_alias(cls, axis, alias):
if axis not in cls._AXIS_NUMBERS:
raise Exception("invalid axis [%s] for alias [%s]" % (axis, alias))
cls._AXIS_ALIASES.pop(alias, None)
set_axis_alias(pd.DataFrame, 'columns', 'myaxis2')
df2 = pd.DataFrame(np.random.randn(3, 2), columns=['c1', 'c2'],
index=['i1', 'i2', 'i3'])
In [241]: df2.sum(axis='myaxis2')
Out[241]:
i1 -0.461013
i2 2.040016
i3 0.904681
dtype: float64
In [242]: clear_axis_alias(pd.DataFrame, 'columns', 'myaxis2')
[Code reading]
# An error occurred in df2.sum (axis ='myaxis2').
df2.sum(axis=1)
I think this function can define its own axis name, but in reality an error occurs and no result is obtained.axis=1
Was obtained by specifying.
In [243]: def expand_grid(data_dict):
rows = itertools.product(*data_dict.values())
return pd.DataFrame.from_records(rows, columns=data_dict.keys())
In [244]: df = expand_grid({'height': [60, 70],
'weight': [100, 140, 180],
'sex': ['Male', 'Female']})
In [245]: df
Out[245]:
height weight sex
0 60 100 Male
1 60 100 Female
2 60 140 Male
3 60 140 Female
4 60 180 Male
5 60 180 Female
6 70 100 Male
7 70 100 Female
8 70 140 Male
9 70 140 Female
10 70 180 Male
11 70 180 Female
#Creating sample data *To create a data frame from all combinations of a particular value, you can create a dictionary where the key is the column name and the value is a list of data values.
In [243]: def expand_grid(data_dict):
rows = itertools.product(*data_dict.values())
return pd.DataFrame.from_records(rows, columns=data_dict.keys())
In [244]: df = expand_grid({'height': [60, 70],
'weight': [100, 140, 180],
'sex': ['Male', 'Female']})
In [245]: df
Out[245]:
height weight sex
0 60 100 Male
1 60 100 Female
2 60 140 Male
3 60 140 Female
4 60 180 Male
5 60 180 Female
6 70 100 Male
7 70 100 Female
8 70 140 Male
9 70 140 Female
10 70 180 Male
11 70 180 Female
[Code reading]
# * dict.values () expands dictionary-style values
d_dict = {'height': [60, 70], 'weight': [100, 140, 180], 'sex': ['Male', 'Female']}
print(*d_dict.values())
[60, 70] [100, 140, 180] ['Male', 'Female']
>
# All combinations Cartesian product (Descartes) itertools.product is an iterator
rows = itertools.product(*d_dict.values())
>
for r in rows:
print(r)
>
(60, 100, 'Male')
(60, 100, 'Female')
(60, 140, 'Male')
(60, 140, 'Female')
(60, 180, 'Male')
(60, 180, 'Female')
(70, 100, 'Male')
(70, 100, 'Female')
(70, 140, 'Male')
(70, 140, 'Female')
(70, 180, 'Male')
(70, 180, 'Female')
#Afterword *Thank you for watching until the end. It's been about two and a half years since I started learning the Python language, and the self-study of the official cookbook was a struggle. *The first time, I started by preparing the same data as the official one, but I gave up because I could not think of sample data on the way. *The second time, I tried to create the code reading part with the folding function by quoting the formula, but I decided that it did not convey the feeling of struggle. Please forgive the part that is not fully explained while the accurate understanding and description of the language are not left. The real feeling after finishing is"The one who controls the index controls the Pandas."**I thought. *For the explanation in code reading, I used mathjax for the first time, but it is unknown whether it worked.
Recommended Posts