When processing time series data, it is convenient to set index to datetime in pandas, but when extracting a week from index, the last week such as 2019/12/30 and 12/31 will be the first week of 2019 This is a memorandum that I investigated how to fix this.
The last week of December and the first week of January of the following year will be the same week. There seem to be various ways to number the week, but pandas seems to be ISO compliant European style, and it is recognized as the first week from the weekday of the week.
However, since 2019/12/30, 31 will be df.index.yaer-> 2019, df.index.week-> 1, it will be recognized as the first week of 2019, so data aggregation on a weekly basis It is inconvenient for processing such as.
The following datetime format index is shown as an example.
df.index
DatetimeIndex(['2019-12-29 22:00:00+00:00', '2019-12-29 22:00:00+00:00',
'2019-12-29 22:00:00+00:00', '2019-12-29 22:00:00+00:00',
'2019-12-29 22:00:00+00:00', '2019-12-29 22:08:00+00:00',
'2019-12-29 23:00:00+00:00', '2019-12-30 01:47:00+00:00',
'2019-12-30 02:48:00+00:00', '2019-12-30 02:48:00+00:00',
'2019-12-30 12:34:00+00:00', '2019-12-30 14:51:00+00:00',
'2019-12-30 14:53:00+00:00', '2019-12-30 14:56:00+00:00',
'2019-12-31 04:50:00+00:00', '2019-12-31 13:41:00+00:00',
'2019-12-31 14:42:00+00:00', '2019-12-31 14:45:00+00:00',
'2019-12-31 15:56:00+00:00', '2019-12-31 15:56:00+00:00',
'2019-12-31 15:58:00+00:00', '2019-12-31 15:58:00+00:00'],
dtype='datetime64[ns, UTC]', name='date', freq=None)
If you set MultiIndex for this data frame as shown below, you can see that 12/29 is 52 weeks, while 12/30 and 31 are the first week of 2019.
df_w = df.set_index([df.index.year, df.index.month,
df.index.week, df.index])
df_w.index.names = ['year', 'month', 'week', 'date']
df_w.sort_index(inplace=True)
df_w.index
MultiIndex([(2019, 12, 1, '2019-12-30 01:47:00+00:00'),
(2019, 12, 1, '2019-12-30 02:48:00+00:00'),
(2019, 12, 1, '2019-12-30 02:48:00+00:00'),
(2019, 12, 1, '2019-12-30 12:34:00+00:00'),
(2019, 12, 1, '2019-12-30 14:51:00+00:00'),
(2019, 12, 1, '2019-12-30 14:53:00+00:00'),
(2019, 12, 1, '2019-12-30 14:56:00+00:00'),
(2019, 12, 1, '2019-12-31 04:50:00+00:00'),
(2019, 12, 1, '2019-12-31 13:41:00+00:00'),
(2019, 12, 1, '2019-12-31 14:42:00+00:00'),
(2019, 12, 1, '2019-12-31 14:45:00+00:00'),
(2019, 12, 1, '2019-12-31 15:56:00+00:00'),
(2019, 12, 1, '2019-12-31 15:56:00+00:00'),
(2019, 12, 1, '2019-12-31 15:58:00+00:00'),
(2019, 12, 1, '2019-12-31 15:58:00+00:00'),
(2019, 12, 52, '2019-12-29 22:00:00+00:00'),
(2019, 12, 52, '2019-12-29 22:00:00+00:00'),
(2019, 12, 52, '2019-12-29 22:00:00+00:00'),
(2019, 12, 52, '2019-12-29 22:00:00+00:00'),
(2019, 12, 52, '2019-12-29 22:00:00+00:00'),
(2019, 12, 52, '2019-12-29 22:08:00+00:00'),
(2019, 12, 52, '2019-12-29 23:00:00+00:00')],
names=['year', 'month', 'week', 'date'])
The reference site has the solution I'm looking for and I applied it to my case.
df.reset_index(inplace=True)
df["year"] = df["date"].dt.year
df["month"] = df["date"].dt.month
df["week"] = df["date"].dt.week
df["week"] = df["date"].apply(
lambda x: 52 if x.year == 2019 and x.month==12 and x.day in [30, 31] else x.week)
df_w = df.set_index([df["year"], df["month"],
df["week"], df["date"]])
df_w.index.names = ['year', 'month', 'week', 'date']
df_w.sort_index(inplace=True)
df_w.index
MultiIndex([(2019, 12, 52, '2019-12-29 22:00:00+00:00'),
(2019, 12, 52, '2019-12-29 22:00:00+00:00'),
(2019, 12, 52, '2019-12-29 22:00:00+00:00'),
(2019, 12, 52, '2019-12-29 22:00:00+00:00'),
(2019, 12, 52, '2019-12-29 22:00:00+00:00'),
(2019, 12, 52, '2019-12-29 22:08:00+00:00'),
(2019, 12, 52, '2019-12-29 23:00:00+00:00'),
(2019, 12, 52, '2019-12-30 01:47:00+00:00'),
(2019, 12, 52, '2019-12-30 02:48:00+00:00'),
(2019, 12, 52, '2019-12-30 02:48:00+00:00'),
(2019, 12, 52, '2019-12-30 12:34:00+00:00'),
(2019, 12, 52, '2019-12-30 14:51:00+00:00'),
(2019, 12, 52, '2019-12-30 14:53:00+00:00'),
(2019, 12, 52, '2019-12-30 14:56:00+00:00'),
(2019, 12, 52, '2019-12-31 04:50:00+00:00'),
(2019, 12, 52, '2019-12-31 13:41:00+00:00'),
(2019, 12, 52, '2019-12-31 14:42:00+00:00'),
(2019, 12, 52, '2019-12-31 14:45:00+00:00'),
(2019, 12, 52, '2019-12-31 15:56:00+00:00'),
(2019, 12, 52, '2019-12-31 15:56:00+00:00'),
(2019, 12, 52, '2019-12-31 15:58:00+00:00'),
(2019, 12, 52, '2019-12-31 15:58:00+00:00')],
names=['year', 'month', 'week', 'date'])
This makes it possible to aggregate weekly for each year.
The solution to the same problem was very helpful. Thanks! Pandas - wrong week extracted week from date
Recommended Posts