Introduction

When processing time series data, it is convenient to set index to datetime in pandas, but when extracting a week from index, the last week such as 2019/12/30 and 12/31 will be the first week of 2019 This is a memorandum that I investigated how to fix this.

problem

The last week of December and the first week of January of the following year will be the same week. There seem to be various ways to number the week, but pandas seems to be ISO compliant European style, and it is recognized as the first week from the weekday of the week.

Reference: List of Week Calendar and Week Numbers for 2019

However, since 2019/12/30, 31 will be df.index.yaer-> 2019, df.index.week-> 1, it will be recognized as the first week of 2019, so data aggregation on a weekly basis It is inconvenient for processing such as.

The following datetime format index is shown as an example.

df.index

DatetimeIndex(['2019-12-29 22:00:00+00:00', '2019-12-29 22:00:00+00:00',
               '2019-12-29 22:00:00+00:00', '2019-12-29 22:00:00+00:00',
               '2019-12-29 22:00:00+00:00', '2019-12-29 22:08:00+00:00',
               '2019-12-29 23:00:00+00:00', '2019-12-30 01:47:00+00:00',
               '2019-12-30 02:48:00+00:00', '2019-12-30 02:48:00+00:00',
               '2019-12-30 12:34:00+00:00', '2019-12-30 14:51:00+00:00',
               '2019-12-30 14:53:00+00:00', '2019-12-30 14:56:00+00:00',
               '2019-12-31 04:50:00+00:00', '2019-12-31 13:41:00+00:00',
               '2019-12-31 14:42:00+00:00', '2019-12-31 14:45:00+00:00',
               '2019-12-31 15:56:00+00:00', '2019-12-31 15:56:00+00:00',
               '2019-12-31 15:58:00+00:00', '2019-12-31 15:58:00+00:00'],
              dtype='datetime64[ns, UTC]', name='date', freq=None)

If you set MultiIndex for this data frame as shown below, you can see that 12/29 is 52 weeks, while 12/30 and 31 are the first week of 2019.

df_w = df.set_index([df.index.year, df.index.month,
                     df.index.week, df.index])
df_w.index.names = ['year', 'month', 'week', 'date']
df_w.sort_index(inplace=True)
df_w.index

MultiIndex([(2019, 12,  1, '2019-12-30 01:47:00+00:00'),
            (2019, 12,  1, '2019-12-30 02:48:00+00:00'),
            (2019, 12,  1, '2019-12-30 02:48:00+00:00'),
            (2019, 12,  1, '2019-12-30 12:34:00+00:00'),
            (2019, 12,  1, '2019-12-30 14:51:00+00:00'),
            (2019, 12,  1, '2019-12-30 14:53:00+00:00'),
            (2019, 12,  1, '2019-12-30 14:56:00+00:00'),
            (2019, 12,  1, '2019-12-31 04:50:00+00:00'),
            (2019, 12,  1, '2019-12-31 13:41:00+00:00'),
            (2019, 12,  1, '2019-12-31 14:42:00+00:00'),
            (2019, 12,  1, '2019-12-31 14:45:00+00:00'),
            (2019, 12,  1, '2019-12-31 15:56:00+00:00'),
            (2019, 12,  1, '2019-12-31 15:56:00+00:00'),
            (2019, 12,  1, '2019-12-31 15:58:00+00:00'),
            (2019, 12,  1, '2019-12-31 15:58:00+00:00'),
            (2019, 12, 52, '2019-12-29 22:00:00+00:00'),
            (2019, 12, 52, '2019-12-29 22:00:00+00:00'),
            (2019, 12, 52, '2019-12-29 22:00:00+00:00'),
            (2019, 12, 52, '2019-12-29 22:00:00+00:00'),
            (2019, 12, 52, '2019-12-29 22:00:00+00:00'),
            (2019, 12, 52, '2019-12-29 22:08:00+00:00'),
            (2019, 12, 52, '2019-12-29 23:00:00+00:00')],
           names=['year', 'month', 'week', 'date'])

Remedy

The reference site has the solution I'm looking for and I applied it to my case.

Release index once
Extract the week with dt.week
Forced to change to 52 weeks
Reset index

df.reset_index(inplace=True)
df["year"] = df["date"].dt.year
df["month"] = df["date"].dt.month
df["week"] = df["date"].dt.week
df["week"] = df["date"].apply(
    lambda x: 52 if x.year == 2019 and x.month==12 and x.day in [30, 31] else x.week)
df_w = df.set_index([df["year"], df["month"],
                     df["week"], df["date"]])
df_w.index.names = ['year', 'month', 'week', 'date']
df_w.sort_index(inplace=True)
df_w.index

MultiIndex([(2019, 12, 52, '2019-12-29 22:00:00+00:00'),
            (2019, 12, 52, '2019-12-29 22:00:00+00:00'),
            (2019, 12, 52, '2019-12-29 22:00:00+00:00'),
            (2019, 12, 52, '2019-12-29 22:00:00+00:00'),
            (2019, 12, 52, '2019-12-29 22:00:00+00:00'),
            (2019, 12, 52, '2019-12-29 22:08:00+00:00'),
            (2019, 12, 52, '2019-12-29 23:00:00+00:00'),
            (2019, 12, 52, '2019-12-30 01:47:00+00:00'),
            (2019, 12, 52, '2019-12-30 02:48:00+00:00'),
            (2019, 12, 52, '2019-12-30 02:48:00+00:00'),
            (2019, 12, 52, '2019-12-30 12:34:00+00:00'),
            (2019, 12, 52, '2019-12-30 14:51:00+00:00'),
            (2019, 12, 52, '2019-12-30 14:53:00+00:00'),
            (2019, 12, 52, '2019-12-30 14:56:00+00:00'),
            (2019, 12, 52, '2019-12-31 04:50:00+00:00'),
            (2019, 12, 52, '2019-12-31 13:41:00+00:00'),
            (2019, 12, 52, '2019-12-31 14:42:00+00:00'),
            (2019, 12, 52, '2019-12-31 14:45:00+00:00'),
            (2019, 12, 52, '2019-12-31 15:56:00+00:00'),
            (2019, 12, 52, '2019-12-31 15:56:00+00:00'),
            (2019, 12, 52, '2019-12-31 15:58:00+00:00'),
            (2019, 12, 52, '2019-12-31 15:58:00+00:00')],
           names=['year', 'month', 'week', 'date'])

This makes it possible to aggregate weekly for each year.

Reference site

The solution to the same problem was very helpful. Thanks! Pandas - wrong week extracted week from date

[PYTHON] Correct the week set by index in datetime format

Introduction

problem

Remedy

Reference site