[PYTHON] Extract periods that match a particular pattern from pandas time series qualitative data

memo.

Extract periods that match a particular pattern from pandas time series qualitative data

import numpy as np
import pandas as pd

np.random.seed(99)
df = pd.DataFrame(index=pd.date_range('2020-01-01', periods=60, freq='D'),
                  data=np.array(list('Sunny cloudy rain snow'))[np.random.randint(0, 4, 60)],
                  columns=['weather'], dtype='string')
df
weather
2020-01-01 Cloudy
2020-01-02 Snow
2020-01-03 Cloudy
2020-01-04 fine
2020-01-05 Cloudy
2020-01-06 fine
2020-01-07 Rain
2020-01-08 fine
2020-01-09 Cloudy
2020-01-10 fine
2020-01-11 Cloudy
2020-01-12 snow
2020-01-13 rain
2020-01-14 snow
2020-01-15 snow
2020-01-16 cloudy
2020-01-17 Cloudy
2020-01-18 fine
2020-01-19 cloudy
2020-01-20 snow
2020-01-21 rain
2020-01-22 snow
2020-01-23 fine
2020-01-24 snow
2020-01-25 rain
2020-01-26 fine
2020-01-27 fine
2020-01-28 fine
2020-01-29 fine
2020-01-30 fine
2020-01-31 cloudy
2020-02-01 Snow
2020-02-02 Snow
2020-02-03 Snow
2020-02-04 Rain
2020-02-05 Snow
2020-02-06 Snow
2020-02-07 Rain
2020-02-08 Cloudy
2020-02-09 rain
2020-02-10 fine
2020-02-11 fine
2020-02-12 cloudy
2020-02-13 rain
2020-02-14 cloudy
2020-02-15 cloudy
2020-02-16 fine
2020-02-17 Cloudy
2020-02-18 snow
2020-02-19 cloudy
2020-02-20 snow
2020-02-21 snow
2020-02-22 rain
2020-02-23 rain
2020-02-24 rain
2020-02-25 snow
2020-02-26 cloudy
2020-02-27 fine
2020-02-28 fine
2020-02-29 fine

It is a record when you want to extract the period showing a specific weather pattern when you have the above data.

Extract sunny days

Judgment on a daily basis is easy.

df[df['weather'] == 'Fine']
weather
2020-01-04 fine
2020-01-06 fine
2020-01-08 fine
2020-01-10 fine
2020-01-18 fine
2020-01-23 fine
2020-01-26 fine
2020-01-27 fine
2020-01-28 fine
2020-01-29 fine
2020-01-30 fine
2020-02-10 fine
2020-02-11 fine
2020-02-16 fine
2020-02-27 fine
2020-02-28 fine
2020-02-29 fine

Extract the period of "sunny for 2 days or more"

I just want to finish it.

df[(df['weather'] == 'Fine') & (df['weather'].shift(-1, fill_value='') == 'Fine')]
weather
2020-01-26 fine
2020-01-27 fine
2020-01-28 fine
2020-01-29 fine
2020-02-10 fine
2020-02-27 fine
2020-02-28 fine

This has extracted the day of "the day offineand the next day of fine", and failed to extract the "period offine". The last day of the period, specifically January 30, February 11, and February 29, is missed in this example. The day after the day when the result of the conditional expression is True must also be True.

So the following is the correct answer.

cond = (df['weather'] == 'Fine') & (df['weather'].shift(-1, fill_value='') == 'Fine')
cond |= cond.shift(fill_value=False)
out_df = df[cond]

out_df
weather
2020-01-26 fine
2020-01-27 fine
2020-01-28 fine
2020-01-29 fine
2020-01-30 fine
2020-02-10 fine
2020-02-11 fine
2020-02-27 fine
2020-02-28 fine
2020-02-29 fine

cond |= cond.shift(fill_value=False)Iscond = cond | cond.shift(fill_value=False)is what it means.

By the way, in order to assign a group number for each period, you can see if it will be fine the next day, so use diff () to

ngroup = cond.diff()[cond].cumsum()

out_df.assign(ngroup=ngroup)
Weather ngroup
2020-01-26 fine 1
2020-01-27 fine 1
2020-01-28 fine 1
2020-01-29 fine 1
2020-01-30 fine 1
2020-02-10 fine 2
2020-02-11 fine 2
2020-02-27 fine 3
2020-02-28 fine 3
2020-02-29 fine 3

Extract the period of "sunny for 3 days or more"

It is an application of the previous section.

cond = ((df['weather'] == 'Fine') & (df['weather'].shift(-1, fill_value='') == 'Fine')
        & (df['weather'].shift(-2, fill_value='') == 'Fine'))
cond |= cond.shift(1, fill_value=False) | cond.shift(2, fill_value=False)
out_df = df[cond]

out_df
weather
2020-01-26 fine
2020-01-27 fine
2020-01-28 fine
2020-01-29 fine
2020-01-30 fine
2020-02-27 fine
2020-02-28 fine
2020-02-29 fine

Extract the period of "sunny for 5 days or more"

How to write as follows

cond = ((df['weather'] == 'Fine')
        & (df['weather'].shift(-1, fill_value='') == 'Fine')
        & (df['weather'].shift(-2, fill_value='') == 'Fine')
        & (df['weather'].shift(-3, fill_value='') == 'Fine')
        & (df['weather'].shift(-4, fill_value='') == 'Fine'))
cond |= (cond.shift(1, fill_value=False) | cond.shift(2, fill_value=False)
         | cond.shift(3, fill_value=False) | cond.shift(4, fill_value=False))
out_df = df[cond]

out_df

Is painful to go this far, so use scipy.ndimage.binary_closing () to

from scipy.ndimage import binary_closing

days = 5
cond = binary_closing((df['weather'] != 'Fine').to_numpy(dtype=bool),
                      np.ones(days, dtype=bool), border_value=True)
df[~cond]
weather
2020-01-26 fine
2020-01-27 fine
2020-01-28 fine
2020-01-29 fine
2020-01-30 fine

Extract the first two days of the "sunny for more than two days" period

If you want to get rain [rainy] sunny sunny sunny rain when there is rain fine sunny sunny sunny rain.

Applying the diff () used when assigning the group number,

cond = (df['weather'] == 'Fine') & (df['weather'].shift(-1, fill_value='') == 'Fine')
cond &= cond.diff()
cond |= cond.shift(fill_value=False)
out_df = df[cond]

out_df
weather
2020-01-26 fine
2020-01-27 fine
2020-02-10 fine
2020-02-11 fine
2020-02-27 fine
2020-02-28 fine

Extract the last two days of the "sunny for more than two days" period

If you want to get rain, sunny, sunny, sunny, rain when you have rain, fine, sunny, sunny, rain.

It is an application of the previous section.

cond = (df['weather'] == 'Fine') & (df['weather'].shift(-1, fill_value='') == 'Fine')
cond = ~cond & cond.diff()
cond |= cond.shift(-1, fill_value=False)
out_df = df[cond]

out_df
weather
2020-01-29 fine
2020-01-30 fine
2020-02-10 fine
2020-02-11 fine
2020-02-28 fine
2020-02-29 fine

Extract the period of "2 days in a row and sunny (excluding 3 days or more)"

It's easy to think of it as an application of the previous section, and add "not sunny the day before" and "not sunny the day after tomorrow" as the conditions. .. ..

cond = ((df['weather'] == 'Fine') & (df['weather'].shift(-1, fill_value='') == 'Fine')
        & (df['weather'].shift(1, fill_value='') != 'Fine')
        & (df['weather'].shift(-2, fill_value='') != 'Fine'))
cond |= cond.shift(fill_value=False)
out_df = df[cond]

out_df
weather
2020-02-10 fine
2020-02-11 fine

For the time being, you can also do this.

is_sunny = (df['weather'] == 'Fine').to_numpy(dtype=int)
# array([0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
#        1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
#        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1])

base = 2**np.arange(4)
#Weather base
# 2020-02-09 rain* 1
# 2020-02-10 fine* 2
# 2020-02-11 fine* 4
# 2020-02-12 cloudy* 8
# 0*1 + 1*2 + 1*4 + 0*8 = 6
target_number = 6
cond = np.convolve(is_sunny, base) == target_number
out_df = df[cond[2:-1] | cond[3:]]

Extract the period from "from a sunny day to the next rainy day"

First, extract only the sunny and rainy days, and get the sunny day and the next rainy day. Then use cumsum () to separate the groups and extract.

is_sun_rain = df['weather'].isin(set('Sunny rain'))

subdf = df[is_sun_rain]
subcond = ((subdf['weather'] == 'Fine')
           & (subdf['weather'].shift(-1, fill_value='') == 'rain'))
subcond |= subcond.shift(fill_value=False)

cond = ((is_sun_rain & subcond).cumsum() % 2).astype('boolean')
cond |= cond.shift(fill_value=False)
out_df = df[cond]

ngroup = cond.diff()[cond].cumsum()
out_df.assign(ngroup=ngroup)
Weather ngroup
2020-01-06 fine 1
2020-01-07 rain 1
2020-01-10 fine 2
2020-01-11 cloudy 2
2020-01-12 snow 2
2020-01-13 rain 2
2020-01-18 fine 3
2020-01-19 cloudy 3
2020-01-20 snow 3
2020-01-21 rain 3
2020-01-23 fine 4
2020-01-24 snow 4
2020-01-25 rain 4
2020-01-30 fine 5
2020-01-31 cloudy 5
2020-02-01 snow 5
2020-02-02 Snow 5
2020-02-03 snow 5
2020-02-04 rain 5
2020-02-11 fine 6
2020-02-12 cloudy 6
2020-02-13 rain 6
2020-02-16 fine 7
2020-02-17 cloudy 7
2020-02-18 snow 7
2020-02-19 cloudy 7
2020-02-20 snow 7
2020-02-21 snow 7
2020-02-22 rain 7

Extract the period from a sunny day to the next rainy day with a cloudy day in between

It is an application of the previous section.

is_sun_rain = df['weather'].isin(set('Sunny rain'))

subdf = df[is_sun_rain]
subcond = ((subdf['weather'] == 'Fine')
           & (subdf['weather'].shift(-1, fill_value='') == 'Cloudy')
           & (subdf['weather'].shift(-2, fill_value='') == 'rain'))
subcond |= subcond.shift(2, fill_value=False)

cond = ((is_sun_rain & subcond).cumsum() % 2).astype('boolean')
cond |= cond.shift(fill_value=False)

out_df = df[cond]

ngroup = cond.diff()[cond].cumsum()
out_df.assign(ngroup=ngroup)
Weather ngroup
2020-01-10 fine 1
2020-01-11 cloudy 1
2020-01-12 snow 1
2020-01-13 rain 1
2020-01-18 fine 2
2020-01-19 cloudy 2
2020-01-20 snow 2
2020-01-21 rain 2
2020-01-30 fine 3
2020-01-31 cloudy 3
2020-02-01 snow 3
2020-02-02 Snow 3
2020-02-03 snow 3
2020-02-04 rain 3
2020-02-11 fine 4
2020-02-12 cloudy 4
2020-02-13 rain 4

Or

arr, _ = df['weather'].factorize()
# 0: 'Cloudy', 1: 'snow', 2: 'Fine', 3: 'rain'
subarr, = (arr != 1).nonzero()
subcond = np.convolve(arr[subarr], [1, 10, 100]) == 203

flags = np.zeros(arr.size, dtype=int)
flags[subarr[subcond[2:]]] = 1
flags[subarr[subcond[:-2]]+1] = -1
cond = flags.cumsum().astype(bool)
out_df = df[cond]

ngroup = flags[cond].cumsum()
out_df.assign(ngroup=ngroup)

Recommended Posts

Extract periods that match a particular pattern from pandas time series qualitative data
Features that can be extracted from time series data
Library tsfresh that automatically extracts features from time series data
A Python program that aggregates time usage from icalendar data
[numpy] Create a moving window matrix from multidimensional time series data
<Pandas> How to handle time series data in a pivot table
Extract lines that match the conditions from a text file with python
Find the index of items that match the conditions in the pandas data frame / series
Get time series data from k-db.com in Python
Extract data from a web page with Python
Predict from various data in Python using Facebook Prophet, a time series prediction tool
[Python] Extracts data frames that do not match a specific column with other data frames of Pandas
Extract data from S3
A memo that reads data from dashDB with Python & Spark
Graph time series data in Python using pandas and matplotlib
A story about clustering time series data of foreign exchange