[PYTHON] A note when looking for an alternative to pandas rolling for moving windows

In technical analysis of time series data, it is common to take the average while moving the window, and to find the maximum and minimum values. With pandas, you can easily write by specifying the move window with rolling and using the mean, max, min methods. This article is a note when I was looking for a faster way than pandas.

Use pandas rolling

First, create a time series of random numbers with numpy array and pandas Series as shown below.

import numpy as np
import pandas as pd
a = np.random.randint(100, size=100000)
s = pd.Series(a)

The mean in the moving window (so-called simple moving average) can be written as follows using the mean method for rolling.

period=10 #period
%timeit smean = s.rolling(period).mean()

Execution time

100 loops, best of 3: 5.47 ms per loop

was. Next are the maximum and minimum values in the move window.

%timeit smax = s.rolling(period).max()
%timeit smin = s.rolling(period).min()
100 loops, best of 3: 5.51 ms per loop
100 loops, best of 3: 5.53 ms per loop

The execution time is almost the same as the moving average.

Use scipy's filter function

Since the moving average is a so-called FIR filter, you can use scipy's lfilter function.

from scipy.signal import lfilter
%timeit amean = lfilter(np.ones(period)/period, 1, a)

Calculate as an FIR filter with all weights set to 1 / period. Execution time

1000 loops, best of 3: 980 µs per loop

have become. It's more than 5 times faster than pandas. As expected it is scipy.

Now, I want to find the maximum and minimum values, but there is no function that is perfect for that, and the one that seems to be usable is order_filter. It was a scipy.signal.order_filter.html) function. This function is a function that sequentially returns the value of the specified rank in the specified window. Specify the window mask array in the argument domain and the rank in the argument rank. However, since the target window will be centered on time-series samples, put 1 only in the first half of the array. For the minimum value, rank = 0, and for the maximum value, rank = period-1.

from scipy.signal import order_filter
domain = np.concatenate((np.ones(period), np.zeros(period-1)))
%timeit amax = order_filter(a, domain, period-1)
%timeit amin = order_filter(a, domain, 0)

The execution result is as follows.

10 loops, best of 3: 102 ms per loop
10 loops, best of 3: 102 ms per loop

This time it's almost 20 times slower than pandas. Even the scipy function didn't work. After all, it is probably because it is sorted every time so that it can be ranked arbitrarily. If you want to find the maximum and minimum values, you should use a dedicated function.

Recommended Posts

A note when looking for an alternative to pandas rolling for moving windows
Python Note: When assigning a value to a string
Things to note when initializing a list in Python
When you want to plt.save in a for statement
I made a method to automatically select and visualize an appropriate graph for pandas DataFrame
Looking for a data disk attached to Azure VM 3000 ri ……
[python] A note when trying to use numpy with Cython
How to make a Python package (written for an intern)
How to substitute a numerical value for a partial match (Note 1)
[Python] How to output a pandas table to an excel file
A note when I can't open Jupyter Notebook on Windows
How to paste a CSV file into an Excel file using Pandas
Points to note when making pandas read csv of excel output
A story about trying to automate a chot when cooking for yourself
A note I was addicted to when making a beep on Linux
Atom: Note for Indentation Error when copying Python script to shell
[Note] Items to check when an infinite loop occurs in pyenv
A note I was addicted to when creating a table with SQLAlchemy
A note on using tab completion when running Python interactively on Windows