Skip to content

DataArray.rolling().mean() is way slower than it should be #1993

Closed
@shoyer

Description

@shoyer

Code Sample, a copy-pastable example if possible

From @RayPalmerTech in pydata/bottleneck#186:

import numpy as np
import pandas as pd
import time
import bottleneck as bn
import xarray
import matplotlib.pyplot as plt

N = 30000200 # Number of datapoints
Fs = 30000 # sample rate
T=1/Fs # sample period
duration = N/Fs # duration in s
t = np.arange(0,duration,T) # time vector
DATA = np.random.randn(N,)+5*np.sin(2*np.pi*0.01*t) # Example noisy sine data and window size
w = 330000

def using_bottleneck_mean(data,width):
  return bn.move_mean(a=data,window=width,min_count = 1)

def using_pandas_rolling_mean(data,width):
  return np.asarray(pd.DataFrame(data).rolling(window=width,center=True,min_periods=1).mean()).ravel()

def using_xarray_mean(data,width):
  return xarray.DataArray(data,dims='x').rolling(x=width,min_periods=1, center=True).mean()

start=time.time()
A = using_bottleneck_mean(DATA,w)
print('Bottleneck: ', time.time()-start, 's')
start=time.time()
B = using_pandas_rolling_mean(DATA,w)
print('Pandas: ',time.time()-start,'s')
start=time.time()
C = using_xarray_mean(DATA,w)
print('Xarray: ',time.time()-start,'s')

This results in:

Bottleneck:  0.0867006778717041 s
Pandas:  0.563546895980835 s
Xarray:  25.133142709732056 s

Somehow xarray is way slower than pandas and bottleneck, even though it's using bottleneck under the hood!

Problem description

Profiling shows that the majority of time is spent in xarray.core.rolling.DataArrayRolling._setup_windows. Monkey-patching that method with a dummy rectifies the issue:

xarray.core.rolling.DataArrayRolling._setup_windows = lambda *args: None

Now we obtain:

Bottleneck:  0.06775331497192383 s
Pandas:  0.48262882232666016 s
Xarray:  0.1723031997680664 s

The solution is to make setting up windows done lazily (in __iter__), instead of doing it in the constructor.

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.3.final.0 python-bits: 64 OS: Linux OS-release: 4.4.96+ machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

xarray: 0.10.2
pandas: 0.22.0
numpy: 1.14.2
scipy: 0.19.1
netCDF4: None
h5netcdf: None
h5py: 2.7.1
Nio: None
zarr: None
bottleneck: 1.2.1
cyordereddict: None
dask: None
distributed: None
matplotlib: 2.1.2
cartopy: None
seaborn: 0.7.1
setuptools: 36.2.7
pip: 9.0.1
conda: None
pytest: None
IPython: 5.5.0
sphinx: None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions