Description
What happened?
Running an in-memory groupby
operation took much longer than expected. Turning off flox fixed this - but I don't think that's the idea ;-)
What did you expect to happen?
flox to be at least on par with our naive implementation
Minimal Complete Verifiable Example
import numpy as np
import xarray as xr
arr = np.random.randn(10, 10, 365*30)
time = xr.date_range("2000", periods=30*365, calendar="noleap")
da = xr.DataArray(arr, dims=("y", "x", "time"), coords={"time": time})
# using max
print("max:")
xr.set_options(use_flox=True)
%timeit da.groupby("time.year").max("time")
%timeit da.groupby("time.year").max("time", engine="flox")
xr.set_options(use_flox=False)
%timeit da.groupby("time.year").max("time")
# as reference
%timeit [da.sel(time=str(year)).max("time") for year in range(2000, 2030)]
# using mean
print("mean:")
xr.set_options(use_flox=True)
%timeit da.groupby("time.year").mean("time")
%timeit da.groupby("time.year").mean("time", engine="flox")
xr.set_options(use_flox=False)
%timeit da.groupby("time.year").mean("time")
# as reference
%timeit [da.sel(time=str(year)).mean("time") for year in range(2000, 2030)]
MVCE confirmation
- Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- Complete example — the example is self-contained, including all data and the text of any traceback.
- Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
- New issue — a search of GitHub Issues suggests this is not a duplicate.
Relevant log output
max:
158 ms ± 4.41 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
28.1 ms ± 318 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
11.5 ms ± 52.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
mean:
95.6 ms ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
34.8 ms ± 2.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
15.2 ms ± 232 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Anything else we need to know?
No response
Environment
INSTALLED VERSIONS
commit: f8127fc
python: 3.10.10 | packaged by conda-forge | (main, Mar 24 2023, 20:08:06) [GCC 11.3.0]
python-bits: 64
OS: Linux
OS-release: 5.15.0-69-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.9.1
xarray: main
pandas: 1.5.3
numpy: 1.23.5
scipy: 1.10.1
netCDF4: 1.6.3
pydap: installed
h5netcdf: 1.1.0
h5py: 3.8.0
Nio: None
zarr: 2.14.2
cftime: 1.6.2
nc_time_axis: 1.4.1
PseudoNetCDF: 3.2.2
iris: 3.4.1
bottleneck: 1.3.7
dask: 2023.3.2
distributed: 2023.3.2.1
matplotlib: 3.7.1
cartopy: 0.21.1
seaborn: 0.12.2
numbagg: 0.2.2
fsspec: 2023.3.0
cupy: None
pint: 0.20.1
sparse: 0.14.0
flox: 0.6.10
numpy_groupies: 0.9.20
setuptools: 67.6.1
pip: 23.0.1
conda: None
pytest: 7.2.2
mypy: None
IPython: 8.12.0
sphinx: None