Skip to content

ds.mean bugs with cftime objects #5897

Open
@aulemahal

Description

@aulemahal

What happened:
Given a dataset that has a variable with cftime objects along dimension A, averaging (mean) leads to buggy behaviour:

  1. Averaging over 'A' drops the variable instead of averaging it.
  2. Averaging over any other dimension will fail if that variable is on the dask backend.

What you expected to happen:

  1. I expected the average to fail in the case of a dask-backed cftime variable, given that this code exists:
    elif _contains_cftime_datetimes(array):
    if is_duck_dask_array(array):
    raise NotImplementedError(
    "Computing the mean of an array containing "
    "cftime.datetime objects is not yet implemented on "
    "dask arrays."
    )
    offset = min(array)
    timedeltas = datetime_to_numeric(array, offset, datetime_unit="us")
    mean_timedeltas = _mean(timedeltas, axis=axis, skipna=skipna, **kwargs)
    return _to_pytimedelta(mean_timedeltas, unit="us") + offset

And I expected the average to work (not drop the var) in the case of the numpy backend.

  1. I expected the fact that dask is used to be irrelevant to the result. I expected the mean to conserve the cftime variable as-is since it doesn't include the averaged dimension.

Minimal Complete Verifiable Example:

# Put your MCVE code here
import xarray as xr

ds = xr.Dataset({
    'var1': (('time',), xr.cftime_range('2021-10-31', periods=10, freq='D')),
    'var2': (('x',), list(range(10)))
 })
# var1 contains cftime objects
# var2 contains integers
# They do not share dims

ds.mean('time')  # var1 has disappeared instead of being averaged

ds.mean('x') # Everything ok

dsc = ds.chunk({})

dsc.mean('time') # var1 has disappeared. I would expected this line to fail.

dsc.mean('x') # Raises NotImplementedError. I would expect this line to run flawlessly.

Anything else we need to know?:
A culprit is #5393, but maybe the bug is older? I think the change introduced there causes the issue (2) above.

In duck_array_ops.py the mean operation is declared numeric_only, which is kinda incoherent with the implementation allowing means of datetime objects. This setting causes my (1) above.

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: fdabf3b
python: 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:20:46)
[GCC 9.4.0]
python-bits: 64
OS: Linux
OS-release: 5.14.12-arch1-1
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: fr_CA.utf8
LOCALE: ('fr_CA', 'UTF-8')
libhdf5: 1.12.1
libnetcdf: 4.8.1

xarray: 0.19.1.dev89+gfdabf3be
pandas: 1.3.4
numpy: 1.21.3
scipy: 1.7.1
netCDF4: 1.5.7
pydap: installed
h5netcdf: 0.11.0
h5py: 3.4.0
Nio: None
zarr: 2.10.1
cftime: 1.5.1
nc_time_axis: 1.4.0
PseudoNetCDF: installed
rasterio: 1.2.10
cfgrib: 0.9.9.1
iris: 3.1.0
bottleneck: 1.3.2
dask: 2021.10.0
distributed: 2021.10.0
matplotlib: 3.4.3
cartopy: 0.20.1
seaborn: 0.11.2
numbagg: 0.2.1
fsspec: 2021.10.1
cupy: None
pint: 0.17
sparse: 0.13.0
setuptools: 58.2.0
pip: 21.3.1
conda: None
pytest: 6.2.5
IPython: 7.28.0
sphinx: None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions