Description
What happened:
When interpolating a dataset with >2000 dask variables a lot of time is spent in da.unifying_chunks
because da.unifying_chunks
forces all variables and coordinates to a dask array.
xarray on the other hand forces coordinates to pd.Index even if the coordinates was dask.array when the dataset was first created.
What you expected to happen:
If the coords of the dataset was initialized as dask arrays they should stay lazy.
Minimal Complete Verifiable Example:
import xarray as xr
import numpy as np
import dask.array as da
a = np.arange(0, 2000)
b = np.core.defchararray.add("long_variable_name", a.astype(str))
coords = dict(time=da.array([0, 1]))
data_vars = dict()
for v in b:
data_vars[v] = xr.DataArray(
name=v,
data=da.array([3, 4]),
dims=["time"],
coords=coords
)
ds0 = xr.Dataset(data_vars)
ds0 = ds0.interp(
time=da.array([0, 0.5, 1]),
assume_sorted=True,
kwargs=dict(fill_value=None),
)
Anything else we need to know?:
Some thoughts:
- Why can't coordinates be lazy?
- Can we use dask.dataframe.Index instead of pd.Index when creating IndexVariables?
- There's no time saved converting to dask arrays in
missing.interp_func
. But some time could be saved if we could convert them to dask arrays inxr.Dataset.interp
before the variable loop starts. - Can we still store the dask array in IndexVariable and use a to_dask_array()-method to quickly get it?
- Initializing the dataarrays will still be slow though since it still has to force the dask array to pd.Index.
Environment:
Output of xr.show_versions()
xr.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.8.5 (default, Sep 3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]
python-bits: 64
OS: Windows
OS-release: 10
libhdf5: 1.10.4
libnetcdf: None
xarray: 0.16.2
pandas: 1.1.5
numpy: 1.17.5
scipy: 1.4.1
netCDF4: None
pydap: None
h5netcdf: None
h5py: 2.10.0
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.3.2
dask: 2020.12.0
distributed: 2020.12.0
matplotlib: 3.3.2
cartopy: None
seaborn: 0.11.1
numbagg: None
pint: None
setuptools: 51.0.0.post20201207
pip: 20.3.3
conda: 4.9.2
pytest: 6.2.1
IPython: 7.19.0
sphinx: 3.4.0