Fancy indexing a Dataset with dask DataArray triggers multiple computes

It appears that boolean arrays (or any slicing array presumably) are evaluated many more times than necessary when applied to multiple variables in a Dataset.  Is this intentional?  Here is an example that demonstrates this:

```python

# Use a custom array type to know when data is being evaluated
class Array():
    
    def __init__(self, x):
        self.shape = (x.shape[0],)
        self.ndim = x.ndim
        self.dtype = 'bool'
        self.x = x
        
    def __getitem__(self, idx):
        if idx[0].stop > 0:
            print('Evaluating')
        return (self.x > .5).__getitem__(idx)

# Control case -- this shows that the print statement is only reached once
da.from_array(Array(np.random.rand(100))).compute();
# Evaluating

# This usage somehow results in two evaluations of this one array?
ds = xr.Dataset(dict(
    a=('x', da.from_array(Array(np.random.rand(100))))
))
ds.sel(x=ds.a)
# Evaluating
# Evaluating
# <xarray.Dataset>
# Dimensions:  (x: 51)
# Dimensions without coordinates: x
# Data variables:
#     a        (x) bool dask.array<chunksize=(51,), meta=np.ndarray>

# The array is evaluated an extra time for each new variable
ds = xr.Dataset(dict(
    a=('x', da.from_array(Array(np.random.rand(100)))),
    b=(('x', 'y'), da.random.random((100, 10))),
    c=(('x', 'y'), da.random.random((100, 10))),
    d=(('x', 'y'), da.random.random((100, 10))),
))
ds.sel(x=ds.a)
# Evaluating
# Evaluating
# Evaluating
# Evaluating
# Evaluating
# <xarray.Dataset>
# Dimensions:  (x: 48, y: 10)
# Dimensions without coordinates: x, y
# Data variables:
#     a        (x) bool dask.array<chunksize=(48,), meta=np.ndarray>
#     b        (x, y) float64 dask.array<chunksize=(48, 10), meta=np.ndarray>
#     c        (x, y) float64 dask.array<chunksize=(48, 10), meta=np.ndarray>
#     d        (x, y) float64 dask.array<chunksize=(48, 10), meta=np.ndarray>
```

Given that slicing is already not lazy, why does the same predicate array need to be computed more than once? 

@tomwhite originally pointed this out in https://github.com/pystatgen/sgkit/issues/299.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fancy indexing a Dataset with dask DataArray triggers multiple computes #4663

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Fancy indexing a Dataset with dask DataArray triggers multiple computes #4663

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions