Skip to content

Fancy indexing a Dataset with dask DataArray triggers multiple computes #4663

Closed
@eric-czech

Description

@eric-czech

It appears that boolean arrays (or any slicing array presumably) are evaluated many more times than necessary when applied to multiple variables in a Dataset. Is this intentional? Here is an example that demonstrates this:

# Use a custom array type to know when data is being evaluated
class Array():
    
    def __init__(self, x):
        self.shape = (x.shape[0],)
        self.ndim = x.ndim
        self.dtype = 'bool'
        self.x = x
        
    def __getitem__(self, idx):
        if idx[0].stop > 0:
            print('Evaluating')
        return (self.x > .5).__getitem__(idx)

# Control case -- this shows that the print statement is only reached once
da.from_array(Array(np.random.rand(100))).compute();
# Evaluating

# This usage somehow results in two evaluations of this one array?
ds = xr.Dataset(dict(
    a=('x', da.from_array(Array(np.random.rand(100))))
))
ds.sel(x=ds.a)
# Evaluating
# Evaluating
# <xarray.Dataset>
# Dimensions:  (x: 51)
# Dimensions without coordinates: x
# Data variables:
#     a        (x) bool dask.array<chunksize=(51,), meta=np.ndarray>

# The array is evaluated an extra time for each new variable
ds = xr.Dataset(dict(
    a=('x', da.from_array(Array(np.random.rand(100)))),
    b=(('x', 'y'), da.random.random((100, 10))),
    c=(('x', 'y'), da.random.random((100, 10))),
    d=(('x', 'y'), da.random.random((100, 10))),
))
ds.sel(x=ds.a)
# Evaluating
# Evaluating
# Evaluating
# Evaluating
# Evaluating
# <xarray.Dataset>
# Dimensions:  (x: 48, y: 10)
# Dimensions without coordinates: x, y
# Data variables:
#     a        (x) bool dask.array<chunksize=(48,), meta=np.ndarray>
#     b        (x, y) float64 dask.array<chunksize=(48, 10), meta=np.ndarray>
#     c        (x, y) float64 dask.array<chunksize=(48, 10), meta=np.ndarray>
#     d        (x, y) float64 dask.array<chunksize=(48, 10), meta=np.ndarray>

Given that slicing is already not lazy, why does the same predicate array need to be computed more than once?

@tomwhite originally pointed this out in https://github.com/pystatgen/sgkit/issues/299.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions