Closed
Description
It appears that boolean arrays (or any slicing array presumably) are evaluated many more times than necessary when applied to multiple variables in a Dataset. Is this intentional? Here is an example that demonstrates this:
# Use a custom array type to know when data is being evaluated
class Array():
def __init__(self, x):
self.shape = (x.shape[0],)
self.ndim = x.ndim
self.dtype = 'bool'
self.x = x
def __getitem__(self, idx):
if idx[0].stop > 0:
print('Evaluating')
return (self.x > .5).__getitem__(idx)
# Control case -- this shows that the print statement is only reached once
da.from_array(Array(np.random.rand(100))).compute();
# Evaluating
# This usage somehow results in two evaluations of this one array?
ds = xr.Dataset(dict(
a=('x', da.from_array(Array(np.random.rand(100))))
))
ds.sel(x=ds.a)
# Evaluating
# Evaluating
# <xarray.Dataset>
# Dimensions: (x: 51)
# Dimensions without coordinates: x
# Data variables:
# a (x) bool dask.array<chunksize=(51,), meta=np.ndarray>
# The array is evaluated an extra time for each new variable
ds = xr.Dataset(dict(
a=('x', da.from_array(Array(np.random.rand(100)))),
b=(('x', 'y'), da.random.random((100, 10))),
c=(('x', 'y'), da.random.random((100, 10))),
d=(('x', 'y'), da.random.random((100, 10))),
))
ds.sel(x=ds.a)
# Evaluating
# Evaluating
# Evaluating
# Evaluating
# Evaluating
# <xarray.Dataset>
# Dimensions: (x: 48, y: 10)
# Dimensions without coordinates: x, y
# Data variables:
# a (x) bool dask.array<chunksize=(48,), meta=np.ndarray>
# b (x, y) float64 dask.array<chunksize=(48, 10), meta=np.ndarray>
# c (x, y) float64 dask.array<chunksize=(48, 10), meta=np.ndarray>
# d (x, y) float64 dask.array<chunksize=(48, 10), meta=np.ndarray>
Given that slicing is already not lazy, why does the same predicate array need to be computed more than once?
@tomwhite originally pointed this out in https://github.com/pystatgen/sgkit/issues/299.