Skip to content

Performance problem when doing computation between two arrays with discontinuous indexes  #3755

Open
@vincentchabot

Description

@vincentchabot

MCVE Code Sample

import xarray as xr 
import numpy as np 

# Creating array
ds = xr.Dataset()
ds["longitude"] = np.arange(0.1,2000.1,1)
ds["latitude"] = range(3000)
ds["step"] = range(50)
ds["field"] = (("step","latitude","longitude"),np.random.randn(50,3000,2000))
ds.to_netcdf("big_array.nc")

# Create another array 
ds = xr.Dataset()
ds["longitude"] = np.arange(500.1,600.1,1) # Coordinate are a continuous subset of the first array
ds["latitude"] = np.arange(510,660)
ds["id"] = range(50)
ds["field"] = (("longitude","latitude","id"),np.random.randn(100,150,50))
ds.to_netcdf("slicing.nc")

# Create another array with "discontinuity" in longitude dimension 
ds = xr.Dataset()
ds["longitude"] = list(np.arange(500.1,598.1,1)) +[622.1,640.1]
ds["latitude"] = range(510,660)
ds["id"] = range(10)
ds["mask"] = (("longitude","latitude","id"),np.random.randn(100,150,10))
ds.to_netcdf("no_slicing.nc")

# Load the Three arrays 

da = xr.open_dataset("big_array.nc")
db = xr.open_dataset("slicing.nc").isel(id=0)
dc = xr.open_dataarray("no_slicing.nc").isel(id=0)

%timeit da*db
# 32.3 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit da*dc
#2.13 s ± 15.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

def slicing_operation(dc,da):
    """
    Slicing when knowing that dc is a subpart of da
    """
    import operator
    min_lat = np.max([dc.latitude.values.min(),da.latitude.values.min()])
    max_lat = np.min([dc.latitude.values.max(),da.latitude.values.max()])
    index_lat_field = operator.and_(da.latitude >= min_lat,da.latitude <= max_lat)
    min_lon = np.max([dc.longitude.values.min(),da.longitude.values.min()])
    max_lon = np.min([dc.longitude.values.max(),da.longitude.values.max()])
    index_lon_field = operator.and_(da.longitude >= min_lon,da.longitude <= max_lon)
    
    # Extending dc such that it covers all longitude of da 
    # Creating an empty array. 
    dempty = xr.Dataset()
    dempty["longitude"] = da.longitude.isel(longitude=index_lon_field).values
    dempty["latitude"] = da.latitude.isel(latitude=index_lat_field).values
    dc_ext = dc.broadcast_like(dempty)
    # Performing multiplication and align 
    res,_ = xr.align((dc_ext *da),dc)
    return res

%timeit slicing_operation(dc,da)
#43.9 ms ± 1.94 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

#Checking if we get the same results
res_0 = dc*da
res = slicing_operation(dc,da)
print(abs(res_0 - res).max())

Problem Description

A performance problem occured when performing operation between da and dc.
Computing da *db takes ~32 ms
While db and dc have the same shape, opeartion dc * da takes ~2.13s (so x 60 in cpu times).
This seems linked to the fact tha dc includes discontinuous indexes of da.

Indeed, when extending dc (such that its longitude are continuous when compared at da) the computation time take ~44 ms.
For me it seems that when doing da * dc da dataset is fully loaded in memory first, while only the part involved in computation is loaded when doing da *db.
Indeed when doing :

%time da.load()

we get the following result:

CPU times: user 0 ns, sys: 1.88 s, total: 1.88 s

Wall time: 1.89 s

On top of that, da *db scales with the number of id selected while it is not the case for da *dc .

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None

python: 3.7.6 | packaged by conda-forge | (default, Jan 7 2020, 22:33:48)

[GCC 7.3.0]

python-bits: 64

OS: Linux

OS-release: 3.10.0-957.27.2.el7.x86_64

machine: x86_64

processor: x86_64

byteorder: little

LC_ALL: None

LANG: en_US.UTF-8

LOCALE: en_US.UTF-8

libhdf5: 1.10.5

libnetcdf: 4.7.3

xarray: 0.14.1

pandas: 0.25.3

numpy: 1.17.4

scipy: 1.4.1

netCDF4: 1.5.3

pydap: None

h5netcdf: None

h5py: 2.10.0

Nio: None

zarr: None

cftime: 1.0.4.2

nc_time_axis: None

PseudoNetCDF: None

rasterio: None

cfgrib: 0.9.7.6

iris: None

bottleneck: 1.3.1

dask: 2.9.1

distributed: 2.9.1

matplotlib: 3.1.2

cartopy: None

seaborn: 0.9.0

numbagg: None

setuptools: 44.0.0.post20200106

pip: 19.3.1

conda: None

pytest: 5.3.2

IPython: 7.11.1

sphinx: 2.3.1

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions