Description
MCVE Code Sample
import xarray as xr
import numpy as np
# Creating array
ds = xr.Dataset()
ds["longitude"] = np.arange(0.1,2000.1,1)
ds["latitude"] = range(3000)
ds["step"] = range(50)
ds["field"] = (("step","latitude","longitude"),np.random.randn(50,3000,2000))
ds.to_netcdf("big_array.nc")
# Create another array
ds = xr.Dataset()
ds["longitude"] = np.arange(500.1,600.1,1) # Coordinate are a continuous subset of the first array
ds["latitude"] = np.arange(510,660)
ds["id"] = range(50)
ds["field"] = (("longitude","latitude","id"),np.random.randn(100,150,50))
ds.to_netcdf("slicing.nc")
# Create another array with "discontinuity" in longitude dimension
ds = xr.Dataset()
ds["longitude"] = list(np.arange(500.1,598.1,1)) +[622.1,640.1]
ds["latitude"] = range(510,660)
ds["id"] = range(10)
ds["mask"] = (("longitude","latitude","id"),np.random.randn(100,150,10))
ds.to_netcdf("no_slicing.nc")
# Load the Three arrays
da = xr.open_dataset("big_array.nc")
db = xr.open_dataset("slicing.nc").isel(id=0)
dc = xr.open_dataarray("no_slicing.nc").isel(id=0)
%timeit da*db
# 32.3 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit da*dc
#2.13 s ± 15.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
def slicing_operation(dc,da):
"""
Slicing when knowing that dc is a subpart of da
"""
import operator
min_lat = np.max([dc.latitude.values.min(),da.latitude.values.min()])
max_lat = np.min([dc.latitude.values.max(),da.latitude.values.max()])
index_lat_field = operator.and_(da.latitude >= min_lat,da.latitude <= max_lat)
min_lon = np.max([dc.longitude.values.min(),da.longitude.values.min()])
max_lon = np.min([dc.longitude.values.max(),da.longitude.values.max()])
index_lon_field = operator.and_(da.longitude >= min_lon,da.longitude <= max_lon)
# Extending dc such that it covers all longitude of da
# Creating an empty array.
dempty = xr.Dataset()
dempty["longitude"] = da.longitude.isel(longitude=index_lon_field).values
dempty["latitude"] = da.latitude.isel(latitude=index_lat_field).values
dc_ext = dc.broadcast_like(dempty)
# Performing multiplication and align
res,_ = xr.align((dc_ext *da),dc)
return res
%timeit slicing_operation(dc,da)
#43.9 ms ± 1.94 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#Checking if we get the same results
res_0 = dc*da
res = slicing_operation(dc,da)
print(abs(res_0 - res).max())
Problem Description
A performance problem occured when performing operation between da and dc.
Computing da *db
takes ~32 ms
While db and dc have the same shape, opeartion dc * da
takes ~2.13s (so x 60 in cpu times).
This seems linked to the fact tha dc includes discontinuous indexes of da.
Indeed, when extending dc (such that its longitude are continuous when compared at da) the computation time take ~44 ms.
For me it seems that when doing da * dc
da dataset is fully loaded in memory first, while only the part involved in computation is loaded when doing da *db
.
Indeed when doing :
%time da.load()
we get the following result:
CPU times: user 0 ns, sys: 1.88 s, total: 1.88 s
Wall time: 1.89 s
On top of that, da *db
scales with the number of id
selected while it is not the case for da *dc
.
Output of xr.show_versions()
python: 3.7.6 | packaged by conda-forge | (default, Jan 7 2020, 22:33:48)
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-957.27.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.5
libnetcdf: 4.7.3
xarray: 0.14.1
pandas: 0.25.3
numpy: 1.17.4
scipy: 1.4.1
netCDF4: 1.5.3
pydap: None
h5netcdf: None
h5py: 2.10.0
Nio: None
zarr: None
cftime: 1.0.4.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: 0.9.7.6
iris: None
bottleneck: 1.3.1
dask: 2.9.1
distributed: 2.9.1
matplotlib: 3.1.2
cartopy: None
seaborn: 0.9.0
numbagg: None
setuptools: 44.0.0.post20200106
pip: 19.3.1
conda: None
pytest: 5.3.2
IPython: 7.11.1
sphinx: 2.3.1