Description
We have a dataset stored across multiple netCDF files. We are getting very slow performance with open_mfdataset
, and I would like to improve this.
Each individual netCDF file looks like this:
%time ds_single = xr.open_dataset('float_trajectories.0000000000.nc')
ds_single
CPU times: user 14.9 ms, sys: 48.4 ms, total: 63.4 ms
Wall time: 60.8 ms
<xarray.Dataset>
Dimensions: (npart: 8192000, time: 1)
Coordinates:
* time (time) datetime64[ns] 1993-01-01
* npart (npart) int32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
Data variables:
z (time, npart) float32 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 ...
vort (time, npart) float32 -9.71733e-10 -9.72858e-10 -9.73001e-10 ...
u (time, npart) float32 0.000545563 0.000544884 0.000544204 ...
v (time, npart) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
x (time, npart) float32 180.016 180.047 180.078 180.109 180.141 ...
y (time, npart) float32 -79.9844 -79.9844 -79.9844 -79.9844 ...
As shown above, a single data file opens in ~60 ms.
When I call open_mdsdataset
on 49 files (each with a different time
dimension but the same npart
), here is what happens:
%time ds = xr.open_mfdataset('*.nc', )
ds
CPU times: user 1min 31s, sys: 25.4 s, total: 1min 57s
Wall time: 2min 4s
<xarray.Dataset>
Dimensions: (npart: 8192000, time: 49)
Coordinates:
* npart (npart) int64 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
* time (time) datetime64[ns] 1993-01-01 1993-01-02 1993-01-03 ...
Data variables:
z (time, npart) float64 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 ...
vort (time, npart) float64 -9.717e-10 -9.729e-10 -9.73e-10 -9.73e-10 ...
u (time, npart) float64 0.0005456 0.0005449 0.0005442 0.0005437 ...
v (time, npart) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
x (time, npart) float64 180.0 180.0 180.1 180.1 180.1 180.2 180.2 ...
y (time, npart) float64 -79.98 -79.98 -79.98 -79.98 -79.98 -79.98 ...
It takes over 2 minutes to open the dataset. Specifying concat_dim='time'
does not improve performance.
Here is %prun
of the open_mfdataset
command.
748994 function calls (724222 primitive calls) in 142.160 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
49 62.455 1.275 62.458 1.275 {method 'get_indexer' of 'pandas.index.IndexEngine' objects}
49 47.207 0.963 47.209 0.963 base.py:1067(is_unique)
196 7.198 0.037 7.267 0.037 {operator.getitem}
49 4.632 0.095 4.687 0.096 netCDF4_.py:182(_open_netcdf4_group)
240 3.189 0.013 3.426 0.014 numeric.py:2476(array_equal)
98 1.937 0.020 1.937 0.020 {numpy.core.multiarray.arange}
4175/3146 1.867 0.000 9.296 0.003 {numpy.core.multiarray.array}
49 1.525 0.031 119.144 2.432 alignment.py:251(reindex_variables)
24 1.065 0.044 1.065 0.044 {method 'cumsum' of 'numpy.ndarray' objects}
12 1.010 0.084 1.010 0.084 {method 'sort' of 'numpy.ndarray' objects}
5227/4035 0.660 0.000 1.688 0.000 collections.py:50(__init__)
12 0.600 0.050 3.238 0.270 core.py:2761(insert)
12691/7497 0.473 0.000 0.875 0.000 indexing.py:363(shape)
110728 0.425 0.000 0.663 0.000 {isinstance}
12 0.413 0.034 0.413 0.034 {method 'flatten' of 'numpy.ndarray' objects}
12 0.341 0.028 0.341 0.028 {numpy.core.multiarray.where}
2 0.333 0.166 0.333 0.166 {pandas._join.outer_join_indexer_int64}
1 0.331 0.331 142.164 142.164 <string>:1(<module>)
It looks like most of the time is being spent on reindex_variables
. I understand why this happens...xarray needs to make sure the dimensions are the same in order to concatenate them together.
Is there any obvious way I could improve the load time? For example, can I give a hint to xarray that this reindex_variables
step is not necessary, since I know that all the npart
dimensions are the same in each file?