Description
This is a follow-up from #1521.
When invoking open_mfdataset, very frequently the user knows in advance that all of his coords that aren't
on the concat_dim are already aligned, and may be willing to blindly trust such assumption in exchange of a huge performance boost.
My production data: 200x NetCDF files on a not very performant NFS file system, concatenated on the "scenario" dimension:
xarray.open_mfdataset('cube.*.nc', engine='h5netcdf', concat_dim='scenario')
<xarray.Dataset>
Dimensions: (attribute: 1, fx_id: 40, instr_id: 10765, scenario: 500001, timestep: 1)
Coordinates:
* attribute (attribute) object 'THEO/Value'
currency (instr_id) object 'ZAR' 'EUR' 'EUR' 'EUR' 'EUR' 'EUR' 'GBP' ...
* fx_id (fx_id) object 'GBP' 'USD' 'EUR' 'JPY' 'ARS' 'AUD' 'BRL' ...
* instr_id (instr_id) object 'S01626556_ZAE000204921' '537805_1275' ...
* timestep (timestep) datetime64[ns] 2016-12-31
type (instr_id) object 'American' 'Bond Future' 'Bond Future' ...
* scenario (scenario) object 'Base Scenario' 'SSMC_1' 'SSMC_2' ...
Data variables:
FX (fx_id, timestep, scenario) float64 dask.array<shape=(40, 1, 500001), chunksize=(40, 1, 2501)>
instruments (instr_id, attribute, timestep, scenario) float64 dask.array<shape=(10765, 1, 1, 500001), chunksize=(10765, 1, 1, 2501)>
CPU times: user 19.6 s, sys: 981 ms, total: 20.6 s
Wall time: 24.4 s
If I skip loading and comparing the non-index coords from all 200 files:
xarray.open_mfdataset('cube.*.nc'), engine='h5netcdf', concat_dim='scenario', coords='all')
<xarray.Dataset>
Dimensions: (attribute: 1, fx_id: 40, instr_id: 10765, scenario: 500001, timestep: 1)
Coordinates:
* attribute (attribute) object 'THEO/Value'
* fx_id (fx_id) object 'GBP' 'USD' 'EUR' 'JPY' 'ARS' 'AUD' 'BRL' ...
* instr_id (instr_id) object 'S01626556_ZAE000204921' '537805_1275' ...
* timestep (timestep) datetime64[ns] 2016-12-31
currency (scenario, instr_id) object dask.array<shape=(500001, 10765), chunksize=(2501, 10765)>
* scenario (scenario) object 'Base Scenario' 'SSMC_1' 'SSMC_2' ...
type (scenario, instr_id) object dask.array<shape=(500001, 10765), chunksize=(2501, 10765)>
Data variables:
FX (fx_id, timestep, scenario) float64 dask.array<shape=(40, 1, 500001), chunksize=(40, 1, 2501)>
instruments (instr_id, attribute, timestep, scenario) float64 dask.array<shape=(10765, 1, 1, 500001), chunksize=(10765, 1, 1, 2501)>
CPU times: user 12.7 s, sys: 305 ms, total: 13 s
Wall time: 14.8 s
If I skip loading and comparing also the index coords from all 200 files:
cube = xarray.open_mfdataset(sh.resolve_env(f'{dynamic}/mtf/{cubename}/nc/cube.*.nc'), engine='h5netcdf',
concat_dim='scenario',
drop_variables=['attribute', 'fx_id', 'instr_id', 'timestep', 'currency', 'type'])
<xarray.Dataset>
Dimensions: (attribute: 1, fx_id: 40, instr_id: 10765, scenario: 500001, timestep: 1)
Coordinates:
* scenario (scenario) object 'Base Scenario' 'SSMC_1' 'SSMC_2' ...
Dimensions without coordinates: attribute, fx_id, instr_id, timestep
Data variables:
FX (fx_id, timestep, scenario) float64 dask.array<shape=(40, 1, 500001), chunksize=(40, 1, 2501)>
instruments (instr_id, attribute, timestep, scenario) float64 dask.array<shape=(10765, 1, 1, 500001), chunksize=(10765, 1, 1, 2501)>
CPU times: user 7.31 s, sys: 61 ms, total: 7.37 s
Wall time: 9.05 s
Proposed design
Add a new optional parameter to open_mfdataset, assume_aligned=None
.
It can be valued to a list of variable names or "all", and requires concat_dim
to be explicitly set.
It causes open_mfdataset to use the first occurrence of every variable and blindly skip loading the subsequent ones.
Algorithm
- Perform the first invocation to the underlying open_dataset like it happens now
- if assume_aligned is not None: for each new NetCDF file, figure out which variables need to be aligned & compared (as opposed to concatenated), and add them to a drop_variables list.
- if assume_aligned != "all": drop_variables &= assume_aligned
- Pass the increasingly long drop_variables list to the underlying open_dataset