open_mfdataset: skip loading for indexes and coordinates from all but the first file

This is a follow-up from #1521.

When invoking open_mfdataset, very frequently the user knows in advance that all of his coords that aren't
on the concat_dim are already aligned, and may be willing to blindly trust such assumption in exchange of a huge performance boost.

My production data: 200x NetCDF files on a not very performant NFS file system, concatenated on the "scenario" dimension:

```
xarray.open_mfdataset('cube.*.nc', engine='h5netcdf', concat_dim='scenario')

<xarray.Dataset>
Dimensions:      (attribute: 1, fx_id: 40, instr_id: 10765, scenario: 500001, timestep: 1)
Coordinates:
  * attribute    (attribute) object 'THEO/Value'
    currency     (instr_id) object 'ZAR' 'EUR' 'EUR' 'EUR' 'EUR' 'EUR' 'GBP' ...
  * fx_id        (fx_id) object 'GBP' 'USD' 'EUR' 'JPY' 'ARS' 'AUD' 'BRL' ...
  * instr_id     (instr_id) object 'S01626556_ZAE000204921' '537805_1275' ...
  * timestep     (timestep) datetime64[ns] 2016-12-31
    type         (instr_id) object 'American' 'Bond Future' 'Bond Future' ...
  * scenario     (scenario) object 'Base Scenario' 'SSMC_1' 'SSMC_2' ...
Data variables:
    FX           (fx_id, timestep, scenario) float64 dask.array<shape=(40, 1, 500001), chunksize=(40, 1, 2501)>
    instruments  (instr_id, attribute, timestep, scenario) float64 dask.array<shape=(10765, 1, 1, 500001), chunksize=(10765, 1, 1, 2501)>

CPU times: user 19.6 s, sys: 981 ms, total: 20.6 s
Wall time: 24.4 s
```


If I skip loading and comparing the non-index coords from all 200 files:

```
xarray.open_mfdataset('cube.*.nc'), engine='h5netcdf', concat_dim='scenario', coords='all')

<xarray.Dataset>
Dimensions:      (attribute: 1, fx_id: 40, instr_id: 10765, scenario: 500001, timestep: 1)
Coordinates:
  * attribute    (attribute) object 'THEO/Value'
  * fx_id        (fx_id) object 'GBP' 'USD' 'EUR' 'JPY' 'ARS' 'AUD' 'BRL' ...
  * instr_id     (instr_id) object 'S01626556_ZAE000204921' '537805_1275' ...
  * timestep     (timestep) datetime64[ns] 2016-12-31
    currency     (scenario, instr_id) object dask.array<shape=(500001, 10765), chunksize=(2501, 10765)>
  * scenario     (scenario) object 'Base Scenario' 'SSMC_1' 'SSMC_2' ...
    type         (scenario, instr_id) object dask.array<shape=(500001, 10765), chunksize=(2501, 10765)>
Data variables:
    FX           (fx_id, timestep, scenario) float64 dask.array<shape=(40, 1, 500001), chunksize=(40, 1, 2501)>
    instruments  (instr_id, attribute, timestep, scenario) float64 dask.array<shape=(10765, 1, 1, 500001), chunksize=(10765, 1, 1, 2501)>

CPU times: user 12.7 s, sys: 305 ms, total: 13 s
Wall time: 14.8 s
```

If I skip loading and comparing also the index coords from all 200 files:

```
cube = xarray.open_mfdataset(sh.resolve_env(f'{dynamic}/mtf/{cubename}/nc/cube.*.nc'), engine='h5netcdf',
                             concat_dim='scenario', 
                             drop_variables=['attribute', 'fx_id', 'instr_id', 'timestep', 'currency', 'type'])

<xarray.Dataset>
Dimensions:      (attribute: 1, fx_id: 40, instr_id: 10765, scenario: 500001, timestep: 1)
Coordinates:
  * scenario     (scenario) object 'Base Scenario' 'SSMC_1' 'SSMC_2' ...
Dimensions without coordinates: attribute, fx_id, instr_id, timestep
Data variables:
    FX           (fx_id, timestep, scenario) float64 dask.array<shape=(40, 1, 500001), chunksize=(40, 1, 2501)>
    instruments  (instr_id, attribute, timestep, scenario) float64 dask.array<shape=(10765, 1, 1, 500001), chunksize=(10765, 1, 1, 2501)>

CPU times: user 7.31 s, sys: 61 ms, total: 7.37 s
Wall time: 9.05 s
```

# Proposed design
Add a new optional parameter to open_mfdataset, ``assume_aligned=None``.
It can be valued to a list of variable names or "all", and requires ``concat_dim`` to be explicitly set.
It causes open_mfdataset to use the first occurrence of every variable and blindly skip loading the subsequent ones.

## Algorithm
1. Perform the first invocation to the underlying open_dataset like it happens now
2. if assume_aligned is not None: for each new NetCDF file, figure out which variables need to be aligned & compared (as opposed to concatenated), and add them to a drop_variables list.
3. if assume_aligned != "all": drop_variables &= assume_aligned
3. Pass the increasingly long drop_variables list to the underlying open_dataset

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

open_mfdataset: skip loading for indexes and coordinates from all but the first file #2039

Proposed design

Algorithm

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

open_mfdataset: skip loading for indexes and coordinates from all but the first file #2039

Description

Proposed design

Algorithm

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions