Skip to content

Non-HTTPS remote URLs no longer work as input for open_zarr #4691

Closed
@charlesbluca

Description

@charlesbluca

What happened:

On 0.16.2 and later, passing a non-HTTPS remote URL path (e.g. gs://...) as input to open_zarr() results in a KeyError or GroupNotFoundError:

>>> import xarray as xr
>>> xr.open_zarr("gs://cmip6/AerChemMIP/AS-RCEC/TaiESM1/histSST/r1i1p1f1/AERmon/od550aer/gn/", consolidated=True)
KeyError: '.zmetadata'
>>> xr.open_zarr("gs://cmip6/AerChemMIP/AS-RCEC/TaiESM1/histSST/r1i1p1f1/AERmon/od550aer/gn/", consolidated=False)
GroupNotFoundError: group not found at path ''

What you expected to happen:

With versions 0.16.1 and earlier, passing a non-HTTPS remote URL path to open_zarr() as input would successfully open the remote store, provided that a package to handle the specific filesystem was available in the environment and the proper storage options were supplied.

Minimal Complete Verifiable Example:

Same as above, but with decode_times=False to circumvent a cftime dependency:

import xarray as xr

xr.open_zarr(
    "gs://cmip6/AerChemMIP/AS-RCEC/TaiESM1/histSST/r1i1p1f1/AERmon/od550aer/gn/",
    consolidated=True,
    decode_times=False,
)

Anything else we need to know?:

From a brief debug of the code, it looks like this error is a result of open_zarr() now calling open_dataset(engine="zarr") to open the Zarr store.

In this function, the remote URL path is now passed through _normalize_path() where it is not recognized as a remote URL (this check is done by is_remote_uri() which only checks for HTTPS) and is instead interpreted as a relative path in the local filesystem, where it does not exist.

I'm not sure if this meant to be expected behavior, as the documentation on reading datasets in the cloud does not show an example using a URL path as input, and only suggests to use a MutableMapping. However, this is a use case that worked before 0.16.2, and now no longer works.

I think this could be resolved by expanding is_remote_uri() to check for other common remote URIs (e.g. gs:, s3:, etc.).

Environment:

Output of xr.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.9.1 | packaged by conda-forge | (default, Dec  9 2020, 01:07:06) [MSC v.1916 64 bit (AMD64)]
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: English_United States.1252
libhdf5: None
libnetcdf: None

xarray: 0.16.2
pandas: 1.1.5
numpy: 1.19.4
scipy: None
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.6.1
cftime: 1.3.0
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
pint: None
setuptools: 51.0.0.post20201207
pip: 20.3.1
conda: None
pytest: None
IPython: 7.19.0
sphinx: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    topic-zarrRelated to zarr storage library

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions