Skip to content

Commit 432c414

Browse files
committed
Merge branch 'main' into groupby-remove-index-variable
* main: Split out distributed writes in zarr docs (pydata#9132) Update zendoo badge link (pydata#9133) Support duplicate dimensions in `.chunk` (pydata#9099) Bump the actions group with 2 updates (pydata#9130) adjust repr tests to account for different platforms (pydata#9127) (pydata#9128)
2 parents 6c60cf7 + 3fd162e commit 432c414

File tree

12 files changed

+155
-175
lines changed

12 files changed

+155
-175
lines changed

.github/workflows/ci-additional.yaml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -130,7 +130,7 @@ jobs:
130130
python -m mypy --install-types --non-interactive --cobertura-xml-report mypy_report xarray/
131131
132132
- name: Upload mypy coverage to Codecov
133-
uses: codecov/codecov-action@v4.4.1
133+
uses: codecov/codecov-action@v4.5.0
134134
with:
135135
file: mypy_report/cobertura.xml
136136
flags: mypy
@@ -184,7 +184,7 @@ jobs:
184184
python -m mypy --install-types --non-interactive --cobertura-xml-report mypy_report xarray/
185185
186186
- name: Upload mypy coverage to Codecov
187-
uses: codecov/codecov-action@v4.4.1
187+
uses: codecov/codecov-action@v4.5.0
188188
with:
189189
file: mypy_report/cobertura.xml
190190
flags: mypy39
@@ -245,7 +245,7 @@ jobs:
245245
python -m pyright xarray/
246246
247247
- name: Upload pyright coverage to Codecov
248-
uses: codecov/codecov-action@v4.4.1
248+
uses: codecov/codecov-action@v4.5.0
249249
with:
250250
file: pyright_report/cobertura.xml
251251
flags: pyright
@@ -304,7 +304,7 @@ jobs:
304304
python -m pyright xarray/
305305
306306
- name: Upload pyright coverage to Codecov
307-
uses: codecov/codecov-action@v4.4.1
307+
uses: codecov/codecov-action@v4.5.0
308308
with:
309309
file: pyright_report/cobertura.xml
310310
flags: pyright39

.github/workflows/ci.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -159,7 +159,7 @@ jobs:
159159
path: pytest.xml
160160

161161
- name: Upload code coverage to Codecov
162-
uses: codecov/codecov-action@v4.4.1
162+
uses: codecov/codecov-action@v4.5.0
163163
with:
164164
file: ./coverage.xml
165165
flags: unittests

.github/workflows/pypi-release.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -88,7 +88,7 @@ jobs:
8888
path: dist
8989
- name: Publish package to TestPyPI
9090
if: github.event_name == 'push'
91-
uses: pypa/gh-action-pypi-publish@v1.8.14
91+
uses: pypa/gh-action-pypi-publish@v1.9.0
9292
with:
9393
repository_url: https://test.pypi.org/legacy/
9494
verbose: true
@@ -111,6 +111,6 @@ jobs:
111111
name: releases
112112
path: dist
113113
- name: Publish package to PyPI
114-
uses: pypa/gh-action-pypi-publish@v1.8.14
114+
uses: pypa/gh-action-pypi-publish@v1.9.0
115115
with:
116116
verbose: true

.github/workflows/upstream-dev-ci.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -146,7 +146,7 @@ jobs:
146146
run: |
147147
python -m mypy --install-types --non-interactive --cobertura-xml-report mypy_report
148148
- name: Upload mypy coverage to Codecov
149-
uses: codecov/codecov-action@v4.4.1
149+
uses: codecov/codecov-action@v4.5.0
150150
with:
151151
file: mypy_report/cobertura.xml
152152
flags: mypy

README.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
[![Available on pypi](https://img.shields.io/pypi/v/xarray.svg)](https://pypi.python.org/pypi/xarray/)
88
[![Formatted with black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/python/black)
99
[![Checked with mypy](http://www.mypy-lang.org/static/mypy_badge.svg)](http://mypy-lang.org/)
10-
[![Mirror on zendoo](https://zenodo.org/badge/DOI/10.5281/zenodo.598201.svg)](https://doi.org/10.5281/zenodo.598201)
10+
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.11183201.svg)](https://doi.org/10.5281/zenodo.11183201)
1111
[![Examples on binder](https://img.shields.io/badge/launch-binder-579ACA.svg?logo=)](https://mybinder.org/v2/gh/pydata/xarray/main?urlpath=lab/tree/doc/examples/weather-data.ipynb)
1212
[![Twitter](https://img.shields.io/twitter/follow/xarray_dev?style=social)](https://twitter.com/xarray_dev)
1313

@@ -46,15 +46,15 @@ provide a powerful and concise interface. For example:
4646

4747
- Apply operations over dimensions by name: `x.sum('time')`.
4848
- Select values by label instead of integer location:
49-
`x.loc['2014-01-01']` or `x.sel(time='2014-01-01')`.
49+
`x.loc['2014-01-01']` or `x.sel(time='2014-01-01')`.
5050
- Mathematical operations (e.g., `x - y`) vectorize across multiple
51-
dimensions (array broadcasting) based on dimension names, not shape.
51+
dimensions (array broadcasting) based on dimension names, not shape.
5252
- Flexible split-apply-combine operations with groupby:
53-
`x.groupby('time.dayofyear').mean()`.
53+
`x.groupby('time.dayofyear').mean()`.
5454
- Database like alignment based on coordinate labels that smoothly
55-
handles missing values: `x, y = xr.align(x, y, join='outer')`.
55+
handles missing values: `x, y = xr.align(x, y, join='outer')`.
5656
- Keep track of arbitrary metadata in the form of a Python dictionary:
57-
`x.attrs`.
57+
`x.attrs`.
5858

5959
## Documentation
6060

@@ -73,12 +73,12 @@ page](https://docs.xarray.dev/en/stable/contributing.html).
7373
## Get in touch
7474

7575
- Ask usage questions ("How do I?") on
76-
[GitHub Discussions](https://github.com/pydata/xarray/discussions).
76+
[GitHub Discussions](https://github.com/pydata/xarray/discussions).
7777
- Report bugs, suggest features or view the source code [on
78-
GitHub](https://github.com/pydata/xarray).
78+
GitHub](https://github.com/pydata/xarray).
7979
- For less well defined questions or ideas, or to announce other
80-
projects of interest to xarray users, use the [mailing
81-
list](https://groups.google.com/forum/#!forum/xarray).
80+
projects of interest to xarray users, use the [mailing
81+
list](https://groups.google.com/forum/#!forum/xarray).
8282

8383
## NumFOCUS
8484

@@ -114,7 +114,7 @@ Licensed under the Apache License, Version 2.0 (the "License"); you
114114
may not use this file except in compliance with the License. You may
115115
obtain a copy of the License at
116116

117-
<https://www.apache.org/licenses/LICENSE-2.0>
117+
<https://www.apache.org/licenses/LICENSE-2.0>
118118

119119
Unless required by applicable law or agreed to in writing, software
120120
distributed under the License is distributed on an "AS IS" BASIS,

doc/user-guide/io.rst

Lines changed: 91 additions & 84 deletions
Original file line numberDiff line numberDiff line change
@@ -741,6 +741,65 @@ instance and pass this, as follows:
741741
.. _Google Cloud Storage: https://cloud.google.com/storage/
742742
.. _gcsfs: https://github.com/fsspec/gcsfs
743743

744+
.. _io.zarr.distributed_writes:
745+
746+
Distributed writes
747+
~~~~~~~~~~~~~~~~~~
748+
749+
Xarray will natively use dask to write in parallel to a zarr store, which should
750+
satisfy most moderately sized datasets. For more flexible parallelization, we
751+
can use ``region`` to write to limited regions of arrays in an existing Zarr
752+
store.
753+
754+
To scale this up to writing large datasets, first create an initial Zarr store
755+
without writing all of its array data. This can be done by first creating a
756+
``Dataset`` with dummy values stored in :ref:`dask <dask>`, and then calling
757+
``to_zarr`` with ``compute=False`` to write only metadata (including ``attrs``)
758+
to Zarr:
759+
760+
.. ipython:: python
761+
:suppress:
762+
763+
! rm -rf path/to/directory.zarr
764+
765+
.. ipython:: python
766+
767+
import dask.array
768+
769+
# The values of this dask array are entirely irrelevant; only the dtype,
770+
# shape and chunks are used
771+
dummies = dask.array.zeros(30, chunks=10)
772+
ds = xr.Dataset({"foo": ("x", dummies)}, coords={"x": np.arange(30)})
773+
path = "path/to/directory.zarr"
774+
# Now we write the metadata without computing any array values
775+
ds.to_zarr(path, compute=False)
776+
777+
Now, a Zarr store with the correct variable shapes and attributes exists that
778+
can be filled out by subsequent calls to ``to_zarr``.
779+
Setting ``region="auto"`` will open the existing store and determine the
780+
correct alignment of the new data with the existing dimensions, or as an
781+
explicit mapping from dimension names to Python ``slice`` objects indicating
782+
where the data should be written (in index space, not label space), e.g.,
783+
784+
.. ipython:: python
785+
786+
# For convenience, we'll slice a single dataset, but in the real use-case
787+
# we would create them separately possibly even from separate processes.
788+
ds = xr.Dataset({"foo": ("x", np.arange(30))}, coords={"x": np.arange(30)})
789+
# Any of the following region specifications are valid
790+
ds.isel(x=slice(0, 10)).to_zarr(path, region="auto")
791+
ds.isel(x=slice(10, 20)).to_zarr(path, region={"x": "auto"})
792+
ds.isel(x=slice(20, 30)).to_zarr(path, region={"x": slice(20, 30)})
793+
794+
Concurrent writes with ``region`` are safe as long as they modify distinct
795+
chunks in the underlying Zarr arrays (or use an appropriate ``lock``).
796+
797+
As a safety check to make it harder to inadvertently override existing values,
798+
if you set ``region`` then *all* variables included in a Dataset must have
799+
dimensions included in ``region``. Other variables (typically coordinates)
800+
need to be explicitly dropped and/or written in a separate calls to ``to_zarr``
801+
with ``mode='a'``.
802+
744803
Zarr Compressors and Filters
745804
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
746805

@@ -767,37 +826,6 @@ For example:
767826
Not all native zarr compression and filtering options have been tested with
768827
xarray.
769828

770-
.. _io.zarr.consolidated_metadata:
771-
772-
Consolidated Metadata
773-
~~~~~~~~~~~~~~~~~~~~~
774-
775-
Xarray needs to read all of the zarr metadata when it opens a dataset.
776-
In some storage mediums, such as with cloud object storage (e.g. `Amazon S3`_),
777-
this can introduce significant overhead, because two separate HTTP calls to the
778-
object store must be made for each variable in the dataset.
779-
By default Xarray uses a feature called
780-
*consolidated metadata*, storing all metadata for the entire dataset with a
781-
single key (by default called ``.zmetadata``). This typically drastically speeds
782-
up opening the store. (For more information on this feature, consult the
783-
`zarr docs on consolidating metadata <https://zarr.readthedocs.io/en/latest/tutorial.html#consolidating-metadata>`_.)
784-
785-
By default, xarray writes consolidated metadata and attempts to read stores
786-
with consolidated metadata, falling back to use non-consolidated metadata for
787-
reads. Because this fall-back option is so much slower, xarray issues a
788-
``RuntimeWarning`` with guidance when reading with consolidated metadata fails:
789-
790-
Failed to open Zarr store with consolidated metadata, falling back to try
791-
reading non-consolidated metadata. This is typically much slower for
792-
opening a dataset. To silence this warning, consider:
793-
794-
1. Consolidating metadata in this existing store with
795-
:py:func:`zarr.consolidate_metadata`.
796-
2. Explicitly setting ``consolidated=False``, to avoid trying to read
797-
consolidate metadata.
798-
3. Explicitly setting ``consolidated=True``, to raise an error in this case
799-
instead of falling back to try reading non-consolidated metadata.
800-
801829
.. _io.zarr.appending:
802830

803831
Modifying existing Zarr stores
@@ -856,59 +884,6 @@ order, e.g., for time-stepping a simulation:
856884
)
857885
ds2.to_zarr("path/to/directory.zarr", append_dim="t")
858886
859-
Finally, you can use ``region`` to write to limited regions of existing arrays
860-
in an existing Zarr store. This is a good option for writing data in parallel
861-
from independent processes.
862-
863-
To scale this up to writing large datasets, the first step is creating an
864-
initial Zarr store without writing all of its array data. This can be done by
865-
first creating a ``Dataset`` with dummy values stored in :ref:`dask <dask>`,
866-
and then calling ``to_zarr`` with ``compute=False`` to write only metadata
867-
(including ``attrs``) to Zarr:
868-
869-
.. ipython:: python
870-
:suppress:
871-
872-
! rm -rf path/to/directory.zarr
873-
874-
.. ipython:: python
875-
876-
import dask.array
877-
878-
# The values of this dask array are entirely irrelevant; only the dtype,
879-
# shape and chunks are used
880-
dummies = dask.array.zeros(30, chunks=10)
881-
ds = xr.Dataset({"foo": ("x", dummies)}, coords={"x": np.arange(30)})
882-
path = "path/to/directory.zarr"
883-
# Now we write the metadata without computing any array values
884-
ds.to_zarr(path, compute=False)
885-
886-
Now, a Zarr store with the correct variable shapes and attributes exists that
887-
can be filled out by subsequent calls to ``to_zarr``.
888-
Setting ``region="auto"`` will open the existing store and determine the
889-
correct alignment of the new data with the existing coordinates, or as an
890-
explicit mapping from dimension names to Python ``slice`` objects indicating
891-
where the data should be written (in index space, not label space), e.g.,
892-
893-
.. ipython:: python
894-
895-
# For convenience, we'll slice a single dataset, but in the real use-case
896-
# we would create them separately possibly even from separate processes.
897-
ds = xr.Dataset({"foo": ("x", np.arange(30))}, coords={"x": np.arange(30)})
898-
# Any of the following region specifications are valid
899-
ds.isel(x=slice(0, 10)).to_zarr(path, region="auto")
900-
ds.isel(x=slice(10, 20)).to_zarr(path, region={"x": "auto"})
901-
ds.isel(x=slice(20, 30)).to_zarr(path, region={"x": slice(20, 30)})
902-
903-
Concurrent writes with ``region`` are safe as long as they modify distinct
904-
chunks in the underlying Zarr arrays (or use an appropriate ``lock``).
905-
906-
As a safety check to make it harder to inadvertently override existing values,
907-
if you set ``region`` then *all* variables included in a Dataset must have
908-
dimensions included in ``region``. Other variables (typically coordinates)
909-
need to be explicitly dropped and/or written in a separate calls to ``to_zarr``
910-
with ``mode='a'``.
911-
912887
.. _io.zarr.writing_chunks:
913888

914889
Specifying chunks in a zarr store
@@ -978,6 +953,38 @@ length of each dimension by using the shorthand chunk size ``-1``:
978953
The number of chunks on Tair matches our dask chunks, while there is now only a single
979954
chunk in the directory stores of each coordinate.
980955

956+
.. _io.zarr.consolidated_metadata:
957+
958+
Consolidated Metadata
959+
~~~~~~~~~~~~~~~~~~~~~
960+
961+
Xarray needs to read all of the zarr metadata when it opens a dataset.
962+
In some storage mediums, such as with cloud object storage (e.g. `Amazon S3`_),
963+
this can introduce significant overhead, because two separate HTTP calls to the
964+
object store must be made for each variable in the dataset.
965+
By default Xarray uses a feature called
966+
*consolidated metadata*, storing all metadata for the entire dataset with a
967+
single key (by default called ``.zmetadata``). This typically drastically speeds
968+
up opening the store. (For more information on this feature, consult the
969+
`zarr docs on consolidating metadata <https://zarr.readthedocs.io/en/latest/tutorial.html#consolidating-metadata>`_.)
970+
971+
By default, xarray writes consolidated metadata and attempts to read stores
972+
with consolidated metadata, falling back to use non-consolidated metadata for
973+
reads. Because this fall-back option is so much slower, xarray issues a
974+
``RuntimeWarning`` with guidance when reading with consolidated metadata fails:
975+
976+
Failed to open Zarr store with consolidated metadata, falling back to try
977+
reading non-consolidated metadata. This is typically much slower for
978+
opening a dataset. To silence this warning, consider:
979+
980+
1. Consolidating metadata in this existing store with
981+
:py:func:`zarr.consolidate_metadata`.
982+
2. Explicitly setting ``consolidated=False``, to avoid trying to read
983+
consolidate metadata.
984+
3. Explicitly setting ``consolidated=True``, to raise an error in this case
985+
instead of falling back to try reading non-consolidated metadata.
986+
987+
981988
.. _io.iris:
982989

983990
Iris

doc/whats-new.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,8 @@ v2024.06.1 (unreleased)
2222

2323
New Features
2424
~~~~~~~~~~~~
25-
25+
- Allow chunking for arrays with duplicated dimension names (:issue:`8759`, :pull:`9099`).
26+
By `Martin Raspaud <https://github.com/mraspaud>`_.
2627

2728
Breaking changes
2829
~~~~~~~~~~~~~~~~
@@ -73,7 +74,6 @@ Bug fixes
7374
support arbitrary kwargs such as ``order`` for polynomial interpolation (:issue:`8762`).
7475
By `Nicolas Karasiak <https://github.com/nkarasiak>`_.
7576

76-
7777
Documentation
7878
~~~~~~~~~~~~~
7979
- Add link to CF Conventions on packed data and sentence on type determination in the I/O user guide (:issue:`9041`, :pull:`9045`).

xarray/namedarray/core.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -812,7 +812,12 @@ def chunk(
812812
chunks = either_dict_or_kwargs(chunks, chunks_kwargs, "chunk")
813813

814814
if is_dict_like(chunks):
815-
chunks = {self.get_axis_num(dim): chunk for dim, chunk in chunks.items()}
815+
# This method of iteration allows for duplicated dimension names, GH8579
816+
chunks = {
817+
dim_number: chunks[dim]
818+
for dim_number, dim in enumerate(self.dims)
819+
if dim in chunks
820+
}
816821

817822
chunkmanager = guess_chunkmanager(chunked_array_type)
818823

xarray/tests/test_dask.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -638,6 +638,13 @@ def counting_get(*args, **kwargs):
638638

639639
assert count[0] == 1
640640

641+
def test_duplicate_dims(self):
642+
data = np.random.normal(size=(4, 4))
643+
arr = DataArray(data, dims=("x", "x"))
644+
chunked_array = arr.chunk({"x": 2})
645+
assert chunked_array.chunks == ((2, 2), (2, 2))
646+
assert chunked_array.chunksizes == {"x": (2, 2)}
647+
641648
def test_stack(self):
642649
data = da.random.normal(size=(2, 3, 4), chunks=(1, 3, 4))
643650
arr = DataArray(data, dims=("w", "x", "y"))

0 commit comments

Comments
 (0)