Skip to content

added 'storage_transformers' to valid_encodings #7540

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

JMorado
Copy link

@JMorado JMorado commented Feb 16, 2023

This PR adds "storage_transformers" to the valid encodings of a variable, thus allowing Zarr stores to be written using the
new sharding storage transformers.

Example of usage:

import xarray as xr
import numpy as np
from zarr._storage.v3_storage_transformers import ShardingStorageTransformer
from zarr._storage.v3 import DirectoryStoreV3

# Dummy dataset
da = xr.DataArray(np.random.randn(1000, 1000), dims=("x", "y"))
ds = xr.Dataset(dict(dummy_var=da))
ds = ds.chunk({"x": 100, "y": 100})

# Sharded store and sharding transformer
store = DirectoryStoreV3("/home/user/dummy-sharded-store.zarr")
transformer = ShardingStorageTransformer("indexed", chunks_per_shard=(5, 1))

# Write sharded store
ds.to_zarr(
    store,
    encoding={"dummy_var": {"storage_transformers": [transformer]}},
    zarr_version=3,
)

@github-actions github-actions bot added io topic-backends topic-zarr Related to zarr storage library labels Feb 16, 2023
@dcherian dcherian requested a review from jhamman February 17, 2023 17:57
Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @JMorado! Would you mind adding a test that utilizes this encoding parameter?

@JMorado
Copy link
Author

JMorado commented Mar 4, 2023

Hi @jhamman,

I have added a test and corrected a problem that I had previously missed. It turns out that a lock must be used to ensure the correct writing of a sharded zarr store. Everything seems to be working as expected now. Here are some comments:

  • Currently, a lock is passed to ArrayWriter even if only one variable is being sharded. While I believe that locks would not be needed to write non-sharded variables, I don't think it is possible to restrict their usage to specific variables, since the zarr store is written as a whole. Any thoughts on this?
  • Related to the previous point, I think the usage of locks could even be made shard-specific, relaxing the writing of different shard files. I also believe this is not currently possible, but let me know if there is a solution.
  • For the test, importing ShardingStorageTransformer is currently a bit awkward, but it can be neated in the future when ShardingStorageTransformer becomes available from the top-level zarr namespace.

Let me know what you think.

@dcherian dcherian requested a review from jhamman March 5, 2023 05:49
@rabernat
Copy link
Contributor

rabernat commented Mar 8, 2023

It's great to see this PR get started in Xarray! Thanks @JMorado!

From the perspective of a Zarr developer, the sharding feature is still highly experimental. The API may change significantly. While the sharding code is released in the sense that it is available deep in Zarr, it is not really considered part of the public API yet.

So perhaps it's a bit too early to be doing this?

@rabernat
Copy link
Contributor

rabernat commented Mar 8, 2023

Regarding locks, I think we need to think hard about the best way to deal with this across the stack. There are a couple of different options:

  • Current status: just use a global lock on the entire array--super inefficient
  • A bit better: use per-variable locks
  • Even better: have locks at the shard level. This would allow concurrent writing of shards
  • Alternative which accomplishes the same thing: expose different virtual chunks when reading vs. writing. When writing, the writer library (e.g. Xarray or Dask) would see the shards as the chunks (with a lower layer of the stack handling breaking the shard down into chunks). When reading, the individual, smaller chunks would be accessible.

Note that there are still some deep inefficiencies in the way zarr-python writes shards (see zarr-developers/zarr-python#1338). I think we should be optimizing things at the Zarr level first, before implementing workarounds in Xarray.

@dcherian
Copy link
Contributor

dcherian commented Mar 8, 2023

Does it makes sense to create a new backend in a new project to enable experimentation?

@jhamman
Copy link
Member

jhamman commented Jan 23, 2025

I think we can close this in favor of #9948. Thanks @JMorado for the early attempt here!

@jhamman jhamman closed this Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants