Skip to content

Added consolidated metadata to spec #309

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

TomAugspurger
Copy link

@TomAugspurger TomAugspurger commented Aug 23, 2024

This PR adds a new optional field to the core metadata for consolidating all the metadata of all child nodes under a single object. The motivation is similar to consolidated metadata in Zarr V2: without consolidated metadata, the time to load metadata for an entire hierarchy scales linearly with the number of nodes. This can be costly, especially for large hierarchies served HTTP from a remote storage system (like Blob Storage).

The primary points to discuss:

  1. This currently proposes storing the consolidated metadata in the root zarr.json (kind="inline). For *very* large hierarchies, this will bloat the size of the root zarr.json`, slowing down operations that just want to open the metadata for the root.
  2. This overlaps slightly with Explicitly listing groups/arrays inside group metadata? #284, which proposes to store the child paths (but not full metadata) in the zarr.json. That might be a nice alternative to consolidated metadata for very large hierarchies (you could do an initial read to get the list of nodes, and the load the metadata for each child concurrently).
  3. Like Zarr v2, consolidated metadata introduces the possibility of "inconsistent" metadata between the consolidated and non-consolidated forms. Should the spec take any stance on how to handle this? I've currently worded things to say that readers should always use the consolidated metadata if it's present.

Closes #136

xref zarr-developers/zarr-python#2113, which is a completed implementation in zarr-python and zarrs/zarrs#55, a draft implementation in zarrrs..

@joshmoore
Copy link
Member

joshmoore commented Aug 27, 2024

Going back to #136 (comment) (to ZEP or not to ZEP), I think this could potentially use some discussion around whether there is an opportunity to make the consolidated metadata one implementation of a wider interface.

The most general abstraction I've been considering is "metadata loader" (similar to the storage transformer that was initially proposed for sharding). Basically, could we have extensible (i.e. you can write your own) ways of looking up metadata that would be registered in the zarr.json?

I owe you a better description but to get them down on paper before I get inundated again, here are some similar requirements/features that I could see:

  • load metadata from .zmetadata (to be backwards compatible)
  • load metadata from ro-crate-metadata.json currently under investigation in NGFF land
  • load metadata from a database, binary or even encrypted file
  • hell, even yaml.
  • store consolidated metadata per hierarchy level (rather than all at the top), suggested by @DennisHeimbigner
  • ...

In my mind, if one of these alternative mechanisms is active, it could be the sole source of information, rather than duplicating the metadata. Some of these are more useful for user attributes rather than the entire metadata, and certainly several may be more difficult without the concept of the root metadata, but it feels like this is an opportunity for us to go beyond the original consolidated metadata workaround.

@TomAugspurger
Copy link
Author

Thanks Josh! I like the idea of metadata loading being an extension point.

We'll want to hash out some details before merging this, since a top-level consolidated_metadata key in the zarr.json would clash with this goal. Roughly, I'm hoping we can have something like

// zarr.json
{
  "metadata_loader": {
    "name": "consolidated_metadata",
    "kind": "inline",
    "metadata": { ... },
}

We could have the name be inline_consolidated_metadata and drop kind if we wanted. The key thing is having a field in the serialized representation we could use to know how we should go about loading the rest of the metadata.

I'll play with that a bit to see how it feels in zarr-python.

@d-v-b
Copy link
Contributor

d-v-b commented Aug 28, 2024

Thanks for working on this! What do you think about multiple instances of metadata consolidation in the same zarr hieararchy? E.g., Group A contains Group B, Group B generates consolidated metadata, then Group A generates consolidated metadata, which ends up containing 2 models of the hierarchy rooted at Group B. Because the consolidated metadata is part of the Group metadata, then consolidating recursively will end up with a lot of repeated content. A few options:

  • just let this happen -- nest consolidated metadata at your own risk
  • don't let this happen -- disallow nested metadata consolidation
  • explicitly exclude the consolidated_metadata key from the metadata that gets consolidated. We could achieve this by defining the "metadata" of an array / group to be "everything except the consolidated_metadata key, if present", and then metadata consolidation is defined as an aggregation of said array / group metadata. Or we just say that the consolidated_metadata key should be skipped 🤷 .

The latter option seems best to me; curious to hear other perspectives.

@TomAugspurger
Copy link
Author

Coming back to @joshmoore's "metadata loader" concept, the thing we need to solve now is what sort of metadata to put in the Zarr Group object to indicate "here's how you find more stuff about this hierarchy". And I'll focus that a bit more to "listing files on blob storage is messy / slow, so I want to avoid that if possible".

I'm going to make a couple of statements that I think are true:

  1. The user has some sort of Group object
  2. That Group object was loaded using some kind of Store, which knows how to interact with the "outside world" that's providing the metadata (files on local disk / object store, a database, objects in memory, etc.)

It's that interface with the store that's tripping me up when trying to generalize this. Why do we need to standardize that in the Zarr Group object? By the time the user has loaded some zarr Group, they have already specified their connection to the metadata store, right? That's how they loaded the group in the first place.

With something like the current proposal, we have a way for the Store concept (reading local / network files, a JSON API endpoint, etc.) to provide additional metadata for the common case of "also give me information about my children".

In short, I think my claim is that the "metadata loader" concept is mostly orthogonal to the consolidated metadata concept. Consolidated metadata is all about reducing the number of trips to the "Store", by giving a place for a Group to hold information about its children. I think I'm comfortable that a top-level consolidated_metadata key in the Group object will not conflict with the desire to load metadata from other sources.


One thing that probably should be addressed now / soon is the idea of storing subsets of the metadata. I'd propose something like a depth field on the consolidated_metadata object, which is an integer stating how far down has been consolidated. I think the two most common will be

  1. None / not specified: Everything is consolidated. Loading any arbitrarily nested child group will can be done without additional I/O
  2. 1: indicating that just my immediate children have been consolidated. Loading an immediate child can be done without I/O, but loading any of its children will require I/O.

@TomAugspurger
Copy link
Author

sorry for missing your questions @d-v-b.

What do you think about multiple instances of metadata consolidation in the same zarr hieararchy?

I like the framing (which I think was yours originally?) of thinking about consolidated metadata as a cache. Since you can have arbitrarily nested Groups / Arrays, IMO you should be able to have arbitrary consolidated metadata at any level of the hierarchy and the spec shouldn't do anything to dissuade that.

As for the question about potentially duplicating data, I think that we won't have that in zarr-python. Working through your example:

Group A contains Group B, Group B generates consolidated metadata, then Group A generates consolidated metadata, which ends up containing 2 models of the hierarchy rooted at Group B. Because the consolidated metadata is part of the Group metadata, then consolidating recursively will end up with a lot of repeated content

I think that we'll have "repeated content" across the files. But we won't have repeated content within a single zarr.json file (i.e. I don't think that the nested arrays should repeat their child consolidated_metadata. I call this out below under "note"):

Given an example like

a/     # group
  b/   # group
    x  # array
    y  # array

The zarr.json for group b will look like

{
    "zarr_format": 3,
    "node_type": "group",
    "consolidated_metadata": {
        "metadata": {
            "x": {...},
            "y": {...}
        },
        ...
    }
}

And the zarr.json for a will look like

{
    "zarr_format": 3,
    "node_type": "group",
    "consolidated_metadata": {
        "metadata": {
            "b": {...},  # note: this probably shouldn't have consolidated_metadata
            "b/x": {...},
            "b/y": {...},
        },
        ...
    }
}

A note that this might be different from how zarr v2 treated consolidated metadata. In zarr-developers/zarr-python#2113 (comment), I think I found that zarr v2 only supported consolidated metadata at the root of a Store, not nested paths within a store. I think this might be an area where we want to diverge from zarr v2.

@TomAugspurger
Copy link
Author

I think this should be about ready to go.

@LDeakin I saw that you at least started on this at zarrs/zarrs#55. Anything come up in your implementation that might adjust what you want to see in the spec?

@LDeakin
Copy link
Member

LDeakin commented Oct 2, 2024

@LDeakin I saw that you at least started on this at LDeakin/zarrs#55. Anything come up in your implementation that might adjust what you want to see in the spec?

Nope, the spec looks good, and no issues popped up. Thanks for writing it.

@joshmoore
Copy link
Member

Sorry for the slow response, @TomAugspurger. To be clear lest it not come through here, I'm super excited to have CM going into the spec!

TomAugspurger commented 3 weeks ago
It's that interface with the store that's tripping me up when trying to generalize this. Why do we need to standardize that in the Zarr Group object? By the time the user has loaded some zarr Group, they have already specified their connection to the metadata store, right? That's how they loaded the group in the first place.

I definitely agree that this is a tricky (though critical) part of the bootstrapping. The use case I tend to be most concerned with is everything under attributes (i.e., "metadata" is an overloaded term) which is more tractable. Other use cases include: some of the group info is there; and none of the group info is there (i.e. it's solely a redirect). Is it possible that you're saying there's no use case for the some-info use case?

In short, I think my claim is that the "metadata loader" concept is mostly orthogonal to the consolidated metadata concept. Consolidated metadata is all about reducing the number of trips to the "Store", by giving a place for a Group to hold information about its children.

Can you help me see why it's orthogonal? My hope was that that would be one of the things this interface could allow a user to do.

I think I'm comfortable that a top-level consolidated_metadata key in the Group object will not conflict with the desire to load metadata from other sources.

Except then it's already done in one specific way, no? With the metadata, e.g., currently being duplicated, no?

One thing that probably should be addressed now / soon is the idea of storing subsets of the metadata. I'd propose something like a depth field on the consolidated_metadata object, which is an integer stating how far down has been consolidated. I think the two most common will be...

Agreed that this is another "parameter" for the consolidation, and I imagine there will be more.

I think this should be about ready to go.

Not that I necessarily agree, but what are the remaining (related) steps from your point-of-view? From my side, I'd be for trying to encourage some more feedback from implementers (huge kudos to @LDeakin)

@TomAugspurger
Copy link
Author

Thanks @joshmoore.

I definitely agree that this is a tricky (though critical) part of the bootstrapping[...]

Sorry, I don't quite understand this set of questions and the "some" / "no" info concepts. Do you have an example of what you have in mind?

Can you help me see why it's orthogonal? My hope was that that would be one of the things this interface could allow a user to do.

To me it just goes back to the original motivation for consolidated metadata: A way for a Group in a Zarr hierarchy to know a bit about its children (to avoid potentially costly repeat trips to a remote Store for common scenarios like "load all the metadata for arrays in this Group / dataset"). How exactly you load that initial Group (your "metadata loader" concept) doesn't really bear on the question of "where do I store information about children", right?

Except then it's already done in one specific way, no? With the metadata, e.g., currently being duplicated, no?

Answered above, I think. All where specifying here is how to present information about your children; no constraints on how that information gets there. In particular, the spec takes no opinion on how the consolidated metadata is actually stored. A database presenting a Zarr API could construct consolidated metadata on the fly if it wanted (and so no duplication).

Agreed that this is another "parameter" for the consolidation, and I imagine there will be more.

This is the only one I can imagine now. If you have any other in mind let me know. I'm not worried about making a depth field backwards compatible but if there are breaking changes we have in mind it'd be best to address them now.

what are the remaining (related) steps from your point-of-view?

Nothing else. We're merging the zarr-python implementation today.


Overall, I'd reiterate that this PR has a narrow scope: a standardized way for Groups know about their children. Effectively, a cache for what you'd get from listing nodes under some prefix, which we already know is needed for serving Zarr data over object storage in a reasonable time.

@joshmoore
Copy link
Member

Ok. Bear with me @TomAugspurger, I find myself unfortunately playing bad cop across the repo. First some immediate responses to your points and then I'll take a step back and try to explain.


I don't quite understand this set of questions and the "some" / "no" info concepts. Do you have an example of what you have in mind?

Looking back, I think I misunderstood you. Let's hold off on this for the moment.

How exactly you load that initial Group (your "metadata loader" concept) doesn't really bear on the question of "where do I store information about children", right?

I thought it did, but you may have found a hole in the plan. Ultimately, I'm trying to find an underlying abstraction for what you've built. But more on that in a second.

A database presenting a Zarr API could construct consolidated metadata on the fly if it wanted (and so no duplication).

I think this helps me see how the Store is becoming intwined. Maybe "MetadataLoader" is the wrong metaphor, but more "MetadataTransformer" to go with the "StorageTransformer". (If, however, CM can be achieved with the StorageTransformer, 👍) The difference in my mind is the loader/mapper/transformer is config to tell us what keys to look up (whether in databases, filesystems, or buckets) rather than the existing JSON objects.

(and so no duplication).

except the current definition would still have 2 keys for the same information even if the storage and/or retrieval has been optimized, right?

we have in mind it'd be best to address them now.

This concerns me, more to that below.

We're merging the zarr-python implementation today.

And in fact, now it is merged. Congrats on that and all the work but I think to be fair I will ignore that for this discussion, because though a great validation of the work here it's the cart before the horse.


Ok, finally, to back up:

Overall, I'd reiterate that this PR has a narrow scope: a standardized way for Groups know about their children.

I understand the goal to get CM in ASAP. However, this is introducing a key into the metadata that could be with us for many years. Spec work is often about looking beyond the immediate scope, especially if we are asking others to implement. So a few, partially overlapping strategies to try not to get in your way:

Broadening the scope

This is what I was trying above and with my suggestion (somewhere?!) of going for a ZEP. By identifying an abstraction that would cover more use cases and that we would all feel comfortable having in https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#extension-points I was hoping to we would have more buy-in for the change. (Obviously, that's a big ask.)

Lowering risk

The addition of extensions to Zarr v3 is still pretty untested. What you are trying to do should be made possible: dev wants to hack out a new feature, implement it and move on. 👍 Your use of "must_understand" does the minimum required by the text of the spec, but what it doesn't do is allow us to change anything about this proposal down the road. The only way forward I can really see is to move to "Yet-Another-Key" if we realize that CM is broken. @rabernat brought up https://stac-extensions.github.io/ as a model that we should match but if you look at https://github.com/radiantearth/stac-spec/blob/master/extensions/README.md#extending-stac, you'll see that some semantic for separating suggestions like yours (there, via key prefixes ex:...) would be needed.

Preparing for change

If we see the need for an evolution of the configuration, it might make sense to introduce versioning internally.

@LDeakin
Copy link
Member

LDeakin commented Oct 10, 2024

@joshmoore

Your use of "must_understand" does the minimum required by the text of the spec, but what it doesn't do is allow us to change anything about this proposal down the road

This is not entirely true for consolidated_metadata. If a breaking change were introduced, an implementation can fail to understand it and parse it as an unknown metadata field that it need not understand. It can fallback to reading non-consolidated metadata until it is updated to support the new revision.

If we see the need for an evolution of the configuration, it might make sense to introduce versioning internally.

If the configuration changes, an implementation can support that without needing an explicit version key.

@TomAugspurger
Copy link
Author

TomAugspurger commented Oct 10, 2024

(and so no duplication).

except the current definition would still have 2 keys for the same information even if the storage and/or retrieval has been optimized, right?

That's up to whatever storage is providing the Zarr metadata. In the case of the consolidated metadata produced by zarr-python, which would typically be written to a plain file / object store, yes. But that hypothetical database could store it however it wants.

Stepping back: why is duplicating the metadata a problem? That's fundamental to this. Maybe substitute "cache" for "duplicate" and see whether it still causes any concerns.

So a few, partially overlapping strategies to try not to get in your way:

"Broadening the scope" and "Lowering the risk" seem pretty much directly in tension with each other :) I'll push back pretty strongly on broadening the scope; this is deliberately scoped to narrowly solve the "open group and immediately inspect its children" use case, and I think that's a good thing.

As for lowing the risk, I think we point to zarr-python 2.x's consolidated metadata. It seems to have worked pretty well, and this spec is essentially identical (aside from moving it to the single zarr.json object, along with the attributes). We've already got a lot of history showing that something like this is useful and practical.

The addition of extensions to Zarr v3 is still pretty untested[...]

I think stac's extension mechanism if much nicer than what zarr v3 provides today. When I first started on consolidated metadata, I actually started on the extension mechanism. JSON schemas for the core metadata plus a top-level zarr_extensions array would get you most of the way there. I think that's worth doing. But I don't think it needs to block this work, right? That's something that could be done independently. I can write that up as an issue, based on my experience with STAC, if you'd like.

If we see the need for an evolution of the configuration, it might make sense to introduce versioning internally.

I think that would be solved by the zarr spec having a better extension mechanism (previous point) or by the consolidated_metadata object having some kind of version key. I'm fine with either, but would lean slightly towards making a better extension mechanism. It seemed a bit strange to me to have a piece of the core metadata evolve separately from the rest of the metadata, which is why I think an extension system like STAC's makes more sense.

@joshmoore
Copy link
Member

General 💯 for thinking through the future-proofing with you guys, so super brief comments for 00:00 CET:

If a breaking change were introduced, an implementation can fail...

My gut reaction is to say we're putting too much on the back of must_understand, but I see what you're saying: if this element is essentially optional then implementations can (MUST) be tolerant of changes. It doesn't feel like an optimal strategy though.

"Broadening the scope" and "Lowering the risk" seem pretty much directly in tension with each other :)

Largely, but you could of course draft a broader-scoped extension externally first, too ;)

As for lowing the risk, I think we point to zarr-python 2.x's consolidated metadata... a lot of history showing that something like this is useful and practical.

But not across implementations and I have the feeling with a good deal of heuristics in consuming libraries like xarray to get the creation/detection in place. That a way to tune the metadata loading for different cases is a given, but I would hope we could also improve on it.

JSON schemas for the core metadata plus a top-level zarr_extensions array would get you most of the way there

nods certainly something that doesn't give the sense that we might need a series of 3.x verisons.

But I don't think it needs to block this work, right?

If we think we can sufficiently lower the risk then agreed, but as currently written my instinct is we're not there yet (though perhaps @LDeakin's suggestion will find traction).

a better extension mechanism

❤️

@TomAugspurger
Copy link
Author

TomAugspurger commented Oct 11, 2024

That a way to tune the metadata loading for different cases is a given, but I would hope we could also improve on it.

Maybe, but I try not to get too caught up in hypotheticals. We have a very real use case where the layout of Zarr metadata on a remote file system causes issues. So we have a tailored solution for that. Why do we need to complicate it to handle other stuff (stuff that at least I don't have a clear grasp on)? If we come up with something better that does... something related to metadata loading, then great. Let's add that to the spec. And it can use consolidated_metadata to store information about its children (or not, since it's optional).


And just to reiterate again: this is a small change. A change to the core metadata, sure, but it's completely optional. And it really isn't inventing any new concepts, both because the use-case was discovered (and solved!) by users of zarr-python, and because all of the complicated objects going in the metadata object are already in the spec since we're just storing a mapping of key: node.

If that's not enough de-risking, then can you lay out exactly what you're looking for?

@TomAugspurger
Copy link
Author

#316 has stuff on the extension side. Happy to discuss more there, but I don't think the limitations of Zarr's extension story should bear on this PR.

@joshmoore
Copy link
Member

The reason I think we're in this discussion is that as things stand this adds a feature (i.e., if we were under semver a minor release). The two paths I see are:

  • make this less than a minor release but that embroils you with the extension story; or
  • get more buy-in from implementers so we have the confidence we need, i.e. a ZEP.

Let me try to give one scenario of it's impact:

  • implementation A understands CM and creates a dataset with the configuration
  • implementation B does not grok CM, opens the dataset, modifies metadata
  • implementation A opens the dataset and does not see the modification

This is just a part of caches being hard but I think previously was less of an issue since:

  • there were fewer writing implementations; and
  • it was a user-driven option, i.e. had to be passed to the library to activate.

Users were then responsible for cleaning things. But that no longer holds, does it?

@d-v-b
Copy link
Contributor

d-v-b commented Apr 11, 2025

However, I’d like to highlight a different angle—specifically, the impact of metadata consolidation on configuration management when Zarr products are stored as files in a conventional filesystem (i.e. not managed via a database or storage service).

Most usage of consolidated metadata is in an unmanaged storage service. People generally use this feature on high-latency cloud storage backends like s3 or gcs, which are "plain" object stores. In fact, if you do have a managed storage service, there's probably no need for a separate consolidated metadata document because you could implement the same functionality dynamically in your managed storage service. After all, consolidated metadata is effectively a caching layer on top of zarr-aware IO operations.

As for your other two concerns, I think it's an important piece of context that consolidated metadata is typically used for data that is written once, and read many times. If the zarr hierarchy is in fact immutable, then there are no changes to track, or issues of desynchronization between the consolidated metadata and the actual metadata. I think using consolidated metadata in a context where a zarr hierarchy is being actively updated by multiple users is probably not a good idea.

But it's also true that zarr doesn't provide good tools for ensuring immutability (anyone with the right permissions can open and modify any zarr metadata). So a hierarchy that should be immutable may not in fact remain so, and the only way to find a discrepancy between the actual hierarchy and the consolidated image of that hierarchy is by directly checking.

Has the alternative approach—used by some existing applications—been considered, where instead of merging metadata into one file, metadata from individual groups is collected into a dedicated directory? This would still allow grouping but preserve separation and traceability.

I think I would need to see a concrete example of this. Where would this directory be stored relative to the zarr hierarchy, and what would be in it?

@mzundo
Copy link

mzundo commented Apr 11, 2025

Thanks for the detailed response—really helpful context.

Most usage of consolidated metadata is in an unmanaged storage service…

To clarify, the use case I’m referring to is not about accessing data through infrastructure APIs. Instead, it’s focused on scientists and engineers exchanging Zarr products as files—including both data and metadata—and using software to locally process, generate, or modify them. While the data might originate from or eventually return to an infrastructure, the key concern is what happens in between—during manual or tool-based manipulation outside of managed environments.

If the Zarr hierarchy is in fact immutable…

I agree that consolidation as merging individually metatada files works well in an immutable context. My concern is supporting hybrid workflows where data might be consolidated for efficiency when stored on infrastructure, but also accessed or modified locally. In such cases, the merging approach can lead to desynchronisation or inconsistencies, especially when changes are made to the hierarchy without regenerating the consolidated metadata.

Zarr doesn’t provide good tools for ensuring immutability…

Exactly—this is one of the reasons why workflows relying on strict immutability can be fragile in practice. Without enforced guarantees, even minor edits can break consistency between group-level metadata and its consolidated view.

I think I would need to see a concrete example of this…

A concrete example is provided by the Pangeo.io ERA5 dataset, which uses a namespace-style approach for metadata consolidation. Each group’s metadata is stored in a separate JSON file, named after the group (e.g., group_name.json), and all are stored in a dedicated group at the root level (typically _meta/). This structure logically looks like:

Screenshot 2025-04-11 at 12 40 44

@d-v-b
Copy link
Contributor

d-v-b commented Apr 11, 2025

A concrete example is provided by the Pangeo.io ERA5 dataset, which uses a namespace-style approach for metadata consolidation. Each group’s metadata is stored in a separate JSON file, named after the group (e.g., group_name.json), and all are stored in a dedicated group at the root level (typically _meta/). This structure logically looks like:

Screenshot 2025-04-11 at 12 40 44

I think this centralized metadata layout was defined in an early version of the zarr v3 spec, but was ultimately rejected in favor of the current decentralized layout. I don't know the exact reasons for that decision, but I think it was motivated by a desire to keep individual arrays and groups self-describing. For arrays and groups to be self-describing, their metadata documents are stored locally, instead of aggregated in a directory at the root of the hierarchy.

To clarify, the use case I’m referring to is not about accessing data through infrastructure APIs. Instead, it’s focused on scientists and engineers exchanging Zarr products as files—including both data and metadata—and using software to locally process, generate, or modify them. While the data might originate from or eventually return to an infrastructure, the key concern is what happens in between—during manual or tool-based manipulation outside of managed environments.

I think on local storage, IO operations are fast enough that there is very little need for consolidated metadata. Especially with the performance improvements we have made to zarr-python 3 for group indexing, I think consolidated metadata on the local file system offers very little value.

@mzundo
Copy link

mzundo commented Apr 11, 2025

I think this centralized metadata layout was defined in an early version of the Zarr v3 spec but was ultimately rejected in favor of the current decentralized layout. I don’t know the exact reasons for that decision, but I think it was motivated by a desire to keep individual arrays and groups self-describing. For arrays and groups to be self-describing, their metadata documents are stored locally, instead of aggregated in a directory at the root of the hierarchy.

I actually agree that keeping groups self-describing is very important. That said, maybe there’s a middle ground—where consolidated metadata could work by reference (e.g. by path) to the individual group metadata, rather than duplicating all the content. This might catch two birds with one stone: preserving self-description while enabling centralised access when needed.

I think on local storage, IO operations are fast enough that there is very little need for consolidated metadata. Especially with the performance improvements we have made to zarr-python 3 for group indexing, I think consolidated metadata on the local file system offers very little value.

Makes sense. My main concern is less about performance and more about managing inconsistencies. If a reference-based consolidation approach were adopted, that issue could be mitigated. But if duplication is necessary, then I think it would be helpful for the spec to explicitly define a rule—for example: in case of discrepancies, the local (group-level) metadata is considered the source of truth. This would allow tools to resolve conflicts programmatically by rebuilding the consolidated view as needed.

@rabernat
Copy link
Contributor

rabernat commented Apr 11, 2025

@mzundo -- these are all great points. I want to point out that the new Icechunk storage engine addresses basically all of your concerns

  • Strict serializable version history
  • ACID transactions ensure consistent updates
  • Transparent diffs for both metadata and chunks for every snapshots
  • Consolidation of all metadata into a single file. No redundant storage of information, no inconsistency possible.

AND it does this all at the store level, meaning that no application code has to change for interacting with Zarr datasets.

We'd be happy to help you get started over at https://github.com/earth-mover/icechunk.

@TomAugspurger
Copy link
Author

TomAugspurger commented Apr 11, 2025

Agreed that icechunk probably fits your needs better.

This PR is primarily standardizing what's already out there for Zarr v2, updated to Zarr v3's conventions.

But if duplication is necessary, then I think it would be helpful for the spec to explicitly define a rule—for example: in case of discrepancies, the local (group-level) metadata is considered the source of truth

I think https://github.com/zarr-developers/zarr-specs/pull/309/files#diff-39bde03e0d9bdcd6ea478965c1f6ec50785744ce98ec8f3a1d5c56218041f17fR851 addresses that.

@mzundo
Copy link

mzundo commented Apr 11, 2025

@mzundo -- these are all great points. I want to point out that the new Icechunk storage engine addresses basically all of your concerns

In my case, I’m working within an existing setup based on vanilla Zarr v2, with plans to eventually move to Zarr v3. There’s a strong preference to minimise changes, so I’ve been trying to understand what improvements or features are already planned or supported in Zarr v3. If some of these concerns were addressed natively in the spec, it would be much easier to consider transitioning—rather than adopting an entirely separate storage engine.

@TomAugspurger
Copy link
Author

I personally don't plan to push for changing consolidated metadata from how it behaved in zarr-python 2.x.

All the concerns around things like consistency and atomic updates are complex, and are better-handled by icechunk.

@mzundo
Copy link

mzundo commented Apr 11, 2025

But if duplication is necessary, then I think it would be helpful for the spec to explicitly define a rule—for example: in case of discrepancies, the local (group-level) metadata is considered the source of truth

I think https://github.com/zarr-developers/zarr-specs/pull/309/files#diff-39bde03e0d9bdcd6ea478965c1f6ec50785744ce98ec8f3a1d5c56218041f17fR851 addresses that.

Thanks for the pointer!

The change you referenced defines the reader’s behaviour—i.e. that if consolidated metadata is available, it should be used. However, it doesn’t explicitly cover what should happen in the presence of discrepancies between consolidated and group-level metadata.

Consider the case where a library opens a Zarr store, reads both the consolidated and local metadata, and detects a mismatch. Beyond (possibly) reporting an error, it would be helpful for the spec to clearly define the source of truth.

Following the principle of self-describing groups, I would argue that the group-level metadata should take precedence, not the consolidated version.

@mzundo
Copy link

mzundo commented Apr 11, 2025

I personally don't plan to push for changing consolidated metadata from how it behaved in zarr-python 2.x.

I’ve always seen the .zmetadata duplication approach in Zarr v2 as a performance-driven workaround/hack, rather than a clean, uniquely defined data model.

Wouldn’t it be worth considering a hybrid approach—for example, consolidation via the root JSON, but by reference rather than by value? That way, performance could still be improved without compromising clarity, traceability, or the principle of self-describing groups. It could even be made optional, controlled by a flag e.g. consolidation="reference" or consolidation-"value" to allow flexibility depending on the use case.

@TomAugspurger
Copy link
Author

TomAugspurger commented Apr 11, 2025

However, it doesn’t explicitly cover what should happen in the presence of discrepancies between consolidated and group-level metadata.

Consider the case where a library opens a Zarr store, reads both the consolidated and local metadata, and detects a mismatch.

That's outside the scope of the spec IMO. zarr-python, for instance, won't want to read non-consolidated metadata (reading all the non-consolidated nodes would defeat the point of consolidated metadata!). So there's no opportunity to detect a conflict.

consolidation via the root JSON, but by reference rather than by value?

I'm not quite sure what this would look like. What the consolidated-metadata provides in zarr-python 2.x, and what's proposed here, let's you get the consolidated metadata of the entire hierarchy in a single read / HTTP request.

@mzundo
Copy link

mzundo commented Apr 14, 2025

That's outside the scope of the spec IMO. zarr-python, for instance, won't want to read non-consolidated metadata (reading all the non-consolidated nodes would defeat the point of consolidated metadata!). So there's no opportunity to detect a conflict.

My understanding was that a key improvement introduced by the Zarr v3 specification was to abstract the format definition (the data model) from any specific implementation, removing the tight coupling with Python — both in terms of data types and how information is interpreted.

One of the main reasons we are interested in Zarr is precisely because we may not want (or be able) to use Python. To enable such use cases, the data format must be intrinsically unambiguous and self-consistent. That is: it should either prevent inconsistency by design or clearly define expected behaviour in cases of potential duplication or conflict.

Only by doing this can Zarr become a truly language-agnostic, portable format — one that behaves the same across implementations, without relying on the assumptions/behaviour of any particular library.

Your reply illustrates the problem: you refer to the behavior of a specific implementation (zarr-python) to justify why the spec doesn’t need to define how conflicts (e.g. between consolidated and non-consolidated metadata) are handled. But that actually confirms my concern: if the expected behavior is not defined by the spec, then different implementations will (reasonably) behave differently.

In other words, the spec should not leave core behaviors to be inferred from a single implementation’s choices. The behavior of a Zarr-compliant reader must be defined by the specification — not by what zarr-python happens to do.

@TomAugspurger
Copy link
Author

@mzundo could you propose a specific change to the wording of the spec? And specifically why https://github.com/zarr-developers/zarr-specs/pull/309/files#diff-39bde03e0d9bdcd6ea478965c1f6ec50785744ce98ec8f3a1d5c56218041f17fR851 doesn't handle your concerns? If consolidated metadata is there, readers should use it.

Your reply illustrates the problem: you refer to the behavior of a specific implementation (zarr-python)

Don't read too much into that, I just happen to help maintain zarr-python. With that hat on, I can say we wouldn't want a change to the spec that would make consolidated metadata useless (since detecting a conflict on read requires reading both consolidated and non-consolidated metadata). @LDeakin can speak for zarrrs, but I suspect that would be a non-starter there too.

@rabernat
Copy link
Contributor

The core problem here is that the Zarr spec, and the requirements which motivated it, simply do not address consistency issues in a consistent (no pun intended) way. The spec describes the static format on disk, no more and no less. It does not explain how to create, update, append, delete, etc. All of these issues are left to implementations.

Consolidated metadata is, IMO, a poorly designed workaround (a hack, as you said) which exacerbates, rather than solves, Zarr's consistency problems. I think the best we can do is stipulate in the extension wording that implementation which use consolidated metadata must ensure that they write consolidated metadata in a way that is consistent with the true state of the store.

Our solution at Earthmover was to keep the spec unchanged, and to address consistency at the store level, by creating a specialized key-value store built for Zarr that supports ACID transactions. This is Icechunk.

A future Zarr spec (e.g. Zarr 4) might choose to build consistency into its requirements and potentially leverage Icechunk's innovations to create a more powerful core format.

@d-v-b
Copy link
Contributor

d-v-b commented Apr 14, 2025

why the spec doesn’t need to define how conflicts (e.g. between consolidated and non-consolidated metadata) are handled.

I don't think we need guidance from the zarr spec on how to handle conflict between consolidated and non-consolidated metadata. Consolidated metadata only has value when the non-consolidated metadata is inaccessible; the non-consolidated metadata is the ground truth, and should be used when available.

@TomAugspurger
Copy link
Author

I think the best we can do is stipulate in the extension wording that implementation which use consolidated metadata must ensure that they write consolidated metadata in a way that is consistent with the true state of the store.

Consolidated metadata is, IMO, a poorly designed workaround (a hack, as you said) which exacerbates, rather than solves, Zarr's consistency problems

As your "exacerbates" notes: Zarr itself can't say much of anything about ACID-style consistency, and that affects the core of the spec, not just consolidated metadata. If others are OK with it, I'd be fine with a section in the core Zarr spec noting the issues Zarr faces with consistency (depending on the Store) and maybe linking to icechunk as a project that does address this.

And for this PR, I think we can focus on the other type of "consistency" that's unique to consolidated metadata: that the duplicated metadata in the consolidated metadata can get out of sync with the non-consolidated metadata, indefinitely. I'm happy to add or modify language beyond "use consolidated if it's present". On the write side, I guess we can say something around consolidating the "true state of the store". That's pretty close to "do it right" IMO, but I'm probably closer to the implementation than most...

Or we just close the PR if we think formalizing consolidated metadata in a world with icechunk just isn't sensible. Perhaps most users of consolidated metadata would be better of adopting icechunk (it's what I'd recommend in most cases).

@DennisHeimbigner
Copy link

I believe the idea of making icechunk an explicit part of the spec is fraught with problems.
There are other possibilities v-a-v consolidated metadata: lazy evaluation for example.
What I am suggesting is that you might want to consider a wide-range of alternatives to
consoladated metadata.

@mzundo
Copy link

mzundo commented Apr 15, 2025

d-v-b wrote:

I don't think we need guidance from the zarr spec on how to handle conflict between consolidated and non-consolidated metadata. Consolidated metadata only has value when the non-consolidated metadata is inaccessible; the non-consolidated metadata is the ground truth, and should be used when available.

The point IMO is not about "guidance" but being normative wrt role each element of the proposed data model fulfil which function and how

The current Zarr v3 specification (section: Consolidated Metadata) states:

"Consolidated Metadata is optional. If present, then readers should use the consolidated metadata. When not present, readers should use the non-consolidated metadata located in the Store to load the data."

While this prioritizes the use of consolidated metadata, it leaves key questions unanswered when both consolidated and non-consolidated metadata are present: This is not just a matter of performance optimization or tooling convenience but it affects the foundational semantics of the data model.


d-v-b wrote:

“Consolidated metadata only has value when the non-consolidated metadata is inaccessible; the non-consolidated metadata is the ground truth, and should be used when available.”

and I fully agree about this (!), however the spec’s currently states that :

readers should use the consolidated metadata if present

without clarifying whether this is a performance recommendation or a normative rule.

The core issue is about being normative and unambiguous about:

  • the roles each element of the data model fulfills, and
  • which metadata source is authoritative when both are present.

After reviewing current wording and interpretations, it’s unclear whether the spec intends:

A) To recommend consolidated metadata only for better performance (which is reasonable and useful),
OR
B) To declare consolidated metadata as the canonical source of truth for all group and array metadata, overriding local files (which is in contradiction with the decision taken to leave metadata locally assigned to each group so they can be self-descriptive).


Proposed Clarification (Spec Language Suggestion)

Consolidated Metadata is optional. If present, then readers must treat the consolidated metadata as cached index only, while the unconsolidated metadata remains the authoritative source of truth

@mzundo
Copy link

mzundo commented Apr 15, 2025

@mzundo could you propose a specific change to the wording of the spec? And specifically why https://github.com/zarr-developers/zarr-specs/pull/309/files#diff-39bde03e0d9bdcd6ea478965c1f6ec50785744ce98ec8f3a1d5c56218041f17fR851 doesn't handle your concerns? If consolidated metadata is there, readers should use it.

I made a wording proposal in my post above

@TomAugspurger
Copy link
Author

Thanks @mzundo.

Consolidated Metadata is optional. If present, then readers must treat the consolidated metadata as cached index only, while the unconsolidated metadata remains the authoritative source of truth

For implementors, what's the practical consequences of this language change? What sort of behavior does it rule out or require that the current language doesn't? I do view it as a cache (nobody has proposed removing the non-consolidated metadata nodes while writing the consolidated nodes). I guess I don't see the difference between saying "read consolidated metadata if it's present" and your proposal.

@mzundo
Copy link

mzundo commented Apr 15, 2025

Thanks @mzundo.

Consolidated Metadata is optional. If present, then readers must treat the consolidated metadata as cached index only, while the unconsolidated metadata remains the authoritative source of truth

For implementors, what's the practical consequences of this language change? What sort of behavior does it rule out or require that the current language doesn't? I do view it as a cache (nobody has proposed removing the non-consolidated metadata nodes while writing the consolidated nodes). I guess I don't see the difference between saying "read consolidated metadata if it's present" and your proposal.

Hi, thanks for your comment, we can use them to improve my proposal:

what I've done was the following is the following 3 :

  1. change "should" (which is non normative), to "must" or would better "shall" (the normative verb used in standards).
  2. make abstract the role of consolidated metadata : do not specify "read" which is a specific use (reading) of the consolidated metadata but rather simply say that it has to be "treated" as cached version (emphasising the role as a performance related index). which covers any application: reading, writing, searching
  3. explicitly state that the canonical truth is the unconsolidated metadata

I would slightly touch up my proposal to:

Consolidated Metadata is optional. If present, the consolidated metadata shall be treated only as a cached metadata index, while the unconsolidated metadata remains the authoritative source of truth.

we could then give "guidelines" in a separate note addressing metadata coherency management (without mandating them) explaining the rationale. For example we could write:

USAGE NOTE 1: The presence of the consolidated metadata in a Zarr store (especially when implemented in a filesystem) is meant to allow more performant access to metadata without having to transverse the whole Zarr store store structure.
USAGE NOTE 2: To support metadata integrity and Zarr store self-consistency it is recommended that any read/writer implements options to a) check the consistency and b) rebuild consolidated from non-consolidated metadata

PS I'm thinking about a system where performance might not be an issue and that is not using consolidation (neither generating nor reading it) that however exchanges zarr store with another system that instead relies on consolidation. The wording above does not "mandate" to use consolidated if present at any cost e.g. the system above will receive zarr store with consolidated data but it will just ignore it. This leaves complete freedom to implementor of system/SW/stores to use consolidated metadata without mandating behaviour and still allowing data exchange and coherency

@DennisHeimbigner
Copy link

While I have to admit that I have not read the most recent version of the consolidated metadata proposal,
I do have one question. Is it legal to use consolidated metadata alone or does it always have to be
paired with normal, unconsolidated metadata?
In this latter situation, the cost of keeping the consolidated and unconsolidated consistent is incurred
at write time. This of course assumes that the dataset metadata is not modified by external agents.
So that cost is not all that much in the overall costs.

@TomAugspurger
Copy link
Author

This is just an extension to the core spec. It can't change anything about the rest of the spec, and so to be considered a valid Zarr hierarchy you'd still need the non-consolidated metadata.

I'd encourage you to review the PR if you haven't @DennisHeimbigner. IIRC, you'd requested that consolidated metadata be standardized outside of zarr-python.

@joshmoore
Copy link
Member

TomAugspurger commented last week
(nobody has proposed removing the non-consolidated metadata nodes while writing the consolidated nodes)

I thought I had (more or less), but as you pointed out, that's basically a different proposal.

Going back to DennisHeimbigner's comment last week:

What I am suggesting is that you might want to consider a wide-range of alternatives to
consolidated metadata.

I'd like to re-raise a thought -- it feels like there are a handful of characteristics that we might be talking about which can be pieced together for different behaviors:

  • full consolidated metadata file as cache: An additional .zmetadata file is created which contains a copy of all the existing metadata. (This is the v2 behavior.)
  • full inline consolidated metadata only: All metadata is only in the top-level zarr.json. This would be must_understand=True. (Likely requires a reference from lower zarr.json files to the top-level)
  • child metadata cached in group: This is a proposal from @DennisHeimbigner during previous Zarr community call (or maybe ZEP) conversations. (Also could make use of @TomAugspurger's depth=1 flag.)
  • all metadata in separate hierarchy: All metadata files are in a duplicate metadata hierarchy as @mzundo suggested. (This was specified in v3-alpha, though it would likely need to work in reverse saying all data is in another directory.)

The result on the implementation behavior would essentially be "which path(s) do I go to to load the metadata?" and (though I'm definitely naive here) I'd hope something sufficiently generic could still be of use when backed by e.g. icechunk.

Under the ZEP9 extension semantics, it would of course be fine for each of these behaviors to be a separate extension. And since @TomAugspurger has been very clear about his intents here, I'm up for moving this to a new issue, but it does feel like there's enough overlap that some of these flags/options could be worked into a single extension rather than being multiple.

@mzundo
Copy link

mzundo commented Apr 22, 2025

The post by @joshmoore opens up some really interesting possibilities—here are my two cents.

I believe it’s important to:

  1. Support expandability in a clean, incremental/layered way—extensions (e.g. for performance) syntax should foreseen by the standard.
  2. Keep data fully self-descriptive: metadata should not depend on context like filesystem layout or positional info. For example, storing something that only makes sense in a hierarchical file tree shouldn’t break when moved to a flat DB-like store.
  3. Explicitly identify extensions or variants via clearly defined metadata fields (this is also a subset of point 2).
  4. Allow direct data exchange between libraries/readers unless absolutely necessary—i.e., default to must_understand=false for optional performance-related extensions (unlike compressors, which require specific decoding support).
  5. ideally Avoid duplication, and where duplication exists (e.g. .zmetadata), the format should clearly specify the canonical source to allow regeneration and integrity checking.

Two possible approaches for metadata definition:

A) Fixed Core + Extensions
• Define a core that must always exist.
• Any extensions are additive and can be redundant.
• Example: in Zarr v2, the local metadata (per group/array) is mandatory, while the consolidated .zmetadata is optional and duplicates it for performance or convenience.

B) Variant Declaration
• Use a top-level variable (e.g. metadata_style) to declare the metadata layout strategy. Some made-up examples:
• "Default" → Metadata lives locally within each group.
• "Value_Consolidated" → All metadata is duplicated into a single top-level .zmetadata file (as in Zarr v2).
• "Reference_Consolidation" → .zmetadata contains only references/paths to distributed metadata files.
• "Hierarchy_Consolidation" → All metadata is relocated under a dedicated group (e.g. /meta).

⚠️ Ambiguity warning: if using Hierarchy_Consolidation, we’d need to avoid identically named metadata files (like zarr.json) to prevent conflicts—some naming scheme like group1.json, group2.json, etc., would help.

Considerations

The Variant approach (B) is arguably cleaner and more efficient (no duplication), but it would require all tools to fully support each variant to read/write/update stores correctly. This reduces out-of-the-box compatibility and increases implementation complexity.

On the other hand, metadata is small compared to the actual data, and a key use case is partial data access in distributed or parallel environments. Having self-contained group-level metadata allows each process to independently handle its chunk of the store without loading or parsing consolidated metadata.

Conclusion

I’d favour Approach A:
• Keep mandatory local metadata at the group/array level.
• Support optional metadata extensions on top.
• But also explicitly declare the metadata layout variant (as in B) to ensure clarity and forward-compatibility.

This lets lightweight or legacy readers operate with minimal assumptions, while enabling more advanced tools to build optimised metadata layers in a controlled and future-proof way.

Let me know what you think—or if there’s already a ZEP going in this direction, I’d be happy to contribute!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

consolidated metadata storage for Zarr v3
7 participants