-
Notifications
You must be signed in to change notification settings - Fork 33
Added consolidated metadata to spec #309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Added consolidated metadata to spec #309
Conversation
Going back to #136 (comment) (to ZEP or not to ZEP), I think this could potentially use some discussion around whether there is an opportunity to make the consolidated metadata one implementation of a wider interface. The most general abstraction I've been considering is "metadata loader" (similar to the storage transformer that was initially proposed for sharding). Basically, could we have extensible (i.e. you can write your own) ways of looking up metadata that would be registered in the zarr.json? I owe you a better description but to get them down on paper before I get inundated again, here are some similar requirements/features that I could see:
In my mind, if one of these alternative mechanisms is active, it could be the sole source of information, rather than duplicating the metadata. Some of these are more useful for user attributes rather than the entire metadata, and certainly several may be more difficult without the concept of the root metadata, but it feels like this is an opportunity for us to go beyond the original consolidated metadata workaround. |
Thanks Josh! I like the idea of metadata loading being an extension point. We'll want to hash out some details before merging this, since a top-level
We could have the name be I'll play with that a bit to see how it feels in zarr-python. |
Thanks for working on this! What do you think about multiple instances of metadata consolidation in the same zarr hieararchy? E.g., Group A contains Group B, Group B generates consolidated metadata, then Group A generates consolidated metadata, which ends up containing 2 models of the hierarchy rooted at Group B. Because the consolidated metadata is part of the Group metadata, then consolidating recursively will end up with a lot of repeated content. A few options:
The latter option seems best to me; curious to hear other perspectives. |
Coming back to @joshmoore's "metadata loader" concept, the thing we need to solve now is what sort of metadata to put in the Zarr Group object to indicate "here's how you find more stuff about this hierarchy". And I'll focus that a bit more to "listing files on blob storage is messy / slow, so I want to avoid that if possible". I'm going to make a couple of statements that I think are true:
It's that interface with the store that's tripping me up when trying to generalize this. Why do we need to standardize that in the Zarr Group object? By the time the user has loaded some zarr Group, they have already specified their connection to the metadata store, right? That's how they loaded the group in the first place. With something like the current proposal, we have a way for the In short, I think my claim is that the "metadata loader" concept is mostly orthogonal to the consolidated metadata concept. Consolidated metadata is all about reducing the number of trips to the "Store", by giving a place for a Group to hold information about its children. I think I'm comfortable that a top-level One thing that probably should be addressed now / soon is the idea of storing subsets of the metadata. I'd propose something like a
|
sorry for missing your questions @d-v-b.
I like the framing (which I think was yours originally?) of thinking about consolidated metadata as a cache. Since you can have arbitrarily nested Groups / Arrays, IMO you should be able to have arbitrary consolidated metadata at any level of the hierarchy and the spec shouldn't do anything to dissuade that. As for the question about potentially duplicating data, I think that we won't have that in zarr-python. Working through your example:
I think that we'll have "repeated content" across the files. But we won't have repeated content within a single Given an example like
The zarr.json for group
And the zarr.json for
A note that this might be different from how zarr v2 treated consolidated metadata. In zarr-developers/zarr-python#2113 (comment), I think I found that zarr v2 only supported consolidated metadata at the root of a Store, not nested paths within a store. I think this might be an area where we want to diverge from zarr v2. |
I think this should be about ready to go. @LDeakin I saw that you at least started on this at zarrs/zarrs#55. Anything come up in your implementation that might adjust what you want to see in the spec? |
Nope, the spec looks good, and no issues popped up. Thanks for writing it. |
Sorry for the slow response, @TomAugspurger. To be clear lest it not come through here, I'm super excited to have CM going into the spec!
I definitely agree that this is a tricky (though critical) part of the bootstrapping. The use case I tend to be most concerned with is everything under
Can you help me see why it's orthogonal? My hope was that that would be one of the things this interface could allow a user to do.
Except then it's already done in one specific way, no? With the metadata, e.g., currently being duplicated, no?
Agreed that this is another "parameter" for the consolidation, and I imagine there will be more.
Not that I necessarily agree, but what are the remaining (related) steps from your point-of-view? From my side, I'd be for trying to encourage some more feedback from implementers (huge kudos to @LDeakin) |
Thanks @joshmoore.
Sorry, I don't quite understand this set of questions and the "some" / "no" info concepts. Do you have an example of what you have in mind?
To me it just goes back to the original motivation for consolidated metadata: A way for a Group in a Zarr hierarchy to know a bit about its children (to avoid potentially costly repeat trips to a remote Store for common scenarios like "load all the metadata for arrays in this Group / dataset"). How exactly you load that initial Group (your "metadata loader" concept) doesn't really bear on the question of "where do I store information about children", right?
Answered above, I think. All where specifying here is how to present information about your children; no constraints on how that information gets there. In particular, the spec takes no opinion on how the consolidated metadata is actually stored. A database presenting a Zarr API could construct consolidated metadata on the fly if it wanted (and so no duplication).
This is the only one I can imagine now. If you have any other in mind let me know. I'm not worried about making a
Nothing else. We're merging the zarr-python implementation today. Overall, I'd reiterate that this PR has a narrow scope: a standardized way for Groups know about their children. Effectively, a cache for what you'd get from listing nodes under some prefix, which we already know is needed for serving Zarr data over object storage in a reasonable time. |
Ok. Bear with me @TomAugspurger, I find myself unfortunately playing bad cop across the repo. First some immediate responses to your points and then I'll take a step back and try to explain.
Looking back, I think I misunderstood you. Let's hold off on this for the moment.
I thought it did, but you may have found a hole in the plan. Ultimately, I'm trying to find an underlying abstraction for what you've built. But more on that in a second.
I think this helps me see how the Store is becoming intwined. Maybe "MetadataLoader" is the wrong metaphor, but more "MetadataTransformer" to go with the "StorageTransformer". (If, however, CM can be achieved with the StorageTransformer, 👍) The difference in my mind is the loader/mapper/transformer is config to tell us what keys to look up (whether in databases, filesystems, or buckets) rather than the existing JSON objects.
except the current definition would still have 2 keys for the same information even if the storage and/or retrieval has been optimized, right?
This concerns me, more to that below.
And in fact, now it is merged. Congrats on that and all the work but I think to be fair I will ignore that for this discussion, because though a great validation of the work here it's the cart before the horse. Ok, finally, to back up:
I understand the goal to get CM in ASAP. However, this is introducing a key into the metadata that could be with us for many years. Spec work is often about looking beyond the immediate scope, especially if we are asking others to implement. So a few, partially overlapping strategies to try not to get in your way: Broadening the scopeThis is what I was trying above and with my suggestion (somewhere?!) of going for a ZEP. By identifying an abstraction that would cover more use cases and that we would all feel comfortable having in https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#extension-points I was hoping to we would have more buy-in for the change. (Obviously, that's a big ask.) Lowering riskThe addition of extensions to Zarr v3 is still pretty untested. What you are trying to do should be made possible: dev wants to hack out a new feature, implement it and move on. 👍 Your use of Preparing for changeIf we see the need for an evolution of the configuration, it might make sense to introduce versioning internally. |
This is not entirely true for
If the configuration changes, an implementation can support that without needing an explicit version key. |
That's up to whatever storage is providing the Zarr metadata. In the case of the consolidated metadata produced by zarr-python, which would typically be written to a plain file / object store, yes. But that hypothetical database could store it however it wants. Stepping back: why is duplicating the metadata a problem? That's fundamental to this. Maybe substitute "cache" for "duplicate" and see whether it still causes any concerns.
"Broadening the scope" and "Lowering the risk" seem pretty much directly in tension with each other :) I'll push back pretty strongly on broadening the scope; this is deliberately scoped to narrowly solve the "open group and immediately inspect its children" use case, and I think that's a good thing. As for lowing the risk, I think we point to zarr-python 2.x's consolidated metadata. It seems to have worked pretty well, and this spec is essentially identical (aside from moving it to the single
I think stac's extension mechanism if much nicer than what zarr v3 provides today. When I first started on consolidated metadata, I actually started on the extension mechanism. JSON schemas for the core metadata plus a top-level
I think that would be solved by the zarr spec having a better extension mechanism (previous point) or by the |
General 💯 for thinking through the future-proofing with you guys, so super brief comments for 00:00 CET:
My gut reaction is to say we're putting too much on the back of
Largely, but you could of course draft a broader-scoped extension externally first, too ;)
But not across implementations and I have the feeling with a good deal of heuristics in consuming libraries like xarray to get the creation/detection in place. That a way to tune the metadata loading for different cases is a given, but I would hope we could also improve on it.
nods certainly something that doesn't give the sense that we might need a series of 3.x verisons.
If we think we can sufficiently lower the risk then agreed, but as currently written my instinct is we're not there yet (though perhaps @LDeakin's suggestion will find traction).
❤️ |
Maybe, but I try not to get too caught up in hypotheticals. We have a very real use case where the layout of Zarr metadata on a remote file system causes issues. So we have a tailored solution for that. Why do we need to complicate it to handle other stuff (stuff that at least I don't have a clear grasp on)? If we come up with something better that does... something related to metadata loading, then great. Let's add that to the spec. And it can use And just to reiterate again: this is a small change. A change to the core metadata, sure, but it's completely optional. And it really isn't inventing any new concepts, both because the use-case was discovered (and solved!) by users of zarr-python, and because all of the complicated objects going in the If that's not enough de-risking, then can you lay out exactly what you're looking for? |
#316 has stuff on the extension side. Happy to discuss more there, but I don't think the limitations of Zarr's extension story should bear on this PR. |
The reason I think we're in this discussion is that as things stand this adds a feature (i.e., if we were under semver a minor release). The two paths I see are:
Let me try to give one scenario of it's impact:
This is just a part of caches being hard but I think previously was less of an issue since:
Users were then responsible for cleaning things. But that no longer holds, does it? |
Most usage of consolidated metadata is in an unmanaged storage service. People generally use this feature on high-latency cloud storage backends like s3 or gcs, which are "plain" object stores. In fact, if you do have a managed storage service, there's probably no need for a separate consolidated metadata document because you could implement the same functionality dynamically in your managed storage service. After all, consolidated metadata is effectively a caching layer on top of zarr-aware IO operations. As for your other two concerns, I think it's an important piece of context that consolidated metadata is typically used for data that is written once, and read many times. If the zarr hierarchy is in fact immutable, then there are no changes to track, or issues of desynchronization between the consolidated metadata and the actual metadata. I think using consolidated metadata in a context where a zarr hierarchy is being actively updated by multiple users is probably not a good idea. But it's also true that zarr doesn't provide good tools for ensuring immutability (anyone with the right permissions can open and modify any zarr metadata). So a hierarchy that should be immutable may not in fact remain so, and the only way to find a discrepancy between the actual hierarchy and the consolidated image of that hierarchy is by directly checking.
I think I would need to see a concrete example of this. Where would this directory be stored relative to the zarr hierarchy, and what would be in it? |
Thanks for the detailed response—really helpful context.
To clarify, the use case I’m referring to is not about accessing data through infrastructure APIs. Instead, it’s focused on scientists and engineers exchanging Zarr products as files—including both data and metadata—and using software to locally process, generate, or modify them. While the data might originate from or eventually return to an infrastructure, the key concern is what happens in between—during manual or tool-based manipulation outside of managed environments.
I agree that consolidation as merging individually metatada files works well in an immutable context. My concern is supporting hybrid workflows where data might be consolidated for efficiency when stored on infrastructure, but also accessed or modified locally. In such cases, the merging approach can lead to desynchronisation or inconsistencies, especially when changes are made to the hierarchy without regenerating the consolidated metadata.
Exactly—this is one of the reasons why workflows relying on strict immutability can be fragile in practice. Without enforced guarantees, even minor edits can break consistency between group-level metadata and its consolidated view.
A concrete example is provided by the Pangeo.io ERA5 dataset, which uses a namespace-style approach for metadata consolidation. Each group’s metadata is stored in a separate JSON file, named after the group (e.g., group_name.json), and all are stored in a dedicated group at the root level (typically _meta/). This structure logically looks like: |
I think this centralized metadata layout was defined in an early version of the zarr v3 spec, but was ultimately rejected in favor of the current decentralized layout. I don't know the exact reasons for that decision, but I think it was motivated by a desire to keep individual arrays and groups self-describing. For arrays and groups to be self-describing, their metadata documents are stored locally, instead of aggregated in a directory at the root of the hierarchy.
I think on local storage, IO operations are fast enough that there is very little need for consolidated metadata. Especially with the performance improvements we have made to zarr-python 3 for group indexing, I think consolidated metadata on the local file system offers very little value. |
I actually agree that keeping groups self-describing is very important. That said, maybe there’s a middle ground—where consolidated metadata could work by reference (e.g. by path) to the individual group metadata, rather than duplicating all the content. This might catch two birds with one stone: preserving self-description while enabling centralised access when needed.
Makes sense. My main concern is less about performance and more about managing inconsistencies. If a reference-based consolidation approach were adopted, that issue could be mitigated. But if duplication is necessary, then I think it would be helpful for the spec to explicitly define a rule—for example: in case of discrepancies, the local (group-level) metadata is considered the source of truth. This would allow tools to resolve conflicts programmatically by rebuilding the consolidated view as needed. |
@mzundo -- these are all great points. I want to point out that the new Icechunk storage engine addresses basically all of your concerns
AND it does this all at the store level, meaning that no application code has to change for interacting with Zarr datasets. We'd be happy to help you get started over at https://github.com/earth-mover/icechunk. |
Agreed that icechunk probably fits your needs better. This PR is primarily standardizing what's already out there for Zarr v2, updated to Zarr v3's conventions.
I think https://github.com/zarr-developers/zarr-specs/pull/309/files#diff-39bde03e0d9bdcd6ea478965c1f6ec50785744ce98ec8f3a1d5c56218041f17fR851 addresses that. |
In my case, I’m working within an existing setup based on vanilla Zarr v2, with plans to eventually move to Zarr v3. There’s a strong preference to minimise changes, so I’ve been trying to understand what improvements or features are already planned or supported in Zarr v3. If some of these concerns were addressed natively in the spec, it would be much easier to consider transitioning—rather than adopting an entirely separate storage engine. |
I personally don't plan to push for changing consolidated metadata from how it behaved in zarr-python 2.x. All the concerns around things like consistency and atomic updates are complex, and are better-handled by icechunk. |
Thanks for the pointer! The change you referenced defines the reader’s behaviour—i.e. that if consolidated metadata is available, it should be used. However, it doesn’t explicitly cover what should happen in the presence of discrepancies between consolidated and group-level metadata. Consider the case where a library opens a Zarr store, reads both the consolidated and local metadata, and detects a mismatch. Beyond (possibly) reporting an error, it would be helpful for the spec to clearly define the source of truth. Following the principle of self-describing groups, I would argue that the group-level metadata should take precedence, not the consolidated version. |
I’ve always seen the .zmetadata duplication approach in Zarr v2 as a performance-driven workaround/hack, rather than a clean, uniquely defined data model. Wouldn’t it be worth considering a hybrid approach—for example, consolidation via the root JSON, but by reference rather than by value? That way, performance could still be improved without compromising clarity, traceability, or the principle of self-describing groups. It could even be made optional, controlled by a flag e.g. |
That's outside the scope of the spec IMO. zarr-python, for instance, won't want to read non-consolidated metadata (reading all the non-consolidated nodes would defeat the point of consolidated metadata!). So there's no opportunity to detect a conflict.
I'm not quite sure what this would look like. What the consolidated-metadata provides in zarr-python 2.x, and what's proposed here, let's you get the consolidated metadata of the entire hierarchy in a single read / HTTP request. |
My understanding was that a key improvement introduced by the Zarr v3 specification was to abstract the format definition (the data model) from any specific implementation, removing the tight coupling with Python — both in terms of data types and how information is interpreted. One of the main reasons we are interested in Zarr is precisely because we may not want (or be able) to use Python. To enable such use cases, the data format must be intrinsically unambiguous and self-consistent. That is: it should either prevent inconsistency by design or clearly define expected behaviour in cases of potential duplication or conflict. Only by doing this can Zarr become a truly language-agnostic, portable format — one that behaves the same across implementations, without relying on the assumptions/behaviour of any particular library. Your reply illustrates the problem: you refer to the behavior of a specific implementation (zarr-python) to justify why the spec doesn’t need to define how conflicts (e.g. between consolidated and non-consolidated metadata) are handled. But that actually confirms my concern: if the expected behavior is not defined by the spec, then different implementations will (reasonably) behave differently. In other words, the spec should not leave core behaviors to be inferred from a single implementation’s choices. The behavior of a Zarr-compliant reader must be defined by the specification — not by what zarr-python happens to do. |
@mzundo could you propose a specific change to the wording of the spec? And specifically why https://github.com/zarr-developers/zarr-specs/pull/309/files#diff-39bde03e0d9bdcd6ea478965c1f6ec50785744ce98ec8f3a1d5c56218041f17fR851 doesn't handle your concerns? If consolidated metadata is there, readers should use it.
Don't read too much into that, I just happen to help maintain zarr-python. With that hat on, I can say we wouldn't want a change to the spec that would make consolidated metadata useless (since detecting a conflict on read requires reading both consolidated and non-consolidated metadata). @LDeakin can speak for zarrrs, but I suspect that would be a non-starter there too. |
The core problem here is that the Zarr spec, and the requirements which motivated it, simply do not address consistency issues in a consistent (no pun intended) way. The spec describes the static format on disk, no more and no less. It does not explain how to create, update, append, delete, etc. All of these issues are left to implementations. Consolidated metadata is, IMO, a poorly designed workaround (a hack, as you said) which exacerbates, rather than solves, Zarr's consistency problems. I think the best we can do is stipulate in the extension wording that implementation which use consolidated metadata must ensure that they write consolidated metadata in a way that is consistent with the true state of the store. Our solution at Earthmover was to keep the spec unchanged, and to address consistency at the store level, by creating a specialized key-value store built for Zarr that supports ACID transactions. This is Icechunk. A future Zarr spec (e.g. Zarr 4) might choose to build consistency into its requirements and potentially leverage Icechunk's innovations to create a more powerful core format. |
I don't think we need guidance from the zarr spec on how to handle conflict between consolidated and non-consolidated metadata. Consolidated metadata only has value when the non-consolidated metadata is inaccessible; the non-consolidated metadata is the ground truth, and should be used when available. |
As your "exacerbates" notes: Zarr itself can't say much of anything about ACID-style consistency, and that affects the core of the spec, not just consolidated metadata. If others are OK with it, I'd be fine with a section in the core Zarr spec noting the issues Zarr faces with consistency (depending on the Store) and maybe linking to icechunk as a project that does address this. And for this PR, I think we can focus on the other type of "consistency" that's unique to consolidated metadata: that the duplicated metadata in the consolidated metadata can get out of sync with the non-consolidated metadata, indefinitely. I'm happy to add or modify language beyond "use consolidated if it's present". On the write side, I guess we can say something around consolidating the "true state of the store". That's pretty close to "do it right" IMO, but I'm probably closer to the implementation than most... Or we just close the PR if we think formalizing consolidated metadata in a world with icechunk just isn't sensible. Perhaps most users of consolidated metadata would be better of adopting icechunk (it's what I'd recommend in most cases). |
I believe the idea of making icechunk an explicit part of the spec is fraught with problems. |
d-v-b wrote:
The point IMO is not about "guidance" but being normative wrt role each element of the proposed data model fulfil which function and how The current Zarr v3 specification (section: Consolidated Metadata) states: "Consolidated Metadata is optional. If present, then readers should use the consolidated metadata. When not present, readers should use the non-consolidated metadata located in the Store to load the data." While this prioritizes the use of consolidated metadata, it leaves key questions unanswered when both consolidated and non-consolidated metadata are present: This is not just a matter of performance optimization or tooling convenience but it affects the foundational semantics of the data model. d-v-b wrote:
and I fully agree about this (!), however the spec’s currently states that : readers should use the consolidated metadata if present without clarifying whether this is a performance recommendation or a normative rule. The core issue is about being normative and unambiguous about:
After reviewing current wording and interpretations, it’s unclear whether the spec intends: A) To recommend consolidated metadata only for better performance (which is reasonable and useful), Proposed Clarification (Spec Language Suggestion) Consolidated Metadata is optional. If present, then readers must treat the consolidated metadata as cached index only, while the unconsolidated metadata remains the authoritative source of truth |
I made a wording proposal in my post above |
Thanks @mzundo.
For implementors, what's the practical consequences of this language change? What sort of behavior does it rule out or require that the current language doesn't? I do view it as a cache (nobody has proposed removing the non-consolidated metadata nodes while writing the consolidated nodes). I guess I don't see the difference between saying "read consolidated metadata if it's present" and your proposal. |
Hi, thanks for your comment, we can use them to improve my proposal: what I've done was the following is the following 3 :
I would slightly touch up my proposal to: Consolidated Metadata is optional. If present, the consolidated metadata shall be treated only as a cached metadata index, while the unconsolidated metadata remains the authoritative source of truth. we could then give "guidelines" in a separate note addressing metadata coherency management (without mandating them) explaining the rationale. For example we could write: USAGE NOTE 1: The presence of the consolidated metadata in a Zarr store (especially when implemented in a filesystem) is meant to allow more performant access to metadata without having to transverse the whole Zarr store store structure. PS I'm thinking about a system where performance might not be an issue and that is not using consolidation (neither generating nor reading it) that however exchanges zarr store with another system that instead relies on consolidation. The wording above does not "mandate" to use consolidated if present at any cost e.g. the system above will receive zarr store with consolidated data but it will just ignore it. This leaves complete freedom to implementor of system/SW/stores to use consolidated metadata without mandating behaviour and still allowing data exchange and coherency |
While I have to admit that I have not read the most recent version of the consolidated metadata proposal, |
This is just an extension to the core spec. It can't change anything about the rest of the spec, and so to be considered a valid Zarr hierarchy you'd still need the non-consolidated metadata. I'd encourage you to review the PR if you haven't @DennisHeimbigner. IIRC, you'd requested that consolidated metadata be standardized outside of zarr-python. |
I thought I had (more or less), but as you pointed out, that's basically a different proposal. Going back to DennisHeimbigner's comment last week:
I'd like to re-raise a thought -- it feels like there are a handful of characteristics that we might be talking about which can be pieced together for different behaviors:
The result on the implementation behavior would essentially be "which path(s) do I go to to load the metadata?" and (though I'm definitely naive here) I'd hope something sufficiently generic could still be of use when backed by e.g. icechunk. Under the ZEP9 extension semantics, it would of course be fine for each of these behaviors to be a separate extension. And since @TomAugspurger has been very clear about his intents here, I'm up for moving this to a new issue, but it does feel like there's enough overlap that some of these flags/options could be worked into a single extension rather than being multiple. |
The post by @joshmoore opens up some really interesting possibilities—here are my two cents. I believe it’s important to:
⸻ Two possible approaches for metadata definition: A) Fixed Core + Extensions B) Variant Declaration ⸻ Considerations The Variant approach (B) is arguably cleaner and more efficient (no duplication), but it would require all tools to fully support each variant to read/write/update stores correctly. This reduces out-of-the-box compatibility and increases implementation complexity. On the other hand, metadata is small compared to the actual data, and a key use case is partial data access in distributed or parallel environments. Having self-contained group-level metadata allows each process to independently handle its chunk of the store without loading or parsing consolidated metadata. ⸻ Conclusion I’d favour Approach A: This lets lightweight or legacy readers operate with minimal assumptions, while enabling more advanced tools to build optimised metadata layers in a controlled and future-proof way. ⸻ Let me know what you think—or if there’s already a ZEP going in this direction, I’d be happy to contribute! |
This PR adds a new optional field to the core metadata for consolidating all the metadata of all child nodes under a single object. The motivation is similar to consolidated metadata in Zarr V2: without consolidated metadata, the time to load metadata for an entire hierarchy scales linearly with the number of nodes. This can be costly, especially for large hierarchies served HTTP from a remote storage system (like Blob Storage).
The primary points to discuss:
zarr.json
(kind="inline
).For *very* large hierarchies, this will bloat the size of the root
zarr.json`, slowing down operations that just want to open the metadata for the root.zarr.json
. That might be a nice alternative to consolidated metadata for very large hierarchies (you could do an initial read to get the list of nodes, and the load the metadata for each child concurrently).Closes #136
xref zarr-developers/zarr-python#2113, which is a completed implementation in zarr-python and zarrs/zarrs#55, a draft implementation in zarrrs..