Skip to content

Explore integration with Icechunk data engine #5

Open
@aufdenkampe

Description

@aufdenkampe

My vision for this package is that would work seamlessly in cooperation with a local and/or remote high performance data catalog and store (i.e. data engine). Presently, the Icechunk cloud-native transactional tensor storage engine is the most promising option, as it was recently open-sourced by EarthMover as the source code behind their ArrayLake services.

An ideal work flow would be to:

  • User requests a dataset from a well-known data repository for a specific area of interest.
    • These well-known data repos will be cataloged here in a yaml file, and optionally referenced with Kerchunk or VirtualiZarr.
  • This package first checks if the specific dataset has already been fetched and saved to a local Icechunk instance.
  • If not, it fetches the specific dataset from the source repository, saving it locally in it's native format.
  • If the user expects to reuse the data, they can choose to convert the dataset into a cloud-optimized, analysis-ready (ARCO) zarr3 dataset within Icechunk.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions