Virtualizarr-data-pipelines is a github template repository intended to help users create and manage Virtualizarr/Icechunk stores on AWS in a consistent, scalable way.
The goal is to let users leverage their expertise to focus on how to parse and concatenate archival files without having to think too much about architecture code.
First create your own repository from the template. You'll use this repository to build and configure your own dataset specific pipeline.
Once you have your own repo, the first step is building your own processor class. There is a sample
processor.py in the repo that uses an in-memory Icechunk store and a fake virtual dataset to
demonstrate how a processor works. Replace this with your own processor.py
file. Your class should follow the VirtualizarrProcessor protocol.
-
initialize_store This method should create your new Icechunk store and use a seed file to initialize the structure that you can append subsequent files to.
-
append This method should take a file uri, use a Virtualizarr parser to parse it and append the resulting ManifestStore or virtual dataset to the Icechunk store along some dimension or dimensions.
You can specify the dependencies for your processor module in its pyproject.toml.
You should create tests for your module in the tests directory. There are sample fixtures for an in memory Icechunk store and some basic sample tests for the sample processor module in the template repo that you can use as a guide.
The virtualizarr-data-pipelines CDK infrastructure will use this module to create a Docker images and Lambda functions for initializing the Icechunk store and consuming SQS messages for files and appending them to the store.
Virtualizarr Data Pipelines is only responsible for creating a store and processing file notifications fed to its queue. You'll be responsible for getting messages in this queue. For existing archival data in S3 the simplest approach is enabling S3 inventories on the bucket and using Athena to query the inventories and push messages onto the queue in batches of a manageable size.
For S3 buckets where new data is continually added you can enable an SNS topic for new data which the Virtualizarr Data Pipelines queue can subscribe to.

Virtualizarr Data Pipelines uses a strongly-typed settings file that allows you to configure things like bucket names and external SNS topics used by the CDK infrastructure when you deploy it. Many of the settings include defaults but you can also specify and override values with a .env file. A sample file is provided as an example.
Here is where you can specify things like the SNS topic you created to feed your queue. Or the S3 bucket where your archival dataset lives.
./scripts/setup.sh
uv run pytest
uv run --env-file .env.sample cdk synth
uv run --env-file .env.sample cdk deploy