GitHub - developmentseed/virtualizarr-data-pipelines: AWS infrastructure for managing Virtualizarr data pipelines.

virtualizarr-data-pipelines

Virtualizarr-data-pipelines is a github template repository intended to help users create and manage Virtualizarr/Icechunk stores on AWS in a consistent, scalable way.

The goal is to let users leverage their expertise to focus on how to parse and concatenate archival files without having to think too much about architecture code.

Getting started :rocket

First create your own repository from the template. You'll use this repository to build and configure your own dataset specific pipeline.

Creating a processor :module

Once you have your own repo, the first step is building your own processor class. There is a sample processor.py in the repo that uses an in-memory Icechunk store and a fake virtual dataset to demonstrate how a processor works. Replace this with your own processor.py file. Your class should follow the VirtualizarrProcessor protocol.

initialize_store This method should create your new Icechunk store and use a seed file to initialize the structure that you can append subsequent files to.
append This method should take a file uri, use a Virtualizarr parser to parse it and append the resulting ManifestStore or virtual dataset to the Icechunk store along some dimension or dimensions.

You can specify the dependencies for your processor module in its pyproject.toml.

You should create tests for your module in the tests directory. There are sample fixtures for an in memory Icechunk store and some basic sample tests for the sample processor module in the template repo that you can use as a guide.

The virtualizarr-data-pipelines CDK infrastructure will use this module to create a Docker images and Lambda functions for initializing the Icechunk store and consuming SQS messages for files and appending them to the store.

Feeding the queue :cookie

Virtualizarr Data Pipelines is only responsible for creating a store and processing file notifications fed to its queue. You'll be responsible for getting messages in this queue. For existing archival data in S3 the simplest approach is enabling S3 inventories on the bucket and using Athena to query the inventories and push messages onto the queue in batches of a manageable size.

For S3 buckets where new data is continually added you can enable an SNS topic for new data which the Virtualizarr Data Pipelines queue can subscribe to.

Configuring the deployment :wrench

Virtualizarr Data Pipelines uses a strongly-typed settings file that allows you to configure things like bucket names and external SNS topics used by the CDK infrastructure when you deploy it. Many of the settings include defaults but you can also specify and override values with a .env file. A sample file is provided as an example.

Here is where you can specify things like the SNS topic you created to feed your queue. Or the S3 bucket where your archival dataset lives.

Project commands :hammer

To set up the development environment

./scripts/setup.sh

Run tests

uv run pytest

Review your infrastructure before deploying

uv run --env-file .env.sample cdk synth

Deploy the CDK infrastructure.


uv run --env-file .env.sample cdk deploy

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
cdk		cdk
docs		docs
lambda		lambda
scripts		scripts
tests		tests
.env.sample		.env.sample
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE.md		LICENSE.md
README.md		README.md
cdk.json		cdk.json
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

virtualizarr-data-pipelines

Getting started :rocket

Creating a processor :module

Feeding the queue :cookie

Configuring the deployment :wrench

Project commands :hammer

To set up the development environment

Run tests

Review your infrastructure before deploying

Deploy the CDK infrastructure.

About

Uh oh!

Releases

Packages

Languages

License

developmentseed/virtualizarr-data-pipelines

Folders and files

Latest commit

History

Repository files navigation

virtualizarr-data-pipelines

Getting started :rocket

Creating a processor :module

Feeding the queue :cookie

Configuring the deployment :wrench

Project commands :hammer

To set up the development environment

Run tests

Review your infrastructure before deploying

Deploy the CDK infrastructure.

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages