RFC 12: Centralized ISISDATA Delivery Network #5693

Kelvinrr · 2024-12-20T00:04:10Z

Kelvinrr
Dec 20, 2024
Maintainer

Feature/Process Name: Centralized Spice Delivery Network
Start Date: 12/19/24
RFC PR:
Author: Kelvin Rodriguez

Summary

Move the ISISDATA and ISISTESTDATA store from a distributed download to a centralized data store with an HTTPS endpoint.

Motivation

To comply with efforts to transition USGS/NASA data to the cloud, we moved to Amazon S3 to host a subset of ISISDATA. S3 has many benefits for hosting static content on the cloud, but it comes at a price every time data is downloaded from S3. Therefore, we decided to leverage data hosted by other public sources provided by NAIF, JAXA, and ESA. This way, we can minimize costs to the USGS by not hosting redundant information. This came with some setbacks:

User downloads dramatically increased because there were extra files in public sources.
Redundant files now exist since the ISISDATA structure was previously planned around using symlinks (every mission often has their own copies of kernels). S3 does not support symlinks.
Having ISISDATA be merged from multiple sources has been very confusing for both users and developers.
Files can disappear from public data stores, causing runtime errors for users.

There have been attempts to rectify some of these problems:

Using filters to prevent unnecessary files from downloading. This is cumbersome to maintain as the list of ignores will only get bigger. Filters meant for one could also affect others.
Using redirects in place of symlinks. These are counterintuitive to how s3 works and require more custom code in downloadIsisData.py.
After training new devs to maintain the ISISDATA store, there still seems to be a lot of confusion.
Address missing files by hosting archived versions ourselves in S3.

We can more permanently rectify these issues by moving to a more centralized solution using what we learned about AWS in the previous implementation to both support easier downloads from a centralized source and keep costs down for USGS.

Proposed Solution / Explanation

Terms used in this explanation:

AWS Simple Storage System (S3)

AWS solution for storing key/value pairs. Although it looks like a directory system, it is not a full-featured filesystem. It is useful when storing files in a publically accessible way using a structure similar to a filesystem. Amazon charges every time data is moved from S3 to some endpoint, like when downloading data.

S3 objects are stored in groups called buckets (e.g., ISISDATA is stored in a bucket called isis_data).

AWS CloudFront

AWS solution for creating a Content Delivery Network (CDN). A common use case is to cache an S3 Bucket. This allows for a fast HTTPS connection to an S3 bucket without paying on every download, only when the bucket is updated (e.g., ISISDATA public URL is updated to have new LRO kernels).

AWS Elastic File System (EFS)

AWS Solution for hosting a shareable drive that has no maximum size as it grows elastically with the size of your data. Unlike S3 buckets, there is no easy way to expose it publically. This is useful for mounting internally to live services that need fast access to the data (e.g., SpiceServer).

How these components solve the problem.

I propose we update the process to do the following in order:

We use the existing code in downloadIsisData.py to generate a superset of data from NAIF, ESA, JAXA, etc., into an S3 bucket.
We filter this S3 bucket using existing software in ISIS or SpiceQL (they generate a kernel database) to determine what kernels are used in USGS software. The reduced data set is stored in another bucket location.
A new CloudFront distribution caches the reduced dataset and provides a URL endpoint (possibly astrogeology.usgs.gov/isisdata/ and astrogeology.usgs.gov/isistestdata/).
Users then download from this source using rclone, rsync, downloadIsisData.py, etc.

How this will impact ISIS users

Data Reduction

Leverage existing USGS metadata used to search for kernels when running spiceinit or ALE's isd_generate to determine what kernels are used in the software. They are not included in the public bucket if not accessed in software to generate camera models or ISD. If we use SpiceQL's inventory system for this, it would also eliminate the need for duplicating kernels in different mission folders since SpiceQL's database is agnostic to filesytem structure.

Option to not use downloadIsisData.py

Other clients will work as you will no longer download directly from an S3 bucket. The script will be simplified but still distributed for users who have grown accustomed to using it.

Less likely for desyncs between ISISDATA and public sources

Problems of files missing should occur less often.

Downloading ISISTESTDATA will have the same process as ISISDATA since they can be hosted in the same place.
To have kernels included in the system, an update to SpiceQL will be necessary.

In order for the kernel database to be updated, we would need a change to spiceql's configs.

Drawbacks

We will have to maintain files also maintained by NAIF etc., to ensure they match what USGS software needs. If a new mission need to be added, we will have to have updates in our software to detect the new kernels before they can be distributed
Need new SOPs for adding support for missions not maintained by USGS Astrogeology.

Alternatives

Maintain the current system.

Unresolved Questions

Can we only host the latest versions of kernels? ISISDATA contains multiple versions of the same kernel, can we not include these in the downloads.

Future Possibilities

Versioned ISISDATA, do we want to update our release SOPs to have versioned data?

Kelvinrr · 2025-02-15T22:51:41Z

Kelvinrr
Feb 15, 2025
Maintainer Author

This has been open long enough, we will start implementing this next year.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC 12: Centralized ISISDATA Delivery Network #5693

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

RFC 12: Centralized ISISDATA Delivery Network #5693

Uh oh!

Uh oh!

Kelvinrr Dec 20, 2024 Maintainer

Summary

Motivation

Proposed Solution / Explanation

How these components solve the problem.

How this will impact ISIS users

Drawbacks

Alternatives

Unresolved Questions

Future Possibilities

Replies: 1 comment

Uh oh!

Kelvinrr Feb 15, 2025 Maintainer Author

Kelvinrr
Dec 20, 2024
Maintainer

Kelvinrr
Feb 15, 2025
Maintainer Author