Skip to content

Automated Data Persistence

Chris Bunch edited this page Aug 19, 2013 · 11 revisions

Introduction

Perhaps the most requested feature in AppScale is the ability to persist your data across cloud and cluster deployments. AppScale 1.10.0 brings that feature to you, for VirtualBox, EC2, Eucalyptus, and Google Compute Engine deployments. This post details how we support data persistence in general, as well as specifics of persisting your data across each supported cluster or cloud.

In General...

Saving your data is typically a hard problem. But why is that? The answer is simple - saving the state of an entire system could be as complicated as saving the state of every machine in your system! Thankfully, App Engine and AppScale both make this normally difficult problem a lot easier for you. For starters, the App Engine programming model forces a more-or-less stateless web server onto you. You save all persistent state into the Datastore, and anything in memcache can be reconstituted from that if needed. AppScale's implementation of these services is also mostly stateless - all state that we need to persist resides in three places:

  1. Cassandra - this NoSQL datastore is used within AppScale to implement support for the Datastore and Blobstore APIs, so all user and application data is stored here.
  2. ZooKeeper - Since Cassandra only supports row-level transactions (not sufficient for the type of transactions that App Engine supports), we use this service to provide locking. With this, AppScale can implement App Engine transaction semantics.
  3. The source / war files for the App Engine apps that are hosted in this AppScale deployment. Note that we could simply not store these, and require the user to upload their apps every time they start up AppScale.

We begin by instructing Cassandra to store all its data in /opt/appscale/cassandra, and ZooKeeper data is stored in /opt/appscale/zookeeper. Similarly, we store App Engine apps that users upload in /opt/appscale/apps. That makes the problem of how to persist data in AppScale deployments a problem of "how do we persist the /opt directory" - a much simpler problem! Let's dive into how we do it anywhere you can run AppScale.

VirtualBox

Amazon EC2 and Eucalyptus

Google Compute Engine

Conclusion and Future Work

This covers how AppScale automatically backs up data to cloud storage systems and uses it in future deployments. One area that we'd love to look into in the future is periodically backing up your data stored in /opt/appscale to a cloud storage service like Amazon S3 or Google Cloud Storage and restoring from that (instead of EBS or PD). Alternatively, we'd also like to consider the performance impacts of storing your Google App Engine applications in the Datastore / Blobstore itself, so that it automatically gets replicated across machines (and so that the AppController doesn't have to worry about storing and locating your apps). We'd love to have an extra set of eyes looking over this, so feel free to join us in #appscale on freenode.net and let us know what you think!

Clone this wiki locally