|
| 1 | +# HLD of Metadata Backend |
| 2 | +This document presents a high level design **(HLD)** of the meta-data back-end for Motr. |
| 3 | +The main purposes of this document are: |
| 4 | + 1. To be inspected by Motr architects and peer designers to ascertain that high level design is aligned with Motr architecture and other designs, and contains no defects. |
| 5 | + 2. To be a source of material for Active Reviews of Intermediate Design **(ARID)** and detailed level design **(DLD)** of the same component. |
| 6 | + 3. To serve as a design reference document. The intended audience of this document consists of Motr customers, architects, designers, and developers. |
| 7 | + |
| 8 | + |
| 9 | + ## Introduction |
| 10 | + Meta-data back-end (BE) is a module presenting an interface for a transactional local meta-data storage. BE users manipulate and access meta-data structures in memory. BE maps this memory to persistent storage. User groups meta-data updates in transactions. BE guarantees that transactions are atomic in the face of process failures. |
| 11 | + |
| 12 | +BE provides support for a few frequently used data structures: double linked list, B-tree, and exit map. |
| 13 | + |
| 14 | + |
| 15 | + ## Dependencies |
| 16 | + - a storage object *(stob)* is a container for unstructured data, accessible through the `m0_stob` interface. BE uses stobs to store meta-data on a persistent store. BE accesses persistent store only through the `m0_stob` interface and assumes that every completed stob write survives any node failure. It is up to a stob implementation to guarantee this. |
| 17 | + - a segment is a stob mapped to an extent in process address space. Each address in the extent uniquely corresponds to the offset in the stob and vice versa. Stob is divided into blocks of fixed size. Memory extent is divided into pages of fixed size. Page size is a multiple of the block size (it follows that stob size is a multiple of page size). At a given moment in time, some pages are up-to-date (their contents are the same as of the corresponding stob blocks) and some are dirty (their contents were modified relative to the stob blocks). In the initial implementation, all pages are up-to-date, when the segment is opened. In the later versions, pages will be loaded dynamically on demand. The memory extent to which a segment is mapped is called segment memory. |
| 18 | + - a region is an extent within segment memory. A (meta-data) update is a modification of some region. |
| 19 | + - a transaction is a collection of updates. The user adds an update to a transaction by capturing the update's region. The user explicitly closes a transaction. BE guarantees that a closed transaction is atomic concerning process crashes that happen after transaction close call returns. That is, after such a crash, either all or none of the transaction updates will be present in the segment memory when the segment is opened next time. If a process crashes before a transaction closes, BE guarantees that none of the transaction updates will be present in the segment memory. |
| 20 | + - a credit is a measure of a group of updates. A credit is a pair (nr, size), where nr is the number of updates and size is the total size in bytes of modified regions. |
| 21 | + |
| 22 | + ## Requirements |
| 23 | + |
| 24 | +* `R.M0.MDSTORE.NUMA`: allocator respects NUMA topology. |
| 25 | +* `R.MO.REQH.10M`: performance goal of 10M transactions per second on a 16-core system with a battery-backed memory. |
| 26 | +* `R.M0.MDSTORE.LOOKUP`: Lookup of a value by key is supported. |
| 27 | +* `R.M0.MDSTORE.ITERATE`: Iteration through records is supported. |
| 28 | +* `R.M0.MDSTORE.CAN-GROW`: The linear size of the address space can grow dynamically. |
| 29 | +* `R.M0.MDSTORE.SPARSE-PROVISIONING`: including pre-allocation. |
| 30 | +* `R.M0.MDSTORE.COMPACT`, `R.M0.MDSTORE.DEFRAGMENT`: used container space can be compacted and de-fragmented. |
| 31 | +* `R.M0.MDSTORE.FSCK`: scavenger is supported |
| 32 | +* `R.M0.MDSTORE.PERSISTENT-MEMORY`: The log and dirty pages are (optionally) in a persistent memory. |
| 33 | +* `R.M0.MDSTORE.SEGMENT-SERVER-REMOTE`: backing containers can be either local or remote |
| 34 | +* `R.M0.MDSTORE.ADDRESS-MAPPING-OFFSETS`: offset structure friendly to container migration and merging |
| 35 | +* `R.M0.MDSTORE.SNAPSHOTS`: snapshots are supported. |
| 36 | +* `R.M0.MDSTORE.SLABS-ON-VOLUMES`: slab-based space allocator. |
| 37 | +* `R.M0.MDSTORE.SEGMENT-LAYOUT` Any object layout for a meta-data segment is supported. |
| 38 | +* `R.M0.MDSTORE.DATA.MDKEY`: Data objects carry a meta-data key for sorting (like the reiser4 key assignment does). |
| 39 | +* `R.M0.MDSTORE.RECOVERY-SIMPLER`: There is a possibility of doing a recovery twice. There is also a possibility to use either object-level mirroring or logical transaction mirroring. |
| 40 | +* `R.M0.MDSTORE.CRYPTOGRAPHY`: optionally meta-data records are encrypted. |
| 41 | +* `R.M0.MDSTORE.PROXY`: proxy meta-data server is supported. A client and a server are almost identical. |
| 42 | + |
| 43 | +## Design Highlights |
| 44 | +BE transaction engine uses write-ahead redo-only logging. Concurrency control is delegated to BE users. |
| 45 | + |
| 46 | +## Functional Specification |
| 47 | +BE provides an interface to make in-memory structures transactionally persistent. A user opens a (previously created) segment. An area of virtual address space is allocated to the segment. The user then reads and writes the memory in this area, by using BE-provided interfaces together with normal memory access operations. When the memory address is read for the first time, its contents are loaded from the segment (initial BE implementation loads the entire segment stob in memory when the segment is opened). Modifications to segment memory are grouped in transactions. After a transaction is closed, BE asynchronous writes updated memory to the segment stob. |
| 48 | + |
| 49 | +When a segment is closed (perhaps implicitly as a result of a failure) and re-opened again, the same virtual address space area is allocated to it. This guarantees that it is safe to store pointers to segment memory in segment memory. Because of this property, a user can place in segment memory in-memory structures, relying on pointers: linked lists, trees, hash tables, strings, etc. Some in-memory structures, notably locks, are meaningless on storage, but for simplicity (to avoid allocation and maintenance of a separate set of volatile-only objects), can nevertheless be placed in the segment. When such a structure is modified (e.g., a lock is taken or released), the modification is not captured in any transaction and, hence, is not written to the segment stob. |
| 50 | + |
| 51 | +BE-exported objects (domain, segment, region, transaction, linked list, and b-tree) support Motr non-blocking server architecture. |
| 52 | + |
| 53 | +## Use Cases |
| 54 | +### Scenarios |
| 55 | + |
| 56 | +|Scenario | Description | |
| 57 | +|---------|-------------| |
| 58 | +|Scenario | `[usecase.component.name]` | |
| 59 | +|Relevant quality attributes| [e.g., fault tolerance, scalability, usability, re-usability]| |
| 60 | +|Stimulus| [an incoming event that triggers the use case]| |
| 61 | +|Stimulus source | [system or external world entity that caused the stimulus]| |
| 62 | +|Environment | [part of the system involved in the scenario]| |
| 63 | +|Artifact | [change to the system produced by the stimulus]| |
| 64 | +|Response | [how the component responds to the system change]| |
| 65 | +|Response measure |[qualitative and (preferably) quantitative measures of response that must be maintained]| |
| 66 | +|Questions and Answers | |
0 commit comments