Skip to content
This repository was archived by the owner on May 3, 2024. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
834 changes: 834 additions & 0 deletions 2035.diff

Large diffs are not rendered by default.

12 changes: 6 additions & 6 deletions doc/ADDB_Monitoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ Reporting of statistics is required, which is similar to df, vmstat, top, etc. T
Statistics belong to two categories:

1. Stats which are readily available, eg. balloc will generate addb records about free space in a container periodically.
1. Stats which are not readily available.
2. Stats which are not readily available.

These stats summary ADDB records can be produced on any node, this could be client or server. If produced on client they are sent to endpoint where addb service is running (using the current mechanism) and also to the endpoint where stats service is running, while if produced on server they are written to addb stob and also sent to this endpoint where stats service is running.

Expand All @@ -57,7 +57,7 @@ Monitors will be used to detect exceptional conditions. Periodic posting is not
## Logical Specification
ADDB monitors are represented as follows:

```
```C
struct m0_addb_monitor {
void (*am_watch) (const struct m0_addb_monitor *mon,
const struct m0_addb_rec *rec,
Expand Down Expand Up @@ -131,8 +131,8 @@ Following steps show how an addb monitor collects statistical information on a p

1. Create ADDB monitor, add it to the global list of monitors.
2. Define the type of addb record that it will generate.
1. Continuously compute statistics from the monitored addb records.
1. Send this statistical information to the endpoint where stats service is running as addb records & to the endpoint where addb service is running if the node is a client or to the addb stob if the node is server periodically.
3. Continuously compute statistics from the monitored addb records.
4. Send this statistical information to the endpoint where stats service is running as addb records & to the endpoint where addb service is running if the node is a client or to the addb stob if the node is server periodically.

**Exceptional conditions monitoring**

Expand All @@ -145,8 +145,8 @@ Following steps are to be taken:
**Building a cluster wide global & local state in memory on a node where stats service is running**

1. Create in-memory state structure of the cluster on this node.
1. Receive statistical summary addb records from all the node.
1. Update the state with the information in these latest addb records.
2. Receive statistical summary addb records from all the node.
3. Update the state with the information in these latest addb records.

**Query for some state information to the stats service**

Expand Down
19 changes: 9 additions & 10 deletions doc/CORTX-MOTR-ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@
+ Asynchronous network and storage interface
+ Same on "client" (libmotr) and server (not quite yet).

# Object Layout
# Object Layout #
+ Object is an array of blocks. Arbitrary scatter-gather IO with overwrite. Object has layout.
+ Default layout is parity de-clustered network raid: N+K+S striping.
+ Layout takes hardware topology into account: distribute units to support fault-tolerance.
Expand All @@ -86,7 +86,7 @@
+ Fast scalable repairs of device failure.
+ There are other layouts: composite.

# Index Layout
# Index Layout #
+ An index is a container of key-value pairs:
+ GET(key) -> val, PUT(key, val), DEL(key), NEXT(key) -> (key, val)
+ used to store meta-data: (key: "/etc/passwd:length", value: 8192)
Expand All @@ -97,15 +97,15 @@

![image](./Images/7_Index_Layout.png)

# Data Flow S3 Redux
# Data Flow S3 Redux #
+ libmotr calculates cob identities and offsets within cobs
+ ioservice maps cob offset to device offset though ad (allocation data) index
+ mapping is done independently for each object and each parity group (aka stripe)
+ parity blocks are calculated by libmotr

![image](./Images/8_Dataflow_S3_Redux.png)

# Data Flow with meta - data
# Data Flow with meta - data #
+ 2, 2': rpc from a client to services (async)
+ 3, 7: various meta-data lookups on the service
+ {4,8}.n: meta-data storage requests (btree operations)
Expand Down Expand Up @@ -133,7 +133,7 @@

![image](./Images/10_DTM.png)

# DTM Implementation Overview
# DTM Implementation Overview #
+ Track distributed transactions for each operation (send transaction identifier)
+ Each service, before executing the operation, writes its description into FOL: file operations log
+ In case of a service or a client failure, surviving nodes look through their logs and determine incomplete transactions.
Expand Down Expand Up @@ -187,8 +187,7 @@

![image](./Images/13_FDMI_Example_Plugin.png)

# Inverse meta-data

# Inverse meta-data
+ block allocation
+ pdclust structure
+ key distribution
Expand Down Expand Up @@ -225,7 +224,7 @@
+ always on (post-mortem analysis, first incident fix)
+ simulation (change configuration, larger system, load mix)

```
```sh
2020-02-20-14:36:13.687531192 alloc size: 40, addr: @0x7fd27c53eb20

| node <f3b62b87d9e642b2:96a4e0520cc5477b>
Expand All @@ -241,7 +240,7 @@
| stob-io-launch 2020-02-20-14:36:13.666152841, <100000000adf11e:3>, count: 8, bvec-nr: 8, ivec-nr: 8, offset: 65536
```

# ADDB: Monitoring and Profiling
# ADDB: Monitoring and Profiling
![image](./Images/14_ADDB_Monitoring_and_Profiling.png)


Expand All @@ -255,4 +254,4 @@
+ combine workloads
![image](./Images/15_ADDB_Advanced_Use_Case.png)

# Questions
# Questions
26 changes: 13 additions & 13 deletions doc/HLD-FOP-State-Machine.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ See [4] and [5] for the description of fop architecture.
* fop state machine (fom) is a state machine [6] that represents the current state of the fop's [r.fop]ST execution on a node. fom is associated with the particular fop and implicitly includes this fop as part of its state.
* a fom state transition is executed by a handler thread[r.lib.threads]. The association between the fom and the handler thread is short-living: a different handler thread can be selected to execute the next state transition.

## Requirements
## Requirements
* `[r.non-blocking.few-threads]` : Motr service should use a relatively small number of threads: a few per processor [r.lib.processors].
* `[r.non-blocking.easy]`: non-blocking infrastructure should be easy to use and non-intrusive.
* `[r.non-blocking.extensibility]`: addition of new "cross-cut" functionality (e.g., logging, reporting) potentially including blocking points and affecting multiple fop types should not require extensive changes to the data structures for each fop type involved.
Expand All @@ -35,15 +35,15 @@ See [4] and [5] for the description of fop architecture.
## Design Highlights
A set of data structures similar to one maintained by a typical thread or process scheduler in an operating system kernel (or a user-level library thread package) is used for non-blocking fop processing: prioritized run-queues of fom-s ready for the next state transition and wait-queues of fom-s parked waiting for events to happen.

## Functional Specification
## Functional Specification
A fop belongs to a fop type. Similarly, a fom belongs to a fom type. The latter is part of the corresponding fop type. fom type specifies machine states as well as its transition function. A mandatory part of fom state is a phase, indicating how far the fop processing progressed. Each fom goes through standard phases, described in [7], as well as some fop-type specific phases.

The fop-type implementation provides an enumeration of non-standard phases and state-transition function for the fom.

<p/> Care is taken to guarantee that at least one handler thread is runnable, i.e., not blocked in the kernel at any time. Typically, a state transition is triggered by some event, e.g., the arrival of an incoming fop, availability of a resource, completion of a network, or storage communication. When a fom is about to wait for an event to happen, the source of a future event is registered with the fom infrastructure. When an event happens, the appropriate state transition function is invoked.<p>

## Logical Specification
### Locality
## Logical Specification
### Locality
<p/> For the present design, server computational resources are partitioned into localities. A typical locality includes a sub-set of available processors [r.lib.cores] and a collection of allocated memory areas[r.lib.memory-partitioning]. fom scheduling algorithm tries to confine processing of a particular fom to a specific locality (called a home locality of the fom) establishing affinity of resources and optimizing cache hierarchy utilization. For example, the inclusion of all cores sharing processor caches in the same locality allows fom to be processed on any of said cores without incurring a penalty of cache misses.<p>

**Run-queue**
Expand Down Expand Up @@ -94,7 +94,7 @@ Two possible strategies to deal with this are:
**Long Term Scheduling**
The network request scheduler (NRS) has its queue of fop-s waiting for the execution. Together with request handler queues, this comprises a two-level scheduling mechanism for long-term scheduling decisions.

## Conformance
## Conformance
* `[r.non-blocking.few-threads]`: thread-per-request model is abandoned. A locality has only a few threads, typically some small number (1–3) of threads per core.
* `[r.non-blocking.easy]`: fom processing is split into a relatively small number of relatively large non-blocking phases.
* `[r.non-blocking.extensibility]`: a "cross-cut" functionality adds new state to the common part of fom. This state is automatically present in all fom-s.
Expand All @@ -103,7 +103,7 @@ The network request scheduler (NRS) has its queue of fop-s waiting for the execu
* `[r.non-blocking.resources]`: resource enqueuing interface (right_get()) supports asynchronous completion notification (see [3])`[r.resource.enqueue.async]`.
* `[r.non-blocking.other-block]`: this requirement is discharged by enter-block/leave-block pairs described in the handler thread subsection above.

## Dependencies
## Dependencies
* fop: fops are used by **Mero**
* library:
* `[r.lib.threads]`: library supports threading
Expand All @@ -118,16 +118,16 @@ The network request scheduler (NRS) has its queue of fop-s waiting for the execu
* resources:
* `[r.resource.enqueue.async]`: asynchronous resource enqueuing is supported.

## Security Model
## Security Model
Security checks (authorization and authentication) are done in one of the standards fom phases (see [7]).

## Refinement
## Refinement
The data structures, their relationships, concurrency control, and liveness issues follow quite straightforwardly from the logical specification above.

# State
## State
See [7] for the description of fom state machine.

## Use Cases
## Use Cases

**Scenarios**

Expand Down Expand Up @@ -183,7 +183,7 @@ Scenario 4
|Response| handler threads wait on a per-locality condition variable until the locality run-queue is non-empty again. |
|Response Measure|

## Failures
## Failures
- Failure of a fom state transition: this lands fom in the standard FAILED phase;
- Dead-lock: dealing with the dead-lock (including ones involving activity in multiple address spaces) is outside of the scope of the present design. It is assumed that general mechanisms of dead-lock avoidance (resource ordering, &c.) are used.
- Time-out: if a fom is staying on the wait-list for too long, it is forced into the FAILED state.
Expand All @@ -204,14 +204,14 @@ An important question is how db5 accesses are handled in a non-blocking model.
- Advantages: purity and efficiency of the non-blocking model are maintained. db5 foot-print is confined and distributed across localities.
- Disadvantages: db5 threads of different localities will compete for shared db5 data, including cache-hot b-tree index nodes leading to worse cache utilization and cache-line ping-ponging (on the positive side, higher level b-tree nodes are rarely modified and so can be shared by multiple cores).

### Scalability
### Scalability
The point of the non-blocking model is to improve server scalability by:

- Reducing cache foot-print, by replacing thread stacks with smaller fom-s.
- Reducing scheduler overhead by using state machines instead of blocking and waking threads.
- Improving cache utilization by binding fom-s to home localities.

## References
## References
- [0] The C10K problem
- [1] LKML Archive Query: debate on 700 threads vs. asynchronous code
- [2] Why Events Are A Bad Idea (for High-concurrency Servers)
Expand Down
18 changes: 9 additions & 9 deletions doc/HLD-Resource-Management-Interface.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# HLD Resource Management Interface
# HLD Resource Management Interface
This document presents a high level design **(HLD)** of scalable resource management interfaces for Motr.
The main purposes of this document are:
1. To be inspected by M0 architects and peer designers to ascertain that high level design is aligned with M0 architecture and other designs, and contains no defects.
Expand All @@ -7,10 +7,10 @@ This document presents a high level design **(HLD)** of scalable resource manage

The intended audience of this document consists of M0 customers, architects, designers, and developers.

## Introduction
## Introduction ##
Motr functionality, both internal and external, is often specified in terms of resources. A resource is part of the system or its environment for which a notion of ownership is well-defined.

## Definitions
## Definitions ##
- A resource is part of the system or its environment for which a notion of ownership is well-defined. Resource ownership is used for two purposes:

- concurrency control. resource owners can manipulate the resource and the ownership transfer protocol assures that owners do not step on each other. That is, resources provide a traditional distributed locking mechanism.
Expand All @@ -26,7 +26,7 @@ Motr functionality, both internal and external, is often specified in terms of r
- A usage credit can be associated with a lease, which is a time interval for which the credit is granted. The usage credit automatically cancels at the end of the lease. A lease can be renewed.
- One possible conflict resolution policy would revoke all already granted conflicting credits before granting the new credit. Revocation is effected by sending conflict call-backs to the owners of the credit. The owners are expected to react by canceling their cached credits.

## Requirements
## Requirements ##
- `[R.M0.LAYOUT.LAYID.RESOURCE]`: layids are handled as a distributed resource (similarly to fids).
- `[R.M0.RESOURCE]`: scalable hierarchical resource allocation is supported
- `[R.M0.RESOURCE.CACHEABLE]`: resources can be cached by clients
Expand Down Expand Up @@ -59,15 +59,15 @@ Motr functionality, both internal and external, is often specified in terms of r
- `[r.resource.power]`: (electrical) power consumed by a device is a resource.


## Design Highlights
## Design Highlights ##
- hierarchical resource names. Resource name assignment can be simplified by introducing variable length resource identifiers.
- conflict-free schedules: no observable conflicts. Before a resource usage credit is canceled, the owner must re-integrate all changes made to the local copy of the resource. Conflicting usage credits can be granted only after all changes are re-integrated. Yet, the ordering between actual re-integration network requests and cancellation requests can be arbitrary, subject to server-side NRS policy.
- resource management code is split into two parts:
1. generic code that implements functionality independent of a particular resource type (request queuing, resource ordering, etc.).
2. per-resource type code that implements type-specific functionality (conflict resolution, etc.).
- an important distinction with a more traditional design (as exemplified by the Vax Cluster or Lustre distributed lock managers) is that there is no strict separation of rôles between "resource manager" and "resource user": the same resource owner can request usage credits from and grant usage credits to other resource owners. This reflects the more dynamic nature of Motr resource control flow, with its hierarchical and peer-to-peer caches.

## Functional Specification
## Functional Specification ##
The external resource management interface is centered around the following data types:
* a resource type
* a resource owner
Expand All @@ -93,7 +93,7 @@ The external resource management interface consists of the following calls:
On successful completion, the granted credit is held. notify_callback is invoked by the resource manager when the cached resource credit has to be revoked to satisfy a conflict resolution or some other policy.
- `credit_put(resource_credit)`: release held credit

## Logical Specification
## Logical Specification ##

A resource owner maintains:

Expand All @@ -106,7 +106,7 @@ Examples:
- a queue of incoming pending credits. This is a queue of incoming requests for usage credits, which were sent to this resource owner and are not yet granted, due to whatever circumstances (unresolved conflict, long-term resource scheduling decision, etc.);
- a queue of outgoing pending credits. This is a queue of usage credits that users asked this resource owner to obtain, but that are not yet obtained.

### Conformance
### Conformance ###

- `[R.M0.LAYOUT.LAYID.RESOURCE]`, `[r.resource.fid]`, `[r.resource.inode-number]`: layout, file and other identifiers are implemented as a special resource type. These identifiers must be globally unique. Typical identifier allocator operates as following:
- originally, a dedicated "management" node runs a resource owner that owns all identifiers (i.e., owns the [0, 0xffffffffffffffff] extent in identifiers name-space).
Expand Down Expand Up @@ -140,7 +140,7 @@ Examples:
- `[r.resource.cluster-configuration]`: cluster configuration is a resource.
- `[r.resource.power]`: (electrical) power consumed by a device is a resource.

### Resource Type Methods
### Resource Type Methods ###
Implementations of these methods are provided by each resource type.

See examples below:
Expand Down
Loading