You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on May 3, 2024. It is now read-only.
Copy file name to clipboardExpand all lines: doc/ADDB_Monitoring.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -40,7 +40,7 @@ Reporting of statistics is required, which is similar to df, vmstat, top, etc. T
40
40
Statistics belong to two categories:
41
41
42
42
1. Stats which are readily available, eg. balloc will generate addb records about free space in a container periodically.
43
-
1. Stats which are not readily available.
43
+
2. Stats which are not readily available.
44
44
45
45
These stats summary ADDB records can be produced on any node, this could be client or server. If produced on client they are sent to endpoint where addb service is running (using the current mechanism) and also to the endpoint where stats service is running, while if produced on server they are written to addb stob and also sent to this endpoint where stats service is running.
46
46
@@ -57,7 +57,7 @@ Monitors will be used to detect exceptional conditions. Periodic posting is not
@@ -131,8 +131,8 @@ Following steps show how an addb monitor collects statistical information on a p
131
131
132
132
1. Create ADDB monitor, add it to the global list of monitors.
133
133
2. Define the type of addb record that it will generate.
134
-
1. Continuously compute statistics from the monitored addb records.
135
-
1. Send this statistical information to the endpoint where stats service is running as addb records & to the endpoint where addb service is running if the node is a client or to the addb stob if the node is server periodically.
134
+
3. Continuously compute statistics from the monitored addb records.
135
+
4. Send this statistical information to the endpoint where stats service is running as addb records & to the endpoint where addb service is running if the node is a client or to the addb stob if the node is server periodically.
136
136
137
137
**Exceptional conditions monitoring**
138
138
@@ -145,8 +145,8 @@ Following steps are to be taken:
145
145
**Building a cluster wide global & local state in memory on a node where stats service is running**
146
146
147
147
1. Create in-memory state structure of the cluster on this node.
148
-
1. Receive statistical summary addb records from all the node.
149
-
1. Update the state with the information in these latest addb records.
148
+
2. Receive statistical summary addb records from all the node.
149
+
3. Update the state with the information in these latest addb records.
150
150
151
151
**Query for some state information to the stats service**
Copy file name to clipboardExpand all lines: doc/HLD-FOP-State-Machine.md
+12-12Lines changed: 12 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -35,15 +35,15 @@ See [4] and [5] for the description of fop architecture.
35
35
## Design Highlights
36
36
A set of data structures similar to one maintained by a typical thread or process scheduler in an operating system kernel (or a user-level library thread package) is used for non-blocking fop processing: prioritized run-queues of fom-s ready for the next state transition and wait-queues of fom-s parked waiting for events to happen.
37
37
38
-
## Functional Specification
38
+
## Functional Specification##
39
39
A fop belongs to a fop type. Similarly, a fom belongs to a fom type. The latter is part of the corresponding fop type. fom type specifies machine states as well as its transition function. A mandatory part of fom state is a phase, indicating how far the fop processing progressed. Each fom goes through standard phases, described in [7], as well as some fop-type specific phases.
40
40
41
41
The fop-type implementation provides an enumeration of non-standard phases and state-transition function for the fom.
42
42
43
43
<p/> Care is taken to guarantee that at least one handler thread is runnable, i.e., not blocked in the kernel at any time. Typically, a state transition is triggered by some event, e.g., the arrival of an incoming fop, availability of a resource, completion of a network, or storage communication. When a fom is about to wait for an event to happen, the source of a future event is registered with the fom infrastructure. When an event happens, the appropriate state transition function is invoked.<p>
44
44
45
-
## Logical Specification
46
-
### Locality
45
+
## Logical Specification
46
+
### Locality
47
47
<p/> For the present design, server computational resources are partitioned into localities. A typical locality includes a sub-set of available processors [r.lib.cores] and a collection of allocated memory areas[r.lib.memory-partitioning]. fom scheduling algorithm tries to confine processing of a particular fom to a specific locality (called a home locality of the fom) establishing affinity of resources and optimizing cache hierarchy utilization. For example, the inclusion of all cores sharing processor caches in the same locality allows fom to be processed on any of said cores without incurring a penalty of cache misses.<p>
48
48
49
49
**Run-queue**
@@ -94,7 +94,7 @@ Two possible strategies to deal with this are:
94
94
**Long Term Scheduling**
95
95
The network request scheduler (NRS) has its queue of fop-s waiting for the execution. Together with request handler queues, this comprises a two-level scheduling mechanism for long-term scheduling decisions.
96
96
97
-
## Conformance
97
+
## Conformance
98
98
*`[r.non-blocking.few-threads]`: thread-per-request model is abandoned. A locality has only a few threads, typically some small number (1–3) of threads per core.
99
99
*`[r.non-blocking.easy]`: fom processing is split into a relatively small number of relatively large non-blocking phases.
100
100
*`[r.non-blocking.extensibility]`: a "cross-cut" functionality adds new state to the common part of fom. This state is automatically present in all fom-s.
@@ -103,7 +103,7 @@ The network request scheduler (NRS) has its queue of fop-s waiting for the execu
*`[r.non-blocking.other-block]`: this requirement is discharged by enter-block/leave-block pairs described in the handler thread subsection above.
105
105
106
-
## Dependencies
106
+
## Dependencies
107
107
* fop: fops are used by **Mero**
108
108
* library:
109
109
*`[r.lib.threads]`: library supports threading
@@ -118,16 +118,16 @@ The network request scheduler (NRS) has its queue of fop-s waiting for the execu
118
118
* resources:
119
119
*`[r.resource.enqueue.async]`: asynchronous resource enqueuing is supported.
120
120
121
-
## Security Model
121
+
## Security Model
122
122
Security checks (authorization and authentication) are done in one of the standards fom phases (see [7]).
123
123
124
-
## Refinement
124
+
## Refinement##
125
125
The data structures, their relationships, concurrency control, and liveness issues follow quite straightforwardly from the logical specification above.
126
126
127
-
# State
127
+
##State##
128
128
See [7] for the description of fom state machine.
129
129
130
-
## Use Cases
130
+
## Use Cases##
131
131
132
132
**Scenarios**
133
133
@@ -183,7 +183,7 @@ Scenario 4
183
183
|Response| handler threads wait on a per-locality condition variable until the locality run-queue is non-empty again. |
184
184
|Response Measure|
185
185
186
-
## Failures
186
+
## Failures ##
187
187
- Failure of a fom state transition: this lands fom in the standard FAILED phase;
188
188
- Dead-lock: dealing with the dead-lock (including ones involving activity in multiple address spaces) is outside of the scope of the present design. It is assumed that general mechanisms of dead-lock avoidance (resource ordering, &c.) are used.
189
189
- Time-out: if a fom is staying on the wait-list for too long, it is forced into the FAILED state.
@@ -204,14 +204,14 @@ An important question is how db5 accesses are handled in a non-blocking model.
204
204
- Advantages: purity and efficiency of the non-blocking model are maintained. db5 foot-print is confined and distributed across localities.
205
205
- Disadvantages: db5 threads of different localities will compete for shared db5 data, including cache-hot b-tree index nodes leading to worse cache utilization and cache-line ping-ponging (on the positive side, higher level b-tree nodes are rarely modified and so can be shared by multiple cores).
206
206
207
-
### Scalability
207
+
### Scalability
208
208
The point of the non-blocking model is to improve server scalability by:
209
209
210
210
- Reducing cache foot-print, by replacing thread stacks with smaller fom-s.
211
211
- Reducing scheduler overhead by using state machines instead of blocking and waking threads.
212
212
- Improving cache utilization by binding fom-s to home localities.
213
213
214
-
## References
214
+
## References
215
215
-[0] The C10K problem
216
216
-[1] LKML Archive Query: debate on 700 threads vs. asynchronous code
217
217
-[2] Why Events Are A Bad Idea (for High-concurrency Servers)
Copy file name to clipboardExpand all lines: doc/HLD-Resource-Management-Interface.md
+9-9Lines changed: 9 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
# HLD Resource Management Interface
1
+
# HLD Resource Management Interface
2
2
This document presents a high level design **(HLD)** of scalable resource management interfaces for Motr.
3
3
The main purposes of this document are:
4
4
1. To be inspected by M0 architects and peer designers to ascertain that high level design is aligned with M0 architecture and other designs, and contains no defects.
@@ -7,10 +7,10 @@ This document presents a high level design **(HLD)** of scalable resource manage
7
7
8
8
The intended audience of this document consists of M0 customers, architects, designers, and developers.
9
9
10
-
## Introduction
10
+
## Introduction##
11
11
Motr functionality, both internal and external, is often specified in terms of resources. A resource is part of the system or its environment for which a notion of ownership is well-defined.
12
12
13
-
## Definitions
13
+
## Definitions##
14
14
- A resource is part of the system or its environment for which a notion of ownership is well-defined. Resource ownership is used for two purposes:
15
15
16
16
- concurrency control. resource owners can manipulate the resource and the ownership transfer protocol assures that owners do not step on each other. That is, resources provide a traditional distributed locking mechanism.
@@ -26,7 +26,7 @@ Motr functionality, both internal and external, is often specified in terms of r
26
26
- A usage credit can be associated with a lease, which is a time interval for which the credit is granted. The usage credit automatically cancels at the end of the lease. A lease can be renewed.
27
27
- One possible conflict resolution policy would revoke all already granted conflicting credits before granting the new credit. Revocation is effected by sending conflict call-backs to the owners of the credit. The owners are expected to react by canceling their cached credits.
28
28
29
-
## Requirements
29
+
## Requirements##
30
30
-`[R.M0.LAYOUT.LAYID.RESOURCE]`: layids are handled as a distributed resource (similarly to fids).
31
31
-`[R.M0.RESOURCE]`: scalable hierarchical resource allocation is supported
32
32
-`[R.M0.RESOURCE.CACHEABLE]`: resources can be cached by clients
@@ -59,15 +59,15 @@ Motr functionality, both internal and external, is often specified in terms of r
59
59
-`[r.resource.power]`: (electrical) power consumed by a device is a resource.
60
60
61
61
62
-
## Design Highlights
62
+
## Design Highlights##
63
63
- hierarchical resource names. Resource name assignment can be simplified by introducing variable length resource identifiers.
64
64
- conflict-free schedules: no observable conflicts. Before a resource usage credit is canceled, the owner must re-integrate all changes made to the local copy of the resource. Conflicting usage credits can be granted only after all changes are re-integrated. Yet, the ordering between actual re-integration network requests and cancellation requests can be arbitrary, subject to server-side NRS policy.
65
65
- resource management code is split into two parts:
66
66
1. generic code that implements functionality independent of a particular resource type (request queuing, resource ordering, etc.).
67
67
2. per-resource type code that implements type-specific functionality (conflict resolution, etc.).
68
68
- an important distinction with a more traditional design (as exemplified by the Vax Cluster or Lustre distributed lock managers) is that there is no strict separation of rôles between "resource manager" and "resource user": the same resource owner can request usage credits from and grant usage credits to other resource owners. This reflects the more dynamic nature of Motr resource control flow, with its hierarchical and peer-to-peer caches.
69
69
70
-
## Functional Specification
70
+
## Functional Specification##
71
71
The external resource management interface is centered around the following data types:
72
72
* a resource type
73
73
* a resource owner
@@ -93,7 +93,7 @@ The external resource management interface consists of the following calls:
93
93
On successful completion, the granted credit is held. notify_callback is invoked by the resource manager when the cached resource credit has to be revoked to satisfy a conflict resolution or some other policy.
94
94
-`credit_put(resource_credit)`: release held credit
95
95
96
-
## Logical Specification
96
+
## Logical Specification##
97
97
98
98
A resource owner maintains:
99
99
@@ -106,7 +106,7 @@ Examples:
106
106
- a queue of incoming pending credits. This is a queue of incoming requests for usage credits, which were sent to this resource owner and are not yet granted, due to whatever circumstances (unresolved conflict, long-term resource scheduling decision, etc.);
107
107
- a queue of outgoing pending credits. This is a queue of usage credits that users asked this resource owner to obtain, but that are not yet obtained.
108
108
109
-
### Conformance
109
+
### Conformance###
110
110
111
111
-`[R.M0.LAYOUT.LAYID.RESOURCE]`, `[r.resource.fid]`, `[r.resource.inode-number]`: layout, file and other identifiers are implemented as a special resource type. These identifiers must be globally unique. Typical identifier allocator operates as following:
112
112
- originally, a dedicated "management" node runs a resource owner that owns all identifiers (i.e., owns the [0, 0xffffffffffffffff] extent in identifiers name-space).
@@ -140,7 +140,7 @@ Examples:
140
140
-`[r.resource.cluster-configuration]`: cluster configuration is a resource.
141
141
-`[r.resource.power]`: (electrical) power consumed by a device is a resource.
142
142
143
-
### Resource Type Methods
143
+
### Resource Type Methods###
144
144
Implementations of these methods are provided by each resource type.
0 commit comments