Skip to content

Commit c95d612

Browse files
authored
Move documentation to main branch (Seagate#2070)
Addresses changes in PR Seagate#1563 Signed-off-by: hessio <[email protected]> Signed-off-by: hessio <[email protected]> Signed-off-by: hessio <[email protected]> Signed-off-by: hessio <[email protected]>
1 parent f7e6439 commit c95d612

20 files changed

+211
-0
lines changed

doc/CORTX-MOTR-ARCHITECTURE.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,7 @@
7575
# Object Layout #
7676
+ Object is an array of blocks. Arbitrary scatter-gather IO with overwrite. Object has layout.
7777
+ Default layout is parity de-clustered network raid: N+K+S striping.
78+
+ More details about [parity declustering](doc/pdclust/index.rst)
7879
+ Layout takes hardware topology into account: distribute units to support fault-tolerance.
7980

8081
![image](./Images/6_Object_Layout.png)

doc/pdclust/RAID.drawio

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
<mxfile host="Electron" modified="2021-05-19T19:12:46.924Z" agent="5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) draw.io/14.6.13 Chrome/89.0.4389.128 Electron/12.0.7 Safari/537.36" etag="vCIrz1xDpv0y0YpzVGCu" version="14.6.13" type="device"><diagram id="QBME54pFpEOekEDQFjqP" name="Page-1">1Zzfj5pAEMf/Gh+bAKur99hSvGtr0yamufSp4WBPSNE1iKL31xfPBXW5pjTOMrO+CLPLDz/AznxnFgfMX+7v83CdfJWxyAaeE+8H7OPA81zX4dXX0XJQFsdhJ8siT2NlOxvm6YuoOyrrNo3F5qpjIWVWpOtrYyRXKxEVV7Ywz2V53e1ZZtdHXYcL0TLMozBrWx/TuEhO1ok3PtsfRLpI6iO7/O7UsgzrzuqXbJIwluWFiQUD5udSFqel5d4X2ZFezeW03fQvrc2J5WJVdNngCy8j35v/nEWPw/cPwWbJVz/eqb3swmyrfrA62eJQE8jldhWL406cAftQJmkh5uswOraW1UWvbEmxzKo1t1p8TrPMl5nMX7dl02nAfb+yt0+2PrLIC7G/MKmTvxdyKYr8UHVRrUPFsb6TuFovz5el7pJcXJHaFqobYdHs+MyqWlC4/gOdZw+6CTF0zB50rkeM3dAidpwYu5FZdgGfTqHYeQ4xdtwidtRcxdgidtR8xcQedoyar7gzy873p9UHiB01X1ELEBvgDak5C9ewrgCFR81buIaVBSg8au7CNa4tfD/gMPBG1PyFa1xcAMIj5zAMqwtIeJycwzAsL0DhkXMYhvUFZE5A17UNFTR4hgUGKLwhNXiGFQaoOuPE4NXe3wZ4usLAh2eRwtDjPHx4FikMPc7Dh2eRwuATavAsUhhjxMc238zz2Usy2v06fJa7l/L7p2DXpVi7ScL1cTE6ZGkFMWf/Jvh0wj17agxh9HvxehG+bYtqN0LZN6fifKVzarwtlm8Q71rPbTIJF3T5G3TN4e0wKFqEVy9c4vPtMG5axFcXMvh8OwytNvGdUOPbIb9jEV9dDuHz7ZACsoivXlbA59shS2QRX11U4fPtkEiyiK+eIsbn2yHXZBNfcv6tS8XbIsBjcg7OdFVcSzXfNrxew2tGWzTt20GdUWHXTF4nA6/fCbe3wbujBq/fGbe3PbWMGjzjRXE4eJzcnWe8KA4HbzymBq/fSbewDoOPkOH1W9eFdRjY8Hqu696WBdIfW3R4hmNkSHhDPUhGh2c4SIaE1wpV0OEZDpIh4bVCFXR4hoNkbUbBbZUx7c7jDjY8w0EyJDzdYXAXG57hIBkSnh4k48MzHCRDwtPHPHx4hmeOgsJzqcHr99000FAFHR4zrDC0WVSQiWQ+wmZnPAkPx64VqaDDMywwIOHpSRV8eP1OHIUN87AFBut34ihoXgAfXr+vpsFGKubgVavnv196bbv4FysW/AE=</diagram></mxfile>

doc/pdclust/RAID.png

4.76 KB
Loading

doc/pdclust/index.rst

Lines changed: 201 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,201 @@
1+
==================================
2+
Parity de-clustering documentation
3+
==================================
4+
5+
:author: Nikita Danilov <[email protected]>
6+
:state: INIT
7+
:copyright: Seagate
8+
:distribution: unlimited
9+
10+
:abstract: This document describes how and why state machines are used by motr.
11+
12+
Stakeholders
13+
============
14+
15+
+----------+----------------------+----------------------------+----------------+
16+
| alias | full name | email | rôle |
17+
+==========+======================+============================+================+
18+
| nikita | Nikita Danilov | [email protected] | author, |
19+
| | | | architect |
20+
+----------+----------------------+----------------------------+----------------+
21+
22+
Introduction
23+
============
24+
25+
*Parity de-clustering* (or pdclust for short) is a type of motr layout, but
26+
let's define what layout is first. In motr every object (where applications
27+
store blocks of data) and index (where applications store key-value records) has
28+
a layout. For an object (indices will be covered later), the layout determines
29+
where in the distributed storage system, object blocks are or should be
30+
stored. The layout is stored as one of the object attributes, together with
31+
other object meta-data. Ignoring, for the time being, the question of how
32+
exactly the location of a block in the storage system is specified, the object
33+
IO operations use layouts. A read operation queries the object layout about the
34+
location of object blocks and then fetches the data from the locations returned
35+
by the layout. A write operation, queries the layout where (overwritten) blocks
36+
are located and (new) blocks should be located and updates the specified
37+
locations with the user-provided data.
38+
39+
In addition to driving IO operations, layout also comes with certain
40+
fault-tolerance characteristics and mechanisms. It might specify that object
41+
data are stored *redundantly*, that is, the original data can still be recovered
42+
is some system component hosting some part of the data fails, and might come
43+
equipped with some fault-tolerance processes.
44+
45+
The layout of a particular object is an instance of some *layout
46+
type*. Currently motr has the only layout type fully implemented: pdclust, other
47+
layout types (compressed, encrypted, de-duplicated, composite, *etc*.) are
48+
planned.
49+
50+
Parity de-clustering
51+
====================
52+
53+
Parity de-clustering comes under many names in motr: "SNS" (server network
54+
striping), network RAID, erasure coding. Network RAID gives probably the most
55+
accessible initial intuition about what parity de-clustering is: it's like a
56+
normal device RAID, but across multiple network nodes.
57+
58+
.. image:: RAID.png
59+
60+
Recall how a typical RAID system works. Given an object to be stored, select a
61+
certain *unit size* and divide the object in data units of this size. Then
62+
aggregate consecutive N data units and calculate for them K units of
63+
parity. Together N data units and K parity units constitute a *parity
64+
group*. The most important property of parity group is that given any N units
65+
(out of N+K) all N+K units can be recovered. This is achieved by carefully
66+
designed parity calculation functions. There are many ways to achieve this, motr
67+
uses the most widely known: Reed-Solomon codes. This ability to recover parity
68+
units is what provides fault-tolerance of pdclust. It is said that a parity
69+
de-clustered layout has *striping pattern* N+K (there are more component to
70+
striping pattern, to be described later), N > 0, K >= 0. Parity blocks are
71+
allocated, filled and managed by the motr IO code and are not visible to the
72+
user.
73+
74+
Some examples of striping patterns:
75+
76+
- N+0: RAID-0,
77+
- 1+1: mirroring,
78+
- 1+K: (K+1)-way replication,
79+
- N+1: RAID-5. In this case, parity unit is XOR-sum of the data units,
80+
- N+2: RAID-6.
81+
82+
Once the support for units and parity groups is in place, IO is conceptually
83+
simple: layout knows the location of all units (data and parity). To read a data
84+
unit just read it directly from the location provided by the layout. In case
85+
this read fails, for whatever reason, read the remaining units of the parity
86+
group and reconstruct the lost data unit from them, this is called *degraded
87+
read*. Write is more complex. The simplest case is when the entire parity group
88+
of N data units is overwritten. In this case, write calculates K parity units
89+
and writes all N+K units in their locations (as determined by the layout). When
90+
a write overwrites only a part of the parity group, read-modify-write cycle is
91+
necessary. In case of failure, a *degraded write* is performed: up to K unit
92+
writes can fail, but the write is still successful.
93+
94+
Example
95+
=======
96+
97+
Consider a very simple storage system. Ignore all the actual complexity of
98+
arranging hardware, cables, attaching devices to servers, racks, *etc*. At the
99+
most basic level, the system consists of a certain number of storage
100+
devices. Units can be read off and written to devices. Devices can fail
101+
(independently).
102+
103+
There is a problem of how units of a parity de-clustered file are
104+
scattered over these devices. There are multiple factors:
105+
106+
- for a given parity group, it's clearly preferable to store each unit (data
107+
and parity) on a separate device. This way, if the device fails, at most
108+
one unit for each group is lost;
109+
- larger K gives better fault-tolerance,
110+
- storage overhead is proportional to K/N ratio,
111+
- because full-group overwrites are most efficient, it's better to keep unit
112+
size small (then a larger fraction of writes will be full-group),
113+
- to utilise as many disc spindles as possible for each operation, it's
114+
better to keep unit size small,
115+
- to have efficient network transfers it's better to have large unit size,
116+
- to have efficient storage transfers it's better to have large unit size,
117+
- cost of computing parity is O(K^2);
118+
- to minimise amount and complexity of internal meta-data that system must
119+
maintain, the map from object units to their locations should be
120+
*computable* (*i.e.*, it should be possible to calculate the location of a
121+
unit by certain function);
122+
- to apply various storage and IO optimisations (copy-on-write, sequential
123+
block allocation, *etc*.), the map from object units to their locations
124+
should be constructed dynamically.
125+
126+
.. image:: pool.png
127+
128+
Failures
129+
========
130+
131+
Again, consider a very simple storage system, with a certain number (P) of
132+
storage devices without any additional structure, and with striping pattern
133+
N+K. Suppose a very simple round-robin block allocation is used:
134+
135+
.. image:: layout-rr.png
136+
137+
A device fails:
138+
139+
.. image:: layout-rr-failure.png
140+
141+
At a conceptual level (without at this time considering the mechanisms used),
142+
let's understand what would be involved in the *repair* of this failure. To
143+
reconstruct units lost in the failure (again, ignoring for the moment details of
144+
when they are reconstructed and how the reconstructed data is used), one needs,
145+
by the definition of the parity group, to read all remaining units of all
146+
affected parity groups.
147+
148+
.. image:: layout-rr-affected.png
149+
150+
Suppose that the number of devices (P) is large (10^2--10^5) and the number of
151+
units is very large (10^15). Ponder for a minute: what's wrong with the picture
152+
above?
153+
154+
The problem is that the number of units that must be read off a surviving device
155+
to repair is different for different devices. During a repair some devices will
156+
be bottlenecks and some will be idle. With a large P, most of the devices will
157+
idle and won't participate in the repair. As a result, the duration of repair
158+
(which is the interval of critical vulnerability in which the system has reduced
159+
fault-tolerance) does not reduce with P growing large. But the probability of a
160+
failure, does grow with P, so overall system reliability would decrease as P
161+
grows. One can do better.
162+
163+
Uniformity
164+
==========
165+
166+
To get better fault-tolerance, two more requirements should be added to our
167+
list:
168+
169+
- units of an object are uniformly distributed across all devices,
170+
- fraction of parity groups shared by any 2 devices is the same. This means
171+
that when a device fails, each surviving device should read
172+
(approximately) the same number of units during repair.
173+
174+
.. image:: layout-uniform.png
175+
176+
A simple round-robin unit placement does not satisfy these uniformity
177+
requirements, but after a simple modification it does.
178+
179+
Let's call a collection of N+K striped units that exactly cover some number of
180+
"rows" on a pool of P devices *a tile*.
181+
182+
.. image:: tile.png
183+
184+
For each tile permute its columns according to a certain permutation selected
185+
independently for each tile.
186+
187+
.. image:: permutation.png
188+
189+
This new layout of units satisfies the basic fault-tolerance requirement that no
190+
two units of the same parity group are on the same device (convince yourself).
191+
192+
It also satisfies the uniformity requirement (at least statistically, for a
193+
large number of tiles).
194+
195+
.. image:: permutation-uniform.png
196+
197+
Uniformity has some very important consequences. All devices participate equally
198+
in the repair. But the total amount of data read during repair is fixed (it is
199+
(N+K-1)*device_size). Therefore, as P grows, each device reads smaller and
200+
smaller fraction of its size. Therefore, as system grows, repair completes
201+
quicker.
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
<mxfile host="Electron" modified="2021-05-20T07:21:35.323Z" agent="5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) draw.io/14.6.13 Chrome/89.0.4389.128 Electron/12.0.7 Safari/537.36" etag="WX4bQtEKwHF6TaQgtdhQ" version="14.6.13" type="device"><diagram id="QBME54pFpEOekEDQFjqP" name="Page-1">3ZrRj5sgHMf/Gh+XVKhaX+f0tuWWLenDHhdPuWqOlobS2vavH55gq+4yl8GBmj7gF1D8SOH3BR0Ybc8PNN0X30iOsAMW+dmBnxwA3AUATv1b5JdG8WHQCBta5qLQTViXVyRrCvVY5ujQKcgIwazcd8WM7HYoYx0tpZRU3WLPBHfvuk83aCCssxQP1Z9lzopGXYHgpn9G5aaQd3b9sMnZprKweJJDkeakupNg7MCIEsKa1PYcIVzDk1yaeskbuW3DKNqxMRXoYU0fr4V3+nX5Sk7X6seX+PRBXOWU4qN4YNFYdpEEeLv3dTK74HKXIwod+LEqSobW+zSrMyr+5rlWsC3mZy5PPpEjL5k/PrVCmr1saK1+PzJ+GST0Q/O+XY+nRUsQZej85iO6LTje4xDZIkYvvIiosBSoRV+DshNVtzfnC6m4e2ltvVR0lk175RtPnhBI/wEvmBVeF9jGF86KL1jYxnc5L74r2/h69vB9LjGOCCb09bYwSUJ+qOEOfdu4+/ZwVzHtWTfvBbPi61k3761mxde3bt4L58XXunlPNmAmgAPrJjh3hLGjDS1+svg72kF4EPtRVJNklLygTg7kR5sjTTNQNBT3QHtD0Ms/gNbHeYTDmyLndv3FGtAjrN4kQYe2gR7h+aYIemnd0DHC/CkBrQAeDGyDN8LBTbGXutA20COs3P+Ajv0kUdVLPevgjfAR1sDrsQs9s+yk554Cu8G/1jg8zeG/SniBbXOLZDUFeL5tcTrQHKcrhWdb7A00x95RlPDDAjdpfHzUHHsbA93v0cZBa47TTYEejLvGQWuO002BHsT0xkGP2HuZIujB+ohx0JrNUwtaAbzB+ohpePJfo299JIpiX1MYZhye9s0TdfAG7sk4PM3uSSW83pDXfkdhjJ32TQ517PprHubhad+4UAevb4jMw9O+GaEQXvhu8Pjp7QPt17y7z9xh/Bs=</diagram></mxfile>

doc/pdclust/layout-rr-affected.png

16.1 KB
Loading
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
<mxfile host="Electron" modified="2021-05-20T07:14:57.320Z" agent="5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) draw.io/14.6.13 Chrome/89.0.4389.128 Electron/12.0.7 Safari/537.36" etag="bDZX2E_GGtFfY0iR_9yB" version="14.6.13" type="device"><diagram id="QBME54pFpEOekEDQFjqP" name="Page-1">3ZrLjpswFIafhmUlsAOEbSn0oqlaKYsuKwY8AdWJkeOEJE9fZzAkgRmVqnZ9BpSF+W2D+Wx8LsHB8eb4kWd1+ZUVhDrILY4O/uAg5LkIOZefW5xaJXBxK6x5VahGV2FVnUnXU6n7qiC7u4aCMSqq+l7M2XZLcnGnZZyz5r7ZE6P3d62zNRkJqzyjY/VHVYiyVZcovOqfSLUuuzt7QdTWbLKusXqSXZkVrLmRcOLgmDMm2tLmGBN6gddxafulr9T2A+NkK6Z04LsVfziX/uHn6Qs7nJvvn5PDO3WVQ0b36oHVYMWpIyDHXV+K+YlW24Jw7OD3TVkJsqqz/FLRyJmXWik2VJ55svjI9rJl8fDYC1n+a80v6re9kJchSt+18+35sqxGQrggx1cf0evByRVH2IYIfpJNVIeFQq3WGu4WUXOduUBJ5c2k9f0ytVjW/ZWvPGVBIf0LvGhWeD0EjS+eFV/kQuO7mBffJTS+Phy+TxWlMaOMP98Wp2kkDz3ccQCNewCHuw6zB87uhbPi64Oze8tZ8Q3A2b1oXnzB2b1uADMBHIIzcN6EwI63tOSJ+2e0I/cgCeJY0/Y6gOeP4S1egGeO3YSoDQq7Pk8CBt6EkAwMvAgavAnxFhR4C3Cv7YRgCgo8HEKDNyEiggLPw9DgTQh3/gVeEqSpNmMLDt4EXxsMvAG7yLfLrotL3wK70VtrHZ5hF1knvBCavehYvQV4ATQfGRn2kbXCg+YjI8M+chyn8jAUnVnf8wz7yFrhDVeedXiGfWSd8EZ7nnV4hn1knfBGPrJ1eBPy/VDgjfIC1uEZDjB0whvlBWzD694Ec3mBOE4CQ66KdXjGk/D64I0iDOvwDEcYOuENtrz+/3hr7Iwn4fWxG+YF7MMznoTXB28YYNiHZzwJrxFe9N/gydPrh77PdTefS+PkNw==</diagram></mxfile>

doc/pdclust/layout-rr-failure.png

15.7 KB
Loading

doc/pdclust/layout-rr.drawio

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
<mxfile host="Electron" modified="2021-05-20T07:37:14.728Z" agent="5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) draw.io/14.6.13 Chrome/89.0.4389.128 Electron/12.0.7 Safari/537.36" etag="BBm6aXIR8-xSd0cQjSDe" version="14.6.13" type="device"><diagram id="QBME54pFpEOekEDQFjqP" name="Page-1">3ZpNk5sgGMc/jcfORIiaXGt128522pkceuy4yqpTEjKExCSfvmQFk+h2aqdQnjWTA/4BhR/I85J4OF4fH3i2rb6wglAPzYqjhz94CPkzhLzLd1acWiXEUSuUvC5Uo6uwqs9E91Tqvi7I7q6hYIyKensv5myzIbm40zLOWXPf7JnR+6dus5IMhFWe0aH6vS5E1aoLFF31j6QuK/1kP1y2NetMN1Yz2VVZwZobCScejjljoi2tjzGhF3iaS9sv/U1tNzBONmJMB75b8cdzFRx+nD6zw7n59ik5vFN3OWR0ryasBitOmoAc9/ZSzE+03hSEYw+/b6pakNU2yy8VjVx5qVViTeWVL4tPbC9bFo9PnZDlP0t+Ub/uhbwNUfquXW8/kOXhfPTgCBfkeCOp+T0QtiaCn2QTVTtXqNVew3oTNdeVC5VU3Sxa1y9Tm6Xs7nzlKQsK6V/gRZPC6yNofPGk+KIZNL7zafFdQOMbTIovDqHxDSfFdw7OvkWT4huAs2+LSfENwdm35bT4grNvegATARyBM3D+iACOt7TkxezPaJ9rSmNGGX/pi9M0CePY0PHagxcM4c1fgWeP3YjoDAq7Lh8CBt6I0AsMvCU0eCPiKijw5uBe2xFBExR4OIIGb0REBAWej6HBGxHu/Au8JExTY8YWHLwRvjYYeD12y8AtOx2XvgV2g7fWOTzLLrJJeBE0e6FZvQV4ITQfGVn2kY3Cg+YjI8s+chyn8mMpOnN+5ln2kY3C6+885/As+8gm4Q3OPOfwLPvIJuENfGTn8Ebk+6HAG+QFnMOzHGCYhDfIC7iGp98Ee3mBOE5CS66Kc3jWk/Dm4A0iDOfwLEcYJuH1jrzu93hn7Kwn4c2x6+cF3MOznoQ3B68fYLiHZz0JbxDe8r/Bk5fXP/S+1N38LRonvwA=</diagram></mxfile>

doc/pdclust/layout-rr.png

12.7 KB
Loading

0 commit comments

Comments
 (0)