|
| 1 | +================================== |
| 2 | +Parity de-clustering documentation |
| 3 | +================================== |
| 4 | + |
| 5 | +:author: Nikita Danilov < [email protected]> |
| 6 | +:state: INIT |
| 7 | +:copyright: Seagate |
| 8 | +:distribution: unlimited |
| 9 | + |
| 10 | +:abstract: This document describes how and why state machines are used by motr. |
| 11 | + |
| 12 | +Stakeholders |
| 13 | +============ |
| 14 | + |
| 15 | ++----------+----------------------+----------------------------+----------------+ |
| 16 | +| alias | full name | email | rôle | |
| 17 | ++==========+======================+============================+================+ |
| 18 | +| nikita | Nikita Danilov | [email protected] | author, | |
| 19 | +| | | | architect | |
| 20 | ++----------+----------------------+----------------------------+----------------+ |
| 21 | + |
| 22 | +Introduction |
| 23 | +============ |
| 24 | + |
| 25 | +*Parity de-clustering* (or pdclust for short) is a type of motr layout, but |
| 26 | +let's define what layout is first. In motr every object (where applications |
| 27 | +store blocks of data) and index (where applications store key-value records) has |
| 28 | +a layout. For an object (indices will be covered later), the layout determines |
| 29 | +where in the distributed storage system, object blocks are or should be |
| 30 | +stored. The layout is stored as one of the object attributes, together with |
| 31 | +other object meta-data. Ignoring, for the time being, the question of how |
| 32 | +exactly the location of a block in the storage system is specified, the object |
| 33 | +IO operations use layouts. A read operation queries the object layout about the |
| 34 | +location of object blocks and then fetches the data from the locations returned |
| 35 | +by the layout. A write operation, queries the layout where (overwritten) blocks |
| 36 | +are located and (new) blocks should be located and updates the specified |
| 37 | +locations with the user-provided data. |
| 38 | + |
| 39 | +In addition to driving IO operations, layout also comes with certain |
| 40 | +fault-tolerance characteristics and mechanisms. It might specify that object |
| 41 | +data are stored *redundantly*, that is, the original data can still be recovered |
| 42 | +is some system component hosting some part of the data fails, and might come |
| 43 | +equipped with some fault-tolerance processes. |
| 44 | + |
| 45 | +The layout of a particular object is an instance of some *layout |
| 46 | +type*. Currently motr has the only layout type fully implemented: pdclust, other |
| 47 | +layout types (compressed, encrypted, de-duplicated, composite, *etc*.) are |
| 48 | +planned. |
| 49 | + |
| 50 | +Parity de-clustering |
| 51 | +==================== |
| 52 | + |
| 53 | +Parity de-clustering comes under many names in motr: "SNS" (server network |
| 54 | +striping), network RAID, erasure coding. Network RAID gives probably the most |
| 55 | +accessible initial intuition about what parity de-clustering is: it's like a |
| 56 | +normal device RAID, but across multiple network nodes. |
| 57 | + |
| 58 | +.. image:: RAID.png |
| 59 | + |
| 60 | +Recall how a typical RAID system works. Given an object to be stored, select a |
| 61 | +certain *unit size* and divide the object in data units of this size. Then |
| 62 | +aggregate consecutive N data units and calculate for them K units of |
| 63 | +parity. Together N data units and K parity units constitute a *parity |
| 64 | +group*. The most important property of parity group is that given any N units |
| 65 | +(out of N+K) all N+K units can be recovered. This is achieved by carefully |
| 66 | +designed parity calculation functions. There are many ways to achieve this, motr |
| 67 | +uses the most widely known: Reed-Solomon codes. This ability to recover parity |
| 68 | +units is what provides fault-tolerance of pdclust. It is said that a parity |
| 69 | +de-clustered layout has *striping pattern* N+K (there are more component to |
| 70 | +striping pattern, to be described later), N > 0, K >= 0. Parity blocks are |
| 71 | +allocated, filled and managed by the motr IO code and are not visible to the |
| 72 | +user. |
| 73 | + |
| 74 | +Some examples of striping patterns: |
| 75 | + |
| 76 | + - N+0: RAID-0, |
| 77 | + - 1+1: mirroring, |
| 78 | + - 1+K: (K+1)-way replication, |
| 79 | + - N+1: RAID-5. In this case, parity unit is XOR-sum of the data units, |
| 80 | + - N+2: RAID-6. |
| 81 | + |
| 82 | +Once the support for units and parity groups is in place, IO is conceptually |
| 83 | +simple: layout knows the location of all units (data and parity). To read a data |
| 84 | +unit just read it directly from the location provided by the layout. In case |
| 85 | +this read fails, for whatever reason, read the remaining units of the parity |
| 86 | +group and reconstruct the lost data unit from them, this is called *degraded |
| 87 | +read*. Write is more complex. The simplest case is when the entire parity group |
| 88 | +of N data units is overwritten. In this case, write calculates K parity units |
| 89 | +and writes all N+K units in their locations (as determined by the layout). When |
| 90 | +a write overwrites only a part of the parity group, read-modify-write cycle is |
| 91 | +necessary. In case of failure, a *degraded write* is performed: up to K unit |
| 92 | +writes can fail, but the write is still successful. |
| 93 | + |
| 94 | +Example |
| 95 | +======= |
| 96 | + |
| 97 | +Consider a very simple storage system. Ignore all the actual complexity of |
| 98 | +arranging hardware, cables, attaching devices to servers, racks, *etc*. At the |
| 99 | +most basic level, the system consists of a certain number of storage |
| 100 | +devices. Units can be read off and written to devices. Devices can fail |
| 101 | +(independently). |
| 102 | + |
| 103 | +There is a problem of how units of a parity de-clustered file are |
| 104 | +scattered over these devices. There are multiple factors: |
| 105 | + |
| 106 | + - for a given parity group, it's clearly preferable to store each unit (data |
| 107 | + and parity) on a separate device. This way, if the device fails, at most |
| 108 | + one unit for each group is lost; |
| 109 | + - larger K gives better fault-tolerance, |
| 110 | + - storage overhead is proportional to K/N ratio, |
| 111 | + - because full-group overwrites are most efficient, it's better to keep unit |
| 112 | + size small (then a larger fraction of writes will be full-group), |
| 113 | + - to utilise as many disc spindles as possible for each operation, it's |
| 114 | + better to keep unit size small, |
| 115 | + - to have efficient network transfers it's better to have large unit size, |
| 116 | + - to have efficient storage transfers it's better to have large unit size, |
| 117 | + - cost of computing parity is O(K^2); |
| 118 | + - to minimise amount and complexity of internal meta-data that system must |
| 119 | + maintain, the map from object units to their locations should be |
| 120 | + *computable* (*i.e.*, it should be possible to calculate the location of a |
| 121 | + unit by certain function); |
| 122 | + - to apply various storage and IO optimisations (copy-on-write, sequential |
| 123 | + block allocation, *etc*.), the map from object units to their locations |
| 124 | + should be constructed dynamically. |
| 125 | + |
| 126 | +.. image:: pool.png |
| 127 | + |
| 128 | +Failures |
| 129 | +======== |
| 130 | + |
| 131 | +Again, consider a very simple storage system, with a certain number (P) of |
| 132 | +storage devices without any additional structure, and with striping pattern |
| 133 | +N+K. Suppose a very simple round-robin block allocation is used: |
| 134 | + |
| 135 | +.. image:: layout-rr.png |
| 136 | + |
| 137 | +A device fails: |
| 138 | + |
| 139 | +.. image:: layout-rr-failure.png |
| 140 | + |
| 141 | +At a conceptual level (without at this time considering the mechanisms used), |
| 142 | +let's understand what would be involved in the *repair* of this failure. To |
| 143 | +reconstruct units lost in the failure (again, ignoring for the moment details of |
| 144 | +when they are reconstructed and how the reconstructed data is used), one needs, |
| 145 | +by the definition of the parity group, to read all remaining units of all |
| 146 | +affected parity groups. |
| 147 | + |
| 148 | +.. image:: layout-rr-affected.png |
| 149 | + |
| 150 | +Suppose that the number of devices (P) is large (10^2--10^5) and the number of |
| 151 | +units is very large (10^15). Ponder for a minute: what's wrong with the picture |
| 152 | +above? |
| 153 | + |
| 154 | +The problem is that the number of units that must be read off a surviving device |
| 155 | +to repair is different for different devices. During a repair some devices will |
| 156 | +be bottlenecks and some will be idle. With a large P, most of the devices will |
| 157 | +idle and won't participate in the repair. As a result, the duration of repair |
| 158 | +(which is the interval of critical vulnerability in which the system has reduced |
| 159 | +fault-tolerance) does not reduce with P growing large. But the probability of a |
| 160 | +failure, does grow with P, so overall system reliability would decrease as P |
| 161 | +grows. One can do better. |
| 162 | + |
| 163 | +Uniformity |
| 164 | +========== |
| 165 | + |
| 166 | +To get better fault-tolerance, two more requirements should be added to our |
| 167 | +list: |
| 168 | + |
| 169 | + - units of an object are uniformly distributed across all devices, |
| 170 | + - fraction of parity groups shared by any 2 devices is the same. This means |
| 171 | + that when a device fails, each surviving device should read |
| 172 | + (approximately) the same number of units during repair. |
| 173 | + |
| 174 | +.. image:: layout-uniform.png |
| 175 | + |
| 176 | +A simple round-robin unit placement does not satisfy these uniformity |
| 177 | +requirements, but after a simple modification it does. |
| 178 | + |
| 179 | +Let's call a collection of N+K striped units that exactly cover some number of |
| 180 | +"rows" on a pool of P devices *a tile*. |
| 181 | + |
| 182 | +.. image:: tile.png |
| 183 | + |
| 184 | +For each tile permute its columns according to a certain permutation selected |
| 185 | +independently for each tile. |
| 186 | + |
| 187 | +.. image:: permutation.png |
| 188 | + |
| 189 | +This new layout of units satisfies the basic fault-tolerance requirement that no |
| 190 | +two units of the same parity group are on the same device (convince yourself). |
| 191 | + |
| 192 | +It also satisfies the uniformity requirement (at least statistically, for a |
| 193 | +large number of tiles). |
| 194 | + |
| 195 | +.. image:: permutation-uniform.png |
| 196 | + |
| 197 | +Uniformity has some very important consequences. All devices participate equally |
| 198 | +in the repair. But the total amount of data read during repair is fixed (it is |
| 199 | +(N+K-1)*device_size). Therefore, as P grows, each device reads smaller and |
| 200 | +smaller fraction of its size. Therefore, as system grows, repair completes |
| 201 | +quicker. |
0 commit comments