Better erasure coding for scale-out & cloud storage

LRcC(6,2,2) example layout
LRcC(6,2,2) example layout

Microsoft Azure uses a different style of erasure coding for their cloud storage than what I have encountered in the past. Their erasure coding technique was documented in a paper presented at USENIX ATC’12 (for more info check out their Erasure coding in Windows Azure Storage paper).

The new erasure coding can be optimized for rebuild read or storage space overhead. can at times correct for more errors than equivalent, more traditional, Reed-Solomon (RS) erasure coding schemes.

Azure Local Reconstruction Codes

The new scheme is called Local Reconstruction Codes, not to be confused with Longitudinal Redundancy Codes, both of which can be abbreviated as LRC. In this post, I will use the abbreviation LRcC for local reconstruction codes although in the paper  (and in these graphics) they used LRC as the abbreviation.

In a typical RS erasure coding scheme, let’s say 12+3, there are 12 data segments and 3 parity segments. In this RS(12,3) scheme, a storage system can recover from up to three data segment failures and still not lose data by using the remaining data segments and the three parity segments.

In the Azure LRcC coding scheme, let’s say 12+2+2, there are 12 data segments, 2 local parity segments, and 2 global parity segments. The data segments are split up into two groups and one local parity segment is used for each group. In an LRcC(12,2,2) scheme, a storage system can recover from up to three arbitrary failures and still not lose data, but it can also recover in most cases from 4 segment failures without losing data.

Ok, you add a parity segment and now you can recover from more failures. An RS(12,4)  coding scheme could recover from any 4 arbitrary segment failures. So that’s not that great. What’s great about the LRcC(12,2,2) scheme is that

What’s great about the LRcC(12,2,2) scheme is that it can reconstruct any single data segment failure by only reading 5 data and 1 parity segments.

Lots of transient errors in cloud storage

That doesn’t seem like much of an advantage until you start considering what this means for scale-out or cloud storage. In these systems, having a storage node/drive be temporarily out of service can occur for a number of reasons and occurs much more frequently than a drive or a controller failure in a more traditional storage system. In this case, for an LRcC(12,2,2) scheme, it would rebuild the missing segment in 6 reads whereas an RS(12,3) scheme would take a minimum of 12 reads.

Temporary or transient outages can occur due to networking failures, upgrade activity, or storage node overloading. In these situations, a coding scheme that can reconstruct a data segment in less IO activity can be a real advantage. And transient outages occur more frequently when you have more storage nodes in your system. Cloud storage takes all of this to an extreme level but many scale-out file systems support 100s of nodes as well.

LRcC coding schemes look different

Describing how an LRcC coding scheme works takes some math but if you are familiar with RAID 5 & RAID 6 you already know most of what you need to know. (The graphic at the start of this post shows an LRcC(6,2,2) scheme).

In the LRcC(12,2,2) scheme discussed earlier, the 12 data segments are split into two groups of 6 data segments each. and the local parity is essentially an RAID5 parity segment for the group.  So an LRcC(12,2,2) looks like this

A,A,A,A,A,A & P(A); B,B,B,B,B,B & P(B); GP1(A,B) and GP2(A,B), 

Where: the 6 A’s are data segments in one local group, and P(A) is its local parity segment; the 6 B’s are data segments in the other local group and P(B) is its local parity segment; GP1(A,B) is the first global parity segment computed over all local data segments (A,B); and GP2(A,B) is the second global parity segment computed over all local data segments (A,B).

The global parity segments can be construed as a similar to the dual parity in a RAID 6, although this is not mathematically correct. In this LRcC(12,2,2) coding scheme:

  • For a 1 segment failure, i.e., where an A (or B) data segment is lost, can be recovered by reading the remaining A segments plus its local parity P(A);
  • For any 2 segment failures, 12 of the 14 remaining data and parity segments will need to be read to reconstruct the failed segments.
  • For any 3 segment failure, all (13) remaining data and  parity segments will need to be read to reconstruct the failing segments.
  • For a recoverable 4 segment failure, all (12) remaining data and parity segments will need to be read to reconstruct the failing segments.

Not sure I can explain how the “recoverable 4 segment failures” work but consider this, if the 4 parity segments are lost, then it’s relatively easy to reconstruct all of them from the data segments themselves. If at most 1 of the 4 failures is in a group, then the rest of the data segments and its associated parity can be used to reconstruct that groups data and parity segments which leaves 3 lost segments that have to be reconstructed from the remaining 3 data and parity segments. So 3 failing segments, with up to 3 non-failing segments. This looks surprising like a RS(n,3), with 3 data segments lost and is readily recoverable. I think the problem with 4 segment outages being non-recoverable in an LRcC(12,2,2) is when the lost segments are split across both groups. The paper states that the LRcC(12,2,2) scheme can handle 86% of all 4 segment failures without data loss.

Advantages of LRcC vs. RS

So, the main advantage of LRcC coding schemes is that reconstruction of the most likely failures (1 segment outage) can occur with 1/2 the rebuild reads as an equivalent RS code. The busier a system gets, the more of an advantage this would seem to provide.

LRcC(622)space-overhead
LRcC vs. RS reconstruction overhead vs. space overhead

In the paper, they have space vs. reconstruction overhead tradeoff charts, –depicting LRcC against RS coding schemes and other standard storage redundancy schemes. For most RS coding schemes you can consume less storage space and have the same rebuild (reconstruction reads) with an LRcC than with an equivalent fault tolerant RS coding scheme. Conversely, for the same storage overhead you can have less reconstruction overhead for an LRcC than with an equivalent RS coding scheme.

The paper also shows some modeled performance comparisons for RS(12,4) vs LRcC(12,2,2) in lightly and heavily loaded storage systems under small (4KB) and large (4MB) IO block workloads for a single segment outage.

LRcC perf charts Performance comparisons of LRcC(12,2,2) vs. RS(12,4) codings schemes for a single segment failure
Performance comparisons of LRcC(12,2,2) vs. RS(12,4) codings schemes for a single segment failure

In lightly loaded storage with small IO blocks the latency advantages are not that significant but when using small IO under heavy load or using large IO under light or heavy load, the reconstruction performance advantages are significant.

This seems a bit unfair as the RS(12,4) can support any arbitrary 4 segment failure while the LRcC(12,2,2) can only support most 4 segment failures without data loss. But I believe the performance comparisons would be exactly the same for an RS(12,3) coding scheme.

~~~~

Comments?

[Thanks to Greg Schulz (@storageIO, StorageIO.com) for the idea for this post]

Picture credit(s): All graphics come from the original Microsoft Research paper

8 Replies to “Better erasure coding for scale-out & cloud storage”

Comments are closed.