Atmos GeoProtect vs RAID

The Night Lights of Europe (as seen from space) by woodleywonderworks (cc) (from flickr)
The Night Lights of Europe (as seen from space) by woodleywonderworks (cc) (from flickr)

Yesterday, twitterland was buzzing about EMC’s latest enhancement to their Atmos Cloud Storage platform called GeoProtect.  This new capability improves cloud data protection by supporting erasure code data protection rather than just pure object replication.

Erasure coding has been used for over a decade in storage and some of the common algorithms are Reed-Solomon, Cauchy Reed-Soloman, EVENODD coding, etc.  All these algorithms provide a way for splitting up customer data into data instances and parity (encoding) to allow some number of data or parity instances to be erased (or lost) while still providing customer data.  For example, a R-S encoding scheme we used in the past (called RAID 6+) had 13 data fragments and 2 parity fragments.  Such an encoding scheme supported the simultaneous failure of any two drives and could still supply (reconstruct) customer data.

But how does RAID differ from something like GeoProtect.

  • RAID is typically within a storage array and not across storage arrays
  • RAID is typically limited to a small number of alternative configurations of data disks and parity disks which cannot be altered in the field, and
  • Currently, RAID typically doesn’t support more than two disk failures while still being able to recover customer data (see Are RAIDs days numbered?)

As I understand it GeoProtect currently supports only two different encoding schemes which can provide for different levels of data instance failures while still protecting customer data.  And with GeoProtect you are protecting data across Atmos nodes and potentially across different geographic locations not just within storage arrays.  Also, with Atmos this is all policy driven and data that comes into the system can use any object replication policy or either of the two GeoProtect policies supported today.

Although the nice thing about R-S encoding is that it doesn’t have to be fixed to two different encoding schemes.  And as it’s all software, new coding schemes could easily be released over time, possibly someday being entirely something a user could dial up or down at their whim.

But this would seem much more like what Cleversafe has been offering in their SliceStor product.  With Cleversafe the user can specify exactly how much redundancy they want to support and the system takes care of everything else.  In addition, Cleversafe has implemented a more fine grained approach (with many more fragments) and data and parity are intermingled in each stored fragment.

It’s not a big stretch for Atmos to go from two GeoProtect configurations to four or more.  Unclear to me what the right number would be but once you get past 3 or so, it might be easier to just code a generic R-S routine that can handle any configuration the customer wants but I may be oversimplifying the mathematics here.

Nonetheless, in future versions of Atmos I wouldn’t be surprised if it’s possible that through policy management the way data is protected could change over time. Specifically, while data is being frequently accessed, one could use object replication or less compressed encoding to speed up access but once access frequency diminishes (or time passes), data can then protected with more storage efficient encoding schemes which would reduce the data footprint in the cloud while still offering similar resiliency to data loss.

Full disclosure I have worked for Cleversafe in the past and although I am currently working with EMC, I have had no work from EMC’s Atmos team.

Storage strategic inflection points

EMC vs S&P 500 Stock price chart
EMC vs S&P 500 Stock price chart - 20 yrs from Yahoo Finance

Both EMC and Spectra Logic celebrated their 30 years in business this month and it got me to thinking. Both companies started the same time but one is a ~$14B revenue (’09 projected) behemoth and the other a relatively successful, but relatively mid-size storage company (Spectra Logic is private and does not report revenues). What’s the big difference between these two. As far as I can tell both companies have been adequately run for some time now by very smart people. Why is one two or more orders of magnitude bigger than the other – recognizing strategic inflection points is key.

So what is a strategic inflection point? Andy Grove may have coined the term and calls a strategic inflection point a point “… where the old strategic picture dissolves and gives way to the new.” In my view EMC has been more successful at recognizing storage strategic inflection points than Spectra Logic and this explains a major part of their success.

EMC’s history in brief

In listening this week to Joe Tucci’s talk at EMC Analyst Days he talked about the rather humble beginnings of EMC. It started out selling furniture and memory for mainframes (I think) but Joe said it really took off in 1991, almost 12 years after it was founded. It seems they latched onto some DRAM based SSD like storage technology and converted it to use disk as a RAID storage device in the mainframe and later open systems arena. RAID killed off the big (14″ platter) disk devices that had dominated storage at that time and once started could not be stopped. Whether by luck or smarts EMC’s push into RAID storage made them what they are today – probably a little of both.

It was interesting to see how this played out in the storage market space. RAID used smaller disks, first 8″, then 5.25″ and now 3.5″. When first introduced, manufacturing costs for the RAID storage were so low that one couldn’t help but make a profit selling against big disk devices that held 14″ platters. The more successful RAID became, the more available and reliable the smaller disks became which led to a virtuous cycle culminating in the highly reliable 3.5″ disk devices available today. Not sure Joe was at EMC at the time but if he was he would probably have called that transition between big platter disks and RAID a “strategic inflection point” in the storage industry at the time.

Most of EMC’s competitors and customers would probably say that aggressive marketing also helped propel EMC to be the top of the storage heap. I am not sure which came first, the recognition of a strategic inflection like RAID or the EMC marketing machine but, together, they gave EMC a decided advantage that re-constructed the storage industry.

Spectra Logic’s history in brief

As far as I can tell Spectra Logic has been in the backup software for a long time and later started supporting tape technology where they are well known today. Spectra Logic has disk storage systems as well but they seem better known for their tape and backup technology.

The big changes in tape technology over the past 30 years have been tape cartridges and robotics. Although tape cartridges were introduced by IBM (for the IBM 3480 in 1985), the first true tape automation was introduced by Storage Technology Corp. (with the STK 4400 in 1987). Storage Technology rode the wave of the robotics revolution throughout the late 80’s into the mid 90’s and was very successful for a time. Spectra Logic’s entry into tape robotics was sometime later (1995) but by the time they got onboard it was a very successful and mature technology.

Nonetheless, the revolution in tape technology and operations brought on by these two advances, probably held off the decline in tape for a decade or two, and yet it could not ultimately stem the tide in tape use apparent today (see my post on Repositioning of tape). Spectra Logic has recently introduced a new tape library.

Another strategic inflection point that helped EMC

Proprietary “Open” Unix systems had started to emerge in the late 80’s and early 90’s and by the mid 90’s were beginning to host most new and sophisticated applications. The FC interface also emerged in the early to mid 90’s as a replacement to HPC-HPPI technology and for awhile battled it out against SSA technology from IBM but by 1997 emerged victorious. Once FC and the follow-on higher level protocols (resulting in SAN) were available, proprietary Unix systems had the IO architecture to support any application needed by the enterprise and they both took off feeding on each other. This was yet another strategic inflection point and I am not sure if EMC was the first entry into this market but they sure were the biggest and as such, quickly emerged to dominate it. In my mind EMC’s real accelerated growth can be tied to this timeframe.

EMC’s future bets today

Again, today, EMC seems to be in the fray for the next inflection. Their latest bets are on virtualization technology in VMware, NAND-SSD storage and cloud storage. They bet large on the VMware acquisition and it’s working well for them. They were the largest company and earliest to market with NAND-SSD technology in the broad market space and seem to enjoy a commanding lead. Atmos is not the first cloud storage service out there, but once again EMC was one of the largest companies to go after this market.

One can’t help but admire a company that swings for the bleachers every time they get a chance at bat. Not every one is going out of the park but when they get ahold of one, sometimes they can change whole industries.

Protecting the Yottabyte archive

blinkenlights by habi (cc) (from flickr)
blinkenlights by habi (cc) (from flickr)

In a previous post I discussed what it would take to store 1YB of data in 2015 for the National Security Agency (NSA). Due to length, that post did not discuss many other aspects of the 1YB archive such as ingest, index, data protection, etc. Thus, I will attempt to cover each of these in turn and as such, this post will cover some of the data protection aspects of the 1YB archive and its catalog/index.

RAID protecting 1YB of data

Protecting the 1YB archive will require some sort of parity protection. RAID data protection could certainly be used and may need to be extended to removable media (RAID for tape), but that would require somewhere in the neighborhood of 10-20% additional storage (RAID5 across 10 to 5 tape drives). It’s possible with Reed-Solomon encoding and using RAID6 that we could take this down to 5-10% of additional storage (RAID 6 for a 40 to a 20 wide tape drive stripe). Possibly other forms of ECC (such as turbo codes) might be usable in a RAID like configuration which would give even better reliability with less additional storage.

But RAID like protection also applies to the data catalog and indexes required to access the 1YB archive of data. Ditto for the online data itself while it’s being ingested, indexed, or readback. For the remainder of this post I ignore the RAID overhead but suffice it to say with today’s an additional 10% storage for parity will not change this discussion much.

Also in the original post I envisioned a multi-tier storage hierarchy but the lowest tier always held a copy of any files residing in the upper tiers. This would provide some RAID1 like redundancy for any online data. This might be pretty usefull, i.e., if a file is of high interest, it could have been accessed recently and therefore resides in upper storage tiers. As such, multiple copies of interesting files could exist.

Catalog and indexes backups for 1YB archive

IMHO, RAID or other parity protection is different than data backup. Data backup is generally used as a last line of defense for hardware failure, software failure or user error (deleting the wrong data). It’s certainly possible that the lowest tier data is stored on some sort of WORM (write once read many times) media meaning it cannot be overwritten, eliminating one class of user error.

But this presumes the catalog is available and the media is locatable. Which means the catalog has to be preserved/protected from user error, HW and SW failures. I wrote about whether cloud storage needs backup in a prior post and feel strongly that the 1YB archive would also require backups as well.

In general, backup today is done by copying the data to some other storage and keeping that storage offsite from the original data center. At this amount of data, most likely the 2.1×10**21 of catalog (see original post) and index data would be copied to some form of removable media. The catalog is most important as the other two indexes could potentially be rebuilt from the catalog and original data. Assuming we are unwilling to reindex the data, with LTO-6 tape cartridges, the catalog and index backups would take 1.3×10**9 LTO-6 cartridges (at 1.6×10**12 bytes/cartridge).

To back up this amount of data once per month would take a gaggle of tape drives. There are ~2.6×10**6 seconds/month and each LTO-6 drive can transfer 5.4×10**8 bytes/sec or 1.4X10**15 bytes/drive-month but we need to backup 2.1×10**21 bytes of data so we need ~1.5×10**6 tape transports. Now tapes do not operate 100% of the time because when a cartridge becomes full it has to be changed out with an empty one, but this amounts to a rounding error at these numbers.

To figure out the tape robotics needed to service 1.5×10**6 transports we could use the latest T-finity tape library just announced by Spectra Logic . The T-Finity supports 500 tape drives and 122,000 tape cartridges, so we would need 3.0×10**3 libraries to handle the drive workload and about 1.1×10**4 libraries to store the cartridge set required, so 11,000 T-finity libraries would suffice. Presumably, using LTO-7 these numbers could be cut in half ~5,500 libraries, ~7.5×10**5 transports, and 6.6×10**8 cartridges.

Other removable media exist, most notably the Prostor RDX. However RDX roadmap info out to the next generation are not readily available and high-end robotics are do not currently support RDX. So for the moment tape seems the only viable removable backup for the catalog and index for the 1YB archive.

Mirroring the data

Another approach to protecting the data is to mirror the catalog and index data. This involves taking the data and copying it to another online storage repository. This doubles the storage required (to 4.2×10**21 bytes of storage). Replication doesn’t easily protect from user error but is an option worthy of consideration.

Networking infrastructure needed

Whether mirroring or backing up to tape, moving this amount of data will require substantial networking infrastructure. If we assume that in 2105 we have 32GFC (32 gb/sec fibre channel interfaces). Each interface could potentially transfer 3.2GB/s or 3.2×10**9 bytes/sec. Mirroring or backing up 2.1×10**21 bytes over one month will take ~2.5×10**6 32GFC interfaces. Probably should have twice this amount of networking just to not have any one be a bottleneck so 5×10**6 32GFC interfaces should work.

As for switches, the current Brocade DCX supports 768 8GFC ports and presumably similar port counts will be available in 2015 to support 32GFC. In addition if we assume at least 2 ports per link, we will need ~6,500 fully populated DCX switches. This doesn’t account for multi-layer switches and other sophisticated switch topologies but could be accommodated with another factor of 2 or ~13,000 switches.

Hot backups require journals

This all assumes we can do catalog and index backups once per month and take the whole month to do them. Now storage today normally has to be taken offline (via snapshot or some other mechanism) to be backed up in a consistent state. While it’s not impossible to backup data that is concurrently being updated it is more difficult. In this case, one needs to maintain a journal file of the updates going on while the data is being backed up and be able to apply the journaled changes to the data backed up.

For the moment I am not going to determine the storage requirements for the journal file required to cover the catalog transactions for a month, but this is dependent on the change rate of the catalog data. So it will necessarily be a function of the index or ingest rate of the 1YB archive to be covered in a future post.

Stay tuned, I am just having too much fun to stop.

Are RAID's days numbered?

HP/EVA drive shelfs in the HP/EVA lab in  Colo. Springs
HP/EVA drive shelfs in the HP/EVA lab in Colo. Springs
A older article that I recently came across said RAID 5 would be dead in 2009 by Robin Haris StorageMojo. In essence, it said as drives get to 1TB or more the time it took to rebuild the drive required going to RAID6.

Another older article I came across said RAID is dead, all hail the storage robot. It seemed to say that when it came to drive sizes there needed to be more flexibility and support for different capacity drives in a RAID group. Data Robotics Drobo products now support this capability which we discuss below.

I am here to tell you that RAID is not dead, not even on life support and without it the storage industry would seize up and die. One must first realize that RAID as a technology is just a way to group together a bunch of disks and to protect the data on those disks. RAID comes in a number of flavors which includes definitions for

  • RAID 0 – no protection)
  • RAID 1 – mirrored data protection
  • RAID 2 through 5 – single parity protection
  • RAID 6 and DP – dual parity protection

The rebuild time problem with RAID

The problem with drive rebuild time is that the time it takes to rebuild a 1TB or larger disk drive can be measured in hours if not days, depending on the busy-ness of the storage system and the RAID group. And of course as 1.5 and 2TB drives come online this just keeps getting longer. This can be sped up by having larger single parity RAID groups (more disk spindles in the RAID stripe), by using DP which actually has two raid groups cross-coupled (which means more disk spindles), or by using RAID 6 which often has more spindles in the RAID group.

Regardless of how you cut it there is some upper limit to the number of spindles that can be used to rebuild a failed drive – the number of active spindles in the storage subsystem. You could conceivably incorporate all these drives into a simple RAID 5 or 6 group (albeit, a very large one).

The downside of this large a RAID group is that data overwrite could potentially cause a performance bottleneck on the parity disks. That is, whenever a block is overwritten in a RAID 2-6 group, the parity for that data block (usually located on one or more other drives) has to be read, recalculated and rewritten back to the same location. Now it can be buffered, and lazily written but the data is not actually protected until parity is on disk someplace.

One way around this problem is to use a log structured file systems. Log file systems never rewrite data so there is no over-write penalty. Nicely eliminating the problem.

Alas, not everyone uses log structured file systems for backend storage. So for the rest of the storage industry the write penalty is real and needs to be managed effectively in order to not become a performance problem. One way to manage this is to limit RAID group size to a small number of drives.

So the dilemma is that in order to provide reasonable drive rebuild times you want a wide (large) RAID group with as many drives as possible in it. But in order to minimize the (over-)write penalty you want as thin (small) a RAID group as possible. How can we solve this dilemma?

Parity declustering

Parity Declustering figure from Holland&Gibson 1992 paper
Parity Declustering figure from Holland&Gibson 1992 paper

In looking at the declustered parity scheme described by Gibson and Holland in their 1992 paper. Parity and the stripe data can be spread across more drives than just in a RAID 5 or 6 group. They show an 8 drive system (see figure) where stripe data (with 3 data block sets) and parity data (of one parity block set) are rotated around a group of 8 physical drives in the array. In this way all 7 remaining drives are used to service the failed 8th drive. Some blocks will be rebuilt with one set of 3 drives and other blocks with a different set of 3 drives. As you go through the failed drives block set, rebuilding it would take all the remaining 7 drives, but not all of them would be busy for all the blocks. This should shrink the drive rebuild time considerably by utilizing more spindles.

Because parity declustering distributes the parity across a number of disk drives as well as the data no one disk would hold the parity for all drives. Doing this would eliminate the hot drive phenomenon, normally dealt with by using smaller RAID groups sizes.

The mixed drive capacity problem with RAID today

The other problem with RAID today is that it assumes a homogeneous set of disk drives in the storage array so that the same blocks/tracks/block sets could be set up as a RAID stripe across those disks used to compute parity. Now, according to the original RAID paper by Patterson, Gibson, and Katz they never explicitly stated a requirement for all the disk drives to be the same capacity but it seems easiest to implement RAID that way. With diverse capacity and performing drives you would normally want them to be in separate RAID groups. But you could create a RAID group using the least common divisor (or smallest capacity drive). However, by doing this you waste all the excess storage in the larger disks.

Now one solution to the above would be the declustered parity solution mentioned above but in the end you would need at least N-drives of the same capacity for whatever your stripe size (N) was going to be. But if you had that many drives why not just use RAID5 or 6.

Another solution popularized by Drobo is to carve up the various disk drives into RAID group segments. So if you had 4 drives with 100GB, 200GB, 400GB and 800GB, you could carve out 4 RAID groups: a 100GB RAID5 group across 4 drives; another 100GB RAID 5 group across 3 drives; a RAID 1 mirror for 200GB across the largest 2 drives; and a RAID 0 of 400GB on the largest drive. This could be configured as 4 LUNs or windows drive letters and used any way you wish.

But is this RAID?

I would say “yes”. Although this is at the subdrive level, it still looks like RAID storage, using parity and data blocks across stripes of data. All that’s been done is to take the unit of drive and make it some portion of a drive instead. Marketing aside, I think it’s an interesting concept and works well for a few drives of mixed capacity (just the market space Drobo is going after).

For larger concerns with intermixed drives I like parity declustering. It has the best of bigger RAID groups without the problems of increased activity for over-writes. Given today’s drive capacities, I might still lean towards a dual parity scheme with the parity declustering stripe but that doesn’t seem difficult to incorporate.

So when people ask if RAID’s days are numbered – my answer is a definite NO!