I ranked my blog posts using a ratio of hits to post age and have identified with the top 10 most popular posts for 2011 (so far):
Vsphere 5 storage enhancements – We discuss some of the more interesting storage oriented Vsphere 5 announcements that included a new DAS storage appliance, host based (software) replication service, storage DRS and other capabilities.
Intel’s 320 SSD 8MB problem – We discuss a recent bug (since fixed) which left the Intel 320 SSD drive with only 8MB of storage, we presumed the bug was in the load leveling logic/block mapping logic of the drive controller.
Is FC dead?! – What with the introduction of 40GbE FCoE just around the corner, 10GbE cards coming down in price and Brocade’s poor YoY quarterly storage revenue results, we discuss the potential implications on FC infrastructure and its future in the data center.
I would have to say #3, 5, and 9 were the most fun for me to do. Not sure why, but #10 probably generated the most twitter traffic. Why the others were so popular is hard for me to understand.
The problem with SSDs is that they typically all fail at some level of data writes, called the write endurance specification.
As such, if you purchase multiple drives from the same vendor and put them in a RAID group, sometimes this can cause multiple failures because of this.
Say the SSD write endurance is 250TBs (you can only write 250TB to the SSD before write failure), and you populate a RAID 5 group with them in a 3 -data drives + 1-parity drive configuration. As, it’s RAID 5, parity rotates around to each of the drives sort of equalizing the parity write activity.
Now every write to the RAID group is actually two writes, one for data and one for parity. Thus, the 250TB of write endurance per SSD, which should result in 1000TB write endurance for the RAID group is reduced to something more like 125TB*4 or 500TB. Specifically,
Each write to a RAID 5 data drive is replicated to the RAID 5 parity drive,
As each parity write is written to a different drive, the parity drive of the moment can contain at most 125TB of data writes and 125TB of parity writes before it uses up it’s write endurance spec.
So for the 4 drive raid group we can write 500TB of data, evenly spread across the group can no longer be written.
As for RAID 6, it almost looks the same except that you lose more SSD life, as you write parity twice. E.g. for a 6 data drive + 2 parity drive RAID 6 group, with similar write endurance, you should get 83.3TB of data writes and 166.7TB of parity writes per drive. Which for an 8 drive parity group is 666.4TB of data writes before RAID group write endurance lifetime is used up.
For RAID 1 with 2 SSDs in the group, as each drive mirrors writes to the other drive, you can only get 125TB per drive or 250TB total data writes per RAID group.
But the “real” problem is much worse
If I am writing the last TB to my RAID group and if I have managed to spread the data writes evenly across the RAID group, one drive will go out right away. Most likely the current parity drive will throw a write error. BUT the real problem occurs during the rebuild.
With a 256GB SSD in the RAID 5 group, with 100MB/s read rate, reading the 3 drives in parallel to rebuild the fourth will take ~43 minutes. However that means all the good SSDs are idle except for rebuild IO. Most systems limit drive rebuild IO to no more than 1/2 to 1/4 of the drive activity (possibly much less) in the RAID group. As such, a more realistic rebuild time can be anywhere from 86 to 169 minutes or more.
Now because rebuild time takes a long time, data must continue to be written to the RAID group. But as we are aware, most of the remaining drives in the RAID group are likely to be at the end of their write endurance already.
Thus, it’s quite possible that another SSD in the RAID group will fail while the first drive is rebuilt.
Resulting in a catastrophic data loss (2 bad drives in a RAID 5, 3 drives in a RAID 6 group).
RAID 1 groups with SSDs are probably even more prone to this issue. When the first drive fail, the second should follow closely behind.
Yes, but is this probable
First we are talking TBs of data here. The likelihood that a RAID groups worth of drives would all have the same amount of data written to them even within a matter of hours of rebuild time is somewhat unlikely. That being said, the lower the write endurance of the drives, the more equal the SSD write endurance at the creation of the RAID group, and the longer it takes to rebuild failing SSDs, the higher the probability of this type of castastrophic failure.
In any case, the problem is highly likely to occur with RAID 1 groups using similar SSDs as the drives are always written in pairs.
But for RAID 5 or 6, it all depends on how well data striping across the RAID group equalizes data written to the drives.
For hard disks this was a good thing and customers or storage systems all tried to equalize IO activity across drives in a RAID group. So with good (manual or automated) data striping the problem is more likely.
Automated storage tiering using SSDs is not as easy to fathom with respect to write endurance catastrophes. Here a storage system automatically moves the hottest data (highest IO activity) to SSDs and the coldest data down to hard disks. In this fashion, they eliminate any manual tuning activity but they also attempt to minimize any skew to the workload across the SSDs. Thus, automated storage tiering, if it works well, should tend to spread the IO workload across all the SSDs in the highest tier, resulting in similar multi-SSD drive failures.
However, with some vendor’s automated storage tiering, the data is actually copied and not moved (that is the data resides both on disk and SSD). In this scenario losing an SSD RAID group or two might severely constrain performance, but does not result in data loss. It’s hard to tell which vendors do which but customers can should be able to find out.
So what’s an SSD user to do
Using RAID 4 for SSDs seems to make sense. The reason we went to RAID 5 and 6 was to avoid hot (parity write) drive(s) but with SSD speeds, having a hot parity drive or two is probably not a problem. (Some debate on this, we may lose some SSD performance by doing this…). Of course the RAID 4 parity drive will die very soon, but paradoxically having a skewed workload within the RAID group will increase SSD data availability.
Mixing SSDs age within RAID groups as much as possible. That way a single data load level will not impact multiple drives.
Turning off LUN data striping within a SSD RAID group so data IO can be more skewed.
Monitoring write endurance levels for your SSDs, so you can proactively replace them long before they will fail
Keeping good backups and/or replicas of SSD data.
I learned the other day that most enterprise SSDs provide some sort of write endurance meter that can be seen at least at the drive level. I would suggest that all storage vendors make this sort of information widely available in their management interfaces. Sophisticated vendors could use such information to analyze the SSDs being used for a RAID group and suggest which SSDs to use to maximize data availability.
But in any event, for now at least, I would avoid RAID 1 using SSDs.
I believe I have covered this ground before but apparently it needs reiterating. Cloud storage without backup cannot be considered a viable solution. Replication only works well if you never delete or logically erase data from a primary copy. Once that’s done the data is also lost in all replica locations soon afterwards.
I am not sure what happened with the sidekick data, whether somehow a finger check deleted it or some other problem but from what I see looking in from the outside – there were no backups, no offline copies, no fall back copies of the data that weren’t part of the central node and it’s network of replicas. When that’s the case disaster is sure to ensue.
At the moment the blame game is going around to find out who is responsible and I hear that some of the data may be being restored. But that’s not the problem. Having no backups that are not part of the original storage infrastructure/environment are the problem. Replicas are never enough. Backups have to be elsewhere to count as backups.
What would have happened if they had backups is that the duration of the outage would have been the length of time it took to restore the data and some customer data would have been lost since the last backup but that would have been it. It wouldn’t be the first time backups had to be used and it won’t be the last. But without backups at all, then you have a massive customer data loss that cannot be recovered from.
This is unacceptable. It gives IT a bad name, puts a dark cloud over cloud computing and storage and makes the IT staff of sidekick/danger look bad or worse incompetent .
All of you cloud providers need to take heed. You can do better. Backup software/services can be used to backup this data and we will all be better served because of it.
Some are saying that the backups just weren’t accessible but until the whole story comes out I will withhold judgement. Just glad to have another potential data loss be prevented.
I was talking with a cloud storage vendor the other day and they made an interesting comment, cloud storage doesn’t need to backup data?! They told me that they and most cloud storage providers replicate customer file data so that there is always at least two (or more) copies of customer data residing in the cloud at different sites, zones or locations. But does having multiple copies of file data eliminate the need to backup data?
Most people backup data to prevent data loss from hardware/software/system failures and from finger checks – user error. Nowadays, I backup to a external hard disk nightly for my business stuff, add some family stuff to this and backup all this up once a week to external removable media, and once a month take a full backup of all user data on my family Mac’s (photos, music, family stuff, etc.) to external removable media which is then saved offsite.
For my professional existence (30+ years) I have lost personal data from a hardware/software/system failure maybe a dozen times. These events have gotten much rarer in recent history (thank you drive vendors). But about once a month I screw something up and delete or overwrite a file I need to keep around. Most often I restore from the hard drive but occasionally use the removable media to retrieve the file.
I am probably not an exception with respect to finger checks. People make mistakes. How cloud storage providers handle restoring deleted file data for user error will be a significant determinant of service quality for most novice and all professional users.
Now in my mind there are a couple of ways cloud storage providers can deal with this problem.
Support data backup, NDMP, or something similar which takes a copy of the data off the cloud and manages it elsewhere. This approach has worked for the IT industry for over 50 years now and still appeals to many of us.
Never “really” delete file data, by this I mean that you always keep replicated copies of all data that is ever written to the cloud. How a customer accesses such “not really deleted” data is open to debate but suffice it to say some form of file versioning might work.
“Delay” file deletion, don’t delete a file when the user requests it, but rather wait until some external event, interval, or management policy kicks in to “actually” delete the file from the cloud. Again some form of versioning may be required to access “delay deleted” data.
Never deleting a file is probably the easiest solution to this the problem but the cloud storage bill would quickly grow out of control. Delaying file deletion is probably a better compromise but deciding which event, interval, or policy to use to trigger “actually deleting data” to free up storage space is crucial.
Luckily most people realize when they have made a finger check fairly quickly (although may be reluctant to admit it). So waiting a week, month, or quarter before actually deleting file data would work to solve with this problem. Mainframers may recall generation datasets (files) where one specified the number of generations (versions) of a file and when this limit was exceeded, the oldest version would be deleted. Also, using some space threshold trigger to delete old file versions may work, e.g., whenever the cloud gets to be 60% of capacity it starts deleting old file versions. Any or all of these could be applied to different classes of data by management policy.
Of course all of this is pretty much what a sophisticated backup package does today. Backup software retains old file data around for a defined timeframe, typically on some other media or storage than where the data is normally stored. Backup storage space/media can be reclaimed on a periodic basis such as reusing backup media every quarter or only retaining a quarters worth of data in a VTL. Backup software removes the management of file versioning from the storage vendor and places it in the hands of the backup vendor. In any case, many of the same policies for dealing with deleted file versions discussed above can apply.
Nonetheless, in my view cloud storage providers must do something to support restoration of deleted file data. File replication is a necessary and great solution to deal with hardware/software/system failures but user error is much more likely. Not supplying some method to restore files when mistakes happen is unthinkable.