Back at SFD10 a couple of weeks back now when visiting with Nimble Storage they mentioned that their latest all flash storage array was going to support triple-parity RAID.
And last week at a NetApp-SolidFire analyst event, someone mentioned that the new ONTAP 9 triple parity RAID-TEC™ for larger SSDs. Also heard at the meeting was that a 15.3TB SSD would take on the order of 12 hours to rebuild.
Need for better protection
When Nimble discussed the need for triple parity RAID they mentioned the report from Google I talked about recently (see my Surprises from 4 years of SSD experience at Google post). In that post, the main surprise was the amount of read errors they had seen from the SSDs they deployed throughout their data center.
I think the need for triple-parity RAID and larger (+15TB SSDs) will become more common over time. There’s no reason to think that the SSD vendors will stop at 15TB. And if it takes 12 hours to rebuild a 15TB one, I think it’s probably something like ~30 hours to rebuild a 30TB one, which is just a generation or two away.
[Full disclosure: I helped develop the underlying hardware for VSM 1-3 and also way back, worked on HSC for StorageTek libraries.]
Virtual Storage Manager System 6 (VSM6) is here. Not exactly sure when VSM5 or VSM5E were released but it seems like an awful long time in Internet years. The new VSM6 migrates the platform to Solaris software and hardware while expanding capacity and improving performance.
Oracle StorageTek VSM is a virtual tape system for mainframe, System z environments. It provides a multi-tiered storage system which includes both physical disk and (optional) tape storage for long term big data requirements for z OS applications.
VSM6 emulates up to 256 virtual IBM tape transports but actually moves data to and from VSM Virtual Tape Storage Subsystem (VTSS) disk storage and backend real tape transports housed in automated tape libraries. As VSM data ages, it can be migrated out to physical tape such as a StorageTek SL8500 Modular [Tape] Library system that is attached behind the VSM6 VTSS or system controller.
VSM6 offers a number of replication solutions for DR to keep data in multiple sites in synch and to copy data to offsite locations. In addition, real tape channel extension can be used to extend the VSM storage to span onsite and offsite repositories.
One can cluster together up to 256 VSM VTSSs into a tapeplex which is then managed under one pane of glass as a single large data repository using HSC software.
What’s new with VSM6?
The new VSM6 hardware increases volatile cache to 128GB from 32GB (in VSM5). Non-volatile cache goes up as well, now supporting up to ~440MB, up from 256MB in the previous version. Power, cooling and weight all seem to have also gone up (the wrong direction??) vis a vis VSM5.
The new VSM6 removes the ESCON option of previous generations and moves to 8 FICON and 8 GbE Virtual Library Extension (VLE) links. FICON channels are used for both host access (frontend) and real tape drive access (backend). VLE was introduced in VSM5 and offers a ZFS based commodity disk tier behind the VSM VTSS for storing data that requires longer residency on disk. Also, VSM supports a tapeless or disk-only solution for high performance requirements.
System capacity moves from 90TB (gosh that was a while ago) to now support up to 1.2PB of data. I believe much of this comes from supporting the new T10,000C tape cartridge and drive (5TB uncompressed). With the ability of VSM to cluster more VSM systems to the tapeplex, system capacity can now reach over 300PB.
Somewhere along the way VSM started supporting triple redundancy for the VTSS disk storage which provides better availability than RAID6. Not sure why they thought this was important but it does deal with increasing disk failures.
Oracle stated that VSM6 supports up to 1.5GB/Sec of throughput. Presumably this is landing data on disk or transferring the data to backend tape but not both. There doesn’t appear to be any standard benchmarking for these sorts of systems so, will take their word for it.
Why would anyone want one?
Well it turns out plenty of mainframe systems use tape for a number of things such as data backup, HSM, and big data batch applications. Once you get past the sunk costs for tape transports, automation, cartridges and VSMs, VSM storage can be a pretty competitive data storage solution for the mainframe environment.
The fact that most mainframe environments grew up with tape and have long ago invested in transports, automation and new cartridges probably makes VSM6 an even better buy. But tape is also making a comeback in open systems with LTO-5 and now LTO-6 coming out and with Oracle’s 5TB T10000C cartridge and IBM’s 4TB 3592 JC cartridge.
Not to mention Linear Tape File System (LTFS) as a new tape format that provides a file system for tape data which has brought renewed interest in all sorts of tape storage applications.
Lately, when I talked with long time StorageTek tape mainframe customers they have all said the same thing. When is VSM6 coming out and when will Oracle get their act in gear and start supporting us again. Hopefully this signals a new emphasis on this market. Although who is losing and who is winning in the mainframe tape market is the subject of much debate, there is no doubt that the lack of any update to VSM has hurt Oracle StorageTek tape business.
Something tells me that Oracle may have fixed this problem. We hope that we start to see some more timely VSM enhancements in the future, for their sake and especially for their customers.
I have been thinking about writing a post on “Is Flash Dead?” for a while now. Well at least since talking with IBM research a couple of weeks ago on their new memory technologies that they have been working on.
As we have discussed before, NAND flash memory has some serious limitations as it’s shrunk below 11nm or so. For instance, write endurance plummets, memory retention times are reduced and cell-to-cell interactions increase significantly.
These issues are not that much of a problem with today’s flash at 20nm or so. But to continue to follow Moore’s law and drop the price of NAND flash on a $/Gb basis, it will need to shrink below 16nm. At that point or soon thereafter, current NAND flash technology will no longer be viable.
Other non-NAND based non-volatile memories
That’s why IBM and others are working on different types of non-volatile storage such as PCM (phase change memory), MRAM (magnetic RAM) , FeRAM (Ferroelectric RAM) and others. All these have the potential to improve general reliability characteristics beyond where NAND Flash is today and where it will be tomorrow as chip geometries shrink even more.
IBM seems to be betting on MRAM or racetrack memory technology because it has near DRAM performance, extremely low power and can store far more data in the same amount of space. It sort of reminds me of delay line memory where bits were stored on a wire line and read out as they passed across a read/write circuit. Only in the case of racetrack memory, the delay line is etched in a silicon circuit indentation with the read/write head implemented at the bottom of the cleft.
Graphene as the solution
Then along comes Graphene based Flash Memory. Graphene can apparently be used as a substitute for the storage layer in a flash memory cell. According to the report, the graphene stores data using less power and with better stability over time. Both crucial problems with NAND flash memory as it’s shrunk below today’s geometries. The research is being done at UCLA and is supported by Samsung, a significant manufacturer of NAND flash memory today.
Current demonstration chips are much larger than would be useful. However, given graphene’s material characteristics, the researchers believe there should be no problem scaling it down below where NAND Flash would start exhibiting problems. The next iteration of research will be to see if their scaling assumptions can hold when device geometry is shrunk.
The other problem is getting graphene, a new material, into current chip production. Current materials used in chip manufacturing lines are very tightly controlled and building hybrid graphene devices to the same level of manufacturing tolerances and control will take some effort.
So don’t look for Graphene Flash Memory to show up anytime soon. But given that 16nm chip geometries are only a couple of years out and 11nm, a couple of years beyond that, it wouldn’t surprise me to see Graphene based Flash Memory introduced in about 4 years or so. Then again, I am no materials expert, so don’t hold me to this timeline.
It appears that the system uses 200K disk drives to support the 120PB of storage. The disk drives are packed in a new wider rack and are water cooled. According to the news report the new wider drive trays hold more drives than current drive trays available on the market.
For instance, HP has a hot pluggable, 100 SFF (small form factor 2.5″) disk enclosure that sits in 3U of standard rack space. 200K SFF disks would take up about 154 full racks, not counting the interconnect switching that would be required. Unclear whether water cooling would increase the density much but I suppose a wider tray with special cooling might get you more drives per floor tile.
There was no mention of interconnect, but today’s drives use either SAS or SATA. SAS interconnects for 200K drives would require many separate SAS busses. With an SAS expander addressing 255 drives or other expanders, one would need at least 4 SAS busses but this would have ~64K drives per bus and would not perform well. Something more like 64-128 drives per bus would have much better performer and each drive would need dual pathing, and if we use 100 drives per SAS string, that’s 2000 SAS drive strings or at least 4000 SAS busses (dual port access to the drives).
Shared storage cluster – where GPFS front end nodes access shared storage across the backend. This is generally SAN storage system(s). But the requirements for high density, it doesn’t seem likely that the 120PB storage system uses SAN storage in the backend.
Networked based cluster – here the GPFS front end nodes talk over a LAN to a cluster of NSD (network storage director?) servers which can have access to all or some of the storage. My guess is this is what will be used in the 120PB storage system
Shared Network based clusters – this looks just like a bunch of NSD servers but provides access across multiple NSD clusters.
Given the above, with ~100 drives per NSD server means another 1U extra per 100 drives or (given HP drive density) 4U per 100 drives for 1000 drives and 10 IO servers per 40U rack, (not counting switching). At this density it takes ~200 racks for 120PB of raw storage and NSD nodes or 2000 NSD nodes.
Unclear how many GPFS front end nodes would be needed on top of this but even if it were 1 GPFS frontend node for every 5 NSD nodes, we are talking another 400 GPFS frontend nodes and at 1U per server, another 10 racks or so (not counting switching).
If my calculations are correct we are talking over 210 racks with switching thrown in to support the storage. According to IBM’s discussion on the Storage challenges for petascale systems, it probably provides ~6TB/sec of data transfer which should be easy with 200K disks but may require even more SAS busses (maybe ~10K vs. the 2K discussed above).
IBM GPFS is used behind the scenes in IBM’s commercial SONAS storage system but has been around as a cluster file system designed for HPC environments for over 15 years or more now.
Given this many disk drives something needs to be done about protecting against drive failure. IBM has been talking about declustered RAID algorithms for their next generation HPC storage system which spreads the parity across more disks and as such, speeds up rebuild time at the cost of reducing effective capacity. There was no mention of effective capacity in the report but this would be a reasonable tradeoff. A 200K drive storage system should have a drive failure every 10 hours, on average (assuming a 2 million hour MTBF). Let’s hope they get drive rebuild time down much below that.
The system is expected to hold around a trillion files. Not sure but even at 1024 bytes of metadata per file, this number of files would chew up ~1PB of metadata storage space.
GPFS provides ILM (information life cycle management, or data placement based on information attributes) using automated policies and supports external storage pools outside the GPFS cluster storage. ILM within the GPFS cluster supports file placement across different tiers of storage.
All the discussion up to now revolved around homogeneous backend storage but it’s quite possible that multiple storage tiers could also be used. For example, a high density but slower storage tier could be combined with a low density but faster storage tier to provide a more cost effective storage system. Although, it’s unclear whether the application (real world modeling) could readily utilize this sort of storage architecture nor whether they would care about system cost.
Nonetheless, presumably an external storage pool would be a useful adjunct to any 120PB storage system for HPC applications.
Can it be done?
Let’s see, 400 GPFS nodes, 2000 NSD nodes, and 200K drives. Seems like the hardware would be readily doable (not sure why they needed watercooling but hopefully they obtained better drive density that way).
It would seem that a 20X multiplier times a current Isilon cluster or even a 10X multiple of a currently supported SONAS system would take some software effort to work together, but seems entirely within reason.
Of course, IBM Almaden is working on project to support Hadoop over GPFS which might not be optimum for real world modeling but would nonetheless support the node count being talked about here.
I wish there was some real technical information on the project out on the web but I could not find any. Much of this is informed conjecture based on current GPFS system and storage hardware capabilities. But hopefully, I haven’t traveled to far astray.
We were talking with Pure Storage last week, another SSD startup which just emerged out of stealth mode today. Somewhat like SolidFire which we discussed a month or so ago, Pure Storage uses only SSDs to provide primary storage. In this case, they are supporting a FC front end, with an all SSDs backend, and implementing internal data deduplication and compression, to try to address the needs of enterprise tier 1 storage.
Pure Storage is in final beta testing with their product and plan to GA sometime around the end of the year.
Pure Storage hardware
Their system is built around MLC SSDs which are available from many vendors but with a strategic investment from Samsung, currently use that vendor’s storage. As we know, MLC has write endurance limitations but Pure Storage was built from the ground up knowing they were going to use this technology and have built their IP to counteract these issues.
The system is available in one or two controller configurations, with an Infiniband interconnect between the controllers, 6Gbps SAS backend, 48GB of DRAM per controller for caching purposes, and NV-RAM for power outages. Each controller has 12-cores supplied by 2-Intel Xeon processor chips.
With the first release they are limiting the controllers to one or two (HA option) but their storage system is capable of clustering together many more, maybe even up to 8-controllers using the Infiniband back end.
Each storage shelf provides 5.5TB of raw storage using 2.5″ 256GB MLC SSDs. It looks like each controller can handle up to 2-storage shelfs with the HA (dual controller option) supporting 4 drive shelfs for up to 22TB of raw storage.
Pure Storage Performance
Although these numbers are not independently verified, the company says a single controller (with 1-storage shelf) they can do 200K sustained 4K random read IOPS, 2GB/sec bandwidth, 140K sustained write IOPS, or 500MB/s of write bandwidth. A dual controller system (with 2-storage shelfs) can achieve 300K random read IOPS, 3GB/sec bandwidth, 180K write IOPS or 1GB/sec of write bandwidth. They also claim that they can do all this IO with an under 1 msec. latency.
One of the things they pride themselves on is consistent performance. They have built their storage such that they can deliver this consistent performance even under load conditions.
Given the amount of SSDs in their system this isn’t screaming performance but is certainly up there with many enterprise class systems sporting over 1000 disks. The random write performance is not bad considering this is MLC. On the other hand the sequential write bandwidth is probably their weakest spec and reflects their use of MLC flash.
One key to Pure Storage (and SolidFire for that matter) is their use of inline data compression and deduplication. By using these techniques and basing their system storage on MLC, Pure Storage believes they can close the price gap between disk and SSD storage systems.
The problems with data reduction technologies is that not all environments can benefit from them and they both require lots of CPU power to perform well. Pure Storage believes they have the horsepower (with 12 cores per controller) to support these services and are focusing their sales activities on those (VMware, Oracle, and SQL server) environments which have historically proven to be good candidates for data reduction.
In addition, they perform a lot of optimizations in their backend data layout to prolong the life of MLC storage. Specifically, they use a write chunk size that matches the underlying MLC SSDs page width so as not to waste endurance with partial data writes. Also they migrate old data to new locations occasionally to maintain “data freshness” which can be a problem with MLC storage if the data is not touched often enough. Probably other stuff as well, but essentially they are tuning their backend use to optimize endurance and performance of their SSD storage.
Furthermore, they have created a new RAID 3D scheme which provides an adaptive parity scheme based on the number of available drives that protects against any dual SSD failure. They provide triple parity, dual parity for drive failures and another parity for unrecoverable bit errors within a data payload. In most cases, a failed drive will not induce an immediate rebuild but rather a reconfiguration of data and parity to accommodate the failing drive and rebuild it onto new drives over time.
At the moment, they don’t have snapshots or data replication but they said these capabilities are on their roadmap for future delivery.
In the mean time, all SSD storage systems seem to be coming out of the wood work. We mentioned SolidFire, but WhipTail is another one and I am sure there are plenty more in stealth waiting for the right moment to emerge.
I was at a conference about two months ago where I predicted that all SSD systems would be coming out with little of the engineering development of storage systems of yore. Based on the performance available from a single SSD, one wouldn’t need 100s of SSDs to generate 100K IOPS or more. Pure Storage is doing this level of IO with only 22 MLC SSDs and a high-end, but essentially off-the-shelf controller.
Just imagine what one could do if you threw some custom hardware at it…
Was talking with someone yesterday about one of my favorite topics, data storage for virtual desktop infrastructure (VDI) deployments. In my mind there are a few advanced storage features that help considerably with VDI implemetations:
Deduplication – almost every one of your virtual desktops will share 75-90% of their O/S disk data with every other virtual desktop. Having sub-file/sub-block deduplication can be a godsend for all this replicated data and reduce O/S storage requirements considerably.
0 storage snapshots/clones – another solution to the duplication of O/S data is to use some sort of space conserving snapshots. For example, one creates a master (gold) disk image and makes 100s if not 1000s of snapshots of it, taking almost no additional space.
Highly available/highly reliable storage – when you have a lone desktop dependent on DAS for it’s O/S, it doesn’t impact a lot of users if that device fails. However, when you have 100s to 1000s of users dependent on DAS device(s) for their O/S software, any DAS failure could impact all of them at the same time. As such, one needs to move off DAS and invest in highly reliable and available external storage of some kind to sustain reasonable uptime for your user community.
Those seem to me to be the most important attributes for VDI storage but there are a couple more features/facilities which can also:
NAS systems with NFS – VDI deployments will generate lots of VMDKs for all the user desktop C: drives. Although this can be managed with block level storage as separate LUNs or multi-VMDK LUNs, who want’s to configure a 100 to 1000 LUNs. NFS files can perform just as well and are much easier to create on the fly and thus, for VDI it’s hard to beat NFS storage.
Boot storm enhancements – Another problem with VDI is that everyone gets to work 8am Monday and proceeds to boot up their (virtual) machines, which drives an awful lot of IO to their virtual C: drives. Deduplication and 0 storage snapshots can help manage the boot storm as long as these characteristics are retained throughout system cache, i.e, deduplication exists in cache as well as on backend disk. But there are other approaches to the problem as well, available from various vendors to better manage boot storms.
Anti-Virus scan enhancements – Similar to boot storms, A-V scans also typically happen around the same time for many desktop users and can be just as bad for virtual C: drive performance. Again, deduplication or 0 storage snapshots can help (with above caveats) but some vendor storage can offload these activities from the desktop alltogether. Also last weeks VMworld release of VMware’s vShield Edge (see VMworld 2010 review) also supports some A-V scan enhancements.Any of these approaches should be able to help.
Regular “dumb” block storage will always work but it will require a lot more raw storage, performance will suffer just when everybody gets back to work, and the administrative burden will be much higher.
I may seem biased but enterprise class reliability&availability with some of the advanced storage features described above can help make your deployment of VDI that much better for you and all your knowledge workers.
Recent press reports about a bidding war for 3PAR bring into focus the expanding need for enterprise class data storage subsystems. What exactly is enterprise storage?
Defining enterprise storage is frought with problems but I will take a shot. Enterprise class data storage has:
Enhanced reliability, high availability and serviceability – meaning it hardly ever fails, it keeps operating (on redundant components) when it does fail, and repairing the storage when the rare failure occurs can be accomplished without disrupting ongoing storage services
Extreme data integrity – goes beyond just RAID storage, meaning that these systems lose data very infrequently, provide the latest data written to a location when read and will tell you when data cannot be accessed.
Automated I/O performance – meaning sophisticated caching algorithms that try to keep ahead of sequential I/O streams, buffer actively read data, and buffer write data in non-volatile cache before destaging to disk or other media.
Multiple types of storage – meaning the system supports SATA, SAS and/or FC disk drives and SSDs or Flash storage
PBs of storage – meaning behind one enterprise class storage (sub-)system one can support over 1PB of storage
Sophisticated functionality – meaning the system supports multiple forms of offsite replication, thin provisioning, storage tiering, point-in-time copies, data cloning, administration GUIs/CLIs, etc.
Compatibility with all enterprise O/Ss – meaning the storage has been tested and is on hardware compatibility lists for every major operating system in use by the enterprise today.
As for storage protocol, it seems best to leave this off the list. I wanted to just add block storage, but enterprises today probably have as much if not more external file storage (CIFS or NFS) as they have block storage (FC or iSCSI). And the proportion in file systems seems to be growing (see IDC report referenced below).
In addition, while I don’t like the non-determinism of iSCSI or file access protocols, this doesn’t seem to stop such storage from putting up pretty impressive performance numbers (see our performance dispatches). Anything that can crack 100K I/O or file operations per second probably deserves to call themselves enterprise storage as long as they meet the other requirements. So, maybe I should add high-performance storage to the list above.
Why the sudden interest in enterprise storage?
Enterprise storage has been around arguably since the 2nd half of last century (for mainframe systems) but lately has become even more interesting as applications deploy to the cloud and server virtualization (from VMware, Microsoft Hyper-V and others) takes over the data center.
Cloud storage and cloud computing services are lowering the entry points for storage and processing, enabling application deployments which were heretofore unaffordable. These new cloud applications consume storage at increasing rates and don’t seem to be slowing down any time soon. Arguably, some cloud storage is not enterprise storage but as service levels go up for these applications, providers must ultimately turn to enterprise storage.
In addition, server virtualization transforms the enterprise data center from a single application per server to easily 5 or more applications per physical server. This trend is raising server utilization, driving more I/O, and requiring higher capacity. Such “multi-application” storage almost always requires high availability, reliability and performance to work well, generating even more demand for enterprise data storage systems.
Margins on enterprise storage are good, some would say very good. While raw disk storage can be had at under $0.50/GB, enterprise class storage is often 10 or more times that price. Now that has to cover redundant hardware, software/firmware engineering and other characteristics, but this still leaves pretty good margins.
In my mind, Dell would see enterprise storage as a natural extension of their current enterprise server business. They already sell and support these customers, including enterprise class storage just adds another product to the mix. Developing enterprise storage from scratch is probably a 4-7 year journey with the right people, buying 3PAR puts them in the market today with a competitive product.
HP is already in the enterprise storage market today, with their XP and EVA storage subsystems. However, having their own 3PAR enterprise class storage may get them better margins than their current XP storage OEMed from HDS. But I think Chuck Hollis’s post on HP’s counter bid for 3PAR may have revealed another side to this discussion – sometime M&A is as much about constraining your competition as it is about adding new capabilities to a company.