Model of graphene structure by CORE-Materials (cc) (from Flickr)
I have been thinking about writing a post on “Is Flash Dead?” for a while now. Well at least since talking with IBM research a couple of weeks ago on their new memory technologies that they have been working on.
As we have discussed before, NAND flash memory has some serious limitations as it’s shrunk below 11nm or so. For instance, write endurance plummets, memory retention times are reduced and cell-to-cell interactions increase significantly.
These issues are not that much of a problem with today’s flash at 20nm or so. But to continue to follow Moore’s law and drop the price of NAND flash on a $/Gb basis, it will need to shrink below 16nm. At that point or soon thereafter, current NAND flash technology will no longer be viable.
Other non-NAND based non-volatile memories
That’s why IBM and others are working on different types of non-volatile storage such as PCM (phase change memory), MRAM (magnetic RAM) , FeRAM (Ferroelectric RAM) and others. All these have the potential to improve general reliability characteristics beyond where NAND Flash is today and where it will be tomorrow as chip geometries shrink even more.
IBM seems to be betting on MRAM or racetrack memory technology because it has near DRAM performance, extremely low power and can store far more data in the same amount of space. It sort of reminds me of delay line memory where bits were stored on a wire line and read out as they passed across a read/write circuit. Only in the case of racetrack memory, the delay line is etched in a silicon circuit indentation with the read/write head implemented at the bottom of the cleft.
Graphene as the solution
Then along comes Graphene based Flash Memory. Graphene can apparently be used as a substitute for the storage layer in a flash memory cell. According to the report, the graphene stores data using less power and with better stability over time. Both crucial problems with NAND flash memory as it’s shrunk below today’s geometries. The research is being done at UCLA and is supported by Samsung, a significant manufacturer of NAND flash memory today.
Current demonstration chips are much larger than would be useful. However, given graphene’s material characteristics, the researchers believe there should be no problem scaling it down below where NAND Flash would start exhibiting problems. The next iteration of research will be to see if their scaling assumptions can hold when device geometry is shrunk.
The other problem is getting graphene, a new material, into current chip production. Current materials used in chip manufacturing lines are very tightly controlled and building hybrid graphene devices to the same level of manufacturing tolerances and control will take some effort.
So don’t look for Graphene Flash Memory to show up anytime soon. But given that 16nm chip geometries are only a couple of years out and 11nm, a couple of years beyond that, it wouldn’t surprise me to see Graphene based Flash Memory introduced in about 4 years or so. Then again, I am no materials expert, so don’t hold me to this timeline.
Susitna Glacier, Alaska by NASA Goddard Photo and Video (cc) (from Flickr)
Talk about big data, Technology Review reported this week that IBM is building a 120PB storage system for some unnamed customer. Details are sketchy and I cannot seem to find any announcement of this on IBM.com.
Hardware
It appears that the system uses 200K disk drives to support the 120PB of storage. The disk drives are packed in a new wider rack and are water cooled. According to the news report the new wider drive trays hold more drives than current drive trays available on the market.
For instance, HP has a hot pluggable, 100 SFF (small form factor 2.5″) disk enclosure that sits in 3U of standard rack space. 200K SFF disks would take up about 154 full racks, not counting the interconnect switching that would be required. Unclear whether water cooling would increase the density much but I suppose a wider tray with special cooling might get you more drives per floor tile.
There was no mention of interconnect, but today’s drives use either SAS or SATA. SAS interconnects for 200K drives would require many separate SAS busses. With an SAS expander addressing 255 drives or other expanders, one would need at least 4 SAS busses but this would have ~64K drives per bus and would not perform well. Something more like 64-128 drives per bus would have much better performer and each drive would need dual pathing, and if we use 100 drives per SAS string, that’s 2000 SAS drive strings or at least 4000 SAS busses (dual port access to the drives).
Shared storage cluster – where GPFS front end nodes access shared storage across the backend. This is generally SAN storage system(s). But the requirements for high density, it doesn’t seem likely that the 120PB storage system uses SAN storage in the backend.
Networked based cluster – here the GPFS front end nodes talk over a LAN to a cluster of NSD (network storage director?) servers which can have access to all or some of the storage. My guess is this is what will be used in the 120PB storage system
Shared Network based clusters – this looks just like a bunch of NSD servers but provides access across multiple NSD clusters.
Given the above, with ~100 drives per NSD server means another 1U extra per 100 drives or (given HP drive density) 4U per 100 drives for 1000 drives and 10 IO servers per 40U rack, (not counting switching). At this density it takes ~200 racks for 120PB of raw storage and NSD nodes or 2000 NSD nodes.
Unclear how many GPFS front end nodes would be needed on top of this but even if it were 1 GPFS frontend node for every 5 NSD nodes, we are talking another 400 GPFS frontend nodes and at 1U per server, another 10 racks or so (not counting switching).
If my calculations are correct we are talking over 210 racks with switching thrown in to support the storage. According to IBM’s discussion on the Storage challenges for petascale systems, it probably provides ~6TB/sec of data transfer which should be easy with 200K disks but may require even more SAS busses (maybe ~10K vs. the 2K discussed above).
Software
IBM GPFS is used behind the scenes in IBM’s commercial SONAS storage system but has been around as a cluster file system designed for HPC environments for over 15 years or more now.
Given this many disk drives something needs to be done about protecting against drive failure. IBM has been talking about declustered RAID algorithms for their next generation HPC storage system which spreads the parity across more disks and as such, speeds up rebuild time at the cost of reducing effective capacity. There was no mention of effective capacity in the report but this would be a reasonable tradeoff. A 200K drive storage system should have a drive failure every 10 hours, on average (assuming a 2 million hour MTBF). Let’s hope they get drive rebuild time down much below that.
The system is expected to hold around a trillion files. Not sure but even at 1024 bytes of metadata per file, this number of files would chew up ~1PB of metadata storage space.
GPFS provides ILM (information life cycle management, or data placement based on information attributes) using automated policies and supports external storage pools outside the GPFS cluster storage. ILM within the GPFS cluster supports file placement across different tiers of storage.
All the discussion up to now revolved around homogeneous backend storage but it’s quite possible that multiple storage tiers could also be used. For example, a high density but slower storage tier could be combined with a low density but faster storage tier to provide a more cost effective storage system. Although, it’s unclear whether the application (real world modeling) could readily utilize this sort of storage architecture nor whether they would care about system cost.
Nonetheless, presumably an external storage pool would be a useful adjunct to any 120PB storage system for HPC applications.
Can it be done?
Let’s see, 400 GPFS nodes, 2000 NSD nodes, and 200K drives. Seems like the hardware would be readily doable (not sure why they needed watercooling but hopefully they obtained better drive density that way).
Luckily GPFS supports Infiniband which can support 10,000 nodes within a single subnet. Thus an Infiniband interconnect between the GPFS and NSD nodes could easily support a 2400 node cluster.
The only real question is can a GPFS software system handle 2000 NSD nodes and 400 GPFS nodes with trillions of files over 120PB of raw storage.
As a comparison here are some recent examples of scale out NAS systems:
It would seem that a 20X multiplier times a current Isilon cluster or even a 10X multiple of a currently supported SONAS system would take some software effort to work together, but seems entirely within reason.
On the other hand, Yahoo supports a 4000-node Hadoop cluster and seems to work just fine. So from a feasability perspective, a 2500 node GPFS-NSD node system seems just a walk in the park for Hadoop.
Of course, IBM Almaden is working on project to support Hadoop over GPFS which might not be optimum for real world modeling but would nonetheless support the node count being talked about here.
——
I wish there was some real technical information on the project out on the web but I could not find any. Much of this is informed conjecture based on current GPFS system and storage hardware capabilities. But hopefully, I haven’t traveled to far astray.
I have written before about the lack of long term archives for digital data mostly focused on disappearing formants but this device if it works, has the potential to solve the other problem (discussed here) mainly that no storage media around today can last that long.
The new single layer DVD (4.7GB max) has a chemically stable, inorganic recording layer which is a heat resistant matrix of materials which can retain data while surviving temperatures of up to 500°C (932°F).
Unlike normal DVDs which record data using organic dyes within a DVD, M-Disc data is recorded on this stone-like layer embedded inside the DVD. By doing so, M-Disc have created the modern day equivalent of etching information in stone.
According to the vendor, M-Disc archive-ability was independently validated by the US DOD at their Church Lake facilities. While the DOD didn’t say the M-Disc DVD has a 1000 year life they did say that under their testing the M-Disc was the only DVD device which did not lose data. The DOD tested DVDs from Mitsubishi, Verbatum, Delkin, MAM-A and Taiyo Yuden (JVC) in addition to the M-Disc.
The other problems with long term archives involve data formats and program availability that could read such formats from long ago. Although Millenniata have no solution for this, something like a format repository with XML descriptions might provide the way forward to a solution.
Given the nature of their DVD recording surface, special purpose DVD writers, with lasers that are 5X the intensity of normal DVDs, need to be used. But once recorded any DVD reader is able to read the data off the disk.
Pricing for the media was suggested to be about equivalent per disk for archive quality DVDs. Pricing for the special DVD writers was not disclosed.
They did indicate they were working on a similar product for BluRay disks which would take the single layer capacity up to 26GBs.
OCZ just released a new version of their enterprise class Z-drive SSD storage with pretty impressive performance numbers (up to 500K IOPS [probably read] with 2.8GB/sec read data transfer).
Bootability
These new drives are bootable SCSI devices and connect directly to a server’s PCIe bus. They come in half height and full height card form factors and support 800GB to 3.2TB (full height) or 300GB to 1.2TB (half height) raw storage capacities.
OCZ also offers their Velo PCIe SSD series which are not bootable and as such, require an IO driver for each operating system. However, the Z-drive has more intelligence which provides a SCSI device and as such, can be used anywhere.
Naturally this comes at the price of additional hardware and overhead. All of which could impact performance but given their specified IO rates, it doesn’t seem to be a problem.
Unclear how many other PCIe SSDs exist today that offer bootability but it certainly puts these drives in a different class than previous generation PCIe SSD such as available from FusionIO and other vendors that require IO drivers.
MLC NAND
One concern with new Z-drives might be their use of MLC NAND technology. Although OCZ’s press release said the new drives would be available in either SLC or MLC configurations, current Z-drive spec sheets only indicate MLC availability.
As discussed previously (see eMLC & eSLC and STEC’s MLC posts), MLC supports less write endurance (program-erase and write cycles) than SLC NAND cells. Normally the difference is on the order of 10X less before NAND cell erase/write failure.
I also noticed there was no write endurance specification on their spec sheet for the new Z-drives. Possibly, at these capacities it may not matter but, in our view, a write endurance specification should be supplied for any SSD drive, and especially for enterprise class ones.
Z-drive series
OCZ offers two versions of their Z-drive the R and C series, both of which offer the same capacities and high performance but as far as I could tell the R series appears to be have more enterprise class availability and functionality. Specifically, this drive has power fail protection for the writes (capacitance power backup) as well as better SMART support (with “enterprise attributes”). These both seem to be missing from their C Series drives.
We hope the enterprise attribute SMART provides write endurance monitoring and reporting. But there is no apparent definition of these attributes that were easily findable.
Also the R series power backup, called DataWrite Assurance Technology would be a necessary component for any enterprise disk device. This essentially saves data written to the device but not to the NAND just yet from disappearing during a power outage/failure.
Given the above, we would certainly opt for the R series drive in any enterprise configuration.
Storage system using Z-drives
Just consider what one can do with a gaggle of Z-drives in a standard storage system. For example, with 5 Z-drives in a server, it could potentially support 2.5M IOPs/sec and 14GB/sec of data transfer with some resulting loss of performance due to front-end emulation. Moreover, at 3.2TB per drive, even in a RAID5 4+1 configuration the storage system would support 12.8TB of user capacity. One could conceivably do away with any DRAM cache in such a system and still provide excellent performance.
What the cost for such a system would be is another question. But with MLC NAND it shouldn’t be too obscene.
On the other hand serviceability might be a concern as it would be difficult to swap out a failed drive (bad SSD/PCIe card) while continuing IO operations. This could be done with some special hardware but it’s typically not present in standard, off the shelf servers.
—-
All in all a very interesting announcement from OCZ. The likelihood that a single server will need this sort of IO performance from a lone drive is not that high (except maybe for massive website front ends) but putting a bunch of these in a storage box is another matter. Such a configuration would make one screaming storage system with minimal hardware changes and only a modest amount of software development.
Apparently the problem occurs when power is suddenly removed from the device. The end result is that the SSD’s capacity is restricted to 8MB from 40GB or more.
I have seen these sorts of problems before. It probably has something to do with table updating activity associated with SSD wear leveling.
Wear leveling
NAND wear leveling looks very similar to virtual memory addressing and maps storage block addresses to physical NAND pages. Essentially something similar to a dynamic memory page table is maintained that shows where the current block is located in the physical NAND space, if present. Typically, there are multiple tables involved, one for spare pages, another for mapping current block addresses to NAND page location and offset, one for used pages, etc. All these tables have to be in some sort of non-volatile storage so they persist after power is removed.
Updating such tables and maintaining their integrity is a difficult endeovor. More than likely some sort of table update is not occurring in an ACID fashion.
Intel’s fix
Intel has replicated the problem and promises a firmware fix. In my experience this is entirely possible. Most probably customer data has not been lost (although this is not a certainty), it’s just not accessible at the moment. And Intel has reminded everyone that as with any storage device everyone should be taking periodic backups to other devices, SSDs are no exception.
I am certain that Intel and others are enhancing their verification and validation (V&V) activities to better probe and test the logic behind wear leveling fault tolerance, at least with respect to power loss. Of course, redesigning the table update algorithm to be more robust, reliable, and fault tolerant is a long range solution to these sorts of problems but may take longer than a just a bug fix.
The curse of complexity
But all this begs a critical question, as one puts more and more complexity outboard into the drive are we inducing more risk?
It’s a perennial problem in the IT industry. Software bugs are highly correlated to complexity and thereby, are ubiquitous, difficult to eliminate entirely, and often escape any and all efforts to eradicate them before customer shipments. However, we can all get better at reducing bugs, i.e., we can make them less frequent, less impactful, and less visible.
What about disks?
All that being said, rotating media is not immune to the complexity problem. Disk drives have different sorts of complexity, e.g., here block addressing is mostly static and mapping updates occur much less frequently (for defect skipping) rather than constantly as with NAND, whenever data is written. As such, problems with power loss impacting table updates are less frequent and less severe with disks. On the other hand, stiction, vibration, and HDI are all very serious problems with rotating media but SSDs have a natural immunity to these issues.
—-
Any new technology brings both advantages and disadvantages with it. NAND based SSD advantages include high speed, low power, and increased ruggedness but the disadvantages involve cost and complexity. We can sometimes tradeoff cost against complexity but we cannot eliminate it entirely.
Moreover, while we cannot eliminate the complexity of NAND wear leveling today, we can always test it better. That’s probably the most significant message coming out of today’s issue. Any product SSD testing has to take into account the device’s intrinsic complexity and exercise that well, under adverse conditions. Power failure is just one example, I can think of dozens more.
Ultrastar SSD400 4 (c) 2011 Hitachi Global Storage Technologies (from their website)
The problem with SSDs is that they typically all fail at some level of data writes, called the write endurance specification.
As such, if you purchase multiple drives from the same vendor and put them in a RAID group, sometimes this can cause multiple failures because of this.
Say the SSD write endurance is 250TBs (you can only write 250TB to the SSD before write failure), and you populate a RAID 5 group with them in a 3 -data drives + 1-parity drive configuration. As, it’s RAID 5, parity rotates around to each of the drives sort of equalizing the parity write activity.
Now every write to the RAID group is actually two writes, one for data and one for parity. Thus, the 250TB of write endurance per SSD, which should result in 1000TB write endurance for the RAID group is reduced to something more like 125TB*4 or 500TB. Specifically,
Each write to a RAID 5 data drive is replicated to the RAID 5 parity drive,
As each parity write is written to a different drive, the parity drive of the moment can contain at most 125TB of data writes and 125TB of parity writes before it uses up it’s write endurance spec.
So for the 4 drive raid group we can write 500TB of data, evenly spread across the group can no longer be written.
As for RAID 6, it almost looks the same except that you lose more SSD life, as you write parity twice. E.g. for a 6 data drive + 2 parity drive RAID 6 group, with similar write endurance, you should get 83.3TB of data writes and 166.7TB of parity writes per drive. Which for an 8 drive parity group is 666.4TB of data writes before RAID group write endurance lifetime is used up.
For RAID 1 with 2 SSDs in the group, as each drive mirrors writes to the other drive, you can only get 125TB per drive or 250TB total data writes per RAID group.
But the “real” problem is much worse
If I am writing the last TB to my RAID group and if I have managed to spread the data writes evenly across the RAID group, one drive will go out right away. Most likely the current parity drive will throw a write error. BUT the real problem occurs during the rebuild.
With a 256GB SSD in the RAID 5 group, with 100MB/s read rate, reading the 3 drives in parallel to rebuild the fourth will take ~43 minutes. However that means all the good SSDs are idle except for rebuild IO. Most systems limit drive rebuild IO to no more than 1/2 to 1/4 of the drive activity (possibly much less) in the RAID group. As such, a more realistic rebuild time can be anywhere from 86 to 169 minutes or more.
Now because rebuild time takes a long time, data must continue to be written to the RAID group. But as we are aware, most of the remaining drives in the RAID group are likely to be at the end of their write endurance already.
Thus, it’s quite possible that another SSD in the RAID group will fail while the first drive is rebuilt.
Resulting in a catastrophic data loss (2 bad drives in a RAID 5, 3 drives in a RAID 6 group).
RAID 1 groups with SSDs are probably even more prone to this issue. When the first drive fail, the second should follow closely behind.
Yes, but is this probable
First we are talking TBs of data here. The likelihood that a RAID groups worth of drives would all have the same amount of data written to them even within a matter of hours of rebuild time is somewhat unlikely. That being said, the lower the write endurance of the drives, the more equal the SSD write endurance at the creation of the RAID group, and the longer it takes to rebuild failing SSDs, the higher the probability of this type of castastrophic failure.
In any case, the problem is highly likely to occur with RAID 1 groups using similar SSDs as the drives are always written in pairs.
But for RAID 5 or 6, it all depends on how well data striping across the RAID group equalizes data written to the drives.
For hard disks this was a good thing and customers or storage systems all tried to equalize IO activity across drives in a RAID group. So with good (manual or automated) data striping the problem is more likely.
Automated storage tiering using SSDs is not as easy to fathom with respect to write endurance catastrophes. Here a storage system automatically moves the hottest data (highest IO activity) to SSDs and the coldest data down to hard disks. In this fashion, they eliminate any manual tuning activity but they also attempt to minimize any skew to the workload across the SSDs. Thus, automated storage tiering, if it works well, should tend to spread the IO workload across all the SSDs in the highest tier, resulting in similar multi-SSD drive failures.
However, with some vendor’s automated storage tiering, the data is actually copied and not moved (that is the data resides both on disk and SSD). In this scenario losing an SSD RAID group or two might severely constrain performance, but does not result in data loss. It’s hard to tell which vendors do which but customers can should be able to find out.
So what’s an SSD user to do
Using RAID 4 for SSDs seems to make sense. The reason we went to RAID 5 and 6 was to avoid hot (parity write) drive(s) but with SSD speeds, having a hot parity drive or two is probably not a problem. (Some debate on this, we may lose some SSD performance by doing this…). Of course the RAID 4 parity drive will die very soon, but paradoxically having a skewed workload within the RAID group will increase SSD data availability.
Mixing SSDs age within RAID groups as much as possible. That way a single data load level will not impact multiple drives.
Turning off LUN data striping within a SSD RAID group so data IO can be more skewed.
Monitoring write endurance levels for your SSDs, so you can proactively replace them long before they will fail
Keeping good backups and/or replicas of SSD data.
I learned the other day that most enterprise SSDs provide some sort of write endurance meter that can be seen at least at the drive level. I would suggest that all storage vendors make this sort of information widely available in their management interfaces. Sophisticated vendors could use such information to analyze the SSDs being used for a RAID group and suggest which SSDs to use to maximize data availability.
But in any event, for now at least, I would avoid RAID 1 using SSDs.
Day 2 saw releases for new VMAX and VPLEX capabilities hinted at yesterday in Joe’s keynote. Namely,
VMAX announcements
VMAX now supports
Native FCoE with 10GbE support now VMAX supports directly FCoE, 10GbE iSCSI and SRDF
Enhanced Federated Live Migration supports other multi-pathing software, specifically it now adds MPIO to PowerPath and soon to come more multi-pathing solutions
Support for RSA’s external key management (RSA DPM) for their internal VMAX data security/encryption capability.
It was mentioned more than once that the latest Enginuity release 5875 is being adopted at almost 4x the rate of the prior generation code. The latest release came out earlier this year and provided a number of key enhancements to VMAX capabilities not the least of which was sub-LUN migration across up to 3 storage tiers called FAST VP.
Another item of interest was that FAST VP was driving a lot of flash sales. It seems its leading to another level of flash adoption. According to EMC they feel that almost 80-90% of customers can get by with 3% of their capacity in flash and still gain all the benefits of flash performance at significantly less cost.
VPLEX announcements
VPLEX announcements included:
VPLEX Geo – a new asynchronous VPLEX cluster-to-cluster communications methodology which can have the alternate active VPLEX cluster up to 50msec latency away
VPLEX Witness – a virtual machine which provides adjudication between the two VPLEX clusters just in case the two clusters had some sort of communications breakdown. Witness can run anywhere with access to both VPLEX clusters and is intended to be outside the two fault domains where the VPLEX clusters reside.
VPLEX new hardware – using the latest Intel microprocessors,
VPLEX now supports NetApp ALUA storage – the latest generation of NetApp storage.
VPLEX now supports thin-to-thin volume migration- previously VPLEX had to re-inflate thinly provisioned volumes but with this release there is no need to re-inflate prior to migration.
VPLEX Geo
The new Geo product in conjuncton with VMware and Hyper V allows for quick migration of VMs across distances that support up to 50msec of latency. There are some current limitations with respect to specific VMware VM migration types that can be supported but Microsoft Hyper-V Live Migration support is readily available at full 50msec latencies. Note, we are not talking about distance here but latency as the limiting factor to how far the VPLEX clusters can be apart.
Recall that VPLEX has three distinct use cases:
Infrastructure availability which proides fault tolerance for your storage and system infrastructure
Application and data mobility which means that applications can move from data center to data center and still access the same data/LUNs from both sites. VPLEX maintains cache and storage coherency across the two clusters automatically.
Distributed data collaboration which means that data can be shared and accessed across vast distances. I have discussed this extensively in my post on Data-at-a-Distance (DaaD) post, VPLEX surfaces at EMCWorld.
Geo is the third product version for VPLEX, from VPLEX Local that supports within data center virtualization, to Vplex Metro which supports two VPLEX clusters which are up to 10msec latency away which generally is up to metropolitan wide distances apart, and Geo which moves to asynchronous cache coherence technologies. Finally coming sometime later is VPLEX Global which eliminates the restriction of two VPLEX clusters or data centers and can support 3-way or more VPLEX clusters.
Along with Geo, EMC showed some new partnerships such as with SilverPeak, Cienna and others used to reduce bandwidth requirements and cost for their Geo asynchronous solution. Also announced and at the show were some new VPLEX partnerships with Quantum StorNext and others which addresses DaaD solutions
Other announcements today
Cloud tiering appliance – The new appliance is a renewed RainFinity solution which provides policy based migration to and from the cloud for unstructured data. Presumably the user identifies file aging criteria which can be used to trigger cloud migration for Atmos supported cloud storage. Also the new appliance can support archiving file data to the Data Domain Archiver product.
Google enterprise search connector to VNX – Showing a Google search appliance (GSA) to index VNX stored data. Thus bringing enterprise class and scaleable search capabilities for VNX storage.
A bunch of other announcements today at EMCWorld but these seemed most important to me.
Was invited to the SNIA tech center to witness the CDMI (Cloud Data Managament Initiative) plugfest that was going on down in Colorado Springs.
It was somewhat subdued. I always imagine racks of servers, with people crawling all over them with logic analyzers, laptops and other electronic probing equipment. But alas, software plugfests are generally just a bunch of people with laptops, ethernet/wifi connections all sitting around a big conference table.
The team was working to define an errata sheet for CDMI v1.0 to be completed prior to ISO submission for official standardization.
What’s CDMI?
CDMI is an interface standard for clients talking to cloud storage servers and provides a standardized way to access all such services. With CDMI you can create a cloud storage container, define it’s attributes, and deposit and retrieve data objects within that container. Mezeo had announced support for CDMI v1.0 a couple of weeks ago at SNW in Santa Clara.
CDMI provides for attributes to be defined at the cloud storage server, container or data object level such as: standard redundancy degree (number of mirrors, RAID protection), immediate redundancy (synchronous), infrastructure redundancy (across same storage or different storage), data dispersion (physical distance between replicas), geographical constraints (where it can be stored), retention hold (how soon it can be deleted/modified), encryption, data hashing (having the server provide a hash used to validate end-to-end data integrity), latency and throughput characteristics, sanitization level (secure erasure), RPO, and RTO.
A CDMI client is free to implement compression and/or deduplication as well as other storage efficiency characteristics on top of CDMI server characteristics. Probably something I am missing here but seems pretty complete at first glance.
SNIA has defined a reference implementations of a CDMI v1.0 server [and I think client] which can be downloaded from their CDMI website. [After filling out the “information on me” page, SNIA sent me an email with the download information but I could only recognize the CDMI server in the download information not the client (although it could have been there). The CDMI v1.0 specification is freely available as well.] The reference implementation can be used to test your own CDMI clients if you wish. They are JAVA based and apparently run on Linux systems but shouldn’t be too hard to run elsewhere. (one CDMI server at the plugfest was running on a Mac laptop).
Plugfest participants
There were a number people from both big and small organizations at SNIA’s plugfest.
Mark Carlson from Oracle was there and seemed to be leading the activity. He said I was free to attend but couldn’t say anything about what was and wasn’t working. Didn’t have the heart to tell him, I couldn’t tell what was working or not from my limited time there. But everything seemed to be working just fine.
Carlson said that SNIA’s CDMI reference implementations had been downloaded 164 times with the majority of the downloads coming from China, USA, and India in that order. But he said there were people in just about every geo looking at it. He also said this was the first annual CDMI plugfest although they had CDMI v0.8 running at other shows (i.e, SNIA SDC) before.
David Slik, from NetApp’s Vancouver Technology Center was there showing off his demo CDMI Ajax client and laptop CDMI server. He was able to use the Ajax client to access all the CDMI capabilities of the cloud data object he was presenting and displayed the binary contents of an object. Then he showed me the exact same data object (file) could be easily accessed by just typing in the proper URL into any browser, it turned out the binary was a GIF file.
The other thing that Slik showed me was a display of a cloud data object which was created via a “Cron job” referencing to a satellite image website and depositing the data directly into cloud storage, entirely at the server level. Slik said that CDMI also specifies a cloud storage to cloud storage protocol which could be used to move cloud data from one cloud storage provider to another without having to retrieve the data back to the user. Such a capability would be ideal to export user data from one cloud provider and import the data to another cloud storage provider using their high speed backbone rather than having to transmit the data to and from the user’s client.
Slik was also instrumental in the SNIA XAM interface standards for archive storage. He said that CDMI is much more light weight than XAM, as there is no requirement for a runtime library whatsoever and only depends on HTTP standards as the underlying protocol. From his viewpoint CDMI is almost XAM 2.0.
Gary Mazzaferro from AlloyCloud was talking like CDMI would eventually take over not just cloud storage management but also local data management as well. He called the CDMI as a strategic standard that could potentially be implemented in OSs, hypervisors and even embedded systems to provide a standardized interface for all data management – cloud or local storage. When I asked what happens in this future with SMI-S he said they would co-exist as independent but cooperative management schemes for local storage.
Not sure how far this goes. I asked if he envisioned a bootable CDMI driver? He said yes, a BIOS CDMI driver is something that will come once CDMI is more widely adopted.
Other people I talked with at the plugfest consider CDMI as the new web file services protocol akin to NFS as the LAN file services protocol. In comparison, they see Amazon S3 as similar to CIFS (SMB1 & SMB2) in that it’s a proprietary cloud storage protocol but will also be widely adopted and available.
There were a few people from startups at the plugfest, working on various client and server implementations. Not sure they wanted to be identified nor for me to mention what they were working on. Suffice it to say the potential for CDMI is pretty hot at the moment as is cloud storage in general.
But what about cloud data consistency?
I had to ask about how the CDMI standard deals with eventual consistency – it doesn’t. The crowd chimed in, relaxed consistency is inherent in any distributed service. You really have three characteristics Consistency, Availability and Partitionability (CAP) for any distributed service. You can elect to have any two of these, but must give up the third. Sort of like the Hiesenberg uncertainty principal applied to data.
They all said that consistency is mainly a CDMI client issue outside the purview of the standard, associated with server SLAs, replication characteristics and other data attributes. As such, CDMI does not define any specification for eventual consistency.
Although, Slik said that the standard does guarantee if you modify an object and then request a copy of it from the same location during the same internet session, that it be the one you last modified. Seems like long odds in my experience. Unclear how CDMI, with relaxed consistency can ever take the place of primary storage in the data center but maybe it’s not intended to.
—–
Nonetheless, what I saw was impressive, cloud storage from multiple vendors all being accessed from the same client, using the same protocols. And if that wasn’t simple enough for you, just use your browser.
If CDMI can become popular it certainly has the potential to be the new web file system.