An announcement this week by VMware on their vSphere 5 Virtual Storage Appliance has brought back the concept of shared DAS (see vSphere 5 storage announcements).
Over the years, there have been a few products, such as Seanodes and Condor Storage (may not exist now) that have tried to make a market out of sharing DAS across a cluster of servers.
Arguably, Hadoop HDFS (see Hadoop – part 1), Amazon S3/cloud storage services and most scale out NAS systems all support similar capabilities. Such systems consist of a number of servers with direct attached storage, accessible by other servers or the Internet as one large, contiguous storage/file system address space.
Why share DAS? The simple fact is that DAS is cheap, its capacity is increasing, and it’s ubiquitous.
Shared DAS system capabilities
VMware has limited their DAS virtual storage appliance to a 3 ESX node environment, possibly lot’s of reasons for this. But there is no such restriction for Seanode Exanode clusters.
On the other hand, VMware has specifically targeted SMB data centers for this facility. In contrast, Seanodes has focused on both HPC and SMB markets for their shared internal storage which provides support for a virtual SAN on Linux, VMware ESX, and Windows Server operating systems.
Although VMware Virtual Storage Appliance and Seanodes do provide rudimentary SAN storage services, they do not supply advanced capabilities of enterprise storage such as point-in-time copies, replication, data reduction, etc.
But, some of these facilities are available outside their systems. For example, VMware with vSphere 5 will supports a host based replication service and has had for some time now software based snapshots. Also, similar services exist or can be purchased for Windows and presumably Linux. Also, cloud storage providers have provided a smattering of these capabilities from the start in their offerings.
Although distributed DAS storage has the potential for high performance, it seems to me that these systems should perform poorer than an equivalent amount of processing power and storage in a dedicated storage array. But my biases might be showing.
On the other hand, Hadoop and scale out NAS systems are capable of screaming performance when put together properly. Recent SPECsfs2008 results for EMC Isilon scale out NAS system have demonstrated very high performance and Hadoops claim to fame is high performance analytics. But you have to throw a lot of nodes at the problem.
In the end, all it takes is software. Virtualizing servers, sharing DAS, and implementing advanced storage features, any of these can be done within software alone.
However, service levels, high availability and fault tolerance requirements have historically necessitated a physical separation between storage and compute services. Nonetheless, if you really need screaming application performance and software based fault tolerance/high availability will suffice, then distributed DAS systems with co-located applications like Hadoop or some scale out NAS systems are the only game in town.
To try to partition this space just a bit, there is unstructured data analysis and structured data analysis. Hadoop is used to analyze un-structured data (although Hadoop is used to parse and structure the data).
On the other hand, for structured data there are a number of other options currently available. Namely:
EMC Greenplum – a relational database that is available in a software only as well as now as a hardware appliance. Greenplum supports both row or column oriented data structuring and has support for policy based data placement across multiple storage tiers. There is a packaged solution that consists of Greenplum software and a Hadoop distribution running on a GreenPlum appliance.
HP Vertica – a column oriented, relational database that is available currently in a software only distribution. Vertica supports aggressive data compression and provides high throughput query performance. They were early supporters of Hadoop integration providing Hadoop MapReduce and Pig API connectors to provide Hadoop access to data in Vertica databases and job scheduling integration.
IBM Netezza – a relational database system that is based on proprietary hardware analysis engine configured in a blade system. Netezza is the second oldest solution on this list (see Teradata for the oldest). Since the acquisition by IBM, Netezza now provides their highest performing solution on IBM blade hardware but all of their systems depend on purpose built, FPGA chips designed to perform high speed queries across relational data. Netezza has a number of partners and/or homegrown solutions that provide specialized analysis for specific verticals such as retail, telcom, finserv, and others. Also, Netezza provides tight integration with various Oracle functionality but there doesn’t appear to be much direct integration with Hadoop on thier website.
ParAccel – a column based, relational database that is available in a software only solution. ParAccel offers a number of storage deployment options including an all in-memory database, DAS database or SSD database. In addition, ParAccel offers a Blended Scan approach providing a two tier database structure with DAS and SAN storage. There appears to be some integration with Hadoop indicating that data stored in HDFS and structured by MapReduce can be loaded and analyzed by ParAccel.
Teradata – a relational databases that is based on a proprietary purpose built appliance hardware. Teradata recently came out with an all SSD, solution which provides very high performance for database queries. The company was started in 1979 and has been very successful in retail, telcom and finserv verticals and offer a number of special purpose applications supporting data analysis for these and other verticals. There appears to be some integration with Hadoop but it’s not prominent on their website.
Probably missing a few other solutions but these appear to be the main ones at the moment.
In any case both Hadoop and most of it’s software-only, structured competition are based on a massively parrallelized/share nothing set of linux servers. The two hardware based solutions listed above (Teradata and Netezza) also operate in a massive parallel processing mode to load and analyze data. Such solutions provide scale-out performance at a reasonable cost to support very large databases (PB of data).
Now that EMC owns Greenplum and HP owns Vertica, we are likely to see more appliance based packaging options for both of these offerings. EMC has taken the lead here and have already announced Greenplum specific appliance packages.
One lingering question about these solutions is why don’t customers use current traditional database systems (Oracle, DB2, Postgres, MySQL) to do this analysis. The answer seems to lie in the fact that these traditional solutions are not massively parallelized. Thus, doing this analysis on TB or PB of data would take a too long. Moreover, the cost to support data analysis with traditional database solutions over PB of data would be prohibitive. For these reasons and the fact that compute power has become so cheap nowadays, structured data analytics for large databases has migrated to these special purpose, massively parallelized solutions.
I was talking with a local start up called SolidFire the other day with an interesting twist on SSD storage. They were targeting cloud service providers with a scale-out, cluster based SSD iSCSI storage system. Apparently a portion of their team had come from Lefthand (now owned by HP) another local storage company and the rest came from Rackspace, a national cloud service provider.
Their storage system is a scale-out cluster of storage nodes that can range from 3 to a theoretical maximum of 100 nodes (validated node count ?). Each node comes equipped with 2-2.4GHz, 6-core Intel processors and 10-300GB SSDs for a total of 3TB raw storage per node. Also they have 8GB of non-volatile DRAM for write buffering and 72GB read cache resident on each node.
The system also uses 2-10GbE links for host to storage IO and inter-cluster communications and support iSCSI LUNs. There are another 2-1GigE links used for management communications.
SolidFire states that they can sustain 50K IO/sec per node. (This looks conservative from my viewpoint but didn’t state any specific R:W ratio or block size for this performance number.)
They are targeting cloud service providers and as such the management interface was designed from the start as a RESTful API but they also have a web GUI built out of their API. Cloud service providers will automate whatever they can and having a RESTful API seems like the right choice.
QoS and data reliability
The cluster supports 100K iSCSI LUNs and each LUN can have a QoS SLA associated with it. According to SolidFire one can specify a minimum/maximum/burst level for IOPS and a maximum or burst level for throughput at a LUN granularity.
With LUN based QoS, one can divide cluster performance into many levels of support for multiple customers of a cloud provider. Given these unique QoS capabilities it should be relatively easy for cloud providers to support multiple customers on the same storage providing very fine grained multi-tennancy capabilities.
This could potentially lead to system over commitment, but presumably they have some way to ascertain over commitment is near and not allowing this to occur.
Data reliability is supplied through replication across nodes which they call Helix(tm) data protection. In this way if an SSD or node fails, it’s relatively easy to reconstruct the lost data onto another node’s SSD storage. Which is probably why the minimum number of nodes per cluster is set at 3.
Aside from the QoS capabilities, the other interesting twist from a customer perspective is that they are trying to price an all-SSD storage solution at the $/GB of normal enterprise disk storage. They believe their node with 3TB raw SSD storage supports 12TB of “effective” data storage.
They are able to do this by offering storage efficiency features of enterprise storage using an all SSD configuration. Specifically they provide,
Thin provisioned storage – which allows physical storage to be over subscribed and used to support multiple LUNs when space hasn’t completely been written over.
Data compression – which searches for underlying redundancy in a chunk of data and compresses it out of the storage.
Data deduplication – which searches multiple blocks and multiple LUNs to see what data is duplicated and eliminates duplicative data across blocks and LUNs.
Space efficient snapshot and cloning – which allows users to take point-in-time copies which consume little space useful for backups and test-dev requirements.
Tape data compression gets anywhere from 2:1 to 3:1 reduction in storage space for typical data loads. Whether SolidFire’s system can reach these numbers is another question. However, tape uses hardware compression and the traditional problem with software data compression is that it takes lots of processing power and/or time to perform it well. As such, SolidFire has configured their node hardware to dedicate a CPU core to each physical disk drive (2-6 core processors for 10 SSDs in a node).
Deduplication savings are somewhat trickier to predict but ultimately depends on the data being stored in a system and the algorithm used to deduplicate it. For user home directories, typical deduplication levels of 25-40% are readily attainable. SolidFire stated that their deduplication algorithm is their own patented design and uses a small fixed block approach.
The savings from thin provisioning ultimately depends on how much physical data is actually consumed on a storage LUN but in typical environments can save 10-30% of physical storage by pooling non-written or free storage across all the LUNs configured on a storage system.
Space savings from point-in-time copies like snapshots and clones depends on data change rates and how long it’s been since a copy was made. But, with space efficient copies and a short period of existence, (used for backups or temporary copies in test-development environments) such copies should take little physical storage.
Whether all of this can create a 4:1 multiplier for raw to effective data storage is another question but they also have a eScanner tool which can estimate savings one can achieve in their data center. Apparently the eScanner can be used by anyone to scan real customer LUNs and it will compute how much SolidFire storage will be required to support the scanned volumes.
There are a few items left on their current road map to be delivered later, namely remote replication or mirroring. But for now this looks to be a pretty complete package of iSCSI storage functionality.
SolidFire is currently signing up customers for Early Access but plan to go GA sometime around the end of the year. No pricing was disclosed at this time.
I was at SNIA’s BoD meeting the other week and stated my belief that SSDs will ultimately lead to the commoditization of storage. By that I meant that it would be relatively easy to configure enough SSD hardware to create a 100K IO/sec or 1GB/sec system without having to manage 1000 disk drives. Lo and behold, SolidFire comes out the next week. Of course, I said this would happen over the next decade – so I am off by a 9.99 years…
I was talking with another analyst the other day by the name of John Koller of Kai Consulting who specializes in the medical space and he was talking about the rise of electronic pathology (e-pathology). I hadn’t heard about this one.
He said that just like radiology had done in the recent past, pathology investigations are moving to make use of digital formats.
What does that mean?
The biopsies taken today for cancer and disease diagnosis which involve one more specimens of tissue examined under a microscope will now be digitized and the digital files will be inspected instead of the original slide.
Apparently microscopic examinations typically use a 1×3 inch slide that can have the whole slide devoted to some tissue matter. To be able to do a pathological examination, one has to digitize the whole slide, under magnification at various depths within the tissue. According to Koller, any tissue is essentially a 3D structure and pathological exams, must inspect different depths (slices) within this sample to form their diagnosis.
I was struck by the need for different slices of the same specimen. I hadn’t anticipated that but whenever I look in a microscope, I am always adjusting the focal length, showing different depths within the slide. So it makes sense, if you want to understand the pathology of a tissue sample, multiple views (or slices) at different depths are a necessity.
So what does a slide take in storage capacity?
Koller said, an uncompressed, full slide will take about 300GB of space. However, with compression and the fact that most often the slide is not completely used, a more typical space consumption would be on the order of 3 to 5GB per specimen.
As for volume, Koller indicated that a medium hospital facility (~300 beds) typically does around 30K radiological studies a year but do about 10X that in pathological studies. So at 300K pathological examinations done a year, we are talking about 90 to 150TB of digitized specimen images a year for a mid-sized hospital.
Why move to e-pathology?
It can open up a whole myriad of telemedicine offerings similar to the radiological study services currently available around the globe. Today, non-electronic pathology involves sending specimens off to a local lab and examination by medical technicians under microscope. But with e-pathology, the specimen gets digitized (where, the hospital, the lab, ?) and then the digital files can be sent anywhere around the world, wherever someone is qualified and available to scrutinize them.
At a recent analyst event we were discussing big data and aside from the analytics component and other markets, the vendor made mention of content archives are starting to explode. Given where e-pathology is heading, I can understand why.
I was at another conference the other day where someone showed a chart that said the world will create 35ZB (10**21) of data and content in 2020 from 800EB (10**18) in 2009.
Every time I see something like this I cringe. Yes, lot’s of data is being created today but what does that tell us about corporate data growth. Not much, I’d wager.
That being said, I have a couple of questions I would ask of the people who estimated this:
How much is personal data and how much is corporate data.
Did you factor how entertainment data growth rates will change over time.
These two questions are crucial.
Entertainment dominates data growth
Just as personal entertainment is becoming the major consumer of national bandwidth (see study [requires login]), it’s clear to me that the majority of the data being created today is for personal consumption/entertainment – video, music, and image files.
I look at my own office, our corporate data (office files, PDFs, text, etc.) represents ~14% of the data we keep. Images, music, video, audio take up the remainder of our data footprint. Is this data growing yes, faster than I would like but the corporate data is only averaging ~30% YoY growth while the overall data growth for our shop is averaging a total of ~116% YoY growth . [As I interrupt this activity to load up another 3.3GB of photos and videos from our camera]
Moreover, although some media content is of significant external interest to select (Media and Entertainment, social media-photo/video sharing sites, mapping/satellite, healthcare, etc.) companies today, most corporations don’t deal with lot’s of video, music or audio data. Thus, I personally see that the 30% growth is a more realistic growth rate for corporate data than 116%.
Will entertainment data growth flatten?
Will we see a drop in the entertainment data growth rates over time, undoubtedly.
Two factors will reduce the growth of this data.
What happens to entertainment data recording formats. I believe media recording formats are starting to level out. I think the issue here is one of fidelity to nature, in terms of how closely a digital representation matches reality as we perceive it. For example, the fact is that most digital projection systems in movie theaters today run from ~2 to 8TBs per feature length motion picture which seems to indicate that at some point further gains in fidelity (or in more pixels/frame) may not be worth it. Similar issues, will ultimately lead to a slowing down of other media encoding formats.
When will all the people that can create content be doing so? Recent data indicates that more than 2B people will be on the internet this year or ~28% of the world’s. But sometime we must reach saturation on internet penetration and when that happens data growth rates should also start to level out. Let’s say for argument sake, that 800EB in 2009 was correct and let’s assume there were 1.5B internet users (in 2009). As such, 1B internet users correlates to a data and content footprint of about 533EB or ~0.5TB/internet user — seems high but certainly doable.
Once these two factors level off, we should see world data and content growth rates plummet. Nonetheless, internet user population growth could be driving data growth rates for some time to come.
The scary part is that the 35ZB represents only a ~41% growth rate over the period against the baseline 2009 data and content creation levels.
But I must assume this estimate doesn’t consider much growth in digital creators of content, otherwise these numbers should go up substantially. In the last week, I ran across someone who said there would be 6B internet users by the end of the decade (can’t seem to recall where, but it was a TEDx video). I find that a little hard to believe but this was based on the assumption that most people will have smart phones with cellular data plans by that time. If that be the case, 35ZB seems awfully short of the mark.
A previous post blows this discussion completely away with just one application, (see Yottabytes by 2015 for the NSA A Yottabyte (YB) is 10**24 bytes of data) and I had already discussed an Exabyte-a-day and 3.3 Exabytes-a-day in prior posts. [Note, those YB by 2015 are all audio (phone) recordings but if we start using Skype Video, FaceTime and other video communications technologies can Nonabytes (10**27) be far behind… BOOM!]
I started out thinking that 35ZB by 2020 wasn’t pertinent to corporate considerations and figured things had to flatten out, then convinced myself that it wasn’t large enough to accommodate internet user growth, and then finally recalled prior posts that put all this into even more perspective.
EMC announced today a couple of new twists on the flash/SSD storage end of the product spectrum. Specifically,
They now support all flash/no-disk storage systems. Apparently they have been getting requests to eliminate disk storage altogether. Probably government IT but maybe some high-end enterprise customers with low-power, high performance requirements.
They are going to roll out enterprise MLC flash. It’s unclear when it will be released but it’s coming soon, different price curve, different longevity (maybe), but brings down the cost of flash by ~2X.
EMC is going to start selling server side Flash. Using storage FAST like caching algorithms to knit the storage to the server side Flash. Unclear what server Flash they will be using but it sounds a lot like a Fusion-IO type of product. How well the server cache and the storage cache talks is another matter. Chuck Hollis said EMC decided to redraw the boundary between storage and server and now there is a dotted line that spans the SAN/NAS boundary and carves out a piece of the server which is sort of on server caching.
Interesting to say the least. How well it’s tied to the rest of the FAST suite is critical. What happens when one or the other loses power, as Flash is non-volatile no data would be lost but the currency of the data for shared storage may be another question. Also having multiple servers in the environment may require cache coherence across the servers and storage participating in this data network!?
Some teaser announcements from Joe’s keynote:
VPLEX asynchronous, active active supporting two datacenter access to the same data over 1700Km away Pittsburgh to Dallas.
New Isilon record scalability and capacity the NL appliance. Can now support a 15PB file system, with trillions of files in it. One gene sequencer says a typical assay generates 500M objects/files…
Embracing Hadoop open source products so that EMC will support Hadoop distro in an appliance or software only solution
Pat G also showed EMC Greenplum appliance searching a 8B row database to find out how many products have been shipped to a specific zip code…
I heard storage beers last nite was quite the party, sorry I couldn’t make it but I did end up at the HDS customer reception which was standing room only and provided all the food and drink I could consume.
Saw quite a lot of old friends too numerous to mention here but they know who they are.
As for technology on display there was some pretty impressive stuff.
Lots of great technology on display there.
Virident tachIOn SSD
One product that caught my eye was from Virident, their tachIOn SSD. I called it a storage subsystem on a board. I had never talked with them before but they have been around for a while using NOR storage but now are focused on NAND.
Their product is a fully RAIDed storage device using flash aware RAID 5 parity locations, their own wear leveling and other SSD control software and logic with replaceable NAND modules.
Playing with this device I felt like I was swapping drives of the future. Each NAND module stack has a separate controller and supports high parallelism. Talking with Shridar Subramanian, VP of marketing, he said the product is capable of over 200K IOPS running a fully 70% read:30% write workload at full capacity.
They have a Capacitor backed DRAM buffer which is capable of uploading the memory buffer to NAND after a power failure. It plugs into a PCIe slot and uses less than 25W of power, in capacities of 300-800GB. It requires a software driver, they currently only support Linux and VMware (a Linux varient) but Windows and other O/Ss are on the way
Other SSDs/NAND storage
Their story was a familair refrain throughout the floor, lots of SSD/NAND technology coming out, in various formfactors. I saw one system using SSDs from Viking Modular Systems that fit into a DRAM DIMM slot and supported a number of SSDs behind a SAS like controller. Also requiring a SW driver.
Of course TMS, Fusion-IO, Micron, Pliant and others were touting their latest SSD/Nand based technology showing off their latest solutions and technology. For some reason lots of SSD’s at this show.
Naturally, all the other storage vendors were there Dell, HDS, HP, EMC, NetApp and IBM. IBM was showing off Watson, their new AI engine that won at Jeopardy.
And then there was cloud, …
Cloud was a hot topic as well. Saw one guy in the corner I have talked about beforeStorSimple which is a cloud gateway provider. They said they are starting to see some traction in the enterprise. Apparently enterprise are starting to adopt cloud – who knew?
Throw in a few storage caching devices, …
Then of course there was the data caching products which ranged from the relaunched DataRAM XcelASAN to Marvel’s new DragonFLY card. DragonFLY provides a cache on a PCI-E card which DataRAM is a FC caching appliance, all pretty interesting.
… and what’s organic storage?
And finally, Scality came out of the shadows with what they are calling an organic object storage device. The product reminded me of Bycast (now with NetApp) and Archivas (now with HDS) in that they had a RAIN architecture, with mirrored data in an object store interface. I asked them what makes them different and Jerome Lecat, CEO said they are relentlessly focused on performance and claims they can retrieve an object in under 40msec. My kind of product. I think they deserve a deeper dive sometime later.
Probably missed a other vendors but these are my initial impressions. For some reason I felt right at home swapping NAND drive modules,…
Had a talk the other week with an storage executive about SSD and NAND cost trends. It seemed that everyone thought that $/GB for SSD was going to overtake (be less costly) than enterprise class disk sometime in 2013. But it appeared that NAND costs weren’t coming down as fast as anticipated and now this was going to take longer than expected.
A couple of other things are going on in the enterprise disk market that are also having an effect on the relative advantage of SSDs over disks. Probably, most concerning to SSD market is enterprise storage’s new penchant for sub-LUN tiering.
Automated sub-LUN storage tiering
The major storage vendors all currently support some form of automated storage tiering for SSD storage (NetApp’s Flash Cache does this differently but the impact on NAND storage requirements is arguably similar). Presumably, such tiering should take better advantage of any amount of SSD/NAND storage available to a storage system.
Prior to automated sub-LUN storage tiering, one had to move a whole LUN to SSDs to take advantage of its speed. However, I/O requests or access are not necessarily at the same intensity for all blocks of a LUN. So one would typically end up with an SSD LUN with a relatively few blocks being heavily accessed while the vast majority of its blocks would not be being hit that much. We paid the high price of SSD LUNs gladly to get the high performance for those few blocks that really needed it.
However, with sub-LUN tiering or NAND caching, one no longer has to move all the blocks of a LUN into NAND storage to gain its benefits. One can now just have the system identify those select blocks which need high performance and move those blocks and those blocks only to NAND storage. The net impact of sub-LUN tiering or NAND caching is that one should require less overall NAND storage to obtain the same performance as one had previously with SSDs alone.
On the other hand, some would say that making the performance advantages of NAND be available at a lower overall cost might actually increase the overall amount of NAND shipments. Also with automated sub-LUN tiering in place, this removes all the complexity needed previously to identify which LUNs needed higher performance. Reducing such complexity should increase SSD or NAND market penetration.
Nonetheless, I feel that given todays price differential of SSDs over enterprise disk, the people buying SSDs today have a very defined need for speed and would have paid the price anyways for SSD storage. Anything we do to make satisfying that need with less SSD or NAND storage should reduce the amount of SSDs shipped today.
But getting back to that price crossover point, as the relative price of NAND on $/GB comes down, having an easy way to take advantage of its better performance should increase its market adoption, even faster than price would do alone.