EMCworld2014 day 1 – EMC acquires DSSD

What does 100TB of flash need with a new ASIC? And why would you implement a realtime analytics data engine using object storage interface on Flash?

It seems the new company purchased by EMC called DSSD is up to it’s eyebrows in ASIC design to implement a lightening fast object store to deal with the needs of real time analytics. Somewhere today there was a slide on the overheads of standard 25usec of OS overhead for the standard Posix file stack and then there is the typical 300 usec SSD overhead to perform an I/O. But as we learned a couple of weeks ago at SFD5 with Diablo-Technologies MCS and SanDisk new UltraDIMM they have reduced the SSD overhead by using memory channels to 5 µsec and now that OS overhead is 5X the overhead of the storage itself.

So what’s one to do. In the Case of Diablo-Tech’s MCS they converted by software from DAS IO to memory channel IO and via ASIC Memory channel IO back to SATA disk IO.

Not sure what DSSD does but if I was to design a new ASIC for memory channel I would want something that talks maybe  memory interface and scales it out to 100TBs of flash. At the software layer maybe we could talk object storage interfaces to the applications.

Going to learn more throughout the day…

[Learned on day 2 it’s more likely shared PCIe SSD storage.  Ok it’s not 5 µsec latency storage but still faster than networked storage.]

 

 

Latest SPC-2 performance results – chart of the month

Spider chart top 10 SPC-1 MB/second broken out by workload LFP, LDQ and VODIn the figure above you can see one of the charts from our latest performance dispatch on SPC-1 and SPC-2  benchmark results. The above chart shows SPC-2 throughput results sorted by aggregate MB/sec order, with all three workloads broken out for more information.

Just last quarter I was saying it didn’t appear as if any all-flash system could do well on SPC-2, throughput intensive workloads.  Well I was wrong (again) and with an aggregate MBPS™ of ~33.5GB/sec. Kaminario’s all-flash K2 took the SPC-2 MBPS results to a whole different level, almost doubling the nearest competitor in this category (Oracle ZFS ZS3-4).

Ok, Howard Marks (deepstorage.net), my GreyBeardsOnStorage podcast co-host and long-time friend, had warned me that SSDs had the throughput to be winners at SPC-2, but they would probably cost to much to be viable.  I didn’t believe him at the time — how wrong could I be.

As for cost, both Howard and I misjudged this one. The K2 came in at just under a $1M USD, whereas the #2, Oracle system was under $400K. But there were five other top 10 SPC-2 MPBS systems over $1M so the K2, all-flash system price was about average for the top 10.

Ok, if cost and high throughput aren’t the problem why haven’t we seen more all-flash systems SPC-2 benchmarks.  I tend to think that most flash systems are optimized for OLTP like update activity and not sequential throughput. The K2 is obviously one exception. But I think we need to go a little deeper into the numbers to understand just what it was doing so well.

The details

The LFP (large file processing) reported MBPS metric is the average of 1MB and 256KB data transfer sizes, streaming activity with 100% write, 100% read and 50%:50% read-write. In K2’s detailed SPC-2 report, one can see that for 100% write workload the K2 was averaging ~26GB/sec. while for the 100% read workload the K2 was averaging ~38GB/sec. and for the 50:50 read:write workload ~32GB/sec.

On the other hand the LDQ workload appears to be entirely sequential read-only but the report shows that this is made up of two workloads one using 1MB data transfers and the other using 64KB data transfers, with various numbers of streams fired up to generate  stress. The surprising item for K2’s LDQ run is that it did much better on the 64KB data streams than the 1MB data streams, an average of 41GB/sec vs. 32GB/sec.. This probably says something about an internal flash data transfer bottleneck at large data transfers someplace in the architecture.

The VOD workload also appears to be sequential, read-only and the report doesn’t indicate a data transfer size but given K2’s actual results, averaging ~31GB/sec it would seem to indicate it was on the order of 1MB.

So what we can tell is that K2’s SSD write throughput is worse than reads (~1/3rd worse) and relatively smaller sequential reads are better than relatively larger sequential reads (~1/4 better).  But I must add that even at the relatively “slower write throughput”, the K2 would still have beaten the next best disk-only storage system by ~10GB/sec.

Where’s the other all-flash SPC-2 benchmarks?

Prior to K2 there was only one other all-flash system (TMS RamSan-630) submission for SPC-2. I suspect that writing 26 GB/sec. to an all-flash system would be hazardous to its health and maybe other all-flash storage system vendors don’t want to encourage this type of activity.

Just for the record the K2 SPC-2 result has been submitted for “review” (as of 18Mar2014) and may be modified before finally “accepted”. However, the review process typically doesn’t impact performance results as much as other report items. So, officially, we will need to await for final acceptance before we can truly believe these numbers.

Comments?

~~~~

The complete SPC  performance report went out in SCI’s February 2014 newsletter.  But a copy of the report will be posted on our dispatches page sometime next quarter (if all goes well).  However, you can get the latest storage performance analysis now and subscribe to future free newsletters by just using the signup form above right.

Even more performance information and OLTP, Email and Throuphput ChampionCharts for Enterprise, Mid-range and SMB class storage systems are also available in SCI’s SAN Buying Guide, available for purchase from  website.

As always, we welcome any suggestions or comments on how to improve our SPC  performance reports or any of our other storage performance analyses.

Holograms, not just for storage anymore

A recent article I read (Holograms put storage capacity in a spin) discusses a novel approach to holographic data storage, this time using magnetic spin waves to encode holographic information on magnetic memory.

It turns out holograms can be made with any wave like phenomena and optical holograms aren’t the only way to go. Magnetic (spin?) waves can also be used to create and read holograms.

These holograms are made in magnetic semiconductor material rather than photographic material. And because the wave nature of magnetic spin operates at a lower frequency than optics there is the potential for even greater densities than corresponding optical holographic storage.

A new memory emerges

The device is called a Magnonic Holographic Memory and it seems to work by applying spin waves through a magnetic substrate and reading (sensing) the resulting interference patterns below the device.

According to the paper, the device is theoretically capable of reading the magnetic (spin) state of hundreds of thousands of nano-magnetic bits in parallel. (Let’s see, that would be about 100KB of information in parallel). Which must have something to do with the holographic nature of the read out I would guess.

I haven’t the foggiest notion how all this works but it seems to be a fallout of some earlier spintronics work the researchers were doing. The paper showed a set of three holograms read out of  grid. And the prototype device seems to require a grid (almost core like) of magnetic material on top of the substrate which is the write head. Not clear if there was a duplicate of this grid below the material to read the spin waves but something had to be there.

The researchers indicated some future directions to shrink the device, primarily by shrinking what appears to be the write head and maybe the read headseven further. It’s also not clear what the magnetic substrate which is being read/written to and whether that can be shrunk any further.

The researchers said that although spin wave holographics cannot compete with the optical holographic storage in terms of propagation delays and seems to be noisier, spin wave holographics do appear to be much more appropriate for NM scale direct integration with electronic circuits.

Is this new generation of solid state storage?

Photo Credits: Spinning Top by RusselStreet

Bringing compute to storage

Researchers at MIT (see Storage system for ‘big data’ dramatically speeds access to information) have come up with a novel storage cluster using FPGAs and flash chips to create a new form of database machine.

In their system they have an FPGA that supports limited computational offload/acceleration along with flash controller functionality for a set of flash chips. They call their system the BlueDBM or Blue Database Machine.

Their storage device is used as PCIe flash card on a host PC. But in their implementation each of the PCIe flash cards are interconnected via an FPGA serial link. This approach creates a distributed controller across all the PCIe flash cards in the host servers and allows any host PC to access any of the flash card data at high speed.

They claim that node to node access latencies are on the order of 60-80 microseconds and their distributed controller can sustain 70% of theoretical system bandwidth.  In their prototype 4-node system their performance testing shows that it’s an order of magnitude faster than Microsoft Research’s CORFU (Cluster of Raw Flash Units).

Why FPGAs?

There are two novel aspects of their system: 1 ) Is the computational offload capabilities provided by the FPGA in front of the flash and 2) Is their implementation of a  distributed controller across the storage nodes using the FPGA serial network.

Both of these characteristics are dependent on the FPGA. Also by using FPGAs system cost would be less and the FPGAs had a readily available, internally supported serial link that could be used.

But by using an FPGA, the computational capabilities are more limited and re-configurating (re-programming) the storage cluster’s compute capabilities will take more time. If they used a more general purpose CPU in front of the flash chips they could support a much richer computational offload next to the storage chips.  For example, in their prototype the FPGAs supported ‘word-counting’ offload functionality.

Nonetheless, as most flash storage these days already have a fairly sophisticated controller, it’s not much of a stretch to bump this compute power up to something a bit more programmable and make its functionality more available via APIs.  I suppose to gain equivalent performance this would need to use PCIe flash cards.

Where they would get the internal card to card serial link with general purpose CPUs may be a concern, which brings up another question.

The distributed controller gives them what exactly?

I believe that with a serial link based distributed controller they don’t need a full networking stack to access the PCIe flash storage on other nodes. This should save both access time and compute power.

In follow on work, the MIT researchers plan to implement a Linux based, distributed file system across the BlueDBM. This should give them a more normal storage stack for their system. How this may interact with the computational offload capabilities is another question.

I would have to say the reduction in access latency is what they were after with the distributed controller and they seem to have achieved it, as noted above. I suppose something similar could be done with multiple PCIe cards in the same host but with the potential to grow from 4 to 20 nodes, the BlueDBM starts to look more interesting.

What sort of application could use such a device?

They talked about performing near real-time analysis of scientific data or modeling all the particles in a simulation of the universe.  But just about any application that required extremely low access time with limited data services could potentially take advantage of their storage system. High Frequency Trading comes to mind.

As for big data applications, I haven’t heard of any big data deployments that use SSDs for basic storage let alone PCIe flash cards. I don’t believe there’s going to be a lot of big data analytics that has need for this fast a storage system.

~~~~

Utilizing excess compute power in a storage controller has been an ongoing dream for a long time. Aside from running VMs and a couple of other specialized services such as A-V scanning within a storage controller there hasn’t been a lot of this type of functionality  ever released for use inside a storage controller. With software defined storage coming online, it may not even make that much sense anymore.

MIT research’s BlueDBM solution is somewhat novel but unless they can more easily generalize the computational offload it doesn’t seem as if it will become a very popular way to go for analytics applications.

As for their reduction in access latencies, that might have some legs if they can put more storage capacity behind it and continue to support similar access latencies. But they will need to provide a more normal access method to it. The distributed Linux file system might be just the ticket to get this off into the market.

Comments?

Photo Credits: Lightening by Jolene

SpecSFS2008 results NFS throughput vs. flash size – Chart of the Month

Scatter plot with SPECsfs2008 NFS throughput results against flash size, SSD, NFS thoughputThe above chart was sent out in our December newsletter and represents yet another attempt to understand how flash/SSD use is impacting storage system performance. This chart’s interesting twist is to try to categorize the use of flash in hybrid (disk-SSD) systems vs. flash-only/all flash storage systems.

First, we categorize SSD/Flash-only (blue diamonds on the chart) systems as any storage system that has as much or more flash storage capacity than SPECsfs2008 exported file system capacity. While not entirely true, there is one system that has ~99% of their exported capacity in flash, it is a reasonable approximation.  Any other system that has some flash identified in it’s configuration is considered a Hybrid SSD&Disks (red boxes on the chart) system.

Next, we plot the system’s NFS throughput on the vertical axis and the system’s flash capacity (in GB) on the horizontal axis. Then we charted a linear regression for each set of data.

What troubles me with this chart is that hybrid systems are getting much more NFS throughput performance out of their flash capacity than flash-only systems. One would think that flash-only systems would generate more throughput per flash GB than hybrid systems because of the slow access times from disk. But the data shows this is wrong?!

We understand that NFS throughput operations are mostly metadata file calls and not data transfers so one would think that the relatively short random IOPS would favor flash only systems. But that’s not what the data shows.

What the data seems to tell me is that judicious use of flash and disk storage in combination can be better than either alone or at least flash alone.  So maybe those short random IOPS should be served out of SSD and the relatively longer, more sequential like data access (which represents only 28% of the operations that constitute NFS throughput) should be served out of disk.  And as the metadata for file systems is relatively small in capacity, this can be supported with a small amount of SSD, leveraging that minimal flash capacity for the greater good (or more NFS throughput).

I would be remiss if I didn’t mention that there are relatively few (7) flash-only systems in the SPECsfs2008 benchmarks and the regression coefficient is very poor (R**2=~0.14), which means that this could change substantially with more flash-only submissions. However, it’s looking pretty flat from my perspective and it would take an awful lot of flash-only systems showing much higher NFS throughput per flash GB to make a difference in the regression equation

Nonetheless, I am beginning to see a pattern here in that SSD/Flash is good for some things and disk continues to be good for others. And smart storage system developers will do good to realize this fact.  Also, as a side note, I am beginning to see some rational why there aren’t more flash-only SPECsfs2008 results.

Comments?

~~~~

The complete SPECsfs2008 performance report went out in SCI’s December 2013 newsletter.  But a copy of the report will be posted on our dispatches page sometime this quarter (if all goes well).  However, you can get the latest storage performance analysis now and subscribe to future free newsletters by just using the signup form above right.

Even more performance information and ChampionCharts for NFS and CIFS/SMB storage systems are also available in SCI’s NAS Buying Guide, available for purchase from  website.

As always, we welcome any suggestions or comments on how to improve our SPECsfs2008  performance reports or any of our other storage performance analyses.

 

Storage changes in vSphere 5.5 announced at VMworld 2013

Pat Gelsinger, VMworld2013 Keynote, vSphere 5.5 storage changesVMworld2013 is going on in San Francisco this week. The big news is the roll out of network virtualization in NSX and vCloud Hybrid Service (vCHS) but there were a few tidbits in the storage arena worth discussing.

  • Virtual SAN public beta – VSAN was released as a public beta and customers can now download a copy of VSAN from www.vsanbeta.com. VSAN will construct a pool of storage out of local attached disks and flash across two or more hosts. It uses the flash as a read-write cache for the local disks. With VSAN customers can elect to have multiple tiers of storage be supported within a single VSAN pool, as well as support different availability (replication) levels, and some other, select characteristics. VSAN can easily scale in performance and capacity by just adding more hosts that have local storage. Now all that stranded local storage and flash server level resources can be used as a VM storage pool. VMware stated that they see VSAN as usefull for tier 2/tier 3 application storage and/or backup-archive storage uses. However they showed one chart with a View Planner application simulation using a 3-host VSAN (presumably with lots of SSD and disk storage) compared against an all-flash array (vendor unknown). In this benchmark the VSAN exactly matched the all-flash external storage in performance (VMs supported). [late update] Lot’s of debate on what VSAN means to enterprise storage but it appears to be a limited in scope and mainly focused on SMB applications.  Chad Sakac did a (real) lengthy post on EMC’s perspective on VSAN and Software Defined Storage if you want to know more check it out.
  • Virsto – VMware announced GA of Virsto which uses any external storage and creates a new global storage pool out of them. Apparently, it maps a log structured file system across the external SAN storage. By doing this it sequentializes all the random write IO coming off of ESX hosts. It supports thin provisioning, snapshot and read-write clones. One could see this as almost a write cache for VM IO activity but read IOs are also by definition spread across (extremely wide striped) across the storage pool which should improve read performance as well. You configure external storage as normal and present those LUNs to Virsto which then converts that storage pool into “vDisks” which can then be configured as VM storage. Probably more to see here but it’s available today. Before acquisition one had to install Virsto into each physical host that was going to define VMs using Virsto vDisks. It’s unclear how much Virsto has been integrated into the hypervisor but over time one would assume like VSAN this would be buried underneath the hypervisor and be available to any vSphere host.
  • vSphere Flash Read Cache – customers with PCIe flash cards and vCenter Ops Manager, can now use them to support a read cache for data access. vSphere Flash Read Cache is apparently vmotion aware such that as you move VMs from one ESX host to another the read cache buffer will move with it. Flash Read Cache is transparent to the VMs and can be assigned on a VMDK basis.
  • vSphere 5.5 low-latency support – unclear what VMware actually did but they now claim vSphere 5.5 now supports low latency applications, like FinServ apps. They claim to have reduced the “jitter” or variability in IO latency that was present in previous versions of vSphere. Presumably they shortened the IO and networking paths through the hypervisor which should help.  I suppose if you have a VMDK which ends up on an SSD storage someplace one can have a more predictable response time. But the critical question is how much overhead does the hypervisor IO path add to the base O/S. When all-flash arrays now sporting latencies under 100 µsecs, adding another 10 or 100 µsecs can make a big difference. In VMware’s quest to virtualize any and all mission critical apps, low-latency apps are one of the last bastions of physical server apps left to conquer. Consider this a step to accommodate them.
  • vVols – VMware keeps talking about vVols as an attempt to extend their VSAN “policy driven control plane” functionality out to networked storage but there’s still no GA yet. The (VASA 2 or vVol) spec’s seem to be out for awhile now, and I have heard from at least two “major” vendors that they have support in place today but VMware still isn’t announcing formal availability yet. Unclear what the hold up is, but maybe the spec’s are more in a state of flux than what’s depicted externally.

Most of this week was spent talking about NSX, VMware’ network virtualization and vCloud Hybrid Services. When they flashed the list of NSX partners on the screen Cisco was absent. Not sure what this means but perhaps there’s some concern that NSX will take revenue away from Cisco.

As for vCHS apparently this is a VMware run public cloud with two now expanding to three data centers in US, that customers can use to support their own hybrid cloud services. VMware announced that SAVVIS is now offering vCHS services as well as VMware with data centers in NY and Chicago.  There was some talk about vCHS offering object storage services like Amazon’s S3 but there was nothing specific about when. [Late update] Pat did mention that a future offering will provide DR-as-a-Service using vCHS as a target for SRM. That seems to be matching what Microsoft seems to be planning for Azzure and Hyper-V DR.

That’s about it as far as I can tell. Didn’t hear any other news on storage changes in vSphere 5.5. But this is the year of network virtualization. Can’t wait to see what they roll out next year.

HP Tech Day – StoreServ Flash Optimizations

Attended HP Tech Field Day late last month in Disneyland. Must say the venue was the best ever for HP, and getting in on Nth Generation Conference was a plus. Sorry it has taken so long for me to get around to writing about it.

We spent a day going over HP’s new converged storage, software defined storage and other storage topics. HP has segmented the Software Defined Data Center (SDDC) storage requirements into cost optimized, Software Defined Storage and SLA optimized, Service Refined Storage. Under Software Defined storage they talked about their StoreVirtual product line which is an outgrowth of the Lefthand Networks VSA, first introduced in 2007. This June, they extended SDS to include their StoreOnce VSA product to go after SMB and ROBO backup storage requirements.

We also discussed some of HP’s OpenStack integration work to integrate current HP block storage into OpenStack Cinder. They discussed some of the integrations they plan for file and object store as well.

However what I mostly want to discuss in this post is the session discussing how HP StoreServ 3PAR had optimized their storage system for flash.

They showed an SPC-1 chart depicting various storage systems IOPs levels and response times as they ramped from 10% to 100% of their IOPS rate. StoreServ 3PAR’s latest entry showed a considerable band of IOPS (25K to over 250K) all within a sub-msec response time range. Which was pretty impressive since at the time no other storage systems seemed able to do this for their whole range of IOPS. (A more recent SPC-1 result from HDS with an all-flash VSP with Hitachi Accelerated Flash also was able to accomplish this [sub-msec response time throughout their whole benchmark], only in their case it reached over 600K IOPS – read about this in our latest performance report in our newsletter, sign up above right).

  • Adaptive Read – As I understood it, this changed the size of backend reads to match the size requested by the front end. For disk systems, one often sees that a host read of say 4KB often causes a read of 16KB from the backend, with the assumption that the host will request additional data after the block read off of disk and 90% of the time spent to do a disk read is getting the head to the correct track and once there it takes almost no effort to read more data. However with flash, there is no real effort to get to a proper location to read a block of flash data and as such, there is no advantage to reading more data than the host requests, because if they come back for more one can immediately read from the flash again.
  • Adaptive Write – Similar to adaptive read, adaptive write only writes the changed data to flash. So if a host writes a 4KB block then 4KB is written to flash. This doesn’t help much for RAID 5 because of parity updates but for RAID 1 (mirroring) this saves on flash writes which ultimately lengthens flash life.
  • Adaptive Offload (destage) – This changes the frequency of destaging or flushing cache depending on the level of write activity. Slower destaging allows written (dirty) data to accumulate in cache if there’s not much write activity going on, which means in RAID 5 parity may not need to be updated as one could potentially accumulate a whole stripe’s worth of data in cache. In low-activity situations such destaging could occur every 200 msecs. whereas with high write activity destaging could occur as fast as every 3 msecs.
  • Multi-tennant IO processing – For disk drives, with sequential reads, one wants the largest stripes possible (due to head positioning penalty) but for SSDs one wants the smallest stripe sizes possible. The other problem with large stripe sizes is that devices are busy during the longer sized IO while performing the stripe writes (and reads). StoreServ modified the stripe size for SSDs to be 32KB so that other IO activity need not have to wait as long to get their turn in the (IO device) queue. The other advantage is when one is doing SSD rebuilds, with a 32KB stripe size one can intersperse more IO activity for the devices involved in the rebuild without impacting rebuild performance.

Of course the other major advantage of HP StoreServ’s 3PAR architecture provides for Flash is its intrinsic wide striping that’s done across a storage pool. This way all the SSDs can be used optimally and equally to service customer IOs.

I am certain there were other optimizations HP made to support SSDs in StoreServ storage, but these are the ones they were willing to talk publicly about.

No mention of when Memristor SSDs were going to be available but stay tuned, HP let slip that sooner or later Memristor Flash storage will be in HP storage & servers.

Comments?

Photo Credits: (c) 2013 Silverton Consulting, Inc

Has latency become the key metric? SPC-1 LRT results – chart of the month

I was at EMCworld a couple of months back and they were showing off a preview of the next version VNX storage, which was trying to achieve a million IOPS with under a millisecond latency.  Then I attended NetApp’s analyst summit and the discussion at their Flash seminar was how latency was changing the landscape of data storage and how flash latencies were going to enable totally new applications.

One executive at NetApp mentioned that IOPS was never the real problem. As an example, he mentioned one large oil & gas firm that had a peak IOPS of 35K.

Also, there was some discussion at NetApp of trying to come up with a way of segmenting customer applications by latency requirements.  Aside from high frequency trading applications, online payment processing and a few other high-performance database activities, there wasn’t a lot that could easily be identified/quantified today.

IO latencies have been coming down for years now. Sophisticated disk only storage systems have been lowering latencies for over a decade or more.   But since the introduction of SSDs it’s been a whole new ballgame.  For proof all one has to do is examine the top 10 SPC-1 LRT (least response time, measured with workloads@10% of peak activity) results.

Top 10 SPC-1 LRT results, SSD system response times

 

In looking over the top 10 SPC-1 LRT benchmarks (see Figure above) one can see a general pattern.  These systems mostly use SSD or flash storage except for TMS-400, TMS 320 (IBM FlashSystems) and Kaminario’s K2-D which primarily use DRAM storage and backup storage.

Hybrid disk-flash systems seem to start with an LRT of around 0.9 msec (not on the chart above).  These can be found with DotHill, NetApp, and IBM.

Similarly, you almost have to get to as “slow” as 0.93 msec. before you can find any disk only storage systems. But most disk only storage comes with a latency at 1msec or more. Between 1 and 2msec. LRT we see storage from EMC, HDS, HP, Fujitsu, IBM NetApp and others.

There was a time when the storage world was convinced that to get really good response times you had to have a purpose built storage system like TMS or Kaminario or stripped down functionality like IBM’s Power 595.  But it seems that the general purpose HDS HUS, IBM Storwize, and even Huawei OceanStore are all capable of providing excellent latencies with all SSD storage behind them. And all seem to do at least in the same ballpark as the purpose built, TMS RAMSAN-620 SSD storage system.  These general purpose storage systems have just about every advanced feature imaginable with the exception of mainframe attach.

It seems nowadays that there is a trifurcation of latency results going on, based on underlying storage:

  • DRAM only systems at 0.4 msec to ~0.1 msec.
  • SSD/flash only storage at 0.7 down to 0.2msec
  • Disk only storage at 0.93msec and above.

The hybrid storage systems are attempting to mix the economics of disk with the speed of flash storage and seem to be contending with all these single technology, storage solutions. 

It’s a new IO latency world today.  SSD only storage systems are now available from every major storage vendor and many of them are showing pretty impressive latencies.  Now with fully functional storage latency below 0.5msec., what’s the next hurdle for IT.

Comments?

Image: EAB 2006 by TMWolf

 

Enhanced by Zemanta