Bringing compute to storage

Researchers at MIT (see Storage system for ‘big data’ dramatically speeds access to information) have come up with a novel storage cluster using FPGAs and flash chips to create a new form of database machine.

In their system they have an FPGA that supports limited computational offload/acceleration along with flash controller functionality for a set of flash chips. They call their system the BlueDBM or Blue Database Machine.

Their storage device is used as PCIe flash card on a host PC. But in their implementation each of the PCIe flash cards are interconnected via an FPGA serial link. This approach creates a distributed controller across all the PCIe flash cards in the host servers and allows any host PC to access any of the flash card data at high speed.

They claim that node to node access latencies are on the order of 60-80 microseconds and their distributed controller can sustain 70% of theoretical system bandwidth.  In their prototype 4-node system their performance testing shows that it’s an order of magnitude faster than Microsoft Research’s CORFU (Cluster of Raw Flash Units).

Why FPGAs?

There are two novel aspects of their system: 1 ) Is the computational offload capabilities provided by the FPGA in front of the flash and 2) Is their implementation of a  distributed controller across the storage nodes using the FPGA serial network.

Both of these characteristics are dependent on the FPGA. Also by using FPGAs system cost would be less and the FPGAs had a readily available, internally supported serial link that could be used.

But by using an FPGA, the computational capabilities are more limited and re-configurating (re-programming) the storage cluster’s compute capabilities will take more time. If they used a more general purpose CPU in front of the flash chips they could support a much richer computational offload next to the storage chips.  For example, in their prototype the FPGAs supported ‘word-counting’ offload functionality.

Nonetheless, as most flash storage these days already have a fairly sophisticated controller, it’s not much of a stretch to bump this compute power up to something a bit more programmable and make its functionality more available via APIs.  I suppose to gain equivalent performance this would need to use PCIe flash cards.

Where they would get the internal card to card serial link with general purpose CPUs may be a concern, which brings up another question.

The distributed controller gives them what exactly?

I believe that with a serial link based distributed controller they don’t need a full networking stack to access the PCIe flash storage on other nodes. This should save both access time and compute power.

In follow on work, the MIT researchers plan to implement a Linux based, distributed file system across the BlueDBM. This should give them a more normal storage stack for their system. How this may interact with the computational offload capabilities is another question.

I would have to say the reduction in access latency is what they were after with the distributed controller and they seem to have achieved it, as noted above. I suppose something similar could be done with multiple PCIe cards in the same host but with the potential to grow from 4 to 20 nodes, the BlueDBM starts to look more interesting.

What sort of application could use such a device?

They talked about performing near real-time analysis of scientific data or modeling all the particles in a simulation of the universe.  But just about any application that required extremely low access time with limited data services could potentially take advantage of their storage system. High Frequency Trading comes to mind.

As for big data applications, I haven’t heard of any big data deployments that use SSDs for basic storage let alone PCIe flash cards. I don’t believe there’s going to be a lot of big data analytics that has need for this fast a storage system.

~~~~

Utilizing excess compute power in a storage controller has been an ongoing dream for a long time. Aside from running VMs and a couple of other specialized services such as A-V scanning within a storage controller there hasn’t been a lot of this type of functionality  ever released for use inside a storage controller. With software defined storage coming online, it may not even make that much sense anymore.

MIT research’s BlueDBM solution is somewhat novel but unless they can more easily generalize the computational offload it doesn’t seem as if it will become a very popular way to go for analytics applications.

As for their reduction in access latencies, that might have some legs if they can put more storage capacity behind it and continue to support similar access latencies. But they will need to provide a more normal access method to it. The distributed Linux file system might be just the ticket to get this off into the market.

Comments?

Photo Credits: Lightening by Jolene

Super Talent releases a 4-SSD, RAIDDrive PCIe card

RAIDDrive UpStream (c) 2012 Super Talent (from their website)
RAIDDrive UpStream (c) 2012 Super Talent (from their website)

Not exactly sure what is happening, but PCIe cards are coming out containing multiple SSD drives.

For example, the recently announced Super Talent RAIDDrive UpStream card contains 4 SAS embedded SSDs that can push storage capacity up to almost a TB of MLC NAND.   They have an optional SLC version but there were no specs provided on this.

It looks like the card uses an LSI RAID controller and SANDforce NAND controller.  Unlike the other RAIDDrive cards that support RAID5, the UpStream can be configured with RAID 0, 1 or 1E (sort of RAID 1 only striped across even or odd drive counts) and currently supports capacities of 220GB, 460GB or 960GB total.

Just like the rest of the RAIDDrive product line, the UpStream card is PCIe x8 connected and requires host software (drivers) for Windows, NetWare, Solaris and other OSs but not for “most Linux distributions”.  Once the software is up, the RAIDDrive can be configured and then accessed just like any other “super fast” DAS device.

Super Talent’s data sheet states UpStream performance at are 1GB/sec Read and 900MB/Sec writes. However, I didn’t see any SNIA SSD performance test results so it’s unclear how well performance holds up over time and whether these performance levels can be independently verified.

It seems just year ago that I was reviewing Virident’s PCIe SSD along with a few others at Spring SNW.   At the time, I thought there were a lot of PCIe NAND cards being shown at the show.  Given Super Talent’s and the many other vendors sporting PCIe SSDs today, there’s probably going to be a lot more this time.

No pricing information was available.

~~~~

Comments?

Why EMC is doing Project Lightening and Thunder

Picture of atmospheric lightening striking ground near a building at night
rayo 3 by El Garza (cc) (from Flickr)

Although technically Project Lightening and Thunder represent some interesting offshoots of EMC software, hardware and system prowess,  I wonder why they would decide to go after this particular market space.

There are plenty of alternative offerings in the PCIe NAND memory card space.  Moreover, the PCIe card caching functionality, while interesting is not that hard to replicate and such software capability is not a serious barrier of entry for HP, IBM, NetApp and many, many others.  And the margins cannot be that great.

So why get into this low margin business?

I can see a couple of reasons why EMC might want to do this.

  • Believing in the commoditization of storage performance.  I have had this debate with a number of analysts over the years but there remain many out there that firmly believe that storage performance will become a commodity sooner, rather than later.  By entering the PCIe NAND card IO buffer space, EMC can create a beachhead in this movement that helps them build market awareness, higher manufacturing volumes, and support expertise.  As such, when the inevitable happens and high margins for enterprise storage start to deteriorate, EMC will be able to capitalize on this hard won, operational effectiveness.
  • Moving up the IO stack.  From an applications IO request to the disk device that actually services it is a long journey with multiple places to make money.  Currently, EMC has a significant share of everything that happens after the fabric switch whether it is FC,  iSCSI, NFS or CIFS.  What they don’t have is a significant share in the switch infrastructure or anywhere on the other (host side) of that interface stack.  Yes they have Avamar, Networker, Documentum, and other software that help manage, secure and protect IO activity together with other significant investments in RSA and VMware.   But these represent adjacent market spaces rather than primary IO stack endeavors.  Lightening represents a hybrid software/hardware solution that moves EMC up the IO stack to inside the server.  As such, it represents yet another opportunity to profit from all the IO going on in the data center.
  • Making big data more effective.  The fact that Hadoop doesn’t really need or use high end storage has not been lost to most storage vendors.  With Lightening, EMC has a storage enhancement offering that can readily improve  Hadoop cluster processing.  Something like Lightening’s caching software could easily be tailored to enhance HDFS file access mode and thus, speed up cluster processing.  If Hadoop and big data are to be the next big consumer of storage, then speeding cluster processing will certainly help and profiting by doing this only makes sense.
  • Believing that SSDs will transform storage. To many of us the age of disks is waning.  SSDs, in some form or another, will be the underlying technology for the next age of storage.  The densities, performance and energy efficiency of current NAND based SSD technology are commendable but they will only get better over time.  The capabilities brought about by such technology will certainly transform the storage industry as we know it, if they haven’t already.  But where SSD technology actually emerges is still being played out in the market place.  Many believe that when industry transitions like this happen it’s best to be engaged everywhere change is likely to happen, hoping that at least some of them will succeed. Perhaps PCIe SSD cards may not take over all server IO activity but if it does, not being there or being late will certainly hurt a company’s chances to profit from it.

There may be more reasons I missed here but these seem to be the main ones.  Of the above, I think the last one, SSD rules the next transition is most important to EMC.

They have been successful in the past during other industry transitions.  If anything they have shown similar indications with their acquisitions by buying into transitions if they don’t own them, witness Data Domain, RSA, and VMware.  So I suspect the view in EMC is that doubling down on SSDs will enable them to ride out the next storm and be in a profitable place for the next change, whatever that might be.

And following lightening, Project Thunder

Similarly, Project Thunder seems to represent EMC doubling their bet yet again on the SSDs.  Just about every month I talk to another storage startup coming out in the market providing another new take on storage using every form of SSD imaginable.

However, Project Thunder as envisioned today is not storage, but rather some form of external shared memory.  I have heard this before, in the IBM mainframe space about 15-20 years ago.  At that time shared external memory was going to handle all mainframe IO processing and the only storage left was going to be bulk archive or migration storage – a big threat to the non-IBM mainframe storage vendors at the time.

One problem then was that the shared DRAM memory of the time was way more expensive than sophisticated disk storage and the price wasn’t coming down fast enough to counteract increased demand.  The other problem was making shared memory work with all the existing mainframe applications was not easy.  IBM at least had control over the OS, HW and most of the larger applications at the time.  Yet they still struggled to make it usable and effective, probably some lesson here for EMC.

Fast forward 20 years and NAND based SSDs are the right hardware technology to make  inexpensive shared memory happen.  In addition, the road map for NAND and other SSD technologies looks poised to continue the capacity increase and price reductions necessary to compete effectively with disk in the long run.

However, the challenges then and now seem as much to do with software that makes shared external memory universally effective as with the hardware technology to implement it.  Providing a new storage tier in Linux, Windows and/or VMware is easier said than done. Most recent successes have usually been offshoots of SCSI (iSCSI, FCoE, etc).  Nevertheless, if it was good for mainframes then, it certainly good for Linux, Windows and VMware today.

And that seems to be where Thunder is heading, I think.

Comments?

 

Comments?

SATA Express combines PCIe and SATA

SATA Express plug configuration (c) SATA-IO (from SATA-IO.org website)SATA-IO recently announced a new specification for an PCIe and SATA-IO specification (better described in the presentation) that will provide a SATA device interface directly connected to a server’s PCIe bus.

The new working specification offers either 8Gbps or 16Gbps depending on the number of PCIe lanes being used and provides a new PCIe/SATA-IO plug configuration.

While this may be a boon to normal SATA-IO disk drives I see the real advantage lies with an easier interface for PCIe based NAND storage cards or Hybrid disk drives.

New generation of PCIe SSDs based on SATA Express

For example, previously if you wanted to produce a PCIe NAND storage card, you either had to surround this with IO drivers to provide storage/cache interfaces (such as FusionIO) or provide enough smarts on the card to emulate an IO controller along with the backend storage device (see my post on OCZ’s new bootable PCIe Z-drive).  With the new SATA Express interface, one no longer needs to provide any additional smarts with the PCIe card as long as you can support SATA Express.

It would seem that SATA Express would be the best of all worlds.

  • If you wanted a directly accessed SATA SSD you could plug it in to your SATA-IO controller
  • If you wanted networked SATA SSDs you could plug it into your storage array.
  • If you wanted even better performance than either of those two alternatives you could plug the SATA SSD directly into the PCIe bus with the PCIe/SATA-IO interface.

Of course supporting SATA Express will take additional smarts on the part of any SATA-IO device but with all new SATA devices supporting the new interface, additional cost differentials should shrink substantially.

SATA-IO 3.2

The PCIe/SATA-IO plug design is just a concept now but SATA expects to have the specification finalized by year end with product availability near the end of 2012.  The SATA-IO organization have designated the SATA Express standard to be part of SATA 3.2.

One other new capability is being introduced with SATA 3.2, specifically a µSATA specification designed to provide storage for embedded system applications.

The prior generation SATA 3.1, coming out in products soon, includes the mSATA interface specification for mobile device storage and the USM SATA interface specification for consumer electronics storage.   And as most should recall, SATA 3.0 provided 6Gbps data transfer rates for SATA storage devices.

—-

Can “SAS Express” be far behind?

Comments?

OCZ’s latest Z-Drive R4 series PCIe SSD

OCZ_Z-Drive_RSeries (from http://www.ocztechnology.com/ocz-z-drive-r4-r-series-pci-express-ssd.html)
OCZ_Z-Drive_RSeries (from http://www.ocztechnology.com/ocz-z-drive-r4-r-series-pci-express-ssd.html)

OCZ just released a new version of their enterprise class Z-drive SSD storage with pretty impressive performance numbers (up to 500K IOPS [probably read] with 2.8GB/sec read data transfer).

Bootability

These new drives are bootable SCSI devices and connect directly to a server’s PCIe bus. They come in half height and full height card form factors and support 800GB to 3.2TB (full height) or 300GB to 1.2TB (half height) raw storage capacities.

OCZ also offers their Velo PCIe SSD series which are not bootable and as such, require an IO driver for each operating system. However, the Z-drive has more intelligence which provides a SCSI device and as such, can be used anywhere.

Naturally this comes at the price of additional hardware and overhead.   All of which could impact performance but given their specified IO rates, it doesn’t seem to be a problem.

Unclear how many other PCIe SSDs exist today that offer bootability but it certainly puts these drives in a different class than previous generation PCIe SSD such as available from FusionIO and other vendors that require IO drivers.

MLC NAND

One concern with new Z-drives might be their use of MLC NAND technology.  Although OCZ’s press release said the new drives would be available in either SLC or MLC configurations, current Z-drive spec sheets only indicate MLC availability.

As  discussed previously (see eMLC & eSLC and STEC’s MLC posts), MLC supports less write endurance (program-erase and write cycles) than SLC NAND cells.  Normally the difference is on the order of 10X less before NAND cell erase/write failure.

I also noticed there was no write endurance specification on their spec sheet for the new Z-drives.  Possibly,  at these capacities it may not matter but, in our view, a write endurance specification should be supplied for any SSD drive, and especially for enterprise class ones.

Z-drive series

OCZ offers two versions of their Z-drive the R and C series, both of which offer the same capacities and high performance but as far as I could tell the R series appears to be have more enterprise class availability and functionality. Specifically, this drive has power fail protection for the writes (capacitance power backup) as well as better SMART support (with “enterprise attributes”). These both seem to be missing from their C Series drives.

We hope the enterprise attribute SMART provides write endurance monitoring and reporting.  But there is no apparent definition of these attributes that were easily findable.

Also the R series power backup, called DataWrite Assurance Technology would be a necessary component for any enterprise disk device.  This essentially saves data written to the device but not to the NAND just yet from disappearing during a power outage/failure.

Given the above, we would certainly opt for the R series drive in any enterprise configuration.

Storage system using Z-drives

Just consider what one can do with a gaggle of Z-drives in a standard storage system.  For example, with 5 Z-drives in a server, it could potentially support 2.5M IOPs/sec and 14GB/sec of data transfer with some resulting loss of performance due to front-end emulation.  Moreover, at 3.2TB per drive, even in a RAID5 4+1 configuration the storage system would support 12.8TB of user capacity. One could conceivably do away with any DRAM cache in such a system and still provide excellent performance.

What the cost for such a system would be is another question. But with MLC NAND it shouldn’t be too obscene.

On the other hand serviceability might be a concern as it would be difficult to swap out a failed drive (bad SSD/PCIe card) while continuing IO operations. This could be done with some special hardware but it’s typically not present in standard, off the shelf servers.

—-

All in all a very interesting announcement from OCZ.  The likelihood that a single server will need this sort of IO performance from a lone drive is not that high (except maybe for massive website front ends) but putting a bunch of these in a storage box is another matter.  Such a configuration would make one screaming storage system with minimal hardware changes and only a modest amount of software development.

Comments?

EMCWorld news Day1 1st half

EMC World keynote stage, storage, vblocks, and cloud...
EMC World keynote stage, storage, vblocks, and cloud...

EMC announced today a couple of new twists on the flash/SSD storage end of the product spectrum.  Specifically,

  • They now support all flash/no-disk storage systems. Apparently they have been getting requests to eliminate disk storage altogether. Probably government IT but maybe some high-end enterprise customers with low-power, high performance requirements.
  • They are going to roll out enterprise MLC flash.  It’s unclear when it will  be released but it’s coming soon, different price curve, different longevity (maybe), but brings down the cost of flash by ~2X.
  • EMC is going to start selling server side Flash.  Using storage FAST like caching algorithms to knit the storage to the server side Flash.  Unclear what server Flash they will be using but it sounds a lot like a Fusion-IO type of product.  How well the server cache and the storage cache talks is another matter.  Chuck Hollis said EMC decided to redraw the boundary between storage and server and now there is a dotted line that spans the SAN/NAS boundary and carves out a piece of the server which is sort of on server caching.

Interesting to say the least.  How well it’s tied to the rest of the FAST suite is critical. What happens when one or the other loses power, as Flash is non-volatile no data would be lost but the currency of the data for shared storage may be another question.  Also having multiple servers in the environment may require cache coherence across the servers and storage participating in this data network!?

Some teaser announcements from Joe’s keynote:

  • VPLEX asynchronous, active active supporting two datacenter access to the same data over 1700Km away Pittsburgh to Dallas.
  • New Isilon record scalability and capacity the NL appliance. Can now support a 15PB file system, with trillions of files in it.  One gene sequencer says a typical assay generates 500M objects/files…
  • Embracing Hadoop open source products so that EMC will support Hadoop distro in an appliance or software only solution

Pat G also showed EMC Greenplum appliance searching a 8B row database to find out how many products have been shipped to a specific zip code…