SCI SPECsfs2008 NFS throughput per node – Chart of the month

SCISFS150928-001
As SPECsfs2014 still only has (SPECsfs sourced) reference benchmarks, we have been showing some of our seldom seen SPECsfs2008 charts, in our quarterly SPECsfs performance reviews. The above chart was sent out in last months Storage Intelligence Newsletter and shows the NFS transfer operations per second per node.

In the chart, we only include NFS SPECsfs2008 benchmark results with configurations that have more than 2 nodes and have divided the maximum NFS throughput operations per second achieved by the node counts to compute NFS ops/sec/node.

HDS VSP G1000 with an 8 4100 file modules (nodes) and HDS HUS (VM) with 4 4100 file modules (nodes) came in at #1 and #2 respectively, for ops/sec/node, each attaining ~152K NFS throughput operations/sec. per node. The #3 competitor was Huawei OceanStor N8500 Cluster NAS with 24 nodes, which achieved ~128K NFS throughput operations/sec./node. At 4th and 5th place were EMC  VNX VG8/VNX5700 with 5 X-blades and Dell Compellent FS8600 with 4 appliances, each of which reached ~124K NFS throughput operations/sec. per node. It falls off significantly from there, with two groups at ~83K and ~65K NFS ops/sec./node.

Although not shown above, it’s interesting that there are many well known scale-out NAS solutions in SPECsfs2008 results with over 50 nodes that do much worse than the top 10 above, at <10K NFS throughput ops/sec/node. Fortunately, most scale-out NAS nodes cost quite a bit less than the above.

But for my money, one can be well served with a more sophisticated, enterprise class NAS system which can do >10X the NFS throughput operations per second per node than a scale-out systm. That is, if you don’t have to deploy 10PB or more of NAS storage.

More information on SPECsfs2008/SPECsfs2014 performance results as well as our NFS and CIFS/SMB ChampionsCharts™ for file storage systems can be found in our just updated NAS Buying Guide available for purchase on our web site.

Comments?

~~~~

The complete SPECsfs2008 performance report went out in SCI’s September newsletter.  A copy of the report will be posted on our dispatches page sometime this quarter (if all goes well).  However, you can get the latest storage performance analysis now and subscribe to future free monthly newsletters by just using the signup form above right.

As always, we welcome any suggestions or comments on how to improve our SPECsfs  performance reports or any of our other storage performance analyses.

 

Latest SPECsfs2008 results, over 1 million NFS ops/sec – chart-of-the-month

Column chart showing the top 10 NFS througput operations per second for SPECsfs2008
(SCISFS111221-001) (c) 2011 Silverton Consulting, All Rights Reserved

[We are still catching up on our charts for the past quarter but this one brings us up to date through last month]

There’s just something about a million SPECsfs2008(r) NFS throughput operations per second that kind of excites me (weird, I know).  Yes it takes over 44-nodes of Avere FXT 3500 with over 6TB of DRAM cache, 140-nodes of EMC Isilon S200 with almost 7TB of DRAM cache and 25TB of SSDs or at least 16-nodes of NetApp FAS6240 in Data ONTAP 8.1 cluster mode with 8TB of FlashCache to get to that level.

Nevertheless, a million NFS throughput operations is something worth celebrating.  It’s not often one achieves a 2X improvement in performance over a previous record.  Something significant has changed here.

The age of scale-out

We have reached a point where scaling systems out can provide linear performance improvements, at least up to a point.  For example, the EMC Isilon and NetApp FAS6240 had a close to linear speed up in performance as they added nodes indicating (to me at least) there may be more there if they just throw more storage nodes at the problem.  Although maybe they saw some drop off and didn’t wish to show the world or potentially the costs became prohibitive and they had to stop someplace.   On the other hand, Avere only benchmarked their 44-node system with their current hardware (FXT 3500), they must have figured winning the crown was enough.

However, I would like to point out that throwing just any hardware at these systems doesn’t necessary increase performance.  Previously (see my CIFS vs NFS corrected post), we had shown the linear regression for NFS throughput against spindle count and although the regression coefficient was good (~R**2 of 0.82), it wasn’t perfect. And of course we eliminated any SSDs from that prior analysis. (Probably should consider eliminating any system with more than a TB of DRAM as well – but this was before the 44-node Avere result was out).

Speaking of disk drives, the FAS6240 system nodes had 72-450GB 15Krpm disks, the Isilon nodes had 24-300GB 10Krpm disks and each Avere node had 15-600GB 7.2Krpm SAS disks.  However the Avere system also had a 4-Solaris ZFS file storage systems behind it each of which had another 22-3TB (7.2Krpm, I think) disks.  Given all that, the 16-node NetApp system, 140-node Isilon and the 44-node Avere systems had a total of 1152, 3360 and 748 disk drives respectively.   Of course, this doesn’t count the system disks for the Isilon and Avere systems nor any of the SSDs or FlashCache in the various configurations.

I would say with this round of SPECsfs2008 benchmarks scale-out NAS systems have come out.  It’s too bad that both NetApp and Avere didn’t release comparable CIFS benchmark results which would have helped in my perennial discussion on CIFS vs. NFS.

But there’s always next time.

~~~~

The full SPECsfs2008 performance report went out to our newsletter subscriber’s last December.  A copy of the full report will be up on the dispatches page of our site sometime later this month (if all goes well). However, you can see our full SPECsfs2008 performance analysis now and subscribe to our free monthly newsletter to receive future reports directly by just sending us an email or using the signup form above right.

For a more extensive discussion of file and NAS storage performance covering top 30 SPECsfs2008 results and NAS storage system features and functionality, please consider purchasing our NAS Buying Guide available from SCI’s website.

As always, we welcome any suggestions on how to improve our analysis of SPECsfs2008 results or any of our other storage system performance discussions.

Comments?

IBM’s 120PB storage system

Susitna Glacier, Alaska by NASA Goddard Photo and Video (cc) (from Flickr)
Susitna Glacier, Alaska by NASA Goddard Photo and Video (cc) (from Flickr)

Talk about big data, Technology Review reported this week that IBM is building a 120PB storage system for some unnamed customer.  Details are sketchy and I cannot seem to find any announcement of this on IBM.com.

Hardware

It appears that the system uses 200K disk drives to support the 120PB of storage.  The disk drives are packed in a new wider rack and are water cooled.  According to the news report the new wider drive trays hold more drives than current drive trays available on the market.

For instance, HP has a hot pluggable, 100 SFF (small form factor 2.5″) disk enclosure that sits in 3U of standard rack space.  200K SFF disks would take up about 154 full racks, not counting the interconnect switching that would be required.  Unclear whether water cooling would increase the density much but I suppose a wider tray with special cooling might get you more drives per floor tile.

There was no mention of interconnect, but today’s drives use either SAS or SATA.  SAS interconnects for 200K drives would require many separate SAS busses. With an SAS expander addressing 255 drives or other expanders, one would need at least 4 SAS busses but this would have ~64K drives per bus and would not perform well.  Something more like 64-128 drives per bus would have much better performer and each drive would need dual pathing, and if we use 100 drives per SAS string, that’s 2000 SAS drive strings or at least 4000 SAS busses (dual port access to the drives).

The report mentioned GPFS as the underlying software which supports three cluster types today:

  • Shared storage cluster – where GPFS front end nodes access shared storage across the backend. This is generally SAN storage system(s).  But the requirements for high density, it doesn’t seem likely that the 120PB storage system uses SAN storage in the backend.
  • Networked based cluster – here the GPFS front end nodes talk over a LAN to a cluster of NSD (network storage director?) servers which can have access to all or some of the storage. My guess is this is what will be used in the 120PB storage system
  • Shared Network based clusters – this looks just like a bunch of NSD servers but provides access across multiple NSD clusters.

Given the above, with ~100 drives per NSD server means another 1U extra per 100 drives or (given HP drive density) 4U per 100 drives for 1000 drives and 10 IO servers per 40U rack, (not counting switching).  At this density it takes ~200 racks for 120PB of raw storage and NSD nodes or 2000 NSD nodes.

Unclear how many GPFS front end nodes would be needed on top of this but even if it were 1 GPFS frontend node for every 5 NSD nodes, we are talking another 400 GPFS frontend nodes and at 1U per server, another 10 racks or so (not counting switching).

If my calculations are correct we are talking over 210 racks with switching thrown in to support the storage.  According to IBM’s discussion on the Storage challenges for petascale systems, it probably provides ~6TB/sec of data transfer which should be easy with 200K disks but may require even more SAS busses (maybe ~10K vs. the 2K discussed above).

Software

IBM GPFS is used behind the scenes in IBM’s commercial SONAS storage system but has been around as a cluster file system designed for HPC environments for over 15 years or more now.

Given this many disk drives something needs to be done about protecting against drive failure.  IBM has been talking about declustered RAID algorithms for their next generation HPC storage system which spreads the parity across more disks and as such, speeds up rebuild time at the cost of reducing effective capacity. There was no mention of effective capacity in the report but this would be a reasonable tradeoff.  A 200K drive storage system should have a drive failure every 10 hours, on average (assuming a 2 million hour MTBF).  Let’s hope they get drive rebuild time down much below that.

The system is expected to hold around a trillion files.  Not sure but even at 1024 bytes of metadata per file, this number of files would chew up ~1PB of metadata storage space.

GPFS provides ILM (information life cycle management, or data placement based on information attributes) using automated policies and supports external storage pools outside the GPFS cluster storage.  ILM within the GPFS cluster supports file placement across different tiers of storage.

All the discussion up to now revolved around homogeneous backend storage but it’s quite possible that multiple storage tiers could also be used.  For example, a high density but slower storage tier could be combined with a low density but faster storage tier to provide a more cost effective storage system.  Although, it’s unclear whether the application (real world modeling) could readily utilize this sort of storage architecture nor whether they would care about system cost.

Nonetheless, presumably an external storage pool would be a useful adjunct to any 120PB storage system for HPC applications.

Can it be done?

Let’s see, 400 GPFS nodes, 2000 NSD nodes, and 200K drives. Seems like the hardware would be readily doable (not sure why they needed watercooling but hopefully they obtained better drive density that way).

Luckily GPFS supports Infiniband which can support 10,000 nodes within a single subnet.  Thus an Infiniband interconnect between the GPFS and NSD nodes could easily support a 2400 node cluster.

The only real question is can a GPFS software system handle 2000 NSD nodes and 400 GPFS nodes with trillions of files over 120PB of raw storage.

As a comparison here are some recent examples of scale out NAS systems:

It would seem that a 20X multiplier times a current Isilon cluster or even a 10X multiple of a currently supported SONAS system would take some software effort to work together, but seems entirely within reason.

On the other hand, Yahoo supports a 4000-node Hadoop cluster and seems to work just fine.  So from a feasability perspective, a 2500 node GPFS-NSD node system seems just a walk in the park for Hadoop.

Of course, IBM Almaden is working on project to support Hadoop over GPFS which might not be optimum for real world modeling but would nonetheless support the node count being talked about here.

——

I wish there was some real technical information on the project out on the web but I could not find any. Much of this is informed conjecture based on current GPFS system and storage hardware capabilities. But hopefully, I haven’t traveled to far astray.

Comments?

 

Shared DAS

Code Name "Thumper" by richardmasoner (cc) (from Flickr)
Code Name "Thumper" by richardmasoner (cc) (from Flickr)

An announcement this week by VMware on their vSphere  5 Virtual Storage Appliance has brought back the concept of shared DAS (see vSphere 5 storage announcements).

Over the years, there have been a few products, such as Seanodes and Condor Storage (may not exist now) that have tried to make a market out of sharing DAS across a cluster of servers.

Arguably, Hadoop HDFS (see Hadoop – part 1), Amazon S3/cloud storage services and most scale out NAS systems all support similar capabilities. Such systems consist of a number of servers with direct attached storage, accessible by other servers or the Internet as one large, contiguous storage/file system address space.

Why share DAS? The simple fact is that DAS is cheap, its capacity is increasing, and it’s ubiquitous.

Shared DAS system capabilities

VMware has limited their DAS virtual storage appliance to a 3 ESX node environment, possibly lot’s of reasons for this.  But there is no such restriction for Seanode Exanode clusters.

On the other hand, VMware has specifically targeted SMB data centers for this facility.  In contrast, Seanodes has focused on both HPC and SMB markets for their shared internal storage which provides support for a virtual SAN on Linux, VMware ESX, and Windows Server operating systems.

Although VMware Virtual Storage Appliance and Seanodes do provide rudimentary SAN storage services, they do not supply advanced capabilities of enterprise storage such as point-in-time copies, replication, data reduction, etc.

But, some of these facilities are available outside their systems. For example, VMware with vSphere 5 will supports a host based replication service and has had for some time now software based snapshots. Also, similar services exist or can be purchased for Windows and presumably Linux.  Also, cloud storage providers have provided a smattering of these capabilities from the start in their offerings.

Performance?

Although distributed DAS storage has the potential for high performance, it seems to me that these systems should perform poorer than an equivalent amount of processing power and storage in a dedicated storage array.  But my biases might be showing.

On the other hand, Hadoop and scale out NAS systems are capable of screaming performance when put together properly.  Recent SPECsfs2008 results for EMC Isilon scale out NAS system have demonstrated very high performance and Hadoops claim to fame is high performance analytics. But you have to throw a lot of nodes at the problem.

—–

In the end, all it takes is software. Virtualizing servers, sharing DAS, and implementing advanced storage features, any of these can be done within software alone.

However, service levels, high availability and fault tolerance requirements have historically necessitated a physical separation between storage and compute services. Nonetheless, if you really need screaming application performance and software based fault tolerance/high availability will suffice, then distributed DAS systems with co-located applications like Hadoop or some scale out NAS systems are the only game in town.

Comments?

EMC to buy Isilon Systems

Isilon X series nodes (c) 2010 Isilon from Isilon's website
Isilon X series nodes (c) 2010 Isilon from Isilon's website

I understand the rationale behind EMC’s purchase of Isilon scale out NAS technology for big data applications.  More and more data is being created every day and most of that unstructured.  How can one begin to support multiple PBs of file data that’s coming online in the next couple of years without scale out NAS.  Scale out NAS has the advantage that within the same architecture one can scale from TBs to PBs of file storage by just adding storage and/or accessor nodes.  Sounds great.

Isilon for backup storage?

But what’s surprising to me is the use of Isilon NL-Series storage in more mundane applications like Database backup.  A couple of weeks ago I wrote a post on how Oracle RMAN compressed backups don’t dedupe very well.  The impetus for that post was that a very large enterprise customer I was talking with had just started deploying Isilon NAS systems in their backup environment to handle non-dedupable data.  The customer was backing up PB of storage, a good portion of which was non-dedupable, and as such, they planed to use Isilon Systems to store this data.

I had never seen scale out NAS systems used for backup storage so I was intrigued to find out why.  Essentially, this customer was in the throws of replacing tape and between deduplication appliances and Isilon storage they believed they had the solutions to eliminate tape forever from their backup systems.

All this begs the question where does EMC put Isilon –  with Celerra and other storage platforms, with Atmos and other cloud services, or with Data Domain and other backup systems?  It seems one could almost break out the three Isilon storage systems and split them into these three business groups but given Isilon’s flexibility it probably belongs in storage platforms.

However, I would think that BRS would have an immediate market requirement for Isilon’s NL-Series storage to complement it’s other backup systems.  I guess we will know shortly where EMC puts it – until then it’s anyone’s guess.

IBM Scale out NAS (SONAS) v1.1.1

IBM SONAS from IBM's Flickr stream (c) IBM
IBM SONAS from IBM's Flickr stream (c) IBM

We have discussed other scale out NAS products on the market such as Symantec’s FileStoreIBRIX reborn as HP networked storage, and why SO/CFS, why now (scale out/cluster file systems) in previous posts but haven’t talked about IBM’s highend scale out NAS (SONAS) product before. There was an announcement yesterday of a new SONAS version so thought it an appropriate time to cover it.

As you may know SONAS packages up IBM’s well known GPFS system services and surrounds it with pre-packaged hardware and clustering software that supports a high availability cluster of nodes serving native CIFS and NFS clients.

One can see SONAS is not much to look at from the outside but internally it comes with three different server components:

  • Interface nodes – which provide native CIFS, NFS and now with v1.1.1 HTTP interface protocols to the file store.
  • Storage nodes – which supply backend storage device services.
  • Management nodes – which provide for administration of the SONAS storage system.

The standard SONAS system starts with a fully integrated hardware package within one rack with 2-management nodes, 2- to 6-interface nodes, 2-storage pods (one storage pod consists of of 2-storage nodes and 60 to 240 attached disk drives).  The starter system can then be expanded with either a single interface rack with up to 30 interface nodes and/or multiple storage racks with 2 storage pods in each rack.

With v1.1.1, a new hardware option has been provided, specifically the new IBM SONAS gateway for IBM’s XIV storage.  With this new capability SONAS storage nodes can now be connected to an IBM XIV storage subsystem using 8GFC interfaces through a SAN switch.

Some other new functionality released in SONAS V1.1.1 include:

  • New policy engine – used for internal storage tiering and for external/hierarchical storage through IBM’s Tivoli Storage Managere (TSM) product. Recall that SONAS supports both SAS and SATA disk drives and now one can use policy management to migrate files between internal storage tiers.  Also, with the new TSM interface, data can now be migrated out of SONAS and onto tape or any of the other over 600 storage devices supported by TSM’s Hierarchical Storage Management (HSM) product.
  • Asynch replication – used for disaster recovery/business continuance.  SONAS uses standard Linux based RSYNC capabilities to replicate file systems from one SONAS cluster to another cluster.  SONAS replication only copies changed portions of files within file systems being replicated and uses SSH data transfer to encrypt data-in-flight between the two SONAS systems.

There were some other minor enhancements for this announcement namely, higher capacity SAS drive support (now 600GB), using NIS authentication, increased cache per interface node (now up to 128GB), and the already mentioned new HTTP support.

In addition, IBM stated that a single interface node can pump out 900MB/sec (out of cache) and 6 interface nodes can sustain over 5GB/sec (presumably also from cache).  SONAS can currently scale up to 30 interface nodes but this doesn’t appear to be an architectural limitation but rather just what has been validated by IBM.

Can’t wait to see this product show up in SPECsfs 2008 performance benchmarks to see how it compares to other SO and non-SA file system products.