AI benchmark for Storage, MLpERF Storage

MLperf released their first round of storage benchmark submissions early this month. There’s plenty of interest how much storage is required to keep GPUs busy for AI work. As a result, MLperf has been busy at work with storage vendors to create a benchmark suitable to compare storage systems under a “simulated” AI workload.

For the v0.5 version ,they have released two simulated DNN training workloads one for image segmentation (3D-Unet [146 MB/sample]) and the other for BERT NLP (2.5 KB/sample).

The GPU being simulated is a NVIDIA V100. What they showing with their benchmark is a compute system (with GPUs) reading data directly from a storage system.

By using simulated (GPU) compute, the benchmark doesn’t need physical GPU hardware to run. However, the veracity of the benchmark is somewhat harder to depend on.

But, if one considers, the reported benchmark metric, # supported V100s, as a relative number across the storage submissions, one is on more solid footing. Using it as a real number of V100s that could be physically supported is perhaps invalid.

The other constraint from the benchmark was keeping the simulated (V100) GPUs at 90% busy. MLperf storage benchmark reports, number of samples/second,MBPS metrics as well as # simulated (V100) GPUs supported (@90% utilization).

In the bar chart we show the top 10 # of simulated V100 GPUs for image segmentation storage submissions, DDN AI400X2 had 5 submissions in this category.

The interesting comparison is probably between DDN’s #1 and #3 submission.

  • The #1 submission had smaller amount of data (24X3.5TB = 64TB of flash), used 200Gbps InfiniBand, with 16 compute nodes and supported 160 simulated V100s.
  • The #3 submission had more data (24X13.9TB=259TB of flash),used 400Gbps InfiniBand with 1 compute node and supports only 40 simulated V100s

It’s not clear why the same storage, with less flash storage, and slower interfaces would support 4X the simulated GPUs than the same storage, with more flash storage and faster interfaces.

I can only conclude that the number of compute nodes makes a significant difference in simulated GPUs supported.

One can see a similar example of this phenomenon with Nutanix #2 and #6 submissions above. Here the exact same storage was used for two submissions, one with 5 compute nodes and the other with just 1 but the one with more compute nodes supported 5X the # of simulated V100 GPUs.

Lucky for us, the #3-#10 submissions in the above chart, all used one compute node and as such are more directly comparable.

So, if we take #3-#5 in the chart above, as the top 3 submissions (using 1 compute node), we can see that the #3 DDN AI400X2 could support 40 simulated V100s, the #4 Weka IO storage cluster could support 20 simulated V100s and the #5 Micron NVMe SSD could support 17 simulated V100s.

The Micron SSD used an NVMe (PCIe Gen4) interface while the other two storage systems used 400Gbps InfiniBand and 100Gbps Ethernet, respectively. This tells us that interface speed, while it may matter at some point, doesn’t play a significant role in determining the # simulated V100s.

Both the DDN AI4000X2 and Weka IO storage systems are sophisticated storage systems that support many protocols for file access. Presumably the Micron SSD local storage was directly mapped to a Linux file system.

The only other MLperf storage benchmark that had submissions was for BERT, a natural language model.

In the chart, we show the # of simulated V100 GPUs on the vertical axis. We see the same impact here of having multiple compute nodes in the #1 DDN solution supporting 160 simulated V100s. But in this case, all the remaining systems, used 1 compute node.

Comparing the #2-4 BERT submissions, both the #2 and #4 are DDN AI400X2 storage systems. The #2 system had faster interfaces and more data storage than the #4 system and supported 40 simulated GPUs vs the other only supporting 10 simulated V100s.

Once again, Weka IO storage system came in at #3 (2nd place in the 1 compute node systems) and supported 24 simulated V100s.

A couple of suggestions for MLperf:

  • There should be different classes of submissions one class for only 1 compute node and the other for any number of compute nodes.
  • I would up level the simulated GPU configurations to A100 rather than V100s, which would only be one generation behind best in class GPUs.
  • I would include a standard definition for a compute node. I believe these were all the same, but if the number of compute nodes can have a bearing on the number of V100s supported, the compute node hardware/software should be locked down across submissions.
  • We assume that the protocol used to access the storage oven InfiniBand or Ethernet was standard NFS protocols and not something like GPUDirect storage or other RDMA variants. As the GPUs were simulated this is probably correct but if not, it should be specfied
  • I would describe the storage configurations with more detail, especially for software defined storage systems. Storage nodes for these systems can vary significantly in storage as well as compute cores/memory sizes which can have a significant bearing on storage throughput.

To their credit this is MLperfs first report on their new Storage benchmark and I like what I see here. With the information provided, one can at least start to see some true comparisons of storage systems under AI workloads.

In addition to the new MLperf storage benchmark, MLperf released new inferencing benchmarks which included updates to older benchmark NN models as well as a brand new GPT-J inferencing benchmark. I’ll report on these next time.

~~~~

Comments?

Is hardware innovation accelerating – hardware vs. software innovation (round 6)

There’s something happening to the IT industry, that maybe has not happened in a couple of decades or so but hardware innovation is back. We’ve been covering bits and pieces of it in our hardware vs software innovation series (see Open source ASiCs – HW vs. SW innovation [round 5] post).

But first please take our new poll:

Hardware innovation never really went away, Intel, AMD, Apple and others had always worked on new compute chips. DRAM and NAND also have taken giant leaps over the last two decades. These were all major hardware suppliers. But special purpose chips, non CPU compute engines, and hardware accelerators had been relegated to the dustbins of history as the CPU giants kept assimilating their functionality into the next round of CPU chips.

And then something happened. It kind of made sense for GPUs to be their own electronics as these were SIMD architectures intrinsically different than SISD, standard von Neumann X86 and ARM CPUs architectures

But for some reason it didn’t stop there. We first started seeing some inklings of new hardware innovation in the AI space with a number of special purpose DL NN accelerators coming online over the last 5 years or so (see Google TPU, SC20-Cerebras, GraphCore GC2 IPU chip, AI at the Edge Mythic and Syntiants IPU chips, and neuromorphic chips from BrainChip, Intel, IBM , others). Again, one could look at these as taking the SIMD model of GPUs into a slightly different direction. It’s probably one reason that GPUs were so useful for AI-ML-DL but further accelerations were now possible.

But it hasn’t stopped there either. In the last year or so we have seen SPUs (Nebulon Storage), DPUs (Fungible, NVIDIA Networking, others), and computational storage (NGD Systems, ScaleFlux Storage, others) all come online and become available to the enterprise. And most of these are for more normal workload environments, i.e., not AI-ML-DL workloads,

I thought at first these were just FPGAs implementing different logic but now I understand that many of these include ASICs as well. Most of these incorporate a standard von Neumann CPU (mostly ARM) along with special purpose hardware to speed up certain types of processing (such as low latency data transfer, encryption, compression, etc.).

What happened?

It’s pretty easy to understand why non-von Neumann computing architectures should come about. Witness all those new AI-ML-DL chips that have become available. And why these would be implemented outside the normal X86-ARM CPU environment.

But SPU, DPUs and computational storage, all have typical von Neumann CPUs (mostly ARM) as well as other special purpose logic on them.

Why?

I believe there are a few reasons, but the main two are that Moore’s law (every 2 years halving the size of transistors, effectively doubling transistor counts in same area) is slowing down and Dennard scaling (as you reduce the size of transistors their power consumption goes down and speed goes up) has stopped almost. Both of these have caused major CPU chip manufacturers to focus on adding cores to boost performance rather than just adding more transistors to the same core to increase functionality.

This hasn’t stopped adding instruction functionality to each CPU, but it has slowed considerably. And single (core) processor speeds (GHz) have reached a plateau.

But what it has stopped is having the real estate available on a CPU chip to absorb lots of additional hardware functionality. Which had been the case since the 1980’s.

I was talking with a friend who used to work on math co-processors, like the 8087, 80287, & 80387 that performed floating point arithmetic. But after the 486, floating point logic was completely integrated into the CPU chip itself, killing off the co-processors business.

Hardware design is getting easier & chip fabrication is becoming a commodity

We wrote a post a couple of weeks back talking about an open foundry (see HW vs. SW innovation round 5 noted above)that would take a hardware design and manufacture the ASICs for you for free (or at little cost). This says that the tool chain to perform chip design is becoming more standardized and much less complex. Does this mean that it takes less than 18 months to create an ASIC. I don’t know but it seems so.

But the real interesting aspect of this is that world class foundries are now available outside the major CPU developers. And these foundries, for a fair but high price, would be glad to fabricate a 1000 or million chips for you.

Yes your basic state of the art fab probably costs $12B plus these days. But all that has meant is that A) they will take any chip design and manufacture it, B) they need to keep the factory volume up by manufacturing chips in order to amortize the FAB’s high price and C) they have to keep their technology competitive or chip manufacturing will go elsewhere.

So chip fabrication is not quite a commodity. But there’s enough state of the art FABs in existence to make it seem so.

But it’s also physics

The extremely low latencies that are available with NVMe storage and, higher speed networking (100GbE & above) are demanding a lot more processing power to keep up with. And just the physics of how long it takes to transfer data across a distance (aka racks) is starting to consume too much overhead and impacting other work that could be done.

When we start measuring IO latencies in under 50 microseconds, there’s just not a lot of CPU instructions and task switching that can go on anymore. Yes, you could devote a whole core or two to this process and keep up with it. But wouldn’t the data center be better served keeping that core busy with normal work and offloading that low-latency, realtime (like) work to a hardware accelerator that could be executing on the network rather than behind a NIC.

So real time processing has become faster, or rather the amount of time to execute CPU instructions to switch tasks and to process data that needs to be done in realtime to keep up with faster line speed is becoming shorter.

So that explains DPUs, smart NICS, DPUs, & SPUs. What about the other hardware accelerator cards.

  • AI-ML-DL is becoming such an important and data AND compute intensive workload that just like GPUs before them, TPUs & IPUs are becoming a necessary evil if we want to service those workloads effectively and expeditiously.
  • Computational storage is becoming more wide spread because although data compression can be easily done at the CPU, it can be done faster (less data needs to be transferred back and forth) at the smart Drive.

My guess we haven’t seen the end of this at all. When you open up the possibility of having a long term business model, focused on hardware accelerators there would seem to be a lot of stuff that needs to be done and could be done faster and more effectively outside the core CPU.

There was a point over the last decade where software was destined to “eat the world”. I get a lot of flack for saying that was BS and that hardware innovation is really eating the world. Now that hardtware innovation’s back, it seems to be a little of both.

Comments?

Photo Credits:

Scratch file use in HPC @ORNL, a statistical analysis

Attended SC17 (Supercomputing Conference) this past week and I received a copy of the accompanying research proceedings. There are a number of interesting papers in the research and I came across one, Scientific User Behavior and Data Sharing Trends in a Peta Scale File System by Seung-Hwan Lim, et al from Oak Ridge National Laboratory (ORNL) and the use of files at the Oak Ridge Leadership Computing Facility (OLCF) which was very interesting.

The paper statistically describes the use of a Scratch files in a multi PB file system (Lustre) at OLCF from January 2015 to August 2016. The OLCF supports over 32PB of storage, has a peak aggregate of over 1TB/s and Spider II (current Lustre file system) consists of 288 Lustre Object Storage Servers, all interconnected and connected to all the supercomputing cluster of  servers via an InfiniBand network. Spider II supports all scratch storage requirements for active/queued jobs for the Titan (#4 in Top 500 [super computer clusters worldwide] list) and other clusters at ORNL.

ORNL uses an HPSS (High Performance Storage System) archive for permanent storage but uses the Spider II file system for all scratch files generated and used during supercomputing applications.  ORNL is expecting Spider III (2018-2023) to host 10 billion files.

Scratch files are purged from Spider II after 90 days of no access.The paper is based on metadata analysis captured during scratch purging process for 500 days of access.

The paper displays a number of statistics and metrics on the use of Spider II:

  • Less than 3% of projects have a directory depth >15, the maximum directory depth was recorded at 432, with most projects having a shallow (<10) directory depth.
  • A project typically has 10X the files that a specific researcher has and a median file count/researcher is 2000 files with a median project having 20,000 files.
  • Storage system performance is actively managed by many projects. For instance, 20 out of 35 science domains manually managed their Lustre cluster configuration to improve throughput.
  • File count continues to grow and reached a peak of 1B files during the time being analyzed.
  • On average only 3% of files were accessed readonly, 10% of files updated (read-write) and 76% of files were untouched during a week period. However, median and maximum file age was 138 and 214 days respectively, which means that these scratch files can continue to be accessed over the course of 200+ days.

There was more information in the paper but one item missing is statistics on scratch file size distribution a concern.

Nonetheless, in paints an interesting picture of scratch file use in HPC application/supercluster environments today.

Comments?

QoM1610: Will NVMe over Fabric GA in enterprise AFA by Oct’2017

NVMeNVMe over fabric (NVMeoF) was a hot topic at Flash Memory Summit last August. Facebook and others were showing off their JBOF (see my Facebook moving to JBOF post) but there were plenty of other NVMeoF offerings at the show.

NVMeoF hardware availability

When Brocade announced their Gen6 Switches they made a point of saying that both their Gen5 and Gen6 switches currently support NVMeoF protocols. In addition to Brocade’s support, in Dec 2015 Qlogic announced support for NVMeoF for select HBAs. Also, as of  July 2016, Emulex announced support for NVMeoF in their HBAs.

From an Ethernet perspective, Qlogic has a NVMe Direct NIC which supports NVMe protocol offload for iSCSI. But even without NVMe Direct, Ethernet 40GbE & 100GbE with  iWARP, RoCEv1-v2, iSCSI SER, or iSCSI RDMA all could readily support NVMeoF on Ethernet. The nice thing about NVMeoF for Ethernet is not only do you get support for iSCSI & FCoE, but CIFS/SMB and NFS as well.

InfiniBand and Omni-Path Architecture already support native RDMA, so they should already support NVMeoF.

So hardware/firmware is already available for any enterprise AFA customer to want NVMeoF for their data center storage.

NVMeoF Software

Intel claims that ~90% of the software driver functionality of NVMe is the same for NVMeoF. The primary differences between the two seem to be the NVMeoY discovery and queueing mechanisms.

There are two fabric methods that can be used to implement NVMeoF data and command transfers: capsule mode where NVMe commands and data are encapsulated in normal fabric packets or fabric dependent mode where drivers make use of native fabric memory transfer mechanisms (RDMA, …) to transfer commands and data.

12679485_245179519150700_14553389_nA (Linux) host driver for NVMeoF is currently available from Seagate. And as a result, support for NVMeoF for Linux is currently under development, and  not far from release in the next Kernel (I think). (Mellanox has a tutorial on how to compile a Linux kernel with NVMeoF driver support).

With Linux coming out, Microsoft Windows and VMware can’t be far behind. However, I could find nothing online, aside from base NVMe support, for either platform.

NVMeoF target support is another matter but with NICs/HBAs & switch hardware/firmware and drivers presently available, proprietary storage system target drivers are just a matter of time.

Boot support is a major concern. I could find no information on BIOS support for booting off of a NVMeoF AFA. Arguably, one may not need boot support for NVMeoF AFAs as they are probably not a viable target for storing App code or OS software.

From what I could tell, normal fabric multi-pathing support should work fine with NVMeoF. This should allow for HA NVMeoF storage, a critical requirement for enterprise AFA storage systems these days.

NVMeoF advantages/disadvantages

Chelsio and others have shown that NVMeoF adds ~8μsec of additional overhead beyond native NVMe SSDs, which if true would warrant implementation on all NVMe AFAs. This may or may not impact max IOPS depending on scale-ability of NVMeoF.

For instance, servers (PCIe bus hardware) typically limit the number of private NVMe SSDs to 255 or less. With an NVMeoF, one could potentially have 1000s of shared NVMe SSDs accessible to a single server. With this scale, one could have a single server attached to a scale-out NVMeoF AFA (cluster) that could supply ~4X the IOPS that a single server could perform using private NVMe storage.

Base level NVMe SSD support and protocol stacks are starting to be available for most flash vendors and operating systems such as, Linux, FreeBSD, VMware, Windows, and Solaris. If Intel’s claim of 90% common software between NVMe and NVMeoF drivers is true, then it should be a relatively easy development project to provide host NVMeoF drivers.

The need for special Ethernet hardware that supports RDMA may delay some storage vendors from implementing NVMeoF AFAs quickly. The lack of BIOS boot support may be a minor irritant in comparison.

NVMeoF forecast

AFA storage systems, as far as I can tell, are all about selling high IOPS and very-low latency IOs. It would seem that NVMeoF would offer early adopter AFA storage vendors a significant performance advantage over slower paced competition.

In previous QoM/QoW posts we have established that there are about 13 new enterprise storage systems that come out each year. Probably 80% of these will be AFA, given the current market environment.

Of the 10.4 AFA systems coming out over the next year, ~20% of these systems pride themselves on being the lowest latency solutions in the market, and thus command high margins. One would think these systems would be the first to adopt NVMeoF. But, most of these systems have their own, proprietary flash modules and do not use standard (NVMe) SSDs and can use their own proprietary interface to their proprietary flash storage. This will delay any implementation for them until they can convert their flash storage to NVMe which may take some time.

On the other hand, most (70%) of the other AFA systems, that currently use SAS/SATA SSDs, could boost their IOP counts and drastically reduce their IO  response times, by implementing NVMe SSDs and NVMeoF. But converting SAS/SATA backends to NVMe will take time and effort.

But, there are a select few (~10%) of AFA systems, that already use NVMe SSDs in their AFAs, and for these few, they would seem to have a fast track towards implementing NVMeoF. The fact that NVMeoF is supported over all fabrics and all storage interface protocols make it even easier.

Moreover, NVMeoF has been under discussion since the summer of 2015, which tells me that astute AFA vendors have already had 18+ months to develop it. With NVMeoF host drivers & hardware available since Dec. 2015, means hardware and software exist to test and validate against.

I believe that NVMeoF will be GA’d within the next 12 months by at least one enterprise AFA system. So my QoM1610 forecast for NVMeoF is YES, with a 0.83 probability.

Comments?

 

 

 

(QoM16-002): Will Intel Omni-Path GA in scale out enterprise storage by February 2016 – NO 0.91 probability

opa-cardQuestion of the month (QoM for February is: Will Intel Omni-Path (Architecture, OPA) GA in scale out enterprise storage by February 2016?

In this forecast enterprise storage are the major and startup vendors supplying storage to data center customers.

What is OPA?

OPA is Intel’s replacement for InfiniBand and starts out at 100Gbps. It’s intended more for high performance computing (HPC), to be used as an inter-cluster server interconnect or next generation fabric. Intel says it “will maintain consistency and compatibility with existing Intel True Scale Fabric and InfiniBand APIs by working through the open source OpenFabrics Alliance (OFA) software stack on leading Linux* distribution releases”. Seems like Intel is making it as easy as possible for vendors to adopt the technology.
Continue reading “(QoM16-002): Will Intel Omni-Path GA in scale out enterprise storage by February 2016 – NO 0.91 probability”

SMB2.2 (CIFS) screams over InfiniBand

Microsoft MVP Summit 2010 by David McCarter (cc) (From Flickr)
Microsoft MVP Summit 2010 by David McCarter (cc) (From Flickr)

I missed the MVP summit last month in Redmond, but I heard there was some more discussion of the Server Message Block v2.2 (SMB2.2, also known previously as CIFS) coming in Windows Server (R) 8.

The big news is SMB2.2 now supports RDMA and can use InfiniBand (announced at SNIA Developer Conference last fall). It also supports RDMA over Ethernet via RoCE (see my Intel buys Qlogic’s Infiniband post) and iWARP.

SMB2.2 over InfiniBand performance

As reported last fall at the SNIA Developer Conference SMB2.2 using RDMA over InfiniBand reached over 3.7GB/sec with no server configuration changes using two QDR cards and 160K IOPs (the IOPs are from an SQLIO run using 8KB IOs, not SPECsfs2008). The pre-beta, SMB2.2 code was running on commodity server hardware using 32Gbps InfiniBand links. I couldn’t find any performance numbers with ROCE or iWARP but I would suspect running on 10GbE these would be much slower than InfiniBand.

Hints are that performance gets even better with the released versions of the code coming out in Windows Server 8.

SMB2.2 gets even faster than NFS

We have noted in the past that SMB (CIFS) on average, shows better throughput (IOPS) performance than NFS in SPECsfs2008 results (for example, see our latest Chart-of-the-Month post on SPECsfs results). However, those results were all at best SMB2 or even SMB1 results, and commonly using Ethernet links.

NFS already supports InfiniBand but I am unsure whether it makes use of RDMA. Nevertheless, the significant speed up shown here for SMB2.2 will potentially take SPECsfs2008 SMB2.2 performance up to a whole new level.

Why InfiniBand?

As you may recall, InfinBand is primarily deployed as a server to server interface and used extensively in the past for high performance computing environments. However nowadays, we find storage clusters, such as EMC Isilon, HP X9000 (Ibrix), IBM XIV and others using InfiniBand for their inter-node communications. The use of InfiniBand in these storage clusters is probably due primarily to its superior latency over Ethernet.

But InfiniBand has another advantage, fast data throughput, when using RDMA it can transfer data faster than almost any other networking protocol alive today. SMB2.2 takes advantage of this throughput boost by using RDMA only for large blocks of data and avoiding it for smaller blocks of data. Not sure what the cutoff is, but this would certainly help in large SQL database queries, disk copies, and any other large file data transfer operations.

Of course with 56Gbps FDR InfiniBand available today and faster transfer rates coming (see IBTA roadmap), there appears to be every reason to believe that superior throughput performance will continue at least for the foreseeable future. Better latency is also certain to be retained as well

Now that Intel’s pushing it, Mellanox continuing to push Infiniband and storage cluster’s using it more frequently, we may start to see more storage protocols supporting it.

We thought that FC only had Ethernet to worry about, with SMB2.2 moving to InfiniBand, NFS already supporting it, can a fully functional FCoIB be far behind?

 

Intel acquires InfiniBand fabric technology from Qlogic

Isilon Packaging by ChrisDag (cc) (from Flickr)”][InfiniBand interconnected] Isilon Packaging by ChrisDag (cc) (from Flickr)Intel announced today that they are going to acquire the InfiniBand (IB) fabric technology business from Qlogic.

From many analyst’s perspective, IB is one of the only technologies out there that can efficiently interconnect a cluster of commodity servers into a supercomputing system.

What’s InfiniBand?

Recall that IB is one of three reigning data center fabric technologies available today which include 10GbE, and 16 Gb/s FC.  IB is currently available in DDR, QDR and FDR modes of operation, that is 5Gb/s, 10Gb/s or 14Gb/s, respectively per single lane, according to the IB update (see IB trade association (IBTA) technology update).  Systems can aggregate multiple IB lanes in units of 4 or 12 paths (see wikipedia IB article), such that an IB QDRx4 supports 40Gb/s and a IB FDRx4 currently supports 56Gb/s.

The IBTA pitch cited above showed that IB is the most widely used interface for the top supercomputing systems and supports the most power efficient interconnect available (although how that’s calculated is not described).

Where else does IB make sense?

One thing IB has going for it is low latency through the use of RDMA or remote direct memory access.  That same report says that an SSD directly connected through a FC takes about ~45 μsec to do a read whereas the same SSD directly connected through IB using RDMA would only take ~26 μsec.

However, RDMA technology is now also coming out on 10GbE through RDMA over Converged Ethernet (RoCE, pronounced “rocky”).  But ITBA claims that IB RDMA has a 0.6 μsec latency and the RoCE has a 1.3 μsec.  Although at these speed, 0.7 μsec doesn’t seem to be a big thing, it doubles the latency.

Nonetheless, Intel’s purchase is an interesting play.  I know that Intel is focusing on supporting an ExaFLOP HPC computing environment by 2018 (see their release).  But IB is already a pretty active technology in the HPC community already and doesn’t seem to need their support.

In addition, IB has been gradually making inroads into enterprise data centers via storage products like the Oracle Exadata Storage Server using the 40 Gb/s IB QDRx4 interconnects.  There are a number of other storage products out that use IB as well from EMC IsilonSGI, Voltaire, and others.

Of course where IB can mostly be found today is in computer to computer interconnects and just about every server vendor out today, including Dell, HP, IBM, and Oracle support IB interconnects on at least some of their products.

Whose left standing?

With Qlogic out I guess this leaves Cisco (de-emphasized lately), Flextronix, Mellanox, and Intel as the only companies that supply IB switches. Mellanox, Intel (from Qlogic) and Voltaire supply the HCA (host channel adapter) cards which provide the server interface to the switched IB network.

Probably a logical choice for Intel to go after some of this technology just to keep it moving forward and if they want to be seriously involved in the network business.

IB use in Big Data?

On the other hand, it’s possible that Hadoop and other big data applications could conceivably make use of IB speeds and as these are mainly vast clusters of commodity systems it would be a logical choice.

There is some interesting research on the advantages of IB in HDFS (Hadoop) system environments (see Can high performance interconnects boost Hadoop distributed file system performance) out of Ohio State University.  This research essentially says that Hadoop HDFS can perform much better when you combine IB with IPoIB (IP over IB, see OpenFabrics Alliance article) and SSDs.  But SSDs alone do not provide as much benefit.   (Although my reading of the performance charts seems to indicate it’s not that much better than 10GbE with TOE?).

It’s possible other Big data analytics engines are considering using IB as well.  It would seem to be a logical choice if you had even more control over the software stack.

~~~~

Comments?

 

IBM’s 120PB storage system

Susitna Glacier, Alaska by NASA Goddard Photo and Video (cc) (from Flickr)
Susitna Glacier, Alaska by NASA Goddard Photo and Video (cc) (from Flickr)

Talk about big data, Technology Review reported this week that IBM is building a 120PB storage system for some unnamed customer.  Details are sketchy and I cannot seem to find any announcement of this on IBM.com.

Hardware

It appears that the system uses 200K disk drives to support the 120PB of storage.  The disk drives are packed in a new wider rack and are water cooled.  According to the news report the new wider drive trays hold more drives than current drive trays available on the market.

For instance, HP has a hot pluggable, 100 SFF (small form factor 2.5″) disk enclosure that sits in 3U of standard rack space.  200K SFF disks would take up about 154 full racks, not counting the interconnect switching that would be required.  Unclear whether water cooling would increase the density much but I suppose a wider tray with special cooling might get you more drives per floor tile.

There was no mention of interconnect, but today’s drives use either SAS or SATA.  SAS interconnects for 200K drives would require many separate SAS busses. With an SAS expander addressing 255 drives or other expanders, one would need at least 4 SAS busses but this would have ~64K drives per bus and would not perform well.  Something more like 64-128 drives per bus would have much better performer and each drive would need dual pathing, and if we use 100 drives per SAS string, that’s 2000 SAS drive strings or at least 4000 SAS busses (dual port access to the drives).

The report mentioned GPFS as the underlying software which supports three cluster types today:

  • Shared storage cluster – where GPFS front end nodes access shared storage across the backend. This is generally SAN storage system(s).  But the requirements for high density, it doesn’t seem likely that the 120PB storage system uses SAN storage in the backend.
  • Networked based cluster – here the GPFS front end nodes talk over a LAN to a cluster of NSD (network storage director?) servers which can have access to all or some of the storage. My guess is this is what will be used in the 120PB storage system
  • Shared Network based clusters – this looks just like a bunch of NSD servers but provides access across multiple NSD clusters.

Given the above, with ~100 drives per NSD server means another 1U extra per 100 drives or (given HP drive density) 4U per 100 drives for 1000 drives and 10 IO servers per 40U rack, (not counting switching).  At this density it takes ~200 racks for 120PB of raw storage and NSD nodes or 2000 NSD nodes.

Unclear how many GPFS front end nodes would be needed on top of this but even if it were 1 GPFS frontend node for every 5 NSD nodes, we are talking another 400 GPFS frontend nodes and at 1U per server, another 10 racks or so (not counting switching).

If my calculations are correct we are talking over 210 racks with switching thrown in to support the storage.  According to IBM’s discussion on the Storage challenges for petascale systems, it probably provides ~6TB/sec of data transfer which should be easy with 200K disks but may require even more SAS busses (maybe ~10K vs. the 2K discussed above).

Software

IBM GPFS is used behind the scenes in IBM’s commercial SONAS storage system but has been around as a cluster file system designed for HPC environments for over 15 years or more now.

Given this many disk drives something needs to be done about protecting against drive failure.  IBM has been talking about declustered RAID algorithms for their next generation HPC storage system which spreads the parity across more disks and as such, speeds up rebuild time at the cost of reducing effective capacity. There was no mention of effective capacity in the report but this would be a reasonable tradeoff.  A 200K drive storage system should have a drive failure every 10 hours, on average (assuming a 2 million hour MTBF).  Let’s hope they get drive rebuild time down much below that.

The system is expected to hold around a trillion files.  Not sure but even at 1024 bytes of metadata per file, this number of files would chew up ~1PB of metadata storage space.

GPFS provides ILM (information life cycle management, or data placement based on information attributes) using automated policies and supports external storage pools outside the GPFS cluster storage.  ILM within the GPFS cluster supports file placement across different tiers of storage.

All the discussion up to now revolved around homogeneous backend storage but it’s quite possible that multiple storage tiers could also be used.  For example, a high density but slower storage tier could be combined with a low density but faster storage tier to provide a more cost effective storage system.  Although, it’s unclear whether the application (real world modeling) could readily utilize this sort of storage architecture nor whether they would care about system cost.

Nonetheless, presumably an external storage pool would be a useful adjunct to any 120PB storage system for HPC applications.

Can it be done?

Let’s see, 400 GPFS nodes, 2000 NSD nodes, and 200K drives. Seems like the hardware would be readily doable (not sure why they needed watercooling but hopefully they obtained better drive density that way).

Luckily GPFS supports Infiniband which can support 10,000 nodes within a single subnet.  Thus an Infiniband interconnect between the GPFS and NSD nodes could easily support a 2400 node cluster.

The only real question is can a GPFS software system handle 2000 NSD nodes and 400 GPFS nodes with trillions of files over 120PB of raw storage.

As a comparison here are some recent examples of scale out NAS systems:

It would seem that a 20X multiplier times a current Isilon cluster or even a 10X multiple of a currently supported SONAS system would take some software effort to work together, but seems entirely within reason.

On the other hand, Yahoo supports a 4000-node Hadoop cluster and seems to work just fine.  So from a feasability perspective, a 2500 node GPFS-NSD node system seems just a walk in the park for Hadoop.

Of course, IBM Almaden is working on project to support Hadoop over GPFS which might not be optimum for real world modeling but would nonetheless support the node count being talked about here.

——

I wish there was some real technical information on the project out on the web but I could not find any. Much of this is informed conjecture based on current GPFS system and storage hardware capabilities. But hopefully, I haven’t traveled to far astray.

Comments?