Scratch file use in HPC @ORNL, a statistical analysis

Attended SC17 (Supercomputing Conference) this past week and I received a copy of the accompanying research proceedings. There are a number of interesting papers in the research and I came across one, Scientific User Behavior and Data Sharing Trends in a Peta Scale File System by Seung-Hwan Lim, et al from Oak Ridge National Laboratory (ORNL) and the use of files at the Oak Ridge Leadership Computing Facility (OLCF) which was very interesting.

The paper statistically describes the use of a Scratch files in a multi PB file system (Lustre) at OLCF from January 2015 to August 2016. The OLCF supports over 32PB of storage, has a peak aggregate of over 1TB/s and Spider II (current Lustre file system) consists of 288 Lustre Object Storage Servers, all interconnected and connected to all the supercomputing cluster of  servers via an InfiniBand network. Spider II supports all scratch storage requirements for active/queued jobs for the Titan (#4 in Top 500 [super computer clusters worldwide] list) and other clusters at ORNL.

ORNL uses an HPSS (High Performance Storage System) archive for permanent storage but uses the Spider II file system for all scratch files generated and used during supercomputing applications.  ORNL is expecting Spider III (2018-2023) to host 10 billion files.

Scratch files are purged from Spider II after 90 days of no access.The paper is based on metadata analysis captured during scratch purging process for 500 days of access.

The paper displays a number of statistics and metrics on the use of Spider II:

  • Less than 3% of projects have a directory depth >15, the maximum directory depth was recorded at 432, with most projects having a shallow (<10) directory depth.
  • A project typically has 10X the files that a specific researcher has and a median file count/researcher is 2000 files with a median project having 20,000 files.
  • Storage system performance is actively managed by many projects. For instance, 20 out of 35 science domains manually managed their Lustre cluster configuration to improve throughput.
  • File count continues to grow and reached a peak of 1B files during the time being analyzed.
  • On average only 3% of files were accessed readonly, 10% of files updated (read-write) and 76% of files were untouched during a week period. However, median and maximum file age was 138 and 214 days respectively, which means that these scratch files can continue to be accessed over the course of 200+ days.

There was more information in the paper but one item missing is statistics on scratch file size distribution a concern.

Nonetheless, in paints an interesting picture of scratch file use in HPC application/supercluster environments today.

Comments?

There’s a new cluster filesystem on the block, Elastifile

At SFD12 last month we talked with the team from Elastifile. They are a new startup out of Israel working on a better cluster file system.

Elastifile was designed to support 1000s of nodes, 100,000 of users/client and 1000s of data containers (file systems/mount points), together with an infinite (64 bit) number of files and directories and up to Exabytes (10**18) in capacity. They also offer a 100% SSD file store capability. I encourage you to view the videos of their presentations at SFD12 to learn more.

Elastifile features

Elastifile supports data compression and optionally deduplication with NAND/Flash (e. g., low-/high-endurance) storage tiering, cloud storage tiering and multi-site storage. They also provide NFSv3/v4, SMB, AWS S3 and HDFS as native access protocols for their file storage.

They also offer non-disruptive hardware/software upgrades, n-way (2- or 3-way) data and metadata redundancy, self-healing capabilities, snapshots, and synchronous/asynchronous data replication or mirroring. Further, they provide multi-tenancy and QoS support.

Elastifile can be used in hyper converged mode as well as a dedicated storage server mode. For backend storage, they support heterogeneous, physical (block, I think?) storage systems as well as direct access storage in cluster nodes

Internals matter

Elastifile’s architecture supports accessor, owner and data nodes. But these can all be colocated on the same server or segregated across different servers.

Owner nodes, own all the metadata objects for a file or directory and caches the metadata working set in i’s memory. Ownership file or directory metadata may change in the case of hardware failures.

Elastifile supports a dynamic write data path, which means they determine, in real time, where to write file data rather than having the data locations identified before hand. They call this distributed write anywhere semantics.

Notably they don’t do data caching (with NVMe it doesn’t make sense) however, as noted above, they do use metadata caching

Internally, Elastifile uses variable length objects for both file data and metadata.

  • File data is composed of three object types: a file metadata (FileMD) object, mapping data objects, and file data objects. FileMD’s hold the normal file metadata (name, file size, create, access & modify ToDs, etc.) as well as pointing to all the Mapping Object (OIDs). Mapping objects exist for each 0.5MB of file data and consist of a 128 element table, each element mapping 4KB of file address space to a data object (OID). Each  data object holds the 4KB of compressed file data and journal log entries.
  • Director metadata is composed of directory metadata (DirMD) object and Directory listing objects. Directory listing objects maps file/directory names to FileMD or DirMD OIDs. Directory listing objects are accessed via an extensible hash table and contain a list of filenames/directory names within the directory

The Elastifile software architecture consists of three layers:

  • A protocol layer which terminates file system access protocols and translates requests into internal requests. The hashing and data compression of file data occur at this level.
  • A metadata layer which provides file system/directory name mapping to objects for owned files/directories and maintains file/directory metadata updates/journals/checkpoints.
  • A data layer which provides transaction consistency and a n-way redundant persistent data storage for (file or metadata) objects.

Metadata operations are persisted via journaled transactions and which are distributed across the cluster. For instance the journal entries for a mapping data object updates are written to the same file data object (OID) as the actual file data, the 4KB compressed data object.

There’s plenty of discussion on how they manage consistency for their metadata across cluster nodes. Elastifile invented and use Bizur, a key-value consensus based DB. Their chief architect Ezra Hoch (@EzraHoch) did a blog post and paper on Bizur for more information

~~~~

New file systems generally take many years to mature and get out into the market, cluster file systems even longer. Elastifile started in 2013, by some very smart engineers, is already on the market, just 4 years later. That’s impressive enough, but with their list of advanced functionality plus cloud storage tiering and multi-site operations all shipping in the current product is mind-blowing.

One lingering question is, does a market exist for another cluster file system? All flash is interesting but most of the current CFS’s do this and ship this today. Cloud storage tiering is interesting and a long term need but some CFSs already have this and others are no doubt implementing it as we speak. CFS’s use of objects for internal data and metadata management is not new and may make internals cleaner but don’t really provide a lot of customer benefit.

Exascale raw capacity, support for 100K users, 1000s of nodes, 1000s of file systems and an infinite # of files/directories is interesting. But most CFSs claim this level of support already, although this is more aspirational for some. And proving support at this scale is difficult, if not impossible.

On the other hand, Bizur is really neat. Its primary benefit is during recovery from hardware failures. For a CFS with 1000s of nodes, failures likely occur quite often. So Bizur’s advantage here may pay significant customer dividends.

Is that enough to to market a new CFS?

To see what other SFD12 bloggers have written on Elastifile, please see:

Exablox, bring your own disk storage

We talked with Exablox a month or so ago at Storage Field Day 10 (SFD10) and they discussed some of their unique storage solution and new software functionality. If you’re not familiar with Exablox they sell a OneBlox appliance with drive slots, but no data drives.

The OneBlox appliance provides a Linux based, scale-out, distributed object storage software with a file system in front of it. They support SMB and NFS access protocols and have inline deduplication, data compression and continuous snapshot capabilities. You supply the (SATA or SAS) drives, a bring your own drive (BYOD) storage offering.

Their OneSystem management solution is available on a subscription basis, which usually runs in the cloud as a web accessed service offering used to monitor and manage your Exablox cluster(s). However, for those customers that want it, OneSystem is also available as a Docker Container, where you can run it on any Docker compatible system.
Continue reading “Exablox, bring your own disk storage”

Faster Docker initialization through Slacker snapshots & NFS storage

Just got back from EMCWorld2016 this week but on the way there and back I was perusing the FAST’16 papers. One of the papers I read  (see Slacker: Fast Distribution with Lazy Docker Containers, p. 181) discussed performance problems with initializing Docker container micro-services and how they could be solved using persistent, intelligent NFS storage.

It appears that Docker container initialization spends a lot of time provisioning and initializing a local file system for each container.  Docker containers typically make use of an AUFS (Another Union File System) storage driver which makes use of another file system (like ext4) as its underlying storage which has beneath it either DAS or external storage.

When using persistent and intelligent NFS storage, Docker can take advantage of storage system snapshots and cloning to improve container initialization significantly. In the paper, the researchers used Tintri as the underlying persistent, enterprise class NFS storage but I believe the functionality that’s taken advantage of is available with most enterprise class NAS systems and as such, is readily available with other storage subsystems.
Continue reading “Faster Docker initialization through Slacker snapshots & NFS storage”

(QoM16-002): Will Intel Omni-Path GA in scale out enterprise storage by February 2016 – NO 0.91 probability

opa-cardQuestion of the month (QoM for February is: Will Intel Omni-Path (Architecture, OPA) GA in scale out enterprise storage by February 2016?

In this forecast enterprise storage are the major and startup vendors supplying storage to data center customers.

What is OPA?

OPA is Intel’s replacement for InfiniBand and starts out at 100Gbps. It’s intended more for high performance computing (HPC), to be used as an inter-cluster server interconnect or next generation fabric. Intel says it “will maintain consistency and compatibility with existing Intel True Scale Fabric and InfiniBand APIs by working through the open source OpenFabrics Alliance (OFA) software stack on leading Linux* distribution releases”. Seems like Intel is making it as easy as possible for vendors to adopt the technology.
Continue reading “(QoM16-002): Will Intel Omni-Path GA in scale out enterprise storage by February 2016 – NO 0.91 probability”

SCI SPECsfs2008 NFS throughput per node – Chart of the month

SCISFS150928-001
As SPECsfs2014 still only has (SPECsfs sourced) reference benchmarks, we have been showing some of our seldom seen SPECsfs2008 charts, in our quarterly SPECsfs performance reviews. The above chart was sent out in last months Storage Intelligence Newsletter and shows the NFS transfer operations per second per node.

In the chart, we only include NFS SPECsfs2008 benchmark results with configurations that have more than 2 nodes and have divided the maximum NFS throughput operations per second achieved by the node counts to compute NFS ops/sec/node.

HDS VSP G1000 with an 8 4100 file modules (nodes) and HDS HUS (VM) with 4 4100 file modules (nodes) came in at #1 and #2 respectively, for ops/sec/node, each attaining ~152K NFS throughput operations/sec. per node. The #3 competitor was Huawei OceanStor N8500 Cluster NAS with 24 nodes, which achieved ~128K NFS throughput operations/sec./node. At 4th and 5th place were EMC  VNX VG8/VNX5700 with 5 X-blades and Dell Compellent FS8600 with 4 appliances, each of which reached ~124K NFS throughput operations/sec. per node. It falls off significantly from there, with two groups at ~83K and ~65K NFS ops/sec./node.

Although not shown above, it’s interesting that there are many well known scale-out NAS solutions in SPECsfs2008 results with over 50 nodes that do much worse than the top 10 above, at <10K NFS throughput ops/sec/node. Fortunately, most scale-out NAS nodes cost quite a bit less than the above.

But for my money, one can be well served with a more sophisticated, enterprise class NAS system which can do >10X the NFS throughput operations per second per node than a scale-out systm. That is, if you don’t have to deploy 10PB or more of NAS storage.

More information on SPECsfs2008/SPECsfs2014 performance results as well as our NFS and CIFS/SMB ChampionsCharts™ for file storage systems can be found in our just updated NAS Buying Guide available for purchase on our web site.

Comments?

~~~~

The complete SPECsfs2008 performance report went out in SCI’s September newsletter.  A copy of the report will be posted on our dispatches page sometime this quarter (if all goes well).  However, you can get the latest storage performance analysis now and subscribe to future free monthly newsletters by just using the signup form above right.

As always, we welcome any suggestions or comments on how to improve our SPECsfs  performance reports or any of our other storage performance analyses.

 

When 64 nodes are not enough

Why would VMware with years of ESX development behind them want to develop a whole new virtualization system for Docker and other container frameworks. Especially since they already have a compatible Docker support in their current product line.

The main reason I can think of is that a 64 node cluster may be limiting to some container services and the likelihood of VMware ESX/vSphere to supporting 1000s of nodes in a single cluster seems pretty unlikely. So given that more and more cloud services are being deployed across 1000s of nodes using container frameworks, VMware had to do something or say goodbye to a potentially lucrative use case for virtualization.

Yes over time VMware may indeed extend vSphere clusters to 128 or even 256 nodes but by then the world will have moved beyond VMware services for these services and where will VMware be then – left behind.

Photon to the rescue

With the new Photon system VMware has an answer to anyone that needs 1000 to 10,000 server cluster environments. Now these customers can easily deploy their services on a VMware Photon Platform which is was developed off of ESX but doesn’t have any cluster limitations of ESX.

Thus, the need for Photon was now. Customers can easily deploy container frameworks that span 1000s of nodes. Of course it won’t be as easy to manage as a 64 node vSphere cluster but it will be easy automated and easier to deploy and easier to scale when necessary, especially beyond 64 nodes.

The claim is that the new Photon will be able to support multiple container frameworks without modification.

So what’s stopping you from taking on the Amazons, Googles, and Apples of the worlds data centers?

  • Maybe storage, but then there’s ScaleIO, and the other software defined storage solutions that are there to support local DAS clusters spanning almost incredible sizes of clusters.
  • Maybe networking, I am not sure just where NSX is in the scheme of things but maybe it’s capable of handling 1000s of nodes and maybe not but networking could be a clear limitation to what how many nodes can be deployed in this sort of environment.

Where does this leave vSphere? Probably continuation of the current trajectory, making easier and more efficient to run VMware clusters and over time extending any current limitations. So for the moment two development streams based off of ESX and each being enhanced for it’s own market.

How much of ESX survived is an open question but it’s likely that Photon will never see the VMware familiar services and operations that is readily available to vSphere clusters.

Comments?

Photo Credit(s): A first look into Dockerfile system

VMware VSAN 6.0 all-flash & hybrid IO performance at SFD7

We visited with VMware’s VSAN team during last Storage Field Day (SFD7, session available here). The presentation was wide ranging but the last two segments dealt with recent changes to VSAN and at the end provided some performance results for both a hybrid VSAN and an all-Flash VSAN.

Some new features in VSAN 6.0 include:

  • More scaleability, up to 64 hosts in a cluster and up to 200VMs per host
  • New higher performance snapshots & clones
  • Rack awareness for better availability
  • Hardware based checksum for T10 DIF (data integrity feature)
  • Support for blade servers with JBODs
  • All-flash configurations
  • Higher IO performance

Even in the all-flash configuration there are two tiers of storage a write cache tier and a capacity tier of SSDs. These are configured with two different classes of SSDs (high endurance/low-capacity and low-endurance/high capacity).

At the end of the session Christos Karamanolis (@Xtosk), Principal Architect for VSAN showed us some performance charts on VSAN 6.0 hybrid and all-flash configurations.

Hybrid VSAN performance

On the chart we see two plots showing IOmeter performance as VSAN scales across multiple nodes (hosts), on the left  we have  a 100% Read workload and on the right a 70%Read:30%Write workload.

The hybrid VSAN configuration has 4-10Krpm disks and 1-400GB SSD on each host and ranges from 8 to 64 hosts. The bars on the chart show IOmeter IOPS and the line shows the average response time (or IO latency) for each VSAN host configuration. I am not a big fan of IOmeter, as it’s an overly simplified, but that’s what VMware used.

The results show that in a 100% read case the hybrid, 64 host VSAN 6.0 cluster was able to sustain ~3.8M IOPS or over 60K IOPS per host.  or the mixed 70:30 R:W workload VSAN 6.0 was able to sustain ~992K IOPs or ~15.5K IOPS per host.

We see a pretty drastic IOPs degradation (~1/4 the 100% read performance) in the plot on the right, when they added write activity to the mix. But with VSAN’s mirrored data protection each VM write represents at least two VSAN backend writes and at a 70:30 IOmeter R:W this would be ~694K IOPS read and ~298K IOPS write frontend IOs but with mirroring this represents 595K writes to the backend storage.

Then of course, there’s destage activity (data written to SSDs need  to be read off SSD and written to HDD) which also multiplies internal IO operations for every external write IOP. Lets say all that activity multiplies each external write by 6 (3 for each mirror: 1 to the write cache SSD, 1 to read it back and 1 to write to HDD) and we multiply that times the ~298K external write IOPS, it would add up to about a total of ~1.8M write derived IOPS  and ~0.7M read derived IOPS or a total of ~2.5M IOPS but this is still far away from the 3.5M IOPS for 100% read activity. I am probably missing another IO or two in the write path (maybe Virtual to physical mapping structures need to be updated) or have failed to account for more inter-cluster IO activity in support of the writes.

In addition, we see the IO latency was roughly flat across the 100% Read workload at ~2.25msec. and got slightly worse over the 70:30 R:W workload, ranging from ~2.5msec. at 4 hosts to a little over 3.0msec. with 64 hosts. Not sure why this got worse but hosts are scaled up it could induce more inter-cluster overhead.

Rays-pix37

In the chart to the right, we can see similar performance data for systems with one or two disk-groups. The message here is that with two disk groups on a host (2X the disk and SSD resources per host) one can potentially double the performance of the systems, to 116K IOPS/host on 100% read and 31K IOPS/host on a 70:30 R:W workload.

All-flash VSAN performance

Rays-pix41

Here we can see performance data for an 8-host, all-flash VSAN configuration. In this case the chart on the left was a single “disk” group and the chart on the right was a dual disk group, all-flash configuration on each of the 8-hosts. The hosts were configured with 1-400GB and 3-800GB SSDs per disk group.

The various bars on the charts represent different VM working set sizes, 100, 200, 400 & 600GB for the single disk group chart and 100, 200, 400, 800 & 1200GB for dual disk group configurations. For the dual disk group, the 1200GB working set size is much bigger than a cache tier on each host.

The chart text is a bit confusing: the title of each plot says 70% read but the text under the two plots says 100% read. I must assume these were 70:30 R:W workloads. If we just look at the 8 hosts, using a 400GB VM working set size, all-flash VSAN 6.0 single disk group cluster was able to achieve ~37.5K IOPS/host and with two disk groups, the all-flash VSAN 6.0  was able to achieve ~68.75K IOPS/host at the 400GB working set size. Both doubling the hybrid performance.

Response times degrade for both the single and dual disk groups as we increase the working set sizes. It’s pretty hard to see on the two charts but it seems to range from 1.8msec to 2.2msec for the single disk group and 1.8msec to 2.5 msec for the dual disk group. The two charts are somewhat misleading because they didn’t use the exact same working group sizes for the two performance runs but just taking the 100|200|400GB working set sizes, for the single disk group it looks like the latency went from ~1.8msec. to ~2.0msec and for the dual disk group from ~1.8msec to ~2.3msec. Why the higher degradation for the dual disk group is anyone’s guess.

The other thing that doesn’t make much sense is that as you increase the working set size the number of IOPS goes down, worse for the dual disk group than the single. Once again taking just the 100|200|400GB working group sizes this ranges from ~350K IOPS to ~300K IOPS (~15% drop) for the single disk group and ~700K IOPS to ~550K IOPS (~22% drop) for the dual disk group.

Increasing working group sizes should cause additional backend IO as the cache effectivity should be proportionately less as you increase working set size. Which I think goes a long way to explain the degradation in IOPS as you increase working set size. But I would have thought the degradation would have been a proportionally similar for both the single and dual disk groups. The fact that the dual disk group did 7% worse seems to indicate more overhead associated with dual disk groups than single disk groups or perhaps, they were running up against some host controller limits (a single controller supporting both disk groups).

~~~~

At the time (3 months ago) this was the first world-wide look at all-flash VSAN 6.0 performance. The charts are a bit more visible in the video than in my photos (?) and if you want to just see and hear Christos’s performance discussion check out ~11 minutes into the final video segment.

For more information you can also read these other SFD7 blogger posts on VMware’s session: