We had talked with Intel at Storage Field Day 20 (SFD20), about a month ago. At the virtual event, Intel’s focus was on their Optane PMEM (persistent memory) technology. Kelsey Prantis (@kelseyprantis), Senior Software Engineering Manager, Intel was on the show and gave an introduction into Intel’s DAOS (Distributed Architecture Object Storage, DAOS.io) a new HPC (high performance computing, super computers) file system they developed from scratch to use leading edge, Intel technologies, and Optane PMEM was one of them.
Kelsey has worked on LUSTRE and other HPC file systems for a long time now and came into the company from the acquisition of Whamcloud. Currently, she manages the development team working on DAOS. DAOS is a new HPC object storage file system which is completely open source (available on GitHub).
DAOS was designed from the start to take advantage of NVMe SSDs and Optane PMEM. With PMEM, current servers can support up to 20TB of memory. Besides the large memory sizes, Optane PMEM also offers non-volatile memory and byte addressability (just like DRAM). These two characteristics opens up new functionality that allows DAOS to move beyond legacy, block oriented, storage architectures that have been the only storage solution for HPC (and the enterprise) for decades now.
What’s different about DAOS
DAOS uses PMEM for all metadata and for storing small files. HPC IO has always focused on heavy bandwidth (IO using large blocks) oriented but lately newer applications have emerged, such as AI/ML/DL, data analytics and others, that use smaller files/blocks. Indeed, most new HPC clusters and supercomputers are deploying almost as many GPUs as CPUs in their configurations to support AI activities.
The problem is that these newer applications typically consume much smaller files. Matt mentioned one HPC client he worked with was processing small batches of seismic data, to predict, in real time, earthquakes that were happening around the world.
By using PMEM for metadata and small files, DAOS can be much more responsive to file requests (open, close, delete, status) as well as provide higher performing IO for small files. All this leads to a much better performing system for the new HPC workloads as well as great sustainable performance for the more traditional large file workloads.
DAOS provides a cluster storage system that can be configured with from 1 (no data protection), but more normally 3 nodes (with data protection) at a minimum to 512 nodes (lab tested). Data protection in DAOS is currently based on mirroring data and can use from 0 to the number of nodes in a cluster as data mirrors.
DAOS system nodes are homogeneous. That is they all come with the same amount of PMEM and NVMe SSDs. Note, DAOS doesn’t support disk drives. Kelsey mentioned DAOS node hardware can be tailored to suit any particular application environment. But they typically require an average of 6% of overall DAOS system capacity in PMEM for metadata and small file activity.
DAOS current supports their own API, POSIX, HDFS5, MPIIO and Apache Spark storage protocols. Kelsey mentioned that standard POSIX uses a pessimistic conflict resolution mode which leads to performance bottlenecks during parallel access. In contrast, DAOS’s versos of POSIX uses optimistic conflict resolution, which means DAOS starts writes assuming there’s no conflict, but if one occurs it handles the conflict in real time. Of course with all the metadata byte addressable and in PMEM this doesn’t take up a lot of (IO) time.
As mentioned earlier, DAOS data protection uses mirror-replicas. However, unlike most other major file systems, DAOS mirroring can be done at the object level. DAOS internally is an object store. Data organization on DAOS starts at the pool level, underneath that is data containers, and then under that are objects. Any object in DAOS can have its own mirroring configuration. DAOS is working towards supporting Erasure Coding as another form of data protection for a future release.
There’s a new storage benchmark that was developed specifically for HPC, called the IO500. The IO500 benchmark simulates a number of different HPC workloads, measures performance for each of them, and computes an (aggregate) performance score to rank HPC storage systems.
IO500 ranks system performance using two lists: one is for any sized configuration that typically range from 50 to 1000s of nodes and their other list limits the configuration to 10 nodes. The first performance ranking can sometimes be gamed by throwing more hardware into a cluster. The 10 node rankings are much harder to game this way and from our perspective, show a fairer comparison of system performance.
As presented (virtually) at ISC 2020, DAOS took the top spot on the IO500 any size configuration list and performed better than 2X the next best solution. And on the IO500 10 node list, Intel’s DAOS configuration, Texas Advanced Computing (TAC) DAOS configuration, and Argonne Nat Labs DAOS configuration took the top 3 spots and had 3X better performance than the next best, non-DAOS storage system.
The Argonne National Labs has already stated that they will be using DAOS in their new HPC system to be deployed in the near future. Early specifications for storage at the new Argonne Lab required support for 230PB of data and 25TB/sec of bandwidth.
The podcast ran ~43 minutes. Kelsey was great to talk with and very knowledgeable about HPC systems and HPC IO in particular. Matt has worked at Argonne in the past so understood these systems better than I. Sadly, we lost Matt’s end of the conversation about 1/2 way into the recording. Both Matt and I thought that DAOS represents the birth of a new generation of HPC storage. Listen to the podcast to learn more.
Kelsey Prantis, Senior Software Engineering Manager, Intel
Kelsey Prantis heads the Extreme Storage Architecture and Development division at Intel Corporation. She leads the development of Distributed Asynchronous Object Storage (DAOS), an open-source, low-latency and high IOPS object store designed from the ground up for massively distributed Non-Volatile Memory (NVM).
She joined Intel in 2012 with the acquisition of Whamcloud, where she led the development of the Intel Manager for Lustre* product.
Prior to Whamcloud, she was a software developer at personal genomics and biotechnology company 23andMe.
Prantis holds a Bachelor’s degree in Computer Science from Rochester Institute of Technology