Category Archives: HPC storage

43: GreyBeards talk Tier 0 again with Yaniv Romem CTO/Founder & Josh Goldenhar VP Products of Excelero

In this episode, we talk with another next gen, Tier 0 storage provider. This time our guests are Yaniv Romem CTO/Founder  & Josh Goldenhar (@eeschwa) VP Products from Excelero, another new storage startup out of Israel.  Both Howard and I talked with Excelero at SFD12 (videos here) earlier last month in San Jose. I was very impressed with their raw performance and wrote a popular RayOnStorage blog post on their system (see my 4M IO/sec@227µsec 4KB Read… post) from our discussions during SFD12.

As we have discussed previously, Tier 0, next generation flash arrays provide very high performing storage at very low latencies with modest to non-existent advanced storage services. They are intended to replace server, direct access SSD storage with a more shared, scaleable storage solution.

In our last podcast (with E8 Storage) they sold a hardware Tier 0 appliance. As a different alternative, Excelero is a software defined, Tier 0 solution intended to be used on any commodity or off the shelf server hardware with high end networking and (low to high end) NVMe SSDs.

Indeed, what impressed me most with their 4M IO/sec, was that target storage system had almost 0 CPU utilization. (Read the post to learn how they did this). Excelero mentioned that they were able to generate high (11M random 4KB) IO/sec on  Intel Core 7, desktop-class CPU. Their one need in a storage server is plenty of PCIe lanes. They don’t even need to have dual socket storage servers, single socket CPU’s work just fine as long as the PCIe lanes are there.

Excelero software

Their intent is to bring Tier 0 capabilities out to all big storage environments. By providing a software only solution it could be easily OEMed by cluster file system vendors or HPC system vendors and generate amazing IO performance needed by their clients.

That’s also one of the reasons that they went with high end Ethernet networking rather than just Infiniband, which would have limited their market to mostly HPC environments. Excelero’s client software uses RoCE/RDMA hardware to perform IO operations with the storage server.

The other thing having little to no target storage server CPU utilization per IO operation gives them is the ability to scale up to 1000 of hosts or storage servers without reaching any storage system bottlenecks.  Another concern eliminated by minimal target server CPU utilization is that you can’t have a noisy neighbor problem, because there’s no target CPU processing to be shared.  Yet another advantage with Excelero is that bandwidth is only  limited by storage server PCIe lanes and networking.  A final advantage of their approach is that they can support any of the current and upcoming storage class memory devices supporting NVMe (e.g., Intel Optane SSDs).

The storage services they offer include RAID 0, 1 and 10 and a client side logical volume manager which supports multi-pathing. Logical volumes can span up to 128 storage servers, but can be accessed by almost any number of hosts. And there doesn’t appear to be a specific limit on the number of logical volumes you can have.

 

They support two different protocols across the 40GbE/100GbE networks. Standard NVMe over Fabric or RDDA (Excelero patented, proprietary Remote Direct Disk Array access). RDDA is what mainly provides the almost non-existent target storage server CPU utilization. But even with standard NVMe over Fabric they support low target CPU utilization. One proviso, with NVMe over Fabric, they do add shared volume functionality to support RAID device locking and additional fault tolerance capabilities.

On Excelero’s roadmap is thin provisioning, snapshots, compression and deduplication. However, they did mention that adding advanced storage functionality like this will impede performance. Currently, their distributed volume locking and configuration metadata is not normally accessed during an IO but when you add thin provisioning, snapshots and data reduction, this metadata needs to become more sophisticated and will necessitate some amount of access during and after an IO operation.

Excelero’s client software runs in Linux kernel mode client and they don’t currently support VMware or Hyper-V. But they do support KVM as a hypervisor and would be willing to support the others, if APIs were published or made available.

They also have an internal OpenStack Cinder driver but it’s not part of their OpenStack’s release yet. They’re waiting for snapshot to be available before they push this into the main code base. Ditto for Docker Engine but this is more of a beta capability today.

Excelero customer experience

One customer (NASA Ames/Moffat Field) deployed a single 2TB NVMe SSD across 128 hosts and had a single 256TB logical volume shared and accessed by all 128 hosts.

Another customer configured Excelero behind a clustered file system and was able to generate 30M randomized IO/sec at 200µsec latencies but more important, 140GB/sec of bandwidth. It turns out high bandwidth is important to many big data applications that have to roll lots of data into their analytics clusters, processing it and output results, and then do it all over again. Bandwidth limitations can impact the success of these types of applications.

By being software only they can be used in a standalone storage server or as a hyper-converged solution where applications and storage can be co-resident on the same server. As noted above, they currently support Linux O/Ss for their storage and client software and support any X86 Intel processor, any RDMA capable NIC, and any NVMe SSD.

Excelero GTM

Excelero is focused on the top 200 customers, which includes the hyper-scale providers like FaceBook, Google, Microsoft and others. But hyper-scale customers have huge software teams and really a single or few, very large/complex applications which they can create/optimize a Tier 0 storage for themselves.

It’s really the customers just below the hyper-scalar class, that have similar needs for high low latency IO/sec or high IO bandwidth (or both) but have 100s to 1000s of applications and they can’t afford to optimize them all for Tier 0 flash. If they solve sharing Tier 0 flash storage in a more general way, say as a block storage device. They can solve it for any application. And if the customer insists, they could put a clustered file system or even an object storage (who would want this) on top of this shared Tier 0 flash storage system.

These customers may currently be using NVMe SSDs within their servers as a DAS device. But with Excelero these resources can be shared across the data center. They think of themselves as a top of rack NVMe storage system.

On their website they have listed a few of their current customers and their pretty large and impressive.

NVMe competition

Aside from E8 Storage, there are few other competitors in Tier 0 storage. One recently announced a move to an NVMe flash storage solution and another killed their shipping solution. We talked about what all this means to them and their market at the end of the podcast. Suffice it to say, they’re not worried.

The podcast runs ~50 minutes. Josh and Yaniv were very knowledgeable about Tier 0, storage market dynamics and were a delight to talk with.   Listen to the podcast to learn more.


Yaniv Romem CTO and Founder, Excelero

Yaniv Romem has been a technology evangelist at disruptive startups for the better part of 20 years. His passions are in the domains of high performance distributed computing, storage, databases and networking.
Yaniv has been a founder at several startups such as Excelero, Xeround and Picatel in these domains. He has served in CTO and VP Engineering roles for the most part.


Josh Goldenhar, Vice President Products, Excelero

Josh has been responsible for product strategy and vision at leading storage companies for over two decades. His experience puts him in a unique position to understand the needs of our customers.
Prior to joining Excelero, Josh was responsible for product strategy and management at EMC (XtremIO) and DataDirect Networks. Previous to that, his experience and passion was in large scale, systems architecture and administration with companies such as Cisco Systems. He’s been a technology leader in Linux, Unix and other OS’s for over 20 years. Josh holds a Bachelor’s degree in Psychology/Cognitive Science from the University of California, San Diego.

42: GreyBeards talk next gen, tier 0 flash storage with Zivan Ori, CEO & Co-founder E8 Storage.

In this episode, we talk with Zivan Ori (@ZivanOri), CEO and Co-founder of E8 Storage, a new storage startup out of Israel. E8 Storage provides a tier 0, next generation all flash array storage solution for HPC and high end environments that need extremely high IO performance, with high availability and modest data services. We first saw E8 Storage at last years Flash Memory Summit (FMS 2016) and have wanted to talk with them since.

Tier 0 storage

The Greybeards discussed new tier 0 solutions in our annual yearend industry review podcast. As we saw it then, tier 0 provides lightening fast (~100s of µsec) read and write IO operations and millions of IO/sec. There are not a lot of applications that need this level of speed and quantity of IOs but for those that do, Tier 0 storage is their only solution.

In the past Tier 0, was essentially SSDs sitting on a PCIe bus, isolated to a single server. But today, with the emergence of NVMe protocols and SSDs, 40/50/100GBE NICs and switches and RDMA  protocols, this sort of solution can be shared across from racks of servers.

There were a few shared Tier 0 solutions available in the past but their challenge was that they all used proprietary hardware. With today’s new hardware and protocols, these new Tier 0 systems often perform as good or much better than the old generation but with off the shelf hardware.

E8 came to the market (emerged out of stealth and GA’d in September of 2016) after NVMe protocols, SSDs and RDMA were available in commodity hardware and have taken advantage of all these new capabilities.

E8 Storage system hardware & software

E8 Storage offers a 2U HA appliance with 24, hot-pluggable NVMe SSDs connected to it and support 8 client or host ports. The  hardware appliance has two controllers, two power supplies, and two batteries. The batteries are used to hold up a DRAM write cache until it can be flushed to internal storage for power failures. They don’t do any DRAM read caching because the performance off the NVMe SSDs is more than fast enough.

The 24 NVMe SSDs are all dual ported for fault tolerance and provide hot-pluggable replacement for better servicing in the field. One E8 Storage system can supply up to 180TB of usable, shared NVMe flash storage.

E8 Storage uses RDMA (RoCE) NICs between client servers and their storage system, which support 40GBE, 50GBE or 100GBE networking.

E8 does not do data reduction (thin provisioning, data deduplication or data compression) on their storage, so usable capacity = effective capacity.  Their belief is that these services consume a lot of compute/IO limiting IO/sec and increasing response times and as the price of NVMe SSD capacity is coming down over time these activities become less useful.

They also have client software that provides a fault tolerant initiator for their E8 storage. This client software supports MPIO and failover across controllers in the event of a controller outage. The client software currently runs on just about any flavor of Linux available today and E8 is working to port this to other OSs based on customer requests.

Storage provisioning and management is through a RESTful API, CLI or web based GUI management portal. Hardware support is supplied by E8 Storage and they offer a 3 year warranty on their system with the ability to extend this to 5 years, if needed.

One problem with today’s standard NVMe over Fabric solutions is that they lack any failover capabilities and really have no support for data protection. By developing their own client software, E8 provides fault tolerance and data protection for Tier 0 storage. They currently supported RAID 0 and 5 for E8 Storage and RAID 6 is in development.

Performance

Everyone wants native DAS-NVMe SSD storage but unlike server Tier 0 solutions, E8 Storage’s 180TB of NVMe capacity can be shared across up to 100 servers (currently have 96 servers talking to a single E8 Storage appliance at one customer).  By moving this capacity out to a shared storage device it can be be made more fault tolerant, more serviceable and be amortized over more servers. However the problem with doing this has always been the lack of DAS like performance.

Talking to Zivan, he revealed that a single E8 Storage service was capable of 5M IO/sec, and at that rate, the system delivers an average response time of  300µsec and for a more reasonable 4M IO/sec, the system can deliver ~120µsec response times. He said they can saturate a 100GBE network by operating at 10M IO/sec. He didn’t say what the response time was at 10M IO/sec but with network saturation, response times probably went exponentially higher.

The other thing that Zivan mentioned was that the system delivered these response times with a very small variance (standard deviation). I believe he mentioned 1.5 to 3% standard deviations which at 120µsec is 18 to 36µsec and even at 300µsec its 45 to 90µsec. We have never see this level of response times, response time variance and IO/sec in a single shared storage system before.

E8 Storage

Zivan and many of his team previously came from IBM XIV storage. As such, they have  been involved in developing and supporting enterprise class storage systems for quite awhile now. So, E8 Storage knows what it takes to create products that can survive in 7X24, high end, highly active and demanding environments.

E8 Storage currently has customers in production in the US. They are seeing primary interest  in their system from the HPC, FinServ, and Retail industries but any large customers could have the need for something like this.  They sell their storage for from $2 to $3/GB.

The podcast runs ~42 minutes, and Zivan was easy to talk with and has a good grasp of the storage industry technologies.  Listen to the podcast to learn more.

Zivan Ori CEO & Co-Founder, E8 Storage

Mr. Zivan Ori is the co-founder and CEO of E8 Storage. Before founding E8 Storage, Mr. Ori held the position of IBM XIV R&D Manager, responsible for developing the IBM XIV high-end, grid-scale storage system, and served as Chief Architect at Stratoscale, a provider of hyper-converged infrastructure.

Prior to IBM XIV, Mr. Ori headed Software Development at Envara (acquired by Intel) and served as VP R&D at Onigma (acquired by McAfee).

39: Greybeards talk deep storage/archive with Matt Starr, CTO Spectra Logic

In this episode, we talk with Matt Starr (@StarrFiles),  CTO of Spectra Logic, the deep storage experts. Matt has been around a long time and Ray’s shared many a meal with Matt as we’re both in NW Denver. Howard has a minor quibble with Spectra Logic over the use of his company’s name (DeepStorage) in their product line but he’s also known Matt for awhile now.

The Pearl

Matt and Spectra Logic have a number of customers with multi-PB to over an EB of data repository problems and how to take care of these ever expanding storage stashes is an ongoing concern.  One of the solutions Spectra Logic offers is the Black Pearl Deep Storage, which provides an object storage, RESTfull interface front end to storage tiering/archive backend that uses flash, (spin-down) disk, (LTFS) tape (libraries) and the (AWS) cloud as backend storage.

Major portions of the Black Pearl are open sourced and available on GitHub. I see several (DS3-)SDK’s for Java, Python, C, and others. Open sourcing the product provides an easy way for client customization. In fact, one customer was using CEPH and they modified their CEPH backup client to send a copy of data off to the Pearl.

We talk a bit about the Black Pearl’s data integrity. It uses a checksum, computed over the object at creation time which is then verified anytime the object is retrieved, copied, moved or migrated and can be validated periodically (scrubbed), even when it has not been touched.

Super Computing’s interesting (storage) problems

Matt just returned from the SC16 (Super Computing Conference 2016) in Salt Lake City last month. At the conference there were plenty of MultiPB customers that were looking for better storage alternatives.

One customer Matt mentioned  was the Square Kilometer Array, the world’s largest radio telescope which will be transmitting 700TB/hour, over an 1EB per year.  All that data has to land somewhere and for this quantity (>eb) of data, tape becomes an necessary choice.

Matt likened Spectra’s  archive solutions to warehouses vs. factories. For the factory floor,  you need responsive (AFA or hybrid) primary storage but for the warehouse, you just want cheap, bulk storage (capacity).

The podcast runs long, over 51 minutes, and reveals a different world from the GreyBeards everyday enterprise environments. Specifically customers that have extra large data repositories and how they manage to survive under the data deluge. Matt’s an articulate spokesperson for Spectra Logic and their archive solutions and we could have talked about >eb data repositories for hours.  Listen to the podcast to learn more.

matt-starrMatt Starr, CTO, Spectra Logic

Matt Starr’s tenure with Spectra Logic spans 24 years and includes experience in service, hardware design, software development, operating systems, electronic design and management. As CTO, he is responsible for helping define the company’s product vision, and serves as the executive representative for the voice of the market. He leads Spectra’s efforts in high-performance computing, private cloud and other vertical markets.

Matt served as the lead engineering architect for the design and production of Spectra’s TSeries tape library family. Spectra Logic has secured more than 50 patents under Matt’s direction, establishing the company as the innovative technology leader in the data storage industry. He holds a BS in electrical engineering from the University of Colorado at Colorado Springs.

38: GreyBeards talk with Rob Peglar, Senior VP and CTO, Symbolic IO

In this episode, we talk with Rob Peglar (@PeglarR), Senior VP and CTO of Symbolic IO, a computationally defined storage vendor. Rob has been around almost as long as the GreyBeards (~40 years) and most recently was with Micron and prior to that, EMC Isilon. Rob is also on the board of SNIA.

Symbolic IO has emerged out of stealth earlier this year and intends to be shipping products by late this year/early next.  Rob joined Symbolic IO in July of 2016.

What’s computational storage?

It’s all about symbolic representation of bits. Symbolic IO has  come up with a way to encode bit streams into unique symbols that offer significant savings in memory space, beyond standard data compression techniques.

All that would be just fine if it was at the end of a storage interface and we would probably just call it a new form of data reduction. But Symbolic IO also incorporates persistent memory (NV-DIMMs, in the future 3D XPoint, RERam, others) and provides this symbolic data inside a server, directly through its processor data cache, in (decoded) raw data form.

Symbolic IO provides a translation layer between persistent memory and processor cache that decodes the symbolic representation of the data in persistent memory for data reads on the way into data cache and encodes the symbolic representation of the raw data for data writes on the way out of cache to persistent memory.

Rob says that the mathematics are there to show that Symbolic IO’s data reduction is significant and that the decode/encode functionality can be done in a matter of a few clock cycles per cache (line) access on modern (Intel) processors.

The system continually monitors the data it sees to determine what the optimum encoding should be and can change its symbolic table to provide more memory savings for new data written to persistent memory.

All this reminds the GreyBeards of Huffman encoding algorithms for data compression (which one of us helped deploy on a previous [unnamed] storage product). Huffman encoding transformed ASCII (8-bit) characters into variable length bit streams.

Symbolic IO will offer 3 products:,

  • IRIS™ Compute, which provides a persistent memory storage, accessed using something like the Linux pmem library and includes Symbolic StoreModules™ (persistent memory hardware);
  • IRIS Vault, which is an appliance with its own (IRIS) infused Linux (Symbolic’s SymCE™) OS plus Symbolic IO StoreModules, that can run any Linux application without change accessing the persistent memory and offers full data security, next generation snapshot-/clone-like capabilities with BLINK™ full storage backups, and offers enhanced physical security with the removable, IRIS Advanced EYE ASIC; and
  • IRIS Store, which extends the IRIS Vault and IRIS Compute above with more tiers of storage, using Symbolic IO StoreModules as Tier1, PCIe (flash) storage as Tier 2 and external SSD storage as Tier 3 storage.

For more information on Symbolic IO’s three products, so we would encourage you to read their website (linked above).

The podcast runs long, over 47 minutes, and was wide ranging, discussing some of the history of processor/memory/information technologies. It was very easy to talk with Rob and both Howard and I have known Rob for years, across multiple vendors & organizations.  Listen to the podcast to learn more.

peglar_robert_160x200Rob Peglar, Senior VP and CTO, Symbolic IO

Rob Peglar is the Senior Vice President and Chief Technology Officer of Symbolic IO. Rob is a seasoned technology executive with 39 years of data storage, network and compute-related experience, is a published author and is active on many industry boards, providing insight and guidance. He brings a vast knowledge of strategy and industry trends to Symbolic IO. Rob is also on the Board of Directors for the Storage Networking Industry Association (SNIA) and an advisor for the Flash Memory Summit. His role at Symbolic IO will include working with the management team to help drive the future product portfolio, executive-level forecasting and customer/partner interaction from early-stage negotiations through implementation and deployment.

Prior to joining Symbolic IO, Rob was the Vice President, Advanced Storage at Micron Technology, where he led next-generation technology and architecture enablement efforts of Micron’s Storage Business Unit, driving storage solution development with strategic customers and partners. Previously he was the CTO, Americas for EMC where he led the entire CTO functions for the Americas. He has also held senior level positions at Xiotech Corporation, StorageTek and ETA Systems.

Rob’s extensive experience in data management, analytics, high-performance computing, non-volatile memory, distributed cluster architectures, filesystems, I/O performance optimization, cloud storage and replication and archiving, networking, virtualization makes him a sought after industry expert and board member. He was named an EMC Elect in 2014, 2015 and 2016. He was one of 25 senior executives worldwide selected for the CRN ‘Storage Superstars’ Award in 2010.

33: GreyBeards talk HPC storage with Frederic Van Haren, founder HighFens & former Sr. Director of HPC at Nuance

IMG_6319In episode 33 we talk with Frederic Van Haren (@fvha), founder of HighFens, Inc. (@HighFens), a new HPC consultancy and former Senior Director of HPC at Nuance Communications. Howard and I got a chance to talk with Frederic at a recent HPE storage deep dive event, I met up with him again during SFD10, where he was talking on behalf of Kaminario, and he was also at HPE Discover conference last week.

Nuance is the backend speech recognition engine for a number of popular service offerings. Nuance looks very similar to a lot of other hyper-scale customers and ultimately, we feel may be the way of the future for all IT over the coming decades.  Nuance’s data storage journey since Frederic’s tenure with the company holds many lessons for all of us in the storage industry

Nuance currently has ~6PB usable (~16PB raw) of speech wave files as well as uncountable text and other files, all inside IBM SpectrumScale (GPFS).  They have both lots of big files and lots of small files. These days, Spectrum Scale is processing 2-3M files/second. They have doubled capacity for each of the last 9 years, and today handle a billion new files a month. GPFS stripes data across storage, provides data protection, migration, snapshotting and storage tiering across a diverse mix of storage. At the end of the podcast we discussed some open source alternatives to Spectrum Scale but at the time Nuance started down this path,  GPFS was found to be the only thing that could do the job. This proved to be a great solution as they have completely swapped out the underlying storage at least 3 times and all their users were none the wiser.

The first storage that Frederic talked about was Coraid (no longer in business) and their ATA over Ethernet storage solution. This used a SuperMicro with 24 SATA drives/shelf and they bought 40 shelves. Over time this grew to 1000s of SATA drives and was easily scaleable but hard to manage, as it was pretty dumb storage. In fact, they had to deploy video cameras, focused on drive shelves, to detect when drives failed!

Overtime, Nuance came to the realization that they had to do something more manageable and brought in HPE MSA storage to replace their Coraid storage. The MSA was a great solution for them which had 96 SAS drives, were able to support both faster “SCRATCH” storage using fast SAS 300GB/15KRPM drives and slower “STATIC” storage with slower SATA 760GB/7.2KRPM drives and was much more manageable than the Coraid solution.

Although MSA storage worked great, after a while, Nuance’s sprawling FC environment which was doubling yearly, caused them to rethink their storage once again. This led them to swap out all their HPE MSA storage, for HPE 3PAR to consolidate their FC network and storage footprint.

For metadata, Nuance uses a 76 node, Hadoop cluster for sophisticated search queries as doing an LS on the GPFS file system would take days. Their file meta-data is essentially a textual, row by row database and they use queries over the Hadoop cluster to determine things like which files have american english, spoken by females, with 8Khz recording.  Not sure when, but eventually Nuance deployed HPE Vertica SQL over Hadoop for their metadata engine and dropped average query from 12 minutes to 73 sec.(!!)

Nuance, because of their extreme growth and more open environment to storage innovation, had become a favorite for storage startups and major vendors to do Proofs of Concepts (PoC) on new storage offerings. One PoC, Nuance did was for Kamanario storage. There is a standard metric that says a CPU core requires so many IOPS, so that when CPU cores  increase,  you need to supply more IOPS. They went with Kaminario for their test-dev environment and more performance intensive storage. Nuance appreciates Kamanario’s reliability, high availability and highly predictable performance. (See the SFD10 video feed for Frederic’s session)

We talked a bit about how speech recognition’s Hidden Markov Chain statistical model was heavily dependent on CPU cores. Today, if you want to do a recognition task, you assigned it to one core and waited until it was done, a serial process dependent on the # of CPU cores you had available. This turned out to be quite a problem as you had to scale CPU cores if you wanted to do more concurrent speech recognition activities. Then came GPUs and you could do speech recognition work on a GPU core. With the new GPU cards,   instead of a server having ~16 CPU cores,  you could have a server with multiple Graphic cards having 3000-GPU cores. This scaled a lot easier. Machine learning and deep neural nets have the potential to parallelize this, so that it will scale even better

In the end, HPC trials, tribulations and ways of doing business are starting to become  mainstream. I was recently talking to one vendor that said, most HPC groups start out in isolation to support one application but over time they either subsume corporate IT or get absorbed into corp. IT or continue to be a standalone group (while waiting until one of the other two happen).

The podcast runs ~41 minutes and  covers a lot of ground about one HPC organization’s evolution of their storage environment over time, what was driving some of that evolution and the tools they chose to master it.  Listen to the podcast to learn more.

0F2A7849 - Copyv2-resizedFrederic Van Haren, founder HighFens, Inc.

Frederic Van Haren is the Chief Technology Officer @Highfens and known for his insights in the HPC and storage industry. He has over 20 years of experience in High Tech providing technical leadership and strategic direction in Telecom and Speech markets. Frederic spent the last decade at  Nuance Communications building large HPC environments from the ground up. He is frequently invited to speak at events to provide his insights on the HPC and storage markets. He has played leading roles as President of a variety of technology user groups promoting the use of innovative technology. As an Engineer he enjoys working with the engineering teams from technology vendors providing feedback on new and upcoming products.

Frederic lives in Massachusetts,  USA but grew up in the northern part of Belgium where he received his Masters in Electrical Engineering, Electronics and Automation.