The rise of MinIO object storage

MinIO presented at SFD21 a couple of weeks back (see videos here). They had a great session, as always with Jonathan and AB leading the charge. We’ve had a couple of GreyBeardsOnStorage podcasts with AB as well (listen and see GreyBeards talk open source S3… and GreyBeards talk Data Persistence …). We first talked with MinIO last year at SFD 19 where AB made a great impression on the bloggers (see videos here)

Their customers run the gamut from startups to F500. AB said that ~58% of the F500 have MinIO installed and over 8% of the F500 have added capacity over the last year. AB said they have a big presence in Finance, e.g., the 10 largest banks run MinIO, also the auto and Space/Defense sectors have adopted their product.

One reason for the later two sectors (auto & space/defense) is the size of MinIO’s binary, 50MB. And my guess for why the rest of those customers have adopted MinIO is because it’s S3 API compatible, it’s open source, and it’s relatively inexpensive.

Object storage trends

Customers running in the cloud have a love-hate relationship with object storage, they love that it scales but hate what it costs. There are numerous on prem object storage alternatives from traditional and non-traditional storage vendors, but most are deployed on appliances.

With appliances, customers have to order, wait for delivery, rack-configure-set up and after maybe weeks to months finally they have object storage on prem. But with MinIO a purely software, open source solution, it can be tried by merely downloading a couple of (Docker) containers and deployed/activated in under an hour..

As mentioned above, MinIO is API compatible with AWS S3 which helps with adoption. Moreover, now that it’s an integral part of VMware (see their new Data Persistence Platform), it can be enabled in seconds on your standard enterprise VMware cluster with Tanzu.

The other trend is that the edge needs storage, and lots of it. The main drivers of massive edge storage requirement are TelCos deploying 5G and auto industry’s self-driving cars. But this is just a start, industrial IoT will be generating reams of sensor log data at the edge, it will need to be stored somewhere. And what better place to store all this data, but on object storage. Furthermore, all this is driving more adoption of object storage, with MinIO picking up the lion’s share of deployments.

In addition, MinIO recently ported their software to run on ARM. AB said this was to support the expanding hobbyist and developers community driving edge innovation.

And then there was Kubernetes. Everyone in the industry (with the possible exception of Google) is surprised by the adoption of K8S. Google essentially gifted ~$1Bs in R&D on how to scale apps to the world of IT, and now any startup, anywhere, can scale with as well as Google can. And scaling is the “killer app” for the SW industry.

But performance isn’t bad either

Jonathan made mention of MinIO performance (see MinIO 24 node disk and MinIo 32 node NVMe SSD reports) benchmarks. Their disk data shows avg read and write performance of 16.3GB/s and 9.4GB/s, respectively and their NVMe SSD average read and write performance of 183.2GB/s and 171.3GB/s, respectively. The disk numbers are very good for object storage, but the SSD numbers are spectacular.

It turns out that modern, cloud native apps don’t need quick access to data as much as high data throughput. Modern apps have moved to a processing data in memory rather than off of storage, which means they move (large) chunks of data to memory and crunch on it there, and then spit it back out to storage This type of operating mode seems to scale better (in the cloud at least) than having a high priced storage system servicing a blizzard of IO requests from everywhere.

Other vendors had offered SSD object storage before but it never took off. But nowadays, with NVMe SSDs, MinIO is seeing starting to see healthcare, finance, and any AI/ML workloads all deploying NVMe SSD object storage. Yes for large storage repositories, (object storage’s traditional strongpoint), ie, 5PB to 100PB, disk can’t be beat but where blistering high throughput, is needed, NVMe SSD object storage is the way to go.

Open source vs. open core

AB mentioned that MinIO business model is 100% open source vs. many other vendors that use open source but whose business model is open core. The distinction is that open core vendors use open source as base functionality and then build proprietary, charged for, software features/functions on top of this.

But open source vendors, like MinIO offer all their functionality under an open source license (Apache SM License V2.0, GNU AGPL v3 Open Source license and other FOSS licenses), but if you want to use it commercially, build products with it embedded inside, or have enterprise class support, one purchases a commercial license.

As presented at SFD21, but their website home page has updated numbers reflected below

The pure open source model has some natural advantages:

  1. It’s a great lead gen solution because anyone, worldwide, 7X24X365 can download the software and start using it, (see Docker Hub or MinIO’s download page
  2. It’s a great hiring pool. Anyone, who has contributed to the MinIO open source is potentially a great technical hire. MinIO stats says they have 685 contributors, 19 in just the last month for MinIO base code (see MinIO’s GitHub repo).
  3. It’s a great development organization. With ~20 commits a weekover the last year, there’s a lot going on to add functionality/fix bugs. But that’s the new world of software development. Given all this activity, release frequencies increase, ~4 releases a month ((see GitHub repo insights above).
  4. It’s a great testing pool with, ~480M Docker Pulls (using a Docker container to run a standard, already configured MinIO server, mc, console, etc.) and ~18K enterprises running their solution, that’s an awful lot of users. With open source a lot of eye’s or contributors make all problems visible, but what’s more typical, from my perspective, is the more users that deploy your product, the more bugs they find.

Indeed, with the VMware’s Data Persistence Platform, Tanzu customers can use MinIO’s object storage at the click of a button (or three).

Of course, open source has downsides too. Anyone can access packages directly (from GitHub repo and elsewhere) and use your software. And of course, they can clone, fork and modify your source code, to add any functionality they want to it. Historically, open source subscription licensing models don’t generate as much revenues as appliance purchases do. And finally, open source, because it’s created by geeks, is typically difficult to deploy, configure, and use.

But can they meet the requirements of an Enterprise world

Because most open source is difficult to use, the enterprise has generally shied away from it. But that’s where there’s been a lot of changes to MinIO.

MinIO always had a “mc” (minio [admin] client) that offered a number of administrative services via an API, programmatically controlled interface. but they have recently come out with a GUI offering, the minIO console, which has a similarly functionality to their mc APU. They demoed the console on their SFD21 sessions (see videos above).

Supporting 18K enterprise users, even if only 8% are using it a lot, can be a challenge, but supporting almost a half a billion docker pulls (even if only 1/4th of these is a complete minIO deployment) can be hell on earth. The surprising thing is that MinIO’s commercial license promises customers direct-to-engineer support.

At their SFD21 sessions, AB stated they were getting ~2.7 new (tickets) problems a day. I assume these are what’s just coming in from commercial licensed users and not the general public (using their open source licensed offerings). AB said their average resolution time for these tickets was under 15 minutes.

Enter SubNet, the MinIO Subscription Network and their secret (not open source?) weapon to scale enterprise class support. Their direct-to-engineer support model involves a much, more collaborative approach to solving customer problems then you typical enterprise support with level 1, 2 & 3 support engineers. They demoed SubNet briefly at SFD21, but it could deserve a much longer discussion/demostration.

What little we saw (at SFD21) was that it looked almost like slack-PM dialog between customer and engineer but with unlimited downloads and realtime interaction.

MinIO also supports a very active Slack discussion group with ~11K users. Here anyone can ask a question and it will get answered by anyone. MinIO’s Slack has 2 channels: (Ggeneral and GitHub for notifications). It seems like MinIO is using Slack as a crowdsourced level 1 support.

But in the long run, to continue to offer “direct-to-engineer” levels of support, may require adding a whole lot more engineers. But AB seems prepared to do just that.

~~~~

MinIO is an interesting open source S3 API compatible, object storage solution that seems to run just about anywhere, is freely deployable with enterprise class support available (at a price) and has high throughput performance. What’s not to like.

New PCM could supply 36PB of memory to CPUs

Read an article this past week on how quantum geometry can enable a new form of PCM (phase change memory) that is based on stacks of metallic layers (SciTech Daily article: Berry curvature memory: quantum geometry enables information storage in metallic layers), That article referred to a Nature article (Berry curvature memory through electrically driven stacking transitions) behind a paywall but I found a pre-print of it, Berry curvature memory through electrically driven stacking transitions.

Figure 1| Signatures of two different electrically-driven phase transitions in WTe2. a, Side view (b–c plane) of unit cell showing possible stacking orders in WTe2 (monoclinic 1T’, polar orthorhombic Td,↑ or Td,↓) and schematics of their Berry curvature distributions in momentum space. The spontaneous polarization and the Berry curvature dipole are labelled as P and D, respectively. The yellow spheres refer to W atoms while the black spheres represent Te atoms. b, Schematic of dual-gate h-BN capped WTe2 evice. c, Electrical conductance G with rectangular-shape hysteresis (labeled as Type I) induced by external doping at 80 K. Pure doping was applied following Vt/dt = Vb/db under a scan sequence indicated by black arrows. d, Electrical conductance G with butterfly-shape switching (labeled as Type II) driven by electric field at 80 K. Pure E field was applied following -Vt/dt = Vb/db under a scan sequence indicated by black arrows. Positive E⊥ is defined along +c axis. Based on the distinct hysteresis observations in c and d, two different phase transitions can be induced by different gating configurations.

The number one challenge in IT today,is that data just keeps growing. 2+ Exabytes today and much more tomorrow.

All that information takes storage, bandwidth and ultimately some form of computation to take advantage of it. While computation, bandwidth, and storage density all keep going up, at some point the energy required to read, write, transmit and compute over all these Exabytes of data will become a significant burden to the world.

PCM and other forms of NVM such as Intel’s Optane PMEM, have brought a step change in how much data can be stored close to server CPUs today. And as, Optane PMEM doesn’t require refresh, it has also reduced the energy required to store and sustain that data over DRAM. I have no doubt that density, energy consumption and performance will continue to improve for these devices over the coming years, if not decades.

In the mean time, researchers are actively pursuing different classes of material that could replace or improve on PCM with even less power, better performance and higher densities. Berry Curvature Memory is the first I’ve seen that has several significant advantages over PCM today.

Berry Curvature Memory (BCM)

I spent some time trying to gain an understanding of Berry Curvatures.. As much as I can gather it’s a quantum-mechanical geometric effect that quantifies the topological characteristics of the entanglement of electrons in a crystal. Suffice it to say, it’s something that can be measured as a elecro-magnetic field that provides phase transitions (on-off) in a metallic crystal at the topological level. 

In the case of BCM, they used three to five atomically thin, mono-layers of  WTe2 (Tungsten Ditelluride), a Type II  Weyl semi-metal that exhibits super conductivity, high magneto-resistance, and the ability to alter interlayer sliding through the use of terahertz (Thz) radiation. 

It appears that by using BCM in a memory, 

Fig. 4| Layer-parity selective Berry curvature memory behavior in Td,↑ to Td,↓ stacking transition. a,
The nonlinear Hall effect measurement schematics. An applied current flow along the a axis results in the generation of nonlinear Hall voltage along the b axis, proportional to the Berry curvature dipole strength at the Fermi level. b, Quadratic amplitude of nonlinear transverse voltage at 2ω as a function of longitudinal current at ω. c, d, Electric field dependent longitudinal conductance (upper figure) and nonlinear Hall signal (lower figure) in trilayer WTe2 and four-layer WTe2 respectively. Though similar butterfly-shape hysteresis in longitudinal conductance are observed, the sign of the nonlinear Hall signal was observed to be reversed in the trilayer while maintaining unchanged in the four-layer crystal. Because the nonlinear Hall signal (V⊥,2ω / (V//,ω)2 ) is proportional to Berry curvature dipole strength, it indicates the flipping of Berry curvature dipole only occurs in trilayer. e, Schematics of layer-parity selective symmetry operations effectively transforming Td,↑ to Td,↓. The interlayer sliding transition between these two ferroelectric stackings is equivalent to an inversion operation in odd layer while a mirror operation respect to the ab plane in even layer. f, g, Calculated Berry curvature Ωc distribution in 2D Brillouin zone at the Fermi level for Td,↑ and Td,↓ in trilayer and four-layer WTe2. The symmetry operation analysis and first principle calculations confirm Berry curvature and its dipole sign reversal in trilayer while invariant in four-layer, leading to the observed layer-parity selective nonlinear Hall memory behavior.
  • To alter a memory cell takes “a few meV/unit cell, two orders of magnitude less than conventional bond rearrangement in phase change materials” (PCM). Which in laymen’s terms says it takes 100X less energy to change a bit than PCM.
  • To alter a memory cell it uses terahertz radiation (Thz) this uses pulses of light or other electromagnetic radiation whose wavelength is on the order of picoseconds or less to change a memory cell. This is 1000X faster than other PCM that exist today.
  • To construct a BCM memory cell takes between 13 and 16  atoms of W and Te2 constructed of 3 to 5 layers of atomically thin, WTe2 semi-metal.

While it’s hard to see in the figure above, the way this memory works is that the inner layer slides left to right with respect to the picture and it’s this realignment of atoms between the three or five layers that give rise to the changes in the Berry Curvature phase space or provide on-off switching.

To get from the lab to product is a long road but the fact that it has density, energy and speed advantages measured in multiple orders of magnitude certainly bode well for it’s potential to disrupt current PCM technologies.

Potential problems with BCM

Nonetheless, even though it exhibits superior performance characteritics with respect to PCM, there are a number of possible issues that could limit it’s use.

One concern (on my part) is that the inner-layer sliding may induce some sort of fatigue. Although, I’ve heard that mechanical fatigue at the atomic level is not nearly as much of a concern as one sees in (> atomic scale and) larger structures. I must assume this would induce some stress and as such, limit the (Write cycles) endurance of BCM.

Another possible concern is how to shrink size of the Thz radiation required to only write a small area of the material. Yes one memory cell can be measured bi the width of 3 atoms, but the next question is how far away do I need to place the next memory cell. The laser used in BCM focused down to ~1.5 μm. At this size it’s 1,000X bigger than the BCM memory cell width (~1.5 nm).

Yet another potential problem is that current BCM must be embedded in a continuous flow of liquid nitrogen (@80K). Unclear how much of a requirement this temperature is for BCM to function. But there are no computers nowadays that require this level of cooling.

Figure 3| Td,↑ to Td,↓ stacking transitions with preserved crystal orientation in Type II hysteresis. a,
in-situ SHG intensity evolution in Type II phase transition, driven by a pure E field sweep on a four-layer and a five-layer Td-WTe2 devices (indicated by the arrows). Both show butterfly-shape SHG intensity hysteresis responses as a signature of ferroelectric switching between upward and downward polarization phases. The intensity minima at turning points in four-layer and five-layer crystals show significant difference in magnitude, consistent with the layer dependent SHG contrast in 1T’ stacking. This suggests changes in stacking structures take place during the Type II phase transition, which may involve 1T’ stacking as the intermediate state. b, Raman spectra of both interlayer and intralayer vibrations of fully poled upward and downward polarization phases in the 5L sample, showing nearly identical characteristic phonons of polar Td crystals. c, SHG intensity of fully poled upward and downward polarization phases as a function of analyzer polarization angle, with fixed incident polarization along p direction (or b axis). Both the polarization patterns and lobe orientations of these two phases are almost the same and can be well fitted based on the second order susceptibility matrix of Pm space group (Supplementary Information Section I). These observations reveal the transition between Td,↑ and Td,↓ stacking orders is the origin of
Type II phase transition, through which the crystal orientations are preserved.

Finally, from my perspective, can such a memory can be stacked vertically, with a higher number of layers. Yes there are three to five layers of the WTe2 used in BCM but can you put another three to five layers on top of that, and then another. Although the researchers used three, four and five layer configurations, it appears that although it changed the amplitude of the Berry Curvature effect, it didn’t seem to add more states to the transition.. If we were to more layers of WTe2 would we be able to discern say 16 different states (like QLC NAND today).

~~~~

So there’s a ways to go to productize BCM. But, aside from eliminating the low-temperature requirements, everything else looks pretty doable, at least to me.

I think it would open up a whole new dimension of applications, if we had say 60TB of memory to compute with, don’t you think?

Comments?

[Updated the title from 60TB to PB to 36PB as I understood how much memory PMEM can provide today…, the Eds.]

Photo Credit(s):

Storage that provides 100% performance at 99% full

A couple of weeks back we were talking with Qumulo at Storage Field Day 20 (SFD20) and they made mention that they were able to provide 100% performance at 99% full. Please see their video session during SFD20 (which can be seen here). I was a bit incredulous of this seeing as how every other modern storage system performance degrades long before they get to 99% capacity.

So I asked them to explain how this was possible. But before we get to that a little background on modern storage systems would be warranted.

The perils of log structured file systems

Most modern storage systems use a log structured file system where when they write data they write it to a sequential log and use a virtual addressing scheme to show where the data is located for that address, creating a (data) log of written blocks.

However, when data is overwritten, it leaves gaps in these data logs. These gaps need to be somehow recycled (squeezed out) in order to be able to be consumed as storage capacity. This recycling process is commonly called “garbage collection”.

Garbage collection does its work by reading heavily gapped log files and re-writing the old, but still current, data into a new log. This frees up those gaps to be reused. But garbage collection like this takes reading and writing of logs to free up space.

Now as log structured file systems get (70-80-90%) full, they need to spend more and more system time and effort (=performance) garbage collecting . This takes system (IO) performance away from normal host IO activity. Which is why I didn’t believe that Qumulo could offer 100% IO performance at 99% full.

But there was always another way to supply storage virtualization (read snapshotting) besides log files. Yes it might involve more metadata (table) management, but what it takes in more metadata, it gives back by requiring no garbage collection.

How Qumulo does without garbage collection

Qumulo has a scaled block store for a back end of their file and object cluster store. And yes it’s still a virtualized block store BUT it’s not a log structured file store.

It seems that there’s a virtual-to-physical mapping table that is used by Qumulo to determine the physical address of any virtual block in the file system. And files are allocated to virtual blocks directly through the use of B-tree metadata. These B-trees indicate which virtual blocks are in use by a file and its snapshots

If a host overwrites a data block. The block can be freed (if not being used in a snapshot) and placed on a freed block list and a new block is allocated in its place. The file’s allocated blocks b-tree is updated to reflect the new block and that’s it.

For snapshots, Qumulo uses something they call “write-out-of-place” process when data that a snapshot points to is overwritten. Again, it appears as if snapshots are some extra metadata associated with a file’s B-tree that defines the data in the snapshot.

The problem comes in when a file is deleted. If it’s a big enough file (TB-PB?), there could be millions to billions of blocks that have to be freed up. This would take entirely too long for a delete command, so this is done in the background. Qumulo calls this “reclaim delete“. So a delete of a big file unlinks the block B-tree from the directory and puts it on this reclaim delete work queue to free up these blocks later. Similarly, when a big snapshot is deleted, Qumulo performs a background process called “reclaim snapshot” for snapshot unique blocks.

As can be seen (it’s very hard to see given the coloration of the chart) from this screen shot of Qumulo’s session at SFD20, reclaim delete and reclaim snapshot are being done concurrently (in the background) with normal system IO. What’s interesting to note here is that reclaim IO (delete and snapshots) are going on all the time during the customers actual work. Why the write throughput drops significantly doing the the 27-29 of July is hard to understand. But the one case where it’s most serious (middle of July 28) reclaim IO also drops significantly. If reclaim IO were impacting write performance I would have expected it to have gone higher when write throughput went lower. But that’s not the case. From what I can see in the above reclaim IO has no impact on read or write throughput at this customer.

So essentially, by using a backing block store that does no garbage collection (not using a log structured file system), Qumulo is able to offer 100% system IO performance at 99% full – woah.

DNA storage using nicks

Read an article the other day in Scientific American (“Punch card” DNA …) which was reporting on a Nature Magazine Article (DNA punch cards for storing data… ). The articles discussed a new approach to storing (and encoding) data into DNA sequences.

We have talked about DNA storage over the years (most recently, see our Random access DNA object storage post) so it’s been under study for almost a decade.

In prior research on DNA storage, scientists encoded data directly into the nucleotides used to store genetic information. As you may recall, there are two complementary nucleotides A-T (adenine-thymine) and G-C (guanine-cytosine) that constitute the genetic code in a DNA strand. One could use one of these pairs to encode a 1 bit and the other for a 0 bit and just lay them out along a DNA strand.

The main problem with nucleotide encoding of data in DNA is that it’s slow to write and read and very error prone (storing data in DNA separate nucleotides is a lossy data storage). Researchers have now come up with a better way.

Using DNA nicks to store bits

One could encode information in DNA by utilizing the topology of a DNA strand. Each DNA strand is actually made up of a sugar phosphate back bone with a nucleotide (A, C, G or T) hanging off of it, and then a hydrogen bond to its nucleotide complement (T, G, C or A, respectively) which is attached to another sugar phosphate backbone.

It appears that one can deform the sugar phosphate back bone at certain positions and retain an intact DNA strand. It’s in this deformation that the researchers are encoding bits and they call this a “DNA nick”.

Writing DNA nick data

The researchers have taken a standard DNA strand (E-coli), and identified unique sites on it that they can nick to encode data. They have identified multiple (mostly unique) sites for nick data along this DNA, the scientists call “registers” but we would call sectors or segments. Each DNA sector can contain a certain amount of nick data, say 5 to 10 bits. The selected DNA strand has enough unique sectors to record 80 bits (10 bytes) of data. Not quite a punch card (80 bytes of data) but it’s early yet.

Each register or sector is made up of 450 base (nucleotide) pairs. As DNA has two separate strands connected together, the researchers can increase DNA nick storage density by writing both strands creating a sort of two sided punch card. They use this other or alternate (“anti-sense”) side of the DNA strand nicks for the value “2”. We would have thought they would have used the absent of a nick in this alternate strand as being “3” but they seem to just use it as another way to indicate “0” .

The researchers found an enzyme they could use to nick a specific position on a DNA strand called the PfAgo (Pyrococcus furiosus Argonaute) enzyme. The enzyme can de designed to nick distinct locations and register (sectors) along the DNA strand. They designed 1024 (2**10) versions of this enzyme to create all possible 10 bit data patterns for each sector on the DNA strand.

Writing DNA nick data is done via adding the proper enzyme combinations to a solution with the DNA strand. All sector writes are done in parallel and it takes about 40 minutes.

Also the same PfAgo enzyme sequence is able to write (nick) multiple DNA strands without additional effort. So we can replicate the data as many times as there are DNA strands in the solution, or replicating the DNA nick data for disaster recovery.

Reading DNA nick data

Reading the DNA nick data is a bit more complicated.

In Figure 1 the read process starts by by denaturing (splitting dual strands into single strands dsDNA) and then splitting the single strands (ssDNA) up based on register or sector length which are then sequenced. The specific register (sector) sequences are identified in the sequence data and can then be read/decoded and placed in the 80 bit string. The current read process is destructive of the DNA strand (read once).

There was no information on the read time but my guess is it takes hours to perform. Another (faster) approach uses a “two-dimensional (2D) solid-state nanopore membrane” that can read the nick information directly from a DNA string without dsDNA-ssDNA steps. Also this approach is non-destructive, so the same DNA strand could be read multiple times.

Other storage characteristics of nicked DNA

Given the register nature of the nicked DNA data organization, it appears that data can be read and written randomly, rather than sequentially. So nicked DNA storage is by definition, a random access device.

Although not discussed in the paper, it appears as if the DNA nicked data can be modified. That is the same DNA string could have its data be modified (written multiple times).

The researcher claim that nicked DNA storage is so reliable that there is no need for error correction. I’m skeptical but it does appear to be more reliable than previous generations of DNA storage encoding. However, there is a possibility that during destructive read out we could lose a register or two. Yes one would know that the register bits are lost which is good. But some level of ECC could be used to reconstruct any lost register bits, with some reduction in data density.

The one significant advantage of DNA storage has always been its exceptional data density or bits stored per volume. Nicked storage reduces this volumetric density significantly, i.e, 10 bits per 450 (+ some additional DNA base pairs required for register spacing) base pairs or so nicked DNA storage reduces DNA storage volumetric density by at least a factor of 45X. Current DNA storage is capable of storing 215M GB per gram or 215 PB/gram. Reducing this by let’s say 100X, would still be a significant storage density at ~2PB/gram.

Comments?

Picture Credit(s):

Gaming is driving storage innovation at WDC

I was at SFD19 a couple of weeks ago and Western Digital supplied the afternoon sessions on their technology (see videos here). Phil Bullinger gave a great session on HDDs and the data center market. Carl Che did a session on HDD technology and discussed on how 5G was going to ramp up demand for video streaming and IoT data requirements. Of course one of the sessions was on their SSD and NAND technologies.

But the one session that was pretty new and interesting to me was their discussion on how Gaming and how it’s driving system innovation. Eric Spaneut, VP of Client Computing was the main speaker for the session but they also had Leah Schoeb, Sr. Developer Manager at AMD, to discuss the gaming market and its impact on systems technology.

There were over 100M viewers of the League of Legends World Championships, with a peak viewership of 44M viewers. To put that in perspective the 2020 Super Bowl had 102M viewers. So gaming championships today are almost as big as the Super Bowl in viewership.

Gaming demands higher performing systems

Gaming users are driving higher compute processors/core counts, better graphics cards, faster networking and better storage. Gamers are building/buying high end desktop systems that cost $30K or more, dwarfing the cost of most data center server hardware.

Their gaming rigs are typically liquid cooled, have LEDs all over and are encased in glass. I could never understand why my crypto mine graphics cards had LEDs all over them. The reason was they were intended for gaming systems not crypto mines.

Besides all the other components in these rigs, they are also buying special purpose storage. Yes storage capacity requirements are growing for games but performance and thermal/cooling have also become major considerations.

Western Digital has dedicated a storage line to gaming called WD Black. It includes both HDDs and SSDs (internal NVMe and external USB/SATA attached) at the moment. But Leah mentioned that gaming systems are quickly moving away from HDDs onto SSDs.

Thermal characteristics matter

Of the WDC’s internal NVMe SSDs (WD Black SN750s), one comes with a heat sink attached. It turns out SSD IO performance can be throttled back due to heat. The heatsink allows the SSD to operate at higher temperatures and offer more bandwidth than the one without. Presumably, it allows the electronics to stay cooler and thus stay running at peak performance.

I believe their WD Black HDDs have internal fans in them to keep them cool. And of course they all come in black with LEDs surrounding them.

Storage can play an important part in the “gaming experience” for users once you get beyond network bottlenecks for downloading. For downloading and storage perform well . however for game loading and playing/editing videos/other gaming tasks, NVMe SSDs offer a significant performance boost over SATA SDDs and HDDs.

But not all gaming is done on high-end gaming desktop systems. Today a lot of gaming is done on dedicated consoles or in the cloud. Cloud based gaming is mostly just live streaming of video to a client device, whether it be a phone, tablet, console, etc. Live game streaming is almost exactly like video on demand but with more realtime input/output and more compute cores/graphic engines to perform the gaming activity and to generate the screens in “real” time. So having capacity and performance to support multiple streams AND the performance needed to create the live, real time experience takes a lot of server compute & graphics hardware, networking AND storage.

~~~~

So wherever gamers go, storage is becoming more critical in their environment. Both WDC and AMD see this market as strategic and growing, whose requirements are unique enough to demand special purpose products. They bothy are responding with dedicated hardware and product lines tailored to gaming needs.

Photo credit(s): All graphics in this post are from WDC’s gaming session video stream

StorPool, fast storage for fast times

At Storage Field Day 18 (SFD18) a couple of weeks ago, we heard from a new company, StorPool, that provides ultra fast software defined storage for MSPs and other cloud providers. You can watch the videos of their sessions here.

Didn’t know what to make of them at first, but when they started demoing their performance, we all woke up. They ran an all read and mixed read-write IO workload, that almost blew away any other proprietary/non-proprietary storage I’ve seen before.

[Updated 12Mar2019] What they were trying to achieve was to match the performance of an Windows Server 2019 Hyper V benchmark which hit 13.8 M IOPS using 12 nodes of 384GB DRAM, 1.5TB Optane DC persistent memory, 32TB (4X8TB) NVMe SSDs and Mellanox 25Gbps RDMA ethernet, with each VM running on the server that had the VHDX file stored.

Their demo showed 70:30 R:W random 4KB mixed workload and achieved 1M IOPS with a read latency of 140 microsec. and write latency of 100 microsec. (end to end at the VM level). [Updated 12Mar2019] They were able to match the performance of a published benchmark without the 1.5TB Optane memory, without the 25Gbps RDMA Ethernet and without having the VMs and its storage running on same nodes. They were able to show this performance running StorPool, KVM and CentOS 7 across 12 nodes running both VMs and storage services.

They also showed information on a pgbench benchmark, which I was not familiar with. The chart had response times on the horizontal axis and TPS performance on the vertical axis.

What’s even more amazing is that even with the great performance they still offer reasonable data services such as CoW snapshots, asynchronous replication (with changed block tracking), thin provisioning, end to end data integrity, and iSCSI support.

Their target market is mostly MSPs and large customers moving to the private cloud configurations. They mentioned deep support for OpenNebula, [updated 12Mar2019] OpenStack, OnApp and Kubernetes which means each virtual disk is a volume/LUN. They support VMware and Windows Server/Hyper-V through iSCSI.

~~~~

The fact that they have a proprietary protocol is not that great but if they can generate the IOPS and response times they showed here with snapshot, thin provisioning and asynch replication, I’m ok with it. [Updated 12Mar2019]The fact that they were able to match the performance of the more expensive system with standard Ethernet, no Optane memory and all VMs running remote made a significant impression on me.

Want to learn more, check out these other discussions on StorPool (and other SFD18 vendors):

SFD18 – as intense as it gets by Max Mortillaro (@DarkAvenger), and

Podcast #3 review the SFD18 presenters by Chris M. Evans (@ChrisMEvans) and Matt Leib (@MBLeib).

[Updated 12Mar2019 Boyan Krosnov sent me an email indicating that the post had made some mistakes in the post which were corrected via updates above. Editors ]

Screaming IOP performance with StarWind’s new NVMeoF software & Optane SSDs

Was at SFD17 last week in San Jose and we heard from StarWind SAN (@starwindsan) and their latest NVMeoF storage system that they have been working on. Videos of their presentation are available here. Starwind is this amazing company from the Ukraine that have been developing software defined storage.

They have developed their own NVMe SPDK for Windows Server. Intel doesn’t currently offer SPDK for Windows today, so they developed their own. They also developed their own initiator (CentOS Linux) for NVMeoF. The target system was a multicore server running Windows Server with a single Optane SSD that they used to test their software.

Extreme IOP performance consumes cores

During their development activity they tested various configurations. At the start of their development they used a Windows Server with their NVMeoF target device driver. With this configuration and on a bare metal server, they found that they could max out the Optane SSD at 550K 4K random write IOPs at 0.6msec to a single Optane drive.

When they moved this code directly to run under a Hyper-V environment, they were able to come close to this performance at 518K 4K write IOPS at 0.6msec. However, this level of IO activity pegged 100% of 8 cores on their 40 core server.

More IOPs/core performance in user mode

Next they decided to optimize their driver code and move as much as possible into user space and out of kernel space, They continued to use Hyper-V. With this level off code, they were able to achieve the same performance as bare metal or ~551K 4K random write IOP performance at the 0.6msec RT and 2.26 GB/sec level. However, they were now able to perform only pegging 2 cores. They expect to release this initiator and target software in mid October 2018!

They converted this functionality to run under ESX/VMware and were able to see much the same 2 cores pegged, ~551K 4K random write IOPS at 0.6msec RT and 2.26 GB/sec. They will have the ESXi version of their target driver code available sometime later this year.

Their initiator was running CentOS on another server. When they decided to test how far they could push their initiator, they were able to drive 4 Optane SSDs at up to ~1.9M 4K random write IOP performance.

At SFD17, I asked what they could have done at 100 usec RT and Max said about 450K IOPs. This is still surprisingly good performance. With 4 Optane SSDs and consuming ~8 cores, you could achieve 1.8M IOPS and ~7.4GB/sec. Doubling the Optane SSDs one could achieve ~3.6M IOPS, with sufficient initiators and target cores with ~14.8GB/sec.

Optane based super computer?

ORNL Summit super computer, the current number one supercomputer in the world, has a sustained throughput of 2.5 TB/sec over 18.7K server nodes. You could do much the same with 337 CentOS initiator nodes, 337 Windows server nodes and ~1350 Optane SSDs.

This would assumes that Starwind’s initiator and target NVMeoF systems can scale but they’ve already shown they can do 1.8M IOPS across 4 Optane SSDs on a single initiator server. Aand I assume a single target server with 4 Optane SSDs and at least 8 cores to service the IO. Multiplying this by 4 or 400 shouldn’t be much of a concern except for the increasing networking bandwidth.

Of course, with Starwind’s Virtual SAN, there’s no data management, no data protection and probably very little in the way of logical volume management. And the ORNL Summit supercomputer is accessing data as files in a massive file system. The StarWind Virtual SAN is a block device.

But if I wanted to rule the supercomputing world, in a somewhat smallish data center, I might be tempted to put together 400 of StarWind NVMeoF target storage nodes with 4 Optane SSDs each. And convert their initiator code to work on IBM Spectrum Scale nodes and let her rip.

Comments?