Data Science storage with NetApp’s Python Toolkit

I’ve got a book someplace (yet to be read completely) with the title Data science with Python. At a recent Storage Field Day 21 last month, NetApp was there discussing a number of their product offerings one of which was their Python SDK to manage NetApp storage for data scientists and AI researchers (see videos of their sessions here).

I’m not a data science expert but a Python SDK for storage management just makes so much sense to me I just had to take a look. Their GitHub repo is available online and they call it the NetApp Data Science Toolkit.

But first please take our new poll:

The challenge for data science and AI researchers is that it’s all about the data. How do you find the data, gain access to it, clean it, and process it quickly so you can do it all over again. Having some sort of Python SDK that allows you to do some rudimentary storage volume configuration, access, snapshotting etc. can make these sorts of pipelines be self-serviced rather than going back and forth with operations to get volumes configured, mounted, and services established.

NetApp Data Science Toolkit

The NetApp Data Science Toolkit can be PIP installed into anything with Python 3.5 or later and can be invoked via a command line or as a library of Python functions that can be invoked. The command line utility and the Python calls appear to be functionally equivalent.

pip3 install netapp-ontap pandas tabulate requests boto3

The Toolkit must be configured for your environment and NetApp storage but once that’s done your ready to rock and roll.

MLOps pipeline from Google

The command line is invoked with

./ntap_dsutil.py

following that command are subcommands and parameters specifying what ONTAP operation you want to perform and how it is to be done. Python function calls seem to follow the same parameterization as the CLI.

The CLI and Python function calls can run on MacOS or any Linux distribution. There’s a paper that discusses how to use the SDK to accelerate AI pipelines as well as another ReadMe that describes it’s use in Kubernetes with NetApp’s Trident CSI plugin.

The functionality supports NetApp AFF, FAS, Cloud Volumes and Select that are running ONTAP 9.7 or later. For a current list of ONTAP functions available, check out the toolkit. But for a overview these ONTAP functions were available.

  • For Volume Management – cloning, creating, listing all, deleting or mounting a volume,
  • For Snapshot Management – creating, deleting, listing and restoring snapshots (of volumes)
  • For Data Fabric Management – listing all cloud sync relationships, triggering a cloud sync operation, multi-thread pulling a bucket down from S3 storage (into a NetApp volume directory), pulling a single object down from S3 into a file, pushing the contents of a directory to bucket on S3 and pushing a file into an object on S3.
  • For Advanced Data Fabric Management – listing all SnapMirror relationships and triggering a sync operation for an existing SnapMirror relationship.

This is a pretty comprehensive list of NetApp ONTAP storage functionality. Having all this under control of Python and CLI for data scientist or AI researcher seems pretty impressive.

Of course not every option for all those functions are supported but it’s just a start (V1.1 of the toolkit). I’m sure there’s more to come, especially if customers demand it.

However, it would be nice to have an ONTAP simulator available with the toolkit that could be used to test out your Python code and CLI commands before using real NetApp storage. This would be very useful for those of us lacking our own test ONTAP storage, just hanging around on prem or in the cloud.

As Python becomes the language of choice for AI and now data science, it seems only natural that storage and data protection companies would start releasing Python SDKs/APIs for their product functionality. That way AI and data science researchers could embed any storage functionality they needed directly into their Python code or Jupyter Notebook application.

Having a Python SDK for NetApp ONTAP storage, means using data storage for your MLops or data science pipelines is that much easier.

Great move by NetApp. Ok where’s the rest of the industry?

Picture credit(s):

New DRAM can be layered on top of CPU cores

At the last IEDM (IEEE International ElectronDevices Meenting), there were two sessions devoted to a new type of DRAM cell that consists or 2 transistors and no capacitors (2TOC) that can be built in layers on top of a micro processor which doesn’t disturb the microprocessor silicon. I couldn’t access (behind paywalls) the actual research but one of the research groups was from Belgium (IMEC) and the other from the US (Notre Dame and R.I.T). This was written up in a couple of teaser articles in the tech press (see IEEE Spectrum tech talk article).

DRAM today is built using 1 transistor and 1 capacitor (1T1C). And it appears that capacitors and logic used for microprocessors aren’t very compatible. As such, most DRAM lives outside the CPU (or microprocessor cores) chip and is attached over a memory bus.

New 2T0C DRAM Bit Cell: Data is written by appliying current to the WBL and WWL and bit’s are read by seeing if acurrent can pass through the RWL RBL

Memory busses have gotten faster in order to allow faster access to DRAM but this to is starting to reach fundamental physical limits and DRAM memory sizes aren’t scaling like the used to.

Wouldn’t it be nice if there were a new type of DRAM that could be easlly built closer or even layered on top of a CPU chip, with faster direct access from/to CPU cores. through inter chip electronics.

Oxide based 2T0C DRAM

DRAM was designed from the start with 1T1C so that it could hold a charge. With a charge in place it could be read out quickly and refreshed periodically without much of a problem.

The researcher found that at certain sizes (and with proper dopants) small transistors can also hold a (small) charge without needing any capacitor.

By optimizing the chemistry used to produce those transistors they were able to make 2T0C transistors hold memory values. And given the fabrication ease of these new transistors, they can easily be built on top of CPU cores, at a low enough temperature so as not to disturb the CPU core logic.

But, given these characteristics the new 2T0C DRAMB can also be built up in layers. Just like 3D NAND and unlike current DRAM technologies.

Today 3D NAND is being built at over 64 layers, with Flash NAND roadmap’s showing double or quadruple that number of layers on the horizon. Researchers presenting at IMEC were able to fabricate an 8 layer 2T0C DRAM on top of a microprocessor and provide direct, lightening fast access to it.

The other thing about the new DRAM technology is that it doesn’t need to be refreshed as often. Current DRAM must be refreshed every 64 msec. This new 2T0C technology has a much longer retention time and currently only needs to be refreshed every 400s and much longer retention times are technically feasible.

Some examples of processing needing more memory:

  • AI/ML and the memory wall -Deep learning models are getting so big that memory size is starting to become a limiting factor in AI model effectiveness. And this is just with DRAM today. Optane and other SCM can start to address some of this problem but ithe problem doesn’t go away, AI DL models are just getting more complex I recently read an article where Google trained a trillion parameter language model.
  • In memory databases – SAP HANA is just one example but they are other startups as well as traditional database providers that are starting to use huge amounts of memory to process data at lightening fast speeds. Data only seems to grow not shrink.

Yes Optane and other SCM today can solve some of thise problems. But having a 3D scaleable DRAM memory, that can be built right on chip core, with longer hold times and faster direct access can be a game changer.

It’s unclear whether we will see all DRAM move to the new 2T0C format, but if it can scale well in the Z direction has better access times, and longer retention, it’s unclear why this wouldn’t displace all current 1T1C DRAM over time. However, given the $Bs of R&D spend on new and current DRAM 1T1C fabrication technology, it’s going to be a tough and long battle.

Now if the new 2T0C DRAM could only move from 1 bit per cell to multiple bits per cell, like SLC to MLC NAND, the battle would heat up considerably.

Photo Credits:

The rise of MinIO object storage

MinIO presented at SFD21 a couple of weeks back (see videos here). They had a great session, as always with Jonathan and AB leading the charge. We’ve had a couple of GreyBeardsOnStorage podcasts with AB as well (listen and see GreyBeards talk open source S3… and GreyBeards talk Data Persistence …). We first talked with MinIO last year at SFD 19 where AB made a great impression on the bloggers (see videos here)

Their customers run the gamut from startups to F500. AB said that ~58% of the F500 have MinIO installed and over 8% of the F500 have added capacity over the last year. AB said they have a big presence in Finance, e.g., the 10 largest banks run MinIO, also the auto and Space/Defense sectors have adopted their product.

One reason for the later two sectors (auto & space/defense) is the size of MinIO’s binary, 50MB. And my guess for why the rest of those customers have adopted MinIO is because it’s S3 API compatible, it’s open source, and it’s relatively inexpensive.

Object storage trends

Customers running in the cloud have a love-hate relationship with object storage, they love that it scales but hate what it costs. There are numerous on prem object storage alternatives from traditional and non-traditional storage vendors, but most are deployed on appliances.

With appliances, customers have to order, wait for delivery, rack-configure-set up and after maybe weeks to months finally they have object storage on prem. But with MinIO a purely software, open source solution, it can be tried by merely downloading a couple of (Docker) containers and deployed/activated in under an hour..

As mentioned above, MinIO is API compatible with AWS S3 which helps with adoption. Moreover, now that it’s an integral part of VMware (see their new Data Persistence Platform), it can be enabled in seconds on your standard enterprise VMware cluster with Tanzu.

The other trend is that the edge needs storage, and lots of it. The main drivers of massive edge storage requirement are TelCos deploying 5G and auto industry’s self-driving cars. But this is just a start, industrial IoT will be generating reams of sensor log data at the edge, it will need to be stored somewhere. And what better place to store all this data, but on object storage. Furthermore, all this is driving more adoption of object storage, with MinIO picking up the lion’s share of deployments.

In addition, MinIO recently ported their software to run on ARM. AB said this was to support the expanding hobbyist and developers community driving edge innovation.

And then there was Kubernetes. Everyone in the industry (with the possible exception of Google) is surprised by the adoption of K8S. Google essentially gifted ~$1Bs in R&D on how to scale apps to the world of IT, and now any startup, anywhere, can scale with as well as Google can. And scaling is the “killer app” for the SW industry.

But performance isn’t bad either

Jonathan made mention of MinIO performance (see MinIO 24 node disk and MinIo 32 node NVMe SSD reports) benchmarks. Their disk data shows avg read and write performance of 16.3GB/s and 9.4GB/s, respectively and their NVMe SSD average read and write performance of 183.2GB/s and 171.3GB/s, respectively. The disk numbers are very good for object storage, but the SSD numbers are spectacular.

It turns out that modern, cloud native apps don’t need quick access to data as much as high data throughput. Modern apps have moved to a processing data in memory rather than off of storage, which means they move (large) chunks of data to memory and crunch on it there, and then spit it back out to storage This type of operating mode seems to scale better (in the cloud at least) than having a high priced storage system servicing a blizzard of IO requests from everywhere.

Other vendors had offered SSD object storage before but it never took off. But nowadays, with NVMe SSDs, MinIO is seeing starting to see healthcare, finance, and any AI/ML workloads all deploying NVMe SSD object storage. Yes for large storage repositories, (object storage’s traditional strongpoint), ie, 5PB to 100PB, disk can’t be beat but where blistering high throughput, is needed, NVMe SSD object storage is the way to go.

Open source vs. open core

AB mentioned that MinIO business model is 100% open source vs. many other vendors that use open source but whose business model is open core. The distinction is that open core vendors use open source as base functionality and then build proprietary, charged for, software features/functions on top of this.

But open source vendors, like MinIO offer all their functionality under an open source license (Apache SM License V2.0, GNU AGPL v3 Open Source license and other FOSS licenses), but if you want to use it commercially, build products with it embedded inside, or have enterprise class support, one purchases a commercial license.

As presented at SFD21, but their website home page has updated numbers reflected below

The pure open source model has some natural advantages:

  1. It’s a great lead gen solution because anyone, worldwide, 7X24X365 can download the software and start using it, (see Docker Hub or MinIO’s download page
  2. It’s a great hiring pool. Anyone, who has contributed to the MinIO open source is potentially a great technical hire. MinIO stats says they have 685 contributors, 19 in just the last month for MinIO base code (see MinIO’s GitHub repo).
  3. It’s a great development organization. With ~20 commits a weekover the last year, there’s a lot going on to add functionality/fix bugs. But that’s the new world of software development. Given all this activity, release frequencies increase, ~4 releases a month ((see GitHub repo insights above).
  4. It’s a great testing pool with, ~480M Docker Pulls (using a Docker container to run a standard, already configured MinIO server, mc, console, etc.) and ~18K enterprises running their solution, that’s an awful lot of users. With open source a lot of eye’s or contributors make all problems visible, but what’s more typical, from my perspective, is the more users that deploy your product, the more bugs they find.

Indeed, with the VMware’s Data Persistence Platform, Tanzu customers can use MinIO’s object storage at the click of a button (or three).

Of course, open source has downsides too. Anyone can access packages directly (from GitHub repo and elsewhere) and use your software. And of course, they can clone, fork and modify your source code, to add any functionality they want to it. Historically, open source subscription licensing models don’t generate as much revenues as appliance purchases do. And finally, open source, because it’s created by geeks, is typically difficult to deploy, configure, and use.

But can they meet the requirements of an Enterprise world

Because most open source is difficult to use, the enterprise has generally shied away from it. But that’s where there’s been a lot of changes to MinIO.

MinIO always had a “mc” (minio [admin] client) that offered a number of administrative services via an API, programmatically controlled interface. but they have recently come out with a GUI offering, the minIO console, which has a similarly functionality to their mc APU. They demoed the console on their SFD21 sessions (see videos above).

Supporting 18K enterprise users, even if only 8% are using it a lot, can be a challenge, but supporting almost a half a billion docker pulls (even if only 1/4th of these is a complete minIO deployment) can be hell on earth. The surprising thing is that MinIO’s commercial license promises customers direct-to-engineer support.

At their SFD21 sessions, AB stated they were getting ~2.7 new (tickets) problems a day. I assume these are what’s just coming in from commercial licensed users and not the general public (using their open source licensed offerings). AB said their average resolution time for these tickets was under 15 minutes.

Enter SubNet, the MinIO Subscription Network and their secret (not open source?) weapon to scale enterprise class support. Their direct-to-engineer support model involves a much, more collaborative approach to solving customer problems then you typical enterprise support with level 1, 2 & 3 support engineers. They demoed SubNet briefly at SFD21, but it could deserve a much longer discussion/demostration.

What little we saw (at SFD21) was that it looked almost like slack-PM dialog between customer and engineer but with unlimited downloads and realtime interaction.

MinIO also supports a very active Slack discussion group with ~11K users. Here anyone can ask a question and it will get answered by anyone. MinIO’s Slack has 2 channels: (Ggeneral and GitHub for notifications). It seems like MinIO is using Slack as a crowdsourced level 1 support.

But in the long run, to continue to offer “direct-to-engineer” levels of support, may require adding a whole lot more engineers. But AB seems prepared to do just that.

~~~~

MinIO is an interesting open source S3 API compatible, object storage solution that seems to run just about anywhere, is freely deployable with enterprise class support available (at a price) and has high throughput performance. What’s not to like.

Undersea datacenter in our future?

Read about Microsoft’s Project Natick Phase 2 this past week. Microsoft submerged a steel encased tube filled with servers, storage and compute for 2 years in the UK and just took it out of the water this past July. We’ve written before about underwater and in space data centers (see our IT in space post)

Project Natick’s Phase 2 underwater data center had 12 racks with 864 servers and 27.5PB of disk storage and was connected to the nearby Orkney island’s power grid (250Kw) and networking infrastructure. The Orkney’s islands are located off the NE coast of the Scotland and its power grid is 100% renewable, using tidal, solar and wind power. During the data center test, Orkney was able was able to power the data center, the islands and still provide power back to the Scottish power grid.

More reliable underwater

According to early reports, the servers in the underwater data center had 1/8th the failures that a control data center, on land, had. Microsoft attributes the enhanced server reliability to the use of a 100% Nitrogen (at 1 atmosphere pressure) rather than normal air and the lack of any humans to jostle the equipment/disturb the environment.

It’s also likely that the temperature variability present in a normal, on the surface of the earth, data center was measurably less than for a data center on the sea floor. If this were true, that could also help explain its better reliability.

Why underwater?

It’s all about cooling modern servers (and storage). According to NREL ( USA National Renewable Energy Lab), most data centers operate at 1.8 PUE (power use efficiency) that is, using 180% of the power required for the servers, storage and networking equipment. The other 80% is used mainly for cooling electronics, but also includes lighting, HVAC, and other essential services for humans. NREL says that high efficiency data centers can achieve a PUE of 1.2.

PUE for Project Natick Phase 2 data center was reported to be 1.07. The only additional electricity needed would probably be power for cooling.

Cooling for the Project Natick Phase 2 data center used seawater pumped through the back of server racks. The data center was placed on the seafloor at 35m (117ft) deep.

It kind looked like a submarine. According to Microsoft, the data center was contracted for, built and deployed in under 90 days. The intent was to have the data center be smaller than a standard ISO shipping container. The data center was driven ontop of an 18 wheeler, from where it was built to the Orkney Island, including ferry crossings. It was placed on a triangular support, towed out to see and deposited on the seafloor.

While 864 servers and 27.5PB of storage seem like a lot to most of us, for Microsoft Azure it’s too small to be used as a regional zone. But for (large) edge deployments. something this size or (10X) smaller might be just the thing.

Microsoft notes that 1/2 the world’s population lives within 200km (120mi) of the ocean. So there’s a ready supply of people and businesses that could take advantage of any underwater data center.

And of course, such a structure when laid on the bottom of the ocean floor, could create an artificial reef (if left in place long enough). Artificial reefs have been made out of ocean oil rigs, sunken war ships and large chunks of steel/concrete. So a underwater data center could do so just as well. And maybe the heating coming from the data center cooling pumps would foster even more coral life.

Microsoft plans Project Natick Phase 3 to be a full Azure AZ that will be deployed underwater which will include about 12 Phase 2 datacenter pressurized units.

Photo Credits:

New PCM could supply 36PB of memory to CPUs

Read an article this past week on how quantum geometry can enable a new form of PCM (phase change memory) that is based on stacks of metallic layers (SciTech Daily article: Berry curvature memory: quantum geometry enables information storage in metallic layers), That article referred to a Nature article (Berry curvature memory through electrically driven stacking transitions) behind a paywall but I found a pre-print of it, Berry curvature memory through electrically driven stacking transitions.

Figure 1| Signatures of two different electrically-driven phase transitions in WTe2. a, Side view (b–c plane) of unit cell showing possible stacking orders in WTe2 (monoclinic 1T’, polar orthorhombic Td,↑ or Td,↓) and schematics of their Berry curvature distributions in momentum space. The spontaneous polarization and the Berry curvature dipole are labelled as P and D, respectively. The yellow spheres refer to W atoms while the black spheres represent Te atoms. b, Schematic of dual-gate h-BN capped WTe2 evice. c, Electrical conductance G with rectangular-shape hysteresis (labeled as Type I) induced by external doping at 80 K. Pure doping was applied following Vt/dt = Vb/db under a scan sequence indicated by black arrows. d, Electrical conductance G with butterfly-shape switching (labeled as Type II) driven by electric field at 80 K. Pure E field was applied following -Vt/dt = Vb/db under a scan sequence indicated by black arrows. Positive E⊥ is defined along +c axis. Based on the distinct hysteresis observations in c and d, two different phase transitions can be induced by different gating configurations.

The number one challenge in IT today,is that data just keeps growing. 2+ Exabytes today and much more tomorrow.

All that information takes storage, bandwidth and ultimately some form of computation to take advantage of it. While computation, bandwidth, and storage density all keep going up, at some point the energy required to read, write, transmit and compute over all these Exabytes of data will become a significant burden to the world.

PCM and other forms of NVM such as Intel’s Optane PMEM, have brought a step change in how much data can be stored close to server CPUs today. And as, Optane PMEM doesn’t require refresh, it has also reduced the energy required to store and sustain that data over DRAM. I have no doubt that density, energy consumption and performance will continue to improve for these devices over the coming years, if not decades.

In the mean time, researchers are actively pursuing different classes of material that could replace or improve on PCM with even less power, better performance and higher densities. Berry Curvature Memory is the first I’ve seen that has several significant advantages over PCM today.

Berry Curvature Memory (BCM)

I spent some time trying to gain an understanding of Berry Curvatures.. As much as I can gather it’s a quantum-mechanical geometric effect that quantifies the topological characteristics of the entanglement of electrons in a crystal. Suffice it to say, it’s something that can be measured as a elecro-magnetic field that provides phase transitions (on-off) in a metallic crystal at the topological level. 

In the case of BCM, they used three to five atomically thin, mono-layers of  WTe2 (Tungsten Ditelluride), a Type II  Weyl semi-metal that exhibits super conductivity, high magneto-resistance, and the ability to alter interlayer sliding through the use of terahertz (Thz) radiation. 

It appears that by using BCM in a memory, 

Fig. 4| Layer-parity selective Berry curvature memory behavior in Td,↑ to Td,↓ stacking transition. a,
The nonlinear Hall effect measurement schematics. An applied current flow along the a axis results in the generation of nonlinear Hall voltage along the b axis, proportional to the Berry curvature dipole strength at the Fermi level. b, Quadratic amplitude of nonlinear transverse voltage at 2ω as a function of longitudinal current at ω. c, d, Electric field dependent longitudinal conductance (upper figure) and nonlinear Hall signal (lower figure) in trilayer WTe2 and four-layer WTe2 respectively. Though similar butterfly-shape hysteresis in longitudinal conductance are observed, the sign of the nonlinear Hall signal was observed to be reversed in the trilayer while maintaining unchanged in the four-layer crystal. Because the nonlinear Hall signal (V⊥,2ω / (V//,ω)2 ) is proportional to Berry curvature dipole strength, it indicates the flipping of Berry curvature dipole only occurs in trilayer. e, Schematics of layer-parity selective symmetry operations effectively transforming Td,↑ to Td,↓. The interlayer sliding transition between these two ferroelectric stackings is equivalent to an inversion operation in odd layer while a mirror operation respect to the ab plane in even layer. f, g, Calculated Berry curvature Ωc distribution in 2D Brillouin zone at the Fermi level for Td,↑ and Td,↓ in trilayer and four-layer WTe2. The symmetry operation analysis and first principle calculations confirm Berry curvature and its dipole sign reversal in trilayer while invariant in four-layer, leading to the observed layer-parity selective nonlinear Hall memory behavior.
  • To alter a memory cell takes “a few meV/unit cell, two orders of magnitude less than conventional bond rearrangement in phase change materials” (PCM). Which in laymen’s terms says it takes 100X less energy to change a bit than PCM.
  • To alter a memory cell it uses terahertz radiation (Thz) this uses pulses of light or other electromagnetic radiation whose wavelength is on the order of picoseconds or less to change a memory cell. This is 1000X faster than other PCM that exist today.
  • To construct a BCM memory cell takes between 13 and 16  atoms of W and Te2 constructed of 3 to 5 layers of atomically thin, WTe2 semi-metal.

While it’s hard to see in the figure above, the way this memory works is that the inner layer slides left to right with respect to the picture and it’s this realignment of atoms between the three or five layers that give rise to the changes in the Berry Curvature phase space or provide on-off switching.

To get from the lab to product is a long road but the fact that it has density, energy and speed advantages measured in multiple orders of magnitude certainly bode well for it’s potential to disrupt current PCM technologies.

Potential problems with BCM

Nonetheless, even though it exhibits superior performance characteritics with respect to PCM, there are a number of possible issues that could limit it’s use.

One concern (on my part) is that the inner-layer sliding may induce some sort of fatigue. Although, I’ve heard that mechanical fatigue at the atomic level is not nearly as much of a concern as one sees in (> atomic scale and) larger structures. I must assume this would induce some stress and as such, limit the (Write cycles) endurance of BCM.

Another possible concern is how to shrink size of the Thz radiation required to only write a small area of the material. Yes one memory cell can be measured bi the width of 3 atoms, but the next question is how far away do I need to place the next memory cell. The laser used in BCM focused down to ~1.5 μm. At this size it’s 1,000X bigger than the BCM memory cell width (~1.5 nm).

Yet another potential problem is that current BCM must be embedded in a continuous flow of liquid nitrogen (@80K). Unclear how much of a requirement this temperature is for BCM to function. But there are no computers nowadays that require this level of cooling.

Figure 3| Td,↑ to Td,↓ stacking transitions with preserved crystal orientation in Type II hysteresis. a,
in-situ SHG intensity evolution in Type II phase transition, driven by a pure E field sweep on a four-layer and a five-layer Td-WTe2 devices (indicated by the arrows). Both show butterfly-shape SHG intensity hysteresis responses as a signature of ferroelectric switching between upward and downward polarization phases. The intensity minima at turning points in four-layer and five-layer crystals show significant difference in magnitude, consistent with the layer dependent SHG contrast in 1T’ stacking. This suggests changes in stacking structures take place during the Type II phase transition, which may involve 1T’ stacking as the intermediate state. b, Raman spectra of both interlayer and intralayer vibrations of fully poled upward and downward polarization phases in the 5L sample, showing nearly identical characteristic phonons of polar Td crystals. c, SHG intensity of fully poled upward and downward polarization phases as a function of analyzer polarization angle, with fixed incident polarization along p direction (or b axis). Both the polarization patterns and lobe orientations of these two phases are almost the same and can be well fitted based on the second order susceptibility matrix of Pm space group (Supplementary Information Section I). These observations reveal the transition between Td,↑ and Td,↓ stacking orders is the origin of
Type II phase transition, through which the crystal orientations are preserved.

Finally, from my perspective, can such a memory can be stacked vertically, with a higher number of layers. Yes there are three to five layers of the WTe2 used in BCM but can you put another three to five layers on top of that, and then another. Although the researchers used three, four and five layer configurations, it appears that although it changed the amplitude of the Berry Curvature effect, it didn’t seem to add more states to the transition.. If we were to more layers of WTe2 would we be able to discern say 16 different states (like QLC NAND today).

~~~~

So there’s a ways to go to productize BCM. But, aside from eliminating the low-temperature requirements, everything else looks pretty doable, at least to me.

I think it would open up a whole new dimension of applications, if we had say 60TB of memory to compute with, don’t you think?

Comments?

[Updated the title from 60TB to PB to 36PB as I understood how much memory PMEM can provide today…, the Eds.]

Photo Credit(s):

Storage that provides 100% performance at 99% full

A couple of weeks back we were talking with Qumulo at Storage Field Day 20 (SFD20) and they made mention that they were able to provide 100% performance at 99% full. Please see their video session during SFD20 (which can be seen here). I was a bit incredulous of this seeing as how every other modern storage system performance degrades long before they get to 99% capacity.

So I asked them to explain how this was possible. But before we get to that a little background on modern storage systems would be warranted.

The perils of log structured file systems

Most modern storage systems use a log structured file system where when they write data they write it to a sequential log and use a virtual addressing scheme to show where the data is located for that address, creating a (data) log of written blocks.

However, when data is overwritten, it leaves gaps in these data logs. These gaps need to be somehow recycled (squeezed out) in order to be able to be consumed as storage capacity. This recycling process is commonly called “garbage collection”.

Garbage collection does its work by reading heavily gapped log files and re-writing the old, but still current, data into a new log. This frees up those gaps to be reused. But garbage collection like this takes reading and writing of logs to free up space.

Now as log structured file systems get (70-80-90%) full, they need to spend more and more system time and effort (=performance) garbage collecting . This takes system (IO) performance away from normal host IO activity. Which is why I didn’t believe that Qumulo could offer 100% IO performance at 99% full.

But there was always another way to supply storage virtualization (read snapshotting) besides log files. Yes it might involve more metadata (table) management, but what it takes in more metadata, it gives back by requiring no garbage collection.

How Qumulo does without garbage collection

Qumulo has a scaled block store for a back end of their file and object cluster store. And yes it’s still a virtualized block store BUT it’s not a log structured file store.

It seems that there’s a virtual-to-physical mapping table that is used by Qumulo to determine the physical address of any virtual block in the file system. And files are allocated to virtual blocks directly through the use of B-tree metadata. These B-trees indicate which virtual blocks are in use by a file and its snapshots

If a host overwrites a data block. The block can be freed (if not being used in a snapshot) and placed on a freed block list and a new block is allocated in its place. The file’s allocated blocks b-tree is updated to reflect the new block and that’s it.

For snapshots, Qumulo uses something they call “write-out-of-place” process when data that a snapshot points to is overwritten. Again, it appears as if snapshots are some extra metadata associated with a file’s B-tree that defines the data in the snapshot.

The problem comes in when a file is deleted. If it’s a big enough file (TB-PB?), there could be millions to billions of blocks that have to be freed up. This would take entirely too long for a delete command, so this is done in the background. Qumulo calls this “reclaim delete“. So a delete of a big file unlinks the block B-tree from the directory and puts it on this reclaim delete work queue to free up these blocks later. Similarly, when a big snapshot is deleted, Qumulo performs a background process called “reclaim snapshot” for snapshot unique blocks.

As can be seen (it’s very hard to see given the coloration of the chart) from this screen shot of Qumulo’s session at SFD20, reclaim delete and reclaim snapshot are being done concurrently (in the background) with normal system IO. What’s interesting to note here is that reclaim IO (delete and snapshots) are going on all the time during the customers actual work. Why the write throughput drops significantly doing the the 27-29 of July is hard to understand. But the one case where it’s most serious (middle of July 28) reclaim IO also drops significantly. If reclaim IO were impacting write performance I would have expected it to have gone higher when write throughput went lower. But that’s not the case. From what I can see in the above reclaim IO has no impact on read or write throughput at this customer.

So essentially, by using a backing block store that does no garbage collection (not using a log structured file system), Qumulo is able to offer 100% system IO performance at 99% full – woah.

Can we back up a PB?

Tradition says no way. IT backup history says not on your life. Common sense would say never in a million years.

Most organizations with PB of data or more, depend on remote replication to protect against data center outage or massive loss of data. This of course costs ~2X your original data center. And for some organizations one copy is not enough, so ~3X .

I don’t know what a PB scale data storage costs these days but I can’t believe it’s under a couple Million $ USD in hw and sw costs and probably at least another Million or so in OpEx/year. Multiply that by 2 or 3X and you’re now talking real money.

How could backup help?

Well for one you wouldn’t need replicas, so that would cut your hw & sw acquisition costs by a factor of 2 or 3. But backup storage is not free either. So you’d probably need to add back 30-50% of the original data center in hw & sw costs for backups.

You certainly wouldn’t need as many admins. And power for backup storage should also be substantially less. So maybe your OpEx would only be 1.5X in total for the original PB and its backups.

But what could possibly back up a PB of data?

We were talking with Igneous at Cloud Field Day 8 (CFD8, see their video here)  a couple of weeks back and they said they could and do backup PBs of data for customers today. A while back, e also talked with them on a GreyBeards on Storage podcast.

The problems with backing up a PB seem insurmountable. First you have to be able to scan a PB of data. This means looking into multiple file systems on many different hardware platforms, across potentially multiple data centers, and that’s just to get a baseline of what all needs to be backed up.

Then at some point you actually have to store all that data on backup storage. So, to gain some cost advantage, you’d want to compress and deduplicate a PB of data, so that the first full backup wouldn’t take a full PB of backup storage.

Then of course you have to transfer a PB of data to your backup storage, in something that wouldn’t take months to perform. And that just gets you the first full backup.

Next, comes the daily scan of what’s changed. This has to re-scan your PB of data to find that 100TB or so, that’s changed over the last 24 hrs. Sometime after that scan completes, then all that 100TB or so of changed data needs to be compressed, deduped and transferred again to backup storage

And if that’s not enough, you have to do it all over again, every day, from now on, almost forever. And data continues to grow. So 1PB today is likely to be 2PB of more in 12 months (it’s great to be in the storage business). 

So those are the challenges. How can it be done, effectively, day in and day out, enough so that IT can depend on their data being backed up.

Igneous to the rescue…

First, Igneous came out of stealth a while back (listen to our podcast) with a couple of unique capabilities needed for massive data repository discovery and analysis. That is they built a unique engine to scan and index PB scale data repositories. This was so they couldd provide administrators better visibility into their PB scale data repositories. But this isn’t about that product, it’s about backup. 

But some of the capabilities they needed to support that product helped them perform backups as well. For instance, their scan needed to handle PBs of data. They came up with AdaptiveSCAN, which didn’t use standard NFS and SMB data transfer protocols to gain access to file metadata. To open a file on NFS or SMB takes quite a lot of NFS or SMB transactions. But to access metadata only, one doesn’t have to use all those NFS and SMB capabilities, it can be done with much less overhead even when using NFS or SMB.

Of course having a way to scan Billions of files was a major accomplishment, but then where do you put all that metadata. And how can you access it effectively to support backup up a PB data repository. So they needed some serious data indexing capabilities and so came up with InfiniteINDEX

Now a trillion item index, seems a bit much, even for PB scale repositories. But my guess is they have eyes on taking their PB scale backups and going after even bigger fish,. That is offering backups for EB scale data repository. And that might just take a trillion item index

Next, there’s moving PB or even TB of data quickly is no small trick. As the development team at Igneous mostly came from unstructured data providers, they also understood and have access to APIs for most storage vendors (NetApp, Dell-EMC Isilon, Pure FlashBlade, Qumulo, etc.). As such, where available, they utilized those native vendor storage API calls to help them move data rather than having to Open an NFS or SMB file and Read it. 

Of course, even doing all that, moving 100TBs of data around or scanning PB sized data repositories is going to take a lot of processing and IO bandwidth to do in a reasonable period of time. 

So another capability they developed is massive parallelism. That is being able to distribute scan, indexing or data movement work, out to multiple systems. In that fashion it can be accomplished in significantly less wall clock time. 

Well with all that, they pretty much had the guts of a backup application system for PB data repositories but they still didn’t have the glue to put it all together. But recently they announced just that a Igneous’s DataProtect, a full scale backup application for PB of data. 

I suppose I haven’t done justice to all of what they have developed or talked about at their session, so I would suggest viewing their talk at CFD8 and listening to our GBoS podcast to learn more. They did demo their product at CFD8 but I believe it was a canned demo.

I didn’t think I’d see the day when some vendor would offer backup services for PBs of data let alone be shooting for more, but there you have it. Igneous means to take your PB scale data repositories and make them as easy to operate as TB scale data repositories. They call that democratizing data.

Comments?

See these other CFD8 bloggers write ups on Igneous.

CFD8  – Igneous Follow Up  by Nate Avery (@Nathaniel_Avery)

Picture credit(s): All from screen saves during Igneous’s session at CFD8

DNA storage using nicks

Read an article the other day in Scientific American (“Punch card” DNA …) which was reporting on a Nature Magazine Article (DNA punch cards for storing data… ). The articles discussed a new approach to storing (and encoding) data into DNA sequences.

We have talked about DNA storage over the years (most recently, see our Random access DNA object storage post) so it’s been under study for almost a decade.

In prior research on DNA storage, scientists encoded data directly into the nucleotides used to store genetic information. As you may recall, there are two complementary nucleotides A-T (adenine-thymine) and G-C (guanine-cytosine) that constitute the genetic code in a DNA strand. One could use one of these pairs to encode a 1 bit and the other for a 0 bit and just lay them out along a DNA strand.

The main problem with nucleotide encoding of data in DNA is that it’s slow to write and read and very error prone (storing data in DNA separate nucleotides is a lossy data storage). Researchers have now come up with a better way.

Using DNA nicks to store bits

One could encode information in DNA by utilizing the topology of a DNA strand. Each DNA strand is actually made up of a sugar phosphate back bone with a nucleotide (A, C, G or T) hanging off of it, and then a hydrogen bond to its nucleotide complement (T, G, C or A, respectively) which is attached to another sugar phosphate backbone.

It appears that one can deform the sugar phosphate back bone at certain positions and retain an intact DNA strand. It’s in this deformation that the researchers are encoding bits and they call this a “DNA nick”.

Writing DNA nick data

The researchers have taken a standard DNA strand (E-coli), and identified unique sites on it that they can nick to encode data. They have identified multiple (mostly unique) sites for nick data along this DNA, the scientists call “registers” but we would call sectors or segments. Each DNA sector can contain a certain amount of nick data, say 5 to 10 bits. The selected DNA strand has enough unique sectors to record 80 bits (10 bytes) of data. Not quite a punch card (80 bytes of data) but it’s early yet.

Each register or sector is made up of 450 base (nucleotide) pairs. As DNA has two separate strands connected together, the researchers can increase DNA nick storage density by writing both strands creating a sort of two sided punch card. They use this other or alternate (“anti-sense”) side of the DNA strand nicks for the value “2”. We would have thought they would have used the absent of a nick in this alternate strand as being “3” but they seem to just use it as another way to indicate “0” .

The researchers found an enzyme they could use to nick a specific position on a DNA strand called the PfAgo (Pyrococcus furiosus Argonaute) enzyme. The enzyme can de designed to nick distinct locations and register (sectors) along the DNA strand. They designed 1024 (2**10) versions of this enzyme to create all possible 10 bit data patterns for each sector on the DNA strand.

Writing DNA nick data is done via adding the proper enzyme combinations to a solution with the DNA strand. All sector writes are done in parallel and it takes about 40 minutes.

Also the same PfAgo enzyme sequence is able to write (nick) multiple DNA strands without additional effort. So we can replicate the data as many times as there are DNA strands in the solution, or replicating the DNA nick data for disaster recovery.

Reading DNA nick data

Reading the DNA nick data is a bit more complicated.

In Figure 1 the read process starts by by denaturing (splitting dual strands into single strands dsDNA) and then splitting the single strands (ssDNA) up based on register or sector length which are then sequenced. The specific register (sector) sequences are identified in the sequence data and can then be read/decoded and placed in the 80 bit string. The current read process is destructive of the DNA strand (read once).

There was no information on the read time but my guess is it takes hours to perform. Another (faster) approach uses a “two-dimensional (2D) solid-state nanopore membrane” that can read the nick information directly from a DNA string without dsDNA-ssDNA steps. Also this approach is non-destructive, so the same DNA strand could be read multiple times.

Other storage characteristics of nicked DNA

Given the register nature of the nicked DNA data organization, it appears that data can be read and written randomly, rather than sequentially. So nicked DNA storage is by definition, a random access device.

Although not discussed in the paper, it appears as if the DNA nicked data can be modified. That is the same DNA string could have its data be modified (written multiple times).

The researcher claim that nicked DNA storage is so reliable that there is no need for error correction. I’m skeptical but it does appear to be more reliable than previous generations of DNA storage encoding. However, there is a possibility that during destructive read out we could lose a register or two. Yes one would know that the register bits are lost which is good. But some level of ECC could be used to reconstruct any lost register bits, with some reduction in data density.

The one significant advantage of DNA storage has always been its exceptional data density or bits stored per volume. Nicked storage reduces this volumetric density significantly, i.e, 10 bits per 450 (+ some additional DNA base pairs required for register spacing) base pairs or so nicked DNA storage reduces DNA storage volumetric density by at least a factor of 45X. Current DNA storage is capable of storing 215M GB per gram or 215 PB/gram. Reducing this by let’s say 100X, would still be a significant storage density at ~2PB/gram.

Comments?

Picture Credit(s):