Facebook’s (Meta) Kangaroo, a better cache for billions of small objects

Read an article this week in Blocks and Files, Facebook’s Kangaroo jumps over flash limitations which spiked my interest and I went and searched for more info on this and found a fb blog post, Kangaroo: A new flash cache optimized for tiny objects which sent me to an ACM SOSP (Syposium on O/S Principles) best paper of 2021, Caching billions of tiny objects on flash.

First, as you may recall flash has inherent limitations when it comes to writing. The more writes to a flash device the more NAND cells start to fail over time. Flash devices are only rated for some amount (of standard, ~4KB) block writes, For example, the Micron 5300 Max SSD only supports 3-5 (4KB blocks) DWPD (drive writes per day). So, a 2TB Micron Max 5300 SSD can only sustain from ~1.5 to 2.4B 4KB block writes per day. Now that seems more than sufficient for most work but when somebody like fb, using the SSD as a object cache, writes a few billion or more 100B(yte) objects and does this day in or day out, can consume an SSD in no time. Especially if they are writing one 100 B object per block

So there’s got to be a better way to cache small objects into bigger blocks. Their paper talks of two prior approaches:

  • Log structured storage – here multiple 100B objects are stored in a single a 4KB block and iwritten out with one IO rather than 40. This works fairly well but the index ,which maps an object key, to a log location, takes up a lot of memory space. If your caching ~3B 100B objects in a logs and each object index takes 16 bytes that’s a data space of 48GB.
  • Associative set storage – here each object is hashed into a set of (one or more) storage blocks and is stored there. In this case there’s no DRAM index but you do need a quick way to determine if an object is in the set storage or not. This can be done with bloom filters (see: wikipedia article on bloom filters). So if each associative set stores 400 objects and one needs to store 3B objects one needs a 30 MB of bloom filters (assuming 4bytes each). The only problem with associative sets is that when one adds an element to a set. the set has to be rewritten. So if over time you add 400 objects to a set you are writing that set 400 times. All of which eats into the DWPD budget for the flash storage.

In Kangaroo, fb engineers have combined the best of both of these together and added a small DRAM cache.

How does it work?

Their 1st tier is a DRAM cache, which is ~1% of the capacity of the whole object cache. Objects are inserted into the DRAM cache first and are evicted in a least recently used fashion, that is object’s that have not been used in the longest time are moved out of this cache and are written to the next layer (not quite but get to that in a moment).

Their 2nd tier is a log structured system, at ~5% of cache capacity. They call this a KLog and it consists of a ring of 4KB blocks on SSD, with a DRAM index telling where each object is located on the ring.. Objects come in and are buffered together into a 4KB block and are written to the next empty slot in the ring with its DRAM index updated accordingly. Objects are evicted from Klog in such a way that a group of them, that would be located in the same associative set and are LRU, can all be evicted at the same time. They have structured the Klog DRAM index so that it makes finding all these objects easy. Also any log structured system needs to deal with garbage collection, Let’s say you evict 5 objects in a 4K block, that leaves 35 that are still good. Garbage collection will read a number of these partially full blocks and mash all the good objects together leaving free space for new objects that need to be cached.

The 3rd and final tier is a set associative store, they call the Kset that uses bloom filters to show object presence. For this tier, an object’s key is hashed to find a block to put it in, the block is read and the object inserted and the block rewritten. Objects are evicted out of the set associative store based on LRU within a block. The bloom filters are used to determine if the object exists in an set associative block.

There are a few items missing from the above description. As can be seen in Figure 3B above, Kangaroo can jettison objects that are LRUed out of DRAM instead of adding them to the Klog. The paper suggests this can be done purely at random, say only admit, into the Klog, 95% of the objects at random being LRUed from DRAM. The jettison threshold for Klog to Kset is different. Here they will jettison single object sets. That is if there were only one object that would be evicted and written to a set, it’s jettisoned rather than saved in the Kset. The engineers call this a Kset threshold of 2 (indicating minimum number of objects in a single set that can be moved to Kset)..

While understanding an objects LRU is fairly easy if you have a DRAM index element for each block, it’s much harder when there’s no individual object index available, as in Kset.

To deal with tracking LRU in the Kset, fb engineers created a RRIParoo index with a DRAM index portion and a flash resident index portion.

  • RRIParoo’s DRAM index is effectively a 40 byte bit map which contains one bit per object, corresponding to its location in the block. A bit on in this DRAM bitmap indicates that the corresponding object has been referenced since the last time the flash resident index has been re-written. .
  • RRIParoo’s flash resident index contains 3 bit integers, each one corresponding to an object in the block. This integer represents how many clock ticks, it has been since the corresponding object has been referenced. When the need arises to add an object to a full block, the object clock counters in that block’s RRIP flash index are all incremented until one has gotten to the oldest time frame b’111′ or 7. It is this object that is evicted.

New objects are given an arbitrary clock tick count say b’001′ or 1 (as shown in Fig. 6, in the paper they use b’110′ or 6), which is not too high to be evicted right away but not too low to be considered highly referenced.

How well does Kangaroo perform

According to the paper using the same flash storage and DRAM, it can reduce cache miss ratio by 29% over set associative or log structured cache’s alone. They tested this by using simulations of real world activity on their fb social network trees.

The engineers did some sensitivity testing using various Kangaroo algorithm parameters to see how sensitive read miss rates were to Klog admission percentage, RRIParoo flash index element (clock tick counter) size, Klog capacity and Kset admission threshold.

Kangaroo performance read miss rate sensitivity to various algorithm parameters

Applications of the technology

Obviously this is great for Twitter and facebook/meta as both of these deal with vast volumes of small data objects. But databases, Kafka data streams, IoT data, etc all deal with small blocks of data and can benefit from better caching that Kangaroo offers.

Storage could also use something similar only in this case, a) the objects aren’t small and b) the cache is all in memory. DRAM indexes for storage caching, especially when we have TBs of DRAM cache, can be still be significant, especially if an index element is kept for each block in cache. So the technique could also be deployed for large storage caches as well.

Then again, similar techniques could be used to provide caching for multiple tiers of storage. Say DRAM cache, SSD Log cache and SSD associative set cache for data blocks with the blocks actually stored on large disks or QLC/PLC SSDs.

Photo credit(s):

NASA’s journey to the cloud – part 1

Read an article the other day, NASA Turns to the Cloud for Help With Next-Generation Earth Missions about how NASA was had started to migrate all their data to the cloud and intended to store all new data there as well. The hope is that researchers would no longer need to download NASA data but rather could access it directly using cloud compute resources.

It turns out that newer earth science satellites are generating so much data that hosting all this data is becoming a challenge and with the quantities being discussed, researchers downloading the data, to perform research in their own environments may take days.

Until recently, earth science data has been hosted and downloadable from NASA, ESA and other space organization sites. For example, see NASA’s GHCR DAAC (Global Hydrometerological Resource Center Distributed Active Archive Center), ESA EarthOnline, JAXA GPM website, etc. Generally one could download a time series of data from any of their prior and current earth/planetary science missions without too much trouble.

The Land Processes Distributed Active Archive Center (LP DAAC) archives and distributes Global Forest Cover Change (GFCC) data products through the NASA Making Earth System Data Records for Use in Research Environments (MEaSUREs) (https://earthdata.nasa.gov/community/community-data-system-programs/measures-projects) Program….

But NASA’s newest earth science satellites will be generating lot’s of data. For instance, the SWOT (Surface Water and Ocean Topography) mission data load will be 20TB/day and the NISAR (NASA-Indian Synthetic Aperture Radar) mission data load will be 80TB/day. And it’s only getting worse as more missions with newer instruments come online.

NASA estimates that, over time, they will store 247PB of data in their EarthData Cloud. At the moment, they have already migrated some (all of ASF [Alaska Satellite Facility] DAAC and some of PO.DAAC [Physical Ocean]) of their Earth Science data to AWS (us-west-2) and over time all of it will migrate there.

NASA will eat any egress charges for EOSDIS data and are also paying any and all hosting fees to storage the data in AWS. Unclear whether they are using standard S3 or S3-Intelligent Tiering. And presumably they are using S3 replication to ensure they don’t lose DAAC data in the cloud, but I don’t see any evidence of that in the literature I’ve read. Of course this doubles the storage costs for their 247PB of DAAC data.

Access to all this data is available to anyone with an EarthData login. There you can register for a profile to access NASA earth sciences data.

NASA’s EarthData also offers a number of AWS cloud based services to help one access this data:

  • EarthData search – filtered search facility to access NASA EarthData by platform (e.g. satellite), instrument (e.g. camera/visual data), organization (e.g. NASA/JPL), etc.
  • EarthData Common Metadata Repository – API driven metadata repository that ” catalogs all data and service metadata records for NASA’s EOSDIS (Earth Observing System Data and Information System) system” data, that can be accessed by anyone, which includes programatic access to EarthData search.
  • EarthData Harmony – which is a EarthData Jupyter notebook examples and API documentation to perform research on earth science data in the EarthData cloud.

One reason to movie EOSDIS DAAC data to the cloud is to allow researchers to not have to download data to run their analysis. By using in cloud EC2 compute instances, they can run their research in AWS with direct , high speed access to the EarthData.

Of course, the researcher would need to purchase their EC2 compute facility directly from AWS. w. NASA publishes a sort of AWS pricing primer for researchers to use AWS EC2 compute to do research directly on the data in the cloud. Also NASA offers a series of tutorials on how to use the AWS cloud for doing research on NASA DAAC data.

Where to from here?

I find this all somewhat discouraging. Yes it’s the Gov’t but one needs to wonder what the overall costs of hosting NASA DAAC data on the AWS cloud will be over the long haul. Most organizations use the cloud to prototype and scale up services but once these services have stabilized, theymigrate them back to onprem/CoLoinfrastructure. See for example, Dropbox’s move away from the [AWS] cloud for ~600PB of data.

I get it, the public cloud allows for nearly infinite data scaleability. But cloud storage costs is not cheap, especially when you are talking about 100s of PBs. And in today’s world, with a whole bunch of open source solutions for object storage and services, one can almost recreate any cloud service in your own data center, at much lower price.

Sure it will still take IT infrastructure and personnel to put it all together. But NASA doesn’t seem to be lacking in infrastructure or IT personnel. Even if you are enamored with AWS services and software infrastructure, one can always run AWS Outpost in your data centers. And DAAC services seem to be pretty stable over time. Yes new satellites will generate more data, but the data load is understood and very predictable. So one should be able to anticipate all this and have infrastructure in place to deal with it.

Yes, having the ability to run analysis in the cloud directly on the data sitting also in the cloud is useful, especially not having to download TB of data. But these costs can also be significant and they are born by the researcher not NASA.

Another grip is why use AWS alone. The other cloud providers all have similar object storage and compute capabilities. It seems wiser to me to set up the EarthData service such that, different DAACs reside in different clouds. This would he more complex and harder to administer and use but I believe in the long run would lead to better more effective services at a more reasonable price.

Going to the cloud doesn’t have to be a one way endeavor. After using the cloud for a while, NASA should have a better idea of the costs of doing so and at that time understand better what it can and cannot afford to do on its own.

It will be interesting to see what ESA, JAXA, CERN and other big science organizations do as they are all in the same bind, data seems to be growing unbounded.

Picture Credit(s):

CTERA, Cloud NAS on steroids

We attended SFD22 last week and one of the presenters was CTERA, (for more information please see SFD22 videos of their session) discussing their enterprise class, cloud NAS solution.

We’ve heard a lot about cloud NAS systems lately (see our/listen to our GreyBeards on Storage podcast with LucidLink from last month). Cloud NAS systems provide a NAS (SMB, NFS, and S3 object storage) front-end system that uses the cloud or onprem object storage to hold customer data which is accessed through the use of (virtual or hardware) caching appliances.

These differ from file synch and share in that Cloud NAS systems

  • Don’t copy lots or all customer data to user devices, the only data that resides locally is metadata and the user’s or site’s working set (of files).
  • Do cache working set data locally to provide faster access
  • Do provide NFS, SMB and S3 access along with user drive, mobile app, API and web based access to customer data.
  • Do provide multiple options to host user data in multiple clouds or on prem
  • Do allow for some levels of collaboration on the same files

Although admittedly, the boundary lines between synch and share and Cloud NAS are starting to blur.

CTERA is a software defined solution. But, they also offer a whole gaggle of hardware options for edge filers, ranging from smart phone sized, 1TB flash cache for home office user to a multi-RU media edge server with 128TB of hybrid disk-SSD solution for 8K video editing.

They have HC100 edge filers, X-Series HCI edge servers, branch in a box, edge and Media edge filers. These later systems have specialized support for MacOS and Adobe suite systems. For their HCI edge systems they support Nutanix, Simplicity, HyperFlex and VxRail systems.

CTERA edge filers/servers can be clustered together to provide higher performance and HA. This way customers can scale-out their filers to supply whatever levels of IO performance they need. And CTERA allows customers to segregate (file workloads/directories) to be serviced by specific edge filer devices to minimize noisy neighbor performance problems.

CTERA supports a number of ways to access cloud NAS data:

  • Through (virtual or real) edge filers which present NFS, SMB or S3 access protocols
  • Through the use of CTERA Drive on MacOS or Windows desktop/laptop devices
  • Through a mobile device app for IOS or Android
  • Through their web portal
  • Through their API

CTERA uses a, HA, dual redundant, Portal service which is a cloud (or on prem) service that provides CTERA metadata database, edge filer/server management and other services, such as web access, cloud drive end points, mobile apps, API, etc.

CTERA uses S3 or Azure compatible object storage for its backend, source of truth repository to hold customer file data. CTERA currently supports 36 on-prem and in cloud object storage services. Customers can have their data in multiple object storage repositories. Customer files are mapped one to one to objects.

CTERA offers global dedupe, virus scanning, policy based scheduled snapshots and end to end encryption of customer data. Encryption keys can be held in the Portals or in a KMIP service that’s connected to the Portals.

CTERA has impressive data security support. As mentioned above end-to-end data encryption but they also support dark sites, zero-trust authentication and are DISA (Defense Information Systems Agency) certified.

Customer data can also be pinned to edge filers, Moreover, specific customer (director/sub-directorydirectories) data can be hosted on specific buckets so that data can:

  • Stay within specified geographies,
  • Support multi-cloud services to eliminate vendor lock-in

CTERA file locking is what I would call hybrid. They offer strict consistency for file locking within sites but eventual consistency for file locking across sites. There are performance tradeoffs for strict consistency, so by using a hybrid approach, they offer most of what the world needs from file locking without incurring the performance overhead of strict consistency across sites. For another way to do support hybrid file locking consistency check out LucidLink’s approach (see the GreyBeards podcast with LucidLink above).

At the end of their session Aron Brand got up and took us into a deep dive on select portions of their system software. One thing I noticed is that the portal is NOT in the data path. Once the edge filers want to access a file, the Portal provides the credential verification and points the filer(s) to the appropriate object and the filers take off from there.

CTERA’s customer list is very impressive. It seems that many (50 of WW F500) large enterprises are customers of theirs. Some of the more prominent include GE, McDonalds, US Navy, and the US Air Force.

Oh and besides supporting potentially 1000s of sites, 100K users in the same name space, and they also have intrinsic support for multi-tenancy and offer cloud data migration services. For example, one can use Portal services to migrate cloud data from one cloud object storage provider to another.

They also mentioned they are working on supplying K8S container access to CTERA’s global file system data.

There’s a lot to like in CTERA. We hadn’t heard of them before but they seem focused on enterprise’s with lots of sites, boatloads of users and massive amounts of data. It seems like our kind of storage system.

Comments?

Storywrangler, ranking tweet ngrams over time

Read a couple of articles the past few weeks on a project in Vermont that has randomly selected 10% of all tweets (150 Billion) since the beginning of Twitter (2008) and can search and rank this tweet corpus for ngrams (1-, 2-, & 3-word phrases). All of these articles were reporting on a Science Advances article: Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using Twitter.

Why we need Storywrangler

The challenge with all social media is that it is transient, here now, (mostly) gone tomorrow. That is once posted, if it’s liked/re-posted/re-tweeted it can exist in echoes of the original on the service for some time, and if not, it dies out very quickly never to be seen (externally ever) again. While each of us could potentially see every tweet we have ever created (when this post is published it should be my 5387th tweet on my twitter account) but most of us cannot see this history for others.

All that makes viewing what goes on on social media impossible which leads to a lot of mis-understanding and makes it difficult to analyze. It would be great if we had a way of looking at social media activity in more detail to understand it better.

I wrote about this before (see my Computational anthropology & archeology post) and if anything, the need for such capabilities has become even more important in today’s society.

If only there was a way to examine the twitter-verse. What’s mainly lacking is a corpus of all tweets that have ever been tweeted. A way to slice, dice, search, and rank this text data would be a godsend to understanding (twitter and maybe social) history, in real time.

Storywrangler, has a randomized version of 10% of all tweets since twitter started. And it provides ngram searching and ranking over a specified time interval. It’s not everything but it’s a start.

Storywrangler currently has over 1 trillion (1- to 3- word) ngrams and they support ngram rankings for over 150 different languages.

Google books ngram viewer

The idea for the Storywrangler project came from Google’s books ngram viewer. Google’s ngram viewer has a corpus of Google books, over a time period (from 1800 to 2019) and allows one to search for ngrams (1- to 5-word phrases) over any time period they support.

Google’s ngram viewer charts ngrams with a vertical axis that is the % of all ngrams in their book corpus. One can see the rise and fall of ngrams, e.g., “atomic power”. The phrase “atomic power” peaked in Google books around 1960 at a height of 0.000260% of all 2 word ngrams. The time period level of granularity is a year.

The nice thing about Google books ngram data is you can download their book ngram data yourself. The data is of the form of tab separated list of rows with ngram text (1 to 5 words), year, how many times it occurred that year, on how many pages, on how many books on each row. Google books ngram data is generally about 2 years old.

Unclear just how much data is in Google’s books ngram database but for instance in the 1 gram English fiction list, they show a sample of two rows (the 3,000,000 and 3,000,001 rows) which are the 1978 and 1979 book counts for the word “circumvallate”.

Storywrangler tweet ngram viewer

The usage tab on the Storywrangler website provides a search engine that one can use to input N-grams that you want to search the corpus for and can visualize how their rank changes over time. For example, one can do a similar search on the “atomic power” ngram only for tweets.

From Storywrangler search one can see that peak tweet use of “Atomic Power” and “ATOMIC POWER” occurred somewhere in July of 2020 (only way to see the month is to hover over that line) and it’s rank reached somewhere around ~10,000 highest used tweet 2 word ngram during that time.

It’s interesting to see that ngram books and ngram twitter don’t seem to have any correlation. For example the prior best ranking for atomic power (~200Kth highest) was in June of 2015. There was no similar peak for book ngrams of the phrase.

For Storywrangler you can download a JSON or CSV version of the charts displayed. It’s not the complete ngram history that Google book ngram viewer provides. Storywrangler data is generally about 2 days old.

The other nice thing about Storywrangler is under the real-time tab it will show you ngram rankings at 15 minute intervals for whatever timeline you wish to see. Also under the trending tab it will show you the changing ranks for the top 5 ngrams over a selected time period. And the languagetab will do tracking for tweet language use for select languages. The common tab will track the ranking of most common ngrams (pretty boring mostly articles/prepositions) over time. And for any of these searches one can turn on or off retweet counting, which can help to eliminate bot activity.

Storywrangler provides a number of other statistics for ngrams other than just ranking such as odds (of occurring) and frequency (of occurrence). And one can also track rank change, old (years) rank vs. current (year) rank, rank (turbulence) divergence.

~~~~

Comments?

Photo Credit(s):

Kasten, Kafka & the quest to protect data #CFD11

We attended Cloud Field Day 11 (CFD11) last week and among the vendors at the show was Kasten by Veeam talking about their extensive Kubernetes data protection/DR/migration capabilities. All that was worthy of its own blog post but somewhere near the end of their session, they started discussing how they plan to backup Apache Kafka message traffic

We have discussed Kafka before (see our Data in motion post). For those who have read that post or are familiar with Kafka, you can ignore the Kafka primer section below and just for the record, I’m no Kafka expert.

Kafka primer

Kafka is a massively scalable message bus and real time stream processing system. Messages or records come in from Kafka producers and are sent to Kafka consumers or subscribers for processing. Kafka supplies a number of guarantees one of which is that messages are processed once and only once.

Messages or records are key-value pairs that are generated continuously and are processed by Kafka apps. Message streams are split into topics.

Kafka connectors allow message traffic to be extracted/generated from external databases and other systems . Kafka stream processors use Kafka streaming primitives and stream flow graphs to construct applications that process unbounded, continuously updated data sets.

Topics in Kafka can have multi-producers and multi-subscribers. Each topic can be further split across partitions which can be processed in different servers or brokers in a Kafka cluster. It’s this partitioning that allows Kafka to scale from processing 1 message to millions of messages per second.

Consumers/subscribers can also be producers, so a message that comes in and is processed by a subscriber could create other messages on topics that need to be processed. Messages are placed on topic partitions in the order they are received.

Messages can be batched and added to a partition or not. Recently Kafka added support for sticky partitions. Normally messages are assigned to partitions based on key hashing but sometimes messages have null keys and in this case Kafka assigned them to partitions in a round robin fashion. Sticky partitioning strategy amends that approach to use batches of messages rather than single messages when adding null key messages to partitions.

Kafka essentially provides a realtime streaming message/event processing system that scales very well.

Data protection for Kafka streams without Kasten

This is our best guess as to what this looks like, so bear with us.

Sometime after messages are processed by a subscriber they can be sent to a logging system. If that system records those messages to external storage, they would be available to be copied off-line and backed up.

How far behind real time, Kafka logs happen to be can vary significantly based on the incoming/outgoing message traffic, resources available for logging and the overall throughput of your logging process/storage. Could it be an hour or two behind, possibly but I doubt it, could it be 1 second behind, also unlikely. Somewhere in between those two extremes seems reasonable

Log processing messages just like any topic could be partitioned. So any single log would only have messages on that partition. So there would need to be some post process to stitch all these log partitions together to get a consistent message stream.

In any case, there is potentially a wide gulf between the message data that is being processed and the logs which hold them. But then they need to be backed up as well so there’s another gap introduced between the backed up data what is happening in real time This is especially true the more messages that are being processed in your Kafka cluster.

(Possible) data protection for Kafka streams with Kasten

For starters, it wasn’t clear what Kasten presented was a current product offering, beta offering or just a germ of an idea. So bear that in mind. See the videos of their CFD11 session here for their take on this.

Kafka supports topic replication. What this means is that any topic can be replicated to other servers in the Kafka cluster. Topics can have 0 to N replications and one server is always designated leader (or primary) for a topic and other replicas are secondary. The way it’s supposed to work is that the primary topic partition server will not acknowledge or formally accept any message until it is replicated to all other topic replicas.

What Kasten is proposing is to use topic replicas and take hot snapshots of replicated topic partitions at the secondary server. That way there should be a minimal impact on primary topic message processing. Once snapped, topic partition message data can be backed up and sent offsite for DR.

We have a problem with this chart. Our understanding is that if we take the numbers to be a message id or sequence number, each partition should be getting different messages rather than the the same messages. Again we are not Kafka experts. The Eds.

However, even though Kasten plans to issue hot snapshots, one after another, for each partition, there is still a small time difference between each snap request. As a result, the overall state of the topic partitions when snapped may be slightly inconsistent. In fact, there is a small possibility that when you stitch all the partition snapshots together for a topic, some messages may be missing.

For example, say a topic is replicated and when looking at the replicated topic partition some message say Msg[2001] was delayed (due to server load) in being replicated for partition 0, while Kasten was hot snapping replica partition 15 which already held Msg[2002]. In this case the snapshots missed Msg[2001]. Thus hot snapshots in aggregate, can be missing one or (potentially) more messages.

Kasten called this a crash consistent backup but it’s not the term we would use (not sure what the term would be but it’s worse than crash consistent). But to our knowledge, this was the first approach that Kasten (or anyone else) has described that comes close to providing data protection for Kafka messaging.

As another alternative, Kasten suggested one could make the backed up set of partitions better would by post processing all the snapshots to find the last message point where all prior messages were available and jettisoning any followon messages. In this way, after post processing, the set of data derived from the hot snaps would be crash consistent up to that message point..

It seems to me, the next step to create an application consistent Kafka messaging backup would require Kafka to provide some way to quiesce a topic message stream. Once quiesced and after some delay to accept all in flight messages, hot snaps could be taken. The resultant snapshots in aggregate would have all the messages to the last one accepted.

Unclear whether quiescing a Kafka topic stream, even for a matter of seconds to minutes, is feasible for a system that processes 1000s to millions of messages per second.

Comments?

Photo Credit(s):

Swarm learning for distributed & confidential machine learning

Read an article the other week about researchers in Germany working with a form of distributed machine learning they called swarm learning (see: AI with swarm intelligence: a novel technology for cooperative analysis …) which was reporting on a Nature magazine article (see: Swarm Learning for decentralized and confidential clinical machine learning).

The problem of shared machine learning is particularly accute with medical data. Many countries specifically call out patient medical information as data that can’t be shared between organizations (even within country) unless specifically authorized by a patient.

So these organizations and others are turning to use distributed machine learning as a way to 1) protect data across nodes and 2) provide accurate predictions that uses all the data even though portions of that data aren’t visible. There are two forms of distributed machine learning that I’m aware of federated and now swarm learning.

The main advantages of federated and swarm learning is that the data can be kept in the hospital, medical lab or facility without having to be revealed outside that privileged domain BUT the [machine] learning that’s derived from that data can be shared with other organizations and used in aggregate, to increase the prediction/classification model accuracy across all locations.

How distributed machine learning works

Distributed machine learning starts with a common model that all nodes will download and use to share learnings. At some agreed to time (across the learning network), all the nodes use their latest data to re-train the common model and share new training results (essentially weights used in the neural network layers) with all other members of the learning network.

Shared learnings would be encrypted with TLS plus some form of homomorphic encryption that allowed for calculations over the encrypted data.

In both federated and swarm learning, the sharing mechanism was facilitated by a privileged block chain (apparently Etherium for swarm). All learning nodes would use this blockchain to share learnings and download any updates to the common model after sharing.

Federated vs. Swarm learning

The main difference between federated and swarm learning is that with federated learning there is a central authority that updates the model(s) and with swarm learning that processing is replaced by a smart contract executing within the blockchain. Updating model(s) is done by each node updating the blockchain with shared data and then once all updates are in, it triggers a smart contract to execute some Etherium VM code which aggregates all the learnings and constructs a new model (or at least new weights for the model). Thus no node is responsible for updating the model, it’s all embedded into a smart contract within the Etherium block chain. .

Buthow does the swarm (or smart contract) update the common model’s weights. The Nature article states that they used either a straight average or a weighted average (weighted by “weight” of a node [we assume this is a function of the node’s re-training dataset size]) to update all parameters of the common model(s).

Testing Swarm vs. Centralized vs. Individual (node) model learning

In the Nature paper, the researchers compared a central model, where all data is available to retrain the models, with one utilizing swarm learning. To perform the comparison, they had all nodes contribute 20% of their test data to a central repository, which ran the common swarm updated model against this data to compute an accuracy metric for the swarm. The resulting accuracy of the central vs swarm learning comparison look identical.

They also ran the comparison of each individual node (just using the common model and then retraining it over time without sharing this information to the swarm versus using the swarm learning approach. In this comparison the swarm learning approach alway seemed to have as good as if not better accuracy and much narrower dispersion.

In the Nature paper, the researchers used swarm learning to manage the machine learning model predictions for detecting COVID19, Leukemia, Tuberculosis, and other lung diseases. All of these used public data, which included PBMC (peripheral blood mono-nuclear cells) transcription data, whole blood transcription data, and X-ray images.

Swarm learning also provides the ability to onboard new nodes in the network. Which would supply the common model and it’s current weights to the new node and add it to the shared learning smart contract.

The code for the swarm learning can be downloaded from HPE (requires an HPE passport login [it’s free]). The code for the models and data processing used in the paper are available from github. All this seems relatively straight forward, one could use the HPE Swarm Learning Library to facilitate doing this or code it up oneself.

Photo Credit(s):

Storageless data!?

I (virtually) attended SFD21 earlier this year and a company called Hammerspace presented discussing their vision for storageless data (see videos of their session at SFD21).

We’ve talked them before but now they have something to offer the enterprise – data mobility or storageless data.

The white board after David Flynn’s session at SFD8

In essence, customers want to be able to run their workloads wherever it makes the most sense, on prem, in private cloud, and in the public cloud among other places. Historically, it’s been relatively painless to transfer an application’s binary from one to another data center, to a managed service provider or to the public cloud.

And with VMware Cloud Foundation, Kubernetes, Docker and Linux operating everywhere, the runtime environment and other OS services that applications depend on are pretty much available in any of those locations. So now customers have 2 out of 3, what’s left?

It’s all about the Data

Data can take a very long time to move around a data center, let alone across the web between locations. MBs and even GBs of data may be relatively painless to move, but TBs of data can be take days, and moving PBs of data is suicidal.

For instance, when we signed up for a globally accessible file synch and share storage service, I probably had 75GB or so of data I wanted managed. It took literally several days of time to upload this. Yes, I didn’t have data center class internet access, but even that might have only sped this up 2-5X. Ok, now try this with 1TB or more and it’s pretty much going to take days, and you can easily multiple that by 10 to do a PB or more. And that’s if it happens to continue to perform the transfer without disruption.

So what’s Hammerspace storageless data got to do with any of this.

Hammerspace’s idea

It’s been sort of a ground truth of storage, since I’ve been in the industry (40+ years now), that not all random IO data is accessed at the same frequency. That is, some data is accessed a lot and other data accessed hardly at all. That’s why DRAM caching of data can be so important to a host or storage system.

Similarly for sequential access, if you can get the first blocks of data to the host and then stream the rest in time, a storage system can appear to read fast.

Now I won’t go into all the tricks of doing good data caching, (the secret sauce to every vendor’s enterprise storage), but if you can appear to cache data well, you don’t actually need to transfer all the data associated with an application to a location it’s running in, you can appear as if all the data is there, when actually only some of it is present.

Essentially, Hammerspace creates a global file system for your data, across any locations you wish to use it, with great caching, optimized data transfer and with real storage behind it. Servers running your applications mount a Hammerspace file system/share that stitches together all the file storage behind it, across all the locations it’s operating in.

An application request goes to Hammerspace and if the data is not present there, Hammerspace goes and fetches and caches blocks of data as fast as it can. This will let the application start performing IO while the rest of the data is being cached and if allowed, moved to the new location.

Storage can be not managed by Hammerspace, read-write managed by Hammerspace or read-only managed by Hammerspace. For customers who want the whole Hammerspace storageless data functionality they would use read-write mode. For those who just want to access data elsewhere read-only would suffice. Customers who want to continue to access data directly but want read access globally, would use the read-only mode.

Once read-write storage is assigned to Hammerspace grabs all the file metadata information on the storage system. Once this process completes, customers no longer access this file data directly, but rather must access it through Hammerspace. At that point, this data is essentially storageless and can be accessed wherever Hammerspace services are available.

How does Hammerspace do it

Behind the scenes is a lot of technology. Some of which is discussed in the SFD21 sessions (see video’s above). Hammerspace is not in the data path but rather in the control path of data access. But it does orchestrate data movement, and it does route data IO requests from an application to where the data (currently) resides.

Hammerspace also supports Service Level Objectives (SLOs) for performance, geolocation, security, data protection options,, etc. These can be used to keep data in particular regions, to encrypt data (using KMIP), ensure high performance, high data availability, etc.

Hammerspace can manage data across 32 separate sites. It takes a couple of hours to deploy. per site. Each site has a Hammerspace metadata service with standalone access to all data within that site. For example, standalone access could be used, in the event of a network loss.

At the moment, they support eventual consistency and don’t support a global lock service. Rather, Hammerspace uses a conflict resolution service in the event data is overwritten by two or more applications. For any file that was being updated in two or more locations, that file would be flagged as in conflict, Hammerspace would provide snapshots of the various versions of the file(s) and it would require some sort of manual intervention to resolve the conflict. Each location would have (temporary) access to the data it had written directly, but at some point the conflict would need resolution.

They also support NFS and SMB file access for the front end and use object storage services for backend data. Data is copied on demand to the local site’s storage when accessed based on the SLO policies in effect for it. During data movement it is copied up, temporarily into objects on AWS, Microsoft Azure, or GCP, and then copied down to the location it’s being moved to. I believe this temporary object data is encrypted and compressed. Hammerspace support KMIP key providers.

Pricing for Hammerspace is on a managed capacity basis. But anyone can use Hammerspace for up to 10TB for free. Hammerspace is available in AWS marketplace for configuration there.

~~~~

Well it’s been a long time coming, but it appears to be here. Any customers wanting hybrid-cloud operations or global access to their data would be remiss to not check out Hammerspace.

[Edited after posting, The Eds.]