Data and code versioning For MLops

Read an interesting article (Ex-Apple engineers raise … data storage startup) and research paper (Git is for data) about a of group of ML engineers from Apple forming a new “data storage” startup targeted at MLOps teams just like Apple. It turns out that MLops has some very unique data requirements that go way beyond just data storage.

The paper discusses some of the unusual data requirements for MLOps such as:

  • Infrequent updates – yes there are some MLOps datasets where updates are streamed in but the vast majority of MLOps datasets are updated on a slower cadence. The authors think monthly works for most MLOps teams
  • Small changes/lots of copies – The changes to MLOps data are relatively small compared to the overall dataset size and usually consist of data additions, record deletions, label updates, etc. But uncommon to most data, MLOps data are often subsetted or extracted into smaller datasets used for testing, experimentation and other “off-label” activities.
  • Variety of file types – depending on the application domain, MLOps file types range all over the place. But there’s often a lot of CSV files in combination with text, images, audio, and semi-structured data (DICOM, FASTQ, sensor streams, etc.). However within a single domain, MLOps file types are pretty much all the same.
  • Variety of file directory trees – this is very MLOps team and model dependent. Usually there are train/validate/test splits to every MLOps dataset but what’s underneath each of these can vary a lot and needs to be user customizable.
  • Data often requires pre-processing to be cleansed and made into something appropriate and more useable by ML models
  • Code and data must co-evolve together, over time – as data changes, the code that uses them change. Adding more data may not cause changes to code but models are constantly under scrutiny to improve performance, accuracy or remove biases. Bias elimination often requires data changes but code changes may also be needed.

It’s that last requirement, MLOps data and code must co-evolve and thus, need to be versioned together that’s most unusual. Data-code co-evolution is needed for reproducibility, rollback and QA but also for many other reasons as well.

In the paper they show a typical MLOps data pipeline.

Versioning can also provide data (and code) provenance, identifying the origin of data (and code). MLOps teams undergoing continuous integration need to know where data and code came from and who changed them. And as most MLOps teams collaborate in the development, they also need a way to identify data and code conflicts when multiple changes occur to the same artifact.

Source version control

Code has had this versioning problem forever and the solution became revision control systems (RCS) or source version control (SVC) systems. The most popular solutions for code RCS are Git (software) and GitHub (SaaS). Both provide repositories and source code version control (clone, checkout, diff, add/merge, commit, etc.) as well as a number of other features that enable teams of developers to collaborate on code development.

The only thing holding Git/GitHub back from being the answer to MLOps data and code version control is that they don’t handle large (>1MB) files very well.

The solution seems to be adding better data handling capabilities to Git or GitHub. And that’s what XetHub has created for Git.

XetHub’s “Git is Data” paper (see link above) explains what they do in much detail, as to how they provide a better data layer to Git, but it boils down to using Git for code versioning and as a metadata database for their deduplicating data store. They are using a Merkle trees to track the chunks of data in a deduped dataset.

How XetHub works

XetHub support (dedupe) variable chunking capabilities for their data store. This allows them to use relatively small files checked into Git to provide the metadata to point to the current (and all) previous versions of data files checked into the system.

Their mean chunk size is ~4KB. Data chunks are stored in their data store. But the manifest for dataset versions is effectively stored in the Git repository.

The paper shows how using a deduplicated data store can support data versioning.

XetHub uses a content addressable store (CAS) to store the file data chunk(s) as objects or BLOBs. The key to getting good IO performance out of such a system is to have small chunks but large objects.

They map data chunks to files using a CDMT (content defined merkle tree[s]). Each chunk of data resides in at least two different CDMTs, one associated with the file version and the other associated with the data storage elements.

XetHub’s variable chunking approach is done using a statistical approach and multiple checksums but they also offer one specialized file type chunking for CSV files. As it is, even with their general purpose variable chunking method, they can offer ~9X dedupe ratio for text data (embeddings).

They end up using Git commands for code and data but provide hooks (Git filters) to support data cloning, add/checkin, commits, etc.). So they can take advantage of all the capabilities of Git that have grown up over the years to support code collaborative development but use these for data as well as code.

In addition to normal Git services for code and data, XetHub also offers a read-only, NFSv3 file system interface to XetHub datases. Doing this eliminates having to reconstitute and copy TB of data from their code-data repo to user workstations. With NFSv3 front end access to XetHub data, users can easily incorporate data access for experimentation, testing and other uses.

Results from using XetHub

XetHub showed some benchmarks comparing their solution to GIT LFS, another Git based large data storage solution. For their benchmark, they used the CORD-19 (and ArXiv paper, and Kaggle CORD-I9 dataset) which is a corpus of all COVID-19 papers since COVID started. The corpus is updated daily, released periodically and they used the last 50 versions (up to June 2022) of the research corpus for their benchmark.

Each version of the CORD-19 corpus consists of JSON files (research reports, up to 700K each) and 2 large CSV files one with paper information and the other paper (word?) embeddings (a more useable version of the paper text/tables used for ML modeling).

For CORD-19, XetHub are able to store all the 2.45TB of research reports and CSV files in only 287GB of Git (metadata) and datastore data, or with a dedupe factor 8.7X. With XetHub’s specialized CSV chunking (Xet w/ CSV chunking above), the CORD-19 50 versions can be stored in 87GB or with a 28.8X dedupe ratio. And of that 87GB, only 82GB is data and the rest ~5GB is metadata (of which 1.7GB is the merle tree).

In the paper, they also showed the cost of branching this data by extracting and adding one version which consisted of a 75-25% (random) split of a version. This split was accomplished by changing only the two (paper metadata and paper word embeddings) CSV files. Adding this single split version to their code-data repository/datastore only took an additional 11GB of space An aligned split (only partitioning on a CSV record boundary, unclear but presumably with CSV chunking), only added 185KB.

XETHUB Potential Enhancements

XetHub envisions many enhancements to their solution, including adding other specific file type chunking strategies, adding a “time series” view to their NFS frontend to view code/data versions over time, finer granularity data provenance (at the record level rather than at the change level), and RW NFS access to data. Further, XetHub’s dedupe metadata (on the Git repo) only grows over time, supporting updates and deletes to dedupe metadata would help reduce data requirements.

Read the paper to find out more.

Picture/Graphic credit(s):

Living forever – the end of evolution part-3

Read an article yesterday on researchers who had been studying various mammals and trying to determine the number of DNA mutations they accumulate at about the time they die. The researchers found that after about 800 mutations for mole rats, they die, see Nature article Somatic mutation rates scale with lifespan across mammals and Telegraph article reporting on the research, Mystery of why humans die around 80 may finally be solved.

Similarly, at around 3500 mutations humans die, at around 3000 mutations dogs die and at around 1500 mutations mice die. But the real interesting thing is that the DNA mutation rates and mammal lifespan are highly (negatively) correlated. That is higher mutation rates lead to mammals with shorter life spans.

C. Linear regression of somatic substitution burden (corrected for analysable genome size) on individual age for dog, human, mouse and naked mole-rat samples. Samples from the same individual are shown in the same colour. Regression was performed using mean mutation burdens per individual. Shaded areas indicate 95% confidence intervals of the regression line. A shows microscopic images of sample mammalian cels and the DNA strands examined and B shows the distribution of different types of DNA mutations (substitutions or indels [insertion/deletions of DNA]).

The Telegraph article seems to imply that at 800 mutations all mammals die. But the Nature Article clearly indicates that death is at different mutation counts for each different type of mammal.

Such research show one way on how to live forever. We have talked about similar topics in the distant past see …-the end of evolution part 1 & part 2

But in any case it turns out that one of the leading factors that explains the average age of a mammal at death is its DNA mutation rate. Again, mammals with lower DNA mutation rates live longer on average and mammals with higher DNA mutation rates live shorter lives on average.

Moral of the story

if you want to live longer reduce your DNA mutation rates.

c, Zero-intercept LME regression of somatic mutation rate on inverse lifespan (1/lifespan), presented on the scale of untransformed lifespan (axis). For simplicity, the axis shows mean mutation rates per species, although rates per crypt were used in the regression. The darker shaded area indicates 95% CI of the regression line, and the lighter shaded area marks a twofold deviation from the line. Point estimate and 95% CI of the regression slope (k), FVE and range of end-of-lifespan burden are indicated.

All astronauts are subject to significant forms of cosmic radiation which can’t help but accelerate DNA mutations. So one would have to say that the risk of being an astronaut is that you will die younger.

Moon and Martian colonists will also have the same problem. People traveling, living and working there will have an increased risk of dying young. And of course anyone that works around radiation has the same risk.

Note, the mutation counts/mutation rates, that seem to govern life span are averages. Some individuals have lower mutation rates than their species and some (no doubt) have higher rates. These should have shorter and longer lives on average, respectively.

Given this variability in DNA mutation rates, I would propose that space agencies use as one selection criteria, the astronauts/colonists DNA mutation rate. So that humans which have lower than average DNA mutation rates have a higher priority of being selected to become astronauts/extra-earth colonists. One could using this research and assaying astronauts as they come back to earth for their DNA mutation counts, could theoretically determine the impact to their average life span.

In addition, most life extension research is focused on rejuvenating cellular or organism functionality, mainly through the use of young blood, other select nutrients, stem cells that target specific organs, etc. For example, see MIT Scientists Say They’ve Invented a Treatment That Reverses Hearing Loss which involves taking human cells, transform them into stem cells (at a certain maturity) and injecting them into the ear drum.

Living forever

In prior posts on this topic (see parts 1 &2 linked above) we suggested that with DNA computation and DNA storage (see or listen rather, to our GBoS podcast with CTO of Catalog) now becoming viable, one could potentially come up with a DNA program that could

  • Store an individuals DNA using some very reliable and long lived coding fashion (inside a cell or external to the cell) and
  • Craft a DNA program that could periodically be activated (cellular crontab) to access the stored DNA for the individual(in the cell would be easiest) and use this copy to replace/correct any DNA mutation throughout an individuals cells.

And we would need a very reliable and correct copy of that person’s DNA (using SHA256 hashing, CRCs, ECC, Parity and every other way to insure the DNA as captured is stored correctly forever). And the earlier we obtained the DNA copy for an individual human, the better.

Also, we would need a copy of the program (and probably the DNA) to be present in every cell in a human for this to work effectively. .

However, if we could capture a good copy of a person’s DNA early in their life we could, perhaps, sometime later, incorporate DNA code/program into the individual to use this copy and sweep through a person’s body (at that point in time) and correct any mutations that have accumulated to date. Ultimately, one could schedule this activity to occur like an annual checkup.

So yeah, life extension research can continue along the lines they are going and you can have a bunch of point solutions for cellular/organism malfunction OR it can focus on correctly copying and storing DNA forever and creating a DNA program that can correct DNA defects in every individual cell, using the stored DNA.

End of evolution

Yes mammals and that means any human could live forever this way. But it would signify the start of the end of evolution for the human species. That is whenever we captured their DNA copy, from that point on evolution (by mutating DNA) of that individual and any offspring of that individual could no longer take place. And if enough humans do this, throughout their lifespan, it means the end of evolution for humanity as a species

This assumes that evolution (which is natural variation driven by genetic mutation & survival of the fittest) requires DNA variation (essentially mutation) to drive the species forward.

~~~~

So my guess, is either we can live forever and stagnate as a species OR live normal lifespans and evolve as a species into something better over time. I believe nature has made it’s choice.

The surprising thing is that we are at a point in humanities existence where we can conceive of doing away with this natural process – evolution, forever.

Photo Credit(s):

Deepmind does code – part 1: the data

1st, let me express my and my fellow coders/programmers disappointment that Deepmind would take on coding. There are many other white collar work domains that need to be conquered before coding.

2nd, let me apologize for the lack of blog posts lately, all I can say is, business is picking up.

Saw an article over the last couple of weeks on Deepmind creating AlphaCode an artificial intelligence coding application which they used to enter coding contests and achieved an average 1238 rating or better than 54% of code contest participants.

I can’t recall where I first saw the news but Deepmind has a pretty decent blog post on AlphaCode and they have published a pre-print of their research paper on AlphaCode as well. I plan on discussing AlphaCode in detail over a couple of posts. This will be the first installment on where they got the data to train their models..

AlphaCode is a transformer-based language models (see: Wikipedia: Transformer (machine learning model) article) that translates a code competition problem statement into code, or a program that can when executed solve the problem statement. In order to train AlphaCode Deepmind first needed to obtain lots of source code.

It’s all about the (training) data

The first step in Deep Learning model generation is gathering data to train the model. Now where would Google’s Deepmind go to gather coding data – well GitHub, a public repository of all things software, of course.

They used GitHub data to pre-train their model(s) but also scraped code (problem statements & test cases) from published code contests to fine tune their model

Deepmind has released their fine-tuning, CodeContests training data for AlphaCode, on GitHub. So as to support other organiazations in creating AI models for coding.

GitHub source to the (pre-training) rescue

There are a couple of problems with using GitHub source code for training:

  • Github code is in any source code language the author feels most appropriate to use.
  • GitHub code is not guaranteed to work correctly.
  • GitHub code is not guaranteed to be completed code.
  • GitHub code represents a wide range of coding skill.
  • GitHub code doesn’t always come with a problem statement.

But the use of GitHub in their pre-training data set is intended to give their transformer-based language model some capability to understand (learn) what coding is all about, what a proper syntax would be, what a proper coding sequence would be, etc.

The AlphaCode team took a snapshot of selected git source repos. This meant they only scrapped Git repos that contained C++, C#, Go, Java, JavaScript, Lua, PHP, Python, Ruby, Rust, Scala, and TypeScript languages. They also dropped from pre-training data any source code with files larger than 1MB or that had any lines larger than 1000 characters. This was done to avoid using any machine generated code. They also stripped all the white space out of the selected source code files and compared them to eliminate all duplicated code.

Their final pre-training dataset was 715GB of data over 86 million source files.

Although, unstated, we would guess that the AlphaCode team used the GitHub repo’s README.md file as a surrogate for the solution description. Unclear what else could have been used unless they generated it automatically from extracting semantic content or generating a summarization of the README.md files.

Excerpt from Deepmind’s competitive code contest source code&problem statements README.md file

The (pre-)training data can be used to train a transformer-based language models. These are used today to provide language translation. In AlphaCode’s case they wanted to create, a code transformer-based model, that translates a specification of a coding problem into source code to solve that problem.

For language translation models, they use text files, in different languages, but represent the same law or information. and notably, are human generated translations.

One challenge with using internet scraped data for training is that it can easily contain actual solutions’ verbatim’ for the problems the model is trying to solve. In order to avoid copying these solutions entirely they decided to split their data into a training set, validation set and test set on a time basis. This way the training data used source code/problem statements only from a period of time prior to the validation set. Ditto for the training-validation data with the test data.

To show that this approach (using a time point to split the data) worked they trained a 1B parameter AlphaCode transformer on two different training-validation datasets, one where the validation data was selected at random (the normal approach to selecting validation data),, the “random” split and the other, with selecting validation data that only occurred some time after the training data, the “temporal’ split. The 1B AlphaCode transformer was able to properly code 0.8% of the problems using a 13K sample of 86M source files/problem statements on the random split, but only 0% on the temporal split.

So much for pre-training, let’s discuss fine tuning

AlphaCode was going to get nowhere with a 0% solve rate (ok this was based on a 13K sample and only a 1B parameter model) but they realized that Git code was only going to get them so far. (ok conjecture on my part)

So fine-tuning beyond pre-training (Git derived) data was needed. So the AlphaCode team turned to code competition source code/problem statement data.

Most code contests publish source code submissions as well as the problem statements and sample test cases. Bp scrapping these, Deepmind was able to attain a very well annotated dataset they could use to fine-tuning their AlphaCode transformer model.

They again used a temporal split for training/validation/test data. But they were also able to add metadata to their data that indicated whether the code solved the problem statement.

Code competitions also publish tests for the problem statement. Having the tests, a human can use them to validate whether their code at least works against the tests. Code contests also have a set of more (sophisticated) hidden tests that they use internally to validate code submissions.

This test data will become important later on in the models operation, which will be discussed in a future post, but suffice it to say that AlphaCode uses the public tests (and mutations of these) to validate AlphaCode generated source code before submitting them..

This fine-tuning dataset is available in the GitHub repo (linked to above) that Deepmind has created/curated for others to work with.

Another nicety of this fine-tuning data is they have proper, human created, problem statements to work from rather than README.md surrogates.

In part-2 we plan to describe the transformer-based model that was created for AlphaCode and at some point, discuss how they used testing in their code submissions.

Once again, all my information comes from Deepmind’s pre-print on their AlphaCode project (linked to above).

Any comments, please don’t hesitate to let me know.

Photo Credits:

Dell EMC PowerStore X and the Edge – TFDxDell

This past summer I attended a virtual TFDxDell event where there was a number of sessions discussing Dell EMC technologies for the enterprise. One session sort of struck a nerve, the Dell EMC PowerStore session and I have finally figured out what interested me most in their talk, their PowerStore X appliances and AppsON technologies

What is AppsON and PowerStore X appliance?

Essentially PowerStore X with AppsON has an onboard ESXi hypervisor which allows customers to run vSphere VMs inside the storage system with direct vVol (I assume) access to PowerStore data storage without having to go out over a (storage) network.

PowerStore X ESXi is a little behind the most recent VMware vSphere releases (at least 30 days) but it’s current enough for most shops. In non-PowerStore X appliances, PowerStoreOS runs as containers but in PowerStore X, PowerStoreOS storage functionality runs as VMs, just like any other VMs running on its ESXi hypervisor.

Moreover, PowerStore X can still service IOs from other non-PowerStore X resident VMs or bare metal applications running in the environment. In this way you get all the data services of an enterprise class storage system, that also run VMs.

With PowerStore OS 2.0 they have added scale out to AppsON. That is any PowerStore X (1000X, 3000X, 5000X or 7000X) appliance, in a PowerStore X cluster, can have their VMs move from one appliance to another using vSphere vMotion. This means that as your PowerStore X storage clusters grow, you can rebalance VM application workloads across the cluster. A PowerStore X cluster can contain up to 4 PowerStore X appliances.

PowerStore’s heritage goes back quite a ways at Dell and EMC. Prior versions of EMC Unity storage and some of its progenitors had the ability to run applications on the storage itself. But by running an ESXi hypervisor on PowerStore X appliances, it takes all this to a whole new level.

Why would anyone want AppsON?

It’s taken me sometime to understand why anyone would want to use AppsON and I have concluded that the edge might be the best environment to deploy it.

Recent VMware enhancements have reduced minimum node configurations for edge environments to 2 servers. It’s unclear to me whether a single PowerStore X appliance with AppsON is one server or two but, for the moment lets assume its just one. This means that a minimum VMware vSphere edge deployment could use 1 PowerStore X and 1 standalone, ESXi server.

In such an environment, customers could run their data intensive VMs directly on the PowerStore X and some of their non-data intensive VMs on the standalone server. But the flexibility exists to vMotion VMs from one to the other as demand dictates.

But does the edge need storage?

Yes, some do. For instance, take 5G. it enables a whole new class of mobile services and many of them can be quite data intensive. 5G is being deployed around the world as mini-data centers in cell towers. Unclear whether these data centers run vSphere but I’m sure VMware is trying their hardest to make that happen. With vSphere running your 5G mini-datacenter, PowerStore X could make a smart addition.

Then there’s all the smart cars, which are creating TBs of sensor data every time they take to the road. You’re probably not going to have a PowerStore appliance in your smart car (at least anytime soon) but they just might have one at the local service station.

And maybe given all the smart devices in your home, smart cars, smart appliances, smart robots, etc., there’s going to be a whole lot of data generated from your smart home. Having something like PowerStore X in your smart home’s mini-data center would offer a place to hold all that data and to do some processing (compressing maybe) before sending it up to the cloud.

~~~~

We have just two more questions for Dell EMC,

  1. Shouldn’t the base PowerStore appliance be called PowerStore K?
  2. Shouldn’t customers be allowed to run their own K8s container apps on their PowerStore K just as easily as running VMs in their PowerStore X?

Legal Disclosure: TechFieldDay and Dell provided gifts to all participants (including me) for the TFDxDell event.

Photo credit(s):

  • From Dell EMC slides presented at TFDxDell event
  • From Dell EMC slides presented at TFDxDell event
  • From Dell EMC slides presented at TFDxDell event

Facebook’s (Meta) Kangaroo, a better cache for billions of small objects

Read an article this week in Blocks and Files, Facebook’s Kangaroo jumps over flash limitations which spiked my interest and I went and searched for more info on this and found a fb blog post, Kangaroo: A new flash cache optimized for tiny objects which sent me to an ACM SOSP (Syposium on O/S Principles) best paper of 2021, Caching billions of tiny objects on flash.

First, as you may recall flash has inherent limitations when it comes to writing. The more writes to a flash device the more NAND cells start to fail over time. Flash devices are only rated for some amount (of standard, ~4KB) block writes, For example, the Micron 5300 Max SSD only supports 3-5 (4KB blocks) DWPD (drive writes per day). So, a 2TB Micron Max 5300 SSD can only sustain from ~1.5 to 2.4B 4KB block writes per day. Now that seems more than sufficient for most work but when somebody like fb, using the SSD as a object cache, writes a few billion or more 100B(yte) objects and does this day in or day out, can consume an SSD in no time. Especially if they are writing one 100 B object per block

So there’s got to be a better way to cache small objects into bigger blocks. Their paper talks of two prior approaches:

  • Log structured storage – here multiple 100B objects are stored in a single a 4KB block and iwritten out with one IO rather than 40. This works fairly well but the index ,which maps an object key, to a log location, takes up a lot of memory space. If your caching ~3B 100B objects in a logs and each object index takes 16 bytes that’s a data space of 48GB.
  • Associative set storage – here each object is hashed into a set of (one or more) storage blocks and is stored there. In this case there’s no DRAM index but you do need a quick way to determine if an object is in the set storage or not. This can be done with bloom filters (see: wikipedia article on bloom filters). So if each associative set stores 400 objects and one needs to store 3B objects one needs a 30 MB of bloom filters (assuming 4bytes each). The only problem with associative sets is that when one adds an element to a set. the set has to be rewritten. So if over time you add 400 objects to a set you are writing that set 400 times. All of which eats into the DWPD budget for the flash storage.

In Kangaroo, fb engineers have combined the best of both of these together and added a small DRAM cache.

How does it work?

Their 1st tier is a DRAM cache, which is ~1% of the capacity of the whole object cache. Objects are inserted into the DRAM cache first and are evicted in a least recently used fashion, that is object’s that have not been used in the longest time are moved out of this cache and are written to the next layer (not quite but get to that in a moment).

Their 2nd tier is a log structured system, at ~5% of cache capacity. They call this a KLog and it consists of a ring of 4KB blocks on SSD, with a DRAM index telling where each object is located on the ring.. Objects come in and are buffered together into a 4KB block and are written to the next empty slot in the ring with its DRAM index updated accordingly. Objects are evicted from Klog in such a way that a group of them, that would be located in the same associative set and are LRU, can all be evicted at the same time. They have structured the Klog DRAM index so that it makes finding all these objects easy. Also any log structured system needs to deal with garbage collection, Let’s say you evict 5 objects in a 4K block, that leaves 35 that are still good. Garbage collection will read a number of these partially full blocks and mash all the good objects together leaving free space for new objects that need to be cached.

The 3rd and final tier is a set associative store, they call the Kset that uses bloom filters to show object presence. For this tier, an object’s key is hashed to find a block to put it in, the block is read and the object inserted and the block rewritten. Objects are evicted out of the set associative store based on LRU within a block. The bloom filters are used to determine if the object exists in an set associative block.

There are a few items missing from the above description. As can be seen in Figure 3B above, Kangaroo can jettison objects that are LRUed out of DRAM instead of adding them to the Klog. The paper suggests this can be done purely at random, say only admit, into the Klog, 95% of the objects at random being LRUed from DRAM. The jettison threshold for Klog to Kset is different. Here they will jettison single object sets. That is if there were only one object that would be evicted and written to a set, it’s jettisoned rather than saved in the Kset. The engineers call this a Kset threshold of 2 (indicating minimum number of objects in a single set that can be moved to Kset)..

While understanding an objects LRU is fairly easy if you have a DRAM index element for each block, it’s much harder when there’s no individual object index available, as in Kset.

To deal with tracking LRU in the Kset, fb engineers created a RRIParoo index with a DRAM index portion and a flash resident index portion.

  • RRIParoo’s DRAM index is effectively a 40 byte bit map which contains one bit per object, corresponding to its location in the block. A bit on in this DRAM bitmap indicates that the corresponding object has been referenced since the last time the flash resident index has been re-written. .
  • RRIParoo’s flash resident index contains 3 bit integers, each one corresponding to an object in the block. This integer represents how many clock ticks, it has been since the corresponding object has been referenced. When the need arises to add an object to a full block, the object clock counters in that block’s RRIP flash index are all incremented until one has gotten to the oldest time frame b’111′ or 7. It is this object that is evicted.

New objects are given an arbitrary clock tick count say b’001′ or 1 (as shown in Fig. 6, in the paper they use b’110′ or 6), which is not too high to be evicted right away but not too low to be considered highly referenced.

How well does Kangaroo perform

According to the paper using the same flash storage and DRAM, it can reduce cache miss ratio by 29% over set associative or log structured cache’s alone. They tested this by using simulations of real world activity on their fb social network trees.

The engineers did some sensitivity testing using various Kangaroo algorithm parameters to see how sensitive read miss rates were to Klog admission percentage, RRIParoo flash index element (clock tick counter) size, Klog capacity and Kset admission threshold.

Kangaroo performance read miss rate sensitivity to various algorithm parameters

Applications of the technology

Obviously this is great for Twitter and facebook/meta as both of these deal with vast volumes of small data objects. But databases, Kafka data streams, IoT data, etc all deal with small blocks of data and can benefit from better caching that Kangaroo offers.

Storage could also use something similar only in this case, a) the objects aren’t small and b) the cache is all in memory. DRAM indexes for storage caching, especially when we have TBs of DRAM cache, can be still be significant, especially if an index element is kept for each block in cache. So the technique could also be deployed for large storage caches as well.

Then again, similar techniques could be used to provide caching for multiple tiers of storage. Say DRAM cache, SSD Log cache and SSD associative set cache for data blocks with the blocks actually stored on large disks or QLC/PLC SSDs.

Photo credit(s):

NASA’s journey to the cloud – part 1

Read an article the other day, NASA Turns to the Cloud for Help With Next-Generation Earth Missions about how NASA was had started to migrate all their data to the cloud and intended to store all new data there as well. The hope is that researchers would no longer need to download NASA data but rather could access it directly using cloud compute resources.

It turns out that newer earth science satellites are generating so much data that hosting all this data is becoming a challenge and with the quantities being discussed, researchers downloading the data, to perform research in their own environments may take days.

Until recently, earth science data has been hosted and downloadable from NASA, ESA and other space organization sites. For example, see NASA’s GHCR DAAC (Global Hydrometerological Resource Center Distributed Active Archive Center), ESA EarthOnline, JAXA GPM website, etc. Generally one could download a time series of data from any of their prior and current earth/planetary science missions without too much trouble.

The Land Processes Distributed Active Archive Center (LP DAAC) archives and distributes Global Forest Cover Change (GFCC) data products through the NASA Making Earth System Data Records for Use in Research Environments (MEaSUREs) (https://earthdata.nasa.gov/community/community-data-system-programs/measures-projects) Program….

But NASA’s newest earth science satellites will be generating lot’s of data. For instance, the SWOT (Surface Water and Ocean Topography) mission data load will be 20TB/day and the NISAR (NASA-Indian Synthetic Aperture Radar) mission data load will be 80TB/day. And it’s only getting worse as more missions with newer instruments come online.

NASA estimates that, over time, they will store 247PB of data in their EarthData Cloud. At the moment, they have already migrated some (all of ASF [Alaska Satellite Facility] DAAC and some of PO.DAAC [Physical Ocean]) of their Earth Science data to AWS (us-west-2) and over time all of it will migrate there.

NASA will eat any egress charges for EOSDIS data and are also paying any and all hosting fees to storage the data in AWS. Unclear whether they are using standard S3 or S3-Intelligent Tiering. And presumably they are using S3 replication to ensure they don’t lose DAAC data in the cloud, but I don’t see any evidence of that in the literature I’ve read. Of course this doubles the storage costs for their 247PB of DAAC data.

Access to all this data is available to anyone with an EarthData login. There you can register for a profile to access NASA earth sciences data.

NASA’s EarthData also offers a number of AWS cloud based services to help one access this data:

  • EarthData search – filtered search facility to access NASA EarthData by platform (e.g. satellite), instrument (e.g. camera/visual data), organization (e.g. NASA/JPL), etc.
  • EarthData Common Metadata Repository – API driven metadata repository that ” catalogs all data and service metadata records for NASA’s EOSDIS (Earth Observing System Data and Information System) system” data, that can be accessed by anyone, which includes programatic access to EarthData search.
  • EarthData Harmony – which is a EarthData Jupyter notebook examples and API documentation to perform research on earth science data in the EarthData cloud.

One reason to movie EOSDIS DAAC data to the cloud is to allow researchers to not have to download data to run their analysis. By using in cloud EC2 compute instances, they can run their research in AWS with direct , high speed access to the EarthData.

Of course, the researcher would need to purchase their EC2 compute facility directly from AWS. w. NASA publishes a sort of AWS pricing primer for researchers to use AWS EC2 compute to do research directly on the data in the cloud. Also NASA offers a series of tutorials on how to use the AWS cloud for doing research on NASA DAAC data.

Where to from here?

I find this all somewhat discouraging. Yes it’s the Gov’t but one needs to wonder what the overall costs of hosting NASA DAAC data on the AWS cloud will be over the long haul. Most organizations use the cloud to prototype and scale up services but once these services have stabilized, theymigrate them back to onprem/CoLoinfrastructure. See for example, Dropbox’s move away from the [AWS] cloud for ~600PB of data.

I get it, the public cloud allows for nearly infinite data scaleability. But cloud storage costs is not cheap, especially when you are talking about 100s of PBs. And in today’s world, with a whole bunch of open source solutions for object storage and services, one can almost recreate any cloud service in your own data center, at much lower price.

Sure it will still take IT infrastructure and personnel to put it all together. But NASA doesn’t seem to be lacking in infrastructure or IT personnel. Even if you are enamored with AWS services and software infrastructure, one can always run AWS Outpost in your data centers. And DAAC services seem to be pretty stable over time. Yes new satellites will generate more data, but the data load is understood and very predictable. So one should be able to anticipate all this and have infrastructure in place to deal with it.

Yes, having the ability to run analysis in the cloud directly on the data sitting also in the cloud is useful, especially not having to download TB of data. But these costs can also be significant and they are born by the researcher not NASA.

Another grip is why use AWS alone. The other cloud providers all have similar object storage and compute capabilities. It seems wiser to me to set up the EarthData service such that, different DAACs reside in different clouds. This would he more complex and harder to administer and use but I believe in the long run would lead to better more effective services at a more reasonable price.

Going to the cloud doesn’t have to be a one way endeavor. After using the cloud for a while, NASA should have a better idea of the costs of doing so and at that time understand better what it can and cannot afford to do on its own.

It will be interesting to see what ESA, JAXA, CERN and other big science organizations do as they are all in the same bind, data seems to be growing unbounded.

Picture Credit(s):

CTERA, Cloud NAS on steroids

We attended SFD22 last week and one of the presenters was CTERA, (for more information please see SFD22 videos of their session) discussing their enterprise class, cloud NAS solution.

We’ve heard a lot about cloud NAS systems lately (see our/listen to our GreyBeards on Storage podcast with LucidLink from last month). Cloud NAS systems provide a NAS (SMB, NFS, and S3 object storage) front-end system that uses the cloud or onprem object storage to hold customer data which is accessed through the use of (virtual or hardware) caching appliances.

These differ from file synch and share in that Cloud NAS systems

  • Don’t copy lots or all customer data to user devices, the only data that resides locally is metadata and the user’s or site’s working set (of files).
  • Do cache working set data locally to provide faster access
  • Do provide NFS, SMB and S3 access along with user drive, mobile app, API and web based access to customer data.
  • Do provide multiple options to host user data in multiple clouds or on prem
  • Do allow for some levels of collaboration on the same files

Although admittedly, the boundary lines between synch and share and Cloud NAS are starting to blur.

CTERA is a software defined solution. But, they also offer a whole gaggle of hardware options for edge filers, ranging from smart phone sized, 1TB flash cache for home office user to a multi-RU media edge server with 128TB of hybrid disk-SSD solution for 8K video editing.

They have HC100 edge filers, X-Series HCI edge servers, branch in a box, edge and Media edge filers. These later systems have specialized support for MacOS and Adobe suite systems. For their HCI edge systems they support Nutanix, Simplicity, HyperFlex and VxRail systems.

CTERA edge filers/servers can be clustered together to provide higher performance and HA. This way customers can scale-out their filers to supply whatever levels of IO performance they need. And CTERA allows customers to segregate (file workloads/directories) to be serviced by specific edge filer devices to minimize noisy neighbor performance problems.

CTERA supports a number of ways to access cloud NAS data:

  • Through (virtual or real) edge filers which present NFS, SMB or S3 access protocols
  • Through the use of CTERA Drive on MacOS or Windows desktop/laptop devices
  • Through a mobile device app for IOS or Android
  • Through their web portal
  • Through their API

CTERA uses a, HA, dual redundant, Portal service which is a cloud (or on prem) service that provides CTERA metadata database, edge filer/server management and other services, such as web access, cloud drive end points, mobile apps, API, etc.

CTERA uses S3 or Azure compatible object storage for its backend, source of truth repository to hold customer file data. CTERA currently supports 36 on-prem and in cloud object storage services. Customers can have their data in multiple object storage repositories. Customer files are mapped one to one to objects.

CTERA offers global dedupe, virus scanning, policy based scheduled snapshots and end to end encryption of customer data. Encryption keys can be held in the Portals or in a KMIP service that’s connected to the Portals.

CTERA has impressive data security support. As mentioned above end-to-end data encryption but they also support dark sites, zero-trust authentication and are DISA (Defense Information Systems Agency) certified.

Customer data can also be pinned to edge filers, Moreover, specific customer (director/sub-directorydirectories) data can be hosted on specific buckets so that data can:

  • Stay within specified geographies,
  • Support multi-cloud services to eliminate vendor lock-in

CTERA file locking is what I would call hybrid. They offer strict consistency for file locking within sites but eventual consistency for file locking across sites. There are performance tradeoffs for strict consistency, so by using a hybrid approach, they offer most of what the world needs from file locking without incurring the performance overhead of strict consistency across sites. For another way to do support hybrid file locking consistency check out LucidLink’s approach (see the GreyBeards podcast with LucidLink above).

At the end of their session Aron Brand got up and took us into a deep dive on select portions of their system software. One thing I noticed is that the portal is NOT in the data path. Once the edge filers want to access a file, the Portal provides the credential verification and points the filer(s) to the appropriate object and the filers take off from there.

CTERA’s customer list is very impressive. It seems that many (50 of WW F500) large enterprises are customers of theirs. Some of the more prominent include GE, McDonalds, US Navy, and the US Air Force.

Oh and besides supporting potentially 1000s of sites, 100K users in the same name space, and they also have intrinsic support for multi-tenancy and offer cloud data migration services. For example, one can use Portal services to migrate cloud data from one cloud object storage provider to another.

They also mentioned they are working on supplying K8S container access to CTERA’s global file system data.

There’s a lot to like in CTERA. We hadn’t heard of them before but they seem focused on enterprise’s with lots of sites, boatloads of users and massive amounts of data. It seems like our kind of storage system.

Comments?

Storywrangler, ranking tweet ngrams over time

Read a couple of articles the past few weeks on a project in Vermont that has randomly selected 10% of all tweets (150 Billion) since the beginning of Twitter (2008) and can search and rank this tweet corpus for ngrams (1-, 2-, & 3-word phrases). All of these articles were reporting on a Science Advances article: Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using Twitter.

Why we need Storywrangler

The challenge with all social media is that it is transient, here now, (mostly) gone tomorrow. That is once posted, if it’s liked/re-posted/re-tweeted it can exist in echoes of the original on the service for some time, and if not, it dies out very quickly never to be seen (externally ever) again. While each of us could potentially see every tweet we have ever created (when this post is published it should be my 5387th tweet on my twitter account) but most of us cannot see this history for others.

All that makes viewing what goes on on social media impossible which leads to a lot of mis-understanding and makes it difficult to analyze. It would be great if we had a way of looking at social media activity in more detail to understand it better.

I wrote about this before (see my Computational anthropology & archeology post) and if anything, the need for such capabilities has become even more important in today’s society.

If only there was a way to examine the twitter-verse. What’s mainly lacking is a corpus of all tweets that have ever been tweeted. A way to slice, dice, search, and rank this text data would be a godsend to understanding (twitter and maybe social) history, in real time.

Storywrangler, has a randomized version of 10% of all tweets since twitter started. And it provides ngram searching and ranking over a specified time interval. It’s not everything but it’s a start.

Storywrangler currently has over 1 trillion (1- to 3- word) ngrams and they support ngram rankings for over 150 different languages.

Google books ngram viewer

The idea for the Storywrangler project came from Google’s books ngram viewer. Google’s ngram viewer has a corpus of Google books, over a time period (from 1800 to 2019) and allows one to search for ngrams (1- to 5-word phrases) over any time period they support.

Google’s ngram viewer charts ngrams with a vertical axis that is the % of all ngrams in their book corpus. One can see the rise and fall of ngrams, e.g., “atomic power”. The phrase “atomic power” peaked in Google books around 1960 at a height of 0.000260% of all 2 word ngrams. The time period level of granularity is a year.

The nice thing about Google books ngram data is you can download their book ngram data yourself. The data is of the form of tab separated list of rows with ngram text (1 to 5 words), year, how many times it occurred that year, on how many pages, on how many books on each row. Google books ngram data is generally about 2 years old.

Unclear just how much data is in Google’s books ngram database but for instance in the 1 gram English fiction list, they show a sample of two rows (the 3,000,000 and 3,000,001 rows) which are the 1978 and 1979 book counts for the word “circumvallate”.

Storywrangler tweet ngram viewer

The usage tab on the Storywrangler website provides a search engine that one can use to input N-grams that you want to search the corpus for and can visualize how their rank changes over time. For example, one can do a similar search on the “atomic power” ngram only for tweets.

From Storywrangler search one can see that peak tweet use of “Atomic Power” and “ATOMIC POWER” occurred somewhere in July of 2020 (only way to see the month is to hover over that line) and it’s rank reached somewhere around ~10,000 highest used tweet 2 word ngram during that time.

It’s interesting to see that ngram books and ngram twitter don’t seem to have any correlation. For example the prior best ranking for atomic power (~200Kth highest) was in June of 2015. There was no similar peak for book ngrams of the phrase.

For Storywrangler you can download a JSON or CSV version of the charts displayed. It’s not the complete ngram history that Google book ngram viewer provides. Storywrangler data is generally about 2 days old.

The other nice thing about Storywrangler is under the real-time tab it will show you ngram rankings at 15 minute intervals for whatever timeline you wish to see. Also under the trending tab it will show you the changing ranks for the top 5 ngrams over a selected time period. And the languagetab will do tracking for tweet language use for select languages. The common tab will track the ranking of most common ngrams (pretty boring mostly articles/prepositions) over time. And for any of these searches one can turn on or off retweet counting, which can help to eliminate bot activity.

Storywrangler provides a number of other statistics for ngrams other than just ranking such as odds (of occurring) and frequency (of occurrence). And one can also track rank change, old (years) rank vs. current (year) rank, rank (turbulence) divergence.

~~~~

Comments?

Photo Credit(s):