Object storage – Silverton Consulting

MinIO Strong Consistency and object gateways #CFD23

Posted on June 18, 2025June 18, 2025 by Ray in Data consistency, Object storage

I attended Cloud Field Day 23 (#CFD23) a couple of weeks back and MinIO presented on their AIStor for one of the CFD sessions (see videos here). During the session the speaker mentioned that object storage gateways were not strongly consistent.

Object storage has been around a long time but AWS S3 took it from a niche solution to a major storage type. Originally S3 was “eventually consistent” that is if you wrote an object and the system returned to you, that object wouldn’t necessarily be all there (that is stored in the object storage system) until sometime “later” .

It took AWS from Mar’2006 until Dec’2020 to offer “strong consistency” for objects that is when the system returned to a user after having created a new object, that object was guaranteed to have been written to the object storage before the return happened.

What the MinIO presenter was saying is that object storage gateways didn’t offer strong consistency for objects being written to them. Most object storage gateways are built ontop of enterprise class file systems. For these systems to be not strongly consistent seemed pretty unusual to me so I asked them to clarify why they said that.

The speaker turned to Anand Babu Periasamy (AB), CEO & Co-Founder, MinIO to answer my question. Essentially AB answered me that it was all about the size of the volumes (or LUNs) that underpin the objects and buckets in the file object storage gateway system.

For our post we will set aside the few object storage systems based solely ontop of block storage (are there any of these) and will just consider the object storage gateways built ontop of file systems.

In S3, there are Single (object) PUT requests and multi-part (object) uploads to create objects. According to MinIO as objects and their buckets get bigger, object gateways don’t maintain strong consistency. Bucket sizes for S3 Objects storage can be on the order of many PBs.

As for the Objects themselves, AWS S3 supports a max object size of 5TiB, using multipart uploads and Single PUT maximum of 5GiB. According to AWS, during S3 multi-part uploads, the object is not readable until after the last multi-part upload is replied to, only then is the full (5TiB) object available and thereby offers strong consistency over multi-part uploads. For Single PUT object creations, the object is strongly consistent after the PUT is returned to.

Here’s single PUT and mult-part upload object size constraints for select object storage gateways:

NetApp ONTAP offers object storage protocols for their Data ONTAP storage systems. The maximum NetApp S3 put size is 5TiB but anything over 5GiB will be flagged to encourage multi-part uploads
Hitachi VSP One Object storage, Object maximum S3 PUT size is 5GiB and an object is limited to a maximum of 5TiB
Dell PowerScale ObjectScale maximum S3 PUT size is 5GiB, and the maximum S3 object size is 4.4TiB.
VAST Data supports S3 objects and it’s maximum S3 PUT size is 5GiB, and the maximum object size is 5TiB.
WekaFS maximum S3 PUT size is 5GiB and maximum S3 object size is 5TiB.
Scality (also presented at CFD23, videos here) RING and Artesca maximum S3 PUT size is 5GiB, and implies that they support 10K parts in a multi-part upload so a maximum object size of ~50TiB
Qumulo (also presented at CFD23, videos here) maximum S3 PUT size is 5GiB, and for S3 compatibility their maximum object size is 5TiB. But they also state that they can actually support an object of ~50TiB (10K multipart uploads * 5GiB)
MinIO S3 maximum object size is 50TiB and their maximum S3 PUT size is 5TiB.

I’m probably missing a couple of other storage systems that support S3 objects but you get the drift. It’s important to note that just about every object system gateway offers ful S3 compatible single PUT size of 5GiB (NetApp & MinIO offer 5TiB though!). And they all seem to offer 5TiB object sizes (with few exceptions Dell PowerScale ObjectScale’s 4.4TiB, Scality, Qumulo & MinIO 50TiB).

The question needs to be asked is if a customer uses multi-part uploads of 5GiB for a 5TiB object are they guaranteed strong consistency in an object storage gateway

According to everything I see the answer seem to be yes. Although outside of AWS S3 and MinIO I don’t see anything that specifically states they are strongly consistent.

However, most of object storage systems map objects to files and single file creates are strongly consistent for these systems. So for a single (5GiB) PUT the answer would be obviously YES. And most of these systems also support much larger files than 5GiB, so I would suspect that multi-part uploads would also support strong consistency as they would be mapped to a multi-write/single file in these systems. So my answer to the question of whether multi-part uploads are strongly consistent for object storage gateways is a guarded YES.

Perhaps what MinIO was meaning to say was that if you do a 5TiB part for a 50TiB multipart upload will it be strongly consistent. I would have to answer NO for anyone of these systems other than Qumulo, Scality & MinIO because they don’t support anything that large. But this is way (10X for object sizes & 1000X for cumulative multi-part upload part sizes) beyond AWS S3’s current capabilities, so I don’t see how this is relevant for those customers wishing to retain AWS S3 compatibility

And I guess I don’t see the relevance of bucket size to strong consistency for an object. I suppose there may be problems as you populate a bucket with big objects and run out of room during a (multi-part) upload/PUT. But it’s clear to me that when that occurred, the object storage gateway would indicate that the object store failed and then consistency doesn’t matter.

So I have to disagree with my friends at MinIO, regarding the lack of strong consistency with the proviso that all these systems support strong consistency for AWS S3 object sizes.

Comments?

Data and code versioning For MLops

Posted on January 30, 2023January 30, 2023 by Ray in Artificial Intelligence, Data, Data ownership, Data versioning, Machine Learning, MLOps, Object storage

Read an interesting article (Ex-Apple engineers raise … data storage startup) and research paper (Git is for data) about a of group of ML engineers from Apple forming a new “data storage” startup targeted at MLOps teams just like Apple. It turns out that MLops has some very unique data requirements that go way beyond just data storage.

The paper discusses some of the unusual data requirements for MLOps such as:

Infrequent updates – yes there are some MLOps datasets where updates are streamed in but the vast majority of MLOps datasets are updated on a slower cadence. The authors think monthly works for most MLOps teams
Small changes/lots of copies – The changes to MLOps data are relatively small compared to the overall dataset size and usually consist of data additions, record deletions, label updates, etc. But uncommon to most data, MLOps data are often subsetted or extracted into smaller datasets used for testing, experimentation and other “off-label” activities.
Variety of file types – depending on the application domain, MLOps file types range all over the place. But there’s often a lot of CSV files in combination with text, images, audio, and semi-structured data (DICOM, FASTQ, sensor streams, etc.). However within a single domain, MLOps file types are pretty much all the same.
Variety of file directory trees – this is very MLOps team and model dependent. Usually there are train/validate/test splits to every MLOps dataset but what’s underneath each of these can vary a lot and needs to be user customizable.
Data often requires pre-processing to be cleansed and made into something appropriate and more useable by ML models
Code and data must co-evolve together, over time – as data changes, the code that uses them change. Adding more data may not cause changes to code but models are constantly under scrutiny to improve performance, accuracy or remove biases. Bias elimination often requires data changes but code changes may also be needed.

It’s that last requirement, MLOps data and code must co-evolve and thus, need to be versioned together that’s most unusual. Data-code co-evolution is needed for reproducibility, rollback and QA but also for many other reasons as well.

In the paper they show a typical MLOps data pipeline.

Versioning can also provide data (and code) provenance, identifying the origin of data (and code). MLOps teams undergoing continuous integration need to know where data and code came from and who changed them. And as most MLOps teams collaborate in the development, they also need a way to identify data and code conflicts when multiple changes occur to the same artifact.

Source version control

Code has had this versioning problem forever and the solution became revision control systems (RCS) or source version control (SVC) systems. The most popular solutions for code RCS are Git (software) and GitHub (SaaS). Both provide repositories and source code version control (clone, checkout, diff, add/merge, commit, etc.) as well as a number of other features that enable teams of developers to collaborate on code development.

The only thing holding Git/GitHub back from being the answer to MLOps data and code version control is that they don’t handle large (>1MB) files very well.

The solution seems to be adding better data handling capabilities to Git or GitHub. And that’s what XetHub has created for Git.

XetHub’s “Git is Data” paper (see link above) explains what they do in much detail, as to how they provide a better data layer to Git, but it boils down to using Git for code versioning and as a metadata database for their deduplicating data store. They are using a Merkle trees to track the chunks of data in a deduped dataset.

How XetHub works

XetHub support (dedupe) variable chunking capabilities for their data store. This allows them to use relatively small files checked into Git to provide the metadata to point to the current (and all) previous versions of data files checked into the system.

Their mean chunk size is ~4KB. Data chunks are stored in their data store. But the manifest for dataset versions is effectively stored in the Git repository.

The paper shows how using a deduplicated data store can support data versioning.

XetHub uses a content addressable store (CAS) to store the file data chunk(s) as objects or BLOBs. The key to getting good IO performance out of such a system is to have small chunks but large objects.

They map data chunks to files using a CDMT (content defined merkle tree[s]). Each chunk of data resides in at least two different CDMTs, one associated with the file version and the other associated with the data storage elements.

XetHub’s variable chunking approach is done using a statistical approach and multiple checksums but they also offer one specialized file type chunking for CSV files. As it is, even with their general purpose variable chunking method, they can offer ~9X dedupe ratio for text data (embeddings).

They end up using Git commands for code and data but provide hooks (Git filters) to support data cloning, add/checkin, commits, etc.). So they can take advantage of all the capabilities of Git that have grown up over the years to support code collaborative development but use these for data as well as code.

In addition to normal Git services for code and data, XetHub also offers a read-only, NFSv3 file system interface to XetHub datases. Doing this eliminates having to reconstitute and copy TB of data from their code-data repo to user workstations. With NFSv3 front end access to XetHub data, users can easily incorporate data access for experimentation, testing and other uses.

Results from using XetHub

XetHub showed some benchmarks comparing their solution to GIT LFS, another Git based large data storage solution. For their benchmark, they used the CORD-19 (and ArXiv paper, and Kaggle CORD-I9 dataset) which is a corpus of all COVID-19 papers since COVID started. The corpus is updated daily, released periodically and they used the last 50 versions (up to June 2022) of the research corpus for their benchmark.

Each version of the CORD-19 corpus consists of JSON files (research reports, up to 700K each) and 2 large CSV files one with paper information and the other paper (word?) embeddings (a more useable version of the paper text/tables used for ML modeling).

For CORD-19, XetHub are able to store all the 2.45TB of research reports and CSV files in only 287GB of Git (metadata) and datastore data, or with a dedupe factor 8.7X. With XetHub’s specialized CSV chunking (Xet w/ CSV chunking above), the CORD-19 50 versions can be stored in 87GB or with a 28.8X dedupe ratio. And of that 87GB, only 82GB is data and the rest ~5GB is metadata (of which 1.7GB is the merle tree).

In the paper, they also showed the cost of branching this data by extracting and adding one version which consisted of a 75-25% (random) split of a version. This split was accomplished by changing only the two (paper metadata and paper word embeddings) CSV files. Adding this single split version to their code-data repository/datastore only took an additional 11GB of space An aligned split (only partitioning on a CSV record boundary, unclear but presumably with CSV chunking), only added 185KB.

XETHUB Potential Enhancements

XetHub envisions many enhancements to their solution, including adding other specific file type chunking strategies, adding a “time series” view to their NFS frontend to view code/data versions over time, finer granularity data provenance (at the record level rather than at the change level), and RW NFS access to data. Further, XetHub’s dedupe metadata (on the Git repo) only grows over time, supporting updates and deletes to dedupe metadata would help reduce data requirements.

Read the paper to find out more.

Picture/Graphic credit(s):

From Wikipedia Version Control article, By Revision_controlled_project_visualization.svg: *Subversion_project_visualization.svg: Traced by User:Stannered, original by en:User:Sami Keroladerivative work: Moxfyre (talk)derivative work: Echion2 (talk) – Revision_controlled_project_visualization.svg, CC BY-SA 3.0
From XetHub’s Git is Data paper.
From XetHub’s Git is Data paper Figure 1.
From XetHub’s Git is Data paper Figure 3.
From XetHub’s Git is Data paper Figure 4.

CTERA, Cloud NAS on steroids

Posted on August 13, 2021August 13, 2021 by Ray in Cloud services, Cloud storage, data access, Data consistency, Data security, Distributed computing, File Storage, Mobile computing, Object storage, storage scalability, Strategic Inflection Points

We attended SFD22 last week and one of the presenters was CTERA, (for more information please see SFD22 videos of their session) discussing their enterprise class, cloud NAS solution.

We’ve heard a lot about cloud NAS systems lately (see our/listen to our GreyBeards on Storage podcast with LucidLink from last month). Cloud NAS systems provide a NAS (SMB, NFS, and S3 object storage) front-end system that uses the cloud or onprem object storage to hold customer data which is accessed through the use of (virtual or hardware) caching appliances.

These differ from file synch and share in that Cloud NAS systems

Don’t copy lots or all customer data to user devices, the only data that resides locally is metadata and the user’s or site’s working set (of files).
Do cache working set data locally to provide faster access
Do provide NFS, SMB and S3 access along with user drive, mobile app, API and web based access to customer data.
Do provide multiple options to host user data in multiple clouds or on prem
Do allow for some levels of collaboration on the same files

Although admittedly, the boundary lines between synch and share and Cloud NAS are starting to blur.

CTERA is a software defined solution. But, they also offer a whole gaggle of hardware options for edge filers, ranging from smart phone sized, 1TB flash cache for home office user to a multi-RU media edge server with 128TB of hybrid disk-SSD solution for 8K video editing.

They have HC100 edge filers, X-Series HCI edge servers, branch in a box, edge and Media edge filers. These later systems have specialized support for MacOS and Adobe suite systems. For their HCI edge systems they support Nutanix, Simplicity, HyperFlex and VxRail systems.

CTERA edge filers/servers can be clustered together to provide higher performance and HA. This way customers can scale-out their filers to supply whatever levels of IO performance they need. And CTERA allows customers to segregate (file workloads/directories) to be serviced by specific edge filer devices to minimize noisy neighbor performance problems.

CTERA supports a number of ways to access cloud NAS data:

Through (virtual or real) edge filers which present NFS, SMB or S3 access protocols
Through the use of CTERA Drive on MacOS or Windows desktop/laptop devices
Through a mobile device app for IOS or Android
Through their web portal
Through their API

CTERA uses a, HA, dual redundant, Portal service which is a cloud (or on prem) service that provides CTERA metadata database, edge filer/server management and other services, such as web access, cloud drive end points, mobile apps, API, etc.

CTERA uses S3 or Azure compatible object storage for its backend, source of truth repository to hold customer file data. CTERA currently supports 36 on-prem and in cloud object storage services. Customers can have their data in multiple object storage repositories. Customer files are mapped one to one to objects.

CTERA offers global dedupe, virus scanning, policy based scheduled snapshots and end to end encryption of customer data. Encryption keys can be held in the Portals or in a KMIP service that’s connected to the Portals.

CTERA has impressive data security support. As mentioned above end-to-end data encryption but they also support dark sites, zero-trust authentication and are DISA (Defense Information Systems Agency) certified.

Customer data can also be pinned to edge filers, Moreover, specific customer (director/sub-directorydirectories) data can be hosted on specific buckets so that data can:

Stay within specified geographies,
Support multi-cloud services to eliminate vendor lock-in

CTERA file locking is what I would call hybrid. They offer strict consistency for file locking within sites but eventual consistency for file locking across sites. There are performance tradeoffs for strict consistency, so by using a hybrid approach, they offer most of what the world needs from file locking without incurring the performance overhead of strict consistency across sites. For another way to do support hybrid file locking consistency check out LucidLink’s approach (see the GreyBeards podcast with LucidLink above).

At the end of their session Aron Brand got up and took us into a deep dive on select portions of their system software. One thing I noticed is that the portal is NOT in the data path. Once the edge filers want to access a file, the Portal provides the credential verification and points the filer(s) to the appropriate object and the filers take off from there.

CTERA’s customer list is very impressive. It seems that many (50 of WW F500) large enterprises are customers of theirs. Some of the more prominent include GE, McDonalds, US Navy, and the US Air Force.

Oh and besides supporting potentially 1000s of sites, 100K users in the same name space, and they also have intrinsic support for multi-tenancy and offer cloud data migration services. For example, one can use Portal services to migrate cloud data from one cloud object storage provider to another.

They also mentioned they are working on supplying K8S container access to CTERA’s global file system data.

There’s a lot to like in CTERA. We hadn’t heard of them before but they seem focused on enterprise’s with lots of sites, boatloads of users and massive amounts of data. It seems like our kind of storage system.

Comments?

The rise of MinIO object storage

Posted on February 3, 2021February 3, 2021 by Ray in Cloud storage, Crowdsourcing, Distributed computing, IoT, NVMe storage, Object storage, Storage performance, Strategic Inflection Points

MinIO presented at SFD21 a couple of weeks back (see videos here). They had a great session, as always with Jonathan and AB leading the charge. We’ve had a couple of GreyBeardsOnStorage podcasts with AB as well (listen and see GreyBeards talk open source S3… and GreyBeards talk Data Persistence …). We first talked with MinIO last year at SFD 19 where AB made a great impression on the bloggers (see videos here)

Their customers run the gamut from startups to F500. AB said that ~58% of the F500 have MinIO installed and over 8% of the F500 have added capacity over the last year. AB said they have a big presence in Finance, e.g., the 10 largest banks run MinIO, also the auto and Space/Defense sectors have adopted their product.

One reason for the later two sectors (auto & space/defense) is the size of MinIO’s binary, 50MB. And my guess for why the rest of those customers have adopted MinIO is because it’s S3 API compatible, it’s open source, and it’s relatively inexpensive.

Object storage trends

Customers running in the cloud have a love-hate relationship with object storage, they love that it scales but hate what it costs. There are numerous on prem object storage alternatives from traditional and non-traditional storage vendors, but most are deployed on appliances.

With appliances, customers have to order, wait for delivery, rack-configure-set up and after maybe weeks to months finally they have object storage on prem. But with MinIO a purely software, open source solution, it can be tried by merely downloading a couple of (Docker) containers and deployed/activated in under an hour..

As mentioned above, MinIO is API compatible with AWS S3 which helps with adoption. Moreover, now that it’s an integral part of VMware (see their new Data Persistence Platform), it can be enabled in seconds on your standard enterprise VMware cluster with Tanzu.

The other trend is that the edge needs storage, and lots of it. The main drivers of massive edge storage requirement are TelCos deploying 5G and auto industry’s self-driving cars. But this is just a start, industrial IoT will be generating reams of sensor log data at the edge, it will need to be stored somewhere. And what better place to store all this data, but on object storage. Furthermore, all this is driving more adoption of object storage, with MinIO picking up the lion’s share of deployments.

In addition, MinIO recently ported their software to run on ARM. AB said this was to support the expanding hobbyist and developers community driving edge innovation.

And then there was Kubernetes. Everyone in the industry (with the possible exception of Google) is surprised by the adoption of K8S. Google essentially gifted ~$1Bs in R&D on how to scale apps to the world of IT, and now any startup, anywhere, can scale with as well as Google can. And scaling is the “killer app” for the SW industry.

But performance isn’t bad either

Jonathan made mention of MinIO performance (see MinIO 24 node disk and MinIo 32 node NVMe SSD reports) benchmarks. Their disk data shows avg read and write performance of 16.3GB/s and 9.4GB/s, respectively and their NVMe SSD average read and write performance of 183.2GB/s and 171.3GB/s, respectively. The disk numbers are very good for object storage, but the SSD numbers are spectacular.

It turns out that modern, cloud native apps don’t need quick access to data as much as high data throughput. Modern apps have moved to a processing data in memory rather than off of storage, which means they move (large) chunks of data to memory and crunch on it there, and then spit it back out to storage This type of operating mode seems to scale better (in the cloud at least) than having a high priced storage system servicing a blizzard of IO requests from everywhere.

Other vendors had offered SSD object storage before but it never took off. But nowadays, with NVMe SSDs, MinIO is seeing starting to see healthcare, finance, and any AI/ML workloads all deploying NVMe SSD object storage. Yes for large storage repositories, (object storage’s traditional strongpoint), ie, 5PB to 100PB, disk can’t be beat but where blistering high throughput, is needed, NVMe SSD object storage is the way to go.

Open source vs. open core

AB mentioned that MinIO business model is 100% open source vs. many other vendors that use open source but whose business model is open core. The distinction is that open core vendors use open source as base functionality and then build proprietary, charged for, software features/functions on top of this.

But open source vendors, like MinIO offer all their functionality under an open source license (Apache SM License V2.0, GNU AGPL v3 Open Source license and other FOSS licenses), but if you want to use it commercially, build products with it embedded inside, or have enterprise class support, one purchases a commercial license.

As presented at SFD21, but their website home page has updated numbers reflected below

The pure open source model has some natural advantages:

It’s a great lead gen solution because anyone, worldwide, 7X24X365 can download the software and start using it, (see Docker Hub or MinIO’s download page
It’s a great hiring pool. Anyone, who has contributed to the MinIO open source is potentially a great technical hire. MinIO stats says they have 685 contributors, 19 in just the last month for MinIO base code (see MinIO’s GitHub repo).
It’s a great development organization. With ~20 commits a weekover the last year, there’s a lot going on to add functionality/fix bugs. But that’s the new world of software development. Given all this activity, release frequencies increase, ~4 releases a month ((see GitHub repo insights above).
It’s a great testing pool with, ~480M Docker Pulls (using a Docker container to run a standard, already configured MinIO server, mc, console, etc.) and ~18K enterprises running their solution, that’s an awful lot of users. With open source a lot of eye’s or contributors make all problems visible, but what’s more typical, from my perspective, is the more users that deploy your product, the more bugs they find.

Indeed, with the VMware’s Data Persistence Platform, Tanzu customers can use MinIO’s object storage at the click of a button (or three).

Of course, open source has downsides too. Anyone can access packages directly (from GitHub repo and elsewhere) and use your software. And of course, they can clone, fork and modify your source code, to add any functionality they want to it. Historically, open source subscription licensing models don’t generate as much revenues as appliance purchases do. And finally, open source, because it’s created by geeks, is typically difficult to deploy, configure, and use.

But can they meet the requirements of an Enterprise world

Because most open source is difficult to use, the enterprise has generally shied away from it. But that’s where there’s been a lot of changes to MinIO.

MinIO always had a “mc” (minio [admin] client) that offered a number of administrative services via an API, programmatically controlled interface. but they have recently come out with a GUI offering, the minIO console, which has a similarly functionality to their mc APU. They demoed the console on their SFD21 sessions (see videos above).

Supporting 18K enterprise users, even if only 8% are using it a lot, can be a challenge, but supporting almost a half a billion docker pulls (even if only 1/4th of these is a complete minIO deployment) can be hell on earth. The surprising thing is that MinIO’s commercial license promises customers direct-to-engineer support.

At their SFD21 sessions, AB stated they were getting ~2.7 new (tickets) problems a day. I assume these are what’s just coming in from commercial licensed users and not the general public (using their open source licensed offerings). AB said their average resolution time for these tickets was under 15 minutes.

Enter SubNet, the MinIO Subscription Network and their secret (not open source?) weapon to scale enterprise class support. Their direct-to-engineer support model involves a much, more collaborative approach to solving customer problems then you typical enterprise support with level 1, 2 & 3 support engineers. They demoed SubNet briefly at SFD21, but it could deserve a much longer discussion/demostration.

What little we saw (at SFD21) was that it looked almost like slack-PM dialog between customer and engineer but with unlimited downloads and realtime interaction.

MinIO also supports a very active Slack discussion group with ~11K users. Here anyone can ask a question and it will get answered by anyone. MinIO’s Slack has 2 channels: (Ggeneral and GitHub for notifications). It seems like MinIO is using Slack as a crowdsourced level 1 support.

But in the long run, to continue to offer “direct-to-engineer” levels of support, may require adding a whole lot more engineers. But AB seems prepared to do just that.

~~~~

MinIO is an interesting open source S3 API compatible, object storage solution that seems to run just about anywhere, is freely deployable with enterprise class support available (at a price) and has high throughput performance. What’s not to like.

Storage that provides 100% performance at 99% full

Posted on August 24, 2020August 24, 2020 by Ray in Distributed computing, File Storage, Object storage, Storage performance, System effectiveness

A couple of weeks back we were talking with Qumulo at Storage Field Day 20 (SFD20) and they made mention that they were able to provide 100% performance at 99% full. Please see their video session during SFD20 (which can be seen here). I was a bit incredulous of this seeing as how every other modern storage system performance degrades long before they get to 99% capacity.

So I asked them to explain how this was possible. But before we get to that a little background on modern storage systems would be warranted.

The perils of log structured file systems

Most modern storage systems use a log structured file system where when they write data they write it to a sequential log and use a virtual addressing scheme to show where the data is located for that address, creating a (data) log of written blocks.

However, when data is overwritten, it leaves gaps in these data logs. These gaps need to be somehow recycled (squeezed out) in order to be able to be consumed as storage capacity. This recycling process is commonly called “garbage collection”.

Garbage collection does its work by reading heavily gapped log files and re-writing the old, but still current, data into a new log. This frees up those gaps to be reused. But garbage collection like this takes reading and writing of logs to free up space.

Now as log structured file systems get (70-80-90%) full, they need to spend more and more system time and effort (=performance) garbage collecting . This takes system (IO) performance away from normal host IO activity. Which is why I didn’t believe that Qumulo could offer 100% IO performance at 99% full.

But there was always another way to supply storage virtualization (read snapshotting) besides log files. Yes it might involve more metadata (table) management, but what it takes in more metadata, it gives back by requiring no garbage collection.

How Qumulo does without garbage collection

Qumulo has a scaled block store for a back end of their file and object cluster store. And yes it’s still a virtualized block store BUT it’s not a log structured file store.

It seems that there’s a virtual-to-physical mapping table that is used by Qumulo to determine the physical address of any virtual block in the file system. And files are allocated to virtual blocks directly through the use of B-tree metadata. These B-trees indicate which virtual blocks are in use by a file and its snapshots

If a host overwrites a data block. The block can be freed (if not being used in a snapshot) and placed on a freed block list and a new block is allocated in its place. The file’s allocated blocks b-tree is updated to reflect the new block and that’s it.

For snapshots, Qumulo uses something they call “write-out-of-place” process when data that a snapshot points to is overwritten. Again, it appears as if snapshots are some extra metadata associated with a file’s B-tree that defines the data in the snapshot.

The problem comes in when a file is deleted. If it’s a big enough file (TB-PB?), there could be millions to billions of blocks that have to be freed up. This would take entirely too long for a delete command, so this is done in the background. Qumulo calls this “reclaim delete“. So a delete of a big file unlinks the block B-tree from the directory and puts it on this reclaim delete work queue to free up these blocks later. Similarly, when a big snapshot is deleted, Qumulo performs a background process called “reclaim snapshot” for snapshot unique blocks.

As can be seen (it’s very hard to see given the coloration of the chart) from this screen shot of Qumulo’s session at SFD20, reclaim delete and reclaim snapshot are being done concurrently (in the background) with normal system IO. What’s interesting to note here is that reclaim IO (delete and snapshots) are going on all the time during the customers actual work. Why the write throughput drops significantly doing the the 27-29 of July is hard to understand. But the one case where it’s most serious (middle of July 28) reclaim IO also drops significantly. If reclaim IO were impacting write performance I would have expected it to have gone higher when write throughput went lower. But that’s not the case. From what I can see in the above reclaim IO has no impact on read or write throughput at this customer.

So essentially, by using a backing block store that does no garbage collection (not using a log structured file system), Qumulo is able to offer 100% system IO performance at 99% full – woah.

Can we back up a PB?

Posted on July 30, 2020July 30, 2020 by Ray in Cloud storage, Data discovery, Data efficiency, data protection, Data transmission, File Storage, Object storage, Storage Backup

Tradition says no way. IT backup history says not on your life. Common sense would say never in a million years.

Most organizations with PB of data or more, depend on remote replication to protect against data center outage or massive loss of data. This of course costs ~2X your original data center. And for some organizations one copy is not enough, so ~3X .

I don’t know what a PB scale data storage costs these days but I can’t believe it’s under a couple Million $ USD in hw and sw costs and probably at least another Million or so in OpEx/year. Multiply that by 2 or 3X and you’re now talking real money.

How could backup help?

Well for one you wouldn’t need replicas, so that would cut your hw & sw acquisition costs by a factor of 2 or 3. But backup storage is not free either. So you’d probably need to add back 30-50% of the original data center in hw & sw costs for backups.

You certainly wouldn’t need as many admins. And power for backup storage should also be substantially less. So maybe your OpEx would only be 1.5X in total for the original PB and its backups.

But what could possibly back up a PB of data?

We were talking with Igneous at Cloud Field Day 8 (CFD8, see their video here) a couple of weeks back and they said they could and do backup PBs of data for customers today. A while back, e also talked with them on a GreyBeards on Storage podcast.

The problems with backing up a PB seem insurmountable. First you have to be able to scan a PB of data. This means looking into multiple file systems on many different hardware platforms, across potentially multiple data centers, and that’s just to get a baseline of what all needs to be backed up.

Then at some point you actually have to store all that data on backup storage. So, to gain some cost advantage, you’d want to compress and deduplicate a PB of data, so that the first full backup wouldn’t take a full PB of backup storage.

Then of course you have to transfer a PB of data to your backup storage, in something that wouldn’t take months to perform. And that just gets you the first full backup.

Next, comes the daily scan of what’s changed. This has to re-scan your PB of data to find that 100TB or so, that’s changed over the last 24 hrs. Sometime after that scan completes, then all that 100TB or so of changed data needs to be compressed, deduped and transferred again to backup storage

And if that’s not enough, you have to do it all over again, every day, from now on, almost forever. And data continues to grow. So 1PB today is likely to be 2PB of more in 12 months (it’s great to be in the storage business).

So those are the challenges. How can it be done, effectively, day in and day out, enough so that IT can depend on their data being backed up.

Igneous to the rescue…

First, Igneous came out of stealth a while back (listen to our podcast) with a couple of unique capabilities needed for massive data repository discovery and analysis. That is they built a unique engine to scan and index PB scale data repositories. This was so they couldd provide administrators better visibility into their PB scale data repositories. But this isn’t about that product, it’s about backup.

But some of the capabilities they needed to support that product helped them perform backups as well. For instance, their scan needed to handle PBs of data. They came up with AdaptiveSCAN, which didn’t use standard NFS and SMB data transfer protocols to gain access to file metadata. To open a file on NFS or SMB takes quite a lot of NFS or SMB transactions. But to access metadata only, one doesn’t have to use all those NFS and SMB capabilities, it can be done with much less overhead even when using NFS or SMB.

Of course having a way to scan Billions of files was a major accomplishment, but then where do you put all that metadata. And how can you access it effectively to support backup up a PB data repository. So they needed some serious data indexing capabilities and so came up with InfiniteINDEX

Now a trillion item index, seems a bit much, even for PB scale repositories. But my guess is they have eyes on taking their PB scale backups and going after even bigger fish,. That is offering backups for EB scale data repository. And that might just take a trillion item index

Next, there’s moving PB or even TB of data quickly is no small trick. As the development team at Igneous mostly came from unstructured data providers, they also understood and have access to APIs for most storage vendors (NetApp, Dell-EMC Isilon, Pure FlashBlade, Qumulo, etc.). As such, where available, they utilized those native vendor storage API calls to help them move data rather than having to Open an NFS or SMB file and Read it.

Of course, even doing all that, moving 100TBs of data around or scanning PB sized data repositories is going to take a lot of processing and IO bandwidth to do in a reasonable period of time.

So another capability they developed is massive parallelism. That is being able to distribute scan, indexing or data movement work, out to multiple systems. In that fashion it can be accomplished in significantly less wall clock time.

Well with all that, they pretty much had the guts of a backup application system for PB data repositories but they still didn’t have the glue to put it all together. But recently they announced just that a Igneous’s DataProtect, a full scale backup application for PB of data.

I suppose I haven’t done justice to all of what they have developed or talked about at their session, so I would suggest viewing their talk at CFD8 and listening to our GBoS podcast to learn more. They did demo their product at CFD8 but I believe it was a canned demo.

I didn’t think I’d see the day when some vendor would offer backup services for PBs of data let alone be shooting for more, but there you have it. Igneous means to take your PB scale data repositories and make them as easy to operate as TB scale data repositories. They call that democratizing data.

Comments?

See these other CFD8 bloggers write ups on Igneous.

CFD8 – Igneous Follow Up by Nate Avery (@Nathaniel_Avery)

Picture credit(s): All from screen saves during Igneous’s session at CFD8

Hitachi Vantara HCP, hits it out of the park #datacenternext

Posted on June 30, 2018August 23, 2018 by Ray in Cloud storage, Data longevity, Distributed computing, Object storage, Strategic Inflection Points

We talked with Hitachi Vantara this past week at a special Tech Field Day extra event (see videos here). This was an all day affair and was a broad discussion of Hitachi’s infrastructure portfolio.

There was much of interest in the days session but one in particular caught my eye and that was the session on Hitachi Vantara’s Content Platform (HCP).

Hitachi has a number of offerings surrounding their content platform, including:

HCP, on premises object store:
HCP Anywhere, enterprise file synch and share using HCP,
HCP Content Intelligence, compliance and content search for HCP object storage, and
HCP Data Ingestor, file gateway to HCP object storage.

I already knew about these offerings but had no idea how successful HCP has been over the years. inng to Hitachi Vantara, HCP has over 4000 installations worldwide with over 2000 customers and is currently the number 1 on premises, object storage solution in the world.

For instance, HCP is installed in 4 out of the 5 largest banks, insurance companies, and TelCos worldwide. HCP Anywhere has over a million users with over 15K in Hitachi alone. Hitachi Vantara has some customers using HCP installations that support 4000-5000 object ingests/sec.

HCP software supports geographically disbursed erasure coding, data compression, deduplication, and encryption of customer object data.

HCP development team has transitioned to using micro services/container based applications and have developed their Foundry Framework to make this easier. I believe the intent is to ultimately redevelop all HCP solutions using Foundry.

Hitachi mentioned a couple of customers:

US Government National Archives which uses HCP behind Pentaho to preserve presidential data and metadata for 100 years, and uses all open APIs to do so
UK Rabo Bank which uses HCP to support compliance monitoring across a number of data feeds
US Ground Support which uses Pentaho, HCP, HCP Content Intelligence and HCP Anywhere to support geospatial search to ascertain boats at sea and what they are doing/shipping.

There’s a lot more to HCP and Hitachi Vantara than summarized here and I would suggest viewing the TFD videos and check out the link above for more information.

Comments?

Want to learn more, see these other TFD bloggers posts:

Hitachi is reshaping its IT division by Andrew Mauro (@Andrew_Mauro)

Western Digital at SFD15: ActiveScale object storage

Posted on May 5, 2018 by Ray in data protection, Distributed computing, Object storage, Storage architecture, Storage archive, Storage availability

Phill Bullinger and his staff from Western Digital presented at Storage Field Day 15 (SFD15) on a number of their enterprise products including Tegile and IntelliFlash but the one that caught my interest was their ActiveScale object store acquired from Amplidata back in 2015.

ActiveScale is an onprem, object storage system that provides cloud-like economics for customer data.

ActiveScale Hardware

ActiveScale systems can both scale up and scale out within a single site. ActiveScale systems have both storage and system nodes. Storage nodes perform erasure coding and System nodes are control points and metadata managers for the object store.

ActiveScale comes in two appliance configurations that contain both storage and system nodes and storage required. The two appliances are:

ActiveScale P100 is a 7U 720TB pod system and A full rack of P100s can read 8GB/sec and can have 17-9s data availability. The P100 can scale up to 2.1PB in a single rack and up to 18PB in the same namespace. The P100 is a higher performing solution with better performing storage and system nodes
ActiveScale X100 is a 42U rack scale solution that holds up to 588 12TB drives or 5.8PB per rack. The X100 can scale up to 9 racks or 52PB in the same namespace. The X100 is a denser configuration with only 6 storage nodes and as such, has a better $/GB than the P100 above.

As WDC is both the supplier of the ActiveScale appliance and a supplier of disk storage they can be fairly aggressive with pricing on appliance systems.

Data integrity in ActiveScale

They make a point of saying that ActiveScale object metadata and data are stored separately. By separating data and metadata, they claim to be more resilient to system failures. Object metadata is 3 way replicated, in a replicated database, residing in system nodes. Other object systems often store metadata and object data in the same way.

Object data can be erasure coded. That is, object data is chunked, erasure coding protected and then spread across multiple disk drives for data protection. ActiveScale erasure coding is called BitSpread. With BitSpread customers identify the number of disk drives to spread object data across and the number of drive failures the system should recover from without data loss.

A typical BitSpread configuration splits object data into 18 chunks and spreads these chunks across storage columns. A storage column is from 6-18 storage nodes. There’s no pre-allocated space in BitSpread. Object data chunks are allocated to disk storage based on current capacity and performance of the system, within redundancy constraints.

In addition, ActiveScale has a background task called BitDynamics that scans erasure coded chunks and does a mathematical health check on the object data. If a chunk is bad, the object data chunk can be recovered and re-erasure coded back to proper health.

WDC performance testing shows that BitDynamics has 0 performance degradation when performing re-erasure coding. Indeed, they took out 98 drives in an ActiveScale cluster and BitDynamics re-coded all that data onto other disk drives and detected no performance impact. No indication how long re-encoding 98 disk drives of data took nor the % of object store capacity utilization at the time of the test but presumably there’s a report someplace to back this up

Unlike many public cloud based object storage systems, ActiveScale is strongly consistent. That is object puts (writes) are not responded back to the entity doing the put, until the object metadata and object data are properly and safely recorded in the object store.

ActiveScale also supports 3 site erasure coding. GeoSpread is their approach to erasure coding across sites. In this case, object metadata is replicated across 3 system nodes across the sites. Object data and erasure coded information is split into 20 chunks which are then spread across the three sites. This way if any one site goes down, the other two sites have sufficient metadata, object data chunks and erasure coded information to reconstruct the data.

ActiveScale 5.2 now supports asynch replication. That is any one ActiveScale cluster can replicate to any other ActiveScale cluster located continent distances away.

Unclear how GeoSpread and asynch replication would interact together, but my guess is that each of the 3 GeoSpread sites could be asynchronously replicated to 3 other sites for maximum redundancy.

Both GeoSpread and ActiveScale replication impact performance, depending on how far the sites are from one another and the speed and bandwidth of the links between sites.

ActiveScale markets

ActiveScale’s biggest market is media and entertainment (M&E), mostly used for media archive or tape replacement/augmentation. WDC showed one customer case study for the Montreaux Jazz Festival, which migrated 49 years of performance videos up to ActiveScale and can now stream any performance, on request, without delay. Montreax media is GeoSpread across 3 sites in France. Another option is to perform transcoding on the object media in realtime and stream the transcoded media.

Another large market is Bio/Life Sciences. Medical & biological scanners are transitioning to higher resolution scans which take more data space. And this sort of medical information needs to be kept a long time

Data analytics on ActiveScale

One other emerging market is data analytics. With the new S3A (S3 adapter), Hadoop clusters can now support object storage as a 2nd tier. One problem with data analytics is that they have lots of data and storing it in triplicate, costs an awful lot.

In big data world, datasets can get very large very quickly. Indeed PB sizes data sets aren’t that unusual. And with triple replication (in native HDFS). When HDFS runs out of space you have to delete data. Before S3A, the only way you could increase storage you had to scale out (with compute and storage and networking) in order to add capacity.

Using Hadoop’s S3A, ActiveScale’s can provide cold archive for data analytics. From a Hadoop user/application perspective, S3A ActiveScale storage looks like just another directory under HDFS (Hadoop Data File System). You can run MapReduce or other Hadoop application directly against object buckets. But a more realistic approach is to move inactive or cold data from an disk resident HDFS directory to a S3A directory

HDFS and MapReduce are tightly coupled and were designed to have data close to where computation happens. So, as long as the active data or working set data is on HDFS disk storage or directly in memory the rest of the (inactive) data could all be placed on S3A object storage. Inactive data is normally historical data no longer being actively analyzed while newer data would be actively analyzed. Older, inactive data can be manually or automatically archived off to S3A. With HIVE you can partition your database to have active data in HDFS disk storage and inactive data in S3A.

Another approach is if the active, working set data can all fit directly in memory then the data can reside on S3A object storage. This way the data is read from S3A storage into memory, analyzed there and output be done back to object store or HDFS disk. Because the data is only read (loaded) once, there’s only a minimal performance penalty to use S3A storage.

Western Digital is an active contributor to Hadoop S3A and have recently added performance improvements to S3A, such as better caching, partial object reading, and core XML performance tuning options.

~~~~
If your interested in learning more about Western Digital ActiveScale, check out the videos referenced earlier and their website.

Also you may be interested in these other posts on the WD sessions at SFD15:

The A is for Active, The S is for Scale by Dan Firth (@PenguinPunk)

Comments?