Data growth – Silverton Consulting

NASA’s journey to the cloud – part 1

Posted on October 22, 2021October 22, 2021 by Ray in Cloud storage, Data, Data growth, data services, storage economics, Strategic Inflection Points

Read an article the other day, NASA Turns to the Cloud for Help With Next-Generation Earth Missions about how NASA was had started to migrate all their data to the cloud and intended to store all new data there as well. The hope is that researchers would no longer need to download NASA data but rather could access it directly using cloud compute resources.

It turns out that newer earth science satellites are generating so much data that hosting all this data is becoming a challenge and with the quantities being discussed, researchers downloading the data, to perform research in their own environments may take days.

Until recently, earth science data has been hosted and downloadable from NASA, ESA and other space organization sites. For example, see NASA’s GHCR DAAC (Global Hydrometerological Resource Center Distributed Active Archive Center), ESA EarthOnline, JAXA GPM website, etc. Generally one could download a time series of data from any of their prior and current earth/planetary science missions without too much trouble.

The Land Processes Distributed Active Archive Center (LP DAAC) archives and distributes Global Forest Cover Change (GFCC) data products through the NASA Making Earth System Data Records for Use in Research Environments (MEaSUREs) (https://earthdata.nasa.gov/community/community-data-system-programs/measures-projects) Program….

But NASA’s newest earth science satellites will be generating lot’s of data. For instance, the SWOT (Surface Water and Ocean Topography) mission data load will be 20TB/day and the NISAR (NASA-Indian Synthetic Aperture Radar) mission data load will be 80TB/day. And it’s only getting worse as more missions with newer instruments come online.

NASA estimates that, over time, they will store 247PB of data in their EarthData Cloud. At the moment, they have already migrated some (all of ASF [Alaska Satellite Facility] DAAC and some of PO.DAAC [Physical Ocean]) of their Earth Science data to AWS (us-west-2) and over time all of it will migrate there.

NASA will eat any egress charges for EOSDIS data and are also paying any and all hosting fees to storage the data in AWS. Unclear whether they are using standard S3 or S3-Intelligent Tiering. And presumably they are using S3 replication to ensure they don’t lose DAAC data in the cloud, but I don’t see any evidence of that in the literature I’ve read. Of course this doubles the storage costs for their 247PB of DAAC data.

Access to all this data is available to anyone with an EarthData login. There you can register for a profile to access NASA earth sciences data.

NASA’s EarthData also offers a number of AWS cloud based services to help one access this data:

EarthData search – filtered search facility to access NASA EarthData by platform (e.g. satellite), instrument (e.g. camera/visual data), organization (e.g. NASA/JPL), etc.
EarthData Common Metadata Repository – API driven metadata repository that ” catalogs all data and service metadata records for NASA’s EOSDIS (Earth Observing System Data and Information System) system” data, that can be accessed by anyone, which includes programatic access to EarthData search.
EarthData Harmony – which is a EarthData Jupyter notebook examples and API documentation to perform research on earth science data in the EarthData cloud.

One reason to movie EOSDIS DAAC data to the cloud is to allow researchers to not have to download data to run their analysis. By using in cloud EC2 compute instances, they can run their research in AWS with direct , high speed access to the EarthData.

Of course, the researcher would need to purchase their EC2 compute facility directly from AWS. w. NASA publishes a sort of AWS pricing primer for researchers to use AWS EC2 compute to do research directly on the data in the cloud. Also NASA offers a series of tutorials on how to use the AWS cloud for doing research on NASA DAAC data.

Where to from here?

I find this all somewhat discouraging. Yes it’s the Gov’t but one needs to wonder what the overall costs of hosting NASA DAAC data on the AWS cloud will be over the long haul. Most organizations use the cloud to prototype and scale up services but once these services have stabilized, theymigrate them back to onprem/CoLoinfrastructure. See for example, Dropbox’s move away from the [AWS] cloud for ~600PB of data.

I get it, the public cloud allows for nearly infinite data scaleability. But cloud storage costs is not cheap, especially when you are talking about 100s of PBs. And in today’s world, with a whole bunch of open source solutions for object storage and services, one can almost recreate any cloud service in your own data center, at much lower price.

Sure it will still take IT infrastructure and personnel to put it all together. But NASA doesn’t seem to be lacking in infrastructure or IT personnel. Even if you are enamored with AWS services and software infrastructure, one can always run AWS Outpost in your data centers. And DAAC services seem to be pretty stable over time. Yes new satellites will generate more data, but the data load is understood and very predictable. So one should be able to anticipate all this and have infrastructure in place to deal with it.

Yes, having the ability to run analysis in the cloud directly on the data sitting also in the cloud is useful, especially not having to download TB of data. But these costs can also be significant and they are born by the researcher not NASA.

Another grip is why use AWS alone. The other cloud providers all have similar object storage and compute capabilities. It seems wiser to me to set up the EarthData service such that, different DAACs reside in different clouds. This would he more complex and harder to administer and use but I believe in the long run would lead to better more effective services at a more reasonable price.

Going to the cloud doesn’t have to be a one way endeavor. After using the cloud for a while, NASA should have a better idea of the costs of doing so and at that time understand better what it can and cannot afford to do on its own.

It will be interesting to see what ESA, JAXA, CERN and other big science organizations do as they are all in the same bind, data seems to be growing unbounded.

Picture Credit(s):

NASA EarthData Cloud website
NASA LP [Land Process] DAAC website
NASA EarthData Cloud AWS primer page
NASA EarthData Cloud Evolution page

Societal growth depends on IT

Posted on May 29, 2020May 29, 2020 by Ray in Data banks, Data growth, Information economy, Strategic Inflection Points

Read an interesting article the other day in SciencDaily (IT played a key role in growth of ancient civilizations) and a Phys.Org article (Information drove development of early states) both of which were reporting on a Nature article (Scale and information processing thresholds in Holocene social evolution) which discussed how the growth of society during ancient times was directly correlated to the information processing capabilities they possessed. In these articles IT meant writing, accounting, currency, etc., relatively primitive forms of IT but IT nonetheless.

Seshat: Global History Databank

What the researchers were able to do was to use the Seshat: Global History Databank which “systematically collects what is currently known about the social and political organization of human societies and how civilizations have evolved over time” and use the data to analyze the use of IT by societies.

We have talked about Seschat before (See our Data Analysis of History post)

The Seshat databank holds information on 30 (natural) geographical areas (NGA), ~400 societies and, their history from 4000 BCE to 1900CE.

Seschat has a ~100 page Code Book that identifies what kinds of information to collect on each society, how it is to be estimated, identified, listed, etc. to normalize the data in their databank. Their Code Book provides essential guidelines on how to gather the ~1500 variables collected on societies.

IT drives society growth

The researchers used the Seshat DB and ran a statistical principal component analysis (PCA) of the data to try to ascertain what drove society’s growth.

PCA (see wikipedia Principal Component Analysis article) essentially produces a list of variables and their inter-relationships. Their combined inter-relationships is essentially a percentage (%Var) of explanatory power in how much those variables explains the variance of all variables. PCA can be one, two, three or N-dimensional.

The researchers took Seshat 51 society variables and combined them into 9 (societal) complexity characteristics (CC)s and did a PCA of those variables across all the (285) society’s information available at the time.

Fig, 2 says that the average PC1 component of all societies is driven by the changes (increases and decreases) in PC2 components. Decreases of PC2 depend on those elements of PC2 which are negative and increases in PC2 depend on those elements of PC2 which are negative.

The elements in PC2 that provide the largest positive impacts are writing (.31), texts (.24), money (.28), infrastructure (.12) and gvrnmnt (.06). The elements in PC2 that provide the largest negative impacts are PolTerr (polity area, -0.35), CapPop (capital population, -0.27), PolPop (polity population, -0.25) and levels (?, -0.15). Below is another way to look at this data.

The positive PC2 CC’s are tracked with the red line and the negative PC2 CC’s are tracked with the blue line. The black line is the summation of the blue and red lines and is effectively equal to the blue line in Fig 2 above.

The researchers suggest that the inflection points in Fig 2 and the black line in Fig 3),represent societal information processing thresholds. Once these IT thresholds have passed they change the direction that PC2 takes on after that point

In Fig4 they have disaggregated the information averaged in Fig. 2 & 3 and show PC2 and PC1 trajectories for all 285 societies tracked in the Seshat DB. Over time as PC1 goes more positive, societie, start to converge on effectively the same level of PC2 . At earlier times, societies tend to be more heterogeneous with varying PC2 (and PC1) values.

Essentially, societies IT processing characteristics tend to start out highly differentiated but over time as societies grow, IT processing capabilities tend to converge and lead to the same levels of societal growth

Classifying societies by I

The Kadashev scale (see wikipedia Kardashev scale article) identifes levels or types of civilizations using their energy consumption. For example, The Kardashev scale lists the types of civilizations as follows:

Type I Civilization can use and control all the energy available on its planet,
Type II Civilization can use and control all the energy available in its planetary system (its star and all the planets/other objects in orbit around it).
Type III Civilization can use and control all the energy available in its galaxy

I can’t help but think that a more accurate scale for civilization, society or a polity’s level would a scale based on its information processing power.

We could call this the Shin scale (named after the primary author of the Nature paper or the Shin-Price-Wolpert-Shimao-Tracy-Kohler scale). The Shin scale would list societies based on their IT levels.

Type A Societies have non-existant IT (writing, money, texts, money & infrastructure) which severely limits their population and territorial size
Type B Societies have primitive forms of IT (writing, money, texts, money & infrastructure, ~MB (10**6) of data) which allows these societies to expand to their natural boundaries (with a pop of ~10M).
Type C Societies have normal (2020) levels of IT (world wide Internet with billions of connected smart phones, millions of servers, ZB (10**21) of data, etc.) which allows societies to expand beyond their natural boundaries across the whole planet (pop of ~10B).
Type D Societies have high levels of IT (speculation here but quintillion connected smart dust devices, trillion (10**12) servers, 10**36 bytes of data) which allows societies to expand beyond their home planet (pop of ~10T).
Type E Societies have high levels of IT (more speculation here, 10**36 smart molecules, quintillion (10**18) servers, 10**51 bytes of data ) which allows societies to expand beyond their home planetary system (pop of ~10Q).

I’d list Type F societies here but a can’t think of anything smaller than a molecule that could potentially be smart — perhaps this signifies a lack of imagination on my part.

Comments?

Photo Credit(s):

Map from Seshat’s World Sample 30 page.
Table 1 from Nature article, Scale and information processing thresholds in Holocene social evolution.
Figure 2 from Nature article, Scale and information processing thresholds in Holocene social evolution.
Figure 3 from Nature article, Scale and information processing thresholds in Holocene social evolution.
Figure 4 from Nature article, Scale and information processing thresholds in Holocene social evolution.
Dyson Swarm super structure By Kevin Gill from Los Angeles, CA, United States – A Dyson Swarm Superstructure, CC BY 2.0

Where should IoT data be processed – part 1

Posted on August 13, 2019August 13, 2019 by Ray in Data analytics, Data compression, Data efficiency, Data growth, Data transmission, Decision making, Deep Learning, Distributed computing, Internet of Things, Mobile computing, Networking, Neural network, Robots, Storage, System effectiveness

I was at FlashMemorySummit 2019 (FMS2019) this week and there was a lot of talk about computational storage (see our GBoS podcast with Scott Shadley, NGD Systems). There was also a lot of discussion about IoT and the need for data processing done at the edge (or in near-edge computing centers/edge clouds).

At the show, I was talking with Tom Leyden of Excelero and he mentioned there was a real need for some insight on how to determine where IoT data should be processed.

For our discussion let’s assume a multi-layered IoT architecture, with 1000s of sensors at the edge, 100s of near-edge processing/multiplexing stations, and 1 to 3 core data center or cloud regions. Data comes in from the sensors, is sent to near-edge processing/multiplexing and then to the core data center/cloud.

Data size

Dans la nuit des images (Grand Palais) by dalbera (cc) (from flickr)

When deciding where to process data one key aspect is the size of the data. Tin GB or TB but given today’s world, can be PB as well. This lone parameter has multiple impacts and can affect many other considerations, such as the cost and time to transfer the data, cost of data storage, amount of time to process the data, etc. All of these sub-factors include the size of the data to be processed.

Data size can be the largest single determinant of where to process the data. If we are talking about GB of data, it could probably be processed anywhere from the sensor edge, to near-edge station, to core. But if we are talking about TB the processing requirements and time go up substantially and are unlikely to be available at the sensor edge, and may not be available at the near-edge station. And PB take this up to a whole other level and may require processing only at the core due to the infrastructure requirements.

Processing criticality

Human or machine safety may depend on quick processing of sensor data, e. g. in a self-driving car or a factory floor, flood guages, etc.. In these cases, some amount of data (sufficient to insure human/machinge safety) needs to be done at the lowest point in the hierarchy, with the processing power to perform this activity.

This could be in the self-driving car or factory automation that controls a mechanism. Similar situations would probably apply for any robots and auto pilots. Anywhere some IoT sensor array was used to control an entity, that could jeopardize the life of human(s) or the safety of machines would need to do safety level processing at the lowest level in the hierarchy.

If processing doesn’t involve safety, then it could potentially be done at the near-edge stations or at the core. .

Processing time and infrastructure requirements

Although we talked about this in data size above, infrastructure requirements must also play a part in where data is processed. Yes sensors are getting more intelligent and the same goes for near-edge stations. But if you’re processing the data multiple times, say for deep learning, it’s probably better to do this where there’s a bunch of GPUs and some way of keeping the data pipeline running efficiently. The same applies to any data analytics that distributes workloads and data across a gaggle of CPU cores, storage devices, network nodes, etc.

There’s also an efficiency component to this. Computational storage is all about how some workloads can better be accomplished at the storage layer. But the concept applies throughout the hierarchy. Given the infrastructure requirements to process the data, there’s probably one place where it makes the most sense to do this. If it takes a 100 CPU cores to process the data in a timely fashion, it’s probably not going to be done at the sensor level.

Data information funnel

We make the assumption that raw data comes in through sensors, and more processed data is sent to higher layers. This would mean at a minimum, some sort of data compression/compaction would need to be done at each layer below the core.

We were at a conference a while back where they talked about updating deep learning neural networks. It’s possible that each near-edge station could perform a mini-deep learning training cycle and share their learning with the core periodicals, which could then send this information back down to the lowest level to be used, (see our Swarm Intelligence @ #HPEDiscover post).

All this means that there’s a minimal level of processing of the data that needs to go on throughout the hierarchy between access point connections.

Pipe availability

The availability of a networking access point may also have some bearing on where data is processed. For example, a self driving car could generate TB of data a day, but access to a high speed, inexpensive data pipe to send that data may be limited to a service bay and/or a garage connection.

So some processing may need to be done between access point connections. This will need to take place at lower levels. That way, there would be no need to send the data while the car is out on the road but rather it could be sent whenever it’s attached to an access point.

Compliance/archive requirements

Any sensor data probably needs to be stored for a long time and as such will need access to a long term archive. Depending on the extent of this data, it may help dictate where processing is done. That is, if all the raw data needs to be held, then maybe the processing of that data can be deferred until it’s already at the core and on it’s way to archive.

However, any safety oriented data processing needs to be done at the lowest level and may need to be reprocessed higher up in the hierachy. This would be done to insure proper safety decisions were made. And needless the say all this data would need to be held.

~~~~

I started this post with 40 or more factors but that was overkill. In the above, I tried to summarize the 6 critical factors which I would use to determine where IoT data should be processed.

My intent is in a part 2 to this post to work through some examples. If there’s anyone example that you feel may be instructive, please let me know.

Also, if there’s other factors that you would use to determine where to process IoT data let me know.

A tale of two storage companies – NetApp and Vantara (HDS-Insight Grp-Pentaho)

Posted on October 9, 2017October 10, 2017 by Ray in Cloud services, Data growth, Distributed computing, Executive leadership, Information economy, Market dynamics, Strategy, Visionary organizations

It was the worst of times. The industry changes had been gathering for a decade almost and by this time were starting to hurt.

The cloud was taking over all new business and some of the old. Flash’s performance was making high performance easy and reducing storage requirements commensurately. Software defined was displacing low and midrange storage, which was fine for margins but injurious to revenues.

Both companies had user events in Vegas the last month, NetApp Insight 2017 last week and Hitachi NEXT2017 conference two weeks ago.

As both companies respond to industry trends, they provide an interesting comparison to watch companies in transition.

Company role

NetApp’s underlying theme is to change the world with data and they want to change to help companies do this.
Vantara’s philosophy is data and processing is ultimately moving into the Internet of things (IoT) and they want to be wherever the data takes them.

Hitachi Vantara is a brand new company that combines Hitachi Data Systems, Hitachi Insight Group and Pentaho (an analytics acquisition) into one organization to go after the IoT market. Pentaho will continue as a separate brand/subsidiary, but HDS and Insight Group cease to exist as separate companies/subsidiaries and are now inside Vantara.

NetApp sees transitions occurring in the way IT conducts business but ultimately, a continuing and ongoing role for IT. NetApp’s ultimate role is as a data service provider to IT.

Customer problem

Vantara believes the main customer issue is the need to digitize the business. Because competition is emerging everywhere, the only way for a company to succeed against this interminable onslaught is to digitize everything. That is digitize your manufacturing/service production, sales, marketing, maintenance, any and all customer touch points, across your whole value chain and do it as rapidly as possible. If you don’t your competition will.
NetApp sees customers today have three potential concerns: 1) how to modernize current infrastructure; 2) how to take advantage of (hybrid) cloud; and 3) how to build out the next generation data center. Modernization is needed to free capital and expense from traditional IT for use in Hybrid cloud and next generation data centers. Most organizations have all three going on concurrently.

Vantara sees the threat of startups, regional operators and more advanced digitized competitors as existential for today’s companies. The only way to keep your business alive under these onslaughts is to optimize your value delivery. And to do that, you have to digitize every step in that path.

NetApp views the threat to IT as originating from LoB/shadow IT originating applications born and grown in the cloud or other groups creating next gen applications using capabilities outside of IT.

Product direction

NetApp is looking mostly towards the cloud. At their conference they announced a new Azure NFS service powered by NetApp. They already had Cloud ONTAP and NPS, both current cloud offerings, a software defined storage in the cloud and a co-lo hardware offering directly attached to public cloud (Azure & AWS), respectively.
Vantara is looking towards IoT. At their conference they announced Lumada 2.0, an Industrial IoT (IIoT) product framework using plenty of Hitachi software functionality and intended to bring data and analytics under one software umbrella.

NetApp is following a path laid down years past when they devised the data fabric. Now, they are integrating and implementing data fabric across their whole product line. With the ultimate goal that wherever your data goes, the data fabric will be there to help you with it.

Vantara is broadening their focus, from IT products and solutions to IoT. It’s not so much an abandoning present day IT, as looking forward to the day where present day IT is just one cog in an ever expanding, completely integrated digital entity which the new organization becomes.

They both had other announcements, NetApp announced ONTAP 9.3, Active IQ (AI applied to predictive service) and FlexPod SF ([H]CI with SolidFire storage) and Vantara announced a new IoT turnkey appliance running Lumada and a smart data center (IoT) solution.

Who’s right?

They both are.

Digitization is the future, the sooner organizations realize and embrace this, the better for their long term health. Digitization will happen with or without organizations and when it does, it will result in a significant re-ordering of today’s competitive landscape. IoT is one component of organizational digitization, specifically outside of IT data centers, but using IT resources.

In the mean time, IT must become more effective and efficient. This means it has to modernize to free up resources to support (hybrid) cloud applications and supply the infrastructure needed for next gen applications.

One could argue that Vantara is positioning themselves for the long term and NetApp is positioning themselves for the short term. But that denies the possibility that IT will have a role in digitization. In the end both are correct and both can succeed if they deliver on their promise.

Comments?

Facebook moving to JBOF (just a bunch of flash)

Posted on August 13, 2016August 13, 2016 by Ray in Data growth, NVMe, NVMe storage, PCIe, SSD storage

At Flash Memory Summit (FMS 2016) this past week, Vijay Rao, Director of Technology Strategy at Facebook gave a keynote session on some of the areas that Facebook is focused on for flash storage. One thing that stood out as a significant change of direction was a move to JBOFs in their datacenters.

As you may recall, Facebook was an early adopter of (FusionIO’s) server flash cards to accelerate their applications. But they are moving away from that technology now.

Insane growth at Facebook

Why? Vijay started his talk about some of the growth they have seen over the years in photos, videos, messages, comments, likes, etc. Each was depicted as a animated bubble chart, with a timeline on the horizontal axis and a growth measurement in % on the vertical axis, with the size of the bubble being the actual quantity of each element.

Although the user activity growth rates all started out small at different times and grew at different rates during their individual timelines, by the end of each video, they were all almost at 90-100% growth, in 4Q15 (assume this is yearly growth rate but could be wrong).

Vijay had similar slides showing the growth of their infrastructure, i.e., compute, storage and networking. But although infrastructure grew less quickly than user activity (messages/videos/photos/etc.), they all showed similar trends and ended up (as far as I could tell) at ~70% growth.
Continue reading “Facebook moving to JBOF (just a bunch of flash)” →

Object store and hybrid clouds at Cloudian

Posted on April 23, 2015 by Ray in Data growth, Distributed computing, File Storage, NFS, Object storage, SSD storage, Storage resiliency

Out of Japan comes another object storage provider called Cloudian. We talked with Cloudian at Storage Field Day 7 (SFD7) last month in San Jose (see the videos of their presentations here).

Cloudian history

Cloudian has been out on the market since March of 2011 but we haven’t heard much about them, probably because their focus has been East Asia. The same day that the Tōhoku Earthquake and Tsunami hit the company announced Cloudian, an Amazon S3 Compliant Multi-Tenant Cloud Storage solution.

Their timing couldn’t have been better. Japanese IT organizations were beating down their door over the next two years for a useable and (earthquake and tsunami) resilient storage solution.

Cloudian spent the next 2 years, hardening their object storage system, the HyperStore, and now they are ready to take on the rest of the world.

Currently Cloudian has about 20PB of storage under management and are shipping a HyperStore Appliance or a software only distribution of their solution. Cloudian’s solutions support S3 and NFS access protocols.

Their solution uses Cassandra, a highly scaleable, distributed NoSQL database which came out of FaceBook for their meta-data database. This provides a scaleable, non-sharable meta-data data base for object meta-data repository and lookup.

Cloudian creates virtual storage pools on backend storage which can be optimized for small objects, replication or erasure coding and can include automatic tiering to any Amazon S3 and Glacier compatible cloud storage. I would guess this is how they qualify for Hybrid Cloud status.

The HyperStore appliance

Cloudian creates a HyperStore P2P ring structure. Each appliance has Cloudian management console services as well as the HyperStore engine which supports three different data stores: Cassandra, Replicas, and Erasure coding. Unlike Scality, it appears as if one HyperStore Ring must exist in a region. But it can be split across data centers. Unclear what their definition of a “region” is.

HyperStore hardware come in entry level (HSA-2024: 24TB/1U), capacity optimized (HSA-2048: 48TB/1U), performance optimized (HSA-2060: all flash, 60TB/2U

Replication with Dynamic Consistency

The other thing that Cloudian supports is different levels of consistency for replicated data. Most object stores support eventual consistency (see Eventual Data Consistency and Cloud Storage post). HyperStore supports 3 (well maybe 5) different levels of consistency:

One – object written to one replica and committed there before responding to client
Quorum – object written to N/2+1 replicas before responding to client
1. Local Quorum – replicas are written to N/2+1 nodes in same data center before responding to client
2. Each Quorum – replicas are written to N/2+1 nodes in each data center before responding to client.
All – all replicas must have received and committed the object write before responding to client

There are corresponding read consistency levels as well. The objects writes have a “coordinator” node which handles this consistency. The implication is that consistency could be established on an object basis. Unclear to me whether Read and Write dynamic consistency can be different?

Apparently small objects are also stored in the Cassandra datastore. That way HyperStore optimizes for object size. Also, HyperStore nodes can be added to a ring and the system will auto balance the data across the old and new nodes automatically.

Cloudian also support object versioning, ACL, and QoS services as well.

~~~

I was a bit surprised by Cloudian. I thought I knew all the object storage solutions out on the market. But then again they made their major strides in Asia and as an on-premises Amazon S3 solution, rather than a generic object store.

For more on Cloudian from SFD7 see:

Cloudian – Storage Field Day 7 Preview by @VirtualizedGeek (Keith Townsend)

Is object storage outpacing structured and unstructured data growth?

Posted on August 13, 2013August 13, 2013 by Ray in Block Storage, Cloud storage, Data, Data growth, File Storage, Object storage, Storage

NASA Blue Marble 2007 West by NASA Goddard Photo (cc) (from flickr)

At the Pacific Crest conference this week there was some lively discussion about the differences in the rates of data growth. Some believe that object storage is growing much faster than structured and unstructured data. For proof they point to the growth in Amazon S3 data objects and Microsoft Azure data objects.

Azure data objects quadrupled between June 2011 and June 2012 from 0.93T to over 4.0T objects. Recently at the Microsoft Build Conference they indicated they are now storing over 8T objects which is doubling every six months. (See here and here).
Amazon S3 has also been growing, in June of 2012 they had over 1T objects and in April of 2013 they were storing over 2T objects. (See here).

For comparison purposes an Amazon S3 object is not equivalent in size to an Azure data object. I believe Amazon S3 objects are significantly larger (10 to 1000X larger) than an Azure data object (but I have no proof for this statement).

Nonetheless, Azure and S3 object storage growth rates are going off the charts.

Comparing object storage growth to structured-unstructured data growth

How does the growth in objects compare to the growth in structured and unstructured storage. Most analysts claim that data is growing by 40-50% per year. And most of that is unstructured. However I would contend that when you dig deeper into unstructured aggregate, you find vastly different growth trajectories.

Historically, unstructured used to mean file data as well as object data, and it’s only recently that anyone considered tracking them differently. But if you start splitting out object data from the aggregate how fast is file data growing.

The key is file data growth

Latest IDC numbers tell us that NAS market revenue is declining while open-SAN (NAS and non-mainframe SAN) revenues were up slightly for 2Q2013 (See here for more information). Realize that revenue numbers aren’t necessarily equal to data growth and NAS doesn’t contain unified storage (NAS and SAN) combined (which is how most enterprise vendors sell file storage these days). The other consideration is that flash’s performance is potentially reducing storage overprovisioning and data reduction technologies (dedupe, compression, thin provisioning, etc.) are increasing capacity utilization which is driving down storage growth.

The other thing is that the amount of data in structured and unstructured forms is probably orders of magnitude larger than object data.

So objects storage is starting at much lower capacities. But Amazon S3 and Azure data objects are also only a part of the object storage space. Most pure object storage solutions only reach their stride at 1PB and or larger and may grow significantly from there.

Given all the foregoing what’s my take on the various growth rates of structured, unstructured and object storage, when in aggregate data is growing by 40-50% per year?

Assuming a baseline of 50% data growth rate, my best guess (and that’s all it is) is that,

Structured data growth accounts for 15% of overall data growth
Unstructured data growth accounts for 25% of overall data growth
Object storage accounts for 10% of overall data growth

You could easily convince me that object storage is more like 5% today and divide the remainder across structured and unstructured.

So how much data is this?

IDC claimed that the world created and replicated 2.8ZB of data in 2012 and predict 4ZB of data will be created/replicated in 2013 (~43% growth rate). So of the 1.2ZB of data created in 2013, ~0.36ZB of that will be structured, 0.6ZB will be unstructured-file data and 0.24ZB will be unstructured-object storage data.

At first blush, the object storage component looks much too large until you start thinking about all the media, satellite and mobile data being created these days. And then it seems about right to me.

What do you think?

The shrinking low-end

Posted on March 6, 2013 by Ray in Block Storage, Data growth, Storage, Storage performance, Strategic Inflection Points, System effectiveness

I was updating my features list for my SAN Buying Guide the other day when I noticed that low-end storage systems were getting smaller.

That is NetApp, HDS and others had recently reduced the number of drives they supported on their newest low-end storage systems (e.g, see specs for HDS HUS-110 vs AMS2100 and Netapp FAS2220 vs FAS2040). And as the number of drives determines system capacity, the size of their SMB storage was shrinking.

But what about the data deluge?

With the data explosion going on, data growth in most IT organizations is something like 65%. But these problems seem to be primarily in larger organizations or in data warehouses databases used for operational analytics. In the case of analytics, these are typically done on database machines or Hadoop clusters and don’t use low-end storage.

As for larger organizations, the most recent storage systems all seem to be flat to growing in capacity, not shrinking. So, the shrinking capacity we are seeing in new low-end storage doesn’t seem to be an issue in these other market segments.

What else could explain this?

I believe the introduction of SSDs is changing the drive requirements for low-end storage. In the past, prior to SSDs, organizations would often over provision their storage to generate better IO performance.

But with most low-end systems now supporting SSDs, over provisioning is no longer an economical solution to increase performance. As such, for those needing higher IO performance the most economical solution (CAPex and OPex) is to buy a small amount of SSD capacity in conjunction with the remaining storage in disk capacity.

That and the finding that maybe SMB data centers don’t need as much disk storage as was originally thought.

The downturn begins

So this is the first downturn in capacity to come along in my long history with data storage. Never before have I seen capacities shrink in new versions of storage systems designed for the same market space.

But if SSDs are driving the reduction in SMB storage systems, shouldn’t we start to see the same trends in mid-range and enterprise class systems?

But disk enclosure re-tooling may be holding these system capacities flat. It takes time, effort and expense to re-implement disk enclosures for storage systems. And as the reductions we are seeing in low-end is not that significant, maybe it’s just not worth it for these other systems – just yet.

But it would be useful to see something that showed the median capacity shipments per storage subsystem. I suppose weighted averages are available from something like IDC disk system shipments and overall capacity shipped. But there’s no real way to derive median from these measures and I think thats the only stat that might show how this trend is being felt in other market segments.

Comments?

Image credit: Photo of Dell EqualLogic PSM4110 Blade Array disk drawer, taken at Dell Storage Forum 2012