As you may recall, Facebook was an early adopter of (FusionIO’s) server flash cards to accelerate their applications. But they are moving away from that technology now.
Insane growth at Facebook
Why? Vijay started his talk about some of the growth they have seen over the years in photos, videos, messages, comments, likes, etc. Each was depicted as a animated bubble chart, with a timeline on the horizontal axis and a growth measurement in % on the vertical axis, with the size of the bubble being the actual quantity of each element.
Although the user activity growth rates all started out small at different times and grew at different rates during their individual timelines, by the end of each video, they were all almost at 90-100% growth, in 4Q15 (assume this is yearly growth rate but could be wrong).
Vijay had similar slides showing the growth of their infrastructure, i.e., compute, storage and networking. But although infrastructure grew less quickly than user activity (messages/videos/photos/etc.), they all showed similar trends and ended up (as far as I could tell) at ~70% growth. Continue reading “Facebook moving to JBOF (just a bunch of flash)”
Cloudian has been out on the market since March of 2011 but we haven’t heard much about them, probably because their focus has been East Asia. The same day that the Tōhoku Earthquake and Tsunami hit the company announced Cloudian, an Amazon S3 Compliant Multi-Tenant Cloud Storage solution.
Their timing couldn’t have been better. Japanese IT organizations were beating down their door over the next two years for a useable and (earthquake and tsunami) resilient storage solution.
Cloudian spent the next 2 years, hardening their object storage system, the HyperStore, and now they are ready to take on the rest of the world.
Currently Cloudian has about 20PB of storage under management and are shipping a HyperStore Appliance or a software only distribution of their solution. Cloudian’s solutions support S3 and NFS access protocols.
Their solution uses Cassandra, a highly scaleable, distributed NoSQL database which came out of FaceBook for their meta-data database. This provides a scaleable, non-sharable meta-data data base for object meta-data repository and lookup.
Cloudian creates virtual storage pools on backend storage which can be optimized for small objects, replication or erasure coding and can include automatic tiering to any Amazon S3 and Glacier compatible cloud storage. I would guess this is how they qualify for Hybrid Cloud status.
The HyperStore appliance
Cloudian creates a HyperStore P2P ring structure. Each appliance has Cloudian management console services as well as the HyperStore engine which supports three different data stores: Cassandra, Replicas, and Erasure coding. Unlike Scality, it appears as if one HyperStore Ring must exist in a region. But it can be split across data centers. Unclear what their definition of a “region” is.
HyperStore hardware come in entry level (HSA-2024: 24TB/1U), capacity optimized (HSA-2048: 48TB/1U), performance optimized (HSA-2060: all flash, 60TB/2U
Replication with Dynamic Consistency
The other thing that Cloudian supports is different levels of consistency for replicated data. Most object stores support eventual consistency (see Eventual Data Consistency and Cloud Storage post). HyperStore supports 3 (well maybe 5) different levels of consistency:
One – object written to one replica and committed there before responding to client
Quorum – object written to N/2+1 replicas before responding to client
Local Quorum – replicas are written to N/2+1 nodes in same data center before responding to client
Each Quorum – replicas are written to N/2+1 nodes in each data center before responding to client.
All – all replicas must have received and committed the object write before responding to client
There are corresponding read consistency levels as well. The objects writes have a “coordinator” node which handles this consistency. The implication is that consistency could be established on an object basis. Unclear to me whether Read and Write dynamic consistency can be different?
Apparently small objects are also stored in the Cassandra datastore. That way HyperStore optimizes for object size. Also, HyperStore nodes can be added to a ring and the system will auto balance the data across the old and new nodes automatically.
Cloudian also support object versioning, ACL, and QoS services as well.
I was a bit surprised by Cloudian. I thought I knew all the object storage solutions out on the market. But then again they made their major strides in Asia and as an on-premises Amazon S3 solution, rather than a generic object store.
At the Pacific Crest conference this week there was some lively discussion about the differences in the rates of data growth. Some believe that object storage is growing much faster than structured and unstructured data. For proof they point to the growth in Amazon S3 data objects and Microsoft Azure data objects.
Azure data objects quadrupled between June 2011 and June 2012 from 0.93T to over 4.0T objects. Recently at the Microsoft Build Conference they indicated they are now storing over 8T objects which is doubling every six months. (See here and here).
Amazon S3 has also been growing, in June of 2012 they had over 1T objects and in April of 2013 they were storing over 2T objects. (See here).
For comparison purposes an Amazon S3 object is not equivalent in size to an Azure data object. I believe Amazon S3 objects are significantly larger (10 to 1000X larger) than an Azure data object (but I have no proof for this statement).
Nonetheless, Azure and S3 object storage growth rates are going off the charts.
Comparing object storage growth to structured-unstructured data growth
How does the growth in objects compare to the growth in structured and unstructured storage. Most analysts claim that data is growing by 40-50% per year. And most of that is unstructured. However I would contend that when you dig deeper into unstructured aggregate, you find vastly different growth trajectories.
Historically, unstructured used to mean file data as well as object data, and it’s only recently that anyone considered tracking them differently. But if you start splitting out object data from the aggregate how fast is file data growing.
The key is file data growth
Latest IDC numbers tell us that NAS market revenue is declining while open-SAN (NAS and non-mainframe SAN) revenues were up slightly for 2Q2013 (See here for more information). Realize that revenue numbers aren’t necessarily equal to data growth and NAS doesn’t contain unified storage (NAS and SAN) combined (which is how most enterprise vendors sell file storage these days). The other consideration is that flash’s performance is potentially reducing storage overprovisioning and data reduction technologies (dedupe, compression, thin provisioning, etc.) are increasing capacity utilization which is driving down storage growth.
The other thing is that the amount of data in structured and unstructured forms is probably orders of magnitude larger than object data.
So objects storage is starting at much lower capacities. But Amazon S3 and Azure data objects are also only a part of the object storage space. Most pure object storage solutions only reach their stride at 1PB and or larger and may grow significantly from there.
Given all the foregoing what’s my take on the various growth rates of structured, unstructured and object storage, when in aggregate data is growing by 40-50% per year?
Assuming a baseline of 50% data growth rate, my best guess (and that’s all it is) is that,
Object storage accounts for 10% of overall data growth
You could easily convince me that object storage is more like 5% today and divide the remainder across structured and unstructured.
So how much data is this?
IDC claimed that the world created and replicated 2.8ZB of data in 2012 and predict 4ZB of data will be created/replicated in 2013 (~43% growth rate). So of the 1.2ZB of data created in 2013, ~0.36ZB of that will be structured, 0.6ZB will be unstructured-file data and 0.24ZB will be unstructured-object storage data.
At first blush, the object storage component looks much too large until you start thinking about all the media, satellite and mobile data being created these days. And then it seems about right to me.
I was updating my features list for my SAN Buying Guide the other day when I noticed that low-end storage systems were getting smaller.
That is NetApp, HDS and others had recently reduced the number of drives they supported on their newest low-end storage systems (e.g, see specs for HDS HUS-110 vs AMS2100 and Netapp FAS2220 vs FAS2040). And as the number of drives determines system capacity, the size of their SMB storage was shrinking.
But what about the data deluge?
With the data explosion going on, data growth in most IT organizations is something like 65%. But these problems seem to be primarily in larger organizations or in data warehouses databases used for operational analytics. In the case of analytics, these are typically done on database machines or Hadoop clusters and don’t use low-end storage.
As for larger organizations, the most recent storage systems all seem to be flat to growing in capacity, not shrinking. So, the shrinking capacity we are seeing in new low-end storage doesn’t seem to be an issue in these other market segments.
What else could explain this?
I believe the introduction of SSDs is changing the drive requirements for low-end storage. In the past, prior to SSDs, organizations would often over provision their storage to generate better IO performance.
But with most low-end systems now supporting SSDs, over provisioning is no longer an economical solution to increase performance. As such, for those needing higher IO performance the most economical solution (CAPex and OPex) is to buy a small amount of SSD capacity in conjunction with the remaining storage in disk capacity.
That and the finding that maybe SMB data centers don’t need as much disk storage as was originally thought.
The downturn begins
So this is the first downturn in capacity to come along in my long history with data storage. Never before have I seen capacities shrink in new versions of storage systems designed for the same market space.
But if SSDs are driving the reduction in SMB storage systems, shouldn’t we start to see the same trends in mid-range and enterprise class systems?
But disk enclosure re-tooling may be holding these system capacities flat. It takes time, effort and expense to re-implement disk enclosures for storage systems. And as the reductions we are seeing in low-end is not that significant, maybe it’s just not worth it for these other systems – just yet.
But it would be useful to see something that showed the median capacity shipments per storage subsystem. I suppose weighted averages are available from something like IDC disk system shipments and overall capacity shipped. But there’s no real way to derive median from these measures and I think thats the only stat that might show how this trend is being felt in other market segments.
Image credit: Photo of Dell EqualLogic PSM4110 Blade Array disk drawer, taken at Dell Storage Forum 2012
We have talked about this before but more facts have come to light regarding the explosion of mobile data traffic, signaling a substantive change in how the world accesses information. In the Rise of mobile and the death of the rest the focus was on the business risks and opportunities coming from the rise of mobile computing.
The MIT article had one chart (see slide 15 above in Mary’s deck) that showed mobile internet traffic in November 2012 was 13% of all internet traffic which grew from a base level of ~1% in Dec 2009. So in roughly 3 years, mobile traffic is consuming over 13X more internet bandwidth than any other type of device.
Just in case you needed more convincing, in another article in MIT Technical Review (this one on spectrum sharing) Cisco was quoted as saying that mobile traffic would grow 18X by 2016.
If mobile’s winning, who’s losing?
That has got to be desktop computing. In fact, another chart (slide 16 in Mary’s deck) showed a comparison of India’s internet traffic tracking desktop vs mobile devices, from December 2008 to November 2012. In this chart India’s mobile internet exceeded desktop traffic sometime the middle of this year.
But I think the one chart that tells this better (see one slide 25) shows that smartphones and tablets shipments exceeded desktops and laptops in 2010. The other interesting thing is that one can also see the gradual decline in desktops and laptop shipments since then.
Where’s the revenue streams?
The funny thing about Mary’s presentation is the fact that she was tracking mobile app and mobile advertising (see slide 17) as a rising revenue opportunity, expected to reach $19B in 2012. In my post on the rise of Mobile, I assumed that mobile advertising would not be a successful model for mobile revenue streams – I was wrong.
Mary’s presentation also showed some of the impact of mobile on other markets and foretells the future impacts mobile will have. One telling example for this is standalone camera sales vs mobile camera shipments (see slide 32) which crossed over in 2008 where standalone camera sales peaked at~150M units. The same thing happened with standalone personal navigation devices (PND) (see slide 34) that peaked 13M units in 2009 but where Waze unit (mobile navigation aid) exceeded PND unit shipments in Q1 2012.
The remainder of the presentation (at least what I read) seemed to define a new life-style option she called Asset-Light which was all about shedding physical assets like wallets, paperback books, TV and other screens, fixed LAN connectivity and moving to a completely mobile world where everything you need is on your tablet with access to the internet via WiFi or LTE.
Mobile is here, better get ready and figure out how to do business with it or consider this a great time to curtail your growth prospects.
The original study (seeLIDAR at Angamuco) cited in the piece above was a result of the Legacies of Resilience project sponsored by Colorado State University (CSU) and goes into some detail about the data processing and archeological use of the LIDAR maps.
LIDAR sends a laser pulse from an airplane/satellite to the ground and measures how long it takes to reflect back to the receiver. With that information and “some” data processing, these measurements can be converted to an X, Y, & Z coordinate system or detailed map of the ground.
The archeologists in the study used LIDAR to create a detailed map of the empire’s main city at a resolution of +/- 0.25m (~10in). They mapped about ~207 square kilometers (80 square miles) at this level of detail. In 4 days of airplane LIDAR mapping, they were able to gather more information about the area then they were able to accumulate over 25 years of field work. Seems like digital archeology was just born.
So how much data?
I wanted to find out just how much data this was but neither the article or the study told me anything about the size of the LIDAR map. However, assuming this is a flat area, which it wasn’t, and assuming the +/-.25m resolution represents a point every 625sqcm, then the area being mapped above should represent a minimum of ~3.3 billion points of a LIDAR point cloud.
Given the above I estimate the 207sqkm LIDAR grid point cloud represents a minimum of ~172GB of data. There are LIDAR compression tools available, but even at 50% reduction, it’s still 85GB for 210sqkm.
My understanding is that the raw LIDAR data would be even bigger than this and the study applied a number of filters against the LIDAR map data to extract different types of features which of course would take even more space. And that’s just one ancient city complex.
With all the above the size of LIDAR raw data, grid point fields, and multiple filtered views is approaching significance (in storage terms). Moving and processing all this data must also be a problem. As evidence, the flights for the LIDAR runs over Angamuco, Mexico occurred in January 2011 and they were able to analyze the data sometime that summer, ~6 months late. Seems a bit long from my perspective maybe the data processing/analysis could use some help.
Indiana Jones meets Hadoop
That was the main subject of the second paper mentioned above done by researchers at the San Diego Supercomputing Center (SDSC). They essentially did a benchmark comparing MapReduce/Hadoop running on a relatively small cluster of 4 to 8 commodity nodes against an HPC cluster (running 28-Sun x4600M2 servers, using 8 processor, quad core nodes, with anywhere from 256 GB to 512GB [only on 8 nodes] of DRAM running a C++ implementation of the algorithm.
The results of their benchmarks were that the HPC cluster beat the Hadoop cluster only when all of the LIDAR data could fit in memory (on a DRAM per core basis), after that the Hadoop cluster performed just as well in elapsed wall clock time. Of course from a cost perspective the Hadoop cluster was much more economical.
The 8-node, Hadoop cluster was able to “grid” a 150M LIDAR derived point cloud at the 0.25m resolution in just a bit over 10 minutes. Now this processing step is just one of the many steps in LIDAR data analysis but it’s probably indicative of similar activity occurring earlier and later down the (data) line.
Let’s see 172GB per 207sqkm, the earth surface is 510Msqkm, says a similar resolution LIDAR grid point cloud of the entire earth’s surface would be about 0.5EB (Exabyte, 10**18 bytes). It’s just great to be in the storage business.
Attended #HDSday yesterday in San Jose. Listened to what seemed like the majority of the executive team. The festivities were MCed by Asim Zaheer, VP Corp and Product Marketing, a long time friend and employee, that came to HDS with the acquisition of Archivas five or so years ago. Some highlights of the day’s sessions are included below.
The first presenter was Jack Domme, HDS CEO, and his message was that there is a new, more aggressive HDS, focused on executing and growing the business.
Jack said there will be almost a half a ZB by 2015 and ~80% of that will be unstructured data. HDS firmly believes that much of this growing body of data today lives in silos, locked into application environments and can’t become truly information until it can be liberated from this box. Getting information out of the unstructured data is one of the key problems facing the IT industry.
To that end, Jack talked about the three clouds appearing on the horizon:
infrastructure cloud – cloud as we know and love it today where infrastructure services can be paid for on a per use basis, where data and applications move seemlessly across various infrastructural boundaries.
content cloud – this is somewhat new but here we take on the governance, analytics and management of the millions to billions pieces of content using the infrastructure cloud as a basic service.
information cloud – the end game, where any and all data streams can be analyzed in real time to provide information and insight to the business.
Jack mentioned the example of when Japan had their earthquake earlier this year they automatically stopped all the trains operating in the country to prevent further injury and accidents, until they could assess the extent of track damage. Now this was a specialized example in a narrow vertical but the idea is that the information cloud does that sort of real-time analysis of data streaming in all the time.
For much of the rest of the day the executive team filled out the details that surrounded Jack’s talk.
For example Randy DeMont, Executive VP & GM Global Sales, Services and Support talked about the new, more focused sales team. On that has moved to concentrate on better opportunities and expanded to take on new verticals/new emerging markets.
Then Brian Householder, SVP WW Marketing and Business Development got up and talked about some of the key drivers to their growth:
Current economic climate has everyone doing more with less. Hitachi VSP and storage virtualization is a unique position to be able to obtain more value out of current assets, not a rip and replace strategy. With VSP one layers better management on top of your current infrastructure, that helps get more done with the same equipment.
Focus on the channel and verticals are starting to pay off. More than 50% of HDS revenues now come from indirect channels. Also, healthcare and life sciences are starting to emerge as a crucial vertical for HDS.
Scaleability of their storage solutions is significant. Used to be a PB was a good sized data center but these days we are starting to talk about multiple PBs and even much more. I think earlier Jack mentioned that in the next couple of years HDS will see their first 1EB customer.
MarkMike Gustafson, SVP & GM NAS (former CEO BlueArc) got up and talked about the long and significant partnership between the two companies regarding their HNAS product. He mentioned that ~30% of BlueArc’s revenue came from HDS. He also talked about some of the verticals that BlueArc had done well in such as eDiscovery and Media and Entertainment. Now these verticals will become new focus areas for HDS storage as well.
John Mansfield, SVP Global Solutions Strategy and Developmentcame up and talked about the successes they have had in the product arena. Apparently they have over 2000 VSPs intsalled, (announced just a year ago), and over 50% of the new systems are going in with virtualization. When asked later what has led to the acceleration in virtualization adoption, the consensus view was that server virtualization and in general, doing more with less (storage efficiency) were driving increased use of this capability.
Hicham Abdessamad, SVP, Global Services got up and talked about what has been happening in the services end of the business. Apparently there has been a serious shift in HDS services revenue stream from break fix over to professional services (PS). Such service offerings now include taking over customer data center infrastructure and leasing it back to the customer at a monthly fee. Hicham re-iterated that ~68% of all IT initiatives fail, while 44% of those that succeed are completed over time and/or over budget. HDS is providing professional services to help turn this around. His main problem is finding experienced personnel to help deliver these services.
After this there was a Q&A panel of John Mansfield’s team, Roberto Bassilio, VP Storage Platforms and Product Management, Sean Moser, VP Software Products, and Scan Putegnat, VP File and Content Services, CME. There were a number of questions one of which was on the floods in Thailand and their impact on HDS’s business.
Apparently, the flood problems are causing supply disruptions in the consumer end of the drive market and are not having serious repercussions for their enterprise customers. But they did mention that they were nudging customers to purchase the right form factor (LFF?) disk drives while the supply problems work themselves out.
Also, there was some indication that HDS would be going after more SSD and/or NAND flash capabilities similar to other major vendors in their space. But there was no clarification of when or exactly what they would be doing.
After lunch the GMs of all the Geographic regions around the globe got up and talked about how they were doing in their particular arena.
Jeff Henry, SVP &GM Americas talked about their success in the F500 and some of the emerging markets in Latin America. In fact, they have been so successful in Brazil, they had to split the country into two regions.
Niels Svenningsen, SVP&GM EMAE talked about the emerging markets in his area of the globe, primarily eastern Europe, Russia and Africa. He mentioned that many believe Africa will be the next area to take off like Asia did in the last couple of decades of last century. Apparently there are a Billion people in Africa today.
Kevin Eggleston, SVP&GM APAC, talked about the high rate of server and storage virtualization, the explosive growth and heavy adoption of Cloud pay as you go services. His major growth areas were India and China.
The rest of the afternoon was NDA presentations on future roadmap items.
All in all a good overview of HDS’s business over the past couple of quarters and their vision for tomorrow. It was a long day and there was probably more than I could absorb in the time we had together.
And of course the buildings going up at Ground Zero are all smart buildings as well, containing sensors embedded in the structure, the infrastructure, and anywhere else that matters.
But what does this mean in terms of data
Data requirements will explode as the smart home and other sensor clouds build out. For example, even if a smart thermostat only issues a message every 15 minutes and the message is only 256 bytes, the data from the 130 million households in the US alone would be an additional ~3.2TB/day. And that’s just one sensor per household.
If you add the smart power meter, lawn sensor, intrusion/fire/chemical sensor, and god forbid, the refrigerator and freezer product sensors to the mix that’s another another 16TB/day of incoming data.
And that’s just assuming a 256 byte payload per sensor every 15 minutes. The intrusion sensors could easily be a combination of multiple, real time exterior video feeds as well as multi-point intrusion/motion/fire/chemical sensors which would generate much, much more data.
But we have smart roads/bridges, smart cars/trucks, smart skyscrapers, smart port facilities, smart railroads, smart boats/ferries, etc. to come. I could go on but the list seems long enouch already. Each of these could generate another ~19TB/day data stream, if not more. Some of these infrastructure entities/devices are much more complex than a house and there are a lot more cars on the road than houses in the US.
It’s great to be in the (cloud) storage business
All that data has to be stored somewhere and that place is going to be the cloud. The Honeywell smart thermostat uses Opower’s cloud storage and computing infrastructure specifically designed to support better power management for heating and cooling the home. Following this approach, it’s certainly feasible that more cloud services would come online to support each of the smart entities discussed above.
Naturally, using this data to provide real time understanding of the infrastructure they monitor will require big data analytics. Hadoop, and it’s counterparts are the only platforms around today that are up to this task.
So cloud computing, cloud storage, and big data analytics have yet another part to play. This time in the upcoming sensor cloud that will envelope the world and all of it’s infrastructure.