3.3 Exabytes-a-day?!

Dans la nuit des images (Grand Palais) by dalbera (cc) (from flickr)
Dans la nuit des images (Grand Palais) by dalbera (cc) (from flickr)

NetworkWorld announced today information from an EMC funded IDC study that said the world will create 1.2 Zettabytes (ZB, 10**21 bytes) of data in 2010. By my calculations this is 3.3 Exabytes-a-day (XB,10**18 bytes), 2.3PB (10**15 bytes) a minute or 38TB (10**12 bytes) a second.  This seems high and I have talked about how we could get here last year in my Exabyte-a-day post.  But what interested me most was a statement that about 35% more information is created than can be stored.  Not sure I understand this claim. (Deduplication perhaps?)

Aside from deduplication, what this must mean is that data is being created, sent across the Internet and not stored anywhere except while in flight to be discarded soon after.  I assume this data is associated with something like VOIP phone calls and Video chats/conferences, only some portion of which is ever recorded and stored.   (Although that will soon no longer be true for audio, see my Yottabytes by 2015 post).

But 35% would indicate ~1 out of every 3 bytes of data is discarded shortly after creation.  IDC also expects this factor to grow, not shrink and “… to over 60% over the next few years.”  So 3 out of 5 bytes of data will only be available during real-time to be discarded thereafter.

Why this portion should be growing more rapidly than data being stored is hard to fathom. Again video and voice over the internet must be a significant part of the reason.

Storing voice data

I don’t know about most people but I record only a few of my more important calls.  Also, these calls happen to be longer on average than my normal calls.  Does this mean that 35% of my call data volume is not stored, maybe.  All my business calls are done via the Internet nowadays so this data is being created and shipped across the net, used while the call is occurring but never stored other than in flight or by call participants.  So non-recorded calls easily qualifies as data created but not stored.  Even so, while I may listen to maybe ~33% of the recorded calls afterwards, I overwrite all of them ultimately, keeping only the ones that fit on the recorder’s flash device.  Hence, in the end even the voice data I do keep is only retained until I need storage to record more.

Not sure how this is treated in the IDC study but it seems to me to be yet another class of data, maybe call this transient data.  I can see similarities of transient data in company backups, log files, database dumps, etc.  Most of this data is stored for a limited time only to be later erased/recorded over in the end.  How IDC classified such data I cannot tell.

But will transient data grow?

As for video, I currently do no video conferencing so have no information on this.  But I am considering moving to another communication platform that supplies Video chat’s and which will make it less intrusive to record calls.  While demoing this new capability I have rapidly consumed over 200MB of storage for call recordings.  (I need to cap this some way before it gets out of hand).  In any case, I believe recording convenience should make such data more store-able over time, not less.

So while I may agree that 1 out of 3 bytes of data created today is not stored, I definitely don’t think that over time that ratio will grow and certainly not to 60%.  My only caveat is that there is a limit to the amount of data the world can readily store at any one time and this will ultimately drive all of us to delete data we would rather keep.

But maybe all this just points to a more interesting question, how much data does the world create that is kept for a year, a decade, or a century.  But that will need to await another post…

Telescope VLBI data storage appetite

I read a news release the other day about a new space discovery called a “micro-quasar”.  The scientists were awaiting analysis of a VLBI (very large baseline inferometery) study involving 20 telescopes to confirm their analysis.    It was said that analyzing that much data takes a while.  So how much data is this?

Another paper in IEEE Spectrum described an Earth sized telescope using e-VLBI test done in May of 2008. At that time, each of 7 radio telescopes around the world fed their observations into one supercomputer which analyzed the data.  The article stated that each telescope delivered 1Gbs of data and that the supercomputer could analyze up to 100TB of data per observation.

Assuming:

  • A single observation has all telescopes observe one point for a 24 hour period
  • As the world turns that point will become visible (and non-visible) to 1/2 the telescopes situated around the globe.

Consequently with 16 telescopes, the combined array should generate ~70TB per day or per observation.  With 20 telescopes that should be closer to ~86TB.  If this network of 20 telescopes can be kept busy 1/2 the time it should generate around 15.7PB of data a year.

Prior to e-VLBI, observation data was sent via tapes or other magnetic media to a central repository and took weeks to gather data for one observation.  However with the new e-VLBI system all this can now be done in real time.  According to the paper getting the data to the supercomputer was a substantial undertaking and required multiple “network providers”.  However,  storing this data was not discussed.

Storing 15.7PB of data a year must not be much of a technological problem anymore.  We’ve previously written on the 1.5PB from CERN, and 7.7PB from smart metering so another 15.7PB/year doesn’t seem that out of place.  I am beginning to think that exabyte-a-day post was conservative and that today’s data deluge is larger than that.  Can YB of data be that far away?

64GB iPad is not enough

Apple iPad (wi-fi) (from apple.com)
Apple iPad (wi-fi) (from apple.com)

I currently don’t have an iPad but I have seen the videos and played with one at the local BestBuy but IMHO, 64GB is not enough for a laptop killer.  My current desktop TAR file backup of documents, pictures and music is ~61.5GB and shows signs of cracking the 64GB barrier sometime next quarter alone.  Of course that’s compressed and doesn’t count the myriad of applications and O/S that’s needed for a desktop/laptop replacement.

I have a similar problem on my 8GB iPhone.  I occasionally tweak the photos and music iTunes sync parameters to be able to get the latest photos or genius mix I want.  But the iPhone (bless its heart) is not a laptop killer.

The iPad has both streaming video and non-streaming video available from iTunes.  A quick look at download sizes for video’s on iTunes shows that the relatively recent “Blind Side” takes up about 1.8GB.  Download a dozen movies and your over one third full without any photos, music or O/S & applications.  Start loading up your vast photo library and music collection and you’ll be running out of iPad storage quickly.

But that’s just for fun.  What about adding all the other Pages|Numbers|Keynote|PDF|Office documents you need just to work and it  just can’t hold it all.  I would say this doesn’t take up as much storage as the media stuff but maybe 10% of my backup data is office work.

Not sure how you even move data to the iPad on a project basis but I assume there is a way to download these files (iTunes Pages|Numbers|Keynote sync option page?)  I suppose worst case they could be emailed.  In any event, the iPad laptop killer needs to be able to work with data that’s created elsewhere.

If one runs out of iPad storage, what’s can one do?

Put your self on a [media&work] information diet and cut back.  Maybe only the top 100 genius playlists, the 3-4-5 stars photos, the latest half dozen movie purchases and only load data for the most current/active projects to the iPad. Whatever you do, how much iPad storage to devote to fun (media) versus office (work) could be flexible, managed in real time on a daily/trip sync basis.

Now if your only using your iPad only for fun this may not be as much of a problem.  But keep in mind, all those videos add up quickly.  Given all this, I guess I’m waiting for a 256GB iPad before I get one.

Smart metering’s data storage appetite

European smart meter in use (from en.wikipedia.org) (cc)
European smart meter in use (from en.wikipedia.org/wiki/Smart_meter) (cc)

A couple of years back I was talking with a storage person from PG&E and he was concerned about the storage performance aspects of installing smart meters in California.  I saw a website devoted to another electric company in California installing 1.4M smart meters that send information every 15min to the electric company.  Given that this must be only some small portion of California this represents  ~134M electricity recording transactions per day and seems entirely doable. But even at only 128 bytes per transaction, ~17GB a day of electric metering data is ingested for this company’s service area. Naturally, this power company wants to extend smart metering to gas usage as well which should not quite double the data load.

According to US census data there were ~129M households in 2008.  At that same 15 minute interval, smart metering for the whole US would generate 12B transactions a day and at 128 bytes per transaction, would represent ~ 1.5TB/day.  Of course thats only households and only electricity usage.

That same census website indicates there were 7.7M businesses in the US in 2007.  To smart meter these businesses at the same interval would take an additional ~740M transactions a day or ~95GB of data. But fifteen minute intervals may be too long for some companies (and their power suppliers), so maybe it should  be dropped to every minute for businesses.  At one minute intervals, businesses would add 1.4TB of electricity metering data to the household 1.5TB data or a total of ~3TB of data/day.

Storage multiplication tables:

  • That 3TB of day must be backed up so that’s at least another 3TB of day of backup load (deduplication notwithstanding).
  • That 3TB of data must be processed offline as well as online, so that’s another 3TB a day of data copies.
  • That 3TB of data is probably considered part of the power company’s critical infrastructure and as such, must be mirrored to some other data center which is another 3TB a day of mirrored data.

So with this relatively “small” base data load of 3TB a day we are creating an additional 9TB/day of copies.  Over the course of a year this 12TB/day generates ~4.4PB of data.  A study done by StorageTek in the late ’90s showed that on average data was copied 6 times, so the 3 copies above may be conservative.  If the study results held true today for metering data, it would generate ~7.7PB/year.

To paraphrase Senator E. Dirksena petabyte here, a petabyte there and pretty soon your talking real storage.

In prior posts we discussed the 1.5PB of data generated by CERN each year, the expectations for the world to generate an exabyte (XB) a day of data in 2009 and  NSA’s need to capture and analyze a yottabyte (YB) a year of voice data by 2015.  Here we show how another 4-8PB of storage could be created each year just by rolling out smart electricity metering to US businesses and homes.

As more and more aspects of home and business become digitized more data is created each day and it all must be stored someplace – data storage.  Other technology arenas may also benefit from this digitization of life, leisure, and economy but today we would contend that  storage benefits most from this trend. We must defer for now discussions as to why storage benefits more than other technological domains to some future post.

Cloud Storage Gateways Surface

Who says there are no clouds today by akakumo (cc) (from Flickr)
Who says there are no clouds today by akakumo (cc) (from Flickr)

One problem holding back general purpose cloud storage has been the lack of a “standard” way to get data in and out of the cloud.  Most cloud storage providers supply a REST interface, an object file interface or other proprietary ways to use their facilities.  The problem with this is that they all require some form of code development on the part of the cloud storage customer in order to make use of these interfaces.

It would be much easier if cloud storage could just talk iSCSI, FCoE, FC, NFS, CIFS, FTP,  etc. access protocols.  Then any data center could use the cloud with a NIC/HBA/CNA and just configure the cloud storage as a bunch of LUNs or file systems/mount points/shares.  Probably FCoE or FC might be difficult to use due to timeouts or other QoS (quality of service) issues but iSCSI and the file level protocols should be able to support cloud storage access without such concerns.

So which cloud storage support these protocols today?  Nirvanix supports CloudNAS used to access their facilities via NFS, CIFS and FTP,  ParaScale supports NFS and FTP, while Amazon S3 and Rackspace CloudFiles do not seem to support any of these interfaces.  There are probably other general purpose cloud storage providers I am missing here but these will suffice for now.   Wouldn’t it be better if some independent vendor supplied one way to talk to all of these storage environments.

How can gateways help?

For one example, Nasuni recently emerged from stealth mode, releasing a beta version of a cloud storage gateway that supports file access to a number of providers. Currently, Nasuni supports CIFS file protocol as a front end for Amazon S3, IronMountain ASP, Nirvanix, and coming soon Rackspace CloudFile.

However, Nasuni is more than just a file protocol converter for cloud storage.  It also supplies a data cache, file snapshot services, data compression/encryption, and other cloud storage management tools. Specifically,

  • Cloud data cache – their gateway maintains a disk cache of frequently accessed data that can be accessed directly without having to go out to the cloud storage.  File data is chunked by the gateway and flushed out of cache to the backend provider. How such a disk cache is maintained coherently across multiple gateway nodes was not discussed.
  • File snapshot services – their gateway supports a point-in-time copy of file date used for backup and other purposes.  The snapshot is created on a time schedule and provides an incremental backup of cloud file data.  Presumably these snapshot chunks are also stored in the cloud.
  • Data compression/encryption services – their gateway compresses file chunks and then encrypts it before sending them to the cloud.  Encryption keys can optionally be maintained by the customer or be automatically maintained by the gateway
  • Cloud storage management services – the gateway configures the cloud storage services needed to define volumes, monitors cloud and network performance and provides a single bill for all cloud storage used by the customer.

By chunking the files and caching them, data read from the cloud should be accessible much faster than normal cloud file access.  Also by providing a form of snapshot, cloud data should be easier to backup and subsequently restore. Although Nasuni’s website didn’t provide much information on the snapshot service, such capabilities have been around for a long time and found very useful in other storage systems.

Nasuni is provided as a software only solution. Once installed and activated on your server hardware, it’s billed for as a service and ultimately is charged for on top of any cloud storage you use.  You sign up for supported cloud storage providers through Nasuni’s service portal.

How well all this works is open for discussion.  We have discussed caching appliances before both from EMC and others.  Two issues have emerged from our discussions, how well caching coherence is maintained across nodes is non-trivial and the economics of a caching appliance are subject to some debate.  However, cloud gateways are more than just caching appliances and as a way of advancing cloud storage adoption, such gateways can only help.

Full disclosure: I currently do no business with Nasuni.

Caching DaaD for federated data centers

Internet Splat Map by jurvetson (cc) (from flickr)
Internet Splat Map by jurvetson (cc) (from flickr)

Today, I attended a webinar where Pat Gelsinger, President of Information Infrastructure at EMC discussed their concept for a new product based on the Yotta Yotta technology they acquired a few years back.  Yotta Yotta’s product was a distributed, coherent caching appliance that had FC front end ports, an Infiniband appliance internal network and both FC and WAN backend links.

What one did with Yotta Yotta nodes was place them in front of your block storage, connect them together via infiniband locally and via a WAN technology (of your choice, then) and then you could access any data behind the appliances from any attached location.  They also provided very quick transferring of bulk data between remote nodes. So, their technology allowed for very rapid data transmission over standard WAN interfaces/distances and provided a distributed cache across those very same distances to the data behind the appliances.

I like caching appliances as much as anyone but they had become prominent only in the late 70’s and early 80’s mostly because caching technology was hard to do with the storage subsystems of the day, but they went away a long time ago.  Nowadays, you can barely purchase a lone disk drive without a cache in them.  So what’s different.

Introducing DaaD

Today we have SSDs and much cheaper processing power.  I wrote about new caching appliances like DataRam‘s XcelaSAN  in a Cache appliances rise from the dead post I did after last years SNW.  But EMC’s going after a slightly broader domain – the world.  The caching appliance that EMC is discussing is really intended to support distributed data access, or as I like to call it,  Data-at-a-Distance (DaaD).

How can this work?  Data is stored on subsystems at various locations around the world.  A DaaD appliance is inserted in front of each of these and connected over the WAN. Some or all of that data is now re-configured (at block or more likely LUN level) to be accessible at distance from each DaaD data center.  As each data center reads and writes data from/to their remote brethern, some portion of that data is cached locally in the DaaD appliance and the rest is only available by going to the remote site (with considerably higher latency).

This works moderately well for well behaved, read intensive workloads where 80% of the IO is to 20% of the data (most of which is cached locally).  But block writes present a particularly nasty problem as any data write has to be propagated to all cache copies before acknowledged.

It’s possible write propagation could be done via invalidating the data in cache (so any subsequent read would need to re-access the data from the original host).  Nevertheless, to even know which DaaD nodes have a cached copy of a particular block, one needs to maintain a dictionary of all globally identifiable blocks held in any DaaD cache node at every moment in time.  Any such table would change often, will necessarily need to be updated very carefully, deadlock free and atomically with non-failable transactions – therein lies one of the technological hurdles.  Doing this quickly without impacting performance is another hurdle.

So simple enough, EMC takes Yotta Yotta’s technology, updates it for todays processors, networking, and storage, and releases it as a data center federation enabler. So, what can one do with a federated data center, well that’s another question and it involves Vmotion, and must be a subject for a future post …

Describing Dedupe

Hard Disk 4 by Alpha six (cc) (from flickr)
Hard Disk 4 by Alpha six (cc) (from flickr)

Deduplication is a mechanism to reduce the amount of data stored on disk for backup, archive or even primary storage.  For any storage, data is often duplicated and any system that eliminates storing duplicate data will be more utilize storage more efficiently.

Essentially, deduplication systems identify duplicate data and only store one copy of such data.  It uses pointers to incorporate the duplicate data at the right point in the data stream. Such services can be provided at the source, at the target, or even at the storage subsystem/NAS system level.

The easiest way to understand deduplication is to view a data stream as a book and as such, it can consist of two parts a table of contents and actual chapters of text (or data).  The stream’s table of contents provides chapter titles but more importantly (to us), identifies a page number for the chapter.  A deduplicated data stream looks like a book where chapters can be duplicated within the same book or even across books, and the table of contents can point to any book’s chapter when duplicated. A deduplication service inputs the data stream, searches for duplicate chapters and deletes them, and updates the table of contents accordingly.

There’s more to this of course.  For example, chapters or duplicate data segments must be tagged with how often they are duplicated  so that such data is not lost when modified.  Also, one way to determine if data is duplicated is to take one or more hashes and compare this to other data hashes, but to work quickly, data hashes must be kept in a searchable index.

Types of deduplication

  • Source deduplication involves a repository, a client application, and an operation which copies client data to the repository.  Client software chunks the data, hashes the data chunks, and sends these hashes over to the repository.  On the receiving end, the repository determines which hashes are duplicates and then tells the client to send only the unique data.  The repository stores the unique data chunks and the data stream’s table of contents.
  • Target deduplication involves performing deduplication inline, in-parallel, or post-processing by chunking the data stream as it’s recieved, hashing the chunks, determining which chunks are unique, and storing only the unique data.  Inline refers to doing such processing while receiving data at the target system, before the data is stored on disk.  In-parallel refers to doing a portion of this processing while receiving data, i.e., portions of the data stream will be deduplicated while other portions are being received.  Post-processing refers to data that is completely staged to disk before being deduplicated later.
  • Storage subsystem/NAS system deduplication looks a lot like post-processing, target deduplication.  For NAS systems, deduplicaiot looks at a file of data after it is closed. For general storage subsystems the process looks at blocks of data after they are written.  Whether either system detects duplicate data below these levels is implementation dependent.

Deduplication overhead

Deduplication processes generate most overhead while deduplicating the data stream, essentially during or after the data is written, which is the reason that target deduplication has so many options, some optimize ingestion while others optimize storage use. There is very little additonal overhead for re-constituting (or un-deduplicating) the data for read back as retrieving the unique and/or duplicated data segments can be done quickly.  There may be some minor performance loss because of lack of  sequentiality but that only impacts data throughput and not that much.

Where dedupe makes sense

Deduplication was first implemented for backup data streams.  Because any backup that takes full backups on a monthly or even weekly basis will duplicate lot’s of data.  For example, if one takes a full backup of 100TBs every week and lets say new unique data created each week is ~15%, then at week 0, 100TB of data is stored both for the deduplicated and undeduplicated data versions; at week 1 it takes 115TB to store the deduplicated data but 200TB for the non-deduplicated data; at week 2 it takes ~132TB to store deduplicated data but 300TB for the non-deduplicated data, etc.  As each full backup completes it takes another 100TB of un-deduplicated storage but significantly less deduplicated storage.  After 8 full backups the un-deduplicated storage would require 8ooTB but only ~265TB for deduplicated storage.

Deduplication can also work for secondary or even primary storage.  Most IT shops with 1000’s of users, duplicate lot’s of data.  For example, interim files are sent from one employee to another for review, reports are sent out en-mass to teams, emails are blasted to all employees, etc.  Consequently, any storage (sub)system that can deduplicate data would more efficiently utilize backend storage.

Full disclosure, I have worked for many deduplication vendors in the past.

Cleversafe’s new hardware

Cleversafe new dsNet(tm) Rack (from Cleversafe.com)
Cleversafe new dsNet(tm) Rack (from Cleversafe.com)

Yesterday, Cleversafe announced new Slicestor(r) 2100 and 2200 hardware using 2TB SATA drives. The standard 2100 1U package supports 8TB of raw data and the 2200 new 2U package supports 24TB of data. In addition, a new Accesser(r) 2100 supports 8GB of ECC RAM, and 2 GigE or 10GbE ports for data access.

In addition to the new server hardware, Cleversafe also announced an integrated rack with up to 18 Slicestor 2200s, 2 Accessors 2100s, 1 Omnience (management node), 48-port ethernet switch, and PDUs. This new rack configuration comes pre-cabled and can easily be installed to support an immediate 432TB raw capacity. It’s expected that customers with multiple sites could order 1 or more racks to support a quick installation of Cleversafe storage services.

Cleversafe currently offers iSCSI block services, direct object storage interface and file services interfaces (over iSCSI).  They are finding some success in the media and entertainment space as well as federal and state government data centers.

The federal and state government agencies seem especially interested in Cleversafe for its data security capabilities.  They offer cloud data security via their SecureSlice(tm) technology which encrypts data slices and uses key masking to obscure the key.  With SecureSlice, the only way to decrypt the data is to have enough slices to reconstitute the data.

Also the new Accesser and Slicestor server hardware now uses a drive on motherboard flash unit to hold operating system/Cleversafe software. This allows data drives to only hold customer data and reduces Accesser power requirements while also improving both Slicestor and Accesser reliability.

In a previous post we discussed EMC’s Atmos’s GeoProtect capabilities and although they are not quite at the sophistication of Cleversafe, EMC does offer a sort of data dispersion across sites/racks.  However, it appears that GeoProtect is currently limited to two distinct configurations.  In contrast, Cleversafe allows the user to select the number of Slicestor’s to store data and the threshold required to reconstitute the data.  Doing this allows the user to almost dial up or down the availability and reliability they want for their data.

Cleversafe performs well enough to saturate a single Accesser GigE iSCSI link.  Accessers maintain a sort of preferred routing table which indicates which Slicestors currently have the best performance. By accessing the quickest Slicestors first to reconstitute data, performance can be optimized.  Specifically, for the typical multi-site Cleversafe implementation, knowing current Slicestor to Accesser performance can improve data reconstitution performance considerably.

Full disclosure, I have done work for Cleversafe in the past.