Is cloud computing/storage decentralizing IT, again?

IBM Card Sorter by Pargon (cc) (From Flickr)
IBM Card Sorter by Pargon (cc) (From Flickr)

Since IT began, over the course of years, computing services have run through massive phases of decentralization out to departments followed by consolidation back to the data center.  In the early years of computing, from the 50s to the 60s, the only real distributed solution to mainframe or big iron data processing was sophisticated card sorters.

Consolidation-decentralization Wars

But back in the 70s the consolidation-decentralization wars were driven by the availability of mini-computers competing with mainframes for applications and users.  During the 80s, the PC emerged to become the dominant decentralizer taking applications away from mainframes and big servers and in the 90s it was small, off-the-shelf linux servers and continuing use of high-powered PCs that took applications out from data center control.

In those days it seemed that most computing decentralization was driven by the ease of developing applications for these upstarts and the relative low-cost of the new platforms.

Server virtualization, the final solution

Since 2000, another force has come to solve the consolidation quandry – server virtualization.  With server virtualization such as from VMware, Citrix and others, IT has once again driven massive consolidation outlying departmental computing services to bring them all, once again, under one roof, centralizing IT control.  Virtualization provided an optimum answer to the one issue that decentralization could never seem to address – utilization efficiency.  With most departmental servers being used at 5-10% utilization, virtualzation offered demonstrable cost savings when consolidated onto data center hardware.

Cloud computing/storage mutiny

But with the insurrection that is cloud computing and cloud storage once again, departments can easily acquire storage and computing resources on demand and utilization is no longer an issue because it’s a “pay only for what you use” solution. And they don’t even need to develop their own applications because SaaS providers can supply most of their application needs using cloud computing and cloud storage resources alone.

Virtualization was a great solution to the poor utilization of systems and storage resources. But with the pooling available with cloud computing and storage, utilization effectiveness occurs outside the bounds of the todays data center.  As such, with cloud services utilization effectiveness in $/MIP or $/GB can be approximately equivalent to any highly virtualized data center infrastructure (perhaps even better).  Thus, cloud services can provide these very same utilization enhancements at reduced costs out to any departmental user without the need for centralized data center services.

Other decentralization issues that cloud solves

Traditionally, the other problems with departmental computing services were lack of security and the unmanageability distributed service both of which held back some decentralization efforts but these are partially being addressed with cloud infrastructure today.  Insecurity continues to plague cloud computing but some cloud storage gateways (see Cirtas Surfaces and other cloud storage gateway posts) are beginning to use encryption and other cryptographic techniques to address these issues.  How this is solved for cloud computing is another question (see Securing the cloud – Part B).

Cloud computing and storage can be just as diffuse and difficult to manage as a proliferation of PCs or small departmental linux servers.  However, such unmanage-ability is a very different issue, one intrinsic to decentralization and much harder to address.  Although it’s fairly easy to get a bill for any cloud services, it’s unclear whether IT will be able to see all of them to manage it.  Also, nothing seems able to stop some department from signing up for SalesForce.com or even to use Amazon EC2 to support an application they need.  The only remedy, as far as I can see to this problem, is adherence to strict corporate policy and practice.  So unmanageability remains an ongoing issue for decentralized computing for some time to come.

—-

Nonetheless, it seems as if decentralization via the cloud is back, at least until the next wave of consolidation hits.  My guess for the next driver of consolidation is to make application development much easier and quicker to accomplish for centralized data center infrastructure – application frameworks anyone?

Comments?

SOHO backup options

© 2010 RDX Storage Alliance. All Rights Reserved. (From their website)
© 2010 RDX Storage Alliance. All Rights Reserved. (From their website)

I must admit, even though I have disparaged DVD archive life (see CDs and DVDs longevity questioned) I still backup my work desktops/family computers to DVD and DVDdl disks.  It’s cheap (on sale 100 DVDs cost about $30 and DVDdl ~2.5 that much) and it’s convenient (no need for additional software, outside storage fees, or additional drives).  For offsite backups I take the monthly backups and store them in a safety deposit box.

But my partner (and wife) said “Your time is worth something, every time you have to swap DVDs you could be doing something else.” (… like helping around the house.)

She followed up by saying “Couldn’t you use something that was start it and forget it til it was done.”

Well this got me to thinking (as well as having multiple media errors in my latest DVDdl full backup), there’s got to be a better way.

The options for SOHO (small office/home office) Offsite backups look to be as follows: (from sexiest to least sexy)

  • Cloud storage for backup – Mozy, Norton BackupGladinetNasuni, and no doubt many others can provide secure, cloud based backup of desktop, laptop data for Macs and Window systems.  Some of these would require a separate VM or server to connect to the cloud while others would not.  Using the cloud might require the office systems to be left on at nite but that would be a small price to pay to backup your data offsite.   Benefits to cloud storage approaches are that it would get the backups offsite, could be automatically scheduled/scripted to take place off-hours and would require no (or minimal) user intervention to perform.  Disadvantages to this approach is that the office systems would need to be left powered on, backup data is out of your control and bandwidth and storage fees would need to be paid.
  • RDX devices – these are removable NFS accessed disk storage which can support from 40GB to 640GB per cartridge. The devices claim 30yr archive life, which should be fine for SOHO purposes.  Cost of cartridges is probably RDX greatest issue BUT, unlike DVDs you can reuse RDX media if you want to.   Benefits are that RDX would require minimal operator intervention for anything less than 640GB of backup data, backups would be faster (45MB/s), and the data would be under your control.  Disadvantages are the cost of the media (640GB Imation RDX cartridge ~$310) and drives (?), data would not be encrypted unless encrypted at the host, and you would need to move the cartridge data offsite.
  • LTO tape – To my knowledge there is only one vendor out there that makes an iSCSI LTO tape and that is my friends at Spectra Logic but they also make a SAS (6Gb/s) attached LTO-5 tape drive.  It’s unclear which level of LTO technology is supported with the iSCSI drive but even one or two generations down would work for many SOHO shops.  Benefits of LTO tape are minimal operator intervention, long archive life, enterprise class backup technology, faster backups and drive data encryption.  Disadvantages are the cost of the media ($27-$30 for LTO-4 cartridges), drive costs(?), interface costs (if any) and the need to move the cartridges offsite.  I like the iSCSI drive because all one would need is a iSCSI initiator software which can be had easily enough for most desktop systems.
  • DAT tape – I thought these were dead but my good friend John Obeto informed me they are alive and well.  DAT drives support USB 2.0, SAS or parallel SCSI interfaces. Although it’s unclear whether they have drivers for Mac OS/X, Windows shops could probably use them without problem. Benefits are similar to LTO tape above but not as fast and not as long a archive life.  Disadvantages are cartridge cost (320GB DAT cartridge ~$37), drive costs (?) and one would have to move the media offsite.
  • (Blu-ray, Blu-ray dl), DVD, or DVDdl – These are ok but their archive life is miserable (under 2yrs for DVDs at best, see post link above). Benefits are they’res very cheap to use, lowest cost removable media (100GB of data would take ~22 DVDs or 12 DVDdls which at $0.30/ DVD or $0.75 for DVDdl thats  ~$6.60 to $9 per backup), and lowest cost drive (comes optional on most desktops today). Disadvantages are high operator intervention (to swap out disks), more complexity to keep track of each DVDs portion of the backup, more complex media storage (you have a lot more of it), it takes forever (burning 7.7GB to a DVDdl takes around an hour or ~2.1MB/sec.), data encryption would need to be done at the host, and one has to take the media offsite.  I don’t have similar performance data for using Blu-ray  for backups other than Blu-ray dl media costs about $11.50 each (50GB).

Please note this post only discusses Offsite backups. Many SOHOs do not provide offsite backup (risky??) and for online backups I use a spare disk drive attached to every office and family desktop.

Probably other alternatives exist for offsite backups, not the least of which is NAS data replication.  I didn’t list this as most SOHO customers are unlikely to have a secondary location where they could host the replicated data copy and the cost of a 2nd NAS box would need to be added along with the bandwidth between the primary and secondary site.  BUT for those sophisticated SOHO customers out there already using a NAS box for onsite shared storage maybe data replication might make sense. Deduplication backup appliances are another possibility but suffer similar disadvantages to NAS box replication and are even less likely to be already used by SOHO customers.

—-

Ok where to now.  Given all this I M hoping to get a Blu-ray dl writer in my next iMac.  Let’s see that would cut my DVDdl swaps down by ~3.2X for single layer and ~6.5X for dl Blu-ray.  I could easily live with that until I quadrupled my data storage, again.

Although an iSCSI LTO-5 tape transport would make a real nice addition to the office…

Comments?

Top 10 storage technologies over the last decade

Aurora's Perception or I Schrive When I See Technology by Wonderlane (cc) (from Flickr)
Aurora's Perception or I Schrive When I See Technology by Wonderlane (cc) (from Flickr)

Some of these technologies were in development prior to 2000, some were available in other domains but not in storage, and some were in a few subsystems but had yet to become popular as they are today.  In no particular order here are my top 10 storage technologies for the decade:

  1. NAND based SSDs – DRAM and other technology solid state drives (SSDs) were available last century but over the last decade NAND Flash based devices have dominated SSD technology and have altered the storage industry forever more.  Today, it’s nigh impossible to find enterprise class storage that doesn’t support NAND SSDs.
  2. GMR head– Giant Magneto Resistance disk heads have become common place over the last decade and have allowed disk drive manufacturers to double data density every 18-24 months.  Now GMR heads are starting to transition over to tape storage and will enable that technology to increase data density dramatically
  3. Data DeduplicationDeduplication technologies emerged over the last decade as a complement to higher density disk drives as a means to more efficiently backup data.  Deduplication technology can be found in many different forms today, ranging from file and block storage systems, backup storage systems, to backup software only solutions.
  4. Thin provisioning – No one would argue that thin provisioning emerged last century but it took the last decade to really find its place in the storage pantheon.  One almost cannot find a data center class storage device that does not support thin provisioning today.
  5. Scale-out storage – Last century if you wanted to get higher IOPS from a storage subsystem you could add cache or disk drives but at some point you hit a subsystem performance wall.  With scale-out storage, one can now add more processing elements to a storage system cluster without having to replace the controller to obtain more IO processing power.  The link reference talks about the use of commodity hardware to provide added performance but scale-out storage can also be done with non-commodity hardware (see Hitachi’s VSP vs. VMAX).
  6. Storage virtualizationserver virtualization has taken off as the dominant data center paradigm over the last decade but a counterpart to this in storage has also become more viable as well.  Storage virtualization was originally used to migrate data from old subsystems to new storage but today can be used to manage and migrate data over PBs of physical storage dynamically optimizing data placement for cost and/or performance.
  7. LTO tape When IBM dominated IT in the mid to late last century, the tape format dejour always matched IBM’s tape technology.  As the decade dawned, IBM was no longer the dominant player and tape technology was starting to diverge into a babble of differing formats.  As a result, IBM, Quantum, and HP put their technology together and created a standard tape format, called LTO, which has become the new dominant tape format for the data center.
  8. Cloud storage Unclear just when over the last decade cloud storage emerged but it seemed to be a supplement to cloud computing that also appeared this past decade.  Storage service providers had existed earlier but due to bandwidth limitations and storage costs didn’t survive the dotcom bubble. But over this past decade both bandwidth and storage costs have come down considerably and cloud storage has now become a viable technological solution to many data center issues.
  9. iSCSI SCSI has taken on many forms over the last couple of decades but iSCSI has the altered the dominant block storage paradigm from a single, pure FC based SAN to a plurality of technologies.  Nowadays, SMB shops can have block storage without the cost and complexity of FC SANs over the LAN networking technology they already use.
  10. FCoEOne could argue that this technology is still maturing today but once again SCSI has taken opened up another way to access storage. FCoE has the potential to offer all the robustness and performance of FC SANs over data center Ethernet hardware simplifying and unifying data center networking onto one technology.

No doubt others would differ on their top 10 storage technologies over the last decade but I strived to find technologies that significantly changed data storage that existed in 2000 vs. today.  These 10 seemed to me to fit the bill better than most.

Comments?

Data processing logistics

IBM System/370 Model 145 By jovike (cc) (from Flickr)
IBM System/370 Model 145 By jovike (cc) (from Flickr)

Chuck Hollis wrote a great post on “information logistics” as a new paradigm for IT centers to have to consider as they deploy applications around the globe and into the cloud.  The problem is that there’s lot’s of data to move around in order to make all this work.

Supercomputing’s Solution

Big data/super computing groups have been thinking about this problem for a long time and have some solutions that might help but it all harken’s back to batch processing and JCL (job control language) of the last century.  In my comment to Chuck’s post I mentioned the University of Wisconsin’s Condor(r) Project which can be used to schedule data transmission and data processing across distributed server nodes in a network, but there are others namely the Globus ToolKit 4 (GT4)  which creates a data grid to support collaborative research on PB of data currently being used by CERN for LHC data, EU for their data grid and others.  We have discussed Condor in our Free Cloud Storage and Cloud Computing post and GT4 in our 15PB a year created by CERN post.

These super computing projects were designed to move data around so that analysis could be done locally with results shared within the community.  However, at least with GT4, they replicate data at a number of nodes, which may not be storage efficient but does provide quicker access for data analysis.  In CERN, there are a hierarchy of nodes which participate in a GT4 data grid and the data is replicated between tiers and within peer nodes just to have better access to it.

In olden days, …

With JCL someone would code up a sequence of batch steps, each of which could be conditional on previous steps that would manipulate data into some transient and at the end, final form.  Sometimes JCL would invoke another job (set of JCL) for a follow on step if everything in this job worked as planned.  The JCL would wait in a queue until the data and execution resources were available for it.

This could mean mounting removable media, creating disk storage “datasets”, or waiting until other jobs were down with datasets being needed, jobs would execute in a priority sequence, and scheduling options could include using different hosts (servers) that would coordinate to provide job execution services.   For all I know, z/OS still supports JCL for batch processing, but it’s been a long time since I have used JCL.

Cloud computing and storage services

Where does that bring us for today. Cloud computing and Cloud storage are bringing this execution paradigm back into vogue. But instead of batch jobs, we are talking about virtual machines, or web applications or anything else that can be packaged up and run generically on anybody’s hardware and storage.

The only problem is that there are only application specific ways to control these execution activities.  I am thinking here of web services that hand off web handling to any web server that happens to have cycles to support it.  Similarly, database machines seem capable of handing off queries to any database server that has idle ergs to process with.  There are myriad others like this but they all seem specific to one application domain.  Nothing exists that is generic or can cross many application domains.

That’s where something like Condor, GT4 or god forbid, JCL can make some sense.  In essence, all of these approaches are application independent.  By doing so, they can be used for any number of applications to take advantage of cloud computing and cloud storage services.

Just had to get this out.  Chuck’s post had me thinking about JCL again and there had to be another solution.

Enterprise data storage defined and why 3PAR?

More SNW hall servers and storage
More SNW hall servers and storage

Recent press reports about a bidding war for 3PAR bring into focus the expanding need for enterprise class data storage subsystems.  What exactly is enterprise storage?

Defining enterprise storage is frought with problems but I will take a shot.  Enterprise class data storage has:

  • Enhanced reliability, high availability and serviceability – meaning it hardly ever fails, it keeps operating (on redundant components) when it does fail, and repairing the storage when the rare failure occurs can be accomplished without disrupting ongoing storage services
  • Extreme data integrity – goes beyond just RAID storage, meaning that these systems lose data very infrequently, provide the latest data written to a location when read and will tell you when data cannot be accessed.
  • Automated I/O performance – meaning sophisticated caching algorithms that try to keep ahead of sequential I/O streams, buffer actively read data, and buffer write data in non-volatile cache before destaging to disk or other media.
  • Multiple types of storage – meaning the system supports SATA, SAS and/or FC disk drives and SSDs or Flash storage
  • PBs of storage – meaning behind one enterprise class storage (sub-)system one can support over 1PB of storage
  • Sophisticated functionality – meaning the system supports multiple forms of offsite replication, thin provisioning, storage tiering, point-in-time copies, data cloning, administration GUIs/CLIs, etc.
  • Compatibility with all enterprise O/Ss – meaning the storage has been tested and is on hardware compatibility lists for every major operating system in use by the enterprise today.

As for storage protocol, it seems best to leave this off the list.  I wanted to just add block storage, but enterprises today probably have as much if not more external file storage (CIFS or NFS) as they have block storage (FC or iSCSI).  And the proportion in file systems seems to be growing (see IDC report referenced below).

In addition, while I don’t like the non-determinism of iSCSI or file access protocols, this doesn’t seem to stop such storage from putting up pretty impressive performance numbers (see our performance dispatches).  Anything that can crack 100K I/O or file operations per second probably deserves to call themselves enterprise storage as long as they meet the other requirements.  So, maybe I should add high-performance storage to the list above.

Why the sudden interest in enterprise storage?

Enterprise storage has been around arguably since the 2nd half of last century (for mainframe systems) but lately has become even more interesting as applications deploy to the cloud and server virtualization (from VMware, Microsoft Hyper-V and others) takes over the data center.

Cloud storage and cloud computing services are lowering the entry points for storage and processing, enabling application deployments which were heretofore unaffordable.  These new cloud applications consume storage at increasing rates and don’t seem to be slowing down any time soon.  Arguably, some cloud storage is not enterprise storage but as service levels go up for these applications, providers must ultimately turn to enterprise storage.

In addition, server virtualization transforms the enterprise data center from a single application per server to easily 5 or more applications per physical server.  This trend is raising server utilization, driving more I/O, and requiring higher capacity.  Such “multi-application” storage almost always requires high availability, reliability and performance to work well, generating even more demand for enterprise data storage systems.

Despite all the demand, world wide external storage revenues dropped 12% last year according to IDC.  Now the economy had a lot to do with this decline but another factor reducing external storage revenue is the ongoing drop in the price of storage on a $/GB basis.  To this point, that same IDC report stated that external storage capacity increased 33% last year.

Why Dell & HP wants 3PAR storage?

Margins on enterprise storage are good, some would say very good.  While raw disk storage can be had at under $0.50/GB, enterprise class storage is often 10 or more times that price.  Now that has to cover redundant hardware, software/firmware engineering and other characteristics, but this still leaves pretty good margins.

In my mind, Dell would see enterprise storage as a natural extension of their current enterprise server business.  They already sell and support these customers, including enterprise class storage just adds another product to the mix.  Developing enterprise storage from scratch is probably a 4-7 year journey with the right people, buying 3PAR puts them in the market today with a competitive product.

HP is already in the enterprise storage market today, with their XP and EVA storage subsystems.  However, having their own 3PAR enterprise class storage may get them better margins than their current XP storage OEMed from HDS.  But I think Chuck Hollis’s post on HP’s counter bid for 3PAR may have revealed another side to this discussion – sometime M&A is as much about constraining your competition as it is about adding new capabilities to a company.

——

What do you think?

Cloud storage, CDP & deduplication

Strange Clouds by michaelroper (cc) (from Flickr)
Strange Clouds by michaelroper (cc) (from Flickr)

Somebody needs to create a system that encompasses continuous data protection, deduplication and cloud storage.  Many vendors have various parts of such a solution but none to my knowledge has put it all together.

Why CDP, deduplication and cloud storage?

We have written about cloud problems in the past (eventual data consistency and what’s holding back the cloud) despite all that, backup is a killer app for cloud storage.  Many of us would like to keep backup data around for a very long time. But storage costs govern how long data can be retained.  Cloud storage with its low cost/GB/month can help minimize such concerns.

We have also blogged about dedupe in the past (describing dedupe) and have written in industry press and our own StorInt dispatches on dedupe product introductions/enhancements.  Deduplication can reduce storage footprint and works especially well for backup which often saves the same data over and over again.  By combining deduplication with cloud storage we can reduce the data transfers and data stored on the cloud, minimizing costs even more.

CDP is more troublesome and yets still worthy of discussion.  Continuous data protection has always been sort of a step child in the backup business.  As a technologist, I understand it’s limitations (application consistency) and understand why it has been unable to take off effectively (false starts).   But, in theory at some point CDP will work, at some point CDP will use the cloud, at some point CDP will embrace deduplication and when that happens it could be the start of an ideal backup environment.

Deduplicating CDP using cloud storage

Let me describe the CDP-Cloud-Deduplication appliance that I envision.  Whether through O/S, Hypervisor or storage (sub-)system agents, the system traps all writes (forks the write) and sends the data and meta-data in real time to another appliance.  Once in the CDP appliance, the data can be deduplicated and any unique data plus meta data can be packaged up, buffered, and deposited in the cloud.  All this happens in an ongoing fashion throughout the day.

Sometime later, a restore is requested. The appliance looks up the appropriate mapping for the data being restored, issues requests to read the data from the cloud and reconstitutes (un-deduplicates) the data before copying it to the restoration location.

Problems?

The problems with this solution include:

  • Application consistency
  • Data backup timeframes
  • Appliance throughput
  • Cloud storage throughput

By tieing the appliance to a storage (sub-)system one may be able to get around some of these problems.

One could configure the appliance throughput to match the typical write workload of the storage.  This could provide an upper limit as to when the data is at least duplicated in the appliance but not necessarily backed up (pseudo backup timeframe).

As for throughput, if we could somehow understand the average write and deduplication rates we could configure the appliance and cloud storage pipes accordingly.  In this fashion, we could match appliance throughput to the deduplicated write workload (appliance and cloud storage throughput)

Application consistency is more substantial concern.  For example, copying every write to a file doesn’t mean one can recover the file.  The problem is at some point the file is actually closed and that’s the only time it is in an application consistent state.  Recovering to a point before or after this, leaves a partially updated, potentially corrupted file, of little use to anyone without major effort to transform it into a valid and consistent file image.

To provide application consistency, one needs to somehow understand when files are closed or applications quiesced.  Application consistency needs would argue for some sort of O/S or hypervisor agent rather than storage (sub-)system interface.  Such an approach could be more cognizant of file closure or application quiesce, allowing a synch point could be inserted in the meta-data stream for the captured data.

Most backup software has long mastered application consistency through the use of application and/or O/S APIs/other facilities to synchronize backups to when the application or user community is quiesced.  CDP must take advantage of the same facilities.

Seems simple enough, tie cloud storage behind a CDP appliance that supports deduplication.  Something like this could be packaged up in a cloud storage gateway or similar appliance.  Such a system could be an ideal application for cloud storage and would make backups transparent and very efficient.

What do you think?

5 killer apps for $0.10/TB/year

iblioteca José Vasconcelos / Vasconcelos Library by * CliNKer * (from flickr) (cc)
iblioteca José Vasconcelos / Vasconcelos Library by * CliNKer * (from flickr) (cc)

Cloud storage keeps getting more viable and I see storage pricing going down considerably over time.  All of which got me thinking what could be done with a dime per TB per year storage ($0.10/TB/yr).  Now most cloud providers charge 10 cents or more per GB per month so this is at least 12,000 times less expensive but it’s inevitable at some point in time.

So here are my 5 killer apps for $0.10/TB/yr cloud storage:

  1. Photo record of life – something akin to glasses which would record a wide angle, high mega-pixel video record of everything I looked at, for every second of my waking life.  I think at a photo shot every second for 12hrs/day 365days/yr would be about ~16M photos and at 4 MB per photo this would be about ~64TB per person year.  For my 4 person family this would cost ~$26/year for each year of family life and for a 40 year family time span, the last payment for this would be ~$1040 or an average payment of $520/year.
  2. Audio recording of life – something akin to a always on bluetooth headset which would record an audio feed to go with the semi-video or photo record above.  By being an always on bluetooth headset it would automatically catch cell phone as well as  spoken conversations but it would need to plug to landlines as well.  As discussed in my YB by 2015 archive post, one minute of MP3 audio recording takes up roughly a MB of storage.  Lets say I converse with someone ~33% of my waking day.  So this would be about 4 hrs of MP3 audio/day 365days/yr or about 21TB per year per person.  For my family this would cost or ~$8.40/year for storage and for a 40 year family life span my last payment would be ~$336 or an average of $168/yr.
  3. Home security cameras – with ethernet based security cameras, it wouldn’t be hard to record a 360 degree outside as well as inside points of entry coverage video.  The quantities for the photo record of my life would suffice for here as well but one doesn’t need to retain the data for a whole year perhaps a rolling 30 day record would suffice but it would be recorded for 24 hours. Assuming 8 cameras outside and inside,  this could be stored in about 10TB of storage per camera, or  about 80TB of storage or $8/year but would not increase over time.
  4. No more deletes/version everything – if storage were cheap enough we would never delete data.  Normal data change activity is in the 5 to 10% per week rate, but this does not account for duplicating deleted data.  So let’s say we would need to store an additional 20% of your primary/active data per week for deleted data.  For a 1TB primary storage working set, a ~20% deletion rate per week would be 10TB of deleted data per year per person and for my family ~$4/yr and my last yearly payment would be ~$160.  If we were to factor in data growth rates of ~20%/year, this would go up substantially averaging ~$7.3k/yr over 40 years.
  5. Customized search engines – if storage AND bandwidth were cheap enough it would be nice to have my own customized search engine. Such a capability would follow all my web clicks, spawning a search spider for every website I traverse and provide customized “deep” searching for every web page I view.   Such an index might take 50% of the size of a page and on average my old website used ~18KB per page, so at 50% this index would require 9KB. Assuming, I look at ~250 web pages per business day of which maybe ~170 are unique and each unique page probably links to 2 more unique pages, which links to two more, which links to two more, … If we go 10 pages deep, then for 170 pages viewed, an average branching factor of 2,  we would need to index ~174K pages/day and for a year, this would represent about represent about 0.6TB of page index.  For my household, a customized search engine would cost  ~$0.25 of additional storage per year and for 40 years my last payment would be $10.

I struggled with coming with ideas that would cost between $10 and $500 a year as every other storage use came out significantly less than $1/year for a family of four.  This seems to say that there might be plenty of applications in the range of under a $10 per TB per year, still 1200X current cloud storage costs.

Any other applications out there that could take  advantage of a dime/TB/year?

More cloud storage gateways come out

Strange Clouds by michaelroper (cc) (from Flickr)
Strange Clouds by michaelroper (cc) (from Flickr)

Multiple cloud storage gateways either have been announced or are coming out in the next quarter or so. We have talked before about Nasuni’s file cloud storage gateway appliance, but now that more are out one can have a better appreciation of the cloud gateway space.

StorSimple

Last week I was talking with StorSimple that just introduced their cloud storage gateway which provides a iSCSI block protocol interface to cloud storage with an onsite data caching.  Their appliance offers a cloud storage cache residing on disk and/or optional flash storage (SSDs) and provides iSCSI storage speeds for highly active working set data residing on the cache or cloud storage speeds for non-working set data.

Data is deduplicated to minimize storage space requirements.  In addition data sent to the cloud is compressed and encrypted. Both deduplication and compression can reduce WAN bandwidth requirements considerably.    Their appliance also offers snapshots and “cloud clones”.  Cloud clones are complete offsite (cloud) copies of a LUN which can then be maintained in synch with the gateway LUNs by copying daily change logs and applying the logs.

StorSimple works with Microsoft’s Azure, AT&T, EMC Atmos, Iron Mountan and Amazon’s S3 cloud storage providers.   A single appliance can support multiple cloud storage providers segregated on a LUN basis.  Although how cross-LUN deduplication works across multiple cloud storage providers was not discussed.

Their product can be purchased as a hardware appliance with a few 100GB of NAND/Flash storage up to a 150TB of SATA storage.  It also can be purchased as a virtual appliance at lower cost but also much lower performance.

Cirtas

In addition to StorSimple, I have talked with Cirtas which has yet to completely emerge from stealth but what’s apparent from their website is that the Cirtas appliance provides “storage protocols” to server systems, and can store data directly on storage subsystems or on cloud storage.

Storage protocols could mean any block storage protocol which could be FC and/or iSCSI but alternatively, it might mean file protocols I can’t be certain.  Having access to independent, standalone storage arrays may mean that  clients can use their own storage as a ‘cloud data cache’.  Unclear how Cirtas talks to their onsite backend storage but presumably this is FC and/or iSCSI as well.  And somehow some of this data is stored out on the cloud.

So from our perspective it looks somewhat similar to StorSimple with the exception that it uses external storage subsystems for its cloud data cache for Cirtas vs. internal storage for StorSimple.  Few other details were publicly available as this post went out.

Panzura

Although I have not talked directly with Panzura they seem to offer a unique form of cloud storage gateway, one that is specific to some applications.  For example, the Panzura SharePoint appliance actually “runs” part of the SharePoint application (according to their website) and as such, can better ascertain which data should be local versus stored in the cloud.  It seems to have  both access to cloud storage as well as local independent storage appliances.

In addition to a SharePoint appliance they offer a “”backup/DR” target that apparently supports NDMP, VTL, iSCSI, and NFS/CIFS protocols to store (backup) data on the cloud. In this version they show no local storage behind their appliance by which I assume that backup data is only stored in the cloud.

Finally, they offer a “file sharing” appliance used to share files across multiple sites where files reside both locally and in the cloud.  It appears that cloud copies of shared files are locked/WORM like but I can’t be certain.  Having not talked to Panzura before, much of their product is unclear.

In summary

We now have both a file access and at least one iSCSI block protocol cloud storage gateway, currently available, publicly announced, i.e., Nasuni and StorSimple.  Cirtas, which is in the process of coming out, will support a “storage protocol” access to cloud storage and Panzura offers it all (SharePoint direct, iSCSI, CIFS, NFS, VTL & NDMP cloud storage access protocols).  There are other gateways just focused on backup data, but I reserve the term cloud storage gateways for those that provide some sort of general purpose storage or file protocol access.

However, Since last weeks discussion of eventual consistency, I am becoming a bit more concerned about cloud storage gateways and their capabilities.  This deserves some serious discussion at the cloud storage provider level and but most assuredly, at the gateway level.  We need some sort of generic statement that says they guarantee immediate consistency for data at the gateway level even though most cloud storage providers only support “eventual consistency”.  Barring that, using cloud storage for anything that is updated frequently would be considered unwise.

If anyone knows of another cloud storage gateway I would appreciate a heads up.  In any case, the technology is still young yet and I would say that this isn’t the last gateway to come out but it feels like these provide coverage for just about any file or block protocol one might use to access cloud storage.