Can we back up a PB?

Tradition says no way. IT backup history says not on your life. Common sense would say never in a million years.

Most organizations with PB of data or more, depend on remote replication to protect against data center outage or massive loss of data. This of course costs ~2X your original data center. And for some organizations one copy is not enough, so ~3X .

I don’t know what a PB scale data storage costs these days but I can’t believe it’s under a couple Million $ USD in hw and sw costs and probably at least another Million or so in OpEx/year. Multiply that by 2 or 3X and you’re now talking real money.

How could backup help?

Well for one you wouldn’t need replicas, so that would cut your hw & sw acquisition costs by a factor of 2 or 3. But backup storage is not free either. So you’d probably need to add back 30-50% of the original data center in hw & sw costs for backups.

You certainly wouldn’t need as many admins. And power for backup storage should also be substantially less. So maybe your OpEx would only be 1.5X in total for the original PB and its backups.

But what could possibly back up a PB of data?

We were talking with Igneous at Cloud Field Day 8 (CFD8, see their video here)  a couple of weeks back and they said they could and do backup PBs of data for customers today. A while back, e also talked with them on a GreyBeards on Storage podcast.

The problems with backing up a PB seem insurmountable. First you have to be able to scan a PB of data. This means looking into multiple file systems on many different hardware platforms, across potentially multiple data centers, and that’s just to get a baseline of what all needs to be backed up.

Then at some point you actually have to store all that data on backup storage. So, to gain some cost advantage, you’d want to compress and deduplicate a PB of data, so that the first full backup wouldn’t take a full PB of backup storage.

Then of course you have to transfer a PB of data to your backup storage, in something that wouldn’t take months to perform. And that just gets you the first full backup.

Next, comes the daily scan of what’s changed. This has to re-scan your PB of data to find that 100TB or so, that’s changed over the last 24 hrs. Sometime after that scan completes, then all that 100TB or so of changed data needs to be compressed, deduped and transferred again to backup storage

And if that’s not enough, you have to do it all over again, every day, from now on, almost forever. And data continues to grow. So 1PB today is likely to be 2PB of more in 12 months (it’s great to be in the storage business). 

So those are the challenges. How can it be done, effectively, day in and day out, enough so that IT can depend on their data being backed up.

Igneous to the rescue…

First, Igneous came out of stealth a while back (listen to our podcast) with a couple of unique capabilities needed for massive data repository discovery and analysis. That is they built a unique engine to scan and index PB scale data repositories. This was so they couldd provide administrators better visibility into their PB scale data repositories. But this isn’t about that product, it’s about backup. 

But some of the capabilities they needed to support that product helped them perform backups as well. For instance, their scan needed to handle PBs of data. They came up with AdaptiveSCAN, which didn’t use standard NFS and SMB data transfer protocols to gain access to file metadata. To open a file on NFS or SMB takes quite a lot of NFS or SMB transactions. But to access metadata only, one doesn’t have to use all those NFS and SMB capabilities, it can be done with much less overhead even when using NFS or SMB.

Of course having a way to scan Billions of files was a major accomplishment, but then where do you put all that metadata. And how can you access it effectively to support backup up a PB data repository. So they needed some serious data indexing capabilities and so came up with InfiniteINDEX

Now a trillion item index, seems a bit much, even for PB scale repositories. But my guess is they have eyes on taking their PB scale backups and going after even bigger fish,. That is offering backups for EB scale data repository. And that might just take a trillion item index

Next, there’s moving PB or even TB of data quickly is no small trick. As the development team at Igneous mostly came from unstructured data providers, they also understood and have access to APIs for most storage vendors (NetApp, Dell-EMC Isilon, Pure FlashBlade, Qumulo, etc.). As such, where available, they utilized those native vendor storage API calls to help them move data rather than having to Open an NFS or SMB file and Read it. 

Of course, even doing all that, moving 100TBs of data around or scanning PB sized data repositories is going to take a lot of processing and IO bandwidth to do in a reasonable period of time. 

So another capability they developed is massive parallelism. That is being able to distribute scan, indexing or data movement work, out to multiple systems. In that fashion it can be accomplished in significantly less wall clock time. 

Well with all that, they pretty much had the guts of a backup application system for PB data repositories but they still didn’t have the glue to put it all together. But recently they announced just that a Igneous’s DataProtect, a full scale backup application for PB of data. 

I suppose I haven’t done justice to all of what they have developed or talked about at their session, so I would suggest viewing their talk at CFD8 and listening to our GBoS podcast to learn more. They did demo their product at CFD8 but I believe it was a canned demo.

I didn’t think I’d see the day when some vendor would offer backup services for PBs of data let alone be shooting for more, but there you have it. Igneous means to take your PB scale data repositories and make them as easy to operate as TB scale data repositories. They call that democratizing data.

Comments?

See these other CFD8 bloggers write ups on Igneous.

CFD8  – Igneous Follow Up  by Nate Avery (@Nathaniel_Avery)

Picture credit(s): All from screen saves during Igneous’s session at CFD8

Rubrik has a better idea for VMware backup

Cluster nodesRubrik has been around since January 2014 and just GA’d in April of last year. They recently presented at  TechFieldDay 10 (TFD10, videos here) with Chris Wahl, Technical Evangelist, Arvin “Nitro” Nithrakashyap, Co-Founder and Bipul Sinha, Co-Founder, in attendance.

I have known Chris Wahl since November of 2013, from our time together on Storage Field Day 4 (SFD4). Howard and I (the “Greybeards”) also interviewed Chris Wahl for Rubrik on a Greybeards on Storage podcast.
Continue reading “Rubrik has a better idea for VMware backup”

Veeam’s upcoming V8 virtues

[Not] Vamoosing VMworld

We were at Storage Field Day 5 (SFD5, see the videos here) last month and had a briefing on Veeam’s upcoming V8 release.

They also told us (news to me) that they are leaving VMworld[I sit corrected, I have been informed after this went to press that Veeam is not leaving VMworld2014, and never said anything about it at the session – My mistake and I take full responsibility, sorry for any confusion] (sigh, now who’s going to have THE after conference, KILLER PARTY at VMworld) and moving to [but they did say they are definitely starting up] their own VeeamON conference at The Cosmopolitan in Las Vegas on October 6,7 & 8 this year. If their VMworld parties are any indication, the conference in the Cosmo should be a fun and rewarding time for all. Pre-registration is open and they have a call out for papers.

Doug Hazelman (@VMDoug), Rick Vanover (@RickVanover) and Luca Dell’Oca (@dellock6) all presented although Luca’s session was under strict NDA to be revealed later. I think sometime later this summer.

Doug mentioned that after 6 years, Veeam now has over 100,000 customers world wide.  One of their more popular, early innovations was the ability to run a VM directly off of a backup and sometime over the past couple of years they have moved from a VMware only backup & replication solution to also supporting Microsoft Hyper-V (more news to me).

V8’s virtues

Veeam V8 will add some interesting capabilities to the Veeam product solutions:

  • (VMware only) Built-in backups from storage snapshots – (Enterprise Plus edition only) Backup from VMware snapshots can sometimes impact app performance, especially when it comes time to commit changes. But with V7, Veeam now offers backup utilizing VMware’s Change Block Tracking (CBT)and taking backups from storage snapshots directly for HP 3PAR StoreServ, HP (Lefthand) StoreVirtual/StoreVirtual VSA and in soon to be available V8, NetApp FAS (Data ONTAP 8.1 or above, 7- or cluster-mode, clones too) storage systems. First Veeam does its application level processing (under Windows Server does VSS operations), after that completes tells VMware to take (a VMware) snapshot, when that completes they tell the storage to take a (storage) snapshot, when that completes they release the VMware snapshot. What all this does is allows them to utilize VMware CBT as well as storage snapshots which makes it up to 20 times faster than normal VMware snapshot backups. This way they can backup directly from the storage snapshot using the Veeam proxy. Also because the VMware snapshot is so short lived there is little overhead for committing any changes.  Also there is no need to use a proxy ESX server to do this, i.e., promote the VMware snapshot to a LUN, add it to an ESX, resignature, add the VM, and do all the backups, which, of course destroys CBT. This works for FC, iSCSI and NFS data stores. With NetApp storage you can also take the (VSS) application consistent snapshot and copy it to SnapVault.
  • Veeam Explorer (recovery) for storage snapshots – (Free backup edition) Recovery from (HP in V7 & NetApp in V8) storage snapshots is yet another feature and provides item (e.g., emails, contacts, email folders for Exchange), granular (VM level or file level) or full (volume) recovery from storage based snapshots, regardless of how those storage snapshots were created.
  • Veeam Explorer for SQL Server (V8 only) – (unsure what license is required) Similar to the Explorer for snapshots discussed above, this would allow a Veeam admin to do item level recovery for an SQL database. This also includes recovery from Veeam Backup repositories as well as storage snapshots. But this means that you could restore a ROW of an SQL table, an SQL TABLE as well as a whole SQL database. Now DBAs always had these sorts of abilities which required using Log services. But allowing a Veeam admin to do these sorts of activities seems like putting a gun in the hands of a child (or maybe a bazooka in the hands of an untrained civilian).
  • Veeam Explorer for Active Directory (V8 only) – (unsure what license is required) You’ve seen whats’ available above and just consider these same capabilities only applied to active directory. This means you can restore a password hash, user, group or organizational unit (OU). I don’t know about you but this seems more akin to a howitzer in the hands of a civilian.

They showed an example of competitive situation where running V8 (in beta?) with NetApp backups using snapshots versus some unnamed competition. They were able to complete a full backup in 1/4 the time of their competition (2hrs. vs. 8hrs.) and completed incremental backups in 35min. vs. 2hrs. for the competition.

“Thar be dragons there …”

Ok, maybe I am a little more paranoid than the average IT guy/gal. But in my (old world, greybeards) view, SQL databases belong in the realm of DBAs and Active Directory databases belong to domain controller admins. Messing around with production versions of SQL DBs or AD DBs seems hazardous to a data centers health. We’re not just talking files anymore here guys.

In Veeam’s defense, these new Explorer recovery tools are only probably going to be used to do something that needs to be done right away, to get things back operating again, and would not be used unless there’s a real need/emergency to do so. Otherwise let the DBA and security admins do it with their log recovery tools.  And another thing, they have had similar capabilities for Exchange emails, folders, contacts, etc. and no ones shot their foot off yet so why the concern.

Nonetheless, I feel strongly that these tools ought to be placed under lock and key and the key put in a safe with the combination under a glass case labeled IN CASE OF EMERGENCY, BREAK GLASS.

Comments.

SCI’s Latest ESRP (v3) Performance Analysis for Over 5K mailboxes – chart of the month

Bar chart showing ESRP Top 10 total database backup throughput results
(SCIESRP120125-001) (c) 2012 Silverton Consulting, All Rights Reserved

This chart comes from our analysis of Microsoft Exchange Reviewed Solutions Program (ESRP) v3 (Exchange 2010) performance results for the over 5000 mailbox category, a report sent out to SCI Storage Intelligence Newsletter subscribers last month.

The total database backup throughput reported here is calculated based on the MB read per second per database times the number of databases in a storage configuration. ESRP currently reports two metrics for database backups the first used in our calculation above is the backup throughput on a database basis and the second is backup throughput on a server basis.  I find neither of these that useful from a storage system perspective so we have calculated a new metric for our ESRP analysis which represents the total Exchange database backup per storage system.

I find three submissions (#1, #3 & #8 above) for the same storage system (HDS USP-V) somewhat unusual in any top 10 chart and as such, provides a unique opportunity to understand how to optimize storage for Exchange 2010.

For example, the first thing I noticed when looking at the details behind the above chart is that disk speed doesn’t speed up database throughput.  The #1, #3 and #8 submissions above (HDS USP-V using Dynamic or thin provisioning) had 7200rpm, 15Krpm and 7200rpm disks respectively with 512 disk each.

So what were the significant differences between the USP-V submissions (#1, #3 and #8) aside from mailbox counts and disk speed:

  • Mailboxes per server differed from 7000 to 6000 to 4700 respectively, with the top 10 median = 7500
  • Mailboxes per database differed from 583 to 1500 to 392, with the top 10 median = 604
  • Number of databases per host (Exchange server) differed from 12 to 4 to 12, with the top 10 median = 12
  • Disk size differed from 2TB to 600GB to 2TB, with the top 10 median = 2TB
  • Mailbox size differed from 1GB to 1GB to 3GB, with the top 10 median = 1.0 GB
  • % storage capacity used by Exchange databases differed from 27.4% to 80.0% to 55.1%, with the top 10 median = 60.9%

If I had to guess, the reason the HDS USP-V system with faster disks didn’t backup as well as the #1 system is that its mailbox databases spanned multiple physical disks.  For instance, in the case of the (#3) 15Krpm/600GB FC disk system it took at least 3 disks to hold a 1.5TB mailbox database.  For the #1 performing 7200rpm/2TB SATA disk system, a single disk could hold almost 4-583GB databases on a single disk.  The slowest performer (#8) also with 7200rpm/2TB SATA disks could hold at most 1-1.2TB mailbox database per disk.

One other thing that might be a factor between the #1 and #3 submissions is that the data being backed up per host was significantly different.  Specifically for a host in the #1 HDS USP-V solution they only backed up  ~4TB but for a host in the #3 submission they had to backup ~9TB.   However, this doesn’t help explain the #8 solution, which only had to backup 5.5TB per host.

How thin provisioning and average space utilization might have messed all this up is another question entirely.  RAID 10 was used for all the USP-V configurations, with a 2d+2d disk configuration per RAID group.  The LDEV configuration in the RAID groups was pretty different, i.e., for #1 & #8 there were two LDEVs one 2.99TB and the other .58TB whereas for the #3 there was one LDEV of 1.05TB.  These LDEVs were then added to Dynamic Provisioning pools for database and log files.  (There might be something in how the LDEVs were mapped to V-VOL groups but I can’t seem to figure it out.)

Probably something else I might be missing here but I believe a detailed study of these three HDS USP-V systems ESRP performance would illustrate some principles on how to accelerate Exchange storage performance.

I suppose the obvious points to come away with here are to keep your Exchange databases  smaller than your physical disk sizes and don’t overburden your Exchange servers.

~~~~

The full ESRP performance report went out to our newsletter subscriber’s this past January.  A copy of the full report will be up on the dispatches page of our website sometime next month (if all goes well). However, you can get the full ESRP performance analysis now and subscribe to future newsletters by just sending us an email or using the signup form above right.

For a more extensive discussion of block or SAN storage performance covering SPC-1&-2 (top 30) and ESRP (top 20) results please consider purchasing SCI’s SAN Storage Buying Guide available on our website.

As always, we welcome any suggestions on how to improve our analysis of ESRP results or any of our other storage performance analyses.

Comments?

 

A day and a half with HP Storage

A photo of bloggers and HP personnel waiting to go on the lab tour
Bloggers and HP people waiting to tour lab

[long post 945 wds] HP held their (annual?) HP Tech Days in Fort Collins, Colorado this last week. We had presentations from a number of HP product managers and got to meet a number of new and old bloggers there.

In attendance from the blogosphere were: Alastair Cooke (@DemitasseNZ), Brian Knudtson (@bknudtson), Howard Marks (@DeepStorageNet), John Obeto (@JohnObeto), Jeff Powers (@Geekazine), Rich Schandler (@recklessop), Derek Schauland (@webjunkie), Justin Vashisht (@3cVGuy), and Matt Vogt (@MattVogt).

Craig Nunes VP of Marketing, HP Storage got up and led off the day’s discussion talking about recent results. HP disk storage is up 11% for the quarter, 3par is growing by triple digit growth (QoQ maybe YoY?) and channel sales are growing by 10%.  HP storage is gaining market share, grew 3% for the quarter.  Also, HP is #2 is shipped backup appliances (1H11).  The current focus for HP storage is in three areas:

  • Invest in established platforms, MSA and EVA (with a 100K customers)
  • Invest in converged storage aimed at new data centers, 3PAR, Lefthand, IBRIX and StoreOnce.
  • Invest in converged systems knocking down barriers between servers, storage and networking with Virtual Systems.

Craig spent most of his time talking about converged storage. HP’s converged storage includes:

  • built in autonomic storage automating operations with one pain of glass and an orchestration layer on top to oversee everything.
  • scale out storage providing simpler ways to grow storage.
  • built on standardized platforms using off the shelf server platform technology

Craig ended up discussing HP’s Virtual System, their response to VCE’s Vblock, NetApp’s FlexPod and Dell’s vStart Bundle.   HP’s Virtual System was announced earlier last year and has been doing well in the market.

Brad Katz, Product Manager got up next and talked about Lefthand storage solutions.  Lefthand’s portfolio now ranges from the Virtual Storage Appliance (VSA) all the way up to a P4800 SAN storage blade with P4300 and P4500 rackmountable storage systems between those two.   Lefthand systems provide a clustered, scale-out IP/SAN and NAS storage.   Cluster data is striped across all disks in all storage nodes.

The VSA runs as a virtual machine and utilizes any ESX  (direct or SAN attached) storage.  The P4800 operates as a storage blade in an HP blade server and uses storage in the blade system.  The two rackmount systems P4300 and P4500 connect to SAS attached, external disk shelves.

HP's Steve Johnson, at the front of the room discussing slide on StoreOnce
Steve Johnson on StoreOnce

Steve Johnson and Mat Jacoby talked next about the StoreOnce deduplicating backup appliance product line.  StoreOnce is an HP R&D Labs home grown, deduplication technology which provides balanced ingest-restore rates and memory efficient deduplication.  The current product line spans D2D25xx, D2D41xx, D2D43xx and the recently announced, B6200 backup storage blade.

StoreOnce use a variable block, 4K chunksize and a sparse index which saves on server memory size which both lead to great deduplication rates.   Most deduplication functionality is memory intensive making it hard to scale without increasing memory or using different dedupe engines across a product line.  StoreOnce’s sparse indexing fixed that issue and as such, can use the same deduplication engine across their entire product line.

HP's JR (Jim Richardson) at the front of the room discussing 3PAR's advantages
JR talking about 3PAR advantages

Jim Richardson or JR, a 3PAR SE from the start, got up and discussed 3PAR.  Early on, 3PAR brought to the market three characteristics that differentiated it from other enterprise storage products:

  • Multi-tennancy – today’s cloud service providers and just about anyone running enterprise storage needs to support mixed workloads on shared storage. 3PAR’s ASIC allows data to be placed on any storage node and be serviced at direct access speeds to better support these multi-application environments. 
  • Thin provisioning – although certainly not the first to support thin provisioning (Iceberg was the first), 3PAR did much to popularize it.  Once again the ASIC provides automated support for thin provisioning.  
  • Autonomic functionality – optimization of storage performance across nodes and tiers of storage was also helped by their ASIC’s ability to transfer data without involving processor interaction.  Also 3PAR, tried to take the drudgery out of administration by automatically wide striping and making provisioning easier.

Jim Hankins and Chris Duffy came up next and talked about the X9000 IBRIX storage system.  Ibrix has intrinsic scale out NAS support and provides automatic failover across dual processing nodes called couplets. The B6200 backup system (see above) is based on Ibrix technology.  Ibrix supports a 15PB single name space that is segmented across cluster couplets.  Ibrix also comes in a gateway configuration using shared SAN storage behind it.

A picture of a X5000 without skins, and a couple of CRUs taken out
HP X5000 NAS system

Robert Thompson got up and talked about the X5000 Windows Server WSS based NAS product.  It is the industry’s first two node file system with active/active clustering in a box.  As the product runs Windows Server, one can run Anti-Virus or other server applications directly on the storage and is customer maintainable. Robert pulled out every replaceable unit in the system.  Apparently the E5000, HP Storage’s Exchange Appliance is also based on the same hardware.   The two servers in the storage system are clustered together using MSCS.

A photo of an intelligent data center floor tile with remotely controlled mechanical louvres to control air flow.
HPer showing off intelligent floor tiles

In the afternoon we went on a lab tour and got to see some of HP’s storage and data center cooling technology on display.

On the second day, Mike Koponen got up and discussed HP’s Virtual System (or Vblock competitor) and Aboubacar Diare gave some of his opinions on VMware VAAI & VASA integration from his testing perspective.  Finally, Calvin Zito wrapped up the two day event and everyone (except me and a few others) went on a brewery tour.

~~~~

All in all, we had a good time with HP.  Too bad, I didn’t get to go on the New Belgium Brewery tour, perhaps next time.

Comments?

 

 

Latest Microsoft ESRP v3 (Exchange 2010) 1K to 5K mailbox performance results – chart of the month

SCIESRP110726-004 (c) 2011 Silverton Consulting, All Rights Reserved
SCIESRP110726-004 (c) 2011 Silverton Consulting, All Rights Reserved

Microsoft specifies two different metrics on sequential read rates for database backup activity in their Exchange Solution Reviewed Program (ESRP) reports

  • MB read/sec per database
  • MB read/sec total per server

Our problem with these metrics is that they don’t say much about the storage systems performance.  Some ESRP submissions could have a single database while others can have 100s of databases.  And the same thing applies to servers, although 20 servers seems to be about the max we have seen.  So as one can see the MB/s/DB or MB/s/server can vary all over the place depending on the Exchange configuration that one uses, even for the same exact storage system.

In the above chart, we  have attempted to move beyond some of these problems and use the information supplied in the ESRP reports to aggregate DB backups across all databases.  As such, we have derived a new metric called “total database backup”.  (Pretty simple actually just multiply the MB/s/DB times the number of databases in the Exchange configuration).

A couple of problems with our approach.

  • Current ESRP reports typically utilize a shadow storage system and shadow Exchange servers which host 50% of the databases and email activity. So what I am showing for those ESRP reports is what two storage systems can accomplish not one.
  • Another potential way to get the same result would be to use the number of servers times the MB/sec/server metric. (But try as I might these two approaches didn’t work to get the same answer so I am using the computation above – must be the way I am recording the number of [shadow] servers).
  • Although ESRP reports the average MB/sec/database to backup a single database it’s not clear that these measurements were taken while backing up all active databases at the same time, especially for those submissions with 100s of databases.

Probably the last is the most problematic critique to our new measure but may not be that harmful for smaller configurations. Nonetheless, we produced the above chart and published it in our last months review of ESRP results for the 1001 to 5000 mailbox category.

One item we discussed in our report was that numbers of disk drives didn’t seem to correlate well with high positions on this chart.  The number ten position (Fujitsu ETERNUS JX40) used over 140 disks, the number two position (Dell PowerEdge R510) had only 12 disk drives, and the number one solution (HP E5700) consisted of 56 drives, close to the average for this category.

One striking finding using this measure is that performance varies considerably from the top providing over 1600 MB/sec of database backup to the lowest of the group providing only ~800 MB/sec of backup performance. What with Exchange 2010 and lagged DAGs, some people feel that backup activity is no longer needed but we would disagree. We continue to believe that taking backups of Exchange data still makes a whole lot of sense and shouldn’t go away, ever.

It’s our hope that this or some similar follow-on metric will remove some of the Exchange configuration parameters from confounding ESRP reported storage system performance results.  We realize that this quixotic quest may never be entirely successful nevertheless we perform this duty in the hope that it will benefit today and future storage performance analysts everywhere.

Comments?

—–

The full ESRP report went out to our newsletter subscribers last month.  A copy of the full report will be up on the dispatches page of our website later next month. However, you can get this information now and subscribe to future newsletters to receive these reports even earlier by just emailing us at SubscribeNews@SilvertonConsulting.com?Subject=Subscribe_to_NewsletterR or using the signup form above and to the right.

As always, we welcome any suggestions on how to improve our analysis of ESRP or any of our other storage system performance discussions.

EMCWorld day 3 …

Sometime this week EMC announced a new generation of Isilon NearLine storage which now includes HGST 3TB SATA disk drives.  With the new capacity the multi-node (144) Isilon cluster using the 108NL nodes can support 15PB of file data in a single file system.

Some of the booths along the walk to the solutions pavilion highlight EMC innovation winners. Two that caught my interest included:

  • Constellation computing – not quite sure how to define this but it’s distributed computing along with distributed data creation.  The intent is to move the data processing to the source of the data creation and keep the data there.  This might be very useful for applications that have many data sources and where data processing capabilities can be moved out to the nodes where the data was created. Seems highly scaleable but may depend on the ability to carve up the processing to work on the local data. I can see where compression, encryption, indexing and some statistical summarization can be done at the data creation site before it’s sent elsewhere. Sort of like both a sensor mesh with a processing nodes attached to the sensors configured as a sensor-proccessing grid.  Only one thing concerned me, there didn’t seem to be any central repository or control to this computing environment.  Probably what they intended, as the distributed solution is more adaptable and more scaleable than a centrally controlled environment.
  • Developing world healthcare cloud – seemed to be all about delivering healthcare to the bottom of the pyramid.  They won EMC’s social innovation award and are working with a group in Rwanda to try to provide better healthcare to remote villages.  It’s built around OpenMRS as a backend medical record archive hosted on EMC DC powered Iomega NAS storage and uses Google’s OpenDataKit to work with the data on mobile and laptop devices.  They showed a mobile phone which could be used to create, record and retrieve healthcare information (OpenMRS records) remotely and upload it sometime later when in range of a cell tower.  The solution also supports the download of a portion of the medical center’s health record database (e.g., a “cohort” slice, think a village’s healthcare records) onto a laptop, usable offline by a healthcare provider to update and record  patient health changes onsite and remotely.  Pulling all the technology together and delivering this as an application stack usable on mobile and laptop devices with minimal IT sophistication, storage and remote/mobile access are where the challenges lie.

Went to Sanjay’s (EMC’s CIO) keynote on EMC IT’s journey to IT-as-a-Service. As you can imagine it makes extensive use of VMware’s vSphere, vCloud, and vShield capabilities primarily in a private cloud infrastructure but they seem agnostic to a build-it or buy-it approach. EMC is about 75% virtualized today, and are starting to see significant and tangible OpEx and energy savings. They designed their North Carolina data center around the vCloud architecture and now are offering business users self service portals to provision VMs and business services…

Only caught the first section of BJ’s (President of BRS) keynote but he said recent analyst data (think IDC?) said that EMC was the overall leader (>64% market share) in purpose built backup appliances (Data Domain, Disk Library, Avamar data stores, etc.).  Too bad I had to step out but he looked like he was on a roll.

Comments?

EMC Data Domain products enter the archive market

(c) 2011 Silverton Consulting, Inc., All Rights Reserved
(c) 2011 Silverton Consulting, Inc., All Rights Reserved

In another assault on the tape market, EMC announced today a new Data Domain 860 Archiver appliance. This new system supports both short-term and long-term retention of backup data. This attacks one of the last bastions of significant tape use – long-term data archives.

Historically, a cheap version of archives had been the long-term retention of full backup tapes. As such, if one needed to keep data around for 5 years, one would keep all their full backup tape sets offsite, in a vault somewhere for 5 years. They could then rotate the tapes (bring them back into scratch use) after the 5 years elapsed. One problem with this – tape technology is advancing to a new generation of technology more like every 2-3 years and as such, a 5-year old tape cartridge would be at least one generation back before it could be re-used. But current tape technology always reads 2 generations and writes at least one generation back so this use would still be feasible. I would say that many tape users did something like this to create a “psuedopseudo-archive”.

On the other hand, there exists many specific archive point products that focused on one or a few application arenas such as email, records, or database archives which would extract specific data items and place them into archive. These did not generally apply outside one or a few application domains but were used to support stringent compliance requirements. The advantage of these application based archive systems is that the data was actually removed from primary storage, out of any data protection activities and placed permanently in only “archive storage”. Such data would be subject to strict retention policies and as such, would be inviolate (couldn’t be modified) and could not be deleted until formally expired.

Enter the Data Domain 860 Archiver, this system supports up to 24 disk shelves, each one of which could either be dedicated to short- or long-term data retention. Backup file data is moved within the appliance by automated policy from short- to long-term storage. Up to 4-disk shelves can be dedicated to short-term storage with the remainder considered long-term archive units.

When a long-term archive unit (disk shelf) fills up with backup data it is “sealed”, i.e., it is given all the metadata required to reconstruct its file system and deduplication domain and thus, would not require the use of other disk shelves to access its data. In this way one creates a standalone unit that contains everything needed to recover the data. Not unlike a full backup tape set which can be used in a standalone fashion to restore data.

Today, the Data Domain 860 Archiver only supports file access and DD boost data access. By doing so, the backup software is responsible for deleting data that has expired. Such data will then be absent deleted from any backups taken and as policy automation copies the backups to long-term archive units it will be missing gone from there as well.

While Data Domain’s Archiver lacks removing the data from ongoing backup streams that application based archive products can achieve, it does look exactly like what could be achieved from tape based archives today.

One can also replicate base Data Domain or Archiver appliances to an Archiver unit to achieve offsite data archives.

—-

Full disclosure: I currently work with EMC on projects specific to other products but am not currently working on anything associated with this product.

Tape, your move…