Data retention – Page 2 – Silverton Consulting

EMCWorld day 3 …

Posted on May 11, 2011May 12, 2011 by Ray in Data, Data grid, Data retention, Distributed computing, File Storage, Storage architecture, Storage Backup, Systems

Sometime this week EMC announced a new generation of Isilon NearLine storage which now includes HGST 3TB SATA disk drives. With the new capacity the multi-node (144) Isilon cluster using the 108NL nodes can support 15PB of file data in a single file system.

Some of the booths along the walk to the solutions pavilion highlight EMC innovation winners. Two that caught my interest included:

Constellation computing – not quite sure how to define this but it’s distributed computing along with distributed data creation. The intent is to move the data processing to the source of the data creation and keep the data there. This might be very useful for applications that have many data sources and where data processing capabilities can be moved out to the nodes where the data was created. Seems highly scaleable but may depend on the ability to carve up the processing to work on the local data. I can see where compression, encryption, indexing and some statistical summarization can be done at the data creation site before it’s sent elsewhere. Sort of like both a sensor mesh with a processing nodes attached to the sensors configured as a sensor-proccessing grid. Only one thing concerned me, there didn’t seem to be any central repository or control to this computing environment. Probably what they intended, as the distributed solution is more adaptable and more scaleable than a centrally controlled environment.
Developing world healthcare cloud – seemed to be all about delivering healthcare to the bottom of the pyramid. They won EMC’s social innovation award and are working with a group in Rwanda to try to provide better healthcare to remote villages. It’s built around OpenMRS as a backend medical record archive hosted on EMC DC powered Iomega NAS storage and uses Google’s OpenDataKit to work with the data on mobile and laptop devices. They showed a mobile phone which could be used to create, record and retrieve healthcare information (OpenMRS records) remotely and upload it sometime later when in range of a cell tower. The solution also supports the download of a portion of the medical center’s health record database (e.g., a “cohort” slice, think a village’s healthcare records) onto a laptop, usable offline by a healthcare provider to update and record patient health changes onsite and remotely. Pulling all the technology together and delivering this as an application stack usable on mobile and laptop devices with minimal IT sophistication, storage and remote/mobile access are where the challenges lie.

Went to Sanjay’s (EMC’s CIO) keynote on EMC IT’s journey to IT-as-a-Service. As you can imagine it makes extensive use of VMware’s vSphere, vCloud, and vShield capabilities primarily in a private cloud infrastructure but they seem agnostic to a build-it or buy-it approach. EMC is about 75% virtualized today, and are starting to see significant and tangible OpEx and energy savings. They designed their North Carolina data center around the vCloud architecture and now are offering business users self service portals to provision VMs and business services…

Only caught the first section of BJ’s (President of BRS) keynote but he said recent analyst data (think IDC?) said that EMC was the overall leader (>64% market share) in purpose built backup appliances (Data Domain, Disk Library, Avamar data stores, etc.). Too bad I had to step out but he looked like he was on a roll.

Comments?

SNIA CDMI plugfest for cloud storage and cloud data services

Posted on April 20, 2011April 20, 2011 by Ray in Cloud services, Data, data access, Data integrity, data protection, Data retention, Data security, Disaster Recovery, Information economy, Systems

Plug by Samuel M. Livingston (cc) (from Flickr)

Was invited to the SNIA tech center to witness the CDMI (Cloud Data Managament Initiative) plugfest that was going on down in Colorado Springs.

It was somewhat subdued. I always imagine racks of servers, with people crawling all over them with logic analyzers, laptops and other electronic probing equipment. But alas, software plugfests are generally just a bunch of people with laptops, ethernet/wifi connections all sitting around a big conference table.

The team was working to define an errata sheet for CDMI v1.0 to be completed prior to ISO submission for official standardization.

What’s CDMI?

CDMI is an interface standard for clients talking to cloud storage servers and provides a standardized way to access all such services. With CDMI you can create a cloud storage container, define it’s attributes, and deposit and retrieve data objects within that container. Mezeo had announced support for CDMI v1.0 a couple of weeks ago at SNW in Santa Clara.

CDMI provides for attributes to be defined at the cloud storage server, container or data object level such as: standard redundancy degree (number of mirrors, RAID protection), immediate redundancy (synchronous), infrastructure redundancy (across same storage or different storage), data dispersion (physical distance between replicas), geographical constraints (where it can be stored), retention hold (how soon it can be deleted/modified), encryption, data hashing (having the server provide a hash used to validate end-to-end data integrity), latency and throughput characteristics, sanitization level (secure erasure), RPO, and RTO.

A CDMI client is free to implement compression and/or deduplication as well as other storage efficiency characteristics on top of CDMI server characteristics. Probably something I am missing here but seems pretty complete at first glance.

SNIA has defined a reference implementations of a CDMI v1.0 server [and I think client] which can be downloaded from their CDMI website. [After filling out the “information on me” page, SNIA sent me an email with the download information but I could only recognize the CDMI server in the download information not the client (although it could have been there). The CDMI v1.0 specification is freely available as well.] The reference implementation can be used to test your own CDMI clients if you wish. They are JAVA based and apparently run on Linux systems but shouldn’t be too hard to run elsewhere. (one CDMI server at the plugfest was running on a Mac laptop).

Plugfest participants

There were a number people from both big and small organizations at SNIA’s plugfest.

Mark Carlson from Oracle was there and seemed to be leading the activity. He said I was free to attend but couldn’t say anything about what was and wasn’t working. Didn’t have the heart to tell him, I couldn’t tell what was working or not from my limited time there. But everything seemed to be working just fine.

Carlson said that SNIA’s CDMI reference implementations had been downloaded 164 times with the majority of the downloads coming from China, USA, and India in that order. But he said there were people in just about every geo looking at it. He also said this was the first annual CDMI plugfest although they had CDMI v0.8 running at other shows (i.e, SNIA SDC) before.

David Slik, from NetApp’s Vancouver Technology Center was there showing off his demo CDMI Ajax client and laptop CDMI server. He was able to use the Ajax client to access all the CDMI capabilities of the cloud data object he was presenting and displayed the binary contents of an object. Then he showed me the exact same data object (file) could be easily accessed by just typing in the proper URL into any browser, it turned out the binary was a GIF file.

The other thing that Slik showed me was a display of a cloud data object which was created via a “Cron job” referencing to a satellite image website and depositing the data directly into cloud storage, entirely at the server level. Slik said that CDMI also specifies a cloud storage to cloud storage protocol which could be used to move cloud data from one cloud storage provider to another without having to retrieve the data back to the user. Such a capability would be ideal to export user data from one cloud provider and import the data to another cloud storage provider using their high speed backbone rather than having to transmit the data to and from the user’s client.

Slik was also instrumental in the SNIA XAM interface standards for archive storage. He said that CDMI is much more light weight than XAM, as there is no requirement for a runtime library whatsoever and only depends on HTTP standards as the underlying protocol. From his viewpoint CDMI is almost XAM 2.0.

Gary Mazzaferro from AlloyCloud was talking like CDMI would eventually take over not just cloud storage management but also local data management as well. He called the CDMI as a strategic standard that could potentially be implemented in OSs, hypervisors and even embedded systems to provide a standardized interface for all data management – cloud or local storage. When I asked what happens in this future with SMI-S he said they would co-exist as independent but cooperative management schemes for local storage.

Not sure how far this goes. I asked if he envisioned a bootable CDMI driver? He said yes, a BIOS CDMI driver is something that will come once CDMI is more widely adopted.

Other people I talked with at the plugfest consider CDMI as the new web file services protocol akin to NFS as the LAN file services protocol. In comparison, they see Amazon S3 as similar to CIFS (SMB1 & SMB2) in that it’s a proprietary cloud storage protocol but will also be widely adopted and available.

There were a few people from startups at the plugfest, working on various client and server implementations. Not sure they wanted to be identified nor for me to mention what they were working on. Suffice it to say the potential for CDMI is pretty hot at the moment as is cloud storage in general.

But what about cloud data consistency?

I had to ask about how the CDMI standard deals with eventual consistency – it doesn’t. The crowd chimed in, relaxed consistency is inherent in any distributed service. You really have three characteristics Consistency, Availability and Partitionability (CAP) for any distributed service. You can elect to have any two of these, but must give up the third. Sort of like the Hiesenberg uncertainty principal applied to data.

They all said that consistency is mainly a CDMI client issue outside the purview of the standard, associated with server SLAs, replication characteristics and other data attributes. As such, CDMI does not define any specification for eventual consistency.

Although, Slik said that the standard does guarantee if you modify an object and then request a copy of it from the same location during the same internet session, that it be the one you last modified. Seems like long odds in my experience. Unclear how CDMI, with relaxed consistency can ever take the place of primary storage in the data center but maybe it’s not intended to.

—–

Nonetheless, what I saw was impressive, cloud storage from multiple vendors all being accessed from the same client, using the same protocols. And if that wasn’t simple enough for you, just use your browser.

If CDMI can become popular it certainly has the potential to be the new web file system.

Comments?

When will disks become extinct?

Posted on March 8, 2011April 10, 2012 by Ray in Data retention, Market dynamics, Storage, Storage architecture, Storage density, storage economics, Storage longevity, Storage performance, Strategic Inflection Points, System effectiveness, Systems

A head assembly on a Seagate disk drive by Robert Scoble (cc) (from flickr)

Yesterday, it was announced that Hitachi General Storage Technologies (HGST) is being sold to Western Digital for $4.3B and after that there was much discussion in the tweeterverse about the end of enterprise disk as we know it. Also, last week I was at a dinner at an analyst meeting with Hitachi, where the conversation turned to when disks will no longer be available. This discussion was between Mr. Takashi Oeda of Hitachi RSD, Mr. John Webster of Evaluator group and myself.

Why SSDs will replace disks

John was of the opinion that disks would stop being economically viable in about 5 years time and will no longer be shipping in volume, mainly due to energy costs. Oeda-san said that Hitachi had predicted that NAND pricing on a $/GB basis would cross over (become less expensive than) 15Krpm disk pricing sometime around 2013. Later he said that NAND pricing had not come down as fast as projected and that it was going to take longer than anticipated. Note that Oeda-san mentioned density price cross over for only 15Krpm disk not 7200rpm disk. In all honesty, he said SATA disk would take longer, but he did not predict when

I think both arguments are flawed:

Energy costs for disk drives drop on a Watts/GB basis every time disk density increases. So the energy it takes to run a 600GB drive today will likely be able to run a 1.2TB drive tomorrow. I don’t think energy costs are going to be the main factor to drives disks out of the enterprise.
Density costs for NAND storage are certainly declining but cost/GB is not the only factor in technology adoption. Disk storage has cost more than tape capacity since the ’50s, yet they continue to coexist in the enterprise. I contend that disks will remain viable for at least the next 15-20 years over SSDs, primarily because disks have unique functional advantages which are vital to enterprise storage.

Most analysts would say I am wrong, but I disagree. I believe disks will continue to play an important role in the storage hierarchy of future enterprise data centers.

NAND/SSD flaws from an enterprise storage perspective

All costs aside, NAND based SSDs have serious disadvantages when it comes to:

Data retention – the problem with NAND data cells is that they can only be written so many times before they fail. And as NAND cells become smaller, this rate seems to be going the wrong way, i.e, today’s NAND technology can support 100K writes before failure but tomorrow’s NAND technology may only support 15K writes before failure. This is not a beneficial trend if one is going to depend on NAND technology for the storage of tomorrow.
Sequential access – although NAND SSDs perform much better than disk when it comes to random reads and less so, random writes, the performance advantage of sequential access is not that dramatic. NAND sequential access can be sped up by deploying multiple parallel channels but it starts looking like internal forms of wide striping across multiple disk drives.
Unbalanced performance – with NAND technology, reads operate quicker than writes. Sometimes 10X faster. Such unbalanced performance can make dealing with this technology more difficult and less advantageous than disk drives of today with much more balanced performance.

None of these problems will halt SSD use in the enterprise. They can all be dealt with through more complexity in the SSD or in the storage controller managing the SSDs, e.g., wear leveling to try to prolong data retention, multi-data channels for sequential access, etc. But all this additional complexity increases SSD cost, and time to market.

SSD vendors would respond with yes it’s more complex, but such complexity is a one time charge, mostly a one time delay, and once done, incremental costs are minimal. And when you come down to it, today’s disk drives are not that simple either with defect skipping, fault handling, etc.

So why won’t disk drives go away soon. I think other major concern in NAND/SSD ascendancy is the fact that the bulk NAND market is moving away from SLC (single level cell or bit/cell) NAND to MLC (multi-level cell) NAND due to it’s cost advantage. When SLC NAND is no longer the main technology being manufactured, it’s price will not drop as fast and it’s availability will become more limited.

Some vendors also counter this trend by incorporating MLC technology into enterprise SSDs. However, all the problems discussed earlier become an order of magnitude more severe with MLC NAND. For example, rather than 100K write operations to failure with SLC NAND today, it’s more like 10K write operations to failure on current MLC NAND. The fact that you get 2 to 3 times more storage per cell with MLC doesn’t help that much when one gets 10X less writes per cell. And the next generation of MLC is 10X worse, maybe getting on the order of 1000 writes/cell prior to failure. Similar issues occur for write performance, MLC writes are much slower than SLC writes.

So yes, raw NAND may become cheaper than 15Krpm Disks on a $/GB basis someday but the complexity to deal with such technology is also going up at an alarming rate.

Why disks will persist

Now something similar can be said for disk density, what with the transition to thermally assisted recording heads/media and the rise of bit-patterned media. All of which are making disk drives more complex with each generation that comes out. So what allows disks to persist long after $/GB is cheaper for NAND than disk:

Current infrastructure supports disk technology well in enterprise storage. Disks have been around so long, that storage controllers and server applications have all been designed around them. This legacy provides an advantage that will be difficult and time consuming to overcome. All this will delay NAND/SSD adoption in the enterprise for some time, at least until this infrastructural bias towards disk is neutralized.
Disk technology is not standing still. It’s essentially a race to see who will win the next generations storage. There is enough of an eco-system around disk that will keep pushing media, heads and mechanisms ever forward into higher densities, better throughput, and more economical storage.

However, any infrastructural advantage can be overcome in time. What will make this go away even quicker is the existance of a significant advantage over current disk technology in one or more dimensions. Cheaper and faster storage can make this a reality.

Moreover, as for the ecosystem discussion, arguably the NAND ecosystem is even larger than disk. I don’t have the figures but if one includes SSD drive producers as well as NAND semiconductor manufacturers the amount of capital investment in R&D is at least the size of disk technology if not orders of magnitude larger.

Disks will go extinct someday

So will disks become extinct, yes someday undoubtedly, but when is harder to nail down. Earlier in my career there was talk of super-paramagnetic effect that would limit how much data could be stored on a disk. Advances in heads and media moved that limit out of the way. However, there will come a time where it becomes impossible (or more likely too expensive) to increase magnetic recording density.

I was at a meeting a few years back where a magnetic head researcher predicted that such an end point to disk density increase would come in 25 years time for disk and 30 years for tape. When this occurs disk density increase will stand still and then it’s a certainty that some other technology will take over. Because as we all know data storage requirements will never stop increasing.

I think the other major unknown is other, non-NAND semiconductor storage technologies still under research. They have the potential for unlimited data retention, balanced performance and sequential performance orders of magnitude faster than disk and can become a much more functional equivalent of disk storage. Such technologies are not commercially available today in sufficient densities and cost to even threaten NAND let alone disk devices.

—-

So when do disks go extinct. I would say in 15 to 20 years time we may see the last disks in enterprise storage. That would give disks an almost an 80 year dominance over storage technology.

But in any event I don’t see disks going away anytime soon in enterprise storage.

Comments?

Personal medical record archive

Posted on February 3, 2011 by Ray in Cloud services, Data, Data availability, Data retention, Data security, Information economy, Strategy

MRI of my brain after surgery for Oligodendroglioma tumor by L_Family (cc) (From Flickr)

I was reading a book the other day and it suggested that sometime in the near future we will all have a personal medical record archive. Such an archive would be a formal record of every visit to a healthcare provider, with every x-ray, MRI, CatScan, doctor’s note, blood analysis, etc. that’s ever done to a person.

Such data would be our personal record of our life’s medical history usable by any future medical provider and accessible by us.

Who owns medical records?

Healthcare is unusual. For any other discipline like accounting, you provide information to the discipline expert and you get all the information you could possibly want back, to store, send to the IRS or or whatever, to do with it as you want. If you decide to pitch it, you can pretty much request a copy (at your cost) of anything for a certain number of years after the information was created.

But, in medicine, X-rays are owned and kept by the medical provider, same with MRIs, CT scans, etc. and you hardly ever get a copy. Occasionally, if the physician deems it useful for explicative reasons, you might get a grainy copy of an X-ray that shows a break or something but other than that and possible therapeutic instructions, typically nothing.

Getting Doctor’s notes is another question entirely. It’s mostly text records in some sort of database somewhere online to the medical unit. But, mainly what we get as patients, is a verbal diagnosis to take in and mull over.

Personal experience with medical records

I worked for an enlightened company a while back that had their own onsite medical practice providing all sorts of healthcare to their employees. Over time, new management decided this service was not profitable and terminated it. As they were winding down the operation, they offered to send patient medical information to any new healthcare provider or to us. Not having a new provider, I asked they send them to me.

A couple of weeks later, a big brown manilla envelope was delivered. Inside was a rather large, multy-page printout of notes taken by every medical provider I had visited throughout my tenure with this facility. What was missing from this assemblage was lab reports, x-rays and other ancillary data that was taken in conjunction with those office visits. I must say the notes were comprehensive and somewhat laden with medical terminology but they were all there to see.

Printouts were not very useful to me and probably wouldn’t be to any follow-on medical group caring for me. However the lack of x-rays, blood work, etc. might be a serious deficiency for any follow-on treatment. But, as far as I was concerned it was the first time any medical entity even offered me any information like this.

Making personal medical records useable, complete, and retrievable

To take this to the next level, and provide something useful for patients and follow-on healthcare, we need some sort of standardization of medical records across the healthcare industry. This doesn’t seem that hard, given where we are today and need not be that difficult. Standards for most medical data already exist, specifically,

DICOM or Digital Imaging and Communications in Medicine – is a standard file format used to digitally record X-Rays, MRIs, CT scans and more. Most digital medical imaging technology (except for ultrasound) out there today optionally records information in DICOM format. There just so happens to be an open source DICOM viewer that anyone can use to view these sorts of files if one is interested.
Ultrasound imaging – is typically rendered and viewed as a sort of movie and is often used for soft tissue imaging and prenatal care. I don’t know for sure but cannot find any standard like DICOM for ultrasound images. However, if they are truly movies, perhaps HD movie files would suffice for a standard ultrasound imaging file.
Audiograms, blood chemistry analysis, etc. – is provided by many technicians or labs and could all be easily represented as PDFs, scanned images, JPEG/MPEG recordings, etc. Doctors or healthcare providers often discuss salient items off these reports that are of specific interest to the patients condition. Such affiliated notes could all be in an associated text file or even a recording made of the doctor discussing the results of the analysis that somehow references the other artifact (“Blood chemistry analysis done on 2/14/2007 indicates …”).
Other doctor/healthcare provider notes – I find that everytime I visit a healthcare provider these days, they either take copious notes using WIFI connected laptops, record verbal notes to some voice recorder later transcribed into notes, or some combination of these. Any of such information could be provided in standard RTF (text files) or MPEG recordings and viewed as is.

How patients can access medical data

Most voice recordings or text notes could easily be emailed to the patient. As for DICOM images, ultrasound movies, etc., they could all be readily provided on DVDs or other removable media sent to the patient.

Another and possibly better alternative, is to have all this data uploaded to a healthcare provider’s designated URL, stored in a medical record cloud someplace, allowing patient access for viewing, downloading and/or copying. I envision something akin to a photo sharing site, upload-able by any healthcare provider but accessible for downloads by any authorized user/patient.

Medical information security

Any patient data stored in such a medical record cloud would need to be secured and possibly encrypted by a healthcare provider supplied pass code which could be used for downloading/decrypting by the patient. There are plenty of open source cryptographic algorithms which would suffice to encrypt this data (see GNU Privacy Guard for instance).

As for access passwords, possible some form of public key cryptography would suffice but it need not be that sophisticated. I prefer to use open source tools for these security mechanisms as then it would be readily available to the patient or any follow-on medical provider to access and decrypt the data.

Medical information retention period

The patient would have a certain amount of time to download these files. I lean towards months just to insure it’s done in a timely fashion but maybe it should be longer, something on the order of 7-years after a patients last visit might work. This would allow the patient sufficient time to retrieve the data and to supply it to any follow-on medical provider or stored it in their own, personal medical record archive. There are plenty of cloud storage providers I know, that would be willing to store such data at a fair, but high price, for any period of time desired.

Medical information access credentials

All the patient would need is an email and/or possible a letter that provides the accessing URL, access password and encryption passcode information for the files. Possibly such information could be provided in plaintext, appended to any bill that is cut for the visit which is sure to find its way to the patient or some financially responsible guardian/parent.

How do we get there

Bootstrapping this personal medical record archive shouldn’t be that hard. As I understand it, Electronic Medical Record (EMR) legislation in the US and elsewhere has provisions stating that any patient has a legal right to copies of any medical record that a healthcare provider has for them. If this is true, all we need do then is to institute some additional legislation that requires the healthcare provider to make those records available in a standard format, in a publicly accessible place, access controlled/encrypted via a password/passcode, downloadable by the patient and to provide the access credentials to the patient in a standard form. Once that is done, we have all the pieces needed to create the personal medical record archive I envision here.

—-

While such legislation may take some time, one thing we could all do now, at least in the US, is to request access to all medical records/information that is legally ours already. Once all the healthcare providers, start getting inundated with requests for this data, they might figure having some easy, standardized way to provide it would make sense. Then the healthcare organizations could get together and work to finalize a better solution/legislation needed to provide this in some standard way. I would think university hospitals could lead this endeavor and show us how it could be done.

Am I missing anything here?

EMC Data Domain products enter the archive market

Posted on January 18, 2011January 24, 2011 by Ray in Data, data protection, Data retention, Storage Backup, Storage longevity

(c) 2011 Silverton Consulting, Inc., All Rights Reserved

In another assault on the tape market, EMC announced today a new Data Domain ~~860~~ Archiver ~~appliance~~. This new system supports both short-term and long-term retention of backup data. This attacks one of the last bastions of significant tape use – long-term data archives.

Historically, a cheap version of archives had been the long-term retention of full backup tapes. As such, if one needed to keep data around for 5 years, one would keep all their full backup tape sets offsite, in a vault somewhere for 5 years. They could then rotate the tapes (bring them back into scratch use) after the 5 years elapsed. One problem with this – tape technology is advancing to a new generation of technology more like every 2-3 years and as such, a 5-year old tape cartridge would be at least one generation back before it could be re-used. But current tape technology always reads 2 generations and writes at least one generation back so this use would still be feasible. I would say that many tape users did something like this to create a “~~psuedo~~pseudo-archive”.

On the other hand, there exists many specific archive point products that focused on one or a few application arenas such as email, records, or database archives which would extract specific data items and place them into archive. These did not generally apply outside one or a few application domains but were used to support stringent compliance requirements. The advantage of these application based archive systems is that the data was actually removed from primary storage, out of any data protection activities and placed permanently in only “archive storage”. Such data would be subject to strict retention policies and as such, would be inviolate (couldn’t be modified) and could not be deleted until formally expired.

Enter the Data Domain 860 Archiver, this system supports up to 24 disk shelves, each one of which could either be dedicated to short- or long-term data retention. Backup file data is moved within the appliance by automated policy from short- to long-term storage. Up to 4-disk shelves can be dedicated to short-term storage with the remainder considered long-term archive units.

When a long-term archive unit (disk shelf) fills up with backup data it is “sealed”, i.e., it is given all the metadata required to reconstruct its file system and deduplication domain and thus, would not require the use of other disk shelves to access its data. In this way one creates a standalone unit that contains everything needed to recover the data. Not unlike a full backup tape set which can be used in a standalone fashion to restore data.

Today, the Data Domain 860 Archiver only supports file access and DD boost data access. By doing so, the backup software is responsible for deleting data that has expired. Such data will then be ~~absent~~ deleted from any backups taken and as policy automation copies the backups to long-term archive units it will be ~~missing~~ gone from there as well.

While Data Domain’s Archiver lacks removing the data from ongoing backup streams that application based archive products can achieve, it does look exactly like what could be achieved from tape based archives today.

One can also replicate base Data Domain or Archiver appliances to an Archiver unit to achieve offsite data archives.

—-

Full disclosure: I currently work with EMC on projects specific to other products but am not currently working on anything associated with this product.

Tape, your move…