MCS, UltraDIMMs and memory IO, the new path ahead – part 1

IMG_2338I was  at Storage Field Day 5 (SFD5) last month and got a chance to talk with SanDisk and Diablo Technologies. It turns out that SanDisk’s UltraDIMM product is based on Diablo Technologies MCS hardware.  So the two of them provided a pretty deep dive into the technology and where they want to go with it. Before we go any deeper the UltraDIMMs will be released to the field by IBM under the eXFlash name.

Diablo Technologies

The team at Diablo have been focusing on the x86 standard memory channel for a while now and lately have been trying out different sorts of technologies to connect as CPU memory. The first Memory Channel Storage (MCS) product converts Memory Channel IO to SATA IO. This allows any SATA device to be attached as memory and enjoy lightening fast, memory access times. Access times are clocked at 7µsec. Most PCIe Flash cards have an access latency at 50µsec or more, so this is 7X faster that PCIe Flash.  They also claim the MCS is capable of 20GB/sec. I know enterprise class storage systems that can’t do that. Also, the MCS utilizes 2 memory channels.

Diablo delivers a chip (that converts MemIO to SATA IO) and software that provides a block IO access to the MCS device. Customers of MCS supply their own SATA flash storage device and presumably package it all together in a DIMM compatible card.

But the main problem is that the whole MCS chip and SATA IO flash device has to fit in the form factor of a DIMM. And cannot draw any more power than a memory device can draw, ~10-15W with its corresponding thermal load.

But this seems plenty for a small flash drive.  The MCS is configured as a 4GB DDR3 DIMM.  There is a requirement to patch the BIOS so that it doesn’t run diagnostic memory tests on the MCS device and their software needs to be loaded to access the device as a block device. I believe they currently support Linux O/S with more O/Ss on the way.

Diablo has looked at other applications for their technology including providing an Memory IO accessed Ethernet NIC was mentioned. But it seems flash storage would be a great first application of their technology.  Not clear to me but SAS would also be something that could be done.

Whatever happens after NAND with the next generation semiconductor storage (see my The end of NAND is near post, it seems to me that accessing it as Memory IO would make an awful lot of sense. makes a lot of sense.  Using MCS as the access channel would seem to be a logical next step.

Part 1 of this story is on Diablo Technologies, Part 2 will be on SanDisk and I am not sure but maybe there will be a Part 3 on IBM eXFlash. So stay tuned.

Comments?

Releasing social and mobile data as a public good

I have been reading a book recently, called Uncharted: Big data as a lens on human culture by Erez Aiden and Jean-Baptiste Michel that discusses the use of Google’s Ngram search engine which counts phrases (Ngrams) used in all the books they have digitized. Ngram phrases are charted against other Ngrams and plotted in real time.

It’s an interesting concept and one example they use is “United States are” vs. “United States is” a 3-Ngram which shows that the singular version of the phrase which has often been attributed to emerge immediately after the Civil War actually was in use prior to the Civil War and really didn’t take off until 1880’s, 15 years after the end of the Civil War.

I haven’t finished the book yet but it got me to thinking. The authors petitioned Google to gain access to the Ngram data which led to their original research. But then they convinced Google after their original research time was up to release the information to the general public. Great for them but it’s only a one time event and happened to work this time with luck and persistance.

The world needs more data

But there’s plenty of other information or data out there where we could use to learn an awful lot about human social interaction and other attributes about the world that are buried away in corporate databases. Yes, sometimes this information is made public (like Google), or made available for specific research (see my post on using mobile phone data to understand people mobility in an urban environment) but these are special situations. Once the research is over, the data is typically no longer available to the general public and getting future or past data outside the research boundaries requires yet another research proposal.

And yet books and magazines are universally available for a fair price to anyone and are available in most research libraries as a general public good for free.  Why should electronic data be any different?

Social and mobile dta as a public good

What I would propose is that the Library of Congress and other research libraries around the world have access to all corporate data that documents interaction between humans, humans and the environment, humanity and society, etc.  This data would be freely available to anyone with library access and could be used to provide information for research activities that have yet to be envisioned.

Hopefully all of this data would be released, free of charge (or for some nominal fee) to these institutions after some period of time has elapsed. For example, if we were talking about Twitter feeds, Facebook feeds, Instagram feeds, etc. the data would be provided from say 7 years back on a reoccurring yearly or quarterly basis. Not sure if the delay time should be 7, 10 or 15 years, but after some judicious period of time, the data would be released and made publicly available.

There are a number of other considerations:

  • Anonymity – somehow any information about a person’s identity, actual location, or other potentially identifying characteristics would need to be removed from all the data.  I realize this may reduce the value of the data to future researchers but it must be done. I also realize that this may not be an easy thing to accomplish and that is why the data could potentially be sold for a fair price to research libraries. Perhaps after 35 to 100 years or so the identifying information could be re-incorporated into the original data set but I think this highly unlikely.
  • Accessibility – somehow the data would need to have an easily accessible and understandable description that would enable any researcher to understand the underlying format of the data. This description should probably be in XML format or some other universal description language. At a minimum this would need to include meta-data descriptions of the structure of the data, with all the tables, rows and fields completely described. This could be in SQL format or just XML but needs to be made available. Also the data release itself would then need to be available in a database or in flat file formats that could be uploaded by the research libraries and then accessed by researchers. I would expect that this would use some sort of open source database/file service tools such as MySQL or other database engines. These database’s represent the counterpart to book shelves in today’s libraries and has to be universally accessible and forever available.
  • Identifyability – somehow the data releases would need to be universally identifiable, not unlike the ISBN scheme currently in use for books and magazines and ISRC scheme used for recordings. This would allow researchers to uniquely refer to any data set that is used to underpin their research. This would also allow the world’s research libraries to insure that they purchase and maintain all the data that becomes available by using some sort of master worldwide catalog that would hold pointers to all this data that is currently being held in research institutions. Such a catalog entry would represent additional meta-data for the data release and would represent a counterpart to a online library card catalog.
  • Legality – somehow any data release would need to respect any local Data Privacy and Protection laws of the country where the data resides. This could potentially limit the data that is generated in one country, say Germany to be held in that country only. I would think this could be easily accomplished as long as that country would be willing to host all its data in its research institutions.

I am probably forgetting a dozen more considerations but this covers most of it.

How to get companies to release their data

One that quickly comes to mind is how to compel companies to release their data in a timely fashion. I believe that data such as this is inherently valuable to a company but that its corporate value starts to diminish over time and after some time goes to 0.

However, the value to the world of such data encounters an inverse curve. That is, the longer away we are from a specific time period when that data was created, the more value it has for future research endeavors. Just consider what current researchers do with letters, books and magazine articles from the past when they are researching a specific time period in history.

But we need to act now. We are already over 7 years into the Facebook era and mobile phones have been around for decades now. We have probably already lost much of the mobile phone tracking information from the 80’s, 90’s, 00’s and may already be losing the data from the early ’10’s. Some social networks have already risen and gone into a long eclipse where historical data is probably their lowest concern. There is nothing that compels organizations to keep this data around, today.

Types of data to release

Obviously, any social networking data, mobile phone data, or email/chat/texting data should all be available to the world after 7 or more years.  Also the private photo libraries, video feeds, audio recordings, etc. should also be released if not already readily available. Less clear to me are utility data, such as smart power meter readings, water consumption readings, traffic tollway activity, etc.

I would say that one standard to use might be if there is any current research activity based on private, corporate data, then that data should ultimately become available to the world. The downside to this is that companies may be more reluctant to grant such research if this is a criteria to release data.

But maybe the researchers themselves should be able to submit requests for data releases and that way it wouldn’t matter if the companies declined or not.

There is no way, anyone could possibly identify all the data that future researchers would need. So I would err on the side to be more inclusive rather than less inclusive in identifying classes of data to be released.

The dawn of Psychohistory

The Uncharted book above seems to me to represent a first step to realizing a science of Psychohistory as envisioned in Asimov’s Foundation Trilogy. It’s unclear whether this will ever be a true, quantified scientific endeavor but with appropriate data releases, readily available for research, perhaps someday in the future we can help create the science of Psychohistory. In the mean time, through the use of judicious, periodic data releases and appropriate research, we can certainly better understand how the world works and just maybe, improve its internal workings for everyone on the planet.

Comments?

Picture Credit(s): Amazon and Wikipedia 

Veeam’s upcoming V8 virtues

[Not] Vamoosing VMworld

We were at Storage Field Day 5 (SFD5, see the videos here) last month and had a briefing on Veeam’s upcoming V8 release.

They also told us (news to me) that they are leaving VMworld[I sit corrected, I have been informed after this went to press that Veeam is not leaving VMworld2014, and never said anything about it at the session – My mistake and I take full responsibility, sorry for any confusion] (sigh, now who’s going to have THE after conference, KILLER PARTY at VMworld) and moving to [but they did say they are definitely starting up] their own VeeamON conference at The Cosmopolitan in Las Vegas on October 6,7 & 8 this year. If their VMworld parties are any indication, the conference in the Cosmo should be a fun and rewarding time for all. Pre-registration is open and they have a call out for papers.

Doug Hazelman (@VMDoug), Rick Vanover (@RickVanover) and Luca Dell’Oca (@dellock6) all presented although Luca’s session was under strict NDA to be revealed later. I think sometime later this summer.

Doug mentioned that after 6 years, Veeam now has over 100,000 customers world wide.  One of their more popular, early innovations was the ability to run a VM directly off of a backup and sometime over the past couple of years they have moved from a VMware only backup & replication solution to also supporting Microsoft Hyper-V (more news to me).

V8’s virtues

Veeam V8 will add some interesting capabilities to the Veeam product solutions:

  • (VMware only) Built-in backups from storage snapshots – (Enterprise Plus edition only) Backup from VMware snapshots can sometimes impact app performance, especially when it comes time to commit changes. But with V7, Veeam now offers backup utilizing VMware’s Change Block Tracking (CBT)and taking backups from storage snapshots directly for HP 3PAR StoreServ, HP (Lefthand) StoreVirtual/StoreVirtual VSA and in soon to be available V8, NetApp FAS (Data ONTAP 8.1 or above, 7- or cluster-mode, clones too) storage systems. First Veeam does its application level processing (under Windows Server does VSS operations), after that completes tells VMware to take (a VMware) snapshot, when that completes they tell the storage to take a (storage) snapshot, when that completes they release the VMware snapshot. What all this does is allows them to utilize VMware CBT as well as storage snapshots which makes it up to 20 times faster than normal VMware snapshot backups. This way they can backup directly from the storage snapshot using the Veeam proxy. Also because the VMware snapshot is so short lived there is little overhead for committing any changes.  Also there is no need to use a proxy ESX server to do this, i.e., promote the VMware snapshot to a LUN, add it to an ESX, resignature, add the VM, and do all the backups, which, of course destroys CBT. This works for FC, iSCSI and NFS data stores. With NetApp storage you can also take the (VSS) application consistent snapshot and copy it to SnapVault.
  • Veeam Explorer (recovery) for storage snapshots – (Free backup edition) Recovery from (HP in V7 & NetApp in V8) storage snapshots is yet another feature and provides item (e.g., emails, contacts, email folders for Exchange), granular (VM level or file level) or full (volume) recovery from storage based snapshots, regardless of how those storage snapshots were created.
  • Veeam Explorer for SQL Server (V8 only) – (unsure what license is required) Similar to the Explorer for snapshots discussed above, this would allow a Veeam admin to do item level recovery for an SQL database. This also includes recovery from Veeam Backup repositories as well as storage snapshots. But this means that you could restore a ROW of an SQL table, an SQL TABLE as well as a whole SQL database. Now DBAs always had these sorts of abilities which required using Log services. But allowing a Veeam admin to do these sorts of activities seems like putting a gun in the hands of a child (or maybe a bazooka in the hands of an untrained civilian).
  • Veeam Explorer for Active Directory (V8 only) – (unsure what license is required) You’ve seen whats’ available above and just consider these same capabilities only applied to active directory. This means you can restore a password hash, user, group or organizational unit (OU). I don’t know about you but this seems more akin to a howitzer in the hands of a civilian.

They showed an example of competitive situation where running V8 (in beta?) with NetApp backups using snapshots versus some unnamed competition. They were able to complete a full backup in 1/4 the time of their competition (2hrs. vs. 8hrs.) and completed incremental backups in 35min. vs. 2hrs. for the competition.

“Thar be dragons there …”

Ok, maybe I am a little more paranoid than the average IT guy/gal. But in my (old world, greybeards) view, SQL databases belong in the realm of DBAs and Active Directory databases belong to domain controller admins. Messing around with production versions of SQL DBs or AD DBs seems hazardous to a data centers health. We’re not just talking files anymore here guys.

In Veeam’s defense, these new Explorer recovery tools are only probably going to be used to do something that needs to be done right away, to get things back operating again, and would not be used unless there’s a real need/emergency to do so. Otherwise let the DBA and security admins do it with their log recovery tools.  And another thing, they have had similar capabilities for Exchange emails, folders, contacts, etc. and no ones shot their foot off yet so why the concern.

Nonetheless, I feel strongly that these tools ought to be placed under lock and key and the key put in a safe with the combination under a glass case labeled IN CASE OF EMERGENCY, BREAK GLASS.

Comments.

EMCworld2014 day 1 – EMC acquires DSSD

What does 100TB of flash need with a new ASIC? And why would you implement a realtime analytics data engine using object storage interface on Flash?

It seems the new company purchased by EMC called DSSD is up to it’s eyebrows in ASIC design to implement a lightening fast object store to deal with the needs of real time analytics. Somewhere today there was a slide on the overheads of standard 25usec of OS overhead for the standard Posix file stack and then there is the typical 300 usec SSD overhead to perform an I/O. But as we learned a couple of weeks ago at SFD5 with Diablo-Technologies MCS and SanDisk new UltraDIMM they have reduced the SSD overhead by using memory channels to 5 µsec and now that OS overhead is 5X the overhead of the storage itself.

So what’s one to do. In the Case of Diablo-Tech’s MCS they converted by software from DAS IO to memory channel IO and via ASIC Memory channel IO back to SATA disk IO.

Not sure what DSSD does but if I was to design a new ASIC for memory channel I would want something that talks maybe  memory interface and scales it out to 100TBs of flash. At the software layer maybe we could talk object storage interfaces to the applications.

Going to learn more throughout the day…

[Learned on day 2 it’s more likely shared PCIe SSD storage.  Ok it’s not 5 µsec latency storage but still faster than networked storage.]