The data is the hybrid cloud

CRKtHnqVEAABeviI have been at NetApp Insight2015 conference the past two days and have been struck with one common theme. They have been talking since the get-go about the Data Fabric and how Clustered Data ONTAP (cDOT) is the foundation to the NetApp Data Fabric which spans on premises, private cloud, off premises public cloud and everything in between.

But the truth of the matter is that it’s data that real needs to span all these domains. Hybrid cloud really needs to have data movement everywhere. NetApp cDOT is just the enabler that helps move the data around much easier.

NetApp cDOT data services

From a cDOT perspective, NetApp has available today:

  • Cloud ONTAP – a software defined ONTAP storage service executing in the cloud, operating on cloud server provider hardware using DAS storage and providing ONTAP data services for your private cloud resident data.
  • ONTAP Edge – similar to Cloud ONTAP, but operating on premises with customer commodity server & DAS hardware and providing ONTAP data services.
  • NetApp Private Storage (NPS) – NetApp storage systems operating in a “near cloud” environment that is directly connected to cloud service providers that provides NetApp storage services with low latency/high IOPs storage to cloud compute applications.
  • NetApp cDOT on premises storage hardware – NetApp storage hardware with All Flash FAS as well as normal disk-only and hybrid FAS storage hardware supplying ONTAP data services to on premises applications.

NetApp Data Fabric

NetApp’s Data Fabric is built on top of ONTAP data services and allows a customer to use any of the above storage instances to host their private data. Which is great in and of itself, but when you realize that a customer can also move their data from anyone of these ONTAP storage instances to any other storage instance that’s when you see the power of the Data Fabric.

The Data Fabric depends mostly on storage efficient ONTAP SnapMirror data replication and ONTAP data cloning capabilities. These services can be used to replicate ONTAP data (LUNs/volumes) from one cDOT storage instance to another and then use ONTAP data cloning services to create accessible copies of this data at the new location. This could be on premises to near cloud, to public cloud or back again, all within the confines of ONTAP data services.

Data Fabric in action

Now I like the concept but they also showed an impressive demo of using cDOT and AltaVault (NetApp’s solution acquired last year from Riverbed, their SteelStor backup appliance) to perform an application consistent backup of a SQL database. But once they had this it went a little crazy.

They SnapMirrored this data from the on premises storage to a near cloud, NPS storage instance, then cloned the data from the mirrors and after that fired up applications running in Azure to process the data. Then they shut down the Azure application and fired up a similar application in AWS using the exact same NPS hosted data. Of course they then SnapMirrored the same backup data (I think from the original on premises storage) to Cloud ONTAP, just to show it could be done there as well.

Ok I get it, you can replicate (mirror) data from any cDOT storage instance (whether on premises or remote site or near cloud NPS or in the cloud using Cloud ONTAP or …). Once there you can clone this data and use it with applications running in any environment running with access to this data instance (such as AWS, Azure and cloud service providers).

And I like the fact that all this can be accomplished in NetApp’s Snap Center software. And I especially like the fact that the clones don’t take up any extra space and the replicant mirroring is done in a quick, space efficient (read deduped) manner

But, having to setup a replication or mirror association between cDOT on premises and cDOT at NPS or Cloud ONTAP and then having to clone the volumes at the target side seems superflous. What I really want to do is just copy or move the data and have it be at the target site without the mirror association in the middle. It’s almost like what I want is CLONE that operates across cDOT storage instances wherever they reside.

Well I’m an analyst and don’t have to implement any of this (thank god). But what NetApp seems to have done is to use their current tools and ONTAP data service capabilities to allow customer data to move anywhere it needs to be, in  customer controlled, space efficient, private and secure manner. Once hosted at the new site, applications have access to this data and customers still have all the ONTAP data services they had on premises but in cloud and near cloud locations.

Seems pretty impressive to me for all of a customer’s ONTAP data. But when you combine the Data Fabric with Foreign LUN Import (importing non-NetApp data into ONTAP storage) and FlexArray (storage virtualization under ONTAP) you can see how all the Data Fabric can apply to non-NetApp storage instances as well and then it becomes really interesting.

~~~~

There was a company that once said that “The Network is the Computer” but today, I think a better tag line is “The Data is the Hybrid Cloud”.

Comments?

 

Nanterro emerges from stealth with CNT based NRAM

512px-Types_of_Carbon_NanotubesNanterro just came out of stealth this week and bagged $31.5M in a Series E funding round. Apparently, Nanterro has been developing a new form of non-volatile RAM (NRAM), based on Carbon Nanotubes (CNT), which seems to work like an old T-bar switch, only in the NM sphere and using CNT for the wiring.

They were founded in 2001, and are finally  ready to emerge from stealth. Nanterro already has 175+ issued patents, with another 200 patents pending. The NRAM is currently in production at 7 CMOS fabs already and they are sampling 4Mb NRAM chips  to a number of customers.

NRAM vs. NAND

Performance of the NRAM is on a par with DRAM (~100 times faster than NAND), can be configured in 3D and supports MLC (multi-bits per cell) configurations.  NRAM also supports orders of magnitude more (assume they mean writes) accesses and stores data much longer than NAND.

The only question is the capacity, with shipping NAND on the order of 200Gb, NRAM is  about 2**14X behind NAND. Nanterre claims that their CNT-NRAM CMOS process can be scaled down to <5nm. Which is one or two generations below the current NAND scale factor and assuming they can pack as many bits in the same area, should be able to compete well with NAND.They claim that their NRAM technology is capable of Terabit capacities (assumed to be at the 5nm node).

The other nice thing is that Nanterro says the new NRAM uses less power than DRAM, which means that in addition to attaining higher capacities, DRAM like access times, it will also reduce power consumption.

It seems a natural for mobile applications. The press release claims it was already tested in space and there are customers looking at the technology for automobiles. The company claims the total addressable market is ~$170B USD. Which probably includes DRAM and NAND together.

CNT in CMOS chips?

Key to Nanterro’s technology was incorporating the use of CNT in CMOS processes, so that chips can be manufactured on current fab lines. It’s probably just the start of the use of CNT in electronic chips but it’s one that could potentially pay for the technology development many times over. CNT has a number of characteristics which would be beneficial to other electronic circuitry beyond NRAM.

How quickly they can ramp the capacity up from 4Mb seems to be a significant factor. Which is no doubt, why they went out for Series E funding.

So we have another new non-volatile memory technology.On the other hand, these guys seem to be a long ways away from the lab, with something that works today and the potential to go all the way down to 5nm.

It should interesting as the other NV technologies start to emerge to see which one generates sufficient market traction to succeed in the long run. Especially as NAND doesn’t seem to be slowing down much.

Comments?

Picture Credits: Wikimedia.com

Transporter, a private Dropbox in a tower

Move over DropboxBox and all you synch&share wannabees, there’s a new synch and share in town.

At SFD7 last month, we were visiting with Connected Data where CEO, Geoff Barrell was telling us all about what was wrong with today’s cloud storage solutions. In front of all the participants was this strange, blue glowing device. As it turns out, Connected Data’s main product is the File Transporter, which is a private file synch and share solution.

All the participants were given a new, 1TB Transporter system to take home. It was an interesting sight to see a dozen of these Transporter towers sitting in front of all the bloggers.

I was quickly, established a new account, installed the software, and activated the client service. I must admit, I took it upon myself to “claim” just about all of the Transporter towers as the other bloggers were still paying attention to the presentation.  Sigh, they later made me give back (unclaim) all but mine, but for a minute there I had about 10TB of synch and share space at my disposal.

Transporters rule

transporterB2So what is it. The Transporter is both a device and an Internet service, where you own the storage and networking hardware.

The home-office version comes as a 1 or 2TB 2.5” hard drive, in a tower configuration that plugs into a base module. The base module runs a secured version of Linux and their synch and share control software.

As tower power on, it connects to the Internet and invokes the Transporter control service. This service identifies the node, who owns it, and provides access to the storage on the Transporter to all desktops, laptops, and mobile applications that have access to it.

At initiation of the client service on a desktop/laptop it creates (by default) a new Transporter directory (folder). Files that are placed in this directory are automatically synched to the Transporter tower and then synchronized to any and all online client devices that have claimed the tower.

Apparently you can have multiple towers that are claimed to the same account. I personally tested up to 10 ;/ and it didn’t appear as if there was any substantive limit beyond that but I’m sure there’s some maximum count somewhere.

A couple of nice things about the tower. It’s your’s so you can move it to any location you want. That means, you could take it with you to your hotel or other remote offices and have a local synch point.

Also, initial synchronization can take place over your local network so it can occur as fast as your LAN can handle it. I remember the first time I up-synched 40GB to DropBox, it seemed to take weeks to complete and then took less time to down-synch for my laptop but still days of time. With the tower on my local network, I can synch my data much faster and then take the tower with me to my other office location and have a local synch datastore. (I may have to start taking mine to conferences. Howard (@deepstorage.net, co-host on our  GreyBeards on Storage podcast) had his operating in all the subsequent SFD7 sessions.

The Transporter also allows sharing of data. Steve immediately started sharing all the presentations on his Transporter service so the bloggers could access the data in real time.

They call the Transporter a private cloud but in my view, it’s more a private synch and share service.

Transporter heritage

The Transporter people were all familiar to the SFD crowd as they were formerly with  Drobo which was at a previous SFD sessions (see SFD1). And like Drobo, you can install any 2.5″ disk drive in your Transporter and it will work.

There’s workgroup and business class versions of the Transporter storage system. The workgroup versions are desktop configurations (looks very much like a Drobo box) that support up to 8TB or 12TB supporting 15 or 30 users respectively.  The also have two business class, rack mounted appliances that have up to 12TB or 24TB each and support 75 or 150 users each. The business class solution has onboard SSDs for meta-data acceleration. Similar to the Transporter tower, the workgroup and business class appliances are bring your own disk drives.

Connected Data’s presentation

transporterA1Geoff’s discussion (see SFD7 video) was a tour of the cloud storage business model. His view was that most of these companies are losing money. In fact, even Amazon S3/Glacier appears to be bleeding money, although this may not stop Amazon. Of course, DropBox and other synch and share services all depend on cloud storage for their datastores. So, the lack of a viable, profitable business model threatens all of these services in the long run.

But the business model is different when a customer owns the storage. Here the customer owns the actual storage cost. The only thing that Connected Data provides is the client software and the internet service that runs it. Pricing for the 1TB and 2TB transporters with disk drives are $150 and $240.

Having a Transporter

One thing I don’t like is the lack of data-at-rest encryption. They use TLS for data transfers across your LAN and the Internet. But the nice thing about having possession of the actual storage is that you can move it around. But the downside is that you may move it to less secure environments (like conference hotel rooms). And as with the any disk storage, someone can come up to the device and steel the disk. Whether the data would be easily recognizable is another question but having it be encrypted would put that question to rest. There’s some indication on the Transporter support site that encryption may be coming for the business class solution. But nothing was said about the Transporter tower.

On the Mac, the Transporter folder has the shared folders as direct links (real sub-folders) but the local data is under a Transporter Library soft link. It turns out to be a hidden file (“.Transporter Library”) under the Transporter folder. When you Control click on this file your are given the option to view deleted files. You can also do this with shared files as well.

One problem with synch and share services is once someone in your collaboration group deletes some shared files they are gone (over time) from all other group users. Even if some of them wanted them. Transporter makes it a bit easier to view these files and save them elsewhere. But I assume at some point they have to be purged to free up space.

When I first installed the Transporter, it showed up as a network node on my finder shared servers. But the latest desktop version (3.1.17) has removed this.

Also some of the bloggers complained about files seeing files “in flux” or duplicates of the shared files but with unusual file suffixes appended to them, such as ” filename124224_f367b3b1-63fa-4d29-8d7b-a534e0323389.jpg”. Enrico (@ESignoretti) opened up a support ticket on this and it’s supposedly been fixed in the latest desktop and was a temporary filename used only during upload and should have been deleted-renamed after the upload was completed. I just uploaded 22MB with about 40 files and didn’t see any of this.

I really want encryption as I wanted one transporter in a remote office and another in the home office with everything synched locally and then I would hand carry the remote one to the other location. But without encryption this isn’t going to work for me. So I guess I will limit myself to just one and move it around to wherever I want to my data to go.

Here are some of the other blog posts by SFD7 participants on Transporter:

Storage field day 7 – day 2 – Connected Data by Dan Firth (@PenguinPunk)

File Transporter, private Synch&Share made easy by Enrico Signoretti (@ESignoretti)

Transporter – Storage Field Day 7 preview by Keith Townsend (@VirtualizedGeek)

Comments?

Data virtualization surfaces

There’s a new storage startup out of stealth, called Primary Data and it’s implementing data (note, not storage) virtualization.

They already have $60M in funding with some pretty highpowered talent from Fusion IO, namely David Flynn, Rick White and Steve Wozniak (the ‘Woz’)  (also of Apple fame).

There have been a number of attempts at creating a virtualization layers for data namely ViPR (See my post ViPR virtues, vexations but no storage virtualization) but Primary Data is taking a different tack to the problem.

Data virtualization explained

Data hypervisor, software defined storage, data plane, control plane
(c) 2012 Silverton Consulting, Inc. All rights reserved

Essentially they want to separate the data plane from the control plane (See my Data Hypervisor post and comments for another view on this).

  • The data plane consists of those storage system activities that actually perform IO or read and writes.
  • The control plane is those storage system activities that do everything else that has to be done by a storage system, including provisioning, monitoring, and managing the storage.

Separating the data plane from the control plane offers a number of advantages. EMC ViPR does this but it’s data plane is either standard storage systems like VMAX, VNX, Isilon etc, or software defined storage solutions. Primary Data wants to do it all.

Their meta data or control plane engine is called a Data Director which holds information about the data objects that are stored in the Primary Data system, runs a data policy management engine and handles data migration.

Primary Data relies on purpose-built, Data Hypervisor (client) software that talks to Data Directors to understand where data objects reside and how to go about accessing them. But once the metadata information is transferred to the client SW, then IO activity can go directly between the host and the storage system in a protocol independent fashion.

[The graphic above is from my prior post and I assumed the data hypervisor (DH) would be co-located with the data but Primary Data has rightly implemented this as a separate layer in host software.]

Data Hypervisor protocol independence?

As I understand it this means that customers could use file storage, object storage or block storage to support any application requirement. This also means that file data (objects) could be migrated to block storage and still be accessed as file data. But the converse is also true, i.e., block data (objects) could be migrated to file storage and still be accessed as block data. You need to add object, DAS, PCIe flash and cloud storage to the mix to see where they are headed.

All data in Primary Data’s system are object encapsulated and all data objects are catalogued within a single, global namespace that spans file, block, object and cloud storage repositories

Data objects can reside on Primary storage systems, external non-Primary data aware file or block storage systems, DAS, PCIe Flash, and even cloud storage.

How does Data Virtualization compare to Storage Virtualization?

There are a number of differences:

  1. Most storage virtualization solutions are in the middle of the data path and because of this have to be fairly significant, highly fault-tolerant solutions.
  2. Most storage virtualization solutions don’t have a separate and distinct meta-data engine.
  3. Most storage virtualization systems don’t require any special (data hypervisor) software running on hosts or clients.
  4. Most storage virtualization systems don’t support protocol independent access to data storage.
  5. Most storage virtualization systems don’t support DAS or server based, PCIe flash for permanent storage. (Yes this is not supported in the first release but the intent is to support this soon.)
  6. Most storage virtualization systems support internal storage that resides directly inside the storage virtualization system hardware.
  7. Most storage virtualization systems support an internal DRAM cache layer which is used to speed up IO to internal and external storage and is in addition to any caching done at the external storage system level.
  8. Most storage virtualization systems only support external block storage.

There are a few similarities as well:

  1. They both manage data migration in a non-disruptive fashion.
  2. They both support automated policy management over data placement, data protection, data performance, and other QoS attributes.
  3. They both support multiple vendors of external storage.
  4. They both can support different host access protocols.

Data Virtualization Policy Management

A policy engine runs in the Data Directors and provides SLAs for data objects. This would include performance attributes, protection attributes, security requirements and cost requirements.  Presumably, policy specifications for data protection would include RAID level, erasure coding level and geographic dispersion.

In Primary Data, backup becomes nothing more than object snapshots with different protection characteristics, like offsite full copy. Moreover, data object migration can be handled completely outboard and without causing data access disruption and on an automated policy basis.

Primary Data first release

Primary Data will be initially deployed as an integrated data virtualization solution which includes an all flash NAS storage system and a standard NAS system. Over time, Primary Data will add non-Primary Data external storage and internal storage (DAS, SSD, PCIe Flash).

The Data Policy Engine and Data Migrator functionality will be separately charged for software solutions. Data Directors are sold in pairs (active-passive) and can be non-disruptively upgraded. Storage (directors?) are also sold separately.

Data Hypervisor (client) software is available for most styles of Linux, Openstack and coming for ESX. Windows SMB support is not split yet (control plane/data plane) but Primary data does support SMB. I believe the Data Hypervisor software will also be released in an upcoming version of the Linux kernel.

They are currently in testing. No official date for GA but they did say they would announce pricing in 2015.

~~~~

Comments?

Disclosure: We have done work for Primary Data over the past year.

Photo Credits:

  1. Screen shot of beta test system supplied by Primary Data
  2. Graphic created by SCI for prior Data Hypervisor post

Protest intensity, world news database and big data – chart of the month

Read an article the other day on the analysis of the Arab Spring (Did the Arab Spring really spark a wave of global protests, in Foreign Policy) using a Google Ideas sponsored project, the GDELT ProjectTime domain run chart showing protest intensity every month for the last 30 years, with running average (Global Database of Events, Language and Tone) file of  events extracted from worldwide media sources.  The GDELT database uses sophisticated language processing to extract “event” data from news media streams and supplies this information in database form.  The database can be analyzed  to identify  trends in world events and possibly to better understand what led up to events that occur on our planet.

GDELT Project

The GDELT database records over 300 categories of events that are geo-referenced to city/mountaintop and time-referenced. The event data dates back to 1979.  The GDELT data captures 60 attributes of any event that occurs, generating a giant spreadsheet of event information with location, time, parties, and myriad other attributes all identified, and cross-referenceable.

Besides the extensive spreadsheet of world event attribute data the GDELT project also supplies a knowledge graph oriented view of its event data. The GDELT knowledge graph “compiles a list of every person, organization, company, location and over 230 themes and emotions from every news report” that can then be used to create network diagrams/graphs to be better able to visualize interactions between events. 

For example see the Global Conversation in Foreign Policy, for a network diagram of every person mentioned in the news during 6 months of 2013.  You can zoom in or out to see how people identified in news reports are connected during the six months. So if you we’re interested, in let’s say the Syrian civil war, one could easily see at a glance any news item that mentioned Syria or was located in Syria since 1979 to now. Wow!

Arab Spring and Worldwide Protest

Getting back to the chart-of-the-month, the graphic above shows the “protest intensity” by month for the last 30 years with a running average charted in black using GDELT data.  (It’s better seen in the FP article/link above or just click on it for an expanded view. ).

One can see from the chart that there was a significant increase in protest activity after January 2011, which corresponds to the beginning of the Arab Spring.  But the amazing inference from the chart above is that this increase has continued ever since. This shows that the Arab Spring has had a lasting contribution that has significantly increased worldwide protest activity.

This is just one example of the types of research available with the GDELT data.

~~~~

I have talked in the past about how (telecom, social media and other) organizations should deposit their corporate/interaction data sets in some public repository for the better good of humanity so that any researcher could use it (see my Data of the world, lay down your chains post for more on this). The GDELT Project is Google Ideas doing this on a larger scale than I ever thought feasible. Way to go.

Comments?

 Image credits: (c) 2014 ForeignPolicy.com, All Rights Reserved

 

 

Computational Anthropology & Archeology

7068119915_732dd1ef63_zRead an article this week from Technology Review on The Emerging Science of Computational Anthropology. It was about the use of raw social media feeds to study the patterns of human behavior and how they change over time. In this article, they had come up with some heuristics that could be used to identify when people are local to an area and when they are visiting or new to an area.

Also, this past week there was an article in the Economist about Mining for Tweets of Gold about the startup DataMinr that uses raw twitter feeds to supply information about what’s going on in the world today. Apparently DataMinr is used by quite a few financial firms, news outlets, and others and has a good reputation for discovering news items that have not been reported yet. DataMinr is just one of a number of commercial entities doing this sort of analysis on Twitter data.

A couple of weeks ago I wrote a blog post on Free Social and Mobile Data as a Public Good. In that post I indicated that social and mobile data should be published, periodically in an open format, so that any researcher could examine it around the world.

Computational Anthropology

Anthropology is the comparative study of human culture and condition, both past and present. Their are many branches to the study of  Anthropology including but not limited to physical/biological, social/cultural, archeology and linguistic anthropologies. Using social media/mobile data to understand human behavior, development and culture would fit into the social/cultural branch of anthropology.

I have also previously written about some recent Computational Anthropological research (although I didn’t call it that), please see my Cheap phones + big data = better world and Mobile phone metadata underpins a new science posts. The fact is that mobile phone metadata can be used to create a detailed and deep understanding of a societies mobility.  A better understanding of human mobility in a region can be used to create more effective mass transit, more efficient  road networks, transportation and reduce pollution/energy use, among other things.

Social media can be used in a similar manner but it’s more than just location information, and some of it is about how people describe events and how they interact through text and media technologies. One research paper discussed how tweets could be used to detect earthquakes in real time (see: Earthquake Shakes Twitter Users: Real-time Event Detection by Social Sensors).

Although the location information provided by mobile phone data is more important to governments and transportation officials, it appears as if social media data is more important to organizations seeking news, events, or sentiment trending analysis.

Sources of the data today

Recently, Twitter announced that it would make its data available to a handful of research organizations (see: Twitter releasing trove of user data …).

On the other hand Facebook and LinkedIn seems a bit more restrictive in allowing access to their data. They have a few data scientists on staff but if you want access to their data you have to apply for it and only a few are accepted.

Although Google, Twitter, Facebook, LinkedIn and Telecoms represent the lions share of social/mobile data out there today, there are plenty of others sources of information that could potentially be useful that come to mind. Notwithstanding the NSA, currently there is limited research accessibility to the actual texts of mobile phone texts/messaging, and god forbid, emails.  Although privacy concerns are high, I believe ultimately this needs to change.

Imagine if some researchers had access to all the texts of a high school student body. Yes much of it would be worthless but some of it would tell a very revealing story about teenage relationships, interest and culture among other things. And having this sort of information over time could reveal the history of teenage cultural change. Much of this would have been previously available through magazines but today texts would represent a much more granular level of this information.

Computational Archeology

Archeology is just anthropology from a historical perspective, i.e, it is the study of the history of cultures, societies and life.  Computational Archeology would apply to the history of the use of computers, social media, telecommunications, Internet/WWW, etc.

There are only few resources that are widely available for this data such as the Internet Archive. But much of the history of WWW, social media, telecom, etc. use is in current and defunct organizations that aside from Twitter, continue to be very stingy with their data.

Over time all such data will be lost or become inaccessible unless something is done to make it available to research organizations. I believe sooner or later humanity will wise up to the loss of this treasure trove of information and create some sort of historical archive for this data and require companies to supply this data over time.

Comments?

Photo Credit(s): State of the Linked Open Data Cloud (LOD), September 2011 by Duncan Hull

Releasing social and mobile data as a public good

I have been reading a book recently, called Uncharted: Big data as a lens on human culture by Erez Aiden and Jean-Baptiste Michel that discusses the use of Google’s Ngram search engine which counts phrases (Ngrams) used in all the books they have digitized. Ngram phrases are charted against other Ngrams and plotted in real time.

It’s an interesting concept and one example they use is “United States are” vs. “United States is” a 3-Ngram which shows that the singular version of the phrase which has often been attributed to emerge immediately after the Civil War actually was in use prior to the Civil War and really didn’t take off until 1880’s, 15 years after the end of the Civil War.

I haven’t finished the book yet but it got me to thinking. The authors petitioned Google to gain access to the Ngram data which led to their original research. But then they convinced Google after their original research time was up to release the information to the general public. Great for them but it’s only a one time event and happened to work this time with luck and persistance.

The world needs more data

But there’s plenty of other information or data out there where we could use to learn an awful lot about human social interaction and other attributes about the world that are buried away in corporate databases. Yes, sometimes this information is made public (like Google), or made available for specific research (see my post on using mobile phone data to understand people mobility in an urban environment) but these are special situations. Once the research is over, the data is typically no longer available to the general public and getting future or past data outside the research boundaries requires yet another research proposal.

And yet books and magazines are universally available for a fair price to anyone and are available in most research libraries as a general public good for free.  Why should electronic data be any different?

Social and mobile dta as a public good

What I would propose is that the Library of Congress and other research libraries around the world have access to all corporate data that documents interaction between humans, humans and the environment, humanity and society, etc.  This data would be freely available to anyone with library access and could be used to provide information for research activities that have yet to be envisioned.

Hopefully all of this data would be released, free of charge (or for some nominal fee) to these institutions after some period of time has elapsed. For example, if we were talking about Twitter feeds, Facebook feeds, Instagram feeds, etc. the data would be provided from say 7 years back on a reoccurring yearly or quarterly basis. Not sure if the delay time should be 7, 10 or 15 years, but after some judicious period of time, the data would be released and made publicly available.

There are a number of other considerations:

  • Anonymity – somehow any information about a person’s identity, actual location, or other potentially identifying characteristics would need to be removed from all the data.  I realize this may reduce the value of the data to future researchers but it must be done. I also realize that this may not be an easy thing to accomplish and that is why the data could potentially be sold for a fair price to research libraries. Perhaps after 35 to 100 years or so the identifying information could be re-incorporated into the original data set but I think this highly unlikely.
  • Accessibility – somehow the data would need to have an easily accessible and understandable description that would enable any researcher to understand the underlying format of the data. This description should probably be in XML format or some other universal description language. At a minimum this would need to include meta-data descriptions of the structure of the data, with all the tables, rows and fields completely described. This could be in SQL format or just XML but needs to be made available. Also the data release itself would then need to be available in a database or in flat file formats that could be uploaded by the research libraries and then accessed by researchers. I would expect that this would use some sort of open source database/file service tools such as MySQL or other database engines. These database’s represent the counterpart to book shelves in today’s libraries and has to be universally accessible and forever available.
  • Identifyability – somehow the data releases would need to be universally identifiable, not unlike the ISBN scheme currently in use for books and magazines and ISRC scheme used for recordings. This would allow researchers to uniquely refer to any data set that is used to underpin their research. This would also allow the world’s research libraries to insure that they purchase and maintain all the data that becomes available by using some sort of master worldwide catalog that would hold pointers to all this data that is currently being held in research institutions. Such a catalog entry would represent additional meta-data for the data release and would represent a counterpart to a online library card catalog.
  • Legality – somehow any data release would need to respect any local Data Privacy and Protection laws of the country where the data resides. This could potentially limit the data that is generated in one country, say Germany to be held in that country only. I would think this could be easily accomplished as long as that country would be willing to host all its data in its research institutions.

I am probably forgetting a dozen more considerations but this covers most of it.

How to get companies to release their data

One that quickly comes to mind is how to compel companies to release their data in a timely fashion. I believe that data such as this is inherently valuable to a company but that its corporate value starts to diminish over time and after some time goes to 0.

However, the value to the world of such data encounters an inverse curve. That is, the longer away we are from a specific time period when that data was created, the more value it has for future research endeavors. Just consider what current researchers do with letters, books and magazine articles from the past when they are researching a specific time period in history.

But we need to act now. We are already over 7 years into the Facebook era and mobile phones have been around for decades now. We have probably already lost much of the mobile phone tracking information from the 80’s, 90’s, 00’s and may already be losing the data from the early ’10’s. Some social networks have already risen and gone into a long eclipse where historical data is probably their lowest concern. There is nothing that compels organizations to keep this data around, today.

Types of data to release

Obviously, any social networking data, mobile phone data, or email/chat/texting data should all be available to the world after 7 or more years.  Also the private photo libraries, video feeds, audio recordings, etc. should also be released if not already readily available. Less clear to me are utility data, such as smart power meter readings, water consumption readings, traffic tollway activity, etc.

I would say that one standard to use might be if there is any current research activity based on private, corporate data, then that data should ultimately become available to the world. The downside to this is that companies may be more reluctant to grant such research if this is a criteria to release data.

But maybe the researchers themselves should be able to submit requests for data releases and that way it wouldn’t matter if the companies declined or not.

There is no way, anyone could possibly identify all the data that future researchers would need. So I would err on the side to be more inclusive rather than less inclusive in identifying classes of data to be released.

The dawn of Psychohistory

The Uncharted book above seems to me to represent a first step to realizing a science of Psychohistory as envisioned in Asimov’s Foundation Trilogy. It’s unclear whether this will ever be a true, quantified scientific endeavor but with appropriate data releases, readily available for research, perhaps someday in the future we can help create the science of Psychohistory. In the mean time, through the use of judicious, periodic data releases and appropriate research, we can certainly better understand how the world works and just maybe, improve its internal workings for everyone on the planet.

Comments?

Picture Credit(s): Amazon and Wikipedia 

Cloud based database startups are heating up

IBM recently agreed to purchase Cloudant an online database service using a NoSQL database called CouchDB. Apparently this is an attempt by IBM to take on Amazon and others that support cloud based services using a NoSQL database backend to store massive amounts of data.

In other news, Dassault Systems, a provider of 3D and other design tools has invested $14.2M in NuoDB, a cloud-based NewSQL compliant database service provider. Apparently Dassault intends to start offering its design software as a service offering using NuoDB as a backend database.

We have discussed NewSQL and NoSQL database’s before (see NewSQL and the curse of old SQL databases post) and there are plenty available today. So, why the sudden interest in cloud based database services. I tend to think there are a couple of different trends playing out here.

IBM playing catchup

In the IBM case there’s just so much data going to the cloud these days that IBM just can’t have a hand in it, if it wants to continue to be a major IT service organization.  Amazon and others are blazing this trail and IBM has to get on board or be left behind.

The NoSQL or no relational database model allows for different types of data structuring than the standard tables/rows of traditional RDMS databases. Specifically, NoSQL databases are very useful for data that can be organized in a tree (directed graph), graph (non-directed graph?) or key=value pairs. This latter item is very useful for Hadoop, MapReduce and other big data analytics applications. Doing this in the cloud just makes sense as the data can be both gathered and tanalyzed in the cloud without having anything more than the results of the analysis sent back to a requesting party.

IBM doesn’t necessarily need a SQL database as it already has DB2. IBM already has a cloud-based DB2 service that can be implemented by public or private cloud organizations.  But they have no cloud based NoSQL service today and having one today can make a lot of sense if IBM wants to branch out to more cloud service offerings.

Dassault is broadening their market

As for the cloud based, NuoDB NewSQL database, not all data fits the tree, graph, key=value pair structuring of NoSQL databases. Many traditional applications that use databases today revolve around SQL services and would be hard pressed to move off RDMS.

Also, one ongoing problem with NoSQL databases is that they don’t really support ACID transaction processing and as such, often compromise on data consistency in order to support highly parallelizable activities. In contrast, a SQL database supports rigid transaction consistency and is just the thing for moving something like a traditional OLTP processing application to the cloud.

I would guess, how NuoDB handles the high throughput needed by it’s cloud service partners while still providing ACID transaction consistency is part of its secret sauce.

But what’s behind it, at least some of this interest may just be the internet of things (IoT)

The other thing that seems to be driving a lot of the interest in cloud based databases is the IoT. As more and more devices become internet connected, they will start to generate massive amounts of data. The only way to capture and analyze this data effectively today is with NoSQL and NewSQL database services. By hosting these services in the cloud, analyzing/processing/reporting on this tsunami of data becomes much, much easier.

Storing and analyzing all this IoT data should make for an interesting decade or so as the internet of things gets built out across the world.  Cisco’s CEO, John Chambers recently said that the IoT market will be worth $19T and will have 50B internet connected devices by 2020. Seems a bit of a stretch seeings as how they just predicted (June 2013) to have 10B devices attached to the internet by the middle of last year, but who am I to disagree.

There’s much more to be written about the IoT and its impact on data storage, but that will need to wait for another time… stay tuned.

Comments?

Photo Credit(s): database 2 by Tim Morgan