Data virtualization surfaces

There’s a new storage startup out of stealth, called Primary Data and it’s implementing data (note, not storage) virtualization.

They already have $60M in funding with some pretty highpowered talent from Fusion IO, namely David Flynn, Rick White and Steve Wozniak (the ‘Woz’)  (also of Apple fame).

There have been a number of attempts at creating a virtualization layers for data namely ViPR (See my post ViPR virtues, vexations but no storage virtualization) but Primary Data is taking a different tack to the problem.

Data virtualization explained

Data hypervisor, software defined storage, data plane, control plane
(c) 2012 Silverton Consulting, Inc. All rights reserved

Essentially they want to separate the data plane from the control plane (See my Data Hypervisor post and comments for another view on this).

  • The data plane consists of those storage system activities that actually perform IO or read and writes.
  • The control plane is those storage system activities that do everything else that has to be done by a storage system, including provisioning, monitoring, and managing the storage.

Separating the data plane from the control plane offers a number of advantages. EMC ViPR does this but it’s data plane is either standard storage systems like VMAX, VNX, Isilon etc, or software defined storage solutions. Primary Data wants to do it all.

Their meta data or control plane engine is called a Data Director which holds information about the data objects that are stored in the Primary Data system, runs a data policy management engine and handles data migration.

Primary Data relies on purpose-built, Data Hypervisor (client) software that talks to Data Directors to understand where data objects reside and how to go about accessing them. But once the metadata information is transferred to the client SW, then IO activity can go directly between the host and the storage system in a protocol independent fashion.

[The graphic above is from my prior post and I assumed the data hypervisor (DH) would be co-located with the data but Primary Data has rightly implemented this as a separate layer in host software.]

Data Hypervisor protocol independence?

As I understand it this means that customers could use file storage, object storage or block storage to support any application requirement. This also means that file data (objects) could be migrated to block storage and still be accessed as file data. But the converse is also true, i.e., block data (objects) could be migrated to file storage and still be accessed as block data. You need to add object, DAS, PCIe flash and cloud storage to the mix to see where they are headed.

All data in Primary Data’s system are object encapsulated and all data objects are catalogued within a single, global namespace that spans file, block, object and cloud storage repositories

Data objects can reside on Primary storage systems, external non-Primary data aware file or block storage systems, DAS, PCIe Flash, and even cloud storage.

How does Data Virtualization compare to Storage Virtualization?

There are a number of differences:

  1. Most storage virtualization solutions are in the middle of the data path and because of this have to be fairly significant, highly fault-tolerant solutions.
  2. Most storage virtualization solutions don’t have a separate and distinct meta-data engine.
  3. Most storage virtualization systems don’t require any special (data hypervisor) software running on hosts or clients.
  4. Most storage virtualization systems don’t support protocol independent access to data storage.
  5. Most storage virtualization systems don’t support DAS or server based, PCIe flash for permanent storage. (Yes this is not supported in the first release but the intent is to support this soon.)
  6. Most storage virtualization systems support internal storage that resides directly inside the storage virtualization system hardware.
  7. Most storage virtualization systems support an internal DRAM cache layer which is used to speed up IO to internal and external storage and is in addition to any caching done at the external storage system level.
  8. Most storage virtualization systems only support external block storage.

There are a few similarities as well:

  1. They both manage data migration in a non-disruptive fashion.
  2. They both support automated policy management over data placement, data protection, data performance, and other QoS attributes.
  3. They both support multiple vendors of external storage.
  4. They both can support different host access protocols.

Data Virtualization Policy Management

A policy engine runs in the Data Directors and provides SLAs for data objects. This would include performance attributes, protection attributes, security requirements and cost requirements.  Presumably, policy specifications for data protection would include RAID level, erasure coding level and geographic dispersion.

In Primary Data, backup becomes nothing more than object snapshots with different protection characteristics, like offsite full copy. Moreover, data object migration can be handled completely outboard and without causing data access disruption and on an automated policy basis.

Primary Data first release

Primary Data will be initially deployed as an integrated data virtualization solution which includes an all flash NAS storage system and a standard NAS system. Over time, Primary Data will add non-Primary Data external storage and internal storage (DAS, SSD, PCIe Flash).

The Data Policy Engine and Data Migrator functionality will be separately charged for software solutions. Data Directors are sold in pairs (active-passive) and can be non-disruptively upgraded. Storage (directors?) are also sold separately.

Data Hypervisor (client) software is available for most styles of Linux, Openstack and coming for ESX. Windows SMB support is not split yet (control plane/data plane) but Primary data does support SMB. I believe the Data Hypervisor software will also be released in an upcoming version of the Linux kernel.

They are currently in testing. No official date for GA but they did say they would announce pricing in 2015.

~~~~

Comments?

Disclosure: We have done work for Primary Data over the past year.

Photo Credits:

  1. Screen shot of beta test system supplied by Primary Data
  2. Graphic created by SCI for prior Data Hypervisor post

Protest intensity, world news database and big data – chart of the month

Read an article the other day on the analysis of the Arab Spring (Did the Arab Spring really spark a wave of global protests, in Foreign Policy) using a Google Ideas sponsored project, the GDELT ProjectTime domain run chart showing protest intensity every month for the last 30 years, with running average (Global Database of Events, Language and Tone) file of  events extracted from worldwide media sources.  The GDELT database uses sophisticated language processing to extract “event” data from news media streams and supplies this information in database form.  The database can be analyzed  to identify  trends in world events and possibly to better understand what led up to events that occur on our planet.

GDELT Project

The GDELT database records over 300 categories of events that are geo-referenced to city/mountaintop and time-referenced. The event data dates back to 1979.  The GDELT data captures 60 attributes of any event that occurs, generating a giant spreadsheet of event information with location, time, parties, and myriad other attributes all identified, and cross-referenceable.

Besides the extensive spreadsheet of world event attribute data the GDELT project also supplies a knowledge graph oriented view of its event data. The GDELT knowledge graph “compiles a list of every person, organization, company, location and over 230 themes and emotions from every news report” that can then be used to create network diagrams/graphs to be better able to visualize interactions between events. 

For example see the Global Conversation in Foreign Policy, for a network diagram of every person mentioned in the news during 6 months of 2013.  You can zoom in or out to see how people identified in news reports are connected during the six months. So if you we’re interested, in let’s say the Syrian civil war, one could easily see at a glance any news item that mentioned Syria or was located in Syria since 1979 to now. Wow!

Arab Spring and Worldwide Protest

Getting back to the chart-of-the-month, the graphic above shows the “protest intensity” by month for the last 30 years with a running average charted in black using GDELT data.  (It’s better seen in the FP article/link above or just click on it for an expanded view. ).

One can see from the chart that there was a significant increase in protest activity after January 2011, which corresponds to the beginning of the Arab Spring.  But the amazing inference from the chart above is that this increase has continued ever since. This shows that the Arab Spring has had a lasting contribution that has significantly increased worldwide protest activity.

This is just one example of the types of research available with the GDELT data.

~~~~

I have talked in the past about how (telecom, social media and other) organizations should deposit their corporate/interaction data sets in some public repository for the better good of humanity so that any researcher could use it (see my Data of the world, lay down your chains post for more on this). The GDELT Project is Google Ideas doing this on a larger scale than I ever thought feasible. Way to go.

Comments?

 Image credits: (c) 2014 ForeignPolicy.com, All Rights Reserved

 

 

Computational Anthropology & Archeology

7068119915_732dd1ef63_zRead an article this week from Technology Review on The Emerging Science of Computational Anthropology. It was about the use of raw social media feeds to study the patterns of human behavior and how they change over time. In this article, they had come up with some heuristics that could be used to identify when people are local to an area and when they are visiting or new to an area.

Also, this past week there was an article in the Economist about Mining for Tweets of Gold about the startup DataMinr that uses raw twitter feeds to supply information about what’s going on in the world today. Apparently DataMinr is used by quite a few financial firms, news outlets, and others and has a good reputation for discovering news items that have not been reported yet. DataMinr is just one of a number of commercial entities doing this sort of analysis on Twitter data.

A couple of weeks ago I wrote a blog post on Free Social and Mobile Data as a Public Good. In that post I indicated that social and mobile data should be published, periodically in an open format, so that any researcher could examine it around the world.

Computational Anthropology

Anthropology is the comparative study of human culture and condition, both past and present. Their are many branches to the study of  Anthropology including but not limited to physical/biological, social/cultural, archeology and linguistic anthropologies. Using social media/mobile data to understand human behavior, development and culture would fit into the social/cultural branch of anthropology.

I have also previously written about some recent Computational Anthropological research (although I didn’t call it that), please see my Cheap phones + big data = better world and Mobile phone metadata underpins a new science posts. The fact is that mobile phone metadata can be used to create a detailed and deep understanding of a societies mobility.  A better understanding of human mobility in a region can be used to create more effective mass transit, more efficient  road networks, transportation and reduce pollution/energy use, among other things.

Social media can be used in a similar manner but it’s more than just location information, and some of it is about how people describe events and how they interact through text and media technologies. One research paper discussed how tweets could be used to detect earthquakes in real time (see: Earthquake Shakes Twitter Users: Real-time Event Detection by Social Sensors).

Although the location information provided by mobile phone data is more important to governments and transportation officials, it appears as if social media data is more important to organizations seeking news, events, or sentiment trending analysis.

Sources of the data today

Recently, Twitter announced that it would make its data available to a handful of research organizations (see: Twitter releasing trove of user data …).

On the other hand Facebook and LinkedIn seems a bit more restrictive in allowing access to their data. They have a few data scientists on staff but if you want access to their data you have to apply for it and only a few are accepted.

Although Google, Twitter, Facebook, LinkedIn and Telecoms represent the lions share of social/mobile data out there today, there are plenty of others sources of information that could potentially be useful that come to mind. Notwithstanding the NSA, currently there is limited research accessibility to the actual texts of mobile phone texts/messaging, and god forbid, emails.  Although privacy concerns are high, I believe ultimately this needs to change.

Imagine if some researchers had access to all the texts of a high school student body. Yes much of it would be worthless but some of it would tell a very revealing story about teenage relationships, interest and culture among other things. And having this sort of information over time could reveal the history of teenage cultural change. Much of this would have been previously available through magazines but today texts would represent a much more granular level of this information.

Computational Archeology

Archeology is just anthropology from a historical perspective, i.e, it is the study of the history of cultures, societies and life.  Computational Archeology would apply to the history of the use of computers, social media, telecommunications, Internet/WWW, etc.

There are only few resources that are widely available for this data such as the Internet Archive. But much of the history of WWW, social media, telecom, etc. use is in current and defunct organizations that aside from Twitter, continue to be very stingy with their data.

Over time all such data will be lost or become inaccessible unless something is done to make it available to research organizations. I believe sooner or later humanity will wise up to the loss of this treasure trove of information and create some sort of historical archive for this data and require companies to supply this data over time.

Comments?

Photo Credit(s): State of the Linked Open Data Cloud (LOD), September 2011 by Duncan Hull

Releasing social and mobile data as a public good

I have been reading a book recently, called Uncharted: Big data as a lens on human culture by Erez Aiden and Jean-Baptiste Michel that discusses the use of Google’s Ngram search engine which counts phrases (Ngrams) used in all the books they have digitized. Ngram phrases are charted against other Ngrams and plotted in real time.

It’s an interesting concept and one example they use is “United States are” vs. “United States is” a 3-Ngram which shows that the singular version of the phrase which has often been attributed to emerge immediately after the Civil War actually was in use prior to the Civil War and really didn’t take off until 1880’s, 15 years after the end of the Civil War.

I haven’t finished the book yet but it got me to thinking. The authors petitioned Google to gain access to the Ngram data which led to their original research. But then they convinced Google after their original research time was up to release the information to the general public. Great for them but it’s only a one time event and happened to work this time with luck and persistance.

The world needs more data

But there’s plenty of other information or data out there where we could use to learn an awful lot about human social interaction and other attributes about the world that are buried away in corporate databases. Yes, sometimes this information is made public (like Google), or made available for specific research (see my post on using mobile phone data to understand people mobility in an urban environment) but these are special situations. Once the research is over, the data is typically no longer available to the general public and getting future or past data outside the research boundaries requires yet another research proposal.

And yet books and magazines are universally available for a fair price to anyone and are available in most research libraries as a general public good for free.  Why should electronic data be any different?

Social and mobile dta as a public good

What I would propose is that the Library of Congress and other research libraries around the world have access to all corporate data that documents interaction between humans, humans and the environment, humanity and society, etc.  This data would be freely available to anyone with library access and could be used to provide information for research activities that have yet to be envisioned.

Hopefully all of this data would be released, free of charge (or for some nominal fee) to these institutions after some period of time has elapsed. For example, if we were talking about Twitter feeds, Facebook feeds, Instagram feeds, etc. the data would be provided from say 7 years back on a reoccurring yearly or quarterly basis. Not sure if the delay time should be 7, 10 or 15 years, but after some judicious period of time, the data would be released and made publicly available.

There are a number of other considerations:

  • Anonymity – somehow any information about a person’s identity, actual location, or other potentially identifying characteristics would need to be removed from all the data.  I realize this may reduce the value of the data to future researchers but it must be done. I also realize that this may not be an easy thing to accomplish and that is why the data could potentially be sold for a fair price to research libraries. Perhaps after 35 to 100 years or so the identifying information could be re-incorporated into the original data set but I think this highly unlikely.
  • Accessibility – somehow the data would need to have an easily accessible and understandable description that would enable any researcher to understand the underlying format of the data. This description should probably be in XML format or some other universal description language. At a minimum this would need to include meta-data descriptions of the structure of the data, with all the tables, rows and fields completely described. This could be in SQL format or just XML but needs to be made available. Also the data release itself would then need to be available in a database or in flat file formats that could be uploaded by the research libraries and then accessed by researchers. I would expect that this would use some sort of open source database/file service tools such as MySQL or other database engines. These database’s represent the counterpart to book shelves in today’s libraries and has to be universally accessible and forever available.
  • Identifyability – somehow the data releases would need to be universally identifiable, not unlike the ISBN scheme currently in use for books and magazines and ISRC scheme used for recordings. This would allow researchers to uniquely refer to any data set that is used to underpin their research. This would also allow the world’s research libraries to insure that they purchase and maintain all the data that becomes available by using some sort of master worldwide catalog that would hold pointers to all this data that is currently being held in research institutions. Such a catalog entry would represent additional meta-data for the data release and would represent a counterpart to a online library card catalog.
  • Legality – somehow any data release would need to respect any local Data Privacy and Protection laws of the country where the data resides. This could potentially limit the data that is generated in one country, say Germany to be held in that country only. I would think this could be easily accomplished as long as that country would be willing to host all its data in its research institutions.

I am probably forgetting a dozen more considerations but this covers most of it.

How to get companies to release their data

One that quickly comes to mind is how to compel companies to release their data in a timely fashion. I believe that data such as this is inherently valuable to a company but that its corporate value starts to diminish over time and after some time goes to 0.

However, the value to the world of such data encounters an inverse curve. That is, the longer away we are from a specific time period when that data was created, the more value it has for future research endeavors. Just consider what current researchers do with letters, books and magazine articles from the past when they are researching a specific time period in history.

But we need to act now. We are already over 7 years into the Facebook era and mobile phones have been around for decades now. We have probably already lost much of the mobile phone tracking information from the 80’s, 90’s, 00’s and may already be losing the data from the early ’10’s. Some social networks have already risen and gone into a long eclipse where historical data is probably their lowest concern. There is nothing that compels organizations to keep this data around, today.

Types of data to release

Obviously, any social networking data, mobile phone data, or email/chat/texting data should all be available to the world after 7 or more years.  Also the private photo libraries, video feeds, audio recordings, etc. should also be released if not already readily available. Less clear to me are utility data, such as smart power meter readings, water consumption readings, traffic tollway activity, etc.

I would say that one standard to use might be if there is any current research activity based on private, corporate data, then that data should ultimately become available to the world. The downside to this is that companies may be more reluctant to grant such research if this is a criteria to release data.

But maybe the researchers themselves should be able to submit requests for data releases and that way it wouldn’t matter if the companies declined or not.

There is no way, anyone could possibly identify all the data that future researchers would need. So I would err on the side to be more inclusive rather than less inclusive in identifying classes of data to be released.

The dawn of Psychohistory

The Uncharted book above seems to me to represent a first step to realizing a science of Psychohistory as envisioned in Asimov’s Foundation Trilogy. It’s unclear whether this will ever be a true, quantified scientific endeavor but with appropriate data releases, readily available for research, perhaps someday in the future we can help create the science of Psychohistory. In the mean time, through the use of judicious, periodic data releases and appropriate research, we can certainly better understand how the world works and just maybe, improve its internal workings for everyone on the planet.

Comments?

Picture Credit(s): Amazon and Wikipedia 

Cloud based database startups are heating up

IBM recently agreed to purchase Cloudant an online database service using a NoSQL database called CouchDB. Apparently this is an attempt by IBM to take on Amazon and others that support cloud based services using a NoSQL database backend to store massive amounts of data.

In other news, Dassault Systems, a provider of 3D and other design tools has invested $14.2M in NuoDB, a cloud-based NewSQL compliant database service provider. Apparently Dassault intends to start offering its design software as a service offering using NuoDB as a backend database.

We have discussed NewSQL and NoSQL database’s before (see NewSQL and the curse of old SQL databases post) and there are plenty available today. So, why the sudden interest in cloud based database services. I tend to think there are a couple of different trends playing out here.

IBM playing catchup

In the IBM case there’s just so much data going to the cloud these days that IBM just can’t have a hand in it, if it wants to continue to be a major IT service organization.  Amazon and others are blazing this trail and IBM has to get on board or be left behind.

The NoSQL or no relational database model allows for different types of data structuring than the standard tables/rows of traditional RDMS databases. Specifically, NoSQL databases are very useful for data that can be organized in a tree (directed graph), graph (non-directed graph?) or key=value pairs. This latter item is very useful for Hadoop, MapReduce and other big data analytics applications. Doing this in the cloud just makes sense as the data can be both gathered and tanalyzed in the cloud without having anything more than the results of the analysis sent back to a requesting party.

IBM doesn’t necessarily need a SQL database as it already has DB2. IBM already has a cloud-based DB2 service that can be implemented by public or private cloud organizations.  But they have no cloud based NoSQL service today and having one today can make a lot of sense if IBM wants to branch out to more cloud service offerings.

Dassault is broadening their market

As for the cloud based, NuoDB NewSQL database, not all data fits the tree, graph, key=value pair structuring of NoSQL databases. Many traditional applications that use databases today revolve around SQL services and would be hard pressed to move off RDMS.

Also, one ongoing problem with NoSQL databases is that they don’t really support ACID transaction processing and as such, often compromise on data consistency in order to support highly parallelizable activities. In contrast, a SQL database supports rigid transaction consistency and is just the thing for moving something like a traditional OLTP processing application to the cloud.

I would guess, how NuoDB handles the high throughput needed by it’s cloud service partners while still providing ACID transaction consistency is part of its secret sauce.

But what’s behind it, at least some of this interest may just be the internet of things (IoT)

The other thing that seems to be driving a lot of the interest in cloud based databases is the IoT. As more and more devices become internet connected, they will start to generate massive amounts of data. The only way to capture and analyze this data effectively today is with NoSQL and NewSQL database services. By hosting these services in the cloud, analyzing/processing/reporting on this tsunami of data becomes much, much easier.

Storing and analyzing all this IoT data should make for an interesting decade or so as the internet of things gets built out across the world.  Cisco’s CEO, John Chambers recently said that the IoT market will be worth $19T and will have 50B internet connected devices by 2020. Seems a bit of a stretch seeings as how they just predicted (June 2013) to have 10B devices attached to the internet by the middle of last year, but who am I to disagree.

There’s much more to be written about the IoT and its impact on data storage, but that will need to wait for another time… stay tuned.

Comments?

Photo Credit(s): database 2 by Tim Morgan 

 

Bringing compute to storage

Researchers at MIT (see Storage system for ‘big data’ dramatically speeds access to information) have come up with a novel storage cluster using FPGAs and flash chips to create a new form of database machine.

In their system they have an FPGA that supports limited computational offload/acceleration along with flash controller functionality for a set of flash chips. They call their system the BlueDBM or Blue Database Machine.

Their storage device is used as PCIe flash card on a host PC. But in their implementation each of the PCIe flash cards are interconnected via an FPGA serial link. This approach creates a distributed controller across all the PCIe flash cards in the host servers and allows any host PC to access any of the flash card data at high speed.

They claim that node to node access latencies are on the order of 60-80 microseconds and their distributed controller can sustain 70% of theoretical system bandwidth.  In their prototype 4-node system their performance testing shows that it’s an order of magnitude faster than Microsoft Research’s CORFU (Cluster of Raw Flash Units).

Why FPGAs?

There are two novel aspects of their system: 1 ) Is the computational offload capabilities provided by the FPGA in front of the flash and 2) Is their implementation of a  distributed controller across the storage nodes using the FPGA serial network.

Both of these characteristics are dependent on the FPGA. Also by using FPGAs system cost would be less and the FPGAs had a readily available, internally supported serial link that could be used.

But by using an FPGA, the computational capabilities are more limited and re-configurating (re-programming) the storage cluster’s compute capabilities will take more time. If they used a more general purpose CPU in front of the flash chips they could support a much richer computational offload next to the storage chips.  For example, in their prototype the FPGAs supported ‘word-counting’ offload functionality.

Nonetheless, as most flash storage these days already have a fairly sophisticated controller, it’s not much of a stretch to bump this compute power up to something a bit more programmable and make its functionality more available via APIs.  I suppose to gain equivalent performance this would need to use PCIe flash cards.

Where they would get the internal card to card serial link with general purpose CPUs may be a concern, which brings up another question.

The distributed controller gives them what exactly?

I believe that with a serial link based distributed controller they don’t need a full networking stack to access the PCIe flash storage on other nodes. This should save both access time and compute power.

In follow on work, the MIT researchers plan to implement a Linux based, distributed file system across the BlueDBM. This should give them a more normal storage stack for their system. How this may interact with the computational offload capabilities is another question.

I would have to say the reduction in access latency is what they were after with the distributed controller and they seem to have achieved it, as noted above. I suppose something similar could be done with multiple PCIe cards in the same host but with the potential to grow from 4 to 20 nodes, the BlueDBM starts to look more interesting.

What sort of application could use such a device?

They talked about performing near real-time analysis of scientific data or modeling all the particles in a simulation of the universe.  But just about any application that required extremely low access time with limited data services could potentially take advantage of their storage system. High Frequency Trading comes to mind.

As for big data applications, I haven’t heard of any big data deployments that use SSDs for basic storage let alone PCIe flash cards. I don’t believe there’s going to be a lot of big data analytics that has need for this fast a storage system.

~~~~

Utilizing excess compute power in a storage controller has been an ongoing dream for a long time. Aside from running VMs and a couple of other specialized services such as A-V scanning within a storage controller there hasn’t been a lot of this type of functionality  ever released for use inside a storage controller. With software defined storage coming online, it may not even make that much sense anymore.

MIT research’s BlueDBM solution is somewhat novel but unless they can more easily generalize the computational offload it doesn’t seem as if it will become a very popular way to go for analytics applications.

As for their reduction in access latencies, that might have some legs if they can put more storage capacity behind it and continue to support similar access latencies. But they will need to provide a more normal access method to it. The distributed Linux file system might be just the ticket to get this off into the market.

Comments?

Photo Credits: Lightening by Jolene

DS3, the BlackPearl and the way forward for … tape

Spectra Logic Summit 2013, Nathan Thompson, CEO talking about  Spectra Logic's historyJust got back from an analyst summit with Spectra Logic.  They announced a new interface to tape called, Deep Simple Storage Service (DS3) and an appliance that implements this interface named the BlackPearl.  The intent is to broaden the use of tape to include, todays more web services, application environments.

The main problems addressed by the new interface is how do you map an essentially sequential, high throughput but long latency access to first byte, removable media device to an essentially small file, get and put environment.  And is there a market for such services. I think Spectra Logic has answered the first set of questions and is about to embark on a journey to answer the second set of questions.

The new interface – it’s all about simplifying tape

The DS3 interface answers the first set of questions. With DS3 Specra Logic has extended Amazon’s S3 interface to expose some of the sequentiality and removability of tape to the object storage world.

As you should recall, Amazon S3 is a RESTful, web interface that uses HTTP type GET and PUT commands to move data to and from the S3 storage service.  The data you are moving is considered an object and the object name or identifier is unique across the storage service. When you “PUT” an object you get to add key-value pairs of information called meta-data to the object. When you “GET” an object you retrieve the data from the storage service. The other thing one needs to be aware of is that you get and put objects into “BUCKET”s.

With DS3, Spectra Logic has added essentially 4 new commands to S3 protocol, which are:

  • Bulk Put – this provides a list of objects that one wants to “PUT” into a DS3 storage service and the response from the DS3 storage service is an ordered list of which objects to PUT in sequence and which DS3 storage server node (essentially an IP address) to send the data.
  • Bulk Get – this supplies a list of objects that one wants to GET from a DS3 storage service and the response is an ordered list of the sequence to get those objects and the node address to use for those object gets
  • Export Bucket – this identifies a BUCKET that you wish to remove from a DS3 storage service.  Presumably the response would be where the bucket can be found,  the number of pieces of media to expect, and some identification of the media serial numbers that constitute a bucket on the DS3 storage service.
  • Import Bucket – this identifies a new bucket which will be imported into a DS3 storage service and will supply some necessary information such as how many pieces of media to expect and the serial numbers of the media.  Presumably the response will be a location which can be used to import the media.

With these four simple commands and an appropriate DS3 client, DS3 server and DS3 storage backend one now has everything they need to support a removable media object store. I could see real value for export/import like this on the “rare occasion” when a  cloud service provider goes out of business.

The DS3 interface will be publicly available and the intent is to both supply Spectra Logic developed clients as well a ISV/partner developed DS3 clients so as to provide removable media object stores for all sorts of other applications.

Spectra’s is providing developer tools and documentation so that anyone can write a DS3 client. To that end, the DS3 developer portal is up (couldn’t find a link this AM but will update this post when I find it) and available free of charge to anyone today (believe you need to register to gain access to the doc.). They have a DS3 server simulator that DS3 client developers can use to test out and validate their client software. They also have a try & buy service for client developers.

Essentially, the combination of DS3 clients, DS3 servers and DS3 backend storage create a really deep archive for object data. It’s not intended for primary or secondary storage access but it’s big, cheap, and power/space efficient storage that can be very effective if used for archive data.

BlackPearl, the first DS3 Server

Their second announcement is the first implementation of a DS3 server, Spectra Logic calls BlackPearl(™). The BlackPearl connects to one or more Spectra Logic tape libraries as a backend store which together essentially provides a DS3 object storage archive. The DS3 server talks to DS3 clients on the front end. BlackPearl uses SAS or FC connected tape transports, which can be any transport currently supported by SpectraLogic tape libraries, including IBM TS1140, LTO-4, -5 and -6.

In addition to BlackPearl, Spectra Logic is releasing the first DS3 client for Hadoop. In this case, the DS3 client implements a new version of the Hadoop DistCp (distributed copy) command which can be used to create a copy of an HDFS directory tree onto a DS3 storage service.

Current BlackPearl hardware is a standard 2U server with 4-400GB SSDs inside which act as sort of a speed matching buffer for the Object interface to SAS/FC tape interface.

We only saw a configuration with one BlackPearl in operation (GA of BlackPearl is expected this December). But the plan is to support multiple BlackPearl appliances to talk with the same DS3 backend storage. In that case, there will be a shared database and (tape) resource scheduler across all the appliances in the cluster.

Yes, but what about the market?

It’s a gutsy move for someone like Spectra Logic to define a new open interface to deep storage. The fact that the appliance exists outside the tape library itself and could potentially support any removable media offers interesting architectural capabilities. The current (beta) implementation lacked some sophistication but the expectation is that much of this will be resolved by GA or over time through incremental enhancements.

Pricing is appealing. When you add BlackPearl appliance(s), with a T950 Spectra Logic tape library using LTO drives which supports uncompressed data store of ~2.4PB of archive data, the purchase price is ~$0.10/GB. This compares especially well with current Amazon Glacier pricing of $0.01/GB/Month, so that for the price of 10 months of Glacier storage you could own your own DS3 storage service.

At larger capacities, such as BlackPearl using T950 with TS1140 tape drives supporting 6.4PB is even cheaper, at $0.09/GB. Other configurations are available and in general bigger congfigurations are cheaper on $/GB and smaller ones more expensive.  The configurations are speced by Spectra Logic to have all the media, tape drives and BlackPearl systems be needed to support an archives object store.

As for markets, Spectra Logic already has beta interest from a large well known web services customer and a number of media & entertainment customers.

In the long run, Spectra Logic believes that if they can simplify access to tape for an application where it’s well qualified to support (deep archive), that this will enable new applications to take advantage of tape, that weren’t even dreamed of before.  By opening up a Object Store interface to tape, anyone currently using S3 is a potential customer.

Amazon announced earlier this year that they have over 2 trillion objects is their S3. And as far as I can tell (see my post Who’s the next winner in storage?) they are growing with no end in sight.

~~~~

Comments?

 

DR preparedness in real time

As many may have seen there has been serious flooding throughout the front range of Colorado.  At the moment the flooding hasn’t impacted our homes or offices but there’s nothing like a recent, nearby disaster to focus one’s thoughts on how prepared we are to handle a similar situation.

 

What we did when serious flooding became a possibility

As I thought about what I should be doing last night with flooding in nearby counties, I moved my computers, printer, some other stuff from the basement office to an upstairs area in case of basement flooding. I also moved my “Time Machine” backup disk upstairs as well which holds the iMac’s backups (hourly for last 24 hrs, daily for past month and weekly backups [for as many weeks that can be held on a 2TB disk drive]). I have often depended on time machine backups to recover files I inadvertently overwrote, so it’s good to have around.

I also charged up all our mobiles, laptops & iPads and made sure software and email were as up-to-date as possible.  I packed up my laptop & iPad, with my most recent monthly and weekly backups and some other recent work printouts into my backpack and left it upstairs ready to go at a moments notice.

The next day post-mortum

This morning with less panic and more time to think, the printer was probably the least of my concerns but the internet and telecommunications (phones & headset) should probably have been moved upstairs as well.

Although we have multiple mobile phones, (AT&T) reception is poor in the office and home. It would have been pretty difficult to conduct business here with the mobile alone if we needed to.  I use a cable provider for business phones but also have a land line for our home. So I (technically) have triple backup for telecom, although to use the mobile effectively, we would have had to leave the office.

Internet access

Internet is another matter though. We also use cable for internet and the modem that supplies office internet connects to a cable close to where it enters the house/office. All this is downstairs, in the basement. The modem is powered using basement plugs (although it does have a battery as well) and there’s a hard ethernet link between the cable modem and an Airport Express base station (also downstairs) which provides WiFi to the house and LAN for the house iMacs/PCs.

Considering what I could do to make this a bit more flood tolerant, I should have probably moved the cable modem and Airport Express upstairs connecting it to the TV cable and powering it using upstairs power. Airport Express WiFi would have provided sufficient Internet access to work but with the modem upstairs connecting an ethernet cable to a desktop would also have been a possibility.

I do have the hotspot/tethering option for my mobile phone but as discussed above, reception is not that great. As such, it may have not sufficed for the household, let alone a work computer.

Internet is available at our local library and at many nearby coffee shops.  So, worst case was to take my laptop and head to a coffee shop/library that still had power/WiFi and camp out all day, for potentially multiple days.

I could probably do better with Internet access. With the WiFi and tethering capabilities available with cellular iPad these days, if I should just purchase one for the office, with a suitable data plan, I could have used the iPad as another hot spot, independent of my mobile. Of course, I would probably go with a different carrier so that reception issues could also be minimized (hoping where one [AT&T] is poor the other [Verizon?] carrier would be fine).

Data availability

Data access outside of the Time Machine disk and the various hard drive backups was another item I considered this morning.  I have a monthly, hard-drive backups, normally kept in a safety deposit box at a local bank.

The bank is in the same flood/fire plane that I am in, but the tell me it’s floodproof, fireproof and earthquake proof.  Call me paranoid but I didn’t see any fire suppression equipment visible in the vault. The vault door although a large quantity of steel and other metals didn’t seem to have waterproof seals surrounding it.  As for earthquakes, concrete walls, steel door doesn’t mean it’s eartquake proof.  But then again, I am paranoid, it would probably survive much better than anything in our home/office.

Also, I keep weekly encrypted backups in the house, alternating between two hard disk drives and keep the most recent upstairs. So between the weeklies, monthlies, and Time Machine I have three distinct tiers of data backups. Of course, the latest monthly was sitting in the house waiting to be moved to the safety deposit box – not good news.

I also have  a (manual) copy of work data on the laptop, current to the last hard backup (also at home). So of my three tiers of backup every single current one of them was in the home/office.

I could do better. Looking at Dropbox and Box for about $100/year/100GB (DropBox, Box is ~40% cheaper) I could keep our important work and home data on cloud storage and have access to it from any Internet accessible location (including with mobile devices) with proper security credentials. Not sure how long it would take to seed this backup we have about 20Gb of family and work office documents and probably another 120GB or so of photos that I would want to keep around or about 140GB of info.  This could provide 5-way redundancy with Time machine, weekly hard drive and monthly hard drive backups and now Box/Dropbox for a for a (office and home) fourth backup, with  the laptop being a fifth (office only) backup.  Seems like cheap insurance at the moment.

The other thing that Box/DropBox would do for me is to provide a synch service with my laptop so that files changed on either device would synch to the cloud and then be copied to all other devices.  This would substitute my current 4th tier of (work) backups with a more current, cloud backup. It would also eliminate the manual copy process performed during every backup to keep my laptop up to date.

I have some data security concerns with using cloud storage, but according to Dropbox they use Amazon S3 for their storage and AES-256 data encryption so that others can’t read your data. They use SSL to transfer data to the cloud.

Where all the keys are held is another matter and with all the hullabaloo with NSA, anything on the internet can be provided to the gov’t with a proper request. But the same could be said for my home computer and all my backups.

There are plenty of other solutions here, Google drive and Microsoft’s SkyDrive to name just a few. But from what I have heard Dropbox is best, especially if you have a large number of files.

The major downsides (besides the cost) is that when you power up your system it can take longer while Dropbox scans for out-of-synch files and the time it takes to seed your Dropbox account. This is all dependent on your internet access, but according to a trusted source Dropbox seeding starts with smallest files and works up to the larger ones over time. So there is a good likelihood your office files (outside of PPT) might make it to the cloud sooner than your larger media, databases, and other large files.  I figure we have about ~140GB to be copied to the cloud. I promise to update the post with the time it took to copy this data to the cloud.

Power and other emergency preparedness

Power is yet another concern.  I have not taken the leap to purchase a generator for the home/office. But now think this unwise. Although power has gotten a lot more reliable in our home/office over the years, there’s still a real possibility that there could be a disruption. The areas with serious flooding all around us are having power blackouts this morning and no telling when their power might get back on. So a generator purchase is definitely in my future.

Listening to the news today, there was talk of emergency personnel notifying people that they had 30 minutes to evacuate their houses.  So, next time there is a flood/fire warning in the area I think I will take some time to pack up more than my laptop. Perhaps some other stuff like clothing and medicines that will help us survive and continue to work.

Food and water are also serious considerations. In Florida for hurricane preparedness  they suggest filling up your bathtubs with water or having 1 gallon of water per person per day set aside in case of emergency – didn’t do this last night but should have.  Florida’s family emergency preparedness plan also suggests enough water for 5-7 days.

I think we have enough dry food around the house to sustain us for a number of days (maybe not 7 though). If we consider whats in the freezer and fridge that probably goes up to a couple of weeks or so, assuming we can keep it cold.

Cooking food is another concern. We have propane and camp stoves which would provide rudimentary ability to cook outdoors if necessary as well as an old charcoal grill and bag of charcoal in our car-camping stuff. Which should suffice for a couple of days but probably not a week.

As for important documents they are in that safety deposit box in our flood plain. (May need to rethink that). Wills and other stuff are also in the hands of appropriate family members and lawyers so that’s taken care of.

Another item on their list of things to have for a hurricane is flashlights and fresh batteries. These are all available in our camping stuff but would be difficult to access in a moments notice. So a couple of rechargeable flashlights that were easier to access might be a reasonable investment. The Florida plan further suggests you have a battery operated radio. I happen to have an old one upstairs with the batteries removed – just need to make sure to have some fresh batteries around someplace.

They don’t mention gassing up your car. But we do that as a matter of course anytime harsh weather is forecast.

I think this is about it for now. Probably other stuff I didn’t think of. I have a few fresh fire extinguishers around the home/office but have no pumps. May need to add that to the list…

~~~~

Comments?

Photo Credits: September 12 [2013], around 4:30pm [Water in Smiley Creek – Boulder Flood]