Big data – Silverton Consulting

Cloud based database startups are heating up

Posted on February 26, 2014 by Ray in Cloud services, data access, Data analytics, Information economy, System effectiveness

IBM recently agreed to purchase Cloudant an online database service using a NoSQL database called CouchDB. Apparently this is an attempt by IBM to take on Amazon and others that support cloud based services using a NoSQL database backend to store massive amounts of data.

In other news, Dassault Systems, a provider of 3D and other design tools has invested $14.2M in NuoDB, a cloud-based NewSQL compliant database service provider. Apparently Dassault intends to start offering its design software as a service offering using NuoDB as a backend database.

We have discussed NewSQL and NoSQL database’s before (see NewSQL and the curse of old SQL databases post) and there are plenty available today. So, why the sudden interest in cloud based database services. I tend to think there are a couple of different trends playing out here.

IBM playing catchup

In the IBM case there’s just so much data going to the cloud these days that IBM just can’t have a hand in it, if it wants to continue to be a major IT service organization. Amazon and others are blazing this trail and IBM has to get on board or be left behind.

The NoSQL or no relational database model allows for different types of data structuring than the standard tables/rows of traditional RDMS databases. Specifically, NoSQL databases are very useful for data that can be organized in a tree (directed graph), graph (non-directed graph?) or key=value pairs. This latter item is very useful for Hadoop, MapReduce and other big data analytics applications. Doing this in the cloud just makes sense as the data can be both gathered and tanalyzed in the cloud without having anything more than the results of the analysis sent back to a requesting party.

IBM doesn’t necessarily need a SQL database as it already has DB2. IBM already has a cloud-based DB2 service that can be implemented by public or private cloud organizations. But they have no cloud based NoSQL service today and having one today can make a lot of sense if IBM wants to branch out to more cloud service offerings.

Dassault is broadening their market

As for the cloud based, NuoDB NewSQL database, not all data fits the tree, graph, key=value pair structuring of NoSQL databases. Many traditional applications that use databases today revolve around SQL services and would be hard pressed to move off RDMS.

Also, one ongoing problem with NoSQL databases is that they don’t really support ACID transaction processing and as such, often compromise on data consistency in order to support highly parallelizable activities. In contrast, a SQL database supports rigid transaction consistency and is just the thing for moving something like a traditional OLTP processing application to the cloud.

I would guess, how NuoDB handles the high throughput needed by it’s cloud service partners while still providing ACID transaction consistency is part of its secret sauce.

But what’s behind it, at least some of this interest may just be the internet of things (IoT)

The other thing that seems to be driving a lot of the interest in cloud based databases is the IoT. As more and more devices become internet connected, they will start to generate massive amounts of data. The only way to capture and analyze this data effectively today is with NoSQL and NewSQL database services. By hosting these services in the cloud, analyzing/processing/reporting on this tsunami of data becomes much, much easier.

Storing and analyzing all this IoT data should make for an interesting decade or so as the internet of things gets built out across the world. Cisco’s CEO, John Chambers recently said that the IoT market will be worth $19T and will have 50B internet connected devices by 2020. Seems a bit of a stretch seeings as how they just predicted (June 2013) to have 10B devices attached to the internet by the middle of last year, but who am I to disagree.

There’s much more to be written about the IoT and its impact on data storage, but that will need to wait for another time… stay tuned.

Comments?

Photo Credit(s): database 2 by Tim Morgan

Bringing compute to storage

Posted on February 5, 2014 by Ray in Data, data access, Data analytics, Distributed computing, SSD storage, Storage, Storage performance

Researchers at MIT (see Storage system for ‘big data’ dramatically speeds access to information) have come up with a novel storage cluster using FPGAs and flash chips to create a new form of database machine.

In their system they have an FPGA that supports limited computational offload/acceleration along with flash controller functionality for a set of flash chips. They call their system the BlueDBM or Blue Database Machine.

Their storage device is used as PCIe flash card on a host PC. But in their implementation each of the PCIe flash cards are interconnected via an FPGA serial link. This approach creates a distributed controller across all the PCIe flash cards in the host servers and allows any host PC to access any of the flash card data at high speed.

They claim that node to node access latencies are on the order of 60-80 microseconds and their distributed controller can sustain 70% of theoretical system bandwidth. In their prototype 4-node system their performance testing shows that it’s an order of magnitude faster than Microsoft Research’s CORFU (Cluster of Raw Flash Units).

Why FPGAs?

There are two novel aspects of their system: 1 ) Is the computational offload capabilities provided by the FPGA in front of the flash and 2) Is their implementation of a distributed controller across the storage nodes using the FPGA serial network.

Both of these characteristics are dependent on the FPGA. Also by using FPGAs system cost would be less and the FPGAs had a readily available, internally supported serial link that could be used.

But by using an FPGA, the computational capabilities are more limited and re-configurating (re-programming) the storage cluster’s compute capabilities will take more time. If they used a more general purpose CPU in front of the flash chips they could support a much richer computational offload next to the storage chips. For example, in their prototype the FPGAs supported ‘word-counting’ offload functionality.

Nonetheless, as most flash storage these days already have a fairly sophisticated controller, it’s not much of a stretch to bump this compute power up to something a bit more programmable and make its functionality more available via APIs. I suppose to gain equivalent performance this would need to use PCIe flash cards.

Where they would get the internal card to card serial link with general purpose CPUs may be a concern, which brings up another question.

The distributed controller gives them what exactly?

I believe that with a serial link based distributed controller they don’t need a full networking stack to access the PCIe flash storage on other nodes. This should save both access time and compute power.

In follow on work, the MIT researchers plan to implement a Linux based, distributed file system across the BlueDBM. This should give them a more normal storage stack for their system. How this may interact with the computational offload capabilities is another question.

I would have to say the reduction in access latency is what they were after with the distributed controller and they seem to have achieved it, as noted above. I suppose something similar could be done with multiple PCIe cards in the same host but with the potential to grow from 4 to 20 nodes, the BlueDBM starts to look more interesting.

What sort of application could use such a device?

They talked about performing near real-time analysis of scientific data or modeling all the particles in a simulation of the universe. But just about any application that required extremely low access time with limited data services could potentially take advantage of their storage system. High Frequency Trading comes to mind.

As for big data applications, I haven’t heard of any big data deployments that use SSDs for basic storage let alone PCIe flash cards. I don’t believe there’s going to be a lot of big data analytics that has need for this fast a storage system.

~~~~

Utilizing excess compute power in a storage controller has been an ongoing dream for a long time. Aside from running VMs and a couple of other specialized services such as A-V scanning within a storage controller there hasn’t been a lot of this type of functionality ever released for use inside a storage controller. With software defined storage coming online, it may not even make that much sense anymore.

MIT research’s BlueDBM solution is somewhat novel but unless they can more easily generalize the computational offload it doesn’t seem as if it will become a very popular way to go for analytics applications.

As for their reduction in access latencies, that might have some legs if they can put more storage capacity behind it and continue to support similar access latencies. But they will need to provide a more normal access method to it. The distributed Linux file system might be just the ticket to get this off into the market.

Comments?

Photo Credits: Lightening by Jolene

EMCworld 2013 Day 2

Posted on May 7, 2013 by Ray in Cloud services, desktop virtualization, Information economy, Internet traffic, Mobile computing, R&D measures, Server virtualization, Software Defined Network, Visionary leadershp

The first session of the day was with Joe Tucci EMC Chairman and CEO. He talked about the trends transforming IT today. These include Mobile, Cloud, Big Data and Social Networking. He then discussed IDC’s 1st, 2nd and 3rd computing platform framework where the first was mainframe, the second was client-server and the third is mobile. Each of these platforms had winers and losers. EMC wants definitely to be one of the winners in the coming age of mobile and they are charting multiple paths to get there.

Mainly they will use Pivotal, VMware, RSA and their software defined storage (SDS) product to go after the 3rd platform applications. Pivotal becomes the main enabler to help companies gain value out of the mobile-social networking-cloud computing data deluge. SDS helps provide the different pathways for companies to access all that data. VMware provides the software defined data center (SDDC) where SDS, server virtualization and software defined networking (SDN) live, breathe and interoperate to provide services to applications running in the data center.

Joe started talking about the federation of EMC companies. These include EMC, VMware, RSA and now Pivotal. He sees these four brands as almost standalone entities whose identities will remain distinct and seperate for a long time to come.

Joe mentioned the internet of things or the sensor cloud as opening up new opportunities for data gathering and analysis that dwarfs what’s coming from mobile today. He quoted IDC estimates that says by 2020 there will be 200B devices connected to the internet, today there’s just 2 to 3B devices connected.

Pivotal’s debut

Paul Maritz, Pivotal CEO got up and took us through the Pivotal story. Essentially they have three components a data fabric, an application development fabric and a cloud fabric. He believes the mobile and internet of things will open up new opportunities for organizations to gain value from their data wherever it may lie, that goes well beyond what’s available today. These activities center around consumer grade technologies which 1) store and reason over very large amounts of data; 2) use rapid application development; and 3) operate at scale in an entirely automated fashion.

He mentioned that humans are a serious risk to continuous availability. Automation is the answer to the human problem for the “always on”, consumer grade technologies needed in the future.

Parts of Pivotal come from VMware, Greenplum and EMC with some available today in specific components. However by YE they will come out with Pivotal One which will be the first framework with data, app development and cloud fabrics coupled together.

Paul called Pivotal Labs as the special forces of his service organization helping leading tech companies pull together the awesome apps needed for the technology of tomorrow, consisting of Extreme programming, Agile development and very technically astute individuals. Also, CETAS was mentioned as an analytics-as-a-service group providing such analytics capabilities to gaming companies doing log analysis but believes there’s a much broader market coming.

Paul also showed some impressive numbers on their new Pivotal HD/HAWQ offering which showed it handled many more queries than Hive and Cloudera/Impala. In essence, parts of Pivotal are available today but later this year the whole cloud-app dev-big data framework will be released for the first time.

Next up was a media-analyst event where David Goulden, EMC President and COO gave a talk on where EMC has come from and where they are headed from a business perspective.

Then he and Joe did a Q&A with the combined media and analyst community. The questions were mostly on the financial aspects of the company rather than their technology, but there will be a more focused Q&A session tomorrow with the analyst community.

Joe was asked about Vblock status. He said last quarter they announced it had reached a $1B revenue run rate which he said was the fastest in the industry. Joe mentioned EMC is all about choice, such as Vblock different product offerings, VSpex product offerings and now with ViPR providing more choice in storage.

Sometime today Joe had mentioned that they don’t really do custom hardware anymore. He said of the 13,000 engineers they currently have ~500 are hardware engineers. He also mentioned that they have only one internally designed ASIC in current shipping product.

Then Paul got up and did a Q&A on Pivotal. He believes there’s definitely an opportunity in providing services surrounding big data and specifically mentioned CETAS as offering analytics-as-a-service as well as Pivotal Labs professional services organization. Paul hopes that Pivotal will be $1B revenue company in 5yrs. They already have $300M so it’s well on its way to get there.

Next, there was a very interesting media and analyst session that was visually stimulating from Jer Thorp, co-founder of The Office for Creative Research. And about the best way to describe him is he is a data visualization scientist.

He took some NASA Kepler research paper with very dry data and brought it to life. Also he did a number of analyzes of public Twitter data and showed twitter user travel patterns, twitter good morning analysis, twitter NYT article Retweetings, etc. He also showed a video depicting people on airplanes around the world. He said it is a little known fact but over a million people are in the air at any given moment of the day.

Jer talked about the need for data ethics and an informed data ownership discussion with people about the breadcrumbs they leave around in the mobile connected world of today. If you get a chance, you should definitely watch his session.

Next Juergen Urbanski, CTO T-Systems got up and talked about the importance of Hadoop to what they are trying to do. He mentioned that in 5 years, 80% of all new data will land on Hadoop first. He showed how Hadoop is entirely different than what went before and will take T-Systems in vastly new directions.

Next up at EMCworld main hall was Pat Gelsinger, VMware CEO’s keynote on VMware. The story was all about Software Defined Data Center (SDDC) and the components needed to make this happen. He said data was the fourth factor of production behind land, capital and labor.

Pat said that networking was becoming a barrier to the realization of SDDC and that they had been working on it for some time prior to the Nicera acquisition. But now they are hard at work merging the organic VMware development with Nicera to create VMware NSX a new software defined networking layer that will be deployed as part of the SDDC.

Pat also talked a little bit about how ViPR and other software defined storage solutions will provide the ease of use they are looking for to be able to deploy VMs in seconds.

Pat demo-ed a solution specifically designed for Hadoop clusters and was able to configure a hadoop cluster with about 4 clicks and have it start deploying. It was going to take 4-6 minutes to get it fully provisioned so they had a couple of clusters already configured and they ran a pseudo Hadoop benchmark on it using visual recognition and showed how Vcenter could be used to monitor the cluster in real time operations.

Pat mentioned that there are over 500,000 physical servers running Hadoop. Needless to say VMware sees this as a prime opportunity for new and enhanced server virtualization capabilities.

That’s about it for the major keynotes and media sessions from today.

Tomorrow looks to be another fun day.

Cheap phones + big data = better world

Posted on April 23, 2013April 23, 2013 by Ray in Crowdsourcing, Data analytics, Data availability, Data science, Information economy, Strategic Inflection Points, Visionary organizations

Big data visualization, Facebook friend connections, Data science — Facebook friend carrousel by antjeverena (cc) (from flickr)

Read an article today in MIT Technical Review website (Big data from cheap phones) that shows how cheap phones, call detail records (CDRs) and other phone logs can be used to help fight disease and help understand disaster impacts.

Cheap phones generate big data

In one example, researchers took cell phone data from Kenya and used it to plot people movements throughout the country. What they were looking for is people who frequented malaria disease hot spots so that they could try to intervene in the transmission of this disease. Researchers discovered one region (cell tower) that had many people that were frequenting a particular bad location for malaria. It turned out the region they identified had a large plantation with many migrant workers. These workers moved around a lot. In order to reduce the transmission of the disease public health authorities could target this region to use more bed nets or try to reduce infestation at source of the disease. In either case, people mobility was easier to see with cell phone data than actually putting people on the ground and counting where people go or come from.

In another example, researchers took cell phone data from Haiti before and after the earthquake and were able to calculate how many people were in the region hardest hit by the earthquake. They were also able to identify how many people left the region and where the went to. As a follow on to this, researchers were able to in real time show how many people had fled the cholera epidemic.

Gaining access to cheap phone data

Most of this call detail record data is limited to specific researchers for very specialized activities requested by the host countries. But recently Orange released 2.5 billion cell phone call and text data records for five million customers they have in Ivory Coast that occurred during five months time. They released the data to the public under some specific restrictions in order to see what data scientists could do with it. The papers detailing their activities will be published at a MIT Data for Development conference.

~~~~

Big data’s contribution to a better world is just beginning but from what we see here there’s real value in data that already exists, if only the data were made more widely available.

Comments?

Object Storage Summit wrap up

Posted on November 30, 2012December 5, 2012 by Ray in Clustered storage, Energy efficiency, Object storage, System effectiveness

Attended ExecEvent’s first Next-Gen Object Storage Summit in Miami this past week. Learned a lot there and met a lot of the players and movers in this space. Here is a summary of what happened during the summit.

Janae starting a debate on Object Storage

Spent most of the morning of the first day discussing some parameters of object storage in general. Janae got up and talked about 4 major adopters for object storage:

Rapid Responders – these customer have data in long term storage and it just keeps building and needs to be stored in scaleable storage. They believe someday they will need access to it and have no idea when. But when they want it, they want it fast. Rapid responder adoption is based on the unpredictability of access. As such, having the data on scaleable disk object storage makes sense. Some examples include black operations sites with massive surveillance feeds which maybe needed fast sometime after initial analysis and medical archives.
Distributed (content) Enterprises – geographically distributed enterprises with users around the globe that need shared access to data. Distributed enterprises often have 100 or so users sharing data access dispersed around the globe and want shared access to data. Object storage can dispurse the data to provide local caching across the world for better data and meta-data latency. Media and Entertainment are key customers in this space but design shops that follow the sun also have the problem.
Private Cloud(y) – data centers adopt the cloud for a number of reasons but sometimes it’s just mandated. In these cases, direct control over cloud storage with the economics of major web service providers can be an alluring proposition. Some object storage solutions roll in with cloud like economics and on premises solutions and responsiveness, the best of all worlds. Enterprise IT forced to move to the cloud are in this category.
Big Hadoop(ers) – lots of data to analyze but with no understanding of when it will be analyzed. Some Hadoopers can schedule analytics but most don’t know what they will want until they finish with the last analysis. In these cases, having direct access to all the data on an object store can cut setup time considerably.

There were other aspects of Janae’s session but these seemed of most interest. We spent the debating aspects of object storage rest of the morning getting an overview on Scality customers. At the end of the morning we debating aspects of object storage. I thought Jean-Luc from Data Direct Networks had the best view of this when he said object storage is at it’s core, data storage that has scalability, resilience, performance and distribution.

The afternoon sessions were deep dives with the sponsors of the Object Summit.

Nexsan talked about there Assureon product line (EverTrust acquisition). SHA1 and MD5 hashes are made of every object then as objects are replicated to other sites, the hashes are both checked to insure the data hasn’t been corrupted and the are periodically checked (every 90 days) to see if the data is still correct. If it’s corrupted, other replica’s obtained and re-instated. In addition, Assureon has some unique immutable access logs that provide an almost “chain of custody” for objects in the system. Finally, Assureon uses a Microsoft Windows Agent that is Windows Certified and installs without disruption to allow any user (or administrator) to identify files, directories, or file systems to be migrated to the object store.
Cleversafe was up next and talked about their market success with their distributed dsNet® object store and provided some proof points. [Full disclosure: I have recently been under contract with Cleversafe]. For instance, today they have under management over 15 billion objects and deployments with over 70PBs in production They have shipped over 170PB of dsNet storage to customers around the world. Cleversafe has many patents covering their information dispersal algorithms and performance optimization. Some of their sites are in the Federal government installations with a few web intensive clients as well, the most notable being Shutterfly, photo sharing site. Although dsNet is inherently geographical distributed all these “sites” could easily be configured over 1 to 3 locations or more for simpler DR-like support.
Quantum talked about their Lattus product built ontop of Amplidata’s technology. Lattus uses 36TB storage nodes, controller nodes to provide erasure coding for geographical data integrity and NAS gateway nodes. The NAS gateway provides CIFS and NFS to objects. The Latus-C deployment is a forever disk archive for cloud like deployments. This system provides erasure coding for objects in the system which are then dispersed across up to 3 sites (today, with 4 site dispersal under test). Their roadmap Lattus-M is going to be a managed file system offering that operates in conjunction with their StorNext product with ILMlike policy management. Farther out, on the roadmap is a Lattus-H which offers object repository for Hadoop clusters that can gain rapid access to data for analysis.
Scality talked about their success in major multi-tennant environments that need rock-solid reliability and great performance. Their big customers are major web providers that supply email services. Scality is a software product that builds a ring of object storage nodes that supplies the backend storage where the email data is held. Scality is priced on a per end-user capacity stored. Today the product supports RestFul interfaces, CDMI (think email storage interface), Scality File System (based on FUSE, a POSIX compliant Linux file system). NFS interface is coming early next year. With the Scality Ring, nodes can go down but the data is still available with rapid response time. Nodes can be replicated or spread across multiple locations
Data Direct Networks (DDN) is coming at the problem from the High Performance Computing market and have an very interesting scaleable solution with extreme performance. DDN products are featured in many academic labs and large web 2.0 environments. The WOS object storage supports just about any interface you want Java, PHP, Python, RestFULL, NFS/CIFS, S3 and others. They claim very high performance something on the order of 350MB/sec read and 250MB/sec write (I think per node) of object data transfers. Nodes come in 240TB units and one can have up to 256 nodes in a WOS system. One customer uses a WOS node to land local sensor streams then ships it to other locations for analysis.

The next day was spent with Nexsan and DDN talking about their customer base and some of their success stories. We spent the remainder of the morning talking about the startup world which surrounds some object storage technology and the inhibiters to broader adoption of the technology.

In the end there’s a lot of education needed to jump start this market place. Education about both the customer problems that can be solved with object stores and the product differences that are out there today. I argued (forcefully) that what’s needed to accelerate adoption was some standard interface protocol that all object storage systems could utilize. Such a standard protocol would enable a more rapid ecosystem build out and ultimately more enterprise adoption.

One key surprise to me was that the problems their customers are seeing is something all IT customers will have some day. Jean-Luc called it the democratization of the HPC problems. Big Data is driving object storage requirements into the enterprise in a big way…

Comments?

Big science/big data ENCODE project decodes “Junk DNA”

Posted on September 7, 2012 by Ray in Data analytics, Data consistency, Data science, Information economy, System quality

Project ENCODE (ENCyclopedia of DNA Elements) results were recently announced. The ENCODE project was done by a consortium of over 400 researchers from 32 institutions and has deciphered the functionality of so called Junk DNA in the human genome. They have determined that junk DNA is actually used to regulate gene expression. Or junk DNA is really on-off switches for protein encoding DNA. ENCODE project results were published by Nature, Scientific American, New York Times and others.

The paper in Nature ENCODE Explained is probably the best introduction to the project. But probably the best resource on the project computational aspects comes from these papers at Nature, The making of ENCODE lessons for BIG data projects by Ewan Birney and ENCODE: the human encyclopedia by Brendan Maher.

I have been following the Bioinformatics/DNA scene for some time now. (Please see Genome Informatics …, DITS, Codons, & Chromozones …, DNA Computing …, DNA Computing … – part 2). But this is perhaps the first time it has all come together to explain the architecture of DNA and potentially how it all works together to define a human.

Project ENCODE results

It seems like there were at least four major results from the project.

Junk DNA is actually programming for protein production in a cell. Scientists previously estimated that <3% of human DNA’s over 3 billion base pairs encode for proteins. Recent ENCODE results seem to indicate that at least 9% of this human DNA and potentially, as much as 50% provide regulation for when to use those protein encoding DNA.
Regulation DNA undergoes a lot of evolutionary drift. That is it seems to be heavily modified across species. For instance, protein encoding genes seem to be fairly static and differ very little between species. On the the other hand, regulating DNA varies widely between these very same species. One downside to all this evolutionary variation is that regulatory DNA also seems to be the location for many inherited diseases.
Project ENCODE has further narrowed the “Known Unknowns” of human DNA. For instance, about 80% of human DNA is transcribed by RNA. Which means on top of the <3% protein encoding DNA and ~9-50% regulation DNA already identified, there is another 68 to 27% of DNA that do something important to help cells transform DNA into life giving proteins. What that residual DNA does is TBD and is subject for the next phase of the ENCODE project (see below).
There are cell specific regulation DNA. That is there are regulation DNA that are specifically activated if it’s bone cell, skin cell, liver cell, etc. Such cell specific regulatory DNA helps to generate the cells necessary to create each of our organs and regulate their functions. I suppose this was a foregone conclusion but it’s proven now

There are promoter regulatory DNA which are located ahead and in close proximity to the proteins that are being encoded and enhancer/inhibitor regulatory DNA which are located a far DNA distance away from the proteins they regulate.

I believe it seems that we are seeing two different evolutionary time frames being represented in the promoter vs. enhancer/inhibitor regulatory DNA. Whereas promoter DNA seem closely associated with protein encoding DNA, the enhancer DNA seems more like patches or hacks that fixed problems in the original promoter-protein encoding DNA sequences, sort of like patch Tuesday DNA that fixes problems with the original regulation activity.

While I am excited about Project ENCODE results. I find the big science/big data aspects somewhat more interesting.

Genome Big Science/Big Data at work

Some stats from the ENCODE Project:

Almost 1650 experiments on around 180 cell types were conducted to generate data for the ENCODE project. All told almost 12,000 files were analyzed from these experiments.
15TB of data were used in the project
ENCODE project internal Wiki had 18.5K page edits and almost 250K page views.

With this much work going on around the world, data quality control was a necessary, ongoing consideration. It took about half way into the project before they figured out how to define and assess data quality from experiments. What emerged from this was a set of published data standards (see data quality UCSC website) used to determine if experimental data were to be accepted or rejected as input to the project. In the end the retrospectively applied the data quality standards to the earlier experiments and had to jettison some that were scientifically important but exhibited low data quality.

There was a separation between the data generation team (experimenters) and the data analysis team. The data quality guidelines represented a key criteria that governed these two team interactions.

Apparently the real analysis began when they started layering the base level experiments on top of one another. This layering activity led to researchers further identifying the interactions and associations between regulatory DNA and protein encoding DNA.

All the data from the ENCODE project has been released and are available to anyone interested. They also have provided a search and browser capability for the data. All this can be found on the top UCSC website. Further, from this same site one can download the software tools used to analyze, browse and search the data if necessary.

This multi-year project had an interesting management team that created a “spine of leadership”. This team consisted of a few leading scientists and a few full time scientifically aware project officers that held the project together, pushed it along and over time delivered the results.

There were also a set of elaborate rules that were crafted so that all the institutions, researchers and management could interact without friction. This included rules guiding data quality (discussed above), codes of conduct, data release process, etc.

What no Hadoop?

What I didn’t find was any details on the backend server, network or storage used by the project or the generic data analysis tools. I suspect Hadoop, MapReduce, HBase, etc. were somehow involved but could find no reference to this.

I expected with the different experiments and wide variety of data fusion going on that there would be some MapReduce scripting that would transcribe the data so it could be further analyzed by other project tools. Alas, I didn’t find any information about these tools in the 30+ research papers that were published in the last week or so.

It looks like the genomic analysis tools used in the ENCODE project are all open source. They useh the OpenHelix project deliverables. But even a search of the project didn’t reveal any hadoop references.

~~~~

The ENCODE pilot project (2003-2007) cost ~$53M, the full ENCODE project’s recent results cost somewhere around $130M and they are now looking to the next stage of the ENCODE project estimated to cost ~$123M. Of course there are 1000s of more human cell types that need to be examined and ~30% more DNA that needs to be figured out. But this all seems relatively straight forward now that the ENCODE project has laid out an architectural framework for human DNA.

Anyone out there that knows more about the data processing/data analytics side of the ENCODE project please drop me a line. I would love to hear more about it or you can always comment here.

Comments?

Image: From Project Encode, Credits: Darryl Leja (NHGRI), Ian Dunham (EBI)

New cloud storage and Hadoop managed service offering from Spring SNW

Posted on April 13, 2012April 16, 2012 by Ray in Cloud services, Cloud storage, Crowdsourcing, Distributed computing, Strategic Inflection Points, System effectiveness

Strange Clouds by michaelroper (cc) (from Flickr)

Last week I posted my thoughts on Spring SNW in Dallas, but there were two more items that keep coming back to me (aside from the tornados). The first was a new startup called Symform in cloud storage and the other was an announcement from SunGard about their new Hadoop managed services offering.

Symform

Symform offers an interesting alternative on cloud storage that avoids the build up of large multi-site data centers and uses your desktop storage as a sort of crowd-sourced storage cloud, sort of bit-torrent cloud storage.

You may recall I discussed such a Peer-to-Peer cloud storage and computing services in a posting a couple of years ago. It seems Symform has taken this task on, at least for storage.

A customer downloads (Windows or Mac) software which is installed and executes on your desktop. The first thing you have to do after providing security credentials is to identify which directories will be moved to the cloud and the second is to tell whether you wish to contribute to Symform’s cloud storage and where this storage is located. Symform maintains a cloud management data center which records all the metadata about your cloud resident data and everyone’s contributed storage space.

Symform cloud data is split up into 64MB blocks and encrypted (AES-256) using a randomly generated key (known only to Symform). Then this block is broken up into 64 fragments with 32 parity fragments (using erasure coding) added to the stream which is then written to 96 different locations. With this arrangement, the system could potentially lose 31 fragments out of the 96 and still reconstitute your 64MB of data. The metadata supporting all this activity sits in Symform’s data center.

Unclear to me what you have to provide as far as ongoing access to your contributed storage. I would guess you would need to provide 7X24 access to this storage but the 32 parity fragments are there for possible network/power failures outside your control.

Cloud storage performance is an outcome of the many fragments that are disbursed throughout their storage cloud world. It’s similar to a bit torrent stream with all 96 locations participating in reconstituting your 64MB of data. Of course, not all 96 locations have to be active just some > 64 fragment subset but it’s still cloud storage so data access latency is on the order of internet time (many seconds). Nonetheless, once data transfer begins, throughput performance can be pretty high, which means your data should arrive shortly thereafter.

Pricing seemed comparable to other cloud storage services with a monthly base access fee and a storage amount fee over that. But, you can receive significant discounts if you contribute storage and your first 200GB is free as long as you contribute 200GB of storage space to the Symform cloud.

Sungard’s new Apache Hadoop managed service

Hadoop Logo (from http://hadoop.apache.org website)

We are well aware of Sungard’s business continuity/disaster recovery (BC/DR) services, an IT mainstay for decades now. But sometime within the last decade or so Sungard has been expanding outside this space by moving into managed availability services.

Apparently this began when Sungard noticed the number of new web apps being deployed each year exceeded the number of client server apps. Then along came virtualization, which reduced the need for lots of server and storage hardware for BC/DR.

As evident of this trend, last year Sungard announced a new enterprise class computing cloud service. But in last week’s announcement, Sungard has teamed up with EMC Greenplum to supply an enterprise ready Apache Hadoop managed service offering.

Recall, that EMC Greenplum is offering their own Apache Hadoop supported distribution, Greenplum HD. Sungard is basing there service on this distribution. But there’s more.

In conjunction with Hadoop, Sungard adds Greenplum appliances. With this configuration Sungard can load Hadoop processed and structured data into a Greenplum relational database for high performance data analytics. Once there, any standard SQL analytics and queries can be used against to analyze the data.

With these services Sungard is attempting to provide a unified analytics service that spans all structured, semi-structured and unstructured data.

~~~~

Probably more to Spring SNW but given my limited time on the exhibition floor and time in vendor discussions these and my previously published post are what I seem of most interest to me.

No-power sensors surface due to computational energy efficiency trends

Posted on April 11, 2012April 16, 2012 by Ray in Data transmission, Distributed computing, Energy efficiency, Strategic Inflection Points

Koomeys_law_graph,_made_by_Koomey (cc) (from wikipedia.org)

Read an article The computing trend that will change everything in MIT’s TechReview today about the trend in energy consumption per unit of computation.

Along with Moore’s law dictating that transister density doubles every 18 to 24 months, there is Koomey’s law that states that computational power efficiency or computations per watt, will double every 1.57 yrs.

Koomey’s law has made today’s smart phones and tablets possible. If your current laptop were computing at the power efficiency of 1991 computers their batteries would last ~2.5 seconds.

No-power sensors?!

But this computing efficiency trend is giving rise to no-power sensors/devices, or computational sensors without batteries. These new sensors gather electrical energy from “ambient radio waves” in the air, and by doing so harvest enough electricity to power computations and as such, don’t need batteries.

Such devices can gather ~50μwatts of power from a TV transmitter just 2.5 miles away. Most calculators only use ~5μwatts and digital thermoters around 1μwatt, so 50 is enough to do some reasonable amounts of sensing work.

But the exciting part is that as Koomley’s law continues, the amount of work that 50μwatts or even 5μwatts supports doubles again every 1.6 years. For example, the computational power of today’s laptops will only consume infinitesimal amounts of power in ~two decades time. Thus, no-power-sensors of 2034 will be very smart indeed.

“Any sufficiently advanced technology is indistinguishable from magic”, Arthur C. Clarke

Data transmission efficiency not keeping up

Nonetheless, the fact that computational efficiency is doubling every 1.6 years doesn’t mean the data transmission efficiency is doing the same. Which means that for the foreseeable future, data transmission may remain a crucial bottleneck for no-power sensors.

However, computational increases can somewhat compensate for data transmission limitations by more efficient encoding, compression, etc. But there are limits as to what can be accomplished within any data transmission technology.

Nanodata

Thus, for the foreseeable future, although sensors will be able to do lots more computations, what they transmit to the outside world may remain limited. Giving rise to smart, no-power sensors providing very miniscule data packages.

One term coined to describe such limited external data transmission from no-power computationally intense sensors is nanodata. Because of their ability to exist outside the power grid, it is very likely that the future sensor cloud or internet-of-things will be primarily comprised of such nanodata devices.

~~~~
I was at SNW last week and there was some discussion of “little data” or data in corporate databases, in contrast with big data. But nanodata is something I had never heard of before today.

So now we have big data, little data, and nanodata. Seems like are missing a few steps here…