NoSQL – Silverton Consulting

IBM playing catchup

In the IBM case there’s just so much data going to the cloud these days that IBM just can’t have a hand in it, if it wants to continue to be a major IT service organization. Amazon and others are blazing this trail and IBM has to get on board or be left behind.

The NoSQL or no relational database model allows for different types of data structuring than the standard tables/rows of traditional RDMS databases. Specifically, NoSQL databases are very useful for data that can be organized in a tree (directed graph), graph (non-directed graph?) or key=value pairs. This latter item is very useful for Hadoop, MapReduce and other big data analytics applications. Doing this in the cloud just makes sense as the data can be both gathered and tanalyzed in the cloud without having anything more than the results of the analysis sent back to a requesting party.

IBM doesn’t necessarily need a SQL database as it already has DB2. IBM already has a cloud-based DB2 service that can be implemented by public or private cloud organizations. But they have no cloud based NoSQL service today and having one today can make a lot of sense if IBM wants to branch out to more cloud service offerings.

Dassault is broadening their market

As for the cloud based, NuoDB NewSQL database, not all data fits the tree, graph, key=value pair structuring of NoSQL databases. Many traditional applications that use databases today revolve around SQL services and would be hard pressed to move off RDMS.

Also, one ongoing problem with NoSQL databases is that they don’t really support ACID transaction processing and as such, often compromise on data consistency in order to support highly parallelizable activities. In contrast, a SQL database supports rigid transaction consistency and is just the thing for moving something like a traditional OLTP processing application to the cloud.

I would guess, how NuoDB handles the high throughput needed by it’s cloud service partners while still providing ACID transaction consistency is part of its secret sauce.

But what’s behind it, at least some of this interest may just be the internet of things (IoT)

The other thing that seems to be driving a lot of the interest in cloud based databases is the IoT. As more and more devices become internet connected, they will start to generate massive amounts of data. The only way to capture and analyze this data effectively today is with NoSQL and NewSQL database services. By hosting these services in the cloud, analyzing/processing/reporting on this tsunami of data becomes much, much easier.

Storing and analyzing all this IoT data should make for an interesting decade or so as the internet of things gets built out across the world. Cisco’s CEO, John Chambers recently said that the IoT market will be worth $19T and will have 50B internet connected devices by 2020. Seems a bit of a stretch seeings as how they just predicted (June 2013) to have 10B devices attached to the internet by the middle of last year, but who am I to disagree.

There’s much more to be written about the IoT and its impact on data storage, but that will need to wait for another time… stay tuned.

Comments?

database 2 by Tim Morgan (cc) (from Flickr)

There was some twitter traffic yesterday on how Facebook was locked into using MySQL (see article here) and as such, was having to shard their MySQL database across 1000s of database partitions and memcached servers in order to keep up with the processing load.

The article indicated that this was painful, costly and time consuming. Also they said Facebook would be better served moving to something else. One answer was to replace MySQL with recently emerging, NewSQL database technology.

One problem with old SQL database systems is they were never architected to scale beyond a single server. As such, multi-server transactional operations was always a short-term fix to the underlying system, not a design goal. Sharding emerged as one way to distribute the data across multiple RDBMS servers.

What’s sharding?

Relational database tables are sharded by partitioning them via a key. By hashing this key one can partition a busy table across a number of servers and use the hash function to lookup where to process/access table data. An alternative to hashing is to use a search lookup function to determine which server has the table data you need and process it there.

In any case, sharding causes a number of new problems. Namely,

Cross-shard joins – anytime you need data from more than one shard server you lose the advantages of distributing data across nodes. Thus, cross-shard joins need to be avoided to retain performance.
Load balancing shards – to spread workload you need to split the data by processing activity. But, knowing ahead of time what the table processing will look like is hard and one weeks processing may vary considerably from the next weeks load. As such, it’s hard to load balance shard servers.
Non-consistent shards – by spreading transactions across multiple database servers and partitions, transactional consistency can no longer be guaranteed. While for some applications this may not be a concern, traditional RDBMS activity is consistent.

These are just some of the issues with sharding and I am certain there are more.

What about Hadoop projects and its alternatives?

One possibility is to use Hadoop and its distributed database solutions. However, Hadoop systems were not intended to be used for transaction processing. Nonetheless, Cassandra and HyperTable (see my post on Hadoop – Part 2) can be used for transaction processing and at least Casandra can be tailored to any consistency level. But both Cassandra and HyperTable are not really meant to support high throughput, consistent transaction processing.

Also, the other, non-Hadoop distributed database solutions support data analytics and most are not positioned as transaction processing systems (see Big Data – Part 3). Although Teradata might be considered the lone exception here and can be a very capable transaction oriented database system in addition to its data warehouse operations. But it’s probably not widely distributed or scaleable above a certain threshold.

The problems with most of the Hadoop and non-Hadoop systems above mainly revolve around the lack of support for ACID transactions, i.e., atomic, consistent, isolated, and durable transaction processing. In fact, most of the above solutions relax one or more of these characteristics to provide a scaleable transaction processing model.

NewSQL to the rescue

There are some new emerging database systems that are designed from the ground up to operate in distributed environments called “NewSQL” databases. Specifically,

Clustrix – is a MySQL compatible replacement, delivered as a hardware appliance that can be distributed across a number of nodes that retains fully ACID transaction compliance.
GenieDB – is a NoSQL and SQL based layered database that is consistent (atomic), available and partition tolerant (CAP) but not fully ACID compliant, offers a MySQL and popular content management systems plugins that allow MySQL and/or CMSs to execute using GenieDB clusters with minimal modification.
NimbusDB – is a client-cloud based SQL service which distributes copies of data across multiple nodes and offers a majority of SQL99 standard services.
VoltDB – is a fully SQL compatible, ACID compliant, distributed, in-memory database system offered as a software only solution executing on 64bit CentOS system but is compatible with any POSIX-compliant, 64bit Linux platform.
Xeround – is a cloud based, MySQL compatible replacement delivered as a (Amazon, Rackspace and others) service offering that provides ACID compliant transaction processing across distributed nodes.

I might be missing some, but these seem to be the main ones today. All the above seem to take a different tack to offer distributed SQL services. Some of the above relax ACID compliance in order to offer distributed services. But for all of them distributed scale out performance is key and they all offer purpose built, distributed transactional relational database services.

—–

RDBMS technology has evolved over the last century and have had at least ~35 years of running major transactional systems. But todays hardware architecture together with web scale performance requirements stretch these systems beyond their original design envelope. As such, NewSQL database systems have emerged to replace old SQL technology, with a new, intrinsically distributed system architecture providing high performing, scaleable transactional database services for today and the foreseeable future.