NewSQL and the curse of Old SQL database systems

database 2 by Tim Morgan (cc) (from Flickr)
database 2 by Tim Morgan (cc) (from Flickr)

There was some twitter traffic yesterday on how Facebook was locked into using MySQL (see article here) and as such, was having to shard their MySQL database across 1000s of database partitions and memcached servers in order to keep up with the processing load.

The article indicated  that this was painful, costly and time consuming. Also they said Facebook would be better served moving to something else. One answer was to replace MySQL with recently emerging, NewSQL database technology.

One problem with old SQL database systems is they were never architected to scale beyond a single server.  As such, multi-server transactional operations was always a short-term fix to the underlying system, not a design goal. Sharding emerged as one way to distribute the data across multiple RDBMS servers.

What’s sharding?

Relational database tables are sharded by partitioning them via a key.  By hashing this key one can partition a busy table across a number of servers and use the hash function to lookup where to process/access table data.   An alternative to hashing is to use a search lookup function to determine which server has the table data you need and process it there.

In any case, sharding causes a number of new problems. Namely,

  • Cross-shard joins – anytime you need data from more than one shard server you lose the advantages of distributing data across nodes. Thus, cross-shard joins need to be avoided to retain performance.
  • Load balancing shards – to spread workload you need to split the data by processing activity.  But, knowing ahead of time what the table processing will look like is hard and one weeks processing may vary considerably from the next weeks load. As such, it’s hard to load balance shard servers.
  • Non-consistent shards – by spreading transactions across multiple database servers and partitions, transactional consistency can no longer be guaranteed.  While for some applications this may not be a concern, traditional RDBMS activity is consistent.

These are just some of the issues with sharding and I am certain there are more.

What about Hadoop projects and its alternatives?

One possibility is to use Hadoop and its distributed database solutions.  However, Hadoop systems were not intended to be used for transaction processing. Nonetheless, Cassandra and HyperTable (see my post on Hadoop – Part 2) can be used for transaction processing and at least Casandra can be tailored to any consistency level. But both Cassandra and HyperTable are not really meant to support high throughput, consistent transaction processing.

Also, the other, non-Hadoop distributed database solutions support data analytics and most are not positioned as transaction processing systems (see Big Data – Part 3).  Although Teradata might be considered the lone exception here and can be a very capable transaction oriented database system in addition to its data warehouse operations. But it’s probably not widely distributed or scaleable above a certain threshold.

The problems with most of the Hadoop and non-Hadoop systems above mainly revolve around the lack of support for ACID transactions, i.e., atomic, consistent, isolated, and durable transaction processing. In fact, most of the above solutions relax one or more of these characteristics to provide a scaleable transaction processing model.

NewSQL to the rescue

There are some new emerging database systems that are designed from the ground up to operate in distributed environments called “NewSQL” databases.  Specifically,

  • Clustrix – is a MySQL compatible replacement, delivered as a hardware appliance that can be distributed across a number of nodes that retains fully ACID transaction compliance.
  • GenieDB – is a NoSQL and SQL based layered database that is consistent (atomic), available and partition tolerant (CAP) but not fully ACID compliant, offers a MySQL and popular content management systems plugins that allow MySQL and/or CMSs to execute using GenieDB clusters with minimal modification.
  • NimbusDB – is a client-cloud based SQL service which distributes copies of data across multiple nodes and offers a majority of SQL99 standard services.
  • VoltDB – is a fully SQL compatible, ACID compliant, distributed, in-memory database system offered as a software only solution executing on 64bit CentOS system but is compatible with any POSIX-compliant, 64bit Linux platform.
  • Xeround –  is a cloud based, MySQL compatible replacement delivered as a (Amazon, Rackspace and others) service offering that provides ACID compliant transaction processing across distributed nodes.

I might be missing some, but these seem to be the main ones today.  All the above seem to take a different tack to offer distributed SQL services.  Some of the above relax ACID compliance in order to offer distributed services. But for all of them distributed scale out performance is key and they all offer purpose built, distributed transactional relational database services.

—–

RDBMS technology has evolved over the last century and have had at least ~35 years of running major transactional systems. But todays hardware architecture together with web scale performance requirements stretch these systems beyond their original design envelope.  As such, NewSQL database systems have emerged to replace old SQL technology, with a new, intrinsically distributed system architecture providing high performing, scaleable transactional database services for today and the foreseeable future.

Comments?

Hadoop – part 1

Hadoop Logo (from http://hadoop.apache.org website)
Hadoop Logo (from http://hadoop.apache.org website)

BIGData is creating quite a storm around IT these days and at the bottom of big data is an Apache open source project called Hadoop.

In addition, over the last month or so at least three large storage vendors have announced tie-ins with Hadoop, namely EMC (new distribution and product offering), IBM ($100M in research) and NetApp (new product offering).

What is Hadoop and why is it important

Ok, lot’s of money, time and effort are going into deploying and supporting Hadoop on storage vendor product lines. But why Hadoop?

Essentially, Hadoop is a batch processing system for a cluster of nodes that provides the underpinnings of most BIGData analytic activities because it bundle two sets of functionality most needed to deal with large unstructured datasets. Specifically,

  • Distributed file system – Hadoop’s Distributed File System (HDFS) operates using one (or two) meta-data servers (NameNode) with any number of data server nodes (DataNode).  These are essentially software packages running on server hardware which supports file system services using a client file access API. HDFS supports a WORM file system, that splits file data up into segments (64-128MB each), distributes segments to data-nodes, stored on local (or networked) storage and keeps track of everything.  HDFS provides a global name space for its file system, uses replication to insure data availability, and provides widely distributed access to file data.
  • MapReduce processing – Hadoop’s MapReduce functionality is used to parse unstructured files into some sort of structure (map function) which can then be later sorted, analysed and summarized (reduce function) into some worthy output.  MapReduce uses the HDFS to access file segments and to store reduced results.  MapReduce jobs are deployed over a master (JobTracker) and slave (TaskTracker) set of nodes. The JobTracker schedules jobs and allocates activities to TaskTracker nodes which execute the map and reduce processes requested. These activities execute on the same nodes that are used for the HDFS.

Hadoop explained

It just so happens that HDFS is optimized for sequential read access and can handle files that are 100TBs or larger. By throwing more data nodes at the problem, throughput can be scaled up enormously so that a 100TB file could literally be read in minutes to hours rather than weeks.

Similarly, MapReduce parcels out the processing steps required to analyze this massive amount of  data onto these same DataNodes, thus distributing the processing of the data, reducing time to process this data to minutes.

HDFS can also be “rack aware” and as such, will try to allocate file segment replicas to different racks where possible.  In addition, replication can be defined on a file basis and normally uses 3 replicas of each file segment.

Another characteristic of Hadoop is that it uses data locality to run MapReduce tasks on nodes close to where the file data resides.  In this fashion, networking bandwidth requirements are reduced and performance approximates local data access.

MapReduce programming is somewhat new, unique, and complex and was an outgrowth of Google’s MapReduce process.  As such, there have been a number of other Apache open source projects that have sprouted up namely, Cassandra, Chukya, Hbase, HiveMahout, and Pig to name just a few that provide easier ways to automatically generate MapReduce programs.  I will try to provide more information on these and other popular projects in a follow on post.

Hadoop fault tolerance

When an HDFS node fails, the NameNode detects a missing heartbeat signal and once detected, the NameNode will recreate all the missing replicas that resided on the failed DataNode.

Similarly, the MapReduce JobTracker node can detect that a TaskTracker has failed and allocate this work to another node in the cluster.  In this fashion, work will complete even in the face of node failures.

Hadoop distributions, support and who uses them

Alas, as in any open source project having a distribution that can be trusted and supported can take much of the risk out of using them and Hadoop is no exception. Probably the most popular distribution comes from Cloudera which contains all the above named projects and more and provides support.  Recently EMC announced that they will supply their own distribution and support of Hadoop as well. Amazon and other cloud computing providers also support Hadoop on their clusters but use other distributions (mostly Cloudera).

As to who uses Hadoop, it seems just about everyone on the web today is a Hadoop user, from Amazon to Yahoo including EBay Facebook, Google and Twitter to highlight just a few popular ones. There is a list on the Apache’s Hadoop website which provides more detail if interested.  The list indicates some of the Hadoop configurations and shows anywhere from a 18 node cluster to over 4500 nodes with multiple PBs of data storage.  Most of the big players are also active participants in the various open source projects around Hadoop and much of the code came from these organizations.

—-

I have been listening to the buzz on Hadoop for the last month and finally decided it was time I understood what it was.  This is my first attempt – hopefully, more to follow.

Comments?