New EMC Data Domain Global Deduplication Array and other enhancements
EMC recently announced updates to their DD880 appliance, appliance software and a new dual controller deduplication storage system called the Global Deduplication Array.
Global Deduplication Array (GDA)
EMC’s GDA pairs two DD880 appliances that offer twice the throughput and capacity of the newly enhanced DD880 appliance. This product currently comes only with Symantec’s OST support for NetBackup and Backup Exec. Later this year, EMC will offer similar capabilities for their NetWorker backup product. Note that this system does not support NAS or VTL configurations as it requires special software at the backup server.
The GDA uses a new version of EMC’s OST plugin that moves some of the deduplication processing to the backup server. As such, this functionality improves backup server throughput while at the same time improving GDA data ingestion. Backup server throughput is improved by copying less data to the GDA backup target. The new OST plugin pre-digests the data for deduplication and the new process looks like:
- OST plugin starts by breaking the backup data into super-chunks (~1MB); hashes the super chunks; and then sends this list over to the GDA.
- GDA takes the content list, identifies the new or unique super-chunk data and returns to the OST a list of only the new super-chunks to be sent across.
- OST plugin then sends only the new super-chunk data across to the proper GDA controller.
This is not quite deduplication at the backup server because the hard work of data lookup is done at the GDA. Once the super-chunk data is transmitted to a GDA DD880 controller, the appliance breaks it up into deduplication chunks (~8KB), identifies which is unique and duplicate, and saves the unique data.
The new OST plugin also supports new application level load-balancing across Ethernet links to improve link throughput. By doing this at the application level, it no longer depends on Ethernet link trunking, which was impossible to use with different Ethernet hardware and had other configuration limitations.
As the GDA has multiple controllers supporting a single backup stream, performance can now scale-out easier than before. The GDA currently supports one or two controller configurations. With two controllers, the deduplication processing and storage is split to provide load-balanced operations. The two controllers in a GDA use a share-nothing type of clustering. Such capabilities easily lend themselves to increasing beyond dual-controller configurations to support true multi-controller performance.
The GDA supports up to 2-DD880 appliances with a maximum of 285TB raw capacity, and a ~12TB/hour ingestion rate using up to 270 simultaneous backup streams.
EMC also announced new hardware and software functionality enhancements to their Data Domain appliances. Specifically,
- Doubled DD880 capacity – supports up to 12 shelves of drives doubling raw capacity to 142.5TB and logical capacity to more than 7PBs of backup data
- Data encryption – supports software encryption for the data store for all Data Domain appliances, which encrypts the data after deduplication and compression as a final step before it’s stored on disk.
- One to Many replication – supports replicating data from one Data Domain appliance to multiple appliances which when combined with the current many to one and cascaded replication just increases the configurations which can be served
- Low bandwidth replication – supports a more compute intensive but bandwidth saving replication option for remote sites with limited bandwidth.
Data encryption is a separately licensable, software only option that can be configured to support stronger or weaker security and uses a single key for a Data Domain appliance. Encryption and replication interoperate to encrypt replicated data with the key of the replication target so that data is automatically recoverable at the target site with the target key. There were no performance numbers provided for encrypted data throughput but by being able to select which encryption algorithm to use, one can tradeoff throughput for less or more security.
New replication options provide new topologies for the enterprise customer, e.g., for a Many to One to Many replication where multiple remote offices are replicated to a central hub, which with cascaded replication can then be mirrored to multiple DR sites. EMC said that replication is a very popular option and as such, adding more configuration flexibility and bandwidth features should make this option even more attractive.
Low bandwidth replication is for remote office environments in many locations that lack high bandwidth network connectivity. One example cited was for oil platforms in very remote areas around the globe.
EMC’s Data Domain team continues to advance their technology and feature set. The GDA seems to be a way to increase performance using mostly software changes. But it has the potential to create a whole new multi-controller configuration of products that could be used to dramatically improve performance by adding more hardware to a single configuration.
A PDF version of this can be found atEMC 2010 April 14 Announcement of Data Domain GDA and other enhancements
Silverton Consulting, Inc. is a Storage, Strategy & Systems consulting services company, based in the USA offering products and services to the data storage community.