There’s a new storage startup out of stealth, called Primary Data and it’s implementing data (note, not storage) virtualization.
They already have $60M in funding with some pretty highpowered talent from Fusion IO, namely David Flynn, Rick White and Steve Wozniak (the ‘Woz’) (also of Apple fame).
There have been a number of attempts at creating a virtualization layers for data namely ViPR (See my post ViPR virtues, vexations but no storage virtualization) but Primary Data is taking a different tack to the problem.
Data virtualization explained
Essentially they want to separate the data plane from the control plane (See my Data Hypervisor post and comments for another view on this).
- The data plane consists of those storage system activities that actually perform IO or read and writes.
- The control plane is those storage system activities that do everything else that has to be done by a storage system, including provisioning, monitoring, and managing the storage.
Separating the data plane from the control plane offers a number of advantages. EMC ViPR does this but it’s data plane is either standard storage systems like VMAX, VNX, Isilon etc, or software defined storage solutions. Primary Data wants to do it all.
Their meta data or control plane engine is called a Data Director which holds information about the data objects that are stored in the Primary Data system, runs a data policy management engine and handles data migration.
Primary Data relies on purpose-built, Data Hypervisor (client) software that talks to Data Directors to understand where data objects reside and how to go about accessing them. But once the metadata information is transferred to the client SW, then IO activity can go directly between the host and the storage system in a protocol independent fashion.
[The graphic above is from my prior post and I assumed the data hypervisor (DH) would be co-located with the data but Primary Data has rightly implemented this as a separate layer in host software.]
Data Hypervisor protocol independence?
As I understand it this means that customers could use file storage, object storage or block storage to support any application requirement. This also means that file data (objects) could be migrated to block storage and still be accessed as file data. But the converse is also true, i.e., block data (objects) could be migrated to file storage and still be accessed as block data. You need to add object, DAS, PCIe flash and cloud storage to the mix to see where they are headed.
All data in Primary Data’s system are object encapsulated and all data objects are catalogued within a single, global namespace that spans file, block, object and cloud storage repositories
Data objects can reside on Primary storage systems, external non-Primary data aware file or block storage systems, DAS, PCIe Flash, and even cloud storage.
How does Data Virtualization compare to Storage Virtualization?
There are a number of differences:
- Most storage virtualization solutions are in the middle of the data path and because of this have to be fairly significant, highly fault-tolerant solutions.
- Most storage virtualization solutions don’t have a separate and distinct meta-data engine.
- Most storage virtualization systems don’t require any special (data hypervisor) software running on hosts or clients.
- Most storage virtualization systems don’t support protocol independent access to data storage.
- Most storage virtualization systems don’t support DAS or server based, PCIe flash for permanent storage. (Yes this is not supported in the first release but the intent is to support this soon.)
- Most storage virtualization systems support internal storage that resides directly inside the storage virtualization system hardware.
- Most storage virtualization systems support an internal DRAM cache layer which is used to speed up IO to internal and external storage and is in addition to any caching done at the external storage system level.
- Most storage virtualization systems only support external block storage.
There are a few similarities as well:
- They both manage data migration in a non-disruptive fashion.
- They both support automated policy management over data placement, data protection, data performance, and other QoS attributes.
- They both support multiple vendors of external storage.
- They both can support different host access protocols.
Data Virtualization Policy Management
A policy engine runs in the Data Directors and provides SLAs for data objects. This would include performance attributes, protection attributes, security requirements and cost requirements. Presumably, policy specifications for data protection would include RAID level, erasure coding level and geographic dispersion.
In Primary Data, backup becomes nothing more than object snapshots with different protection characteristics, like offsite full copy. Moreover, data object migration can be handled completely outboard and without causing data access disruption and on an automated policy basis.
Primary Data first release
Primary Data will be initially deployed as an integrated data virtualization solution which includes an all flash NAS storage system and a standard NAS system. Over time, Primary Data will add non-Primary Data external storage and internal storage (DAS, SSD, PCIe Flash).
The Data Policy Engine and Data Migrator functionality will be separately charged for software solutions. Data Directors are sold in pairs (active-passive) and can be non-disruptively upgraded. Storage (directors?) are also sold separately.
Data Hypervisor (client) software is available for most styles of Linux, Openstack and coming for ESX. Windows SMB support is not split yet (control plane/data plane) but Primary data does support SMB. I believe the Data Hypervisor software will also be released in an upcoming version of the Linux kernel.
They are currently in testing. No official date for GA but they did say they would announce pricing in 2015.
Disclosure: We have done work for Primary Data over the past year.
- Screen shot of beta test system supplied by Primary Data
- Graphic created by SCI for prior Data Hypervisor post