Western Digital at SFD15: ActiveScale object storage

Phill Bullinger and his staff from Western Digital presented at Storage Field Day 15 (SFD15) on a number of their enterprise products including Tegile and IntelliFlash but the one that caught my interest was their ActiveScale object store acquired from Amplidata back in 2015.

ActiveScale is an onprem, object storage system that provides cloud-like  economics for customer data.

ActiveScale Hardware

ActiveScale systems can both scale up and scale out within a single site. ActiveScale systems have both  storage and system nodes. Storage nodes perform erasure coding and System nodes are control points and metadata managers for the object store.

ActiveScale comes in two appliance configurations that contain both storage and system nodes and storage required.  The two appliances are:

  • ActiveScale P100 is a 7U 720TB pod system and A full rack of P100s can read 8GB/sec and can have 17-9s data availability. The P100 can scale up to 2.1PB in a single rack and up to 18PB in the same namespace. The P100 is a higher performing solution with better performing storage and system nodes
  • ActiveScale X100 is a 42U rack scale solution that holds up to 588 12TB drives or 5.8PB per rack. The X100 can scale up to 9 racks or 52PB in the same namespace. The X100 is a denser configuration with only 6 storage nodes and as such, has a better $/GB than the P100 above.

As WDC is both the supplier of the ActiveScale appliance and a supplier of disk storage they can be fairly aggressive with pricing on appliance systems.

Data integrity in ActiveScale

They make a point of saying that ActiveScale object metadata and data are stored separately. By separating data and metadata, they claim to be  more resilient to system failures. Object metadata is 3 way replicated, in a replicated database, residing in system nodes. Other object systems often store metadata and object data in the same way.

Object data can be erasure coded. That is, object data is chunked, erasure coding protected and then spread across multiple disk drives for data protection. ActiveScale erasure coding is called BitSpread. With BitSpread customers identify the number of disk drives to spread object data across and the number of drive failures the system should recover from without data loss.

A typical BitSpread configuration splits object data into 18 chunks and spreads these chunks across storage columns. A storage column is from 6-18 storage nodes. There’s no pre-allocated space in BitSpread. Object data chunks are allocated to disk storage based on current capacity and performance of the system, within redundancy constraints.

In addition, ActiveScale has a background task called BitDynamics that scans  erasure coded chunks and does a mathematical health check on the object data. If a chunk is bad, the object data chunk can be recovered and re-erasure coded back to proper health.

WDC performance testing shows that BitDynamics has 0 performance degradation when performing re-erasure coding. Indeed, they took out 98 drives in an ActiveScale cluster and BitDynamics re-coded all that data onto other disk drives and detected no performance impact. No indication how long  re-encoding 98 disk drives of data took nor the % of object store capacity utilization at the time of the test but presumably there’s a report someplace to back this up

Unlike many public cloud based object storage systems, ActiveScale is strongly consistent. That is object puts (writes) are not responded back to the entity doing the put,  until the object metadata and object data are properly and safely recorded in the object store.

ActiveScale also supports 3 site erasure coding. GeoSpread is their approach to erasure coding across sites. In this case, object metadata is replicated across 3 system nodes across the sites. Object data and erasure coded information is split into 20 chunks which are then spread across the three sites.  This way if any one site goes down, the other two sites have sufficient metadata, object data chunks and erasure coded information to reconstruct the data.

ActiveScale 5.2 now supports asynch replication. That is any one ActiveScale cluster can replicate to any other ActiveScale cluster located continent distances away.

Unclear how GeoSpread and asynch replication would interact together, but my guess is that each of the 3 GeoSpread sites could be asynchronously replicated to 3 other sites for maximum redundancy.

Both GeoSpread and ActiveScale replication impact performance,  depending on how far the sites are from one another and the speed and bandwidth of the links between sites.

ActiveScale markets

ActiveScale’s biggest market is media and entertainment (M&E), mostly used for media archive or tape replacement/augmentation. WDC showed one customer case study for the Montreaux Jazz Festival, which migrated 49 years of performance videos up to ActiveScale and can now stream any performance, on request, without delay. Montreax media is GeoSpread across 3 sites in France. Another option is to perform transcoding on the object media in realtime and stream the transcoded media.

Another large market is Bio/Life Sciences. Medical & biological scanners are transitioning to higher resolution scans which take more data space. And this sort of medical information needs to be kept a long time

Data analytics on ActiveScale

One other emerging market is data analytics. With the new S3A (S3 adapter), Hadoop clusters can now support object storage as a 2nd tier. One problem with data analytics is that they have lots of data and storing it in triplicate, costs an awful lot.

In big data world, datasets can get very large very quickly. Indeed PB sizes data sets aren’t that unusual. And with triple replication (in native HDFS). When HDFS runs out of space you have to delete data. Before S3A, the only way you could increase storage you had to scale out (with compute and storage and networking) in order to add capacity.

Using Hadoop’s S3A, ActiveScale’s can provide cold archive for data analytics.  From a Hadoop user/application perspective, S3A ActiveScale storage looks like just another directory under HDFS (Hadoop Data File System). You can run MapReduce or other Hadoop application directly against object buckets. But a more realistic approach is to move inactive or cold data from an disk resident HDFS directory to a S3A directory

HDFS and MapReduce are tightly coupled and were designed to have data close to where computation happens. So,  as long as the active data or working set data is on HDFS disk storage or directly in memory the rest of the (inactive) data could all be placed on S3A object storage. Inactive data is normally historical data no longer being actively analyzed while newer data would be actively analyzed. Older, inactive data can be manually or automatically archived off to S3A. With HIVE you can partition your database to have active data in HDFS disk storage and inactive data in S3A.

Another approach is if the active, working set data can all fit directly in memory then the data can reside on S3A object storage. This way the data is read from S3A storage into memory, analyzed there and output be done back to object store or HDFS disk. Because the data is only read (loaded) once, there’s only a minimal performance penalty to use S3A storage.

Western Digital is an active contributor to Hadoop S3A and have recently added performance improvements to S3A, such as better caching, partial object reading, and core XML performance tuning options.

~~~~
If your interested in learning more about Western Digital ActiveScale, check out the videos referenced earlier and their website.

Also you may be interested in these other posts on the WD sessions at SFD15:

The A is for Active, The S is for Scale by Dan Firth (@PenguinPunk)

Comments?

BlockStack, a Bitcoin secured global name space for distributed storage

At USENIX ATC conference a couple of weeks ago there was a presentation by a number of researchers on their BlockStack global name space and storage system based on the blockchain based Bitcoin network. Their paper was titled “Blockstack: A global naming and storage system secured by blockchain” (see pg. 181-194, in USENIX ATC’16 proceedings).

Bitcoin blockchain simplified

Blockchain’s like Bitcoin have a number of interesting properties including completely distributed understanding of current state, based on hashing and an always appended to log of transactions.

Blockchain nodes all participate in validating the current block of transactions and some nodes (deemed “miners” in Bitcoin) supply new blocks of transactions for validation.

All blockchain transactions are sent to each node and blockchain software in the node timestamps the transaction and accumulates them in an ordered append log (the “block“) which is then hashed, and each new block contains a hash of the previous block (the “chain” in blockchain) in the blockchain.

The miner’s block is then compared against the non-miners node’s block (hashes are compared) and if equal then, everyone reaches consensus (agrees) that the transaction block is valid. Then the next miner supplies a new block of transactions, and the process repeats. (See wikipedia’s article for more info).

All blockchain transactions are owned by a cryptographic address. Each cryptographic address has a public and private key associated with it.
Continue reading “BlockStack, a Bitcoin secured global name space for distributed storage”

Primary data’s path to better data storage presented at SFD8

IMG_5606rz A couple of weeks ago we met with Primary Data, Lance Smith, CEO, David Flynn, CTO and Kaycee Lai, SVP Product & Sales who were presenting at Storage Field Day 8 (SFD8, videos of their sessions available here). Primary Data has just emerged out of stealth late last year and has ~$60M in funding. Also they have Steve Wozniak (of Apple fame) as Chief Scientist, but he wasn’t at the SFD8 session 🙁

Primary Data seems out to change the world. At first I thought this was just another form of storage virtualization but they are laser focused on data virtualization or what they call data mobility. It differs from pure storage virtualization by being outside the data path.  (I have written about data virtualization before as well as the data hypervisor a long time ago). Nowadays they seem to be using the tag line of data in motion.

Why move data?

David has a theory behind the proliferation of startup storage companies. The spectrum behind capacity and performance has gotten immense, over time, which has provided an opening for a number of companies to address these widening needs.

David believes that caching at the storage system or in the servers is an attempt to address this issue by “loaning” the data from the storage silo to the cache. This is trying to supply a lower cost $/IOP for the data. Similar considerations are apparent at the other side where customer’s use archive or backup services to take advantage of much cheaper $/GB storage.

However, given the difficulty of moving data around in present day storage environments, customer data has become essentially immobile. Primary Data is trying to bring about a data mobility revolution and allow data to move over this spectrum of performance and capacity of storage with ease. Doing so easily, will provide significant benefits as customers can more fully take advantage of the various levels of performance and capacity in their data center storage environments.

Primary Data architecture

IMG_5607Primary Data is providing data mobility by using their meta-data service called the DataSphere appliance and their client software running on host servers called the Data Portal. Their offering can be best explained in three layers:

  • Data virtualization layer – provides continuity of identity and continuity of access across multiple physical storage systems. That is the same data (identity continuity) can be accessed wherever it resides (access continuity) by server applications. Such access and identity must transcend access protocols and interfaces. The Data Portal client software intercepts the server data activity and does control plane activity using the DataSphere appliance and performs IO directly using the physical storage.
  • Objective based data management – supplies a data affinity service. That is data can have a temporary location relationship with physical storage depending on the current performance (R:W, IOPS, bandwidth, latency) and protection (durability, availability, disaster recoverability, security, copy-ability, version-ability) characteristics of the data. These data objectives are matched to the capabilities or service catalog of the storage infrastructure and data objectives can change over time
  • Analytics in the loop – detects the performance and other characteristics of the storage and data in real-time. That is by monitoring the storage IO activity Primary Data can determine the actual performance attribute of the storage. Similarly, by monitoring the applications IO characteristics over time the system can determine the performance objectives of its data. The system also takes advantage of SMI-S to define some of the other characteristics of the storage systems.

How does Primary Data work?

Primary Data has taken advantage of parallel NFS extensions (pNFS) in NFSv4 to externalize and separate the storage control plane from the IO data plane. This works well for native Linux where the main developer of the Linux file system stack is on their payroll.IMG_5608rz

In Windows they put a filter driver in front of SMB to split off the control from data IO plane. Something similar is done for VMware ESX servers to supply the control-data plane split but in this case there is a software defined Data Portal that goes along with the DataSphere Service client that can do it all within the same ESX server. Another alternative exists and that is to use the Data Portal appliance as a storage virtualization service but then the IO data path goes through the portal.

According to their datasheet they currently support data virtualization services for NetApp cDOT and 7-mode, EMC Isilon OneFS7.2, and Nexenta 4.x&5.0 but plan on more.

They are not quite GA yet, but are close.

Comments?

 

 

 

EMCWorld2015 Day 2&3 news

Some additional news from EMCWorld2015 this week:

IMG_4527 IMG_4528 IMG_4531EMC announced directed availability for DSSD, their Rack scale shared Flash storage solution using a PCIe3 (switched) fabric with 36 dual ported, flash modules, which hold 512 NAND chips for 144TB NAND flash storage. On the stage floor they had a demonstration pitting a  40 node Hadoop cluster with DAS against a 15 node Hadoop cluster using the DSSD, both running HIVE and working on the same Query. By the time the 40node/DAS solution got to about 2% of the query completion the 15node/DSSD based cluster had finished the query without breaking a sweat. They then ran an even more complex query and it took no time at all.

They also simulated a copy of a 4TB file (~32K-128K IOs) from memory to memory and it took literally seconds, then copied it to SSD that took considerably longer (didn’t catch how long but much longer than memory), and then they showed the same file copy to DSSD and it only took seconds, almost looked exactly a smidgen slower than the memory to memory copy.

They said the PCIe fabric (no indication what the driver was) provided much more parallelism to the dual ported flash storage that the system was almost able to complete the 4TB copy at memory to memory speeds. It was all pretty impressive, albeit a simulation of the real thing.

EMC indicated that they designed the flash modules themselves and expect to double capacity of the DSSD to 288TB shortly. They showed the controller board that had a mezzanine board over a part of it, but together had 12 major chips on it which I assume had something to do with the PCIe fabric. They said there were two controllers in the system for high availability and the 144TB DSSD was deployed in 5U of space.

I can see how this would play well for real time analytics, high frequency trading and HPC environments but there’s more to shared storage than just speed. Cost wasn’t mentioned neither was the software driver but with the ease with which it worked on the Hive query, I can only assume at some lever it must look something like a DAS device but with memory access times… NVMe anyone?

Project CoprHD was announced which open sourced EMC’s ViPR Controller software. Many ViPR customers were asking for EMC to open source ViPR controller, apparently their listening. Hopefully this will enable some participation from non-EMC storage vendors to allow their storage to be brought under the management of ViPR Controller. I believe the intent is to have an EMC hardened/supported version of Project CoprHD or ViPR Controller to coexist with the open source project version which anyone can download and modify for themselves.

A Non-production, downloadable version of ScaleIO was also announced. The test-dev version is a free download with unlimited capacity, full functionality and available for an unlimited time but only for non-production use.  Another of the demos onstage this morning was Chad configuring storage across a ScaleIO cluster and using its QoS services to limit the impact of a specific workload. There was talk that ScaleIO was available previously as a free download but it took a bunch of effort to find and download. They have removed all these prior hindrances and soon, if not today it’s freely available for anyone. ScaleIO runs on VMware and other hypervisors (maybe bare metal as well). So if you wanted to get your feet wet with software defined storage, this sounds like the perfect opportunity.

ECS is being added to EMC’s Data Lake foundation. Not exactly sure what are all the components in the data lake solution but previously the only Data Lake storage was Isilon based. This week EMC added Elastic Cloud Storage to the picture. Recall that Elastic Cloud Storage comes in either a software only or hardware appliance deployment and provides object storage.

I missed Project Liberty before but it’s a virtual VNX appliance, software only version.  I assume this is intended for ROBO deployments or very low end business environments. Presumably it runs on VMware and has some sort of storage limitations. It seems, more and more of EMC products are coming out in virtual appliance versions.

Project Falcon was also announced which is a virtual Data Domain appliance, software only solution, targeted for ROBO environments and other small enterprises. The intent is to have an onramp for DataDomain backup storage.  I assume runs under VMware.

Project Caspian – rolling out CloudScaling orchestration/automation for OpenStack deployments. On the big stage today, Chad and Jeremy demonstrated Project Caspian on a VCE VxRACK deploying racks of servers under OpenStack control. They were able within a couple of clicks define and deploy openstack on bare metal hardware and deploy applications to the OpenStack servers. They had a monitoring screen which showed the OpenStack server activity (transactions) in real time and showed an over commit of the rack and how easy it was to add a new rack with more servers. All this seemed to take but a few clicks. The intent is not to create another OpenStack distribution but to provide an orchestration/automation/monitoring layer of software on top of OpenStack to “industrialize OpenStack” for enterprise users. Looked pretty impressive to me.

I would have to say the DSSD box was most impressive. It would have been interesting to get an upclose look at the box with some more specifications but they didn’t have one on the Expo floor.

VMware VVOLs potential performance problems

We discussed vSphere 6 VVOLs (Virtual Volumes) on this month’s GreyBeardsonStorage (GBoS) podcast with Howard Marks (@DeepStorageNet) and Satyam Vaghani (@SatyamVaghani, “Father of VVOLs”, CoFounder & CTO of PernixData).

VVOLs queue depth problem?

One performance problem from my perspective is that all VVOL FC IO is now funeled through a single Protocol Endpoint (PE) LUN for a single storage system. There may be some potential queue depth issues, but Satyam and Howard both said that queue depths have been greatly increased over the last decade or so and this shouldn’t be a problem, as long as you’re configured properly.

What about VVOL PEs on ALUA storage?

In an ALUA (Asymmetrical Logical Unit Access) Active/Passive, dual controller storage system, a set of LUNs is assigned to  one controller, the “active” side of an Active/Passive ALUA storage system. Many ALUA vendors now support “Active/Active” configurations such that 1/2 the LUNs are assigned to one side and the other 1/2  assigned to the other sider, for an Active/Passive & Passive/Active pair or Active/Active configuration.

So, ALUA storage systems have a LUN “allegiance” to a controller. If this continues to be the case under VVOLs,  then a PE would only be processed by one side of an ALUA dual controller system, effectively reducing the horse power to process VVOL IO to 1/2 of an ALUA storage system.

Now just because there is a LUN allegiance in ALUA storage doesn’t necessarily mean that all internal IO processing for a LUN is done on only one controller. But historically that has been the case. For instance, during an ALUA system non-disruptive code update, an “active” ALUA side must “failover” its LUNs to the other side to provide continuous IO activity, while the formerly active ALUA side taken down and updated with new code.

Potential solutions to ALUA PE performance?

One way to get around the VVOL ALUA performance problem is to have multiple PEs in a single storage system for the same vSphere Cluster VVOLs. I don’t know anything that would inhibit a storage system from supporting multiple PEs today, they already need to support multiple PEs for multiple vSphere clusters. Also, a VMware vSphere cluster must support multiple PEs for multiple storage systems.

I am also not aware of any VASA 2.0 requirement that restricts the number of PEs for a storage system’s support of a single vSphere cluster. But I could be mistaken here. So there should be nothing to inhibit multiple PEs from the same ALUA storage system to the same vSphere cluster.

Of course, this means an ALUA storage VVOLs would need to be divided across ALUA PEs.

Another solution is to eliminate any LUN allegiance for ALUA controllers. This requires shared memory between controllers to hold IO state and this is what non-ALUA storage does already.

~~~~

It’s just like Howard said on the GBoS podcast, “there’s going to be good and bad implementations of VVOLs” and telling the difference between the two will need to be done.

Comments?

 

Photo Credit(s): Passport Please by Oren Levine

MCS, UltraDIMMs and memory IO, the new path ahead – part 1

IMG_2338I was  at Storage Field Day 5 (SFD5) last month and got a chance to talk with SanDisk and Diablo Technologies. It turns out that SanDisk’s UltraDIMM product is based on Diablo Technologies MCS hardware.  So the two of them provided a pretty deep dive into the technology and where they want to go with it. Before we go any deeper the UltraDIMMs will be released to the field by IBM under the eXFlash name.

Diablo Technologies

The team at Diablo have been focusing on the x86 standard memory channel for a while now and lately have been trying out different sorts of technologies to connect as CPU memory. The first Memory Channel Storage (MCS) product converts Memory Channel IO to SATA IO. This allows any SATA device to be attached as memory and enjoy lightening fast, memory access times. Access times are clocked at 7µsec. Most PCIe Flash cards have an access latency at 50µsec or more, so this is 7X faster that PCIe Flash.  They also claim the MCS is capable of 20GB/sec. I know enterprise class storage systems that can’t do that. Also, the MCS utilizes 2 memory channels.

Diablo delivers a chip (that converts MemIO to SATA IO) and software that provides a block IO access to the MCS device. Customers of MCS supply their own SATA flash storage device and presumably package it all together in a DIMM compatible card.

But the main problem is that the whole MCS chip and SATA IO flash device has to fit in the form factor of a DIMM. And cannot draw any more power than a memory device can draw, ~10-15W with its corresponding thermal load.

But this seems plenty for a small flash drive.  The MCS is configured as a 4GB DDR3 DIMM.  There is a requirement to patch the BIOS so that it doesn’t run diagnostic memory tests on the MCS device and their software needs to be loaded to access the device as a block device. I believe they currently support Linux O/S with more O/Ss on the way.

Diablo has looked at other applications for their technology including providing an Memory IO accessed Ethernet NIC was mentioned. But it seems flash storage would be a great first application of their technology.  Not clear to me but SAS would also be something that could be done.

Whatever happens after NAND with the next generation semiconductor storage (see my The end of NAND is near post, it seems to me that accessing it as Memory IO would make an awful lot of sense. makes a lot of sense.  Using MCS as the access channel would seem to be a logical next step.

Part 1 of this story is on Diablo Technologies, Part 2 will be on SanDisk and I am not sure but maybe there will be a Part 3 on IBM eXFlash. So stay tuned.

Comments?

Fall SNWUSA 2013

Here’s my thoughts on SNWUSA which occurred this past week in the Long Beach Convention Center.

First, it was a great location. I saw a number of users I haven’t seen at SNWUSA ever before, some of which I have known for years from other (non-storage) venues.

Second, the exhibit hall was scantly populated. There were no major storage vendors at the show at all. Gold sponsors included NEC, Riverbed, & Sepaton, representing the largest exhibiters presenn. Making up the next (Contributing) tier were Western Digital, Toshiba, Active Archive Alliance, and LTO consortium with a smattering of smaller companies.  Finally, there were another 12 vendors with kiosks around the floor, with the largest there being Veeam Software.

I suspect VMWorld Europe happening the same time in Barcelona might have had something to do with the sparse exhibit floor but the trend has been present for the past few shows.

That being said there were still a few surprises in store, at least for me.  Two of the most interesting ones were:

  • Coho Data who came out of stealth with a scale out, RAIN (Redundant array of independent nodes) based storage cluster, with distributed, mirrored customer data across nodes and software defined networking. They currently support NFS for VMware with a management UI reminiscent of IOS 7 sans touch support. The product comes as a series of nodes with SSDs, disk storage and SDN. The SDN allows Coho Data to relocate front-end (client) connections to where the customer data lies. The distributed, mirrored backend storage provides redundancy in the case of a node/disk failure, at which time the system understands what data is now at risk and rebuilds the now-mirorless data onto other nodes. It reminds me a lot of Bycast/Archivas like architectures, with SDN and NFS support. I suppose the reason they are supporting VMware VMDKs is that the files are fairly large and thus easier to supply.
  • Cloud Physics was not exhibiting but they sponsored a break. As such, they were there talking with analysts and the press about their product. Their product installs as a VMware VM service and propagates VMware management agents to ESX servers which then pipe information back to their app about how your VMware environment is running, how VMs are performing, how your network and storage are performing for the VMs running, etc. This data is then sent to the cloud, where it’s anonymized. In the cloud, customers can use apps (called Cards) to analyze this data in the cloud, which can help them understand problem areas, predict what configuration changes can do for them, show them how VMs are performing, etc. It essentially is logging all this information to the cloud and providing ways to analyze the data to optimize your VMware environment.

Coming in just behind these two was Jeda Networks with their Software Defined Storage Network (SDSN). They use commodity (OpenFlow compatible) 10GbE switches to support a software FCoE storage SAN. Jeda Networks say that over the past two years,  most 10GbE switch hardware have started to support DCB in hardware and with that in place, plus OpenFlow compatibility, they can provide a SDSN on top of them just by emulating a control layer for FCoE switches. Of course one would still need FCoE storage and CNAs but with that in place one could use much cheaper switches to support FCoE.

CloudPhysics has a subscription based pricing model which offers three tiers:

  • Free where you get their Vapp, the management agents and a defined set of Free Card Apps for no cost;
  • Standard level where you get all the above plus a set of Card Apps which provide more VMware managability for $50/ESX server/Month; and
  • Enterprise level where you get all the above plus all the Card Apps presently available for $150/ESX server/Month.

Jeda networks and Coho Data are still developing their pricing and had none they were willing to disclose.

One of the CloudPhysics Card apps could predict how certain VMs would benefit from host based (PCIe or SSD) IO caching. They had a chart which showed working set inflection points for (I think) one VM running an OLTP application.  I have asked for this chart to discuss further in a future post.  But although CloudPhysics has the data to produce such a chart, the application shows three potential break points where say adding 500MB, 2000MB or 10000MB of SSD cache can speed up application performance by 10%, 30% or 50% (numbers here made up for example purposes and not off the chart they showed me).

A few other companies made announcements at the show. For example, Sepaton announced their new VirtuoSO, scale out hybrid reduplication appliance.

That’s about it. I would have to say that SNW needs to rethink their business model, frequency of stows or what they are trying to do at their conferences. However, on the plust side, most of the users I talked with came away with a lot of information and thought the show was worthwhile and I came away with a few surprises.

~~~~

Comments?