Storage architecture – Silverton Consulting

Skyrmion and chiral bobber solitons for racetrack storage

Posted on July 7, 2018 by Ray in R&D measures, SSD storage, Storage architecture, Strategic Inflection Points, System effectiveness, Visionary leadershp

Read an article this week in Science Daily (Magnetic skyrmions: Not the only one of their class; …) about new magnetic structures that could lend themselves to creating a new type of moving, non-volatile storage. (There’s more information in the press release and the Nature paper [DOI: 10.1038/s41565-018-0093-3], behind a paywall).

Skyrmions and chiral bobbers are both considered magnetic solitons, types of magnetic structures only 10’s of nm wide, that can move around, in sort of a race track configuration.

Delay line memories

Early in computing history, there was a type of memory called a delay line memory which used various mechanisms (mercury, magneto-resistence, capacitors, etc.) arranged along a circular line such as a wire, and had moving pulses of memory that raced around it. .

One problem with delay line memory was that it was accessed sequentially rather than core which could be accessed randomly. When using delay lines to change a bit, one had to wait until the bit came under the read/write head . It usually took microseconds for a bit to rotate around the memory line and delay line memories had a capacity of a few thousand bits 256-512 bytes per line, in today’s vernacular.

Delay lines predate computers and had been used for decades to delay any electronic or acoustic signal before retransmission.

A new racetrack

Solitons are being investigated to be used in a new form of delay line memory, called racetrack memory. Skyrmions had been discovered a while ago but the existence of chiral bobbers was only theoretical until researchers discovered them in their lab.

Previously, the thought was that one would encode digital data with only skyrmions and spaces. But the discovery of chiral bobbers and the fact that they can co-exist with skyrmions, means that chiral bobbers and skyrmions can be used together in a racetrack fashion to record digital data. And the fact that both can move or migrate through a material makes them ideal for racetrack storage.

Unclear whether chiral bobbers and skyrmions only have two states or more but the more the merrier for storage. I am assuming that bit density or reliability is increased by having chiral bobbers in the chain rather than spaces.

Unlike disk devices with both rotating media and moving read-write heads, the motion of skyrmion-chiral bobber racetrack storage is controlled by a very weak pulse of current and requires no moving/mechanical parts prone to wear/tear. Moreover, as a solid state devices, racetrack memory is not sensitive to induced/organic vibration or shock, So, theoretically these devices should have higher reliability than disk devices.

There was no information comparing the new racetrack memory reliability to NAND or 3D Crosspoint/PCM SSDs, but there may be some advantage here as well. I suppose one would need to understand how to miniaturize the read-erase-write head to the right form factor for nm racetracks to understand how it compares.

And I didn’t see anything describing how long it takes to rotate through bits on a skyrmion-chiral bobber racetrack. Of course, this would depend on the number of bits on a racetrack, but some indication of how long it takes one bit to move, one postition on the racetrack would be helpful to see what its rotational latency might be.

~~~~

At the moment, reading and writing skyrmions and the newly discovered chiral bobbers takes a lot of advanced equipment and is only done in major labs. As such, I don’t see a skyrmion-chiral bobber racetrack storage device arriving on my desktop anytime soon. But the fact that there’s a long way to go before, we run out of magnetic storage options, even if it is on a chip rather than magnetic media, is comforting to know. Even if we don’t ever come up with an economical way to produce it.

I wonder if you could synchronize rotational timing across a number of racetrack devices, at least that way you could be reading/erasing/writing a whole byte, word, double word etc, at a time, rather than a single bit.

Comments?

Photo Credit(s): From Experimental observation of chiral magnetic bobbers in B20 Type FeGe paper

From Experimental observation of chiral magnetic bobbers in B20 Type FeGe paper

From Timeline of computer history Magnetoresistive delay lines

From Experimental observation of chiral magnetic bobbers in B20 Type FeGe paper

Huawei presents OceanStor architecture at SFD15

Posted on May 21, 2018May 21, 2018 by Ray in Block Storage, Clustered storage, Data compression, Data reduction, IOPS, LRT, NVMe storage, SPC-1, SSD storage, Storage architecture, Storage Features, Storage performance

At Storage Field Day 15 (SFD15) we had a few sessions with Huawei, on some of their latest storage technology. One of the sessions I was particularly interested in was, OceanStor Dorado (enterprise class, block storage), an architectural deep dive with Chun Liu, (see video here).

Their latest OceanStor Dorado 18000F storage system, due out soon, can scale up to 16 controllers in a cluster, supporting all flash storage configurations. The new Dorado 18000F block storage system supports inline compression and deduplication for data reduction.

The latest SPC-1 performance showed 800K IOPS at 500usec response time with dedupe and inline compression turned on. Although, it’s unclear whether SPC-1 data is deduplicable or compressible. So this may have hurt them with no corresponding advantage in capacity or cost.

System architecture

Chun had one chart that said historically as you add storage system features you often lose 70-80% performance. However, with their implementation using shards of metadata/other data structures and not using (as much) serialization, they have managed to add features without serious performance impact. In fact with the latest architecture, using RAID-TP (3 parity), inline compression, inline deduplication and metro cluster, they lose only about 20% of their baseline system performance. Although, if the metro cluster their using is synchronous replication, it must not be that far away.

They have a pretty standard protocol layer at the top, replication, snapshot and LUN management below that with a cache layer next. Then it gets interesting, they have a distributed object router layer, with deduplication/compression and metadata management underneath that and then the data layout. With infrastructure (backend) at the bottom and inter-cluster communications that span the cluster of controllers. Every enclosure has 2 controllers and inter-cluster communications is over switched PCIe. SSDs can be NVMe or SAS.

IO without serialization

They support a log structured file system on the back end but not just one log. Their internal architecture is a share nothing approach which shards metadata, fingerprint data bases, logs, and other data. Each of these shards is assigned with CPU core/thread affinity and as long as, nothing goes wrong, the storage code operates on shards with no serialization required.

To maximize IO performance they use a lightweight thread (LWT) compute model, that’s non-preemptive. They partition all data structures into fine shards, such that within each shard. Each metadata shard’s is assigned to have a core/thread affinity. That way they can share nothing across compute threads resulting in lock free execution. The LWT runs beginning to end, without preemption, to complete any data updates required and minimize any contention.

IO flow

Write flow: the system receives data in cache, mirrors it to the adjacent controllers cache and then responds back to the host. Controller cache is battery backed up, non volatile storage.

The cache data is then compressed and with deduplication active, fingerprinted. Data fingerprints are used to determine which fingerprint database shard (and subsequent core/thread) to route the data to for further processing. They also compare any matched fingerprinted data to the unique data already stored, because of their “weak” fingerprint hash. If the data is unique, it’s routed the LUN mapping shard (and subsequent core/thread) to calculate a physical address to write the data. Sometime later the data is routed to RAID aggregation and written out to backend SSDs.

Read flow: when the request comes, they check the LUN map shard (core/thread) and if it’s pointing to a fingerprint index they know it’s deduped block and then read that data to respond to the read request.

Other optimizations

They have some specially, designed, optimized code paths. For example, standard RAID TP algorithms perform RAID protection at 2.3GB/sec or 4.5GB/s but Huawei OceanStor Dorada 18000F can perform triple RAID calculations at 6.5GB/s. Similarly, standard LZ4 data compression algorithms can compress data at ~507MB/sec (on email) but Huawei’s data compression algorithm can perform compression (on email) at ~979MB/s. Ditto for CRC16 (used to check block integrity). Traditional CRC16 algorithms operate at ~2.3GB/sec but Hauwei can sustain ~7.2GB/s.

For data on SSDs, they identify data with a short life span (quickly overwritten) and try to coalesce this short lived data onto their own flash pages. That way all the data in a short life span flash page get’s freed up together, which can then be overwritten, without having to move old, non-deleted (long lived) data to new blocks. They claim to have reduced write amplification (non-new data block writes) by 60% this way.

Also LUNs can be configured as throughput optimized or IOPs optimized. Unclear how, but it probably has something to do with cache management and backend layout.

~~~~

Overall, I was impressed with their capabilities to reduce serialization bottlenecks. Back in the old days, when I was looking for how to optimize code, we always seemed to be spending 30-50% of CPU compute spinning on locks, waiting to obtain a lock before the system could continue the code execution.

It never occurred to me we didn’t have to use locks at all.

For more information, please read these other SFD15 blogger posts on Huawei:

Dorado – All about Speed – Storage Gaga, Chin-Fah Heoh (@StorageGaga)
Huawei – Probably Not What You Expected, Dan Firth (@PenguinPunk)

Western Digital at SFD15: ActiveScale object storage

Posted on May 5, 2018 by Ray in data protection, Distributed computing, Object storage, Storage architecture, Storage archive, Storage availability

Phill Bullinger and his staff from Western Digital presented at Storage Field Day 15 (SFD15) on a number of their enterprise products including Tegile and IntelliFlash but the one that caught my interest was their ActiveScale object store acquired from Amplidata back in 2015.

ActiveScale is an onprem, object storage system that provides cloud-like economics for customer data.

ActiveScale Hardware

ActiveScale systems can both scale up and scale out within a single site. ActiveScale systems have both storage and system nodes. Storage nodes perform erasure coding and System nodes are control points and metadata managers for the object store.

ActiveScale comes in two appliance configurations that contain both storage and system nodes and storage required. The two appliances are:

ActiveScale P100 is a 7U 720TB pod system and A full rack of P100s can read 8GB/sec and can have 17-9s data availability. The P100 can scale up to 2.1PB in a single rack and up to 18PB in the same namespace. The P100 is a higher performing solution with better performing storage and system nodes
ActiveScale X100 is a 42U rack scale solution that holds up to 588 12TB drives or 5.8PB per rack. The X100 can scale up to 9 racks or 52PB in the same namespace. The X100 is a denser configuration with only 6 storage nodes and as such, has a better $/GB than the P100 above.

As WDC is both the supplier of the ActiveScale appliance and a supplier of disk storage they can be fairly aggressive with pricing on appliance systems.

Data integrity in ActiveScale

They make a point of saying that ActiveScale object metadata and data are stored separately. By separating data and metadata, they claim to be more resilient to system failures. Object metadata is 3 way replicated, in a replicated database, residing in system nodes. Other object systems often store metadata and object data in the same way.

Object data can be erasure coded. That is, object data is chunked, erasure coding protected and then spread across multiple disk drives for data protection. ActiveScale erasure coding is called BitSpread. With BitSpread customers identify the number of disk drives to spread object data across and the number of drive failures the system should recover from without data loss.

A typical BitSpread configuration splits object data into 18 chunks and spreads these chunks across storage columns. A storage column is from 6-18 storage nodes. There’s no pre-allocated space in BitSpread. Object data chunks are allocated to disk storage based on current capacity and performance of the system, within redundancy constraints.

In addition, ActiveScale has a background task called BitDynamics that scans erasure coded chunks and does a mathematical health check on the object data. If a chunk is bad, the object data chunk can be recovered and re-erasure coded back to proper health.

WDC performance testing shows that BitDynamics has 0 performance degradation when performing re-erasure coding. Indeed, they took out 98 drives in an ActiveScale cluster and BitDynamics re-coded all that data onto other disk drives and detected no performance impact. No indication how long re-encoding 98 disk drives of data took nor the % of object store capacity utilization at the time of the test but presumably there’s a report someplace to back this up

Unlike many public cloud based object storage systems, ActiveScale is strongly consistent. That is object puts (writes) are not responded back to the entity doing the put, until the object metadata and object data are properly and safely recorded in the object store.

ActiveScale also supports 3 site erasure coding. GeoSpread is their approach to erasure coding across sites. In this case, object metadata is replicated across 3 system nodes across the sites. Object data and erasure coded information is split into 20 chunks which are then spread across the three sites. This way if any one site goes down, the other two sites have sufficient metadata, object data chunks and erasure coded information to reconstruct the data.

ActiveScale 5.2 now supports asynch replication. That is any one ActiveScale cluster can replicate to any other ActiveScale cluster located continent distances away.

Unclear how GeoSpread and asynch replication would interact together, but my guess is that each of the 3 GeoSpread sites could be asynchronously replicated to 3 other sites for maximum redundancy.

Both GeoSpread and ActiveScale replication impact performance, depending on how far the sites are from one another and the speed and bandwidth of the links between sites.

ActiveScale markets

ActiveScale’s biggest market is media and entertainment (M&E), mostly used for media archive or tape replacement/augmentation. WDC showed one customer case study for the Montreaux Jazz Festival, which migrated 49 years of performance videos up to ActiveScale and can now stream any performance, on request, without delay. Montreax media is GeoSpread across 3 sites in France. Another option is to perform transcoding on the object media in realtime and stream the transcoded media.

Another large market is Bio/Life Sciences. Medical & biological scanners are transitioning to higher resolution scans which take more data space. And this sort of medical information needs to be kept a long time

Data analytics on ActiveScale

One other emerging market is data analytics. With the new S3A (S3 adapter), Hadoop clusters can now support object storage as a 2nd tier. One problem with data analytics is that they have lots of data and storing it in triplicate, costs an awful lot.

In big data world, datasets can get very large very quickly. Indeed PB sizes data sets aren’t that unusual. And with triple replication (in native HDFS). When HDFS runs out of space you have to delete data. Before S3A, the only way you could increase storage you had to scale out (with compute and storage and networking) in order to add capacity.

Using Hadoop’s S3A, ActiveScale’s can provide cold archive for data analytics. From a Hadoop user/application perspective, S3A ActiveScale storage looks like just another directory under HDFS (Hadoop Data File System). You can run MapReduce or other Hadoop application directly against object buckets. But a more realistic approach is to move inactive or cold data from an disk resident HDFS directory to a S3A directory

HDFS and MapReduce are tightly coupled and were designed to have data close to where computation happens. So, as long as the active data or working set data is on HDFS disk storage or directly in memory the rest of the (inactive) data could all be placed on S3A object storage. Inactive data is normally historical data no longer being actively analyzed while newer data would be actively analyzed. Older, inactive data can be manually or automatically archived off to S3A. With HIVE you can partition your database to have active data in HDFS disk storage and inactive data in S3A.

Another approach is if the active, working set data can all fit directly in memory then the data can reside on S3A object storage. This way the data is read from S3A storage into memory, analyzed there and output be done back to object store or HDFS disk. Because the data is only read (loaded) once, there’s only a minimal performance penalty to use S3A storage.

Western Digital is an active contributor to Hadoop S3A and have recently added performance improvements to S3A, such as better caching, partial object reading, and core XML performance tuning options.

~~~~
If your interested in learning more about Western Digital ActiveScale, check out the videos referenced earlier and their website.

Also you may be interested in these other posts on the WD sessions at SFD15:

The A is for Active, The S is for Scale by Dan Firth (@PenguinPunk)

Comments?

BlockStack, a Bitcoin secured global name space for distributed storage

Posted on July 16, 2016July 16, 2016 by Ray in Cloud storage, Data grid, Data integrity, Object storage, Storage architecture, storage scalability

At USENIX ATC conference a couple of weeks ago there was a presentation by a number of researchers on their BlockStack global name space and storage system based on the blockchain based Bitcoin network. Their paper was titled “Blockstack: A global naming and storage system secured by blockchain” (see pg. 181-194, in USENIX ATC’16 proceedings).

Bitcoin blockchain simplified

Blockchain’s like Bitcoin have a number of interesting properties including completely distributed understanding of current state, based on hashing and an always appended to log of transactions.

Blockchain nodes all participate in validating the current block of transactions and some nodes (deemed “miners” in Bitcoin) supply new blocks of transactions for validation.

All blockchain transactions are sent to each node and blockchain software in the node timestamps the transaction and accumulates them in an ordered append log (the “block“) which is then hashed, and each new block contains a hash of the previous block (the “chain” in blockchain) in the blockchain.

The miner’s block is then compared against the non-miners node’s block (hashes are compared) and if equal then, everyone reaches consensus (agrees) that the transaction block is valid. Then the next miner supplies a new block of transactions, and the process repeats. (See wikipedia’s article for more info).

All blockchain transactions are owned by a cryptographic address. Each cryptographic address has a public and private key associated with it.
Continue reading “BlockStack, a Bitcoin secured global name space for distributed storage” →

Primary data’s path to better data storage presented at SFD8

Posted on November 5, 2015 by Ray in Block Storage, DAS, data access, data mobility, File Storage, NFS, Storage architecture, Storage virtualization, Strategic Inflection Points

A couple of weeks ago we met with Primary Data, Lance Smith, CEO, David Flynn, CTO and Kaycee Lai, SVP Product & Sales who were presenting at Storage Field Day 8 (SFD8, videos of their sessions available here). Primary Data has just emerged out of stealth late last year and has ~$60M in funding. Also they have Steve Wozniak (of Apple fame) as Chief Scientist, but he wasn’t at the SFD8 session 🙁

Primary Data seems out to change the world. At first I thought this was just another form of storage virtualization but they are laser focused on data virtualization or what they call data mobility. It differs from pure storage virtualization by being outside the data path. (I have written about data virtualization before as well as the data hypervisor a long time ago). Nowadays they seem to be using the tag line of data in motion.

Why move data?

David has a theory behind the proliferation of startup storage companies. The spectrum behind capacity and performance has gotten immense, over time, which has provided an opening for a number of companies to address these widening needs.

David believes that caching at the storage system or in the servers is an attempt to address this issue by “loaning” the data from the storage silo to the cache. This is trying to supply a lower cost $/IOP for the data. Similar considerations are apparent at the other side where customer’s use archive or backup services to take advantage of much cheaper $/GB storage.

However, given the difficulty of moving data around in present day storage environments, customer data has become essentially immobile. Primary Data is trying to bring about a data mobility revolution and allow data to move over this spectrum of performance and capacity of storage with ease. Doing so easily, will provide significant benefits as customers can more fully take advantage of the various levels of performance and capacity in their data center storage environments.

Primary Data architecture

Primary Data is providing data mobility by using their meta-data service called the DataSphere appliance and their client software running on host servers called the Data Portal. Their offering can be best explained in three layers:

Data virtualization layer – provides continuity of identity and continuity of access across multiple physical storage systems. That is the same data (identity continuity) can be accessed wherever it resides (access continuity) by server applications. Such access and identity must transcend access protocols and interfaces. The Data Portal client software intercepts the server data activity and does control plane activity using the DataSphere appliance and performs IO directly using the physical storage.
Objective based data management – supplies a data affinity service. That is data can have a temporary location relationship with physical storage depending on the current performance (R:W, IOPS, bandwidth, latency) and protection (durability, availability, disaster recoverability, security, copy-ability, version-ability) characteristics of the data. These data objectives are matched to the capabilities or service catalog of the storage infrastructure and data objectives can change over time
Analytics in the loop – detects the performance and other characteristics of the storage and data in real-time. That is by monitoring the storage IO activity Primary Data can determine the actual performance attribute of the storage. Similarly, by monitoring the applications IO characteristics over time the system can determine the performance objectives of its data. The system also takes advantage of SMI-S to define some of the other characteristics of the storage systems.

How does Primary Data work?

Primary Data has taken advantage of parallel NFS extensions (pNFS) in NFSv4 to externalize and separate the storage control plane from the IO data plane. This works well for native Linux where the main developer of the Linux file system stack is on their payroll. IMG_5608rz

In Windows they put a filter driver in front of SMB to split off the control from data IO plane. Something similar is done for VMware ESX servers to supply the control-data plane split but in this case there is a software defined Data Portal that goes along with the DataSphere Service client that can do it all within the same ESX server. Another alternative exists and that is to use the Data Portal appliance as a storage virtualization service but then the IO data path goes through the portal.

According to their datasheet they currently support data virtualization services for NetApp cDOT and 7-mode, EMC Isilon OneFS7.2, and Nexenta 4.x&5.0 but plan on more.

They are not quite GA yet, but are close.

Comments?

EMCWorld2015 Day 2&3 news

Posted on May 6, 2015May 6, 2015 by Ray in Block Storage, Cloud services, Cloud storage, Clustered storage, Information economy, Object storage, PCIe, Storage, Storage architecture, Storage performance, Strategic Inflection Points, System effectiveness

Some additional news from EMCWorld2015 this week:

EMC announced directed availability for DSSD, their Rack scale shared Flash storage solution using a PCIe3 (switched) fabric with 36 dual ported, flash modules, which hold 512 NAND chips for 144TB NAND flash storage. On the stage floor they had a demonstration pitting a 40 node Hadoop cluster with DAS against a 15 node Hadoop cluster using the DSSD, both running HIVE and working on the same Query. By the time the 40node/DAS solution got to about 2% of the query completion the 15node/DSSD based cluster had finished the query without breaking a sweat. They then ran an even more complex query and it took no time at all.

They also simulated a copy of a 4TB file (~32K-128K IOs) from memory to memory and it took literally seconds, then copied it to SSD that took considerably longer (didn’t catch how long but much longer than memory), and then they showed the same file copy to DSSD and it only took seconds, almost looked exactly a smidgen slower than the memory to memory copy.

They said the PCIe fabric (no indication what the driver was) provided much more parallelism to the dual ported flash storage that the system was almost able to complete the 4TB copy at memory to memory speeds. It was all pretty impressive, albeit a simulation of the real thing.

EMC indicated that they designed the flash modules themselves and expect to double capacity of the DSSD to 288TB shortly. They showed the controller board that had a mezzanine board over a part of it, but together had 12 major chips on it which I assume had something to do with the PCIe fabric. They said there were two controllers in the system for high availability and the 144TB DSSD was deployed in 5U of space.

I can see how this would play well for real time analytics, high frequency trading and HPC environments but there’s more to shared storage than just speed. Cost wasn’t mentioned neither was the software driver but with the ease with which it worked on the Hive query, I can only assume at some lever it must look something like a DAS device but with memory access times… NVMe anyone?

Project CoprHD was announced which open sourced EMC’s ViPR Controller software. Many ViPR customers were asking for EMC to open source ViPR controller, apparently their listening. Hopefully this will enable some participation from non-EMC storage vendors to allow their storage to be brought under the management of ViPR Controller. I believe the intent is to have an EMC hardened/supported version of Project CoprHD or ViPR Controller to coexist with the open source project version which anyone can download and modify for themselves.

A Non-production, downloadable version of ScaleIO was also announced. The test-dev version is a free download with unlimited capacity, full functionality and available for an unlimited time but only for non-production use. Another of the demos onstage this morning was Chad configuring storage across a ScaleIO cluster and using its QoS services to limit the impact of a specific workload. There was talk that ScaleIO was available previously as a free download but it took a bunch of effort to find and download. They have removed all these prior hindrances and soon, if not today it’s freely available for anyone. ScaleIO runs on VMware and other hypervisors (maybe bare metal as well). So if you wanted to get your feet wet with software defined storage, this sounds like the perfect opportunity.

ECS is being added to EMC’s Data Lake foundation. Not exactly sure what are all the components in the data lake solution but previously the only Data Lake storage was Isilon based. This week EMC added Elastic Cloud Storage to the picture. Recall that Elastic Cloud Storage comes in either a software only or hardware appliance deployment and provides object storage.

I missed Project Liberty before but it’s a virtual VNX appliance, software only version. I assume this is intended for ROBO deployments or very low end business environments. Presumably it runs on VMware and has some sort of storage limitations. It seems, more and more of EMC products are coming out in virtual appliance versions.

Project Falcon was also announced which is a virtual Data Domain appliance, software only solution, targeted for ROBO environments and other small enterprises. The intent is to have an onramp for DataDomain backup storage. I assume runs under VMware.

Project Caspian – rolling out CloudScaling orchestration/automation for OpenStack deployments. On the big stage today, Chad and Jeremy demonstrated Project Caspian on a VCE VxRACK deploying racks of servers under OpenStack control. They were able within a couple of clicks define and deploy openstack on bare metal hardware and deploy applications to the OpenStack servers. They had a monitoring screen which showed the OpenStack server activity (transactions) in real time and showed an over commit of the rack and how easy it was to add a new rack with more servers. All this seemed to take but a few clicks. The intent is not to create another OpenStack distribution but to provide an orchestration/automation/monitoring layer of software on top of OpenStack to “industrialize OpenStack” for enterprise users. Looked pretty impressive to me.

I would have to say the DSSD box was most impressive. It would have been interesting to get an upclose look at the box with some more specifications but they didn’t have one on the Expo floor.

VMware VVOLs potential performance problems

Posted on February 16, 2015February 16, 2015 by Ray in Block Storage, FC, Server virtualization, Storage, Storage architecture, Storage performance

We discussed vSphere 6 VVOLs (Virtual Volumes) on this month’s GreyBeardsonStorage (GBoS) podcast with Howard Marks (@DeepStorageNet) and Satyam Vaghani (@SatyamVaghani, “Father of VVOLs”, CoFounder & CTO of PernixData).

VVOLs queue depth problem?

One performance problem from my perspective is that all VVOL FC IO is now funeled through a single Protocol Endpoint (PE) LUN for a single storage system. There may be some potential queue depth issues, but Satyam and Howard both said that queue depths have been greatly increased over the last decade or so and this shouldn’t be a problem, as long as you’re configured properly.

What about VVOL PEs on ALUA storage?

In an ALUA (Asymmetrical Logical Unit Access) Active/Passive, dual controller storage system, a set of LUNs is assigned to one controller, the “active” side of an Active/Passive ALUA storage system. Many ALUA vendors now support “Active/Active” configurations such that 1/2 the LUNs are assigned to one side and the other 1/2 assigned to the other sider, for an Active/Passive & Passive/Active pair or Active/Active configuration.

So, ALUA storage systems have a LUN “allegiance” to a controller. If this continues to be the case under VVOLs, then a PE would only be processed by one side of an ALUA dual controller system, effectively reducing the horse power to process VVOL IO to 1/2 of an ALUA storage system.

Now just because there is a LUN allegiance in ALUA storage doesn’t necessarily mean that all internal IO processing for a LUN is done on only one controller. But historically that has been the case. For instance, during an ALUA system non-disruptive code update, an “active” ALUA side must “failover” its LUNs to the other side to provide continuous IO activity, while the formerly active ALUA side taken down and updated with new code.

Potential solutions to ALUA PE performance?

One way to get around the VVOL ALUA performance problem is to have multiple PEs in a single storage system for the same vSphere Cluster VVOLs. I don’t know anything that would inhibit a storage system from supporting multiple PEs today, they already need to support multiple PEs for multiple vSphere clusters. Also, a VMware vSphere cluster must support multiple PEs for multiple storage systems.

I am also not aware of any VASA 2.0 requirement that restricts the number of PEs for a storage system’s support of a single vSphere cluster. But I could be mistaken here. So there should be nothing to inhibit multiple PEs from the same ALUA storage system to the same vSphere cluster.

Of course, this means an ALUA storage VVOLs would need to be divided across ALUA PEs.

Another solution is to eliminate any LUN allegiance for ALUA controllers. This requires shared memory between controllers to hold IO state and this is what non-ALUA storage does already.

~~~~

It’s just like Howard said on the GBoS podcast, “there’s going to be good and bad implementations of VVOLs” and telling the difference between the two will need to be done.

Comments?

Photo Credit(s): Passport Please by Oren Levine

MCS, UltraDIMMs and memory IO, the new path ahead – part 1

Posted on May 22, 2014 by Ray in Block Storage, DAS, SSD storage, Storage, Storage architecture, Storage performance

I was at Storage Field Day 5 (SFD5) last month and got a chance to talk with SanDisk and Diablo Technologies. It turns out that SanDisk’s UltraDIMM product is based on Diablo Technologies MCS hardware. So the two of them provided a pretty deep dive into the technology and where they want to go with it. Before we go any deeper the UltraDIMMs will be released to the field by IBM under the eXFlash name.

Diablo Technologies

The team at Diablo have been focusing on the x86 standard memory channel for a while now and lately have been trying out different sorts of technologies to connect as CPU memory. The first Memory Channel Storage (MCS) product converts Memory Channel IO to SATA IO. This allows any SATA device to be attached as memory and enjoy lightening fast, memory access times. Access times are clocked at 7µsec. Most PCIe Flash cards have an access latency at 50µsec or more, so this is 7X faster that PCIe Flash. They also claim the MCS is capable of 20GB/sec. I know enterprise class storage systems that can’t do that. Also, the MCS utilizes 2 memory channels.

Diablo delivers a chip (that converts MemIO to SATA IO) and software that provides a block IO access to the MCS device. Customers of MCS supply their own SATA flash storage device and presumably package it all together in a DIMM compatible card.

But the main problem is that the whole MCS chip and SATA IO flash device has to fit in the form factor of a DIMM. And cannot draw any more power than a memory device can draw, ~10-15W with its corresponding thermal load.

But this seems plenty for a small flash drive. The MCS is configured as a 4GB DDR3 DIMM. There is a requirement to patch the BIOS so that it doesn’t run diagnostic memory tests on the MCS device and their software needs to be loaded to access the device as a block device. I believe they currently support Linux O/S with more O/Ss on the way.

Diablo has looked at other applications for their technology including providing an Memory IO accessed Ethernet NIC was mentioned. But it seems flash storage would be a great first application of their technology. Not clear to me but SAS would also be something that could be done.

Whatever happens after NAND with the next generation semiconductor storage (see my The end of NAND is near post, it seems to me that accessing it as Memory IO would make an awful lot of sense. makes a lot of sense. Using MCS as the access channel would seem to be a logical next step.

Part 1 of this story is on Diablo Technologies, Part 2 will be on SanDisk and I am not sure but maybe there will be a Part 3 on IBM eXFlash. So stay tuned.

Comments?