Peak server, the cloud & NetApp storage for AWS

I was at a conference a month or so ago and one speaker mentioned that the number of x86 servers being sold has peaked and is dropping. I can imagine a number of reasons for this and the main one being server virtualization. But this speaker had a different view and it seemed to be the cloud.

Peak server is here.

He said that three companies were purchasing over 1/2 the x86 servers these days. I feel that there should be at least four Google, Facebook, Amazon & Microsoft and maybe five, if you add in Apple.

Something has happened over the past year or so. Enterprise IT has continued along its merry way but the adoption of cloud services is starting to take off.

I have seen this before, with mainframes, then mini-computers, and now client-server. Minicomputers came out and were so easy to use and develop/deploy applications on, that people stopped creating new apps on the mainframe. Mainframes never died out, and probably have never really stopped shipping increasing MIPS every year. But the share of WW MIP installations for mainframes has been shrinking for decades and have never got going again.

Ultimately, the proprietary minicomputer was just a passing fad and only lasted about 25 years or so. It was wounded by the PC, and then killed off by proprietary Unix workstations.

Then it happened again, the new upstart this time was Windows Server and Linux. Once again it was just easier to build apps on these new and cheaper servers, than any of the older Unix servers. Of course there’s still plenty of business in proprietary Unix servers, but again I would venture to say that their share of WW installed MIPS has been shrinking for a long time.

Nowadays, the cloud is mortally wounding the server market. Server virtualization is helping a lot but it’s also enabling the cloud to eliminate many physical server sales. This is because new applications, new IT environments are being ported/moved/deployed onto the cloud.

Peak server means less enterprise networking, storage and server hardware

In this new, cloud world, customers need less servers, less networking and less enterprise class storage. Yes not every application is suitable to cloud deployment but that’s why there’s still mainframes, still Unix servers, and a continuing need for standalone, physical or virtual x86 servers in the enterprise. But their share of MIPs will start shrinking soon if it hasn’t already.

Ok, so enterprise data center share of MIPs will start shrinking vis a vis cloud data centers. But what happens to networking and storage. My view is that networking becomes software defined and there’s a component of that which operates on special purpose hardware. This will increase in shipments but the more complex, enterprise class networking equipment will flatline and never see any more substantial growth.

And up until yesterday I felt much the same about enterprise class storage. Software defined storage in my future, DAS and SSDs for the capacity and the smarts exist in software if at all. Today, most of the cloud and many service providers have been moving off enterprise class storage and onto DAS.

NetApp’s new enterprise storage in AWS

But yesterday I heard about NetApp private storage for the cloud. This is a configuration of NetApp storage installed in a CoLo facility with a “direct connection” to Amazon compute cloud. In this way, enterprise customers can maintain data stewardship/ownership/governance over their data while at the same time deploying applications onto AWS compute cloud.

This seems to be one of the sticking points to enterprise customers adopting the cloud. By having (data) storage owned lock/stock&barrel by the enterprise it seems much easier and less risky to deploy new and old applications to the cloud.

Whether this pans out and can provide enough value to cover the added expense of the enterprise class storage, only the market can decide. But this is the first time I can remember, where any vendor has articulated a role for enterprise class storage in the cloud. Let’s hope it works.

Image: PDP8/s by ajmexico

Windows Server 2012 R2 storage changes announced at TechEd

Microsoft TechEd Trends driving IT todayMicrosoft TechEd USA is this week and they announced a number of changes to the storage services that come with Windows Server 2012 R2

  • Azure DRaaS – Microsoft is attempting to democratize DR by supporting a new DR-as-a-Service (DRaaS).  They now have an Azure service that operates in conjunction with Windows Server 2012 R2 that provides orchestration and automation for DR site failover and fail back to/from remote sites.  Windows Server 2012 R2 uses Hyper-V Replica to replicate data across to the other site. Azure DRaaS supports DR plans (scripts) to identify groups of Hyper-V VMs which need to be brought up and their sequencing. VMs within a script group are brought up in parallel but different groups are brought up in sequence.  You can have multiple DR plans, just select the one to execute. You must have access to Azure to use this service. Azure DR plans can pause for manual activities and have the ability to invoke PowerShell scripts for more fine tuned control.  There’s also quite a lot of setup that must be done, e.g. configure Hyper-V hosts, VMs and networking at both primary and secondary locations.  Network IP injection is done via mapping primary to secondary site IP addresses. The Azzure DRaaS really just provides the orchestration of failover or fallback activity. Moreover, it looks like Azure DRaaS is going to be offered by service providers as well as private companies. Currently, Azure’s DRaaS has no support for SAN/NAS replication but they are working with vendors to supply an SRM-like API to provide this.
  • Hyper-V Replica changes – Replica support has been changed from a single fixed asynchronous replication interval (5 minutes) to being able to select one of 3 intervals: 15 seconds; 5 minutes; or 30 minutes.
  • Storage Spaces Automatic Tiering – With SSDs and regular rotating disk in your DAS (or JBOD) configuration , Windows Server 2012 R2 supports automatic storage tiering. At Spaces configuration time one dedicates a certain portion of SSD storage to tiering.  There is a scheduled Windows Server 2012 task which is then used to scan the previous periods file activity and identify which file segments (=1MB in size) that should be on SSD and which should not. Then over time file segments are moved to an  appropriate tier and then, performance should improve.  This only applies to file data and files can be pinned to a particular tier for more fine grained control.
  • Storage Spaces Write-Back cache – Another alternative is to dedicate a certain portion of SSDs in a Space to write caching. When enabled, writes to a Space will be cached first in SSD and then destaged out to rotating disk.  This should speed up write performance.  Both write back cache and storage tiering can be enabled for the same Space. But your SSD storage must be partitioned between the two. Something about funneling all write activity to SSDs just doesn’t make sense to me?!
  • Storage Spaces dual parity – Spaces previously supported mirrored storage and single parity but now also offers dual parity for DAS.  Sort of like RAID6 in protection but they didn’t mention the word RAID at all.  Spaces dual parity does have a write penalty (parity update) and Microsoft suggests using it only for archive or heavy read IO.
  • SMB3.1 performance improvements of ~50% – SMB has been on a roll lately and R2 is no exception. Microsoft indicated that SMB direct using a RAM DISK as backend storage can sustain up to a million 8KB IOPS. Also, with an all-flash JBOD, using a mirrored Spaces for backend storage, SMB3.1 can sustain ~600K IOPS.  Presumably these were all read IOPS.
  • SMB3.1 logging improvements – Changes were made to SMB3.1 event logging to try to eliminate the need for detail tracing to support debug. This is an ongoing activity but one which is starting to bear fruit.
  • SMB3.1 CSV performance rebalancing – Now as one adds cluster nodes,  Cluster Shared Volume (CSV) control nodes will spread out across new nodes in order to balance CSV IO across the whole cluster.
  • SMB1 stack can be (finally) fully removed – If you are running Windows Server 2012, you no longer need to install the SMB1 stack.  It can be completely removed. Of course, if you have some downlevel servers or clients you may want to keep SMB1 around a bit longer but it’s no longer required for Server 2012 R2.
  • Hyper-V Live Migration changes – Live migration can now take advantage of SMB direct and its SMB3 support of RDMA/RoCE to radically speed up data center live migration. Also, Live Migration can now optionally compress the data on the current Hyper-V host, send compressed data across the LAN and then decompress it at target host.  So with R2 you have three options to perform VM Live Migration traditional, SMB direct or compressed.
  • Hyper-V IO limits – Hyper-V hosts can now limit the amount of IOPS consumed by each VM.  This can be hierarchically controlled providing increased flexibility. For example one can identify a group of VMs and have a IO limit for the whole group, but each individual VM can also have an IO limit, and the group limit can be smaller than the sum of the individual VM limits.
  • Hyper-V supports VSS backup for Linux VMs – Windows Server 2012 R2 has now added support for non-application consistent VSS backups for Linux VMs.
  • Hyper-V Replica Cascade Replication – In Windows Server 2012, Hyper V replicas could be copied from one data center to another. But now with R2 those replicas at a secondary site can be copied to a third, cascading the replication from the first to the second and then the third data center, each with their own replication schedule.
  • Hyper-V VHDX file resizing – With Windows Server 2012 R2 VHDX file sizes can now be increased or reduced for both data and boot volumes.
  • Hyper-V backup changes – In previous generations of Windows Server, Hyper-V backups took two distinct snapshots, one instantaneously and the other at quiesce time and then the two were merged together to create a “crash consistent” backup. But with R2, VM backups only take a single snapshot reducing overhead and increasing backup throughput substantially.
  • NVME support – Windows Server 2012 R2 now ships with a Non-Volatile Memory Express (NVME) driver for PCIe flash storage.  R2’s new NVME driver has been tuned for low latency and high bandwidth and can be used for non-clustered storage spaces to improve write performance (in a Spaces write-back cache?).
  • CSV memory read-cache – Windows Server 2012 R2 can be configured to set aside some host memory for a CSV read cache.  This is different than the Spaces Write-Back cache.  CSV caching would operate in conjunction with any other caching done at the host OS or elsewhere.

That’s about it. Some of the MVPs had a preview of R2 up in Redmond, but all of this was to be announced in TechEd, New Orleans, this week.

~~~~

Image: Microsoft TechEd by BetsyWeber





Racetrack memory gets rolling

A recent MIT study showed how new technology can be used to control and write magnetized bits in nano-structures, using voltage alone. This new technique also consumes much less power than using magnets or magnetism as well.

They envision a sort of nano-circuit, -wire or -racetrack with a series of transistor-like structures spaced at regular intervals above it.  Nano-bits would be racing around these nano-wires as a series of magnetized domains.  These new transitor-like devices would be a sort of onramp for the bits as well as stop-lights/speed limits for the racetrack.

Magnetic based racetrack memory issues

The problems with using magnets to write the bits in nano-racetrack is that magnetism casts a wide shadow and can impact adjacent race tracks, sort of like shingled writes (we last discussed in Shingled magnetic recording disks).   The other problem has been a way to (magnetically) control the speed of racing bits so they can be isolated and read or written effectively.

Magneto-ionic racetrack memory solutions

But MIT researchers have discovered a way to use voltage to change the magnetic orientation of a bit on a race track.  They also found a way through the use of voltage to precisely control the position of magnetic bits speeding around the track and to electronically isolate and select a bit.

What they have created is sort of a transistor for magnetized domains using ion-rich materials.  Voltages can be used to attract or repel those ions and then those ions can interact with flowing magnetic domains to speed up or slow down the movement of magnetic domains.

Thus, the transistor-like device can  be set to attract (or speed up) magnetized domains, slow down magnetized domains or stop them and also be used to change the magnetic orientation of a domain.  MIT researchers call these devices Magneto-ionic devices.

Racetrack memory redefined

So now we have a way to (electronically) seek to bit data on a race track,  a way to precisely (electronically) select bits on the race track, and a way to precisely (electronically) write data on a race track.  And presumably, with an appropriate (magnetic) read head, a way to read this data.  As an added bonus, apparently data once written on the racetrack requires no additional power to stay magnetized.

So the transistor-like devices are a combination of write heads, motors and brakes for the racetrack memory.  Not sure,  but if they can write, slow down and speedup magnetic domains, why can’t they read them as well that way the transistor-like devices could be a read head as well.

Why do they need more than one write-head per track. It seems to me that one should suffice for a fairly long track, not unlike disk drives. I suppose  more of them would make the track faster to write. But  they would all have to operate in tandem, speeding up or stoping the racing bits on the track all together and then starting them all back up, together again.  Maybe this way they can write a byte or a word or a chunk of data all at the same time.

In any event, it seems that race track memory took a (literally) quantum leap  forward with this new research out of MIT.

Racetrack memory futures

IBM has been talking about race track memory for some time now and this might be the last hurdle to overcome to getting there (we last discussed this in A “few exabytes-a-day” from SKA post).

In addition,  there doesn’t appear to be any write cycle, bit duration or the need for erasing whole page issues with this type of technology.  So as an underlying storage for a new sort of semi-conductor storage device (SSD) this has significant inherent advantages.

Not to mention that is all based on nano-based device sizes which means that it can pack a lot of bits in very little volume or area.  So SSDs based on these racetrack memory technologies will be denser, faster, and require less energy – could you want.

Image: Nürburgring 2012 by Juriën Minke

 

EMCworld 2013 Day 3

IMG_1431Rich Napolitano, President Unified Storage Division got up and showed some technology demonstrations of what they had working in their labs.  Rich had some of his long time engineers up on the stage to show what was running in their labs.

  • First up was a dual controller, dual processors per controller 8 core processing chips (32cores in all) running against an all SSD backend. The configuration was up for a short time but it seemed like 96 SSDs, so an all-flash VNX array.  They used Iometer, random-8KB IO to drive almost 975K IOPS at sub-msec. response time. They hit 1M IOPS with just slightly above 1 msec. response time. You could see the processor utilization of the 32 cores going up as the workload reached higher levels.  Couldn’t see precisely but all the cores were running at ~70-80% busy at the 1Miops level and it seemed like the system performance was entering the knee-of-the-curve
  • Next up was the new VNX data app store demonstration. Similar to iPhone and Android App stores. EMC has identified a select set of apps that can be run directly on VNX hardware. The current demonstration had two versions of anti-virus, Recover Point Virtual Appliance (vRPA), (v?)VPLEX, CloudAccess and MySQL server.  The engineers showed how AV software could be installed and be running on the VNX as well as how vRPA could be installed and provide onboard replication services.
  • Then, they demonstrated a VNX virtual appliance (vVNX?) which was able to run on white box server which I think was running ESX.  In this case, vVNX was running with onboard DAS storage but had all the advanced functionality of VNX
  • Finally, they showed a vVNX running in a cloud services environment. Not sure if this was VMware vCloud or some other compute cloud but Rich stated that they will support many clouds.  With vVNX running in the cloud accessing storage behind the compute engine it’s unclear what the performance would be and how one would access the storage (file or iSCSI no doubt) but it did open up new possibilities as to where one could run VNX services.

It’s readily apparent that the next iteration of VNX software seems focused on taking advantage of multi-core processing (called MCx) to boost storage system performance, providing a virtualized environment within the VNX engine to run specialized data services and supplying a new vVNX functionality which can be deployed just about anywhere you would want.

That’s all for the public sessions, spent much of the rest of the day in NDA sessions.

I had a good time at EMCworld 2013, seeing old friends again and meeting new ones and thank EMC for inviting me.  For information on previous days at EMCworld 2013 please see my Day 1 and Day 2 posts.

EMCworld 2013 day 1

Lines for coffee at the Cafe were pretty long this morning and I missed my opportunity to have breakfast to do some work. But eventually made my way to the press room and got some food and coffee.

Spent the morning in Analyst sessions mostly under NDA but it seems safe to say that EMC sees plenty of opportunity ahead.

The first session Q&A with BRS executives and customers was enlightening but the main message from the customers was that data protection is hard, legacy systems often can’t adjust quick enough and sometimes a completely new architecture is warranted. The executives were upbeat about current BRS business and where they were headed in the future.

20130506-142735.jpgRest of the morning was with Jeremy Burton EVP Product, Operations and Marketing and John Roese, the new SVP and CTO of EMC (6 months on the job). Jeremy talked about an IDC insight that there’s a new world emerging so-called 3rd platform applications based on mobile and consumer grade technology  with literally billions of users, millions of apps built on mobile-cloud-bigdata-social infrastructure which complements the 2nd platform built on lan/wan, client server frameworks.

For an example of this environment Jeremy mentioned that AT&T provisions 12PB of storage a month.

What’s needed for this new platform is a new type of storage built for the 3rd platform but taking advantage of current enterprise storage characteristics.  This is ViPR (more on that later)

John comes by way of Huawei, Nortel and myriad others and offers a broad insight to the way forward for EMC. It looks like a bright future ahead if they can do half of what John has outlined.

John talked about the intersections between the carrier market (or services), enterprise IT and consumer market.  There is convergence between these regions and at each of these intersections new technology is going to answer many of the problems which exist. For instance in the carrier space:

  • The amount of information they gather is frightening they know everything about you. Pivotal will be the key here because its good at 1) ability to correlate information across different information sources. Most carriers have a whole bunch of disparate information stores; and 2) It’s not just focused on Big Data as a non-realtime problem but also provides realtime analytics as well.
  • Capital costs are going down but $/bits are going way down.  VMware & Software defined data center is the right way to drive down costs.  Today servers are ~50% virtualized but networking is not virtualized at all.
  • Customers are dissatisfied with service providers (carriers).  Again Pivotal is key here. One carrier customer was focused on customer churn and tried to figure out how to minimize this. They used  Gemfire’ high speed infrastructure that could watchc all transactions on cell tower infrastructure pick out dropped calls, send it to Greenplum and correlate this with the customer attributes (good or bad), and within 100msec supply an interaction with the customer in to apologize and offer some services to make it better.
  • Internet is the new wild west –use at your own risk,  spoofing websites, respond to email could be anyone, chaos to security. RSA can become the trusted internet provider by looking at the internet holistically, combining information from many customers, aggregating and sharing these interactions to deterimine the trust of every transaction. Trust is becoming a new big data problem.
  • Hybrid and public cloud is their biggest opportunity but they don’t know how to attack it. VMware and SDDC will evolve to provide orchestrated movement from private to public and closed to open.

The thinking seems pretty straightforward given what they are trying to accomplish and the framework he applied to EMC’s strategy going forward made a lot of sense.

20130506-172955.jpgBrian Gallagher did a keynote on enterprise storage new functions and features which covered VMAX, VPLEX, RecoverPoint, and XtremIO/SF/SW. Mentioned RecoverPoint virtual appliance and sort of a statement of direction on being able to move application functionality directly on VMAX. He kind of demoed this with VPLEX running on VMAX.

He also talked about FAST speed of reaction versus the competition, mentioned that FAST provides information about the storage tiering to up to 4 different VMAX arrays. Showed a comparison of VMAX 10K against another prime competitor that looked downright embarrassing.  And talked about VMAX cloud edition.

20130506-173022.jpgAfter that 1 on 1 meetings all under strict NDA. But then the big Keynote with Jeremy again and David Goulden President and COO on ViPR. They have implemented software defined storage (SDS).  Last week I did a post on SDS trying to layout some of the problems and promises of SDS (please see The promise of SDS post).

But what I missed was the data path transformation that ViPR can do to provide object and HDFS access to traditional and commodity storage systems.  ViPR starts out primarily in the control layer providing automated provisioning, self management, across heterogeneous storage pools. With ViPR one can define virtual storage arrays and then configure virtual storage pools across those arrays regardless of the physical infrastructure underneath them.

More on ViPR in a separate post but suffice it to say EMC has been working on this for awhile now. But how it’s positioned with VPLEX and the other storage virtualization capabilities in VMAX and other products is another matter. But it seems they are carving out a space for ViPR between and above the current storage solutions.

End of day one is in the Expo and then cocktail parties… stay tuned for day 2.

 

The antifragility of disk RAID groups, the fragility of SSDs and what to do about it

HDA, disk, disk head-media, Hard Disk by Jeff Kubina (cc) (from Flickr)
Hard Disk by Jeff Kubina (cc) (from Flickr)

[A long post today] I picked up the book Antifragile: Things that gain be disorder,  by Nassim N. Taleb and despite trying to put it away at least 3 times now, can’t stop turning back to it.  In his view fragility is defined by having a negative (or bad) response to variation, volatility or randomness in general.  Antifragile is the exact opposite of fragile in that it has a positive (or good) response to more variation, volatility or randomness.  Somewhere between antifragility and fragility is robustness which has neither a positive or negative response (is indifferent) to high volatility, variation or randomness.

Why disks are robust …

To me there are plenty of issues with disks. To name just a few:

  • They are energy hogs,
  • They are slow (at least in comparison to SSDs and flash memory), and
  • They are mechanical contrivances which can be harmed by excess shock/vibration.

But, aside from their capacity benefits, they have a tendency to fail at a normalized failure rate unless there is a particular (special) problem with media, batch, electronics or micro-programing.  I have seen plenty of these other types of problems at StorageTek over the years to know that there are many things that can disturb disk failure rate normalization. However, in general, absent some systematic causes of failure, disk fail at a predictable rate with a relative wide distribution (although, being away from the engineering of storage systems,  I have no statistics for the standard deviation of disk failures – it just feels right [Nassim would probably disavow me for reading that]).

The other aspect of disk anti-fragility is that as they degrade over time, they seem to get slower and louder.  The former is predominantly due to defect skipping, an error recovery procedure for bad blocks.  And they get louder as bearings start to wear out, signaling eminent failure ahead.

In defect skipping when a disk drive detects a bad block, the disk drive marks the block as bad and uses a spare block it has somewhere else in the disk for all subsequent writes. The new block is typically “far” away from the old block so when reading multiple blocks the drive has to now seek to the new block and seek back to read them. increasing response time in the process.

The other phenomona that disk failures have is a head crash. These seem to occur at completely at random with disks from “mature processes”.

So, I believe disks from mature processes have a normalized failure rate with a reasonably wide standard of deviation around this MTBF rate. As such, disk drives should be classified as robust.

… and RAID groups of disk drives are antifragile

But, while disk drives are robust, when you place such devices in a RAID group with others, the RAID groups survive better.  As long as the failure rate of the devices is randomized and there is a wide variance on this failure rate, when a RAID group encounters a single drive failure it is unlikely that a second, or third (RAID DP/6) will also fail while trying to recover from the first.  (Yes, as disk drives get larger the time to recover gets longer thus increasing the probability of multiple drive failures, but absent systematic causes of drive failures, the likelihood of data loss should be rare).

In a past life we had multiple disk systems in a location subject to volcanic activity. Somehow, sulferic fumes from the volcano had found its way into the computer room and was degrading the optical transceivers in our disk drives causing drive failures.   The subsystem at the time had RAID 6 (dual parity) and over the course of a few weeks we had 20 or more disk drives die in these systems. The customer lost no data during this time but only because the disk drive failure rate was randomly distributed over time with a wide dispersion.

So from Nassim’s definition disk RAID groups are anti-fragile, they do operate better with more randomness.

Why SSD and SSD RAID groups are fragile

SSD, Toshiba's New 2.5" SSD from SSD.Toshiba.com
Toshiba’s New 2.5″ SSD from SSD.Toshiba.com

SSDs have a number of good things going for them. For example:

  • They are blistering fast,  at least compared to rotating disks.
  • They are relatively green storage devices meaning they use less energy than rotating disk
  • They are semiconductor devices and as such, are relatively immune to shock and vibration.

Despite all that, given todays propensity to use wear leveling, RAID groups composed of SSDs can exhibit fragility because all the SSDs will fail at approximately the same number of Program/Erase cycles.

My assumption, is that because NAND wear out is essentially an electro-chemical phenomenon that its failure rate, while a normalized distribution, probably has a very narrow variation.  Now given the technology NAND pages will fail after so many writes, it may be 10K, 30K or 100K (for MLC, eMLC, or SLC) but all the NAND pages from the same technology (manufactured on the same fab line) will likely fail at about the same number of P/E cycles. With wear leveling equalizing the P/E cycles across all pages in an SSD, this means that there is some number of writes that an SSD will endure and then go no farther.  (Again, I have no hard statistics to support this presumption and Nasssim will probabilistically not be pleased with me for saying this).

As such, for a RAID group made up of wear leveling SSDs especially with data stripping across the group, all the SSDs will probabilistically fail at almost same time because they all will have had the same amount of data written to them.  This means that as we reach wear out on one SSD in the group, assuming all the others were also fresh at the time of original creation of the group, then all the other devices will be near wear out.  As a result, when one SSD fails, others in the RAID group will have a high probability of failure, leading to data loss.

I have written about this before, see my Potential data loss using SSD RAID groups post for more information.

What we can do about the fragility of SSD RAID groups?

A couple of items come to mind that can be done to reduce the fragility of a RAID group of SSDs:

  • Intermix older and newer (fresher) SSDs in a single RAID group to not cause them all to fail at the same time.
  • Don’t use data striping across RAID groups of SSDs this would allow some devices to be written more than others and by doing so cause some randomness to the SSD failures in the group.
  • Don’t use RAID 1 as this will always cause the same number of writes to be written to pairs of SSDs
  • Don’t use RAID 5 or other protection methodologies that spread parity writes across the group, using these would be akin to data striping in that all parity writes would be spread evenly across the group.
  • Consider using different technology SSDs in a RAID group, if one intermixed MLC, eMLC and SLC drives in a RAID group this would have the effect of varying the SSD failure rates.
  • Move away from wear leveling to defect skipping while doing so will cause some SSDs to fail earlier than today, their failure rate will be randomly distributed.

The last one probably deserves some discussion.  There are many reasons for wear leveling one of which is to speed up writes (by always having a fresh page to write), another is that NAND blocks cannot be updated in place, they need to be erased to be written.  But another major reason is to distribute write activity across all NAND pages to equalize wear out.

In order to speed up writing sans wear leveling one would need some sort of DRAM buffer to absorb the write activity and then later destage it to NAND when available.   The inability to update in place is more problematic but could potentially be dealt with by using the same DRAM cache to read in the previous information and write back the updates.  Other solutions to this later problem exist but seem to be more problematic than they are worth.

But for the aspect of wear leveling done to equalize NAND page wearout, I believe there’s a less fragile solution.  If we were to institute some form of defect skipping with a certain amount of spare NAND pages, we could easily extend the life of an SSD, at least until we run out of spare pages.

Today, there is a considerable amount of spare capacity shipped with most SSDs, over 10% in most enterprise class storage and more with consumer grade. With this much capacity a single NAND logical block could be rewritten an awful high number of times. For instance using defect skipping, with a 100GB MLC SSD at 10K write endurance with 10% spare pages and a 1MB page size, one single logical block address page could written ~100million times (assuming no other pages were being written beyond their maximum).

The main advantage is that, now SSD failure rates would be more widely distributed. Yes there would be more early life failures, especially for SSDs that get hit a lot. But they would no longer fail in unison at some magical write level.

Making SSDs less fragile

While doing all the above may help a RAID group full of SSDs be less fragile, addressing the inherent antifragility of an SSD is more problematic.  Nonetheless, some ideas do come to mind:

  • Randomly mix NAND chips from different FABs/vendors, then the SSDs that use this intermixture could have a more randomly distributed failure rate, which should increase the standard deviation of MTBF.
  • Use different NAND technologies in an SSD, using say MLC for the bulk of the storage capacity and SLC for the defect skip capacity on an SSD (with no wear leveling). Doing this would elongate the lifetime of the average SSD and randomly distribute failures of SSDs based on write locality of reference thereby increasing the standard deviation of MTBF.  Of course this would also have the affect of speeding up heavily written blocks now coming out of SLC rather than slower MLC, making these SSDs even faster for those blocks which are written more frequently.
  • Use more random, less deterministic predictive maintenance, SSD predictive maintenance is used to limit the damage from a failing SSD by replacing it before death. By using less deterministic algorithms and more randomized algorithms  (such as how close to wear out we let the SSD get before signaling failure) we would have the impact of increasing the variance of failure.

This post is almost too long now but there are probably other ideas to increase the robustness of SSDs and PCIe Flash cards that deserve mention someplace. Maybe we can explore these in a subsequent post.

Comments?

[Full disclosure:  I have a number of desktops that use single disk drives (without RAID) that are backed up to other disk drives.  I own and use a laptop, iPads, and an iPhone that all use SSDs or NAND technology (without RAID). I have neither disk or SSD storage subsystems that I own.]

 

SNWUSA Spring 2013 summary

SNWUSA, SNIA, partyFor starters the parties were a bit more subdued this year although I heard Wayne’s suite was hopping to 4am last night (not that I would ever be there that late).

And a trend seen the past couple of years was even more evident this year, many large vendors and vendor spokespeople went missing. I heard that there were a lot more analyst presentations this SNW than prior ones although it was hard to quantify.  But it did seem that the analyst community was pulling double duty in presentations.

I would say that SNW still provides a good venue for storage marketing across all verticals. But these days many large vendors find success elsewhere, leaving SNW Expo mostly to smaller vendors and niche products.  Nonetheless, there were a\ a few big vendors (Dell, Oracle and HP) still in evidence. But EMC, HDS, IBM and NetApp were not   showing on the floor.

I would have to say the theme for this years SNW was hybrid storage. It seemed last fall the products that impressed me were either cloud storage gateways or all flash arrays but this year there weren’t as many of these at the show but hybrid storage certainly has arrived.

Best hybrid storage array of the show

It’s hard to pick just one hybrid storage vendor as my top pick, especially since there were at least 3 others talking to me under NDA, but from my perspective the Hybrid vendor of the show had to be Tegile (pronounced I think, as te’-jile). They seemed to have a fully functional system with snapshot, thin provisioning, deduplication and pretty good VMware support (only time I have heard a vendor talk about VASA “stun” support for thin provisioned volumes).

They made the statement that SSD in their system is used as a cache, not a tier. This use is similar to NetApp’s FlashCache and has proven to be a particularly well performing approach to use of hybrid storage. (For more information on that take a look at some of my NFS and recent SPC-1 benchmark review dispatches. How well this is integrated with their home grown dedupe logic is another question.

On the negative side, they seem to be lacking a true HA/dual controller version but could use two separate systems with synch (I think) replication between them to cover this ground?? They also claimed their dedupe had no performance penalty, a pretty bold claim that cries out for some serious lab validation and/or benchmarking to prove. They also offer an all flash version of their storage (but then how can it be used as a cache?).

The marketing team seemed pretty knowledgeable about the market space and they seem to be going after mid-range storage space.

The product supports file (NFS and CIFS/SMB), iSCSI and FC with GigE, 10GbE and 8Gbps FC. They quote “effective” capacities with dedupe enabled but it can be disabled on a volume basis.

Overall, I was impressed by their marketing and the product (what little I saw).

Best storage tool of the show

Moving onto other product categories, it was hard to see anything that caught my eye. Perhaps I have just been to too many storage conferences but I did get somewhat excited when I looked at SwiftTest.  Essentially they offer a application profiling, storage modeling, workload generating tool set.

The team seems to be branching out of their traditional vendor market focus and going after large service providers and F100 companies with large storage requirements.

Way back, when I was in Engineering, we were always looking for some information as to how customers actually used storage products. Well what SwiftTest has, is an appliance to instrument your application environment (through network taps/hardware port connections) to monitor your storage IO and create a statistical operational profile of your I/O environment. Then take that profile and play it against a storage configuration model to show how well it’s likely to perform.  And if that’s not enough the same appliance can be used to drive a simulated version of the operational profile back onto a storage system.

It offers NFS (v2,v3, v4) CIFS/SMB (SMB1, SMB2, SMB3) FC, iSCSI, and HTTP/REST (what no FCoe?). They mentioned an $8oK price tag for the base appliance (one protocol?) but grows up pretty fast, if you want all of them.  They also seem to have three levels of appliances (my guess more performance and more protocols come with the bigger boxes).

Not sure where they top out but simulating an operational profile can be quite complex especially when you have to be able to control data patterns to match deduplication potential in customer data, drive markov chains with probability representations of operational profiles, and actually execute IO operations. They said something about their ports have dedicated CPU cores to insure adequate performance or something similar but still it seems to little to hit high IO workloads.

Like I said, when I was in engineering were searching for this type of solution back in the late 90s and we would have probably bought it in a moment, if it was available.

GoDaddy.com, the domain/web site services provider was one of their customers that used the appliance to test storage configurations. They presented at SNW on some of their results but I missed their session (the case study is available on SwiftTests website).

~~~~

In short, SNW had a diverse mixture of end user customers but lacked a full complement of vendors to show off to them.   The ratio of vendors to customers has definitely shifted to end-users the last couple of years and if anything has gotten more skewed to end-users, (which paradoxically should appeal to more storage vendors?!).

I talked with lots of end-users, from companies like FedEx, Northrop Grumman and AOL to name just a few big ones. But there were plenty of smaller ones as well.

The show lasted three days and had sessions scheduled all the way to the end. I was surprised at the length and the fact that it started on Tuesday rather than Monday as in years past.  Apparently, SNIA and Computerworld are still tweaking the formula.

It seemed to me there were more cancelled sessions than in years past but again this was hard to quantify.

Some of the customers I talked with thought SNW should go to a once a year and can’t understand why it’s still twice a year.  Many mentioned VMworld as having taken the place of SNW in being a showplace for storage vendors of all sizes and styles.  That and the vendor specific shows from EMC, IBM, Dell and others.

The fall show is moving to Long Beach, CA. Probably, a further experiment to find a formula that works.  Let’s hope they succeed.

Comments?

 

Heating NAND brings it back to life

Read an article today in ARS Technica titled NAND flash gets baked, lives longer that researchers at Macronix have come up with a technique that rejuvenates NAND bit cells by heating them.  The process releases the bit cells captive electrons and returns it back to a fresh NAND cell.

As discussed previously in this blog (e.g., see The End of NAND is near, maybe… and What eMLC and eSLC do for SSD longevity) as NAND technology shrinks to smaller transistor diameters, their longevity and durability decreases proportionally. Which means that with denser NAND chips coming out over the coming years, they will become increasingly short lived.

With this new approach and an awful lot of engineering to zap a NAND bit cell with intense heat (800C) can take dead NAND cells and bring them back to life. Apparently this rejuvenation process has been known for some time and had been in use for phase change memory but had not been applied to NAND cells in memory.  They were heating batches of NAND cells for hours at 200C but this wouldn’t be very practical in production.

The new NAND memory cells are designed with a resistive heating element on top of them, which when enabled can heat the NAND bit cells beneath them. According to some news reports I’ve read this enables the NAND cell to go from 10K P/E cycles to 100K P/E cycles.  And the heat only needs to be applied in occasional pulses to keep cells operating within parameters.   As such, it can be used sparingly and not cost too much energy in the process.

Another side effect of heating is that erase cycles operate faster than at normal temperatures, which now adds the possibility of heat assisted NAND cells.   Erasure being one of the key bottlenecks to NAND write performance anything that can speed this up would help.

Hot NAND may have some life in them after all.

~~~~

Comments?

Image: Blow Torch by xlibber