Intel’s Optane (3D Xpoint) SSD specs in the wild

Read an article the other day in Ars Technica (Specs for 1st Intel 3DX SSD…) about a preview of the Intel Octane specs for their 375GB 3D Xpoint (3DX) flash card. The device is NVMe compliant, PCIe Gen3 add in card, that’s in a half height, half length, low profile form factor.

Intel’s Optane SSD vs. the competition

A couple of items from the Intel Optane spec sheet of interest to me as a storage guru:

  • 30 Drive writes per day/12.3 PBW (written) – 3DX, at launch, had advertised that it would have 1000 times the endurance of (2D-MLC?) NAND. Current flash cards (see Samsung SSD PRO NVMe 256GB Flash card specs) offer about 200TBW (for 256GB card) or 400TBW (for 512GB card). The Samsung PRO is based on 3D (V-)NAND, so its endurance is much better than  2D-MLC at these densities. That being said, the Octane drive is still ~40X the write endurance of the PRO 950. Not quite 1000 but certainly significantly better.
  • Sequential (bandwidth) performance (R/W) of 2400/2000 MB/sec – 3DX advertised 1000 times the performance of (2D-MLC,  non-NVMe?) NAND. Current 3D (V-)NAND cards (see Samsung SSD PRO above) above offers (R/W) 2200/900 MB/sec for an NVMe device. The Optane’s read bandwidth is a slight improvement but the write bandwidth is a 2.2X improvement over current competitive devices.
  • Random 4KB IOPs performance (R/W) of 550K/500K – Similar to the previous bulleted item, 3DX advertised 1000 times the performance of (2D-MLC,  non-NVMe?) NAND. Current 3D (V-)NAND cards like the Samsung SSD PRO offer Random 4KB IOPs performance  (R/W) of 270K/85K IOPS (@4 threads). Optane’s read random 4KB IOPs performance is 2X the PRO 950 but its write performance is ~5.9X better.
  • IO latency of <10 µsec. – 3DX advertised 10X better latency than the current (2D-MLC, non-NVMe) flash drives. According to storage review (Samsung 950 Pro M.2), the Samsung PRO 950 had a latency of ~22 µsec. Optane has at least 2X better latency than the current competition.
  • Density 375GB/HH-HL-LP – 3DX advertised 1000X the density of (then current DRAM). Today Micron offers a 4GiB DDR4/288 pin DIMM which is probably 1/2 the size of the HH flash drive. So maybe in the same space this could be 8GiB. This says that the Optane is about 100X denser than today’s DRAM.

Please note, when 3DX was launched, ~2 years ago, the then current NAND technology was 2D-MLC and NVMe was just a dream. So comparing launch claims against today’s current 3D-NAND, NVMe drives is not a fair comparison.

Nevertheless, the Optane SSD performs considerably better than current competitive NVMe drives and has significantly better endurance than current 3D (V-)NAND flash drives. All of which is a great step in the right direction.

What about DRAM replacement?

At launch, 3DX was also touted as a higher density, potential replacement for DRAM. But so far we haven’t seen any specs for what 3DX NVM looks like on a memory bus. It has much better density than DRAM, but we would need to see 3DX memory access times under 50ns to have a future as a DRAM replacement. Optane’s NVMe SSD at 10 µsec. is about 200X too slow, but then again it’s not a memory device configuration nor is it attached to a memory bus.

Comments?

Photo Credit(s):  Intel Optane Spec sheet from Ars Technica Article,  DDR4 DRAM from Wikimedia user:Dsimic

Surprises from 4 years of SSD experience at Google

Flash field experience at Google 

Overview SSDsIn a FAST’16 article I recently read (Flash reliability in production: the expected and unexpected, see p. 67), researchers at Google reported on field experience with flash drives in their data centers, totaling many millions of drive days covering MLC, eMLC and SLC drives with a minimum of 4 years of production use (3 years for eMLC). In some cases, they had 2 generations of the same drive in their field population. SSD reliability in the field is not what I would have expected and was a surprise to Google as well.

The SSDs seem to be used in a number of different application areas but mainly as SSDs with a custom designed PCIe interface (FusionIO drives maybe?). Aside from the technology changes, there were some lithographic changes as well from 50 to 34nm for SLC and 50 to 43nm for MLC drives and from 32 to 25nm for eMLC NAND technology.
Continue reading “Surprises from 4 years of SSD experience at Google”

5D storage for humanity’s archive

5D data storage.jpg_SIA_JPG_fit_to_width_INLINEA group of researchers at the University of Southhampton in the UK have  invented a new type of optical recording, based on femto-second laser pulses and silica/quartz media that can store up to 300TB per (1″ diameter) disc platter with thermal stability at up to 1000°C or a media life of up to 13.8B years at room temperature (190°C?). The claim is that the memory device could outlive humanity and maybe the universe.

The new media/recording technique was used recently to create copies of text files (Holy Bible, pictured above). Other significant humanitarian, political and scientific treatise have also been stored on the new media. The new device has been nicknamed “Superman Memory Crystal”, due to the memory glass (quartz) likeness to Superman’s memory crystals.

We have written before on long term archives(See Super Long Term Archive and Today’s data and the 1000 year archive posts) but this one beats them all by many orders of magnitude.
Continue reading “5D storage for humanity’s archive”

Optical discs for Facebook cold storage

I heard last week that Facebook is implementing Blu Ray libraries for cold storage. Each BluRay disk holds ~100GB and they figure they can store 10,000 discs or ~1PB in a rack.

They bundle 12 discs in a cartridge and 36 cartridges in a magazine, placing 24 magazines in a cabinet, with BluRay drives and a robotic arm. The robot arm sits in the middle of the cabinet with the magazines/cartridges located on each side.

It’s unclear what Amazon Glacier uses for its storage but a retrieval time of 3-5 hours indicates removable media of some type.  I haven’t seen anything on Windows Azure offering a similar service but Google has released Durable Reduced Availability (DRA) storage which could potentially be hosted on removable media as well.  I was unable to find any access times specifications for Google DRA.

Why the interest in cold storage?

The article mentioned that Facebook is testing the new technology first on its compliance data. After that Facebook will start using it for cold photo storage. Facebook also said that it will be using different storage technologies for it’s cold storage repository mentioning “bad flash” as another alternative.

BluRay supports both a re-writeable as well as WORM (write once, read many times) technology. As such, WORM discs cannot be modified, only destroyed.  WORM technology would be very useful for anyone’s compliance data. The rewritable Blu Ray discs might be more effective for cold photo storage, however the fact that people on Facebook rarely delete photos, says WORM would work well here too.

100GB is a pretty small storage bucket these days but for compliance documents, such as email, invoices, contracts, etc. it’s plenty large.

Can Blu Ray optical provide data center cold storage?

Facebook didn’t discuss the specs on the robot arm that they were planning to use but with 10K cartridges it has a lot of work to do. Tape library robots move a single cartridge in about 11 seconds or so. If the optical robot could do as well (no information to the contrary) one robot arm could support ~4K disc moves per day. But that would be enterprise class robotics and 100% duty cycle, more likely 1/2 to 1/4 of this would be considered good for an off the shelf system like this. So maybe a 1000 to 2000 disc picks per day.

If we use 22 seconds per disc swap (two disc moves), a single robot/rack could support a maximum of 100 to 200TB of data writes per day (assuming robot speed was the only bottleneck).  In the video (see about 30 minutes in) the robot didn’t look all that fast as compared to a tape library robot, but maybe I am biased.

Near as I can tell a 12x BluRay drive can write at ~35MB/sec (SATA drive, writing single layer, 25GB disc, we assume this can be sustained for a 4-layer or dual-sided 2-layer 100GB disc). So to be able to write a full 100GB disk would take ~48 minutes and if you add to that the 22 seconds of disc swap time, one SATA drive running 100% flat out could maybe write 30 discs per day or ~3TB/day.

In the video, the BluRay drives appear to be located in an area above the disc magazines along each side. There appears to be two drives per column with 6 columns per side, so a maximum of 24 drives. With 24 drives, one rack could write about 72TB/day or 720 discs per day which would fit into our 22 seconds per swap.  At 72TB/day it’s going to take ~14 days to fill up a cabinet. I could be off on the drive count, they didn’t show the whole cabinet in the video, so it’s possible they have 12 columns per side, 48 drives per cabinet and 144TB/day.

All this assumes 100% duty cycle on the drives which is unreasonable for an enterprise class tape drive let alone a consumer class BluRay drive. This is also write speed, I assume that read speed is the same or better. Also, I didn’t see any servers in the cabinet and I assume that something has to be reading, writing and controlling the optical library. So these other servers need to be somewhere close by, but they could easily be located in a separate rack somewhere near to the library.

So it all makes some amount of sense from a system throughput perspective. Given what we know about the drive speed, cartridge capacity and robot capabilities, it’s certainly possible that the system could sustain the disc swaps and data transfer necessary to provide data center cold storage archive.

And the software

But there’s plenty of software that has to surround an optical library to make it useful. Somehow we would want to be able to identify a file as a candidate for cold storage then have it moved to some cold storage disc(s), cataloged, and then deleted from the non-cold storage repository.  Of course, we probably want 2 or more copies to be written, maybe these redundant copies should be written to different facilities or at least different cabinets.  The catalog to the cold storage repository is all important and needs to be available 24X7 so this needs to be redundant/protected, updated with extreme care, and from my perspective on some sort of high-speed storage to handle archives of 3EB.

What about OpenStack? Although there have been some rumblings by Oracle and others to provide tape support in OpenStack, nothing seems to be out yet. However, it’s not much of a stretch to see removable media support in OpenStack, if some large company were to put some effort into it.

Other cold storage alternatives

In the video, Facebook says they currently have 30PB of cold storage at one facility and are already in the process of building another. They said that they should have 150PB of cold storage online shortly and that each cold storage facility is capable of holding 3EB or 3,000PB of cold storage.

A couple of years back at Hitachi in Japan, we were shown a Blu Ray optical disc library using 50GB discs. This was just a prototype but they were getting pretty serious about it then. We also saw an update of this at an analyst meeting at HDS, a year or so later. So there’s at least one storage company working on this technology.

Facebook, seems to have decided they were better off developing their own approach. It’s probably more dense/space efficient and maybe even more power efficient but to tell that would take some spec comparisons which aren’t available from Facebook or HDS just yet.

Why not magnetic tape?

I see these large storage repository sizes and wonder if Facebook might not be better off using magnetic tape. It has a much larger capacity and I believe magnetic tape (LTO or enterprise) would supply better volumetric (bytes/in**3) density than the Blu Ray cabinet they showed in the video.

Facebook said that BluRay discs had a 50 year lifetime.  I believe enterprise and LTO tape vendors say their cartridges have a 30 year lifetime. And that might be one consideration driving them to optical.

The reality is that new LTO technology is coming out every 2-3 years or so, and new drives read only 2 generations back and write only the current technology. With that quick a turnover, a data center would probably have to migrate data from old to new tape technology every decade or so before old tape drives go out of warranty.

I have not seen any Blu Ray technology roadmaps so it’s hard to make a comparison, but to date, PC based Blu Ray drives typically can read and write CDs, DVDs, and current Blu Ray disks (which is probably 4 to 5 generations back). So they have a better reputation for backward compatibility over time.

Tape technology roadmaps are so quick because tape competes with disk, which doubles capacity every 18 months or so. I am sure tape drive and media vendors would be happy not to upgrade their technology so fast but then disk storage would take over more and more tape storage applications.

If Blu Ray were to become a data center storage standard, as Facebook seems to want, I believe that Blu Ray technology would fall under similar competitive pressures from both disk and tape to upgrade optical technology at a faster rate. When that happens, it would be interesting to see how quickly optical drives stop supporting the backward compatibility that they currently support.

Comments?

Photo Credit: [73/366] Grooves by Dwayne Bent [Ed. note, picture of DVD, not Blu Ray disc]

 

 

HP Tech Day – StoreServ Flash Optimizations

Attended HP Tech Field Day late last month in Disneyland. Must say the venue was the best ever for HP, and getting in on Nth Generation Conference was a plus. Sorry it has taken so long for me to get around to writing about it.

We spent a day going over HP’s new converged storage, software defined storage and other storage topics. HP has segmented the Software Defined Data Center (SDDC) storage requirements into cost optimized, Software Defined Storage and SLA optimized, Service Refined Storage. Under Software Defined storage they talked about their StoreVirtual product line which is an outgrowth of the Lefthand Networks VSA, first introduced in 2007. This June, they extended SDS to include their StoreOnce VSA product to go after SMB and ROBO backup storage requirements.

We also discussed some of HP’s OpenStack integration work to integrate current HP block storage into OpenStack Cinder. They discussed some of the integrations they plan for file and object store as well.

However what I mostly want to discuss in this post is the session discussing how HP StoreServ 3PAR had optimized their storage system for flash.

They showed an SPC-1 chart depicting various storage systems IOPs levels and response times as they ramped from 10% to 100% of their IOPS rate. StoreServ 3PAR’s latest entry showed a considerable band of IOPS (25K to over 250K) all within a sub-msec response time range. Which was pretty impressive since at the time no other storage systems seemed able to do this for their whole range of IOPS. (A more recent SPC-1 result from HDS with an all-flash VSP with Hitachi Accelerated Flash also was able to accomplish this [sub-msec response time throughout their whole benchmark], only in their case it reached over 600K IOPS – read about this in our latest performance report in our newsletter, sign up above right).

  • Adaptive Read – As I understood it, this changed the size of backend reads to match the size requested by the front end. For disk systems, one often sees that a host read of say 4KB often causes a read of 16KB from the backend, with the assumption that the host will request additional data after the block read off of disk and 90% of the time spent to do a disk read is getting the head to the correct track and once there it takes almost no effort to read more data. However with flash, there is no real effort to get to a proper location to read a block of flash data and as such, there is no advantage to reading more data than the host requests, because if they come back for more one can immediately read from the flash again.
  • Adaptive Write – Similar to adaptive read, adaptive write only writes the changed data to flash. So if a host writes a 4KB block then 4KB is written to flash. This doesn’t help much for RAID 5 because of parity updates but for RAID 1 (mirroring) this saves on flash writes which ultimately lengthens flash life.
  • Adaptive Offload (destage) – This changes the frequency of destaging or flushing cache depending on the level of write activity. Slower destaging allows written (dirty) data to accumulate in cache if there’s not much write activity going on, which means in RAID 5 parity may not need to be updated as one could potentially accumulate a whole stripe’s worth of data in cache. In low-activity situations such destaging could occur every 200 msecs. whereas with high write activity destaging could occur as fast as every 3 msecs.
  • Multi-tennant IO processing – For disk drives, with sequential reads, one wants the largest stripes possible (due to head positioning penalty) but for SSDs one wants the smallest stripe sizes possible. The other problem with large stripe sizes is that devices are busy during the longer sized IO while performing the stripe writes (and reads). StoreServ modified the stripe size for SSDs to be 32KB so that other IO activity need not have to wait as long to get their turn in the (IO device) queue. The other advantage is when one is doing SSD rebuilds, with a 32KB stripe size one can intersperse more IO activity for the devices involved in the rebuild without impacting rebuild performance.

Of course the other major advantage of HP StoreServ’s 3PAR architecture provides for Flash is its intrinsic wide striping that’s done across a storage pool. This way all the SSDs can be used optimally and equally to service customer IOs.

I am certain there were other optimizations HP made to support SSDs in StoreServ storage, but these are the ones they were willing to talk publicly about.

No mention of when Memristor SSDs were going to be available but stay tuned, HP let slip that sooner or later Memristor Flash storage will be in HP storage & servers.

Comments?

Photo Credits: (c) 2013 Silverton Consulting, Inc

Super long term archive

Read an article this past week in Scientific American about a new fused silica glass storage device from Hitachi Ltd., announced last September. The new media is recorded with lasers burning dots which represent binary one or leaving spaces which represents binary 0 onto the media.

As can be seen in the photos above, the data can readily be read by microscope which makes it pretty easy for some future civilization to read the binary data. However, knowing how to decode the binary data into pictures, documents and text is another matter entirely.

We have discussed the format problem before in our Today’s data and the 1000 year archive as well as Digital Rosetta stone vs. 3D barcodes posts. And this new technology would complete with the currently available, M-disc long term achive-able, DVD technology from Millenniata which we have also talked about before.

Semi-perpetual storage archive!!

Hitachi tested the new fused silica glass storage media at 1000C for several hours which they say indicates that it can survive several 100 million years without degradation. At this level it can provide a 300 million year storage archive (M-disc only claims 1000 years).   They are calling their new storage device, “semi-perpetual” storage.  If 100s of millions of years is semi-perpetual, I gotta wonder what perpetual storage might look like.

At CD recording density, with higher densities possible

They were able to achieve CD levels of recording density with a four layer approach. This amounted to about 40Mb/sqin.  While DVD technology is on the order of 330Mb/sqin and BlueRay is ~15Gb/sqin, but neither of these technologies claim even a million year lifetime.   Also, there is the possibility of even more layers so the 40Mb/sqin could double or quadruple potentially.

But data formats change every few years nowadays

My problem with all this is the data format issue, we will need something like a digital rosetta stone for every data format ever conceived in order to make this a practical digital storage device.

Alternatively we could plan to use it more like an analogue storage device, with something like a black and white or grey scale like photographs of  information to be retained imprinted in the media.  That way, a simple microscope could be used to see the photo image.  I suppose color photographs could be implemented using different plates per color, similar to four color magazine production processing. Texts could be handled by just taking a black and white photo of a document and printing them in the media.

According to a post I read about the size of the collection at the Library of Congress, they currently have about 3PB of digital data in their collections which in 650MB CD chunks would be about 4.6M CDs.  So if there is an intent to copy this data onto the new semi-perpetual storage media for the year 300,002012 we probably ought to start now.

Another tidbit to add to the discussion at last months Hitachi Data Systems Influencers Summit, HDS was showing off some of their recent lab work and they had an optical jukebox on display that they claimed would be used for long term archive. I get the feeling that maybe they plan to commercialize this technology soon – stay tuned for more

 

~~~~

Image: Hitachi.com website (c) 2012 Hitachi, Ltd.,

The end of NAND is near, maybe…

In honor of today’s Flash Summit conference, I give my semi-annual amateur view of competing NAND technologies.

I was talking with a major storage vendor today and they said they were sampling sub-20nm NAND chips with P/E cycles of 300 with a data retention period under a week at room temperatures. With those specifications these chips almost can’t get out of the factory with any life left in them.

On the other hand the only sub-20nm (19nm) NAND information I could find online were inside the new Toshiba THNSNF SSDs with toggle MLC NAND that guaranteed data retention of 3 months at 40°C.   I could not find any published P/E cycle specifications for the NAND in their drive but presumably this is at most equivalent to their prior generation 24 nm NAND or at worse somewhere below that generations P/E cycles. (Of course, I couldn’t find P/E cycle specifications for that drive either but similar technology in other drives seems to offer native 3000 P/E cycles.)

Intel-Micron, SanDisk and others have all recently announced 20nm MLC NAND chips with a P/E cycles around 3K to 5K.

Nevertheless, as NAND chips go beyond their rated P/E cycle quantities, NAND bit errors increase. With a more powerful ECC algorithm in SSDs and NAND controllers, one can still correct the data coming off the NAND chips.  However at some point beyond 24 bit ECC this probably becomes unsustainable. (See interesting post by NexGen on ECC capabilities as NAND die size shrinks).

Not sure how to bridge the gap between 3-5K P/E cycles and the 300 P/E cycles being seen by storage vendors above but this may be a function of prototype vs. production technology and possibly it had other characteristics they were interested in.

But given the declining endurance of NAND below 20nm, some industry players are investigating other solid state storage technologies to replace NAND, e.g.,  MRAM, FeRAM, PCM and ReRAM all of which are current contenders, at least from a research perspective.

MRAM is currently available in small capacities from Everspin and elsewhere but hasn’t really come up with similar densities on the order of today’s NAND technologies.

ReRAM is starting to emerge in low power applications as a substitute for SRAM/DRAM, but it’s still early yet.

I haven’t heard much about FeRAM other than last year researchers at Purdue having invented a new non-destructive read FeRAM they call FeTRAM.   Standard FeRAMs are already in commercial use, albeit in limited applications from Ramtron and others but density is still a hurdle and write performance is a problem.

Recently the PCM approach has heated up as PCM technology is now commercially available being released by Micro.  Yes the technology has a long way to go to catch up with NAND densities (available at 45nm technology) but it’s yet another start down a technology pathway to build volume and research ways to reduce cost, increase density and generally improve the technology.  In the mean time I hear it’s an order of magnitude faster than NAND.

Racetrack memory, a form of MRAM using wires to store multiple bits, isn’t standing still either.  Last December, IBM announced they have demonstrated  Racetrack memory chips in their labs.  With this milestone IBM has shown how a complete Racetrack memory chip could be fabricated on a CMOS technology lines.

However, in the same press release from IBM on recent research results, they announced a new technique to construct CMOS compatible graphene devices on a chip.  As we have previously reported, another approach to replacing standard NAND technology  uses graphene transistors to replace the storage layer of NAND flash.  Graphene NAND holds the promise of increasing density with much better endurance, retention and reliability than today’s NAND.

So as of today, NAND is still the king of solid state storage technologies but there are a number of princelings and other emerging pretenders, all vying for its throne of tomorrow.

Comments?

Image: 20 nanometer NAND Flash chip by IntelFreePress

Million year optical disk

Read an article the other day about scientists creating an optical disk that would be readable in a million years or so. The article in Science Mag titled A million – year hard disk was intended to warn people about potential dangers in the way future that were being created today.

A while back I wrote about a 1000 year archive which was predominantly about disappearing formats. At the time, I believed given the growth in data density that information could easily be copied and saved over time but the formats for that data would be long gone by the time someone tried to read it.

The million year optical disk eliminates the format problem by using pixelated images etched on media. Which works just dandy if you happen to have a microscope handy.

Why would you need a million year disk

The problem is how do you warn people in the far future not to mess with radioactive waste deposits buried below. If the waste is radioactive for a million years, you need something around to tell people to keep away from it.

Stone markers last for a few thousand years at best but get overgrown and wear down in time. For instance, my grandmother’s tombstone in Northern Italy has already been worn down so much that it’s almost unreadable. And that’s not even 80 yrs old yet.

But a sapphire hard disk that could easily be read with any serviceable microscope might do the job.

How to create a million year disk

This new disk is similar to the old StorageTek 100K year optical tape. Both would depend on microscopic impressions, something like bits physically marked on media.

For the optical disk the bits are created by etching a sapphire platter with platinum. Apparently the prototype costs €25K but they’re hoping the prices go down with production.

There are actually two 20cm (7.9in) wide disks that are molecularly fused together and each disk can store 40K miniaturized pages that can hold text or images. They are doing accelerated life testing on the sapphire disks by bathing them in acid to insure a 10M year life for the media and message.

Presumably the images are grey tone (or in this case platinum tone). If I assume 100Kbytes per page that’s about 4GB, something around a single layer DVD disk in a much larger form factor.

Why sapphire

It appears that sapphire is available from industrial processes and it seems impervious to wear that harms other material. But that’s what they are trying to prove.

Unclear why the decided to “molecularly” fuse two platters together. It seems to me this could easily be a weak link in the technology over the course of dozen millennia or so. On the other hand, more storage is always a good thing.

~~~~

In the end, creating dangers today that last millions of years requires some serious thought about how to warn future generations.

Image: Clock of the Long Now by Arenamontanus