New PCM could supply 36PB of memory to CPUs

Read an article this past week on how quantum geometry can enable a new form of PCM (phase change memory) that is based on stacks of metallic layers (SciTech Daily article: Berry curvature memory: quantum geometry enables information storage in metallic layers), That article referred to a Nature article (Berry curvature memory through electrically driven stacking transitions) behind a paywall but I found a pre-print of it, Berry curvature memory through electrically driven stacking transitions.

Figure 1| Signatures of two different electrically-driven phase transitions in WTe2. a, Side view (b–c plane) of unit cell showing possible stacking orders in WTe2 (monoclinic 1T’, polar orthorhombic Td,↑ or Td,↓) and schematics of their Berry curvature distributions in momentum space. The spontaneous polarization and the Berry curvature dipole are labelled as P and D, respectively. The yellow spheres refer to W atoms while the black spheres represent Te atoms. b, Schematic of dual-gate h-BN capped WTe2 evice. c, Electrical conductance G with rectangular-shape hysteresis (labeled as Type I) induced by external doping at 80 K. Pure doping was applied following Vt/dt = Vb/db under a scan sequence indicated by black arrows. d, Electrical conductance G with butterfly-shape switching (labeled as Type II) driven by electric field at 80 K. Pure E field was applied following -Vt/dt = Vb/db under a scan sequence indicated by black arrows. Positive E⊥ is defined along +c axis. Based on the distinct hysteresis observations in c and d, two different phase transitions can be induced by different gating configurations.

The number one challenge in IT today,is that data just keeps growing. 2+ Exabytes today and much more tomorrow.

All that information takes storage, bandwidth and ultimately some form of computation to take advantage of it. While computation, bandwidth, and storage density all keep going up, at some point the energy required to read, write, transmit and compute over all these Exabytes of data will become a significant burden to the world.

PCM and other forms of NVM such as Intel’s Optane PMEM, have brought a step change in how much data can be stored close to server CPUs today. And as, Optane PMEM doesn’t require refresh, it has also reduced the energy required to store and sustain that data over DRAM. I have no doubt that density, energy consumption and performance will continue to improve for these devices over the coming years, if not decades.

In the mean time, researchers are actively pursuing different classes of material that could replace or improve on PCM with even less power, better performance and higher densities. Berry Curvature Memory is the first I’ve seen that has several significant advantages over PCM today.

Berry Curvature Memory (BCM)

I spent some time trying to gain an understanding of Berry Curvatures.. As much as I can gather it’s a quantum-mechanical geometric effect that quantifies the topological characteristics of the entanglement of electrons in a crystal. Suffice it to say, it’s something that can be measured as a elecro-magnetic field that provides phase transitions (on-off) in a metallic crystal at the topological level. 

In the case of BCM, they used three to five atomically thin, mono-layers of  WTe2 (Tungsten Ditelluride), a Type II  Weyl semi-metal that exhibits super conductivity, high magneto-resistance, and the ability to alter interlayer sliding through the use of terahertz (Thz) radiation. 

It appears that by using BCM in a memory, 

Fig. 4| Layer-parity selective Berry curvature memory behavior in Td,↑ to Td,↓ stacking transition. a,
The nonlinear Hall effect measurement schematics. An applied current flow along the a axis results in the generation of nonlinear Hall voltage along the b axis, proportional to the Berry curvature dipole strength at the Fermi level. b, Quadratic amplitude of nonlinear transverse voltage at 2ω as a function of longitudinal current at ω. c, d, Electric field dependent longitudinal conductance (upper figure) and nonlinear Hall signal (lower figure) in trilayer WTe2 and four-layer WTe2 respectively. Though similar butterfly-shape hysteresis in longitudinal conductance are observed, the sign of the nonlinear Hall signal was observed to be reversed in the trilayer while maintaining unchanged in the four-layer crystal. Because the nonlinear Hall signal (V⊥,2ω / (V//,ω)2 ) is proportional to Berry curvature dipole strength, it indicates the flipping of Berry curvature dipole only occurs in trilayer. e, Schematics of layer-parity selective symmetry operations effectively transforming Td,↑ to Td,↓. The interlayer sliding transition between these two ferroelectric stackings is equivalent to an inversion operation in odd layer while a mirror operation respect to the ab plane in even layer. f, g, Calculated Berry curvature Ωc distribution in 2D Brillouin zone at the Fermi level for Td,↑ and Td,↓ in trilayer and four-layer WTe2. The symmetry operation analysis and first principle calculations confirm Berry curvature and its dipole sign reversal in trilayer while invariant in four-layer, leading to the observed layer-parity selective nonlinear Hall memory behavior.
  • To alter a memory cell takes “a few meV/unit cell, two orders of magnitude less than conventional bond rearrangement in phase change materials” (PCM). Which in laymen’s terms says it takes 100X less energy to change a bit than PCM.
  • To alter a memory cell it uses terahertz radiation (Thz) this uses pulses of light or other electromagnetic radiation whose wavelength is on the order of picoseconds or less to change a memory cell. This is 1000X faster than other PCM that exist today.
  • To construct a BCM memory cell takes between 13 and 16  atoms of W and Te2 constructed of 3 to 5 layers of atomically thin, WTe2 semi-metal.

While it’s hard to see in the figure above, the way this memory works is that the inner layer slides left to right with respect to the picture and it’s this realignment of atoms between the three or five layers that give rise to the changes in the Berry Curvature phase space or provide on-off switching.

To get from the lab to product is a long road but the fact that it has density, energy and speed advantages measured in multiple orders of magnitude certainly bode well for it’s potential to disrupt current PCM technologies.

Potential problems with BCM

Nonetheless, even though it exhibits superior performance characteritics with respect to PCM, there are a number of possible issues that could limit it’s use.

One concern (on my part) is that the inner-layer sliding may induce some sort of fatigue. Although, I’ve heard that mechanical fatigue at the atomic level is not nearly as much of a concern as one sees in (> atomic scale and) larger structures. I must assume this would induce some stress and as such, limit the (Write cycles) endurance of BCM.

Another possible concern is how to shrink size of the Thz radiation required to only write a small area of the material. Yes one memory cell can be measured bi the width of 3 atoms, but the next question is how far away do I need to place the next memory cell. The laser used in BCM focused down to ~1.5 μm. At this size it’s 1,000X bigger than the BCM memory cell width (~1.5 nm).

Yet another potential problem is that current BCM must be embedded in a continuous flow of liquid nitrogen (@80K). Unclear how much of a requirement this temperature is for BCM to function. But there are no computers nowadays that require this level of cooling.

Figure 3| Td,↑ to Td,↓ stacking transitions with preserved crystal orientation in Type II hysteresis. a,
in-situ SHG intensity evolution in Type II phase transition, driven by a pure E field sweep on a four-layer and a five-layer Td-WTe2 devices (indicated by the arrows). Both show butterfly-shape SHG intensity hysteresis responses as a signature of ferroelectric switching between upward and downward polarization phases. The intensity minima at turning points in four-layer and five-layer crystals show significant difference in magnitude, consistent with the layer dependent SHG contrast in 1T’ stacking. This suggests changes in stacking structures take place during the Type II phase transition, which may involve 1T’ stacking as the intermediate state. b, Raman spectra of both interlayer and intralayer vibrations of fully poled upward and downward polarization phases in the 5L sample, showing nearly identical characteristic phonons of polar Td crystals. c, SHG intensity of fully poled upward and downward polarization phases as a function of analyzer polarization angle, with fixed incident polarization along p direction (or b axis). Both the polarization patterns and lobe orientations of these two phases are almost the same and can be well fitted based on the second order susceptibility matrix of Pm space group (Supplementary Information Section I). These observations reveal the transition between Td,↑ and Td,↓ stacking orders is the origin of
Type II phase transition, through which the crystal orientations are preserved.

Finally, from my perspective, can such a memory can be stacked vertically, with a higher number of layers. Yes there are three to five layers of the WTe2 used in BCM but can you put another three to five layers on top of that, and then another. Although the researchers used three, four and five layer configurations, it appears that although it changed the amplitude of the Berry Curvature effect, it didn’t seem to add more states to the transition.. If we were to more layers of WTe2 would we be able to discern say 16 different states (like QLC NAND today).

~~~~

So there’s a ways to go to productize BCM. But, aside from eliminating the low-temperature requirements, everything else looks pretty doable, at least to me.

I think it would open up a whole new dimension of applications, if we had say 60TB of memory to compute with, don’t you think?

Comments?

[Updated the title from 60TB to PB to 36PB as I understood how much memory PMEM can provide today…, the Eds.]

Photo Credit(s):

Facebook down to 1.08 PUE and counting for cold storage

prineville-servers-470Read a recent article in ArsTechnica about Facebook’s cold storage archive and their sustainable data centers (How Facebook puts petabytes of old cat pix on ice in the name of sustainability). In the article there was a statement that Facebook had achieved a 1.08 PUE (Power Usage Effectiveness) for one of these data centers. This means for every 100 Watts used to power up racks, Facebook needed to add 8 Watts for other overhead.

Just last year I wrote a paper for a client where I interviewed the CEO of an outsourced data center provider (DuPont Fabros Technology) whose state of the art new data centers were achieving a PUE of from 1.14 to 1.18. For Facebook to run their cold storage data centers at 1.08 PUE is even better.

At the moment, Facebook has two cold storage data centers one at Prineville, OR and the other at Forest City, NC (Forest City achieved the 1.08 PUE). The two cold data storage sites add to the other Facebook data centers that handle everything else in the Facebook universe.

MAID to the rescue

First off these are just cold storage data centers, over an EB of data, but still archive storage, racks and racks of it. How they decide something is cold or hot seems to depend on last use. For example, if a picture has been referenced recently then it’s warm, if not then it’s cold.

Second, they have taken MAID (massive array of idle disks) to a whole new data center level. That is each 1U (Knox storage tray) shelf has 30 4TB drives and a rack has 16 of these storage trays, holding 1.92PB of data. At any one time, only one drive in each storage tray is powered up at a time. The racks have dual servers and only one power shelf (due to the reduced power requirements).

They also use pre-fetch hints provided by the Facebook application to cache user data.  This means they will fetch some images ahead of time,when users areis paging through photos in stream in order to have them in cache when needed. After the user looks at or passes up a photo, it is jettisoned from cache, the next photo is pre-fetched. When the disks are no longer busy, they are powered down.

Less power conversions lower PUE

Another thing Facebook is doing is reducing the number of power conversions that need to happen to power racks. In a typical data center power comes in at 480 Volts AC,  flows through the data center UPS and then is dropped down to 208 Volts AC at the PDU which flows to the rack power supply which is then converted to 12 Volts DC.  Each conversion of electricity generally sucks up power and in the end only 85% of the energy coming in reaches the rack’s servers and storage.

In Facebooks data centers, 480 Volts AC is channeled directly to the racks which have an in rack battery backup/UPS and rack’s power bus converts the 480 Volt AC to 12 Volt DC or AC directly as needed. By cutting out the data center level UPS and the PDU energy conversion they save lots of energy overhead which can be used to better power the racks.

Free air cooling helps

Facebook data centers like Prineville also make use of “fresh air cooling” that mixes data center air with outside air, that flows through through “wetted media” to cool which is then sent down to cool the racks by convection.  This process keeps the rack servers and storage within the proper temperature range but probably run hotter than most data centers this way. How much fresh air is brought in depends on outside temperature, but during most months, it works very well.

This is in contrast to standard data centers that use chillers, fans and pumps to keep the data center air moving, conditioned and cold enough to chill the equipment. All those fans, pumps and chillers can consume a lot of energy.

Renewable energy, too

Lately, Facebook has made obtaining renewable energy to power their data centers a high priority. One new data center close to the Arctic Circle was built there because of hydro-power, another in Iowa and one in Texas were built in locations with wind power.

All of this technology, open sourced

Facebook has open sourced all of it’s hardware and data center systems. That is the specifications for all the hardware discussed above and more is available from the Open Compute Organization, including the storage specification(s), open rack specification(s) and data center specification(s) for these data centers.

So if you want to build your own cold storage archive that can achieve 1.08 PUE, just pick up their specs and have at it.

Comments?

Picture Credits: DataCenterKnowledge.Com

 

Nanterro emerges from stealth with CNT based NRAM

512px-Types_of_Carbon_NanotubesNanterro just came out of stealth this week and bagged $31.5M in a Series E funding round. Apparently, Nanterro has been developing a new form of non-volatile RAM (NRAM), based on Carbon Nanotubes (CNT), which seems to work like an old T-bar switch, only in the NM sphere and using CNT for the wiring.

They were founded in 2001, and are finally  ready to emerge from stealth. Nanterro already has 175+ issued patents, with another 200 patents pending. The NRAM is currently in production at 7 CMOS fabs already and they are sampling 4Mb NRAM chips  to a number of customers.

NRAM vs. NAND

Performance of the NRAM is on a par with DRAM (~100 times faster than NAND), can be configured in 3D and supports MLC (multi-bits per cell) configurations.  NRAM also supports orders of magnitude more (assume they mean writes) accesses and stores data much longer than NAND.

The only question is the capacity, with shipping NAND on the order of 200Gb, NRAM is  about 2**14X behind NAND. Nanterre claims that their CNT-NRAM CMOS process can be scaled down to <5nm. Which is one or two generations below the current NAND scale factor and assuming they can pack as many bits in the same area, should be able to compete well with NAND.They claim that their NRAM technology is capable of Terabit capacities (assumed to be at the 5nm node).

The other nice thing is that Nanterro says the new NRAM uses less power than DRAM, which means that in addition to attaining higher capacities, DRAM like access times, it will also reduce power consumption.

It seems a natural for mobile applications. The press release claims it was already tested in space and there are customers looking at the technology for automobiles. The company claims the total addressable market is ~$170B USD. Which probably includes DRAM and NAND together.

CNT in CMOS chips?

Key to Nanterro’s technology was incorporating the use of CNT in CMOS processes, so that chips can be manufactured on current fab lines. It’s probably just the start of the use of CNT in electronic chips but it’s one that could potentially pay for the technology development many times over. CNT has a number of characteristics which would be beneficial to other electronic circuitry beyond NRAM.

How quickly they can ramp the capacity up from 4Mb seems to be a significant factor. Which is no doubt, why they went out for Series E funding.

So we have another new non-volatile memory technology.On the other hand, these guys seem to be a long ways away from the lab, with something that works today and the potential to go all the way down to 5nm.

It should interesting as the other NV technologies start to emerge to see which one generates sufficient market traction to succeed in the long run. Especially as NAND doesn’t seem to be slowing down much.

Comments?

Picture Credits: Wikimedia.com

NetApp Analyst Summit Customer Panel – how to survive a category 5 tornado

20120621-085224.jpg
NetApp had three of their customer innovation winners come up on stage for a panel discussion with Dave Hitz moderating the discussion. All three had interesting deployments of NetApp storage systems:

  • Andrew Henderson from ING DIRECT talked about their need to deploy copies of the banks IT environment for test, development, optimization and security testing. This process took 12 weeks to accomplish the first time they tried and only created a single copy. They wanted to speed this up and be able to deploy 10 or more copies if necessary. Andrew looked at Microsoft Hyper-V, System Center and NetApp FlexClones and transformed this process to now generate a copy of the entire banks IT services in under 10 minutes. And since the new capabilities have been in place they have created over 400 copies of the bank (he called these bank-in-a-box) for various purposes.
  • Teresa Wahlert from Iowa Workforce Development Agency was up next and talked about their VDI implementation. Iowa cut their budget which forced them to shut down a number of physical offices. But with VDI, VMware and NetApp storage Workforce were able to disperse their services to over 3000 locations now in prisons, libraries, and other venues where they had no presence before. They put out a general call for all the tired, dying PCs in Iowa government and used these to host VDI services. Now Workforce services are up 7X24 locations, pretty amazing for government work. Apparently they had tried VDI before and their previous storage couldn’t handle it. They moved to NetApp with FlashCache and it worked just fine. That’s when they rolled it VDI services to their customers and businesses. With NetApp they were able to implement VDI, reduce storage costs (via deduplication and other storage efficiency features) and increase department services.
  • Jeff Bell at Mercy Healthcare talked about the difficulties of rolling out electronic health records (EHR) and their challenges of integrating ~30 hospitals and ~400 medical clinics. They started with EHR fairly early 2006-2007 well before the latest governmental push. He mentioned Joplin MO and last years category 5 tornado which about wiped out their hospital there. He said within 2 hours after the disaster, Mercy Healthcare was printing out the EHR for the 183 patients present in the hospital at the time that had to be moved to other care facilities. The promise of EHR is that the information travels with the patient, can be recovered in the event of a disaster and is immediately available.  It seems that at least at Mercy Healthcare, EHR is living up to its promise. In addition, they just built a new data center as they were running out of space, power and cooling at the old one. They installed new NetApp storage there and for the first few months had to run heaters to keep the data center live-able because the new power/cooling load was so far below what they were experienced previously. Looking back on what they had accomplished Jeff was not so sure they would build a new data center again. With new cloud offerings coming out and the reduced power/cooling and increased density of NetApp storage they could almost get by without another data center at all.

That’s about it from the customer session.

NetApp execs spent the rest of the day on innovation, mostly at NetApp but also in the IT industry in general.

There was lots of discussion on the new release of Data ONTAP 8.1.1 with its latest cluster mode features.  NetApp positioned it as fulfilling out the transition to  data/storage as an infrastructure that IT has been pushing for the last decade or so.  Following in the grand tradition of what IBM did for computing infrastructure with the 360 and what Cisco and others did for networking infrastructure in the mid 80’s.

Comments?

Tape still alive, well and growing at Spectra Logic

T-Finity library at SpectraLogic's test facility (c) 2011 Silverton Consulting, All Rights Reserved
T-Finity library at SpectraLogic's test facility (c) 2011 Silverton Consulting, All Rights Reserved

Today I met with Spectra Logic execs and some of their Media and Entertainment (M&E) customers, and toured their manufacturing, test labs and briefing center.  The tour was a blast and the customers Kyle Knack from National Geographic (Nat Geo) Global Media, Toni Perez from Medcom (Panama based entertainment company) and Lee Coleman from Entertainment Tonight (ET) all talked about their use of the T-950 Spectra Logic tape libraries in the media ingest, editing and production processes.

Mr. Collins from ET spoke almost reverently about their T-950 and how it has enabled ET to access over 30 years of video interviews, movie segments and other media they can now use to put together clips on just about any entertainment subject imaginable.

He  talked specifically about the obit they did for Michael Jackson and how they were able to grab footage from an interview they did years ago and splice it together with more recent media to show a more complete story.  He also showed a piece on some early Eddie Murphy film footage and interviews they had done at the time which they used in a recent segment about his new movie.

All this was made possible by moving to digital file formats and placing digital media in their T-950 tape libraries.

Spectra Logic T-950 (I think) with TeraPack loaded in robot (c) 2011 Silverton Consulting, All Rights Reserved
Spectra Logic T-950 (I think) with TeraPack loaded in robot (c) 2011 Silverton Consulting, All Rights Reserved

Mr. Knack from Nat Geo Media said every bit of media they get anymore, automatically goes into the library archive and becomes the “original copy” of the media used in case other copies are corrupted or lost.  Nat Geo started out only putting important media in the library but found it just cost so much less to just store it in the tape archive that they decided it made more sense to just move all media to the tape library.

Typically they keep two copies in their tape library and important media is also copied to tape and shipped offsite (3 copies for this data).  They have a 4-frame T-950 with around 4000 slots and 14 drives (combination of LTO-4 and -5).  They use FC and FCoE storage for their primary storage and depend on 1000s of SATA drives for primary storage access.

He said they only use SSDs for some metadata support for their web site. He found that SATA drives can handle their big block sequential and provide consistent throughput and especially important to M&E companies consistent latency.

3D printer at Spectra Logic (for mechanical parts fabrication) (c) 2011 Silverton Consulting, All Rights Reserved
3D printer at Spectra Logic (for mechanical parts fabrication) (c) 2011 Silverton Consulting, All Rights Reserved

Mr. Perez from MedCom had much the same story. They were in the process of moving off of proprietary video tape format (Sony Betacam) to LTO media and digital files. The process is still ongoing although they are more than halfway there for current production.

They still have a lot of old media in Betacam format which will take them years to convert to digital files but they are at least starting this activity.  He said a recent move from one site to another revealed that much of the Betacam tapes were no longer readable.  Digital files on LTO tape should solve that problem for them when they finally get there.

Matt Starr Spectra Logic CTO talked about the history of tape libraries at Spectra Logic which was founded in 1998 and has been laser focused on tape data protection and tape libraries.

I find it pleasantly surprising that a company today can just supply tape libraries with software and make a ongoing concern of it. Spectra Logic must be doing something right, revenue grew 30% YoY last year and they are outgrowing their current (88K sq ft) office, lab, and manufacturing building they just moved into earlier this year and have just signed to occupy another building providing 55K sq ft of more space.

T-Series robot returning TeraPack to shelf (c) 2011 Silverton Consulting, All Rights Reserved
T-Series robot returning TeraPack to shelf (c) 2011 Silverton Consulting, All Rights Reserved

Molly Rector Spectra Logic CMO talked about the shift in the market from peta-scale (10**15 bytes) storage repositories to exa-scale (10**18 bytes) ones.  Ms. Rector believed that today’s cloud storage environments can take advantage of these large tape based, archives to provide much more economical storage for their users without suffering any performance penalty.

At lunch with Matt Starr, Fred Moore (Horison Information Strategies)Mark Peters (Enterprise Strategy Group) and I were talking about HPSS (High Performance Storage System) developed in conjunction with IBM and 5 US national labs that supports vast amounts of data residing across primary disk and tape libraries.

Matt said that there are about a dozen large HPSS sites (HPSS website shows at least 30 sites using it) that store a significant portion of the worlds 1ZB (10**21 bytes) of digital data created this past year (see my 3.3 exabytes of data a day!? post).  Later that day talking with Nathan Thompson Spectra Logic CEO, he said these large HPSS sites probably store ~10% of the worlds data, or 100EB.  I find that difficult to comprehend that much data at only ~12 sites but the national labs do have lots of data on hand.

Nowadays you can get a Spectra Logic T-Finity tape complex with 122K slot, using LTO-4/-5 or IBM TS1140 (enterprise class) tape drives.  This large a T-Finity has 4 rows of tape libraries which uses the ‘Skyway’ to transport a terapack of tape cartridges between one library row to the another.   All Spectra Logic libraries are built around a tape cartridge package they call the TeraPack which contains 10 LTO cartridges or (I think) 9-TS1140 tape cartridges (they are bigger than LTO tapes).  The TeraPack is used to import or export tapes from the library and all the tape slots in the library.

The software used to control all this is called BlueScale and is used in their T50e, a small, 50 slot library all the way up to the 122K T-Finity tape complex.  There are some changes for configuration, robotics and other personalization for each library type but the UI looks exactly the same across any of their libraries. Moreover, BlueScale offers the same enterprise level of functionality (e.g., drive and media life management) services for all Spectra Logic tape libraries.

Day 1 for SpectraPRDay closed with the lab tour and dinner.  Day 2 will start discussing futures and will be under NDA so there won’t be much to talk about right away. But from what I can see, Spectra Logic seems to be breaking down the barriers inhibiting tape use and providing tape library systems, that people almost revere.

I haven’t seen that sort of reaction about a tape library since the STK 4400 first came out last century.

—-

Comments?

Graphene Flash Memory

Model of graphene structure by CORE-Materials (cc) (from Flickr)
Model of graphene structure by CORE-Materials (cc) (from Flickr)

I have been thinking about writing a post on “Is Flash Dead?” for a while now.  Well at least since talking with IBM research a couple of weeks ago on their new memory technologies that they have been working on.

But then this new Technology Review article came out  discussing recent research on Graphene Flash Memory.

Problems with NAND Flash

As we have discussed before, NAND flash memory has some serious limitations as it’s shrunk below 11nm or so. For instance, write endurance plummets, memory retention times are reduced and cell-to-cell interactions increase significantly.

These issues are not that much of a problem with today’s flash at 20nm or so. But to continue to follow Moore’s law and drop the price of NAND flash on a $/Gb basis, it will need to shrink below 16nm.  At that point or soon thereafter, current NAND flash technology will no longer be viable.

Other non-NAND based non-volatile memories

That’s why IBM and others are working on different types of non-volatile storage such as PCM (phase change memory), MRAM (magnetic RAM) , FeRAM (Ferroelectric RAM) and others.  All these have the potential to improve general reliability characteristics beyond where NAND Flash is today and where it will be tomorrow as chip geometries shrink even more.

IBM seems to be betting on MRAM or racetrack memory technology because it has near DRAM performance, extremely low power and can store far more data in the same amount of space. It sort of reminds me of delay line memory where bits were stored on a wire line and read out as they passed across a read/write circuit. Only in the case of racetrack memory, the delay line is etched in a silicon circuit indentation with the read/write head implemented at the bottom of the cleft.

Graphene as the solution

Then along comes Graphene based Flash Memory.  Graphene can apparently be used as a substitute for the storage layer in a flash memory cell.  According to the report, the graphene stores data using less power and with better stability over time.  Both crucial problems with NAND flash memory as it’s shrunk below today’s geometries.  The research is being done at UCLA and is supported by Samsung, a significant manufacturer of NAND flash memory today.

Current demonstration chips are much larger than would be useful.  However, given graphene’s material characteristics, the researchers believe there should be no problem scaling it down below where NAND Flash would start exhibiting problems.  The next iteration of research will be to see if their scaling assumptions can hold when device geometry is shrunk.

The other problem is getting graphene, a new material, into current chip production.  Current materials used in chip manufacturing lines are very tightly controlled and  building hybrid graphene devices to the same level of manufacturing tolerances and control will take some effort.

So don’t look for Graphene Flash Memory to show up anytime soon. But given that 16nm chip geometries are only a couple of years out and 11nm, a couple of years beyond that, it wouldn’t surprise me to see Graphene based Flash Memory introduced in about 4 years or so.  Then again, I am no materials expert, so don’t hold me to this timeline.

 

—-

Comments?

IBM’s 120PB storage system

Susitna Glacier, Alaska by NASA Goddard Photo and Video (cc) (from Flickr)
Susitna Glacier, Alaska by NASA Goddard Photo and Video (cc) (from Flickr)

Talk about big data, Technology Review reported this week that IBM is building a 120PB storage system for some unnamed customer.  Details are sketchy and I cannot seem to find any announcement of this on IBM.com.

Hardware

It appears that the system uses 200K disk drives to support the 120PB of storage.  The disk drives are packed in a new wider rack and are water cooled.  According to the news report the new wider drive trays hold more drives than current drive trays available on the market.

For instance, HP has a hot pluggable, 100 SFF (small form factor 2.5″) disk enclosure that sits in 3U of standard rack space.  200K SFF disks would take up about 154 full racks, not counting the interconnect switching that would be required.  Unclear whether water cooling would increase the density much but I suppose a wider tray with special cooling might get you more drives per floor tile.

There was no mention of interconnect, but today’s drives use either SAS or SATA.  SAS interconnects for 200K drives would require many separate SAS busses. With an SAS expander addressing 255 drives or other expanders, one would need at least 4 SAS busses but this would have ~64K drives per bus and would not perform well.  Something more like 64-128 drives per bus would have much better performer and each drive would need dual pathing, and if we use 100 drives per SAS string, that’s 2000 SAS drive strings or at least 4000 SAS busses (dual port access to the drives).

The report mentioned GPFS as the underlying software which supports three cluster types today:

  • Shared storage cluster – where GPFS front end nodes access shared storage across the backend. This is generally SAN storage system(s).  But the requirements for high density, it doesn’t seem likely that the 120PB storage system uses SAN storage in the backend.
  • Networked based cluster – here the GPFS front end nodes talk over a LAN to a cluster of NSD (network storage director?) servers which can have access to all or some of the storage. My guess is this is what will be used in the 120PB storage system
  • Shared Network based clusters – this looks just like a bunch of NSD servers but provides access across multiple NSD clusters.

Given the above, with ~100 drives per NSD server means another 1U extra per 100 drives or (given HP drive density) 4U per 100 drives for 1000 drives and 10 IO servers per 40U rack, (not counting switching).  At this density it takes ~200 racks for 120PB of raw storage and NSD nodes or 2000 NSD nodes.

Unclear how many GPFS front end nodes would be needed on top of this but even if it were 1 GPFS frontend node for every 5 NSD nodes, we are talking another 400 GPFS frontend nodes and at 1U per server, another 10 racks or so (not counting switching).

If my calculations are correct we are talking over 210 racks with switching thrown in to support the storage.  According to IBM’s discussion on the Storage challenges for petascale systems, it probably provides ~6TB/sec of data transfer which should be easy with 200K disks but may require even more SAS busses (maybe ~10K vs. the 2K discussed above).

Software

IBM GPFS is used behind the scenes in IBM’s commercial SONAS storage system but has been around as a cluster file system designed for HPC environments for over 15 years or more now.

Given this many disk drives something needs to be done about protecting against drive failure.  IBM has been talking about declustered RAID algorithms for their next generation HPC storage system which spreads the parity across more disks and as such, speeds up rebuild time at the cost of reducing effective capacity. There was no mention of effective capacity in the report but this would be a reasonable tradeoff.  A 200K drive storage system should have a drive failure every 10 hours, on average (assuming a 2 million hour MTBF).  Let’s hope they get drive rebuild time down much below that.

The system is expected to hold around a trillion files.  Not sure but even at 1024 bytes of metadata per file, this number of files would chew up ~1PB of metadata storage space.

GPFS provides ILM (information life cycle management, or data placement based on information attributes) using automated policies and supports external storage pools outside the GPFS cluster storage.  ILM within the GPFS cluster supports file placement across different tiers of storage.

All the discussion up to now revolved around homogeneous backend storage but it’s quite possible that multiple storage tiers could also be used.  For example, a high density but slower storage tier could be combined with a low density but faster storage tier to provide a more cost effective storage system.  Although, it’s unclear whether the application (real world modeling) could readily utilize this sort of storage architecture nor whether they would care about system cost.

Nonetheless, presumably an external storage pool would be a useful adjunct to any 120PB storage system for HPC applications.

Can it be done?

Let’s see, 400 GPFS nodes, 2000 NSD nodes, and 200K drives. Seems like the hardware would be readily doable (not sure why they needed watercooling but hopefully they obtained better drive density that way).

Luckily GPFS supports Infiniband which can support 10,000 nodes within a single subnet.  Thus an Infiniband interconnect between the GPFS and NSD nodes could easily support a 2400 node cluster.

The only real question is can a GPFS software system handle 2000 NSD nodes and 400 GPFS nodes with trillions of files over 120PB of raw storage.

As a comparison here are some recent examples of scale out NAS systems:

It would seem that a 20X multiplier times a current Isilon cluster or even a 10X multiple of a currently supported SONAS system would take some software effort to work together, but seems entirely within reason.

On the other hand, Yahoo supports a 4000-node Hadoop cluster and seems to work just fine.  So from a feasability perspective, a 2500 node GPFS-NSD node system seems just a walk in the park for Hadoop.

Of course, IBM Almaden is working on project to support Hadoop over GPFS which might not be optimum for real world modeling but would nonetheless support the node count being talked about here.

——

I wish there was some real technical information on the project out on the web but I could not find any. Much of this is informed conjecture based on current GPFS system and storage hardware capabilities. But hopefully, I haven’t traveled to far astray.

Comments?

 

SNIA illuminates storage power efficiency

Untitled by johnwilson1969 (cc) (from Flickr)
Untitled by johnwilson1969 (cc) (from Flickr)

At SNW, a couple of weeks back, SNIA annouced the coming out of their green storage initiative’s new SNIA Emerald Program and the first public draft release of their storage power efficiency test  specification.  Up until now, other than SPC and some pronouncements from EPA there hasn’t been much standardization activity on how to measure storage power efficiency.

SNIA’s Storage Power Efficiency Specification

As such, SNIA felt there was a need for an industry standard on how to measure storage power use.  SNIA’s specification supplies a taxonomy for storage systems that can be used to define and categorize various storage systems. Their extensive taxonomy should minimize problems like comparing consumer storage power use against data center storage power use.  Also, the specification identifies storage use attributes such as deduplication and thin provisioning or capacity optimization features that can impact power efficiency.

In addition, the specification has two appendices:

  • Appendix A specifies the valid power and environmental meters that are to be used to measure power efficiency of the system under test.
  • Appendix B specifies the benchmark tool that is used to drive the system under test while its power efficiency is being measured.

Essentially, there are two approved benchmark drivers used to drive IOs in the online storage category Iometer and vdbench both of which are freely available.  Iometer has been employed for quite awhile now in vendor benchmarking activity.  In contrast, vdbench is a relative newcomer but I have worked with its author, Henk Vandenbergh, over many years now and he is a consummate performance analyst.  I look forward to seeing how Henk’s vdbench matures over time.

Given the spec’s taxonomy and the fact that it lists online, near-online, removable media, virtual media and adjunct storage device categories with multiple sub-categories for each, we will focus only on the online family of storage and save the rest for later.

SPC energy efficiency measures

As my readers should recall, the Storage Performance Council (SPC) also has benchmarks that measure energy use with their SPC-1/E and SPC-1C/E reports (see our SPC-1 IOPS per Watt post).  The interesting part about SPC-1/E results is that there are definite IOPS levels where storage power use undergoes significant transitions.

One can examine a SPC-1/E Executive Summary report and see power use at various IO intensity levels, i.e., 100%, 95%, 90%, 85%, 80%, 50%, 10% and 0% (or idle) for a storage subsystem under test.   SPC summarizes these detail power measurements by defining profiles for “Low”, “Medium” and “Heavy” storage system use.  But the devils often in the details and having all the above measurements allows one to calculate whatever activity profile works best for you.

Unfortunately, only a few SPC-1/E reports have been submitted to date and it has yet to take off.

SNIA alternative power efficiency metrics

Enter SNIA’s Emerald program, which is supposed to be an easier and quicker way to measure storage power use.  In addition to the specification, SNIA has established a website (see above) to hold SNIA approved storage power efficiency results and a certification program for auditors that can be used to verify vendor power efficiency testing meet all specification requirements.

What’s missing from the present SNIA power efficiency test specification are the following:

  • More strict IOPS level definitions – the specification refers to IO intensity but doesn’t provide an adequate definition from my perspective.  It says that subsystem response time cannot exceed 30msec and uses this to define 100% IO intensity for the workloads.  However given this definition it could apply to random read, random write, or mixed workloads and there is no separate specification for sequential or random (and/or mixed) workloads.  This could be tightened up
  • More IO intensity levels measured – the specification calls for power measurements at an IO intensity of 100% for all workloads and 25% for 70:30 R:W workloads for online storage.  However we would be more interested in also seeing 80% and 10%.  From a user perspective, 80% probably represents a heavy sustainable IO workload and 10% looks like a complete cache hit workload.  We would only measure these levels for the “Mixed workload” so as to minimize effort.
  • More write activity in “Mixed workloads” – the specification defines mixed workload as 70% read and 30% write random IO activity.  Given today’s O/S propensity to buffer read data, it would seem more prudent to use a 50:50 Read to Write mix.

Probably other items need more work as well, such as defining a standardized reporting format containing a detailed description of HW and SW of system under test, benchmark driver HW and SW, table for reporting all power efficiency metrics and inclusion of full benchmark report including input parameter specifications and all outputs, etc. but these are nits.

Finally, SNIA’s specification goes into much detail about capacity optimization testing which includes things like compression, deduplication, thin provisioning, delta-snapshotting, etc. with an intent to measure storage system power use when utilizing these capabilities.  This is a significant and complex undertaking to define how each of these storage features will be configured and used during power measurement testing.  Although SNIA should be commended for their efforts here, this seems to much to take on at the start.  We suggest capacity optimization testing definitions should be deferred to a later release and focus now on the more standard storage power efficiency measurements.

—-

I critique specifications at my peril.  Being wrong in the past has caused me to re-double efforts to insure a correct interpretation of any specification.  However, if there’s something I have misconstrued or missed here that are worthy of note please feel free to comment.