Facebook down to 1.08 PUE and counting for cold storage

prineville-servers-470Read a recent article in ArsTechnica about Facebook’s cold storage archive and their sustainable data centers (How Facebook puts petabytes of old cat pix on ice in the name of sustainability). In the article there was a statement that Facebook had achieved a 1.08 PUE (Power Usage Effectiveness) for one of these data centers. This means for every 100 Watts used to power up racks, Facebook needed to add 8 Watts for other overhead.

Just last year I wrote a paper for a client where I interviewed the CEO of an outsourced data center provider (DuPont Fabros Technology) whose state of the art new data centers were achieving a PUE of from 1.14 to 1.18. For Facebook to run their cold storage data centers at 1.08 PUE is even better.

At the moment, Facebook has two cold storage data centers one at Prineville, OR and the other at Forest City, NC (Forest City achieved the 1.08 PUE). The two cold data storage sites add to the other Facebook data centers that handle everything else in the Facebook universe.

MAID to the rescue

First off these are just cold storage data centers, over an EB of data, but still archive storage, racks and racks of it. How they decide something is cold or hot seems to depend on last use. For example, if a picture has been referenced recently then it’s warm, if not then it’s cold.

Second, they have taken MAID (massive array of idle disks) to a whole new data center level. That is each 1U (Knox storage tray) shelf has 30 4TB drives and a rack has 16 of these storage trays, holding 1.92PB of data. At any one time, only one drive in each storage tray is powered up at a time. The racks have dual servers and only one power shelf (due to the reduced power requirements).

They also use pre-fetch hints provided by the Facebook application to cache user data.  This means they will fetch some images ahead of time,when users areis paging through photos in stream in order to have them in cache when needed. After the user looks at or passes up a photo, it is jettisoned from cache, the next photo is pre-fetched. When the disks are no longer busy, they are powered down.

Less power conversions lower PUE

Another thing Facebook is doing is reducing the number of power conversions that need to happen to power racks. In a typical data center power comes in at 480 Volts AC,  flows through the data center UPS and then is dropped down to 208 Volts AC at the PDU which flows to the rack power supply which is then converted to 12 Volts DC.  Each conversion of electricity generally sucks up power and in the end only 85% of the energy coming in reaches the rack’s servers and storage.

In Facebooks data centers, 480 Volts AC is channeled directly to the racks which have an in rack battery backup/UPS and rack’s power bus converts the 480 Volt AC to 12 Volt DC or AC directly as needed. By cutting out the data center level UPS and the PDU energy conversion they save lots of energy overhead which can be used to better power the racks.

Free air cooling helps

Facebook data centers like Prineville also make use of “fresh air cooling” that mixes data center air with outside air, that flows through through “wetted media” to cool which is then sent down to cool the racks by convection.  This process keeps the rack servers and storage within the proper temperature range but probably run hotter than most data centers this way. How much fresh air is brought in depends on outside temperature, but during most months, it works very well.

This is in contrast to standard data centers that use chillers, fans and pumps to keep the data center air moving, conditioned and cold enough to chill the equipment. All those fans, pumps and chillers can consume a lot of energy.

Renewable energy, too

Lately, Facebook has made obtaining renewable energy to power their data centers a high priority. One new data center close to the Arctic Circle was built there because of hydro-power, another in Iowa and one in Texas were built in locations with wind power.

All of this technology, open sourced

Facebook has open sourced all of it’s hardware and data center systems. That is the specifications for all the hardware discussed above and more is available from the Open Compute Organization, including the storage specification(s), open rack specification(s) and data center specification(s) for these data centers.

So if you want to build your own cold storage archive that can achieve 1.08 PUE, just pick up their specs and have at it.

Comments?

Picture Credits: DataCenterKnowledge.Com

 

Nanterro emerges from stealth with CNT based NRAM

512px-Types_of_Carbon_NanotubesNanterro just came out of stealth this week and bagged $31.5M in a Series E funding round. Apparently, Nanterro has been developing a new form of non-volatile RAM (NRAM), based on Carbon Nanotubes (CNT), which seems to work like an old T-bar switch, only in the NM sphere and using CNT for the wiring.

They were founded in 2001, and are finally  ready to emerge from stealth. Nanterro already has 175+ issued patents, with another 200 patents pending. The NRAM is currently in production at 7 CMOS fabs already and they are sampling 4Mb NRAM chips  to a number of customers.

NRAM vs. NAND

Performance of the NRAM is on a par with DRAM (~100 times faster than NAND), can be configured in 3D and supports MLC (multi-bits per cell) configurations.  NRAM also supports orders of magnitude more (assume they mean writes) accesses and stores data much longer than NAND.

The only question is the capacity, with shipping NAND on the order of 200Gb, NRAM is  about 2**14X behind NAND. Nanterre claims that their CNT-NRAM CMOS process can be scaled down to <5nm. Which is one or two generations below the current NAND scale factor and assuming they can pack as many bits in the same area, should be able to compete well with NAND.They claim that their NRAM technology is capable of Terabit capacities (assumed to be at the 5nm node).

The other nice thing is that Nanterro says the new NRAM uses less power than DRAM, which means that in addition to attaining higher capacities, DRAM like access times, it will also reduce power consumption.

It seems a natural for mobile applications. The press release claims it was already tested in space and there are customers looking at the technology for automobiles. The company claims the total addressable market is ~$170B USD. Which probably includes DRAM and NAND together.

CNT in CMOS chips?

Key to Nanterro’s technology was incorporating the use of CNT in CMOS processes, so that chips can be manufactured on current fab lines. It’s probably just the start of the use of CNT in electronic chips but it’s one that could potentially pay for the technology development many times over. CNT has a number of characteristics which would be beneficial to other electronic circuitry beyond NRAM.

How quickly they can ramp the capacity up from 4Mb seems to be a significant factor. Which is no doubt, why they went out for Series E funding.

So we have another new non-volatile memory technology.On the other hand, these guys seem to be a long ways away from the lab, with something that works today and the potential to go all the way down to 5nm.

It should interesting as the other NV technologies start to emerge to see which one generates sufficient market traction to succeed in the long run. Especially as NAND doesn’t seem to be slowing down much.

Comments?

Picture Credits: Wikimedia.com

NetApp Analyst Summit Customer Panel – how to survive a category 5 tornado

20120621-085224.jpg
NetApp had three of their customer innovation winners come up on stage for a panel discussion with Dave Hitz moderating the discussion. All three had interesting deployments of NetApp storage systems:

  • Andrew Henderson from ING DIRECT talked about their need to deploy copies of the banks IT environment for test, development, optimization and security testing. This process took 12 weeks to accomplish the first time they tried and only created a single copy. They wanted to speed this up and be able to deploy 10 or more copies if necessary. Andrew looked at Microsoft Hyper-V, System Center and NetApp FlexClones and transformed this process to now generate a copy of the entire banks IT services in under 10 minutes. And since the new capabilities have been in place they have created over 400 copies of the bank (he called these bank-in-a-box) for various purposes.
  • Teresa Wahlert from Iowa Workforce Development Agency was up next and talked about their VDI implementation. Iowa cut their budget which forced them to shut down a number of physical offices. But with VDI, VMware and NetApp storage Workforce were able to disperse their services to over 3000 locations now in prisons, libraries, and other venues where they had no presence before. They put out a general call for all the tired, dying PCs in Iowa government and used these to host VDI services. Now Workforce services are up 7X24 locations, pretty amazing for government work. Apparently they had tried VDI before and their previous storage couldn’t handle it. They moved to NetApp with FlashCache and it worked just fine. That’s when they rolled it VDI services to their customers and businesses. With NetApp they were able to implement VDI, reduce storage costs (via deduplication and other storage efficiency features) and increase department services.
  • Jeff Bell at Mercy Healthcare talked about the difficulties of rolling out electronic health records (EHR) and their challenges of integrating ~30 hospitals and ~400 medical clinics. They started with EHR fairly early 2006-2007 well before the latest governmental push. He mentioned Joplin MO and last years category 5 tornado which about wiped out their hospital there. He said within 2 hours after the disaster, Mercy Healthcare was printing out the EHR for the 183 patients present in the hospital at the time that had to be moved to other care facilities. The promise of EHR is that the information travels with the patient, can be recovered in the event of a disaster and is immediately available.  It seems that at least at Mercy Healthcare, EHR is living up to its promise. In addition, they just built a new data center as they were running out of space, power and cooling at the old one. They installed new NetApp storage there and for the first few months had to run heaters to keep the data center live-able because the new power/cooling load was so far below what they were experienced previously. Looking back on what they had accomplished Jeff was not so sure they would build a new data center again. With new cloud offerings coming out and the reduced power/cooling and increased density of NetApp storage they could almost get by without another data center at all.

That’s about it from the customer session.

NetApp execs spent the rest of the day on innovation, mostly at NetApp but also in the IT industry in general.

There was lots of discussion on the new release of Data ONTAP 8.1.1 with its latest cluster mode features.  NetApp positioned it as fulfilling out the transition to  data/storage as an infrastructure that IT has been pushing for the last decade or so.  Following in the grand tradition of what IBM did for computing infrastructure with the 360 and what Cisco and others did for networking infrastructure in the mid 80’s.

Comments?

Tape still alive, well and growing at Spectra Logic

T-Finity library at SpectraLogic's test facility (c) 2011 Silverton Consulting, All Rights Reserved
T-Finity library at SpectraLogic's test facility (c) 2011 Silverton Consulting, All Rights Reserved

Today I met with Spectra Logic execs and some of their Media and Entertainment (M&E) customers, and toured their manufacturing, test labs and briefing center.  The tour was a blast and the customers Kyle Knack from National Geographic (Nat Geo) Global Media, Toni Perez from Medcom (Panama based entertainment company) and Lee Coleman from Entertainment Tonight (ET) all talked about their use of the T-950 Spectra Logic tape libraries in the media ingest, editing and production processes.

Mr. Collins from ET spoke almost reverently about their T-950 and how it has enabled ET to access over 30 years of video interviews, movie segments and other media they can now use to put together clips on just about any entertainment subject imaginable.

He  talked specifically about the obit they did for Michael Jackson and how they were able to grab footage from an interview they did years ago and splice it together with more recent media to show a more complete story.  He also showed a piece on some early Eddie Murphy film footage and interviews they had done at the time which they used in a recent segment about his new movie.

All this was made possible by moving to digital file formats and placing digital media in their T-950 tape libraries.

Spectra Logic T-950 (I think) with TeraPack loaded in robot (c) 2011 Silverton Consulting, All Rights Reserved
Spectra Logic T-950 (I think) with TeraPack loaded in robot (c) 2011 Silverton Consulting, All Rights Reserved

Mr. Knack from Nat Geo Media said every bit of media they get anymore, automatically goes into the library archive and becomes the “original copy” of the media used in case other copies are corrupted or lost.  Nat Geo started out only putting important media in the library but found it just cost so much less to just store it in the tape archive that they decided it made more sense to just move all media to the tape library.

Typically they keep two copies in their tape library and important media is also copied to tape and shipped offsite (3 copies for this data).  They have a 4-frame T-950 with around 4000 slots and 14 drives (combination of LTO-4 and -5).  They use FC and FCoE storage for their primary storage and depend on 1000s of SATA drives for primary storage access.

He said they only use SSDs for some metadata support for their web site. He found that SATA drives can handle their big block sequential and provide consistent throughput and especially important to M&E companies consistent latency.

3D printer at Spectra Logic (for mechanical parts fabrication) (c) 2011 Silverton Consulting, All Rights Reserved
3D printer at Spectra Logic (for mechanical parts fabrication) (c) 2011 Silverton Consulting, All Rights Reserved

Mr. Perez from MedCom had much the same story. They were in the process of moving off of proprietary video tape format (Sony Betacam) to LTO media and digital files. The process is still ongoing although they are more than halfway there for current production.

They still have a lot of old media in Betacam format which will take them years to convert to digital files but they are at least starting this activity.  He said a recent move from one site to another revealed that much of the Betacam tapes were no longer readable.  Digital files on LTO tape should solve that problem for them when they finally get there.

Matt Starr Spectra Logic CTO talked about the history of tape libraries at Spectra Logic which was founded in 1998 and has been laser focused on tape data protection and tape libraries.

I find it pleasantly surprising that a company today can just supply tape libraries with software and make a ongoing concern of it. Spectra Logic must be doing something right, revenue grew 30% YoY last year and they are outgrowing their current (88K sq ft) office, lab, and manufacturing building they just moved into earlier this year and have just signed to occupy another building providing 55K sq ft of more space.

T-Series robot returning TeraPack to shelf (c) 2011 Silverton Consulting, All Rights Reserved
T-Series robot returning TeraPack to shelf (c) 2011 Silverton Consulting, All Rights Reserved

Molly Rector Spectra Logic CMO talked about the shift in the market from peta-scale (10**15 bytes) storage repositories to exa-scale (10**18 bytes) ones.  Ms. Rector believed that today’s cloud storage environments can take advantage of these large tape based, archives to provide much more economical storage for their users without suffering any performance penalty.

At lunch with Matt Starr, Fred Moore (Horison Information Strategies)Mark Peters (Enterprise Strategy Group) and I were talking about HPSS (High Performance Storage System) developed in conjunction with IBM and 5 US national labs that supports vast amounts of data residing across primary disk and tape libraries.

Matt said that there are about a dozen large HPSS sites (HPSS website shows at least 30 sites using it) that store a significant portion of the worlds 1ZB (10**21 bytes) of digital data created this past year (see my 3.3 exabytes of data a day!? post).  Later that day talking with Nathan Thompson Spectra Logic CEO, he said these large HPSS sites probably store ~10% of the worlds data, or 100EB.  I find that difficult to comprehend that much data at only ~12 sites but the national labs do have lots of data on hand.

Nowadays you can get a Spectra Logic T-Finity tape complex with 122K slot, using LTO-4/-5 or IBM TS1140 (enterprise class) tape drives.  This large a T-Finity has 4 rows of tape libraries which uses the ‘Skyway’ to transport a terapack of tape cartridges between one library row to the another.   All Spectra Logic libraries are built around a tape cartridge package they call the TeraPack which contains 10 LTO cartridges or (I think) 9-TS1140 tape cartridges (they are bigger than LTO tapes).  The TeraPack is used to import or export tapes from the library and all the tape slots in the library.

The software used to control all this is called BlueScale and is used in their T50e, a small, 50 slot library all the way up to the 122K T-Finity tape complex.  There are some changes for configuration, robotics and other personalization for each library type but the UI looks exactly the same across any of their libraries. Moreover, BlueScale offers the same enterprise level of functionality (e.g., drive and media life management) services for all Spectra Logic tape libraries.

Day 1 for SpectraPRDay closed with the lab tour and dinner.  Day 2 will start discussing futures and will be under NDA so there won’t be much to talk about right away. But from what I can see, Spectra Logic seems to be breaking down the barriers inhibiting tape use and providing tape library systems, that people almost revere.

I haven’t seen that sort of reaction about a tape library since the STK 4400 first came out last century.

—-

Comments?

Graphene Flash Memory

Model of graphene structure by CORE-Materials (cc) (from Flickr)
Model of graphene structure by CORE-Materials (cc) (from Flickr)

I have been thinking about writing a post on “Is Flash Dead?” for a while now.  Well at least since talking with IBM research a couple of weeks ago on their new memory technologies that they have been working on.

But then this new Technology Review article came out  discussing recent research on Graphene Flash Memory.

Problems with NAND Flash

As we have discussed before, NAND flash memory has some serious limitations as it’s shrunk below 11nm or so. For instance, write endurance plummets, memory retention times are reduced and cell-to-cell interactions increase significantly.

These issues are not that much of a problem with today’s flash at 20nm or so. But to continue to follow Moore’s law and drop the price of NAND flash on a $/Gb basis, it will need to shrink below 16nm.  At that point or soon thereafter, current NAND flash technology will no longer be viable.

Other non-NAND based non-volatile memories

That’s why IBM and others are working on different types of non-volatile storage such as PCM (phase change memory), MRAM (magnetic RAM) , FeRAM (Ferroelectric RAM) and others.  All these have the potential to improve general reliability characteristics beyond where NAND Flash is today and where it will be tomorrow as chip geometries shrink even more.

IBM seems to be betting on MRAM or racetrack memory technology because it has near DRAM performance, extremely low power and can store far more data in the same amount of space. It sort of reminds me of delay line memory where bits were stored on a wire line and read out as they passed across a read/write circuit. Only in the case of racetrack memory, the delay line is etched in a silicon circuit indentation with the read/write head implemented at the bottom of the cleft.

Graphene as the solution

Then along comes Graphene based Flash Memory.  Graphene can apparently be used as a substitute for the storage layer in a flash memory cell.  According to the report, the graphene stores data using less power and with better stability over time.  Both crucial problems with NAND flash memory as it’s shrunk below today’s geometries.  The research is being done at UCLA and is supported by Samsung, a significant manufacturer of NAND flash memory today.

Current demonstration chips are much larger than would be useful.  However, given graphene’s material characteristics, the researchers believe there should be no problem scaling it down below where NAND Flash would start exhibiting problems.  The next iteration of research will be to see if their scaling assumptions can hold when device geometry is shrunk.

The other problem is getting graphene, a new material, into current chip production.  Current materials used in chip manufacturing lines are very tightly controlled and  building hybrid graphene devices to the same level of manufacturing tolerances and control will take some effort.

So don’t look for Graphene Flash Memory to show up anytime soon. But given that 16nm chip geometries are only a couple of years out and 11nm, a couple of years beyond that, it wouldn’t surprise me to see Graphene based Flash Memory introduced in about 4 years or so.  Then again, I am no materials expert, so don’t hold me to this timeline.

 

—-

Comments?

IBM’s 120PB storage system

Susitna Glacier, Alaska by NASA Goddard Photo and Video (cc) (from Flickr)
Susitna Glacier, Alaska by NASA Goddard Photo and Video (cc) (from Flickr)

Talk about big data, Technology Review reported this week that IBM is building a 120PB storage system for some unnamed customer.  Details are sketchy and I cannot seem to find any announcement of this on IBM.com.

Hardware

It appears that the system uses 200K disk drives to support the 120PB of storage.  The disk drives are packed in a new wider rack and are water cooled.  According to the news report the new wider drive trays hold more drives than current drive trays available on the market.

For instance, HP has a hot pluggable, 100 SFF (small form factor 2.5″) disk enclosure that sits in 3U of standard rack space.  200K SFF disks would take up about 154 full racks, not counting the interconnect switching that would be required.  Unclear whether water cooling would increase the density much but I suppose a wider tray with special cooling might get you more drives per floor tile.

There was no mention of interconnect, but today’s drives use either SAS or SATA.  SAS interconnects for 200K drives would require many separate SAS busses. With an SAS expander addressing 255 drives or other expanders, one would need at least 4 SAS busses but this would have ~64K drives per bus and would not perform well.  Something more like 64-128 drives per bus would have much better performer and each drive would need dual pathing, and if we use 100 drives per SAS string, that’s 2000 SAS drive strings or at least 4000 SAS busses (dual port access to the drives).

The report mentioned GPFS as the underlying software which supports three cluster types today:

  • Shared storage cluster – where GPFS front end nodes access shared storage across the backend. This is generally SAN storage system(s).  But the requirements for high density, it doesn’t seem likely that the 120PB storage system uses SAN storage in the backend.
  • Networked based cluster – here the GPFS front end nodes talk over a LAN to a cluster of NSD (network storage director?) servers which can have access to all or some of the storage. My guess is this is what will be used in the 120PB storage system
  • Shared Network based clusters – this looks just like a bunch of NSD servers but provides access across multiple NSD clusters.

Given the above, with ~100 drives per NSD server means another 1U extra per 100 drives or (given HP drive density) 4U per 100 drives for 1000 drives and 10 IO servers per 40U rack, (not counting switching).  At this density it takes ~200 racks for 120PB of raw storage and NSD nodes or 2000 NSD nodes.

Unclear how many GPFS front end nodes would be needed on top of this but even if it were 1 GPFS frontend node for every 5 NSD nodes, we are talking another 400 GPFS frontend nodes and at 1U per server, another 10 racks or so (not counting switching).

If my calculations are correct we are talking over 210 racks with switching thrown in to support the storage.  According to IBM’s discussion on the Storage challenges for petascale systems, it probably provides ~6TB/sec of data transfer which should be easy with 200K disks but may require even more SAS busses (maybe ~10K vs. the 2K discussed above).

Software

IBM GPFS is used behind the scenes in IBM’s commercial SONAS storage system but has been around as a cluster file system designed for HPC environments for over 15 years or more now.

Given this many disk drives something needs to be done about protecting against drive failure.  IBM has been talking about declustered RAID algorithms for their next generation HPC storage system which spreads the parity across more disks and as such, speeds up rebuild time at the cost of reducing effective capacity. There was no mention of effective capacity in the report but this would be a reasonable tradeoff.  A 200K drive storage system should have a drive failure every 10 hours, on average (assuming a 2 million hour MTBF).  Let’s hope they get drive rebuild time down much below that.

The system is expected to hold around a trillion files.  Not sure but even at 1024 bytes of metadata per file, this number of files would chew up ~1PB of metadata storage space.

GPFS provides ILM (information life cycle management, or data placement based on information attributes) using automated policies and supports external storage pools outside the GPFS cluster storage.  ILM within the GPFS cluster supports file placement across different tiers of storage.

All the discussion up to now revolved around homogeneous backend storage but it’s quite possible that multiple storage tiers could also be used.  For example, a high density but slower storage tier could be combined with a low density but faster storage tier to provide a more cost effective storage system.  Although, it’s unclear whether the application (real world modeling) could readily utilize this sort of storage architecture nor whether they would care about system cost.

Nonetheless, presumably an external storage pool would be a useful adjunct to any 120PB storage system for HPC applications.

Can it be done?

Let’s see, 400 GPFS nodes, 2000 NSD nodes, and 200K drives. Seems like the hardware would be readily doable (not sure why they needed watercooling but hopefully they obtained better drive density that way).

Luckily GPFS supports Infiniband which can support 10,000 nodes within a single subnet.  Thus an Infiniband interconnect between the GPFS and NSD nodes could easily support a 2400 node cluster.

The only real question is can a GPFS software system handle 2000 NSD nodes and 400 GPFS nodes with trillions of files over 120PB of raw storage.

As a comparison here are some recent examples of scale out NAS systems:

It would seem that a 20X multiplier times a current Isilon cluster or even a 10X multiple of a currently supported SONAS system would take some software effort to work together, but seems entirely within reason.

On the other hand, Yahoo supports a 4000-node Hadoop cluster and seems to work just fine.  So from a feasability perspective, a 2500 node GPFS-NSD node system seems just a walk in the park for Hadoop.

Of course, IBM Almaden is working on project to support Hadoop over GPFS which might not be optimum for real world modeling but would nonetheless support the node count being talked about here.

——

I wish there was some real technical information on the project out on the web but I could not find any. Much of this is informed conjecture based on current GPFS system and storage hardware capabilities. But hopefully, I haven’t traveled to far astray.

Comments?

 

SNIA illuminates storage power efficiency

Untitled by johnwilson1969 (cc) (from Flickr)
Untitled by johnwilson1969 (cc) (from Flickr)

At SNW, a couple of weeks back, SNIA annouced the coming out of their green storage initiative’s new SNIA Emerald Program and the first public draft release of their storage power efficiency test  specification.  Up until now, other than SPC and some pronouncements from EPA there hasn’t been much standardization activity on how to measure storage power efficiency.

SNIA’s Storage Power Efficiency Specification

As such, SNIA felt there was a need for an industry standard on how to measure storage power use.  SNIA’s specification supplies a taxonomy for storage systems that can be used to define and categorize various storage systems. Their extensive taxonomy should minimize problems like comparing consumer storage power use against data center storage power use.  Also, the specification identifies storage use attributes such as deduplication and thin provisioning or capacity optimization features that can impact power efficiency.

In addition, the specification has two appendices:

  • Appendix A specifies the valid power and environmental meters that are to be used to measure power efficiency of the system under test.
  • Appendix B specifies the benchmark tool that is used to drive the system under test while its power efficiency is being measured.

Essentially, there are two approved benchmark drivers used to drive IOs in the online storage category Iometer and vdbench both of which are freely available.  Iometer has been employed for quite awhile now in vendor benchmarking activity.  In contrast, vdbench is a relative newcomer but I have worked with its author, Henk Vandenbergh, over many years now and he is a consummate performance analyst.  I look forward to seeing how Henk’s vdbench matures over time.

Given the spec’s taxonomy and the fact that it lists online, near-online, removable media, virtual media and adjunct storage device categories with multiple sub-categories for each, we will focus only on the online family of storage and save the rest for later.

SPC energy efficiency measures

As my readers should recall, the Storage Performance Council (SPC) also has benchmarks that measure energy use with their SPC-1/E and SPC-1C/E reports (see our SPC-1 IOPS per Watt post).  The interesting part about SPC-1/E results is that there are definite IOPS levels where storage power use undergoes significant transitions.

One can examine a SPC-1/E Executive Summary report and see power use at various IO intensity levels, i.e., 100%, 95%, 90%, 85%, 80%, 50%, 10% and 0% (or idle) for a storage subsystem under test.   SPC summarizes these detail power measurements by defining profiles for “Low”, “Medium” and “Heavy” storage system use.  But the devils often in the details and having all the above measurements allows one to calculate whatever activity profile works best for you.

Unfortunately, only a few SPC-1/E reports have been submitted to date and it has yet to take off.

SNIA alternative power efficiency metrics

Enter SNIA’s Emerald program, which is supposed to be an easier and quicker way to measure storage power use.  In addition to the specification, SNIA has established a website (see above) to hold SNIA approved storage power efficiency results and a certification program for auditors that can be used to verify vendor power efficiency testing meet all specification requirements.

What’s missing from the present SNIA power efficiency test specification are the following:

  • More strict IOPS level definitions – the specification refers to IO intensity but doesn’t provide an adequate definition from my perspective.  It says that subsystem response time cannot exceed 30msec and uses this to define 100% IO intensity for the workloads.  However given this definition it could apply to random read, random write, or mixed workloads and there is no separate specification for sequential or random (and/or mixed) workloads.  This could be tightened up
  • More IO intensity levels measured – the specification calls for power measurements at an IO intensity of 100% for all workloads and 25% for 70:30 R:W workloads for online storage.  However we would be more interested in also seeing 80% and 10%.  From a user perspective, 80% probably represents a heavy sustainable IO workload and 10% looks like a complete cache hit workload.  We would only measure these levels for the “Mixed workload” so as to minimize effort.
  • More write activity in “Mixed workloads” – the specification defines mixed workload as 70% read and 30% write random IO activity.  Given today’s O/S propensity to buffer read data, it would seem more prudent to use a 50:50 Read to Write mix.

Probably other items need more work as well, such as defining a standardized reporting format containing a detailed description of HW and SW of system under test, benchmark driver HW and SW, table for reporting all power efficiency metrics and inclusion of full benchmark report including input parameter specifications and all outputs, etc. but these are nits.

Finally, SNIA’s specification goes into much detail about capacity optimization testing which includes things like compression, deduplication, thin provisioning, delta-snapshotting, etc. with an intent to measure storage system power use when utilizing these capabilities.  This is a significant and complex undertaking to define how each of these storage features will be configured and used during power measurement testing.  Although SNIA should be commended for their efforts here, this seems to much to take on at the start.  We suggest capacity optimization testing definitions should be deferred to a later release and focus now on the more standard storage power efficiency measurements.

—-

I critique specifications at my peril.  Being wrong in the past has caused me to re-double efforts to insure a correct interpretation of any specification.  However, if there’s something I have misconstrued or missed here that are worthy of note please feel free to comment.

SPC-1/E IOPS per watt – chart of the month

SPC*-1/E IOPs per Watt as of 27Aug2010
SPC*-1/E IOPs per Watt as of 27Aug2010

Not a lot of Storage Performance Council (SPC) benchmark submissions this past quarter just a new SPC-1/E from HP StorageWorks on their 6400 EVA with SSDs and a new SPC-1 run for Oracle Sun StorageTek 6780.  Recall that SPC-1/E executes all the same tests as SPC-1 but adds more testing with power monitoring equipment attached to measure power consumption at a number of performance levels.

With this chart we take another look at the storage energy consumption (see my previous discussion on SSD vs. drive energy use). As shown above we graph the IOPS/watt for three different performance environments: Nominal, Medium, and High as defined by SPC.  These are contrived storage usage workloads to measure the varibility in power consumed by a subsystem.  SPC defines the workloads as follows:

  • Nominal usage is 16 hours of idle time and 8 hours of moderate activity
  • Medium usage is 6 hours of idle time, 14 hours of moderate activity, and 4 hours of heavy activity
  • High usage is 0 hours of idle time, 6 hours of moderate activity and 18 hours of heavy activity

As for activity, SPC defines moderate activity at 50% of the subsystem’s maximum SPC-1 reported performance and heavy activity is at 80% of its maximum performance.

With that behind us, now on to the chart.  The HP 6400 EVA had 8-73GB SSD drives for storage while the two Xiotech submissions had 146GB/15Krpm and 600GB/15Krpm drives with no flash.  As expected the HP SSD subsystem delivered considerably more IOPS/watt at the high usage workload – ~2X the Xiotech with 600GB drives and ~2.3X the Xiotech with 146GB drives.  The multipliers were slightly less for moderate usage but still substantial nonetheless.

SSD nominal usage power consumption

However, the nominal usage bears some explanation.  Here both Xiotech subsystems beat out the HP EVA SSD subsystem at nominal usage with the 600GB drive Xiotech box supporting ~1.3X the IOPS/watt of the HP SSD system. How can this be?  SSD idle power consumption is the culprit.

The HP EVA SSD subsystem consumed ~463.1W at idle while the Xiotech 600GB only consumed ~23.5W and the Xiotech 146GM drive subsystem consumed ~23.4w.  I would guess that the drives and perhaps the Xiotech subsystem have considerable power savings algorithms that shed power when idle.  For whatever reason the SSDs and HP EVA don’t seem to have anything like this.  So nominal usage with 16Hrs of idle time penalizes the HP EVA SSD system resulting in the poors IOPS/watt for nominal usage shown above..

Rays reading: SSDs are not meant to be idled alot and disk drives, especially the ones that Xiotech are using have very sophisticated power management that maybe SSDs and/or HP should take a look at adopting.

The full SPC performance report will go up on SCI’s website next month in our dispatches directory.  However, if you are interested in receiving this sooner, just subscribe by email to our free newsletter and we will send you the current issue with download instructions for this and other reports.

As always, we welcome any suggestions on how to improve our analysis of SPC performance information so please comment here or drop us a line.