Surprises in disk reliability from Microsoft’s “free cooled” datacenters

HH5At Usenix ATC’16 last week, there was a “best of the rest” session which repeated selected papers presented at FAST’16 earlier this year. One that caught my interest was discussing disk reliability in free cooled data centers at Microsoft (Environmental conditions and disk reliability in free-cooled datacenters, see pp. 53-66).

The paper discusses disk reliability at 9 different datacenters in Microsoft for over 1M drives over the course of 1.5 to 4 years vs. how datacenters were cooled.

How to cool datacenters

HH0Free-cooled datacenters have become more fashionable of late because they reduce the energy needed to cool data centers. The paper classifies datacenter cooling styles into:

  • Chiller-based – which uses chilled water to carry heat out of the data center to the outside where it can be dissipated and the water re-cooled is then re-cooled via chillers and recirculated. These datacenters typically have a Power Use Efficiency (PUE) of 1.7.
  • Water-side economized – which improves the chiller-based approach by bypassing chillers (and turning them off) when cooling towers alone are sufficient to cool the water before it is returned to the datacenter. These datacenters have a PUE of 1.19
  • Free-cooled – which uses large fans to blow cool outside filtered air into the datacenter and uses similar fans to blow warm air back out of the datacenter, thus have no chillers or cooling towers. When the outside air is too hot, evaporative cooling can be used to bring the outside air temperature down, when the outside air is too cold, warm inside air can be recycled to warm it up. These datacenters have a PUE of 1.12

The problem with free-cooled datacenters is that server equipment is subject to warmer temperatures and more variable relative humidity.

Microsoft hyper-scale datacenters

HH1Data was gathered across 9 of Microsoft hyper-scale data centers, 2 of which were chiller cooled, 2 of which were water-side economized cooled and the remaining 5 were all free-cooled.

One of the free-cooled data centers was in a cool and dry (CD) climate and the 4 remaining free-cooled data centers were in hot and humid (HH) climates. The chart shows the temperature range and relative humidity range of different spots in the first hot and humid (HH1) datacenter.

The other interesting information was that (free-cooled) HH1 typically runs its servers cooler than (chiller cooled) HD1 datacenter.

The bad news

HH2Free-cooled hot and humid datacenters have much higher disk annual failure rates (AFR) than chiller or water-side economized datacenters. The paper didn’t discuss non-disk errors but did indicate that in HH1, disk errors corresponded to 83% of all component failures.

HH3It appears from the data they have gathered that relative humidity is the main driver of these higher disk failures Higher humidity seems to result in more disk controller and connectivity errors.

The good news

HH4Disk drive location, with respect to air flow, can make a big difference in relative humidity they encounter and ultimately, AFR.  They show that when disks are placed behind server air flow, they encounter higher temperature but lower relative humidity.

These days most systems I see have disks in front and server electronics in the back. Cold air usually enters the front and exits the back. According to their research, this is the worst configuration for free-cooled data centers. Disk-server relative placement should be reversed, with the disks closer to the hot aisle and the server componentry closer to the cold aisle, if we want to reduce relative humidity exposure.

What this does to DIMM and other server component reliability is another question.

PUE, AFR and TCO

In their research (see first graphic above), they summarized TCO projections using chiller, water-side economized and free-cooled datacenters factoring in the increase in disk AFR for free cooled, using 10, 15 and 20 year intervals. In every case, free-cooled datacenter have a lower relative TCO, even though they have a higher disk replacement rate due to worse AFR. This is due to the lower power  and capital cost to build out and use free-cooled datacenters.

~~~~

The paper had a lot more to say. Relative humidity causes corrosion, and over time, disks seem to be particularly vulnerable to this type of corrosion. Perhaps this is due to the extra vibration produced during seek activity (not something discussed in the paper).

I found the placement discussion very interesting, although the paper didn’t spend much time talking about it. Obviously, reversing the cold and ot aisles could  have the same affect. But I wonder whether customers would be willing to have the hot aisle in front of the equipment.

Free-cooled datacenters are probably not as prevalent but are becoming more so and as big companies build out more green-field datacenters will become more prominent.

I do believe disk reliability still matters. Yes, AFA has taken over the storage world. But most data centers still have lots of disk and hybrid storage arrays still represent the majority of storage that’s being shipped these days.

So, free-cooled datacenters can be cheap to build and run and if you place your disk drives near the hot aisle, can be even cheaper to run.

Comments?

 

 

8 Replies to “Surprises in disk reliability from Microsoft’s “free cooled” datacenters”

Comments are closed.