At Flash Memory Summit (FMS 2016) this past week, Vijay Rao, Director of Technology Strategy at Facebook gave a keynote session on some of the areas that Facebook is focused on for flash storage. One thing that stood out as a significant change of direction was a move to JBOFs in their datacenters.
As you may recall, Facebook was an early adopter of (FusionIO’s) server flash cards to accelerate their applications. But they are moving away from that technology now.
Insane growth at Facebook
Why? Vijay started his talk about some of the growth they have seen over the years in photos, videos, messages, comments, likes, etc. Each was depicted as a animated bubble chart, with a timeline on the horizontal axis and a growth measurement in % on the vertical axis, with the size of the bubble being the actual quantity of each element.
Although the user activity growth rates all started out small at different times and grew at different rates during their individual timelines, by the end of each video, they were all almost at 90-100% growth, in 4Q15 (assume this is yearly growth rate but could be wrong).
Vijay had similar slides showing the growth of their infrastructure, i.e., compute, storage and networking. But although infrastructure grew less quickly than user activity (messages/videos/photos/etc.), they all showed similar trends and ended up (as far as I could tell) at ~70% growth.
Peak activity at Facebook is rare and precarious
Vijay next talked about what some of the peaks in activity looked like at Facebook. He had a number of charts which showed relatively flat if not slightly increasing daily volume of their user interactions during a 3 month period and then a 10-100X peak in activity would come along, seemingly out of the blue. These were all when major sporting events occurred, when terrorist attacks occurred, when natural disasters happened, etc.
The question, from Facebook’s perspective, is how you support an unpredictable, but occasional 10-100X peak in activity, without spending 100X or more in infrastructure to handle it. Knowing full well that there’s little likelihood you can predict when they will occur, you just know they will happen.
JBOFs to the rescue
JBOFs are a set of Flash drives (SSDs, M.2s, etc.) that are joined together in one shelf (or group of shelves) that can connect to multiple hosts/servers.
Vijay didn’t go into the connection logic too much in his session but from the project documentation (Open Compute Project Lightning [deck]) and the demo units on the FMS 2016 floor, it’s using PCIe/NVMe connectivity between hosts and flash.
In the JBOF head, there’s a PCIe switch that connects all the flash individually to all the servers. There were two variants in the project deck, a 30X2 or 15X4 JBOF. The digit after the “X” indicates the number of PCIe lanes connected to each SSD, and the digits before the “X” indicates how many (30 or 15) SSDs were available in the JBOF.
A JBOF came in a 1U shelf, with SSDs in a horizontal configuration, with either 3 rows of 5 or 10 SSDs each. There were 2 JBOFs per enclosure.
The SSDs in the drive carrier could be 7mm 2.5″, 15mm 2.5″ or M.2.
Using M2 carriers gives one the 30X2 JBOF configuration. There were at least 3 different flash/system vendors exhibiting at FMS 2016, with their own SSDs in Project Lightning JBOFs. It’s unclear whether you could mix the M2 and non-M2 SSD carriers in the same JBOF.
Not unlike JBODs, JBOFs provide direct access to all SSDs from all attached hosts. There didn’t appear to be any RAID or other data protection logic in the JBOF head or at the hosts. Their JBOF does require a PCIe retimer board at the host to deal with the PCIe clocking/timing issues. Each JBOF could be connected to up to 4 or optionally, 8 hosts.
The rest of the JBOF included fans, power supplies and drive plane, PCIe switch and fans control boards. It didn’t seem to be as dense as it could have been (checkout the drive carriers) but there may be thermal/power considerations to making it denser.
Facebook wants to independently scale compute and storage and not have to scale them together anymore. Doing this could provide a less expensive way to support their inevitable activity peaks.
I could foresee a number of ways to use JBOFs to manage activity peaks, some of which are static and others that are more dynamic. With a static allocation of say 3 out of 4 hosts using 3 SSDs each and the 4th host using 6 SSDs, they could direct peak workloads to the servers with the more SSDs.
For a more dynamic solution, they could just decide to evenly allocate the flash across the 4 hosts, But when peak activity hits, they could re-allocate and concentrate flash storage to those servers executing peak transactional activity.
Consolidating (flash) storage and wanting to independently scale storage and compute is the same way that network storage emerged in the 90’s. Is this the start of a shift to networked storage in hyperscaler environments?
Photo Credit(s): From the Open Compute Project Lightning deck