The At Scale conference happened this past week in LA. Jay Parikh, Global Head of Engineering and Infrastructure at Facebook, kicked off the conference by talking about how Facebook is attempting to conquer some of it’s intrinsic problems, as it scales up from over 1B users today. I was unable to attend the conference but watched a video of the keynote (on Facebook of course).
The At Scale community is a group of large, hyper-scale, web companies such as Google, Microsoft, Twitter, and of course Facebook, among a gaggle of others that all have problems trying to scale up their infrastructure to handle more and more users activities. They had 1800 people registered for the At Scale 2015 conference on Monday, double last years count. The At Scale community are trying to push the innovation level of the industry faster, through a community of companies that need to work at hyper-scale.
Facebook’s video problem
At Facebook the current hot problem that’s impacting customer satisfaction seems to be video uploads and playback (downloads). The issues with Facebook’s video experience are multifaceted and range from the time it takes to successfully upload a video, to the bandwidth it takes to playback a video to the -system requirements to support live streaming video to 100,000s of users.
Facebook started as a text only service, migrated to a photo oriented service, but now is quickly moving to a video oriented user experience. But it doesn’t stop there they can see on the horizon that augmented and virtual reality will become a significant driver of activity for Facebook uses?!
Daily video 1B last year now at 4B video views/day. They also launched a new service lately, LiveMentions, which was a live streaming service for celebrities (real time video streams). Several celebrities were live streaming to 150K of their subscribers. So video has become and will continue as the main consumer of bandwidth at Facebook.
Struggling to enhance the Facebook user’s video experience over the past year, they have come up with three key engineering principles that have helped them: Planning, Iteration and Performance.
Facebook is already operating a terabit scale network, so doing something to its network wrong is going to cause major problems, around the world. As a result, Facebook engineering focused early on, into incorporating lots of instrumentation in their network and infrastructure services. This has allowed them to constantly monitor the activity of their users across their infrastructure to identify problems and solutions.
One metric Parikh talked about was “playback success rate”, this is the percentage where the video starts to play in under 1 second for a facebook user. One chart he showed, was a playback success rate colored ove a world map but aggregated (averaged) at the country level. But with their instrumentation Facebook was able to drill down to regions within a country and even cities within a region. This allows engineering to identify problems at almost any level of granularity they need.
One key take away to Planing, is if you have the instrumentation in place, have people to monitor and mine the data and are willing to address the problems that crop up, then you can create a more flexible, efficient and effective environment and build a better product for your users.
Iteration is not just about feature deployment, but it’s also about the Facebook user experience. Their instrumentation had told them that they were doing ok on video uploads but it turns out that when they looked at the details, they saw that some customers were not having a satisfactory video upload experience. For instance, one Facebook engineer had to wait 82 hours to upload a video.
The Facebook world is populated with 10s of thousands of unique devices with different memory, compute and storage. They had to devise approaches that could optimize the encoding for all the different devices, some of which was done on mobile phones.
They also had to try to optimize the network stack for different devices and mobile networking technologies. Parikh had another map showing network connectivity. Surprise, most of the world is not on LTE, and a vast majority of world is on 2G and 3G cellular networks. So via iteration Facebook went about improving video upload by 1% here and 1% there, but with Facebook’s user base, these improvements impact millions of users. They used cross functional teams to address the problems they uncovered.
However, video uploads problems were not just in device and connectivity realms. Turns out they had a big cancel upload button on their screen after the start of the video upload. This was sometimes clicked by mistake and they found that almost 10% of users hit the cancel upload. So they went through and re-examined the whole user experience to try to eliminate other hindrances to successful video uploads.
The key take away from this segment of the talk was that performance has to be considered from the get go of a new service or service upgrade. It is impossible to improve performance after the fact, especially for At Scale environments.
In my CS classes, the view was make it work and then make it work fast. What Facebook has found is that you never have the time after a product has shipped to make it fast. As soon as it works, they had to move on to the next problem.
As a result if performance is not built in from the start, not a critical requirement/feature of a system architecture and design, it never gets addressed. Also if all you focus on is making it work then the design and all the code is built around feature functionality. Changing working functionality later to improve performance is an impossible task and typically represents a re-architecture/re-design/re-implementation of the functionality.
For instance, Facebook used to do video encoding in serial on a single server. It often took a long time (10 to 30 minutes). Engineering reimplemented their video encoding to partition the video and distribute the encoding across multiple servers. Doing this, sped up encoding time considerably.
But they didn’t stop there, with such a diverse user networking environment, they felt that they could save bandwidth and better optimize user playback if could reduce playback video size. They were able to take their machine learning/AI investments that Facebook has made and apply this to distributed video encoding. They were able to analyze the video scene by scene and opportunistically reduce bandwidth load and storage size but still maintain video playback quality. By implementing the new video encoding process they have achieved double digit reductions in bandwidth requirements for playback.
Another example of the importance of performance was the LiveMentions feature discussed above. Celebrities often record streams in places with poor networking infrastructure. So in order to insure a good streaming experience Facebook had to implement variable bit rate video upload to adjust upload bandwidth requirements based on networking environmentr. Moreover, once a celebrity starts a live stream all the fans in the world get notified. then there’s a thundering herd (boot storms anyone) to start watching the video stream. In order to support this mass streaming, Facebook implemented stream blocking, which holds off the start of a live stream viewing until they have cached enough of the video stream at their edge servers, worldwide. This guaranteed that all the fans had a good viewing experience, once it started.
There were a couple more videos of the show sessions but I didn’t have time to review them. But Facebook sounds like a fun place to work, especially for infrastructure performance experts.