Learning to live with lattices or say goodbye to security

safe 'n green by Robert S. Donovan (cc) (from flickr)
safe ‘n green by Robert S. Donovan (cc) (from flickr)

Read an article the other day in Quantum Magazine: A tricky path to quantum encryption about the problems that will occur in current public key cryptology (PKC) schemes when quantum computing emerges over the next five to 30 years.  With advances in quantum computing our current PKC scheme that depends on the difficulty of factoring large numbers will be readily crackable. At that time, all current encrypted traffic, used by banks, the NSA, the internet, etc. will no longer be secure.

NSA, NIST, & ETSI looking at the problem

So there’s a search on for quantum-resistant cryptology (see this release from ETSI [European Telecommunications Standard Institute], this presentation from NIST [{USA} National Institute of Standards &Technology], and this report from Schneier on Security on NSA’s [{USA} National Security Agency] Plans for Post-Quantum world ). There are a number of alternatives being examined by all these groups but the most promising at the moment depends on multi-dimensional (100s of dimensions) mathematical lattices.

Lattices?

According to Wikipedia a lattice is a 3-dimensional space of equidistant points. Apparently, for security reasons, they had to increase the number of dimensions significantly beyond 3.

A secret is somehow inscribed in a route (vector) through this 500-dimensional lattice between two points: an original  point (the public key) in the lattice and another arbitrary point, somewhere nearby in the lattice. The problem from a cryptographic sense is that finding a route, in a 500 dimensional lattice, is a difficult task when you only have one of the points.

But can it be efficient for digital computers of today to use?

So the various security groups have been working on divising efficient algorithms for multi-dimensional public key encryption over the past decade or so. But they have run into a problem.

Originally, the (public) keys for a 500-dimensional lattice PKC were on the order of MBs, so they have been restricting the lattice computations to utilize smaller keys and in effect reducing the complexity of the underlying lattice. But in the process they have now reduced the security of the lattice PKC scheme. So they are having to go back to longer keys, more complex lattices and trying to ascertain which approach leaves communications secure but is efficient enough to implement by digital computers and communications links of today.

Quantum computing

The problem is that quantum computers provide a much faster way to perform certain calculations like factoring a number. Quantum computing can speed up this factorization, by on the order of the square root of a number, as compared to normal digital computing of today.

Its possible that similar quantum computing calculations for lattice routes between points could also be sped up by an equivalent factor.  So even when we all move to lattice based PKC, it’s still possible for quantum computers to crack the code hopefully, it just takes longer.

So the mathematics behind PKC will need to change over the next 5 years or so as quantum computing becomes more of a reality. The hope is that this change will will at least keep our communications secure, at least until the next revolution in computing comes along, or quantum computing becomes even faster than that envisioned today.

Comments?

At Scale conference keynote, Facebook video experience re-engineered

11990439_1644273839179047_2244380699715442158_nThe At Scale conference happened this past week in LA. Jay Parikh, Global Head of Engineering and Infrastructure at Facebook, kicked off the conference by talking about how Facebook is attempting to conquer some of it’s intrinsic problems, as it scales up from over 1B users today. I was unable to attend the conference but watched a video of the keynote (on Facebook of course).

The At Scale community is a group of large, hyper-scale, web companies such as Google, Microsoft, Twitter, and of course Facebook, among a gaggle of others that all have problems trying to scale up their infrastructure to handle more and more users activities. They had 1800 people registered for the At Scale 2015 conference on Monday, double last years count. The At Scale community are trying to push the innovation level of the industry faster, through a community of companies that need to work at hyper-scale.

Facebook’s video problem

At Facebook the current hot problem that’s impacting customer satisfaction seems to be video uploads and playback (downloads). The issues with Facebook’s video experience are multifaceted and range from the time it takes to successfully upload a video, to the bandwidth it takes to playback a video to the -system requirements to support live streaming video to 100,000s of users.

Facebook started as a text only service, migrated to a photo oriented service, but now is quickly moving to a video oriented user experience. But it doesn’t stop there they can see on the horizon that augmented and virtual reality will become a significant driver of activity for Facebook uses?!

Daily video 1B last year now at 4B video views/day. They also launched a new service lately, LiveMentions, which was a live streaming service for celebrities (real time video streams). Several celebrities were live streaming to 150K of their subscribers. So video has become and will continue as the main consumer of bandwidth at Facebook.

Struggling to enhance the Facebook user’s video experience over the past year, they have come up with three key engineering principles that have helped them: Planning, Iteration and Performance.

Planning

Facebook is already operating a terabit scale network, so doing something to its network wrong is going to cause major problems, around the world. As a result, Facebook engineering focused early on, into incorporating lots of instrumentation in their network and infrastructure services. This has allowed them to constantly monitor the activity of their users across their infrastructure to identify problems and solutions.

One metric Parikh talked about was “playback success rate”, this is the percentage where the video starts to play in under 1 second for a facebook user.  One chart he showed, was a playback success rate colored ove a world map  but aggregated (averaged) at the country level. But with their instrumentation Facebook was able to drill down to regions within a country and  even cities within a region. This allows engineering to identify problems at almost any level of granularity they need.

One key take away to Planing, is if you have the instrumentation in place, have people to monitor and mine the data and are willing to address the problems that crop up, then you can create a more flexible, efficient and effective environment and build a better product for your users.

Iteration

Iteration is not just about feature deployment, but it’s also about the Facebook user experience. Their instrumentation had told them that they were doing ok on video uploads but it turns out that when they looked at the details, they saw that some customers were not having a satisfactory video upload experience. For instance, one Facebook engineer had to wait 82 hours to upload a video.

The Facebook world is populated with 10s of thousands of unique devices with different memory, compute and storage. They had to devise approaches that could optimize the encoding for all the different devices, some of which was done on mobile phones.

They also had to try to optimize the network stack for different devices and mobile networking technologies. Parikh had another map showing network connectivity. Surprise, most of the world is not on LTE, and a vast majority of world is on 2G and 3G cellular networks. So via iteration Facebook went about improving video upload by 1% here and 1% there, but with Facebook’s user base, these improvements impact millions of users. They used cross functional teams to address the problems they uncovered.

However, video uploads problems were not just in device and connectivity realms. Turns out they had a big cancel upload button on their screen after the start of the video upload. This was sometimes clicked by mistake and they found that almost 10% of users hit the cancel upload. So they went through and re-examined the whole user experience to try to eliminate other hindrances to successful video uploads.

Performance

The key take away from this segment of the talk was that performance has to be considered from the get go of a new service or service upgrade. It is impossible to improve performance after the fact, especially for At Scale environments.

In my CS classes, the view was make it work and then make it work fast.  What Facebook has found is that you never have the time after a product has shipped to make it fast. As soon as it works, they had to move on to the next problem.

As a result if performance is not built in from the start, not a critical requirement/feature of a system architecture and design, it never gets addressed. Also if all you focus on is making it work then the design and all the code is built around feature functionality. Changing working functionality later to improve performance is an impossible task and typically represents a re-architecture/re-design/re-implementation of the functionality.

For instance, Facebook used to do video encoding in serial on a single server. It often took a long time (10 to 30 minutes). Engineering reimplemented their video encoding to partition the video and distribute the encoding across multiple servers. Doing this, sped up encoding time considerably.

But they didn’t stop there, with such a diverse user networking environment, they felt that they could save bandwidth and better optimize user playback if could reduce playback video size. They were able to take their machine learning/AI investments that Facebook has made and apply this to distributed video encoding. They were able to analyze the video scene by scene and opportunistically reduce bandwidth load and storage size but still maintain video  playback quality. By implementing the new video encoding process they have achieved double digit reductions in bandwidth requirements for playback.

Another example of the importance of performance was the LiveMentions feature discussed above. Celebrities often record streams in places with poor networking infrastructure. So in order to insure a good streaming experience Facebook  had to implement variable bit rate video upload to adjust upload bandwidth requirements based on networking environmentr. Moreover, once a celebrity starts a live stream all the fans in the world get notified. then there’s a thundering herd (boot storms anyone) to start watching the video stream. In order to support this mass streaming, Facebook implemented stream blocking, which holds off the start of a live stream viewing until they have cached enough of the video stream at their edge servers, worldwide. This guaranteed that all the fans had a good viewing experience, once it started.

There were a couple more videos of the show sessions but I didn’t have time to review them.  But Facebook sounds like a fun place to work, especially for infrastructure performance experts.

~~~~

Comments?

When 64 nodes are not enough

Why would VMware with years of ESX development behind them want to develop a whole new virtualization system for Docker and other container frameworks. Especially since they already have a compatible Docker support in their current product line.

The main reason I can think of is that a 64 node cluster may be limiting to some container services and the likelihood of VMware ESX/vSphere to supporting 1000s of nodes in a single cluster seems pretty unlikely. So given that more and more cloud services are being deployed across 1000s of nodes using container frameworks, VMware had to do something or say goodbye to a potentially lucrative use case for virtualization.

Yes over time VMware may indeed extend vSphere clusters to 128 or even 256 nodes but by then the world will have moved beyond VMware services for these services and where will VMware be then – left behind.

Photon to the rescue

With the new Photon system VMware has an answer to anyone that needs 1000 to 10,000 server cluster environments. Now these customers can easily deploy their services on a VMware Photon Platform which is was developed off of ESX but doesn’t have any cluster limitations of ESX.

Thus, the need for Photon was now. Customers can easily deploy container frameworks that span 1000s of nodes. Of course it won’t be as easy to manage as a 64 node vSphere cluster but it will be easy automated and easier to deploy and easier to scale when necessary, especially beyond 64 nodes.

The claim is that the new Photon will be able to support multiple container frameworks without modification.

So what’s stopping you from taking on the Amazons, Googles, and Apples of the worlds data centers?

  • Maybe storage, but then there’s ScaleIO, and the other software defined storage solutions that are there to support local DAS clusters spanning almost incredible sizes of clusters.
  • Maybe networking, I am not sure just where NSX is in the scheme of things but maybe it’s capable of handling 1000s of nodes and maybe not but networking could be a clear limitation to what how many nodes can be deployed in this sort of environment.

Where does this leave vSphere? Probably continuation of the current trajectory, making easier and more efficient to run VMware clusters and over time extending any current limitations. So for the moment two development streams based off of ESX and each being enhanced for it’s own market.

How much of ESX survived is an open question but it’s likely that Photon will never see the VMware familiar services and operations that is readily available to vSphere clusters.

Comments?

Photo Credit(s): A first look into Dockerfile system