When people hear "Netflix engineering" or "Netflix tech stack" more they often think about microservices, AWS, continuous deployment. Those are definitely worth its mention and its fame. Yet that’s only one side of Netflix infrastructure. AWS-services are used for "dynamic" parts of Netflix experiences — service API requests for personalized recommendations, title discovery, UI customization, etc.
Another critical part, video streaming, is handled by an in-house CDN, called Open Connect, serving around 15% of the world's Internet traffic (and over a third of US downstream Internet traffic at peak). Open Connect today is a widely distributed network of thousands of servers, all around the world, well connected to Internet Exchange locations and Internet Providers. That is one of the critical components of the Netflix streaming experience, different from our operations in AWS.
There have been many excellent talks done about both Netflix operates in the cloud or scales video delivery, hence this proposal is NOT about those topics.
This proposal is about a system that my team engineers and operates. It leverages the power of globally distributed CDN infrastructure to better manage interactive API requests between Netflix clients and AWS endpoint. Its goal is to make network requests faster, more reliable and easier to control. The main goal of the talk, however, is not about the problems and solutions themselves (which would require a good bit of networking knowledge), but about the principles, we’ve followed to design, implement and deploy the system. The principles that allowed us to beat the speed of light, make the Internet faster for Netflix customers and yet have minimal reliability risk, on-call burden or engineering costs.
A quick context on the problem and the solution: while video traffic is served from an Open Connect location typically close to a client, dynamic API requests still go all the way to AWS data centers in North America and Europe. Even with the speed of light, it takes quite a long time to go from Australia to North America over the open Internet. So here is a gist of a solution — proxy API traffic through our Open Connect servers and use a number of smart transport, network, and routing tricks to make them run faster.
The idea seems simple enough and after a quick primer on the fundamentals of network protocols and Internet architecture, it makes total sense. However, making sure we can have observability and control of the traffic is when things get scary (or exciting, depending on whom you ask).
Netflix API load is no joke — AWS servers receive millions of requests per second that we would migrate over to our infrastructure. On its way to the closest AWS location, traffic would start on a wide variety of Netflix devices (including 10-year-old Blu-ray players), proceed over the last mile (should I mention microwaves or concrete walls?) into ISP networks and then, proxied via one of the thousands of our CDN servers worldwide, hit an endpoint backed by hundreds of microservices in AWS.
We would have to use DNS, Anycast, and BGP to steer traffic in the network — all the good stuff that allows the Internet to survive a nuclear attack, but was not built with observability and control in mind.
Having thousands of types of devices on one end of the wire, hundreds of microservices on the other, and the Internet in the middle, you have a situation when many, MANY things can (and almost certainly will) break at any moment.
When things break, finding and fixing the root cause is not a trivial task even if you fully control your system. In our case, we have many moving parts outside of our control (ISP steering changes, OS updates, client releases, AWS service deployments, AB tests) so we are up for a big challenge — troubleshooting becomes the exercise of tracing a needle in a haystack, while being blind, deaf, and one-handed (I could go on, but let’s stop here). Needless to say (pun intended), when you run a service that is in the middle of every app request, you better be quick — you be the first to be called when things go south.
So what if I tell you that our whole API acceleration system is being developed and operated by a team of 3 engineers? That we have only a handful of critical alerts that can page us (and most weekly on-calls end without a single page). And despite all that, production traffic started to flow through this system after only about half a year’s work, that we perform dozens of network routing experiments at a time and have scaled the system to handle millions of RPS of service-critical Netflix API requests.
This is a talk about our approach to system design, monitoring, and operations that make all of this possible. Gives us confidence that our system is functioning properly. And when things break — guides us through the sea of unknown (or tells us that it’s not our sea). It is about tools we had to build or tune, metrics we chose, engineering practices we followed, and mistakes we made along the way.
While our PROBLEM DOMAIN is quite unique and might only apply to a few companies, the fundamental PRINCIPLES that allow us to make big changes and still sleep well at night can be applied universally, in both legacy mega-corp or disruptive startup. This talk will show how you can learn from our experience and apply the same principles in your system — so you can focus on the evolution of your service and stop being horrified by an on-call week.
Program committe comment:
Do you want to sleep well at night? Come to this talk, learn principles which can be applied to different companies and make your life easy.