

This approach also allowed us to gauge the “readiness” of a region to take traffic, because instances generally need some time to reach optimal operation (e.g., via JITing). To compensate for DNS TTL delays, we used our Zuul proxies to migrate traffic via back-end tunnels between regions. 10 minutes or more to proxy our traffic to destination regions.
PHLO STUDIO DOWNLOAD
Boot an AWS instance, launch a service, download resources required to operate, make backend connections, register with Eureka, apply any morphing changes specified through our Archaius configuration management, register with AWS ELBs… our instances have it tough, and we could only do so much to coax them into starting faster under threat of receiving traffic from other regions.

25 minutes for our services to start up.As a result, failover needed to include a step of computing how much capacity was required for each of the services in our ecosystem. Our clusters were (and are) not overprovisioned to the point where they could absorb the additional traffic they would see if we failed traffic from another region to them. This was nontrivial: Netflix services autoscale following diurnal patterns of traffic. This included predicting necessary scale-up and then scaling destination regions to absorb the traffic. 3–5 minutes to provision resources from AWS.As a result, a failover was a somewhat risky, not only slow, path to take. Failover operations took time, and the operations made enormous amounts of AWS EC2 mutations that could potentially confuse the state of a healthy region. 5 minutes to decide whether we would push the failover button or not.When we set out on this journey, we began by breaking down the time it took then to do a traffic failover, about 50 minutes: As part of our project requirements, we wanted minimal changes to core infrastructure, no disruptions to work schedules, and no onerous maintenance requirements dropped on other engineering teams at the company. While traffic failovers have been an important tool at our disposal for some time, Nimble takes us to the next level by optimizing the way in which we use existing capacity to migrate traffic. The history of region evacuation at Netflix is captured in three prior articles. This article describes how we re-imagined region failover from what used to take close to an hour to less than 10 minutes, all while remaining cost neutral. So it’s critical that we are able to route traffic away from a region quickly when needed. Because Netflix continues to grow quickly, we are now at a point where even short or partial outages affect many of our customers. One of the most important tools in our toolbox is to route traffic away from an AWS region that is unhealthy.


A lot of the work we do centers around making our systems ever more available, and averting or limiting customer-facing outages. At Netflix, our goal is to be there for our customers whenever they want to come and watch their favorite shows. We are proud to present Nimble: the evolution of the Netflix failover architecture that makes region evacuation an order of magnitude faster.
