Tuesday, August 23, 2016

Evolve or Die: High-Availability Design Principles Drawn from Google’s Network Infrastructure

Evolve or Die: High-Availability Design Principles Drawn from Google’s Network Infrastructure

Ramesh Govindan (University of Southern California/Google)
Ina Minei (Google)
Mahesh Kallahalla (Google)
Bikash Koley (Google)
Amin Vahdat (Google)

Google runs three types of networks
  1. Data Center Networks: designed from merchant silicon switches, with a logically centralized control plane; 
  2. B4: A software-defined WAN that supports multiple traffic classes and uses centralized traffic engineering
  3. B2: A Global WAN for user-facing traffic that employs decentralized traffic engineering
They strive to maintain high availability in these networks. Typical availability targets is no more than a few minutes downtime per month for these networks. Due to Google's scale, evolution velocity and management complexity, it is hard to maintain availability in these networks. Despite these challenges, Google delivers high availability in their networks and services.

However, they do experience failure events and document these failures in form of post-mortem reports and root cause analysis. In this paper they analyze 103 such reports and find that failure rate is comparable across all three networks. These failures occur in data, control and management plane. They also find that 80% of the failures last between 10 mins and 100 mins. Nearly 90% of the failures have high impact i.e. high packet losses, or blackholes to entire data centers or parts thereof. Interesting bit is that when most of these failures happen, a management operation is usually found be in progress in the vicinity.

They classify these failures on the basis of root causes and find that, for each of data, control and management planes, the failures can be root-caused to a handful of categories like risk assessment failures; lack of consistency between control plane components; device resource overruns; link flaps; incorrectly executed management operation; and so forth. They provide examples of these failures in the paper.

Finally they discuss high availability design principles drawn from these failures. These include defense in depth, which is required to detect and react to failures across different layers and planes of the network and can be achieved by containing the failure radius and developing fallback strategies like red button (software controls that let an operator trigger an immediate fallback). Second, fail open, which preserves the data plane when the control plane fails. Third, maintaining consistency across data, control, and management planes can ensure safe network evolution. Fourth, careful risk assessment, testing, and a unified management plane can prevent or avoid failures. Fifth, fast recovery from failures is not possible without high-coverage monitoring systems and techniques for root-cause analysis.

The conclude that by applying these principles, together with the idea that the network should be continuously and incrementally evolved, they have managed to increase the availability of their networks even while its scale and complexity has grown many fold.


Q1) Is there an place in this work for formal methods?
A) Our paper discusses this, formal methods
   have not been applied at large scale networks. However, there is recent
   work on data plane and control plane verification and space is there
   for work in formal methods.


Q2) Has Google ever pressed the red button?
A) The answer is beyond my pay grade.