Thursday, August 25, 2016

Robotron: Top-down Network Management at Facebook Scale

Presenter: Yu-Wei Eric Sung
Co-Authors: Xiaozheng Tie, Starsky H.Y.Wong, Hongyi Zheng

Managing networks of the size of the Facebook network involves tasks like router/switch turnup, circuit migration, etc but this usually needs a lot of human interaction. However, this is challenging since it might involve low level configurations across distributed devices, these networks could involve multiple domains and sub-networks with different versions across them. As a result, incorrect management could cause a lot of issues for users and impact Facebook significantly too. This becomes particularly challenging as the network grows bigger.

Facebook's network is setup such that there are multiple POPs near the user which may/may not be able to satisfy an incoming request from the user. If the POP cannot satisfy the request, it is sent through a backbone to a Datacenter (DC) and the response is sent back. The tasks involved in management in this kind of network could include adding/migrating circuits or routers in the backbone and building clusters or upgrading capacity in the datacenter. It is also possible, especially in datacenters, that multiple generations of cluster architectures could co-exist. This makes these tasks particularly challenging as the network evolves.

Facebook implements this top-down network management essentially through an model of the network called Fbnet.  This process is extended to model the entire network. There are also READ and Write APIs to interact with the model where the write consists of atomic tasks. Storage consists of a single primary database and multiple secondary databases some of which are locally available.

Network design intent can be conveyed through the creation of distinct objects. For example, for the above 4-pop cluster, FbNet creates distinct objects for the network switch, the linecard, the physical interfaces, prefixes, circuit, etc. with relationship fields pointing to other objects. Following this, Design intent can be translated into device configs starting from Templates for a cluster that are validated afterwards and then converted into fb net objects and then per-device config objects. All of these are vendor-agnostic. After this, these could be converted into vendor-specific configs and deployed.

By monitoring the usage of FbNet, the authors were able to observe that FbNet model changes a lot with time and still changes with new models and relationships as opposed to becoming stable with time. POP and DC change a lot with a single design change affecting almost 1000 devices on average. Backbone's also change but don't affect more than 10 devices on average. Similar results seen with Robotron when looking at the number of configuration lines that change as designs are changed.

Robotron has evolved over time bottom up and has been largely driven by experience since 2008 when FbNet was modeled to active and passive monitoring in 2011 and deploying it over time. Over this period, the authors have gained key insights into the process of modelling, the coupling between network design, config generation and deployment and the need for manual involvement at times. These came directly from experiences involving egress link saturation and failing to deploy after designing a new rack.

Q: You talked about your network model, there are some fields and some of them have values, you talked about the syntax, but not the semantics. How would you define the meaning so that some formal analysis could be conducted on it?
A: There is no formal semantics. We would like to work more on that too.

Q: About your claim that this is the first paper of this kind. There is a conference called NOMS for network management related papers including carrier network?
A: I am not very aware, but will check it out.

Q: At the end of your talk, you mentioned you needed a manual override. DO you have any characterization of the types of situations where this was necessary?
A: Experiences include a version of model was used where it didn't model BGP policies correctly, we couldn't wait for that.
Q: Are there fundamentally some things Robotron cannot handle or is it just lack of anticipation?
A: It is usually lack of anticipation. But, sometimes, performance isn't guaranteed even if Robotron can handle something, so it becomes important to manually handle some things.