Authors: Aditya Akella (University of Wisconsin-Madison/Microsoft), Ratul Mahajan (Microsoft)
Much work has been done about the data-plane and control-plane. But we don't have comparable tools and frameworks to reason about the management plane ("the unlooked cousin in the networking"). Because of this, even simple concepts (e.g., how heterogenous are network designs) are not understood. This problem is important as a misbehaving management plane can negatively impact the control-plane (e.g., increasing configuration bugs) as well as the data-plane (e.g., decrease performance).
In this paper, the authors present a systematic framework for characterizing the management plane which is called MPA for Management Plane Analytics.
The MPA framework has 3 high-level goals:
- Characterize network's management practices using a set of metrics;
- Infer which practices matter for operational healthiness, answering questions such as "does automating configuration helps or not?";
- Build models of a network's operational health to inform practice, answering questions such as "is a design change robust to outages?".
The biggest challenge faced by MPA is that management practices and their impact are not directly logged. There is therefore a need to infer the relevant information from the available data. Available data include configuration snapshots and trouble ticket logs. Unfortunately, such data-sets are noisy; have holes in them, or do not contain enough samples. To overcome these limitations, the authors propose to leverage big data by aggregating management information across many networks.
The authors applied MPA to a set of online service providers management data. Using this data set, they answer a bunch of questions among which:
- "Does the size of the network correlate with operational health?" Short answer: No. There is no clear relationship between the size of the network and its operational health.
- "Is heterogeneity correlated with operational health?" Short answer: Yes. As heterogeneity increases, operational health tends to worsen.
- "Is change rate correlated with operational health?" Short answer: Yes. As change rate increases, operational health tends to worsen when networks are homogeneous networks.
- Coming up with a new set of metrics that describe operational health
- Improve the framework, better approaches to model noise and its impact. They also need to model causal relations.
Q: (Nina Taft) What "networks" do you consider?
A: Large service providers such as Google, Facebook, etc.
Q: (Andrew Ferguson) Are the tickets only for networking issues? Or also for cooling? Power?
A: Only networking related data. More data would help for sure.