Thursday, August 25, 2016

The Good, the Bad, and the Differences: Better Network Diagnostics with Differential Provenance

Ang Chen (University of Pennsylvania)
Yang Wu (University of Pennsylvania)
Andreas Haeberlen (University of Pennsylvania)
Wenchao Zhou (Georgetown University)
Boon Thau Loo (University of Pennsylvania)

The intricate relationship between the events in a distributed system makes network diagnostics in such a system quite complex and tricky. Existing debuggers can provide information on what happened on the network by providing details like packet history and comprehensive explanation through data provenance. However, identifying the root cause of the problem from this huge set of information is still complicated. The paper presents a concept called differential provenance (DiffProf), that tracks the causal connections between network states and state changes and provides root cause analysis by reasoning about the differences between provenance tree for a reference event (good event) and the provenance tree for the symptom event (bad event).

The provenance graph is a directed graph that contains events or states as vertices and causality as the edges. The naive method of taking a difference between the two provenance trees can cause the diff tree to be larger than the two individual trees due to the butterfly effect of initial small differences getting propagated as drastic differences later on. The paper presents an initial algorithm for identifying the minimal diff tree and suggests many further refinements to make the algorithm more efficient and correct. The system states and events are represented as tuples, which are organized into tables. A set of derivation rules describe how these tuples could be derived. The initial algorithm includes rolling back the execution of the rolling back the execution of the symptom event to a divergent point by which the two provenance trees differ and then mutate the faulty node to the correct state. The problem with this approach is that, it only finds the initial differences and a large set of consequences are ignored. In order to preserve the meaningful unique information about the event which the symptom provenance tree represents, one leaf tuple in each tree is marked as a seed of that tree, which cannot be mutated. The algorithm is then refined to roll back until the divergent point and change the faulty node to an equivalent node and then repeat the same process to find completely equivalent trees, thereby identifying the root-causes of multiple faults that occurred.

DiffProf is implemented in C++ based on RapidNet and evaluated in two sets of case studies for SDN and Hadoop systems. For the multiple case studies evaluated, DiffProf finds one or two nodes as the faults, thereby exactly isolating the root cause of the faults. DiffProf is also efficient to answer all the queries within a minute. Apart from this, DiffProf also determines the root cause efficiently for complex networks, under heavy interference.

Q&A:

Q: What happens if the reference and the symptom are the same? In the starting case in your strawman, you are being routed to two different servers but you needed to go to the same servers. Is that covered?
A: We need to find a reference event that encodes the correct intent, in this case the reference event does exactly the same thing that the symptom event not give sufficient information, so we can’t handle this case.

Q: How much does this vary with the reference events, how carefully should those be chosen?
A: The number of root causes found is corresponding to the actual number of misconfigurations in the network and not the reference events. For instance, we found one or two misconfigurations because in the problems that we recreated from other papers, the scenarios themselves lead to one or two misconfigurations. That is why the packet goes to one of two of the nodes as a result.

Q: What kinds of faults can you figure out, like you mentioned ATPG. ATPG can only figure out the action faults and not the action faults, like for example traffic header is matching a rule which is not supposed to match, so how do you detect those anomalies? What is the coverage of the rules in all of the switches in the network?
A: One difference between ATPG and our work is that we do not focus specifically on the data plane. So we can handle more general cases than ATPG, for instance say it could be applicable to general distributed systems like MapReduce. But that said, it works when there is a reference event. So that is one restriction for the kind of faults that DiffProf could handle. Beyond that DiffProf mostly works for functional faults and not for tiny faults such as race conditions, where the predicted result is correct, but the timing perhaps is delayed - DiffProf does not work well for those kind of bugs. But we are having an ongoing research project on those kind of faults as well.