Presenter: Yang Wu
Authors: Yang Wu, Andreas Haeberlen, Wenchao Zhou, Boon Thau Loo
Network debuggers like ndb work by generating a backtrace — a causal chain of events from an observed problem back to a set of root cases. But to generate a backtrace, one needs an observed problem to start from. This means that network debuggers cannot be used to detect and explain questions such as "why is the HTTP server not receiving traffic?"
This paper proposes a methodology for answering these "why-not queries" using negative provenance. Provenance is a concept from databases; it models causal relationships between inputs and outputs. Provenance is often represented as a DAG that can be calculated in a straightforward way from programs written in languages such as NDLog:
PacketSent :- FlowEntry, PacketReceived
This paper proposes an extended model of provenance that includes negative information, and also develops techniques for doing counter-factual reasoning on these representations. To provide programmers with simpler explanations, it also presents algorithms for compressing provenance graphs — a 90% reduction in size on average, and typically fewer than 20 nodes. A prototype implementation has been developed using mininet.
This work is significant because it shows how to extend network debuggers to explain negative questions in addition to positive ones.
Q: Are there limits on the kinds of queries you can express using negative provenance? For example, why is the traffic not being load balanced?
A: Our current focus is on debugging logical properties and not quantitative properties.
Q: In how many of these examples would a forward trace catch the bugs as opposed to a backward trace?
A: In many situations you don't know which trace to issue and where to start. This is especially true in complex and dynamic environments.
Q: This sounds related to some older work from the knowledge plane world: something goes wrong, how do you figure out what happened? Well, someone has to start by identifying a problem. In network like the Internet, you're not going to have global visibility. There's a big scaling problem, probably a machine learning problem, etc.
A: We would be interested in understanding that work better.
Q: It seems like why-not queries could be understood in terms of classic notions of safety and liveness. For example, given the liveness property "if HTTP traffic is arriving at the ingress to the network, then it should eventually be delivered to a server," your system generates conditions that can be readily checked (using specific assumptions about the domain). Have you thought about modeling things this way to get a handle on the kinds of properties you can check?
A: Not yet. Regarding which properties we can check, as stated previously, our current focus is on logical properties and not quantitative properties.