Friday, August 26, 2016

Taking the Blame Game out of Data Centers Operations with NetPoirot

Presenter: Behnaz Arzani
Behnaz Arzani  (University of Pennsylvania)
Selim Ciraci (Microsoft)
Boon Thau Loo (University of Pennsylvania)
Assaf Schuster  (Technion - Israel Institute of Technology)
Geoff Outhred (Microsoft)

The paper presents NetPoirot which is a diagnostic tool that monitors aggregate TCP metrics and uses machine-learning based classification techniques to identify the cause of failures in the data center.

Data center can fail and it is often unclear what causes the failure, it could lie in the underlying network, client, or service-level application. A real world failure example is presented in the paper, VMs with applications running inside could trigger an operation in the hypervisor, which then sends a request to a remote service. Whenever the request/response latency increases an error occurs in the hypervisor which in turn causes the VM to panic and reboot, hence disrupting normal operations. The failure to diagnose this scenario motivates proposing a more effective debugging tool.

To achieve this goal, NetPoirot is proposed and it only measures aggregate TCP metrics to diagnose the failures. The key insight of this tool is TCP can observe the entire communication path and it sees faults no matter where it happened. There are two main parts of this tool, one is the monitoring agent which runs in the VM and periodically collects TCP metrics data, the second part is learning agent which takes the monitoring data in and uses decision tree model to classify the failures.

A prototype of NetPoirot has been developed and deployed in a production data center, and the authors have done extensive evaluation over a 6-month period. The speaker mentioned some lessons learned from this work: The first one is TCP sees everything even at a single endpoint, so it's possible to use TCP metrics to detect who is responsible for a failure, and surprisingly two features are enough for most applications to describe each failure observed by the TCP. The second one is the relationship between failures and TCP are nonlinear, and failures in a group (client/server/network) are similar which also implies individuate fault classification is difficult.

Q1: is it possible to game NetPoirot so that it makes mistakes?

A1: Potentially, it could be possible, but it could be thought of as an abnormal failure, so what one might do is anomaly detection to see how similar a failure to failures observed in the past to see if it has enough in common with them to be identified as a legit diagnosis. Although I have not done any research on this so I am not 100% what the result of doing that would be.

Q2: What are the features most prominently identified by PCA? given that there are only two features (derived from the original features) required to characterize each failure using TCP data.

A2: It varies from application to application, given that the approach is very application dependent. But for the example application we tried for example for client side failures we had time spend in zero window as the feature with the highest coefficient.