Tuesday, August 18, 2015

[Session 3.1, Experience track1] Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis

Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis

Authors: Chuanxiong Guo, Lihua Yuan, Dong Xiang, Yingnong Dang, Ray Huang, Dave Maltz, Zhaoyi Liu, Vin Wang, Bin Pang, Hua Chen, Zhi-Wei Lin, Varugis Kurien*
Microsoft, *Midfin systems

Presenter: Chuanxiong Guo

This work focuses on network measurement and analysis. It considers the data center network (DCN), and tries to tackle challenges in DCN operation, such as the source of incidents (network or not), tracking SLA etc. The work suggests that using latency measurements between any two servers these challenges can be addressed.

The system, dubbed Pingmesh, uses Pingmesh agents installed on each server. The agents are controlled by a Pingmesh controller, which defines and manages the measurements.  The system is built on top of existing infrastructure (Autopilot, Cosmos/SCOPE). The storage and analysis is done on a scale that ranges from 10 minutes to a day.

The rest of the talk focused on the results collected and the lessons learned.
It started by exploring latency measurements, while comparing two data centers.
A following packet drop rate study has shown packet drops at the NIC and the ToR, and compared it between different data centers, showing that they differ (for intra and inter rack drop), and that a typical drop rate is in the order of 10^-4 to 10^-5.
Using Pingmesh, it was possible to detect black holes (deterministic packet drops)  and silent random packet drops.

Pingmesh is not without limitations. Pingmesh can not tell which spine switch is dropping packets. Also, Pingmesh uses single packets for the measurements, which is not good for detecting network reachability and packet-level latency issues.

Pingmesh provides an always-on service that provides a full coverage. It provides scalability, leaving space for evolvability. As such, it is an interesting tool for tracking problems in the network, which is one of the hard problem today. Plus, it is running on a production system and for a long time (4 years), which is its main advantage. 


Q: What are the thresholds beyond which the overhead of measurements is too high?
A: Pingmesh has a lot of overheads, that relate to processing, but in terms on networking pingmesh uses just Kbps, vs. Gbps available on the servers, so it is negligible. 

Q: In the past, works like Planerlab used similar measurement techniques to predict how systems will behave. Will you be able to do what/if based on the huge amounts of data you collected?
A: We are currently focused on using the system for trouble shooting.

Q: Do you really need 2M probes to find that a server went down?
A: You can not predict what will happen next, so we try to make the measurements as comprehensible as possible, so we can catch failures quickly.