Sunday, August 20, 2017

Session 6 paper 3: Resilient Datacenter Load Balancing in the Wild (Hong Zhang, Junxue Zhang, Wei Bai, Kai Chen, Mosharaf Chowdhury)

Data center networks are usually multi-rooted, so load-balancing across the multiple paths are required. However, load-balancing are hard to deal with uncertainties of three types:
1. Traffic dynamics: e.g., congestion caused by bursts.
2. Topology asymmetry: e.g., link cut or heterogenous devices
3. Switch failures: including blackhole and silent random drops.

So a load balancing has two requirements:
1. It needs to sense uncertainties.
2. It needs to react to uncertainties.

Previous works all fail to sense switch failures, and have their own problems. Here are three examples showing the problem of previous solutions.

Example 1: problem of failure-ignorant. If random drop happens, congestion-aware LB does not sense it. Actually, it even divert more traffic to the failed switch, because there is very few traffic (mostly dropped by blackhole/random drops).

Example 2: Flowlet switching fails to react to uncertainty timely. DCTCP traffic is much less bursty then normal TCP, so it is hard to find a large enough gap to separate a flowlet.

Example 3: Problem of vigurous rerouting. It causes congestion mismatch: the TCP adjusts rate based on the current path's congestion condition, but it changes to another path, this rate may not be right.

The challenge of LB is, how to gracefully handle uncertainties. The author proposes Hermes, which has four properties: comprehensiveness, timeliness, transport friendly, and deployability. It has two modules: sensing module and rerouting module. Sensing module senses both congetion and failures, and rerouting module decides when an where to reroute traffic.

The sensing module uses ECN and RTT to sense congestion. It uses retransmission and timeout to infer failure: frequent timeout means blackhole, and frequent retransmission means random drops. It also uses active probing to improve visibility. The baseline of probing is to probe all paths. They reduce the overhead by only probing 2 random paths (refer to the power of two choices), plus 1 previous best path.

The rerouting module is cautious. The author gives an example of one flow, in which a reroute may first drop the rate to half, and then increase. So rerouting can be beneficial even with reordering, and we should reroute a flow immediately if it can reduce the FCT. Hermes has three heuristics: (1) only reroute when the new path is much better (because the sensing is inaccurate, sensed slightly better may not be actual better), (2) avoid rerouting with small remaining size, because the flow may not benefit much from the higher rate, but has to experience dropped rate, and (3) avoid rerouting flows with already high sending rate.

In the evaluation, the author shows that for some workload Conga is slightly better (17% lower average FCT), because switch-based solution has better visibility to the congestion. However, under asymmetric topo, Hermes is better than Conga and presto, because conga has very few flowlets, and presto is congestion oblivious. The author also shows under switch failure, Hermes is better by up to 32% FCT.

Q: How to handle micro-burst
A: Major goal is not handle micro-burst. Our goal is like conga, having feedback loop, to avoid global congestion. There is a tradeoff, if you want to avoid global congestion, you need longer feedback loop; Drill has much smaller feedback loop, so it does not avoid global congestion.

Q: Run simulation at 2G, what about 10G and 25G?
A: At baseline, we run 10G. The 2G is to create asymmatry.
Q: What happend at 25G?
A: Intereting problem. One thing might change is the transport protocol behavior and how LB interact with it. At 25G, maybe flowlets have diff pattern. Good to dig out.

Q: Distributed system, why independent choice lead to correct estimation?
A: We leverage the power of 2 choice. Each host randomly selects 2 path to probe, so we can avoid the herding problem.