Wednesday, August 14, 2013

SIGCOMM2013: Reducing Web Latency: The Virtue of Gentle Aggression

Presenter: Tobias Flach
Co-authors: Tobias Flach, Nandita Dukkipati, Andreas Terzis, Barath Raghavan, Neal Cardwell, Yuchung Cheng, Ankur Jain, Shuai Hao, Ethan Katz-Bassett, and Ramesh Govindan

This works describes how the mean latency between clients and Google servers was reduced by 23%.
Analyzed billions of flows between Google web servers and clients. The main reason for increased latency is packet loss, which in 77% of the cases is recovered by retransmission timeouts (RTOs), which take time. (for 25% of transfers, the RTO is 10 times than the RTT). The proposal to reduce latency is based in the observation that 35% of lossy bursts experience a single packet loss, and that happens usually to packets towards the end of a burst (twice as likely). The goal is to achieve near optimal loss detection and recovery, in order to minimize loss-induced latency increase.

The proposal is focused on multi-staged TCP transfers, where the clients connects to a frontend server, and the frontend server connects to a backend server. The former path is controlled by ISPs, while the latter is controlled by the provider (Google). In order to minimize loss-induced latency on the public segment, the authors propose a reactive model in which the frontend server retransmits the last(s) packets of a burst to a client, after waiting for two RTTs. This speeds up loss detection. For the frontend-backend server segment, they propose a proactive recovery model, where the backend send duplicate packets for all packets sent to the frontend. As a middle way, a third model, "Corrective" combines the low overhead of "Reactive" with the fast recovery rate of "Proactive", by using a few redundant segments that encode previously transmitted segments, which can be used to recover a single packet loss. This speeds up recovery since no loss detection is required.

Reactive and proactive correction provided a latency decrease of 23%
Corrective was evaluated in emulation. The tail latency was reduced by 20% when losses occured, but a slight performance hit was dealt on loss-free connections.
Reactive and Proactive are curently IETF internet drafts, and Reactive is pushed upstream to Linux kernel 3.10

Q&A: (based on the parts that I could hear. Will try to talk to the presenter to complete missing parts)

Q: Why handle congestion [...] ?
A: Because we don't want to change TCP

Q: Why do we have losses?
A: Congestion, ISP policies

Q: How do we know it's always the very last, why is that the case?
A: Same as above, traffic arrives in bursts, the probablity of losing packets increases towards the end of a bursts.

Q: Why not apply FCP (previous talk) instead of Proactive?
A: Because it would be hard to apply a new protocol like that overnight