Thursday, August 24, 2017

SIGCOMM'17 , Session 9 (Realities), Paper 1: Understanding and Mitigating Packet Corruption in Data Center Networks

Authors: Danyang Zhuo (University of Washington), Monia Ghobadi (Microsoft Research), Ratul Mahajan (Microsoft Research & Intentionet), Klaus-Tycho Förster (Aalborg University), Arvind Krishnamurthy (University of Washington), Thomas Anderson (University of Washington)

Presenter: Danyang Zhuo

[Link to the paper]

Several techniques for reducing packet loss have been studied. The techniques mostly focused on congestion which is only one source of packet loss whereas packet corruption, another source of packet loss, has received little attention.

Danyang Zhuo et al. worked on the packet corruption in DCNs. They monitored around 350K switch to switch, optical links within 15 data centres of a major cloud provider over seven months.

Based on their measurements and  analyses, they found out that:
  • Packet corruption is a significant source of packet loss;
  • Packet corruption has distinct symptoms and root causes. Its characteristics differ from congestion losses; 
  • Corruption impacts fewer links than congestion but imposes a heavier loss rate; 
  • Unlike congestion, corruption rate on a link is stable over time and is not correlated with its utilization.
To mitigate data corruption, Danyang Zhuo said they developed CorrOpt which partially deployed in Microsoft DCs.  CorrOpt intelligently selects which corrupting links can be safely disabled, while:
  •  Ensuring that each top-of-rack switch has a minimum number of paths to reach other switches. 

CorrOpt also recommends specific actions to repair disabled links:

  • Clean dirty connectors 
  • Replace damaged fibres 
  • Replace dying transceiver laser
Danyang Zhuo said : Our prposed system (CorrOpt) has been deployed in over Microsoft DCs and it could reduce corruption losses by three orders of magnitude.

There was a discussion after the talk.
Q1: What’s statistical percentage of link with corruption?
A:1 I can't comment on that. I just can only report the relative ratio of link with congestion.

Q:2 What are other remedies to mitigate the corruption other than bring it to links, e.g. reduce the rate or change error code, …?
A2: What we found in our data centres that loss rate is not correlated with traffic. So, I think reducing rate wouldn’t remedy those hardware failure.
We target at a deployable solution, I think for error correction code and end-to-end, I don’t think people do that. I think a 100G technology has it. 10G and 40G usually don’t have it on the switch. 
Q2: How about changes on the modulation?
A2: For data centres transceivers there is no option to change the modulation.
Q2: Really?
A2: It is fixed but for long haul, I’m not sure (maybe changeable).

Q3: How often you have multiple link corruption at the same path?
A3:The details are in the paper. What we found is that for optical breakout cable (when you have dirt on the fibre) those links tend to scatter across the entire DC but for failures such as optical breakout cables or switch hardware problem , it is going to affect multiple links on the same switch.