Thursday, August 25, 2016

RDMA over Commodity Ethernet At Scale

Presenter: Chuanxiong Guo
Co-Authors: Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitu Padhye, Marina Lipshteyn

TCP is still the dominant protocol for transmitting packets. But it suffers from latency issues and CPU overhead problems. RDMA, on the other hand, needs a lossless network, but uses a priority based flow control that prevent buffer overflow.

In this talk, the authors discussed their experience deploying RDMA over commodity Ethernet (RoCEv2) in Microsoft's datacenters.  They also discuss how they overcame some of the challenges they encountered.

PFC is a single flow control protocol that uses the priority that each packet contains in its VLAN Tag. he authors argue that this doesn't scale well. It breaks PXE boot which is needed for OS updates. There is also no standard way to carry VLAN tags in layer 3 networks. They would much rather use the DSCP field in the IP header, thus removing VLAN related issues completely.

In addition to this, they observed transport livelock issues despite low packet drop rates. To overcome this, they chose to retransmit from the last sent packet as opposed to starting again from 0. They don't observe any deadlocks since packets travel up and then down, but do not experience any loops and hence, cannot have deadlocks. 

Further, a malfunctioning NIC may block the entire network owing to a series of pause frames (waits at the buffer) causing a domino effect on each other. They install watchdogs to stop this process by monitoring packets at those frames areas.

A latency comparison between ROCEV2 and TCP shows that the former at 99.9th percentile is better even against the 99th percentile of the latter. RoCEV2 also directly handles Incast also since it is lossless. On a network with 500 servers across two podsets, RDMA is able to achieve 3Tb/s throughput. Similar measurements were performed while shuffling data. The latency increases as shuffling occurs and the authors observe that it isn't possible to achieve both low latency and high throughput at once, in practice, even though a small congestion is observed.

In the future, the authors hope to explore RDMA on inter-dataceter communication, gain a deeper understanding of deadlocks and develop more applications that use RDMA instead of TCP.

Q: You haven't talked about the classical problem of lossless network (tree-saturation) - localized congestion that spreads out and blocks other innocent bystander traffic at other ports. Do you notice this?
A: We do experience this in our production networks. We have several approaches to deal with this. DCQCN could be used, buffer sizes are tuned so that we can use Dynamic Buffers and we have ways to deal with pause storms.

Q: You conduct your experiments at livelock. Why did you need retransmission despite that?
A: Lossless implies packets shouldn't be dropped due to congestion. But, there could be other kinds of packet drops like due to switch hardware issues or corruption problems. Hence, we need packet retransmission to fix those.
Q : Are those frequent?
A : The network is large scale, so we do need to handle it.

Q: Can you share details on workload on top of RDMA? Why are these latency sensitive and what are the latency requirements?
A: We currently have 2 types of workloads - Bing traffic/indexing is very important there. The requirement there is to reduce latency to millisecond level. Other type of workload - usually for storage, they are throughput-centric workload, move a lot of traffic. Hence, we need to reduce CPU overhead there.

Q: Are there a lot of NIC firmware changes or OS level changes?

A: To make the NIC work, transport layer is in NIC firmware and has a driver. But, for mechanisms like PSC and transport protocols, they avoid NIC firmware. Our providers provide us with the nIC firmware necessary.