Wednesday, August 24, 2016

NUMFabric: Fast and Flexible Bandwidth Allocation in Datacenters
Presenter: Kanthi Nagaraj
Co-authors: Dinesh Bharadia, Hongzi Mao, Sandeep Chinchali, Mohammad Alizadeh, Sachin Katti

In this paper authors focus on designing a mechanism by which network operators can allocate bandwidth amongst the contending flows in a very fast and flexible manner. Authors motivate this problem by pointing out that bandwidth allocation in datacenters is subject to different policies as decided by network operators, and while there are several works which try to optimize a subset of these policies, there is no one mechanism that can allow the operators to optimize for any given policy based on the workload requirements.

NUMFabric tries to extend the classic Network Utility Maximization (NUM) framework, such that it converges significantly faster and is much more robust. The key insight here is to decouple the mechanisms for maximizing network utilization and achieving the optimal relative bandwidth allocation across competing flows, something that existing NUM algorithms do not do. The way NUMFabric does this is by having two logical layers - (1) A lower level (they call it Swift transport), which does weighted max-min fair rate allocation using dynamic flow weights, and (2) A higher layer (called xWI) which employs a distributed algorithm using which the sources and switches exchange information in packet headers to iteratively compute the weights that flows should use to achieve the optimal bandwidth allocation.

The results show that compared to gradient descent based solutions, NUMFabric converges to the optimal allocations 2.3× faster at the median and 2.7× faster at the 95th percentile.

Q: What objectives NUMFabric cannot support?
A: It doesn’t support anything classical NUM does not support.It supports all convex functions.

Q: how different is it from RCP?
A: They are very similar.

Q: To what extent is it tied to symmetric topology? Will it work for asymmetric topology?
A: It is not tied to any particular topololgy. We also did some experiments with some ISP based topology. One issue could be we believe all events are synchronous, which might not be the case in ad-hoc topology.

Q: how frequently do you calculate prices on switches?
A: It depends on how dynamic the workload is. One restriction is we must not make changes faster than the time it takes to converge.

Q: Do you have a full implementation of your system?
A: Proposal requires changes to switches, so we didn’t have a full implementation, just the simulation results.

Q: Do you have a control knob for weight allocation?
A: We do not have any particular knobs.