Tuesday, August 13, 2013

SIGCOMM2013: Achieving High Utilization with Software-Driven WANs  

This is a report of the presentation done by Chi-Yao Hong on 2013-08-13. Paper co-authors are Srikanth Kandula, Ratul Mahajan, and Ming Zhang (Microsoft Research), Vijay Gill and Mohan Nanduri (Microsoft), and Roger Wattenhofer (ETH Zurich).

Inter-DC WANs are critical and expensive resource. In this talk, Chi-Yao discussed two key problems in inter-DC WAN named poor efficiency and poor sharing. Then, he explained the reasons beyond them:
  1. lack of coordination
  2. local, greedy resource allocation
Chi-Yao presented a system called Software-Driven WAN (SWAN) that enables inter-DC WANs to carry more traffic significantly. Based on the current service demand and network topology, SWAN decides how much traffic each service can send. SWAN was tested using a testbed prototype and data-driven simulation. Results show that SWAN carries 60% more traffic than the current practice.  

Chi-Yao concluded his talk by : achieving high utilization is easy but coupling it with flexible sharing and change management is hard

Q&A session:

Q1. How do you compare SWAN with z-UPDATE.
A: SWAN targeted at a very different scenario: an inter-DC wide area network where the bandwidth is expensive and we want to fully utilize all the network resource. On the other hand, z-UPDATE targets at intra-DC network and therefore has a different set of constraints. For example, congestion-free update may not be as severe inside data centers as typically datacenter networks have highly provisioned capacity. Also, the transition delay from one state to another one inside a DC is multiple orders of magnitude smaller than that in inter-DC networks.

Q2. When performing congestion-free update, you need to wait until all the updates within a step have applied before you can move to the next step. Would it bring a concern of having long update time?
A: That's a valid concern. One nice property we provide here is that the order of updates within a step can be arbitrary, so that the SWAN controller can send a batch of update commands (instead of one by one in a certain sequence) to minimize the time each step takes. In practice, we still need to query the forwarding state after we sent the update commands to ensure the commands have been taken effect successfully, and the ack is exactly a signal for the completion of the step.

Q3. Why is achieving high utilization important here?

You can use the same amount of investment to deliver more traffic. Or you can save money by delaying time to expand your network as service grows.