Wednesday, August 24, 2016

NUMFabric: Fast and Flexible Bandwidth Allocation in Datacenters
Presenter: Kanthi Nagaraj
Co-authors: Dinesh Bharadia, Hongzi Mao, Sandeep Chinchali, Mohammad Alizadeh, Sachin Katti

In this paper authors focus on designing a mechanism by which network operators can allocate bandwidth amongst the contending flows in a very fast and flexible manner. Authors motivate this problem by pointing out that bandwidth allocation in datacenters is subject to different policies as decided by network operators, and while there are several works which try to optimize a subset of these policies, there is no one mechanism that can allow the operators to optimize for any given policy based on the workload requirements.

NUMFabric tries to extend the classic Network Utility Maximization (NUM) framework, such that it converges significantly faster and is much more robust. The key insight here is to decouple the mechanisms for maximizing network utilization and achieving the optimal relative bandwidth allocation across competing flows, something that existing NUM algorithms do not do. The way NUMFabric does this is by having two logical layers - (1) A lower level (they call it Swift transport), which does weighted max-min fair rate allocation using dynamic flow weights, and (2) A higher layer (called xWI) which employs a distributed algorithm using which the sources and switches exchange information in packet headers to iteratively compute the weights that flows should use to achieve the optimal bandwidth allocation.

The results show that compared to gradient descent based solutions, NUMFabric converges to the optimal allocations 2.3× faster at the median and 2.7× faster at the 95th percentile.

Q: What objectives NUMFabric cannot support?
A: It doesn’t support anything classical NUM does not support.It supports all convex functions.

Q: how different is it from RCP?
A: They are very similar.

Q: To what extent is it tied to symmetric topology? Will it work for asymmetric topology?
A: It is not tied to any particular topololgy. We also did some experiments with some ISP based topology. One issue could be we believe all events are synchronous, which might not be the case in ad-hoc topology.

Q: how frequently do you calculate prices on switches?
A: It depends on how dynamic the workload is. One restriction is we must not make changes faster than the time it takes to converge.

Q: Do you have a full implementation of your system?
A: Proposal requires changes to switches, so we didn’t have a full implementation, just the simulation results.

Q: Do you have a control knob for weight allocation?
A: We do not have any particular knobs.
Scheduling Mix-flows in Commodity Datacenters with Karuna
Presenter: Li Chen
Co-authors: Kai Chen, Wei Bai, Mohammad Alizadeh

The key problem that the authors are trying to solve is how to schedule a mixture of deadline and non-deadline flows such that we meet most of the deadlines for the deadline sensitive flows and at the same time ensure that the flow completion times for non-deadline flows is low. The authors argue that the current solutions tend to optimize for one type of flows while hurting the other type, for e.g. SJF improves flow completion times for non-deadline flows but increases the deadline miss rate for deadline sensitive flows when the flow sizes for such flows are large and on the other end of the spectrum EDF improves the deadline meet rate, but also increases the flow completion times for non-deadline flows due to its use of strict prioritization.

The key insight that the paper uses is to ensure that the deadline sensitive flows just meet their deadlines rather than finishing very early, and use the bandwidth left to distribute among the non-deadline flows. Implementation uses a multi-level priority queue, where deadline sensitive flows are put in the highest priority queue and non-deadline flows are put in lower priority flows (using SJF as the priority scheme). To account for non-deadline flows with unknown size, they use PIAS to decide which priority queue to put the flows in.

The evaluation was done using a small testbed of 16 servers and large scale ns3 simulations. The results show that the system can achieve lower flow completion times for non-deadline flows by up to 47.78% at heavy load as compared to pFabric while at the same time maintaining low (<5.8%) deadline miss rates.

Q: How long does large scale simulations take? Is your simulation code open source?
A: It takes a long time. Kernel module is open source, we will make simulation code open source sometime in the future.

Q: Why not just use WFQ?
A: The demand for deadline traffic is dynamic, so we use dynamic WFQ.

Q: Do you differentiate between small deadlines and large deadlines?
A: We use whatever deadline values given by the application, and do not make any specific differentiation.

Q: Did you measure avg flow completion time for deadline flows? Is it worse than traditional solutions?
A: We trade-off avg flow completion time for deadline flows for smaller tail latency of non-deadline flows.

Q: Will your algorithm work for fat-tree topology?
A: It should work. Queue estimation might be an issue but we can change the constant parameter in the queue estimation part of the algorithm to accommodate for topology changes.

Q: Since you are pushing very close to the deadlines for the deadline sensitive flows, do you take into account bursty deadline traffic or transient failures to make sure that even in these situations we do not miss deadlines?
A: Karuna is by design not optimal for deadline flows. We make a trade-off to account for non-deadline flows.
CODA: Toward Automatically Identifying and Scheduling COflows in the DArk
Presenter: Hong Zhang
Co-authors: Li Chen, Bairen Yi, Kai Chen, Mosharaf Chowdhury, Yanhui Geng

The key problem that the authors are trying to solve in the paper is how to automatically identify and schedule coflows without any application modifications. The key motivation behind this work is the fact that for coflows to work, all the distributed data-parallel applications in a shared cluster have to be modified to use the same coflow API, which authors point out is infeasible to enforce in many cases.

CODA is the mechanism designed by the authors, which tires to identify coflows without modifying applications and at the same time ensures that the coflow scheduler is robust enough that even if coflow identification is not perfect (which invariably be the case), the performance of the system is still significantly better than the per-flow mechanisms.

CODA uses DBSCAN as its base algorithm to identify coflows. For this, it uses both explicit attributes like arrival times and packet headers and implicit attributes like communication patterns. Further to tolerate coflow identification errors, CODA delays the assignment of flows to particular coflows until they must be scheduled to reduce the number of stagglers, and couples it with intra-coflow prioritization (prioritize small flow inside a coflow) to further reduce the number of stagglers.

For evaluation, authors used both large scale trace-driven simulations and a small scale testbed with 40 servers. The key results include over 95% flow identification accuracy and improvement in completion times by 2x (5x) on average (95th percentile) on traces obtained from Facebook, compared to per-flow mechanisms.

Future work includes to apply CODA to more varied set of applications, extend coda to take into account coflow dependencies, and how to perform error tolerant scheduling in general.

Q: You use ML to classify applications to do scheduling, but it comes with a danger that changing few lines of code might make the algorithm miss-classify, so the programmers now will have to take this into account now while they are making code changes.
A: As long as programmers do not change low level apis it should not affect too much. We will further explore this, but we expect it not to be a very big issue.

Q: Is there a more principled approach to solve this problem rather than using a ML algorithm?
A: We use very simple learning classification algorithm and try to understand which attributes are more effective by adding humans in the loop. Applying this to different frameworks still needs work as attributes might change.

2DFQ: Two-Dimensional Fair Queueing for Multi-Tenant Cloud Services
Presenter: Jonathan Mace
Co-authors: Peter Bodik, Madanlal Musuvathi, Rodrigo Fonseca, Krishnan Varadarajan

The key problem that the paper is trying to solve is to provide resource fairness to multiple tenants running in the cloud based systems. The paper makes the key observation that in cloud based systems multiple tenants tend to compete with each other for resources within the same shared process. The paper further points out that resource allocation done by operating systems or hypervisor is not suitable for these scenarios, and various fair schedulers like WFQ also achieve sub-optimal fairness.

With the above motivation in mind, the paper proposes a new fair scheduler 2DFQ. There are two main goals that 2DFQ is trying to achieve - (1) Ensure that the scheduler is work conservative, and (2) The schedule produced by the scheduler is not bursty. The key insight that 2DFQ uses is to modify WF2Q algorithm such that a request is eligible to be scheduled at different times for different worker threads, thus partitioning requests across. The paper further extends 2DFQ to scenarios where the request costs are not known in advance, via a pessimistic cost estimation.

For the purpose of evaluation, authors compared 2DFQ to WFQ and WF2Q. Some of the key results include one to two order of magnitude reduction in service lag variation by using 2DFQ and improvement in the 99th percentile latency of tenant requests by upto 198x.

Q: Results depend on the predictability of workloads, so how to incorporate for workloads whose predictabilty change over time?
A: The estimates used by the algorithm decay relatively quickly over time.

Q: This algorithm is work conserving and so there is some pattern that will drive it to worst case scenario. How do you account for it?
A: You can sacrifice work conserving property so that this doesn’t happen, but it can adversely affect the performance in other more common scenarios.

Q: Do you have any complexity result in the paper?
A: We thought about this but only in practical sense - whatever complexity is, it works fine in practice.

Q: Do you account for dependency between different threads?
A: Assumption is that there is no dependency or communication between threads.

Q: How difficult it is to incorporate dependency between the threads?
A: Do not have a concrete solution right now on how to do it.

Trumpet: Timely and Precise Triggers in Data Centers

Masoud Moshref (USC)
Minlan Yu (USC)
Ramesh Govindan (USC)
Amin Vahdat (Google, inc.)

Growth of data centers with respect to scale, speed and link utilization necessitates a programmed, precise and quick detection of events by the monitoring system. Such an active monitoring system would aid network management tasks by maintaining availability, performance and providing security. The paper presents Trumpet, an event monitoring system in which users provide a set of event definitions, which are then installed as triggers by the controller on end-hosts. The end-hosts evaluate the triggers for the incoming packets and sends the trigger evaluation results back to the controller which aggregates the trigger results. End-hosts are chosen for trigger evaluation because they are easily programmable, have sufficient processing power for finer-timescale and they already inspect every packet. Events are defined in term of packet drops, delays, duplicates; some examples are: transient congestion, burst loss, load imbalance, traffic surge, link failure.

Trumpet provides support for an event definition language which allows users to define network events by specifying a filter, that identifies the set of target packets over which a predicate is evaluated; users can also specify spatio-temporal granularity for aggregation. Trumpet consists of two components: Trumpet Event Manager (TEM) at the controller and Trumpet Packet Manager (TPM) at each end-host. Users submit events to TEM, which determines the set of end-hosts that should monitor them and install the triggers at the respective TPMs. TPM is co-located with software switch on a single core, which conserves CPU resource and avoids inter-core synchronization. TPM has two phases: 1. Match and Scatter (incoming packets are matched to a 5-tuple flow and the statistics are stored) and 2. Gather-Test-And-Report (at the specified time granularity, gathers the statistics and evaluates the predicate). The match phase is optimized by means of caching and TLB usage. The second phase is performed as an off path operation. But off path operation if scheduled improperly can cause delays in packet processing too - in order to bound this delay, when the packet queue is low they run the off path phase and run the on path otherwise.

Trumpet is implemented in C code and the evaluation is done on a 10-core Intel machine with 10G NIC. Trumpet can process 14.8 Mpps 64 byte packets at 10G and 650 byte packets at 4*10G, while evaluating 4K triggers at 10ms granularity. At moderate packet rates, it can detect events at 1ms. At full packet rate, it can process 16K triggers without delaying packets by more than 10 microseconds.

Q&A:

Q: What if you have a 10 ms congestion in one of the switches inside the data centers, then you have to go through all the flows to find the end point of congestion, then the program becomes much harder right? I am assuming you don't always know the exact route and don't have any path information.
A: That's a problem we have to solve. Certainly what you could do is, you could proactively install these events on all the servers and certainly the kinds of numbers that I presented will let you scale to that. There is an underlying assumption though, that perhaps the controller has some of the routing information that lets you find at least potential locations where you can install these events. It is not something which we have looked at, but it is a capability that would be necessary.

Q: You mentioned that Trumpet runs on the same CPU as the virtual switch, is it programmed on the virtual switch itself or is separated?
A: If the software switch runs on multiple cores, we have not explored the performance impact of that in particular with inter-core synchronization. I suspect that there would be something required. But we have thought of how you would scale Trumpet to larger bandwidth setting, one of the basic things would be to devote multiple CPU's and design your data structure so that you could do share nothing.

Tuesday, August 23, 2016

Dynamic Pricing and Traffic Engineering for Timely Inter-Datacenter Transfers

Dynamic Pricing and Traffic Engineering for Timely Inter-Datacenter Transfers


Virajith Jalaparti (Mircosoft)
Ivan Bliznets (Mircosoft)
Srikanth Kandula (Mircosoft)
Brendan Lucier (Mircosoft)
Ishai Menache (Mircosoft)


Bandwidth on wide area network is a costly and important resource. Providers like  Google and Microsoft carry both user traffic (requires low latency) and large amounts of business data (requires meeting deadlines). Various centralized traffic engineering (TE) techniques have been proposed but these techniques are dependent on the users input and knowledge of detailed traffic information, priority class of requests, latency and deadline requirements. These features are not provided by users, as they are known to inflate priorities and deadlines in hopes of getting better service. In today's world WAN customers do not have any grantees on data transfers. In fact providers do not have mechanism in place for users to specify their requirements.

In a survey conducted by authors, they find that WAN customers would even pay more for guaranteed deadlines. There is also a disconnect between pricing models and providers cost. Typically 95th percentile usage of a link is charged and static capacity costs are dependent on peak utilization.

In this papers authors present a system called Pretium that combines traffic engineering with dynamic pricing. The design goals of the system are

  1. Each user should be presented a menu of prices up front, from which they can select a service level.
  2. Prices should align the preferences of individual users with the goals and costs of the platform.
  3. Prices should be dynamic, adapting to changes in demand and usage patterns.
  4. The pricing model should complement not preclude, centralized traffic engineering.

Pretium serves both byte requests and rate requests. In Pretium, a customer specifies basic information and is immediately presented with a price quote with various service levels. The customer can then choose a a level or modify the request if nothing is appropriate.

The selected transfer is then managed by a central scheduler, tasked with upholding the service level guarantees. Pretium updates pricing in real time using feedback loop. The authors show that sum of top 10% usages can serve as a proxy for complex 95th percentile usage cost model.

They measure social welfare defined as the total value generated (over all requests served) minus operating costs and evaluate Pritum on topology and traffic observed in a large production WAN. They show that Pretium achieves more than 60% of the optimal welfare, which is 3.5X higher than a region-based fixed pricing scheme. Further, Pretium results in more than 2X higher profit for the provider. They demonstrate that the prices picked by Pretium adapt to load variations and that Pretium is robust to variations in network and request characteristics.

Q1) How are you doing or want to do congestion pricing?

A) We are using ,to some extent, congestion pricing techniques in our paper.

Q2) What happens when you cannot guarantee the transfer, what sort of mechanism are
in place to handle multiple requests by a customer that is trying to game the system?

A) We presume that we are not going to run out of bandwidth, we also set aside some capacity, if we run into this situation.

If some requests are colluding, our system makes sure that you cannot get benefits using this approach, you will only get benefits if you do not split requests.

Q3) You show that optimal choice = value - cost, and since we don't know the value but we know the cost and estimate value. How is this value different from maximum profit?

A) We maximize welfare not profit, in a maximum profit strategy, the prices would be set high by the provider. Welfare is not greedy and prices are close to value. A maximum profit strategy is very hard. In welfare, if the customer is happy, you would get profit.

Q4) How do you plan to compare against multiple providers?

A)  We have thought about it, and this is very complex to analyze. We will have to simplify some assumptions and how to come up with an experiment is complex as well.

One Sketch to Rule Them All: Rethinking Network Flow Monitoring with UnivMon

Zaoxing Liu (Johns Hopkins University)
Antonis Manousis (Carnegie Mellon University)
Gregory Vorsanger (Johns Hopkins University)
Vyas Sekar (Carnegie Mellon University)
Vladimir Braverman (Johns Hopkins University)


Ability to accurately estimate network metrics like flow size distribution, heavy hitters, super spreaders, entropy measures, traffic pattern changes, is useful for network management tasks like traffic engineering, accounting, worm detection and anomaly detection. The paper presents “UnivMon” framework that achieves high fidelity across a range of monitoring tasks while maintaining generality late binding and avoiding application-level specific customized algorithms. UnivMon is also aimed at providing "one-big-switch" abstraction for monitoring various dimensions through network-wide coordination schemes.

UnivMon's data plane uses primitives based on universal sketching, which requires no prior knowledge of the metrics, thereby providing the required late-binding through the usage of a general sketch. For any incoming stream of packets, UnivMon generates log(n) sub-streams by running the original stream through log(n) independent hash functions. Those sub-streams are then processed in parallel, to identify the L2 heavy-hitters and compute counters. These counters are then processed in an offline stage where the application metrics are estimated. To provide support for network-level monitoring, UnivMon control plane assigns the individual switches with sketching manifests, which are created with the goal of correctness and load balancing through solving a constrained optimization problem. The counters from individual switch sketches are then aggregated to provide network-wide monitoring.

UnivMon's data plane is implemented in P4 in four stages: sampling, sketching, heavy hitters computation, metric estimation. The first two stages are implemented in the data plane; the heavy hitters computation stage is partially handled by the data plane (heavy hitter identification) and the control plane (heavy hitters frequency computation). UnivMon is evaluated using CAIDA backbone traces. They compare the performance of UnivMon against OpenSketch for the following application metrics: heavy hitters, change detection and entropy estimation. They found UnivMon to be competitive with the custom per-application algorithms with a very low maximum error gap. UnivMon's memory requirements were also found to grow logarithmically. For the network-wide monitoring, even distribution of resources were provided by UnivMon.

Q&A:

Q: Can your universal monitor be used for reversible caches?
Yes, we can achieve the same result by sampling based techniques instead of using reversible keys. This will be similar to the approach on heavy hitters.

Q: IPv4 addresses which have 32-bits will lead to smaller tables, what happens with IPv6?

Memory usage increases only logarithmically, so this won’t be a problem.

Optimizing Bulk Transfers with Software-Defined Optical WAN

Optimizing Bulk Transfers with Software-Defined Optical WAN

Xin Jin (Princeton)
Yiran Li (Tsinghua)
Da Wei (Tsinghua)
Siming Li (Stony Brook)
Jie Gao (Stony Brook)
Lei Xu (Sodero Networks)
Guangzhi Li (AT&T Labs)
Wei Xu (Tsinghua)
Jennifer Rexford (Princeton)


Wide area data transfer are major part of today's globally-distributed applications. These transfers account for large part of WANs. Network operators are required to carefully schedule these transfers as they can impact not only their business success but also application's performance (e.g. search result quality with outdated index).

In Modern WAN, network layer is built over an intelligent optical layer. Operators can dynamically change their network topology by re-configuring the optical devices. In this paper, authors present Owan, a traffic management system that optimizes wide-area bulk transfers with centralized joint control of the optical and network layers. They use SDN along with ROADM devices (Reconfigurable Optical Add-Drop Multiplexer, a device that connects a WAN router to an optical switch. Operators can change which two router ports are connected by changing optical layer circuits, Modern ROADM devices allow fast remote re-configurations e.g. provisioning a circuit in tens to hundreds of milliseconds instead of minutes). Owan orchestrates bulk transfers in a centralized manner. It computes and implements the optical circuit configuration (the optical circuits that implement the network-layer topology) and the routing configuration (the paths and rate allocation for each transfer) to optimize the transfer completion time or the number of transfers that meet their deadlines.

They build Owen prototype using commodity hardware and have nine sites with Internet 2 topology emulation. They conduct extensive evaluations through both testbed experiments and large scale simulations with data from an ISP network and an inter-DC network. Their results show that Owan improves the transfer completion time by up to 4.45x on average and 3.84x at the 95th percentile, as compared to prior methods that only control the network layer. Furthermore, Owan allows up to 1.36x more transfers to meet their deadlines and up to 2.03x more bytes to finish before their deadlines.

Q1) You give Software defined ability to optical networks and configure optical circuits. How do you regenerate a optical network and what about load balancing the traffic?

A) We build regeneration graph that resembles regular consumption and transforms the problem to a shortest path finding problem. This helps us regenerating the topology.



Evolve or Die: High-Availability Design Principles Drawn from Google’s Network Infrastructure

Evolve or Die: High-Availability Design Principles Drawn from Google’s Network Infrastructure

Ramesh Govindan (University of Southern California/Google)
Ina Minei (Google)
Mahesh Kallahalla (Google)
Bikash Koley (Google)
Amin Vahdat (Google)

Google runs three types of networks
  1. Data Center Networks: designed from merchant silicon switches, with a logically centralized control plane; 
  2. B4: A software-defined WAN that supports multiple traffic classes and uses centralized traffic engineering
  3. B2: A Global WAN for user-facing traffic that employs decentralized traffic engineering
They strive to maintain high availability in these networks. Typical availability targets is no more than a few minutes downtime per month for these networks. Due to Google's scale, evolution velocity and management complexity, it is hard to maintain availability in these networks. Despite these challenges, Google delivers high availability in their networks and services.

However, they do experience failure events and document these failures in form of post-mortem reports and root cause analysis. In this paper they analyze 103 such reports and find that failure rate is comparable across all three networks. These failures occur in data, control and management plane. They also find that 80% of the failures last between 10 mins and 100 mins. Nearly 90% of the failures have high impact i.e. high packet losses, or blackholes to entire data centers or parts thereof. Interesting bit is that when most of these failures happen, a management operation is usually found be in progress in the vicinity.

They classify these failures on the basis of root causes and find that, for each of data, control and management planes, the failures can be root-caused to a handful of categories like risk assessment failures; lack of consistency between control plane components; device resource overruns; link flaps; incorrectly executed management operation; and so forth. They provide examples of these failures in the paper.

Finally they discuss high availability design principles drawn from these failures. These include defense in depth, which is required to detect and react to failures across different layers and planes of the network and can be achieved by containing the failure radius and developing fallback strategies like red button (software controls that let an operator trigger an immediate fallback). Second, fail open, which preserves the data plane when the control plane fails. Third, maintaining consistency across data, control, and management planes can ensure safe network evolution. Fourth, careful risk assessment, testing, and a unified management plane can prevent or avoid failures. Fifth, fast recovery from failures is not possible without high-coverage monitoring systems and techniques for root-cause analysis.

The conclude that by applying these principles, together with the idea that the network should be continuously and incrementally evolved, they have managed to increase the availability of their networks even while its scale and complexity has grown many fold.


Q1) Is there an place in this work for formal methods?
A) Our paper discusses this, formal methods
   have not been applied at large scale networks. However, there is recent
   work on data plane and control plane verification and space is there
   for work in formal methods.


Q2) Has Google ever pressed the red button?
A) The answer is beyond my pay grade.

SNAP: Stateful Network-Wide Abstractions for Packet Processing

Mina Tahmasbi Arashloo (Princeton University)
Yaron Koral (Princeton University)
Michael Greenberg (Pomona College)
Jennifer Rexford (Princeton University)
David Walker (Princeton University)

While early implementations of SDN only supported forwarding rules and counters at each switch, recent SDN proposals support general purpose persistent state at each switch (not just the controller). Programming against this model is hard however, as network operators have to correctly program a stateful distributed system.

SNAP provides network operators with a high level language, which creates the illusion that the network is a single switch with local state. SNAP then compiles a program written in this language to multiple programs running distributed across the real network with local state at each switch.

SNAP was evaluated on 7 campus and ISP topologies with hundreds of switches and links. SNAP compiled about 20 program for these topologies in 5sec - 6min each.

This work is exciting because it shows a way to extend the various existing high-level SDN programming languages to SDN with state. This work may also be interesting to high-level languages for other networking protocols, like the Border Gateway Protocol which requires persistent state at each router.

Q: (Aurojit Panda) When a switch fails, what happens with its state?
A: SNAP does not currently support failure; the switch loses its state.

Q: Do situations arise where state has to be transferred between switches?
A: If a switch accesses multiple states, all states are stored on the same switch.

Q1: If there is a bug in a SNAP program, will it take down the entire network?
A1: Yes, a bug might bring down the entire network.
Q2: Joint optimization leads to lack of robustness!

Q: Can you do load balancing?
A: Imagine that a switch keeps some state for each IP address. You can shard this state across several switches by the IP address. We have not done this yet.

Link to paper.

Programmable Packet Scheduling at Line Rate

Anirudh Sivaraman (MIT CSAIL)
Suvinay Subramanian (MIT CSAIL)
Mohammad Alizadeh (MIT CSAIL)
Sharad Chole (Cisco Systems)
Shang-Tse Chuang (Cisco Systems)
Anurag Agrawal (Barefoot Networks)
Hari Balakrishnan (MIT CSAIL)
Tom Edsall (Cisco Systems)
Sachin Katti (Stanford University)
Nick McKeown (Stanford University)

Existing switches come with a small set of built in scheduling algorithms. These can be tuned with parameters, but network operators and researchers cannot implement completely new algorithms. Schedulers have resisted the push to more programmability, because there exist no abstractions for schedulers that would allow programming them at line rate.

This paper presents an abstraction for schedulers that is programmable, and for which evidence suggests that it can be implemented to run at line rate. In this abstraction, the scheduler maintains multiple priority queues, from which the scheduler periodically dequeues and forwards the packet with the highest rank (priority). The scheduler waits to send the dequeued packet until its rank is equal to the wall clock time. The scheduler is programmed by setting a packet's rank during the enqueue operation. This enables programmability, as the rank can be computed with arbitrary programs in the ingress pipeline.

The authors evaluate this new scheduling abstraction by implementing various scheduling algorithms
against the abstraction. The authors also present a hardware design of their scheduler abstraction that works at line rate for up to 1024 flows. The authors estimate that the design requires about 4% in additional chip area.

This work is exciting, as it shows a way to provide programmability for yet another aspect of routing.

Q1: How can a network operator influence when a packet is forwarded?
A1: A packet's rank represents the wall clock time at which the packet should be forwarded.
Q2: How is your approach different from a calendar queue?
A2: When applied to traffic shaping, it is exactly like a calendar queue.

Q: How can you scale up to routers that forward over 6 TBit/s?
A: We don't have a concrete design, but we believe this to be possible because people know how to scale FIFOs, and because our scheduler is implemented in flip flops.

Link to paper.

Packet Transactions: High-Level Programming for Line-Rate Switches

Anirudh Sivaraman (MIT CSAIL), Alvin Cheung (University of Washington, Seattle), Mihai Budiu (VMWare Research), Changhoon Kim (Barefoot Networks), Mohammad Alizadeh (MIT CSAIL), Hari Balakrishnan (MIT CSAIL), George Varghese (Microsoft Research), Nick McKeown (Stanford University), Steve Licking (Barefoot Networks)

Presenter: Anirudh Sivaraman 

What does a programmable switch means? The ability to express data plane algorithms like Active Queue Management, Congestion Control etc. Line rate means the switch must forward the packet at the highest possible rate (1-10 Gbps with 10-100s of ports).  Practical programmable chips at line rate (for e.g. Flexpipe, Xpliant) are available in the market nowadays. However, the hardware is dedicated to stateless tasks which hinders data plane algorithms which require state at the switch. Also the language of programming the switch is low-level (match-action pipeline), and is difficult to translate these algorithms. The development of a stateful instruction set for switches is also required. 

Towards programming data plane algorithms at line rates, the authors make the following contributions: a high-level abstraction called packet transactions to describe the data plane algorithm; a low-level abstraction called atoms which helps in the development of the stateful instruction set, and a compiler to translate packet transactions to atoms. 

Packet transactions are a block of imperative code (in a language called Domino) acting on packets one at a time serially, thus simplifying the development of data plane algorithms (do not have to worry about concurrency issues, can be developed agnostic to switch hardware), and an example transaction to implement a count function was presented. However, under the hood, the execution is pipelined (and thus packets can be acted upon concurrently). The authors define 'atom', a action unit which can read and modify the packet, and also maintains a local state. Thus a switch model is multiple pipeline of atoms. The atom constitutes the switch instruction set. An example atom can read the packet and store a packet field in memory. While stateless operations can be pipelined with ease, many stateful operations require atomic operations in hardware which needs to be performed in a clock cycle to provide line rates. 

The compilation of packet transaction with a atom given as input provides the pipeline of atoms in the switch which implements the transaction at line rate (or rejects the input if there is no solution which can run at line rate). Thus, switch designers can run multiple iterations of the compiler to find an atom expressive enough for the data plane algorithm (the authors provided a list of atoms in order of expressibility). The author then presented a demo for finding the atom structure for two algorithms: learning bloom filters and heavy hitters using different atoms. For evaluation, the authors present the expressiveness of packet transactions which have a comparable LOC count as software routers. They evaluate the number of atoms required to implement different data plane algorithms and find they can implement most of them with < 100 atoms which requires 1% additional area. Thus, in conclusion, the authors present different abstractions and a compiler to implement data plane algorithms in switches at line rates. 

Q: Are packet transactions a good abstraction as it requires strong serializability? 
A: Weaker concurrency models were considered, but it is easier to implement algorithms with transactions. However, a concrete analysis of different models is a future arena of work. 

Q: P4 is used to make OpenFlow more expressible? How does this system compare with P4?
A: A P4 program could be generated from packet transactions, which is an interesting proposition. 

Q: The Domino language does not support loops which may be required for some algorithms? 
A: Only loops with constant number of iterations which can be unrolled at compile-time can be implemented by the system for now (this can be provided as syntactic sugar). However, dynamic loop iterations cannot be performed. 

ClickNP: Highly flexible and High-performance Network Processing with Reconfigurable Hardware

Bojie Li (USTC / Microsoft Research), Kun Tan (Microsoft Research), Layong (Larry) Luo (Microsoft), Yanqing Peng (SJTU / Microsoft Research), Renqian Luo (USTC / Microsoft Research), Ningyi Xu (Microsoft Research), Yongqiang Xiong (Microsoft Research), Peng Cheng (Microsoft Research), Enhong Chen (USTC)

Presenter: Bojie Li 

With the rising requirements of network functions(NFs) to perform various packet processing tasks, hardware NFs are not flexible enough. To counter this, Virtual NFs are run on commodity servers. However, this software approach faces major scale-up challenges: limited processing capacity of cores force the requirement of large number of cores (100 cores for NF at 40Gbps). Also, the latency of software NFs is inflated and unstable. Thus, the authors present using FPGAs for running NFs (found in NICs in modern times). FPGAs provides massive parallelism for packet processing, consume lesser power and are cheap. The challenges of using FPGA are programmability, as FPGAs use hardware description languages like Verilog which can be obscure for software developers to develop NFs in these languages. 

The authors present ClickNP to ease FPGA programming for network programming. ClickNP allows programmers to develop NFs in a high-level language, allows modularization (like the Click modular router), synthesizes high-performance code and provides functionality for joint CPU/FPGA processing. 

NFs can be developed in a multi-core programming model, however a difference in ClickNP is that cores share information via channels (rather than a shared memory model), preventing the shared memory bottleneck. A 'element' is a building block which represents a single-threaded core and the speaker describes the components of the element. Two extensions were presented: the element can be executed on both CPU or FPGA cores. The architecture and runtime components of ClickNP is presented with a example of packet logger presented.  The authors then present several optimizations: parallelism across elements is improved by pipeline parallelism, while inside elements, they propose delayed writes to remove read-write memory dependency; unbalanced pipeline stages are tackled by offloading the slow path to another element.

The authors then present the evaluation of ClickNP and implement various components from the Click library (for e.g. parser, checksum, AES). The NFs produced by ClickNP have high peak throughput and delay below 1ms. ClickNP also simplifies the development of NFs (each NF has 10s of lines of code and developed in 1 week). The authors present two case studies: developing a IPSec Gateway and L4 Load balancer using ClickNP and compare the performance with software NFs, where the ClickNP NF can provide 40 Gbps throughput (compared to 628 Mbps with the software NF), and also latency is low and stable (order of microseconds). They present the resource utilization of ClickNP NFs versus NFs handwritten with Verilog and find less than 2 times overhead in resource utilization (which is not much of a concern as FPGA resources are bound to increase in the future).  In summary, ClickNP provides a practical platform for implementing NFs using FPGA which provides various advantages.

Q: There has been a lot of exciting work with P4. Can you provide a sense of how P4 fits with ClickNP?
A: Network Functions are generally complicated like encryption/decryption and stateful load balancers which are difficult to implement using P4. Integration with P4 is a future line of work.

Q: Why does ClickNP elements use channels instead of a shared memory model?
A: This is primarily to avoid the bottleneck of shared memory, though using channels forces programmers to write programs without a shared memory model. 




Monday, August 22, 2016

SIGCOMM 2016 - Opening Keynote, August 22th, 2016 (Welcome to SIGCOMM )



Opening Keynote Agenda:
Welcome - General Chair
review process and best paper awards
Test of time and dissertation awards- awards chair
SigCOMM award winner

Some Statistics for SIGCOMM 2016:
383+ attendees
139+/73+work shop tutorial attendees
34+ countries
191+ students
122 travel award recipients

Overview of SIGCOMM
http://conferences.sigcomm.org/sigcomm/2016/index.php
41 full papers (2 from CCR)
5 workshops + 4 tutorials
21 posters, 12 main track poster, 18 demos and 8 industrial demos

Social Events:
welcome reception
N2women dinner
SIGCOMM awards dinner
Student dinner at Ataliba churrascaria
Conference banquet at Slaviero Ingleses

Highlights:
mentoring moments
topic review
community feedback session
live webcast of main conference sessions
Remote representations with interactive Q/A.


From Awards Chair: Bruce Maggs 

Best Paper Award Winners:
1. Don't Mind the Gap: Bridging Network-wide Objectives and Device-level Configuration; Ryan Beckett (Princeton Univ.), Ratul Mahajan (Microsoft), Todd Millstein (UCLA), Jitendra  Padhye (Microsoft)
2. Eliminating Channel Feedback in Next-Generation Cellular Networks; Deepak Vasisht (MIT), Swarun Kumar (CMU), Hariharan Rahul (MIT), Dina Katabi (MIT)
3. Inter-Technology Backscatter: Towards Internet Connectivity for Implanted Devices; Vikram Iyer (Univ. of Washington), Vamsi Talla  (Univ. of Washington), Bryce Kellogg  (Univ. of Washington, Shyamnath Gollakota  (Univ. of Washington, Joshua Smith  (Univ. of Washington)

Test of Time Award Winners
1. A First-Principles Approach to Understanding the Internet's Router-Level Topology, Proc SIGCOMM 2004, Lun Li, David Alderson, Walter Willinger, John Doyle
2. Link-Level Measurements from an 802.11b Mesh Network, Proc SIGCOMM 2004, Daniel Aguayo, John Bicket, Sanjit Biswas, Glenn Judd, Robert Morris

SIGCOMM dissertation Award:
1. Mosharaf Chowdury (Advisor: Ion Stoica, U.C. Berkeley), Colflow: A Networking Abstraction for Distributed Data-Parallel Applications

Lifetime Achievement Award: Jim Kurose



Keynote Speaker: Jim Kurose

Professor Kurose shares Ten Insights for younger Researcher in Networking Research, Education, Mentoring and Service topics.:
#1 Picking Research problems: Carefully
#2 Choosing, defining a research problem.
#3 Teaching
#4 Teaching: a Prediction
#5 Computer Science Education
#6 Service
#7 Mentoring: the process of doing research
#8 Learn how to write really well
#9 Learn how to speak really well
#10 Identify role models


Final Observations:
Networking research community: Vibrant in topics includes SDN, NFV, mobility, cybersecurity, data, IoT
Constant need to "prove" yourself
privileged to be doing what we do
work we do is great; People Matters.



Questions:
Q: Is it useful to look at another people work in your field? and what could cause system failure using modern network technology?
A: Its great to see another people's work and idea and share your ideas as well. In terms of system failure, roughly 95% are mostly organizational failures. Interaction between human and technology is critical.


Q: 20 years ago in 1997, What was your approach to solve multicast ?
A: Solve the problem in one controlled domain first such as in an inter-domain or intra-domain, which was much easier. Now applying SDN/NFV concepts can bring us new ways to look at multicast problem.


Q: What other possible network research topics in the future and which are important?
A: Some topics are more popular than the other. The life cycle is also different, some are longer than the others. Take SDN as an example,  SDN does have a longer life cycle in both research and industry word. When you are trying to solve a real work problem, always thinking forward for the future.







Friday, August 21, 2015

End of SIGCOMM

This concludes Layer 9's coverage of SIGCOMM 2015. Thank you all for tuning in and to the contributors for their timely and accurate scribing! We understand that archived videos of the sessions will be available in the future. For now, slides from the topic preview sessions have been posted: http://yuba.stanford.edu/~huangty/sigcomm15_preview/

P4: Programming Protocol-Independent Packet Processors

Pat Bossharty , Glen Gibby ,  Martin Izzardy and Dan Talaycoy (Barefoot Networks),  Dan Daly (Intel) ,  Nick McKeownz (Stanford University),  Jennifer Rexford , Cole Schlesinger and David Walker (Princeton University),  Amin Vahdat (Google),  George Varghesex (Microsoft Research)

Paper link

Presenter : Changhoon Kim (Barefoot Networks)

With the advent of Software-Defined Networking (SDN) and OpenFlow standard, programmers can program control plane.In OpenFlow-enable switch, it recognizes a predetermined set of header fields and processes packets using a small set of predefined actions. However, considering that protocols evolve more rapidly such as encapsulation protocols for network virtualization and OpenFlow from 1.0 to 1.5, current approach such as OpenFlow can not express how packets should be processed to best meet the needs of control applications.

Fortunately, now Protocol-independent switch ASIC is available at a few terabit per second. Thus, programmers can also program data plane as well as control plane. To use this flexible data plane, the presenter introduced P4 which is a high-level language for Programming Protocol-Independent Packet Processors.

P4 is designed with three goals:
1.Protocol independence: Network devices should not be tied to any specific network protocols.
2.Target independence: Programmers should be able to describe packet-processing functionality independently of the specifics of the underlying hardware.
3.Re-configurability in the field: Programmers should be able to change the way switches process packets once they are deployed.

With P4, the programmer can program how the forwarding plane processes packets and a compiler transforms an imperative program into a table dependency graph that can be mapped to many specific target switches, including optimized hardware implementations.

In the talk, the presenter highlighted that this will accelerate innovations

- Areas of innovation
* Reducing feature set
* Network monitoring, analysis, and diagnostics
* Tunnel-splicing gateways
* Load balancing
* Attack detection and mitigation
* Host-stack offloading

- This will research opportunities
* Novel network and protocol designs
* development tools
* network verification, test, and simulation


Q1 : At the begining of slide, with P4, it is easy to debug problem in switch. However, current hardware switch already had many solutions like testing and feedback-loop. What is different?
A1 : In the talk, bugs mean software bugs which are from incorrect data plane description such as ambiguity.  With P4, compilers can help this point.  If the programmers specify incorrect data plane description, it will provide error messages. So it will reduce number of errors.

Q2 : What is the difference between OpenData Plane and P4?
A2 : I am not familiar with it, later we will talk more.

Q3 : What is the difference between OpenFlow and P4?
A3 : OpenFlow is one of the P4 applications. P4 can be one of OF.

Q4 : The actions such as changing header fields can be applied at the middle of pipelining procedure in data plane. In this case, does only parsing packet in first step before entering pipelining cause problems?
A4 : I did not mention this in presentation, P4 can specify deparse logic to handle this case.

Q5 : After submitting a paper, did you add more new features?
A5 : There are more functionality. Since the number of paper was only 6 pages, we do not mention all of them.  In P4 specification, you can see entire features.


A Primer on IPv4 Scarcity

Philipp Richter (TU Berlin/ICSI) , Mark Allman (ICSI) , Randy Bush (Internet Initiative Japan) , Vern Paxson (UC Berkeley/ICSI)


The IPv4 scarcity became reality since today only about 2% IPv4 addresses are available. The authors tried to answers these questions.
What is IPv4 scarcity?
What factors affect it?
What challenges does it pose?
How are we dealing with it?
To answer these questions, the authors first outline the evolution in address space management as well as address space use patterns,  identifying key factors of the scarcity issues. Finally, they characterize the possible solution space to overcome these issues.

The authors started the evolution of IPv4 address management and checked the degree to which allocation reflects actual use. They categorize history of IPv4 address into three time phases. First is "Early Registration phase". In this time, Address blocks were allocated quite informally, which caused heavy internal fragmentation and waste of address space. Second phase is "Needs-Based Provision". The modern framework of Regional Internet Registries (RIRs) was established in this phase. The primary goal of this phase is the conservation of address space and to get address space, the requester justify their need for the address space. Last phase is "Depletion and Exhaustion Phase". In this phase, more strict allocation policies were applied due to the small last remaining blocks.

During these three phases, the address space are almost fully exhausted.  However, according to "Figure 4. Allocated and routed address blocks", Large amounts of address space are not routed. More specifically, address ranges assigned prior to the existence of the RIRs shows poor utilization, whereas the RIR-allocated ranges show higher utilization, which means that policies to allocate IPv4 are effective.




The scarcity of IPv4 address makes people to think IP address as a virtual resource. People thought IPs are "for free" for last 30 years, but now IPs are valuable resource and became goods exchanged on markets.So, IPv4 address space transfer arise via RIR transfer policies and transfers outside the RIRs and network operators have already started buying and selling address blocks. However, defining the boundaries of what exactly an address transfer is and what it is not is not straightforward. Thus, the authors mentioned IP block transfer is more careful and needs further research.

Finally, the authors consider three possible solution to overcome IPv4 scarcity. First is to use IPv6  address space more. Considering the number of possible addresses with IPv6, this is ultimate nature solution. However, there is an issue that IPv6 is not compatible with IPv4 and complex transition mechanisms between IPv6 and IPv4 are required. Second is to multiplex current IPv4 address space using address sharing techniques like Carrier-grade NAT (CGN). Final possible solution is more efficiently use of the current IPv4 address space. According to Figure 4, about a third of all Internet address blocks are not routed, and thus not in (at least public) use.  Making more efficient use of address space will require adapting address management policies, guidelines and technologies,  including the difficult (both technically and politically) problem of re-assigning already allocated address blocks.


Q1. In slide 23 page, Can you show a graph in slide 23 page as unit of IP address blocks?
A1. It is in backup slides

Q2. Is the same situation possible in IPv6?
A1. It is possible, but for IPv6, there is no informal allocation and waste of IP addresses like first phase of IPv4.

Q3. Given the price of transferring IP block in slide, do you have an evidence about price to buy ip address block?
A3. It is hard to find prices since only few data is publicsized.


Q4. There are many unrouted IP address. How did you guarantee whether they are used for private IP address or for protected IP address? How many these accounts for?
A4. True. Some companies or governments use them as this purpose.
I can not know exact number. However, I assume they are relatively small.

Thursday, August 20, 2015

ASwatch: An AS Reputation System to Expose Bulletproof Hosting ASes


Session 11: Security, Privacy, and Censorship - Paper 2
Authors: Maria Konte (Georgia Institute of Technology), Roberto Perdisci (University of Georgia), Nick Feamster (Princeton University)

Paper:
http://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p625.pdf

Public Review:
http://conferences.sigcomm.org/sigcomm/2015/pdf/reviews/284pr.pdf


Intro:
Cyber-criminals protect their network resources by hosting their services inside malicious ASes, referred as bulletproof hosting ASes. Current defenses against these malicious ASes rely on traffic monitoring using AS reputation systems like BGP Routing. The problem with these approaches is that they need lots of vantage points, have high false positives, hard to use and too late to prevent the attack. The authors’ approach is to monitor routes and use machine learning algorithm to identify these malicious ASes.  It is the first attempt to deliver AS reputation based on routing monitoring, exclusively on public data.

System:
The system consists of two phases: training phase and operational phase. It uses confirmed cases of malicious and legit ASes as ground truth for training and extracts features based on their domain knowledge.  Example features include: Rewriting Changes, BGP Routing Dynamics and IP Space Fragmentation, which are explained in following paragraphs. Then using these matrices, the operational phase generates AS reputation report. These domain knowledge include:

Rewriting Changes/Link Connectivity: maclicious ASes tend to change connectivity more aggressively than legit ASes. To measure this, they take snapshots of connectivity. For example, the measurement takes as follows:
 - Monitor the last x snapshots
 - collect all providers
 - measure fraction of snapshots with each provider
Then the link connectivity is represented by three features from the distribution.

BGP Routing Dynamics: Malicious ASes routing dynamics are driven by illicit operations, in contrast, legit ASes dynamics are driven by policy changes, traffic engineering decisions.

Fragmentation and churn of advertised prefixes: Malicious ASes rotate their advertised prefixed, e.g., to avoid evasion, blacklisting; and they advertise large number of non-contiguous prefixes.

Evaluation:
Using features like above, they train the classifiers and evaluate with cross-validation. The accuracy is 93% true positive and 5% false positive. They also investigate which features are important by including/excluding each feature family separately to see the performance change. The result shows that the most important features are the connectivity features in terms of true positive rate. Fragmentation and churn of advertised prefixes are less important than connectivity, but helps to lower false positives.


Q: How does the algorithm adapt as people become aware of the work? There are some security problems like tricking the algorithm.
A: The malicious behaviors themselves are very hard to change.
follow-up: What if the criminals don’t do these misbehaviors to avoid the detection? (take offline)

Q: Have you look into the patterns of these hosting ASes? Who are more frequent providers, who are legit provider?
A: No, it more complex than single provider examination. Yes, there are some providers tend to have more malicious ASes; what is happening is more likely that your may have legiti provider hosting malicious ASes without knowing.

Q: Say one AS that is malicious, don’t want to get routed? BGP has no guarantee the path we analyze is the path it’s gonna take, so what can I do?
A: We didn’t try to answer that question. What we were trying to do is to see how we can use the connectivity and BGP updates can be used to detect malicious ASes.

Q: Do you have or intend to provide some one-paragrah recommendation to the policy makers(ISP community), so they can use these intuitions.
A: That's a good question, we haven’t examined that path. There are similar questions like how to configure reliable AS protections.

Alibi Routing

Session 11: Security, Privacy and Censorship - Paper 1
Authors: Dave Levin (University of Maryland), Youndo Lee (University of Maryland), Luke Valenta (University of Pennsylvania), Zhihao Li (University of Maryland), Victoria Lai (University of Maryland), Cristian Lumezanu (NEC Labs), Neil Spring (University of Maryland), Bobby Bhattacharjee (University of Maryland)
Paper:
http://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p611.pdf
Public Review:
http://conferences.sigcomm.org/sigcomm/2015/pdf/reviews/371pr.pdf

Intro:
Users have no control over routing path and lack the insights into undeniable proof of where their packets travelled. This is problematic because the packets might go thorough censoring country, causing collateral damage. Existing approaches like HTTPS and Anonymity can help hide information, but they are still somehow subject to censorship. The authors propose to avoid these geographic regions in the first place and provide proofs of the avoidance, which is called “provable avoidance routing”.

System:
The problem is considered to be intractable because you are trying to prove something did not happen without enumerating everything that could have? the paper uses a very simple logic: if event A happens reduces to event X doesn’t happen, then conjectured with A actually happens, we have the proof that X doesn’t happen. Here A makes an “alibi”.  To put the routing into this logic, X is traversing a forbidden area and A is traversing a relay node. The key idea is to choose a relay that is far away from a forbidden node that the difference between the two traversing is significant.

Using these two mutually exclusive events, the author then presents “alibi routing”, a peer-to-peer overlay routing system.  The algorithm works in following steps:
1. users choose forbidden regions
    These regions are user-specified and arbitrary sets of polygons(defined over lat/lon)
2. users compute target regions (where alibis might be)
    Exclude locations where alibis cannot exit
    Segment the world into grid
    Include a grid point if the shortest latency(min RTT) to reach a destination via the alibis node alone is largely smaller than via the alibis node+any node in the forbidden area.
3. Alibi Routing recursively searches for peers within the target regions

Evaluation/Result:
The authors evaluate their work by both simulation(20k nodes) and an implementation deployed on PlanetLab which has 425 nodes.
The result turns out that
- Most source-destination pairs can provably avoid.
- Failure typically  arises when target region is too small or non-existent
- Failure is likely when source or destination are very close to the forbidden region
little latency overhead
- and more results in the paper

The conclusion is that provable avoidance is possible safely and efficiently. Alibi routing finds potential alibis successfully and at low cost.

The data and code are publicly available at alibi.cs.umd.edu


Q: Given that distance variability , it must cause equivalently false positives? Is that good?
A: This is what that delta parameter will affect, the idea is that you can specify smaller argument, so in that way hopefully that target region is still less than the minimum requirement to succeed. This is the tradeoff of success versus performance.

Q: The motivation is about censorship, but there are lots of people inside the forbidden area. But there’s very small percentage of traffic from Europe across China. Is that fair?
A: We didn’t evaluate the overhead and we aren’t sure how that could be fair. But even as centric as the US data is, still the majority of the time it still succeeds.

Q: What if attackers increases the delay?
A: Don’t know for sure. But the attackers cannot trick us by decreasing the delay.


Adaptive Congestion Control for Unpredictable Cellular Networks

Yasir Zaki (New York University Abu Dhabi), Thomas Pötsch (University of Bremen), Jay Chen (New York University Abu Dhabi), Lakshminarayanan Subramanian (New York University), Carmelita Görg (University of Bremen)

Paper:
http://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p509.pdf

Public review:
http://conferences.sigcomm.org/sigcomm/2015/pdf/reviews/308pr.pdf

Code:
http://yzaki.github.io/verus

It is well-known that Cellular Networks have throughput performance problems. One of the reason is the mismatch between TCP assumptions and Cellular Network behavior.

Authors present Verus, which is an adaptive congestion control, to mitigate the problem between TCP features and unpredictable cellular links. Verus is an end-to-end congestion control protocol that uses delay measurements to react quickly to the capacity changes in cellular networks without explicitly attempting to predict the cellular channel dynamics. Verus learns a delay profile that captures the relationship between end-to-end packet delay and outstanding window size over short epochs.

Authors compares Verus with TCP Cubic and TCP Vegas. Verus better utilizes the bandwidth. Also, TCP Delay and Vegas have higher delay. Verus provides nearly 30% higher throughput but with higher delay against Sprout, which is the state of the art technique.

Q&A

Q: For very short LTE latency, what would be the TCP gain?
A: We did some analysis with commercial LTE deployments. Thus, we do not think there would be much different. You would be able to get the delay feedback and adapt.

Q: You can blast UDP flow and mess up TCP fairness.
A: Verus was evaluated against other TCPs on a tail drop link.

Q: Why did you do your implementation of TCP over UDP?
A: Convenience, not in the kernel.

Q: What if you implement on kernel space?
A: We might get better improvements since we could get the signals faster.

Q: What if the flows are short?
A: If the flows are short, they do not live beyond the slow start and perform similarly to TCP. The target was to look at longer-lived connections.

Q: Did you look at the convergence time?
A: No, we did not.