Layer 9: August 2016

Tuesday, August 30, 2016

SIGCOMM 2016 - Session 7 (Networked Applications) - CS2P: Improving Video Bitrate Selection and Adaptation with Data-Driven

Presenter: Yi Sun (ICT/CAS)

Co-Authors: Xiaoqi Yin (CMU), Junchen Jiang (CMU), Vyas Sekar (CMU), Fuyuan Lin (ICT.CAS), Nanshu Wang (ICT/CAS), Tao Liu (iQIYI), and Bruno Sinopoli (CMU)

The volume of TCP based adaptive video streaming is growing as the queuing is becoming more critical. To achieve a good QoE, the video adaptation algorithm needs to be smart. Most algorithms start with the basic bit rate and take a long time to achieve the ideal bit rate, because they cannot predict throughput, problem that this paper addresses.

Accurate bit rate prediction is tricky, as our abilities to do so are quite limited. The authors tried some approaches, such as harmonic mean, and support vector regression, but they do not work well. This paper's main contributions are the following:

a large scale analysis of the throughput scalability and predictability, and provides insights on how to predict throughput accurately;
it proposes CS2P, a cross-section stateful predictor for improving bit rate selection and adaptation via throughput modeling;
a practical implementation of CS2P video player and demonstration of QoE improvements.

To perform the throughput variability across sessions and within sessions, the authors used a proprietary dataset with more than 20 million entries collected from the iQIYI, a Chinese video provider. To capture the stateful characteristics in the evolution of the session throughput, CS2P uses hidden Markov models. Also, CS2P groups similar sessions based on the observation that many sessions share similar throughput structure and uses data from these sessions for cross-session prediction.

CS2P improves most of the QoE metrics compared to other algorithms. It performs 3.2% better than harmonic mean and 14% to buffer based solution, for instance.

Q: Link type is not used as one of the metrics. How does the algorithm work considering different link types? (4G, Wi-Fi, ...)

A: Including link type into the candidate feature set will further enhance the accuracy of CS2P, because the link type is critical for the throughput; however, we do not have this information from the content provider.

Q: When doing throughput prediction, how far ahead does CS2P can go and how does it affect the QoE?

A: CS2P looks into different horizons in the future and the improvement over other algorithms is consistent. Before the start of a session, CS2P predicts the overall downloading time for the entire video.

Q: How does the prediction vary based on where the nodes are?

A: Because CS2P relies on cross session prediction, the more sessions with similar features the more accurate it will be. Unfortunately, in our pilot experiment we do not have many volunteers.

Q: Regarding locality issue, how does it play out in a CDN with different node conditions?

A: The CDN server is in the candidate feature set. We have considered that in our experiments.

Friday, August 26, 2016

SIGCOMM 2016 - Session 7 (Networked Applications) - WebPerf: Evaluating What-If Scenarios for Cloud-hosted Web Applications

Presenter: Ramesh Govidan (Microsoft Research)
Co-Authors: Yurong Jiang (USC), Lenin Ravindranath (Microsoft Research), Suman Nath (Microsoft Research)

Simple web applications, such as note taking, can have significant underlying complexity in the cloud. Moreover, since web applications are hosted in the cloud, latency is critical. This talk looks particularly in the latency experienced by these web applications in the cloud. In fact, developers find it very hard to optimize latency for such applications.

The paper argues that the reason for that is simple: the front-end of cloud services offer a variety of hardware options, all impact the overall latency of the application. Latency is impacted by almost every other component in the application, each of which have several configuration choices.
In all these different configurations, is not clear what the latency is. Ideally, a developer would like to have a What-If capability that would tell what the new latency would be in case of a configuration change. However, there are several challenges of doing such estimation, multiple dependencies for instance.

This paper presents WebPerf. A What-If scenario estimation. It tells the developers what happens when you change a configuration. WebPerf has four components: a dependency graph, a baseline performance, a profile dictionary, and a predictor, which combines all that information. More details
of these components are described in the paper. Also, WebPerf is cheap, fast, and can be automated. The reason for that is because many of the profiles are application independent. Another insight is that the depency graph is independent of the scenario.

The last part of the talk presents the evaluation, more specifically how accurate WebPerf is, and efficiency quantification. To perform this evaluation, the authors ran WebPerf in six different applications with different complexity. Results have showed that, indeed, configuration choices can really impact latency and that WebPerf is very accurate.

Summarizing, this paper presents a tool that help developers to reason about latency for Web applications when the configuration of its components is changed.

Q: Based on the individual profiling of each component to estimate the latency and the observation that Microsoft, Google, and Facebook already have many
data sets available, how would WebPerf benefit from the existing data sets?
A: As described in the paper, the profiling the components is made offline and it's plausible to use those measurements to profile existing components.

Q: Is there a way to get the dependency graph binary synchronization for use in cloud applications?
A: The paper discussed a few other alternatives, such as monitoring network traffic. It turns out it is hard to instrument the binaries with more indirect
methods, because you risk missing the synchronization points and you only get indirect information about dependencies observing latencies.

Q: Cloud providers do a lot of reactive management to respond to changes in the workload. The paper presented very stable results, which seems a bit unusual.
A: Most dynamic management is done on aggregate loads. The difference of WebPerf is that we do for the performance of the application, more per developer.

Q: What are the next steps, in terms of how to turn certain latency and how to make certain configuration changes based on the desired code?
A: In fact, it adds another layer of search on top to look for specific how-to questions.

Taking the Blame Game out of Data Centers Operations with NetPoirot

Presenter: Behnaz Arzani
Authors:
Behnaz Arzani (University of Pennsylvania)
Selim Ciraci (Microsoft)
Boon Thau Loo (University of Pennsylvania)
Assaf Schuster (Technion - Israel Institute of Technology)
Geoff Outhred (Microsoft)

Summary
The paper presents NetPoirot which is a diagnostic tool that monitors aggregate TCP metrics and uses machine-learning based classification techniques to identify the cause of failures in the data center.

Data center can fail and it is often unclear what causes the failure, it could lie in the underlying network, client, or service-level application. A real world failure example is presented in the paper, VMs with applications running inside could trigger an operation in the hypervisor, which then sends a request to a remote service. Whenever the request/response latency increases an error occurs in the hypervisor which in turn causes the VM to panic and reboot, hence disrupting normal operations. The failure to diagnose this scenario motivates proposing a more effective debugging tool.

To achieve this goal, NetPoirot is proposed and it only measures aggregate TCP metrics to diagnose the failures. The key insight of this tool is TCP can observe the entire communication path and it sees faults no matter where it happened. There are two main parts of this tool, one is the monitoring agent which runs in the VM and periodically collects TCP metrics data, the second part is learning agent which takes the monitoring data in and uses decision tree model to classify the failures.

A prototype of NetPoirot has been developed and deployed in a production data center, and the authors have done extensive evaluation over a 6-month period. The speaker mentioned some lessons learned from this work: The first one is TCP sees everything even at a single endpoint, so it's possible to use TCP metrics to detect who is responsible for a failure, and surprisingly two features are enough for most applications to describe each failure observed by the TCP. The second one is the relationship between failures and TCP are nonlinear, and failures in a group (client/server/network) are similar which also implies individuate fault classification is difficult.

Q&A
Q1: is it possible to game NetPoirot so that it makes mistakes?

A1: Potentially, it could be possible, but it could be thought of as an abnormal failure, so what one might do is anomaly detection to see how similar a failure to failures observed in the past to see if it has enough in common with them to be identified as a legit diagnosis. Although I have not done any research on this so I am not 100% what the result of doing that would be.

Q2: What are the features most prominently identified by PCA? given that there are only two features (derived from the original features) required to characterize each failure using TCP data.

A2: It varies from application to application, given that the approach is very application dependent. But for the example application we tried for example for client side failures we had time spend in zero window as the feature with the highest coefficient.

Thursday, August 25, 2016

Closing

Huge appreciation to all volunteers, conference organizing committee and chairs for this year's SIGCOMM. Obrigado and Muito bem!!

Next year's SIGCOMM will be in LA on August 21st.

Happy birthday Keshav!

The Good, the Bad, and the Differences: Better Network Diagnostics with Differential Provenance

Ang Chen (University of Pennsylvania)
Yang Wu (University of Pennsylvania)
Andreas Haeberlen (University of Pennsylvania)
Wenchao Zhou (Georgetown University)
Boon Thau Loo (University of Pennsylvania)

The intricate relationship between the events in a distributed system makes network diagnostics in such a system quite complex and tricky. Existing debuggers can provide information on what happened on the network by providing details like packet history and comprehensive explanation through data provenance. However, identifying the root cause of the problem from this huge set of information is still complicated. The paper presents a concept called differential provenance (DiffProf), that tracks the causal connections between network states and state changes and provides root cause analysis by reasoning about the differences between provenance tree for a reference event (good event) and the provenance tree for the symptom event (bad event).

The provenance graph is a directed graph that contains events or states as vertices and causality as the edges. The naive method of taking a difference between the two provenance trees can cause the diff tree to be larger than the two individual trees due to the butterfly effect of initial small differences getting propagated as drastic differences later on. The paper presents an initial algorithm for identifying the minimal diff tree and suggests many further refinements to make the algorithm more efficient and correct. The system states and events are represented as tuples, which are organized into tables. A set of derivation rules describe how these tuples could be derived. The initial algorithm includes rolling back the execution of the rolling back the execution of the symptom event to a divergent point by which the two provenance trees differ and then mutate the faulty node to the correct state. The problem with this approach is that, it only finds the initial differences and a large set of consequences are ignored. In order to preserve the meaningful unique information about the event which the symptom provenance tree represents, one leaf tuple in each tree is marked as a seed of that tree, which cannot be mutated. The algorithm is then refined to roll back until the divergent point and change the faulty node to an equivalent node and then repeat the same process to find completely equivalent trees, thereby identifying the root-causes of multiple faults that occurred.

DiffProf is implemented in C++ based on RapidNet and evaluated in two sets of case studies for SDN and Hadoop systems. For the multiple case studies evaluated, DiffProf finds one or two nodes as the faults, thereby exactly isolating the root cause of the faults. DiffProf is also efficient to answer all the queries within a minute. Apart from this, DiffProf also determines the root cause efficiently for complex networks, under heavy interference.

Q&A:

Q: What happens if the reference and the symptom are the same? In the starting case in your strawman, you are being routed to two different servers but you needed to go to the same servers. Is that covered?

A: We need to find a reference event that encodes the correct intent, in this case the reference event does exactly the same thing that the symptom event not give sufficient information, so we can’t handle this case.

Q: How much does this vary with the reference events, how carefully should those be chosen?

A: The number of root causes found is corresponding to the actual number of misconfigurations in the network and not the reference events. For instance, we found one or two misconfigurations because in the problems that we recreated from other papers, the scenarios themselves lead to one or two misconfigurations. That is why the packet goes to one of two of the nodes as a result.

Q: What kinds of faults can you figure out, like you mentioned ATPG. ATPG can only figure out the action faults and not the action faults, like for example traffic header is matching a rule which is not supposed to match, so how do you detect those anomalies? What is the coverage of the rules in all of the switches in the network?

A: One difference between ATPG and our work is that we do not focus specifically on the data plane. So we can handle more general cases than ATPG, for instance say it could be applicable to general distributed systems like MapReduce. But that said, it works when there is a reference event. So that is one restriction for the kind of faults that DiffProf could handle. Beyond that DiffProf mostly works for functional faults and not for tiny faults such as race conditions, where the predicted result is correct, but the timing perhaps is delayed - DiffProf does not work well for those kind of bugs. But we are having an ongoing research project on those kind of faults as well.

Values and Networks -- Steps Toward Exploring their Relationships

Presenter: Roland Bless

Co-authors: Carsten Orwat

The current trend of "Internet of Things" increases to the dependency of the society and individuals on networks. For example, Free of opinion and expression and freedom of assembly is often asserted on Internet. However, fundamental social values are frequently in conflicts with each other on networks. A canonical example is freedom of expression and right to privacy is often at odd with surveillance for homeland security.

The paper proposes three methodologies for value-oriented network design and value-aware communication architecture. First, paper proposes to use specification and operationalization to provide technical solutions for conflicting requirements. Operationalization has drawbacks, such as there is limited room for interpretation or exception for rules. Secondly, the paper proposes institutionalization, which means establishing institutional framework, policy and governance structures. Institutionalization has drawbacks, such as regulatory capture by partial interest. Third,
choice and markets can provide different products according to different values. But market failures are possible, which limited choices and binding customers to platforms. Technical support can help with open-standards.

The paper suggest to think in institutional frameworks. Technical implementation are only on part of the solution, Future research should focus on methods that enables considerations of technical and institutional solutions at the same time.

Q: There was a similar proposal in SIGCOMM 2014 "Balancing Accountability and Privacy in the Network", can you compare and contrast?
A: Privacy achieved by prior work is a good example of what is propose in the work.

Q: Problem does not seem new to me, there are parallel ideas such as financial systems.
A: In medical emergency system, your privacy is not as important. What proposed here depends on use cases, you may need different network for different values. Accountability may need additional law to drive technical design. Values can easily contradict with each other, you need to design for different values.

Crowdsurf: Empowering Informed Choices in the Web

Presenter: Hassan Metalle

Authors: Stefano Traverso, Marco Mellia, Stanislav Miskovic, and Mario Baldi

Users are often unaware of their sensitive information being sent to a third party when browsing the web. Hence, this paper proposes a system called Crowdsurf, which allows users and companies to re-gain control on the information they exchange with web services.

On the cloud side, CrowdSurf consists of a controller which collect traffic samples about the service users visit. It then processed user's contributed data with data analyzer to infer which flows are harmful. On the client side, CrownSurf implements a firefox plugin, which uses rules in the form of regular expressions to determine if certain flows should be blocked, redirected, allowed, modified, or simply logged.

CrowdSurf is evaluated on a live trace from an Internet Service Provider and shown to perform reliably with little overhead on page loading time.

Q: How do you deal with false positives?

A: We have false positives, and manually resolved them.

Q: What looks like tracker may be required for site operation? Do you have a way to make exception?

A: Yes, users have options to opt-out our service. For example, google uses tracker to offer service, so user can choose to allow trackers from google.

SIGCOMM 2016 Session 11, SDN/NFV II: Dataplane Specialization for High-performance OpenFlow Software Switching (Molnár, Pongrácz, Enyedi, Kis, Csikor, Juhász, Korösi, Rétvári)

In this paper Gabor Retvari et al. present ESWITCH which takes as input forwarding behaviour of a switch in OpenFlow and compiles it to efficient code which can be run on the switch.

Retvari starts with the expressibility-performance tradeoff, he points out that while OpenFlow is great in terms of flexibility it provides in implementing rules at different packet layers, this genericity comes at the cost of performance. Specifically it is very hard to make OpenFlow perform better on x86 machines. This performance hit is due to the number of packet classification stages which need to be performed on the fast-path. The classification is needed to identify the packets belonging to a flow so that the corresponding actions for the flow can be performed. With the arrival of server virtualization in datacenters, these tasks now need to be performed on commodity x86 machines which do not have packet classification in hardware. In order to make up for this hardware support switches perform flow caching, however, the problem with flow caching is that it over-generalizes, since it enforces the same caching semantics over a set of diverse applications which run in a datacenter.

He argues that instead of having a general-purpose fast path we need to build mechanisms which allow the switch to automatically adapt and specialize itself for the workload type.

In order to attack this problem the team proposes ESWITCH which takes as input an OpenFlow pipeline and uses template code generation to output specialized code for a switch. ESWITCH is based on the insight that most OpenFlow pipelines combine only a small selection of simple but general primitives into complex dataplane programs. Therefore they can be reasonably represented using few templates where a template is a unit of a specific packet processing behaviour. ESWITCH is built upon the Intel DPDK and implements the packet parser template, matcher templates, flow table templates and action templates to provide the full set of forwarding functionality. It starts by deconstructing the OpenFlow input pipeline into basic packet processing primitives and matches the processing stages with the pre-built templates and then puts them together to build a pipeline.

For the evaluation, the team considers few use cases: layer 2 switching, layer 3 routing, a load balancer and telco access gateway. They conduct experiments over a range of traffic matrixes. For the first two use cases the results show that as the number of flows increases the performance of OVS with its flow caching goes down, ESWITCH on the other hand does not need to perform the flow caching and maintains high packet rates over a range of traffic mixes consistently reaching 12-14 Mbps, that is, within 1 Mbps of the benchmark (15Mbps). With respect to the latency, ESWITCH also maintains superiority over OVS as the number of active flows increase. He further shows the robustness of ESWITCH as the number updates per second increase.

In summary, for the switch to be truly programmable, the dataplane must also be programmable and this is exactly what the ESWITCH achieves.

The following questions were put forth after the talk:

Q: I was wondering if you have looked at specialization over different domains or have you compared in other domains?
A: OpenFlow is easy to specialize since it is well defined and there may be more domains where the idea of specialization can be generally applied but we haven't explored those yet.

Q: Do you generate the whole data plane after each update or is it just incremental?
A: Yes, we can update a templates incrementally, but for some templates such as the direct code template we need to compile it from scratch.

Q: The performance is great but it does go down with the number of flows, so at some stage it may not be enough, one idea is to offload the forwarding to other places in the network to accelerate under high pressure such as the vCRIB, but your work is putting more emphasis on the switch itself?
A: At the moment we haven't explored how this can be done with ESWITCH but we don't see why its fundamentally difficult with ESWITCH to achieve.

SIGCOMM 2016 Session 11, SDN/NFV II: PISCES: A Programmable, Protocol-Independent Software Switch (Shahbaz, Choi ,Pfaff, Kim, Feamster, McKeown , Rexford)

Choi et al. propose PISCES which is compiler for a software switch which allows custom protocol specifications in a high level domain specific language (DSL) and compiles it to a actual C code.

Choi starts with highlighting the importance of programmability of forwarding behaviour. He starts with providing context using hardware switching. There is a current push towards making programmable hardware chips, similarly, if you look at the current day data centers we see that the switching is mostly done in software to achieve high degree of programmability.

Choi argues that while software switches are great in terms of performing quick updates than a hardware switch, implementing new features in the software switch still requires deep knowledge of the hypervisor kernel that most network operators may not have. Additionally, changes to the switching not only need to implemented but maintained over time which imposes a significant management burden. This complexity is not suitable for altering the forwarding behaviour of a switch and we need mechanisms to make this task easier for network operators.

Towards this end, the team proposes their compiler called PISCES which takes forwarding behaviour of a switch described in a domain specific language (P4) and compiles to a target switch level C code. They show that code compiled from PISCES approaches the performance of hand-written code while reducing the program size by 40%.

The platform chosen by the team for their prototype is OVS (Open vSwitch) in which they replace the parse, match, action code by the code generated by their PISCES compiler. In order to swap in the new OVS configuration, it simply requires a reboot of the switch and then entire process takes between 20-30 seconds. In order to make OVS amenable to programs specified in P4, they make three key changes to OVS. These include the ability to perform arbitrary encapsulation/decapsulation through post-pipeline and inline editing modes, the ability to perform actions based on conditional statements and the ability to compute checksums incrementally.

They evaluate the performance of PISCES on accounts of complexity of describing a baseline for a switch and the complexity of making changes to switch compared to OVS code. The results show that the number of lines of code needed to build a baseline for a switch is two orders of magnitude smaller than the OVS implementation and making changes requires modifications of less than 10 lines for a number of use cases.

They also compare the forwarding performance of the switch compared to hand-coded OVS switch, the results show that with compiler optimizations, the performance of PISCES is within 2% of the OVS switch over a range of packet sizes.

In summary PISCES allows quick development and maintenance of OVS switch with very low overhead.

The following questions were put forth after the talk:

Q: What happens when you need a functionality when it is not in the OVS? like rate-limiting.
A: We also implement the P4 primitives, so if you need action on P4 primitives those are also supported in our work.

Q: Have you tried a different target switch than OVS?
A: We are trying different switches such as switches from Cisco and Barefoot's implementation of the software switch. We choose OVS because its forwarding model is close to P4.

Q: Have you done full implementation of the specification of P4, if not then why?
A: We implemented all except varying packets sizes and stateful processing. We weren't able to do a one to one mapping for stateful processing.

SIGCOMM 2016 Session 11, SDN/NFV II: OpenBox: A Software-Defined Framework for Developing, Deploying, and Managing Network Functions (Bremlar-Barr, Harchol, Hay)

Yotam Harchol presents OpenBox which is a software defined framework for network wide deployment and management for network functions (NF).

Harchol points out that traditionally network functions have been implemented by middleboxes which are sold as monolithic boxes. He argues that while this has many benefits such as on demand scaling, it still requires requires separate management of each individual network function (NF) where each NF may have its own management interface and may be administered by a different administrator. This makes it hard to efficiently manage different NFs and evolve new NFs, their work is therefore aimed at making this management task easier and making it easier to innovate and evolve new NFs.

In order to achieve the goal, Harchol argues that we need to de-couple control plane of a NF from its data plane so that we can centralize the control. This allows the control functions to be carried out by a single entity in one place. However, this requires that the dataplane to be programmable.

In order the achieve this design, the team proposes OpenBox which is build on top of a new communication protocol. They key enabler behind OpenBox is that while different NFs may have different control logic they have similar data planes. Inside the OpenBox universe differences NF are called the OpenBox Applications. OpenBox works by combining the functionality of different applications, this is done by first decomposing each application in to a processing graph and then merging the processing graph of all the applications to form a single processing graph.

The data plane consists of OpenBox instances (OBIs) which perform various packet processing tasks, these tasks are received from the centralized OpenBox controller. The control plane comprises of the OpenBox controller which sets the processing tasks and controls provisioning and scaling of OBIs. The communication between OpenBox controller and OpenBox instances are done using a new communication protocol which provides a set of messages for communication and a set of processing blocks which can be joined together to build a network function.

To evaluate OpenBox the team implemented three applications including a firewall, a web cache and a load balancer and compare it with a static pipeline based NF implementation. They show that the throughput of a pipeline based NF is limited by the throughput of the slowest NF however, with OpenBox this restriction is removed since all OpenBox applications are executed on the same OBI. The single OBI instance also means that the latency is lower compared to the traditional pipeline based NF. Harchol also shows that OpenBox is scalable and reliable, if a service instance goes down the controller can react to take necessary steps to replace it.

The following questions were put forth after the talk:

Q: How does OpenBox performs fault tolerance, especially if a replica breaks down in the middle of processing a packet?
A: The current version of OpenBox by itself doesn't take care of this scenario. However, we have a mechanism for transactions which we have thought about but haven't implemented it yet.

Q: How would you take work previous work on fault tolerant middle box and add it to OpenBox
A: We haven't yet gotten to fault tolerance but we have a few ideas that we can employ for it.

Q: How is your work different form previous work?
Previous work is mainly focused on placement and resource sharing on the same physical machine, but doesn't de couple the control plane and the data plane, additionally you have to use specific languge. On OpenBox you can implement different control planes with different dataplanes.

SIGCOMM 2016 - Session 8 - Wireless - Paper 3: "Braidio: An Integrated Active-Passive Radio for Mobile Devices with Asymmetric Energy Budgets".

Presenter: Pan Hu
Authors: Pan Hu, Pengyu Zhang, Mohammad Rostami, Deepak Ganesan

Summary:

Battery capacities of mobile devices vary by up to three orders of magnitude between laptops and

wearables. When such asymmetric battery lifetime meets the usual symmetric power consumption in wireless radios, the lifetime of constrained portable devices is significantly limited.

The driving question of this paper is that, can we design a power proportional radio to consume power proportional to battery size? Other than the classic active wireless radio which adopts a symmetric architecture at both transmitter and receiver sides, passive radio such as backscatter adopts asymmetric architectures to achieve a low power consumption either at the transmitter side or the receiver side.

This paper presents Braidio, a radically new radio design that is capable of dynamic carrier offload i.e. the ability to dynamically switch the transmission carrier between the transmitter and receiver.

In particular, Braidio adopts a passive receiver with SAW filter, and leverages antenna diversity for better reception. The first component suffers from a reduce sensitivity while the second suffers from a reduced robustness, both of which lead to a reduced reception range. To cope with this, Braidio uses active radio as a safety net, and automatically switch between different modes, i.e., active, backscatter and passive, to perform carrier offloading.

A prototype of Braidio is implemented, and experiment results show that Braidio can support transmitter–receiver power ratios between 1:2546 to 3546:1 and enables a huge dynamic range of asymmetry to suit a wide range of energy budgets between end points, and in the meantime achieve a reasonable maximal communication range, e.g., 3-6 meters.

Q&A
Q1. What is the performance of Braidio on bitrate under different modes?
A: We fix a constant bitrate in the experiment.

The Deforestation of L2

Speaker: Mingjie Zhao

Co-authors: James McCauley, Ethan J. Jackson, Barath Raghavan, Sylvia Ratnasamy, Scott Shenker

Motivation

Spanning Tree Protocol of level 2 was designed in the 80s as a plug and play protocol and has worked well through decades of network evolution. However it has two important drawbacks:

1) It reduces the bisection band- width of the network to that of a single link

2) When a link on the spanning tree fails, the entire tree must be reconstructed

In this paper the authors present a new approach to L2, called the All conneXion Engine or AXE, that retains the original goal of plug-and-play, but can use all network links while providing extremely fast recovery from failures.

GOALS

The speaker started by stating the goals of the paper as:

Be plug and play
Use all links for shortest path
Have fast recovery from failure with no control plane

The authors make the assumptions that failure direction can be fast, there’s still market for flood and learn L2 and everything is point to point.

With these assumptions, AXE attempts to strike a balance between shortest path and fast recovery, while keeping things as simple as possible. AXE follows a basic flood/learn approach where:

When you see a packet, learn
When you don’t know what to do, flood

Approach

AXE does not need a tree to deal with loops which means flooding works with handling failure. The author explained that the treeless flooding is done by duplicate packet deletion. There are multiple ways of doing this, but the authors use hash-based duplication detection. This make learning and responding to failures more subtle. This is done by extending the packet header by adding a per-switch sequence number, hop count (protect from loop), flooded flag and learnable flags. In addition, switches will have two separate queues: a normal queue and a queue for flood packets for higher priority to deliver them quickly (to stop flooding quickly).

Algorithm

The speaker explained the algorithm as having two main steps:

learning (port or HC to src) and unlearning phase (if path corrupted)
output phase which does the following:

drop duplicate
flood if unknown
forward packet otherwise

For the rest of the talk, the speaker showcased that AXE actually works by showing result for:

On a campus network and a cluster showing not much flooding are caused by failures
The duplication filter can be very small (1000 entries is enough)
Does it recover from overload? AXE is designed to handle overload (speaker referred the audience to the paper)

Q&A

Q: Can you prove the correctness of your protocol?

A: I am not sure how to do the formal proof, but we did things through experiment

Q: what’s the overhead of extending the header?

A: we think the overhead is not much cause what we add is very small

Q:Wha's the percentage of the overhead?

A: We haven’t tested that

Q:You assume that you know there is a failure. What happens if a link fails, and you don’t have keep-alive because you don;t have control packets, how do you detect the failure and recover?

A:We can recompute the topology after failures (they decided to take the discussion offline)

SIGCOMM 2016 - Session 8 - Wireless - Paper 2: "Enabling Practical Backscatter Communication for On-body Sensors".

Presenter: Pengyu Zhang
Authors: Pengyu Zhang, Mohammad Rostami, Pan Hu, Deepak Ganesan

Summary
The driving question of this paper is the same as the last one, can these implanted devices communicate directly with mobile devices such as smartphones, watches and tablets? Popular wireless radios, e.g., WiFi, Bluetooth, etc., consumes too much energy for lower power sensors. Therefore, this paper also focuses on applying backscatter as the go-to technique, but with a different approach to tackle its practicality.

In particular, the authors focuses on the fact that in order to extract the weak backscattered signal, the system needs to deal with self interference from the wireless carrier (WiFi or Bluetooth) without

relying on built-in capability to cancel or reject the carrier interference. The proposed solution is called FS-Backscatter, which was motivated by a simple but effective intuition: if a backscatter tag can shift an incident WiFi or Bluetooth carrier to a clean WiFi or Bluetooth band, then the receiver can see a clean, carrier-interference free backscattered signal in the shifted band.

In order to achieve a high data rate, the characteristics of packet-level decoder and bit-level decoder are studied, and a low-power ring oscillator-based clock generator is designed for the FSBackscatter tag which operates at tens of micro-watts but also has temperature-induced frequency variations. The authors implement a prototype of FS-Backscatter and demonstrate that a 20MHz frequency shift is enough for enabling an FS-Backscatter tag to communicate with commercial WiFi and Bluetooth radios. It is also shown from evaluation that an FS-Backscatter tag operates at a power budget of 45 microwatt through the use of a ring oscillator based clock design, and is robust to frequency variations induced by environmental changes.

Q&A:
Q1. What are the differences between this work and the Interscatter work presented just now?
A: We share the same vision in that backscatter is a promising technique in supporting on-body/inbody sensors, and we take different approaches to provide practical backscatter solutions. It is expected that certain designs in Interscatter can be applied to improve the performance of FS-Backscatter.

Q2. What is the reason to shift the frequency by 20MHz?
A: We shift the singal by 20MHz because it is the bandwidth of a channel in WiFi. As long as the carreir is shifted to a clean channel, the receiver can see a clean, carrier-interference free backscattered signal in the shifted band.

AC/DC TCP: Virtual Congestion Control Enforcement for Datacenter Networks

Presenter: Keqiang He (University of Wisconsin-Madison)
Coauthors: Eric Rozner (IBM Research)
Kanak Agarwal (IBM)
Yu (Jason) Gu (IBM)
Wes Felter (IBM Research)
John Carter (IBM)
Aditya Akella (University of Wisconsin-Madison)

Congestion is not rare in datacenter networks, and 99.9th tile latency is orders of magnitude higher than the median mainly because of queueing. Unfortunately, admins cannot control VM TCP stacks in multi-tenant datacenters. As a result, outdated, misconfigured TCP stacks could be runninn in VM. This could lead to two problems: Large queueing latency, and TCP unfairness (i.e., ECN vs non-ECN). Further, different congestion control algorithms could lead to unfairness.

AC/DC TCP (Administrator Control over Data Center TCP) implements TCP congestion control in the vSwitch and ensures the VM TCP stacks can not impact the network. AC/DCP run in the data plane of vSwitch. In particular, AC/DC TCP does not requires changes form VMS and hardware, provides low lateny by using state-of-the-art CC algorithms, improves TCP fairness, and enforces per-flow differentiation via congestion control.

The design of AC/DC TCP is as follows:
- Obtaining congestion control state: per-flow connection tracking
- DCTCP congestion control in the vSwitch: universal ECN marking, getting ECN feedback via being piggybacked on an existing TCP ACK.
- enforcing cc: Reuses RWND for congestion control or drops any excess packets for non-conforming flows
- per-flow differentiation via congestion control

AC/DC TCP is implemented in OVS (Open vSwitch), and uses hardware supports for performance improvements, such as TCP segments, and checksum offloading.

For evaluation, 17 servers and 6 switches were used, and AC/DC TCP was compared against CUBIC, and DCTCP. The results demonstrates the followings:
- Running DCTCP on top of AC/DC closely tracks the window size of native DCTCP.
- AC/DC has comparable convergence properties as DCTCP and is better than CUBIC
- AC/DC improves fairness when VMs use different congestion control algorithms
- Less than 1% CPU overhead. Each connections uses 320b to maintain CC states.
- In an incast scenario, AC/DC tracks the performance of DCTCP closely.
- Flow completion time: AC/DC performs as well as DCTCP and 36%~76% better than CUBIC.

Q: What lead both papers to come up with the same / similar solution?
A: These are independent works and this is the most natural solution.

Q: Some applications want to have control over congestion control. Is there a mechanism that allows the VMs to take over?
A: Within datacenters, the throughput of CUBIC and DCTCP are quite similar, and our solution will not impact the performance of applications, and more like an optimal for throughput.

Q: Can someone game the system by observing the changes you are making, and gain more share from the network?
A: The packets will be dropped, and will not gain much.

Q: In the incast experiment, why measured RTT instead of throughput?
A: Flow size is small in many incast scenarios. Haven't considered throughput.

Virtualized Congestion Control

Bryce Cronkite-Ratcliff (VMware / Stanford)
Aran Bergman (Technion)
Shay Vargaftik (Technion)
Madhusudhan Ravi (VMware)
Nick McKeown (Stanford)
Ittai Abraham (VMware)
Isaac Keslassy (VMware / Stanford / Technion)

Network operators are facing challenges to deploy new congestion control algorithms in a datacenter network. In particular, it is not trivial to deploy new algorithms such as DCTCP, TIMELY into a multi-tenant datacenter network where legacy tenants are not always able to upgrade to use new algorithms.

Instead of upgrading tenants, implementing a translation algorithm between legacy algorithms and new algorithms inside a vSwitch of a hypervisor can solve the problem. Hypervisor can easily modify flows from VMS by changing a bit in TCP header, buffering packets, faking ACK packets, or implementing TCP proxy.

Enabling ECN (Explicit Congestion Notification) bit is not always possible in legacy VMs. As a result, legacy non-ECN flows perform poorly because non-ECN packets can be dropped at a switch when competing with ECN packets. By translating non-ECN packets to ECN packets can solve the problem. On transmit path, hypervisor marks packets to be ECN enabled. On receive path, upon receiving ECE packet, the hypervisor changes the window size accordingly. Further, vCC also enables application-specific congestion control algorithms to be applied to particular flows.

Q: There are companies that do TCP header modifications. How is this work different from middleboxes?
A: We are proposing several techniques just other than TCP header modifications for congestion control translation.

Q: Could be done outside the server as a middlebox.
A: With middle boxes, you can only to TCP header modifications. You can do more with hypervisors

Q: Is there a cleaner way to do this?
A: Yes, there could be. But, it might not get too complicated if you do this way.

Q: Overhead against Open vSwitch (OVS)?
A: We don't use OVS, and we don't have any numbers for OVS yet.

Q: What are the limitations of header modifications. What can't you do?
A: Causing the guests to send if you go more aggressively than normally you would.

Q: What about UDP flows?
A: You might be able to. But requires different techniques.

SIGCOMM 2016 - Session 8 - Wireless - Paper 1: "Inter-Technology Backscatter: Towards Internet Connectivity for Implanted Devices".

Presenter: Vikram Iyer
Authors: Vikram Iyery, Vamsi Tallay, Bryce Kelloggy, Shyamnath Gollakota and Joshua R. Smith

Summary:
Smart medical devices such as smart contact lens, implantable devices and etc. requires are crucial part in pursuing treatment and cure of chronicle diseases, and have high requirement on nework connectivity.

The driving question of this paper is: can these implanted devices communicate directly with mobile devices such as smartphones, watches and tablets? The key challenge is that these devices cannot use conventional radios such as WiFi, Bluetooth and ZigBee because these radios consumes too much power while these devices are extremely power constrained.

The solution proposed in this paper is to recycle radio frequency signals from external devices, aka, backscattering. Though many papers propose solutions using this technology, they tend to require extra hardware support. In this paper,  the authors introduce interscatter, a novel backscatter communication system that works using only commodity devices by transforming transmissions from one technology to the other, on the air. For instance, interscatter can create 2–11 Mbps Wi-Fi standards-compliant signals by backscattering Bluetooth transmissions.

The basic idea of interscatter involves two steps. The first step is to create single-tone transmission using Bluetooth radio. The intuition is that Bluetooth uses two frequencies to encode the zero and one data bits. By transmitting a stream of constant ones or zeros, a single frequency tone can be created. To cope with the Bluetooth data whitening,  reverse engineering can be used by setting different original payload bits. This is because the whitening sequence in Bluetooth protocol uses a pseudo random bit generator that fixes the channel number as the seed.

The second step of interscatter is to  generate 802.11b signals by backscattering the single-tone from the Bluetooth device. Previous works on passive WiFi can achieve this goal, but also creates a mirror copy outside the ISM band. Interscatter proposes the first single sideband backscatter architecture that produces a frequency shift on only one side of the single tone Bluetooth transmission. The intuition in this architecture is to generate orthogonal signals so that their mirror copy signal can be cancelled out.

A prototype backscatter hardware on an FPGA platform is implemented. Evaluation shows that interscatter can generate 2–11 MbpsWi-Fi signals from Bluetooth transmissions. And several proof-of-concept applications such as a contact lens form-factor antenna and an implantable neural recording interface antenna are also implemented and evaluated.

Q&A:
Q1: Is it possible to charge these implant device wirelessly whiling performing inter-backscattering?
A: Wireless charging is not the focus of our paper. But it will be an interesting topic to study the feasibility of wireless charge and backscatter simultaneously.

Q2. What is the next step of backscatter?
A: We are hoping more involvement of the community in developing products using this technology. In terms of inter-backscattering, our current applications are proof-of-concept, and we are working on design product-ready devices.

Q3. How small is the contact-lens prototype?
A: The proof-of-concept prototype is smaller than WiFi/Bluetooth chips. (See Sec. 5.1)

Q4. Is this technology going to be open-sourced?
A: It is already licensed to a start-up.

Eliminating Feedback in Next-Generation Cellular Networks

Best paper award winner.

Authors: Deepak Vasisht (presenting remotely), Swarun Kumar, Hariharan Rahul, Dina Katabi

This paper pioneers solving an important challenge for the deployment of next-generation cellular networks (5G MIMO and others): inferring channel parameters without explicit feedback from the clients. In the U.S., such task is greatly complicated by the widespread use of Frequency Division Duplexing, which means that the uplink and downlink connections (from and to the cell tower) use different frequency bands.

The authors’ solution to this problem is a transform that uses estimated path information to infer channel parameters, using a phase-spatial power profile measured at the AP, following a simple observation: "while the channels change with frequencies, the underlying physical paths traversed by the signal stay the same” (p. 399). Therefore, the system is capable of estimating wireless channels on a different frequency band to that where the measurements are collected, all on a band that’s only 20 MHz from commercial providers in the U.S.

The paper’s algorithm (sections 4 and 5) accounts for multiple complications in the estimation of channel parameters: phase variations related to the different wavelengths of each channel; windowing effect, related to the spatially-limited location of antennas on cell towers; and superposition, following multi-path propagation. They solve each of these issues using a combination of mathematical techniques (including modified fourier transforms).

Q: Were the transmitters and receivers on the same elevation? Can you comment on three-dimensional aspects.

A: Yes. For testing the angular analysis, the two-dimensional testing is all that matters.

Q: When you estimate the antenna parameters, how many measurements do you take to minimize channel error?

A: No need for multiple measurements.

Q: You need uplink packets being sent before, which might not work for UDP?

A: When the link is established an uplink packet is sent, before any protocol kicks in.

Q: Can you comment on the analysis inaccuracies introduced?

A: As far as we can estimate it, the analysis works quite nicely.

Real-Time Distributed MIMO Systems

Authors: Deepak Vasisht, Swarun Kumar, Hariharan Rahul (presenting), Dina Katabi

MIMO greatly improves the WiFi performance in venues with a high concentration of clients (such as SIGCOMM). This paper’s most exciting contribution is demonstrating the feasibility of a full real-time, distributed MIMO wireless networking system (802.11). Such a system must meet some critical constraints: a full implementation of the 802.11 physical layer; meeting the timing constraints of 802.11; and supporting mobile clients in dynamic environments.

The system described in the paper introduces a new method to estimate the channel parameters (which are critical to implementing real-time MIMO), by using a mathematical model that improves state-of-the-art channel reciprocity estimation. Further, the authors’ system introduced explicit feedback from the automatic gain control (AGC)’s behavior into MIMO signaling and channel estimation. Finally, MegaMIMO allows for timing coordination between multiple devices.

Q: regarding the initial calculation. Does the estimate assume homogeneity across two consecutive packets?
A: we solve that by using time-separation between master- and slave-sent packets.

Q: what do you do when there are no packets sent on the uplink?
A: there are generally packets being sent, for instance ACK packets for TCP.

Q: evaluation had a fixed number of users, how does it scale?
A: we know it scales well up to 10 from previous work, in this paper we saw linear scaling to 4.

Q: what about calibration across different hardware versions?
A: calibration is done locally, no need to calibrate across devices.

Q: does it require changes to the client’s hardware/software?
A: no, changes required at the AP, not the client.

RDMA over Commodity Ethernet At Scale

Presenter: Chuanxiong Guo

Co-Authors: Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitu Padhye, Marina Lipshteyn

TCP is still the dominant protocol for transmitting packets. But it suffers from latency issues and CPU overhead problems. RDMA, on the other hand, needs a lossless network, but uses a priority based flow control that prevent buffer overflow.

In this talk, the authors discussed their experience deploying RDMA over commodity Ethernet (RoCEv2) in Microsoft's datacenters. They also discuss how they overcame some of the challenges they encountered.

PFC is a single flow control protocol that uses the priority that each packet contains in its VLAN Tag. he authors argue that this doesn't scale well. It breaks PXE boot which is needed for OS updates. There is also no standard way to carry VLAN tags in layer 3 networks. They would much rather use the DSCP field in the IP header, thus removing VLAN related issues completely.

In addition to this, they observed transport livelock issues despite low packet drop rates. To overcome this, they chose to retransmit from the last sent packet as opposed to starting again from 0. They don't observe any deadlocks since packets travel up and then down, but do not experience any loops and hence, cannot have deadlocks.

Further, a malfunctioning NIC may block the entire network owing to a series of pause frames (waits at the buffer) causing a domino effect on each other. They install watchdogs to stop this process by monitoring packets at those frames areas.

A latency comparison between ROCEV2 and TCP shows that the former at 99.9th percentile is better even against the 99th percentile of the latter. RoCEV2 also directly handles Incast also since it is lossless. On a network with 500 servers across two podsets, RDMA is able to achieve 3Tb/s throughput. Similar measurements were performed while shuffling data. The latency increases as shuffling occurs and the authors observe that it isn't possible to achieve both low latency and high throughput at once, in practice, even though a small congestion is observed.

In the future, the authors hope to explore RDMA on inter-dataceter communication, gain a deeper understanding of deadlocks and develop more applications that use RDMA instead of TCP.

Q: You haven't talked about the classical problem of lossless network (tree-saturation) - localized congestion that spreads out and blocks other innocent bystander traffic at other ports. Do you notice this?

A: We do experience this in our production networks. We have several approaches to deal with this. DCQCN could be used, buffer sizes are tuned so that we can use Dynamic Buffers and we have ways to deal with pause storms.

Q: You conduct your experiments at livelock. Why did you need retransmission despite that?

A: Lossless implies packets shouldn't be dropped due to congestion. But, there could be other kinds of packet drops like due to switch hardware issues or corruption problems. Hence, we need packet retransmission to fix those.

Q : Are those frequent?

A : The network is large scale, so we do need to handle it.

Q: Can you share details on workload on top of RDMA? Why are these latency sensitive and what are the latency requirements?

A: We currently have 2 types of workloads - Bing traffic/indexing is very important there. The requirement there is to reduce latency to millisecond level. Other type of workload - usually for storage, they are throughput-centric workload, move a lot of traffic. Hence, we need to reduce CPU overhead there.

Q: Are there a lot of NIC firmware changes or OS level changes?

A: To make the NIC work, transport layer is in NIC firmware and has a driver. But, for mechanisms like PSC and transport protocols, they avoid NIC firmware. Our providers provide us with the nIC firmware necessary.

ProjecToR: Agile Reconfigurable Data Center Interconnect

Presenter: Monia Ghobadi

Co-Authors: Ratul Mahajan, Amar Phanishayee, Nikhil Devanur, Janardhan Kulkarni, Gireeja Ranade, Pierre-Alexandre Branche, Houman Rastegarfar, Madeleine Glick

Currently, datacenter networks have electrical fibers that connect TOR switches to others. These links, as a result, have static capacity and the only way to change the connection is to send someone physically to the field to change it. However, according to the authors, many rack switches are either under-utilized or over-utilized. Many of the racks don't exchange much traffic, but some of them generate a lot of traffic. Hence, there is a need for reconfigurable interconnects that can adjust capacity dynamically.

Desirable properties of such interconnects include a way of augmenting the standard capacity by maintaining separate static and reconfigurable portions. There is also a need for high fan-out or a huge number of direct links to other racks so that high traffic can be sent along these links. This should also involve low switching time to send the traffic fast. The authors argue that ProjecToR accounts for all of these.

The key insight in ProjecToR is to remove all the cables and use light instead. The medium of transmission is free space and the device used is a digital micro-mirror (DMD) to direct the light and a magnifier to adjust reach. By changing the bit pattern uploaded on the DMD, light can be redirected elsewhere. The number of accessible locations where the light is redirected to is proportional to the total number of micromirrors, but some accessible locations need to be skipped to eliminate interference resulting in about 18 K accessible locations or fan-out (all of these are within +- 3 degrees though). To address the last point or the narrow angular reach, ProjecToR uses angled mirrors to further the reach and then design the mirror assembly according to the datacenter requirements (like using a disco ball structure).

How feasible in a ProjecToR interconnect?

The authors build a small prototype and micro-benchmarked it. The prototype contained 3 ToR switches that used ProjecToR with a source laser to send light, FPGA to program the DMD and mirrors to reflect light towards receiving TORs. The authors ran evaluations on this for over a day in 10 second intervals. The throughput of this setup matches that of a wired link. The switching time, measured by changing destination to ToR 3 and measuring light intensity at both places to know when switching is completed, was about 12 microseconds.

The topology can be configured using connections between lasers and photodetectors. But the switching time is still larger than packet transmission time. So, the system uses a dedicated default topology and then uses opportunistic topology on the fly. The dedicated topology uses K-shortest paths routing and a virtual queue and is used for smaller flows. An opportunistic link can be created on the fly and is used instantly for elephant flows. Looking for active links for the opportunistic topology is similar to current switch scheduling problems. The system uses the Glae-Shapely algorithm and is very close to an optimal offline scheduling algorithm.

Cost is estimated based on cost breakdown of the individual components.

Simulations were run on 128 Tor with 16 lasers and photodetector and day long traffic. ProjecToR average flow completion time increases very slowly because of re-configurability even on a skewed traffic matrix. Firefly and Fat Tree are upto 95 % worse, the former because of low switching time and low fanout (multi-hop needed) while Fat Tree has no re-configurability.

Q: Free-space optics are very vulnerable to vibrations and outdoor variations which is an issue in datacenters. Have you had any experience with this?

A: Firefly has shown that it can be tolerable within 5mm, we use an optical bench that is more tolerant. If you miss something due to a photodetector, you might able to redirect it and capture it.

Q: Have you discussed this with computational geometry people? They might be able to provide some insights.

A: I have done simulations with typical data center topologies and came up with the disco ball, but haven't looked into reconfiguring the data center itself

Q: Two practical points from optical communication perspective. Cross talk is one and link-recovery in the optical receiver is the other (transceiver also has a link detection phase and would be available only after milliseconds).

A: We eliminate adjacent angles to remove crosstalk (by skipping every 4 degrees).

Prior work has showed that you might be able to overcome the latency in link-detection to nanoseconds.

Q: I am trying to compare free-space space to switch-spectrum space (you could get single hop/reconfigurability there too). Not sure why free-space is more desirable.

A: One reason is scalability since building a large switch in closed form is complicated.

Robotron: Top-down Network Management at Facebook Scale

Presenter: Yu-Wei Eric Sung
Co-Authors: Xiaozheng Tie, Starsky H.Y.Wong, Hongyi Zheng

Managing networks of the size of the Facebook network involves tasks like router/switch turnup, circuit migration, etc but this usually needs a lot of human interaction. However, this is challenging since it might involve low level configurations across distributed devices, these networks could involve multiple domains and sub-networks with different versions across them. As a result, incorrect management could cause a lot of issues for users and impact Facebook significantly too. This becomes particularly challenging as the network grows bigger.

Facebook's network is setup such that there are multiple POPs near the user which may/may not be able to satisfy an incoming request from the user. If the POP cannot satisfy the request, it is sent through a backbone to a Datacenter (DC) and the response is sent back. The tasks involved in management in this kind of network could include adding/migrating circuits or routers in the backbone and building clusters or upgrading capacity in the datacenter. It is also possible, especially in datacenters, that multiple generations of cluster architectures could co-exist. This makes these tasks particularly challenging as the network evolves.

Facebook implements this top-down network management essentially through an model of the network called Fbnet. This process is extended to model the entire network. There are also READ and Write APIs to interact with the model where the write consists of atomic tasks. Storage consists of a single primary database and multiple secondary databases some of which are locally available.

Network design intent can be conveyed through the creation of distinct objects. For example, for the above 4-pop cluster, FbNet creates distinct objects for the network switch, the linecard, the physical interfaces, prefixes, circuit, etc. with relationship fields pointing to other objects. Following this, Design intent can be translated into device configs starting from Templates for a cluster that are validated afterwards and then converted into fb net objects and then per-device config objects. All of these are vendor-agnostic. After this, these could be converted into vendor-specific configs and deployed.

By monitoring the usage of FbNet, the authors were able to observe that FbNet model changes a lot with time and still changes with new models and relationships as opposed to becoming stable with time. POP and DC change a lot with a single design change affecting almost 1000 devices on average. Backbone's also change but don't affect more than 10 devices on average. Similar results seen with Robotron when looking at the number of configuration lines that change as designs are changed.

Robotron has evolved over time bottom up and has been largely driven by experience since 2008 when FbNet was modeled to active and passive monitoring in 2011 and deploying it over time. Over this period, the authors have gained key insights into the process of modelling, the coupling between network design, config generation and deployment and the need for manual involvement at times. These came directly from experiences involving egress link saturation and failing to deploy after designing a new rack.

Q: You talked about your network model, there are some fields and some of them have values, you talked about the syntax, but not the semantics. How would you define the meaning so that some formal analysis could be conducted on it?

A: There is no formal semantics. We would like to work more on that too.

Q: About your claim that this is the first paper of this kind. There is a conference called NOMS for network management related papers including carrier network?

A: I am not very aware, but will check it out.

Q: At the end of your talk, you mentioned you needed a manual override. DO you have any characterization of the types of situations where this was necessary?

A: Experiences include a version of model was used where it didn't model BGP policies correctly, we couldn't wait for that.

Q: Are there fundamentally some things Robotron cannot handle or is it just lack of anticipation?

A: It is usually lack of anticipation. But, sometimes, performance isn't guaranteed even if Robotron can handle something, so it becomes important to manually handle some things.

Globally Synchronized Time Via Datacenter Networks

Presenter: Ki Suh Lee

Co-Authors: Han Wang, Vishal Shrivatsav, Hakim Weatherspoon

Synchronized Clocks are important for network and distributed systems since precise clocks help you with monitoring, coordination, updates, etc. Broadly, clock synchronization protocols work measure the offset or the time difference between two clocks and the precision or the maximum offset between any two clocks. The two servers exchange messages between them that contain the timestamps of the servers. Using this information and the round-trip time of the message, one is usually able to compute the offset between the two servers.

However, measuring the RTT is hard. There could be a skew, timestamping could be wrong or the network could induce delays in the messages sent between the two servers. PTP (Precision Time Protocol) tries to overcome this by hardware stamping on PTP-enabled switches, yet it doesn't provide bounded precision and requires PTP-enabled devices in the network. Approaches like NTP also suffer from much higher precision errors.

The authors propose DTP which uses the physical layer since clocks are already synchronized at that level between peers. Further, this approach doesn't impact higher layers and doesn't have network overhead either. This is able to achieve nanosecond level precisions since the physical layer clock is updated every few nanoseconds, resulting in high accuracies.

A DTP enabled device has a local counter that is updated at every clock tick or is adjusted upon the receipt of a message from a peer. The DTP protocol overwrites /E/ bocks in the PHY layer with the time at the physical layer and sends these messages to its peer. The message sending involves two phases namely an INIT and a BEACON. The two peers send an INIT message that is sent back to the sender as an ACK and using this, one can compute the one-way delay. The BEACON messages contains the local counter value of the sender and the receiver uses this information, its delay and its own local counter to update its clock. The error in this process is utmost 4 clock tick errors (2 from the one-way delay message and 2 from the beacon sending).

To synchronize this externally, a DTP Daemon could be used to poll these counter values and correlate them with UTC values. This DTP approach needs NIC and switch modifications like PTP, but gives better precision bounds than PTP.

To evaluate DTP, the authors used an FPGA development board along with few other modules and setup 12 identical servers in a tree topology. They measured offset between peers in this 3 level network. They also evaluated PTP on the same topology. The evaluations showed that while the PTP offset is about 10s to 100s of nanosecond when idle, the offset is close to 100 microseconds when experiencing load. DTP offset, on the other hand, is always within 4 clock ticks and within a few nanoseconds. Even within a datacenter with 6 hops, the offset between peers is about 150 nanoseconds and end-to end application delay isn't more than 200 ns.

Q: Do you think the ability to hack into the physical layer is generally useful for other reasons? How easy was it to get into the physical layer?

A: It is not very hard to insert something into the physical layer, once you understand it and know what you are doing, you can modify the bits to do other things

Q: Does this translate into the wireless domain at all?

A: Wireless domain is a little bit different, there are collisions, wireless doesn't guarantee that something is being sent continuously.

Q: What if it was all point-to-point?

A: yes (no interference)

Q: You make some assumptions about the frequency. What if this was meant to run on a different link with different frequency?

A: Relationship between clock frequency. If you control the increments with the frequency carefully, it might still work.

Q: This needs to be standardized to be used. Any plans?

A: Looking forward to making this a standard for others to use.

Q: This mechanism keeps it internally synced, but could drift from a global counter. What are your thoughts on external synchronization?

A: Use a connected FPGA board that is synchronized via GPS and this acts as a global clock, but you could potentially synchronize all internal clocks to that Global clock.

Wednesday, August 24, 2016

Community Feedback Session

Chair presentation (slides available online):

Major announcement: change in CCR policy, lifting the page limit for papers including data to aid replication.

SIGCOMM 2017 will be in Los Angeles. Applications for the 2018 location are open until September, with a preference for European submissions.

SIG had a loss this year, which was induced as the fund is in a healthy sate ($1M of discretionary spending available). Large number of travel grants awarded each year. Conferences can be sponsored by SIG.

There is student support available for national summits and summer schools as well.

New SIGCOMM communication channels on Facebook and Twitter.

Nomination committee for next SIG officers will be active this fall.

Feedback session:

-What are your thoughts on remote participation?

Mixed feelings, some of the presentations were highly polished. But the hallway feedback is lost, hard to replace the in-person experience for both the speaker and others interested.

Multi-room chats might help, but will be challenging to implement.

Remote Q&A enhanced the experience.

Suggestion: record and play video for all talks at a workshop to try this new format. Perhaps request 1-minute video with all submissions, all or most of which would be accepted.

Suggestion: allow for remote participation from outside the conference, not only from authors but also from the audience.

Suggestion: loyalty points for sponsors of the conference (example: Cisco, who sent personnel to Brazil).

Question: will page limit lifting be transferred to SIGCOMM as well?

Answer: we are usually reluctant to do changes to the well-run, long-standing conference, but if the CCR test goes well we will consider it.

Question: why is SIGCOMM held on the second or third week of August, quite inconvenient to some?

Answer: the specification has long been between Aug. 15 and Sept. 15. We encourage organizers to consider travel conditions and various scheduling conflicts. However this range can be flexible, and we can consider changes; a survey could be run to obtain people’s preferences, so a Fall date could be feasible.

Motivation for date: keeping clear of the teaching calendar and other important deadlines for our community.

Commitment made to organize survey.

Question: has SIGCOMM considered rolling deadlines, which would greatly benefit graduate students?

Answer: three parts. There are other conferences in the area, and CCR is also a good alternative. There is PACM — Proceedings of the ACM, which constraints the deadline to a more journal-like format. Also, tradition for single-track, as SIGCOMM is the level-setting conference for this community.
Chair: “I tried very hard but could not succeed, perhaps new chair can look into it.”

Citation can be changed to CCR for all related publications.

Challenge: need to find a PC that is willing to work with a rolling deadline and multiple meetings.

Suggestion: papers accepted to CCR, to include a video like the remote presentations we saw today.

Question: some clarification on the number of submissions, and how many were “experience track” papers? They are not flagged in the proceedings.

Challenge: industry experience usually can not be replicated. But there is a risk of having plainly descriptive papers.

Experience track is managed by TSC.

Question: can live streaming be added to the remote participation?

Answer: SIGCOMM has been live streamed for five years.

Follow-up: quality could be better, should we consider a different provider?

Answer: we are not tied up to a provider for next year, so please send me (chair) any concerns you might have.