Friday, August 25, 2017

SIGCOMM'17 - Session 10 (Peering) - Detecting Peering Infrastructure Outages in the Wild

Presenter: Vasileios Giotsas
Co-Authors: Christoph Dietzel, Georgios Smaragdakis, Anja Feldmann, Arthur Berger, Emile Aben

The link to the paper.

In the recent years the Internet infrastructure has changed a lot unexpectedly, one of them is known as "flattening" of the Internet,   e.g., inter-domain traffic flows directly between edge networks  (IXPs, CFs), and bypassing transit providers (ISPs). As peering thousands or hundreds of cloud datacenters become commonplace, IXPs and CFs gain their popularity. On the other hand, little effort is paid to know about outages and the influence of the outages. Thus this paper addresses this problem and proposed a system to provide near real-time outage detection close to the level of buildings, and report the experience of running it for a few years.

In order to detect the outage (automatically, timely and at the building level), localize the outage (able to distinguish cascading failure effects from outage sources), and track outage (as well as determine the duration, shifts in the routing paths and geographic spread), the authors proposed the Kepler. With a initialization phase parsing the BGP routes, Kepler can detect and determine the outage by sensing outage signals and investigating the signals. To help pinpoint the outages precisely, it is required to compute high-resolution co-location maps from the communities to decorrelate the behaviors of affected ASes.

In the evaluation and results, authors showed that Kepler can do a very good job on de-noising BGP routing activity and providing strong outage signals. It can also detect peering infrastructure outages in the wild, and measure the impact of such outages. 
This result is significant because i) most of today's outage can't be detected automatically (unless operators raise questions in mailing list), and ii) monitoring or measurement of these outage is not achievable. These problems or questions can be answered here.

To conclude, Kepler use passive BGP monitoring to timely and accurately detect the infrastructure- level outages, and provide hard evidence on peering infrastructure outages thus can later improve accountability, transparency and resilience strategies.


Q1: On the map of world you show the community, it seems biased in European countries. Can you comment on it?
A1: Yes the majority of the community is in Europe or America. The truth is that it is biased when we collect the dataset. And the fact that there are more IXP in Europe

Q2: In the number of accuracy and precision, seem like the number of the outages.
A2: The information is from the mailing list, social media post and from operators, and tech reports

Q3: Any comments on the possibilities causing the outages?
A3: Mostly the causes are fiber cuts. And it actually makes the root cause diagnosis very efficient.

Q4: On your dataset, how can you decide it is maintenance or failure?
A4: In the case of maintenance, it happened in a accessive way. Every time it affect or large part, it  can be this case. Of cause sometimes it is not easy to distinguish