Friday, August 25, 2017

SIGCOMM'17 - Session 11 (Routing) - Paper 3: The Impact of Router Outages on the AS-level Internet

Summary:

this paper uses large-scale measurement to understand the dependence of the AS-level Internet on individual routers. 149,560 routers are surveyed across the Internet for 2.5 years. 59,175 of the surveyed routers (40%) experience at least one reboot. This study identified specific routers that were single points of failure for the prefixes they advertised. A novel active probing technique is designed to identify router restarts. Networks are associated with routers using trace data, and then global BGP activities are correlated with router restarts. To infer single point of failures, outages are correlated with route withdraws by testing whether the windows overlap. The results show that only 4.0% of routers were correlated with complete withdrawals involving 3,396 prefixes, where the routers appeared in traceroute paths towards the prefixes.


Q: what you identified is the percentage of routers as single points of failure. You sort of hide 6 percent of them in customers' edge, presumably customers are harming themselves by having a single point of failure. For the other 40%, how much does Internet get taken out, i.e., ISP edge, device serving multiple customers? So the fact isn't necessarily that 4% of Internet can be taken out by this singe point of failure, but in fact some larger fraction is disabled. Do you have any sense of that?

A: that is not the case that there is 40% outside of the customer edge. I think it is another 20% or something hidden in customer, but you stay with small, maybe around 10%, outside of the customer network, and you can imagine that if you are providing a tunnel service, the providers will say one of the cases will be sorted out.