Presented by: Thomas Holterbach
2.5 minutes: the worst case convergence time of a router after a link failure. After a failure the router needs to learn about the failure and then update it's forwarding entries. It can take minutes for the router to scan its routing table. There are existing solutions for when the link is between the router's AS and a neighboring AS. However, no current solution works when there are remote link failures. Furthermore, remote outages are numerous: they analyzed bursts of withdrawals from RouteViews and RIPE RIS and found more than 3000 bursts in November of 2016.
SWIFT can reduce the convergence time of BGP from 13s to 2s, fast reroute around failures in 40ms, and is currently deployable on today's routers. SWIFT trades accuracy, now using region granularity, for speed.
SWIFT monitors BGP streams, uses an algorithm to detect link failure probability, and after detecting a link failure, predicts future withdrawals. It uses two metrics: withdrawals share: fraction of withdrawals crossing a link and path sharing: the proportion of prefixes withdrawn with a path on a link. Their product product is the fit score: this value gives the likelihood of the failure of a specific link.
To be fast SWIFT can't wait for withdrawals, so the algorithm is run early in a withdrawal burst. Furthermore, outages can affect multiple links, so SWIFT returns a set of links all with a high fit score.
Their experimental results show that SWIFT never makes an extremely inaccurate prediction while is correctly infers the failure of a majority of links.
The challenges for fast reroute are that any set of prefixes can be affected, failures can affect backup paths, and they want to be fast. Thus, SWIFT uses a hierarchical forwarding table to make the convergence fast. Entries in the table hold a primary next-hop, the AS-path the prefix is using, and the backup next-hop. SWIFT uses this to reroute the affected traffic to unaffected backup paths.
Evaluation shows that SWIFT's fast reroute requires less than 100 forwarding updates or about 40ms to effectively reroute after a failure.
Furthermore, it is deployable over most existing routers has many currently have support for hierarchical forwarding tables, but for routers that don't have that support, we can still deploy SWIFT using an extra SDN switch attached to the router.
Thus, SWIFT speeds up the learning of remote link failures, efficiently reroutes around those failures, and is deployable on existing routers.
Q: In the graph with false and true positives, what is the ground truth?
A: The ground truth comes from after rerouting if later those paths are withdrawn. We also evaluated with a simulator to have the ground truth.
Q: What are the main factors behind the long convergence time?
A: Each router has a decision process to go through, things like scanning through it's routing table. This process could explain why it is slow.