Authors: Amin Vahdat, Keith Marzullo
Fault recovery in Data Centers is the topic of this paper.
Single failure can disconnect large part of the network.
When a switch fails, it takes time for network to broadcast the new upgrades and connect all part of the network together.
Failure happens very frequent (80%) and they are impactful and far-reaching.
Have lasting effects 10 sec to recover.
The solution can be adding extra links but how we can find extra ports to add links.
1. Increase the number of ports and add double link for each connection but this solution is expensive.
2. Another way is to build a bigger network- add more switches on the top level and add one more layer of switches. However more switches make paths longer.
3. Ports can be provided by removing some of the links of switches to give the chance for more redundant links. This scheme has scalability problem by losing some hosts.
Multi rooted tree with extra links at one or more levels (eg. VL2)
The contribution of this paper is to find the tradeoffs between fault tolerant, scalability and network size.
The evaluated results are compared against OSPF.
Q: there are a lot of works on fault tolerant, then whats new?
A: whats new is little bit different point of space. They are adding a little bit of latency instead of adding hardware. There are some topologies designed for super computer like aspen or subset of aspen link splitting aspen trees and doubling the number of links at all levels. The goal here is to see what the tradeoffs are and what the math is here that designers don’t need to guess what is the cost of their opinion of the fault tolerant need to be paid.