Resilient Overlay Networks

D. Andersen. H. Balakrishnan, F. Kaashoek, R. Morris, “Resilient Overly Networks,” 18th Symposium on Operating Systems Principles, (December 2001). [PDF]


Routing scalability in BGP comes at the cost of reduced fault-tolerance and slower fault-recovery procedures. BGP hides paths due to policy concerns, damps routing updates to avoid route flapping, and has or uses little information about live traffic. As a result, sometimes it takes many minutes to sense and recover from an outage or traffic overload. This paper proposes using an overlay (RON) to actively and passively probe and monitor traffic conditions on links to provide resilient and faster paths circumventing faults in the network, given that there is physical path redundancy.

A RON creates a clique of distributed overlay nodes that are tightly connected and continuously share information between themselves (using a link state protocol) through aggressive probing and monitoring to maintain high throughput virtual links (with minimal bandwidth usage). Whenever a usual physical path experiences excessive delay or completely goes out, a RON node uses RON virtual links to route packets through other RON nodes. As a result, RON can detect route faults faster and can provide alternative paths quicker than BGP. Other goals of RON include, application defined path selection (by allowing applications to define their own metric of the “goodness” of a link) and to provide a framework for implementing expressive routing policies.

A RON client interacts with RON using an API interface, called conduit. The entry RON node conduit classifies a packet, tags it, selects a path, encapsulates it in RON header, and forwards it to the next RON node (if the usual physical path is experiencing outage or performance hit). Intermediate RON nodes keep forwarding the packet by guided by their database of policies as well as tag-action mappings until the packet reaches the exit RON node where it is decapsulated and delivered to the destination host. Forwarding component of a RON node also uses caching based on flow id for faster processing of packets. Each RON node maintains three different metrics – loss rate, latency, and TCP throughput – for each link. In  addition, RON also allows creation of app-defined metrics and update mechanisms for further flexibility.

RON provides very high connectivity among its nodes and provides guarantees that there is always a path to reach from one RON node to another unless the RON graph is partitioned. The bootstrapping of RON nodes starts with a static list and dynamically updates and adds new RON nodes over time. The cost of keeping such tight coupling through periodic flooding is postulated to be negligible to the gain provided by maintaining them.

The authors evaluated the RON system on two RON networks, one with 12 nodes (RON1) and the other with 16 nodes (RON2) distributed over the Internet, each with N*(N-1) links. RON nodes were hosted two types of networks: highly reliable educaional and research network (Internet2) and commercial networks. All improvement regarding commercial paths came from commercial network by explicitly prohibiting them from using Internet2 paths. Major results are as follows:

  • RON1 managed to successfully detect and recover from 100% of total outages and all periods of sustained high loss rates of 30% or more; RON2 did the same with 60% success rate. The authors did not shed much light on why RON2 couldn’t do as well as RON1; they just mentioned that in the RON2 almost always unreachable node was completely isolated due to network disruption.
  • For a RON of 50 nodes, there is a 30Kbps overhead to keep its nodes informed and operational.
  • RON took 18s on the average to route around an outage or a failure. An emulated 3-node RON topology was used to show that there is a 13s turn-around time for RON facing a flooding attack; however, how it would’ve performed with larger number of nodes remains open.
  • RON doubled TCP throughput in 5% of the samples and reduced the loss probability by 0.05 or more in 5% of the samples (may or may not be different).
  • Most importantly, they found that in most cases going through a single RON node is good enough. The whole paper revolves around this observation and RON workflow is also influenced by this.


The idea presented in this paper is interesting from a conceptual point of view and really worth reading or thinking about. However, from a practical point of view, there are several reasons why RON is not going to make it:

  • First of all, RON will mess with the underlying traffic matrices of the ISPs, which might result in route instability; as RON starts flowing traffic in some routes that ISP traffic matrices do not want to use, ISPs will update their traffic matrices to counter RON and eventually there will be a cat-and-mouse chase between them.
  • Second, RON will introduce extra latency and bandwidth consumption in intermediate nodes and specific links due to probing, encapsulating/decapsulating etc; the evaluation provided in the paper is limited to only two scenarios which may or may not generalize to larger cases.
  • Third, RON depends on cliques between all the RON nodes and uses flooding between them; this will not scale.
  • Fourth, the authors do not discuss how someone will pick nodes where to put RON nodes; it seems there should be strategic configurations that will give more leverage than others.
  • In addition, there are NAT issues as pointed out by the authors.

Overall, it is a nice paper and a good conceptual exercise, but not practical for the current Internet ecosystem.

Leave a Reply

Your email address will not be published. Required fields are marked *