Category Archives: Reviews

Floodless in SEATTLE: A Scalable Ethernet Architecture for Large Enterprises

C. Kim, M. Caesar, J. Rexford, “Floodless in SEATTLE: A Scalable Ethernet Architecture for Large Enterprises,” ACM SIGCOMM Conference, (August 2008). [PDF]

Summary

SEATTLE is an enterprise network architecture that combines the scalability of IP networks with the simplicity of Ethernet without incurring respective drawbacks. It uses flat address space to provide plug-n-play functionality and ensures scalability and performance through shortest-path routing, hash-based host resolution mechanism, aggressive caching, and one hop DHT lookups to avoid flooding in ARP and DHCP protocols.

The authors present a nice overview of the existing technologies used in enterprise networks and argue why Ethernet, now largely adopted, cannot be continued as the method of choice in relatively large networks even though it has some nice properties (e.g., flat address space for instant installation, manageability etc.). There are several critical faults in Ethernet-based solutions:

  • Ethernet bridging requires flooding in the network to locate previously unknown end hosts (i.e., end hosts without physical to MAC address mapping). This requires large number of states and incurs control overhead that grows linearly with the number of end hosts.
  • Ethernet arranges the broadcast domain as a spanning tree, which cannot be scaled to larger networks due to low latency, high availability requirements.
  • In a Ethernet solution, ARP is used to resolve IP to MAC addresses which rely on broadcasting, hence not scalable. Same goes for commonly used DHCP for dynamic IP address allocation to end hosts.

Scalability issues in Ethernet can be solved by arranging them into IP subnets and then using shortest-path routing over IP addresses. However, as the authors point out, IP is harder to manage and require considerable amount of resources. There are several other solutions such as VLANs that allow logical grouping of hosts for better management while providing many of the Ethernet’s management advantages. However, VLAN also rely on IP to allow host in different VLANs to communicate.

Considering all the options, the authors go for a middle ground to implement a one-hop network layer DHT in the switch-level that stores the location of all the end hosts using a link-state routing protocol. The DHT also acts as a directory service for address resolution and provides service discovery service. SEATTLE forwards packets based on end host MAC address using the DHT as a mapper and resolver. In order to reduce control overheads during network failure or excessive churn, SEATTLE uses consistent hashing, and it balances load using the concept of virtual switches. Scalability in SEATTLE is supported by creating multi-level hierarchy of one-hop DHTs. In addition, SEATTLE removes broadcasting requirement from ARP and DHCP.

SEATTLE promotes a traffic-driven reactive location resolution and caching to avoid excessive load on the directory service. Cache update in this architecture is based on explicit notifications, heart-beat protocol, and some other clever workarounds (e.g., MAC eviction list for updating remote host’s cache). Based on its features, SEATTLE inherently supports mobility without having to do anything explicitly.

In addition to everything, the proposed architecture is backward compatible in the sense that it exposes Ethernet-like semantics for immediate deployment. However, beneath its hood, all the broadcasting is replaced by unicast or DHT lookups, multi-casting and access control is supported by using group concept, and bootstrapping functionalities are modified to avoid flooding.

The authors demonstrate the validity and performance of SEATTLE using extensive packet-level simulation of real enterprise network traces and through experimentation using XORP and Click modular router.

Critique

SEATTLE’s focus on solving an old problem using a clean-slate design without throwing out old values and restrictions is admirable. The paper itself is very well-presented and hence easy to follow. One special attraction of this paper is the detailed discussion on the past and the state-of-the-art in enterprise networks.

Hierarchical arrangement of DHTs for scalability is half-baked and not well explained (an aside: most probably it is inspired by Caesar’s previous work on ROFL). However, it is unlikely that too many levels in the hierarchy will be necessary (how many end hosts can there be in an enterprise network anyway?)

Security can be an issue if any of the end hosts is compromised, but one can probably assume that within the guarded perimeter of an enterprise network it can be handled somehow (at least with higher probability than in the Internet)!

Detailed Diagnosis in Enterprise Networks

S. Kandula, R. Mahajan, P. Verkaik, S. Agarwal, J. Padhye, P. Bahl, “Detailed Diagnosis in Enterprise Networks,” ACM SIGCOMM Conference, (August 2009). [PDF]

Summary

Diagnosis of generic and application-specific faults in networked systems is known to be notoriously hard due to dynamic and complex interactions between components. This paper presents NetMedic, a system that enables detailed diagnosis in small enterprise networks by using rich (multi-variable) information exposed by the system itself and application running on it. NetMedic formulates diagnosis as an inference problem and detects complex, inter-twined faults by using a history-based simple learning mechanism without any explicit knowledge of the underlying semantics of the collected information and with little application-specific knowledge. It models the network as a dependency graph of fine-grained components and assigns values to dependency edges using likelihood of one component affecting another. By using edge weights, NetMedic ranks possible causes of faults to find out the true culprit underneath. This work is limited to automatic identification, i.e., automatic repair is considered out of scope.

NetMedic has two basic properties:

  1. Details: It can diagnose both application-specific and generic problems with as much specificity as possible. Since most faults are application-specific, digging dip is considered an integral attribute to reach closer to the context of a fault.
  2. Application agnosticism: It can diagnose with minimal application-specific knowledge. This property allows NetMedic to function properly even if new unknown application is introduced.

Inference in NetMedic depends on the joint-historical information of components impacting one another. For any two states in two components, NetMedic looks back into the history (from any period) to find out similar states with close correlation to assign likelihood to current situation with a hope that history will repeat itself. The authors argue that the values are not that important; the important part is assigning low weight to enough edges so that in the end it will be easier to rank possible fault. This type of expectation seems pretty naive, but the authors insist that it is enough without strong justification or real data.

The workflow of NetMedic can be divided into three functional pieces:

  1. Capturing component state: NetMedic captures the state of each component as a multi-variable vector as opposed to scalar variables in the existing approaches. It stores the states in one minute bins to be processed later to create the dependency graph.
  2. Generating the dependency graph: NetMedic uses pre-created templates to find the relationships between different types of components and stitch them together to complete the dependency graph. Use of templates requires mandatory human intervention whenever there is a unknown component without any template.
  3. Diagnosis: Finally, NetMedic computes abnormalities of states, assigns values to the edges of the dependency graph (details in $5.3.2), and ranks likely causes to identify the true causes of faults. Abnormality detection process using historical information introduces false positives. Based on the abnormality of edges, NetMedic assigns weights to edges using multiple heuristics and threshold constants. Finally, at the ranking stage it uses a compound ranking function consisting of two components: local and global impact of a cause.

The authors try to justify/validate their scheme using limited and controlled evaluation which are extremely lacking and do little to convince the readers.

Critique

Except for the facts that this is the first study of faults in enterprise networks (as claimed by the authors) and the first of its kind using multi-variable states to infer faults, everything else is not very convincing. The model seems too simplistic with heuristics and constants all over the place. The authors keep arguing that all those things work in real network, even though they evaluated the system on two primitive controlled networks. The way they have used historical information (linear correlation) also seems pretty naive in real-world scenario where interactions between two components can often be influenced by many others.

Dependence on history also restricts NetMedics ability to deal with previously unseen faults (considering the fact that it is not a full-fledged learner). It is not clear how the authors handle this issue. Specially, when a new instance of NetMedic starts with no history at all, how long will it take to reach the level of performance even in the controlled scenario presented in the paper?

NetMedic seems to work on diagnosing a single fault (in controlled environment), but does not do well in more than one simultaneous faults even in the test scenarios. The reliability of the evaluation regarding multiple faults can also be questioned, since it generalizes one (may be more not shown here) synthetic test case(s) and says NetMedic will perform similarly in other multi-fault scenarios. To do justice to the authors, it should be mentioned that it is extremely difficult to evaluate such systems without long term deployment in real scenario.

Congestion Control for High Bandwidth-Delay Product Networks

D. Katabi, M. Handley, C. Rohrs, “Congestion Control for High Bandwidth-Delay Product Networks,” ACM SIGCOMM Conference, (August 2002). [PDF]

Summary

Regardless of gateway queueing schemes, TCP becomes inefficient and suffers from instability as per-flow product of bandwidth and delay increases. The main reason is that the existing TCP queueing mechanisms are not fast enough to respond to changing conditions in high bandwidth-delay product networks. This paper presents XCP (eXplicit Control Protocol) that outperforms TCP in conventional and high bandwidth-delay product networks in terms of efficiency, fairness, scalability, and stability. XCP extends the concept of ECN (Explicit Congestion Notification) by using richer notifications and builds upon CSFQ’s core-state less routers concept (no per-flow state).

XCP is based on the following design rationales:

  1. Congestion should have explicit notifications instead of using packet drops as a signal. Moreover, such explicit notifications should be rich (carrying the degree of congestion).
  2. Aggressiveness of the sources should be adjusted according to the delay in the feedback-loop; the system must slow down as feedback delay increases.
  3. Dynamics of aggregate traffic should be independent from the number of flows to ensure robustness to congestion.

The biggest contribution of XCP is the separation of the efficiency control mechanism from the fairness control mechanism, thus providing a flexible analytical framework as well as support for service differentiation (i.e., inherent QoS support). Every XCP router has a efficiency controller (EC) and a fairness controller (FC). Both these controllers estimate average RTT, and a single control decision is taken in each RTT step. This paper uses special congestion headers that are sent in each control loop from the congested link to the source updated by other congested links throughout the path resulting in a congestion signal carrying the highest amount of congestion. The EC sends aggregate feedback signal proportional to the spare bandwidth of the link with an aim to provide fast responses to maximize link utilization (using MIMD), but it leaves the fair allocation of the aggregated bandwidth to the FC. The FC apportions the feedback between individual packets to achieve min-max fairness using AIMD (as in TCP) and bandwidth shuffling mechanism. The authors ensure that XCP is robust to estimation errors and is parameter-insensitive. A proof of stability is also presented.

Through extensive packet-level ns-2 simulation, this paper shows that XCP outperforms TCP using any queueing mechanism in conventional and high bandwidth-delay product networks in terms of bottleneck utilization, average queue size, bottleneck packet drops in different scenarios (e.g., changing capacity; delay; and number, length, type of flows). They also consider short- and long-term dynamics of XCP, response to burstiness, and its feasibility as a QoS enabler mechanism. The authors also point out that the broad range of operation of XCP makes it suitable for many unconventional networks including wireless, satellite, DTN etc.

Thougths

This paper is an excellent read, very well thought-out and delivered. The evaluation section is specially nice covering a huge array of questions that might arise (the topologies are pretty simple though). It also provides a proof for the stability of the protocol that solidifies the claims made in the paper. The concept of using extra information to take smarter decision is not novel, but the way it differentiates between the EC and the FC, and how they interact on aggregate and individual flow level to ensure efficiency and fairness is ingenious.

However, there are some aspects of the paper that are questionable. To make it secure, per-flow state is required on the edge routers; there is no evaluation on how expensive that might be. Some of the results are described in the paper in a different way than they appear in the plots or in some other cases some results are not discussed at all. For example, the average bottleneck queue is seen to be growing linearly with increasing bottleneck capacity but not discussed at all. Also, there is a trend of blaming most discrepancies on the shuffling mechanism, even though it is an integral part of the XCP design.

Random Early Detection Gateways for Congestion Avoidance

S. Floyd, V. Jacobson, “Random Early Detection Gateways for Congestion Avoidance,” IEEE/ACM Transactions on Networking, (August 1993). [PDF]

Summary

Random Early Detection or RED detects impending congestion based on average queue size and notifies connections of congestion through binary feedback by dropping packets or by marking bits in headers. To this end, RED uses two thresholds: once the average queue size exceeds the minimum threshold, RED starts marking/dropping packets with a probability that is a function of the average queue size; when it exceeds the maximum threshold RED marks all the incoming packets. The authors argue against packet dropping as long as there is responsiveness to notifications from the connections, and they resort to dropping as an extreme measure. They also promote use of FCFS/FIFO scheduling over FQ citing the lower overhead for not having to save per-flow states.

The RED gateway consists of two separate algorithms: one for determining the average queue size that determines the maximum allowable burstiness, and the other calculates the probability of marking packets when the average queue size is between the minimum and maximum threshold. The authors leave the decision of determining the thresholds to the system designers’ choice of desired average queue size; however, as rules of thumb they advise to set the minimum threshold sufficiently high to maximize network power (the ratio of throughput to delay) and to set the maximum threshold twice as high as the minimum to avoid global synchronization and oscillation. Through observation and probabilistic calculation, it is shown that packet dropping probability should be calculated as a uniformly distributed random variable. The reason behind this selection is that in order to have a constant average queue size, packets should be marked in uniform intervals not in clusters.

Interesting features of RED include TCP-compatibility, avoidance of global synchronization through randomization, and ability to maintain an upper-bound on average queue size (which also allows RED to better handle bursty traffic). Moreover, RED does not discriminate between different classes of connections, and there is no reward for being within the fair share. RED does not provide explicit action mechanism against misbehaving users; instead the authors argue that misbehaving connections face higher probability for their packets being dropped/marked since they have more packets in the queue than the rest.

The authors back up their claims through simulation and provide mathematical bounds on the average queue size as well as the two thresholds. They also provide a so-called fast version of their control algorithm, which is supposed to useful for in-router implementation due to reduced operations.

Discussion

The main contribution of this paper is the use of randomization/probabilistic analysis in a clever way (i.e., in conjunction with two different thresholds) that by itself obviates some of the nagging problems of other solutions (e.g., bias against bursty traffic, global synchronization etc.). The paper itself could be shorter though.

This paper promotes FCFS saying that it shares delay among connections and reduces delay for particular connection, which is completely different from the FQ thought process. However, the method described for identifying misconducting connections end-up being stateful, even though they shunned FQ due to its use of states (which they use to find out offenders and to ensure min-max fairness). The definition of fairness is vague, and it is not clear whether they strive for min-max fairness (no explicit mentioning; but since they are using FCFS underneath, its unlikely).

RED leaves too many parameters to the system administrators, and it is quite sensitive to those parameters. Tuning these parameters not only have made their simulation results not-so-reliable (you can check only that many variations!), but also they can potentially create problems in its realistic deployment (too many options to chose from; on the other hand, multiple knobs allow fine-tuning when in good hands).

Core-Stateless Fair Queueing: Achieving Approximately Fair Bandwidth Allocations in High Speed Networks

I. Stoica, S. Shenker, H. Zhang, “Core-Stateless Fair Queueing: Achieving Approximately Fair Bandwidth Allocations in High Speed Networks,” ACM SIGCOMM, (August 1998). [PDF]

Summary

Fair resource (bandwidth, buffer etc.) allocation algorithms (e.g., Fair Queueing) in routers usually need to manage large number of states, buffers, and has to do packet scheduling on a per flow basis which makes it economically infeasible to implement them on production routers. This paper proposes an architecture that significantly reduces the overheads by completely eliminating per-flow states from the core routers, yet manages to achieve approximately fair bandwidth allocation. The end result is a simplified queueing algorithm that provides performance comparable to its counterparts, but is better suited for incremental deployment in the existing Internet.

The authors question the necessity, complexity, and adaptabilty of fair allocation allocation mechanisms and base their contributions on two basic assumptions:

  1. Fair allocation is necessary, or at least important, for congestion control, and
  2. Complexity of then-existing fair allocation schemes is a substantial hindrance to their adoption.

In its core, CSFQ (Core-Stateless Fair Queueing) divides the whole Internet into islands of routers. Each island is considered to have edge routers that maintain per-flow states and label packets with fair rates as they enter in an  island; whereas, core routers process packets based on the labels attached by the edge routers using FIFO queueing in conjunction with probabilistic dropping. Two primary challenges in this mechanism is to estimate arrival rates for labeling and to estimate fair share based on arrival rates. CSFQ uses aggregate based estimates using exponential averaging for both cases and uses heuristics to fix possible discrepancies in estimations. Finally, when packets leave an island labels are rewritten with updated fair rates. Theoretical bounds and detailed algorithms can be found in the paper.

Through simulations, CSFQ is shown to be performing better than simplistic FIFO and RED, in par with FRED, and worse than more complex and state-full DRR. However, the authors seem undecided about what to make of the outcomes!

In the end, the authors discuss problems with unfriendly/malicious flows, two possible approaches (allocation and identification) to handle them, and end up proposing a penalizing scheme for unfriendliness which is a compound of (first) allocation and (then) identification approaches.

Critique

This paper is a great read in terms of concept, delivery, clarity, and mostly honesty of the authors. However, it is not comforting to find that the authors are undecided about almost everything discussed in the paper!

The simulation section is rigorous and impressive. Even though they mention several times that they couldn’t put everything for the lack of space, the authors end up with providing a pretty much full-proof simulation results.

The use and derivation of the heuristic to adjust estimations seemed arbitrary. There is no clear reasoning about how the authors came up with those certain constants and percentages.

One of the assumptions, considered as an architectural consideration, in the paper is: “all packets in the same flow must follow the same path within the core of an island“. How valid is that?

Analysis and Simulation of a Fair Queueing Algorithm

A. Demers, S. Keshav, S. Shenker, “Analysis and Simulation of a Fair Queueing Algorithm,” Internetworking: Research and Experience, 1 (1990), pp. 3-26.

Summary

Congestion control in communication networks can be implemented at the source, where flow control algorithms vary the sending rate, or at the gateway through routing and queueing algorithms. However, implementations at the source must rely on the benevolence of the users; hence, they are susceptible to malicious behavior. The authors argue that queueing algorithms can be more effective in congestion control since they control how packets from different sources interact and, in turn, affect the collective behavior of flow control algorithms. They consider bandwidth, promptness, and buffer space as the three control axes, effective allocation of which can ensure efficient and fair congestion control.

This paper presents a modified Fair Queueing (FQ) algorithm, based on the original proposal by Nagle, with an aim to provide fair allocation of bandwidth, lower delay for benevolent sources, and isolation of malicious ones in heterogeneous networking environment. Major changes to Nagle’s algorithm in this paper include the use of variable length packets, control over the promptness/delay of packet dispatching, and dynamic packet scheduling instead of static round-robin ordering. An allocation is considered fair if it satisfies the max-min criteria:

  1. No user receives more than its request,
  2. No other allocation scheme satisfying condition 1 has higher minimum allocation, and
  3. Condition 2 remains recursively true as minimal user is removed and total resource is reduced accordingly.

Several possible definitions of the term user is provided: the source, the destination, the source-destination pair etc. Eventually, they settle on using the source-destination pair as user citing a tradeoff between security and efficiency (without any clear explanation!)

The algorithm itself is a packet-level approximation of a bit-by-bit (fluid) round-robin transmission model using the following rule: “whenever a packet finishes transmission, next one selected is the one with the earliest finish time“. A preemptive version of the algorithm is also presented, which additionally supports a tolerance parameter to allow users using less than fair share bandwidth to have lower delay. Packet dropping policy used in FQ is: “when the queue is full and a new packet arrives, the last packet from the user using the most buffer is dropped”, which provides a small incentive for the users to implement better flow controls.

Finally, through mathematical derivation and simulation, the authors establish the performance and fairness gain of modified FQ working in conjunction with different congestion control schemes.

Thoughts

The use of source-destination pair as user is not properly justified in the paper. The authors do mention the possible security concerns arising from it (malicious source hogging resources by creating many source-destination pairs), but ignore it completely.

Also, the memory requirement for keeping states (too many of them in case of source-destination pairs), price, deployability of FQ in gateways are not discussed.

Congestion Avoidance and Control

V. Jacobson, M. Karels, “Congestion Avoidance and Control,” ACM SIGCOMM Conference, (August 1988). [PDF]

Summary

This paper argues that much of the congestion-related problems (observed circa 1988 in LBL to UC Berkeley connections on 4.3 BSD machines) stem from wrong implementations and bugs in transport protocol, rather than the protocol itself. The authors present examples of wrong interpretations of the window-based transport protocol and propose several algorithms (e.g., RTT time variation estimation, exponential retransmit timer backoff, slow-start, more aggressive receiver ACK policy, dynamic window sizing on congestion) for ensuring network stability by using so-called packet conservation principle.

By packet conservation the authors mean that for a connection in steady state a new packet isn’t put into the network until an old packet leaves, i.e., total number packets in a connection at steady state with full window of data transit is constant. They observe that a congestion can occur only if this principle is violated, which can happen in one of the three cases:

  1. The connection doesn’t get to the equilibrium, or
  2. A sender injects new packet before an old packet has exited, or
  3. The equilibrium can’t be reached because of resource limits along the path.

Based on this observation, the authors put forward measures to avoid/prevent or recover from each of the three possible causes of conservation principle violators.

Contributions

  • Slow start: The Internet transport protocol is self clocking, meaning it adjusts to the connection speed based on the rate of received ACKs. When a connection starts or restarts after failure, it might not get to an equilibrium since there are no ACKs in the first place. To solve this problem, this paper introduces the concept of congestion window and presents slow start algorithm that smoothly adjusts to the connection speed resulting increased throughput.
  • Round-trip timing: Considering the rest of the protocol implementation correct, the authors argue that a sender might inject a new packet without an old one leaving if its estimation of round-trip time is incorrect. In this respect, they propose use of RTT variation to calculate retransmission timeout interval with an aim to reduce spurious retransmissions.
  • Exponential retransmit timer backoff: The authors propose that in case of multiple retransmissions of the same packet, only exponential backoff can avoid congestion in general case.
  • Congestion avoidance: This paper assumes that packet loss in a network occurs mostly (with >99% probability) due to congestion. To counter that there must be a control-feedback mechanism so that network elements can notify of congestion and end hosts can take actions accordingly. The authors end up with (not surprisingly) a linear AIMD solution, where they use packet loss as an indication of congestion without using explicit congestion indicator bits.

Comments

The results presented in the paper are oriented toward wired Internet, which is understandable based on the period it was published. However, applicability of these results in wireless, ad-hoc networks is questionable; e.g., congestion is NOT the only source of packet loss in a wireless network.

One interesting predication in this paper is that as the Internet grows bigger, higher order control functions will be needed to avoid congestion. Recent congestion avoidance algorithms do use cubic control functions!

Overall, this paper is well-argued with lots of comments/annotations in footnotes to allow better understanding of whatever is going on. The authors also synthesize the whole solution by borrowing/combining interesting results from different proposals to create a holistic solution, which by itself is an art of system building.

Analysis of the Increase and Decrease Algorithms for Congestion Avoidance in Computer Networks

D-M Chiu, R. Jain, “Analysis of the Increase and Decrease Algorithms for Congestion Avoidance in Computer Networks,” Computer Networks and ISDN Systems, 17 (1989), pp 1-14. [PDF]

Summary

“Low delay” and “high throughput”, two main operating objectives of a computer network, are frequently hampered by congestion of packets. Once there is a congestion, it takes considerable amount of time and network resource to get back to equilibrium (in this case, to not-congested state is probably more proper). Congestion avoidance is a proactive measure that strives for avoiding a congestion just before it occurs, thus improving network utilization and performance. This paper presents mathematical analyses of several existing increase/decrease linear congestion avoidance algorithms and compare them using multiple performance metrics in a simplified network model.

Model

The authors consider variations of decentralized increase/decrease congestion avoidance algorithms each using “binary feedback scheme”. In a binary feedback control mechanism, network resources send back a one for overloaded state and a zero otherwise to the controllers. Basic assumptions of their model are:

  1. Users sharing the same bottleneck will receive the same feedback,
  2. Feedback and control loop for all users are synchronous (i.e., there take actions in lockstep), and
  3. There is only one bottleneck resource in the network.

Once users/end hosts receive feedback from the congested network resource, they take measures based on the following criteria:

  1. Efficiency: Measured in terms of the utilization of resource; both overload and underload are undesirable.
  2. Fairness: Each user should take equal (proportional!) share of the bottleneck. Properties of a suitable fairness function can be found in the paper.
  3. Distributedness: Control requires minimum amount of feedback from the resource under consideration. No central processing is required.
  4. Convergence: The control scheme should converge within a reasonable level of tolerance.

Results

Through mathematical reasoning and geometrical representation, the authors compare AIAD, AIMD, MIAD, and MIMD (A: additive, D: decrease, I: increase, M: multiplicative) control schemes and prove that only AIMD can optimally converge to both efficiency and fairness in a distributed manner with minimal feedback. They also find out the operating region and control parameters for the control scheme.

Non-linear controls are not considered in this paper citing the sake of practicality and modeling issues. While the authors do agree that non-linear control gives more flexibility, they argue that too much sensitivity to changing conditions in non-linear schemes can result in instability.

Critique

At the end of the paper, the authors themselves raise several questions on different possible scenario that can affect AIMD in multiple ways; e.g., how does delayed feedback affect the control, what is the marginal utility of increased bits of feedback, is it worthwhile to guess the current number of users, and what is the impact of asynchronous operation. However, they left all of them as future work! I wonder what are the correct (if any) answers to these questions.

The paper itself is a wonderful read with fairly easy, understandable math with explanatory geometric representations. However, it is not clear how the authors came up with that massive logarithmic expression while calculating the conditions for optimal convergence to efficiency.

As far as I understand, CUBIC TCP for high-speed network uses non-linear model (cubic) function for congestion avoidance mechanism. Discussion/info on other variations of non-linear models is very much welcome.

Interdomain Internet Routing

H. Balakrishnan, “Interdomain Internet Routing,” MIT Lecture Notes, (January 2009).

Summary

Connectivity between different autonomous systems (ASes) with heterogeneous characteristics was the primary objective while designing the Internet. Interdomain routing enables end-to-end communication and global connectivity in a distributed manner through the dissemination of reachability information among the ASes/ISPs in the Internet. This lecture provides an overview of the existing interdomain routing scenario; its components and relationships between them; currently used interdomain routing protocol (BGP4); and underlying mechanisms as well as economic activities that dictate its workflow.

The Internet is a connected graph with ASes (uniquely identifiable and managed by different ISPs) as its nodes and interdomain links (between border gateways of adjacent ASes) as its edges. Each AS has a range of IP addresses that it advertises (after aggregation) to its neighbors along with the reachability information it received from its other neighbors. In steady state, each AS ends up with a global view where it knows which AS a packet to forward to for a particular destination prefix. In practice, however, ISPs have different sizes and reachability, which results in a ISP-hierarchy with different relationships (e.g., peer-to-peer, provider-customer) along the hierarchy. Moreover, there is competitions and economic tensions between connected ISPs. This lecture discusses different ISP relationships (e.g., peering, transits) and related concepts and mechanisms (e.g., ranking, filtering) from an economic standpoint.

The main objective of an interdomain routing protocol is to provide reachability in a scalable manner without violating the autonomy (in terms of policies) of the concerned ISPs. In the existing Internet, BGP (a path-vector protocol) performs the functionalities to provide scalable, policy-based routing under competitive circumstances. As pointed out in the lecture, BGP lacks security and trust and it is prone to misconfigurations, implementation and human errors ; but in terms of connectivity BGP is quite resilient and fault-tolerant. One thing should be noted here is that the first BGP draft came out in 1989, one year after the DARPA design paper, and security was not the biggest concern at that time.

BGP comes in two flavors: eBGP connects border routers from different ISPs, while iBGP internally connects the border routers of a single ISP in a full mesh to share reachability information. In a larger ISP, iBGP faces some scalability issues; techniques to subside them (e.g., route reflecting, hierarchical organization) are addressed in the lecture. In addition, BGP attributes and their functionalities in route selection, filtering, and path ranking are also discussed. Finally, a brief overview of the biggest BGP challenges (e.g., prefix hijacking, convergence and multi-homing issues) concludes the discussion.

Critique

Most of the policies in an interdomain routing protocol are driven by economic objectives – be it route selection, exporting reachability information, or traffic engineering. The best thing about this lecture is that it drives the discussion from an economic perspective instead of diving deep into the technical details of BGP internal mechanisms.

One of the fascinating things about BGP is that even though each ISP is a black-box to the rest of the world and there is little trust between the ISPs, BGP is still working and doing pretty well at that. Consequently, a significant literature in game theory and mechanism design have been developed to try to understand and model the behavior and incentives of ISPs in the Internet. I am not really informed of their success though!

Security concerns and ample opportunities for human errors are the two weakest points in BGP design and implementation. Even though S-BGP is out there to add some security, mostly due to bureaucratic reasons its not deployed. Its unfortunate but true for any changes to the core of the Internet. However, why no expressive/declarative, type-safe language for BGP configuration has not yet been created by any vendor is unclear. If they keep the underlying mechanism same but provide a safer human-interface that might be able to cut a large number of misconfiguration errors.

To address various perceived deficiencies of BGP, tens (may be even hundreds) of interdomain routing protocols have appeared over the years without any success. Does it prove BGP’s success or is everyone involved just too reluctant to spent time, money, and effort to replace BGP?

Understanding BGP Misconfiguration

R. Mahajan, D. Wetherall, T. Anderson, “Understanding BGP Misconfiguration,” ACM SIGCOMM Conference, (August 2002).

Summary

Misconfigurations in BGP tables result in excessive routing load, connectivity disruption, and policy violation. However, misconfigurations are widely prevalent in the Internet without any systematic study of their characteristics. This paper presents an empirical study of the BGP misconfigurations with an aim to find out their frequencies and root causes behind them. The authors analyzed routing table advertisements over a three-week period from 23 vantage points using the RouteViews listener to detect BGP misconfigurations and separately polled the concerned ISPs to verify the collected data with the polled ones. Their final verdict is that the perceived end-to-end connectivity the Internet is surprisingly robust! (Does it sound familiar?) Apart from that, there are several numerical results, most of them providing conservative lower bounds on different configuration errors. They also propose several fixes to abate the sources of misconfigurations.

The authors consider two globally visible BGP misconfigurations (misconfigurations being inadvertent slips and design mistakes):

  1. Origin misconfiguration: Accidental injection of prefix in global tables due to (i) aggregation errors in rules, (ii) hijacks, and (iii) leakage of private routes. Causes of origin misconfigurations include: (i) self deaggregation, (ii) related origin, and (iii) foreign origin.
  2. Export misconfiguration: AS-paths in violation of policies of one of the ASes in the path.

The basic assumption behind used for declaring a prefix to be an artifact of error is that misconfigurations appear and disappear really quickly while policy changes happen in human timescale. To be exact, the authors only consider short-lived (less than a day) changes as candidates for misconfigurations, and double-check their observations through email surveys of the concerned ISPs.

From their empirical study the authors discovered several intriguing results:

  • Origin misconfigurations cause 13% incident, but result in only 4% connectivity issues (i.e., disrupting connectivity is not easy). Moreover, 95% of the incidents are corrected within less than 10 hours and 95% of connectivity issues are resolved in less than an hour!
  • Export misconfigurations do not cause connectivity issues directly, but they bring in extra traffic to cause congestion leading to negligible indirect connectivity errors.
  • Overall, 75% of prefix announcements are due to misconfigurations in one way or another.
  • Apart from human error, which they set out to prove in the first place, software bugs play a significant role in BGP issues.

Based on email survey results, this paper identifies several sources of both types of misconfigurations (e.g., bugs, filtering error, stale configuration, hijacking, bad ACL etc.). And to check these source, the authors also propose solutions such as UI improvement, high-level language design, database consistency, deployment of secure protocols (e.g., S-BGP) etc.

Critique

One thing noticeable here is that they do not propose any automated monitoring and identifying system that could make life easier for the network operators. May be the results presented here will provide insight into solving some of the issues, but manual methods will probably not be helpful in identifying unknown phenomena. (Just saw that Neil added some links to ML-based studies possibly for automated identification)

Somewhere it is said that temporary route injection could be done to fend off DDoS attacks. Better explanation during discussion might help in that respect. The also authors remark that only a few BGP features are responsible for most of the errors. (Does it follow the Pareto principle?)  But they do not particularly point out any of the features that are actually responsible. Is it even possible to figure those out? One more thing mentioned in the paper is that problem with silent failures. How do we make sure that routers do not fail silently? Do we use heartbeat protocols or just dump every action for later inspection?

Not much to say about the status of the inconsistent databases throughout the Internet and inability to incorporate extensions (e.g., S-BGP) into the network. Somethings never change.

Feedback

While this paper aims to untangle mysteries behind BGP misconfigurations, in its heart its an excellent network measurement paper. The way the authors arrange the bits and pieces to build up the whole paper is incredible. (An aside: Ratul was also involved in the RocketFuel paper, another excellent network measurement paper, that got the best student paper award in SIGCOMM 2002.)