Tag Archives: UCB CS268 F09

VL2: A Scalable and Flexible Data Center Network

A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz, P. Patel, S. Sengupta, “VL2: A Scalable and Flexible Data Center Network,” ACM SIGCOMM, (August 2009). [PDF]

Summary

VL2 is a scalable data center architecture to provide uniform high capacity between servers, performance isolation between services, Ethernet layer 2 semantics in large data centers, and dynamic resource allocation across large server pools (aka agility). The authors pose agility as the main motivation behind their design citing that without agility the server pool will be fragmented, and there will be hot-spots, congestion, lack of isolation between services, and large management overhead. VL2’s design promotes changes in the end hosts and server protocol stacks to work around existing switching hardware limitations (by leaving them unmodified). VL2 uses flat addressing, Valiant Load Balancing (VLB), end host based address resolution, Clos network topology, and make use of the existing protocols and technologies.There is a separation of identity and location of end hosts, and VL2 uses a directory service to keep track of the mappings.

One of the major contributions of this paper is the detailed traffic characteristics analysis of a large data center. The authors observed that

  • majority of flows are small,
  • more than 50% of the time each machine has about 10 concurrent flows, but at least 5% of the time it has greater than 80 concurrent flows,
  • traffic pattern is divergent and cannot easily be summarized,
  • there is no structure in traffic pattern,
  • most failures are small in size,
  • in 0.3% of failures all redundant (1:1 redundancy) components in a network device group become unavailable.

The design principles and the actual design of VL2 are directly influenced by these observations:

  • VL2 replaces conventional hierarchical topologies with Clos network topology that ensures graceful degradation of bandwidth on failure and simpler routing.
  • VL2 uses randomization to deal with volatility and churn in the data center traffic matrix. Its use of VLB spreads traffic across multiple paths irrespective of the destination.
  • VL2 is based on existing routing and forwarding technologies (e.g., OSPF, ECMP etc.) to ensure that it can be implemented on existing hardware. Note that, OSPF is used to maintain switch-level topology, not to disseminate end host information.
  • Names and locators are separated into application-specific addresses (AAs) and location-specific addresses (LAs). It should be obvious that AAs are fixed for entities, while LAs change with location changes.
  • Finally, there is a two-tiered (one for performance and the other for consistency requirements) directory service that keeps track of the MAC, AA, and LA mappings for routing and forwarding purposes.

Discussion

The nicest part of this paper is the study (the first of its kind, as the authors put it) of data center traffic over a long period time. The observations are interesting, even though it might be possible that they are tightly coupled with the type of services running on that particular data center. Unlike the Internet, different data centers might have niche applications/services; it will be interesting to see if more such studies come out from different corporations.

Use of established protocols and ideas is another strong point of VL2. The authors synthesize the architecture using VLB, link state routing, ECMP, the concept of the separation of location and identity with Clos network topology (which apparently has been in use in telecom networks for a while and deals with the over-subscription problems). The evaluation section of the paper is also covers a lot of ground. There is one interesting thing regarding fairness though: the authors use Jain’s fairness index to show how evenly flows are split, and they rely on TCP for rate-limiting fair-sharing (max-min fairness supposedly). Can these two go together ? (not sure, can be further discussed)

There are lots of similarities between VL2 and PortLand along with some differences. But Amin Vahdat points out in his blog that the core difference is the philosophy of modifying the end hosts (in VL2) vs modifying the switches/routers (in PortLand); everything else stem from this. There is a nice article on network world that covers the bases of both the architectures. In addition to that, there is the difference between multi-rooted fat tree topology and Clos network topology; how different are they is not very clear though.

PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric

R. N. Mysore, A. Pamboris, N. Farrington, N. Huang, P. Miri, S. Radhakrishnan, V. Subramanya, A. Vahdat, “PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric”, ACM SIGCOMM, (August 2009). [PDF]

Summary

PortLand is a scalable, fault-tolerant, and easily manageable layer 2 routing and forwarding protocol for data center environments that exploits the knowledge of the underlying topology of a data center, completely avoids broadcast-based mechanisms, and stresses on fast and easy virtual machine mobility within the data center. The authors outline several requirements for a future data center network and explain why existing layer 2 and layer 3 fabrics as well as VLAN-based solutions are not suitable. PortLand combines the advantages of such solutions using a pre-determined topology (fat-tree), hierarchical pseudo-MAC (PMAC) address, and logically centralized directory (fabric manager or FM) to resolve PMAC to actual MAC (AMAC) addresses to enable Ethernet-compatible plug-n-play networking.

The design philosophy of PortLand prefers changes in the internal nodes, keeping the end hosts unmodified. PortLand design revolves around the assumption that baseline multi-rooted fat tree topology is unlikely to change over time. Information regarding the topology as well as other configuration data are stored using soft states in a logically centralized FM. There is a explicit separation of host locaion and identifier in PortLand. Each node is assigned a hierarchical PMAC address that identify its exact position in the fat tree (as opposed to flat identifier MAC address) with the property that all the nodes in a pod subtree share the same PMAC prefix. All packets are routed and forwarded using PMAC prefixes with appropriate conversions between PMAC and AMAC addresses in ingress and egress points to give an illusion of unmodified MAC addresses to the end hosts. The mappings between PMAC, AMAC, and IP addresses are stored in and managed by the FM, which allows FM to handle ARP and DHCP without broadcasting. For increased manageability, PortLand proposes a location discovery protocol (LDP) that automates PMAC assigning process. Packets in PortLand are forwarded through the least common ancestor in the fat tree; hence, there is no possibility of a loop. The authors also consider different failure scenarios and propose explicit solutions.

To evaluate PortLand, the authors created a testbed using 20 4-port NetFPGA PCI card switches interconnecting 16 end hosts and use OpenFlow to control switch forwarding tables. They present several convergence characteristics, scalability measurements, and VM migration performance in PortLand (without comparing with competing architectures).

Critique

In PortLand design, the FM can become the bottleneck since all the functionalities rely on it. While the paper discusses several fault scenarios, it pretty much avoids discussion on FM failure saying that the FM is distributed and replicated, hence robust. Some qualitative evaluation of FM failure could provide important insight into PortLand’s fault tolerance. The evaluation section in general is not very strong. The authors could have evaluated the overheads of changing PMAC addresses, the affect of multiple VM migrations simultaneously among others.

PortLand is highly customized for multi-rooted fat tree topology. The qualitative comparison against SEATTLE is a bit unfair as SEATTLE, being for general topologies, couldn’t afford customized optimizations. Also in the evaluation section there is no quantitative comparison against any other architecture.

Floodless in SEATTLE: A Scalable Ethernet Architecture for Large Enterprises

C. Kim, M. Caesar, J. Rexford, “Floodless in SEATTLE: A Scalable Ethernet Architecture for Large Enterprises,” ACM SIGCOMM Conference, (August 2008). [PDF]

Summary

SEATTLE is an enterprise network architecture that combines the scalability of IP networks with the simplicity of Ethernet without incurring respective drawbacks. It uses flat address space to provide plug-n-play functionality and ensures scalability and performance through shortest-path routing, hash-based host resolution mechanism, aggressive caching, and one hop DHT lookups to avoid flooding in ARP and DHCP protocols.

The authors present a nice overview of the existing technologies used in enterprise networks and argue why Ethernet, now largely adopted, cannot be continued as the method of choice in relatively large networks even though it has some nice properties (e.g., flat address space for instant installation, manageability etc.). There are several critical faults in Ethernet-based solutions:

  • Ethernet bridging requires flooding in the network to locate previously unknown end hosts (i.e., end hosts without physical to MAC address mapping). This requires large number of states and incurs control overhead that grows linearly with the number of end hosts.
  • Ethernet arranges the broadcast domain as a spanning tree, which cannot be scaled to larger networks due to low latency, high availability requirements.
  • In a Ethernet solution, ARP is used to resolve IP to MAC addresses which rely on broadcasting, hence not scalable. Same goes for commonly used DHCP for dynamic IP address allocation to end hosts.

Scalability issues in Ethernet can be solved by arranging them into IP subnets and then using shortest-path routing over IP addresses. However, as the authors point out, IP is harder to manage and require considerable amount of resources. There are several other solutions such as VLANs that allow logical grouping of hosts for better management while providing many of the Ethernet’s management advantages. However, VLAN also rely on IP to allow host in different VLANs to communicate.

Considering all the options, the authors go for a middle ground to implement a one-hop network layer DHT in the switch-level that stores the location of all the end hosts using a link-state routing protocol. The DHT also acts as a directory service for address resolution and provides service discovery service. SEATTLE forwards packets based on end host MAC address using the DHT as a mapper and resolver. In order to reduce control overheads during network failure or excessive churn, SEATTLE uses consistent hashing, and it balances load using the concept of virtual switches. Scalability in SEATTLE is supported by creating multi-level hierarchy of one-hop DHTs. In addition, SEATTLE removes broadcasting requirement from ARP and DHCP.

SEATTLE promotes a traffic-driven reactive location resolution and caching to avoid excessive load on the directory service. Cache update in this architecture is based on explicit notifications, heart-beat protocol, and some other clever workarounds (e.g., MAC eviction list for updating remote host’s cache). Based on its features, SEATTLE inherently supports mobility without having to do anything explicitly.

In addition to everything, the proposed architecture is backward compatible in the sense that it exposes Ethernet-like semantics for immediate deployment. However, beneath its hood, all the broadcasting is replaced by unicast or DHT lookups, multi-casting and access control is supported by using group concept, and bootstrapping functionalities are modified to avoid flooding.

The authors demonstrate the validity and performance of SEATTLE using extensive packet-level simulation of real enterprise network traces and through experimentation using XORP and Click modular router.

Critique

SEATTLE’s focus on solving an old problem using a clean-slate design without throwing out old values and restrictions is admirable. The paper itself is very well-presented and hence easy to follow. One special attraction of this paper is the detailed discussion on the past and the state-of-the-art in enterprise networks.

Hierarchical arrangement of DHTs for scalability is half-baked and not well explained (an aside: most probably it is inspired by Caesar’s previous work on ROFL). However, it is unlikely that too many levels in the hierarchy will be necessary (how many end hosts can there be in an enterprise network anyway?)

Security can be an issue if any of the end hosts is compromised, but one can probably assume that within the guarded perimeter of an enterprise network it can be handled somehow (at least with higher probability than in the Internet)!

Detailed Diagnosis in Enterprise Networks

S. Kandula, R. Mahajan, P. Verkaik, S. Agarwal, J. Padhye, P. Bahl, “Detailed Diagnosis in Enterprise Networks,” ACM SIGCOMM Conference, (August 2009). [PDF]

Summary

Diagnosis of generic and application-specific faults in networked systems is known to be notoriously hard due to dynamic and complex interactions between components. This paper presents NetMedic, a system that enables detailed diagnosis in small enterprise networks by using rich (multi-variable) information exposed by the system itself and application running on it. NetMedic formulates diagnosis as an inference problem and detects complex, inter-twined faults by using a history-based simple learning mechanism without any explicit knowledge of the underlying semantics of the collected information and with little application-specific knowledge. It models the network as a dependency graph of fine-grained components and assigns values to dependency edges using likelihood of one component affecting another. By using edge weights, NetMedic ranks possible causes of faults to find out the true culprit underneath. This work is limited to automatic identification, i.e., automatic repair is considered out of scope.

NetMedic has two basic properties:

  1. Details: It can diagnose both application-specific and generic problems with as much specificity as possible. Since most faults are application-specific, digging dip is considered an integral attribute to reach closer to the context of a fault.
  2. Application agnosticism: It can diagnose with minimal application-specific knowledge. This property allows NetMedic to function properly even if new unknown application is introduced.

Inference in NetMedic depends on the joint-historical information of components impacting one another. For any two states in two components, NetMedic looks back into the history (from any period) to find out similar states with close correlation to assign likelihood to current situation with a hope that history will repeat itself. The authors argue that the values are not that important; the important part is assigning low weight to enough edges so that in the end it will be easier to rank possible fault. This type of expectation seems pretty naive, but the authors insist that it is enough without strong justification or real data.

The workflow of NetMedic can be divided into three functional pieces:

  1. Capturing component state: NetMedic captures the state of each component as a multi-variable vector as opposed to scalar variables in the existing approaches. It stores the states in one minute bins to be processed later to create the dependency graph.
  2. Generating the dependency graph: NetMedic uses pre-created templates to find the relationships between different types of components and stitch them together to complete the dependency graph. Use of templates requires mandatory human intervention whenever there is a unknown component without any template.
  3. Diagnosis: Finally, NetMedic computes abnormalities of states, assigns values to the edges of the dependency graph (details in $5.3.2), and ranks likely causes to identify the true causes of faults. Abnormality detection process using historical information introduces false positives. Based on the abnormality of edges, NetMedic assigns weights to edges using multiple heuristics and threshold constants. Finally, at the ranking stage it uses a compound ranking function consisting of two components: local and global impact of a cause.

The authors try to justify/validate their scheme using limited and controlled evaluation which are extremely lacking and do little to convince the readers.

Critique

Except for the facts that this is the first study of faults in enterprise networks (as claimed by the authors) and the first of its kind using multi-variable states to infer faults, everything else is not very convincing. The model seems too simplistic with heuristics and constants all over the place. The authors keep arguing that all those things work in real network, even though they evaluated the system on two primitive controlled networks. The way they have used historical information (linear correlation) also seems pretty naive in real-world scenario where interactions between two components can often be influenced by many others.

Dependence on history also restricts NetMedics ability to deal with previously unseen faults (considering the fact that it is not a full-fledged learner). It is not clear how the authors handle this issue. Specially, when a new instance of NetMedic starts with no history at all, how long will it take to reach the level of performance even in the controlled scenario presented in the paper?

NetMedic seems to work on diagnosing a single fault (in controlled environment), but does not do well in more than one simultaneous faults even in the test scenarios. The reliability of the evaluation regarding multiple faults can also be questioned, since it generalizes one (may be more not shown here) synthetic test case(s) and says NetMedic will perform similarly in other multi-fault scenarios. To do justice to the authors, it should be mentioned that it is extremely difficult to evaluate such systems without long term deployment in real scenario.

Congestion Control for High Bandwidth-Delay Product Networks

D. Katabi, M. Handley, C. Rohrs, “Congestion Control for High Bandwidth-Delay Product Networks,” ACM SIGCOMM Conference, (August 2002). [PDF]

Summary

Regardless of gateway queueing schemes, TCP becomes inefficient and suffers from instability as per-flow product of bandwidth and delay increases. The main reason is that the existing TCP queueing mechanisms are not fast enough to respond to changing conditions in high bandwidth-delay product networks. This paper presents XCP (eXplicit Control Protocol) that outperforms TCP in conventional and high bandwidth-delay product networks in terms of efficiency, fairness, scalability, and stability. XCP extends the concept of ECN (Explicit Congestion Notification) by using richer notifications and builds upon CSFQ’s core-state less routers concept (no per-flow state).

XCP is based on the following design rationales:

  1. Congestion should have explicit notifications instead of using packet drops as a signal. Moreover, such explicit notifications should be rich (carrying the degree of congestion).
  2. Aggressiveness of the sources should be adjusted according to the delay in the feedback-loop; the system must slow down as feedback delay increases.
  3. Dynamics of aggregate traffic should be independent from the number of flows to ensure robustness to congestion.

The biggest contribution of XCP is the separation of the efficiency control mechanism from the fairness control mechanism, thus providing a flexible analytical framework as well as support for service differentiation (i.e., inherent QoS support). Every XCP router has a efficiency controller (EC) and a fairness controller (FC). Both these controllers estimate average RTT, and a single control decision is taken in each RTT step. This paper uses special congestion headers that are sent in each control loop from the congested link to the source updated by other congested links throughout the path resulting in a congestion signal carrying the highest amount of congestion. The EC sends aggregate feedback signal proportional to the spare bandwidth of the link with an aim to provide fast responses to maximize link utilization (using MIMD), but it leaves the fair allocation of the aggregated bandwidth to the FC. The FC apportions the feedback between individual packets to achieve min-max fairness using AIMD (as in TCP) and bandwidth shuffling mechanism. The authors ensure that XCP is robust to estimation errors and is parameter-insensitive. A proof of stability is also presented.

Through extensive packet-level ns-2 simulation, this paper shows that XCP outperforms TCP using any queueing mechanism in conventional and high bandwidth-delay product networks in terms of bottleneck utilization, average queue size, bottleneck packet drops in different scenarios (e.g., changing capacity; delay; and number, length, type of flows). They also consider short- and long-term dynamics of XCP, response to burstiness, and its feasibility as a QoS enabler mechanism. The authors also point out that the broad range of operation of XCP makes it suitable for many unconventional networks including wireless, satellite, DTN etc.

Thougths

This paper is an excellent read, very well thought-out and delivered. The evaluation section is specially nice covering a huge array of questions that might arise (the topologies are pretty simple though). It also provides a proof for the stability of the protocol that solidifies the claims made in the paper. The concept of using extra information to take smarter decision is not novel, but the way it differentiates between the EC and the FC, and how they interact on aggregate and individual flow level to ensure efficiency and fairness is ingenious.

However, there are some aspects of the paper that are questionable. To make it secure, per-flow state is required on the edge routers; there is no evaluation on how expensive that might be. Some of the results are described in the paper in a different way than they appear in the plots or in some other cases some results are not discussed at all. For example, the average bottleneck queue is seen to be growing linearly with increasing bottleneck capacity but not discussed at all. Also, there is a trend of blaming most discrepancies on the shuffling mechanism, even though it is an integral part of the XCP design.

Random Early Detection Gateways for Congestion Avoidance

S. Floyd, V. Jacobson, “Random Early Detection Gateways for Congestion Avoidance,” IEEE/ACM Transactions on Networking, (August 1993). [PDF]

Summary

Random Early Detection or RED detects impending congestion based on average queue size and notifies connections of congestion through binary feedback by dropping packets or by marking bits in headers. To this end, RED uses two thresholds: once the average queue size exceeds the minimum threshold, RED starts marking/dropping packets with a probability that is a function of the average queue size; when it exceeds the maximum threshold RED marks all the incoming packets. The authors argue against packet dropping as long as there is responsiveness to notifications from the connections, and they resort to dropping as an extreme measure. They also promote use of FCFS/FIFO scheduling over FQ citing the lower overhead for not having to save per-flow states.

The RED gateway consists of two separate algorithms: one for determining the average queue size that determines the maximum allowable burstiness, and the other calculates the probability of marking packets when the average queue size is between the minimum and maximum threshold. The authors leave the decision of determining the thresholds to the system designers’ choice of desired average queue size; however, as rules of thumb they advise to set the minimum threshold sufficiently high to maximize network power (the ratio of throughput to delay) and to set the maximum threshold twice as high as the minimum to avoid global synchronization and oscillation. Through observation and probabilistic calculation, it is shown that packet dropping probability should be calculated as a uniformly distributed random variable. The reason behind this selection is that in order to have a constant average queue size, packets should be marked in uniform intervals not in clusters.

Interesting features of RED include TCP-compatibility, avoidance of global synchronization through randomization, and ability to maintain an upper-bound on average queue size (which also allows RED to better handle bursty traffic). Moreover, RED does not discriminate between different classes of connections, and there is no reward for being within the fair share. RED does not provide explicit action mechanism against misbehaving users; instead the authors argue that misbehaving connections face higher probability for their packets being dropped/marked since they have more packets in the queue than the rest.

The authors back up their claims through simulation and provide mathematical bounds on the average queue size as well as the two thresholds. They also provide a so-called fast version of their control algorithm, which is supposed to useful for in-router implementation due to reduced operations.

Discussion

The main contribution of this paper is the use of randomization/probabilistic analysis in a clever way (i.e., in conjunction with two different thresholds) that by itself obviates some of the nagging problems of other solutions (e.g., bias against bursty traffic, global synchronization etc.). The paper itself could be shorter though.

This paper promotes FCFS saying that it shares delay among connections and reduces delay for particular connection, which is completely different from the FQ thought process. However, the method described for identifying misconducting connections end-up being stateful, even though they shunned FQ due to its use of states (which they use to find out offenders and to ensure min-max fairness). The definition of fairness is vague, and it is not clear whether they strive for min-max fairness (no explicit mentioning; but since they are using FCFS underneath, its unlikely).

RED leaves too many parameters to the system administrators, and it is quite sensitive to those parameters. Tuning these parameters not only have made their simulation results not-so-reliable (you can check only that many variations!), but also they can potentially create problems in its realistic deployment (too many options to chose from; on the other hand, multiple knobs allow fine-tuning when in good hands).

Core-Stateless Fair Queueing: Achieving Approximately Fair Bandwidth Allocations in High Speed Networks

I. Stoica, S. Shenker, H. Zhang, “Core-Stateless Fair Queueing: Achieving Approximately Fair Bandwidth Allocations in High Speed Networks,” ACM SIGCOMM, (August 1998). [PDF]

Summary

Fair resource (bandwidth, buffer etc.) allocation algorithms (e.g., Fair Queueing) in routers usually need to manage large number of states, buffers, and has to do packet scheduling on a per flow basis which makes it economically infeasible to implement them on production routers. This paper proposes an architecture that significantly reduces the overheads by completely eliminating per-flow states from the core routers, yet manages to achieve approximately fair bandwidth allocation. The end result is a simplified queueing algorithm that provides performance comparable to its counterparts, but is better suited for incremental deployment in the existing Internet.

The authors question the necessity, complexity, and adaptabilty of fair allocation allocation mechanisms and base their contributions on two basic assumptions:

  1. Fair allocation is necessary, or at least important, for congestion control, and
  2. Complexity of then-existing fair allocation schemes is a substantial hindrance to their adoption.

In its core, CSFQ (Core-Stateless Fair Queueing) divides the whole Internet into islands of routers. Each island is considered to have edge routers that maintain per-flow states and label packets with fair rates as they enter in anĀ  island; whereas, core routers process packets based on the labels attached by the edge routers using FIFO queueing in conjunction with probabilistic dropping. Two primary challenges in this mechanism is to estimate arrival rates for labeling and to estimate fair share based on arrival rates. CSFQ uses aggregate based estimates using exponential averaging for both cases and uses heuristics to fix possible discrepancies in estimations. Finally, when packets leave an island labels are rewritten with updated fair rates. Theoretical bounds and detailed algorithms can be found in the paper.

Through simulations, CSFQ is shown to be performing better than simplistic FIFO and RED, in par with FRED, and worse than more complex and state-full DRR. However, the authors seem undecided about what to make of the outcomes!

In the end, the authors discuss problems with unfriendly/malicious flows, two possible approaches (allocation and identification) to handle them, and end up proposing a penalizing scheme for unfriendliness which is a compound of (first) allocation and (then) identification approaches.

Critique

This paper is a great read in terms of concept, delivery, clarity, and mostly honesty of the authors. However, it is not comforting to find that the authors are undecided about almost everything discussed in the paper!

The simulation section is rigorous and impressive. Even though they mention several times that they couldn’t put everything for the lack of space, the authors end up with providing a pretty much full-proof simulation results.

The use and derivation of the heuristic to adjust estimations seemed arbitrary. There is no clear reasoning about how the authors came up with those certain constants and percentages.

One of the assumptions, considered as an architectural consideration, in the paper is: “all packets in the same flow must follow the same path within the core of an island“. How valid is that?

Analysis and Simulation of a Fair Queueing Algorithm

A. Demers, S. Keshav, S. Shenker, “Analysis and Simulation of a Fair Queueing Algorithm,” Internetworking: Research and Experience, 1 (1990), pp. 3-26.

Summary

Congestion control in communication networks can be implemented at the source, where flow control algorithms vary the sending rate, or at the gateway through routing and queueing algorithms. However, implementations at the source must rely on the benevolence of the users; hence, they are susceptible to malicious behavior. The authors argue that queueing algorithms can be more effective in congestion control since they control how packets from different sources interact and, in turn, affect the collective behavior of flow control algorithms. They consider bandwidth, promptness, and buffer space as the three control axes, effective allocation of which can ensure efficient and fair congestion control.

This paper presents a modified Fair Queueing (FQ) algorithm, based on the original proposal by Nagle, with an aim to provide fair allocation of bandwidth, lower delay for benevolent sources, and isolation of malicious ones in heterogeneous networking environment. Major changes to Nagle’s algorithm in this paper include the use of variable length packets, control over the promptness/delay of packet dispatching, and dynamic packet scheduling instead of static round-robin ordering. An allocation is considered fair if it satisfies the max-min criteria:

  1. No user receives more than its request,
  2. No other allocation scheme satisfying condition 1 has higher minimum allocation, and
  3. Condition 2 remains recursively true as minimal user is removed and total resource is reduced accordingly.

Several possible definitions of the term user is provided: the source, the destination, the source-destination pair etc. Eventually, they settle on using the source-destination pair as user citing a tradeoff between security and efficiency (without any clear explanation!)

The algorithm itself is a packet-level approximation of a bit-by-bit (fluid) round-robin transmission model using the following rule: “whenever a packet finishes transmission, next one selected is the one with the earliest finish time“. A preemptive version of the algorithm is also presented, which additionally supports a tolerance parameter to allow users using less than fair share bandwidth to have lower delay. Packet dropping policy used in FQ is: “when the queue is full and a new packet arrives, the last packet from the user using the most buffer is dropped”, which provides a small incentive for the users to implement better flow controls.

Finally, through mathematical derivation and simulation, the authors establish the performance and fairness gain of modified FQ working in conjunction with different congestion control schemes.

Thoughts

The use of source-destination pair as user is not properly justified in the paper. The authors do mention the possible security concerns arising from it (malicious source hogging resources by creating many source-destination pairs), but ignore it completely.

Also, the memory requirement for keeping states (too many of them in case of source-destination pairs), price, deployability of FQ in gateways are not discussed.

Congestion Avoidance and Control

V. Jacobson, M. Karels, “Congestion Avoidance and Control,” ACM SIGCOMM Conference, (August 1988). [PDF]

Summary

This paper argues that much of the congestion-related problems (observed circa 1988 in LBL to UC Berkeley connections on 4.3 BSD machines) stem from wrong implementations and bugs in transport protocol, rather than the protocol itself. The authors present examples of wrong interpretations of the window-based transport protocol and propose several algorithms (e.g., RTT time variation estimation, exponential retransmit timer backoff, slow-start, more aggressive receiver ACK policy, dynamic window sizing on congestion) for ensuring network stability by using so-called packet conservation principle.

By packet conservation the authors mean that for a connection in steady state a new packet isn’t put into the network until an old packet leaves, i.e., total number packets in a connection at steady state with full window of data transit is constant. They observe that a congestion can occur only if this principle is violated, which can happen in one of the three cases:

  1. The connection doesn’t get to the equilibrium, or
  2. A sender injects new packet before an old packet has exited, or
  3. The equilibrium can’t be reached because of resource limits along the path.

Based on this observation, the authors put forward measures to avoid/prevent or recover from each of the three possible causes of conservation principle violators.

Contributions

  • Slow start: The Internet transport protocol is self clocking, meaning it adjusts to the connection speed based on the rate of received ACKs. When a connection starts or restarts after failure, it might not get to an equilibrium since there are no ACKs in the first place. To solve this problem, this paper introduces the concept of congestion window and presents slow start algorithm that smoothly adjusts to the connection speed resulting increased throughput.
  • Round-trip timing: Considering the rest of the protocol implementation correct, the authors argue that a sender might inject a new packet without an old one leaving if its estimation of round-trip time is incorrect. In this respect, they propose use of RTT variation to calculate retransmission timeout interval with an aim to reduce spurious retransmissions.
  • Exponential retransmit timer backoff: The authors propose that in case of multiple retransmissions of the same packet, only exponential backoff can avoid congestion in general case.
  • Congestion avoidance: This paper assumes that packet loss in a network occurs mostly (with >99% probability) due to congestion. To counter that there must be a control-feedback mechanism so that network elements can notify of congestion and end hosts can take actions accordingly. The authors end up with (not surprisingly) a linear AIMD solution, where they use packet loss as an indication of congestion without using explicit congestion indicator bits.

Comments

The results presented in the paper are oriented toward wired Internet, which is understandable based on the period it was published. However, applicability of these results in wireless, ad-hoc networks is questionable; e.g., congestion is NOT the only source of packet loss in a wireless network.

One interesting predication in this paper is that as the Internet grows bigger, higher order control functions will be needed to avoid congestion. Recent congestion avoidance algorithms do use cubic control functions!

Overall, this paper is well-argued with lots of comments/annotations in footnotes to allow better understanding of whatever is going on. The authors also synthesize the whole solution by borrowing/combining interesting results from different proposals to create a holistic solution, which by itself is an art of system building.

Analysis of the Increase and Decrease Algorithms for Congestion Avoidance in Computer Networks

D-M Chiu, R. Jain, “Analysis of the Increase and Decrease Algorithms for Congestion Avoidance in Computer Networks,” Computer Networks and ISDN Systems, 17 (1989), pp 1-14. [PDF]

Summary

“Low delay” and “high throughput”, two main operating objectives of a computer network, are frequently hampered by congestion of packets. Once there is a congestion, it takes considerable amount of time and network resource to get back to equilibrium (in this case, to not-congested state is probably more proper). Congestion avoidance is a proactive measure that strives for avoiding a congestion just before it occurs, thus improving network utilization and performance. This paper presents mathematical analyses of several existing increase/decrease linear congestion avoidance algorithms and compare them using multiple performance metrics in a simplified network model.

Model

The authors consider variations of decentralized increase/decrease congestion avoidance algorithms each using “binary feedback scheme”. In a binary feedback control mechanism, network resources send back a one for overloaded state and a zero otherwise to the controllers. Basic assumptions of their model are:

  1. Users sharing the same bottleneck will receive the same feedback,
  2. Feedback and control loop for all users are synchronous (i.e., there take actions in lockstep), and
  3. There is only one bottleneck resource in the network.

Once users/end hosts receive feedback from the congested network resource, they take measures based on the following criteria:

  1. Efficiency: Measured in terms of the utilization of resource; both overload and underload are undesirable.
  2. Fairness: Each user should take equal (proportional!) share of the bottleneck. Properties of a suitable fairness function can be found in the paper.
  3. Distributedness: Control requires minimum amount of feedback from the resource under consideration. No central processing is required.
  4. Convergence: The control scheme should converge within a reasonable level of tolerance.

Results

Through mathematical reasoning and geometrical representation, the authors compare AIAD, AIMD, MIAD, and MIMD (A: additive, D: decrease, I: increase, M: multiplicative) control schemes and prove that only AIMD can optimally converge to both efficiency and fairness in a distributed manner with minimal feedback. They also find out the operating region and control parameters for the control scheme.

Non-linear controls are not considered in this paper citing the sake of practicality and modeling issues. While the authors do agree that non-linear control gives more flexibility, they argue that too much sensitivity to changing conditions in non-linear schemes can result in instability.

Critique

At the end of the paper, the authors themselves raise several questions on different possible scenario that can affect AIMD in multiple ways; e.g., how does delayed feedback affect the control, what is the marginal utility of increased bits of feedback, is it worthwhile to guess the current number of users, and what is the impact of asynchronous operation. However, they left all of them as future work! I wonder what are the correct (if any) answers to these questions.

The paper itself is a wonderful read with fairly easy, understandable math with explanatory geometric representations. However, it is not clear how the authors came up with that massive logarithmic expression while calculating the conditions for optimal convergence to efficiency.

As far as I understand, CUBIC TCP for high-speed network uses non-linear model (cubic) function for congestion avoidance mechanism. Discussion/info on other variations of non-linear models is very much welcome.