All posts by Mosharaf

XORs in the Air: Practical Wireless Network Coding

S. Katti, H. Rahul, W. Hu, D. Katabi, M. Medard, J. Crowcroft, “XORs in the Air: Practical Wireless Network Coding,” ACM SIGCOMM Conference, (September 2006). [PDF]

Summary

This paper present COPE, a new forwarding architecture for wireless mesh networks that significantly improves the overall throughput by exploiting the shared nature of wireless medium using network coding. All the nodes communicate using broadcast in COPE. Each node stores overheard packets from its neighbors in a buffer pool for a short amount of time and also tells its neighbors what packets it has in its own pool using a bitmap. When a node transmits any packet, it uses the knowledge about its neighbor (if it knows) or intelligently guesses to XOR multiple packets together and transmits them as a single packet. Receiving node in the next hop, decodes the packet by re-XORing it with the packets in its pool. Packets are encoded only if they can be extracted in the next hop.

The authors explain through examples and prove theorems to show Coding gain and Coding+MAC gain (gain due to resource allocation less skewed toward hub-type nodes) of using COPE over traditional options. They have three results:

  1. In the absence of opportunistic listening, COPE’s maximum coding gain is 2, and it is achievable;
  2. In the absence of opportunistic listening, COPE’s maximum Coding+MAC gain is 2, and it is achievable; and
  3. In the presence of opportunistic listening, COPE’s maximum coding gain is unbounded.

There are several implementation related gotchas:

  • Packets are never delayed to avoid unnecessary queue development.
  • COPE prefers XORing packets of same length and keeps multiple virtual queues per neighbor based one for each packet size!
  • It also tries to avoid TCP packet reordering and use hop-by-hop ACKs and retransmissions.
  • While guessing whether or not to XOR one more packet to the mix, COPE calculates the probability based on some metric (ETX/ETT), and if the probability goes below a threshold (0.8 in the paper), it does not XOR that packet.
  • Moreover, COPE implements a pseudo-broadcast consisting of multiple unicasts to work-around default broadcasts poor reliability and lack of backoff.

COPE was evaluated on a 20 node 802.11a wireless testbed with a bit rate of 6 Mbps. Main results are:

  • For toy topologies, TCP flows get lower (no MAC gain) boost than UDP flows.
  • For a congested wireless medium, UDP flows gets 3-4x boost in overall throughput and approaches Coding+MAC gain. But with increasing load, UDP decreases and seems to have been stabilizing around 2x range.
  • But TCP suffers from hidden-terminal induced collision related backoff and cannot gain much using COPE. Without any hidden terminal, the maximum goodput shows 38% improvement.
  • Almost half the coded packets are XOR of only 2 packets and trailing off at a maximum of 5 packets.
  • Finally, if COPE based access network is connected to the Internet through a gateway, throughput gain varies from 5% to 70% based on uplink to downlink ratio.

Comments

This is an overall excellent paper that combines theory with practice and backs up the claims with wide range of experimental evaluation. The paper is also very well written and easy to follow. However, there are several issues that could use more explanation. COPE requires as many virtual queues as there are packet sizes (which they say essentially to be 2 for the Internet) for each neighbor within its hearing distance, and will have to look through all of them to find optimal XOR strategy. The paper does not provide any evaluation of how big they grow or how long its takes to optimally pick an XORing strategy.

The authors say that the pseudo-broadcast mechanism will make sure that every node gets more than once chance to overhear. However, they avoid to mention and evaluate that this might also result in a higher number of collisions and link layer retransmits.

ExOR: Opportunistic Multi-Hop Routing for Wireless Networks

S. Biswas, R. Morris, “ExOR: Opportunistic Multi-Hop Routing for Wireless Networks,” ACM SIGCOMM Conference, (August 2005). [PDF]

Summary

ExOR is an integrated routing and MAC technique that opportunistically exploits the broadcast characteristics of the wireless environment. Unlike the traditional routing protocols, where the best sequence of nodes from a source to a destination is decided before packets are routed along that path, ExOR defers the next hop decision until it knows all the nodes that received a packet and allows the closest node to the destination to be the next hop and retransmit that packet toward destination. As a result, ExOR can try multiple long but lossy links concurrently, resulting in high expected progress per transmission.

Packets are sent in batches. The source node includes in each packet a list of candidate forwarders prioritized by closeness to destination (ETX with only forward delivery probability is used in this paper) and a batch map that the sender’s best guess of the highest priority node to have received each packet. Receiving nodes buffer packets until the end of batch transmission. Then based on priority, each receiver broadcasts its received packets with updated batch maps. If a packet is somehow overheard by a node farther away than expected, it replaces the original guess with its own id as the highest priority receiver in the batch map. Essentially, each node end up with more or less the same batch map using this gossip-like protocol. Every forwarder transmits only those packets that are not ACKed by higher priority nodes. This cycle continues until 90% packets have reached the destination; the rest are transmitted using traditional routing protocol.

Evaluation of ExOR was performed on a 32 node testbed, and the authors claim that ExOR performed better than traditional routing for almost all node pairs with an average 2x end-to-end throughput gain. ExOR gain is more prominent in longer lower throughput routes (due to more probability of exploiting overhearing) than shorter higher throughput routes, but it manages to lead traditional routing throughout the spectrum. The authors believe that the best batch size is somewhere between 10 to 100 packets, but do not provide any specific value for magic formula.

Two of assumptions behind ExOR are that reception at different nodes is independent, and that there is a gradual falloff in delivery probability with distance. Through simulation, the authors show that with longer routes dependent loss more and more lags independent loss; however, maximum lag found in the simulation is 20%, which made the authors postulate that independent loss is not really a requirement but can be exploited if present.

Critique

The concept is simple and straightforward, but the paper somehow managed to make it more complex than it really is. Figure 6 as well as its description in Section 3.8 only confused everything more. And those massive plots with minute information only made things worse (LaTeX put them in an unordered fashion, which made reading the paper all the more annoying).The evaluation also did not reveal much except for showing that ExOR does increase throughput in their testbed. In addition, it also interacts badly with TCP and resorts to split connections breaking the end-to-end TCP semantics and possibly introducing overhead.

A Performance Comparison of Multi-Hop Wireless Ad Hoc Network Routing Protocols

J. Broch, D. Maltz, D. Johnson, Y-C Hu, J. Jetcheva, “A Performance Comparison of Multi-Hop Wireless Ad Hoc Network Routing Protocols,” ACM Mobicom Conference, (October 1998). [PDF]

Summary

This paper presents packet level simulation results of four multi-hop wireless ad hoc network routing protocols: DSDV, TORA, DSR, and AODV, in networks of 50 mobile nodes. The authors have extended the ns-2 network simulator to accurately model the MAC and the physical layer behavior of IEEE 802.11 standard, added a realistic wireless transmission channel model, and included node mobility to the simulator.

Protocols Considered

Distance-Sequenced Distance Vector (DSDV) is a hop-by-hop distance vector routing protocol, where each node maintains a routing table listing the “next hop” for each reachable distance and periodically broadcast the table. Sequence number of an entry determines its freshness – the higher the better. Better metric value is used to break a tie. In addition to periodical updates, a route update is triggered whenever a new sequence number is heard. Based on the waiting time to broadcast a triggered update there can be two variations of DSDV. This paper uses DSDV-SQ, where they instantly trigger an update.

Temporally-Ordered Routing Algorithm (TORA) discovers routes on-demand using a distributed link-reversal algorithm. Whenever a node needs to route to a particular destination, it broadcasts a QUERY packet. The recipient of the QUERY packet broadcasts an UPDATE packet. As the UPDATE packet propagates through the network, a directed path is created to the destination. TORA is implemented on top of IMEP (Internet MANET Encapsulation Protocol) for reliable, in-order delivery of control packets. Periodical BEACON and HELLO messages detect link status.

Dynamic Source Routing (DSR) is another on-demand routing protocol that uses source routing rather than hop-by-hop routing; each packet carries a complete, ordered list of nodes that it must traverse to reach the destination. The DSR protocol has two mechanisms: Route Discovery finds the list of nodes to reach a destination after the sender broadcasts a ROUTE-REQUEST packet (route caching is used to improve performance). Route Maintenance detects whether a source route has been invalidated due to topology changes.

Ad Hoc On-Demand Distance Vector (AODV) combines the basic on-demand mechanism of Route Discovery and Route Maintenance of DSR with DSDV hop-by-hop routing, sequence numbers, and periodic beacons of DSDV. Instead of using source routing, hop-by-hop states are created when a Route Discovery request is replied.

All in all, these four protocols differ in several design choices: on-demand vs constant route computation, hop-by-hop vs source routing, use of sequence numbers vs relying on ‘heights’ vs node list in source routing etc.

Evaluation

This paper considers three metrics to compare the protocols: packet delivery ratio (PDR), routing overhead, and path optimality (hop count). Major findings/confirmations are:

  • With increasing mobility, PDR of DSDV-SQ drops significantly than the rest of the algorithms. For lower mobility, all the algorithms have closer PDRs.
  • For DSR and AODV-LL, PDR is independent of offered traffic load. DSDV-SQ fails to converge after a certain threshold, most the losses caused by its storing only one route which turns out to be broken. TORA, on the other hand, performs well for lower loads but fails miserably once load goes up due to its higher overhead.
  • DSDV-SQ maintains a constant routing overhead irrespective of mobility and load. DSR and AODV-LL show similar reaction to mobility and load, but AODV-LL has more overhead due to its lack of optimizations used in DSR and for not using source routing. TORA has high overhead even for smaller load and completely loses applicability with increasing load as IMEP faces congestion collapse.
  • Lower speed of node movement does not affect the overhead for DSDV-SQ (it has to perform periodical broadcasting anyway). But in general, lower speed mellows the effect of mobility on all the protocols in terms of PDR and overhead.

Comments

This paper presents a good summary of different protocols for multi hop routing in wireless networks and presents a reasonable evaluation over a very large parameter space. Even though more sophisticated than its contemporary models, several of the important factors in modeling of wireless links are ignored in this paper. And using hop-count as a metric for path optimality is not the best choice. Overall, this was a good paper in its time but should have been replaced by now with experimental evaluation or more accurate simulation models.

A High Throughput Path Metric for Multi-Hop Wireless Routing

D. De Couto, D. Aguayo, J. Bicket, R. Morris, “A High Throughput Path Metric for Multi-Hop Wireless Routing,” ACM Mobicom Conference, (September 2003). [PDF]

Summary

This paper presents ETX (expected transmission count), a new metric to find high-throughput paths on multi-hop wireless networks (802.11b), which minimizes the expected total number of packet transmissions (including retransmissions) required to successfully deliver a packet to its ultimate destination. Contrary to the minimum hop-count (MHC) metric, ETX incorporates link loss ratios, asymmetry in loss ratios between the two directions, and interference among successive links. Measurements in a 29-node testbed show that ETX improves average throughput of multi-hop routes by up to a factor of two over MHC.

The authors motivate ETX design by pointing out several flaws in MHC:

  • MHC-based protocols implicitly assume that links either work (if probe packets go through) or don’t work at all. However, in wireless networks, links with intermediate loss ratios can exist that are good enough to transmit control packets (probe packets) but cannot transmit data packets  introducing errors. Also many links have asymmetric delivery ratios, which cannot be captured using MHC.
  • By minimizing the hop-count, MHC protocols maximize distance traveled by each hop, which decreases signal strength and increases loss ratio. In addition, in a dense network many routes can exist with same MHC and arbitrarily selecting one route can significantly effect total transmission time.
  • Experimental results show that MHC performs well whenever the shortest route is also the fastest route, especially when there is a one-hop link with low loss ratio. With increasing number of hops, MHC throughput decreases almost linearly.

Somewhat better approaches than MHC, product of per-link delivery ratios (PPLDR), bottleneck-loss-ratio (BLR), end-to-end delay (E2ED), also fail to account the wide range of loss ratios, asymmetry in loss ratios, and interference between successive hops. PPLDR and BLR fails to account for inter-hop interference (e.g., perfect two-hop link is better than a one-hop link with 1% loss ratio) and E2ED introduces oscillation due to its high sensitivity to network load (as queues go larger and smaller, E2ED can vary widely).

ETX calculates forward and reverse delivery ratios of a link and uses the inverse of the product of the two to estimate expected number of transmissions. The ETX of a route is calculated by summing indivual ETXes of each link. As a result, ETX accounts for asymmetry (by using forward and reverse loss ratios), interference (by summing up ETXes along the path), and link loss ratios (by definition). Since delivery ratios are directly related to throughput, ETX can better predict throughput than MHC and other approaches.

The rest of paper provides gory details of the modifications they had to make to the DSDV and DSR protocols to make them work with ETX. Apparently, no one, not even the people who created DSDV and DSR,  are 100% sure about different implementation details; the authors did whatever they saw fit to make them work.

In the evaluation section, the authors match up ETX-modified DSDV and DSR against their MHC counterparts and convincingly find ETX winning every rounds!

Critique

ETX is much better than MHC, but it was sort of obvious from their motivation briefing. A better evaluation could have been against PPLDR or E2ED. Btw, the same group came up with a even better metric ETT (estimated transmission time) in their future work.

ETX performed bad for large packet sizes. The authors argued that using small probe packets introduced and over-estimation of link qualities, which was visible for larger packet sizes. Why they did not consider large prob packets (possibly of same size as the data packets) and evaluate performance in that is not clear.

Somewhere in the paper it is mentioned that ETX tends to reduce spectrum use, which was not clear.

Architecture and Evaluation of an Unplanned 802.11b Mesh Network

J. Bicket, D. Aguayo, S. Biswas, R. Morris, “Architecture and Evaluation of an Unplanned 802.11b Mesh Network,” ACM Mobicom Conference, (September 2005). [PDF]

Summary

Planning and deployment of a large wireless mesh network in a urban area is complex, time-consuming, and expensive . This paper argues that an unplanned wireless mesh network with commodity hardware can do quite as well. The authors presents their experience with such an unplanned architecture, Roofnet, and evaluate multiple characteristics of Roofnet architecture.

Primary design decisions for Roofnet are as follows:

  1. Unconstrained node placement: No topology or planned coverage. 37 Nodes are pseudo-randomly spread over 4 square kilometer urban area.
  2. Omni-directional antenna: No directional high-fidelity connection. Nodes should be able to connect to anyone within their reach (based on some clever formula)
  3. Multi-hop routing: No base-station- or access-point-centric single hop connection. Nodes should be able to find their way to the Internet through multiple hops over the network.
  4. Routing optimization for throughput: No routing optimization for route route repair as in a mobile network. Routes are optimized for throughput in a slowly changing network.

Architecture

Roofnet is deployed in a densely populated urban area where line-of-sight propagation between nodes is often obstructed. Most Roofnet nodes are on top of three-four storeyed building with a few on higher ones. Each Roofnet node consists of a PC (running Linux, routing s/w on Click, DHCP server, and a web server), an 802.11b card (with RTS/CTS disabled), and a roof-mounted omni-directional antenna.

In order that Roofnet nodes be self-configuring several challenges must be addressed: address allocation, finding gateway to the Internet, finding best route to the gateway. Roofnet uses internal addressing mechanism which are translated using NATs for local network under each Roofnet node. Gateways are speacial Roofnet nodes that have wired connection to the Internet and provide NAT service for TCP flows originating inside Roofnet.

Roofnet routing protocol, Srcr, tries to find the highest throughput route between any pair of Roofnet nodes. Each Srcr node maintains a partial database of link metrics between other pairs of nodes and uses Dijkstra’s algorithms on that database. Nodes learn fresh link metrics by pushing, pulling (through flooding), or overhearing. Srcr chooses route with in estimated transmission time (ETT) metric that predicts the total amount of time it would take to send a data packet along a route (taking into account each link’s highest-throughout transmission bit-rate and its delivery probability at that bit-rate). Roofnet has its own bit-rate measuring and selection algorithm SampleRate to choose among the 802.11b transmit bit-rates.

Evaluation Results

  • Over all pairs, Roofnet throughput and latency averages are 627 kbps and 39 ms, respectively. The authors also found that Roofnet nodes are most possibly working at 5.5 Mbps.
  • Srcr uses mostly shorter links, almost all of the which that are faster than 2 Mbps, and it ignores most of the links slower than that.
  • Through simulation, the authors found that the network approaches all-pairs connectivity after total of number of node crosses 20, and average throughput increases with higher density. They postulate that higher density gives more shorter links with higher transmission rate to increase throughput.
  • In terms of robustness, the authors found that there are a few links (around a dozen) that contribute to half the throughput. They also found that there are two nodes that carries 43% of the total throughput.
  • A comparison against single-hop alternative shows that multi-hop architecture can operate with fewer number of gateways since it can take multiple hops to reach gateways (duh!). For single-hop scenario, minimum number of gateways for full Internet connectivity is found to be 5.
  • Distribution of throughout is dictated by hop-count with throughput decreasing significantly faster with increasing hop-count.The authors postulate that this is because of packet loss from inter-hop interference. However, they also found that RTS/CTS scheme, that is supposed to avoid such collisions, does not even performance.

Critique

This paper performs an elaborate measurement study to propound unplanned mesh network deployment as a considerable alternative to engineered network deployment in an urban area. However elaborate, there are several things that require further explanation. First of all, Roofnet ‘layer’ and Roofnet addressing scheme could use better explanation. It is not clear why they needed a different layer and how they used it.

Secondly, a significant number of evaluations have been done using a mathematical formula of end-to-end throughput, which – the authors admit – has estimation error. However, there was no evaluation regarding the level of estimation mismatch it is causing. If the introduced error is quite high, then all the results will be affected. Since the total number of nodes is very low, even smaller changes in required number of nodes for ‘X’ is significant.

In addition, there is no evaluation regarding how SampleRate works, what it selected in practice, and what was used for simulation. One might assume that the authors were using 5.5 Mbps in simulation based on their findings from basic performance analysis. A little clarification could help.

Finally, based on their results (e.g., 2 nodes carrying 43% throughput, 12 links carrying 50%, very few links were highly used) it seems plausible that a spanning tree was formed. It would have been nice to know the properties of such formation (even though on pseudo-random graphs).

Modeling Wireless Links for Transport Protocols

Andrei Gurtov, Sally Floyd, “Modeling Wireless Links for Transport Protocols,” ACM SIGCOMM Computer Communications Review, Volume 34, Number 2, (April 2004). [PDF]

Summary

Intrinsic characteristics of wireless links (e.g., variable bandwidth, bit corruption, channel allocation delays, and asymmetry) significantly affect performance of transport protocols. In addition, different types of wireless links show noticeable variance in different characteristics. This paper summarizes the authors’ experiences in creating realistic simulation models for three different types of wireless links (cellular, WLAN, and satellite links) to evaluate the effect of link level mechanisms on end-to-end transport protocols.

The authors point out that most of the existing models are either not realistic enough to capture the characteristics of different types of wireless links; or realistic but fails to cover large parameter space required to better understand different characteristics; or overly realistic and ignore the underlying dynamics; or lack reproducibility. Their objective in this paper is to find a balance between reality, generality, and detail.

This paper discusses six different link characteristics to consider for simulating  wireless links and provides plausible models for each of them:

Link Characteristics How to model?
Error losses and corruption Drop packets according to a per-packet, per-bit, or time-based probability
Delay variation Suspend data transmission on a simulated link
Packet reordering Either swap two packets in a queue, or delay one packet and let others pass
On-demand resource allocation Include additional delay when a packet arrives to a queue that has been empty longer than the channel hold time
Bandwidth variation Change bandwidth of a link in simulation
Asymmetry in bandwidth and latency Configure up- and down-link with different values

In addition, the paper also discusses modeling of queue management and effects of mobility. For cellular and current WLAN links, queue management can be modeled by using a Drop-Tail buffer with configurable maximum size in packets; and for satellite and future cellular and WLAN links, using RED is advised to be more effective. Mobility modeling require adequate definitions of the underlying mechanisms and depend on specific scenarios (intersystem vs intrasystem).

Comments

This paper presents a good survey of possible issues to take care of while modeling wireless links (of different types) and proposes several plausible solutions for each of the issues. The authors take no side on whether to change link layer or to change transport layer; rather, they propose a combination of the two based on suitability.

A Comparison of Mechanisms for Improving TCP Performance over Wireless Links

H. Balakrishnan, V. Padmanabhan, S. Seshan, R. H. Katz, “A Comparison of Mechanisms for Improving TCP Performance over Wireless Links,” IEEE/ACM Transactions on Networking, (December 1997). [PDF]

Summary

Unlike wired networks, congestion is not the only source of packet loss in wireless networks. Networks with wireless and lossy links also suffer from significant losses due to bit errors and handoffs. This paper performs an in-depth examination of the applicability of different mechanisms to improve TCP performance over wireless links in LAN and WAN environments.

The authors observe that there are two different approaches to improving TCP performance in such lossy systems:

  • Hide non-congestion-related losses from the TCP sender, so that no changes are required in the existing TCP implementations. In this approach, transport layer remains unaware of the losses due to link characteristics.
  • Make senders aware of lossy links so that senders do not invoke congestion control algorithms when losses are due to link characteristics. However, differentiating different types of losses is a hard problem in both cases.
  • In addition, a hybrid of the two (wireless-aware transport layer coexisting with link layer schemes) can also be used.

This paper classifies the many schemes into three basic groups based on their underlying philosophy:

  1. End-to-end proposals: These protocols try to handle losses through the use of Selective ACKs (SACKs) and Explicit Loss Notification (ELNs). Two types of SACK-related proposals are considered: standard SACK, which contain ACK for three recent non-contagious blocks, and SMART, which contain cumulative ACK and the sequence number of the packet that caused the SACK.
  2. Split-connection proposals: In this case, end-to-end connection is broken into two, one wireless (to the base station) and the other wired (from the base station to other end). Split connection protocols suffer from several problems:
    • Wired part is bottlenecked by the wireless part;
    • Processing twice at the base station;
    • End-to-end TCP semantics is broken;
    • Large number of states in the intermediate point.
  3. Link-layer proposals: These protocols hide link-related losses from the transport layer by using local retransmissions and forward error correction (FEC). TCP-aware link layer proposals are also considered to avoid timer mismatch between two layers and to avoid TCP’s duplicate ACKs causing retransmissions.

From the experiments, this paper provides the following conclusions:

  • A reliable link layer protocol that is also TCP-aware (to shield duplicate ACKs) gives a 10-30% higher throughput than TCP-unaware link layer protocol. In general, TCP-aware link layer protocol with SACK (SMART) turned out to be the best mechanism.
  • Split-connection-based proposals are not comparable to TCP-aware link layer combo.
  • SACK-based schemes are very useful in the presence of lossy links, specially for bursty losses.
  • End-to-end schemes, while not as good as the better options, can be considered since they require no changes to the intermediate nodes. ELN scheme improves throughput by more than a factor of two over TCP-Reno.

Critique

The authors have done a great job in comparing a large number of algorithms and covered a full spectrum of possible solutions for improving TCP performance over wireless links. They also pointed out several issues with SACKs and ELN for general case. One thing seemed missing though: this paper shows advantages of using SACKs as well as ELNs separately; what about the performance characteristics of using SACK and ELN together? Since they had both of those implemented, why did they not test them together (even though they mentioned they ought to) leaves a question mark.

MACAW: A Media Access Protocol for Wireless LANs

V. Bharghavan, A. Demers, S. Shenker, L. Zhang, “MACAW: A Media Access Protocol for Wireless LANs,” ACM SIGCOMM Conference, (August 1994). [PDF]

Summary

The media in a wireless network is a shared and scarce resource; as a result, controlling access to this shared media to maximize performance is one of the key issues in wireless networks literature. This paper presents a media access protocol, MACAW, for single channel wireless LAN building upon preexisting MACA (Multiple Access, Collision Avoidance) protocol with an aim to increase performance and fairness among contending wireless nodes. The design of MACAW is based on four observations:

  1. Relevant contention occurs at the receiver; sensing carrier at the sender (as in CSMA) is inappropriate
  2. Congestion is location dependent
  3. For fair allocation, collision (congestion) information must be shared among devices
  4. Information related to contention period must be synchronized among devices to promote fair contention

In this work, the authors also assume that there is no interference nor capture (one signal overpowering another) of signals; a station successfully receives a packet if and only if there is exactly one active transmitter within its range. In addition, they also assume that communication is symmetric.

MACA Briefing

MACA uses two types of short, fixed-length (32 byte) signaling packets: RTS (Request-to-Send) and CTS (Clear-to-Send). When a station wants to send something, it sends an RTS first. If the receiver is free, it replies with an CTS. Upon receiving CTS, the sender immediately sends data. Any station overhearing RTS waits long enough for a CTS to pass through. Anyone, overhearing CTS waits for the length of data (length of data is carried in signaling headers). As a result, hidden-terminal problem is completely avoided and exposed-terminal problem is simplified. If the sender does not receive CTS within some period, it will time out and schedule for retransmission using a binary exponential backoff (BEB).

MACAW

MACAW design is based on the observations mentioned before as well as the limitations of MACA. A brief summary follows:

  • Backoff algorithm: MACAW replaces BEB with MILD (multiplicative increase and linear decrease) to ensure that backoff interval grows a bit slowly (1.5x instead of 2x) and shrinks really slowly (linearly to minimum value). To enable better congestion detection, MACAW shares backoff timers among stations by putting this info in headers.
  • Multiple stream model: MACAW uses separate queues for each stream in each node for increased fairness. In addition, each queue runs independent backoff algorithms. However, all stations attempting to communicate with the same receiver should use the same backoff value.
  • Basic exchange: MACAW replaces RTS-CTS-DATA to RTS-CTS-DS-DATA-ACK with the following extensions:
    1. ACK: An extra ACK at the end ensures that errors can be recovered in the link layer, which is much faster than transport layer recovery. If an ACK is lost, next RTS can generate another ACK for the previous transmission.
    2. DS: This signal ensures a 3-way handshake between sender and receiver (similar to TCP) so that everyone within hearing distance of the two stations know that a data transmission is about to happen. Without the DS packet, stations vying for the shared media cannot compete properly and one is always starved due to the lack of its knowledge of the contention period. In short, DS enables synchronization.
    3. RRTS: RRTS is basically a proxy RTS, when the actual RTS sender is too far away to fight for the contention slot. However, there is one scenario where even RRTS cannot guarantee fair contention.
    4. Multicast: Multicast is handled by sending data right away after the RTS packet, without waiting for CTS. It suffers from the same problems as in CSMA, but the authors leave it as an open challenge.
  • Evaluation: This paper presents simulation-based evaluation of MACAW, which shows that MACAW is fairer and gives higher throughput than MACA.

Comments

MACAW proposes significant improvement over MACA in terms performance and fairness. But the authors leave several issues unresolved in the paper: special RRTS scenario, multicast, impact of mobility etc. They also propose several extensions to MACAW as future works; what turned out of those?

To some extent MACAW(sort of) approximates TCP. MILD is similar to AIMD, inclusion of DS is akin to 3-way handshake in TCP, the ending ACK gives a sense of FIN. It would’ve been interesting to know what the authors thought about this.

Understanding TCP Incast Throughput Collapse in Datacenter Networks

Y. Chen, R. Griffith, J. Liu, A. Joseph, R. H. Katz, “Understanding TCP Incast Throughput Collapse in Datacenter Networks,” Workshop on Research in Enterprise Networks (WREN’09), (August 2009). [PDF]

Summary

This paper presents a diagnosis of the TCP incast collapse problem and proposes a framework for a solution that should be:

  1. generalized: not limited to particular network, traffic pattern, or workload;
  2. theoretically sound: the analytical model should be able to predict and explain experimental data;
  3. deployable in practice: should be implemented in kernels and evaluated using real workloads.

The main objective of the paper, as pointed out by the authors, is to understand the nature of the problem through extensive experimentation. To this end, they have done an excellent job by using different existing workloads, reproducing prior work in their own testbed, proposing a model (still crude), and using the model to explain observed phenomena.

The paper starts of with an exploratory approach. The authors try to reproduce the results from prior work (which they did to show the generality of the problem) and propose multiple smaller tweaks (decreasing the TCP RTOmin, randomizing the RTOmin, setting a smaller multiplier for the RTO exponential backoff, and randomizing the multiplier value) to find out that the most promising modification was to reduce RTOmin (which confirms the solution presented here).

Through in-depth analysis, they identify three specific regions of goodput change over different RTOmin values for increasing number of senders (which was not observed in prior work). The authors argue that any model should be able to capture these details:

  1. R1: Goodput collapse, which appears to reach a minimum point for different RTOmin values for the same number of senders;
  2. R2: Goodput recovery, which takes goodput to a local maximum for larger number of senders for larger RTOmin values;
  3. R3: Goodput decline, which has the same slope of decrease for all RTOmin values.

Unlike the previous studies, the authors find that disabling delayed ACKs have negative effects on goodput irrespective timer granularity and workload (in contrast to prior work). They attribute this behavior to spurious retransmissions due to overdriving congestion window and mismatch between very low RTOmin value and RTT in the testbed.

Finally, a yet-to-be-completed quantitative model is presented that captures regions R1 and R2 up until now. However, the main contribution of the model is that it identifies inter-packet wait time as a significant influence on goodput in addition to RTO timer values: for large RTO timer values, reducing RTO timer value is a first-order mitigation to the incast problem, but for smaller RTO values, intelligently controlling inter-packet wait time becomes critical. The paper ends with number of qualitative refinements to the proposal to better model the observed phenomena.

Discussion

The authors have done a really excellent job in reproducing the prior results and provided an extensive evaluation in real testbeds (no simulation! yay!) that dug up previously undiscovered factors and their impacts on the incast problem. It also brought up the fact that systems papers are dictated by the workloads/test cases they use, and wrong choice of workloads can spoil great findings if not completely ruin them. No one saw three regions before this paper; but who can guarantee that someone else won’t see another increasing region in some other workload? What will happen to the quantitative model sketched in the paper in that case?

The quantitative model itself is also ugly not-so-good-looking. Hopefully, when the project is completed, the authors will come up with a model that does not look like forced to fit particular empirical observation.

Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication

V. Vasudevan, A. Phanishayee, H. Shah, E. Krevat, D. G. Andersen, G. R. Ganger, G. A. Gibson, B. Mueller, “Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication,” ACM SIGCOMM Conference, (August 2009). [PDF]

The TCP Incast Collapse Problem

In data centers with high-fan-in, high-bandwidth synchronized TCP workloads, receivers can experience a drastic reduction in goodput (application throughput) when multiple servers try to send back large amount of data to a single requesting client over a bottleneck link. As the number of concurrent senders increases, the goodput keeps decreasing drastically until it goes down to orders of magnitude lower than the link capacity. This pathological response of TCP under extreme stress is known as the TCP incast collapse problem. Its preconditions include:

  1. High bandwidth, low-latency networks with small switch buffers;
  2. Clients issuing barrier synchronized requests in parallel (i.e., client waits before issuing another request until all the responses from the last request get responses);
  3. Servers returning responses with small amount of data that cannot fill the pipe.

When a server involved in a barrier synchronized request experiences a timeout (due to packet loss in small switch buffers), the client has to wait for RTOmin time (which is normally 200ms in OSes) before that sender retransmits; during this time the client connection remains idle (since the responses, being small, do not occupy the pipe for too long). As a result, in worst cases, goodput can go down to 1-10% of clients bandwidth capacity.

Proposed Solution(s)

The authors argue that RTT (round trip time) in data center environment is orders of magnitude lower than standard TCP RTOmin value (1ms vs 200ms), which was fixed for the WAN environment (with ~100ms RTT). Through simulations and experiments they show that RTO, specially RTOmin, is the bottleneck in this case. So they end up with the following fixes:

  • First, they lower the minimum RTO (RTOmin) to 20us for their workloads and end up generalizing that RTOmin should be on the same timescale as the network latency (essentially removed).
  • Still they observe some decrease in goodput for higher fan-in settings due to synchronization of multiple senders’ timing out, backing off, and retransmissions. So they introduce randomization to RTO to desynchronize retransmissions.

Later, the authors discuss implementation of fine-grained timers in kernels and modifications to the TCP stack to utilize those timers in RTT estimations and RTO calculations.

Finally, implications of fine-grained timers on WAN environment is considered so that the proposed modifications can be integrated into standard TCP implementations. There can be two implications of removing RTOmin or significantly decreasing RTO:

  1. Spurious retransmissions, when the network RTT experiences a spike. The authors argue, through hand-waving and some simulation, that the effect of a shorter RTO is negligible.
  2. Standard delayed ACK threshold (40ms, which much higher than 20us) will cause noticeable drop in throughput. They propose disabling delayed ACK completely.

Comments

The high point of the paper is that it provides a relatively easy and deployable solution to an important problem. The proposed solutions, being tested on a smaller real cluster and compared against simulation in much larger scenario, show consistency in their behavior, which makes them believable.

It would be interesting to see detailed evaluation of the overhead of using the fine-grained timers proposed in the paper (it is left as a future work). If the overhead is too much, the solution might become unusable in practice.