V. Vasudevan, A. Phanishayee, H. Shah, E. Krevat, D. G. Andersen, G. R. Ganger, G. A. Gibson, B. Mueller, “Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication,” ACM SIGCOMM Conference, (August 2009). [PDF]
The TCP Incast Collapse Problem
In data centers with high-fan-in, high-bandwidth synchronized TCP workloads, receivers can experience a drastic reduction in goodput (application throughput) when multiple servers try to send back large amount of data to a single requesting client over a bottleneck link. As the number of concurrent senders increases, the goodput keeps decreasing drastically until it goes down to orders of magnitude lower than the link capacity. This pathological response of TCP under extreme stress is known as the TCP incast collapse problem. Its preconditions include:
- High bandwidth, low-latency networks with small switch buffers;
- Clients issuing barrier synchronized requests in parallel (i.e., client waits before issuing another request until all the responses from the last request get responses);
- Servers returning responses with small amount of data that cannot fill the pipe.
When a server involved in a barrier synchronized request experiences a timeout (due to packet loss in small switch buffers), the client has to wait for RTOmin time (which is normally 200ms in OSes) before that sender retransmits; during this time the client connection remains idle (since the responses, being small, do not occupy the pipe for too long). As a result, in worst cases, goodput can go down to 1-10% of clients bandwidth capacity.
The authors argue that RTT (round trip time) in data center environment is orders of magnitude lower than standard TCP RTOmin value (1ms vs 200ms), which was fixed for the WAN environment (with ~100ms RTT). Through simulations and experiments they show that RTO, specially RTOmin, is the bottleneck in this case. So they end up with the following fixes:
- First, they lower the minimum RTO (RTOmin) to 20us for their workloads and end up generalizing that RTOmin should be on the same timescale as the network latency (essentially removed).
- Still they observe some decrease in goodput for higher fan-in settings due to synchronization of multiple senders’ timing out, backing off, and retransmissions. So they introduce randomization to RTO to desynchronize retransmissions.
Later, the authors discuss implementation of fine-grained timers in kernels and modifications to the TCP stack to utilize those timers in RTT estimations and RTO calculations.
Finally, implications of fine-grained timers on WAN environment is considered so that the proposed modifications can be integrated into standard TCP implementations. There can be two implications of removing RTOmin or significantly decreasing RTO:
- Spurious retransmissions, when the network RTT experiences a spike. The authors argue, through hand-waving and some simulation, that the effect of a shorter RTO is negligible.
- Standard delayed ACK threshold (40ms, which much higher than 20us) will cause noticeable drop in throughput. They propose disabling delayed ACK completely.
The high point of the paper is that it provides a relatively easy and deployable solution to an important problem. The proposed solutions, being tested on a smaller real cluster and compared against simulation in much larger scenario, show consistency in their behavior, which makes them believable.
It would be interesting to see detailed evaluation of the overhead of using the fine-grained timers proposed in the paper (it is left as a future work). If the overhead is too much, the solution might become unusable in practice.