Y. Chen, R. Griffith, J. Liu, A. Joseph, R. H. Katz, “Understanding TCP Incast Throughput Collapse in Datacenter Networks,” Workshop on Research in Enterprise Networks (WREN’09), (August 2009). [PDF]
This paper presents a diagnosis of the TCP incast collapse problem and proposes a framework for a solution that should be:
- generalized: not limited to particular network, traffic pattern, or workload;
- theoretically sound: the analytical model should be able to predict and explain experimental data;
- deployable in practice: should be implemented in kernels and evaluated using real workloads.
The main objective of the paper, as pointed out by the authors, is to understand the nature of the problem through extensive experimentation. To this end, they have done an excellent job by using different existing workloads, reproducing prior work in their own testbed, proposing a model (still crude), and using the model to explain observed phenomena.
The paper starts of with an exploratory approach. The authors try to reproduce the results from prior work (which they did to show the generality of the problem) and propose multiple smaller tweaks (decreasing the TCP RTOmin, randomizing the RTOmin, setting a smaller multiplier for the RTO exponential backoff, and randomizing the multiplier value) to find out that the most promising modification was to reduce RTOmin (which confirms the solution presented here).
Through in-depth analysis, they identify three specific regions of goodput change over different RTOmin values for increasing number of senders (which was not observed in prior work). The authors argue that any model should be able to capture these details:
- R1: Goodput collapse, which appears to reach a minimum point for different RTOmin values for the same number of senders;
- R2: Goodput recovery, which takes goodput to a local maximum for larger number of senders for larger RTOmin values;
- R3: Goodput decline, which has the same slope of decrease for all RTOmin values.
Unlike the previous studies, the authors find that disabling delayed ACKs have negative effects on goodput irrespective timer granularity and workload (in contrast to prior work). They attribute this behavior to spurious retransmissions due to overdriving congestion window and mismatch between very low RTOmin value and RTT in the testbed.
Finally, a yet-to-be-completed quantitative model is presented that captures regions R1 and R2 up until now. However, the main contribution of the model is that it identifies inter-packet wait time as a significant influence on goodput in addition to RTO timer values: for large RTO timer values, reducing RTO timer value is a first-order mitigation to the incast problem, but for smaller RTO values, intelligently controlling inter-packet wait time becomes critical. The paper ends with number of qualitative refinements to the proposal to better model the observed phenomena.
The authors have done a really excellent job in reproducing the prior results and provided an extensive evaluation in real testbeds (no simulation! yay!) that dug up previously undiscovered factors and their impacts on the incast problem. It also brought up the fact that systems papers are dictated by the workloads/test cases they use, and wrong choice of workloads can spoil great findings if not completely ruin them. No one saw three regions before this paper; but who can guarantee that someone else won’t see another increasing region in some other workload? What will happen to the quantitative model sketched in the paper in that case?
The quantitative model itself is also ugly not-so-good-looking. Hopefully, when the project is completed, the authors will come up with a model that does not look like forced to fit particular empirical observation.