Update: Camera-ready version is available here now!
Since introducing the coflow abstraction in 2012, we’ve been working hard to make it practical one step at a time. Over the years, we’ve worked on efficient coflow scheduling, removed clairvoyance requirements in coflow scheduling, and performed fair sharing among coexisting coflows. Throughout all these efforts, one requirement remained constant: all distributed data-parallel applications had to be modified to correctly use the same coflow API. Unfortunately, this is difficult, if not impossible, because of the sheer number of computing frameworks available today.
CODA makes a first attempt at addressing this problem head on. The intuition is simple: if we can automatically identify coflows from the network, applications need not be modified. There are at least two key challenges in achieving this goal. The first one is classifying coflows. Unlike traditional traffic classification, coflows cannot be identified using five-tuples. Instead, we must use domain-specific insights to capture the logical relationship between multiple flows. The second challenge is avoiding the impact of misclassification, because even the best classifier will sometimes be wrong. We must be able to schedule coflows even when some flows have been misclassified into wrong coflows. Our solution errs on the side of caution to perform error-tolerant coflow scheduling. Overall, CODA approximates Aalo very closely and significantly outperforms flow-based solutions.
Leveraging application-level requirements using coflows has recently been shown to improve application-level communication performance in data-parallel clusters. However, existing coflow-based solutions rely on modifying applications to extract coflows, making them inapplicable to many practical scenarios. In this paper, we present CODA, a first attempt at automatically identifying and scheduling coflows without any application-level modifications.
We employ an incremental clustering algorithm to perform fast, application-transparent coflow identification and complement it by proposing an error-tolerant coflow scheduler to mitigate occasional identification errors. Testbed experiments and large-scale simulations with realistic production workloads show that CODA can identify coflows with more than 90% accuracy, and CODA’s scheduler is robust to inaccuracies, enabling communication stages to complete 2.4X (5.1X) faster on average (95th percentile) in comparison to per-flow mechanisms. CODA’s performance is comparable to that of solutions requiring application-level modifications.
This work is a tight collaboration with Kai Chen and his students at HKUST: Hong Zhang, Bairen Yi, and Li Chen, which started with Hong sending me a short note asking whether I think the problem is worth addressing. Of course I did; it is the second future work listed in my dissertation!
This year the SIGCOMM PC accepted 39 papers out of 225 submissions with a 17.33% acceptance rate. This also happens to be my first SIGCOMM in an advisory role and the first one without having Ion onboard. I am thankful to Kai for giving me full access to his excellent, hard-working students. It’s been a gratifying experience to work with them, and I look forward to further collaborations. Finally, I’m excited to start my life as an advisor on a high note. Go Blue!