Update: Camera-ready version is available here now!
Last SIGCOMM we introduced the coflow scheduling problem and presented Varys that addressed its clairvoyant variation, i.e., when all the information of individual coflows are known a priori, and there is no cluster and task scheduling dynamics. In many cases, these assumptions do not hold very well and left us with two primary challenges. First, how to schedule coflows when individual flow sizes, the total number of flows in a coflow , or their end points are unknown. Second, how to schedule coflows for jobs with more than one stages forming DAGs and where tasks can be scheduled in multiple waves. Aalo addresses both these problems by providing the first solution for the non-clairvoyant coflow scheduling problem: it can efficiently schedule coflows without any prior knowledge of coflow characteristics and can still approximate Varys’s performance very closely. This makes coflows practical in almost all scenarios we face in data-parallel communication.
Leveraging application-level requirements exposed through coflows has recently been shown to improve application-level communication performance in data-parallel clusters. However, existing efficient schedulers require a priori coflow information (e.g., flow size) and ignore cluster dynamics like pipelining, task failures, and speculative executions, which limit their applicability. Schedulers without prior knowledge compromise on performance to avoid head-of-line blocking. In this paper, we present Aalo that strikes a balance and efficiently schedules coflows without prior knowledge.
Aalo employs Discretized Coflow-Aware Least-Attained Service (D-CLAS) to separate coflows into a small number of priority queues based on how much they have already sent across the cluster. By performing prioritization across queues and by scheduling coflows in the FIFO order within each queue, Aalo’s non-clairvoyant scheduler can schedule diverse coflows and minimize their completion times. EC2 deployments and trace-driven simulations show that com- munication stages complete 1.93X faster on average and 3.59X faster at the 95th percentile using Aalo in comparison to per-flow mechanisms. Aalo’s performance is comparable to that of solutions using prior knowledge, and Aalo outperforms them in presence of cluster dynamics.
This is a joint work with Ion, the first time we have written a conference paper just between the two of us!
This year the SIGCOMM PC accepted 40 papers out of 262 submissions with a 15.27% acceptance rate. This also happens to be my last SIGCOMM as a student. Glad to be lucky enough to end on a high note!