Tag Archives: Coflow

Near Optimal Coflow Scheduling Accepted at SPAA’2019

Since the inception of coflow in 2012, the abstraction and works surrounding it are growing at a fast pace. In addition to systems building, we have seen a rise of theoretical analyses of the coflow scheduling problem. One of the most recent ones to this end has even received a Best Student Paper Award in SIGCOMM’2018.

Today I want to share a recent development in theoretical analysis of coflow scheduling in the context of general graphs. While datacenter networks can be abstracted out as a non-blocking switch, under failures and load imbalance this simplistic model does not hold any more. More importantly, inter-datacenter wide-area networks (WANs) are never non-blocking anyway. Naturally, we need a new analytical framework that allows for a broader variety of network topologies. In our recent work with our colleagues at University of Maryland and Google, we provide a randomized 2-approximation algorithm for single path and free path models of coflow scheduling over the WAN, significantly improving over the state-of-the-art.

The Coflow scheduling problem has emerged as a popular abstraction in the last few years to study data communication problems within a data center [5]. In this basic framework, each coflow has a set of communication demands and the goal is to schedule many coflows in a manner that minimizes the total weighted completion time. A coflow is said to complete when all its communication needs are met. This problem has been extremely well studied for the case of complete bipartite graphs that model a data center with full bisection bandwidth and several approximation algorithms and effective heuristics have been proposed recently [1, 2, 29].

In this work, we study a slightly different model of coflow scheduling in general graphs (to capture traffic between datacenters [15, 29]) and develop practical and efficient approximation algorithms for it. Our main result is a randomized 2 approximation algorithm for the single path and free path model, significantly improving prior work. In addition, we demonstrate via extensive experiments that the algorithm is practical, easy to implement and performs well in practice.

This is my first paper in SPAA, but all thanks go to Sheng Yang at UMD whose dissertation will include this work and his advisor Samir and collaborator Manish at Google. My student Jie You (Jimmy) did the evaluation for this work to compare it against existing works on different WAN topologies and workloads. It was a pleasant experience with the theory folks and helping them in better understanding the constraints. I look forward to many more future collaborations!

Carbyne Accepted to Appear at OSDI’2016

Update: Camera-ready version is available here now!

With the wide adoption of distributed data-parallel applications, large-scale resource scheduling has become a constant source of innovation in recent years. There are tens of scheduling solutions that try to optimize for objectives such as user-level fairness, application-level performance, and cluster-level efficiency. However, given the well-known tradeoffs between fairness, performance, and efficiency, these solutions have traditionally focused on one primary objective (e.g., fairness in case of DRF), and they consider other objectives as best effort, secondary goals.

In this paper, we revisit the tradeoff space, demonstrating out that aggressively focusing on optimizing one primary objective while giving up the rest often does not matter much. Because a job cannot complete until all its tasks have completed, each job can altruistically yield some of its resources without hampering its own completion time. These altruistic resources can then be rescheduled among other jobs to significantly improve secondary objectives without hampering the first. Benefits of our approach is visible even for single-stage jobs, and they increase as jobs have more complex DAGs.

Given the well-known tradeoffs between performance, fairness, and efficiency, modern cluster schedulers focus on aggressively optimizing a single objective, while ignoring the rest. However, short-term convergence to a selected objective often does not result in noticeable long-term benefits. Instead, we propose an altruistic, long-term approach, where jobs yield fractions of their allocated resources without impacting their own completion times.

We show that leftover resources collected via altruisms of many jobs can then be rescheduled, essentially introducing a third degree of freedom in cluster scheduling — in addition to inter- and intra-job scheduling. We leverage this new-found flexibility in Carbyne, a scheduler that combines three existing schedulers to simultaneously optimize cluster-wide performance, fairness, and efficiency. Deployments and large-scale simulations on industrial benchmarks and production traces show that Carbyne closely approximates the state-of-the-art solutions for each of the three objectives simultaneously. For example, Carbyne provides 1.26X better efficiency and 1.59X lower average completion time than DRF, while providing similar fairness guarantees.

Altruistic scheduling has many more use cases; e.g., we had a similar observation for coflow scheduling in Varys.

This work started as a collaboration with Robert Grandl and Aditya Akella toward the end of 2015. Ganesh Ananthanarayanan from MSR later joined us to take it to the next level.  After CODA, this is related to another future work (the seventh) from my dissertation; infer whatever you want to out of these two data points ;)

This year the OSDI PC accepted 47 out of 260 papers. This happens to be my first time submitting to OSDI. It’s also my first paper with Ganesh, even though it happened after we both graduated from Berkeley; we sat opposite to each other for four years back then! I also want to thank Aditya for letting me work closely with Robert; it’s been great collaborating with them.

CODA Accepted to Appear at SIGCOMM’2016

Update: Camera-ready version is available here now!

Since introducing the coflow abstraction in 2012, we’ve been working hard to make it practical one step at a time. Over the years, we’ve worked on efficient coflow scheduling, removed clairvoyance requirements in coflow scheduling, and performed fair sharing among coexisting coflows. Throughout all these efforts, one requirement remained constant: all distributed data-parallel applications had to be modified to correctly use the same coflow API. Unfortunately, this is difficult, if not impossible, because of the sheer number of computing frameworks available today.

CODA makes a first attempt at addressing this problem head on. The intuition is simple: if we can automatically identify coflows from the network, applications need not be modified. There are at least two key challenges in achieving this goal. The first one is classifying coflows. Unlike traditional traffic classification, coflows cannot be identified using five-tuples. Instead, we must use domain-specific insights to capture the logical relationship between multiple flows. The second challenge is avoiding the impact of misclassification, because even the best classifier will sometimes be wrong. We must be able to schedule coflows even when some flows have been misclassified into wrong coflows. Our solution errs on the side of caution to perform error-tolerant coflow scheduling. Overall, CODA approximates Aalo very closely and significantly outperforms flow-based solutions.

Leveraging application-level requirements using coflows has recently been shown to improve application-level communication performance in data-parallel clusters. However, existing coflow-based solutions rely on modifying applications to extract coflows, making them inapplicable to many practical scenarios. In this paper, we present CODA, a first attempt at automatically identifying and scheduling coflows without any application-level modifications.

We employ an incremental clustering algorithm to perform fast, application-transparent coflow identification and complement it by proposing an error-tolerant coflow scheduler to mitigate occasional identification errors. Testbed experiments and large-scale simulations with realistic production workloads show that CODA can identify coflows with more than 90% accuracy, and CODA’s scheduler is robust to inaccuracies, enabling communication stages to complete 2.4X (5.1X) faster on average (95th percentile) in comparison to per-flow mechanisms. CODA’s performance is comparable to that of solutions requiring application-level modifications.

This work is a tight collaboration with Kai Chen and his students at HKUST: Hong Zhang, Bairen Yi, and Li Chen, which started with Hong sending me a short note asking whether I think the problem is worth addressing. Of course I did; it is the second future work listed in my dissertation!

This year the SIGCOMM PC accepted 39 papers out of 225 submissions with a 17.33% acceptance rate. This also happens to be my first SIGCOMM in an advisory role and the first one without having Ion onboard. I am thankful to Kai for giving me full access to his excellent, hard-working students. It’s been a gratifying experience to work with them, and I look forward to further collaborations. Finally, I’m excited to start my life as an advisor on a high note. Go Blue!

HUG Accepted to Appear at NSDI’2016

Update: Camera-ready version is available here now!

With the advent of cloud computing and datacenter-scale applications, simultaneously dealing with multiple resources is the new norm. When multiple parties have multi-resource demands, fairly dividing these resources  (for some notion of fairness) is a core challenge in the resource allocation literature. Dominant Resource Fairness (DRF) in NSDI’2011 was the first work to point out the ensuing challenges and extended traditional, single-resource max-min fairness to the multi-resource environment.

While DRF guarantees many nice properties to ensure fairness, it gives up work conservation. We show that it can be arbitrarily low in the worst case. Over the past five years, many software and research proposals have built upon DRF under the assumption that DRF can trivially be made work-conserving without losing any of its properties.

After five years, we prove that not to be the case. We show that there is a fundamental tradeoff between guaranteed resources and work conservation, and trivially adding work conservation on DRF can actually decrease utilization instead of increasing it. We also show what is the maximum possible utilization while maintaining DRF’s optimal guarantees and propose a new algorithm, High Utilization with Guarantees (HUG) that generalizes single-and multi-resource max-min fairness as well as multi-tenant network sharing literature under a unifying framework.

In this paper, we study how to optimally provide isolation guarantees in multi-resource environments, such as public clouds, where a tenant’s demands on different resources (links) are correlated in order for her to make progress. Unlike prior work such as Dominant Resource Fairness (DRF) that assumes demands to be static and fixed, we consider elastic demands. Our approach generalizes canonical max-min fairness to the multi-resource setting with correlated demands, and extends DRF to elastic demands. We consider two natural optimization objectives: isolation guarantee from a tenant’s viewpoint and system utilization (work conservation) from an operator’s perspective. We prove that in non-cooperative environments like public cloud networks, there is a strong tradeoff between optimal isolation guarantee and work conservation when demands are elastic. Even worse, work conservation can even decrease network utilization instead of improving it when demands are inelastic. We identify the root cause behind the tradeoff and present a provably optimal allocation algorithm, High Utilization with Guarantees (HUG), to achieve maximum attainable network utilization without sacrificing the optimal isolation guarantee, strategy-proofness, and other useful properties of DRF. In cooperative environments like private datacenter networks, HUG achieves both the optimal isolation guarantee and work conservation. Analyses, simulations, and experiments show that HUG provides better isolation guarantees, higher system utilization, and better tenant-level performance than its counterparts.

This work is the result of a close collaboration with Zhenhua Liu with help from Ion Stoica and Ali Ghodsi.

This year the NSDI PC accepted 45 out of 228 submissions with a 19.74% acceptance rate. This work forms the basis of the last chapter of my dissertation, ensuring SIGCOMM or NSDI publications from all its primary chapters!

Aalo Accepted to Appear at SIGCOMM’2015

Update: Camera-ready version is available here now!

Last SIGCOMM we introduced the coflow scheduling problem and presented Varys that addressed its clairvoyant variation, i.e., when all the information of individual coflows are known a priori, and there is no cluster and task scheduling dynamics. In many cases, these assumptions do not hold very well and left us with two primary challenges. First, how to schedule coflows when individual flow sizes, the total number of flows in a coflow , or their end points are unknown. Second, how to schedule coflows for jobs with more than one stages forming DAGs and where tasks can be scheduled in multiple waves. Aalo addresses both these problems by providing the first solution for the non-clairvoyant coflow scheduling problem: it can efficiently schedule coflows without any prior knowledge of coflow characteristics and can still approximate Varys’s performance very closely. This makes coflows practical in almost all scenarios we face in data-parallel communication.

Leveraging application-level requirements exposed through coflows has recently been shown to improve application-level communication performance in data-parallel clusters. However, existing efficient schedulers require a priori coflow information (e.g., flow size) and ignore cluster dynamics like pipelining, task failures, and speculative executions, which limit their applicability. Schedulers without prior knowledge compromise on performance to avoid head-of-line blocking. In this paper, we present Aalo that strikes a balance and efficiently schedules coflows without prior knowledge.

Aalo employs Discretized Coflow-Aware Least-Attained Service (D-CLAS) to separate coflows into a small number of priority queues based on how much they have already sent across the cluster. By performing prioritization across queues and by scheduling coflows in the FIFO order within each queue, Aalo’s non-clairvoyant scheduler can schedule diverse coflows and minimize their completion times. EC2 deployments and trace-driven simulations show that com- munication stages complete 1.93X faster on average and 3.59X faster at the 95th percentile using Aalo in comparison to per-flow mechanisms. Aalo’s performance is comparable to that of solutions using prior knowledge, and Aalo outperforms them in presence of cluster dynamics.

This is a joint work with Ion, the first time we have written a conference paper just between the two of us!

This year the SIGCOMM PC accepted 40 papers out of 262 submissions with a 15.27% acceptance rate. This also happens to be my last SIGCOMM as a student. Glad to be lucky enough to end on a high note!

Orchestra is the Default Broadcast Mechanism in Apache Spark

With its recent release, Apache Spark has promoted Cornet—the BitTorrent-like broadcast mechanism proposed in Orchestra (SIGCOMM’11)—to become its default broadcast mechanism. It’s great to see our research see the light of the real-world! Many thanks to Reynold and others for making it happen.

MLlib, the machine learning library of Spark, will enjoy the biggest boost from this change because of the broadcast-heavy nature of many machine learning algorithms.

Varys Developer Alpha Released!

We are glad to announce the first open-source release of Varys, an application-aware network scheduler for data-parallel clusters using the coflow abstraction. It’s a stripped-down dev-alpha release for the experts, so please be patient with it!

A quick overview of the system can be found at varys.net. Here is a 30-second summary:

Varys is an open source network manager/scheduler that aims to improve communication performance of Big Data applications. Its target applications/jobs include those written in Spark, Hadoop, YARN, BSP, and similar data-parallel frameworks.

Varys provides a simple API that allows data-parallel frameworks to express their communication requirements as coflows with minimal changes to the framework. Using coflows as the basic abstraction of network scheduling, Varys implements novel schedulers either to make applications faster or to make time-restricted applications complete within deadlines.

Primary features included in this initial release are:

  • Support for in-memory and on-disk coflows,
  • Efficient scheduling to minimize the average coflow completion times, and
  • In the deadline-sensitive mode, support for soft deadlines.

Here are some links, if you want to check it out, contribute to make it better, or just want to point someone else who can help us.

Project Website: http://varys.net
Git repository: https://github.com/coflow/varys
Relevant tools: https://github.com/coflow
Research papers
1. Efficient Coflow Scheduling with Varys
2. Coflow: A Networking Abstraction for Cluster Applications

The project is still in its early stage and can use all the help to become successful. We appreciate your feedback.

Varys accepted at SIGCOMM’2014: Coflow is coming!

Update 2: Varys Developer Alpha released!

Update 1: Latest version will be is available here soon now!

We introduced the coflow abstraction back in 2012 at HotNets-XI, and now we have something solid to show what coflows are good for. Our paper on efficient and deadline-sensitive coflow scheduling has been accepted to appear at this year’s SIGCOMM. By exploiting a little bit of application-level information, coflows enable significantly better performance for distributed data-intensive jobs. In the process, we have discovered, characterized, and proposed heuristics for a novel scheduling problem: “Concurrent Open Shop Scheduling with Coupled Resources.” Yeah, it’s a mouthful; but hey, how often does one stumble upon a new scheduling problem?

Communication in data-parallel applications often involves a collection of parallel flows. Traditional techniques to optimize flow-level metrics do not perform well in optimizing such collections, because the network is largely agnostic to application-level requirements. The recently proposed coflow abstraction bridges this gap and creates new opportunities for network scheduling. In this paper, we address inter-coflow scheduling for two different objectives: decreasing communication time of data-intensive jobs and guaranteeing predictable communication time. We introduce the concurrent open shop scheduling with coupled resources problem, analyze its complexity, and propose effective heuristics to optimize either objective. We present Varys, a system that enables data-intensive frameworks to use coflows and the proposed algorithms while maintaining high network utilization and guaranteeing starvation freedom. EC2 deployments and trace-driven simulations show that communication stages complete up to 3.16X faster on average and up to 2X more coflows meet their deadlines using Varys in comparison to per-flow mechanisms. Moreover, Varys outperforms non-preemptive coflow schedulers by more than 5X.

This is a joint work with Yuan Zhong and Ion Stoica. But as always, it took a village! I’d like to thank Mohammad Alizadeh, Justine Sherry, Peter Bailis, Rachit Agarwal, Ganesh Ananthanarayanan, Tathagata Das, Ali Ghodsi, Gautam Kumar, David Zats, Matei Zaharia, the NSDI and SIGCOMM reviewers, and everyone else who patiently read many drafts or listened to my spiel over the years. It’s been in the oven for many years, and I’m glad that it was so well received.

I believe coflows are the future, and they are coming for the good of the realm network. Varys is only the tip of the iceberg, and there are way too many interesting problems to solve (both for network and theory folks!). We are in the process of open sourcing Varys in coming weeks and look forward to seeing it used with major data-intensive frameworks. Personally, I’m happy how my dissertation research is shaping up :)

This year the SIGCOMM PC accepted 45 papers out of 237 submissions with a 18.99% acceptance rate. This is the highest acceptance rate since 1995  and the highest number of papers accepted ever (tying with SIGCOMM’86)! After years of everyone complaining about the acceptance rate, this year’s PC have taken some action; probably a healthy step for the community.

Sinbad accepted to appear at SIGCOMM’2013

Update: Latest version is available here!

Our paper on leveraging flexibility in endpoint placement, aka Sinbad/Usher/Orchestrated File System (OFS), has been accepted for publication at this year’s SIGCOMM. We introduce constrained anycast in data-intensive clusters in the form of network-aware replica placement for cluster/distributed file systems. This is the second piece of my dissertation research, where the overarching goal is to push slightly more application semantics into the network to get disproportionately larger improvements. Sometimes things are unfair in a good way!

Many applications do not constraint the destinations of their network transfers. New opportunities emerge when such transfers contribute a large amount of network bytes. By choosing the endpoints to avoid congested links, completion times of these transfers as well as that of others without similar flexibility can be improved. In this paper, we focus on leveraging the flexibility in replica placement during writes to cluster file systems (CFSes), which account for almost half of the cross-rack traffic in data-intensive clusters. CFS writes only require placing replicas in multiple fault domains and a balanced use of storage, but the replicas can be in any subset of machines in the cluster.

We study CFS interactions with the cluster network, analyze optimizations for replica placement, and propose Sinbad – a system that identifies imbalance and adapts replica destinations to navigate around congested links. Sinbad does so with little impact on long-term storage load balance. Experiments on EC2 (trace-driven simulations) show that block writes complete 1.3x(1.58x) faster as the network becomes more balanced. As a collateral benefit, the end-to-end completion times of data-intensive jobs improve by up to 1.26x. Trace-driven simulations also show that Sinbad’s improvements (1.58x) is close to that of a loose upper bound (1.89x) of the optimal.

This is a joint work with Srikanth Kandula and Ion Stoica. One cool fact is that we collaborated with Srikanth almost entirely over Skype for almost a year! I did take up a lot of his time from his Summer interns, but most of them had their papers in too. There are a LOT of people to thank including Yuan Zhong, Matei Zaharia, Gautam Kumar, Dave Maltz, Ganesh Ananthanarayanan, Ali Ghodsi, Raj Jain, and everyone else I’m forgetting right now. It took quite some time to put our ideas crisply in few pages, but in the end the reception was great from the PC. We hope it will be useful in practice too.

This year the SIGCOMM PC accepted 38 papers out of 240 submissions with a 15.83% acceptance rate. This is the highest acceptance rate since 1996 (before Ion Stoica had any SIGCOMM paper) and the highest number of papers accepted since 1987 (before Scott Shenker had any paper in SIGCOMM)!

I’m a PhD candidate now!

I’ve finally taken the qualifying exam this morning and presented my proposal in front of professors Sylvia Ratnasamy, Ion Stoica, Scott Shenker, and Marti Hearst. It was good to hear them say I passed :)

My proposal was on using coflows to better manage the network for data-intensive cluster applications: how to manage each one individually, how to manage them together, and how to move everything else out of coflows’ ways.

My talk improved for the better by sitting down with and giving practice talks to Yuan, David, Ali, Ganesh, Panda, Sameer, Anand, and Arka. I’d also thank Matei, Gautam, and TD who helped me even though they were travelling. This is one of the great things about the AMPLab; my lab mates are awesome!

Looking forward to get some work done in the coming months.