Tag Archives: Datacenter Networking

CODA Accepted to Appear at SIGCOMM’2016

Update: Camera-ready version is available here now!

Since introducing the coflow abstraction in 2012, we’ve been working hard to make it practical one step at a time. Over the years, we’ve worked on efficient coflow scheduling, removed clairvoyance requirements in coflow scheduling, and performed fair sharing among coexisting coflows. Throughout all these efforts, one requirement remained constant: all distributed data-parallel applications had to be modified to correctly use the same coflow API. Unfortunately, this is difficult, if not impossible, because of the sheer number of computing frameworks available today.

CODA makes a first attempt at addressing this problem head on. The intuition is simple: if we can automatically identify coflows from the network, applications need not be modified. There are at least two key challenges in achieving this goal. The first one is classifying coflows. Unlike traditional traffic classification, coflows cannot be identified using five-tuples. Instead, we must use domain-specific insights to capture the logical relationship between multiple flows. The second challenge is avoiding the impact of misclassification, because even the best classifier will sometimes be wrong. We must be able to schedule coflows even when some flows have been misclassified into wrong coflows. Our solution errs on the side of caution to perform error-tolerant coflow scheduling. Overall, CODA approximates Aalo very closely and significantly outperforms flow-based solutions.

Leveraging application-level requirements using coflows has recently been shown to improve application-level communication performance in data-parallel clusters. However, existing coflow-based solutions rely on modifying applications to extract coflows, making them inapplicable to many practical scenarios. In this paper, we present CODA, a first attempt at automatically identifying and scheduling coflows without any application-level modifications.

We employ an incremental clustering algorithm to perform fast, application-transparent coflow identification and complement it by proposing an error-tolerant coflow scheduler to mitigate occasional identification errors. Testbed experiments and large-scale simulations with realistic production workloads show that CODA can identify coflows with more than 90% accuracy, and CODA’s scheduler is robust to inaccuracies, enabling communication stages to complete 2.4X (5.1X) faster on average (95th percentile) in comparison to per-flow mechanisms. CODA’s performance is comparable to that of solutions requiring application-level modifications.

This work is a tight collaboration with Kai Chen and his students at HKUST: Hong Zhang, Bairen Yi, and Li Chen, which started with Hong sending me a short note asking whether I think the problem is worth addressing. Of course I did; it is the second future work listed in my dissertation!

This year the SIGCOMM PC accepted 39 papers out of 225 submissions with a 17.33% acceptance rate. This also happens to be my first SIGCOMM in an advisory role and the first one without having Ion onboard. I am thankful to Kai for giving me full access to his excellent, hard-working students. It’s been a gratifying experience to work with them, and I look forward to further collaborations. Finally, I’m excited to start my life as an advisor on a high note. Go Blue!

HUG Accepted to Appear at NSDI’2016

Update: Camera-ready version is available here now!

With the advent of cloud computing and datacenter-scale applications, simultaneously dealing with multiple resources is the new norm. When multiple parties have multi-resource demands, fairly dividing these resources  (for some notion of fairness) is a core challenge in the resource allocation literature. Dominant Resource Fairness (DRF) in NSDI’2011 was the first work to point out the ensuing challenges and extended traditional, single-resource max-min fairness to the multi-resource environment.

While DRF guarantees many nice properties to ensure fairness, it gives up work conservation. We show that it can be arbitrarily low in the worst case. Over the past five years, many software and research proposals have built upon DRF under the assumption that DRF can trivially be made work-conserving without losing any of its properties.

After five years, we prove that not to be the case. We show that there is a fundamental tradeoff between guaranteed resources and work conservation, and trivially adding work conservation on DRF can actually decrease utilization instead of increasing it. We also show what is the maximum possible utilization while maintaining DRF’s optimal guarantees and propose a new algorithm, High Utilization with Guarantees (HUG) that generalizes single-and multi-resource max-min fairness as well as multi-tenant network sharing literature under a unifying framework.

In this paper, we study how to optimally provide isolation guarantees in multi-resource environments, such as public clouds, where a tenant’s demands on different resources (links) are correlated in order for her to make progress. Unlike prior work such as Dominant Resource Fairness (DRF) that assumes demands to be static and fixed, we consider elastic demands. Our approach generalizes canonical max-min fairness to the multi-resource setting with correlated demands, and extends DRF to elastic demands. We consider two natural optimization objectives: isolation guarantee from a tenant’s viewpoint and system utilization (work conservation) from an operator’s perspective. We prove that in non-cooperative environments like public cloud networks, there is a strong tradeoff between optimal isolation guarantee and work conservation when demands are elastic. Even worse, work conservation can even decrease network utilization instead of improving it when demands are inelastic. We identify the root cause behind the tradeoff and present a provably optimal allocation algorithm, High Utilization with Guarantees (HUG), to achieve maximum attainable network utilization without sacrificing the optimal isolation guarantee, strategy-proofness, and other useful properties of DRF. In cooperative environments like private datacenter networks, HUG achieves both the optimal isolation guarantee and work conservation. Analyses, simulations, and experiments show that HUG provides better isolation guarantees, higher system utilization, and better tenant-level performance than its counterparts.

This work is the result of a close collaboration with Zhenhua Liu with help from Ion Stoica and Ali Ghodsi.

This year the NSDI PC accepted 45 out of 228 submissions with a 19.74% acceptance rate. This work forms the basis of the last chapter of my dissertation, ensuring SIGCOMM or NSDI publications from all its primary chapters!

Aalo Accepted to Appear at SIGCOMM’2015

Update: Camera-ready version is available here now!

Last SIGCOMM we introduced the coflow scheduling problem and presented Varys that addressed its clairvoyant variation, i.e., when all the information of individual coflows are known a priori, and there is no cluster and task scheduling dynamics. In many cases, these assumptions do not hold very well and left us with two primary challenges. First, how to schedule coflows when individual flow sizes, the total number of flows in a coflow , or their end points are unknown. Second, how to schedule coflows for jobs with more than one stages forming DAGs and where tasks can be scheduled in multiple waves. Aalo addresses both these problems by providing the first solution for the non-clairvoyant coflow scheduling problem: it can efficiently schedule coflows without any prior knowledge of coflow characteristics and can still approximate Varys’s performance very closely. This makes coflows practical in almost all scenarios we face in data-parallel communication.

Leveraging application-level requirements exposed through coflows has recently been shown to improve application-level communication performance in data-parallel clusters. However, existing efficient schedulers require a priori coflow information (e.g., flow size) and ignore cluster dynamics like pipelining, task failures, and speculative executions, which limit their applicability. Schedulers without prior knowledge compromise on performance to avoid head-of-line blocking. In this paper, we present Aalo that strikes a balance and efficiently schedules coflows without prior knowledge.

Aalo employs Discretized Coflow-Aware Least-Attained Service (D-CLAS) to separate coflows into a small number of priority queues based on how much they have already sent across the cluster. By performing prioritization across queues and by scheduling coflows in the FIFO order within each queue, Aalo’s non-clairvoyant scheduler can schedule diverse coflows and minimize their completion times. EC2 deployments and trace-driven simulations show that com- munication stages complete 1.93X faster on average and 3.59X faster at the 95th percentile using Aalo in comparison to per-flow mechanisms. Aalo’s performance is comparable to that of solutions using prior knowledge, and Aalo outperforms them in presence of cluster dynamics.

This is a joint work with Ion, the first time we have written a conference paper just between the two of us!

This year the SIGCOMM PC accepted 40 papers out of 262 submissions with a 15.27% acceptance rate. This also happens to be my last SIGCOMM as a student. Glad to be lucky enough to end on a high note!

Orchestra is the Default Broadcast Mechanism in Apache Spark

With its recent release, Apache Spark has promoted Cornet—the BitTorrent-like broadcast mechanism proposed in Orchestra (SIGCOMM’11)—to become its default broadcast mechanism. It’s great to see our research see the light of the real-world! Many thanks to Reynold and others for making it happen.

MLlib, the machine learning library of Spark, will enjoy the biggest boost from this change because of the broadcast-heavy nature of many machine learning algorithms.

Presented Varys at SIGCOMM’2014

Update: Slides from my talk are online now!

Just got back from a vibrant SIGCOMM! I presented our recent work on application-aware network scheduling using the coflow abstraction (Varys). This is my fifth time attending and third time giving a talk. Great pleasure as always in meeting old friends and making new ones!

SIGCOMM was held at Chicago this year, and about 715 people attended.

Varys Developer Alpha Released!

We are glad to announce the first open-source release of Varys, an application-aware network scheduler for data-parallel clusters using the coflow abstraction. It’s a stripped-down dev-alpha release for the experts, so please be patient with it!

A quick overview of the system can be found at varys.net. Here is a 30-second summary:

Varys is an open source network manager/scheduler that aims to improve communication performance of Big Data applications. Its target applications/jobs include those written in Spark, Hadoop, YARN, BSP, and similar data-parallel frameworks.

Varys provides a simple API that allows data-parallel frameworks to express their communication requirements as coflows with minimal changes to the framework. Using coflows as the basic abstraction of network scheduling, Varys implements novel schedulers either to make applications faster or to make time-restricted applications complete within deadlines.

Primary features included in this initial release are:

  • Support for in-memory and on-disk coflows,
  • Efficient scheduling to minimize the average coflow completion times, and
  • In the deadline-sensitive mode, support for soft deadlines.

Here are some links, if you want to check it out, contribute to make it better, or just want to point someone else who can help us.

Project Website: http://varys.net
Git repository: https://github.com/coflow/varys
Relevant tools: https://github.com/coflow
Research papers
1. Efficient Coflow Scheduling with Varys
2. Coflow: A Networking Abstraction for Cluster Applications

The project is still in its early stage and can use all the help to become successful. We appreciate your feedback.

Varys accepted at SIGCOMM’2014: Coflow is coming!

Update 2: Varys Developer Alpha released!

Update 1: Latest version will be is available here soon now!

We introduced the coflow abstraction back in 2012 at HotNets-XI, and now we have something solid to show what coflows are good for. Our paper on efficient and deadline-sensitive coflow scheduling has been accepted to appear at this year’s SIGCOMM. By exploiting a little bit of application-level information, coflows enable significantly better performance for distributed data-intensive jobs. In the process, we have discovered, characterized, and proposed heuristics for a novel scheduling problem: “Concurrent Open Shop Scheduling with Coupled Resources.” Yeah, it’s a mouthful; but hey, how often does one stumble upon a new scheduling problem?

Communication in data-parallel applications often involves a collection of parallel flows. Traditional techniques to optimize flow-level metrics do not perform well in optimizing such collections, because the network is largely agnostic to application-level requirements. The recently proposed coflow abstraction bridges this gap and creates new opportunities for network scheduling. In this paper, we address inter-coflow scheduling for two different objectives: decreasing communication time of data-intensive jobs and guaranteeing predictable communication time. We introduce the concurrent open shop scheduling with coupled resources problem, analyze its complexity, and propose effective heuristics to optimize either objective. We present Varys, a system that enables data-intensive frameworks to use coflows and the proposed algorithms while maintaining high network utilization and guaranteeing starvation freedom. EC2 deployments and trace-driven simulations show that communication stages complete up to 3.16X faster on average and up to 2X more coflows meet their deadlines using Varys in comparison to per-flow mechanisms. Moreover, Varys outperforms non-preemptive coflow schedulers by more than 5X.

This is a joint work with Yuan Zhong and Ion Stoica. But as always, it took a village! I’d like to thank Mohammad Alizadeh, Justine Sherry, Peter Bailis, Rachit Agarwal, Ganesh Ananthanarayanan, Tathagata Das, Ali Ghodsi, Gautam Kumar, David Zats, Matei Zaharia, the NSDI and SIGCOMM reviewers, and everyone else who patiently read many drafts or listened to my spiel over the years. It’s been in the oven for many years, and I’m glad that it was so well received.

I believe coflows are the future, and they are coming for the good of the realm network. Varys is only the tip of the iceberg, and there are way too many interesting problems to solve (both for network and theory folks!). We are in the process of open sourcing Varys in coming weeks and look forward to seeing it used with major data-intensive frameworks. Personally, I’m happy how my dissertation research is shaping up :)

This year the SIGCOMM PC accepted 45 papers out of 237 submissions with a 18.99% acceptance rate. This is the highest acceptance rate since 1995  and the highest number of papers accepted ever (tying with SIGCOMM’86)! After years of everyone complaining about the acceptance rate, this year’s PC have taken some action; probably a healthy step for the community.

Coflow accepted at HotNets’2012

Update: Coflow camera-ready is available online! Tell us what you think!

Our position paper to address the lack of a networking abstraction for cluster applications, “Coflow: A Networking Abstraction for Cluster Applications,” has been accepted at the latest workshop  on hot topics in networking. We make the observation that thinking in terms of flows is good, but we can do better if we think in terms of collections of flows using application semantics while optimizing datacenter-scale applications.

Cluster computing applications — frameworks like MapReduce and user-facing applications like search platforms — have application-level requirements and higher-level abstractions to express them. However, there exists no networking abstraction that can take advantage of the rich semantics readily available from these data parallel applications.

We propose coflow, a networking abstraction to express the communication requirements of prevalent data parallel programming paradigms. Coflows make it easier for the applications to convey their communication semantics to the network, which in turn enables the network to better optimize common communication patterns.

This observation is a culmination of some of the work I’ve been fortunate to be part of in last couple of years (specially Orchestra@SIGCOMM’11 and recently Ahimsa@HotCloud’12 and FairCloud@SIGCOMM’12) as well as work done by others in the area (e.g., D3@SIGCOMM’11). Looking forward to constructive discussion in Redmond next month!

FTR, HotNets PC this year accepted 23 papers out of 120 submissions.

Ahimsa accepted at HotCloud’2012

Update: Camera-ready is available online! Do let us know what you think in the comments section.

Our exploratory paper on the complexity of a transfer, “Redefining Network Fairness to Support Data Parallelism,” has been accepted for publication at this year’s HotCloud workshop!

In Orchestra, we defined the notion of transfers in the context of cluster computing, and in FairCloud, we argued for fairness across multiple transfers. However, we have so far been considering transfers independently of the computations they enable. Gautam observed that not all transfers are created equal: when we scale-up or -down the input to computations, input to transfers do not always scale linearly (e.g., partitioned transfers like shuffles in a MapReduce program scales linearly, whereas broadcast has a super-linear scaling factor). As a result, network fairness, when defined in terms of bandwidth, does not always match the simple goal of data parallelism: “given n times more resources, a data parallel application can expect to complete n times faster.” Ahimsa explores the notion of network fairness that can match this goal.

This year, HotCloud accepted 24 out of 75 submissions, six of which have at least one Berkeley author :)

“Surviving Failures in Bandwidth-Constrained Datacenters” at SIGCOMM’2012

Update: Camera-ready version is in my publications page!

My internship work from last Summer has been accepted for publication at SIGCOMM’2012 as well; yay!! In this piece of work, we try to allocate machines for datacenter applications with bandwidth and fault-tolerance constraints, which are at odds—allocation for bandwidth tries to put machines closer, whereas a fault-tolerant allocation spreads machine out across multiple fault domains.

Datacenter networks have been designed to tolerate failures of network equipment and provide sufficient bandwidth. In practice, however, failures and maintenance of networking and power equipment often make tens to thousands of servers unavailable, and network congestion can increase service latency. Unfortunately, there exists an inherent tradeoff between achieving high fault tolerance and reducing bandwidth usage in network core; spreading servers across fault domains improves fault tolerance, but requires additional bandwidth, while deploying servers together reduces bandwidth usage, but also decreases fault tolerance. We present a detailed analysis of a large-scale Web application and its communication patterns. Based on that, we propose and evaluate a novel optimization framework that achieves both high fault tolerance and significantly reduces bandwidth usage in network core by exploiting the skewness in the observed communication patterns.

During my Master’s, I worked on several variations of a similar problem called virtual network embedding in the context of network virtualization (INFOCOM’2009, Networking’2010, VISA’2010, ToN’2012).

This year 32 out of 235 papers have been accepted at SIGCOMM,  seven of which have at least one Berkeley author.