Tag Archives: Big Data Systems

CODA Accepted to Appear at SIGCOMM’2016

Update: Camera-ready version is available here now!

Since introducing the coflow abstraction in 2012, we’ve been working hard to make it practical one step at a time. Over the years, we’ve worked on efficient coflow scheduling, removed clairvoyance requirements in coflow scheduling, and performed fair sharing among coexisting coflows. Throughout all these efforts, one requirement remained constant: all distributed data-parallel applications had to be modified to correctly use the same coflow API. Unfortunately, this is difficult, if not impossible, because of the sheer number of computing frameworks available today.

CODA makes a first attempt at addressing this problem head on. The intuition is simple: if we can automatically identify coflows from the network, applications need not be modified. There are at least two key challenges in achieving this goal. The first one is classifying coflows. Unlike traditional traffic classification, coflows cannot be identified using five-tuples. Instead, we must use domain-specific insights to capture the logical relationship between multiple flows. The second challenge is avoiding the impact of misclassification, because even the best classifier will sometimes be wrong. We must be able to schedule coflows even when some flows have been misclassified into wrong coflows. Our solution errs on the side of caution to perform error-tolerant coflow scheduling. Overall, CODA approximates Aalo very closely and significantly outperforms flow-based solutions.

Leveraging application-level requirements using coflows has recently been shown to improve application-level communication performance in data-parallel clusters. However, existing coflow-based solutions rely on modifying applications to extract coflows, making them inapplicable to many practical scenarios. In this paper, we present CODA, a first attempt at automatically identifying and scheduling coflows without any application-level modifications.

We employ an incremental clustering algorithm to perform fast, application-transparent coflow identification and complement it by proposing an error-tolerant coflow scheduler to mitigate occasional identification errors. Testbed experiments and large-scale simulations with realistic production workloads show that CODA can identify coflows with more than 90% accuracy, and CODA’s scheduler is robust to inaccuracies, enabling communication stages to complete 2.4X (5.1X) faster on average (95th percentile) in comparison to per-flow mechanisms. CODA’s performance is comparable to that of solutions requiring application-level modifications.

This work is a tight collaboration with Kai Chen and his students at HKUST: Hong Zhang, Bairen Yi, and Li Chen, which started with Hong sending me a short note asking whether I think the problem is worth addressing. Of course I did; it is the second future work listed in my dissertation!

This year the SIGCOMM PC accepted 39 papers out of 225 submissions with a 17.33% acceptance rate. This also happens to be my first SIGCOMM in an advisory role and the first one without having Ion onboard. I am thankful to Kai for giving me full access to his excellent, hard-working students. It’s been a gratifying experience to work with them, and I look forward to further collaborations. Finally, I’m excited to start my life as an advisor on a high note. Go Blue!

Aalo Accepted to Appear at SIGCOMM’2015

Update: Camera-ready version is available here now!

Last SIGCOMM we introduced the coflow scheduling problem and presented Varys that addressed its clairvoyant variation, i.e., when all the information of individual coflows are known a priori, and there is no cluster and task scheduling dynamics. In many cases, these assumptions do not hold very well and left us with two primary challenges. First, how to schedule coflows when individual flow sizes, the total number of flows in a coflow , or their end points are unknown. Second, how to schedule coflows for jobs with more than one stages forming DAGs and where tasks can be scheduled in multiple waves. Aalo addresses both these problems by providing the first solution for the non-clairvoyant coflow scheduling problem: it can efficiently schedule coflows without any prior knowledge of coflow characteristics and can still approximate Varys’s performance very closely. This makes coflows practical in almost all scenarios we face in data-parallel communication.

Leveraging application-level requirements exposed through coflows has recently been shown to improve application-level communication performance in data-parallel clusters. However, existing efficient schedulers require a priori coflow information (e.g., flow size) and ignore cluster dynamics like pipelining, task failures, and speculative executions, which limit their applicability. Schedulers without prior knowledge compromise on performance to avoid head-of-line blocking. In this paper, we present Aalo that strikes a balance and efficiently schedules coflows without prior knowledge.

Aalo employs Discretized Coflow-Aware Least-Attained Service (D-CLAS) to separate coflows into a small number of priority queues based on how much they have already sent across the cluster. By performing prioritization across queues and by scheduling coflows in the FIFO order within each queue, Aalo’s non-clairvoyant scheduler can schedule diverse coflows and minimize their completion times. EC2 deployments and trace-driven simulations show that com- munication stages complete 1.93X faster on average and 3.59X faster at the 95th percentile using Aalo in comparison to per-flow mechanisms. Aalo’s performance is comparable to that of solutions using prior knowledge, and Aalo outperforms them in presence of cluster dynamics.

This is a joint work with Ion, the first time we have written a conference paper just between the two of us!

This year the SIGCOMM PC accepted 40 papers out of 262 submissions with a 15.27% acceptance rate. This also happens to be my last SIGCOMM as a student. Glad to be lucky enough to end on a high note!

Orchestra is the Default Broadcast Mechanism in Apache Spark

With its recent release, Apache Spark has promoted Cornet—the BitTorrent-like broadcast mechanism proposed in Orchestra (SIGCOMM’11)—to become its default broadcast mechanism. It’s great to see our research see the light of the real-world! Many thanks to Reynold and others for making it happen.

MLlib, the machine learning library of Spark, will enjoy the biggest boost from this change because of the broadcast-heavy nature of many machine learning algorithms.

Presented Varys at SIGCOMM’2014

Update: Slides from my talk are online now!

Just got back from a vibrant SIGCOMM! I presented our recent work on application-aware network scheduling using the coflow abstraction (Varys). This is my fifth time attending and third time giving a talk. Great pleasure as always in meeting old friends and making new ones!

SIGCOMM was held at Chicago this year, and about 715 people attended.

Varys Developer Alpha Released!

We are glad to announce the first open-source release of Varys, an application-aware network scheduler for data-parallel clusters using the coflow abstraction. It’s a stripped-down dev-alpha release for the experts, so please be patient with it!

A quick overview of the system can be found at varys.net. Here is a 30-second summary:

Varys is an open source network manager/scheduler that aims to improve communication performance of Big Data applications. Its target applications/jobs include those written in Spark, Hadoop, YARN, BSP, and similar data-parallel frameworks.

Varys provides a simple API that allows data-parallel frameworks to express their communication requirements as coflows with minimal changes to the framework. Using coflows as the basic abstraction of network scheduling, Varys implements novel schedulers either to make applications faster or to make time-restricted applications complete within deadlines.

Primary features included in this initial release are:

  • Support for in-memory and on-disk coflows,
  • Efficient scheduling to minimize the average coflow completion times, and
  • In the deadline-sensitive mode, support for soft deadlines.

Here are some links, if you want to check it out, contribute to make it better, or just want to point someone else who can help us.

Project Website: http://varys.net
Git repository: https://github.com/coflow/varys
Relevant tools: https://github.com/coflow
Research papers
1. Efficient Coflow Scheduling with Varys
2. Coflow: A Networking Abstraction for Cluster Applications

The project is still in its early stage and can use all the help to become successful. We appreciate your feedback.

Sinbad Merged with HDFS Trunk at Facebook!

Today marks the end of my stay at Facebook. Over the last eight months, on and off, I have re-implemented, tested, evaluated, reproduced the results, and merged Sinbad (SIGCOMM’13) with Facebook’s internal HDFS trunk. Several of the future work promised in the paper have been executed in the process, with varying results. It has been quite a fruitful stint for me; hopefully, it’ll be useful for Facebook in coming days as it rolls out to more production clusters. After all, who doesn’t like faster writes!

I’m really glad to have had this opportunity, and there are many people to thank. It all started with giving a talk last July at the end of my Facebook Fellowship, which resulted in Hairong asking me to implement Sinbad at Facebook. Every single member of the HDFS team have been extremely helpful since I started last October. I’d specially like to mention Hairong, Pritam, Dikang, Weiyan, Justin, Peter, Tom, and Eddie for spending many hours in getting me up to the speed, code reviewing, bug fixing, helping in deploying, troubleshooting, and generally baring with me. I’d also like to thank Lisa from the Networking team to help me add APIs to their infrastructure so that Sinbad can collect rack-level information.

The whole experience was different than my previous internships at research labs for several reasons. First, because it wasn’t an internship, I was on my own schedule and could get my research done while doing this on the side. Second, it was nice to sit with the actual team running some of the largest HDFS clusters on earth. I did learn a lot and found some new problems to explore in the future. Finally, of course, lots of free delicious food!

I’m glad I listened to Ion when I was asked by Hairong. Now, let’s hope Apache Hadoop/YARN picks it up as well.

Varys accepted at SIGCOMM’2014: Coflow is coming!

Update 2: Varys Developer Alpha released!

Update 1: Latest version will be is available here soon now!

We introduced the coflow abstraction back in 2012 at HotNets-XI, and now we have something solid to show what coflows are good for. Our paper on efficient and deadline-sensitive coflow scheduling has been accepted to appear at this year’s SIGCOMM. By exploiting a little bit of application-level information, coflows enable significantly better performance for distributed data-intensive jobs. In the process, we have discovered, characterized, and proposed heuristics for a novel scheduling problem: “Concurrent Open Shop Scheduling with Coupled Resources.” Yeah, it’s a mouthful; but hey, how often does one stumble upon a new scheduling problem?

Communication in data-parallel applications often involves a collection of parallel flows. Traditional techniques to optimize flow-level metrics do not perform well in optimizing such collections, because the network is largely agnostic to application-level requirements. The recently proposed coflow abstraction bridges this gap and creates new opportunities for network scheduling. In this paper, we address inter-coflow scheduling for two different objectives: decreasing communication time of data-intensive jobs and guaranteeing predictable communication time. We introduce the concurrent open shop scheduling with coupled resources problem, analyze its complexity, and propose effective heuristics to optimize either objective. We present Varys, a system that enables data-intensive frameworks to use coflows and the proposed algorithms while maintaining high network utilization and guaranteeing starvation freedom. EC2 deployments and trace-driven simulations show that communication stages complete up to 3.16X faster on average and up to 2X more coflows meet their deadlines using Varys in comparison to per-flow mechanisms. Moreover, Varys outperforms non-preemptive coflow schedulers by more than 5X.

This is a joint work with Yuan Zhong and Ion Stoica. But as always, it took a village! I’d like to thank Mohammad Alizadeh, Justine Sherry, Peter Bailis, Rachit Agarwal, Ganesh Ananthanarayanan, Tathagata Das, Ali Ghodsi, Gautam Kumar, David Zats, Matei Zaharia, the NSDI and SIGCOMM reviewers, and everyone else who patiently read many drafts or listened to my spiel over the years. It’s been in the oven for many years, and I’m glad that it was so well received.

I believe coflows are the future, and they are coming for the good of the realm network. Varys is only the tip of the iceberg, and there are way too many interesting problems to solve (both for network and theory folks!). We are in the process of open sourcing Varys in coming weeks and look forward to seeing it used with major data-intensive frameworks. Personally, I’m happy how my dissertation research is shaping up :)

This year the SIGCOMM PC accepted 45 papers out of 237 submissions with a 18.99% acceptance rate. This is the highest acceptance rate since 1995  and the highest number of papers accepted ever (tying with SIGCOMM’86)! After years of everyone complaining about the acceptance rate, this year’s PC have taken some action; probably a healthy step for the community.

Sinbad accepted to appear at SIGCOMM’2013

Update: Latest version is available here!

Our paper on leveraging flexibility in endpoint placement, aka Sinbad/Usher/Orchestrated File System (OFS), has been accepted for publication at this year’s SIGCOMM. We introduce constrained anycast in data-intensive clusters in the form of network-aware replica placement for cluster/distributed file systems. This is the second piece of my dissertation research, where the overarching goal is to push slightly more application semantics into the network to get disproportionately larger improvements. Sometimes things are unfair in a good way!

Many applications do not constraint the destinations of their network transfers. New opportunities emerge when such transfers contribute a large amount of network bytes. By choosing the endpoints to avoid congested links, completion times of these transfers as well as that of others without similar flexibility can be improved. In this paper, we focus on leveraging the flexibility in replica placement during writes to cluster file systems (CFSes), which account for almost half of the cross-rack traffic in data-intensive clusters. CFS writes only require placing replicas in multiple fault domains and a balanced use of storage, but the replicas can be in any subset of machines in the cluster.

We study CFS interactions with the cluster network, analyze optimizations for replica placement, and propose Sinbad – a system that identifies imbalance and adapts replica destinations to navigate around congested links. Sinbad does so with little impact on long-term storage load balance. Experiments on EC2 (trace-driven simulations) show that block writes complete 1.3x(1.58x) faster as the network becomes more balanced. As a collateral benefit, the end-to-end completion times of data-intensive jobs improve by up to 1.26x. Trace-driven simulations also show that Sinbad’s improvements (1.58x) is close to that of a loose upper bound (1.89x) of the optimal.

This is a joint work with Srikanth Kandula and Ion Stoica. One cool fact is that we collaborated with Srikanth almost entirely over Skype for almost a year! I did take up a lot of his time from his Summer interns, but most of them had their papers in too. There are a LOT of people to thank including Yuan Zhong, Matei Zaharia, Gautam Kumar, Dave Maltz, Ganesh Ananthanarayanan, Ali Ghodsi, Raj Jain, and everyone else I’m forgetting right now. It took quite some time to put our ideas crisply in few pages, but in the end the reception was great from the PC. We hope it will be useful in practice too.

This year the SIGCOMM PC accepted 38 papers out of 240 submissions with a 15.83% acceptance rate. This is the highest acceptance rate since 1996 (before Ion Stoica had any SIGCOMM paper) and the highest number of papers accepted since 1987 (before Scott Shenker had any paper in SIGCOMM)!

Coflow accepted at HotNets’2012

Update: Coflow camera-ready is available online! Tell us what you think!

Our position paper to address the lack of a networking abstraction for cluster applications, “Coflow: A Networking Abstraction for Cluster Applications,” has been accepted at the latest workshop  on hot topics in networking. We make the observation that thinking in terms of flows is good, but we can do better if we think in terms of collections of flows using application semantics while optimizing datacenter-scale applications.

Cluster computing applications — frameworks like MapReduce and user-facing applications like search platforms — have application-level requirements and higher-level abstractions to express them. However, there exists no networking abstraction that can take advantage of the rich semantics readily available from these data parallel applications.

We propose coflow, a networking abstraction to express the communication requirements of prevalent data parallel programming paradigms. Coflows make it easier for the applications to convey their communication semantics to the network, which in turn enables the network to better optimize common communication patterns.

This observation is a culmination of some of the work I’ve been fortunate to be part of in last couple of years (specially Orchestra@SIGCOMM’11 and recently Ahimsa@HotCloud’12 and FairCloud@SIGCOMM’12) as well as work done by others in the area (e.g., D3@SIGCOMM’11). Looking forward to constructive discussion in Redmond next month!

FTR, HotNets PC this year accepted 23 papers out of 120 submissions.