Tag Archives: Big Data Systems

Jie Graduates. Congrats Dr. You!

Jie (Jimmy) has recently become my third PhD student to graduate with a dissertation titled “Toward Practical Application-Aware Big Data Systems.” Over the course of his PhD, Jimmy ended up contributing to most of the major projects of our group over the years, and we will miss him going forward. He’s joining Meta.

Jimmy officially started his PhD at Michigan in Fall 2016, but he started working with me from Summer 2017. He started working on one of the first projects of SymbioticLab on geo-distributed analytics that had taken many names over the years (Gaia/Terra/System H). While the core project didn’t go as expected, it created tools and artifacts that helped other projects published in SPAA and NSDI that Jimmy contributed to. After that Jimmy learned that sometimes its better to come up with your idea, and he successfully led end-to-end the Kayak project that provided a middle-ground between KV- and RPC-based systems. This was done in collaboration with Xin and his group. After moving from high-latency to low-latency networks, Jimmy’s final project moved in another very different direction. In Zeus, Jimmy and Jae-Won collaborated on understanding and optimizing the energy consumption of DNN training. I think Zeus is the best of his works and will have a lasting impact. While it was stressful as an advisor to see him change course so many times in a PhD, it was also fun to see him eventually find his footing on his own terms.

Jimmy is very inquisitive, which led to him exploring many different things during his PhD. He is also very good at taking feedback and improving himself, which he’s clearly demonstrated over the past five years. I’m sure he’ll ask many questions and explore many new things in the next chapter(s) of his career.

QOOP Accepted to Appear at OSDI’2018

Distributed data-parallel queries run on large collections of resources — within a single datacenter in case of traditional big data queries and across multiple datacenters in case of geo-distributed analytics. The total amount of available resources to a running query changes over time for a variety of reasons, including changes in the fair share of assigned resources due to new job arrivals, machine failures, changes in spot instance pricing, fluctuations in available WAN bandwidth, etc. Traditional big data systems rely on the cluster scheduler to deal with these fluctuations. Each query is assigned a fixed query plan by a query planner to convert it to a distributed data-parallel job, which is then executed by an execution engine. Over the years, this has led to pushing more complexity into the cluster scheduler for better performance. But what if we could change the query plan?

In QOOP, we do exactly that and show that this can lead to simpler execution engines and cluster schedulers. Instead of pushing complexity down the stack, QOOP allows the query planner to re-plan queries on resource fluctuations using a simple greedy algorithm. We prove that this is expected to perform well and empirically validate the theoretical result.

Modern data processing clusters are highly dynamic — both in terms of the number of concurrently running jobs and their resource usage. To improve job performance, recent works have focused on optimizing the cluster scheduler and the jobs’ query planner with a focus on picking the right execution plan — represented as a directed acyclic graph (DAG) — for a job in a resource-aware manner, and scheduling jobs in a DAG-aware manner. However, because existing solutions use a fixed DAG throughout the entire execution, the inability to adapt a DAG in reaction to resource changes often leads to large performance inefficiencies.

This paper argues for dynamic query re-planning, wherein we re-evaluate and re-plan a job’s DAG during its execution. We show that designing for re-planning requires fundamental changes to the interfaces between key layers of data analytics stacks today, i.e., the query planner, the execution engine, and the cluster scheduler. Instead of pushing more complexity into the scheduler or the query planner, we argue for a redistribution of responsibilities between the three components to simplify their designs. Under this redesign, we analytically show that a greedy algorithm for re-planning and execution alongside a simple max-min fair scheduler can offer provably competitive behavior even under adversarial resource changes. We prototype our algorithms atop Apache Hive and Tez. Using extensive experiments on a 20-node cluster, we show that our design can offer a median performance improvement of 1.47x compared to state-of-the-art alternatives.

While QOOP focused on small-scale clusters, its core ideas can be applied to other dynamic query re-planning scenarios — for example, in response to WAN bandwidth fluctuations in the context of geo-distributed analytics.

This work is my second OSDI collaboration with Aditya after Carbyne in OSDI 2016 and my third OSDI paper overall. This time I got to work with another incredible student, Kshiteej Mahajan, as well as a terrific theoretician, Shuchi Chawla. I look forward to many more fruitful collaborations between our groups in the future.

Two Papers Accepted at APNet’2018 and MAMA’2018 Workshops

Couple more workshop papers got accepted in last few days!

  1. Pas de Deux: Shape the Circuits, and Shape the Apps Too! — APNet’18
  2. Fair Allocation of Heterogeneous and Interchangeable Resources — MAMA’18

The APNet one is about coflows on optical networks, following my collaborations with Hong and Kai at HKUST (SIGCOMM’16 and SIGCOMM’17), while the MAMA one is with Zhenhua and his very capable students Xiao and Tan, following up on my work with Zhenhua (NSDI’16).

Look forward to the longer versions soon.

Three Papers Accepted at HotCloud’2018 and GRADES-NDA’2018 Workshops

Over the course of last few days, we heard back about the acceptance of three workshops papers:

  1. To Relay or Not to Relay for Inter-Cloud Transfers? — HotCloud’18
  2. Monarch: Gaining Command on Geo-Distributed Graph Analytics — HotCloud’18
  3. Bridging the GAP: Towards Approximate Graph Analytics — GRADES-NDA’18

The HotCloud ones deal with networking for and graph processing over geo-distributed analytics, while the GRADES-NDA one deals with approximate graph processing. The first of the three also happens to be the first submission and paper by my first-year student Fan Lai in collaboration with Harsha. The other two are led completely by my friend Anand at Berkeley.

TWO NSF Proposals as the Lead PI Awarded. Thanks NSF!

The first one is on rack-scale computing using RDMA-enabled networks with Barzan Mozafari at the University of Michigan, and the second is on theoretical and systems implications of long-term fairness in cluster computing with Zhenhua Liu (Stony Brook University).

Thanks NSF!

Combined with the recent awards on geo-distributed analytics from NSF and Google, I’m looking forward to very exciting days in the future. If you want to be a part these exciting efforts, consider joining my group!

Carbyne Accepted to Appear at OSDI’2016

Update: Camera-ready version is available here now!

With the wide adoption of distributed data-parallel applications, large-scale resource scheduling has become a constant source of innovation in recent years. There are tens of scheduling solutions that try to optimize for objectives such as user-level fairness, application-level performance, and cluster-level efficiency. However, given the well-known tradeoffs between fairness, performance, and efficiency, these solutions have traditionally focused on one primary objective (e.g., fairness in case of DRF), and they consider other objectives as best effort, secondary goals.

In this paper, we revisit the tradeoff space, demonstrating out that aggressively focusing on optimizing one primary objective while giving up the rest often does not matter much. Because a job cannot complete until all its tasks have completed, each job can altruistically yield some of its resources without hampering its own completion time. These altruistic resources can then be rescheduled among other jobs to significantly improve secondary objectives without hampering the first. Benefits of our approach is visible even for single-stage jobs, and they increase as jobs have more complex DAGs.

Given the well-known tradeoffs between performance, fairness, and efficiency, modern cluster schedulers focus on aggressively optimizing a single objective, while ignoring the rest. However, short-term convergence to a selected objective often does not result in noticeable long-term benefits. Instead, we propose an altruistic, long-term approach, where jobs yield fractions of their allocated resources without impacting their own completion times.

We show that leftover resources collected via altruisms of many jobs can then be rescheduled, essentially introducing a third degree of freedom in cluster scheduling — in addition to inter- and intra-job scheduling. We leverage this new-found flexibility in Carbyne, a scheduler that combines three existing schedulers to simultaneously optimize cluster-wide performance, fairness, and efficiency. Deployments and large-scale simulations on industrial benchmarks and production traces show that Carbyne closely approximates the state-of-the-art solutions for each of the three objectives simultaneously. For example, Carbyne provides 1.26X better efficiency and 1.59X lower average completion time than DRF, while providing similar fairness guarantees.

Altruistic scheduling has many more use cases; e.g., we had a similar observation for coflow scheduling in Varys.

This work started as a collaboration with Robert Grandl and Aditya Akella toward the end of 2015. Ganesh Ananthanarayanan from MSR later joined us to take it to the next level.  After CODA, this is related to another future work (the seventh) from my dissertation; infer whatever you want to out of these two data points ;)

This year the OSDI PC accepted 47 out of 260 papers. This happens to be my first time submitting to OSDI. It’s also my first paper with Ganesh, even though it happened after we both graduated from Berkeley; we sat opposite to each other for four years back then! I also want to thank Aditya for letting me work closely with Robert; it’s been great collaborating with them.

EC-Cache Accepted to Appear at OSDI’2016

Update: Camera-ready version is available here now!

In-memory caching is the de facto solution to enable low latency analytics over large datasets. While caching objects,  one must be careful about maximizing the number of requests that can be served from memory in the presence of popularity skew, background load imbalance, and server failures. Traditional solutions use selective replication, i.e., adjusting the number of replicas based on object popularity, to address these issues. However, dynamically determining replication factors and reacting to hotspots and load imbalance fast enough is often challenging.

We applied erasure coding to address the same set of challenges. Traditionally, erasure coding is widely used in modern clusters to minimize storage usage. However, it is not applied to object caching, nor is it considered a viable solution for load balancing. We show both analytically and empirically that given the instruction sets in modern CPUs, erasure coding can be a viable solution to address the limitations of selective replication as long as we are willing to sacrifice some network bandwidth. Given the rise of high-capacity network topologies and networking technologies, we believe this to be a timely tradeoff.

Data-intensive clusters rely on in-memory object caching to maximize the number of requests that can be served from memory in the presence of popularity skew, background load imbalance, and server failures. In order to improve I/O latency and for load balancing, these caches typically employ selective replication, where the number of cached replicas of an object is proportional to its popularity. Selective replication, however, often falls short in practice, because it needs careful selection of replication factors and dynamic, load-aware replica placement.

EC-Cache is a load-balanced, high-performance cluster cache that uses erasure coding in a novel way to overcome the limitations of selective replication. EC-Cache employs two techniques: (a) splitting and erasure coding individual objects during writes, and (b) late binding, wherein obtaining any k out of (k + r) splits of an object are sufficient, during reads. As compared to selective replication, EC-Cache improves load balancing by a factor of 3.3X and reduces the median and tail read latencies by more than 2X, while using the same amount of memory. EC-Cache does so using 10% additional bandwidth and a small increase in the amount of stored metadata. The benefits offered by EC-Cache are further amplified in the presence of background network load imbalance.

The genesis of this work traces back to 2013, when I was just wrapping up Sinbad and heard Rashmi Vinayak giving a talk about her work at Facebook. While we planned many times, we were always busy with other stuff and never managed to work together until very recently. During this time, the project evolved from disk-based storage to general object store to its current form: in-memory object cache. The very energetic Jack Kosaian tracked me down last year to work on something exciting, which also helped in finally getting the project going at full speed. I’ve been very fortunate with my collaborators.

This year the OSDI PC accepted 47 out of 260 papers. This happens to be my first time submitting to OSDI. It’s also my first paper with my own student. Jack has been a tremendous force that kept the project going and helped Rashmi proceed at a fast pace. I also want to thank Rashmi for mentoring Jack throughout the process. I love this advising business :) Go Blue!

Received SIGCOMM Doctoral Dissertation Award

About a week or so ago Bruce Maggs, SIGCOMM’s awards chair, kindly informed me over the phone that my dissertation on coflows has been selected for the 2015 ACM SIGCOMM Doctoral Dissertation Award. The committee for the award included Ratul Mahajan, Dina Papagiannaki, Laurent Vanbever (chair), and Minlan Yu, and the citation reads:

Chowdhury’s dissertation provides novel and application-aware networking abstractions which significantly improve the performance of networked applications running in the cloud.

It’s a great honor to join some great researchers who’ve won this award before me.

It goes without saying that this achievement wouldn’t have been possible without my great advisor Ion Stoica keeping me on course, my excellent collaborators over the years, and many others who paved my way in multiple ways. I am also grateful to some of my biggest supporters: Randy Katz, Mike Franklin, Srikanth Kandula, and Vyas Sekar. An equal amount of credit, if not more, goes to my family and friends, who cheered me on and supported me for so many years. I don’t say it often enough: thank you!