Infiniswap Accepted to Appear at NSDI’2017

As networks become faster, the difference between remote and local resources is blurring everyday. How can we take advantage of these blurred lines? This is the key observation behind resource disaggregation and, to some extent, rack-scale computing. In this paper, we take our first stab at making memory disaggregation practical by exposing remote memory to unmodified applications. While there have been several proposals and feasibility studies in recent years, to the best of our knowledge, this is the first concrete step in making it real.

Memory-intensive applications suffer large performance loss when their working sets do not fully fit in memory. Yet, they cannot leverage otherwise unused remote memory when paging out to disks even in the presence of large imbalance in memory utilizations across a cluster. Existing proposals for memory disaggregation call for new architectures, new hardware designs, and/or new programming models, making them infeasible.

This paper describes the design and implementation of Infiniswap, a remote memory paging system designed specifically for an RDMA network. Infiniswap opportunistically harvests and transparently exposes unused memory to unmodified applications by dividing the swap space of each machine into many slabs and distributing them across many machines’ remote memory. Because RDMA operations bypass remote CPUs, Infiniswap leverages the power of many choices to perform decentralized slab placements and evictions.

We have implemented and deployed Infiniswap on an RDMA cluster without any OS modifications and evaluated its effectiveness using multiple workloads running on unmodified VoltDB, Memcached, PowerGraph, GraphX, and Apache Spark. Using Infiniswap, throughputs of these applications improve between 7.1X (0.98X) to 16.3X (9.3X) over disk (Mellanox nbdX), and median and tail latencies between 5.5X (2X) and 58X (2.2X). Infiniswap does so with negligible remote CPU usage, whereas nbdX becomes CPU-bound. Infiniswap increases the overall memory utilization of a cluster and works well at scale.

This work started as a class project for EECS 582 in the Winter when I gave the idea to Juncheng Gu and Youngmoon Lee, who made the pieces into a whole. Over the summer, Yiwen Zhang, an enterprising and excellent undergraduate, joined the project and helped us in getting it done within time.

This year the NSDI PC accepted 46 out of 255 papers. This happens to be my first paper with an all blue cast! I want to thank Kang for giving me complete access to Juncheng and Youngmoon; it’s been great collaborating with them. I’m also glad that Yiwen has decided to start a Master’s and stay with us for longer, and more importantly, our team will remain intact for many more exciting followups in this emerging research area.

If this excites you, come join our group!

TWO NSF Proposals as the Lead PI Awarded. Thanks NSF!

The first one is on rack-scale computing using RDMA-enabled networks with Barzan Mozafari at the University of Michigan, and the second is on theoretical and systems implications of long-term fairness in cluster computing with Zhenhua Liu (Stony Brook University).

Thanks NSF!

Combined with the recent awards on geo-distributed analytics from NSF and Google, I’m looking forward to very exciting days in the future. If you want to be a part these exciting efforts, consider joining my group!

Carbyne Accepted to Appear at OSDI’2016

Update: Camera-ready version is available here now!

With the wide adoption of distributed data-parallel applications, large-scale resource scheduling has become a constant source of innovation in recent years. There are tens of scheduling solutions that try to optimize for objectives such as user-level fairness, application-level performance, and cluster-level efficiency. However, given the well-known tradeoffs between fairness, performance, and efficiency, these solutions have traditionally focused on one primary objective (e.g., fairness in case of DRF), and they consider other objectives as best effort, secondary goals.

In this paper, we revisit the tradeoff space, demonstrating out that aggressively focusing on optimizing one primary objective while giving up the rest often does not matter much. Because a job cannot complete until all its tasks have completed, each job can altruistically yield some of its resources without hampering its own completion time. These altruistic resources can then be rescheduled among other jobs to significantly improve secondary objectives without hampering the first. Benefits of our approach is visible even for single-stage jobs, and they increase as jobs have more complex DAGs.

Given the well-known tradeoffs between performance, fairness, and efficiency, modern cluster schedulers focus on aggressively optimizing a single objective, while ignoring the rest. However, short-term convergence to a selected objective often does not result in noticeable long-term benefits. Instead, we propose an altruistic, long-term approach, where jobs yield fractions of their allocated resources without impacting their own completion times.

We show that leftover resources collected via altruisms of many jobs can then be rescheduled, essentially introducing a third degree of freedom in cluster scheduling — in addition to inter- and intra-job scheduling. We leverage this new-found flexibility in Carbyne, a scheduler that combines three existing schedulers to simultaneously optimize cluster-wide performance, fairness, and efficiency. Deployments and large-scale simulations on industrial benchmarks and production traces show that Carbyne closely approximates the state-of-the-art solutions for each of the three objectives simultaneously. For example, Carbyne provides 1.26X better efficiency and 1.59X lower average completion time than DRF, while providing similar fairness guarantees.

Altruistic scheduling has many more use cases; e.g., we had a similar observation for coflow scheduling in Varys.

This work started as a collaboration with Robert Grandl and Aditya Akella toward the end of 2015. Ganesh Ananthanarayanan from MSR later joined us to take it to the next level.  After CODA, this is related to another future work (the seventh) from my dissertation; infer whatever you want to out of these two data points ;)

This year the OSDI PC accepted 47 out of 260 papers. This happens to be my first time submitting to OSDI. It’s also my first paper with Ganesh, even though it happened after we both graduated from Berkeley; we sat opposite to each other for four years back then! I also want to thank Aditya for letting me work closely with Robert; it’s been great collaborating with them.

EC-Cache Accepted to Appear at OSDI’2016

Update: Camera-ready version is available here now!

In-memory caching is the de facto solution to enable low latency analytics over large datasets. While caching objects,  one must be careful about maximizing the number of requests that can be served from memory in the presence of popularity skew, background load imbalance, and server failures. Traditional solutions use selective replication, i.e., adjusting the number of replicas based on object popularity, to address these issues. However, dynamically determining replication factors and reacting to hotspots and load imbalance fast enough is often challenging.

We applied erasure coding to address the same set of challenges. Traditionally, erasure coding is widely used in modern clusters to minimize storage usage. However, it is not applied to object caching, nor is it considered a viable solution for load balancing. We show both analytically and empirically that given the instruction sets in modern CPUs, erasure coding can be a viable solution to address the limitations of selective replication as long as we are willing to sacrifice some network bandwidth. Given the rise of high-capacity network topologies and networking technologies, we believe this to be a timely tradeoff.

Data-intensive clusters rely on in-memory object caching to maximize the number of requests that can be served from memory in the presence of popularity skew, background load imbalance, and server failures. In order to improve I/O latency and for load balancing, these caches typically employ selective replication, where the number of cached replicas of an object is proportional to its popularity. Selective replication, however, often falls short in practice, because it needs careful selection of replication factors and dynamic, load-aware replica placement.

EC-Cache is a load-balanced, high-performance cluster cache that uses erasure coding in a novel way to overcome the limitations of selective replication. EC-Cache employs two techniques: (a) splitting and erasure coding individual objects during writes, and (b) late binding, wherein obtaining any k out of (k + r) splits of an object are sufficient, during reads. As compared to selective replication, EC-Cache improves load balancing by a factor of 3.3X and reduces the median and tail read latencies by more than 2X, while using the same amount of memory. EC-Cache does so using 10% additional bandwidth and a small increase in the amount of stored metadata. The benefits offered by EC-Cache are further amplified in the presence of background network load imbalance.

The genesis of this work traces back to 2013, when I was just wrapping up Sinbad and heard Rashmi Vinayak giving a talk about her work at Facebook. While we planned many times, we were always busy with other stuff and never managed to work together until very recently. During this time, the project evolved from disk-based storage to general object store to its current form: in-memory object cache. The very energetic Jack Kosaian tracked me down last year to work on something exciting, which also helped in finally getting the project going at full speed. I’ve been very fortunate with my collaborators.

This year the OSDI PC accepted 47 out of 260 papers. This happens to be my first time submitting to OSDI. It’s also my first paper with my own student. Jack has been a tremendous force that kept the project going and helped Rashmi proceed at a fast pace. I also want to thank Rashmi for mentoring Jack throughout the process. I love this advising business :) Go Blue!

Received SIGCOMM Doctoral Dissertation Award

About a week or so ago Bruce Maggs, SIGCOMM’s awards chair, kindly informed me over the phone that my dissertation on coflows has been selected for the 2015 ACM SIGCOMM Doctoral Dissertation Award. The committee for the award included Ratul Mahajan, Dina Papagiannaki, Laurent Vanbever (chair), and Minlan Yu, and the citation reads:

Chowdhury’s dissertation provides novel and application-aware networking abstractions which significantly improve the performance of networked applications running in the cloud.

It’s a great honor to join some great researchers who’ve won this award before me.

It goes without saying that this achievement wouldn’t have been possible without my great advisor Ion Stoica keeping me on course, my excellent collaborators over the years, and many others who paved my way in multiple ways. I am also grateful to some of my biggest supporters: Randy Katz, Mike Franklin, Srikanth Kandula, and Vyas Sekar. An equal amount of credit, if not more, goes to my family and friends, who cheered me on and supported me for so many years. I don’t say it often enough: thank you!

CODA Accepted to Appear at SIGCOMM’2016

Update: Camera-ready version is available here now!

Since introducing the coflow abstraction in 2012, we’ve been working hard to make it practical one step at a time. Over the years, we’ve worked on efficient coflow scheduling, removed clairvoyance requirements in coflow scheduling, and performed fair sharing among coexisting coflows. Throughout all these efforts, one requirement remained constant: all distributed data-parallel applications had to be modified to correctly use the same coflow API. Unfortunately, this is difficult, if not impossible, because of the sheer number of computing frameworks available today.

CODA makes a first attempt at addressing this problem head on. The intuition is simple: if we can automatically identify coflows from the network, applications need not be modified. There are at least two key challenges in achieving this goal. The first one is classifying coflows. Unlike traditional traffic classification, coflows cannot be identified using five-tuples. Instead, we must use domain-specific insights to capture the logical relationship between multiple flows. The second challenge is avoiding the impact of misclassification, because even the best classifier will sometimes be wrong. We must be able to schedule coflows even when some flows have been misclassified into wrong coflows. Our solution errs on the side of caution to perform error-tolerant coflow scheduling. Overall, CODA approximates Aalo very closely and significantly outperforms flow-based solutions.

Leveraging application-level requirements using coflows has recently been shown to improve application-level communication performance in data-parallel clusters. However, existing coflow-based solutions rely on modifying applications to extract coflows, making them inapplicable to many practical scenarios. In this paper, we present CODA, a first attempt at automatically identifying and scheduling coflows without any application-level modifications.

We employ an incremental clustering algorithm to perform fast, application-transparent coflow identification and complement it by proposing an error-tolerant coflow scheduler to mitigate occasional identification errors. Testbed experiments and large-scale simulations with realistic production workloads show that CODA can identify coflows with more than 90% accuracy, and CODA’s scheduler is robust to inaccuracies, enabling communication stages to complete 2.4X (5.1X) faster on average (95th percentile) in comparison to per-flow mechanisms. CODA’s performance is comparable to that of solutions requiring application-level modifications.

This work is a tight collaboration with Kai Chen and his students at HKUST: Hong Zhang, Bairen Yi, and Li Chen, which started with Hong sending me a short note asking whether I think the problem is worth addressing. Of course I did; it is the second future work listed in my dissertation!

This year the SIGCOMM PC accepted 39 papers out of 225 submissions with a 17.33% acceptance rate. This also happens to be my first SIGCOMM in an advisory role and the first one without having Ion onboard. I am thankful to Kai for giving me full access to his excellent, hard-working students. It’s been a gratifying experience to work with them, and I look forward to further collaborations. Finally, I’m excited to start my life as an advisor on a high note. Go Blue!

HUG Accepted to Appear at NSDI’2016

Update: Camera-ready version is available here now!

With the advent of cloud computing and datacenter-scale applications, simultaneously dealing with multiple resources is the new norm. When multiple parties have multi-resource demands, fairly dividing these resources  (for some notion of fairness) is a core challenge in the resource allocation literature. Dominant Resource Fairness (DRF) in NSDI’2011 was the first work to point out the ensuing challenges and extended traditional, single-resource max-min fairness to the multi-resource environment.

While DRF guarantees many nice properties to ensure fairness, it gives up work conservation. We show that it can be arbitrarily low in the worst case. Over the past five years, many software and research proposals have built upon DRF under the assumption that DRF can trivially be made work-conserving without losing any of its properties.

After five years, we prove that not to be the case. We show that there is a fundamental tradeoff between guaranteed resources and work conservation, and trivially adding work conservation on DRF can actually decrease utilization instead of increasing it. We also show what is the maximum possible utilization while maintaining DRF’s optimal guarantees and propose a new algorithm, High Utilization with Guarantees (HUG) that generalizes single-and multi-resource max-min fairness as well as multi-tenant network sharing literature under a unifying framework.

In this paper, we study how to optimally provide isolation guarantees in multi-resource environments, such as public clouds, where a tenant’s demands on different resources (links) are correlated in order for her to make progress. Unlike prior work such as Dominant Resource Fairness (DRF) that assumes demands to be static and fixed, we consider elastic demands. Our approach generalizes canonical max-min fairness to the multi-resource setting with correlated demands, and extends DRF to elastic demands. We consider two natural optimization objectives: isolation guarantee from a tenant’s viewpoint and system utilization (work conservation) from an operator’s perspective. We prove that in non-cooperative environments like public cloud networks, there is a strong tradeoff between optimal isolation guarantee and work conservation when demands are elastic. Even worse, work conservation can even decrease network utilization instead of improving it when demands are inelastic. We identify the root cause behind the tradeoff and present a provably optimal allocation algorithm, High Utilization with Guarantees (HUG), to achieve maximum attainable network utilization without sacrificing the optimal isolation guarantee, strategy-proofness, and other useful properties of DRF. In cooperative environments like private datacenter networks, HUG achieves both the optimal isolation guarantee and work conservation. Analyses, simulations, and experiments show that HUG provides better isolation guarantees, higher system utilization, and better tenant-level performance than its counterparts.

This work is the result of a close collaboration with Zhenhua Liu with help from Ion Stoica and Ali Ghodsi.

This year the NSDI PC accepted 45 out of 228 submissions with a 19.74% acceptance rate. This work forms the basis of the last chapter of my dissertation, ensuring SIGCOMM or NSDI publications from all its primary chapters!

Organized AMP Camp 6

Last week, November19-20, I led the AMP Camp 6 organizing team to successfully arrange (most likely) the penultimate AMP Camp before the lab shuts down at the end of 2016. We had 200+ registered attendees, an all-star lineup of 14 speakers, and more than 40 volunteers. We also revamped the AMP Camp exercises syllabus, which now includes some of the more recent components of the Berkeley Data Analytics Stack, i.e., the BDAS (pronounced “badass”) stack.

This is the first time I organized such a large event, and I now have a new-found appreciation for the organizers of many conferences and workshops I had attended in the past. I want to especially thank Kattt Atchley, Jey Kottalam, and Boban Zarkovich for helping me along the way, lab directors Mike Franklin and Ion Stoica for their full support, and all the speakers, volunteers, and attendees for making this event successful.

As I move on to the next stage of my career in a few weeks, I feel proud and fortunate to have been a part of such a stellar research lab.