Tag Archives: NSDI

ModelKeeper and Zeus Accepted to Appear at NSDI’2023

Deep learning, and machine learning in general, is taking over the world. It is, however, quite expensive to tune, train, and serve deep learning models. Naturally, improving the efficiency and performance of deep learning workflows has received significant attention (Salus, Tiresias, and Fluid to name a few). Most of the existing works, including our prior works, focus on two primary ways to improve efficiency; and resource efficiency at that. The first is packing work as tightly as possible (placement). The second is scheduling over time. Some apply both together. None focus on improving energy efficiency. ModelKeeper and Zeus, respectively, are our efforts toward improving resource efficiency by not doing work and improving energy efficiency instead of solely focusing on resource usage efficiency.

ModelKeeper

We know scheduling and placement can improve efficiency of resource usage, but even with optimal algorithms one cannot reduce the amount of work that needs to be done in the general case. This simple observation led us to explore how can we reduce the amount of work that needs to be done when training DNN models. It turns out that instead of starting from random values and then training to reach the final values after training a model, one can potentially better initialize a model when training starts and short-circuit the process! By identifying similar models that had already been trained in the past, one can reduce the number of iterations needed for a model to converge.

With growing deployment of machine learning (ML) models, ML developers are training or re-training increasingly more deep neural networks (DNNs). They do so to find the most suitable model that meets their accuracy requirement while satisfying the resource and timeliness constraints of the target environment. In large shared clusters, the growing number of neural architecture search (NAS) and training jobs often result in models sharing architectural similarities with others from the same or a different ML developer. However, existing solutions do not provide a systematic mechanism to identify and leverage such similarities.

We present ModelKeeper, the first automated training warmup system that accelerates DNN training by repurposing previously-trained models in a shared cluster. Our key insight is that initializing a training job’s model by transforming an already-trained model’s weights can jump-start it and reduce the total amount of training needed. However, models submitted over time can differ in their architectures and accuracy. Given a new model to train, ModelKeeper scalably identifies its architectural similarity with previously trained models, selects a parent model with high similarity and good model accuracy, and performs structure-aware transformation of weights to preserve maximal information from the parent model during the warmup of new model weights. Our evaluations across thousands of CV and NLP models show that ModelKeeper achieves 1.3×–4.3× faster training completion with little overhead and no reduction in model accuracy.

Fan started the ModelKeeper project with Yinwei in late 2020 while Oort was making rounds and FedScale was in its infancy. With his internship with Meta in the middle and many other projects he’s been working on, ModelKeeper submission was pushed back a couple times. In hindsight, the extra time significantly improved the quality of the work. While the setting considered in this paper is cloud computing, ModelKeeper is likely going to be an integral part of the greater FedScale project now to speed up federated learning as well.

ModelKeeper is yet another collaboration between Harsha and myself. Hopefully, we will continue to collaborate more even after Harsha moves to USC in Winter 2023.

Zeus

With ever-increasing model sizes, the cost of DNN training is increasing rapidly. While the monetary cost is discussed often, there is an implicit energy cost of DNN training as well. For example, training the GPT-3 model consumes 1,287 megawatt-hour (MWh), which is equivalent to 120 years of electricity consumption for an average U.S. household. In this pioneering work, we take the first step in better understanding and then optimizing the energy consumption of DNN training. Specifically, we optimize batch size and GPU power cap for recurring training jobs to provide a better tradeoff between energy consumed and accuracy attained.

Training deep neural networks (DNNs) is becoming increasingly more resource- and energy-intensive every year. Unfortunately, existing works primarily focus on optimizing DNN training for faster completion, often without considering the impact on energy efficiency.

In this paper, we observe that common practices to improve training performance can often lead to inefficient energy usage. More importantly, we demonstrate that there is a tradeoff between energy consumption and performance optimization. To this end, we propose an optimization framework, Zeus, to navigate this tradeoff by automatically configuring job- and GPU-level configurations of recurring DNN training jobs. Zeus uses an online exploration-exploitation approach in conjunction with just-in-time energy profiling, averting the need for expensive offline measurements, while adapting to workload changes and data drifts over time. Our evaluation shows that Zeus can improve the energy efficiency of DNN training by 18.7%-72.8% for diverse workloads.

Zeus started sometime around Fall 2020/Winter 2021 with Jimmy. At the end of Winter, when Jimmy left for his internship, we had a basic idea of a problem and one motivating plot that’d eventually drive our efforts. With the arrival of Jae-Won in Fall 2021 as a first-year student and Jimmy being back from Meta, we picked up the pace, which eventually led to its submission. Zeus is the first Treehouse project, and my first foray into energy-related anything. We had a lot to learn, but I was in capable hands of Jimmy and Jae-Won, who learned and taught me much. And we haven’t even scratched the surface!

To work on many more exciting projects like these, join SymbioticLab!

Justitia Accepted to Appear at NSDI’2022

The need for higher throughput and lower latency is driving kernel-bypass networking (KBN) in datacenters. Of the two related trends in KBN, hardware-based KBN is especially challenging because, unlike software KBN such as DPDK, it does not provide any control once a request is posted to the hardware. RDMA, which is the most prevalent form of hardware KBN in practice, is completely opaque. RDMA NICs (RNICs), whether they are using InfiniBand, RoCE, or iWARP, have fixed-function scheduling algorithms programmed in them. Like any other networking component, they also suffer from performance isolation issues when multiple applications compete for RNIC resources. Justitia is our attempt at introducing software control in hardware KBN.

Kernel-bypass networking (KBN) is becoming the new norm in modern datacenters. While hardware-based KBN offloads all dataplane tasks to specialized NICs to achieve better latency and CPU efficiency than software-based KBN, it also takes away the operator’s control over network sharing policies. Providing policy support in multi-tenant hardware KBN brings unique challenges — namely, preserving ultra-low latency and low CPU cost, finding a well-defined point of mediation, and rethinking traffic shapers. We present Justitia to address these challenges with three key design aspects: (i) Split Connection with message-level shaping, (ii) sender-based resource mediation together with receiver-side updates, and (iii) passive latency monitoring. Using a latency target as its knob, Justitia enables multi-tenancy policies such as predictable latencies and fair/weighted resource sharing. Our evaluation shows Justitia can effectively isolate latency-sensitive applications at the cost of slightly decreased utilization and ensure that throughput and bandwidth of the rest are not unfairly penalized.

Yiwen started working on this problem when we first observed RDMA isolation issues in Infiniswap. He even wrote a short paper in KBNets 2017 based on his early findings. Yue worked on it for quite a few months before she went to Princeton for Ph.D. Brent has been helping us getting this work into shape since the beginning. It’s been a long and arduous road; every time we fixed something, new reviewers didn’t like something else. Finally, an NSDI revision allowed us to directly address the most pressing concerns. Without commenting on how much the paper has improved after all these iterations, I can say that adding revisions to NSDI has saved us, especially Yiwen, a lot more frustrations. For what it’s worth, Justitia now has the notorious distinction of my current record for accepted-after-N-submissions; it’s been so long that I’ve lost track of the exact value of N!

Kayak Accepted to Appear at NSDI’2021

As memory disaggregation and resource disaggregation, in general, become popular, one must make a call about whether to continue moving data from remote memory or to sometimes ship compute to remote data too. This is not a new problem in the context of disaggregated datacenters either. The notion of data locality and associated challenges are rooted in the same observation. As the ratio between compute and communication needs of an application changes and as the speed of the network changes over time, the answer has changed many times. Oftentimes, the solutions boil down to understanding workload and network characteristics and making a call: ship compute or ship data for that workload on that network. But what if the workload is dynamic? In Kayak, we dynamically decide between the two options and making the right call at the right time.

How cloud applications should interact with their data remains an active area of research. Over the last decade, many have suggested relying on a key-value (KV) interface to interact with data stored in remote storage servers, while others have vouched for the benefits of using remote procedure call (RPC). Instead of choosing one over the other, in this paper, we observe that an ideal solution must adaptively combine both of them in order to maximize throughput while meeting application latency requirements. To this end, we propose a new system called Kayak that proactively adjusts the rate of requests and the fraction of requests to be executed using RPC or KV, all in a fully decentralized and self-regulated manner. We theoretically prove that Kayak can quickly converge to the optimal parameters. We implement a system prototype of Kayak. Our evaluations show that Kayak achieves sub-second convergence and improves overall throughput by 32.5%-63.4% for compute-intensive workloads and up to 12.2% for non-compute-intensive and transactional workloads over the state-of-the-art.

The core idea behind Kayak is all due to Jie (Jimmy), who started the project in early 2019. The project picked up significant tailwind when Xin and his student Jingfeng joined us in the summer of 2020 to help us better understand the problem from an analytical perspective too, in addition to better positioning the work. I’m super excited about the possibilities and Kayak’s potential impact on resource disaggregation systems.

The NSDI PC this year accepted 40 out of 255 submissions in the Fall deadline to result in a lower acceptance rate from last year. 

Sol and Pando Accepted to Appear at NSDI'2020

With the advent of edge analytics and federated learning, the need for distributed computation and storage is only going to increase in coming years. Unfortunately, existing solutions for analytics and machine learning have focused primarily on datacenter environments. When these solutions are applied to wide-area scenarios, their compute efficiency decreases and storage overhead increases. Neither is suitable for pervasive storage and computation throughout the globe. In this iteration of NSDI, we have two papers to address the compute and storage aspects of emerging wide-area computing workloads.

Sol

Sol is a federated execution engine that can execute low-latency computation across a variety of network conditions. The key insight here is that modern execution engines (e.g., those used by Apache Spark or TensorFlow) have implicit assumptions about low-latency and high-bandwidth networks. Consequently, in poor network conditions, the overhead of coordinating work outweigh the work they are trying to execute. The end result, interestingly enough, is CPU underutilization because workers spend more time in waiting for new work to be assigned from the centralized coordinator/master than doing the work. Our solution is API-compatible with Apache Spark so that any existing jobs (SQL, ML, or Streaming) can run on Sol with significant performance improvement in edge computing and federated learning scenarios.

The popularity of big data and AI has led to many optimizations at different layers of distributed computation stacks. Despite – or perhaps, because of – its role as the narrow waist of such software stacks, the design of the execution engine, which is in charge of executing every single task of a job, has mostly remained unchanged. As a result, the execution engines available today are ones primarily designed for low latency and high bandwidth datacenter networks. When either or both of the network assumptions do not hold, CPUs are significantly underutilized.

In this paper, we take a first-principles approach toward developing an execution engine that can adapt to diverse network conditions. Sol, our federated execution engine architecture, flips the status quo in two respects. First, to mitigate the impact of high latency, Sol proactively assigns tasks, but does so judiciously to be resilient to uncertainties. Second, to improve the overall resource utilization, Sol decouples communication from computation internally instead of committing resources to both aspects of a task simultaneously. Our evaluations on EC2 show that, compared to Apache Spark in resource-constrained networks, Sol improves SQL and machine learning jobs by 16.4× and 4.2× on average.

This is Fan’s first major paper, and I’m very proud to see him open his book. I would like to thank Jimmy and Allen for their support in getting it done and Harsha for his enormous amount of time and efforts to make Sol successful. I believe Sol will have significant impact on the emerging fields of edge analytics and federated learning.

Pando

Pando started with a simple idea when Harsha and I were discussing how to apply erasure coding to mutable data. It ended up being so much more. Not only have we designed an erasure-coded state machine, we have also identified the theoretical tradeoff limits for read latency, write latency, and storage overhead. In the process, we show that erasure coding is the way to get closer to the limits that replication-based systems cannot reach. Moreover, Pando can dynamically switch between both depending on the goals it has to achieve.

By replicating data across sites in multiple geographic regions, web services can maximize availability and minimize latency for their users. However, when sacrificing data consistency is not an option, we show that service providers have to today incur significantly higher cost to meet desired latency goals than the lowest cost theoretically feasible. We show that the key to addressing this sub-optimality is to 1) allow for erasure coding, not just replication, of data across data centers, and 2) mitigate the resultant increase in read and write latencies by rethinking how to enable consensus across the wide-area network. Our extensive evaluation mimicking web service deployments on the Azure cloud service shows that we enable near-optimal latency versus cost tradeoffs.

While Muhammed is advised by Harsha, I have had the opportunity to work with him since 2016 (first on a project that failed; on this one, from 2017). Many of the intricacies of the protocol are outside my expertise, but I learned a lot from Harsha and Muhammed. I’m also glad to see that our original idea of mutable erasure-coded data has come to fruition in a much stronger form than what Harsha and I devised as the summer of 2017 was coming to an end. Btw, Pando now has the notorious distinction of my current record for accepted-after-N-submissions; fifth time’s the charm!

The NSDI PC this year accepted 48 out of 275 submissions in the Fall deadline to increase the acceptance rate from the poor showing last year. 

Tiresias Accepted to Appear at NSDI’2019

With the advancement of AI in recent years, GPUs have emerged as a popular choice for training deep learning (DL) models on large datasets. To deal with ever-growing datasets, it is also common to run distributed deep learning over multiple GPUs in parallel. Achieving cost-effectiveness and high performance in these clusters relies on efficiently sharing resources between multiple users. Unfortunately, most  GPU clusters in production rely on resource managers designed for traditional big data analytics. This results in suboptimal performance and strong, but unnecessary, constraints. Tiresias is our first attempt at designing a GPU cluster resource manager that relies on profiling to make good scheduling and placement decisions with little or no input from the users.

Distributed training of deep learning (DL) models on GPU clusters is becoming increasingly more popular. Existing cluster managers face some unique challenges from DL training jobs, such as unpredictable training times, an all-or-nothing execution model, and inflexibility in GPU sharing. Our analysis of a large GPU cluster in production shows that existing big data schedulers — coupled with consolidated job placement constraint, whereby GPUs for the same job must be allocated in as few machines as possible — cause long queueing delays and low overall performance.

We present Tiresias, a GPU cluster resource manager tailored for distributed DL training jobs, which efficiently schedules and places DL jobs to reduce their job completion times (JCT). Given that a DL job’s execution time is often unpredictable, we propose two scheduling algorithms — Discretized Two-Dimensional Gittins Index relies on partial information and Discretized Two-Dimensional LAS is information-agnostic — that aim to minimize the average JCT. Additionally, we describe when the consolidated placement constraint can be relaxed and present a placement algorithm to leverage these observations without any user input. Experiments on a cluster with 60 P100 GPUs — and large-scale trace-driven simulations — show that Tiresias improves the average JCT by up to 5.5X over an Apache YARN-based resource manager used in production. More importantly, Tiresias’s performance is comparable to that of solutions assuming perfect knowledge.

This is Juncheng’s second NSDI paper after Infiniswap in NSDI’17, and a very proud moment for me as his advisor. I would like to thank all our collaborators. I would also like to thank Samir and Barna for inviting me to the TTIC Summer Workshop on Datacenter Scheduling, where I heard Mor’s talk on SOAP that inspired our use of Gittins Index-based scheduling to this context in the partial information case. The application of LAS was inspired by my earlier work on information-agnostic coflow scheduling.

This year NSDI PC accepted 49 out of 332 submissions across Spring (19/92) and Fall (30/240) deadlines for a somewhat lower acceptance rate in comparison to recent years. 

Infiniswap Accepted to Appear at NSDI’2017

Update: Camera-ready version is available here. Infiniswap code is now on GitHub!

As networks become faster, the difference between remote and local resources is blurring everyday. How can we take advantage of these blurred lines? This is the key observation behind resource disaggregation and, to some extent, rack-scale computing. In this paper, we take our first stab at making memory disaggregation practical by exposing remote memory to unmodified applications. While there have been several proposals and feasibility studies in recent years, to the best of our knowledge, this is the first concrete step in making it real.

Memory-intensive applications suffer large performance loss when their working sets do not fully fit in memory. Yet, they cannot leverage otherwise unused remote memory when paging out to disks even in the presence of large imbalance in memory utilizations across a cluster. Existing proposals for memory disaggregation call for new architectures, new hardware designs, and/or new programming models, making them infeasible.

This paper describes the design and implementation of Infiniswap, a remote memory paging system designed specifically for an RDMA network. Infiniswap opportunistically harvests and transparently exposes unused memory to unmodified applications by dividing the swap space of each machine into many slabs and distributing them across many machines’ remote memory. Because RDMA operations bypass remote CPUs, Infiniswap leverages the power of many choices to perform decentralized slab placements and evictions.

We have implemented and deployed Infiniswap on an RDMA cluster without any OS modifications and evaluated its effectiveness using multiple workloads running on unmodified VoltDB, Memcached, PowerGraph, GraphX, and Apache Spark. Using Infiniswap, throughputs of these applications improve between 7.1X (0.98X) to 16.3X (9.3X) over disk (Mellanox nbdX), and median and tail latencies between 5.5X (2X) and 58X (2.2X). Infiniswap does so with negligible remote CPU usage, whereas nbdX becomes CPU-bound. Infiniswap increases the overall memory utilization of a cluster and works well at scale.

This work started as a class project for EECS 582 in the Winter when I gave the idea to Juncheng Gu and Youngmoon Lee, who made the pieces into a whole. Over the summer, Yiwen Zhang, an enterprising and excellent undergraduate, joined the project and helped us in getting it done within time.

This year the NSDI PC accepted 46 out of 255 papers. This happens to be my first paper with an all-blue cast! I want to thank Kang for giving me complete access to Juncheng and Youngmoon; it’s been great collaborating with them. I’m also glad that Yiwen has decided to start a Master’s and stay with us for longer, and more importantly, our team will remain intact for many more exciting followups in this emerging research area.

If this excites you, come join our group!

HUG Accepted to Appear at NSDI’2016

Update: Camera-ready version is available here now!

With the advent of cloud computing and datacenter-scale applications, simultaneously dealing with multiple resources is the new norm. When multiple parties have multi-resource demands, fairly dividing these resources  (for some notion of fairness) is a core challenge in the resource allocation literature. Dominant Resource Fairness (DRF) in NSDI’2011 was the first work to point out the ensuing challenges and extended traditional, single-resource max-min fairness to the multi-resource environment.

While DRF guarantees many nice properties to ensure fairness, it gives up work conservation. We show that it can be arbitrarily low in the worst case. Over the past five years, many software and research proposals have built upon DRF under the assumption that DRF can trivially be made work-conserving without losing any of its properties.

After five years, we prove that not to be the case. We show that there is a fundamental tradeoff between guaranteed resources and work conservation, and trivially adding work conservation on DRF can actually decrease utilization instead of increasing it. We also show what is the maximum possible utilization while maintaining DRF’s optimal guarantees and propose a new algorithm, High Utilization with Guarantees (HUG) that generalizes single-and multi-resource max-min fairness as well as multi-tenant network sharing literature under a unifying framework.

In this paper, we study how to optimally provide isolation guarantees in multi-resource environments, such as public clouds, where a tenant’s demands on different resources (links) are correlated in order for her to make progress. Unlike prior work such as Dominant Resource Fairness (DRF) that assumes demands to be static and fixed, we consider elastic demands. Our approach generalizes canonical max-min fairness to the multi-resource setting with correlated demands, and extends DRF to elastic demands. We consider two natural optimization objectives: isolation guarantee from a tenant’s viewpoint and system utilization (work conservation) from an operator’s perspective. We prove that in non-cooperative environments like public cloud networks, there is a strong tradeoff between optimal isolation guarantee and work conservation when demands are elastic. Even worse, work conservation can even decrease network utilization instead of improving it when demands are inelastic. We identify the root cause behind the tradeoff and present a provably optimal allocation algorithm, High Utilization with Guarantees (HUG), to achieve maximum attainable network utilization without sacrificing the optimal isolation guarantee, strategy-proofness, and other useful properties of DRF. In cooperative environments like private datacenter networks, HUG achieves both the optimal isolation guarantee and work conservation. Analyses, simulations, and experiments show that HUG provides better isolation guarantees, higher system utilization, and better tenant-level performance than its counterparts.

This work is the result of a close collaboration with Zhenhua Liu with help from Ion Stoica and Ali Ghodsi.

This year the NSDI PC accepted 45 out of 228 submissions with a 19.74% acceptance rate. This work forms the basis of the last chapter of my dissertation, ensuring SIGCOMM or NSDI publications from all its primary chapters!

Spark has been accepted at NSDI’2012

Our paper “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing” has been accepted at NSDI’2012. This is Matei’s brainchild and a joint work of a lot of people including, but not limited to, TD, Ankur, Justin, Murphy, and professors Ion Stoica, Scott Shenker, and Michael Franklin. Unlike many other systems papers, Spark is actively developed and used by many people. You can also download and use it in no time to solve all your problems; well, at least the ones that require analyzing big data in little time. We focus on the concept of resilient distributed datasets or RDDs in this paper, and show how we can perform fast, in-memory iterative and interactive jobs with low-overhead fault-tolerance.

We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarse-grained transformations rather than fine-grained updates to shared state. However, we show that RDDs are expressive enough to capture a wide class of computations, including current specialized programming models for iterative jobs like Pregel. We have implemented RDDs in a system called Spark, which we evaluate through a variety of benchmarks and user applications.

The NSDI’2012 PC accepted 30 out of 169 papers. On other news, this time Berkeley will have a big presence at NSDI with several other papers. Go Bears!!!