Tag Archives: GPU

Fluid Accepted to Appear at MLSys’2021

While training and inference of deep learning models have received significant attention in recent years (e.g., Tiresias, AlloX, and Salus from our group), hyperparamter tuning is often overlooked or put together in the same bucket of optimizations as training. Existing hyperparameter tuning solutions, primarily from the ML research community, are mostly resource-agnostic. More importantly, even if they try to use up all available resources, existing solutions do not distinguish between the throughput of a GPU (how much work a GPU is doing) and its goodput (how much of that is ultimately useful work) during hyperparameter tuning. Fluid is our attempt at bridging the gap between hyperparameter tuning algorithms and the underlying cluster resources by improving both intra- and inter-GPU goodput in large clusters.

Current hyperparameter tuning solutions lack complementary execution engines to efficiently leverage distributed computation, thus ignoring the possibility of intra- and inter-GPU sharing, which exhibits poor resource usage. In this paper, we present FluidExec, a generalized hyperparameter tuning execution engine, that coordinates between hyperparameter tuning jobs and cluster resources. FluidExec schedules evaluation trials in such jobs using a water-filling approach to make the best use of resources both at intra- and inter-GPU granularities to speed up the tuning process. By abstracting a hyperparameter tuning job as a sequence of TrialGroup, FluidExec can boost the performance of diverse hyperparameter tuning solutions. Our experiments show that FluidExec can speed up synchronous BOHB by 200%, and BOHB and ASHA by 30% while having similar final accuracy.

Fluid is a joint project between Peifeng and Jiachen, which started right after Salus and before Jiachen started her Ph.D.! I’m super excited about many future works in the Systems + AI area from SymbioticLab members.

Presented Keynote Speech at HotEdgeVideo’2020

Earlier this week, I presented a keynote speech on the state of resource management for deep learning at the HotEdgeVideo’2020 workshop, covering our recent works on systems support for AI (Tiresias, AlloX, and Salus) and discussing open challenges in this space.

In this talk, I highlighted three unique aspects for resource management in the context of deep learning, which I think makes this setting unique even after decades of resource management works in CPUs, networks, and even Big Data clusters.

  1. Short-term predictability and long-term uncertainty of deep learning workloads: Deep learning workloads (hyperparameter tuning, training, and inference) may have different objectives, but they all share a common trend. Although we cannot know how many iterations will there be in a job or the number of requests coming to a model for inference, each iteration/request performs the same computation for a given job. Meaning, we can profile and then exploit that information for long-term benefits by adapting classic information-agnostic and information-limited scheduling techniques.
  2. Heterogeneous, interchangeable compute devices in deep learning clusters: Deep learning clusters becoming increasingly more diverse with many generations of GPUs and new hardware accelerators that our coming out every month. The key challenge in terms of resource management here is that all these compute devices are interchangeable (they all can compute), but they don’t compute at the same rate for all models. Some more suitable for CPUs, some for GPUs, some for TPUs, and so on. We need to rethink resource management algorithms to account for resource interchangeability.
  3. Black-box hardware accelerators: Deep learning hardware are also black boxes. Even for GPUs, we do not have any control over their internals; apart from some high-level information that are publicly available, we don’t know anything about what happens inside. For newer, vendor-locked accelerators, details are even more scarce. Consequently, resource management solutions should be designed to assume black-box hardware from the get go and then rely on profiling (by leveraging the iterative nature of deep learning) and short-term predictability to extract good performance using software techniques.

My slides from this talk are publicly available and have more details elaborating these points.

AlloX Accepted to Appear at EuroSys’2020

While GPUs are always in the news when it comes to deep learning clusters (e.g., Salus or Tiresias), we are in the midst of an emergence of many more computing devices (e.g., FPGAs and problem-specific accelerators), including the traditional CPUs. All of them are compute devices, but one cannot expect the same speed from all of them for all types of computations. A natural question is, therefore, how to allocate such interchangeable resources in a hybrid cluster? AlloX is our attempt at a reasonable answer.

Modern deep learning frameworks support a variety of hardware, including CPU, GPU, and other accelerators, to perform computation. In this paper, we study how to schedule jobs over such interchangeable resources – each with a different rate of computation – to optimize performance while providing fairness among users in a shared cluster. We demonstrate theoretically and empirically that existing solutions and their straightforward modifications perform poorly in the presence of interchangeable resources, which motivates the design and implementation of AlloX. At its core, AlloX transforms the scheduling problem into a min-cost bipartite matching problem and provides dynamic fair allocation over time. We theoretically prove its optimality in an ideal, offline setting and show empirically that it works well in the online scenario by incorporating with Kubernetes. Evaluations on a small-scale CPU-GPU hybrid cluster and large-scale simulations highlight that AlloX can reduce the average job completion time significantly (by up to 95% when the system load is high) while providing fairness and preventing starvation.

AlloX has been in the oven for more than two years, and it is a testament to Tan and Xiao’s tenacity. It’s also my second paper with Zhenhua after HUG, our first joint-work published based on our recent NSF project, and my first paper at EuroSys. I believe the results in this paper will give rise to new analyses of many systems that have interchangeable resources, such as DRAM-NVM hybrid systems or the storage/caching hierarchy.

This year the EuroSys PC accepted 43 out of 234 submissions.

Salus Accepted to Appear at MLSys'2020

With the rising popularity of deep learning, the popularity of GPUs has increased in recent years. Modern GPUs are extremely powerful with a lot of resources. A key challenge in this context is making sure that these devices are highly utilized. Although there has been a lot of research on improving GPU efficiency at the cluster level (e.g., our Tiresias in NSDI’19), little is known about how well individual GPUs are being utilized today. Worse, even if they are underutilized, little can be done because GPUs are opaque black boxes without any primitives for sharing them. Existing mechanisms for GPU sharing, such as NVIDIA MPS, are coarse-grained and cannot leverage application-specific information. Salus is our foray into the GPU sharing domain by providing two key sharing primitives that allows one to develop a variety of algorithms and improve GPU efficiency for training, inference, and hyperparameter tuning workloads.

Unlike traditional resources such as CPU or the network, modern GPUs do not natively support fine-grained sharing primitives. Consequently, implementing common policies such as time-sharing and preemption are expensive. Worse, when a deep learning (DL) application cannot completely use a GPU’s resources, the GPU cannot be efficiently shared between multiple applications, leading to GPU underutilization.

We present Salus to enable two GPU sharing primitives: fast job switching and memory sharing, to achieve fine-grained GPU sharing among multiple DL applications. Salus is an efficient, consolidated execution service that exposes a GPU to different DL applications, and enforces fine-grained sharing by performing iteration scheduling and addressing associated memory management issues. We show that these primitives can then be used to implement flexible sharing policies. Our integration of Salus with TensorFlow and evaluation on popular DL jobs shows that Salus can improve the average completion time of DL training jobs by 3.19X, GPU utilization for hyper-parameter tuning by 2.38X, and GPU utilization of DL inference applications by 42X over not sharing the GPU and 7X over NVIDIA MPS with small overhead.

Salus has long been in the making and is the first project for me to get into systems for AI and GPU resource management. Peifeng has been diligently working on it since 2017! While it took a long time, I’m excited that it has found a great home and looking forward to building on top of it. This is Peifeng’s first major paper, and the future is even brighter.

This year’s MLSys has 34 accepted papers and remained highly competitive as its previous iteration.

Tiresias Accepted to Appear at NSDI’2019

With the advancement of AI in recent years, GPUs have emerged as a popular choice for training deep learning (DL) models on large datasets. To deal with ever-growing datasets, it is also common to run distributed deep learning over multiple GPUs in parallel. Achieving cost-effectiveness and high performance in these clusters relies on efficiently sharing resources between multiple users. Unfortunately, most  GPU clusters in production rely on resource managers designed for traditional big data analytics. This results in suboptimal performance and strong, but unnecessary, constraints. Tiresias is our first attempt at designing a GPU cluster resource manager that relies on profiling to make good scheduling and placement decisions with little or no input from the users.

Distributed training of deep learning (DL) models on GPU clusters is becoming increasingly more popular. Existing cluster managers face some unique challenges from DL training jobs, such as unpredictable training times, an all-or-nothing execution model, and inflexibility in GPU sharing. Our analysis of a large GPU cluster in production shows that existing big data schedulers — coupled with consolidated job placement constraint, whereby GPUs for the same job must be allocated in as few machines as possible — cause long queueing delays and low overall performance.

We present Tiresias, a GPU cluster resource manager tailored for distributed DL training jobs, which efficiently schedules and places DL jobs to reduce their job completion times (JCT). Given that a DL job’s execution time is often unpredictable, we propose two scheduling algorithms — Discretized Two-Dimensional Gittins Index relies on partial information and Discretized Two-Dimensional LAS is information-agnostic — that aim to minimize the average JCT. Additionally, we describe when the consolidated placement constraint can be relaxed and present a placement algorithm to leverage these observations without any user input. Experiments on a cluster with 60 P100 GPUs — and large-scale trace-driven simulations — show that Tiresias improves the average JCT by up to 5.5X over an Apache YARN-based resource manager used in production. More importantly, Tiresias’s performance is comparable to that of solutions assuming perfect knowledge.

This is Juncheng’s second NSDI paper after Infiniswap in NSDI’17, and a very proud moment for me as his advisor. I would like to thank all our collaborators. I would also like to thank Samir and Barna for inviting me to the TTIC Summer Workshop on Datacenter Scheduling, where I heard Mor’s talk on SOAP that inspired our use of Gittins Index-based scheduling to this context in the partial information case. The application of LAS was inspired by my earlier work on information-agnostic coflow scheduling.

This year NSDI PC accepted 49 out of 332 submissions across Spring (19/92) and Fall (30/240) deadlines for a somewhat lower acceptance rate in comparison to recent years.