MLSys | Mosharaf Chowdhury

While training and inference of deep learning models have received significant attention in recent years (e.g., Tiresias, AlloX, and Salus from our group), hyperparamter tuning is often overlooked or put together in the same bucket of optimizations as training. Existing hyperparameter tuning solutions, primarily from the ML research community, are mostly resource-agnostic. More importantly, even if they try to use up all available resources, existing solutions do not distinguish between the throughput of a GPU (how much work a GPU is doing) and its goodput (how much of that is ultimately useful work) during hyperparameter tuning. Fluid is our attempt at bridging the gap between hyperparameter tuning algorithms and the underlying cluster resources by improving both intra- and inter-GPU goodput in large clusters.

Current hyperparameter tuning solutions lack complementary execution engines to efficiently leverage distributed computation, thus ignoring the possibility of intra- and inter-GPU sharing, which exhibits poor resource usage. In this paper, we present FluidExec, a generalized hyperparameter tuning execution engine, that coordinates between hyperparameter tuning jobs and cluster resources. FluidExec schedules evaluation trials in such jobs using a water-filling approach to make the best use of resources both at intra- and inter-GPU granularities to speed up the tuning process. By abstracting a hyperparameter tuning job as a sequence of TrialGroup, FluidExec can boost the performance of diverse hyperparameter tuning solutions. Our experiments show that FluidExec can speed up synchronous BOHB by 200%, and BOHB and ASHA by 30% while having similar final accuracy.

Fluid is a joint project between Peifeng and Jiachen, which started right after Salus and before Jiachen started her Ph.D.! I’m super excited about many future works in the Systems + AI area from SymbioticLab members.

With the rising popularity of deep learning, the popularity of GPUs has increased in recent years. Modern GPUs are extremely powerful with a lot of resources. A key challenge in this context is making sure that these devices are highly utilized. Although there has been a lot of research on improving GPU efficiency at the cluster level (e.g., our Tiresias in NSDI’19), little is known about how well individual GPUs are being utilized today. Worse, even if they are underutilized, little can be done because GPUs are opaque black boxes without any primitives for sharing them. Existing mechanisms for GPU sharing, such as NVIDIA MPS, are coarse-grained and cannot leverage application-specific information. Salus is our foray into the GPU sharing domain by providing two key sharing primitives that allows one to develop a variety of algorithms and improve GPU efficiency for training, inference, and hyperparameter tuning workloads.

Unlike traditional resources such as CPU or the network, modern GPUs do not natively support fine-grained sharing primitives. Consequently, implementing common policies such as time-sharing and preemption are expensive. Worse, when a deep learning (DL) application cannot completely use a GPU’s resources, the GPU cannot be efficiently shared between multiple applications, leading to GPU underutilization.

We present Salus to enable two GPU sharing primitives: fast job switching and memory sharing, to achieve fine-grained GPU sharing among multiple DL applications. Salus is an efficient, consolidated execution service that exposes a GPU to different DL applications, and enforces fine-grained sharing by performing iteration scheduling and addressing associated memory management issues. We show that these primitives can then be used to implement flexible sharing policies. Our integration of Salus with TensorFlow and evaluation on popular DL jobs shows that Salus can improve the average completion time of DL training jobs by 3.19X, GPU utilization for hyper-parameter tuning by 2.38X, and GPU utilization of DL inference applications by 42X over not sharing the GPU and 7X over NVIDIA MPS with small overhead.

Salus has long been in the making and is the first project for me to get into systems for AI and GPU resource management. Peifeng has been diligently working on it since 2017! While it took a long time, I’m excited that it has found a great home and looking forward to building on top of it. This is Peifeng’s first major paper, and the future is even brighter.

This year’s MLSys has 34 accepted papers and remained highly competitive as its previous iteration.

Mosharaf Chowdhury

Tag Archives: MLSys

Fluid Accepted to Appear at MLSys’2021

Salus Accepted to Appear at MLSys'2020