Tag Archives: Systems + AI

Oort Accepted to Appear at OSDI’2021

Oort’s working title was Kuiper.

With the wide deployment of AI/ML in our daily lives, the need for data privacy is receiving more attention in recent years. Federated Learning (FL) is an emerging sub-field of machine learning that focuses on in-situ processing of data wherever it is generated. This is only going to become more important as regulations around data movement (e.g., GDPR, CCPA) become even more restrictive. Although there has already been a large number of FL algorithms from the ML community and some FL deployments from large companies, systems support for FL is somewhat non-existent. Oort is our effort in building the first open-source FL system that allows FL developers to select participants for their training in an informed manner instead of selecting them at random. In the process, we have also collected the largest public dataset for FL that we plan to open source in near future.

Federated Learning (FL) is an emerging direction in distributed machine learning (ML) that enables in-situ model training and testing on edge data. Despite having the same end goals as traditional ML, FL executions differ significantly in scale, spanning thousands to millions of participating devices. As a result, data characteristics and device capabilities vary widely across clients. Yet, existing efforts randomly select FL participants, which leads to poor model and system efficiency.

In this paper, we propose Oort to improve the performance of federated training and testing with guided participant selection. With an aim to improve time-to-accuracy performance in model training, Oort prioritizes the use of those clients who have both data that offers the greatest utility in improving model accuracy and the capability to run training quickly. To enable FL developers to interpret their results in model testing, Oort enforces their requirements on the distribution of participant data while improving the duration of federated testing by cherry-picking clients. Our evaluation shows that, compared to existing participant selection mechanisms, Oort improves time-to-accuracy performance by 1.2X-14.1X and final model accuracy by 1.3%-9.8%, while efficiently enforcing developer-specified model testing criteria at the scale of millions of clients.

Fan and Allen had been working on Oort since summer of 2019, and it’s been a great learning experience for me. As always, it’s been a pleasure to collaborate with Harsha, and I look forward to many more in the future. Over the past two years, many others have joined Fan in our efforts toward providing systems support for federated learning and analytics, with many exciting results in different stages in the pipeline and focusing on cloud/edge/WAN challenges. It’s only going to become more exciting!

This is the first OSDI in an odd year as OSDI moves to a yearly cadence. Although the number of submissions is lower than the past, it’s likely only due to the late announcement; being in my first OSDI PC, I think the quality of the submitted and accepted papers remains as high as ever. Overall, the OSDI PC accepted 31 out of 165 submissions.

Fluid Accepted to Appear at MLSys’2021

While training and inference of deep learning models have received significant attention in recent years (e.g., Tiresias, AlloX, and Salus from our group), hyperparamter tuning is often overlooked or put together in the same bucket of optimizations as training. Existing hyperparameter tuning solutions, primarily from the ML research community, are mostly resource-agnostic. More importantly, even if they try to use up all available resources, existing solutions do not distinguish between the throughput of a GPU (how much work a GPU is doing) and its goodput (how much of that is ultimately useful work) during hyperparameter tuning. Fluid is our attempt at bridging the gap between hyperparameter tuning algorithms and the underlying cluster resources by improving both intra- and inter-GPU goodput in large clusters.

Current hyperparameter tuning solutions lack complementary execution engines to efficiently leverage distributed computation, thus ignoring the possibility of intra- and inter-GPU sharing, which exhibits poor resource usage. In this paper, we present FluidExec, a generalized hyperparameter tuning execution engine, that coordinates between hyperparameter tuning jobs and cluster resources. FluidExec schedules evaluation trials in such jobs using a water-filling approach to make the best use of resources both at intra- and inter-GPU granularities to speed up the tuning process. By abstracting a hyperparameter tuning job as a sequence of TrialGroup, FluidExec can boost the performance of diverse hyperparameter tuning solutions. Our experiments show that FluidExec can speed up synchronous BOHB by 200%, and BOHB and ASHA by 30% while having similar final accuracy.

Fluid is a joint project between Peifeng and Jiachen, which started right after Salus and before Jiachen started her Ph.D.! I’m super excited about many future works in the Systems + AI area from SymbioticLab members.

Presented Keynote Speech at HotEdgeVideo’2020

Earlier this week, I presented a keynote speech on the state of resource management for deep learning at the HotEdgeVideo’2020 workshop, covering our recent works on systems support for AI (Tiresias, AlloX, and Salus) and discussing open challenges in this space.

In this talk, I highlighted three unique aspects for resource management in the context of deep learning, which I think makes this setting unique even after decades of resource management works in CPUs, networks, and even Big Data clusters.

  1. Short-term predictability and long-term uncertainty of deep learning workloads: Deep learning workloads (hyperparameter tuning, training, and inference) may have different objectives, but they all share a common trend. Although we cannot know how many iterations will there be in a job or the number of requests coming to a model for inference, each iteration/request performs the same computation for a given job. Meaning, we can profile and then exploit that information for long-term benefits by adapting classic information-agnostic and information-limited scheduling techniques.
  2. Heterogeneous, interchangeable compute devices in deep learning clusters: Deep learning clusters becoming increasingly more diverse with many generations of GPUs and new hardware accelerators that our coming out every month. The key challenge in terms of resource management here is that all these compute devices are interchangeable (they all can compute), but they don’t compute at the same rate for all models. Some more suitable for CPUs, some for GPUs, some for TPUs, and so on. We need to rethink resource management algorithms to account for resource interchangeability.
  3. Black-box hardware accelerators: Deep learning hardware are also black boxes. Even for GPUs, we do not have any control over their internals; apart from some high-level information that are publicly available, we don’t know anything about what happens inside. For newer, vendor-locked accelerators, details are even more scarce. Consequently, resource management solutions should be designed to assume black-box hardware from the get go and then rely on profiling (by leveraging the iterative nature of deep learning) and short-term predictability to extract good performance using software techniques.

My slides from this talk are publicly available and have more details elaborating these points.

Thanks Google for Supporting Our Research

We are doing a lot of work on systems for AI in recent years and historically have done some work on application-aware networking. This support will help us combine the two directions to provide datacenter network support for AI workloads both at the edge and inside the network.

Many thanks to Nandita Dukkipati for championing our cause. We’ve been trying to find a good project to collaborate on since 2016 when we had a chat at a Dagstuhl workshop!

AlloX Accepted to Appear at EuroSys’2020

While GPUs are always in the news when it comes to deep learning clusters (e.g., Salus or Tiresias), we are in the midst of an emergence of many more computing devices (e.g., FPGAs and problem-specific accelerators), including the traditional CPUs. All of them are compute devices, but one cannot expect the same speed from all of them for all types of computations. A natural question is, therefore, how to allocate such interchangeable resources in a hybrid cluster? AlloX is our attempt at a reasonable answer.

Modern deep learning frameworks support a variety of hardware, including CPU, GPU, and other accelerators, to perform computation. In this paper, we study how to schedule jobs over such interchangeable resources – each with a different rate of computation – to optimize performance while providing fairness among users in a shared cluster. We demonstrate theoretically and empirically that existing solutions and their straightforward modifications perform poorly in the presence of interchangeable resources, which motivates the design and implementation of AlloX. At its core, AlloX transforms the scheduling problem into a min-cost bipartite matching problem and provides dynamic fair allocation over time. We theoretically prove its optimality in an ideal, offline setting and show empirically that it works well in the online scenario by incorporating with Kubernetes. Evaluations on a small-scale CPU-GPU hybrid cluster and large-scale simulations highlight that AlloX can reduce the average job completion time significantly (by up to 95% when the system load is high) while providing fairness and preventing starvation.

AlloX has been in the oven for more than two years, and it is a testament to Tan and Xiao’s tenacity. It’s also my second paper with Zhenhua after HUG, our first joint-work published based on our recent NSF project, and my first paper at EuroSys. I believe the results in this paper will give rise to new analyses of many systems that have interchangeable resources, such as DRAM-NVM hybrid systems or the storage/caching hierarchy.

This year the EuroSys PC accepted 43 out of 234 submissions.

Salus Accepted to Appear at MLSys'2020

With the rising popularity of deep learning, the popularity of GPUs has increased in recent years. Modern GPUs are extremely powerful with a lot of resources. A key challenge in this context is making sure that these devices are highly utilized. Although there has been a lot of research on improving GPU efficiency at the cluster level (e.g., our Tiresias in NSDI’19), little is known about how well individual GPUs are being utilized today. Worse, even if they are underutilized, little can be done because GPUs are opaque black boxes without any primitives for sharing them. Existing mechanisms for GPU sharing, such as NVIDIA MPS, are coarse-grained and cannot leverage application-specific information. Salus is our foray into the GPU sharing domain by providing two key sharing primitives that allows one to develop a variety of algorithms and improve GPU efficiency for training, inference, and hyperparameter tuning workloads.

Unlike traditional resources such as CPU or the network, modern GPUs do not natively support fine-grained sharing primitives. Consequently, implementing common policies such as time-sharing and preemption are expensive. Worse, when a deep learning (DL) application cannot completely use a GPU’s resources, the GPU cannot be efficiently shared between multiple applications, leading to GPU underutilization.

We present Salus to enable two GPU sharing primitives: fast job switching and memory sharing, to achieve fine-grained GPU sharing among multiple DL applications. Salus is an efficient, consolidated execution service that exposes a GPU to different DL applications, and enforces fine-grained sharing by performing iteration scheduling and addressing associated memory management issues. We show that these primitives can then be used to implement flexible sharing policies. Our integration of Salus with TensorFlow and evaluation on popular DL jobs shows that Salus can improve the average completion time of DL training jobs by 3.19X, GPU utilization for hyper-parameter tuning by 2.38X, and GPU utilization of DL inference applications by 42X over not sharing the GPU and 7X over NVIDIA MPS with small overhead.

Salus has long been in the making and is the first project for me to get into systems for AI and GPU resource management. Peifeng has been diligently working on it since 2017! While it took a long time, I’m excited that it has found a great home and looking forward to building on top of it. This is Peifeng’s first major paper, and the future is even brighter.

This year’s MLSys has 34 accepted papers and remained highly competitive as its previous iteration.

Tiresias Accepted to Appear at NSDI’2019

With the advancement of AI in recent years, GPUs have emerged as a popular choice for training deep learning (DL) models on large datasets. To deal with ever-growing datasets, it is also common to run distributed deep learning over multiple GPUs in parallel. Achieving cost-effectiveness and high performance in these clusters relies on efficiently sharing resources between multiple users. Unfortunately, most  GPU clusters in production rely on resource managers designed for traditional big data analytics. This results in suboptimal performance and strong, but unnecessary, constraints. Tiresias is our first attempt at designing a GPU cluster resource manager that relies on profiling to make good scheduling and placement decisions with little or no input from the users.

Distributed training of deep learning (DL) models on GPU clusters is becoming increasingly more popular. Existing cluster managers face some unique challenges from DL training jobs, such as unpredictable training times, an all-or-nothing execution model, and inflexibility in GPU sharing. Our analysis of a large GPU cluster in production shows that existing big data schedulers — coupled with consolidated job placement constraint, whereby GPUs for the same job must be allocated in as few machines as possible — cause long queueing delays and low overall performance.

We present Tiresias, a GPU cluster resource manager tailored for distributed DL training jobs, which efficiently schedules and places DL jobs to reduce their job completion times (JCT). Given that a DL job’s execution time is often unpredictable, we propose two scheduling algorithms — Discretized Two-Dimensional Gittins Index relies on partial information and Discretized Two-Dimensional LAS is information-agnostic — that aim to minimize the average JCT. Additionally, we describe when the consolidated placement constraint can be relaxed and present a placement algorithm to leverage these observations without any user input. Experiments on a cluster with 60 P100 GPUs — and large-scale trace-driven simulations — show that Tiresias improves the average JCT by up to 5.5X over an Apache YARN-based resource manager used in production. More importantly, Tiresias’s performance is comparable to that of solutions assuming perfect knowledge.

This is Juncheng’s second NSDI paper after Infiniswap in NSDI’17, and a very proud moment for me as his advisor. I would like to thank all our collaborators. I would also like to thank Samir and Barna for inviting me to the TTIC Summer Workshop on Datacenter Scheduling, where I heard Mor’s talk on SOAP that inspired our use of Gittins Index-based scheduling to this context in the partial information case. The application of LAS was inspired by my earlier work on information-agnostic coflow scheduling.

This year NSDI PC accepted 49 out of 332 submissions across Spring (19/92) and Fall (30/240) deadlines for a somewhat lower acceptance rate in comparison to recent years. 

“No! Not Another Deep Learning Framework” to Appear at HotOS’2017

Our position paper calling for a respite in the deep learning framework building arms race has been accepted to appear at this year’s HotOS workshop. We make a simple observation: too many frameworks are being proposed with little interoperability between them, even though many target the same or similar workloads; this inevitably leads to repetitions and reinventions from a machine learning perspective and suboptimal performance from a systems perspective. We identify two places for consolidation across many deep learning frameworks’ architectures that may enable interoperability as well as code, optimization, and resource sharing, benefitting both the machine learning and systems communities.

In recent years, deep learning has pervaded many areas of computing due to the confluence of an explosive growth of large-scale computing capabilities, availability of datasets, and advances in learning techniques. While this rapid growth has resulted in diverse deep learning frameworks, it has also led to inefficiencies for both the users and developers of these frameworks. Specifically, adopting useful techniques across frameworks — both to perform learning tasks and to optimize performance — involves significant repetitions and reinventions.

In this paper, we observe that despite their diverse origins, many of these frameworks share architectural similarities. We argue that by introducing a common representation of learning tasks and a hardware abstraction model to capture compute heterogeneity, we might be able to relieve machine learning researchers from dealing with low-level systems issues and systems researchers from being tied to any specific framework. We expect this decoupling to accelerate progress in both domains.

Our foray into deep learning systems started with a class project by Peifeng and Linh last Fall in my EECS 582 course. From a systems perspective, this is a very new and exciting area! We are learning new things everyday, ranging from low-level GPU programming to communication over the NVLink technology, and we are looking forward to a very exciting summer.

FTR, the HotOS PC accepted 29 papers out of 94 submissions this year.