Tag Archives: Systems + AI

ModelKeeper and Zeus Accepted to Appear at NSDI’2023

Deep learning, and machine learning in general, is taking over the world. It is, however, quite expensive to tune, train, and serve deep learning models. Naturally, improving the efficiency and performance of deep learning workflows has received significant attention (Salus, Tiresias, and Fluid to name a few). Most of the existing works, including our prior works, focus on two primary ways to improve efficiency; and resource efficiency at that. The first is packing work as tightly as possible (placement). The second is scheduling over time. Some apply both together. None focus on improving energy efficiency. ModelKeeper and Zeus, respectively, are our efforts toward improving resource efficiency by not doing work and improving energy efficiency instead of solely focusing on resource usage efficiency.


We know scheduling and placement can improve efficiency of resource usage, but even with optimal algorithms one cannot reduce the amount of work that needs to be done in the general case. This simple observation led us to explore how can we reduce the amount of work that needs to be done when training DNN models. It turns out that instead of starting from random values and then training to reach the final values after training a model, one can potentially better initialize a model when training starts and short-circuit the process! By identifying similar models that had already been trained in the past, one can reduce the number of iterations needed for a model to converge.

With growing deployment of machine learning (ML) models, ML developers are training or re-training increasingly more deep neural networks (DNNs). They do so to find the most suitable model that meets their accuracy requirement while satisfying the resource and timeliness constraints of the target environment. In large shared clusters, the growing number of neural architecture search (NAS) and training jobs often result in models sharing architectural similarities with others from the same or a different ML developer. However, existing solutions do not provide a systematic mechanism to identify and leverage such similarities.

We present ModelKeeper, the first automated training warmup system that accelerates DNN training by repurposing previously-trained models in a shared cluster. Our key insight is that initializing a training job’s model by transforming an already-trained model’s weights can jump-start it and reduce the total amount of training needed. However, models submitted over time can differ in their architectures and accuracy. Given a new model to train, ModelKeeper scalably identifies its architectural similarity with previously trained models, selects a parent model with high similarity and good model accuracy, and performs structure-aware transformation of weights to preserve maximal information from the parent model during the warmup of new model weights. Our evaluations across thousands of CV and NLP models show that ModelKeeper achieves 1.3×–4.3× faster training completion with little overhead and no reduction in model accuracy.

Fan started the ModelKeeper project with Yinwei in late 2020 while Oort was making rounds and FedScale was in its infancy. With his internship with Meta in the middle and many other projects he’s been working on, ModelKeeper submission was pushed back a couple times. In hindsight, the extra time significantly improved the quality of the work. While the setting considered in this paper is cloud computing, ModelKeeper is likely going to be an integral part of the greater FedScale project now to speed up federated learning as well.

ModelKeeper is yet another collaboration between Harsha and myself. Hopefully, we will continue to collaborate more even after Harsha moves to USC in Winter 2023.


With ever-increasing model sizes, the cost of DNN training is increasing rapidly. While the monetary cost is discussed often, there is an implicit energy cost of DNN training as well. For example, training the GPT-3 model consumes 1,287 megawatt-hour (MWh), which is equivalent to 120 years of electricity consumption for an average U.S. household. In this pioneering work, we take the first step in better understanding and then optimizing the energy consumption of DNN training. Specifically, we optimize batch size and GPU power cap for recurring training jobs to provide a better tradeoff between energy consumed and accuracy attained.

Training deep neural networks (DNNs) is becoming increasingly more resource- and energy-intensive every year. Unfortunately, existing works primarily focus on optimizing DNN training for faster completion, often without considering the impact on energy efficiency.

In this paper, we observe that common practices to improve training performance can often lead to inefficient energy usage. More importantly, we demonstrate that there is a tradeoff between energy consumption and performance optimization. To this end, we propose an optimization framework, Zeus, to navigate this tradeoff by automatically configuring job- and GPU-level configurations of recurring DNN training jobs. Zeus uses an online exploration-exploitation approach in conjunction with just-in-time energy profiling, averting the need for expensive offline measurements, while adapting to workload changes and data drifts over time. Our evaluation shows that Zeus can improve the energy efficiency of DNN training by 18.7%-72.8% for diverse workloads.

Zeus started sometime around Fall 2020/Winter 2021 with Jimmy. At the end of Winter, when Jimmy left for his internship, we had a basic idea of a problem and one motivating plot that’d eventually drive our efforts. With the arrival of Jae-Won in Fall 2021 as a first-year student and Jimmy being back from Meta, we picked up the pace, which eventually led to its submission. Zeus is the first Treehouse project, and my first foray into energy-related anything. We had a lot to learn, but I was in capable hands of Jimmy and Jae-Won, who learned and taught me much. And we haven’t even scratched the surface!

To work on many more exciting projects like these, join SymbioticLab!

Peifeng has Phinished. Congrats Dr. Yu!

Peifeng just became my second student to finish PhD a few days ago after successfully defending his dissertation “Application-Aware Scheduling in Deep Learning Software Stacks.” This will be a big loss for the SymbioticLab as we will miss his presence and deep technical insights. Peifeng is joining Google to continue working on resource management systems for AI/ML.

Peifeng officially started his PhD in Fall 2017, but he started working with me on and off from the Fall before when he took EECS 582 with me as a master’s student at UM. Peifeng and his friend, Linh, were working on a term project on video captioning for that course, but Peifeng was interested into better designing systems for AI/ML instead of simply applying existing ML techniques to different use cases. Although I did not know anything about systems for AI/ML, Peifeng pulled me into this world. Since then, Peifeng has worked on several groundbreaking projects, including Salus and Fluid; Orloj, an even more exciting project is in the pipeline to be published. Salus was the first software GPU sharing solution that provided significantly higher utilization than NVIDIA MPS; Fluid was the first leverage the collective nature of jobs in hyperparameter tuning to improve GPU- and cluster-level utilizations. Orloj is the first inference system to provide predictable performance for dynamic DNNs while maintaining the best-in-class performance for traditional static DNNs. I enjoyed this journey thoroughly, learned a lot in the process, and am really proud to be called his advisor.

Peifeng is one of the best (ML) systems developers I have ever seen (and I have seen many luminaries over years). He cares more about doing his work than hyping them up. He is also unbothered by the publications rat race to the point of causing advisor anxiety.

I have no doubt he will be extremely successful in whatever he sets his mind to.

FedScale Accepted to Appear at ICML’2022

Although theoretical federated learning (FL) research is growing exponentially, we are far from putting those theories into practice. Over the course of last few years, SymbioticLab has made significant progress in building deployable FL systems, with Oort being the most prominent example. As I discussed in the past, while evaluating Oort, we observed the weaknesses of the existing FL workloads/benchmarks: they are too small and sometimes too homogeneous to highlight the uncertainties that FL deployments would face in the real world. FedScale was borne out of the necessity to evaluate Oort. As we worked on it, we added more and more datasets to create a diverse benchmark that not only contains workloads to evaluate FL but also traces to emulate real-world end device characteristics. Eventually, we started building a runtime as well that one can use to implement any FL algorithm within FedScale. For example, Oort can be implemented with a few lines in FedScale, or a more recent work PyramidFL in MobiCom’22, which is based on Oort. This ICML paper gives an overview of the benchmarking aspects of FedScale for the ML/FL researchers, while providing a quick intro to the systems runtime that we are continuously working on and plan to publish later this year.

We present FedScale, a diverse set of challenging and realistic benchmark datasets to facilitate scalable, comprehensive, and reproducible federated learning (FL) research. FedScale datasets are large-scale, encompassing a wide range of important FL tasks, such as image classification, object detection, word prediction, speech recognition, and sequence prediction in video streaming. For each dataset, we provide a unified evaluation protocol using realistic data splits and evaluation metrics. To meet the pressing need for reproducing realistic FL at scale, we build an efficient evaluation platform to simplify and standardize the process of FL experimental setup and model evaluation. Our evaluation platform provides flexible APIs to implement new FL algorithms, and includes new execution backends (e.g., mobile backends) with minimal developer efforts. Finally, we perform systematic benchmark experiments on these datasets. Our experiments suggest fruitful opportunities in heterogeneity-aware co-optimizations of the system and statistical efficiency under realistic FL characteristics. FedScale will be open-source and actively maintained, and we welcome feedback and contributions from the community.

Fan and Yinwei had been working on FedScale for more than two years with some help from Xiangfeng toward the end of Oort. During this time, Jiachen and Sanjay joined first as users of FedScale and later as its contributors. Of course, Harsha is with us like all other past FL projects. Including this summer, close to 20 undergrads and master’s students have worked on/with/around it. At this point, FedScale has become the largest project in the SymbioticLab with interests from academic and industry users within and outside Michigan, and there is an active slack channel as well where users from many different institutions collaborate. We are also organizing the first FedScale Summer School this year. Overall, FedScale reminds me of another small project called Spark I was part of many years ago!

This is my/our first paper in ICML or any ML conference for that matter, even though it’s not necessarily a core ML paper. This year, ICML received 5630 submissions. Among these, 1117 were accepted for short and 118 for long presentations with a 21.94% acceptance rate; FedScale is one of the former. These numbers are mind boggling for me as someone from the systems community!

Join us in making FedScale even bigger, better, and more useful, as a member of SymbioticLab or as a FedScale user/contributor. Now that we have the research vehicle, possibilities are limitless. We are exploring maybe less than 10 such ideas, but 100s are waiting for you.

Visit http://fedscale.ai/ to learn more.

Juncheng Levels Up. Congrats Dr. Gu!

My first Ph.D. student Juncheng Gu graduated earlier this month after successfully defending his dissertation titled Efficient Resource Management for Deep Learning Clusters.” This is a bittersweet moment. While I am extremely proud of everything he has done, I will miss having him around. I do know that a bigger stage awaits him; Juncheng is joining the ByteDance AI Lab to build practical systems for AI and machine learning!

Juncheng started his Ph.D. in the Fall of 2015 right before I started in Michigan. I joined his then advisor Kang Shin to co-advise him as he started working on a pre-cursor to Infiniswap as a term project for the EECS 582 course I was teaching. Since then, Juncheng worked on many projects that ranged from hardware, systems, and machine learning/computer vision with varying levels of luck and success, but they were all meaningful works. I consider him a generalist in his research taste. Infiniswap and Tiresias stand out the most among his projects. Infiniswap heralded the rise of many followups we see today on the topic of memory disaggregation. It was the first of its kind and introduced many around the world to this new area of research. Tiresias was one of the earliest works on GPU cluster management and certainly the first that did not require any prior knowledge about deep learning jobs’ characteristics to effectively allocate GPUs for them and to schedule them. To this day, it is the best of its kind for distributed deep learning training. I am honored to have had the opportunity to advise Juncheng.

Juncheng is a great researcher, but he is an even better person. He is very down-to-earth and tries his best to help others out whenever possible. He also understates and underestimates what he can do and has achieved, often to a fault.

I wish him a fruitful career and a prosperous life!

Oort Wins the Distinguished Artifact Award at OSDI’2021. Congrats Fan and Xiangfeng!

Oort, our federated learning system for scalable machine learning over millions of edge devices has received the distinguished artifact award at this year’s USENIX OSDI conference!

This is a testament to a lot of hard work put in by Fan and Xiangfeng over the course of last couple years. Oort is our first foray into federated learning, but it certainly is not the last.

Oort and it’s workloads (FedScale) are both open-source at https://github.com/symbioticlab.

FedScale Released on GitHub

Anyone working on federated learning (FL) has faced this problem at least once: you are reading two papers and they either use very different datasets for performance evaluation or unclear about their experimental assumptions about the runtime environment, or both. They often deal with very small datasets as well. There have been attempts at solutions too, creating many FL benchmarks. In the process of working on Oort, we faced the same problem(s). Unfortunately, none of the existing benchmarks fit our requirements. We had to create one on our own.

We present FedScale, a diverse set of challenging and realistic benchmark datasets to facilitate scalable, comprehensive, and reproducible federated learning (FL) research. FedScale datasets are large-scale, encompassing a diverse range of important FL tasks, such as image classification, object detection, language modeling, speech recognition, and reinforcement learning. For each dataset, we provide a unified evaluation protocol using realistic data splits and evaluation metrics. To meet the pressing need for reproducing realistic FL at scale, we have also built an efficient evaluation platform to simplify and standardize the process of FL experimental setup and model evaluation. Our evaluation platform provides flexible APIs to implement new FL algorithms and include new execution backends with minimal developer efforts. Finally, we perform in-depth benchmark experiments on these datasets. Our experiments suggest that FedScale presents significant challenges of heterogeneity-aware co-optimizations of the system and statistical efficiency under realistic FL characteristics, indicating fruitful opportunities for future research. FedScale is open-source with permissive licenses and actively maintained, and we welcome feedback and contributions from the community.

You can read up on the details on our paper and check it out on Github. Do check it out and contribute so that we can together build a large-scale benchmark that considers both data and system heterogeneity across a variety of application domains.

Fan, Yinwei, and Xiangfeng have put in tremendous amount of work over almost two years to get to this point, and I’m super excited about its future.