Tag Archives: Carbon-Aware Systems

ModelKeeper and Zeus Accepted to Appear at NSDI’2023

Deep learning, and machine learning in general, is taking over the world. It is, however, quite expensive to tune, train, and serve deep learning models. Naturally, improving the efficiency and performance of deep learning workflows has received significant attention (Salus, Tiresias, and Fluid to name a few). Most of the existing works, including our prior works, focus on two primary ways to improve efficiency; and resource efficiency at that. The first is packing work as tightly as possible (placement). The second is scheduling over time. Some apply both together. None focus on improving energy efficiency. ModelKeeper and Zeus, respectively, are our efforts toward improving resource efficiency by not doing work and improving energy efficiency instead of solely focusing on resource usage efficiency.

ModelKeeper

We know scheduling and placement can improve efficiency of resource usage, but even with optimal algorithms one cannot reduce the amount of work that needs to be done in the general case. This simple observation led us to explore how can we reduce the amount of work that needs to be done when training DNN models. It turns out that instead of starting from random values and then training to reach the final values after training a model, one can potentially better initialize a model when training starts and short-circuit the process! By identifying similar models that had already been trained in the past, one can reduce the number of iterations needed for a model to converge.

With growing deployment of machine learning (ML) models, ML developers are training or re-training increasingly more deep neural networks (DNNs). They do so to find the most suitable model that meets their accuracy requirement while satisfying the resource and timeliness constraints of the target environment. In large shared clusters, the growing number of neural architecture search (NAS) and training jobs often result in models sharing architectural similarities with others from the same or a different ML developer. However, existing solutions do not provide a systematic mechanism to identify and leverage such similarities.

We present ModelKeeper, the first automated training warmup system that accelerates DNN training by repurposing previously-trained models in a shared cluster. Our key insight is that initializing a training job’s model by transforming an already-trained model’s weights can jump-start it and reduce the total amount of training needed. However, models submitted over time can differ in their architectures and accuracy. Given a new model to train, ModelKeeper scalably identifies its architectural similarity with previously trained models, selects a parent model with high similarity and good model accuracy, and performs structure-aware transformation of weights to preserve maximal information from the parent model during the warmup of new model weights. Our evaluations across thousands of CV and NLP models show that ModelKeeper achieves 1.3×–4.3× faster training completion with little overhead and no reduction in model accuracy.

Fan started the ModelKeeper project with Yinwei in late 2020 while Oort was making rounds and FedScale was in its infancy. With his internship with Meta in the middle and many other projects he’s been working on, ModelKeeper submission was pushed back a couple times. In hindsight, the extra time significantly improved the quality of the work. While the setting considered in this paper is cloud computing, ModelKeeper is likely going to be an integral part of the greater FedScale project now to speed up federated learning as well.

ModelKeeper is yet another collaboration between Harsha and myself. Hopefully, we will continue to collaborate more even after Harsha moves to USC in Winter 2023.

Zeus

With ever-increasing model sizes, the cost of DNN training is increasing rapidly. While the monetary cost is discussed often, there is an implicit energy cost of DNN training as well. For example, training the GPT-3 model consumes 1,287 megawatt-hour (MWh), which is equivalent to 120 years of electricity consumption for an average U.S. household. In this pioneering work, we take the first step in better understanding and then optimizing the energy consumption of DNN training. Specifically, we optimize batch size and GPU power cap for recurring training jobs to provide a better tradeoff between energy consumed and accuracy attained.

Training deep neural networks (DNNs) is becoming increasingly more resource- and energy-intensive every year. Unfortunately, existing works primarily focus on optimizing DNN training for faster completion, often without considering the impact on energy efficiency.

In this paper, we observe that common practices to improve training performance can often lead to inefficient energy usage. More importantly, we demonstrate that there is a tradeoff between energy consumption and performance optimization. To this end, we propose an optimization framework, Zeus, to navigate this tradeoff by automatically configuring job- and GPU-level configurations of recurring DNN training jobs. Zeus uses an online exploration-exploitation approach in conjunction with just-in-time energy profiling, averting the need for expensive offline measurements, while adapting to workload changes and data drifts over time. Our evaluation shows that Zeus can improve the energy efficiency of DNN training by 18.7%-72.8% for diverse workloads.

Zeus started sometime around Fall 2020/Winter 2021 with Jimmy. At the end of Winter, when Jimmy left for his internship, we had a basic idea of a problem and one motivating plot that’d eventually drive our efforts. With the arrival of Jae-Won in Fall 2021 as a first-year student and Jimmy being back from Meta, we picked up the pace, which eventually led to its submission. Zeus is the first Treehouse project, and my first foray into energy-related anything. We had a lot to learn, but I was in capable hands of Jimmy and Jae-Won, who learned and taught me much. And we haven’t even scratched the surface!

To work on many more exciting projects like these, join SymbioticLab!

Treehouse White Paper Released

We started the Treehouse project at the beginning of last summer with a mini workshop. It took a while to distill our core ideas, but we finally have a white paper that attempts to provide answers to some frequently asked questions: why now? what does it mean for cloud software to be carbon-aware? what are (some of) the ways to get there? why these? etc.

It’s only the beginning, and as we go deeper into our explorations, I’m sure the details will evolve and become richer. Nonetheless, I do believe that we have a compelling vision of the future where we can, among other things, enable reasonable tradeoffs between performance and carbon consumption of cloud software stacks, an accurate way of understanding how much carbon our software systems are consuming, and optimally choose consumption sites, sources, and timing to best address the energy challenges we are all facing.

We look forward to hearing from the community. Watch out for exciting announcements coming soon from the Treehouse project!

Treehouse Funded. Thanks NSF and VMware!

Treehouse, our proposal on energy-first software infrastructure designs for sustainable cloud computing has recently won a three million dollar joint funding from NSF and VMware! Here is a link to the VMware press release. SymbioticLab now has a chance to collaborate with a stellar team consisting of Tom Anderson, Adam Belay, Asaf Cidon, and Irene Zhang, and I’m super excited about the changes Treehouse will bring to how we do sustainable cloud computing in near future.

Our team started with Asaf and I looking to go beyond memory disaggregation in early Fall of 2020. Soon we managed to convince Adam to join us, who introduced Irene to the idea; both of them are experts in efficient software designs. Finally, together we managed to get Tom to lead our team, and many ideas about an energy-first redesign of cloud computing software stacks started to sprout.

The early steps in building Treehouse is already under way. Last month, we had our first inaugural mini-workshop with more than 20 attendees, and I expect it to grow only bigger in the near future. Stay tuned!