Jie Graduates. Congrats Dr. You!

Jie (Jimmy) has recently become my third PhD student to graduate with a dissertation titled “Toward Practical Application-Aware Big Data Systems.” Over the course of his PhD, Jimmy ended up contributing to most of the major projects of our group over the years, and we will miss him going forward. He’s joining Meta.

Jimmy officially started his PhD at Michigan in Fall 2016, but he started working with me from Summer 2017. He started working on one of the first projects of SymbioticLab on geo-distributed analytics that had taken many names over the years (Gaia/Terra/System H). While the core project didn’t go as expected, it created tools and artifacts that helped other projects published in SPAA and NSDI that Jimmy contributed to. After that Jimmy learned that sometimes its better to come up with your idea, and he successfully led end-to-end the Kayak project that provided a middle-ground between KV- and RPC-based systems. This was done in collaboration with Xin and his group. After moving from high-latency to low-latency networks, Jimmy’s final project moved in another very different direction. In Zeus, Jimmy and Jae-Won collaborated on understanding and optimizing the energy consumption of DNN training. I think Zeus is the best of his works and will have a lasting impact. While it was stressful as an advisor to see him change course so many times in a PhD, it was also fun to see him eventually find his footing on his own terms.

Jimmy is very inquisitive, which led to him exploring many different things during his PhD. He is also very good at taking feedback and improving himself, which he’s clearly demonstrated over the past five years. I’m sure he’ll ask many questions and explore many new things in the next chapter(s) of his career.

ModelKeeper and Zeus Accepted to Appear at NSDI’2023

Deep learning, and machine learning in general, is taking over the world. It is, however, quite expensive to tune, train, and serve deep learning models. Naturally, improving the efficiency and performance of deep learning workflows has received significant attention (Salus, Tiresias, and Fluid to name a few). Most of the existing works, including our prior works, focus on two primary ways to improve efficiency; and resource efficiency at that. The first is packing work as tightly as possible (placement). The second is scheduling over time. Some apply both together. None focus on improving energy efficiency. ModelKeeper and Zeus, respectively, are our efforts toward improving resource efficiency by not doing work and improving energy efficiency instead of solely focusing on resource usage efficiency.

ModelKeeper

We know scheduling and placement can improve efficiency of resource usage, but even with optimal algorithms one cannot reduce the amount of work that needs to be done in the general case. This simple observation led us to explore how can we reduce the amount of work that needs to be done when training DNN models. It turns out that instead of starting from random values and then training to reach the final values after training a model, one can potentially better initialize a model when training starts and short-circuit the process! By identifying similar models that had already been trained in the past, one can reduce the number of iterations needed for a model to converge.

With growing deployment of machine learning (ML) models, ML developers are training or re-training increasingly more deep neural networks (DNNs). They do so to find the most suitable model that meets their accuracy requirement while satisfying the resource and timeliness constraints of the target environment. In large shared clusters, the growing number of neural architecture search (NAS) and training jobs often result in models sharing architectural similarities with others from the same or a different ML developer. However, existing solutions do not provide a systematic mechanism to identify and leverage such similarities.

We present ModelKeeper, the first automated training warmup system that accelerates DNN training by repurposing previously-trained models in a shared cluster. Our key insight is that initializing a training job’s model by transforming an already-trained model’s weights can jump-start it and reduce the total amount of training needed. However, models submitted over time can differ in their architectures and accuracy. Given a new model to train, ModelKeeper scalably identifies its architectural similarity with previously trained models, selects a parent model with high similarity and good model accuracy, and performs structure-aware transformation of weights to preserve maximal information from the parent model during the warmup of new model weights. Our evaluations across thousands of CV and NLP models show that ModelKeeper achieves 1.3×–4.3× faster training completion with little overhead and no reduction in model accuracy.

Fan started the ModelKeeper project with Yinwei in late 2020 while Oort was making rounds and FedScale was in its infancy. With his internship with Meta in the middle and many other projects he’s been working on, ModelKeeper submission was pushed back a couple times. In hindsight, the extra time significantly improved the quality of the work. While the setting considered in this paper is cloud computing, ModelKeeper is likely going to be an integral part of the greater FedScale project now to speed up federated learning as well.

ModelKeeper is yet another collaboration between Harsha and myself. Hopefully, we will continue to collaborate more even after Harsha moves to USC in Winter 2023.

Zeus

With ever-increasing model sizes, the cost of DNN training is increasing rapidly. While the monetary cost is discussed often, there is an implicit energy cost of DNN training as well. For example, training the GPT-3 model consumes 1,287 megawatt-hour (MWh), which is equivalent to 120 years of electricity consumption for an average U.S. household. In this pioneering work, we take the first step in better understanding and then optimizing the energy consumption of DNN training. Specifically, we optimize batch size and GPU power cap for recurring training jobs to provide a better tradeoff between energy consumed and accuracy attained.

Training deep neural networks (DNNs) is becoming increasingly more resource- and energy-intensive every year. Unfortunately, existing works primarily focus on optimizing DNN training for faster completion, often without considering the impact on energy efficiency.

In this paper, we observe that common practices to improve training performance can often lead to inefficient energy usage. More importantly, we demonstrate that there is a tradeoff between energy consumption and performance optimization. To this end, we propose an optimization framework, Zeus, to navigate this tradeoff by automatically configuring job- and GPU-level configurations of recurring DNN training jobs. Zeus uses an online exploration-exploitation approach in conjunction with just-in-time energy profiling, averting the need for expensive offline measurements, while adapting to workload changes and data drifts over time. Our evaluation shows that Zeus can improve the energy efficiency of DNN training by 18.7%-72.8% for diverse workloads.

Zeus started sometime around Fall 2020/Winter 2021 with Jimmy. At the end of Winter, when Jimmy left for his internship, we had a basic idea of a problem and one motivating plot that’d eventually drive our efforts. With the arrival of Jae-Won in Fall 2021 as a first-year student and Jimmy being back from Meta, we picked up the pace, which eventually led to its submission. Zeus is the first Treehouse project, and my first foray into energy-related anything. We had a lot to learn, but I was in capable hands of Jimmy and Jae-Won, who learned and taught me much. And we haven’t even scratched the surface!

To work on many more exciting projects like these, join SymbioticLab!

Peifeng has Phinished. Congrats Dr. Yu!

Peifeng just became my second student to finish PhD a few days ago after successfully defending his dissertation “Application-Aware Scheduling in Deep Learning Software Stacks.” This will be a big loss for the SymbioticLab as we will miss his presence and deep technical insights. Peifeng is joining Google to continue working on resource management systems for AI/ML.

Peifeng officially started his PhD in Fall 2017, but he started working with me on and off from the Fall before when he took EECS 582 with me as a master’s student at UM. Peifeng and his friend, Linh, were working on a term project on video captioning for that course, but Peifeng was interested into better designing systems for AI/ML instead of simply applying existing ML techniques to different use cases. Although I did not know anything about systems for AI/ML, Peifeng pulled me into this world. Since then, Peifeng has worked on several groundbreaking projects, including Salus and Fluid; Orloj, an even more exciting project is in the pipeline to be published. Salus was the first software GPU sharing solution that provided significantly higher utilization than NVIDIA MPS; Fluid was the first leverage the collective nature of jobs in hyperparameter tuning to improve GPU- and cluster-level utilizations. Orloj is the first inference system to provide predictable performance for dynamic DNNs while maintaining the best-in-class performance for traditional static DNNs. I enjoyed this journey thoroughly, learned a lot in the process, and am really proud to be called his advisor.

Peifeng is one of the best (ML) systems developers I have ever seen (and I have seen many luminaries over years). He cares more about doing his work than hyping them up. He is also unbothered by the publications rat race to the point of causing advisor anxiety.

I have no doubt he will be extremely successful in whatever he sets his mind to.

Promoted to Associate Professor With Tenure!

All credits to my students and collaborators who helped realize many groundbreaking ideas; hundreds of students I taught in undergraduate and graduate levels; my mentors who continue to advise me through good times and bad; everyone who wrote letters for me; countless reviewers and panelists who improved our papers and proposals; funding entities (especially NSF) who support our large research program; those who invited me to be on and learn from program committees, panels, meetings, and talks; my research community as a whole; everyone in CSE Michigan; and countless others I’m sure I’m forgetting. Thanks to everyone from the depths of my heart.

Extra special thanks to my family who made countless sacrifices in unimaginable ways.

Well, it’s done now, but I don’t feel any different; only more energized to do bigger, better things. If you want to work on challenging problems, join SymbioticLab. We’re only getting started!

FedScale Accepted to Appear at ICML’2022

Although theoretical federated learning (FL) research is growing exponentially, we are far from putting those theories into practice. Over the course of last few years, SymbioticLab has made significant progress in building deployable FL systems, with Oort being the most prominent example. As I discussed in the past, while evaluating Oort, we observed the weaknesses of the existing FL workloads/benchmarks: they are too small and sometimes too homogeneous to highlight the uncertainties that FL deployments would face in the real world. FedScale was borne out of the necessity to evaluate Oort. As we worked on it, we added more and more datasets to create a diverse benchmark that not only contains workloads to evaluate FL but also traces to emulate real-world end device characteristics. Eventually, we started building a runtime as well that one can use to implement any FL algorithm within FedScale. For example, Oort can be implemented with a few lines in FedScale, or a more recent work PyramidFL in MobiCom’22, which is based on Oort. This ICML paper gives an overview of the benchmarking aspects of FedScale for the ML/FL researchers, while providing a quick intro to the systems runtime that we are continuously working on and plan to publish later this year.

We present FedScale, a diverse set of challenging and realistic benchmark datasets to facilitate scalable, comprehensive, and reproducible federated learning (FL) research. FedScale datasets are large-scale, encompassing a wide range of important FL tasks, such as image classification, object detection, word prediction, speech recognition, and sequence prediction in video streaming. For each dataset, we provide a unified evaluation protocol using realistic data splits and evaluation metrics. To meet the pressing need for reproducing realistic FL at scale, we build an efficient evaluation platform to simplify and standardize the process of FL experimental setup and model evaluation. Our evaluation platform provides flexible APIs to implement new FL algorithms, and includes new execution backends (e.g., mobile backends) with minimal developer efforts. Finally, we perform systematic benchmark experiments on these datasets. Our experiments suggest fruitful opportunities in heterogeneity-aware co-optimizations of the system and statistical efficiency under realistic FL characteristics. FedScale will be open-source and actively maintained, and we welcome feedback and contributions from the community.

Fan and Yinwei had been working on FedScale for more than two years with some help from Xiangfeng toward the end of Oort. During this time, Jiachen and Sanjay joined first as users of FedScale and later as its contributors. Of course, Harsha is with us like all other past FL projects. Including this summer, close to 20 undergrads and master’s students have worked on/with/around it. At this point, FedScale has become the largest project in the SymbioticLab with interests from academic and industry users within and outside Michigan, and there is an active slack channel as well where users from many different institutions collaborate. We are also organizing the first FedScale Summer School this year. Overall, FedScale reminds me of another small project called Spark I was part of many years ago!

This is my/our first paper in ICML or any ML conference for that matter, even though it’s not necessarily a core ML paper. This year, ICML received 5630 submissions. Among these, 1117 were accepted for short and 118 for long presentations with a 21.94% acceptance rate; FedScale is one of the former. These numbers are mind boggling for me as someone from the systems community!

Join us in making FedScale even bigger, better, and more useful, as a member of SymbioticLab or as a FedScale user/contributor. Now that we have the research vehicle, possibilities are limitless. We are exploring maybe less than 10 such ideas, but 100s are waiting for you.

Visit http://fedscale.ai/ to learn more.

Aequitas Accepted to Appear at SIGCOMM’2022

Although Remote Procedure Calls (RPCs) generate bulk of the network traffic in modern disaggregated datacenters, network-level optimizations still focus on low-level metrics at the granularity of packets, flowlets, and flows. The lack of application-level semantics lead to suboptimal performance as there is often a mismatch between network- and application-level metrics. Indeed, my works on coflow relied on a similar observation for distributed data-parallel jobs. In this work, we focus on leveraging application- and user-level semantics of RPCs to improve application- and user-perceived performance. Specifically, we implement quality-of-service (QoS) primitives at the RPC level by extending classic works on service curves and network calculus. We show that it can be achieved using an edge-based solutions in conjunction with traditional QoS classes in legacy/non-programmable switches.

With the increasing popularity of disaggregated storage and microservice architectures, high fan-out and fan-in Remote Procedure Calls (RPCs) now generate most of the traffic in modern datacenters. While the network plays a crucial role in RPC performance, traditional traffic classification categories cannot sufficiently capture their importance due to wide variations in RPC characteristics. As a result, meeting service-level objectives (SLOs), especially for performance-critical (PC) RPCs, remains challenging.

We present Aequitas, a distributed sender-driven admission control scheme that uses commodity Weighted-Fair Queuing (WFQ) to guarantee RPC-level SLOs. Aequitas maps PC RPCs to higher weight queues. In the presence of network overloads, it enforces cluster-wide RPC latency SLOs by limiting the amount of traffic admitted into any given QoS and downgrading the rest. We show analytically and empirically that this simple scheme works well. When the network demand spikes beyond provisioned capacity, Aequitas achieves a latency SLO that is 3.8× lower than the state-of-art congestion control at the 99.9th-percentile and admits up to 2× more PC RPCs meeting SLO when compared with pFabric, Qjump, D3, PDQ, and Homa. Results in a fleet-wide production deployment at a large cloud provider show a 10% latency improvement.

The inception of this project can be traced back to the spring of 2019, when I visited Google to attend a networking summit. Nandita and I had been trying to collaborate on something since 2016(!), and the time seemed right with Yiwen ready to go for an internship. Over the course of couple summers in 2019 and 2020, and many months before and after, Yiwen managed to build a theoretical framework that added QoS support to the classic network calculus framework in collaboration with Gautam. Yiwen also had a lot of support from Xian, PR , and unnamed others to get the simulator and the actual code implemented and deployed in Google datacenters. They implemented a public-facing/shareable version of the simulator as well! According to Amin and Nandita, this is one of the “big” problems, and I’m happy that we managed to pull it off.

This was my first time working with Google. Honestly, I was quite impressed by the rigor of the internal review process (even though the paper still got rejected a couple times). It’s also my first paper with Gautam after 2012! He was great then, and he’s gotten even better in the past decade. Yiwen is also having a great year with a SIGCOMM paper right after his NSDI one.

This year SIGCOMM PC accepted 55 out 281 submissions after a couple years of record-breaking acceptance rates.

Treehouse White Paper Released

We started the Treehouse project at the beginning of last summer with a mini workshop. It took a while to distill our core ideas, but we finally have a white paper that attempts to provide answers to some frequently asked questions: why now? what does it mean for cloud software to be carbon-aware? what are (some of) the ways to get there? why these? etc.

It’s only the beginning, and as we go deeper into our explorations, I’m sure the details will evolve and become richer. Nonetheless, I do believe that we have a compelling vision of the future where we can, among other things, enable reasonable tradeoffs between performance and carbon consumption of cloud software stacks, an accurate way of understanding how much carbon our software systems are consuming, and optimally choose consumption sites, sources, and timing to best address the energy challenges we are all facing.

We look forward to hearing from the community. Watch out for exciting announcements coming soon from the Treehouse project!

Hydra Accepted to Appear at FAST’2022

Almost five years after Infiniswap, memory disaggregation is now a mainstream research topic. It goes by many names, but the idea of accessing (unused/stranded) memory over high-speed networks is now close to reality. Despite many works in this line of research, a key remaining problem is ensuring resilience: how can applications recover quickly if remote memory fails, become corrupted, or is inaccessible? Keeping a copy in the disk like Infiniswap causes performance bottleneck under failure, but keeping in-memory replicas doubles the memory overhead. We started Hydra to hit a sweet spot between these two extremes by applying lessons learned from our EC-Cache project to extremely small objects. While EC-Cache explicitly focused on very large objects in the MB range, Hydra aims to perform erasure coding at the 4KB page granularity in microsecond timescale common for RDMA. Additionally, we extended Asaf’s CopySets idea to erasure coding to tolerate concurrent failures with low overhead.

We present Hydra, a low-latency, low-overhead, and highly available resilience mechanism for remote memory. Hydra can access erasure-coded remote memory within a single-digit μs read/write latency, significantly improving the performance-efficiency tradeoff over the state-of-the-art – it performs similar to in-memory replication with 1.6× lower memory overhead. We also propose CodingSets, a novel coding group placement algorithm for erasure-coded data, that provides load balancing while reducing the probability of data loss under correlated failures by an order of magnitude. With Hydra, even when only 50% memory is local, unmodified memory-intensive applications achieve performance close to that of the fully in-memory case in the presence of remote failures and outperforms the state-of-the-art remote-memory solutions by up to 4.35×.

Youngmoon started working on Hydra right after we presented Infiniswap in early 2017! As Youngmoon graduated, Hasan started leading the project from early 2019, and Asaf joined us later that year. Together they significantly improved the paper over early drafts. Even then, Hydra faced immense challenges. In the process, Hydra has now taken over Justitia for the notorious distinction of my current record for accepted-after-N-submissions.

This was my first time submitting to FAST.