Category Archives: Recent News

Presented a Keynote Talk on FL Systems at DistributedML’21

This week, I presented a keynote talk at the DistributedML’21 workshop on our recent works on building software systems support for practical federated computation (SolOort, and FedScale). This is a longer version of my talk at Google FL workshop last month, and the updated slides go into more details of our research on cross-silo and cross-device federated learning and analytics.

Presented a Keynote at 2021 Workshop on Federated Learning and Analytics

Recently I presented a keynote talk at the 2021 federated learning and analytics workshop organized Google on our recent works on building software systems support for practical federated computation (Sol, Oort, and FedScale).

I based my talk around the similarities and differences between software stacks for cloud systems and federated systems for learning and analytics. While we still want to perform similar computation (to some extent), the underlying network appears as one of the biggest challenges for the latter. Because the wide-area network (WAN) has significantly lower bandwidth and higher latency than that of a datacenter, federated software stacks have to be rethought with those constraints in mind. Small tweaks here and there are not enough.

Federated learning and analytics systems also come in two broad flavors: cross-silo and cross-device. In case of the former, few computationally powerful and reliable facilities are connected by the WAN, each with several/many powerful computation devices. For the latter, massive number of weak and unreliable devices (e.g., smartphones) take part in the computation. Naturally, cross-device solutions have to deal with additional challenges beyond dealing with the network. The devices have resource and battery constraints, and their owners may not always be connected or have unique behavioral/charging patterns. How do we reason about learning and analytics under such uncertainty?

While the former two topics focused on systems research, the third piece of my talk was about providing a service to my machine learning (ML) colleagues so that they can easily implement and evaluate large-scale federated systems. I believe that systems researchers will fail their ML counterparts if an ML person have to spend considerable time in building and tinkering with systems instead of spending that time on developing new ideas and algorithms. To this end, I talked about the challenges in building such a benchmarking dataset and experimental harness.

I want to thank Peter Kairouz and Marco Gruteser from Google for inviting me to the workshop. My slides are available here and have more details.

Juncheng Levels Up. Congrats Dr. Gu!

My first Ph.D. student Juncheng Gu graduated earlier this month after successfully defending his dissertation titled Efficient Resource Management for Deep Learning Clusters.” This is a bittersweet moment. While I am extremely proud of everything he has done, I will miss having him around. I do know that a bigger stage awaits him; Juncheng is joining the ByteDance AI Lab to build practical systems for AI and machine learning!

Juncheng started his Ph.D. in the Fall of 2015 right before I started in Michigan. I joined his then advisor Kang Shin to co-advise him as he started working on a pre-cursor to Infiniswap as a term project for the EECS 582 course I was teaching. Since then, Juncheng worked on many projects that ranged from hardware, systems, and machine learning/computer vision with varying levels of luck and success, but they were all meaningful works. I consider him a generalist in his research taste. Infiniswap and Tiresias stand out the most among his projects. Infiniswap heralded the rise of many followups we see today on the topic of memory disaggregation. It was the first of its kind and introduced many around the world to this new area of research. Tiresias was one of the earliest works on GPU cluster management and certainly the first that did not require any prior knowledge about deep learning jobs’ characteristics to effectively allocate GPUs for them and to schedule them. To this day, it is the best of its kind for distributed deep learning training. I am honored to have had the opportunity to advise Juncheng.

Juncheng is a great researcher, but he is an even better person. He is very down-to-earth and tries his best to help others out whenever possible. He also understates and underestimates what he can do and has achieved, often to a fault.

I wish him a fruitful career and a prosperous life!

Oort Wins the Distinguished Artifact Award at OSDI’2021. Congrats Fan and Xiangfeng!

Oort, our federated learning system for scalable machine learning over millions of edge devices has received the distinguished artifact award at this year’s USENIX OSDI conference!

This is a testament to a lot of hard work put in by Fan and Xiangfeng over the course of last couple years. Oort is our first foray into federated learning, but it certainly is not the last.

Oort and it’s workloads (FedScale) are both open-source at https://github.com/symbioticlab.

Justitia Accepted to Appear at NSDI’2022

The need for higher throughput and lower latency is driving kernel-bypass networking (KBN) in datacenters. Of the two related trends in KBN, hardware-based KBN is especially challenging because, unlike software KBN such as DPDK, it does not provide any control once a request is posted to the hardware. RDMA, which is the most prevalent form of hardware KBN in practice, is completely opaque. RDMA NICs (RNICs), whether they are using InfiniBand, RoCE, or iWARP, have fixed-function scheduling algorithms programmed in them. Like any other networking component, they also suffer from performance isolation issues when multiple applications compete for RNIC resources. Justitia is our attempt at introducing software control in hardware KBN.

Kernel-bypass networking (KBN) is becoming the new norm in modern datacenters. While hardware-based KBN offloads all dataplane tasks to specialized NICs to achieve better latency and CPU efficiency than software-based KBN, it also takes away the operator’s control over network sharing policies. Providing policy support in multi-tenant hardware KBN brings unique challenges — namely, preserving ultra-low latency and low CPU cost, finding a well-defined point of mediation, and rethinking traffic shapers. We present Justitia to address these challenges with three key design aspects: (i) Split Connection with message-level shaping, (ii) sender-based resource mediation together with receiver-side updates, and (iii) passive latency monitoring. Using a latency target as its knob, Justitia enables multi-tenancy policies such as predictable latencies and fair/weighted resource sharing. Our evaluation shows Justitia can effectively isolate latency-sensitive applications at the cost of slightly decreased utilization and ensure that throughput and bandwidth of the rest are not unfairly penalized.

Yiwen started working on this problem when we first observed RDMA isolation issues in Infiniswap. He even wrote a short paper in KBNets 2017 based on his early findings. Yue worked on it for quite a few months before she went to Princeton for Ph.D. Brent has been helping us getting this work into shape since the beginning. It’s been a long and arduous road; every time we fixed something, new reviewers didn’t like something else. Finally, an NSDI revision allowed us to directly address the most pressing concerns. Without commenting on how much the paper has improved after all these iterations, I can say that adding revisions to NSDI has saved us, especially Yiwen, a lot more frustrations. For what it’s worth, Justitia now has the notorious distinction of my current record for accepted-after-N-submissions; it’s been so long that I’ve lost track of the exact value of N!

FedScale Released on GitHub

Anyone working on federated learning (FL) has faced this problem at least once: you are reading two papers and they either use very different datasets for performance evaluation or unclear about their experimental assumptions about the runtime environment, or both. They often deal with very small datasets as well. There have been attempts at solutions too, creating many FL benchmarks. In the process of working on Oort, we faced the same problem(s). Unfortunately, none of the existing benchmarks fit our requirements. We had to create one on our own.

We present FedScale, a diverse set of challenging and realistic benchmark datasets to facilitate scalable, comprehensive, and reproducible federated learning (FL) research. FedScale datasets are large-scale, encompassing a diverse range of important FL tasks, such as image classification, object detection, language modeling, speech recognition, and reinforcement learning. For each dataset, we provide a unified evaluation protocol using realistic data splits and evaluation metrics. To meet the pressing need for reproducing realistic FL at scale, we have also built an efficient evaluation platform to simplify and standardize the process of FL experimental setup and model evaluation. Our evaluation platform provides flexible APIs to implement new FL algorithms and include new execution backends with minimal developer efforts. Finally, we perform in-depth benchmark experiments on these datasets. Our experiments suggest that FedScale presents significant challenges of heterogeneity-aware co-optimizations of the system and statistical efficiency under realistic FL characteristics, indicating fruitful opportunities for future research. FedScale is open-source with permissive licenses and actively maintained, and we welcome feedback and contributions from the community.

You can read up on the details on our paper and check it out on Github. Do check it out and contribute so that we can together build a large-scale benchmark that considers both data and system heterogeneity across a variety of application domains.

Fan, Yinwei, and Xiangfeng have put in tremendous amount of work over almost two years to get to this point, and I’m super excited about its future.

AIFO Accepted to Appear at SIGCOMM’2021

Packet scheduling is a classic problem in networking. In recent years, however, the focus on packet scheduling has somewhat shifted from designing new scheduling algorithms to designing generalized frameworks that can be programmed to approximate a variety of scheduling disciplines. Push-In-First-Out (PIFO) from SIGCOMM 2016 is such a framework that has been shown to be quite expressive. Following that, a variety have solutions have attempted to make implementing PIFO more practical, with the key problem being minimizing the number of priority levels needed. SP-PIFO is a recent take that shows that having a handful queues is enough. This leaves us with an obvious question: what’s the minimum number of queues one needs to approximate PIFO? We show that the answer is just one.

Programmable packet scheduling enables scheduling algorithms to be programmed into the data plane without changing the hardware. Existing proposals either have no hardware implementations or require multiple strict-priority queues.

We present Admission-In First-Out (AIFO) queues, a new solution for programmable packet scheduling that uses only a single first-in first-out queue. AIFO is motivated by the confluence of two recent trends: shallow buffers in switches and fast-converging congestion control in end hosts, that together leads to a simple observation: the decisive factor in a flow’s completion time (FCT) in modern datacenter networks is often which packets are enqueued or dropped, not the ordering they leave the switch. The core idea of AIFO is to maintain a sliding window to track the ranks of recent packets and compute the relative rank of an arriving packet in the window for admission control. Theoretically, we prove that AIFO provides bounded performance to Push-In First-Out (PIFO). Empirically, we fully implement AIFO and evaluate AIFO with a range of real workloads, demonstrating AIFO closely approximates PIFO. Importantly, unlike PIFO, AIFO can run at line rate on existing hardware and use minimal switch resources—as few as a single queue.

Although programmable packet scheduling has been quite popular for more than five years, I started paying careful attention only after the SP-PIFO presentation in NSDI 2020. I felt that we should be able to approximate something like that with even fewer priority classes, especially by using something similar to Foreground-Background scheduling that needs only two priorities. Xin had been thinking about the problem even longer given his vast experience in programmable switches and approached me after submitting Kayak to NSDI 2021. Xin pointed out that two priorities need only queue with an admission control mechanism in front! I’m glad he roped me in as it’s always a pleasure working with him and Zhuolong. It seems unbelievable even to me that this is my first packet scheduling paper!

This year SIGCOMM has broken the acceptance record once again by accepting 55 out of 241 submissions into the program!

Treehouse Funded. Thanks NSF and VMware!

Treehouse, our proposal on energy-first software infrastructure designs for sustainable cloud computing has recently won a three million dollar joint funding from NSF and VMware! Here is a link to the VMware press release. SymbioticLab now has a chance to collaborate with a stellar team consisting of Tom Anderson, Adam Belay, Asaf Cidon, and Irene Zhang, and I’m super excited about the changes Treehouse will bring to how we do sustainable cloud computing in near future.

Our team started with Asaf and I looking to go beyond memory disaggregation in early Fall of 2020. Soon we managed to convince Adam to join us, who introduced Irene to the idea; both of them are experts in efficient software designs. Finally, together we managed to get Tom to lead our team, and many ideas about an energy-first redesign of cloud computing software stacks started to sprout.

The early steps in building Treehouse is already under way. Last month, we had our first inaugural mini-workshop with more than 20 attendees, and I expect it to grow only bigger in the near future. Stay tuned!