Category Archives: Recent News

Oort Accepted to Appear at OSDI’2021

Oort’s working title was Kuiper.

With the wide deployment of AI/ML in our daily lives, the need for data privacy is receiving more attention in recent years. Federated Learning (FL) is an emerging sub-field of machine learning that focuses on in-situ processing of data wherever it is generated. This is only going to become more important as regulations around data movement (e.g., GDPR, CCPA) become even more restrictive. Although there has already been a large number of FL algorithms from the ML community and some FL deployments from large companies, systems support for FL is somewhat non-existent. Oort is our effort in building the first open-source FL system that allows FL developers to select participants for their training in an informed manner instead of selecting them at random. In the process, we have also collected the largest public dataset for FL that we plan to open source in near future.

Federated Learning (FL) is an emerging direction in distributed machine learning (ML) that enables in-situ model training and testing on edge data. Despite having the same end goals as traditional ML, FL executions differ significantly in scale, spanning thousands to millions of participating devices. As a result, data characteristics and device capabilities vary widely across clients. Yet, existing efforts randomly select FL participants, which leads to poor model and system efficiency.

In this paper, we propose Oort to improve the performance of federated training and testing with guided participant selection. With an aim to improve time-to-accuracy performance in model training, Oort prioritizes the use of those clients who have both data that offers the greatest utility in improving model accuracy and the capability to run training quickly. To enable FL developers to interpret their results in model testing, Oort enforces their requirements on the distribution of participant data while improving the duration of federated testing by cherry-picking clients. Our evaluation shows that, compared to existing participant selection mechanisms, Oort improves time-to-accuracy performance by 1.2X-14.1X and final model accuracy by 1.3%-9.8%, while efficiently enforcing developer-specified model testing criteria at the scale of millions of clients.

Fan and Allen had been working on Oort since summer of 2019, and it’s been a great learning experience for me. As always, it’s been a pleasure to collaborate with Harsha, and I look forward to many more in the future. Over the past two years, many others have joined Fan in our efforts toward providing systems support for federated learning and analytics, with many exciting results in different stages in the pipeline and focusing on cloud/edge/WAN challenges. It’s only going to become more exciting!

This is the first OSDI in an odd year as OSDI moves to a yearly cadence. Although the number of submissions is lower than the past, it’s likely only due to the late announcement; being in my first OSDI PC, I think the quality of the submitted and accepted papers remains as high as ever. Overall, the OSDI PC accepted 31 out of 165 submissions.

Fluid Accepted to Appear at MLSys’2021

While training and inference of deep learning models have received significant attention in recent years (e.g., Tiresias, AlloX, and Salus from our group), hyperparamter tuning is often overlooked or put together in the same bucket of optimizations as training. Existing hyperparameter tuning solutions, primarily from the ML research community, are mostly resource-agnostic. More importantly, even if they try to use up all available resources, existing solutions do not distinguish between the throughput of a GPU (how much work a GPU is doing) and its goodput (how much of that is ultimately useful work) during hyperparameter tuning. Fluid is our attempt at bridging the gap between hyperparameter tuning algorithms and the underlying cluster resources by improving both intra- and inter-GPU goodput in large clusters.

Current hyperparameter tuning solutions lack complementary execution engines to efficiently leverage distributed computation, thus ignoring the possibility of intra- and inter-GPU sharing, which exhibits poor resource usage. In this paper, we present FluidExec, a generalized hyperparameter tuning execution engine, that coordinates between hyperparameter tuning jobs and cluster resources. FluidExec schedules evaluation trials in such jobs using a water-filling approach to make the best use of resources both at intra- and inter-GPU granularities to speed up the tuning process. By abstracting a hyperparameter tuning job as a sequence of TrialGroup, FluidExec can boost the performance of diverse hyperparameter tuning solutions. Our experiments show that FluidExec can speed up synchronous BOHB by 200%, and BOHB and ASHA by 30% while having similar final accuracy.

Fluid is a joint project between Peifeng and Jiachen, which started right after Salus and before Jiachen started her Ph.D.! I’m super excited about many future works in the Systems + AI area from SymbioticLab members.

Kayak Accepted to Appear at NSDI’2021

As memory disaggregation and resource disaggregation, in general, become popular, one must make a call about whether to continue moving data from remote memory or to sometimes ship compute to remote data too. This is not a new problem in the context of disaggregated datacenters either. The notion of data locality and associated challenges are rooted in the same observation. As the ratio between compute and communication needs of an application changes and as the speed of the network changes over time, the answer has changed many times. Oftentimes, the solutions boil down to understanding workload and network characteristics and making a call: ship compute or ship data for that workload on that network. But what if the workload is dynamic? In Kayak, we dynamically decide between the two options and making the right call at the right time.

How cloud applications should interact with their data remains an active area of research. Over the last decade, many have suggested relying on a key-value (KV) interface to interact with data stored in remote storage servers, while others have vouched for the benefits of using remote procedure call (RPC). Instead of choosing one over the other, in this paper, we observe that an ideal solution must adaptively combine both of them in order to maximize throughput while meeting application latency requirements. To this end, we propose a new system called Kayak that proactively adjusts the rate of requests and the fraction of requests to be executed using RPC or KV, all in a fully decentralized and self-regulated manner. We theoretically prove that Kayak can quickly converge to the optimal parameters. We implement a system prototype of Kayak. Our evaluations show that Kayak achieves sub-second convergence and improves overall throughput by 32.5%-63.4% for compute-intensive workloads and up to 12.2% for non-compute-intensive and transactional workloads over the state-of-the-art.

The core idea behind Kayak is all due to Jie (Jimmy), who started the project in early 2019. The project picked up significant tailwind when Xin and his student Jingfeng joined us in the summer of 2020 to help us better understand the problem from an analytical perspective too, in addition to better positioning the work. I’m super excited about the possibilities and Kayak’s potential impact on resource disaggregation systems.

The NSDI PC this year accepted 40 out of 255 submissions in the Fall deadline to result in a lower acceptance rate from last year. 

Presented Keynote Talk at CloudNet’2020

Earlier this week, I presented a keynote talk on the state of network-informed data systems design at the CloudNet’2020 conference, with a specific focus on our recent works on memory disaggregation (Infiniswap, Leap, and NetLock), and discussed the many open challenges toward making memory disaggregation practical.

In this talk, I discussed the motivation behind disaggregating memory, or any other expensive resource for that matter. High-performance data systems strive to keep data in the main memory. They often over-provision to avoid running out of memory, leading to a 50% average underutilization in Google, Facebook, and Alibaba datacenters. The root cause is simple: applications today cannot access otherwise unused memory beyond their machine boundaries even when their performance grinds to a halt. But could they? Over the course of last five years, our research in the SymbioticLab have addressed and continue to address this fundamental question regarding memory disaggregation, whereby an application can leverage both local and remote memory by leveraging emerging high-speed networks.

I also highlighted at least eight major challenges any memory disaggregation solution must address to even have a shot of becoming practical and widely used. These include, applicability to a large variety of applications by not requiring any application-level changes; scalability to large datacenters; efficiency is using up all available memory; high performance when using disaggregated memory; performance isolation from other datacenter traffic; resilience in the presence of failures and unavailability; security from others; and generality to a variety of memory technologies beyond DRAM. While this may come across as a laundry list of problems, we do believe that a complete solution must address each one of them.

In this context, I discussed three projects: Infiniswap, which achieves applicability using remote memory paging, and scalability and efficiency using decentralized algorithms; Leap, which improves performance by prefetching; and NetLock, which shows how to disaggregate programmable switch memory. I also pointed out a variety of ongoing projects toward our ultimate goal of unified, practical memory disaggregation.

My slides from this talk are publicly available and have more details elaborating these points.

Presented Keynote Speech at HotEdgeVideo’2020

Earlier this week, I presented a keynote speech on the state of resource management for deep learning at the HotEdgeVideo’2020 workshop, covering our recent works on systems support for AI (Tiresias, AlloX, and Salus) and discussing open challenges in this space.

In this talk, I highlighted three unique aspects for resource management in the context of deep learning, which I think makes this setting unique even after decades of resource management works in CPUs, networks, and even Big Data clusters.

  1. Short-term predictability and long-term uncertainty of deep learning workloads: Deep learning workloads (hyperparameter tuning, training, and inference) may have different objectives, but they all share a common trend. Although we cannot know how many iterations will there be in a job or the number of requests coming to a model for inference, each iteration/request performs the same computation for a given job. Meaning, we can profile and then exploit that information for long-term benefits by adapting classic information-agnostic and information-limited scheduling techniques.
  2. Heterogeneous, interchangeable compute devices in deep learning clusters: Deep learning clusters becoming increasingly more diverse with many generations of GPUs and new hardware accelerators that our coming out every month. The key challenge in terms of resource management here is that all these compute devices are interchangeable (they all can compute), but they don’t compute at the same rate for all models. Some more suitable for CPUs, some for GPUs, some for TPUs, and so on. We need to rethink resource management algorithms to account for resource interchangeability.
  3. Black-box hardware accelerators: Deep learning hardware are also black boxes. Even for GPUs, we do not have any control over their internals; apart from some high-level information that are publicly available, we don’t know anything about what happens inside. For newer, vendor-locked accelerators, details are even more scarce. Consequently, resource management solutions should be designed to assume black-box hardware from the get go and then rely on profiling (by leveraging the iterative nature of deep learning) and short-term predictability to extract good performance using software techniques.

My slides from this talk are publicly available and have more details elaborating these points.

Thanks Google for Supporting Our Research

We are doing a lot of work on systems for AI in recent years and historically have done some work on application-aware networking. This support will help us combine the two directions to provide datacenter network support for AI workloads both at the edge and inside the network.

Many thanks to Nandita Dukkipati for championing our cause. We’ve been trying to find a good project to collaborate on since 2016 when we had a chat at a Dagstuhl workshop!

Leap Wins the Best Paper Award at ATC’2020. Congrats Hasan!

Leap, the fastest memory disaggregation system to date, has won a best paper award at this year’s USENIX ATC conference!

This is a happy outcome for Hasan‘s persistence on this project for more than two years. From coming up with the core idea to executing it well, Hasan has done a fantastic job so far; I expect more great outcomes for his ongoing works.

Leap is open-source and available at https://github.com/symbioticlab/leap.

NetLock Accepted to Appear at SIGCOMM’2020

High-throughput, low-latency lock managers are useful for building a variety of distributed applications. Traditionally, a key tradeoff in this context had been expressed in terms of the amount of knowledge available to the lock manager. On the one hand, a decentralized lock manager can increase throughput by parallelization, but it can starve certain categories of applications. On the other hand, a centralized lock manager can avoid starvation and impose resource sharing policies, but it can be limited in throughput. In SIGMOD’18, we presented DSLR that attempted to mitigate this tradeoff in clusters with fast RDMA networks by adapting Lamport’s bakery algorithm in the context of RDMA’s fetch-and-add (FA) operations to design a decentralized solution. The downside is that we couldn’t implement complex policies that need centralized information.

What if we could have a high-speed centralized point that all remote traffic must go through anyway? NetLock is our attempt at doing just that by implementing a centralized lock manager in a programmable switch by working at tandem with the servers. The co-design is important to go around the resource limitations of the switch. By carefully caching hot locks and moving warm and cold ones to the servers, we can meet both the performance and policy goals of a lock manager without significant compromise in either.

Lock managers are widely used by distributed systems. Traditional centralized lock managers can easily support policies between multiple users using global knowledge, but they suffer from low performance. In contrast, emerging decentralized approaches are faster but cannot provide flexible policy support. Furthermore, performance in both cases is limited by the server capability.

We present NetLock, a new centralized lock manager that co-designs servers and network switches to achieve high performance without sacrificing flexibility in policy support. The key idea of NetLock is to exploit the capability of emerging programmable switches to directly process lock requests in the switch data plane. Due to the limited switch memory, we design a memory management mechanism to seamlessly integrate the switch and server memory. To realize the locking functionality in the switch, we design a custom data plane module that efficiently pools multiple register arrays together to maximize memory utilization We have implemented a NetLock prototype with a Barefoot Tofino switch and a cluster of commodity servers. Evaluation results show that NetLock improves the throughput by 14.0–18.4×, and reduces the average and 99% latency by 4.7–20.3× and 10.4–18.7× over DSLR, a state-of-the-art RDMA-based solution, while providing flexible policy support.

Xin and I came up with the idea of this project over a couple meals in San Diego at OSDI’18, and later Zhuolong and Yiwen expanded and successfully executed our ideas that lead to NetLock. Similar to DSLR, NetLock explores a different design point in our larger memory disaggregation vision.

This year’s SIGCOMM probably has the highest acceptance rate in 25 years, if not more. After a long successful run at SIGCOMM and a small break doing many other exciting things, it’s great to be back to some networking research! But going forward, I’m hoping for much more along these lines both inside the network and at the edge.