Category Archives: Recent News

Leap Accepted to Appear at ATC’2020

Since our pioneering work on Infiniswap that attempted to make memory disaggregation practical, there has been quite a few proposals to use different application-level interfaces to remote memory over RDMA. A common issue faced by all these approaches is the high overhead of existing kernel data paths whether they use the swapping subsystem or the file system. Moreover, we observed that even if such latencies could be resolved by short-circuiting the kernel, the latency of RDMA itself is still significantly higher than that of local memory access. By early 2018, we came to the conclusion that instead of holding our breadth for them to come closer, we should invest on designing an algorithm to prefetch data even before an applications needs a remote page. It had be extremely fast and must deal with a variety of cloud applications; so we found complex learning-based approaches to be unsuitable. Leap is our simple-yet-effective answer to all these challenges.

Memory disaggregation over RDMA can improve the performance of memory-constrained applications by replacing disk swapping with remote memory accesses. However, state-of-the-art memory disaggregation solutions still use data path components designed for slow disks. As a result, applications experience remote memory access latency significantly higher than that of the underlying low-latency network, which itself is too high for many applications.

In this paper, we propose Leap, a prefetching solution for remote memory accesses due to memory disaggregation. At its core, Leap employs an online, majority-based prefetching algorithm, which increases the page cache hit rate. We complement it with a lightweight and efficient data path in the kernel that isolates each application’s data path to the disaggregated memory and mitigates latency bottlenecks arising from legacy throughput-optimizing operations. Integration of Leap in the Linux kernel improves the median and tail remote page access latencies of memory-bound applications by up to 104.04× and 22.62×, respectively, over the default data path. This leads to up to 10.16× performance improvements for applications using disaggregated memory in comparison to the state-of-the-art solutions.

The majority-based prefetching algorithm at the core of Leap is the brainchild of Hasan. Leap is not limited to remote or disaggregated memory either; it can be used in any slow storage-fast storage combination. This is Hasan’s first project, and I’m glad to see that it has eventually been a success. It’s my first paper in ATC too.

This year USENIX ATC accepted 65 out of 348 papers, which is pretty competitive.

AlloX Accepted to Appear at EuroSys’2020

While GPUs are always in the news when it comes to deep learning clusters (e.g., Salus or Tiresias), we are in the midst of an emergence of many more computing devices (e.g., FPGAs and problem-specific accelerators), including the traditional CPUs. All of them are compute devices, but one cannot expect the same speed from all of them for all types of computations. A natural question is, therefore, how to allocate such interchangeable resources in a hybrid cluster? AlloX is our attempt at a reasonable answer.

Modern deep learning frameworks support a variety of hardware, including CPU, GPU, and other accelerators, to perform computation. In this paper, we study how to schedule jobs over such interchangeable resources – each with a different rate of computation – to optimize performance while providing fairness among users in a shared cluster. We demonstrate theoretically and empirically that existing solutions and their straightforward modifications perform poorly in the presence of interchangeable resources, which motivates the design and implementation of AlloX. At its core, AlloX transforms the scheduling problem into a min-cost bipartite matching problem and provides dynamic fair allocation over time. We theoretically prove its optimality in an ideal, offline setting and show empirically that it works well in the online scenario by incorporating with Kubernetes. Evaluations on a small-scale CPU-GPU hybrid cluster and large-scale simulations highlight that AlloX can reduce the average job completion time significantly (by up to 95% when the system load is high) while providing fairness and preventing starvation.

AlloX has been in the oven for more than two years, and it is a testament to Tan and Xiao’s tenacity. It’s also my second paper with Zhenhua after HUG, our first joint-work published based on our recent NSF project, and my first paper at EuroSys. I believe the results in this paper will give rise to new analyses of many systems that have interchangeable resources, such as DRAM-NVM hybrid systems or the storage/caching hierarchy.

This year the EuroSys PC accepted 43 out of 234 submissions.

Salus Accepted to Appear at MLSys'2020

With the rising popularity of deep learning, the popularity of GPUs has increased in recent years. Modern GPUs are extremely powerful with a lot of resources. A key challenge in this context is making sure that these devices are highly utilized. Although there has been a lot of research on improving GPU efficiency at the cluster level (e.g., our Tiresias in NSDI’19), little is known about how well individual GPUs are being utilized today. Worse, even if they are underutilized, little can be done because GPUs are opaque black boxes without any primitives for sharing them. Existing mechanisms for GPU sharing, such as NVIDIA MPS, are coarse-grained and cannot leverage application-specific information. Salus is our foray into the GPU sharing domain by providing two key sharing primitives that allows one to develop a variety of algorithms and improve GPU efficiency for training, inference, and hyperparameter tuning workloads.

Unlike traditional resources such as CPU or the network, modern GPUs do not natively support fine-grained sharing primitives. Consequently, implementing common policies such as time-sharing and preemption are expensive. Worse, when a deep learning (DL) application cannot completely use a GPU’s resources, the GPU cannot be efficiently shared between multiple applications, leading to GPU underutilization.

We present Salus to enable two GPU sharing primitives: fast job switching and memory sharing, to achieve fine-grained GPU sharing among multiple DL applications. Salus is an efficient, consolidated execution service that exposes a GPU to different DL applications, and enforces fine-grained sharing by performing iteration scheduling and addressing associated memory management issues. We show that these primitives can then be used to implement flexible sharing policies. Our integration of Salus with TensorFlow and evaluation on popular DL jobs shows that Salus can improve the average completion time of DL training jobs by 3.19X, GPU utilization for hyper-parameter tuning by 2.38X, and GPU utilization of DL inference applications by 42X over not sharing the GPU and 7X over NVIDIA MPS with small overhead.

Salus has long been in the making and is the first project for me to get into systems for AI and GPU resource management. Peifeng has been diligently working on it since 2017! While it took a long time, I’m excited that it has found a great home and looking forward to building on top of it. This is Peifeng’s first major paper, and the future is even brighter.

This year’s MLSys has 34 accepted papers and remained highly competitive as its previous iteration.

Sol and Pando Accepted to Appear at NSDI'2020

With the advent of edge analytics and federated learning, the need for distributed computation and storage is only going to increase in coming years. Unfortunately, existing solutions for analytics and machine learning have focused primarily on datacenter environments. When these solutions are applied to wide-area scenarios, their compute efficiency decreases and storage overhead increases. Neither is suitable for pervasive storage and computation throughout the globe. In this iteration of NSDI, we have two papers to address the compute and storage aspects of emerging wide-area computing workloads.

Sol

Sol is a federated execution engine that can execute low-latency computation across a variety of network conditions. The key insight here is that modern execution engines (e.g., those used by Apache Spark or TensorFlow) have implicit assumptions about low-latency and high-bandwidth networks. Consequently, in poor network conditions, the overhead of coordinating work outweigh the work they are trying to execute. The end result, interestingly enough, is CPU underutilization because workers spend more time in waiting for new work to be assigned from the centralized coordinator/master than doing the work. Our solution is API-compatible with Apache Spark so that any existing jobs (SQL, ML, or Streaming) can run on Sol with significant performance improvement in edge computing and federated learning scenarios.

The popularity of big data and AI has led to many optimizations at different layers of distributed computation stacks. Despite – or perhaps, because of – its role as the narrow waist of such software stacks, the design of the execution engine, which is in charge of executing every single task of a job, has mostly remained unchanged. As a result, the execution engines available today are ones primarily designed for low latency and high bandwidth datacenter networks. When either or both of the network assumptions do not hold, CPUs are significantly underutilized.

In this paper, we take a first-principles approach toward developing an execution engine that can adapt to diverse network conditions. Sol, our federated execution engine architecture, flips the status quo in two respects. First, to mitigate the impact of high latency, Sol proactively assigns tasks, but does so judiciously to be resilient to uncertainties. Second, to improve the overall resource utilization, Sol decouples communication from computation internally instead of committing resources to both aspects of a task simultaneously. Our evaluations on EC2 show that, compared to Apache Spark in resource-constrained networks, Sol improves SQL and machine learning jobs by 16.4× and 4.2× on average.

This is Fan’s first major paper, and I’m very proud to see him open his book. I would like to thank Jimmy and Allen for their support in getting it done and Harsha for his enormous amount of time and efforts to make Sol successful. I believe Sol will have significant impact on the emerging fields of edge analytics and federated learning.

Pando

Pando started with a simple idea when Harsha and I were discussing how to apply erasure coding to mutable data. It ended up being so much more. Not only have we designed an erasure-coded state machine, we have also identified the theoretical tradeoff limits for read latency, write latency, and storage overhead. In the process, we show that erasure coding is the way to get closer to the limits that replication-based systems cannot reach. Moreover, Pando can dynamically switch between both depending on the goals it has to achieve.

By replicating data across sites in multiple geographic regions, web services can maximize availability and minimize latency for their users. However, when sacrificing data consistency is not an option, we show that service providers have to today incur significantly higher cost to meet desired latency goals than the lowest cost theoretically feasible. We show that the key to addressing this sub-optimality is to 1) allow for erasure coding, not just replication, of data across data centers, and 2) mitigate the resultant increase in read and write latencies by rethinking how to enable consensus across the wide-area network. Our extensive evaluation mimicking web service deployments on the Azure cloud service shows that we enable near-optimal latency versus cost tradeoffs.

While Muhammed is advised by Harsha, I have had the opportunity to work with him since 2016 (first on a project that failed; on this one, from 2017). Many of the intricacies of the protocol are outside my expertise, but I learned a lot from Harsha and Muhammed. I’m also glad to see that our original idea of mutable erasure-coded data has come to fruition in a much stronger form than what Harsha and I devised as the summer of 2017 was coming to an end. Btw, Pando now has the notorious distinction of my current record for accepted-after-N-submissions; fifth time’s the charm!

The NSDI PC this year accepted 48 out of 275 submissions in the Fall deadline to increase the acceptance rate from the poor showing last year. 

Co-Organized NSF Workshop on Next-Gen Cloud Research Infrastructure

Earlier this week Jack Brassil and I co-organized an NSF-supported workshop on next-generation cloud research infrastructure (RI) in Princeton, NJ. The focus of the workshop was on the role of cloud on research and education, how needs are changing, and how cloud infrastructure should evolve to keep up with the changing needs. We had about 60 experts who builds and manages a variety of research infrastructure and uses those infrastructure for a variety of research activities beyond CS. It’s been an eye-opening experience for me, especially given that, over the years, I have used many of the testbeds built and managed by our attendees without ever meeting or knowing them.

The workshop itself was divided into multiple plenary and breakout sessions, where we heard from cloud RI operators, commercial cloud operators, cloud users in both CS and basic science domains. We also discussed the state-of-the-art in cloud RI, needs of existing and future experiments, and how to effectively fulfill those needs by combining RI and commercial clouds. Naturally, wide-area or federated computing as well as advances in hardware (e.g., accelerators, disaggregation) were at the forefront of our discussions. We also spent quite some time discussing the role of cloud in education, how to make cloud computing accessible to under-served minorities via federal-funded testbeds, and the role of cloud for institutions focusing on those communities. It was a very different crowd than I am used to, but a great experience overall.

I’d like to thank Jack for inviting me to co-organize with me, Srini Seshan (CMU) and Terry Benzel (USC) for steering our efforts, NSF and especially Deep Medhi for supporting this workshop, Jen Rexford for supporting us, and of course, all our attendees who traveled from far-flung places to make it successful. Soon, we will have a report summarizing our findings from this workshop, which I hope will help guide future agenda in cloud research infrastructure, both public and commercial.

Joint Award With CMU on Distributed Storage. Thanks NSF!

This project aims to build on top of our past and ongoing works with Rashmi Vinayak (CMU) and Harsha Madhyastha (Michigan) to address the optimal performance-cost tradeoffs in distributed storage. It’s always fun to have the opportunity to be able to work with great friends and colleagues.

I’m very grateful to NSF and the broader research community for their great support and show of confidence in our research.

Received VMware Early Career Faculty Award!

A few weeks ago, I received a cold email from VMware Research’s Irina Calciu with this great news! The award is to support our ongoing research on memory disaggregation. VMware is doing some cool work in this space as well, and I look forward to collaborating with them in near future.

I’d like to thank VMware for their vote of confidence and support, as well as my students whose hard work this award recognizes. We have several good results in the pipeline, and hopefully we’ll get to share them with the world soon.

Near Optimal Coflow Scheduling Accepted at SPAA’2019

Since the inception of coflow in 2012, the abstraction and works surrounding it are growing at a fast pace. In addition to systems building, we have seen a rise of theoretical analyses of the coflow scheduling problem. One of the most recent ones to this end has even received a Best Student Paper Award in SIGCOMM’2018.

Today I want to share a recent development in theoretical analysis of coflow scheduling in the context of general graphs. While datacenter networks can be abstracted out as a non-blocking switch, under failures and load imbalance this simplistic model does not hold any more. More importantly, inter-datacenter wide-area networks (WANs) are never non-blocking anyway. Naturally, we need a new analytical framework that allows for a broader variety of network topologies. In our recent work with our colleagues at University of Maryland and Google, we provide a randomized 2-approximation algorithm for single path and free path models of coflow scheduling over the WAN, significantly improving over the state-of-the-art.

The Coflow scheduling problem has emerged as a popular abstraction in the last few years to study data communication problems within a data center [5]. In this basic framework, each coflow has a set of communication demands and the goal is to schedule many coflows in a manner that minimizes the total weighted completion time. A coflow is said to complete when all its communication needs are met. This problem has been extremely well studied for the case of complete bipartite graphs that model a data center with full bisection bandwidth and several approximation algorithms and effective heuristics have been proposed recently [1, 2, 29].

In this work, we study a slightly different model of coflow scheduling in general graphs (to capture traffic between datacenters [15, 29]) and develop practical and efficient approximation algorithms for it. Our main result is a randomized 2 approximation algorithm for the single path and free path model, significantly improving prior work. In addition, we demonstrate via extensive experiments that the algorithm is practical, easy to implement and performs well in practice.

This is my first paper in SPAA, but all thanks go to Sheng Yang at UMD whose dissertation will include this work and his advisor Samir and collaborator Manish at Google. My student Jie You (Jimmy) did the evaluation for this work to compare it against existing works on different WAN topologies and workloads. It was a pleasant experience with the theory folks and helping them in better understanding the constraints. I look forward to many more future collaborations!