Tag Archives: Disaggregation

Leap Accepted to Appear at ATC’2020

Since our pioneering work on Infiniswap that attempted to make memory disaggregation practical, there has been quite a few proposals to use different application-level interfaces to remote memory over RDMA. A common issue faced by all these approaches is the high overhead of existing kernel data paths whether they use the swapping subsystem or the file system. Moreover, we observed that even if such latencies could be resolved by short-circuiting the kernel, the latency of RDMA itself is still significantly higher than that of local memory access. By early 2018, we came to the conclusion that instead of holding our breadth for them to come closer, we should invest on designing an algorithm to prefetch data even before an applications needs a remote page. It had be extremely fast and must deal with a variety of cloud applications; so we found complex learning-based approaches to be unsuitable. Leap is our simple-yet-effective answer to all these challenges.

Memory disaggregation over RDMA can improve the performance of memory-constrained applications by replacing disk swapping with remote memory accesses. However, state-of-the-art memory disaggregation solutions still use data path components designed for slow disks. As a result, applications experience remote memory access latency significantly higher than that of the underlying low-latency network, which itself is too high for many applications.

In this paper, we propose Leap, a prefetching solution for remote memory accesses due to memory disaggregation. At its core, Leap employs an online, majority-based prefetching algorithm, which increases the page cache hit rate. We complement it with a lightweight and efficient data path in the kernel that isolates each application’s data path to the disaggregated memory and mitigates latency bottlenecks arising from legacy throughput-optimizing operations. Integration of Leap in the Linux kernel improves the median and tail remote page access latencies of memory-bound applications by up to 104.04× and 22.62×, respectively, over the default data path. This leads to up to 10.16× performance improvements for applications using disaggregated memory in comparison to the state-of-the-art solutions.

The majority-based prefetching algorithm at the core of Leap is the brainchild of Hasan. Leap is not limited to remote or disaggregated memory either; it can be used in any slow storage-fast storage combination. This is Hasan’s first project, and I’m glad to see that it has eventually been a success. It’s my first paper in ATC too.

This year USENIX ATC accepted 65 out of 348 papers, which is pretty competitive.

Received VMware Early Career Faculty Award!

A few weeks ago, I received a cold email from VMware Research’s Irina Calciu with this great news! The award is to support our ongoing research on memory disaggregation. VMware is doing some cool work in this space as well, and I look forward to collaborating with them in near future.

I’d like to thank VMware for their vote of confidence and support, as well as my students whose hard work this award recognizes. We have several good results in the pipeline, and hopefully we’ll get to share them with the world soon.

Received NSF CAREER Award. Thanks NSF!

The overarching goal of the proposal is to take a holistic view to make memory disaggregation practical by addressing challenges in the end host, inside the network, and throughout the entire cluster.

Thanks NSF for supporting the proposal and Deep Medhi for championing it. I’d like to also thank Barzan, Harsha, Prabal, and Vyas for sharing their experiences with me. Last but not the least, this was possible due to my incredible students who laid out the groundwork.

DSLR Accepted to Appear at SIGMOD’2018

High-throughput, low-latency lock managers are useful for building a variety of distributed applications. A key tradeoff in this context can be expressed in terms of the amount of knowledge available to the lock manager. On the one hand, a decentralized lock manager can increase throughput by parallelization, but it can starve certain categories of applications. On the other hand, a centralized lock manager can avoid starvation and impose resource sharing policies, but it can be limited in throughput. DSLR is our attempt at mitigating this tradeoff in clusters with fast RDMA networks. Specifically, we adapt Lamport’s bakery algorithm in the context of RDMA’s fetch-and-add (FA) operations, which provides higher throughput, lower latency, and avoids starvation in comparison to the state-of-the-art.

Lock managers are a crucial component of modern distributed systems. However, with the increasing popularity of fast RDMA-enabled networks, traditional lock managers can no longer keep up with the latency and throughput requirements of modern systems. Centralized lock managers can ensure fairness and prevent starvation using global knowledge of the system, but are themselves single points of contention and failure. Consequently, they fall short in leveraging the full potential of RDMA networks. On the other hand, decentralized (RDMA-based) lock managers either completely sacrifice global knowledge to achieve higher throughput at the risk of starvation and higher tail latencies, or they resort to costly communications to maintain global knowledge, which can result in significantly lower throughput.

In this paper, we show that it is possible for a lock manager to be fully decentralized and yet exchange the partial knowledge necessary for preventing starvation and thereby reducing tail latencies. Our main observation is that we can design a lock manager using RDMA’s fetch-and-add (FA) operation, which always succeeds, rather than compare-and-swap (CAS), which only succeeds if a given condition is satisfied. While this requires us to rethink the locking mechanism from the ground up, it enables us to sidestep the performance drawbacks of the previous CAS-based proposals that relied solely on blind retries upon lock conflicts.

Specifically, we present DSLR (Decentralized and Starvation-free Lock management with RDMA), a decentralized lock manager that targets distributed systems with RDMA-enabled networks. We demonstrate that, despite being fully decentralized, DSLR prevents starvation and blind retries by providing first-come-first-serve (FCFS) scheduling without maintaining explicit queues. We adapt Lamport’s bakery algorithm [34] to an RDMA-enabled environment with multiple bakers, utilizing only one-sided READ and atomic FA operations. Our experiments show that DSLR delivers up to 2.8X higher throughput than all existing RDMA-based lock managers, while reducing their average and 99.9% latencies by up to 2.5X and 47X, respectively.

Barzan and I started this project with Dong Young in 2016 right after I joined Michigan, as our interests matched in terms of new and interesting applications of RDMA primitives. It’s exciting to see our work turn into my first SIGMOD paper. As we work on rack-scale/resource disaggregation over RDMA, we are seeing more exciting use cases of RDMA, going beyond key-value stores and designing new RDMA networking protocols. Stay tuned!

Infiniswap in USENIX ;login: and Elsewhere

Since our first open-source release of Infiniswap over the summer, we have seen growing interest with many follow-ups within our group and outside.

Here is a quick summary of selected writeups on Infiniswap:

What’s Next?

In addition to a lot of performance optimizations and bug fixes, we’re working on several key features of Infiniswap that will be released in coming months.

  1. High Availability: Infiniswap will be able to maintain high performance across failures, corruptions, and interruptions throughout the cluster.
  2. Performance Isolation: Infiniswap will be able to provide end-to-end performance guarantees over the RDMA network to multiple applications concurrently using it.

Received Two Alibaba Innovation Research Grants

More resources for following up on our recent memory disaggregation and erasure coding works! One of the awards is a collaboration with Harsha Madhyastha. Looking forward to working with Alibaba.

In 2017, many proposals are received by AIR (Alibaba Innovation Research), which are from 99 universities and institutes (domestic 54; overseas 45) in 13 countries and regions. After rigorous review by AIR Technical Committee, 43 distinguish research proposals are accepted and will be funded by AIR Program.

Infiniswap Released on GitHub

Today we are glad to announce the first open-source release of Infiniswap, the first practical, large-scale memory disaggregation system for cloud and HPC clusters.

Infiniswap is an efficient memory disaggregation system designed specifically for clusters with fast RDMA networks. It opportunistically harvests and transparently exposes unused cluster memory to unmodified applications by dividing the swap space of each machine into many slabs and distributing them across many machines’ remote memory. Because one-sided RDMA operations bypass remote CPUs, Infiniswap leverages the power of many choices to perform decentralized slab placements and evictions.

Extensive benchmarks on workloads from memory-intensive applications ranging from in-memory databases such as VoltDB and Memcached to popular big data software Apache Spark, PowerGraph, and GraphX show that Infiniswap provides order-of-magnitude performance improvements when working sets do not completely fit in memory. Simultaneously, it boosts cluster memory utilization by almost 50%.

Primary features included in this initial release are:

  • No new hardware and no application or operating system modifications;
  • Fault tolerance via asynchronous disk backups;
  • Scalability via decentralized algorithms.

Here are some links, if you want to check it out, contribute, or just want to point someone else who can help us to make it better.

Git repository: https://github.com/Infiniswap/infiniswap
Detailed Overview: Efficient Memory Disaggregation with Infiniswap

The project is still in its early stage and can use all the help to become successful. We appreciate your feedback.

FaiRDMA Accepted to Appear at KBNets’2017

As cloud providers deploy RDMA in their datacenters and developers rewrite/update their applications to use RDMA primitives, a key question remains open: what will happen when multiple RDMA-enabled applications must share the network? Surprisingly, this simple question does not yet have a conclusive answer. This is because existing work focus primarily on improving individual application’s performance using RDMA on a case-by-case basis. So we took a simple step. Yiwen built a benchmarking tool and ran some controlled experiments to find some very interesting results: there is often no performance isolation, and there are many factors that determine RDMA performance beyond the network itself. We are currently investigating the root causes and working on mitigating them.

To meet the increasing throughput and latency demands of modern applications, many operators are rapidly deploying RDMA in their datacenters. At the same time, developers are re-designing their software to take advantage of RDMA’s benefits for individual applications. However, when it comes to RDMA’s performance, many simple questions remain open.

In this paper, we consider the performance isolation characteristics of RDMA. Specifically, we conduct three sets of experiments — three combinations of one throughput-sensitive flow and one latency-sensitive flow — in a controlled environment, observe large discrepancies in RDMA performance with and without the presence of a competing flow, and describe our progress in identifying plausible root-causes.

This work is an offshoot, among several others, of our Infiniswap project on rack-scale memory disaggregation. As time goes by, I’m appreciating more and more Ion’s words of wisdom on building real systems to find real problems.

FTR, the KBNets PC accepted 9 papers out of 22 submissions this year. For a first-time workshop, I consider it a decent rate; there are some great papers in the program from both the industry and academia.

Infiniswap Accepted to Appear at NSDI’2017

Update: Camera-ready version is available here. Infiniswap code is now on GitHub!

As networks become faster, the difference between remote and local resources is blurring everyday. How can we take advantage of these blurred lines? This is the key observation behind resource disaggregation and, to some extent, rack-scale computing. In this paper, we take our first stab at making memory disaggregation practical by exposing remote memory to unmodified applications. While there have been several proposals and feasibility studies in recent years, to the best of our knowledge, this is the first concrete step in making it real.

Memory-intensive applications suffer large performance loss when their working sets do not fully fit in memory. Yet, they cannot leverage otherwise unused remote memory when paging out to disks even in the presence of large imbalance in memory utilizations across a cluster. Existing proposals for memory disaggregation call for new architectures, new hardware designs, and/or new programming models, making them infeasible.

This paper describes the design and implementation of Infiniswap, a remote memory paging system designed specifically for an RDMA network. Infiniswap opportunistically harvests and transparently exposes unused memory to unmodified applications by dividing the swap space of each machine into many slabs and distributing them across many machines’ remote memory. Because RDMA operations bypass remote CPUs, Infiniswap leverages the power of many choices to perform decentralized slab placements and evictions.

We have implemented and deployed Infiniswap on an RDMA cluster without any OS modifications and evaluated its effectiveness using multiple workloads running on unmodified VoltDB, Memcached, PowerGraph, GraphX, and Apache Spark. Using Infiniswap, throughputs of these applications improve between 7.1X (0.98X) to 16.3X (9.3X) over disk (Mellanox nbdX), and median and tail latencies between 5.5X (2X) and 58X (2.2X). Infiniswap does so with negligible remote CPU usage, whereas nbdX becomes CPU-bound. Infiniswap increases the overall memory utilization of a cluster and works well at scale.

This work started as a class project for EECS 582 in the Winter when I gave the idea to Juncheng Gu and Youngmoon Lee, who made the pieces into a whole. Over the summer, Yiwen Zhang, an enterprising and excellent undergraduate, joined the project and helped us in getting it done within time.

This year the NSDI PC accepted 46 out of 255 papers. This happens to be my first paper with an all-blue cast! I want to thank Kang for giving me complete access to Juncheng and Youngmoon; it’s been great collaborating with them. I’m also glad that Yiwen has decided to start a Master’s and stay with us for longer, and more importantly, our team will remain intact for many more exciting followups in this emerging research area.

If this excites you, come join our group!

TWO NSF Proposals as the Lead PI Awarded. Thanks NSF!

The first one is on rack-scale computing using RDMA-enabled networks with Barzan Mozafari at the University of Michigan, and the second is on theoretical and systems implications of long-term fairness in cluster computing with Zhenhua Liu (Stony Brook University).

Thanks NSF!

Combined with the recent awards on geo-distributed analytics from NSF and Google, I’m looking forward to very exciting days in the future. If you want to be a part these exciting efforts, consider joining my group!