Tag Archives: RDMA

DSLR Accepted to Appear at SIGMOD’2018

High-throughput, low-latency lock managers are useful for building a variety of distributed applications. A key tradeoff in this context can be expressed in terms of the amount of knowledge available to the lock manager. On the one hand, a decentralized lock manager can increase throughput by parallelization, but it can starve certain categories of applications. On the other hand, a centralized lock manager can avoid starvation and impose resource sharing policies, but it can be limited in throughput. DSLR is our attempt at mitigating this tradeoff in clusters with fast RDMA networks. Specifically, we adapt Lamport’s bakery algorithm in the context of RDMA’s fetch-and-add (FA) operations, which provides higher throughput, lower latency, and avoids starvation in comparison to the state-of-the-art.

Lock managers are a crucial component of modern distributed systems. However, with the increasing popularity of fast RDMA-enabled networks, traditional lock managers can no longer keep up with the latency and throughput requirements of modern systems. Centralized lock managers can ensure fairness and prevent starvation using global knowledge of the system, but are themselves single points of contention and failure. Consequently, they fall short in leveraging the full potential of RDMA networks. On the other hand, decentralized (RDMA-based) lock managers either completely sacrifice global knowledge to achieve higher throughput at the risk of starvation and higher tail latencies, or they resort to costly communications to maintain global knowledge, which can result in significantly lower throughput.

In this paper, we show that it is possible for a lock manager to be fully decentralized and yet exchange the partial knowledge necessary for preventing starvation and thereby reducing tail latencies. Our main observation is that we can design a lock manager using RDMA’s fetch-and-add (FA) operation, which always succeeds, rather than compare-and-swap (CAS), which only succeeds if a given condition is satisfied. While this requires us to rethink the locking mechanism from the ground up, it enables us to sidestep the performance drawbacks of the previous CAS-based proposals that relied solely on blind retries upon lock conflicts.

Specifically, we present DSLR (Decentralized and Starvation-free Lock management with RDMA), a decentralized lock manager that targets distributed systems with RDMA-enabled networks. We demonstrate that, despite being fully decentralized, DSLR prevents starvation and blind retries by providing first-come-first-serve (FCFS) scheduling without maintaining explicit queues. We adapt Lamport’s bakery algorithm [34] to an RDMA-enabled environment with multiple bakers, utilizing only one-sided READ and atomic FA operations. Our experiments show that DSLR delivers up to 2.8X higher throughput than all existing RDMA-based lock managers, while reducing their average and 99.9% latencies by up to 2.5X and 47X, respectively.

Barzan and I started this project with Dong Young in 2016 right after I joined Michigan, as our interests matched in terms of new and interesting applications of RDMA primitives. It’s exciting to see our work turn into my first SIGMOD paper. As we work on rack-scale/resource disaggregation over RDMA, we are seeing more exciting use cases of RDMA, going beyond key-value stores and designing new RDMA networking protocols. Stay tuned!

Infiniswap in USENIX ;login: and Elsewhere

Since our first open-source release of Infiniswap over the summer, we have seen growing interest with many follow-ups within our group and outside.

Here is a quick summary of selected writeups on Infiniswap:

What’s Next?

In addition to a lot of performance optimizations and bug fixes, we’re working on several key features of Infiniswap that will be released in coming months.

  1. High Availability: Infiniswap will be able to maintain high performance across failures, corruptions, and interruptions throughout the cluster.
  2. Performance Isolation: Infiniswap will be able to provide end-to-end performance guarantees over the RDMA network to multiple applications concurrently using it.

Infiniswap Released on GitHub

Today we are glad to announce the first open-source release of Infiniswap, the first practical, large-scale memory disaggregation system for cloud and HPC clusters.

Infiniswap is an efficient memory disaggregation system designed specifically for clusters with fast RDMA networks. It opportunistically harvests and transparently exposes unused cluster memory to unmodified applications by dividing the swap space of each machine into many slabs and distributing them across many machines’ remote memory. Because one-sided RDMA operations bypass remote CPUs, Infiniswap leverages the power of many choices to perform decentralized slab placements and evictions.

Extensive benchmarks on workloads from memory-intensive applications ranging from in-memory databases such as VoltDB and Memcached to popular big data software Apache Spark, PowerGraph, and GraphX show that Infiniswap provides order-of-magnitude performance improvements when working sets do not completely fit in memory. Simultaneously, it boosts cluster memory utilization by almost 50%.

Primary features included in this initial release are:

  • No new hardware and no application or operating system modifications;
  • Fault tolerance via asynchronous disk backups;
  • Scalability via decentralized algorithms.

Here are some links, if you want to check it out, contribute, or just want to point someone else who can help us to make it better.

Git repository: https://github.com/Infiniswap/infiniswap
Detailed Overview: Efficient Memory Disaggregation with Infiniswap

The project is still in its early stage and can use all the help to become successful. We appreciate your feedback.

FaiRDMA Accepted to Appear at KBNets’2017

As cloud providers deploy RDMA in their datacenters and developers rewrite/update their applications to use RDMA primitives, a key question remains open: what will happen when multiple RDMA-enabled applications must share the network? Surprisingly, this simple question does not yet have a conclusive answer. This is because existing work focus primarily on improving individual application’s performance using RDMA on a case-by-case basis. So we took a simple step. Yiwen built a benchmarking tool and ran some controlled experiments to find some very interesting results: there is often no performance isolation, and there are many factors that determine RDMA performance beyond the network itself. We are currently investigating the root causes and working on mitigating them.

To meet the increasing throughput and latency demands of modern applications, many operators are rapidly deploying RDMA in their datacenters. At the same time, developers are re-designing their software to take advantage of RDMA’s benefits for individual applications. However, when it comes to RDMA’s performance, many simple questions remain open.

In this paper, we consider the performance isolation characteristics of RDMA. Specifically, we conduct three sets of experiments — three combinations of one throughput-sensitive flow and one latency-sensitive flow — in a controlled environment, observe large discrepancies in RDMA performance with and without the presence of a competing flow, and describe our progress in identifying plausible root-causes.

This work is an offshoot, among several others, of our Infiniswap project on rack-scale memory disaggregation. As time goes by, I’m appreciating more and more Ion’s words of wisdom on building real systems to find real problems.

FTR, the KBNets PC accepted 9 papers out of 22 submissions this year. For a first-time workshop, I consider it a decent rate; there are some great papers in the program from both the industry and academia.

Infiniswap Accepted to Appear at NSDI’2017

Update: Camera-ready version is available here. Infiniswap code is now on GitHub!

As networks become faster, the difference between remote and local resources is blurring everyday. How can we take advantage of these blurred lines? This is the key observation behind resource disaggregation and, to some extent, rack-scale computing. In this paper, we take our first stab at making memory disaggregation practical by exposing remote memory to unmodified applications. While there have been several proposals and feasibility studies in recent years, to the best of our knowledge, this is the first concrete step in making it real.

Memory-intensive applications suffer large performance loss when their working sets do not fully fit in memory. Yet, they cannot leverage otherwise unused remote memory when paging out to disks even in the presence of large imbalance in memory utilizations across a cluster. Existing proposals for memory disaggregation call for new architectures, new hardware designs, and/or new programming models, making them infeasible.

This paper describes the design and implementation of Infiniswap, a remote memory paging system designed specifically for an RDMA network. Infiniswap opportunistically harvests and transparently exposes unused memory to unmodified applications by dividing the swap space of each machine into many slabs and distributing them across many machines’ remote memory. Because RDMA operations bypass remote CPUs, Infiniswap leverages the power of many choices to perform decentralized slab placements and evictions.

We have implemented and deployed Infiniswap on an RDMA cluster without any OS modifications and evaluated its effectiveness using multiple workloads running on unmodified VoltDB, Memcached, PowerGraph, GraphX, and Apache Spark. Using Infiniswap, throughputs of these applications improve between 7.1X (0.98X) to 16.3X (9.3X) over disk (Mellanox nbdX), and median and tail latencies between 5.5X (2X) and 58X (2.2X). Infiniswap does so with negligible remote CPU usage, whereas nbdX becomes CPU-bound. Infiniswap increases the overall memory utilization of a cluster and works well at scale.

This work started as a class project for EECS 582 in the Winter when I gave the idea to Juncheng Gu and Youngmoon Lee, who made the pieces into a whole. Over the summer, Yiwen Zhang, an enterprising and excellent undergraduate, joined the project and helped us in getting it done within time.

This year the NSDI PC accepted 46 out of 255 papers. This happens to be my first paper with an all-blue cast! I want to thank Kang for giving me complete access to Juncheng and Youngmoon; it’s been great collaborating with them. I’m also glad that Yiwen has decided to start a Master’s and stay with us for longer, and more importantly, our team will remain intact for many more exciting followups in this emerging research area.

If this excites you, come join our group!