Tag Archives: RDMA

Hydra Accepted to Appear at FAST’2022

Almost five years after Infiniswap, memory disaggregation is now a mainstream research topic. It goes by many names, but the idea of accessing (unused/stranded) memory over high-speed networks is now close to reality. Despite many works in this line of research, a key remaining problem is ensuring resilience: how can applications recover quickly if remote memory fails, become corrupted, or is inaccessible? Keeping a copy in the disk like Infiniswap causes performance bottleneck under failure, but keeping in-memory replicas doubles the memory overhead. We started Hydra to hit a sweet spot between these two extremes by applying lessons learned from our EC-Cache project to extremely small objects. While EC-Cache explicitly focused on very large objects in the MB range, Hydra aims to perform erasure coding at the 4KB page granularity in microsecond timescale common for RDMA. Additionally, we extended Asaf’s CopySets idea to erasure coding to tolerate concurrent failures with low overhead.

We present Hydra, a low-latency, low-overhead, and highly available resilience mechanism for remote memory. Hydra can access erasure-coded remote memory within a single-digit μs read/write latency, significantly improving the performance-efficiency tradeoff over the state-of-the-art – it performs similar to in-memory replication with 1.6× lower memory overhead. We also propose CodingSets, a novel coding group placement algorithm for erasure-coded data, that provides load balancing while reducing the probability of data loss under correlated failures by an order of magnitude. With Hydra, even when only 50% memory is local, unmodified memory-intensive applications achieve performance close to that of the fully in-memory case in the presence of remote failures and outperforms the state-of-the-art remote-memory solutions by up to 4.35×.

Youngmoon started working on Hydra right after we presented Infiniswap in early 2017! As Youngmoon graduated, Hasan started leading the project from early 2019, and Asaf joined us later that year. Together they significantly improved the paper over early drafts. Even then, Hydra faced immense challenges. In the process, Hydra has now taken over Justitia for the notorious distinction of my current record for accepted-after-N-submissions.

This was my first time submitting to FAST.

Justitia Accepted to Appear at NSDI’2022

The need for higher throughput and lower latency is driving kernel-bypass networking (KBN) in datacenters. Of the two related trends in KBN, hardware-based KBN is especially challenging because, unlike software KBN such as DPDK, it does not provide any control once a request is posted to the hardware. RDMA, which is the most prevalent form of hardware KBN in practice, is completely opaque. RDMA NICs (RNICs), whether they are using InfiniBand, RoCE, or iWARP, have fixed-function scheduling algorithms programmed in them. Like any other networking component, they also suffer from performance isolation issues when multiple applications compete for RNIC resources. Justitia is our attempt at introducing software control in hardware KBN.

Kernel-bypass networking (KBN) is becoming the new norm in modern datacenters. While hardware-based KBN offloads all dataplane tasks to specialized NICs to achieve better latency and CPU efficiency than software-based KBN, it also takes away the operator’s control over network sharing policies. Providing policy support in multi-tenant hardware KBN brings unique challenges — namely, preserving ultra-low latency and low CPU cost, finding a well-defined point of mediation, and rethinking traffic shapers. We present Justitia to address these challenges with three key design aspects: (i) Split Connection with message-level shaping, (ii) sender-based resource mediation together with receiver-side updates, and (iii) passive latency monitoring. Using a latency target as its knob, Justitia enables multi-tenancy policies such as predictable latencies and fair/weighted resource sharing. Our evaluation shows Justitia can effectively isolate latency-sensitive applications at the cost of slightly decreased utilization and ensure that throughput and bandwidth of the rest are not unfairly penalized.

Yiwen started working on this problem when we first observed RDMA isolation issues in Infiniswap. He even wrote a short paper in KBNets 2017 based on his early findings. Yue worked on it for quite a few months before she went to Princeton for Ph.D. Brent has been helping us getting this work into shape since the beginning. It’s been a long and arduous road; every time we fixed something, new reviewers didn’t like something else. Finally, an NSDI revision allowed us to directly address the most pressing concerns. Without commenting on how much the paper has improved after all these iterations, I can say that adding revisions to NSDI has saved us, especially Yiwen, a lot more frustrations. For what it’s worth, Justitia now has the notorious distinction of my current record for accepted-after-N-submissions; it’s been so long that I’ve lost track of the exact value of N!

Presented Keynote Talk at CloudNet’2020

Earlier this week, I presented a keynote talk on the state of network-informed data systems design at the CloudNet’2020 conference, with a specific focus on our recent works on memory disaggregation (Infiniswap, Leap, and NetLock), and discussed the many open challenges toward making memory disaggregation practical.

In this talk, I discussed the motivation behind disaggregating memory, or any other expensive resource for that matter. High-performance data systems strive to keep data in the main memory. They often over-provision to avoid running out of memory, leading to a 50% average underutilization in Google, Facebook, and Alibaba datacenters. The root cause is simple: applications today cannot access otherwise unused memory beyond their machine boundaries even when their performance grinds to a halt. But could they? Over the course of last five years, our research in the SymbioticLab have addressed and continue to address this fundamental question regarding memory disaggregation, whereby an application can leverage both local and remote memory by leveraging emerging high-speed networks.

I also highlighted at least eight major challenges any memory disaggregation solution must address to even have a shot of becoming practical and widely used. These include, applicability to a large variety of applications by not requiring any application-level changes; scalability to large datacenters; efficiency is using up all available memory; high performance when using disaggregated memory; performance isolation from other datacenter traffic; resilience in the presence of failures and unavailability; security from others; and generality to a variety of memory technologies beyond DRAM. While this may come across as a laundry list of problems, we do believe that a complete solution must address each one of them.

In this context, I discussed three projects: Infiniswap, which achieves applicability using remote memory paging, and scalability and efficiency using decentralized algorithms; Leap, which improves performance by prefetching; and NetLock, which shows how to disaggregate programmable switch memory. I also pointed out a variety of ongoing projects toward our ultimate goal of unified, practical memory disaggregation.

My slides from this talk are publicly available and have more details elaborating these points.

Leap Wins the Best Paper Award at ATC’2020. Congrats Hasan!

Leap, the fastest memory disaggregation system to date, has won a best paper award at this year’s USENIX ATC conference!

This is a happy outcome for Hasan‘s persistence on this project for more than two years. From coming up with the core idea to executing it well, Hasan has done a fantastic job so far; I expect more great outcomes for his ongoing works.

Leap is open-source and available at https://github.com/symbioticlab/leap.

Leap Accepted to Appear at ATC’2020

Since our pioneering work on Infiniswap that attempted to make memory disaggregation practical, there has been quite a few proposals to use different application-level interfaces to remote memory over RDMA. A common issue faced by all these approaches is the high overhead of existing kernel data paths whether they use the swapping subsystem or the file system. Moreover, we observed that even if such latencies could be resolved by short-circuiting the kernel, the latency of RDMA itself is still significantly higher than that of local memory access. By early 2018, we came to the conclusion that instead of holding our breadth for them to come closer, we should invest on designing an algorithm to prefetch data even before an applications needs a remote page. It had be extremely fast and must deal with a variety of cloud applications; so we found complex learning-based approaches to be unsuitable. Leap is our simple-yet-effective answer to all these challenges.

Memory disaggregation over RDMA can improve the performance of memory-constrained applications by replacing disk swapping with remote memory accesses. However, state-of-the-art memory disaggregation solutions still use data path components designed for slow disks. As a result, applications experience remote memory access latency significantly higher than that of the underlying low-latency network, which itself is too high for many applications.

In this paper, we propose Leap, a prefetching solution for remote memory accesses due to memory disaggregation. At its core, Leap employs an online, majority-based prefetching algorithm, which increases the page cache hit rate. We complement it with a lightweight and efficient data path in the kernel that isolates each application’s data path to the disaggregated memory and mitigates latency bottlenecks arising from legacy throughput-optimizing operations. Integration of Leap in the Linux kernel improves the median and tail remote page access latencies of memory-bound applications by up to 104.04× and 22.62×, respectively, over the default data path. This leads to up to 10.16× performance improvements for applications using disaggregated memory in comparison to the state-of-the-art solutions.

The majority-based prefetching algorithm at the core of Leap is the brainchild of Hasan. Leap is not limited to remote or disaggregated memory either; it can be used in any slow storage-fast storage combination. This is Hasan’s first project, and I’m glad to see that it has eventually been a success. It’s my first paper in ATC too.

This year USENIX ATC accepted 65 out of 348 papers, which is pretty competitive.

DSLR Accepted to Appear at SIGMOD’2018

High-throughput, low-latency lock managers are useful for building a variety of distributed applications. A key tradeoff in this context can be expressed in terms of the amount of knowledge available to the lock manager. On the one hand, a decentralized lock manager can increase throughput by parallelization, but it can starve certain categories of applications. On the other hand, a centralized lock manager can avoid starvation and impose resource sharing policies, but it can be limited in throughput. DSLR is our attempt at mitigating this tradeoff in clusters with fast RDMA networks. Specifically, we adapt Lamport’s bakery algorithm in the context of RDMA’s fetch-and-add (FA) operations, which provides higher throughput, lower latency, and avoids starvation in comparison to the state-of-the-art.

Lock managers are a crucial component of modern distributed systems. However, with the increasing popularity of fast RDMA-enabled networks, traditional lock managers can no longer keep up with the latency and throughput requirements of modern systems. Centralized lock managers can ensure fairness and prevent starvation using global knowledge of the system, but are themselves single points of contention and failure. Consequently, they fall short in leveraging the full potential of RDMA networks. On the other hand, decentralized (RDMA-based) lock managers either completely sacrifice global knowledge to achieve higher throughput at the risk of starvation and higher tail latencies, or they resort to costly communications to maintain global knowledge, which can result in significantly lower throughput.

In this paper, we show that it is possible for a lock manager to be fully decentralized and yet exchange the partial knowledge necessary for preventing starvation and thereby reducing tail latencies. Our main observation is that we can design a lock manager using RDMA’s fetch-and-add (FA) operation, which always succeeds, rather than compare-and-swap (CAS), which only succeeds if a given condition is satisfied. While this requires us to rethink the locking mechanism from the ground up, it enables us to sidestep the performance drawbacks of the previous CAS-based proposals that relied solely on blind retries upon lock conflicts.

Specifically, we present DSLR (Decentralized and Starvation-free Lock management with RDMA), a decentralized lock manager that targets distributed systems with RDMA-enabled networks. We demonstrate that, despite being fully decentralized, DSLR prevents starvation and blind retries by providing first-come-first-serve (FCFS) scheduling without maintaining explicit queues. We adapt Lamport’s bakery algorithm [34] to an RDMA-enabled environment with multiple bakers, utilizing only one-sided READ and atomic FA operations. Our experiments show that DSLR delivers up to 2.8X higher throughput than all existing RDMA-based lock managers, while reducing their average and 99.9% latencies by up to 2.5X and 47X, respectively.

Barzan and I started this project with Dong Young in 2016 right after I joined Michigan, as our interests matched in terms of new and interesting applications of RDMA primitives. It’s exciting to see our work turn into my first SIGMOD paper. As we work on rack-scale/resource disaggregation over RDMA, we are seeing more exciting use cases of RDMA, going beyond key-value stores and designing new RDMA networking protocols. Stay tuned!

Infiniswap in USENIX ;login: and Elsewhere

Since our first open-source release of Infiniswap over the summer, we have seen growing interest with many follow-ups within our group and outside.

Here is a quick summary of selected writeups on Infiniswap:

What’s Next?

In addition to a lot of performance optimizations and bug fixes, we’re working on several key features of Infiniswap that will be released in coming months.

  1. High Availability: Infiniswap will be able to maintain high performance across failures, corruptions, and interruptions throughout the cluster.
  2. Performance Isolation: Infiniswap will be able to provide end-to-end performance guarantees over the RDMA network to multiple applications concurrently using it.

Infiniswap Released on GitHub

Today we are glad to announce the first open-source release of Infiniswap, the first practical, large-scale memory disaggregation system for cloud and HPC clusters.

Infiniswap is an efficient memory disaggregation system designed specifically for clusters with fast RDMA networks. It opportunistically harvests and transparently exposes unused cluster memory to unmodified applications by dividing the swap space of each machine into many slabs and distributing them across many machines’ remote memory. Because one-sided RDMA operations bypass remote CPUs, Infiniswap leverages the power of many choices to perform decentralized slab placements and evictions.

Extensive benchmarks on workloads from memory-intensive applications ranging from in-memory databases such as VoltDB and Memcached to popular big data software Apache Spark, PowerGraph, and GraphX show that Infiniswap provides order-of-magnitude performance improvements when working sets do not completely fit in memory. Simultaneously, it boosts cluster memory utilization by almost 50%.

Primary features included in this initial release are:

  • No new hardware and no application or operating system modifications;
  • Fault tolerance via asynchronous disk backups;
  • Scalability via decentralized algorithms.

Here are some links, if you want to check it out, contribute, or just want to point someone else who can help us to make it better.

Git repository: https://github.com/Infiniswap/infiniswap
Detailed Overview: Efficient Memory Disaggregation with Infiniswap

The project is still in its early stage and can use all the help to become successful. We appreciate your feedback.

FaiRDMA Accepted to Appear at KBNets’2017

As cloud providers deploy RDMA in their datacenters and developers rewrite/update their applications to use RDMA primitives, a key question remains open: what will happen when multiple RDMA-enabled applications must share the network? Surprisingly, this simple question does not yet have a conclusive answer. This is because existing work focus primarily on improving individual application’s performance using RDMA on a case-by-case basis. So we took a simple step. Yiwen built a benchmarking tool and ran some controlled experiments to find some very interesting results: there is often no performance isolation, and there are many factors that determine RDMA performance beyond the network itself. We are currently investigating the root causes and working on mitigating them.

To meet the increasing throughput and latency demands of modern applications, many operators are rapidly deploying RDMA in their datacenters. At the same time, developers are re-designing their software to take advantage of RDMA’s benefits for individual applications. However, when it comes to RDMA’s performance, many simple questions remain open.

In this paper, we consider the performance isolation characteristics of RDMA. Specifically, we conduct three sets of experiments — three combinations of one throughput-sensitive flow and one latency-sensitive flow — in a controlled environment, observe large discrepancies in RDMA performance with and without the presence of a competing flow, and describe our progress in identifying plausible root-causes.

This work is an offshoot, among several others, of our Infiniswap project on rack-scale memory disaggregation. As time goes by, I’m appreciating more and more Ion’s words of wisdom on building real systems to find real problems.

FTR, the KBNets PC accepted 9 papers out of 22 submissions this year. For a first-time workshop, I consider it a decent rate; there are some great papers in the program from both the industry and academia.

Infiniswap Accepted to Appear at NSDI’2017

Update: Camera-ready version is available here. Infiniswap code is now on GitHub!

As networks become faster, the difference between remote and local resources is blurring everyday. How can we take advantage of these blurred lines? This is the key observation behind resource disaggregation and, to some extent, rack-scale computing. In this paper, we take our first stab at making memory disaggregation practical by exposing remote memory to unmodified applications. While there have been several proposals and feasibility studies in recent years, to the best of our knowledge, this is the first concrete step in making it real.

Memory-intensive applications suffer large performance loss when their working sets do not fully fit in memory. Yet, they cannot leverage otherwise unused remote memory when paging out to disks even in the presence of large imbalance in memory utilizations across a cluster. Existing proposals for memory disaggregation call for new architectures, new hardware designs, and/or new programming models, making them infeasible.

This paper describes the design and implementation of Infiniswap, a remote memory paging system designed specifically for an RDMA network. Infiniswap opportunistically harvests and transparently exposes unused memory to unmodified applications by dividing the swap space of each machine into many slabs and distributing them across many machines’ remote memory. Because RDMA operations bypass remote CPUs, Infiniswap leverages the power of many choices to perform decentralized slab placements and evictions.

We have implemented and deployed Infiniswap on an RDMA cluster without any OS modifications and evaluated its effectiveness using multiple workloads running on unmodified VoltDB, Memcached, PowerGraph, GraphX, and Apache Spark. Using Infiniswap, throughputs of these applications improve between 7.1X (0.98X) to 16.3X (9.3X) over disk (Mellanox nbdX), and median and tail latencies between 5.5X (2X) and 58X (2.2X). Infiniswap does so with negligible remote CPU usage, whereas nbdX becomes CPU-bound. Infiniswap increases the overall memory utilization of a cluster and works well at scale.

This work started as a class project for EECS 582 in the Winter when I gave the idea to Juncheng Gu and Youngmoon Lee, who made the pieces into a whole. Over the summer, Yiwen Zhang, an enterprising and excellent undergraduate, joined the project and helped us in getting it done within time.

This year the NSDI PC accepted 46 out of 255 papers. This happens to be my first paper with an all-blue cast! I want to thank Kang for giving me complete access to Juncheng and Youngmoon; it’s been great collaborating with them. I’m also glad that Yiwen has decided to start a Master’s and stay with us for longer, and more importantly, our team will remain intact for many more exciting followups in this emerging research area.

If this excites you, come join our group!