Tag Archives: Disaggregation

Aequitas Accepted to Appear at SIGCOMM’2022

Although Remote Procedure Calls (RPCs) generate bulk of the network traffic in modern disaggregated datacenters, network-level optimizations still focus on low-level metrics at the granularity of packets, flowlets, and flows. The lack of application-level semantics lead to suboptimal performance as there is often a mismatch between network- and application-level metrics. Indeed, my works on coflow relied on a similar observation for distributed data-parallel jobs. In this work, we focus on leveraging application- and user-level semantics of RPCs to improve application- and user-perceived performance. Specifically, we implement quality-of-service (QoS) primitives at the RPC level by extending classic works on service curves and network calculus. We show that it can be achieved using an edge-based solutions in conjunction with traditional QoS classes in legacy/non-programmable switches.

With the increasing popularity of disaggregated storage and microservice architectures, high fan-out and fan-in Remote Procedure Calls (RPCs) now generate most of the traffic in modern datacenters. While the network plays a crucial role in RPC performance, traditional traffic classification categories cannot sufficiently capture their importance due to wide variations in RPC characteristics. As a result, meeting service-level objectives (SLOs), especially for performance-critical (PC) RPCs, remains challenging.

We present Aequitas, a distributed sender-driven admission control scheme that uses commodity Weighted-Fair Queuing (WFQ) to guarantee RPC-level SLOs. Aequitas maps PC RPCs to higher weight queues. In the presence of network overloads, it enforces cluster-wide RPC latency SLOs by limiting the amount of traffic admitted into any given QoS and downgrading the rest. We show analytically and empirically that this simple scheme works well. When the network demand spikes beyond provisioned capacity, Aequitas achieves a latency SLO that is 3.8× lower than the state-of-art congestion control at the 99.9th-percentile and admits up to 2× more PC RPCs meeting SLO when compared with pFabric, Qjump, D3, PDQ, and Homa. Results in a fleet-wide production deployment at a large cloud provider show a 10% latency improvement.

The inception of this project can be traced back to the spring of 2019, when I visited Google to attend a networking summit. Nandita and I had been trying to collaborate on something since 2016(!), and the time seemed right with Yiwen ready to go for an internship. Over the course of couple summers in 2019 and 2020, and many months before and after, Yiwen managed to build a theoretical framework that added QoS support to the classic network calculus framework in collaboration with Gautam. Yiwen also had a lot of support from Xian, PR , and unnamed others to get the simulator and the actual code implemented and deployed in Google datacenters. They implemented a public-facing/shareable version of the simulator as well! According to Amin and Nandita, this is one of the “big” problems, and I’m happy that we managed to pull it off.

This was my first time working with Google. Honestly, I was quite impressed by the rigor of the internal review process (even though the paper still got rejected a couple times). It’s also my first paper with Gautam after 2012! He was great then, and he’s gotten even better in the past decade. Yiwen is also having a great year with a SIGCOMM paper right after his NSDI one.

This year SIGCOMM PC accepted 55 out 281 submissions after a couple years of record-breaking acceptance rates.

Treehouse White Paper Released

We started the Treehouse project at the beginning of last summer with a mini workshop. It took a while to distill our core ideas, but we finally have a white paper that attempts to provide answers to some frequently asked questions: why now? what does it mean for cloud software to be carbon-aware? what are (some of) the ways to get there? why these? etc.

It’s only the beginning, and as we go deeper into our explorations, I’m sure the details will evolve and become richer. Nonetheless, I do believe that we have a compelling vision of the future where we can, among other things, enable reasonable tradeoffs between performance and carbon consumption of cloud software stacks, an accurate way of understanding how much carbon our software systems are consuming, and optimally choose consumption sites, sources, and timing to best address the energy challenges we are all facing.

We look forward to hearing from the community. Watch out for exciting announcements coming soon from the Treehouse project!

Hydra Accepted to Appear at FAST’2022

Almost five years after Infiniswap, memory disaggregation is now a mainstream research topic. It goes by many names, but the idea of accessing (unused/stranded) memory over high-speed networks is now close to reality. Despite many works in this line of research, a key remaining problem is ensuring resilience: how can applications recover quickly if remote memory fails, become corrupted, or is inaccessible? Keeping a copy in the disk like Infiniswap causes performance bottleneck under failure, but keeping in-memory replicas doubles the memory overhead. We started Hydra to hit a sweet spot between these two extremes by applying lessons learned from our EC-Cache project to extremely small objects. While EC-Cache explicitly focused on very large objects in the MB range, Hydra aims to perform erasure coding at the 4KB page granularity in microsecond timescale common for RDMA. Additionally, we extended Asaf’s CopySets idea to erasure coding to tolerate concurrent failures with low overhead.

We present Hydra, a low-latency, low-overhead, and highly available resilience mechanism for remote memory. Hydra can access erasure-coded remote memory within a single-digit μs read/write latency, significantly improving the performance-efficiency tradeoff over the state-of-the-art – it performs similar to in-memory replication with 1.6× lower memory overhead. We also propose CodingSets, a novel coding group placement algorithm for erasure-coded data, that provides load balancing while reducing the probability of data loss under correlated failures by an order of magnitude. With Hydra, even when only 50% memory is local, unmodified memory-intensive applications achieve performance close to that of the fully in-memory case in the presence of remote failures and outperforms the state-of-the-art remote-memory solutions by up to 4.35×.

Youngmoon started working on Hydra right after we presented Infiniswap in early 2017! As Youngmoon graduated, Hasan started leading the project from early 2019, and Asaf joined us later that year. Together they significantly improved the paper over early drafts. Even then, Hydra faced immense challenges. In the process, Hydra has now taken over Justitia for the notorious distinction of my current record for accepted-after-N-submissions.

This was my first time submitting to FAST.

Juncheng Levels Up. Congrats Dr. Gu!

My first Ph.D. student Juncheng Gu graduated earlier this month after successfully defending his dissertation titled Efficient Resource Management for Deep Learning Clusters.” This is a bittersweet moment. While I am extremely proud of everything he has done, I will miss having him around. I do know that a bigger stage awaits him; Juncheng is joining the ByteDance AI Lab to build practical systems for AI and machine learning!

Juncheng started his Ph.D. in the Fall of 2015 right before I started in Michigan. I joined his then advisor Kang Shin to co-advise him as he started working on a pre-cursor to Infiniswap as a term project for the EECS 582 course I was teaching. Since then, Juncheng worked on many projects that ranged from hardware, systems, and machine learning/computer vision with varying levels of luck and success, but they were all meaningful works. I consider him a generalist in his research taste. Infiniswap and Tiresias stand out the most among his projects. Infiniswap heralded the rise of many followups we see today on the topic of memory disaggregation. It was the first of its kind and introduced many around the world to this new area of research. Tiresias was one of the earliest works on GPU cluster management and certainly the first that did not require any prior knowledge about deep learning jobs’ characteristics to effectively allocate GPUs for them and to schedule them. To this day, it is the best of its kind for distributed deep learning training. I am honored to have had the opportunity to advise Juncheng.

Juncheng is a great researcher, but he is an even better person. He is very down-to-earth and tries his best to help others out whenever possible. He also understates and underestimates what he can do and has achieved, often to a fault.

I wish him a fruitful career and a prosperous life!

Justitia Accepted to Appear at NSDI’2022

The need for higher throughput and lower latency is driving kernel-bypass networking (KBN) in datacenters. Of the two related trends in KBN, hardware-based KBN is especially challenging because, unlike software KBN such as DPDK, it does not provide any control once a request is posted to the hardware. RDMA, which is the most prevalent form of hardware KBN in practice, is completely opaque. RDMA NICs (RNICs), whether they are using InfiniBand, RoCE, or iWARP, have fixed-function scheduling algorithms programmed in them. Like any other networking component, they also suffer from performance isolation issues when multiple applications compete for RNIC resources. Justitia is our attempt at introducing software control in hardware KBN.

Kernel-bypass networking (KBN) is becoming the new norm in modern datacenters. While hardware-based KBN offloads all dataplane tasks to specialized NICs to achieve better latency and CPU efficiency than software-based KBN, it also takes away the operator’s control over network sharing policies. Providing policy support in multi-tenant hardware KBN brings unique challenges — namely, preserving ultra-low latency and low CPU cost, finding a well-defined point of mediation, and rethinking traffic shapers. We present Justitia to address these challenges with three key design aspects: (i) Split Connection with message-level shaping, (ii) sender-based resource mediation together with receiver-side updates, and (iii) passive latency monitoring. Using a latency target as its knob, Justitia enables multi-tenancy policies such as predictable latencies and fair/weighted resource sharing. Our evaluation shows Justitia can effectively isolate latency-sensitive applications at the cost of slightly decreased utilization and ensure that throughput and bandwidth of the rest are not unfairly penalized.

Yiwen started working on this problem when we first observed RDMA isolation issues in Infiniswap. He even wrote a short paper in KBNets 2017 based on his early findings. Yue worked on it for quite a few months before she went to Princeton for Ph.D. Brent has been helping us getting this work into shape since the beginning. It’s been a long and arduous road; every time we fixed something, new reviewers didn’t like something else. Finally, an NSDI revision allowed us to directly address the most pressing concerns. Without commenting on how much the paper has improved after all these iterations, I can say that adding revisions to NSDI has saved us, especially Yiwen, a lot more frustrations. For what it’s worth, Justitia now has the notorious distinction of my current record for accepted-after-N-submissions; it’s been so long that I’ve lost track of the exact value of N!

Treehouse Funded. Thanks NSF and VMware!

Treehouse, our proposal on energy-first software infrastructure designs for sustainable cloud computing has recently won a three million dollar joint funding from NSF and VMware! Here is a link to the VMware press release. SymbioticLab now has a chance to collaborate with a stellar team consisting of Tom Anderson, Adam Belay, Asaf Cidon, and Irene Zhang, and I’m super excited about the changes Treehouse will bring to how we do sustainable cloud computing in near future.

Our team started with Asaf and I looking to go beyond memory disaggregation in early Fall of 2020. Soon we managed to convince Adam to join us, who introduced Irene to the idea; both of them are experts in efficient software designs. Finally, together we managed to get Tom to lead our team, and many ideas about an energy-first redesign of cloud computing software stacks started to sprout.

The early steps in building Treehouse is already under way. Last month, we had our first inaugural mini-workshop with more than 20 attendees, and I expect it to grow only bigger in the near future. Stay tuned!

Kayak Accepted to Appear at NSDI’2021

As memory disaggregation and resource disaggregation, in general, become popular, one must make a call about whether to continue moving data from remote memory or to sometimes ship compute to remote data too. This is not a new problem in the context of disaggregated datacenters either. The notion of data locality and associated challenges are rooted in the same observation. As the ratio between compute and communication needs of an application changes and as the speed of the network changes over time, the answer has changed many times. Oftentimes, the solutions boil down to understanding workload and network characteristics and making a call: ship compute or ship data for that workload on that network. But what if the workload is dynamic? In Kayak, we dynamically decide between the two options and making the right call at the right time.

How cloud applications should interact with their data remains an active area of research. Over the last decade, many have suggested relying on a key-value (KV) interface to interact with data stored in remote storage servers, while others have vouched for the benefits of using remote procedure call (RPC). Instead of choosing one over the other, in this paper, we observe that an ideal solution must adaptively combine both of them in order to maximize throughput while meeting application latency requirements. To this end, we propose a new system called Kayak that proactively adjusts the rate of requests and the fraction of requests to be executed using RPC or KV, all in a fully decentralized and self-regulated manner. We theoretically prove that Kayak can quickly converge to the optimal parameters. We implement a system prototype of Kayak. Our evaluations show that Kayak achieves sub-second convergence and improves overall throughput by 32.5%-63.4% for compute-intensive workloads and up to 12.2% for non-compute-intensive and transactional workloads over the state-of-the-art.

The core idea behind Kayak is all due to Jie (Jimmy), who started the project in early 2019. The project picked up significant tailwind when Xin and his student Jingfeng joined us in the summer of 2020 to help us better understand the problem from an analytical perspective too, in addition to better positioning the work. I’m super excited about the possibilities and Kayak’s potential impact on resource disaggregation systems.

The NSDI PC this year accepted 40 out of 255 submissions in the Fall deadline to result in a lower acceptance rate from last year. 

Presented Keynote Talk at CloudNet’2020

Earlier this week, I presented a keynote talk on the state of network-informed data systems design at the CloudNet’2020 conference, with a specific focus on our recent works on memory disaggregation (Infiniswap, Leap, and NetLock), and discussed the many open challenges toward making memory disaggregation practical.

In this talk, I discussed the motivation behind disaggregating memory, or any other expensive resource for that matter. High-performance data systems strive to keep data in the main memory. They often over-provision to avoid running out of memory, leading to a 50% average underutilization in Google, Facebook, and Alibaba datacenters. The root cause is simple: applications today cannot access otherwise unused memory beyond their machine boundaries even when their performance grinds to a halt. But could they? Over the course of last five years, our research in the SymbioticLab have addressed and continue to address this fundamental question regarding memory disaggregation, whereby an application can leverage both local and remote memory by leveraging emerging high-speed networks.

I also highlighted at least eight major challenges any memory disaggregation solution must address to even have a shot of becoming practical and widely used. These include, applicability to a large variety of applications by not requiring any application-level changes; scalability to large datacenters; efficiency is using up all available memory; high performance when using disaggregated memory; performance isolation from other datacenter traffic; resilience in the presence of failures and unavailability; security from others; and generality to a variety of memory technologies beyond DRAM. While this may come across as a laundry list of problems, we do believe that a complete solution must address each one of them.

In this context, I discussed three projects: Infiniswap, which achieves applicability using remote memory paging, and scalability and efficiency using decentralized algorithms; Leap, which improves performance by prefetching; and NetLock, which shows how to disaggregate programmable switch memory. I also pointed out a variety of ongoing projects toward our ultimate goal of unified, practical memory disaggregation.

My slides from this talk are publicly available and have more details elaborating these points.

Leap Wins the Best Paper Award at ATC’2020. Congrats Hasan!

Leap, the fastest memory disaggregation system to date, has won a best paper award at this year’s USENIX ATC conference!

This is a happy outcome for Hasan‘s persistence on this project for more than two years. From coming up with the core idea to executing it well, Hasan has done a fantastic job so far; I expect more great outcomes for his ongoing works.

Leap is open-source and available at https://github.com/symbioticlab/leap.

NetLock Accepted to Appear at SIGCOMM’2020

High-throughput, low-latency lock managers are useful for building a variety of distributed applications. Traditionally, a key tradeoff in this context had been expressed in terms of the amount of knowledge available to the lock manager. On the one hand, a decentralized lock manager can increase throughput by parallelization, but it can starve certain categories of applications. On the other hand, a centralized lock manager can avoid starvation and impose resource sharing policies, but it can be limited in throughput. In SIGMOD’18, we presented DSLR that attempted to mitigate this tradeoff in clusters with fast RDMA networks by adapting Lamport’s bakery algorithm in the context of RDMA’s fetch-and-add (FA) operations to design a decentralized solution. The downside is that we couldn’t implement complex policies that need centralized information.

What if we could have a high-speed centralized point that all remote traffic must go through anyway? NetLock is our attempt at doing just that by implementing a centralized lock manager in a programmable switch by working at tandem with the servers. The co-design is important to go around the resource limitations of the switch. By carefully caching hot locks and moving warm and cold ones to the servers, we can meet both the performance and policy goals of a lock manager without significant compromise in either.

Lock managers are widely used by distributed systems. Traditional centralized lock managers can easily support policies between multiple users using global knowledge, but they suffer from low performance. In contrast, emerging decentralized approaches are faster but cannot provide flexible policy support. Furthermore, performance in both cases is limited by the server capability.

We present NetLock, a new centralized lock manager that co-designs servers and network switches to achieve high performance without sacrificing flexibility in policy support. The key idea of NetLock is to exploit the capability of emerging programmable switches to directly process lock requests in the switch data plane. Due to the limited switch memory, we design a memory management mechanism to seamlessly integrate the switch and server memory. To realize the locking functionality in the switch, we design a custom data plane module that efficiently pools multiple register arrays together to maximize memory utilization We have implemented a NetLock prototype with a Barefoot Tofino switch and a cluster of commodity servers. Evaluation results show that NetLock improves the throughput by 14.0–18.4×, and reduces the average and 99% latency by 4.7–20.3× and 10.4–18.7× over DSLR, a state-of-the-art RDMA-based solution, while providing flexible policy support.

Xin and I came up with the idea of this project over a couple meals in San Diego at OSDI’18, and later Zhuolong and Yiwen expanded and successfully executed our ideas that lead to NetLock. Similar to DSLR, NetLock explores a different design point in our larger memory disaggregation vision.

This year’s SIGCOMM probably has the highest acceptance rate in 25 years, if not more. After a long successful run at SIGCOMM and a small break doing many other exciting things, it’s great to be back to some networking research! But going forward, I’m hoping for much more along these lines both inside the network and at the edge.