Tag Archives: Datacenter Networking

Aequitas Accepted to Appear at SIGCOMM’2022

Although Remote Procedure Calls (RPCs) generate bulk of the network traffic in modern disaggregated datacenters, network-level optimizations still focus on low-level metrics at the granularity of packets, flowlets, and flows. The lack of application-level semantics lead to suboptimal performance as there is often a mismatch between network- and application-level metrics. Indeed, my works on coflow relied on a similar observation for distributed data-parallel jobs. In this work, we focus on leveraging application- and user-level semantics of RPCs to improve application- and user-perceived performance. Specifically, we implement quality-of-service (QoS) primitives at the RPC level by extending classic works on service curves and network calculus. We show that it can be achieved using an edge-based solutions in conjunction with traditional QoS classes in legacy/non-programmable switches.

With the increasing popularity of disaggregated storage and microservice architectures, high fan-out and fan-in Remote Procedure Calls (RPCs) now generate most of the traffic in modern datacenters. While the network plays a crucial role in RPC performance, traditional traffic classification categories cannot sufficiently capture their importance due to wide variations in RPC characteristics. As a result, meeting service-level objectives (SLOs), especially for performance-critical (PC) RPCs, remains challenging.

We present Aequitas, a distributed sender-driven admission control scheme that uses commodity Weighted-Fair Queuing (WFQ) to guarantee RPC-level SLOs. Aequitas maps PC RPCs to higher weight queues. In the presence of network overloads, it enforces cluster-wide RPC latency SLOs by limiting the amount of traffic admitted into any given QoS and downgrading the rest. We show analytically and empirically that this simple scheme works well. When the network demand spikes beyond provisioned capacity, Aequitas achieves a latency SLO that is 3.8× lower than the state-of-art congestion control at the 99.9th-percentile and admits up to 2× more PC RPCs meeting SLO when compared with pFabric, Qjump, D3, PDQ, and Homa. Results in a fleet-wide production deployment at a large cloud provider show a 10% latency improvement.

The inception of this project can be traced back to the spring of 2019, when I visited Google to attend a networking summit. Nandita and I had been trying to collaborate on something since 2016(!), and the time seemed right with Yiwen ready to go for an internship. Over the course of couple summers in 2019 and 2020, and many months before and after, Yiwen managed to build a theoretical framework that added QoS support to the classic network calculus framework in collaboration with Gautam. Yiwen also had a lot of support from Xian, PR , and unnamed others to get the simulator and the actual code implemented and deployed in Google datacenters. They implemented a public-facing/shareable version of the simulator as well! According to Amin and Nandita, this is one of the “big” problems, and I’m happy that we managed to pull it off.

This was my first time working with Google. Honestly, I was quite impressed by the rigor of the internal review process (even though the paper still got rejected a couple times). It’s also my first paper with Gautam after 2012! He was great then, and he’s gotten even better in the past decade. Yiwen is also having a great year with a SIGCOMM paper right after his NSDI one.

This year SIGCOMM PC accepted 55 out 281 submissions after a couple years of record-breaking acceptance rates.

Justitia Accepted to Appear at NSDI’2022

The need for higher throughput and lower latency is driving kernel-bypass networking (KBN) in datacenters. Of the two related trends in KBN, hardware-based KBN is especially challenging because, unlike software KBN such as DPDK, it does not provide any control once a request is posted to the hardware. RDMA, which is the most prevalent form of hardware KBN in practice, is completely opaque. RDMA NICs (RNICs), whether they are using InfiniBand, RoCE, or iWARP, have fixed-function scheduling algorithms programmed in them. Like any other networking component, they also suffer from performance isolation issues when multiple applications compete for RNIC resources. Justitia is our attempt at introducing software control in hardware KBN.

Kernel-bypass networking (KBN) is becoming the new norm in modern datacenters. While hardware-based KBN offloads all dataplane tasks to specialized NICs to achieve better latency and CPU efficiency than software-based KBN, it also takes away the operator’s control over network sharing policies. Providing policy support in multi-tenant hardware KBN brings unique challenges — namely, preserving ultra-low latency and low CPU cost, finding a well-defined point of mediation, and rethinking traffic shapers. We present Justitia to address these challenges with three key design aspects: (i) Split Connection with message-level shaping, (ii) sender-based resource mediation together with receiver-side updates, and (iii) passive latency monitoring. Using a latency target as its knob, Justitia enables multi-tenancy policies such as predictable latencies and fair/weighted resource sharing. Our evaluation shows Justitia can effectively isolate latency-sensitive applications at the cost of slightly decreased utilization and ensure that throughput and bandwidth of the rest are not unfairly penalized.

Yiwen started working on this problem when we first observed RDMA isolation issues in Infiniswap. He even wrote a short paper in KBNets 2017 based on his early findings. Yue worked on it for quite a few months before she went to Princeton for Ph.D. Brent has been helping us getting this work into shape since the beginning. It’s been a long and arduous road; every time we fixed something, new reviewers didn’t like something else. Finally, an NSDI revision allowed us to directly address the most pressing concerns. Without commenting on how much the paper has improved after all these iterations, I can say that adding revisions to NSDI has saved us, especially Yiwen, a lot more frustrations. For what it’s worth, Justitia now has the notorious distinction of my current record for accepted-after-N-submissions; it’s been so long that I’ve lost track of the exact value of N!

AIFO Accepted to Appear at SIGCOMM’2021

Packet scheduling is a classic problem in networking. In recent years, however, the focus on packet scheduling has somewhat shifted from designing new scheduling algorithms to designing generalized frameworks that can be programmed to approximate a variety of scheduling disciplines. Push-In-First-Out (PIFO) from SIGCOMM 2016 is such a framework that has been shown to be quite expressive. Following that, a variety have solutions have attempted to make implementing PIFO more practical, with the key problem being minimizing the number of priority levels needed. SP-PIFO is a recent take that shows that having a handful queues is enough. This leaves us with an obvious question: what’s the minimum number of queues one needs to approximate PIFO? We show that the answer is just one.

Programmable packet scheduling enables scheduling algorithms to be programmed into the data plane without changing the hardware. Existing proposals either have no hardware implementations or require multiple strict-priority queues.

We present Admission-In First-Out (AIFO) queues, a new solution for programmable packet scheduling that uses only a single first-in first-out queue. AIFO is motivated by the confluence of two recent trends: shallow buffers in switches and fast-converging congestion control in end hosts, that together leads to a simple observation: the decisive factor in a flow’s completion time (FCT) in modern datacenter networks is often which packets are enqueued or dropped, not the ordering they leave the switch. The core idea of AIFO is to maintain a sliding window to track the ranks of recent packets and compute the relative rank of an arriving packet in the window for admission control. Theoretically, we prove that AIFO provides bounded performance to Push-In First-Out (PIFO). Empirically, we fully implement AIFO and evaluate AIFO with a range of real workloads, demonstrating AIFO closely approximates PIFO. Importantly, unlike PIFO, AIFO can run at line rate on existing hardware and use minimal switch resources—as few as a single queue.

Although programmable packet scheduling has been quite popular for more than five years, I started paying careful attention only after the SP-PIFO presentation in NSDI 2020. I felt that we should be able to approximate something like that with even fewer priority classes, especially by using something similar to Foreground-Background scheduling that needs only two priorities. Xin had been thinking about the problem even longer given his vast experience in programmable switches and approached me after submitting Kayak to NSDI 2021. Xin pointed out that two priorities need only queue with an admission control mechanism in front! I’m glad he roped me in as it’s always a pleasure working with him and Zhuolong. It seems unbelievable even to me that this is my first packet scheduling paper!

This year SIGCOMM has broken the acceptance record once again by accepting 55 out of 241 submissions into the program!

Thanks Google for Supporting Our Research

We are doing a lot of work on systems for AI in recent years and historically have done some work on application-aware networking. This support will help us combine the two directions to provide datacenter network support for AI workloads both at the edge and inside the network.

Many thanks to Nandita Dukkipati for championing our cause. We’ve been trying to find a good project to collaborate on since 2016 when we had a chat at a Dagstuhl workshop!

NetLock Accepted to Appear at SIGCOMM’2020

High-throughput, low-latency lock managers are useful for building a variety of distributed applications. Traditionally, a key tradeoff in this context had been expressed in terms of the amount of knowledge available to the lock manager. On the one hand, a decentralized lock manager can increase throughput by parallelization, but it can starve certain categories of applications. On the other hand, a centralized lock manager can avoid starvation and impose resource sharing policies, but it can be limited in throughput. In SIGMOD’18, we presented DSLR that attempted to mitigate this tradeoff in clusters with fast RDMA networks by adapting Lamport’s bakery algorithm in the context of RDMA’s fetch-and-add (FA) operations to design a decentralized solution. The downside is that we couldn’t implement complex policies that need centralized information.

What if we could have a high-speed centralized point that all remote traffic must go through anyway? NetLock is our attempt at doing just that by implementing a centralized lock manager in a programmable switch by working at tandem with the servers. The co-design is important to go around the resource limitations of the switch. By carefully caching hot locks and moving warm and cold ones to the servers, we can meet both the performance and policy goals of a lock manager without significant compromise in either.

Lock managers are widely used by distributed systems. Traditional centralized lock managers can easily support policies between multiple users using global knowledge, but they suffer from low performance. In contrast, emerging decentralized approaches are faster but cannot provide flexible policy support. Furthermore, performance in both cases is limited by the server capability.

We present NetLock, a new centralized lock manager that co-designs servers and network switches to achieve high performance without sacrificing flexibility in policy support. The key idea of NetLock is to exploit the capability of emerging programmable switches to directly process lock requests in the switch data plane. Due to the limited switch memory, we design a memory management mechanism to seamlessly integrate the switch and server memory. To realize the locking functionality in the switch, we design a custom data plane module that efficiently pools multiple register arrays together to maximize memory utilization We have implemented a NetLock prototype with a Barefoot Tofino switch and a cluster of commodity servers. Evaluation results show that NetLock improves the throughput by 14.0–18.4×, and reduces the average and 99% latency by 4.7–20.3× and 10.4–18.7× over DSLR, a state-of-the-art RDMA-based solution, while providing flexible policy support.

Xin and I came up with the idea of this project over a couple meals in San Diego at OSDI’18, and later Zhuolong and Yiwen expanded and successfully executed our ideas that lead to NetLock. Similar to DSLR, NetLock explores a different design point in our larger memory disaggregation vision.

This year’s SIGCOMM probably has the highest acceptance rate in 25 years, if not more. After a long successful run at SIGCOMM and a small break doing many other exciting things, it’s great to be back to some networking research! But going forward, I’m hoping for much more along these lines both inside the network and at the edge.

Two Papers Accepted at APNet’2018 and MAMA’2018 Workshops

Couple more workshop papers got accepted in last few days!

  1. Pas de Deux: Shape the Circuits, and Shape the Apps Too! — APNet’18
  2. Fair Allocation of Heterogeneous and Interchangeable Resources — MAMA’18

The APNet one is about coflows on optical networks, following my collaborations with Hong and Kai at HKUST (SIGCOMM’16 and SIGCOMM’17), while the MAMA one is with Zhenhua and his very capable students Xiao and Tan, following up on my work with Zhenhua (NSDI’16).

Look forward to the longer versions soon.

DSLR Accepted to Appear at SIGMOD’2018

High-throughput, low-latency lock managers are useful for building a variety of distributed applications. A key tradeoff in this context can be expressed in terms of the amount of knowledge available to the lock manager. On the one hand, a decentralized lock manager can increase throughput by parallelization, but it can starve certain categories of applications. On the other hand, a centralized lock manager can avoid starvation and impose resource sharing policies, but it can be limited in throughput. DSLR is our attempt at mitigating this tradeoff in clusters with fast RDMA networks. Specifically, we adapt Lamport’s bakery algorithm in the context of RDMA’s fetch-and-add (FA) operations, which provides higher throughput, lower latency, and avoids starvation in comparison to the state-of-the-art.

Lock managers are a crucial component of modern distributed systems. However, with the increasing popularity of fast RDMA-enabled networks, traditional lock managers can no longer keep up with the latency and throughput requirements of modern systems. Centralized lock managers can ensure fairness and prevent starvation using global knowledge of the system, but are themselves single points of contention and failure. Consequently, they fall short in leveraging the full potential of RDMA networks. On the other hand, decentralized (RDMA-based) lock managers either completely sacrifice global knowledge to achieve higher throughput at the risk of starvation and higher tail latencies, or they resort to costly communications to maintain global knowledge, which can result in significantly lower throughput.

In this paper, we show that it is possible for a lock manager to be fully decentralized and yet exchange the partial knowledge necessary for preventing starvation and thereby reducing tail latencies. Our main observation is that we can design a lock manager using RDMA’s fetch-and-add (FA) operation, which always succeeds, rather than compare-and-swap (CAS), which only succeeds if a given condition is satisfied. While this requires us to rethink the locking mechanism from the ground up, it enables us to sidestep the performance drawbacks of the previous CAS-based proposals that relied solely on blind retries upon lock conflicts.

Specifically, we present DSLR (Decentralized and Starvation-free Lock management with RDMA), a decentralized lock manager that targets distributed systems with RDMA-enabled networks. We demonstrate that, despite being fully decentralized, DSLR prevents starvation and blind retries by providing first-come-first-serve (FCFS) scheduling without maintaining explicit queues. We adapt Lamport’s bakery algorithm [34] to an RDMA-enabled environment with multiple bakers, utilizing only one-sided READ and atomic FA operations. Our experiments show that DSLR delivers up to 2.8X higher throughput than all existing RDMA-based lock managers, while reducing their average and 99.9% latencies by up to 2.5X and 47X, respectively.

Barzan and I started this project with Dong Young in 2016 right after I joined Michigan, as our interests matched in terms of new and interesting applications of RDMA primitives. It’s exciting to see our work turn into my first SIGMOD paper. As we work on rack-scale/resource disaggregation over RDMA, we are seeing more exciting use cases of RDMA, going beyond key-value stores and designing new RDMA networking protocols. Stay tuned!

Hermes Accepted to Appear at SIGCOMM’2017

Datacenter load balancing, especially in Clos topologies, remains a hot topic even after almost a decade. The pace of progress has picked up over the last few years with multiple solutions exploring different extremes of the solution space, ranging from edge-based to in-network solutions and using different granularities of load balancing: packets, flowcells, flowlets, or flows. Throughout all these efforts, load balancing granularity has always been either fixed or passively determined. This inflexibility and lack of control lead to many inefficiencies. In Hermes, we take a more active approach: we comprehensively sense path conditions and actively direct traffic to ensure cautious yet timely rerouting.

Production datacenters operate under various uncertainties such as traffic dynamics, topology asymmetry, and failures. Therefore, datacenter load balancing schemes must be resilient to these uncertainties; i.e., they should accurately sense path conditions and timely react to mitigate the fallouts. Despite significant efforts, prior solutions have important drawbacks. On the one hand, solutions such as Presto and DRB are oblivious to path conditions and blindly reroute at fixed granularity; on the other hand, solutions such as CONGA and CLOVE can sense congestion, but they can only reroute when flowlets emerge, thus cannot always react to uncertainties timely. To make things worse, these solutions fail to detect/handle failures such as blackholes or random packet drops, which greatly degrades their performance.

In this paper, we propose Hermes, a datacenter load balancer that is resilient to the aforementioned uncertainties. At its heart, Hermes leverages comprehensive sensing to detect path conditions including failures unattended before, and reacts by timely yet cautious rerouting. Hermes is a practical edge-based solution with no switch modification. We have implemented Hermes with commodity switches and evaluated it through both testbed experiments and large-scale simulations. Our results show that Hermes achieves comparable performance with CONGA and Presto in normal cases, and handles uncertainties well: under asymmetries, Hermes achieves up to 10% and 40% better flow completion time (FCT) than CONGA and CLOVE; under switch failures, it significantly outperforms all other schemes by over 50%.

While I have spent a lot of time working on datacenter network-related problems, my focus has always been on enabling application-awareness in the network using coflows and multi-tenancy issues; I have actively stayed away from lower level details. So when Hong and Kai brought up this load balancing problem after last SIGCOMM, I was a bit apprehensive. I became interested when they posed the problem as an interaction challenge between lower-level load balancing solutions and transport protocols, and I’m glad I got involved. As always, it’s been a pleasure working with Hong and Kai. I’ve been closely working with Hong for about two years now, leading to two first-authored SIGCOMM papers under his belt; at this point, I feel like his unofficial co-advisor. Junxue and Wei were instrumental in getting the experiments done in time.

This year the SIGCOMM PC accepted 36 papers out of 250 submissions with a 14.4% acceptance rate. The number of accepted papers went down and the number of submissions up, which led to the lowest acceptance rate since SIGCOMM 2012. This was also my first time on SIGCOMM PC.

FaiRDMA Accepted to Appear at KBNets’2017

As cloud providers deploy RDMA in their datacenters and developers rewrite/update their applications to use RDMA primitives, a key question remains open: what will happen when multiple RDMA-enabled applications must share the network? Surprisingly, this simple question does not yet have a conclusive answer. This is because existing work focus primarily on improving individual application’s performance using RDMA on a case-by-case basis. So we took a simple step. Yiwen built a benchmarking tool and ran some controlled experiments to find some very interesting results: there is often no performance isolation, and there are many factors that determine RDMA performance beyond the network itself. We are currently investigating the root causes and working on mitigating them.

To meet the increasing throughput and latency demands of modern applications, many operators are rapidly deploying RDMA in their datacenters. At the same time, developers are re-designing their software to take advantage of RDMA’s benefits for individual applications. However, when it comes to RDMA’s performance, many simple questions remain open.

In this paper, we consider the performance isolation characteristics of RDMA. Specifically, we conduct three sets of experiments — three combinations of one throughput-sensitive flow and one latency-sensitive flow — in a controlled environment, observe large discrepancies in RDMA performance with and without the presence of a competing flow, and describe our progress in identifying plausible root-causes.

This work is an offshoot, among several others, of our Infiniswap project on rack-scale memory disaggregation. As time goes by, I’m appreciating more and more Ion’s words of wisdom on building real systems to find real problems.

FTR, the KBNets PC accepted 9 papers out of 22 submissions this year. For a first-time workshop, I consider it a decent rate; there are some great papers in the program from both the industry and academia.

Received SIGCOMM Doctoral Dissertation Award

About a week or so ago Bruce Maggs, SIGCOMM’s awards chair, kindly informed me over the phone that my dissertation on coflows has been selected for the 2015 ACM SIGCOMM Doctoral Dissertation Award. The committee for the award included Ratul Mahajan, Dina Papagiannaki, Laurent Vanbever (chair), and Minlan Yu, and the citation reads:

Chowdhury’s dissertation provides novel and application-aware networking abstractions which significantly improve the performance of networked applications running in the cloud.

It’s a great honor to join some great researchers who’ve won this award before me.

It goes without saying that this achievement wouldn’t have been possible without my great advisor Ion Stoica keeping me on course, my excellent collaborators over the years, and many others who paved my way in multiple ways. I am also grateful to some of my biggest supporters: Randy Katz, Mike Franklin, Srikanth Kandula, and Vyas Sekar. An equal amount of credit, if not more, goes to my family and friends, who cheered me on and supported me for so many years. I don’t say it often enough: thank you!