Tag Archives: SIGCOMM

Aequitas Accepted to Appear at SIGCOMM’2022

Although Remote Procedure Calls (RPCs) generate bulk of the network traffic in modern disaggregated datacenters, network-level optimizations still focus on low-level metrics at the granularity of packets, flowlets, and flows. The lack of application-level semantics lead to suboptimal performance as there is often a mismatch between network- and application-level metrics. Indeed, my works on coflow relied on a similar observation for distributed data-parallel jobs. In this work, we focus on leveraging application- and user-level semantics of RPCs to improve application- and user-perceived performance. Specifically, we implement quality-of-service (QoS) primitives at the RPC level by extending classic works on service curves and network calculus. We show that it can be achieved using an edge-based solutions in conjunction with traditional QoS classes in legacy/non-programmable switches.

With the increasing popularity of disaggregated storage and microservice architectures, high fan-out and fan-in Remote Procedure Calls (RPCs) now generate most of the traffic in modern datacenters. While the network plays a crucial role in RPC performance, traditional traffic classification categories cannot sufficiently capture their importance due to wide variations in RPC characteristics. As a result, meeting service-level objectives (SLOs), especially for performance-critical (PC) RPCs, remains challenging.

We present Aequitas, a distributed sender-driven admission control scheme that uses commodity Weighted-Fair Queuing (WFQ) to guarantee RPC-level SLOs. Aequitas maps PC RPCs to higher weight queues. In the presence of network overloads, it enforces cluster-wide RPC latency SLOs by limiting the amount of traffic admitted into any given QoS and downgrading the rest. We show analytically and empirically that this simple scheme works well. When the network demand spikes beyond provisioned capacity, Aequitas achieves a latency SLO that is 3.8× lower than the state-of-art congestion control at the 99.9th-percentile and admits up to 2× more PC RPCs meeting SLO when compared with pFabric, Qjump, D3, PDQ, and Homa. Results in a fleet-wide production deployment at a large cloud provider show a 10% latency improvement.

The inception of this project can be traced back to the spring of 2019, when I visited Google to attend a networking summit. Nandita and I had been trying to collaborate on something since 2016(!), and the time seemed right with Yiwen ready to go for an internship. Over the course of couple summers in 2019 and 2020, and many months before and after, Yiwen managed to build a theoretical framework that added QoS support to the classic network calculus framework in collaboration with Gautam. Yiwen also had a lot of support from Xian, PR , and unnamed others to get the simulator and the actual code implemented and deployed in Google datacenters. They implemented a public-facing/shareable version of the simulator as well! According to Amin and Nandita, this is one of the “big” problems, and I’m happy that we managed to pull it off.

This was my first time working with Google. Honestly, I was quite impressed by the rigor of the internal review process (even though the paper still got rejected a couple times). It’s also my first paper with Gautam after 2012! He was great then, and he’s gotten even better in the past decade. Yiwen is also having a great year with a SIGCOMM paper right after his NSDI one.

This year SIGCOMM PC accepted 55 out 281 submissions after a couple years of record-breaking acceptance rates.

AIFO Accepted to Appear at SIGCOMM’2021

Packet scheduling is a classic problem in networking. In recent years, however, the focus on packet scheduling has somewhat shifted from designing new scheduling algorithms to designing generalized frameworks that can be programmed to approximate a variety of scheduling disciplines. Push-In-First-Out (PIFO) from SIGCOMM 2016 is such a framework that has been shown to be quite expressive. Following that, a variety have solutions have attempted to make implementing PIFO more practical, with the key problem being minimizing the number of priority levels needed. SP-PIFO is a recent take that shows that having a handful queues is enough. This leaves us with an obvious question: what’s the minimum number of queues one needs to approximate PIFO? We show that the answer is just one.

Programmable packet scheduling enables scheduling algorithms to be programmed into the data plane without changing the hardware. Existing proposals either have no hardware implementations or require multiple strict-priority queues.

We present Admission-In First-Out (AIFO) queues, a new solution for programmable packet scheduling that uses only a single first-in first-out queue. AIFO is motivated by the confluence of two recent trends: shallow buffers in switches and fast-converging congestion control in end hosts, that together leads to a simple observation: the decisive factor in a flow’s completion time (FCT) in modern datacenter networks is often which packets are enqueued or dropped, not the ordering they leave the switch. The core idea of AIFO is to maintain a sliding window to track the ranks of recent packets and compute the relative rank of an arriving packet in the window for admission control. Theoretically, we prove that AIFO provides bounded performance to Push-In First-Out (PIFO). Empirically, we fully implement AIFO and evaluate AIFO with a range of real workloads, demonstrating AIFO closely approximates PIFO. Importantly, unlike PIFO, AIFO can run at line rate on existing hardware and use minimal switch resources—as few as a single queue.

Although programmable packet scheduling has been quite popular for more than five years, I started paying careful attention only after the SP-PIFO presentation in NSDI 2020. I felt that we should be able to approximate something like that with even fewer priority classes, especially by using something similar to Foreground-Background scheduling that needs only two priorities. Xin had been thinking about the problem even longer given his vast experience in programmable switches and approached me after submitting Kayak to NSDI 2021. Xin pointed out that two priorities need only queue with an admission control mechanism in front! I’m glad he roped me in as it’s always a pleasure working with him and Zhuolong. It seems unbelievable even to me that this is my first packet scheduling paper!

This year SIGCOMM has broken the acceptance record once again by accepting 55 out of 241 submissions into the program!

NetLock Accepted to Appear at SIGCOMM’2020

High-throughput, low-latency lock managers are useful for building a variety of distributed applications. Traditionally, a key tradeoff in this context had been expressed in terms of the amount of knowledge available to the lock manager. On the one hand, a decentralized lock manager can increase throughput by parallelization, but it can starve certain categories of applications. On the other hand, a centralized lock manager can avoid starvation and impose resource sharing policies, but it can be limited in throughput. In SIGMOD’18, we presented DSLR that attempted to mitigate this tradeoff in clusters with fast RDMA networks by adapting Lamport’s bakery algorithm in the context of RDMA’s fetch-and-add (FA) operations to design a decentralized solution. The downside is that we couldn’t implement complex policies that need centralized information.

What if we could have a high-speed centralized point that all remote traffic must go through anyway? NetLock is our attempt at doing just that by implementing a centralized lock manager in a programmable switch by working at tandem with the servers. The co-design is important to go around the resource limitations of the switch. By carefully caching hot locks and moving warm and cold ones to the servers, we can meet both the performance and policy goals of a lock manager without significant compromise in either.

Lock managers are widely used by distributed systems. Traditional centralized lock managers can easily support policies between multiple users using global knowledge, but they suffer from low performance. In contrast, emerging decentralized approaches are faster but cannot provide flexible policy support. Furthermore, performance in both cases is limited by the server capability.

We present NetLock, a new centralized lock manager that co-designs servers and network switches to achieve high performance without sacrificing flexibility in policy support. The key idea of NetLock is to exploit the capability of emerging programmable switches to directly process lock requests in the switch data plane. Due to the limited switch memory, we design a memory management mechanism to seamlessly integrate the switch and server memory. To realize the locking functionality in the switch, we design a custom data plane module that efficiently pools multiple register arrays together to maximize memory utilization We have implemented a NetLock prototype with a Barefoot Tofino switch and a cluster of commodity servers. Evaluation results show that NetLock improves the throughput by 14.0–18.4×, and reduces the average and 99% latency by 4.7–20.3× and 10.4–18.7× over DSLR, a state-of-the-art RDMA-based solution, while providing flexible policy support.

Xin and I came up with the idea of this project over a couple meals in San Diego at OSDI’18, and later Zhuolong and Yiwen expanded and successfully executed our ideas that lead to NetLock. Similar to DSLR, NetLock explores a different design point in our larger memory disaggregation vision.

This year’s SIGCOMM probably has the highest acceptance rate in 25 years, if not more. After a long successful run at SIGCOMM and a small break doing many other exciting things, it’s great to be back to some networking research! But going forward, I’m hoping for much more along these lines both inside the network and at the edge.

Hermes Accepted to Appear at SIGCOMM’2017

Datacenter load balancing, especially in Clos topologies, remains a hot topic even after almost a decade. The pace of progress has picked up over the last few years with multiple solutions exploring different extremes of the solution space, ranging from edge-based to in-network solutions and using different granularities of load balancing: packets, flowcells, flowlets, or flows. Throughout all these efforts, load balancing granularity has always been either fixed or passively determined. This inflexibility and lack of control lead to many inefficiencies. In Hermes, we take a more active approach: we comprehensively sense path conditions and actively direct traffic to ensure cautious yet timely rerouting.

Production datacenters operate under various uncertainties such as traffic dynamics, topology asymmetry, and failures. Therefore, datacenter load balancing schemes must be resilient to these uncertainties; i.e., they should accurately sense path conditions and timely react to mitigate the fallouts. Despite significant efforts, prior solutions have important drawbacks. On the one hand, solutions such as Presto and DRB are oblivious to path conditions and blindly reroute at fixed granularity; on the other hand, solutions such as CONGA and CLOVE can sense congestion, but they can only reroute when flowlets emerge, thus cannot always react to uncertainties timely. To make things worse, these solutions fail to detect/handle failures such as blackholes or random packet drops, which greatly degrades their performance.

In this paper, we propose Hermes, a datacenter load balancer that is resilient to the aforementioned uncertainties. At its heart, Hermes leverages comprehensive sensing to detect path conditions including failures unattended before, and reacts by timely yet cautious rerouting. Hermes is a practical edge-based solution with no switch modification. We have implemented Hermes with commodity switches and evaluated it through both testbed experiments and large-scale simulations. Our results show that Hermes achieves comparable performance with CONGA and Presto in normal cases, and handles uncertainties well: under asymmetries, Hermes achieves up to 10% and 40% better flow completion time (FCT) than CONGA and CLOVE; under switch failures, it significantly outperforms all other schemes by over 50%.

While I have spent a lot of time working on datacenter network-related problems, my focus has always been on enabling application-awareness in the network using coflows and multi-tenancy issues; I have actively stayed away from lower level details. So when Hong and Kai brought up this load balancing problem after last SIGCOMM, I was a bit apprehensive. I became interested when they posed the problem as an interaction challenge between lower-level load balancing solutions and transport protocols, and I’m glad I got involved. As always, it’s been a pleasure working with Hong and Kai. I’ve been closely working with Hong for about two years now, leading to two first-authored SIGCOMM papers under his belt; at this point, I feel like his unofficial co-advisor. Junxue and Wei were instrumental in getting the experiments done in time.

This year the SIGCOMM PC accepted 36 papers out of 250 submissions with a 14.4% acceptance rate. The number of accepted papers went down and the number of submissions up, which led to the lowest acceptance rate since SIGCOMM 2012. This was also my first time on SIGCOMM PC.

CODA Accepted to Appear at SIGCOMM’2016

Update: Camera-ready version is available here now!

Since introducing the coflow abstraction in 2012, we’ve been working hard to make it practical one step at a time. Over the years, we’ve worked on efficient coflow scheduling, removed clairvoyance requirements in coflow scheduling, and performed fair sharing among coexisting coflows. Throughout all these efforts, one requirement remained constant: all distributed data-parallel applications had to be modified to correctly use the same coflow API. Unfortunately, this is difficult, if not impossible, because of the sheer number of computing frameworks available today.

CODA makes a first attempt at addressing this problem head on. The intuition is simple: if we can automatically identify coflows from the network, applications need not be modified. There are at least two key challenges in achieving this goal. The first one is classifying coflows. Unlike traditional traffic classification, coflows cannot be identified using five-tuples. Instead, we must use domain-specific insights to capture the logical relationship between multiple flows. The second challenge is avoiding the impact of misclassification, because even the best classifier will sometimes be wrong. We must be able to schedule coflows even when some flows have been misclassified into wrong coflows. Our solution errs on the side of caution to perform error-tolerant coflow scheduling. Overall, CODA approximates Aalo very closely and significantly outperforms flow-based solutions.

Leveraging application-level requirements using coflows has recently been shown to improve application-level communication performance in data-parallel clusters. However, existing coflow-based solutions rely on modifying applications to extract coflows, making them inapplicable to many practical scenarios. In this paper, we present CODA, a first attempt at automatically identifying and scheduling coflows without any application-level modifications.

We employ an incremental clustering algorithm to perform fast, application-transparent coflow identification and complement it by proposing an error-tolerant coflow scheduler to mitigate occasional identification errors. Testbed experiments and large-scale simulations with realistic production workloads show that CODA can identify coflows with more than 90% accuracy, and CODA’s scheduler is robust to inaccuracies, enabling communication stages to complete 2.4X (5.1X) faster on average (95th percentile) in comparison to per-flow mechanisms. CODA’s performance is comparable to that of solutions requiring application-level modifications.

This work is a tight collaboration with Kai Chen and his students at HKUST: Hong Zhang, Bairen Yi, and Li Chen, which started with Hong sending me a short note asking whether I think the problem is worth addressing. Of course I did; it is the second future work listed in my dissertation!

This year the SIGCOMM PC accepted 39 papers out of 225 submissions with a 17.33% acceptance rate. This also happens to be my first SIGCOMM in an advisory role and the first one without having Ion onboard. I am thankful to Kai for giving me full access to his excellent, hard-working students. It’s been a gratifying experience to work with them, and I look forward to further collaborations. Finally, I’m excited to start my life as an advisor on a high note. Go Blue!

Aalo Accepted to Appear at SIGCOMM’2015

Update: Camera-ready version is available here now!

Last SIGCOMM we introduced the coflow scheduling problem and presented Varys that addressed its clairvoyant variation, i.e., when all the information of individual coflows are known a priori, and there is no cluster and task scheduling dynamics. In many cases, these assumptions do not hold very well and left us with two primary challenges. First, how to schedule coflows when individual flow sizes, the total number of flows in a coflow , or their end points are unknown. Second, how to schedule coflows for jobs with more than one stages forming DAGs and where tasks can be scheduled in multiple waves. Aalo addresses both these problems by providing the first solution for the non-clairvoyant coflow scheduling problem: it can efficiently schedule coflows without any prior knowledge of coflow characteristics and can still approximate Varys’s performance very closely. This makes coflows practical in almost all scenarios we face in data-parallel communication.

Leveraging application-level requirements exposed through coflows has recently been shown to improve application-level communication performance in data-parallel clusters. However, existing efficient schedulers require a priori coflow information (e.g., flow size) and ignore cluster dynamics like pipelining, task failures, and speculative executions, which limit their applicability. Schedulers without prior knowledge compromise on performance to avoid head-of-line blocking. In this paper, we present Aalo that strikes a balance and efficiently schedules coflows without prior knowledge.

Aalo employs Discretized Coflow-Aware Least-Attained Service (D-CLAS) to separate coflows into a small number of priority queues based on how much they have already sent across the cluster. By performing prioritization across queues and by scheduling coflows in the FIFO order within each queue, Aalo’s non-clairvoyant scheduler can schedule diverse coflows and minimize their completion times. EC2 deployments and trace-driven simulations show that com- munication stages complete 1.93X faster on average and 3.59X faster at the 95th percentile using Aalo in comparison to per-flow mechanisms. Aalo’s performance is comparable to that of solutions using prior knowledge, and Aalo outperforms them in presence of cluster dynamics.

This is a joint work with Ion, the first time we have written a conference paper just between the two of us!

This year the SIGCOMM PC accepted 40 papers out of 262 submissions with a 15.27% acceptance rate. This also happens to be my last SIGCOMM as a student. Glad to be lucky enough to end on a high note!

Varys accepted at SIGCOMM’2014: Coflow is coming!

Update 2: Varys Developer Alpha released!

Update 1: Latest version will be is available here soon now!

We introduced the coflow abstraction back in 2012 at HotNets-XI, and now we have something solid to show what coflows are good for. Our paper on efficient and deadline-sensitive coflow scheduling has been accepted to appear at this year’s SIGCOMM. By exploiting a little bit of application-level information, coflows enable significantly better performance for distributed data-intensive jobs. In the process, we have discovered, characterized, and proposed heuristics for a novel scheduling problem: “Concurrent Open Shop Scheduling with Coupled Resources.” Yeah, it’s a mouthful; but hey, how often does one stumble upon a new scheduling problem?

Communication in data-parallel applications often involves a collection of parallel flows. Traditional techniques to optimize flow-level metrics do not perform well in optimizing such collections, because the network is largely agnostic to application-level requirements. The recently proposed coflow abstraction bridges this gap and creates new opportunities for network scheduling. In this paper, we address inter-coflow scheduling for two different objectives: decreasing communication time of data-intensive jobs and guaranteeing predictable communication time. We introduce the concurrent open shop scheduling with coupled resources problem, analyze its complexity, and propose effective heuristics to optimize either objective. We present Varys, a system that enables data-intensive frameworks to use coflows and the proposed algorithms while maintaining high network utilization and guaranteeing starvation freedom. EC2 deployments and trace-driven simulations show that communication stages complete up to 3.16X faster on average and up to 2X more coflows meet their deadlines using Varys in comparison to per-flow mechanisms. Moreover, Varys outperforms non-preemptive coflow schedulers by more than 5X.

This is a joint work with Yuan Zhong and Ion Stoica. But as always, it took a village! I’d like to thank Mohammad Alizadeh, Justine Sherry, Peter Bailis, Rachit Agarwal, Ganesh Ananthanarayanan, Tathagata Das, Ali Ghodsi, Gautam Kumar, David Zats, Matei Zaharia, the NSDI and SIGCOMM reviewers, and everyone else who patiently read many drafts or listened to my spiel over the years. It’s been in the oven for many years, and I’m glad that it was so well received.

I believe coflows are the future, and they are coming for the good of the realm network. Varys is only the tip of the iceberg, and there are way too many interesting problems to solve (both for network and theory folks!). We are in the process of open sourcing Varys in coming weeks and look forward to seeing it used with major data-intensive frameworks. Personally, I’m happy how my dissertation research is shaping up :)

This year the SIGCOMM PC accepted 45 papers out of 237 submissions with a 18.99% acceptance rate. This is the highest acceptance rate since 1995  and the highest number of papers accepted ever (tying with SIGCOMM’86)! After years of everyone complaining about the acceptance rate, this year’s PC have taken some action; probably a healthy step for the community.

Sinbad accepted to appear at SIGCOMM’2013

Update: Latest version is available here!

Our paper on leveraging flexibility in endpoint placement, aka Sinbad/Usher/Orchestrated File System (OFS), has been accepted for publication at this year’s SIGCOMM. We introduce constrained anycast in data-intensive clusters in the form of network-aware replica placement for cluster/distributed file systems. This is the second piece of my dissertation research, where the overarching goal is to push slightly more application semantics into the network to get disproportionately larger improvements. Sometimes things are unfair in a good way!

Many applications do not constraint the destinations of their network transfers. New opportunities emerge when such transfers contribute a large amount of network bytes. By choosing the endpoints to avoid congested links, completion times of these transfers as well as that of others without similar flexibility can be improved. In this paper, we focus on leveraging the flexibility in replica placement during writes to cluster file systems (CFSes), which account for almost half of the cross-rack traffic in data-intensive clusters. CFS writes only require placing replicas in multiple fault domains and a balanced use of storage, but the replicas can be in any subset of machines in the cluster.

We study CFS interactions with the cluster network, analyze optimizations for replica placement, and propose Sinbad – a system that identifies imbalance and adapts replica destinations to navigate around congested links. Sinbad does so with little impact on long-term storage load balance. Experiments on EC2 (trace-driven simulations) show that block writes complete 1.3x(1.58x) faster as the network becomes more balanced. As a collateral benefit, the end-to-end completion times of data-intensive jobs improve by up to 1.26x. Trace-driven simulations also show that Sinbad’s improvements (1.58x) is close to that of a loose upper bound (1.89x) of the optimal.

This is a joint work with Srikanth Kandula and Ion Stoica. One cool fact is that we collaborated with Srikanth almost entirely over Skype for almost a year! I did take up a lot of his time from his Summer interns, but most of them had their papers in too. There are a LOT of people to thank including Yuan Zhong, Matei Zaharia, Gautam Kumar, Dave Maltz, Ganesh Ananthanarayanan, Ali Ghodsi, Raj Jain, and everyone else I’m forgetting right now. It took quite some time to put our ideas crisply in few pages, but in the end the reception was great from the PC. We hope it will be useful in practice too.

This year the SIGCOMM PC accepted 38 papers out of 240 submissions with a 15.83% acceptance rate. This is the highest acceptance rate since 1996 (before Ion Stoica had any SIGCOMM paper) and the highest number of papers accepted since 1987 (before Scott Shenker had any paper in SIGCOMM)!

“Surviving Failures in Bandwidth-Constrained Datacenters” at SIGCOMM’2012

Update: Camera-ready version is in my publications page!

My internship work from last Summer has been accepted for publication at SIGCOMM’2012 as well; yay!! In this piece of work, we try to allocate machines for datacenter applications with bandwidth and fault-tolerance constraints, which are at odds—allocation for bandwidth tries to put machines closer, whereas a fault-tolerant allocation spreads machine out across multiple fault domains.

Datacenter networks have been designed to tolerate failures of network equipment and provide sufficient bandwidth. In practice, however, failures and maintenance of networking and power equipment often make tens to thousands of servers unavailable, and network congestion can increase service latency. Unfortunately, there exists an inherent tradeoff between achieving high fault tolerance and reducing bandwidth usage in network core; spreading servers across fault domains improves fault tolerance, but requires additional bandwidth, while deploying servers together reduces bandwidth usage, but also decreases fault tolerance. We present a detailed analysis of a large-scale Web application and its communication patterns. Based on that, we propose and evaluate a novel optimization framework that achieves both high fault tolerance and significantly reduces bandwidth usage in network core by exploiting the skewness in the observed communication patterns.

During my Master’s, I worked on several variations of a similar problem called virtual network embedding in the context of network virtualization (INFOCOM’2009, Networking’2010, VISA’2010, ToN’2012).

This year 32 out of 235 papers have been accepted at SIGCOMM,  seven of which have at least one Berkeley author.

FairCloud has been accepted at SIGCOMM’2012

Update: Camera-ready is now available online!

This is kinda old news now, but still as exciting as it was few days ago. Our paper “FairCloud: Sharing the Network in Cloud Computing” has been accepted for publication at this year’s SIGCOMM. We explore the design space of sharing networks, identify tradeoffs, and place categorize different strategies based on their characteristics. In case you are following our Orchestra work, FairCloud sits in the Inter-Transfer Controller (ITC) of the Orchestra hierarchy.

The network, similar to CPU and memory, is a critical and shared resource in the cloud. However, unlike other resources, it is neither shared proportionally to payment, nor do cloud providers offer minimum guarantees on network bandwidth. The reason is that networks are more difficult to share, since the network allocation of a VM X depends not only on the VMs running on the same machine with X, but also on the other VMs that X communicates with, as well as on the cross-traffic on each link used by X. In this paper, we start from the above requirements—payment proportionality and minimum guarantees—and show that the network- specific challenges lead to fundamental tradeoffs when sharing datacenter networks. We then propose a set of properties to explicitly express these tradeoffs. Finally, we propose three allocation policies that allow us to navigate the tradeoff space. We evaluate their characteristics through simulation and testbed experiments, showing that they are able to provide minimum guarantees and achieve better proportionality with the per-VM payment than known allocations.

This year 32 out of 235 papers have been accepted at SIGCOMM. On other news, Berkeley has seven papers in this SIGCOMM!!!