DSLR Accepted to Appear at SIGMOD’2018

High-throughput, low-latency lock managers are useful for building a variety of distributed applications. A key tradeoff in this context can be expressed in terms of the amount of knowledge available to the lock manager. On the one hand, a decentralized lock manager can increase throughput by parallelization, but it can starve certain categories of applications. On the other hand, a centralized lock manager can avoid starvation and impose resource sharing policies, but it can be limited in throughput. DSLR is our attempt at mitigating this tradeoff in clusters with fast RDMA networks. Specifically, we adapt Lamport’s bakery algorithm in the context of RDMA’s fetch-and-add (FA) operations, which provides higher throughput, lower latency, and avoids starvation in comparison to the state-of-the-art.

Lock managers are a crucial component of modern distributed systems. However, with the increasing popularity of fast RDMA-enabled networks, traditional lock managers can no longer keep up with the latency and throughput requirements of modern systems. Centralized lock managers can ensure fairness and prevent starvation using global knowledge of the system, but are themselves single points of contention and failure. Consequently, they fall short in leveraging the full potential of RDMA networks. On the other hand, decentralized (RDMA-based) lock managers either completely sacrifice global knowledge to achieve higher throughput at the risk of starvation and higher tail latencies, or they resort to costly communications to maintain global knowledge, which can result in significantly lower throughput.

In this paper, we show that it is possible for a lock manager to be fully decentralized and yet exchange the partial knowledge necessary for preventing starvation and thereby reducing tail latencies. Our main observation is that we can design a lock manager using RDMA’s fetch-and-add (FA) operation, which always succeeds, rather than compare-and-swap (CAS), which only succeeds if a given condition is satisfied. While this requires us to rethink the locking mechanism from the ground up, it enables us to sidestep the performance drawbacks of the previous CAS-based proposals that relied solely on blind retries upon lock conflicts.

Specifically, we present DSLR (Decentralized and Starvation-free Lock management with RDMA), a decentralized lock manager that targets distributed systems with RDMA-enabled networks. We demonstrate that, despite being fully decentralized, DSLR prevents starvation and blind retries by providing first-come-first-serve (FCFS) scheduling without maintaining explicit queues. We adapt Lamport’s bakery algorithm [34] to an RDMA-enabled environment with multiple bakers, utilizing only one-sided READ and atomic FA operations. Our experiments show that DSLR delivers up to 2.8X higher throughput than all existing RDMA-based lock managers, while reducing their average and 99.9% latencies by up to 2.5X and 47X, respectively.

Barzan and I started this project with Dong Young in 2016 right after I joined Michigan, as our interests matched in terms of new and interesting applications of RDMA primitives. It’s exciting to see our work turn into my first SIGMOD paper. As we work on rack-scale/resource disaggregation over RDMA, we are seeing more exciting use cases of RDMA, going beyond key-value stores and designing new RDMA networking protocols. Stay tuned!

Co-Chairing APNET’2018 (Deadline: April 20); Submit Your Early Ideas!

Kun Tan and I are co-chairing for the Second Asia-Pacific Workshop on Networking (APNet’2018) to be held in Beijing, China. APNET aims to bring together the very best researchers in computer networking and systems to a live forum in the Asia-Pacific region discussing and debating innovative ideas at their early stages. The mission of APNet is that promising but not-yet-mature ideas can receive timely feedback from leading researchers of the field. We are fortunate to have 35 PC members from 29 institutions across 13 countries!

I was honored to receive an invitation to co-chair it and accepted it quickly because of the massive success of APNET’2017: 33 PC members wrote 179 reviews to select 17 papers out of 48 submissions, and about 150 person attended it. This year, we are hoping for an even bigger reception — in terms of the number of submissions, quality, and attendance.

Important Details

We invite submissions of short papers (up to 6 pages, including references) on a wide range of networking research topics.

Abstract registration April 6, 2018 (11:59 GMT)
Paper submission April 13, 2018 (11:59 PM GMT)
Notification of decision June 11, 2018
Workshop dates  August 2-3, 2018

You can access the full call for papers (CFP) following this link.

Infiniswap in USENIX ;login: and Elsewhere

Since our first open-source release of Infiniswap over the summer, we have seen growing interest with many follow-ups within our group and outside.

Here is a quick summary of selected writeups on Infiniswap:

What’s Next?

In addition to a lot of performance optimizations and bug fixes, we’re working on several key features of Infiniswap that will be released in coming months.

  1. High Availability: Infiniswap will be able to maintain high performance across failures, corruptions, and interruptions throughout the cluster.
  2. Performance Isolation: Infiniswap will be able to provide end-to-end performance guarantees over the RDMA network to multiple applications concurrently using it.

Received Two Alibaba Innovation Research Grants

More resources for following up on our recent memory disaggregation and erasure coding works! One of the awards is a collaboration with Harsha Madhyastha. Looking forward to working with Alibaba.

In 2017, many proposals are received by AIR (Alibaba Innovation Research), which are from 99 universities and institutes (domestic 54; overseas 45) in 13 countries and regions. After rigorous review by AIR Technical Committee, 43 distinguish research proposals are accepted and will be funded by AIR Program.

Infiniswap Released on GitHub

Today we are glad to announce the first open-source release of Infiniswap, the first practical, large-scale memory disaggregation system for cloud and HPC clusters.

Infiniswap is an efficient memory disaggregation system designed specifically for clusters with fast RDMA networks. It opportunistically harvests and transparently exposes unused cluster memory to unmodified applications by dividing the swap space of each machine into many slabs and distributing them across many machines’ remote memory. Because one-sided RDMA operations bypass remote CPUs, Infiniswap leverages the power of many choices to perform decentralized slab placements and evictions.

Extensive benchmarks on workloads from memory-intensive applications ranging from in-memory databases such as VoltDB and Memcached to popular big data software Apache Spark, PowerGraph, and GraphX show that Infiniswap provides order-of-magnitude performance improvements when working sets do not completely fit in memory. Simultaneously, it boosts cluster memory utilization by almost 50%.

Primary features included in this initial release are:

  • No new hardware and no application or operating system modifications;
  • Fault tolerance via asynchronous disk backups;
  • Scalability via decentralized algorithms.

Here are some links, if you want to check it out, contribute, or just want to point someone else who can help us to make it better.

Git repository: https://github.com/Infiniswap/infiniswap
Detailed Overview: Efficient Memory Disaggregation with Infiniswap

The project is still in its early stage and can use all the help to become successful. We appreciate your feedback.

Hermes Accepted to Appear at SIGCOMM’2017

Datacenter load balancing, especially in Clos topologies, remains a hot topic even after almost a decade. The pace of progress has picked up over the last few years with multiple solutions exploring different extremes of the solution space, ranging from edge-based to in-network solutions and using different granularities of load balancing: packets, flowcells, flowlets, or flows. Throughout all these efforts, load balancing granularity has always been either fixed or passively determined. This inflexibility and lack of control lead to many inefficiencies. In Hermes, we take a more active approach: we comprehensively sense path conditions and actively direct traffic to ensure cautious yet timely rerouting.

Production datacenters operate under various uncertainties such as traffic dynamics, topology asymmetry, and failures. Therefore, datacenter load balancing schemes must be resilient to these uncertainties; i.e., they should accurately sense path conditions and timely react to mitigate the fallouts. Despite significant efforts, prior solutions have important drawbacks. On the one hand, solutions such as Presto and DRB are oblivious to path conditions and blindly reroute at fixed granularity; on the other hand, solutions such as CONGA and CLOVE can sense congestion, but they can only reroute when flowlets emerge, thus cannot always react to uncertainties timely. To make things worse, these solutions fail to detect/handle failures such as blackholes or random packet drops, which greatly degrades their performance.

In this paper, we propose Hermes, a datacenter load balancer that is resilient to the aforementioned uncertainties. At its heart, Hermes leverages comprehensive sensing to detect path conditions including failures unattended before, and reacts by timely yet cautious rerouting. Hermes is a practical edge-based solution with no switch modification. We have implemented Hermes with commodity switches and evaluated it through both testbed experiments and large-scale simulations. Our results show that Hermes achieves comparable performance with CONGA and Presto in normal cases, and handles uncertainties well: under asymmetries, Hermes achieves up to 10% and 40% better flow completion time (FCT) than CONGA and CLOVE; under switch failures, it significantly outperforms all other schemes by over 50%.

While I have spent a lot of time working on datacenter network-related problems, my focus has always been on enabling application-awareness in the network using coflows and multi-tenancy issues; I have actively stayed away from lower level details. So when Hong and Kai brought up this load balancing problem after last SIGCOMM, I was a bit apprehensive. I became interested when they posed the problem as an interaction challenge between lower-level load balancing solutions and transport protocols, and I’m glad I got involved. As always, it’s been a pleasure working with Hong and Kai. I’ve been closely working with Hong for about two years now, leading to two first-authored SIGCOMM papers under his belt; at this point, I feel like his unofficial co-advisor. Junxue and Wei were instrumental in getting the experiments done in time.

This year the SIGCOMM PC accepted 36 papers out of 250 submissions with a 14.4% acceptance rate. The number of accepted papers went down and the number of submissions up, which led to the lowest acceptance rate since SIGCOMM 2012. This was also my first time on SIGCOMM PC.

Revamping EECS 489: A Retrospective

A couple of weeks ago, we wrapped up the Spring 2017 offering of the EECS 489: Computer Networks course. This was my first time teaching this course — in fact, it was my first time teaching any undergraduate course. Trying to introduce small changes in a undergraduate course is difficult; being naive, I went for revamping almost everything! It was an impossible task, but I think we — my students, support staff, and myself — weathered it well. Hindsight being 20/20, now seems like a good time to look back at all the changes we have introduced and all the challenges we faced. It is also a good time to thank the village that made it a successful one; yes, it does take a village.

What’s New?

This one is easy to answer in one word — EVERYTHING!!!

  • Course materials have gone through a major revamp with a focus on “in the new stuff” (datacenters, cloud computing, application-aware networking, SDN etc.) “without throwing out the old” (Internet still rules), all the while focusing on keeping things simple;
  • Ordering of content is more straightforward now: a literal top-down walk of the network stack;
  • Textbook upgraded to the 7th edition of Kurose and Ross;
  • Most importantly, ALL NEW assignments/projects — on performance measurement, CDN, video streaming, reliable transport, and router design — with Mininet as the underlying substrate for large-scale emulation.

You can take a look at all the course materials from this offering at https://github.com/mosharaf/eecs489/tree/w17.

What Did the Students Think?

As far as I can tell based on student evaluations, almost everyone loved it! The very minority who didn’t enjoy the course as much, still felt they learned how Internet works and they can watch YouTube and Netflix; I consider that a success too and happy to declare victory.

The Village

There are three sets of people that made this course a success.

First, we have an amazing community in computer networking, where everyone cares about making life easier for the youngins. I must thank the many networking researchers and academicians who helped me by sharing many resources, including course materials, assignment/project ideas, and their collective knowledge from teaching computer networking for a long long time. Without any particular ordering, I want to give my heartfelt thanks to Sylvia Ratnasamy (Berkeley), Sugih Zamin (Michigan), Aditya Akella (Wisconsin), Peter Steenkiste (CMU), Philip Levis (Stanford), Nick Feamster (Princeton) Vyas Sekar (CMU), Mohammad Alizadeh (MIT), Hari Balakrishnan (MIT), Arvind Krishnamurthy (UW), Harsha Madhyastha (Michigan), Jason Flinn (Michigan) and Peter Chen (Michigan). Additionally, I want to thank Kshiteej Mahajan (Wisconsin), David Naylor (CMU), Matt Mukerjee (CMU), and Chase Basich (Stanford) for answering many questions regarding different projects and assignments as we modified and ported them to Michigan. Without the help of people above (and many others that I may have missed), my efforts probably wouldn’t have been as successful.

The second set is small, but they are the most important pieces of the puzzle: my support staff. Given the class size, I could have only one GSI and one grader. I was extremely fortunate to have Nitish Paradkar as that one GSI. He is many in one and made everything possible. Without him my attempt to change ALL assignments/projects in one go most definitely would’ve been a disaster. Thank you Nitish, and all the best with your future endeavor at Facebook. The other half of this team is Shane Schulte, my grader. While Nitish made everything ready-to-go, Shane kept them running by being very very punctual on grading everything. One drawback of changing the entire course was that we couldn’t create any autograders. Shane had to manually grade (with extreme cases being double-checked by Nitish) on time every time to keep the course going.While I’m on this topic, I also want to thank Github Education who supported us with creating private repositories of all the students, which made life easier for Nitish and Shane. Overall, I don’t know what I would’ve done if I were not lucky enough to have Nitish and Shane as partners.

The final set consists of my 72 students, who made all our work worthwhile. It was big ask when I walked in on the first day and asked them to tolerate all my mistakes and my inexperience. They had been excellent, and I don’t know what more I could’ve expected of them. There were many storms, but none became too big to handle because my students were very patient to give our team a chance to figure things out. I’m very thankful to have them as my first set of undergrads. I must admit that I never thought I would enjoy teaching as much as I did, and it is only because of them. I wish them all the very best that life has to offer.

Looking Forward

Most things went well. However, despite our best efforts, there are still several loose ends — e.g., autograding, exposition of some complicated ideas etc. I hope that the next offering of the course will be markedly easier because of some content reuse, and I look forward to spending that time and energy to better address the remaining issues that I noticed or were brought up by my students.

FaiRDMA Accepted to Appear at KBNets’2017

As cloud providers deploy RDMA in their datacenters and developers rewrite/update their applications to use RDMA primitives, a key question remains open: what will happen when multiple RDMA-enabled applications must share the network? Surprisingly, this simple question does not yet have a conclusive answer. This is because existing work focus primarily on improving individual application’s performance using RDMA on a case-by-case basis. So we took a simple step. Yiwen built a benchmarking tool and ran some controlled experiments to find some very interesting results: there is often no performance isolation, and there are many factors that determine RDMA performance beyond the network itself. We are currently investigating the root causes and working on mitigating them.

To meet the increasing throughput and latency demands of modern applications, many operators are rapidly deploying RDMA in their datacenters. At the same time, developers are re-designing their software to take advantage of RDMA’s benefits for individual applications. However, when it comes to RDMA’s performance, many simple questions remain open.

In this paper, we consider the performance isolation characteristics of RDMA. Specifically, we conduct three sets of experiments — three combinations of one throughput-sensitive flow and one latency-sensitive flow — in a controlled environment, observe large discrepancies in RDMA performance with and without the presence of a competing flow, and describe our progress in identifying plausible root-causes.

This work is an offshoot, among several others, of our Infiniswap project on rack-scale memory disaggregation. As time goes by, I’m appreciating more and more Ion’s words of wisdom on building real systems to find real problems.

FTR, the KBNets PC accepted 9 papers out of 22 submissions this year. For a first-time workshop, I consider it a decent rate; there are some great papers in the program from both the industry and academia.

“No! Not Another Deep Learning Framework” to Appear at HotOS’2017

Our position paper calling for a respite in the deep learning framework building arms race has been accepted to appear at this year’s HotOS workshop. We make a simple observation: too many frameworks are being proposed with little interoperability between them, even though many target the same or similar workloads; this inevitably leads to repetitions and reinventions from a machine learning perspective and suboptimal performance from a systems perspective. We identify two places for consolidation across many deep learning frameworks’ architectures that may enable interoperability as well as code, optimization, and resource sharing, benefitting both the machine learning and systems communities.

In recent years, deep learning has pervaded many areas of computing due to the confluence of an explosive growth of large-scale computing capabilities, availability of datasets, and advances in learning techniques. While this rapid growth has resulted in diverse deep learning frameworks, it has also led to inefficiencies for both the users and developers of these frameworks. Specifically, adopting useful techniques across frameworks — both to perform learning tasks and to optimize performance — involves significant repetitions and reinventions.

In this paper, we observe that despite their diverse origins, many of these frameworks share architectural similarities. We argue that by introducing a common representation of learning tasks and a hardware abstraction model to capture compute heterogeneity, we might be able to relieve machine learning researchers from dealing with low-level systems issues and systems researchers from being tied to any specific framework. We expect this decoupling to accelerate progress in both domains.

Our foray into deep learning systems started with a class project by Peifeng and Linh last Fall in my EECS 582 course. From a systems perspective, this is a very new and exciting area! We are learning new things everyday, ranging from low-level GPU programming to communication over the NVLink technology, and we are looking forward to a very exciting summer.

FTR, the HotOS PC accepted 29 papers out of 94 submissions this year.

Jack Kosaian Selected for NSF GRFP’2017!

Jack is the first student, graduate or undergraduate, that I have had the opportunity to work with since I joined Michigan. In just over a year working with me, he has an OSDI paper under his belt along with a first-authored submission to another major conference. It’s easy to forget that he’s still an undergrad!

Two other excellent graduate students from Michigan CSE — Andrew Quinn and Timothy Trippel — have also been selected. I’m fortunate to know both of them as well from my two EECS 582 offerings in 2016.

Many congratulations to everyone!