Received Two Alibaba Innovation Research Grants

More resources for following up on our recent memory disaggregation and erasure coding works! One of the awards is a collaboration with Harsha Madhyastha. Looking forward to working with Alibaba.

In 2017, many proposals are received by AIR (Alibaba Innovation Research), which are from 99 universities and institutes (domestic 54; overseas 45) in 13 countries and regions. After rigorous review by AIR Technical Committee, 43 distinguish research proposals are accepted and will be funded by AIR Program.

Infiniswap Released on GitHub

Today we are glad to announce the first open-source release of Infiniswap, the first practical, large-scale memory disaggregation system for cloud and HPC clusters.

Infiniswap is an efficient memory disaggregation system designed specifically for clusters with fast RDMA networks. It opportunistically harvests and transparently exposes unused cluster memory to unmodified applications by dividing the swap space of each machine into many slabs and distributing them across many machines’ remote memory. Because one-sided RDMA operations bypass remote CPUs, Infiniswap leverages the power of many choices to perform decentralized slab placements and evictions.

Extensive benchmarks on workloads from memory-intensive applications ranging from in-memory databases such as VoltDB and Memcached to popular big data software Apache Spark, PowerGraph, and GraphX show that Infiniswap provides order-of-magnitude performance improvements when working sets do not completely fit in memory. Simultaneously, it boosts cluster memory utilization by almost 50%.

Primary features included in this initial release are:

  • No new hardware and no application or operating system modifications;
  • Fault tolerance via asynchronous disk backups;
  • Scalability via decentralized algorithms.

Here are some links, if you want to check it out, contribute, or just want to point someone else who can help us to make it better.

Git repository: https://github.com/Infiniswap/infiniswap
Detailed Overview: Efficient Memory Disaggregation with Infiniswap

The project is still in its early stage and can use all the help to become successful. We appreciate your feedback.

Hermes Accepted to Appear at SIGCOMM’2017

Datacenter load balancing, especially in Clos topologies, remains a hot topic even after almost a decade. The pace of progress has picked up over the last few years with multiple solutions exploring different extremes of the solution space, ranging from edge-based to in-network solutions and using different granularities of load balancing: packets, flowcells, flowlets, or flows. Throughout all these efforts, load balancing granularity has always been either fixed or passively determined. This inflexibility and lack of control lead to many inefficiencies. In Hermes, we take a more active approach: we comprehensively sense path conditions and actively direct traffic to ensure cautious yet timely rerouting.

Production datacenters operate under various uncertainties such as traffic dynamics, topology asymmetry, and failures. Therefore, datacenter load balancing schemes must be resilient to these uncertainties; i.e., they should accurately sense path conditions and timely react to mitigate the fallouts. Despite significant efforts, prior solutions have important drawbacks. On the one hand, solutions such as Presto and DRB are oblivious to path conditions and blindly reroute at fixed granularity; on the other hand, solutions such as CONGA and CLOVE can sense congestion, but they can only reroute when flowlets emerge, thus cannot always react to uncertainties timely. To make things worse, these solutions fail to detect/handle failures such as blackholes or random packet drops, which greatly degrades their performance.

In this paper, we propose Hermes, a datacenter load balancer that is resilient to the aforementioned uncertainties. At its heart, Hermes leverages comprehensive sensing to detect path conditions including failures unattended before, and reacts by timely yet cautious rerouting. Hermes is a practical edge-based solution with no switch modification. We have implemented Hermes with commodity switches and evaluated it through both testbed experiments and large-scale simulations. Our results show that Hermes achieves comparable performance with CONGA and Presto in normal cases, and handles uncertainties well: under asymmetries, Hermes achieves up to 10% and 40% better flow completion time (FCT) than CONGA and CLOVE; under switch failures, it significantly outperforms all other schemes by over 50%.

While I have spent a lot of time working on datacenter network-related problems, my focus has always been on enabling application-awareness in the network using coflows and multi-tenancy issues; I have actively stayed away from lower level details. So when Hong and Kai brought up this load balancing problem after last SIGCOMM, I was a bit apprehensive. I became interested when they posed the problem as an interaction challenge between lower-level load balancing solutions and transport protocols, and I’m glad I got involved. As always, it’s been a pleasure working with Hong and Kai. I’ve been closely working with Hong for about two years now, leading to two first-authored SIGCOMM papers under his belt; at this point, I feel like his unofficial co-advisor. Junxue and Wei were instrumental in getting the experiments done in time.

This year the SIGCOMM PC accepted 36 papers out of 250 submissions with a 14.4% acceptance rate. The number of accepted papers went down and the number of submissions up, which led to the lowest acceptance rate since SIGCOMM 2012. This was also my first time on SIGCOMM PC.

Revamping EECS 489: A Retrospective

A couple of weeks ago, we wrapped up the Spring 2017 offering of the EECS 489: Computer Networks course. This was my first time teaching this course — in fact, it was my first time teaching any undergraduate course. Trying to introduce small changes in a undergraduate course is difficult; being naive, I went for revamping almost everything! It was an impossible task, but I think we — my students, support staff, and myself — weathered it well. Hindsight being 20/20, now seems like a good time to look back at all the changes we have introduced and all the challenges we faced. It is also a good time to thank the village that made it a successful one; yes, it does take a village.

What’s New?

This one is easy to answer in one word — EVERYTHING!!!

  • Course materials have gone through a major revamp with a focus on “in the new stuff” (datacenters, cloud computing, application-aware networking, SDN etc.) “without throwing out the old” (Internet still rules), all the while focusing on keeping things simple;
  • Ordering of content is more straightforward now: a literal top-down walk of the network stack;
  • Textbook upgraded to the 7th edition of Kurose and Ross;
  • Most importantly, ALL NEW assignments/projects — on performance measurement, CDN, video streaming, reliable transport, and router design — with Mininet as the underlying substrate for large-scale emulation.

You can take a look at all the course materials from this offering at https://github.com/mosharaf/eecs489/tree/w17.

What Did the Students Think?

As far as I can tell based on student evaluations, almost everyone loved it! The very minority who didn’t enjoy the course as much, still felt they learned how Internet works and they can watch YouTube and Netflix; I consider that a success too and happy to declare victory.

The Village

There are three sets of people that made this course a success.

First, we have an amazing community in computer networking, where everyone cares about making life easier for the youngins. I must thank the many networking researchers and academicians who helped me by sharing many resources, including course materials, assignment/project ideas, and their collective knowledge from teaching computer networking for a long long time. Without any particular ordering, I want to give my heartfelt thanks to Sylvia Ratnasamy (Berkeley), Sugih Zamin (Michigan), Aditya Akella (Wisconsin), Peter Steenkiste (CMU), Philip Levis (Stanford), Nick Feamster (Princeton) Vyas Sekar (CMU), Mohammad Alizadeh (MIT), Hari Balakrishnan (MIT), Arvind Krishnamurthy (UW), Harsha Madhyastha (Michigan), Jason Flinn (Michigan) and Peter Chen (Michigan). Additionally, I want to thank Kshiteej Mahajan (Wisconsin), David Naylor (CMU), Matt Mukerjee (CMU), and Chase Basich (Stanford) for answering many questions regarding different projects and assignments as we modified and ported them to Michigan. Without the help of people above (and many others that I may have missed), my efforts probably wouldn’t have been as successful.

The second set is small, but they are the most important pieces of the puzzle: my support staff. Given the class size, I could have only one GSI and one grader. I was extremely fortunate to have Nitish Paradkar as that one GSI. He is many in one and made everything possible. Without him my attempt to change ALL assignments/projects in one go most definitely would’ve been a disaster. Thank you Nitish, and all the best with your future endeavor at Facebook. The other half of this team is Shane Schulte, my grader. While Nitish made everything ready-to-go, Shane kept them running by being very very punctual on grading everything. One drawback of changing the entire course was that we couldn’t create any autograders. Shane had to manually grade (with extreme cases being double-checked by Nitish) on time every time to keep the course going.While I’m on this topic, I also want to thank Github Education who supported us with creating private repositories of all the students, which made life easier for Nitish and Shane. Overall, I don’t know what I would’ve done if I were not lucky enough to have Nitish and Shane as partners.

The final set consists of my 72 students, who made all our work worthwhile. It was big ask when I walked in on the first day and asked them to tolerate all my mistakes and my inexperience. They had been excellent, and I don’t know what more I could’ve expected of them. There were many storms, but none became too big to handle because my students were very patient to give our team a chance to figure things out. I’m very thankful to have them as my first set of undergrads. I must admit that I never thought I would enjoy teaching as much as I did, and it is only because of them. I wish them all the very best that life has to offer.

Looking Forward

Most things went well. However, despite our best efforts, there are still several loose ends — e.g., autograding, exposition of some complicated ideas etc. I hope that the next offering of the course will be markedly easier because of some content reuse, and I look forward to spending that time and energy to better address the remaining issues that I noticed or were brought up by my students.

FaiRDMA Accepted to Appear at KBNets’2017

As cloud providers deploy RDMA in their datacenters and developers rewrite/update their applications to use RDMA primitives, a key question remains open: what will happen when multiple RDMA-enabled applications must share the network? Surprisingly, this simple question does not yet have a conclusive answer. This is because existing work focus primarily on improving individual application’s performance using RDMA on a case-by-case basis. So we took a simple step. Yiwen built a benchmarking tool and ran some controlled experiments to find some very interesting results: there is often no performance isolation, and there are many factors that determine RDMA performance beyond the network itself. We are currently investigating the root causes and working on mitigating them.

To meet the increasing throughput and latency demands of modern applications, many operators are rapidly deploying RDMA in their datacenters. At the same time, developers are re-designing their software to take advantage of RDMA’s benefits for individual applications. However, when it comes to RDMA’s performance, many simple questions remain open.

In this paper, we consider the performance isolation characteristics of RDMA. Specifically, we conduct three sets of experiments — three combinations of one throughput-sensitive flow and one latency-sensitive flow — in a controlled environment, observe large discrepancies in RDMA performance with and without the presence of a competing flow, and describe our progress in identifying plausible root-causes.

This work is an offshoot, among several others, of our Infiniswap project on rack-scale memory disaggregation. As time goes by, I’m appreciating more and more Ion’s words of wisdom on building real systems to find real problems.

FTR, the KBNets PC accepted 9 papers out of 22 submissions this year. For a first-time workshop, I consider it a decent rate; there are some great papers in the program from both the industry and academia.

“No! Not Another Deep Learning Framework” to Appear at HotOS’2017

Our position paper calling for a respite in the deep learning framework building arms race has been accepted to appear at this year’s HotOS workshop. We make a simple observation: too many frameworks are being proposed with little interoperability between them, even though many target the same or similar workloads; this inevitably leads to repetitions and reinventions from a machine learning perspective and suboptimal performance from a systems perspective. We identify two places for consolidation across many deep learning frameworks’ architectures that may enable interoperability as well as code, optimization, and resource sharing, benefitting both the machine learning and systems communities.

In recent years, deep learning has pervaded many areas of computing due to the confluence of an explosive growth of large-scale computing capabilities, availability of datasets, and advances in learning techniques. While this rapid growth has resulted in diverse deep learning frameworks, it has also led to inefficiencies for both the users and developers of these frameworks. Specifically, adopting useful techniques across frameworks — both to perform learning tasks and to optimize performance — involves significant repetitions and reinventions.

In this paper, we observe that despite their diverse origins, many of these frameworks share architectural similarities. We argue that by introducing a common representation of learning tasks and a hardware abstraction model to capture compute heterogeneity, we might be able to relieve machine learning researchers from dealing with low-level systems issues and systems researchers from being tied to any specific framework. We expect this decoupling to accelerate progress in both domains.

Our foray into deep learning systems started with a class project by Peifeng and Linh last Fall in my EECS 582 course. From a systems perspective, this is a very new and exciting area! We are learning new things everyday, ranging from low-level GPU programming to communication over the NVLink technology, and we are looking forward to a very exciting summer.

FTR, the HotOS PC accepted 29 papers out of 94 submissions this year.

Jack Kosaian Selected for NSF GRFP’2017!

Jack is the first student, graduate or undergraduate, that I have had the opportunity to work with since I joined Michigan. In just over a year working with me, he has an OSDI paper under his belt along with a first-authored submission to another major conference. It’s easy to forget that he’s still an undergrad!

Two other excellent graduate students from Michigan CSE — Andrew Quinn and Timothy Trippel — have also been selected. I’m fortunate to know both of them as well from my two EECS 582 offerings in 2016.

Many congratulations to everyone!

Infiniswap Accepted to Appear at NSDI’2017

Update: Camera-ready version is available here. Infiniswap code is now on GitHub!

As networks become faster, the difference between remote and local resources is blurring everyday. How can we take advantage of these blurred lines? This is the key observation behind resource disaggregation and, to some extent, rack-scale computing. In this paper, we take our first stab at making memory disaggregation practical by exposing remote memory to unmodified applications. While there have been several proposals and feasibility studies in recent years, to the best of our knowledge, this is the first concrete step in making it real.

Memory-intensive applications suffer large performance loss when their working sets do not fully fit in memory. Yet, they cannot leverage otherwise unused remote memory when paging out to disks even in the presence of large imbalance in memory utilizations across a cluster. Existing proposals for memory disaggregation call for new architectures, new hardware designs, and/or new programming models, making them infeasible.

This paper describes the design and implementation of Infiniswap, a remote memory paging system designed specifically for an RDMA network. Infiniswap opportunistically harvests and transparently exposes unused memory to unmodified applications by dividing the swap space of each machine into many slabs and distributing them across many machines’ remote memory. Because RDMA operations bypass remote CPUs, Infiniswap leverages the power of many choices to perform decentralized slab placements and evictions.

We have implemented and deployed Infiniswap on an RDMA cluster without any OS modifications and evaluated its effectiveness using multiple workloads running on unmodified VoltDB, Memcached, PowerGraph, GraphX, and Apache Spark. Using Infiniswap, throughputs of these applications improve between 7.1X (0.98X) to 16.3X (9.3X) over disk (Mellanox nbdX), and median and tail latencies between 5.5X (2X) and 58X (2.2X). Infiniswap does so with negligible remote CPU usage, whereas nbdX becomes CPU-bound. Infiniswap increases the overall memory utilization of a cluster and works well at scale.

This work started as a class project for EECS 582 in the Winter when I gave the idea to Juncheng Gu and Youngmoon Lee, who made the pieces into a whole. Over the summer, Yiwen Zhang, an enterprising and excellent undergraduate, joined the project and helped us in getting it done within time.

This year the NSDI PC accepted 46 out of 255 papers. This happens to be my first paper with an all-blue cast! I want to thank Kang for giving me complete access to Juncheng and Youngmoon; it’s been great collaborating with them. I’m also glad that Yiwen has decided to start a Master’s and stay with us for longer, and more importantly, our team will remain intact for many more exciting followups in this emerging research area.

If this excites you, come join our group!

TWO NSF Proposals as the Lead PI Awarded. Thanks NSF!

The first one is on rack-scale computing using RDMA-enabled networks with Barzan Mozafari at the University of Michigan, and the second is on theoretical and systems implications of long-term fairness in cluster computing with Zhenhua Liu (Stony Brook University).

Thanks NSF!

Combined with the recent awards on geo-distributed analytics from NSF and Google, I’m looking forward to very exciting days in the future. If you want to be a part these exciting efforts, consider joining my group!

Carbyne Accepted to Appear at OSDI’2016

Update: Camera-ready version is available here now!

With the wide adoption of distributed data-parallel applications, large-scale resource scheduling has become a constant source of innovation in recent years. There are tens of scheduling solutions that try to optimize for objectives such as user-level fairness, application-level performance, and cluster-level efficiency. However, given the well-known tradeoffs between fairness, performance, and efficiency, these solutions have traditionally focused on one primary objective (e.g., fairness in case of DRF), and they consider other objectives as best effort, secondary goals.

In this paper, we revisit the tradeoff space, demonstrating out that aggressively focusing on optimizing one primary objective while giving up the rest often does not matter much. Because a job cannot complete until all its tasks have completed, each job can altruistically yield some of its resources without hampering its own completion time. These altruistic resources can then be rescheduled among other jobs to significantly improve secondary objectives without hampering the first. Benefits of our approach is visible even for single-stage jobs, and they increase as jobs have more complex DAGs.

Given the well-known tradeoffs between performance, fairness, and efficiency, modern cluster schedulers focus on aggressively optimizing a single objective, while ignoring the rest. However, short-term convergence to a selected objective often does not result in noticeable long-term benefits. Instead, we propose an altruistic, long-term approach, where jobs yield fractions of their allocated resources without impacting their own completion times.

We show that leftover resources collected via altruisms of many jobs can then be rescheduled, essentially introducing a third degree of freedom in cluster scheduling — in addition to inter- and intra-job scheduling. We leverage this new-found flexibility in Carbyne, a scheduler that combines three existing schedulers to simultaneously optimize cluster-wide performance, fairness, and efficiency. Deployments and large-scale simulations on industrial benchmarks and production traces show that Carbyne closely approximates the state-of-the-art solutions for each of the three objectives simultaneously. For example, Carbyne provides 1.26X better efficiency and 1.59X lower average completion time than DRF, while providing similar fairness guarantees.

Altruistic scheduling has many more use cases; e.g., we had a similar observation for coflow scheduling in Varys.

This work started as a collaboration with Robert Grandl and Aditya Akella toward the end of 2015. Ganesh Ananthanarayanan from MSR later joined us to take it to the next level.  After CODA, this is related to another future work (the seventh) from my dissertation; infer whatever you want to out of these two data points ;)

This year the OSDI PC accepted 47 out of 260 papers. This happens to be my first time submitting to OSDI. It’s also my first paper with Ganesh, even though it happened after we both graduated from Berkeley; we sat opposite to each other for four years back then! I also want to thank Aditya for letting me work closely with Robert; it’s been great collaborating with them.