Tag Archives: OSDI

Oort Wins the Distinguished Artifact Award at OSDI’2021. Congrats Fan and Xiangfeng!

Oort, our federated learning system for scalable machine learning over millions of edge devices has received the distinguished artifact award at this year’s USENIX OSDI conference!

This is a testament to a lot of hard work put in by Fan and Xiangfeng over the course of last couple years. Oort is our first foray into federated learning, but it certainly is not the last.

Oort and it’s workloads (FedScale) are both open-source at https://github.com/symbioticlab.

Oort Accepted to Appear at OSDI’2021

Oort’s working title was Kuiper.

With the wide deployment of AI/ML in our daily lives, the need for data privacy is receiving more attention in recent years. Federated Learning (FL) is an emerging sub-field of machine learning that focuses on in-situ processing of data wherever it is generated. This is only going to become more important as regulations around data movement (e.g., GDPR, CCPA) become even more restrictive. Although there has already been a large number of FL algorithms from the ML community and some FL deployments from large companies, systems support for FL is somewhat non-existent. Oort is our effort in building the first open-source FL system that allows FL developers to select participants for their training in an informed manner instead of selecting them at random. In the process, we have also collected the largest public dataset for FL that we plan to open source in near future.

Federated Learning (FL) is an emerging direction in distributed machine learning (ML) that enables in-situ model training and testing on edge data. Despite having the same end goals as traditional ML, FL executions differ significantly in scale, spanning thousands to millions of participating devices. As a result, data characteristics and device capabilities vary widely across clients. Yet, existing efforts randomly select FL participants, which leads to poor model and system efficiency.

In this paper, we propose Oort to improve the performance of federated training and testing with guided participant selection. With an aim to improve time-to-accuracy performance in model training, Oort prioritizes the use of those clients who have both data that offers the greatest utility in improving model accuracy and the capability to run training quickly. To enable FL developers to interpret their results in model testing, Oort enforces their requirements on the distribution of participant data while improving the duration of federated testing by cherry-picking clients. Our evaluation shows that, compared to existing participant selection mechanisms, Oort improves time-to-accuracy performance by 1.2X-14.1X and final model accuracy by 1.3%-9.8%, while efficiently enforcing developer-specified model testing criteria at the scale of millions of clients.

Fan and Allen had been working on Oort since summer of 2019, and it’s been a great learning experience for me. As always, it’s been a pleasure to collaborate with Harsha, and I look forward to many more in the future. Over the past two years, many others have joined Fan in our efforts toward providing systems support for federated learning and analytics, with many exciting results in different stages in the pipeline and focusing on cloud/edge/WAN challenges. It’s only going to become more exciting!

This is the first OSDI in an odd year as OSDI moves to a yearly cadence. Although the number of submissions is lower than the past, it’s likely only due to the late announcement; being in my first OSDI PC, I think the quality of the submitted and accepted papers remains as high as ever. Overall, the OSDI PC accepted 31 out of 165 submissions.

QOOP Accepted to Appear at OSDI’2018

Distributed data-parallel queries run on large collections of resources — within a single datacenter in case of traditional big data queries and across multiple datacenters in case of geo-distributed analytics. The total amount of available resources to a running query changes over time for a variety of reasons, including changes in the fair share of assigned resources due to new job arrivals, machine failures, changes in spot instance pricing, fluctuations in available WAN bandwidth, etc. Traditional big data systems rely on the cluster scheduler to deal with these fluctuations. Each query is assigned a fixed query plan by a query planner to convert it to a distributed data-parallel job, which is then executed by an execution engine. Over the years, this has led to pushing more complexity into the cluster scheduler for better performance. But what if we could change the query plan?

In QOOP, we do exactly that and show that this can lead to simpler execution engines and cluster schedulers. Instead of pushing complexity down the stack, QOOP allows the query planner to re-plan queries on resource fluctuations using a simple greedy algorithm. We prove that this is expected to perform well and empirically validate the theoretical result.

Modern data processing clusters are highly dynamic — both in terms of the number of concurrently running jobs and their resource usage. To improve job performance, recent works have focused on optimizing the cluster scheduler and the jobs’ query planner with a focus on picking the right execution plan — represented as a directed acyclic graph (DAG) — for a job in a resource-aware manner, and scheduling jobs in a DAG-aware manner. However, because existing solutions use a fixed DAG throughout the entire execution, the inability to adapt a DAG in reaction to resource changes often leads to large performance inefficiencies.

This paper argues for dynamic query re-planning, wherein we re-evaluate and re-plan a job’s DAG during its execution. We show that designing for re-planning requires fundamental changes to the interfaces between key layers of data analytics stacks today, i.e., the query planner, the execution engine, and the cluster scheduler. Instead of pushing more complexity into the scheduler or the query planner, we argue for a redistribution of responsibilities between the three components to simplify their designs. Under this redesign, we analytically show that a greedy algorithm for re-planning and execution alongside a simple max-min fair scheduler can offer provably competitive behavior even under adversarial resource changes. We prototype our algorithms atop Apache Hive and Tez. Using extensive experiments on a 20-node cluster, we show that our design can offer a median performance improvement of 1.47x compared to state-of-the-art alternatives.

While QOOP focused on small-scale clusters, its core ideas can be applied to other dynamic query re-planning scenarios — for example, in response to WAN bandwidth fluctuations in the context of geo-distributed analytics.

This work is my second OSDI collaboration with Aditya after Carbyne in OSDI 2016 and my third OSDI paper overall. This time I got to work with another incredible student, Kshiteej Mahajan, as well as a terrific theoretician, Shuchi Chawla. I look forward to many more fruitful collaborations between our groups in the future.

Carbyne Accepted to Appear at OSDI’2016

Update: Camera-ready version is available here now!

With the wide adoption of distributed data-parallel applications, large-scale resource scheduling has become a constant source of innovation in recent years. There are tens of scheduling solutions that try to optimize for objectives such as user-level fairness, application-level performance, and cluster-level efficiency. However, given the well-known tradeoffs between fairness, performance, and efficiency, these solutions have traditionally focused on one primary objective (e.g., fairness in case of DRF), and they consider other objectives as best effort, secondary goals.

In this paper, we revisit the tradeoff space, demonstrating out that aggressively focusing on optimizing one primary objective while giving up the rest often does not matter much. Because a job cannot complete until all its tasks have completed, each job can altruistically yield some of its resources without hampering its own completion time. These altruistic resources can then be rescheduled among other jobs to significantly improve secondary objectives without hampering the first. Benefits of our approach is visible even for single-stage jobs, and they increase as jobs have more complex DAGs.

Given the well-known tradeoffs between performance, fairness, and efficiency, modern cluster schedulers focus on aggressively optimizing a single objective, while ignoring the rest. However, short-term convergence to a selected objective often does not result in noticeable long-term benefits. Instead, we propose an altruistic, long-term approach, where jobs yield fractions of their allocated resources without impacting their own completion times.

We show that leftover resources collected via altruisms of many jobs can then be rescheduled, essentially introducing a third degree of freedom in cluster scheduling — in addition to inter- and intra-job scheduling. We leverage this new-found flexibility in Carbyne, a scheduler that combines three existing schedulers to simultaneously optimize cluster-wide performance, fairness, and efficiency. Deployments and large-scale simulations on industrial benchmarks and production traces show that Carbyne closely approximates the state-of-the-art solutions for each of the three objectives simultaneously. For example, Carbyne provides 1.26X better efficiency and 1.59X lower average completion time than DRF, while providing similar fairness guarantees.

Altruistic scheduling has many more use cases; e.g., we had a similar observation for coflow scheduling in Varys.

This work started as a collaboration with Robert Grandl and Aditya Akella toward the end of 2015. Ganesh Ananthanarayanan from MSR later joined us to take it to the next level.  After CODA, this is related to another future work (the seventh) from my dissertation; infer whatever you want to out of these two data points ;)

This year the OSDI PC accepted 47 out of 260 papers. This happens to be my first time submitting to OSDI. It’s also my first paper with Ganesh, even though it happened after we both graduated from Berkeley; we sat opposite to each other for four years back then! I also want to thank Aditya for letting me work closely with Robert; it’s been great collaborating with them.

EC-Cache Accepted to Appear at OSDI’2016

Update: Camera-ready version is available here now!

In-memory caching is the de facto solution to enable low latency analytics over large datasets. While caching objects,  one must be careful about maximizing the number of requests that can be served from memory in the presence of popularity skew, background load imbalance, and server failures. Traditional solutions use selective replication, i.e., adjusting the number of replicas based on object popularity, to address these issues. However, dynamically determining replication factors and reacting to hotspots and load imbalance fast enough is often challenging.

We applied erasure coding to address the same set of challenges. Traditionally, erasure coding is widely used in modern clusters to minimize storage usage. However, it is not applied to object caching, nor is it considered a viable solution for load balancing. We show both analytically and empirically that given the instruction sets in modern CPUs, erasure coding can be a viable solution to address the limitations of selective replication as long as we are willing to sacrifice some network bandwidth. Given the rise of high-capacity network topologies and networking technologies, we believe this to be a timely tradeoff.

Data-intensive clusters rely on in-memory object caching to maximize the number of requests that can be served from memory in the presence of popularity skew, background load imbalance, and server failures. In order to improve I/O latency and for load balancing, these caches typically employ selective replication, where the number of cached replicas of an object is proportional to its popularity. Selective replication, however, often falls short in practice, because it needs careful selection of replication factors and dynamic, load-aware replica placement.

EC-Cache is a load-balanced, high-performance cluster cache that uses erasure coding in a novel way to overcome the limitations of selective replication. EC-Cache employs two techniques: (a) splitting and erasure coding individual objects during writes, and (b) late binding, wherein obtaining any k out of (k + r) splits of an object are sufficient, during reads. As compared to selective replication, EC-Cache improves load balancing by a factor of 3.3X and reduces the median and tail read latencies by more than 2X, while using the same amount of memory. EC-Cache does so using 10% additional bandwidth and a small increase in the amount of stored metadata. The benefits offered by EC-Cache are further amplified in the presence of background network load imbalance.

The genesis of this work traces back to 2013, when I was just wrapping up Sinbad and heard Rashmi Vinayak giving a talk about her work at Facebook. While we planned many times, we were always busy with other stuff and never managed to work together until very recently. During this time, the project evolved from disk-based storage to general object store to its current form: in-memory object cache. The very energetic Jack Kosaian tracked me down last year to work on something exciting, which also helped in finally getting the project going at full speed. I’ve been very fortunate with my collaborators.

This year the OSDI PC accepted 47 out of 260 papers. This happens to be my first time submitting to OSDI. It’s also my first paper with my own student. Jack has been a tremendous force that kept the project going and helped Rashmi proceed at a fast pace. I also want to thank Rashmi for mentoring Jack throughout the process. I love this advising business :) Go Blue!