Tag Archives: Spark

Orchestra is the Default Broadcast Mechanism in Apache Spark

With its recent release, Apache Spark has promoted Cornet—the BitTorrent-like broadcast mechanism proposed in Orchestra (SIGCOMM’11)—to become its default broadcast mechanism. It’s great to see our research see the light of the real-world! Many thanks to Reynold and others for making it happen.

MLlib, the machine learning library of Spark, will enjoy the biggest boost from this change because of the broadcast-heavy nature of many machine learning algorithms.

Spark has been accepted at NSDI’2012

Our paper “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing” has been accepted at NSDI’2012. This is Matei’s brainchild and a joint work of a lot of people including, but not limited to, TD, Ankur, Justin, Murphy, and professors Ion Stoica, Scott Shenker, and Michael Franklin. Unlike many other systems papers, Spark is actively developed and used by many people. You can also download and use it in no time to solve all your problems; well, at least the ones that require analyzing big data in little time. We focus on the concept of resilient distributed datasets or RDDs in this paper, and show how we can perform fast, in-memory iterative and interactive jobs with low-overhead fault-tolerance.

We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarse-grained transformations rather than fine-grained updates to shared state. However, we show that RDDs are expressive enough to capture a wide class of computations, including current specialized programming models for iterative jobs like Pregel. We have implemented RDDs in a system called Spark, which we evaluate through a variety of benchmarks and user applications.

The NSDI’2012 PC accepted 30 out of 169 papers. On other news, this time Berkeley will have a big presence at NSDI with several other papers. Go Bears!!!

Distributed in-memory datasets

AMPLab, “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing,” UCB/EECS-2011-82, 2011. [PDF]

Russell Power, Jinyang Li, “Piccolo: Building Fast, Distributed Programs with Partitioned Tables,” OSDI, 2010. [PDF]

Summary

MapReduce and similar frameworks, while widely applicable, are limited to directed acyclic data flow models, do not expose global states, and generally slow due to the lack of support for in-memory computations. MPI, while extremely powerful, is hard to use for non-experts. An ideal solution would be a compromise between the two approaches. Spark and Piccolo try to approximate that ideal within the MapReduce-to-MPI spectrum using in-memory data abstractions.

Piccolo

Piccolo provides a distributed key-value store-like abstraction, where applications/tasks can read from and write to a shared storage. Users write partition functions to divide the data across multiple machines, control functions to decide the workflow, kernel functions for performing distributed operations on mutable states, and conflict resolution functions to resolve write-write conflicts. Piccolo uses Chandi-Lamport snapshot algorithm for periodic checkpointing and rolls back all tasks of a failed job from the last checkpoint when required.

Spark

Spark is a distributed programming model based on a distributed in-memory data abstraction called Resilient Distributed Datasets (RDDs). RDDs are immutable, support coarse-grained transformations, and keep track of which transformations have been applied to them so far using lineages that can be used for RDD reconstruction. As a result, checkpointing requirements/overheads are low in Spark.

Spark vs Piccolo

There are two key differences between Spark and Piccolo.

  1. RDDs only support coarse-grained writes (transformations) as opposed to finer-grained writes supported by distributed tables used by Piccolo. This allows efficient storage of lineage information, which reduces checkpointing overhead and fast fault recovery. However, this makes Spark unsuitable for applications that depend on fine-grained updates.
  2. RDDs are immutable, which enables straggler mitigation by speculative execution in Spark.

Comments

Piccolo is closer to MPI, while Spark is closer to MapReduce on the MapReduce-to-MPI spectrum. The key tradeoff in both cases, however, is between framework usability vs its applicability/power (framework complexity follows power). Both frameworks are much faster than Hadoop (but remember that Hadoop is not the best implementation of MapReduce), a large fraction of which comes from the use of memory. May be I am biased as a member of the Spark project, but Spark should be good enough for most applications unless they absolutely require fine-grained updates.

Technical report on Spark is available Online

A technical report describing the key concepts behind Spark is available online. The abstract goes below:

We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow models like MapReduce. RDDs are motivated by two types of applications that current data flow systems handle inefficiently: iterative algorithms, which are common in graph applications and machine learning, and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a highly restricted form of shared memory: they are read-only datasets that can only be constructed through bulk operations on other RDDs. However, we show that RDDs are expressive enough to capture a wide class of computations, including MapReduce and specialized programming models for iterative jobs such as Pregel. Our implementation of RDDs can outperform Hadoop by 20x for iterative jobs and can be used interactively to search a 1 TB dataset with latencies of 5-7 seconds.

You can also download the 0.3 release of Spark and read corresponding release notes from the Spark download page.

Spark’s in the wild

We have been working on the Spark cluster computing framework for last couple of years. It has always been open source under the BSD license in github. But yesterday Matei declared official launch of the spark website (spark-project.org) and mailing lists along with its 0.2 release to everyone during the AMPLab summer retreat at Chaminede, Santa Cruz.

Head over to the website to learn more about it, to download, and to solve real world problems (e.g., spam filtering, natural language processing and traffic estimation) lightning fast! It will get even better as more people use it and contribute back.

Orchestra has been accepted at SIGCOMM’2011

Update: Camera-ready version of the paper should be can be found in the publications page very soon!

Our paper “Managing Data Transfers in Computer Clusters with Orchestra” has been accepted at SIGCOMM’2011. This is a joint work with Matei, Justin, and professors Mike Jordan and Ion Stoica. The project started as part of Spark and now quickly expanding to stand on its own to support other data-intensive frameworks (e.g., Hadoop, Dryad etc.). We also believe that interfacing Orchestra with Mesos will enable better network sharing between concurrently running frameworks in data centers.

Cluster computing applications like MapReduce and Dryad transfer massive amounts of data between their computation stages. These transfers can have a significant impact on job performance, accounting for more than 50% of job completion times. Despite this impact, there has been relatively little work on optimizing the performance of these data transfers. In this paper, we propose a global management architecture and a set of algorithms that improve the transfer times of common communication patterns, such as broadcast and shuffle, and allow one to prioritize a transfer over other transfers belonging to the same application or different applications. Using a prototype implementation, we show that our solution improves broadcast completion times by up to 4.5x compared to the status quo implemented by Hadoop. Furthermore, we show that transfer-level scheduling can reduce the completion time of high-priority transfers by 1.7x.

The paper so far have been well-received, and we’ve got great feedback from the anonymous reviewers that will further strengthen it. Hopefully, you will like it too :)

Those who are interested in stats, this year SIGCOMM accepted 32 out of 223 submissions.

Anyway, it’s Friday and we so excited!

Spark short paper has been accepted at HotCloud’10

An initial overview of our ongoing work on Spark, an iterative and interactive framework for cluster computing, has been accepted at HotCloud’10. I’ve been joined the project last February, while Matei has been working on it since last Fall. I will have uploaded the paper in the publications page. once we have taken care of the reviewer comments/suggestions, meanwhile you can read the technical report version.

This year HotCloud accepted 18 papers (24% of the submitted papers), and the PC are thinking about extending the workshop to a 2nd day from next year.