Tag Archives: Big Data Systems

Spark has been accepted at NSDI’2012

Our paper “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing” has been accepted at NSDI’2012. This is Matei’s brainchild and a joint work of a lot of people including, but not limited to, TD, Ankur, Justin, Murphy, and professors Ion Stoica, Scott Shenker, and Michael Franklin. Unlike many other systems papers, Spark is actively developed and used by many people. You can also download and use it in no time to solve all your problems; well, at least the ones that require analyzing big data in little time. We focus on the concept of resilient distributed datasets or RDDs in this paper, and show how we can perform fast, in-memory iterative and interactive jobs with low-overhead fault-tolerance.

We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarse-grained transformations rather than fine-grained updates to shared state. However, we show that RDDs are expressive enough to capture a wide class of computations, including current specialized programming models for iterative jobs like Pregel. We have implemented RDDs in a system called Spark, which we evaluate through a variety of benchmarks and user applications.

The NSDI’2012 PC accepted 30 out of 169 papers. On other news, this time Berkeley will have a big presence at NSDI with several other papers. Go Bears!!!

Presented Orchestra at SIGCOMM’2011

I’m attending my second SIGCOMM and had the privilege of giving my first talk at the flagship networking conference. I presented Orchestra, which happened to be very well attended even though it was the last talk of the day at 6PM. I’d like to thank everyone for showing up and also for the lively Q/A session at the end of my talk. Now that the talk is over, I can enjoy the rest of the conference in a more relaxed fashion.

The slides for the talk are available here.

Technical report on Spark is available Online

A technical report describing the key concepts behind Spark is available online. The abstract goes below:

We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow models like MapReduce. RDDs are motivated by two types of applications that current data flow systems handle inefficiently: iterative algorithms, which are common in graph applications and machine learning, and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a highly restricted form of shared memory: they are read-only datasets that can only be constructed through bulk operations on other RDDs. However, we show that RDDs are expressive enough to capture a wide class of computations, including MapReduce and specialized programming models for iterative jobs such as Pregel. Our implementation of RDDs can outperform Hadoop by 20x for iterative jobs and can be used interactively to search a 1 TB dataset with latencies of 5-7 seconds.

You can also download the 0.3 release of Spark and read corresponding release notes from the Spark download page.

Spark’s in the wild

We have been working on the Spark cluster computing framework for last couple of years. It has always been open source under the BSD license in github. But yesterday Matei declared official launch of the spark website (spark-project.org) and mailing lists along with its 0.2 release to everyone during the AMPLab summer retreat at Chaminede, Santa Cruz.

Head over to the website to learn more about it, to download, and to solve real world problems (e.g., spam filtering, natural language processing and traffic estimation) lightning fast! It will get even better as more people use it and contribute back.

Presented Orchestra at LBNL

Today I presented Orchestra for the first time in front of a crowd outside our lab. Taghrid Samak kindly invited me at LBNL’s Computing Sciences Seminar after we caught up over lunch last week, after a year. She is currently a post-doc fellow with the Advance Computing for Science group.

Overall, the talk went very well with some interesting questions. We might even get into future extension/collaboration work regarding some pieces of Orchestra. Hot stuff!

Orchestra has been accepted at SIGCOMM’2011

Update: Camera-ready version of the paper should be can be found in the publications page very soon!

Our paper “Managing Data Transfers in Computer Clusters with Orchestra” has been accepted at SIGCOMM’2011. This is a joint work with Matei, Justin, and professors Mike Jordan and Ion Stoica. The project started as part of Spark and now quickly expanding to stand on its own to support other data-intensive frameworks (e.g., Hadoop, Dryad etc.). We also believe that interfacing Orchestra with Mesos will enable better network sharing between concurrently running frameworks in data centers.

Cluster computing applications like MapReduce and Dryad transfer massive amounts of data between their computation stages. These transfers can have a significant impact on job performance, accounting for more than 50% of job completion times. Despite this impact, there has been relatively little work on optimizing the performance of these data transfers. In this paper, we propose a global management architecture and a set of algorithms that improve the transfer times of common communication patterns, such as broadcast and shuffle, and allow one to prioritize a transfer over other transfers belonging to the same application or different applications. Using a prototype implementation, we show that our solution improves broadcast completion times by up to 4.5x compared to the status quo implemented by Hadoop. Furthermore, we show that transfer-level scheduling can reduce the completion time of high-priority transfers by 1.7x.

The paper so far have been well-received, and we’ve got great feedback from the anonymous reviewers that will further strengthen it. Hopefully, you will like it too :)

Those who are interested in stats, this year SIGCOMM accepted 32 out of 223 submissions.

Anyway, it’s Friday and we so excited!

Spark short paper has been accepted at HotCloud’10

An initial overview of our ongoing work on Spark, an iterative and interactive framework for cluster computing, has been accepted at HotCloud’10. I’ve been joined the project last February, while Matei has been working on it since last Fall. I will have uploaded the paper in the publications page. once we have taken care of the reviewer comments/suggestions, meanwhile you can read the technical report version.

This year HotCloud accepted 18 papers (24% of the submitted papers), and the PC are thinking about extending the workshop to a 2nd day from next year.