All posts by Mosharaf

Technical report on Spark is available Online

A technical report describing the key concepts behind Spark is available online. The abstract goes below:

We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow models like MapReduce. RDDs are motivated by two types of applications that current data flow systems handle inefficiently: iterative algorithms, which are common in graph applications and machine learning, and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a highly restricted form of shared memory: they are read-only datasets that can only be constructed through bulk operations on other RDDs. However, we show that RDDs are expressive enough to capture a wide class of computations, including MapReduce and specialized programming models for iterative jobs such as Pregel. Our implementation of RDDs can outperform Hadoop by 20x for iterative jobs and can be used interactively to search a 1 TB dataset with latencies of 5-7 seconds.

You can also download the 0.3 release of Spark and read corresponding release notes from the Spark download page.

Spark’s in the wild

We have been working on the Spark cluster computing framework for last couple of years. It has always been open source under the BSD license in github. But yesterday Matei declared official launch of the spark website (spark-project.org) and mailing lists along with its 0.2 release to everyone during the AMPLab summer retreat at Chaminede, Santa Cruz.

Head over to the website to learn more about it, to download, and to solve real world problems (e.g., spam filtering, natural language processing and traffic estimation) lightning fast! It will get even better as more people use it and contribute back.

Extended version of ViNEYard has been accepted in IEEE/ACM ToN

An extended, updated, and emended version of our ViNEYard paper in INFOCOM’09 has been accepted for publication in IEEE/ACM Transactions on Networking after yearlong multiple rounds of reviews. Since there is normally a long queue for actually getting an accepted ToN paper printed, its hard to tell when ours will officially be out there. I’d like to thank our anonymous reviewers who took great amount of care to find existing issues in the original submission and suggested directions for improvements. Thanks also to Cedric Westphal for moderating the process in a fast and efficient manner. Lastly, without Prof. Boutaba’s insistence on getting it done (even after both Muntasir and I had left Waterloo,) this paper would never have come to be.

Apart from many small tweaks and fixes (some reported by researchers who read the INFOCOM version; thank you), the following are include some of the notable changes:

  • Introduced WiNE, a generalized window-based VN embedding mechanism for equipping any existing online VN embedding algorithm with lookahead capabilities.
  • Compared the ViNEYard algorithms with their counterparts under the influence of WiNE ;)
  • Included time-complexity expressions from D-ViNE and R-ViNE.
  • Added extended evaluation results on the comparative run times of the proposed algorithms.
  • Updated/added references to related work.
  • Provided approximation ratio for D-ViNE under a restricted model.

Presented Orchestra at LBNL

Today I presented Orchestra for the first time in front of a crowd outside our lab. Taghrid Samak kindly invited me at LBNL’s Computing Sciences Seminar after we caught up over lunch last week, after a year. She is currently a post-doc fellow with the Advance Computing for Science group.

Overall, the talk went very well with some interesting questions. We might even get into future extension/collaboration work regarding some pieces of Orchestra. Hot stuff!

Orchestra has been accepted at SIGCOMM’2011

Update: Camera-ready version of the paper should be can be found in the publications page very soon!

Our paper “Managing Data Transfers in Computer Clusters with Orchestra” has been accepted at SIGCOMM’2011. This is a joint work with Matei, Justin, and professors Mike Jordan and Ion Stoica. The project started as part of Spark and now quickly expanding to stand on its own to support other data-intensive frameworks (e.g., Hadoop, Dryad etc.). We also believe that interfacing Orchestra with Mesos will enable better network sharing between concurrently running frameworks in data centers.

Cluster computing applications like MapReduce and Dryad transfer massive amounts of data between their computation stages. These transfers can have a significant impact on job performance, accounting for more than 50% of job completion times. Despite this impact, there has been relatively little work on optimizing the performance of these data transfers. In this paper, we propose a global management architecture and a set of algorithms that improve the transfer times of common communication patterns, such as broadcast and shuffle, and allow one to prioritize a transfer over other transfers belonging to the same application or different applications. Using a prototype implementation, we show that our solution improves broadcast completion times by up to 4.5x compared to the status quo implemented by Hadoop. Furthermore, we show that transfer-level scheduling can reduce the completion time of high-priority transfers by 1.7x.

The paper so far have been well-received, and we’ve got great feedback from the anonymous reviewers that will further strengthen it. Hopefully, you will like it too :)

Those who are interested in stats, this year SIGCOMM accepted 32 out of 223 submissions.

Anyway, it’s Friday and we so excited!

MSR Cambridge, here I come!

I’m going to spend this Summer in stealth mode at Microsoft Research, Cambridge working with Christos Gkantsidis and Hitesh Ballani on a super-secret project. Hopefully, we’ll have some cool results on a hot topic.

This will be my first time in England/UK as well. Looking forward to the English weather that I’ve heard so much about! A Schengen visa is also accompanying me. So don’t be surprised if I’m seen in random European cities.

PolyViNE has been accepted at VISA’2010

Our paper, “PolyViNE: Policy-based Virtual Network Embedding Across Multiple Domains” is set to appear in VISA’2010 workshop (with SIGCOMM’2010) in New Delhi. I worked on it during my last few months in Waterloo (circa Winter/Spring 2009), and it has been lying around ever since because everyone had been busy. Finally, its going to wake up and smell a workshop.

Intra-domain virtual network embedding (ViNE) is a well studied problem in the network virtualization literature. For most practical purposes, however, virtual networks (VNs) must be provisioned across heterogeneous administrative domains managed by multiple infrastructure providers (InPs).

In this paper we present PolyViNE, a policy-based inter-domain VN embedding framework that embeds end-to-end VNs in a decentralized manner. PolyViNE introduces a distributed protocol that coordinates the VN embedding process across participating InPs and ensures competitive prices for service providers (SPs), i.e., VN owners. We also present a location aware VN request forwarding mechanism — based on a hierarchical addressing scheme (COST) and a location awareness protocol (LAP) — to allow faster embedding and outline scalability and performance characteristics of PolyViNE through quantitative and qualitative evaluations.

As always, the paper can be found in my publications page. once I upload it (not yet).

Berkeley computer science networking Prelim reading list (Spring’10)

Somehow I never managed to upload the reading list after I finished my Prelim. Here it is finally, or they are – too many of them (not my fault).

This zipped file contains reading lists for the grad networking/network security courses from different years along with the grandfather of all reading lists for networking prelim compiled a decade ago. I didn’t read all of them, but most of them are good/great papers anyway.

All credits to Sara and Sameer for prodding me to put the files together. Best of luck to them!

Download: Berkeley CS Networking Prelim Reading List (Spring’10 Edition) [ZIP]

Spark short paper has been accepted at HotCloud’10

An initial overview of our ongoing work on Spark, an iterative and interactive framework for cluster computing, has been accepted at HotCloud’10. I’ve been joined the project last February, while Matei has been working on it since last Fall. I will have uploaded the paper in the publications page. once we have taken care of the reviewer comments/suggestions, meanwhile you can read the technical report version.

This year HotCloud accepted 18 papers (24% of the submitted papers), and the PC are thinking about extending the workshop to a 2nd day from next year.

Prelim, begone, I will have no more of thee!

Update: I’ve finally uploaded the reading list. Here it is!

Took the networking Prelim last Thursday and passed. I was a never fan of oral interrogations; heck, I am not a fan of any interrogation for that matter. And this one was really a close call; I am just happy to reach alive at the other end of the tunnel. What a relief! Thanks to everyone, specially Ganesh and Matei, who helped me prepare with mock Prelims and everything else.

I will have the (unofficial) reading list for this Prelim in a future post.

On another note, do watch Rosemary’s Baby to find out where the title of this post originated from. Believe it or not, I had this post in mind since I heard Rosemary saying “Pain, begone, I will have no more thee!”It feels great to be able to use it finally!