Graph-parallel frameworks

Google, “Pregel: A System for Large-Scale Graph Processing,” SIGMOD, 2010. [PDF]

Carnegie Mellon, “GraphLab: A New Framework for Parallel Machine Learning,” arXiv:1006.4990, 2010. [PDF]

Summary

Data-parallel frameworks such as MapReduce and Dryad are good at performing embarrassingly parallel jobs. These frameworks are not ideal for iterative jobs and for jobs where data-dependencies across stages are sparse (e.g., in MapReduce, each reducer is likely to depend on each mapper). However, there are many problems, specially in machine learning, that can be intuitively expressed using graphs with sparse computational dependencies, require multiple iterations to converge, and have variable convergence rate for different parameters. Pregel and GraphLab are two frameworks optimized for this type of graph-based problems.

A typical graph-parallel problem is expressed using graphs with vertices and edges, where each vertex and edge have associated data with them. In every iteration, vertex and edge data are updated and a bunch messages are exchanged between neighboring entities. This update function is typically the same for every vertex, and it is written by the user. There may or may not be a synchronization step at the end of every iteration. In a distributed setting, the graph is cut and divided across multiple nodes and updates from a collection of vertices in one node is communicated to another using message passing.

Pregel vs GraphLab

The key difference between Pregel and GraphLab is that Pregel has a barrier at the end of every iteration, whereas GraphLab is completely asynchronous. Asynchrony in GraphLab allows it to prioritize more complex vertices over others, but it also calls for consistency models to maintain sanity of results. GraphLab proposes three consistency models: full, edge, and vertex consistency, to allow different levels of parallelism. Another difference is that Pregel allows dynamic modifications to the graph structure, whereas GraphLab does not.

Comments

Pregel and GraphLab sit at two ends of the “power of framework” vs “ease of use” tradeoff space. Allowing asynchrony makes GraphLab more general and powerful than Pregel, but it is more complex and requires users to understand which consistency model is suitable for them. Pregel is simpler (common for most frameworks in Google’s arsenal), but still capable of handling a wide variety of problems. Given its origin at Google, open-source clones like Giraph, Pregel’s model is more likely to succeed in near future.

Leave a Reply

Your email address will not be published. Required fields are marked *