Exploring Parallelism in Apache Spark.

spark-logo-trademark

What is Spark?

  • Apache Spark is an open-source data processing framework which offers in-memory computing techniques and graph traversal computation API. It works on any file-system.
  • The Spark developers claim it can run programs 100 times faster than Hadoop (A popular Big Data framework) and 10 times faster on disk. (which are mostly ordinary servers and commodity hardware) Also, less of code to be written.


Why Spark?

  • It is designed to overcome serious limitations of one-pass computation nature of MapReduce and non-scalable MPI programming model.
  • Provides fault tolerance and easy to combine with SQL, Streaming data, Machine Learning, Graphs etc.
  •  Mainly, Parallel Processing!


What parallel-paradigm features does Spark boast?

Note :

Before we start to work with examples to understand the parallel action, make sure you have Apache Spark installed on your system. Another blog post gives clear instructions to get Spark  running on your machine. It will also be helpful if you are familiar with the “Hello, World!” examples.

I haven’t worked on clusters or on any commodity servers. This is run on my local machine which has quad-core processor and the parallelism is exploited in the cores i.e. threads run on each core.  If you want to get this working on large cluster of computers, Spark lets you configure the machines. You can control the degree of parallelism too.

RDD.

The goal of an RDD is to  work with distributed collections as you would with local ones. It is built through parallel transformations like map, filter, reduceByKey, join etc. They exploit data-parallel systems and hence data parallelism.

Take a look at the code below.


val count = sc.parallelize(1 to NUM_SAMPLES).filter { _ =>
  val x = math.random
  val y = math.random
  x*x + y*y < 1
}.count()
println(s"Pi is roughly ${4.0 * count / NUM_SAMPLES}")

  • Parallelized collections are created by calling SparkContext’s parallelize method on an existing collection in your driver program.
  • The elements of the collection are copied to form RDD that can be operated on in parallel.
  • Spark will run one task for each partition of the cluster. Typically you would want 2-4 partitions for each CPU in your cluster.

GraphX Pregel API.

  • Graphs created by modern applications can be very large (potentially terabytes or petabytes in size), thus, a single node (computer) will not be able to hold all these nodes in memory.
  • The way forward is to partition and parallelise the graph to involve many machines to process them in parallel.
  • Pregel employs a vertex-centric approach in processing large distributed graphs. (A Bulk synchronous parallel model ) Pregel computations consist of a sequence of iterations, called supersteps.

Let’s understand the type signature of the Pregel operator followed by an implementation of that. (SSSP Algorithm)


class GraphOps[VD, ED] {
  def pregel[A]
      (initialMsg: A,
       maxIter: Int = Int.MaxValue,
       activeDir: EdgeDirection = EdgeDirection.Out)
      (vprog: (VertexId, VD, A) => VD,
       sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
       mergeMsg: (A, A) => A)
    : Graph[VD, ED] = {
    // Receive the initial message at each vertex
    var g = mapVertices( (vid, vdata) => vprog(vid, vdata, initialMsg) ).cache()
    // compute the messages
    var messages = g.mapReduceTriplets(sendMsg, mergeMsg)
    var activeMessages = messages.count()
    // Loop until no messages remain or maxIterations is achieved
    var i = 0
    while (activeMessages > 0 && i < maxIterations) {
      // Receive the messages and update the vertices.
      g = g.joinVertices(messages)(vprog).cache()
      val oldMessages = messages
      // Send new messages, skipping edges where neither side received a message. We must cache
      // messages so it can be materialized on the next line, allowing us to uncache the previous
      // iteration.
      messages = g.mapReduceTriplets(
        sendMsg, mergeMsg, Some((oldMessages, activeDirection))).cache()
      activeMessages = messages.count()
      i += 1
    }
  }
}

Now we can use the Pregel operator to express computation such as single source shortest path. The implementation of the same can be found in this GitHub link.

Conclusion

I have made an attempt to demonstrate simple examples in RDD and GraphX of the Pregel API, I believe it is enough to get us started to solve more complex graph problems, but of course this is not the only way to do so, there might be some other better examples to explore the parallelism techniques and working.

Advertisements

1 thought on “Exploring Parallelism in Apache Spark.”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s