Development Workflow

Developing and Deploying Spark applications can be a challenge upfront. This section helps you through the development and deployment workflow.

When developing a Spark application that uses external dependencies, typically there are two challenges a developer is confronted with:

  • How to efficiently and quickly develop locally.

  • How to reliably deploy to a production cluster.

Apart from the actual development, the biggest hurdle most of the time is the dependency management aspect. This documentation is intended to guide you through and to help you get set up as needed.

There are two ways the problem is typically approached: you can either add the dependencies to the classpath when you submit the application or you can create a big jar which contains all dependencies.

Adding the Connector to the Executor’s classpath

If you want to manage the dependency directly on the worker, the actual project setup is quite simple. You don’t need to set up shadowing and can use a build.sbt like this:

name := "your-awesome-app"

organization := "com.acme"

version := "1.0.0-SNAPSHOT"

scalaVersion := "2.11.8"

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % "2.2.0",
  "org.apache.spark" %% "spark-streaming" % "2.2.0",
  "org.apache.spark" %% "spark-sql" % "2.2.0",
  "com.couchbase.client" %% "spark-connector" % "2.2.0"
)

when developing the application you need to make sure that the master is not set, since you need to deploy it eventually to a non-local master.

Note that since Spark 2.0 it is recommended to use the SparkSession instead of initializing the SparkContext directly, but it is still supported and behaves the same like noted below.

object Main {
  def main(args: Array[String]): Unit = {
    // Configure Spark
    val cfg = new SparkConf()
      .setAppName("MyApp")
      // .setMaster("local[*]") <--- do not set the master!
      .set("com.couchbase.bucket.travel-sample", "")

    // Generate The Context
    val sc = new SparkContext(cfg)

    // ...
  }
}

If you run this, Spark will complain:

Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration

There are different approaches to handle this, but all boil down to conditionally setting the system property for the master. One approach that is useful is to have a runnable class in the test (or somewhere else) directory which sets the property and then runs the application. This also has additional benefits that are discussed in the next section.

object RunLocal {
  def main(args: Array[String]): Unit = {
    System.setProperty("spark.master", "local[*]")
    Main.main(args)
  }
}

You can then build your application via sbt package on the command line and submit it to Spark via spark-submit. Because Couchbase distributes the connector via maven, you can just include it with the --packages flag:

/spark/bin/spark-submit --master spark://your-spark-master:7077 --packages com.couchbase.client:spark-connector_2.10:1.2.1 /path/to/app/target/scala-2.10/your_app_2.10-1.0.jar

If your environment does not have access to the internet you can use the --jars argument instead and grab the assembly with all the dependencies from here: Download and API Reference.

Deploying a jar with dependencies included

The previous example shows how to add the connector during the submit phase as a dependency. If more than one dependency needs to be managed this can become cumbersome quickly since you also need to make sure that your build.sbt is always in sync with your command line parameters.

The alternative is to create a “fat jar” with batteries included that ships everything in the first place. This requires more setup on the project side but saves you some work later on.

The build.sbt looks a bit different here:

lazy val root = (project in file(".")).
  settings(
    name := "my-app",
    version := "1.0",
    scalaVersion := "2.11.8",
    mainClass in Compile := Some("Main")
  )

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % "2.2.0" % "provided,test",
  "org.apache.spark" %% "spark-sql" % "2.2.0" % "provided,test",
  "com.couchbase.client" %% "spark-connector" % "2.2.0"
)

The important piece here is that the Spark dependencies are scoped to provided and test. We need this to not include Spark in the shadowed jar but still allow it to run in the IDE for development.

Also, create project/assembly.sbt if it doesn’t exist and add:

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3")

The code now is similar to the previous section, but you need to make sure that the class you run in your IDE is in the test namespace so that Spark is actually included during development:

In src/main/scala:

object Main {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
      .setAppName("TravelCruncher")
      .set("com.couchbase.bucket.travel-sample", "")

    val sc = new SparkContext(conf)

    // ...
  }
}

Under src/test/scala:

object RunLocal {
  def main(args: Array[String]): Unit = {
    System.setProperty("spark.master", "local[*]")
    Main.main(args)
  }
}

Now instead of sbt package you need to use sbt assembly which will create the shadowed jar under target/scala_2.1X/your-app-assembly-1.0.jar.

The jar can be submitted without further dependency shenanigans:

/spark/bin/spark-submit --master spark://your-spark-master:7077 /path/to/app/target/scala-2.11/your_app-assembly-1.0.jar