Spark
Spark is one of the most popular open-source distributed computation engines and offers a scalable, flexible framework for processing huge amounts of data efficiently. The recent 2.0 release milestone brought a number of significant improvements including DataSets, an improved version of DataFrames, more support for SparkR, and a lot more. One of the great things about Spark is that it’s relatively autonomous and doesn’t require a lot of extra infrastructure to work. While Spark’s latest release is at 2.1.0 at the time of publishing, we’ll use the example of 2.0.1 throughout this post.
Jupyter
Jupyter notebooks are an interactive way to code that can enable rapid prototyping and exploration. It essentially connects a browser-based frontend, the Jupyter Server, to an interactive REPL underneath that can process snippets of code. The advantage to the user is being able to write code in small chunks which can be run independently but share the same namespace, greatly facilitating testing or trying multiple approaches in a modular fashion. The platform supports a number of kernels (the things that actually run the code) besides the out-of-the-box Python, but connecting Jupyter to Spark is a little trickier. Enter Apache Toree, a project meant to solve this problem by acting as a middleman between a running Spark cluster and other applications.
In this post I’ll describe how we go from a clean Ubuntu installation to being able to run Spark 2.0 code on Jupyter. If you’re on Mac or Windows, I suggest looking into the Anaconda platform. We will need to have access to certain things in the environment before we start:
- Java (for Spark)
- Scala 2.11+ (for Spark)
- Python 2 (for Jupyter and PySpark)
- pip (to handle Python 2 packages)
- git (the easiest way to get the latest Toree code)
- Docker (for building Toree)
Installation
The first step is to download and install Spark itself. In this example we’ll use Spark 2.0.1 with support for Hadoop 2.4 but you can use whatever Hadoop version matches your existing stack, or just not worry about it if you’re not using Hadoop at all. The install location doesn’t matter too much, but you should definitely make sure that you set a couple important environment variables. In this example we’ll assume that Spark was installed in the folder /opt/spark/
. You can add the following lines to your shell’s startup script if you want them to be applied automatically in the future.
export SPARK_HOME=/opt/spark export PYTHONPATH=$PYTHONPATH:$SPARK_HOME/python:$SPARK_HOME/python/lib
We can easily install Jupyter using pip. pip install jupyter
should do it.
Next, we’ll clone the Toree repository from GitHub. As of this writing, a pip-installable package wasn’t available for Toree v. 0.2.0, which is what we need for Spark 2.0 support. No problem, we’ll build them ourselves! In fact, there’s a preset make command to make said package, we just need to set an environment variable to let Toree know which version of Spark we’re going to be using.
git clone git@github.com:apache/incubator-toree.git cd incubator-toree APACHE_SPARK_VERSION=2.0.1 make pip-release
Let’s install the package we just built:
pip install dist/toree-pip/toree-0.2.0.dev1.tar.gz
Putting the pieces together
All we have left to do is add Toree as one of Jupyter’s kernels. Remember that we’ve previously set the environment variable $SPARK_HOME
to point to where we installed Spark. The replace flag here is a way of making this command idempotent. In other words, no matter how many times we run this command, the end environment should be the same. This is nice for being able to reprovision boxes without worrying about whether or not commands have been run before.
jupyter toree install --replace --spark_home=$SPARK_HOME --kernel_name="Spark" --spark_opts="--master=local[*]"
You’ll notice that we can pass arbitrary options to Spark as well – these are the same parameters we would use when calling spark-submit. There’s another flag we could have passed here as well which is worth mentioning. Adding --user
to the above command will install the kernel in the user’s directory, which might get around permissions problems trying to install it in protected system folders without administrator privileges.
And that’s it! You should now be able to run jupyter notebook
and start a new notebook using the Spark-Scala kernel and get to work. To use PySpark, open up a Python notebook and simply import pyspark
. Happy distributed computing!