Spark 2.0 on Jupyter with Toree

spark-logo2Spark

Spark is one of the most popular open-source distributed computation engines and offers a scalable, flexible framework for processing huge amounts of data efficiently. The recent 2.0 release milestone brought a number of significant improvements including DataSets, an improved version of DataFrames, more support for SparkR, and a lot more. One of the great things about Spark is that it’s relatively autonomous and doesn’t require a lot of extra infrastructure to work. While Spark’s latest release is at 2.1.0 at the time of publishing, we’ll use the example of 2.0.1 throughout this post.

Jupyter

Jupyter notebooks are an interactive way to code that can enable rapid prototyping and exploration. It essentially connects a browser-based frontend, the Jupyter Server, to an interactive REPL underneath that can process snippets of code. The advantage to the user is being able to write code in small chunks which can be run independently but share the same namespace, greatly facilitating testing or trying multiple approaches in a modular fashion. The platform supports a number of kernels (the things that actually run the code) besides the out-of-the-box Python, but connecting Jupyter to Spark is a little trickier. Enter Apache Toree, a project meant to solve this problem by acting as a middleman between a running Spark cluster and other applications.

In this post I’ll describe how we go from a clean Ubuntu installation to being able to run Spark 2.0 code on Jupyter. If you’re on Mac or Windows, I suggest looking into the Anaconda platform. We will need to have access to certain things in the environment before we start:

  • Java (for Spark)
  • Scala 2.11+ (for Spark)
  • Python 2 (for Jupyter and PySpark)
  • pip (to handle Python 2 packages)
  • git (the easiest way to get the latest Toree code)
  • Docker (for building Toree)

 

Installation

The first step is to download and install Spark itself. In this example we’ll use Spark 2.0.1 with support for Hadoop 2.4 but you can use whatever Hadoop version matches your existing stack, or just not worry about it if you’re not using Hadoop at all. The install location doesn’t matter too much, but you should definitely make sure that you set a couple important environment variables. In this example we’ll assume that Spark was installed in the folder /opt/spark/. You can add the following lines to your shell’s startup script if you want them to be applied automatically in the future.

export SPARK_HOME=/opt/spark
export PYTHONPATH=$PYTHONPATH:$SPARK_HOME/python:$SPARK_HOME/python/lib

We can easily install Jupyter using pip. pip install jupyter should do it.

Next, we’ll clone the Toree repository from GitHub. As of this writing, a pip-installable package wasn’t available for Toree v. 0.2.0, which is what we need for Spark 2.0 support. No problem, we’ll build them ourselves! In fact, there’s a preset make command to make said package, we just need to set an environment variable to let Toree know which version of Spark we’re going to be using.

git clone git@github.com:apache/incubator-toree.git
cd incubator-toree
APACHE_SPARK_VERSION=2.0.1 make pip-release

Let’s install the package we just built:

pip install dist/toree-pip/toree-0.2.0.dev1.tar.gz

Putting the pieces together

All we have left to do is add Toree as one of Jupyter’s kernels. Remember that we’ve previously set the environment variable $SPARK_HOME to point to where we installed Spark. The replace flag here is a way of making this command idempotent. In other words, no matter how many times we run this command, the end environment should be the same. This is nice for being able to reprovision boxes without worrying about whether or not commands have been run before.

jupyter toree install --replace --spark_home=$SPARK_HOME --kernel_name="Spark" --spark_opts="--master=local[*]"

You’ll notice that we can pass arbitrary options to Spark as well – these are the same parameters we would use when calling spark-submit. There’s another flag we could have passed here as well which is worth mentioning. Adding --user to the above command will install the kernel in the user’s directory, which might get around permissions problems trying to install it in protected system folders without administrator privileges.

And that’s it! You should now be able to run jupyter notebook and start a new notebook using the Spark-Scala kernel and get to work. To use PySpark, open up a Python notebook and simply import pyspark. Happy distributed computing!

Related Blog Posts

Moving From Mechanical Engineering to Data Science

Moving From Mechanical Engineering to Data Science

Mechanical engineering and data science may appear vastly different on the surface. Mechanical engineers create physical machines, while data scientists deal with abstract concepts like algorithms and machine learning. Nonetheless, transitioning from mechanical engineering to data science is a feasible path, as explained in this blog.

Read More »
Data Engineering Project

What Does a Data Engineering Project Look Like?

It’s time to talk about the different data engineering projects you might work on as you enter the exciting world of data. You can add these projects to your portfolio and show the best ones to future employers. Remember, the world’s most successful engineers all started where you are now.

Read More »
open ai

AI Prompt Examples for Data Scientists to Use in 2023

Artificial intelligence (AI) isn’t going to steal your data scientist job! Instead, AI tools like ChatGPT can automate some of the more mundane tasks in your future career, saving you time and energy. To make life easier, here are some data science prompts to get you started.

Read More »