Try Apache Spark’s shell using Docker

tags: #Docker
categories: Apache Spark BigData Scala
published: December 18, 2014
reading time: 4 minutes

Ever wanted to try out Apache Spark without actually having to install anything ? Well if you've got Docker, I've got a christmas present for you, a Docker image you can pull to try and run Spark commands in the Spark shell REPL. The image has been pushed to the Docker Hub here and can be easily pulled using Docker.

So exactly what is this image, and how can I use it ?

Well, all you need is to execute these few commands :

[code language=”bash”]

docker pull ogirardot/spark-docker-shell

[/code]

I'll try to keep this image up-to-date with future releases of Spark, so if you want to test against a specific version, all you have to do is pull (or directly run) the image with the corresponding tag like that :

[code language=”bash”]

docker pull ogirardot/spark-docker-shell:1.1.1

[/code]

And then after Docker will have downloaded the full image, using the run command you will have access to a stand-alone spark-shell that will allow you to try and learn Spark's API in a sandboxed environment, here's what a correct launch looks like :

[code language=”scala”]

docker run -t -i ogirardot/spark-docker-shell
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark’s default log4j profile: org/apache/spark/log4j-defaults.properties
14/12/11 20:33:14 INFO SecurityManager: Changing view acls to: root
14/12/11 20:33:14 INFO SecurityManager: Changing modify acls to: root
14/12/11 20:33:14 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
14/12/11 20:33:14 INFO HttpServer: Starting HTTP Server
14/12/11 20:33:14 INFO Utils: Successfully started service ‘HTTP class server’ on port 50535.
Welcome to
__ __
/ __/__ _ ___\_/ /__
_\ / _ / _ `/ _/ ‘_/
/_\_/ .__/_,_/_/ /_/\\ version 1.1.1
/_/

Using Scala version 2.10.4 (OpenJDK 64-Bit Server VM, Java 1.7.0_65)
Type in expressions to have them evaluated.
Type :help for more information.
14/12/11 20:33:18 INFO SecurityManager: Changing view acls to: root
14/12/11 20:33:18 INFO SecurityManager: Changing modify acls to: root
14/12/11 20:33:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
14/12/11 20:33:19 INFO Slf4jLogger: Slf4jLogger started
14/12/11 20:33:19 INFO Remoting: Starting remoting
14/12/11 20:33:19 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@ea9ec670e429:43346]
14/12/11 20:33:19 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver@ea9ec670e429:43346]
14/12/11 20:33:19 INFO Utils: Successfully started service ‘sparkDriver’ on port 43346.
14/12/11 20:33:19 INFO SparkEnv: Registering MapOutputTracker
14/12/11 20:33:19 INFO SparkEnv: Registering BlockManagerMaster
14/12/11 20:33:19 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20141211203319-f310
14/12/11 20:33:19 INFO Utils: Successfully started service ‘Connection manager for block manager’ on port 58304.
14/12/11 20:33:19 INFO ConnectionManager: Bound socket to port 58304 with id = ConnectionManagerId(ea9ec670e429,58304)
14/12/11 20:33:19 INFO MemoryStore: MemoryStore started with capacity 265.4 MB
14/12/11 20:33:19 INFO BlockManagerMaster: Trying to register BlockManager
14/12/11 20:33:19 INFO BlockManagerMasterActor: Registering block manager ea9ec670e429:58304 with 265.4 MB RAM, BlockManagerId(<driver>, ea9ec670e429, 58304, 0)
14/12/11 20:33:19 INFO BlockManagerMaster: Registered BlockManager
14/12/11 20:33:19 INFO HttpFileServer: HTTP File server directory is /tmp/spark-4c832cee-7ed5-470d-9e41-d4a36227d48f
14/12/11 20:33:19 INFO HttpServer: Starting HTTP Server
14/12/11 20:33:19 INFO Utils: Successfully started service ‘HTTP file server’ on port 55020.
14/12/11 20:33:19 INFO Utils: Successfully started service ‘SparkUI’ on port 4040.
14/12/11 20:33:19 INFO SparkUI: Started SparkUI at http://ea9ec670e429:4040
14/12/11 20:33:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
14/12/11 20:33:19 INFO Executor: Using REPL class URI: http://172.17.0.15:50535
14/12/11 20:33:19 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@ea9ec670e429:43346/user/HeartbeatReceiver
14/12/11 20:33:19 INFO SparkILoop: Created spark context..
Spark context available as sc.

scala>

[/code]

Once you reach this scala prompt, you're practically done, and you can use your available SparkContext (variable sc) with simple examples :

[code language=”scala”]
scala> sc.parallelize(1 until 1000).map(_ * 2).filter(_ < 10 ).reduce(_ + _)
res0: Int = 20
[/code]

If you've got this right, you're all set ! Plus, as this is a Scala prompt, using you'll have access to all the auto-completion magic a strong type-system can bring you.

So enjoy, take your time and be bold.