Saturday 6 June 2015

Configure IPython Notebook with Apache Spark (PySpark)

Apache Spark is a great way for performing large-scale data processing. Recently, I'm working with different modules of Apache Spark using Scala and PySpark(a way of interfacing with Spark through Python).
This tutorial will show you how to properly set up PySpark with IPython Notebook.I followed the set up using Cloudera blog,but the some steps are missing  and not correct. I encountered the following errors:
I) Couldn't read ipynb files with error:
 Unreadable Notebook: /home/kuntal/Downloads/setup-master/spark_tutorial_student.ipynb Unsupported nbformat version 4

II) iPython Notebook won't run b/c it requires tornado >= 4.0 but I have 3.1

Overview of this tutorial: 
  • Install IPython
  • Install Spark
  • Create PySpark profile for IPython
  • IPython and Spark(PySpark) Configuration
  • Word Count example
  • Shortcut trick for IPython Spark set up


Hardware description:
  • Ubuntu-14.04
  • 6GB RAM
  • Intel i3 Quadcore Processor

     => For demonstration purposes, Spark will run in local mode, but the configuration can be          
           updated to submit code to a cluster.

Install IPython   
sudo apt-get install python-pip ipython ipython-notebook
sudo pip install --upgrade ipython tornado

Verify that the latest version of IPython is installed by typing:
ipython --version
3.1.0


Install SPARK  
Download the source for the latest Spark binary release:
Unzip source to ~/spark-1.2.0/  --will refer to this as SPARK_HOME.


Create PySpark Profile for IPython
After Spark is installed, let’s start by creating a new IPython profile for PySpark.

ipython profile create pyspark


IPython and Spark(PySpark) Configuration
Updating profile config file /home/kuntal/.ipython/profile_pyspark/ipython_notebook_config.py
(Note: the file path may vary on your system,but when you create the pyspark profile,it will print the config file name with full path)

Edit this file:
vi /home/kuntal/.ipython/profile_pyspark/ipython_notebook_config.py

c = get_config()

Add the following lines after the above method.

# IPython PySpark
c.NotebookApp.ip = 'localhost'
c.NotebookApp.open_browser = False
c.NotebookApp.port = 7770


Set the following environment variables in .bashrc or .bash_profile:
sudo vi ~/.bashrc

export SPARK_HOME="/home/kuntal/knowledge/BigData/spark-1.2.0"
export PYSPARK_SUBMIT_ARGS="--master local[2]"


Create a file named /home/kuntal/.ipython/profile_pyspark/startup/00-pyspark-setup.py containing the following:

Configure the necessary Spark environment
import os
import sys

spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))

# Initialize PySpark to predefine the SparkContext variable 'sc'
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))


Now we are ready to launch a notebook using the PySpark profile

ipython notebook --profile=pyspark 
(Start the above command from a directory where there is ipynb file)

Go to browser : http://localhost:7770



To have Interactive PySpark shell with IPython start it with the command:

ipython  --profile=pyspark




Word Count Example
Now start your pyspark programming in the console.The word count example is  simple taken from spark website:

Read file with Spark context.
text_file =sc.textFile("file:///home/kuntal/knowledge/BigData/spark-1.2.0/README.md")

Split each line from the file into words.Map each word to a tuple containing the word and an initial count of 1.Sum up the count for each word.
word_counts = text_file.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)

To actually count the words, execute the action:
word_counts.collect()

Here’s a portion of the output:

[(u'all', 1),
 (u'when', 1),
 (u'"local"', 1),
 (u'including', 3),
 (u'computation', 1),
 (u'Spark](#building-spark).', 1),
 (u'using:', 1),
 (u'guidance', 3),
...


 (u'./dev/run-tests', 1),
 (u'first', 1),
 (u'latest', 1)]


Note : 
You can use this set up to submit your Lab exercise to the EDX Course (without VM and complicated set up procedure) - Introduction to Big Data with Apache Spark



Shortcut  for setting up IPython notebook with Spark

This will give you quick way of setting up Spark-1.5.0(spark-1.5.0-bin-hadoop2.4) and IPython notebook in minutes. Assuming you have already installed ipython notebook and downloaded latest spark version and also set up the Spark home in your bashrc or profile. Lets get started.

1) Create Spark profile name pyspark
ipython profile create pyspark

2)  Change directory to the ipython base (example guven below in ubuntu)
cd ~/.ipython/profile_pyspark/


3) Edit the config file
vi ipython_config.py

4) Add the following lines in the files
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.port = 8889

5) Now create a shell script to run IPython Notebook with Spark
vi ~/start_ipython_notebook.sh

6) Add the following lines
#!/bin/bash
IPYTHON_OPTS="notebook --profile=pyspark" pyspark

7) Make this script executable and run it
chmod +x ~/start_ipython_notebook.sh

8) Run the Notebook
./start_ipython_notebook.sh

9) Validate with basic spark code



Enjoy your sparky ride :)