Big Data Analytics and Machine Learning: June 2015

Apache Spark is a great way for performing large-scale data processing. Recently, I'm working with different modules of Apache Spark using Scala and PySpark(a way of interfacing with Spark through Python).
This tutorial will show you how to properly set up PySpark with IPython Notebook.I followed the set up using Cloudera blog,but the some steps are missing and not correct. I encountered the following errors:
I) Couldn't read ipynb files with error:
Unreadable Notebook: /home/kuntal/Downloads/setup-master/spark_tutorial_student.ipynb Unsupported nbformat version 4

II) iPython Notebook won't run b/c it requires tornado >= 4.0 but I have 3.1

Overview of this tutorial:

Install IPython
Install Spark
Create PySpark profile for IPython
IPython and Spark(PySpark) Configuration
Word Count example
Shortcut trick for IPython Spark set up

Hardware description:

Ubuntu-14.04
6GB RAM
Intel i3 Quadcore Processor

=> For demonstration purposes, Spark will run in local mode, but the configuration can be
updated to submit code to a cluster.

Install IPython
sudo apt-get install python-pip ipython ipython-notebook
sudo pip install --upgrade ipython tornado

Verify that the latest version of IPython is installed by typing:
ipython --version
3.1.0

Install SPARK
Download the source for the latest Spark binary release:
Unzip source to ~/spark-1.2.0/ --will refer to this as SPARK_HOME.

Create PySpark Profile for IPython
After Spark is installed, let’s start by creating a new IPython profile for PySpark.

ipython profile create pyspark

IPython and Spark(PySpark) Configuration
Updating profile config file /home/kuntal/.ipython/profile_pyspark/ipython_notebook_config.py
(Note: the file path may vary on your system,but when you create the pyspark profile,it will print the config file name with full path)

Edit this file:
vi /home/kuntal/.ipython/profile_pyspark/ipython_notebook_config.py

c = get_config()

Add the following lines after the above method.

# IPython PySpark
c.NotebookApp.ip = 'localhost'
c.NotebookApp.open_browser = False
c.NotebookApp.port = 7770

Set the following environment variables in .bashrc or .bash_profile:
sudo vi ~/.bashrc

export SPARK_HOME="/home/kuntal/knowledge/BigData/spark-1.2.0"
export PYSPARK_SUBMIT_ARGS="--master local[2]"

Create a file named /home/kuntal/.ipython/profile_pyspark/startup/00-pyspark-setup.py containing the following:

Configure the necessary Spark environment
import os
import sys

spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))

# Initialize PySpark to predefine the SparkContext variable 'sc'
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))

Now we are ready to launch a notebook using the PySpark profile

ipython notebook --profile=pyspark
(Start the above command from a directory where there is ipynb file)

Go to browser : http://localhost:7770

To have Interactive PySpark shell with IPython start it with the command:

ipython --profile=pyspark

Word Count Example
Now start your pyspark programming in the console.The word count example is simple taken from spark website:

Read file with Spark context.
text_file =sc.textFile("file:///home/kuntal/knowledge/BigData/spark-1.2.0/README.md")

Split each line from the file into words.Map each word to a tuple containing the word and an initial count of 1.Sum up the count for each word.
word_counts = text_file.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)

To actually count the words, execute the action:
word_counts.collect()

Here’s a portion of the output:

[(u'all', 1),
(u'when', 1),
(u'"local"', 1),
(u'including', 3),
(u'computation', 1),
(u'Spark](#building-spark).', 1),
(u'using:', 1),
(u'guidance', 3),
...

(u'./dev/run-tests', 1),
(u'first', 1),
(u'latest', 1)]

Note : You can use this set up to submit your Lab exercise to the EDX Course (without VM and complicated set up procedure) - Introduction to Big Data with Apache Spark

Shortcut for setting up IPython notebook with Spark

This will give you quick way of setting up Spark-1.5.0(spark-1.5.0-bin-hadoop2.4) and IPython notebook in minutes. Assuming you have already installed ipython notebook and downloaded latest spark version and also set up the Spark home in your bashrc or profile. Lets get started.

1) Create Spark profile name pyspark
ipython profile create pyspark

2) Change directory to the ipython base (example guven below in ubuntu)
cd ~/.ipython/profile_pyspark/

3) Edit the config file
vi ipython_config.py

4) Add the following lines in the files
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.port = 8889

5) Now create a shell script to run IPython Notebook with Spark
vi ~/start_ipython_notebook.sh

6) Add the following lines
#!/bin/bash
IPYTHON_OPTS="notebook --profile=pyspark" pyspark

7) Make this script executable and run it
chmod +x ~/start_ipython_notebook.sh

8) Run the Notebook
./start_ipython_notebook.sh

9) Validate with basic spark code

Enjoy your sparky ride :)

Big Data Analytics and Machine Learning

Saturday, 6 June 2015

Configure IPython Notebook with Apache Spark (PySpark)

Labels

About Me