Apache Spark is a great way for performing large-scale data processing. Recently, I'm working with different modules of Apache Spark using Scala and PySpark(a way of interfacing with Spark through Python).
This tutorial will show you how to properly set up PySpark with IPython Notebook.I followed the set up using Cloudera blog,but the some steps are missing and not correct. I encountered the following errors:
I) Couldn't read ipynb files with error:
Unreadable Notebook: /home/kuntal/Downloads/setup-master/spark_tutorial_student.ipynb Unsupported nbformat version 4
II) iPython Notebook won't run b/c it requires tornado >= 4.0 but I have 3.1
Overview of this tutorial:
Hardware description:
=> For demonstration purposes, Spark will run in local mode, but the configuration can be
updated to submit code to a cluster.
Install IPython
sudo apt-get install python-pip ipython ipython-notebook
sudo pip install --upgrade ipython tornado
Verify that the latest version of IPython is installed by typing:
ipython --version
3.1.0
Install SPARK
Download the source for the latest Spark binary release:
Unzip source to ~/spark-1.2.0/ --will refer to this as SPARK_HOME.
Create PySpark Profile for IPython
After Spark is installed, let’s start by creating a new IPython profile for PySpark.
ipython profile create pyspark
IPython and Spark(PySpark) Configuration
Updating profile config file /home/kuntal/.ipython/profile_pyspark/ipython_notebook_config.py
(Note: the file path may vary on your system,but when you create the pyspark profile,it will print the config file name with full path)
Edit this file:
vi /home/kuntal/.ipython/profile_pyspark/ipython_notebook_config.py
c = get_config()
Add the following lines after the above method.
# IPython PySpark
c.NotebookApp.ip = 'localhost'
c.NotebookApp.open_browser = False
c.NotebookApp.port = 7770
Set the following environment variables in .bashrc or .bash_profile:
sudo vi ~/.bashrc
export SPARK_HOME="/home/kuntal/knowledge/BigData/spark-1.2.0"
export PYSPARK_SUBMIT_ARGS="--master local[2]"
Create a file named /home/kuntal/.ipython/profile_pyspark/startup/00-pyspark-setup.py containing the following:
Configure the necessary Spark environment
import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))
# Initialize PySpark to predefine the SparkContext variable 'sc'
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
Now we are ready to launch a notebook using the PySpark profile
ipython notebook --profile=pyspark
(Start the above command from a directory where there is ipynb file)
Go to browser : http://localhost:7770
To have Interactive PySpark shell with IPython start it with the command:
ipython --profile=pyspark
Word Count Example
Now start your pyspark programming in the console.The word count example is simple taken from spark website:
Read file with Spark context.
text_file =sc.textFile("file:///home/kuntal/knowledge/BigData/spark-1.2.0/README.md")
Split each line from the file into words.Map each word to a tuple containing the word and an initial count of 1.Sum up the count for each word.
word_counts = text_file.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
To actually count the words, execute the action:
word_counts.collect()
Here’s a portion of the output:
[(u'all', 1),
(u'when', 1),
(u'"local"', 1),
(u'including', 3),
(u'computation', 1),
(u'Spark](#building-spark).', 1),
(u'using:', 1),
(u'guidance', 3),
...
(u'./dev/run-tests', 1),
(u'first', 1),
(u'latest', 1)]
Note : You can use this set up to submit your Lab exercise to the EDX Course (without VM and complicated set up procedure) - Introduction to Big Data with Apache Spark
Shortcut for setting up IPython notebook with Spark
This will give you quick way of setting up Spark-1.5.0(spark-1.5.0-bin-hadoop2.4) and IPython notebook in minutes. Assuming you have already installed ipython notebook and downloaded latest spark version and also set up the Spark home in your bashrc or profile. Lets get started.
1) Create Spark profile name pyspark
ipython profile create pyspark
2) Change directory to the ipython base (example guven below in ubuntu)
cd ~/.ipython/profile_pyspark/
3) Edit the config file
vi ipython_config.py
4) Add the following lines in the files
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.port = 8889
5) Now create a shell script to run IPython Notebook with Spark
vi ~/start_ipython_notebook.sh
6) Add the following lines
#!/bin/bash
IPYTHON_OPTS="notebook --profile=pyspark" pyspark
7) Make this script executable and run it
chmod +x ~/start_ipython_notebook.sh
8) Run the Notebook
./start_ipython_notebook.sh
9) Validate with basic spark code
This tutorial will show you how to properly set up PySpark with IPython Notebook.I followed the set up using Cloudera blog,but the some steps are missing and not correct. I encountered the following errors:
I) Couldn't read ipynb files with error:
Unreadable Notebook: /home/kuntal/Downloads/setup-master/spark_tutorial_student.ipynb Unsupported nbformat version 4
II) iPython Notebook won't run b/c it requires tornado >= 4.0 but I have 3.1
- Install IPython
- Install Spark
- Create PySpark profile for IPython
- IPython and Spark(PySpark) Configuration
- Word Count example
- Shortcut trick for IPython Spark set up
Hardware description:
- Ubuntu-14.04
- 6GB RAM
- Intel i3 Quadcore Processor
=> For demonstration purposes, Spark will run in local mode, but the configuration can be
updated to submit code to a cluster.
Install IPython
sudo apt-get install python-pip ipython ipython-notebook
sudo pip install --upgrade ipython tornado
Verify that the latest version of IPython is installed by typing:
ipython --version
3.1.0
Install SPARK
Download the source for the latest Spark binary release:
Unzip source to ~/spark-1.2.0/ --will refer to this as SPARK_HOME.
Create PySpark Profile for IPython
After Spark is installed, let’s start by creating a new IPython profile for PySpark.
ipython profile create pyspark
IPython and Spark(PySpark) Configuration
Updating profile config file /home/kuntal/.ipython/profile_pyspark/ipython_notebook_config.py
(Note: the file path may vary on your system,but when you create the pyspark profile,it will print the config file name with full path)
Edit this file:
vi /home/kuntal/.ipython/profile_pyspark/ipython_notebook_config.py
c = get_config()
Add the following lines after the above method.
# IPython PySpark
c.NotebookApp.ip = 'localhost'
c.NotebookApp.open_browser = False
c.NotebookApp.port = 7770
Set the following environment variables in .bashrc or .bash_profile:
sudo vi ~/.bashrc
export SPARK_HOME="/home/kuntal/knowledge/BigData/spark-1.2.0"
export PYSPARK_SUBMIT_ARGS="--master local[2]"
Create a file named /home/kuntal/.ipython/profile_pyspark/startup/00-pyspark-setup.py containing the following:
Configure the necessary Spark environment
import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))
# Initialize PySpark to predefine the SparkContext variable 'sc'
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
Now we are ready to launch a notebook using the PySpark profile
ipython notebook --profile=pyspark
(Start the above command from a directory where there is ipynb file)
Go to browser : http://localhost:7770
To have Interactive PySpark shell with IPython start it with the command:
ipython --profile=pyspark
Word Count Example
Now start your pyspark programming in the console.The word count example is simple taken from spark website:
Read file with Spark context.
text_file =sc.textFile("file:///home/kuntal/knowledge/BigData/spark-1.2.0/README.md")
Split each line from the file into words.Map each word to a tuple containing the word and an initial count of 1.Sum up the count for each word.
word_counts = text_file.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
To actually count the words, execute the action:
word_counts.collect()
Here’s a portion of the output:
[(u'all', 1),
(u'when', 1),
(u'"local"', 1),
(u'including', 3),
(u'computation', 1),
(u'Spark](#building-spark).', 1),
(u'using:', 1),
(u'guidance', 3),
...
(u'./dev/run-tests', 1),
(u'first', 1),
(u'latest', 1)]
Note : You can use this set up to submit your Lab exercise to the EDX Course (without VM and complicated set up procedure) - Introduction to Big Data with Apache Spark
Shortcut for setting up IPython notebook with Spark
This will give you quick way of setting up Spark-1.5.0(spark-1.5.0-bin-hadoop2.4) and IPython notebook in minutes. Assuming you have already installed ipython notebook and downloaded latest spark version and also set up the Spark home in your bashrc or profile. Lets get started.
1) Create Spark profile name pyspark
ipython profile create pyspark
2) Change directory to the ipython base (example guven below in ubuntu)
cd ~/.ipython/profile_pyspark/
3) Edit the config file
vi ipython_config.py
4) Add the following lines in the files
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.port = 8889
5) Now create a shell script to run IPython Notebook with Spark
vi ~/start_ipython_notebook.sh
6) Add the following lines
#!/bin/bash
IPYTHON_OPTS="notebook --profile=pyspark" pyspark
7) Make this script executable and run it
chmod +x ~/start_ipython_notebook.sh
8) Run the Notebook
./start_ipython_notebook.sh
9) Validate with basic spark code