Saturday 24 January 2015

Web Crawling and Data Mining with Apache Nutch

This tutorial series will how to do web crawling with Apache Nutch.
After you complete this tutorial,you will be able to successfully crawl data from most popular web site,and even build your own search engine with Apache Solr.

With fast-growing technologies such as social media, cloud computing, mobile applications, and big data, these are exciting, and challenging, times to be in computing.
One of the main challenges facing software architects is handling the massive volume of data consumed and produced by a huge, global user base. In addition, users expect online applications to always be available and responsive. To address the scalability and availability needs of modern web applications, we’ve seen a growing interest in specialized, non-relational data storage and processing technologies, collectively known as NoSQL (Not only SQL).

Apache Nutch:
Apache Nutch is a very robust and scalable tool for web crawling; it can be also integrated with the scripting language Python for web crawling. You can use it whenever your application contains huge data and you want to apply crawling on your data.And also you can integrate this with search engine like Apache Solr very easily.

Apache Solr:
Solr is a scalable, ready-to-deploy enterprise search engine that’s optimized to search large
volumes of text-centric data and return results sorted by relevance.
Scalable- Solr scales by distributing work (indexing and query processing) to multiple servers in a cluster.
Ready to deploy- Solr is open source, is easy to install and configure, and provides a preconfigured example to help you get started.
Optimized for search- Solr is fast and can execute complex queries in subsecond speed, often only tens of milliseconds.
Large volumes of documents- Solr is designed to deal with indexes containing many millions of documents.
Text-centric- Solr is optimized for searching natural-language text, like emails,web pages, resumes, PDF documents, and social messages such as tweets or blogs.


This tutorial series consists of two parts:

Part 1-  Build and Install Nutch 2.2 with MySQL

Part 2-  Crawling Naptol, Flipkart, Amazon, Jabong with Apache Nutch and Solr

Get ready to have some fun..!!

Web Crawling Naptol, Flipkart, Amazon, Jabong with Apache Nutch and Apache Solr

This tutorial will teach you to crawl data from popular online shopping portal like (Amazon, Flipkart, Naptol and Jabong) and index this crawl data into Apache Solr. Also you will learn how to crawl ajax enabled and secured (https) site with Apach Nutch.

Please go through the previous tutorial to set up Apache Nutch 2.2 with MySql.

Update the nutch-site.xml:
cd ${APACHE_NUTCH_HOME}/runtime/local/conf

Edit the nutch-site.xml to enable crawling through secure https:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>DemoWebCrawler</value>
</property>

<property>
<name>http.accept.language</name>
<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
<description>Value of the “Accept-Language” request header field.This allows selecting non-English language as default one to retrieve.It is a useful setting for search engines build for certain national group.
</description>
</property>

<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
<description>The character encoding to fall back to when no other information is available </description>
</property>

<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
<description>The Gora DataStore class for storing and retrieving data.
Currently the following stores are available: ….
</description>
</property>

<!-- Add this property ,so that nutch can crawl into secure https based websites -->
<property>
 <name>plugin.includes</name>
 <value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
</configuration>

Update regex-urlfilter.txt to overcome the block urls:

The regex-urlfilter blocks urls that have querystring parameters:

skip URLs containing certain characters as probable queries, etc.

-[?*!@=]
Modify that file so that urls with querystring parameters are crawled:

skip URLs containing certain characters as probable queries, etc.

-[*!@]
So comment (using #) those regex queries.More information on this check.
Now it lets crawl data/information from  Naptol(mobile phone), Amazon(books, Flipkart(sport shoes) and Jabong(sport shoes).
Edit your seed.txt and paste the following:
http://www.naaptol.com/brands/nokia/mobile-phones.html
http://www.flipkart.com/mens-footwear/shoes/sports-shoes/pr?sid=osp,cil,nit,1cu&otracker=hp_nmenu_sub_men_0_Sports%20Shoes
http://www.amazon.in/s/ref=nb_sb_noss_2/278-5129563-3057638?url=search-alias%3Daps&field-keywords=machine%20learning
http://www.jabong.com/men/shoes/sports-shoes/?source=topnav_men
Start crawling by typing the following into the command line:
bin/nutch inject urls
bin/nutch generate -topN 20
bin/nutch fetch -all
bin/nutch parse -all
bin/nutch updatedb

Repeat the last four commands (generate, fetch, parse and updatedb) again.

Set up and index with Solr 
Use latest version of Solr 4 (im using 4.9),other version 4+ will work fine too. Untar it to to $HOME/apache­solr­4.X.X­XX. This folder will be now referred to as ${APACHE_SOLR_HOME}.

Download from this link and use it to replace ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml .

From the terminal start solr:

cd ${APACHE_SOLR_HOME}/example

java -jar start.jar

You can check this is running by opening http://localhost:8983/solr in your web browser as hown below. Select collection1 from the core selector.




Leave that terminal running and from a different terminal type the following:

cd ${APACHE_NUTCH_HOME}/runtime/local/

bin/nutch solrindex http://localhost:8983/solr/ -reindex

You can now run queries using Solr versus your crawled content. Open http://localhost:8983/solr/#/collection1/query and assuming you have already crawled the above websites,type in the input box titled “q” or "fq" you can do a search by inputting

content: jabong OR content: nokia  
(similarly try out for others like shoes,books etc)

and you should see something like this:



Congratulation :) for making your first small search engine ready from the crawl data of popular online shopping websites.



Build and Install Nutch 2.2 with MySQL

This tutorial will teach you to build set up Apache Nutch (latest version -2.2) with MySql. Let's get started !

Install MySQL Server and MySQL Client using the Ubuntu software center or  sudo apt-get install mysql-server mysql-client  at the command line.

As MySQL defaults to latin we need to edit  sudo vi /etc/mysql/my.cnf  and under [mysqld] add

innodb_file_format=barracuda
innodb_file_per_table=true
innodb_large_prefix=true
character­set­server=utf8mb4
collation­server=utf8mb4_unicode_ci
max_allowed_packet=500M

The innodb options are to help deal with the small primary key size restriction of MySQL. The character and collation settings are to handle Unicode correctly.The max_allowed_packet settings is optional and only necessary for very large sizes. Restart your machine for the changes to take effect.

Check to make sure MySQL is running by typing  sudo netstat -tap | grep mysql  and you should see something like:
tcp 0 0 localhost:mysql *:* LISTEN

We need to set up the nutch database manually as the current Nutch/Gora/MySQL generated db schema defaults to latin.
Log into mysql at the command line using your previously set up MySQL id and password type

mysql -u xxxxx -p
then in the MySQL editor type the following:

CREATE DATABASE nutch DEFAULT CHARACTER SET utf8mb4 DEFAULT COLLATE utf8mb4_unicode_ci; 

use nutch;

and enter and then copy and paste the following altogether:

CREATE TABLE `webpage` (
`id` varchar(767) NOT NULL,
`headers` blob,
`text` longtext DEFAULT NULL,
`status` int(11) DEFAULT NULL,
`markers` blob,
`parseStatus` blob,
`modifiedTime` bigint(20) DEFAULT NULL,
`prevModifiedTime` bigint(20) DEFAULT NULL,
`score` float DEFAULT NULL,
`typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`batchId` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`baseUrl` varchar(767) DEFAULT NULL,
`content` longblob,
`title` varchar(2048) DEFAULT NULL,
`reprUrl` varchar(767) DEFAULT NULL,
`fetchInterval` int(11) DEFAULT NULL,
`prevFetchTime` bigint(20) DEFAULT NULL,
`inlinks` mediumblob,
`prevSignature` blob,
`outlinks` mediumblob,
`fetchTime` bigint(20) DEFAULT NULL,
`retriesSinceFetch` int(11) DEFAULT NULL,
`protocolStatus` blob,
`signature` blob,
`metadata` blob,
PRIMARY KEY (`id`)
) ENGINE=InnoDB
ROW_FORMAT=COMPRESSED
DEFAULT CHARSET=utf8mb4;

Then type enter. You are done setting up the MySQL database for Nutch.

Set up Nutch 2.2 by downloading the apache­nutch­2.2­src.tar.gz version from
http://www.apache.org/dyn/closer.cgi/nutch/.
Untar the contents of the file you just downloaded to a folder we will refer to
going forward as ${APACHE_NUTCH_HOME}.

From inside the nutch folder ensure the MySQL dependency for Nutch is available by editing the following in ${APACHE_NUTCH_HOME}/ivy/ivy.xml
change

<dependency org=”org.apache.gora” name=”gora­core” rev=”0.3′′ conf=”*­>default”/>
to
<dependency org=”org.apache.gora” name=”gora­core” rev=”0.2.1′′ conf=”*­>default”/>

and uncomment the gora­sql

<dependency org=”org.apache.gora” name=”gora­sql” rev=”0.1.1­incubating” conf=”*­>default” />
and uncomment the mysql connector

<!– Uncomment this to use MySQL as database with SQL as Gora store. –>
<dependency org=”mysql” name=”mysql­connector­java” rev=”5.1.18′′ conf=”*­>default”/>

Also update the following jetty mortbay dependency with latest version (7.0.0.pre5),else you might get build failure:

<dependency org="org.mortbay.jetty" name="jetty" rev="7.0.0.pre5" conf="test->default" />
 <dependency org="org.mortbay.jetty" name="jetty-util" rev="7.0.0.pre5" conf="test->default" />

<dependency org="org.mortbay.jetty" name="jetty-client" rev="7.0.0.pre5" />

Edit the ${APACHE_NUTCH_HOME}/conf/gora.properties file either deleting or commenting out the Default SqlStore
Properties using #. Then add the MySQL properties below replacing xxxxx with the user and password you set up when installing MySQL earlier.

###############################
# MySQL properties #
###############################
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true
gora.sqlstore.jdbc.user=xxxxx
gora.sqlstore.jdbc.password=xxxxx

Edit the ${APACHE_NUTCH_HOME}/conf/gora­sql­mapping.xml file changing the length of the primarykey from 512 to 767 in both places.
<primarykey column=”id” length=”767′′/>

Configure ${APACHE_NUTCH_HOME}/conf/nutch­site.xml to put in a name in the value field under http.agent.name. It can be anything but cannot be left blank. You must specify Sqlstore.

<property>
<name>http.agent.name</name>
<value>DemoWebCrawler</value>
</property>
<property>
<name>http.accept.language</name>
<value>ja­jp, en­us,en­gb,en;q=0.7,*;q=0.3</value>
<description>Value of the “Accept­Language” request header field.This allows selecting non­English language as default one to retrieve.It is a useful setting for search engines build for certain national group.
</description>
</property>
<property>
<name>parser.character.encoding.default</name>
<value>utf­8</value>
<description>The character encoding to fall back to when no other information
is available</description>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
<description>The Gora DataStore class for storing and retrieving data.Currently the following stores are available: ....
</description>
</property>

Install ant using the Ubuntu software center or  sudo apt-get install ant  at the command line.
From the command line  cd  to your nutch folder and  after you have  cd  to ${APACHE_NUTCH_HOME} simply type  ant runtime
This may take a few minutes to compile.

Start your first crawl by typing the lines below at the terminal (replace ‘http://nutch.apache.org/’ with whatever site you want to crawl):

Inject a URL into the DB

cd ${APACHE_NUTCH_HOME}/runtime/local

mkdir -p urls

echo 'http://nutch.apache.org/' > urls/seed.txt

Start crawling (you will want to create your own script later but manually just to see what is happening type the following into the command line:

bin/nutch inject urls
bin/nutch generate -topN 20
bin/nutch fetch -all
bin/nutch parse -all
bin/nutch updatedb

Repeat the last four commands (generate, fetch, parse and updatedb) again.

For the generate command, topN is the max number of links you want to actually parse each time. The first time there is only one URL (the one we injected from seed.txt) but after that there are many more. Note, however, Nutch keeps track of all links it encounters in the webpage table. It just limits the amount it actually parses to TopN so don’t be surprised by seeing many more rows in the webpage table than you expect by limiting with TopN.

Check your crawl results by looking at the webpage table in the nutch database.
mysql -u xxxxx -p
use nutch;
SELECT * FROM nutch.webpage LIMIT 10;

You should see the 10 rows/results of your crawl ( i have shown in mysql workbench):


Now that you have successfully set up Apache nutch with MySql and crawl few web site.Its time to do something more interesting stuff like,using this crawl data for indexing and searching.Follow the next tutorial.