Big Data Analytics and Machine Learning: Web Crawling and Data Mining with Apache Nutch

This tutorial series will how to do web crawling with Apache Nutch.
After you complete this tutorial,you will be able to successfully crawl data from most popular web site,and even build your own search engine with Apache Solr.

With fast-growing technologies such as social media, cloud computing, mobile applications, and big data, these are exciting, and challenging, times to be in computing.
One of the main challenges facing software architects is handling the massive volume of data consumed and produced by a huge, global user base. In addition, users expect online applications to always be available and responsive. To address the scalability and availability needs of modern web applications, we’ve seen a growing interest in specialized, non-relational data storage and processing technologies, collectively known as NoSQL (Not only SQL).

Apache Nutch:
Apache Nutch is a very robust and scalable tool for web crawling; it can be also integrated with the scripting language Python for web crawling. You can use it whenever your application contains huge data and you want to apply crawling on your data.And also you can integrate this with search engine like Apache Solr very easily.

Apache Solr:
Solr is a scalable, ready-to-deploy enterprise search engine that’s optimized to search large
volumes of text-centric data and return results sorted by relevance.
Scalable- Solr scales by distributing work (indexing and query processing) to multiple servers in a cluster.
Ready to deploy- Solr is open source, is easy to install and configure, and provides a preconfigured example to help you get started.
Optimized for search- Solr is fast and can execute complex queries in subsecond speed, often only tens of milliseconds.
Large volumes of documents- Solr is designed to deal with indexes containing many millions of documents.
Text-centric- Solr is optimized for searching natural-language text, like emails,web pages, resumes, PDF documents, and social messages such as tweets or blogs.

This tutorial series consists of two parts:

Part 1- Build and Install Nutch 2.2 with MySQL

Part 2- Crawling Naptol, Flipkart, Amazon, Jabong with Apache Nutch and Solr

Get ready to have some fun..!!

Big Data Analytics and Machine Learning

Saturday, 24 January 2015

Web Crawling and Data Mining with Apache Nutch

No comments:

Post a Comment

Labels

About Me