Java

How to code a simple webcrawler using Java4 min read

There is gigantic information on the internet where it will take a million years to read by a single human in order to get the information an automated software is required yes a crawler.

In this post, I will show you how to make a prototype of Web crawler step by step by using Java. Making a Web crawler is not as difficult as it sounds. Just follow the guide and you will quickly get there in about 1 hour or less, and then enjoy the huge amount of information that it can get for you. The goal of this tutorial is to be the simplest tutorial in the world for making a crawler by using Java. As this is only a prototype, you need spend more time to customize it for your needs.




I assume you know the following:

  • Basic Java programming
  • A little bit about SQL and MySQL Database.

If you don’t want to use a database, you can use a file to track the crawling history.

1. The goal

In this tutorial, the goal is as the following:

Given a website root URL, e.g., “codingsec.net”, return all pages that contains a string “hack” from this website

A typical crawler works in the following steps:

  1. Parse the root web page (“codingsec.net”), and get all links from this page. To access each URL and parse HTML page, I will use JSoup which is a convenient and simple Java library.
  2. Using the URLs that retrieved from step 1, and parse those URLs
  3. When doing the above steps, we need to track which page has been processed before, so that each web page only get processed once. This is the reason why we need a database.

2. Set up MySQL database

If you are using Windows, you can simply use WampServer. You can simple download it from wampserver.com and install it in a minute and good to go for next step.

I will use phpMyAdmin to manipulate MySQL database. It is simply a GUI interface for using MySQL. It is totally fine if you any other tools or use no GUI tools.

Create a database named “Crawler” and create a table called “Record” like the following:

4. Start crawling using Java

1). Download JSoup core library from http://jsoup.org/download.
Download mysql-connector-java-xxxbin.jar from http://dev.mysql.com/downloads/connector/j/

2). Now Create a project in your eclipse with name “Crawler” and add the JSoup and mysql-connector jar files you downloaded to Java Build Path. (right click the project –> select “Build Path” –> “Configure Build Path” –> click “Libraries” tab –> click “Add External JARs”)

3). Create a class named “DB” which is used for handling database actions.

Now you have your own Web crawler. Of course, you will need to filter some links you don’t want to crawl.

Take your time and comment on this article.

Leave a Comment