There is gigantic information on the internet where it will take a million years to read by a single human in order to get the information an automated software is required yes a crawler.
In this post, I will show you how to make a prototype of Web crawler step by step by using Java. Making a Web crawler is not as difficult as it sounds. Just follow the guide and you will quickly get there in about 1 hour or less, and then enjoy the huge amount of information that it can get for you. The goal of this tutorial is to be the simplest tutorial in the world for making a crawler by using Java. As this is only a prototype, you need spend more time to customize it for your needs.
I assume you know the following:
- Basic Java programming
- A little bit about SQL and MySQL Database.
If you don’t want to use a database, you can use a file to track the crawling history.
1. The goal
In this tutorial, the goal is as the following:
Given a website root URL, e.g., “codingsec.net”, return all pages that contains a string “hack” from this website
A typical crawler works in the following steps:
- Parse the root web page (“codingsec.net”), and get all links from this page. To access each URL and parse HTML page, I will use JSoup which is a convenient and simple Java library.
- Using the URLs that retrieved from step 1, and parse those URLs
- When doing the above steps, we need to track which page has been processed before, so that each web page only get processed once. This is the reason why we need a database.
2. Set up MySQL database
If you are using Windows, you can simply use WampServer. You can simple download it from wampserver.com and install it in a minute and good to go for next step.
I will use phpMyAdmin to manipulate MySQL database. It is simply a GUI interface for using MySQL. It is totally fine if you any other tools or use no GUI tools.
Create a database named “Crawler” and create a table called “Record” like the following:
1 2 3 4 5 6 7 | CREATE TABLE IF NOT EXISTS `Record` ( `RecordID` INT(11) NOT NULL AUTO_INCREMENT, `URL` text NOT NULL, PRIMARY KEY (`RecordID`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ; |
4. Start crawling using Java
1). Download JSoup core library from http://jsoup.org/download.
Download mysql-connector-java-xxxbin.jar from http://dev.mysql.com/downloads/connector/j/
2). Now Create a project in your eclipse with name “Crawler” and add the JSoup and mysql-connector jar files you downloaded to Java Build Path. (right click the project –> select “Build Path” –> “Configure Build Path” –> click “Libraries” tab –> click “Add External JARs”)
3). Create a class named “DB” which is used for handling database actions.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 | import java.sql.Connection; import java.sql.DriverManager; import java.sql.ResultSet; import java.sql.SQLException; import java.sql.Statement; public class DB { public Connection conn = null; public DB() { try { Class.forName("com.mysql.jdbc.Driver"); String url = "jdbc:mysql://localhost:3306/Crawler"; conn = DriverManager.getConnection(url, "root", "admin213"); System.out.println("conn built"); } catch (SQLException e) { e.printStackTrace(); } catch (ClassNotFoundException e) { e.printStackTrace(); } } public ResultSet runSql(String sql) throws SQLException { Statement sta = conn.createStatement(); return sta.executeQuery(sql); } public boolean runSql2(String sql) throws SQLException { Statement sta = conn.createStatement(); return sta.execute(sql); } @Override protected void finalize() throws Throwable { if (conn != null || !conn.isClosed()) { conn.close(); } } } 4). Create a class with name "Main" which will be our crawler. import java.io.IOException; import java.sql.PreparedStatement; import java.sql.ResultSet; import java.sql.SQLException; import java.sql.Statement; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class Main { public static DB db = new DB(); public static void main(String[] args) throws SQLException, IOException { db.runSql2("TRUNCATE Record;"); processPage("http://www.codingsec.net"); } public static void processPage(String URL) throws SQLException, IOException{ //check if the given URL is already in database String sql = "select * from Record where URL = '"+URL+"'"; ResultSet rs = db.runSql(sql); if(rs.next()){ }else{ //store the URL to database to avoid parsing again sql = "INSERT INTO `Crawler`.`Record` " + "(`URL`) VALUES " + "(?);"; PreparedStatement stmt = db.conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS); stmt.setString(1, URL); stmt.execute(); //get useful information Document doc = Jsoup.connect("http://www.codingsec.net/").get(); if(doc.text().contains("hack")){ System.out.println(URL); } //get all links and recursively call the processPage method Elements questions = doc.select("a[href]"); for(Element link: questions){ if(link.attr("href").contains("codingsec.net")) processPage(link.attr("abs:href")); } } } } |
Now you have your own Web crawler. Of course, you will need to filter some links you don’t want to crawl.
Take your time and comment on this article.