Web Crawler Utilities – JSpider tools


JSpider-tool is a set of utilities built on top of the JSpider application. JSpider is an open source product written in java. It is available under LGPL License. JSpider-tool can be used to perform basic crawling functionality. JSpider along with sources can be downloaded from here. After extracting it, jspider-tool is found as a utility in bin folder.

Functionality available with JSpider-tool:

  • Can print the headers sent by a web server
  • Can display information about a web resource
  • Can display the content of a web resource
  • Can download a certain file from a web server to a local file
  • Can find all links to other resources in a certain page
  • Can find all e-mail addresses mentioned in a web page

Tools
There are several tools implemented in JSpider-tool. I’ll explain them one by one with example.

1. headers
It prints out the headers sent by a web server, which can be used to understand what’s being sent to web clients. See a X-hacker header from my blog :) .

Example:

C:\softwares\jspider-src-0.5.0-dev\bin>jspider-tool headers http://paritoshranjan.wordpress.com
[Engine] jspider.home=..
[Engine] default output folder=..\output
[Engine] starting with configuration 'tool'
null:HTTP/1.1 200 OK
Server:nginx
Date:Mon, 05 Jul 2010 06:14:13 GMT
Content-Type:text/html; charset=UTF-8
Transfer-Encoding:chunked
Connection:close
Vary:Cookie
X-hacker:If you're reading this, you should visit automattic.com/jobs and apply to join the fun, mention this header.
X-Pingback:http://paritoshranjan.wordpress.com/xmlrpc.php
Link:; rel=shortlink
Set-Cookie:fid=23388923; path=/; domain=.wordpress.com

2. info

The info gives some additional information along with the information given by headers. Like size of the file and time taken to send the request and come back.

C:\softwares\jspider-src-0.5.0-dev\bin>jspider-tool.bat info http://paritoshranjan.wordpress.com
[Engine] jspider.home=..
[Engine] default output folder=..\output
[Engine] starting with configuration 'tool'
URL          : http://paritoshranjan.wordpress.com
HTTP Headers :
null:HTTP/1.1 200 OK
Server:nginx
Date:Mon, 05 Jul 2010 06:22:54 GMT
Content-Type:text/html; charset=UTF-8
Transfer-Encoding:chunked
Connection:close
Vary:Cookie
X-hacker:If you're reading this, you should visit automattic.com/jobs and apply to join the fun, mention this header.
X-Pingback:http://paritoshranjan.wordpress.com/xmlrpc.php
Link:; rel=shortlink
Set-Cookie:fid=1982526567; path=/; domain=.wordpress.com
Mime Type    : text/html; charset=UTF-8
Size         : 56054
Time (ms)    : 3589

It returns the URL fetched, the mime type, size of the content and the time it took to fetch the resource.

3. fetch
The fetch utility displays the content of the resource mentioned.

C:\softwares\jspider-src-0.5.0-dev\bin>jspider-tool.bat fetch http://paritoshranjan.wordpress.com/robots.txt

results in the robots.txt file being printed out:

[Engine] jspider.home=..
[Engine] default output folder=..\output
[Engine] starting with configuration ‘tool’
Sitemap: http://paritoshranjan.wordpress.com/sitemap.xml

User-agent: IRLbot
Crawl-delay: 3600

User-agent: *
Disallow: /next/

# har har
User-agent: *
Disallow: /activate/

User-agent: *
Disallow: /signup/

User-agent: *
Disallow: /related-tags.php

User-agent: *
Disallow:

4. download

The download utility downloads the requested resource and saves it in a file on the local filesystem.
C:\softwares\jspider-src-0.5.0-dev\bin>jspider-tool.bat download http://paritoshranjan.wordpress.com robots.txt
[Engine] jspider.home=..
[Engine] default output folder=..\output
[Engine] starting with configuration 'tool'

5. findlinks

The findlinks utility finds the links on the page mentioned. This can be used for further crawling all the links found on the page.

C:\softwares\jspider-src-0.5.0-dev\bin>jspider-tool findlinks http://paritoshranjan.wordpress.com
[Engine] jspider.home=..
[Engine] default output folder=..\output
[Engine] starting with configuration 'tool'

http://s2.wp.com/wp-content/themes/pub/inove/style.css

http://s2.wp.com/wp-content/themes/pub/inove/ie.css

http://s2.wp.com/wp-content/themes/pub/inove/js/base.js

http://s2.wp.com/wp-content/themes/pub/inove/js/menu.js

http://paritoshranjan.wordpress.com/feed/

http://paritoshranjan.wordpress.com/xmlrpc.php

http://s0.wp.com/wp-content/themes/h4/global.css

http://paritoshranjan.wordpress.com/xmlrpc.php

http://paritoshranjan.wordpress.com/wp-includes/wlwmanifest.xml

6. email
The email tool works the same way as the findlinks tool, but reports all e-mail addresses found in the web resource:
C:\softwares\jspider-src-0.5.0-dev\bin>jspider-tool email http://paritoshranjan.wordpress.com
Issuing this statement will print out all e-mail addresses found in this web page.

One Comment

  1. Posted January 3, 2012 at 10:51 am | Permalink | Reply

    Nice article :)
    Crawl all the content your important website using Minalyzer, get of broken references exist in your website, status code and many more reports to high ranking of websites.
    Try 30 days free trial version with full functionality- http://www.minalyzer.com/minalyzer-download.html

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.