JSpider-tool is a set of utilities built on top of the JSpider application. JSpider is an open source product written in java. It is available under LGPL License. JSpider-tool can be used to perform basic crawling functionality. JSpider along with sources can be downloaded from here. After extracting it, jspider-tool is found as a utility in bin folder.
Functionality available with JSpider-tool:
- Can print the headers sent by a web server
- Can display information about a web resource
- Can display the content of a web resource
- Can download a certain file from a web server to a local file
- Can find all links to other resources in a certain page
- Can find all e-mail addresses mentioned in a web page
Tools
There are several tools implemented in JSpider-tool. I’ll explain them one by one with example.
1. headers
It prints out the headers sent by a web server, which can be used to understand what’s being sent to web clients. See a X-hacker header from my blog
.
Example:
C:\softwares\jspider-src-0.5.0-dev\bin>jspider-tool headers http://paritoshranjan.wordpress.com
[Engine] jspider.home=..
[Engine] default output folder=..\output
[Engine] starting with configuration 'tool'
null:HTTP/1.1 200 OK
Server:nginx
Date:Mon, 05 Jul 2010 06:14:13 GMT
Content-Type:text/html; charset=UTF-8
Transfer-Encoding:chunked
Connection:close
Vary:Cookie
X-hacker:If you're reading this, you should visit automattic.com/jobs and apply to join the fun, mention this header.
X-Pingback:http://paritoshranjan.wordpress.com/xmlrpc.php
Link:; rel=shortlink
Set-Cookie:fid=23388923; path=/; domain=.wordpress.com
2. info
The info gives some additional information along with the information given by headers. Like size of the file and time taken to send the request and come back.
C:\softwares\jspider-src-0.5.0-dev\bin>jspider-tool.bat info http://paritoshranjan.wordpress.com
[Engine] jspider.home=..
[Engine] default output folder=..\output
[Engine] starting with configuration 'tool'
URL : http://paritoshranjan.wordpress.com
HTTP Headers :
null:HTTP/1.1 200 OK
Server:nginx
Date:Mon, 05 Jul 2010 06:22:54 GMT
Content-Type:text/html; charset=UTF-8
Transfer-Encoding:chunked
Connection:close
Vary:Cookie
X-hacker:If you're reading this, you should visit automattic.com/jobs and apply to join the fun, mention this header.
X-Pingback:http://paritoshranjan.wordpress.com/xmlrpc.php
Link:; rel=shortlink
Set-Cookie:fid=1982526567; path=/; domain=.wordpress.com
Mime Type : text/html; charset=UTF-8
Size : 56054
Time (ms) : 3589
It returns the URL fetched, the mime type, size of the content and the time it took to fetch the resource.
3. fetch
The fetch utility displays the content of the resource mentioned.
C:\softwares\jspider-src-0.5.0-dev\bin>jspider-tool.bat fetch http://paritoshranjan.wordpress.com/robots.txt
results in the robots.txt file being printed out:
[Engine] jspider.home=..
[Engine] default output folder=..\output
[Engine] starting with configuration ‘tool’
Sitemap: http://paritoshranjan.wordpress.com/sitemap.xml
User-agent: IRLbot
Crawl-delay: 3600
User-agent: *
Disallow: /next/
# har har
User-agent: *
Disallow: /activate/
User-agent: *
Disallow: /signup/
User-agent: *
Disallow: /related-tags.php
User-agent: *
Disallow:
4. download
The download utility downloads the requested resource and saves it in a file on the local filesystem.
C:\softwares\jspider-src-0.5.0-dev\bin>jspider-tool.bat download http://paritoshranjan.wordpress.com robots.txt
[Engine] jspider.home=..
[Engine] default output folder=..\output
[Engine] starting with configuration 'tool'
5. findlinks
The findlinks utility finds the links on the page mentioned. This can be used for further crawling all the links found on the page.
C:\softwares\jspider-src-0.5.0-dev\bin>jspider-tool findlinks http://paritoshranjan.wordpress.com
[Engine] jspider.home=..
[Engine] default output folder=..\output
[Engine] starting with configuration 'tool'
http://s2.wp.com/wp-content/themes/pub/inove/style.css
http://s2.wp.com/wp-content/themes/pub/inove/ie.css
http://s2.wp.com/wp-content/themes/pub/inove/js/base.js
http://s2.wp.com/wp-content/themes/pub/inove/js/menu.js
http://paritoshranjan.wordpress.com/feed/
http://paritoshranjan.wordpress.com/xmlrpc.php
http://s0.wp.com/wp-content/themes/h4/global.css
http://paritoshranjan.wordpress.com/xmlrpc.php
http://paritoshranjan.wordpress.com/wp-includes/wlwmanifest.xml
6. email
The email tool works the same way as the findlinks tool, but reports all e-mail addresses found in the web resource:
C:\softwares\jspider-src-0.5.0-dev\bin>jspider-tool email http://paritoshranjan.wordpress.com
Issuing this statement will print out all e-mail addresses found in this web page.





One Comment
Nice article
Crawl all the content your important website using Minalyzer, get of broken references exist in your website, status code and many more reports to high ranking of websites.
Try 30 days free trial version with full functionality- http://www.minalyzer.com/minalyzer-download.html