LDSpider

The LDSpider project aims to build a web crawling framework for the linked data web. Requirements and challenges for crawling the linked data web are different from regular web crawling, thus this projects offer a web crawler adapted to traverse and harvest sources and instances from the linked data web. We offer a single jar which can be easily integrated into own applications.

Category:
Website:
Author(s):
Andreas Harth, Jürgen Umbrich, Aidan Hogan, Robert Isele 
Contact Email:
andreas.harth@deri.org 
License:
User Interface:
Command Line 
Programming Language(s):
Java 
Online Manual:
Mailing List:

Documentation

LDSpider is a web crawling framework for the Web of Data. It can be used through a command line application as well as through a flexible API for a usage within another application.

Requirements:

  • Java Runtime Environment

Usage:

  1. Download the latest LDSpider version.
  2. Run:
    $ java -jar ldspider-1.1d.jar 
    usage:  [-a <file>] [-b <depth uri-limit> | -c <max-uris>] [-h] [-n]
            [-o <file>] [-r <redirects>] [-s <file> | -u <uri>] 
    		[-t <threads>]  [-y]
    
     -a <file>                 name of access log file
     -b <depth uri-limit>      do strict breadth-first
     -c <max-uris>             use load balanced crawling strategy
     -h,--help                 print help
     -n                        do not extract links - just follow redirects
     -o <file>                 name of NQuad file with output
     -oe <uri>                 URI of an endpoint that supports SPARQL/Update
     -r <redirects>            write redirects.nx file
     -s <file>                 location of seed list
     -t <threads>              number of threads (default 2)
     -u <uri>                  uri of an instance
     -y                        stay on domains of seed uris
     -follow <uri>             only follow a specific predicate 
                               e.g. http://www.w3.org/2002/07/owl#sameAs