The LDSpider project aims to build a web crawling framework for the linked data web. Requirements and challenges for crawling the linked data web are different from regular web crawling, thus this projects offer a web crawler adapted to traverse and harvest sources and instances from the linked data web. We offer a single jar which can be easily integrated into own applications.
LDSpider is a web crawling framework for the Web of Data. It can be used through a command line application as well as through a flexible API for a usage within another application.
- Java Runtime Environment
- Download the latest LDSpider version.
$ java -jar ldspider-1.1d.jar
usage: [-a <file>] [-b <depth uri-limit> | -c <max-uris>] [-h] [-n] [-o <file>] [-r <redirects>] [-s <file> | -u <uri>] [-t <threads>] [-y] -a <file> name of access log file -b <depth uri-limit> do strict breadth-first -c <max-uris> use load balanced crawling strategy -h,--help print help -n do not extract links - just follow redirects -o <file> name of NQuad file with output -oe <uri> URI of an endpoint that supports SPARQL/Update -r <redirects> write redirects.nx file -s <file> location of seed list -t <threads> number of threads (default 2) -u <uri> uri of an instance -y stay on domains of seed uris -follow <uri> only follow a specific predicate e.g. http://www.w3.org/2002/07/owl#sameAs