Christian Becker

In the course of my diploma thesis, I evaluated the performance of several RDF stores when small pieces of information are requested from a large dataset (DBpedia infoboxes plus two very small sets). The benchmark queries employ varying levels of joins and constraints.

As of now, only the configuration for OpenLink Virtuoso has been optimized - this must be taken into consideration when comparing performance.

Note: This work has been superseded by the Berlin SPARQL Benchmark (BSBM) which is available at http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/.

Contents

  1. Motivation
  2. Tested RDF Stores
  3. Dataset
  4. Benchmark Configuration
  5. Loading
  6. Queries
  7. Feedback

News

1. Motivation

The use case is a mobile client-server application that allows for the exploration of Linked Data based on geographical coordinates. As the application will be user-facing, short response times are of high importance. In this context, queries are expected to yield small result sets, but involve large datasets (such as DBpedia) and possibly several levels of joins.

2. Tested RDF Stores

RDF stores were required to support large datasets such as DBpedia, SPARQL, Named Graphs as well as means to implement owl:sameAs inference (i.e. built-in ability or an apt programming interface). The following stores were selected:

2.1 OpenLink Virtuoso Open-Source Edition 5.0.2

Virtuoso was compiled from source for x64. Import was performed using the JDBC interface; data was loaded using the TTLP_MT command.

The following parameters were modified from the default configuration:

[Database]
MaxCheckpointRemap              = 131072     ; set to 1 gb as the database size is roughly 4 gb (reference)
 
[Parameters]
NumberOfBuffers                 = 85197      ; 65% percent of RAM (reference)
MaxDirtyBuffers = 63898 ; About 3/4 of buffers (reference) TransactionAfterImageLimit = 1500000000 ; required during import due to the size of the infoboxes set (reference)

In an initial release of this benchmark, Virtuoso's performance was far from ideal, which OpenLink traced back to inappropriate indexes for this usage scenario, which does not make use of graph indications. Following suggestions by OpenLink, the configuration was adjusted to include POGS, PSOG and SOPG indexes next to the default OGPS index, resulting in 3-45 times shorter query times.

2.2 SDB Beta 1

The index layout was tested on PostgreSQL 8.2.5 and MySQL 5.0.45 (x64 versions, default configurations). The hash layout was tested only on PostgreSQL due to performance issues ("Hash loading is very bad on MySQL." - SDB Wiki).

The results obtained for SDB currently can not be compared to those of Virtuoso, as the databases lack optimizations. Andy Seaborne suggests the use of PostgreSQL's ANALYZE command.

2.3 Sesame 2.0 beta 6

Sesame's good preliminary results and moderate loading times prompted me to explore the effects of supplementary indexes in addition to the default spoc and posc indexes. The following table shows the build times on the full dataset (see the section Queries for query times):

Index Build Time [s]
opsc 12666.415
ospc 12323.288
psoc 3299.538
sopc 349.508

3. Dataset

The benchmark dataset consists of DBpedia's infoboxes, geocoordinates and homepages datasets with minor corrections:

4. Benchmark Configuration

The low amount of RAM (1GB vs. a 4 GB dataset) likely impacts the results. Accordingly, the results have significance only for comparable configurations.

5. Loading

The RDF stores feature different indexing behaviors: Sesame automatically indexes after each import, while SDB and Virtuoso allow for selective index activation. In order to make load times comparable, the data import was performed as follows:

  1. infoboxes-fixed.nt was imported with indexes initially disabled in SDB and Virtuoso. Indexes were then activated and the time required for index creation time was factored into the import time.
  2. geocoordinates-fixed.nt was imported with indexes enabled.
  3. homepages-fixed.nt was imported with indexes enabled.

5.1 Loading of infoboxes-fixed.nt

Loading of infoboxes-fixed.nt

5.2 Loading of geocoordinates-fixed.nt

Loading of geocoordinates-fixed.nt

5.3 Loading of homepages-fixed.nt

Loading of homepages-fixed.nt

6. Queries

As few data has been prepared for actual use in the application, the queries are mostly of generic nature. They run against the DBpedia infoboxes set and assess performance with varying levels of joins and constraints.

In order to minimize query caching effects, queries were always executed in order after server startup. An exception was Virtuoso, where a noticeable warm-up delay occurred with the initial query. Accordingly, results for query 1 were obtained by restarting the server and warming it up using query 5.

6.1 All available information about a specific subject

SELECT ?p ?o WHERE {
  <http://dbpedia.org/resource/Metropolitan_Museum_of_Art> ?p ?o
}

6.2 Two degrees of separation from Kevin Bacon (?)

PREFIX p: <http://dbpedia.org/property/>

SELECT ?film1 ?actor1 ?film2 ?actor2
WHERE {
?film1 p:starring <http://dbpedia.org/resource/Kevin_Bacon> .
?film1 p:starring ?actor1 .
?film2 p:starring ?actor1 .
?film2 p:starring ?actor2 .
}

6.3 Unconstrained query for artworks, artists, museums and their directors

PREFIX p: <http://dbpedia.org/property/>

SELECT ?artist ?artwork ?museum ?director
WHERE {
?artwork p:artist ?artist .
?artwork p:museum ?museum .
?museum p:director ?director
}

6.4 Homepages of resources roughly in the area of Berlin

PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?s ?homepage WHERE {
   <http://dbpedia.org/resource/Berlin> geo:lat ?berlinLat .
   <http://dbpedia.org/resource/Berlin> geo:long ?berlinLong . 
   ?s geo:lat ?lat .
   ?s geo:long ?long .
   ?s foaf:homepage ?homepage .
   FILTER (
     ?lat        <=     ?berlinLat + 0.03190235436 &&
     ?long       >=     ?berlinLong - 0.08679199218 &&
     ?lat        >=     ?berlinLat - 0.03190235436 && 
     ?long       <=     ?berlinLong + 0.08679199218)
}

6.5 Homepages of architects of resources roughly in the area of New York City

PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX p: <http://dbpedia.org/property/>

SELECT ?s ?a ?homepage WHERE {
   <http://dbpedia.org/resource/New_York_City> geo:lat ?nyLat .
   <http://dbpedia.org/resource/New_York_City> geo:long ?nyLong . 
   ?s geo:lat ?lat .
   ?s geo:long ?long .
   ?s p:architect ?a .
   ?a foaf:homepage ?homepage .
   FILTER (
     ?lat        <=     ?nyLat + 0.3190235436 &&
     ?long       >=     ?nyLong - 0.8679199218 &&
     ?lat        >=     ?nyLat - 0.3190235436 && 
     ?long       <=     ?nyLong + 0.8679199218)
}

7. Feedback

Please send comments to Christian Becker.

Further information about our work in the area of the Semantic Web/Web-of-Data can be found at
List of our other open source projects @ Freie Universität Berlin