- Author:
- Chris Bizer (chris@bizer.de)
- Richard Cyganiak (richard@cyganiak.de)
- Version:
- $Id: index.html,v 1.11 2006/07/04 22:10:47 bizer Exp $

This document describes the implementation and internal architecture of the WIQA Engine and outlines the multi-step process that is required to check a set of RDF graphs against a WIQA policy and to generate WIQA explanations.

- 1. Overview
- 2. Package list
- 3. Policy objects
- 4. The parser
- 5. Query execution
- 6. The graph pattern tree
- 7. Evaluating count constraints
- 8. Generating explanations
- 9. Change log

The WIQA engine is a library for **filtering Semantic Web information**.
The input is a set of RDF graphs, each named with a URI. The output is an RDF graph that
contains only those triples from the source graphs that **match a certain
policy**.

In addition to the filtering, the
engine can generate textual **explanation** objects that explain to a user
why a certain triple was accepted into the output graph. The power of the policy language
can be extended by defining custom **Functions** and
**Extensions**.

Policies are defined using the **WIQA-PL policy language.** Its syntax is inspired by the SPARQL query language.
The WIQA engine builds on
**ARQ**, the SPARQL
implementation that is shipped with the Jena
Semantic Web Framework.

This document assumes familiarity with the WIQA engine's public API.

Package | Purpose |
---|---|

de.fuberlin.wiwiss.wiqa | The engine's public API. |

de.fuberlin.wiwiss.wiqa.bindings | Code for dealing with streams of solutions and for grouping solutions by shared variable values. Solutions are
called bindings inside the codebase because the ARQ class com.hp.hpl.jena.query.core.Binding
is used as an implementation of the concept. |

de.fuberlin.wiwiss.wiqa.count | Code for evaluating count constraints. Implements the algorithm described in section 7. |

de.fuberlin.wiwiss.wiqa.engine | The query engine's core, especially those parts that hook into Jena's ARQ query processor. |

de.fuberlin.wiwiss.wiqa.example | Some usage examples. |

de.fuberlin.wiwiss.wiqa.explanation | Code for generating explanations. Implements the algorithm described in section 8. |

de.fuberlin.wiwiss.wiqa.gp | Code for arranging graph patterns into a tree. Implements the algorithm described in section 6. |

de.fuberlin.wiwiss.wiqa.extensions | Contains various functions and extensions that add capabilities to the engine, such as a complex trust filtering metric. |

de.fuberlin.wiwiss.wiqa.parser | Hand-authored support code for the JavaCC-generated WIQA-PL parser. |

de.fuberlin.wiwiss.wiqa.parser.javacc | JavaCC-generated parser for WIQA policies. |

de.fuberlin.wiwiss.wiqa.vocab | RDF vocabularies used by WIQA. |

A WIQA policy is a set of patterns and constraints. Policies operate on an input dataset (described in the API overview). Input triples that match the patterns and constraints are accepted into an output RDF graph. Policies are written using the WIQA-PL language, whose features are described in the WIQA-PL Language Specification.

The following example policy has five graph patterns, several explanation templates, and a count constraint. Namespace prefix declarations have been omitted.

NAME "Asserted by analysts with at least 3 positive ratings" DESCRIPTION "Only accept information that has been asserted by analysts who have received at least 3 positive ratings." PATTERNS { GRAPH fd:GraphFromAggregator { EXPL "it was asserted by " ?authority " and " . ?GRAPH swp:assertedBy ?warrant . ?warrant swp:authority ?authority . } GRAPH ?graph2 { ?authority rdf:type fin:Analyst . } GRAPH fd:GraphFromAggregator { EXPL ?authority2 " claims that " ?authority " is an analyst." . ?graph2 swp:assertedBy ?warrant2 . ?warrant2 swp:authority ?authority2 . } GRAPH ANY { EXPL ?authority "has received positive ratings from" . ?rater fin:positiveRating ?authority . FILTER (wiqa:count(?rater) >= 3) . } GRAPH fd:BackgroundInformation { EXPL ?rater "who works for" ?company . ?rater fin:affiliation ?company . } }

These policies are parsed into a cluster of Java objects centered around the Policy class.

Policies consist of several parts:

- Name and description
- These are informatory text.
- Graph patterns
- A graph pattern is a set of triple patterns that is matched against certain named graphs
from the input dataset. It may also contain additional constraint expressions
(
`FILTER`clauses). In policy objects, each graph patterns is represented by an ARQ ElementNamedGraph instance. - Graph pattern tree
- In WIQA-PL, graph patterns appear as a simple list. But for some steps of query processing, it is necessary to order them into a tree of graph patterns. This is described in more detail in section 6.
- Root constraint expressions
- Constraint expressions (
`FILTER`clauses) can occur not just as part of graph patterns, but also on the root level of the policy. - Extensions
- In WIQA-PL, extension calls look just like normal
`FILTER`expressions, but the engine treats them differently because they are evaluated at a different stage of query execution. In policy objects, they are represented as instances of ARQ's Extension interface. - Count constraints
- Just as extension calls, Count constraints are evaluated at a stage different
from usual
`FILTER`conditions. In policy objects, they are represented as instances of the CountConstraint class. Their evaluation is described in section 7. - Explanation templates
- Explanations are generated from instances of the ExplanationTemplate class. The templates are attached to nodes of the graph pattern tree. Section 8 describes explanations in more detail.

The parser's job is to turn a policy, written down in WIQA-PL, into a Policy object. Application programmers access it through the PolicyParser facade, the full implementation is in the de.fuberlin.wiwiss.wiqa.parser package.

The parser is derived from ARQ's SPARQL parser
and built using the JavaCC
parser generator. Most of the parser code (the javacc subpackage) is automatically
generated from a grammar definition file by JavaCC. The grammar file
can be found in the distribution as `/grammar/wiqa.jj`. It was
created by adapting ARQ's `sparql.jj`. Most
low-level grammar structures, such as triple patterns or `FILTER`
expressions, have been re-used without change. But the grammar's higher
levels have been replaced with grammar rules for WIQA-PL.

While the parser reads a WIQA-PL file, it adds the various parts described in the previous section to a Policy object using its setter methods.

One syntactic construct in WIQA-PL is overloaded with three different features: the extension function syntax.

FILTER ... ext:someQName(?foo, ?bar) ...

When this “QName or URI followed by arguments in brackets” syntax is encountered in a policy, it could be

- an extension function, like in SPARQL, or
- an Extension, which is another mechanism for adding new functionality to WIQA-PL, and is evaluated in a different phase of query execution, or
- a count constraint.

These features can be distinguished only by the URI. Count
constraints use the well-known URI `wiqa:count`, and
all other URIs must be checked against the lists of registered functions
and extensions to see which kind it represents.

Distinguishing between these three types is somewhat
difficult in the token/rule framework provided by JavaCC. Like most parsers,
a JavaCC-generated parser operates in two stages: First, the input string is split into
tokens (e.g. keywords, QNames, URIs, literals, punctuation). Second, the
stream of tokens is matched against the grammar rules. To cope with the
three function types, the WIQA parser has another phase between these two: The token
stream is analysed, and URI or QName tokens whose URI is `wiqa:count`
or a registered extension are replaced by special “count” or
“extension” tokens. This happens in
CustomTokenManager.
The grammar rules for count constraints and extensions are defined using
these tokens.

This section describes what happens behind the scenes when the output graph is queried.

AcceptedGraph
is an implementation of Jena's
Graph
interface. Like any RDF graph, it is a **set of triples**.
AcceptedGraph is **read-only**, it doesn't implement methods
for adding and removing triples. All the read methods (find, contains, size, query) are implemented in terms of a simple **match of a triple pattern** against the graph's triples. This match is implemented in the method AcceptedGraph.graphBaseFind(tripleMatch). The basic procedure is as follows:

**Create an ARQ Query object**out of the policy's graph patterns and extension calls. ARQ query objects are abstract representations of RDF queries. This happens in the Policy.toARQ() method.**Pre-bind the**if the subject, predicate or object of the triple pattern were concrete RDF nodes and not wildcards. This happens in AcceptedGraph.graphBaseFind.`?SUBJ`,`?PRED`and`?OBJ`query variables**Let ARQ execute the query against the input dataset.**This automatically executes any calls to Functions (because they are evaluated when ARQ evaluates`FILTER`expressions in the graph patterns) and to Extensions (because they were explicitly added to the Query object in step 1). The result is an iterator of possible solutions to the query. WIQA has to hook into some internals of ARQ's query engine in order to return explanations from functions and extensions.**Evaluate all count constraints**by going through the solutions and counting the number of different values of the constrained variables. Groups of solutions that don't fulfil the constraint are discarded. This is described in more detail in section 7.**Generate explanations**by running the filtered solution stream through a QueryIterExplain, the most important class of the`explanation`package. Explanations are stored in an ExplanationCache, which is held by the AcceptedGraph. More details in section 8.**Deliver result triples**by taking only the`?SUBJ`,`?PRED`and`?OBJ`variables from the remaining solutions. An iterator over triples created from these variables is returned as the result of the initial AcceptedGraph.graphBaseFind call. This happens in TripleIterator, a private class inside AcceptedGraph.

The rest of this document provides more detail about steps 3, 4 and 5.

Almost all policies have at least one graph pattern. In WIQA-PL, graph patterns appear as a simple list. But for some steps of query processing, it is necessary to order this list into a tree.

Each node in the tree is a tuple `(gp, parent, v)` where
`gp` is a graph pattern, `parent` is the parent node,
and `v` is the node's *shared variable* —
a variable that is shared with the parent node's graph pattern.

The graph pattern of the tree's root is `rootpattern`,
the implicit graph pattern

?GRAPH { ?SUBJ ?PRED ?OBJ . }

that is part of every policy. The full tree is
constructed recursively by the function `buildTree(GP)`,
which takes the set of all graph patterns in the policy and
produces a set of nodes interconnected through references to
their parents.

buildTree(GP) input: GP, the set of graph patterns to be arranged as a tree return buildNodes(rootpattern,undef,undef, GP) buildNodes(gp, parent, v, GP) input: gp, a graph pattern input: parent, a node in the graph parent tree input: v, a variable shared with the graph pattern of parent input: GP, the set of graph patterns to be arranged as a tree n = (gp, parent, v) nodes = {n} for each graph pattern gp_{i}in GP for each variable v_{j}contained in gp_{i}if v_{j}is contained in gp and v_{j}is not theshared variableof any node in ancestors(n) U {n} then nodes = nodes U buildNodes(gp_{i}, n, v_{j}, GP - {gp_{i}}) return nodes ancestors(n) input: n = (parent, gp, v), a node in the graph pattern tree return {} if parent isundefreturn ancestors(parent) U {parent}

The algorithm is implemented in the TreeBuilder class.

This tree is used in two parts of WIQA query evaluation. In count constraints, it is used to separate solutions into groups, and the counting happens within each of the groups. And when explanations are generated, the tree structure of the explanation's parts mirrors the graph pattern tree.

This section describes the algorithm used for evaluating count constraints.

If a policy `P` is in use, and `B` is the solution set
produced by a request, and the filter condition

FILTER (wiqa:count(?x) > 3)

is encountered in graph pattern `gp`, then the filter result is
determined by the function `count(B, ?x, > 3, gp, graph pattern tree of P)`.
`gp` is *undef* if the count constraint is located on
the root.

A *solution* is a partial function from variable names to RDF terms.

count(B, v_{c}, condition, gp, t) input: B, a list of solutions input: v_{c}, the variable whose values are to be counted input: condition, a function from integers to {true, false} input: gp, the graph pattern containing the count constraint input: t, the policy's graph pattern tree result = B if gp =undefthen result = groupAndCount(result, v_{c}, condition, {?SUBJ, ?PRED, ?OBJ}) else do a depth-first walk of t; for each node n if gp = the graph pattern of n then V_{p}= the set ofshared variablesof n and all its ancestors result = groupAndCount(result, v_{c}, condition, {?SUBJ, ?PRED, ?OBJ} U V_{p})

The function `groupAndCount` sorts solutions into buckets, where
each bucket contains only solutions with identical values for the
some set of group variables `V _{group}`. This is similar
to SQL's

groupAndCount(B, v_{c}, condition, V_{group}) input: B, a list of solutions input: v_{c}, the variable whose values are to be counted input: condition, a function from integers to {true, false} input: V_{group}, a set of variables by which solutions are grouped prior to counting result = {} for each groupdef in { groupdef = projection(b, V_{group}) and b in B } group = { b in B : groupdef is a subset of b } if condition( |group| ) is true result = result U group

The `projection` of a solution on a set of variables V is a new solution that retains only values for variables in V.

projection(b, V) input: b, a solution input: V, a set of variables result = the sub-solution p of b that has p(v) = b(v) for all v in V, and p(v) =undeffor all other variables

These functions are implemented in the
`count` package,
the most important class is QueryIterCount.

This section describes how explanations for an accepted RDF triple are generated from the set of solutions.

An explanation is a set of explanation parts. Each of these is a tuple `(text, parts)`,
where `text` is a list of RDF nodes (usually literals), and `parts` is a
set of explanation parts.

Explanations are generated from explanation templates. An explanation template is a list of RDF nodes and variables. Variables in the template are replaced with actual RDF nodes from a solution when the template is instantiated.

The explanation for an RDF triple `rt = (s, p, o)` is generated by the
function `explain(rt, B, root, tree)`, where `B` is the solution set,
`tree` is a graph pattern tree as returned by `buildTree`, and `root`
is its root node.

explain(rt, B, n, tree) input: rt = (subj, pred, obj), an RDF triple input: S, a solution set input: n, a node in the graph pattern tree input: tree, a graph pattern tree S' = { s in S where s(?SUBJ) = subj & s(?PRED) = pred & s(?OBJ) = obj } return makeParts(S', root(tree), tree) makeParts(S, n, tree) input: S, a solution set input: n = (gp, parent, v), a node in the graph pattern tree input: tree, a graph pattern tree parts = {} if gp has an explanation template tpl V_{tpl}= set of all variables in tpl valuesets = {} for each s in S valuesets = valuesets U { projection(s, V_{tpl}) } for each valueset in valuesets text = instantiate(tpl, valueset) solutiongroup = { s in S where valueset subset of s } parts = {} for each c in children(n, tree) childparts = childparts U makeParts(solutiongroup, c, tree) parts = parts U { (text, childparts) } else for each c in children(n, tree) parts = parts U makeParts(S, c, tree) return parts instantiate(tpl, s) input: tpl = <r_{1}, ..., r_{n}>, a list of RDF nodes or variables input: s, a solution return <r_{1}', ..., r_{n}'>, where r_{i}' = s(r_{i}), if r_{i}is a variable r_{i}' = r_{i}, otherwise

$Log: index.html,v $ Revision 1.10 2006/06/26 21:58:17 cyganiak Changed iqa: prefix to wiqa: throughout Revision 1.9 2006/06/26 21:42:06 cyganiak Terminology: binding -> solution Revision 1.8 2006/06/26 20:47:49 cyganiak Small improvement to tree building algorithm Revision 1.7 2006/06/21 14:16:11 cyganiak Improved formal algorithms Revision 1.6 2006/06/14 11:59:00 cyganiak Terminology: parent link variable -> shared variable Revision 1.5 2006/06/14 10:19:17 cyganiak Added abstract, small editorial changes Revision 1.4 2006/06/13 15:24:59 cyganiak First complete version. Revision 1.3 2006/06/12 17:56:42 cyganiak - Added UML diagram of Policy and friends - Various revisions to the text Revision 1.2 2006/05/03 11:58:20 cyganiak Added section on graph pattern trees to engine implementation doc Revision 1.1 2006/05/02 13:14:55 cyganiak Half-finished draft of engine architecture/implementation docs.