The main steps of the Linked Data publication and consumption process are:

  1. Modeling
  2. Publishing
  3. Discovery
    1. Crawling
    2. Searching
    3. Browsing
    4. Extracting
  4. Consolidation
    1. Vocabulary Mapping
    2. Identity Resolution
  5. Application
    1. Exploration
    2. Integration

In the following we describe those steps in detail after giving definitions for the main keywords.

Definitions

In the following, we give some definitions in order to clarify the role of the different parts and components of the process.

Linked Data

A data set is considered to be "Linked Data" if the statements it contains are expressed according to the Linked Data publishing principles. For the present document, we restrain this definition further to focus on data that is freely available on the Web.

Publication

The publication of Linked Data consists in making available on the Web a given data set. The publication involves converting the data from its original format and making the necessary consolidation work to ensure compliance with the definition of Linked Data.

Consumption

The consumption of Linked Data consists in accessing the data and aggregating or integrating it. This allows for displaying the data in different ways.

The consumption of Linked Data is the process of acquisition of a subset of the Linked Data available on the Web in order to fulfil a specific goal (data visualisation, data aggregation, …). The consumption of Linked Data may also involve the usage of data which is not Linked Data itself (for instance, a local data base).

1 Modeling

On the Web of Data, the relationships between resources are expressed by concepts defined in vocabularies (also called "ontologies"). The decentralised publication model of Linked data allows for re-using the same vocabularies across different data sets. When publishing data, one may consider using existing concepts from published ontologies or define a new set of concepts (and publish it). Vocabularies are typically created to fit a specific modelisation problem. For instance, GoodRelations is used to express business-related relations (vendor, price, …), FOAF contains concepts about social networks (knows, familyName, …) and DublinCore provides concepts that can be used for documents. Some of the existing vocabularies can be found on http://semanticweb.org/wiki/Ontology, Swoogle provides a search engine to find more of them.

If no vocabulary matches a specific modeling use case, there is a need for creating, and publishing a new vocabulary. Neologism is a vocabulary editing and publishing platform for the Web of Data, with a focus on ease of use and compatibility with Linked Data principles. Neologism makes it possible to create classes and properties in an easy, fast and standard complient way.

Author(s):
Guido Cecilio, Stéphane Corlosquet, Richard Cyganiak 
Website:
License:
User Interface:
Graphic / Drupal Module 

2 Publishing

Publishing data as Linked Data on the Web of Data enables the integration of different data sources. It allows for displaying and querying different data sources. Furthermore it allows the integrating data describing the same entity.

When publishing Linked Data on the Web, data is represented using the Resource Description Framework (RDF).

The Web of Linked Data is built upon two simple ideas: Structured data is published on the Web using dereferencable HTTP URIs to represent data items wherein related data items are connected using RDF links.

It is desirable to not only publish the data but to also publish its schema as Linked Data. Thus, Linked Data applications can e.g. customize views on the data.

The format of the original data which is to be published as Linked Data is relevant for the publications steps to take.

If the original data is available in a relational database, we recommend using D2R Server to publish the data along with its schema as Linked Data. D2R Server offers a HTML, Linked Data and SPARQL interface to the published data. It is written in Java and licensed under the Apache License V2.0.

Author(s):
Chris Bizer, Richard Cyganiak 
Website:
License:
User Interface:
Command Line 

If the original data is available in any structured format, it has to be converted to RDF by a converter.

For CSV and Excel files we recommend using the RDF Extension for Google Refine. It adds a graphical user interface for exporting data of Google Refine projects in RDF format. The extension is available under the BSD license.

Author(s):
Fadi Maali, Richard Cyganiak 
Website:
License:
User Interface:
Graphic 

For XML files we recommend using XSLT to transform them into RDF.

Any other formats have to be converted using your own converter. You can also check the converter list at the W3C Semantic Web wiki.

If you have converted your data into RDF, you have two options to publish it on the Web as Linked Data.

You can serve the RDF file(s) using any web server. You will have to enable URL rewriting to make the Linked Data URIs in your data set dereferencable. We can recommend this method for publishing small data sets or vocabularies.

For bigger data sets we recommend setting up a Triple store. Triple stores allow for storing Linked Data and providing it using different interfaces. Via a SPARQL interface the data is made accessible for querying. Ideally Triple stores provide a Linked Data interface. In addition, a HTML interface can be offered.

A list of Triple stores is available at the W3C Semantic Web wiki. The Berlin SPARQL Benchmark (BSBM) is a benchmark for comparing the performance of Triple stores that expose SPARQL endpoints. The benchmark results on the store performances are available online.

As a Linked Data and HTML interface for any Triple store not offering those we recommend Pubby. Pubby makes it easy to turn a SPARQL endpoint into a Linked Data server. It is implemented as a Java web application and available under the Apache License V2.0.

Author(s):
Richard Cyganiak, Chris Bizer 
Website:
License:
User Interface:
Command Line 

3 Discovery

The Web of Data is essentially a decentralised publication system for data, just like the Web of Documents is a decentralised publication system for documents. As for the Web of Documents, it is not possible to get a global view of the Web of Data and finding something in particular in it can turn into finding a needle in a haystack problem. The discovery of data in the Web of Data can be done through essentially four different means: crawling the Web, using a search engine, browsing an index or looking for data embedded in Web documents.

3.1 Crawling

Linked Data crawlers follow RDF links from a given set of seed URIs and store the retrieved data either in an RDF store or as local files. This approach is particularly useful for data that is not already available through SPARQL endpoints or RDF dumps.

LDSpider is a web crawling framework for the Web of Data. It can be used through a command line application as well as through a flexible API for a usage within another application.

Author(s):
Andreas Harth, Jürgen Umbrich, Aidan Hogan, Robert Isele 
Website:
License:
User Interface:
Command Line 

3.2 Searching

Sindice is a state of the art infrastructure to process, consolidate and query the Web of Data. The web site provides a search engine that returns RDF documents matching the keyword(s) provided. In order to be able to do this, Sindice uses crawlers that browse the Web of Data, storing and consolidating the data found.

Author(s):
Giovanni Tummarello, Tamas Benko, Stephane Campinas, Richard Cyganiak, Szymon Danielczyk, Renaud Delbru, Robert Fuller, Michael Hausenblas, Michele Mostarda, Stephen Mulcahy, Davide Palmisano 
Website:
License:
 
User Interface:
Web-based 

3.3 Browsing

Indexes are available for the Web of Data, they are manually curated and contain extra information about the content of the data sets such as the vocabularies used or the number of triples. These indexes are a more directed approach to finding data sets than Crawling and Searching, allowing for looking for data sets covering a specific topic. The index for the LOD Cloud on CKAN currently (as of February 2011) lists 203 data sets.

3.4 Extracting

Syntactic extensions have been defined to embed structured data into the HTML that is used to create Web documents. Two main standards, Microformats and RDFa, enable a Web page author to add meta data describing the content of the document. The tool Any23 finds such embedded data and extracts it as RDF.

Author(s):
Michele Catasta, Richard Cyganiak, Michele Mostarda, Davide Palmisano, Gabriele Renzi, Jürgen Umbrich 
Website:
License:
User Interface:
Plugin / Web Service / Command Line 

4 Consolidation

Once a data set is published as Linked Data, its value can be increased in different ways.

If the Linked Data set vocabulary defines new concepts or relationships it should be mapped to existing vocabularies.

It is desirable to publish links to data sets that are related to the newly published data set by applying Identity Resolution methods.

4.1 Vocabulary Mapping

Linked Data sources often use different vocabularies to represent data about the same type of entity. In order to achieve an integrated view on the data to their users, Linked Data applications may translate data from different vocabularies into the application’s target schema.

The R2R Framework enables Linked Data applications which discover data on the Web that is represented using unknown terms, to search the Web for mappings and apply the discovered mappings to translate Web data to the application's target vocabulary. R2R provides the R2R Mapping Language for publishing fine-grained term mappings on the Web.

Author(s):
Andreas Schultz, Chris Bizer 
Website:
License:
User Interface:
Command Line 

4.2 Identity Resolution

Linked Data sources can overlap thematically. Thus, the same entity is described in different data sets, either in different detail or from different points of view. To integrate this data, identity resolution tools are needed. They identify duplicate entities and interlink them by owl:sameAs links. It is also desirable to link entities that are connected in the real world, e.g. a book to its author. The linking approaches can be manual, semi-automatic or automatic ones.

The Silk Link Discovery Framework is a tool for discovering relationships between data items within different Linked Data sources. Data publishers can use Silk to set RDF links from their data sources to other data sources on the Web. The framework is implemented in Scala and is available under the terms of the Apache Software License.

Silk is provided in three different editions which address different use cases. Silk Single Machine is used to generate RDF links on a single machine. Silk MapReduce is used to generate RDF links between data sets using a cluster of multiple machines. Silk Server can be used as an identity resolution component within applications that consume Linked Data from the Web. Silk Server provides an HTTP API for matching instances from an incoming stream of RDF data while keeping track of known entities.

Author(s):
Robert Isele, Anja Jentzsch, Chris Bizer, Julius Volz 
Website:
License:
User Interface:
Command Line 

5 Application

They are many ways to make use of the data published on the Web of Data. The publication model facilitates data integration as all the data is published under the same format, and uses common identifiers and vocabularies. We will highlight two uses cases focused around exploring the content of data sets and enriching a Web site.

5.1 Exploration

Publication and storing tools often offer HTML interfaces to Linked Data, like D2R Server, allowing to directly explore the content of a particular data set from a Web browser. Most data-centric visualization tools help validating and displaying the data in a textual way, e.g. tables, but more complex information can be sought, like for instance the presence of other data related to a particular entity or the relation between two entities. Sig.ma and RelFinder are tools respectively matching these two use-cases. Sig.ma leverages the search engine Sindice to look for data about a particular resource, all the data found is aggregated and can be filtered based on the data source. RelFinder looks for and displays relations between two, or more, resources on the Web of Data. In addition to these two tools, the Sindice Web Data Inspector can be used to visualize and validate RDF files, HTML pages embedding microformats, and XHTML pages embedding RDFa – thereby providing the means to the verification of the data being published.

Author(s):
Michele Catasta, Richard Cyganiak, Szymon Danielczyk, Giovanni Tummarello 
Website:
License:
User Interface:
Web-based 
Author(s):
Philipp Heim, Steffen Lohmann, Timo Stegemann 
Website:
License:
User Interface:
Graphical 

5.2 Integration

The data published on the Web of Data can be integrated into Web sites to enrich the information provided. OntoWiki and SPARQL Views are two examples of data integration. OntoWiki is a Semantic Data Wiki enabling the collaborative creation and publication of RDF knowledge bases as Linked Data. SPARQL Views is a Drupal query plugin. It allows querying RDF data in SPARQL endpoints and RDFa on Web pages, bringing the data into Drupal Views. The data can then be displayed and formatted using Views.

Author(s):
Sebastian Tramp, Sören Auer, Philipp Frischmuth, Norman Heino 
Website:
License:
User Interface:
Web-based, Command line 
Author(s):
Lin Clark 
Website:
License:
 
User Interface:
Graphic