Silk - The Linked Data Integration Framework

Robert Isele (eccenca GmbH)
Anja Jentzsch (Hasso Plattner Institut)
Christian Bizer (University of Mannheim)
Julius Volz (Google)
Petar Petrovski (University of Mannheim)

Silk is an open source framework for integrating heterogeneous data sources. The primary uses cases of Silk include:

Generating links between related data items within different Linked Data sources.
Linked Data publishers can use Silk to set RDF links from their data sources to other data sources on the Web.
Applying data transformations to structured data sources.

Silk is based on the Linked Data paradigm, which is built on two simple ideas: First, RDF provides an expressive data model for representing structured information. Second, RDF links are set between entities in different data sources. Background information about Linked Data and the vision of the Web of Data can be found in the overview article Linked Data - The Story So Far and the Linked Data book.

Linking Data Sources

Using the declarative Silk - Link Specification Language (Silk-LSL), developers can specify which types of RDF links should be discovered between data sources as well as which conditions data items must fulfill in order to be interlinked. These link conditions may combine various similarity metrics and can take the graph around a data item into account, which is addressed using an RDF path language. Silk accesses the data sources that should be interlinked via the SPARQL protocol and can thus be used against local as well as remote SPARQL endpoints. Link Specifications can be created using the Silk Workbench graphical user interface or manually in XML.

The linking process is based on the Silk Link Discovery Engine which offers the following features:

Flexible, declarative language for specifying linkage rules
Support of RDF link generation (owl:sameAs links as well as other types)
Employment in distributed environments (by accessing local and remote SPARQL endpoints)
Usable in situations where terms from different vocabularies are mixed and where no consistent RDFS or OWL schemata exist
Scalability and high performance through efficient data handling (speedup factor of 20 compared to Silk 0.2):
- Reduction of network load by caching and reusing of SPARQL result sets
- Multi-threaded computation of the data item comparisons (3 million comparisons per minute on a Core2 Duo)
- Optional blocking of data items

Data Transformations

While the main part of a integration workflow lies in the interlinking of data sources. Data sets coming fron different sources sometimes required the harmonization of the schemata and data formats prior to interlinking. For this purpose, Silk enables the user to create and execute lightweight transformation rules. Transformation rules may be used for:

Data cleaning, e.g., removing unwanted values
Mapping between different properties or adding new properties with generated values.
Converting between different data formats. Data may read from sources such as RDF, CSV or XML. Typically the output is written to an RDF store which can be queried using SPARQL, but data can also be written as CSVs to be imported into relational databases or opened in Excel.

Silk Workbench

Silk Workbench is a web application which guides the user through the process of interlinking different data sources.

Silk Workbench offers the following features:

It enables the user to manage different sets of data sources, linking tasks and transformation tasks.
It offers a graphical editor which enables the user to easily create and edit linking tasks and transformation tasks.
As finding a good linking heuristics is usually an iterative process, the Silk Workbench makes it possible for the user to quickly evaluate the links which are generated by the current link specification.
It allows the user to create and edit a set of reference links used to evaluate the current link specification.

Documentation of the Silk Workbench is available in the Wiki.

Silk Command Line Applications

In addition to the Workbench, Silk provides three different command line applications for executing link specifications:

Silk Single Machine is used to generate RDF links on a single machine. The datasets that should be interlinked can either reside on the same machine or on remote machines which are accessed via the SPARQL protocol. Silk Single Machine provides multithreading and caching. In addition, the performance is further enhanced using the MultiBlock blocking algorithm.
Silk MapReduce is used to generate RDF links between data sets using a cluster of multiple machines. Silk MapReduce is based on Hadoop and can for instance be run on Amazon Elastic MapReduce. Silk MapReduce enables Silk to scale out to very big datasets by distributing the link generation to multiple machines.
Silk Server can be used as an identity resolution component within applications that consume Linked Data from the Web. Silk Server provides an HTTP API for matching entities from an incoming stream of RDF data while keeping track of known entities. It can be used for instance together with a Linked Data crawler to populate a local duplicate-free cache with data from the Web.

Silk Free Text Preprocessor

The main goal of the Free Text Pre-processing tool is to produce a structured representation of data that contains or is derived from free text. The tool takes as input an RDF file with properties with free text values and an additional RDF file that contains structured data used to learn the extraction model. Based on the learned model the tool extracts new property-value pairs from free text. The resulting output is an RDF dump file containing the extracted structured values. Using a declarative XML-based language, a user can specify which extraction methods to use.

Documentation of the Silk Free Text Preprocessor is available in the Wiki.

Acknowledgments

This work was supported in part by Vulcan Inc. as part of its Project Halo and by the EU FP7 project LOD2 - Creating Knowledge out of Interlinked Data (Grant No. 257943).