Implementation of an Algorithm for a stable RDF Graph Serialization

State: completed by Daniel Spicar

Published: 2011-09-13

Data of Internet applications are traditionally stored in relational databases. However, in Semantic Web area, a more convenient way to describe resources is through the use of triples, which are statements, each comprising a subject, a predicate, and an object. This collection of statements or triples builds a database. The World Wide Web Consortium (W3C) has developed the Resource Description Framework (RDF), a framework for representing information in the Web. The set of triples in RDF is called an RDF Graph. Subjects and objects are nodes of a graph, whereas predicates are directed arcs which always point to objects. Thus, an RDF Graph is a directed graph.
Generally, each resource is expected to be assigned a unique identifier, in the form of a URI (also known as RDF URI Reference). However, it is practical to have unique nodes without an intrinsic name. Such a node is called a blank node and can be used in several statements. The introduction of blank nodes into a graph also leads to problems in serializing the graph. One of the problems is to be addressed here in this Assignment.

In general, to serialize a graph, each blank node needs to be given a locally unique identifier. Blank node identifiers are different from all URIs and literals. However, generation of blank node identifiers is implementation specific: "Note that such blank node identifiers are not part of the RDF abstract syntax, and the representation of triples containing blank nodes is entirely dependent on the particular concrete syntax used".

Generating locally unique blank node identifiers is not particularly difficult. Each triple store implementation that is able to store triples persistently, should have a solution for generation of blank node identifiers. For data exchanges across a communication channel, a number of formats for serialization of triples do exist, including N3, N-Triples (a fixed subset of N3), Turtle, and RDF/XML. Problems arise when two graphs need to be merged or compared. In case of merging, blank node identifiers may need to be re-allocated. Furthermore, determining whether two graphs are isomorphic has a complexity class higher than P-problems. In this work, however, a simpler task is defined, which has a practical requirement.
When an RDF Graph is modified and then serialized, depending on the serialization algorithm applied, the resulted serialization output may show a big "difference" compared to the unmodified one, even if the changes made was minimal. This is particularly due to the re-allocation of blank node identifiers and reordering of triples. Since it is often useful to know the degree of data changes in various practical tasks, it is highly desirable to reflect the amount of changes in RDF Graphs in the "amount of differences" between the two graph serialization outputs, before and after changes are made. In this work, this property is termed "stable serialization".

The work shall include:
- an analysis of existing algorithms for RDF Graph serialization with respect to its "stability",
- an implementation of an algorithm for a stable RDF Graph serialization in Java,
- a test to show that the algorithm implemented produces stable serialization of RDF Graphs

30% Design, 70% Implementation

Java

Supervisors: Prof. Dr. Burkhard Stiller

back to the main page