Linked Data, SPARQL and “web termlists”

One problem with the expectation that the cultural sector will publish lots of Linked Data is that most (if not all) of our current data consists of string values held in traditional databases.  I have discussed before how we need software support to convert these string values to Linked Data URLs, either as a one-off operation within the source database, or on the fly as part of the Linked Data publication process.  Given that many Linked Data resources whose URLs we might want to use will offer a SPARQL endpoint, it would be nice if we could use such an endpoint directly to enhance our data.

Looking at this problem in the specific context of the Modes software, we have the option of setting up “web termlists”, which treat an external web resource as an authority file.  The interface to this resource includes a URL pattern which can be used to query for concepts matching a given string.  The assumption is that HTTP requests to this URL will return an XML response. An XSLT transform converts the returned XML into a form of XML which can be stored locally in a Modes data file.  In the past we have used the “web termlist” technique to interface to resources like Geonames, which have a reasonably simple query syntax.

One “SPARQL challenge” came about while making the standard Modes termlists into Linked Data resources.  The British Museum materials termlist has been published by the BM as a SKOS ontology as part of their online collections data, and they suggested that instead of creating a parallel Modes file, we should simply access their data directly.

The first job was to work out how to query the BM data so as to retrieve just materials termlist concepts, and how to retrieve useful information. This was tackled by going to the search box for the SPARQL endpoint and hitting it with queries until something useful came back.  Three key learning points came out of this exercise:

  • the CONSTRUCT command gives you a useful subset of the original data; the standard SELECT command is pretty useless
  • filtering on the required SKOS ontology simply required a ?s skos:inScheme <http://collection.britishmuseum.org/id/thesauri/material> clause in the SPARQL query
  • support for FILTER and REGEX is required if you want to search the data for string matches: FILTER regex(?term, “^agave”)

This is the SPARQL query pattern which I eventually used:

CONSTRUCT { ?s ?p ?o } WHERE
{
?s ?rel ?term .
?s skos:inScheme <http://collection.britishmuseum.org/id/thesauri/material> .
?s ?p ?o
FILTER regex(?term, "^***")
}

(where “***” is replaced by the user’s search term).  This web termlist allows Modes users to record the string value of a material in their data, and then quickly look up and include the corresponding BM Linked Data identifier:

BM materials thesaurus lookup

The second SPARQL endpoint which I wanted to access is the Ordnance Survey postcode resource.  The reason for this is that web resources such as Historypin require geolocation information (latitude/longitude) for uploaded resources.  Modes users want to be able to contribute to such web resources. However, typical Modes local history data might include postcodes, but certainly won’t have lat/long coordinates.  The Ordnance Survey postcode data includes lat/long and NGR coordinates for the centre of each postcode area.  So, by making a link to the OS data, Modes users can get the required coordinate information “for free”.  This is a good example of how cultural history institutions can get added value from adopting a Linked Data approach.

Looking up postcodes is more straightforward than materials keywords.  The correct form of a postcode is simple and well-understood, so there is no need for a SPARQL “search”, just a lookup of the OS data.  This URL pattern looks up the skos:notation property of the postcode, then returns all triples of which it is the subject:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
CONSTRUCT { ?s ?p ?o }
WHERE { ?s skos:notation "***"^^<http://data.ordnancesurvey.co.uk/ontology/postcode/Postcode> .
?s ?p ?o . }

(Again, “***” in this pattern is replaced by the postcode, e.g. “RH15 8JA”.) The only complication here is that the skos:notation has a specified datatype, so this has to be specified in the SPARQL query.  When a postcode is selected from the web termlist, a new record is created and stored in a local Modes data file, using the Place application:
OS postcode termlist lookup

Because this cached copy of the data is stored locally in a linked Modes data file, and in an XML format which is compatible with other Modes place data, the lat/long coordinates which it contains can easily be accessed and used in views and reports.  In particular, they can be included in a report which generates output for “bulk load” into Historypin (our original objective).

These experiments demonstrate that if data is published in Linked Data format with a SPARQL endpoint, it is quite possible to use this as an “API” to access and use the data in a variety of ways. In particular, we can use the “web termlist” approach to generate new Linked Data connections to existing resources, and so enrich the developing cultural history Linked Data environment.

Posted in Linked Data | Leave a comment

Shakespeare’s works as Linked Data

As my contribution to the Will’s World Hack Week I decided to take the XML encodings of Shakespeare’s works (all the plays and some verse works) which were provided in the Will’s World Registry, and see if they could be expressed as Linked Data. The intention is to provide a set of stable URLs which represent every aspect of the plays’ text, and which can be used as the basis for a shared understanding of this resource.  This is not really a hack in the usual sense, in that there isn’t very much to see when you have finished.  The value of the exercise will only become apparent, as and when other people start using the URLs which have been published.

The plays were presented as separate XML documents, conforming to a relatively simple XML Schema. I converted this to a DTD and added some attributes which I thought might come in handy, but left the element structure as found.  Then I wrote a routine to split each file into multiple mini-records (one per play, act, scene, speech and line) and imported them into a Modes database file.  I followed standard Modes practice (as used with TEI and other monlithic XML frameworks) and included processing instructions to represent parent/child links between these records.

Shakespeare XML in Modes

The characters in each play were recorded in a separate Modes file, using the supplied Person data structure. I included each character’s speeches in their record, to allow cross-reference from character to speeches, and included the character’s code in the speech record, to tie things up in the other direction.

Person (character) data in Modes

Once the data was loaded, the next job was to write XSLT transforms to generate all the different forms in which it would be published. Initially, these forms are HTML, XML and RDF. Then a publication record needed to be added to the Modes Linked Data Framework for each file, specifying what formats it would be available in.

The URLs all follow the same pattern: domain / application / file / ‘id’ / identifier, e.g.:

http://richardlight.org.uk/Plays/shakespeare/id/654811

which defaults to an HTML rendition. You can do the content negotiation via the HTTP Accept header to get RDF or XML via 303 See Other, or you can cheat by sticking the content type into the URL (domain / application / file / ‘id’ / content type / identifier) e.g.:

http://richardlight.org.uk/Plays/shakespeare/id/xml/654811

There is a simple word search facility which uses the same content-type-specifying trick:

http://richardlight.org.uk/Plays/shakespeare/search/xml/?q=vessel&start=11

The character identifiers weren’t that consistent in their format, so I minted a new set which combined the name with an abbreviation of the play, to make each one unique. Then I added cross-references between character records where the same character appeared in a number of plays.  Since the XML only had codes for characters, I had to find a separate list of full character names and meld them together.

The XML focuses on “speeches” (i.e. one or more consecutive lines spoken by the same character(s)), but I felt that it may also be useful to identify each line separately, so I have created individual records for these as well:

A single line expressed as RDF

(In fact, in response to a comment from another Will’s World Hack participant, I have also identified each word in each line with its own URL.)  The RDF schema which this data follows was hand-made for this project, and it was kept as simple as possible, e.g.:

RDF Schema for plays

The general idea is that you should be able to “follow your nose” around this data, using nothing more exotic than a browser.  The HTML version of this single-line record contains hyperlinks to take you to higher-level records:

Single line as browsable HTML

While this Linked Data resource is potentially useful as it stands, its value would be enhanced by some additional information. Two obvious omissions are the scene titles and any form of synopsis – even something at scene level would be really helpful.  The line structure potentially lends itself to adding translations, since I assume these would be carried out line-for-line.  I trawled dbpedia and the Library of Congress Subject headings for authority records for characters, and added those which seemed relevant from dbpedia. However, I gave up on LCSH quite quickly on finding headings such as “Jailer” which applied to any jailer in any Shakespeare play.  If anyone has suggestions as to where such enrichment data might come from, I would be interested in helping to add it to this resource.

Another major question relates to the actual use of this data.  In principle, by following Linked Data guidelines, I am providing this Shakespeare resource in an open format which can be consumed in a wide variety of ways.  I would be interested to hear if anyone is making use of it, and in particular I would be willing to help improve the form in which it is presented, if this would enable its adoption.  It doesn’t have a SPARQL end-point (but then neither does the LCSH resource).  The search URLs return results which aim to conform to the OpenSearch spec.

I wait with interest to see if this Linked Data resource can fulfil a practical need.

Posted in Linked Data | 2 Comments

Modes vocabularies as Linked Data

Instead of just talking about Linked Data, I’ve decided to take some affirmative action.  The Modes software (mostly used for cataloguing museum collections) is distributed with a number of “termlists”.  Their aim is to make cataloguing more consistent, by checking entries against standard terminology.  Assuming that most Modes users will just use these termlists as published, they represent a commonality of practice across a whole (pretty large) community.

So I reckon that publishing these termlists as Linked Data would make it possible for Modes users to share this consistency of recording, by using the appropriate Linked Data URLs when publishing their own records as Linked Data.  In other words, they can convert their string values to Linked Data URLs.

As of yesterday, I think there are some results worth sharing.  We now have a dedicated subdomain of the Modes web presence: data.modes.org.uk.  Within that, we publish each vocabulary on a concept-by-concept basis.  The URLs have a consistent logic:

[domain]/[application]/[filename]/id/[identifier]

e.g.

http://data.modes.org.uk/Termlist/HertsSimpleName_termlist/id/ball

Content negotiation has been implemented, so you can ask for an RDF (or XML) representation in the HTTP header.  Alternatively, you can include the format directly in the URL:

http://data.modes.org.uk/Termlist/HertsSimpleName_termlist/id/rdf/ball

and finesse the 303 See Other redirection step.  There is search support, which uses OpenSearch conventions, e.g.:

http://data.modes.org.uk/Termlist/HertsSimpleName_termlist/search/xml/?q=rope

This combination of search facility and delivery of individual “found” concepts is what we need to implement Modes “web termlists”, where an external web service (like Geonames) is used as the source of a controlled vocabulary.  Our hope is that it will also allow other cultural history software to make use of these Modes vocabularies, so that we can start to share conceptual frameworks more widely within the community.

We have the beginnings of a VoID implementation, which at present simply lists the resources available:

http://data.modes.org.uk/Termlist/void/rdf

For some of the vocabularies there is a VoID description and example:

http://data.modes.org.uk/Termlist/counties_termlist/void/rdf

We welcome feedback on this initiative, and will try to make any improvements which make it easier to use.

Posted in Linked Data | 1 Comment