Following my conclusions on the need for URLs when searching Culture Grid (and by extension other cultural heritage resources), I decided to move things forward by producing some working code. This article describes a proof-of-concept lookup service for place name resolution which I am working on.
The general principle is that place names are presented to the service as a string, e.g.:
Burgess Hill, West Sussex, U.K.
and, if it finds a matching place concept, the service will return a Linked Data URL for the place:
The search syntax matches both the form of place names found in Culture Grid’s spatial facet keywords, and also the way in which museums have tended to record place information (assuming they are following the original MDA guidelines, or a compatible successor framework).
The initial idea is to return URLs for the most specific place in the “address”, but there is no reason why there couldn’t be URLs for each level. A confidence level might be a helpful adjunct, to indicate how wise it would be to use these results automatically, e.g. to drive further machine-based lookups or inferences.
The principal assumption behind the proposed lookup service is that a hierarchical relationship exists between the place names in the search string. I further assume that they run from specific to general, or vice versa. Finally I assume that a consistent separator has been placed between each place name/level: by default a comma.
The place lookup service requires two support services: one to return all instances matching a given place name string, and the other to provide hierarchical relationships between a given place and contained/containing places. Put simply, the strategy is to look up the individual place names in the search string, and then find which of those places have the hierarchical relationships implied by the form of the search string.
Sources of URLs
There are several online geographic resources which could be used, e.g. Geonames, the TGN, or OS Linked Data. However, not all of them provide unique persistent dereferenceable URLs for each concept they define. This is certainly [at present] an issue the Getty’s Thesaurus of Geographic Names, which in addition will only provide search results as HTML. OS Linked Data only covers Great Britain, but may still be a useful resource in the U.K. museums context (despite the provenance of the material described commonly being worldwide).
I therefore based the prototype service on Geonames, since it has the search facilities required, a worldwide scope, and the (potentially useful) facility to add places which are not currently covered. A search of the form:
will return G.B. places first in the hit list. For each hit, XML like this will be returned:
<geoname><toponymName>Burgess Hill</toponymName><name>burgess hill, west sussex</name><lat>50.95843</lat><lng>-0.13287</lng><geonameId>2654308</geonameId><countryCode>GB</countryCode><fcl>P</fcl><fcode>PPL</fcode></geoname>
This yields a Geonames identifier (2654308), which can then be used in a hierarchy request:
which in turn returns an XML response listing the higher-level places within which this place falls (Earth, Europe, U.K., England, West Sussex, Mid Sussex District). Not all of these are required for matching, but the county and country do match those specified in the search string, and there are no non-matches. So in this case, a response with a high level of certainty can be returned for the cost of just two Geonames requests. In other cases, multiple requests may be required.
How dynamic should the service be?
If the service were to be completely dynamic, all these queries would have to be submitted each time it was called. This would place a non-trivial burden on the resource being queried (Geonames, in the first instance), and would lead to slower response times for the user. Therefore it seems fairer to implement a degree of local storage to cache previous queries. This could consist simply of the search string and the results returned by the service (URLs and confidence levels), or more information could be stored locally.
The Culture Grid data suggests that it would be useful to be able to specify a second place as the context within which the search string is evaluated (in the absence of an unambiguous hierarchy in the search string). In that case, the location of the source institution would provide a context within which local place names could be disambiguated. In other cases, one might just specify “U.K.”, on the assumption that material from elsewhere would have at least a country name specified.
Progress to date
I now have a working proof of concept, which can be found at:
This simple CGI program takes the following arguments:
- q: the sequence of place names to check, e.g. “q=Cambridge, Cambridgeshire, England”
- sep: the separator between place names within this sequence, e.g. “sep=;”. By default, commas are interpreted as separators. This value should be urlencoded
- username: the username to use when submitting requests to Geonames. By default this is “demo”
Thus the HTTP request:
will yield the (XML) response:
<result q="Cambridge, Cambridgeshire, England" q1="Cambridge"q2="Cambridgeshire" q3="England" country="GB"url="http://api.geonames.org/search?style=short&name_equals=Cambridge&country=GB&username=demo"hits="5" geonameId="2653941"hierUrl="http://api.geonames.org/hierarchy?geonameId=2653941&username=demo"hit1="true" hit2="false: /geonames/geoname[toponymName[.='Cambridgeshire'] orname[.='Cambridgeshire']]" hit3="true"certainty="66">http://www.geonames.org/2653941/</result>
The key part of this XML response is the text inside the <result> element: the attributes are there simply to provide background and debugging information. If the lookup fails, this text node will be empty, but there will still be a <result> element so that the reasons for the failure can be investigated. Thus, in this example, the match on the county name has failed. The “hit2″ attribute records this fact, and gives the XPath expression which failed. (This fails because Geonames has both name and toponymName as “County of Cambridgeshire”, not “Cambridgshire”.)
The criterion for success is that there should be at least one hierarchical “match” in addition to a match on the most specific term itself (i.e. the one for which we are returning a URL). There is a “certainty” attribute which simply indicates the percentage of terms in the string which matched.
At present only the first retrieved term is checked hierarchically,and a non-match on hierarchy causes the whole process to return a “fail”. While it is clearly possible to go on and search subsequent hits until a hierarchical test succeeds, this would add greatly to the load on the Geonames service.
During testing I found that the daily quota for the Geonames “demo” user had been exceeded, so everything stopped working until the next day. For this reason I added support for the “username” parameter, so that registered Geonames users can work within their own resource allocation. (Anyone can apply for their own Geonames key – this is free.) By default, “demo” is used, but be aware that it may be throttled at any time.
Since Geonames provides support for searching by country, I have added a special check for country names within the search string. This uses an external file of country names and their associated two-letter codes.
The whole strategy relies on string matching, and this will often fail (as per the Cambridgeshire example above). I have gone for a “high precision, low recall” approach, by applying the “name_equals” parameter in preference to “name” or “q”. This, in conjunction with inclusion of the country code, improves the chance that the first “hit” will be the one I am after. (I found it necessary to include some commonly-used variant names for the U.K. in my country codes file, to improve the chance of getting a match on country name.)
If the name lookup information were stored locally it would be possible to investigate the hierarchies of all hits, use partial and fuzzy string matches, etc. Since the raw Geonames data is available for download, this is always an option.