In my last post I described the process of converting FreeBMD search results into usable data. This allowed me to infer family relationships programmatically, using the surnames of spouse and mother as a key link.
I’ve now taken a step back from this, and re-cast the source FreeBMD data as records of events, rather than people. Which is fair enough – that’s what they are. I’m now looking to infer the identity of individuals automatically from this data.
Aligning death and birth information
My initial thought on personal identity was that if you could record the date and place of birth and death, together with a person’s name, then you would have enough information to be reasonably confident about their identity. Thus, if another person appears who has all these five properties in common, then we are actually describing the same person.
So my first attempt to infer identity directly from FreeBMD data involves taking death events, and doing a lookup to see if there is a corresponding birth event. Where age at death is recorded (from 1911 onwards) this can be quite a precise operation. However, even then it is possible to end up with more than one “candidate” birth event. If the place of birth and death are the same, this increases the likelihood that the two events relate to the same individual (though even a “perfect match” doesn’t imply certainty about this). Where age at death isn’t recorded, all you can do is to note birth events which happened prior to the death event as being low-likelihood candidates.
Having removed duplicate entries from my Light and Kerridge events, I have a file of 32,228 BMD events. Of these, 9,136 are death events. An attempt to match each of these to a corresponding birth event had the following outcomes:
- 4,428 had some sort of match to birth events
- of these, 1,387 had a high-probability match on both age and place of birth/death
- 1,558 matched on age at death but not on place of birth/death
- 595 birth events preceded the death event (age at death not specified) and matched on place
- the remaining 888 death events had a preceding birth event for a person with the same name, but no other factors to indicate they apply to the same individual
This means that more than 30% of the death events have at least one plausibly corresponding birth event. In fact, even the more likely matches may have one or more competing birth events, as this screenshot shows:
Here, two William Kerridges were born in the third quarter of 1871. One lived to 24, the other didn’t make it to his second birthday. The fact that there is a place match on both entries increases the likelihood that one didn’t move from Downham, and the other didn’t move from Ipswich.
Further work is required to get an understanding of the reasons why nearly 70% of death events don’t have any corresponding birth events. Of course, some will be births which happened prior to 1837, and so are not recorded in the FreeBMD data. Another possible factor would be variation in the recording of names, particularly forenames; my current approach assumes an exact match on name. Finally, there will be all the married women who finished their life with a different surname to the one they started it with.
The source data is imprecise both geographically and temporally. Each District covers a significant area, and the dates are recorded only as falling within a three-month span (and may in any case have actually happened prior to those dates). Therefore any matches will only be indicative, rather than “certain”. A better indication of the likelihood of a match could be obtained by checking the total number of births for people with the same name; less common names will clearly be more amenable to this approach.
Search functionality used
In order to carry out these identity matching experiments, I set up a number of indexes to my Modes data, including indexes on surname and given name(s). I also set up a CGI program which allowed me to retrieve XML resources via HTTP. Everything else was done within an XSLT transform, which started with a death event record, and used the XSLT document() function to grab all the events relating to people with the same name as an XML “node set”. All the date and place comparison was done by XSLT.
Thus, if FreeBMD were to provide an HTTP query service which allowed searching of BMD events on surname and/or given name, and which returned an XML response, it would be possible to carry out the sort of identity inference described above on the complete FreeBMD database. In fact, the current online search facility already supports many more search criteria than this; the key development would be to re-cast the existing search functionality so that it can be invoked by an HTTP request, and so that it will deliver a machine-processible response.