Welcome to the third in a series of technical posts about working with beneficial ownership (BO) data. Over the past three years, Open Ownership (OO) has been building a prototype Global Beneficial Ownership Register, in order to prove the feasibility and value of linking BO data across jurisdictions. We have used this project to experiment with a variety of tools and processes, from standardising data under the Beneficial Ownership Data Standard (BODS) to providing a search engine, disclosure viewer, data archive, and network visualisation (amongst others). As this phase of our work draws to a close, we are documenting the things we have learned and sharing our experiences with data users and publishers in the public and private sector.
In Parts One and Two, we covered the broad topic of modelling BO data, then the significant challenge of reconciling data from different sources. This blog post expands on reconciliation to discuss the related subject of finding duplicated data and merging records. We will cover the modelling approaches needed to make merging possible and, crucially, reversible; different strategies one can adopt to find duplicates; and finally, what publishers can do to make duplicates less likely.
In a perfect world, where every BO record can be reconciled through unique identifiers, we would not have any duplicated data. Of course, the world is not perfect and in many disclosure regimes unique identifiers are not available, particularly for people. This ultimately results in less valuable data, because one of the most common questions asked of BO data is “what does this person own?” When a person’s records are duplicated, that question is not trivial to answer.
Here is an illustrative example from the UK’s Persons of Significant Control register:
Despite his uniqueness as heir to the British throne, the nature of the UK’s BO reporting means that it is impossible to know for sure that the 17 different records returned for Prince Charles refer to the same person. Each ownership is reported separately by the company involved and there is no uniquely identifying information across them which can link Prince Charles definitively.
This is not just a problem unique to people, or to the UK. Across other data sources we import into the register, there are many other situations that also require deduplication, for example:
Eagle-eyed viewers will notice in the screenshot above that the Open Ownership Register has succeeded in merging some records. It has achieved this by matching them exactly on the name, date of birth, and address. However, even with a name as regal as Charles Philip Arthur George Mountbatten Windsor, one sees a lot of variation. The register takes a particularly conservative approach, requiring exact matches across every field in an attempt to limit any mismatches. Other applications might prefer to be less strict, but no deduplication strategy will be perfect; it naturally reflects the risk appetite of the data’s end users. As a project with a small budget for corrections and a wide audience, we would rather not be wrong.
Many of the technical strategies for deduplication look similar to those discussed with regards to reconciliation in Part Two of this series. The difference is that users are not relying on some external oracle to identify the data, but rather looking within their own data to find likely matches. Similar techniques of data normalisation can increase that likelihood, so tools like Open Corporates and libpostal as well as simple text trimming and formatting can all help. We introduced and discussed a lot of these broad concepts in our earlier post, so here we are going to take a deeper dive into two tools to help put these approaches into practice.
With any complex matching system, users will be dealing with heuristics to find matches and searching across all of their data. This is basically a perfect situation for a search engine. On the Open Ownership Register we use Elasticsearch to power our search functions, so it made sense for us to use it to power duplicate matching, too. There are other search engines available that have similar features, and can give equally good results.
The basic approach is to set up a new index specifically for duplicate finding. In it, normalised versions of all the fields from the data that are useful to the matching process should be added: names, addresses, dates of birth, etc. Elasticsearch has a powerful system for things like standardising cases, alternate versions, etc., that can be used to help with this. Specialised queries can then be used with the full power of the search query engine to find duplicates for each record. In addition to being extremely flexible (and potentially very fast), this will automatically give users a scoring system for the results it finds.
The key to success in this setup is defining the search query precisely and then preparing the data in the index to make searches fast. A simple search for a name amongst all the data will likely return lots of results, and it will be difficult to discriminate between them. The chosen search engine’s query language should be experimented with, and a combination of fuzzier searches on specific fields with exact matches on others should be used. At the outset, some of these combinations will be expensive to run, but once a clear approach has been found, users can optimise their search index to speed them up.
An alternative to a hand-tuned heuristics based search engine is a machine learning approach. Dedupe.io is an example of this, available as self-hosted software or as a hosted service, where users “train” a model by giving it examples of duplicates and allow their algorithms to work out from those examples which characteristics of the data connect duplicates. Dedupe.io has some extremely detailed documentation, which explains not only how they calculate matches, but also how they make the matching process performant across large datasets. This is recommended reading for anyone interested in de-duping, regardless of whether or not their service is being used.
No matter how clever a matching algorithm is, there will always be mistakes to deal with. Therefore, how the merging aspect of a deduplication system is implemented is extremely important. In the Open Ownership Register, we refer to two kinds of merges: hard and soft. Hard merges leave no trace: there is no way to undo them. Soft merges are where two records are functionally merged, but that merging process can be undone at a later date. We achieve this by modelling a merge as another connection in a graph:
Technically, this is achieved in our database by adding another relationship between the merged people. We pick the person record with the most owned companies and then link the others to it. Then, within our application code, we enforce the primacy of that record during all interactions with it or any of the other merged records. Crucially, this design allows us to connect those secondary records to a new primary, should any error come to light. The Open Ownership Register does not do this yet, but a more generalised form of this approach would allow a system to present multiple interpretations of merged data, with different connections for different algorithms used to match records.
The second key consideration of this process, especially when running deduping automatically, is to be able to monitor what is happening and, in particular, to alert when data indicates that previous merges are no longer accurate. In the data model explained by the diagram above, any time a record’s primary, or status as a primary, changes, that is an indication that some new information has come to light. Alerting on every change might produce so many alert that systems might want to guard against such changes in normal usage and then run separate error-finding processes to check regularly. The Open Ownership Register manages this by logging instances of changes during the process and then reviewing them periodically, but our de-duplications are run manually and sporadically so that we can dedicate time to this analysis.
The final key component to merging duplicates is providing transparency into what has happened to end users of a system: explaining first that merging has happened, using what algorithm/criteria, and then, ideally, allowing users to see the original records that were merged.
Our work at Open Ownership speaks to both users of data and publishers. When we talk about duplicates or merging records, often we are fixing issues that could have been solved by the original publishers of that data. However, the scope of what a publisher could possibly do and what they can feasibly do within their current disclosure regime often vary hugely. Below are some of our most pragmatic suggestions, in a roughly ascending order of complexity:
Overall, deduplication is a chore that all who work with BO data in the real world will have to deal with. There are some great high-tech solutions that can be put to work, but ultimately one of our aims at Open Ownership is to make them less necessary. Through our work on the BODS and our principles for effective BO disclosure we hope to increase the quality of BO data around the world, and reducing duplicates is one key measure.