Record Linkage - Maintaining a Registry
Record linkage: Exact Matching
Record linkage is a framework to bring together corresponding records from two or more data sources (or to find duplicates within a file). There are two general approaches to linking data files: exact matching and statistical matching. Exact matching differs from statistical matching in several ways. The most important difference is that in an exact match, one assumes that the individuals on the two files are the same, whereas in a statistical match one assumes that the individuals come from the same population, but there is no need to assume that the two files have the same individuals. For this case study, we will concentrate on exact matching techniques, since we are in the context of a registry and need to maintain exact information on our population of interest.
Unique identifiers rarely exist. It is therefore necessary to use identifying variables such as surname, given names, date of birth, etc. in order to link records from the two sources. In many cases, such identifying characteristics may not be unique to a particular individual. They may change over time, they may have been recorded incorrectly, or they may be missing in certain records. In this case study we want to perform an exact match, linking data from tax, deaths and births for the same people on each file in order to update the registry.
Deterministic Record Linkage
A match in deterministic linkage is made when a sufficient number of identifiers agree between two records. In the simplest and most restrictive case, all identifiers are required to agree. More flexible (hierarchical) rules can be used which allows some pre-defined subset of identifiers to “determine” a link.
One of the major limitations of deterministic linkage is that it considers each identifier to be of equal quality. Agreement on one identifier provides no stronger evidence for a link than agreement on any of the others. Consequently, it is impossible to resolve ties, which occur when one record matches with two (or more) others on the same number of identifiers.
In practice, identifiers differ in the amount of information they contain about an individual. Real data contain missing or incorrect values, with some identifiers coded more reliably than others. A single miscoded value can cause a link to fail, even if the evidence for a link based on other identifiers is perfect, or conversely, it can cause a record to be incorrectly linked.
Probabilistic Record Linkage
One way to take this difference into account is probabilistic linkage. Here we are not only concerned with how many identifiers match, but also which ones. A match on three strong identifiers will be taken over a match on three weaker ones, whereas in deterministic linkage this would have resulted in a tie. Fellegi and Sunter (1969) presented this idea in a formal mathematical framework. In addition to this approach, neural networks, bipartite graphs, and fuzzy logic have been used in linkage projects.
The importance of an identifier is measured by calculating the amount of information conveyed by the values of the variable. Variables with many potential values, such as day of birth or month of birth, usually contain more information than ones with few, such as sex. It is much more likely, for example, that two records selected at random will have the same sex than the same birthday. A match on birthday is then considered stronger evidence for a link than a match on sex, because of the much higher probability of the match on sex being due entirely to chance. This information, the probability of two random pairs having the same values, is used in conjunction with the probability that two real pairs have the same values to estimate the odds of a true pair.
Many areas rely on record linkage techniques to update registry information and to perform subsequent reports and analysis with the updated data set.
The variables used in a linkage project should be chosen based on the following criteria:
- Permanent: exist at birth and remain unchanged,
- Universal: every member of the population has it,
- Reasonable: persons do not object to the information being disclosed,
- Economical, Simple, Available, Known, Accurate, Unique.
Then, once the variables have been identified, the files need to be cleaned and standardized.
Cleaning (pre-processing) the File
In any data set there are errors. We need to minimize the errors and more importantly standardize the files.
1. Standardize name
- There are many ways that the student may standardize the name (remove spaces, hyphens or other characters).
- Two common methods to standardize names are: Soundex and NYSIIS.
2. Standardize dates
- Remove spaces, hyphens or other characters.
- Ensure a common format and order for day, month, year.
3. Standardize geographical data
- Remove spaces, hyphens or other characters.
- Postal Code
- Sub-components of Postal Code (FSA: Forward Sortation Area)
4. Assess missing data
There are several methods that can be used to link data. For this case study, we will concentrate on probabilistic record linkage, however, please feel free to work with other methods such as neural networks, bipartite graphs and fuzzy logic.
Probabilistic Record Linkage: Fellegi-Sunter
In this type of linkage, one compares all possible pairs in order to determine which ones are the most likely. Each pair is assessed using rules. Each outcome in a rule has an associated weight and these weights are added up across all rules to get the total weight, which is used to assess the likelihood of a true pair.
A rule is a comparison of two fields. For example, one might compare agreement on the month of birth in two files.
For each rule one creates a series of outcomes. Using the comparison of month of birth the possible outcomes one might use are:
- Agreement ¡V the months match perfectly,
- Disagreement ¡V the months are more than 2 months apart,
- Partial agreement ¡V the months are within 2 months of each other.
* Note that the outcomes are in general independent.
For each outcome in a rule we will calculate a weight. This is the combination of certain probabilities. We need to calculate the following:
- P(Agreement | Pair)
- P(Agreement | Not a Pair)
- P(Disagreement | Pair)
- P(Disagrement | Not a Pair)
It may not seem obvious that these should be anything other than 1 or 0, however, the data has errors and therefore even when things are a True Pair they may not always agree.
Using month of birth, looking at the outcome of agreement, we know that there are 12 months, so if we take two random people the chance that they agree on month but are not a pair is 1/12. Then the probability of disagreement on two who are not a pair would be 11/12.
We also can make some assumptions about how well coded or captured the data are. If the error rate is low, we might assume that there is a 90% chance that the data are entered correctly. If two records are really a pair, then with data errors, the probability that they agree would be .90, while the probability that they disagree given they are a true pair is 0.1.
The ratios of these are the odds of having agreement on a true pair compared to that of a false pair. A template with example weight calculations is given in the file: Example Weights
The global weight is the sum of the weights for each outcome.
Since this method compares all possible pairs, it makes sense to only create pairs that could possibly be a match. One way to do this is to create blocks. Only records within a block are matched. For example, if we are relatively certain that sex is coded correctly on the files one could compare only pairs of males and females separately.
The idea of setting a threshold is best illustrated with a diagram. This plot shows a theoretical frequency plot of global weights. We can see that it is bi-modal. The first group is the true pairs while the second is the false pairs. The two distributions overlap. One must choose two thresholds. Items above the upper threshold are considered true pairs, the ones below the lower threshold true non-pairs and the ones in between are considered possible. These need to be reviewed manually to determine if they are in fact true or false pairs. The threshold needs to be set to minimize the amount of manual review, while also minimizing type I and type II misclassification errors.
Assessing the Linkage
How many pairs did we find? How accurate were we? Did we create many false pairs?
This helps to not only decide on threshold levels, but also helps to decide on the discriminating power of the rule set we are using. The following two graphs illustrate this. In the first example, only two rules are used in the calculation of the global weight. We can see that the distribution is not bi-modal and to distinguish between true and false pairs would be nearly impossible. In the second example, 15 rules are used and we can see that the two populations are much more easily separated.
The number of pairs created will generally be less than 100 percent. Even with a large number of rules and a long manual review process some links will be missed.
For each area of the graph one can select a small sample of pairs and with manual review determine the number of true pairs and false pairs. The number of false pairs above the upper threshold should be near 0%, the rate in the gray area more and in the area below the lower threshold the rate should be near 100%. One can calculate the number of true pairs and false pairs and estimate the probability of missing a pair and the probability of creating a false pair.
One can easily set up a program (in SAS or Excel) to perform a record linkage. However, here is a list of several available record linkage software packages: LINKS: A Record Linkage Package, GRLS, The Link King: Record Linkage and Consolidation Software, Netrics: Intelligent Record Matching™ software, Identity Search Server, and Surematch.
LINKS is a freeware record linkage package developed at the University of Manitoba; GRLS (Generalized Record Linkage Software), is a record linkage package developed and marketed by Statistics Canada; the Link King is a freeware SAS/AF application using the probabilistic algorithms developed by MEDSTAT for the Substance Abuse and Mental Health Administration¡¦s (SAMHSA) integrated database project; Netrics: Intelligent Record Matching™ software is a US commercial software that uses a machine learning matching model; Identity Search Server™ (ISS) is a US commercial software that provides search, matching, duplicate discovery, and relationship linking for all forms of identification data using “fuzzy” indexes; Surematch is a freeware product that will standardize, phonetically search, match, review, correctly case and de-duplicate a database.
Glossary of Terms
There are various terms used in record linkage. Some of these have been defined in: Newcombe, H.B. (1988). Handbook of Record Linkage Methods for Health and Statistical Studies, Administration and Business. Oxford, U.K. Oxford University Press, pp. 103-106.
The terms used in that book are as follows:
The use of sequencing information (e.g., the phonetically coded versions of the surnames) to divide the files into “pockets.” Normally, records are only compared with each other where they are from the same “pocket,” i.e., have identical blocking information. The purpose is to avoid having to compare the enormous numbers of record pairs that would be generated if every record in the file initiating the searches were allowed to pair with every record in the file being searched.
This usually refers to the denominator in a FREQUENCY RATIO, i.e., the frequency of a given comparison outcome among UNLINKABLE pairs of records brought together at random. It may be applied also to one of the two components of any ODDS.
The frequency of a given comparison outcome among correctly LINKED pairs of records, divided by the corresponding frequency among UNLINKABLE pairs brought together at random. The comparison outcome may be defined in any way, for example as a full agreement, a partial agreement, a more extreme disagreement, or any combination of values from the two records that are being compared. The FREQUENCY RATIO may be specific for the particular value of an identifier when it agrees, or for the value of the agreement portion of an identifier that partially agrees, or it may be non-specific for value.
A weighted mean of the frequencies of the various values of an identifier among the individual (i.e., unpaired) records of the file being searched. It is non-specific for value. Value-specific frequencies are also obtained from the same source.
The frequency of a comparison outcome among pairs of records, when that outcome is defined in terms that are non-specific for the value of the identifier. The outcome may be a full agreement, a partial agreement, or a more extreme disagreement. The record pairs may be those of a LINKED file, or they may be UNLINKABLE pairs brought together at random. Only in the special case of the full agreement outcomes are the global and the general frequencies numerically equal, but they always remain conceptually different. The difference is that a global frequency, although value non-specific, always reflects the full definition of the non-agreement portion of that definition. A general frequency cannot do this because it is based on a file of single (i.e., unpaired) records.
Global Frequency Ratio:
The ratio of the global frequency for a particular comparison outcome among LINKED pairs of records, divided by the corresponding frequency among UNLINKABLE pairs. It is equivalent to the global ODDS. GLOBAL FREQUENCY RATIOS for agreement outcomes and partial agreement outcomes are often subsequently converted to this value-specific counterpart during the linkage process. The conversion is accomplished by means of an adjustment upwards where the agreement portion of the identifier has a rare value, and an adjustment downwards where the value is common.
In its broadest sense, RECORD LINKAGE is the bringing together of information from two or more records that are believed to relate to the same “entity.” For an economic or social study, the “entities” in question might be farms or businesses. For a health study, the “entities” of special interest are usually individual people or families. It is in the latter sense that the word is used throughout this book.
In line with the above definition of “record linkage,” LINKED pairs of records are pairs believed to relate to the same individual or family (or other kind of entity). Record pairs brought together and judged not to relate to the same individual or family may be referred as “UNLINKABLE” pairs. For short, the two sorts of pairs are sometimes called “LINKS” and “NON-LINKABLE,” respectively. As used here, the term implies that some sort of decision has been reached concerning the likely correctness of the match.
This word is variously used in the literature on record linkage. In this book, however, it is given no special technical meaning and merely implies a pairing of records on the basis of some stated similarity (or dissimilarity). For example, early in a linkage operation, records from the two files being LINKED are normally matched for agreement of the surname code. The resulting pairs may also be called “candidate pairs” for linkage, but this emphasis is most appropriate in the later stages when the numbers of competing pairs have diminished. Pairs of records will frequently be spoken of as “correctly matched,” “falsely matched,” or “randomly matched.”
This usually refers to the numerator in a FREQUENCY RATIO, i.e., the frequency of a given comparison outcome among pairs of records believed to be correctly LINKED. It may be applied also to one of the two components of any ODDS.
This word is used in its ordinary sense but is applied in a number of situations. As relating to a particular outcome from the comparison of a given identifier it is synonymous with the FREQUENCY RATIO for that outcome. As relating to the accumulated FREQUENCY RATIOS for a given record pair it refers to the overall RELATIVE ODDS. It is also applied to the overall ABSOLUTE ODDS.
This refers to any outcome or result from the comparison of a particular identifier (or concatenated identifiers) on a pair of records, or the comparison of a particular identifier on one record with a different but logically related identifier on the other. It may be defined in almost any way, for example as an AGREEMENT, a PARTIAL AGREEMENT, a more extreme DISAGREEMENT, any other SIMILARITY or DISSIMILARITY, or the absence of an identifier on one record a s compared with its presence or absence on the other. An outcome may be specific for a particular value of an identifier (e.g., as it appears on the search record) or for any part of that identifier, especially where there is an agreement or partial agreement; it may be non-specific for value; or it may even be specific for a particular king of DISAGREEMENT defined in terms of any pair of values being compared.
An identifier (e.g., an initial) may be said to have a number of different “values” (e.g., initial “A,” initial “B,” and so on). Surnames, given names, and places of birth have many possible values. Other identifiers tend to have fewer values that need to be distinguished from each other.
In the literature, this term has been widely applied to the logarithms of various entities, such as:
- a FREQUENCY RATIO for a specified outcome from the comparison of a given identifier,
- the product of all the FREQUENCY RATIOS for a given record pair,
- the NUMERATOR of a particular FREQUENCY RATIO,
- the DENOMINATOR of a particular FREQUENCY RATIO,
- any estimate of such a numerator or denominator, not obtained directly from a file of matched pairs of records.
The use of the logarithm is merely a convenience when doing the arithmetic; it does no affect the logic except to make it appear more complicated. The term “WEIGHT” has therefore been employed sparingly in this book. Instead, reference has been made directly to the source frequency or FREQUENCY RATIO, or to the estimates of these, wherever possible.
In many research and administrative situations, the need arises to link two or more data sources together. The student assignment is to complete a record linkage project that simulates updating a register. First pre-process the files (standardize the names, parse the addresses), then use existing record linkage software (or write one!) to match files together. A registry gets updated with new records using sources like a ‘birth file’, moves people to new addresses using a file like the ‘drivers file’, and removes people by linking them to a ‘death file’. When the matching is complete, the student will update the ‘registry file’ and evaluate linkages and discuss the strengths and weaknesses of their approach.
The data sets are comprised of completely synthetic data that was constructed to simulate registry data for the purpose of this case study, using SAS 9.1. The synthetic files have similar name, date of birth, and date of death frequencies for a sample of residents of the province of Prince Edward Island. No information from individuals is used in this case study.
The data sets are: register, births, drivers and deaths.
The file layouts are given in: Data Layout.
* Click the heading to download data:
The register is a synthetic database that represents the population of PEI. The file contains a unique id, the name, date of birth, and address information for residents in the province of PEI. This is the file that we are working to maintain, to form the basis for current and future studies.
Births for a registry are persons who enter the population of interest, for example by moving into an area of interest, or attaining a certain age. The data file on births contains both present and previous address information as well as complete name and date of birth information. As data sources are updated independently it is possible that the registry may link to either the old or new address.
The information on the drivers files provide help capture moves and births for a registry. People living in Canada may take out their first license, get a new license (out of province) or update their present license (within a province). This file has information on names, address and date of birth.
The deaths file contains simulated information contained on a vital statistics registry. The data includes the following information: names, address, date of birth and date of death.