Use of name recognition software, census data and multiple imputation to predict missing data on ethnicity: application to cancer registry records
© Ryan et al; licensee BioMed Central Ltd. 2012
Received: 18 May 2011
Accepted: 23 January 2012
Published: 23 January 2012
Information on ethnicity is commonly used by health services and researchers to plan services, ensure equality of access, and for epidemiological studies. In common with other important demographic and clinical data it is often incompletely recorded. This paper presents a method for imputing missing data on the ethnicity of cancer patients, developed for a regional cancer registry in the UK.
Routine records from cancer screening services, name recognition software (Nam Pehchan and Onomap), 2001 national Census data, and multiple imputation were used to predict the ethnicity of the 23% of cases that were still missing following linkage with self-reported ethnicity from inpatient hospital records.
The name recognition software were good predictors of ethnicity for South Asian cancer cases when compared with data on ethnicity derived from hospital inpatient records, especially when combined (sensitivity 90.5%; specificity 99.9%; PPV 93.3%). Onomap was a poor predictor of ethnicity for other minority ethnic groups (sensitivity 4.4% for Black cases and 0.0% for Chinese/Other ethnic groups). Area-based data derived from the national Census was also a poor predictor non-White ethnicity (sensitivity: South Asian 7.4%; Black 2.3%; Chinese/Other 0.0%; Mixed 0.0%).
Currently, neither method for assigning individuals to an ethnic group (name recognition and ethnic distribution of area of residence) performs well across all ethnic groups. We recommend further development of name recognition applications and the identification of additional methods for predicting ethnicity to improve their precision and accuracy for comparisons of health outcomes. However, real improvements can only come from better recording of ethnicity by health services.
This paper presents a method for imputing missing data on the ethnicity of cancer patients, developed for a regional cancer registry in the UK. It implements existing approaches in a novel situation and evaluates their utility. It combines four differing approaches to dealing with missing data of this type: the use of an additional source of self-reported ethnicity to replace the missing data; the use of name recognition software to predict the ethnicity of individuals; the use of Census data based on area of residence to predict the ethnicity of individuals; and finally, the use of multiple imputation (MI) to make an allowance for the use of these predictors in subsequent statistical analyses. The method has applications beyond cancer registries, and the results presented below are relevant to all organisations and researchers that have incomplete information on the ethnic group of individuals who have access to additional data that may help predict ethnicity when missing.
The West Midlands Cancer Intelligence Unit (WMCIU) is a regional cancer registry covering a population of approximately 5.3 million. The registry collects information about new cases of cancer and produces statistics about incidence, prevalence, survival and mortality: the availability of information on the sociodemographic characteristics, including ethnicity, of cancer cases is important for service planning, ensuring equal access to these services, and for epidemiological studies. The main source of information on the ethnicity of cancer cases for the WMCIU is linkage to the national Hospital Episode Statistics database (HES), which routinely collects the self-reported ethnicity of hospital patients . However, HES currently only provides information on patients admitted to hospital, and does not include patients who were assessed or treated in secondary care, but not admitted, nor does it include information on patients seen privately outside the National Health Service (NHS). Nationally, only 80% of cancer cases have ethnicity data available from HES . This reliance on linkage with hospital admissions has implications for cancer incidence, prevalence and survival. The use of complete case analyses in this situation has the potential to cause bias: for example, cases of prostate cancers who are not treated ('watchful waiting') may have longer survival than those whose cancer is more advanced and are therefore admitted to hospital. The exclusion of the former from survival analyses will tend to underestimate survival overall, and, if ethnicity is associated with treatment type, may obscure any differences in survival between ethnic groups. Similarly, cancer cases who are known to the WMCIU through death certification only (DCO) (the only information they hold about the case is based on a death certificate) have less complete linkage with HES. Any comparison of cancer incidence by ethnic group restricted to complete cases has the potential to obscure any differences between ethnic groups if people from ethnic minorities are over represented among the DCO cases.
There are alternative approaches to dealing with missing information that allow all cases to be retained for analysis. The simplest of these is to link with additional external information sources that record ethnicity, such as the National Breast Cancer Screening Service (NBSS), but the population coverage, completeness and accuracy of these data may also be limited. A second commonly used approach is to use lists of names that are associated with particular ethnic groups. Name recognition software packages such as Nam Pehchan and SANGRA have been used in a variety of settings to identify people with South Asian heritage [3–6]. More recently, another package called Onomap has become available: this attempts to identify people from many ethnic groups [7, 8]. Each of these packages relies on individuals having a name that is strongly associated with a particular ethnic group and their sensitivity and specificity is known to vary from setting to setting [9–11]. None of them can identify people who would describe themselves as having a mixed ethnicity, individuals whose surnames are not specific to ethnic groups, or the original ethnic group of individuals who adopt their partner's surname where the partner is a member of another ethnic group. Another approach is the use of census information on the ethnic distribution of the area in which the case lived. Examples of this are a study on the uptake of breast cancer screening services in London, and the development of a risk calculator for coronary heart disease based on data from a large set of general practices [12, 13]. This approach has the advantage of being easy to implement if the postcode of the individual is available, but relies on the assumption that the ethnic distribution of a Census area (approximately 1,500 people) is an accurate predictor of the ethnicity of individuals.
The use of MI, such as that implemented by Royston and colleagues in the statistical software package Stata, has the potential to create a complete dataset that combines the predictions generated by the above methods, and makes an allowance for the imprecision of these predictions that is carried through to the final statistical analyses . It requires the user to have an accurate understanding of the reasons why the data are missing (the missing data mechanism), good predictors of the value of the missing data, and assumes that the data are otherwise missing at random (MAR). It is, therefore, a combined approach which aims to maximise accuracy where missing values are predicted and to adjust the precision of any estimates derived from these predictions (e.g. by widening the confidence intervals on estimates of cancer incidence).
We will present information on the sensitivity and specificity of each of these methods, and describe how these multiple sources were combined in a dataset that can be used for the estimation of cancer incidence and survival by ethnic group.
111 694 cancer cases normally resident in the West Midlands region of the UK and diagnosed between 1 January 2001 and 31 December 2007 were included in the cohort for this project. Cases were limited to the five most common cancer sites (breast, upper GI, lower GI, prostate, and lung) as there would be too few cases from the non-White ethnic group to allow reliable comparisons of incidence and survival for the less common cancers.
The five ethnic groupings used were: White, South Asian, Black, Chinese/Other, and Mixed. These broad groupings coincide with those used in official national statistics .
Availability of existing data on ethnicity
Data on ethnicity were available for 85 506/111 694 (77%) of cancer cases from HES, although some individuals had more than one ethnic group recorded. Where this occurred (1506/85 506 (1.8%)), we used the most commonly recorded ethnic group, in line with the method recommended by Downing and colleagues . This approach is believed to be appropriate as it uses most information. We set ethnic group to missing for cases with more than one ethnic group recorded, but without a 'most common' ethnic group (154/85 506 (0.18%) of the complete cases), again in line with the method used by Downing. We then tabulated the characteristics of the cohort, and used a univariate chi-square analysis to identify demographic and clinical factors associated with missing ethnicity (the missing data mechanism) [17, 18].
Additional source of ethnicity data
The 28 795 breast cancer cases from the cohort were linked to data held by the eight breast cancer screening services (NBSS) in the region using their NHS number. We assessed the value of this additional information by comparing sensitivity, specificity and positive predictive value of the ethnicity recorded in the NBSS using the HES dataset as the gold standard, where both were available.
Prediction of ethnicity using name recognition software
Two name recognition applications were available for use in this project: Nam Pehchan and Onomap. Nam Pehchan was used to identify people with South Asian names, and Onomap was used to identify people with names associated with White, South Asian, Black and Chinese/Other ethnic groups. As early use of Nam Pehchan with this cohort showed that it included forenames which were common among other ethnic groups (e.g. 'Mona'), we decided to run it on forenames and surnames separately, rather than the default approach which was to run it on all forenames and surnames combined. This allowed matched surnames to carry a greater weight than matched forenames in the MI process.
Since Onomap only makes use of a single forename, only the first forename was used when the application was run. However, as cases could have more than one surname associated with their WMCIU record, the application was run with each combination of forename and surname for each case. If Onomap assigned a case to more than one broad ethnic group, their multiple results were replaced by a single result according to the following order of preference: Chinese/Other, Black, South Asian, White. This ordering corresponds with the relative size of each group within the regional population, with preference given to the less common ethnic groups. The sensitivity, specificity, and positive predictive value (PPV) of the two applications was compared with the ethnic group recorded in HES in order to assess the ability of each to identify the ethnicity of cancer cases.
Prediction of ethnicity using area-based Census data
Cases were assigned to the Census area (lower layer super output area (LSOA): average size 1500 persons) associated with their postcode or residence, and linked to a dataset with the ethnic distribution of each LSOA in the 2001 national Census.
Full multiple imputation model
Characteristics of the Project Cohort
Cases with missing ethnicity following linkage with HES records (% of cases)
3500(p < 0.001)
Year of diagnosis
206(p < 0.001)
1 (most deprived)
(Income Domain of
Index of Multiple
5 (least deprived)
6300(p < 0.001)
1100(p < 0.001)
391(p < 0.001)
2300(p < 0.001)
Ever seen privately
(cancer was diagnosed or treated outside the free National Health Service at least on one occasion)
4600(p < 0.001)
4000(p < 0.001)
616(p < 0.001)
2400(p < 0.001)
61(p < 0.001)
Number of admissions
(includes non-cancer admissions)
6300(p < 0.001)**
The project did not require separate ethical approval as it was commissioned by, and carried out in collaboration with, the regional cancer registry. Cancer registries have legal support to collect data relating to cancer under Section 251 of the NHS Act 2006 (and formerly under Section 60 of the Health and Social Care Act 2001).
The completeness of information on the ethnicity of cancer cases following linkage with HES varied significantly (P < 0.001 in each case) by the demographic and clinical factors listed in Table 1.
Sensitivity, Specificity and Positive Predictive Value of NBSS-derived Ethnicity for Breast Cancer Cases
Ethnic group recorded in HES
Number of cases recorded in HES
Positive predictive value
Ethnicity of Cases Following Linkage with HES and NBSS Datasets
Cases with ethnic group recorded in HES (%)
Cases with unknown ethnicity resolved by NBSS linkage
Ethnic breakdown of cohort following use of HES and NBSS (%)
Sensitivity, Specificity and Positive Predictive Value of Name Recognition Software
Name recognition software
Positive predictive value
Onomap and Nam Pehchan combined
Sensitivity, Specificity and Positive Predictive Value of Census Data on Ethnicity
Positive predictive value
The ethnicity of cases that were missing following linkage with the HES and NBSS datasets was imputed in Stata using ICE with an imputation model that included the variables significantly associated with missingness (Table 1), the predicted ethnicity of each case made using Onomap and Nam Pehchan, and the ethnic breakdown of the area of residence of the case. The number of imputed datasets generated for the full run was set to 23 as ethnicity was missing for 22.6% of the cases (Table 3).
Sensitivity, Specificity and Positive Predictive Value of Full Model
Positive predictive value
Comparison of Distribution of Ethnic Groups: Observed and Imputed
The main aim of this project was to create a method to impute the ethnic group of cancer cases who were notified to the regional cancer registry, but whose ethnic group was not available from their main source, linkage with the national database on hospital admissions (HES). We made use of precise external information on the ethnicity of cases where possible, through linkage with a further dataset (the NBSS), two name recognition applications, and area-based information on the ethnic make-up of the resident population. We then assessed the value of each of these additional sources by comparing them with the ethnic group of cases whose ethnicity was known from HES. In the final stage of the method, we created a dataset which can be used to estimate ethnic group specific cancer incidence and survival: this involved the use of a MI procedure (ICE).
The main benefit of using additional linked datasets, like the NBSS, is that it makes use of precise information recorded about the individuals of interest. The main limitation for this project is that the NBSS dataset only contains breast cancer cases who attended the screening programme and who had their ethnic group recorded at that time: this resolved the ethnic group of just 1% of the cancer cases whose ethnicity was not already known from linkage with HES. We decided to use NBSS recorded ethnicity as a direct substitute for missing ethnicity rather than including it as a predictor of ethnicity in the MI process: it did refer directly to the person of interest. Similar datasets were not available for the other cancer sites of interest.
The performance of Nam Pehchan is widely known but, as far as we are aware, this is the first peer reviewed report on the Onomap application. The higher sensitivity and specificity that was achieved by using both applications together suggests that the best name-based predictions of ethnicity can be achieved by the use of multiple applications. It is, however, unlikely that name recognition software will ever precisely predict membership of some ethnic groups: although many people from South Asian, Chinese and some other ethnic groups may have distinctive names, many individuals from White, Black and Mixed ethnic groups do not. This suggests that we will always have to make some allowance for their imprecision, and include additional predictors of ethnicity.
Area-based information from the national Census is a popular predictor of ethnicity and easy to implement, but this project demonstrates that is not precise enough to be used alone. Although sensitivity for the White ethnic group may be high (> 99%) specificity is very low (21%), showing that it misclassifies approximately 4 out of every 5 people from non-White backgrounds as White. Similarly, sensitivity is poor for the most common non-White ethnic groups in the region: the use of area-based Census data only appears to correctly identify approximately 7% of South Asian and 2% of Black cases. Setting alternative cutoff values for the model predictions from the default value of 0.5 did not improve the predictive performance of the Census data to any great extent (results not shown).
The full model used to predict ethnicity within the MI procedure did appear to be superior to naming software and Census data alone. The model appeared to perform best for the South Asian ethnic group, and did identify membership of the White, Black and Chinese/Other ethnic groups with greater sensitivity than any of its individual constituent predictors. However, the sensitivity of the full model in absolute terms for Black, Chinese/Other and Mixed groups is low. It is, therefore, uncertain that the existing predictors can be improved, or that new predictors could be added, which would increase the sensitivity of future models to levels similar to that seen for the South Asian group.
The greatest difference between the observed and imputed data in the final model was for two of the three minority groups whose ethnicity is poorly predicted by name recognition software (Chinese/Other and Mixed). This may indicate that the imputation process does not perform well in these cases. New MI models may benefit from including other predictors of ethnicity which help identify membership of these groups more precisely, using age-specific data on the ethnic composition of small geographic areas (LSOAs) if this is published following the next national Census or country of birth, for example.
In summary, we have developed a method and dataset that will allow comparison of cancer incidence and survival between ethnic groups. However, the sensitivity of the Onomap name recognition application appears to be low for people from non-White and non-South Asian ethnic groups, suggesting that it is of limited use for studies that wish to classify individuals by ethnic group. Similarly, area-based information from the national Census, a common approach where individual names are not available but area of residence is, appears to be imprecise, particularly for the less common ethnic groups. Currently, neither method for assigning individuals to an ethnic group performs well across all ethnic groups. We recommend further development of name recognition applications and the identification of additional methods for predicting ethnicity to improve their precision and accuracy for comparisons of health outcomes. Nevertheless, neither imputation nor name recognition software will be completely accurate: reliable statistics relating to the incidence, prevalence and survival of persons with cancer by ethnic group require more complete recording of these data .
This project was supported by the West Midlands Cancer Intelligence Unit (WMCIU). It was commissioned by the WMCIU which provided access to the data and funding for RR to carry out the project. GL is the Director of the WMCIU and SV was the Deputy Director of Cancer Registration at the WMCIU at the time of the study: they contributed to the writing of the manuscript and to the decision to submit the manuscript for publication. RR and SW are employed by the University of Birmingham.
- Hospital Episodes Statistics. accessed 01 Sept 2010, [http://www.hesonline.nhs.uk]
- National Cancer Intelligence Network Coordinating Centre: Cancer incidence and survival by major ethnic group, England. 2002, accessed 13 Jan 2012, [http://publications.cancerresearchuk.org/downloads/Product/CSINCSURVBYETHNICITY.pdf] -2006Google Scholar
- Cummins C, Winter H, Cheng KK, Maric R, Silcocks P, Varghese C: An assessment of the Nam Pehchan computer program for the identification of names of south Asian ethnic origin. J Public Health Med. 1999, 21: 401-6. 10.1093/pubmed/21.4.401.View ArticlePubMedGoogle Scholar
- Price CL, Szczepura AK, Gumber AK, Patnick JP: Comparison of breast and bowel cancer screening uptake patterns in a common cohort of South Asian women in England. BMC Health Services Research. 2010, 10: 103-10.1186/1472-6963-10-103.View ArticlePubMedPubMed CentralGoogle Scholar
- Szczepura A, Price CL, Gumber A: Breast and bowel cancer screening uptake patterns over 15 years for UK South Asian ethnic minority populations, corrected for differences in socio-demographic characteristics. BMC Public Health. 2008, 8: 1471-2458.View ArticleGoogle Scholar
- Nanchahal K, Mangtani P, Alston M, dos Santos Silva I: Development and validation of a computerized South Asian Names and Group Recognition Algorithm (SANGRA) for use in British health-related studies. J Public Health Med. 2001, 23: 278-85. 10.1093/pubmed/23.4.278.View ArticlePubMedGoogle Scholar
- OnoMAP. accessed 5 Oct 2010, [http://www.onomap.org]
- The Names Projects. accessed 13 Jan 2012, [http://redress.lancs.ac.uk/resources/launch.php?creator=Mateos Pablo&title=The Names Projects]
- Mateos P: A review of name-based ethnicity classification methods and their potential in population studies. Population Space and Place. 2007, 13: 243-263. 10.1002/psp.457.View ArticleGoogle Scholar
- Brant LJ, Boxall E: The problem with using computer programmes to assign ethnicity: Immigration decreases sensitivity. Public Health. 2009, 123: 316-320. 10.1016/j.puhe.2009.02.002.View ArticlePubMedGoogle Scholar
- Identifying Ethnicity: comparison of two computer programmes. [http://www2.warwick.ac.uk/fac/med/research/csri/ethnicityhealth/aspects_diversity/identifying_ethnicity/identifying_ethnicity.doc]
- Renshaw C, Jack RH, Dixon S, Møller H, Davies EA: Estimating attendance for breast cancer screening in ethnic groups in London. BMC Public Health. 2010, 10: 157-10.1186/1471-2458-10-157.View ArticlePubMedPubMed CentralGoogle Scholar
- Hippisley-Cox J, Coupland C, Vinogradova Y, Robson J, May M, Brindle P: Derivation and validation of QRISK, a new cardiovascular disease risk score for the United Kingdom: prospective open cohort study. BMJ. 2007, 335: 136-10.1136/bmj.39261.471806.55.View ArticlePubMedPubMed CentralGoogle Scholar
- Royston P, Carlin JB, White IR: Multiple imputation of missing values: New features for mim. Stata Journal. 2009, 9: 252-264.Google Scholar
- Population Estimates by Ethnic Group: Methodology Paper. accessed 13 Jan 2012, [http://www.ons.gov.uk/ons/rel/peeg/population-estimates-by-ethnic-group--experimental-/current-estimates/population-estimates-by-ethnic-group-methodology-paper.pdf]
- Downing A, Forman D, Thomas JD, West RM, Lawrence G, Gilthorpe MS: Investigating the association between ethnicity and survival from breast cancer using routinely collected health data: challenges and potential solutions [abstract]. Journal of Epidemiology and Community Health. 2009, 63: 88-10.1136/jech.2009.096735j.View ArticleGoogle Scholar
- van Buuren S, Boshuizen HC, Knook DL: Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine. 1999, 18: 681-694. 10.1002/(SICI)1097-0258(19990330)18:6<681::AID-SIM71>3.0.CO;2-R.View ArticlePubMedGoogle Scholar
- Sterne JA, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, Wood AM, Carpenter JR: Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ. 2009, 338: b2393-10.1136/bmj.b2393.View ArticlePubMedPubMed CentralGoogle Scholar
- Iqbal G, Gumber A, Szczepura A, Johnson M, Wilson S, Dunn J: Improving ethnicity data collection for cancer statistics in the UK. Diversity in Health and Care. 2009, 16: 267-285.Google Scholar
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1472-6947/12/3/prepub