Public health utility of cause of death data: applying empirical algorithms to improve data quality

Background Accurate, comprehensive, cause-specific mortality estimates are crucial for informing public health decision making worldwide. Incorrectly or vaguely assigned deaths, defined as garbage-coded deaths, mask the true cause distribution. The Global Burden of Disease (GBD) study has developed methods to create comparable, timely, cause-specific mortality estimates; an impactful data processing method is the reallocation of garbage-coded deaths to a plausible underlying cause of death. We identify the pattern of garbage-coded deaths in the world and present the methods used to determine their redistribution to generate more plausible cause of death assignments. Methods We describe the methods developed for the GBD 2019 study and subsequent iterations to redistribute garbage-coded deaths in vital registration data to plausible underlying causes. These methods include analysis of multiple cause data, negative correlation, impairment, and proportional redistribution. We classify garbage codes into classes according to the level of specificity of the reported cause of death (CoD) and capture trends in the global pattern of proportion of garbage-coded deaths, disaggregated by these classes, and the relationship between this proportion and the Socio-Demographic Index. We examine the relative importance of the top four garbage codes by age and sex and demonstrate the impact of redistribution on the annual GBD CoD rankings. Results The proportion of least-specific (class 1 and 2) garbage-coded deaths ranged from 3.7% of all vital registration deaths to 67.3% in 2015, and the age-standardized proportion had an overall negative association with the Socio-Demographic Index. When broken down by age and sex, the category for unspecified lower respiratory infections was responsible for nearly 30% of garbage-coded deaths in those under 1 year of age for both sexes, representing the largest proportion of garbage codes for that age group. We show how the cause distribution by number of deaths changes before and after redistribution for four countries: Brazil, the United States, Japan, and France, highlighting the necessity of accounting for garbage-coded deaths in the GBD. Conclusions We provide a detailed description of redistribution methods developed for CoD data in the GBD; these methods represent an overall improvement in empiricism compared to past reliance on a priori knowledge. Supplementary Information The online version contains supplementary material available at 10.1186/s12911-021-01501-1.


Background
Across humanity, we know two events to be inevitable: birth and death. In order to maximize the quality and quantity of time spent between these two events, we need accurate, timely, and cause-specific mortality estimates. Even though systematic cause of death (CoD) reporting has improved since the first such records measuring bubonic plague mortality [1], no country has yet created a perfectly accurate death registration system. The highest-quality CoD data are reported via vital registration (VR) systems, through which "the continuous, permanent, compulsory and universal" recording of vital demographic events occurs "in accordance with the legal requirements of a country" [2,3]. The presence of VR systems is far from ubiquitous and remains especially inadequate in lower-and lower-middle-income countries [4][5][6]. Furthermore, the process of completing and accurately coding a death certificate according to the international standard established by the International Statistical Classification of Diseases and Related Health Problems (ICD) is challenging for all countries, regardless of income status [7].
According to the ICD, only one CoD is reported for statistical purposes: the underlying cause of death (UCoD), i.e., the disease or injury that initiated the chain of events leading to death [8]. Physicians often do not receive adequate training in the public health importance of ICD rules, however, and death certificates are regularly filled out incorrectly [9][10][11]. As a result, many deaths are ascribed to "garbage" codes, i.e. codes that are not specific enough, are an immediate or intermediate CoD, or impossible CoD [12,13]. Sepsis, for example, is often listed as an UCoD, however, a number of conditions, including malaria, diabetes, or a road traffic injury [14] may be the underlying cause that leads to sepsis. Garbage codes mask the distribution of true underlying causes, and numerous country-specific data quality analyses that address garbage coding have revealed different mortality patterns than initially reported [15][16][17][18][19][20]. Furthermore, coding practices vary across age groups, sexes, space, and time, severely hindering intra-and inter-country comparability of cause-specific mortality over time and limiting the usability of CoD data for public health purposes [21][22][23][24].
The Global Burden of Disease (GBD) study, a tool for quantifying health loss from hundreds of diseases, injuries, and risk factors, is one response to the question of how to generate usable cause-specific mortality estimates from a collection of imperfect, heterogeneous data [2,25,26]. The GBD produces regular, timely estimates of cause-specific mortality that are comparable by age, sex, year, and location from 1980 onwards. Accounting for garbage-coded deaths is one of the key data processing steps in creating cause-specific mortality estimates and reveals a mortality distribution that countries can use to compare the mortality level and composition over time, across age groups and sexes. Here we present the methods developed to account for garbage-coded deaths in VR data by location, year, age, and sex in the GBD 2013 study through GBD 2020, in addition to describing the pattern of garbage-coded deaths in the world. Furthermore, we draw from previously established criteria, namely coverage and frequency of garbage-coded deaths, to evaluate the overall quality of CoD VR data in the world [16,27].

Methods
The GBD produces a continuously updated, comprehensive, comparable database of standardized CoD data by age, sex, location, and year from 1980 onwards. We aim to include as much CoD data as possible: rather than exclude data that do not fit the ideal, we have devised a number of methods to enhance the usability of a variety of CoD data sources. Existing CoD data sources differ based primarily on the method by which the data were collected (e.g., VR, verbal autopsy, sibling history) and the coding system and format used to report the CoD data (e.g., International Classification of Diseases [ICD]-9 and ICD-10) (Additional file 1: Figure 1). This variation creates a number of challenges in standardizing the data, including unknown age and/or sex, tabulated (aggregated) cause codes, misclassification of underlying causes to another cause or to garbage codes, and stochastic noise in deaths over time. An overview of the process for building the CoD database is summarized briefly below (Additional file 1: Figure 2), though an in-depth description of all methods is outside the scope of this paper and described elsewhere [2]. Specifically, we focus on the set of algorithms used to reallocate garbage-coded deaths to a most likely UCoD, collectively referred to as "redistribution" (the third box in Additional file 1: Figure 2). This study complies with the Guidelines for Accurate and Transparent Health Estimates Reporting (GATHER) statement [28]. The GBD study used de-identified data, and the waiver of informed consent was reviewed and approved by the University of Washington Institutional Review Board (application number 46665). Data preparation and analyses were carried out using R version 3.5.1 and Python 3 [29,30].
We will first briefly cover the key steps in the data processing pipeline to contextualize how redistribution of garbage-coded deaths fits into creating the CoD database (Additional file 1: Figure 2). First, all causes of death are mapped from their original coding onto the GBD cause list [2]. Second, observations from some CoD data sources are not available by detailed age and sex, and must be split into detailed age and sex groups. This is achieved by using cause, age, and sex specific global mortality rates generated from CoD VR where complete age and sex detail is available. Alongside population, these mortality rates are used to estimate an expected number of deaths in each detailed age group and for both sexes, which are then scaled to total the deaths in the original non-detailed observation. Additional details on the age and sex splitting process can be found elsewhere [2]. Third, deaths where the cause has been misclassified to Alzheimer's disease and other dementias are reassigned to the most plausible underlying cause [2]. Fourth, deaths assigned to a garbage code are redistributed (the focus of this manuscript) (Fig. 1). Fifth, misclassification of HIVrelated deaths is corrected [2,31,32]. Finally, noisy data due to stochastic variation are smoothed and CoD data are uploaded to a central database for use in the GBD fatal estimation process [2]. This paper provides further detail on the most current methods developed to account for garbage-coded deaths in VR data using the detailed ICD-9 and ICD-10 nosological classification systems, as these data represent the vast majority of GBD's mortality data (Additional file 1: Figure 1).

Identification of garbage codes
In the first step of the cause of death database creation, every ICD code is mapped to a corresponding CoD in the mutually exclusive, collectively exhaustive GBD cause hierarchy (Additional file 1: Figure 3) [2]. Not every ICD code is a valid UCoD in the GBD hierarchy, however; garbagecoded deaths describe ICD codes that cannot or should not be considered the UCoD (Additional file 1: Figure 4) [33].
This includes impossible causes of death, e.g., senility; nonspecific causes, e.g., ill-defined cancer site; causes that the GBD considers a symptom rather than a cause, e.g., back pain; and intermediate or immediate causes that result from other underlying conditions, e.g., heart failure, sepsis. We refer to these codes as "garbage codes"; garbage-coded deaths are not lost during analysis, but instead grouped based on diagnostic relatedness and collectively reassigned to the most probable UCoD during a process we refer to as redistribution, described in detail in the following sections.

Categorization of garbage codes
While all garbage codes are alike in that they cannot (or should not) be considered the UCoD, not all garbage codes are the same, and vary in their level of specificity. For example, deaths that are garbage-coded as "sepsis" could be attributed to hundreds of underlying causes of deaths, whereas deaths garbage-coded to "unspecified stroke" have a short list of possible underlying causes. In GBD 2016, garbage levels, here termed "classes", were created to categorize garbage codes into four classes of increasing specificity [34]. A more detailed explanation of these classes has been published previously [35]; and they are briefly described in Box 1 (a table of ICD codes by garbage class can be found in Additional file 1: Figure 4).
Classes one and two are collectively referred to as major garbage; correction of these classes has the most important policy implications, and the proportion of age-standardized major garbage out of all deaths in each location and year is a key component of the star rating data-quality metric produced by GBD [2], described in further detail below. In GBD 2020, 16.9% of ICD-9 and ICD-10 VR data across all years were major garbage-coded deaths, with the percent of major garbage staying relatively stable over time, ranging from a low of 13.5% to a high of 18.4% during the period from 1980 to 2019 (Additional file 1: Figure 5).

Star rating
Fatal GBD estimation is most accurate when using data from complete VR systems that span consecutive years, with a low proportion of garbage-coded deaths. In GBD 2013, the Vital Statistics Performance Index (VSPI), a composite of six metrics, was created to empirically measure the performance of VR systems [27]. In GBD 2016, a simpler system was developed, using a star rating system from 0 to 5 to represent data quality for a location across a given time series [34]. For any given location-year, the two components that determine this star rating are the proportion of age-standardized major garbage and level of completeness. Completeness is a measure of how successfully the VR captures deaths that occur in a location-year (regardless of garbage coding). It is calculated as the fraction of total reported deaths in the VR over total GBD estimated allcause mortality deaths. These components are then used to calculate a percent well certified (PWC) value between 0 and 1 (Eq. 1).
Star values are then assigned based on the calculated PWC value. A mapping of PWC values to star ratings can be found in the Additional file 1 (Additional file 1: Figure 6). This method for assigning a star rating to a specific location-year of data and then summarizing that metric across a time series is described in detail elsewhere [34]. A location can increase its number of stars by decreasing the proportion of major garbage-coded deaths, increasing the total number of deaths captured, and increasing the number of available years of data. Data quality, as measured via the star ranking system, ranges substantially across GBD locations and within countries with subnational detail available (Additional file 1: Figure 7).

Redistribution
Redistribution is the process of reallocating garbagecoded deaths to plausible underlying causes [12]. For each group of diagnostically related garbage codes, we define a set of probable underlying causes of death and the proportion of garbage-coded deaths that are redistributed to each underlying cause, separately by GBD age group, sex, location, and year. We want to note that while uncertainty intervals for these proportions are calculated, they are used only to aid in the modelling of data that have completed all steps of the data processing pipeline (Additional file 1: Figure 2). They are not used to inform redistribution of the garbage coded deaths. Thus, specific details regarding calculation of redistribution uncertainty have been omitted from this paper but are described in detail elsewhere [2].
There are four main methods used to determine a set of plausible underlying causes and proportions for a given group of garbage codes, explained in detail in subsequent paragraphs: (1) multiple cause analysis, (2) negative correlation, (3) impairment, and (4) proportional redistribution (Table 1, Fig. 1). Garbage codes are first grouped based on diagnostic relatedness (Additional file 1: Figure 4), afterwards one of these four methods is chosen. The appropriate method is determined on a case-by-case basis, as will be explained in more detail below. Each of (1) PWC = Percent Completeness × 1 − Percent MajorGarbage these methods independently produces the necessary inputs to redistribution, where garbage-coded deaths are reallocated. Although the underlying algorithm for redistribution, the final step shown in bright green in Fig. 1, has not changed significantly since GBD 2013 [36], substantial improvements were made during GBD 2019 and 2020 to the methods for the steps feeding into redistribution, shown in teal boxes in Fig. 1.

Multiple cause analysis
Death is not a single event, but rather a chain of causal events ultimately leading to death. Multiple cause data, individual-level records listing all causes from the death certificate, include the chain of events leading to death (Part I, Fig. 2) and other significant conditions contributing to mortality, but are not part of the sequence directly leading to death (Part II, Fig. 2) [37].
The chain of events leading to death includes underlying (disease or injury that initiated the events resulting in death), intermediate (events initiated by the underlying cause), and immediate (the terminal event) causes ( Fig. 2) [37]. Multiple cause data rarely distinguish intermediate from immediate causes, and therefore we refer to all causes in the chain (i.e., non-underlying causes) on a death certificate as intermediate causes. For example, if a child gets pneumonia, is unable to receive adequate medical attention and then dies of sepsis, we would say the underlying cause of death is pneumonia and sepsis is an intermediate cause. These data are particularly useful to analyze causes that would not otherwise be captured by the underlying cause alone [38], but such data are difficult to obtain due to data privacy issues; Table 2 shows the number of deaths and location-years available for analysis in GBD 2020. As the list of location-years  of multiple cause data availability increases, so does our preference for this method over the others presented in this manuscript.

Intermediate causes of death
A variety of methods have been previously used to account for intermediate causes incorrectly listed as the UCoD, including multinomial regression, Bayesian regression, and coarsened exact matching [39][40][41][42].
We have built on these analyses, and the methods presented here include two key innovations introduced in GBD 2019 and further developed in GBD 2020: (1) determining a set of plausible underlying causes from multiple cause data, rather than relying on literature reviews or expert opinion, and (2) increasing generalizability across all GBD-estimated locations. In GBD 2019, the analysis described in the following paragraphs was introduced to inform the redistribution of deaths incorrectly coded to the following intermediate causes: sepsis; embolism (pulmonary and arterial); heart failure (left, right, and unspecified); acute kidney injury; hepatic failure; acute respiratory failure; pneumonitis; and unspecified central nervous system disorders. In GBD 2020, this list was expanded to include gastrointestinal bleeding; chronic respiratory failure; peritonitis; fluid, electrolyte, and acid-base disorders; arrhythmia; pneumothorax; alcoholic hepatic failure; amyloidosis; cachexia; osteomyelitis; plegia; atherosclerosis; empyema; hypertension; shock, cardiac arrest, and coma; and renal failure. First, death certificates with a non-garbage UCoD were mapped to corresponding GBD causes and tagged indicating presence of the intermediate cause of interest by ICD code (ICD codes for each intermediate cause can be found in Additional file 1: Figure 8 Second, we determined the set of the most plausible underlying causes, separately for each intermediate cause. A key feature of redistribution is the selection of the most likely underlying causes of death. Our first approach used all underlying causes appearing in the multiple cause data; however, this resulted in arbitrarily small proportions, e.g., 0.00068% of pulmonary embolism-related deaths due to diphtheria in high-income countries among males between the ages of 15 and 29. To avoid artificial redistribution results, we performed a two-step process to trim the list of underlying causes that will serve as redistribution targets for the garbage coded deaths. First, we keep the underlying causes comprising 80% of deaths in the multiple cause data. Then, a least absolute shrinkage and selection operator (LASSO) regression is used on only the response variable (proportion of intermediatecause-related deaths) and the underlying causes comprising the bottom 20% of deaths [43]. LASSO adds a penalty, tuned by adjusting the lambda parameter, equal to the absolute value of the magnitude of the coefficients, such that the coefficients on many of the underlying causes were reduced to zero and could be empirically excluded. Related dimension reduction techniques, such as ridge and elastic net regressions, may reduce coefficients, but do not push them to zero, and were therefore not used. The lambda parameter was chosen based on minimization of the cross-validated sum of squared residuals, with 10 folds. The R package "glmnet" was used [44].
After determining the most plausible set of underlying causes for each intermediate cause, we then constructed a predictive model. The proportion of deaths related to the intermediate cause of interest was estimated using a generalized linear model with binomial response and link logit (Eq. 2) using the R package "lme4" [45] where: Y i = the proportion of deaths related to the intermediate cause of interest.
Where the distribution of random variable Y i is binomial, with n i number of observations and probability of an intermediate cause-related death π i for each age, sex, location, year, and underlying cause group i . β 0 is the global intercept, β 1 is the effect of X covariates, β age and β sex are the categorical covariates for age group and sex, and γ underlying cause is the random effect on UCoD. Separate models were run for each intermediate cause of interest, and each set of covariates is listed in the Additional file 1 (Additional file 1: Figure 9), with the most common being Healthcare Access and Quality Index (HAQ Index). The HAQ Index is a measure of amenable mortality informed by mortality rates for a set of 32 causes which should not be fatal given adequate medical treatment [46].
A step-by-step example is given below for sepsis (Eq. 3). Referenced below as "sepsis fraction, " proportions were extrapolated for all GBD locations using the above where "sepsis fraction" was estimated from the model shown in Eq. 2 and a, s, l, y, c denote a given age group, sex, location, year, and UCoD, respectively.
The resulting proportions from Eq. 3, step 3 were used as inputs to redistribution (Fig. 1). Results from the multiple cause analysis for pulmonary embolism (Additional file 1: Figure 10) and unspecified heart failure (Additional file 1: Figure 11) are shown in the Additional file 1.

Unspecified injuries: X59 and Y34
Deaths due to injury are described in the ICD by codes specific to the external cause (e.g., motor vehicle crash) and for the injury diagnosis (e.g., injury to head), also referred to as nature of injury codes [47]. Though it is often easier to identify the nature of injury of a deceased person than the factor that caused the injury, a detailed (3) 1. sepsis deaths a,s,l,y,c= sepsis fraction a,s,l,y,c * GBD deaths a,s,l,y,c 2. total sepsis deaths a,s,l,y = c sepsis deaths a,s,l,y,c 3. proportion of sepsis to redistribute a,s,l,y,c = sepsis deaths a,s,l,y,c total sepsis deaths a,s,l,y external injury code is required for correctly assigning the UCoD [48]. Two common non-specific codes for external causes of mortality are exposure to unspecified factors (X59 in ICD-10) and unspecified event of undetermined intent (Y34 in ICD-10) [49]. These codes comprise 2.5% of all garbage-coded deaths and 8.1% of total injuries deaths in ICD-10 VR. To identify proportions and plausible underlying causes for these deaths, we employed a multi-step approach that uses the combination of nature of injury codes in the causal chain in multiple cause data Fig. 2.
First, death certificates in multiple cause data with the garbage code of interest or a GBD injuries cause as the UCoD were selected. The detailed nature of injury codes in the causal chain of these death certificates were collapsed to 37 custom groups of diagnostically related ICD Codes (Additional file 1: Figure 12). For each death, we then identified combinations of nature of injury codes appearing in the chain according to these custom groups. The top 95% of combinations were then used to derive preliminary cause, age, sex, year, and location-specific redistribution proportions. These proportions were derived based on the probability of a given combination being coded to an X59/Y34-related garbage code or a GBD injuries cause and then summed for all combinations. An example is given below for X59 (Eq. 4): where: combination j = a given nature of injury code combination in the causal chain; UCoD X59 = a death with X59 coded as the UCoD; UCoD GBD injuries cause i = a death with a given GBD injuries cause i coded as the UCoD.
These proportions are based on the specific pattern of injuries in country-years with multiple cause data; they are preliminary and can only be applied to multiple cause data to estimate the fraction of each injury cause that are coded to X59 or Y34. We applied these cause-, age-, sex-, year-, and location-specific redistribution proportions on the data where X59 or Y34 was the UCoD to get the number of unspecified injuries deaths "attributable" to each GBD injuries cause. Then, for each GBD injuries cause in the multiple cause data, we calculated the fraction of redistributed garbage-coded injuries deaths over the fraction of total injuries deaths for that cause and modeled this intermediate cause fraction using a mixed effects linear regression (Eq. 2), same as that used for intermediate causes. As described in detail above for analyzing intermediate causes, unspecified injuries fractions were multiplied by CoD results from the previous GBD round, summed across all GBD injuries causes, and final redistribution proportions were calculated separately for X59 (Additional file 1: Figure 13) and Y34 (Additional file 1: Figure 14) by age, sex, location, year, and GBD injuries cause for use in all CoD data. Results from this analysis are shown in the Additional file 1. An additional, separate example of using multiple cause data to redistribute misclassification of accidental poisoning can be found in the Additional file 1 (Additional file 1: Figure 15).

Negative correlation
While multiple cause analysis is the preferred method of determining underlying cause targets and proportions for garbage codes, this method is not possible for class 4, the most specific garbage-coded deaths (e.g., malignant neoplasm of ill-defined digestive organs). This is because a death certificate would never include a more detailed ICD code nested within a less detailed code; for example "malignant neoplasm of ill-defined digestive organs" and "liver cancer". In these instances of class 4 garbage, there is a noticeable inverse relationship between the garbagecoded death and its plausible underlying causes of death, i.e., as the number of garbage-coded deaths increases, the number of deaths due to plausible underlying causes decreases. Thus, we use a negative correlation method to determine how to redistribute these deaths (Fig. 1). First described by Ahern et al. [50] for the redistribution of unspecified heart failure, this method assumes that with improvements in coding practices, more deaths are assigned to the plausible underlying cause(s) and fewer to the corresponding garbage codes. The detailed methods for negative correlation redistribution have been described elsewhere [2]. In GBD 2019, the core methods for negative correlation redistribution were revisited, and a slightly different approach was adopted to redistribute deaths attributed to unspecified diabetes, unspecified stroke, and malignant neoplasm without specification of site. Using unspecified stroke as an example, these methods are summarized in brief here.
The corresponding plausible underlying causes of death for unspecified stroke are assumed a priori to be the subtypes ischemic stroke, intracerebral stroke, and subarachnoid stroke. Shown in Eq. 5 below, we assume the logit-transformed proportion of each stroke subtype (out of all non-garbage-coded stroke deaths), µ i , can be modeled linearly as a function of covariates predictive of stroke mortality, β 1 X i , with intercept β 0 for each age, sex, location, and year group i.
In an ideal world, the method would conclude after the aforementioned regression (Eq. 5). In practice, however, we noticed bias in the residuals with respect to the proportion of unspecified stroke in all stroke related deaths. To account for these biases, we apply an adjustment, which is made in two steps. First, residuals from the regression (Eq. 5) are calculated and regressed against the logit-transformed proportion of deaths coded to unspecified stroke in order to identify any trend present between the residuals and the proportion of deaths garbagecoded to unspecified stroke. Second, the adjustment is calculated using the slope of this regression line, and the difference between the value of the residuals when no deaths are garbage-coded to unspecified stroke and at the observed proportion of deaths coded to unspecified stroke (Eq. 6). Ideally, the proportion of unspecified stroke would not influence the model and regressing the residuals against the proportion of unspecified stroke would show little correlation with a slope near 0. The adjustment would then be quite small. However, stronger correlation between the residuals and the proportion of unspecified stroke results in a larger adjustment being necessary.
This adjustment is added to the initially estimated proportion of a given stroke subtype generated by Eq. 5, bringing it closer to the true proportion of a world without garbage coding. Proportions are normalized by age, sex, location, and year.
Since the residuals are modeled on logit(GC%), it is not possible to calculate the adjustment for GC% = 0%. Instead, we used GC% = 1% to represent the counterfactual of "no garbage. " The same methods are applied to the redistribution of unspecified diabetes and malignant neoplasm without specification of site. We therefore combine two approaches-descriptive linear modeling with covariates explanatory of mortality and an adjustment for coding practices-to produce improved estimates as compared to previous GBD cycles.

Impairments
The GBD defines impairments as domains of health loss that are a consequence of multiple underlying causes, rather than underlying causes of death themselves [2]. Anaemia, for example, can occur as the result of chronic kidney disease or malaria, but is not considered the UCoD. Due to the difficulty in identifying a single underlying cause for impairments, neither a multiple cause analysis nor the negative correlation method is possible, and instead we rely on the non-fatal burden estimation process of GBD [2]. The resulting years lived with disability (YLDs) [2] are used to calculate redistribution proportions and to determine a plausible set of underlying causes of death for impairments (Fig. 1).
Plausible underlying causes are restricted to causes that have years of life lost (YLLs) attributed to them rather than exclusively YLDs, i.e. causes from which a person can conceivably die. Proportions are calculated by dividing the number of cause-specific YLDs for a given impairment by the sum of YLDs across all causes for each age group, sex, location, and year. Locations with a star rating > 3 have country-specific proportions, while countries with a star rating ≤ 3 are assigned region-level proportions. GBD 2020 redistribution of anemia and pelvic inflammatory disease relied on the results of the non-fatal GBD 2019 estimation process. Proportions and underlying causes are then used as inputs to redistribute garbage-coded CoD data (Fig. 1). In the GBD 2020 study, 0.4% of garbage-coded deaths across all years were incorrectly assigned to impairments, rather than to the appropriate UCoD, prior to redistribution (Table 1).

Proportional redistribution
Unlike the other processes outlined above, where we use external data sources to define a set of proportions for redistribution, proportional redistribution reallocates garbage-coded deaths to be directly proportional to the distribution of plausible underlying causes of death in the non-garbage-coded deaths in the CoD data, separately by age group, sex, location, and year, as shown in Fig. 3.
The key assumption of proportional redistribution is that garbage coding is independent of underlying cause: every underlying cause targeted by proportional redistribution for a given garbage code is equally likely to be miscoded. We use this method when the distribution of non-garbage-coded deaths is plausible and there are enough non-garbage-coded deaths to inform the post-redistribution cause pattern (Fig. 1). Proportional redistribution is only used for the least specific class 1 garbage-coded deaths, e.g., "all ill-defined, " the set of plausible causes includes all non-garbage-coded deaths in the data. Whereas for more detailed class 3 garbage codes, e.g., unspecified upper respiratory infections, the set of underlying causes is determined a priori based on clinical knowledge. Proportional redistribution was used for 11.6% of all ICD-9 and -10 coded VR deaths and 39.6% of garbage-coded deaths in CoD data from 1980 to 2019 (Table 1).

Role of the funding source
The funders of the study had no role in study design, data collection, data analysis, data interpretation, or writing of the report. The corresponding author had full access to all the data in the study and final responsibility for the decision to submit for publication.

Results
The percentage of garbage-coded deaths out of all deaths in VR data varied widely across locations and by garbage code class. In VR data for the year 2015 (or the most recent year available by location), for example, deaths coded to major (class 1 or 2) garbage codes spanned a wider range across locations (from a low of 3.7% to a high of 67.3%) compared to the percentage of deaths coded to more detailed (class 3 and 4) garbage codes, which ranged from 2.4% to 34.6% (Fig. 4). Additional stratification of the percentage of garbage-coded deaths for each class is presented in the Additional file 1 (Additional file 1: Figure 16). Results in Fig. 4 are shown for the year 2015 in order to maximize the data availability across locations because the overall level of garbage coding does not change substantially over time (Additional file 1: Figure 5). There is also substantial subnational variation in the proportion of deaths coded to class 1 or 2 garbage codes. In 2015, subnational variation was largest in Russia, from 5.1% in Jewish autonomous oblast to 27.7% in Rostov oblast, and in Brazil, ranging from 8.5% in Espírito Santo to 29.5% in Bahia. Some countries, such as Japan, Norway, and the UK, had very little variation in proportion of deaths coded to class 1 or 2 garbage codes, compared to countries with relatively more variation, such as the Philippines.
The portion of age-standardized deaths coded to major garbage, out of all deaths, decreases as a location's Socio-Demographic Index (SDI) increases (Fig. 5). The SDI value serves as an indicator of development status and is a value between 0 and 1 calculated from three components: fertility rate, income per capita, and average educational attainment. More information on the SDI and how it is calculated is described elsewhere [33]. This relationship between SDI and age-standardized major garbage is true at the global level and in each GBD superregion, although it is less pronounced in some regions, such as sub-Saharan Africa. Using the age-standardized, rather than all-age, proportion of major garbage as a metric is more useful for inter-country comparisons because the percentage of garbage-coded deaths is often higher in locations with larger elderly populations.
In addition to geographic variation, the garbage codes that comprise the most deaths vary across age groups. In those under 1 year of age in 2015, unspecified lower respiratory infections accounted for the largest proportion of garbage-coded deaths, out of all garbage-coded deaths, for both males and females (Fig. 6), compared to unspecified stroke for both sexes in the 50 to 79 age range. There was also some variation by sex and age: in those aged 80 and over, most garbage-coded deaths were attributable to unspecified lower respiratory infections in males, compared to unspecified stroke in females. While Fig. 6 depicts the most frequent garbage codes at the global level, there is notable variation in garbage code prevalence by location. More information on countryspecific leading garbage codes can be found in the Additional file 1 (Additional file 1: Figure 17). Similar to Fig. 4, results in both Fig. 6 and the following Fig. 7 are shown for the year 2015 in order to maximize the data availability across locations. The process of redistribution affects the number of deaths assigned to different causes, the cause fraction, and the corresponding mortality rates. The effect of redistribution can be large, and results in changes in the rankings of the top causes of death by location, age, and sex. At the national level, redistribution of garbage codes can substantially change the rankings of the top 10, 20, and 50 causes of death. Figure 7 highlights the change in ranking by total deaths of the top 20 underlying causes before and after redistribution for Brazil, France, Japan, and the US in 2015, combined for all age groups and both sexes. These four countries were selected for illustrative purposes, and underlying cause rankings for all other countries and territories estimated by GBD can be found in Additional file 1: Figure 18. In Brazil, France, and the US, there were large increases in the rank of ischemic stroke after redistribution, from 31st to second, ninth to fourth, and 28th to fifth, respectively. Deaths due to diabetes mellitus type 2 increased 4.0-fold in the US and 10.6-fold in Brazil after redistribution. Notably, in Japan, Alzheimer's disease and other dementias rose from the ninth-ranked UCoD to the first in terms of number of deaths. In Japan, large increases in the rank of deaths due to influenza, pneumococcal pneumonia, and other lower respiratory infections occurred. Of our exemplars, the US is the only country shown where redistribution resulted in a large increase in the rank of drug use disorders, with opioid use disorders jumping in rank from 141 st to 16th following redistribution. France was the only country of the four to have an injuries-related cause move into the top 10 after redistribution, with deaths due to falls ranked sixth, increasing from 7,590 assigned deaths to 18,247 assigned deaths in 2015.

Discussion
We have described the four methods for redistributing garbage-coded VR deaths in the GBD: (1) multiple cause analysis, (2) negative correlation, (3) impairments, and (4) proportional redistribution (Fig. 1). Overall, the methods introduced here reflect an improvement in empiricism of redistribution methods; for less-detailed garbage, rather than relying on a priori selection of plausible underlying causes and proportions, we have sought out alternative methods and data sources. Notably, this study provides the first in-depth explanation of the incorporation of multiple cause data to inform redistribution for 32.2% of garbage-coded deaths in GBD 2020 (Table 1).
The change in ranking among the top 20 underlying causes of death by number of deaths before and after redistribution highlights the necessity of redistribution of garbage-coded deaths to understand a country's actual cause-specific mortality pattern (Fig. 7). This figure also captures the effects of misdiagnosis corrections, a process outside the scope of this paper that has been described previously [3]. Redistribution is not the ideal solution for the problem of garbage-coded deaths, however: ultimately, higher-quality CoD data in all locations is needed to provide accurate information on mortality patterns and inform public health decision making.
Interventions to increase the quality of cause of death coding must be context-specific. We have shown that the proportion of major garbage has not varied dramatically over time (Additional file 1: Figure 5), but rather by SDI, with countries with lower SDI having higher proportions of age-standardized garbage (Fig. 5). When contrasting the proportion of major garbage versus more detailed garbage codes, however, there is substantial intra-and inter-country variation (Fig. 4). These deaths coded to classes 1 and 2 have the most substantial health policy  The left-hand column is data before redistribution compared to data after redistribution in the right-hand column. Causes are connected by arrows before and after redistribution. Infectious diseases are shown in red, non-communicable diseases in blue, and injuries in green. This figure also captures additional corrections applied prior to redistribution, namely adjustments made for the misdiagnosis of Parkinson's, atrial fibrillation, and Alzheimer's disease and other dementias not discussed in detail in this paper (Additional file 1: Figure 1). Additionally, only real underlying causes are included in this figure. For that reason, one will not see "Garbage Code" listed in the deaths prior to redistribution implications, as they can mislead policy makers on the overall mortality composition in a population, as well as on the importance of various leading causes of death within a disease category [35]. Deaths coded to classes 3 and 4 can hamper prevention and treatment efforts because they do not distinguish between subtypes of a disease. National-level policy interventions have been shown to increase death registration (including ascertainment of a CoD) [51]. Specifically, enhanced training efforts led by the Bloomberg Data for Health Initiative, where physicians and instructors leading ICD-compliant certification courses received targeted training, has dramatically improved the number of correctly filled out death certificates in locations including the Philippines, Sri Lanka, and Peru [52]. Such interventions have decreased the number of deaths coded to class 1, 2, or 3 garbage; however, reducing deaths coded to the most specific, class 4 garbage often requires more expensive medical technology. Diagnosis of ischemic versus hemorrhagic stroke, for example, requires computed tomography scanners, which are often unavailable in low-resource settings [53].
The methods described here have a number of limitations. First, the scope of this paper has been limited to countries sharing VR data for use in the GBD; countries without color in Fig. 4 are therefore excluded from the methods presented here. Second, for the multiple cause analysis, the primary explanatory variable in the majority of the models which is used to predict the proportion of intermediate-cause-related deaths for all GBD-estimated locations is the HAQ Index. The inclusion of additional explanatory covariates, additional sources of multiple cause data to support these covariates, and empirical covariate selection is crucial for strengthening the predictive validity of estimates. Third, the multiple cause analysis has circular dependencies, as the proportions used to redistribute garbage-coded deaths rely on GBD cause-specific mortality estimates. If, for example, our redistribution proportions for unspecified heart failure overestimate mortality due to a given CoD, then the overall level of estimated mortality will increase for that cause, and this effect will continue to be perpetuated in subsequent GBD rounds. Solutions to reduce the circularity in generation of results are being explored. Fourth, in the case of proportional redistribution, we make a strong assumption that the assignment of garbage is independent from the underlying cause. We hope to improve this method in the future, with the incorporation of data where death certificates are linked with hospital admissions. Lastly, we want to acknowledge that the term "garbage" codes may be viewed as punitive; renaming has been discussed within the GBD; however, for this manuscript we have opted to maintain it to be consistent with other publications on this topic.
In addition to continually seeking out additional multiple cause of death data, we are currently working to improve the methods used to redistribute unspecified injuries X59 and Y34 garbage codes. We are in the process of implementing machine-learning algorithms to improve upon the algebra-based method described above for generating cause-, age-, sex-, and year-specific redistribution proportions for X59 and Y34. Furthermore, we would like to align our measure of data quality with the  more comprehensive Vital Statistics Performance Index (VSPI). While VSPI and the current star ranking of data quality both incorporate measures of completeness and proportion of garbage-coded deaths, VSPI includes additional measures such as proportion of deaths without age or sex detail and timeliness of data reporting. Producing VSPI as a data quality indicator would also align the GBD with other efforts to produce comparable metrics of data quality [54]. Lastly, we welcome future collaborations to analyze country-specific explanations behind many of the descriptive analyses produced here.

Conclusions
In an ideal world, CoD certification and coding practices would be consistent and accurate across space and time, and there would be no need for garbage code redistribution. In the absence of such standardized practices, the GBD uses redistribution methods on garbage-coded deaths in order to provide the most comprehensive set of cause of death-specific mortality estimates and enable precision in public health decision making. These methods continue to be updated and improved as new strategies and data sources become available.