Skip to main content

Dealing with missing data in laboratory test results used as a baseline covariate: results of multi-hospital cohort studies utilizing a database system contributing to MID-NET® in Japan

Abstract

Background

To evaluate missing data methods applied to laboratory test results used for confounding adjustment, utilizing data from 10 MID-NET®-collaborative hospitals.

Methods

Using two scenarios, five methods dealing with missing laboratory test results were applied, including three missing data methods (single regression imputation (SRI), multiple imputation (MI), and inverse probability weighted (IPW) method). We compared the point estimates of adjusted hazard ratios (aHRs) and 95% confidence intervals (CIs) between the five methods. Hospital variability in missing data was considered using the hospital-specific approach and overall approach. Confounding adjustment methods were propensity score (PS) weighting, PS matching, and regression adjustment.

Results

In Scenario 1, the risk of diabetes due to second-generation antipsychotics was compared with that due to first-generation antipsychotics. The aHR adjusted by PS weighting using SRI, MI, and IPW by the hospital-specific-approach was 0.61 [95%CI, 0.39–0.96], 0.63 [95%CI, 0.42–0.93], and 0.76 [95%CI, 0.46–1.25], respectively. In Scenario 2, the risk of liver injuries due to rosuvastatin was compared with that due to atorvastatin. Although PS matching largely contributed to differences in aHRs between methods, PS weighting provided no substantial difference in point estimates of aHRs between SRI and MI, similar to Scenario 1. The results of SRI and MI in both scenarios showed no considerable changes, even upon changing the approaches considering hospital variations.

Conclusions

SRI and MI provide similar point estimates of aHR. Two approaches considering hospital variations did not markedly affect the results. Adjustment by PS matching should be used carefully.

Peer Review reports

Background

In April 2018, the operation of the Medical Information Database Network (MID-NET®) began as a national project aimed at utilizing real-world data for drug safety assessments in Japan [1,2,3,4]. MID-NET® is a database systems network consisting of 10 collaborative organizations (23 collaborative hospitals) that can analyze data derived from claim data, diagnosis procedure combination data, and electronic medical record (EMR) data at the individual-level [2].

Laboratory test results derived from EMR data have detailed information on clinical symptoms [5] and are expected to be used for confounding adjustments in drug safety assessments. However, the appropriate use of these test results is difficult since a number of data obtained during routine medical care may be missing in datasets for analysis. Therefore, it is essential to appropriately select and apply existing methods to reduce bias due to missing data (hereinafter referred to as “missing data methods”).

Choosing a missing data method requires an understanding of the missing data sources and missing data mechanism [6,7,8]. Raebel et al., in their study using the US Food and Drug Administration Mini-Sentinel Distributed Database [6], reported that missing data sources of laboratory test results include testing outside of contracted laboratories, patient location where testing was conducted, patient clinical features, etc. In the study of the 10 MID-NET®-collaborative hospitals of the Tokushukai Medical Group [9], authors reported that patient background and setting of ordering system of laboratory tests were the main missing data sources. Although their impact was not evaluated because they were unobserved, the measurement policies by physicians and institutions were considered as potential sources as well. Missing data mechanism can be classified as missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR) based on the relationship of missing data probability with missing data and observed values [10] (See Table 1).

Table 1 Missing data mechanisms and their examples

In the utilization of laboratory test results contained in MID-NET® to confounding adjustments, it is not clear what impact different missing data methods have on the result. Furthermore, owing to the differences between hospitals in terms of the proportion of patients with missing data (hereinafter, “missing proportion”) and the relationship between missing data and patient background [9], hospital variations regarding missing data should be well-considered.

In this study, the application of missing data methods was carried out to two scenarios of drug safety assessment adjusted by one laboratory test item. We evaluated the impact that missing data methods/approaches to hospital variations had on the interpretation of effect estimates and results.

Methods

Study’s scope

We considered a drug safety assessment to estimate an exposure effect using the Cox proportional-hazards model, which is commonly used in cohort designs. Assuming one laboratory test item as a confounder, we applied five missing methods.

Database and target hospitals

This study used the database system for MID-NET®-collaborative organizations of Tokushukai Medical Group (hereinafter, “Tokushukai database”), which has the largest number of MID-NET®- collaborative hospitals (10 hospitals; Supplementary Table S1). Supplementary Table S2 demonstrates the data items used for analysis.

Definition of missing data

As per a previous study, missing data were defined as “data that would be meaningful for the analysis but not available during a specific period before the first prescription date” [9]. The specific period, based on the results of this prior study, was set to 90 days.

Missing methods

MAR, the reasoning behind which is explained in Missing methods section, was assumed as the missing data mechanism in this study. The following four methods were considered as missing data methods providing unbiased results (hereinafter, “MAR-based methods”) when both the MAR assumption and the correct model specification used in the missing data method (hereinafter collectively referred to as “missing data models”) were correct: single regression imputation (SRI), multiple imputation (MI), inverse probability weighted (IPW) method, and likelihood-based method [11,12,13,14]. In the MID-NET® database system, SAS version 9.4 (SAS Institute Inc., Cary, NC, USA) can be used. Since SRI, MI, and IPW methods are implemented in SAS version 9.4, we adopted these MAR-based methods.

In this study, five missing methods were applied in order to evaluate their impact on the effect estimation of the outcome model. The methods included the above three MAR-based methods, a method excluding a laboratory test item from baseline covariates (Exclusion), and a complete case (CC) method (Table 2). Table 2 provides a brief description of the three MAR-based methods [7, 15,16,17,18,19,20,21,22,23] and the settings of this study. In the application of MAR-based methods, two approaches were adopted to consider hospital variations in missing data. The hospital-specific-approach implements a missing data model within each hospital cohort. The overall-approach, consequently, implements a missing data model to the overall cohort (combined hospitals cohorts) and uses hospital as a fixed effect covariate (Fig. 1).

Table 2 Overview of the five missing methods
Fig. 1
figure 1

Overview of the two approaches considering hospital variations. Abbreviations: aHR, adjusted hazard ratio. † In the overall approach, the effects of missing data sources at the hospital level, including unobserved missing data sources, can be considered as the hospital effect of the fixed effect, if not all. ‡ In the hospital-specific approach, the interaction of patient background factors with hospitals can be considered

Confounding adjustment and outcome model

For the confounding adjustment methods, we adopted two propensity score (PS) methods (PS weighting and PS matching) as well as an outcome model method with confounding factors as covariates (regression adjustment). The confounding factors, other than the laboratory test item, were patient related factors (see Table 3) and the corresponding hospitals. Since the target population in the scenarios of this study is an exposed population, the standardized mortality ratio weighting (SMRW) [27] was used for PS weighting.

Table 3 Details of scenarios 1 and 2

Then, the point estimate of adjusted hazard ratio (aHR) and 95% confidence interval (CI) were calculated using the Cox proportional-hazard model. In PS matching, stratified Cox proportional-hazard model by matched pairs was used. Robust standard error was used to calculate 95% CI. In the combination of PS methods and MI, aHR estimation was performed for m imputed datasets, and m estimates were combined (“within approach”) [34] (see Supplementary Fig. S1).

Scenarios

Based on our previous research [9], we selected the scenarios of two cohort studies that evaluated the relation between drugs and their known risks (see Supplementary Fig. S2). Scenario 1 was the risk of diabetes associated with antipsychotic drug use. In Scenario 1, the blood glucose level before prescription of antipsychotic drugs was the target laboratory test item. Scenario 2 was the risk of hepatic injury associated with HMG-CoA reductase inhibitors (statins) use. In Scenario 2, the target laboratory test item was either the alkaline phosphatase (ALP), alanine transaminase (ALT), low-density lipoprotein cholesterol (LDL-chol), or triglyceride (TG) levels before statins prescription. In Scenario 2, four laboratory test items were included in the baseline covariates of the outcome model, and CC method, SRI, MI, and IPW method were set for each laboratory test item. Table 3 shows the detailed settings of the scenarios, including cohort sizes and the number of complete cases.

Protocol approval and statistical analysis

Our study protocol was approved by the Kyoto University Graduate School, Faculty of Medicine, and Kyoto University Hospital Ethics Committee in November 2018 (R1793). Statistical analyses were performed using SAS version 9.4 (SAS Institute, Cary, NC, USA).

Results

Overall cohort sizes and the numbers of complete cases were summarized in Table 3 and Supplementary Figs. S3 and S4. Patient backgrounds including incidence rate are demonstrated in Supplementary Tables S3-S6 (Tables S3 and S5 for overall cohorts and complete cases, and Tables S4 and S6 for hospital cohorts).

Scenario 1

Confounding adjustment by PS weighting

The aHR of Exclusion was 0.52 (95% CI, 0.34–0.81) (Fig. 2). CC method, SRI, MI, and IPW method contained blood glucose level as a covariate. The aHRs of SRI and MI by the hospital-specific-approach were 0.61 (95% CI, 0.39–0.96) and 0.63 (95% CI, 0.42–0.93), respectively; the point estimates were relatively close, but the width of the 95% CI was slightly narrower in MI. Conversely, the aHRs of the CC and IPW methods by hospital-specific-approach were 0.78 and 0.76, respectively, and thus higher than those of SRI and MI.

Fig. 2
figure 2

Scenario1 (the risk of diabetes associated with SGA compared to FGA use). Hazard ratios from outcome models with and without baseline blood glucose (SRI, MI, and IPW method were applied by hospital-specific-approach). Abbreviations: aHR, adjusted hazard ratio; CC, complete cases; CI, confidence interval; IPW, inverse probability weighted; MI, multiple imputation; PS, propensity score; SRI, single regression imputation. †The sample size in PS matching with MI gives the mean of 10 matched samples

The aHRs of SRI, MI, and IPW method by the overall-approach showed no substantial differences compared with the estimates by the hospital-specific-approach (Fig. 3).

Fig. 3
figure 3

Scenario1 (the risk of diabetes associated with SGA use compared to FGA use). The difference in hazard ratios between approaches considering hospital variations. Abbreviations: aHR, adjusted hazard ratio; CI, confidence interval; IPW, inverse probability weighted; MI, multiple imputation; PS, propensity score; SRI, single regression imputation. †The sample size in PS matching with MI gives the mean of 10 matched samples

Confounding adjustment by PS matching

The aHRs of SRI and MI by the hospital-specific-approach were 1.10 (95% CI, 0.72–1.66) and 1.01 (95% CI, 0.34–2.97), respectively. Although there were no substantial differences in point estimates, the width of the 95% CI was larger in MI (Fig. 2). The aHR of the IPW method was 1.31 by the hospital-specific-approach but reduced to 0.94 by the overall-approach (Fig. 3).

Regression adjustment

The aHRs of SRI and MI by the hospital-specific-approach were relatively close in terms of the point estimates and the 95% CIs (Fig. 2).

Scenario 2

Confounding adjustment by PS weighting

The aHR of Exclusion was 1.32 (95% CI, 0.83–2.08). CC method, SRI, MI, and IPW method included each laboratory test item as a covariate. The aHRs of SRI and MI by hospital-specific-approach varied between 1.20 and 1.32 through all laboratory test items (Fig. 4). While there were no substantial differences in the point estimates of SRI and MI, the width of the 95% CI was slightly narrower in MI; for example, the aHR of SRI and MI for LDL-chol were 1.24 (95% CI, 0.78–1.96) and 1.20 (95% CI, 0.84–1.71), respectively. The aHR of IPW method for ALT was 1.21, whereas those for ALP and LDL-chol were close to 1. The aHRs of the CC method varied depending on the type of laboratory test item. Accordingly, the aHRs for ALT and LDL-chol were 1.29 and 1.08, respectively.

Fig. 4
figure 4

Scenario2 (the risk of hepatic injury associated with rosuvastatin use compared to atorvastatin use). Hazard ratios from outcome models with and without baseline ALT, ALP, LDL-chol, or TG (SRI, MI, and IPW method were applied by hospital-specific-approach). Abbreviations: aHR, adjusted hazard ratio; ALT, alanine transaminase; ALP, alkaline phosphatase; CC, complete cases; CI, confidence interval; IPW, inverse probability weighted; LDL-chol, low-density lipoprotein cholesterol; MI, multiple imputation; TG, triglyceride; PS, propensity score; SRI, single regression imputation. † Not shown in the forest plot if the aHR is 8.0 or more. ‡The sample size in PS matching with MI gives the mean of 10 matched samples

The aHRs of SRI and MI by the overall-approach ranged between 1.19 and 1.32 through all laboratory test items (Fig. 5). For some laboratory test items, especially ALP, the aHR of MI tended to be higher than those determined by the hospital-specific-approach.

Fig. 5
figure 5

Scenario2 (the risk of hepatic injury associated with rosuvastatin use compared to atorvastatin use). The difference in hazard ratios between approaches considering the hospital variations. Abbreviations: aHR, adjusted hazard ratio; ALT, alanine transaminase; ALP, alkaline phosphatase; CI, confidence interval; IPW, inverse probability weighted; LDL-chol, low-density lipoprotein cholesterol; MI, multiple imputation; TG, triglyceride; PS, propensity score; SRI, single regression imputation. † Not shown in the forest plot if the aHR is 8. or more. ‡The sample size in PS matching with MI gives the mean of 10 matched samples

Confounding adjustment by PS matching

For all laboratory test items, the differences in aHR between methods were higher and the width of 95% CI was greater than the values obtained with other confounding adjustment methods.

Regression adjustment

The aHRs of SRI and MI by the hospital-specific-approach ranged from 1.22 to 1.28 throughout all laboratory test items. For the results in each laboratory test item, both the point estimates and the 95% CIs were relatively close in SRI and MI (Fig. 4). However, the aHR of the IPW method was lower than those of SRI and MI.

Discussion

We evaluated the impact of five missing methods, approaches to hospital variations, and confounding adjustment methods on the effect estimation in two different scenarios. In Missing methods section of Discusssion and Approaches considering hospital variations sections, we excluded discussions on PS matching in Scenario 2 were excluded and were, instead, included in Confounding adjustment methods section.

Missing methods

SRI versus MI

Although the setting is different, Marshall et al. [35] reported that the bias in SRI and MI, using the same covariates in imputation models, is the same when the missing proportion of a continuous variable is around 25%, and that there was no substantial difference in bias even if the missing proportion increased to 50%. In this study, we had no difference in the point estimates of aHRs between SRI and MI. This might be due to the fact that we used the same covariates in imputation models.

However, for 95% CIs, we found some differences between SRI and MI in confounding adjustment with PS methods (PS weighting of Scenarios 1 and 2 and PS matching of Scenario 1). This will be further explained in Confounding adjustment methods section. Although SRI is generally described as underestimating SE [36], the reason that this study did not show a clear underestimation of SE regardless of the type of confounding adjustment method might be due to the targeted single laboratory test item.

IPW method versus imputation methods (SRI and MI)

The degree of difference in the point estimates of aHRs between the IPW method and the imputation methods varied depending on laboratory test items. In particular, there were large differences in Scenarios 1 and 2 adjusted by blood glucose level and ALP, respectively. This might be attributed to the large weights in the IPW method. In the analyses of blood glucose level and ALP, some hospital cohorts had IPWs of 20 or higher. Large weights can contribute to estimation instability. Therefore, it is important to check the stability of the effect estimation using the truncation of large weights [23] as a sensitivity analysis.

CC and exclusion methods

In the CC method, large differences in aHRs from those in the MAR-based methods were observed, especially in Scenario 1 adjusted by blood glucose level and in Scenario 2 adjusted by LDL-chol. When the standardized mean differences (SMD) [37] between complete cases and overall cohorts were calculated for patient background factors, factors with SMD of 0.1 or higher existed in Scenarios 1 and 2 (complete cases for LDL-chol) (Supplementary Tables S3 and S5). Such factors were considered to have non-negligible differences in patient backgrounds [38, 39], and the aHR difference between the CC method and the missing data methods was due to the fact that complete cases did not represent overall cohorts.

In the method excluding the laboratory test item in Scenario 1, a relatively large difference in aHR from those in the MAR-based methods occurred. However, in some Scenario 2 cases, the difference was relatively small and no noticeable difference in the analysis for ALT existed. This might be attributed to the fact that confounding by ALT was small because mean ALTs were the same for rosuvastatin and atorvastatin users.

Assumption of missing data mechanism

We considered the missing data mechanism in this study was a mixture of MAR and MNAR. Applying the missing data method based on the MNAR assumption requires extensive modeling of the missing data process. In such a case, the MAR-based method may be used as the main analysis method. Then, to evaluate the stability of the main results with the MAR assumption, the pre-planned sensitivity analysis should be considered [40].

When applying the MAR-based methods, the validity of the MAR assumption should be relatively increased by considering as many factors affecting the missing data as possible increases [41, 42]. In this study, all patient background characteristics were included in the missing data model. However, other missing data sources, such as the settings in the ordering system for laboratory tests, the measurement policy of tests, and the preferences of doctors were unobserved and could not be included in the missing data models.

Approaches considering hospital variations

The results of SRI and MI, and some of those in the IPW method showed no noticeable change in aHR due to different approaches. In the overall approach, hospitals were included in the missing data model as a fixed variable and were expected to capture some of the effects of unobserved missing data sources at the hospital level. In the hospital-specific approach, interactions between patient backgrounds and hospitals can be considered. It is not known whether the hospital effect can be explained by the fixed effect. Additionally, the presence or absence of interaction has not been evaluated, but the results of this study indicate that the difference in approach may not significantly affect the difference in aHRs.

Some changes in aHRs due to the difference in approach in the IPW method may have been caused by the change in the distribution of IPW. In the analysis for ALP in Scenario 2 of the hospital-specific-approach, unlike the overall-approach, some patients with an IPW of 20 or higher existed.

Confounding adjustment methods

Confounding adjustment methods affected the results in two ways. First, in PS methods (especially PS matching), there was a difference in 95% CIs between SRI and MI. For binary outcomes, Granger et al. [43] reported that coverage of 95% CI was too high in the PS matching compared to the PS weighting using SMRW due to the overestimation of variances. A combination of the PS matching and MI with the “within approach” may have affected the difference in 95% CIs in this study as well.

Second, in the analysis using PS matching, the degree of variation in aHRs and the ranges of 95% CI between missing data methods were large. One of the reasons was attributed to the decrease in the number of patients to be analyzed (Figs. 2 and 3). In Scenario 2, the incidence rate was low (Supplementary Table S5), and the impact of decreasing sample size was large. PS matching should be used carefully, considering the results from this study and its features (e.g., the target population will no longer be an exposed population if all patients in the exposed population do not have matched controls).

In scenario 1, unlike a previous study [44], the increased risk of diabetes associated with SGA use was not observed. In our study, although we did not confirm the frequency of laboratory measurements during the follow-up period, there were differences between groups in the missing proportion before drug prescription (SGA users: 20.6%, FGA users: 9.2%), suggesting that a detection bias may have existed and affected the results. In scenario 2, the point estimates of aHR were around or over 1, consistent with previous studies [32, 33].

When selecting scenarios and laboratory test items, we also considered the feasibility of the number of events; thus, it was not covered in this study, but the following case can also be a typical scenario under which missing laboratory values occur in the MID-NET® based on the knowledge from our previous study [9]; the scenario using laboratory tests with restrictions on the implementation interval based on the health insurance system (e.g., measurement can only be performed once every 3 months) that may be unique to the Japanese medical environment. It is important to Considering the impact and characteristics of missing data is important when planning research.

There are two main limitations to this study. The first is that the scenarios and laboratory test items were limited. Since we used five missing methods, two approaches for considering hospital variations, and three confounder adjustment methods, we had to limit the scenarios and the number of laboratory test items. As more than one laboratory test item can be a confounding factor, further studies focusing on such situations are needed. The second is that we only examined the Tokushukai Database. When applying the missing data method to the entire MID-NET®, one must refer to the findings of this study considering the differences from the Tokushukai Database regarding the target population and missing data sources.

Conclusions

Based on the five main findings of this study (Table 4), we concluded that, although the different missing methods may contribute to differences in parameter estimates of the outcome model, SRI and MI can provide similar point estimates, and two approaches considering hospital variations do not have a major impact on the results. Confounding adjustment by PS matching gave unstable point estimates and wide confidence intervals and should therefore be used carefully. Although we report findings based on a case study and cannot draw generalizable recommendation, our research results may help in the selection of missing data imputation methods and the interpretation of obtained results in the future utilization of MID-NET®.

Table 4 Main findings in this study

Availability of data and materials

The datasets generated and/or analyzed during the current study are not publicly available due to the terms of use for MID-NET® to which we adhered when conducting this study; the accessibility of the dataset used for this analysis is restricted to specific authors including the corresponding author in a predetermined secure environment. No outside researchers are allowed to access the dataset. This study used the database system for MID-NET®-collaborative organizations of the Tokushukai Medical Group, a part of MID-NET®, and not the entire MID-NET®. However, we followed the terms of use for MID-NET®, as the datasets were included in the entire MID-NET®.

Abbreviations

aHR:

Adjusted hazard ratio

ALP:

Alkaline phosphatase

ALT:

Alanine transaminase

CC:

Complete case

CI:

Confidence interval

EMR:

Electronic medical record

FGA:

First-generation antipsychotic

HbA1c:

Hemoglobin A1c

ICD:

International Classification of Diseases

IPW:

Inverse probability weighted

JDS:

The Japan Diabetes Society

LDL-chol:

Low-density lipoprotein cholesterol

MAR:

Missing at random

MCAR:

Missing completely at random

MNAR:

Missing not at random

MI:

Multiple imputation

NGSP:

The National Glycohemoglobin Standardization Program

PS:

Propensity score

SGA:

Second-generation antipsychotic

SMD:

Standardized mean differences

SMRW:

Standardized mortality ratio weighting

SRI:

Single regression imputation

TG:

Triglyceride

References

  1. Yamada K, Itoh M, Fujimura Y, et al. The utilization and challenges of Japan’s MID-NET® medical information database network in postmarketing drug safety assessments: a summary of pilot pharmacoepidemiological studies. Pharmacoepidemiol Drug Saf. 2019;28(5):601–8.

    Article  PubMed  Google Scholar 

  2. Yamaguchi M, Inomata S, Harada S, et al. Establishment of the MID-NET® medical information database network as a reliable and valuable database for drug safety assessments in Japan. Pharmacoepidemiol Drug Saf. 2019;28(10):1395–404.

    Article  PubMed  PubMed Central  Google Scholar 

  3. Pharmaceuticals and Medical Devices Agency. Summary of MID-NET® study: no. 2018-001. 2020. https://www.pmda.go.jp/files/000233987.pdf. Accessed 7 May 2022.

  4. Pharmaceuticals and Medical Devices Agency. Summary of MID-NET® study: no. 2018-002. 2020. https://www.pmda.go.jp/files/000234446.pdf. Accessed 7 May 2022.

  5. Schneeweiss S, Rassen JA, Glynn RJ, et al. Supplementing claims data with outpatient laboratory test results to improve confounding adjustment in effectiveness studies of lipid-lowering treatments. BMC Med Res Methodol. 2012;12:180.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Raebel MA, Shetterly S, Lu CY, et al. Methods for using clinical laboratory test results as baseline confounders in multi-site observational database studies when missing data are expected. Pharmacoepidemiol Drug Saf. 2016;25(7):798–814.

    Article  CAS  PubMed  Google Scholar 

  7. Wells BJ, Chagin KM, Nowacki AS, Kattan MW. Strategies for handling missing data in electronic health record-derived data. EGEMS (Wash DC). 2013;1(3):1035.

    PubMed  Google Scholar 

  8. Eekhout I, de Boer RM, Twisk JW, de Vet HC, Heymans MW. Missing data: a systematic review of how they are reported and handled. Epidemiology. 2012;23(5):729–32.

    Article  PubMed  Google Scholar 

  9. Komamine M, Fujimura Y, Nitta Y, Omiya M, Doi M, Sato T. Characteristics of hospital differences in missing of clinical laboratory test results in a multi-hospital observational database contributing to MID-NET® in Japan. BMC Med Inform Decis Mak. 2021;21(1):181.

    Article  PubMed  PubMed Central  Google Scholar 

  10. Molenberghs G, Fitzmaurice G, Kenward MG, Tsiatis A, Verbeke G. Handbook of missing data methodology. Boca Raton: CRC Press; 2014.

    Book  Google Scholar 

  11. Herring AH, Ibrahim JG. Likelihood-based methods for missing covariates in the Cox proportional hazards model. J Am Stat Assoc. 2001;96(453):292–302.

    Article  Google Scholar 

  12. Donders AR, van der Heijden GJ, Stijnen T, Moons KG. Review: a gentle introduction to imputation of missing values. J Clin Epidemiol. 2006;59(10):1087–91.

    Article  PubMed  Google Scholar 

  13. Perkins NJ, Cole SR, Harel O, et al. Principled approaches to missing data in epidemiologic studies. Am J Epidemiol. 2018;187(3):568–75.

    Article  PubMed  Google Scholar 

  14. Dziura JD, Post LA, Zhao Q, Fu Z, Peduzzi P. Strategies for dealing with missing data in clinical trials: from design to analysis. Yale J Biol Med. 2013;86(3):343–58.

    PubMed  PubMed Central  Google Scholar 

  15. Little RJA, Rubin DB. Statistical analysis with missing data. 2nd ed. Hoboken: Wiley; 2002.

    Book  Google Scholar 

  16. Baraldi AN, Enders CK. An introduction to modern missing data analyses. J Sch Psychol. 2010;48(1):5–37.

    Article  PubMed  Google Scholar 

  17. Rubin DB. Inference and missing data. Biometrika. 1976;63:581–92.

    Article  Google Scholar 

  18. Rubin DB. Multiple imputation for nonresponse in surveys. New York: Wiley; 1987.

    Book  Google Scholar 

  19. Graham JW, Olchowski AE, Gilreath TD. How many imputations are really needed? Some practical clarifications of multiple imputation theory. Prev Sci. 2007;8(3):206–13.

    Article  PubMed  Google Scholar 

  20. Bounthavong M, Watanabe JH, Sullivan KM. Approach to addressing missing data for electronic medical records and pharmacy claims data research. Pharmacotherapy. 2015;35(4):380–7.

    Article  PubMed  Google Scholar 

  21. van Buuren S. Flexible imputation of missing data. CRC Press; 2018.

  22. Wang CY, Chen HY. Augmented inverse probability weighted estimator for Cox missing covariate regression. Biometrics. 2001;57(2):414–9.

    Article  CAS  PubMed  Google Scholar 

  23. Seaman SR, White IR. Review of inverse probability weighting for dealing with missing data. Stat Methods Med Res. 2013;22(3):278–95.

    Article  PubMed  Google Scholar 

  24. Von Elm E, Altman DG, Egger M, et al. The strengthening the reporting of observational studies in epidemiology (STROBE) Statement: guidelines for reporting observational studies. Ann Intern Med. 2007;147(8):573–7.

    Article  Google Scholar 

  25. White IR, Royston P. Imputing missing covariate values for the Cox model. Stat Med. 2009;28(15):1982–98.

    Article  PubMed  PubMed Central  Google Scholar 

  26. Xu Q, Paik MC, Rundek T, Elkind MS, Sacco RL. Reweighting estimators for Cox regression with missing covariate data: analysis of insulin resistance and risk of stroke in the Northern Manhattan Study. Stat Med. 2011;30(28):3328–40.

    Article  PubMed  PubMed Central  Google Scholar 

  27. Sato T, Matsuyama Y. Marginal structural models as a tool for standardization. Epidemiology. 2003;14(6):680–6.

    Article  PubMed  Google Scholar 

  28. Newcomer JW, Haupt DW, Fucetola R, et al. Abnormalities in glucose regulation during antipsychotic treatment of schizophrenia. Arch Gen Psychiatry. 2002;59(4):337–45.

    Article  CAS  PubMed  Google Scholar 

  29. Lindenmayer JP, Nathan AM, Smith RC. Hyperglycemia associated with the use of atypical antipsychotics. J Clin Psychiatry. 2001;62Suppl23:30–8.

    Google Scholar 

  30. Crestor (rosuvastatin) [package insert]. Osaka: AstraZeneca K.K.; 2022. https://www.pmda.go.jp/PmdaSearch/iyakuDetail/ResultDataSetPDF/670227_2189017F1022_1_28. Accessed 7 May 2022.

  31. Lipitor (atorvastatin) [package insert]. Tokyo: Viatris Inc; 2021. https://www.pmda.go.jp/PmdaSearch/iyakuDetail/ResultDataSetPDF/671450_2189015F1023_2_01. Accessed 7 May 2022.

  32. Clarke AT, Johnson PC, Hall GC, Ford I, Mills PR. High dose atorvastatin associated with increased risk of significant hepatotoxicity in comparison to simvastatin in UK GPRD cohort. PLoS One. 2016;11(3):e0151587.

    Article  PubMed  PubMed Central  Google Scholar 

  33. Chang CH, Chang YC, Lee YC, Liu YC, Chuang LM, Lin JW. Severe hepatic injury associated with different statins in patients with chronic liver disease: a nationwide population-based cohort study. J Gastroenterol Hepatol. 2015;30(1):155–62.

    Article  CAS  PubMed  Google Scholar 

  34. Leyrat C, Seaman SR, White IR, et al. Propensity score analysis with partially observed covariates: how should multiple imputation be used? Stat Methods Med Res. 2019;28(1):3–19.

    Article  PubMed  Google Scholar 

  35. Marshall A, Altman DG, Holder RL. Comparison of imputation methods for handling missing covariate data when fitting a Cox proportional hazards model: a resampling study. BMC Med Res Methodol. 2010;10:112.

    Article  PubMed  PubMed Central  Google Scholar 

  36. Little RJA, Rubin DB. Statistical analysis with missing data. Wiley; 1987, pp.44–47.

  37. Austin PC. Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples. Stat Med. 2009;28(25):3083–107.

    Article  PubMed  PubMed Central  Google Scholar 

  38. Normand ST, Landrum MB, Guadagnoli E, et al. Validating recommendations for coronary angiography following an acute myocardial infarction in the elderly: a matched analysis using propensity scores. J Clin Epidemiol. 2001;54(4):387–98.

    Article  CAS  PubMed  Google Scholar 

  39. Cappabianca G, Mariscalco G, Biancari F, et al. Safety and efficacy of prothrombin complex concentrate as first-line treatment in bleeding after cardiac surgery. Crit Care. 2016;20:5.

    Article  PubMed  PubMed Central  Google Scholar 

  40. E9 (R1): addendum to statistical principles for clinical trials on choosing appropriate estimands and defining sensitivity analyses in clinical trials. https://database.ich.org/sites/default/files/E9-R1_Step4_Guideline_2019_1203.pdf. Accessed 7 May 2022.

  41. Beaulieu-Jones BK, Lavage DR, Snyder JW, Moore JH, Pendergrass SA, Bauer CR. Characterizing and managing missing structured data in electronic health records: data analysis. JMIR Med Inform. 2018;6(1):e11.

    Article  PubMed  PubMed Central  Google Scholar 

  42. van Buuren S, Boshuizen HC, Knook DL. Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med. 1999;18(6):681–94.

    Article  PubMed  Google Scholar 

  43. Granger E, Sergeant JC, Lunt M. Avoiding pitfalls when combining multiple imputation and propensity scores. Stat Med. 2019;38(26):5120–32.

    Article  PubMed  PubMed Central  Google Scholar 

  44. Smith M, Hopkins D, Peveler RC, Holt RI, Woodward M, Ismail K. First- v. second-generation antipsychotics and risk for diabetes in schizophrenia: systematic review and meta-analysis. Br J Psychiatry. 2008;192(6):406–11.

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

We thank Dr. Masaaki Doi, Dr. Yoshiaki Uyama, and Tokushukai Medical Group for their valuable assistance. Views expressed here are those of the authors and do not necessarily represent the official views and findings of the Pharmaceuticals and Medical Devices Agency. This research was funded by the Agency for Medical Research and Development (AMED) under Grant Number JP21lk0201702.

Funding

This research was supported by the Japan Agency for Medical Research and Development (AMED) under Grant Number JP21lk0201702. The funding bodies played no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Author information

Authors and Affiliations

Authors

Contributions

MK, MO, and TS conceptualized the study. MK analyzed the data. MK wrote the initial draft of the manuscript. MK, YF, MO, and TS contributed to the interpretation of findings and manuscript revisions. All authors have read and approved the final version of the manuscript.

Corresponding author

Correspondence to Maki Komamine.

Ethics declarations

Ethics approval and consent to participate

The study was conducted in accordance with the Declaration of Helsinki and approved by the Kyoto University Graduate School and Faculty of Medicine Kyoto University Hospital Ethics Committee in November 2018 (R1793). Based on Japanese Ethical Guidelines for Medical and Health Research Involving Human Subjects, the need for informed consent from individual patients was waived by the Kyoto University Graduate School and Faculty of Medicine Kyoto University Hospital Ethics Committee. The use of extracted data for this study from the database system for MID-NET®-collaborative organizations of Tokushukai Medical Group was approved by the administrative board of General Incorporated Association Tokushukai, which is the data holder. The data used in this study were anonymized before their use (refer to the previous study by Yamaguchi et al. [2] for details of the data anonymization).

Consent for publication

Not applicable.

Competing interests

Maki Komamine is employed by the Pharmaceuticals and Medical Devices Agency and has no financial or personal relationships with other people or organizations that could inappropriately influence or bias the contents of this paper. Other authors have no financial or personal relationships with other people or organizations that could inappropriately influence or bias the contents of this paper.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: Supplementary Table S1.

List of the 10 hospitals in the database system for MID-NET® collaborative organizations of Tokushukai Medical Group. Supplementary Table S2. Data items used in this study. Supplementary Table S3. Scenario 1. Patient backgrounds among the cohort, complete cases, and cases with missing data. Supplementary Table S4. Scenario 1. Patient backgrounds by hospital. Supplementary Table S5. Scenario 2. Patient backgrounds among the cohort, complete cases, and cases with missing data. Supplementary Table S6. Scenario 2. Patient backgrounds by hospital. Supplementary Figure S1. Sequence of steps from handling missing data to statistical analysis. Supplementary Figure S2. Study scenario selection flowchart. Supplementary Figure S3. Scenario 1. Number of patients in the study cohort: risk of diabetes associated with SGA. Supplementary Figure S4. Scenario 2. Number of patients in the study cohort: risk of hepatic injury associated with rosuvastatin.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Komamine, M., Fujimura, Y., Omiya, M. et al. Dealing with missing data in laboratory test results used as a baseline covariate: results of multi-hospital cohort studies utilizing a database system contributing to MID-NET® in Japan. BMC Med Inform Decis Mak 23, 242 (2023). https://doi.org/10.1186/s12911-023-02345-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12911-023-02345-7

Keywords