Data source
Taiwan instituted the National Health Insurance (NHI) program, a mandatory single-payer program that offers comprehensive medical care coverage, including outpatient, inpatient, emergency, and traditional Chinese medicine services, to almost 98% of residents on March 1, 1995 [32]. Moreover, as of 2014, 99.9% of Taiwan’s population was enrolled and foreigners in Taiwan were also eligible for the NHI program.
Since 1996, the NHI reimbursement data in Taiwan have been transferred to the National Health Research Institute (NHRI) for further management and organization. In addition, as part of these efforts, the work of NHRI has resulted in establishing a national health care database called NHIRD, which contains comprehensive information on clinical practices, including patients’ demographic characteristics, medical expenditure, prescription claims data, surgery code, treatment code, and diagnostic codes based on the International Classification of Disease, Ninth Revision, Clinical Modification (ICD-9-CM).
The use of NHIRD is limited to research purposes only. Researchers must follow the Computer-Processed Personal Data Protection Law (http://www.winklerpartners.com/?p=987) and the regulations of NHRI. In addition, an agreement must be signed by the researchers and their supervisor upon application submission. All applications for the databases release would be reviewed for approval by experts in NHIR. Furthermore, confidentiality is also maintained based on the directives of the Bureau of NHI. In the current study, the Longitudinal Health Insurance Database 2005 (LHID 2005), which is a dataset released by the NHRI, was used as the data source. The LHID 2005 contains all the original claims data of 1,000,000 beneficiaries enrolled in year 2005 randomly sampled from the year 2005 Registry for Beneficiaries of the NHIRD, where registration data of everyone who was a beneficiary of the NHI program during the period of January 1st 2005 to January 1st, 2006. There are approximately 25.68 million individuals in this registry. All the registration and claims data of these 1,000,000 individuals collected by the NHI program constitute the LHID 2005. The NHRI reported that no significant difference exists in the average insured payroll-related amount, sex distribution, or age distribution between patients in the LHID 2005 and those in the NHIRD.
Study population
The data extracted from the LHID 2005 were used to conduct a retrospective case–control study on patients who were newly diagnosed with BPD (ICD-9-CM code: 301.83) by a psychiatrist between January 1, 2003 and December 31, 2006. For each BPD patient, 4 age- and sex-matched control patients without BPD were randomly selected from the LHID 2005 between 2003 and 2006. The random assignment procedures were performed by SAS statistical software and were based on the random numbers which were generated from the uniform distribution. Information on physical and psychiatric comorbidities, which were diagnosed within 3 years before and after enrollment, was collected. In this study, all comorbidities were categorized according to the original classification of the ICD-9-CM system. Details regarding psychiatric disorders including depressive disorder, bipolar disorder, anxiety disorder, substance use disorder (e.g., alcohol use disorder, opioid use disorder, and amphetamine use disorder), sleep disorder, eating disorder, autistic spectrum disorder, mental retardation, and ADHD were categorized for the ARM analysis.
Statistical analysis
The prevalence rate of comorbidities in the BPD and control patients was calculated, and independent t and chi-squared tests were used to examine the differences in the demographic characteristics between the BPD and control patients. A univariable logistic regression model was also used to calculate the ORs of physical and psychiatric comorbidities between the BPD and control patients. In addition, although the coverage rate of the NHI system in Taiwan is up to 99%, there is still a very low incidence of missing data in the dataset. However, in our study, the missingness was unrelated to the variables, as so called missing completely at random. Therefore, we have adopted the most commonly complete case analyses method to accomondate these missing data-to simply exclude those participants in our dataset who have any data missing. SAS statistical software for Windows, Version 9.3 (SAS Institute, Cary, NC, USA), was used for data extraction, computation, data linkage, processing, and sampling. All other statistical analyses were conducted using SPSS statistical software for Windows, Version 20 (IBM, Armonk, NY, USA). Comparisons with P < .05 indicated statistically significant relationships.
Association rule mining
ARM is one of the most useful methods for discovering patterns or extracting co-occurrences from transactional databases. Recently, ARM has been applied in clinical data analysis [21, 33]. A collection of entire diagnoses can be defined as a set of items, and enrollees’ clinical records can be represented as transactions, which include their historical combination sets. Therefore, the basic concept of ARM used in clinical data analysis can be outlined as follows.
Let I be a complete set of diagnoses (i.e., items in conventional ARM) and T = {T
1, T
2,…, T
m
} be a set of enrollees’ clinical records (i.e., transactions in conventional ARM), where T
i
(1 ≤ i ≤ m) is a set of diagnoses for enrollee i (ie, T
i
⊆ I). Given two nonoverlapping sets of diagnoses, X and Y (X ⊂ I, Y ⊂ I and X∩Y = ϕ), an association rule is an implication of the form X → Y (which is read as “X implies Y”), indicating that if a set of diagnoses X occurs, then a set of diagnoses Y also occurs in the enrollee’s clinical record [34]. X and Y represent the antecedent and consequent of the rule, respectively.
Two measures, support and confidence, must be assessed in the mining process to discover association rules. The rule X → Y has support s in T if s% of the enrollees’ clinical records in T contains X∪Y; the rule has confidence c if c% of enrollees’ clinical records in T that support X also support Y. The confidence c could be also expressed as Probability (Y|X) [P (Y|X)]. Given a user-specified minimum support (called minsup) and minimum confidence (called minconf), the goal of ARM is to discover all association rules that have support and confidence greater than minsup and minconf, respectively.
Although the support-confidence framework for ARM has been widely studied in the literature [21, 33], it is a challenging task to set minsup and minconf simultaneously in real world application [35, 36]. To address this issue, we disregard minsup and minconf and consider only the interestingness of rules in ARM. Previous studies indicate that using interestingness measures can quickly evaluate the quality of rules and thus facilitate the rule consolidation process [37]. This study chooses the lift as the interestingness measure. The lift of the rule X → Y is defined as follows:
$$ lift\left(X\to Y\right)=\frac{c\left(X\to Y\right)}{s(Y)}=\frac{P\left(Y\left|X\right.\right)}{P(Y)}=\frac{P\left(X,Y\right)}{P(X)\times P(Y)} $$
The value of lift means that how much does the joint probability P (X, Y) deviate from the independence assumption P (X) × P (Y). Based on the above equation, we can interpret the measure as follows:
$$ lift\left(X\to Y\right)\left\{\begin{array}{c}\hfill >1,\hfill \\ {}\hfill =1,\hfill \\ {}\hfill <1,\hfill \end{array}\right.\begin{array}{c}\hfill \hfill \\ {}\hfill \hfill \\ {}\hfill \hfill \end{array}\begin{array}{c}\hfill \mathrm{if}\ X\ \mathrm{and}\ Y\ \mathrm{are}\ \mathrm{positively}\ \mathrm{correlated}\hfill \\ {}\hfill \mathrm{if}\ X\ \mathrm{and}\ Y\ \mathrm{are}\ \mathrm{independent}\kern3.25em \hfill \\ {}\hfill \mathrm{if}\ X\ \mathrm{and}\ Y\ \mathrm{are}\ \mathrm{negatively}\ \mathrm{correlated}\hfill \end{array} $$
The objective of this study was to determine the main psychiatric comorbidity of BPD by using ARM. In this study, analyses of ARM were conducted using Weka 3.6 open-source machine learning software (www.cs.waikato.ac.nz/ml/weka). The Apriori module in WEKA was used to discover interesting association rules relating to comorbidity of BPD. To apply the lift metric in Apriori module, we set both minsup and minconf as 0% and specify the metric type as lift. The minimum value of lift is defined as 1 in order to discover the rules having positive correlation between their antecedent and consequent.
Based on the above setting, all the interesting association rules (ie, lift > 1) will be outputted in descending order by their lift values in WEKA. Although the lift metric can prune less meaningful rules during the mining process, the number of generated rules is still huge because of disregarding minsup and minconf thresholds. In this study, we confine the number of generate rules in WEKA. Specifically, we generate top-k (k = 10,000) interesting rules by specifying parameter k in WEKA.
In order to evaluate whether the discovered association rules hold in general, we partitioned the collected data into training and testing sets. Specifically, two third of patients were randomly selected as the training set to discover association rules and the remaining one third of patients (testing set) were used to validate the discovered association rules [38, 39].
Because small supports were noted in some discovered association rules, bootstrap simulation was also used to validate the discovered association rules. Of the 292 BPD patients included in our study, a random sample of 292 BPD patients is drawn with replacement (Therefore, this sample will include some BPD patients multiple times, and other BPD patients will be excluded at random.). The 292 BPD patients were matched to 4 controls according to age and sex, resulting in a bootstrap sample. The association rules on this bootstrap sample were evaluated. Above steps were repeated 1000 times to create 1000 bootstrap samples. The mean support, confidence, lift, and ORs with 95% confidence interval (CI) of 1000 bootstrap samples were calculated.