 Research
 Open Access
 Published:
Prediction and risk stratification from hospital discharge records based on Hierarchical sLDA
BMC Medical Informatics and Decision Making volume 22, Article number: 14 (2022)
Abstract
Background
The greatly accelerated development of information technology has conveniently provided adoption for risk stratification, which means more beneficial for both patients and clinicians. Risk stratification offers accurate individualized prevention and therapeutic decision making etc. Hospital discharge records (HDRs) routinely include accurate conclusions of diagnoses of the patients. For this reason, in this paper, we propose an improved model for risk stratification in a supervised fashion by exploring HDRs about coronary heart disease (CHD).
Methods
We introduced an improved fourlayer supervised latent Dirichlet allocation (sLDA) approach called Hierarchical sLDA model, which categorized patient features in HDRs as patient featurevalue pairs in onehot way according to clinical guidelines for lab test of CHD. To address the data missing and imbalance problem, RFs and SMOTE methods are used respectively. After TFIDF processing of datasets, variational Bayes expectationmaximization method and generalized linear model were used to recognize the latent clinical state of a patient, i.e., risk stratification, as well as to predict CHD. Accuracy, macroF1, training and testing time performance were used to evaluate the performance of our model.
Results
According to the characteristics of our datasets, i.e., patient featurevalue pairs, we construct a supervised topic model by adding one more Dirichlet distribution hyperparameter to sLDA. Compared with established supervised algorithm Multiclass sLDA model, we demonstrate that our proposed approach enhances training time by 59.74% and testing time by 25.58% but almost no loss of average prediction accuracy on our datasets.
Conclusions
A model for risk stratification and prediction of CHD based on sLDA model was proposed. Experimental results show that Hierarchical sLDA model we proposed is competitive in time performance and accuracy. Hierarchical processing of patient features can significantly improve the disadvantages of low efficiency and timeconsuming Gibbs sampling of sLDA model.
Introduction
Cerebrovascular accident (CVA), coronary heart disease (CHD) and other cardiovascular diseases (CVD) are the leading causes of death and serious family burden in China nowadays. According to the World Health Organization (WHO), risk factors can increase the chances that a person suffers from that disease (WHO, 2014). Risk stratification incorporating these risk factors can be used by physicians to assess the risk of atherosclerotic of individual patient, such as taking treatment with drugs to lower blood pressure and blood cholesterol based on an individual’s absolute cardiovascular risk [1]. According to accurate risk stratification, so as to reduce the overall risk of CVD or CHD, physicians can formulate corresponding comprehensive clinical treatments or life intervention management programs for patients with different risk levels. To carry out risk stratification also plays an important role in individualized nursing, drug development and cost estimation of CVD [2].
In recent years, with the rapid development of medical information technology, the applications of electronic medical records (EMRs) are becoming more and more widely and indepth. EMRs store and share clinical information of patients, such as general items, diagnostic images (e.g., Xrays), history of present illness (HPI), family history, lab results etc. for different diseases [3,4,5] and various medical applications [6,7,8,9], especially HDRs, which include abundant information on patient risk factors. In 2014, in order to identify and extract medical risk factors related to CHD, i2b2/UTHealth Natural Language Processing shared task with respect to the longitudinal medical records of patients with diabetes. The risk factors included hypertension, hyperlipidemia, obesity, smoking status, family history, diabetes [10].
Multivariable traditional assessment models have been provided to estimate CVD risk [11, 12]. At the cohort level, these risk stratification models take statistical analysis techniques (e.g., logistic regression, Cox regression, etc.) to estimate the absolute risk of patients, which offer little insight beyond a flat scorebased segmentation that has high cost, carefully selected and highly stratified patient characteristics [13].
In this study, we proposed a fourlayer probabilistic topic model, i.e., Hierarchical sLDA model, for risk stratifications and prediction of CHD by using the diagnosis cases from HDRs, where Hierarchical sLDA model is a variant of sLDA. Firstly, we collected data from real clinical settings and annotated risk factors with an annotation tool developed by ourselves, under the guidance of clinicians. Then, we extracted the patient featurevalue pairs of risk factors and encoding them in One Hot way according to clinical guidelines for lab test of CHD. Meanwhile, to address the data missing and imbalance problem, RFs and SMOTE are used respectively. After TFIDF processing of datasets, variational Bayes Expectationmaximization (VBEM) and generalized linear model (GLM) was used to recognize the latent clinical state of a patient, i.e., risk stratification, as well as to predict CHD. Experimental results show that our model can significantly improve the disadvantages of low efficiency and timeconsuming of sLDA model.
Related work
With the continuous digitization and storage of knowledge in the forms of news, blogs, web pages, scientific articles, books, images, sound, video and social networks, it is more and more difficult for us to find what we are looking for from massive information [14].
In 1999, Thomas Hofmann proposed probabilistic latent semantic indexing (PLSI), which characterizes the polysemy of a word by describing the word frequency vector with multinomial distribution [15]. The proposal of PLSI enables the discovery and analysis of potential topics or categories in a large number of documents, and realizes the tasks of document clustering and dimension reduction.
In order to overcome the defects of inconsistent generative semantics of PLSI that cannot generate (i.e., predict) new documents and its model overfitting, Blei et al. [16] proposed LDA (latent Dirichlet allocation) topic model in 2003, and two commonly used approximate inference methods are Variational Bayes (VB) deterministic and collapsed Gibbs sampling (GS) stochastic approximation. Girolami and Kabán [17] showed that PLSI is maximum a posteriori (MAP) estimated LDA model under uniform Dirichlet prior. LDA is a threelayer hierarchical (including documents, topics and words) Bayesian unsupervised topic model. Based on the Bagofwords (BOW) representation, LDA can cluster topics or classify texts from a large number of documents and has good scalability. With the development of probabilistic topic model, it has made continuous progress in image analysis, bioinformatics and other fields [18]. Jelodar et al. [19] reviewed the research progress, future development trend and wide application subjects of LDA topic model from 2003 to 2016, such as Social Network, Crime Science, Medical/Biomedical and Linguistic science.
For text classification, Li and McCallum [20] showed that LDA model does not capture the correlation between topics, and the accuracy and efficiency of topic classification are not outstanding and insufficient. Furthermore, although LDA model can achieve document clustering and dimension reduction and other tasks, it is not suitable for prediction. In 2010, Blei and McAuliffe [21] proposed supervised latent Dirichlet allocation (sLDA). By combining GLM for latent topics, and selecting different exponential distribution family according to response variables, multiple response variables (e.g., real, category or multinomial response variable) can be predicted.
Wang et al. [22] extended to image classification and proposed Multiclass sLDA prediction model by selecting GLM model for multinomial response. Their experimental results showed that the average accuracy of the model for LabelMe datasets (1600 images, 8 categories) and UIUCSport datasets (1792 images, 8 categories) was 76% and 66% respectively, which was better than that of Li and Perona [23] and Bosch et al. [24], and the average accuracy increased by more than 10%.
By using factor graph to represent collapsed LDA to encode the joint probability, Zeng et al. [25] enabled the belief propagation (BP) for approximate inference and parameter estimation and has enhanced both speed and accuracy by experimental results on four document datasets.
Besides these methods, there are various variants and applications about the stateoftheart topic model LDA recently, e.g., opinion mining on big data [26], stock market returns [27], topic change point detection [28], openended versus closedended response [29], etc.
Motivated by these observations, our study proposes a probabilistic ensemble classification method, which distinguishes from other methods in that: (1) it takes a slight structural change to standard sLDA, which improves threelayer sLDA to a fourlayer called Hierarchically sLDA model, and achieves encouraging experimental results in terms of time performance; (2) our model can provide prediction of CHD and risk stratification simultaneously.
The remaining sections of this paper are organized as follows. “Hierarchical sLDA” section describes our proposed model, variational inference and parameter estimation. “Experiments” section carefully describes the datasets and presents our model experimental results. Finally, we present our conclusions possible directions for future work in “Conclusions” section.
Hierarchical sLDA
Modeling HDRs and labels
Firstly, we summarize some important symbols and notations in this paper shown in Table 1.
The graphical model representation of hierarchical sLDA is depicted in Fig. 1. Nodes are random variables; edges indicate possible dependence; a shaded node is an observed variable; an unshaded node is a hidden variable.
Each HDRs is represented as a bag of patient feature \(f_{1:F}\) or patient featurevalue pair \(v_{1:N}\). The category c is a discrete class label. Each topic is a distribution over a vocabulary of patient feature, and also be regarded as distribution over vocabulary of patient featurevalue pair. K is the number of latent topics; N is the number of featurevalue of a single HDRs; D is the number of patient HDRs.
Our model assumes the following generative process of an HDRs, and its class label.

1.
Draw topic proportions \(\theta\), \(\theta \mid \alpha \sim {\text {Dir}}(\alpha )\);

2.
For each patient HDRs featurevalue pair \(v_{1:N}\):

(a)
Draw topic assignment \(r_{n}\) from category distribution with parameter \(\theta\), \(r_{\mathrm {n}} \mid \theta \sim {\text {Mult}}(\theta )\);

(b)
Draw feature \(f_{n}\) from category distribution with parameter \(\pi\), \(f_{n} \mid r_{n}, \pi _{1: K} \sim {\text {Mult}}\left( \pi _{r_{n}}\right)\);

(c)
Draw featurevalue pair \(v_{n}\) from category distribution with parameter \(\beta\), \(v_{n} \mid r_{n}, f_{n}, \beta _{1: K} \sim {\text {Mult}}\left( \beta _{f_{n}, r_{n}}\right)\);

(a)

3.
Draw class label y from GLM distribution \(y \mid r_{1: N}, \eta \sim GLM({\bar{r}}, \eta )\), where \({\bar{r}}=\frac{1}{N} \sum _{n=1}^{N} r_{n}\), that is:
$$\begin{aligned}p\left( y \mid r_{1:N}, \eta \right) =\exp \left( \eta _{y}^{T} {\bar{r}}\right) / \sum _{l=1}^{C} \exp \left( \eta _{l}^{T} {\bar{r}}\right) \end{aligned}$$it also can be written as exponential distribution family:
$$\begin{aligned} p\left( y \mid r_{\mathrm {1:N}}, \eta \right) =\exp \left\{ \eta _{y}^{T} {\bar{r}}\log \left( \sum _{l=1}^{C} \exp \left( \eta _{l}^{T} {\bar{r}}\right) \right) \right\} \end{aligned}$$
The posterior inference of the model can be divided into three steps or tasks.

(a)
Variational approximate inference A posterior inference is to compute the conditional distribution of the latent variables at the patient HDRs level. That is, approximate inference is used to estimate the parameters of the variational distribution.

(b)
Parameter estimation The model parameters \(\alpha , \beta _{1: K}, \pi _{1: K}, \eta\) are fitted with variational expectation maximization (EM) by maximum likelihood estimation.

(c)
Prediction and risk stratification To perform prediction and risk stratification over the model parameters\(\alpha , \beta _{1: K}, \pi _{1: K}, \eta\) and variational distribution parameters \(\gamma , \phi\) means approximation of posterior expectation of response variable \(y={\mathrm {E}}\left[ Y \mid r_{1:N}, \alpha , \beta _{1: K}, \pi _{1: K}, \eta \right]\).
Variational approximate inference
Both Parameter estimation and prediction depend on the posterior inference. Following Jordan et al. [16], given patient HDRs and response variable y, we start the joint distribution in the following equation. The real posterior of latent variables (including topic proportions \(\theta\), topic \(r_{n}\)) is:
As the denominator of posterior distribution is difficult to calculate, we use meanfield variational approximation inference:
Let \(\zeta =\left\{ \alpha , \pi _{1: K}, \beta _{1: K}, \eta \right\}\), the KL divergence between the real posterior of latent variables \(p\left( \theta , r_{1:N} \mid f_{1:N}, v_{1:N}, y, \zeta \right)\) and variational distribution \(q\left( \theta , r_{\mathrm {1:N}}\right)\) is:
So the evidence lower bound (ELBO) is:
we denote ELBO by \({\mathscr {L}}(\bullet )\), and the entropy of variational distribution by \(H(q)={\mathrm {E}}_{q}\left[ \log q\left( \theta , r_{1: N}\right) \right]\), and then ELBO is written as:
We fit these parameters by maximizing ELBO with respect to \(\gamma ,\phi\) and obtain an estimate of the posterior under the sense of KL divergence between \(q\left( \theta , r_{1: N}\right)\) and the true posterior \(p\left( \theta , r_{1: N} \mid f_{1: N}, v_{1: N}, y, \zeta \right)\).
The terms of equation 1 are as follows:
where \(\Psi (\bullet )\) denotes the digamma function.
Variational Estep
The coordinate ascent method updates for variational parameter \(\gamma\), which is the same as sLDA, does not directly involve the response variable y.
Under the conditions \(E[\overline{\mathrm {r}}]={\bar{\phi }}=\frac{1}{N} \sum _{n=1}^{N} \phi _{n}\), the terms in ELBO containing \(\phi _{n}\) are:
Following Wang et al. [22], under the constraint \(\sum _{i=1}^{K} \phi _{n, i}=1\) and setting partial derivatives to zero of the ELBO with respect to \(\phi _{n, i}\), we write \(\log \left( \sum _{l=1}^{C} \prod _{n=1}^{N}\right.\) \(\left. \left( \sum _{i=1}^{K} \phi _{n, i} \exp \left( \eta _{l, i} \frac{1}{N}\right) \right) \right)\) as \(h^{T} \phi _{n}\), and then obtain:
The variational EM algorithm alternatively updates Eqs. 8 and 9 until the bound on the expected log likelihood converges.
Parameter estimation
Variational Mstep is an optimization of ELBO on the whole HDRs datasets level w.r.t model parameters \(\zeta =\left\{ \alpha , \pi _{1: K}, \beta _{1: K}, \eta \right\}\). Repeating the Variational Estep D times, we can obtain approximate posterior over latent variables \(\theta , r_{1:N}\) for each patient HDRs. It is noted that different patient HDRs has different variational distributions \(q_{d}\left( \theta , r_{1: N}\right)\). Then we obtain:
Variational Mstep
Similarly, the coordinate ascent method is used to maximize the whole HDRs datasets ELBO to estimate the model parameters \(\zeta =\left\{ \alpha , \pi _{1: K}, \beta _{1: K}, \eta \right\}\).

(a)
Setting
$$\begin{aligned}&\partial L\left( \alpha , \pi _{1: \mathrm {K}}, \beta _{1: \mathrm {K}}, \eta ; D\right) / \partial \pi _{k, f}=0\\&\partial L\left( \alpha , \pi _{1: \mathrm {K}}, \beta _{1: \mathrm {K}}, \eta ; D\right) / \partial \beta _{k, \mathrm {v}}=0 \end{aligned}$$it leads to:
$$\begin{aligned} {\hat{\pi }}_{k, f}^{new}&\propto \sum _{d=1}^{D} \sum _{n=1}^{N} I\left( f=f_{n, k}^{d}\right) \phi _{n, k}^{d} \end{aligned}$$(10)$$\begin{aligned} {\hat{\beta }}_{k, v}^{new}&\propto \sum _{d=1}^{D} \sum _{n=1}^{N} I\left( v=v_{n, k}^{d}\right) \phi _{n, k}^{d} \end{aligned}$$(11) 
(b)
The whole HDRs datasets ELBO containing \(\eta _{c}\) are:
$$\begin{aligned} L_{\left[ \eta _{1:C}\right] }(\mathrm {D})=\sum _{d=1}^{D}\left( \eta _{c_{d}}^{T} {\bar{\phi }}_{d}\log \left( \sum _{l=1}^{C}\left\{ \prod _{n=1}^{N}\left( \sum _{i=1}^{K} \phi _{n, i}^{d} \exp \left( \eta _{l, i} \frac{1}{N}\right) \right) \right\} \right) \right) \end{aligned}$$Setting \(\frac{\partial L_{\left[ \eta _{1:C}\right] }(\mathrm {D})}{\partial \eta _{\mathrm {c}, i}}=0\) does not lead to a closedform solution. Following Wang et al. [22], we optimize with conjugate gradient. Let \(\kappa _{d}=\sum _{l=1}^{C}\left\{ \prod _{n=1}^{N} \phi _{n, \mathrm {r}_{n}} \right.\) \(\left. \sum _{\mathrm {i}=1}^{K}\left( \exp \left( \eta _{l, i} \frac{1}{N}\right) \right) \right\}\), the derivatives are:
$$\begin{aligned} \begin{aligned} \frac{\partial L_{\left[ \eta _{\mathrm {1:c}}\right] }(\mathrm {D})}{\partial \eta _{\mathrm {c}, i}}&=\sum _{d=1}^{D}\left( 1\left[ \mathrm {c}_{d}=\mathrm {c}\right] {\bar{\phi }}_{d, i}\right) \sum _{d=1}^{D}\left( \kappa _{d}^{1} \prod _{n=1}^{N}\left( \sum _{j=1}^{K} \phi _{n, \mathrm {j}}^{\mathrm {d}} \exp \left( \eta _{\mathrm {c}_{j}} \frac{1}{N}\right) \right) \right. \\&\quad \left. \times \sum _{n=1}^{N}\left( \frac{\frac{1}{N} \phi _{n, \mathrm {i}}^{\mathrm {d}} \exp \left( \eta _{c_{i}} \frac{1}{N}\right) }{\sum _{j=1}^{K} \phi _{n, \mathrm {j}}^{\mathrm {d}} \exp \left( \eta _{\mathrm {c}_{j}} \frac{1}{N}\right) }\right) \right) \end{aligned} \end{aligned}$$ 
(c)
In this paper, We will address this setting in details about the Dirichlet parameter \(\alpha\) and \(\beta\) in “Discussion” section.
Prediction and risk stratification
Under the fitted model\(\left\{ \alpha ,\gamma ,\phi _{1:N},\pi _{1: K}, \beta _{1: K}, \eta \right\}\), the expected response value is:
where \(\mu (\bullet )={\mathrm {E}}_{G L M}[Y \mid \cdot ]=\left[ \frac{\exp \left( \eta _{1}^{T} \overline{\mathrm {r}}\right) }{\sum _{l=1}^{C} \exp \left( \eta _{l}^{T} \overline{\mathrm {r}}\right) }, \ldots , \frac{\exp \left( \eta _{C}^{T} \overline{\mathrm {r}}\right) }{\sum _{l=1}^{C} \exp \left( \eta _{l}^{T} \mathrm {r}\right) }\right] ^{T}\).
In classification, to estimate the probability of the label c with the variational distribution, we obtain:
Thus, the prediction formulation is:
where \(E[{{\overline{r}}}]={\bar{\phi }}=\frac{1}{N} \sum _{n=1}^{N} \phi _{n}\).
Experiments
Methods
Data source
We propose our model on realworld datasets in the clinical domains collected from the Cardiology Department of the First Affiliated Hospital of Xinjiang Medical University containing 420 HDRs of CHD patients. These HDRs describe basic information about patients with a first diagnosis of coronary atherosclerotic heart disease, such as admission, treatment history, important test results, discharge status, discharge orders, and followup recommendations. An original HDR of a patient was described in Fig. 2left, corresponding English version was shown in Fig. 2right. In this paper, the patient’s diagnostic results is used as the class label for the data, including the following 4 classes: Stable Angina Pectoris(SAP), Unstable Angina Pectoris(UAP), Ischemic Cardiomyopathy(ICM) and Acute Miocardial Infarction(AMI).
Annotation of features
HDRs are a kind of unstructured text. In order to guarantee the accuracy of datasets and ensure the credibility as far as possible, each patient case in the experimental datasets was identified and evaluated by clinicians. This is a laborintensive and laborious task.
With an annotation tool developed by ourselves, we finished the features annotation of patients and obtain structured data including patient demographics, general items, history of present illness (HPI), laboratory results, conclusions of diagnoses etc., which provide a comprehensive source for risk stratification. Based on the CHD risk factors formulated in I2B2 and combined with the clinical experience of clinicians in the diagnosis and treatment, we drew up annotation criterion and guidelines for this study under the help of clinicians. This work consists of three preannotation and a formal annotation, with 50 HDRs randomly selected each time. The result was verified by Inter Annotator Agreement (IAA) after each annotation to guarantee the qualification. At the same time, the criterion and guidelines of annotation were constantly updated to ensure the standardization throughout the process. Figure 3 shows the features annotation process of HDRs.
Data preprocessing
While random forests (RFs) [30], which allow for avoiding overfitting makes it suitable for processing data with outliers, missing values, have been widely applied to other fields such as biological prediction [31], we use RFs for filling the missing data.
And then, we obtain 34 patient features and 79 patient featurevalue pairs, according to specific medical guidelines and specifications of CHD, by dividing patient features into a series of categories. Using the TFIDF text mining technique to assign a weight to each feature and featurevalue pair term, features matrix of 420 × 34 dimensions and featurevalue pairs matrix of 420 × 79 dimensions are obtained respectively as the source datasets for modeling.
At the same time, 265 out of the 420 patient HDRs cases in the datasets are Stable Angina Pectoris (63.10%), and the class imbalance problem is encountered. In this paper, we use the SMOTE algorithm to generate additional samples from randomly oversampling minority class, and finally totaling 1060 samples are obtained. Therefore, features and diagnosis results can be extracted from HDRs, which provide data preparation for supervised training and testing and can be used as the labels of CHD. Summary statistics of the datasets are shown in Table 2.
Results
For risk stratification and classification prediction of CHD, we use fivefold crossvalidation method to create the train and test sets. Results are reported as an average across folds.
Classification performance
In order to perform assessment of the classification performance of our model, we compared it with previous salient approach Multiclass sLDA [22]. Multiclass sLDA embeds single softmax into LDA model, and reports better classification performance. The distinguishing factor between Hierarchical sLDA and Multiclass sLDA is the additional structure imposed on the featurevalue pair, which would result in an outstanding performance in predictive performance.
The experiments were performed from topic \(K = 10\) to \(K = 70\) with intervals of 10. The results of accuracy, training and testing time are illustrated in Fig. 4 and the confusion matrices are shown in Fig. 5. Hierarchical sLDA model we proposed is competitive in both time performance and accuracy, as validated by experimental results. From Fig. 4left and middle, comparison of over all classes based on fivefold cross validation it can be seen that Hierarchical sLDA (\(K = 70\)) reaches the average training time (669.69s) and testing time (0.32s) performance, which is 59.74% and 25.58% higher than the average training time (1663.42s) and testing time (0.43s) of Multiclass sLDA (\(K = 70\)) but almost no loss of accuracy on our datasets (See Fig. 4right). The average accuracy of the Hierarchical sLDA (\(K = 70\)) is 74.53%, while that of Multiclass sLDA (\(K = 70\)) is 74.06%.
From Fig. 4right, the average classification accuracy of the Hierarchical sLDA, as the number of topics increases, is smoothly converging, as well as not suffering from the overfitting problem. That means it provides more robust than other classifiers.
We evaluated on topic \(K=70\) about different types of CHD using macroF1 score, macroPrecision score, and macroRecall score on test data respectively. A comparison between Hierarchical sLDA model and Multiclass sLDA model indicated that the two models are not significantly different(See Table 3).
Risk stratification
Table 4 shows the top 5 risk factors of \(K=70\) topics inferred by Hierarchical sLDA under high, and lowrisk tier separately of different types of CHD. The Hierarchical sLDA model of HDRs for CHD shows us:

(a)
Different types of CHD generally have different risk factors, while DiabetesYes and Antiplatelet MedicationYes are risk factors of Stable angina pectoris and Unstable angina pectoris separately. We can take a conclusion that Antiplatelet Medication provides an effective treatment, simultaneously should be highly aware of Diabetes.

(b)
Uric acidAbn. and SBPAbn. are the risk factors of 4 types of CHD, which indicate the causeandeffect correlation between these two risk factors and CHD.

(c)
Both antiplatelet medication, ACEI/ARB and Lipid drug therapy are often used to reduce highrisk factors.

(d)
Gender seems to have higher probability of different types of CHD.
Dirichlet hyperparameter
According to experience, Wei and Croft [32] pointed out that the choice of Dirichlet hyperparameter is not sensitive to the experimental results. They used symmetric Dirichlet priors in the estimation \(\alpha ={50}/{K}\) and \(\beta =0.01\). In this paper, according to our analysis (See “Parameter estimation” section) and the experimental results in Fig. 6, under the same topic (e.g., \(K=70\)), selecting different hyperparameter \(\alpha\) has no significant effect on the average prediction accuracy (Fig. 6right). However, after using \(\alpha\) from 0.1 to 1.5 with interval of 0.1, K from 10 to 70 with interval of 5, the 3D represention from Fig. 6left and right show that, rather than \(\alpha ={50}/{K}\), we should optimize hyperparameter \(\alpha\) with the fitted curve
where \(c_{1}\) and \(c_{2}\) are constants and V is the number of patient featurevalue pairs (e.g., \(c_{1} = 0.3\), \(c_{2}\) = 0.25 and \(V = 79\) in our experiments).
Panichella [33] also showed that searchbased approaches are very effective in solving LDA hyperparameter tuning problem. In our proposed Hierarchical sLDA model, Dirichlet hyperparameter can affect the speed of convergence (See “Parameter estimation” section, Eqs. 10 and 11), therefore this structure has the potential to become a generic scheme for variants of sLDAbased model.
Discussion
Traditional statistical models for risk stratification of CVD risk are welldeveloped, but typically less flexible than machine learning techniques and only hold under well controlled conditions for prediction and classification.
Hierarchical sLDA model is a variant of sLDA. The experimental results clearly demonstrate that the proposed Hierarchical sLDA has significant advantages on supervised CHD classification and risk stratification relative to the compared Multiclass sLDA approach, including accuracy and time performance.
Most classification techniques do not handle hierarchical features, which offer little insight beyond a flat featurebased segmentation, as they assume that features in the training datasets are fully independent. By categorizing patient features in HDRs as patient featurevalue pairs, threelayer sLDA is improved to fourlayer Hierarchical sLDA, which can accelerate the convergence of timeconsuming Gibbs sampling.
We have shown that the model, as the number of topics increases, is converging smoothly. That means it is not suffering from overfitting problem and provides more robust than other classifiers.
Intriguingly, according to experience, Dirichlet hyperparameters are set to \(\alpha = 50/\)K and \(\beta = 0.01\). However, we recommend that Dirichlet hyperparameter \(\alpha\) may be optimized in another setup policy (See “Dirichlet hyperparameter” section). Through experiment analysis, there exists two limitations and weaknesses for the current approach:

(a)
Insufficient training data. From the data source process, due to specific difficulties, features extraction is semiautomatic, which can be time and labor intensive. And therefore, Insufficient training data limits the performance of the model.

(b)
More experiments. For future work, we will investigate the performance of our model when applied to other topic models and datasets.
Conclusions
We hereby have proposed an approach in HDRs for risk stratification and classification of CHD simultaneously over our datasets, which is competitive in time performance and accuracy. Hierarchical processing of patient features can significantly improve the disadvantages of low efficiency and timeconsuming Gibbs sampling of sLDA model. Meanwhile our model has the potential to be applied to other datasets by transforming the features of datasets into featurevalue pairs. On the other hand, while coronary angiography is the gold standard for the diagnosis of CHD, but its limitations, such as invasive, random errors by the selection of radiographic projection, limit its wide clinical applications. Risk factors, which is extracted from risk stratification, can be used as a reference to provide individualized prevention and therapeutic decisions with noninvasive methods. However, the difficulty in processing complicated clinical applications suggests that this is still an open question needed to be solved in future research.
Availability of data and materials
The datasets used in this study and the source code of Hierarchical sLDA are available at https://github.com/flyingbiao/HierachicalsLDA. All experiments were performed on Intel(R) Core(TM) i7 @4.20GHz with 16G bytes main memory running on Ubuntu 18.04.2. The algorithm was implemented using C++.
References
Rod J, Carlene Mm L, et al. Treatment with drugs to lower blood pressure and blood cholesterol based on an individual's absolute cardiovascular risk. Lancet. 2014;384(9943):591–8. https://doi.org/10.1016/S01406736(14)612125.
Schlesinger DE, Stultz CM. Deep learning for cardiovascular risk stratification. Curr Treat Options Cardiovasc Med. 2020. https://doi.org/10.1007/s11936020008140.
Brindle P, Beswick A, Fahey T, Ebrahim S. Accuracy and impact of risk assessment in the primary prevention of cardiovascular disease: a systematic review. Heart. 2006;92(12):1752–9. https://doi.org/10.1136/hrt.2006.087932.
Matheny M, Mcpheeters ML, et al. Systematic review of cardiovascular disease risk assessment tools [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2011.
Hsueh PYS, Zhu XX, et al. Automatic summarization of risk factors preceding disease progression an insightdriven healthcare service case study on using medical records of diabetic patients. World Wide Web Internet Web Inf Syst. 2015;18(4):1163–75. https://doi.org/10.1007/s1128001403042.
Brom H, Brooks Carthon JM, Ikeaba U, Chittams J. Leveraging electronic health records and machine learning to tailor nursing care for patients at high risk for readmissions. J Nurs Care Qual. 2019;35(1):27–33. https://doi.org/10.1097/NCQ.0000000000000412.
Whitlock EL, Braehler MR, Kaplan JA, Finlayson E, Rogers SE, Douglas V, Donovan AL. Derivation, validation, sustained performance, and clinical impact of an electronic medical recordbased perioperative delirium risk stratification tool. Anesth Analg. 2020;131(6):1901–10. https://doi.org/10.1213/ANE.0000000000005085.
Safarova MS, Kullo IJ. Using the electronic health record for genomics research. Curr Opin Lipidol. 2020;31(2):85–93. https://doi.org/10.1097/MOL.0000000000000662.
Petersen JD, Lozovatsky M, Markovic D, Duncan R, Zheng S, Shamsian A, Kagele S, Ross MK. Clinical decision support for hyperbilirubinemia risk assessment in the electronic health record. Acad Pediatr. 2020;20(6):857–62. https://doi.org/10.1016/j.acap.2020.02.009.
Stubbs A, Kotfila C, Xu H, Uzuner O. Identifying risk factors for heart disease over time: overview of 2014 i2b2/UTHealth shared task Track 2—ScienceDirect. J Biomed Inform. 2015;58(S):67–77. https://doi.org/10.1016/j.jbi.2015.07.001.
Conroy R, Sans S, Fitzgerald A, et al. Estimation of tenyear risk of fatal cardiovascular disease in Europe: the score project. Eur Heart J. 2003;11(24):987–1003. https://doi.org/10.1016/s0195668x(03)001143.
Goff DC, LloydJones DM, Bennett G, et al. 2013 ACC/AHA guideline on the assessment of cardiovascular risk: a report of the American College of Cardiology/American Heart Association Task Force on Practice Guidelines. J Am Coll Cardiol. 2014;63(25,B):2935–59. https://doi.org/10.1016/j.jacc.2013.11.005.
Huang Z, Dong W, Duan H. A probabilistic topic model for clinical risk stratification from electronic health records. J Biomed Inform. 2015;58:28–36. https://doi.org/10.1016/j.jbi.2015.09.005.
Blei DM. Probabilistic topic models. Commun ACM. 2012. https://doi.org/10.1145/2133806.2133826.
Hofmann T. Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval—SIGIR ’99 0. 1999. p. 50–57. https://doi.org/10.1145/312624.312649.
Blei DM, Ng AY, Jordan MI. Latent Dirichlet allocation. J Mach Learn Res. 2003;3:993–1022. https://doi.org/10.1162/jmlr.2003.3.45.993.
Girolami M, Kabán A. On an equivalence between PLSI and LDA. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval—SIGIR ’03. 2003. p. 433–434. https://doi.org/10.1145/860435.860537.
Steyvers M, Griffiths T. Handbook of latent semantic analysis. 2014. p. 427–448. https://doi.org/10.4324/9780203936399.ch21.
Jelodar H, Wang Y, Yuan C, Feng X. Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimed Tools Appl. 2017;78:15169–211. https://doi.org/10.1007/s1104201868944.
Li W, McCallum A. Pachinko allocation: DAGstructured mixture models of topic correlations. In: Proceedings of the 23rd international conference on machine learning—ICML’06. 2006. p. 577–584. https://doi.org/10.1145/1143844.1143917.
Blei DM, McAuliffe JD. Supervised topic models. Adv Neural Inf Process Syst. 2010;3:327–32. https://doi.org/10.1109/ICPR.2014.65.
Wang C, Blei DM, Li FF. Simultaneous image classification and annotation. In: IEEE conference on computer vision and pattern recognition—CVPR’09. 2009. p. 1903–1910. https://doi.org/10.1109/CVPR.2009.5206800.
Li FF, Perona P. A bayesian hierarchical model for learning natural scene categories. In: IEEE computer society conference on computer vision and pattern recognition—CVPR’05, vol 2. 2005. p. 524–531. https://doi.org/10.1109/CVPR.2005.16.
Bosch A, Zisserman A, Munoz X. Scene classification via pLSA. Eur Conf Comput Vis. 2006;3954:517–30. https://doi.org/10.1007/11744085_40.
Zeng J, Cheung WK, Liu J. Learning topic models by belief propagation. IEEE Trans Pattern Anal Mach Intell. 2013;35(5):1121–34. https://doi.org/10.1109/TPAMI.2012.185.
Yuan L, Bin J, Wei Y, Huang F, Hu X, Tan M. Big data aspectbased opinion mining using the slda and hmelda models. Wirel Commun Mob Comput. 2020;2020:1–19. https://doi.org/10.1155/2020/8869385.
Glasserman P, Krstovski K, Laliberte P, Mamaysky H. Choosing news topics to explain stock market returns. In: ACM international conference on AI in finance. 2020. p. 1–8. https://doi.org/10.1145/3383455.3422557.
Lu X, Guo Y, Chen J, Wang F. Topic change point detection using a mixed bayesian model. Data Min Knowl Discov. 2021. https://doi.org/10.1007/s10618021008041.
Baburajan V, de Abreu e Silva J, Pereira FC. Openended versus closedended responses: a comparison study using topic modeling and factor analysis. IEEE Trans Intell Transp Syst. 2021;22(4):2123–32. https://doi.org/10.1109/TITS.2020.3040904.
Ma L, Fan S. CURESMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests. BMC Bioinform. 2017;18:1–18. https://doi.org/10.1186/s128590171578z.
Hassan H, Badr A, Abdelhalim MB. Prediction of Oglycosylation sites using random forest and GAtuned PSO technique. Bioinform Biol Insights. 2015;9:103–9. https://doi.org/10.4137/BBI.S26864.
Wei X, Croft WB. LDAbased document models for adhoc retrieval. In: Proceedings of the 29th annual international ACM SIGIR. 2006. p. 178–185. https://doi.org/10.1145/1148170.1148204.
Panichella A. A systematic comparison of searchbased approaches for LDA hyperparameter tuning. Inf Softw Technol. 2021;130:106411. https://doi.org/10.1016/j.infsof.2020.106411.
Acknowledgements
The authors sincerely thank the cooperative hospital as well as all medical staffs involved for their supports. In addition, we would like to thank the anonymous reviewers for comments and suggestions offered for improvements.
Funding
This work was supported by the Natural Science Foundation of Xinjiang Uygur Autonomous Region under Grant No. 2019D01C205 and No. 2019D01C041. This content is entirely the responsibility of the authors and does not represent the official views of founder.
Author information
Authors and Affiliations
Contributions
GY wrote the main manuscript text. YZ prepared datasets. LZ, JZ, TZ designed the experiment. XB prepared data preprocessing. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
This study was approved by the ethics committee of the Medical Ethics Committee of the First Affiliated Hospital of Xinjiang Medical University, and the committee’s reference number is K20210701.
Consent for publication
Not applicable.
Competing interests
The authors report that they have no conflicts.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Yu, G., Zhang, L., Zhang, Y. et al. Prediction and risk stratification from hospital discharge records based on Hierarchical sLDA. BMC Med Inform Decis Mak 22, 14 (2022). https://doi.org/10.1186/s12911022017473
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12911022017473
Keywords
 Risk stratification
 Topic models
 Supervised latent Dirichlet allocation
 Hospital discharge records