 Research article
 Open Access
 Published:
Quality indices for topic model selection and evaluation: a literature review and case study
BMC Medical Informatics and Decision Making volume 23, Article number: 132 (2023)
Abstract
Background
Topic models are a class of unsupervised machine learning models, which facilitate summarization, browsing and retrieval from large unstructured document collections. This study reviews several methods for assessing the quality of unsupervised topic models estimated using nonnegative matrix factorization. Techniques for topic model validation have been developed across disparate fields. We synthesize this literature, discuss the advantages and disadvantages of different techniques for topic model validation, and illustrate their usefulness for guiding model selection on a large clinical text corpus.
Design, setting and data
Using a retrospective cohort design, we curated a text corpus containing 382,666 clinical notes collected between 01/01/2017 through 12/31/2020 from primary care electronic medical records in Toronto Canada.
Methods
Several topic model quality metrics have been proposed to assess different aspects of model fit. We explored the following metrics: reconstruction error, topic coherence, rank biased overlap, Kendall’s weighted tau, partition coefficient, partition entropy and the XieBeni statistic. Depending on context, crossvalidation and/or bootstrap stability analysis were used to estimate these metrics on our corpus.
Results
Crossvalidated reconstruction error favored large topic models (K ≥ 100 topics) on our corpus. Stability analysis using topic coherence and the XieBeni statistic also favored large models (K = 100 topics). Rank biased overlap and Kendall’s weighted tau favored small models (K = 5 topics). Few model evaluation metrics suggested midsized topic models (25 ≤ K ≤ 75) as being optimal. However, human judgement suggested that midsized topic models produced expressive lowdimensional summarizations of the corpus.
Conclusions
Topic model quality indices are transparent quantitative tools for guiding model selection and evaluation. Our empirical illustration demonstrated that different topic model quality indices favor models of different complexity; and may not select models aligning with human judgment. This suggests that different metrics capture different aspects of model goodness of fit. A combination of topic model quality indices, coupled with human validation, may be useful in appraising unsupervised topic models.
Background
An increasing share of modern human communication is captured in digital text format [1]. The increasing digitization of communication creates a demand for computational and statistical methods to facilitate exploration, understanding and extraction of meaningful insights from these voluminous and complex text data sources.
Topic models represent a class of statistical techniques for summarizing large text corpora. These models represent a document as arising from an admixture of latent topical vectors. The k = 1…K latent topical vectors describe the thematic content of the corpus using a set of semantically correlated word clusters. An admixing parameter (of dimension K) expresses the extent to which a particular document displays an affinity for a particular topic. Numerous statistical algorithms exist for estimating a topic model. One of the earliest techniques involved representing a text corpus as a document term matrix (DTM) and decomposing the resulting large sparse DTM via singular value decomposition – a methodology coined Latent Semantic Analysis (LSA) [2,3,4]. To facilitate improved interpretation, nonnegative matrix factorization (NMF) has been employed to decompose the DTM. NMF summarizes an input DTM in terms of a lowrank outer product decomposition; however, in NMF the statistical decomposition forces nonnegativity constraints on row and column bases, yielding an additive partsbased interpretation [5,6,7]. To enhance interpretation of how complex document collections emerge, probabilistic extensions of LSA/NMF were developed, including probabilistic latent semantic indexing (pLSI) [8] and latent Dirichlet allocation (LDA) [9,10,11,12].
As with other unsupervised machine learning algorithms, posthoc evaluation, validation, and criticism of fitted topic models is encouraged, albeit challenging. Depending on the variant of topic model fit to a given document collection, several hyperparameters need to be tuned to achieve a meaningful summarization of the corpus. A common hyperparameter employed in all the aforementioned topic models is the number of topics (k = 1,2,3,…K), a discrete positive hyperparameter which governs model complexity. Specification of too few topics often yields a noisy thematic characterization, where learned topics are broad in scope (i.e. words loading highly on a given topic are not semantically correlated); whereas, specification of too many topics often results in overclustering (a phenomena whereby semantically related topics are redundantly repeated in the summarization). Topic model validity indices are posthoc quantitative metrics which can be used to guide the analyst towards aspects of a “good fitting” model. A wide variety of internal model quality indices have been proposed across disparate statistical research fields.
The primary objective of this manuscript is to review and synthesize literature surrounding modern topic model validity indices. As a secondary objective, we fit several topic models and apply different topic model quality indices to a large corpus of clinical notes collected from primary care electronic medical records from Toronto, Canada. We comment on the ability of these quality indices to guide analysts towards a consensus model. We critically appraise the different topic model quality indices and investigate how different metrics assess different aspects of model stability, robustness, and goodness of fit.
Literature review
Nonnegative matrix factorization topic models
In this study, we fit a nonnegative matrix factorization (NMF) topic model to an input document term matrix (DTM). The DTM is a large sparse matrix with d = 1…D rows (a single row for each document/note in the corpus) and v = 1…V columns (a single column for each word/token in the empirical vocabulary). Each element of the DTM \(({X}_{dv})\) is a count random variable, denoting the number of times word/token (v) occurs in document (d). NMF factorizes the D*V dimensional DTM into two latent submatrices of dimension D*K (\(\theta\)) and K*V (\(\phi\)). The DTM (X) consists of nonnegative integers (i.e. word frequency counts); whereas, the learned matrices (\(\theta\),\(\phi\)) consist of nonnegative real values.
Many suitable objective functions have been proposed for learning the NMF latent matrices (θ,\(\phi\)). We introduce a general NMF objective function below.
The objective function specifies that the observed elements of the DTM are approximated by a Kdimensional bilinear form (\(\sum_{k=1}^{K}{\theta }_{\left\{d,k\right\}}{\phi }_{\left\{k,v\right\}}\)). The user must specify the dimension of the latent space (K). The model can be adapted to different data generating mechanisms via the choice of loss function (\(f\left({X}_{\left\{d,v\right\}}; \sum_{k=1}^{K}{\theta }_{\left\{d,k\right\}}{\phi }_{\left\{k,v\right\}}\right)\)). Arbitrary weighting of data points can be accommodated through \({w}_{\left\{d,v\right\}}\) (examples of weighted NMF models are given in Udell et al. [13]). Regularization (\(\Lambda \left({\theta }_{d,k}, {\phi }_{k,v}\right)\)) can be introduced to achieve parameter estimates with desirable properties (e.g. sparsity, smoothness, minimum volume, etc.). Seminal articles on NMF include Paatero & Tapper [14] and Lee & Seung [5, 6]. Surveys of NMF and low rank models are given in Berry [15] and Udell et al. [13].
When using NMF for topic modelling, a sparse D*V dimensional DTM (X) is factored into two nonnegative real matrices: a D*K dimensional matrix of perdocument topic parameters (\(\theta\)) and a K*V dimensional matrix of pertopic word parameters (\(\phi\)). The NMF model imposes nonnegativity constraints on the estimates of the latent matrices (\(\theta\) and \(\phi\)). Posthoc, one can normalize the rowvectors constituting both \(\theta\) and \(\phi\), dividing by their respective rowsums. The resulting normalized vectors can be interpreted as compositional/probability vectors (i.e. each normalized row of \(\theta\) and \(\phi\) contains nonnegative entries which sum to one). Row vectors of the matrix \(\phi\) encode a set of k = 1…K pertopic word probabilities (estimated over a discrete set of v = 1…V words in our corpus). Row vectors of the matrix \(\theta\) encode a set of d = 1…D perdocument topic proportions (estimated over a discrete set of k = 1…K latent dimensions), encoding the affinity a given document has for a particular topic.
Quality indices/metrics for evaluating NMF topic models
A topic model validity index is a numeric metric/score used to guide selection of an “optimal” topic model fitted to a given document collection. The choice of an “optimal” model is context dependent and, in many cases, may even represent a nebulous concept [16, 17]. As a result of the difficulty associated with a priori defining the attributes of an optimal performing topic model, different validity indices have been developed across different statistical research communities, which highlight different aspects of model goodness of fit. Four broad classes of topic model quality indices will be introduced: 1) metrics which emphasize model fit (i.e. residual error or reconstruction error), 2) metrics which focus on evaluation of the pertopic distribution over words matrix \(({\phi }_{\{k,v\}})\), 3) metrics which prioritize evaluation of the perdocument distribution over topics matrix (\({\theta }_{\{d,k\}}\)), and 4) metrics which simultaneously combine evaluation of \({\theta }_{\{d,k\}}\) and \({\phi }_{\{k,v\}}.\)
Each of the topic quality indices discussed are examples of internal validation indices [18, 19]. Internal indices construct a validation score using only data available during the topic model fitting process. These internal indices can be contrasted with external validation indices. An external validation index uses information collected via the same sampling process that generated the original DTM; however, it is external to the topic model fitting/estimation algorithm. For example, it is common to evaluate a topic model in terms of the ability of the latent topical basis (particularly the D*K matrix of perdocument topic weights, θ) to predict an external target vector in a regression/classification context.
A final approach to validating topic models involves subjective human interpretation/validation.
Matthews [20] describe “eyeballing” as a common approach to validating (or ascribing meaning) to fitted topic models. Using the “eyeballing” method, researchers fit several topic models to an observed document collection (over a predetermined hyperparameter grid) and subjectively label learned topic distributions by inspection of highloading words/tokens (from \(\phi\)) or highloading documents (from \(\theta\)). This subjective humancentric approach to topic model validation parallels that of face validity checks or social validity checks used in qualitative content analyses [21]. Doogan et al. [17] are also proponents of a more exhaustive approach to humanintheloop topic model validation where both the latent topic vectors and the perdocument topic weights are simultaneously evaluated.
Monte Carlo cross validation on reconstruction error metrics
Cross validation is a commonly employed methodology for estimating model performance and conducting model validation/selection [22]. Several challenges arise when cross validating a matrix factorization topic model. The input data structure in NMF topic modelling is a sparse highdimensional DTM. When performing crossvalidation in the context of NMF topic modelling we do not want to holdout an entire row/column of the DTM. If an entire row/column index of the DTM is heldout for validation/testing (in a kfold crossvalidation scheme), then the training algorithm will never learn an embedding/basis over the heldout row/column indices. As such, simple kfold crossvalidation schemes are not amenable for crossvalidating an NMF topic model.
Wold [23] discusses several holdout schemes relevant to crossvalidating matrix factorization models. One scheme holdsout individual elements/indices (d,v) at random from the DTM, in such a manner that no entire row/column is left out of the training process. Wold [23] introduces an alternative crossvalidation scheme which holds out diagonal bands of the DTM, again ensuring that no entire row/column is excluding from the training process. Owens et al. [24] extends the idea, holding out contiguous blocks of rows/columns, again ensuring no entire/row column is held out of the DTM during the training and crossvalidation process. Bro et al. [25] review crossvalidation in the context of matrix factorization problems. Lastly, if the matrix (or DTM) being sampled is dense many of the above cross validation schemes are trivial to implement; however, if the matrix is sparse (as is the case with many DTMs), checks on the validity of the holdout process need to be carefully implemented.
In this study we employ a Monte Carlo crossvalidation scheme similar to Wold [23]. We represent the DTM using a sparse triplet/coordinateformat data structure. We randomly sample 80% of data elements for inclusion in the training process, and 20% of data elements are heldout for inclusion in the validation process. Sampling is conducted without replacement. When sampling (d,v,x) triples for inclusion in the training sample, we write assertions to check that all row indices (d = 1…D) and all column indices (v = 1…V) are included in the training sample. If a randomly generated training sample excludes an entire row/column index) then the Monte Carlo crossvalidation sample is rejected as invalid. We repeat the random sampling process five times (noting that sampling (d,v,x) triples from a large/sparse DTM is computationally expensive). We fit five independent NMF topic models to each Monte Carlo crossvalidation training sample, estimate reconstruction error on each of the training/validation samples, and average the reconstruction errors over heldout validation samples.
Bootstrap stability analysis using topic coherence metrics
Stability analysis is a generic methodology for evaluating the quality of a fitted unsupervised machine learning model. The methodology proceeds by drawing bootstrap samples (i.e. sampling with replacement) from the original data structure; a model is fitted to each bootstrap replicate dataset, and its quality/stability is evaluated [26, 27]. The specifics of the methodology are dependent on the topic model quality metric used in the analysis. In this section we discuss stability analysis in the context of topic coherence metrics.
Topic coherence metrics represent a family of scoring functions used to quantify the semantic correlation of word/token lists. For a given topic model, consider the topP words/tokens loading most highly on a specific topical vector (k). Typically P is chosen to be a small integer value (P = {5, 10, 25, etc.}). Topical coherence is estimated by constructing \(\left(\begin{array}{c}P\\ 2\end{array}\right)\) cooccurrence scores between each pair of words/tokens in the topP list. The cooccurrence scores are based on word frequency cooccurrence counts in the observed document collection (although external corpora may also be introduced for scoring). For each topic k = 1…K, an estimate of topical coherence is obtained. Averaging over k = 1…K topics in a fitted topic model results in an overall coherence score for the model fit. These scores can then be compared across b = 1…B bootstrap replicate samples to assess model stability. In this manuscript we consider two metrics: the UCI scoring metric [28, 29] and the UMASS scoring metric [30]. Development of topic coherence scores remains a popular area of research and a review on available topic coherence metrics is discussed in Roder et al. [31].
The UCI and UMASS scores are given below. The quantities describe the marginal or joint occurrence probabilities of words/tokens in the empirical corpus (although any arbitrarily chosen external corpus could be used to estimate the marginal/joint probabilities of a word occurrence).
Stability analysis using set based agreement metrics
Set based agreement metrics can also be employed to assess topic model stability. Under this approach to stability analysis, we begin by drawing b = 1…B bootstrap replicate samples from the original DTM (i.e. sampling with replacement). The goal is to compare agreement between topP lists of semantically aligned topics fitted under different models. If there exist B bootstrap samples drawn for the stability analysis, then there exist \(\left(\begin{array}{c}B\\ 2\end{array}\right)\) models to compare, each of complexity K. For any given pertopic word distribution (indexed k = 1…K), and any two models (M_{s} and M_{t} from bootstrap samples b_{s} and b_{t} respectively – for {s,t} = 1…B) the goal is to pairwise compare the two resulting topP lists (over k = 1…K latent dimensions in the model). There exist numerous metrics for comparing topP lists of discrete items, and these types of metrics are reviewed in Fagin et al. [32]. In this study, we will use a particular setbased agreement metric, rank biased overlap (RBO) [33].
An inherent challenge associated with this variant of stability analysis involves the exchangeability of the learned topics from a fitted NMF model. That is, the ordering of the topical basis in the lowrank reconstruction is arbitrary. As such, even if two NMF topic models of the same latent dimension are fitted to a dataset (or two different bootstrap datasets) there is no guarantee that semantically related topics occur in the same (arbitrary) ordering across different model fits. However, it is possible to align the learned topical bases (\({\phi }_{s}\)) and (\({\phi }_{t}\)) across the two bootstrap datasets. Alignment of the respective topical matrices is a type of linearsumassignment problem. In other words, the goal is to learn an optimal K*K permutation matrix (π ε П) such that the Frobenius norm error between the topical matrices (\({\phi }_{s}\)) and (\({\phi }_{t}\)) is minimized. Solving such a problem can be approached using the Hungarian algorithm [34, 35; see also Appendix A].
Given two aligned topical matrices, from two NMF model fits, on two distinct bootstrap datasets, we proceed to estimating the RBO agreement metric between the two topP token/word lists (S and T).
Mathematically RBO is defined below. The metric measures the weighted average agreement between two topP sets. The metric lives on the space [0,1] with zero indicating no agreement between topP lists, and one indicating complete agreement. The RBO score is a function of a tunable hyperparameter (z \(\in\) (0,1)) which determines how the score prioritizes agreement over topP list depth. Smaller values of z favor topweighted elements in the list, and when z = 0 (in the limit) only the first item in the pair of lists are compared. A_{P} measures the fractional/proportional agreement of the lists (S and T) up to depthP. An example of top5 agreement between word lists is given below (Table 1).
We note that using an identical strategy as defined above, the RBO metric could also be used to assess the stability of returned topP documentlists from the matrix of perdocument topic proportions (θ).
Stability analysis using rank based correlation metrics
Another commonly encountered metric for comparing topP lists is Kendall’s Tau statistic, a measure of rankbased correlation [36] ranging between [1, + 1]. The statistic is defined below. The numerator is the number of concordantpairs minus discordantpairs between the two ranked lists, and the denominator is the total number of ways to choose two items from a rankP list: \(\left(\begin{array}{c}P\\ 2\end{array}\right)\).
Kendall’s tau, and other rankbased correlation metrics can be used to assess the quality/stability of either topP word/token lists emerging from \(\phi\); and can also be used to assess quality/stability of topP document lists emerging from the matrix \(\theta .\) An issue with Kendall’s tau in the context of comparing topP lists is that the metric demands that all elements in one list be contained in the other list. In other words, the two topP lists being compared must be conjoint, and not disjoint. Heuristics have been proposed to circumvent this challenge: for example, removing items/elements which occur in only one set from the scoring, or adding items occurring in only one set to the end of the other set. We have not seen these types of heuristics employed in the context of topic model evaluation; however, they are necessary, as bootstrap sampling does not ensure the same indices are present in even two pairwise samples/models being compared. Hence, we consider concordance estimated over the intersection of indices present in two pairs of bootstrap samples. By the bootstrap 0.632 principle and the independence of the generated bootstrap samples, the concordance is estimated over approximately 0.632*0.632 = 0.399*100 percent of the original indices in the input DTM.
In this study, we use a variant of Kendall’s tau –weighted Kendall’s tau – and estimate concordance over all elements of \(\theta\) and \(\phi\), respectively. Using weighted Kendall’s tau, elements of the topV or topD lists are not given equal weight, rather items appearing higher in the ranked lists are given higher weighting under the weighted Kendall’s tau metric. Any positive weighting function can be employed – in this study we use the hyperbolic function: 1/(1 + r). The hyperbolic function assigns high weight to items appearing near the top of the ranked lists and attempts to ensure that arbitrary swaps/exchanges of lowranking elements of the large lists do not unnecessarily deflate estimates of rankbased concordance.
Stability analysis using fuzzy clustering quality indices
We note that NMF topic modelling resembles the framework of admixture/mixedmembership modelling [37]. The learned parameter matrices \((\theta\) and \(\phi )\) contain nonnegative real numbers; however, we can scale the row vectors of both \(\theta\) and \(\phi\) to represent probability/compositional vectors, dividing each by its respective row sum. Following transformation of the rowvectors from the \(\theta\) matrix to the probability simplex, the NMF topic model closely resembles the grade of membership matrix used in fuzzy clustering models [38], and hence quality indices developed for validating fuzzy clustering solutions may be used to evaluate aspects of the quality of NMF topic model fits.
The normalized matrix of perdocument topicweights (θ^{*}) describes the affinity of a given document for a specific latent topic vector. Each row of θ^{*} is a compositional/probability vector (living on K1 dimensional simplex). Under this transformation, the matrix θ^{*} closely resembles the gradeofmembership matrix in fuzzy clustering models. Several validity indices from the fuzzy clustering community have been developed for investigating aspects of the fuzzy clustering solutions, including the (modified) partition coefficient (PC) and the (modified) partition entropy (PE) [39, 40]. Mathematical details of both validity indices are given below. Hard clustering solutions would represent θ^{*} as d = 1…D bitvectors (i.e. K1 elements would be 0, and a single element would equal 1). As fuzzy clustering solutions (and membership vectors) tend towards bitvector type solutions, mixedmembership/admixture type models begin to look more like hardclustering models (i.e. a data point is assigned to only a single latent topical vector; rather than a mixture over latent topics). This phenomenon is captured by the partition coefficient and partition entropy validity indices. For the partition coefficient, scores near 1 imply a hardtype clustering whereas, scores closer to zero imply a fuzzy clustering where documents spread topical prevalence weight across observed latent dimensions. Conversely, for the partition entropy, scores near zero imply a near hard clustering solution; whereas, positive scores indicate a more fuzzy clustering solution.
The PC and PE validity indices discussed above focus only on evaluating the perdocument topicweight matrix (θ^{*}). Alternative fuzzy clustering validity indices have been devised and reviewed [38]; many of the fuzzy clustering metrics attempt to evaluate aspects of cluster separation and compactness. The XieBeni index is mathematically described in [41, 42] and is one such fuzzycluster validity index which may be applied to assess quality of a transformed NMF topic model solution. Smaller values of the XieBeni index indicate a better fuzzy clustering model fit (i.e. topics are compact and wellseparated; when scaled over the gradeofmembership matrix). As such, the XieBeni index is one metric which simultaneously evaluates aspects of both latent matrices (\(\theta\) and \(\phi\)).
In terms of stability analysis, we can draw b = 1…B bootstrap replicate samples (with replacement) from the original DTM (X), and estimate the fuzzy cluster stability indices on NMF topic models fit to each bootstrap dataset and assess robustness/stability against data perturbations.
Data augmentation based stability analysis
Many of the NMF topic model validity indices discussed above could be applied to stochastically augmented/perturbed versions of the original DTM (rather than bootstrap replicate data samples). In this scenario we envision the DTM being structured in (d,v,x) triplet/coordinate format – represented as (rowindex, columnindex, value) tuples. Using a data augmentation approach to stability analysis, one could maintain the row/columnindices in the coordinate format data structure, and stochastically augment the observed data value using random noise. One may draw an entirely new value from some parametric distribution; or they may jitter the observed data value (up/down) by some random value, while obeying constraints of the original data generating mechanism (e.g. DTM entries are nonnegative count/integer random variables).
There exist certain advantages to the data augmentation approach to topic model stability analysis: 1) it is computationally much faster to sample/jitter new data values from the triplet/coordinate format sparse DTM than it is to draw legitimate crossvalidation or bootstrap samples, and 2) data augmentation trivially ensures that the row/column indices from the original DTM are preserved and that documents and/or words/tokens appearing in the input DTM also appear in the augmented samples (this may not be the case with crossvalidation or bootstrap sampling, and if desired must be verified via programmatic assertions). As a limitation, data augmentation may not be theoretically as principled a methodology for assessing model stability and robustness as compared with crossvalidation and bootstrap resampling. Further, assumptions regarding the distribution generating augmented values are subjective. We do not experimentally investigate the suitability of data augmentation in NMF topic model validity, but we do consider it to be an interesting area for further research.
Methods
Study design, setting, data sources and inclusion/exclusion criteria
Our study employs a retrospective open cohort design. Encounterlevel data are collected from patient primary care electronic medical records from Toronto, Canada. Data curation and cleaning is conducted by the University of Toronto Practice Based Research Network (UTOPIAN: https://www.dfcm.utoronto.ca/utopian). Additional details regarding UTOPIAN, including sampling, representativeness and data curation are given in Garies et al. [43]. The study start date is January 1, 2017 and the study end date is December 31, 2020. We include all patients who contribute at least one primary care clinical progress note during our study timeframe. We exclude patients who are missing basic demographic information (e.g. age or sex) or study identifiers (e.g. patient ID, physician ID, clinic ID).
Computationally processing the clinical progress note corpus
Representing collections of text data using a document term matrix
A document term matrix is a D*V dimensional matrix. D represents the number of documents in the collection (here the number of unique clinical progress notes, written during patientprovider interactions). V represents the number of unique words/tokens in the empirical vocabulary of the document collection. A given element (X_{{d,v}}) of the matrix, is a count random variable, denoting the number of times a particular word/token (v) was used in a particular document (d).
Preprocessing text data
Raw clinical text data are a sequence of digital characters (letters, numbers, punctuation, other symbols, etc.). For each document in our collection, these raw text strings must be processed into individual words/tokens to facilitate creation of a DTM. Many approaches exist for processing clinical text data. We discuss several key elements of our text preprocessing pipeline, namely, tokenization, vocabulary normalization and dictionary/vocabulary creation.
Tokenization refers to the process of separating raw text strings (i.e. digital character sequences) into individual words/tokens [44, 45]. We employ a simple form of “whitespace tokenization”, which separates input character sequences into words/tokens based on the presence of whitespace boundaries (e.g. spaces, tabs, newlines, etc.).
Text normalization refers to the process of converting word/tokens into a single canonical form. We normalize tokens based on regular expressions, namely: case folding (lowercase conversion), and removal of punctuation/numbers.
Following tokenization and normalization we manually review the most frequently occurring words in the corpus and select to include 2210 tokens for inclusion in the corpus vocabulary; where the particular tokenset selected consisted of focused/specific medical entities (e.g. disease names, disease symptoms, drug names, medical procedures, medical specialties, anatomical locations, etc.).
Nonnegative matrix factorization
We estimated NMF topic models using the module sklearn.decomposition.NMF() from sklearn version = 0.24.2 in 64bit Python version 3.6. We varied model complexity (K = {5,10,25,50,75,100,150,200,250} topics) and investigated which models are selected as optimal using different metrics of model quality. We did not apply any regularization to latent parameter matrices. We randomly initialized parameter matrices. We estimated parameters using a gradient descent method on an L2/Frobeniusnorm loss function, and we employed a loss function convergence tolerance of 1e5 for terminating iterative update processes.
Experiments comparing the quality of NMF topic models on the Utopian clinical note corpus

A)
Monte Carlo crossvalidation using a reconstruction error metric: \({\Vert X \sum_{k=1}^{K}{\theta }_{dk}{\phi }_{kv}\Vert }_{2}\)

B)
(Average) bootstrap stability using UCI/UMASS topic coherence metrics (over phi)

C)
(Pairwise average) bootstrap stability using a RBO metric (over phi)

D)
(Pairwise average) bootstrap stability using a RBO metric (over theta)

E)
(Pairwise average) bootstrap stability using Kendall’s tau metric (over phi)

F)
(Pairwise average) bootstrap stability using Kendall’s tau metric (over theta)

G)
(Average) bootstrap stability using PC/PE fuzzy clustering coefficients (over theta)

H)
(Average) bootstrap stability using XB fuzzy clustering metric (over theta and phi)
Research ethics
This study received ethics approval from North York General Hospital Research Ethics Board (REB ID: NYGH #20–0014).
Results
Description of corpus
The study corpus/sample consists of 382,666 primary care progress notes from 44,828 patients, 54 physicians, and 12 clinics collected 01/01/2017 through 31/12/2020 from Toronto, Canada.
NMF Pertopic word/token distribution
We summarize the corpus using the k = 1…K rows of the matrix \(\phi .\) We report on the top5 words loading most highly on each topical vector (Table 2). Next to each word/token in Table 2, we additionally display its probability of occurrence under topic k = 1…50. These topical vectors provide a lowdimensional thematic summarization of the clinical text dataset. We observe interesting topics corresponding to many major thematic areas of primary care, including: acute health conditions (e.g. COVID19 and other respiratory conditions such as cough, flu, and colds), chronic physical health conditions (e.g. heart disease, cancer, arthritis and other musculoskeletal issues), mental health conditions (e.g. anxiety, depression and sleep issues), preventative health/screening (e.g. pap smears, flu shots, diet/exercise) and social/familial dynamics.
Human judgement validation was used to determine the complexity of the model below (i.e. K = 50 latent topical bases). On inspection, we found that a model with approximately K = 50 latent topical dimensions resulted in a parsimonious summary of the primary care clinical text corpora. Models with far fewer topics often resulted in an incomplete summarization of primary care topics; and/or resulted in distinct primary care concepts being grouped under a single topical construct. Conversely, models with a far greater number of topics were often more time consuming to interpret, and resulted in semantically similar topics being redundantly described. In the results subsections which follow, we compare topic model complexity identified via human judgement evaluation, with those identified using quantitative topic model quality indices.
NMF Perdocument topicdistribution
We inspected the top5 documents loading most highly on each of the k = 1…K columns of the latent matrix \(\theta .\) Excerpts of the mostrelevant documents under each topical query provide complementary evidence that the learned latent basis effectively summarizes the corpus, and further can be used as a tool to facilitate document retrieval and clustering. We observed that documents loading most highly on a topical vector of \(\theta\) are semantically related to the corresponding word/tokens used to describe the topic (Table 2). Top5 most probable documents under a particular topic are not displayed because clinical text excerpts may contain sensitive information and/or other protected health information.
NMF Topic model validity indices
In the subsections below we apply several topic model quality indices to the primary care clinical note corpus. We demonstrate how different topic model quality indices highlight different aspects of model goodness of fit, stability, and robustness.
Monte Carlo crossvalidation using a (predictive) reconstruction error metric
In this subsection we use a Monte Carlo crossvalidation methodology for comparatively evaluating topic model quality, across NMF models of complexity K = (5, 25, 50, 75, 100, 150, 200, 250). For each model complexity parameter (k) we assess mean reconstruction error on training and heldout test samples, using a fivefold Monte Carlo crossvalidation scheme. The results suggest that larger (more complex) NMF models provide a more optimal fit to the primary care corpus (Fig. 1).
(Average) bootstrap stability using UCI/UMASS topic coherence metrics (over phi)
In this subsection we use an average bootstrap stability analysis methodology employing a topic coherence metric for assessing model quality over NMF complexity parameters K = (5, 10, 25, 50, 75, 100). For a given NMF model of complexity K, we average the k = 1…K topical coherences vectors (\({\phi }_{k})\), resulting in a single measure of model coherence. We further compare modelbased topical coherence scores across the five separate bootstrap replicate samples in the stability analysis, and observe higher scores at larger values of model complexity (implying more coherent topics) (Fig. 2).
(Pairwise average) bootstrap stability using a RBO metric (over phi)
In this subsection we use an average bootstrap stability analysis methodology employing a rank biased overlap metric for assessing model quality over NMF complexity parameters K = (5, 10, 25, 50, 75, 100). Using a fivefold stability analysis there exist \(\left(\begin{array}{c}5 \\ 2\end{array}\right)\) pairwise model comparisons (over aligned NMF topical matrices \(\phi ).\) We average the pairwise RBO scores from the fivefold stability analysis process and comparatively evaluate model quality over complexity K = (5, 10, 25, 50, 75, 100). Using RBO we favor smaller models (K = {5,10}) (Fig. 3).
(Pairwise average) bootstrap stability using a RBO metric (over theta)
In this subsection we use an average bootstrap stability analysis methodology employing a rank biased overlap metric for assessing model quality over NMF complexity parameters K = (5, 10, 25, 50, 75, 100). Using a fivefold stability analysis there exist \(\left(\begin{array}{c}5 \\ 2\end{array}\right)\) pairwise model comparisons (over aligned NMF perdocument topicprevalence matrices \(\theta ).\) We average the pairwise RBO scores from the fivefold stability analysis process and comparatively evaluate model quality over complexity K = (5, 10, 25, 50, 75, 100). Using RBO we favor smaller models (K = {5,10}) (Fig. 4).
(Pairwise average) bootstrap stability using weighted Kendall’s tau metric (over phi)
In this subsection we use an average bootstrap stability analysis methodology employing a weighted Kendall’s tau metric for assessing model quality over NMF complexity parameters K = (5, 10, 25, 50, 75, 100). Using a fivefold stability analysis there exist \(\left(\begin{array}{c}5 \\ 2\end{array}\right)\) pairwise model comparisons (over aligned NMF topical matrices \(\phi ).\) We average the pairwise Kendall weighted tau statistics from the fivefold stability analysis process and comparatively evaluate model quality over complexity K = (5, 10, 25, 50, 75, 100). Using Kendall’s weighted tau we favor smaller models (K = {5,10}) (Fig. 5).
(Pairwise average) bootstrap stability using weighted Kendall’s tau metric (over theta)
We used an average bootstrap stability analysis methodology employing a weighted Kendall’s tau metric for assessing model quality over NMF complexity parameters K = (5, 10, 25, 50, 75, 100). Using a fivefold stability analysis there exist \(\left(\begin{array}{c}5 \\ 2\end{array}\right)\) pairwise model comparisons (over aligned NMF topical matrices \(\theta ).\) We average the pairwise Kendall weighted tau statistics from the fivefold stability analysis process and comparatively evaluate model quality over complexity K = (5, 10, 25, 50, 75, 100). Using Kendall’s weighted tau we favor smaller models (K = {5,10}) (Fig. 6).
(Average) bootstrap stability using PC/PE fuzzy clustering coefficients (over theta)
In this subsection we use an average bootstrap stability analysis methodology employing partition coefficient and partition entropy metrics for assessing model quality over NMF complexity parameters K = (5, 10, 25, 50, 75, 100). For a given NMF model of complexity K we average/compare partition coefficient/entropy scores across the five separate bootstrap replicate samples in the stability analysis (Fig. 7).
(Average) bootstrap stability using XieBeni fuzzy clustering metric (over theta and phi)
In this subsection we use an average bootstrap stability analysis methodology employing XieBeni fuzzy clustering metrics for assessing model quality over NMF complexity parameters K = (5, 10, 25, 50, 75, 100). For a given NMF model of complexity K we average/compare XieBeni scores across the five separate bootstrap replicate samples in the stability analysis. The XieBeni statistic appears to favor larger topic model fits on our observed corpus (Fig. 8).
Discussion
The major finding of this study (and one which has been observed elsewhere) is that different topical model quality indices do not necessarily agree on a single topic model as being “optimal” when applied to a given empirical dataset. Crossvalidated reconstruction error, (averaged) topic coherence and (averaged) XieBeni score (a ratio of compactness vs separation) were observed to favour large models; whereas, set based agreement measures (rank biased overlap) and rank correlation measures (weighted Kendall’s tau) pairwise averaged over aligned bootstrapped datasets seem to favour small models. Few/none of the metrics were observed to guide selection towards “midsized” models as being optimal (which were subjectively preferred based on human/analyst judgement). Large models (K ≥ 100 topics) were observed to produce smaller residual error (we do not observe an increase in validation/test error even at K = 250 topics; noting that fitting larger models is computationally prohibitive on our dataset). Similarly, large models were observed to generate focused, interpretable, and coherent topical vectors. And finally, the XieBeni fuzzy clustering coefficient is suggestive that geometrically compact topics/clusters that are wellseparated form at increasing model complexity. On the contrary, stability metrics based on rank correlation (e.g. weighted Kendall’s tau) and set based agreement (rank biased overlap) seem to favour models of much lower complexity (K = 5 topics). The evidence suggests that different topic model quality indices lead to different inferences regarding an optimal NMF topic model. The investigated quality indices provide different/complementary insights regarding model goodness of fit, and it likely makes sense to utilize numerous indices when evaluating fitted models to empirical datasets.
Using a humancentric approach to NMF topic model selection, where data scientists and subject matter experts attempt to select an optimal topic model fitted to the primary care clinical note corpus (after “eyeballing” fits at k = {5,10,25,50,75,100}), we prefer midsized models (K = 50 topic model). The midsized models (qualitatively) satisfy many desirable properties of a topic model: 1) the latent topical basis provide a meaningful characterization of the document collection, facilitating an improved understanding of the large/complex primary care progress notes corpus, 2) the document topic weights seem appropriate, providing an efficient lowdimensional basis for retrieval, clustering and browsing of documents, and 3) the model explained variation is reasonable (although could be improved with a more complex model). In terms of human judgement, we find large NMF topic models (subjectively) to be overly complex. It takes a great deal of time/effort to meaningfully “eyeball” K ≥ 100 topical bases (and highloading documents). Further, when an NMF model of excessive complexity is fit to the primary care progress note corpus we observe an overclustering effect where many of the learned focused/specific topics appear redundant (and by Ockham’s razor, this may suggest a more parsimonious model is attainable and perhaps preferrable). On the contrary, we find that the NMF topic models of low complexity (K = {5,10}) which are preferred by the rank correlation metrics and the set based agreement metrics are not expressive enough to adequately summarize a complex clinical text corpus (subjectively, the space of primary care medicine is about more than 5–10 topicals/themes).
Limitations and future work
We focused on several popular quality indices applicable to the evaluation of NMF topic models; however, our review is necessarily incomplete (as metrics are disparately studied over a vast number of scientific disciplines) and our evaluation focuses on a single large/complex biomedical text corpus. Future work should attempt to further synthesize/consolidate an increasing number of topic model quality indices, and further evaluate these metrics over many real and simulated datasets. Formal systematic literature reviews and/or scoping reviews may be valuable, as they could more exhaustively identify the space of available topic model quality indices. That said, our review has focused on some of the more popular metrics from a variety of disparate fields, such as: computer science and topic modelling, matrix factorization, fuzzy clustering, and set theory.
This study focused on evaluating/selecting an optimal model complexity parameter (K) over fitted NMF topic models. Metrics presented above could also be used to investigate other NMF model hyperparameters, for example: model loss function, model functional form, regularization, initialization techniques, and model termination criteria are all relevant hyperparameters whose optimal configuration can be assessed with quality indices discussed in this study. Incorporation of the aforementioned topic model quality indices, in a formal hyperparameter optimization framework, may help to guide the analyst towards an optimal hyperparameter configuration for a topic model fitted to a particular empirical dataset [46].
This study has focused on combining appropriate topic model quality metrics with computational resampling methods (e.g. crossvalidation or bootstrapping) for assessing NMF topic model goodness of fit. The evaluation pipeline is computationally expensive: Monte Carlo crossvalidation requires specific checks on the validity of returned DTMs and stability analysis may require application of expensive matrix alignment methods. Data augmentation was not thoroughly explored in this study; however, it may represent an interesting and computationally affordable approach to topic model evaluation.
This study focused on quality indices for evaluating aspects of NMF topic models. We did not compare NMF topic model fits against alternative topic modelling frameworks, for example: Bayesian probabilistic graphical models (e.g. latent Dirichlet allocation), neural topic models (e.g. BERTopic) or tensor factorization models (e.g. the canonical polyadic decomposition, or the Tucker decomposition). Each of the aforementioned methodologies estimate a latent representation, characterizing the extent to which 1) words load on topical vectors, and 2) document load on topical vectors. As such, many of the topic model quality indices investigated in this study could be used to evaluate topic models generated using alternative approaches to statistical estimation.
In this study we adopted a hybrid approach to text processing and vocabulary construction. First, we performed an initial computational tokenization pass over the corpus; next, we reviewed the returned list of tokens and a human determined which ones to include in the final vocabulary (focusing particularly on lexical entities relevant to primary healthcare). The number of unique (and justifiable) approaches to text processing are essential uncountably large. This study did not investigate alternative text processing pipelines, and their impact on topic model quality. For example, we did not consider using stemmers/lemmatizers; nor did we attempt to group semantically similar lexical variants posttokenization. Further research should continue to investigate the impact of text processing pipelines on vocabulary specification in vector space models, and in particular topic models.
Conclusions
In this study we reviewed and comparatively evaluated several topic model quality indices. Oftentimes an eyeballing approach is used in topic model selection/evaluation, whereby subject matter experts and data scientists iteratively review learned topic models and subjectively determine an appropriate fitting model for the corpus at hand – the approach is often criticized as lacking empirical rigor, and advocates often suggest employing one of potentially many available topic model quality indices for guiding model selection. This study illustrates some challenges associated with the latter line of thought, namely, a large host of defensible topic model quality indices exist, and the choice of an optimal model appears metric dependent (i.e. different quality metrics guide the analyst toward fundamentally different NMF topic models). This finding does not invalidate quantitative topic model quality indices, rather it suggests that different metrics highlight different aspects of model goodness of fit. Further, human in the loop approaches to topic model selection/evaluation are likely still required where different models (under different hyperparameter configurations) are fitted to empirical datasets, and evaluated using a combination of human judgment in addition to different quality indices. Both quantitative topic model quality indices, and human judgement evaluation, are crucially important when interpreting unsupervised machine learning models.
Availability of data and materials
Data from this study are held by the University of Toronto PracticeBased Research Network. Ethics approval for this study does not allow access to patient level data outside of the trusted research environment in which it is held. Researchers can apply for access to the data, subject to approval by appropriate research ethics boards, and approval of the UTOPIAN scientific advisory committee.
References
Gentzkow M, Kelly B, Taddy M. Text as Data. Journal of Economic Literature. 2019;57:535–74.
Deerwester S, Dumais S, Furnas G, et al. Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science. 1990;41:391–408.
Berry M, Dumais S, O’Brien G. Using Linear Algebra for Intelligent Information Retrieval. SIAM Rev. 1995;37:573–95.
Landauer T, Dumais S. A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. Psychological Reviews. 1997;104:211–40.
Lee D, Seung S. Learning the Parts of an Object by NonNegative Matrix Factorization. Nature. 1999;401:788–91.
Lee D, Seung HS. Algorithms for nonnegative matrix factorization. Advances in neural information processing systems. 2000;13.
Xu W, Liu X, Gong Y. Document clustering based on nonnegative matrix factorization. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval; 2003. pp. 267–73.
Hofmann T. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval; 1999. pp. 50–7.
Blei D, Ng A, Jordan M. Latent Dirichlet Allocation. J Mach Learn Res. 2003;3:993–1022.
Griffiths M. Gibbs Sampling in the Generative Model of Latent Dirichlet Allocation. Technical Report: Stanford University; 2002.
Griffiths, T, Steyvers, M. Probabilistic Topic Models. In Handbook of Latent Semantic Analysis. Chapter 21. 2007.
Blei D. Probabilistic Topic Models. Commun ACM. 2012;55:77–84.
Udell M, Horn C, Zadeh R, et al. Generalized Low Rank Models. Foundations and Trends in Machine Learning. 2016;9:1–118.
Paatero P, Tapper U. Positive Matrix Factorization: A NonNegative Factor Model with Optimal Utilization of Error Estimates of Data Values. Environmetrics. 1994;5:111–26.
Berry M, Browne M, Langville A, et al. Algorithms and Applications for Approximate NonNegative Matrix Factorization. Comput Stat Data Anal. 2007;52:155–73.
Chang J, Gerrish S, Wang C, BoydGraber J, Blei D. Reading tea leaves: How humans interpret topic models. Advances in neural information processing systems. 2009;22.
Doogan C, Buntine W. Topic model or topic twaddle? reevaluating semantic interpretability measures. In Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies; 2021. pp. 3824–48.
Greene D, Cunningham P, Mayer R. Unsupervised learning and clustering. Machine learning techniques for multimedia: Case studies on organization and retrieval. 2008:51–90.
PalacioNino, J, Berzal, F. Evaluation Metrics for Unsupervised Learning Algorithms. Arxiv. 2019; 1–9. URL: https://arxiv.org/pdf/1905.05667.pdf
Matthews P. HumanInTheLoop Topic Modelling: Assessing topic labelling and genretopic relations with a movie plot summary corpus. In The Human Position in an Artificial World: Creativity, Ethics and AI in Knowledge Organization. ErgonVerlag; 2019. pp. 181–207.
Krippendorff K. Content Analysis: An Introduction to its Methodology. 2nd ed. Thousand Oaks, California: Sage Publications; 2008.
Hastie T, Tibshirani R, Friedman JH, Friedman JH. The elements of statistical learning: data mining, inference, and prediction. New York: Springer; 2009. Vol. 2, pp. 1–758.
Wold S. CrossValidatory Estimation of the Number of Components in Factor Analysis and Principal Component Analysis. Technometrics. 1978;20:387–405.
Owen A, Perry P. BiCross Validation of the SVD and NonNegative Matrix Factorization. Annals of Applied Statistics. 2009;3:564–94.
Bro R, Khejdahl K, Smilde A, et al. CrossValidation of Component Model: A Critical Look at Current Methods. Annals of Bioanalytic Chemistry. 2008;390:1241–51.
Greene D, O’Callaghan D, Cunningham P. How many topics? stability analysis for topic models. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2014, Nancy, France, September 1519, 2014. Proceedings, Part I 14. Springer Berlin Heidelberg; 2014. pp. 498–513.
Lange T, Roth V, Braun M, et al. Stability Based Validation of Clustering Solutions. Neural Comput. 2004;16:1299–323.
AlSumait L, Barbará D, Gentle J, Domeniconi C. Topic significance ranking of LDA generative models. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2009, Bled, Slovenia, September 711, 2009, Proceedings, Part I 20. Springer Berlin Heidelberg; 2009. pp. 67–82.
Newman D, Lau JH, Grieser K, Baldwin T. Automatic evaluation of topic coherence. In Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics; 2010. pp. 100–8.
Mimno D, Wallach H, Talley E, Leenders M, McCallum A. Optimizing semantic coherence in topic models. In Proceedings of the 2011 conference on empirical methods in natural language processing; 2011. pp. 262–72.
Röder M, Both A, Hinneburg A. Exploring the space of topic coherence measures. In Proceedings of the eighth ACM international conference on Web search and data mining; 2015. pp. 399–408.
Fagin R, Kumar R, Sivakumar D. Comparing TopK Lists. SIAM Journal of Discrete Mathematics. 2003;17:134–60.
Webber W, Moffat A, Zobel J. A Similarity Measure for Indefinite Rankings. ACM Transactions on Information Systems. 2010;28:1–34.
Hurley J, Cattell R. Producing Direct Rotation to Test a Hypothesized Factor Solution. Behavioural Science. 1962;7:258–62.
Kuhn H. The Hungarian Method for the Assignment Problem. Naval Research Logistics Quarterly. 1955;2:83–97.
Kendall M. A New Measure of Rank Correlation. Biometrika. 1938;30:81–9.
Airoldi E, Blei D, Erosheva E, et al. Handbook of Mixed Membership Models and Their Applications. Boca Raton, Florida: Chapman and Hall Press; 2015.
Wang W, Zhang Y. On Fuzzy Clustering Validity Indices. Fuzzy Sets Syst. 2007;158:2095–117.
Bezdek J. Cluster Validity with Fuzzy Sets. Cybernetics. 1974;3:58–73.
Dave R. Validating Fuzzy Partitions Obtained Through CShells Partitions. Pattern Recogn Lett. 1996;17:613–23.
Xie X, Beni G. A Validity Measure for Fuzzy Clustering. IEEE Transactions of Pattern Analysis and Machine Learning. 1991;13:841–7.
Pal N, Bezdek J. On Cluster Validity of the Fuzzy CMeans Model. IEEE Trans Fuzzy Syst. 1995;3:370–9.
Garies S, Birtwhistle R, Drummond N, Queenan J, Williamson T. Data resource profile: national electronic medical record data from the Canadian Primary Care Sentinel Surveillance Network (CPCSSN). Int J Epidemiol. 2017;46(4):1091–2f.
Webster JJ, Kit C. Tokenization as the initial phase in NLP. In COLING 1992 volume 4: The 14th international conference on computational linguistics; 1992.
Díaz NPC, López MJM. An analysis of biomedical tokenization: problems and strategies. In Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis; 2015. pp. 40–9.
Hutter F, Kotthoff L, Vanschoren J. Automated Machine Learning. Springer; 2019.
Acknowledgements
Not applicable.
Funding
The study was supported by funding provided by a Foundation Grant (FDN 143303) from the Canadian Institutes of Health Research (CIHR). The funding agency had no role in the study design, the collection, analysis, or interpretation of data, the writing of the report, or the decision to submit the report for publication.
Author information
Authors and Affiliations
Contributions
CM: Writing – original draft, Study Conceptualization, Methodology, Data curation, Programming, Formal analysis. TS: Writing – review & editing, Supervision, Methodology. PA Writing – review & editing, Supervision, Methodology. RM: Writing – review & editing, Supervision, Methodology. MG Writing – review & editing, Supervision, Methodology, Investigation. ME: Writing – review & editing, Supervision, Methodology.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
This study received ethics approval from North York General Hospital Research Ethics Board (REB ID: NYGH #20–0014). All participating primary care physicians provided written informed consent for the collection and analysis of their electronic medical record data at UTOPIAN; patients rostered to a particular primary care physician could optout of providing their data to UTOPIAN if they so chose. This model of consent was approved by REB and is consistent with Ontario's privacy legislation (PHIPA Sect. 44).
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1.
Appendix A and B.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Meaney, C., Stukel, T.A., Austin, P.C. et al. Quality indices for topic model selection and evaluation: a literature review and case study. BMC Med Inform Decis Mak 23, 132 (2023). https://doi.org/10.1186/s12911023022161
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12911023022161
Keywords
 Nonnegative matrix factorization
 Topic model
 Internal validation
 Crossvalidation
 Stability analysis
 Clinical text data
 Electronic medical record