 Research article
 Open Access
 Open Peer Review
 Published:
Optimum binary cutoff threshold of a diagnostic test: comparison of different methods using Monte Carlo technique
BMC Medical Informatics and Decision Making volume 14, Article number: 99 (2014)
Abstract
Background
Using Monte Carlo simulations, we compare different methods (maximizing Youden index, maximizing mutual information, and logistic regression) for their ability to determine optimum binary cutoff thresholds for a ratioscaled diagnostic test variable. Special attention is given to the stability and precision of the results in dependence on the distributional characteristics as well as the pretest probabilities of the diagnostic categories in the test population.
Methods
Fictitious data sets of a ratioscaled diagnostic test with different distributional characteristics are generated for 50, 100 and 200 fictitious "individuals" with systematic variation of pretest probabilities of two diagnostic categories. For each data set, optimum binary cutoff limits are determined employing different methods. Based on these optimum cutoff thresholds, sensitivities and specificities are calculated for the respective data sets. Mean values and SD of these variables are computed for 1000 repetitions each.
Results
Optimizations of cutoff limits using Youden index and logistic regressionderived likelihood ratio functions with correct adaption for pretest probabilities both yield reasonably stable results, being nearly independent from pretest probabilities actually used. Maximizing mutual information yields cutoff levels decreasing with increasing pretest probability of disease. The most precise results (in terms of the smallest SD) are usually seen for the likelihood ratio method. With this parametric method, however, cutoff values show a significant positive bias and, hence, specificities are usually slightly higher, and sensitivities are consequently slightly lower than with the two nonparametric methods.
Conclusions
In terms of stability and bias, Youden index is best suited for determining optimal cutoff limits of a diagnostic variable. The results of Youden method and likelihood ratio method are surprisingly insensitive against distributional differences as well as pretest probabilities of the two diagnostic categories. As an additional bonus of the parametric procedure, transfer of the likelihood ratio functions, obtained from logistic regression analysis, to other diagnostic scenarios with different pretest probabilities is straightforward.
Background
Evaluation of diagnostic tests is an important issue in medical disciplines. Best known is the analysis of simple diagnostic test situations which can be represented by means of a 2 × 2contingency table: one dimension of such a table is defined by two diagnostic categories (e.g., "nondiseased" versus "diseased"), and the second dimension represents the dichotomous test result (e.g., "normal" versus "pathological"). According to its importance, there is a large literature on the subject. A recent series of review articles presents an excellent overview covering all relevant theoretical and practical aspects of the subject [1][4].
An interesting way of evaluating diagnostic tests is provided by information theory [5][9], and an alternative elegant way of dealing with multiple, variably scaled diagnostic variables, has been suggested in 1982 by Albert [10]: he demonstrated that logistic regression analysis can be employed to compute likelihood ratio functions which, in analogy to the wellknown likelihood ratio obtained from a simple 2 × 2contingency table, are useful to compute posttest probability functions of the diagnostic categories investigated. A critical step when applying logistic regression results for the computation of likelihood ratio functions is a correction according to the pretest probabilities of the diagnostic categories actually used for the regression procedure [10]. A combination of Albert 's findings with a generalization of the computation of posttest probabilities for more than two diagnostic categories [11],[12] was demonstrated [13].
In an attempt to (1) direct new awareness to Albert 's timehonoured but nevertheless most relevant results regarding the use of logistic regression analysis in clinical chemistry, and to (2) compare logistic regression analysis with other methods for dividing patients into those with low versus those with high risk of being "diseased", we here present the results of MonteCarlo simulation studies. Specifically, for a diagnostic dilemma ("diseased" versus "nondiseased") we simulate data sets for a fictitious diagnostic variable x with different prespecified distributional characteristics for the two diagnostic categories. Then, we search for the optimum cutoff threshold of x including the following methods:

maximizing the mutual information of the respective 2 × 2contingency table obtained by systematically varying a binary cutoff threshold for the diagnostic variable x

maximizing the Youden index (Youden index = sensitivity + specificity 1) by systematically varying a binary cutoff threshold for the diagnostic variable x

performing a logistic regression analysis on the problem and searching the value of the diagnostic variable x for which the logistic regressionderived likelihood ratio (LR) function, properly corrected for the pretest probabilities of the diagnostic categories used for the regression procedure, attains unity (i.e., the test value at which the posttest probabilities of the diagnostic categories equal the pretest probabilities).
Major results of the simulations investigated are, for each of these three statistical procedures, the respective optimum cutoff values as well as their associated sensitivities and specificities. Besides mean values of these quantities of central interest, important "byproducts" of the MonteCarlo approach are their SD observed over the repetitive computer experiments.
We perform such calculations for four scenarios using different distributional characteristics underlying the computergenerated test data. Besides employing different sample sizes, as the most important additional control variable pretest probabilities of disease [P(D)] are systematically varied over a wide range (from 0.10 to 0.90).
With these MonteCarlo simulation experiments we attempt to answer the following research questions:

How well are the two nonparametric methods (maximizing mutual information, maximizing Youden index) and the parametric method (LR technique based on logistic regression analysis) suited for determining optimum binary cutoff levels of a ratioscaled diagnostic test, given different distributional characteristics of test data, and how well do the results of the three methods agree with the theoretical crossing points of the distribution functions underlying the two diagnostic categories

How do total numbers of test data and their composition in terms of pretest probabilities of the two diagnostic categories influence the results

Which of the techniques yields the most precise estimates in terms of the resulting SD values of the Monte Carlo simulation runs
Methods
All computations are done using the commercially available computer software MATHEMATICA, version 9, by Wolfram Research, Inc., Champaign, IL, USA.
First, for the categories "no disease" and "disease", according to P(D) chosen, fictitious patient data sets are generated using the random number generator of MATHEMATICA in combination with one out of many possible distribution functions: thus, for both diagnostic categories, fictitious data of a ratioscaled diagnostic variable are generated following the chosen distribution functions. We choose total numbers of fictitious data sets of 50, 100 and 200, and we assume pretest probabilities of category "disease" [P(D)] increasing from 0.10 to 0.90 in steps of 0.10. We simulate four different diagnostic scenarios:

Scenario 1: The lognormal distribution with mean value 2.0 and standard deviation 0.4 is assumed for the "healthy" category, and with mean value 2.5 and standard deviation 0.3 for the "diseased" category.

Scenario 2: The chisquare distribution with 7 degrees of freedom is assumed for the "healthy" category, and with 10 degrees of freedom for the "diseased" category.

Scenario 3: The inverse gamma distribution with shape parameter 6.0 is assumed for the "healthy" category and 3.0 for the "diseased" category. The scale parameter is set to 20.0 for both categories.

Scenario 4: The chisquare distribution with 6 degrees of freedom is assumed for the "healthy" category; for the "diseased" category, the Weibull distribution is chosen with shape parameter 10.0 and scale parameter 20.0.
Using the MATHEMATICA function FindRoot, we obtain the following crossing points for the distribution functions of the two diagnostic categories: Scenario 1, x = 9.20041; Scenario 2, x = 7.47228; Scenario 3; x = 5.10873; and Scenario 4, x = 13.4333. Differences from these values define the bias of the actually detected mean cutoff levels.
Analyses done on each data set include:

"Empirical" determination of the cutoff value at which mutual information is maximum ("Mutual information method"): the cutoff value is systematically varied over the range of all test values by increments of 1.0, and that cutoff value is searched for which the resulting 2 × 2contingency table produces the maximum mutual information.

Determination of the cutoff value at which Youden index is maximum. In the following, we shall designate this method as "Youden index method". The cutoff value is systematically varied over the range of all test values by increments of 1.0, and that cutoff value is searched for which the resulting 2 × 2contingency table produces the maximum Youden index.

Logistic regression analysis and calculation of the LR function (with proper correction for P(D) . Determination of the test values for which the LR functions become equal to unity ("LR method"). Briefly, logistic regression analysis on a data set for N fictitious "individuals" yields a linear predictor function 0 + 1 x, where x is the test result. The parameters 0 and 1 denote the intercept and the slope of the linear predictor. The linear predictor must be corrected for the pretest probabilities of the diagnostic categories in order to yield the corrected linear predictor (equal to the natural logarithm of the LR function [10]):
$$\mathrm{l}\mathrm{o}{\mathrm{g}}_{\mathrm{e}}\left(\mathrm{L}\mathrm{R}\right)=\phantom{\rule{0.25em}{0ex}}0\phantom{\rule{0.25em}{0ex}}+1\mathrm{x}\mathrm{l}\mathrm{o}{\mathrm{g}}_{\mathrm{e}}\left[\frac{\mathrm{P}\left(\mathrm{D}\right)}{1\mathrm{P}\left(\mathrm{D}\right)}\right]\text{,}$$the argument of the logarithm on the right side of the equation being the pretest odds.
Thus, for each data set, according to each of these three methods, three cutoff limits as well as their associated sensitivities and specificities are computed as main results.
These analyses are repeated 1000 times in order to get not only estimates of these quantities of interest, but also "empirical" estimates for their SD values.
Additionally, the effects of P(D) on the parameters 0 and 1 as well as on the properly corrected intercept parameter and, hence, on the posttest probabilities computed thereof under different diagnostic situations, are demonstrated for a specific example.
For convenience, we supply the MATHEMATICA documents necessary to reproduce our results: Additional file 1 (help.docx) gives a short explanation how to use the MATHEMATICA notebooks monte_carlo_SDev.nb (Additional file 2) which performs the necessary statistical calculations as well as the Monte Carlo simulation, and distributions.nb (Additional file 3) which produces graphical visualizations of the distribution functions used, and which calculates the crossing points of the two distribution functions for the "nondiseased" and the "diseased" fictitious individuals.
Results
The Monte Carlo experiments
For the MonteCarlo experiments, we use the following conditions: total numbers of fictitious "individuals" are chosen as N =50, 100 and 200.
P(D) is varied, in steps of width 0.10, between P(D) =0.10 and P(D) =0.90.
At each P(D), 1000 data sets, each consisting of N =50, 100 or 200 randomly chosen test values x, are generated according to the four distributional scenarios detailed in the Methods section. Figure 1 demonstrates the distribution functions underlying the four scenarios.
Each data set contains N test values, of which N × P(D) values are associated with "Disease", and N × [1 − P(D)] values are labelled "No disease". For each data set, the optimum cutoff threshold for x is determined by each of three different techniques (see Methods section).
Table 1 reports the ranges of the resulting mean values (and SD values) of the optimum cutoff values together with the associated sensitivities and specificities, obtained by the variation of P(D) from 0.1 to 0.9 in steps of 0.1.
For the mean cutoff values and their SD values obtained with each of the three methods, Figure 2 demonstrates for the four scenarios the dependence on P(D) as well as the deviations with respect to the theoretical crossing points of the distribution functions underlying the two diagnostic categories. (Notably, each result is based on 200 fictitious individuals and 1000 repetitions.)
Table 1 and Figure 2 reveal important and characteristic features of the results: first, while the Youden index method as well as the LR method produce cutoff levels which are remarkably stable with respect to the large variation of P(D), the Mutual information method yields, irrespective of the distributions used, monotonously decreasing cutoff levels with increasing P(D). So obviously this technique in the case of small P(D) optimizes specificity of the test, and with high P(D), sensitivity is optimized. Second, the Mutual information method is invariably associated with the largest SD values, followed by the Youden index method; the parametric LR method shows by far the smallest variations. On the other hand, the LR method tends to produce a constant positive bias; with the exception of Scenario 4 (strongly separated distribution functions underlying the two diagnostic categories) the cutoff levels found with this method lie consistently above the theoretical crossing points of the respective distribution functions. In fact, the smallest bias is found with the Youden index method; with the Mutual information method cutoff levels at small P(D) are generally too high, and too low with high P(D).
Figure 3 visualizes in more detail the results obtained for Scenario 1 (lognormal distributions with mean value 2.0 and standard deviation 0.4 for the "healthy" category, and with mean value 2.5 and standard deviation 0.3 for the "diseased" category) and 200 fictitious "individuals" (N =200).
In accordance with Table 1 and Figure 2, the cutoff values and hence, the associated specificities, found by the LR method are usually slightly higher than those detected with the two nonparametric techniques. Consequently, the latter methods yield slightly better test sensitivities but slightly worse specificities than the LR method. As already shown in Figure 2 for the mean cutoff levels, also for sensitivities and specificities the stronger dependence of the Mutual information method on P(D) is clearly obvious from Figure 3 (panels a2 and a3). Analogously, also for the SD values of cutoff levels as well as of sensitivities and specificities the order LR method < Youden index method < Mutual information method is obtained.
Closer inspection of the SD results in Table 1 shows in addition that in accordance with expectation, the variances of the results decrease with increasing sample size N.
Dependence on P(D) of the parametric estimates obtained by the LRmethod
Despite the remarkable stability of the optimum cutoff thresholds as well as of sensitivities and specificities obtained by the parametric LR method over a broad range of P(D), the mean estimates of the logistic regression analyses ( 0 and 1) nevertheless are somewhat dependent on P(D), and this dependence even remains after proper correction. For the example shown in Figure 3 [Scenario 1 (lognormal distributions with mean value 2.0 and standard deviation 0.4 for the "healthy" category, and with mean value 2.5 and standard deviation 0.3 for the "diseased" category) and 200 fictitious "individuals" (N =200)], at P(D) =0.10 the uncorrected mean intercept estimate ( 0) is 5.246, and at P(D) =0.90 it increases to 3.075. Hence, the mean corrected intercept estimate decreases from 3.049 to 5.273 between these limits; and the mean slope estimates ( 1) increases from 0.303 to 0.535. So in fact, also the corrected linear predictor functions (as well as the LR functions) change somewhat according to P(D). How strongly influence these dependencies the estimated posttest probabilities of disease To answer this question, the mean values of the MonteCarlo estimates of the logistic regression analyses at P(D) =0.10, 0.50 and 0.90 were used to calculate three respective corrected LR functions. With each of these three functions, then, according to the fact that the postodds can be computed by multiplication of the preodds by the LR $\left[\frac{\mathrm{P}\left(\mathrm{D}\mathrm{x}\right)}{1\phantom{\rule{0.25em}{0ex}}\phantom{\rule{0.25em}{0ex}}\mathrm{P}\left(\mathrm{D}\mathrm{x}\right)}=\frac{\mathrm{P}\left(\mathrm{D}\right)}{1\phantom{\rule{0.25em}{0ex}}\phantom{\rule{0.25em}{0ex}}\mathrm{P}\left(\mathrm{D}\right)}\times \mathrm{L}\mathrm{R}\left(\mathrm{x}\right)\right]$, we compute the posttest probabilities P(Dx) (conditional probabilities for disease given test result x) as functions of test value x, again for three P(D) =0.10, 0.50 and 0.90.
Figure 4 shows the resulting curves of the posttest probabilities. Notably, if the corrected estimates of the logistic regression analyses were independent from P(D) of the data set employed, the three LR functions would coincide, and we would finally obtain only three different and parallel sigmoid curves (one for each P(D) used for the second step of this computation). However, as the estimated slope parameters ( 1) increase with increasing P(D) in the data sets used for the logistic regression analyses, the sigmoidal posttest probability curves are steeper with respect to variation of test value x when we employ the estimates of logistic regression analyses obtained at higher P(D). Interestingly, each three curves obtained with the three different sets of logistic regression estimates for a specified P(D) in the second calculation step (in Figure 4 these are the curves with the same line style each) cross at a test value x 10 and at a posttest probability which approximates the actually specified P(D) (look at the little arrows in Figure 4).
Taking into account the results from Figure 2, we can easily understand this behaviour: Figure 3, panel a1, shows that with the LR method, a cutoff value of x 10 is obtained over the whole range of P(D). In other words, at x 10, all three different LR functions yield approximately unity, and hence, the posttest probabilities approximate P(D).
In addition we note, that for each of the three different LR functions the respective set of three curves at different P(D) in the second calculation step (in Figure 4 these are the curves with the same colour) shows "parallel" course, due to their common slope parameter 1 being representative for the respective LR function.
Discussion
In this paper, we compare a parametric and two nonparametric methods of determining optimum binary cutoff levels of a ratioscaled test using Monte Carlo technique. For scenarios with quite different distributional characteristics underlying the computergenerated data sets, and for different total numbers of fictitious "individuals" (i.e., data sets), we focus on the effects of varying P(D) on the optimum cutoff levels obtained, and on sensitivities and specificities associated with these threshold values.
Our study shows that the Youden index method and the LR method yield very stable mean cutoff levels over the whole range of P(D), while the results of the Mutual information method show a characteristic monotonous decrease of the mean cutoff values with increasing P(D). While the parametric LR method, based on logistic regression analysis followed by proper correction of the intercept parameter for P(D), produces by far the most precise estimates (smallest SD values), the method yields results which are positively biased for three of the four distributional scenarios studied. The best agreement between mean cutoff levels is obtained by the Youden index method, and the worst precision (largest SD values) is generally found by the technique of maximizing the mutual information statistic.
In perfect accordance with the behaviour of the mean optimum cutoff levels is the effect of the distributional scenarios as well as of P(D) on important test characteristics like sensitivities and specificities (and their SD values). Notably, as the LR technique generally produces the highest estimates of optimum cutoff levels (nearly irrespective of the distributional scenarios), it also yields the highest mean values for specificities, and in turn, the smallest mean values for sensitivities.
The estimates of the logistic regression analysis are somewhat dependent on the actual P(D) used, and posttest probability functions for the presence of disease, given a certain value of the diagnostic test variable, therefore show somewhat different slopes and positions; but as shown in Figure 4, the different curves obtained from logistic regression analyses with different P(D), when applied to compute posttest probabilities for a situation with an arbitrarily specified P(D) (in the second step), all cross approximately at a point in a P(Dx) vs. x diagram the abscissa of which is approximately equal to the optimum cutoff value and the ordinate of which approximates the specified P(D) (in the second step).
The results of the Monte Carlo simulations reported here appear to be representative for a broad variety of distributional characteristics underlying the test data in the "nondiseased" and the "diseased" category. Moreover, the results do not greatly vary when using 50, 100 or 200 fictitious "individuals"; clearly, the SD values obtained for increasing numbers of "individuals" are slightly decreasing. The study is restricted insofar as in any case, 1000 repetitions are employed for computing the respective mean values and SD values; however, this number is apparently high enough to guarantee quite stable estimates.
One might question the use of the crossing points of the involved distribution functions as the reference value for determining the bias of the methods: the crossing points of the distribution functions as used in our work imply a P(D) of 0.50 (both diagnostic categories would have the same weight) and, of course, varying the relative weights of the two distribution functions would lead to varying crossing points. For example, the theoretical crossing points for the distribution functions of scenario 4 vary between 15.748 and 11.358 when P(D) changes from 0.10 to 0.90; the reported value of 13.4333 is obtained for P(D) =0.50. However, we deliberately use the crossing points of the equally weighted distribution functions as the stable and correct reference value because we think that in diagnostic practice, the composition of a test sample with arbitrary P(D) should have as little effect as possible on critical results such as the optimum cutoff threshold. And in the light of these considerations, it is particularly surprising and satisfying that the Youden index method and the LR method indeed provide optimum cutoff value which are essentially independent from P(D).
In this work, we have concentrated on the specific influence of varying P(D) on few critical results of the diagnostic evaluation process; namely, the optimum cutoff levels and their associated sensitivities and specificities. We have not included many other important facets of modern test evaluation theory such as, e.g., utility aspects. It would certainly be promising to extend such simulation studies as our present one also on these and other advanced issues.
Conclusions
Over a remarkably wide spectrum of distributional scenarios and over a wide range of different P(D) values, the Youden index method and the LR method give quite satisfactory results for optimum cutoff values in terms of stability and of test characteristics derived thereof. The results of the Mutual information method are stronger dependent on P(D) and, in addition, show the highest variation. Notably, the parametric LR technique yields particularly precise, however frequently positively biased results. A bonus of this method, on the other hand, is the straightforward transferability of the results to situations with other pretest probabilities.
Additional files
References
 1.
Linnet K, Bossuyt PMM, Moons KGM, Reitsma JB: Quantifying the accuracy of a diagnostic test or marker. Clin Chem. 2012, 58: 12921301. 10.1373/clinchem.2012.182543.
 2.
Moons KGM, de Groot JAH, Linnet K, Reitsma JB, Bossuyt PMM: Quantifying the added value of a diagnostic test. Clin Chem. 2012, 58: 14081417. 10.1373/clinchem.2012.182550.
 3.
Reitsma JB, Moons KGM, Bossuyt PMM, Linnet K: Systematic reviews of studies quantifying the accuracy of diagnostic tests and markers. Clin Chem. 2012, 58: 15341545. 10.1373/clinchem.2012.182568.
 4.
Bossuyt PMM, Reitsma JB, Linnet K, Moons KGM: Beyond diagnostic accuracy: the clinical utility of diagnostic tests. Clin Chem. 2012, 58: 16361643. 10.1373/clinchem.2012.182576.
 5.
Diamond GA, Hirsch M, Forrester JS, Staniloff HM, Vas R, Halpern SW, Swan HJC: Application of information theory to clinical diagnostic testing. The electrocardiographic stress test. Circulation. 1981, 63: 915921. 10.1161/01.CIR.63.4.915.
 6.
Büttner J: Grundlagen der Anwendung der Informationstheorie auf qualitative klinischchemische Untersuchungen. J Clin Chem Clin Biochem. 1982, 20: 477490.
 7.
Rudolph RA, Bernstein LH, Babb J: Information induction for predicting acute myocardial infarction. Clin Chem. 1988, 34: 20312038.
 8.
Kazmierczak SC, Catrou PG, Van Lente F: Enzymatic markers of gallstoneinduced pancreatitis identified by ROC curve analysis, discriminant analysis, logistic regression, likelihood ratios, and information theory. Clin Chem. 1995, 41: 523531.
 9.
Reibnegger G: Beyond the 2×2contingency table: a primer on entropies and mutual information in various scenarios involving m diagnostic categories and n categories of diagnostic tests. Clin Chim Acta. 2013, 425: 97103. 10.1016/j.cca.2013.07.011.
 10.
Albert A: On the use and computation of likelihood ratios in clinical chemistry. Clin Chem. 1982, 28: 11131119.
 11.
Birkett NJ: Evaluation of diagnostic tests with multiple diagnostic categories. J Clin Epidemiol. 1988, 41: 491494. 10.1016/08954356(88)900510.
 12.
Reibnegger G, Fuchs D, Hausen A, Werner ER, WernerFelmayer G, Wachter H: Generalization of the likelihood ratio concept for diagnostic tests with multiple diagnostic categories. J Clin Epidemiol. 1989, 42: 477479. 10.1016/08954356(89)901388.
 13.
Reibnegger G, Fuchs D, Hausen A, Werner ER, WernerFelmayer G, Wachter H: Generalized likelihood ratio concept and logistic regression analysis for multiple diagnostic categories. Clin Chem. 1989, 35: 990994.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
GR designed the study, developed the statistical content of the paper and programmed the computational core of the MATHEMATICA program. WS programmed the Monte Carlo part of the MATHEMATICA program, the output of the Monte Carlo results in EXCEL sheets as well as visual representations of the results. GR wrote the primary draft of the manuscript. Both authors revised the manuscript critically and both read and approved the final draft.
Electronic supplementary material
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Reibnegger, G., Schrabmair, W. Optimum binary cutoff threshold of a diagnostic test: comparison of different methods using Monte Carlo technique. BMC Med Inform Decis Mak 14, 99 (2014) doi:10.1186/s1291101400991
Received
Accepted
Published
DOI
Keywords
 Mutual Information
 Diagnostic Category
 Youden Index
 Likelihood Ratio Method
 Distributional Scenario