BMC Medical Informatics and Decision Making

Background: When developing multivariable regression models for diagnosis or prognosis, continuous independent variables can be categorized to make a prediction table instead of a prediction formula. Although many methods have been proposed to dichotomize prognostic variables, to date there has been no integrated method for polychotomization. The latter is necessary when dichotomization results in too much loss of information or when central values refer to normal states and more dispersed values refer to less preferable states, a situation that is not unusual in medical settings (e.g. body temperature, blood pressure). The goal of our study was to develop a theoretical and practical method for polychotomization.


Background
In modern diagnostic and descriptive prognostic research, regression models are often used to model an illnessrelated outcome based on a number of independent regressor variables, also referred to as diagnostic indicators or prognostic predictors [1]. Such regressor variables can be categorical or numerical. From the vantage point of applicability in a clinical setting, categorization (often dichotomization) of continuous independent variables can be useful. Obtaining a prediction at the bedside without computer is easier with a prediction table based on categorized variables than with a prediction formula. Even if calculation is not problematic, table presentation of the risks has the practical advantages that (1) repeated use of the table will give physicians an intuitive feel for the disease risk, and (2) even if the value of one or two of the prognostic variables is not available, physicians can obtain a probability range corresponding to the patient's risk by referring to the most extreme cases in the table.
Depending on the setting, several different approaches have been proposed for dichotomization. One popular method is to find a cutoff point to discriminate whether a patient belongs to a normal group or a disease group based on the observed value of a predictive factor. This type of discriminant function analysis was first developed by R.A. Fisher [2] in 1930's. The Mahalanobis distance [3] can be used to find the optimal cutoff point if the variable distributes normally.
Another solution, sometimes used in clinical chemistry, is to find a cutoff point that maximizes the sum of sensitivity (SE) and specificity (SP) [4,5]. There are different versions of this approach where one can maximize the weighted sum of SE and SP, or maximize the SE while fixing SP to an acceptable value [6,7]. Cantor claimed that these methods have been used in many published articles without giving a theoretical foundation or scientific justification [8].
Yet another straightforward and popular method is to select a classification that maximizes a measure of difference between the two groups, such as the p-value of a chi square statistic [9,10]. This method, sometimes called the minimum p-value approach, has been described and used for the prognosis of cancers [11,12]. Several authors have pointed out that the naïve selection used in this method overestimates the significance of the predictor or indicator's relationship to the dependent variable because of multiple testing, and several adjustment methods of the observed p-values have been proposed [9][10][11][12][13][14][15][16][17][18].
Besides using the data at hand to come to a dichotomization of continuous variables, it is also possible to use profit (benefit) or loss (cost) information. In that case, the optical cutoff point is defined so as to maximize the expected utility. Metz showed that the optimal point is the spot on the ROC curve at which the slope is (L/B)(1-p)/p, where B is the net benefit of treating diseased individuals, L the net loss of treating non-diseased individuals, and p the prevalence of the disease under study [19]. Nevertheless, Cantor et al., in a review of studies in the medical literature that referred to "ROC" and "cutoff", found that only a few articles included a L/B ratio in the analysis for determining an optimal cutoff point [8].
The above methods all concern dichotomization. However, when central values refer to normal states and dispersed values to diseased states, two (or more) cutoff points are necessary to discriminate these states. Consequently, one is inevitably faced with the challenge of polychotomization.
Unfortunately, methods for polychotomization are less developed. Although Kristjansson et al. [20] described a method for choosing optimal cutoff points in a screening test with a continuous score to divide people into a number of disease categories, their method is not applicable to polychotomization of regressor variables in regression models; their criterion loses its meaning in this setting.
The major goal of our study is to develop a theoretical and practical method for polychotomization. We propose a novel approach for independent continuous variables in regression models based on the overall discrimination index C introduced by Harrel et al. [21,22]. We will show that this index is closely related to the area under the ROC curve for the original continuous variable and that the resulting categorized variables have predictive properties comparable to the original continuous variable. However, the naïve search of the maximum C index gives rise to positive bias, not unlike the minimum p-value approach [9][10][11][12][13][14][15][16][17][18] or the method of maximizing the sum of the sensitivity and specificity [4,5]. We therefore propose a parametric version in which the estimates of the predictive performance and cutoff points are both essentially unbiased. We evaluate this method and present means and standard deviations of predictive performance and cutoff point estimates for typical cases via Monte Carlo simulation. Finally, we provide a simple application example with a predictive regression model for rhabdomyolysis and show how our method can be used to create a probability profile table.

The categorization criterion
We assume there is an existing predictive model based on patients that belong to either a normal group or a diseased group and that the distribution of the relevant independent continuous variable X is known or that we have observations on it. Our goal is to find a method of optimal polychotomization for this continuous variable with a minimum loss of predictive ability. This involves making the number of possible patient's profiles finite, and replacing the regression formula with a table of the risk probabilities for all patient profiles. Different from most previously developed approaches we have no a priori intention to categorize the variable into two classes and we assume that it might be necessary to compare categorizations to three or more classes.
For this discussion we need a measure to evaluate the predictive power of a predictive variable. Our choice for a measure of predictive power is the overall discrimination index C [21][22][23][24], or the 'pair consistency probability', as we like to call it. This measure refers to the probability that the relative position of single normal-disease pair values is consistent with the relative position of their values of central tendency.
Without losing generality, we assume that the central value of the distribution of the random variable X in the group of healthy cases is smaller than the central value in the group of diseased cases. Next we take a sample x i [h] from the healthy group and another sample x i [d] from the diseased group randomly. Then the pair ( [d] and the pair consistency probability C is defined as: where p con and p tied denote the probabilities that the pair is consistent and tied respectively. Next, if we let f h represent the probability density function (PDF) of X in the healthy group and f d represent the PDF of X in the diseased group, and let z represent a cutoff point for dichotomization, then the true positive fraction Tp and false positive fraction Fp are defined by In the case that the variable is continuous, as z increases, Tp and Fp both decrease continuously. The ROC curve [19,25] can be depicted as the trace of points (Fp , Tp ). Green and Swets [25] demonstrated that This means that the pair consistency probability is equivalent to the area under the ROC curve for continuous variables. We will demonstrate that this relation also holds for polychotomized variables, and that the pair consistency probability C is a good measure to compare the predictive ability with the original continuous variable.

Optimal cutoff point for dichotomization
First, we discuss our method for dichotomization in which a continuous independent variable in a predictive model is categorized to one of two classes by a cutoff point. If we denote the value of the cutoff point z and assume that X is continuous in both the healthy and the diseased groups, that is, P(x [h] = z) = 0 and P(x [d] = z) = 0, the results of random pair sampling are classified into the following four cases: x [h] <z and x [d] <z, tied x [h] <z and x [d] > z, consistent Let α denote the probability that x [h] is greater than z, and β denote the probability that x [d] is less than z. Assuming that the central value of the distribution of the random variable X in the group of healthy cases is smaller than the central value in the group of diseased cases, we have Then the probability of a consistent pair becomes p con = (1 -α)(1 -β), and the probability of a tied pair becomes Assigning these probabilities into (1), we have It follows that the highest pair consistency probability is achieved when the sum of the two types of errors, α + β, is minimized. Since sensitivity is (1 -β) and specificity is (1α), we have C = (sensitivity + specificity)/2. (4) Therefore the highest pair consistency probability is achieved when the sum of sensitivity and specificity is maximized. This means that for a dichotomous variable, the area under the ROC straight line graph for a dichotomous variable is, analogous to the case with a continuous variable, equivalent to the pair consistency probability C. Therefore, finding a cutoff point that maximizes C is equivalent to the problem of finding the point (Fp 0 , Tp 0 ) on the original ROC curve that maximizes the area A under the ROC straight line graph.

Optimal cutoff points for polychotomization
Next, consider the polychotomous case. Again, let x [h] be a sample from the continuous random variable X in the healthy group and x [d] a sample from the same variable in the disease group, both taken randomly. Let z 0 = -∞, z n = ∞ and z 1 , z 2 ,..., z n-1 be cutoff points where z 1 <z 2 <...<z n-1 . We define that [ ] 1 1

1
The ROC curve and ROC straight line graph for the sample distributions in Figure 1  Then the probabilities for tied and concordant pairs become and the pair consistency probability C can be calculated from equation (1).
We also define The points (Fp k , Tp k ) lie on the original ROC curve, and the set of points (Fp k , Tp k ) jointed by straight lines yields the ROC straight line graph. Let A represent the area under the ROC straight line graph and A k represent the area under the line whose ends are (Fp k-1 , Tp k-1 ) and (Fp k , Tp k ). As illustrated in Figure 3, the area A k is Therefore, Then we have Again, the pair consistency probability C for the polychotomized variable is equivalent to the area under its ROC straight line graph, and the problem of finding the optimal cutoff points that maximize C is mathematically equivalent to finding the set of edge points of the ROC straight line graph that maximizes the area A under that graph.

Optimal cutoff points for variables for which normal and diseased cases have a common central tendency
There are many predictive variables whose central values refer to a normal state and whose more dispersed values refer to less preferable states. In the example of rhabdomyolysis prognosis that will follow later, body temperature, pulse rate, plasma sodium, and plasma pH are such variables. For these predictors, we need to find at least two cutoff points to discriminate normal and abnormal states. If we denote the values of the cutoff points z 1 and z 2 (z 1 <z 2 ), and regard the value between these two cutoff points as normal, then type I error α and type II error β become: and The pair consistency probability C can now be calculated with equation (3) and the combination of cutoff points (z 1 , z 2 ) which maximizes (3) becomes the solution. In case of categorization of the variable into more than three states, we can define the optimal combination of cutoff points as follows: Let z n = -∞, w n = ∞ and z 1 , z 2 ,..., z n-1 , w 1 , w 2 ,..., w n-1 be cutoff points where z n-1 <...<z 2 <z 1 <w 1 <w 2 <...<w n-1 , and Then the probabilities for tied and concordant pairs become and the pair consistency probability C can be calculated from equation (1). The combination of cutoff points that maximizes C becomes the solution.

Parametric method for estimating cutoff points and predictive performance
The polychotomization methods proposed in the previous sections have been developed under conditions where the exact distribution of a prognostic or diagnostic factor in a population is known. However, in research practice we work with samples and we need to discuss whether our methods can be applied in situations involving parameter uncertainty. Although some methods were developed for correct estimation of the pair consistency probability C in these situations, including non-parametric ones [22][23][24], none of them addressed the estimation of cutoff points and they can therefore not be applied to our setting.
The challenge we are faced with is that if we repeat the evaluation of the pair consistency probability to find optimal cutoff points, for instance by increasing the possible value of the cutoff point with a certain step, it gives rise to estimation error just like the minimum p-value approach [9][10][11][12][13][14][15][16][17][18] and would mistakenly lead to an optimistic conclusion on the predictive performance of the model in future observations.
It is clear that we need a practical method that does not suffer from this over-estimation bias. In this paper we show that if f h and f d can be transformed to normal distributions, a parametric method provides essentially unbiased estimators of predictive performance and cutoff points.
Our method is based on the following:

Distributions of the estimators for the cutoff point and the pair consistency probability
If f h and f d are both normal and s h = s d , then the two curves intersect at x = (m h + m d )/2. The pair consistency probability C takes the maximum value at this point as mentioned earlier.
In the case that s h is not equal to s d , the two curves intersect at the following two points: and the point that is located between m h and m d can be used to calculate the true maximum value of the pair consistency probability C with equations (2) and (3). As it is difficult to evaluate the statistical properties of the above formulae analytically, even for the simplest dichotomization case, we performed a Monte Carlo simulation to assess the estimation of the cutoff points and the corresponding C. For these purposes, a custom simulation program was written in the programming language Pascal with the following characteristics:  C), in which cutoff points are searched numerically in the same manner as the above stepwise repeated search based not on the sample data but on the estimated PDFs, e) repeat the above sample generation and estimating steps 10,000 or 100,000 times for each of various combinations of population parameters.

Extension for multiple associated independent variables
Thus far, we have discussed a method for selecting cutoff points that maximizes the predictive ability of each prognostic variable individually. When a regression model has more than one explanatory variable, the version of our method presented in this article can only be applied if the variables are not associated (no correlation and no interaction). Since associations between prognostic variables are common, our method requires a multivariable extension in which cutoff points are found while taking such associations into account.
Our maximum C index approach can be applied to multivariate scenario if the distributions of a number of prognostic variables for healthy and diseased groups can be described by multivariate normal distributions and if the calculation times are acceptable [26]. However, because we are still in the process of assessing the performance of multivariable extensions and comparisons with other approaches, we will only give a short summary below: a) determine the regression model that best fits the observations, b) estimate the multivariate normal distribution parameters from the observed data, c) for a set of categorized variables defined by a combination of cutoff points, calculate the regression equation and evaluate its overall C index (based not on the observed data but on the estimated distributions), d) iterate (c) systematically for every combination of cutoff points and select the combination of cutoff points which gives the maximum overall C index for the regression equation.

Evaluation of the parametric method by Monte Carlo simulation
In this section, we present an evaluation of our parametric method, together with the naïve application of a stepwise repeated search based on multiple evaluations. In the absence of a standard method for polychotomization, the latter is currently probably the first choice for researchers, mainly due to its simplicity.  We repeated the above simulations for various n h and n d (n h = n d ) and Figure 8 and Figure 9 summarize the results. The graphs show that the estimation by the parametric method is almost unbiased even if the sample size is relatively small, both for dichotomization ( Figure 8) and trichotomization for variables whose realizations in healthy and diseased groups have a similar central tendency (Figure 9), whereas the naïve repeated search method shows Distributions of estimated pair consistency probability C in 100,000 simulations of dichotomization Changes of estimated pair consistency probability C in dichotomization as a function of sample size  Table 1 shows how the pair consistency probability C increases when the number of the cutoff point changes from one to three for the case that x [h]~ N(0, 1 2 ) and x [d] Ñ (μ d , 1.5 2 ). For instance, when the pair consistency probability for the original continuous variable is 0.8 (μ d = 1.517), the pair consistency probability for the dichotomized, trichotomized and quatrochotomized variables are 0.738, 0.775 and 0.787, respectively. Table 2 summarizes the means and standard deviations of the pair consistency probability C estimated by the parametric method for dichotomization when the sample sizes of the two groups are equal (n = 10, 25, 50, 100, 200, and 500) and σ d = 1.5σ h . Table 3 gives the results for trichotomization when the continuous variable in healthy and diseased cases has a common central tendency and the sample sizes of the two groups are equal. These tables can be used to evaluate the accuracy and precision of the estimated predictive ability of C for various sample sizes.

Example: Polychotomization of the prognostic factors of rhabdomyolysis
Rhabdomyolysis is a potentially lethal complication, often observed in patients who have attempted suicide with large doses of psychotropic drugs. Though it is important to make the diagnosis and begin proper treatment at an early stage, the diagnosis of rhabdomyolysis is difficult unless specific enzymes and myoglobin in skeletal muscle are detected by laboratory tests.
To find prognostic variables of rhabdomyolysis at an outpatient clinic where laboratory data are not available, we previously evaluated 131 cases of acute drug toxicosis [27][28][29] and found twelve variables to be significantly contributing to diagnosis of rhabdomyolysis (rhabdomyolysis group: n = 34, non-rhabdomyolysis group: n = 97). For this example, we selected three non laboratory data variables to predict the risk at the outpatient clinic: (1) qtc: ECG QTc (non-dimensional); (2) t: time from taking the drug to hospitalization (hours); and (3) bt, body temperature (Celsius).
Applying the maximum pair consistency probability criterion, the three continuous variables are categorized, assuming that qtc is a normal variable, t a log-normal variable and bt a variable with a common central tendency. Table 4 shows the selected cutoff points and the changes of the pair consistency probability. Comparing the pair consistency probabilities of the categorized variable, we can observe how predictive ability changes with polychotomization and the pair consistency probability C can be used as a measure to evaluate the loss of predictive ability by categorization. Changes of estimated pair consistency probability C in tri-chotomization as a function of sample size Next, we applied the cross-split-half-method [30] to validate the effectiveness of prediction by these variables with logistic regression [31] and evaluated the amount of over estimation of prediction performance by a single data set. The estimated optimism for the overall C index was 0.018, which is sufficiently small.

Example: Risk table for prognosis of rhabdomyolysis
Based on categorized variables, we obtained the new prediction formula: p = 1/(1 + exp(7.96 -3.13QTC -6.22T 1 -3.11T 2 -1.97BT) (8) where QTC is ECG QTc (1 for more than or equal to 0.45 and 0 for less than 0.45), T 1 is the time from drug ingestion to hospitalization (1 for more than or equal to 12 hours, 0 for otherwise), T 2 is also the time from drug ingestion to hospitalization (1 for less than 12 hours and more than or equal to 5 hours, 0 for otherwise), and BT is body temperature (1 for more than or equal to 37.2° or less than or equal to 34.0°, and 0 for otherwise). Since the overall index C for this formula was 0.945, we estimate the predictive performance in future data will be around 0.927(= 0.945 -0.018).
To ascertain the fitness of the selected regression model, we conducted the Hosmer-Lemeshow goodness-of-fit test [32] by dividing disease probability into eight classes. The actual number of occurrences for each class showed good agreement with the expected number of occurrences of rhabdomyolysis (p = 0.618).
Since all the three prognostic variables are categorized, the number of patient profiles becomes twelve and the risk probabilities of rhabdomyolysis for all possible patient profiles can now be obtained by assigning a combination of the values of categorized variables into regression formula (8). This yields a risk table for rhabdomyolysis occurrence (Table 5). For instance, if T, QTC and BT are "++ ", "+ " and "-" respectively, we can read from the table Two samples of the size n from N(0, 1 2 ) and N(0, ) are generated by Monte Carlo simulation and the pair consistency probability C for the optimal trichotomized points is calculated by the parametric method. This step is iterated 10,000 times, producing means and standard deviations of  Two samples of the size n from N(0, 1 2 ) and N(μ d , 1.5 2 ) are generated by Monte Carlo simulation and the pair consistency probability C for the optimal dichotomized point is calculated by the parametric method. This step is iterated 10,000 times, producing means and standard deviations of that the risk of rhabdomyolysis is 0.801. Repeated use of this table over time will give physicians a "sense" of the disease risk.

Discussion
The criterion for optimal categorization of continuous variables in regression models may vary depending on the object of the categorization, and there have been several different approaches. Many of these approaches are inadequate for our purpose. We have proposed to use the overall discrimination index C introduced by Harrel and other authors [21][22][23][24] as the measure for predictive performance of a categorized variable. Since the overall discrimination index C has a clear and straight forward meaning as the pair consistency probability, it is intuitively logical to use it as a measure for the predictive discrimination for polychotomized variables.
Though mathematically distinct, our method has much in common with previously developed methods [2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20][33][34][35][36][37][38], which can be explained through the relations between the pair consistency probability C, SE and SP, and the area under the ROC straight line graph, as is expressed in formulae (2) to (6). In addition, our ROC straight line graph has a close relation with ordinal dominance or the OD curve proposed by Darlington to visualize the ordering feature of two comparative sets [39]. He showed that the OD curve is a complete representation of the rank-order properties of data and many statistical procedures follow naturally from assessment of the curve. Bamber clarified the relation between the area above the OD curve and a measure identical to the pair consistency probability [40]. Our proof of formula (6) related to the ROC straight line graph corresponds to Bamber's OD curve related proof.
Monte Carlo simulation showed that the naïve search of the maximum C index will give rise to an estimation bias, which is very much like the positive bias that affects the minimum p-value method. Such bias is also seen in the method where the cutoff point is selected in a way that maximizes the sum of SE and SP. Linnet and Brandt calculated the sample distribution of (SE + SP)/2 in the case of dichotomization using computer simulation assuming that distributions are normal, and evaluated the positive bias induced by the selection of an optimal cutoff point [4]. They found that estimates of test performance are too optimistic when the sample size is small, with an average positive bias up to 15% for a sample size of 25. We have shown that this problem does not affect our proposed parametric method. Abbreviations: T = time from taking the drug to arrival at hospital ('++' for more than or equal to 12 hours, '+' for less than 12 hours and more than or equal to 5 hours, '-' for less than 5 hours); QTC = ECG QTc ('+' for more than or equal to 0.45, '-' for less than 0.45); BT = body temperature ('+' for more than or equal to 37.2° or less than or equal to 34.0°, '-' for otherwise). However, there may be cases where a transformation to a normal distribution does not work well. For such cases, we conceive that approximation of distribution curve by a more suitable function or a restricted cubic spline function [41] creates a workable situation. We are currently in the process of evaluating this approach and the results will be reported elsewhere.
To keep this introduction of the maximum C index approach for polychotomizing predictive variables short and readable, we have used an example in which a regression model without correlated independent variables and without interaction fitted the observed data well (p = 0.618 by Hosmer and Lemeshow goodness-of-fit test) [42,43]. However, if correlation and interaction are relevant for the regression function, our maximum C index approach must be extended to a multivariable setting. Mazumdar extended a cutoff point search based on the maximum chi-square method to a multivariable setting [44], and showed that the cutoff points obtained by a multivariable search were closer to the true cutoff points.
Another method that is appealing for regression settings with correlated independent variables, is the so-called 'simplified integer score' method in which continuous variables are transformed into semi-continuous interval variables [41]. It has been used in numerous articles and is based on the categorization of the continuous variable, and the transformation of the products of the regression coefficient and the value of the variable into integers. This method is clinically useful and can be applied to the situation where explanatory variables are correlated. If the number of variables is small enough and they have few classifications, this method can also be used to create the simple probability profile tables that result from our approach. We are currently in the process of evaluating a multivariable extension of the C index maximization approach, including a comparison with this method.
Along with regression models, decision trees can also be used in diagnostic or prognostic decision making [36]. Breiman et al. developed an approach called classification and regression trees (CART) to build a decision tree for medical diagnosis based on a training data set [41,45]. In these decision trees, diagnosis is made by a sequential decision making process, in which a question on an independent variable is posed at each step and, depending on the answer, a different "branch" of the tree is selected until the final result is achieved. If an independent variable is continuous, dichotomization (or polychotomization) will be necessary to build a decision tree. Typically, the cutoff points are found by maximizing the total utility of decision scheme [46,47], which appears to be closely related mathematically to our approach. Further study is necessary to make a theoretical and practical comparison.
We have indicated that it is easier for most people to read a probability profile table to obtain the risk probability than to calculate the risk with a regression formula. Additionally, probability profile tables give physicians an intuitive feel for the disease risk. Even if the value of one or two of the prognostic variables is not available, physicians can obtain a probability range corresponding to the patient's risk by referring to both the positive and negative cases from the table. By making simplified risk tables in advance, physicians can obtain the patient's risk from an auxiliary table, even if the value of a predictor is missing.
Since the table presentation of probabilities has these practical advantages, we believe our method for categorizing prognostic variables can be a helpful tool to make diagnostic or descriptive prognostic research with regression models become more applicable in clinical practice.

Conclusion
We have proposed a new approach for polychotomization (including dichotomization) of independent continuous variables in regression models based on the overall discrimination index C, or the pair consistency probability, introduced by Harrel. We have shown that this index is closely related to the area under the ROC curve for the original continuous variable and that the resulting categorized variables have predictive properties comparable to the original continuous variable. We showed that the naïve application of the method gives rise to positive bias, not unlike the minimum p-value approach or the method of maximizing the sum of sensitivity and specificity, and we proposed a parametric version in which the estimates of the predictive performance and cutoff points are essentially unbiased. To evaluate the accuracy and precision of the estimate of the predictive performance, we presented tables of the means and standard deviations of the estimate of predictive performance for typical cases by the use of Monte Carlo simulation. Finally we provided an application of our method to a prediction rule with continuous predictor variables for rhabdomyolysis and showed that our method for polychotomizing continuous regressor variables can be a valid and useful tool to create probability profile tables. All programs (and their source codes) used in this study are available from the authors.
ing useful comments and we are grateful to K. Doi for her technical assistance.