 Research article
 Open access
 Published:
Polychotomization of continuous variables in regression models based on the overall C index
BMC Medical Informatics and Decision Making volumeÂ 6, ArticleÂ number:Â 41 (2006)
Abstract
Background
When developing multivariable regression models for diagnosis or prognosis, continuous independent variables can be categorized to make a prediction table instead of a prediction formula. Although many methods have been proposed to dichotomize prognostic variables, to date there has been no integrated method for polychotomization. The latter is necessary when dichotomization results in too much loss of information or when central values refer to normal states and more dispersed values refer to less preferable states, a situation that is not unusual in medical settings (e.g. body temperature, blood pressure). The goal of our study was to develop a theoretical and practical method for polychotomization.
Methods
We used the overall discrimination index C, introduced by Harrel, as a measure of the predictive ability of an independent regressor variable and derived a method for polychotomization mathematically. Since the naÃ¯ve application of our method, like some existing methods, gives rise to positive bias, we developed a parametric method that minimizes this bias and assessed its performance by the use of Monte Carlo simulation.
Results
The overall C is closely related to the area under the ROC curve and the produced di(poly)chotomized variable's predictive performance is comparable to the original continuous variable. The simulation shows that the parametric method is essentially unbiased for both the estimates of performance and the cutoff points. Application of our method to the predictor variables of a previous study on rhabdomyolysis shows that it can be used to make probability profile tables that are applicable to the diagnosis or prognosis of individual patient status.
Conclusion
We propose a polychotomization (including dichotomization) method for independent continuous variables in regression models based on the overall discrimination index C and clarified its meaning mathematically. To avoid positive bias in application, we have proposed and evaluated a parametric method. The proposed method for polychotomizing continuous regressor variables performed well and can be used to create probability profile tables.
Background
In modern diagnostic and descriptive prognostic research, regression models are often used to model an illnessrelated outcome based on a number of independent regressor variables, also referred to as diagnostic indicators or prognostic predictors [1]. Such regressor variables can be categorical or numerical. From the vantage point of applicability in a clinical setting, categorization (often dichotomization) of continuous independent variables can be useful. Obtaining a prediction at the bedside without computer is easier with a prediction table based on categorized variables than with a prediction formula. Even if calculation is not problematic, table presentation of the risks has the practical advantages that (1) repeated use of the table will give physicians an intuitive feel for the disease risk, and (2) even if the value of one or two of the prognostic variables is not available, physicians can obtain a probability range corresponding to the patient's risk by referring to the most extreme cases in the table.
Depending on the setting, several different approaches have been proposed for dichotomization. One popular method is to find a cutoff point to discriminate whether a patient belongs to a normal group or a disease group based on the observed value of a predictive factor. This type of discriminant function analysis was first developed by R.A. Fisher [2] in 1930's. The Mahalanobis distance [3] can be used to find the optimal cutoff point if the variable distributes normally.
Another solution, sometimes used in clinical chemistry, is to find a cutoff point that maximizes the sum of sensitivity (SE) and specificity (SP) [4, 5]. There are different versions of this approach where one can maximize the weighted sum of SE and SP, or maximize the SE while fixing SP to an acceptable value [6, 7]. Cantor claimed that these methods have been used in many published articles without giving a theoretical foundation or scientific justification [8].
Yet another straightforward and popular method is to select a classification that maximizes a measure of difference between the two groups, such as the pvalue of a chi square statistic [9, 10]. This method, sometimes called the minimum pvalue approach, has been described and used for the prognosis of cancers [11, 12]. Several authors have pointed out that the naÃ¯ve selection used in this method overestimates the significance of the predictor or indicator's relationship to the dependent variable because of multiple testing, and several adjustment methods of the observed pvalues have been proposed [9â€“18].
Besides using the data at hand to come to a dichotomization of continuous variables, it is also possible to use profit (benefit) or loss (cost) information. In that case, the optical cutoff point is defined so as to maximize the expected utility. Metz showed that the optimal point is the spot on the ROC curve at which the slope is (L/B)(1p)/p, where B is the net benefit of treating diseased individuals, L the net loss of treating nondiseased individuals, and p the prevalence of the disease under study [19]. Nevertheless, Cantor et al., in a review of studies in the medical literature that referred to "ROC" and "cutoff", found that only a few articles included a L/B ratio in the analysis for determining an optimal cutoff point [8].
The above methods all concern dichotomization. However, when central values refer to normal states and dispersed values to diseased states, two (or more) cutoff points are necessary to discriminate these states. Consequently, one is inevitably faced with the challenge of polychotomization. Unfortunately, methods for polychotomization are less developed. Although Kristjansson et al. [20] described a method for choosing optimal cutoff points in a screening test with a continuous score to divide people into a number of disease categories, their method is not applicable to polychotomization of regressor variables in regression models; their criterion loses its meaning in this setting.
The major goal of our study is to develop a theoretical and practical method for polychotomization. We propose a novel approach for independent continuous variables in regression models based on the overall discrimination index C introduced by Harrel et al. [21, 22]. We will show that this index is closely related to the area under the ROC curve for the original continuous variable and that the resulting categorized variables have predictive properties comparable to the original continuous variable. However, the naÃ¯ve search of the maximum C index gives rise to positive bias, not unlike the minimum pvalue approach [9â€“18] or the method of maximizing the sum of the sensitivity and specificity [4, 5]. We therefore propose a parametric version in which the estimates of the predictive performance and cutoff points are both essentially unbiased. We evaluate this method and present means and standard deviations of predictive performance and cutoff point estimates for typical cases via Monte Carlo simulation. Finally, we provide a simple application example with a predictive regression model for rhabdomyolysis and show how our method can be used to create a probability profile table.
Methods
The categorization criterion
We assume there is an existing predictive model based on patients that belong to either a normal group or a diseased group and that the distribution of the relevant independent continuous variable X is known or that we have observations on it. Our goal is to find a method of optimal polychotomization for this continuous variable with a minimum loss of predictive ability. This involves making the number of possible patient's profiles finite, and replacing the regression formula with a table of the risk probabilities for all patient profiles. Different from most previously developed approaches we have no a priori intention to categorize the variable into two classes and we assume that it might be necessary to compare categorizations to three or more classes.
For this discussion we need a measure to evaluate the predictive power of a predictive variable. Our choice for a measure of predictive power is the overall discrimination index C [21â€“24], or the 'pair consistency probability', as we like to call it. This measure refers to the probability that the relative position of single normaldisease pair values is consistent with the relative position of their values of central tendency.
Without losing generality, we assume that the central value of the distribution of the random variable X in the group of healthy cases is smaller than the central value in the group of diseased cases. Next we take a sample x _{ i[h]}from the healthy group and another sample x _{ i[d]}from the diseased group randomly. Then the pair (x _{ i[h]}, x _{ i[d]}) is considered consistent if x _{ i[h]}<x _{ i[d]}, tied if x _{ i[h]}= x_{ i[d]}, and inconsistent if x _{ i[h]}> x _{ i[d]}and the pair consistency probability C is defined as:
C={p}_{con}+\frac{1}{2}{p}_{tied},\left(1\right)
where p _{ con }and p _{ tied }denote the probabilities that the pair is consistent and tied respectively.
Next, if we let f _{ h }represent the probability density function (PDF) of X in the healthy group and f _{ d }represent the PDF of X in the diseased group, and let z represent a cutoff point for dichotomization, then the true positive fraction Tp and false positive fraction Fp are defined by
Tp={\displaystyle {\xe2\u02c6\xab}_{z}^{\mathrm{\xe2\u02c6\u017e}}{f}_{d}(x)}dx\text{and}Fp={\displaystyle {\xe2\u02c6\xab}_{z}^{\mathrm{\xe2\u02c6\u017e}}{f}_{h}(x)}dx.
In the case that the variable is continuous, as z increases, Tp and Fp both decrease continuously. The ROC curve [19, 25] can be depicted as the trace of points (Fp , Tp ). Green and Swets [25] demonstrated that
\begin{array}{c}C={\displaystyle {\xe2\u02c6\xab}_{\xe2\u02c6\u2019\mathrm{\xe2\u02c6\u017e}}^{\mathrm{\xe2\u02c6\u017e}}P({x}_{[h]}=z)\xe2\u2039\dots P({x}_{[d]}>z)}dz\\ ={\displaystyle {\xe2\u02c6\xab}_{\xe2\u02c6\u2019\mathrm{\xe2\u02c6\u017e}}^{\mathrm{\xe2\u02c6\u017e}}{f}_{h}(z)\xe2\u2039\dots [{\displaystyle {\xe2\u02c6\xab}_{z}^{\mathrm{\xe2\u02c6\u017e}}{f}_{d}(x)dx]dz}}\\ ={\displaystyle {\xe2\u02c6\xab}_{0}^{1}Tp(z)dFp(z).}\end{array}
This means that the pair consistency probability is equivalent to the area under the ROC curve for continuous variables. We will demonstrate that this relation also holds for polychotomized variables, and that the pair consistency probability C is a good measure to compare the predictive ability with the original continuous variable.
Optimal cutoff point for dichotomization
First, we discuss our method for dichotomization in which a continuous independent variable in a predictive model is categorized to one of two classes by a cutoff point. If we denote the value of the cutoff point z and assume that X is continuous in both the healthy and the diseased groups, that is, P(x _{[h]}= z) = 0 and P(x _{[d]}= z) = 0, the results of random pair sampling are classified into the following four cases:
x _{[h]}<z and x _{[d]}<z, tied
x _{[h]}<z and x _{[d]}> z, consistent
x _{[h]}> z and x _{[d]}<z, inconsistent
x _{[h]}> z and x _{[d]}> z, tied.
Let Î± denote the probability that x _{[h]}is greater than z, and Î² denote the probability that x _{[d]}is less than z. Assuming that the central value of the distribution of the random variable X in the group of healthy cases is smaller than the central value in the group of diseased cases, we have
\mathrm{\xce\pm}={\displaystyle {\xe2\u02c6\xab}_{z}^{\mathrm{\xe2\u02c6\u017e}}{f}_{h}(x)dx=Fp\text{and}\mathrm{\xce\xb2}={\displaystyle {\xe2\u02c6\xab}_{\xe2\u02c6\u2019\mathrm{\xe2\u02c6\u017e}}^{z}{f}_{d}(x)dx=1\xe2\u02c6\u2019Tp.\left(2\right)}}
Then the probability of a consistent pair becomes
p _{ con }= (1  Î±)(1  Î²),
and the probability of a tied pair becomes
p _{ tied }= (1  Î±) Î² + Î± (1  Î²).
Assigning these probabilities into (1), we have
C = 1  (Î± + Î²)/2. (3)
It follows that the highest pair consistency probability is achieved when the sum of the two types of errors, Î± + Î², is minimized. Since sensitivity is (1  Î²) and specificity is (1  Î±), we have
C = (sensitivity + specificity)/2. (4)
Therefore the highest pair consistency probability is achieved when the sum of sensitivity and specificity is maximized.
Figure 1 illustrates the changes of C when f _{ h }and f _{ d }are normal. Let z be the cutoff point where f _{ h }and f _{ d }cross between two peaks. If the cutoff point is shifted to the right from z, then Î± will decrease and Î² will increase. In this case, since f _{ d }is greater than f _{ h }in this interval, the increase of Î² is greater than the decrease of Î±. If the cutoff point is shifted to the left, then the opposite is true. Therefore, the sum of the two types of errors, Î± + Î², occupies the local minimum at the point where f _{ h }and f _{ d }intersect between the peaks. If f _{ h }and f _{ d }are unimodal and cross only at one point, Î± + Î² occupies the true minimum at the cross point.
Generation and meaning of the ROC straight line graph for a dichotomous variable
As we have described earlier, when the independent variable is continuous, Tp and Fp both decrease continuously and the ROC curve can be depicted as the trace of points (Fp , Tp ). But what happens to the ROC curve when the variable is dichotomous? Let z _{0} represent the cutoff point and Fp _{0} and Tp _{0} denote the false positive and true positive fractions for z _{0}, respectively. Unlike the continuous variables, only three points (1, 1), (Fp _{0}, Tp _{0}) and (0, 0) are depicted in Fp  Tp coordinates and we cannot obtain a true curve (see Figure 2). We jointed these points with straight lines, and labelled this graph the ROC straight line graph. Then area A under the ROC straight line graph becomes:
A = Fp _{0} Tp _{0}/2 + (1  Fp _{0})Tp _{0} + (1  Fp _{0})(1  Tp _{0})/2
= 1  (Î± + Î²)/2 = C. (5)
This means that for a dichotomous variable, the area under the ROC straight line graph for a dichotomous variable is, analogous to the case with a continuous variable, equivalent to the pair consistency probability C. Therefore, finding a cutoff point that maximizes C is equivalent to the problem of finding the point (Fp _{0} , Tp _{0} ) on the original ROC curve that maximizes the area A under the ROC straight line graph.
Optimal cutoff points for polychotomization
Next, consider the polychotomous case. Again, let x _{[h]}be a sample from the continuous random variable X in the healthy group and x _{[d]}a sample from the same variable in the disease group, both taken randomly. Let z _{0} = âˆž, z _{ n }= âˆž and z _{1}, z _{2},..., z _{ n1}be cutoff points where z _{1}<z _{2} <...<z _{ n1}. We define that
\begin{array}{cc}{H}_{k}=P({z}_{k\xe2\u02c6\u20191}<{x}_{[h]}<{z}_{k})={\displaystyle {\xe2\u02c6\xab}_{{z}_{k\xe2\u02c6\u20191}}^{{z}_{k}}{f}_{h}(x)}dx& (k=1,\mathrm{...},n)\end{array}
\begin{array}{cc}{D}_{k}=P({z}_{k\xe2\u02c6\u20191}<{x}_{[d]}<{z}_{k})={\displaystyle {\xe2\u02c6\xab}_{{z}_{k\xe2\u02c6\u20191}}^{{z}_{k}}{f}_{d}(x)}dx& (k=1,\mathrm{...},n).\end{array}
Then the probabilities for tied and concordant pairs become
{p}_{tied}={\displaystyle \underset{k=1}{\overset{n}{\xe2\u02c6\u2018}}{H}_{k}\xe2\u2039\dots {D}_{k}}\text{and}{p}_{con}={\displaystyle \underset{k=1}{\overset{n\xe2\u02c6\u20191}{\xe2\u02c6\u2018}}{H}_{k}\xe2\u2039\dots \left({\displaystyle \underset{j=k+1}{\overset{n}{\xe2\u02c6\u2018}}{D}_{j}}\right)},
and the pair consistency probability C can be calculated from equation (1).
We also define
Tp _{ k }= P (x _{[d]}> z _{ k }) and Fp _{ k }= P (x _{[h]}> z _{ k }) (k = 0,..., n).
The points (Fp _{ k }, Tp _{ k }) lie on the original ROC curve, and the set of points (Fp _{ k }, Tp _{ k }) jointed by straight lines yields the ROC straight line graph. Let A represent the area under the ROC straight line graph and A _{ k }represent the area under the line whose ends are (Fp _{ k1}, Tp _{ k1}) and (Fp _{ k }, Tp _{ k }). As illustrated in Figure 3, the area A _{ k }is
{A}_{k}=\frac{1}{2}\{F{p}_{k\xe2\u02c6\u20191}\xe2\u02c6\u2019F{p}_{k}\}\xe2\u2039\dots \{T{p}_{k\xe2\u02c6\u20191}+T{p}_{k}\}.
Therefore,
\begin{array}{l}{A}_{k}=\frac{1}{2}P({z}_{k\xe2\u02c6\u20191}<{x}_{[h]}<{z}_{k})\xe2\u2039\dots \{P({x}_{[d]}>{z}_{k\xe2\u02c6\u20191})+P({x}_{[d]}>{z}_{k})\}\\ =P({z}_{k\xe2\u02c6\u20191}<{x}_{[h]}<{z}_{k})\xe2\u2039\dots P({x}_{[d]}>{z}_{k})+\frac{1}{2}P({z}_{k\xe2\u02c6\u20191}<{x}_{[h]}<{z}_{k})\xe2\u2039\dots P({z}_{k\xe2\u02c6\u20191}<{x}_{[d]}<{z}_{k})\\ =\{\begin{array}{ll}{H}_{k}\xe2\u2039\dots \left({\displaystyle \underset{j=k+1}{\overset{n}{\xe2\u02c6\u2018}}{D}_{j}}\right)+\frac{1}{2}{H}_{k}\xe2\u2039\dots {D}_{k}\hfill & (k=1,\mathrm{...},n\xe2\u02c6\u20191)\hfill \\ \frac{1}{2}{H}_{n}\xe2\u2039\dots {D}_{n}\hfill & (k=n)\hfill \end{array}\end{array}
Then we have
\begin{array}{c}A={\displaystyle \underset{k=1}{\overset{n}{\xe2\u02c6\u2018}}{A}_{k}}\\ ={\displaystyle \underset{k=1}{\overset{n\xe2\u02c6\u20191}{\xe2\u02c6\u2018}}{H}_{k}\xe2\u2039\dots \left({\displaystyle \underset{j=k+1}{\overset{n}{\xe2\u02c6\u2018}}{D}_{j}}\right)+\frac{1}{2}{\displaystyle \underset{k=1}{\overset{n}{\xe2\u02c6\u2018}}{H}_{k}\xe2\u2039\dots {D}_{k}}}\\ ={p}_{con}+\frac{1}{2}{p}_{tied}=C.\end{array}\left(6\right)
Again, the pair consistency probability C for the polychotomized variable is equivalent to the area under its ROC straight line graph, and the problem of finding the optimal cutoff points that maximize C is mathematically equivalent to finding the set of edge points of the ROC straight line graph that maximizes the area A under that graph.
Optimal cutoff points for variables for which normal and diseased cases have a common central tendency
There are many predictive variables whose central values refer to a normal state and whose more dispersed values refer to less preferable states. In the example of rhabdomyolysis prognosis that will follow later, body temperature, pulse rate, plasma sodium, and plasma pH are such variables. For these predictors, we need to find at least two cutoff points to discriminate normal and abnormal states. If we denote the values of the cutoff points z _{1} and z _{2} (z _{1} <z _{2}), and regard the value between these two cutoff points as normal, then type I error Î± and type II error Î² become:
\mathrm{\xce\pm}={\displaystyle {\xe2\u02c6\xab}_{\xe2\u02c6\u2019\mathrm{\xe2\u02c6\u017e}}^{{z}_{1}}{f}_{h}(x)}dx+{\displaystyle {\xe2\u02c6\xab}_{{z}_{2}}^{\mathrm{\xe2\u02c6\u017e}}{f}_{h}(x)}dx=Fp
and
\mathrm{\xce\xb2}={\displaystyle {\xe2\u02c6\xab}_{{z}_{1}}^{{z}_{2}}{f}_{d}(x)}dx=1\xe2\u02c6\u2019Tp.
The pair consistency probability C can now be calculated with equation (3) and the combination of cutoff points (z _{1}, z _{2}) which maximizes (3) becomes the solution. In case of categorization of the variable into more than three states, we can define the optimal combination of cutoff points as follows: Let z _{ n }= âˆž, w _{ n }= âˆž and z _{1}, z _{2},..., z _{ n1}, w _{1}, w _{2},..., w _{ n1}be cutoff points where z_{ n1}<...<z _{2} <z _{1} <w _{1} <w _{2} <...<w _{ n1}, and
\begin{array}{cc}{H}_{1}={\displaystyle {\xe2\u02c6\xab}_{{z}_{1}}^{{w}_{1}}{f}_{h}(x)dx,}& {D}_{1}={\displaystyle {\xe2\u02c6\xab}_{{z}_{1}}^{{w}_{1}}{f}_{d}(x)dx}\end{array}
\begin{array}{cc}{H}_{k}={\displaystyle {\xe2\u02c6\xab}_{{z}_{k}}^{{z}_{k\xe2\u02c6\u20191}}{f}_{h}(x)dx+{\displaystyle {\xe2\u02c6\xab}_{{w}_{k\xe2\u02c6\u20191}}^{{w}_{k}}{f}_{h}(x)dx}}& (k=2,\mathrm{...},n),\end{array}
\begin{array}{cc}{D}_{k}={\displaystyle {\xe2\u02c6\xab}_{{z}_{k}}^{{z}_{k\xe2\u02c6\u20191}}{f}_{d}(x)dx+{\displaystyle {\xe2\u02c6\xab}_{{w}_{k\xe2\u02c6\u20191}}^{{w}_{k}}{f}_{d}(x)dx}}& (k=2,\mathrm{...},n).\end{array}
Then the probabilities for tied and concordant pairs become
{p}_{tied}={\displaystyle \underset{k=1}{\overset{n}{\xe2\u02c6\u2018}}{H}_{k}\xe2\u2039\dots {D}_{k}}\text{and}{p}_{con}={\displaystyle \underset{k=1}{\overset{n\xe2\u02c6\u20191}{\xe2\u02c6\u2018}}{H}_{k}\xe2\u2039\dots \left({\displaystyle \underset{j=k+1}{\overset{n}{\xe2\u02c6\u2018}}{D}_{j}}\right)},
and the pair consistency probability C can be calculated from equation (1). The combination of cutoff points that maximizes C becomes the solution.
Parametric method for estimating cutoff points and predictive performance
The polychotomization methods proposed in the previous sections have been developed under conditions where the exact distribution of a prognostic or diagnostic factor in a population is known. However, in research practice we work with samples and we need to discuss whether our methods can be applied in situations involving parameter uncertainty. Although some methods were developed for correct estimation of the pair consistency probability C in these situations, including nonparametric ones [22â€“24], none of them addressed the estimation of cutoff points and they can therefore not be applied to our setting.
The challenge we are faced with is that if we repeat the evaluation of the pair consistency probability to find optimal cutoff points, for instance by increasing the possible value of the cutoff point with a certain step, it gives rise to estimation error just like the minimum pvalue approach [9â€“18] and would mistakenly lead to an optimistic conclusion on the predictive performance of the model in future observations.
It is clear that we need a practical method that does not suffer from this overestimation bias. In this paper we show that if f _{ h }and f _{ d }can be transformed to normal distributions, a parametric method provides essentially unbiased estimators of predictive performance and cutoff points.
Our method is based on the following:

a)
the assumption that the probability density functions of an independent variable on the healthy and disease groups, f _{ h }and f _{ d }, are both normally distributed or can be transformed to a normal distribution,

b)
the estimation of the means and standard deviations of f _{ h }and f _{ d }, m _{ h }, s _{ h }, m _{ d }, and s _{ d }from sample data,

c)
the localization of the optimal cutoff points based on the estimated distributions {\stackrel{\xcb\u0153}{f}}_{h} and {\stackrel{\xcb\u0153}{f}}_{d}, and

d)
the calculation of the predictive performance based on the estimated cutoff points.
Distributions of the estimators for the cutoff point and the pair consistency probability
If f _{ h }and f _{ d }are both normal and s _{ h }= s _{ d }, then the two curves intersect at x = (m _{ h }+ m _{ d })/2. The pair consistency probability C takes the maximum value at this point as mentioned earlier. In the case that s _{ h }is not equal to s _{ d }, the two curves intersect at the following two points:
x=\frac{1}{({s}_{d}^{2}\xe2\u02c6\u2019{s}_{h}^{2})}\{({m}_{h}{s}_{d}^{2}\xe2\u02c6\u2019{m}_{d}{s}_{h}^{2})\xc2\pm \sqrt{{({m}_{\text{h}}{s}_{\text{d}}^{2}\xe2\u02c6\u2019{m}_{\text{d}}{s}_{\text{h}}^{2})}^{2}\xe2\u02c6\u2019({s}_{\text{d}}^{2}\xe2\u02c6\u2019{s}_{\text{h}}^{2})[({m}_{\text{h}}^{2}{s}_{\text{d}}^{2}\xe2\u02c6\u2019{m}_{\text{d}}^{2}{s}_{\text{h}}^{2})+2{s}_{\text{h}}^{2}{s}_{\text{d}}^{2}\mathrm{log}(\frac{{s}_{\text{h}}}{{s}_{\text{d}}})]\}}\left(7\right)
and the point that is located between m _{ h }and m _{ d }can be used to calculate the true maximum value of the pair consistency probability C with equations (2) and (3). As it is difficult to evaluate the statistical properties of the above formulae analytically, even for the simplest dichotomization case, we performed a Monte Carlo simulation to assess the estimation of the cutoff points and the corresponding C. For these purposes, a custom simulation program was written in the programming language Pascal with the following characteristics:

a)
the assumption that f _{ h }and f _{ d }are both normal,

b)
generation of samples of healthy and disease groups, each with a given number of measurements, by randomly generating the value of the prognostic variable,

c)
estimation of the optimal cutoff points and pair consistency probability C by naÃ¯ve stepwise repeated search, in which the cutoff point is changed with a certain small step Î”z and the corresponding C is evaluated based on the sample data to find a point which gives the maximum C. In case of polychotomization, this step is iterated for every combination of possible cutoff values,

d)
estimation of the parameters of f _{ h }and f _{ d }and calculation of the optimal cutoff points based on the estimated distributions (including the corresponding predictive ability C), in which cutoff points are searched numerically in the same manner as the above stepwise repeated search based not on the sample data but on the estimated PDFs,

e)
repeat the above sample generation and estimating steps 10,000 or 100,000 times for each of various combinations of population parameters.
Extension for multiple associated independent variables
Thus far, we have discussed a method for selecting cutoff points that maximizes the predictive ability of each prognostic variable individually. When a regression model has more than one explanatory variable, the version of our method presented in this article can only be applied if the variables are not associated (no correlation and no interaction). Since associations between prognostic variables are common, our method requires a multivariable extension in which cutoff points are found while taking such associations into account.
Our maximum C index approach can be applied to multivariate scenario if the distributions of a number of prognostic variables for healthy and diseased groups can be described by multivariate normal distributions and if the calculation times are acceptable [26]. However, because we are still in the process of assessing the performance of multivariable extensions and comparisons with other approaches, we will only give a short summary below:

a)
determine the regression model that best fits the observations,

b)
estimate the multivariate normal distribution parameters from the observed data,

c)
for a set of categorized variables defined by a combination of cutoff points, calculate the regression equation and evaluate its overall C index (based not on the observed data but on the estimated distributions),

d)
iterate (c) systematically for every combination of cutoff points and select the combination of cutoff points which gives the maximum overall C index for the regression equation.
Results
Evaluation of the parametric method by Monte Carlo simulation
In this section, we present an evaluation of our parametric method, together with the naÃ¯ve application of a stepwise repeated search based on multiple evaluations. In the absence of a standard method for polychotomization, the latter is currently probably the first choice for researchers, mainly due to its simplicity.
Figures 4, 5, 6, illustrate the frequency distributions of the estimates of predictive performance C for the repeated search method and the parametric method for dichotomization (Figure 4), trichotomization (Figure 5), and polychotomization into four categories (Figure 6), when f _{ h }and f _{ d }are both normally distributed and n _{ h }= n _{ d }= 30. Since the true values for the C were 0.722, 0.748 and 0.755 for dichotomization, trichotomization and polychotomization into four categories, the parametric method provides essentially unbiased normally distributed estimators (means and SDs: 0.725 Â± 0.043, 0.751 Â± 0.048, and 0.758 Â± 0.050), whereas the repeated search method has relatively large positive biases (0.752 Â± 0.048, 0.786 Â± 0.051, and 0.795 Â± 0.053).
Figure 7 shows the frequencies of the optimal cutoff point in dichotomization estimated by the each of two methods. Whereas the true cutoff point is 1.150, the estimated values and their standard deviations are 1.175 Â± 0.209 with the parametric approach and 1.071 Â± 0.433 with the repeated search method, which means the former provides a more accurate estimator for the cutoff point with higher precision.
We repeated the above simulations for various n _{ h }and n _{ d }(n _{ h }= n _{ d }) and Figure 8 and Figure 9 summarize the results. The graphs show that the estimation by the parametric method is almost unbiased even if the sample size is relatively small, both for dichotomization (Figure 8) and trichotomization for variables whose realizations in healthy and diseased groups have a similar central tendency (Figure 9), whereas the naÃ¯ve repeated search method shows nonnegligible bias even when the sample size is large (n = 300).
Distributions of estimators from the parametric method
Table 1 shows how the pair consistency probability C increases when the number of the cutoff point changes from one to three for the case that x _{[h]}~ N(0, 1^{2}) and x _{[d]}~ N(Î¼ _{ d }, 1.5^{2}). For instance, when the pair consistency probability for the original continuous variable is 0.8 (Î¼ _{ d }= 1.517), the pair consistency probability for the dichotomized, trichotomized and quatrochotomized variables are 0.738, 0.775 and 0.787, respectively.
Table 2 summarizes the means and standard deviations of the pair consistency probability C estimated by the parametric method for dichotomization when the sample sizes of the two groups are equal (n = 10, 25, 50, 100, 200, and 500) and Ïƒ _{ d }= 1.5Ïƒ _{ h }. Table 3 gives the results for trichotomization when the continuous variable in healthy and diseased cases has a common central tendency and the sample sizes of the two groups are equal. These tables can be used to evaluate the accuracy and precision of the estimated predictive ability of C for various sample sizes.
Example: Polychotomization of the prognostic factors of rhabdomyolysis
Rhabdomyolysis is a potentially lethal complication, often observed in patients who have attempted suicide with large doses of psychotropic drugs. Though it is important to make the diagnosis and begin proper treatment at an early stage, the diagnosis of rhabdomyolysis is difficult unless specific enzymes and myoglobin in skeletal muscle are detected by laboratory tests.
To find prognostic variables of rhabdomyolysis at an outpatient clinic where laboratory data are not available, we previously evaluated 131 cases of acute drug toxicosis [27â€“29] and found twelve variables to be significantly contributing to diagnosis of rhabdomyolysis (rhabdomyolysis group: n = 34, nonrhabdomyolysis group: n = 97). For this example, we selected three non laboratory data variables to predict the risk at the outpatient clinic: (1) qtc: ECG QTc (nondimensional); (2) t: time from taking the drug to hospitalization (hours); and (3) bt, body temperature (Celsius).
Applying the maximum pair consistency probability criterion, the three continuous variables are categorized, assuming that qtc is a normal variable, t a lognormal variable and bt a variable with a common central tendency. Table 4 shows the selected cutoff points and the changes of the pair consistency probability. Comparing the pair consistency probabilities of the categorized variable, we can observe how predictive ability changes with polychotomization and the pair consistency probability C can be used as a measure to evaluate the loss of predictive ability by categorization.
Considering the predictive performance of the each of the categorized variables and convenience in the clinical setting, we finally chose the cutoff point values 0.45 for qtc, 5.0 and 12.0 for t, and 34.0 and 37.2 for bt. We then converted the continuous variables to categorical variables. Next, we applied the crosssplithalfmethod [30] to validate the effectiveness of prediction by these variables with logistic regression [31] and evaluated the amount of over estimation of prediction performance by a single data set. The estimated optimism for the overall C index was 0.018, which is sufficiently small.
Example: Risk table for prognosis of rhabdomyolysis
Based on categorized variables, we obtained the new prediction formula:
p = 1/(1 + exp(7.96  3.13QTC  6.22T _{1}  3.11T _{2}  1.97BT) (8)
where QTC is ECG QTc (1 for more than or equal to 0.45 and 0 for less than 0.45), T _{1} is the time from drug ingestion to hospitalization (1 for more than or equal to 12 hours, 0 for otherwise), T _{2} is also the time from drug ingestion to hospitalization (1 for less than 12 hours and more than or equal to 5 hours, 0 for otherwise), and BT is body temperature (1 for more than or equal to 37.2Â° or less than or equal to 34.0Â°, and 0 for otherwise). Since the overall index C for this formula was 0.945, we estimate the predictive performance in future data will be around 0.927(= 0.945  0.018).
To ascertain the fitness of the selected regression model, we conducted the HosmerLemeshow goodnessoffit test [32] by dividing disease probability into eight classes. The actual number of occurrences for each class showed good agreement with the expected number of occurrences of rhabdomyolysis (p = 0.618).
Since all the three prognostic variables are categorized, the number of patient profiles becomes twelve and the risk probabilities of rhabdomyolysis for all possible patient profiles can now be obtained by assigning a combination of the values of categorized variables into regression formula (8). This yields a risk table for rhabdomyolysis occurrence (Table 5). For instance, if T, QTC and BT are "++ ", "+ " and " " respectively, we can read from the table that the risk of rhabdomyolysis is 0.801. Repeated use of this table over time will give physicians a "sense" of the disease risk.
Discussion
The criterion for optimal categorization of continuous variables in regression models may vary depending on the object of the categorization, and there have been several different approaches. Many of these approaches are inadequate for our purpose. We have proposed to use the overall discrimination index C introduced by Harrel and other authors [21â€“24] as the measure for predictive performance of a categorized variable. Since the overall discrimination index C has a clear and straight forward meaning as the pair consistency probability, it is intuitively logical to use it as a measure for the predictive discrimination for polychotomized variables.
Though mathematically distinct, our method has much in common with previously developed methods [2â€“20, 33â€“38], which can be explained through the relations between the pair consistency probability C, SE and SP, and the area under the ROC straight line graph, as is expressed in formulae (2) to (6). In addition, our ROC straight line graph has a close relation with ordinal dominance or the OD curve proposed by Darlington to visualize the ordering feature of two comparative sets [39]. He showed that the OD curve is a complete representation of the rankorder properties of data and many statistical procedures follow naturally from assessment of the curve. Bamber clarified the relation between the area above the OD curve and a measure identical to the pair consistency probability [40]. Our proof of formula (6) related to the ROC straight line graph corresponds to Bamber's OD curve related proof.
Monte Carlo simulation showed that the naÃ¯ve search of the maximum C index will give rise to an estimation bias, which is very much like the positive bias that affects the minimum pvalue method. Such bias is also seen in the method where the cutoff point is selected in a way that maximizes the sum of SE and SP. Linnet and Brandt calculated the sample distribution of (SE + SP)/2 in the case of dichotomization using computer simulation assuming that distributions are normal, and evaluated the positive bias induced by the selection of an optimal cutoff point [4]. They found that estimates of test performance are too optimistic when the sample size is small, with an average positive bias up to 15% for a sample size of 25. We have shown that this problem does not affect our proposed parametric method.
However, there may be cases where a transformation to a normal distribution does not work well. For such cases, we conceive that approximation of distribution curve by a more suitable function or a restricted cubic spline function [41] creates a workable situation. We are currently in the process of evaluating this approach and the results will be reported elsewhere.
To keep this introduction of the maximum C index approach for polychotomizing predictive variables short and readable, we have used an example in which a regression model without correlated independent variables and without interaction fitted the observed data well (p = 0.618 by Hosmer and Lemeshow goodnessoffit test) [42, 43]. However, if correlation and interaction are relevant for the regression function, our maximum C index approach must be extended to a multivariable setting. Mazumdar extended a cutoff point search based on the maximum chisquare method to a multivariable setting [44], and showed that the cutoff points obtained by a multivariable search were closer to the true cutoff points.
Another method that is appealing for regression settings with correlated independent variables, is the socalled 'simplified integer score' method in which continuous variables are transformed into semicontinuous interval variables [41]. It has been used in numerous articles and is based on the categorization of the continuous variable, and the transformation of the products of the regression coefficient and the value of the variable into integers. This method is clinically useful and can be applied to the situation where explanatory variables are correlated. If the number of variables is small enough and they have few classifications, this method can also be used to create the simple probability profile tables that result from our approach. We are currently in the process of evaluating a multivariable extension of the C index maximization approach, including a comparison with this method.
Along with regression models, decision trees can also be used in diagnostic or prognostic decision making [36]. Breiman et al. developed an approach called classification and regression trees (CART) to build a decision tree for medical diagnosis based on a training data set [41, 45]. In these decision trees, diagnosis is made by a sequential decision making process, in which a question on an independent variable is posed at each step and, depending on the answer, a different "branch" of the tree is selected until the final result is achieved. If an independent variable is continuous, dichotomization (or polychotomization) will be necessary to build a decision tree. Typically, the cutoff points are found by maximizing the total utility of decision scheme [46, 47], which appears to be closely related mathematically to our approach. Further study is necessary to make a theoretical and practical comparison.
We have indicated that it is easier for most people to read a probability profile table to obtain the risk probability than to calculate the risk with a regression formula. Additionally, probability profile tables give physicians an intuitive feel for the disease risk. Even if the value of one or two of the prognostic variables is not available, physicians can obtain a probability range corresponding to the patient's risk by referring to both the positive and negative cases from the table. By making simplified risk tables in advance, physicians can obtain the patient's risk from an auxiliary table, even if the value of a predictor is missing. Since the table presentation of probabilities has these practical advantages, we believe our method for categorizing prognostic variables can be a helpful tool to make diagnostic or descriptive prognostic research with regression models become more applicable in clinical practice.
Conclusion
We have proposed a new approach for polychotomization (including dichotomization) of independent continuous variables in regression models based on the overall discrimination index C, or the pair consistency probability, introduced by Harrel. We have shown that this index is closely related to the area under the ROC curve for the original continuous variable and that the resulting categorized variables have predictive properties comparable to the original continuous variable. We showed that the naÃ¯ve application of the method gives rise to positive bias, not unlike the minimum pvalue approach or the method of maximizing the sum of sensitivity and specificity, and we proposed a parametric version in which the estimates of the predictive performance and cutoff points are essentially unbiased. To evaluate the accuracy and precision of the estimate of the predictive performance, we presented tables of the means and standard deviations of the estimate of predictive performance for typical cases by the use of Monte Carlo simulation. Finally we provided an application of our method to a prediction rule with continuous predictor variables for rhabdomyolysis and showed that our method for polychotomizing continuous regressor variables can be a valid and useful tool to create probability profile tables. All programs (and their source codes) used in this study are available from the authors.
References
Miettinen OS: The modern scientific physician: 3. Scientific diagnosis. CMAJ. 2001, 165 (6): 7812.
Fisher RA: The use of multiple measurements in taxonomic problems. Annals of Eugenics. 1936, 7 (2): 179188.
Mahalanobis PC: Mahalanobis distance. Proceedings National Institute of Science of India. 1936, 49 (2): 234256.
Linnet K, Brandt E: Assessing diagnostic tests once an optimal cutoff point has been selected. Clin Chem. 1986, 32 (7): 13411346.
Bairagi R, Suchindran CM: An estimator of the cutoff point maximizing sum of sensitivity and specificity. Indian J Stat. 1989, 51 (B2): 263269.
SchÃ¤fer H: Constructing a cutoff point for a quantitative diagnostic test. Stat Med. 1989, 8: 13811391.
Gail MH, Green SB: A generalization of the onesided twosample KolmogorovSmirnov statistics for evaluating diagnostic tests. Biometrics. 1976, 32: 561570. 10.2307/2529745.
Cantor SB, Sun CC, TortoleroLuna G, RichardsKortum R, Follen M: A comparison of C/B ratios from studies using receiver operating characteristic curve analysis. J Clin Epidemiol. 1999, 52: 885892. 10.1016/S08954356(99)00075X.
Miller R, Siegmund D: Maximally selected chi square statistics. Biometrics. 1982, 38: 10111016. 10.2307/2529881.
Lausen B, Schumacher M: Maximally selected rank statistics. Biometrics. 1992, 48: 7385. 10.2307/2532740.
Altman DG, Lausen B, Sauerbrei W, Schumacher M: Dangers of using "optimal" cutpoints in the evaluation of prognostic factors. J Natl Cancer Inst. 1994, 86 (11): 829835. 10.1093/jnci/86.11.829.
Mazumdar M, Glassman J: Categorizing a prognostic variable: review of methods, code for easy implementation and applications to decisionmaking about cancertreatments. Stat Med. 2000, 19: 113132. 10.1002/(SICI)10970258(20000115)19:1<113::AIDSIM245>3.0.CO;2O.
Hilsenbeck SG, Clark GM, McGuire WL: Why do so many prognostic factors fail to pan out?. Breast Cancer Res Treat. 1992, 22: 197206. 10.1007/BF01840833.
Cantor AB: Re: Dangers of using "optimal" cutpoints in the evaluation of prognostic factors. J Natl Cancer Inst. 1994, 86 (23): 179810.1093/jnci/86.23.1798a.
Lausen B, Schumacher M: Evaluating the effect of optimized cutoff values in the assessment of prognostic factors. Comp Stat Data Analysis. 1996, 21: 307326. 10.1016/01679473(95)00016X.
Hilsenbeck SG, Clark GM: Practical pvalue adjustment for optimally selected cutpoints. Stat Med. 1996, 15: 103112. 10.1002/(SICI)10970258(19960115)15:1<103::AIDSIM156>3.0.CO;2Y.
Faraggi D, Simon R: A simulation study of crossvalidation for selecting an optimal cutpoint in univariable survival analysis. Stat Med. 1996, 15: 22032213. 10.1002/(SICI)10970258(19961030)15:20<2203::AIDSIM357>3.0.CO;2G.
Contal C, O'Quigley J: An application of changepoint methods in studying the effect of age on survival in breast cancer. Comp Stat Data Analysis. 1999, 30: 253270. 10.1016/S01679473(98)000966.
Metz CE: Basic principles of ROC analysis. Semin Nucl Med. 1978, 8 (4): 283298.
Kristjansson B, Hill G, McDowell I, Lindsay J: Optimal cutpoints when screening for more than one disease state: an example from the Canadian study of health and aging. J Clin Epidemiol. 1996, 49 (12): 142328. 10.1016/S08954356(96)002727.
Harrell FE Jr, Califf RM, Pryor DB, Lee KL, Rosati RA: Evaluating the yield of medical tests. JAMA. 1982, 247 (18): 25432546. 10.1001/jama.247.18.2543.
Harrell FE Jr, Lee KL, Mark DB: Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med. 1996, 15 (4): 361387. 10.1002/(SICI)10970258(19960229)15:4<361::AIDSIM168>3.0.CO;24.
Nam B, D'Agostino RB: Discrimination index, the area under the ROC curve. GoodnessofFit Tests and Model Validity. Edited by: HuberCarol C. 2003, Boston: Birkhauser, 267279.
Pencina MJ, D'Agostino RB: Overall C as a measure of discrimination in survival analysis: model specific population value and confidence interval estimation. Stat Med. 2004, 23 (13): 21092123. 10.1002/sim.1802.
Green DM, Swets JA: Signal detection theory and psychophysics. 1966, New York: Wiley
Tsuruta H, Tsutsumi K, Doi K: The changes of predictive ability when prognostic factors are categorized. Proceedings of the 24th Joint Conference on Medical Informatics: 26â€“28 November 2004; Nagoya. 2004, Japan Association for Medical Informatics, 824825.
Morita S, Tsutsumi K, Doi K, Tsuruta H: Prediction of rhabdomyolysis occurring in patients with acute drug toxicosis by logistic regression model. Jpn J Gen Hosp Psychiatry. 1998, 10: 3743.
Tsuruta H, Tsutsumi K, Doi K: Prediction of rhabdomyolysis in patients with acute drug toxicosis. Proceedings of the 21th Joint Conference on Medical Informatics: 26â€“28 November 2001; Hamamatsu. 2001, Japan Association for Medical Informatics, 514515.
Tsuruta H, Tsutsumi K, Mochizuki M: Table presentation of the risk of rhabdomyolysis by the use of an optimal categorization method for prognostic factors and logistic regression analysis. Proceedings of the 11th World Congress on Medical Informatics: 7â€“11 September 2004; San Francisco. AMIA. Edited by: Fieschi M, Coiera E, Li YCJ. 2004, 1888
Cooper RG: An empirically derived new product project selection model. IEEE Trans Eng Manag. 1981, 28 (3): 5461.
Walker SH, Duncan DB: Estimation of the probability of an event as a function of several independent variables. Biometrika. 1967, 54 (1 and 2): 167179. 10.2307/2333860.
Lemeshow S, Hosmer DW: A review of goodness of fit statistics for use in the development of logistic regression models. Am J Epidemiol. 1982, 115 (1): 92106.
Metz CE, Kronman HB: Statistical significance tests for binormal ROC curves. J Math Psych. 1980, 22: 218243. 10.1016/00222496(80)900206.
Hanley JA, McNeil BJ: The meaning and use of the area under the receiver operating characteristic (ROC) curve. Radiology. 1982, 143 (1): 2936.
Hanley JA, McNeil BJ: A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology. 1983, 148 (3): 83943.
Hunink M, Glasziou P, Siegel J, Weeks J, Pliskin J, Elstein A, Milton CW: Decision Making in Health and Medicine: Integrating Evidence and Values. 2001, Cambridge: Cambridge University Press
Faraggi D, Reiser B: Estimation of the area under the ROC curve. Stat Med. 2002, 21 (20): 30933106. 10.1002/sim.1228.
Copas JB, Corbett P: Overestimation of the receiver operating characteristic curve for logistic regression. Biometrika. 2002, 89 (2): 315331. 10.1093/biomet/89.2.315.
Darlington RB: Comparing two groups by simple graphs. Psychol Bull. 1973, 79 (2): 110116. 10.1037/h0033854.
Bamber D: Area above the ordinal dominance graph and the area below the receiver operating characteristic graph. J Math Psych. 1975, 12: 387415. 10.1016/00222496(75)900012.
Harrell FE: Regression Modeling Strategies With Applications to Linear Models, Logistic Regression, and Survival Analysis. 2001, New York: Springer
Kleinbum DG, Kupper LL, Muller KE: Applied regression analysis and other multivariable methods. 1998, Boston: PWSKent Publishing Company
Hosmer DW, Lemeshow S: Applied Logistic Regression. 2000, New York: John Wiley and Sons
Mazumdar M, Smith A, Bacik J: Methods for categorizing a prognostic variable in a multivariable setting. Stat Med. 2003, 22: 559571. 10.1002/sim.1333.
Breiman L, Friedman JH, Olshen RA, Stone CJ: Classification and Regression Trees. 1984, Belmont: Wadsworth
Long WJ, Griffith JL, Selker HP, D'Agostino RB: A comparison of logistic regression to decisiontree induction in a medical domain. Comput Biomed Res. 1993, 26: 7497. 10.1006/cbmr.1993.1005.
Shannon CE: A Mathematical Theory of Communication. The Bell System Tech J. 1948, 27: 379423.
Prepublication history
The prepublication history for this paper can be accessed here:http://www.biomedcentral.com/14726947/6/41/prepub
Acknowledgements
The authors would like to thank the late Dr. T. Tsutsumi and Dr. S. Morita for their contributions to the analysis of the rhabdomyolysis data. The authors would also like to thank Assistant Professor J. Goddard for providing useful comments and we are grateful to K. Doi for her technical assistance.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The author(s) declare that they have no competing interests.
Authors' contributions
HT derived the polychotomization method, drafted the manuscript and supervised the study. LB provided feedback on methodological issues and contributed to data analysis and manuscript writing. All authors read and approved the final manuscript.
Authorsâ€™ original submitted files for images
Below are the links to the authorsâ€™ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Tsuruta, H., Bax, L. Polychotomization of continuous variables in regression models based on the overall C index. BMC Med Inform Decis Mak 6, 41 (2006). https://doi.org/10.1186/14726947641
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/14726947641