Polychotomization of continuous variables in regression models based on the overall C index
- Harukazu Tsuruta^{1}Email author and
- Leon Bax^{2}
DOI: 10.1186/1472-6947-6-41
© Tsuruta and Bax; licensee BioMed Central Ltd. 2006
Received: 23 May 2006
Accepted: 14 December 2006
Published: 14 December 2006
Abstract
Background
When developing multivariable regression models for diagnosis or prognosis, continuous independent variables can be categorized to make a prediction table instead of a prediction formula. Although many methods have been proposed to dichotomize prognostic variables, to date there has been no integrated method for polychotomization. The latter is necessary when dichotomization results in too much loss of information or when central values refer to normal states and more dispersed values refer to less preferable states, a situation that is not unusual in medical settings (e.g. body temperature, blood pressure). The goal of our study was to develop a theoretical and practical method for polychotomization.
Methods
We used the overall discrimination index C, introduced by Harrel, as a measure of the predictive ability of an independent regressor variable and derived a method for polychotomization mathematically. Since the naïve application of our method, like some existing methods, gives rise to positive bias, we developed a parametric method that minimizes this bias and assessed its performance by the use of Monte Carlo simulation.
Results
The overall C is closely related to the area under the ROC curve and the produced di(poly)chotomized variable's predictive performance is comparable to the original continuous variable. The simulation shows that the parametric method is essentially unbiased for both the estimates of performance and the cutoff points. Application of our method to the predictor variables of a previous study on rhabdomyolysis shows that it can be used to make probability profile tables that are applicable to the diagnosis or prognosis of individual patient status.
Conclusion
We propose a polychotomization (including dichotomization) method for independent continuous variables in regression models based on the overall discrimination index C and clarified its meaning mathematically. To avoid positive bias in application, we have proposed and evaluated a parametric method. The proposed method for polychotomizing continuous regressor variables performed well and can be used to create probability profile tables.
Background
In modern diagnostic and descriptive prognostic research, regression models are often used to model an illness-related outcome based on a number of independent regressor variables, also referred to as diagnostic indicators or prognostic predictors [1]. Such regressor variables can be categorical or numerical. From the vantage point of applicability in a clinical setting, categorization (often dichotomization) of continuous independent variables can be useful. Obtaining a prediction at the bedside without computer is easier with a prediction table based on categorized variables than with a prediction formula. Even if calculation is not problematic, table presentation of the risks has the practical advantages that (1) repeated use of the table will give physicians an intuitive feel for the disease risk, and (2) even if the value of one or two of the prognostic variables is not available, physicians can obtain a probability range corresponding to the patient's risk by referring to the most extreme cases in the table.
Depending on the setting, several different approaches have been proposed for dichotomization. One popular method is to find a cutoff point to discriminate whether a patient belongs to a normal group or a disease group based on the observed value of a predictive factor. This type of discriminant function analysis was first developed by R.A. Fisher [2] in 1930's. The Mahalanobis distance [3] can be used to find the optimal cutoff point if the variable distributes normally.
Another solution, sometimes used in clinical chemistry, is to find a cutoff point that maximizes the sum of sensitivity (SE) and specificity (SP) [4, 5]. There are different versions of this approach where one can maximize the weighted sum of SE and SP, or maximize the SE while fixing SP to an acceptable value [6, 7]. Cantor claimed that these methods have been used in many published articles without giving a theoretical foundation or scientific justification [8].
Yet another straightforward and popular method is to select a classification that maximizes a measure of difference between the two groups, such as the p-value of a chi square statistic [9, 10]. This method, sometimes called the minimum p-value approach, has been described and used for the prognosis of cancers [11, 12]. Several authors have pointed out that the naïve selection used in this method overestimates the significance of the predictor or indicator's relationship to the dependent variable because of multiple testing, and several adjustment methods of the observed p-values have been proposed [9–18].
Besides using the data at hand to come to a dichotomization of continuous variables, it is also possible to use profit (benefit) or loss (cost) information. In that case, the optical cutoff point is defined so as to maximize the expected utility. Metz showed that the optimal point is the spot on the ROC curve at which the slope is (L/B)(1-p)/p, where B is the net benefit of treating diseased individuals, L the net loss of treating non-diseased individuals, and p the prevalence of the disease under study [19]. Nevertheless, Cantor et al., in a review of studies in the medical literature that referred to "ROC" and "cutoff", found that only a few articles included a L/B ratio in the analysis for determining an optimal cutoff point [8].
The above methods all concern dichotomization. However, when central values refer to normal states and dispersed values to diseased states, two (or more) cutoff points are necessary to discriminate these states. Consequently, one is inevitably faced with the challenge of polychotomization. Unfortunately, methods for polychotomization are less developed. Although Kristjansson et al. [20] described a method for choosing optimal cutoff points in a screening test with a continuous score to divide people into a number of disease categories, their method is not applicable to polychotomization of regressor variables in regression models; their criterion loses its meaning in this setting.
The major goal of our study is to develop a theoretical and practical method for polychotomization. We propose a novel approach for independent continuous variables in regression models based on the overall discrimination index C introduced by Harrel et al. [21, 22]. We will show that this index is closely related to the area under the ROC curve for the original continuous variable and that the resulting categorized variables have predictive properties comparable to the original continuous variable. However, the naïve search of the maximum C index gives rise to positive bias, not unlike the minimum p-value approach [9–18] or the method of maximizing the sum of the sensitivity and specificity [4, 5]. We therefore propose a parametric version in which the estimates of the predictive performance and cutoff points are both essentially unbiased. We evaluate this method and present means and standard deviations of predictive performance and cutoff point estimates for typical cases via Monte Carlo simulation. Finally, we provide a simple application example with a predictive regression model for rhabdomyolysis and show how our method can be used to create a probability profile table.
Methods
The categorization criterion
We assume there is an existing predictive model based on patients that belong to either a normal group or a diseased group and that the distribution of the relevant independent continuous variable X is known or that we have observations on it. Our goal is to find a method of optimal polychotomization for this continuous variable with a minimum loss of predictive ability. This involves making the number of possible patient's profiles finite, and replacing the regression formula with a table of the risk probabilities for all patient profiles. Different from most previously developed approaches we have no a priori intention to categorize the variable into two classes and we assume that it might be necessary to compare categorizations to three or more classes.
For this discussion we need a measure to evaluate the predictive power of a predictive variable. Our choice for a measure of predictive power is the overall discrimination index C [21–24], or the 'pair consistency probability', as we like to call it. This measure refers to the probability that the relative position of single normal-disease pair values is consistent with the relative position of their values of central tendency.
Without losing generality, we assume that the central value of the distribution of the random variable X in the group of healthy cases is smaller than the central value in the group of diseased cases. Next we take a sample x _{ i[h]}from the healthy group and another sample x _{ i[d]}from the diseased group randomly. Then the pair (x _{ i[h]}, x _{ i[d]}) is considered consistent if x _{ i[h]}<x _{ i[d]}, tied if x _{ i[h]}= x_{ i[d]}, and inconsistent if x _{ i[h]}> x _{ i[d]}and the pair consistency probability C is defined as:
$C={p}_{con}+\frac{1}{2}{p}_{tied},\left(1\right)$
where p _{ con }and p _{ tied }denote the probabilities that the pair is consistent and tied respectively.
Next, if we let f _{ h }represent the probability density function (PDF) of X in the healthy group and f _{ d }represent the PDF of X in the diseased group, and let z represent a cutoff point for dichotomization, then the true positive fraction Tp and false positive fraction Fp are defined by
$Tp={\displaystyle {\int}_{z}^{\infty}{f}_{d}(x)}dx\text{and}Fp={\displaystyle {\int}_{z}^{\infty}{f}_{h}(x)}dx.$
In the case that the variable is continuous, as z increases, Tp and Fp both decrease continuously. The ROC curve [19, 25] can be depicted as the trace of points (Fp , Tp ). Green and Swets [25] demonstrated that
$\begin{array}{c}C={\displaystyle {\int}_{-\infty}^{\infty}P({x}_{[h]}=z)\cdot P({x}_{[d]}>z)}dz\\ ={\displaystyle {\int}_{-\infty}^{\infty}{f}_{h}(z)\cdot [{\displaystyle {\int}_{z}^{\infty}{f}_{d}(x)dx]dz}}\\ ={\displaystyle {\int}_{0}^{1}Tp(z)dFp(z).}\end{array}$
This means that the pair consistency probability is equivalent to the area under the ROC curve for continuous variables. We will demonstrate that this relation also holds for polychotomized variables, and that the pair consistency probability C is a good measure to compare the predictive ability with the original continuous variable.
Optimal cutoff point for dichotomization
First, we discuss our method for dichotomization in which a continuous independent variable in a predictive model is categorized to one of two classes by a cutoff point. If we denote the value of the cutoff point z and assume that X is continuous in both the healthy and the diseased groups, that is, P(x _{[h]}= z) = 0 and P(x _{[d]}= z) = 0, the results of random pair sampling are classified into the following four cases:
x _{[h]}<z and x _{[d]}<z, tied
x _{[h]}<z and x _{[d]}> z, consistent
x _{[h]}> z and x _{[d]}<z, inconsistent
x _{[h]}> z and x _{[d]}> z, tied.
Let α denote the probability that x _{[h]}is greater than z, and β denote the probability that x _{[d]}is less than z. Assuming that the central value of the distribution of the random variable X in the group of healthy cases is smaller than the central value in the group of diseased cases, we have
$\alpha ={\displaystyle {\int}_{z}^{\infty}{f}_{h}(x)dx=Fp\text{and}\beta ={\displaystyle {\int}_{-\infty}^{z}{f}_{d}(x)dx=1-Tp.\left(2\right)}}$
Then the probability of a consistent pair becomes
p _{ con }= (1 - α)(1 - β),
and the probability of a tied pair becomes
p _{ tied }= (1 - α) β + α (1 - β).
Assigning these probabilities into (1), we have
C = 1 - (α + β)/2. (3)
It follows that the highest pair consistency probability is achieved when the sum of the two types of errors, α + β, is minimized. Since sensitivity is (1 - β) and specificity is (1 - α), we have
C = (sensitivity + specificity)/2. (4)
Therefore the highest pair consistency probability is achieved when the sum of sensitivity and specificity is maximized.
Generation and meaning of the ROC straight line graph for a dichotomous variable
A = Fp _{0} Tp _{0}/2 + (1 - Fp _{0})Tp _{0} + (1 - Fp _{0})(1 - Tp _{0})/2
= 1 - (α + β)/2 = C. (5)
This means that for a dichotomous variable, the area under the ROC straight line graph for a dichotomous variable is, analogous to the case with a continuous variable, equivalent to the pair consistency probability C. Therefore, finding a cutoff point that maximizes C is equivalent to the problem of finding the point (Fp _{0} , Tp _{0} ) on the original ROC curve that maximizes the area A under the ROC straight line graph.
Optimal cutoff points for polychotomization
Next, consider the polychotomous case. Again, let x _{[h]}be a sample from the continuous random variable X in the healthy group and x _{[d]}a sample from the same variable in the disease group, both taken randomly. Let z _{0} = -∞, z _{ n }= ∞ and z _{1}, z _{2},..., z _{ n-1}be cutoff points where z _{1}<z _{2} <...<z _{ n-1}. We define that
$\begin{array}{cc}{H}_{k}=P({z}_{k-1}<{x}_{[h]}<{z}_{k})={\displaystyle {\int}_{{z}_{k-1}}^{{z}_{k}}{f}_{h}(x)}dx& (k=1,\mathrm{...},n)\end{array}$
$\begin{array}{cc}{D}_{k}=P({z}_{k-1}<{x}_{[d]}<{z}_{k})={\displaystyle {\int}_{{z}_{k-1}}^{{z}_{k}}{f}_{d}(x)}dx& (k=1,\mathrm{...},n).\end{array}$
Then the probabilities for tied and concordant pairs become
${p}_{tied}={\displaystyle \sum _{k=1}^{n}{H}_{k}\cdot {D}_{k}}\text{and}{p}_{con}={\displaystyle \sum _{k=1}^{n-1}{H}_{k}\cdot \left({\displaystyle \sum _{j=k+1}^{n}{D}_{j}}\right)},$
and the pair consistency probability C can be calculated from equation (1).
We also define
Tp _{ k }= P (x _{[d]}> z _{ k }) and Fp _{ k }= P (x _{[h]}> z _{ k }) (k = 0,..., n).
${A}_{k}=\frac{1}{2}\{F{p}_{k-1}-F{p}_{k}\}\cdot \{T{p}_{k-1}+T{p}_{k}\}.$
Therefore,
$\begin{array}{l}{A}_{k}=\frac{1}{2}P({z}_{k-1}<{x}_{[h]}<{z}_{k})\cdot \{P({x}_{[d]}>{z}_{k-1})+P({x}_{[d]}>{z}_{k})\}\\ =P({z}_{k-1}<{x}_{[h]}<{z}_{k})\cdot P({x}_{[d]}>{z}_{k})+\frac{1}{2}P({z}_{k-1}<{x}_{[h]}<{z}_{k})\cdot P({z}_{k-1}<{x}_{[d]}<{z}_{k})\\ =\{\begin{array}{ll}{H}_{k}\cdot \left({\displaystyle \sum _{j=k+1}^{n}{D}_{j}}\right)+\frac{1}{2}{H}_{k}\cdot {D}_{k}\hfill & (k=1,\mathrm{...},n-1)\hfill \\ \frac{1}{2}{H}_{n}\cdot {D}_{n}\hfill & (k=n)\hfill \end{array}\end{array}$
Then we have
$\begin{array}{c}A={\displaystyle \sum _{k=1}^{n}{A}_{k}}\\ ={\displaystyle \sum _{k=1}^{n-1}{H}_{k}\cdot \left({\displaystyle \sum _{j=k+1}^{n}{D}_{j}}\right)+\frac{1}{2}{\displaystyle \sum _{k=1}^{n}{H}_{k}\cdot {D}_{k}}}\\ ={p}_{con}+\frac{1}{2}{p}_{tied}=C.\end{array}\left(6\right)$
Again, the pair consistency probability C for the polychotomized variable is equivalent to the area under its ROC straight line graph, and the problem of finding the optimal cutoff points that maximize C is mathematically equivalent to finding the set of edge points of the ROC straight line graph that maximizes the area A under that graph.
Optimal cutoff points for variables for which normal and diseased cases have a common central tendency
There are many predictive variables whose central values refer to a normal state and whose more dispersed values refer to less preferable states. In the example of rhabdomyolysis prognosis that will follow later, body temperature, pulse rate, plasma sodium, and plasma pH are such variables. For these predictors, we need to find at least two cutoff points to discriminate normal and abnormal states. If we denote the values of the cutoff points z _{1} and z _{2} (z _{1} <z _{2}), and regard the value between these two cutoff points as normal, then type I error α and type II error β become:
$\alpha ={\displaystyle {\int}_{-\infty}^{{z}_{1}}{f}_{h}(x)}dx+{\displaystyle {\int}_{{z}_{2}}^{\infty}{f}_{h}(x)}dx=Fp$
and
$\beta ={\displaystyle {\int}_{{z}_{1}}^{{z}_{2}}{f}_{d}(x)}dx=1-Tp.$
The pair consistency probability C can now be calculated with equation (3) and the combination of cutoff points (z _{1}, z _{2}) which maximizes (3) becomes the solution. In case of categorization of the variable into more than three states, we can define the optimal combination of cutoff points as follows: Let z _{ n }= -∞, w _{ n }= ∞ and z _{1}, z _{2},..., z _{ n-1}, w _{1}, w _{2},..., w _{ n-1}be cutoff points where z_{ n-1}<...<z _{2} <z _{1} <w _{1} <w _{2} <...<w _{ n-1}, and
$\begin{array}{cc}{H}_{1}={\displaystyle {\int}_{{z}_{1}}^{{w}_{1}}{f}_{h}(x)dx,}& {D}_{1}={\displaystyle {\int}_{{z}_{1}}^{{w}_{1}}{f}_{d}(x)dx}\end{array}$
$\begin{array}{cc}{H}_{k}={\displaystyle {\int}_{{z}_{k}}^{{z}_{k-1}}{f}_{h}(x)dx+{\displaystyle {\int}_{{w}_{k-1}}^{{w}_{k}}{f}_{h}(x)dx}}& (k=2,\mathrm{...},n),\end{array}$
$\begin{array}{cc}{D}_{k}={\displaystyle {\int}_{{z}_{k}}^{{z}_{k-1}}{f}_{d}(x)dx+{\displaystyle {\int}_{{w}_{k-1}}^{{w}_{k}}{f}_{d}(x)dx}}& (k=2,\mathrm{...},n).\end{array}$
Then the probabilities for tied and concordant pairs become
${p}_{tied}={\displaystyle \sum _{k=1}^{n}{H}_{k}\cdot {D}_{k}}\text{and}{p}_{con}={\displaystyle \sum _{k=1}^{n-1}{H}_{k}\cdot \left({\displaystyle \sum _{j=k+1}^{n}{D}_{j}}\right)},$
and the pair consistency probability C can be calculated from equation (1). The combination of cutoff points that maximizes C becomes the solution.
Parametric method for estimating cutoff points and predictive performance
The polychotomization methods proposed in the previous sections have been developed under conditions where the exact distribution of a prognostic or diagnostic factor in a population is known. However, in research practice we work with samples and we need to discuss whether our methods can be applied in situations involving parameter uncertainty. Although some methods were developed for correct estimation of the pair consistency probability C in these situations, including non-parametric ones [22–24], none of them addressed the estimation of cutoff points and they can therefore not be applied to our setting.
The challenge we are faced with is that if we repeat the evaluation of the pair consistency probability to find optimal cutoff points, for instance by increasing the possible value of the cutoff point with a certain step, it gives rise to estimation error just like the minimum p-value approach [9–18] and would mistakenly lead to an optimistic conclusion on the predictive performance of the model in future observations.
It is clear that we need a practical method that does not suffer from this over-estimation bias. In this paper we show that if f _{ h }and f _{ d }can be transformed to normal distributions, a parametric method provides essentially unbiased estimators of predictive performance and cutoff points.
- a)
the assumption that the probability density functions of an independent variable on the healthy and disease groups, f _{ h }and f _{ d }, are both normally distributed or can be transformed to a normal distribution,
- b)
the estimation of the means and standard deviations of f _{ h }and f _{ d }, m _{ h }, s _{ h }, m _{ d }, and s _{ d }from sample data,
- c)
the localization of the optimal cutoff points based on the estimated distributions ${\tilde{f}}_{h}$ and ${\tilde{f}}_{d}$, and
- d)
the calculation of the predictive performance based on the estimated cutoff points.
Distributions of the estimators for the cutoff point and the pair consistency probability
If f _{ h }and f _{ d }are both normal and s _{ h }= s _{ d }, then the two curves intersect at x = (m _{ h }+ m _{ d })/2. The pair consistency probability C takes the maximum value at this point as mentioned earlier. In the case that s _{ h }is not equal to s _{ d }, the two curves intersect at the following two points:
$x=\frac{1}{({s}_{d}^{2}-{s}_{h}^{2})}\{({m}_{h}{s}_{d}^{2}-{m}_{d}{s}_{h}^{2})\pm \sqrt{{({m}_{\text{h}}{s}_{\text{d}}^{2}-{m}_{\text{d}}{s}_{\text{h}}^{2})}^{2}-({s}_{\text{d}}^{2}-{s}_{\text{h}}^{2})[({m}_{\text{h}}^{2}{s}_{\text{d}}^{2}-{m}_{\text{d}}^{2}{s}_{\text{h}}^{2})+2{s}_{\text{h}}^{2}{s}_{\text{d}}^{2}\mathrm{log}(\frac{{s}_{\text{h}}}{{s}_{\text{d}}})]\}}\left(7\right)$
- a)
the assumption that f _{ h }and f _{ d }are both normal,
- b)
generation of samples of healthy and disease groups, each with a given number of measurements, by randomly generating the value of the prognostic variable,
- c)
estimation of the optimal cutoff points and pair consistency probability C by naïve stepwise repeated search, in which the cutoff point is changed with a certain small step Δz and the corresponding C is evaluated based on the sample data to find a point which gives the maximum C. In case of polychotomization, this step is iterated for every combination of possible cutoff values,
- d)
estimation of the parameters of f _{ h }and f _{ d }and calculation of the optimal cutoff points based on the estimated distributions (including the corresponding predictive ability C), in which cutoff points are searched numerically in the same manner as the above stepwise repeated search based not on the sample data but on the estimated PDFs,
- e)
repeat the above sample generation and estimating steps 10,000 or 100,000 times for each of various combinations of population parameters.
Extension for multiple associated independent variables
Thus far, we have discussed a method for selecting cutoff points that maximizes the predictive ability of each prognostic variable individually. When a regression model has more than one explanatory variable, the version of our method presented in this article can only be applied if the variables are not associated (no correlation and no interaction). Since associations between prognostic variables are common, our method requires a multivariable extension in which cutoff points are found while taking such associations into account.
- a)
determine the regression model that best fits the observations,
- b)
estimate the multivariate normal distribution parameters from the observed data,
- c)
for a set of categorized variables defined by a combination of cutoff points, calculate the regression equation and evaluate its overall C index (based not on the observed data but on the estimated distributions),
- d)
iterate (c) systematically for every combination of cutoff points and select the combination of cutoff points which gives the maximum overall C index for the regression equation.
Results
Evaluation of the parametric method by Monte Carlo simulation
In this section, we present an evaluation of our parametric method, together with the naïve application of a stepwise repeated search based on multiple evaluations. In the absence of a standard method for polychotomization, the latter is currently probably the first choice for researchers, mainly due to its simplicity.
Distributions of estimators from the parametric method
Changes of the pair consistency probability C by the number of cutoff points
μ _{ d }* | C** | C _{1}*** | C _{2}*** | C _{3}*** |
---|---|---|---|---|
0.691 | 0.650 | 0.630 | 0.648 | 0.653 |
0.941 | 0.700 | 0.663 | 0.688 | 0.695 |
1.216 | 0.750 | 0.700 | 0.731 | 0.741 |
1.517 | 0.800 | 0.738 | 0.775 | 0.787 |
1.868 | 0.850 | 0.780 | 0.821 | 0.835 |
2.310 | 0.900 | 0.828 | 0.871 | 0.884 |
2.965 | 0.950 | 0.886 | 0.925 | 0.937 |
Means and standard deviations of the estimates of C for a dichotomized variable
true C | |||
---|---|---|---|
n | 0.650 | 0.750 | 0.850 |
10 | 0.662 ± 0.079 | 0.760 ± 0.076 | 0.856 ± 0.062 |
25 | 0.655 ± 0.051 | 0.754 ± 0.048 | 0.852 ± 0.040 |
50 | 0.652 ± 0.036 | 0.751 ± 0.034 | 0.851 ± 0.028 |
100 | 0.651 ± 0.026 | 0.751 ± 0.024 | 0.851 ± 0.020 |
200 | 0.650 ± 0.018 | 0.750 ± 0.017 | 0.850 ± 0.014 |
500 | 0.650 ± 0.011 | 0.750 ± 0.011 | 0.850 ± 0.009 |
Means and standard deviations of the estimates of C for a trichotomized variable
true C | |||
---|---|---|---|
n | 0.650 | 0.750 | 0.850 |
10 | 0.671 ± 0.065 | 0.759 ± 0.057 | 0.854 ± 0.039 |
25 | 0.658 ± 0.042 | 0.754 ± 0.036 | 0.851 ± 0.024 |
50 | 0.654 ± 0.030 | 0.752 ± 0.025 | 0.851 ± 0.017 |
100 | 0.652 ± 0.022 | 0.751 ± 0.017 | 0.850 ± 0.012 |
200 | 0.651 ± 0.015 | 0.751 ± 0.012 | 0.850 ± 0.008 |
500 | 0.651 ± 0.010 | 0.750 ± 0.008 | 0.850 ± 0.005 |
Example: Polychotomization of the prognostic factors of rhabdomyolysis
Rhabdomyolysis is a potentially lethal complication, often observed in patients who have attempted suicide with large doses of psychotropic drugs. Though it is important to make the diagnosis and begin proper treatment at an early stage, the diagnosis of rhabdomyolysis is difficult unless specific enzymes and myoglobin in skeletal muscle are detected by laboratory tests.
To find prognostic variables of rhabdomyolysis at an outpatient clinic where laboratory data are not available, we previously evaluated 131 cases of acute drug toxicosis [27–29] and found twelve variables to be significantly contributing to diagnosis of rhabdomyolysis (rhabdomyolysis group: n = 34, non-rhabdomyolysis group: n = 97). For this example, we selected three non laboratory data variables to predict the risk at the outpatient clinic: (1) qtc: ECG QTc (non-dimensional); (2) t: time from taking the drug to hospitalization (hours); and (3) bt, body temperature (Celsius).
Optimal cutoff points for the prognostic factors of rhabdomyolysis
4a Optimal cutoff points for qtc* | ||||
---|---|---|---|---|
number of cutoff points | z _{1} | z _{2} | z _{3} | C |
1 | 0.460 | 0.611 | ||
2 | 0.428 | 0.491 | 0.634 | |
3 | 0.410 | 0.460 | 0.509 | 0.642 |
continuous | 0.651 | |||
4b Optimal cutoff points for t** (hours) | ||||
number of cutoff points | z _{1} | z _{2} | z _{3} | C |
1 | 7.74 | 0.751 | ||
2 | 4.99 | 12.16 | 0.795 | |
3 | 3.91 | 7.88 | 15.75 | 0.810 |
continuous | 0.829 | |||
4c Optimal cutoff points for bt*** (Celsius) | ||||
number of cutoff points | z _{1} | z _{2} | C | |
2 | 33.9 | 37.2 | 0.640 | |
continuous | 0.675 |
Considering the predictive performance of the each of the categorized variables and convenience in the clinical setting, we finally chose the cutoff point values 0.45 for qtc, 5.0 and 12.0 for t, and 34.0 and 37.2 for bt. We then converted the continuous variables to categorical variables. Next, we applied the cross-split-half-method [30] to validate the effectiveness of prediction by these variables with logistic regression [31] and evaluated the amount of over estimation of prediction performance by a single data set. The estimated optimism for the overall C index was 0.018, which is sufficiently small.
Example: Risk table for prognosis of rhabdomyolysis
Based on categorized variables, we obtained the new prediction formula:
p = 1/(1 + exp(7.96 - 3.13QTC - 6.22T _{1} - 3.11T _{2} - 1.97BT) (8)
where QTC is ECG QTc (1 for more than or equal to 0.45 and 0 for less than 0.45), T _{1} is the time from drug ingestion to hospitalization (1 for more than or equal to 12 hours, 0 for otherwise), T _{2} is also the time from drug ingestion to hospitalization (1 for less than 12 hours and more than or equal to 5 hours, 0 for otherwise), and BT is body temperature (1 for more than or equal to 37.2° or less than or equal to 34.0°, and 0 for otherwise). Since the overall index C for this formula was 0.945, we estimate the predictive performance in future data will be around 0.927(= 0.945 - 0.018).
To ascertain the fitness of the selected regression model, we conducted the Hosmer-Lemeshow goodness-of-fit test [32] by dividing disease probability into eight classes. The actual number of occurrences for each class showed good agreement with the expected number of occurrences of rhabdomyolysis (p = 0.618).
Probability profile table for rhabdomyolysis
T | QTC | BT | risk |
---|---|---|---|
- | - | - | 0.0003 |
- | - | + | 0.0025 |
- | + | - | 0.0079 |
- | + | + | 0.0542 |
+ | - | - | 0.0078 |
+ | - | + | 0.0532 |
+ | + | - | 0.152 |
+ | + | + | 0.562 |
++ | - | - | 0.149 |
++ | - | + | 0.557 |
++ | + | - | 0.801 |
++ | + | + | 0.966 |
Discussion
The criterion for optimal categorization of continuous variables in regression models may vary depending on the object of the categorization, and there have been several different approaches. Many of these approaches are inadequate for our purpose. We have proposed to use the overall discrimination index C introduced by Harrel and other authors [21–24] as the measure for predictive performance of a categorized variable. Since the overall discrimination index C has a clear and straight forward meaning as the pair consistency probability, it is intuitively logical to use it as a measure for the predictive discrimination for polychotomized variables.
Though mathematically distinct, our method has much in common with previously developed methods [2–20, 33–38], which can be explained through the relations between the pair consistency probability C, SE and SP, and the area under the ROC straight line graph, as is expressed in formulae (2) to (6). In addition, our ROC straight line graph has a close relation with ordinal dominance or the OD curve proposed by Darlington to visualize the ordering feature of two comparative sets [39]. He showed that the OD curve is a complete representation of the rank-order properties of data and many statistical procedures follow naturally from assessment of the curve. Bamber clarified the relation between the area above the OD curve and a measure identical to the pair consistency probability [40]. Our proof of formula (6) related to the ROC straight line graph corresponds to Bamber's OD curve related proof.
Monte Carlo simulation showed that the naïve search of the maximum C index will give rise to an estimation bias, which is very much like the positive bias that affects the minimum p-value method. Such bias is also seen in the method where the cutoff point is selected in a way that maximizes the sum of SE and SP. Linnet and Brandt calculated the sample distribution of (SE + SP)/2 in the case of dichotomization using computer simulation assuming that distributions are normal, and evaluated the positive bias induced by the selection of an optimal cutoff point [4]. They found that estimates of test performance are too optimistic when the sample size is small, with an average positive bias up to 15% for a sample size of 25. We have shown that this problem does not affect our proposed parametric method.
However, there may be cases where a transformation to a normal distribution does not work well. For such cases, we conceive that approximation of distribution curve by a more suitable function or a restricted cubic spline function [41] creates a workable situation. We are currently in the process of evaluating this approach and the results will be reported elsewhere.
To keep this introduction of the maximum C index approach for polychotomizing predictive variables short and readable, we have used an example in which a regression model without correlated independent variables and without interaction fitted the observed data well (p = 0.618 by Hosmer and Lemeshow goodness-of-fit test) [42, 43]. However, if correlation and interaction are relevant for the regression function, our maximum C index approach must be extended to a multivariable setting. Mazumdar extended a cutoff point search based on the maximum chi-square method to a multivariable setting [44], and showed that the cutoff points obtained by a multivariable search were closer to the true cutoff points.
Another method that is appealing for regression settings with correlated independent variables, is the so-called 'simplified integer score' method in which continuous variables are transformed into semi-continuous interval variables [41]. It has been used in numerous articles and is based on the categorization of the continuous variable, and the transformation of the products of the regression coefficient and the value of the variable into integers. This method is clinically useful and can be applied to the situation where explanatory variables are correlated. If the number of variables is small enough and they have few classifications, this method can also be used to create the simple probability profile tables that result from our approach. We are currently in the process of evaluating a multivariable extension of the C index maximization approach, including a comparison with this method.
Along with regression models, decision trees can also be used in diagnostic or prognostic decision making [36]. Breiman et al. developed an approach called classification and regression trees (CART) to build a decision tree for medical diagnosis based on a training data set [41, 45]. In these decision trees, diagnosis is made by a sequential decision making process, in which a question on an independent variable is posed at each step and, depending on the answer, a different "branch" of the tree is selected until the final result is achieved. If an independent variable is continuous, dichotomization (or polychotomization) will be necessary to build a decision tree. Typically, the cutoff points are found by maximizing the total utility of decision scheme [46, 47], which appears to be closely related mathematically to our approach. Further study is necessary to make a theoretical and practical comparison.
We have indicated that it is easier for most people to read a probability profile table to obtain the risk probability than to calculate the risk with a regression formula. Additionally, probability profile tables give physicians an intuitive feel for the disease risk. Even if the value of one or two of the prognostic variables is not available, physicians can obtain a probability range corresponding to the patient's risk by referring to both the positive and negative cases from the table. By making simplified risk tables in advance, physicians can obtain the patient's risk from an auxiliary table, even if the value of a predictor is missing. Since the table presentation of probabilities has these practical advantages, we believe our method for categorizing prognostic variables can be a helpful tool to make diagnostic or descriptive prognostic research with regression models become more applicable in clinical practice.
Conclusion
We have proposed a new approach for polychotomization (including dichotomization) of independent continuous variables in regression models based on the overall discrimination index C, or the pair consistency probability, introduced by Harrel. We have shown that this index is closely related to the area under the ROC curve for the original continuous variable and that the resulting categorized variables have predictive properties comparable to the original continuous variable. We showed that the naïve application of the method gives rise to positive bias, not unlike the minimum p-value approach or the method of maximizing the sum of sensitivity and specificity, and we proposed a parametric version in which the estimates of the predictive performance and cutoff points are essentially unbiased. To evaluate the accuracy and precision of the estimate of the predictive performance, we presented tables of the means and standard deviations of the estimate of predictive performance for typical cases by the use of Monte Carlo simulation. Finally we provided an application of our method to a prediction rule with continuous predictor variables for rhabdomyolysis and showed that our method for polychotomizing continuous regressor variables can be a valid and useful tool to create probability profile tables. All programs (and their source codes) used in this study are available from the authors.
Declarations
Acknowledgements
The authors would like to thank the late Dr. T. Tsutsumi and Dr. S. Morita for their contributions to the analysis of the rhabdomyolysis data. The authors would also like to thank Assistant Professor J. Goddard for providing useful comments and we are grateful to K. Doi for her technical assistance.
Authors’ Affiliations
References
- Miettinen OS: The modern scientific physician: 3. Scientific diagnosis. CMAJ. 2001, 165 (6): 781-2.PubMedPubMed CentralGoogle Scholar
- Fisher RA: The use of multiple measurements in taxonomic problems. Annals of Eugenics. 1936, 7 (2): 179-188.View ArticleGoogle Scholar
- Mahalanobis PC: Mahalanobis distance. Proceedings National Institute of Science of India. 1936, 49 (2): 234-256.Google Scholar
- Linnet K, Brandt E: Assessing diagnostic tests once an optimal cutoff point has been selected. Clin Chem. 1986, 32 (7): 1341-1346.PubMedGoogle Scholar
- Bairagi R, Suchindran CM: An estimator of the cutoff point maximizing sum of sensitivity and specificity. Indian J Stat. 1989, 51 (B-2): 263-269.Google Scholar
- Schäfer H: Constructing a cut-off point for a quantitative diagnostic test. Stat Med. 1989, 8: 1381-1391.View ArticlePubMedGoogle Scholar
- Gail MH, Green SB: A generalization of the one-sided two-sample Kolmogorov-Smirnov statistics for evaluating diagnostic tests. Biometrics. 1976, 32: 561-570. 10.2307/2529745.View ArticlePubMedGoogle Scholar
- Cantor SB, Sun CC, Tortolero-Luna G, Richards-Kortum R, Follen M: A comparison of C/B ratios from studies using receiver operating characteristic curve analysis. J Clin Epidemiol. 1999, 52: 885-892. 10.1016/S0895-4356(99)00075-X.View ArticlePubMedGoogle Scholar
- Miller R, Siegmund D: Maximally selected chi square statistics. Biometrics. 1982, 38: 1011-1016. 10.2307/2529881.View ArticleGoogle Scholar
- Lausen B, Schumacher M: Maximally selected rank statistics. Biometrics. 1992, 48: 73-85. 10.2307/2532740.View ArticleGoogle Scholar
- Altman DG, Lausen B, Sauerbrei W, Schumacher M: Dangers of using "optimal" cutpoints in the evaluation of prognostic factors. J Natl Cancer Inst. 1994, 86 (11): 829-835. 10.1093/jnci/86.11.829.View ArticlePubMedGoogle Scholar
- Mazumdar M, Glassman J: Categorizing a prognostic variable: review of methods, code for easy implementation and applications to decision-making about cancer-treatments. Stat Med. 2000, 19: 113-132. 10.1002/(SICI)1097-0258(20000115)19:1<113::AID-SIM245>3.0.CO;2-O.View ArticlePubMedGoogle Scholar
- Hilsenbeck SG, Clark GM, McGuire WL: Why do so many prognostic factors fail to pan out?. Breast Cancer Res Treat. 1992, 22: 197-206. 10.1007/BF01840833.View ArticlePubMedGoogle Scholar
- Cantor AB: Re: Dangers of using "optimal" cutpoints in the evaluation of prognostic factors. J Natl Cancer Inst. 1994, 86 (23): 1798-10.1093/jnci/86.23.1798-a.View ArticlePubMedGoogle Scholar
- Lausen B, Schumacher M: Evaluating the effect of optimized cutoff values in the assessment of prognostic factors. Comp Stat Data Analysis. 1996, 21: 307-326. 10.1016/0167-9473(95)00016-X.View ArticleGoogle Scholar
- Hilsenbeck SG, Clark GM: Practical p-value adjustment for optimally selected cutpoints. Stat Med. 1996, 15: 103-112. 10.1002/(SICI)1097-0258(19960115)15:1<103::AID-SIM156>3.0.CO;2-Y.View ArticlePubMedGoogle Scholar
- Faraggi D, Simon R: A simulation study of cross-validation for selecting an optimal cutpoint in univariable survival analysis. Stat Med. 1996, 15: 2203-2213. 10.1002/(SICI)1097-0258(19961030)15:20<2203::AID-SIM357>3.0.CO;2-G.View ArticlePubMedGoogle Scholar
- Contal C, O'Quigley J: An application of changepoint methods in studying the effect of age on survival in breast cancer. Comp Stat Data Analysis. 1999, 30: 253-270. 10.1016/S0167-9473(98)00096-6.View ArticleGoogle Scholar
- Metz CE: Basic principles of ROC analysis. Semin Nucl Med. 1978, 8 (4): 283-298.View ArticlePubMedGoogle Scholar
- Kristjansson B, Hill G, McDowell I, Lindsay J: Optimal cut-points when screening for more than one disease state: an example from the Canadian study of health and aging. J Clin Epidemiol. 1996, 49 (12): 1423-28. 10.1016/S0895-4356(96)00272-7.View ArticlePubMedGoogle Scholar
- Harrell FE Jr, Califf RM, Pryor DB, Lee KL, Rosati RA: Evaluating the yield of medical tests. JAMA. 1982, 247 (18): 2543-2546. 10.1001/jama.247.18.2543.View ArticlePubMedGoogle Scholar
- Harrell FE Jr, Lee KL, Mark DB: Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med. 1996, 15 (4): 361-387. 10.1002/(SICI)1097-0258(19960229)15:4<361::AID-SIM168>3.0.CO;2-4.View ArticlePubMedGoogle Scholar
- Nam B, D'Agostino RB: Discrimination index, the area under the ROC curve. Goodness-of-Fit Tests and Model Validity. Edited by: Huber-Carol C. 2003, Boston: Birkhauser, 267-279.Google Scholar
- Pencina MJ, D'Agostino RB: Overall C as a measure of discrimination in survival analysis: model specific population value and confidence interval estimation. Stat Med. 2004, 23 (13): 2109-2123. 10.1002/sim.1802.View ArticlePubMedGoogle Scholar
- Green DM, Swets JA: Signal detection theory and psychophysics. 1966, New York: WileyGoogle Scholar
- Tsuruta H, Tsutsumi K, Doi K: The changes of predictive ability when prognostic factors are categorized. Proceedings of the 24th Joint Conference on Medical Informatics: 26–28 November 2004; Nagoya. 2004, Japan Association for Medical Informatics, 824-825.Google Scholar
- Morita S, Tsutsumi K, Doi K, Tsuruta H: Prediction of rhabdomyolysis occurring in patients with acute drug toxicosis by logistic regression model. Jpn J Gen Hosp Psychiatry. 1998, 10: 37-43.Google Scholar
- Tsuruta H, Tsutsumi K, Doi K: Prediction of rhabdomyolysis in patients with acute drug toxicosis. Proceedings of the 21th Joint Conference on Medical Informatics: 26–28 November 2001; Hamamatsu. 2001, Japan Association for Medical Informatics, 514-515.Google Scholar
- Tsuruta H, Tsutsumi K, Mochizuki M: Table presentation of the risk of rhabdomyolysis by the use of an optimal categorization method for prognostic factors and logistic regression analysis. Proceedings of the 11th World Congress on Medical Informatics: 7–11 September 2004; San Francisco. AMIA. Edited by: Fieschi M, Coiera E, Li YCJ. 2004, 1888-Google Scholar
- Cooper RG: An empirically derived new product project selection model. IEEE Trans Eng Manag. 1981, 28 (3): 54-61.View ArticleGoogle Scholar
- Walker SH, Duncan DB: Estimation of the probability of an event as a function of several independent variables. Biometrika. 1967, 54 (1 and 2): 167-179. 10.2307/2333860.View ArticlePubMedGoogle Scholar
- Lemeshow S, Hosmer DW: A review of goodness of fit statistics for use in the development of logistic regression models. Am J Epidemiol. 1982, 115 (1): 92-106.PubMedGoogle Scholar
- Metz CE, Kronman HB: Statistical significance tests for binormal ROC curves. J Math Psych. 1980, 22: 218-243. 10.1016/0022-2496(80)90020-6.View ArticleGoogle Scholar
- Hanley JA, McNeil BJ: The meaning and use of the area under the receiver operating characteristic (ROC) curve. Radiology. 1982, 143 (1): 29-36.View ArticlePubMedGoogle Scholar
- Hanley JA, McNeil BJ: A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology. 1983, 148 (3): 839-43.View ArticlePubMedGoogle Scholar
- Hunink M, Glasziou P, Siegel J, Weeks J, Pliskin J, Elstein A, Milton CW: Decision Making in Health and Medicine: Integrating Evidence and Values. 2001, Cambridge: Cambridge University PressGoogle Scholar
- Faraggi D, Reiser B: Estimation of the area under the ROC curve. Stat Med. 2002, 21 (20): 3093-3106. 10.1002/sim.1228.View ArticlePubMedGoogle Scholar
- Copas JB, Corbett P: Overestimation of the receiver operating characteristic curve for logistic regression. Biometrika. 2002, 89 (2): 315-331. 10.1093/biomet/89.2.315.View ArticleGoogle Scholar
- Darlington RB: Comparing two groups by simple graphs. Psychol Bull. 1973, 79 (2): 110-116. 10.1037/h0033854.View ArticleGoogle Scholar
- Bamber D: Area above the ordinal dominance graph and the area below the receiver operating characteristic graph. J Math Psych. 1975, 12: 387-415. 10.1016/0022-2496(75)90001-2.View ArticleGoogle Scholar
- Harrell FE: Regression Modeling Strategies With Applications to Linear Models, Logistic Regression, and Survival Analysis. 2001, New York: SpringerGoogle Scholar
- Kleinbum DG, Kupper LL, Muller KE: Applied regression analysis and other multivariable methods. 1998, Boston: PWS-Kent Publishing CompanyGoogle Scholar
- Hosmer DW, Lemeshow S: Applied Logistic Regression. 2000, New York: John Wiley and SonsView ArticleGoogle Scholar
- Mazumdar M, Smith A, Bacik J: Methods for categorizing a prognostic variable in a multivariable setting. Stat Med. 2003, 22: 559-571. 10.1002/sim.1333.View ArticlePubMedGoogle Scholar
- Breiman L, Friedman JH, Olshen RA, Stone CJ: Classification and Regression Trees. 1984, Belmont: WadsworthGoogle Scholar
- Long WJ, Griffith JL, Selker HP, D'Agostino RB: A comparison of logistic regression to decision-tree induction in a medical domain. Comput Biomed Res. 1993, 26: 74-97. 10.1006/cbmr.1993.1005.View ArticlePubMedGoogle Scholar
- Shannon CE: A Mathematical Theory of Communication. The Bell System Tech J. 1948, 27: 379-423.View ArticleGoogle Scholar
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1472-6947/6/41/prepub
Pre-publication history
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.