Grid multi-category response logistic models

Background Multi-category response models are very important complements to binary logistic models in medical decision-making. Decomposing model construction by aggregating computation developed at different sites is necessary when data cannot be moved outside institutions due to privacy or other concerns. Such decomposition makes it possible to conduct grid computing to protect the privacy of individual observations. Methods This paper proposes two grid multi-category response models for ordinal and multinomial logistic regressions. Grid computation to test model assumptions is also developed for these two types of models. In addition, we present grid methods for goodness-of-fit assessment and for classification performance evaluation. Results Simulation results show that the grid models produce the same results as those obtained from corresponding centralized models, demonstrating that it is possible to build models using multi-center data without losing accuracy or transmitting observation-level data. Two real data sets are used to evaluate the performance of our proposed grid models. Conclusions The grid fitting method offers a practical solution for resolving privacy and other issues caused by pooling all data in a central site. The proposed method is applicable for various likelihood estimation problems, including other generalized linear models. Electronic supplementary material The online version of this article (doi:10.1186/s12911-015-0133-y) contains supplementary material, which is available to authorized users.


Background
In biomedical research, data sharing plays an important role in accelerating scientific discoveries. For example, networks based on information from electronic health record (EHR) [1,2] have been established for this purpose. However, due to privacy concerns, patient-level data cannot always be exchanged across different institutions. In these circumstances grid computing, which avoids sharing patient level data among multiple institutions, can be used to build a global model.
For example, logistic regression models have been used in a variety of clinical applications, such as scoring candidates for liver transplant using the Model for End-stage Liver Disease [3], producing estimates related to myocardial infarction diagnosis [4], and detecting suspicious accesses to electronic health records [5]. These scenarios, in their classical setups, have difficulties in handling multicenter data, as the training phase requires accessing the entire dataset.
Our previous work [6] and [7] proposed privacypreserving models through the aggregation of nonsensitive intermediary results (i.e., gradient and Hessian matrix for the log-likelihood function), but the model only deals with binary models. Response variables with more than two categorical values occur very often in medical models. For example, cancer progress is often categorized into 4 or 5 phases. One simple method to deal with multiple responses is to fit binary logistic fitting for each pair of these multiple categories. However, this approach is very inconvenient and the performance of each binary logistic model might be degraded when sample size is insufficient. Some researchers extended the binary logistic model to handle multi-category response problems. Among existing approaches, ordinal logistic [8] and multinomial logistic [9] are the two most popular multi-category response logistic models for ordinal and nominal responses, respectively. Both methods are widely used to fit data with multi-category response. However, methods for binary model fitting assessment may not be applicable to multi-category problems. Hosmer and Lemeshow [9] introduced novel methods to evaluate the goodness-of-fit of multi-category logistic models. The Area Under the ROC Curve (AUC) [10] is an important measure in checking classification performance of binary outcome models. Hand and Till [11] generalized the original AUC measure to deal with classification methods for multicategory outcome cases. The AUC for binary logistic regression is given bŷ where n 1 , n 2 are the number of observations with Y=1 and with Y=0, R is the rank sum based on the predicted probability of Y = 1 for observations with Y = 1 among all observations. Van Calster et. al. [12] described several AUC score estimation methods for the ordinal logistic model, one of which is to use the mean of AUC scores from K − 1 binary logistic regression estimations to serve as the AUC score for ordinal logistic model. Hand and Till [11] defined Â(k 1 |k 2 ) in the same way for observations with Y ∈ {k 1 , k 2 } and proposed a generalized AUC for multinomial logistic model as Yang and Carlin [13] generalized the ROC curve to a surface and used the volume under the ROC surface (VUC) to measure the accuracy of a diagnostic test based on multi-category response models. Dreiseitl et. al. [14] proposed to use a three-way ROC curve analysis for the same goal. Van Calster et. al. [12] suggested an ordinal c-index measurement (ORC) and discussed the relationship between the new measurement with VUC and other measurements based on assessing pairs of cases.
In this article, we introduce grid ordinal and multinomial logistic models to handle multi-center modeling of multi-category response, including model assumption checking. We also propose to use the grid AUC score to evaluate the added value of the grid model fitting when compared to models fitted by separate sub-datasets. The remainder of this article is organized as follows. The second Section briefly reviews ordinal logistic [8] and multinomial logistic [9] models and their model assumptions, and also discusses model coefficient estimation methods for both models and the statistical test for checking the ordinal logistic assumption. The third Section discusses grid maximum likelihood estimation and grid computing for the ordinal logistic model assumption test statistics. The fourth Section provides technical details for grid model fitting assessment. The fifth Section elaborates on grid AUC score computing. The sixth Section describes simulation studies to evaluate the theoretical results. The seventh Section carries out additional experiments on two real datasets to demonstrate our proposed methods. The eighth Section discusses the generalization of the proposed grid models and the limitations of this work.

Ordinal and multinomial logistic models
Before we introduce our method we first introduce both ordinal and multinomial logistic models in a few more detail. In terms of how to split response categories, many ordinal logistic models have been studied. However, in this article, we only focus on the proportional odds logistic model to deal with multi-category problems. The proposed method will be extended to other multi-category logistic regression models in the future. Suppose response Y could take values 1, ⋯, K (for K categories) with K ≥ 3. There are m features in the model and n observations. The predictor matrix can be expressed as Let's define p(w, i) ≜ Pr(Y ≤ w|x i ) and assume 1 ≤ i ≤ n and 1 ≤ w ≤ K − 1. The ordinal logistic regression [8] can be defined as With parameters β T = (b 1 , ⋯, b m ). The conditional likelihood function is given by where I y i ¼w ½ is the indicator function, with value of 1 if y i = w and 0 otherwise. Let θ = (α 1 , α 2 , ⋯, α K − 1 , β T ) T , the log-likelihood function for the proportional odds logistic model be denoted as l O (θ). The maximum likelihood estimation (MLE)θ for l O (θ) is usually computed using the Newton method for efficiency. The variance-covariance matrix forθ is estimated by − ∂ 2 l O θ ð Þ= ½ ∂θ∂θ T À Á jθ −1 . Equation (1) assumes that the non-intercept model coefficients β remain the same for 1 ≤ w ≤ K − 1. Usually, a justification for the model assumption is needed when fitting ordinal logistic model. This assumption is called proportional odds assumption [15]. The score test is a common way to test the proportional odds assumption. To perform the score test, we first introduce the generalized ordered logit model [16], which is a generalization of the ordinal logistic model as it allows non-intercept model coefficients to be different. The generalized ordered logit model is given by The log-likelihood function for this generalized model, l G (ψ), is obtained by combining (3) and (2). From its definition, we see that the generalized ordered logistic model requires more parameters than the proportional odds model. Hence, model fitting for small sample size data is a big concern for the generalized ordered logistic model. To check the proportional odds assumption, we need to test whether β 1 = ⋯ = β K − 1 . As mentioned previously, supposeθ ¼α 1 ; The score test statistic is Under the null hypothesis Þ . The multinomial logistic model is mainly dealing with a nominal response with unordered categories. It does not require the proportional odds assumption. Using the multinomial model on ordered data disregards the inherent information in the ordering of the response categories and is not, in general, recommended. Suppose the response variable and predictors are the same as described in the proportional odds model except that the proportional odds assumption does not hold. Let's de- . In multinomial logistic model for 1 ≤ i ≤ n and 1 ≤ w ≤ K − 1 The likelihood function is then given by As The log-likelihood function for multinomial logistic regression is denoted as l M (ψ). The MLEψ for multinomial logistic regression can be also obtained by the Newton method and the variance-covariance matrix forψ is estimated by It is worth noting that the multinomial logistic model requires the same number of parameters as does the generalized ordered logistic model.

Grid ordinal and multinomial logistic models
This section first proposes the grid Newton method for the MLE, which can be used for both the grid proportional odds and the multinomial logistic regression models. Then, we develop the grid proportional odds ratio test for proportional odds logistic regression. Suppose that we want to find the MLEθ for the loglikelihood function l(θ) with θ being a column vector. We can apply the Newton method as for J = 0, 1, 2, ⋯. θ (J) approachesθ as J increases. Because the Newton method is very efficient, it is usually enough for J < 15 to achieve a tolerance 10 (−6) for θ (J) , Suppose data are split into U parts in terms of observations and each part contains the same variables. Let l (θ) be the log-likelihood function for data combined from U parts, which can be decomposed by observations. Hence where l u (θ) is the log-likelihood function for data of part u with u = 1, ⋯, U. For the gradient and Hessian matrix of l(θ), we have and respectively. We get the following grid Newton method from (9), (10) and (7) θ Jþ1 Equation (11) tells us that each Newton update can be finished by combining gradients and Hessian matrices of the partial log-likelihood functions based on corresponding sub-datasets. This equation suggests the following model fitting process in which separate datasets do not need to be pooled in the fitting process.

Compute gradients and Hessian matrices based on
the current coefficient estimation using partial datasets separately. 2. Find overall gradients and Hessian matrices by combining the partial results obtained from Step1, then updating the coefficient estimation.
Starting from an initial value for the model coefficients, the MLE can be obtained by repeating Step 1 and Step 2 until convergence.
The above grid Newton method is used for both ordinal and multinomial logistic model coefficient estimations. The variance-covariance matrix of MLEθ based on the log-likelihood function l(θ) is given by − Using (9) and (10), we get the grid variance-covariance matrix estimates ofθ . This is a typical grid method for a variance-covariance matrix and it is suitable for both proportional odds and multinomial logistic regression. The gradients and Hessian matrices for both regression models are presented in Additional file 1.
For the grid computing for the proportional odds assumption test statistic T o in (4), we first compute the grid MLEθ based on the log-likelihood l O of ordinal logistic regression, then T o is produced by using (9) and (10) to evaluate the gradient and Hessian matrix of l G at ψ e , where ψ e comes from the rearrangement ofθ entries as introduced in the previous Section.

Grid model fit assessment
Assessment of goodness-of-fit for the ordinal logistic model can be done using methods for binary logistic regression on each of K − 1 regressions. Additionally, Fagerland and Hosmer [17] proposed a Homer-Lemeshow type goodness-of-fit test for the proportional odds. To handle the multinomial logistic model, Hosmer and Lemeshow [9] modified several existing measures, including Pearson's residual and R-square. Alternatively, Fagerland et al. [18] modified the Hosmer-Lemeshow (HL) test for the same goal. Some of these methods can be used for grid models.
We use the HL test as an example to explain grid model fit assessment. For binary logistic regression, the HL test statistic is calculated as follows. First, sorted values of the predicted probability of Y = 1 for all observations are split into g groups. E c,k equals the sum of predicted probability of Y = k (k = 0, 1) in category c, O c,k equals the number of observations with Y = k in category c. Then the test statistic is given by which asymptotically follows χ 2 g−2 . In the modified statistic, the g groups are split based on sorted values of the predicted probability of Y<K for all observations. The extended HL (EHL) test statistic is defined as where O c,k and E c,k are defined in the same way as above. The new statistic asymptotically follows χ 2 Þ . O c,k and E c,k only requires response value and predicted probability of Y=k for all observations. Grid HL m computing can be finished by first pooling Y values and corresponding predicted probability values from separate sub-datasets after grid model fitting.

Grid Area under the ROC Curve
The rationale of grid model fitting is based on the assumption that the grid model outperforms models fitted by separate sub-datasets. However, this is not always true and actually depends on data structures. The Area Under ROC Curve (AUC) is a very popular measurement to assess model classification performance, so we propose to use the AUC to check the value of a grid model.
For ordinal logistic regression, we adopt the idea proposed by Van Calster et al. [12] to use the mean of K − 1 AUC scores for assessing the model. For the multinomial logistic regression we adopt the Hand and Till [11] AUC estimation method. For both grid models, their AUC scores can be obtained by pooling response values and predicted probabilities for necessary observations from separate sub-datasets after model fitting. To check the added value, we need to compare the grid AUC score with the AUC score for each sub-dataset.

Simulation
The derivation of the grid method clearly implies that the grid method gives identical results as does the centralized method (i.e., the methods in which sub-datasets are pooled). Additionally, only the total sample size from all sites is important for the fitting results, and the sample size in an individual site will not affect the fitting results for a fixed total sample size. We conducted simulation studies to evaluate the accuracy of the proposed grid model estimation and to compare it with the classical centralized fitting method. Four simulation studies in different settings were performed to compare the various grid multi-category models against two corresponding centralized models. For all studies, simulated data are split into two pieces, one for model fitting, and another for AUC score evaluation and HL test. The first two studies are designed for the ordinal logistic model and the other two studies are designed for the multinomial logistic model. In Studies 1 and 2, data were simulated from an ordinal logistic model with total sample sizes 1800 and 900, respectively. In Studies 3 and 4, data were simulated from a multinomial logistic model with total sample sizes 1800 and 900, respectively. The HL tests for all binary logistic regression estimations were performed in Studies 1 and 2; the extended HL test was performed in Studies 3 and 4. In addition, an average AUC score or extended AUC score [11] was evaluated for each study.
In all studies, we simulated data so that there are 4 outcome categories (Y ∈ {1, 2, 3, 4}). For Studies 1 and 3 we used a total sample size of 1800 for centralized models and split them into 3 separate parts in three different ways: (600, 600, 600), (100, 200, 1500) and (50, 50, 1700) for the grid models. For Studies 2 and 4 we used a total sample size of 900 for centralized models and split them into 3 separate parts in three different ways: (300, 300, 300), (50, 100, 750) and (24, 26, 850) for the grid models. For all studies, each split subset was further split in half, one for model fitting and another for AUC evaluation and HL or extended HL tests. We chose two continuous covariates x 1 and x 2 and two binary covariates x 3 and x 4 (i.e., 5 coefficients for 4 covariates and intercept) in these studies. Simulation data were generated in two steps. First, we generated x 1 and x 2 from a standard normal distribution independently and generated x 3 and x 4 from a Bernoulli distribution with p = 0.5 independently. For Studies 1 and 2 we generated the response y from an ordinal distribution assuming that For Studies 3 and 4 we generated the response y from a multinomial distribution assuming that We conducted the simulations with 1000 runs in all studies. In Studies 1 and 2, the estimation for log odds log Pr Y ≤k ð Þ Pr Y >k ð Þ equalsα k þβ 1 x 1 þβ 2 x 2 þβ 3 x 3 þβ 4 x 4 , for k = 1, 2, 3. In Studies 3 and 4, the estimation for log odds log Pr Y ¼k ð Þ Pr Y ¼4 ð Þ equalsα k þβ 1;k x 1 þβ 2;k x 2 þβ 3;k x 3 þβ 4;k x 4 , for k = 1, 2, 3. Table 1 presents the results for Studies 1  and 2 and Table 2 presents the results for Studies 3 and 4. We show the average biases (Bias) and standard errors (Se) for the estimates in both tables. Table 3 provides the passing rate of the proportional odds assumption (POA) test, the HL test and both (POA&HL) tests for Studies 1 and 2. Table 3 depicts the results of the EHL test in Studies 3 and 4 among 1000 runs. Figure 1 shows the box plots of AUC scores for the four studies.
Note that, as expected, all four studies show that the three grid methods and the corresponding centralized method produce identical results. Hence, each table or figure presents the common results for the three grid models and the corresponding centralized model.

Two examples
In addition to simulation studies, we used two split public datasets to test our core model-fitting algorithm. The purpose was to illustrate how our core grid-fitting algorithm  works. Note that these are not real multi-center studies but used for illustration purposes.
The first example is about the low birth weight dataset, which was obtained from Hosmer and Lemeshow [9] and contains 189 observations with 9 non-redundant variables. We picked 8 variables including AGE, RACE, SMOKE, PTL, HT, UI, FTV, BWT from the dataset, and reasonably modified several variables to create a new dataset as follows. RACE is a three-category variable, replaced by two binary variables: OTHERvsWHITE and BLACKvsWHITE, respectively. PTL is the number of premature labors with values of 0, 1, etc., and was dichotomized into 0 and greater than 0. FTV is the number of physician visits, which is also dichotomized into 0 and greater than 0 as well. BWT is the birth weight in grams and it was categorized into 4 values (1, 2, 3 ,4) using cutoffs 3500, 3000 and 2500. AGE, SMOKE, HT, UI were kept as original, where AGE is continuous, SMOKE is binary, HT is binary variable for "History of hypertension", and UI is binary variable for "Presence of uterine irritability". We denote the new dataset as LBW.
To test the grid model fitting, we randomly picked 95 observations from LBW to create dataset LBW1 and the rest 94 observations to create LBW2. BWT is chosen as the 4-category response variable and the rest are covariates. Since the response is ordinal, we fitted a grid ordinal logistic model without pooling LBW1 and LBW2. Suppose the fitted value for log Pr BW T ≤k ð Þ Pr BW T >k ð Þ (k = 1, 2, 3) iŝ α k þβ 1 AGE þβ 2 OTHERvsWHITE þβ 3 BLACKvsWHITE þβ 4 SMOKE þβ 5 PT L þβ 6 HT þβ 7 UI þβ 8 FTV : Table 4 shows the model coefficient estimates (Est) and their standard errors (Se), with z-values (Zval) equal to the ratios of Est values over according Se values and p-values (Pval) to test whether Zval is significantly different than 0.
The grid proportional odds assumption test was also performed and resulted in a p-value of 0.366. Hence, there is no evidence to show that the assumption for the ordinal logistic model was invalid. To justify the grid model fitting, the ordinal logistic model was also fitted for LBW1 and for LBW2, separately. Grid AUC score (GAUC), AUC score for the model fitted by LBW1 (AUC1), and AUC score for the model fitted by LBW2 (AUC2) were all evaluated by 10-fold cross validation: GAUC ¼ 0:665; AUC1 ¼ 0:645; AUC2 ¼ 0:568: Note that in this example the data are randomly split so every subset has the same underlying population. Hence, small AUC values only result from smaller sample sizes (in subgroups). In addition, a grid HL test for grid model and HL tests for two separate models were performed using 10-fold cross validation with the same data partitions. Unfortunately, none of these models passed the HL test. This may be related to nonlinear effects of the   However, as shown in simulation studies, failing to pass the HL test does not necessary mean the goodness-of-fit of these models are very poor.
The second example is about Mammograph experience data, which was also obtained from Hosmer and Lemeshow [9] and contains 412 observations with 6 variables. We kept the original dataset and only replaced multi-category variables by multiple binary variables. The generated new dataset was denoted as MAM and contained 9 variables: ME, SYMPT1, SYMPT2, SYMPT3, PB, HIST, BSE, DETC1 and DETC2. ME denotes mammograph experience with "3 = never", "2 = within a year" and "1 = over a year ago". Original SYMPT was a 4category variable and denoted the 4 responses to "you do not need a mammograph unless you develop symptoms" from "strongly agree" to "strongly disagree". It was replaced by binary variables SYMPT1, SYMPT2 and SYMPT3. PB is a continuous variable for the degree of "perceived benefit of mammography". HIST is a binary variable for the response to whether "mother or sister has breast cancer history". BSE is the binary response to "Has anyone taught you how to examine your own breasts?". Original DECT was a 3-category variable and the response to "How likely is it that a mammogram could find a new case of breast cancer?". It was replaced by binary variables DECT1 and DECT2.
To justify the grid model fitting, a multinomial logistic model was also fitted for MAM1 and for MAM2, separately. However, both separate models produced invalid estimates (with very large standard errors). The invalid estimates are probably due to the small number of subjects with ME = 2 after splitting the dataset, and the large number of parameters. This obviously shows the need for grid model fitting based on datasets MAM1 and MAM2 when they are not allowed to be pooled. Ten-fold cross validation was used to evaluate extended AUC score and we performed the extended HL test for the grid fitted model. The grid AUC score was 0.626 and the grid fitted model passed the extended HL test.

Discussion
While our focus was on multi-category logit models, the grid MLE method is applicable for grid computing for various likelihood type estimation problems including other generalized linear models and generalized estimating equation models. However, when the likelihood is not separable for observations, then grid MLE may not work. For example, the Cox proportional hazards regression adopts a profile likelihood that cannot be split by observations. Hence, more effort is necessary to design a grid model for Cox proportional odds regression, which was discussed in our recent publication [19].
For the proposed grid HL test and grid AUC, Y values are pooled directly and not protected. To protect Y, the patient outcome values, we could adopt the methods proposed by Wu et al. [6] for the Grid HL test and the AUC score calculation, which avoid exchanging Y values. These methods are accomplished through using transmitted locally predicted probabilities and their orders. Details are given in Algorithms 1 and 2 in Wu et al. [6].
In practice, the grid model fitting using multi-site data is more complicated than what is described in this manuscript (we focused on the model fitting step). Very often, it is necessary to conduct data pre-processing before the model fitting. For example, gender may use a coding method in different sites. Hence, data harmonization is necessary before the grid model can be fitted. Another issue is missing data. One way to mitigate the problem is to deal with missing data during the pre-processing step using the same grid protocol across all sites. Another approach is to handle missing data in the grid model-fitting step, which would be cumbersome. Additionally, sometimes there are too many variables to fit the model; variable selection may thus be needed. Variable selection usually requires the construction of models and it can be incorporated into the model-fitting step. Different sites may have different variables, so choosing and harmonizing the values of common variables needs to be done before the model-fitting step. For the proposed grid models, we assumed that the data were uniformly distributed across local clinical sites, and treated the data from each local site as a random sample from the whole dataset. However, this assumption may not hold and we will consider cluster effects from different sites in our future work. We described (on page 4) that steps 1 and 2 for the grid model-fitting step need to be repeated until convergence. Each site needs to send the first derivative and Hessian matrices multiple times, which means that a reliable data transmission function is necessary for successfully fitting a grid model. Recently we produced a reliable webservice called WebGLORE for binary logistic grid fitting [20]. In our setting the data transmission was adequate but there may be settings in which this may not be the case.

Conclusion
In the proposed grid methods, individual-level observation data were never shared during the model fitting process. This offers a practical solution for mitigating privacy issues caused by pooling all data into a central site. Grid ordinal and multinomial logistic models were introduced in detail. In terms of increasing sample sizes, grid computing is more valuable for multi-category response logistic model than it is for binary logistic regression, since the larger number of coefficient estimates in multi-category models obviously require more observations. A small sample size might result in estimations with very large bias or standard error. The ordinal logistic model was proposed to only address the ordinal response data. The multinomial logistic model is used to deal with nominal response data, which requires even more coefficients and hence more observations for proper estimation when compared to the ordinal logistic model. The theory guarantees that the proposed grid Newton method achieves accurate estimation, which is the same as the one of the classical centralized Newton method. This is consistent with simulation study results. As shown in the simulation studies, the HL test and its extension might be too strong for assessing model fit and might produce false significant test results. These are limitations for the HL test, which are discussed by Vittinghoff et al. [21]. Hence, other model fit assessment methods introduced by Hosmer and Lemeshow [9] could be used in addition to the extended HL test for the multinomial logistic model, and other methods for binary logistic model fit assessment could be used in addition to the HL test for the ordinal logistic model.