 Research
 Open Access
 Published:
A novel reliabilitybased regression model to analyze and forecast the severity of COVID19 patients
BMC Medical Informatics and Decision Making volumeÂ 22, ArticleÂ number:Â 123 (2022)
Abstract
Background
Coronavirus outbreak (SARSCoV2) has become a serious threat to human society all around the world. Due to the rapid rate of disease outbreaks and the severe shortages of medical resources, predicting COVID19 disease severity continues to be a challenge for healthcare systems. Accurate prediction of severe patients plays a vital role in determining treatment priorities, effective management of medical facilities, and reducing the number of deaths. Various methods have been used in the literature to predict the severity prognosis of COVID19 patients. Despite the different appearance of the methods, they all aim to achieve generalizable results by increasing the accuracy and reducing the errors of predictions. In other words, accuracy is considered the only effective factor in the generalizability of models. In addition to accuracy, reliability and consistency of results are other critical factors that must be considered to yield generalizable medical predictions. Since the role of reliability in medical decisions is significant, upgrading reliable medical datadriven models requires more attention.
Methods
This paper presents a new modeling technique to specify and maximize the reliability of results in predicting the severity prognosis of COVID19 patients. We use the wellknown classic regression as the basic model to implement our proposed procedure on it. To assess the performance of the proposed model, it has been applied to predict the severity prognosis of COVID19 by using a dataset including clinical information of 46 COVID19 patients. The dataset consists of two types of patientsâ€™ outcomes including mild (discharge) and severe (ICU or death). To measure the efficiency of the proposed model, we compare the accuracy of the proposed model to the classic regression model.
Results
The proposed reliabilitybased regression model, by achieving 98.6% sensitivity, 88.2% specificity, and 93.10% accuracy, has better performance than classic accuracybased regression model with 95.7% sensitivity, 85.5% specificity, and 90.3% accuracy. Also, graphical analysis of ROC curve showed AUC 0.93 (95% CI 0.88â€“0.98) and AUC 0.90 (95% CI 0.85â€“0.96) for classic regression models, respectively.
Conclusions
Maximizing reliability in the medical forecasting models can lead to more generalizable and accurate results. The competitive results indicate that the proposed reliabilitybased regression model has higher performance in predicting the deterioration of COVID19 patients compared to the classic accuracybased regression model. The proposed framework can be used as a suitable alternative for the traditional regression method to improve the decisionmaking and triage processes of COVID19 patients.
Background
COVID19 which was initially emerged from Wuhan, China in December 2019 has spread rapidly all around the world and has caused serious challenges for public health, economic and social activities. COVID19 pandemic has put considerable pressure on governments and healthcare systems. In this crisis situation, predicting the disease severity of arriving patients can play a fundamental role in saving more lives. It helps treatment teams to prioritize patients who are more likely to have an acute condition (ICU admission or death), which in turn accelerates the triage and healing processes, reduces the number of deaths, and causes more efficient resource management. Patient characteristics including clinical data and computed tomography (CT) imaging have been studied by researchers to achieve precise predictions about COVID19 severity. Gallo Marin et al. [15] have surveyed useful features in predicting the severity of COVID19 disease. The factors include patientsâ€™ age, comorbidities, immune response, radiographic findings, laboratory markers, and indicators of organ dysfunction. Francone et al. [14] have studied CT scores and laboratory findings of SARSCoV2 patients. The results have shown that CT score has a critical role in forecasting the outcome of patients and there is a high correlation between this score and laboratory findings. Rokni et al. [36] have compared clinical, paraclinical, and laboratory findings between survived and deceased COVID19 patients by using an independent sample Ttest. The results show that elevated neutrophil to lymphocyte ratio (NLR), platelet to lymphocyte ratio (PLR), and systematic immuneinflammation (SII) can be considered as prognostic and risk stratifying factors of the severe form of COVID19. Zhang et al. [48] have compared clinical, laboratory, and CT findings between the survived and deceased groups of patients. Their results have shown that older age, comorbidities such as diabetes and emphysema, and higher CRP and NLRs increase the risk of death in Covid19 patients. The literature of forecasting in COVID19, specifically for disease severity, shows a great interest to apply modelbased approaches in different forms. In general, these models can be categorized into two main categories of analytical and predictical approaches. In the analytical approaches, the final goal is to yield a valid model for analyzing the underlying relationships between the target variable to the explanatory variable(s). While the main goal of the predictical approaches is to predict the target variable. Both of these categories are beneficial in their domain and have been applied in a wide range of applications, successfully.
Statistical and intelligent models are two main classes of methods that have been used in this field. The use of statistical techniques is a common approach to develop COVID19 severity, prediction models. Regression models are among the most commonly used statistical methods in medical predictions. Different forms of regression models such as classic regression, logit regression, Cox regression, and least absolute shrinkage and selection operator (LASSO), etc. are among the most important statistical methods that have been used frequently in COVID19 severity prediction researches. Hajiahmadi et al. [16] have used a multivariate regression model to show the usefulness of chest severity score (CSS) in predicting ICU admission and mortality. Homayounieh et al. [18] have applied a multiple logistic regression model to show the superiority of the radionics from noncontrast chest CT over the radiologistsâ€™ estimation in predicting the outcome of COVID19. Huang et al. [19] have shown that clinical attributes including underlying diseases, increased respiratory rate, elevated Creactive protein (CRP), and lactate dehydrogenase (LDH) have a significant correlation with the progress severity of COVID19. The obtained results also indicate that elevated lactate dehydrogenase can be used as an effective feature to differentiate severe cases from mild patients. They have utilized singlefactor and multivariate logistic regression models as prediction methods. Zhou et al. [52] have studied Demographics, symptoms, comorbidities, and temporal changes of laboratory results, CT features and severity scores for recovered and deceased groups by employing MannWhitney U test and the logistic regression model. Xiao et al. [44] have applied univariable and multivariable logistic regression models by using demographic, clinical, laboratory, and radiological data of COVID19 patients. Their findings show that maximum CT score (>11) and chronic obstructive pulmonary disease (COPD) are critical features that affect the deterioration of COVID19 patients. Shi et al. [37] have employed a LASS logistic regression model to predict the severity of COVID19 disease based on clinical and radiological findings of patients at admission. Wei et al. [42] have applied the value of CT texture analysis and clinical parameters to predict severe COVID19 patients. They first have performed a minimum redundancy and maximum relevance (MRMR) method to feature selection and secondly have applied selected features as independent variables in a multivariate logistic regression framework. Zhang et al. [48] have used univariable and multivariable logistic regression models to determine the risk factors of COVID19 severity including age, white blood cell count, neutrophil, glomerular filtration rate, and myoglobin. A scoring system has been built according to the hazard ratio of each selected feature and the system has been used to predict severe COVID19 patients. Chen et al. [8] have determined risk factors on fetal status for COVID19 hospitalized patients by employing multivariate Cox regression analysis. The risk factors include advanced age, dyspnea, coronary heart disease (CHD), cerebrovascular disease (CVD), and elevated levels of procalcitonin (PCT) and aspartate aminotransferase (AST). Bi et al. [6] have studied factors of coagulation function in COVID19 patients. Their results show that fibrinogentoAlbumin Ratio (FAR) and platelet count (PLT) are two important features in predicting the progression of severe disease by applying a multivariate Cox analysis. Zhou et al. [53] have used the LASSO regression model to determine effective factors on COVID19 severity including body temperature at admission, cough, dyspnea, hypertension, cardiovascular disease, chronic liver disease, and chronic kidney disease. They have utilized a multivariable logistic regression to achieve COVID19 severity predictions. Dong et al. [12] have employed Cox regression models to identify highrisk features in COVID19 severity. The features which include comorbidities, advanced age, reduced lymphocyte count, and higher lactate dehydrogenase at presentation are applied to make a scoring forecasting model.McRae et al. [30] have used logistic regression model by using different attributes including CRP, Nterminus pro B type natriuretic peptide (NTproBNP), myoglobin (MYO), Ddimer, PCT, creatine kinasemyocardial band (CKMB), and, cardiac troponin I (cTnI) to determine COVID19 severity. Zhang et al. [49] have employed the Cox regression method to forecast recovery in adult hospitalized COVID19 patients in the short term.
As well as statistical models that are useful tools in modeling and analysis, machine learning and artificial intelligence methods have attracted a great deal of attention in the field of COVID19 severity prediction. Li et al. [27] have shown the effectiveness of laboratory tests and CT data to predict severe cases by employing a machine learning approach based on the random forest approach. Matos et al. [29] have provided a prediction of shortterm outcomes in COVID19 patients. They have shown that the volume of disease on CT scans and clinical attributes are useful to predict shortterm outcomes. They have applied lymphocyte percentage and Creactive protein to predict the volume of disease on CT scans. Different classification methods have been employed in their work including generalized linear model (GLM), penalized binominal regression (PBR), conditional inference trees (CIT), and support vector machine with the linear kernel (SVL). Zhou et al. [51] have examined a set of clinical factors including oxygenation index, basophil counts aspartate aminotransferase, gender, magnesium, gammaglutamyl transpeptidase, platelet counts, activated partial thromboplastin time, oxygen saturation, body temperature, and days after symptom onset to achieve a predict of COVID19 disease development. They have used a genetic algorithm (GA) as a feature selection method as well as support vector machine (SVM) model to make the predictions. Yan et al. [46] have proposed an XGBoost machinelearning model to predict critically ill patients by using lactic dehydrogenase (LDH), lymphocyte, and Highsensitivity Creactive protein (hsCRP) factors. Ning et al. [31] have prepared a deep learning approach to predict COVID19 patient outcomes by using CT images and 130 clinical features including biochemical and cellular analyses of blood and urine samples. Bai et al. [5] have used clinical, laboratory, and CT data to predict COVID19 malignant progression by utilizing different approaches including logistic regression model, linear discriminant analysis (LDA), SVM, Multilayer perceptron (MLP), and long short term memory (LSTM) methods. They have proposed a machinelearningbased model for severity prediction which outperforms the logistic regression model. Cheng et al. [9] have applied a random forest (RF) model to forecast ICU Transfer within 24 h for COVID19 patients who are hospitalized. AlNajjar and AlRousan [2] have studied the effect of various variables including sex, birth year, country, region, group, infection reason, and confirmed date on the outcome (death or survival) of a set of COVID19 patients by applying neural networks. Their results show that infection reason, confirmation date, and region are the most crucial factors in deceased cases while region, birth year, and confirmation date are the most effective features in survived patients. Moreover, the least effective factors in deceased cases include sex and group where the least important factors in survived patients are infection reason and country. Several researches carried out in this field have been summarized in TableÂ 1.
Despite the different appearance of Covid19 severity prediction models, they all have been developed based on logic and common idea. The idea is that maximizing accuracy in a predefined training dataset (known patients) leads to higher generalizability in the unknown testing dataset (unseen samples). This means that the accuracy of results is considered as the only factor to determine the generalizability of forecasting models. Although it is a reasonable and frequent approach, it is not the only effective factor in making generalizable predictions. Undoubtedly, the consistency or stability of modelsâ€™ performance is also important to make proper decisions. In other words, a model with less variety will have more reliability which is an important issue in making medical forecasts. Increasing the reliability of medical forecasting models increases the survival chance of the patients and makes the treatment process more costefficient and timeefficient. In other words, the reliability of accuracy is another critical factor in yielding more generalizable and confident medical results that have not been taken into consideration in the modeling processes. In general, increasing the reliability of medical results is usually examined through reducing errors in laboratory tests, errors of equipment, and human error. In this paper, we propose a reliabilitybased approach to maximize the reliability of accuracy instead of accuracy and achieve more confident predictions in the severity prognosis of COVID19 patients. In fact, developing datadriven prediction approaches to maximize the reliability of the modelsâ€™ performance has been mainly ignored in the literature.
The main idea of this paper is to quantify the changes in the accuracy of modelsâ€™ performance and minimize these changes to maximize reliability. In addition, the variety in this approach has been measured by the variance function. This implies that as the changes in the performance accuracy of the model decrease in the training or validation set, the reliability of the results for the unseen test set increases. To achieve this goal, the classic regression model is chosen to implement the proposed approach. This model has been used to predict various applications in medicine, engineering, energy, finance, management, environment, etc., in the literature. We briefly describe recent researche in a wide range of applications to show the importance and efficiency of this method.
In medicine, Rath et al. [35] applied the multiple linear regression techniques (MLR) to predict the next dayâ€™s trend in the active cases of coronavirus disease in Odisha and India. These models acquired remarkable accuracy in COVID19 recognition. Tang et al. [40] established the MLR model using radial artery pulse wave characteristic parameters to assess vascular aging. Huang et al. [20] presented a Kmeansbased multiple linear regression model to predict new local Chronic Obstructive Pulmonary Disease hospitalizations number per week with major air pollutants. This prediction model between Chronic Obstructive Pulmonary Disease and air pollutants helps early identification, individualized interventions to slow disease progression, and reduces medical expenditures. The mean absolute percentage error (MAPE) was used to evaluate the model efficiency.
In engineering, Ciulla and Dâ€™Amico [10] developed the MLR method to determine the thermal heating or cooling energy demand of a generic building in any weather condition. The promising results justify the use of MLR as an alternative method, issuing an immediate and straightforward tool that can solve a complex problem like building energy balance. Park et al. [33] predicted the largescale ground source heat pump systemâ€™s hourly heating performance with satisfactory accuracy by the MLR and artificial neural network (ANN) models. This research demonstrated the advantage of MLR for the interpretation of the quantitative analysis of performance influencing factors for the ground source heat pump systemâ€™s performance. In energy, Ã‡erÃ§i and HÃ¼rdoÄŸan [7] designed the MLR and ANN models to estimate the drybulb temperature and absolute humidity values of the process air coming out of the process outlet of a desiccant wheel. The coefficient of determination (R2), Mean Absolute Error (MAE), and Root Mean Square Error (RMSE) criteria were used to determine the consistency of the results obtained from different models to the manufacturerâ€™s data. Khemet and Richman [23] predicted the quantity of air leakage in houses based on variables including building geometry, building materials, building age, and local climate by using the MLR model. Siavash et al. [38] predicted the turbine power curve and rotor speed for the small wind turbine equipped with a wide range of duct opening angles at any wind speed using the MLR and ANN models. Four MLR models in different shapes and a multilayer perceptron neural network is presented to estimate the power and rotor angular speed of a wind turbine equipped with a variable shroud. The accuracy of prediction models was presented using RMSE and R2 for both the ANN and MLR models. In agriculture, Abrougui et al. [1] evaluated the MLRs and ANNs to predict organic potato crop yield by using tillage systems and soil properties. The results showed that the MLR model estimated crop yield more accurately than the ANN model. Lee et al. [25] used the MLR model to estimate the soil moistureâ€™s spatial distribution in South Korea. The coefficients of the MLR model were estimated seasonally considering five days of preceding precipitation. Xie et al. [45] conducted the MLR and random forest regression (RFR) models to estimate soil amylase and urease activities in longterm coastal reclaimed land. PahlavanRad et al. [32] compared the MLR and the RFR models for predicting soil infiltration rates in a dry flood plain of eastern Iran. The model RMSE and MAE evaluation metrics were similar between models. In environment, Stoichev et al. [39] used an innovative MLR model to evaluate metal/metalloid contamination in a coastal lagoonâ€™s surface sediments. Yuchi et al. [47] used the MLR and RFR to model indoor air pollution with 87 potential predictor variables from outdoor monitoring data, questionnaires, home assessments, and geographic data sets. Tang et al. [41] developed the MLR and support vector machine algorithms to predict biodegradation rate as a significant process for removing organic chemicals from water, soil, and sediment environments. AmoozadKhalili et al. [3] investigated the relationship between input costs and the income of wheat production in mechanized and semimechanized systems using various MLR models. In finance, CogoljeviÄ‡ et al. [11] applied the MLR analysis to determine how consumer price index, monetary aggregates, discount rate, and exchange rate affect inflation. Based on the results, one can observe an acceptable correlation, which means there is a strong correlation between reals and estimated values.
Moreover, recently, Zheng et al. [50] by using the MLR techniques examined how process conditions (r.g., temperature and duration) and feedstock properties affect the product characteristics. According to the R2 and RMSE, the developed MLR model had an excellent quantitative determination of hydrothermal carbonization properties with high accuracy. Kern et al. [21] applied many MLR models for the prediction of dry matter during curd treatment. The best models were selected based on Akaikeâ€™s information criterion (AICc), R2, and most parsimonious construction to describe the data set. Kusano et al. [24] developed the MLR analysis to predict the tensile properties using several microstructural features for selective laser melted and post heattreated. The model showed good accuracy for predicting. Rahbari et al. [34] provided the MLR model as a conceptually simple and computationally efficient way of computing thermodynamic derivatives for multicomponent systems analysis. Hoang [17] proposed the MLR and ANN models for estimating the punching shear capacity of steel fiber reinforced concrete (SFRC) slabs. Experimental results show that MLR can deliver prediction outcomes better than those of ANN and empirical design equations. Therefore, MLR can be a promising alternative to assist structural engineers in designing structures.
There are two main reasons to employ the classical linear regression model for implementing the proposed reliabilitybased approach. First, the classical linear regression with low complexity eliminates the effect of other features such as the impact of design and complexity of models on generalization power, and the increase in model generalizability only originates from increasing in the reliability. Second, the initial purpose of this paper is to analyze the severity of Covid19 in addition to forecasting it. Therefore, the stateoftheart models which have not the capability to analyze the relationship between the variables have not been considered and the regression model which is considered as a popular method for analysis purposes is chosen.
All MLR models in the literature have identical thinking on the method of modeling. The logic of creating such models is to maximize the performance accuracy of the training data to achieve maximum accuracy in the test data or the modelâ€™s generalization ability. Accordingly, the generalization ability in this type of model is considered only related to performance accuracy. Although the accuracy is one of the most important factors affecting the modelâ€™s generalization ability, it is not the unique factor explaining how to change the modelâ€™s generalization ability. It seems that one of the other factors affecting the generalization ability of the model is the degree of confidence in performance accuracy, or in other words, changes in performance accuracy in the face of different conditions that are not considered in the conventional thinking of MLR modeling. In fact, he performance basis in conventional regression modeling is based on the assumption that maximum accuracy in inaccessible data is obtained from models with the least amount of error in modeling available data. In this type of regression modeling, in order to maximize the generalization ability of simulations, which are the main factor influencing the quality of decisions made in realworld problems, the principle of maximization of the accuracy of available historical data is used. However, in this type of modeling process, the modelâ€™s reliability and its results have not been considered. On the other hand, the generalization capability of a model is simultaneously dependent on the accuracy of the model and the reliability level of the accuracy. In this paper, a new methodology is proposed for multiple linear regression modeling; in contrast to traditionally developed models, the constructed modelsâ€™ reliability is maximized instead of its accuracy.
To show the effectiveness of the proposed Reliabilitybased regression (RbR) model, it has been applied to predict the severity of COVID19 disease. A dataset including clinical findings of 46 patients with COVID19 symptoms is studied and the severe cases are predicted by applying the proposed framework. The results indicate the superiority of the proposed RbR model over the classic regression model.
The remainder of this paper is organized as follows: In the next section, the concepts and formulation of the proposed RbR model are presented. In "Results and discussion" section, the dataset is described and the proposed RbR model is applied to predict disease severity of COVID19 patients in mentioned dataset and its performance is compared with the traditional regression model. Finally, and in the last section, we represent conclusions.
Method
Traditional modeling approaches in medical predictions all have been developed based on a common theory, which indicates that accuracy in the training set is supposed as the only effective factor on the generalizability of models. However, modelsâ€™ generalizability as an important factor in applying the model to solve realworld problems depends on both the accuracy and reliability of results. In fact, another way to enhance the generalizability of disease diagnosis models is increasing the reliability of the results and the reproducibility of the modelsâ€™ performance. Given the importance of achieving reliable results in the process of diagnosis and treatment of diseases, in this study, a new Reliabilitybased regression (RbR) model has been developed to maximize reliability rather than accuracy in diagnostic methods. The basic concept of the presented model is quantifying the fluctuations of performance in the training data or a portion of it (validation data) and minimizing these fluctuations to ensure higher reliability and generalizability in the test data. Therefore, in the first step, the data is divided into the training and testing data, and next a part of the training data is selected for validation data. To achieve the maximum reliability, the unknown parameters in the proposed approach are calculated in such a manner that the fluctuations of the modelâ€™s performance are minimized for the validation data.
In the following, first, the traditional multiple regression model, as a wellknown statistical technique in medical applications, is briefly described and then the procedure of the suggested reliabilitybased regression template is explained in detail.
Multiple Linear Regression is broadly used in medical prediction researches, especially in modeling and analysis linear relationships between one output variable such as disease severity and one or several input variables such as patientsâ€™ attributes. A linear regression model can be shown as follows:
where Y represents the output variable, \(X_1, X_2, \ldots , X_k\) are the output explanatory variables, \(\beta _1\) is the intercept of the regression line, \(\beta _2\) to \(\beta _k\) are regression coefficients, (slopes), u is the residual term, and N is the number of samples. The operation of the ordinary least square (OLS) technique which is used to estimate unknown parameters of the above formula is based on minimizing error (the difference between actual and predicted values) squares. In other words, OLS is an accuracybased technique. In contrast, the procedure of our proposed model is based on this key idea that minimizing the variation of errorsâ€™ squares, results in maximizing the reliability of predictions. To perform this model, first, a section of the training data set is considered as the validation data set. In this paper, the accuracy, as sum of squared errors, for the training data as well as training data plus each data of the validation is determined as follows:
and in the same manner, for each member of the validation data set:
where \(\sum _{t=1}^{N+i} {\hat{u}}_{it}^2\) for \(i=0,1,\ldots ,n\) and \(t=1,2,\ldots ,N+i\) are the residual sum of squares (RSS), and is the size n of validation dataset. To determine the optimal value of unknown parameters in each data point, \(\beta _{ij} \quad i=0,1,\ldots ,n \quad j=1,2,\ldots ,k\), they are determined in such a way that \(\sum {\hat{u}}_{it}^2\) is minimized [13, 22]. This is performed by differentiating each equation partially with respect to parameters in each data point and setting the results to zero. The process yields k simultaneous equations in k unknowns, for each data point, as follows. For the training data:
and in the same way, for the first data of the validation data set:
For the last data of the validation dataset, we have:
To construct the RbR model with the minimum deviation of squared errors in validation samples, the unknown parameters of all accuracybased regression lines must be equal. Thus, we have:
where, \({\hat{\beta }}e_{j}\) is the jth parameter of the RbR model. Eventually, Eqs. (4â€“6) could be shown as follows:
The equations are presented in a matrix format as follows:
At last, the unknown parameters of RbR model can be obtained by solving Eq. (9). For instance, in a 3variable model, the parameters are estimated as follow:
where \(A_{j,j'}=\sum _{i=0}^{n}\sum _{t=1}^{N+i} X_{jt} X_{j't}\) for \(j, j' = 1,2,\ldots , k\), and \(B_{j}=\sum _{i=0}^{n}\sum _{t=1}^{N+i} X_{jt} Y_{t}\) for \(j = 1,2,\ldots , k\).
Results and discussion
In this study, we have applied clinical features of 46 patients of Covid19 which have been used by Li et al. [27]. There are more than 300 samples in the dataset, each patient with several samples on different days, related to 105 different tests based on clinical reports. The dataset includes 10 severe and 36 mild patients. These patients visited the Peopleâ€™s Hospital of Yicheng City, China, between January 16, 2020, and March 4, 2020, and were diagnosed with Covid19. The dataset consists of 6 male and 4 female in severe group and 19 male and 17 female in mild group. The mean age of patients is 48.6. In addition, the mean age of patients in the severe and nonsevere groups is 56.8 and 46.5, respectively [27]. Due to the large amount of missing data, 28 factors have been omitted and also for some factors that had less missing values, missing data replaced with the mean values. After normalization and data preprocessing, at last, a group of 50 factors has been selected to analyze and predict the severity of Covid19 patients (output variable) by using the proposed reliabilitybased regression and classic accuracybased regression models. TableÂ 2 summarizes the list of independent variables (clinical factors). The download link of this data set is provided in the Availability of Data and Materials section.
In the first step, we use the proposed RbR model to analyze the effective variables on disease severity of Covid19 patients and compute their coefficients using the equations presented in section . The results considering all clinical variables are presented in TableÂ 3. As shown, \(R^2\) of the reliabilitybased model, using all mentioned variables in TableÂ 3, is more than 82%. To interpret the reliabilitybased regression coefficients and identify the most important risk factors, multicollinearity effects must be eliminated. Moreover, to analyze the relationships between the severity of Covid19 patients and clinical variables, in each category of highly correlated variables, we keep the variable with the highest correlation to the dependent variable in the model and remove others. The result of performing the RbR model between the severity of Covid19 patients and selected clinical variables has been shown in Table 4. The results express that the remained clinical features in the model can explain more than 67% of changes in Covid19 patients. According to the obtained results of the RbR model, the pvalue is statistically significant (lower than 0.05) for the explanatory variables including X12 (CRP), X13 (CysC), X18(GGT), X21(Hb), X23 (LDH), X25 (Lymph%), and X36 (PT). TableÂ 3 indicates that the largest positive reliabilitybased coefficients are related to X23 (LDH), X13 (CysC), X36 (PT), X18(GGT), and X12 (CRP), respectively, which means that according to the results of the RbR model the amount of these factors increases in severe cases of Covid19. Also, the variables X25 (Lymph%), and X21 (Hb) have negative coefficients, which indicates that the amount of these factors decreases in the severe cases of Covid19 patients. The results are consistent with recent researches, showing elevated levels of LDH, CysC, PT, GGT, and CRP and lower lymphocytes percentage and Hemoglobin in severe cases of Covid19 patients [4, 26,27,28, 43]. This means that in the RbR model, in addition to quantifying the changes in the accuracy of the model performance and minimizing these changes to maximize the reliability of results, the effect of influencing factors on the severity of COVID19 patients is also logical. In the second step, after analyzing the effective variables on the severity of COVID19 patients, we implement the the reliabilitybased model to predict COVID19 disease severity. All of the clinical factors have been used in the prediction model. To make the prediction model, firstly, the data set is divided into a training set (80% of samples) and testing sets (20% of samples). Then, in the next stage, a part of the training data (10%) is applied for validation and obtaining the unknown parameters based on the formulation presented in "Background" section. Due to the specific method of selecting the validation data, and to assure removing all possible data effects on the modelâ€™s performance, the procedure has been performed more than 100 times, each time with a different validation dataset.
To assess the performance of the presented model, it is compared with the traditional regression model according to accuracy metric, i.e., the ratio of correctly predicted samples to the total number of samples. The results achieved by the proposed RbR and the classic regression models have been provided in Table 5 and Fig.Â 1. The performance results demonstrate that the proposed reliabilitybased approach, by yielding 98.6% sensitivity, 88.2% specificity, and 93.10% accuracy, has higher efficiency than its accuracybased rival and even can successfully predict severe Covid19 patients with more validity. Therefore, the proposed RbR model has provided more accurate results in distinguishing between the severe and mild cases of Covid19 patients. Also, the graphical analysis of the ROC curve in Fig.Â 2 and its analysis in TableÂ 6 shows that the proposed RbR model with a higher area under the curve (AUC) has a better performance than the classic regression model. The empirical results illustrate the importance of considering the reliability in predicting disease severity in Covid19 patients and are important from two aspects. First, the proposed model can guarantee the reliability of predictions, especially in medical decision makings, which require stable and reliable results rather than accurate, because this model minimizes performance fluctuations. Secondly, the results show that the proposed reliabilitybased approach not only increases the reliability and stability of the results in medical decisions but also presents more accurate results than the classical accuracybased regression method. Hence, the proposed RbR model not only solves the problem of unreliable results in traditional accuracybased models, but also improves the accuracy of such models, so it can be a useful alternative for classic prediction models to adopt reliable and accurate medical decisions.
Conclusion
The accuracy of the prediction models plays a critical role in forecasting the severity of Covid19 disease, but it is not the only effective factor to judge the generalizability of the models. Certainly, the reliability and confidence of the accuracy is another crucial factor that must be considered in modeling and forecasting the severity of Covid19 patients. In this study, we have proposed a novel modeling approach to consider and maximize the reliability of the accuracy in predicting the severity of Covid19 patients. For this, the classic regression model as a fundamental and common statistical method in disease predictions is applied. To show the generalization power of the proposed RbR model, we have applied a realworld dataset. The results imply that the proposed approach has not only increased the reliability of the results, it has also provided logical results about effective factors on the severity of Covid19 patients and has yielded more accurate results compared with the classic accuracybased regression model. The main contribution of the paper is the mathematical formulation of the proposed model. It is then used to analyze and forecast the severity of COVID19 patients. The results of the suggested RbR model show the importance of the reliability effect on the generalization power of the classic regression model. For future works, performing the RbR model on other datasets of the severity of Covid19 patients is suggested. Also, the reliabilitybased approach can be implemented on other types of existing models including different statistical or artificial intelligence forecasting models.
Availability of data and materials
The dataset used and analysed during the current study is available publicly from the link provided and also the corresponding author.
Abbreviations
 AUC:

Area under the curve
 CHD:

Coronary heart disease
 CRP:

Creactive protein
 CSS:

Chest severity score
 CT:

Computed tomography
 CVD:

Cerebrovascular disease
 GA:

Genetic algorithm
 LDA:

Linear discriminant analysis
 LSTM:

Long short term memory
 MLP:

Multilayer perceptron
 MLR:

Multiple linear regression
 MRMR:

Minimum redundancy and maximum relevance
 NLR:

Neutrophil to lymphocyte ratio
 OLS:

Ordinary least square
 PLR:

Platelet to lymphocyte ratio
 RbR:

Reliabilitybased regression
 RSS:

Residual sum of squares
 SII:

Systematic immuneinflammation
 SVM:

Support vector machine
References
Abrougui K, Gabsi K, Mercatoris B, Khemis C, Amami R, Chehaibi S. Prediction of organic potato yield using tillage systems and soil properties by artificial neural network (ANN) and multiple linear regressions (MLR). Soil Tillage Res. 2019;190:202â€“8.
AlNajjar H, AlRousan N. A classifier prediction model to predict the status of coronavirus COVID19 patients in South Korea. Eur Rev Med Pharmacol Sci. 2020;24(6):3400â€“3.
AmoozadKhalili M, Rostamian R, EsmaeilpourTroujeni M, KosariMoghaddam A. Economic modeling of mechanized and semimechanized rainfed wheat production systems using multiple linear regression model. Inf Process Agric. 2020;7(1):30â€“40.
Anai M, Akaike K, Iwagoe H, Akasaka T, Higuchi T, Miyazaki A, Naito D, Tajima Y, Takahashi H, Komatsu T, et al. Decrease in hemoglobin level predicts increased risk for severe respiratory failure in COVID19 patients with pneumonia. Respir Investig. 2021;59(2):187â€“93.
Bai X, Fang C, Zhou Y, Bai S, Liu Z, Xia L, Chen Q, Xu Y, Xia T, Gong S, et al. Predicting COVID19 malignant progression with AI techniques. J Clin Med. 2020;9(6):1668.
Bi X, Su Z, Yan H, Du J, Wang J, Chen L, Peng M, Chen S, Shen B, Li J. Prediction of severe illness due to COVID19 based on an analysis of initial fibrinogen to albumin ratio and platelet count. Platelets. 2020;31(5):674â€“9.
Ã‡erÃ§i KN, HÃ¼rdoÄŸan E. Comparative study of multiple linear regression (MLR) and artificial neural network (ANN) techniques to model a solid desiccant wheel. Int Commun Heat Mass Transf. 2020;116: 104713.
Chen R, Liang W, Jiang M, Guan W, Zhan C, Wang T, Tang C, Sang L, Liu J, Ni Z, et al. Risk factors of fatal outcome in hospitalized subjects with coronavirus disease 2019 from a nationwide analysis in China. Chest. 2020;158(1):97â€“105.
Cheng FY, Joshi H, Tandon P, Freeman R, Reich DL, Mazumdar M, KohliSeth R, Levin MA, Timsina P, Kia A. Using machine learning to predict ICU transfer in hospitalized COVID19 patients. J Clin Med. 2020;9(6):1668.
Ciulla G, Dâ€™Amico A. Building energy performance forecasting: a multiple linear regression approach. Appl Energy. 2019;253:113500.
CogoljeviÄ‡ D, GavriloviÄ‡ M, RoganoviÄ‡ M, MatiÄ‡ I, Piljan I. Analyzing of consumer price index influence on inflation by multiple linear regression. Physica A. 2018;505:941â€“4.
Dong YM, Sun J, Li YX, Chen Q, Liu QQ, Sun Z, Pang R, Chen F, Xu BY, Manyande A, et al. Development and validation of a nomogram for assessing survival in patients with COVID19 pneumonia. Clin Infect Dis. 2021;72(4):652â€“60.
Etemadi S, Khashei M. Etemadi multiple linear regression. Measurement. 2021;186: 110080.
Francone M, Iafrate F, Masci GM, Coco S, Cilia F, Manganaro L, Panebianco V, Andreoli C, Colaiacomo MC, Zingaropoli MA, et al. Chest CT score in COVID19 patients: correlation with disease severity and shortterm prognosis. Eur Radiol. 2020;30(12):6808â€“17.
Gallo Marin B, Aghagoli G, Lavine K, Yang L, Siff EJ, Chiang SS, SalazarMather TP, Dumenco L, Savaria MC, Aung SN, et al. Predictors of COVID19 severity: a literature review. Rev Med Virol. 2021;31(1):1â€“10.
Hajiahmadi S, Shayganfar A, Janghorbani M, Esfahani MM, Mahnam M, Bakhtiarvand N, Sami R, Khademi N, Dehghani M. Chest computed tomography severity score to predict adverse outcomes of patients with COVID19. Infect Chemother. 2021;53(2):308.
Hoang ND. Estimating punching shear capacity of steel fibre reinforced concrete slabs using sequential piecewise multiple linear regression and artificial neural network. Measurement. 2019;137:58â€“70.
Homayounieh F, Ebrahimian S, Babaei R, Mobin HK, Zhang E, Bizzo BC, Mohseni I, Digumarthy SR, Kalra MK (2020) CT radiomics, radiologists, and clinical information in predicting outcome of patients with COVID19 pneumonia. Radiol Cardiothorac Imaging 2(4):e200322
Huang H, Cai S, Li Y, Li Y, Fan Y, Li L, Lei C, Tang X, Hu F, Li F, et al. Prognostic factors for COVID19 pneumonia progression to severe symptoms based on earlier clinical features: a retrospective analysis. Front Med. 2020;7:643.
Huang ZY, Lin S, Long LL, Cao JY, Luo F, Qin WC, Sun DM, Gregersen H. Predicting the morbidity of chronic obstructive pulmonary disease based on multiple locally weighted linear regression model with kmeans clustering. Int J Med Inform. 2020;139: 104141.
Kern C, Stefan T, Hinrichs J. Multiple linear regression modeling: prediction of cheese curd dry matter during curd treatment. Food Res Int. 2019;121:471â€“8.
Khashei M, Bakhtiarvand N, Etemadi S. A novel reliabilitybased regression model for medical modeling and forecasting. Diabetes Metab Syndr Clin Res Rev. 2021;15(6): 102331.
Khemet B, Richman R. A univariate and multiple linear regression analysis on a national fan (de) pressurization testing database to predict airtightness in houses. Build Environ. 2018;146:88â€“97.
Kusano M, Miyazaki S, Watanabe M, Kishimoto S, Bulgarevich DS, Ono Y, Yumoto A. Tensile properties prediction by multiple linear regression analysis for selective laser melted and post heattreated Ti6Al4V with microstructural quantification. Mater Sci Eng A. 2020;787:139549.
Lee Y, Jung C, Kim S. Spatial distribution of soil moisture estimates using a multiple linear regression model and Korean geostationary satellite (coms) data. Agric Water Manag. 2019;213:580â€“93.
Li C, Ye J, Chen Q, Hu W, Wang L, Fan Y, Lu Z, Chen J, Chen Z, Chen S, et al. Elevated lactate dehydrogenase (LDH) level as an independent risk factor for the severity and mortality of COVID19. Aging (Albany NY). 2020;12(15):15670.
Li D, Zhang Q, Tan Y, Feng X, Yue Y, Bai Y, Li J, Li J, Xu Y, Chen S, et al. Prediction of COVID19 severity using chest computed tomography and laboratory measurements: evaluation using a machine learning approach. JMIR Med Inform. 2020;8(11): e21604.
Lu C, Liu Y, Chen B, Yang H, Hu H, Zhao Y. Prognostic value of lymphocyte count in severe COVID19 patients with corticosteroid treatment. Signal Transduct Target Ther. 2021;6(1):1â€“3.
Matos J, Paparo F, Mussetto I, Bacigalupo L, Veneziano A, Bernardi SP, Biscaldi E, Melani E, Antonucci G, Cremonesi P, et al. Evaluation of novel coronavirus disease (COVID19) using quantitative lung CT and clinical data: prediction of shortterm outcome. Eur Radiol Exp. 2020;4(1):1â€“10.
McRae MP, Simmons GW, Christodoulides NJ, Lu Z, Kang SK, Fenyo D, Alcorn T, Dapkins IP, Sharif I, Vurmaz D, et al. Clinical decision support tool and rapid pointofcare platform for determining disease severity in patients with covid19. Lab Chip. 2020;20(12):2075â€“85.
Ning W, Lei S, Yang J, Cao Y, Jiang P, Yang Q, Zhang J, Wang X, Chen F, Geng Z, et al. Open resource of clinical data from patients with pneumonia for the prediction of COVID19 outcomes via deep learning. Nat Biomed Eng. 2020;4(12):1197â€“207.
PahlavanRad MR, Dahmardeh K, Hadizadeh M, Keykha G, Mohammadnia N, Gangali M, Keikha M, Davatgar N, Brungard C. Prediction of soil water infiltration using multiple linear regression and random forest in a dry flood plain, Eastern Iran CATENA. 2020;194:104715.
Park SK, Moon HJ, Min KC, Hwang C, Kim S. Application of a multiple linear regression and an artificial neural network model for the heating performance analysis and hourly prediction of a largescale ground source heat pump system. Energy Build. 2018;165:206â€“15.
Rahbari A, Josephson TR, Sun Y, Moultos OA, Dubbeldam D, Siepmann JI, Vlugt TJ. Multiple linear regression and thermodynamic fluctuations are equivalent for computing thermodynamic derivatives from molecular simulation. Fluid Phase Equilib. 2020;523: 112785.
Rath S, Tripathy A, Tripathy AR. Prediction of new active cases of coronavirus disease (COVID19) pandemic using multiple linear regression model. Diabetes Metab Syndr Clin Res Rev. 2020;14(5):1467â€“74.
Rokni M, Ahmadikia K, Asghari S, Mashaei S, Hassanali F. Comparison of clinical, paraclinical and laboratory findings in survived and deceased patients with COVID19: diagnostic role of inflammatory indications in determining the severity of illness. BMC Infect Dis. 2020;20(1):1â€“11.
Shi W, Peng X, Liu T, Cheng Z, Lu H, Yang S, Zhang J, Wang M, Gao Y, Shi Y, et al. A deep learningbased quantitative computed tomography model for predicting the severity of COVID19: a retrospective study of 196 patients. Ann Transl Med. 2021;9(3):216â€“28.
Siavash NK, Ghobadian B, Najafi G, Rohani A, Tavakoli T, Mahmoodi E, Mamat R, et al. Prediction of power generation and rotor angular speed of a small wind turbine equipped to a controllable duct using artificial neural network and multiple linear regression. Environ Res. 2021;196: 110434.
Stoichev T, Coelho JP, De Diego A, Valenzuela MGL, Pereira ME, de Chanvalon AT, Amouroux D. Multiple regression analysis to assess the contamination with metals and metalloids in surface sediments (Aveiro Lagoon, Portugal). Mar Pollut Bull. 2020;159: 111470.
Tang Q, Huang L, Pan Z. Multiple linear regression model for vascular aging assessment based on radial artery pulse wave. Eur J Integr Med. 2019;28:92â€“7.
Tang W, Li Y, Yu Y, Wang Z, Xu T, Chen J, Lin J, Li X. Development of models predicting biodegradation rate rating with multiple linear regression and support vector machine algorithms. Chemosphere. 2020;253: 126666.
Wei W, Hu XW, Cheng Q, Zhao YM, Ge YQ. Identification of common and severe COVID19: the value of CT texture analysis and correlation with clinical characteristics. Eur Radiol. 2020;30(12):6788â€“96.
Wu MY, Yao L, Wang Y, Zhu XY, Wang XF, Tang PJ, Chen C. Clinical evaluation of potential usefulness of serum lactate dehydrogenase (LDH) in 2019 novel coronavirus (COVID19) pneumonia. Respir Res. 2020;21(1):1â€“6.
Xiao J, Li X, Xie Y, Huang Z, Ding Y, Zhao S, Yang P, Du D, Liu B, Wang X. Maximum chest CT score is associated with progression to severe illness in patients with COVID19: a retrospective study from Wuhan. China BMC Infect Dis. 2020;20(1):1â€“11.
Xie X, Wu T, Zhu M, Jiang G, Xu Y, Wang X, Pu L. Comparison of random forest and multiple linear regression models for estimation of soil extracellular enzyme activities in agricultural reclaimed coastal saline land. Ecol Ind. 2021;120: 106925.
Yan L, Zhang HT, Xiao Y, Wang M, Sun C, Liang J, Li S, Zhang M, Guo Y, Xiao Y, etÂ al. (2020) Prediction of survival for severe COVID19 patients with three clinical features: development of a machine learningbased prognostic model with clinical data in Wuhan. medRxiv
Yuchi W, Gombojav E, Boldbaatar B, Galsuren J, Enkhmaa S, Beejin B, Naidan G, Ochir C, Legtseg B, Byambaa T, et al. Evaluation of random forest regression and multiple linear regression for predicting indoor fine particulate matter concentrations in a highly polluted city. Environ Pollut. 2019;245:746â€“53.
Zhang C, Qin L, Li K, Wang Q, Zhao Y, Xu B, Liang L, Dai Y, Feng Y, Sun J, et al. A novel scoring system for prediction of disease severity in COVID19. Front Cell Infect Microbiol. 2020;10:318.
Zhang S, Guo M, Duan L, Wu F, Hu G, Wang Z, Huang Q, Liao T, Xu J, Ma Y, et al. Development and validation of a risk factorbased system to predict shortterm survival in adult hospitalized patients with COVID19: a multicenter, retrospective, cohort study. Crit Care. 2020;24(1):1â€“13.
Zheng X, Jiang Z, Ying Z, Song J, Chen W, Wang B. Role of feedstock properties and hydrothermal carbonization conditions on fuel properties of sewage sludgederived hydrochar using multiple linear regression technique. Fuel. 2020;271: 117609.
Zhou K, Sun Y, Li L, Zang Z, Wang J, Li J, Liang J, Zhang F, Zhang Q, Ge W, et al. Eleven routine clinical features predict COVID19 severity uncovered by machine learning of longitudinal measurements. Comput Struct Biotechnol J. 2021;19:3640â€“9.
Zhou S, Chen C, Hu Y, Lv W, Ai T, Xia L. Chest CT imaging features and severity scores as biomarkers for prognostic prediction in patients with COVID19. Ann Transl Med. 2020;8(21)
Zhou Y, He Y, Yang H, Yu H, Wang T, Chen Z, Yao R, Liang Z. Development and validation a nomogram for predicting the risk of severe COVID19: a multicenter study in Sichuan, China. PLoS ONE. 2020;15(5): e0233328.
Acknowledgements
The authors gratefully acknowledge the support from Isfahan University of Technology and Avicenna Center of Excellence (ACE) under the Isfahan Research Network program.
Funding
The research costs are supported by Research and Technology Affairs of Isfahan University of Technology, No. 3135. The funding bodies did not play any role in the design of the study, analysis, and interpretation of data and in writing the manuscript.
Author information
Authors and Affiliations
Contributions
NB contributed significantly on data preprocessing and design of the paper. NB and MK designed the RbR model and analyzed the experimental results. MM contributed to the conception of the study and reviewing the paper. SH provided useful insights on the results with constructive discussions. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
This study was conducted according to the principles of the Declaration of Iran and the informed consent was obtained from all participants. The study protocol was approved by the Isfahan University of Medical Sciences Ethics Committee (reference number IR.MUI.MED.REC.1400.238).
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Bakhtiarvand, N., Khashei, M., Mahnam, M. et al. A novel reliabilitybased regression model to analyze and forecast the severity of COVID19 patients. BMC Med Inform Decis Mak 22, 123 (2022). https://doi.org/10.1186/s12911022018612
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12911022018612
Keywords
 COVID19
 Multiple linear regression (MLR)
 Forecasting and modeling
 Reliability and accuracy
 Data analysis
 Disease severity