Skip to main content

Diagnostic performance of classification trees and hematological functions in hematologic disorders: an application of multidimensional scaling and cluster analysis

Abstract

Background

Several hematological indices have been already proposed to discriminate between iron deficiency anemia (IDA) and β‐thalassemia trait (βTT). This study compared the diagnostic performance of different hematological discrimination indices with decision trees and support vector machines, so as to discriminate IDA from βTT using multidimensional scaling and cluster analysis. In addition, decision trees were used to determine the diagnostic classification scheme of patients.

Methods

Consisting of 1178 patients with hypochromic microcytic anemia (708 patients with βTT and 470 patients with IDA), this cross-sectional study compared the diagnostic performance of 43 hematological discrimination indices with classification tree algorithms and support vector machines in order to discriminate IDA from βTT. Moreover, multidimensional scaling and cluster analysis were used to identify the homogeneous subgroups of discrimination methods with similar performance.

Results

All the classification tree algorithms except the LOTUS tree algorithm showed acceptable accuracy measures for discrimination between IDA and βTT in comparison with other hematological discrimination indices. The results indicated that the CRUISE and C5.0 tree algorithms had better diagnostic performance and efficiency among other discrimination methods. Moreover, the AUC of CRUISE and C5.0 tree algorithms indicated more precise classification with values of 0.940 and 0.999, indicating excellent diagnostic accuracy of such models. Moreover, the CRUISE and C5.0 tree algorithms showed that mean corpuscular volume can be considered as the main variable in discrimination between IDA and βTT.

Conclusions

CRUISE and C5.0 tree algorithms as powerful methods in data mining techniques can be used to develop accurate differential methods along with other laboratory parameters for the discrimination of IDA and βTT. In addition, the multidimensional scaling method and cluster analysis can be considered as the most appropriate techniques to determine the discrimination indices with similar performance for future hematological studies.

Peer Review reports

Background

Microcytic anemia is the most common form of anemia, as a predominant hematologic disorder. IDA and βTT are the two common types of microcytic anemia disorders [1, 2]. The discrimination between IDA and βTT is a vital issue in hematology studies [3, 4]. IDA is a prevalent disorder worldwide, and βTT is, in turn, predominant in the Mediterranean region [5,6,7,8,9,10].

The discrimination between these two hematologic disorders is necessary to prevent iron overload and its complications caused by misdiagnosis and inaccurate treatment so as to determine the prenatal causes for hemoglobin chain disorders. However, the differential diagnosis of IDA from βTT is a major challenge given that they provide similar experimental conditions [3, 11, 12].

In addition to complete blood count (CBC), different tests have been already conducted to differentiate between IDA and βTT precisely; however, they are time-consuming and expensive. The definitive diagnostic methods for the IDA and βTT are respectively based on the increase in HbA2 (Hemoglobin A2), the increase in TIBC (total iron binding capacity), and also the decrease in serum iron and serum ferritin [4, 11, 13,14,15,16].

Due to the importance of discriminating between these types of anemia, various studies have been conducted since 1973 to identify appropriate, rapid, and low-cost differential indices for discriminating between IDA and βTT [17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41]. The existing gaps in the literature about hematological indices showed that each hematological index only includes one or some specific blood parameters. In addition, some indices like Nishad [33] and Matos and Carvalho [41] are suggested based on the parametric statistical model like the discriminant analysis. However, this parametric model needs different assumptions (multivariate normality and equality of covariance matrices) and violation of these assumptions affects the results [42].

Recently, the accessibility of powerful statistical software programs has paved the way for the application of advanced statistical models such as data mining techniques in the differential diagnosis of IDA from βTT. However, few studies have already employed such advanced statistical methods and data mining techniques for differential diagnosis of hematological data [40, 43,44,45,46,47,48,49,50,51,52]. Therefore, this study was intended to compare tree algorithms as powerful machine-learning methods and support vector machines (SVM) with hematological indices in differentiation between IDA and βTT. Tree-based methods can determine homogeneous subgroups of patients needing different treatment strategies or diagnostic tests, making these methods useful for subgroup analysis [53,54,55,56].

The tree-based methods include nonparametric methods and need no assumptions about the functional form of the data. Besides, they deal with the high-dimensional dataset, high-order interactions, and nonlinear relationships. These methods are invariant to monotone transformations of predictor variables, and are robust to outliers, missing values, and also multicollinearity. These algorithms can identify the cutoff points of important predictors to discriminate the patients. In addition, tree algorithms are easy to interpret as they display results graphically, making the results understandable without requiring statistical experience. These methods can also assist the clinician in decision making [57,58,59,60,61,62].

CART (Classification and Regression Tree) algorithm is the best-known classic tree algorithm [63], though it suffers from some problems like greediness and bias in split rule selection. Tree generating in CART is based on the greedy search algorithm, and this search cannot find a global optimum [64]. The splitting method in CART is biased toward independent variables with more distinct values [65, 66]. Several tree algorithms are proposed to solve the problems of the CART algorithm. In turn, Evtree algorithm (Evolutionary learning of globally optimal classification and regression trees) [64] has been proposed to solve the greediness problem. Tree algorithms like Quick, Unbiased and Efficient Statistical Tree (QUEST) [67], Classification Rule with Unbiased Interaction Selection and Estimation (CRUISE) [68], Generalized, Unbiased, Interaction Detection and Estimation (GUIDE) [69], Conditional Inference Trees (Ctree) [70], and Logistic Tree with Unbiased Selection (LOTUS) [62] are, in turn, suggested to solve the bias in split rule selection problem.

This study aimed to compare the diagnostic performance of the CART algorithm and remedial tree algorithms for solving the disadvantages of this algorithm and SVM with hematological discrimination indices to discriminate between IDA and βTT by using accuracy measures such as true positive rate (TPR or sensitivity), true negative rate (TNR or specificity), false positive rate (FPR), false negative rate (FNR), accuracy, Youden’s index, positive predictive value (PPV), negative predictive value (NPV), positive likelihood ratio (PLR), negative likelihood ratio (PLR), diagnostic odds ratio (DOR), F-measure, and area under the curve (AUC).

Besides, the multidimensional scaling and cluster analysis were applied to extract homogeneous subgroups of hematological discriminating indices and classification tree algorithms with a similar performance according to the accuracy measures used.

Methods

Sample and disease type

This study included 1178 patients with hypochromic microcytic anemia from Boghrat clinical center in Tehran, Iran. CBC analysis of EDTA-K2 anti-coagulated blood samples was performed using Sysmex kx-21 automated hematology analyzer to measure hematological parameters such as Hb (Hemoglobin), HCT (hematocrit), MCV (Mean Corpuscular Volume), MCH (Mean Corpuscular Hemoglobin), MCHC (Mean Corpuscular Hemoglobin Concentration), RBC (Red Blood Cell Count) and RDW (Red Blood Cell Distribution Width). In addition, HbA2, TIBC, serum iron and serum ferritin were measured for all patients.

Inclusion criteria

Patients with hypochromic microcytic anemia (MCV < 80 fL, MCH < 27 pg), Hb < 12 g.dl for women and Hb < 13 g.dl for men were included in the study. Among them, 708 patients were diagnosed as βTT with HbA2 > 3.5%, and 470 patients were diagnosed as IDA with serum ferritin < 15 ng/ml according to the World Health Organization [WHO] [71, 72].

Exclusion criteria

Patients with simultaneous presentation of both diseases, severe anemia (Hb < 8 g.dl), anemia due to chronic disease, infectious disease, chronic inflammation, pregnancy or other hemoglobinopathies were excluded.

Statistical analysis

Descriptive statistics and univariate analysis

Descriptive statistics (mean, standard deviation), median and interquartile range) were evaluated for different blood parameters. Normality of data was assessed using Shapiro–wilk test. Mann–Whitney U test was also used to compare the differences between the hematological parameters of both groups (IDA and βTT). P < 0.05 was considered to be statistically significant.

Hematological discriminating indices for discriminating between IDA and βTT

Hematological indices for discrimination between IDA and βTT were computed for each patient according to their formula and cut off. These indices with their formula are shown in Additional file 1: Table S1.

Classification algorithms

Classification tree algorithms (CART [63], QUEST [67], CRUISE [68], J48 [73], GUIDE [69], Ctree [70], Evtree [64], C5.0 [74], and LOTUS [62]) and SVM [75] were used to discriminate IDA from βTT.

Accuracy measures

Diagnostic performance of discrimination indices was compared with classifications tree algorithms using accuracy measures such as sensitivity, specificity, FPR, FNR, PPV, NPV, Youden's index (sensitivity + specificity – 1), accuracy, PLR, NLR, DOR, F-measure and AUC. The discrimination method with sensitivity, specificity, PPV, NPV, Youden's index, accuracy, F-measure and AUC near to 1 provided better performance. Likewise, the discrimination method with PLR > 10, NLR < 0.1 and high DOR caused a good performance for discriminating between IDA from βTT [76, 77]. Receiver operating characteristic (ROC) curve analysis was used to compute the AUC, and compare the value of AUC of discrimination methods [78].

Multidimensional scaling

Multidimensional scaling method was used to create a map based on the Euclidean distance for showing similarity or dissimilarity between observations. This map can be in one dimension, two dimensions, and three dimensions or in higher dimensions. Smaller distance among two observations indicates more similar and vice versa. This used a map in two dimensions for showing similarity/dissimilarity among pairs of discrimination methods through accuracy measures such as sensitivity, specificity, PPV, NPV, Youden's Index, accuracy, PLR, NLR, F-measure, and AUC [79].

Cluster analysis

Cluster analysis is a method for extracting homogeneous subgroups of observations. Different algorithms are proposed for cluster analysis. This study used a complete-linkage hierarchical algorithm to determine homogeneous subgroups of methods with a similar diagnostic performance using accuracy measures. The optimal number of methods with a similar diagnostic performance was selected using 30 appropriate measures. Finally, the optimal number was selected based on the majority role [80].

Software programs and checklists

Data analysis was done using software R 4.0.0. Package epiR and package pROC were used to compute the accuracy measures and ROC curve analysis, respectively. Classification tree algorithms like CART, J48, Ctree, Evtree, and C5.0 were fitted using packages rpart, Rweka, party, evtree, and C50, respectively. Software for tree algorithms like QUEST, CRUISE, GUIDE, and LOTUS was obtained from http://pages.stat.wisc.edu/~loh/research.html. SVM algorithm and multidimensional scaling method were fitted using package MASS and package e1071, respectively. The cluster optimal number, or homogeneous groups of diagnostic discrimination methods with a similar diagnostic performances was determined using the package of NbClust. This study was also conducted based on the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) Statement: guidelines for reporting observational studies and the Standards for Reporting Studies of Diagnostic Accuracy (STARD). These checklists can be obtained from www.equator-network.org.

Results

This study included 1178 patients with hypochromic microcytic anemia (708 patients with βTT and 470 patients with IDA) to compare the diagnostic performance of hematological discrimination indices with classification tree algorithms and SVM, so as to discriminate IDA from βTT. Data balance was, in turn, assessed using Shannon entropy [81, 82]. Additional file 1: Table S2 indicated the descriptive statistics of hematological parameters across the type of hypochromic microcytic anemia (IDA and βTT). According to this table, all variables indicated significant difference among the groups (P < 0.001). CRUISE, C5.0, CART, and GUIDE algorithms can calculate the normalized importance (%) for each predictor variable. These algorithms indicated similar ranking of hematological parameters importance. In this study, the normalized importance of variables was reported based on the classification tree algorithms with the best diagnostic performance (CRUISE and C5.0 algorithms). This algorithm showed that MCV and HCT variables had the highest and lowest importance for discrimination between IDA and βTT, respectively (Additional file 1: Table S2).

Figures 1 and 2 indicated that all predictor variables except HCT and RDW can be used to split the nodes of tree. First variable splitting of tree-based methods except tree algorithms such as Evtree, Ctree, and LOTUS were based on the MCV with similar rule splitting. GUIDE and CART algorithms showed the same tree structure.

Fig. 1
figure 1

Tree structure of classification tree algorithms such as J48, CART, GUIDE, QUEST, and CRUISE (red: βeta thalassemia trait and green: iron deficiency anemia)

Fig. 2
figure 2

Tree structure of classification tree algorithms such as Evtree, Ctree, LOTUS, and C5.0 (red: βeta thalassemia trait and green: iron deficiency anemia)

Additional file 1: Table S3 displays the values of accuracy measures such as sensitivity, specificity, FPR, FNR, PPV and NPV for each discrimination method (Additional file 1: Table S3).

Additional file 1: Table S3 indicated that none of the discrimination methods were fully specific for discrimination between IDA and βTT. This table showed that Janel index and CRUISE tree algorithm had the lowest FPR (while the highest TNR and PPV). In turn, the lowest TNR belonged to the Telmissani–MCHD index, while the lowest PPV was related to the Bessman (RDW) index. Shine and Lal index and Roth index showed perfect TPR (100%) and NPV (100%) as compared to other discrimination methods. Also, these indices showed the lowest FNR and the highest FPR. The lowest TPR (the highest FNR) was related to the Bessman (RDW) index, while the lowest NPV belonged to the Pornprasert (MCHC) index. All tree classification algorithms and SVM showed good performance for discriminating between IDA and βTT based on the accuracy measures like TPR, TNR, PPV and NPV in comparison to other hematological discrimination methods (Additional file 1: Table S3).

The values of accuracy measures such as Youden's index, accuracy, PLR, NLR, and DOR for each discrimination method are shown in Table 1. According to this table, the highest Youden's index/accuracy belonged to the CRUISE and C5.0 tree algorithms, while the lowest Youden's index/accuracy was for the MCHC index. Also, the highest DOR/F-measure belonged to the CRUISE and C5.0 tree algorithms, whereas the Roth index and Bessman (RDW) index had the lowest DOR/F-measure. Table 1 indicated that only CRUISE tree algorithm had PLR > 10 and discrimination methods with NLR < 0.1 were all tree algorithms except C5.0 tree algorithm and indices such as Shine and Lal, Bordbar, Sehgal, and Kerman I.

Table 1 Youden’s index, accuracy, positive likelihood ratio (PLR), negative likelihood ratio (NLR) and diagnostic odds ratio (DOR) of each hematological index and classification tree algorithm for differentiation between iron deficiency anemia (IDA) and β‐thalassemia trait (βTT) with their 95% confidence interval

The value of discrimination method AUC for discrimination between IDA and βTT was shown in Table 2. The ROC analysis showed that CRUISE and C5.0 tree algorithms had the highest AUC. According to the AUC, CRUISE and C5.0 tree algorithms indicated excellent diagnostic accuracy, whereas MCHC index could not be useful for discrimination between IDA and βTT. Table 2 indicated that AUC of all indices except indices such as Ricerca, Telmissani–MCHD, Huber–Herklotz, Zaghloul1, Zaghloul2 and Kandhrol1 were significantly more than 0.5, and AUC of discrimination indices such as RDW and MCHC were significantly less than 0.5 (P < 0.001).

Table 2 F-measure and AUC of each hematological index and classification tree algorithm for differentiation between iron deficiency anemia (IDA) and β‐thalassemia trait (βTT) with their 95% confidence interval

The comparison between AUC values of classification tree algorithms and hematological discrimination index with the best diagnostic performance among hematological indices (Ehsani index) showed that there was a statistically significant difference between AUC values of tree algorithms with Ehsani index (P < 0.05). In this regard, classification tree algorithms had significantly higher AUC than the mentioned hematological discrimination index. Also, CRUISE and C5.0 tree algorithms had significantly higher AUC than other classification tree algorithms, but there was no significant difference between AUC values of Ctree and CART algorithms (P > 0.05).

Overall, the results showed that CRUISE and C5.0 tree algorithms had a better performance for discrimination between IDA and βTT in comparison to all indices and other classification tree methods. CRUISE tree algorithm extracted six homogenous subgroups of patients (Fig. 1). According to the tree structure of CRUISE tree algorithm, it can be concluded that patients with MCV > 67.65 or 67.65 < MCV \(\le\) 71.25 and Hb \(\le\) 11.15 or MCV \(\le\) 67.65 and Hb \(\le\) 8.85 and MCHC \(\le\) 30.32 were classified as βTT. Also, patients with 67.65 < MCV \(\le\) 71.25 and Hb > 11.15 or MCV \(\le\) 67.65 and Hb > 8.85 or MCV \(\le\) 67.65 and MCHC > 30.32 were classified as IDA.

In addition, multidimensional scaling method extracted three subgroups of methods. The diagram of this analysis is shown in Fig. 3. One group included hematological discrimination indices such as Pornprasert, RDW, Kandhrol1, Huber–Herklotz, Sirachainan, Hameed, Zaghloul1, and Zaghloul2, while the other group included Shine and Lal, Roth, Ricerca, and Telmissani–MCHD. The third group in turn included classification tree algorithms, SVM, and some of hematological discrimination indices.

Fig. 3
figure 3

Diagram of multidimensional scaling for extracting homogeneous groups of hematological indices and classification tree algorithms with a similar diagnostic performance (1:England and Fraser, 2:RBC, 3:Mentzer, 4:Srivastava, 5:Shine and Lal, 6:Bessman (RDW), 7:Ricerca, 8:Green and King, 9:Das Gupta, 10:Jayabose (RDWI), 11:Telmissani–MCHD, 12:Telmissani–MDHL, 13:Huber–Herklotz, 14:Kerman I, 15:Kerman II, 16:Sirdah, 17:Ehsani, 18:Keikhaei, 19:Nishad, 20:Wongprachum, 21:Sehgal, 22:Pornprasert, 23:Sirachainan, 24:Bordbar, 25:Matos and Carvalho, 26:Janel (11 T), 27:CRUISE Index, 28:Index26, 29:CART/Guide, 30:J48, 31:QUEST, 32:CRUISE, 33:Ctree, 34:Evtree, 35:Hisham, 36:Hameed, 37:Ravanbakhsh-F1, 38:Ravanbakhsh-F2, 39:Ravanbakhsh-F3, 40:Ravanbakhsh-F4, 41:Zaghloul1, 42:Zaghloul2, 43:Kandhrol1, 44:Kandhrol2, 45:Alparslan, 46:Merdin1, 47:Merdin2, 48:Roth, 49: Sargolzaie, 50: C5.0, 51: LOTUS, and 52: SVM)

Cluster analysis like multidimensional scaling method extracted three homogenous groups of discrimination methods. The diagram of this analysis is shown in Fig. 4.

Fig. 4
figure 4

Dendrogram of cluster analysis for extracting homogeneous groups of hematological discrimination indices and classification tree algorithms with the same diagnostic performance (each rectangles includes discrimination methods with a similar diagnostic performance)

Discussion

The two common types of microcytic anemia disorders are IDA and βTT which have similar clinical and experimental conditions [3, 11, 12]. The discrimination between these two disorders is clinically important needing time-consuming and expensive tests like HbA2, serum iron, serum ferritin and TIBC [4, 11, 13,14,15,16]. Several hematological indices are proposed for rapid and low-cost discrimination between IDA and βTT which are not fully sensitive and specific for differential diagnose [17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41].

This study used classification tree algorithms to discriminate between IDA and βTT. These are efficient and low-cost detection methods to extract homogeneous subgroups of patients [53,54,55,56]. Thus, the diagnostic performance of hematological indices was compared with tree-based methods to differentiate IDA and βTT using various accuracy measures.

Additionally, multidimensional scaling was used to extract homogeneous subgroups of methods with a similar performance based on the mentioned criteria.

The findings showed that none of the mentioned discrimination methods are fully sensitive and specific in discrimination between IDA and βTT. Also, tree-based methods exhibited high performance for differential diagnosis in comparison with the other hematological indices. CRUISE tree algorithm indicated better performance than other discrimination methods based on the amount of accuracy measures such as Youden's index, accuracy, PLR, NLR, DOR, F-measure and AUC. These criteria included both sensitivity and specificity and indicated the diagnostic performance of discrimination method more accurately than other criteria. So, this algorithm can help physicians make better clinical decision.

Although sensitivity of hematological discrimination methods such as Ricerca, Telmissani—MCHD, Bordbar, Roth, and Shine and Lal (S&L) was higher than that of CRUISE tree algorithm, these hematological indices had a high false positive rate as compared to the CRUISE tree algorithm. Moreover, with respect to the other measurements, these indices had poor performance in discriminating between IDA and βTT.

Consistent with the findings of this study, other studies demonstrated that Ehsani index had good performance in discrimination between these two disorders in comparison with other hematological indices [83, 84]. Meta-analysis studies indicated that Bessman (RDW) index had a low AUC in comparison to other hematological indices [85, 86].

Overall, the findings showed that CRUISE tree algorithm had better performance in discrimination between IDA and βTT as compared to all hematological discrimination indices and other classification tree methods. Moreover, comparison between the AUC of CRUISE and C5.0 tree algorithms and Ehsani index (this index had the best diagnostic performance in comparison to the other hematological indices) showed that there was a statistically significant difference between AUC of these discrimination methods (P < 0.001); CRUISE and C5.0 tree algorithms had significantly higher AUC than this discrimination index. Indeed, all accuracy measures indicated that CRUISE and C5.0 tree algorithms had the best diagnostic performance among the discrimination methods used.

Tree-based methods were fitted using hematological parameters as predictor variables. Based on the results obtained from CRUISE and C5.0 tree methods, MCV was the main hematological predictor parameter in differentiation between different types of hypochromic microcytic anemia. In this regard, it was found that the patient with βTT had lower values of MCV. In a previous study which used different decision trees for discrimination between IDA and βTT, the first split of all algorithms was based on the MCV indicating that MCV was an important predictor variable in discrimination of IDA and βTT [47].

Several studies proposed various tree-based methods for differential diagnostic between microcytic anemia [43, 44, 47, 50,51,52]. For instance, Bellinger et al. used classification algorithms like J48 decision tree, support vector machines (SVM), k-nearest neighbours (K-NN), multilayer perceptron (MLP) and naϊve Bayes (NB) to discriminate between patients with IDA and βTT or both [50]. In another study, Setsirichok evaluated the classification of blood characteristics by a C4.5 decision tree, a NB classifier and a MLP for classifying eighteen classes of thalassemia abnormality [43]. Likewise, Jahangiri et al. (2017) used classification tree algorithms for constructing differential scheme and investigating the performance of several tree algorithms for the differential diagnosis of IDA from βTT. In agreement with this study, Jahangiri et al. (2017) reported that CRUISE tree algorithm had the highest AUC, and MCV was an important predictor variable in the discrimination of observations into IDA and βTT, and the first split of all algorithms was based on of MCV [47]. Moreover, Chakraborty et al. (2017) utilized Ada-boost algorithm to generate multiple decision trees by using C4.5 decision tree for classification of erythrocytes or anemia detection. Their proposed approach showed accuracy, specificity and sensitivity of 97.81%, 99.7% and 97.33% respectively in detecting abnormal erythrocytes [51]. Comparing the diagnostic performance of several algorithms such as J48, K-NN, artificial neural networks and NB for identifying β-thalassemia carriers, AlAgha concluded that naϊve Bayes had the superior performance to differentiate between normal and β-thalassemia carriers [52].

Overall, the CRUISE and C5.0 tree algorithms with the best performance in this study showed better performance in comparison with tree algorithms in the previous studies [43, 87].

Using advanced methods such as tree-based methods for discriminating between IDA and βTT in addition to the differential indices can be a good idea for discriminating between these two hematologic disorders. Though each index only includes one or specific blood parameters, machine learning methods can consider the effects of all blood parameters simultaneously for data prediction and exploratory modeling. Besides, using decision trees for discrimination between IDA and βTT can avoid expensive, time‐consuming, and complicated laboratory procedures leading to non-satisfactory hematological indices in discriminating between these two hematologic disorders.

The application of methods like multidimensional scaling and cluster analysis are deemed to be useful to determine different classification methods with similar diagnostic functions. In previous hematological studies, such indices were compared subjectively based on the accuracy measures. Therefore, the application of multidimensional scaling method and cluster analysis are proposed to determine the hematological discrimination indices with similar performance for future hematological studies.

Application in practice for medical studies

In medical diagnostic processes, decision making with high diagnostic performance is very important. Tree-based methods can be considered as appropriate methods for decision making, because they generate differential diagnosis with high accuracy measures (sensitivity, specificity, PPV, NPV, PLR, NLR, DOR, accuracy, and AUC) in comparison to the discrimination indices. In addition, tree algorithms display results graphically, making the results understandable with no statistical expertise. These algorithms can be thus useful for diagnostic classification scheme of patients in medical studies. This study thus considered the discrimination between IDA and βTT to prevent iron overload and its complications caused by misdiagnosis and inaccurate treatment, and also to determine the prenatal causes for hemoglobin chain disorders.

Conclusions

Given its diagnostic performance, CRUISE and C5.0 tree algorithms are considered as an appropriate method for differential diagnosis of patients in comparison to other methods. Moreover, tree-based methods are useful along with other parameters for discriminating between IDA and βTT. In conclusion, considering the advantages of tree algorithms, they can help physicians make better clinical decisions. The results showed that multidimensional scaling method and cluster analysis are appropriate techniques to determine the discrimination indices with similar performance for future studies. In addition, the tree-based methods were identified as good methods for extracting homogeneous subgroups of observations in medical studies.

Availability of data and materials

The datasets used and/or analyzed during the study are available from the corresponding author on reasonable request.

References

  1. Kara B, Çal S, Aydogan A, Sarper N. The prevalence of anemia in adolescents: a study from Turkey. J Pediatr Hematol Oncol. 2006;28(5):316–21.

    Article  PubMed  Google Scholar 

  2. Brittenham G. Disorders of iron metabolism: iron deficiency and overload. Hematol Basic Principles Pract. 2000.

  3. Hallberg L. Iron requirements. Biol Trace Elem Res. 1992;35(1):25–45.

    Article  CAS  PubMed  Google Scholar 

  4. Oliveri N. The beta-thalassemias. N Engl J Med. 1999;341(2):99–109.

    Article  Google Scholar 

  5. Rathod DA, Kaur A, Patel V, Patel K, Kabrawala R, Patel V, et al. Usefulness of cell counter-based parameters and formulas in detection of β-thalassemia trait in areas of high prevalence. Am J Clin Pathol. 2007;128(4):585–9.

    Article  PubMed  Google Scholar 

  6. Angastiniotis M, Modell B. Global epidemiology of hemoglobin disorders. Ann N Y Acad Sci. 1998;850(1):251–69.

    Article  CAS  PubMed  Google Scholar 

  7. Weatherall D, Clegg JB. Inherited haemoglobin disorders: an increasing global health problem. Bull World Health Organ. 2001;79:704–12.

    CAS  PubMed  PubMed Central  Google Scholar 

  8. Urrechaga E, Borque L, Escanero JF. The role of automated measurement of RBC subpopulations in differential diagnosis of microcytic anemia and β-thalassemia screening. Am J Clin Pathol. 2011;135(3):374–9.

    Article  PubMed  Google Scholar 

  9. Galanello R, Origa R. Beta-thalassemia. Orphanet J Rare Dis. 2010;5(1):11.

    Article  PubMed  PubMed Central  Google Scholar 

  10. Camaschella C. New insights into iron deficiency and iron deficiency anemia. Blood Rev. 2017;31(4):225–33.

    Article  CAS  PubMed  Google Scholar 

  11. Lafferty JD, Crowther MA, Ali MA, Levine M. The evaluation of various mathematical RBC indices and their efficacy in discriminating between thalassemic and non-thalassemic microcytosis. Am J Clin Pathol. 1996;106(2):201–5.

    Article  CAS  PubMed  Google Scholar 

  12. Bessman JD, Gilmer PR, Gardner FH. Improved classification of anemias by MCV and RDW. Am J Clin Pathol. 1983;80(3):322–6.

    Article  CAS  PubMed  Google Scholar 

  13. Thomas C, Thomas L. Biochemical markers and hematologic indices in the diagnosis of functional iron deficiency. Clin Chem. 2002;48(7):1066–76.

    Article  CAS  PubMed  Google Scholar 

  14. Goddard AF, James MW, McIntyre AS, Scott BB. Guidelines for the management of iron deficiency anaemia. Gut. 2011;60:1309–16.

    Article  CAS  PubMed  Google Scholar 

  15. Mosca A, Paleari R, Ivaldi G, Galanello R, Giordano P. The role of haemoglobin A2 testing in the diagnosis of thalassaemias and related haemoglobinopathies. J Clin Pathol. 2009;62(1):13–7.

    Article  CAS  PubMed  Google Scholar 

  16. Demir A, Yarali N, Fisgin T, Duru F, Kara A. Most reliable indices in differentiation between thalassemia trait and iron deficiency anemia. Pediatr Int. 2002;44(6):612–6.

    Article  PubMed  Google Scholar 

  17. England J, Fraser P. Differentiation of iron deficiency from thalassaemia trait by routine blood-count. Lancet. 1973;301(7801):449–52.

    Article  Google Scholar 

  18. Klee GG, Fairbanks VF, Pierre RV, O’sullivan MB. Routine erythrocyte measurements in diagnosis of iron-deficiency anemia and thalassemia minor. Am J Clin Pathol. 1976;66(5):870–7.

    Article  CAS  PubMed  Google Scholar 

  19. Mentzer W. Differentiation of iron deficiency from thalassaemia trait. Lancet. 1973;301(7808):882.

    Article  Google Scholar 

  20. Srivastava P, Bevington J. Iron deficiency and/or Thalassaemia trait. Lancet. 1973;301(7807):832.

    Article  Google Scholar 

  21. Shine I, Lal S. A strategy to detect β-thalassaemia minor. Lancet. 1977;309(8013):692–4.

    Article  Google Scholar 

  22. Bessman JD, Feinstein D. Quantitative anisocytosis as a discriminant between iron deficiency and thalassemia minor. Blood. 1979;53(2):288–93.

    Article  CAS  PubMed  Google Scholar 

  23. Ricerca B, Storti S, d’Onofrio G, Mancini S, Vittori M, Campisi S, et al. Differentiation of iron deficiency from thalassaemia trait: a new approach. Haematologica. 1986;72(5):409–13.

    Google Scholar 

  24. Green R, King R. A new red cell discriminant incorporating volume dispersion for differentiating iron deficiency anemia from thalassemia minor. Blood Cells. 1989;15(3):481–95.

    CAS  PubMed  Google Scholar 

  25. Gupta AD, Hegde C, Mistri R. Red cell distribution width as a measure of severity of iron deficiency in iron deficiency anemia. Indian J Med Res. 1994;100:177–83.

    PubMed  Google Scholar 

  26. Jayabose S, Giamelli J, LevondogluTugal O, Sandoval C, Ozkaynak F, Visintainer P. # 262 Differentiating iron deficiency anemia from thalassemia minor by using an RDW-based index. J Pediatr Hematol Oncol. 1999;21(4):314.

    Article  Google Scholar 

  27. Telmissani OA, Khalil S, Roberts GT. Mean density of hemoglobin per liter of blood: a new hematologic parameter with an inherent discriminant function. Lab Hematol. 1999;5:149–52.

    Google Scholar 

  28. Huber AR, Ottiger C, Risch L, Regenass S, Hergersberg M, Herklotz R, editors. Thalassemie-syndrome: klinik und diagnose. Schweiz Med Forum; 2004.

  29. Kohan N, Ramzi M. Evaluation of sensitivity and specificity of Kerman index I and II in screening beta thalassemia minor. 2008.

  30. Sirdah M, Tarazi I, Al Najjar E, Al HR. Evaluation of the diagnostic reliability of different RBC indices and formulas in the differentiation of the β-thalassaemia minor from iron deficiency in Palestinian population. Int J Lab Hematol. 2008;30(4):324–30.

    Article  CAS  PubMed  Google Scholar 

  31. Ehsani M, Shahgholi E, Rahiminejad M, Seighali F, Rashidi A. A new index for discrimination between iron deficiency anemia and beta-thalassemia minor: results in 284 patients. Pakist J Biol Sci. 2009;12(5):473–5.

    Article  CAS  Google Scholar 

  32. Keikhaei B. A new valid formula in differentiating iron deficiency anemia from ß-thalassemia trait. Pakist J Med Sci. 2010;26:368–73.

    Google Scholar 

  33. Nishad AAN, Pathmeswaran A, Wickremasinghe A, Premawardhena A. The Thal-index with the BTT prediction. exe to discriminate ß-thalassaemia traits from other microcytic anaemias. 2012.

  34. Wongprachum K, Sanchaisuriya K, Sanchaisuriya P, Siridamrongvattana S, Manpeun S, Schlep FP. Proxy indicators for identifying iron deficiency among anemic vegetarians in an area prevalent for thalassemia and hemoglobinopathies. Acta Haematol. 2012;127(4):250–5.

    Article  PubMed  Google Scholar 

  35. Dharmani P, Sehgal K, Dadu T, Mankeshwar R, Shaikh A, Khodaiji S. Developing a new index and its comparison with other CBC-based indices for screening of beta thalassemia trait in a tertiary care hospital. Int J Lab Hematol. 2013;35:118.

    Google Scholar 

  36. Pornprasert S, Panya A, Punyamung M, Yanola J, Kongpan C. Red cell indices and formulas used in differentiation of β-thalassemia trait from iron deficiency in Thai school children. Hemoglobin. 2014;38(4):258–61.

    Article  CAS  PubMed  Google Scholar 

  37. Sirachainan N, Iamsirirak P, Charoenkwan P, Kadegasem P, Wongwerawattanakoon P, Sasanakul W, et al. New mathematical formula for differentiating thalassemia trait and iron deficiency anemia in thalassemia prevalent area: a study in healthy school-age children. Southeast Asian J Trop Med Public Health. 2014;45(1):174.

    PubMed  Google Scholar 

  38. Bordbar E, Taghipour M, Zucconi BE. Reliability of different RBC indices and formulas in discriminating between β-thalassemia minor and other microcytic hypochromic cases. Mediterranean journal of hematology and infectious diseases. 2015;7(1).

  39. Janel A, Roszyk L, Rapatel C, Mareynat G, Berger MG, Serre-Sapin AF. Proposal of a score combining red blood cell indices for early differentiation of beta-thalassemia minor from iron deficiency anemia. Hematology. 2011;16(2):123–7.

    Article  PubMed  Google Scholar 

  40. Jahangiri M, Rahim F, Malehi AS. Diagnostic performance of hematological discrimination indices to discriminate between βeta thalassemia trait and iron deficiency anemia and using cluster analysis: Introducing two new indices tested in Iranian population. Sci Rep. 2019;9(1):1–13.

    Article  CAS  Google Scholar 

  41. Matos JF, Dusse L, Borges KB, de Castro RL, Coura-Vital W, Carvalho MDG. A new index to discriminate between iron deficiency anemia and thalassemia trait. Rev Bras Hematol Hemoter. 2016;38(3):214–9.

    Article  PubMed  PubMed Central  Google Scholar 

  42. Sharma S. Applied multivariate techniques. New York: Wiley; 1995.

    Google Scholar 

  43. Setsirichok D, Piroonratana T, Wongseree W, Usavanarong T, Paulkhaolarn N, Kanjanakorn C, et al. Classification of complete blood count and haemoglobin typing data by a C4.5 decision tree, a naïve Bayes classifier and a multilayer perceptron for thalassaemia screening. Biomed Signal Process Control. 2012;7(2):202–12.

    Article  Google Scholar 

  44. Dogan S, Turkoglu I. Iron-deficiency anemia detection from hematology parameters by using decision trees. Int J Sci Technol. 2008;3(1):85–92.

    Google Scholar 

  45. Urrechaga E, Aguirre U, Izquierdo S. Multivariable discriminant analysis for the differential diagnosis of microcytic anemia. Anemia. 2013;2013.

  46. Wongseree W, Chaiyaratana N, Vichittumaros K, Winichagoon P, Fucharoen S. Thalassaemia classification by neural networks and genetic programming. Inf Sci. 2007;177(3):771–86.

    Article  Google Scholar 

  47. Jahangiri M, Khodadi E, Rahim F, Saki N, Saki Malehi A. Decision‐tree‐based methods for differential diagnosis of β‐thalassemia trait from iron deficiency anemia. Expert Syst. 2017;34(3).

  48. Barnhart-Magen G, Gotlib V, Marilus R, Einav Y. Differential diagnostics of thalassemia minor by artificial neural networks model. J Clin Lab Anal. 2013;27(6):481–6.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Amendolia SR, Cossu G, Ganadu M, Golosio B, Masala G, Mura GM. A comparative study of k-nearest neighbour, support vector machine and multi-layer perceptron for thalassemia screening. Chemom Intell Lab Syst. 2003;69(1):13–20.

    Article  CAS  Google Scholar 

  50. Bellinger C, Amid A, Japkowicz N, Victor H, editors. Multi-label classification of anemia patients. In IEEE 14th international conference on machine learning and applications (ICMLA); 2015. IEEE.

  51. Maity M, Mungle T, Dhane D, Maiti AK, Chakraborty C. An ensemble rule learning approach for automated morphological classification of erythrocytes. J Med Syst. 2017;41(4):56.

    Article  PubMed  Google Scholar 

  52. AlAgha AS, Faris H, Hammo BH, Alam A-Z. Identifying β-thalassemia carriers using a data mining approach: the case of the Gaza Strip, Palestine. Artif Intell Med. 2018;88:70–83.

    Article  PubMed  Google Scholar 

  53. Lipkovich I, Dmitrienko A, Denne J, Enas G. Subgroup identification based on differential effect search—a recursive partitioning method for establishing response to treatment in patient subpopulations. Stat Med. 2011;30(21):2601–21.

    Article  PubMed  Google Scholar 

  54. Loh WY, He X, Man M. A regression tree approach to identifying subgroups with differential treatment effects. Stat Med. 2015;34(11):1818–33.

    Article  PubMed  PubMed Central  Google Scholar 

  55. Su X, Tsai C-L, Wang H, Nickerson DM, Li B. Subgroup analysis via recursive partitioning. J Mach Learn Res. 2009;10:141–58.

    Google Scholar 

  56. Li C, Glüer C-C, Eastell R, Felsenberg D, Reid DM, Roux C, et al. Tree-structured subgroup analysis of receiver operating characteristic curves for diagnostic tests. Acad Radiol. 2012;19(12):1529–36.

    Article  PubMed  PubMed Central  Google Scholar 

  57. De’ath G, Fabricius KE. Classification and regression trees: a powerful yet simple technique for ecological data analysis. Ecology. 2000;81(11):3178–92.

    Article  Google Scholar 

  58. Lemon SC, Roy J, Clark MA, Friedmann PD, Rakowski W. Classification and regression tree analysis in public health: methodological review and comparison with logistic regression. Ann Behav Med. 2003;26(3):172–81.

    Article  PubMed  Google Scholar 

  59. Speybroeck N, Berkvens D, Mfoukou-Ntsakala A, Aerts M, Hens N, Van Huylenbroeck G, et al. Classification trees versus multinomial models in the analysis of urban farming systems in Central Africa. Agric Syst. 2004;80(2):133–49.

    Article  Google Scholar 

  60. Malehi AS, Jahangiri M. Classic and bayesian tree-based methods. Enhanced expert systems. IntechOpen; 2019.

  61. Feldesman MR. Classification trees as an alternative to linear discriminant analysis. Am J Phys Anthropol Off Publ Am Assoc Phys Anthropol. 2002;119(3):257–75.

    Article  Google Scholar 

  62. Chan K-Y, Loh W-Y. LOTUS: an algorithm for building accurate and comprehensible logistic regression trees. J Comput Graph Stat. 2004;13(4):826–52.

    Article  Google Scholar 

  63. Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and regression trees. Boca Raton: CRC Press; 1984.

    Google Scholar 

  64. Grubinger T, Zeileis A, Pfeiffer K-P. Evtree: Evolutionary learning of globally optimal classification and regression trees in R. Working papers in economics and statistics; 2011.

  65. Loh WY. Tree-structured classifiers. Wiley Interdiscip Rev Comput Stat. 2010;2(3):364–9.

    Article  Google Scholar 

  66. Loh WY. Classification and regression trees. Wiley Interdiscip Rev Data Min Knowl Discov. 2011;1(1):14–23.

    Article  Google Scholar 

  67. Loh W-Y, Shih Y-S. Split selection methods for classification trees. Stat Sin. 1997:815–40.

  68. Kim H, Loh W-Y. Classification trees with unbiased multiway splits. J Am Stat Assoc. 2001;96(454):589–604.

    Article  Google Scholar 

  69. Loh W-Y. Improving the precision of classification trees. Ann Appl Stat. 2009:1710–37.

  70. Hothorn T, Hornik K, Zeileis A. Unbiased recursive partitioning: a conditional inference framework. J Comput Graph Stat. 2006;15(3):651–74.

    Article  Google Scholar 

  71. Organization WH. Serum ferritin concentrations for the assessment of iron status and iron deficiency in populations. World Health Organization; 2011.

  72. Chinudomwong P, Binyasing A, Trongsakul R, Paisooksantivatana K. Diagnostic performance of reticulocyte hemoglobin equivalent in assessing the iron status. J Clin Lab Anal. 2020:e23225.

  73. Quinlan JR. C4.5: programs for machine learning. Amsterdam: Elsevier; 2014.

    Google Scholar 

  74. Kuhn M, Weston S, Culp M, Coulter N, Quinlan R. Package ‘C50’. CRAN, UTC; 2015.

  75. Karatzoglou A, Meyer D, Hornik K. Support vector machines in R. J Stat Softw. 2006;15(1):1–28.

    Google Scholar 

  76. Šimundić A-M. Measures of diagnostic accuracy: basic definitions. Med Biol Sci. 2008;22(4):61–5.

    Google Scholar 

  77. Ferri C, Hernández-Orallo J, Modroiu R. An experimental comparison of performance measures for classification. Pattern Recogn Lett. 2009;30(1):27–38.

    Article  Google Scholar 

  78. DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988:837–45.

  79. Kruskal JB. Multidimensional scaling. London: Sage; 1978.

    Book  Google Scholar 

  80. Charrad M, Ghazzali N, Boiteau V, Niknafs A. NbClust: an R package for determining the relevant number of clusters in a data set. J Stat Softw. 2014;61(1):1–36.

    Google Scholar 

  81. Wang K, Phillips CA, Saxton AM, Langston MA. EntropyExplorer: an R package for computing and comparing differential Shannon entropy, differential coefficient of variation and differential expression. BMC Res Notes. 2015;8(1):1–5.

    Article  CAS  Google Scholar 

  82. Available from: https://stats.stackexchange.com/questions/239973/a-general-measure-of-data-set-imbalance/239982.

  83. Ehsani M, Sotoudeh K, Shahgholi E, Rahiminezhad M, Seyghali F, Aslani A. Discrimination of iron deficiency anemia and beta thalassemia minor based on a new index. 2007.

  84. Vehapoglu A, Ozgurhan G, Demir AD, Uzuner S, Nursoy MA, Turkmen S, et al. Hematological indices for differential diagnosis of beta thalassemia trait and iron deficiency anemia. Anemia. 2014;2014.

  85. Hoffmann JJ, Urrechaga E, Aguirre U. Discriminant indices for distinguishing thalassemia and iron deficiency in patients with microcytic anemia: a meta-analysis. Clin Chem Lab Med. 2015;53(12):1883–94.

    Article  CAS  PubMed  Google Scholar 

  86. Jahangiri M, Rahim F, Saki Malehi A, Pezeshki SMS, Ebrahimi M. Differential diagnosis of microcytic anemia, thalassemia or iron deficiency anemia: a diagnostic test accuracy meta-analysis. Mod Med Lab J. 2019;3(1):1–14.

    Google Scholar 

  87. Bellinger C, Amid A, Japkowicz N, Victor H, editors. Multi-label classification of anemia patients. In: 2015 IEEE 14th international conference on machine learning and applications (ICMLA). IEEE;2015.

Download references

Acknowledgements

Research reported in this publication was supported by Elite Researcher Grant Committee under award number [987862] from the National Institute for Medical Research Development (NIMAD), Tehran, Iran.

Funding

Research reported in this study was supported by Elite Researcher Grant Committee from the National Institutes for Medical Research Development (NIMAD), Tehran, Iran. The funding source had no role in the study design, data collection, analysis or interpretation of data, manuscript preparation or decision for submission.

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization, project administration and supervision: AK, Data curation: FR, Formal analysis and Methodology: MJ and KG, Software: MJ and ASM, Writing- original draft: AK and MJ. Writing-review and editing: All authors. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Anoshirvan Kazemnejad.

Ethics declarations

Ethics approval and consent to participate

This study was approved by the ethical code IR.NIMAD.REC.1398.389 from the National Institute for Medical Research Development, Tehran, Iran. A written informed consent was obtained before the enrollment. All methods were performed in accordance with the relevant guidelines and the institutional regulations.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1

. Table S1. Discrimination indices for differentiation between iron deficiency anemia (IDA) and β-thalassemia trait (βTT). Table S2. Descriptive statistics of blood parameters of the study groups and normalized importance (%) of hematological parameters based on the CRUISE tree algorithm (SD: standard deviation and IQR: interquartile range). Table S3. Sensitivity (TPR), specificity (TNR), false positive rate (FPR), false negative rate (FNR), positive predictive values (PPV) and negative predictive values (NPV) of each hematological index and classification tree algorithm for differentiation between iron deficiency anemia (IDA) and β-thalassemia trait (βTT) with their 95% confidence interval.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rahim, F., Kazemnejad, A., Jahangiri, M. et al. Diagnostic performance of classification trees and hematological functions in hematologic disorders: an application of multidimensional scaling and cluster analysis. BMC Med Inform Decis Mak 21, 313 (2021). https://doi.org/10.1186/s12911-021-01678-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12911-021-01678-5

Keywords