Skip to main content

Table 2 Advantages and limitations of different supervised machine learning algorithms

From: Comparing different supervised machine learning algorithms for disease prediction

Supervised algorithm

Advantages

Limitations

Artificial neural network (ANN)

- Can detect complex nonlinear relationships between dependent and independent variables.

- Requires less formal statistical training.

- Availability of multiple training algorithms.

- Can be applied to both classification and regression problems.

- Have characteristics of ‘black box’ - user can not have access to the exact decision-making process and therefore,

- Computationally expensive to train the network for a complex classification problem.

- Predictor or Independent variables require pre-processing.

Decision tree (DT)

- Resultant classification tree is easier to understand and interpret.

- Data preparation is easier.

- Multiple data types such as numeric, nominal, categorical are supported.

- Can generate robust classifiers and can be validated using statistical tests.

- Require classes to be mutually exclusive.

- Algorithm cannot branch if any attribute or variable value for a non-leaf node is missing.

- Algorithm depends on the order of the attributes or variables.

- Do not perform as well as some other classifier (e.g., Artificial Neural Network) [80]

K-nearest neighbour (KNN)

- Simple algorithm and can classify instances quickly.

- Can handle noisy instances or instances with missing attribute values.

- Can be used for classification and regression.

- Computationally expensive as the number of attributes increases.

- Attributes are given equal importance, which can lead to poor classification performance.

- Provide no information on which attributes are most effective in making a good classification.

Logistic regression (LR)

- Easy to implement and straightforward.

- LR-based models can be updated easily.

- Does not make any assumptions regarding the distribution of independent variable (s).

- It has a nice probabilistic interpretation of model parameters.

- Does not have good accuracy when input variables have complex relationships.

- Does not consider the linear relationship between variables.

- Key components of LR - logic models, are vulnerable to overconfidence.

- May overstate the prediction accuracy due to sampling bias.

- Unless multinomial, generic LR can only classify variables that have two states (i.e., dichotomous).

Naïve Bayes (NB)

- Simple and very useful for large datasets.

- Can be used for both binary and multi-class classification problems.

- It requires less amount of training data.

- It can make probabilistic predictions and can handle both continuous and discrete data.

- Classes must be mutually exclusive.

- Presence of dependency between attributes negatively affects the classification performance.

- It assumes the normal distribution of numeric attributes.

Random forest (RF)

- Lower chance of variance and overfitting of training data compared to DT, since RF takes the average value from the outcomes of its constituent decision trees.

- Empirically, this ensemble-based classifier performs better than its individual base classifiers, i.e., DTs.

- Scales well for large datasets.

- It can provide estimates of what variables or attributes are important in the classification.

- More complex and computationally expensive.

- Number of base classifiers needs to be defined.

- It favours those variables or attributes that can take high number of different values in estimating variable importance.

- Overfitting can occur easily.

Support vector machine (SVM)

- More robust compared to LR

- Can handle multiple feature spaces.

- Less risk of overfitting.

- Performs well in classifying semi-structured or unstructured data, such as texts, images etc.

- Computationally expensive for large and complex datasets.

- Does not perform well if the data have noise.

- The resultant model, weight and impact of variables are often difficult to understand.

- Generic SVM cannot classify more than two classes unless extended.