Enhancement of hepatitis virus immunoassay outcome predictions in imbalanced routine pathology data by data balancing and feature selection before the application of support vector machines

Table 2 Workflow for SVM analyses

HBV
Extract 9170 individuals with HBV recorded of which 172 positive, 8998 negative
Split data into training (70%) and testing (30%) with 120 positive and 6300 negative in each split
Either	Downsize the training data into 52 sets of 120 positive plus 120 negative
Or	SMOTE the training data 400% oversampling and 100% under sampling leading to 52 sets of 3960 individuals with 1920 positive, 2040 negative
Or	Multiply downsize the training data into 11 sets of 120 positive and 120 negative
Then either	grow a random forest and pick the top five variables, apply SVM with the top five variables from the random forest
Or	proceed straight to SVM
HCV
Extract 7820 individuals with HCV recorded with 533 positive, 7287 negative
Split data into training (70%) and testing (30%) with 373 positive and 5100 negative in each split
Either	Downsize the training data into 13 sets of 373 positive, 373 negative
Or	SMOTE the training data at 400% oversampling and 100% under sampling leading to 13 sets of 4797 individuals with 1492 positive, 1865 negative
Or	Multiply downsize the training data into 11 sets of 373 positive and 373 negative
Then either	grow a random forest and pick the top five variables, apply SVM with the top five variables from the random forest
Or	proceed straight to SVM

ISSN: 1472-6947