Using artificial intelligence to reduce diagnostic workload without compromising detection of urinary tract infections

Background A substantial proportion of microbiological screening in diagnostic laboratories is due to suspected urinary tract infections (UTIs), yet approximately two thirds of urine samples typically yield negative culture results. By reducing the number of query samples to be cultured and enabling diagnostic services to concentrate on those in which there are true microbial infections, a significant improvement in efficiency of the service is possible. Methodology Screening process for urine samples prior to culture was modelled in a single clinical microbiology laboratory covering three hospitals and community services across Bristol and Bath, UK. Retrospective analysis of all urine microscopy, culture, and sensitivity reports over one year was used to compare two methods of classification: a heuristic model using a combination of white blood cell count and bacterial count, and a machine learning approach testing three algorithms (Random Forest, Neural Network, Extreme Gradient Boosting) whilst factoring in independent variables including demographics, historical urine culture results, and clinical details provided with the specimen. Results A total of 212,554 urine reports were analysed. Initial findings demonstrated the potential for using machine learning algorithms, which outperformed the heuristic model in terms of relative workload reduction achieved at a classification sensitivity > 95%. Upon further analysis of classification sensitivity of subpopulations, we concluded that samples from pregnant patients and children (age 11 or younger) require independent evaluation. First the removal of pregnant patients and children from the classification process was investigated but this diminished the workload reduction achieved. The optimal solution was found to be three Extreme Gradient Boosting algorithms, trained independently for the classification of pregnant patients, children, and then all other patients. When combined, this system granted a relative workload reduction of 41% and a sensitivity of 95% for each of the stratified patient groups. Conclusion Based on the considerable time and cost savings achieved, without compromising the diagnostic performance, the heuristic model was successfully implemented in routine clinical practice in the diagnostic laboratory at Severn Pathology, Bristol. Our work shows the potential application of supervised machine learning models in improving service efficiency at a time when demand often surpasses resources of public healthcare providers. Electronic supplementary material The online version of this article (10.1186/s12911-019-0878-9) contains supplementary material, which is available to authorized users.


Background
For routine clinical microbiology diagnostic laboratories, the highest workload is generated by urine samples from patients with suspected urinary tract infection (UTI) [1]. According to the UK Standards of Microbiological Investigations, UTIs are defined as the 'presence and multiplication of microorganisms, in one or more structures of the urinary tract, with associated tissue invasion'. The most common causative pathogen is E. coli followed by other members of the Enterobacteriaceae family. The incidence of UTIs varies with age, gender, and comorbidities. Women experience a higher incidence than men, with 10-20% suffering from at least one symptomatic UTI throughout their lifetime. Most UTIs that occur in men are associated to physiological abnormalities of the urinary tract. In children, UTIs are common but often difficult to diagnose due to non-specific symptoms. Where a UTI is suspected, a urine sample is collected for processing by a centralised diagnostic laboratory. Upon arrival, the sample receives microscopic analysis, microbiological culture, and where necessary, antimicrobial sensitivity testing [2]. However, many urine samples will yield a negative culture result, no significant bacterial isolate or mixed culture results suggesting sample contamination. Such ambiguous and diagnostically unhelpful outcomes typically occur in approximately 70-80% of urine samples cultured [3][4][5][6][7][8]. This creates opportunities for significant cost savings. At the same time, diagnostic microbiology laboratories in the UK and elsewhere are undergoing transition to full laboratory automation [9][10][11]. With a view to assist with the consolidation of services [12] and changes in laboratory practice, appropriate pre-processing and classification of urine samples prior to culture might be required to reduce the number of unnecessary cultures performed.
In many hospitals, automated urine microscopy is performed prior to culture using automated urine sediment analysers. This is a common precursor to culture and informs on the cellular content of the urine sample, where evidence of pyuria results in direct antimicrobial sensitivity testing accompanying culture; in addition to culture on chromogenic agar, urine is applied directly to nutrient agar for sensitivity testing by Kirby-Bauer method. The use of microscopic analysis, biochemical dip-stick testing, and flow cytometry for predicting urinary tract infection are well documented in the literature. The current consensus is that WBC count and bacterial count correlate with culture outcome [3,4,13] but not well enough to replace culture entirely. We here explored the potential for a machine learning solution to reduce the burden of culturing the large number of culture-negative samples without reducing detection of culture-positive samples, with concessions made for particularly vulnerable patient groups.
We speculated that the application of a statistical machine learning model that accounts not just for current diagnostic results but also for historical culture outcome, as well as clinical details and demographical data, could potentially reduce laboratory workload without compromising the detection of UTIs. We contrast the classification performance of heuristic microscopy thresholds with three machine learning algorithms: A Random Forest classifier, a Neural Network with a single hidden layer, and the Extreme Gradient Boosting algorithm XGBoost. Random Forest classifiers are one of many ensemble methods, where the predictions of multiple base estimators are used to improve classification. In a Random Forest multiple 'trees' are constructed, each from a bootstrap sample of the training data and a random subset of features. The resulting classification is a result of the average of all the 'trees', hence the name 'Random Forest' [14]. Neural Networks are supervised learning algorithms made up of multiple layers of 'perceptrons' with assigned weights, which when summed and provided to a step function, produce a classification output. By optimising a loss function and adjusting the weights through a process called 'backpropagation', Neural Networks can learn non-linear relationships [14]. Boosting algorithms, such as the XGBoost algorithm in this study, generate a decision tree using a sample of the training data. The performance of the trained classifier, when tested using all the training data, is used to generate sample weights that influence the next classifier. An iterative process then occurs, each time generating a new classifier that is informed by the misclassification of the prior classifier [15].

Patient samples and data pre-processing
This project was performed as part of a service improvement measure on anonymised retrospective data at Southmead Hospital Bristol, North Bristol NHS Trust, UK, and was approved locally by the service manager and head of department. Urine samples with specimen date between 1st October 2016 and 1st October 2017 (n = 225,207) were extracted from the Severn Pathology infectious science services laboratory information management system (LIMS), Winpath Enterprise. Additional file 2: Figure S1 details pre-processing steps taken prior to investigation of microscopy thresholds and machine learning algorithms. Samples that received manual microscopy (often due to excessive haematuria or pyuria) and those from catheterised patients were excluded from the study. All preprocessing was performed in the Python programming language (version 3.5) utilising the Pandas library (version 0.23). The dependent variable, the culture result, was classified using regular expression to create a binary outcome; positive outcome was denoted as any significant bacterial isolate with accompanying antimicrobial sensitivities, whereas a negative outcome was a culture result of 'no growth', 'no significant growth', or 'mixed growth'.
Microscopy counts for white blood cells (WBCs) and red blood cells (RBC) were artificially capped at 100/μl due to the interface between SediMAX and Winpath Enterprise implemented in the laboratory. For the same reason, epithelial cell count was capped at 25/μl. No adjustments are made here as the data set represents 'realworld' data and the type of data a model would encounter in practice. The bacterial cell count was heavily positively skewed. To counteract the effect of outliers without deviating from a representation of typical data, bacterial counts that exceeded the 0·99 percentile were classed as outliers and removed. Two additional features were engineered from the microscopy cell counts: 'haematuria with no WBCs' and 'pyuria with no RBCs'. Pyuria was defined as a WBC count > = 10/μl and haematuria as ≥3/μl, as described in the UK Standards for Microbiology Investigations [12].

Patient groupings by clinical indicators
We defined several significant patient groups with a higher incidence of UTI based on clinical advice and prior published work [2,16,17]. For each of these groups we created a list of keywords for association (Additional file 1: Table S1). Using the Levenshtein distance algorithm implemented in the Natural Language Toolkit library (NLTK, version 3.3) [18] with an edit distance threshold of one or less, keywords were compared to clinical details provided with urine specimens, to classify specimens into significant patient groups. This implementation was chosen to negate errors in spelling and grammar in the clinical details provided, and as a result of its ease of use and popularity in text mining and bioinformatics applications [18,19].
To increase the accuracy of patient grouping, clinical details were consolidated where multiple samples were received from the same patient; approximately 58% of patients in the data set studied had multiple samples. For acute kidney infection, occurrence of keywords within a two-week timeframe resulted in allocation of a patient to this group. In the case of pregnancy, this timeframe was increased to nine months. When allocating patients to the pre-operative group, only the clinical details unique to a sample were considered. For all other groups the assumption was made that conditions are chronic and keyword search was conducted on the consolidation of all clinical details.
Using the same methodology as the patient grouping, two additional variables were engineered from the clinical details: the reported presence of nitrates in the urine and descriptive qualities of the sample such offensive smell and/or appearance.

Exploratory data analysis and implementation of heuristic models and machine learning algorithms
Heuristic models using microscopy thresholds, as well as the machine learning algorithms, were developed in the Python programming language (version 3.5) utilising the Pandas (version 0.23) [20] and Sci-kit learn (version 0.19) [14] libraries. Exploratory data analysis was performed in R (version 3.4.3) utilising the TidyVerse packages (version 1.2.1) [21] and base functions. Data visualisation and graphical plots were created using the Python library Seaborn (version 0.9.0) [22]. Three machine learning algorithms were assessed: multi-layer feed-forward Neural Network, Random Forest Classifier, and XGBoost Gradient Boosted Tree Classifier. Random Forests, Neural Networks, and Boosting Ensembles have been noted as having the best performance in terms of accuracy amongst 17 'families' investigated [23]. Data was randomly split into training (70%, n = 157,645) and holdout data (30%, n = 67,562). Holdout data was used for model validation. Model training and parameter optimisation was performed using a grid-search algorithm with k-fold (k = 10) cross-validation, where the model parameters where chosen based on area under receiver operator curve (AUC Score). Performance of models were measured as a balance between classification sensitivity and relative workload reduction when tested on holdout data; classification sensitivity took precedent in the choice of model, but once an optimal sensitivity of 95% was met, workload reduction was the deciding metric. Classification sensitivity and specificity were calculated as described in Additional file 3: Figure S2. 95% confidence intervals were calculated using the normal approximation method. Due to the size of the data-set studied and following guidance published by Raschka S [24], the Cochran's Q test was selected to formally test for statistically significant difference in accuracy amongst models (p < 0.05). Where this condition is met, the McNemar test was used post hoc for individual model comparison with Bonferroni's correction for multiple comparisons; McNemar and Cochran's Q test implemented using the MLXtend python library [25].

Patient characteristics
Around 20% of the samples in the data belonged to inpatients, with an incidence of significant culture of 20·8% (Table 1). The ratio of female to males was approximately 3:1, but the incidence of significant culture was similar with 21·6% and 26·8% for males and females, respectively. Amongst the groupings generated from clinical details 'Pregnant' and 'Persistent/Recurrent Infection' contributed to the largest proportion of the overall data, with all other groups consisting of less than 12% of the data set. Samples categorised as 'Persistent/ Recurrent Infection' showed an incidence of significant growth of almost 40%. The small number of samples whose clinical details included offensive smell or testing positive for nitrates showed the highest incidence of significant culture. Additionally, the presence of pyuria in the absence of red blood cells, a condition reported in 11·6% of samples, showed in excess of 50% bacterial culture-positive results. The age distribution for female patients was multimodal, with a peak between 20-and 40years accounting for the pregnant women (Additional file 4: Figure S3). For males, the distribution was bimodal, with most samples coming from elderly individuals.

Exploratory data analysis
Exploratory data analysis revealed that among the four microscopic cell counts performed, WBC and bacterial counts per μl showed the strongest correlation with the probability of significant bacterial growth on culture (Fig. 1). RBC and epithelial cell count were not significantly associated with culture outcome. To confirm the relationships observed in Fig. 1, an individual Logistic Regression model trained using cellular counts showed that inclusion of WBC and bacterial counts exhibited a higher reduction in residual deviance when compared to RBC and epithelial cell count. Age of the patient also positively correlated with the probability of significant growth, albeit to a lesser extent when compared to WBC and bacterial counts.
With regards to the distribution of automated microscopy cell counts, the patient population split into those with significant bacterial culture results and those without (Fig. 2). WBC counts demonstrated the greatest distinction between the population with significant culture results and the population without. Bacterial counts showed significant overlap between the two populations. Both were positively skewed, but to a greater extent for the population with significant culture results, which also displayed a lower kurtosis. A high WBC count was associated with an increase in significant bacterial growth, as were bacterial counts about 500 cells/μl. Low counts of WBC or bacteria were, however, not diagnostic of a negative culture result.
Patient groups were ranked and compared using the Chi-squared test for independence (implemented in Scikit-Learn feature selection module). Pyuria in the absence of RBCs, pregnancy, positive testing for nitrate, persistent/recurrent infection, and being an inpatient ranked the highest, showing they were the least likely to be independent of class, and therefore more valuable for classification. Additionally, gender, smell, and being pre- operative ranked higher than other categorical variables, such as whether the patient was immunocompromised (Additional file 1: Table S2). While these were the most highly ranked of the clinical indicators, they were not in themselves enough for classification of the bulk of patients due to the low numbers existing in the population.
As an example, while being noted as being positive for nitrates was associated with a high probability of culturable bacteria (59·7%), this occurred in only 6·09% of the patients founds positive for bacterial culture. Hence, we examined the potential of heuristic and machine learning models that could include variables that were applicable to large numbers of patients.
Performance of heuristic microscopy thresholds for predicting urine culture outcome Given their strong association with positive bacterial culture, WBC counts and bacterial counts were chosen in combination to create a microscopy threshold for predicting culture outcome. Microscopy thresholds were compared using classification sensitivity, with 95% being chosen as the acceptable minimum. At the same time specificity, positive predictive value, negative predictive value, and the relative reduction in workload were calculated. By iterating over permutations from a range of WBC and bacterial counts, the effect of applied thresholds was simulated (Additional file 1: Tables S3 and S4). Following simulation of microscopy thresholds, the optimum minimum thresholds for WBC and bacterial counts were found to be 30/μl and 100/μl, respectively. With these criteria it was simulated that there would be a 39·1% reduction in the number of samples needing culture and a classification sensitivity of 96·0 ± 0·1% (95% CI) for culture-positive urines (Table 3). Despite achieving the optimal sensitivity, the specificity of using a microscopy threshold was only 52·1 ± 0·4% (95% CI). The potential for an improved solution that reduced the number of false positive classifications resulted in exploration of supervised machine learning solutions incorporating additional variables.

Integration of additional variables into machine learning algorithms
To measure the effectiveness of the machine learning algorithms, a Logistic Regression Classifier based on WBC and bacterial counts was used as a baseline. This algorithm exhibited similar performance to the use of microscopy threshold, as was to be expected as Logistic Regression classifiers are sensitive to non-linear relationships between independent and dependent variables; a condition suspected during exploratory data analysis.
The data exhibited a natural class imbalance in that only 27% of samples resulted in a positive culture outcome. Given that the purpose of this study was to create a screening method which would reduce the incidence of culture without compromising sensitivity, class weights were applied in such a way that false negative classifications were more heavily penalised than false positives. Initial class weights were chosen through grid search parameter optimisation and then adjusted manually to improve sensitivity. In the case of the neural network, resampling (without  replacement) was used to eliminate class imbalance from the training data. Table 2 details the results of feature selection, performed using recursive feature elimination (RFE) to generate a list of optimal features; feature importance and AUC score in a Random Forest Classifier were used to eliminate features recursively. RFE suggested 16 optimal features (features with a ranking of 1).
The results of the supervised machine learning models when trained on the optimal features (those with an RFE ranking of 1) are shown in Table 3, with an accompanying ROC curve in Fig. 3. All machine learning algorithms outperformed the heuristic model (microscopy threshold of 30 WBC/μl and 100 bacteria/μl) in terms of accuracy. The Random Forest Classifier provided the best performance with a sensitivity of 95·95 ± 0·23% (95% CI) and a reduction in the number of necessary cultures by 47·58%. Cochran's Q test found a statistically significant difference between models and post-hoc comparison to the heuristic model by McNemar's test showed all models to be significantly different in terms of classification accuracy.

Classification of pregnant patients
When observing the classification sensitivity for different patient demographics, it was noted that the sensitivity for pregnant patients was in the range of 56-86% across all models, below the sensitivity for the general population. Asymptomatic bacteriuria is a condition known to occur in 2-10% of pregnancies and is associated with adverse outcomes such as increased risk of preterm birth, low birth weight, and perinatal mortality [26]. Figure 4 compares the kernel density estimate for WBC and bacterial counts, where there was significant bacterial growth on culture, for pregnant patients and all other patients. For pregnant patients there was a greater prevalence of samples with increased bacterial count in the absence of WBCs, which may explain the poor classification sensitivity in comparison to other patient groups.
Considering that all samples from pregnant patients and children under 11 years of age should be cultured routinely according to the recommendations by the UK Standards for Microbiology Investigations [2], the heuristic model was re-examined and microscopy thresholds analysed with those patients removed ( Table 4). The new optimal microscopy threshold was found to be 30 WBC/μl and 150 bacteria/μl. This threshold performed with a sensitivity of 95·0 ± 0·1% (95% CI) and a relative workload reduction of 33·7% (Table 4, Fig. 5). Due to the considerable cost savings without compromising diagnostic performance, this model went on to be implemented into clinical practice at the Severn Pathology service in Bristol, UK.
In response to this finding, machine learning algorithms were revisited with the removal of pregnant patients and children less than 11 years old from the classification process. Since the Random Forest classifier provided the best performance previously, a new implementation of this algorithm was trained on a randomly selected cohort of 70% of the remaining data; 30% was kept as holdout for evaluation of model performance. Parameter optimisation was performed using grid search with a reduced class weight of 1:8 for positive culture when considering samples other than pregnant patients. As shown in Table 4, a Random Forest Classifier that considers additional variables could achieve a specificity of 68·8% compared with the specificity of the heuristic model of 44·6%. However, given that samples from pregnant women and children under 11 together comprise 29.2% of samples entering the pipeline, the overall, workload reduction only improved by around 4%.
The alternative approach was to separate pregnant patients and children from all other samples, creating three separate datasets. Training and validation data was generated for each dataset following the same XGBoost is a resource efficient algorithm that exhibits greater computational performance [15]. For this reason, combined with good classification performance in prior experiments, it was chosen over all other machine learning models going forward. The algorithms were trained independently of one another and evaluated on holdout data from their separate populations (pregnant, children, and everyone else). Classification sensitivity for pregnant patients, children, and samples from all other patients was 95·4%, 94·9% and 95·3% respectively. When tested on the validation data, the combined workload reduction from the three independent models was 41.2%, a significant improvement over the performance of the heuristic model. This combination of XGBoost models gives optimal performance in terms of classification sensitivity and relative workload reduction and is summarised in Fig. 6.

Discussion
To our knowledge, there are no other observational studies of this magnitude for the study of urine analysis for the diagnosis of UTIs. Most previous studies with the objective of predicting urine culture based on variables generated from sediment analysis, flow cytometry, and/or dip-stick testing have been controlled studies of a few hundred patients, with little consistency in the inclusion criteria [3,4,6,13,[27][28][29]. Prior efforts to establish a heuristic model based on microscopy thresholds generated conflicting results. Falbo et al. [4] and Inigo et al. [3] reported a sensitivity and specificity in the range of 96-98% and 59-63% respectively, with microscopy thresholds on sample populations of less than 1000. Both studies reported an optimum WBC count (cells/μl) of 18 but differing bacterial counts (44/μl and 97/μl respectively). Variation in results between the two studies is likely to be due to small sample size. It should also be noted that neither study adjusted for pregnant patients or children under the age of 11, and the sensitivity of classification for vulnerable demographics was not shared. Additionally, greater than 50% of samples in the study by Inigo et al. originated from inpatients and both studies included specimens from catheterised patients [3,4]. In contrast to those findings, Sterry-Blunt et al. [6] reported from a study of 1411 samples that the highest achievable negative predictive value when using white blood cell and bacterial count thresholds was  89·1% and concluded that the SediMAX should not be used as a screening method prior to culture. The use of flow cytometry for urine analysis prior to culture has been gaining popularity as a replacement to automated urine microscopy and shows good performance in the literature. Multiple studies have now shown that the use of flow cytometry with optimised cell count thresholds provides greater specificity without compromising sensitivity when classifying urine samples [3,27,[30][31][32]. Future work should investigate the benefit of using machine learning algorithms that include cellular counts generated using flow cytometry methods as opposed to automated microscopy.
Taking advantage of recent developments in 'big data' technologies, our observational study analysed data representing an entire year of urine analysis at a large pathology service that covers sample processing for multiple hospitals as well as the community in the Bristol/Bath region in the Southwest of the UK. To our knowledge there have been no attempts to apply machine learning techniques for the purpose of predicting urine culture outcome in a laboratory setting. Taylor et al. [5] applied supervised machine   learning to predict UTIs in symptomatic emergency department patients. An observational study of 80,387 adult patients, using 211 variables of both clinical and laboratory data, was used to develop 6 machine learning algorithms that were then compared to documentation of UTI diagnosis and antibiotic administration. The study concluded that the XGBoost algorithm outperformed all other classifiers and when compared to the documented diagnosis, application of the algorithm would approximate to 1 in 4 patients being recategorised from false positive to true negative, and 1 in 11 patients being re-categorised from false negative to true positive. The XGBoost algorithm presented has similar performance to the one trained on our dataset, with an AUC score of 0·904. The sensitivity was poor however, at 61·7%, and a corresponding specificity of 94·9%. It is suspected that the difference in sensitivity between our models is the result of the application of class weights. Taylor et al. [5] did not disclose any parameter tuning of this sort and the sensitivity reported was likely a result of class imbalance (only 23% of their training consists of positive samples). Here, we applied class weights to direct a classification algorithm that favored a high sensitivity and met the criteria expected of a screening test.
Our study made considerations for the high risk groups of pregnant patients and children under the age of 11, with the objective to generate a predictive algorithm that would conform to the UK standards of microbiological investigations. We also classified patients into groups based on identification of key words in clinical details provided by the requesting clinician. Although methods were put into place to increase the accuracy of these classifications (employment of a Levenshtein distance algorithm and consolidation of clinical details from patients with multiple samples) the free-form nature of the notes means that key words would not always be included even when applicable. This has likely led to an underestimation of some groups, but it is possible that this may be addressed in future by more advanced text mining of clinical notes, such as the use of deep learning techniques that can Fig. 6 Performance of the optimal model, with independent classification algorithms for stratified patient groups, as predicted from validation data. The top four features are ranked by average feature importance for all decision trees in the model. Performance is shown as sensitivity ±95% confidence interval classify patients into medical subdomains, as shown successfully by Weng et al. [33].
In our dataset, when observing samples that have generated a positive bacterial culture, there is a clear difference in the distribution of white cell counts in pregnant patients compared to all other patients. The changes in the immune response during pregnancy are not fully understood but it is agreed that modulation of the immune system is significantly changed [26]. This could explain the differences observed in our dataset, but we must also consider the contribution from the screening for asymptomatic bacteriuria in pregnant patients during the middle trimester. Although asymptomatic bacteriuria is cited as an associated with adverse outcomes [26,27], a randomised control study of 5132 pregnant patients in the Netherlands reported a low risk of pyelonephritis in untreated asymptomatic bacteriuria, question the use of such screening [7].
Our study demonstrates the power of machine learning algorithms in defining critical variables for clinical diagnosis of suspected UTIs. Given increasing demand due to ageing populations in most developed and developing countries, radical change is needed to improve cost efficiency and optimise capacity in diagnostic laboratories. At a time when antimicrobial resistance is dramatically on the rise amongst Gram-negative bacteria, including the two most common urinary pathogens, E. coli and Klebsiella pneumoniae, any significant reduction in inappropriate sample processing will have a positive impact on the turn-around time for clinically relevant infection and improve time to appropriate therapy and antimicrobial stewardship.
Extrapolating our estimated workload reduction on a national scale, the savings made in reduction of purchases of culture agar alone (without considering the time cost and additional expenses involved in performing bacterial culture), the implementation of the three XGBoost algorithms as described in Fig. 6 would result in savings of £800,000-5 million per year across the UK (estimates are based on local purchasing data and online sources [34]).
There are several limitations of this study. Firstly, the retrospective nature of the study makes it difficult to clarify some of the details such as potential mis-labelling of samples. However, the use of over 200,000 samples archived with a state-of-the-art LIMS system should ensure the data are relatively robust to random individual errors in labelling. Secondly, the clinical details provided by the requesting clinicians were relatively sparse. This is true for most diagnostic requests in a busy and publicly-funded hospital, where doctors must prioritise their limited time. Hence, the dataset represents the "real life" scenario. Thirdly, it should be remembered that the outcome we have studied is a culture predictability rather than clinical/therapeutic outcome.