On Scene Injury Severity Prediction (OSISP) model for trauma developed using the Swedish Trauma Registry

Background Providing optimal care for trauma, the leading cause of death for young adults, remains a challenge e.g., due to field triage limitations in assessing a patient’s condition and deciding on transport destination. Data-driven On Scene Injury Severity Prediction (OSISP) models for motor vehicle crashes have shown potential for providing real-time decision support. The objective of this study is therefore to evaluate if an Artificial Intelligence (AI) based clinical decision support system can identify severely injured trauma patients in the prehospital setting. Methods The Swedish Trauma Registry was used to train and validate five models – Logistic Regression, Random Forest, XGBoost, Support Vector Machine and Artificial Neural Network – in a stratified 10-fold cross validation setting and hold-out analysis. The models performed binary classification of the New Injury Severity Score and were evaluated using accuracy metrics, area under the receiver operating characteristic curve (AUC) and Precision-Recall curve (AUCPR), and under- and overtriage rates. Results There were 75,602 registrations between 2013–2020 and 47,357 (62.6%) remained after eligibility criteria were applied. Models were based on 21 predictors, including injury location. From the clinical outcome, about 40% of patients were undertriaged and 46% were overtriaged. Models demonstrated potential for improved triaging and yielded AUC between 0.80–0.89 and AUCPR between 0.43–0.62. Conclusions AI based OSISP models have potential to provide support during assessment of injury severity. The findings may be used for developing tools to complement field triage protocols, with potential to improve prehospital trauma care and thereby reduce morbidity and mortality for a large patient population. Supplementary Information The online version contains supplementary material available at 10.1186/s12911-023-02290-5.


Background
Trauma is defined as injury caused by external force and covers a wide spectrum of scenarios; penetrating and blunt force trauma; intentional and unintentional trauma; and low-and high-energy trauma, e.g., falls, motor vehicle crashes and violence [1].It is the leading cause of death in the young population, and accounts for more than 5 million deaths per year globally, corresponding to 9% of the world's deaths [2].In addition to its high mortality rate, trauma also represents a high social cost, where in Sweden the cost for injuries is estimated to 60 billion SEK yearly (approximately US$6.4 billion/ €5.8 billion) [3].Prehospital assessment and care, i.e., care provided at the scene of injury and during transport to a hospital [3], can play a critical role in the delivery of optimal trauma care [4] by facilitating prioritization and deciding adequate destination.Increasing precision in prehospital assessment, prioritization and management of trauma patients is therefore essential to increase personalized care and improve medical outcome.
When arriving at the scene and during transport, the Emergency Medical Services (EMS) clinicians perform field triage to assess severity of injury, prioritize and decide transport destination [5].If the assessment indicates that a patient is severely injured and has lifethreatening conditions, the time to definite optimal care must be minimized to increase the chance of survival [1] -which may not be provided at the closest hospital.To achieve this, the trauma care can be organized with a trauma system that classifies certain medical facilities as trauma centers (TC) depending on their capabilities for managing severely injured patients [6].The condition of a patient can then be matched with an appropriate destination according to predefined route schemes [6], where direct transportation of severely injured patients to a TC instead of the closest emergency department (ED) reduces the time to definitive care [1] and thereby reduces mortality [7,8].Trauma systems have been shown to reduce pooled statistical odds of mortality from 52 studies (OR 0.74, 95% confidence interval 0.69-0.79)[9].There is no unified trauma system with TC in Sweden [10], a common approach is therefore to approximate university hospitals as TC [11,12].By doing so, reduced mortality has also been indicated [11,13].
Assessment of a patient's condition is a difficult task that requires both general and individual understanding of the trauma incident to deal with varying circumstances [1].In Sweden, national trauma team criteria have been recommended for activating a trauma team at an ED [14].EMS clinicians initiate the procedure and alert the nurse in charge in the ED, who in turn decides the level of trauma team alert.The protocol functions as a checklist and activates a full trauma team (level 1, life-threatening) if any physiologic or anatomic criteria are fulfilled.If none of the physiologic or anatomic criteria have been checked, but any mechanism of injury criteria are fulfilled, a limited trauma team is activated (level 2, potentially life-threatening) [14].The protocol also contains observation points that may increase the assessed level in case of fulfilled criteria; age < 5 or > 60 years, pregnancy, hypothermia, anticoagulant therapy, serious comorbidity, intoxication or prehospital deterioration [14].Evaluation of field triage protocols can be performed by assessing the percent of incorrect classifications in terms of undertriage (a severely injured patient being transported to a non-TC), and overtriage (a patient with minor injuries being transported to a TC) [6].The acceptable level of undertriage is less than 5%, as it has a direct impact on the patient's chance for survival, whereas overtriage has a higher acceptable range between 25 to 35% as it mostly concerns overcrowding at a TC [6].In practice, the reported under-and overtriage rates are usually not fulfilling the acceptable levels.High proportions of undertriaged patients have been indicated [11,15], especially among motor vehicle crashes [12].
The high proportion of undertriaged patients can be understood by realizing the difficult trade-offs made in prehospital decision-making, where the level of resources at the closest hospital is set against the transportation time to a hospital further away with a higher level of resources [1].In some cases, the results can also be explained by the underestimation of a patient's care need, which is more common for certain categories of injuries and patients [16].For instance, age has been identified as an influencing factor for what level of care a patient receives, where elderly with severe injuries are at greater risk of being undertriaged [15][16][17].Another example is pre-shock, which can be difficult to detect as particularly children, younger patients and athletes can compensate for a long time before shock is evident [1].The challenge of reaching acceptable levels of triage accuracy indicates limitations with the current method of assessing a patient's condition.We believe this may be improved by complementing field triage protocols by making use of mathematics and leverage the potential in statistics and artificial intelligence (AI), e.g., discerning complex patterns between criteria associated with life-threatening conditions.AI based methods for predicting the risk of injury severity can potentially increase precision of trauma severity assessment.
AI can identify complex relationships between variables and has been shown to increase precision in several health care domains.The prehospital care is increasingly represented [18][19][20], where researchers face challenges with developing models based on incomplete data [21].Injury severity prediction models for field triage have been studied by Candefjord and associates under the concept of On Scene Injury Severity Prediction (OSISP), focusing on motor vehicle crashes [22][23][24].However, the potential of increasing precision in field triage of all trauma incidents in Sweden with AI has not been studied.Prior studies have focused either on subsets of trauma patients (geriatric trauma and motor vehicle crashes), or the complete prehospital patient group, which could be argued to lower the prediction performance due to poor generalization of performance across target domains [25].Studies on adult trauma have focused on one particular model, a small set of predictors and/or listwise deletion of missing data, which may benefit the performance of simpler models.
The aim of this paper is to evaluate if an AI-based OSISP model for prehospital trauma has potential to complement clinical practice in predicting the risk of injury severity.The models are developed and internally validated with data from the Swedish Trauma Registry (SweTrau) [26].The long-term ambition is to provide a responsible and explainable [27] data-driven prehospital injury severity classification model for real-time assessment of patients to support prehospital decision making.

Source of data
This was a registry study where data from SweTrau, the Swedish national trauma registry, during the period 2013-2020 were used to develop and validate OSISP models.Registration in SweTrau applies to patients fulfilling the following three criteria: 1) All patients where a trauma alert was activated at the hospital; 2) Hospitalized patients with New Injury Severity Score (NISS) > 15, even if they did not trigger a trauma alert; and 3) Patients who were transferred to the hospital within seven days after the traumatic incident and had NISS > 15 [26].Exclusion criteria apply when the only injury was a chronic subdural hematoma or if a trauma alert was triggered without an underlying traumatic incident [26].The registrations are managed by each connected hospital via authorized personnel [26].
SweTrau is based on a variable set proposed in the 2008-2009 Utstein protocol, a European consensus protocol for uniform reporting of data following major trauma [28].The data contains predictive model variables (e.g., age, systolic blood pressure [SBP], dominating type of injury), system characteristic variables (e.g., type of transportation, airway management and highest level of prehospital care provider), and process mapping variables (e.g., timestamps of arrival at scene, first CT scan and first key emergency intervention) related to a patient's care chain registered at the scene of injury, on arrival at hospital, at discharge and at 30 days after the trauma incident.Injuries are coded retrospectively with the Abbreviated Injury Scale (AIS, version 2005 Update 2008), where a 7-digit code contains information of injury type, location and severity [29].A multiple injured patient's overall injury severity status is described using Injury Severity Score (ISS) [30] and New Injury Severity Score (NISS) [31].ISS is calculated by summing the squares of the three most severe injuries from six predefined body regions [30], whereas NISS is calculated by summing the squares of the three most severe injuries independent of body region [31].

Sample size
The minimum number of registered trauma incidents needed to develop the prediction model was calculated according to Buderer [32] to decide whether data from SweTrau would be suitable to use.Statistical significance was set as p < 0.05 with a tolerance of 1% of the 95% confidence interval (CI).From SweTrau's annual reports, the prevalence of severely injured patients (NISS > 15) was approximated to 21.3%.To our knowledge, reported clinical practice of undertriage rate and overtriage rate range between 10.5-72.0%and 9.9-48.2%,respectively [15].With the aim of developing a model that exceeds clinical practice in precision, the expected sensitivity and specificity were set to 90% and 25%, respectively.Calculations showed that a sample size of approximately 16,000 registrations was needed, clearly exceeded by the number of registrations in SweTrau during the selected time-period.

Participants
According to the annual report of 2020, forty-seven of 49 hospitals providing emergency services (95.9%) in Sweden were connected to SweTrau, where 40 (81.6%)contributed with registrations [33].The registry had an approximate coverage, i.e. the number of trauma patients with intensive care need in SweTrau compared to the number in the Swedish Intensive Care Registry, of 63.4% in 2020, where the highest amount was obtained from the Stockholm, south and middle healthcare regions and the lowest amounts from the west, north, and southeast healthcare regions [33].
Six sampling exclusion criteria were applied to extract information relevant for field triage in the prehospital setting: 1) Registrations where a prehospital resource was not involved; 2) Transfers between hospitals; 3) Children, i.e. patients younger than 15 years; 4) Data falling outside realistic values according to defined ranges in SweTrau's manual or as judged by the authors; 5) Duplications, i.e. registrations that shared the same patient ID and where time between trauma incidents was less than 24 h; 6) Missing data in outcome variables.
The definition of a duplication was chosen to enable early readmissions, since multiple readmissions within a year may indicate an increased risk of unplanned readmission of trauma patients [34].The time difference was calculated with the timestamp for arrival at hospital since it was the day-time variable with least amount of missing data, and instances with a missing date or time were excluded.In the case of a duplication, only the first instance was included in the dataset.

Predictors
As a first step, potential predictors were chosen based on relevance to injury severity gained from literature [1,5,28,[35][36][37][38], clinical knowledge and potential to be captured in prehospital settings, resulting in the following set: age, gender, prehospital Glasgow Coma Scale (GCS), motor component of GCS (mGCS), prehospital SBP, prehospital respiratory rate (RR), prehospital cardiac arrest, prehospital airway management and type, season of year, weekday of trauma, time of trauma, time interval between the emergency call and the prehospital resource arriving at the scene (response time), dominating type of injury, mechanism of injury, intention of injury and AIS regions.In SweTrau, the SBP and RR can be registered as continuous (measurements of the vital signs) or categorical (approximations of the vital signs divided into Revised Trauma Score [RTS] levels).Because the data collection is based on different methods, both the continuous and categorical variables for SBP and RR were included.
Next, an assessment of each potential predictor's predictive value for injury severity was performed.A Chisquare univariate test of independence with significance p < 0.05 was performed separately on each potential predictor versus the primary outcome (NISS > 15), where Yate's continuity correction was applied when the degree of freedom was equal to one [39].The univariate test was also used to select variable in case of similar information, i.e., GCS, mGCS, SBP based on measurements and RTS, and RR based on measurements and RTS.In these cases, the variable with lowest p-value was selected.Logistic regression (LR) was applied for a multivariate analysis of the potential predictors, where variables with statistically significant coefficients were deemed as suitable predictors.Significant result in either of the two statistical tests, i.e., univariate and multivariate, motivated inclusion of the variable in the final set of predictors used to train and validate the machine learning models.

Machine learning models
Five machine learning techniques were selected based on promising results within prehospital care, emergency medicine, triaging and trauma: LR [19,20,[40][41][42], Random Forest (RF) [19,[42][43][44][45], Support Vector Machine (SVM) [41,42], eXtreme Gradient Boosting (XGBoost) [18,45] and Artificial Neural Networks (ANN) [19,41,44].Because the aim of the study was to explore if there is a potential in using an AI-based OSISP model for prehospital triage of trauma and complement the clinical practice, optimization during the model development was not incorporated in the study design and default settings were used for each model.
A LR model is a supervised learning technique that describes the expected probability of a positive event in terms of a logit function and regression coefficients [46].Sklearn's class LogisticRegression was used to implement the model.
A RF model classifies samples by considering the majority vote of several decision trees created from bootstrapped data samples of the original dataset and where the decision trees have been built by randomly considering several of the available variables (with replacements) [47].Sklearn's class RandomForestClassifier was used to implement the model.
An XGBoost model performs classification based on the majority vote from several trees, where each tree is created based on residual similarity scores and gains [48].The model was implemented in Python using the open-source software library XGBoost, with the objective of binary classification.The evaluation metric was set as the area under the precision recall curve (AUCPR) since it has been argued to reflect a model's performance more accurately in the case of imbalanced data compared to the traditional area under the curve (AUC) for the receiver operating characteristic (ROC) curve [49,50].
An SVM is a supervised learning technique that transforms data to a higher dimension to find a decision boundary, as a line or hyperplane, which successfully separates classes [51].Sklearn's class SVC was used to implement the model.
An ANN consists of a network that resembles the human brain with input units, hidden units and output units and performs nonlinear classification by updating the connections between the units [52].The model was implemented with Sklearn's class MLPClassifier.

Outcome
The NISS was selected as the primary outcome variable of this study because of its wide use in injury severity scoring and accessibility in SweTrau.To assess the sensitivity of the model's predictive ability in relation to injury severity, the ISS was also used as an outcome measure.Historically, NISS is the successor of ISS and was developed due to the limitation that ISS does not consider multiple severe injuries within the same body region [31].Although both NISS [31] and ISS [53] correlate with mortality, comparative studies of the two scales reports better predictive power in terms of survival after severe trauma with NISS as compared to ISS [31,54,55].
Traditionally, ISS > 15 has been used to define severely injured patients [56], but adjustments of the AIS coding of injuries have led to recommendations of adapting ISS > 12 to decrease the risk of excluding severely injured patients [53].In the present study, a threshold of 15 was used as definition of severely injured trauma patients.The model's predictive ability in relation to risk group was assessed by comparing the result of this threshold with a threshold equal to 12, for both NISS and ISS.
The secondary objective of this study was to evaluate whether the OSISP models have potential to complement clinical practice, i.e. whether OSISP has potential to lower field under-and overtriage.Because AIS codes are registered retrospectively at the hospital, the NISS and ISS scores are not available in the prehospital setting and were therefore not suitable as comparison metrics.Instead, under-and overtriage were selected, where undertriage was defined as a severely injured patient being transported to a non-TC and overtriage was defined as a patient with minor injuries transported to a TC, following ACS-COT recommendations [6].The mapping of hospital name, hospital code and binary classification (TC/non-TC) followed an earlier study [11].For clinical practice, under-and overtriage was calculated based on the NISS/ISS score registered at the hospital and whether the decided destination was a TC or non-TC.For models, under-and overtriage were calculated based on the predicted NISS/ISS score and an automatic decision of destination based on the NISS/ISS score.The difference in calculations of clinical outcome and models was applied since information about geographic location of the scene of injury was not registered in SweTrau, and therefore a decision of transportation destination could not take the proximity of different hospitals into account.

Missing data
The high proportion of missing values in trauma registries [21] requires careful consideration to attain a dataset that both represents the population and is sufficiently large for model development.Mainly four approaches for handling missing data in trauma registry-based studies are used: complete case (CC) analysis, subgroup analysis of unknown, multiple imputation (MI) or a combination of CC and MI [57].The key of selecting a suitable method relies on a realistic assumption of the missing mechanism.In the case of trauma, missing completely at random is generally not a valid assumption [57,58].In addition, the missing mechanism in trauma data may vary across variables and registering units, it has therefore been suggested that a more realistic assumption is missing at random (MAR) or a combination of MAR and missing not at random [59].
To our knowledge, the missing mechanisms in SweTrau remains unstudied.We therefore included different approaches to enable a comparison of model performance.From the raw data, instances with missing values in administrative and outcome variables (patient id, timedate variable, ISS, NISS, 30-day mortality, hospital) were removed.Next, four datasets were generated: one based on CC analysis, and three based on different imputation techniques.
In dataset A, CC analysis was applied by examination of different thresholds of missing data in predictors and the effect on data size after listwise deletion.The thresholds ranged from 0 to 100% with an increase of 5%, resulting in 21 datapoints.For each threshold, variables with a larger proportion of missing data were removed and listwise deletion was applied on the remaining predictors.The number of registrations left after the listwise deletion together with the number of remaining predictors were compared to find a threshold that enabled most predictors and instances to be included in dataset A. With this approach, the threshold of acceptable level of missing data was selected to 15%.
Datasets B and C were generated using different imputation techniques on the predictors in dataset A. In dataset B, missing data (missing and unknown) represented a new level in each predictor.In dataset C, a single imputation technique was used to substitute missing values in prehospital predictors representing SBP, GCS and RR with corresponding in-hospital values.
In dataset D, MI was used as it is recommended in the case of MAR [21,60] and has shown added value in analysis for both prehospital and in-hospital trauma data [59,61,62].More specifically, MI by chained equations (MICE) was applied since it is recommended for non-monotonic missing data [60] and has been used in trauma registry-based studies [63,64].Five datasets were imputed where the final set of predictors and the primary outcome (NISS > 15) were used to predict values for the missing locations.The predictors and outcome were in raw format to reduce risk of information loss during imputation.Different imputation methods were applied depending on data type, where numeric data was imputed using predictive mean matching, binary data imputed using a logistic regression model, nominal data imputed using a multinomial logit model, and ordinal data imputed using an ordered logit model [65].A roman visit schedule and 20 iterations were applied during each imputation procedure.

Statistical analysis methods
The raw data were used to generate variables of interest and then the exclusion criteria were applied.Next, the described imputation techniques were used to create datasets A-D.The univariate and multivariate tests for selection of predictors were performed on dataset A (CC analysis).Next, the final set of predictors were one-hot encoded to enable numeric input to the machine learning models, and a reference level was selected for each predictor to avoid multicollinearity.Model assessment [52] was performed with a stratified 10-fold cross-validation [66] on dataset A-D.Model evaluation metrics were selected to capture the performance in relation to clinical practice and imbalanced data [67] and included the following: under-and overtriage, accuracy, F-measure with β = 1, ROC curves with the True Positive Rate (TPR) versus False Positive Rate (FPR), AUC, Precision-Recall (ROCPR) Curves with precision versus recall/sensitivity and AUCPR.The cross-validated ROC, ROCPR and F1 scores were based on concatenated -i.e., combined data across the ten folds -true positives, false positives, false negatives and true positives across the folders, while the accuracy, under-and overtriage, AUC and AUCPR were averaged across the folders [68].Note that the TPR can be interpreted as 1-undertriage, FPR/Recall as overtriage, and precision as the number of patients in need of going to a TC of those that did.The same evaluation metrics were applied on dataset D and were based on the concatenated data from across the folds for each of the imputed datasets (D1-D5).

Hold-out analysis
In addition to 10-fold cross-validation, a hold-out analysis was performed on dataset A (CC) to evaluate impact on model performance.In the SweTrau data, registrations from year 2020 were included, the first year of the COVID-19 pandemic, which may affect characteristics of injuries [69].Two cases were tested.In case 1, models were trained on data between 2013-2019 and evaluated on data from 2020.In case 2, models were trained on data between 2013-2015 and 2017-2020 and evaluated on data from 2016.The same dependent variable (NISS > 15), set of predictors and evaluation metrics as for the 10-fold cross-validation were used.

Software
The analysis was executed in Python version 3.  S1 in Additional file 1).

Ethical considerations
The study was accepted by the Swedish Ethical Review Authority on the 10th of February 2021 (reference number 2020-06899) and conducted in agreement with the ethical references of the Swedish Research Council.All registry data were pseudonymized and the dataset did not contain any personal data.SweTrau data used in this study cannot be made publicly available.

Participants
There were 75,602 registrations during the period 2013-2020 and distribution of trauma incidents with respect to year is presented in Table 1.After applying the eligibility criteria 47,357 (62.6%) registrations remained.The patient selection process is displayed in Fig. 1. For

Model development
The threshold of acceptable level of missing data in each variable for construction of Dataset A resulted in a set of possible predictors including gender, age, prehospital GCS and mGCS, prehospital SBP, prehospital RR, prehospital cardiac arrest, prehospital airway management, season of trauma, weekday of trauma, time of trauma, dominating type of injury, mechanism of injury, intention of injury, response time and all AIS regions.In-hospital RR predictors were removed due to a larger amount of missing data than the selected threshold and were therefore not used to substitute missing values in prehospital counterparts in Dataset C.
The univariate and multivariate tests resulted in statistically significant results for different variables.From the univariate analysis, gender, age, airway management, prehospital GCS and mGCS, prehospital SBP, prehospital RR, cardiac arrest, season of year, dominating type of injury, mechanism of injury, response time, and all AIS regions were significant.The mGCS had a lower p-value compared to the GCS and was therefore kept as predictor.From the multivariate analysis, gender, age, mGCS, SBP, RR, injury mechanism, intention of injury and all AIS regions except external had statistically significant coefficients.The coefficients from the multivariate analysis is presented in an additional file (see Table S2 in Additional file 2).Variables that didn't achieve statistical significance in either of the two tests were weekday and time of trauma and were therefore excluded as predictors.Descriptive statistics of the final set of predictors are presented in Table 2 for the excluded data and dataset A-D.

Model performance
Cross-validated ROC and ROCPR curves for all models are visualized in Fig. 2 for each dataset and evaluation metrics are summarized in Table 3. Mapping of underand overtriage to the ROC curves according to earlier description, the clinical recommendation of 25-35% overtriage yielded an undertriage between 8-25% depending on model and dataset, to be compared with the clinical recommendation of 5%.Reviewing the clinical outcome of field triage in SweTrau for the selected time-period, an undertriage of about 40.4% and overtriage of about 45.8% were obtained.At a corresponding level of overtriage, the cross-validated OSISP models had an undertriage between 4.1-12.4%.
To the authors knowledge, there is not a clinical recommendation for precision.The clinical outcome resulted in a precision equal to 20.9% at a recall level of 59.6%.At a corresponding level of recall, the OSISP  models had a precision between 41.1-56.4%.Undertriage, overtriage and precision for selected points on the ROC curve according to recommended levels by ACS-COT [6] and an additional point with low undertriage (1%) are presented in Table 4.
The ROC curves showed similar performance between the models, where LR, XGBoost and ANN yielded best performance.The ROCPR curves demonstrated better performance than baseline (prevalence in dataset), with LR, XGBoost, SVM and ANN performing similarly, whereas RF yielded the lowest accuracy.Comparison of the ROC and ROCPR curves showed a noisier behavior in the latter in case of low recall.
From Table 3, SVM achieved the highest accuracy while XGBoost performed best in terms of AUC and AUCPR across all datasets.The difference between models were nonetheless minor.Inconclusive results were indicated for the concatenated F1 score as no model Table 3 Model performance for predicting the risk of severely injured (NISS > 15) Accuracy, AUC and AUCPR presented as average value and standard deviation across the folds.F1-score presented as concatenated value across all folds.Dataset D presented with an interval of respective value across all folds for the five imputed datasets, with the highest standard deviation  4 Model performance for different points on the ROC curve when predicting the risk of severely injured (NISS > 15)

Metric
Values are presented as percentage and a denotes fixed metric values.Dataset D presented with an interval across the five imputed datasets (D1-D5) performed best across all datasets.ROC and ROCPR curves for each of ISS > 12, ISS > 15 and NISS > 12 as definitions for a severely injured patient performed similarly and AUC and AUCPR are presented in an additional file (see Table S3 in Additional file 3).Removal of AIS regions as predictors resulted in a decline in performance with average AUC respective AUCPR values across the models between 0.57-0.74and 0.25-0.40.Model performance for the hold-out analysis is presented in Table 5.

Key results
In this study, an OSISP model for adult trauma was developed based on data from SweTrau.Predictors for severe injury were selected based on statistically significant results from univariate and multivariate tests, resulting in 21 included predictors.AIS regions constituted nine of these predictors and seem to be strong predictors.Both ROC and ROCPR curves demonstrated promising performance.Cross-validated evaluation metrics showed similar results across the models and the four different datasets derived from different strategies for handling missing data.

Limitations
There are several limitations connected to the data source.Data points from SweTrau originate from different hospitals, settings and regions.The number of active hospitals connected to the registry varies across the years, which can lead to a biased representation of hospitals with high level of administrative resources to manage the time-consuming task of registering in quality registries.The eligibility criteria of the registry may disregard some trauma patients cared for by prehospital resources.For example, patients who are declared dead upon arrival at the hospital are not included in the registry.The registration in SweTrau is performed manually by a register nurse at each connected hospital.The data are in different electronic health records and require subjective assessment in some cases.The work requires many resources, e.g. the mean time of registering a patient at Sahlgrenska University Hospital is estimated about 45 min.There were also some data quality issues, for instance some data falling outside realistic values that had to be discarded.This study is limited to NISS and ISS as outcome measurements, scales that are similar but with the difference of in which body regions the three most severe injuries of a patient can be located.In future studies, the prediction models' performance could be further compared with injury severity scales calculated differently from ISS and NISS.In this study, we worked with a binary classification model (not severely injured/severely injured) and binary transport destination (NTC/TC).Multiple classification might be more suitable depending on the destination definition.For instance, in the definition of TC used in the US, each TC is assigned a rating (I, II, III or IV) depending on the level of resources, where rating I represents the highest level and IV the lowest level [6].Although there is no unified trauma system in Sweden, a similar rank-approach could possibly be adapted, for instance by assigning highest rank to university hospitals, second rank to county hospitals or trauma receiving hospitals, and third rank to remaining non-trauma receiving hospitals.These ratings could then be used as basis for possible destinations, and future models could potentially categorize what risk interval a patient is in and match it with an appropriately ranked hospital.Alternatively, the care needed could also act as a destination selection.For instance, based on the predicted injury severity and locations of injuries a treatment might be recommended and the transport destination could then be based on what hospitals offer that treatment.
The estimated under-and overtriage for the AI models could not take into account the transportation times to the different nearby hospitals.The model performance might therefore be overestimated, as geographical information about nearby hospitals was not accessible and sometimes the transportation time to a TC may be too long to be recommended.Following the same reasoning, the clinical outcome in field triage could be argued to be underestimated as no TCs might have been located within a reasonable time-frame.
This was an exploratory study to evaluate AI-based field triage for the whole adult trauma population group.A more complex approach for optimizing the algorithms used was therefore out of scope but could be considered in future studies.From Tables 3 and 4, the generated datasets seem to have minor impacts on model performance.For dataset A accuracy and precision increase, whereas for dataset D F1-score increases, and for datasets B and D AUC, AUCPR and under-and overtriage increase.The differences are however small and in general the results are in relatively close agreement.An important aspect to consider during model development is data leakage, i.e., when information related to the test set leaks into the training set, which removes the purpose of having a test dataset as it should consist of unseen data and might lead to overoptimistic results [70].Alternatives for upcoming studies regarding data leakage could benefit from adjusting the predictor selection.In this study, it is based on univariate and multivariate statistical tests based on Dataset A before applying 10-fold cross validation.Another approach could instead be to first divide the included data into ten folders for cross validation, create dataset A-D within each folder and train the model on the current combination of nine folders.Optimization of the models' hyperparameters may also increase the prediction ability.For instance, selecting a linear kernel for a SVM model in the case of non-linear data will lead to poor performance.Another possibility for optimization is to incorporate techniques developed for imbalanced data [67,71].

Interpretation
The results indicate that OSISP has potential to provide effective decision support for EMS clinicians.Injury locations based on AIS coding seem to be strong predictors, also indicated by other studies [20].There may be some body regions that are stronger predictors, for instance the logistic regression coefficient for the external region was not significant and might be a weaker predictor compared to other body regions.It should be repeated that AIS codes are retrospectively coded at the hospital and not possible to obtain in real-time within the prehospital setting.However, the field triage protocols in some Swedish regions include markings of injury location that can be used to obtain similar information.Data collection of these markings may result in a different model performance as they will be coded in real-time with no time for controlling entered values.Nonetheless, the impact on model performance motivates further studies.
In general, the origin of each variable used to construct a prediction model is important to consider and how it fits in the prehospital trauma workflow.If the model is to function dynamically during the entire workflow the predictors must be readily available.One example in SweTrau is the two prehospital variables for SBP.One contains exact measurements; however, it is generally not recommended to perform exact measurements on-site with consideration to time-sensitive conditions [1].The other contains approximations more in line with general practice, where the most suitable option of obtaining an approximation is chosen.This may on the other hand bias the data as all variable levels are not considered, only the most suitable from an accessibility perspective, leading to difficult decisions within the data analysis and interpretation of variables during model development.Furthermore, there is no time-point in SweTrau for when the prehospital variables were registered, impeding development of dynamical algorithm development.
The model performance for AI based field triaging in this study shows potential in improving precision and motivates further work towards clinical implementation, since early identification and transport of severely injured patients to a TC potentially improves patients' outcome both globally [7,9] and in Sweden [11].However, when considering how the OSISP algorithm will function in practice the field triage depends on more factors than injury severity scores.The real-time assessment of the patient being severely injured or not will provide a basis for the decision-making of transport destination, but this will also be influenced by factors like distance to the nearest hospital, the nearest hospital's resources and distance to the TC.Distance to a TC has an impact on patient outcome where mortality is increased with distance [72].There is a difference between triaging to a TC and bypassing the nearest hospital in a big city compared to the same decision in a rural area where there is a long distance between the hospitals.In Sweden, the likelihood of being transported to a TC is reduced for every kilometer of distance to the center [12].This does not mean that the patient is triaged incorrectly, as there are large risks in transporting severely injured patients in an ambulance with little opportunity for advanced treatment.For example, according to the authors' experiences, it can sometimes be the right decision to drive to the local hospital with a patient with uncontrolled bleeding to stabilize the circulation with, for example, blood products.Patients with isolated head injuries may also need to be anesthetized and intubated before a longer transport to a TC.However, these are difficult and complex decisions where EMS clinicians need further support.It is possible, for example, that in addition to an AI-based decision support the possibility of consulting the on-call trauma surgeon at the TC via video for support in transport decisions could bring further improvements to the prehospital workflow.
An important aspect when evaluating model performance is potential implications on clinical outcome.Clinical implementation requires determining an optimal point on the ROC curve.This is however not a trivial task as it should be decided based on both a technical and medical perspective.Recommended levels of under-and overtriage from ACS-COT could act as a reference but may be challenging to achieve in practice.Table 4 show that performance is generally not on par with the recommended levels of 5% undertriage at less than 35% overtriage.Furthermore, the hold-out analysis indicates a time dependent characteristic in the data with an improved performance when excluding data from year 2016 during the training compared to excluding data from 2020 (Table 5).One hypothesis by the authors is that this may be a result of a stricter triaging policy due to a reduction of resources during the pandemic.Another hypothesis may be that COVID-19 led to a change in injury characteristics.
The general reduction of 28% in undertriage compared to the clinical outcome may benefit around 900 patients in the SweTrau dataset used in this study, i.e. about 112 patients per year.This indicates a potential to achieve a more equal care, although not all those patients may benefit from transport to a TC, e.g.depending on prolonged transportation time and type of injury.Today, patient assessment and care are influenced by different factors such as socioeconomics, ethnicity, age and gender.Two examples are that people in socioeconomically vulnerable areas more often receive inadequate care [73], and elderly with severe trauma are at greater risk of being transported to a hospital with insufficient resources to manage the injuries [9,16].Because the OSISP algorithm has been developed based on a data-driven approach, such factors are managed during the training of the models and will not influence the prediction during the patient assessment.In addition, a digital tool does not experience the circumstances that EMS clinicians are exposed to, such as stress and tiredness.However, the support will function together with the EMS clinicians and these factors will still need to be considered in terms of how the variables have been measured and entered to the system.Furthermore, with a digital tool there is an opportunity to develop explainable support systems where the classification of a severely injured patient can be displayed to the EMS clinicians in terms of what variables were important for the prediction.This could give the EMS clinician the possibility to evaluate the patient and relate the OSISP recommendation to their clinical experience.For instance, LR may be a preferred model to test in a clinical setting since the models' performances were similar and the LR model's coefficients can be used to derive an explanation to why the patient is predicted to have high or low risk of severe injury.In addition to performance differences also fairness, equality, and explainability should be considered when deciding on which model to develop towards clinical implementation [27].
There are some comparative studies that can help indicate whether the models presented here achieve expected performance.Spangler et al. [18] applied machine learning on regional Swedish prehospital data (not limited to trauma), to develop risk scores for three triage related outcomes, achieving AUC values between 0.66-0.89.Kim et al. [19] used adult prehospital trauma data from the US to predict survival and obtained AUC values between 0.71-0.89.van Rein et al. [20] developed a LR model based on regional adult prehospital data from the Netherlands to predict severely injured (ISS > 15), reporting an AUC value of about 0.82 and an undertriage of about 11% at an overtriage of 50%.Previous studies by Candefjord and colleagues [22][23][24] developed OSISP models for motor vehicle crashes, reaching AUC values of 0.83 for Swedish data and 0.86 for US data, respectively.The models developed in the present study achieve competitive performances in terms of AUC and under-and overtriage.However, direct comparisons are impeded by variations in trauma system and study designs, i.e., data collection and processing, selected outcomes and development procedures.

Future research
Development of AI models relies heavily on data, where a larger dataset is preferable.This becomes more important when enabling a larger set of predictors as some predictor levels might be rare.To strengthen results where multiple predictors are included, it should therefore be considered to pool data from different countries.For instance, there are other trauma registries that base their variables on the proposed variables from the Utstein protocol.Pooling data from such registries could provide several opportunities for future work.One example is a pooling of data from different registries, where the extended dataset could be used to increase the size of the development and/or constitute an internal validation dataset.This may increase the model's ability to generalize the result.A second possibility is to use the data from one registry for development and internal validation of a prediction model, and use data from the second registry to validate the model.
The SweTrau data do not represent all vital signs documented during the prehospital assessment.For instance, pulse, oxygen saturation and heart rate are commonly measured and have proven to contain important information about a patient's state and could be valuable to include in the decision support.These vitals may be recorded in other registries and a combination of these data could therefore be valuable to increase the data basis for model development.

Conclusions
An OSISP algorithm for trauma related events aimed for prehospital use shows promising results in aiding care givers in distinguishing between severely injured and non-severely injured patients.This could potentially lower undertriage and reduce mortality.Future model optimization is needed to determine the most suitable model.The results warrant further studies for further development and future implementation and clinical studies of AI based tools to complement current tools for prehospital triage.

Fig. 1
Fig. 1 Flow-chart of patient selection with the number of severely injured patients (NISS > 15) and field under-and overtriage included for each step.Instances presented as numbers of cases with percentage in parenthesis

Table 1
Distribution of trauma incidents with respect to year in raw and included data, presented as number and percentage Distribution of trauma incidents with respect to year in raw and included data, presented as number and percentage.Percentages for 30-day mortality and NISS > 15 are based on the number of registrations per year, percentages for under-and overtriage are based on the number of severely injured and not-severely injured per year.m mortality

Table 2
Descriptive statistics of study population

Table 2
(continued) Shows selected reference level for each predictor.Instances presented as number of cases with percentage in parenthesis.Dataset D presented with average number of cases and standard deviation across the five imputed datasets (D1-D5), percentage in parenthesis Bakidou et al.BMC Medical Informatics and Decision Making (2023) 23:206 a,b Denote statistically significant results for univariate and multivariate tests, respectively c

Table 5
Model performance for predicting the risk of severely injured (NISS > 15) in the hold-out analysis