Basic science and development
Current biotechnologies such as the different -omics analyses generate massive data, the so-called Biomedical Big Data (BBD). The results of all these research efforts can be dispersed over a wide range of locations, sources and specialties. AI based analytical tools are essential to sifting through all this information, to detecting patterns, and to giving meaning to the outputs of many techniques currently used in biomedical research, such as next generation sequencing, microbiomes, proteomics, etc. AI based search engines can retrieve and visualize all the available data on certain molecules, thus avoiding duplication of research efforts, find new uses for existing molecules (repurposing) or create new molecules with predefined properties [22, 23]. AI can also help to generate new molecules for prespecified tasks, based on structural or functional similarity with other molecules [24]. IBM Watson for Drug Discovery [25] was developed for this purpose and was launched with great ambitions, although its actual real world performance is considered to be a disappointment by many commentators [26]. The latest reports on the capacity of DeepMind’s AI to help unravel the three dimensional structure of proteins [27] demonstrate the potential, but also the narrow applicability of such AI systems: they are great at clearly circumscribed tasks, but transitioning from in silico or in vitro molecular biology to real world clinical applications remains very challenging. Understanding the molecular basis of disease processes can help identify patients in whom certain therapeutic approaches will not work, for example because they lack receptors for the intervention, or because their pathogenetic process runs through different pathways. Many therefore see AI as the ultimate step towards individual patient focused precision medicine. However, unfortunately, biology is mostly more complex than this, and in addition, virtually all of these deep learning analyses are based on associations and not on causal links. A thorough in-depth understanding of causal pathway analysis is therefore often essential to translate results to clinical practice [8].
Information and evidence gathering
The current rate of publication is such that it substantially outpaces human capacity to read and assimilate all information [28]. Therefore, objective and systematic methods to search, review, and aggregate published studies are a fundamental aspect of evidence building [29, 36]. Many of the tasks involved are extremely time consuming and human resource intensive: setting up and testing a search strategy for each of the different databases (Ovid, Medline, Pubmed, Central…); exclusion of non-relevant papers based on title and abstract; selecting papers based on in- and exclusion criteria of the clinical question; data extraction and aggregation into the different outcomes of interest; applying risk of bias scores to the individual papers [30]. All these tasks need to be done in duplicate, to avoid errors and bias in interpretation. Moreover, information that is not published in indexed journals, also called grey literature, is difficult and time-consuming to find, yet this literature leads to erroneous conclusions [31]. As a result of their laborious character, systematic reviews sometimes are already outdated at the time of publication. However, as these tasks are repetitive in nature, they are in principle ideally suited for automation. Most of these tasks are classification problems which can be addressed by AI based on natural language processing (NPL). Search engines can help plough through the worldwide web to find grey literature information in abstracts and conference proceedings, and even in the data warehouses of administrative organizations such as the Federal Drug Administration (FDA) or the European Medicines Agency (EMA).
AI systems have been developed for each of these individual tasks or combinations thereof. SWIFT-review [32], for example, can be used during the scoping phase to help formulate questions and identify whether they can reasonably be answered based on the available evidence. It could also be used to assist in updating existing systematic reviews. For example, European Renal Best Practice (ERBP), used the Early Review Organization System technology to facilitate and document the first phases of screening and selection of papers in the context of an international multidisciplinary team [36]. EPPI-reviewer [33], used by Cochrane and NICE, not only provides tools to retrieve, select and document papers based on self-learning algorithms, it can also perform automated text coding and data extraction from full papers. As the use of such AI based systems could speed up turn-around time of systematic reviews, substantially more up-to-date systematic reviews could be undertaken. Furthermore, more databases, including FDA and EMA and grey literature, could be explored for information, resulting in more in-depth analysis while at the same time reducing the impact of publication bias. The latter could easily be reduced even more by matching published work with pre-registered protocols, again a task for which AI is ideally suited [34]. Furthermore, once set up and trained, the AI algorithm could be run on a regular basis and eventually update the evidence as new information emerges over time (so-called ‘living’ systematic review).
Before these AI based systems could gain widespread implementation for systematic evidence review, some hurdles remain to be taken however. Crucially, these systems will need to provide evidence of non-inferiority in comparison with human hand-searched systematic reviews [35], so all stakeholders involved can be confident that the same high standards apply for both. Although in principle evidence review is a time linear process, different teams take different approaches to the different subtasks and their timing in the workflow. The different AI systems currently available [30] also mostly work as separate devices on subtasks of the evidence review process, so they need to be integrated in the workflow of the team. Using AI can substantially alter the line of thought here. For example, all systematic review teams agree that framing the question (the PICO framework [36]) should be the first step before any search or data extraction is done. However, using AI, it would be possible to perform the data extraction automatically already at the moment of publication of a paper, and store the data in a knowledge repository for later analysis when the question arises. The methodological, epistemological and other potential sources of bias of such an approach remain to be investigated. Clearly, in order to be generalizable, an AI system should ideally automate all the tasks between the formulation of a question to the presentation of results.
Generation of knowledge from routinely collected data
Current paradigms place randomised controlled trials (RCTs) at the top of the hierarchy of evidence from which to derive information. However, RCTs have several aspects that make them less suited as sources of evidence under certain conditions. First, inherent to the requirement for strict in- and exclusion criteria, RCTs typically investigate single, well defined and isolated interventions in very specific subgroups of the population. Whereas this is a strong advantage with regard to the deduction of causal relations between intervention and observed effect, it also has a downside as it hampers the external validity of RCTs [37, 38]. In an era where patients increasingly have multiple underlying comorbidities rather than one single disease, this is a serious obstacle. Moreover, for rare diseases, with treatments tailored specifically to particular individuals, simply not enough comparable patients might be present to set up a sufficiently powered RCT.
Second, RCTs are very expensive, hence evidence on the efficacy and safety of products with limited commercial interest is often lacking. Some commentators suggest the use of pragmatic RCTS, whereby in- and exclusion criteria are broader and administrative regulation is less strict, thus reducing costs. In most settings, such pragmatic trials de facto are a hybrid between a genuine RCT and a prospective, well designed observational trial in which part of the population is randomized [39].
Third, RCTs do not solve the problem of what to do now, as their results only become available after a significant delay. For example, the IDEAL trial on the timing of start of dialysis in patients with chronic kidney disease took 10 years from the start of the study to the publication of the results [40]. In addition, it appeared that due to the differences in interpretation of the criteria to start, the trial actually answered a question different from the original question it was randomized for [41]. In the field of cardiology, the question on thrombus aspiration during percutaneous coronary intervention was solved more rapidly based on a large registry trial [39] than with the classical RCT [42], even though the budget for the former was 30 times lower. In this era with a rapid evolution and development of new innovative interventions, the effect size of an intervention thus cannot be timely assessed in RCTs as, by the time the RCT is finished, new interventions have become available [43]. Due to all these factors, knowledge generation in medicine would be incomplete and biased if it would only rely on RCTs. The current system is thus suboptimal from an epistemological, ethical and regulatory perspective, and there is an urgent need for complementary ways to generate evidence next to RCTs [21].
Besides their use for benchmarking and demonstrating the value of an intervention in real world conditions (see below), routinely collected data can potentially also be used to emulate randomized clinical trials, using an approach known as counterfactual prediction. In this way, routinely collected data can become part of an additional or complimentary methodology to RCTs [21]. Admittedly, existing observational datasets frequently do not contain the necessary granularity to make an exact emulation possible, but about three quarters of trials contain sufficient data to reasonably do so [14]. First, using this technique, data from an existing randomized controlled trial can potentially be generalized to a broader population. In this way, the validity of the effect size beyond that of the original trial population defined by specific in- and exclusion criteria can be established (Fig. 2). It can also be used to explore transportability [44], i.e. to explore the validity of effect sizes in a population that substantially differs from the trial population [45]. Both conditions are becoming increasingly common, as the number of patients with more than one comorbidity is rapidly increasing.
Second, the approach of counterfactual prediction potentially enables the simulation of an as yet non-existing randomized clinical trial based on routinely collected data [19]. If such simulation would be successful, the need for expensive RCTs could potentially be avoided, and desired evidence might be obtained more rapidly [39, 42]. Furthermore, RCTs can be considered unethical or simply not feasible in certain settings.
Third, the number of potential interventions for any particular condition is growing rapidly, and often these different interventions can be administered in a certain order. For example in HIV or cancer treatment, the succession of different drugs or even types of interventions (radiotherapy, chemotherapy, surgery) will need to be evaluated as these diseases tend to become chronic conditions, requiring different treatments at different points in time (first line vs second line, attack vs maintenance etc.). Also in renal replacement therapy, it is unclear which succession of available treatments (hemodialysis, peritoneal dialysis, transplantation) yields the most optimal outcomes over the life span of the patient [46]. It is nearly impossible to explore all potential combinations and successions of these treatments through RCTs. The optimal timing of an intervention, such as starting renal replacement therapy for acute kidney injury, can also be difficult to explore in RCTs. Due to the necessarily strict definitions used in RCTs for deciding the timing of “early” versus “late” start of renal replacement, in the different available RCTs de facto different strategies are compared, explaining differences in outcome. However, it is clear that not all potential definitions of “early” and “late” start can be explored in RCTs.
Such dynamic problems could potentially be explored based on routinely collected data, provided that appropriate analytical techniques are used [20]. Emulating RCTs based on routinely collected data could potentially generate evidence in these circumstances, with the advantage that evidence can be updated as new cases come in. Linkage of routinely collected data with RCT data may also leverage RCT findings by enabling increased statistical power, allowing RCTs to be stopped earlier [47], allowing for the identification of relevant effect modifiers, so that treatments can be tailored based on certain markers [48], and earlier validation of surrogate endpoints [49].
Shared decision making: explaining evidence to patients
Shared decision-making (SDM) is increasingly advocated as the preferred conceptual framework for decisions at the individual patient level [50]. The three pillars involved in this decision process are the evidence base, the clinical expertise of the healthcare worker and the preferences and values of the patient. The healthcare worker tries to inform the patient on probabilities that intervention A will result in desired outcome X and not an undesired outcome Y, as verbalized by the patient, taking into account the situation of the patient. However, it is essential that the patient and physician can obtain, understand and correctly interpret the information provided. As both patients [51] and physicians [52] often lack basic statistical literacy, information needs to be presented in a way that helps them gain insight in the data and their meaning in a simple, informative and straightforward way [53, 54].
It is therefore crucial to develop easily understandable presentation paradigms that allow for the integration of all available evidence with the specific condition of the individual patient, helping her to make a decision that leads to a result as close as possible to her preferences and values. A healthcare worker has an obligation to first elicit the true values and life goals of the patient before considering a treatment. No interventions should be made to achieve outcomes that are of no value to a given patient, for example interventions that are only intended to optimize surrogate outcomes which are irrelevant to the patient. The proposed treatment should be that which has the highest probability of achieving these values and life goals. Situations may of course arise in which the values of the healthcare worker and those of the patient differ and where, from the perspective of the healthcare worker, a suboptimal decision is made. The shared decision making process is akin to a balanced negotiation process wherein both parties try to achieve the best decision. The stronger the evidence (for example several large randomized controlled trials with similar results), and the more important the outcome (for example an important improvement in survival), the greater the effort the healthcare worker must make to convince the patient. Graphical representations of the projected outcome of different treatment alternatives can be constructed in real time by algorithms based on routinely collected data, containing features similar to the target patient, but treated with the different alternative treatments available [17]. Such real time visualizations of the different options can be used to help achieve the goal of genuinely shared decision making [17, 53].
The need for core outcome sets
If we intend to use aggregated evidence from randomized trials and observational studies to assess outcomes that are relevant for patients, it is essential to create standardized core outcome sets [15]. A core outcome set is a compilation of well-defined outcome domains relevant to patients, with a unified, well circumscribed definition of the measure used to evaluate the outcome domain, and the desired way to report it. The unique definition is essential to allow for aggregation of data across studies and to ensure that each study reports on the same construct of the outcome. Currently, many outcomes are ill-defined and have different meaning in different studies, leading to misinterpretation and confusion [55]. Even when there is a unified definition, the interpretation and application of this definition should be as uniform as possible. Differences in what is reported as an outcome substantially limit the progress of knowledge and make the aggregation of evidence difficult if not impossible, as such differences result in comparing apples and oranges. Moreover, using standardized outcomes will decrease research waste, as studies will only investigate those outcomes that are relevant to patients and society. There is a growing understanding of the importance of this problem by scientific organisations [58,57,58]. Some administrative and commercial initiatives have also been launched in this regard, e.g. ICHOM (International Collaborative on Health care Outcome Measures) [59].
Acute kidney injury is a clear example. For a long time, a unified definition was lacking, leading to a wide divergence in reported incidence, prevalence and outcomes [10]. Over the last years, a unified definition has been formulated and accepted [60], yet the practical interpretation and implementation of this “unified” definition still is open to interpretation [61], with substantial impact on reported incidence, prevalence and outcome of AKI [62]. When data are aggregated in big data sets, it is essential that the constructs that are represented by those data are defined and measured as uniformly as possible. If not, the “tank problem” [63] might arise, i.e. patients are categorized based on criteria that are not linked to their underlying pathology, but to some other, mostly technical aspect. This has been found, for example, with regard to the automated diagnosis of pneumonia, where diagnosis was strongly influenced by the type of X-ray device used for imaging [64].
To be useful in Shared Decision Making, standardized outcomes should also be relevant for all stakeholders [16]. This means that patients should be involved in constructing and selecting the outcome variables of interest, how they should be measured and which difference in that outcome is relevant to them [65]. The need for standardized outcomes will be further exacerbated if we start using AI to explore evidence. AI will either used predefined terminology, so it can search for predefined terms, or natural language processing to extract concepts from existing texts. In both cases, standardization of the outcomes is essential. If AI uses predefined concepts in its search, we need to ensure that these concepts all have the same meaning in the primary sources, for otherwise information will be lost and may even be wrong, as an apple might not always truly be an apple. If we are going to use natural language processing, there are strong reasons to be concerned that AI will get stuck in the understanding the true meaning of expressions it encounters during the text analysis and in placing them in a general context, a task that can easily be performed by humans, but is very hard for AI.
The use of big data in health economics analysis (affordability and prioritization)
Over the last decades, healthcare expenses have surged exponentially. Although partially explained by the ageing of the population, the bulk of this increase can be attributed to a steep increase of technological interventions, both in terms of availability and accessibility, as well as to the cost of these interventions. It is obvious that there is a limit to the total budget that can be spent on healthcare, which inherently implies that choices need to be made all the time. To ensure that these choices result in justifiable healthcare, a thorough analysis is necessary. First, an assessment of the cost of the intervention in relation to its potential impact needs to be performed. The indicator of “quality adjusted life years gained” (QALY) is most frequently used in this context. The utility of an intervention is based on available evidence regarding the estimated effect size, with systematic reviews being the ideal instrument to calculate these. As mentioned earlier, big data and AI can be used to help perform such systematic reviews if they are lacking.
Second, the budgetary impact of the intervention needs to be assessed, i.e. the cost of the intervention times the number of persons in the society who would potentially benefit from that intervention. Registry data can be used to assess these. In nephrology, for example, registries could establish the number of people with diabetes mellitus type 2 and micro- or macro-albuminuria, thus also assessing the number of people who could potentially benefit from a drug that retards progression of kidney disease. Ideally, data on the degrees of comorbidities in this population should also be available, to assess the extent to which available evidence can be generalized to this specific real-world population, in order to estimate the true effect and thus the real expected QALYs, as described above.
Evidence from trials and registries could potentially be complemented by real world data from wearables, handheld devices and social media to help assess the utility of interventions in everyday living conditions of patients [66]. Many more databases pertaining to people with different backgrounds are needed for health economic analysis than in the case of effect size estimation. For example, in order to estimate costs, not only data on the incidence and prevalence of disease conditions and data regarding costs are needed, but also information on the extent of the associated comorbidities and their distributions.
Given that all of these data need to be integrated, the use of big data for the purposes of health economics analysis might be jeopardized even more by concerns on data management and data quality in the data sources and the analytical techniques used [67]. Data management issues, such as data storage, computation power, opaque access, integration and linkage of datasets, and ensuring the uniqueness of used definitions, can be mitigated by creating standardized approaches to storage and definitions of data. Still then, the cost of creating and/or accessing all these datasets might be prohibitive for health economists to access and integrate all the databases they ideally would need to feed their models. Most machine learning algorithms are developed for prediction based on associations, for which they perform quite well. However, prediction is quite different from estimating the effect sizes of interventions, where it is essential that the relation between variables and outcome is causal. When applying machine learning in comparative health economics, it is moreover essential that the algorithms can handle the data in a counterfactual way: what would have happened if, instead of intervention A, intervention B would have been implemented? [68].
It is an open question whether official bodies should accept observational data, even when “big”, as a substitute for randomized controlled trials [69]. Nevertheless, the US and European drug regulators (cf. the Twenty-first Century Cures Act in the US and the Adaptive Licensing approach in the EU) propose that in some cases Phase III trials could be replaced by post-marketing evidence based on routinely collected data studies [70]. This paradigm shift not only entails ethical and regulatory challenges, but also substantial methodological challenges because drawing valid causal conclusions from routinely collected data necessarily relies upon crucial assumptions about the causal structure of the real world beyond what is encoded in the data. This fundamentally changes the paradigm of the safety and effectiveness assessment from a process with clearly distinctive phases to a continuous process in which post-marketing evidence derived from routinely collected data plays an important role [71].
Likewise, for medical devices, the new Medical Device Regulation (MDR) (Regulation (EU) 2017/745) in the EU indicates an evolution towards the increased importance of post-marketing surveillance based on routinely collected data. Increased use of routinely collected data could provide valuable information on safety and effectiveness but the credibility, transparency and enforceability of their role in post-market surveillance should be explored and demonstrated [72,73,74,75]. Various remaining regulatory uncertainties, for example regarding the need to make public whether or not post-approval studies have begun, or the timing of confirmatory clinical trials, have spurred criticisms that such procedures might progressively lead to de-regulation [70, 72, 74].
The use of big data for safety and benchmarking
Safety monitoring and surveillance outcomes of interventions
According to some commentators, systems based on available routinely collected data can potentially accommodate for some of the limitations of monitoring based on spontaneous reporting, considered as a cornerstone so far [76], such as the underreporting of non-obvious side effects. Several pilot programs in the US (e.g. OMOP and the Sentinel initiatives), the EU (e.g. EU-ADR and PROTECT) and Asia (e.g. Asian Pharmacoepidemiology Network or AsPEN) assess the potential of routinely collected data for pharmacovigilance and routine signal detection. However, partly due to limited statistical standards for risk assessment, none of these initiatives has convincingly provided credible or reproducible evidence of unexpected adverse drug reaction or confirmation of known harms [73]. A literature review has compared a broad range of analytical approaches and identified traditional pharmaco-epidemiological designs (in particular, self-controlled designs) and sequence symmetry analysis as two of the most promising approaches for signal detection in routinely collected data [76]. An outcome-wide approach [77] to pharmaco-epidemiological designs based on propensity score analysis may considerably reduce modelling and computational demands, thereby increasing their suitability for routine signal detection, with a minimal risk of bias.
Benchmarking and effectiveness
Although RCTs continue to rank high in the pyramid of evidence, they suffer from some inherent problems such as a lack of generalizability and transportability, as discussed above. In addition, for some interventions, the way in which RCTs are implemented in real clinical practice could substantially impact the difference between the effect size in real life and that the effect size observed in RCTs, where conditions are mostly optimal. For technical interventions such as surgical operations, catheterisations, and diagnostic procedures, the skill and expertise of the operator can have a substantial impact on the final outcome and will determine whether the results can be replicated or not. If the operator has much skill and experience with intervention B, in her hands the outcome with this intervention B could result in better outcomes than with intervention A, even if in a randomized trial where A was applied by operators skilled in A, the latter was superior.
Currently, the quality of delivered healthcare is mostly the result of team work and of a succession of events, from correct referral over correct diagnostic procedure and interpretation, to correct identification and attention to safety and a culture to avoid accidents, including basic nursing care. Hence it is not only the individual skill of an individual operator or one single intervention that will determine the final result, but rather the full chain of all processes and people involved in the total process. Even for simple interventions, such as the dosing of dialysis for acute kidney injury, differences in outcomes between RCTs can be explained by differences in overall practice between centres [78]. Typically, with studies done in the setting of a single centre, exceptional attention is given to all study participants, as the team believes in the investigated treatment, which is far less so in multicentric trials. The routine collection of outcome data offers opportunities to evaluate and illustrate the performance of healthcare providers at both the micro level of the individual provider and the meso level of the organization. The technical possibilities of Big Data approaches allow for collection of the data necessary to produce such evaluations from different sources and turning them to a meaningful construct. For example, the outcome of a cancer intervention can be assessed by accessing laboratory, pathology or radiology data warehouses to collect data from an individual diagnosed with the cancer, and linking them with the persons involved in the care as well as to other outcomes such as mortality, medical costs, social welfare and employment, need for societal support and other parameters derived from various other available datasets.
From all these data, algorithms can derive different markers of performance. The latter can subsequently be used as feedback to the healthcare workers (formative evaluation), or to inform patients on the performance of different healthcare institutions/providers in domains that might be of interest and value to them. In this way, Big Data could contribute to value based healthcare [79] and shared decision making. Whereas the strict technological requirements to assemble such online repositories will probably be resolved in the near future, some more fundamental methodological questions remain to be answered before such systems can be safely and effectively used in clinical practice. The selection of the most relevant constructs to reflect “performance” has been discussed already (cf. supra) and should follow the same procedures as those for establishing standardized core outcome sets. Furthermore, it should be questioned how the feedback to healthcare professionals should be structured and organized in order to to achieve a true improvement in the quality of care provided. Currently, most of these systems use a benchmarking against a mean. However, there is evidence that follow up of performance of an individual over time or comparison with accepted and established criteria might be much more effective to induce a positive change in behaviour [80]. Finally, one should be careful when designing the presentation of data to patients, as they can struggle with interpreting the information offered [17].
Patient reported outcomes/patient reported experiences
Patient perspectives and experiences are increasingly gaining interest. Patient-reported outcome measures (PROMs) and experience measures (PREMs) are mostly questionnaires that assess patients’ health, health-related quality of life and other health-related constructs. They can be used to evaluate performance or as a benchmark to inform patient choice for healthcare. However, when used intelligently, the information can also be used to discover unmet needs or preferences in the approach and management of certain health conditions or patient groups, assess the effectiveness of different treatment plans, monitor disease progression, stimulate better communication, promote shared decision making and issue tailored advice and education [81,82,83]. PROMS and PREMS allow for the visualization of the outcome of certain interventions in some of the real treatment centres available in the region of the patient rather than results obtained in the highly controlled setting of an RCT.
Patient reported outcomes and experiences mostly are collected as a one-off (cross-sectional) assessment, most frequently using pencil and paper, resulting in a burden for patient and staff. As a consequence, surveys are restricted in size, decreasing the relevance and spectrum of the topics explored. The advent of new digital technologies opens the door to more continuous, in-depth and online reporting of symptoms and experiences of patients, in a more feasible, sustainable and cost-effective way [84]. In the most simple format, patients can use a tablet or handheld device to complete questionnaires during waiting times in the hospital. More sophisticated systems allow for a continuous reporting of symptoms and outcomes through smartphones or wearables. Some systems rely on algorithms that infer treatment recommendations or advice to plan earlier consultations from the data provided [85]. Such systems for digital symptom reporting can have a positive impact on the quality of healthcare with reduced symptom distress, improved symptom burden through better self-management, improved health-related quality of life and higher quality of interaction with healthcare professionals.
Going one step further would be to track patient data continuously, for example by using smartwatches to register heartbeat, or geolocation to assess mobility, activity and independence as a surrogate of well-being [66]. Even ‘smart’ pills, monitoring adherence to medication intake, are possible these days using AI technology [86]. Although attractive at first sight, many unresolved issues remain before such systems can be more widely used [86]. Major points of concern are the safeguarding of privacy and the likely impact on attitudes of insurance companies and healthcare organizations when the possibility of eavesdropping on all movements of patients all of the time makes it easier to distinguish high and low risk patients from their point of view.