Study design and participants
Our research team used a retrospective cohort of the New York University Langone Health (NYULH) EHR data on COVID-19 patients to derive and validate the RL algorithm. Eligible patients had positive COVID-19 PCR test and had oxygen therapy in hospital between March 1st 2020 and January 19th 2021. We excluded COVID-19 patients aged below 50 and not been hospitalized as the lacked consistent documentation of vital signs, treatments, and laboratory tests. This study was approved by the NYULH IRB and the data were de-identified to ensure anonymity.
For each patient, we had access to demographic data, including age, sex, race, ethnicity and smoking status, ICU admits and discharge information, in-hospital living status, comorbidities, treatments, and laboratory test data. The comorbidities, including hyperlipidemia, coronary artery disease, heart failure, hypertension, diabetes, asthma or chronic obstructive pulmonary, dementia and stroke, are defined based on the International Classification of Diseases (ICD)-10 diagnosis codes. To reduce the feature dimensionality, we selected 36 laboratory tests based on two criteria: (1) less than 28% missing values; and (2) COVID-19 related tests and vital signs. In specific, we explore the associations between laboratory tests and COVID-19 based on existing literature and clinical findings. For example, recent studies have shown that a reduced estimated glomerular filtration rate (eGFR), low platelet count, low serum calcium level, increased white blood cell count, Neutrophil-to-lymphocyte ratio (NLR), and red blood cell distribution width-coefficient of variation (RDW-CV) are related to high risk of severity and mortality in patients with COVID-19 [17,18,19,20,21]. Additionally, some research suggests well-controlled blood glucose is associated with the lower mortality in COVID-19 patients with Type-2 diabetes [22] and continuous renal potassium level has correlation of hypokalemia, which is common among patients with COVID-19 [23]. Arterial blood gas analysis, including pH, Oxyhemoglobin saturation (SaO2), oxygen saturation (SpO2), partial pressure of oxygen (PaO2) and bicarbonate (HCO3), is commonly used biomarkers measuring the severity of ARDS [24, 25].
In this study, we employed leave-one-hospital-out validation to evaluate the model performance. The whole dataset was divided into 4 batches by the hospital and then we take one batch as validation set and the rest as training set in each simulation.
RL algorithm overview
We model patient health trajectory and the clinical decisions during a course of intensive care over a period of ICU stay by a Markov decision process (MDP) with state, action, and reward. The state of a patient includes the observed patient demographics, vital signs, and laboratory tests at each time \(t\). The action refers to the oxygen flow rate. After a sequence of actions, the patient receives a reward if he/she survives in the next 7 days; otherwise, a penalty to death will be given. The cumulative return is defined as the discounted sum of all rewards of each patient received during the ICU stay. The intrinsic design of RL provides a powerful tool to handle sparse and time-delayed reward signals, which makes them well-suited to overcome the heterogeneity of patient responses to actions and the delayed indications of the efficacy of treatments [14]. The details of state, action, and reward are listed as follows.
-
State \(s_{t}\): observed patient’s characteristics at each time \(t\) with information, including demographics, COVID-19 lab tests, and vital signs.
-
Action \(a_{t}\): oxygen flow rate ranged from 0 L/min to 60 L/min.
-
Reward \(r_{t}\): the reward of an action at time \(t\) is measured by its associated ultimate health outcome given the patient's health state. Similar to [14], we used in-hospital mortality as the system-defined penalty and reward. When a patient survived, a positive reward was released at the end of the patient’s trajectory (i.e., a `reward’ of +15); a negative reward (i.e., a `penalty of −15) was issued if the patient died. We find that such a reward function can propagate the final health outcome backward to each decision and intervention over the period so that RL can predict long-term effects and dynamically guide the optimal oxygen flow treatment.
-
Discount factor \(\gamma\): determines how much the RL agents balance rewards in the distant future relative to those in the immediate future. It can take values between 0 and 1 [16]. After considering the ICU stay tends to be short and conducting side experiments, we chose a value of 0.99, which means that we put nearly as much importance on late deaths as opposed to early deaths for each recommended oxygen flow rate.
The schematic of the proposed scheme with EHR cohort is shown in Fig. 1. As shown in the bottom part of this diagram, the electronic health data were collected from New York University Langone Health (NYULH) by following the clinical guide. At each time, the oxygen flow rate decision, denoted by \(a\), was chosen based on current health state, denoted by \(s\), of the patient and then a new heath state \(s^{\prime }\) was observed at the next measurement time. We record the tuple \(\left( {s,a,s^{\prime } ,r} \right)\) in the experience replay memory with the zero reward \(r = 0\). At the end of the treatment, a positive reward was recorded (i.e., a `reward’ of + 15) if patient survived; and a negative reward (i.e., a `penalty of -15) was issued if the patient died. Then we applied deep deterministic policy gradient (DDPG), as shown at the top part of Fig. 1, to learn the optimal decision policy from the experience replay memory. DDPG, composed of actor and critic networks, takes historical samples \(\left( {s,a,s^{{\prime }} ,r} \right)\) from EHR data to concurrently learn a critic network (Q-function approximation) and an actor network (policy). The critic network, denoted by \(Q^{\pi } (s,a|\theta )\), is a nonlinear function that approximates the Q value function
$$Q^{\pi } \left( {s,a} \right) = E\left[ {\mathop \sum \limits_{t = 0}^{\infty } \gamma^{t} r_{t} |s_{t} = s,a_{t} = a,\pi } \right]$$
of the action \(a\) (i.e., oxygen flow rate) given a patient’s health state \(s\). The actor network representing a policy, denoted by \(\pi_{\phi } \left( s \right)\), proposes an action for each given state through the mapping or equation \(a = \pi_{\phi } \left( s \right)\). The critic loss, defined by the mean squared TD-error (see Eq. (5) in Additional file 1), is used to improve the approximation of Q-function. Based on the expected Q-value computed by the critic network, we use the policy gradient method to optimize the actor network. In sum, DDPG learns a scoring rule (critic network) which evaluates the performance of a candidate policy, i.e., it returns an oxygen flow rate given a patient’s health state, and then uses such a rule to improve the decision making policy (actor network) by optimizing the score. See more details in Reinforcement Learning Algorithms Section of Additional file 1.
Model evaluation
We evaluated the RL-recommended oxygen therapy by comparing its efficacy with the observed one on the cohort from each validation hospital. At each decision time, the RL algorithm recommends an oxygen flow rate for the patient. If the absolute difference of recommended and the observed oxygen flow rate is less than 10 L/min, we say that RL is “consistent” with the critical care physicians.
When RL is discrepant with the oxygen flow rate used by physicians, the efficacy of the RL-recommended oxygen therapy is not directly observed. The problem then becomes how to assess the health outcomes in the future after taking RL recommendations. For this reason, we predicted the outcome of the RL-recommended treatment using Cox proportional hazards model, a regression model commonly used for investigating the association between the survival probability of patients during a period and predictor variables of interest in medical observational studies [26, 27]. In short, a patient was labeled as “alive” if he/she survived after a treatment within seven days; otherwise, labeled as “deceased”. Then we fitted a Cox survival model with demographics, vital signs, and lab tests as predictors and evaluated the effect of decision using the leave-one-hospital-out validation.
To assess the performance of the survival models, we compared predicted and observed outcomes (7-day living status) using 4 metrics: similarity, accuracy, Chi-squared test, and concordance index. Overall, the cosine similarity between predicted and actual survival is greater than 99.9%, and the concordance index is 0.83. Both metrics indicate that the predictive model can effectively estimate unobserved health outcomes. Moreover, the paired Chi-squared test (p-value < 0.0001) shows no significant difference between true and predicted survival.