### Dataset

The settings of this study are three adult ICUs at The Ohio State University Wexner Medical Center (OSUWMC). OSUWMC serves as a major referral center for patients from the entire state of Ohio and throughout the Midwest. The ICUs include 83 beds in total, admitting approximately 3,800 patients annually.

Essentris^{©} is the commercial system used for documentation in all ICUs at the medical center. Patient data entered into Essentris^{©} are eventually transferred to the information warehouse (IW). The IW compiles EHR data from the various electronic records systems throughout OSUWMC, namely, an administrative system (ADT^{©}), a laboratory system, a computerized provider order entry (CPOE) system, and a billing system. Figure 1 shows the study procedure. Specific details will be described in the following sections.

### Data extraction

Patients (age ≥ 18) admitted to ICUs between January 1, 2007 and December 31, 2010 comprise the sample set. Institutional Review Board (IRB) approval was obtained in 2010 for the data extraction. Data were de-identified and supplied by IW staff. Patients developing a PU are identified by reviewing discharge diagnoses appropriately coded with the International Classification of Diseases (ICD)-9 codes [25] as one of the fields in EHR system. For instance, if a patient had an ICD-9 code, 707.07 (Pressure ulcer, Heel), the patient is included in the PU group. On the other hand, if a patient does not have any of the ICD-9 codes indicating PUs, the patient is included in the non-PU group.

### Data cleaning and preparation

Data cleaning and preparation processes include several steps. First, patients who are afflicted with a PU at the time of admission are excluded. In addition, patients whose ICU stay is shorter than 36 hours are excluded since PUs generally develop after 3 days of admission [26]. Second, if a patient had multiple hospitalizations during the study period, only the first hospitalization record is included. Similarly, if a patient has more than one ICU admission record during the hospitalization, only the first ICU admission record is included for analysis. This is because our objective is to find risk factors of patients who have the first incidence of PUs during their ICU stay. Patients who suffer from the onset of PUs at the time of admission are excluded as they may have previously been exposed to unknown risk factors for which we have no data. This patient selection process is consistent with practices in our previous studies [5, 6].

Medications administered to patients in PU and non-PU groups during the ICU stay are also listed. The list is reviewed by an inter-disciplinary team that includes a registered nurse, two ICU clinical nurse specialists, and a dietician. Through this manual review, medications are grouped into 72 categories based on their perceived functional purposes and efficacies. For instance, Meperidine and Nalbuphine are included in the Analgesia category, and Dopamine and Epinephrine are grouped into Vasoactive category. Medication categories are coded as dichotomous (binary) variables.

Diagnostic data in the form of ICD-9 codes are 5 digits long and are extracted from the EHR system from records within ICU length of stay. The first three digits indicate a main disease type and the last two provide additional information about the disease. Discharge diagnostic ICD-9 codes are used to identify patients with maladies during ICU hospitalization. Subsequently, ICD-9 codes are truncated into 3 digits in order to analyze the primary conditions. Most of the 707 ICD-9 codes are considered PUs except for 707.1 (ulcer of lower limb), 707.8 (chronic ulcer of other specified sites), and 707.9 (chronic ulcer of unspecified site). Those codes are labeled as 707-nonPU.

Braden scale contains 6 subscales that measure sensory perception, moisture, activity, mobility, nutrition, and friction & shear. A Braden total score is simply calculated by adding up all the subscales. We consider each subscale separately to see which subscales are more related to PU incidence in ICU settings. Moreover, most of the subscales have significant association with PU incidence, in addition to the summed Braden scale (Braden total subscale). We include these for consistency with previous work [27, 28].

### Variable selection

In order to eliminate ill-defined, non-salient, and “noise” variables irrelevant to PUs, we select a set of medications and diagnoses highly associated with the PU condition. To achieve this aim, univariate analysis is first carried out to determine what medication categories and diagnoses (variables) are highly associated with PUs. For each variable, one of the following two statistical tests is used. A χ^{2} -test, being sensitive to small expected frequencies, is used only where expected frequencies are large enough (> = 20), otherwise Fisher’s Exact Test (FET) is applied. In the midst of this screening process, we do not apply multiple test compensation in order to be more inclusive.

Medication categories that appear to be significantly associated with PUs are retained. Likewise, diagnoses that appear to be highly associated with PUs (which we call having a strong “comorbidity association”) are retained. The retained medication categories, retained diagnoses, and all Braden features are used as variables (nodes) for Bayesian networks.

### Bayesian network modeling

A Bayesian network model is introduced to model clinical data which is high dimensional in nature and characterized by variables of heterogeneous data types. Bayesian network models are graphs in which nodes represent random variables, and the lack of edges represent conditional independence. Formally, Bayesian networks are defined as follows:

Let *U* = {*x*
_{1}, …, *x*
_{
n
}}, *n* ≥ 1 be a set of random variables. A Bayesian network *B* over *U* is a network structure *B*
_{
s
} in the form of a directed acyclic graph (DAG) over *U* and a set of probability assertions *B*
_{
p
} = {*Pr*(*u*|*Pa*(*u*)), *u* ∈ *U*} where *Pa*(*u*) is a set of parents of *u* in *B*
_{
s
}.

In this work, discrete-valued Bayesian networks are used. Therefore, probability models are represented with discrete conditional probability tables. There are two steps to constructing a Bayesian network: structure learning and parameter estimation. Structure learning extracts a Bayesian network *B*
_{
s
} from observed data. Parameter estimation constructs the conditional probability distribution set *U* for each node in the network once the structure has been learned.

### Structure learning

Score-based structure learning is a commonly used method to identify a network structure. This approach uses a scoring function that measures how well the model fits the observed data. The score-based structure learning assigns a score to each candidate network and tries to find the network maximizing the score. An optimal solution is intractable since this problem has been shown to be NP-Hard; therefore, many approximate methods have been proposed. Greedy hill climbing is one of the simplest and most commonly used search algorithms. It has been observed to achieve similar results as an optimal algorithm (run on small networks of not more than 20 nodes) [29]. Greedy hill climbing iteratively takes the step that leads to the largest improvement in the score until no modification improves the score. It therefore can terminate in a local optimum. Repeated hill climbing can be used to avoid being caught in local optima. It repeatedly uses greedy hill climbing algorithm and returns the best structure of the multiple runs. Tabu search and simulated annealing are two approaches that are commonly used to explore the region around, and therefore escape, local optima. Tabu search [30] is a variation of greedy hill climbing which keeps a list of length *L* of recently used operations such as edge addition, deletion, and reverse. For each step, it does not consider operations in the list, forcing it to explore new directions in the search space. Simulated annealing is a different hill climbing variation which starts with an initially large “temperature” parameter. When the temperature is large, the algorithm may take steps which decrease the score. As the algorithm proceeds the temperature is gradually reduced, and the search increasingly focuses only on moves that improve the score.

### Scoring functions

A scoring function is used with a search algorithm to approximate the probability of each candidate structure given the data *D*. The goal is to find a highest scoring structure *B*
^{*}_{
s
}
, that is:

$$ {B}_s^{*}=\underset{B_s}{ \arg\ \max } Score\left({B}_s\Big| D\right) $$

### Bayesian scoring function

The premise of the Bayesian scoring function is to compute the posterior distribution of a network from given data. The best network is the one that maximizes the posterior probability. A widely used Bayesian scoring function is the Bayesian Dirichlet with score equivalence and uniform priors (BDeu) proposed by Buntine [31]. BDeu assigns the same score to equivalent network structures and has a uniform prior distribution assumption. Therefore, BDeu has only one necessary hyper-parameter called *equivalent sample size*. The BDeu scoring function is defined as follows:

$$ BDeu\left({B}_s\Big| D\right)= \log \left( Pr\left({B}_s\right)\right)+{\displaystyle \sum_i^n}{\displaystyle \sum_j^{q_i}}\left( \log \left(\frac{\Gamma \left(\frac{N^{\hbox{'}}}{q_i}\right)}{\Gamma \left({N}_{i j}+\frac{N^{\hbox{'}}}{q_i}\right)}\right)+{\displaystyle \sum_k^{r_i}}\left( \log \left(\frac{\Gamma \left({N}_{i j k}+\frac{N^{\hbox{'}}}{r_i{q}_i}\right)}{\Gamma \left(\frac{N^{\hbox{'}}}{r_i{q}_i}\right)}\right)\right)\right) $$

where *Γ*(.) is the gamma function, *n* is the total number of variables, *r*
_{
i
} is the number of possible values of variable *x*
_{
i
} (e.g., 2 for a binary variable), and *q*
_{
i
} is the number of possible values of *Pa*(*x*
_{
i
}). *N*
_{
ijk
} is the number of records in the data set *D* having variable *x*
_{
i
} in state *k* for which *Pa*(*x*
_{
i
}) has its *j* -th value. *N*
_{
ij
} is calculated by summing over all states of a variable *x*
_{
i
}: \( {N}_{i j}={\displaystyle \sum_{k=1}^{r_i}}{N}_{i j k} \). *N*
^{ '} is the user-specified equivalent sample size, which expresses how much prior knowledge should be taken into account in the network structure.

### Information-theoretic scoring function

The premise of this scoring function is a tradeoff between how well the network structure fits the data and how complex the network is. This can be viewed as a log-likelihood (LL) function along with a penalty factor to address the over-fitting problem. The log-likelihood function is the log probability of *D* given *B*
_{
s
} and can be calculated as:

$$ L L\left( D\Big|{B}_s\right)= \log \left( Pr\left( D\Big|{B}_s\right)\right) $$

There are several well-known information-theoretic scoring functions. In this study, we consider minimum description length (MDL) (equivalent to Bayesian information criterion (BIC) for Bayesian networks [29]) as it has been shown that it can outperform Akaike’s information criterion (AIC), Bayesian Dirichlet equivalence score (BDeu), and factorized normalized maximum likelihood (fNML) [29]. MDL scoring function is defined as follows:

$$ M D L\left({B}_s, D\right)= L L\left( D\Big|{B}_s\right)-\frac{logN}{2}\left|{B}_s\right| $$

where |*B*
_{
s
}| is the number of independent parameters in network *B*
_{
s
}. The penalty factor can be viewed as the number of bits required to encode the model.

### Parameter estimation

The conditional probability table for each variable in the network is created once the structure learning has been carried out. Direct estimates of the conditional probabilities are calculated for each node in the network structure as follows:

$$ \Pr \left( x= k\Big| Pa(x)= j\right)=\frac{N_{ij k} + N{\hbox{'}}_{ij k}}{N_{ij}+ N{\hbox{'}}_{ij}} $$

where *N* ' _{
ijk
} is a parameter used for estimating the probability tables and can be interpreted as the initial count on each value. When *N* ' _{
ijk
} = 1, *N* ' _{
ij
} = *r*
_{
i
}, assigning (instead of 0) a small prior to values unobserved in training data. With *N* ' _{
ijk
} = 0, maximum likelihood estimates are obtained.

### Markov Blanket

An important concept underlying a Bayesian network is that of a Markov blanket of a node. The Markov blanket of a node is a set of nodes that shield the node from the rest of the network. This set contains the node’s parents, the node’s children, and all other parents of its children. Formally, let *N* be the set of all nodes in a network and *M* be a set of nodes not containing node *x. M* is a Markov Blanket for *x* if *x* is conditionally independent of all variables in the set *N* – *M* – *x* and it is further required that *M* is minimal. This implies that a variable in a Bayesian network is conditionally independent of other variables not included in its Markov blanket. On the other hand, when the Markov blanket of a certain variable *x* is known, adding knowledge of other variables outside the Markov blanket leaves the probability of *x* unchanged [32]. This property is noteworthy since only variables in the Markov blanket are required to predict the behavior of the outcome variable. From this property, we can reduce the size of the model significantly.