Dataset and data preprocess
We use the questionnaire and laboratory results from the National Health and Nutrition Examination Survey (NHANES) [17] to establish the clinical disease sign vectors. According to the questionnaire survey (e.g., “Has been diagnosed with type 2 diabetes?”), individual samples are divided into disease group (who answered “yes”) and healthy group (who answered “no”). Next, we perform the statistical analysis to identify those disease-related clinical variables from collected laboratory results in NHANES data. We extract 87,464 individual samples, 986 numerical clinical variables and more than 30 disease conditions from NHANES data range from 1999 to 2016. Here, we only consider the disease conditions with more than 1000 individual samples, which results in 6 unique diseases (i.e., asthma, coronary heart disease, congestive heart failure, heart attack, type 2 diabetes and stroke).
We use the prescription and laboratory result histories of patients in a proprietary deidentified Electronic Health Record (EHR) to establish the clinical drug effect vectors. We transform the prescription records of patients into matrixes based on medication use situations. To study the associations between prescribed drugs and laboratory results, we apply a continuous self-controlled case series model [15] to analyze the effects of a drug on the laboratory results. We only consider patients with complete records (i.e., having both prescription and its corresponding laboratory results), which results in 91,934 patients, 1344 kinds of treatments and 65 kinds of laboratory results. After excluding those prescriptions with less than 1000 patients, we obtain 392 unique prescribed drugs.
We bridge the drug and disease using the laboratory results obtained from each side. Since the laboratory results are from different data resources (i.e., national survey data and electronic health records), we need to standardize those laboratory results for further analysis. The laboratory results that appear in both datasets are included and mapped to a standard list with consistent names. Also, the non-numerical laboratory results are excluded. Finally, we obtain 35 laboratory results considered as clinical variables. The full list of the 35 clinical variables can be found in Additional file 5: Table S1. Our inference of a drug-disease pair is based on the complementary and adverse effects that each drug candidate and disease condition has on the 35 clinical variables.
Clinical disease sign vector
We extract 6 disease conditions and 35 clinical variables from NHANES after preprocessing to establish the clinical disease sign vector. The dimension of each disease sign vector is \(1 \times 35\). There are three types of relations between a disease and clinical vectors (i.e., “Up”, “Down” and “No”), which represents increasing, decreasing and not significantly changing of laboratory results level, respectively. As mentioned above, the combined data is divided into disease group and control group according to the questionnaire data, we apply Wilcoxon rank sum test (a.k.a., Mann Whitney U test) on two groups to calculate the p-value for each clinical variable. Certain p-value cut-off is used to examine whether the value change is significant or not [18]. In our work, the p-value threshold is set to be 0.05. Only the clinical variables satisfy the condition that p-values are less than 0.05 can be regarded as significant clinical variables concerning the disease. We consult the Mann–Whitney table of \(\alpha =0.05\). If the smaller value of \(U_{1}\) and \(U_{2}\) is larger than the value given in the table, the null hypothesis is true otherwise false. Then we assign relation direction to this clinical variable by comparing the average clinical variable value of the disease group and control group. Up relation (“\(\uparrow\)”) indicates a significant value increase in the disease group compared with the control group, while down relation (“\(\downarrow\)”) means the laboratory value of the disease group is significantly lower than that of the control group, no relation (“-”) indicate the laboratory result level will not be significantly influenced by the disease.
Clinical drug effect vector
To establish clinical drug effect vectors, we extract 392 drugs and 35 clinical variables from EHR data. The clinical variables used here are the same as ones in establishing the disease sign vectors. So, the dimension of each drug effect vector is \(1 \times 35\). We need to consider the prescription records of patients and their corresponding laboratory results records simultaneously, and the EHR dataset we use is a large dataset that includes millions of records. So, we need to find a way to analyze the high-dimensional longitudinal data. In our work, we adopt the continuous self-controlled case Series (CSCCS) model proposed by Kuang et al. [15], it is a lasso regression analysis model designed to do the data analytical work for EHR dataset.
Assuming there are N patients with a specific kind of clinical variable measurement and M kinds of drugs in EHR dataset. Continuous variable \(y_{i j}\), where \(i \in \{1,2, \ldots , N\}\), \(j \in \{1,2, \ldots , J_{i}\}\), indicates the value of \(j_{t h}\) clinical variable measurement taken among a total number of \(J_{i}\) measurements for the \(i_{t h}\) patient, while binary variable \(x_{i j m}\), where \(i \in \{1,2, \ldots , N\}\), \(j \in \{1,2, \ldots , J_{i}\}\), \(m \in \{1,2, \ldots , M\}\), are used to indicated the drug whether \(i_{t h }\) patient are exposed to the \(m_{t h}\) drug when the \(j_{t h }\) clinical variable measurement is taken. 0 represents no and 1 represents yes.
\(y_{i j}\) is regard as the output variables when we fit the structured data into the linear regression model, so we have:
$$\begin{aligned} y_{i j} | {\varvec{x}}_{i j}= & {} \alpha _{i}+\varvec{\beta }^{\top } {\varvec{x}}_{i j}+\epsilon _{i j}, \quad \epsilon _{i j} {\mathop {\sim }\limits ^{i i d}} N\left( 0, \sigma ^{2}\right) \nonumber \\ \varvec{\beta }= & {} \begin{bmatrix} \beta _{1}&\beta _{2}&\cdots&\beta _{M} \end{bmatrix}^{\top },\nonumber \\ {\varvec{x}}_{i j}= & {} \begin{bmatrix} x_{i j 1}&x_{i j 2}&\cdots&x_{i j M} \end{bmatrix}^{\top } \end{aligned}$$
(1)
\(\alpha _{i}\) in Eq. (1) represents the average baseline level of \(y_{ij}\) on \(i_{th}\) patient. That means it is independent of the date the measurement was taken and drugs the patient used when the measurement was taken. Each patient has an individual baseline value. \(\epsilon _{ij}\) here is an independent and identically distributed Gaussian noises with zero means and fixed but unknown variance \(\sigma ^{2}\). Then the linear model can be easily converted to a least square problem as follows:
$$\begin{aligned} \arg \min _{\varvec{\alpha },\varvec{\beta }}{\mathcal {L}}(\varvec{\alpha },\varvec{\beta })=\arg \min _{\varvec{\alpha },\varvec{\beta }}\frac{1}{2} \begin{Vmatrix} {\varvec{y}}- \begin{bmatrix} {\varvec{Z}}&{\varvec{X}} \end{bmatrix} \begin{bmatrix} \varvec{\alpha }\\ \varvec{\beta } \end{bmatrix} \end{Vmatrix}^{2}_{2} \end{aligned}$$
(2)
where
$$\begin{aligned}{}&\varvec{\alpha }= \begin{bmatrix} \alpha _{1}&\alpha _{2}&\cdots&\alpha _{N} \end{bmatrix}^{\top }, {\varvec{Z}}={\text {diag}} \begin{pmatrix} {\mathbf {1}}_{1}, \cdots , {\mathbf {1}}_{N} \end{pmatrix},\\&{\varvec{y}}= \begin{bmatrix} y_{11}&\cdots&y_{1 J_{1}}&{\cdots }&y_{N 1}&\cdots&y_{N J_{N}} \end{bmatrix}^{\top },\\&{\varvec{X}}= \begin{bmatrix} {\varvec{x}}_{11}&\cdots&{\varvec{x}}_{1 J_{1}}&\cdots&{\varvec{x}}_{N 1}&\cdots&{\varvec{x}}_{N J_{N}} \end{bmatrix}^{\top } \end{aligned}$$
where \({\varvec{Z}}\) is a block diagonal matrix and \(\varvec{1_{i}}\) is a \(J_{i} \times 1\) vector in which all the components are 1. By solving this problem, we can get the optimized parameter \(\varvec{\beta }\), which is also the interest of our task. \(\varvec{\beta }\) is a \(1 \times M\) parameter vector, parameter \(\varvec{\beta _{m}}\) in \(\varvec{\beta }\) indicates the effect of \(m_{t h}\) drug on the output variable \({\varvec{y}}\). The optimized parameter we get with the CSCCS model is numerical. Positive and negative parameters in this vector represent the corresponding drugs that may increase and decrease the level of output variable respectively, while 0 indicates the corresponding drugs do not influence it. In the CSCCS model, parameter \(\varvec{\alpha }\) is regarded as a nuisance parameter, our interest is parameter \(\varvec{\beta }\) so we do not need to care the value of \(\varvec{\alpha }\). To eliminate the effect of \(\varvec{\alpha }\), [15] consider:
$$\begin{aligned} \frac{\partial {\mathcal {L}}(\varvec{\alpha }, \varvec{\beta })}{\partial \varvec{\alpha }}={\mathbf {0}} \Rightarrow \varvec{\alpha }=\left( {\varvec{Z}}^{\top } {\varvec{Z}}\right) ^{-1} {\varvec{Z}}^{\top }({\varvec{y}}-{\varvec{X}} \varvec{\beta })=\overline{{\varvec{y}}}-\overline{{\varvec{X}}} \varvec{\beta } \end{aligned}$$
(3)
Where \(\overline{{\varvec{y}}}\) is a \(N\times 1\) vector which includes the average value of clinical value among N patients, \(\overline{y_{i}} = \frac{1}{J_{i}} \sum _{j=1}^{J_{i}} y_{i j}\). \(\overline{{\varvec{X}}}\) is a \(N \times M\) matrix and \(\overline{{\varvec{X}}}_{i}=\frac{1}{J_{i}} \sum _{j=1}^{J_{i}} {\varvec{x}}_{i j}^{\top }\). So, the expression of CSCCS model below, which is free of \(\varvec{\alpha }\), is derived by substituting Eq. (3) into Eq. (2):
$$\begin{aligned} \arg \min _{\beta } \frac{1}{2}\Vert {\varvec{y}}-{\varvec{Z}} \overline{{\varvec{y}}}-({\varvec{X}}-{\varvec{Z}} \overline{{\varvec{X}}}) \varvec{\beta }\Vert _{2}^{2} \end{aligned}$$
(4)
When we apply the CSCCS model on the high-dimensional longitudinal EHR data, we will add a \(L_{1}\) penalty term because there is an assumption that the level of clinical variables will only be significantly influenced by a small portion of drugs. The \(L_{1}\) penalization drives most components of \(\varvec{\beta }\) to zero or closed to zero [19]. In other words, we simply want to know the drugs which are most correlated to the level change of clinical variables. So the final expression of the CSCCS model we apply to this problem is:
$$\begin{aligned} \arg \min _{\varvec{\beta }} \frac{1}{2}\Vert {\varvec{y}}-{\varvec{Z}} \overline{{\varvec{y}}}-({\varvec{X}}-{\varvec{Z}} \overline{{\varvec{X}}}) \varvec{\beta }\Vert _{2}^{2}+\lambda \Vert \varvec{\beta }\Vert _{1} \end{aligned}$$
(5)
where \(\lambda >0\), \(\lambda\) decides the sparsity of optimized result so we need to tune this parameter to get a final result with proper sparsity level.
In order to further filter out the drugs which do not have a significant effect on clinical variables, our implementation also returns the p-value of each component in \(\varvec{\beta }\). We apply the same p-value cut-off strategy on the optimized result. Parameters with a p-value greater than 0.05 in \(\varvec{\beta }\) are regarded as insignificant effect and we assume their corresponding drugs are uncorrelated with clinical variable level change. The significant effects can be divided into increasing or decreasing effect based on the coefficient value is positive or negative. Then we assign each drug-clinical variable pair up (“\(\uparrow\)”), down (“\(\downarrow\)”) and no (“-”) relation type just like clinical disease sign vectors.
Scoring function
After we establish the clinical vectors for each drug-disease pair, we need to define a scoring function to calculate the repurposing possibility score for each drug-disease pair. The inference for each drug-disease pair is based on complementary and adverse effects. Specifically, complementary effect refers to the opposed relation type between a clinical disease sign vector and clinical drug effect vector on the same clinical variables, while adverse effect refers to the same relation type between a clinical disease sign vector and clinical drug effect vector on the same clinical variables. The complementary relation direction between the two vectors will increase the final repurposing possibility score of a drug-disease pair while adverse relation direction will decrease it. Here, we use a dot product-based scoring function to consider both complementary and adverse effects of a drug candidate on a disease. The scoring function can be written as follow:
$$\begin{aligned} \hbox {TS}_{\mathrm{disease-drug}}=-\hbox {CV}_{\mathrm{Drug}} \cdot \hbox {CV}_{\mathrm{Disease}} \end{aligned}$$
(6)
where \(\hbox {CV}_{\mathrm{Drug}}\) is the clinical drug effect vector, and \(\hbox {CV}_{\mathrm{Disease}}\) is the clinical disease sign vector. We transform the 3 kinds of relation type in clinical vectors (“\(\uparrow\)”, “\(\downarrow\)” and “−”) into numerical values (1.0, \(-1.0\), 0.0) for the convenience of calculation. To rank the drugs in descending order and emphasize the most powerful drug candidates predicted by our model, we add a minus sign before the product of the two vectors. So, the positive result calculated by this scoring function means there are more complementary relation directions than adverse relation directions between a drug candidate and a disease, while negative results indicate more adverse relation directions between this pair.