Given a potentially huge set of candidate criteria, our goal is to help users choose the most appropriate criteria from among the candidates. In this work, we assume that the candidate criteria exist in some document(s) in a given repository, either separately from different documents or together in the same document, and our technical objective is to rank these candidates according to their appropriateness to the task at hand. As mentioned previously, such task would be to put together a clinical trial protocol about some particular disease, treatment, or drug.
To provide a concrete idea of our problem, the following is a snippet of a clinical trial protocol, downloaded from the clinical trials website http://www.clinicaltrials.gov, and modified for presentation as an example. It shows the title, the objective of the clinical trial as well as the inclusion and exclusion criteria.
Title: Antioxidant Systems and AgeRelated Macular Degeneration Objective:
The objective of this study is to determine whether the antioxidant supplements used in AREDS shifted the plasma pool of the AREDS subjects to a more reduced state.
The AREDS subjects were randomly assigned to one of four treatment groups:
1. antioxidants (500mg Vitamin C, 4000IU Vitamin E, 15mg beta carotene)
2. zinc (80mg zinc oxide, 2mg cupric oxide)
3. antioxidants plus zinc;
4. placebo.
...
Inclusion Criteria:
* Age 5580
* Participants with Intermediate or Advanced AMD
* participants with no ocular signs of AMD
* Willing to give written informed consent, make the required study visits, and follow instructions
* Any race and either sex
Exclusion Criteria:
* Current history of a medical condition that would preclude scheduled study visits or completion of the study (e.g., unstable cardiovascular disease, unstable pulmonary disease, chronic hepatitis, or AIDS).
* Intraocular surgery in study eye (eye to be treated) within 60 days prior to enrollment
* Presence of a scleral buckle in the study eye
...
Clearly, as the snippet above shows, the solution to the problem of automatically identifying the appropriate eligibility criteria is not an easy task. There are many possible starting points from which to address this problem ranging from knowledgebased approaches [8], use of techniques involving natural language processing [9, 12], information extraction methods [10, 12, 13], to the use machine learning approaches [5, 14, 15].
In this section, we describe an approach using machine learning techniques, in particular the use of topic models and probability models. One practical advantage in using probability models is that assuming availability of sufficient training data, probability models are fairly easy to build and update. In contrast to knowledgebased approaches in which domain knowledge from experts are used, typically no manual tuning is required in machine learning approaches because general 'rules' are automatically inferred or discovered from observations.
For our particular problem, however, it is not quite obvious how to leverage probability models since we do not have data labeled as appropriate or inappropriate eligibility criteria. This means no supervised machine learning method can be directly used to solve the modeling problem. Even if one wants to construct such a training data, e.g., with appropriate (+1) and inappropriate (−1) labels, it would be infeasible to construct one that can be used for any given search string associated with the task of putting together clinical trial protocols.
Our approach to addressing the unavailability of training data is to combine an unsupervised method and a supervised method to eventually construct probability models. The intuition is to identify a set of random variable(s) that would allow us to specify conditions on the probability distribution. Topics can naturally be these random variables and topics associated with a candidate criterion can be used to specify conditions on a probability distribution. For instance, the probability that a candidate criterion is a good solution is directly linked to whether that candidate criterion belongs to the same set of topics a user is interested in, i.e., conditional upon the same set of topics.
Topic modeling [5, 15], an unsupervised machine learning approach, is a technique that can be used to infer what these topics are. Once these topics have been identified, it is possible to build probability models or classifiers using a oneversusrest approach as a way of assigning training labels. Given training data, we can use topic models to identify in a particular set of documents these topics. These topics are themselves expressed as a group of words that belong to particular documents. We shall refer to this supervised training technique as anchored training, where topics predicted using an unsupervised learning model are subsequently used as pseudolabels to build models using a supervised learning method. In a sense, the training 'labels' for the supervised learning method are based or 'anchored' on latent topics which were inferred without manual labeling. It is important to point out that although topic models output clusters of documents according to particular topics, these topics are really pseudolabels in the sense that documents vary in their levels of membership to a particular cluster. Unlike standard labels used in supervised learning methods which clearly signify a training instance to be of one label and not the other, there is some amount of noise associated with our pseudolabels. A document (or in our case, a training instance) could be a member of more than one cluster or topic in varying levels of proportion so it is not readily apparent that these pseudolabels are useful without further refinement.
The first step in building a model through anchored training is to get a rough idea about which subsets of the data are good training candidates for a particular pseudolabel. This step is achieved through an unsupervised learning method. Our goal in using an unsupervised method is to get an approximate grouping of data. This approximate grouping via latent topics (pseudolabels) can then be used as a guide to tease out and determine the features that have real significant predictive or discriminatory power. We then apply a model refinement step using a supervised learning method. Optionally, this model refinement step can be done after a dimensionality reduction phase. In the second step, the output is a refined model because important attributes or features that are useful are teased out and determined, e.g., via the attributes' corresponding learned weights, during the model fitting process. Thus, anchored training provides a way to build supervised models without the high cost associated with manually labeling a set of training instances. More importantly, models could potentially be built even in the case where no labeled training data is available. The training method is outlined in Algorithm 1 where line (2) can be replaced with any method that outputs topic weights
Algorithm 1: Anchored Training
Let W = { w_{
i
}} denote topic proportions, and D collection of documents d. Let P = {p_{
i
}}, i = 1 ... n be a set of trained logistic regression models p_{
i
}, and LDA_{
n
} be a trained Latent Dirichlet Allocation (LDA) model, where n is the number of topics.
Input: set of documents D, thresholds κ_{1}, κ_{2}
Output: set of probability models P
Anchor(D)

(1)
P ← ∅

(2)
W ← LDA_{
n
}(D)

(3)
foreach i = 1 ... n

(4)
L_{
i
} ← {d ∈ D w_{
i
}(d) > κ_{1}}

(5)
L_{
−i
} ← {d ∈ D w_{
i
}(d) ≤ κ_{2}}

(6)
p_{
i
} ← train(L_{
i
}, L_{
−i
})

(7)
P ← P ∪ {p_{
i
}}

(8)
return
P
or proportions and where line (6) can be replaced with any supervised method that uses the labels L_{
i
}, i = 1 ... n.
In the work reported here, we used Latent Dirichlet Allocation (LDA) [5] as the unsupervised learning method for anchored training. Given a document, LDA outputs the topic proportions that document has. With a sufficiently large collection of documents, it is then easy to find documents that have a topic t in their mixture and those that do not have a topic t. Alternatively, threshold values as suggested in lines (4)(5) in Algorithm 1, can be used. Clearly, one can now use this dichotomized data as data for either training classifiers or building probability models. In our case, we trained probability models from such data using logistic regression [4].
To help users put together the inclusion and exclusion criteria, we assume as an input to our method a small sample of clinical trial protocols identified by the user as relevant to the task at hand. Depending on the user interface, the sample set could be used as an initial seed to find other relevant documents. Such sample documents can, for instance, be a set resulting from a search query over a large repository by using a series of search strings. Note that as in the sample snippet above, important keywords that users normally use as part of the search string may or may not appear as words in the eligibility criteria section of the clinical trial protocol.
As mentioned above, the specific implementation of our approach uses Latent Dirichlet Allocation (LDA) and Logistic Regression (LR). We shall henceforth refer to our method as the LDALR method. Algorithm 2 presents the general steps of the LDALR method. In line (1), we use an LDA trained model to infer the latent topics from a given set of documents D. In this work, each d ∈ D is a clinical trial protocol which in itself contains a set of criteria S. Each criterion s_{
j
} ∈ S can either be an inclusion criterion or an exclusion criterion, but not both. Then in line (2), we invoke the function GETSCORE (Algorithm 3) that executes two additional general steps. In GETSCORE, line (4) performs two steps. First, it uses the logistic regression models p_{
i
} to compute the probability that a candidate criterion is of
Algorithm 2: LDALR Method
Let W = {w_{
i
}} denote topic proportions, s_{
j
} ∈ S denote a criterion, and S = {s_{
j
}  s_{
j
} ∈ d, d ∈ D}. Let P = {p_{
i
}}, i = 1 ... n be the set of trained logistic regression models p_{
i
}, and LDA_{
n
} be a trained Latent Dirichlet Allocation (LDA) model, where n is the number of topics.
Input: set of documents D, trained logistic regression models P, trained LDA model LDA_{
n
} Output: set of ordered criteria S^{★}.
LDALR(D, P, LDA_{
n
})

(1)
W ← LDA_{
n
}(D)

(2)
S^{★} ← GETSCORE(S, W, P)

(3)
return
S
^{★}
topic i. This is done for all topics i = 1 ... n. Then, it computes the expected value σ ( s ) of a candidate criterion by taking the sum of the probabilityweighted normalized topic proportions \stackrel{\u0303}{w}. GETSCORE returns a set of criteria sorted in descending order by each candidate criterion's expected value. Intuitively, any set of candidate criteria whose topic proportions are more dominant in the samples, will have higher expected values or scores compared to those whose topics are not significantly represented in the samples.
Each d ∈ D has a topic proportion as computed in line (1) of Algorithm 2. In Algorithm 3, we simply normalize the topic proportions w_{
i
} so that for a given D,\phantom{\rule{0.3em}{0ex}}{\displaystyle \sum _{i=1}^{n}}{w}_{i}=1. If we let w_{
id
} = w_{
i
}(d), denote the topic proportion of topic i in dth document, for d ∈ D, then the normalized weights {\stackrel{\u0303}{w}}_{i} are simply computed as:
{\stackrel{\u0303}{w}}_{i}=\frac{{\sum}_{d=1}^{\leftD\right}{w}_{id}}{{\sum}_{i=1}^{n}{\sum}_{d=1}^{\leftD\right}{w}_{id}}
(1)
The normalize function in line (2) of the Scoring Function is computed using Eq. (1).
The solutions given by LDALR have some interesting characteristics.
Proposition 1 Let S be a set of criteria and H = { h_{
j
}} be a sufficiently large set of randomly drawn target criterion h_{
j
}, where H and S are possibly disjoint, and that the topics in H and S are drawn from the same distribution. Furthermore, let ϕ be any similarity function such that ϕ (x_{1}, x_{2}) → [0, 1], for x_{1}, x_{2} ∈ H ∪ S and
{s}^{\top}=\underset{s\in S}{\mathsf{\text{arg}}\mathsf{\text{max}}\varphi (s,{\text{h}}_{j})}
(2)
{s}^{*}=\underset{s\in S}{\mathsf{\text{arg}}\mathsf{\text{max}}\sigma \left(s\right)}
(3)
where σ
(
s
)
is the scoring function in Algorithm 3, then
\frac{\varphi \left({s}^{\top},{h}_{j}\right)\varphi \left({s}^{\top},{s}^{*}\right)}{\leftH\right}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}>\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\epsilon
(4)
\varphi \left({s}^{\top},{s}^{*}\right)\varphi \left({s}^{\top},random\left(s\right)\right)\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}>\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}0
(5)
for some ε >0, s ∈ S, h_{
j
} ∈ H
Algorithm 3: Scoring Function
Let σ denote the score of a criterion s ∈ S and let T (s) denote the topic of s.
Input: set of criteria S, topic proportions W = {w_{
i
}}, logistic regression models P = {pi}, for i = 1 ... n
Output: set of ordered criteria S^{★}.
GETSCORE(S, W, P )

(1)
S^{★} ← ∅

(2)
\stackrel{\u0303}{W}\leftarrow \mathsf{\text{normalize}}\left(W\right)

(3)
foreach s ∈ S

(4)
\sigma \left(s\right)\leftarrow {\displaystyle \sum _{i=1}^{n}}{\stackrel{\u0303}{w}}_{i}{p}_{i}\left(T\left(s\right)=i\right)

(5)
S^{★} ← S^{★} ∪ {〈s, σ(s)〉}

(6)
return sort(S^{★})
Eqs. (4) and (5) describe some characteristics regarding average similarity, as a result of using LDALR. Eq. (4) simply says that the average similarity between what LDALR recommends as a solution and the optimal solution can not be more than the average maximum similarity between the target and any candidate. On the other hand, Eq. (5) establishes the lower bound on solutions that LDALR recommends i.e., on the average, LDALR's solution can not be worse than a randomly drawn solution.