The performance evaluation clearly places the CAT Crawler meta-search engine on par with the individual search engines at BestBETs and UMHS as far as recall is concerned, and well above them for precision (see Table 2 and Figure 2). According to these results, the application can be called successful: by using the CAT Crawler to look for relevant information at specific sites, the medical professional will obtain as much information as by going to the sites directly, but the precision of the obtained results will be higher.
Benoit  has analyzed various methods of information retrieval and their impact on user behavior. He finds that users wish for greater interactive opportunities to determine for themselves the potential relevance of documents, and that a parts-of-document approach is preferable for many information retrieval situations. At present, the CAT Crawler allows a number of interactive opportunities , but their implementation would have no impact on the calculation of recall and precision under the condition of the present study. Benoit's reasoning should be kept in mind, however, for improving the user friendliness in the sense that some further useful filter functions can be included in future versions of the application. While such advanced search functions will be profitable when large datasets are studied, the currently still manageable information in the online CAT libraries  will serve the user better if initially displayed in a broader way. For example, some of the information displayed here may be older than 18 months, which makes it undesirable according to the strict rules for CAT updating as defined by Sackett et al . Formally outdated information, however, may in a given situation still be "best evidence" and positively influence the decision-making. Use of filters to block aged information will certainly influence this process.
Despite the encouraging results, some fundamental questions regarding the evaluation of this meta-search engine in particular, and also meta-search engines in general remain unsolved.
With regard to recall, there is the theoretical possibility that manually searching all documents at a given repository will yield a higher recall for a given search term. In view of hundreds of CAT documents per repository, however, it seems unlikely that a human evaluator's attention will not wander, leading to less than optimal scrutiny of the documents and introducing a non-quantifiable error to the evaluation. This is a general problem of knowledge databases, especially when indexing is done by humans, whose decisions are not consistent. In a study of 700 Medline references indexed in duplicate, the consistency of main subject-heading indexing was only 68% and that for heading-subheading combinations was significantly less . Also, in two studies [17, 18] on Medline searching, there was considerable disagreement by those judging relevance of the retrieved documents regarding which documents were relevant to a given query.
In order to overcome this problem, the number of documents that contained a given keyword as found by the keyword extractor was used as the basis for calculating the technical recall. This may (or may not) lead to numerical results for recall that differ from the absolute true value as determined above. As the same numbers are used throughout, however, the comparison of search results obtained by the individual search engines and the CAT Crawler meta-search engine remains valid.
Critics have pointed out the over-reliance of researchers on the use of recall and precision in evaluation studies  and the difficulty to design an experiment that allows both laboratory-style control and operational realism . For instance, recall may be of only little consequence once the user has found a useful document. Rhodes and Maes  evaluated both with a traditional field user test and then asked for relevance feedback. In their experiment, users gave a score 1–5 to each document that was delivered to calculate an overall average value for perceived precision. While a document can get a high score for precision, it may at the same time get a low score for practical usefulness. This was often due to the fact that the documents were already known to the users, in some cases had even been written by them. Accordingly, Rhodes and Maes  added features to the system that weeded out relevant documents that by some predefined criteria would not be useful. As a result, the measurable precision could be worse, but the overall usefulness could be better. In the study presented here, a similar approach was chosen in the instructions to the evaluators in the sense that they could make the distinction between 'irrelevant' (e.g. the retrieved document was only a web hosted clinical question) and 'medically irrelevant' (e.g. the word Appendicitis appeared only in the reference section of a document dealing with questions of abdominal pain relief). Due to the relatively small number, no difference could be detected between the various grades of relevance, and results were pooled to relevant/irrelevant and used for calculating recall and precision as described above. If a larger number of volunteers could be recruited, repetition of this evaluation might yield interesting results.
Other approaches have been spawned to evaluating system effectiveness in order to minimize these problems with recall and precision. One example are task-oriented methods that measure how well the user can perform certain tasks [21–24]. These different approaches were not chosen in this study for a reason: the primary aim was to compare the search engines. Under the present restrictions, recall and precision allow to answer this question.