An evaluation of GPT models for phenotype concept recognition

Objective Clinical deep phenotyping and phenotype annotation play a critical role in both the diagnosis of patients with rare disorders as well as in building computationally-tractable knowledge in the rare disorders field. These processes rely on using ontology concepts, often from the Human Phenotype Ontology, in conjunction with a phenotype concept recognition task (supported usually by machine learning methods) to curate patient profiles or existing scientific literature. With the significant shift in the use of large language models (LLMs) for most NLP tasks, we examine the performance of the latest Generative Pre-trained Transformer (GPT) models underpinning ChatGPT as a foundation for the tasks of clinical phenotyping and phenotype annotation. Materials and methods The experimental setup of the study included seven prompts of various levels of specificity, two GPT models (gpt-3.5-turbo and gpt-4.0) and two established gold standard corpora for phenotype recognition, one consisting of publication abstracts and the other clinical observations. Results The best run, using in-context learning, achieved 0.58 document-level F1 score on publication abstracts and 0.75 document-level F1 score on clinical observations, as well as a mention-level F1 score of 0.7, which surpasses the current best in class tool. Without in-context learning, however, performance is significantly below the existing approaches. Conclusion Our experiments show that gpt-4.0 surpasses the state of the art performance if the task is constrained to a subset of the target ontology where there is prior knowledge of the terms that are expected to be matched. While the results are promising, the non-deterministic nature of the outcomes, the high cost and the lack of concordance between different runs using the same prompt and input make the use of these LLMs challenging for this particular task. Supplementary Information The online version contains supplementary material available at 10.1186/s12911-024-02439-w.


INTRODUCTION
Over the past decade, clinical deep phenotyping -i.e., the comprehensive documentation of abnormal physical characteristics and traits in a computationally-tractable manner -has evolved into a common procedure for individuals who are either suspected of having or have been diagnosed with a rare disease.Similarly, the development and continuous enrichment of knowledge bases in the rare disease domain has become standard practice.Conceptually, both tasks rely on ontologies that are developed and updated by the medical community.Such ontologies facilitate the description of a patient's unique phenotype, as well as the characterisation of the phenotypic manifestations of gene mutations using ontological terms and concepts.The utility of using ontology-coded knowledge in rare diseases has been showcased repeatedly over the years in data sharing [1][2][3] and clinical variant prioritization and interpretation [4][5][6].
The Human Phenotype Ontology (HPO) [7,8] provides the most comprehensive resource for computational deep phenotyping and has become the de facto standard for encoding phenotypes in the rare disease domain, for both disease definitions as well as patient profiles to aid genomic diagnostics.The ontology, maintained by the Monarch Initiative [9], provides a set of more than 16,500 terms describing human phenotypic abnormalities, arranged as a hierarchy with the most specific terms furthest from the root, as depicted in Fig. 1.In addition to underpinning complex diagnostic tasks (e.g., clinical interpretation of an exome/genome) [10] or building care coordination plans, ontology-coded phenotypes often represent also a communication channel between practitioners and patients and, subsequently, between patients and other stakeholders -e.g., education, disability or welfare workers.Moreover, concepts grounded in HPO provide the explainability required to improve the transparency of the decision-making process, which can then support communication and documentation.
Manual curation of phenotype profiles -and manual annotation as a general task -is, however, tedious and has represented the main blocker to wide-spread uptake of computational deep phenotyping on the clinical side and to keeping rare disease knowledge bases up to date.(Semi-)automated methods that rely on natural language processing (NLP) have been introduced to remove this blocker and have gradually become the standard modus operandi.Such methods, the latest built using convolutional neural networks [11] or transformer-based architectures [12], are also addressing a variety of challenges associated with phenotype concept recognition (CR) such as ambiguity, use of metaphorical expressions, as well as negation and complex or nested structures.
Lately, the focus has shifted to large language models (LLMs) for most NLP tasks.LLMs -a class of transformer-based models trained on trillions of words of diverse texts [13] -showcased superior capabilities in application domains such as chatbots and text prediction [14].Their main advantage also stems from having the ability to use few-shot learning to perform specific tasks, without the need for further training or fine-tuning [15], which replaces the "traditional" task-driven training of machine learning models [16].gpt-3.5 and gpt-4.0 are examples of such LLMs that have witnessed a rapid general user adoption via the ChatGPT application, a chatbot fine-tuned for conversation-based interactions with humans.A user can "prompt" ChatGPT to perform a variety of tasks, with or without the need to provide examples to support them.
In the biomedical domain, several domain-specific models have been published -BioBERT [15], PubMedBERT [17] or BioGPT [18] -and shown to perform well on NLP tasks including relationship extraction (e.g., drug-drug interactions or drug-target interactions) and question answering.The experimental results also included comparisons against GPT-2.0, a predecessor of the current models powering ChatGPT.Lately, several studies have been published on the utility of using GPT models for annotation (in general) [19,20] and few discussed the efficiency of such models on concept recognition tasks, in particular phenotype concept recognition.Note that ontology-based concept recognition implies a joint task of named entity recognition (i.e., finding entities on interest in a text and their corresponding boundaries) and entity linking (i.e., aligning the entities found in the text to concepts defined in a given ontology).Experiments documenting the accuracy on named entity recognition -with a focus on diseases and chemical entities -were documented by Chen et al. [21], with gpt-4.0(+ one-shot learning) achieving a performance poorer than a fine-tuned PubmedBERT, yet significantly better that gpt-3.5.This paper examines the ability of gpt-3.5 and gpt-4.0 to perform phenotype concept recognition using HPO as a background ontology.Three different approaches are used to generate prompts to gain a deeper understanding of the limitations in various scenarios.Specifically, the experimental setup targets direct concept recognition -i.e., named entity recognition followed by an alignment to HPO concepts and few-shot learning.

MATERIALS AND METHODS
The study uses two gold standard corpora available in the literature for phenotype concept recognition: (i) a corpus of 228 scientific abstracts collected from PubMed, initially annotated and published by Groza et al [22], and subsequently refined by Lobo et al [23] (named HPO-GS from here on); and (ii) the dev component of the corpus made available through Track 3 of BioCreative VIII (454 entries), focusing on extraction and normalization of phenotypes resulting from genetic diseases, based on dysmorphology physical examination [24] (named BIOC-GS from here on).An example of an entry is: "ABDOMEN: Small umbilical hernia.Mild distention.Soft." All experiments were conducted using these two corpora.HPO-GS covers 2,773 HPO term mentions and a total of 497 unique HPO IDs, with the minimum size of a document being 138 characters, the maximum size 2,417 characters and the average being ~500 characters.BIOC-GS covers 783 HPO term mentions and a total of 358 unique HPO IDs, with the minimum size of an entry being 13 characters, the maximum 225 characters and the average ~56 characters.Note that we chose the dev component of Track 3 because of its similarity in the number of unique HPO IDs and its profile (described below) to HPO-GS.We were unable to download the test component of Track 3 and hence the results reported here are not comparable to the results published by the Track's organisers.
The complexity of the lexical representations of the HPO concepts can be partially assessed based on their length (presented in Fig. 2) and structural placement in the ontology.The latter is depicted in Fig. 3 using the children of the Phenotypic abnormality concept as major categories and the values representing the proportion of terms belonging to each category (as also depicted in Fig 1).It can be observed that the large majority of concepts in both corpora (~88%) have low to moderate lexical complexity, with a label length of 4 words or less, and are placed predominantly in the nervous and musculoskeletal system (including here also head and neck and limbs) -i.e., denoting finger, toe, face, arm and leg abnormalities.

Prompt generation approaches
Conversational models such as gpt-3.5-turboand gpt-4.0take inputs in the form of prompts.These prompts can include -in addition to a target content -explicit instructions or examples of the desired output.This is sometimes referred to as "prompt engineering".A task, such as concept recognition, can be defined via prompts in various ways, with the behaviour and hence the output of the model being heavily influenced by smallest differences in these definitions.In this study, we used three types of prompts to investigate the models' efficiency to perform phenotype concept recognition.Two remarks are worth noting about our selection strategy: • We opted for well-known, low-barrier prompts that do not require significant prompt engineering knowledge and skills • We were aware of the HPO ID hallucinations as a result of aiming for concept recognitioninstead of a chain of entity recognition, external entity linking and LLM-based validation, however, as presented later in our experiments, this did not materialise as a real concern.Instructional (directed) phenotype concept recognition.Prompts in this category aimed to capture the impact of the wording used to define 'phenotypes' on the CR task.They are instructional (directed) because the model is asked explicitly to perform a certain task.The four prompts defined in this category are listed below; the key instructions are underlined for easier comprehension.
• The first three prompts direct the model to 'extract' artefacts from the provided text.Prompt 2 is a variation of Prompt 1 ('phenotypes' vs 'phenotypes and clinical abnormalities'), while Prompt 3 refers directly to HPO terms.Prompt 4 explicitly names the task requested to be performed by the model -i.e., 'automated concept recognition'.
Instructional (directed) named entity recognition followed by instructional (directed) entity alignment.The prompts in the first category target directly concept recognition by requesting HPO IDs.As a task, concept recognition can also be modelled as named entity recognition (used to detect entity boundaries in the text) followed by entity alignment (used to match the candidates extracted from the text to ontology concepts / IDs).Prompt sets 5 and 6 below explicitly perform this two-step process by first asking the model to extract phenotypes, then using this output as input to align the text to HPO IDs.Prompt set 6 is a subset of Prompt set 5 -'phenotypes' vs 'phenotypes and clinical abnormalities'.• Part 1: Examples: The Human Phenotype Ontology defines phenotype concepts using the following label -HPO ID associations: Hypospadias // HP:0000047 … • Part 2: Task: Using the list above, find Human Phenotype Ontology concepts in the following text and return their associated IDs for every appearance in the text.
Part 1 was completed by adding the pairs of label -HPO ID for all HPO concepts present in the gold standard corpus.Attempts were made to include the entire ontology, or to include the labels and all synonyms for the desired HPO concepts, however they failed due to the model limitations on the size of the input content.We do, however, demonstrate the impact of using various sets of concepts to underpin the few-shot learning task.A complete example of prompt 7 is provided in Appendix 1 in the Supplementary material.

Experimental setup
Experiments were conducted by calling the GPT models using the OpenAI API (https://platform.openai.com/docs/api-reference).Each call used one of the seven prompts discussed above and the text corresponding to each abstract or examination entry, individually, as user input.The results were stored individually and HPO concepts were extracted and associated with the PMID / entry ID corresponding to the text used as input.The code used to annotate the corpora and perform the evaluation is available at: https://github.com/tudorgroza/code-for-papers.
The standard evaluation procedure for concept recognition covers two aspects: (i) boundary detection -i.e., a correct alignment of the boundaries of the concepts in text, usually by matching the offsets of the corresponding text span to the offsets found by the system being evaluated; and (ii) concept mapping -i.e., a correct matching of the ID of the concept against that provided by the gold standard.The boundary detection step proved to be challenging to evaluate accurately with the results produced by the OpenAI GPT models -an aspect documented also by Chen et al. [19].
Consequently, given our focus on understanding the utility of these models to support manual phenotype annotation / curation, we relaxed the evaluation procedure to include only the second step -i.e., concept mapping.A correct match was, therefore, counted if the HPO ID present in the gold standard was found at least once by the LLM.
The evaluation metrics used in this experiment are the standard for the task: precision, recall and F1.These were computed at both macro and micro levels.The macro level defines a true positive when a desired HPO ID is found at least once by the LLM, while the micro level keeps track of all encounters of the HPO ID in a particular abstract and defines a true positive when each individual encounter is found by the LLM.Tables 1 and 2 list the experimental results achieved by both models across all seven prompts on HPO-GS and BIOC-GS respectively, while Table 3 lists, as reference point, the results of the state of the art methods for phenotype concept recognition.Below we discuss the main findings emerging from these results:

GPT
• The few-shot learning strategy achieves results comparable or better than the state of the art.Phenotype concept recognition is known to be a difficult task -as showcased by the F1 scores listed in Table 3, which are roughly 0.2 lower that other domain-specific concept recognition tasks, such as gene or drug names.The GPT models perform significantly lower that the state of the art in most cases, with the micro-level evaluation F1 scores being half the values of tools such as PhenoTagger or the Monarch Annotator (note that the latter does not rely on a BERT-based architecture, or in general on a LLM-based on neural network-based architecture).The few-shot learning strategy, however, showcases the power of generative models.While on HPO-GS the results are comparable to the state of the art, on BIOC-GS gpt-4.0surpasses the best in class with a significant marginalmost 0.1 (0.7 F1 on micro-level evaluation compared to 0.61 F1 for PhenoTagger).
Although the setup employed for this strategy would not serve phenotype concept recognition in general, it would support manual annotation in a clearly defined domaine.g., cardiovascular diseases.Additional experiments are described in the following section.
• Both models have a consistent behaviour across prompts.Prompts 1 and 2 -defining the task as an extraction of phenotypes and clinical abnormalities, followed by an alignment to HPO IDs -achieve the best precision, with the increased focus of Prompt 2 leading to better results when using gpt-3.5 (although this change had no effect on gpt-4.0).Similarly, Prompt 7 (few-shot learning) achieved the best recall -which was expected since the examples included all concepts present in the gold standard.• Macro and micro-level evaluation results show significant discrepancies.While some differences were expected, the micro-level experimental results were surprisingly lower than the macro-level results on HPO-GS.This could be attributed to the variability of the lexical representations of the concepts in text -e.g., Brachydactyly C vs Brachydactyly, type C.This outcome does not hold on BIOC-GS, which seems to be more uniform.

Hallucinations
Hallucinations represent the generation of inaccurate, nonsensical, or text irrelevant to the given context.Our experiments defined standard, community-accepted phenotype concept recognition tasks and the evaluation targeted HPO IDs extracted by the models.Hence, in terms of hallucinations, the expectation was to find non-existing HPO IDs in the output produced by the models.

Few-shot learning with different sets of concepts
The results achieved by Prompt 7 and discussed above relied on the same set of concepts as those present in the gold standard.To test the impact of this set on the results (which in a standard setting would be expected, since the entire ontology would be considered), we performed three additional experiments using gpt-4.0 and Prompt 7. Firstly, we used the concepts covered by HPO-GS to do few-shot learning for BIOC-GS.The results were significantly lower, the model achieving macro-level precision, recall and F1 of 0.25, 0.23, 0.24 respectively and micro-level metrics of 0.23, 0.2, 0.21.
Secondly, we used the top-level profile of the two corpora (depicted in Fig. 3; i.e., the majority of the concepts describing musculo-skeletal abnormalities) to generate a random set of ~1,000 concepts.This resulted in a set comprising 1,165 HPO concepts (~43KB in size with labels and ~7% of the entire ontology) and the following overlaps with the two gold standard corpora: (i) 160 concepts overlap with BIOC-GS -i.e., 45% of BIOC-GS and 14% of the learning set; (ii) 138 concepts overlap with HPO-GS -i.e., 30% of HPO-GS and 12% of the learning set.We re-ran Prompt 7 on gpt-4.0 on both corpora and the results -as shown in These results support our assumption that gpt-4.0 would be useful for annotation purposes in a domain-specific setting, without the need to use the entire set of concepts describing the domain to perform few-shot learning.

Concordance across prompts
A complete overview of the pairwise concordance of the outcomes across both models and all prompts is provided in Appendix 2 and 3 in the Supplementary material.More concretely, we recorded the percentage of common correct and incorrect HPO IDs when considering one model output as base reference.For example, on HPO-GS 51% of the correct HPO IDs extracted by gpt-3.5 Prompt 1 are in common with the correct HPO IDs extracted by gpt-3.5 Prompt 2, with this common set representing 93% of the total correct HPO IDs extracted by the latter.
Overall, the results vary significantly and there is no combination of model -prompt that achieved a high level of agreement on both correctly and incorrectly extracted HPO IDs.A stand-out is perhaps gpt-3.5 Prompt 2 that achieves a rather consistent level of agreement with most of the other prompts on both corpora: (i) on HPO-GS -93% correct in common with Prompt 1 -which is expected because Prompt 2 targets conceptually a subset of Prompt 1, 87% with Prompt 7, 84% with gpt-4.0Prompt 1; (ii) on BIOC-GS -97% correct in common with Prompt 1, 81% with Prompt 7 and over 80% with all gpt-4.0experiments except for Prompts 5 and 6.
Appendix 4 in the Supplementary material lists the top 5 incorrectly extracted HPO IDs across all experiments.These HPO IDs are fairly consistent within the context of a model and completely divergent across models.For example, the most common errors of gpt-3.5 are: Decreased body weight (HP:0004325), Intellectual disability, profound (HP:0002187), Joint hypermobility (HP:0001382), Abnormality of the nervous system (HP:0000707), while those of gpt-4.0are: Poor wound healing (HP:0001058), Cerebral hamartoma (HP:0009731).

Same model and prompt concordance
A final experiment was performed to understand the concordance across different runs of the same model and prompt.We ran five times the annotation experiment using gpt-4.0Prompt 1.
Overall, all runs achieved the same precision and recall, with very minor differences (+/-0.01).The concordance in the results produced by the runs was, however, surprisingly low.Across all runs, we found the common set of: 75.82% of all correctly identified HPO IDs; 28.09% of all incorrectly identified HPO IDs, and 86.6% of all concepts not found by the models.This shows a high level of divergence in concept mapping errors produced by the individual runs.

LIMITATIONS
A summary of the limitations derived from the experiments discussed above is listed below: • The few-shot learning strategy adopted in our experiments -while surpassing the state of the art in some cases -defeats the general purpose of open phenotype concept recognition.Due to limitations in the size of the input data, we restricted the examples to only the concepts present in the gold standard.In a real-world scenario, this set of concepts is unknown -and hence this strategy would fail.Our experiments did, however, show that ontology stratification strategies could be employed as an alternative to using the entire ontology -e.g., domain-specific selection.Cost is still a prohibitive factor for this approach.For example, the few-shot learning experiments on BIOC-GS costed USD $50, which used only 454 entries with an average length of 56 characters + the learning component of 358 unique ontology concepts (~12KB) • The performance of the model is non-deterministic.Executing the same prompt over the same input leads to slightly different results.This is particularly challenging as it hinders the establishment of an accurate ground truth and leaves a degree of uncertainty in completeness always associated with the outcomes.• The choice of wording in the prompt influences the results.While this is expected (hence the need for iterative prompt engineering), it is also remarkably challenging when considering the lack of concordance between the outcomes -as shown in Appendix 2 and 3 in the Supplementary material (e.g., prompts that have been iterated on produce HPO IDs that are not found by subsequent prompts)

CONCLUSION
This paper presents a study that assesses the capabilities of the GPT models underpinning ChatGPT to perform phenotype concept recognition, using concepts grounded in the Human Phenotype Ontology, assuming a need for manual curation / annotation of publications or clinical records.The experimental setup covered both gpt-3.5 and gpt-4.0 and a series of seven prompts ranging from direct instructions to perform the task by name to chains of named entity recognition followed by concept mapping and to few-shot learning.The results show that with an appropriate set-up -in this case few-shot learning -these models can surpass the best-in-class tools, which are either using BERT-based architectures or more classical natural language processing pipelines.
Task: Using the list above, find Human Phenotype Ontology concepts in the following text and return their associated IDs for every appearance in the text: This paper is based on our experience with the Gorlin-Goltz syndrome and on data from 14 patients of the Nordwestdeutsche Kieferklinik in whom this disorder was detected, treated and followed up.
A clinical concept has been produced, with a diagnostic check list including a genetic and a dermatological routine work up as well as a radiological survey of the jaws and skeleton.Whenever multiple basal cell carcinomas plus the typical jaw lesions are found in a patient, the diagnosis is easy.A minimum diagnostic criterion is the combination of either the skin tumours or multiple odontogenic keratocysts plus a positive family history for this disorder, bifid ribs, lamellar calcification of the falx cerebri or any one of the skeletal abnormalities typical of this syndrome.All those in whom this disorder is diagnosed or suspected should be followed up for the rest of their lives.The family should be examined and genetic counselling should be offered.

APPENDIX 2: Pairwise concordance across all models and prompts using the HPO-GS corpus
Base: gpt-3.

Fig 1 .
Fig 1. Simplified example of Human Phenotype Ontology concepts and their structural arrangement in the hierarchy.Solid lines denote direct parent-child relationships, while dotted lines denote ancestordescendant relationships.

Fig 2 .
Fig 2. Label length distribution of the HPO concepts present in the gold standard corpus

Fig 3 .
Fig 3. Top-level overview of the gold standard corpus using the children of 'Phenotype abnormality' as major categories.

Prompt 1: Analyze the text below delimited by triple backticks and extract phenotypes and clinical abnormalities. Align the phenotypes and clinical abnormalities found to Human Phenotype Ontology IDs. List the results in a JSON format using the following structure. • Prompt 2: Analyze the text below delimited by triple backticks, extract phenotypes and
You will be provided with text delimited by triple backticks.Align the text below to Human Phenotype Ontology labels.List only the HPO concepts found.Analyze the text below delimited by triple backticks and extract phenotypes.List them together with the start and end offsets in the text.You will be provided with text delimited by triple backticks.Align the text below to Human Phenotype Ontology labels.List only the HPO concepts found.
• Prompt set 5: o Step 1: Analyze the text below delimited by triple backticks and extract phenotypes and clinical abnormalities.List them together with the start and end offsets.o Step 2: • Prompt set 6: o Step 1: o Step 2: Few-shot learning using a subset of HPO.The final category (prompt set 7) attempts to aid the model by providing examples of the concepts targeted for extraction.The prompt used a standard two-part template, as below:

Table 1 .
Macro and micro-level evaluation results across both models and all seven prompts on HPO-GS

Table 2 .
Macro and micro-level evaluation results across both models and all seven prompts on BIOC-GS

Table 3 .
Micro-level evaluation results of the state of the art methods for phenotype concept recognition

Table 4 .
Table3lists an overview of the number of concepts (identified by HPO IDs) extracted in our experiments, in addition to the number of hallucinations.It can be observed that the latter has insignificant levels, and as such, hallucinations do not pose challenges for this task.Some examples of hallucinations include: HP:0020115, HP:0025111, HP:0023656, HP:0031966, HP:0020019, HP:0040060.A second observation can be made with respect to Prompts 3 and 4 (instructing the model to perform the task by its name): these prompts are very prolific on HPO-GS (7780 HPO IDs found, and 6698, respectively), which leads to an increased recall and a lower precision.Overview of number of HPO IDs found in all experiments and associated hallucinations

Table 5 .
Experimental results on using a random set of concepts for few-shot learning