From: An evaluation of GPT models for phenotype concept recognition
 | HPO-GS | BioC-GS | ||||
---|---|---|---|---|---|---|
Tool | Precision | Recall | F1 | Precision | Recall | F1 |
PhenoTagger [12] | 0.77 | 0.68 | 0.72 | 0.74 | 0.52 | 0.61 |
ClinPheno [26] | 0.73 | 0.36 | 0.48 | 0.47 | 0.57 | 0.52 |
Doc2HPO [25] | 0.81 | 0.50 | 0.62 | 0.84 | 0.29 | 0.43 |
Monarch Annotator [9] | 0.82 | 0.50 | 0.62 | 0.47 | 0.46 | 0.46 |
NCBO Annotator [27] | 0.66 | 0.49 | 0.56 | 0.78 | 0.41 | 0.54 |
Best non in-contex learning gpt (gpt-4, Prompt 4) | 0.32 | 0.23 | 0.27 | 0.43 | 0.46 | 0.44 |
Best gpt-4 (Prompt 7; in-context learning) | 0.73 | 0.3 | 0.43 | 0.77 | 0.64 | 0.7 |
Best gpt-3.5 (Prompt 7; in-context learning) | 0.28 | 0.25 | 0.26 | 0.54 | 0.49 | 0.51 |