An evaluation of GPT models for phenotype concept recognition

Table 3 Mention-level evaluation results of the state-of-the-art methods for phenotype concept recognition and for reference purposes the results of the best-performing GPT models

	HPO-GS			BioC-GS
Tool	Precision	Recall	F1	Precision	Recall	F1
PhenoTagger [12]	0.77	0.68	0.72	0.74	0.52	0.61
ClinPheno [26]	0.73	0.36	0.48	0.47	0.57	0.52
Doc2HPO [25]	0.81	0.50	0.62	0.84	0.29	0.43
Monarch Annotator [9]	0.82	0.50	0.62	0.47	0.46	0.46
NCBO Annotator [27]	0.66	0.49	0.56	0.78	0.41	0.54
Best non in-contex learning gpt (gpt-4, Prompt 4)	0.32	0.23	0.27	0.43	0.46	0.44
Best gpt-4 (Prompt 7; in-context learning)	0.73	0.3	0.43	0.77	0.64	0.7
Best gpt-3.5 (Prompt 7; in-context learning)	0.28	0.25	0.26	0.54	0.49	0.51

ISSN: 1472-6947