An evaluation of GPT models for phenotype concept recognition

Table 1 Document and mention-level evaluation results across both models and all seven prompts on HPO-GS

GPT version	Prompt	Precision	Recall	F1	Precision	Recall	F1
GPT version	Prompt	Document-level			Mention-level
3.5	1	0.45	0.21	0.29	0.39	0.14	0.20
	2	0.51	0.12	0.19	0.46	0.08	0.13
	3	0.12	0.25	0.16	0.05	0.15	0.07
	4	0.12	0.28	0.16	0.07	0.17	0.10
	5	0.14	0.09	0.11	0.14	0.06	0.08
	6	0.3	0.13	0.18	0.29	0.08	0.12
	7	0.41	0.41	0.41	0.28	0.25	0.26
4	1	0.41	0.34	0.37	0.36	0.21	0.26
	2	0.41	0.34	0.37	0.36	0.21	0.26
	3	0.37	0.31	0.33	0.34	0.19	0.24
	4	0.34	0.38	0.35	0.32	0.23	0.27
	5	0.31	0.22	0.25	0.26	0.13	0.17
	6	0.35	0.17	0.22	0.29	0.10	0.15
	7	0.75	0.47	0.58	0.73	0.3	0.43

ISSN: 1472-6947