An evaluation of GPT models for phenotype concept recognition

Table 2 Document and mention-level evaluation results across both models and all seven prompts on BIOC-GS

GPT version	Prompt	Precision	Recall	F1	Precision	Recall	F1
GPT version	Prompt	Document-level			Mention-level
3.5	1	0.51	0.12	0.19	0.5	0.11	0.18
	2	0.68	0.05	0.09	0.68	0.05	0.09
	3	0.27	0.29	0.28	0.26	0.25	0.25
	4	0.26	0.33	0.29	0.22	0.29	0.25
	5	0.31	0.2	0.24	0.3	0.17	0.22
	6	0.31	0.2	0.24	0.3	0.17	0.22
	7	0.56	0.56	0.56	0.54	0.49	0.51
4	1	0.46	0.44	0.45	0.45	0.39	0.42
	2	0.44	0.44	0.44	0.43	0.38	0.4
	3	0.47	0.43	0.45	0.47	0.37	0.41
	4	0.43	0.53	0.47	0.43	0.46	0.44
	5	0.44	0.27	0.33	0.43	0.24	0.31
	6	0.44	0.27	0.33	0.43	0.24	0.31
	7	0.78	0.73	0.75	0.77	0.64	0.7

ISSN: 1472-6947