Skip to main content
Fig. 13 | BMC Medical Informatics and Decision Making

Fig. 13

From: Exploring the potential of ChatGPT in medical dialogue summarization: a study on consistency with human preferences

Fig. 13

Human evaluation of 100 summaries generated by ChatGPT, BART, and BERTSUM models, with average scores on four evaluation metrics: Contains Key Result, Coherence, Usefulness, and Readability. Sub-figures (a) and (b) demonstrate that summaries generated by ChatGPT achieved favorable results in the human evaluation metrics, especially under the Prompt_T condition, with a substantial proportion of “Strongly Agree” in all metrics. However, sub-figure (c) indicates that the BART model performed poorly in the human evaluation metrics, except for the “Readability” metric. sub-figure (d) shows that the BERTSUM model exhibited very poor performance across all metrics, almost entirely in the “Strongly Disagree” state

Back to article page