Skip to main content

Table 4 The strict and relaxed overall performance on the test sets of COVANCE, ELIIE, and CHIA corpora

From: A comparative study of pre-trained language models for named entity recognition in clinical trial eligibility criteria from multiple corpora

Models

Covance

  

EliIE

  

Chia

  
 

P

R

F1

P

R

F1

P

R

F1

BERT

0.691

(0.810)

0.719

(0.849)

0.705

(0.829)

0.810

(0.877)

0.842

(0.917)

0.826

(0.896)

0.577

(0.701)

0.620

(0.761)

0.598

(0.730)

SpanBERT

0.692

(0.810)

0.718

(0.847)

0.705

(0.828)

0.813

(0.879)

0.843

(0.917)

0.828

(0.897)

0.593

(0.711)

0.628

(0.758)

0.610

(0.734)

BioBERT

0.694

(0.812)

0.722

(0.851)

0.708

(0.831)

0.810

(0.879)

0.837

(0.915)

0.823

(0.896)

0.589

(0.707)

0.632

(0.765)

0.609

(0.735)

BlueBERT

0.689

(0.807)

0.718

(0.848)

0.703

(0.827)

0.811

(0.880)

0.838

(0.917)

0.824

(0.898)

0.590

(0.702)

0.616

(0.737)

0.603

(0.719)

PubMedBERT

0.704

(0.820)

0.727

(0.851)

0.715*

(0.835)

0.817

(0.881)

0.847

(0.920)

0.832*

(0.900)

0.606

(0.724)

0.639

(0.765)

0.622*

(0.744)

SciBERT

0.696

(0.813)

0.723

(0.850)

0.709

(0.831)

0.813

(0.883)

0.839

(0.915)

0.825

(0.899)

0.589

(0.709)

0.634

(0.768)

0.611

(0.737)

  1. Bold values were calculated using the Wilcoxon rank sum test. The Wilcoxon rank sum test is a non-parametric test method that determines whether the means of strict F1 scores (Bold values) from the 10-fold experiments of the PubMedBERT model and each other model (BERT, SpanBERT, BioBERT, SciBERT) are statistically different from each other based on ranks rather than the original F1 scores of the experiments. The detailed definition of the Wilcoxon rank sum test can be found in the reference [33] as shown in the manuscript
  2. Numbers in the parentheses are results based on relaxed criteria
  3. *Indicates p < 0.05 when comparing to other pre-trained models