Skip to main content

Table 8 Distribution of tokens (upper rows) and entities (inferior rows) per split

From: A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine

TOKENS Train Dev Test
Abstracts 84 855 27 957 28 433
M (SD) 282.85 (±67.66) 279.57 (±56.34) 284.33 (±88.49)
EudraCT 90 348 30 713 29 867
M (SD) 215.11 (±66.93) 219.38 (±68.04) 213.34 (±77.81)
All 175 203 58 670 58 300
M (SD) 243.34 (±75.04) 244.46 (±69.94) 242.92 (±89.41)
ENTITIES Train Dev Test
Abstracts 12 129 4092 3810
M (SD) 40.43 (±13.29) 40.92 (±13.78) 38.10 (±14.63)
EudraCT 15 972 5537 5159
M (SD) 38.03 (±14.10) 39.55 (±14.70) 36.85 (±14.90)
All 28 101 9629 8969
M (SD) 39.03 (±13.81) 40.12 (±14.31) 37.37 (±14.77)
ANAT 4023 1442 1263
M (SD) 5.59 (±4.88) 6.01 (±4.78) 5.26 (±4.61)
CHEM 5577 1840 1807
M (SD) 7.75 (±6.00) 7.67 (±6.01) 7.53 (±6.50)
DISO 7832 2716 2519
M (SD) 10.88 (±6.18) 11.32 (±6.94) 10.50 (±6,01)
PROC 10 669 3631 3380
M (SD) 14.82 (±6.91) 15.13 (±7.41) 14.08 (±7.27)