Skip to main content

Table 4 Count of sentences, tokens and annotated entities

From: A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine

  Abstracts EudraCT Total
Texts 500 700 1200
Sentences 7160 11 995 19 155
M (SD) 14.32 (±4.24) 17.14 (±5.24) 15.96 (±5.04)
Annotated sentences 5444 8607 14 051
M (SD) 10.89 (±3.00) 12.29 (±4.63) 11.71 (±4.09)
Tokens 141 245 150 928 292 173
M (SD) 282.49 (±70.21) 215.61 (±69.38) 243.48 (±77.11)
Entities 20 031 26 668 46 699
M (SD) 40.06 (±13.67) 38.10 (±14.39) 38.92 (±14.12)
Nested entities 2613 (13.04%) 3914 (14.68%) 6527 (13.98%)
Normalized 13 627 19 382 33 009
to UMLS CUIs (68.03%) (72.68%) (70.68%)