Skip to main content

Table 4 Count of sentences, tokens and annotated entities

From: A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine

 

Abstracts

EudraCT

Total

Texts

500

700

1200

Sentences

7160

11 995

19 155

M (SD)

14.32 (±4.24)

17.14 (±5.24)

15.96 (±5.04)

Annotated sentences

5444

8607

14 051

M (SD)

10.89 (±3.00)

12.29 (±4.63)

11.71 (±4.09)

Tokens

141 245

150 928

292 173

M (SD)

282.49 (±70.21)

215.61 (±69.38)

243.48 (±77.11)

Entities

20 031

26 668

46 699

M (SD)

40.06 (±13.67)

38.10 (±14.39)

38.92 (±14.12)

Nested entities

2613 (13.04%)

3914 (14.68%)

6527 (13.98%)

Normalized

13 627

19 382

33 009

to UMLS CUIs

(68.03%)

(72.68%)

(70.68%)