Skip to main content

Table 8 Distribution of tokens (upper rows) and entities (inferior rows) per split

From: A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine

TOKENS

Train

Dev

Test

Abstracts

84 855

27 957

28 433

M (SD)

282.85 (±67.66)

279.57 (±56.34)

284.33 (±88.49)

EudraCT

90 348

30 713

29 867

M (SD)

215.11 (±66.93)

219.38 (±68.04)

213.34 (±77.81)

All

175 203

58 670

58 300

M (SD)

243.34 (±75.04)

244.46 (±69.94)

242.92 (±89.41)

ENTITIES

Train

Dev

Test

Abstracts

12 129

4092

3810

M (SD)

40.43 (±13.29)

40.92 (±13.78)

38.10 (±14.63)

EudraCT

15 972

5537

5159

M (SD)

38.03 (±14.10)

39.55 (±14.70)

36.85 (±14.90)

All

28 101

9629

8969

M (SD)

39.03 (±13.81)

40.12 (±14.31)

37.37 (±14.77)

ANAT

4023

1442

1263

M (SD)

5.59 (±4.88)

6.01 (±4.78)

5.26 (±4.61)

CHEM

5577

1840

1807

M (SD)

7.75 (±6.00)

7.67 (±6.01)

7.53 (±6.50)

DISO

7832

2716

2519

M (SD)

10.88 (±6.18)

11.32 (±6.94)

10.50 (±6,01)

PROC

10 669

3631

3380

M (SD)

14.82 (±6.91)

15.13 (±7.41)

14.08 (±7.27)