Skip to main content

Table 2 BioNLP corpora in Spanish

From: A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine

Corpus

Text type and size

Annotated entities (count)

MultiMedica [27]

Technical/popularizing texts; 4204 in Spanish, >4M tokens

No entities annotated, only part-of-speech

MANTRA corpus [28]

Multilingual; in Spanish, texts from EMA (100; 1961 tokens) & Medline (100; 1087 tokens)

UMLS semantic types and CUIs; 5530 total annotations (756 in Spanish)

IxaMedGS [29]

75 clinical reports (41 633 tokens)

Disease (2766), Drug (1191) and adverse drug reactions relations (228)

Spanish ADR [30]

397 texts from ForumClinic               (26 519 tokens)

Drugs (187) and adverse drug reactions (636)

Drug Semantics [31]

30 texts from Summaries of Product Characteristics                      (226 729 tokens)

Disease (724), Drug (657), Measurement (557), Excipient (66), Composition (62), Dose Form (45), Route (42), Medicament (37), Food (31), Therapeutic Action (20)

IULA-SCRC [32]

3194 sentences from 300 anonymized clinical records

Body part (7), Substance (14), Finding (1064), Procedure (93), Negation (1207)

Cotik et al. [33]

513 radiology reports

Anatomy (4398), Finding (2637), Location (722), Measure (3210), Texture (1890), Measure Type (1127), Negation (1207), Uncertainty (109), Abbreviation (880), Temporal (35), Multiword (788); 9 relation types (10 987)

BARR2 [34]

3563 report cases            (1 433 685 tokens)

Abbreviations, acronyms and expanded terms (9552 annotations)

SPACCC [35]

1000 clinical cases published in journals from SciELO (396 988 tokens)

PharmaCoNER: Proteins (3009), Normalizable to SNOMED CT (4398), Not-normalizable (50), Unclear (167). CODIESP: 18 483 ICD-10 codes

eHealth Discovery

1173 Spanish health-related sentences from MedlinePlus

Entities (7188), Roles (3586) and 4 types of relations (2339)

NUBes [41]

29 682 sentences from 7019 anonymized EHRs

Negation (7567 sentences) and Speculation (2219 sentences)

CWLC [42]

1912 sentences (36 157 tokens) from 900 referrals

9029 entities (Symptom, Diagnostic, Therapeutic or Laboratory Procedure, Family Member, Disease, Body part, Medication, Result, Abbreviation), 385 attributes (5 types), 284 relations