Skip to main content

Table 3 Number of tokens/unique tokens per data set and split

From: Classifying patient and professional voice in social media health posts

 

Reddit

Twitter

Both data sources

Cardiovascular

Train

831,169/26,037

119,087/16,118

950,256/34,998

Test

211,486/13,302

30,257/6729

241,743/17,094

Skin

Train

1,159,225/29,176

98,410/13,639

1,257,635/35,854

Test

290,227/14,201

24,337/5483

314,564/16,731

Both domains

Train

1,990,394/43,390

217,497/25,441

2,207,891/57,118

Test

501,713/21,779

54,594/10,201

556,307/26,444