Skip to main content

Table 2 Corpora descriptive statistics for characters, words, and tokens

From: Development of a machine learning model to predict mild cognitive impairment using natural language processing in the absence of screening

Corpus

Num. chars

Num. chars

Num. chars

Num. words

Num. words

Num. words

Num. tokens

Num. of tokens

Num. tokens

Mean

Max

Min

Mean

Max

Min

Mean

Max

Min

ACT (training)

1229.6

52,491.0

0

216.9

9350.0

0

260.8

10,946.0

0

Gen. Pop. (training)

1324.9

76,831.0

0

233.0

15,029.0

0

276.9

17,422.0

0

Gen. Pop. (validation)

1118.7

58,080.0

0

196.9

9588.0

0

234.9

11,251.0

0