Skip to main content

Table 1 Summary of Testing and Training Data Available for Algorithm Development

From: Building a tobacco user registry by extracting multiple smoking behaviors from clinical notes

Smoking status1

i2b2

Local EHR

Smoking Status

Smoking Status

Pack years

Cessation Date

Train n = 398

Test n = 104

Train N = 533

Test N = 223

Train N = 84

Test N = 36

Train N = 54

Test N = 19

Never

66

16

117

51

–

–

–

–

Ever

80

25

139

64

84

26

54

19

Former

36

11

71

30

38

12

54

19

Current

35

11

58

31

39

23

–

–

Smoker

9

3

10

3

7

1

–

–

Unknown

252

63

277

108

–

–

–

–

  1. Distribution of annotations for smoking status, pack years, and cessation date for the training and testing data from the i2b2 Challenge and our local EHR. Smoking status was determined by a manual review, with notes classified as: Never smoker, former smoker, current smoker, smoker temporality unknown (referred to as smoker), or no smoking status information (referred to as unknown). For the local EHR pack year and cessation date counts, we indicate the number of notes for which this information was identified by manual review