Skip to main content
  • Research article
  • Open access
  • Published:

Methods to improve the quality of smoking records in a primary care EMR database: exploring multiple imputation and pattern-matching algorithms



Primary care electronic medical record (EMR) data are emerging as a useful source for secondary uses, such as disease surveillance, health outcomes research, and practice improvement. These data capture clinical details about patients’ health status, as well as behavioural risk factors, such as smoking. While the importance of documenting smoking status in a healthcare setting is recognized, the quality of smoking data captured in EMRs is variable. This study was designed to test methods aimed at improving the quality of patient smoking information in a primary care EMR database.


EMR data from community primary care settings extracted by two regional practice-based research networks in Alberta, Canada were used. Patients with at least one encounter in the previous 2 years (2016–2018) and having hypertension according to a validated definition were included (n = 48,377). Multiple imputation was tested under two different assumptions for missing data (smoking status is missing at random and missing not-at-random). A third method tested a novel pattern matching algorithm developed to augment smoking information in the primary care EMR database. External validity was examined by comparing the proportions of smoking categories generated in each method with a general population survey.


Among those with hypertension, 40.8% (n = 19,743) had either no smoking information recorded or it was not interpretable and considered missing. Those with missing smoking data differed statistically by demographics, clinical features, and type of EMR system used in the clinic. Both multiple imputation methods produced fully complete smoking status information, with the proportion of current smokers estimated at 25.3% (data missing at random) and 12.5% (data missing not-at-random). The pattern-matching algorithm classified 18.2% of patients as current smokers, similar to the population-based survey (18.9%), but still resulted in missing smoking information for 23.6% of patients. The algorithm was estimated to be 93.8% accurate overall, but varied by smoking status category.


Multiple imputation and algorithmic pattern-matching can be used to improve EMR data post-extraction but the recommended method depends on the purpose of secondary use (e.g. practice improvement or epidemiological analyses).

Peer Review reports


There has been a transformative shift towards the uptake of electronic medical record (EMR) systems in primary care settings and the subsequent re-use of these data for a variety of secondary purposes. In Canada, the Canadian Primary Care Sentinel Surveillance Network (CPCSSN) has established a source of de-identified, patient-level primary care EMR data that has been used for health research, disease surveillance, and quality improvement [1]. This has also included the development of extensive processes and algorithms to extract, transform, and standardize the raw EMR data generated from nearly 2 million patients in 7 Canadian provinces or territories [1, 2].

One distinct advantage of the CPCSSN data is the availability of behavioural risk factor information (such as smoking, alcohol use, diet and exercise) coupled with clinically-verified health measures and outcomes. However, risk factor data are often entered in unstructured fields or as free text in different areas of the EMR, which also varies by the type of EMR system being used. This creates difficulties during the data extraction process, as well as analytic challenges for data users, and also impacts the quality of these data in terms of completeness and accuracy. For instance, more than one-third of all patients over 16 years of age in the national CPCSSN database were missing smoking information, with biases likely introduced due to how smoking status was captured and for whom [3].

When conducting observational studies using EMR data, it is not uncommon to encounter missing data. Part of the challenge is determining whether the data are missing at random (where the probability of missingness depends on other observed information or data) or missing not-at-random (where the probability of missingness depends on unobserved or unknown information) [4], in addition to deciding on the most appropriate technique to use for analysis when data are missing. For some situations, it may not be recommended to use a complete case analysis (i.e. including only those who do not have missing data), as this method can reduce sample size and statistical power, lower precision of results, and produce biased findings [4]. Multiple imputation (MI) has been touted as a more accurate and less biased method, as missing data are imputed based on the distributions and variability of other data elements in the sample [4]. The use of MI to improve the recording of smoking data has been previously demonstrated in a large U.K. general practitioner database, where the authors observed that smoking data were likely missing non-randomly and had identified potential misclassification of ex-smokers who were recorded as non-smokers, particularly if a substantial period of time had passed since quitting or if quitting occurred at a younger age [5, 6].

The objective of this study is to explore methods aimed at improving the completeness of patient smoking status in the CPCSSN database, specifically focused on patients with hypertension. Two methods will test MI using different scenarios for data missing at random (MAR) or missing not at random (MNAR). The third method will test a newly-developed CPCSSN-based classifier for assigning individual smoking status based on additional free text in the smoking record.


Data source

This study used CPCSSN data collected by two practice-based research networks in Alberta, Canada (Northern Alberta Primary Care Research Network [NAPCReN], Southern Alberta Primary Care Research Network [SAPCReN]). Over 320 community-based providers (mostly family physicians with a small proportion of nurse practitioners and pediatricians) contribute de-identified EMR data from close to 400,000 patients across Alberta. Healthcare is publicly funded in Canada and typically administered by each province or territory. There is no mandatory patient registration in primary care, although patient rostering is increasingly undertaken in Alberta.

The CPCSSN data contain nearly all information entered into the patient EMR, including demographics, patient profile (e.g., medical diagnoses, history), physical measurements (e.g., blood pressure, height, weight), risk factor information, laboratory results, prescribed medications, medical procedures, referrals, and physician billing claims [1]. Additional details about the regional extraction, processing and transformation for NAPCReN and SAPCReN have been described elsewhere [7].


This analysis focused on a cohort of patients with hypertension, as guidelines recommend that tobacco use status should be updated on a regular basis for these patients [8]. Patients were identified as hypertensive if they met the validated definition developed by CPCSSN, which includes a combination of diagnostic codes related to hypertension and antihypertensive medications [9]. This definition demonstrated good validity as compared to the reference standard chart review (sensitivity = 84.9%; specificity = 93.5%) [9]. Patients with at least one encounter with their primary care provider within the previous 2 years (July 1, 2016 to June 30, 2018) were included; this is considered an acceptable indicator of ‘active’ patients in a practice, as the large majority of individuals visit a family physician at least once in a two-year period [10]. Those who were explicitly recorded as deceased or inactive according to their EMR status were excluded. Data from the entire history of the patient EMR were used for the analysis; the earliest timepoint varies by patient and depends on when a clinic integrated their EMR system into practice and when the patient first attended the clinic.

Defining smoking status

The smoking record for each patient found in the Risk Factor table of the CPCSSN database was used. Patient smoking status can be first documented at any time and updated as often or as infrequently as chosen by the patient and provider. For those patients who had more than one smoking status documented at any time throughout the history of their EMR, the most recent status was used according to the date the record was created in the clinic EMR system. Each risk factor record is comprised of several fields including: a unique CPCSSN identifier for each patient, date on which the record was entered into the EMR, risk factor name, risk factor status, and risk factor values (i.e. any quantitative or extraneous information). Records with the risk factor name coded as ‘Smoking’ or ‘Unknown’ were selected for this analysis. The ‘Status’ of these risk factor records contained the original EMR data and consisted of four categories: current, past, never, and unknown. For the imputation, all unknown statuses were recoded as missing data.

Defining demographic and clinical variables

The relevant patient characteristics and clinical values included in the multiple imputation were sex, age, rural or urban residence, body mass index (BMI), systolic blood pressure, diabetes, number of clinical encounters, and type of EMR system used in the practice. Patient residence was considered rural if the second character of the postal code was 0 and urban if the second character was any digit except 0. Most recently recorded blood pressure and body mass index (BMI) were used. BMI was included if it was deemed to be within a plausible range (between 12 and 70 kg/m2) and if the difference between subsequent per-patient measurements was within 10 kg/m2. Patients were indexed with diabetes and/or chronic obstructive pulmonary disease (COPD) according to validated CPCSSN definitions. The diabetes definition had a sensitivity of 95.6% and specificity of 97.1%; COPD demonstrated a sensitivity of 82.1% and specificity of 97.3% [9]. The number of clinical encounters was a sum of visit dates documented in the EMR for each patient. The earliest patient encounter date in the CPCSSN database varies, depending on when a clinic first implemented an EMR system and when the patient first attended.

Methods for improving completeness

Three methods for improving the completeness of smoking data were tested. The first two utilized multiple imputation (MI) – this is an analytical technique for improving missing data and can be readily implemented using currently available statistical software (e.g. R). MI first generates many different datasets with the missing values imputed based on the distribution of missing data in the original dataset; then, analyses are conducted on each dataset separately and lastly, the results are pooled to generate a multiple imputed estimate [4]. The advantage of MI is that it allows uncertainty in the missing data to be accounted for and has been shown to result in more accurate standard errors and less biased outcomes compared to a complete case analysis [4].

MI was applied under two different missing data assumptions, similar to Marston et al.’s previous work using a UK general practice EMR database [6]. The first MI method considered smoking status to be ‘missing at random’ (MAR) and missing data for three smoking categories, current, past, and never, were imputed. The second MI method assumes that all current smokers have been recorded (given the importance of smoking as a significant risk factor for hypertensive patients) and thus, smoking status is ‘missing-not-at-random’ (MNAR). Here, only the categories of past or never smokers will be imputed for those missing smoking status data.

The third method used a classifier algorithm recently developed by a CPCSSN data manger (MC). The raw EMR data that are used to create the CPCSSN national risk factor data set, comprising approximately 1 million smoking records in both official languages (English and French), were used to develop text-matching patterns for calculating patient smoking status. These raw data are broken into three text fields: original Name, original Value (e.g., 10 packs/day), and original Status, where ‘original’ refers to the raw data from the EMR. For records that are coded as ‘smoking’, risk factor status is coded to one of the following categories: current, never, past, not current, and unknown. A status of ‘not current’ means it is unclear whether the patient is a past smoker or has never smoked (e.g., ‘non-smoker’). Lastly, records that match either zero or multiple statuses are coded as ‘Unknown’.

The smoking status classifier was created following manual review of the top 2000 most frequently occurring smoking records in the national CPCSSN database, comprising approximately 75% of the data. For each record, the true smoking status was determined after review of the text in the all fields related to smoking status in the CPCSSN database (Name, Value, and Status). This text was then used to generate a unique text-pattern that identified the true status. Once all 2000 records were reviewed, a random sample of 1000 records from the remaining 25% of the data set were then analysed to validate and improve pattern-matching accuracy. For each status, a simplified set of all text-patterns, plus additional logic, was then collected to create the smoking status classifier. Figure 1 displays a flow chart of the classifier’s operation. It analyzes each record sequentially by field, with data in the Status field having precedence over Value, and Value over Name, as shown in Fig. 1. Entries that contain relationship words (e.g., aunt, father) are ignored by the classifier, as they are usually family history or other non-risk factor data that typically cause patient smoking status to be misclassified.

Fig. 1
figure 1

The flow diagram for the pattern-matching smoking status classifier (Method 3)


Patient demographic and clinical characteristics were reported as proportions for categorical variables, mean with standard deviation (SD) for normally distributed continuous variables, and median with lower and upper quartiles for non-parametric data. Comparisons of characteristics between patients with and without smoking records were conducted using a chi-squared test for categorical variables, one-way test for continuous variables, and Wilcoxon test for non-parametric data.

For the multiple imputation in Method 1 and 2, the MICE package in R was used [11]. This analytic tool uses a fully conditional specification to impute each variable by specifying an imputation model that best suits the variable type. For instance, a Bayesian polytomous regression model was used for the unordered, categorial smoking status data. Forty imputed datasets were generated with 5 iterations taken to impute the missing values.

To obtain error estimates for the MI methods, a simulation was conducted for patients with complete smoking status data. Missing smoking data were randomly introduced for 40.8% of the complete case group (the same proportion of missing smoking status data in the total sample of 48,377).

The classifier algorithm used by Method 3, programmed in Python, was applied to the hypertensive dataset and a ‘calculated’ Status was generated for each smoking record. Method 3 was then validated by manually assigning a smoking status to all unique records, based on the original Status and Value data, while blinded to the result of the algorithmic-calculated Status. Since, for the hypertensive data set, the original Name data only contain one string (‘Smoking’), they were not reviewed during algorithm assessment.

External validity was estimated using Alberta-specific data from the 2017 Canadian Tobacco, Alcohol, and Drugs Survey (CTADS) [12]. CTADS is a biennial cross-sectional telephone survey in Canadians ages 15 years and older conducted by Statistics Canada [12]. Results are available by province and therefore, the smoking status of Alberta respondents was used (n = 1444 survey respondents weighted to a population estimate of 3,478,000).

All data analysis was performed in RStudio version 1.1.456.

This study was approved by the Conjoint Health Research Ethics Board at the University of Calgary (REB17–1825) and the Health Research Ethics Board at the University of Alberta (Pro00079372).


The flow of patients into the study cohort is summarized in Fig. 2. There were 48,377 adult patients with hypertension identified in the CPCSSN data from Alberta within a two-year contact period who were not deceased or inactive. Of these, 28,634 (59.2%) had a smoking status recorded and available in the database, with approximately 20% having more than one smoking status entry recorded throughout the history of their EMR. For smoking entries with an associated date of entry, nearly 80% were recorded or updated within the previous 2 years (as the ‘most recent’ status was chosen for inclusion). Table 1 summarizes demographic, clinical, and practice characteristics for those with and without smoking information. Those who were missing smoking status information (n = 19,743) were more likely to be older (p = 0.001), female (p = 0.001), have higher systolic (p < 0.001) and diastolic (p = 0.027) blood pressure, lower BMI (p = 0.001), less co-morbid diabetes and COPD (p < 0.001), reside in an urban setting (p < 0.001), and have fewer total clinical encounters (p < 0.001). While the differences in sex, age, BMI, and blood pressure between the two groups were statistically significant, they were small in absolute terms and may not be clinically meaningful.

Fig. 2
figure 2

The flow diagram for patients in the study cohort

Table 1 Demographic and clinical characteristics of patients with and without smoking status recorded in their EMR

The two groups also differed significantly by the type of EMR system (Wolf, Med Access, Practice Solutions Suite, Healthquest, Telin) used by the primary care practice (p < 0.001). Of the five EMR systems included, it was observed that Med Access contributed the largest proportion of patients with missing or undefined smoking data (n = 13,110; 66.4%). When examining the missingness of smoking records by each EMR system, Practice Solutions Suite had the most complete (97% of these patients had a smoking record) and Healthquest had the least complete (23% of these patients had a smoking record); however, the total number of patients in these groups were small (462 patients on Practice Solutions EMR; 644 patients on Healthquest EMR).

Lastly, it was also observed that patients who were missing smoking status also had more missing data for other clinical variables (BMI and blood pressure).

Methods 1 and 2 (multiple imputation)

Table 2 summarizes patient smoking statuses for those without missing data (complete cases), MI for all three statuses (current, past, never; Method 1) and MI for only past or never smokers (Method 2). Both methods resulted in complete data for all patients in the full cohort; however, the proportion of current smokers varied substantially, with 25.3% in Method 1 and 12.5% in Method 2. Compared to the Alberta-specific CTADS survey data [12], Method 1 may be overestimating the proportion of current smokers and Method 2 may be underestimating current smokers. The ‘past’ category for Methods 1 and 2 fall within the survey’s estimated 95% confidence intervals. The relative error for the multiple imputation, which compared the proportions of smoking status in the complete cases to the imputed dataset, was small (current = − 0.05%, past = 0.12%, never = − 0.03%). Any additional free text related to smoking status was deemed to be insufficient for the MI process, as this was rarely available for most patients and was highly unstructured and not easily interpretable (e.g., over 850 unique strings were found with an average of 18 characters within each string; examples of text entries included “smokes socially”, “smoker: quit”, “up to 2ppd”, “yes”, “tobacco use assessment”, etc.).

Table 2 Comparison of different improvement methods for smoking categories

Method 3 (pattern-matching algorithm)

To validate the accuracy of the pattern-matching algorithm on all available smoking records, all 3295 unique pairs of original Smoking Status and Smoking Value data were manually reviewed from a total of 110,850 smoking records for 39,898 patients. The pattern-matching algorithm correctly assigned a calculated smoking status for 103,935 records (93.8%) (Table 3). In the data review, two new categories were included: ‘Conflicting Information’, which reported on pairs of Status and Value data elements in the same smoking record that were contradicting (e.g. record containing Status = “Current” and Value = “non smoker”); and ‘Not Smoking-related’, which identified records that had been categorized as a smoking record but contained non-smoking information (e.g., blister pack medications). The smoking categories that were most accurately coded by the algorithm were ‘past’ (99.9%), ‘never’ (99.3%),‘not current’ (90.4%), and ‘current’ (83.2%), while the ‘unknown’ category had the least accurate classification rate at 63%. The low accuracy in ‘unknown’ classification results largely because many of the records contain multiple risk factors (such as alcohol use, diet, exercise, etc.). Since the smoking classification algorithm is part of a general risk factor classification algorithm, it is designed to produce a definitive classification when only a single risk factor is identified in a given record.

Table 3 Comparison of the pattern-matching algorithm classification to data review

When examining the most recent smoking status for each patient calculated by the pattern-matching algorithm, the proportion of current smokers was slightly lower than in the complete case data (18.2% vs. 21.2%) but was similar to the survey prevalence of 18.9%. Although the algorithm may have more accurately recoded the smoking data, 23.6% of patients were still missing smoking status; however, this is a reduction by 17 percentage points in missingness as compared to the raw, uncoded data of the complete cases.

We conducted a sensitivity analysis to examine whether a different definition for the cohort (at least three encounters within the two-year study period, rather than one or more encounters) would elicit differences in the proportion of patients in each smoking status category and in the multiple imputation results. Although fewer patients remained in the new cohort (n = 43,375), their demographic and clinical characteristics remained largely the same and no significant differences were found for the multiple imputation.


This study examined post-extraction methods for improving the quality of smoking data in a primary care EMR database. The methodologies explored here differed by their aims and processes – the goal of MI is to improve the completeness of data using other variables in the dataset while also accounting for variability of these data. While the fully conditional multiple imputation successfully eliminated all missing smoking data, there is uncertainty around the accuracy of the smoking status classification. We used a province-specific general population survey to estimate the external validity of the smoking classifications produced by the MI; however, the CTADS survey data have their own limitations, such as a low response rate, small sample size, and non-response bias [13], which means the true prevalence of smoking is still unclear and particularly for adults with hypertension. Thus, the smoking information produced by MI is more appropriate for epidemiological purposes rather than for use in clinical feedback or practice quality improvement.

The rationale for smoking status as MNAR data in the MI was that current smokers are more likely to be captured in the EMR data, given its importance as a cardiovascular risk factor. Method 2 (MNAR) resulted in a lower proportion of current smokers (12.5%) compared to the other methods and survey data, which ranged from 18.2–25.3% (Table 1). It may be possible that for older adults with chronic conditions, emphasis is placed on smoking cessation and therefore, lower proportions of current smokers are observed in this cohort compared to population-level surveys. This finding is similar to a Canadian direct health measures survey, where only 9.9% of females and 14.7% of males treated for hypertension were current smokers; this analysis was focused on a smaller sample of older adults aged 60–79 (n = 2111) [14]. Furthermore, the proportion of smokers was found to decline with increasing age in a U.K. primary care database, which was also accompanied by a rising proportion of former smokers as age increased.

The pattern-matching algorithm was designed to use additional information in the smoking record to produce more complete and accurate smoking statuses on an individual level, which is the recommended method for purposes where the accuracy of an individual’s smoking status has greater significance – for example, when using primary care EMR data for practice quality improvement, clinical feedback, or generating a cohort or registry of current smokers. Although this method resulted in reasonably accurate smoking status classifications, the algorithm was still not able to classify 24% of smoking records; further, the category of ‘not current’ is ambiguous and may not be useful information on its own, as this may represent either former or never smokers (though, the number of total records in this category was small). These issues underscore the larger challenges associated with unstructured, inconsistent textual data in primary care EMRs. Any coding or classification that is applied to the CPCSSN data is predicated on the data being present and accurate in the record. The source of missing or inaccurate data can be difficult to identify, as there are numerous places throughout the data flow pathway that can contribute to poor data quality [15]. This includes during point-of-care encounters (patient self-reporting to clinician; poor data entry; lack of structured fields in the EMR to enter smoking information or EMR containing multiple fields where conflicting information is entered), during data extraction, or through data processing (misclassification or errors in the coding algorithm) [7, 15].

Ultimately, a variety of approaches are required to achieve complete, consistent, and accurate information about patient smoking status in EMRs. Any post-extraction approach to data quality improvement relies on complete and accurate data, which therefore means that strategies to support meaningful use of EMRs at the point of care are critical. For example, we found a statistically significant difference in the recording of smoking status attributed to type of EMR system. There are many different EMR products that exist with no formal requirements on the type or location of fields included in the EMR or how the interface appears to the user. Some systems allow for more structured, consistent information to be recorded in an easy-to-navigate way, though some contain more unstructured free text or check-boxes in hard-to-access areas of the chart that can be more difficult for the extraction and interpretation of data. In addition to developing better tools for providers, there have been other examples of interventions to enhance data quality at the practice-level: employing a dedicated data entry clerk was found to improve the quality and usability of EMR data (particularly for smoking status), though the long-term sustainability and scale may not be feasible [16]; implementing data quality feedback tools, reports, and educational training for clinicians have been shown to improve the recording of information in electronic health records [17, 18], as well as reduce variations in quality due to different electronic systems [17]; and lastly, policy-based initiatives, such as meaningful use regulations and financial incentives, have had a positive impact on the recording of smoking status, among other data elements [19, 20]. Even so, data improvement at the point-of-care require more intensive interventions and resources, and therefore post-data extraction strategies should also be pursued. This includes further exploration of the methods described in this paper, as well as more advanced techniques like natural language processing [21].


The first limitation of this study is the lack of a true reference standard from which to assess external validity. By not knowing the true prevalence of smokers (and other smoking statuses) among those with hypertension, we cannot say for certain whether the MI methods are correct from a population-level. The CTADS survey included in the paper reported on the smoking status of individuals in the general Alberta population and was not specific to hypertension, unlike the patient sample here. Other sources were considered but not used for various reasons: the Canadian Community Health Survey (CCHS) and the Canadian Health Measures Survey (CHMS) do not have publicly available smoking rates for people with hypertension. The CCHS describes the proportion of self-reported current (daily or occasional) smokers but not former or non-smokers. The CHMS only includes participants up to the age of 79.

The CPCSSN primary care EMR data also have specific shortcomings. Most importantly, the smoking data within the EMR may be limited in terms of its accuracy and timeliness, which would subsequently impact the results of the MI and pattern matching algorithm. Since smoking status is a dynamic state that can change at any point in time, it is difficult to confirm whether the data present in the EMR are a true reflection of each patient’s current smoking status. We attempted to address this by selecting the most recent smoking status available in the EMR for each patient and found that the majority of this cohort had their most recent smoking status recorded within the previous 2 years.

The CPCSSN data represent a relatively small sample of providers (just over 5% of all family physicians in Alberta [22]) and patients in Alberta (from a provincial population of over 4.3 million in 2019 [23]), and thus may be reasonably but not entirely representative of all providers and patients in the province or elsewhere in Canada. Both the MI and pattern-matching algorithms may perform differently depending on variations between different regional databases and the EMR system from which the data are extracted (e.g., only 5 out of the 11 EMR systems within the CPCSSN database were represented here).


Multiple imputation and algorithmic pattern-matching are two ways that EMR data can be improved post-extraction, with MI returning complete data and the classification algorithm permitting accuracy estimates. There is potential to further enhance the algorithm using more complex coding with more data points, which would further improve its accuracy and ability to reduce missing patient smoking status. In the future, these methods could also be tested on other risk factor information in a primary care EMR databases, such as alcohol use, diet and exercise.

Availability of data and materials

The national CPCSSN data are available to approved researchers for a fee; for more information or to submit a Letter of Intent, visit:

The Alberta-specific CPCSSN data that was used for this analysis are available as two separate data sets through the regional networks (NAPCReN, SAPCReN). Data access procedures and requirements vary by network; contact the corresponding author for more information or visit: /



Body mass index


Blood pressure


Confidence interval


Canadian primary care sentinel surveillance network


Canadian tobacco, alcohol, and drugs survey


Electronic medical record

kg/m2 :

kilograms per metres squared


missing at random


multiple imputation


multivariate imputation by chained equations


missing not at random


Northern alberta primary care research network


Southern alberta primary care research network


Standard deviation


United Kingdom


  1. Garies S, Birtwhistle R, Drummond N, Queenan J, Williamson T. Data Resource Profile: National electronic medical record data from the Canadian Primary Care Sentinel Surveillance Network (CPCSSN). Int J Epidemiol. 2017;46(4):1091–1092f.

    Article  Google Scholar 

  2. CPCSSN. Canadian Primary Care Sentinel Surveillance Network (CPCSSN). 2016. Available from: [cited 2019 Feb 14].

    Google Scholar 

  3. Greiver M, Aliarzadeh B, Meaney C, Moineddin R, Southgate CA, Barber DTS, et al. Are we asking patients if they smoke?: missing information on tobacco use in Canadian electronic medical records. Am J Prev Med. 2015;49(2):264–8.

    Article  Google Scholar 

  4. Pedersen AB, Mikkelsen EM, Cronin-Fenton D, Kristensen NR, Pham TM, Pedersen L, et al. Missing data and multiple imputation in clinical epidemiological research. Clin Epidemiol. 2017;9:157–66.

    Article  Google Scholar 

  5. Marston L, Carpenter JR, Walters KR, Morris RW, Nazareth I, Petersen I. Issues in multiple imputation of missing data for large general practice clinical databases. Pharmacoepidemiol Drug Saf. 2010;19:618–26.

    Article  Google Scholar 

  6. Marston L, Carpenter JR, Walters KR, Morris RW, Nazareth I, White IR, et al. Smoker, ex-smoker or non-smoker? The validity of routinely recorded smoking status in UK primary care: A cross-sectional study. BMJ Open. 2014;4:e004958.

    Article  Google Scholar 

  7. Garies S, Cummings M, Forst B, McBrien K, Soos B, Taylor M, et al. Achieving quality primary care data: a description of the Canadian Primary Care Sentinel Surveillance Network data capture, extraction, and processing in Alberta. Int J Popul Data Sci. 2019;4(2):1–8.

  8. Tobe SW, Stone JA, Anderson T, Bacon S, Cheng AY, Daskalopoulou SS, et al. Canadian cardiovascular harmonized National Guidelines Endeavour (C-CHANGE) guideline for the prevention and management of cardiovascular disease in primary care: 2018 update. CMAJ. 2018;190(40):E1192–206.

    Article  Google Scholar 

  9. Williamson T, Green ME, Birtwhistle R, Khan S, Garies S, Wong ST, et al. Validating the 8 CPCSSN case definitions for chronic disease surveillance in a primary care database of electronic health records. Ann Fam Med. 2014;12(4):367–72.

    Article  Google Scholar 

  10. Sibley LM, Moineddin R, Agha MM, Glazier RH. Risk adjustment using administrative data-based and survey-derived methods for explaining physician utilization. Med Care. 2010;48(2):175–82.

    Article  Google Scholar 

  11. van Buuren S, Groothuis-Oudshoorn K. MICE: multivariate imputation by chained equations in R. J Stat Softw. 2011;45(3):1–67.

    Article  Google Scholar 

  12. Statistics Canada. Table 2, Canadian Tobacco, Alcohol and Drugs Survey (CTADS). 2017. Available from: [cited 2019 May 1].

    Google Scholar 

  13. Gagné T. Estimation of smoking prevalence in Canada: implications of survey characteristics in the CCHS and CTUMS/CTADS. Can J Public Health. 2017;108(3):e331–4.

    Article  Google Scholar 

  14. Bushnik T, Hennessy DA, McAlister FA, Manuel DG. Factors associated with hypertension control among older Canadians. Health Rep. 2018;29(6):3–10.

    PubMed  Google Scholar 

  15. Verheij RA, Curcin V, Delaney BC, McGilchrist MM. Possible sources of bias in primary care electronic health record data use and reuse. J Med Internet Res. 2018;20(5):e185.

    Article  Google Scholar 

  16. Greiver M, Barnsley J, Aliarzadeh B, Krueger P, Moineddin R, Butt D, et al. Using a data entry clerk to improve data quality in primary care electronic medical records: a pilot study. Inform Prim Care. 2011;19(4):241–50.

    PubMed  Google Scholar 

  17. van der Bij S, Khan N, Ten Veen P, de Bakker DH, Verheij RA, Blumenthal D, et al. Improving the quality of EHR recording in primary care: a data quality feedback tool. J Am Med Inform Assoc. 2016;356(24):2527–34.

  18. Taggart J, Liaw ST, Yu H. Structured data quality reports to improve EHR data quality. Int J Med Inform. 2015;84(12):1094–8.

    Article  Google Scholar 

  19. Schindler-Ruwisch JM, Abroms LC, Bernstein SL, Heminger CL. A content analysis of electronic health record (EHR) functionality to support tobacco treatment. Transl Behav Med. 2017;7(2):148–56.

    Article  Google Scholar 

  20. Taggar JS, Coleman T, Lewis S, Szatkowski L. The impact of the quality and outcomes framework (QOF) on the recording of smoking targets in primary care medical records: cross-sectional analyses from the health improvement network (THIN) database. BMC Public Health. 2012;12:329.

    Article  Google Scholar 

  21. Liao KP, Cai T, Savova GK, Murphy SN, Karlson EW, Ananthakrishnan AN, et al. Development of phenotype algorithms using electronic medical records and incorporating natural language processing. BMJ. 2015;350:1–6.

    Article  Google Scholar 

  22. Canadian Medical Association. Family medicine profile. 2018. Available from:

    Google Scholar 

  23. Government of Alberta. Quarterly population report; Second quarter 2019 2019. Available from: [cited 2019 Oct 21].

    Google Scholar 

Download references


Not applicable.


SG is funded through an Alberta Innovates Health Solutions Graduate Studentship (2016–2020). The CPCSSN project in Alberta, hosted by NAPCReN and SAPCReN, is funded by the Canadian Institutes of Health Research (CIHR) and Alberta Innovates through the Alberta Strategies for Patient Oriented Research (SPOR) Primary and Integrated Health Care Innovation Network, as well as the Public Health Agency of Canada. The funders had no role in the study design, data collection, analysis or interpretation of the data, or in the writing of the manuscript.

Author information

Authors and Affiliations



SG conceptualized the study, MC developed the pattern-matching algorithm, and SG and MC analysed and interpreted the data. SG wrote the initial draft of the manuscript. KM, HQ, DM, ND, and TW contributed to the interpretation of findings and revisions to the manuscript. All authors have read and approved the final version of the manuscript.

Corresponding author

Correspondence to Stephanie Garies.

Ethics declarations

Ethics approval and consent to participate

This study was approved by the University of Calgary’s Conjoint Health Research Ethics Board (REB17–1825) and the University of Alberta’s Health Research Ethics Board (Pro00079372). A waiver of individual patient consent was granted by the Research Ethics Board at each university affiliated with each participating CPCSSN practice-based research network for the collection and use of de-identified EMR data. Written consent was obtained from each sentinel participating in the CPCSSN project.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Garies, S., Cummings, M., Quan, H. et al. Methods to improve the quality of smoking records in a primary care EMR database: exploring multiple imputation and pattern-matching algorithms. BMC Med Inform Decis Mak 20, 56 (2020).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: