 Research
 Open Access
 Published:
Exploiting mutual information for the imputation of static and dynamic mixedtype clinical data with an adaptive knearest neighbours approach
BMC Medical Informatics and Decision Making volume 20, Article number: 174 (2020)
Abstract
Background
Clinical registers constitute an invaluable resource in the medical datadriven decision making context. Accurate machine learning and data mining approaches on these data can lead to faster diagnosis, definition of tailored interventions, and improved outcome prediction. A typical issue when implementing such approaches is the almost unavoidable presence of missing values in the collected data. In this work, we propose an imputation algorithm based on a mutual informationweighted knearest neighbours approach, able to handle the simultaneous presence of missing information in different types of variables. We developed and validated the method on a clinical register, constituted by the information collected over subsequent screening visits of a cohort of patients affected by amyotrophic lateral sclerosis.
Methods
For each subject with missing data to be imputed, we create a feature vector constituted by the information collected over his/her first three months of visits. This vector is used as sample in a knearest neighbours procedure, in order to select, among the other patients, the ones with the most similar temporal evolution of the disease over time. An ad hoc similarity metric was implemented for the sample comparison, capable of handling the different nature of the data, the presence of multiple missing values and include the crossinformation among features captured by the mutual information statistic.
Results
We validated the proposed imputation method on an independent test set, comparing its performance with those of three stateoftheart competitors, resulting in better performance. We further assessed the validity of our algorithm by comparing the performance of a survival classifier built on the data imputed with our method versus the one built on the data imputed with the bestperforming competitor.
Conclusions
Imputation of missing data is a crucial –and often mandatory– step when working with realworld datasets. The algorithm proposed in this work could effectively impute an amyotrophic lateral sclerosis clinical dataset, by handling the temporal and the mixedtype nature of the data and by exploiting the crossinformation among features. We also showed how the imputation quality can affect a machine learning task.
Background
By discovering novel and useful patterns from clinical registers and electronic health records, healthcare analytics has transformed the healthcare industry both in terms of cost optimisation and ever improving quality of care [1]. Among the possible approaches, the use of machine learning (ML) and data mining techniques are providing the means to extract information from the complex and voluminous amount of available data, virtually creating a paradigm shift in the whole healthcare sector, from basic research to clinical and management applications [2, 3]. The possible advantages of such analyses could vastly improve patients’ lives and benefit society as a while. From an economic perspective, the use of these techniques to improve practice efficiency results in a more affordable, highquality healthcare [4]. Besides, from a clinical point of view, the possible improvements in medical knowledge, as well in diagnosis and prognosis capabilities, allow higher health standards. Studies as survival analyses can evidence risk factors and detect the effect of specific treatments both in disease progression and quality of life [5], moving towards a personalised care system. Moreover, an enhanced knowledge of the pathologies can be translated into computeraided tools, offering clinicians a valid support in decision making.
The creation of accurate and effective analytic models from healthcare data, however, is challenging, because of issues regarding quality and heterogeneity [6]. The type and frequency of collected data vary based on the specific application field, a patient’s clinical condition and administrative requirements. Moreover, medical tests and treatments can be carried out at different times even if patients exhibit the same symptoms. This, together with human factors (poor handwriting, missing charts or pages, measurements being documented in inconsistent locations, etc.), results in many aspects of a patient’s clinical condition being unmeasured or unrecorded at different time points.
Missing values may be clinically important, but cannot be handled by most analytics algorithms [7] and can significantly affect the conclusions that can be drawn from the data [8]. For instance, missing data can introduce bias in the results of randomised controlled trials, negatively affecting the derived clinical decisions and ultimately patient care [9]. When performing survival analysis, missing data can occur in one or more risk factors. The standard response of simply excluding the affected individuals from the analysis could lead to invalid results if the excluded group is selective with respect to the entire sample, and to a waste of costly collected data [10]. In remote health monitoring settings, missing data is a prevalent issue affecting longterm monitoring systems which can lead to failure in decision making [11]. For electronic health records, missing values frequently outnumber observed ones, mainly because they were designed to record and improve patient care and streamline billing rather than collecting data for research purposes [12].
Many kinds of analyses, from simple statistics to advanced data mining and machine learning methods, either fail altogether in dealing with missing data or end up producing biased estimates of the investigated associations when simple curing techniques (such as complete case analysis, overall mean imputation, or the missingindicator methods) are applied [13]. To utilise all clinical data and achieve optimal performance of the used algorithms, the missing data issue must be addressed by imputing the missing values.
When considering the heterogeneity of the data recorded in this setting, a typical example of mixedtype variables dataset is represented by disease registers. The variables in this domain can be classified as either static if constant throughout the patient’s clinical history, such as sex or age at disease onset, or dynamic if varying in time, such as blood pressure or sugar levels at subsequent visits. Furthermore, they can be continuous when representing measurements in a range of continuous values, ordinal when the values fall in a discrete ordered set, or categorical when describing a qualitative property out of a finite number of categories or distinct groups without any order relations. An adequate imputation method should therefore be able to handle this data complexity altogether.
Many of the available imputation methods are restricted to only one type of variable. For mixedtype data, the different variable types are usually handled separately, thus ignoring possible relations among variables of different types. Moreover, most of them make strong assumptions on the characteristics of the missing data, such as locality in Gaussian Process based models [14], lowrankness and temporal regularity in matrix factorisation models [15] and multivariate normality in ExpectationMaximisation methods [16]. Finally, most commonly used imputation methods are not able to explicitly handle the temporal nature of longitudinal patient data [17].
This paper presents an adaptive mutual informationweighted knearest neighbours (wkNN) imputation algorithm developed to explicitly handle missing values of continuous/ordinal/categorical and static/dynamic features conjointly. The proposed methodology was applied and validated on a subset of the Piemonte and Valle d’Aosta Amyotrophic Lateral Sclerosis (PARALS) register [18], a prospective epidemiological register from two Italian regions.
Types of missing data
Missing values can be of three general types: missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR). When missing data are MCAR, the presence and/or absence of data is completely independent of observable variables and parameters of interest. In this case, the set of subjects with no missing data is also a random sample from the source population. This represents the best possible type of missing data as any analysis performed will be unbiased [19], although it is a highly unlikely scenario.
Missing data are MAR when the propensity for a value to be missing depends on some observed patient characteristic. For instance, males are less likely to fill in a depression survey. This kind of missing data can induce bias in the resulting analysis especially when the data is unbalanced because of many missing values in a certain category.
Finally, we are in the MNAR scenario when the missing values are neither MCAR nor MAR. For instance, when asking subjects for their income level it might well be that missing data are more likely to occur when the income level is relatively high. Here, the reason for missingness obviously is not completely at random, but is related to unobserved patient characteristics.
Many imputation methods require the missing data to be MCAR, or at least MAR. On the other hand, an imputation based on a knearest neighbours approach is applicable in any of the three previous situations, as long as there is a relationship between the variable with the missing value and the other variables [20].
Previous work
Several methods for handling missing data are already available [21]. The simplest approaches consists in focusing the analysis only on nonmissing values in the dataset, by either dropping cases where at least one variable is missing or by dropping variables where at least one value is missing. These approaches completely neglect the relationships among variables, possibly causing severe information loss and worsening the statistical power and standard errors of the analyses [22, 23]. Mean/median/mode imputation or value propagation (Last Observation Carried Backward or Next Observation Carried Forward), are some other fast and easily interpretable statistical approaches. These imputation methods, however, may lead to low accuracy and biased estimates of the investigated associations [13, 24].
Regression represents a somewhat more advanced imputation approach that estimates missing values by regressing them from other related variables [25], especially time [26]. While deterministic regression limits the imputation to the exact prediction of the regression model, often producing an overestimation of the correlation among the variables, stochastic regression adds a random error term to the predicted value in order to recover a part of the data variability [27].
Multivariate imputation by chained equations (MICE) [28] is one of the most prominent methods in the literature [29]. In this imputation procedure, a series of regression models are run whereby each variable with missing data is modelled conditional upon the other variables in the data. This means that each variable is modelled according to its distribution, with, for example, predictive mean matching for continuous data, logistic regression for binary data, polytomous logistic regression for categorical data and proportional odds for ordinal data.
3DMICE, recently introduced in [17], combines MICE with Gaussian process (GP) [14, 30] predictions, thus imputing missing data based on both crosssectional and longitudinal patient data information. MICE is used to carry out crosssectional imputation of the missing values, while a singletask GP is used to perform longitudinal imputation. The estimates obtained by the two methods are then combined by computing a varianceinformed weighted average. 3DMICE can adequately impute continuous longitudinal patient data, but is unable to handle categorical and static variables.
A nonparametric method based on a random forest that can cope with different types of variables simultaneously, called missForest, was introduced by Stekhoven et al. [31]. This method is based on the idea that a random forest intrinsically constitutes a multiple imputation scheme by averaging over many unpruned classification or regression trees. While not requiring assumptions about distributional aspects of the data, missForest requires the observations to be pairwise independent, which is rarely the case when handling clinical records (several visits for each patient).
Another popular imputation method for crosssectional time series data is Amelia II [16], which performs multiple imputation by implementing an ExpectationMaximisation with Bootstrapping algorithm. Amelia II can utilise both time series and multivariable information in a dataset for the imputation task. This method requires all variables in the dataset to be multivariate normally (MVN) distributed. This requirement reduces the applicability of the method especially when dealing with nonnormalisable and/or categorical variables.
Recently, a number of deep learning frameworks for estimating missing values in multitimeseries clinical data have been proposed [32–34]. These methods achieved impressive results on benchmark datasets due to the highquality representations extracted from large amount data, which means that their applicability is limited when only few data are available.
The “nearest neighbours” (NN) methods are among the most popular imputation procedures [20, 35]. Missing values of samples with missing data are replaced by values extracted from similar other samples with respect to observed characteristics. NN imputation approaches are donorbased methods where the imputed value is either a value that was actually measured for another record in a database (1NN) or the average/median/mode of measured values from k records (kNN). These methods were often shown to outperform other imputation techniques [36], even though results depend heavily on the choice of the metric used to measure the similarity between samples. Moreover, because data collection periods vary across patients, samples may not be directly comparable. Furthermore, the similarity metric should also handle the presence of missing values in the donor samples, manage the different nature of the data, and take into account the possibly unbalanced contribution of static and dynamic variables, with the latter adding information over time.
Aim of this work
In this work, we present an imputation algorithm based on a weighted kNN approach, able to handle missing data in static and dynamic mixedtype variables simultaneously. The kNN imputation approach is fully nonparametric and does not require explicit models to relate variables, thus being less prone to model misspecification than other methods [20]. In our algorithm, we define an ad hoc similarity metric in which we employ the mutual information (MI) values between feature pairs as weights in the computation of the distance among samples, in order to account for the crossfeature information.
The proposed methodology has been developed and validated on a clinical epidemiological register of patients affected by amyotrophic lateral sclerosis (ALS), that is, a collection of dynamically acquired data over subsequent screening visits, one visit at a time. Compared to clinical trial datasets, epidemiological registers better characterise the general ALS population, since clinical trial population must fit a stringent set of criteria [37]. This clinical register represents a typical instance of complex dataset constituted of both static/dynamic and mixedtype variables, and, coherently with its realworld nature, is inevitably subject to missing data.
ALS is a fatal neurodegenerative disorder characterised by progressive muscle paralysis caused by the degeneration of motor neurons in the brain and spinal cord [38]. The disease is progressive and fatal: the symptoms worsen over time and there are no known effective treatments that can effectively halt or reverse its progression, which will inevitably result in respiratory failure, typically within 4 years form disease onset [39]. The enormous social, medical and human costs imposed on ALS patients, their families and the health systems in general are pushing the scientific community towards the development of computational tools to derive predictions for prognostic counselling, stratification of cohorts for pharmacological trials, and timing of interventions [40–44].
To this purpose, two distinct DREAM Challenges have been organised in the past years [41, 44]. By employing the clinical information of the first three months of patients’ visits from different datasets, the participants were asked to develop algorithms to predict the disease progression and to stratify the patients into meaningful subgroups. The PARALS register used in our work was partially included in the datasets of the second challenge.
ALS is a rare disease: its incidence in Europe and in populations of European descent is 2.6 cases for 100,000 people per year and the prevalence is of 7–9 cases per 100,000 people [45], with ALS rates being mainly unknown in the rest of the world [38]. This implies that the available patients’ data collected in clinical registers is of inestimable importance for furthering the translational research on the disease and that missing values cannot be treated with simple curing techniques. With the aim to build a complete dataset from the PARALS register that can be similarly used for the application and development of ML algorithms, we developed an adaptive weighted knearest neighbours algorithm for the imputation of the first three months of screening visits. Our imputation method is based on the assumption that subjects with a similar disease progression over a short period of time share similar feature values and can therefore be crossexploited to impute missing values.
In addition to adequately characterising the temporal evolution of the disease course [41], the selected time interval is short enough to allow the imputation of subjects with few available visits. Moreover, the information of newly added subjects can be promptly used for the imputation of others. Finally, by focusing on a reduced observation interval, only a relatively small number of visits (and thus a relatively small number of features) is considered. In a kNN setting, having a small number of features prevents the methods from incurring in the curse of dimensionality: in general, as the number of dimensions (features) increases, the closest distance among samples tends to the average distance and the predictive power of the algorithm decreases [46].
The proposed method was compared to three other stateoftheart imputation algorithms, namely Amelia II [16], missForest [31] and MICE [28], which are the main representatives of the methods currently available in the literature. Our experiments show that our method outperforms the competitors in the imputation of most of the features and on average.
To assess the possible impact of the proposed method in a concrete scenario, we provide a simple application of the imputed data in a survival classification task. We used a naïve Bayes (NB) classifier to distinguish between patients with long and short survival times by using only the information in their first three months of screening visits. Our results show that imputing the training set with the proposed method improves the prediction performance of the NB classifier on a holdout test set, also achieving better performance than the classifier built on the training set imputed with the top competitor (MICE). By asserting the effectiveness of the proposed imputation method in enhancing the training data for a very simple classification algorithm with naïve hypotheses, we confirm its applicability in more complex and sophisticated analyses. Finally, we believe that the proposed methodology could be of great aid to clinicians since it enables the survival prediction of patients by employing only the information from their first three months of visits, regardless of possible missing values.
Materials and methods
Dataset
The dataset used in this work was extracted from the PARALS Register as follows. We selected the cohort of patients with first visit from January 1st, 2001 and followup up to July 18th, 2017, and excluded the ones having an onset that predated the first visit by five years or more (average ALS prognosis) in order to filter out clinical outliers. The selected cohort includes 700 patients, resulting in a dataset containing the information assessed over their subsequent screening visits, for a total of 6,726 visits.
The 25 variables collected in the dataset include some clinical features recorded during the first visit –the static ones– that are: patient sex, bodymass index (BMI) both premorbid and at diagnosis, a measure of respiratory functionality (forced vital capacity, FVC) at diagnosis, familiality of ALS, the result of a genetic screening over the most common ALSassociated genes, presence of frontotemporal dementia (FTD), site of disease onset (limb/bulbar), age at onset, diagnostic delay (time from ALS onset to diagnosis); the remaining features –the dynamic ones– are collected over visits and consist of: the presence/absence up to the current visit of noninvasive ventilation (NIV) and percutaneous endoscopic gastrostomy (PEG), that are two guidelinerecommended interventions for symptom management in ALS, and the revised ALS Functional Rating Scale (ALSFRSR) [47], which is a 12item questionnaire rated on a 0–4 point scale evaluating the observable functional status and change for patients with ALS over time.
The time of the visit for each patient is expressed in months and set to zero in correspondence to the first visit, resulting in negative values for the onset delta. These variables are detailed in Table 1, according to their data type (continuous, ordinal, or categorical), with the percentage of native missing values and the static (S) or dynamic (D) nature of the feature. In this summary, for the NIV and PEG variables we reported the total number of patients who were administered these interventions.
In order to develop and validate the imputation algorithms on independent data, we split the dataset in training (80% = 560 subjects, 5,507 visits) and test (20% = 140 subjects, 1,219 visits) sets, by stratifying the dataset over all variables.
Imputation algorithm
In this work we developed a weighted kNN approach to impute the missing values in the first three months of screening visits of each patient. We based our algorithm on the assumption that patients with similar characteristics share the same disease course over time. Patient similarity is assessed by using an apposite distance metric over their features.
Given a patient with a missing value to be imputed and a pool of other patients having that feature, the algorithm searches for the kclosest subjects in terms of disease progression similarity and infers the estimate for the missing value. First, the distance among the current patient and the other candidate subjects from the pool is computed. Then, a weighted average of the corresponding values in the k most similar patients is obtained and used as plausible estimate of the missing one. To impute the whole dataset, the procedure is iterated for each missing value of the given patient and then for each patient with missing values in their visits. The algorithm takes into account the temporal evolution of the data over visits and handles both the mixed nature of the data and the presence of missing values in the distance computation.
Adaptive kNN sample construction
To capture the temporal evolution of the features over subsequent visits, for a given patient i with missing data to be imputed, the algorithm builds a feature vector (kNN sample) that contains the information recorded during his/her first three months of screening visits. The feature vector is created by binding the static information for that patient (constant throughout all his/her visits) to the dynamic ones in the [0,2] months time interval from the first visit in chronological order (with 0 being the first month). In our dataset, all the patients have between 1 and 4 visits in the first three months of screening: the algorithm adaptively builds kNN samples whose length depends on the number of available visits for each subject to be imputed. Figure 1(a) illustrates the sample construction for subject i, with p being the number of static features, m the number of the dynamic ones, and n the number of his/her visits in the first three months of screening.
To identify the subjects in the pool of candidates having disease progression similar to subject i, the algorithm builds an analogous feature vector for each candidate neighbour with an available value in correspondence to the feature to be imputed. In more detail, each candidate neighbour j is temporally mapped over the current subject i, adaptively building a sample according to their matching time points. The feature vector of j is initialised with the subject’s static features. Let \(\mathbf {t}_{i}=\left (t_{i,1}, t_{i,2}, \dots, t_{i,n}\right)\) be the time points of the visits in the first three months of screening for subject i. For each visit time point t_{i,l} of subject i, the closestintime visit of subject j within one month is selected. If no matching visit is found, candidate j is excluded from the kNN search. Otherwise, the dynamic features of the matching visit are extracted and stacked to the feature vector of subject j; possible missing values in the matching visits of subject j are passed on his/her feature vector. Please notice that a candidate subject j may have repeated blocks of dynamic features in his/her feature vector corresponding to the same visit matching with multiple visits of subject i. Also notice that the feature vectors of the candidate subjects include the dynamic information of visits in the [0,3] months time interval from the first visit (that is, of the first four months of screening visits). Figure 1(b) schematically depicts the candidate sample construction procedure.
Weighted knearest neighbours
For a subject i with a missing value to be imputed, the wkNN algorithm proceeds as follows. The features of the subject sample, together with his/her candidate samples, are normalised to the [0,1] interval in order to account for the difference among the ranges. Then, the distance between subject i and each candidate j is computed according to the following metric.
Let \(\mathbf {v}=\left (v_{1}, v_{2}, \dots, v_{N}\right)\) and \(\mathbf {u}=\left (u_{1}, u_{2}, \dots, u_{N}\right)\) be the feature vectors of, respectively, subject i and candidate j. Let N_{stat}(v,u) and N_{dyn}(v,u), be, respectively, the number of common nonmissing static and dynamic features in v and u. Also, let S_{categ},S_{ord},S_{cont},D_{categ},D_{ord}, and D_{cont} be the sets of indices of, respectively, the static categorical, the static ordinal, the static continuous, the dynamic categorical, the dynamic ordinal, and the dynamic continuous features in v and u. The distance between v and u is given by:
where n is the number of visits in the first three months of screening for subject i and I(v_{l},u_{l}) is 0 if v_{l}=u_{l} and 1 otherwise. If either v_{l} or u_{l}, or both, are missing, the feature at index l does not contribute to the distance. The numerator is divided by the number of comparable features in u and v to normalise the distance on the number of common nonmissing values. Because of the sample building procedure, each dynamic feature appears n times in the feature vectors: to rebalance the contribution of all the features to the similarity metrics, both the distance between static features and the count N_{stat}(v,u) are multiplied by n.
At this point, a filtering step is performed: candidates with a number of comparable features with subject i smaller than the 90% of the total number of nonmissing features in sample i (both computed with the same adjustment for the static features) are dropped.
Once the distances to all the candidates have been computed, the k nearest ones are selected and their values in correspondence to the feature to be imputed are used for the imputation: for continuous and ordinal features, after removing possible outliers (values outside 1.5 times the interquartile range above the upper quartile and below the lower quartile), the missing feature in i is imputed with the average of the selected values, each weighted by the inverse of the corresponding candidate distance; for categorical features, the missing feature in i is imputed with the mode of the selected values.
The procedure is repeated over all features with missing values in subject i. In our implementation, values previously imputed in i are not used for the subsequent imputations.
Weighted knearest neighbours with mutual information
We improved the wkNN algorithm by including the crossinformation among the features, given by the mutual information statistic, in the similarity metric (wkNN MI). Unlike correlation metrics, the MI can measure the strength of both linear and nonlinear associations among features.
The MI among features is computed using the infotheo R package v1.2.0 [48]. For two discrete variables X and Y whose joint probability distribution is p_{XY}(x,y)=P(X=x,Y=y), and marginal probability distributions are, respectively, p_{X}(x)=P(X=x) and p_{Y}(y)=P(Y=y), the mutual information between them, denoted MI(X,Y), is computed as:
The marginal and joint probability distributions of X and Y are determined empirically from the data by a frequentist approach. Continuous variables (X) are discretised into \(i=\sqrt [3]{N}\) intervals of equal width w=(max(X)− min(X))/i, where N is the number of samples of X.
Let f be the index of the feature currently being imputed in subject i, and let \(\text {\bf {MI}}_{f}=\left (\text {MI}_{f,1}, \dots, \text {MI}_{f,f}, \dots,\text {MI}_{f,N} \right)\) be the MI values between the feature at index f and all the features in the sample. The MI values are then employed as weights for the distance computation in the wkNN algorithm:
Please notice that here the distance among samples depends on the missing feature value currently being imputed, which means that the candidates chosen as nearest neighbours may change when imputing different features. An outline of the proposed imputation procedure is given in Fig. 2 and thoroughly described in Algorithm 1.
Imputation performance metrics
To evaluate the performance of the developed imputation methods, we employed the normalised rootmeansquare deviation (nRMSD) for the continuous and ordinal features and the proportion of falselyclassified (PFC) for the categorical ones. Let f be the index of a feature imputed in T patient visits: \(\mathbf {v}_{f}^{\text {imp}}\) is the vector of imputed values for that feature and \(\mathbf {v}_{f}^{\text {true}}\) is the vector of true measured values. If f is the index of a continuous or ordinal feature, the corresponding nRMSD is calculated over the T patient visits as:
Otherwise, if f is the index of a categorical feature, the corresponding PFC is calculated over the T patient visits as:
where \(I(v_{i,f}^{\text {true}}, v_{i,f}^{\text {imp}})\) equals 0 if \(v_{i,f}^{\text {true}} = v_{i,f}^{\text {imp}}\), and 1 otherwise.
In order to better analyse and compare the distribution of the error, we also computed the normalised absolute error (nAE) of each imputed continuous or ordinal value. The nAE for the imputed feature f of a given patient visit is given by:
Analysing the nAE distribution for each feature allows us to gain more insight on the quality of the imputation.
In all cases, the closer these metrics are to zero the better the imputation.
Selecting the optimal number of nearest neighbours
The proposed wkNN and wkNN MI imputation methods require the user to select an adequate k (number of nearest neighbours) hyperparameter. This can be achieved by performing a cross validation scheme to test out different k values and select the best one. The patients in the dataset are partitioned into a userdefined number of folds. For a given k value, for each patient in a given fold, and for each feature, all the measured values corresponding to that feature are first removed at the same time from the patient’s visits, and then imputed by using all the subjects from the other folds as candidates.
By repeating this procedure for all folds, an imputed value is obtained for each known measurement, and the imputation quality for the current value of k can be assessed by using a chosen performance metric. This procedure can be repeated for several values of k in order to determine the best performing one to be finally used to impute the whole dataset. Moreover, by removing the values of only one feature at a time, the distribution and pattern of missing values in the dataset is generally preserved, which ensures the plausibility of the imputation performance results.
Enhancing the performance of a survival classification task with data imputation
Patients with ALS exhibit a very high degree of variability in disease susceptibility and pathogenic mechanisms. This is one of the main reasons for the negative results of therapeutic trials conducted so far, as statistical variance masks treatment effects [49, 50]. An optimal trial design requires samples size estimation, which, in turn, requires some understanding of the natural progression of the disease. The accurate prediction of the survival time in ALS patients is of paramount importance, and could aid prognostic counselling, stratification of cohorts for pharmacological trials, and timing of interventions.
In order to evaluate the enhanced potential of the dataset imputed with the proposed method, we implemented a simple survival classification task. The PARALS register contains survival information for each patient, either in the form of date of death for the deceased ones or the date of the last visit for the censored ones. For each subject, we determined the survival outcome as the binary answer to the question “Does the subject survive for more than 3 years (36 months) from his/her first screening visit?”. The patients that were censored before the 36 months threshold were discarded since we were unable to answer the question. The number of patients in the training set was thus reduced to 545 (from the initial 560), and the number of patients in the test set was reduced to 138 (from the initial 140). The 36 months threshold was selected because it splits the patients into two almost equal sets.
For each patient, we built a survival sample – a feature vector able to encode the disease progression in his/her first three months of visits, as follows. For each dynamic feature in this time range, we computed three derived features, namely the minimum, maximum, and the slope. The slope was obtained by fitting a linear regression model on the temporal series constituted by the values of the feature collected over the three months interval. These values were then used together with the static features to construct a fixedlength vector (53 features in total) used as an input sample for our classification task (see Fig. 1(c)). The survival samples constructed on the original data (that is, before imputation) carry over their missing values. When handling missing static features, the missing values were simply carried over to the constructed samples. In case of missing dynamic features, missing values are reported in the corresponding derived features that could not be computed due to data missingness.
For this classification task we employed the naïve Bayes classifier [51] implemented in the e1071 R package v1.72 [52].
Naïve bayes models
Naïve Bayes is a simple learning algorithm that utilises Bayes’ theorem in conjunction with the “naïve” assumption that, given the class label, every pair of features is conditionally independent. A NB classifier considers the contribution of each feature to the given class probability as independent, regardless of possible correlations. Although this assumption is often violated in practice, NB classifiers often achieve competitive classification results [53]. Because of theirs computational efficiency and many other desirable features, NB classifiers are widely used in practice. A brief introduction to the method is reported in Additional file 1.
In order to evaluate the effect of the different imputation techniques on the classification task, and to further assess the performance of the proposed algorithm, we trained five NB models on five distinct sets of survival samples. First, starting from the original nonimputed training set composed of the first three months of patient visits, we built the corresponding training set of survival samples with their native missing values, from here on referred to as original dataset. From this first set we obtained two other sets for the complete case analysis: the complete cases dataset obtained by selecting only the survival samples without missing values, resulting in 252 survival samples, and the complete features dataset obtained by selecting only the features without missing values, resulting in 44 remaining features in the survival samples. Finally, we built two other training sets of survival samples for the classification task by imputing the first three months of patient visits from the training set once with the proposed algorithm (wkNN MI) and once with the best performing competitor.
The models were used to predict the set of test samples obtained from the nonimputed first three months of patient visits in the original test set.
Results and discussion
Comparison with the other imputation methods
We compared the proposed algorithm with the three stateoftheart imputation methods, namely Amelia II (Amelia R package v1.7.5), missForest (missForest R package v1.4) and MICE (mice R package v3.6.0). We also introduced a random version of our algorithm, krandom neighbours (kRN), that randomly samples a subset of k subjects from the pool of available candidates, to be used as a baseline for the imputation performance assessment. The selection of the optimal hyperparameter values for all the employed imputation methods is reported in Additional file 1.
Performance comparison on the training set
On the training set, the imputation performance was evaluated with the LOOCV setting described earlier: for each subject, all the measured values of his/her features were removed one feature at a time, and were then imputed using the competitor methods. The imputed values obtained by each method were compared to the true ones, and the average error was evaluated for each feature.
Tables 2, 3 and 4 show the average error (in terms of nRMSD or PFC) obtained on the training set for each continuous, ordinal and categorical feature, respectively. The proposed wkNN MI imputation method outperforms the competitors on average and on the majority of the features. For the continuous features, the average nRMSD score obtained by wkNN MI with the optimal k=20 is 0.1195 against 0.1539 of wkNN with the optimal k=10, 0.1651 of Amelia II, 0.1572 of MICE, and 0.1784 of missForest. For the ordinal features, the average nRMSD score obtained by wkNN MI is 0.1182 against 0.1550 of wkNN, 0.1751 of Amelia II, 0.1521 of MICE, and 0.1728 of missForest. For the categorical features, the average PFC score obtained by wkNN MI is 0.1198 against 0.1323 of wkNN, 0.2589 of Amelia II, 0.1761 of MICE, and 0.1900 of missForest. In the three tables, we also report the performances for the kRN baseline, computed for k=10 and k=20: the obtained performances outperform the baseline.
To verify that the performance improvement was in fact statistically significant, we analysed the nAE distributions and PFC values obtained by wkNN MI and MICE (the best performing among the competitor methods) on, respectively, the continuous/ordinal and categorical features. Figure 3 shows the nAE distributions obtained on the training set for the continuous features. The plots show that wkNN MI yields lower nAE values in all features. We also performed twotailed Wilcoxon signedrank tests [54] to assess the difference between the distributions: the obtained pvalues are all smaller than 0.001, confirming that the difference is statistically significant. The Wilcoxon signedrank test is a nonparametric statistical test used to assess whether the population mean ranks differ in a paired samples setting. This test can be used to determine whether two paired samples were selected from populations having the same distribution. We employed this nonparametric test to asses whether there is any statistically significant difference between the nAE distributions (which are very skewed and cannot be assumed to be normally distributed) obtained on continuous and ordinal data by different imputation methods.
Figure 4 shows the nAE distributions obtained on the training set for the ordinal features. The plots show that wkNN MI yields lower nAE values on 10 out of 12 features (ALSFRSR scores 1 to 10). We also performed twotailed Wilcoxon signedrank tests with Pratt’s correction (since the nAE values on the ALSFRSR variables can only assume values in {0,0.25,0.5,0.75,1}, the signedrank test has many “ties”) to assess the difference between the distributions: the obtained pvalues are smaller than 0.001 for the ALSFRSR scores 1 to 10 which confirms that the difference is statistically significant for these features. Lastly, the tests showed that for ALSFRSR 11 and 12 there was no statistically significant difference between wkNN MI and MICE.
Figure 5 compares the PFC values obtained by wkNN MI and MICE. The plots show that wkNN MI outperforms MICE in all the categorical features, resulting in a significant difference in 6 out of 7 of them, namely in sex, familiality, genetics, FTD, onset site, and NIV, while showing no significant improvement for PEG. We also performed McNemar’s Chisquared test [55] which confirmed that the difference is statistically significant in these 6 features. McNemar’s Chisquared test is a statistical test used on paired categorical data. It is applied to 2×2 dichotomous contingency tables with paired samples, to determine whether there is “marginal homogeneity”, that is, the row and column marginal frequencies are equal. When comparing two classifiers, each sample can be either be classified correctly or missclassified by each classifier, and thus a 2×2 dichotomous contingency table can be built. The null hypothesis of “marginal homogeneity” would mean there is no difference between the two classifiers. The imputation of categorical data can be seen as a classification task, and thus, McNemar’s Chisquared test can be used to determine if the difference between two imputation methods is statistically significant.
Performance comparison on the test set
After selecting the methods’ hyperparameters on the training set, we compared the performance of the proposed imputation method against the competitors on the test set. For each patient in the test set, we removed all the known measurements from his/her visits, one feature at a time, and imputed the missing values by using all the training set subjects as candidates. This setting represents the common situation where new subjects are continuously added to an existing dataset of clinical records and some of their values are natively missing. For Amelia II, MICE and missForest, we bound the records of the first three months of visits for the given patient in the test set with all the information on the training set in a single data frame, which was then used as an input for these imputation algorithms. Finally, we compared the imputed values obtained by each method with the true ones.
The imputation results on the test set are shown in Tables 5, 6 and 7 for each continuous, ordinal and categorical feature, respectively. Results on the heldback test set confirm that the proposed wkNN MI imputation method outperforms the competitors on average and on the majority of the features. For the continuous features, the average nRMSD score obtained by wkNN MI is 0.1332 against 0.1624 of wkNN, 0.1803 of Amelia II, 0.1731 of MICE, and 0.2011 of missForest. For the ordinal features, the average nRMSD score obtained by wkNN MI is 0.1274 against 0.1561 of wkNN, 0.2654 of Amelia II, 0.1542 of MICE, and 0.1740 of missForest. For the categorical features, the average PFC score obtained by wkNN MI is 0.1303 against 0.1456 of wkNN, 0.2646 of Amelia II, 0.1900 of MICE, and 0.1966 of missForest. The baseline was also outperformed by the proposed wkNN approaches.
We also analysed the nAE distributions and PFC values obtained by wkNN MI and MICE (the best performing among the competitor methods) on, respectively, the continuous/ordinal and categorical features. Figure 6 shows the nAE distributions obtained on the test set for the continuous features. The plots and the twotailed Wilcoxon signedrank tests show that wkNN MI yields statistically significant lower nAE values in 5 out of 6 features, namely BMI premorbid, FVC diagnosis, age at onset, diagnostic delay, and onset delta. The two methods did not obtain statistically significant differences in the imputation of BMI diagnosis.
Figure 7 shows the nAE distributions obtained on the test set for the ordinal features. The plots and the twotailed Wilcoxon signedrank tests with Pratt’s correction show that wkNN MI yields statistically significant lower nAE values on 9 out of 12 features (ALSFRSR scores 1 to 5 and 8 to 11) at the 0.05 level. Lastly, the tests showed that for ALSFRSR 6, 7 and 12 there was no statistically significant difference between wkNN MI and MICE.
Figure 8 compares the PFC values obtained by wkNN MI and MICE. The plots and the McNemar’s Chisquared tests show that wkNN MI outperforms MICE in 4 out of 7 categorical features, namely in sex, genetics, FTD, and onset site, at the 0.05 statistical significance level. No statistically significant improvements are obtained for familiality, NIV and PEG.
Survival classification results
In this section we report the results of the survival classification procedure. Figure 9 gives the PrecisionRecall (PR) and Receiver Operating Characteristic (ROC) plots of the NB classifiers trained on the five different sets of training samples. These plots were obtained by thresholding on the class label probabilities obtained by the NB classifiers for each survival sample. We also included the PR and ROC plots of a random predictor as a baseline. To ensure that the performance improvement is statistically significant, we computed the absolute classification error of the NB classifiers for each classification sample in the test set. The absolute classification error of each sample was computed as the absolute value of the difference between the class label and the predicted class probability. We performed twotailed Wilcoxon signedrank tests to assess the difference between the errors.
As a first result, we observe that the proposed method improves the prediction capabilities of a NB classifier: indeed, the PR curve achieves a perfect precision score of 1.0 for wider recall values. Moreover, the proposed method obtains the highest Area Under the Curve (AUC) value of 0.865. The improvement is somewhat less noticeable in terms of ROC curves and ROCAUCs, although we can see that the proposed method improves the false positive rate which stays at zero for a wider true positive rate interval. The statistical test on the absolute classification error compared to all the other classifiers obtained pvalues smaller than 0.001, confirming that the improvement is statistically significant.
Interestingly enough, the complete cases (PRAUC=0.833 and ROCAUC=0.785) and complete features analyses (PRAUC=0.840 and ROCAUC=0.790) worsen the prediction quality of the classifier with respect to the original dataset (PRAUC=0.850 and ROCAUC=0.796). The twotailed Wilcoxon signedrank tests’ pvalue when comparing the complete cases and complete features analyses with the original dataset are <0.001 and 0.022, respectively, while there is no statistically significant difference between the complete cases and the complete features analyses (pvalue =0.379). The loss of information resulting from simply ignoring samples or entire columns with missing data hinders the precision of the classifier. On the other hand, the NB classifier can effectively learn from the survival samples with their native missing values, as reflected by the prediction results.
By comparing the predictions of the NB classifier trained on the original dataset (PRAUC=0.850 and ROCAUC=0.796) with the ones trained on the two imputed datasets, we can see how the imputation quality can affect the classification performance: the performance improves when the patient data are imputed with wkNN MI (PRAUC=0.865 and ROCAUC=0.816), while it worsens when using the best competitor for the imputation (MICE), as can be seen from its PR and ROC curves which do not achieve a perfect precision of 1 or a perfect false positive rate of 0 for any interval of recall/true positive rate.
Conclusions
In this work we developed a weighted kNNbased imputation approach, able to plausibly fill in the missing values in an ALS disease register. The best performing method, the proposed weighted kNN with MI with k=20, outperforms the stateoftheart algorithms in terms of imputation accuracy, on continuous, ordinal and categorical variables.
The advantages of the proposed approach are manifold. While many imputation methods require stringent assumptions on the nature of the missing data, a kNNbased imputation only requires the presence of some relationship between the variable with the missing value and the other variables. The imputed values are always in the dynamic range of the existing data. Furthermore, the selection of a small k parameter ensures a good compromise between performance and the need to preserve the original distribution of the data, a very important characteristic any imputation method should satisfy.
The proposed method employs the MI values between feature pairs as weights in the distance computation of the wkNN procedure. The results show that wkNN MI outperforms the wkNN approach, confirming that the MI can be effectively used to exploit the crossinformation of the features for the imputation task.
We showed that the proposed algorithm is able to handle mixedtype data effectively, that is, patient records composed of categorical, ordinal and continuous features, each of which can be either static or dynamic, and with different distributions. In our method, thanks to the sample construction procedure described in Adaptive kNN Sample Construction, the temporal evolution of the data over subsequent visits is captured and exploited for the imputation. Furthermore, our method does not require a dataset of complete cases to perform the imputation because of the distance metric used. We only used information from the training set to impute the subjects of the test set in order to simulate the realworld scenario where new subjects populate the disease register a few at a time.
Finally, we provided a simple survival classification task as a potential application example of the proposed imputation method. Our results show that the imputation of the missing values in the training dataset improves the predictions of a Naïve Bayes classifier. Since the NB represents a very simple classification technique, we believe that more complex and sophisticated analyses could also benefit from our imputation method.
For all these reasons, we believe that our method is potentially applicable in diverse contexts where imputation is needed. The final aim of this work is to provide a tool that can enhance the quality and the quantity of the data employed in analytics tasks, to improve and accelerate translational research. Concretely, the tool will allow clinicians to effectively use the information collected in a limited time interval by curing the possible presence of missing data.
The specific employment of the method in the context of epidemiological ALS registers will enable the development and application of machine learning and data mining methods for the prediction of ALS disease prognosis, as well as the identification of related biomarkers. As novel clinical registers covering wider patient populations and new clinical variables (for instance, new genetic test results, different functional scale measures) will become available, missing values arising from the aggregation with older datasets could be imputed with the proposed approach. We also believe that the proposed methodology could be of great aid in other disease registers containing static and dynamic mixedtype data as well.
The proposed algorithm is able to impute missing data in a fixed time window (that is, the first three months of patients’ visits). We plan to extend its imputation capabilities to the whole patients’ visits history with a slidingwindow approach. Moreover, other distance metrics with more sophisticated weighting schemes could yield better imputation results. We will investigate these issues in our future work.
Availability of data and materials
The datasets generated and/or analysed during the current study are not publicly available in order to ensure the patients’ rights to privacy and anonymity and to prevent inappropriate secondary analyses. The proposed algorithm was implemented in the wkNNMI R package and is freely available from CRAN at https://cran.rproject.org/package=wkNNMI.
Abbreviations
 1NN:

1Nearest Neighbours
 ALS:

Amyotrophic Lateral Sclerosis
 ALSFRSR:

Revised ALS Functional Rating Scale
 AUC:

Area Under the Curve
 BMI:

Bodymass index
 DREAM:

Dialogue for Reverse Engineering Assessments and Methods
 FVC:

Forced Vital Capacity
 FTD:

Frontotemporal Dementia
 GP:

Gaussian Process
 kNN:

kNearest Neighbours
 kRN:

kRandom Neighbours
 LOOCV:

LeaveOneOut Cross Validation
 MAR:

Missing At Random
 MCAR:

Missing Completely At Random
 MI:

Mutual Information
 MICE:

Multivariate Imputation by Chained Equations
 ML:

Machine Learning
 MNAR:

Missing Not At Random
 MVN:

Multivariate Normally Distributed
 nAE:

Normalised Absolute Error
 NB:

Naïve Bayes
 NIV:

NonInvasive Ventilation
 NN:

Nearest Neighbours
 nRMSD:

Normalised RootMeanSquare Deviation
 PARALS:

Piemonte and Valle d’Aosta Register for ALS
 PEG:

Percutaneous Endoscopic Gastrostomy
 PFC:

Proportion of Falsely Classified
 PF:

PrecisionRecall
 ROC:

Receiver Operating Characteristic
 wkNN MI:

Mutual Informationweighted kNearest Neighbours
 wkNN:

weighted kNearest Neighbours
References
 1
El Morr C, AliHassan H. Healthcare analytics applications. In: Analytics in Healthcare: A Practical Introduction. Cham: Springer: 2019. p. 57–70.
 2
Islam M, Hasan M, Wang X, Germack H, NoorEAlam M. A systematic review on healthcare analytics: Application and theoretical perspective of data mining.Healthcare. 2018; 6(2).
 3
Editorial. Ascent of machine learning in medicine.Nature Materials. 2019; 18(407).
 4
Davenport T, Kalakota R. The potential for artificial intelligence in healthcare. Future Healthc J. 2019; 6(2):94–98.
 5
Gogtay N, Thatte U. Survival analysis. J Assoc Physicians India. 2017; 65:80–84.
 6
Xiao C, Choi E, Sun J. Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review. J Am Med Inform Assoc. 2018; 25(10):1419–1428.
 7
Waljee A, Mukherjee A, Singal A, Zhang Y, Warren J, Balis U, Marrero J, Zhu J, Higgins P. Comparison of imputation methods for missing laboratory data in medicine.Br Med J (BMJ) Open. 2013; 3(8).
 8
Graham J. Missing data analysis: Making it work in the real world. Annu Rev Psychol. 2009; 60(1):549–576.
 9
Rombach I, Gray A, Jenkinson C, Murray D, RiveroArias O. Multiple imputation for patient reported outcome measures in randomised controlled trials: advantages and disadvantages of imputing at the item, subscale or composite score level. BioMed Cent (BMC) Med Res Methodol. 2018; 18(1):87.
 10
van Buuren S, Boshuizen H, Knook D. Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine. 1999; 18(6):681–694.
 11
Azimi I, Pahikkala T, Rahmani A, NielaVilén H, Axelin A, Liljeberg P. Missing data resilient decisionmaking for healthcare iot through personalization: A case study on maternal health. Futur Gener Comput Syst. 2019; 96:297–308.
 12
BeaulieuJones B, Lavage D, Snyder J, Moore J, Pendergrass S, Bauer C. Characterizing and managing missing structured data in electronic health records: data analysis. J Med Internet Res (JMIR) Med Inform. 2018; 6(1):11.
 13
Donders A, van der Heijden G. J. M. G., Stijnen T, Moons K. Review: A gentle introduction to imputation of missing values. J Clin Epidemiol. 2006; 59(10):1087–1091.
 14
Hori T, Montcho D, Agbangla C, Ebana K, Futakuchi K, Iwata H. Multitask gaussian process for imputing missing data in multitrait and multienvironment trials. Theor Appl Genet. 2016; 129(11):2101–2115.
 15
Yu HF, Rao N, Dhillon I. Temporal regularized matrix factorization for highdimensional time series prediction In: Lee D, Sugiyama M, Luxburg U, Guyon I, Garnett R, editors. Advances in Neural Information Processing Systems 29. Barcelona, Spain: Curran Associates, Inc.: 2016. p. 847–855.
 16
Honaker J, King G, Blackwell M. Amelia II: A Program for Missing Data. J Stat Softw. 2011; 45(7):1–47.
 17
Luo Y, Szolovits P, Dighe A, Baron J. 3DMICE: integration of crosssectional and longitudinal imputation for multianalyte longitudinal clinical data. J Am Med Inform Assoc. 2017; 25(6):645–653.
 18
Chiò A, Mora G, Moglia C, Manera U, Canosa A, Cammarosano S, Ilardi A, Bertuzzo D, Bersano E, Cugnasco P, Grassano M, Pisano F, Mazzini L, Calvo A. Secular Trends of Amyotrophic Lateral Sclerosis: The Piemonte and Valle d’Aosta Register. J Am Med Assoc (JAMA) Neurol. 2017; 74(9):1097–1104.
 19
Greenland S, Finkle W. A critical look at methods for handling missing covariates in epidemiologic regression analyses. Am J Epidemiol. 1995; 142(12):1255–1264.
 20
Beretta L, Santaniello A. Nearest neighbor imputation algorithms: a critical evaluation. BioMed Central (BMC) Med Inform Decis Mak. 2016; 16(3):74.
 21
Bell M, Fiero M, Horton N, Hsu CH. Handling missing data in rcts; a review of the top medical journals. BioMed Central (BMC) Med Res Methodol. 2014; 14(1):118.
 22
Peng CY, Harwell M, Liou SM, Ehman L. Advances in missing data methods and implications for educational research. Chap. 3 In: Sawilowsky S, editor. Real Data Analysis. Quantitative Methods in Education and the Behavioral Sciences: Issues, Research, and Teaching. New York: Information Age Publishing: 2007. p. 31–78.
 23
Weber G, Adams W, Bernstam E, Bickel J, Fox K, Marsolo K, Raghavan V, Turchin A, Zhou X, Murphy S, Mandl K. Biases introduced by filtering electronic health records for patients with “complete data”. J Am Med Inform Assoc. 2017; 24(6):1134–1141.
 24
Luo Y, Xin Y, Joshi R, Celi L, Szolovits P. Predicting ICU mortality risk by grouping temporal trends from a multivariate panel of physiologic measurements. In: Proceedings of the Thirtieth Association for the Advancement of Artificial Intelligence (AAAI) Conference on Artificial Intelligence. AAAI’16. Phoenix, Arizona, USA: AAAI Press: 2016. p. 42–50.
 25
Zhang Z. Missing data imputation: focusing on single imputation.Annals of Translational Medicine. 2016; 4(1).
 26
Moritz S, BartzBeielstein T. imputeTS: Time Series Missing Value Imputation in R. The R Journal. 2017; 9(1):207–218.
 27
Ray E, Qian J, Brecha R, Reilly M, Foulkes A. Stochastic imputation for integrated transcriptome association analysis of a longitudinally measured trait.Stat Methods Med Res. 2019.
 28
van Buuren S, GroothuisOudshoorn K. mice: Multivariate Imputation by Chained Equations in R. J Stat Softw. 2011; 45(3):1–67.
 29
Azur M, Stuart E, Frangakis C, Leaf P. Multiple imputation by chained equations: what is it and how does it work?Int J Methods Psychiatr Res. 2011; 20(1):40–49.
 30
Rasmussen C. Gaussian processes in machine learning In: Bousquet O, von Luxburg U., Rätsch G, editors. Advanced Lectures on Machine Learning: ML Summer Schools 2003, Canberra, Australia, February 2  14, 2003, Tübingen, Germany, August 4  16, 2003, Revised Lectures. Berlin, Heidelberg: Springer: 2004. p. 63–71.
 31
Stekhoven D, Bühlmann P. MissForest–nonparametric missing value imputation for mixedtype data. Bioinformatics. 2011; 28(1):112–118.
 32
Cao W, Wang D, Li J, Zhou H, Li L, Li Y. Brits: bidirectional recurrent imputation for time series In: Bengio S, Wallach H, Larochelle H, Grauman K, CesaBianchi N, Garnett R, editors. Advances in Neural Information Processing Systems 31 Curran Associates Inc.Montréal, Canada: 2018. p. 6775–6785.
 33
Luo Y, Cai X, Zhang Y, Xu J, Yuan X. Multivariate time series imputation with generative adversarial networks In: Bengio S, Wallach H, Larochelle H, Grauman K, CesaBianchi N, Garnett R, editors. Advances in Neural Information Processing Systems 31 Curran Associates Inc.Montréal, Canada: 2018. p. 1603–1614.
 34
Yoon J, Zame W, van der Schaar M. Estimating missing data in temporal data streams using multidirectional recurrent neural networks. (IEEE) Trans Biomed Eng. 2019; 66(5):1477–1490.
 35
Andridge R, Little R. A review of hot deck imputation for survey nonresponse. Int Stat Rev. 2010; 78(1):40–64.
 36
Yenduri S, Iyengar S. Int J Softw Eng Knowl Eng. 2007; 17(01):127–152.
 37
Fournier C, Glass J. Modeling the course of amyotrophic lateral sclerosis. Nat Biotechnol. 2015; 33(1):45.
 38
van Es M, Hardiman O, Chio A, AlChalabi A, Pasterkamp R, Veldink J, Van den Berg LH. Amyotrophic lateral sclerosis.The Lancet. 2017.
 39
Huisman M, de Jong S, van Doormaal P, Weinreich S, Schelhaas H, van der Kooi AJ, de Visser M, Veldink J, van den Berg LH. Population based epidemiology of amyotrophic lateral sclerosis using capture–recapture methodology. J Neurol Neurosurg Psychiatry. 2011; 82(10):1165–1170.
 40
Atassi N, Berry J, Shui A, Zach N, Sherman A, Sinani E, Walker J, Katsovskiy I, Schoenfeld D, Cudkowicz M, Leitner M. The PROACT database design, initial analyses, and predictive features. Neurology. 2014; 83(19):1719–1725.
 41
Küffner R, Zach N, Norel R, Hawe J, Schoenfeld D, Wang L, Li G, Fang L, Mackey L, Hardiman O, Cudkowicz M, Sherman A, Ertaylan G, GrosseWentrup M, Hothorn T, van Ligtenberg J, Macke J, Meyer T, Schölkopf B, Tran L, Vaughan R, Stolovitzky G, Leitner M. Crowdsourced analysis of clinical trial data to predict amyotrophic lateral sclerosis progression. Nat Biotechnol. 2015; 33(1):51.
 42
Taylor A, Fournier C, Polak M, Wang L, Zach N, Keymer M, Glass J, Ennist D. The Pooled Resource OpenAccess ALS Clinical Trials Consortium: Predicting disease progression in amyotrophic lateral sclerosis. Ann Clin Transl Neurol. 2016; 3(11):866–875.
 43
Ong ML, Tan P, Holbrook J. Predicting functional decline and survival in amyotrophic lateral sclerosis. Public Library of Science (PloS) One. 2017; 12(4):0174925.
 44
Kueffner R, Zach N, Bronfeld M, Norel R, Atassi N, Balagurusamy V, Di Camillo B, Chiò A, Cudkowicz M, Dillenberger D, GarciaGarcia J, Hardiman O, Hoff B, Knight J, Leitner M, Li G, Mangravite L, Norman T, Wang L, The ALS Stratification Consortium, Xiao J, Fang WC, Peng J, Yang C, Chang HJ, Stolovitzky G. Stratification of amyotrophic lateral sclerosis patients: a crowdsourcing approach. Scientific Reports. 2019; 9(1):690.
 45
Hardiman O, Al Chalabi A, Brayne C, Beghi E, van den Berg LH, Chio A, Martin S, Logroscino G, Rooney J. The changing picture of amyotrophic lateral sclerosis: lessons from European registers.J Neurol Neurosurg Psychiatry. 2017; 2016.
 46
Grus J. Data Science from Scratch: First Principles with Python 2nd edn. Sebastopol, CA, USA: O’Reilly Media; 2019.
 47
Cedarbaum J, Stambler N, Malta E, Fuller C, Hilt D, Thurmond B, Nakanishi A. The ALSFRSR: a revised ALS functional rating scale that incorporates assessments of respiratory function. J Neurol Sci. 1999; 169(1):13–21.
 48
Meyer P. infotheo: InformationTheoretic Measures. R package version 1.2.0.https://cran.rproject.org/package=infotheo. Accessed 27 Apr 2020.
 49
Beghi E, Chiò A, Couratier P, Esteban J, Hardiman O, Logroscino G, Millul A, Mitchell D, Preux PM, Pupillo E, Stevic Z, Swingler R, Traynor B, Van den Berg LH, Veldink J, Zoccolella S. The Eurals Consortium: The epidemiology and treatment of ALS: focus on the heterogeneity of the disease and critical appraisal of therapeutic trials. Amyotroph Lateral Scler. 2011; 12(1):1–10.
 50
Rutkove S. Clinical measures of disease progression in amyotrophic lateral sclerosis. Neurotherapeutics. 2015; 12(2):384–393.
 51
Hand D, Yu K. Idiot’s Bayes–not so stupid after all?Int Stat Rev. 2001; 69(3):385–398.
 52
Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F. Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071). TU Wien. R package version 1.72.:e1071. https://cran.rproject.org/package=e1071 Accessed 27 Apr 2020.
 53
Zhang H. The optimality of naive bayes In: Barr V, Markov Z, editors. Proceedings of the Seventeenth International Florida Artificial Intelligence Research Society Conference (FLAIRS 2004). Miami Beach, Florida, USA: AAAI Press: 2004.
 54
Wilcoxon F. Individual comparisons by ranking methods. Biometrics Bulletin. 1945; 1(6):80–83.
 55
McNemar Q. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika. 1947; 12(2):153–157.
Acknowledgements
The authors are grateful to Dr. Alessandro Zandonà for his contributions in the early stages of this project.
About this supplement
This article has been published as part of BMC Medical Informatics and Decision Making, Volume 20 Supplement 5, 2020: Selected articles from the CIBB 2019 Special Session on Machine Learning in Healthcare Informatics and Medical Biology. The full contents of the supplement are available at https://bmcmedinformdecismak.biomedcentral.com/articles/supplements/volume20supplement5.
Funding
This research has been partially supported by the University of Padua project C94I19001730001 “Deconstruct and rebuild phenotypes: a multimodal approach toward personalised medicine in ALS (DECIPHERALS)”, by the Italian Ministry of Health grant (Ricerca Finalizzata) RF201602362405 “Identification of genetic and environmental determinants of onset and progression of ALS (INITIALS)”, by the Italian Ministry of Education, University and Research grant for Research Projects of National Relevance (PRIN) 2017SNW5MB, and by the initiative “Departments of Excellence” of the Italian Ministry of Education, University and Research (Law 232/2016). The funding sources had no role in the design and conduct of the study; collection, management, analysis, or interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication. Publication costs are funded by the University of Padua project C94I19001730001 “Deconstruct and rebuild phenotypes: a multimodal approach toward personalised medicine in ALS (DECIPHERALS)”.
Author information
Affiliations
Contributions
ET, SD and BDC designed the study.
ACH, RV and ACA provided the patient data.
RV preprocessed the data.
ET and SD developed the tools, performed the analyses and produced the results.
ET and SD analysed the results and wrote the manuscript.
BDC and ACH acquired the funding and provided the resources.
All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
This study was approved by the local ethics committee of the “Azienda OspedalieroUniversitaria della Città della Salute e della Scienza di Torino”, University of Turin. Informed consent to participate in the study was obtained from all the patients or their legal representatives. The databases were anonymised according to the Italian law for the protection of privacy.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Tavazzi, E., Daberdaku, S., Vasta, R. et al. Exploiting mutual information for the imputation of static and dynamic mixedtype clinical data with an adaptive knearest neighbours approach. BMC Med Inform Decis Mak 20, 174 (2020). https://doi.org/10.1186/s12911020011662
Received:
Accepted:
Published:
Keywords
 Imputation
 Missing data
 Knearest neighbours
 Mutual information
 Naïve Bayes
 Clinical datasets
 Amyotrophic lateral sclerosis