Comparative analysis of predictive methods for early assessment of compliance with continuous positive airway pressure therapy

Background Patients suffering obstructive sleep apnea are mainly treated with continuous positive airway pressure (CPAP). Although it is a highly effective treatment, compliance with this therapy is problematic to achieve with serious consequences for the patients’ health. Unfortunately, there is a clear lack of clinical analytical tools to support the early prediction of compliant patients. Methods This work intends to take a further step in this direction by building compliance classifiers with CPAP therapy at three different moments of the patient follow-up, before the therapy starts (baseline) and at months 1 and 3 after the baseline. Results Results of the clinical trial shows that month 3 was the time-point with the most accurate classifier reaching an f1-score of 87% and 84% in cross-validation and test. At month 1, performances were almost as high as in month 3 with 82% and 84% of f1-score. At baseline, where no information of patients’ CPAP use was given yet, the best classifier achieved 73% and 76% of f1-score in cross-validation and test set respectively. Subsequent analyzes carried out with the best classifiers of each time point revealed baseline factors (i.e. headaches, psychological symptoms, arterial hypertension and EuroQol visual analog scale) closely related to the prediction of compliance independently of the time-point. In addition, among the variables taken only during the follow-up of the patients, Epworth and the average nighttime hours were the most important to predict compliance with CPAP. Conclusions Best classifiers reported high performances after one month of treatment, being the third month when significant differences were achieved with respect to the baseline. Four baseline variables were reported relevant for the prediction of compliance with CPAP at each time-point. Two characteristics more were also highlighted for the prediction of compliance at months 1 and 3. Trial registration ClinicalTrials.gov Identifier, NCT03116958. Retrospectively registered on 17 April 2017. Electronic supplementary material The online version of this article (10.1186/s12911-018-0657-z) contains supplementary material, which is available to authorized users.


Background
Obstructive sleep apnea (OSA) [1] is defined as repeated episodes of shallow or paused breathing during sleep, despite the effort to breathe. This syndrome is caused by complete or partial obstructions of the upper airway leading to daytime sleepiness and impaired cardiopulmonary function. Gold standard treatment of OSA involves the use of a device that administers continuous positive pressure (CPAP) in the respiratory tract of patients [2]. Patients should wear a mask, usually nasal [3,4], to receive the air from the CPAP device. Doctors are responsible for adjusting the air pressure provided by the device, preventing the throat from collapsing or damaging its tissue [5].
Several studies aimed at objectively assess the effects of using CPAP show that this kind of intervention is highly effective in improving symptoms, such as daytime sleepiness, morbidity, and mortality rates related to cardiovascular diseases [6,7]. Although it is highly effective in minimizing OSA symptoms, up to 36% of patients do not use or even discontinue CPAP [8,9].
Different factors have been found that influence the adherence to CPAP treatment, although they have not been reported consistently. In [10] patient demographic features, such as age, sex, and marital status were found to be weak predictors of CPAP adherence, as well as apneahypopnea index (AHI), oxygen desaturation index (ODI), or daytime sleepiness in [11][12][13]. In contrast, an improvement of ODI and deep sleep in patients during the CPAP titration increased the chances of complying with CPAP in [14]. Also, the adherence measured few days after CPAP initiation was shown to be a good predictor of long-term compliance in [15].
Despite these efforts, the reasons that lead to compliance with therapy remain an open field of research. In fact, knowing in more detail the initial factors that determine compliance with CPAP could help reduce dropout rates and the application of personalized treatments to improve patient compliance. Unfortunately, there is a clear lack of clinical analytical tools to support the early prediction of compliant patients.
When statistically analyzing the possible factors that might determine CPAP compliance, we found common complexities that compromise the predictive capacity and robustness of the findings. The limited number of participants, the large number of clinical variables and the quality of collected data are just to name a few. To overcome these limitations the use of machine learning (ML) might be an alternative solution to the aforementioned problems. Despite the numerous ML applications in the medical domain, (e.g. disease diagnosis [16,17]), compliance with therapy is usually constrained to the medication adherence problem [18,19]. Recent ML algorithms (e.g. support vector machines [20] and artificial neural networks [21]) often provide highly accurate predictive models. However, such models lack transparency and therefore their interpretation is difficult [22]. As a consequence, other classification algorithms in the medical field are still preferred (e.g. logistic regressions [23] or decision trees [24]).
The goal of the study presented in this paper is twofold. On the one hand, to provide a comparative analysis of predictive methods of CPAP compliance built using machine learning techniques in different stages of treatment. On the other hand, to define the most important factors associated with CPAP compliance identified by the best predictive methods obtained in the different stages of therapy up to month 6.

Participants
Fifty-one adult patients (> 18 years), diagnosed with OSA (15 or more apneas/hypopneas per hour in an overnight sleep study) and requiring CPAP treatment were recruited at Hospital Arnau de Vilanova (Lleida, Spain). Patients with impaired lung function (overlap syndrome, obesity hypoventilation, and restrictive disorders), severe heart failure, psychiatric disorders, periodic leg movements, pregnancy, other dyssomnias or parasomnias, and/or a history of previous CPAP treatment were excluded. The study was approved by the hospital ethics committee (Approval number: CEIC-1283). All recruited patients signed an informed consent form.
Of the 51 patients originally included in the study, 3 were excluded due to malfunction of the CPAP machine, 5 did not attend the last visit at the sleep unit and 1 patient died during the study.

Datasets
The study variables from the 42 patients were manually collected by lung specialists along four visits at month-0 (baseline or T0), at month-1 (T1), at month-3 (T3), and at month-6 (T6). During the first visit (T0) clinicians gathered 77 features organized in five categories: clinical history (e.g. depression, anxiety, arterial hypertension (HTA), cardiopathy, neurological disease, respiratory disease), symptoms (e.g. irritability, apathy, depression, insomnia), co-morbidities (e.g. diabetes, obesity, dyslipidemia), therapies (e.g. beta blockers, diuretics), sleeping test (e.g. sleeping time, AHI, percentage of the night spent with oxygen saturation < 90% or CT 90) and basal information (e.g. size, weight, BMI, tas, tad, oxygen saturation). In the second visit (T1), after the patients had the CPAP machine at home during one consecutive month, 16 new features related to monitoring were collected (e.g. nightly average use, abandon or adverse effects of the treatment, such as dry mouth, allergies, and cutaneous irritations). At the third month (T3), the same number of features as in T1 were gathered together with 5 more: size, weight, BMI, removed drugs, and added drugs). At month-6 (T6), although some other variables were collected, for the purpose of this study only the average use of nightly hours was considered. Eventually, three datasets (D0, D1, and D3) with an incremental number of features (i.e. D1 features = D0 features + features collected at T1) were created with 77, 93, and 114 features, respectively. The full list of variables is described in Additional file 1: Table S1 of the supplementary material.
In this study, we addressed CPAP compliant users as those who had more than 4 h per night on average during the first 6 months of treatment. Therefore, all samples from each dataset (ds) were labeled using the collected information about nightly hours/use on average of the CPAP device at the end of the month-6. In so doing, 24 (57%) patients were labeled as "compliant", class "1", as they correctly followed the CPAP therapy prescription (more than 4h nightly on average). On the contrary, 18 (43%) patients did not achieve the prescribed treatment (minus or equal than 4 h nightly on average) and they were labeled as "non-compliant", class "0".

Data description
Datasets D0, D1, and D3 collected at time points T0, T1, and T3, respectively, were statistically analyzed for a better understanding of the sample. In this task, the Mann-Whitney U test was used to evaluate the statistical significance of quantitative variables with CPAP compliance and Chi-square tests for qualitative characteristics. Previously, the categorical features were converted into numeric to achieve a homogeneous data type sample. Variables with only two categories were directly mapped into binary values. Variables with more than two categories, given their underlying incremental meaning, were mapped into unsigned ordinal values.

Data cleansing
We carried out a set of empirical tasks, supervised by the lung specialists, to reduce possible noise and redundancy in the datasets. First, given the existence of null values in the datasets, an imputation process was carried out consisting of computing the mode for the categorical characteristics and the mean for the numerical characteristics.
Subsequently, the distributions of the categorical characteristics were analyzed, which revealed variables with few individuals by category. The features with a number of individuals less than a given threshold were removed from the study to avoid the noise they might introduce when building the predictive models. To catch up possible information redundancy in the datasets, we computed the mutual information (MI) [25] score for the categorical variables in a pair-wise manner. From each pair, we kept the variable more statistically significant with the dependent variable using Chi-square test. Among the numerical features, we applied a correlation analysis to detect highly redundant features. Given the existence of non-normally distributed numerical features, we used the Spearman correlation method on all numerical variables in a pair-wise manner. Empirically we set-up a threshold for the correlation scores above of which one feature of the pair was removed (i.e. the feature with the highest P-value). P-values lower than 0.05 were considered statistically significant.

Classification framework
All preprocessed datasets presented common particularities, such as a small number of samples, the presence of missing values, class unbalance and high multidimensionality feature space. To cope with these complexities we designed a classification framework flexible enough to enable the execution of heterogeneous pipelines or sequence of configurable machine learning steps. In particular, the pipelines were composed of three mandatory steps (i.e. imputation, variance filtering and data standardization), two optional steps (i.e. feature selection and feature sampling), and two more final steps (i.e. classifier training and evaluation). In total 80 pipelines were configured from 4 feature selection methods, 5 classifier algorithms, 2 sampling strategies, and 2 evaluation metrics.
The result of running (i.e. training or building) a pipeline (Pipe i ) on a dataset (D j ) with parameters (params i ) is a predictive model or classifier (M i,j ) with its associated predictive performance (Perf i,j ). Figure 1 sketches the scheme with the input, output, and the different steps that configure the pipelines for compliance with CPAP therapy.
The first step of a pipeline is the imputation of null values. To do this, given the small proportion of null values in the datasets, a simple strategy was proposed to replace the null values with their most frequent value (for categorical characteristics) and with the mean value (for numerical characteristics).
The second step consists of a simple filter method to eliminate features with zero variance, that is, to eliminate these characteristics that have the same value in all the samples and that do not provide any additional information to the dataset. Since the data come from different sources, the next step is to standardize them. This step consisted of homogenizing all features to zero mean and variance one. This transformation step is crucial for the construction of many classification algorithms since it allows them to compare features without harming their performance or execution time [26].
Feature selection (fs) was introduced in the pipelines given a large number of features compared with the number of samples for each dataset (p > n). This type of methods aims to reduce over-fitting by improving model performance and generalization, to provide faster and more cost-effective models and simplify models making them easier to interpret [27]. Feature selection methods are usually divided into three categories: filter, wrapper, and embedded [28]. Filter methods, in general, examine features individually with respect to the class, wrapper methods use a predictive model to generate subsets of features evaluated according to their predictive power, and embedded methods search for an optimal subset of features during the training of the prediction model. In this study, we used one method for each of the different feature selection strategies. For the filter-based strategy, we defined a simple method (Combine_fs) that makes a ranking of the features by their statistical significance with the class (i.e. applying ANOVA or chi-squared tests according to the data type of the characteristics). Then, this method returns the subset of features through a configurable threshold. For the wrapper strategy, we proposed the recursive feature elimination (RFE_RF_fs) method [29] configured with a random forest to provide the importance of the features. The embedded strategy was in charge of the application of the least absolute shrinkage and selection operator (Lasso_fs) [30]. To this purpose, we used a logistic regression model configured with the L1 norm. It was also considered the possibility of not using any method of feature selection. The next step is data sampling (sm). In particular, we proposed the use of the synthetic minority over-sampling technique (Smote) [31]. This technique consists of creating synthetic samples (i.e. detecting similar instances and performing small perturbations in their values) of the under-represented class samples instead of creating copies, as the over-sampling method would do. In particular, we used this technique to balance the minority class with the same number of instances of the majority class. The main idea behind this method is to avoid the bias produced by many standard classifier learning algorithms towards the class with a larger number of instances. As in the feature selection pipeline step, we also considered the option of not using data sampling in the experiments.
Regarding the training and evaluation stage, we selected several classification algorithms (cls) to deal with various classification strategies (i.e. linear, non-linear, distancebased, and tree-based). In fact, the provision of different classification strategies is especially appropriate in complex datasets when the distribution of data is not clear. As already mentioned, the interpretability of the resulting predictive models is also a desired condition. Therefore, we opted for logistic regression (LR) [23], k-nearest neighbor (k-NN) [32], and random forest (RF) [33] for the subset of interpretable classification algorithms (referred as "descriptive", hereinafter). In contrast, we chose support vector machines (SVM) [20] and artificial neural networks (NN) [21] for the subset of algorithms with less interpretative capacity but with a potential greater discriminatory capacity (referred as "non-descriptive", hereinafter).

Evaluation setup
In order to ensure adequate performance evaluation, the available data were shuffled, stratified, and randomly divided into train (29 rows, 12 non-compliant, and 17 compliant) and test (13 rows, 6 non-compliant, and 7 compliant) sets with a ratio of 70/30. Therefore, the training set partitions of the three datasets (D0, D1, and D3) contained the same individuals. The same rule applies for the test set.
Test sets remained untouched until the end of the process. Training sets were used for 10-fold cross-validation to enable proper model tuning and evaluation. This technique is particularly suitable when the sample size is small. Indeed, as suggested in [34], the entire sequence of processes that composed each pipeline was wrapped-up within the cross-validation technique in order to reduce the possibility of biased results. Thus, training data were shuffled and randomly split into stratified train-validation sets (20 rows, 8 non-compliant, and 12 compliant) and stratified test-validation sets (9 rows, 4 non-compliant, and 5 compliant) following a ratio of 70/30. Then, for each of the configured pipelines (i.e. 80 pipelines), we created as many experiments as combinations of values for the different hyper-parameters defined for each method of the pipeline (Table 1).
We performed 10-fold cross-validation for all experiments in each pipeline. This process was repeated for each of the proposed learning metrics (i.e. f1-weighted and precision-weighted). The learning metric, f1-weighted  (f 1), was selected since it is a suitable measure for unbalanced datasets. This metric combines the precision and recall metrics weighted by the number of samples per each class. The other selected metric, precision-weighted (prec), tends to prefer classifiers with less incorrect compliant predicted patients, which indeed they are the most harmful cases to avoid. This metric computes the ratio of correctly classified cases (i.e. compliant patients) among all positive classified cases weighted by the number of samples per each class. As a result of this 10-fold cross-validation process, we reported for each experiment the average and standard deviation performance of the learning metric which the pipeline was configured with. A greedy-search strategy was applied to select the best experiment, i.e. best pipeline parameterization. Additionally, with the intention to avoid the possible bias introduced in this process to the validation dataset [35] and to reduce the chances of having obtained similar folds with repeated individuals, we evaluated the best parameterization of each pipeline (i.e. pipelines with the appropriated values for their hyper-parameters) using a final outer stratified 10fold cross-validation on the training data (i.e. with learning metric=f1-weighted, ratio=70/30 for cv-train and cvtest). As a result of this process we reported the final cross-validation performance (f1-score) of the pipelines. This score was provided by the f1-weighted metric since although having a high precision is desirable, a high recall (rec) is also needed especially for health institutions since it reduces false negatives and thus non-necessary clinical interventions and additional costs.
This whole process was repeated 80 times for all pipelines of each dataset. The best pipelines of each dataset were identified by ranking the cross-validation performances (i.e. f1-score) reported by each pipeline. The best pipelines of each dataset were compared together in order to find statistically significant differences. We did the same among the best descriptive and non-descriptive pipelines. To do this, we used a 10-fold cross-validated paired t-test.
To complete the reporting of this analysis, we computed on the test set and for each of the best pipelines of each dataset a comprehensive set of scores (i.e. f1, precision, recall, AUC, and confusion matrix) to enable a better understanding of the results. Also, we reported their ROC and learning curves.

Feature importances
We reported the most important features from the best descriptive pipelines. Let us note that, in this study, we only used those configured with random forest and logistic regression. for each dataset. To do this, we performed a ranking of features using a stability score [36]. This score measures how "stable" the features of a predictive model are. For that reason, we build a pipeline n times using n random subsets of fixed size s.
To compute this score, we created n = 100 randomly stratified partitions from the s = 70% of the entire dataset. For each data partition, we trained the best pipelines and keep a record of the selected features of the classifier and their weights (or feature importances for RF classifier). With this information, we computed the number of times any feature was selected (i.e. stability score) and its normalized absolute average weight (or importance). Since one classifier might report all features as relevant (i.e. non-zero), a threshold (t > 0.4) in the weights was empirically defined to make usable the stability score.

Data cleansing
A descriptive analysis of the initial datasets was carried out and summarized in Tables S2, S3 and S4 of the Additional file 1. Subsequently, the data cleansing process was conducted under the supervision of the lung specialists. From this process, we found out that (11/27/42) features had null values with ratios between 2.3% and 12% from the total number of rows. After a null imputation, we found (14/7/10) variables with underrepresented categories (<= 10% of rows per category). We also detected 4 pairs of categorical features (i.e. no active, anti-depressives, ADO, and memory disorders) in the D0 with MI scores above 50%. From the correlation analysis applied on the numerical variables, we found 4 highly (> 80%) redundant features in D0 (i.e. abdomen and hip circumference, weight and CT90%) and 4 features in D3 (i.e. size_3, weight_3, bmi_3 and total_use_hours_3). After the validation by the specialists of setting aside the aforementioned features, we obtained the final datasets for the classification analysis. These were composed of (54/63/70) variables, respectively.

Classification analysis
We evaluated 76 out of 80 initially configured pipelines for each dataset. In particular, we rule out pipelines which had same classification algorithm (i.e. random forest) for feature selection and classification given their initial poor contribution to the experimental results and the long runtime required to complete their evaluation.
To visually support these values, Figs. 2, 3, and 4 show the area under the receiver operating characteristic (ROC) curves and Figs. 5, 6, and 7 show the learning curves with the effects of increasing the size of the training set in their performances.
Regarding the comparison of performance between descriptive and non-descriptive pipelines, the best descriptive pipelines obtained f1-scores of 0.69 +/-0. 15 Table 4.
To complete the analysis, we extracted the most important features used by the best descriptive pipelines. In total (25/28/20) features were reported for each dataset after setting up a minimum threshold of 0.4 in the feature weights. Top-10 features of D0 and D1 reported stability scores above 0.6. In D3 only 2 features were above 0.4 of stability score. Figures 9, 10, and 11 provides a visual ranking of the features of each of the dataset ordered by the stability score.

Discussion
CPAP compliance has been demonstrated to reduce cardiovascular risk and symptoms in patients with sleep apnea. Therefore, an automatic support to the early identification of patients with poor compliance with CPAP therapy would allow personalized treatments and better management of hospital resources. Consequently, in this article we proposed the creation of classifiers that predict compliance with CPAP therapy in patients with OSA in the early stages of treatment.
One of the main challenges of this study has been to build predictive models from small data [37]. In this scenario, models tend to over-fit training and even validation data, which means they are not able to generalize well for previously unseen data. To overcome this problem, first, under the supervision of pulmonary specialists, we performed an initial process of data cleansing to eliminate the non-relevant and noisy characteristics of the dataset. Secondly, we set up several machine learning experiments (e.g. using data sampling, feature selection or one of the different proposed machine learning algorithms) to obtain those with the best generalization performance for each dataset. To do this, we proposed the use of machine learning pipelines to carry out all the steps of the learning experiment together for each of the folds of the crossvalidation. In this way, we avoided the knowledge leakage from the training to the evaluation stage. Thirdly, we used a nested cross-validation [38] to train, adjust, and evaluate the models. This mechanism reduces the chances of reporting over-fitted models since the validation scores are reported through a second level of cross-validation, once the models are tuned on the first level of crossvalidation. Lastly, we provided a thorough evaluation of the best models to understand their performance. The results of this evaluation are presented in Table 2 and show that the performance of the best predictive models, both for validation and test datasets, are high (i.e. greater than 0.7 for D0 and higher than 0.8 for D1 and D3 of precision, recall, and f1 scores) and close to each other (i.e. with only 0.02 and 0.03 difference in f1) for each of the analyzed datasets. These values reflect that current models are reliable and generalize well in data partitions that were not used for training. In addition, as can be seen in the Figs. 5, 6, and 7, validation curves reach the highest values when all training data are available to build the classifiers. However, as these are still a bit far from the training curves but the tendency is to get together, getting more data for training could be beneficial to obtain more accurate models (especially for datasets D1 and D3).
Going deep into the classification results obtained at three different moments of CPAP treatment, they show that dataset D0, collected before the start of treatment, is the most complex to learn compared to D1 and D3, collected in months 1 and 3 of the patient's therapy, Fig. 7 Learning curves of the best pipeline for dataset D3 respectively. Nevertheless, a considerable average f1-score of 0.73 +/-0.18 was achieved in cross-validation and 0.76 in the test set. As shown in Table 5, an important performance increase of 0.09 (p = 0.12) was reached in D1 with respect to D0. In D3, we get the best classification performance with a significant increase of 0.14 in f1-score (p = 0.024) with respect to D0. This same trend occurred in the test set where the best pipelines of D1 and D3 reported performances of 0.18 above the achieved in D0. The difference in f1-score between D3 and D1 did not prove to be significant (p = 0.30). These results seem to confirm that follow-up measurements help to increase baseline prediction performance. Indeed, the closer to the CPAP compliance cut-off we are the more confident the classifier is (i.e. performance in D0 is lower than performance in D1 and lower than D3). In addition, patterns of CPAP adherence appear early, in our case at month 1, since it is when the greatest increase in performance is achieved. This same finding was confirmed by other studies [15,39].
During the evaluation step, we noticed that the use of sampling, feature selection or a particular learning metric  were not as substantial as expected in any of the datasets (Table 3). To be more specific, the maximum increase in performance, regardless of the method used, was between 0.02 and 0.04 of f1-score in cross-validation. Probably this confined contribution was due to the initial preprocessing and the fact that the data were not severely unbalanced. Indeed, in D0 and D1 the use of feature selection compromised the performances with a maximum decrease of 0.13. In contrast, important increments of performance in all datasets were produced depending on the classification algorithm used (i.e. 0.14, 0.21, and 0.20). In D0 and D1 best pipelines were using an SVM. This result was not surprising because this algorithm has been already found suitable for problems with few samples and with a high number of features [40], being able to build complex non-linear decision boundaries. In D3, the best pipeline was configured with an RF, although the one with SVM also provided a high score. RF algorithm is also suitable for difficult problems and especially indicated for handling categorical features [33]. The other non-descriptive classifier (i.e. NN) reported competent performances in all three datasets, especially in D3, where it exceeded the results reported  by the best pipeline configured with an SVM. However, the k-NN algorithm reported the worst scores of the three datasets. This is partly because it does not usually work well with a large number of features [41].
Focusing on the differences of performance between the best descriptive and non-descriptive pipelines, those were always lower than 0.1 and not significant (p = 0.14) for D0 but significant in D1 (p = 0.02). In contrast, the best performance in D3 was achieved through a descriptive classifier although in cross-validation the difference in performance with the best non-descriptive pipeline (0.02+/-0.16 of f1-score) proved to be non-significant (p = 0.69).
Regarding the relevant factors related to the CPAP compliance prediction, four baseline features were found common in each of the best pipelines for the different datasets collected at time-points (T0, T1, and T3). Those were headaches, psychological symptoms (i.e. irritability, apathy, and depression), arterial hypertension and the visual analog scale (as part of EuroQol questionnaire).
From these four features, the headache was the most stable feature (i.e. with the highest stability score) at T0 and T1. In the baseline, all these characteristics were found significant with respect to the CPAP compliance but the latter (p = 0.079). In particular, compliant patients were more likely to not having headache (85%, 23 out of 27) nor psychological symptoms (67%, 18 out of 27), having arterial hypertension (74%, 20 out of 27) and worst visual analog scale score (9.15+/-1.02 on mean difference with respect to non-compliant patients). To the best of our knowledge, these features together have not previously been reported as relevant to predict patient compliance with CPAP therapy at either month 0, 1, and 3. In the literature, having morning headache was also found significant in a randomized control trial of OSA patients [42]. On the contrary, psychological factors did not show prediction capability in [43][44][45] but how patients were challenging difficult situations (active versus passive) [46]. In [47], authors highlighted the positive effect of CPAP treatment on blood pressure in patients with resistant hypertension. The visual analog scale, used as a generic method for measuring the quality of life, was reported useful to track treatment-induced changes in [48,49].
Different studies [50][51][52] have shown an improvement in snoring, gastro-esophageal reflux and oxygen saturation with CPAP treatment. In our case, they were found among the characteristics with the highest stability scores for the best pipelines of month-0 and month-1. However, only oxygen saturation (p < 0.001) predicted good compliance with CPAP.
Two of the features collected at months 1 and 3 (i.e. average hours of nightly CPAP use and Epworth) were found as the most important predictive features in these time-points. These features were significant regarding CPAP compliance. Interestingly, from months 1 to 3 the average of nightly hours of use for compliant users increased (from 5.9+/-1.51 to 6.17+/-1.29) while in non-compliant users decreased (from 4.4+/-1.75 to 3.56+/-1.76). On the contrary, the average of Epworth for compliant users decreased from 5.48+/-3.63 to 4.64+/-3.07, while for non-compliant increased from 7.33+/-3.7 to 8.46+/-4.16). Early measurements of the average hours of nightly CPAP use were already reported as predictive of CPAP compliance in [53,54]. Epworth was also reported as a relevant predictor of compliance in [15,[54][55][56]. The limitations of this study can be summarized in the following two points. First, although a common cutoff was selected for the definition of CPAP compliance, changes in this threshold might cause different performances as well as variations in the rank of the feature importance reported in this work, thus further explorations are required in this regard. Second, despite the positive scores obtained by the predictive models at the different time-points, the small number of individuals in the sample makes it highly recommendable to validate the obtained results with new data.
Finally, let us point out that the work proposed in this paper is part of the myOSA project (RTC-2014-3138-1), aimed at developing new ICT tools to support the OSA treatment. Under the umbrella of myOSA, we created an IoT system [57] that remotely monitors the patient's CPAP devices to provide indicators of progress such as early compliance, adherence level, as well as personalized recommendations to empower the patients. Moreover, we are currently investigating how to apply these predictive methods also in the context of the H2020 project CONNECARE (ID: 689802), in which we focus on patient's monitoring with the final goal of providing selfmanagement features to people in needs, such as chronic patients [58].

Conclusions
To the best of our knowledge, this article is the first attempt to analyze and compare the compliance with the CPAP therapy of patients with OSA at different points of the treatment by building classifiers. Three time-points were established to perform the analysis (i.e. before the treatment starts, at month 1, and at month 3). To build and evaluate the classifiers a flexible framework was designed relying on machine learning pipelines. High performances were reached yet after one month of treatment, being the third month when significant differences in performances were achieved with respect to the baseline. Four baseline variables were found relevant for the prediction of compliance with CPAP at each time-point. Two characteristics more, collected during the follow-up, were also highlighted for the prediction of compliance at months 1 and 3. Further tasks are devised to extend the present study, including the collection of new patients and exploring other CPAP compliance cut-offs, in order to validate the findings and reported performances. This work aims to take a step forward towards the creation of new tools to allow early and accurate detection of patients struggling to follow the CPAP treatment and thus enable personalized patient interventions that would lead to improving their quality of life.