The study was conducted using secondary data from Ethiopian Demographic and Health Survey (EDHS) 2016 dataset. EDHS is a population-based cross-sectional study design that was carried out from January 18, 2016, to June 27, 2016, in Ethiopia. Ethiopia have been conducted nationally representative household surveys every five years to collect information for a variety of population overall health, nutritional, maternal, and child health issues [8]. Hence, the 2016 EDHS collected a nationwide data that allowed the calculation of key demographic indicators such as contraceptive discontinuation.
Source and study population
Source population
All women aged 15–49 in Ethiopia whoever used contraceptives within the five years prior to the EDHS 2016 survey.
Study population
All women aged 15–49 in the selected enumeration areas whoever used contraceptives within the 5- years prior to EDHS 2016 survey.
Sample size determination
In the Ethiopian DHS 2016, a total of 18,008 households were selected for the sample, and a total of 16,650 occupied households were successfully interviewed, yielding a response rate of 98%. 16,583 eligible women were identified for interviews from the households, but it was conducted with 15,683 women, achieving a 95% response rate [8]. The sample in this analysis is weighted to adjust for non-response and variations in probability of selection. Furthermore, the sample used is limited to responses from women who had ever used any contraceptive method within the five years before the survey. Thus, the analysis was restricted to a weighted sample of 6,737 reproductive-age women. The data for predicting contraceptive discontinuation is obtained from a woman’s questionnaire.
Study variables
Dependent variable
Contraceptive discontinuation status was the dependent variable which dichotomized into two categories as discontinued and not-discontinued.
Independent variable
The independent variables were adopted from previous studies [12, 14, 27, 33,34,35] and includes age of women, residence, marital status, religion, women occupation, sex of household head, region, employment, women education, husband education, internet usage, mobile ownership, wealth index, owning TV, owning radio, khat chewing, alcohol drinking, smoking cigarette, fertility preference, parity, family size, husband’s desire for a child and woman’s abortion history.
Operational Definition
Contraceptive discontinuation was computed from the reproductive calendar which consists of two columns; column 1) Births, pregnancies, terminations and contraceptive use and column 2) Reasons for discontinuation of contraceptive use. For the end of each episode of contraceptive use recorded in column 1 of the calendar, the interviewer asks additional questions to ascertain the reason for discontinuing use of any contraceptive method (modern/traditional) and records the code for the reason for discontinuation in column 2 of the calendar in the row corresponding to the month of ending use of the method. Discontinuation of a method is determined if the box for the month in the first column specified any method use and column 2 specified any reason to discontinue a contraceptive. However, if column 2 is blank/not filled any value it shows continuation.
During variable recoding, this information was obtained from V302A (Ever used anything or tried to delay or avoid getting pregnant) and V360 (Reason of last discontinuation). So, if a woman ever use any contraceptive method and she mentioned a reason for last discontinuation of method, then she assumed to be a discontinuer. But if she didn't mention anything for a reason for last discontinuation of method, she assumed to be a non-discontinuer.
Data processing and analysis
This study employed the general framework used in a prior study [36] built based on Yufeng Guo's 7 Steps of Machine Learning [37] to predict contraceptive discontinuation. In his article, Yufeng Guo outlines the seven steps in supervised machine learning namely: Data collection, Data preparation, Model selection, Model training, Model evaluation, parameter tuning, and making a prediction. Machine learning algorithms for this study were implemented using scikit-learn [38] version 1.1.1, and xgboost [39]version 1.6 packages in python using Jupyter Notebook.
Data collection
The dataset for this study is available on the Measure Demographic and Health Survey website and obtained upon a formal request. The data contains a weighted sample of 6,737 reproductive-age women who had ever used a method of both modern and traditional contraception.
Data preparation/pre-processing
The process of transforming data, which includes data cleaning, exploratory data analysis, normalization, and dimensionality reduction, can have a profound impact on the model's performance. Data preparation techniques that were employed in this study were data cleaning, feature engineering, dimensionality reduction, and data splitting.
Data cleaning
Data Cleaning is the first step after the data is retrieved and consists of detecting and removing outliers, handling missing values, and handling unbalanced categories of the outcome variable from the dataset. In this study, k-nearest neighbors (KNN) imputation technique was used to impute missing values of independent variables in the dataset. In a previous study [40], KNN imputation provided a more robust and sensitive method for missing value estimation. Another data cleaning task was imbalanced data handling. Imbalanced data is a dataset in which the values of the outcome variable is dominated by one category, while the other category is underrepresented. Machine learning models trained on imbalanced data are typically biased toward the majority class and fail to predict cases that are rare/minority class [41]. The problem of imbalanced data is currently well recognized and there are various approaches to address data imbalance [42].Hence, a random oversampling technique such as Synthetic Minority Oversampling Technique (SMOTE) [43] was used to balance the training data.
Feature engineering
The process of transforming raw data into features that better represent the underlying problem to predictive models, resulting in improved model accuracy on unseen data, is known as feature engineering [44]. Among various feature engineering techniques, encoding of categorical variables into numeric values known as one hot encoding for nominal variables and label encoding for coding each category of variables as a number was done through preprocessing module of the scikit-learn package.
Dimensionality reduction
Dimensionality reduction refers to reducing the number of input variables for a predictive model. Fewer input variables can lead to a simpler predictive model, which can perform better when making predictions on new data [45]. There are two approaches to dimension reduction: feature selection and feature extraction, with the latter being more appropriate for pattern recognition or image processing [36]. Feature selection involves using statistics to assess the relationship between the independent input variables and the outcome variable and selecting independent variables that have the highest importance to predict the target variable. Hence, feature selection was made through the feature importance technique (which is implemented in the feature importance property of ML models) to identify the most important predictors of contraceptive discontinuation. Feature importance is the average measure of how significant a feature is in comparison to other features used in the ensemble model to predict the outcome variable; higher feature importance means that that feature is used to differentiate one outcome versus the other more frequently [46]. This technique has been widely used in previous public health studies to identify predictors/factors of various health outcomes [46,47,48,49]. Furthermore, it has been demonstrated to be the best feature selection method when compared to other feature selection methods such as Boruta and recursive feature elimination techniques, which makes it extremely useful and efficient in selecting the important variables [50].
Data split
Every machine learning algorithm needs training and test/validation data to train the model and validate it on data it has never seen before. A simple 80/20 split method in which 80% of the data is used for training and the remaining 20% for testing the model was used. However, tenfold cross-validation method was used in this study for model training as it does not waste a lot of data, which is a big advantage when the number of samples is small [38]. K-Fold divides all observations into equal-sized groups of samples called folds and k-1 folds used to train the prediction function, and then the fold that is left out is used for testing k times repeatedly [38]. The k-fold cross-validation performance measure is the average of the values computed in the loop.
Model selection
After the data has been prepared and divided into training and testing tests, suitable models were selected to perform the training. Since the outcome variable was categorical, the task was a classification task and appropriate classifiers need to be selected to conduct the prediction. The dataset used in the analysis fall under the category of binary classification since contraceptive discontinuation was categorized into two mutually exclusive categories. The classification algorithms used for this analysis were logistic regression (LR), Random forest (RF), KNN, artificial neural network, support vector machine (SVM), Naïve Bayes, eXtreme gradient boosting (XGBoost), and AdaBoost classifiers. These algorithms were selected based on previous studies which applied machine learning techniques for classification tasks on EDHS data [46,47,48, 51,52,53].
Model training
Following model selection, the selected classifiers were trained with balanced and unbalanced data and their performance was compared through tenfold cross-validation. After comparison, the best predictive model was selected and trained with balanced training data for the final prediction on unseen test data.
Model evaluation
Model evaluation was performed after the model has been trained to determine how well it is performing by testing the model's performance on previously unseen data that was reserved for this purpose during data splitting. The confusion matrix, which is a simple cross-tabulation of the actual and predicted categories for the outcome variable, is a common method for analyzing the performance of a classification model [54]. Confusion metrics can be used to calculate performance metrics such as overall accuracy, precision, recall, and F1 score, which will all be used in this study to compare the performance of the selected classifiers. Furthermore, the ROC curve (or receiver operating characteristic curve) was used for visualizing the performance of ML models.
The confusion matrix and different derived metrics are adapted from [55] and presented as shown in Table 1.
Hyperparameter tuning
A model Hyperparameter is an external configuration to the model whose value must be specified by the user as it can't be estimated from data [56]. A simple example of a hyperparameter is the number of neighbors (K) in the K-nearest neighbor algorithm which should be specified manually. For this study, hyper-parameter tuning was done through Optuna [57] framework. According to the authors, Optuna works by formulating the hyperparameter optimization as a process of minimizing/maximizing an objective function that takes a set of hyperparameters as an input and uses the Bayesian framework to understand better the probability of the optimal values and avoid unnecessary computation for the combination of non-performing parameters in the search for the optimal parameter settings. This framework is better than traditional hyperparameter tuning techniques such as grid search and randomized search which takes explicitly defined hyperparameters by the user and uses only those hyperparameters to optimize the model.
Making prediction
This is the final stage of the machine learning approach in which all the above activities come into action. Prediction is estimating the outcome variable based on independent variables. In this case, contraceptive discontinuation was determined based on important variables identified along the way. Given different predictor variables, whether a woman will discontinue or not her contraceptive use was determined based on the best performing classifier with a specified accuracy.
Even though machine learning analysis identified the most important predictors, it doesn’t show which category is more associated with the presence of discontinuation. In this study, association rule analysis was applied through the Apriori algorithm (arules package) through R software (version 4.2.1) to identify a specific category of predictor variables that have associations with contraceptive discontinuation. Association rules are IF–THEN rules which are very important as they are easy to interpret and they select only the relevant features for the model during rule generation [58]. Instead of using straightforward tests of statistical significance, association rule mining (ARM) can find strong and frequent relationships between variables based on measures of "interestingness" which are related to the effect size of a pattern [59].
An association rule is a pair (X, Y) of sets of attributes, denoted by x → Y, where x is the left-hand side of the rule /antecedent and Y is the right-hand side/consequent of the rule which states that if X happens, then Y would also happen. It is a fundamental data mining technique that thoroughly searches for frequent patterns, correlations, and associations among the sets of variables making them ideal for discovering predictive rules from medical data repositories [60, 61]. It has been used in prior healthcare research to identify risk factors for various health outcomes such as early childhood caries [62], parasite infection [63], motorcycle crash casualty [64], stroke [65], and to discover symptom patterns of coronavirus disease of 2019 (COVID-19) [66].
The strength of an association rule can be measured by support (the prevalence of both X and Y co-occurring), confidence (the probability that Y occurs given that X is already present), and lift [67] which refers to the deviation of the support parameter from what would be expected if X and Y were independent. Confidence also called the accuracy of association rules is an indication of how often the rule has been found to be true /how reliable the rule is [68]. If the lift is less than one, the appearances of X and Y are negatively correlated, which implies that if one is present, the other is likely to be absent, and if it is greater than one, x and Y are positively correlated, which indicates that if one is present, the other is likely to be present as well. However, X and Y are independent and have no relationship if the lift value of the rule equals one [65]. Finally, the overall data preparation and analysis workflow is presented in Fig. 1.