CNFE-SE: a novel approach combining complex network-based feature engineering and stacked ensemble to predict the success of intrauterine insemination and ranking the features
BMC Medical Informatics and Decision Making volume 21, Article number: 1 (2021)
Intrauterine Insemination (IUI) outcome prediction is a challenging issue which the assisted reproductive technology (ART) practitioners are dealing with. Predicting the success or failure of IUI based on the couples' features can assist the physicians to make the appropriate decision for suggesting IUI to the couples or not and/or continuing the treatment or not for them. Many previous studies have been focused on predicting the in vitro fertilization (IVF) and intracytoplasmic sperm injection (ICSI) outcome using machine learning algorithms. But, to the best of our knowledge, a few studies have been focused on predicting the outcome of IUI. The main aim of this study is to propose an automatic classification and feature scoring method to predict intrauterine insemination (IUI) outcome and ranking the most significant features.
For this purpose, a novel approach combining complex network-based feature engineering and stacked ensemble (CNFE-SE) is proposed. Three complex networks are extracted considering the patients' data similarities. The feature engineering step is performed on the complex networks. The original feature set and/or the features engineered are fed to the proposed stacked ensemble to classify and predict IUI outcome for couples per IUI treatment cycle. Our study is a retrospective study of a 5-year couples' data undergoing IUI. Data is collected from Reproductive Biomedicine Research Center, Royan Institute describing 11,255 IUI treatment cycles for 8,360 couples. Our dataset includes the couples' demographic characteristics, historical data about the patients' diseases, the clinical diagnosis, the treatment plans and the prescribed drugs during the cycles, semen quality, laboratory tests and the clinical pregnancy outcome.
Experimental results show that the proposed method outperforms the compared methods with Area under receiver operating characteristics curve (AUC) of 0.84 ± 0.01, sensitivity of 0.79 ± 0.01, specificity of 0.91 ± 0.01, and accuracy of 0.85 ± 0.01 for the prediction of IUI outcome.
The most important predictors for predicting IUI outcome are semen parameters (sperm motility and concentration) as well as female body mass index (BMI).
Infertility is defined as the failure of the female partner to conceive after at least one year of regular unprotected sexual intercourse . More than 186 million people of the world's population specifically people living in developing countries are suffering from infertility . In most cases, the causes of infertility are not clear, which complicates the treatment procedure. These problems have been exacerbated for several reasons, such as lifestyle changes, infection, and genetic issues. In many cases, the only way to get pregnant has been through the use of assisted reproductive technology (ART), and its performance has not yet been optimized .
Every year, more than 1.5 million ART cycles are carried out all over the world . ART consists of three basic procedures including intrauterine insemination (IUI), in-vitro fertilization (IVF) and intracytoplasmic injection (ICSI) which are generally carried out in different steps of the treatment . The first-line treatment, second and the third stages of ART are IUI, IVF, and ICSI, respectively . In comparison with other sophisticated methods of ART, IUI has been considered as the easiest, minimally invasive and less expensive one. Most of the recent researches have shown the efficacy of IUI [6, 7].
IUI outcome prediction is a challenging issue which the ART practitioners are dealing with. Predicting the success or failure of IUI based on the couples' features can assist the physicians to make the appropriate decision for suggesting IUI to the couples or not and/or continuing the treatment or not for them .
Machine Learning approaches, as the modern scientific discipline, concentrates on how to detect the hidden patterns and extract the information from data. Machine learning provides different methods and algorithms to predict the output from some input predictors which can be used for clinical decision making .
To the best of our knowledge, many previous studies have been focused on predicting the IVF and ICSI outcome using machine learning methods as summarized in Table 1.
As illustrated by Table 2, the previous studies related to outcome prediction of ART methods are listed which have analyzed data using data mining and/or statistical methods. For this purpose, classifiers such as Decision Tree (DT), Logistic Regression (LR), Naïve Bayes (NB), K-Nearest Neighbors (K-NN), Support Vector Machines (SVM), Random Forest (RF), and Artificial Neural Networks (ANN) such as Multi-Layered Perceptron (MLP) and Radial Basis Function (RBF) have been used in the previous studies for predicting the clinical pregnancy after the complete cycles of different ART methods. A main drawback of the most of the considered previous studies is small volume of dataset and a few number of the considered features. Small dataset increases the risk of overfitting the trained models. Overfitting occurs when a model has good predictive ability for training dataset but shows poor performance for test dataset. Models with high overfitting property has lower generalization ability.
In this study, a dataset including the features of 11,255 IUI treatment cycles for 8360 couples is considered for IUI outcome prediction. Our dataset includes the couples' demographic characteristics, historical data about the patients' diseases, the clinical diagnosis, the treatment plans and the prescribed drugs during the cycles, semen quality, laboratory tests and the clinical pregnancy outcome. Considering the large number of couples and their corresponding IUI treatment cycles is a main advantage of this study compared to the considered previous studies.
On the other hand, most of the previous studies have considered the outcome prediction for IVF or ISCI. To the best of our knowledge, a few studies have been focused on predicting the outcome of IUI which have used clustering methods [9, 10] or regression analysis .
The previous studies which have been based on regression analysis only have considered the weights of the independent features to predict the overall pregnancy probability and they have not assessed the interconnection among the features [11,12,13,14,15,16,17]. Many previous studies have suffered from the lack of statistical power due to their small dataset [17, 18]. Also, the AUC performance of the previously proposed models for predicting IUI outcome have been low . Therefore, it is required to improve the prediction performance by proposing novel methods and considering more data records.
Most of the considered previous studies have used single classifiers and/or RF as a simple ensemble classifier. Some previous studies have illustrated that the stacked models can improve the classification performance for other applications and other datasets [19,20,21]. Therefore, in this study, a novel stacked ensemble is designed and proposed for improving the performance of IUI outcome prediction.
The main aim of this study is to develop an automatic classification and feature scoring method to predict intrauterine insemination (IUI) outcome and ranking of the most significant features, based on the features describing the couples and their corresponding IUI treatment cycles. For this purpose, a novel approach combining complex network-based feature engineering and stacked ensemble (CNFE-SE) is proposed. Three complex networks are extracted considering the patients' data similarities. The feature engineering step is performed on the complex networks. The original feature set and/or the features engineered are fed to the proposed stacked ensemble to classify and predict IUI outcome for couples per IUI treatment cycle. Our study is a retrospective study of a 5-year couples' data undergoing IUI. Data is collected from Reproductive Biomedicine Research Center, Royan Institute describing 11,255 IUI treatment cycles for 8,360 couples.
The main novelty of this study lies in three folds including:
Proposing a method for feature scoring and classification based on weighted complex networks and stacking ensemble classifiers
Proposing feature engineering method based on complex networks
Designing a novel stacked ensemble classifier for predicting IUI outcome
The main steps of the proposed approach combining complex network-based feature engineering and stacked ensemble (CNFE-SE) to predict the success of Intrauterine Insemination and ranking the features are illustrated in Fig. 1.
The main steps of the proposed method (CNFE-SE) as depicted in Fig. 1 include the modules for data collection and preparation, feature scoring and classification and finally model evaluation and validation. The first module consists of data collection, sampling from data, preprocessing the collected data and filtering irrelevant features. In the next module, ignoring a feature, constructing three complex networks from the patients, extracting features from the constructed complex networks, training the classifiers based on the extracted features and finally scoring the ignored feature are performed. The last module evaluates and validates the models trained in the previous module. More details about the mentioned tasks are described in the following subsections.
Our research is approved by the Institutional Review Board of the Royan Institute Research Center and the Royan Ethics Committee consistent with Helsinki Declaration with the approval ID of IR.ACECR.ROYAN.REC.1398.213. Anonymity and confidentiality of data were respected.
Dataset studied in this article is collected from Royan Institute, a public none-profitable organization, affiliated to the academic center for education, culture and research (ACECR) in Iran. It includes the features describing the patients having been treated by IUI method in the Infertility clinic at Royan Institute between January 2011 and September 2015.
In this retrospective study, a completed episode is defined as a sequence of treatment cycles resulting in positive clinical pregnancy or when the treatment with IUI is stopped. The inclusion criteria for the couples to be treated under IUI cycles were male factor, ovulatory disorders such as PCOS, hypothalamic amenorrhea, diminished ovarian reserve, combined causes, and unexplained subfertility. The couples' duration of infertility was at least 1 year. Male infertility was defined as the semen quality parameters lower than the standards determined by WHO including sperm concentration lower than 15 million/ejaculate, semen volume lower than 1.5 mL, and total motility lower than 40% . The male partners with donor sperms, Varicocele, and semen samples with total motile sperm count lower than 1 × 106 were excluded from being candidates for IUI treatment. Additionally, patients with anatomical and metabolic abnormalities, severe endometriosis and/or systemic diseases were excluded from our study.
11,255 IUI cycles related to 8,360 couples are considered in which the women age ranges from 16 to 47 with the average age of 29. This dataset contains 1,622 positive outcomes and 9,633 negative ones. Therefore, the overall pregnancy rate is 14.41% per completed cycle and 19.4% per couple. Each couple is treated for 1.31 ± 0.59 (mean ± Standard Deviation) IUI cycles which ranges from 1 to 7 cycle.
The features describe the couples' demographic characteristics, historical data about their diseases, the clinical diagnosis, the treatment plans and the prescribed drugs to the couples, male semen quality, laboratory tests and the clinical pregnancy outcome. The considered demographic features include age, body mass index (BMI), education level, consanguinity with spouse and some other features. The information about the history of the patients' subfertility consists of the duration and type of infertility, length of marriage and so on.
The types of feature values are numerical, binary, nominal and binominal types for 86, 152, 51 and 7 features, respectively. More details about the features is shown in Appendix 1.
In the collected dataset, the majority of couples (almost 72%) have been treated for one cycle, 22% of couples have underwent two cycles, 5% of couples have been treated for three cycles, and less than 1% have been treated more than three cycles. The maximum number of cycles for treating a couple is seven. Figure 2 depicts the distributions of positive and negative clinical pregnancy rates for patients per treatment cycle.
As illustrated by Fig. 2, 63% of the couples belonging to the positive class (positive clinical pregnancy after completing the cycle) have been pregnant after the first treatment cycle. 26% of data records in the positive class have received positive outcome after the second cycle. Moreover, 74% of the couples in the negative class have been considered after the first cycle.
Data should be randomly partitioned into training and test datasets with no overlapping among these two subsets. The models are trained on the training dataset and finally are evaluated by applying them to the test datasets.
K-fold cross validation (C.V.) is a common and popular sampling strategy used for this purpose. In this method, data is randomly divided into K disjoint equal-size subsets. Every time, one of these K subsets is considered as the test dataset and all (K-1) remaining subsets make the training one. The model is trained K times on K training datasets and applied to the corresponding test datasets to evaluate the performances of the trained models.
Before sampling from data, the features having missing value rate higher than 20% are removed from the study. Moreover, the patient records with high missing value rate (higher than 20%) are excluded from the study and then, fivefold C.V. is used for sampling from the collected dataset, in this study.
At first, dataset is partitioned into non-overlapping subsets D1, D2, …, DK based on K-fold Cross Validation strategy. Then, the models are trained on K training datasets composed of all D1, …, DK subsets excluding Di for 1 ≤ i ≤ K. Therefore, the ith training dataset consists of all D1, …, DK but Di and the ith test dataset is Di. The ith training dataset is balanced using over-sampling strategy.
Moreover, a strategy for classification structural risk assessment is used named as A-Test which will be described in the evaluation and validation subsection with more details. The number of instances of positive and negative outcomes in each folder of fivefold is 324–325 and 1926–1927, respectively. therefore, the imbalance ratio of the training set in each of 5-folds is about 0.168.
Preprocessing of data is one of the most essential steps in the knowledge discovery tasks. A previous study have stated that 80% of total time in data mining projects is allocated for data preparation and preprocessing step .
In the first step, the initial collected dataset includes almost 86,000 data records describing the partners and about 1,000 features. The data records describing one couple per IUI treatment cycle are aggregated to make our dataset. Thus, the aggregated dataset includes 11,255 data records and 296 features describing a couple during an IUI treatment cycle.
The nominal features are converted to dummy binary variables. If a nominal features has m different levels or values, it will be converted to (m-1) dummy binary variables. Therefore, instead of considering a nominal feature in the classification and feature ranking, its corresponding dummy binary variables are considered in the mentioned tasks.
The missing values for numeric and categorical features are imputed based on the average and the most frequent values, respectively . All numerical and ordinal features are normalized using min–max normalization method and the nominal features are converted into dummy binary variables.
Outlier detection is performed in this study based on isolation forest method which has been proposed by Liu et al.  as an appropriate outlier detection method for high dimensional data. The hyperparameters of Isolation Forest including the number of estimators, maximum number of the samples, contamination coefficient, maximum number of the features, bootstrapping or not, and the number of jobs are tuned using grid search method. For evaluating the performance of Isolation Forest, its results are compared to other outlier detection methods such as One-class SVM with kernel of Radial Basis Function (RBF), boxplot analysis and expert's opinions. Three outliers are identified by this method and excluded from the study.
Filtering irrelevant features
Since the aggregated dataset consists of many features, the irrelevant features can be removed to reduce the computational time required for processing and analyzing data. Thus, the features having very low correlation with the output feature or very high correlation with other input features are excluded from this study. The linear correlation coefficient between pairs of the features Fp and Fq are calculated as Eq. (1):
where Fx,p (Fx,q) indicates the xth row of the feature Fp (Fq) and mp (mq) denotes the average of the feature Fp (Fq), respectively.
If two features Fp and Fq have low (high) correlation, Corr (Fp, Fq) tends to zero (− 1 or + 1).
Ignoring a feature
Breiman has proposed measuring the feature importance by mean decrease in accuracy (MDA) of random forest . This study aims at ranking the features according to their predictive power for classifying the instances to positive or negative clinical pregnancy. For this purpose, all the steps 6–9 are performed by considering all the features excluding one feature each time and MDA for the trained proposed classifier is calculated on the validation dataset. MDA values show the amount of reducing the model accuracy after removing a feature. Therefore, the higher values of MDA indicate the higher predictive ability of the corresponding features.
Constructing complex networks of patients
For modeling nonlinear data, complex networks are effective method . Complex network is a weighted undirected graph G = (V, E, W), where V is the set of nodes, E denotes the set of edges e (vi, vj) between the pairs of the nodes vi and vj and W is the weights w (vi, vj) assigned to their corresponding edges e (vi, vj) of E.
Three complex networks are constructed from the training datasets and one data record which should be classified independent from it belongs to training or test dataset. The first one is comprised of all the training data records and one data record which should be classified as its nodes and is called CN1. The second and the third complex networks consist of one data record which should be classified and all training data records excluding the negative and positive classes and named as CN2 and CN3, respectively. If the considered data record belongs to training dataset, its class label is excluded from its corresponding complex networks.
In other words, the nodes of CN1, CN2 and CN3 are one data record which should be classified and all the training data records, positive labeled and negative labeled training data records, respectively. Therefore, for each data record, three complex networks are constructed.
An edge between node vi and vj is drawn if the distance between the input features of the ith and jth training data records is smaller than a user-defined threshold. For calculating the pairwise distance between data records, Euclidean distance function is used and can be calculated as Eq. (2):
where m is the number of the input features, Fi,p and Fj,p denote the pth input feature values for data records corresponding to vi and vj.
The weight of the edge e(vi,vj) is calculated as Eq. (3):
Feature engineering based on the complex networks
In this section, three complex networks per data record are constructed including the considered data record, all training instances as CN1 and all training instances excluding negative (positive) instances as CN2 (CN3). A simple intuitive hypothesis is that a node has more similarity with the training instances of its own class compared to the instances of the other class. Therefore, the node centrality in different complex networks CN1, CN2 and CN3 can be compared to classify the node. Features listed in Tables 3, 4 are defined based on this hypothesis.
Node degree is the number of its adjacent edges. Betweenness centrality for graph nodes have been introduced by Bavelas  and is calculated as Eq. (4). If a node lies in many shortest paths between pairs of nodes, its Betweenness centrality will be high. Nodes with high Betweenness centrality are the bridges for information flow.
Node closeness centrality measures the reciprocal of the sum of the length of the shortest paths between the node and all other nodes in the graph.
Node Eigen vector centrality is higher when the node is pointed to by many important nodes.
Clustering coefficient of a node is calculated as Eq. (5):
Since, the number of the instances are very high, the complex networks are partitioned into smaller communities to reduce the computational complexity for calculating the engineered features.
One complex network extracted from only 100 data records treated by IUI method as a sample is shown in Fig. 3.
Figure 4 depicts two complex networks of the same samples of positive instances drawn by different thresholds.
As shown by Fig. 4, reducing the threshold for keeping the edges in the complex network even with a small value lead to the network with more sparsity and more small-sized communities.
Figure 5 illustrates three complex networks from the samples of both classes, negative and/or positive classes.
As shown by Fig. 5, for the same thresholds, complex network considering the instances of both classes has the most density and the complex network from only positive instances has the most sparsity and consists of several small communities.
Training the stacked ensemble classifier
Stacked ensemble classifier which is a scalable meta-modeling methodology has been first introduced by Wolpert in 1994 . It has been inspired by neural networks whose classifiers have been considered as the nodes. Instead of a linear model, the stacked classifier can use any base classifier. The stacking operation has been performed by either a normal stacking or a re-stacking mode. In the normal stacking mode, the base classifiers in each layer use the output scores of the previous ones as the predictors similar to a typical feedforward neural network. The formula of normal stacking mode is written as Eq. (6):
where n indicates the nth layer of the stacked ensemble, x denotes a sample of a dataset, V presents a vector holding the neurons (the base classifiers), D is the number of hidden neurons through the nth hidden layer and finally, k is the kth neuron in the nth layer.
Some previous studies have illustrated that the stacked models can improve the performance of the classification [20, 21, 30]. Therefore, in this study, a new stacked ensemble classifier is proposed and designed based on the normal stacking mode. In the beginning, some of the basic classifiers are trained, and those outperforming the others are selected to be considered as the base classifiers in the stacked ensemble layers. The architecture of the proposed stacked ensemble classifier is shown in Fig. 6.
As illustrated in Fig. 6, input dataset consists of the features in OFS, FS-Fi, EFS and/or EFS-Fi. Input dataset is fed to the base classifiers in the first layer of the proposed stacked ensemble classifier.
Several different classifiers are trained and verified. The classifiers for using in the ensemble layers of our proposed stacked ensemble classifier are chosen among different trained classifiers with different values of hyperparameters based on their accuracy and diversity on the validation dataset. A previous study has proposed a method to choose classifiers for ensemble learning based on accuracy and diversity which is used in this study for the same purpose. The pairwise diversity of the classifiers are calculated using Q statistic.
Logistic regression (LR) , support vector machines (SVM) , decision tree (DT) , random forest (RF) , Adaboost  and LightGradient Boosting Machine (LightGBM)  are the base classifiers chosen based on their accuracy and diversity in both ensemble layers.
LR, SVM with linear kernel and DT are appropriate classifiers for classifying linearly separable data. SVM with non-linear kernels, RF, Adaboost and LightGBM are ensemble classifiers which can classify nonlinearly separable data with high performance. All the mentioned classifiers can be trained fast. Therefore, they are chosen as the base classifiers of the proposed stacked ensemble classifier.
The hyperparameters of the classifiers are tuned based on grid search method and the best values for hyperparameters leading to the highest accuracy for validation dataset are considered for each classifier.
After training the base classifiers in the first layer, their outputs are considered as Meta features according to the normal stacking mode. The Meta features are fed into the base classifiers of the second layer for training them. Finally, the outputs of the base classifiers in the second layer are aggregated by weighted voting aggregation rule.
The weight of each base classifier is obtained by measuring its accuracy for classifying the validation dataset. The validation dataset is about 20% of the original training dataset which is excluded during the base classifiers' training in both layers.
Mathematical calculation is performed in this study to show the performance improvement obtained by stacked ensemble compared to traditional one-layer ensemble and the individual classifiers.
Without loss of generality, it is assumed that each base classifier in the first layer of stacked ensemble has the error rate of ε < 0.50. If the aggregation of the base classifiers is performed with bagging strategy which is the simplest aggregation method and uses majority voting, the error rate of the first ensemble layer (εL1) can be calculated as Eq. (7):
where M is the number of the base independent classifiers in the first ensemble layer. For misclassifying a data record using bagging strategy as the aggregation method, more than half of the base classifiers should misclassify the record. If it is assumed that i is the number of the base classifiers which misclassify the data record, i should be more than M/2 for misclassifying it with the first ensemble layer. For example, if M is 25, at least 13 base classifiers should misclassify data for erroneous classifying data in ensemble of these base classifiers. Now, if ε is 0.35 for each of 25 base classifiers, εL1 will be 0.04. It shows the first layer of ensemble or traditional ensemble can improve the error rate of the single independent classifiers significantly.
Now, it is assumed that we have one more ensemble layer such as a two-layer stacked ensemble. Bagging strategy uses simple majority voting for classifying data as Eq. (8):
where rj indicates the jth data record and i denotes the ith base classifier. As shown in Eq. (8), a simple decision tree or SVM with linear kernel can provide rules or find hyperplanes to classify data according to Eq. (8). Therefore, it can be shown that the performance of each base classifier in the second layer will not be worse than the simple bagging aggregation strategy used in the first ensemble layer.
This conclusion is true because each base classifier will try to find the hyperplane or rules to discriminate the training samples of two classes. But, bagging strategy uses simple majority voting. Furthermore, the input features (the first meta feature set as shown by Fig. 6) for the base classifiers of the second ensemble layer are the same as the input features fed to the bagging strategy in the first ensemble layer. These input features are the output class labels generated by the base classifiers in the first layer. Therefore, the error rate of each base classifier in the second ensemble layer would be at most εL1.
The aggregation rule in the first ensemble layer is majority voting in the bagging strategy. The base classifiers try to separate the instances of different classes using linear or non-linear hyperplanes or rules. The input dataset for majority voting in the first ensemble layer is the first meta feature set. Therefore, the input of the majority voting rule and the base classifiers of the second ensemble layer is the same. The majority voting rule can be stated as Eq. (9) for the first meta feature set with M columns:
where MV is the majority voting strategy. Majority voting strategy is similar to using a hyperplane considering the equal coefficients for all of its input features as the separator of two classes.
The base classifiers try to find a best hyperplane for discriminating the instances of two classes. Therefore, their fitted hyperplane will not be worse than the hyperplane used with majority voting strategy. Thus, their performance will be more than or equal to the performance of the majority voting in the first ensemble layer. According to the Eq. (7), it is shown that the performance of the majority voting will be much better than the performance of the single classifiers in the first ensemble layer. Therefore, the performance of the single classifiers in the second ensemble layer will be better than the performance of the single classifiers in the first ensemble layer.
According to Eq. (7), if the bagging strategy is used for the second ensemble layer, the error rate of the second ensemble layer in the stacked ensemble would be εL2 which can be calculated as Eq. (9):
where M2 is the number of the base classifier in the second ensemble layer of the stacked ensemble and εb2 is the error rates of the base classifiers in the second ensemble layer. As mentioned in the previous paragraph, the error rate of each base classifier in the second layer would be at most εL1. Therefore, εb2 will be not more than εL1.
A previous study have demonstrated that adding more layers to stack ensemble can improve the classification performance in terms of accuracy and AUC .
Based on the obtained results, it can be shown that adding more layers to stacked ensemble can improve its performance. Although, adding more layers has higher burden of time complexity and memory usage, too.
There are a few studies considering the effect of the ensemble size or cardinality (the number of the base classifiers in the ensemble classifier) on the performance of the ensemble method [1, 2]. The previous studies have shown the ensemble size depends on the diversity of the base classifiers included in the ensemble and its aggregation rule [1, 2]. In addition, a previous study has examined different ensemble sizes including 10, 20, 50 and 100 classifiers for bioinformatics applications . They have shown that the best ensemble size has been 50 but the ensemble size of 10 is sufficient to achieve to highly reasonable performance .
Scoring the ignored feature
As mentioned in Sect. 1.5, MDA score is calculated for each feature and is considered as the feature importance score.
Evaluating and validating the trained models
To evaluate the performances of the trained models, the performance measures for classification problems are used in this study including Accuracy, Sensitivity, Specificity and F-Score as shown in Eq. (11) -(14):
where TP and FP (TN and FN) indicate the number of instances in the positive (negative) classes which are classified correctly and incorrectly, respectively.
Moreover, the area under the curve (AUC) of the receiver operating curve (ROC) is considered.
In order to validate the results, the experiments are repeated 50 times, and each time the data is selected based on fivefold C.V.
A novel method named as A-Test has been proposed in a previous study to calculate the structural risk of a classifier model as its instability with the new test data . A-test calculates the misclassification error percentage Γζ,K for different K values using the balanced K-fold validation. In this study, the values of Γζ,K will be reported for different classifiers and different feature sets. Γζ,K is calculated as Eq. (15):
where Kmax cannot be more than the size of the minority class. For estimating the structural risk of a classifier method, the average of the values of Γζ,K is considered as Eq. (16):
where Γζ^ ranges from 0 to 100% which higher values show higher risk of classification and lower values show the higher capacity and generalization ability of the model. Therefore, the lower values of Γζ^ are more desired.
In this section, the features are ranked based on MDA obtained by ignoring them during the training of CNFE-SE. Then the partial dependencies between high-ranked features are discussed. Finally, the performance of the proposed model (CNFE-SE) is compared with other state-of-the-art classifiers.
Ranking the significance of features
Figure 7 represents top-20 important features with highest MDA score for IUI outcome prediction based on 50 repetitions of CNFE-SE training on different training samples. Post wash total motile sperm counts, female BMI, sperm motility grades a + b, total sperm motility and sperm motility grade c are high-ranked predictors of IUI outcome. Additionally, post-wash total motile sperm counts, female BMI, and total sperm counts are the features illustrated with dark blue colors in Fig. 7, have the highest repetitions as the first informative features. Generally, the variables related to the men's semen analysis parameters are high-ranked features in this study.
The Pearson correlation coefficients are calculated among the top-20 important features, and Fig. 8 depicts the heat map of the correlation coefficients.
As shown by Fig. 8, the male semen parameters are positively correlated to each other, the more sperm concentration, the more total sperm count, and the more total motile sperm count. Also, couples' duration of infertility and duration of marriage are positively correlated.
Figure 9 shows the exact values of MDA score for top-20 features in this study.
In addition, Table 3 lists MDA values of top-20 features.
Partial dependency between the features
Figure 10 depicts the partial dependency plots for the most important features. Partial dependency plots show whether a feature has a positive or negative effect on the response variable when the other ones are controlled. However, in order to interpret the graphs, we should note that changes in the clinical pregnancy probabilities in terms of the value of the features, even the most significant ones, are roughly small (the y-axis range is 0.44–0.52). Therefore, it is noteworthy that none of the features could individually and significantly alter the pregnancy rates more than 0.52. This finding underlines the value of the machine learning approach by determining the complicated association between individual predictors to make an effective classification model.
According to the results of the partial dependency plots as shown by Fig. 8, the clinical pregnancy rate has raised with increased number of post-wash total motile sperm counts and after processing sperm concentration. Also, when their values respectively vary upper than 100 million and 30 million spermatozoa per ml, the rate of pregnancy reaches its highest rate. In addition, the likelihood of IUI success increases through growing the number of total sperm counts which is mentioned in the previous studies, too .
Comparing the performance of CNFE-SE with other state-of-the-art classifiers
Table 5 lists the performance measures for comparing CNFE-SE with other state of the art classifiers.
Two different feature sets are considered as the input variables fed to the classifiers including all 296 features and only the most important features (top-20 features shown in Fig. 6). Moreover, CNFE-SE is trained and evaluated twice (one time without doing feature engineering (FE) and another time with performing feature engineering).
The models are executed and trained on different random training samples up to 50 times and the mean ± standard deviation values are depicted in Table 5. The CNFE-SE outperforms the compared models by AUC of 0.84 ± 0.01, sensitivity of 0.79 ± 0.01, specificity of 0.91 ± 0.01, and accuracy of 0.85 ± 0.01 when trains on all 296 features. Moreover, CNFE-SE has the superior performance when only 20-top features are fed to it as input variables with AUC of 0.87 ± 0.01, sensitivity of 0.82 ± 0.01, specificity of 0.92 ± 0.01 and accuracy of 0.87 ± 0.01. Our obtained results show that feature engineering and considering only 20-top features improve the performance of CNFE-SE.
Table 6 shows the confusion matrix of CNFE-SE for total dataset.
Figure 11 depicts ROC curve for CNFE-SE trained with all features.
As shown by Fig. 11, AUC of CNFE-SE trained on all features is 0.84 ± 0.01. As illustrated by Table 5, the compared single classifiers show almost weak performances. The main reason is that the patients treated with IUI do not have complicated conditions and the leading cause of their infertility is idiopathic. Therefore, the data of the two classes have high similarity with each other, and their differentiation using single classifier is not an easy task. However, among these models, Light-GBM as one of state-of-the-art machine learning algorithms has the second best performance because it is a gradient boosting framework that uses tree-based learning algorithms and not only covers multi hyper-parameters but also has more focus on the accuracy of the results .
When the classes are imbalanced, Precision-Recall curve is a useful instrument for the presentation of prediction success. A great area under this curve shows both high precision, which is related to low false-positive rate, and high recall, refers to low false-negative rate. Figure 12 indicates the precision-recall curves for CNFE-SE trained using top-20 features.
As shown in Fig. 12, CNFE-SE predicts both classes with highly reasonable performance.
Moreover, the results of A-test method for structural risk calculation for different combinations of feature sets and classifiers are shown in Table 7.
Lower values of Γζ^ and Γζ shows lower risk of the classifier for classifying previously unseen records and the higher capacity and generalization ability of the model. Therefore, the feature set and classifier achieving the lower values of Γζ^ and Γζ is more desired. As shown by Table 7, CNFE-SE trained using top-20 features has the superior performance based on A-Test results.
In the current study, among the various features that significantly affect the IUI outcome, the most potential predictors are female BMI and semen quality parameters. Semen data such as sperm count and motility are illustrated as the most prognostic factors in pregnancies, conceived by IUI and their association with IUI outcome have demonstrated in some previous studies . Moreover, some previous studies have confirmed that semen descriptors, after the swim-up procedure have been more important than the ones before sperm washing process [39, 40]. Similarly, the percentage of motile sperm and its progression in the ejaculate have been known as significant predictors in IUI outcome prediction in the literature [41, 42]. Sperm motility grades a + b (progressive motility) and grade d (immotile sperms) are also determined in this study as potential predictive factors for a successful IUI . Thus, if their corresponding values are more than 20% and less than 15%, respectively, the IUI success rate is higher.
Furthermore, the results of this study indicates that the IUI success rate is almost low when the female BMI is abnormal (BMI is lower than 20 or larger than 30). If female BMI is about 25 as the normal BMI value, the probability of pregnancy increases. This finding is mentioned in the previous studies, too .
Previous studies have shown that pregnancy rate could be reduced by increase in the female age [42, 45]. The present study identifies that the women older than 38 have a lower chance of successful IUI. However, Edrem et al. have not found the female age to be a prognostic factor in the prediction of IUI outcome .
As shown in Fig. 8, the duration of infertility inversely affects the fertility rate, and the decline in fecundity is acclaimed by some previous works, as well. Also, the previous studies have shown that when the couples' duration of infertility is less than six years, the pregnancy success rate is higher .
The total dose of gonadotropins is taken into account in this study as an important feature. Moreover, its significance has been considered recently, too . This study identifies that the total dose of gonadotropin is positively correlated with the pregnancy rate. Moreover, other factors contributing to failure or success of IUI outcome according to this study's findings include semen volume, male age, sperm normal and amorphous morphology, duration of the marriage, and endometrial thickness which some of them have been demonstrated as the influential attributes in some previous studies [48,49,50].
Eventually, the CNFE-SE is trained using the 20 most important features and it yields surprisingly good performances (AUC = 0.87, 95% CI 0.86–0.88). It shows that the model carried out by these features, demonstrates a highly reasonable performance.
Some studies consider different patients' cycles as independent of each other, which may lead to a biased result. For example, they have considered the first cycle information [16, 51]. Our reanalysis of the primary cycle data revealed that the AUC performances of Light-GBM and CNFE-SE are 0.62 ± 0.01 and 0.84 ± 0.01, respectively, which does not change significantly when all the cycles are taken into account. Moreover, as shown in the materials and methods section, increasing the number of cycles augment the clinical pregnancy rate which are in line with the importance of this feature in subsequent IUI outcome [52, 53]. On the contrary, the variable cycle number has not identified as an important feature according to CNFE-SE feature scores. This finding may be due to the high number of data in the first cycle compared to the second, third and more cycles, which approximately 74% of the data belongs to the first cycle of IUI treatment.
Finally, our study has some restrictions. Some of the female hormonal tests including FSH, TSH, LH, and AMH have not been measured in all the patients before beginning IUI cycle, and therefore they are eliminated from the analysis due to their high missing value rate. At the Royan center, the patients who are entering the IUI treatment cycles are those who do not have complicated conditions, and the women's hormonal tests are usually normal. Moreover, the male BMI is excluded because of its high rate of missing values. The features describing the geographic information of couples’ habitats are removed from the study due to their low quality data entry.
Currently machine learning algorithms has been increasingly employed in different medical fields . Therefore, through using machine learning methods, we are able to predict the success or failure of the IUI cycle treatment outcome for each couple, based on their demographic characteristics and cycle information. In other words, our proposed CNFE-SE model shows superior performance among the compared state of the art classifiers. A decision support system (DSS) can be designed and implemented based on CNFE-SE. This DSS can help the physicians to choose other treatment plans for the couples and reduce patients' costs if their IUI cycle success rate is low. The schematic of this medical assistance system is shown in Fig. 13.
The proposed DSS is trained on the training dataset by CNFE-SE after preprocessing the collected dataset. After completing the training of CNFE-SE, every time a new data record is registered in the DSS, it can be classified by CNFE-SE into positive or negative outcome. The predicted outcome for the new data record can assist the physicians to decide to treat the couple with IUI method or not.
In conclusion, the use of machine learning methods to predict the success or failure rate of the IUI could effectively improve the evaluation performances in comparison with other classical prediction models such as regression analysis. Furthermore, our proposed CNFE-SE model outperforms the compared methods with highly reasonable accuracy. CNFE-SE can be used as clinical decision-making assistance for the physicians to choose a beneficial treatment plan with regards to their patients’ therapy options, which would reduce the patients’ costs as well.
The experimental results in this study show that the most important features for predicting IUI outcome are semen parameters (sperm motility and concentration) as well as female BMI.
Some features which have been identified as good discriminative features for IUI outcome prediction in the previous studies are excluded from this study because of their high missing value rate. For example, some of the female hormonal tests including FSH, TSH, LH, and AMH are not routinely measured in all the patients before IUI and they are excluded from the study. It is proposed to augment dataset with data records without missing value in the mentioned features and consider the excluded features to CNFE-SE, and then try to rank the augmented feature set and evaluate the performance of the classifier.
On the other hand, some data records have noisy information which can reduce the performance of the classifiers. As future work, it is suggested that improving the robustness of CNFE-SE against the noisy data by including vote-boosting and other previously proposed methods for increasing the noise robustness of the classifiers. Moreover, the data is highly imbalanced which can have negative effect on the classifiers' performance. As another research opportunity, it is suggested that reducing the influence of data distribution per class by incorporating the advanced balanced sampling strategies.
Determining the optimal ensemble size is a challenging issue, yet. It is suggested that the impact of the ensemble size on the overall performance of stacked ensemble is studied in the future studies on different tasks and different datasets.
Availability of data and materials
Our study is a retrospective study of a 5-year couples' data undergoing IUI. Data is collected from Reproductive Biomedicine Research Center, Royan Institute for 8,360 couples who underwent 11,255 IUI cycles were included. But, we are not allowed to share the original dataset because of the privacy and security issues.
Academic center for education, culture and research
Artificial neural networks
Assisted reproductive technology
Area under curve
Body mass index
Complex network which is comprised of all the training data records as its nodes
Complex network which includes all training data excluding negative class
Complex network which includes all training data excluding positive class
Complex network-based feature engineering and stacked ensemble
Decision support system
Hierarchical clustering analysis
In-vitro fertilization (IVF)
Mean decrease of accuracy
Principal component analysis
Radial basis function
Support vector machines
Practice Committee of the American Society for Reproductive Medicine. Definitions of infertility and recurrent pregnancy loss: a committee opinion. Fertil Steril. 2013;99(1):63.
Borght M, Wyns C. Fertility and infertility: definition and epidemiology. Clin Biochem. 2018;62:2–10.
Milewska AJ, et al. Prediction of infertility treatment outcomes using classification trees. Stud Log Gramm Rhetoric. 2016;47(1):7–19.
Blank C, et al. Prediction of implantation after blastocyst transfer in in vitro fertilization: a machine-learning perspective. Fertil Steril. 2019;111(2):318–26.
Patil AS. A review of soft computing used in assisted reproductive techniques (ART). Int J Eng Trends Appl (IJETA). 2015;2(3):88–93.
Bahadur G, et al. First line fertility treatment strategies regarding IUI and IVF require clinical evidence. Hum Reprod. 2016;31(6):1141–6.
Ombelet W, Puttemans P, Bosmans E. Intrauterine insemination: a first-step procedure in the algorithm of male subfertility treatment. Hum Reprod. 1995;10:90–102.
Deo RC. Machine learning in medicine. Circulation. 2015;132(20):1920–30.
Milewska AJ, et al. Analyzing outcomes of intrauterine insemination treatment by application of cluster analysis or kohonen neural networks. Stud Log Gramm Rhetoric. 2013;35(1):7–25.
Kooptiwoot S, Salam MA. IUI mining: human expert guidance of information theoretic network approach. Soft Comput. 2006;10(4):369–73.
Ghaffari F, et al. Evaluating the effective factors in pregnancy after intrauterine insemination: a retrospective study. Int J Fertil Steril. 2015;9(3):300.
Steures P, et al. Prediction of an ongoing pregnancy after intrauterine insemination. Fertil Steril. 2004;82(1):45–51.
Goldman RH, et al. Patient-specific predictions of outcome after gonadotropin ovulation induction/intrauterine insemination. Fertil Steril. 2014;101(6):1649–55.
Marshburn PB, et al. Spermatozoal characteristics from fresh and frozen donor semen and their correlation with fertility outcome after intrauterine insemination. Fertil Steril. 1992;58(1):179–86.
Moro F, et al. Anti-Müllerian hormone concentrations and antral follicle counts for the prediction of pregnancy outcomes after intrauterine insemination. Int J Gynecol Obstet. 2016;133(1):64–8.
Lemmens L, et al. Predictive value of sperm morphology and progressively motile sperm count for pregnancy outcomes in intrauterine insemination. Fertil Steril. 2016;105(6):1462–8.
Arslan M, et al. Predictive value of the hemizona assay for pregnancy outcome in patients undergoing controlled ovarian hyperstimulation with intrauterine insemination. Fertil Steril. 2006;85(6):1697–707.
Florio P, et al. Evaluation of endometrial activin A secretion for prediction of pregnancy after intrauterine insemination. Fertil Steril. 2010;93(7):2316–20.
Shah S, Kusiak A. Cancer gene search with data-mining and genetic algorithms. Comput Biol Med. 2007;37(2):251–61.
Kaya A. Cascaded classifiers and stacking methods for classification of pulmonary nodule characteristics. Comput Methods Programs Biomed. 2018;166:77–89.
Wang SQ, Yang J, Chou KC. Using stacked generalization to predict membrane protein types based on pseudo-amino acid composition. J Theor Biol. 2006;242(4):941–6.
Tocci A, Lucchini C. WHO reference values for human semen. Hum Reprod Update. 2010;16(5):559–559.
Zhang S, Zhang C, Yang Q. Data preparation for data mining. Appl Artif Intell. 2003;17(5–6):375–81.
Han J, Pei J, Kamber M. Data mining: concepts and techniques. Amsterdam: Elsevier; 2011.
Liu FT, Ting KM, Zhou ZH, Isolation forest, in 2008 Eighth IEEE international conference on data mining. 2008, IEEE. p. 413–422
Breiman L. Random forests. Mach Learn. 2001;45:5–32.
Diykh M, Li Y, Abdulla S. EEG sleep stages identification based on weighted undirected complex networks. Comput Methods Programs Biomed. 2020;184:105116.
Bavelas A. A mathematical model for group structure, human organization. Appl Anthropol. 1948;7(3):16–30.
Wolpert DH. Stacked generalization. Neural Netw. 1992;5(2):241–59.
Güneş F, Wolfinger R, Tan PY. Stacked ensemble models for improved prediction accuracy. in Static Anal. Symp. 2017.
Sperandei S. Understanding logistic regression analysis. Biochem Med. 2014;24(1):12–8.
Cortes C, Vapnik V. Support-vector network. Mach Learn. 1995;20:1–25.
Quinlan JR. Induction of decision trees. Mach Learn. 1986;1:81–106.
Zhu J, et al. Multi-class AdaBoost. Stat Interfere. 2009;2:349–60.
Ke, G., et al. Lightgbm: a highly efficient gradient boosting decision tree. in Advances in neural information processing systems. 2017.
Gharehbaghi A, Linden M. A deep machine learning method for classifying cyclic time series of biological signals using time-growing neural network. IEEE Trans Neural Netw Learn Syst. 2018;29(9):4102–15.
Campana A, et al. Intrauterine insemination: evaluation of the results according to the woman’s age, sperm quality, total sperm count per insemination and life table analysis. Hum Reprod. 1996;11(4):732–6.
Kuriya A, Agbo C, Dahan MH. Do pregnancy rates differ with intra-uterine insemination when different combinations of semen analysis parameters are abnormal? J Turk German Gynecol Assoc. 2018;19(2):57.
Zhang E, et al. Effect of sperm count on success of intrauterine insemination in couples diagnosed with male factor infertility. Materia Socio-Medica. 2014;26(5):321.
Ombelet W, et al. Semen quality and intrauterine insemination. Reprod BioMed Online. 2003;7(4):485–92.
Dickey RP, et al. Comparison of the sperm quality necessary for successful intrauterine insemination with World Health Organization threshold values for normal sperm. Fertil Steril. 1999;71(4):684–9.
Duran HE, et al. Sperm DNA quality predicts intrauterine insemination outcome: a prospective cohort study. Hum Reprod. 2002;17(12):3122–8.
Muriel L, et al. Value of the sperm chromatin dispersion test in predicting pregnancy outcome in intrauterine insemination: a blind prospective study. Hum Reprod. 2006;21(3):738–44.
Thijssen A, et al. Predictive factors influencing pregnancy rates after intrauterine insemination with frozen donor semen: a prospective cohort study. Reprod Biomed Online. 2017;34(6):590–7.
Merviel P, et al. Predictive factors for pregnancy after intrauterine insemination (IUI): An analysis of 1038 cycles and a review of the literature. Fertil Steril. 2010;93(1):79–88.
Erdem A, et al. Factors affecting live birth rate in intrauterine insemination cycles with recombinant gonadotrophin stimulation. Reprod Biomed Online. 2008;17(2):199–206.
Kamath MS, et al. Predictive factors for pregnancy after intrauterine insemination: a prospective study of factors affecting outcome. Hum Reprod Sci. 2010;3(3):129.
Licht RS, Handel L, Sigman M. Site of semen collection and its effect on semen analysis parameters. Fertil Steril. 2008;89(2):395–7.
Francavilla F, et al. Effect of sperm morphology and motile sperm count on outcome of intrauterine insemination in oligozoospermia and/or asthenozoospermia. Fertil Steril. 1990;53(5):892–7.
Luco SM, et al. The evaluation of pre and post processing semen analysis parameters at the time of intrauterine insemination in couples diagnosed with male factor infertility and pregnancy rates based on stimulation agent. A retrospective cohort study. Eur J Obstet Gynecol Reprod Biol Endocrinol. 2014;179:159–62.
Blank C, et al. Prediction of implantation after blastocyst transfer in in vitro fertilization: a machine-learning perspective. Fertil Steril. 2019;111(2):318–26.
Nuojua-Huttunen S, et al. Intrauterine insemination treatment in subfertility: an analysis of factors affecting outcome. Hum Reprod. 1999;14(3):698–703.
Liu W, et al. Comparing the pregnancy rates of one versus two intrauterine inseminations (IUIs) in male factor and idiopathic infertility. J Assist Reprod Genet. 2006;23(2):75–9.
The authors acknowledge the Royan institute staffs, especially the informatics department for their valuable contributions. There is no conflict of interest in this study.
This study was not funded by any organization.
Ethics approval and consent to participate
This study is approved by the institutional review board of the ROYAN Institute (IR.ACECR.ROYAN.REC.1398.213). The informed consent requirement for this study was waived because this was a retrospective study with little patients' sensitive or personal information, and all data were anonymized. The full name of the ethics committee who approved this study is IR.ACECR.ROYAN.REC which ROYAN Institute belongs to. The committee's reference number is IR.ACECR.ROYAN.REC.1398.213.
Consent for publication
The authors declare that there are no conflicts of interest.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Ranjbari, S., Khatibi, T., Vosough Dizaji, A. et al. CNFE-SE: a novel approach combining complex network-based feature engineering and stacked ensemble to predict the success of intrauterine insemination and ranking the features. BMC Med Inform Decis Mak 21, 1 (2021). https://doi.org/10.1186/s12911-020-01362-0
- IUI outcome prediction
- Complex networks
- Feature engineering
- Stacked ensemble classifier
- Feature selection