Performance evaluation and comparison
Comparing with the other three models, classification rules produced by the C5.0 DT model are easier to understand and apply in clinical practice. The TAN model shows the distribution of conditional probabilities, which commendably interprets the probabilistic dependency relationships between independent variables and dependent variable. The LR model was effective in previous traditional epidemiological and health statistical studies, and it calculated odds ratios relative to the base category. However, when the LR model was applied to processing the big or high-dimensional data, it was less effective contrasting with data mining models. As this study, consequently the three data mining models had more hopeful classification effects in comparison with the LR model, which effectively improved screening for the risk of EGC, especially the MLP model which with the highest accuracy, the largest AUC and consideration of the classfier’s clinical translation.
Although the traditional statistical models easily explain the relationship between dependent variables and independent variable, they fail to cope with enormous variables, various types of variables and complex relationships among variables [20,21,22]. If the purpose of one research is to boost the performance of prediction models, and the interpretability of models is secondary, then researchers prefer to develop data mining models to obtain gratifying predictions [23]. Therefore, the above discussion may fully clarify that the three data mining models are potentially optimal models of improving screening for the risk of EGC, the MLP model in especial.
Important independent variables
This study sought out 16 important influence factors for the risk of EGC, they may be of crucially considerable value in screening the risk of EGC. When focusing on the 16 factors, clinicians can rapidly evaluate which risk of EGC the patients with gastric disease at. The 16 factors involve four serological examinations: HP antibody, pepsinogen I, gastrin 17 and pepsinogenI/II, it suggests that serological examinations are of the important methods for screening the risk of EGC. Yamaguchi Y also found that a ABC method, which combined assay of HP and serum pepsinogen, was useful for screening gastric cancer in high-risk and low-risk populations [24]. Many epidemiological researchers has reported that HP infection is a risk factor for gastric cancer. HP participate in invasion, metastasis and clinical stage of gastric cancer, and it promote the pathogenesis of gastric cancer, so it is clinically a potential marker for evaluating the progress and prognosis of gastric cancer [25, 26].
This study indicates that drinking-water sources is a important factor for the risk of EGC. Wells and rivers water may be contaminated due to lacking of effective regulations, the pollution sources include industrial waste, agricultural fertilizers and pesticides, and microorganisms [27,28,29]. The wells and rivers water polluted as drinking water should cause gastrointestinal malignant tumors, which may be closely related to the following factors: bacteria, cyanotoxins, sulfates, nitrates, minerals, microelements, chlorides, heavy metals and so on [30].
Many eating habits importantly affect the risk of EGC as well. On the one hand, previous studies have found that people who frequently drink tea and eat fruits had low rate of tumors [31, 32]. On the other hand, there are dangerous eating habits, such as often drinking hot water. Constantly drinking hot water induces mucosal injuries in the digestive tract, which accelerate the carcinogenic processes of carcinogens [33]. It suggest that people drink less hot water to prevent gastric cancers. Though previous researchers deemed that smoking and drinking likely cause a variety of cancers, this study did not take them as important factors of the risk of EGC, potentially on account of no quantitatively analyzing smoking and drinking [34, 35].
The four demographic characteristics: occupations, residences, education levels and languages, imply the social status and health care consciousness of the participants, which may further determine their eating habits and so on, so this four demographic characteristics have comprehensive effects on the patients in respect of their risk of EGC. Some studies had shown that family history of gastric cancer was risk factor for gastric cancer [36], and previous history of colorectal cancer, diabetes mellitus and gastric ulcer increased the risk of gastric cancer distinctly [37,38,39]. But they were excluded when this study analyzed the correlation between them and the risk of EGC, probably because their proportion was too small to correlate with the risk of EGC.
Advantages and limitations
The greatest advantage of this study is that it screened the risk of EGC accurately and noninvasively. Some scholars have continuously studied medical instruments and detection reagents to improve the screening of EGC, and they applied the research results to the clinical gastroscopy and biopsy [24, 40]. A few researchers have combined genetics, proteomics and molecular biology to diagnose EGC [41, 42]. However, due to the restrictions of invasion, complexity, high cost or low compliance, these achievements have not been widely used in the clinical practice of screening for EGC. This study applied data mining methods to screen the risk of EGC in the light of noninvasive factors. Data mining methods obtained better predictions than traditional epidemic and health statistical methods when dealing with numerous factors and complicated relations among factors [22, 23]. Patients was initially screened by the optimal data mining models established, and then the high-risk patients screened were confirmed by further endoscopy plus pathology biopsy. This hierarchical screening strategy of EGC has high compliance and low cost, which will easily increase the screening coverage of EGC in clinical practice.
The limitations of this study include the patients from 26 hospitals, which participated in the project of the First Hospital Affiliated Guangdong Pharmaceutical University, slanted toward the narrow socioeconomic scale, limiting how these results could be generalized to more affluent populations. Furthermore, this study employed SMOTE to balance the training set to heighten the predictive performance of the models, but the data generated by SMOTE were not real data after all. Future researches will gather sufficient real data, the minority classe in particular, to further qualify the overall result. Ultimately, the effective prediction models performed will be applied to construct a cloud platform of screening for EGC to promote the clinical detection of EGC in future.