Discovering the set of patterns or regularities that underlie raw data is the aim of Data Mining. One of the main ways to represent structural patterns underlying raw data is by Association Rules, which express dependencies or correlations between facts or observations in the data. Such dependency analysis is central to empirical science. Medical professionals want to identify factors or diseases that predispose to or prevent other diseases, and genetic researchers are interested in which gene groups correlate. For example, in the medical field, we can find an AR:
$$\begin{aligned} arthralgia, female gender, achneform eruptions \rightarrow PCDF \end{aligned}$$
which asserts that there is a positive dependency between high levels of polychlorinated dibenzofuran (PCDF) and the presence of arthralgia and acne eruptions in female patients. This was demonstrated in the oil poisoning environmental case that happened in Japan during the sixties [1]. Even if there are various reasons why such a dependency relationship exists between different symptoms, the very existence of the relationship provides valuable information. It can influence decisions on medical diagnosis or treatments [2]. ARs, comprised of a few elements with some relationship between them, are much easier to interpret than other methods for identifying correlations, such as those based on automatic learning (Bayesian Networks, Support Vector Machines or Neural Networks). For instance, a database of such ARs could be set the following query: “find all rules that have problem pharyngitis as consequent”, and these rules could identify which medical symptoms or problems should be treated or determined in order to prevent or to diagnose pharyngitis.
There are several algorithms based on heuristics statistical models [3] that provide the complete set of ARs compatible with a database of groups of elements (events, medical conditions, features, etc.) that have occurred at the same time. However, many of these rules are irrelevant and may have happened by chance. A solution to this problem could be to train a Machine Learning system (ML) to identify relevant rules. However, this would require training data from which to learn. Due to the large amount of ARs generated from a database of coinciding elements, such rules are rarely relevant or negligible, since it is a costly and time-consuming process for medical experts.
The objective of this work is to design a new semi-supervised iterative ML algorithm, i.e. an algorithm that minimizes the amount of tagged ARs to be supplied as input. It only needs a tiny initial seed of tagged ARs that self-trains the algorithm in an incremental and iterative way. This is called bootstrapping [4], and it means that the economic and time costs of discovering new valid ARs would diminish drastically, and it could make the task more practicable.
The proposed algorithm is based on a combination of supervised and unsupervised techniques which can detect the most reliable information which is then used to improve the incremental training of the system. The supervised system is based on a number of relevant AR features. We have evaluated the system using real data from different sections of a hospital, and such data being homogenized, anonymized and standardized into EHR extracts. The data refers to real problems of hospital patients. We performed an exhaustive evaluation of the proposal, comparing the results of an unsupervised approach (0.63 F-measure), with a fully supervised one (0.71 F-measure) and also with the proposed semi-supervised system (0.75 F-measure).
The new semi-supervised algorithm performs in a similar way to fully supervised ML algorithms on the same corpus, but uses a much smaller amount of manually tagged ARs, thus making the discovery of new medical knowledge easier to achieve.
The formal definitions of ARs and the concept of goodness measure, related to an AR can be found in Additional file 1: Supplementary Material, section 1. Different goodness measures are available, the most widely used are the \({\chi }^2\)-measure [5], for high absolute frequencies, and Fisher’s exact test [6] when these frequencies are low in general.
Given a set of data, several algorithms may be used to generate ARs implied by the data. However, a brute force search algorithm may generate such a high number of ARs that the problem is often called the curse of dimensionality. Some algorithms, such as FP-growth [7], use a number techniques to limit the number of rules produced. These include a minimum frequency threshold, also called support of the rule; or a minimum confidence of the rule.
However, none of these two requirements guarantees the existence of a positive dependence between the antecedent and the consequent of the rule, and indeed the rule might have been generated by chance. Even after selecting those rules included in the goodness measure there may be two kinds of errors. Type 1 errors (false positives) refer to rules that pass the validation test but are false, and type 2 errors (false negatives) refer to invalidated but true rules [8]. These two types of errors are usually complementary. Accordingly, the discovered ARs should always be pruned in a post-processing phase using a statistical test (goodness measure) such as the \({\chi }^2\) test or Fisher’s exact test.
Selection of significant patterns
In order to alleviate the false-positive problem in the discovery of association rules several testing correction techniques have been proposed [9]. Most of them are based on the use of p values. The p value of an association rule R is the probability of observing R, or one rule which is stricter than R, when the two sides of R are independent. A low p value rule is unlikely to occur if its two sides are independent. Accordingly, since the rule has been found in the data, it is unlikely that its two sides are independent, and the association is likely to be true. By way of contrast, a high p value does not provide information about the independence of the two sides of the rule, and such rules can be discarded. A commonly used p value threshold [10] is 0.05. Some of the most frequently used statistical tests for computing p values are Pearson chi-square test of independence [5, 11] and Fisher’s exact test [6]. These tests compute the p value from the discrepancies between observed and expected values. Whereas chi-square is an approximation for large sample sizes, Fisher’s exact test, provides an exact p value for any sample size.
A technique for reducing the number of false positives proposed by Webb [12], is based on separating the available data into exploratory and holdout sets. The exploratory set is used to discover rules using standard algorithms for association rules, such as FPGrowth [7]. The holdout set is then used to compute the statistical significance of the discovered rules using a standard test. Finally, by setting an appropriate threshold for the required statistical significance, the most promising rules are selected.
Fisher’s test provides the significance of the association (contingency) between the two ways of classifying data. The computation of the test is usually based on the contingency table which records the different classes. The p value is computed as the hypergeometric distribution of the numbers contained in the cells of the table.
Semi-supervised learning
Standard supervised ML algorithms trying to discover new good (true) rules (i.e. new medical knowledge) have a severe problem namely the excessive amount of necessary training. The amount of data used to train a model has a direct impact on its performance. Supervised systems trained on large amounts of annotated data outperform unsupervised systems, as they rely on more information related to the problem in question. However, human-annotated data is expensive and often difficult to obtain. This is because of the inherent complexity of knowledge-codifying rules and also the very high number of them being produced. Semi-supervised learning techniques can be an alternative when only limited amounts of annotated data are available. These techniques enhance a small amount of annotated data with a large amount of unlabeled data [4, 13]. This idea is related to other forms of semi-supervised learning, such as co-learning and mutual bootstrapping. The co-training approach [14] looks at multiple representations of the same data. During the co-training process, two classifiers are trained on the same data using different feature sets. These two classifiers then bootstrap each other and make predictions on unseen examples thereby feeding each other. Data labeled with high confidence by one classifier is given the other as training data. Another approach is mutual bootstrapping [15] which aims to learn different types of knowledge simultaneously by alternatively leveraging one type of knowledge to learn the other. Our proposal differs from these other approaches, since we do not combine two classifiers, but a supervised method with a non-supervised one. However, these provide different types of knowledge and are also applied alternatively (in a series of iterations) as they are in the mutual bootstrapping approach.
Algorithms for association rule mining
Association rule mining (ARM) is one of the most popular methods used to extract knowledge from large databases [3] . In 1993 Agrawal et al. proposed the Apriori algorithm to extract frequent rules and patterns from databases [16]. Many researchers have tried to improve this process, including trying to generate ARs using faster algorithms such as FPGrowth or reducing the large number of rules generated [7, 17,18,19,20,21,22,23].
Examples of practical use of standard AR mining in the medical field include the identification of clinically accurate association between medications, laboratory results and diseases [24, 25] and clinical findings and chronic diseases [26]. Networks of such disease relationships are also visualized [27]. AR generation algorithms such as A-priori [16] or FPGrowth [7] have also been used to establish relationships between healthcare parameters and specific problems, such as heart disease [28], brain tumours [29], HIV [30], oral cancer [31], type 2 diabetes [32] or Alzheimer’s disease [33]. The difficulty of controlling the proliferation of type 1 errors (false positives) is closely related to the subject of this paper and is addressed in [34] with non-definitive results (i.e. this is an active research topic). In [35] it is applied to the specific problem of mining a medical image dataset. Guo et al. [36] address the relationship between readmission and other features in diabetics’ patient data, reducing the readmission of such patients. In [37] the best AR mining algorithm is tested and chosen using a number of different criteria.