Skip to main content

Table 1 The dual mining algorithm

From: Locating previously unknown patterns in data-mining results: a dual data- and knowledge-mining method

1 Given a database (DB) to be mined, select a relevant knowledgebase (KB);
2 Produce a mapping of terminology of DB to KB, and/or vice versa. In biomedicine, databases usually use "aggregate classification", while knowledgebases usually use "detailed clinical vocabularies";
3 Choose "primary unit of analysis" for DB and KB. Examples of unit of analysis for DB are each 'patient', each 'visit', or an 'episode of care'; and for KB is each biomedical article;
4 Choose a type of pattern, a mining method for finding that type of pattern, and its measure of pattern strength. For example, association rules where strength of the rule is measured by Spearman's Rho;
5 Given list of attributes, and their sampling probabilities, generate m n-tuples. m and n are integer numbers. m is count of n-tuples that are chosen simultaneously for a single iteration. For example, m can be a number like 20, 50, or 100. n is count of attributes within a pattern. For example, n may range from 2 to 5;
6 Evaluate the batch of m n-tuples in the DB, and estimate strength of each n-tuple;
7 Estimate strengths of the same n-tuples in the KB;
8 Estimate the surprise score (SS), by using the pair of strengths of each n-tuples in DB and KB. Besides, estimate statistical significance of the scores;
9 Update list of sampling probabilities of attributes by using the estimated SS's. Attributes observed more frequently in n-tuples with high SS, will receive higher sampling probabilities, while attributes of low SS n-tuples receive lower probabilities.
10 Start over from step 5, until all n-tuples generated from the list of attributes are exhausted, or the time limit is reached.