Skip to main content

Table 1 The dual mining algorithm

From: Locating previously unknown patterns in data-mining results: a dual data- and knowledge-mining method

1

Given a database (DB) to be mined, select a relevant knowledgebase (KB);

2

Produce a mapping of terminology of DB to KB, and/or vice versa. In biomedicine, databases usually use "aggregate classification", while knowledgebases usually use "detailed clinical vocabularies";

3

Choose "primary unit of analysis" for DB and KB. Examples of unit of analysis for DB are each 'patient', each 'visit', or an 'episode of care'; and for KB is each biomedical article;

4

Choose a type of pattern, a mining method for finding that type of pattern, and its measure of pattern strength. For example, association rules where strength of the rule is measured by Spearman's Rho;

5

Given list of attributes, and their sampling probabilities, generate m n-tuples. m and n are integer numbers. m is count of n-tuples that are chosen simultaneously for a single iteration. For example, m can be a number like 20, 50, or 100. n is count of attributes within a pattern. For example, n may range from 2 to 5;

6

Evaluate the batch of m n-tuples in the DB, and estimate strength of each n-tuple;

7

Estimate strengths of the same n-tuples in the KB;

8

Estimate the surprise score (SS), by using the pair of strengths of each n-tuples in DB and KB. Besides, estimate statistical significance of the scores;

9

Update list of sampling probabilities of attributes by using the estimated SS's. Attributes observed more frequently in n-tuples with high SS, will receive higher sampling probabilities, while attributes of low SS n-tuples receive lower probabilities.

10

Start over from step 5, until all n-tuples generated from the list of attributes are exhausted, or the time limit is reached.