|
Abstract:
Data mining
aims at discovering interesting and previously unknown patterns from data
sets. Transactional mining (association rules, decision trees etc.) can be
effectively used to find non-trivial patterns in categorical and
unstructured data. For applications that have an inherent structure (e.g.,
chemical compounds, proteins) graph mining is
appropriate, because mapping the structured data into other
representations would lead to loss of structure. The need for mining
structured data has increased in the past few years. Graph mining uses
graph theory principles to perform mining. Database mining of graphs aims
at mining structured graph data stored in relational database tables using
SQL queries. Various kinds of data such as Social network data, Protein,
and other Bioinformatics data can be effectively represented as graphs.
Graph mining has been successful in the areas of counter terrorism
analysis, credit card fraud detection, drug
discovery in pharmaceutical industry etc.
The focus of this thesis is to apply relational database techniques to
accommodate all aspects of graph mining. Our primary goal is to address
scalability of graph mining to very large data sets, not currently
addressed by main memory approaches. This thesis addressed the most
general graph representation including multiple edges between any two
vertices, and cycles. This thesis extends previous work (EDB-subdue) in a
number of ways: improved substructure representation to avoid false
positives during frequency counting, unconstrained substructure expansion
with pseudo duplicate elimination for expanding multiple edges, canonical
ordering of substructures for getting true count, hierarchical reduction
for producing abstract pattern and generalization of DMDL that includes
the presence of multiple edges in a subgraph. We also extend the
substructure pruning to include ties when selecting top beam
substructures.
|