A RELATIONAL DATABASE APPROACH FOR FREQUENT SUBGRAPH MINING
Data mining aims at discovering interesting and previously unknown patterns from data sets. Further more, graph-based data mining represents a collection of techniques for mining the relational aspects of data represented as a graph. Complex relationships in data can be represented using graphs and hence graph mining is appropriate for analyzing data that is rich in structural relationships. Database mining of graphs, on the other hand, aims at directly mining graphs stored in a database using SQL queries. Several SQL-based mining algorithms have been developed successfully and their efficiency and scalability have been established. One of them is HDB-Subdue which uses SQL-based approach for mining for the best substructures in a graph using the MDL (Minimum Description Length) Principle. Determining frequent subgraphs is another type of graph mining for which only main memory algorithms exist currently. There are many applications in social networks, biology, computer networks, chemistry and the World Wide Web that require mining of frequent subgraphs. Also, data in most applications are directly stored in a database. Hence, there is a need for developing a SQL-based approach for frequent subgraph mining that scales to very large data sizes.
The focus of this thesis is to apply relational database techniques to support frequent subgraph mining over a set of graphs. Our primary goal is to address scalability of graph mining to very large data sets, not currently addressed by main memory approaches. Unlike the main memory counter parts, this thesis addresses the most general graph representation including multiple edges between any two vertices, and cycles. In the process of developing frequent subgraph mining over a set of graphs, the substructure representation of HDB-Subdue has been leveraged and extended. An algorithm is presented for frequent subgraph mining over a set of graphs. We also present an algorithm for pseudo duplicate elimination that is more efficient than the one used in the previous approach (HDB-Subdue).
This thesis also presents an efficient approach to infer structural relationships from relational data to facilitate graph mining (either the best subgraph or frequent subgraphs). The approach developed for the task infers the entity-relationship model for the database using the table instances along with primary and foreign key constraints and generates the instances of the graphs from the populated instances of the relations. The primary focus of this approach is the portability of the system across various relational database platforms and to minimize memory/space requirement. As part of this task the following issues were addressed: i) representation of an individual relations as a graph (template), ii) representation of multiple relations as a graph (based on foreign key constraints), iii) alternative representations for ii), and iv) an algorithm which uses the most appropriate sequence in which the relations are processed to generate graph instances to minimize memory/space requirements.