DIVIDE AND CONQUER APPROACH TO SCALABLE SUBSTRUCTURE DISCOVERY: PARTITIONING SCHEMES, ALGORITHMS, OPTIMIZATION AND PERFORMANCE ANALYSIS USING MAP/REDUCE PARADIGM
With the proliferation of applications rich in relationships, graphs are becoming the preferred choice of data model for representing/storing data with relationships. The notion of â€œinformation retrievalâ€ and â€œinformation discoveryâ€ in graphs has acquired a completely new connotation and are currently being applied to a wide range of contexts ranging from social networks, chemical compounds, telephone networks to transactional networks. From the point of view of an end user, one of the most important aspects on graphs is to discover recurrent patterns following user-defined parameters. Finding frequent patterns play an important role in mining associations, correlations and many other interesting aspects among data.
The evolution of web 2.0 has propelled growth of graphs at an unprecedented rate. These graphs have hundreds of millions of entities and their interactions needing tens to hundreds of GBs of storage. Data processing for finding frequent patterns on these graphs generate huge intermediate result sets. Neither these graphs nor the intermediate results can be materialized on a single machine. Even if we have access to powerful machines to scale vertically, state-of-the art methods for frequent subgraph mining requires enormous amount of computing resources causing even powerful machines to crash at times. Therefore, development of techniques that scale horizontally and effectively with increasing graph sizes is necessary. This dissertation addresses research in that direction by designing scalable graph mining techniques. Although scalability has been the main concern while developing approaches and algorithms, their correctness, elegance, and efficiency on large-scale graphs is maintained.
Until now, graph mining has been addressed using main memory, disk-based as well as database-oriented approaches to deal with progressively large-sizes of applications. This dissertation starts with the problem of substructure discovery by dividing the graph into smaller partitions and then combining the results across partitions effectively. Two algorithms, based on two partitioning strategies are introduced which cast a main memory approach (Subdue ), along with its nuances into a distributed framework. Map/Reduce has been used as a distributed paradigm here. The basics of graph mining such as systematic expansion and computing graph similarity have been elegantly translated to the Map/Reduce paradigm. The overall focus is to address scalable mining techniques on partitioned graph using a cluster of (heterogeneous) commodity machines.
In the process of mapping the mining algorithm to a distributed environment, some of the nuances of the existing algorithm are propagated to the distributed paradigm. For example, now intermediate results are generated within each partition which need to be handled across partitions. A lot of these intermediate results are duplicates when different graphs grow into multiple copies of the same bigger graph. Elimination of duplicates is critical not only for correctness but also for reducing the mining cost (i.e., performance and speedup.) The next part of the dissertation introduces a set of optimizations over the existing iterative algorithm (both in single machine and map/reduce environment.) These optimizations aim to reduce duplicate generation by introducing heuristics based on graph characteristics. Irrespective of the choice of the heuristics, these optimizations improve response time and storage cost of graph mining.
The dissertation finally continues to examining different paradigm specific costs for our partition-based graph mining. Graph partitioning is a widely researched problem where the motive is to minimize inter partition connectivity. However graph partitioning has been predominantly used for query answering purpose in graphs. This dissertation outlines the limitations of a state of the art partitioning scheme for substructure discovery and introduces two new partitioning strategies for graph mining. Mining in the presence of partitions incurs computation cost in each partition and network cost of grouping results across partitions. We analyze the cost of partitionbased graph mining for our partitioning schemes thereby leading to evaluating the choice of initial partitioning for substructure discovery. Usability of these partitioning techniques for three different classes of graph mining problems (non iterative, fixed cost iterative and variable cost iterative) have been provided to postulate our observation that one partitioning scheme does not fit all. Finally, the dissertation elaborates the applicability of our partition-based techniques on a different paradigm (Spark) to justify the benefit of our algorithm design over any distributed paradigm.
The effectiveness of the techniques proposed in this dissertation are validated with extensive experimental analysis on 2 real world graphs (Live Journal  and Orkut ) and 2 synthetic graphs and their variations(generated using the Subgen  and RMAT  artificial graph generators.)