Skip to content. Skip to main navigation.


Graph Mining and Querying

This research explores two things: i) Efficient algorithms for substructure discovery in graphs (forests) and ii) querying (not searching) graphs using a query processing engine. For both we are interested in scalability.

Substructure discovery: We have developed efficient algorithms using a distributed framework (map/reduce) for finding the substructure that can compress the graph best in a very large directed labeled graph. The focus here is to develop graph analysis algorithms/techniques using different paradigms for identifying repetitive subgraphs that satisfy certain characteristics. The issues addressed as part of this work include: i) developing two partitioned algorithms for frequent substructure generation using the map/reduce paradigm by formulating appropriate key values to facilitate duplicate elimination and isomorphism detection, ii) designing partition management strategies for evaluating their impact, iii) Component cost analysis to identify places for improving the efficiency of the algorithm.  Our preliminary work on directed labeled graphs addresses scalability along with correctness of partitioned approach. It focuses on  i/o usage and response time behavior in partitioned graphs. We have shown its correctness analytically -- both with and without duplicate elimination. The algorithm has also been verified empirically by embedding known substructures of different sizes and of different characteristics (small, large, sparse, and dense) in a large synthetically generated graph. Scalability and effectiveness have been established on large real datasets like Twitter, Orkut and LiveJournal with Millions of nodes and hundreds Millions of edges.

Graph Querying: Although query optimization has received a lot of atention over the pas few decades, the same thing cannot be said about graph querying. Most of the work focuses on search and that too for very small subgraphs. It is still not possible to give a subgraph as input with conditions and wild cards as we do in a search and find all (or top-k) matching subgraphs in a large graph. This effort addresses this using a cost-based optimizer cutomized for graphs. As the sizes of graphs cab vevery large, scalability is also an issue. We use a modified version of the Subdue substructure discovery algorithm to perform focused expansion in the local vicinity to make it efficient. Both Input and output are graphs and not a triplet or so making it meaningful for user interaction. We have used Map/reduce that evaluate a query correctly using partitins and Map/Reduce architecture.

Motivation

As a general data structure, graphs have become increasingly important in modeling structural information and their relationships as is demonstrated by applications including social networks (e.g., LnikedIN, Twitter, Facebook, etc.), chemical informatics, bio-informatics, computer vision, video indexing, text retrieval and web analysis. With the rapid development of social networks, graph databases such as Freebase, and graphs derived from social networks as well as Wiki entity graphs have become too large. Even non main-memory based approaches (disk-based, database-oriented, and traditional partitioning) cannot handle these graph sizes in a meaningful way. Thus the onus is now on developing scalable algorithms/approaches to process ever increasing graph sizes with meaningful response times.

People

Department of Computer Science and Engineering