MR QP: A SCALABLE APPROACH TO QUERY PROCESSING ON ARBITRARY-SIZE GRAPHS USING THE MAP/REDUCE FRAMEWORK
The utility and widespread use of Relational Database Management Systems
(RDBMSs) comes not only from its simple, easy-to-understand data model (a relation or a set) but mainly from the ability to write non-procedural queries and their
optimization by the system. Queries produce exact answers that match the contents
of the database. Query processing of RDBMSs has been researched for more than
4 decades and includes extensions to more complex analysis on data warehouses. In
contrast, search has not been addressed by RDBMSs.
As the use of other other data types (key-value store, column-store, and graphs
to name a few) are becoming popular for modeling to match the data set characteristics, query processing and optimization are becoming important again. The
approaches used in RDBMSs, such as cost-based, I/O focused may not be applicable
in the same way to new models and queries. Hence, new approaches need to be developed that are suited for the data model used and the expressiveness of the queries
to be supported.
This thesis addresses query processing of large graphs (or forest) and develops algorithms for query processing as well as develops heuristics for improving the
response time using graph characteristics. Although search (unlike RDBMS) has received a lot of attention for graphs, query processing, in contrast, has received very
little attention. With the advent of large social networks and other large graphs
(e.g., freebase, knowledge and entity graphs), querying to understand the data set
and retrieve relevant/exact information becoming critical.
This thesis builds on the previous work at the Information Technology laboratory at UTA (IT Lab) to scale query processing to arbitrary-size graphs (or forests)
and to exploit parallelism as much as possible. Partitioning (a form of divide and
conquer) and Map/Reduce (for parallel processing) are used as basic ingredients for
scalability. Partitioning a graph for query processing and computing all answers poses
a number of challenges: i) partitioning schemes, ii) scheduling or choosing which partition or partitions to schedule for processing, iii) developing heuristics for reducing the
total response time exploiting query and graph characteristics, and iv) importantly,
correctness of results.
This thesis address all of the above challenges using the map/reduce framework.
The choice of map/reduce framework allows us to make partitions based on available
resources and optimize parallelism based on the number of partitions to schedule at a
time. We use a partitioning strategy that has been shown to be good for substructure
discovery. We develop a number of heuristics that are based on query and graph
characteristics. The query itself is expressed as a graph without having to cast in some
other language. Relational comparison operators, Boolean operators, wild cards, and
union queries are supported. There is no restriction on node and edge labels, and
uniquely labeled multiple edges are supported. Extensive experimental analysis of the
approach (partitioning sizes, algorithm, and heuristics) using large data sets (real-world and synthetic) are shown for speedup, scalability, and efficacy of the heuristics