Harshit Ashokkumar Modi

Mr. Harshit Ashokkumar Modi

Email id: harshitashokkum[dot]modi[at]mavs[dot]uta[dot]edu

Profile:

Graduation Year: May 2020

Dissertation:

MS Thesis

MR QP: A SCALABLE APPROACH TO QUERY PROCESSING ON ARBITRARY-SIZE GRAPHS USING THE MAP/REDUCE FRAMEWORK

Abstract

The utility and widespread use of Relational Database Management Systems (RDBMSs) comes not only from its simple, easy-to-understand data model (a relation or a set) but mainly from the ability to write non-procedural queries and their optimization by the system. Queries produce exact answers that match the contents of the database. Query processing of RDBMSs has been researched for more than 4 decades and includes extensions to more complex analysis on data warehouses. In contrast, search has not been addressed by RDBMSs.

As the use of other other data types (key-value store, column-store, and graphs to name a few) are becoming popular for modeling to match the data set characteristics, query processing and optimization are becoming important again. The approaches used in RDBMSs, such as cost-based, I/O focused may not be applicable in the same way to new models and queries. Hence, new approaches need to be developed that are suited for the data model used and the expressiveness of the queries to be supported.

This thesis addresses query processing of large graphs (or forest) and develops algorithms for query processing as well as develops heuristics for improving the response time using graph characteristics. Although search (unlike RDBMS) has received a lot of attention for graphs, query processing, in contrast, has received very little attention. With the advent of large social networks and other large graphs (e.g., freebase, knowledge and entity graphs), querying to understand the data set and retrieve relevant/exact information becoming critical.

This thesis builds on the previous work at the Information Technology laboratory at UTA (IT Lab) to scale query processing to arbitrary-size graphs (or forests) and to exploit parallelism as much as possible. Partitioning (a form of divide and conquer) and Map/Reduce (for parallel processing) are used as basic ingredients for scalability. Partitioning a graph for query processing and computing all answers poses a number of challenges: i) partitioning schemes, ii) scheduling or choosing which partition or partitions to schedule for processing, iii) developing heuristics for reducing the total response time exploiting query and graph characteristics, and iv) importantly, correctness of results.

This thesis address all of the above challenges using the map/reduce framework. The choice of map/reduce framework allows us to make partitions based on available resources and optimize parallelism based on the number of partitions to schedule at a time. We use a partitioning strategy that has been shown to be good for substructure discovery. We develop a number of heuristics that are based on query and graph characteristics. The query itself is expressed as a graph without having to cast in some other language. Relational comparison operators, Boolean operators, wild cards, and union queries are supported. There is no restriction on node and edge labels, and uniquely labeled multiple edges are supported. Extensive experimental analysis of the approach (partitioning sizes, algorithm, and heuristics) using large data sets (real-world and synthetic) are shown for speedup, scalability, and efficacy of the heuristics proposed.

Department of Computer Science and Engineering