Skip to content. Skip to main navigation.
The limitations of the traditional Database Management System to process continuous, unbounded and real time data streams for applications such as tickers, network monitoring, traffic management, and sensor monitoring have necessitated the development of MavStream. It is a Data Stream Management System (DSMS) developed for processing stream data with bursty arrival rates and need for real-time performance requirements.
This thesis investigates the design and implementation of scheduling strategies for stream processing. The scheduling strategies implemented in this thesis minimize the tuple latency enabling the system to satisfy quality of service (QoS) requirements. Several scheduling strategies have been integrated into the architecture of MavStream. We first introduce the path capacity strategy with the goal of minimization of the tuple latency. As an improvement, a segment scheduling strategy and its simplified version are implemented which aim at minimization of total memory requirement and throughput. Extensive set of experiments have been designed and executed to evaluate the effectiveness of the proposed scheduling strategies. In addition, a feeder module was developed to simulate real time streams. Extensive experiments have been performed to measure tuple latency, memory usage and their effect on varying data rate and window widths on input streams with varying characteristics.
This thesis also presents an efficient approach to infer structural relationships from relational data to facilitate graph mining (either the best subgraph or frequent subgraphs). The approach developed for the task infers the entity-relationship model for the database using the table instances along with primary and foreign key constraints and generates the instances of the graphs from the populated instances of the relations. The primary focus of this approach is the portability of the system across various relational database platforms and to minimize memory/space requirement. As part of this task the following issues were addressed: i) representation of an individual relations as a graph (template), ii) representation of multiple relations as a graph (based on foreign key constraints), iii) alternative representations for ii), and iv) an algorithm which uses the most appropriate sequence in which the relations are processed to generate graph instances to minimize memory/space requirements.