INCREMENTAL RETRIEVAL AND RANKING OF COMPLEX PATTERNS FROM TEXT REPOSITIORIES
As the volume of information accessible via the electronic medium (such as the Internet) is staggeringly large and growing rapidly, users have to sift through vast reservoirs of information to retrieve relevant data of their choice. It has been estimated that, the Internet consists of 2.5 billion unique, publicly accessible web-pages and this figure
is growing at an alarming rate of 7.3 million pages per day1. Currently, the only way to wade through such colossal information is by using search engines (such as Google, Live, Yahoo, etc.). Although the popularity of search engines has increased manifold due to - simplicity of usage, speed of retrieval and amount of results generated, their ability to intelligently retrieve relevant information is significantly hampered due to over-reliance
on Boolean operators for data retrieval.
To address these issues, we have designed a framework (made up of two interdependent yet distinct systems - InfoFilter and InfoSearch) based on an expressive pattern specification language and a set of novel detection algorithms that handle streaming as well as static data. InfoFilter handles pattern detection for streaming data (news feeds, IP packets, etc.) where freshness of search results is paramount. However, in the case of data that resides in the form of large yet static repositories, and when the freshness of data is not critical, the InfoSearch system handles pattern detection using a pre-computed index.
The initial design of InfoSearch, for complex pattern detection, focused on fetching all matching occurrences of the pattern in the data repositories. It further processed all the tuples of the operands that constituted the pattern. In this approach, all answers are generated even if the user is interested in a small number of answers. Generating all
answers when the request is only for “k” answers is inefficient in terms of processing time, memory utilization and number of computations. Moreover, the results are generated in the order in which they are detected, thus ignoring the relevancy of results with respect to user preferences. The user may be interested in ranked results which can indicate their quality. In order to address these problems, an incremental approach for complex pattern
detection is needed. Moreover, in order to generate results based on the relevancy rather than the order of detection, ranking mechanisms for appropriately filtering the results also needs to be addressed.
In this thesis, we investigate several approaches for incremental detection of complex patterns, retrieval, and ranking of results. We also investigate the need for novel data structures as well as the types of structural meta-data to be associated with the data stored in the index for ranking the fetched results. We propose a novel ranking
algorithm that utilizes the structural boundaries of the data to rank results based on the location and occurrence of a complex pattern in a document. We also present algorithms for each operator encountered in a pattern, that are based on ensuring optimal utilization of computational and memory resources in the least possible time. Extensive
experiments have been performed to evaluate the scalability, performance, and memory usage of these algorithms on a number of patterns.