|
Abstract:
As the volume of information accessible via
the electronic medium (such as the Internet) is staggeringly large and
growing rapidly, users have to sift through vast reservoirs of
information to retrieve relevant data of their choice. It has been
estimated that, the Internet consists of 2.5 billion unique, publicly
accessible web-pages and this figure
is growing at an alarming rate of 7.3 million pages per day1. Currently,
the only way to wade through such colossal information is by using
search engines (such as Google, Live, Yahoo, etc.). Although the
popularity of search engines has increased manifold due to – simplicity
of usage, speed of retrieval and amount of results generated, their
ability to intelligently retrieve relevant information is significantly
hampered due to over-reliance
on Boolean operators for data retrieval.
To address these issues, we have designed a
framework (made up of two interdependent yet distinct systems –
InfoFilter and InfoSearch) based on an expressive pattern specification
language and a set of novel detection algorithms that handle streaming
as well as static data. InfoFilter handles pattern detection for
streaming data (news feeds, IP packets, etc.) where freshness of search
results is paramount. However, in the case of data that resides in the
form of large yet static repositories, and when the freshness of data is
not critical, the InfoSearch system handles pattern detection using a
pre-computed index.
The initial design of InfoSearch, for complex pattern detection, focused
on fetching all matching occurrences of the pattern in the data
repositories. It further processed all the tuples of the operands that
constituted the pattern. In this approach, all answers are generated
even if the user is interested in a small number of answers. Generating
all
answers when the request is only for “k” answers is inefficient in terms
of processing time, memory utilization and number of computations.
Moreover, the results are generated in the order in which they are
detected, thus ignoring the relevancy of results with respect to user
preferences. The user may be interested in ranked results which can
indicate their quality. In order to address these problems, an
incremental approach for complex pattern
detection is needed. Moreover, in order to generate results based on the
relevancy rather than the order of detection, ranking mechanisms for
appropriately filtering the results also needs to be addressed.
In this thesis, we investigate several
approaches for incremental detection of complex patterns, retrieval, and
ranking of results. We also investigate the need for novel data
structures as well as the types of structural meta-data to be associated
with the data stored in the index for ranking the fetched results. We
propose a novel ranking
algorithm that utilizes the structural boundaries of the data to rank
results based on the location and occurrence of a complex pattern in a
document. We also present algorithms for each operator encountered in a
pattern, that are based on ensuring optimal utilization of computational
and memory resources in the least possible time. Extensive
experiments have been performed to evaluate the scalability,
performance, and memory usage of these algorithms on a number of
patterns.
|