|
Abstract:
The colossal amount of
information available online has resulted in overloading users who need to
navigate this information for their routine requirements. Although search
engines have been effective in reducing this information overload, they
support only keyword searches and queries that use Boolean operators.
There are certain application domains where more expressive ways of
searching are necessary. Consider searching a full-text patent database
for documents containing more than n occurrences of a particular pattern,
or
for documents that have a particular pattern followed by another pattern
within a specified distance. Such complex patterns involving pattern
frequency and sequence of patterns, as well as patterns involving
proximity, structural boundaries and synonyms are not supported by current
search engines. Users searching documents based on focused, precise
semantics may need to specify such patterns. This calls for a document
retrieval system (pattern specification language as well as an efficient
detection engine) that allows
such expressive patterns.
Presently, we have an expressive pattern specification language and
detection algorithms that work on stream data. That is, data coming as a
stream of text (or converted to a stream of text) is fed into the Pattern
Detection Graph (PDG) using a dataflow approach. This is suitable and
required for searching patterns over frequently updated data sources, such
as news feeds, IP packets, etc., in order to ensure freshness of the
search results. However, this approach is inefficient and unwarranted for
searching patterns over data sources that are extremely large or
relatively static, or where freshness is not critical (e.g., Web
repositories). For such large data sources, streaming approach becomes
impractical. One alternative is to use a pre-computed (or computed
off-line) index over the stored documents to efficiently detect expressive
queries.
This thesis investigates the type of information needed as part of the
index to detect patterns that include frequency, proximity and sequence of
patterns, phrases, synonyms, etc. We present algorithms for each operator
that uses the index and detects the pattern. The thesis also deals with
grouping of common subexpressions and their detection in an efficient
manner. We describe InfoSearch, a system for accepting complex patterns
and retrieving the documents that contain the patterns, by using an
inverted index over the document collection. InfoSearch allows
specification of patterns that include word frequency, proximity, sequence
or structural boundaries using a Pattern Specification Language (PSL). For
searching these patterns and retrieving the matching documents,
InfoSearch uses Pattern Detection Graphs to filter the results returned by
the index. |