INFOSEARCH: A SYSTEM FOR SEARCHING AND RETRIEVING DOCUMENTS USING COMPLEX QUERIES
The colossal amount of information available online has resulted in overloading users who need to navigate this information for their routine requirements. Although search engines have been effective in reducing this information overload, they support only keyword searches and queries that use Boolean operators. There are certain application domains where more expressive ways of searching are necessary. Consider searching a full-text patent database for documents containing more than n occurrences of a particular pattern, or
for documents that have a particular pattern followed by another pattern within a specified distance. Such complex patterns involving pattern frequency and sequence of patterns, as well as patterns involving proximity, structural boundaries and synonyms are not supported by current search engines. Users searching documents based on focused, precise semantics may need to specify such patterns. This calls for a document retrieval system (pattern specification language as well as an efficient detection engine) that allows
such expressive patterns.
Presently, we have an expressive pattern specification language and detection algorithms that work on stream data. That is, data coming as a stream of text (or converted to a stream of text) is fed into the Pattern Detection Graph (PDG) using a dataflow approach. This is suitable and required for searching patterns over frequently updated data sources, such as news feeds, IP packets, etc., in order to ensure freshness of the search results. However, this approach is inefficient and unwarranted for searching patterns over data sources that are extremely large or relatively static, or where freshness is not critical (e.g., Web repositories). For such large data sources, streaming approach becomes impractical. One alternative is to use a pre-computed (or computed off-line) index over the stored documents to efficiently detect expressive queries.
This thesis investigates the type of information needed as part of the index to detect patterns that include frequency, proximity and sequence of patterns, phrases, synonyms, etc. We present algorithms for each operator that uses the index and detects the pattern. The thesis also deals with grouping of common subexpressions and their detection in an efficient manner. We describe InfoSearch, a system for accepting complex patterns and retrieving the documents that contain the patterns, by using an inverted index over the document collection. InfoSearch allows specification of patterns that include word frequency, proximity, sequence or structural boundaries using a Pattern Specification Language (PSL). For searching these patterns and retrieving the matching documents, InfoSearch uses Pattern Detection Graphs to filter the results returned by the index.