INFOSIFT: ADAPTING GRAPH MINING TECHNIQUES FOR DOCUMENT CLASSIFICATION
Text classification is the problem of assigning pre-defined class labels to incoming, unclassified documents. The class labels are defined based on a sample of pre-classified documents, which are used as a training corpus. A number of machine learning, probabilistic, and information retrieval based approaches have been proposed for text classification. The same have also been applied to solve the problem of email and web page classification. While most of these techniques rely on extracting keywords or highly frequent words for classification, they ignore the importance of extracting a group of related terms that co-occur and are unable to capture relationships between words. These patterns of term association are important, as there is reason to believe that documents within a class adhere to a set of patterns, and that these patterns closely correspond to, and are derived from the documents of the particular class.
A classification system that determines the patterns of various term associations that emerge from documents of a class, and uses these patterns for classifying similar documents is needed. This thesis proposes a novel graph-based mining approach for document classification. Our approach is based on the premise that representative - common and recurring - structures or patterns can be extracted from a pre-classified document class and the same can be used effectively for classifying incoming documents. To the best of our knowledge, there is no existing work in the area of text, email or web page classification based on pattern inference and the utilization of the learned
patterns for classification. A number of factors that influence representative structure extraction and classification are analyzed conceptually and validated experimentally. In our approach, the notion of inexact graph match is leveraged for deriving structures
that provide coverage for characterizing the contents of a document class. The ability to classify based on similar and not exact occurrences is singularly important in most classification tasks, as no two samples are exactly the same. Extensive experimentation validates the selection of parameters and the effectiveness of our approach for text, email and web page classification.
The novel idea proposed in the thesis aims at establishing the ground work for adapting graph mining techniques for various classification problems, not necessarily limited to text.