M-InfoSift: A Graph-based Approach For Multiclass Document Classification
With the increase in the amount of data being introduced into the Internet on a daily basis, the problem of managing these large amount of data is an unavoidable problem. The area of document classification has been examined, explored and experimented as a technique for organizing and managing vast repositories of electronic documents such as emails, text and web pages. Over the past decade, several approaches such as machine learning, data mining, information retrieval and others have been proposed for addressing
this problem of classifying electronic documents. While a majority of these techniques rely on extracting high-frequency keywords, they ignore the aspect of extracting groups of related keywords. Additionally, they fail to capture the salient relationships between a number of keywords and their inherent structure, which can prove to be a decisive element in classifying specific types of documents (e.g., web-pages). To this effect, the design of InfoSift was proposed which incorporates graph mining techniques for document classification by using a supervised learning model. Perhaps for the first time it was shown how the structure within a document can be used for classification. It was also shown that the techniques can be applied to different types of documents, such as text, email, and web. This framework focused on identifying representative substructures using graph mining approach and to classify an incoming unknown document to a folder using a ranking mechanism.
However, in the real world, documents are categorized into multiple folders based on varied characteristics (such as multiple folders for different emails or multiple classes for documents). Existing approaches have not used structural relationships with in a document for classification and are based on the occurrence of words. Adopting these approaches within the InfoSift framework do not lead to a feasible solution due to the consideration of group of keywords and their relationships with other words. In order to
bridge this gap between the strength of InfoSift and issues of Multi-folder classification, a different technique needs to be investigated.
Hence, in this thesis, we introduce a new approach to extend the abilities of InfoSift to support Multiple categories (folders). A ranking technique to order the representative - common and recurring - structures generated from pre-classified documents to categorize new incoming documents has been presented. This approach is based on a global ranking model that incorporates several factors regarding document classification and overcomes numerous problems while using existing approaches for multiple folder classification in the InfoSift system. A number of parameters which influence the generation of representative substructures in single folder classification are analyzed, re-examined, and adapted to multiple folders. Additional graph representations have been analyzed and their use
has been validated experimentally. Exhaustive experiments substantiating the selection of parameters for classification of unknown documents into multiple folders have been conducted for text, emails and web pages.