Aditya Telang

Dr. Aditya Telang

Email id: aditya[dot]telang@mavs[dot]uta[dot]edu

Graduation Year: August 2011

Dissertation:

PhD Thesis:

A HOLISTIC, SIMILARITY-BASED APPROACH FOR PERSONALIZED RANKING IN WEB DATABASES

Abstract

With the advent of the Web, the notion of information retrieval has acquired a completely new connotation and currently encompasses several disciplines ranging from traditional forms of text and data retrieval in unstructured and structured repositories to retrieval of static and dynamic information from the contents of the surface and deep Web. From the point of view of the end user, a common thread that binds all these areas is to support appropriate alternatives for allowing users to specify their intent (i.e., the user input) and displaying the resulting output ranked in an order relevant to the users.

In the context of specifying an user’s intent, the paradigms of querying as well as searching have served well, as the staple mechanisms in the process of information retrieval over structured and unstructured repositories. Processing queries over known, structured repositories (e.g., traditional and Web databases) has been well-understood, and search has become ubiquitous when it comes to unstructured repositories (e.g., document collections and the surface Web). Furthermore, searching structured repositories has also been explored to a limited extent. However, there is not much work in querying unstructured sources which, we believe is the next step in performing focused retrievals.

Correspondingly, one of the contributions of this dissertation is a novel semantic guided approach, termed Query-By-Keywords (or QBK), to generate queries from search like inputs for unstructured repositories. Instead of burdening the user with schema details, this approach utilizes pre-discovered semantic information in the form of taxonomies, relationship of keywords based on context, and attribute & operator compatibility amongst Web sources, to generate query skeletons that are subsequently transformed into queries. Additionally, progressive feedback from users is used to further improve the accuracy of these query skeletons. The overall focus thus, is to propose an alternative paradigm for the generation of queries on unstructured repositories using as little information from the user as possible.

Irrespective of the template for intent specification (i.e., either a search or a query request), the number of results typically returned in response to such intents are, often, extremely large. This is particularly true in the context of the deepWeb where a large number of results are returned for queries on Web databases and choosing the most useful answer(s) becomes a tedious and time-consuming task. Most of the time the user is not interested in all answers; instead s/he would prefer those results, that are ranked based on her/his interests, characteristics, and past usage, to be displayed before the rest. Furthermore, these preferences vary as users and queries change.

Accordingly, in this dissertation, we propose a novel similarity-based framework for supporting user and query dependent ranking of query results in Web databases. This framework is based on the intuition that for the results of a given query, similar users display comparable ranking preferences, and a user displays similar ranking preferences over results of analogous queries. Fittingly, this framework is supported by two novel and comprehensive models of 1) Query Similarity, and 2) User Similarity, proposed as part of this work. In addition, this ranking framework relies on the availability of a small yet representative set of ranking functions collected across several user-query pairs, in order to rank the results of a given user query at runtime. Appropriately, we address the subsequent problem i.e., establishing a relevant workload of ranking functions that assists the similarity model in the best possible way to achieve the goal of user- and query-dependent ranking. Furthermore, we advance a novel probabilistic learning model that infers individual ranking functions (for this workload) based on the implicit browsing behavior displayed by users. We establish the effectiveness of this holistic ranking framework by experimentally evaluating it on Google Base’s vehicle and real estate databases with the aid of Amazon’s Mechanical Turk users.

Department of Computer Science and Engineering