Skip to content. Skip to main navigation.

Dr. Yaunzhe Cai

Email id: yuanzhe[dot]cai@gmail[dot]com

Graduation Year: May 2014


PhD Thesis:

Search has become ubiquitous mainly because of its usage simplicity. Search has made great strides in making information gathering relatively easy and without a learning curve. Question answering services/communities (termed CQA services or Q/A networks; e.g., Yahoo! Answers, Stack Overflow) have come about in the last decade as yet another way to search. Here the intent is to obtain good/high quality answers (from users with different levels of expertise) for a question when posed, or to retrieve answers from an archived Q/A repository. To make use of these services (and archives) effectively as an alternative to search, it is imperative that we develop a framework including techniques and algorithms for identifying quality of answers as well as the expertise of users answering questions. Finding answer quality is critical for archived data sets for accessing their value as stored repositories to answer questions. Meanwhile, determining the expertise of users is extremely important (and more challenging) for routing queries in real-time which is very important to these Q/A services - both paid and free. This problem entails an understanding of the characteristics of interactions in this domain as well as the structure of graphs derived from these interactions. These graphs (termed Ask-Answer graphs in this thesis) have subtle differences from web reference graphs, paper citation graphs, and others. Hence it is imperative to design effective and efficient ranking approaches for these Q/A network data sets to help users retrieve/search for meaningful information.

The objective of this dissertation is to push the state-of-the-art in the analysis of Q/A social network data sets in terms of theory, semantics, techniques/algorithms, and experimental analysis of real-world social interactions. We leverage “participant characteristics” as the social community is dynamic with participants changing over a period of time and answering questions at their will. The participant behavior seems to be important for inferring some of the characteristics of their interaction.

First, our research work has determined that temporal features make a significant difference in predicting the quality of answers because the answerer’s (or participant’s) current behavior plays an important role in identifying the quality of an answer. We present learning to rank approaches for predicting answer quality as compared to traditional classification approaches and establish their superiority over currently-used classification approaches. Second, we discuss the difference between ask-answer graphs and web reference graphs and propose the ExpertRank framework and several approaches using domain information to predict the expertise level of users by considering both answer quality and graph structure. Third, current approaches infer expertise using traditional link-based methods such as PageRank or HITS. However, these approaches only identify global experts, which are termed generalists, in CQA services. The generalist may not be the best person to answer an arbitrary question. If a question contains several important concepts, it is meaningful for a person who is an expert in these concepts to answer that question. This thesis proposes techniques to identify experts at the concept level as a basic building block. This is critical as it can be used as a basis for inferring expertise at different levels using the derived concept rank. For example, a question can be viewed as a collection of a few important concepts. For answering a question, we use the ConceptRank framework to identify specialists for answering that question. This can be generalized using concept taxonomy for classifying topics, areas, and other larger concepts using the primary concept of coverage.

Ranking is central to the problems addressed in this thesis. Hence, we analyze the motivation behind traditional link-based approaches, such as HITS. We argue that these link-based approaches correspond to statistical information representing the opinion of web writers for these web resources. In contrast, we address the ranking problem in web and social networks by using the ILP (in-link probability) and OLP (out-link probability) of a graph to help understand HITS approach in contexts other than web graphs. We have further established that the two probabilities identified correspond to the hub and authority vectors of the HITS approach. We have used the standard Non-negative Matrix Factorization (NMF) to calculate these two probabilities for each node. Our experimental results and theoretical analysis validate the relationship between ILOD approach and HITS algorithm.