ANALYSIS OF COMPLEX DATA SETS USING MULTILAYER NETWORKS: A DECOUPLING-BASED FRAMEWORK
We are on the cusp of analyzing a variety of data being collected in every walk of life - social, biological, health-care, corporate, climate, to name a few. The data sets are becoming diverse and complex in addition to increased size.
Some of the complexity comes from interacting entities that arise in diverse disciplines, such as epidemiology, marketing strategy, social sciences, cybersecurity and drug design.
Data sets becoming diverse and complex entails search for appropriate models and concomitant analytical techniques that are also efficient. Our ability to analyze large, complex, and disparate data for a broad set of analysis objectives differentiates big data analytics from mining which is narrow in scope both from data and analysis perspective. For big data analytics, flexibility of analysis (different from scalability) is important. Efficiency is important due to large number of analysis needs.
Elegantly modeling and efficiently analyzing these complex datasets to obtain actionable knowledge presents several challenges. Traditional approaches, such as using single graph (or a single layer network or monoplex) may not be sufficient or appropriate for modeling and computation flexibility. Recently, multilayer networks have been proposed as an alternative for modeling such data elegantly.
In this thesis, we first discuss different types of multilayer networks -- homogeneous, heterogeneous and hybrid -- from a modeling perspective. The benefits of this modeling, in terms of ease, understanding, and usage, are highlighted. Although big data analysis has warranted many new data models, not much attention has been paid to their modeling from requirements. Going straight from application requirements to data model and analysis, especially for complex data sets, is likely to be difficult, error prone, and not extensible to say the least. Hence for data models used in big data analysis, such as Multilayer Networks, there is a need to algorithmically transform the requirements using a systematic modeling approach, such as EER (Enhanced Entity Relationship). Here, we start with application requirements of complex data sets including analysis objectives and show how the EER approach can be leveraged for modeling given data to generate the MLN model and appropriate analysis expressions on them.
However, this model brings with it a new set of challenges -- both algorithmically and efficiency-wise -- for its analysis. Since there are not many algorithms available in the literature for the analysis of MLN as a whole, applying currently available techniques to a transformed version of MLN leads to loss of information in terms of structure and semantics. Our proposed approach is to develop an analysis framework without transforming the MLN model so structure and semantics can be easily preserved. The general framework proposed and developed in this thesis is termed network decoupling. This framework is intended to be beneficial to all aggregate computations although this thesis focuses on two of them. The essence of this approach is to analyze each network layer individually and then use a composition function for aggregating individual layer results. This thesis demonstrates the network decoupling approach and its merits for widely-used graph aggregation analysis, such as community and centrality. For both community and centrality detection of MLN using Boolean operators, efficient composition functions and algorithms have been developed and validated for Homogeneous Multilayer Networks. To demonstrate its effectiveness, this thesis has proposed a new community definition of heterogeneous MLNs using the same framework. This not only uses the decoupling approach based on bipartite graph matching, but also preserves structure and semantics. Structure and semantics preservation for MLNs (both homogeneous and heterogeneous) is crucial for drill down analysis to clearly understand and interpret results. Our definition supports a family of community detection algorithms for heterogeneous MLNs which is very useful for matching analysis objectives. Further, for a broader analysis, we introduce several weight metrics for bringing in individual layer community characteristics on the MLN community. Essentially, this results in an extensible family of community computations.
Finally, the framework and the algorithms proposed have been applied to real-world (Internet Movie Database - IMDb, Database Bibliography - DBLP, UK Accidents, US Airlines, Facebook) and synthetic data sets in order to validate the approach, flexibility afforded, accuracy limits, and efficiency aspects. Meticulous drill-down analysis on the final results has been carried out to come up with few surprising analysis results that predicted future potential events that we could verify by independently available ground truth. Based on this work, a dashboard for visualizing MLN analysis is underway.