Held in conjunction with KDD'16

Aug 14, 2016 - San Francisco, CA

Aug 14, 2016 - San Francisco, CA

12th International Workshop on

Mining and Learning with Graphs

Mining and Learning with Graphs

There is a great deal of interest in analyzing data that is best represented as a graph. Examples include the WWW, social networks, biological networks, communication networks, transportation networks, energy grids, and many others. These graphs are typically multi-modal, multi-relational and dynamic. In the era of big data, the importance of being able to effectively mine and learn from such data is growing, as more and more structured and semi-structured data is becoming available. The workshop serves as a forum for researchers from a variety of fields working on mining and learning from graphs to share and discuss their latest findings.

There are many challenges involved in effectively mining and learning from this kind of data, including:

- Understanding the different techniques applicable, including graph mining algorithms, graphical models, latent variable models, matrix factorization methods and more.
- Dealing with the heterogeneity of the data.
- The common need for information integration and alignment.
- Handling dynamic and changing data.
- Addressing each of these issues at scale.

Traditionally, a number of subareas have contributed to this space: communities in graph mining, learning from structured data, statistical relational learning, inductive logic programming, and, moving beyond subdisciplines in computer science, social network analysis, and, more broadly network science.

Morning Sessions | |
---|---|

8:50 am | Opening Remarks |

9:00 am | Keynote: Lars Backstrom Serving a Billion Personalized News Feeds |

9:40 am | Paper Spotlights 1 |

10:00 am | Coffee Break |

10:30 am | Paper Spotlights 2 |

10:50 am | Keynote: Leman Akoglu Communities and Anomalies in Attributed Networks |

11:30 am | Poster Session |

12:00 pm | Lunch (+ Poster Session) |

Afternoon Sessions | |
---|---|

1:10 pm | Keynote: Tamara Kolda Correctly Modeling Networks |

1:50 pm |
Scaling Overlapping Clustering Incremental Method for Spectral Clustering of Increasing Orders |

2:10 pm | Keynote: Yizhou Sun Node Representation in Mining Heterogeneous Information Networks |

2:50 pm |
Distance-Based Influence in Networks: Computation and Maximization Measuring Graph Proximity with Blink Model |

3:10 pm | Coffee Break |

3:40 pm | Keynote: Jennifer Neville Statistical Methods for Modeling Network Distributions |

4:20 pm | Keynote: SVN Vishwanathan Exploiting the Computation Graph for Large Scale Distributed Machine Learning |

5:00 pm | Closing Remarks |

Assistant Professor

Stony Brook University

Given a network in which nodes are associated with a list of attributes, how can we define and characterize communities? How can we spot anomalous communities and anomalies within communities?
Networks have long been studied and focus has most recently shifted to 'networks with content'. Long-studied network questions, such as ranking, clustering, and similarity, are reconsidered for such networks, as the new information such as node/edge attributes and types help enrich the formulations and increase our understanding of real-world networks.
In this talk, I will introduce our work on spotting anomalies in networks with node attributes. Our main approach to anomaly mining in attributed networks is through communities. In particular, we quantify the degree that a community can be characterized through (a subset of) attributes on which its members 'click'. We then use such a quantity as a 'normality' score, based on which we identify individual anomalous nodes inside communities as well as communities that are anomalous as a group of nodes due to their low normality.

Director of Engineering

Facebook

Feed ranking's goal is to provide people with over a billion personalized experiences. We strive to provide the most compelling content to each person, personalized to them so that they are most likely to see the content that is most interesting to them. Similar to a newspaper, putting the right stories above the fold has always been critical to engaging customers and interesting them in the rest of the paper. In feed ranking, we face a similar challenge, but on a grander scale. Each time a person visits, we need to find the best piece of content out of all the available stories and put it at the top of feed where people are most likely to see it. To accomplish this, we do large-scale machine learning to model each person, figure out which friends, pages and topics they care about and pick the stories each particular person is interested in. In addition to the large-scale machine learning problems we work on, another primary area of research is understanding the value we are creating for people and making sure that our objective function is in alignment with what people want.

Distinguished Member

Sandia National Labs

Understanding and modeling go hand in hand – we develop models not only to make predictions but also to see where the models fail and there is more to do. Large-scale networks are immensely challenging to model mathematically. In this talk, we present our arguments for what features are important to measure and reproduce. In the undirected case, we show that graphs with high clustering coefficients (i.e., many triangles) must have dense Erdȍs-Rényi subgraphs. This is a key theoretical finding that may yield clues in understanding network structure. Following this line, we propose the Block Two-level Erdȍs-Rényi (BTER) model because it reproduces a given degree distribution and clustering coefficient profile (i.e., the triangle distribution), scales linearly in the number of edges, and is easily parallelized. We also consider the extension of this work to bipartite graphs, where we consider bipartite four-cycles, and propose a bipartite BTER (biBTER) model. These models can be used to generate artificial graphs that capture salient features of real graphs. We compare the artificial and real-world graphs so that we can understand where the models are accurate or not. Time permitting, we also explain how these models can be specified with very few parameters, which is useful for benchmarking purposes. We close with open questions for future investigations. This is joint work with S. Aksoy, A. Pinar, T. Plantenga, and C. Seshadhri.

Associate Professor

Purdue University

The recent interest in analyzing the network structure of complex systems has fueled a large body of research on both models of network structure and algorithms to automatically discover patterns for use in predictive models. However, robust statistical models, which can accurately represent distributions over graph populations, and sample efficiently from those distributions, are critical to assess the evaluate the performance of analytic algorithms and the significance of discovered patterns. However, unlike metric spaces, the space of graphs exhibits a combinatorial structure that poses significant theoretical and practical challenges to accurate estimation and efficient sampling/inference. In this talk, I will discuss our recent work on modeling distributions of networks, both attributed and unattributed, and outline how the methods can be used for inference and evaluation.

Assistant Professor

University of California, Los Angeles

One of the challenges in mining information networks is the lack of intrinsic metric in representing nodes into a low dimensional space, which is essential in many mining tasks, such as recommendation and anomaly detection. Moreover, when coming to heterogeneous information networks, where nodes belong to different types and links represent different semantic meanings, it is even more challenging to represent nodes properly. In this talk, we will focus on two mining tasks, i.e., (1) content-based recommendation and (2) anomaly detection in heterogeneous categorical events, and introduce (1) how to represent nodes when different types of nodes and links are involved; and (2) how heterogeneous links play different roles in these tasks. Our results have demonstrated the superiority as well as the interpretability of these new methodologies.

Professor | Principal Scientist

UCSC | Amazon

Many machine learning algorithms minimize a regularized risk. It is well known that stochastic optimization algorithms are both theoretically and practically well motivated for regularized risk minimization. Unfortunately, stochastic optimization is not easy to parallelize. In this talk, we take a radically new approach and show that working with the saddle-point problem that arises out of the Lagrangian has a very specific computational graph structure which can be exploited to allow for a natural partitioning of the parameters across multiple processors. This allows us to derive a new parallel stochastic optimization algorithm for regularized risk minimization. Joint work with: Inderjit Dhillon, Cho-Jui Hsieh, Shihao Ji, Shin Matsushima, Parameshwaran Raman, Hsiang-Fu Yu, and Hyokun Yun.

**A Graph Analytics Framework for Ranking Authors, Papers and Venues**
PDF

*Arindam Pal and Sushmita Ruj.*

Abstract: A lot of scientific works are published in different areas of science, technology, engineering and mathematics. It is not easy, even for experts, to judge the quality of authors, papers and venues (conferences and journals). An objective measure to assign scores to these entities and to rank them is very useful. Although, several metrics and indexes have been proposed earlier, they suffer from various problems. In this paper, we propose a graph-based analytics framework to assign scores and to rank authors, papers and venues. Our algorithm considers only the link structures of the underlying graphs. It does not take into account other aspects, such as the associated texts and the reputation of these entities. In the limit of large number of iterations, the solution of the iterative equations gives the unique entity scores. This framework can be easily extended to other interdependent networks.

**Adaptive Neighborhood Graph Construction for Inference in Multi-Relational Networks**
PDF

*Shobeir Fakhraei, Dhanya Sridhar, Jay Pujara and Lise Getoor.*

Abstract: A neighborhood graph, which represents the instances as vertices and their relations as weighted edges, is the basis of many semi-supervised and relational models for node labeling and link prediction. Most methods employ a sequential process to construct the neighborhood graph. This process often consists of generating a candidate graph, pruning the candidate graph to make a neighborhood graph, and then performing inference on the variables (i.e., nodes) in the neighborhood graph. In this paper, we propose a framework that can dynamically adapt the neighborhood graph based on the states of variables from intermediate inference results, as well as structural properties of the relations connecting them. A key strength of our framework is its ability to handle multi-relational data and employ varying amounts of relations for each instance based on the intermediate inference results. We formulate the link prediction task as inference on neighborhood graphs, and include preliminary results illustrating the caring effects of different strategies in our proposed framework.

**Adding Structure: Social Network Inference with Graph Priors**
PDF

*Han Liu, Stratis Ioannidis, Smriti Bhagat and Chen-Nee Chuah.*

Abstract: We study the problem of social network graph inference, whereby the topology of user interaction networks, as well as the strength of pairwise influences, are inferred from traces of information cascades. We propose a framework of introducing graph structural priors into the above inference process. This framework allows us to capture different priors on the graph's degree distribution, including, e.g., stretched exponential, log-normal (approximation), and power-law, which are important due to the natural prevalence of such structures in real social networks and many other complex graphs. We show that network inference under our model is amenable to the so-called majorize-minimize method, and that its implementation is tractable, as each step amounts to solving a convex optimization problem. We extensively evaluate our method over synthetic datasets as well as real-world datasets from Twitter and a Facebook app, iHeart. We observe that network inference incorporating our structural priors significantly outperforms state-of-the-art approaches incorporating a structure-free regularization term.

**Detecting Concept Drift in Classification Over Streaming Graphs**
PDF

*Yibo Yao and Lawrence Holder.*

Abstract: Detecting concept drift in data streams has been widely studied in the data mining community. Conventional drift detection methods use classifiers' outputs (e.g., classification accuracy, error rate) as indicators to signal concept changes. As a result, their performance greatly depends on the chosen classifiers. However, there is little work on addressing concept drift in graph-structured data. In this paper, we present a Graph Entropy-based Method (GEM) to effectively detect concept drift in graph streams. Contrary to many related works, we investigate the intrinsic properties of data (i.e., subgraph distribution w.r.t. class membership), instead of monitoring classification outputs. This method can be combined with any graph stream classifier to facilitate classification on non-stationary graph streams. Our approach is combined with several graph stream classification algorithms and tested on synthetic and real-world graph data streams. The experimental results demonstrate the advantage of our method in detecting concept drift as well as improving classification performance.

**Distance-Based Influence in Networks: Computation and Maximization**
PDF

*Edith Cohen, Daniel Delling, Thomas Pajor and Renato Werneck.*

Abstract: A premise at a heart of network analysis is that entities in a network derive utilities from their connections. The {\em influence} of a seed set $S$ of nodes is defined as the sum over nodes $j$ of the {\em utility} of $S$ to $j$. {\em Distance-based} utility, which is a decreasing function of the distance from $S$ to $j$, was explored in several successful research threads from social network analysis and economics: Network formation games [Bloch and Jackson 2007], Reachability-based influence [Richardson and Domingos 2002; Kempe et al. 2003]; ``threshold'' influence [Gomez-Rodriguez et al. 2011]; and {\em closeness centrality} [Bavelas 1948]. We formulate a model that unifies and extends this previous work and address the two fundamental computational problems in this domain: {\em Influence oracles} and {\em influence maximization} (IM). An oracle performs some preprocessing, after which influence queries for arbitrary seed sets can be efficiently computed. With IM, we seek a set of nodes of a given size with maximum influence. Since the IM problem is computationally hard, we instead seek a {\em greedy sequence} of nodes, with each prefix having influence that is at least $1-1/e$ of that of the optimal seed set of the same size. We present the first highly scalable algorithms for both problems, providing statistical guarantees on approximation quality and near-linear worst-case bounds on the computation. We perform an experimental evaluation which demonstrates the effectiveness of our designs on networks with hundreds of millions of edges.

**Distributed Community Detection on Edge-labeled Graphs using Spark**
PDF

*San-Chuan Hung, Miguel Araujo and Christos Faloutsos.*

Abstract: How can we detect communities in graphs with edge-labels, such as time-evolving networks or edge-colored graphs? Unlike classical graphs, edge-labels contain additional information about the type of edges, e.g., when two people got connected, or which company hosts the air route between two cities. We model community detection on edge-labeled graphs as a tensor decomposition problem and propose TeraCom, a distributed system that is able to scale in order to solve this problem on 10x larger graphs. By carefully designing our algorithm and leveraging the Spark framework, we show how to achieve better accuracy (in terms of recovering ground-truth communities) when compared to PARAFAC methods - up to 30% increase in NMI. We also present interesting clusters discovered by our system in a flights network.

**Efficient Comparison of Massive Graphs Through The Use Of ‘Graph Fingerprints’**
PDF

*Stephen Bonner, John Brennan, Georgios Theodoropoulos, Stephen McGough and Ibad Kureshi.*

Abstract: The problem of how to compare empirical graphs is an area of great interest within the field of network science. The ability to accurately but efficiently compare graphs has a significant impact in such areas as temporal graph evolution, anomaly detection and protein comparison. The comparison problem is compounded when working with graphs containing millions of anonymous, i.e. unlabelled, vertices and edges. Comparison of two or more graphs is highly computationally expensive. Thus reducing a graph to a much smaller feature set – called a fingerprint, which accurately captures the essence of the graph would be highly desirable. Such an approach would have potential applications outside of graph comparisons, especially in the area of machine learning. This paper introduces a feature extraction based approach for the efficient comparison of large topologically similar, but order varying, unlabelled graph datasets. The approach acts by producing a ‘Graph Fingerprint’ which represents both vertex level and global level topological features from a graph. The approach is shown to be efficient when comparing graphs which are highly topologically similar but order varying. The approach scales linearly with the size and complexity of the graphs being fingerprinted.

**Entity Resolution in Familial Networks**
PDF

*Pigi Kouki, Christopher Marcum, Laura Koehly and Lise Getoor.*

Abstract: Entity resolution is an important graph mining problem. Entity resolution is particularly interesting and challenging when there is rich relational structure. In this paper, we study the problem of performing entity resolution in familial networks. In our setting, we are given partial views of a familial network as described from the point of view of different people in the network and our goal is to reconstruct the underlying familial network from these perspective partial views. The data and relations provided may be inaccurate, missing or incomplete. In our approach, we start by augmenting the known set of familial relations with additional ones that are either inversed or derived from the original set of relations by linkage heuristics. Additionally, we propose a set of measures that capture the similarity of persons in the familial network based on both personal and relational information. We present a supervised learning approach where we view entity resolution in familial networks as a classication problem. Our experiments on real-world data from multiple-informant pedigrees show that our approach works well and that we can improve performance by considering separate similarity scores for each relation type.

**Entity Typing: A Critical Step for Mining Structures from Massive Unstructured Text**
PDF

*Xiang Ren, Wenqi He, Ahmed El-Kishky, Clare R. Voss, Heng Ji, Meng Qu and Jiawei Han.*

Abstract: We have been studying learning and mining graphs or networks. Where do most real networks come from? Although some networks come from well-structured and explicitly connected nodes and links, a majority of networks come from massive unstructured text data, and it takes human efforts to extract them and build them explicitly. Unfortunately, manual data curation and extraction of structures from unstructured data can be costly, unscalable, and error-prone. We have been investigating a data-driven approach to building structured networks from unstructured text data. First, quality phrases can be mined from massive text corpus, serving as basic semantic units, mostly being entities. Second, types can be inferred for such entities from such massive text data with distant supervision and relationships among entities can be uncovered by network embedding as well. Therefore, entity typing is a critical step for mining structures from unstructured text data. In this study, we focus on how to conduct entity typing with a data-driven approach. We show that "rough" entity types can be identified from massive text data with a distant supervision approach via some domain-independent knowledge-bases. However, for refined typing, even the type labels in a knowledge bases can be noisy (i.e., incorrect for the entity mention’s local context). We propose a general framework, called PLE, to jointly embed entity mentions, text features and entity types into the same low-dimensional space where, in that space, objects whose types are semantically close have similar representations. Then we estimate the type-path for each training example in a top-down manner using the learned embeddings. We formulate a global objective for learning the embeddings from text corpora and knowledge bases, which adopts a novel margin-based loss that is robust to noisy labels and faithfully models type correlation derived from knowledge bases. Our experiments on three public typing datasets demonstrate the effectiveness and robustness of PLE, with an average of 25\% improvement in accuracy compared to next best method.

**Fast Patchwork Bootstrap for Quantifying Estimation Uncertainties in Sparse Random Networks**
PDF

*Yulia Gel, Vyacheslav Lyubchich and Leticia Ramirez Ramirez.*

Abstract: We propose a new method of nonparametric bootstrap to quantify estimation uncertainties in large and possibly sparse random networks. The method is tailored for inference on functions of network degree distribution, under the assump- tion that both network degree distribution and network or- der are unknown. The key idea is based on adaptation of the “blocking” argument, developed for bootstrapping of time series and re-tiling of spatial data, to random networks. We sample network blocks (patches) and bootstrap the data within these patches. To select an optimal patch size, we develop a new computationally efficient and data-driven cross- validation algorithm. The proposed fast patchwork bootstrap (FPB) methodology further extends the ideas developed by Thompson et al., 2016, for a case of network mean degree, to inference on a degree distribution. In addition, the FPB is substantially less computationally expensive, requires less information on a graph, and is free from nuisance parameters. In our simulation study, we show that the new bootstrap method outperforms competing approaches by providing sharper and better calibrated confidence intervals for functions of a network degree distribution than other available approaches. We illustrate the FPB in application to a study of the Erdös collaboration network.

**IGLOO: Integrating global and local biological network alignment**
PDF

*Lei Meng, Joseph Crawford, Aaron Striegel and Tijana Milenkovic.*

Abstract: Analogous to genomic sequence alignment, biological network alignment (NA) aims to find regions of similarities between molecular networks (rather than sequences) of different species. NA can be either local (LNA) or global (GNA). LNA aims to identify highly conserved common subnetworks, which are typically small, while GNA aims to identify large common subnetworks, which are typically suboptimally conserved. We recently showed that LNA and GNA yield complementary results: LNA has high functional but low topological alignment quality, while GNA has high topological but low functional alignment quality. Thus, we propose IGLOO, a new approach that integrates GNA and LNA in hope to reconcile the two. We evaluate IGLOO against state-of-the-art LNA (NetworkBLAST, NetAligner, AlignNemo, and AlignMCL) and GNA (GHOST, NETAL, GEDEVO, MAGNA++, WAVE, and L-GRAAL) methods. We show that IGLOO allows for a trade-off between topological and functional alignment quality better than the existing LNA and GNA methods considered in our study.

**Identifying Anomalies in Graph Streams Using Change Detection**
PDF

*William Eberle and Lawrence Holder.*

Abstract: Anomaly detection in graph streams requires both the discovery of normative subgraph patterns in the stream and then the identification of subgraphs that are unexpected deviations from the normative patterns. Both of these processes are computationally complex, and therefore we should only update them when necessary. We present an approach based on a change detection metric used for graph sampling that selectively updates the normative patterns only when significant change has occurred. Using a batch processing method on the graph stream, we show that the change detection approach significantly reduces running times while maintaining the accuracy of more exhaustive approaches.

**Incremental Method for Spectral Clustering of Increasing Orders**
PDF

*Pin-Yu Chen, Baichuan Zhang, Mohammad Hasan and Alfred Hero.*

Abstract: The smallest eigenvalues and the associated eigenvectors (i.e., eigenpairs) of a graph Laplacian matrix have been widely used for spectral clustering and community detection. However, in real-life applications the number of clusters or communities (say, K) is generally unknown a-priori. Consequently, the majority of the existing methods either choose K heuristically or they repeat the clustering method with different choices of K and accept the best clustering result. The first option, more often, yields suboptimal result, while the second option is computationally expensive. In this work, we propose an incremental method for constructing the eigenspectrum of the graph Laplacian matrix. This method leverages the eigenstructure of graph Laplacian matrix to obtain the K-th eigenpairs of the Laplacian matrix given a collection of all the $K-1$ smallest eigenpairs. Our proposed method adapts the Laplacian matrix such that the batch eigenvalue decomposition problem transforms into an efficient sequential leading eigenpair computation problem. As a practical application, we consider user-guided spectral clustering. Specifically, we demonstrate that users can utilize the proposed incremental method for effective eigenpair computation and determining the desired number of clusters based on multiple clustering metrics.

**Investigating the impact of graph structure and attribute correlation on collective classification performance**
PDF

*Giselle Zeno and Jennifer Neville.*

Abstract: Relational machine learning methods can significantly improve the predictive accuracy of models for a range of network domains, from social networks to physical and biological networks. The methods automatically learn network correlation patterns from observed data and then use them in a collective inference process to propagate predictions throughout the network. While previous work has indicated that both link density and network autocorrelation impact the performance of collective classification models, this is based on observations from a limited set of real world networks available for empirical evaluation. There has been some work using synthetic data to systematically study model performance as data characteristics are varied, but the complexity of generating realistic network structures made it difficult to consider characteristics jointly. In this paper, we exploit a recently developed method for generating attributed networks with realistic network structure (i.e., parameters learned from real networks) and correlated attributes. Using synthetic data generated from the model, we conduct a systematic study of relational learning and collective inference methods to investigate how graph characteristics interact with attribute correlation to impact classification performance. Notably, we show that AUC performance of the method can be accurately predicted from a linear function of link density and attribute correlation.

**Local Spectral Diffusion for Robust Community Detection**
PDF

*Kun He, Pan Shi, John E. Hopcroft and David Bindel.*

Abstract: We address a semi-supervised learning problem of identifying all latent members of a local community from very few labeled seed members in large networks. By a simple and efficient sampling method, we conduct a comparatively small subgraph encompassing most of the latent members such that the follow-up membership identification could focus on an accurate local region instead of the whole network. Then we look for a sparse vector, a relaxed indicator vector representing the subordinative probability of the corresponding nodes, that lies in a local spectral subspace defined by an order-$d$ Krylov subspace. The subspace serves as a local proxy for the invariant subspace spanned by leading eigenvectors of the Laplacian matrices. Based on Rayleigh quotients, we relate the local membership identification task as a local RatioCut or local normalized cut optimization problem, and provide some theoretical justifications. We thoroughly explore different probability diffusion methods for the subspace definition and evaluate our method on four groups with a total of 28 representative LFR benchmark datasets, and eight public available real-world networks with labeled ground truth communities across multiple domains. Experimental results exhibit the effectiveness and robustness of the proposed algorithm, and the local spectral communities perform better than those from the celebrated Heat Kernel diffusion[10] and the PageRank diffusion[1].

**Measuring Graph Proximity with Blink Model**
PDF

*Haifeng Qian, Hui Wan, Mark N. Wegman, Luis A. Lastras and Ruchir Puri.*

Abstract: This paper proposes a new graph proximity measure. This measure is a derivative of network reliability. By analyzing its properties and comparing it against other proximity measures through graph examples, we demonstrate that it is more consistent with human intuition than competitors. A new deterministic algorithm is developed to approximate this measure with practical complexity. Empirical evaluation by two link prediction benchmarks, one in coauthorship networks and one in Wikipedia, shows promising results. For example, a single parameterization of the proposed measure achieves accuracies that are 14-35% above the best accuracy for each graph of all predictors reported in the 2007 Liben-Nowell and Kleinberg survey.

**Perseus3: Visualizing and Interactively Mining Large-Scale Graphs**
PDF

*Di Jin, Christos Faloutsos, Danai Koutra and Ticha Sethapakdi.*

Abstract: How can we summarize large graphs of different types, e.g., unipartite or bipartite, directed or undirected? How can we find anomalous patterns in such graphs efficiently? In this paper we present PERSEUS3, a large-scale graph mining system that supports analysis of three types of graphs: unipartite and undirected; bipartite and undirected; and unipartite and directed. Our system provides coupled summarization of graph properties and the network structure, and allows the user to interactively explore normal and anomalous node behaviors. PERSEUS3 is developed based on PERSEUS with three significant extensions: (1) Graph statistics are extracted depending on the type of the input network (e.g., total degree, eigenvectors for undirected graphs; in/out degree, SVD vectors for directed graphs); (2) Subgraphs of the selected node are interactively visualized through the adjacency matrices; (3) Heatmaps (instead of simple scatterplots) are adopted in graph summarization to improve the scalability of the system. Our extensive experiments show that PERSEUS3 handles different tasks of graph mining efficiently. Specifically we run the univariate undirected graph analysis on a Twitter who-follows-whom graph which spans 0.26 million users and 220 million links; we also run the bipartite graph analysis on a user-movie ratings dataset, and the directed graph analysis on a patent citation graph. We report the patterns discovered, including bipartite cores and outliers spotted by PERSEUS3.

**Predicting risky behavior in social communities**
PDF

*Olivia Simpson and Julian McAuley.*

Abstract: Predicting risk profiles of individuals in networks (e.g. susceptibility to a particular disease, or likelihood of smoking) is challenging for a variety of reasons. For one, ‘local’ features (such as an individual’s demographic information) may lack sufficient information to make informative predictions; this is especially problematic when predicting ‘risk,’ as the relevant features may be precisely those that an individual is disinclined to reveal in a survey. Secondly, even if such features are available, they still may miss crucial information, as ‘risk’ may be a function not just of an individual’s features but also those of their friends and social communities. Here, we predict individual’s risk profiles as a function of both their local features and those of their friends. Instead of modeling influence from the social network directly (which proved difficult as friendship links may be sparse and partially observed), we instead model influence by discovering social communities in the network that may be related to risky behavior. The result is a model that predicts risk as a function of local features, while making up for their deficiencies, and accounting for social influence by uncovering community structure in the network. We test our model by predicting risky behavior among adolescents from the Add health data set, and hometowns among users in a Facebook ego net. Compared to prediction by features alone, our model demonstrates better predictive accuracy when measured as a whole, and in particular when measured as a function of network “richness.”

**Real-Time Community Detection in Large Social Networks on a Laptop**
PDF

*Ben Chamberlain and Marc Deisenroth.*

Abstract: For a broad range of governmental and commercial applications, it is important to understand the allegiances, communities and structure of key players in society. One promising direction towards extracting this information is to exploit the rich relational data in digital social networks (the social graph). As social media data sets are very large, most approaches make use of distributed computing systems for this purpose. Distributing graph processing requires solving many dicult engineering problems, which has lead some researchers to look at single-machine solutions that are faster and easier to maintain. In this article, we present a single-machine real-time system for large-scale graph processing that allows analysts to interactively explore graph structures. The key idea is that the aggregate actions of large numbers of users can be compressed into a data structure that encapsulates user similarities while being robust to noise and queryable in real-time. We achieve single machine real-time performance by compressing the neighbourhood graph of each vertex using minhash signatures and facilitate rapid queries through Locality Sensitive Hashing. These techniques reduce query times from hours using industrial desktop machines operating on the full graph to milliseconds on standard laptops. Our method allows analysts to interactively explore strongly associated regions, e.g., communities, of large graphs in real-time on their laptops. Our work has been deployed in software that is actively used by analysts to understand social network structure. It oers another channel for social media providers to monetise their data, helping them to continue to provide free services that are valued by billions of people globally.

**Reducing Million-Node Graphs to a Few Structural Patterns: A Unified Approach**
PDF

*Yike Liu, Tara Safavi, Neil Shah and Danai Koutra.*

Abstract: How do graph clustering techniques compare in terms of summarization power? How well can they summarize a million-node graph with a few representative structures? In this paper, we compare and contrast different techniques: METIS, LOUVAIN, SPECTRAL CLUSTERING, SLASHBURN, BIGCLAM, HYCOMFIT, and KCBC, our proposed k-core-based clustering method. Unlike prior work that focuses on various measures of cluster quality, we use vocabulary structures that often appear in real graphs and the Minimum Description Length (MDL) principle to obtain a graph summary per clustering method. Our main contributions are: (i) Formulation: we propose a summarization-based evaluation of clustering methods. Our method, VOG-OVERLAP, concisely summarizes graphs in terms of their important structures with small edge overlap and large node/edge coverage; (ii) Algorithm: we introduce KCBC, a graph decomposition technique based on the k-core algorithm. We also introduce STEP, a summary assembly heuristic that produces compact summaries, as well as two parallel approximations thereof. (iii) Evaluation: we compare the summarization power of seven clustering techniques on large real graphs and analyze their compression rates, summary statistics, and runtimes.

**Relational Similarity Machines**
PDF

*Ryan Rossi, Rong Zhou and Nesreen Ahmed.*

Abstract: This paper proposes Relational Similarity Machines (RSM): a fast, accurate, and flexible relational learning framework for supervised and semi-supervised learning tasks. Despite the importance of relational learning, most existing methods are unable to handle large noisy attributed networks with low or even modest levels of relational autocorrelation. Furthermore, they also have issues with efficiency, accuracy, scalability, and flexibility. For instance, many existing methods perform poorly for multi-class classification problems, graphs that are sparsely labeled, or network data with low relational autocorrelation. In contrast, the proposed relational learning framework is designed to be (i) fast for learning and inference at real-time interactive rates, and (ii) flexible for a variety of learning settings, constraints, and application domains. The experiments demonstrate the effectiveness of RSM for a variety of tasks and data.

**Scaling Overlapping Clustering**
PDF

*Kyle Kloster, Merrielle Spain and Stephen Kelley.*

Abstract: Identifying communities plays a central role in understanding the structure of large networks. As practitioners analyze progressively larger networks, it becomes increasingly important to understand the computational complexity of candidate algorithms. We examine the complexity of the link clustering algorithm for overlapping community detection. We provide new, tight bounds for the original implementation and propose modifications to reduce algorithmic complexity. These new bounds are a function of the number of wedges in the graph. Additionally, we demonstrate that for several community detection algorithms, wedges predict runtime better than commonly cited graph features. We conclude by proposing a method to reduce the wedges in a graph by removing high-degree vertices from the network, identifying communities with an optimized version of link clustering, and heuristically matching communities with the removed vertices as a post-processing step. We empirically demonstrate a large reduction in processing time on several common data sets.

**Sparse Network Inference using the k-Support Norm**
PDF

*Aman Gupta, Haohan Wang and Rama Kumar Pasumarthi.*

Abstract: Network inference is an important problem in a variety of domains. Understanding gene interaction networks is an important problem in Biology. DNA microarrays provide a cheap and efficient mechanism of measuring mRNA expression levels of thousands of genes in one go. The computational problem then reduces to taking gene mRNA expression time series data and using inference/reverse-engineering techniques to learn a gene interaction network. In the context of social network analysis, network inference refers to the problem of inferring underlying network of influence, given time series data of when users performed certain actions (e.g., post, retweet, share). These networks capture the dynamics of influence and information diffusion. In this position paper, we discuss various strategies to learn sparse networks with the help of the k-support norm, which corresponds to the tightest convex relaxation of sparsity combined with an l2 penalty. This norm encourages sparsity and also controls the influence of individual parameters. We also discuss strategies to incorporate various constraints on learned networks.

**Subgraph2vec: Learning Distributed Representations of Rooted Sub-graphs from Large Graphs**
PDF

*Annamalai Narayanan, Mahinthan Chandramohan, Lihui Chen, Yang Liu and Santhosh Kumar Saminathan.*

Abstract: In this paper, we present subgraph2vec, a novel approach for learning latent representations of rooted subgraphs from large graphs inspired by recent advancements in Deep Learning and Graph Kernels. These latent representations encode semantic sub-structure dependencies in a continuous vector space, which is easily exploited by statistical models for tasks such as graph classification and clustering. Similar to recently proposed Deep Graph Kernels [7], subgraph2vec leverages on local information obtained from neighborhoods of nodes to learn their latent representations in an unsupervised fashion. We demonstrate that subgraph vectors learnt by our approach could be used in conjunction with classifiers such as CNNs, SVMs and relational data clustering algorithms to achieve significantly superior classification and clustering accuracies. Our experiments on several benchmark and largescale real-world datasets reveal that subgraph2vec achieves significant improvements in accuracies over existing graph kernels. Specifically, on two large-scale program analysis tasks, namely, code clone and malware detection, subgraph2vec outperforms state-of-the-art graph kernels by more than 17% and 7%, respectively.

**The Infinity Mirror Test for Analyzing the Robustness of Graph Generators**
PDF

*Salvador Aguinaga and Tim Weninger.*

Abstract: Graph generators learn a model from a source graph in order to generate a new graph that has many of the same properties. The learned models each have implicit and explicit biases built in, and its important to understand the assumptions that are made when generating a new graph. Of course, the differences between the new graph and the original graph, as compared by any number of graph properties, are important indicators of the biases inherent in any modelling task. But these critical differences are often subtle, and therefore not immediately apparent using standard performance metrics. Therefore, we introduce the infinity mirror test for the analysis of graph generator performance and robustness. This stress test operates by repeatedly, recursively fitting a model to itself. A perfect graph generator would have no deviation from the original or ideal graph, however the implicit biases and assumptions that are cooked into the various models are exaggerated by the infinity mirror test allowing for new insights that were not available before. We show, via hundreds of experiments on 6 real world graphs, that the several common graph generators do degenerate in interesting and informative ways. We believe that the observed degenerative patterns are clues to the future development of better graph models.

**Training Iterative Collective Classifiers with Back-Propagation**
PDF

*Shuangfei Fan and Bert Huang.*

Abstract: We propose a new method for training iterative collective classifiers for labeling nodes in network data. The iterative classification algorithm (ICA) is a canonical method for incorporating relational information into the classification process. Yet, existing methods for training ICA models rely on computing relational features using the true labels of the nodes. This method introduces a bias that is inconsistent with the actual prediction algorithm. In this paper, we introduce a variant of ICA, recurrent collective classification (RCC) as a procedure analogous to recurrent neural network prediction, which enables gradient-based strategies for optimizing over model parameters. We demonstrate that by training RCC with back-propagation, we more directly optimize the training loss of collective classification, which translates to improved accuracy and robustness on real network data. This robustness enables effective collective classification in settings where local classification is very noisy, settings that previously were particularly challenging for ICA and variants.

**User Action Prediction for Computational Advertisement Using Local Graph Algorithms**
PDF

*Hongxia Yang, Yada Zhu and Jingrui He.*

Abstract: User behavior modeling is essential in computational advertisement, which builds users' profiles by tracking their online behaviors and then delivers the relevant ads according to each user's interests and needs. Accurate models will lead to higher targeting accuracy and thus improved advertising performance. Intuitively, similar users tend to have similar behaviors towards the displayed ads. However, to the best of our knowledge, there is very little (if at all) previous work that explicitly investigates such similarities and incorporates them into ad response targeting and prediction, largely due to the prohibitive scale of the problem. To bridge this gap, in this paper, we use bipartite graphs to represent historical user behaviors, which consist of both user nodes and advertiser campaign nodes, as well as edges reflecting user-campaign interactions in the past. Based on this representation, we study random-walk-based local algorithms for user behavior modeling and action prediction, whose computational complexity depends only on the size of the output cluster, rather than the entire graph. Our goal is to improve action prediction by leveraging historical user-user, campaign-campaign, and user-campaign interactions. In particular, we propose the bipartite graphs AdvUserGraph accompanied with the ADNI algorithm. ADNI extends the NIBBLE algorithm to AdvUserGraph, and it is able to find the local cluster consisting of interested users towards a specific advertiser campaign. We also propose two extensions of ADNI with improved efficiencies. The performance of the proposed algorithms is demonstrated on both synthetic data and a world leading Demand Side Platform (DSP), showing that they are able to discriminate extremely rare events in terms of their action propensity.

**Using MapReduce for Impression Allocation in Online Social Networks**
PDF

*Inzamam Rahaman and Patrick Hosein.*

Abstract: Online Social Networks (OSNs) offer an efficient and cost effective platform for the dissemination of advertisements to potential consumers. User actions and relationshipscan be analyzed to make informed decisions aboutwhere, how, and to whom advertisement impressions should be allocatedto maximize the efficacy of an advertising campaign in an OSN. Basedon the observation that users influence their friends, research hasbeen done to decide to whom impressions should be given so as to maximize thetotal number of clicks achieved. Recently, a multistageformulation that involves the allocation of advertisementsin stages, has been proposed as a stochastic dynamicprogramming problem. We have developed heuristics based on this formulation and believethat the MapReduce model can be used to further increase performance and reduce compute time.This involves using cluster analysis together with a distributed version of our proposedgreedy algorithm. In this paper we provide the framework for this distributedalgorithm.

**Within-network classification with label-independent features and latent linkages**
PDF

*Christopher Ryther, Jakob Simonsen and Andreas Koch.*

Abstract: We study within-network classification in sparsely labelled networks, presenting two separate contributions: (A) a thorough reproduction of a label-independent method for classification from recent research using statistical relational learning (SRL) and semi-supervised learning (SSL), and (B) a novel approach that utilizes node attribute information to improve SRL and SSL classifier performance called Attribute Network Propagation (ANP). (B) uses a method of linearly combining predictions with a procedure of transforming node attributes into graph edges. For both contributions we use two well-known real-world datasets: the reality mining (RM) cell phone calls dataset and the Cora Portal publication citations dataset (CORA), both formulated as binary classification problems. We employ an existing classification framework to run 10 individual SRL- and SSL-based classifiers and evaluate performance using the area under the ROC curve (AUC). Results from (A) confirms that label-independent features can improve the performance of some relational classifiers using iterative methods, but in most cases, 26 out of 30, deteriorates performance on existing baselines. Results from (B) show that in 91 out of 100 cases it is possible to improve the performance of relational classifiers with ANP.

This workshop is a forum for exchanging ideas and methods for mining and learning with graphs, developing new common understandings of the problems at hand, sharing of data sets where applicable, and leveraging existing knowledge from different disciplines. The goal is to bring together researchers from academia, industry, and government, to create a forum for discussing recent advances graph analysis. In doing so we aim to better understand the overarching principles and the limitations of our current methods, and to inspire research on new algorithms and techniques for mining and learning with graphs.

To reflect the broad scope of work on mining and learning with graphs, we encourage submissions that span the spectrum from theoretical analysis to algorithms and implementation, to applications and empirical studies. As an example, the growth of user-generated content on blogs, microblogs, discussion forums, product reviews, etc., has given rise to a host of new opportunities for graph mining in the analysis of social media. We encourage submissions on theory, methods, and applications focusing on a broad range of graph-based approaches in various domains.

Topics of interest include, but are not limited to:

**Theoretical aspects:**- Computational or statistical learning theory related to graphs
- Theoretical analysis of graph algorithms or models
- Sampling and evaluation issues in graph algorithms
- Relationships between MLG and statistical relational learning or inductive logic programming

**Algorithms and methods:**- Graph mining
- Kernel methods for structured data
- Probabilistic and graphical models for structured data
- (Multi-) Relational data mining
- Methods for structured outputs
- Statistical models of graph structure
- Combinatorial graph methods
- Spectral graph methods
- Semi-supervised learning, active learning, transductive inference, and transfer learning in the context of graph

**Applications and analysis:**- Analysis of social media
- Social network analysis
- Analysis of biological networks
- Large-scale analysis and modeling

We invite the submission of regular research papers (6-8 pages) as well as position papers (2-4 pages).
We recommend papers to be formatted according to the standard double-column ACM Proceedings Style.
All papers will be peer reviewed, single-blinded.
Authors whose papers are accepted to the workshop will have the opportunity to participate in a poster session, and some set may also be chosen for oral presentation.
The accepted papers will be published online and will not be considered archival.

For paper submission, please proceed to the submission website.

Please send enquiries to chair@mlgworkshop.org.

To receive updates about the current and future workshops and the Graph Mining community, please join the Mailing List, or follow the Twitter Account.

**Paper Submission Open:** ~~March 16, 2016~~

**Paper Submission Deadline:** ~~May 27, 2016 ~~ (~~May 16~~)

**Author Notification:** ~~June 13, 2016 ~~

**Final Version:** ~~June 25, 2016 ~~

**Workshop:** August 14, 2016

University of Maryland

College Park

University of California

Santa Cruz

University of Michigan

Ann Arbor

University of California

San Diego

Facebook

Menlo Park

Leman Akoglu (Stony Brook University)

Aris Anagnostopoulos (Sapienza University of Rome)

Arindam Banerjee (University of Minnesota)

Christian Bauckhage (University of Bonn)

Hendrik Blockeel (K.U. Leuven)

Ulf Brefeld (Leuphana University of Lüneburg)

Aaron Clauset (University of Colorado Boulder)

Seshadhri Comandur (University of California Santa Cruz)

Bing Tian Dai (Singapore Management University)

Thomas Gärtner (University of Nottingham)

David Gleich (Purdue University)

Mohammad Hasan (Indiana University Purdue University)

Jake Hofman (Microsoft Research)

Larry Holder (Washington State University)

Bert Huang (Virginia Tech)

Kristian Kersting (Technical University of Dortmund)

Jennifer Neville (Purdue University)

Ali Pinar (Sandia National Laboratories)

Jan Ramon (K.U. Leuven)

Jiliang Tang (Yahoo Labs)

Hanghang Tong (Arizona State University)

Chris Volinsky (AT&T Labs-Research)

Stefan Wrobel (University of Bonn)

Xifeng Yan (University of California Santa Barbara)

Mohammed Zaki (Rensselaer Polytechnic Institute)

Elena Zheleva (Vox Media)

Zhongfei Zhang (Binghamton University)

2013, Chicago, USA (co-located with KDD)

2012, Edinburgh, Scotland (co-located with ICML)

2011, San Diego, USA (co-located with KDD)

2010, Washington, USA (co-located with KDD)

2009, Leuven, Belgium (co-located with SRL and ILP)

2008, Helsinki, Finland (co-located with ICML)

2007, Firenze, Italy

2006, Berlin, German (co-located with ECML and PKDD)

2005, Porto, Portugal, October 7, 2005

2004, Pisa, Italy, September 24, 2004

2003, Cavtat-Dubrovnik, Croatia