<![CDATA[ IEEE Transactions on Knowledge and Data Engineering - new TOC ]]>
http://ieeexplore.ieee.org
TOC Alert for Publication# 69 2018February 15<![CDATA[Automatic Segmentation of Dynamic Network Sequences with Node Labels]]>SnapNETS, to automatically find segmentations of such graph sequences, with different characteristics of nodes of each label in adjacent segments. It satisfies all the desired properties (being parameter free, comprehensive and scalable) by leveraging a principled, multi-level, flexible framework which maps the problem to a path optimization problem over a weighted DAG. Also, we develop the parallel framework of SnapNETS which speeds up its running time. Finally, we propose an extension of SnapNETS to handle the dynamic graph structures and use it to detect anomalies (and events) in network sequences. Extensive experiments on several diverse real datasets show that it finds cut points matching ground-truth or meaningful external signals and detects anomalies outperforming non-trivial baselines. We also show that the segmentations are easily interpretable, and that SnapNETS scales near-linearly with the size of the input. Finally, we show how to use SnapNETS to detect anomaly in a sequence of dynamic networks.]]>3034074201807<![CDATA[Cleaning Antipatterns in an SQL Query Log]]>303421434871<![CDATA[ComClus: A Self-Grouping Framework for Multi-Network Clustering]]>ComClus, to simultaneously group and cluster multiple networks. ComClus is novel in combining the clustering approach of non-negative matrix factorization (NMF) and the feature subspace learning approach of metric learning. Specifically, it treats node clusters as features of networks and learns proper subspaces from such features to differentiate different network groups. During the learning process, the two procedures of network grouping and clustering are coupled and mutually enhanced. Moreover, ComClus can effectively leverage prior knowledge on how to group networks such that network grouping can be conducted in a semi-supervised manner. This will enable users to guide the grouping process using domain knowledge so that network clustering accuracy can be further boosted. Extensive experimental evaluations on a variety of synthetic and real datasets demonstrate the effectiveness and scalability of the proposed method.]]>3034354481074<![CDATA[Cross-Bucket Generalization for Information and Privacy Preservation]]>$l$-diversity causes overprotection for identity and large mounts of information utility loss. This paper presents a novel approach, called cross-bucket generalization, as a solution to meet the problem. The rationale is to divide microdata into equivalence groups and buckets. First, it provides separate protection for identity and sensitive values, and the level of protection can be flexibly adjusted based on actual demands. Second, the sizes of equivalence groups and buckets are minimized as far as possible by only satisfying the protection requirements, which avoid the overprotection for identity and reduce information loss. The experiments we conducted illustrate the effectiveness of our solution.]]>303449459908<![CDATA[Discovering Canonical Correlations between Topical and Topological Information in Document Networks]]>community and topic. Despite of the homophily (i.e., tendency to link to similar others) or heterophily, CCA can properly capture the inherent correlations which fit the dataset itself without any prior hypothesis. We also impose auxiliary word embeddings to improve the quality of topics. The effectiveness of our proposed model is comprehensively verified on three different types of datasets which are hyperlinked networks of web pages, social networks of friends, and coauthor networks of publications. Experimental results show that our approach achieves significant improvements compared with the current state of the art.]]>3034604731507<![CDATA[Efficient Maintenance of Shortest Distances in Dynamic Graphs]]>incremental algorithms, as it is impractical to recompute shortest distances from scratch every time updates occur. In this paper, we address the problem of maintaining all-pairs shortest distances in dynamic graphs. We propose efficient incremental algorithms to process sequences of edge deletions/insertions/updates and vertex deletions/insertions. The proposed approach relies on some general operators that can be easily “instantiated” both in main memory and on top of different underlying DBMSs. We provide complexity analyses of the proposed algorithms. Experimental results on several real-world datasets show that current main-memory algorithms become soon impractical, disk-based ones are needed for larger graphs, and our approach significantly outperforms state-of-the-art algorithms.]]>3034744871713<![CDATA[Finding Top-k Shortest Paths with Diversity]]>$k$ shortest paths in a directed graph, plays an important role in many application domains, such as providing alternative paths for vehicle routing services. However, the returned $k$ shortest paths may be highly similar, i.e., sharing significant amounts of edges, thus adversely affecting service qualities. In this paper, we formalize the K Shortest Paths with Diversity (KSPD) problem that identifies top- $k$ shortest paths such that the paths are dissimilar with each other and the total length of the paths is minimized. We first prove that the KSPD problem is NP-hard and then propose a generic greedy framework to solve the KSPD problem in the sense that (1) it supports a wide variety of path similarity metrics which are widely adopted in the literature and (2) it is also able to efficiently solve the traditional KSP problem if no path similarity metric is specified. The core of the framework includes the use of two judiciously designed lower bounds, where one is dependent on and the other one is independent on the chosen path similarity metric, which effectively reduces the search space and significantly improves efficiency. Empirical studies on five real-world and synthetic graphs and five different path similarity metrics offer insight into the design properties of the proposed general framework and offer evidence that the proposed lower bounds are effective.]]>3034885021406<![CDATA[Improved Lower Bounds for Graph Edit Distance]]>$mathsf {Bscriptstyle{RANCH}}$ that runs in $mathcal{O}(n^2Delta ^3+n^3)$ time, where $Delta$ is the maximum of the maximum degrees of input graphs $G$ and $H$. We also develop a speed-up $mathsf {Bscriptstyle{RANCH}}mathsf{Fscriptstyle{AST}}$ that runs in $mathcal{O}(n^2Delta ^2+n^3)$ time and computes an only slightly less accurate lower bound. The lower bounds produced by $maths-
{Bscriptstyle{RANCH}}$ and $mathsf {Bscriptstyle{RANCH}}mathsf{Fscriptstyle{AST}}$ are shown to be pseudo-metrics on a collection of graphs. Finally, we suggest an anytime algorithm $mathsf {Bscriptstyle{RANCH}}mathsf{Tscriptstyle{IGHT}}$ that iteratively improves $mathsf {Bscriptstyle{RANCH}}$’s lower bound. $mathsf {Bscriptstyle{RANCH}}mathsf{Tscriptstyle{IGHT}}$ runs in $mathcal{O}(n^3Delta ^2+I(n^2Delta ^3+n^3))$ time, where the number of iterations $I$ is controlled by the user. A detailed experimental evaluation shows that all suggested algorithms are Pareto optimal, that they are very effective when used as filters for edit distance range queries, and that they perform excellently when used within classification frameworks.]]>3035035161450<![CDATA[Local and Global Structure Preservation for Robust Unsupervised Spectral Feature Selection]]>303517529970<![CDATA[Mining Precise-Positioning Episode Rules from Event Sequences]]>fixed-gap episode to address this problem. A fixed-gap episode consists of an ordered set of events where the elapsed time between any two consecutive events is a constant. Based on this concept, we formulate the problem of mining precise-positioning episode rules in which the occurrence time of each event in the consequent is clearly specified. In addition, we develop a trie-based data structure to mine such precise-positioning episode rules with several pruning strategies incorporated for improving the performance as well as reducing memory consumption. Experimental results on real datasets show the superiority of our proposed algorithms.]]>303530543974<![CDATA[MPI-FAUN: An MPI-Based Framework for Alternating-Updating Nonnegative Matrix Factorization]]>$mathbf{W}$ and $mathbf{H}$, for the given input matrix $mathbf{A}$ , such that $mathbf{A}approx mathbf{W}mathbf{H}$. NMF is a useful tool for many applications in different domains such as topic modeling in text mining, background separation in video analysis, and community detection in social networks. Despite its popularity in the data mining community, there is a lack of efficient parallel algorithms to solve the problem for big data sets. The main contribution of this work is a new, high-performance parallel computational framework for a broad class of NMF algorithms that iteratively solves alternating non-negative least squares (NLS) subproblems for $mathbf{W}$ and $mathbf{H}$. It maintains the data and factor matrices in memory (distributed across processors), uses MPI for interprocessor communication, and, in the dense case, provably minimizes communication costs (under mild assumptions). The framework is flexible and able to leverage-
a variety of NMF and NLS algorithms, including Multiplicative Update, Hierarchical Alternating Least Squares, and Block Principal Pivoting. Our implementation allows us to benchmark and compare different algorithms on massive dense and sparse data matrices of size that spans from few hundreds of millions to billions. We demonstrate the scalability of our algorithm and compare it with baseline implementations, showing significant performance improvements. The code and the datasets used for conducting the experiments are available online.]]>3035445581222<![CDATA[PurTreeClust: A Clustering Algorithm for Customer Segmentation from Massive Customer Transaction Data]]>$k$ customers as the representatives of $k$ customer groups. Finally, the clustering results are obtained by assigning each customer to the nearest representative. We also propose a gap statistic based method to evaluate the number of clusters. A series of experiments were conducted on ten real-life transaction data sets, and experimental results show the superior performance of the proposed method.]]>3035595721698<![CDATA[Selecting Optimal Subset to Release Under Differentially Private M-Estimators from Hybrid Datasets]]>[1] . From a perspective of non-interactive learning, we first construct the weighted private density estimation from the hybrid datasets under DP. Along the same line as [2] , we analyze the accuracy of the DP M-estimators based on the hybrid datasets. Our main contributions are (i) we find that the bias-variance tradeoff in the performance of our M-estimators can be characterized in the sample size of the released dataset; (ii) based on this finding, we develop an algorithm to select the optimal subset of the public dataset to release under DP. Our simulation studies and application to the real datasets confirm our findings and set a guideline in the real application.]]>3035735841236<![CDATA[TaxiRec: Recommending Road Clusters to Taxi Drivers Using Ranking-Based Extreme Learning Machines]]>3035855982153<![CDATA[Using Reenactment to Retroactively Capture Provenance for Transactions]]>reenactment, a novel technique for replaying a transactional history with provenance capture. Reenactment exploits the time travel and audit logging capabilities of modern DBMS to replay parts of a transactional history using queries. Importantly, our technique requires no changes to the transactional workload or underlying DBMS and results in only moderate runtime overhead for transactions. We have implemented our approach on top of a commercial DBMS and our experiments confirm that by applying novel optimizations we can efficiently capture provenance for complex transactions over large data sets.]]>3035996121673