TY - CONF TI - Parallel algorithms for graph optimization using tree decompositions AU - Sullivan, B.D. AU - Weerapurage, D. AU - Groer, C. AB - Although many NP-hard graph optimization problems can be solved in polynomial time on graphs of bounded tree-width, the adoption of these techniques into mainstream scientific computation has been limited due to the high memory requirements of the dynamic programming tables and excessive runtimes of sequential implementations. This work addresses both challenges by proposing a set of new parallel algorithms for all steps of a tree decomposition-based approach to solve the maximum weighted independent set problem. A hybrid OpenMP/MPI implementation includes a highly scalable parallel dynamic programming algorithm leveraging the MADNESS task based runtime, and computational results demonstrate scaling. This work enables a significant expansion of the scale of graphs on which exact solutions to maximum weighted independent set can be obtained, and forms a framework for solving additional graph optimization problems with similar techniques. C2 - 2013/// C3 - Proceedings - IEEE 27th International Parallel and Distributed Processing Symposium Workshops and PhD Forum, IPDPSW 2013 DA - 2013/// DO - 10.1109/IPDPSW.2013.242 SP - 1838-1847 UR - http://www.scopus.com/inward/record.url?eid=2-s2.0-84899764118&partnerID=MN8TOARS ER - TY - JOUR TI - On a conjecture of Andrica and Tomescu AU - Sullivan, B.D. T2 - Journal of Integer Sequences DA - 2013/// PY - 2013/// VL - 16 IS - 3 UR - http://www.scopus.com/inward/record.url?eid=2-s2.0-84880061040&partnerID=MN8TOARS ER - TY - CONF TI - Tree-Like Structure in Large Social and Information Networks AU - Adcock, Aaron B. AU - Sullivan, Blair D. AU - Mahoney, Michael W. AB - Although large social and information networks are often thought of as having hierarchical or tree-like structure, this assumption is rarely tested. We have performed a detailed empirical analysis of the tree-like properties of realistic informatics graphs using two very different notions of tree-likeness: Gromov's d-hyperbolicity, which is a notion from geometric group theory that measures how tree-like a graph is in terms of its metric structure, and tree decompositions, tools from structural graph theory which measure how tree-like a graph is in terms of its cut structure. Although realistic informatics graphs often do not have meaningful tree-like structure when viewed with respect to the simplest and most popular metrics, e.g., the value of d or the tree width, we conclude that many such graphs do have meaningful tree-like structure when viewed with respect to more refined metrics, e.g., a size-resolved notion of d or a closer analysis of the tree decompositions. We also show that, although these two rigorous notions of tree-likeness capture very different tree-like structures in worst-case, for realistic informatics graphs they empirically identify surprisingly similar structure. We interpret this tree-like structure in terms of the recently-characterized "nested core-periphery" property of large informatics graphs, and we show that the fast and scalable k-core heuristic can be used to identify this tree-like structure. C2 - 2013/12// C3 - 2013 IEEE 13th International Conference on Data Mining DA - 2013/12// DO - 10.1109/icdm.2013.77 SP - 1-10 PB - IEEE UR - http://dx.doi.org/10.1109/icdm.2013.77 ER - TY - JOUR TI - Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU AU - Wu, Bo AU - Zhao, Zhijia AU - Zhang, Eddy Zheng AU - Jiang, Yunlian AU - Shen, Xipeng T2 - ACM SIGPLAN Notices AB - The performance of Graphic Processing Units (GPU) is sensitive to irregular memory references. Some recent work shows the promise of data reorganization for eliminating non-coalesced memory accesses that are caused by irregular references. However, all previous studies have employed simple, heuristic methods to determine the new data layouts to create. As a result, they either do not provide any performance guarantee or are effective to only some limited scenarios. This paper contributes a fundamental study to the problem. It systematically analyzes the inherent complexity of the problem in various settings, and for the first time, proves that the problem is NP-complete. It then points out the limitations of existing techniques and reveals that in practice, the essence for designing an appropriate data reorganization algorithm can be reduced to a tradeoff among space, time, and complexity. Based on that insight, it develops two new data reorganization algorithms to overcome the limitations of previous methods. Experiments show that an assembly composed of the new algorithms and a previous algorithm can circumvent the inherent complexity in finding optimal data layouts, making it feasible to minimize non-coalesced memory accesses for a variety of irregular applications and settings that are beyond the reach of existing techniques. DA - 2013/8/23/ PY - 2013/8/23/ DO - 10.1145/2517327.2442523 VL - 48 IS - 8 SP - 57 J2 - SIGPLAN Not. LA - en OP - SN - 0362-1340 UR - http://dx.doi.org/10.1145/2517327.2442523 DB - Crossref KW - Performance KW - Experimentation KW - GPGPU KW - Memory coalescing KW - Computational complexity KW - Thread-data remapping KW - Runtime optimizations KW - Data transformation ER - TY - CHAP TI - Fine-Grained Treatment to Synchronizations in GPU-to-CPU Translation AU - Guo, Ziyu AU - Shen, Xipeng T2 - Languages and Compilers for Parallel Computing AB - GPU-to-CPU translation may extend Graphics Processing Units (GPU) programs executions to multi-/many-core CPUs, and hence enable cross-device task migration and promote whole-system synergy. This paper describes some of our findings in treatment to GPU synchronizations during the translation process. We show that careful dependence analysis may allow a fine-grained treatment to synchronizations and reveal redundant computation at the instruction-instance level. Based on thread-level dependence graphs, we present a method to enable such fine-grained treatment automatically. Experiments demonstrate that compared to existing translations, the new approach can yield speedup of a factor of integers. PY - 2013/// DO - 10.1007/978-3-642-36036-7_12 SP - 171-184 OP - PB - Springer Berlin Heidelberg SN - 9783642360350 9783642360367 UR - http://dx.doi.org/10.1007/978-3-642-36036-7_12 DB - Crossref ER - TY - CHAP TI - Optimal Co-Scheduling to Minimize Makespan on Chip Multiprocessors AU - Tian, Kai AU - Jiang, Yunlian AU - Shen, Xipeng AU - Mao, Weizhen T2 - Job Scheduling Strategies for Parallel Processing AB - On-chip resource sharing among sibling cores causes resource contention on Chip Multiprocessors (CMP), considerably degrading program performance and system fairness. Job co-scheduling attempts to alleviate the problem by assigning jobs to cores intelligently. Despite many heuristics-based empirical explorations, studies on optimal co-scheduling and its inherent complexity start only recently, and all have concentrated on the minimization of total performance degradations. There is another important criterion for scheduling, makespan, which determines the finish time of a job set. Its importance for job co-scheduling on CMP is increasingly recognized, especially with the rise of CMP-based compute cloud, data centers, and server farms. However, optimal co-scheduling for makespan minimization still remains unexplored. This work compares makespan minimization problem with previously studied cost minimization (or degradation minimization) problem, revealing these connections as well as significant differences. It uncovers the computational complexity of the makespan minimization problem, and proposes a series of algorithms to either compute or approximate the optimal schedules. It proves that the problem is NP-complete in a general setting, but for a special case (dual-core without job migrations), the problem is solvable in O(n 2.5·logn) time (n is the number of jobs). In addition, this paper presents a series of algorithms to compute or approximate the optimal schedules in the general setting. Experiments on both real and synthetic problems verify the optimality of the optimal co-scheduling algorithms, and demonstrate the reasonable accuracy and scalability of the approximation algorithms. The findings may advance the current understanding of optimal co-scheduling, facilitate the evaluation of real co-scheduling systems, and provide insights for the development of practical co-scheduling algorithms. PY - 2013/// DO - 10.1007/978-3-642-35867-8_7 SP - 114-133 OP - PB - Springer Berlin Heidelberg SN - 9783642358661 9783642358678 UR - http://dx.doi.org/10.1007/978-3-642-35867-8_7 DB - Crossref ER - TY - JOUR TI - HPar AU - Zhao, Zhijia AU - Bebenita, Michael AU - Herman, Dave AU - Sun, Jianhua AU - Shen, Xipeng T2 - ACM Transactions on Architecture and Code Optimization AB - Parallelizing HTML parsing is challenging due to the complexities of HTML documents and the inherent dependencies in its parsing algorithm. As a result, despite numerous studies in parallel parsing, HTML parsing remains sequential today. It forms one of the final barriers for fully parallelizing browser operations to minimize the browser’s response time—an important variable for user experiences, especially on portable devices. This article provides a comprehensive analysis on the special complexities of parallel HTML parsing and presents a systematic exploration in overcoming those difficulties through specially designed speculative parallelizations. This work develops, to the best of our knowledge, the first pipelining and data-level parallel HTML parsers. The data-level parallel parser, named HPar , achieves up to 2.4× speedup on quadcore devices. This work demonstrates the feasibility of efficient, parallel HTML parsing for the first time and offers a set of novel insights for parallel HTML parsing DA - 2013/12/1/ PY - 2013/12/1/ DO - 10.1145/2541228.2555301 VL - 10 IS - 4 SP - 1-25 J2 - TACO LA - en OP - SN - 1544-3566 UR - http://dx.doi.org/10.1145/2541228.2555301 DB - Crossref ER - TY - CHAP TI - Simple Profile Rectifications Go a Long Way AU - Wu, Bo AU - Zhou, Mingzhou AU - Shen, Xipeng AU - Gao, Yaoqing AU - Silvera, Raul AU - Yiu, Graham T2 - ECOOP 2013 – Object-Oriented Programming AB - Feedback-driven program optimization (FDO) is common in modern compilers, including Just-In-Time compilers increasingly adopted for object-oriented or scripting languages. This paper describes a systematic study in understanding and alleviating the effects of sampling errors on the usefulness of the obtained profiles for FDO. Taking a statistical approach, it offers a series of counter-intuitive findings, and identifies two kinds of profile errors that affect FDO critically, namely zero-count errors and inconsistency errors. It further proposes statistical profile rectification, a simple approach to correcting profiling errors by leveraging statistical patterns in a profile. Experiments show that the simple approach enhances the effectiveness of sampled profile-based FDO dramatically, increasing the average FDO speedup from 1.16X to 1.3X, around 92% of what full profiles can yield. PY - 2013/// DO - 10.1007/978-3-642-39038-8_27 SP - 654-678 OP - PB - Springer Berlin Heidelberg SN - 9783642390371 9783642390388 UR - http://dx.doi.org/10.1007/978-3-642-39038-8_27 DB - Crossref ER - TY - JOUR TI - Adiabatic quantum programming: minor embedding with hard faults AU - Klymko, Christine AU - Sullivan, Blair D. AU - Humble, Travis S. T2 - Quantum Information Processing AB - Adiabatic quantum programming defines the time-dependent mapping of a quantum algorithm into an underlying hardware or logical fabric. An essential step is embedding problem-specific information into the quantum logical fabric. We present algorithms for embedding arbitrary instances of the adiabatic quantum optimization algorithm into a square lattice of specialized unit cells. These methods extend with fabric growth while scaling linearly in time and quadratically in footprint. We also provide methods for handling hard faults in the logical fabric without invoking approximations to the original problem and illustrate their versatility through numerical studies of embeddability versus fault rates in square lattices of complete bipartite unit cells. The studies show that these algorithms are more resilient to faulty fabrics than naive embedding approaches, a feature which should prove useful in benchmarking the adiabatic quantum optimization algorithm on existing faulty hardware. DA - 2013/11/20/ PY - 2013/11/20/ DO - 10.1007/S11128-013-0683-9 VL - 13 IS - 3 SP - 709-729 J2 - Quantum Inf Process LA - en OP - SN - 1570-0755 1573-1332 UR - http://dx.doi.org/10.1007/S11128-013-0683-9 DB - Crossref KW - Quantum computing KW - Adiabatic quantum optimization KW - Graph embedding KW - Fault-tolerant computing ER - TY - CHAP TI - Evaluating OpenMP Tasking at Scale for the Computation of Graph Hyperbolicity AU - Adcock, Aaron B. AU - Sullivan, Blair D. AU - Hernandez, Oscar R. AU - Mahoney, Michael W. T2 - OpenMP in the Era of Low Power Devices and Accelerators AB - We describe using OpenMP to compute δ-hyperbolicity, a quantity of interest in social and information network analysis, at a scale that uses up to 1000 threads. By considering both OpenMP workshare and tasking models to parallelize the computations, we find that multiple task levels permits finer grained tasks at runtime and results in better performance at scale than worksharing constructs. We also characterize effects of task inflation, load balancing, and scheduling overhead in this application, using both GNU and Intel compilers. Finally, we show how OpenMP 3.1 tasking clauses can be used to mitigate overheads at scale. PY - 2013/// DO - 10.1007/978-3-642-40698-0_6 VL - 8122 LNCS SP - 71-83 OP - PB - Springer Berlin Heidelberg SN - 9783642406973 9783642406980 UR - http://dx.doi.org/10.1007/978-3-642-40698-0_6 DB - Crossref ER - TY - JOUR TI - Profmig: A framework for flexible migration of program profiles across software versions AU - Zhou, Mingzhou AU - Wu, Bo AU - Ding, Yufei AU - Shen, Xipeng T2 - Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) AB - Offline program profiling is costly, especially when software update is frequent. In this paper, we initiate a systematic exploration in cross-version program profile migration, which tries to effectively reuse the valid part of the behavior profiles of an old version of a software for a new version. We explore the effects imposed on profile reusability by the various factors in program behaviors, profile formats, and impact analysis, and introduce ProfMig, a framework for flexible migrations of various profiles. We demonstrate the effectiveness of the techniques on migrating loop trip-count profiles and dynamic call graphs. The migration saves significant (48-67% on average) profiling time with less than 10% accuracy compromised for most programs. DA - 2013/2// PY - 2013/2// DO - 10.1109/cgo.2013.6494984 ER - TY - CONF TI - Exploring Hybrid Memory for GPU Energy Efficiency through Software-Hardware Co-Design AU - Wang, Bin AU - Wu, Bo AU - Li, Dong AU - Shen, Xipeng AU - Yu, Weikuan AU - Jiao, Yizheng AU - Vetter, Jeffrey T2 - PACT AB - Hybrid memory designs, such as DRAM plus Phase Change Memory (PCM), have shown some promise for alleviating power and density issues faced by traditional memory systems. But previous studies have concentrated on CPU systems with a modest level of parallelism. This work studies the problem in a massively parallel setting. Specifically, it investigates the special implications to hybrid memory imposed by the massive parallelism in GPU. It empirically shows that, contrary to promising results demonstrated for CPU, previous designs of PCM-based hybrid memory result in significant degradation to the energy efficiency of GPU. It reveals that the fundamental reason comes from a multi-facet mismatch between those designs and the massive parallelism in GPU. It presents a solution that centers around a close cooperation between compiler-directed data placement and hardware-assisted runtime adaptation. The co-design approach helps tap into the full potential of hybrid memory for GPU without requiring dramatic hardware changes over previous designs, yielding 6% and 49% energy saving on average compared to pure DRAM and pure PCM respectively, and keeping performance loss less than 2%. C2 - 2013/// C3 - Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques CY - Edinburgh, Scotland DA - 2013/// PY - 2013/9/7/ DO - 10.1109/pact.2013.6618807 PB - IEEE ER - TY - JOUR TI - Bayesian nonparametric models for community detection AU - Guo, J. AU - Nordman, D.J. AU - Wilson, Alyson T2 - Technometrics AB - We propose a series of Bayesian nonparametric statistical models for community detection in graphs. We model the probability of the presence or absence of edges within the graph. Using these models, we naturally incorporate uncertainty and variability and take advantage of nonparametric techniques, such as the Chinese restaurant process and the Dirichlet process. Some of the contributions include: (a) the community structure is directly modeled without specifying the number of communities a priori; (b) the probabilities of edges within or between communities may be modeled as varying by community or pairs of communities; (c) some nodes can be classified as not belonging to any community; and (d) Bayesian model diagnostics are used to compare models and help with appropriate model selection. We start by fitting an initial model to a well-known network dataset, and we develop a series of increasingly complex models. We propose Markov chain Monte Carlo algorithms to carry out the estimation as well as an approach for community detection using the posterior distributions under a decision theoretical framework. Bayesian nonparametric techniques allow us to estimate the number and structure of communities from the data. To evaluate the proposed models for the example dataset, we discuss model comparison using the deviance information criterion and model checking using posterior predictive distributions. Supplementary materials are available online. DA - 2013/// PY - 2013/// DO - 10.1080/00401706.2013.804438 VL - 55 IS - 4 SP - 390-402 UR - http://www.scopus.com/inward/record.url?eid=2-s2.0-84890066326&partnerID=MN8TOARS ER - TY - JOUR TI - Bayesian Methods for Estimating System Reliability Using Heterogeneous Multilevel Information AU - Guo, Jiqiang AU - Wilson, Alyson G. T2 - TECHNOMETRICS AB - Abstract We propose a Bayesian approach for assessing the reliability of multicomponent systems. Our models allow us to evaluate system, subsystem, and component reliability using multilevel information. Data are collected over time, and include binary, lifetime, and degradation data. We illustrate the methodology through two examples and discuss extensions. Supplementary materials are available online. KEY WORDS: Degradation dataHierarchical modelLifetime dataMulticomponent system ACKNOWLEDGMENTS Alyson Wilson's work on this article took place while she was at IDA Science and Technology Policy Institute Washington, DC, and Iowa State University, Ames, IA. Notes 1Assume that a system has only two states: functioning or failed. A system is coherent when if all the components have failed, the system has failed; if all the components are functioning, the system is functioning; if the system has failed, an additional component failure does not cause the system to function; if the system is functioning, replacing a failed component with a functioning component does not cause the system to fail (Rausand and Høyland Citation2004). DA - 2013/11// PY - 2013/11// DO - 10.1080/00401706.2013.804441 VL - 55 IS - 4 SP - 461-472 SN - 1537-2723 UR - http://www.scopus.com/inward/record.url?eid=2-s2.0-84890052261&partnerID=MN8TOARS KW - Degradation data KW - Hierarchical model KW - Lifetime data KW - Multicomponent system ER -