TY - RPRT TI - Experiences in Porting the Hubbard Model in Computational Materials Science to GPU AU - Albert, Chase AU - Paloski, Aaron AU - Shen, Xipeng AU - Walter, Eric J. AU - Zhang, Shiwei A3 - Computer Science Department, The College of William and Mary DA - 2010/// PY - 2010/// M1 - WM-CS-2010-04 M3 - Technical report PB - Computer Science Department, The College of William and Mary SN - WM-CS-2010-04 ER - TY - RPRT TI - Implementing the Dslash Operator in OpenCL AU - Kowalski, Andy AU - Shen, Xipeng A3 - Computer Science Department, The College of William and Mary DA - 2010/// PY - 2010/// M1 - WM-CS-2010-03 M3 - Technical Report PB - Computer Science Department, The College of William and Mary SN - WM-CS-2010-03 ER - TY - CHAP TI - LU Decomposition on Cell Broadband Engine: An Empirical Study to Exploit Heterogeneous Chip Multiprocessors AU - Mao, Feng AU - Shen, Xipeng T2 - Lecture Notes in Computer Science AB - To meet the needs of high performance computing, the Cell Broadband Engine owns many features that differ from traditional processors, such as the large number of synergistic processor elements, large register files, the ability to hide main-storage latency with concurrent computation and DMA transfers. The exploitation of those features requires the programmer to carefully tailor programs and simutaneously deal with various performance factors, including locality, load balance, communication overhead, and multi-level parallelism. These factors, unfortunately, are dependent on each other; an optimization that enhances one factor may degrade another. This paper presents our experience on optimizing LU decomposition, one of the commonly used algebra kernels in scientific computing, on Cell Broadband Engine. The optimizations exploit task-level, data-level, and communication-level parallelism. We study the effects of different task distribution strategies, prefetch, and software cache, and explore the tradeoff among different performance factors, stressing the interactions between different optimizations. This work offers some insights in the optimizations on heterogenous multi-core processors, including the selection of programming models, considerations in task distribution, and the holistic perspective required in optimizations. PY - 2010/// DO - 10.1007/978-3-642-15672-4_7 SP - 61-75 OP - PB - Springer Berlin Heidelberg SN - 9783642156717 9783642156724 UR - http://dx.doi.org/10.1007/978-3-642-15672-4_7 DB - Crossref ER - TY - CHAP TI - Combining Locality Analysis with Online Proactive Job Co-scheduling in Chip Multiprocessors AU - Jiang, Yunlian AU - Tian, Kai AU - Shen, Xipeng T2 - High Performance Embedded Architectures and Compilers AB - The shared-cache contention on Chip Multiprocessors causes performance degradation to applications and hurts system fairness. Many previously proposed solutions schedule programs according to runtime sampled cache performance to reduce cache contention. The strong dependence on runtime sampling inherently limits the scalability and effectiveness of those techniques. This work explores the combination of program locality analysis with job co-scheduling. The rationale is that program locality analysis typically offers a large-scope view of various facets of an application including data access patterns and cache requirement. That knowledge complements the local behaviors sampled by runtime systems. The combination offers the key to overcoming the limitations of prior co-scheduling techniques.Specifically, this work develops a lightweight locality model that enables efficient, proactive prediction of the performance of co-running processes, offering the potential for an integration in online scheduling systems. Compared to existing multicore scheduling systems, the technique reduces performance degradation by 34% (7% performance improvement) and unfairness by 47%. Its proactivity makes it resilient to the scalability issues that constraints the applicability of previous techniques. PY - 2010/// DO - 10.1007/978-3-642-11515-8_16 SP - 201-215 OP - PB - Springer Berlin Heidelberg SN - 9783642115141 9783642115158 UR - http://dx.doi.org/10.1007/978-3-642-11515-8_16 DB - Crossref ER - TY - CONF TI - Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? AU - Zhang, Eddy Z. AU - Jiang, Yunlian AU - Shen, Xipeng T2 - the 15th ACM SIGPLAN symposium AB - Most modern Chip Multiprocessors (CMP) feature shared cache on chip. For multithreaded applications, the sharing reduces communication latency among co-running threads, but also results in cache contention. C2 - 2010/// C3 - Proceedings of the 15th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '10 DA - 2010/// DO - 10.1145/1693453.1693482 PB - ACM Press SN - 9781605587080 UR - http://dx.doi.org/10.1145/1693453.1693482 DB - Crossref KW - Shared Cache KW - Thread Scheduling KW - Parallel Program Optimizations KW - Chip Multiprocessors ER - TY - CHAP TI - Is Reuse Distance Applicable to Data Locality Analysis on Chip Multiprocessors? AU - Jiang, Yunlian AU - Zhang, Eddy Z. AU - Tian, Kai AU - Shen, Xipeng T2 - Lecture Notes in Computer Science AB - On Chip Multiprocessors (CMP), it is common that multiple cores share certain levels of cache. The sharing increases the contention in cache and memory-to-chip bandwidth, further highlighting the importance of data locality analysis. As a rigorous and hardware-independent locality metric, reuse distance has served for a variety of locality analysis, program transformations, and performance prediction. However, previous studies have concentrated on sequential programs running on unicore processors. On CMP, accesses by different threads (or jobs) interact in the shared cache. How reuse distance applies to the new architecture remains an open question—particularly, how the interactions in shared cache affect the collection and application of reuse distance, and how reuse-distance–based locality analysis should adapt to such architecture changes. This paper presents our explorations towards answering those questions. It first introduces the concept of concurrent reuse distance, a direct extension of the traditional concept of reuse distance with data references by all co-running threads (or jobs) considered. It then discusses the properties of concurrent reuse distance, revealing the special challenges facing the collection and application of concurrent reuse distance on CMP platforms. Finally, it presents the solutions to those challenges for a class of multithreading applications. The solutions center on a probabilistic model that connects concurrent reuse distance with the data locality of each individual thread. Experiments demonstrate the effectiveness of the proposed techniques in facilitating the uses of concurrent reuse distance for CMP computing. PY - 2010/// DO - 10.1007/978-3-642-11970-5_15 SP - 264-282 OP - PB - Springer Berlin Heidelberg SN - 9783642119699 9783642119705 UR - http://dx.doi.org/10.1007/978-3-642-11970-5_15 DB - Crossref ER - TY - CONF TI - Exploiting statistical correlations for proactive prediction of program behaviors AU - Jiang, Yunlian AU - Zhang, Eddy Z. AU - Tian, Kai AU - Mao, Feng AU - Gethers, Malcom AU - Shen, Xipeng AU - Gao, Yaoqing T2 - the 8th annual IEEE/ ACM international symposium AB - This paper presents a finding and a technique on program behavior prediction. The finding is that surprisingly strong statistical correlations exist among the behaviors of different program components (e.g., loops) and among different types of program level behaviors (e.g., loop trip-counts versus data values). Furthermore, the correlations can be beneficially exploited: They help resolve the proactivity-adaptivity dilemma faced by existing program behavior predictions, making it possible to gain the strengths of both approaches--the large scope and earliness of offline-profiling--based predictions, and the cross-input adaptivity of runtime sampling-based predictions. C2 - 2010/// C3 - Proceedings of the 8th annual IEEE/ ACM international symposium on Code generation and optimization - CGO '10 DA - 2010/// DO - 10.1145/1772954.1772989 PB - ACM Press SN - 9781605586359 UR - http://dx.doi.org/10.1145/1772954.1772989 DB - Crossref ER - TY - CONF TI - Streamlining GPU applications on the fly AU - Zhang, Eddy Z. AU - Jiang, Yunlian AU - Guo, Ziyu AU - Shen, Xipeng T2 - the 24th ACM International Conference AB - Because of their tremendous computing power and remarkable cost efficiency, GPUs (graphic processing unit) have quickly emerged as a kind of influential platform for high performance computing. However, as GPUs are designed for massive data-parallel computing, their performance is subject to the presence of condition statements in a GPU application. On a conditional branch where threads diverge in which path to take, the threads taking different paths have to run serially. Such divergences often cause serious performance degradations, impairing the adoption of GPU for many applications that contain non-trivial branches or certain types of loops. C2 - 2010/// C3 - Proceedings of the 24th ACM International Conference on Supercomputing - ICS '10 DA - 2010/// DO - 10.1145/1810085.1810104 PB - ACM Press SN - 9781450300186 UR - http://dx.doi.org/10.1145/1810085.1810104 DB - Crossref ER - TY - CONF TI - An input-centric paradigm for program dynamic optimizations AU - Tian, Kai AU - Jiang, Yunlian AU - Zhang, Eddy Z. AU - Shen, Xipeng T2 - the ACM international conference AB - Accurately predicting program behaviors (e.g., locality, dependency, method calling frequency) is fundamental for program optimizations and runtime adaptations. Despite decades of remarkable progress, prior studies have not systematically exploited program inputs, a deciding factor for program behaviors.Triggered by the strong and predictive correlations between program inputs and behaviors that recent studies have uncovered, this work proposes to include program inputs into the focus of program behavior analysis, cultivating a new paradigm named input-centric program behavior analysis. This new approach consists of three components, forming a three-layer pyramid. At the base is program input characterization, a component for resolving the complexity in program raw inputs and the extraction of important features. In the middle is input-behavior modeling, a component for recognizing and modeling the correlations between characterized input features and program behaviors. These two components constitute input-centric program behavior analysis, which (ideally) is able to predict the large-scope behaviors of a program's execution as soon as the execution starts. The top layer of the pyramid is input-centric adaptation, which capitalizes on the novel opportunities that the first two components create to facilitate proactive adaptation for program optimizations.By centering on program inputs, the new approach resolves a proactivity-adaptivity dilemma inherent in previous techniques. Its benefits are demonstrated through proactive dynamic optimizations and version selection, yielding significant performance improvement on a set of Java and C programs. C2 - 2010/// C3 - Proceedings of the ACM international conference on Object oriented programming systems languages and applications - OOPSLA '10 DA - 2010/// DO - 10.1145/1869459.1869471 PB - ACM Press SN - 9781450302036 UR - http://dx.doi.org/10.1145/1869459.1869471 DB - Crossref ER - TY - JOUR TI - Counting paths in digraphs AU - Seymour, Paul AU - Sullivan, Blair D. T2 - European Journal of Combinatorics AB - Say a digraph is k-free if it has no directed cycles of length at most k, for positive integers k. Thomasse conjectured that the number of induced 3-vertex directed paths in a simple 2-free digraph on n vertices is at most (n-1)n(n+1)/15. We present an unpublished result of Bondy proving that there are at most 2n^3/25 such paths, and prove that for the class of circular interval digraphs, an upper bound of n^3/16 holds. We also study the problem of bounding the number of (non-induced) 4-vertex paths in 3-free digraphs. We show an upper bound of 4n^4/75 using Bondy's result for Thomasse's conjecture. DA - 2010/4// PY - 2010/4// DO - 10.1016/j.ejc.2009.05.008 VL - 31 IS - 3 SP - 961-975 J2 - European Journal of Combinatorics LA - en OP - SN - 0195-6698 UR - http://dx.doi.org/10.1016/j.ejc.2009.05.008 DB - Crossref ER - TY - JOUR TI - Comment AU - Wilson, A.G. AU - Anderson-Cook, C.M. T2 - Technometrics DA - 2010/// PY - 2010/// DO - 10.1198/TECH.2010.09178 VL - 52 IS - 4 SP - 397-400 UR - http://www.scopus.com/inward/record.url?eid=2-s2.0-78249237164&partnerID=MN8TOARS ER -