Works (9)

Updated: July 5th, 2023 15:34

2019 journal article

Coordinated CTA Combination and Bandwidth Partitioning for GPU Concurrent Kernel Execution

ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 16(3).

By: Z. Lin n, H. Dai n, M. Mantor* & H. Zhou n

author keywords: GPGPU; TLP; bandwidth management; concurrent kernel execution
TL;DR: A coordinated approach for CTA combination and bandwidth partitioning that dynamically detects co-running kernels as latency sensitive or bandwidth intensive and allocates more CTA resources for latency-sensitive kernels and more NoC/DRAM bandwidth resources to NoC-/DRam-intensive kernels. (via Semantic Scholar)
Sources: Web Of Science, ORCID
Added: December 2, 2019

2019 conference paper

Exploring Memory Persistency Models for GPUs

28th International Conference on Parallel Architectures and Compilation Techniques (PACT), 310–322.

By: Z. Lin n, M. Alshboul n, Y. Solihin* & H. Zhou n

Event: International Conference on Parallel Architectures and Compilation Techniques at Seattle, WA on September 21-25, 2019

TL;DR: This paper adapt, re-architect, and optimize CPU persistency models for GPU, and design a pragma-based compiler scheme for expressing persistency model for GPUs, and identifies that the thread hierarchy in GPUs offers intuitive scopes to form epochs and durable transactions. (via Semantic Scholar)
Sources: Web Of Science, ORCID
Added: August 10, 2020

2019 article

Scatter-and-Gather Revisited: High-Performance Side-Channel-Resistant AES on GPUs

12TH WORKSHOP ON GENERAL PURPOSE PROCESSING USING GPUS (GPGPU 12), pp. 2–11.

By: Z. Lin n, U. Mathur n & H. Zhou n

TL;DR: This paper revisits the scatter-and-gather (SG) approach and makes a case for using this approach to implement table-based cryptographic algorithms on GPUs to achieve both high performance and strong resistance to side channel attacks. (via Semantic Scholar)
Sources: Web Of Science, ORCID
Added: July 22, 2019

2018 conference paper

Accelerate GPU Concurrent Kernel Execution by Mitigating Memory Pipeline Stalls

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

By: H. Dai n, Z. Lin n, C. Li n, C. Zhao*, F. Wang*, N. Zheng*, H. Zhou n

TL;DR: The proposed schemes significantly improve the weighted speedup of two state-of-the-art intra-SM sharing schemes, Warped-Slicer and SMK, by 24.6% and 27.2% on average, respectively, with lightweight hardware overhead. (via Semantic Scholar)
Sources: ORCID, Web Of Science
Added: September 22, 2019

2018 journal article

GPU Performance vs. Thread-Level Parallelism: Scalability Analysis and a Novel Way to Improve TLP

ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 15(1).

By: Z. Lin n, M. Mantor* & H. Zhou n

author keywords: GPGPU; TLP; context switching; latency hiding
TL;DR: A novel scalability analysis from the perspective of throughput utilization of various GPU components, including off-chip DRAM, multiple levels of caches, and the interconnect between L1 D-caches and L2 partitions shows that the interConnect bandwidth is a critical bound for GPU performance scalability. (via Semantic Scholar)
Sources: Web Of Science, ORCID
Added: August 6, 2018

2016 conference paper

Enabling efficient preemption for SIMT architectures with lightweight context switching

SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 898–908.

By: Z. Lin n, L. Nyland* & H. Zhou n

TL;DR: Three complementary ways to reduce and compress the architectural states to achieve lightweight context switching on SIMT processors with compiler and hardware co-design are proposed. (via Semantic Scholar)
Sources: NC State University Libraries, ORCID
Added: August 6, 2018

2015 conference paper

Automatic data placement into GPU on-chip memory resources

2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 23–33.

By: C. Li n, Y. Yang*, Z. Lin n & H. Zhou n

TL;DR: This paper focuses on programs that have already been reasonably optimized either manually by programmers or automatically by compiler tools and proposed compiler algorithms refine these programs by revising data placement across different types of GPU on-chip resources to achieve both performance enhancement and performance portability. (via Semantic Scholar)
Sources: NC State University Libraries, ORCID
Added: August 6, 2018

2015 chapter

GLES: A Practical GPGPU Optimizing Compiler Using Data Sharing and Thread Coarsening

In Languages and Compilers for Parallel Computing.

Zhen Lin

author keywords: GPGPU; Optimization; Compiler
TL;DR: This paper presents GLES, an optimizing compiler for GPGPU programs, which proposes two optimization techniques based on divergence analysis: data sharing optimization for data reuse and bandwidth enhancement and thread granularity coarsening for reducing redundant instructions. (via Semantic Scholar)
Source: ORCID
Added: September 22, 2019

2014 conference paper

Implementation and evaluation of deep neural networks (DNN) on mainstream heterogeneous systems

Proceedings of 5th Asia-Pacific Workshop on Systems - APSys '14.

Zhen Lin

TL;DR: This paper implements two well-known DNN kernels Multi-Layer Perceptron (MLP) and Autoencoder on various GPUs and APUs from mainstream processor manufacturers and conducts bottleneck analysis and presents the optimized techniques on various platforms. (via Semantic Scholar)
Source: ORCID
Added: September 22, 2019

Citation Index includes data from a number of different sources. If you have questions about the sources of data in the Citation Index or need a set of data which is free to re-distribute, please contact us.

Certain data included herein are derived from the Web of Science© and InCites© (2024) of Clarivate Analytics. All rights reserved. You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.