2019 journal article
Coordinated CTA Combination and Bandwidth Partitioning for GPU Concurrent Kernel Execution
ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 16(3).
2019 conference paper
Exploring Memory Persistency Models for GPUs
28th International Conference on Parallel Architectures and Compilation Techniques (PACT), 310–322.
Event: International Conference on Parallel Architectures and Compilation Techniques at Seattle, WA on September 21-25, 2019
Scatter-and-Gather Revisited: High-Performance Side-Channel-Resistant AES on GPUs
12TH WORKSHOP ON GENERAL PURPOSE PROCESSING USING GPUS (GPGPU 12), pp. 2–11.
2018 conference paper
Accelerate GPU Concurrent Kernel Execution by Mitigating Memory Pipeline Stalls
2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).
2018 journal article
GPU Performance vs. Thread-Level Parallelism: Scalability Analysis and a Novel Way to Improve TLP
ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 15(1).
2016 conference paper
Enabling efficient preemption for SIMT architectures with lightweight context switching
SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 898–908.
2015 conference paper
Automatic data placement into GPU on-chip memory resources
2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 23–33.
GLES: A Practical GPGPU Optimizing Compiler Using Data Sharing and Thread Coarsening
In Languages and Compilers for Parallel Computing.
2014 conference paper
Implementation and evaluation of deep neural networks (DNN) on mainstream heterogeneous systems
Proceedings of 5th Asia-Pacific Workshop on Systems - APSys '14.