Hongwen Dai Lin, Z., Dai, H., Mantor, M., & Zhou, H. (2019). Coordinated CTA Combination and Bandwidth Partitioning for GPU Concurrent Kernel Execution. ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 16(3). https://doi.org/10.1145/3326124 Dai, H., Li, C., Zhou, H., Gupta, S., Kartsaklis, C., & Mantor, M. (2016). A Model-Driven Approach to Warp/Thread-Block Level GPU Cache Bypassing. 2016 ACM/EDAC/IEEE DESIGN AUTOMATION CONFERENCE (DAC). https://doi.org/10.1145/2897937.2897966 Mayank, K., Dai, H. W., Wei, J. Z., & Huiyang. (2015). Analyzing graphics processor unit (GPU) instruction set architectures. Ieee international symposium on performance analysis of systems and, 155–156. https://doi.org/10.1109/ispass.2015.7095794 Li, C., Yang, Y., Dai, H. W., Yan, S. G., Mueller, F., & Zhou, H. Y. (2014). Understanding the tradeoffs between software-managed vs. hardware-managed caches in GPUs. Ieee international symposium on performance analysis of systems and, 231–241.