2024 article

Predicting GPUDirect Benefits for HPC Workloads

2024 32ND EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING, PDP 2024, pp. 88–97.

author keywords: performance prediction; simulator; GPGPU; heterogeneous architectures
Source: Web Of Science
Added: July 8, 2024

Graphics processing units (GPUs) are becoming increasingly popular in modern HPC systems. Hardware for data movement to and from GPUs such as NVLink and GPUDirect has reduced latencies, increased throughput, and eliminated redundant copies. In this work, we use discrete event simulations to explore the impact of different communication paradigms on the messaging performance of scientific applications running on multi-GPU nodes. First, we extend an existing simulation framework to model data movement on GPU-based clusters. Sec-ond, we implement support for the simulation of communication paradigms such as GPUDirect with point-to-point messages and collectives. Finally, we study the impact of different parameters on communication performance such as the number of GPUs per node and GPUDirect. We validate the framework and then simulate traces from GPU-enabled applications on a fat-tree based cluster. Simulation results uncover strengths but also weaknesses of GPUDirect depending on the application and their usage of communication primitives.