2022 article
BALA-CPD: BALanced and Asynchronous Distributed Tensor Decomposition
2022 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2022), pp. 440–450.
Tensor decomposition is widely used in machine learning, recommendation systems, and social networks. Large real-world tensors require parallel algorithms running on distributed memory systems. Parallel algorithms suffer two major performance bottlenecks: load imbalance and communication cost, which are difficult to overcome due to the inherent tradeoff among the multiple types of computations and communications, especially for irregular sparse tensors. Previous work predominately focuses on balancing the load within the tensor-related computation, resulting in imbalance for multiple matrix-only computations and increased communication costs. It also extensively uses collective communication operations and bulk-synchronous computations by interleaving stages of global communication and stages of local computation, failing to hide the communication cost. In this paper, we present a novel algorithm BALA-CPD, which achieves the best overall workload balance, and effectively overlaps communication and computation for the popular distributed Canonical Polyadic Decomposition (CPD) algorithms. BALA-CPD uses a workload and data partition scheme that prioritizes the load balance for all the matrix-only computations and all the communications. When necessary, BALA-CPD adjusts to mitigate the load imbalance for the tensor-related computation. Departing from the bulk-synchronous approaches, BALA-CPD breaks down computation and communication in consecutive stages, and masks the communication costs by a combination of one-sided asynchronous communication and a fine-grained interleaving of communication and computation. We implement BALA-CPD and evaluate it on a 64-node cluster with 1280 processors. Experimental results show BALA-CPD is scalable and outperforms the state-of-the-art distributed implementations by up to 1.8× on 1280 processors.