An Efficient Out-of-Core Implementation of Block Cholesky Decomposition on a Multi-GPU System

Lin Cheng, Hyunsu Cho, Peter Yoon, and Jiajia Zhao


Cholesky decomposition, general-purpose GPU computing, image segmentation


The Cholesky decomposition is one of the most efficient preconditioners to iterative schemes for solving linear systems such as the conjugate gradient method. However, we are often faced with situations where a linear system exceeds the capacity of existing memory. In this paper we present an efficient out-of-core implementation of the block Cholesky decomposition on a multi-GPU system, which will be able to handle linear systems of arbitrary size. Our implementation exploits in a streamlined fashion three core memory systems: GPU memory, CPU host memory, and virtual memory on the disk. We also demonstrate that incorporating memory traffic reduction, efficient data allocation and task overlapping is critical in optimizing performance. Our experiment shows that our implementation outperforms a multi-core CPU version by at least a factor of 30 for large matrices. We have also successfully applied our work to image segmentation.

Important Links:

Go Back