Scalable parallel file write from a large numa system

Abstract

As the number of the sockets and cores within a cache-coherent NUMA (Non-Uniform Memory Access) system increases, parallel file I/O performance is affected more by processor affinity and synchronization overhead within the system. In this paper, we report on the performance impact of processor affinity and page caching overhead on parallel write operations to a single shared file on a large NUMA system. We did experiments using two configurations of an HPE Superdome Flex system–one with 12 sockets and the other with 32 sockets, both having 24 cores per socket, OpenMPI, and the Lustre parallel file system. Our results show that processor affinity and page caching overhead can result in large performance variation depending on the number of MPI processes. We observe that page caching is useful for parallel file writes with a small number of MPI processes, but it is not scalable on a large NUMA system. Parallel file writes without page caching are scalable but achieve higher performance only with a large number of MPI processes.

Date: December 5, 2025
Authors: Dong In D Kang, John Paul Walters, Stephen P Crago
Journal: HPEC

View Paper

Information Sciences Institute

Publications

Scalable parallel file write from a large numa system

Abstract