SWIFT: Using task-based parallelism, fully asynchronous communication, and graph partition-based domain decomposition for strong scaling on more than 100000 cores.
Abstract
We present a new open-source cosmological code, called \swift, designed to solve the equations of hydrodynamics using a particle-based approach (Smooth Particle Hydrodynamics) on hybrid shared / distributed-memory architectures. \swift was designed from the bottom up to provide excellent {\em strong scaling} on both commodity clusters (Tier-2 systems) and Top100-supercomputers (Tier-0 systems), without relying on architecture-specific features or specialized accelerator hardware. This performance is due to three main computational approaches:
\begin{itemize}
\item \textbf{Task-based parallelism} for shared-memory parallelism, which provides fine-grained load balancing and thus strong scaling on large numbers of cores.
\item \textbf{Graph-based domain decomposition}, which uses the task graph to decompose the simulation domain such that the {\em work}, as opposed to just the {\em data}, as is the case with most partitioning schemes, is equally distributed across all nodes.
\item \textbf{Fully dynamic and asynchronous communication}, in which communication is modelled as just another task in the task-based scheme, sending data whenever it is ready and deferrin on tasks that rely on data from other nodes until it arrives.
\end{itemize}
In order to use these approaches, the code had to be re-written from scratch, and the algorithms therein adapted to the task-based paradigm. As a result, we can show upwards of 60\% parallel efficiency for moderate-sized problems when increasing the number of cores 512-fold, on both x86-based and Power8-based architectures.
\begin{itemize}
\item \textbf{Task-based parallelism} for shared-memory parallelism, which provides fine-grained load balancing and thus strong scaling on large numbers of cores.
\item \textbf{Graph-based domain decomposition}, which uses the task graph to decompose the simulation domain such that the {\em work}, as opposed to just the {\em data}, as is the case with most partitioning schemes, is equally distributed across all nodes.
\item \textbf{Fully dynamic and asynchronous communication}, in which communication is modelled as just another task in the task-based scheme, sending data whenever it is ready and deferrin on tasks that rely on data from other nodes until it arrives.
\end{itemize}
In order to use these approaches, the code had to be re-written from scratch, and the algorithms therein adapted to the task-based paradigm. As a result, we can show upwards of 60\% parallel efficiency for moderate-sized problems when increasing the number of cores 512-fold, on both x86-based and Power8-based architectures.