Software approaches for resilience of high performance computing systems
High-performance computing (HPC) systems are critical to advancing scientific discovery and innovation in a variety of fields. However, as HPC systems become larger and more complex, they are also exposed to more frequent and diverse faults that can impair performance and correctness. How can we ensure that HPC systems can run parallel programs correctly and efficiently? A research team from Beihang University published their study addressing this problem in Frontiers of Computer Science.
The team conducted a comprehensive and systematic survey of existing software resilience approaches for HPC systems. They classify these approaches into five categories: checkpointing, replication, soft error resilience, algorithm-based fault tolerance (ABFT), and fault detection and prediction. They presented and summarized the main techniques and systems in each category and discussed their advantages and limitations.
In addition, they identified some challenges regarding the recently developed software resilience approach for HPC systems, mainly in terms of scalability and heterogeneous architecture. They also highlighted the challenges of emerging fault-slow faults that require more attention in the future.
The paper aims to help researchers understand the progress and overall picture of HPC software resilience. It also provides some insights and directions for future research in this area.
More information:
Jie Jia et al, Software approaches for resilience of high performance computing systems: a survey, Frontiers of Computer Science (2022). DOI: 10.1007/s11704-022-2096-3
Provided by Frontiers Journals