2017 conference paper

Characterizing the impact of soft errors across microarchitectural structures and implications for predictability

Proceedings of the 2017 ieee international symposium on workload characterization (iiswc), 250–260.

By: B. Wibowo n, A. Agrawal n & J. Tuck n

Source: NC State University Libraries
Added: August 6, 2018

The trends of transistor size and system complexity scaling continue. As a result, soft errors in the system, including the processor core, are predicted to become one of the major reliability challenges. A fraction of soft errors at the device level could become an unmasked error visible to the user. Unmasked soft errors may manifest as a detectable error, which could be recoverable (DRE) or unrecoverable (DUE), or a Silent Data Corruption (SDC). Detecting and recovering from an SDC is especially challenging since an explicit checker is needed to detect erroneous state. Predicting when SDCs are more likely could be valuable in designing resilient systems. To gain insight, we evaluate the Architectural Vulnerability Factor (AVF) of all major in-core memory structures of an out-of-order superscalar processor. In particular, we focus on the vulnerability factors for detectable and unrecoverable errors (DUEAVF) and silent data corruptions (SDCAVF) across windows of execution to study their characteristics, time-varying behavior, and their predictability using a linear regression trained offline. We perform more than 35 million microarchitectural fault injection simulations and, if necessary, run-to-completion using functional simulations to determine AVF, DUEAVF, and SDCAVF. Our study shows that, similar to AVF, DUEAVF and SDCAVF vary over time and across applications. We also find significant differences in DUEAVF and SDCAVF across the processor structures we studied. Furthermore, we find that DUEAVF can be predicted using a linear regression with similar accuracy as AVF estimation. However, SDCAVF could not be predicted with the same level of accuracy. As a remedy, we propose adding a software vulnerability factor, in the form of SDCPVF, to the linear regression model for estimating SDCAVF. We find that SDCPVF of the Architectural Register File explains most of the behavior of SDCAVF for the combined microarchitectural structures studied in this paper. Our evaluation shows that the addition of SDCPVF improves the accuracy by 5.19×, on average, to a level similar to DUEAVF and AVF estimates. We also evaluate the impact of limiting software-layer reliability information to only 5 basic blocks (16× cost reduction, on average), and observe that it increases error only by 18.7%, on average.