2022 article

Eris: Fault Injection and Tracking Framework for Reliability Analysis of Open-Source Hardware


By: S. Nema n, J. Kirschner n, D. Adak n, S. Agarwal*, B. Feinberg*, A. Rodrigues*, M. Marinella*, A. Awad n

author keywords: Reliability; fault-injection; RISC-V; safety-critical; resilience; fault-tracking
Source: Web Of Science
Added: September 26, 2022

As transistors have been scaled over the past decade, modern systems have become increasingly susceptible to faults. Increased transistor densities and lower capacitances make a particle strike more likely to cause an upset. At the same time, complex computer systems are increasingly integrated into safety-critical systems such as autonomous vehicles. These two trends make the study of system reliability and fault tolerance essential for modern systems. To analyze and improve system reliability early in the design process, new tools are needed for RTL fault analysis.This paper proposes Eris, a novel framework to identify vulnerable components in hardware designs through fault-injection and fault propagation tracking. Eris builds on ESSENT—a fast C/C++ RTL simulation framework—to provide fault injection, fault tracking, and control-flow deviation detection capabilities for RTL designs. To demonstrate Eris’ capabilities, we analyze the reliability of the open source Rocket Chip SoC by randomly injecting faults during thousands of runs on four microbenchmarks. As part of this analysis we measure the sensitivity of different hardware structures to faults based on the likelihood of a random fault causing silent data corruption, unrecoverable data errors, program crashes, and program hangs. We detect control flow deviations and determine whether or not they are benign. Additionally, using Eris’ novel fault-tracking capabilities we are able to find 78% more vulnerable components in the same number of simulations compared to RTL-based fault injection techniques without these capabilities. We will release Eris as an open-source tool to aid future research into processor reliability and hardening.