Efficient fault-injection-based assessment of software-implemented hardware fault tolerance
Loading...
Date
2016
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
With continuously shrinking semiconductor structure sizes and lower supply
voltages, the per-device susceptibility to transient and permanent hardware
faults is on the rise. A class of countermeasures with growing popularity
is Software-Implemented Hardware Fault Tolerance (SIHFT), which avoids
expensive hardware mechanisms and can be applied application-specifically.
However, SIHFT can, against intuition, cause more harm than good, because
its overhead in execution time and memory space also increases the figurative
“attack surface” of the system – it turns out that application-specific configuration of SIHFT is in fact a necessity rather than just an advantage.
Consequently, target programs need to be analyzed for particularly critical spots to harden. SIHFT-hardened programs need to be measured and compared throughout all development phases of the program to observe reliability improvements or deteriorations over time. Additionally, SIHFT implementations
need to be tested.
The contributions of this dissertation focus on Fault Injection (FI) as an assessment technique satisfying all these requirements – analysis, measurement and comparison, and test. I describe the design and implementation of an FI tool, named Fail*, that overcomes several shortcomings in the state of
the art, and enables research on the general drawbacks of simulation-based
FI. As demonstrated in four case studies in the context of SIHFT research,
Fail* provides novel fine-grained analysis techniques that exploit the newly
gained possibility to analyze FI results from complete fault-space exploration.
These analysis techniques aid SIHFT design decisions on the level of program
modules, functions, variables, source-code lines, or single machine instructions.
Based on the experience from the case studies, I address the problem
of large computation efforts that accompany exhaustive fault-space exploration
from two different angles: Firstly, I develop a heuristical fault-space
pruning technique that allows to freely trade the total FI-experiment count
for result accuracy, while still providing information on all possible faultspace
coordinates. Secondly, I speed up individual TAP-based FI experiments
by improving the fast-forwarding operation by several orders of magnitude
for most workloads. Finally, I dissect current practices in FI-based evaluation
of SIHFT-hardened programs, identify three widespread pitfalls in the
result interpretation, and advance the state of the art by defining a novel
comparison metric.
Description
Table of contents
Keywords
Fault injection, Transient memory faults, Software-implemented hardware fault tolerance, Criticality analysis, Fault-tolerance assessment, FAIL*, Fault-similarity pruning, Smart-hopping, Extrapolated absolute failure count, Software-based fault tolerance, Software test