In the
internet you can google acronyms like SWIFT (Software Implemented Fault
Tolerance) or SIHFT (Software-implemented Hardware Fault Tolerance). Both have
the same meaning and describe a group of methods that help to organize,
enforce, protect, optimize your software in a such way that it will be tolerant
of bit flips. In this post I would like to discuss general SIHFT
approaches and give useful links to the publications for further details.
Control
flow checking
Detection of hardware
errors using control flow checking provides the means for recognizing of
invalid control flow of the executed program. Execution of sequences of
instructions that are not permitted for the executed binary can be detected.
Control flow checking cannot detect errors, which do only influence processed
data. Usually, control flow checking is only applied for inter-basic-block
control flow. Erroneous control flow within the boundaries of basic-blocks is
not detectable. Here is a simple example. We decompose the source code (on
the instruction level) into the basic blocks. A basic block is a number of
instructions in-between two control flow jumps (branching instructions). After
that we generate unique exit-signatures for all blocks. We instrument our code
in a such way that at the end of each block we save this signature into a special
register, and than, check it at the beginning of the next block. This helps to
tolerate bit-flips in control flow instructions and prevent wrong control flow
jumps.
·
Edson Borin, Cheng Wang,
Youfeng Wu, and Guido Araujo. Software-based transparent and comprehensive
control-flow error detection. In Proceedings of the International Symposium on
Code Generation and Optimization (CGO), Washington, DC, USA, 2006. IEEE
Computer Society. 6, 7
·
Ramtilak Vemu and Jacob A.
Abraham. CEDA: Control-flow error detection through assertions. In IOLTS ’06:
Proceedings of the 12th IEEE International Symposium on On-Line Testing. IEEE
Computer Society, 2006. 6, 7
Invariant-based
methods
Algorithm-based fault
tolerance and self-checking software use invariants to check the validity of
the generated results. These methods require the existence of appropriate
invariants, which provide a good generic failure detection capability. However,
they are not easy (if not impossible) to find for most applications or limited
to specific algorithms and program sections. Generic invariants as
assertion-based loop invariant only detect significant variations of the
results reliably and may miss subtle errors or side-effects.
·
V.K. Stefanidis and K.G.
Margaritis. Algorithm based fault tolerance : Review and experimental study. In
International Conference of Numerical Analysis and Applied Mathematics, 2004.
6, 7
·
Hal Wasserman and Manuel
Blum. Software reliability via run-time result-checking. J. ACM, 1997. 6,
7
Redundancy
Other software approaches
for detecting hardware errors work with replicated execution and comparison
(voting) of the obtained results. The protected software is modified during or
before compilation – rarely, dynamic binary instrumentation is used.
Replication is applied at different levels of abstraction. A number of
approaches duplicate single instructions within one thread. Others execute
duplicates of the whole program using several threads. This methods are similar
to the typical hardware reliability approaches like double and triple module
redundancies. Duplication of the instructions helps only to detect a bit flip,
triple execution can even mask it.
·
George A. Reis, Jonathan
Chang, David I. August, Robert Cohn, and Shubhendu S. Mukherjee. Configurable
transient fault detection via dynamic binary translation. In Proceedings of the
2nd Workshop on Architectural Reliability (WAR), 2006. 7
·
C. Bolchini, A. Miele, M.
Rebaudengo, F. Salice, D. Sciuto, L. Sterpone, and M. Violante. Software and
hardware techniques for SEU detection in IP processors. J. Electron. Test.,
2008. 6, 7
·
Cheng Wang, Ho seop Kim,
Youfeng Wu, and Victor Ying. Compiler-managed software-based redun- dant
multi-threading for transient fault detection. In International Symposium on
Code Generation and Optimization (CGO), 2007. 7
Replicated execution and
comparison methods help against bit flips in CPU during instruction processing.
In a similar way, we can tolerate bit flips in memory. For this purpose, we
have to store two or three copies of our variables and compare them on
read. The main drawback of these methods is a very high performance
(execution time, memory consumption) overhead.
·
George A. Reis, Jonathan
Chang, and David I. August. Automatic instruction-level software-only recovery.
IEEE Micro, 27:36–47, January 2007. 2
·
George A. Reis, Jonathan
Chang, Neil Vachharajani, Ram Rangan, and David I. August. Swift: Software
implemented fault tolerance. In Proceedings of the international symposium on
Code generation and optimization, CGO ’05, pages 243–254, Washington, DC, USA,
2005. IEEE Computer Society. 2
Arithmetic codes
Instead of duplication, or
in addition to it, arithmetic codes can be used to detect errors. The
arithmetic codes are conserved by correct arithmetic operations that is, a
correctly executed operation taking valid code words as input produces a result
that is again a valid code word. On the other hand, faulty arithmetic
operations do not conserve the code with a very high probability, that is,
faulty operations result in a non-valid code word. Furthermore, arithmetic
codes protect data from unnoticed modifications during storage or transport on
a bus. These random modifications with high probability will not result in a
valid code word. Thus, the arithmetic codes facilitate not only the detection
of soft errors but can catch permanent hardware faults, too. AN, ANB, and ANBD
encodings are different types of arithmetic code protections. Note that in the
ISO standard for the functional safety of road vehicles (ISO 26262) encoding
with an ANBD-code is one of the two means to reach the ASIL-D level – the
highest safety level.
·
A. Avizienis. Arithmetic
Error Codes: Cost and Effectiveness Studies for Application in Digital System
Design. In Transactions on Computers, 1971. 7
·
Nahmsuk Oh, Subhasish
Mitra, and Edward J. McCluskey. ED4I: Error detection by diverse data and
duplicated instructions. IEEE Trans. Comput., 51, 2002. 7
·
onathan Chang, George A.
Reis, and David I. August. Automatic instruction-level software-only recovery.
In Proceedings of the International Conference on Dependable Systems and
Networks (DSN), Washington, USA, 2006. 7
·
Ute Schiffel, Martin Su ̈ßkraut, and Christof Fetzer.
AN-encoding compiler: Building safety-critical systems with commodity hardware.
In The 28th International Conference on Computer Safety, Reliability and
Security (SafeComp 2009), 2009. 7, 9, 10
·
Ute Wappler and Christof
Fetzer. Hardware failure virtualization via software encoded processing. In 5th
IEEE International Conference on Industrial Informatics (INDIN 2007), 2007. 7,
9, 8, 10
·
Ute Wappler and Christof
Fetzer. Software encoded processing: Building dependable systems with com-
modity hardware. In The 26th International Conference on Computer Safety,
Reliability and Security (SafeComp 2007), 2007. 7, 9, 10
No comments:
Post a Comment