Tool for Customizing Fault Tolerance in a System - CAE Users

Report 2 Downloads 55 Views
Tool for Customizing Fault Tolerance in a System Karan Maini, Sriharsha Yerramalla Dept. of Electrical and Computer Engineering University of Wisconsin-Madison, USA [email protected], [email protected]

Abstract There are numerous real time & operation critical systems in which the failure of the system is unacceptable at any stage of processing. Fault detection and even correction of internal faults during normal operation is of prime concern in such applications. Fault Tolerance (F-T) has been taken into account for many years during design process of these applications, but it has not leveraged any of recent advances in CAD tools that automate the design process.[1] Therefore, inserting fault tolerant structures into a circuit has been considered as a challenge. In this paper we propose a new tool for the automatic insertion of fault-tolerant structures in an HDL synthesizable description of the design. With this tool, Fault Tolerance could be included into any instance of a module with little overhead in area and extra cost. An automatic fault tolerant design is produced based on user specifications, provided on the GUI, and this design in Verilog HDL is simulated and synthesized using commercial tools like ModelSim, Quartus and Design Vision. The GUI has been implemented using Java Swing technology and interacts frequently with Perl script, running as a separate process. Numerous implementation techniques have been considered to demonstrate the capabilities of this tool. A comparative analysis of these techniques has been carried out in terms of area and timing delay.

Keywords Triple Modular Redundancy (TMR), N-Modular Redundancy (NMR), Hybrid Redundancy, Sift-Out Modular Redundancy, Majority Voting, Verilog HDL, Java Swing.

1. INTRODUCTION Development of a system generally occurs over a number of stages starting from the specification phase, through the design, prototyping, and implementation phases and finally the installation phase. A fault can occur during one or more of these phases. A fault is defined as a physical defect that takes place in some part(s) of a system. A fault that occurs during one development stage can become apparent only at some later stage(s). Faults manifest themselves in the form of error(s). When an error is encountered during the operation of a system, it will lead to a failure. A system is said to have failed if it cannot deliver its intended function.

Figure 1 shows a simple example that illustrates the three terms.[2]

Figure 1: Relationship between failure, error & fault

A Fault Tolerant system is one that can continue to correctly perform its specified tasks in the presence of hardware failures. In some real-time applications where an error in the system performance is unacceptable, F-T is mandatory. In the development of a critical application, designers have to consider the possibility of including F-T structures in the circuit. In early days, these techniques have to be inserted in the design descriptions manually, because no tools were available for this task.[1] The relatively small number of Fault Tolerant applications did not sound attractive for the development of tools specific to the design of fault-tolerant circuits. With the growing need of operation critical systems in which the failure of the system is unacceptable at any stage of processing, we need F-T circuits. The design of such circuit is usually accomplished by introducing fault-tolerant structures in a previously designed circuit. Within the current HDL-based methodologies, the insertion of fault-tolerant structures is usually performed by manually modifying the HDL code in order to insert hardware redundancy, information redundancy or time redundancy at critical points in the circuit. Then, the modified, fault-tolerant design obtained follows a similar design flow consisting in automatic logic synthesis and place and route.[1] In this paper an automatic tool for developing Fault Tolerant Digital Integrated Circuits in the RT abstraction level with Verilog HDL is presented. This tool will help designers increasing the design productivity of fault tolerant circuits. First of all, basic specifications will be taken from user regarding type of F-T

and then a F-T library will be constructed to provide the required components to fulfil the request. After the insertion of components in the design is completed, a Verilog HDL will be returned back to the user. Automatic validation of resulting Verilog code is accomplished before providing the final output file. Finally, synthesis shall be carried out on the resulting output to compare against the original non F-T design and also against the alternative design approaches taken for introducing F-T in the system. Distribution of the remaining paper is as follows. Section 2 presents different implementation schemes and construction of a F-T library. Section 3 describes the general schema of the fault-tolerance insertion tool including the GUI. Section 4 highlights the results obtained from validation and synthesis and compares the different schemes upon timing delay and area. Finally, section 5 presents some conclusions deduced and future scope.

one module fails. The reconfiguration unit replaces the malfunctioning module with a working spare. The implementation of circuit for hybrid redundancy is partly inspired from that presented in the paper by Siewiorek et al.[4] Modifications are done to circuit to make scaling simpler. A high level circuit diagram for an overview is presented in the following Figure 2[11]. One of the disadvantages of this scheme is that it has voter, compare unit and reconfiguration unit as single point of failure.

2. HARDWARE FAULT TOLERANT TECHNIQUES Most real time systems need to function with high accuracy even in presence of faults. This fault tolerance is generally achieved by redundancy. Redundancy can be in different forms in a system, hardware, software, information or time. Our focus in this work is on hardware redundancy. For a given system, hardware redundancy can be achieved at different abstraction levels. Taking an example of a mobile phone, redundancy can be obtained by having spare phone or a spare processor inside the phone or a spare block inside the processor and so on. Thus different reliability at different levels can be used depending on resource availability and requirements. Our tool provides choice to user to add several types of hardware redundancy at different hierarchies and to different components. Four different types of redundancies targeted here are NMR, Hybrid, Sift-out and pair and spare. These methods are described briefly in following sections. NMR This is the most obvious and simple technique where multiple instances of a component are used and voter determines the majority output. While Triple module redundancy and five modular redundancy schemes are common, higher NMRs can also be achieved but the voter complexity increases drastically. A complex voter can become bottle neck in design and can defeat our purpose as it is a single point of failure. Though there are ways to deal this problem, we will not discuss it here. Hybrid This scheme is combination of NMR and spares, it has the ability to purge its faulty modules and select one of the free spares in its place. Erroneous unit is identified by comparing outputs of primary modules with voter’s output. Implicit assumption here is that at an instant of time only

Figure 2: Hybrid Redundancy

Sift-Out This technique tolerates up to N-2 failures where N is initial number of units. Each unit is active as long as it fault-free. Whenever it fails, it stops contributing to the system. This iteration continues till there are two units left at the end. Implementation of this scheme consists of three modules comparator, detector and collector. The comparator compares every input to N-1 others and passes on its outputs to detector which spots faulty module and isolates it from rest of the units. The collector module gives final output. Memory elements are used in this circuit to ensure that circuit is tolerant to transient faults and race conditions are not encountered. Circuit for this implementation is partly borrowed from that proposed by Paulo T. De Sousa et al.[5] The top level circuit is provided in Figure 3 for convenience. Sift out redundancy has fault tolerance as high as or higher than hybrid redundancy as well as simpler implementation.

Figure 3: Sift-Out Modular Redundancy

Pair and Spare In this technique modules are grouped in pairs and output from only one pair is considered. If any of the pairs have inconsistency amongst themselves then that pair is disconnected and computation result is obtained from other pair. Implementation of this technique has not been completed yet. It will be completed by the end of this semester. Figure 4 represents a high level picture of pair and spare scheme. [11]

3. IMPLEMENTATION The main objective of the proposed fault-tolerance insertion (FTI) tool is to generate a fault tolerant Verilog design description. Designer will provide an original Verilog design description and some guidelines about the type of FT techniques to be used. These details will be inputted by the designer on UI. User will select an instance of a module which the user wants to make Fault-Tolerant. Based on these details, certain components will be selected from the Fault-Tolerant Library and will be inserted into the Verilog design file instantiating that particular instance. Finally, the output Verilog file is generated and given back to the user.

Figure 4: Pair and Spare

Fault Tolerant (F-T) Library Our tool provides options for user to select the type of hardware redundancy he wishes to introduce into the system. Fault tolerant library consists of modules to support TMR, 5MR. Based on user selection, parameterized voter is selected from library and instantiated. For hybrid redundancy there is choice to select the number of spares to support Triple modular redundancy. Sift-out module is parameterized with respect to number of units/channels. All the modules are designed in such a way that they can be parameterized on the fly by the tool. Pair and Spare details are presented as we envision it to be since it is not yet implemented. Below table captures all the details of library modules. This library can be extended further with different types of redundancies without many major changes in the tool. This is possible as library modules are inserted into design as independent entities without disturbing the functionality and integration of existing logic. Table 1 provides various options that the user can choose and the corresponding values in different factors. Table 1: Options provided by library modules Module/parameters

Redundancy number Partly

Spares

Voter

Bus width Yes

Hybrid redundancy

Yes

Partly

Yes

Sift-out redundancy

Yes

Yes

NA

Pair and Spare

Yes

NA

NA

NA

Figure 5: Scheme of the Fault - Tolerant Insertion Tool

Application of FT Insertion Tool follows the scheme shown in Figure 5.[1] The final fault-tolerant design obtained can then be synthesized or downloaded to a new Verilog file, which will be the input for simulation or synthesis with other tools. This option provides several advantages:  It allows performing behavioral simulation of the modified design, in order to validate the correct behavior of the design after the modifications are performed. This saves simulation time, because behavioral simulation is far more efficient than gate level simulation. [1]  The user can recognize the modifications performed in the code, providing higher user confidence in the modification process. Manual modifications are still possible, if needed. During the user learning steps, the user may compare the results obtained with the automatic and manual modification process.  The proposed approach makes automatic insertion of fault-tolerant structures into the design. Thus, the designer need not worry about designing the components. The designer just needs to import the F-T library into the workspace.

DESIGN FLOW Start the GUI from Eclipse IDE

User uploads Verilog design files using File Chooser

Perl script to clean the code from each file

Perl script to create hierarchy of modules and instantiations starting from top level module

Display the hierarchy on UI using JTree and prompt the user to select an instance to make it Fault tolerant

Prompt the user to select one redundancy type

No

Is Hybrid, SiftOut or Pair and Spare selected?

Yes Allow the user to enter number of Spare(s)

Select appropriate components from F-T Library and output the F-T Verilog file to the user Figure 6. Flowchart of Design Process

Figure 6 describes the steps involved in greater detail. The Graphical User Interface (GUI) is created in Eclipse Integrated Development Environment (IDE) using Java Swing technology. It provides a rich set of components to the designer. The UI is made using the Design view, which can aid the designer to make it quickly and efficiently. Figure 7 shows a screenshot of the tool in its entirety. It comprises of following three sections. Section 1 This shows the welcome screen to the user. User is asked to input all the Verilog files of his design. This feature is implemented using JFile Chooser. The Look and Feel of the UI has been set to the current system i.e. the environment it is being running in like Windows, Mac OS, etc. After the user uploads a set of design files, path of the files are added to an ArrayList of String and displayed on the UI. User can de-select a particular file in case he does not wish to consider it. User is prompted for another input to provide the name of the Top level Module of his design so that the entire hierarchy of his design can be extracted in the next section. Before starting with the next section, UI does a couple of background tasks. It copies all the input files by the user to its current working directory (cwd) i.e. the Java Workspace. It then writes all the file names in a plaintext file “Files.txt” and forks a child process to call a Perl script “Scan.pl”. Section 2 Interface between the Perl scripts and the Java code has been accomplished by writing the results to a plaintext file. This way both the piece of code can interact with each other easily and does not incur any complexity in the code. Moreover, it helps to keep the flow consistent on both ends. “Scan.pl” script starts by reading the “Files.txt” to fetch the file names one by one and remove any comments from the file, if any. It copies the file, reading line by line and disregarding any comments, to a new file at the same location. Regular Expression has been written to take care of both single line and multi-line comments in a Verilog file. Parallely top modules in all the design file are written to a “Modules.txt” to have a list of all the modules in picture. After completion, next step in the process is to generate an entire hierarchy of modules and respective instantiations present in the design. For this, another Perl script “Rec_func.pl” is written which starts from the top level module inputted by the user and calls a function recursively with a module name as an argument to generate the hierarchy. First level in hierarchy is top level module, followed by all the modules present in its definition at the second level and their instance names at the last (third) level. These are separated by tabs and the hierarchy is

written by the script in a “Hierarchy.txt” file which is then read by the Java code. When the script completes, we check the return code to make sure it does not fail and then proceed the main Java thread waiting on child to finish. “Hierarchy.txt” file is read and all the nodes are stored on a JTree component. The tree structure is then laid out on the UI so that the user gets a clear view of what all modules and instances are present in his design and at what hierarchical level. Section 3 In this section, user is asked to provide a set of inputs which are considered to be essential for making the design fault-tolerant. First, he is asked to select an instance (leaf node) on Jtree which should be made Fault-Tolerant. For this, Jtree selection is restricted to Singleton node and user is prompted with a Message Dialog Box in case he does not select an instantiation (instead, selects a module). Another input required from the user is about the redundancy technique. Since we have focused on hardware redundancy, user is provided the following set of options to choose from.    

Triple Modular Redundancy (TMR) Five Modular Redundancy (5MR) Hybrid Redundancy (Using TMR) Hybrid Redundancy (Using 5MR)

 Sift-Out Modular Redundancy  Pair and Spare Modular Redundancy These options are implemented using JRadio Buttons and clubbed together using Button Group. In case user selects Hybrid Redundancy, he is required to input number of spares in a Text Field. Input validations are present at all places to have correct and non-zero inputs from user. After all the inputs are validated on click of Submit button, the request for fault tolerant design is registered and specifications are stored in Constants declared as fields of the Java class. For input validations on click of “Submit” button, different Action listeners and handlers are written to provide the desired functionality. While processing the request, module definition file of the selected instance is read for identifying input, output and inout ports and their respective bus-widths. These details are written to “Ports.txt” file for reference of user and for usage in last segment of the code. Different cases are considered for the declaration of input, output and inout ports and their buswidths. These can be present at the line of module definition or somewhere after the definition is completed. In the last segment of code, module instantiation is renamed multiple times (depending upon redundancy type chosen by user) and all output ports are renamed in each instantiation by adding a constant string to the original name(E.g. – “_red0”, “_red1”, and so on (red- redundancy))

Figure 7: GUI displaying all 3 Sections and the user selections made on each section

This String constant is assumed to make a unique variable name that is not being used in the design. It can be considered as Magic String similar to existence of Magic numbers. New wires are declared with same width as from “Ports.txt” and Voter instantiation is made in the design based on the user specification. The output of the Voter is written to the same old port as declared in original file so that the consistency is retained throughout the design. Definition of Voter for Majority Selection has already been implemented as a part of Fault-Tolerant library described earlier. In the end, a message is displayed to user indicating that the Fault-Tolerant design is ready and closes the Application on click of “OK” button. The new Verilog design file is then ready for use. It can also be simulated against the test bench by introducing a fault in the original file and then comparing the results of the original and the new design. Manual modifications are also possible in case of minor adjustment needed after simulation.

4. PROJECT RESULTS To compare between different schemes, each fault tolerant unit is analyzed independently first without adding to any of our designs. The comparison between different techniques is based on area estimation done by Design Vision tool to build the fault tolerant modules and worst case timing delay overhead caused by them. Design Vision estimates area with different optimizations. Table 2 gives data collected for bus width of 12 with three different optimizations. Table 2: Area estimation with three different optimizations. Area is in library units and time in ns.

Module/Over head

Normal Compile

Ungroup all

TMR

23.3

23.3

Comp -ile Ultra 23.3

Delay

5MR

131.2

131.2

93.1

10.658

Hybrid ( three primary with 2 spares) Hybrid (5 primary with 3 spares) Sift-out (3 units ) Sift-out (5 units)

370.6

358.4

269.4

13.294

716

535.6

491.6

13.472

132.1

135.3

103.9

15.232

363.2

365.9

303.1

17.770

8.324

From above table it can be observed that passive redundancy, TMR, adds smallest delay to existing functional path while Sift-out adds maximum delay. It is expected that TMR and 5MR have simplest circuits of all hence need least number of logic gates and delay. Though Sift-out circuit needs similar number of logic elements compared to hybrid redundancy for same number of redundant modules, fan-in and fan-out of logic elements used is higher and therefore logic gates are considerably slower. It is difficult to quantify fault tolerance ability of these circuits as both these circuits survive till last two units are up and running. But from intuition it can be inferred that simpler circuits are less prone to faults. Apart from above observation another trend in the data is, though delay does not increase drastically with increase in redundancy number, logic elements required are much higher for higher redundancy. We verified our tool on three different designs. The design blocks used were Micro-op generator, PIIR module and a 5 stage pipeline processor which are henceforth called as DUT1, DUT2 and DUT3. The designs were chosen in such a way that they had considerable complexity and depth in terms of hierarchies. Quartus was used for Verilog design and verification. Our tool is yet to be tested on DUT3, it is expected to be done soon. To verify the tool functionality, faults were randomly injected into our correctly functioning DUTs. After noting the error caused by the injected fault, fault tolerance was introduced to mask the fault. This method was followed for all the fault tolerant library modules to verify their behavior. The DUT1 functionality is to break complex instructions in ARMv4 ISA to simple instructions of loads and stores. This design has counters in it to generate offsets for loads and stores. One of the counter’s reset is tied to 1 to model it as struck at 1 fault and the effect of error is observed on its external ports. Then different types of fault tolerances are added to design and expected output is verified. Similar verification is performed on module in DUT2. Corresponding to various hierarchies shown in Figure 7, Table 3 gives the area overhead incurred by adding fault tolerance to various modules inside DUT2. PIIR is the top level module in the design. It is a Programmable Infinite Impulse Response Filter. It accepts two 16 bit input values; it has some constants stored in the memory of the processor which is used in the multiplication of these input values to generate a response. There are 5 multiplication of 16 bit happening

inside the processor to produce a single output response. The response given out is a delayed response by 2 clock cycles. MAC is the multiply and Accumulator unit used inside the PIIR unit. It has the 16*16 booth multiplier and 16 bit Carry Look Ahead adder to perform the multiplication and addition. CLA 16 bit is made using four CLA 4 bits which are further broken down into 1 bit CLA adders. The table gives a rough idea about the dependency of redundancy overhead on hierarchy. Making MAC fault tolerance would require much higher resources compared to its lower module instances of Mult_resbooth and CLA_16bit modules. Table 3: Area Comparison of Original File (Without Fault Tolerance) with Different F-T Techniques Introduced for 16 bit Booth Multiplier and 16 bit CLA Adder

RT* Component Mult_resbooth

Original

5MR

6267.7

CLA_16 bit

6267.7

6. ACKNOWLEDGEMENT The authors would like to thank Prof. Kewal Saluja for his valuable feedback and comments, which were key for successful completion of our project.

7. REFERENCES [1]

[2]

[3]

10288

Hybrid(3 ,2) 10566

Siftout (5) 10608

6536.3

6543.4

6590.2

RT – Redundancy Technique, Hybrid (3, 2) – 3 primary, 2 spares. Sift-out (5) – 5 redundant units

[4]

[5]

[6]

5. CONCLUSION AND FUTURE WORK A tool for adding hardware fault tolerance is built for systems written in Verilog. It has ability to add fault tolerance to modules at different hierarchies as requested by user. The tool’s library has several options for fault tolerance. The functionality of tool is verified on three different designs, by adding fault tolerance at different hierarchies. A brief analysis is performed on library modules in terms of complexity and delay introduced into design due to adding fault tolerance. This work can be continued further to add other types of hardware redundancies. While adding redundancies like duplex, triplex-duplex is not very difficult from this stage, adding complex redundancies like triplicated voters is not straight forward. Information redundancy can also be added but it cannot be as generalized as hardware redundancy. Had time permitted, we could have added add check bits to register files increasing options for fault tolerance. Currently we are verifying the generated Verilog code manually, but this can be automated and an existing verification tool can be called by our tool to verify an input test bench automatically.

[7]

[8]

[9]

[10]

Luis Entrena, Celia Lopez, Emilio Olias, “Automatic Generation of Fault Tolerant VHDL Designs in RTL” in Forum on Design Language, 2001. Deepti Shinghal, Dinesh Chandra, “Design and Analysis of a Fault Tolerant Microprocessor Based on Triple Modular Redundancy Using VHDL” in International Journal of Advances in Engineering & Technology (IJAET), Mar 2011. R. Leveugle, “Automatic Modifications of High Level VHDL Descriptions for Fault Detection or Tolerance” in Design, Automation and Test in Europe Conference and Exhibition, 2002. Daniel P. Siewiorek, Edward J. Mccluskey, “An Iterative Cell Switch Design for Hybrid Redundancy” in IEEE Transactions on Computers, Vol. C-22, No. 3, March 1973. Paulo T. De Sousa, Francis P. Mathur, “Sift-Out Modular Redundancy” in IEEE Transactions On Computers, Vol. C-27, No. 7, July 1978. A.M. Awlendola, A. Benso, F. Coma, L. Impagliazzo, P. Marmo, P. Prinetto, M. Re Baudengo, M.Sonza Reorda “Fault Behaviour Observation of a Microprocessor System through a VHDL simulation – based Fault Injection experiment” IEEE, pp 536-541, Published in 1996. Fred A. Bower, Daniel J. Sorin, and Sule Ozev “A Mechanism for Online Diagnosis of Hard Faults in Microprocessors”, appears in the 38th Annual International Symposium on Microarchitecture (MICRO) Barcelona, Spain, November, 2005. Hamid R. Zarandi, Seyed Ghassem Miremadi, Alireza Ejlali, “Fault Injection into Verilog Models for Dependability Evaluation of Digital System”, Proceedings of the second International Symposium on Parallel and Distributed Computing (ISPDC’03), IEEE 2003. S. L. Hight, D. P. Petersen, "Dissent in a majority voting system," IEEE Trans. Comput., vol. C-22, pp. 168-171, Feb 1973. A. Avilienis, "Design of fault-tolerant computers," in 1967 Fall Joint Comput. Conf., AFIPS Conf Proc., vol. 31. Washington, D.C.: Thompson, 1967, pp. 733-743.

[11]

[12]

[13]

Israel Koren, C. Mani Krishna “Fault Tolerant Systems” Chapter-2 “Hardware Fault Tolerance”, Published in 2007. Michael L. Bushnell, Vishwani D. Agrawal “Essentials of Electronic Testing for Digital, Memory and MixedSignal VLSI Circuits” Chapter-4 “Fault Modeling”, Published in 2002. Loy Marc, Eckstein Robert, Wood Dave, Elliott James, Cole Brian “Java Swing” (2nd edition) O'Reilly Media, Inc. P. 53. ISBN 1449337309, Published in 2012.

[14]

[15]

"Look And Feel” (Java SE7), Oracle Documentation, http://docs.oracle.com/javase/7/docs/api/javax/swi ng/LookAndFeel.html, May 2012. IEEE Standard for Verilog Hardware Description Language. IEEE Std. 1364-2005, IEEE 2006.