Estimating TMR Reliability on FPGAs Using Markov Models
This paper summarizes several reliability models for modeling FPGA reliability using Markov models.
1 Estimating TMR Reliability on FPGAs Using Markov Models Daniel McMurtrey, Keith Morgan, Brian Pratt, Michael Wirthlin I. BACKGROUND A. Markov A fundamental problem in fault-tolerant computing is predicting a system s reliability. Several methods exist to model system reliability. Markov modeling, named for the Russian mathematician Andrei Markov, is one such method. The underlying assumption of Markov models is that the probability of a state transition depends only on the current state. These are called first-order Markov models. Other higher-order Markov models exist, but these are mathematically much more difficult to deal with. The systems analyzed in this paper can all be modeled as a first-order Markov process. Since the conditional probability of being in any state j at time t in a first-order Markov process only depends on the previous state i, the conditional probability of transitioning from one state to the next can be assembled in a square matrix called a transition matrix T. Each entry in the transition matrix at row m and column n represents the probability of a transition from state m to state n (where m can equal n). As such, all entries in a transition matrix are non-negative and the entries in each row must sum to unity. Multiplying the vector of marginal probabilities of the different states at time t will give the marginal vector of probabilities for the next time period [1]. The reliability of a computing system can be modeled as a very simple Markov process with two states functional and failed. A graphical representation of this Markov model is shown in Figure 1. State 0 represents the functional state and state 1 represents the failed state. The system transitions from functional to failed at a rate of , as labeled on the arc from sate 0 to state 1. Fig. 1. A simple two-state Markov model. The reliability function of the Markov model depicted in Figure 1 can be derived by first converting the process to a continuous-time model and then solving a set of differential equations for the probability the system is in state 1, the failed state, at time t. To convert to a continuous-time Markov model, a transition matrix T is needed. By inspection, the transition matrix for Figure 1 is T = 1 t t 0 1 , (1) where the entry in row m, column n represents the probability of transitioning from state m to state n. Notice that entry 1, 1 is one, meaning once the system is in the failed state it will remain there indefinitely. From the transition matrix a set of equations can be defined that represent the probability of being in state 0 or 1 at time t+ t. The distribution of probabilities at time t+ t is equal to the product of the distribution of the probabilities at time t multiplied by the transition matrix. Stated mathematically, [p0(t+ t), p1(t+ t)] = [p0(t), p1(t)] 1 t t 0 1 . (2) Multiplying out yields the system of equations p0(t + t) = (1 t)p0(t) (3) p1(t + t) = tp0(t) + p1(t). (4) Subtracting p0(t) from both sides of Equation 3, p1(t) from both sides of Equation 4, and dividing both sides of both equations by t gives p0(t + t) p0(t) t = p0(t) (5) p1(t + t) p1(t) t = p0(t). (6) Notice that the left hand side of Equations 5 and 6 are the definition of a derivative when t ! 0. Thus taking lim t!0 of Equations 5 and 6 generates p00(t) = p0(t) (7) p01(t) = p0(t). (8) Solving this set of equations is more easily accomplished if they are first converted to the LaPlace domain. Taking the LaPlace transform of Equations 7 and 8 yields sP0(s) p0(0) = P0(s) (9) sP1(s) p1(0) = P0(s) (10) where pi(0) = pi(t) at t = 0. In matrix form, Equations 9 and 10 can be written as [p0(0), p1(0)] = [P0(s), P1(s)] + s 0 s . (11) 2 Solving for [P0(s), P1(s)] gives [P0(s), P1(s)] = [p0(0), p1(0)] + s 0 s 1 . (12) Computing the inverse of the matrix yields [P0(s), P1(s)] = [p0(0), p1(0)] 1 +s ( +s)s 0 1 s . (13) If the system starts out operational (in state 0), then the initial probability distribution is [p0(0), p1(0)] = [1, 0]. (14) Substituting the values from Equation 14 into Equation 13 gives [P0(s), P1(s)] = [1, 0] 1 +s ( +s)s 0 1 s , (15) or after multiplying through, P0(s) = 1 + s (16) P1(s) = ( + s)s . (17) Taking the inverse LaPlace transform of Equations 16 and 17 results in p0(t) = e( t) (18) p1(t) = 1 e( t). (19) In words, Equations 18 and 19 are the probability pi(t) that the system is in state i at time t. Since the system reliability R(t) as a function of time is the probability that the system has survived up to time t, the reliability is simply the probability the system is in state 0 at time t, or R(t) = p0(t) (20) = e( t). In more complex systems it is typically easier to compute the probability the system is in the failed state and subtract that from unity (1) to get the reliability function. Computing R(t) in this manner yields R(t) = 1 p1(t) (21) = 1 (1 e( t)) = e( t), which is the same result. B. Triple Modular Redundancy Triple modular redundancy (TMR) is a reliability technique in which each module in a circuit is triplicated and a majority vote (two out of three) is taken on the outputs to determine the final module output. This technique is represented in Figure 2. In this figure, the three rows of rectangular blocks represent three domains of identical circuit modules. The three circles represent the 2-out-of-3 majority voters. The output from each domain is compared against the others and the final result determined. The output of the circuit is also triplicated. Fig. 2. A simplified view of triple modular redundancy (TMR). Rectangles represent logic blocks, circles represent voters. TMR is able to mask any single fault (affecting only one module) in the circuit. In fact, TMR masks any number of faults that affect only one domain (only the modules in the top row of Figure 2, for instance). As long as only one domain has failed, the matching outputs of the other two will determine the output of the voter. If, however, two faults occur in separate domains, the output of the voters may be incorrect. To prevent the system from failing, then, a failed module must be replaced or repaired before another fault occurs. In FPGAs, a method called configuration scrubbing, or simply scrubbing, is used to repair faulty modules. Scrubbing involves the periodic refresh of the contents of the FPGA configuration memory. This effectively replaces the modules that have failed by reinitializing the contents of the SRAM cells that define their operation. The scrubbing rate can be adjusted according to the rate at which faults are expected to occur. II. TMR WITH REPAIR Applying the Markov model to a system that has been implemented with TMR yields the graph shown in Figure 3. State 0 represents the state where all the TMR modules are functioning properly. State 1 represents the case where a fault has occurred in any of the three replicated modules. State 2 represents the state where more than one module has been affected by a fault. In state 2 the output is no longer dependable and the system is unreliable at this point. There is no recovery once multiple modules have failed. Fig. 3. Markov model representing TMR with repair through scrubbing. For the system modeled in Figure 3, the failure rate is assumed to be identical for each of the three modules. Also, it is assumed that scrubbing can repair any module at rate . Since the modules all have an equal failure rate , the rate at 3 which the system can transition from state 0 to state 1 is 3 , or the sum total of the failure rates of each of the individual modules. Once a single module has failed, the system is in state 1. From state 1, the system can transition to state 2 with rate 2 since one of the other non-affected modules must have a fault for the total system to fail. However, if scrubbing occurs before a second module fails, the system would return to the fully functioning state 0. Scrubbing occurs at rate . The transition matrix can be represented as T = 2 4 1 3 t 3 t 0 t 1 2 t t 2 t 0 0 1 3 5. (22) Using the transition matrix, the transition equation is [p0(t + t), p1(t + t), p2(t + t)] = [p0(t), p1(t), p2(t)]T. (23) Expanding this equation produces p0(t + t) = (1 3 t)p0(t) + tp1(t) (24) p1(t + t) = 3 tp0(t) + (1 (2 + ) t)p1(t)(25) p2(t + t) = 2 tp1(t) + p2(t). (26) Manipulating these equations produces p0(t + t) p0(t) t = 3 p0(t) + p1(t) (27) p1(t + t) p1(t) t = 3 p0(t) (2 + )p1(t) (28) p2(t + t) p2(t) t = 2 p1(t), (29) with the familiar definition of a derivative on the left hand side of each equation. Taking the limit as t ! 0 yields the continuous-time domain equations p00 = 3 p0(t) + p1(t) (30) p01 = 3 p0(t) (2 + )p1(t) (31) p02 = 2 p1(t). (32) Applying the LaPlace transform to each equation produces sP0(s) p0(0) = 3 P0(s) + P1(s) (33) sP1(s) p1(0) = 3 P0(s) (2 + )P1(s) (34) sP2(s) p2(0) = 2 P1(s). (35) Combining equations 33, 34, and 35 yields [p0(0), p1(0), p2(0)] = [P0(s), P1(s), P2(s)]A, (36) where, A = 2 4 s + 3 0 3 + s + 2 0 0 2 s 3 5. (37) A = 2 4 s + 3 3 0 + s + 2 2 0 0 s 3 5. (38) Solving for [P0(s), P1(s), P2(s)] by obtaining the inverse of A gives [P0(s), P1(s), P2(s)] = [p0(0), p1(0), p2(0)]A 1, (39) where, A 1 = 2 64 +s+2 s+s2+5 s+6 2 3 s+s2+5 s+6 2 6 2 s( s+s2+5 s+6 2) s+s2+5 s+6 2 s+3 s+s2+5 s+6 2 2 (s+3 ) s( s+s2+5 s+6 2) 0 0 1 s 3 75 . (40) Assuming the initial distribution of probabilities is [p0(0), p1(0), p2(0)] = [1, 0, 0] (meaning the system starts in the functional state with all modules fault-free), Equation 39 becomes [P0(s), P1(s), P2(s)] = [1, 0, 0]A 1, (41) which can also be represented as P0(s) = + s + 2 s + s2 + 5 s + 6 2 (42) P1(s) = 3 s + s2 + 5 s + 6 2 (43) P2(s) = 6 2 s ( s + s2 + 5 s + 6 2) . (44) Solving for the reliability of the system requires determining the probability of being in states 0 or 1. However, it reduces the math to subtract the probability of being in state 2 from unity (1). Solving for the reliability R(t), using the inverse LaPlace transform gives R(t) = 1 p2(t) (45) = ( + 5 )sinh( 1 2 t p 2 + 10 + 2)e 1 2 ( +5 )t p 2 + 10 + 2 +cosh( 1 2t p 2 + 10 + 2)e 1 2 ( +5 )t (46) These results were verified against the results obtained in [2]. Using the values of = 0.001 and = 0.1 we were able to demonstrate the improvement in reliability of a system implemented with TMR. Figure 4 is a plot of the reliability of a non-redundant system and a TMR system as a function of time. The reliability of the non-redundant system drops immediately and quickly approaches 0. The system with TMR has a much higher reliability than the non-redundant system. III. PERSISTENCE While many FPGA applications cannot tolerate faults nor the corresponding interruption of service, there are applications which can tolerate a temporary interruption of service. By tolerating temporary interruptions of service, an application can trade availability for improvements in reliability. In [3], [4], Morgan et al. showed that an FPGA application that implements scrubbing will experience both permanent and temporary fault-induced service interruptions. They called the permanent interruptions persistent errors and the temporary interruptions non-persistent errors. When a fault induces nonpersistent errors the application becomes temporarily unavailable. Once scrubbing repairs the fault, the functional errors will eventually exit and the system will return to normal operation. When a fault induces persistent errors the application becomes permanently unavailable. 4 Fig. 4. Comparison of the Reliability of TMR vs. no TMR on a design. = 0.001 and = 0.1 Traditionally, FPGA application failure occurs after any interruption of service. By tolerating temporary service interruptions, an application will only fail after permanent service interruptions. To measure the improvement in reliability achieved by tolerating temporary service interruptions, a Markov model of a system that tolerates non-persistent errors was developed. The model is shown in Figure 5. State 0 is the functional state. State 1 is the unavailable state. State 2 is the failed state. In state 1, one or more non-persistent errors have occurred. In state 3 a persistent error has occurred. All state transition arcs are governed by the normal failure rate , the fraction p of the FPGA cross section susceptible to persistent errors and the scrubbing rate . In state 0, the system will transition to state 2 when a persistent error occurs. On this arc the normal failure rate is scaled by the persistent cross section fraction p. On the other hand, the system will transition from state 0 to state 1 when a non-persistent error occurs. On this arc the normal failure rate is scaled by the non-persistent cross section fraction. Since the non-persistent and persistent cross sections are mutually exclusive, the non-persistent cross section fraction is simply 1 p. In state 1, the system will transition back to state 0 when scrubbing repairs the fault that caused a non-persistent error. The rate on this arc is simply the scrubbing rate . On the other hand, the state can still transition to state 2, the failed state, if another fault induces a persistent error before scrubbing returns the system to state 0. Again this arc is governed by the failure rate, scaled by the persistent fraction of the sensitive cross section. The transition matrix T for the Markov model in Figure 5 is T = 2 4 1 t (1 p) t p t t 1 ( p + ) t p t 0 0 1 3 5. (47) Using the process described in Section I-A, the reliability as a function of time was found to be R(t) = e( pt). (48) Notice the similarity to Equation 21, the reliability of a simple system without repair (repair of failures). Adding the concept of a persistent cross section to the Markov model simply scales the failure rate in the reliability equation by the persistent cross section fraction. Fig. 5. Markov model of a system that only fails after a persistent error. State 0 is the functional state. State 1 is the unavailable state. In state 1, one or more non-persistent errors have occurred. State 2 is the failed state. In state 2 a persistent error has occurred. Fig. 6. Markov model of a system with TMR that only fails after a persistent error. Applying the notion of persistence to TMR, produces the Markov model shown in Figure 6. As in previous discussions, is the rate at which faults occur in an individual module and is the scrubbing rate that allows the system to recover from faulty modules. The states in Figure 6 are: State 0 - the system has no errors State 1 - one module has failed with a non-persistent error State 2 - one module has failed with a persistent error (the module could also have non-persistent errors as well) 5 State 3 - two modules have failed with non-persistent errors State 4 - two modules have failed one with persistent error and at least one other with a non-persistent error (again the module with a persistent error could have other nonpersistent errors) State 5 - two modules have failed with persistent errors, all the modules could have any number of non-persistent errors Note that only in state 5 is the system failed to the point where scrubbing cannot correct the errors. Thus all states but 5 and 0 will transition back to state 0 if scrubbing occurs. These transition arcs are marked with the scrubbing rate . In state 0, the system will transition to state 1 if a non-persistent error has occurred in any of the modules. This occurs at a the sum of the error rates of each module scaled by the non-persistent fraction (1 p). The system will transition to state 2 if a persistent error occurs. The probability of this happening is the sum of the error rates of the individual modules scaled by the persistent fraction p. From state 1, the system can move to state 2 if a persistent error occurs in the already failed module, or move to state 3 if a non-persistent error occurs in a functioning module. State 2 will transition to state 4 if a non-persistent error occurs in a functioning module or state 5 if the error is persistent. State 3 will transition to state 4 if a persistent error occurs in any of the three modules. State 4 will transition to state 5 if a persistent error occurs in any of the two modules that do not have persistent errors. By inspection, the transition matrix for this Markov model is shown in Equation 49. Applying the techniques demonstrated in Sections I-A and II we can solve for the reliability of this model. The technique used yielded multiple solutions. Figure 7 is a plot of a particular solution with = 0.001, = 0.1 and p = 0.1. The plot shows that TMR does improve an FPGA application s reliability, specifically one that tolerates non-persistent errors. Comparing Figure 7 with Figure 4 shows designs which can tolerate non-persistent errors increase the reliability benefits gained by TMR. IV. TMR WITH PARTITIONS It is sometimes desirable to improve an FPGA application s reliability by partitioning the circuit into smaller sections and voting on the outputs of each partition. Figure 8 shows how this is done. In this arrangement, the correct state of the circuit is restored from partition to partition, rather than allowing an error propagate through the entire circuit. Adding partitions to TMR gives the advantage of higher reliability by masking more faults in the circuit than standard TMR. For example, if a fault occurred in module 1 of partition A and another fault occurred in module 2 of partition B, both faults would be masked and the outputs of both partition A and partition B would be correct. If a similar set of faults were to occur in the circuit of Figure 2, TMR would be defeated and the circuit output would be incorrect. To estimate the reliability of the partitioned TMR circuit, we created the model illustrated in Figure 9 which corresponds Fig. 7. Reliability of a TMR design vs. non-redundant design taking into account persistence. = 0.001, = 0.1, and p = 0.1 Fig. 8. A TMR system with multiple TMR partitions. to a circuit with two TMR partitions (as in Figure 8). State 0 represents the system when there are no faults present. State 1 represents the system after a fault occurs in a single module. State 2 represents the system when two modules in separate TMR partitions contain faults. State 3 is the failure state in which two faults are present in two different modules in the same TMR partition. represents the failure rate of both partitions of a single module. Note that states 0, 1, and 2 are all functional states and all faults are successfully masked. Using the model of Figure 9, we construct the transition matrix T as T = 6 T = 2 6664 1 3 t 3 (1 p) t 3 p t 0 0 0 t 1 (2 (1 p) + 3 p + ) t p t 2 (1 p) t 2 p t 0 t 0 1 (2 p + + 2 (1 p)) t 0 2 (1 p) t 2 p t t 0 0 1 (3 p + ) t 3 p t 0 t 0 0 0 1 (2 p + ) t 2 p t 0 0 0 0 0 1 3 7775 (49) Fig. 9. Markov model representing a TMR system with two TMR partitions. 2 664 1 3 t 3 t 0 0 t 1 ( + 5 2 ) t 3 2 t t t 0 1 ( + 2 ) t 2 t 0 0 0 1 3 775 (50) which results in the reliability function plotted in Figure 12. For this example we are assuming the partitions are exactly equal in size. The model in Figure 9 can be extended to include any number of TMR partitions in order to increase reliability further. Figure 10 shows the model as we have extended it to three TMR partitions. Figure 11 illustrates how the model could be extended to N partitions. The difference in reliability between systems with two and three TMR partitions in shown in Figure 12. Fig. 10. Markov model representing a TMR system with three TMR partitions. Fig. 11. Markov model representing a TMR system with N TMR partitions. V. CONCLUSIONS Although TMR, partitioned TMR, and non-persistent error tolerance empirically improve a system s reliability, it is good Fig. 12. Comparison of the Reliability of two TMR partitions vs. three TMR partitions on a design. = 0.001 and = 0.1 to derive a closed-form estimate of the reliability in order to accurately assess a particular system. In this paper, solutions for system reliability were derived for a simple computing system, a system with TMR and repair, a system which could tolerate non-persistent errors, a system with TMR and repair that could tolerate non-persistent errors and a system with partitioned TMR and repair. The results for the first four solutions are plotted together in Figure 13. Again, = 0.001, = 0.1 and p = 0.1. The results indicate that significant improvements in reliability can be made with the techniques referenced throughout this paper. REFERENCES [1] J. Lindsey, Statistical Analysis of Stochastic Processes in Time. Cambridge University Press, 2004. [2] D. P. Siewiorek and R. S. Swarz, Reliable Computer Systems . A K Peters, 1998. [3] K. Morgan, M. Caffrey, P. Graham, E. Johnson, B. Pratt, and M. Wirthlin, Seu-induced persistent error propagation in FPGAs, IEEE Transactions on Nuclear Science, no. 6, DEC 2005. [4] K. S. Morgan, Seu-induced persistent error propagation in FPGAs, Master s thesis, Brigham Young University, August 2006. [5] D. K. Pradhan, Fault-Tolerant Computer System Design . Prentice Hall, 1998. [6] K. S. Trivedi, Probability and Statistics with Reliability, Queuing and Computer Science Applications . John Wiley & Sons, Inc., 2002. 7 Fig. 13. Comparison of the Reliability of systems with different levels of mitigation and different levels of non-persistent error tolerance. = 0.001 and = 0.1