IEEE TRANSACTIONS ON COMMUNICATIONS. VOL. 43, NO. 9. SEPTEMBER 1995
2458
Algebraic Survivor Memory Management Design for Viterbi Detectors Gerhard Fettweis, Member, IEEE
Abstract-The problem of survivor memory management of a Viterbi detector is classically solved either by a register-exchange implementation which has minimal latency, but large hardware complexity and power consumption, or by a trace-back scheme but largerlatency.Here an withsmallpowerconsumption, algebraic formulation ofthe survivor memorymanagement is introduced which providesa framework for the derivation of new algorithmic and architectural solutions. This allows for solutions to be designed with greatly reduced latency andor complexity, as well as for achieving a tradeoff between latency and complexity. VLSI case studies of specific new solutions have shown that at minimal latency more than50%savings are possible in hardware complexity as well as power consumption.
I. INTRODUCTION
D
YNAMIC PROGRAMMING is a well-established approach for a large variety of problems concerning mul[l]. One spccificapplication of tistagedecisionprocesses dynamic programming is the search for the best path through agraphofweightedbranches. These branch weights in the following will be referredto as branchmetrics. The path throughthegraphwhich is to be foundis that onc with themaximum (or minimum) cost, Le., themaximumvalue of accumulated branch rnetrics. An example of such a graph is thetrellis (the statetransition diagram) of adiscrete-time finitestatemachine. Thestate sequcncc of the finite state machinemarks a paththrough the trellis. If thispath is to he estimated with the help ofnoisy measurements of the output of the finite state machine,and if this is solved by dynamic programming, then in communications this is called the “Viterbi algorithm” (VA) [2]. The VA was introduced in 1967 as a methodto decode convolutional codes [3]. In themeantimethe VA hasfound widespreadapplications in communications as. e.g., in digitaltransmission,magnetic recording and speechrecognition.A comprehensive tUtorbd1 on the VA is given in [4]. The VA can bedivided into threefunctional units, the branch mctric unit (BMU), the add-compare-select unit (ACSU), and thesurvivormemory unit (SMU). Whereas the BMUand ACSUperform arithmeticoperations asaddition,multiplication, andmaximudminimum selection.the SMUhasto of decision pointers trace the course of a path with the help
thatweregenerated in the ACSU.Two basic methods for implementing the SMU are known, the register-exchange and trace-back SMU, of which the first has minimal latency but large hardware complexity, and the latter has a smaller hardware complexity but longer latency. The focus of this paper is on providing a novel algebraic framework for describing the survivor memory management problem. This enables the easy design of new SMU architectures, tailored to the desired latency/complexityoptimization goal. Following, a brief introduction in the VA isgiven in Section 11. Section 111 describes the survivor memory problem. andfurthermorc itsalgebraicformulation is introduced [ 5 ] . Based on this, the following two sections outline architectural alternatives, Le., continuous-flow processing in Section IV, and blockprocessing in Section V. 11. T H E VITERBIALGORITHM
Assume a discrete-time finite state machine with N states. Without loss ofgeneralitywe assume that thetransition diagram and thetransition rate 1/T are constant in time. The trellis,which shows thetransitiondynamics, is a twodimensionalgraphwhich is described in verticaldirection in horizontaldirection by timcinstants by M statesand kT (T = 1). The statcs of time instant k are connected with those of time k 1 by the branches of time interval ( k , k I). Below we refer to a specific state z at time instant k as “node” s i , h . A simple example of a trellis is given in Fig. l(a) for N = 2 states. The notation used can be summarized ab
+
N
5 s;,k
+
number of states time instant node: ith possible state of time instant
k
The finite-state machine chooses a path through the trellis, and with thehelpoftheobserved state transitions (over a noisy channel) the branch metrics of time interval ( A : , k: 1) are computed. The best path through the trellis is calculated recursively by the VA, where best can mean, e.g., thc “most likeliest”. This is done recursivcly by computing N paths, i.e., the optimum path to each of the N nodes of time k . The N new optimum Paper approbed by I. Treng.theEditor for VLSI in Communicallorls paths of time k+ 1 are calculated with the help of the old paths of thc WEE Communications Society. Manuscript received June 9, 1992; 1).This shall be revised Map 20, 1993. This paper was presented in part at the Proceed~ngs and the branch metrics of timc step ( k , k of the IEEE International Conference on Communications (ICC’Y?). Chicago, explained for thc simple trellis shown in Fig. 1 (a). As indicated 1L. June 1992. in Fig. l(b), each of the optimum paths of time k , i.e., each The author IS with the Dresden Universlty ofTechnology, D-01062 Drcsdcn, node ,sL,k, hasapathmetric 7 i . k which is theaccumulation Germany. IEEE Log Number 94 13 168. of its branch metrics. Now the new optimum path leading to
+
+
0090-6778/95$04.00 C 1995 IEEE
FETTWEIS,MEMORYMANAGEMENTDESIGNFOR
% -
...
% -
- - - - -
VITERBI DETECTORS
- ...
P 1 , k
A
- k
7
...
tlT
kt1
Fig. 3.
lime k-'D
k
k+l
(b) h g , 1. Example of trellis, add-compare-select, and detected paths. (a) Trellis with "V = 2 states, (b) Decoding the optimum path tn node .sl , ~ c +at~time ! i 1 . Thc paths merge when theyare traced back D t m e steps.
+
k k
kll
k
/
Example of trace-back decision pointers Tor .\-= 4
memory management is introduced which provides a framework for the dcrivation of new algorithmic and architectural solution$. VISI case studies of specific new solutionshave shown that more than 50% savings are possible in hardware complexity as well as power consumption.
111. THE SUKVIVUK MEMUKYUNIT The unit of the VU which isof concern in this paper i s the SMU. Generally. two basic methods have been proposed for solving the problem of processing the decisions made in the ACSU to reccivc thc dctcctcd path: thc rcgistcr-cxchange (RE) and thetrace-back (TB) SMU [ h ] . In case of an RE-SMU the new decisions of each iteration k are used to compute and store all N paths recursively, one to every state. Then the state of time k - U is simply determined by reading out the state of time A: l? o f one of the paths. In case of a TB-SMU the decisions are stored in a RAM, and thcn one path is traced back recursively D steps by using the stored decisions to determine the state of time k U . At a first glance this might seem not to be well suited for VLSI, sinceateach time stepone new decision is written to the RAM and D decisions are read during the trace-back, making this a bottleneck for the iteration speed of the VD. However, 7 steps at a time, a by block-wisetracing back more than i blockofmorethan one state is determinedpertrace-back. Combining this with multiple trace-back pointers operating on multiple RAM'S in parallel has allowed for the derivation of many efficienthardwaresolutions [7]-[9].
( OuaUt
Fig. 2. Block diagram of theViterbi
detector
node .sl,l;+l is the path with maximum metric leading to this node. Thcrcfore the new path metric - y l , k + l of node sl.l;+l is ~ I , A - += I
n=im4All.k
+ Y I , ~ ,A 1 2 , k + Y Z , ~ )
and thepathmetric of node . s z , ~ + lis computed in analogy. This is referred to as the add-compare-select (ACS) recursion of the VA. The problem which needs to be solved is to dctcrmine the best (unique) path with the help of the decisions of the ACSrecursion. If all N paths are tracedback in time then they merge into auniqucpath,andthis is exactlythebest one which is to befound.Thenumber of timestepsthathave tobe traced back for thepaths to havemerged with high probability is calledthe survivor depth D . Therefore, i n a practical implementation of the VA the latency of decoding is at least l? lime steps. An implementation of the VA, referred to as Viterbi detector (VD), can be divided into three basic units, as shown in Fig. 2. The input data is used in thebranchmetric unit (BMU) to calculate the set of branch mctrics for each new time step. These arethen fed to the add-compare-select unit (ACSU) whichaccumulatesthebranchmetricsrecursively as path metrics according to the ACS-recursion. The survivormemory unit (SMC) processes the decisions which are being made in the ACSU, and outputs the estimated path with a latency of at least D . The problcm solved by the SMU can therefore be stated as: find the state of time k L). This is classically solved either by a register-exchange implementation which has minimal latency. but large hardware complexity and power consumption. or by a tracc-back scheme with small power consumption, but larger latency. Here an algebraic formulation of the survivor ~
~
~
A. The Trace-Back SMU
A more detailed description of the trace-back scheme is as follows. At time k the current decision of state i points to its preceding state, for which we will use the notation h k ( : ) , with the value of b,(i) E { 1. . . . , X } pointing to the state preceding state i . Hence, a set of .rv' pointers { bk.( 1).. . . . hi (X)} makes up the decisions of time k . For ease of understanding see the example for N = 4 shown in Fig. 3. Nowthetrace-backprocedureworks by starting at an arbitrary state b at time k . Itsdecision h k ( b ) determines thc precedingstate of time k - 1, and thedecision of this state determines the state of time k - 2. as O k - l ( h , ( b ) ) , etc.; until by looking up D decisions in this trace-back manner h k & D + 1 ( . . ' b k & l ( h k ( b ) ) .. .)
(1)
the state o f time k D is determined. As can be seen by thenature of thisdecisiontrace-back, the usual way of implementation is by using multiplexers to ~
2460
lEEE TRANSACTIONS ON COMMUNIC4TIONS. VOL. 43, NO. 9, SEPTE.MBER 1995
pick thenextdecisionpointer in the scheme. However,this trace-back can also be formulatedalgebraically by introducing another notation for 6 k ( i ) . For 6 h ( i ) = j the Wdimensional vector J k ( i , ) is defined as the all-zero vector except for a 1 entry at the j-th position bk.2) := (0, I . .
Fig. 4. Linear look-ahead pipeline-interleaving architecture (regisler exchange).
, 0 ; . , . 0).
1
v
7th position for S k ( i ) = j.
(2)
Now the set of N decisions at time IC form the square matrix
Hence, if the starting state of the trace-back 6 is written as a vector b [in analogy to (2)1, then & ( / I ) can be written as
Example: Assume a 4-statetrcllis time k arc as shown in Fig. 3 bh(1)
where thedecisions
of
= 2. b k ( 2 ) = 1. h k ( 3 ) = 2. h k ( 4 ) = 3
then
0 1 0 0
A k = (10 1O 0O 0 ) 0
0
1
0
If we now multiplythismatrix by (0.0,1,0) this rcads out the third row of A,. i.e., it determines the preceding state of state 3 as 6,(3) = 2.
/0
1 0
Iv.
PIPELINE INTERLEAVING
LOOK-AHEAD ARCHIT~CTURES
Since the trace-back decoding of the dccisions principally has to take place at every new time instant, it is clear that the multiplication given in ( 5 ) is to be viewed upon as a sliding windowoperation over the sequence { A h } . Hcnce. at time k:+l
hasto bc evaluated, and so on. It is to be noticedthat, due ( D - 1)-fold tothefactthattheassociativelawholds,the matrix-matrix multiplication of ( 5 )
Ak
,
’ ‘
Ak.--L)+1
(7)
can be carried out first, and then the row of interest can be picked by applying b. The continuous “sliding window” computation of the expression (7) is analogous to thetype of operationwhich is referred to as “pipeline interleaving look-ahead computation” for the parallelization of linear feedback loops’ [ 101, [ I I ] . Hence, all pipeline interleaving architecturcs known for lookahead computation can be applied for the continuous (sliding) evaluationof (7). A. The Register-Exxchunge SMU
For notational ease the short-hand notation
0\
The significant result of the algebraic formulation is that the D-fold trace-back can now be written as
Thisis a D-fold vector-matrixproduct.It isto he noticed thatthis isjust analgebraicformulation of thetrace-back procedure.Hence,conventionally used multiplexerarchitectures for trace-backdecoding of course can bc appliedhere for theimplementation of the vector-matrixmultiplications. Furthermore, due to the simplicity of the matrix operations it is clear that this can also be done by simple gate logic. The most important aspect of ( 5 ) is that thc multiplication operation is associative. Therefore it can be carried out not only from left to right, but also in an arbitrary order as, e g . , in a faster tree-like manner. In the followingwe shall now make use of this algebraic feature.
shall be introduced. The architecture known as “linear lookahead” [lo] for the sliding-window evaluation of (7) is shown in Fig. 4. In this case the current A, is multiplied with D stored values in parallel, to obtain the following D results
As can be seen, the first element, A,, indicates the prcceding states of the N current paths. The next elemcnt, 1, determines the state of two time steps back of every current path. By carrying this on, i t can be seen that (8) yields exactly the state sequence of all N current paths of time k ovcr the whole interval ( k - D + 1:k ) . Thus, it can easily be seen that thelinearlook-aheadarchitcctureofFig. 4 is thealgebraic formulation of thc RE-SMU. ‘Onc other very important application of such a U-told mdtlplication IS the carry computatlon of a binary adder, for which different algorithm are known as e.6. carry-ripple, cmy-skip, cq-select, and carry-look-ahead [IO]. Now these architectures can all be transferred to derlve SMU realizations.
COMMUNICATIONS, IEEE TRANSACTIONS ON
2462
VOL. 43, NO. 9. SEFTEMBER 1995
architecture RE-SMU example Fig. 5(a)
total memory
registers
RAM
R/w
vector-matrix mult.
Db
DA T
-
D n-
D .v
D 3-
7.\-
D-block TB-SMU example Fig. 6
D('1.Y + 1) D( A' + 1)
( D - 7).Y
D(4.Y D(S
-
+ 1)
+ 1)
lugl (D).\-t 3s 4 x\- 3
+ +
log2(D).\\-t 2
s+ 1
latency D D 4D 2D
+Note,a$ mentioned in Section IV-R, these multiplicationscan he more complex, especially for large .Y.
algorithms and architectures that can be derived, two directions of further research shall be pointed out in the following. computing If multiplier a feedback loop is used for L A k = 7 i L , then the coarse grainsequcnce { T , & = ~ ~ L } can be used to perform trace-backs in the larger step-size of L , to cut downthe latency of the SMU. If expression (IO) is examined more closely, it can be seen that it can also be written as the product of the factor ( b . ~ A k ) with a vector
Initialize
Flg. 6. Single~nultiplierfeedbackloop followed by blocktrace-hack.
for starting pointer at
I;
= JID,
(b ' D a k = n A l ) . (ak-n.nk-D.n,-o-l;. . . , a k - o . . . a k D-,\1+1). (12)
+
of time interval ( r t U , n L ) U ) , and has given out m out of D decisions for the trace-back. The total RAM-size therefore is only M = D pointers, each of complexity N . It is to bc noticcd that a block-wisc trace-back always leads to giving out blocks of the detected path, which internally are in time-reversed order. This can be corrected by a second RAM of size D , which again is operating in the same block-by-block LIFO manner [8]. Hence, the total latency is D M = 2 0 . The additional RAM-size is only D state indexes. This results in a total s i x o f both I J F O RAM's o F D x ()Xr+ 1 ). Compared tothe analogous conventional2-pointertrace-back schemes [7]. [SI, this amounts to at least 50% savings in hardware as well as latency. Of coursethismethodcan be generalized t o the case where 11.1 is adivisor of D. D = f M . Then a number of f multiplier feedbackloopsoperate in parallel on the computation of { , + l = 7 , ~ ~ } . In this case the total latency comprised 01- (race-back and of the time ordering block-LIFO is reduced from 2 0 to and the RAM size is reduced to D X -V M X 1 = D X ( N j). In comparison to conventional trace-back methods thi\ new class of algorithms has a substantially reduced latency, RAMsizeandtrace-backpointerlogic, by thecost of one,orin Sincethese general f 2 1, additionalmatrixmultipliers. multipliers operate sequentially on one single decision matrix at a time,theircomplexity is exactlythat of one stage of a conventional RE-SMU.
+
+
FD,
+
B. Furlher Methods
The description of all possible algorithms and architectures for survivormemory managementwould by far exceedthe scope of this paper. The intention here lies in showing that the algebraicnotationprovides a frameworkfor finding a large variety of new solutions. To point out the large span of new
The contents of the vector in expression (12) is exactly what is computed by a register-exchange SMU of length hil, see Section IV-A. Hence, combinations of register-exchange and trace-back promise lo yield further solutions of interest. In addition note that the algebraic formulation can also lead tosimplifiedsoftwareimplementations. For example, for an N = 2 state problem it can casily bc seen that the logarithmic look-ahead RE-SMU of Fig. 5 can be much more efficient to implement than any other solution. VI. DISCUSSION Due to the variety of possibledifferenttechnologies that may be used for implementingthearchitecturesdiscussed in thispaper,it is difficult to find an objectivemeasureto comparethem. To allow forsome objective comparisons to be made,the total amount of memorymust be dividedinto memory which can be realized by RAM and memory that must be realized by registers. In addition, the multiplications can be divided into vector-matrix and matrix-matrix multiplicatlons, where the latter I S N times as complex as the former since it comprises N vector-matrix multiplications. Abasicmeasureof power consumption is the number 01. vector-matrix multiplication and the number of read and write (W)operations that are necessary.Therefore,for power consumption comparisons,the number of R / M 7 operations must be added as ameasure. Using these more detailed measures, the solutions which are compared i n Table I are the RE-SMU, the TB-SMU with block trace-back of block length D , and the new SMU architectures of Figs. 5(a) and 6. It can be seen that thealgebraicformulation 01' the SMU problem allowed for an easy design of new architectures which are sample points in the large space of solutions with differing latency. memory complexity, and arithmetic complexity. The
FFTTWEIS, MEMORY MANAGEMENT DESIGN FOR VITERBI DETECTORS
2463
algebraicformulationenablessolutions tobe designed with [7] R. Cypher and B. Shung, “Generalized trace hack techniques for survivor memory management in the Viterbi algorithm,” in IEEE greatly reduced latency and/or complexity, as well as it allows CA, Dec. 1990, vol. 2, pp. 1318-1322 GLOBECOM, SanDiego, for achieving a tradeoff between latency, hardware complexity, (707A.l). [B] G. Feugin and P. G. Gulak, “Survivor mcmory managcmcnt in Viterbi and power consumption. decoders,” IEEE Trans. Commun., vol. 39, 1991.
VII. CONCLUSION Inthispaper an algebraicformulation of thesurvivor memory management of Viterbi detectors is introduced. This revealsthefactthattheproblem of survivormemoryimplementation is analogous to the realization of look-ahead in parallelized linear feedback loops. Hence, next to finding new solutions, a wide range of known solutions can be transferred and adaptedfrom chis well-knownproblem.Theymainly presentnovelapproachesforsurvivormemoryrealization. VLSI case studies of novel algorithms and architectures have shown that 50% savingsinhardwareand/orlatencycanbe achieved. The algebraic formulation introduced here is related to the algebraic formulation of the add-compare-select recursion of theViterbidetector, introduced in [14], [15]. Hence.itnow is easy to derive well matched survivor memory realizations also for all parallelized Viterbi detectors.
REFERENCES R. E. Bellman and S . E. Dreyfus, Applied Dynumic Prugrumming. Princeton, NJ: Princelon University Press, 1962. A. 1. Omura, “On the Viterbi algorithm,” IEEE Trans. Inform. Theory, pp, 177-179, Jan. 1969. A. 1. Viterhi, “Error bounds for convolutional coding and an asymptotically optimum decoding algorithm,” IEEE Trans. Inform. Theory. vol. IT-13, pp. 260-269, Apr. 1967. G . D. Forney. “The Viterbi algorithm.” Proc. IEEE, Mar. 1973, vol. 61, pp. 268-278. G. Fettweis, “Algebraic survivor memory management for Viterbi dctectors,’‘ IEEE Int. Conf: Commun. (ICC’YZJ,Chicago, IL, June 1992. pp. 313.4.1-313.3.5. C. M. Rader, “Memory management in a Viterbi decoder,” IEEE Trans. Commun., vol. COM-29, pp. 1399-1401, Scpt. 1981.
[SI T. K. Truong,M.-T. Shih, 1. S . Reed, and E. H. Satorius, “A VLSl design for a trace-back Viterbi decoder,” IEEE Trans. Commun., vol. 40, pp. 61G624, Mar. 1992. [ I O ] G. Fettweis, L. Thiele, and H. Meyr, “Algorithm transformations for unlimited parallelism.” in Proc. IEEE Int. Symp. Circuits and Sysr., New Orleans, LA, May 1990, vol. 2, pp. 1756-1759. [ I l l K. K. Parhi and D. G . Messerschmidt, “Block digital filtering cia incremental block-state-structures,” in Proc. IEEE Int. Symp. Circults and Syst., Philadelphia, PA, 1987, pp. 645448. [I21 -, “Pipelined VLSI recursive composition,” in Proc. IEEE Znt. Con$ Acoust., Speech and Signal Processing, New York, 1988, pp. 212C2123. [I31 L. Thiele and G. Fettweis, “Algorithm transformations forunlim~ted parallelism,” Elecrron. and Commun. ( A E i i ) , vol. 2, pp. 83-91. Apr. 1990. [I41 G. Fettweis and H. Meyr. “High-speed Viterbi processor: A systolic array solution,” IEEE J. Select. Areas Commim., vol. 8 pp. 1520-1534. Oct. 1990. [I51 -, “High-speed parallel Viterhi decoding,” lEEE Commun. Mug.. pp. 4 6 5 5 , May 1 9 9 1 .
Gerhard Fettweis (S’84-M’90) received the Dipl: Ing. and the Ph.D. degrees in electrical engineering from the Aachen University of Technology, Aachen, Germany. in lY86 and 19YO. 1-espectlvely. He is a scientist at TCSI Corporation, Berkeley, CA In 1986 he worked at ABB rescarch laboratory, Baden. Switzerland on his Diplom-thesis. During 1991 he was a visiting scientist at the IBM Almaden Research Center, San Jose, CA. His interests are in microelectronlcs and digital wlreless communications, especially the interaction between algorithm and architecture design for high-performance VLSI proces5or Implementations. Dr. Fettweis is a member of the IEEE Solid Slate Circuits Council ab representative of the IEEE Communications Society, and is Associate Editor ON CIRCUITS AND SYSTEMSI1 of the IEEE TRANSACTIOKS