New Bit-Parallel Systolic Multiplier over GF(2m) Using The Modified Booth’s Algorithm Chiou-Yng Lee*
Hsin-Chiu Yu
Che Wun Chiou
LungHwa University Email:
[email protected] Chang Gung University, Hsing Wu College Email:
[email protected] Ching Yun University Email:
[email protected] Abstract– A new algorithm for the multiplication of two elements in GF(2m ) based on the modified Booth’s algorithm is presented. The proposed algorithm permits efficient realization of the multiplexer-based bit-parallel multiplication using iterative arrays. The latency of the multiplier has 3m/2 clock cycles. For the estimated complexity of the proposed multiplier, we take into the transistor count using a standard CMOS VLSI realization. Our analysis shows that, in terms of the time and the space complexities, the multiplexer-based array architecture is the better choice for our proposed bit-parallel systolic multiplier.
I. Introduction Many important applications, such as error-correcting codes [1] and cryptography [2], are based on GF(2m ) arithmetic operations. Hence, the design of efficient GF(2m ) arithmetic with low circuit complexity, high throughout rate and short computation delay is required. Since Yeh, Reed, and Truong [3] first suggested a bit-parallel systolic multiplier over GF(2m ), various bit-parallel systolic multipliers have been reported [4-6]. Fenn et al. in [4] proposed a systolic multiplier using the dual basis representation. Two parallel systolic multipliers using a unidirectional data flow were suggested by Wang-Lin [5]. A power-sum circuit systolic multiplier for computing AB 2 + C over GF(2m ) was proposed by Wei [6]. However, the previous existing multipliers require at least the latency of 3m clock cycles. On the other hand, the key idea of the ”divide-andconquer” methodology was first introduced by Booth in [7]. In this technique, intermediate results are always in a redundant form of two integer numbers. Pekmestzi in 1999 [8] presented a multiplexer-based array multiplier which efficiently reduced complexities by using the modified Booth’s algorithm. From the basic idea of the multiplexerbased implementation, this article aims to implement the GF(2m ) multiplier in contrast to previous GF(2m ) multipliers using AND and XOR gates. In this paper, we reduce the number of the latency in the proposed multiplication algorithm using the modified Booth’s algorithm. The algorithm can be implemented using a multiplexer-based bit-parallel systolic multiplier. The multiplier has the latency of 3m/2 clock cycles, as opposed to 3m in existing multipliers in [3,5]. Therefore, the speed of the proposed multiplier is twofold. For the estimated complexity of the proposed multiplier, we will
take into the transistor count using a standard CMOS VLSI realization. As m increase, one can find that our proposed multipliers have less complex than the other multipliers. Our analysis shows that, in terms of the time and the space complexities, the multiplexer-based array architecture is the better choice for our proposed bitparallel systolic multipliers. II. Finite Field Representation Suppose the reader is already very familiar with the basic notion and theory of the finite field, the characteristic of finite field refers to [1-2]. In the following paragraphs, we will briefly introduce some basic operations of finite fields. It is well known that the finite field GF(2m ) can be viewed as a vector space of dimension m over GF(2). A basis of the form 1,α , α2 ,· · ·, αm−1 is called a polynomial basis of GF(2m ) and α is called a primitive element of GF(2m ). A polynomial P (x) of degree m over GF(2) is primitive if P (x) = p0 + p1 x + · · · + pm−1 xm−1 + xm over GF(2) is irreducible and has period 2m -1. If α ∈GF(2m ) is a root of a primitive polynomial P (x), then αm = p0 + p1 α + · · · + pm−1 αm−1 . Let the set {1, α, α2 , · · · , αm−1 } be a polynomial basis, every element A in GF(2m ) can be represented by A = a0 + a1 α + · · · + am−1 αm−1 ,where ai ∈GF(2) (0 ≤ i ≤ m−1) is the ith coordinate of A. Each element in GF(2m ) has a unique representation as a linear combination of the polynomial. The multiplication of two m−1 elements in GF(2m ) is uniquely determined by αm = j=0
pj αj , pj ∈ GF(2), since P (α) = 0. Namely, multiplication in GF(2m ) can be performed by polynomial modulo P (x) on the field elements represented as polynomials of degree m − 1 or less. III. Proposed Multiplexer-Based Multiplication over GF(2m ) Using Modified Booth’s Algorithm Booth [7] firstly introduced the key idea of the ”divideand-conquer” methodology. The major concept is that intermediate results are always in a redundant form of two integer numbers. Since Pekmestzi in 1999 [8] presented a multiplexer-based array multiplier whose complexities were efficiently reduced by using the modified Booth’s algorithm, this section aims to implement the GF(2m )
TABLE I The values of vi,j
multiplier using multiplexers other than previous GF(2m ) multipliers using AND and XOR gates. Assume that the finite field GF(2m ) is generated by a primitive polynomial of the form P (x) = p0 +p1 x+p2 x2 + · · ·+pm−1 xm−1 +xm over GF(2). Let α be a root of P (x), i.e., P (α) = 0, we have αm αm+1
= p0 + p1 α + p2 α2 + · · · + pm−1 αm−1 = p0 α + p1 α2 + p2 α3 + · · · + pm−1 αm
pj 1 1 0 0
aj 1 1 0 0
where
pj =
pj−1 + pm−1 pj pm−1 pj
f or 1 ≤ j ≤ m − 1 f or j=0
Given the element A = a0 +a1 α+a2 α2 +· · ·+am−1 αm−1 over GF(2) in GF(2m ), a common computation in both types of multiplications is the multiply-by-α, which can be done by the following rules in Eq.(4). Let us define that A = Aα, we obtain
= Aα = a0 α + a1 α2 + a2 α3 + · · · + am−1 αm
A
(3)
where
aj =
aj−1 + am−1 pj am−1 pj
T0 Ti C
(4)
C
ti,j
= AB m−1 = A( bj αj ) j=0
m−3
= αm−2 (bm−1 αA + bm−2 A) + A(
bj αj )
j=0
= αm−2 (bm−1 A + bm−2 A) + A(
m−3
bj αj )
j=0
= α
m−4
((bm−1 A + bm−2 A)α + bm−3 A m−5 +bm−4 A) + A( bj αj )
ki,j bi bm−2i bm−2i+1 0
= 0 = Ti−1 α2 + A bm−2i+1 + Abm−2i = Tm/2
(6) (7) (8)
= ti−1,m−3 αm−1 + · · · + ti−1,1 α3 + ti−1,0 α2 (9) +ti−1,m−1 (pm−1 αm−1 + · · · + p1 α + p0 ) +ti−1,m−2 (pm−1 αm−1 + · · · + p1 α + p0 ) +bm−2i+1 (am−1 αm−1 + · · · + a1 α + a0 ) +bm−2i (am−1 αm−1 + · · · + a1 α + a0 )
where
Next, let A = a0 + a1 α + a2 α2 + · · · + am−1 αm−1 , B = b0 + b1 α + b2 α2 + · · · + bm−1 αm−1 and C = c0 + c1 α + c2 α2 +· · ·+cm−1 αm−1 be three elements in GF(2m ) with a primitive polynomial P (x), where C is the multiplication of A and B. Assume that A is pre-computed using Eq. (4), the product C can be represented by
aj 1 0 1 0
Substituting Eq.(1) and Eq.(2) into Eq.(7), Ti yields Ti
f or 1 ≤ j ≤ m − 1 f or j=0
ti−1,m−2 ti−1,m−1 0
be computed recursively as follows:
Substituting (1) into (3), we have A = a0 + a1 α + a2 α2 + · · · + am−1 αm−1
vi,j ti
TABLE II The values of ki,j
(1) (2)
= p0 + p1 α + p2 α2 + · · · + pm−1 αm−1
pj 1 0 1 0
vi,j ki,j
vi,j + ki,j f or j = 0, 1 (10) ti−1,j−2 + vi,j + ki,j for 2 ≤ j ≤ m − 1 = ti−1,m−2 pj + ti−1,m−1 pj = bm−2i aj + bm−2i+1 aj =
Since the modified Booth’s recoding is considered, the quantity vi,j can be determined by the values of pj and pj , as shown in Table 1. The quantity vi,j requires computation when both pj and pj have logical value 1s. Hence, the quantity vi,j can be selected from 0, ti−1,m−2 , ti−1,m−1 and ti = ti−1,m−2 + ti−1,m−1 depending on the values of pj and pj . Similarly, the quantity ki,j can also be selected from 0, bm−2i+1 , bm−2i and bi = bm−2i+1 + bm−2i depending on the values of aj and aj , as shown in Table 2. Therefore, the computation of ti,j in Eq.(10) can use two 4x1 multiplexers and one 3-input XOR gate to determine the value of ti,j when the values of ti and bi are precomputed.
(5)
j=0
Let Ti = ti,0 + ti,1 α + ti,2 α2 + · · · + ti,m−1 αm−1 be the ith intermediate multiplication, then the product C can
IV. New Multiplexer-Based Bit-Parallel Systolic Multiplier over GF(2m ) In this section, a parallel-in, parallel-out, twodimensional systolic array is presented for performing the proposed multiplication in GF(2m ) with the polynomial basis. The discussion in this section is limited to the
a2 p3 p3′
a1 p2 p′2
V00
a3
V01
0
b3 b2
0
U10
Q1
a0 p1 p1′
0 p0 p0′
V02
V03 0
0
U11
ti-1,m-1 ti-1,m-2
U12
0
U13
bm-2i bm-2i+1
0
b1 b0
Q2
U20
U21
U22
U23
Fig. 4. Fig. 1.
The detailed circuit of the Q-cell
The signal flow graph (SFG) array for GF(24 ) a j a′j p j p′j
a2 p3
ti-1,j-1
t i −1.m −1 t i −1.m − 2 ti b2i b2i −1 bi
a3 0 0
b3 b2 MUX1
p1′ 0
V01 00
00
U10
Q1
p0 p′0
U11
MUX2
0
00
b1 b0
ti,j
Q2
U20 c3
The detailed circuit of the U-cell
finite field GF(24 ). An analogous development can be constructed for any finite field GF(2m ). Fig. 1 shows a signal flow graph (SFG) array for realizing the recursive given in Eqs. (6-10) with m = 4. It consists of m/2m U-cells, m V-cells and m/2 Q-cells. Assume that the cell is located at the ith row and the j th column of the SFG array; we shall then refer to it as the (i,j) cell. From Fig. 1, the first row and the column cells are identical of the V-cell and Q-cell, respectively; other cells are identical of the U-cell. Each V-cell in the position (0,j), for 0 ≤ j ≤ m − 1, uses one 2-input AND and one 2-input XOR gates to perform the computation of aj = am−1 pj + aj−1 , as shown in Fig.3. Applying the basic concept of the modified Booth’s algorithm, each Q-cell in Fig.4 generates two sets, (ti−1,m−1 , ti−1,m−2 , ti ) and (bm−2i+1 , bm−2i , bi ) to provide the U-cell to determine the values vi,j and ki,j , respectively. While two sets, (ti−1,m−1 , ti−1,m−2 , ti ) and (bm−2i+1 , bm−2i , bi ), are generated in Qi cells, and Ui,j cell in Fig.2, for 0 ≤ j ≤ m − 1 and 1 ≤ i ≤ m/2, is performed by the following operations: a j−1 p j p′j
am-1 aj
Fig. 3.
a0 p1
V00
ti-1,j-2
0
Fig. 2.
p3′ a1 p2 p′2
The detailed circuit of the V-cell
Fig. 5.
c2
U21 c1
c0
The proposed bit-parallel systolic multiplier over GF(2 4 )
1) MUX1: Determine the value vi,j which is selected from 0, ti−1,m−2 , ti−1,m−1 and ti by the values of pj and pj ; 2) MUX2: Determine the value ki,j which is selected from 0, bm−2i−1 , bm−2i and bi by the values of aj and aj ; 3) 3-input XOR: Perform the computation of ti,j = ti−1,j−2 + vi,j + ki,j . Applying the cut-set systolization techniques in [9], support that two adjacent cells in the horizontal direction are combined into one cell to get a modified SFG array, we can derives a new parallel-in parallel-out systolic multiplier for computing AB in GF(2m ), as shown in Fig.5. The circuit consists of m/22 U cells, m/2 V cells and m/2 Q-cells. Each U cell is composed of two 3-input XOR gates, two MUX4x1 gates and 24 1-bit latches, as depicted in Fig. 6. Each V cell is identical of two 2-input AND gates, two 2-input XOR gates and 10 1-bit latches, as shown in Fig. 7. The longest propagation delay time for each cell is the total of the delay due to one MUX4x1 and one 3-input XOR gates. The proposed systolic multiplier over GF(2m ) can produce one result every clock cycle with an initial delay of 3m/2 clock cycles. Several bit-parallel systolic multipliers have been reported for performing the multiplication in GF(2m ) [3-6]. Their architectures use AND and XOR gates to implement various bit-parallel systolic multipliers. The architecture for computing multiplication in GF(2m ) [5] require at least the latency of 3m clock cycles, and incorporate m2 identical cells. Each cell is composed of two 2-input AND
TABLE III Comparison of the related systolic multipliers over GF(2m ) MUX1
0
Fig. 6.
MUX2
MUX1
0
MUX2
0
0
Multipliers
Wang-Lin [5]
#cells
m2
Throughput
1
cell complexity
2 AND2 1 XOR3 7 latches
the detailed circuit of the U cell
Computation time per cell Latency (unit=cycles)
Fig. 7.
the detailed circuit of the V cell
gates and one 3-input XOR gate, and seven 1-bit latches. However, their multipliers are difficult to breakdown the number of the latency due to its multiplication algorithms based on MSB-first scheme. To reduce the number of the latency, this paper has presented two new bit-parallel multiplexer-based systolic multipliers over GF(2m ) using the modified Booth’s algorithm. Both multipliers have a latency of 3m/2 clock cycles and a throughput rate of one result per clock cycle. A circuit comparison between the proposed multipliers and the existed multipliers is given in Table 3. From this table, we can see that the proposed multipliers have smaller latency than Wang-Lin’s multiplier [5]. For the estimated complexity of the proposed multiplier, we will take into the transistor count using a standard CMOS VLSI realization. In the CMOS VLSI technology, 2-input AND, 2-input XOR, 1-bit latch and MUX4x1 are composed of 6, 6, 8, and 16 transistors, respectively. Table 4 shows the total number of transistors for various bit-parallel systolic multipliers. As m increase, our analysis shows that the total number of transistors for the proposed multiplier in Fig.4 saves 13% as compared to multipliers proposed by [4], [5], [10]. Therefore, our analysis shows that the proposed algorithm using the multiplexer-based array has low-latency and low-complexity parallel systolic architecture. V. Conclusions In this paper, we have proposed a novel multiplication algorithm using the modified Booth’s algorithm that permits efficient VLSI realizations. The proposed algorithm can use a multiplexer-based array to implement a new bit-parallel systolic multiplier. The multiplier has the latency of 3m/2 clock cycles, as opposed to 3m in previous works used AND and XOR gates. Therefore, the speed of the proposed multiplier is twofold. For the estimated complexity of the proposed multiplier, we will
Fig. 3 V cell m/2 U cell m/42 Q cell m/2 1 V U AND2 2 0 XOR2 2 0 XOR3 0 2 MUX4x 1 0 4 latch 10 24
T AN D2 +T XOR3
T MU X4x 1 +T XOR3
3m
3m/2
Q 0 2 0 0 6
TABLE IV The total number of transistors for various bit-parallel systolic multipliers over GF(2m ) multipliers Wang-Lin [5] Fenn et al. [4] Lee-Chiou[10] Fig.3
basis polynomial dual type-2 Normal dual
The total number of transistors 42m 2 -4m 42m 2 -4m 39m 2 +6m 34m 2 +35m+4
take into the transistor count using a standard CMOS VLSI realization. Table 4 shows that our proposed multiplier has less complex than other bit-parallel systolic multipliers. Therefore, one can find that, in terms of the time and the space complexities, the multiplexerbased array architecture is the better choice for our proposed bit-parallel systolic multipliers. Moreover, our proposed multipliers have attractive features for highspeed VLSI system design, such as simplicity, regularity and modularity. References [1] F. J. MacWilliams and N. J. A. Sloane, The Theory of ErrorCorrecting Codes, Amsterdam: North-Holland, 1977. [2] R. Lidl and H. Niederreiter, Introduction to Finite Fields and Their Applications, New York: Cambridge Univ. Press, 1994. [3] C.S. Yeh, S. Reed, and T. K. Truong, ”Systolic Multipliers for Finite Fields GF(2m ),” IEEE Trans. Computers, Vol. C-33, PP. 357-360, Apr. 1984. [4] S.T.J. Fenn, M. Benaissa, and O. Taylor, ”Dual basis systolic multipliers for GF(2m ),” IEE Computers and Digital Techniques, Vol. 144, No. 1, PP. 43 -46, Jan 1997 [5] C.L. Wang and J. L. Lin, ”Systolic Array Implementation of Multipliers for GF(2m ),” IEEE Trans. Circuits and Systems II, Vol. 38, PP. 796-800, July 1991. [6] S.W. Wei, ”A Systolic Power-Sum Circuit For GF(2m ),” IEEE Trans. Computers, Vol. 43, No. 2, PP. 226-229, Feb. 1994. [7] A. Booth, ”A Signed Binary Multiplication Technique,” Q.J. Mech. Appl. Math., Vol. 4, PP. 236 240, 1951. [8] K.Z. Pekmestzi, ”Multiplexer-based Array Multipliers,” IEEE Trans. Computers, Vol. 48, No. 1, PP. 15 23, Jan. 1999. [9] S.Y. Kung, VLSI array processors, Englewood Cliffs, NJ: Prentice-Hall, 1988. [10] C.Y. Lee and C.W. Chiou, ”Design of low-complexity bitparallel systolic Hankel multipliers to implement multiplication in normal and dual bases of GF(2m ),” IEICE Trans. Fund., vol. E88-A, no.11, pp. 3169-3179, Nov. 2005