New bit-parallel Montgomery multiplier for trinomials using squaring ...

Report 0 Downloads 62 Views
1

New bit-parallel Montgomery multiplier for trinomials using squaring operation Yin Li,Yiyang Chen Abstract—In this paper, a new bit-parallel Montgomery multiplier for GF (2m ) is presented, where the field is generated with an irreducible trinomial. We first present a slightly generalized version of a newly proposed divide and conquer approach. Then, by combining this approach and a carefully chosen Montgomery factor, the Montgomery multiplication can be transformed into a composition of small polynomial multiplications and Montgomery squarings, which are simpler and more efficient. Explicit complexity formulae in terms of gate counts and time delay of our architecture are investigated. As a result, the proposed multiplier has generally 25% lower space complexity than the fastest multipliers, with time complexity as good as or better than previous Karatsuba-based multipliers for the same class of fields. Among the five irreducible polynomials recommended by NIST for the ECDSA (Elliptic Curve Digital Signature Algorithm), there are two trinomials which are available for our architecture. We show that our proposal outperforms the previous best known results if the space and time complexity are both considered. Index Terms—Montgomery multiplication, squaring, bit-parallel, trinomials.

F

1

I NTRODUCTION

Efficient hardware implementation of multiplication over GF (2m ) is very important in many areas such as coding theory, computer algebra and public key cryptosystems [1], [2]. Nowadays, more and more circuit gates can be located on a single chip which make the bitparallel architectures possible and reasonable. During recent years, a number of bit-parallel GF (2m ) multiplier schemes and architectures have been proposed to achieve the higher computation speed or lower area complexity. They have covered extensive cases with respect to different bases representation [4], [5] and generating polynomials [6], [7], [8]. Montgomery multiplication is an important algorithm which was originally used for fast modular integer multiplication [9] and then extended to the field multiplication over GF (2m ) [10] and GF (pm ) with p > 2 [11]. In [10], Koc¸ • Yin Li is with Department of Computer Science and Technology, Xinyang Normal University, Henan, China. E-mail: [email protected]. • Yiyang Chen is with Department of Computer Science and operational Research, Montreal University, Montreal, Canada. E-mail: [email protected].

and Acar have introduced a class of algorithms for software implementation of Montgomery multiplication. They argued that Montgomery multiplication can be implemented efficiently if the Montgomery factor is chosen properly. The hardware implementation of the Montgomery multiplication is investigated in [12], [13]. The Montgomery factor in these literatures are selected as xm . In [14], Wu has proposed a new bit-parallel Montgomery multiplier for irreducible trinomials using a different factor. His scheme is based on the a slightly generalized method proposed in [10] and showed that the Montgomery factor is chosen as the middle term of the trinomial xm + xk + 1 can result in efficient bit-parallel multiplier and squarer which are at least as good as previous proposals. Also in the literatures, some systolic architectures are proposed for the Montgomery multiplication for trinomials, e.g., [15], [16], [17]. Hariri and Reyhani-Masoleh [18] have further improved Wu’s proposal. Besides new recommendation of the Montgomery factor, fast bitserial and bit-parallel multiplier architectures are also given for irreducible trinomials and pentanomials. It is argued that their scheme matches the best known result reported in the

2

literatures. Hariri and Reyhani-Masolehs scheme is very fast, but their architecture costs about 2m2 circuit gates. In this paper, our work is devoted to designing a bit-parallel non-systolic Montgomery multiplier for trinomials, which obtains a trade-off between the space and time complexity. We start describing a slightly extended Park et al. algorithm [19], referred as PCHS algorithm. Then, by combining this algorithm with Montgomery squaring operations, new bit-parallel multiplier architecture is proposed. The main contributions of our work are as follows: • The space complexity of our proposal is about 25% less than any other Montgomery or Mastrovito multipliers for trinomials, and matches the Karatsuba multiplier proposed by Elia [23]. • Besides, the time complexity of our proposal is slightly higher than the fastest multipliers, but no more than 2TX . • For the range m ∈ [100, 1203] and k ≤ m/2, there are 1405 irreducible trinomials. For 1061 trinomials, its time complexity is equal to TA + (2 + dlog2 me)TX , for other trinomials, it is only 1TX more. The remainder of this paper is organized as follows: In Section 2, we briefly review the PCHS algorithm and the Montgomery squaring operation over GF (2m ). Then we describe a slightly extended PCHS algorithm. Based on it, a new bit-parallel Montgomery multiplier is developed in Section 3. In Section 4, we further analyze its complexity and present a comparison between our proposal and some others. Finally, some conclusions are drawn.

polynomial according to exponent of the Pparity m−1 i indeterminate. Pm−1 i Assume that A = i=0 ai x and B = i=0 bi x be two polynomials over F2 [x] such that m is an odd integer. A, B can be partitioned into: A = A21 + xA22 and B = x−1 B12 + B22 respectively, where (m−3)/2

(m−1)/2

A1 =

X

i

a2i x , A2 =

X

a2i+1 xi ,

i=0 (m−1)/2

i=0 (m−1)/2

B1 =

X

b2i−1 xi , B2 =

i=1

X

b2i xi .

i=0

Then the polynomial multiplication can be rewritten as: AB = (A21 + xA22 )(x−1 B12 +B22 ) = x−1 (A1 B1 )2 +x(A2 B2 )2 +(A1 B2 )2 +(A2 B1 )2 = x−1 (A1 B1 )2 +x(A2 B2 )2 +(A1 B1 )2 +(A2 B2 )2 + [(A1 +A2 )(B1 +B2 )]2 .

(1)

It is clear that Equation (1) converts the mterm polynomial multiplication into three m+1 2 (or m−1 )-term polynomial multiplications and 2 squarings at the cost of three extra additions. For finite field multiplications, this formula can be combined with fast squaring operation together to construct efficient multiplier. The authors utilized the fast squaring formulae for two types of special pentanomials. However, their scheme is a little complicated as the related squaring is built on transformations between weak dual basis (WDB) and polynomial basis (PB) [20]. Actually, the Montgomery squaring for trinomial is very 2 P RELIMINARY simple and efficient. In the following sections, In this section, we briefly introduce the basic we will describe new bit-parallel multiplier for ingredients used in our algorithm, including irreducible trinomials based on the PCHS algothe PCHS algorithm and Montgomery squaring rithm and Montgomery squaring operation. over GF (2m ). 2.1 The PCHS Algorithm Recently, Park et al. [19] proposed a new divide and conquer approach for odd degree polynomial multiplication. This approach is analogous to Karatsuba algorithm but divides a

2.2

Montgomery Squaring over GF (2m )

The Montgomery squaring operation derives from Montgomery multiplication and is defined by A2 (x)R−1 (x) mod f (x), where f (x) is an irreducible polynomial generating GF (2m ),

3

A(x), R(x) ∈ GF (2m ) and R(x) is a fixed ele- where ( P m−3 P m−1 ment named as Montgomery factor. The generi i 2 2 a x , A = A = 2i 2 1 i=0 i=0 a2i+1 x , al algorithm for Montgomery squaring is studm−1 P m−1 P 2 2 B1 = i=1 b2i−1 xi , B2 = i=0 b2i xi . ied in [10] where f (x) is an arbitrary irreducible polynomial and R(x) is selected as xm . An Then (2) can be rewritten as optimized Montgomery squaring is proposed   ABx−h = (A21 + xA22 )(x−1 B12 + B22 ) x−h by Wu [14] for irreducible trinomials xm +xk +1. This squaring is designed using R(x) = xk and = [x−1 (A1 B1 )2 + x(A2 B2 )2 + (A1 B2 )2 the corresponding circuit delay is TX , whereas + (A2 B1 )2 ]x−h the squaring in polynomial basis costs more = [x−1 (A1 B1 )2 + x(A2 B2 )2 + (A1 B1 )2 circuit gates and has a delay of as most 2TX . The main reason is the factor R−1 (x) = x−k + (A2 B2 )2 + (CD)2 ]x−h could simplify the modular reduction related = (A1 B1 )2 x−h (1+x−1 ) + (CD)2 x−h to xm + xk + 1. Similar trick is also applied in + (A2 B2 )2 x−h (1 + x), some special types of irreducible pentanomial in [22]. where C = A1 + A2 , D = B1 + B2 . m is even. This case is a little different from the previous case. We partition A, B as follows: 3 N EW FIELD MULTIPLICATION USING

M ONTGOMERY SQUARING OPERATION

A = A21 + xA22 , B = B12 + xB22 ,

In this section, we present a new Montgomery where multiplication formula for irreducible trinomi( P m2 −1 P m2 −1 als using a slightly generalized PCHS algorithA1 = i=0 a2i xi , A2 = i=0 a2i+1 xi , m m P 2 −1 P 2 −1 m. B1 = i=0 b2i xi , B2 = i=0 b2i+1 xi . m Suppose that the field GF (2 ) is defined by an irreducible trinomial f (x) = xm +xk +1 with In this case, (2) is written as a root x, and the field elements are represented ABx−h = (A2 + xA2 )(B 2 + xB 2 ) x−h 2 1 2  12  using polynomial basis {1, x, · · · , xm−1 }. From 2 −1 2 = (A1 + xA2 )(x B1 + B22 ) x−h+1 now on, we only take account of f (x) = xm + = [x−1 (A1 B1 )2 + x(A2 B2 )2 + (A1 B2 )2 xk + 1 where 1 ≤ k ≤ m/2, as there always exist irreducible trinomial f (x) = xm + xm−k + 1 by + (A2 B1 )2 ]x−h+1 m the reciprocal property [3]. Let A, B ∈ GF (2 ) = [x−1 (A1 B1 )2 + x(A2 B2 )2 + (A1 B1 )2 be two arbitrary elements in polynomial basis + (A2 B2 )2 + (CD)2 ]x−h+1 representation: = (A1 B1 )2 x−h+1 (1+x−1 ) + (CD)2 x−h+1 A = am−1 xm−1 + am−2 xm−2 + · · · + a1 x + a0 , + (A2 B2 )2 x−h+1 (1 + x), B = bm−1 xm−1 + bm−2 xm−2 + · · · + b1 x + b0 , where C = A1 + A2 , D = B1 + B2 . where ai , bi ∈ F2 . h The two expressions as above both transforDenoted by x (1 ≤ h < m) the Montgomery m Montgomery multiplication into the three factor, the Montgomery multiplication (MM) m squaring operations. We can choose suitable over GF (2 ) is given by: factor xh in order to obtain the simplest im−h A · B · x mod f (x). (2) plementation. It is argued that the squaring V 2 (x)x−k mod xm + xk + 1 and V 2 (x)x−k+1 mod xm + xk + 1 have the simplest modular reduc3.1 Extended PCHS Algorithm tion [14], [18]. Therefore, we choose the MontAccording to the parity of m, we consider two gomery factor xh as follows: following cases. ( m is odd. Let k, m is odd, h= (3) k + 1, m is even. A = A21 + xA22 , B = x−1 B12 + B22 ,

4

As a result, the Montgomery multiplication of two cases have the same transformation. ABx−h = (A1 B1 )2 x−k (1 + x−1 ) + (A2 B2 )2 x−k (1 + x) + (CD)2 x−k . (4)

m is even. c(i = Pi

a2j b2(i−j) , 0 ≤ i ≤ m2 − 1, m Pj=0 −1 m 2 ≤ i ≤ m − 2. j=i− m +1 a2j b2(i−j) , 2

(6)

2

Meanwhile, its Montgomery squarings have More explicitly, the gate count and time deone of the best factors. Note that the degrees lay for implementation of each ci in (5) are of A1 B1 , A2 B2 and CD are at most m − 1. From presented in Table 1. It is easy to check that now on, the following notations are used: m2 −1 the computation of c s totally cost AND i m−1 m−1 m−1 4 X X X m2 −4m+3 i i i XOR gates with path delay TA + A1 B1 = ci x , A2 B2 = di x , CD = ei x , and 4  m−1 dlog2 2 eTX . i=0 i=0 i=0 2 m−1 X When m is even, it requires m4 AND and 2 −k −1 i S1 = (A1 B1 ) x (1 + x ) mod f (x) = ri x , m2 −4m+4 4  XOR gates with path delay TA + i=0 m dlog 2 2 eTX . m−1 X Similarly, we can easily obtain the space and S2 = (A2 B2 )2 x−k (1 + x) mod f (x) = s i xi , time complexity related to A2 B2 which are the i=0 same as those of A1 B1 . m−1 X 2 −k i S3 = (CD) x mod f (x) = ti x . i=0

3.3 Different Cases Here, notice that if m is odd, c0 = 0 and dm−1 = 0, if m is even, we have cm−1 = dm−1 = 0. Then we consider the computations of S1 , S2 The new Montgomery multiplication can be and S3 . According to previous description, summarized in following algorithm: the key computation of S1 , S2 , S3 is the Montgomery squaring related to A1 B1 , A2 B2 and Algorithm 1 New Bit-parallel MM CD. This operation has been fully studied m through the explicit formulations [14] which Input: A, B ∈ GF (2 ), f (x) are varied according to the range of m and k. Output: ABx−h mod f (x) Hence, we consider six cases: 1: Partition A, B according to (4) 2: Implement S1 , S2 , S3 in parallel 1) m is odd, k is odd, 1 ≤ k ≤ m−3 , 2 3: Compute S1 + S2 + S3 in parallel , 2) m is odd, k is odd, k = m−1 2 , 3) m is odd, k is even, 1 ≤ k ≤ m−3 2 m−1 4) m is odd, k is even, k = , We then consider the detailed computation 2 5) m is even, k is odd, m > 2k, of S1 , S2 and S3 , respectively. 6) m is even, k is odd, m = 2k. 3.2 The Complexities of A1 B1 , A2 B2 The above six cases correspond to different First we briefly analyze the complexities of squaring formulae, resulting different expresthe products A1 B1 and A2 B2 which will be sion of S1 , S2 and S3 . For the sake of the length used in the computation of S1 , S2 . According of the paper, we only analyze two representato previous description, the coefficients ci s of tive cases, i.e., case 1 and case 5.1 Pm−1 A1 B1 = i=0 ci xi are given as follows: m is odd. 3.4 The Computation of S1 , S2 ci = Pm−1 i  0, i = 0, Case 1: Denote  i=0 zi x as the Montgomery P 2 −k i−1 squaring (A1 B1 ) x mod f (x), we have follow1 ≤ i ≤ m−1 , j=0 a2j b2(i−j)−1 , 2 m−1  P  m+1 2  ≤ i ≤ m − 1. m−1 a2j b2(i−j)−1 , j=i−

2

2

(5)

1. We can follow a similar line of approaches used in case 1 and case 5.

5

TABLE 1 The computation complexity of ci ci

#AND

#XOR

Delay

c0 = 0

0

0

-

c1 = a0 b1

1

0

TA

c2 = a0 b3 + a2 b1 .. .

2 .. .

1 .. .

TA + TX .. .

c m−1 = a0 bm−2 + · · · + am−3 b1

m−1 2

m−3 2

TA + (dlog2 ( m−1 )e)TX 2

c m+1 = a2 bm−2 + · · · + am−1 b1 2 .. .

m−1 2

m−3 2

.. .

.. .

)e)TX TA + (dlog2 ( m−1 2 .. . TA

2

cm−1 = am−1 bm−2

1

0

Total

m2 −1 4

m2 −4m+3 4

ing expression using related formula in [14]: zi =   , i = 0, 2, · · · , k − 1; c 2i + c m+k+i  2   c m+k+i , i = k + 1, k + 3, · · · , m − k − 2;   2

i = m − k, m − k + 2 · · · , m − 1;

c k−m+i ,

TA + dlog2

m−1 2

 eTX

Obviously, we can utilize the similar strategy to obtain the explicit expression of S2 . Let Pm−1 0 i i=0 zi x denote the Montgomery squaring respect to A2 B2 , then we have:

(7)

2

   ,  c k+i 2   c k+i + c m+i , 2

2

m

i = 1, 3, · · · , k − 2; i = k, k + 2, · · · , m − 2.

k

Since x +x = 1, we have x It follows that: S1 = =

m−1 X

S2 =

−1

=x

m−1

+x

k−1

.

i=0 m−1 X

0 (zi0 + zi−1 )xi

0 0 0 + (zk0 + zk−1 + zm−1 )xk + (zm−1 + z00 ).

(zi + zi+1 )xi

i=0,i6=k−1

+ (zk−1 + zk + z0 )xk−1 + (zm−1 + z0 )xm−1 .

zi0 xi (1 + x) mod f (x)

i=1,i6=k

zi xi (1 + x−1 ) mod f (x)

i=0 m−2 X

=

m−1 X

(8)

Then we substitute zi with the expressions in (7). Note that c0 = 0, the coefficients of S1 are given by: ri =  i = 0, 2, · · · , k − 1; c 2i + c m+k+i + c k+i+1 ,   2 2    c m+k+i + c k+i+1 + c m+i+1 , i = k + 1, k + 3,  2  2 2   · · · , m − k − 2;     c i = m − k, m − k + 2, k−m+i + c k+i+1 + c m+i+1 ,   2 2 2    · · · , m − 3;  i = m − 1; c k−1 + c m+k , 2 2    c k+i + c i+1 + c m+k+i+1 , i = 1, 3, · · · , k − 2;  2 2 2    m+i c + c + c , i = k, k + 2, m+k+i+1  k+i  2 2 2    · · · , m − k − 3;     c k+i + c m+i + c k−m+i+1 , i = m−k−1, m−k+1,   2 2  2 · · · , m − 2. (9)

(10)

The explicit formulae for the coefficients of S2 are given in (11).

si = i = 0, 2, · · · , k − 1; d 2i + d m+k+i + d k+i−1 ,   2 2    d m+k+i + d k+i−1 + d m+i−1 , i = k + 1, k + 3,   2 2 2    · · · , m − k − 2;     d i = m − k, m − k + 2, k−m+i + d k+i−1 + d m+i−1 ,  2 2 2   · · · , m − 1; + d i−1 + d m+k+i−1 , i = 1, 3, · · · , k − 2;  d k+i  2 2 2    d k+i + d m+i + d m+k+i−1 , i = k, k + 2,   2 2 2    · · · , m − k − 1;     d i = m−k+1, m−k+3, k+i + d m+i + d k−m+i−1 ,  2 2 2   · · · , m − 2. (11)

Case 5: In this case, it is easy to check that S1 and S2 have the same transformation as case 1 presented in (8) and (10), but the Montgomery

6

and

squaring formula is different. We have ri =  c 2i + c m+k+i+1 + c k+i+1 ,   2 2     c0 + c k−1 + c m+2k + ck ,  2 2     c m+i + c k+i+1 + c m+k+i+1 ,   2 2 2         c m+i + c k+i+1 + c k−m+i+1 ,   2 2 2  

i = 0, 2, · · · , k − 3; i = k − 1; i = k + 1, k + 3, · · · , m−k−3; i = m−k−1, m−k+1, · · · , m − 2; i = m − 1;

 c k−1 + c m+k−1 + c0 ,   2 2  c k+i + c m+k+i + c i+1 ,  i = 1, 3, · · · , k − 2;   2 2  2   + c m+k+i + c m+k+i+1 , i = k, k + 2,  c k+i 2 2 2    · · · , m−k−2;     c k+i + c k−m+i + c m+i+1 , i = m−k, m−k+2,   2 2   2 · · · , m − 3; (12)

and si = d i + d m+k+i−1 + d k+i−1 , i = 0, 2, · · · , k − 1;  2 2 2   d m+i + d k+i−1 + d m+k+i−1 , i = k + 1, k + 3, · · · ,    2 2  2   m−k−1;      m+i−1 d + d + d , i = m−k+1, m−k+3, k−m+i k+i−1   2 2 2   · · · , m − 2;  i = 1, 3, · · · , k − 2; d k+i + d m+k+i + d i−1 ,   2 2 2    m+i−1 d + d + d , i = k, k + 2, · · · , k+i m+k+i   2 2 2    m−k−2;      d k+i + d k−m+i + d m+i−1 , i = m−k, m−k+2,   2 2   2 · · · , m − 1. (13)

The explicit formulae about S1 + S2 of case 2-4 and case 6 can be found in the appendix A.

  i , i = 0, 2, · · · , k − 1;  e 2 + e m+k+i 2    , i = k + 1, k + 3, · · · , m − k − 2; e m+k+i   2 i = m − k, m − k + 2 · · · , m − 1; ti = e k−m+i , 2    i = 1, 3, · · · , k − 2; e k+i ,   2   e k+i + e m+i , i = k, k + 2, · · · , m − 2. 2 2 (15)

By substituting (14) into (15), we can obtain the explicit expression of ti s summarized in Table 2 AND gates to 2. Note that it only need (m+1) 4 computed all the ui vj for i, j = 0, 1, · · · , m−1 , 2 here, we only present the required number of XOR gates. XOR Gates Reuse trick: The computation of ti consists of multiplying ui with vj and adding up all these products using a binary XOR tree. In (15), we note that there exist certain overlapped terms among some ti s, thus reusing the intermediated results in binary XOR trees could further reduce the number of required XOR gates. For example, t0 = e0 +e m+k and tk = ek +e m+k 2 2 contain the same part e m+k . According to (14), it 2 P m−1 2 follows that e m+k = j= k+1 uj v m+k −j consisting 2

2

2

of m−k terms. If m−k is an odd number, t0 and 2 2 tk are computed in following way: t0 =[u k+1 v m−1 + u k+3 v m−3 ] + [u k+5 v m−5 2

2

2

2

2

2

+u k+7 v m−7 ] + · · · + [u m−5 v k+5 + u m−3 v k+3 ] 2

2

2

2

2

2

+ [u m−1 v k+1 + u0 v0 ], 2

2

tk =[u k+1 v m−1 + u k+3 v m−3 ] + [u k+5 v m−5 2

2

3.5

Computation of S3

2

2

2

+u k+7 v m−7 ] + · · · + [u m−5 v k+5 + u m−3 v k+3 ] 2

2

Since the computation of C = A1 + A2 and D = B1 + B2 require one extra TX gate delay, in order to keep pace with the computations of S1 and S2 , we use a different computational strategy for S3 that combines the polynomial multiplication with Montgomery squaring. P m−1 P m−1 2 2 Case 1: Let i=0 ui xi = C and i=0 vi xi = D, then we have

2

2

2

2

2

+ [u m−1 v k+1 + u0 vk ] + [u1 vk−1 + u2 vk−2 ] 2

2

+ · · · + [uk−1 v1 + uk v0 ]. (16)

In (16), the terms ui vj correspond to the XOR tree nodes in depth 0. We then add the nodes pairwisely and repeat this step until adding up all those terms together. This procedure can be depicted in Fig. 1 and Fig. 2. (Pi The black nodes in tree (a) and tree (b) rep0 ≤ i ≤ m−1 , j=0 uj vi−j , 2 resent the overlapping terms in t0 and tk . Due m−1 ei = P 2 m+1 ≤ i ≤ m − 1. m−1 uj vi−j , to parallelism, the additions in the brackets are 2 j=i− 2 (14) performed simultaneously. Note that brackets

7

TABLE 2 The computation complexity of ti before optimization i 0 2

u0 v0 +

ti P m2−1

1 j= k+ 2

u0 v1 +u1 v0 +

1 P k− 2

j=0

k+1 k+3

uj v m+k−j 2

m−1 2 3 j= k+ 2

P

.. . k−1

#XOR

uj v m+k+1−j 2

.. . P m2−1

uj v k−1−j + j=k uj v m−1+k−j 2 2 P m2−1 −1 j=k+1 uj v m2 +k+1−j P m2−1 j=k+2 uj v m−1+k+2−j 2

i

m−k 2

1 3

.. .

.. .

m−k 2

k−2

P

m−1 −k−2 2

k+2

m−k−2

u m−1 v m−1

0

m−k−1

m−k+2

u0 v1 +u1 v0

.. . m−1

0

.. . 1 P k− 2

j=0

uj v k−1−j 2

uj v k+1+1−j

k+3 2

2

uj vk−j +

j=0

.. .

u0 v 0

j=0

2

m−k+1

1

m−k+3

.. .

.. .

k−1 2

m−2

.. .

uj vk−1−j P m2−1

k−1

j=0

k

.. .

m−k

k+1 2

Pk−1 Pk

.. . 2

k+3 2

.. .

.. . 2

uj v k+1−j

j=0

m−k 2

m−1 −k−1 2

#XOR

ti 1 P k+ 2

Pk+1

j=0

uj vk+1−j +

j=0

P P

j=1 m−1 2

j=2

P m2−1

3 j= k+ 2

uj v m+k−j

m+k 2

−1

2

uj v m+k+1−j

m+k 2

−1

2

.. . P m2−1

P m2−1 m−1 2

1 j= k+ 2

.. .

uj v m−1−j + −k uj vm−k+1−j j= m2 2 2 P m2−1 u v uj v m+1−j + k j m− −1−j −k j= m2 +1 2 2 m − 1 P 2 uj v m+3−j + u v k − j m−k m− 3−j j= 2 +2

2

2

m+k 2 m+k 2

−2

m+k 2

−4

.. . P m2−1

1 j= k− 2

uj v m+k−1−j +u m−1 v m−1 2

(m-k)/2 terms

(m-k)/2 terms

.. . 2

2

m−k 2

+1

k+1 terms

depth 0

depth 0

depth 1

depth 1 depth h-1 depth h

depth h-1 depth h depth log

2

m  k  2  2

Fig. 1. The binary XOR tree (a) related to t0

Fig. 2. The binary XOR tree (b) related to tk

with underlines between the two expressions contain the same values, we only need to compute these values in tree (a) and reuse them in tree (b). Therefore, b m−k c XOR gates will 4 be saved in depth 0. Similarly, it follows that b m−k c XOR gates can be saved at depth 1, 8 m−k b 16 c XOR gates saved at depth 2, etc. Let blog2 ( m−k )c = h, in depth h − 1, there exist two 2 nodes thus one XOR gates will be saved. Then the total number of saved XOR gates is       m−k m−k m−k + + ··· + . (17) 4 8 2h+1

  Here, note that m−k = 1. Moreover, above h+1 2 expression can be further simplified to   m−k m−k −W , 2 2 according to certain proposition.2 In Table 3, we present the explicit number of XOR gates for all the ti s overlapped with others when we apply this trick. The second column j represents the index of the tj which overlaps with ti . Consequently, we can obtain 2. The proposition and its proof are presented in appendix B.

8

TABLE 3 The computation complexity of certain ti s after optimization i

overlap with j

#XOR

m−k

m−k 2

2 .. .

m−k+2 .. .

m−k 2

k−1

m−1

k

0

k+2 .. .

2 .. .

m−k−1

m − 2k − 1

0

m − 2k + 1 .. .

m−2

m−k−2

dlog2

m−k 2

+ 1e

− (2 − W (2)) .. .

dlog2 m−k 2

+ 1e

.. . dlog2

m−k 2

+ 1e

− ( m−k − W ( m−k )) 2 2

dlog2

m+k 2

+ 1e

− ( m−k − 1 − W ( m−k − 1)) 2 2 .. .

dlog2

m+k 2

+ 1e

m+k 2

dlog2

m+k 2

m−k+1 .. .

− (1 − W (1))

− ( k+1 − W ( k+1 )) 2 2

m−k 2

m+k 2

delay for binary tree

m+k 2

− ( k+1 − W ( k+1 )) 2 2

−2−

( k−1 2



W ( k−1 )) 2

.. . + 1e

e dlog2 m+k 2

.. . m−k 2

m+k 2

.. .

+ 1 − (1 − W (1))

dlog2

m−k 2

+ 1e

the complexity related to S3 based on Table 2 Plus the delay of computing ui vj , the circuit and Table 3: delay for parallel implementation of Eq. (19) is at most TA + dlog2 ( m2 )eTX . The space and time (m + 1)2 , #AND : complexity of S3 here are given by: 4 k+1

m−k

2 2 X (m − 1)2 X #XOR : + W (i) + W (i), 4 i=1 i=1

#AND :

2

and t i =  e 2i ,    e m+i ,

i = 0, 2, · · · , k − 1; i = k + 1, k + 3, · · · , m − 2;

2

 e k+i + e m+k+i ,   2 2  e k+i + e k−m+i , 2

2

i = 1, 3, · · · , m − k − 2; i = m − k, m − k + 2, · · · , m − 1. (19)

m−k−3

k+1

2 2 X m2 − 4m + 4 X #XOR : W (i) + W (i), + 4 i=1 i=1

Delay : TA + dlog2 (m + k + 2)eTX .

Case 5: Actually, the computation strategies of other cases are nearly the same as that presented in case 1, we can use the XOR gates reuse trick to optimize its implementation. Note that the degrees of C, D are at most m2 − 1, P m2 −1 P m2 −1 i i = C and let u x i i=0 i=0 vi x = D, then the coefficients of CD and its Montgomery squaring are: (Pi 0 ≤ i ≤ m2 − 1, j=0 uj vi−j , ei = P m2 −1 (18) m ≤ i ≤ m − 2. j=i− m +1 uj vi−j , 2

m2 , 4

Delay : TA + dlog2 meTX .

3.6

The Computation Sequence

Ultimately, we add S1 , S2 and S3 together to obtain the result. It is crucial to arrange the computation sequence properly to obtain the optical circuit delay. Particularly, note that the computation of S3 requires at least one more TX gate delay than that of A1 B1 and A2 B2 . During this extra XOR gate delay, we can perform one bitwise addition between S1 , S2 in parallel. Case 1: The computation sequence is arranged as follows:  C, D (CD)2 x−t  | {z }  | {z }  m+k+2 1T X

  

A1 B1 , A2 B2 | {z }

TA +dlog2 m−1 eTX 2

TA +dlog2

2

[S1 + S2 ] | {z } 1TX

eTX

[S1 + S2 ]∗ + S3 | {z } 2TX

We observed that when substitute Eq. (18) into where [S1 + S2 ] denotes the parallel bitwise Eq. (19), each ti is the sum of at most m2 terms. additions in the square brackets indicated in

9

following equation. ri + si =  [c 2i + c m+k+i ] + [c k+i+1 + d 2i ]   2 2    +[d m+k+i + d k+i−1 ],   2 2    [c m+k+i + c k+i+1 ] + [c m+i+1   2 2 2    +d m+k+i ] + [d k+i−1 + d m+i−1 ],   2 2 2     [c k−m+i + c k+i+1 ] + [c m+i+1  2 2 2     +d k−m+i ] + [d k+i−1 + d m+i−1 ],   2 2 2    [c k−1 + c m+k ] + [d k−1 + d k+m−2 ]   2

2

2

i = 0, 2, · · · , k − 1; i = k + 1, k + 3, · · · , m − k − 2; i = m−k, m−k+2, · · · , m − 3; i = m − 1;

2

  

1TX

TA +dlog2 m eTX 2

A1 B1 , A2 B2 | {z }

[S1 + S2 ] − {c0 xk−1 } | {z }

eTX TA +dlog2 m 2

2

2TX

1TX

3m2 , 4 P k+1 2 3m 2 +5m+ i=1 4

i = 1, 3, · · · , k − 2;

#XOR :

i = k, k + 2,

Delay : TA + (2 + dlog2 me) TX .

· · · , m − k − 3;

[S1 + S2 ]∗ + S3 | {z }

The total complexity of the proposed multiplier of case 5 will be #AND :

[c k+i + c i+1 ] + [c m+k+i+1 + d k+i ]

2 2 2    i−1 + d m+k+i−1 ], +[d   2 2    [c k+i + c m+i ] + [c m+k+i+1  2  2 2    +d k+i ] + [d m+i + d m+k+i−1 ],   2 2 2    m−1 + c 2m−k−1 ] + [d m−1 [c   2 2 2    +d 2m−k−1 ],   2    [c k+i + c m+i ] + [c k−m+i+1   2 2 2   +d k+i ] + [d m+i + d k−m+i−1 ],

that of S3 . Therefore, the computation sequence of case 5 is given by:  (CD)2 x−t + {c0 xk−1 } C, D   | {z } {z } | 

W (i)+

P m−k−3 2 i=1

W (i),

(22) The computation sequences of other cases are the same as those we presented in (21) and i = m − k − 1; (22). Finally, we summarize the space and time i = m − k + 1, m−k+3, · · · , m − 2. complexity of these corresponding multipliers 2 2 2 (20) in the Table 4. After that, each coefficient of S1 + S2 consists of at most 3 terms, we denote these results as 4 C OMPARISON AND D ISCUSSION [S1 + S2 ]∗ . It follows that there are at most 4 terms constituting to these coefficients of [S1 + Comments on space complexity: We note that the S2 ]∗ + S3 which can be implemented in 2TX in expressions for number of XOR gates in Table parallel. Also note that m − 1 extra XOR gates 4 contain the sum of hamming weights Pσ relatare needed for computation of C, D. Therefore, ed to certain integer, denoted by i=1 W (i). This expression can be roughly written as the total complexity of the proposed multiplier σ 4 log2 σ. Therefore, the number of XOR gates of case 1 will be 2 2 required by our multiplier here is about: 3m4 + 3m2 +2m−1 , #AND : 4 O(m log2 m). P k+1 P m−k 2 3m 9m 29 2 2 Comments on time complexity: Denoted by #XOR : 4 + 2 + i=1 W (i)+ i=1 W (i)− 4 , T (m, k) the time complexity of our multiplier, Delay : TA + (2 + dlog2 (m + k + 2)e) TX . (21) according to (21), (22) and Table 4, one can that Case 5: According to the coefficient expres- check ( sions of S1 in (12), the coefficient rk−1 consists T (m, k) ≤ TA + (3 + dlog2 me)TX , m odd, of four terms which would lead to one more T (m, k) = TA + (2 + dlog2 me)TX , m even. XOR delay for S1 + S2 . However, we observed that rk−1 contains the But for odd m, it is interesting only if T (m, k) = term c0 that can be obtained with delay TA +TX . TA + (2 + dlog me)TX . This happens frequently 2 It is possible to “insert” c0 to the binary XOR when m = 2n + c where c is smaller than 2n−1 . tree related to tk−1 of S3 . 3 Particularly, the k−1- For the range 100 < m ≤ 1023, there exist 786 th coefficient of S3 , i.e., ti = e k−1 consists of k+1 trinomials of odd degrees where k ≤ m−1 . We 2 2 2 terms. The delay of the binary tree related to have checked all these trinomials and found tk−1 +c0 is dlog2 ( k+1 +1)eTX . Note that here k < that 442 trinomials satisfies the previous re2 m k+1 , we have dlog2 ( 2 + 1)e < dlog2 m2 e, which quirement. 2 indicates that S3 + c0 has the same delay with 3. All the nodes of the binary XOR trees related to ti can be calculated with the same delay TA + TX .

4. Note that bit length of σ is dlog2 σe, the average hamming weight of the number from 1 to σ is about σ2 , which directly obtain the evaluation.

10

TABLE 4 Complexities of Montgomery multiplier for other cases Case

#AND

#XOR

m odd, k even 0 nt ≥ 0.  2 2 2    Note that blog2 ic = n1 , then  +d 2i + d k+i + d m+k+i−1 ,   2 2          c k+i + c m+k+i+1 + c m+i+1 i = k, k + 2,  i i i  2 2 2  + + · · · + n1 = i − W (i)  +d k+i + d m+k+i−1 + d m+i−1 , · · · , m − k − 3;  2 4 2  2 2 2    c m−1 + cm− k + d m−1 + dm− k −1 , i = m − k − 1;   2 2 2  2   c i = m−k+1, m−k+3, k+i + c k−m+i+1 + c m+i+1   2 2 2   +d k+i + d k−m+i−1 + d m+i−1 , · · · , m − 3; 2 2 2  c m+k+i + c i+1 + c k+i+1   2 2 2    i = 1, 3, · · · , k − 1; +d m+k+i + d i−1 + d k+i−1 ,   2 2 2    m+i c + c i = k + 1, k + 3, + c m+k+i k+i+1   2 2 2    +d m+k+i + d m+i + d k+i−1 , · · · , m − k − 2;   2 2 2     c k−m+i + c m+i + c k+i+1 i = m − k, m−k+2,  2 2 2     · · · , m − 2; +d k−m+i + d m+i + d k+i−1 ,   2 2 2  c m+k−1 + c k + d m+k+1 + d k , i = m − 1. −1 2

2

2

2

Case 4: When m is odd and k is even, k = we have:

m−1 , 2

ri + si = c i + c k+i + c m+k+i+1   2 2 2  +d i + d k+i + d m+k+i−1 ,    2 2  2   c + c m+k+1 + dk + d m+k−1  k  2 2    c k+i + c k−m+i+1 + c m+i+1  2 2 2   +d k+i + d k−m+i−1 + d m+i−1 , 2

i = 0, 2, · · · , k − 2; i = k; i = k + 2, k + 4, · · · , m − 3;

2

2

 c m+k−1 + c k + d m+k−1 −1 + d k −1 , i = m − 1;   2 2 2 2    c m+k+i + c i+1 + c k+i+1   2 2 2    i = 1, 3, · · · , k − 1; +d m+k+i + d i−1 + d k+i−1 ,   2 2 2    c k−m+i + c m+i + c k+i+1 i = k + 1, k + 3,   2 2 2   +d k−m+i + d m+i + d k+i−1 , · · · , m − 2. 2

2

Case 6: When m is even and k is odd, m = 2k i = 0, 2, · · · , m 2 − 3; i=

 +d0 + d k+m−1   2   c k+i + c m+k+i + c i+1   2 2 2    +d k+i + d k−m+i + d m+i−1 ,   2 2 2    c k+i + c k−m+i + c m+i+1   2 2 2    +d k+i + d k−m+i + d m+i−1 ,  2  2 2   c k−1 + c m+k−1 + d m2 + d k−1 , 2

m 2

− 1;

i = k + 1, k + 3, · · · , m − 2; i=

2

2

When we rearrange these terms of previous expressions and add them up, we have: 2n1 −1 + 2n1 −2 + · · · + 1 = 2n1 − 1, 2n2 −1 + 2n2 −2 + · · · + 1 = 2n2 − 1, .. . nt −1 2 + 2nt −2 + · · · + 1 = 2nt − 1. Obviously, 2n1 − 1 + 2n2 − 1 + · · · + 2nt − 1 = i − W (i), which conclude the proposition.

2

ri + si = c i + c k+i+1 + c m+k+i+1 + d i  2 2 2 2   +d m+k+i−1 + d k+i−1    2 2    c + c k−1 + dk−1 + d k−1  k  2 2    + c k+i+1 + c k−m+i+1 c m+i  2 2 2   +d m+i + d k+i−1 + d k−m+i−1 ,   2 2 2   ck + c m+k+1 + c0 + dk

2

Proof: Firstly, it is clear that   i = 2n1 −1 + 2n2 −1 + · · · + 2nt −1 , 2   i = 2n1 −2 + 2n2 −2 + · · · + 2nt −2 , 4 .. .   i = 2n1 −nt + 2n2 −nt + · · · + 2nt−1 −nt + 20 , 2nt .. .   i = 1. 2n1

m 2;

A PPENDIX C Proposition 2 The time complexity of our multiplier is at most 2TX higher than those of Fan [4] and Hariri [18] scheme. Proof: According to Table 5, the time complexity of [4] and [18] are the same, which is equal to TA + dlog2 (2m − k − 1)eTX , if 0 < k ≤

i = 1, 3, · · · , i=

m 2

m 2

− 2;

+ 2, k + 4,

· · · , m − 3; i = m − 1;

m−1 , 2

or

TA + dlog2 (m + k)eTX , if k = m2 . When we compare it with the formulae with respect to time delay presented in (21), (22) and Table 4, we have:

13

1) m odd, k odd, 1 ≤ k ≤ m−3 2 2m − k − 1 − (m + k + 2) = m − 2k − 3 ≥ 0; 2) m odd, k odd, k = m−1 , 2 2m − k − 1 − (m + k + 2) = −2; 3)

a) m odd, k even, 1 ≤ k ≤ m−1 , 3 2m−k −1−(2m−3k −2) = 2k −1 > 0; < k ≤ m−3 , b) m odd, k even, m−1 3 2 2m − k − 1 − 3k = 2m − 4k − 1 > 0;

, 4) m odd, k even, k = m−1 2 2m − k − 1 − 3k = 2m − 4k − 1 = 1 > 0; 5) m even, k odd, m > 2k, 2m − k − 1 − m = m − k − 1 > 0; 6) m even, k odd, m = 2k, m + k − m = k > 0. Therefore, except case 2, all the formulae related to the time delay have at most 2 more TX than those of Fan [4] and Hariri [18] scheme. In addition, in case 2, we note that the difference between m + k + 2 and 2m − k − 1 is only 2. There is a high probability that + 2)e is equal to the formula dlog2 (m + m−1 2 m−1 dlog2 (2m− m−1 −1)e. Only if m+ +2 = 2` +1 2 2 or 2` + 2, the two formulae are unequal, where ` > 0 is an integer. But we found that there is no such irreducible trinomial for m ∈ [100, 2048] for cryptography interests. Therefore, in case 2, our multiplier is also 2TX slower than Fan [4] and Hariri [18] scheme, which conclude the proposition.

R EFERENCES [1] A.J. Menezes, I.F. Blake, X. Gao, R.C. Mullin, S.A. Vanstone, T. Yaghoobian, Applications of Finite Fields, Kluwer Academic, Norwell, Massachusetts, USA, 1993. [2] I. Blake, G. Seroussi, N. Smart, Elliptic Curves in Cryptography, Lond. Math. Soc. Lect. Note Ser., vol. 265, Cambridge University Press, 1999. [3] R. Lidl, H. Niederreiter, Finite Fields, Cambridge University Press, New York, NY, USA, 1996. [4] H. Fan, M.A. Hasan, “Fast Bit Parallel-Shifted Polynomial Basis Multipliers in GF (2n ),” IEEE Trans. Circuits and Systems I: Fundamental Theory and Applications, vol. 53, no. 12, pp. 2606–2615, Dec. 2006. [5] A. Cilardo, “Fast Parallel GF (2m ) Polynomial Multiplication for All Degrees,” IEEE Trans. Computers, vol. 62, no. 5, pp.929–943, May 2013. [6] B. Sunar and C ¸ .K. Koc¸, “Mastrovito multiplier for all trinomials,” IEEE Trans. Computers, vol. 48, no. 5, pp. 522– 527, 1999. [7] A. Halbutogullari and C ¸ .K. Koc¸, “Mastrovito multiplier for general irreducible polynomials,” IEEE Trans. Computers, vol. 49, no. 5, pp. 503–518, May 2000.

[8] T. Zhang and K.K. Parhi, “Systematic design of original and modified mastrovito multipliers for general irreducible polynomials,” Computers, IEEE Transactions on, vol. 50, no. 7, pp. 734–749, July 2001. [9] Peter L. Montgomery, “Modular multiplication without trial division,” Mathematics of Computation, vol. 44, no. 170, pp. 519–521, 1985. [10] C ¸ .K. Koc¸ and T. Acar, “Montgomery multiplication in GF (2k ),” Designs, Codes and Cryptography, vol. 14, no. 1, pp. 57–69, 1998. [11] J.C. Bajard, L. Imbert, and C. Negre, “Arithmetic operations in finite fields of medium prime characteristic using the lagrange representation,” IEEE Trans. Computers, vol. 55, no. 9, pp. 1167–1177, 2006. [12] C. Chiou, C. Lee, A. Deng, and J. Lin, “Concurrent Error Detection in Montgomery Multiplication over GF (2m ),” IEICE Trans. Fundamentals of Electronics, Comm. and Computer Sciences, vol. 89, no. 2, pp. 566-574, 2006. [13] M. Morales-Sandoval, C. Feregrino-Uribe, P. Kitsos, and R. Cumplido, “Area/performance trade-off analysis of an FPGA digit-serial GF(2m) Montgomery multiplier based on LFSR,” Comput. Electr. Eng. vol. 39, no. 2, pp. 542-549, February 2013. [14] H. Wu, “Montgomery multiplier and squarer for a class of finite fields,” IEEE Trans. Comput., vol. 51, no. 5, pp. 521– 529, 2002. [15] C.-Y. Lee, C.-C. Chen; E.-H. Lu, “Compact Bit-Parallel Systolic Montgomery Multiplication Over GF (2m ) Generated by Trinomials,” TENCON 2006. 2006 IEEE Region 10 Conference, pp. 1-4, November 2006. [16] C.-Y. Lee, C.W. Chiou, J.-M. Lin, C.-C. Chang, “Scalable and systolic Montgomery multiplier over GF (2m ) generated by trinomials,” IET Circuits, Devices & Systems, vol. 1, no. 6, pp. 477-484, December 2007. [17] C.-Y. Lee, J.-S. Horng, I.-C. Jou, E.-H. Lu, “LowComplexity Bit-Parallel Systolic Montgomery Multipliers for Special Classes of GF (2m ),” IEEE Trans. Comput., vol. 54, no. 9, pp. 1061-1070, September 2005. [18] A. Hariri and A. Reyhani-Masoleh, “Bit-serial and bitparallel Montgomery multiplication and squaring over GF (2m ),” IEEE Trans. Computers, vol. 58, no. 10, pp. 1332– 1345, October 2009. [19] S. Park, K, Chang, D. Hong, and C. Seo, “New efficient bit-parallel polynomial basis multiplier for special pentanomials,” Integration, the VLSI Journal, vol. 47, no. 1, pp. 130 – 139, 2014. [20] S. Park, “Explicit formulae of polynomial basis squarer for pentanomials using weakly dual basis Integration,” Integration, the VLSI Journal, vol. 45, pp. 205–210, 2012. [21] Y. Cho, N. Chang, C. Kim, Y. Park, and S. Hong, “New bit parallel multiplier with low space complexity for all irreducible trinomials over GF (2n ),” IEEE Trans. Very Large Scale Integration (VLSI) Systems, vol. 20, no. 10, pp. 1903– 1908, October 2012. [22] X. Xiong, H. Fan, “GF (2n ) bit-parallel squarer using generalised polynomial basis for new class of irreducible pentanomials,” Electronics Letters , vol. 50, no. 9, pp. 655– 657, April 2014. [23] M. Elia, M. Leone, and C. Visentin, “Low complexity bitparallel multipliers for GF (2m ) with generator polynomial xm + xk + 1,” Electronic Letters, vol. 35, no. 7, pp. 551–552, 1999. [24] Y. Li, G. Chen, and J. Li, “Speedup of bit-parallel karatsuba multiplier in GF (2m ) generated by trinomials,” Inf. Process. Lett., vo. 111, no. 8, pp. 390–394, March 2011.

14

[25] N. Petra, D. De Caro, and A.G.M. Strollo, “A novel architecture for galois fields GF (2m ) multipliers based on mastrovito scheme,” IEEE Trans. Computers, vol. 56, no. 11, pp. 1470–1483, November 2007. [26] H. Shen, Y. Jin. 2008, “Low complexity bit parallel multiplier for GF (2m ) generated by equally-spaced trinomials,” Inf. Process. Lett., vol. 107, no. 6, pp. 211-215, August 2008. [27] G. Shou, Z. Mao, Y. Hu, Z. Guo, Z. Qian, “Low complexity architecture of bit parallel multipliers for GF (2m ),” Electronics Letters, vol. 46, no. 19, pp. 1326-1327, September 2010. [28] Recommended Elliptic Curves for Federal Government Use, http://csrc.nist.gov/groups/ST/toolkit/documents/ dss/NISTReCur.pdf, July, 1999.