Novel Arithmetic Architecture for High Performance ... - Semantic Scholar

Report 2 Downloads 23 Views
Novel Arithmetic Architecture for High Performance Implementation of SHA-3 Finalist Keccak on FPGA Platforms Kashif Latif, M. Muzaffar Rao, Athar Mahboob, and Arshad Aziz National University of Sciences and Technology (NUST) H-12 Islamabad, Pakistan {kashif,mrao,athar}@pnec.edu.pk, [email protected]

Abstract. We propose high speed architecture for Keccak using Look-Up Table (LUT) resources on FPGAs, to minimize area of Keccak data path and to reduce critical path lengths. This approach allows us to design Keccak data path with minimum resources and higher clock frequencies. We show our results in the form of chip area consumption, throughput and throughput per area. At this time, the design presented in this work is the highest in terms of throughput for any of SHA-3 candidates, achieving a figure of 13.67Gbps for Keccak-256 on Virtex 6. This can enable line rate operation for hashing on 10Gbps network interfaces. Keywords: SHA-3, Keccak, Cryptographic Hash Functions, High Speed Encryption Hardware, FPGA, Reconfigurable Computing.

1

Introduction

Cryptographic hash algorithms are used in digital signatures, message authentication codes (MACs) and many other information security applications. Vulnerabilities found in a number of hash functions in recent years, including SHA-0, SHA-1, SHA2, RIPEMD and MD5 led to the rendering of long-term security of these algorithms suspect [1-3]. To ensure the long-term robustness of applications that use hash functions National Institute of Standards and Technology (NIST) has announced a public competition in the Federal Register Notice published on November 2, 2007 [4] to develop a new cryptographic Hash algorithm called SHA-3. This competition is now in the final round with 5 candidates. Five short listed candidates are BLAKE, Grøstl, JH, Keccak and Skein. The tentative time-frame for the end of this competition and selection of official SHA-3 is in 4th quarter of 2012 [5]. This paper describes high throughput efficient hardware implementation of Keccak. The remainder of this paper is organized as follows. Section 2 gives brief description of Keccak. In section 3 we present the efficient hardware implementation of Keccak, elaborating our novel architectural approach using LUT resources. In section 4 we give the results of our work and compare it with the other reported efficient implementations of Keccak in section 5. Finally, we provide some conclusions in Section 6. O.C.S. Choy et al. (Eds.): ARC 2012, LNCS 7199, pp. 372–378, 2012. © Springer-Verlag Berlin Heidelberg 2012

Novel Arithmetic Architecture for High Performance Implementation of SHA-3

2

373

Brief Description of Keccak

Keccak is a family of sponge functions with members Keccak [r, c] characterized by two parameters, bitrate r and capacity c. The sum r + c determine the width of the Keccak-f permutation used in the sponge construction and is restricted to values in {25, 50, 100, 200, 400, 800, 1600} [6]. For SHA-3 proposal Keccak team proposed the Keccak [1600] with different r and c values for each desired length of hash output [6]. The 1600-bit state of Keccak [1600] consists of 5x5 matrix of 64-bit words. Each compression step of Keccak consists of 24 rounds. Let us denote the state matrix with . Each round then consists of following five steps: ,0 ⊕

,1 ⊕

,2 ⊕

1 ⊕

Theta (θ): , Rho (ρ) - Pi (π):

,2

Chi (χ):

,

Iota (i):

0,0

,

,4

1 ,1



3

, ,

,3 ⊕



0,0 ⊕

,

, 1,

2,

0

4

(1)

0

4

(2)

0

,

4 (3)

0

,

4 (4)

0

,

4 (5)

(6)

In above listed equations all operations within indices are done modulo 5. denotes the complete permutation state array and , denotes a particular 64-bit word in that state. , , and are intermediate variables. The symbol ⊕ denotes the bitwise XOR, the bitwise complement and the bitwise AND operation. Finally, , denotes the bitwise cyclic shift operation, moving the bit at position into position (modulo 64). The constants , and are cyclic shift offset and round constant respectively, and are defined in [6].

3

Implementation

We have implemented the 256-bit and 512-bit variants of Keccak on Xilinx Virtex 5 and Virtex 6 FPGAs. We have extensively used the Xilinx specific library resources in the design of Keccak data path. Xilinx LUT primitives are used to implement Keccak single round. This round then iterative number of times to achieve Keccak compression function. The use of primitives makes our design much efficient with minimum area, high speed and high throughput per area as compared to other reported work for Keccak. 3.1

Datapath of Keccak

The data path implemented for Keccak is shown in Fig. 1. The A_Reg represents the matrix register, on which processing of Keccak algorithm takes place. Keccak data path is fully parameterized, such that the design may be synthesized for any value of r (bitrate) and c (capacity). For this reason, the width of each net is highlighted as r, c or r + c in Fig. 1. The length of A_Reg also varies according to r and c and it is defined as r + c (bits). For Keccak-256, r is specified as 1088-bits and c as 512-bits. For Keccak-512, r is specified as 576-bits and c as 1024-bits. Accordingly A_Reg

374

K. Latif et al.

msg

0’s

r

c

c r c

r

Concat. r+c

A_Reg r+c

θ ρ || π

r+c

χ RC ROM counter

64

i Trunc. r hash

Fig. 1. Data path of Keccak

will be of 1600-bits. In beginning of every hash process A_Reg is initialized with all zeros. First message block is directly copied to A_Reg after concatenating it with c wide stream of 0’s. The Concat block in Fig. 1 represents the concatenation operation. Compression function of Keccak consists of five steps. In Fig. 1 each step is denoted by the symbol as specified in Keccak specifications. These steps are , , , and i. We have combined these steps during implementation, wherever possible. We have implemented and as a single step. The round constants (RC) are stored in ROM using 24x64 bit single port distributed ROM. Respective round constant is addressed during each round using round number as ROM address. 3.2

Novel Arithmetic Architecture for Keccak Compression Function

Our novel arithmetic architecture for compression function of Keccak is now described. Keccak algorithm’s compression function consists simple XOR, AND and NOT operations. These operations are implemented using LUT primitives from Xilinx specific libraries. Following are details of implementation of each step. Theta ( ) Step: There are three equations in step. Equation (1) is implemented using LUT5 primitive for XOR logic as shown in Fig. 2(a). The INIT value in hexadecimal, shown under attributes in Fig. 2(a), configures the LUT to perform

Novel Arithmetic Architecture for High Performance Implementation of SHA-3 LUT5

LUT3

 

 Attributes INIT= 96696996





 

5-Input Look-Up Table



375

Attributes INIT= 96



3-Input Look-Up Table

(a) 5-bit XOR used in θ step

(b) 3-bit XOR used in θ step

LUT3

LUT2 

Attributes INIT= D2



Attributes INIT= 6





3-Input Look-Up Table

(c) 3-bit Logic used in χ step

2-Input Look-Up Table

(d) 2-bit XOR used in i step

Fig. 2. LUT primitives from Xilinx Library used to implement Keccak’s round steps

XOR operation at its inputs. The INIT value is derived by laying down the truth table for all possible combinations of LUT inputs. To XOR 5 64-bit operands of equation (1), LUT5 primitive is instantiated 64 times. For complete implementation of equation (1), 5x64 LUT5 are required. We can combine equation (2) with equation (3) as follows: ,

,



1 ⊕

1 ,1

0

,

4

(7)

For implementation of equation (7), LUT3 primitive is used for XOR logic as shown in Fig. 2(b). The one bit rotation in last operand of equation (7) is implemented through rewiring. To implement the complete logic, 25x64 instantiations of LUT3 primitive are required. Rho ( ) and Pi ( ) Steps: The and are permutations, which may be achieved through simple rewiring in hardware, at no resource cost. The cyclic shift constant , is fixed and known for each position of matrix . It is also implemented by means of fixed rewiring. Chi ( Step: In step three logical operations XOR, NOT and AND are used. These are implemented using LUT3 primitive as shown in Fig. 2(c). In order to accomplish the step, LUT3 with logic is instantiated 25x64 times. Iota ( ): The i step involves simple XOR of round constant with least significant 64 bits of A_Reg, i.e. 0,0 . It is implemented using LUT2 primitive as shown in Fig. 2(d). LUT2 is instantiated 64 times for i step. These five steps or a single round of Keccak algorithm are accomplished in one clock cycle. Therefore 24 clock cycles are required to complete 24 rounds of Keccak algorithm. After completion of 24 rounds on a message block, resulting r-bits of state of A_Reg are XORed with next message block and same round sequence is repeated

376

K. Latif et al.

again. This process continues till all message blocks are processed. At the end, state of A_Reg is truncated to the desired length of hash output.

4

Implementation Results

The design has been implemented on Xilinx Virtex 5 and Virtex 6. The resulting clock frequencies and area utilization after place and route are reported. Table 1 shows achieved area consumption ( ), clock frequency ( ), throughput ( ) and throughput per area ( ) for implemented designs. The is the block is the number of clock cycles required for hash of a size of message in bits and single message block. Table 1. Results for Keccak.

in MHz,

in Slices,

in Gb/s and

Keccak-256

in Mbps/Slice.

Keccak-512

Device Virtex 6 24

1088

301.57

13.67 14.94

576

291.21 1015

6.99

6.89

Virtex 5 24

1088

275.56 1333 12.49 9.37

576

263.16 1197

6.32

5.28

5

915

Comparison with Previous Work

Table 2 shows the comparison of results with previously reported implementations in terms of throughput, area and throughput per area. E. Homsirikamol et al. [12] discussed and reported their results for various architectures of Keccak using Table 2. Comparison of Keccak Implementations. and in Mbps/Slice

in MHz,

Keccak-256

Author(s)

Device

This work This work

Virtex 6 301.57 915 13.67 14.94 291.21 Virtex 5 275.56 1333 12.49 9.37 263.16 Virtex 5 122.00 1330 5.20 3.91

Keccak Team [6] Strömbergson [7] Strömbergson [7] Baldwin et al.[8] Matsuo et al. [9] Akin et al. [10] Akin et al. [10] Akin et al. [10] Kris Gaj et al. [11] E. Hom. et al. [12] E. Hom. et al. [12]

Spartan3A Virtex 5 Virtex 5 Virtex 5 Spartan 3 Virtex-II Virtex 4 Virtex 5 Virtex 6 Virtex 5

85.00 118.00 195.73 205.00 81.40 136.60 142.90 238.38 -

3393 4.80 1.41 1483 6.70 4.52 1971 6.26 3.17 195.73 1433 4.20 2.93 2024 3.46 1.71 2024 5.81 2.87 2024 6.07 3.00 1229 10.81 8.79 276.86 1165 11.84 10.17 1395 12.77 9.16 -

in Slices,

in Gbps

Keccak-512 1015 1197 -

6.99 6.32 -

6.89 5.28 -

1971 1236 1231 1220

8.52 6.64 7.23 6.56

4.32 5.37 5.87 5.37

Novel Arithmetic Architecture for High Performance Implementation of SHA-3

377

pipelining, folding and loop unrolling approaches. For performance comparison, we considered the results of architecture based on basic iterative approach. However, our results in terms of throughput per area are exceeding all of their results. Our results for Virtex 6 and Virtex 5 are far ahead from all previously reported work in terms of throughput per area, except for Keccak-512 on Virtex 5. We show best results of our work in bold font in Table 2.

6

Conclusion

In this work we have presented high throughput hardware implementation of SHA-3 finalist: Keccak. Look-Up Table (LUT) resources on FPGAs are used to enhance the hardware performance of Keccak in terms of both speed and area. We reported the implementation results of Keccak-256 and Keccak-512 on Xilinx Virtex 6 and Virtex 5. We reported the performance figures of our implementation in terms of area, throughput and throughput per area and compared it with previously reported implementation results. Results achieved in this work are exceeding the implementations reported so far. We compared and contrasted the performance figures of Keccak-256 and Keccak-512 on Virtex 5 and Virtex 6. This work serves as performance investigation of Keccak on most up-to-date FPGAs. Moreover, our design can be used for latest gigabit wire speed communication networks such as 10Gbps Ethernet.

References 1. Xiaoyun Wang, X.L., Feng, D., Yu, H.: Collisions for hash functions MD4, MD5, HAVAL-128 and RIPEMD. Cryptology ePrint Archive, Report 2004/199, pp. 1–4 (2004), http://eprint.iacr.org/2004/199 2. Szydlo, M.: SHA-1 collisions can be found in 263 operations. CryptoBytes Technical Newsletter (2005) 3. Stevens, M.: Fast collision attack on MD5. ePrint-2006-104, pp. 1–13 (2006), http://eprint.iacr.org/2006/104.pdf 4. Federal Register / Vol. 72, No. 212 / Friday, November 2 (2007), / Notices, http://csrc.nist.gov/groups/ST/hash/documents/ FR_Notice_Nov07.pdf 5. National Institute of Standards and Technology (NIST): Cryptographic Hash Algorithm Competition, http://www.nist.gov/itl/csd/ct/ 6. Bertoni, G., Daemen, J., Peeters, M., Assche, G.V.: The Keccak SHA-3 Submission version 3, pp. 1–14 (2011), http://keccak.noekeon.org/Keccak-submission-3.pdf 7. Strömbergson, J.: Implementation of the Keccak Hash Function in FPGA Devices, pp. 1–4 (2008), http://www.strombergson.com/files/Keccak_in_FPGAs.pdf 8. Baldwin, B., Hanley, N., Hamilton, M., Lu, L., Byrne, A., Neill, M., Marnane, W.P.: FPGA Implementations of the Round Two SHA-3 Candidates. In: 2nd SHA-3 Candidate Conference, Santa Barbara, August 23-24, pp. 1–18 (2010)

378

K. Latif et al.

9. Matsuo, S., Knezevic, M., Schaumont, P., Verbauwhede, I., Satoh, A., Sakiyama, K., Ota, K.: How Can We Conduct Fair and Consistent Hardware Evaluation for SHA-3 Candidate? In: 2nd SHA-3 Candidate Conference, Santa Barbara, August 23-24, pp. 1–15 (2010) 10. Akin, A., Aysu, A., Ulusel, O.C., Savas, E.: Efficient Hardware Implementations of High Throughput SHA-3 Candidates Keccak, Luffa and Blue Midnight Wish for Single and Multi-Message Hashing. In: 2nd SHA-3 Candidate Conference, Santa Barbara, August 2324, pp. 1–12 (2010) 11. Gaj, K., Homsirikamol, E., Rogawski, M.: Comprehensive Comparison of Hardware Performance of Fourteen Round 2 SHA-3 Candidates with 512-bit Outputs Using Field Programmable Gate Arrays. In: 2nd SHA-3 Candidate Conference, Santa Barbara, August 23-24, pp. 1–14 (2010) 12. Homsirikamol, E., Rogawski, M., Gaj, K.: Comparing Hardware Performance of Round 3 SHA-3 Candidates using Multiple Hardware Architectures in Xilinx and Altera FPGAs. In: ECRYPT II Hash Workshop 2011, Tallinn, Estonia, May 19-20, pp. 1–15 (2011)