A note on efficient implementation of prime generation algorithms in ...

Comment

Report 0 Downloads 32 Views

Computer Networks 49 (2005) 476–491 www.elsevier.com/locate/comnet

A note on eﬃcient implementation of prime generation algorithms in small portable devices Chenghuai Lu *, Andre L.M. Dos Santos Georgia Institute of Technology, College of Computing, 801 Atlantic Drive, Atlanta, GA 30309, United States Received 23 December 2003; received in revised form 2 December 2004; accepted 17 December 2004 Available online 23 March 2005 Responsible Editor: G. Schaefer

Abstract This paper investigates existing prime generation algorithms on small portable devices, makes optimizations and compares their eﬃciencies. It shows by comparing the performances that the bit array algorithm is the most eﬃcient among all the existing prime generation algorithms. The paper further optimizes the implementation of the bit array algorithm by using an optimal parameter in the prime generations, namely the small prime set for its sieving procedure. A method for estimating the optimal small prime set for the bit array algorithm is provided. The paper gives generalized bit array algorithms which are able to ﬁnd primes with special constraints, i.e., DSA primes and strong primes. Finally, the algorithms are implemented in a smart card and a PDA for validation. It shows that there is very little eﬃciency sacriﬁce for generating special primes with respect to generating random primes. It also shows that using optimal sets of small primes for prime generations will result in 30–200% eﬃciency improvement. 2005 Elsevier B.V. All rights reserved. Keywords: Public key cryptography; Smart card; Primality test; Sieve procedure

1. Introduction Cryptographic functions based on public key cryptography [6,19] have gained increasing attention from the research and commercial communities, as well as from end users. The use of public *

Corresponding author. Tel.: +1 404 822 8849. E-mail address: [email protected] (C. Lu).

key cryptography can add security to a wide variety of applications. Especially, public key cryptography is a valuable tool for simplifying key management and enabling secure communications. Recently, there is a strong trend to use public key cryptography in small portable devices such as smart cards and handheld PDAÕs to enable them to perform secure transactions. Those devices should be able to implement one or multiple

1389-1286/$ - see front matter 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.comnet.2004.12.007

C. Lu, A.L.M. Dos Santos / Computer Networks 49 (2005) 476–491

public key cryptographic systems. Examples of public key algorithms that can be used by portable devices are RSA [17], Diﬃe–Hellman [7] and DSA [15]. Among them, RSA is the most popular public key cryptosystem and has been widely deployed in many portable devices to support protocols (e.g. communication protocols). Many of the currently available portable devices possess a limited amount of hardware resource and computing power. Consequently, the processing speed attained by implementing public key cryptographic functions on those devices could be much slower than those on desktop computers. Because of this, there have been many studies on using specially designed hardware and software approaches to overcome the limited hardware resource and improve the performance of certain cryptographic functions [10,13,16]. This paper focuses on one of the important problems in public key cryptosystems—the generation of large random primes on resourceconstrained devices. Due to the nature of the procedure for prime generation, which will be discussed later in the paper, the generation of large random primes is very time costly. For instance, the generation of 1024-bit primes, which can correspond to 2048-bit RSA key pairs, may cost several minutes to accomplish in devices like smart cards. For some applications, a user may need a higher security level that requires generating even larger primes, e.g. 2048-bit primes. In this case, the time needed for generating 2048-bit random primes is usually much more.1 One of the reasons contributing to the low performance of large prime generation is the hardness of ﬁnding eﬃcient primality testing algorithms. Meanwhile, another reason is that the existing prime generation algorithms and the implementations are not suﬃciently investigated, particularly for small portable devices. There have been a number of prime generation algorithms implemented on small portable devices, with performances varying widely. Unfortunately, most implementations simply use any prime generation algorithm available to them without noticing

1 It could be more than 10 times as much as that for generating 1024-bit primes.

477

the big performance variations among them. As a result, some of the implementations run for an unreasonable amount of time. The total time for key generation can be very long and sometimes unacceptable, especially when a group of keys need to be generated or if some low-end tamper resistant devices are used. In addition, many small portable devices have limited storage space, particularly non-persistent storage that is used to store temporary values. The problem is aggravated by the fact of having several applications competing for the limited storage. Naturally, prime generation algorithms should optimize their storage requirements, what is not considered in the usual publicly available prime generation algorithms. Therefore, it is necessary to optimize the performance and storage requirements of prime generation algorithms for small portable devices. This paper investigates existing prime generation algorithms, makes optimizations and compares their eﬃciencies in terms of time and memory space required. Hence, it provides a good reference for software engineers who implement large prime generations on small portable devices where resource is limited. The paper initially discusses one of the most used ways for prime generation—incremental search. Then, several optimizations are made to the incremental search prime generation algorithm. The study of performances shows that the table lookup and bit-array algorithms are the most eﬃcient among all the algorithms examined. In addition, the storage requirements are compared. The bit array algorithm requires signiﬁcantly less memory than the table lookup algorithm and therefore is the best choice among the algorithms, when both time and memory eﬃciencies are considered. One of the important issues in optimizing incremental search prime generation algorithms is in choosing a small prime set (SPS) for the sieve procedure. The paper analyzes the factors that aﬀect the choices of the optimal SPS sets when generating diﬀerent sizes of primes on portable devices and proposes a method that can predicate those values instead of exhaustively searching for them. It is shown by experiments that the eﬃciencies of prime generations can be improved by 30% in the worst case and 200% in the best case by using optimal SPS sets, compared with using some

478

C. Lu, A.L.M. Dos Santos / Computer Networks 49 (2005) 476–491

commonly used SPS sets. The paper follows with a discussion of a prime generation algorithm, called the constructive method [10,11], which diﬀers from incremental search. The paper compares the performance of the bit array algorithm using incremental search with that of the constructive method and shows that the bit array algorithm outperforms its peer. Therefore it can be concluded that the bit array algorithm using incremental search is the most eﬃcient among all the available prime generation algorithms that can be implemented on small portable devices. Some public key cryptosystems have additional requirements on the primes to be generated. For example, RSA requires random primes, or strong primes for meeting X9.31 standard; DSS system requires DSA primes; and Diﬃe–Hellman systems require random or safe primes. This paper describes how the bit array algorithm can be generalized to generate primes guaranteed to satisfy such constraints while preserving eﬃciency. This paper uses smart cards and PDAÕs for discussions and validation. However, the results presented can be used as a guideline for any other forms of resource-constrained computing systems that need to implement prime generation algorithms.

2. Background Apart from being mathematically interesting, it is well known that eﬃcient generation of prime numbers is of extreme importance in modern cryptography. Today, a prime generation implementation should be able to generate random primes that are at least 512 bits long. In many cases, the prime generation implementations will need to generate 1024-bit, 2048-bit or longer primes in order to support enhanced security levels in public key cryptosystems. A generic approach for prime generation is to repeatedly select a random number and test it for primality. The random number generation and primality test stop when a number is found to be prime. The primality test can be done by either probabilistic primality tests or deterministic primality tests. The probabilistic primality tests will ﬁnd if a num-

ber is composite with probability 1 or if it is prime with some probability smaller than 1. Hence, by repeatedly running the test one gains more and more conﬁdence that a number that does not fail any of the tests is prime. The most common probabilistic primality tests are the Fermat,2 Solovay– Strassen, and Miller Rabin tests [3,12]. Primality tests [2,14] will ﬁnd if a number is prime with probability 1. But, they are much more complex and computationally expensive than probabilistic primality tests. Therefore, existing cryptographic applications often use probabilistic primality test algorithms for implementing their primality tests. An implementation of the primality test, which is called a primality test oracle in this paper, will perform several rounds of probabilistic primality tests on a candidate, using diﬀerent witnesses. If the candidate passes all the tests, the oracle concludes it as a probable prime. If the candidate fails any of the tests, the oracle will conclude that the candidate is a composite. The discussion on choosing the number of rounds of probabilistic primality tests in a primality test oracle can be found in [4]. It needs to be mentioned that, although probabilistic primality tests are faster than primality tests, they are still very expensive to perform. Therefore, it is extremely important to optimize the number of probabilistic primality tests for generating a prime in order to achieve good performance. Besides the generic prime generation approach, there is a slightly diﬀerent paradigm of prime generation, called incremental search [4]. The incremental search randomly chooses an odd candidate q and looks into the sequence of candidates q, q + 2, q + 4, . . ., q + 2s, until a probable prime is found, where s represents the number of consecutive odd numbers that will be examined. If the search reaches the end of the sequence, the prime generation fails. Some discussions on the security of using incremental search method for prime

2 The Fermat test works as follows. (1) Choose a small number a, preferably a prime, as a witness; (2) compute ap1 mod p, where p is the number that is tested for primality; (3) if the result is not 1, the candidate is a composite. Otherwise, the candidate is a prime with probability 1/2. Other probabilistic primality tests work in a similar way.

C. Lu, A.L.M. Dos Santos / Computer Networks 49 (2005) 476–491

generation can be found in [5] by Brandt et al. And the incremental search has been practically used by most of the existing prime generation applications. Therefore, the paper will follow the common practice and use the incremental search as a baseline for studying optimizations on prime generations. Incremental search prime generation algorithms can be further optimized by test dividing candidates against a set of small primes before applying the primality test oracle. This is usually referred to as a sieve procedure [4,12,13,18]. The sieve procedure can detect a portion of composite numbers, thus save them from expensive probabilistic primality testing. There are three currently available prime generation algorithms based on incremental search, the test-division algorithm, the table-lookup algorithm and the bit array algorithm [13]. Those algorithms mainly diﬀer from each other in the sieve procedure implementations. Besides incremental search prime generation algorithms, another type of prime generation algorithms construct candidates in such a way that they are guaranteed to be not divisible by small primes. Now, the algorithms donÕt need to examine as many candidates as the generic approach per prime generation since the constructed candidates have higher probability to be primes. Therefore, the constructive methods can achieve better eﬃciency than the generic approach. The referred method is called the constructive method [10,11]. One of the goals of this paper is to implement the prime generation algorithms on portable devices, compare and analyze their eﬃciencies. Another goal is to further optimize the best prime generation algorithm to achieve its best eﬃciency when implemented in portable devices. The authors are unaware of any similar work. Therefore, the authors feel that it is important to perform this study to provide guidelines to developers when they make their prime generation implementations on portable devices.

3. Prime generation algorithms The generic form of incremental search algorithms will choose new candidate numbers by adding 2 to the old ones when they fail in the primality

479

test oracle until a probable prime is found. The procedure for generating an l-bit prime number using incremental search is illustrated as below. Step 1. Generate an l-bit odd candidate q. Step 2. Perform primality testing on q. Step 3. If q is a probable prime, output q and terminate the search. Step 4. Increment q by 2 and go to Step 2. In Step 2, a primality test oracle is used for determine whether a candidate is a probable prime. For the purpose of evaluating the eﬃciency of prime generation, it is typical that the primality test oracle accepts a candidate as prime if it passes one round of Fermat test using 2 as the witness [4,10,13] without concerning the theoretical probability for the candidate to be a prime. Hence, one of the most important goal of the optimization is to reduce the number of calls to the primality test oracle, which is the same as the number of probabilistic primality tests, per prime generation. According to the prime number theorem (cf. [12]), the number of primes less than a number q is asymptotically equal to q/ln(q). Thus, if a candidate less than q is chosen randomly, the probability for the candidate being a prime number is approximately equal to 1/ln(q), i.e., the generic prime generation algorithm needs to examine ln (q)/2 candidates (only odd candidates are considered) on average before a prime is found. It is proved based on the well-known r-tuple conjecture [8,9] that the incremental search method will require to examine ln(q)/2 candidates on average before a probable prime is obtained [5]. This also implies that there needs ln(q)/2 calls to the primality test oracle. Generally, a sieve procedure is placed before the primality test oracle to improve the performance. The sieve procedure will early detect a portion of composite candidates by testdividing them with a set of small primes. If a candidate is detected as a composite by the sieve procedure, it will not be passed for the primality test. Thus, the number of calls to the primality test oracle per prime generation is reduced. The optimized prime generation procedure by test dividing candidates against a set of small primes is illustrated as below.

480

C. Lu, A.L.M. Dos Santos / Computer Networks 49 (2005) 476–491

Step 1. Generate an l-bit candidate q. Step 2. Test-divide q against a set of small primes. Step 3. If q is divisible by one of the primes in the set, go to step 6. Step 4. Perform primality testing on q. Step 5. If q is a probable prime, output q and terminate. Step 6. Increment q by 2 and go to Step 2. Before we further discuss the optimizations on prime generation, some notations that will be used later are given below. r SPS(r) pi PrimeTest(q)

K(r) s

Upper boundary of an SPS set The set of small odd primes less than r The ith smallest odd prime A primality test oracle which returns TRUE if q is a probable prime and FALSE otherwise kSPS(r)k. Number of elements in SPS(r) The number of consecutive odd numbers that will be examined in bit array algorithms

The sieve procedure helps to exclude all but some fraction a(r) of the searched candidates before they are submitted to the primality test oracle, if we use SPS(r) as the set of small primes. This will reduce the average number of calls to the primality test oracle per prime generation to a(r)ln(q)/2. The average of a(r) can be represented by ð1 31 Þ

ð1 51 Þ . . . ð1 p1 KðrÞ Þ and approximated by a very simple formula 2ec/ln(r)1.12/ln(r) based on the MertensÕ theorem [1]. Table 1 shows the expected and approximated values for a(r). As can be seen from Table 1, 1.12/ln(r) can be used as a good approximation of a(r) especially when r goes up to more than 1000. The expected average numbers of calls to the primality test oracle using diﬀerent SPS sets is computed based on the estimated rate of survivor candidates a(r) and shown in Table 2. As can be seen in Table 2, the bigger is the size of the set SPS(r) the less is the number of calls to the primality test oracle per prime generation. Naturally, we want SPS(r) to be as big as possible. However, a larger SPS(r) will need more storage space to keep, and more processing time for performing the sieve procedure, what penalizes the overall performance. Thus, the goals for optimizing incremental search prime generation algorithms are: • Finding a sieve procedure implementations that can eﬃciently process large SPS sets. • Choosing the optimal SPS set such that the combined cost of performing the probabilistic primality tests and the sieve procedure is minimized.

4. Optimizations on the sieve procedure There have been three diﬀerent implementations of the sieve procedure for incremental search

Table 1 Values for a(r)

E(a(r)) 1.1229/ln(r)

r = 29 (%)

r = 256 (%)

r = 512 (%)

r = 2560 (%)

r = 5120 (%)

31.5 33.2

20.0 20.2

17.9 18.0

14.3 14.3

13.1 13.1

Table 2 Expected average number of calls to the probabilistic primality test

512-bit 1024-bit

SPS(29)

SPS(256)

SPS(512)

SPS(2560)

SPS(5120)

55.7 111.5

35.4 70.8

31.7 63.4

25.3 50.6

23.2 46.4

C. Lu, A.L.M. Dos Santos / Computer Networks 49 (2005) 476–491

481

Fig. 1. Prime generation algorithm using test division. Fig. 2. Prime generation algorithm using table look-up.

prime generation. They are the test-division algorithm (TDA), the table-lookup algorithm (TLU) and the bit array algorithm (BTA). The test-division algorithm is used by many prime generation applications including RSAREF [20], OpenSSL and etc. The algorithm directly divides the candidates by the elements in the SPS set to check if the candidates are composites with small prime factors as shown in Fig. 1. The sieve procedure shown in the Fig. 1 uses modular reduction operations. These operations are very expensive, even for some devices that have hardware accelerators to speed up modular reductions, as the candidates are several hundreds of bit long. Therefore it is desirable to minimize the number of modular reduction operations in the sieve procedure. The modular reduction, shown ðiÞ as (wj :¼ qðiÞ mod pj) in Step 2.a of Fig. 1, can be improved as follows. As q(i + 1) = q(i)+2, and ðiÞ wj ¼ qðiÞ mod p, the residue of q(i+1) can also be ðiþ1Þ ðiÞ computed as wj :¼ wj þ 2mod p. The table look-up algorithm optimizes the sieve procedure by introducing an array T[1, . . . ,K(r)] that caches the residues of the candidate q modulus small primes as T[i] = q mod pi, 1 6 i 6 K(r). As a new candidate is generated by adding 2 to an old one, when the old one fails the primality test oracle, new residues T 0 [i] can be computed as T[i] + 2 mod pi, 1 6 i 6 K(r). The upper boundary r in the SPS set is usually less than 216, resulting in that both small primes and residues are less than 216. Therefore the computations T[i] + 2 mod pi, 1 6 i 6 K(r) are much faster to perform than those in the test-division algorithm. The algorithm just described is shown in Fig. 2. The algorithm shown in Fig. 2 requires a table ðiÞ [w1, w2, w3, . . .] to keep all the residues wj from

1.

w(ji +1) := w(ij ) + 2;

2.

w(ji +1) := w(ij

3. If w

( i +1) j

1)

– pj ;

< 0, w(ji +1) := w(ji +1) + p j . ðiþ1Þ

Fig. 3. Algorithm for ﬁnding wj

.

the previous iteration. A further improvement on the performance of the algorithm can be attained ðiþ1Þ ðiÞ :¼ wj þ 2 mod pj as shown by calculating wj in Fig. 3.3 The bit array algorithm deﬁnes a bit array A = [a0 a1 . . . as1], which are initially set to 0, and with ai representing the candidate q(i) = q + 2i, 0 6 i 6 s 1. For each element p in SPS(r), the bits representing the candidates that are divisible by p are marked as 1. To place the marks, initially the ﬁrst candidate divisible by p must be found within the considered range. The array index representing this ﬁrst candidate is g(p) = min{ijq(i) is divisible by p, 0 6 i 6 s 1}. Then, the bit corresponding to the found candidate is marked as 1. And so is every pth bit of A from that position to the end of the interval. After all the elements in SPS(r) have been considered, the bits in the bit array A are zero if and only if the corresponding candidates are not divisible by any of the elements in SPS(r). The sieve is then concluded and the prime generation algorithm must only scan the bit array and perform the primality test for each

3 This method will be useful when a developer uses assembly for implementing the table lookup algorithm. This is because, in most assembly languages, there is no single instruction for comparing the value of two numbers.

482

C. Lu, A.L.M. Dos Santos / Computer Networks 49 (2005) 476–491

Fig. 4. The bit array prime generation algorithm.

q(i) such that ai is zero. Fig. 4 shows the prime generation algorithm using bit array. The index g(p) used in the algorithm shown in Fig. 4 can be found using the following formula.4 8 t¼0 > : ð2p tÞ=2 t is even where t = q mod p. The bit-array algorithm will ﬁnd a probable prime if there is one in the chosen interval. However, there is the possibility that the interval contains no probable prime. In this case, the algorithm will restart the search from q + 2s (or q(s)) by letting q(0): = q(s) and repeating the same procedure.

5. Performance evaluations 5.1. Sieve performance and prime generation time Algorithms using the three diﬀerent sieve methods were implemented in the HP iPAQ 5500 pocket PC. We measure the performance overheads for the three sieve procedures using diﬀerent upper boundaries for the SPS sets. The results are shown in Fig. 5, where the x-axis represents the

4

Assuming t q mod p is odd, it is easy to verify that (p t)/ 2 is an integer which is less than p. q(g(p)) :¼ q(0) + 2*(p t)/2 = q + p t. Since r q mod p, q r 0 mod p, qv q r + p 0 mod p. Hence, (p t)/2 is the smallest index i such that pjq(i). Proofs for t is 0 and even numbers are similar.

upper boundaries for the SPS sets and the y-axis represents the overheads in seconds. From Fig. 5, we can see clearly that the test division algorithm is much slower than the other two algorithms. The bit array and the table lookup algorithm have comparable performance. However, the performance of the sieve procedure for the bit array algorithm gets increasingly better than that of the table lookup algorithm as the SPS set gets larger. The three algorithms are also implemented in a smart card simulator, simulating the Inﬁneon SLE66CX160S microprocessor running at 3.57 MHz. The performances of the sieve procedures are shown in Table 3. From Table 3 it can be seen that the bit array algorithm is obviously more eﬃcient than the other two algorithms in performing the sieve procedures. Due to the fact that the SLE66CX160S microprocessor has only 1 Kbytes of RAM memory, we are not able to create a residue table as required by SPS(2560) and SPS(5120) in the smart card (the reason will be given in the next section). Additionally, the implementation of the test-division sieve algorithm takes excessively long time to perform the sieve procedure using with SPS(2560) and SPS(5120). Therefore, we donÕt have performance numbers for the both cases. It can be concluded from the simulations in the PDA and the smart card that the bit array algorithm is the most eﬃcient among the three algorithms examined for implementations on portable computing devices. Finally, the entire prime generation algorithms were implemented in the HP iPAQ 5500 pocket PC and the SLE66CX160S smart card simulator. For each implementation, we choose diﬀerent upper boundaries for the SPS sets and generate 500 primes for each SPS sets. The performances are measured and shown in Fig. 6 and Table 4. In Fig. 6, the x-axis represents the upper boundaries for the SPS sets and the y-axis represents the time in seconds. The performance results clearly show the advantage of the bit array algorithm over the other two. As also can be seen from Fig. 6 and Table 4, using large SPS sets will initially reduce the time for prime generations. However, when the SPS sets

C. Lu, A.L.M. Dos Santos / Computer Networks 49 (2005) 476–491 512-bit 8

TDA

483

1024-bit

TLU

16

BTA

7

14

6

12

5

10

4

8

3

6

2

4

1

2

0

TDA

TLU

BTA

0

6 25

24 10

92 17

0 92 17

0 24 10

60 25

0 60 25

0 28 33

0 96 40

0 64 48

6 25

24 10

92 17

0 24 10

60 25

0 92 17

0 60 25

0 28 33

0 96 40

0 64 48

Fig. 5. Sieve performance for three algorithms.

Table 3 Performance of sieve procedures in the smart card using diﬀerent algorithms

SPS(256) TDA TLU BTA SPS(512) TDA TLU BTA SPS(2560) TDA TLU BTA SPS(5120) TDA TLU BTA

1024-bit (s)

2.00 0.20 0.11

20.00 0.80 0.17

2.54 0.47 0.15

30.37 1.72 0.26

5.88 N/Ab 0.37

N/Aa N/Ab 0.84

9.81 N/Ab 0.64

N/Aa N/Ab 1.55

TDA with SPS(256) TLU with SPS(256) BTA with SPS(256) BTA with SPS(512) BTA with SPS(2560) BTA with SPS(5120)

512-bit 12

TDA

512-bit (s)

1024-bit (s)

9.29 5.68 5.38 4.74 4.33 4.43

89.54 59.84 58.58 51.80 44.76 44.32

are excessively increased, the beneﬁt of increasing SPS sets will become very small. In fact, after some point it will penalize the prime generations due to the overhead introduced by the sieve procedure. Therefore, to achieve the best eﬃciencies for prime generations, it is important that we choose proper prime generation algorithms as well as proper SPS sets. The discussion on how to choose the optimal SPS sets will be given in the next section.

Excessively large overhead for the sieve procedure. Not enough memory for creating a residue table.

1024-bit

TLU

140

BTA

TDA

TLU

BTA

120

10

100 8

80 6

60 4

Fig. 6. Performances of the prime generations in the PDA.

23 04 00

17 92 00

12 80 00

76 80 0

38 40 0

12 80 0

23 04

17 92

12 80

25 6

48 64 0

40 96 0

33 28 0

25 60 0

17 92 0

10 24 0

25 60

0 17 92

0 10 24

20

76 8

40

2

25 6

a b

512-bit (s)

Table 4 Performance of the prime generations in the smart card

484

C. Lu, A.L.M. Dos Santos / Computer Networks 49 (2005) 476–491

5.2. Memory consumption Every prime generation algorithm requires additional memory space to store SPS sets. Fig. 7 shows the sizes of the SPS sets in the y-axis with respect to their upper boundaries as in the x-axis. In most cases, the memory requirement to store the SPS sets is not an issue since the SPS sets are pre-computed and can be stored in read-only memory. A diﬀerent issue when using the table lookup algorithm is that the algorithm requires read/write memory space which is the same size as the SPS set for storing residues. If one wants to use big SPS sets, such as SPS(2560) or larger, several Kbytes of read/write memory is needed for creating the residue tables. The read/write memory requirement is a consideration that needs to be factored with care, as some portable devices possess very limited read/write memory space. Although EEPROM memory could also be used for the table, the side eﬀect would be that the writing access time will be excessively higher. In SLE66CX160S microcontroller, updating one byte in RAM only cost several microseconds while the same operation will cost 3.62 ms in EEPROM. Comparatively, the bit array algorithm has much lower read/write memory requirement than the table lookup algorithm. The memory space required by the bit array algorithm is independent of the size of the SPS set and depends only on the size of the candidates in the search interval. For the bit array algorithm implementations, the memory

size of SPS sets

6000

size can be chosen to be a small ﬁxed value, e.g., 192 bytes of RAM as used in the paper. The discussion on how to choose appropriate RAM size for bit array algorithm will be given in the later section.

6. Optimal SPS sets As already discussed, it is important to choose a proper set of small primes in the sieve procedure for prime generations. Initially, increasing the size of the SPS set will substantially improve the performance of prime generations because it reduces the number of calls to the primality test oracle. However, the improvement will become less and less signiﬁcant as the size of the SPS set grows. After some point, the increase due to the sieve procedure overhead will exceed the improvement obtained by reducing the probabilistic primality tests. From that point on, the overall performance starts to degrade. Hence, an important goal when optimizing an incremental search prime generation algorithm is to ﬁnd the optimal SPS set for the sieve procedure. This section will discuss how to choose optimal SPS sets for the bit array algorithm, which has been concluded to be the most eﬃcient prime generation algorithm from the previous analysis. Some notations that are going to be used later are given below. DB R s

5000

The sieve processing overhead for one small prime p The primality test oracle processing overhead The number of candidates in the search interval

4000 3000 2000 1000

10 24 17 92 25 60 10 24 0 17 92 0 25 60 0 33 28 0 40 96 0 48 64 0

25 6

0

Fig. 7. The sizes of SPS sets corresponding to the upper boundaries.

DB is the average time for processing one small prime p in the sieve implementation of the bit array algorithm. The sieve process on a small prime p is composed of computing the candidate modulus p, computing the index g(p), and marking every pth bit in the bit array. The overhead due to marking is usually negligible. This is because (1) when s is 1736 as used in the paper, the marking procedure takes [1736/p] iterations. As most small

C. Lu, A.L.M. Dos Santos / Computer Networks 49 (2005) 476–491

primes are greater than 100, the number of iterations for marking will be small, not more than 20. (2) Marking a bit in the array only costs a few cycles. Therefore, we rather consider DB to be the overheads of the modular reduction and the computing of g(p). The optimal size for the SPS set used by the bit array algorithm is reached when the cost incurred by adding one small prime p into the small prime set is greater than that it can reduce. By adding a new element p, the overhead in the sieve procedure will be increased by DB. Meanwhile, it could detect a(p)ln(q)/2p more composite numbers per prime generation than before and save the time of a(p)ln(q)*R/2p, where R is the primality test oracle processing overhead and q is the ﬁrst candidate in the search interval. Hence, the upper bound of the optimal SPS set is the largest value p such that DB < aðpÞ lnðqÞ R=2p

ð1Þ

Since ln(q) can be approximated by l*ln 2 when q is an l bit number, and a(p) can be approximated by 1.12/ln(p), the approximation of the optimal value p can be obtained by solving a much simpler inequality: p lnðpÞ < 1:12l lnð2ÞR=2DB

ð2Þ

From Eq. (2) it can be seen that the optimal SPS set depends on three factors—the bit length of the large primes to be generated, the processing time DB for one small prime and the time for performing one primality test oracle R. The bit length is a known parameter for a prime generation. Meanwhile, the value of DB and R are quite consistent and can be obtained by averaging the processing overheads of the corresponding process for some small primes. Based on these three values, we are able to obtain the optimal small prime set by applying the formula (2). We use two diﬀerent approaches to obtain the optimal SPS sets for achieving the best performance in prime generations. In the ﬁrst approach, we choose diﬀerent SPS sets and measure their prime generation times using our bit array algorithm implementation on the smart card simulator. Then, we compare the performances to ﬁnd out the optimal SPS set that could achieve the best performance in prime generation. To do this, we gen-

485

Table 5 The optimal SPS sets using diﬀerent approaches

DB (ms) R (ms) Optimal SPS sets by exhaustive search Optimal SPS sets by Fomula (2)

512-bit

768-bit

1024-bit

1.6 116.5 SPS(2048)

2.9 500.0 SPS(6144)

3.6 866.3 SPS(10240)

SPS(1914)

SPS(5917)

SPS(10347)

erate 500 primes for each SPS set, where the SPS sets are chosen to be SPS(27), SPS(256), SPS(512), SPS(2048), SPS(6144), SPS(10,240) and SPS(20,480). And we found out that the SPS(2048) is the optimal SPS set for the 512bit prime generation, SPS(6144) is the optimal SPS set for the 768-bit prime generation and SPS(10,240) is the optimal SPS set for the 1024bit prime generation. In the second approach, we obtain the parameters of DB and R by testing and averaging the corresponding overheads of 100 small prime numbers. After that, we compute the optimal SPS set using Formula (2). The result is listed in Table 5 and the times of prime generations are given in Fig. 8. Although both approaches can ﬁnd optimal SPS sets for prime generations, using Formula (2) for ﬁnding the optimal SPS set is much more favorable than the exhaustive search. If one uses exhaustive search for ﬁnding the optimal SPS set, he may need to measure the prime generation performances for a large number of SPS sets. For instance, the SPS sets from 1024 to 20480 with a certain incremental value, e.g., 1024, 2048, etc.5 Consequently, exhaustively searching the optimal SPS set can be very time costly. Meanwhile, using Formula (2) for computing the optimal SPS set only requires to measure the time of sieve procedure and primality test, which are fairly easy to collect. Fig. 8 shows that using optimal SPS sets for prime generation is important. As can be seen, we have as much as 100% performance improvement over an implementation using SPS(27)6 for 5

A big incremental value may reduce the overhead of exhaustive search, but will cause the SPS set found less optimal. 6 This SPS set is used in RSAREF.

486

C. Lu, A.L.M. Dos Santos / Computer Networks 49 (2005) 476–491 140

512-bit

768-bit

1024-bit

120 100 80 60 10240

40 20

6144 2048

0 27

256

512

1732

2048

Optimal SPS set by exhaustive search Optimal SPS set by Formula (2) SPS(27) SPS(256) SPS(512)

5346

6144

9341

10240

20480

512-bit

768-bit

1024-bit

4.32 4.33 8.54 5.38 4.74

19.6 19.6 35.07 25.86 23.34

40.32 40.33 115.10 58.58 51.80

Fig. 8. Prime generation on smart card primes using diﬀerent SPS sets.

512-bit prime generations, and an almost 200% performance improvement for 1024-bit prime generations, if using optimal SPS sets. Compare with using SPS(512), the optimal SPS set for prime generation will have more than 25% performance improvement in 1024-bit prime generation. Due to the limitation of the available smart card hardware, the bit array algorithm for 2048-bit prime generation was not implemented or tested. The bit array implementation chooses a search interval to be 1736 consecutive candidates. The choice of 1736 consecutive candidates was taken to make the possibility that the bit array algorithm fails to ﬁnd a probable prime in the search interval very small when generating up to 1024-bit primes. According to [5], if d(n) is the distance from n to the next larger prime and n is chosen uniformly from 1 to x, then the probability that

dðnÞ > k logðxÞ is ek. Therefore, the possibility for the failure to ﬁnd an l-bit probable prime in [q, q + 2s] is e2s/(l*ln2) If s = 1736 and l = 512, 768 and 1024, the probabilities that the incremental search fails are 1.7 · 104, 3.1 · 103 and 102, respectively. In case the search fails, the prime generation will choose a new odd random number and start again.

7. Eﬃciency of constructive prime generation As aforementioned, a large prime can also be generated by using the constructive method [10,11] as given in Fig. 9. The algorithm generates candidates that are coprime to g, which is the product of all the elements in SPS(r), thus increasing the probability for the

Fig. 9. Constructive prime generation algorithm.

C. Lu, A.L.M. Dos Santos / Computer Networks 49 (2005) 476–491

120

120

100

100

80

80

60

60

40

487

40

20

20

0 512

768

1024 1280 1536 1792 2048

0 512

Constr

IncSrh

768 Constr

1024 1280 1536 1792 2048 SPS(2000)

SPS(15000)

Fig. 10. Constructive prime generation vs. incremental search prime generation on the average number of calls to the primality test oracle.

candidates to be primes. Despite of several obvious diﬀerences from the incremental search, the constructive prime generation algorithm needs to use the primality test oracle as those in incremental search for examining the primality of the candidates. Therefore, its eﬃciency will also be aﬀected by the number of calls to the primality test oracle per prime generation, just as the algorithms already discussed in this paper. In the following, we are going to perform more detailed studies on the eﬃciency of the constructive prime generation algorithm. In the constructive prime generation, the candidates submitted to the primality test oracle are coprime to all the elements in SPS(r), with SPS(r) being used to compute the parameter g. This looks similar to in the bit array algorithm where a SPS(r) is used to ensure that the candidates submitted to the primality test oracle are coprime to all the elements in SPS(r). Based on this observation, we conjecture that the average numbers of calls to primality test oracles in these two algorithms are the same using the same SPS set. Therefore, their eﬃciencies of prime generation are similar. An experiment was conducted whose outcome supports our conjecture. The experiment performs 10,000 prime generations of diﬀerent bit lengths in a personal computer to estimate the average number of calls

to the probabilistic primality test when the constructive prime generation algorithm is used. The numbers of calls to the probabilistic primality test were recorded and the averages were calculated for the experiments. The average number of calls to the primality test oracle of the constructive prime generation algorithm is then compared with those of the bit array algorithm. The results are shown in Fig. 10. In the left graph of Fig. 10, the SPS sets being used by the bit array and constructive prime generation algorithms are the same. And the SPS sets are chosen as the optimal SPS sets for the constructive algorithm. For example, when the SPS(373) is used to compute the g value for a 512-bit constructive prime generation, SPS(373) is also used for the corresponding 512-bit prime generation using the bit array algorithm. When SPS(719) is used to compute the g value for a 1024-bit constructive prime generation, SPS(719) is also used for the corresponding 1024-bit prime generation using bit array algorithm. The outcome of the experiment veriﬁes the correctness of the conjecture. The right graph of Fig. 10 shows the constructive algorithm using optimal SPS sets and the bit array algorithm using diﬀerent SPS sets. It is clear that the bit array algorithm easily outperforms the constructive prime generation

488

C. Lu, A.L.M. Dos Santos / Computer Networks 49 (2005) 476–491

Table 6 Timings for prime generation using constructive method vs. bit array algorithm

Constr (s) BitArray (optimal SPS set) (s)

512-bit

768-bit

1024-bit

5.30 4.32

25.54 19.6

59.16 40.32

algorithm as larger SPS sets are chosen since it reduces the number of calls to the primality test oracle. The constructive method cannot increase the SPS set size after some point since g is calculated as the product of all primes in the SPS set, and as such its bit length must not be bigger than the bit length of the prime to be generated. The bit array and the constructive method algorithms are then implemented in the smart card simulator for the SLE66CS160S microprocessor. Prime generations are performed using both algorithms generating 500 primes, 512-bit and 1024-bit wide. The result of this experiment is shown in Table 6. It is easy to notice that the bit array algorithm is more eﬃcient than the constructive method.

8. Generation of special primes Some public key cryptosystems require primes of special forms instead, or in addition to random

primes. The special primes needed by public key cryptosystems include DSA primes, strong primes and safe primes. Therefore, we modify the bit array algorithm in this section to generate those special primes with good eﬃciency. We give a generalized form of incremental search which uses an incremental value k other than 2. Consequently, the sequence of candidates that will be search is q, q + 2w, q + 4w, . . ., q + 2sw. Based on this generalized form of incremental search, DSA primes and strong primes can be generated as shown in Figs. 11 and 12. In the last step of both algorithms, a generalized bit array algorithm is used for the incremental search as follows. (1) An array A = [a0 a1 . . . as1] similar to that in the basic bit array algorithm is created. (2) The index, g(p), of the ﬁrst candidate in the array that is divisible by a small prime p is computed as g(p) = ( wp2 mod p)*q mod p since q + g(p)w = 0 mod p and w1 = wp2 mod p. (3) The bit corresponding to the found candidate and every pth bit after that till the end of the interval is marked as 1. (4) The primality tests are only performed for those candidates whose corresponding bits are zero. The algorithm for generating safe primes is illustrated in Fig. 13 which is slightly diﬀerent from previous algorithms. For each small prime p, the index of the ﬁrst candidate divisible by the small prime p where w = 4 is computed and

Fig. 11. Incremental search for DSA prime.

Fig. 12. Incremental search for strong prime.

C. Lu, A.L.M. Dos Santos / Computer Networks 49 (2005) 476–491

489

Fig. 13. The bit array algorithm for generating safe prime.

denoted as g1(p). Then, the bit corresponding to the found candidate and every pth bit after that till the end of the interval is marked as 1. Additionally, the index of the ﬁrst candidate divisible by the small prime p where w = 2 is also computed and denoted as g2(p). Again, the bit corresponding to the found candidate and every pth bit after that till the end of the interval is marked as 1. After all the elements in SPS(r) have been considered, the prime generation algorithm must only scan the bit array and test the primality for the candidates whose corresponding bit is zero. We implement DSA and strong prime generation algorithms on the SLE66CX160S smart card simulator to evaluate their performances. There are minor increases of time in DB, the sieve processes per small prime, since the implementations are slightly diﬀerent from that in random prime generation. Meanwhile, the time for performing one primality test oracle is the same as that in random prime generation. Based on the performance numbers of sieve processing and primality testing

(see Table 7), the optimal SPS sets are computed using Formula (2) and used in the prime generations. Finally, the time for generating DSA and strong primes with various big lengths are obtained and shown in Table 8 in seconds. As can be seen, the performances for generating special primes are close to those for generating random primes. The generation of safe primes is not implemented on the smart cards. This is because the generation of safe primes is extremely expensive. With limited computing power as in smart cards and PDA, the generation of large safe primes, e.g., 1024-bit primes could take hours even with the optimized algorithms.

Table 8 The timings for generating primes using the optimal SPS sets

Random prime DSA prime Strong prime

512-bit

768-bit

1024-bit

4.32 5.53 6.73

19.60 20.70 22.00

40.23 41.50 42.71

Table 7 The timings for processing one small prime in the sieve procedures for generating DSA and strong primes and the optimal SPS sets DSA prime generation

512-bit 768-bit 1024-bit

Strong prime generation

DB (ms)

Upper bound by Formula (2)

DB (ms)

Upper bound by Formula (2)

1.8 3.3 4.1

1725 5270 9201

1.9 3.5 4.4

1645 5000 8634

490

C. Lu, A.L.M. Dos Santos / Computer Networks 49 (2005) 476–491

9. Conclusion and future work

References

This paper compares the eﬃciencies of implementations of various prime generation algorithms on small portable devices. It is shown that the bit array algorithm is the most eﬃcient algorithm for generation of probable primes. The paper further optimizes the bit array algorithm by discussing how to choose an important parameter—the optimal SPS set. The implementations show that the performances of prime generations can be substantially improved by using the optimal SPS sets. Besides random primes, some special primes have important use in public key cryptosystems. The paper develops generalized bit array algorithms that are able to ﬁnd primes with special constraints, i.e., DSA, strong, and safe primes. The algorithms are implemented in a smart card simulator for validation. It is shown that there is very little eﬃciency loss for generating DSA primes and strong primes with respect to generating random primes. Primality testing is the most time consuming part of prime generation. The primality test is the part that impacts the most on the eﬃciency of the overall prime generation. An important problem is to develop primality testing algorithms that are faster than those currently available. For all the existing prime generation algorithms, optimizations are done by ﬁltering out candidates that have small prime factors or by constructing candidate that are coprime with a set of small primes. The sizes of the small prime set that prime generation algorithms can handle eﬃciently directly aﬀect the performances of prime generations. The bit array algorithm can only handle small primes of up to 20,000. A future work will be to develop an eﬃcient sieve procedure that can handle much larger sets of small primes, like those on the order of 108.

[1] E. Bach, J. Shallit, Algorithmic number theory, Foundations of Computing Volume I: Eﬃcient Algorithms, the MIT Press, Cambridge, MA, 1996. [2] W. Bosma, M.P. Van Der Hulst, Primality proving with cyclotomy. Doctoral Dissertation, University of Amsterdam, 1990. [3] D.M. Bressoud, Factorization and Primality Testing, Springer-Verlag, New York, 1989. [4] J. Brandt, I. Damgard, P. Landrock, Speeding-up prime number generation, in: Proceedings of the ASIACRYPTÕ91, pp. 440–449. [5] J. Brandt, I.B. Damgard, On generation of probable primes by incremental search, in: Proceedings of the CRYPTOÕ92, pp. 358–369. [6] W. Diﬃe, M. Hellman, Multiuser cryptographic techniques, in: Proceedings of AFIPS National Computer Conference, 1976, pp. 109–112. [7] W. Diﬃe, M.E. Hellman, New directions in cryptography, IEEE Transactions on Information Theory 22 (1976) 644– 654. [8] P.X. Gallagher, On the distribution of primes in short intervals, Mathematica 23 (1976) 4–9. [9] G.H. Hardy, J.E. Littlewood, Some problems of ÔPartitio NumberorumÕ, on the expression of a number as a sum of primes, Acta Mathematica 44 (1922) 1–70. [10] M. Joye, P. Palliar, S. Vaudeney, Eﬃcient generation of prime numbers, CHES 2000, pp. 340–354. [11] M. Joye, P. Palliar, Constructive methods for the generation of prime numbers, in: Proceeding of 2nd Open NESSIE Workshop, Egham, UK. [12] D.E. Knuth, The Art of Computer Programming, 3rd ed. Seminumerical algorithms, Vol. 2, Addison-Wesley, Reading, MA, 1997. [13] C.H. Lu, A. Dos Santos, F.R. Pimentel, Implementation of fast RSA key generation inside smart card, in: Proceedings of the 17th ACM Symposium on Applied Computing 2002, p. 214. [14] F. Morain, Implementation of the Goldwasser–Killian– Atkin primility testing algorithm, Mathematics Computation 54 (1990) 839–854. [15] NIST, US Department of Commerce, Digital Signature Standard, FIPS PUB 186, May 1994. [16] P. Pascal, Low-cost double-size modular exponentiation or how to stretch your cryptoprocessor, Public Key Cryptography, 1999. [17] R. Rivest, A. Shamit, L. Adleman, A method for obtaining digital signatures and public key cryptosystems, Communications of ACM 21 (2) (1978) 158–164. [18] R.D. Silverman, Fast generation of random, strong RSA primes, Crypto Bytes, RSA Laboratories, 3, Spring 1997, pp. 9–13. [19] W. Stallings, Cryptography and Network Security: Principle and Practice, 2nd ed., Prentice-Hall, New Jersey, 1999.

Acknowledgement The authors are grateful to the anonymous reviewers for their valuable comments. The research was partly supported by Microsoft.

C. Lu, A.L.M. Dos Santos / Computer Networks 49 (2005) 476–491 [20] RSAREF crypto library. Available from .

Chenghuai Lu is a Ph.D. student in the College of Computing at Georgia Tech. His research interests include optimizing cryptographic functions on tamper resistant devices and sidechannel cryptographic analysis.

491

Andre L.M. Dos Santos is an assistant professor in the College of Computing at Georgia Tech. His research interests include distributed systems security, security of systems using tamper resistant devices, and vulnerability analysis. He received a Ph.D. in computer science from University of California, Santa Barbara.

Recommend Documents

A note on the efficient implementation of Hamiltonian BVMs

Efficient Implementation of Image Processing Algorithms on Linear ...