Systolic Gaussian elimination over GF(p) with ... - Semantic Scholar

Report 2 Downloads 32 Views
1321

IEEE TRANSACTIONS ON COMPUTERS, VOL. 38, NO. 9, SEPTEMBER 1989

Correspondence Systolic Gaussian Elimination over GF@) with Partial Pivoting

Systolic arrays for Gaussian triangularization of dense real matrices are well known (see [l] and [7] among others). However, partial pivoting has never been implemented in such arrays, due to the BERTRAND HOCHET, PATRICE QUINTON, AND fact that the search for a pivot in a whole row or column would break YVES ROBERT down the locality and the regularity of the systolic design. Hence, Abstract-We propose a systolic architecture for the triangularization systolic arrays for Gaussian elimination over R only apply to via the Gauss elimination algorithm of large dense n x n matrices over symmetric positive definite or diagonally dominant matrices. To GF@), where p is a prime number. The solution of large dense linear triangularize general matrices, one must use an orthogonal factorizasystems over G F ( p )is the major computational step in various algorithms tion via Givens rotations (see the systolic implementations of [l], [4], issued from arithmetic number theory and computer algebra. The [7], and [21], among others). An other alternative would consist of proposed architecture implements the elimination with partial pivoting, including local exchanges of rows in the Gaussian elimination although the operation of the array remains purely systolic. Extension of algorithm, as suggested by Gentleman and Kung [7]: such “pairwise the array to the complete solution of a linear system A x = b over G F @ ) pivoting” techniques have been proven to be numerically stable for a large class of matrices by Sorensen. is also considered. In a finite field GF( p ) , partial pivoting cannot be avoided, since Index Terms-Finite fields, Gaussian triangularization, partial pivot- in average every other p element is zero. For instance, pairwise pivoting would break down as soon as in a given column, both the ing, solution of linear systems, systolic arrays. diagonal coefficient and the next one happen to be zero. But it turns I. INTRODUCTION out that partial pivoting can be solved much more efficiently than in The solution of large dense linear systems of algebraic equations in R, since any nonzero element in a given row or column can be chosen finite fields appears to be the computational kernel of many important as the pivot: there is no need for an entire scanning, the search can stop as soon as a nonzero element has been found out. This simple algorithms. Let us mention three representative applications: consideration will be very important in our implementation. the large integer factoring routines, where the final stage Section II is devoted to the design of a systolic array for Gaussian necessitates performing Gaussian elimination on a large dense matrix elimination over GF(2) of an n x (n + 1) matrix (A, b). The over GF(2): [16]-[18]. geometry of this array is similar to that of the array of Gentleman and the factorizationof polynomials over GF( p), where p is a prime Kung [7], but the operation of the processors is different to manage number, via Berlekamp’s algorithm: [14]. the additional pivoting features. In Section In, we discuss two linear prediction in the analysis of discrete signals: [15]. extensions of this array: the on-the-fly solution of the triangular In some of these applications, matrices of several thousands of system that remains to be solved, and the computation in G F ( p ) rows and columns are required, and there is a large number of when p is a general prime number. matrices to triangularize. However, Gaussian elimination requires an amount of arithmetic operations proportional to the cube of the order 11. GAUSSIAN ELIMINATION OVERGF(2) of the matrix, and this leads to serious limitations both on the size of Let A be a dense n x n matrix and b be an n vector. To solve the the problems that can be dealt with, and the speed of the solution, system 1) Ax = b, we transform it into an equivalent triangular when implemented on sequential computers. Parallel processing, therefore, offers an interesting perspective for system 2) Tx = b’. The transformation of the system 1) into the such computations. Indeed, Parkinson and Wunderlich report in [17] system 2) is done by triangularizing the matrix A, using Gaussian the implementation of Gaussian elimination over GF(2) on the ICL elimination with partial pivoting. We briefly recall the operation of Gentleman and Kung’s twoDAP (which has 4096 processors), and Poet [18] considers the use of special-purpose hardware to factor 140 digit numbers using the same dimensional triangular array of orthogonally connected processors to triangularize a real dense n x (n + 1) matrix (A, b) without algorithm. In this paper, we introduce a systolic architecture for the Gauss pivoting. This array is depicted in Fig. 1. The total number of cells in triangularization algorithm of general dense matrices over a finite the array is n[(n + 1)/2 + I]. The array is composed of n rows, field GF( p), where p is prime. Systolic arrays have been introduced each row k including n + 2 - k processors numbered from left to by Kung and Leiserson [13] and consist of a large number of right Pk,l, . * * , Pk,,+l-k. The matrix (A, 6 ) is fed into the array elementary processors (or cells) which are mesh interconnected in a column by column. More specifically, columnj of the matrix is input regular and modular way, and achieve high performance through to processor P Ij , one new element each time step, beginning at time t extensive concurrent and pipeline use of the cells. Such architectures = j . This input format is depicted in Fig. 1. The first row of (A, b) is stored in the uppe~row of the array, and, have now proven to be efficient for the solution of compute-bound problems, and we refer the reader to [2] and [ll] for a general as any row numbered i L 1 is read by the array, it is combined with the first one in order to zero out element ail; this combination presentation of the systolic model. corresponds to a transformation of the form ~

Manuscript received August 10, 1986; revised June 1, 1987. This work was supported by the Coordinated Research Program C3of CNRS and Ministkre de la Recherche et de la Technologie. B. Hochet is with CNRS, Laboratoire TIM3, INPG, Grenoble Cedex, France. P. Quinton is with CNRS, IRISA. 35042 Rennes Cedex France. Y. Robert is with CNRS, Laboratoire TIM3 and IBM ECSEC, 00147 Roma, Italy. IEEE Log Number 8824547.

The 2 x 2 matrix Mil is computed by P I 1at time t = i , and sent to the other cells of the first row; more precisely, cell P1.k (k Z 2)

0018-9340/89/0900-1321$01.00

0 1989 IEEE

1322

IEEE TRANSACTIONS ON COMPUTERS, VOL. 38, NO. 9, SEPTEMBER 1989

b) if all = 0, we have to find a pivoting row. Two cases occur: bl) if a21 = 0, we do nothing, and choose Mzl = 1 2 b2) if aZ1 = 1, we exchange rows 1 and 2, and Mz1 is the permutation matrix

b4

'43

'44

b3

'34

b2

'42

'33

'24

bl

'32

'23

'14

-

'31

'22

'13

'21

'I2

'41

'11

M2*=(;

;)

*

Row 2 is then stored in the first row of the array and acts as the pivoting row for phase 1, which consists in zeroing out all the elements but one of the first column of the matrix (A, b). Note that in the case bl), row 2 will not be modified during the first phase of the algorithm. Assume that ail, i > 2 , is the first nonzero element that we find: row i will be used as a pivoting row for phase 1, but since we already know that a l l ,azl, * * u i - l , lare all zero, we do not have to combine row i with row 1,2, * . i - 1. We can give an informal description of the algorithm as follows:

-

e ,

a ,

f o r k : = l t o n - 1do {phase k of the algorithm} begin 1. find the first j 2 k such that a$) = 1 2. exchange row k and rowj ( i f j # k ) , that is, choose rowj as pivoting row 3. zero out the elements U!:) such that i > j and U!:) = 1 by using row j to row i (here addition denotes the addition in GF(2),that is, the XOR operation) end. Fig. 1. The systolic array for Gaussian elimination ( n performs the transformation

= 4).

For a given k, note that the steps 2 and 3 are not executed if we cannot find a j such thatj 2 k and ajk = 1: in this case, the matrix A is rank deficient.

B. Operation of the Processors There are two types of processors in the array, respectively represented as circular and square processors in Fig. 1. In the kth at time t = i + k - 1. row of the array, the processor Pkl is a circular processor, which Let generates the matrices M j k , i > k, and all the other processors are square processors, which apply the transformation relative to Mik. The operation of these two types of processors is detailed in Fig. 2. The operation of a given processor depends on whether it is the first time it operates. In the description of Fig. 2, we assume for simplicity that there is a Boolean called init (for "initialization") stored in every processor which is set to "true" at the beginning of the computation. denote the matrix obtained from ( A , b ) after the elimination of the This Boolean controls the operation of the processors. In an actual elements at positions ( i , j ) such that i > j , j = 1, * k - 1. Then implementation, the instruction to start the computation would be (ai!, * * , u g , is stored in the kth row of the array. together with the first coefficient all and input to processor PI1 When (a$), . ., a!:), bjk)),i > k , is read by this row of cells, it is systolically propagated through the whole array (it can be easily using a 2 x 2 matrix Mik, in checked that processor p k j operates for the first time at step 3k + j combined with ( a i ! , . , a i ! , order to set a$) to zero. - 3). Circular Processors: The first time they operate, circular procesA . General Description of the Array sors store their input in their internal register. Afterwards, they The key idea to include partial pivoting in the algorithm is to compute the matrices Mikand send them rightwards to the square generate matrices k f j k whose structure depends on the values of U @ processors. Rather than directly generating the matrices M i k , the and U!:). For the sake of simplicity, let us consider the case k = 1, I circular processors transmit rightwards one of the following three = 2, that is, when the first matrix Mzl is generated by processor P I I . instructions: Element all = U $ ) is stored in Pll,and alz = U $ is being input to -id, which stands for the identity matrix P1l.Two cases must be considered: -perm, which requires a permutation of rows -add, which requires an addition of rows. a) if al = 1, row 1 will be chosen as the pivoting row, and the This allows the instructions generated by the circular processors to unl will be done using usual Gaussian elimination of azl, q l , . be encoded using only two bits. elimination. In our example, Square Processors: Again, square processors store their input in al) if azl = 0, there is nothing to do, Mzl is equal to ZZ,the their internal register the first time they operate. Afterwards, they identity matrix of order 2 update their vertical input and possibly their internal register, a2) if azl = 1, add row 1 to row 2 to zero out azl, that is according to the instruction they receive from the left. See Fig. 2 for a complete description. Mzl=(; C. Unloading the Array At the end, the matrix (T, b ' ) is stored in the array as follows Note that addition denotes here the addition in GF(2), i.e., the (assuming n = 4): Exclusive OR operation XOR.

-

-

bi?)

bi?)

e ,

;)

e ,

1323

IEEE TRANSACTIONS ON COMPUTERS, VOL. 38, NO. 9, SEPTEMBER 1989

if init then ( s m q,, )begin r := q,,; init :=false ; end

else

(

generate instruction ]begin of (0.0) :opout := id [ identity - still looking for pivot ) ; (0,l) :opout :=id [ element already zero) ; :upout := perm [ exchange rows ) ; (1 ,I) :opout := add ( XOR operation ) ;

case (ai,.,.’)

end (a)

if init then ( store 9, ) begin r := ain ;init :=false ; end else ( update q,,(and r if a permutation of rows)) begin case oph of id : :=ai,.,( risnotmodified ) ; add : :=ai,.,XORr ( risnotmodifed 1 ; pcsm : a,,u, :=r; r := q,, ( exchange ai,, andr 1 ; OPout := opj ; end (b)

Iu. REMARKS AND EXTENSIONS A . Solving the Triangular System After triangularization, there remains a triangular system to be solved. Assuming the matrix A is nonsingular, we can use another systolic array, the triangular system solver of Kung and Leiserson [ 131, which requires 2n additional steps to solve the system Tx = b ’. Fig. 2. Gaussian elimination over GF(2)-Operation of the processors. (a) However, this would require the triangular matrix to be stored in the Operation of the circular cells. (b) Operation of the square cells. host and reordered by diagonals before being fed in Kung and Leiserson’s array. Rather, it is more efficient to use the Jordan elimination scheme depicted for the real case in [6] and [19]. This scheme can be implemented on the same array and requires only n additional steps to provide the solution x , leading to a very efficient scheme of only n[(n + 1)/2 + 11 cells and 4n - 1 time steps to completely solve the system A x = b. There remains to unload the array. According to the general The idea is to directly transform the system 2) Tx = b’ into a philosophy of the model, this must be done systolically, by diagonal one. At the end of the triangularization phase, rather than propagating special Boolean control instructions. Moreover, we unloading the array we start another computational phase, which we would like to pipeline the unloading of the array with the computation term resolution phase. When a diagonal processor Pkl receives the phase, so that n additional steps are not necessary. Various schemes input resol, it checks the value of its internal register r. If r = 0, then are possible. We outline a scheme where each processor sends the the matrix is singular, and the processor outputs the control error to content of its register to the right when it has finished operating. Let Pk+l,l. If r # 0, then the diagonal processor transmits to its right neighbors the instruction to propagate downwards the content of their = b / for convenience. At time t = 4 in our example, PI1operates for the last time. It internal registers. As a result, row k will be combined successively remains idle for one step and then sends tlI rightwards. P12 finishes with rows k + 1, k + 2, . , n: this corresponds to the Jordan work at time t = 5 . It transmits t l l to the right at t = 6, and then diagonalization scheme. The extended array is depicted in Fig. 3. Data coefficients and sends f12 at time t = 7. The process goes on, and tIiis output by PIS at time t = 9 + i, for 1 5 i 5 n + 1. Similarly, we start unloading control instructions are input to the array and move systolically, until the second row at time t = 8. We see that the first rows of the array the solution vector is output by processor Pn2. The triangularization are unloaded, while the last rows still operate, thus achieving the phase is pipelined with the diagonalization phase, and the last pipelining that is sought. Indeed, in the general case, the last component of the solution vector x is delivered after 4n - 1 time computation occurs in P,,,,,,I at time 3n - 1, and ti,i+k(1 5 i 5 n, 0 steps. The value of x is not validated if processor P,,I outputs the control error at step 4n - 2, a necessary and sufficient condition for 5 k 5 n + 1 - i) is output from the array at time 2n + i + k + 1. the coefficient matrix A to be singular. The operation of the The total computation time is 3n + 2. We point out that the unloading of the array cannot be simplified as processors is described in Fig. 4. Note that the array can be straightforwardly extended to the case in the case of Gaussian elimination over Ri without pivoting: in the real case, the coefficients ti,i+kare not modified after being stored in where there are q systems Axi = bi to solve, 1 5 i 5 q: this will the array; hence, they can be sent downwards immediately. Here on require n(n + 1)/2 + qn cells and 4n + q - 2 time steps. the contrary, the tipi+k can be modified by any following permutation However, for large values of q, it is more efficient to use the GaussJordan diagonalization algorithm rather than triangularizing the of rows.

-

1324

IEEE TRANSACTIONS ON COMPUTERS, VOL. 38, NO. 9, SEPTEMBER 1989

of bits) because a a+b-p.

case controlin of init : begin ( store ain ) r := ai,,; controlout := controlin ;end triang :begin case ai,,^) of (0.0) :opout := id ( identity - still looking for pivot ) ; (0,l) :opmt :=id ( element already m ) ; (1.0) : opout :=perm ( exchange rows 1 ; (1.1) :opout:=add( XORoperation); controlout := controt, ; end resol : begin ( solving the system - value of %ut meaningless ) if r = 0 then controlout := error ( singular mamx J else controlout := controlin ; end error : ( singular m a h ) controbUt:= conaoli, ; (a) “in

Oout

st!aLkl case control of init : (storeai,, J ‘:=ain; triang : ( update qn(and r if a permutation of rows) ) begin case 0pin of id : a, := ai,,( risnotmcdified ) ; add : a,, := ain XOR r ( r is not modified 1 ; perm : Bout :=r; r :=ain ( exchange ai,,andr ) ; opout := opin ;end resol : ( wnsmit r downwards ) %ut := r ; opout := opin ; error : ( meaningless operation )

(b) Fig. 4. Solving A x = b over GF(2)-Operation of the processors. (a) Operation of the circular cells. Operation of the square cells.

system matrix and then solve q triangular systems: see [9] for more details. B. Extension to GF(p), p Prime Number The previous array can be easily extended to deal with computations over a finite field GF(p), where p is any prime number. The key idea is the same: the first nonzero element found out during the search is chosen as pivot. The only modification in the operation of the processors is to replace the add instruction by the appropriate combination factor. Hence, the processors now generate an instruction op E { i d , perm, comb} Wherever op = comb, an appropriate combination factor is also to be generated, and transmitted or stored according to the program of the cell. The arithmetic requirements are more complex in G F ( p ) for general p than in GF(2). A possible way to implement the inverse computation is a table lookup, whereas the multiplication can be done using the MMA modular multiplication algorithm of [3], which consists of reducing each partial summation modulo p before going on: this prevents having to perform a division by p at the end. Finally, the cost of an addition is twice that in 12(for the same number

+ b mod p requires the computation of a + b and

C. Conclusion We have introduced a systolic array for solving dense linear systems over G F ( p ) ,p prime number. The array implements the Gaussian elimination algorithm with partial pivoting, although the operation remains purely systolic. There are no feedback cycles in our arrays. The importance of designing arrays without feedback cycles is emphasized in Kung and Lam [ 121: acyclic implementations usually exhibit more favorable characteristics with respect to fault tolerance, two-level pipelining, and problem decomposition (see Hwang and Cheng [lo]) in general. In particular, all the techniques described by Chuang and He [5] can be easily applied to our design, leading to a problem size independent systolic array for Gaussian elimination over GF( p). REFERENCES H. M. Ahmed, J. M. Delosme, and M. Morf, “Highly concurrent computing structures for matrix arithmetic and signal processing,” IEEE Computer, vol. 15, pp. 65-82, 1982. F. Andre, P. Frison, and P. Quinton, “Algorithmes systoliques: de la thtorie i la pratique,” IRISA Res. Rep. 214, May 1983. G. R. Blakley,”A computer algorithm for computing the product A B modulo M,” IEEE Trans. Comput., vol. 32, no. 4, pp. 497-500, 1983. A. Bojanczyk, R. P. Brent, and H. T. Kung, “Numerically stable solution of dense systems of linear equations using mesh-connected processors,” Tech. Rep. Carnegie-Mellon Univ., 1981. H. Y. H. Chuang and G. He, “Design of problem-size independent systolic arrays systems,” in Proc. IEEE Int. Conf. ICCD’84, Port Chester, New York, 1984, pp. 152-156. M. Cosnard, Y. Robert, and M. Tchuente, “Matching parallel algorithms with architectures: A case study,” IFIP Working Conf. Highly Parallel Computers, Nice, France, Mar. 24-26, 1986. W. M . Gentleman and H. T. Kung, “Matrix triangularisation by systolic arrays,” in Proc. SPIE 298, Real-time Signal Processing IV, San Diego, CA, 1981, pp. 19-26. D. Helier, “Partitioning big matrices for small systolic arrays,” in VLSI and Modern Signal Processing, S. Y. Kung et al. Eds. Englewood Cliffs, NJ: Prentice-Hall, 1985, pp. 185-199. B. Hochet, P. Quinton, and Y. Robert, “Systolic solution of linear systems with partial pivoting,” in Proc. IEEE Int. Conf. Computer Arithmetic, May 1987, Como, Italy. K. Hwang and Y. H. Cheng, “Partitioned matrix algorithm for VLSI arithmetic systems,” IEEE Trans. Comput., vol. (2-31, no. 12, pp. 1215-1224, 1982. H. T. Kung, “Why systolic architectures,” IEEE Computer, vol. 15, no. 1, pp. 37-46, 1982. H. T. Kung and M. S. Lam, “Fault-tolerance and two-level pipelining in VLSI systolic arrays,” J. Parallel Distrib. Comput., vol. 1, no. 1 , pp. 32-63, 1984. H. T. Kung and C. E. Leiserson, “Systolic arrays for (VLSI),” in Proc. Symp. Sparse Matrices Computations, I.S. Duff et al., Eds. Knoxville, TN, 1978, pp. 256-282. D. E. Knuth, The Art of Computer Programming, Vol. 2 Reading, MA: Addison-Wesley, 1969, ch. 4.6.2. J. Makhoul, “Linear prediction: A tutorial review,” Proc. IEEE, vol. 63, no.4, pp. 561-580, 1975. M. A. Morrison and J. Brillhart, “A method of factoring and the factorization of F,,” Math Comput., vol. 29, pp. 183-205, 1975. D.Parkinson and M. Wunderlich, “A compact algorithm for Gaussian elimination over GF(2) implemented on highly parallel computers,” Parallel Comput., vol. 1, pp. 65-73, 1984. R. Poet, “The design of special purpose hardware to factor large integers,” Comput. Phys. Commun., vol. 37, pp. 337-341, 1985. Y. Robert and M. Tchuente, “Resolution systolique de systbmes lineaires denses,” RAIRO ModPlisation et Analyse NumPrique, vol. 19, no. 2, pp. 315-326, 1985. Y. Robert and D. Trystram, “Un reseau systolique orthogonal pour le problbme du chemin algtbrique,” C.R.A.S. Paris, 302, I, 6, pp. 241244, 1986. R. Schreiber and P. Kuekes, “Systolic linear algebra machines in digital signal processing,” in VLSI and Modern Signal Processing, S. Y. Kung et al., Eds. Englewood Cliffs, NJ: Prentice-Hall, 1985.