An Algebraic Theory for Modeling Multistage ... - CiteSeerX

Report 1 Downloads 104 Views
An Algebraic Theory for Modeling Multistage Interconnection Networks z

S. D. Kaushik, S. Sharma, C.-H. Huang Department of Computer and Information Science The Ohio State University Columbus, OH 43210

Abstract We use an algebraic theory based on tensor products to model multistage interconnection networks. This algebraic theory has been used for designing and implementing block recursive numerical algorithms on shared-memory vector multiprocessors. In this paper, we focus on the modeling of multistage interconnection networks. The tensor product representations of the baseline network, the reverse baseline network, the indirect binary n-cube network, the generalized cube network, the omega network, and the

ip network are given. We present the use of this theory for specifying and verifying network properties such as network partitioning and topological equivalence. Algorithm mapping using tensor product formulation is demonstrated by mapping the matrix transposition algorithm onto multistage interconnection networks.

Keywords: Tensor product, parallel architecture, multistage interconnection network, partitionability, topological equivalence, algorithm mapping.

1 Introduction Tensor products, also known as Kronecker products, have been previously used for matrix calculus [11, 12]. This notation has also been used for the design and implementation of block recursive numerical algorithms such as fast Fourier transforms [16, 17, 31] and Strassen's matrix multiplication algorithm [13, 14, 17]. The tensor product formulation of these algorithms has been used to generate ecient parallel and vector programs for shared-memory multiprocessors. It has been proved to be useful for extracting parallel and vector operations and for automatically generating code for complex index computations [14, 15, 17]. We extend this work to use the tensor product notation for designing and implementing ecient algorithms for non-uniform shared-memory multiprocessors which are connected by multistage interconnection networks. A major step in mapping algorithms to non-uniform shared-memory multiprocessors is to model algorithms and interconnection networks in a compatible form. We will model multistage interconnection networks including the baseline network, the reverse baseline network, the indirect binary n-cube network, the generalized cube network, the omega network, and the ip network using the tensor product notation. zThis work was supported in part by DARPA, order number 7899, monitored by NIST under grant number

60NANB1D1150 and Ohio State University Seed Grant, No. 221337.

1

A large number of multistage interconnection networks have been proposed and implemented in multiprocessor systems [2, 3, 4, 9, 10, 19, 20, 22, 24, 27, 28, 29, 30]. These multistage interconnection networks typically connect a set of N processing elements to a set of N memory modules and perform various permutations among the two sets . The processing elements and memory modules are represented as N input and N output terminals. The permutations are performed by log2N stages of switching elements and interconnections between two stages of switching elements. Each network has the full access capability, i.e., it can connect any input terminal to any output terminal. Several mathematical models have been used to specify and verify various properties of multistage interconnection networks. Benes [5, 6, 7] uses a group theoretic approach to prove properties of the Benes and Clos Network. Lawrie [20] uses a number theoretic approach to prove the permutation capabilities of the omega network. Wu and Feng use the binary representation of the input lines and permutations on these representations to prove topological and functional equivalence of a class of multistage interconnection networks and the universality of the omega network[32, 33, 34]. Siegel uses permutation cycles to explore the ability to partition various multistage interconnection networks [26, 27, 28]. Agrawal [1] uses a graph theoretic approach to model the properties of networks. Pradhan and Kodandapani [23] use bit strings to represent the permutations performed by single and multistage interconnection networks to partition a class of networks. It has been proved that the graph and binary representations for a given class of interconnection networks are equivalent to the tensor product representation [8]. However, the tensor product representation is more versatile in the ability to represent both algorithms and architectures. In this paper, we use an algebraic theory based on the tensor product notation for representing multistage interconnection networks. A class of full-access, unique path, blocking multistage interconnection networks, including the baseline network, the reverse baseline network, the indirect binary n-cube network, the generalized cube network, the omega network, and the ip network are represented using the tensor product notation. The tensor product representation of the networks is used to prove the partitionability and the topological equivalence of the networks. The objective of this paper is to demonstrate the use of the tensor product notation to specify and verify a variety of properties and use these properties and representations to map algorithms expressed in tensor product form to a multiprocessor architecture based on a multistage interconnection network. The ability of the tensor product notation to represent both algorithm and interconnection network speci cations enables us to e ectively map an algorithm onto a speci c network topology. The paper is organized as follows. Section 2 gives a brief overview of the algebraic theory of tensor products. The representation of basic elements of multistage interconnection networks, including control structures, topology and switching stages, is described in Section 3. Tensor product representation of various networks is derived in Section 4. Network partitioning and topological equivalence are presented in Section 5 and Section 6, respectively. A tensor product formulation of the matrix transposition algorithm and its mapping onto multistage interconnection networks are presented in Section 7. Conclusions are given in Section 8.

2 The Tensor Product Representation In this section, we give an overview of the tensor product notation and the properties which are used in the representation of multistage interconnection networks. For details of this theory, the 2

reader is referred to [11, 12].

De nition 2.1 (Tensor Product) Let A and B be two matrices of sizes m  n and p  q, respectively. The tensor product of A and B is the block matrix obtained by replacing each element ai;j by ai;j B, i.e., A B is an mp  nq matrix de ned as 2 a B ; 6 A B = 4 ... 00

a0;n?1 B 3



.. .

...

am?1;0 B    am?1;n?1 B

75 :

m;n Tensor products can be expressed in terms of matrix and vector bases. A matrix basis Ei;j is an m  n matrix with one at the (i; j )-th position and zeros elsewhere. A vector basis emi is a column vector of length m with a one at position i and zeros elsewhere. If a matrix is stored by m;n is isomorphic to the tensor product of two vector bases em en . The tensor rows, the basis Ei;j i j product of two vector bases emi enj is equal to the vector basis emn in+j .

De nition 2.2 (Direct Sum) Let A and B be two matrices of sizes m  n and p  q, respectively. The direct sum of A and B is an (m + p)  (n + q ) matrix de ned as

AB =

"

A

#

B :

Note that In B , where B is an m  m matrix, is the direct sum of n copies of B , 3 2B ?1 B = 6 In B = nk=0 4 . . . 75 : B Applying In B to an mn dimensional vector X is equivalent to applying B on n consecutive Q Q n ? 1 n ? 1 segments of X , each of size m. The product i=0 Ai is interpreted as i=0 Ai = An?1    A0 . One of the permutations that arises frequently in the tensor product representation of networks is the stride permutation. n m n m De nition 2.3 (Stride Permutation) Lmn n (ei ej ) = ej ei .

Lmn the stride permutation of size mn with stride distance n. Tensor basis emi enj n is referred to as m;n m;n is isomorphic to Ei;j when a matrix is stored by rows; tensor basis enj emi is isomorphic to Ei;j when a matrix is stored by columns. Therefore, the stride permutation Lmn n transposes an m  n matrix. The following properties of the tensor product are used in this paper. Let In denote the n  n identity matrix and the appropriate matrix inverses and matrix products be de ned. 1. (A B )(C D) = AC BD 2. A B = (A In )(Im B ) = (Im B )(A In ) 3

3. 4. 5. 6. 7. 8.

(A B )?1 = A?1 B ?1 Qn?1(I A ) = I (Qn?1 A ) i n i=0 n i=0 i A B C = A (B C ) = (A B) C ?1 rs r (Lrs r ) = Ls , Lr = Ir rst rst Lrst st = Ls Lt rt st Lrst t = (Lt Is )(Ir Lt )

A useful property of the tensor product is that it can be commuted by applying stride permutations. mn Theorem 2.1 (Commutation Theorem) Lmn n (A B ) = (B A) Ln , where?A is an m  m mn 1 matrix and B is an n  n matrix. In other words, (B A) = Lmn n (A B ) (Ln ) .

The commutation theorem can be generalized to the following corollary.

Corollary 2.1 If Ai is an ni  ni matrix, then (At    Ai+1 Ai    A0 ) = (I nt ni Lnnii ni I ni? n ) (At    Ai Ai+1    A0 ) (I nt ni Lnnii ni I ni? n ): +2

+1 +1

1

0

+2

+1

1

0

We will now describe the tensor product representation of multistage interconnection networks.

3 Multistage Interconnection Networks A multistage interconnection network is constructed from one or more stages of smaller interconnection networks called switching elements. A k1  k2 switching element connects k1 input lines to k2 output lines. The number of stages is the maximum number of switching elements lying along any input-output path. Typically, 2  2 switching elements are used. A multistage interconnection network performs a permutation on N input lines and connects them to N output lines. The permutation is determined by the state of the switching elements and the permutations that connect the output lines of one stage to the input lines of the next stage. In this paper, we will consider networks with N input and N output lines and n stages of 2  2 switching elements, such that N = 2n . The networks considered here are known as full-access blocking multistage interconnection networks. The input and output lines can be represented as N -dimensional vectors, X and Y , respectively.

2 X = 64

2 x0 3 .. 75 and Y = 64 .

xN ?1

4

y0 3 .. 75 : .

yN ?1

The various permutations performed by the network can be represented by a single operator A. The input and output lines are related by, Y = AX: For example, consider a multistage interconnection network of size N = 4. Suppose that a particular permutation performed by the interconnection network is represented by the matrix 3 2 0 0 0 1 7 6 A = 664 00 01 10 00 775 : 1 0 0 0 Then, we have 32 3 2 3 2 3 2 x x 0 0 0 1 y0 66 y1 77 66 0 0 1 0 77 66 x01 77 66 x32 77 64 y2 75 = 64 0 1 0 0 75 64 x2 75 = 64 x1 75 : 1 0 0 0 x0 x3 y3 This means that output line y0 is connected to input line x3, output line y1 is connected to input line x2, output line y2 is connected to input line x1 and output line y3 is connected to input line to x0. Operator A can be expressed in terms of the permutations performed by the switching stages and the permutation of lines between the stages. These have been referred to as the control structure and the topology of multistage interconnection networks [25]. In the following sections, we present the representation of the topology, the control structure, and the switching elements of an interconnection network.

3.1 Topology Since networks of size N = 2n are considered, the input-output lines and the lines between the stages are numbered from 0 to N ? 1. Each line can be represented by a binary string of length n [32, 33]. The topology of a network is represented by the permutations performed between the switching stages of the network. A permutation, P N , between two switching stages maps the line with binary representation bn?1    b0 at the output of the rst stage to the line with binary representation P N (bn?1    b0) at the input of the next stage. For example, the shue permutation used in the omega network maps the line with binary representation bn?1 bn?2    b0 to the line with binary representation bn?2    b0 bn?1 . The binary string representing the lines can also be represented by tensor product basis e2in?

   e2i ; where in?1 ;    ; i0 2 f0; 1g. For example, the binary string 1001 can be represented by e21 e20 e20 e21 . The permutations performed in an interconnection network are represented as permutations performed on this tensor product basis. In terms of the tensor product basis, the shue permutation can be represented by an operator which maps the basis e2in? e2in?    e2i to the basis e2in?    e2i e2in? . By De nition 2.3, this mapping can be speci ed by the stride permutation L22nn?   L22nn? e2in? e2in?    e2i = e2in?    e2i e2in? : 1

0

1

2

0

1

1

1

1

2

0

5

2

0

1

2

0

It will be shown in section 4 that the topologies of the baseline network, the reverse baseline network, the indirect binary n-cube network, the generalized cube network, the omega network, and the ip network can be expressed in terms of the stride permutations.

3.2 Switching Element A 2  2 switching element is used as the basic building block in multistage interconnection networks. Each switching element connects a pair of input lines x0 and x1 to a pair of output lines y0 and y1. The switching element has two possible states: a through state and a cross state. The through state can be represented by the 2  2 identity matrix I2 " # " #" # " # y0 = I X = 1 0 x0 = x0 : 2 0 1 x1 y1 x1 The cross state can be represented by the 2  2 matrix S 2 " #" # " # " # y0 = S 2X = 0 1 x0 = x1 : x0 1 0 x1 y1 The switching element is represented as D2 where D2 2 fS 2; I2g.

3.3 Control Structure The control structure of an interconnection network determines how the switches in a stage of an interconnection network can be set. There are two types of control structures: individual stage control in which all the switches in a stage are set to the same state and individual box control in which the state of each switch in a particular stage can be set independently of other switches in 2n that stage. In individual stage control, the stage i, Di of a network of size N = 2n , is represented as n? Di2n = k(2=0 ?1) Di2 = I2n? Di2: (1) In individual box control, the stage i, Di2n , is represented as 1

1

n? 2 Di2n = k(2=0 ?1) Di;k : 1

(2)

The individual stage control is used for proving properties about interconnection network structures, such as partitionability and topological equivalence. The individual box control is used for reasoning about those network properties for which the states of switching elements are of importance. When the state of each individual switching element is not a concern, we will often omit the subscript of DiN and simply use DN to represent the switches in a stage.

3.4 Network Representation In an n stage network, there are n + 1 permutations: n ? 1 permutations between the switching stages, one pre-permutation before the rst switching stage, and one post-permutation after the 6

last switching stage. Let us consider an n stage network AN . Let Pi ; 1  i < n, denote the permutation performed between stages i ? 1 and i. Let P0 be the pre-permutation and Pn be the post-permutation. The operator AN , which represents the permutation performed by the entire network can be expressed as AN = PnN DnN?1 PnN?1    P1N D0N P0N : In most multistage interconnection networks, either the pre-permutation or the post-permutation is identity permutation IN . Therefore, we represent an interconnection network as

AN = AN =

or

nY ?1

i=0 nY ?1 i=0

(DiN PiN )

(3)

(PiN+1 DiN ):

(4)

For example, in the omega? network of size N = 4, with individual stage control, the switching stage can be represented by I2 Di2 . The permutations performed between the stages can be expressed by the stride operator L42 . Also, an additional shue permutation is performed before the rst stage. Using Eq. 3, the omega network can be expressed as



 



4 = I2 D12 L42 I2 D02 L42: To obtain the identity permutation, all the stages are set to the through state, i.e., D02 = I2 and D12 = I2 .

 

4

Di2 =I2

= (I2 I2 ) L42 (I2 I2) L42 = I4L42 I4 L42 = L44 = I4 :

4 Representation of Multistage Interconnection Networks In this section, the tensor product formulation for the baseline network, the reverse baseline network, the indirect binary n-cube network, the generalized cube network, the omega network and the ip network is derived.

4.1 The Baseline Network The baseline network topology can be represented by the permutation BjN which maps the line bn?1    bn?j+1 bn?j    b1b0 at the output of stage j ? 1 to the line bn?1    bn?j+1 b0bn?j    b1 at the input of stage j . The permutation BjN thus places the bit b0 between bits bn?j +1 and bn?j . In terms of the tensor product, this permutation can be represented as BjN = I2j? L22n?j because   2  I2j? L22n?j ein?    e2in?j e2in?j    e2i e2i   = e2in?    e2in?j e2i e2in?j    e2i : 1

+1

1

1

1

+1

+1

0

1

1

7

0

+1

L 0

8 2

I

2x 4 OL 2

I8 0

1

1

2

2

3

3

4

4

5

5

6

6 7

7 x D2 I 4O

x D2 I 4O

x D2 I 4O

Figure 1: Baseline Network for N = 8 The pre-permutation is IN and the post-permutation is I2n? L22 = IN . Using the network representation in Eq. 4, a baseline network of size N = 2n can be represented by 1

nY ?1 

BN =



I2i L22n?i DN :

i=0

(5)

A baseline network of size N = 8 is shown in Fig. 1 and is represented by







 



B8 = I8 I4 D2 I2 L42 I4 D2 L82 I4 D2 : In Fig. 1, the tensor products at the top represent the permutations between the stages and the tensor products at the bottom represent the switching stages.

4.2 The Reverse Baseline Network The reverse baseline network can be represented by the permutation RNj which maps the line bn?1    bj+1 bj bj?1    b0 at the output of stage j ? 1 to the line bn?1    bj+1 bj?1    b0bj at the input of stage j . The permutation RNj thus places the bit bj in the least signi cant position. In terms of the tensor product, we have RNj = I2n?j? L22jj because +1

1





I2n?j? L22jj e2in?    e2ij e2ij e2ij?    e2i = e2in?    e2ij e2ij?    e2i e2ij : +1

1

1

1

+1

1

+1

1



0

0

The pre-permutation is I2n? L21 = IN and the post-permutation is IN . Using the network representation in Eq. 3, a reverse baseline network of size N = 2n can be represented by 1

RN =

nY ?1 i=0



DN I2n?i? L22ii 1

+1



:

(6)

A reverse baseline network of size N = 8 is shown in Fig. 2 and is represented by



 







R8 = I4 D2 L82 I4 D2 I2 L42 I4 D2 I8 : 8

I8

I

0

8 L4

2x 4 OL 2

0

1

1

2

2

3

3

4

4

5

5

6

6 7

7 x D2 I 4O

x D2 I 4O

x D2 I 4O

Figure 2: Reverse Baseline Network for N = 8 The reverse baseline network can be shown to be the inverse of the baseline network.

Theorem 4.1 RN = (BN )? : 1

  Proof: Consider the baseline network representation from Eq. 5. Noting that DN ? = DN , 1

we have

 N ? B

1

=

Y 0

i=n?1



?  nY

n?i DN I2i L22n?i? = 1

1

j =0



DN I2n?j? L22jj

+1



1

= RN : 2

4.3 The Indirect Binary n-Cube Network The indirect binary n-cube network [22] can be represented by the permutation jN which maps the line bn?1    bj +1 bj bj ?1    b1b0 at the output of stage j ? 1 to the line bn?1    bj +1 b0bj ?1    b1bj at the input of stage j . The permutation jN exchanges bits bj and b0. This exchange can be performed in two steps. In the rst step, bit bj is placed in the position of bit b0. This is performed by the tensor product I2n?j? L22jj . In the second step, bit b0 is placed at the lefthand side of bit bj ?1 . This step is performed by the tensor product I2n?j?

L22j I2. Thus,        jN = I2n?j? L22j I2 I2n?j? L22jj = I2n?j? L22j I2 L22jj because +1

1

1

+1

1

1

+1

1

    I n?j? L j I I n?j? L jj ein?    eij eij eij?    ei ei    = I n?j? L j I ein?    eij eij?    ei ei eij = ein?    eij ei eij?    ei eij : 2

1

2 2

2

2

1

2 2

2

2

2

1

1

0

2

1

2

1

2

2

+1

2

+1

2

2

2

1

2

2

2

+1

2 +1 2

1

2

1

2

1

2

1

2

0

2

0

2

1

The pre-permutation of the network is IN . The post-permutation, nN , of the network is the inverse shue permutation which maps the line with binary representation bn?1 bn?2    b1b0 to the 9

8 x I 2) L (L4 O 2 4

2 4 xL I O 2

8 L 2

0

0

1

1

2

2

3

3

4

4

5

5

6

6

7

7 4x 2 D I O

4x 2 D I O

4x 2 D I O

Figure 3: Indirect Binary n-cube Network for N = 8 line with binary representation b0 bn?1 bn?2    b1. It can be expressed by





L22n e2in? e2in?    e2i = e2i e2in? e2in?    e2i : 1

2

0

0

1

2

1

Using the network representation in Eq. 4, the indirect binary n-cube network can be expressed as

N =

"nY ?2  n L22 DN I2n?i?2 i=0



L

2i+1

2

I



2

 N +2 L22ii+1 D

#

:

(7)

An indirect binary n-cube network of size N = 8 is as shown in Fig. 3 and is represented as



 









8 = L82 I4 D2 L42 I2 L84 I4 D2 I2 L42 I4 D2 :

4.4 The Generalized Cube Network The generalized cube network [27, 28, 29] can be represented by the permutation GNj which maps line bn?1    bn?j +1 bn?j bn?j ?1    b0 at the output of stage j ?1 to line bn?1    bn?j +1 b0bn?j ?1    bn?j at the input of stage j . The permutation GNj exchanges bits bn?j and b0 . In terms of the tensor product, the permutation GNj is expressed as





GNj = I2j? L22n?j I2 L22n?j n?j 1

+1



:

The pre-permutation of the network, GN0 , is the shue permutation. The post-permutation of the network is IN . Using the network representation in Eq. 3, the generalized cube network can be represented as

GN

=

"nY ?

1

i=1

DN





I2i? L 1

n?i

2 2



I L 2

#

n?i+1  n?i

2 2

DN L22nn? : 1

The generalized cube network for N = 8 is as shown in Fig. 4 and can be represented by 10

(8)

4

0

2x 4 L I O 2

4 x 2 8 (L O I )L 4 2

8 L

0

1

1

2

2

3

3

4

4

5

5

6

6

7

7

x D2 I 4O

x D2 I 4O

x D2 I 4O

Figure 4: Generalized Cube Network for N = 8







 





G8 = I4 D2 I2 L42 I4 D2 L42 I2 L84 I4 D2 L84: The generalized cube network can be shown to be the inverse of the indirect binary n-cube network.

Theorem 4.2 GN = ( N )? : 1

Proof:

 N ?

1

2 3     Y i i DN I n?i? L L i I 5 DN L nn? = 4 2i n? 3     Y i i DN I n?i? L I L i 5 DN L nn? = 4 2ni ?n? 3     Y n?j n?j = 4 DN I j? L I L n?j 5 DN L nn? 0

=

2

2

2

2

2 +2 2

2

2 2

1

2 +2 2 +1

2 2

1

2

0

=

2 +1 2

2

2

1

j =1 GN :

=

2 +1 2

2

2 2

1

2

2 2

+1

2 2

1

2

4.5 The Omega Network The omega network [19, 20, 21, 30], also known as the shue exchange network, is represented by the permutation Nj which maps line bn?1 bn?2    b0 at the output of stage j ? 1, to line bn?2    b0bn?1 at the input of stage j . The permutation Nj rotates the bit sequence by one bit left. In terms of the tensor product, Nj = L22nn? because



1



L22nn? e2in? e2in?    e2i = e2in?    e2i e2in? : 1

1

2

0

11

2

0

1

8 L

0

8 L

8 L

4

4

4

0

1

1

2

2

3

3

4

4

5

5

6

6

7

7 x D2 I4 O

x D2 I4 O

x D2 I4 O

Figure 5: Omega Network for N = 8 The pre-permutation of the network is L22nn? . The post-permutation of the network is IN . Using the network representation in Eq. 3, the omega network of size N = 2n can be represented as 1

N =

nY ?1 i=0

DN L22nn? :

(9)

1

An omega network of size N = 8 is as shown in Fig. 5 and is represented as



 

 



8 = I4 D2 L84 I4 D2 L84 I4 D2 L84 :

4.6 The Flip Network The ip network is represented by the permutation FjN which maps line bn?1    b1b0 at the output of stage j ? 1 to line b0bn?1    b1 at the input of stage j . Thus, permutation FjN rotates the bit sequence right by one bit. In terms of the tensor product, FjN = L22n because





L22n e2in?    e2i e2i = e2i e2in?    e2i : 1

1

0

0

1

1

The pre-permutation of the network is IN . The post permutation of the network is L22n . Using the network representation in Eq. 4, the ip network can be represented as

FN

=

nY ?1 i=0

L22n DN :

(10)

A ip network of size N = 8 is as shown in Fig. 6 and is represented as



 

 



F 8 = L82 I4 D2 L82 I4 D2 L82 I4 D2 : The ip network can be shown to be the inverse of the omega network. 12

8 L 2

0

8 L 2

8 L 2

0

1

1

2

2

3

3

4

4

5

5

6

6 7

7 x D2 I4 O

x D2 I4 O

x D2 I4 O

Figure 6: Flip Network for N = 8

Theorem 4.3 F N = ( N )? : 1

Proof:

 N ?

1

=

Y  0

i=n?1

L

n ?1 N D n?1

2 2

=

nY ?1 j =0

L22n DN = F N : 2

5 Partitionability of Networks If two independent tasks are to be performed on a multiprocessor system, it is often necessary to split the system into two independent multiprocessor subsystems, each consisting of a set of processors and memory modules connected by an interconnection network. Thus, it is necessary to be able to partition a network into subnetworks [26]. Each subnetwork should have all the interconnection capabilities of a complete network of its size and should be able to function independently. In the following sections, we show that each network of size N = 2n , can be recursively expressed in terms of a network of size N=2 = 2n?1 . This recursive expression has two implications. First, the network can be recursively built from similar networks of size, N=2i, 1  i < n; second, it can be partitioned into subnetworks of size 2i , 1  i < n. To partition a network, it may be necessary to apply an additional permutation at the input lines before the rst stage and/or at the output lines after the last stage of the network. We now present the partitioning of the networks presented in Section 4.

5.1 Partitioning of the Baseline and the Reverse Baseline Networks The baseline and reverse baseline networks can be partitioned into their subnetworks. We show the recursive expression of the baseline network in the following theorem.

  Theorem 5.1 BN = I BN= LN DN : 2

2

2

13

L8 2

L8 2

L8 4

0

0

0

1

1

1

1

2

2

2

2

3

3

3

3

4

4

4

4

5

5

5

5

6

6

6

6

7

7

7

7

0

Baseline (N=4)

Baseline (N=4)

x D2 I 4O

I

(a)

2

x B4 O

(b)

Figure 7: Recursively built and Partitioned Baseline Network for N = 8

Proof: BN =

nY ?1 

"i nY ?  =0

=

1

2

"i

=0

= =

I2i L

"niY ?  =1

=



I2i L22n?i I2n? D2

I2i L

I2

1

n?(i+1)  

I2i L



I2n? D

2 2

nY ?2 

I2



n?i  

2 2

+1

1



L22n I2n? D2

1

n?1)?i 

2

#

I2n? D

2( 2

i=0  N=

B 2 LN D N :

2

I2 n? (

2

#

?

1) 1

1





L22n I2n? D2

D

2

!#

1





L22n I2n? D2 1



2

A baseline network of size N = 8 recursively constructed from two baseline networks of size N = 4 is shown in Fig. 7(a). The network is partitioned into two networks by setting the switches in the rst stage to the through states and by applying a permutation LNN=2 at the input.









BN = I2 B N=2 LN2 IN=2 I2 LNN=2 = I2 B N=2 : A partitioned network of size N = 8 is shown in Fig. 7(b). An additional external permutation L84 is applied at the input to partition the network. Each individual subnetwork has been shaded with a di erent pattern. Using Theorems 4.1 and 5.1, the reverse baseline network can be recursively expressed as in the following theorem.

  Theorem 5.2 RN = DN LNN= I RN= : 2

2

2

14

L8 4

0

L8 2

0

1

1

2

2

3

3

4

4

5

5

6

6

7

7

Figure 8: Partitioned Reverse Baseline Network for N = 8 The reverse baseline network can be partitioned into two independent networks by setting the switches in the last stage to the through states and applying the permutation LN2 at the output.









RN = LN2 IN=2 I2 LNN=2 I2 RN=2 = I2 RN=2: A partitioned network of size N = 8 is shown in Fig. 8.

5.2 Partitioning of the Indirect Binary n-Cube and the Generalized Cube Networks We show the partitionability of the indirect binary n-cube and the generalized cube networks in the theorems below.

   Theorem 5.3 N = D IN= I N= : 2

Proof: N

n

I2n? D

2

2

2

2

?   "nY

2

 i=0

2



I2n?i? L



i

i i



I L I n? D     = L n I n? D L n? I L nn? I n? D "nY ?   i  i  # I n?i? L I L i I n? D :

= L

2 2

2 2

1

1

2

2 2

3

i=0

2

2 +1 2

2

2

1

2 2

2

1

2 +1 2

2

2

2

1

By Commutation Theorem 2.1, the rst term can be rewritten as:





2

2

1

2

2 +2 2 +1

2

2 +2 2 +1







L22n I2n? D2 L22n? I2 L22nn? I2n? D2      = D2 I2n? L22n L22nn? I2 L22n? I2n? D2     = D2 I2n? I2 L22n? I2 I2n? D2      = D2 I2n? I2 L22n? I2n? D2 : 1

1

1

1

1

1

1

1

1

2

1

1

15

2

1

1

2

#

2 x I ) L8 (L4 O 2 4

L

8 2

0

0

1

1

2

2

3

3

4

4

5

5

6

6

7

7

Figure 9: Partitioned Indirect Binary n-cube Network for N = 8 Next consider the second term,



nY ?3 



I2n?i? L22i I2 L22ii

i=0

= I2

+1

2

nY ?3 i=0



I2n?i? L 3

i

2 +1 2

+2 +1





I L 2

I2n? D2

i i

2 +2 2 +1

1





I2n? D 2

2

!

:

Combining the two terms, we have,

  D I n? ( " ?  !#)  i  i   nY n?  I n?i? L I L i I n? D I n? D I L i    = D IN= I N= : 2

N =

2

2

1

2

1

2 2

2

3

2

2

2 +1 2

3

2

2

2 +2 2 +1

2

2

2

=0

2

2

2

2

The indirect binary n-cube can be partitioned into two independent networks of size N=2 by setting the switches in the last switching stage to the through states.





N = (I2 I2n? ) I2 N=2 = I2 N=2: 1

A partitioned network of size N = 8 is shown in Fig. 9. Using Theorems 4.2 and 5.3, the generalized cube network can be recursively expressed as in the following theorem.

   Theorem 5.4 GN = I GN= D IN= . 2

2

2

2

A generalized cube network of size N can be partitioned into two independent networks of size N=2 by setting the switches in the rst switching stage to the through state. A partitioned network of size N = 8 is shown in Fig. 10. 16

8 2 ( L4 x O I ) L 2 4

L8

4

0

0

1

1

2

2

3

3

4

4

5

5

6

6

7

7

Figure 10: Partitioned Generalized Cube Network for N = 8

5.3 Partitioning of the Omega and the Flip Networks The omega network can be partitioned in two ways as shown below.

Theorem 5.5









N = IN=2 D2 LNN=2 I2 N=2 LN2     = I2 N=2 LN2 IN=2 D2 LNN=2:

(11) (12)

Proof: Consider the tensor representation for the omega network as represented in Eq. 9 nY ?   n N

1

=

I2n? D2 L22n? 1

i=0



1



I2n? D L

=

2

1

"nY ?  2

=

i=0

?

"nY ?2   2n # n 2 I2n?1 D L2n?1 n?1

2 2



i=0

I2n? D L 2

1

?

n n?1

2 2

#

(13)



I2n? D2 L22nn? : 1

(14)

1



?2 I n? D2 L2nn? . By Commutation Theorem 2.1, Consider the term Qni=0 2 2 nY ?2  i=0

1



I2n? D L 2

1

n n?1

2 2

= =

1

"nY ?

2

=0 "niY ?2

i=0

L

  2n 2n n 2

D

I I L2i+2 L2n?1 n?i? 2 i +1 n?i?2 2 2

2 2

L

  2n n 2

D

I I L2i+1 n?i? 2 i +1 n?i?2 2 2

2 2

#

#

:

Noting that for i = 0, L22ni =nL22n , forni = n ? 2, nL22nn?i? = I2n , and the product of two consecutive stride terms has the result L22 i L22n?i? = L22n = I2n , we have +1

2

( +1)+1

nY ?2  i=0



I2n? D L 1

2

n n?1

2 2

=

2

"nY ?  2

i=0

I2i D I2n?i? 2

+1

17

2

#

L22n

0

L 48

L 48

L8 2

0

1

1

2

2

3

3

4

4

5

5

6

6

7

7

Figure 11: Partitioned Omega Network for N = 8 = = =

"

I2

"

I2

" h

I2

nY ?2  i=0 nY ?2

I2i D I2n?i? 2

L

i=0 nY ?2 

n? i

1 2 2 +1

2



L22n



I2n?i? I2i D L 2

2



I2n? D L

i=0i N=

2 L2n :

!#

2

2

2 2

n?1 n?2

!#

!#

L22n

L22n

= I2 2 Substituting Eq. 15 in Eq. 13 and Eq. 14, we have    

N = I2n? D2 L22nn? I2 N=2 L22n     = I2 N=2 L22n I2n? D2 L22nn? : 2 1

n?1 n?i?2

2 2

(15)

1

1

1

A partitioned omega network corresponding to Eq. 11 is obtained by setting the switches in the last switching stage to the through states, applying an additional permutation LNN=2 at the input, and applying an additional permutation LN2 at the output. A partitioning corresponding to Eq. 11 is shown in Fig. 11. A partitioned omega network corresponding to Eq. 12 is obtained by setting the switches in the rst switching stage the through states. Again, the partitionability of the ip network follows from Theorems 4.3 and 5.5.

Theorem 5.6





  



F N = LNN=2 I2 F N=2 LN2 IN=2 D2      = LN2 IN=2 D2 LNN=2 I2 F N=2 :

(16) (17)

A partitioned ip network corresponding to Eq. 16 is obtained by setting the switches in the rst stage to the through states, applying an additional permutation LNN=2 at the input, and applying an additional permutation LN2 at the output. A partitioned ip network corresponding to Eq. 17 is obtained by setting the switches in the last stage to the through states. A partitioning corresponding to Eq. 17 is shown in Fig. 12. 18

L8

L8

2

0

2

0

1

1

2

2

3

3

4

4

5

5

6

6

7

7

Figure 12: Partitioned Flip Network for N = 8

6 Topological Equivalence of Networks Many multistage interconnection networks have been proposed and studied in the literature that although seemingly di erent have several similar properties. Topological equivalence has been a means of establishing equivalence between two networks. Topological equivalence of two networks implies that one network can be obtained from the other by permuting switches within a switching stage. Wu and Feng have proved topological equivalence of a class of a multistage interconnection networks [32]. To specify topological equivalence, every line in the network is assigned a physical name and a logical name. The topological equivalence between two networks is shown by renumbering the lines at the input and the output of the switching stages in one of the networks. The output lines of the network are also renumbered. The new number of a line after the renumbering is its logical name. The permutation iN , called the renumbering function, maps the physical names at stage N i of a network to the corresponding logical names. The permutation iN is of the form i I2, 0  i < n, because it also permutes the switching elements within a stage. The permutation nN corresponds to the renumbering of the output lines. Two networks N1 and N2 are topologically equivalent if the lines of network N1 can be renumbered so that the permutations on the logical names of network N1 are identical to the permutations on the physical names of network N2 . 2

De nition 6.1 Let Ni and iN , 0  i < n + 1, be the topologies of networks N 1 and N 2. N 1 and

NN2 are topologically equivalent, if there exists a set of permutations iN ; 0  i < n + 1 of the form i I2 for 0  i < n, such that 2

Ni =

 N ?

i

1





iN iN?1 :

(18)

We note that topological equivalence is an equivalence relation. To prove topological equivalence of two networks, we can construct the renumbering function as given below. 19

0

0

1

1

2

2

3

3

4

4

5

5

6

6

7

7

0

0

1

2

1

2

4

2

7

0

3

2

1

3

0 1

5

2

2

3

7

0

0

4

5

6

6 7

1

4

4 5

6

3

1

3

5

0

0

3

2

7

4

4

2

5

6

2

1

3

5

1

2 3

4 5

3

6

3

0

1

1

6

0

6

7

7

7

0

0

1

2

1

3

4

2

5

6

3

7

0

0

1

1

2

2

3

3

4

4

5

5

6

6

7

7

(a)

0

1

2

3

0

0

1

1

2

2

3

3

4

4

5

5

6

6

7

7

0

1

2

3

0

0

1

1

2

2

3

3

4

4

5

5

6

6

7

7

(b)

Figure 13: Indirect binary n-cube network with logical names and baseline network with physical names

Proposition 6.1 Two networks N and N with topologies Ni and iN , 0  i < n + 1, are topologically equivalent if the renumbering function iN , 0  i < n + 1, de ned as 0 i 10 1   Y Y ?

iN = @ jN A @ Nj A 1

2

0

N

j =0

1

j =i

has the form i I2 for 0  i < n and nN is an arbitrary permutation. 2

Theorem 6.1 The baseline, the reverse baseline, the indirect binary n-cube, the generalized cube, the omega, and the ip networks are topologically equivalent.

Using Proposition 6.1, the renumbering functions for each pair of networks in the above class can be derived. As an example, we show topological equivalence of the baseline network and the indirect binary n-cube network of size N = 8. For the baseline network, we have B08 = I?8 , B18 = L82, 8 4 B 38 = I8. ? For the indirect binary n-cube network, we have 08 ?1 = I8, ?B 28=?1 I=2 I L 2L, 4and ? ?  ? 1 8 = L82 L42 I2 , and 38 ?1 = L84. We obtain the renumbering functions: 2 1 2 , 2

08 = I8 (I8)?1 = I4 I2 ;

18 = L82 I2 L42 = L42 I2 ;     

28 = I2 L42 L42 I2 L82 L42 I2 = L42 I2;  

38 = L42 I2 L84 : The indirect binary n-cube network with the logical names is shown in Fig. 13(a). The permutations on the logical line numbers of the indirect binary n-cube network are identical to the permutations on the physical line number of the baseline network in Fig. 13(b).

7 Algorithm Mapping : An Example In this section, we will illustrate the use of the tensor product notation to specify the mapping of matrix transposition onto multistage interconnection networks. 20

Consider the transposition of a square matrix A of size 16  16. The indices of matrix A 16 when stored in row-major order can be represented by the tensor basis e16 i ej , 0  i; j < 16. 256 Transposition of matrix A corresponds to stride permutation L16 . Let matrix A be stored on an eight-processor system by rows such that row i is located on processor b 2i c, i.e., each processor will have 2 consecutive rows of the matrix in its local memory. After matrix transposition, each processor contains two columns of A stored in column-major order. Mapping this transposition to an eight-processor system requires further factorization of L256 16 to:

L256 = 16

Y 3

i=0

(L82 I32 )(I4 L42 I16 )(I8 L32 2 ):

Given each factor in the above factorization, we can determine local computation and global communication required for that factor. A factor of the form P n Id is interpreted as a permutation P of data blocks of size d performed on a network of size n. A factor of the form (In Qd ) is interpreted as computation corresponding to Q performed on n processors in parallel. Examining each of the factors, we have

 I L : corresponds to all 8 processors performing the permutation L on the blocks of 8

32 2

32 2

32 elements stored in their local memory. The operation is performed in parallel on all the processors.  I4 L42 I16: represents processor to processor communication. The processors are split into 4 partitions, each consisting of 2 consecutive processors. Processors within a partition exchange data blocks of 16 elements with each other. This communication is performed in parallel in each of the 4 partitions. The indexing information for the data blocks to be communicated is obtained by observing the e ects of the permutation L42 I16 on a block of 64 elements (stored on 2 processors in the same partition). The e ect of this permutation is to exchange the elements A(1; 0 : 15) from the rst processor with elements A(0; 0 : 15) from the second processor. This e ect can be achieved by performing the permutation I4 S 2. The interconnection network used should be capable of performing the above permutation.  L82 I32: represents processor to processor communication in which the interconnection network performs the permutation L82 on data blocks of 32 elements. The network should be able to perform L82 eciently. The pseudo-code for a node program is as shown in Fig. 14. Each processor contains two rows of array A in Aloc (0 : 1; 0 : 15). The send and receive commands correspond to non-local memory access. The sends are non-blocking and the receives are blocking. Note that the operations in the tensor product formulas are applied from right to left. Variable proc id denotes the processor number. A network is selected depending on its ability to perform L82 and I4 S 2. Consider the indirect binary n-cube network, the omega network, and the baseline network. The indirect binary n-cube performs permutation I4 S 2 in one pass and permutation L82 in two passes. The omega network performs permutation I4 S 2 in one pass and permutation L82 in three passes. The baseline network performs permutation I4 S 2 in two passes and permutation L82 in two passes. Thus the matrix transposition can be performed on any one of the above networks. 21

do i = 0; 3

/* computation for I8 L32 2 */ Aloc (0 : 1; 0 : 15) = L32 A loc (0 : 1; 0 : 15) 2 /* computation for I4 L42 I16 */ if (proc id = 2  j ) then Send(Aloc (1; 0 : 15); proc(2  j + 1)); Recv(Aloc (1; 0 : 15); proc(2  j + 1)); elseif (proc id = 2  j + 1) then Send(Aloc (0; 0 : 15); proc(2  j )); Recv(Aloc (0; 0 : 15); proc(2  j ));

endif

/* communication for L82 */ Send(Aloc (0 : 1; 0 : 15); proc(L82(proc id))); Recv(Aloc (0 : 1; 0 : 15); proc(L84(proc id)));

enddo

Figure 14: Matrix Transposition

8 Conclusions In this paper, we have used an algebraic theory based on tensor products for specifying and verifying the properties of multistage interconnection networks. We have shown this notation to be useful in proving network partitioning and topological equivalence. We believe that this notation can be further used to prove other properties of this class of interconnection networks such as functional equivalence, the universality of a particular network and to obtain routing algorithms for interconnection networks. The tensor product representation of multistage interconnection networks can be used in mapping various algorithms onto speci c parallel architectures. We have presented an example of mapping matrix transposition onto multistage networks. Strategies for automatic program generation for a particular architecture using the tensor product formulation are currently under study. Tensor product representations of direct networks such as the ring, the mesh, the hypercube, have also been developed [18] and the process of mapping algorithms onto these networks is currently under investigation.

Acknowledgments We would like express our appreciation to J. R. Johnson, R. W. Johnson, and P. Sadayappan for sharing their ideas concerning properties of tensor products and multistage interconnection networks. 22

References [1] D. P. Agrawal. Graph theoretical analysis and design of multistage interconnection networks. IEEE Transactions on Computers, C-32(7):637{648, 1983. [2] K. E. Batcher. The ip network in STARAN. In International Conference on Parallel Processing, pages 65{71, 1976. [3] K. E. Batcher. Design of a massively parallel processor. IEEE Transactions on Computers, C-29(9):836{844, 1980. [4] K. E. Batcher. Bit serial parallel processing systems. IEEE Transactions on Computers, C-31(5):377{384, 1982. [5] V. E. Benes. Proving the rearrangeability of connecting networks by group calculations. The Bell System Technical Journal, 54:421{434, 1975. [6] V. E. Benes. Towards a group-theoretic proof of the rearrangeability theorem for Clos network. The Bell System Technical Journal, 55:797{805, 1975. [7] V.E. Benes. Applications of group theory to connecting networks. The Bell System Technical Journal, 54:407{420, 1975. [8] Marc Davio. Kronecker products and shue algebra. IEEE Transactions on Computers, C-30(2):116{125, 1981. [9] T. Feng. Data manipulating functions in parallel processors and their implementations. IEEE Transactions on Computers, C-23(3):309{318, 1974. [10] T. Feng. A survey of interconnection networks. IEEE Transactions on Computers, C30(12):12{27, 1981. [11] A. Graham. Kronecker Products and Matrix Calculus: With Applications. Ellis Horwood Limited, 1981. [12] R. A. Horn and C. R. Johnson. Topics in Matrix Analysis. Cambridge University Press, Cambridge, 1991. [13] C.-H. Huang, J. R. Johnson, and R. W. Johnson. A tensor product formulation of Strassen's matrix multiplication algorithm. Appl. Math Letters, 3(3):67{71, 1990. [14] C.-H. Huang, J. R. Johnson, and R. W. Johnson. Generating parallel programs from tensor product formulas: a case study of Strassen's matrix multiplication algorithm. In International Conference on Parallel Processing, pages 104{108, 1992. [15] J. R. Johnson, C.-H. Huang, and R. W. Johnson. Tensor permutations and block matrix allocation. In Second International Workshop on Array Structures (ATABLE-92), 1992. To appear. [16] J. R. Johnson, R. W. Johnson, D. Rodriguez, and R. Tolimieri. A methodology for designing, modifying and implementing Fourier transform algorithms on various architectures. Circuits Systems Signal Process, 9(4):450{500, 1990. 23

[17] R. W. Johnson, C.-H. Huang, and J. R. Johnson. Multilinear algebra and parallel programming. Journal of Supercomputing, 5:189{218, 1991. [18] S. D. Kaushik, S. Sharma, C.-H. Huang, J. R. Johnson, R. W. Johnson, and P. Sadayappan. An algebraic theory for modelling direct interconnection networks. In Supercomputing '92, pages 488{497, 1992. [19] T. Lang. Interconnections between processors and memory modules using shue-exchange networks. IEEE Transactions on Computers, C-25(5):496{503, 1976. [20] D. K. Lawrie. Access and alignment of data in an array processor. IEEE Transactions on Computers, C-24(12):1145{1155, 1975. [21] D. S. Parker. Notes on shue/exchange-type switching networks. IEEE Transactions on Computers, C-29(3):213{222, 1980. [22] M. C. Pease III. The indirect binary n-cube microprocessor array. IEEE Transactions on Computers, C-26(5):458{473, 1977. [23] D. K. Pradhan and K. L. Kodandapani. A uniform representation of single- and multistage interconnection networks used in SIMD machines. IEEE Transactions on Computers, C29(9):777{791, 1980. [24] H. J. Siegel. Interconnection networks for SIMD machines. IEEE Computer, pages 57{65, 1979. [25] H. J. Siegel. A model of SIMD machines and a comparison of various interconnection networks. IEEE Transactions on Computers, C-28(12):907{917, 1979. [26] H. J. Siegel. The theory underlying the partitioning of permutation networks. IEEE Transactions on Computers, C-29(9):791{800, 1980. [27] H. J. Siegel. Using the multistage cube network topology in parallel computers. Proceedings of IEEE, 77(12):1932{1953, 1989. [28] H. J. Siegel. Interconnection Networks for Large Scale Parallel Processing : Theory and Case Studies. McGraw-Hill, 1990. [29] H. J. Siegel, W. T. Hsu, and M. Jeng. An introduction to the multistage cube family of interconnection networks. Journal of Supercomputing, 1:13{42, 1987. [30] H. S. Stone. Parallel processing with the perfect shue. IEEE Transactions on Computers, C-20(2):153{161, 1971. [31] C. Van Loan. Computational Framework for the Fast Fourier Transform. SIAM, 1992. [32] C.-L. Wu and T. Feng. On a class of multistage interconnection networks. IEEE Transactions on Computers, C-29(8):694{702, 1980. [33] C.-L. Wu and T. Feng. The reverse exchange interconnection network. IEEE Transactions on Computers, C-29(9):801{810, 1980. [34] C.-L. Wu and T. Feng. The universality of the shue-exchange network. IEEE Transactions on Computers, C-30(5):324{332, 1981. 24