Lesson 24 Distributed Matrix Multiply Matrix Multiply: Basic Definitions C ← C + A*B → This is the dot product of row A and column B and accumulating the sum into the output. The Matrix Multiply as pseudo code: for i ← 1 to m do for j ← 1 to n do for l ← i to k do C[i,j] ← C[i,l] + A[i,l] . B[l,j] The Time to complete the algorithm is: 3 T (m,n,k) = O(mnk) → T (m,n,k) = O(n ) when m=n=k * * The Matrix Multiply as Parallel pseudo code: parfor i ← 1 to m do parfor j ← 1 to n do for l ← i to k do C[i,j] ← C[i,l] + A[i,l] . B[l,j] This means each ‘row’ and ‘column’ could actually represent a sub matrix. The third loop is a reduction. The Matrix Multiply as Parallel pseudo code: parfor i ← 1 to m do parfor j ← 1 to n do let T[1:k] = temp array parfor l ← 1 to k do T[l] ← A[i,l] . B[l,k] C[i,j] ← C[i,j] + reduce(T[:]) 3 W(n) = O(n ) D(n) = O(log n) A Geometric View Using a cube the rows and columns can be projected onto the cube. The three matrices are areas on the x, y, z planes. If the the three projections intersect the matrices can be multiplied. The resulting volume is the set of multiplications that need to be done.
According to Loomis and Witney: The volume of I is ….. |I| 4 * n2/ P when s > 1/2 * n/√P A smaller ‘s’ increases latency time. If s is at it’s maximum value, the SUMMA algorithm might need 5 times √P amount of storage. A Lower Bound on Communication lower bound the number of words a node MUST communicate. each phase sends and receives exactly ‘m’ words. SA, SB, Sc → the set of unique elements of each matrix seen in this phase. M ax # multiplies per phase ≤ √|SA| * |SB| * |SC| ≤ 2 * √2 * M 3/2 L ≥ # full P hases ≥ [W /max # multiplies per phase] FLOOR
L ≥ W /(2√2 * M 3/2
# words communicated by 1 node ≥ (# full phases) * M # words communicated by 1 node = Ω(n2 /√P )
A Lower Bound on Communication T net(n; P ) = Ω(α * √P + β * n2/√P ) T SUMMA, net (n; P,s) = α * n/s * log P + β * n2/√P * log P (tree)