Optimality conditions and duality theory for ... - Semantic Scholar

Report 3 Downloads 101 Views
Optimality conditions and duality theory for minimizing sums of the largest eigenvalues of symmetric matrices M. L. Overton1 and R. S. Womersley2 June 1991 Submitted to Math Programming

Abstract: This paper gives max characterizations for the sum of the largest eigen-

values of a symmetric matrix. The elements which achieve the maximum provide a concise characterization of the generalized gradient of the eigenvalue sum in terms of a dual matrix. The dual matrix provides the information required to either verify rst-order optimality conditions at a point or to generate a descent direction for the eigenvalue sum from that point, splitting a multiple eigenvalue if necessary. A model minimization algorithm is outlined, and connections with the classical literature on sums of eigenvalues are explained. Sums of the largest eigenvalues in absolute value are also addressed.

Key words: symmetric matrix, maximum eigenvalues, spectral radius, minimax problem, max characterization, generalized gradient.

Short title: Minimizing sums of eigenvalues 1. Courant Institute of Mathematical Sciences, New York University, New York, New York, 10012. The work of this author was supported by the National Science Foundation under grant CCR-88-02408. 2. School of Mathematics, University of New South Wales, P.O. Box 1, Kensington, N.S.W. 2033, Australia. The work of this author was supported in part during a visit to Argonne National Laboratory by the Applied Mathematical Sciences subprogram of the Oce of Energy Research of the U.S. Department of Energy under contract W-31-109-Eng-38, and in part during a visit to the Courant Institute by the U.S. Department of Energy under Contract DEFG0288ER25053. 1

1 Introduction

Let A be an n by n real symmetric matrix, and let  2 f1; . . . ; ng. Denote the eigenvalues of A by 1; . . . ; n and also by 1; . . . ; n, the di erence being that the former are ordered by 1       n ; (1:1) while the latter are ordered by

j1j      jnj: De ne and

f (A) = g(A) =

 X

i

(1:3)

jij:

(1:4)

i=1  X i=1

(1:2)

Note that f1(A) is the largest eigenvalue of A, while g1(A) is the spectral radius of A (its largest eigenvalue in absolute value). After establishing some further notation in Section 2, the main results of the paper are given in Sections 3 and 4 for the functions f(A) and g(A) respectively. Max characterizations of these functions are established in terms of the Frobenius inner product h A; B i = tr(ABT ) for A; B 2 Rmn (where tr(A) denotes the trace of A), and sets of matrices de ned by positive semi-de nite inequalities (see Section 2). Let Sn denote the set of real n by n symmetric matrices, n; = f U 2 Sn : 0  U  I; tr(U ) =  g

(1:5)

n; = f W 2 Sn : W = U ? V; where U; V 2 Sn; 0  U  I; 0  V  I; tr(U ) + tr(V ) =  g:

(1.6)

and

The sets n; and n; are compact convex subsets of Sn . It is shown that

h A; U i f(A) = Umax 2n; and

h A; W i: g (A) = Wmax 2 n; In fact, equation (1.7) implies a well known result of Fan (1949), namely f (A) = Z2Rnmax tr(Z T AZ ):  ;Z T Z =I 2

(1:7) (1:8) (1:9)

Fan's result is widely referenced; see Wielandt (1955), Cullum, Donath and Wolfe (1975), Friedland (1981), Sameh and Wisniewski (1982), Horn and Johnson (1985) and Fletcher (1985). Both (1.7) and (1.9) show that f (A) is a convex function. The advantage of (1.7) over (1.9) is that it leads directly to a characterization of the subdi erential of f which is computationally very useful, since it does not involve a convex hull operation. Equation (1.8) has a similar advantage over a corresponding analogue of (1.9). In the case  = 1, (1.9) reduces to the Rayleigh principle f1 (A) = qmax qT A(x)q (1:10) T q=1 while (1.7) reduces to

f1(A) = U 0max h A; U i: (1:11) ;tr(U )=1 Note that the inequality U  I is not required. Equation (1.11) is moderately well known; see Fletcher (1985) and Overton (1988, 1990). Now consider the composite functions f (x)  f (A(x)); g(x)  g (A(x)) (1:12) where A(x) is a smooth symmetric matrix function de ned on a vector of parameters x 2 Rm. The use of the same symbol f for a function de ned on the set of symmetric matrices and on the parameter space Rm is convenient, and the distinction should be clear from the context. The max characterizations (1.7) and (1.8) prove that f(x) and g(x) are locally Lipschitz, subdi erentially regular, and have generalized gradients @f (x) and @g(x) respectively, which are nonempty compact convex sets in Rm (see Clarke (1983)). These generalized gradients are obtained by composing the subdi erential of f (A) and g (A) with the derivative of A(x), using the chain rule. In fact Clarke (1983, Proposition 2.8.8) derives an expression for the generalized gradient of the largest eigenvalue f1(x), but in a form which requires a convex hull operation. An important feature of our max characterizations is that they lead to rst-order optimality conditions which are computationally veri able, providing matrix analogues of Lagrange multipliers in constrained optimization, namely U or U and V , which we call dual matrices. The necessary condition 0 2 @f (x) or 0 2 @g(x) (see Clarke (1983)) provides systems of linear equations which can be solved to obtain the dual matrices. Inequalities of the form 0  U  I determine if the current point is a stationary point, or provide information from which a descent direction can be calculated. Equations (1.7) and (1.8) show that if A(x) is an ane function then f (x) and g (x) are convex functions. They also illustrate that minimizing f (x) or g(x) are minimax problems: (1:13) min f (x)  xmin max h A(x); U i; x2Rm 2Rm U 2n; 3

and

min g(x)  xmin max h A(x); W i: 2Rm W 2 n;

x2Rm

(1:14)

If A(x) is ane, the saddle point result min max h A(x); U i = Umax min h A(x); U i; 2n; x2Rm x2Rm U 2n; and a similar result for (1.14) are established. This is similar to the result of Shapiro (1985) for minimizing a function of a symmetric matrix subject to positive semide nite constraints. These saddle point results justify the dual matrix terminology. If the eigenvalues of A(x) are distinct then (1.13) and (1.14) are just minimax problems with smooth functions i(x) : Rm ! R for i = 1; . . . ; n. Let (x) = ( 1 (x); . . . ; n(x) )T , using any ordering for the eigenvalues. Then f (x) = umax (x)T u (1:15) 2n; where and where

n; = f u 2 Rn : 0  ui  1; i = 1; . . . ; n;

n X i=1

ui =  g;

(x)T w g (x) = wmax 2 n; n n; = f w 2 R : ?1  wi  1; i = 1; . . . ; n;

(1:16) (1:17)

n X i=1

jwij =  g:

(1:18)

The additional complication in (1.13) and (1.14) arises from the possibility of multiple eigenvalues; hence the positive semi-de nite constraints de ning the sets n; and n; . This paper is not concerned with algorithm development. However, a brief discussion of model algorithms for minimizing f (x) and g (x) is given. These are generalizations of the algorithm presented by Overton (1988) for the case when  = 1 and A(x) is ane. It is, in fact, possible to design the model algorithms so that they have quadratic local convergence, even if the objective function is not smooth at the solution; see Overton (1988) and Overton (1990). More detail will be given by Overton and Womersley (to appear). Cullum, Donath and Wolfe (1975) gave an algorithm, related to the -subgradient methods of Lemarechal and others, for the case that the variables x are the diagonal elements of A(x). The most important di erence between this earlier work and our model algorithms is that the latter compute the dual matrices which demonstrate optimality. These matrices are also the key to sensitivity analysis of the solution; see Overton (1990, Section 3). 4

Sums of eigenvalues of symmetric matrices have been addressed in one form or another in many classical papers on matrix theory; a good overview is Bellman (1970, Chapter 8). However, the rich interconnection between this subject and the sets n; and n; appears to have been largely overlooked. Note that although all our results are given in terms of real symmetric matrices, generalization to the complex Hermitian case is straightforward. The classical literature on sums of eigenvalues does not include much discussion of applications. However, these appear to be quite numerous, especially in connection with adjacency matrices of graphs. See in particular Cullum, Donath and Wolfe (1975), as well as Rendl and Wolkowicz (1990), and Alizadeh (1991). Another application is the \orthogonal Procrustes" problem, which refers to rotating a number of matrices towards a best least squares t. This problem is discussed by Shapiro and Botha (1988) and has also been addressed by Watson (1990). There is a large variety of applications in the case  = 1; see Overton(1990).

2 Notation The following notation is used throughout this paper. Let 1. Sn = set of all n by n real symmetric matrices (AT = A). 2. Kn = set of all n by n real skew-symmetric matrices (AT = ?A). 3. Dn = set of all n by n real diagonal matrices. 4. Om;n = set of all m by n real orthogonal matrices, where m  n. Thus Z T Z = In for all Z 2 Om;n. The vector ei is the ith coordinate vector, e is the vector of all 1s, and I is the identity matrix, with the dimensions of ei , e and I determined by the context. A matrix D 2 Dn is denoted by diag( 1; . . . ; n ), or diag(u) where u 2 Rn. The convex hull of a set is denoted by conv . For A 2 Rnn the eigenvalues of A are denoted by (1.1) and by (1.2). The trace of A is n n n X X X tr(A) = aii = i = i : i=1

i=1

i=1 Sn is

The positive semi-de nite partial ordering on used to express matrix inequalities (see Golub and van Loan (1985) or Horn and Johnson (1985) for example). Thus A  0 means that A is positive semi-de nite (equivalently yT Ay  0 8y, or i  0 for i = 1; . . . ; n). Hence A  B means that A ? B is positive semi-de nite. For example the constraints 0  A  I on A 2 Sn mean that 0  i  1 for i = 1; . . . ; n. 5

The Frobenius inner product h A; B i of two matrices A; B 2 Rmn is

h A; B i =

tr(ABT )

=

n m X X i=1 j =1

aij bij :

This inner product is the P natural extension for real matrix variables of the stanT dard inner product x y = ni=1 xi yi on Rn . It is widely used in problems with matrix variables, for example in the work of Bellman and Fan (1963), Arnold (1971), Craven and Mond (1981). Fletcher (1985), Overton (1988), and Overton and Womersley (1988) used the notation A : B for h A; B i. Some useful properties of the Frobenius inner product are summarized below. 1. h A; A i = jjAjj2F . 2. h A; I i = tr(A). 3. If A 2 Sn and K 2 Kn then h A; K i = 0. 4. As tr(ABT ) = tr(BAT ) = tr(AT B)

h A; B i = h B; A i = h AT ; BT i = h BT ; AT i: In particular (a) For any nonsingular matrices S 2 Rmm and T 2 Rnn

h A; B i = h S ?1A; S T B i = h AT ?1 ; BT T i: (b) If A 2 Rnn ; B 2 Rmm and Z 2 Rnm then

h Z T AZ; B i = h A; ZBZ T i:

3 Sum of the largest eigenvalues Let A 2 Sn have eigenvalues

1       n ; and let Q 2 On;n be a matrix whose columns are normalized eigenvectors of A, so QT AQ = 

(3:1)

where  = diag(1; . . . ; n ). The matrix Q will be regarded as xed, although if A has multiple eigenvalues the choice of Q is not unique. 6

Let  2 f1; . . . ; ng. Section 3.1 establishes a max characterization of

f (A) =

 X i=1

i :

(3:2)

Thereafter we concentrate on the case when A : Rm ! Sn is a smooth matrix-valued function and f (x) = f (A(x)): (3:3) Section 3.2 considers the di erential properties of f (x), Section 3.3 gives necessary conditions for a local minimizer, Section 3.4 establishes a saddle point result, Section 3.5 gives a formula for the directional derivative, Section 3.6 discusses the generation of descent directions by splitting multiple eigenvalues and Section 3.7 considers a model algorithm for minimizing f (x).

3.1 A max characterization Let

and

n; = f U 2 Sn : 0  U  I; tr(U ) =  g;

(3:4)

n; = f u 2 Rn : 0  u  e; eT u =  g: (3:5) Lemma 3.1 The sets n; and n; are compact and convex. Moreover n; is invariant under orthogonal similarity transformations (i.e. U 2 n; () Z T UZ 2 n; where Z 2 On;n), and n; and n; are related by n; = f U 2 Sn : U = ZDZ T where Z 2 On;n; D = diag(u1; . . . ; un ) and u 2 n; g;

(3.6)

n; = f u 2 Rn : ui = Uii for i = 1; . . . ; n where U 2 n; g:

(3.7)

and

Proof: That n; and n; are compact convex sets is immediate. A spectral

decomposition of U yields (3.6) and the orthogonal invariance of n; . The set n; is contained in the right-hand side of (3.7) since for any u 2 n; , diag(u1; . . . ; un) 2 n; . To obtain the reverse inclusion, let U 2 n; and de ne u by ui = Uii. The facts that the trace is the sum of the diagonal elements and that a positive semide nite matrix cannot have a negative diagonal element then show that u 2 n; .

7

To characterize the elements that achieve the maximum in the following results, information about the multiplicity of the eigenvalues of A is needed. Let 1      r > r+1 =    =  =    = r+t > (3.8) r+t+1      n ; where t  1 and r  0 are integers. The multiplicity of the th eigenvalue is t. The number of eigenvalues larger than  is r. Here r may be zero; in particular this must be the case if  = 1. Note that by de nition r + 1    r + t  n; so t   ? r. Also, t = 1 implies that  = r + 1. First two lemmas, which depend only on the de nitions of the sets n; and n; and the ordering of the elements i in (3.8), are established. In particular they do not require i to be an eigenvalue of a matrix. Lemma 3.2 If the elements of  2 Rn satisfy (3.8) then max T u = u2n;

 X i=1

i

with argmax f T u : u 2 n; g = f u 2 Rn : ui = 1 0  ui  1 ui = 0

i = 1; . . . ; r; i = r + 1; . . . ; r + t; i = r + t + 1; . . . ; n and rX +t ui =  ? r g:

i=r+1

Proof: These results follow directly from (3.5) and (3.8). Lemma 3.3 Let  = diag() where the elements of  2 Rn satisfy (3.8). Then max h ; U i = U 2n; with

 X i=1

argmax f h ; U i : U 2 n; g 2 I 6 = f U 2 Sn : U = 4 U~ 8

i

0

(3:9)

3 75 ; U~

2 t;?r g

(3.10)

where

t;?r = f U~ 2 St : 0  U~  I and tr(U~ ) =  ? r g: (3:11) Here the diagonal blocks of U have dimension r, t, and n ? r ? t respectively.

Proof:  = diag() so

h ; U i =

n X i=1

iUii:

(3:12)

Hence (3.9) follows from combining Lemmas 3.1 and 3.2. If U  is any element of the right hand side of (3.10) then from (3.8)

h ; U  i =

r X i=1

 X i + tr(U~ ) = i :

i=1  )T Then u = (U11 ; . . . ; Unn

Now suppose U P2 argmax f h ; U i : U 2 n; g. satis es T u = i=1 i , so from Lemma 3.2 it follows that

i = 1; . . . ; r; Uii = 1;  0  Uii  1; i = r + 1; . . . ; r + t  i = r + t + 1; . . . ; n Uii = 0; r+t X Uii =  ? r:

(3:13)

2 n; (3.14)

i=r+1

Partition the rows and columns of U  into blocks of r, t and n ? r ? t elements so 2 C11 C12 C13 3 (3:15) U  = 64 C12T C22 C23 75 T T C13 C23 C33 where C11 2 Sr ; C22 2 St; C33 2 Sn?r?t . As U  2 n; , 0  Cii  I; i = 1; 2; 3

and

3 X

i=1

tr(Cii) = :

From (3.14) and (3.15) the diagonal elements of C11 are all 1 and the diagonal elements of C33 are all zero. Suppose an o -diagonal element of C11 is nonzero. Then I ? C11 is symmetric, has zero elements on the diagonal and a nonzero o diagonal element, so is inde nite. This contradicts C11  I , so C11 = I . Similarly " # " # C C 0 ? C 11 12 12 I ? C T C = ?C T I ? C  0 22 22 12 12 implies that C12 = 0. As the diagonal elements of C33 are all zero the existence of a nonzero o -diagonal element in C33 would contradict C33  0, so C33 = 0. Similarly 9

it follows that C13 = 0 and C23 = 0 as the existence of a nonzero element in either of these submatrices contradicts U   0. The only conditions remaining then are 0  C22  I with tr(C22) =  ? r as required.

Remark: The proof uses the basic results that if C 2 Sn then C  0; tr(C ) = 0 ) C = 0; C  I; tr(C ) = n ) C = I;

(3.16) (3.17)

which follow from the observation that if C  0 and Cii = 0 then Cij = 0 and Cji = 0 for j = 1; . . . ; n. Now consider sums of the eigenvalues of A. Let P1 2 On;r be the matrix consisting of the rst r columns of Q (de ned in equation (3.1)), and let Q1 2 On;t be the matrix consisting of the next t columns of Q. By (3.8) we have

P1T AP1 = diag(1; . . . ; r );

QT1 AQ1 =  I:

(3:18)

Theorem 3.4 Let A 2 Sn have eigenvalues 1      n. Then max h A; U i =

U 2n;

 X i=1

i:

(3:19)

If the eigenvalues satisfy (3.8) then

argmax f h A; U i : U 2 n;g ~ T1 ; U~ 2 t;?r g; = f U 2 Sn : U = P1P1T + Q1UQ

(3.20)

where P1, Q1 satisfy (3.18).

Proof: For any U 2 n; equation (3.1) and the properties of the Frobenius inner product imply that

h A; U i = h QQT ; U i = h ; QT UQ i;

(3:21)

with QT UQ 2 n; as n; is invariant to orthogonal transformations. Hence max h A; U i = Umax h ; U i = U 2n; 2n;

 X i=1

i;

where the second equality follows from Lemma 3.3. From (3.21) and the invariance of n; to orthogonal transformations argmax f h A; U i : U 2 n; g = f QU  QT : U  2 g 10

where = argmax f h ; U i : U 2 n; g. The proof is completed by applying Lemma 3.3 since 2 3 I ~ T1 : QU  QT = Q 64 U~ 75 QT = P1P1T + Q1UQ 0

Remark: The term dual matrix is used to refer to either U 2 n; or U~ 2 t;?r .

The distinction between U and U~ is analogous to the question of whether or not to assign zero Lagrange multipliers to inactive constraints in nonlinear programming.

Remark: If  = r + t then U~ = I and argmax f h A; U i : U 2 n;g = P1P1T + Q1QT1 :

(3:22)

The matrix achieving the maximum is unique if and only if  = r + t. In particular this is the case if t = 1, i.e. the th largest eigenvalue is simple.

Remark: All the freedom in the choice of Q1 in (3.18) is absorbed into the matrix U~ . It also makes no di erence if any of the eigenvalues 1; . . . ; r have multiplicity greater than 1, as all of these multiple eigenvalues are included in f (A). The

corresponding columns of P1 can be any orthonormal basis for the corresponding eigenspace, without a ecting P1P1T .

Remark: As A 2 Sn it follows that h A; K i = 0 for any K 2 Kn (the set of skew-symmetric n by n matrices). Hence

f (A) = h A; U + K i for any U belonging to (3.20) and K 2 Kn.

Corollary 3.5 The function f(A) : Sn ! R is convex and its subdi erential @f (A) is the nonempty compact convex set ~ T1 g: @f (A) = f U 2 Sn : 9 U~ 2 t;?r with U = P1P1T + Q1UQ

(3:23)

Proof: For any U 2 n; the inner product h A; U i is a linear function of A 2 Sn , so the convexity of f(A) follows from the max characterization in Theorem 3.4. Moreover, from Rockafellar (1970, corollary 23.5.3), the subdi erential of a function 11

de ned as a pointwise maximum of a set of linear functions is the convex hull of the gradients of the linear functions achieving the maximum at the given point. Thus the subdi erential of f(A) is @f (A) = conv f U 2 Sn : h A; U i = f (A) g: The result follows from Theorem 3.4 as the set (3.20) is already convex.

Remark: As previously noted any skew-symmetric matrix can be added to U

without a ecting the inner product h A; U i. As f (A) is de ned on Sn only the symmetric subdi erential is of interest. Remark: In the case  = 1, the condition U~  I is unnecessary since it is implied by U~  0; tr(U~ ) = 1. This case is moderately well known; see Fletcher (1985). Fletcher also addressed the case  > 1 in an appendix, but his Theorem A.4 is incorrect in the case  > 1 since the condition U~  I was omitted. This section is concluded by relating our max characterization to the well known result of Fan (1949).

Lemma 3.6 The extreme points of n; are 

exactly  of the indicies 1; . . . ; n; g: f u : ui = 10 for otherwise.

(3:24)

Proof: Straightforward. Lemma 3.7 The extreme points of n; are the elements in n; with rank , that is the set of matrices U 2 n; with  eigenvalues equal to 1, and n ?  eigenvalues equal to zero.

Proof: Matrices in n; must have rank at least . Since n; and n; are related

by (3.6), it is straightforward to show that any element of n; with rank greater than  is not an extreme point. The only candidates for extreme points, then, are those with rank . But it is clearly not possible that some rank  elements are extreme points and others not, since the de nition of n; does not in any way distinguish between di erent rank  elements. Since a compact convex set must have extreme points, the proof is complete. Fan's theorem now follows as a corollary.

Theorem 3.8 (Fan)

max tr(Z T AZ ) = Z 2On; 12

 X i=1

i:

(3:25)

Proof: Since a linear function on a convex set must assume its maximum at an extreme point, combining Theorem 3.4 with the lemma just given shows that max h A; U i = U 2n; ; rank U =

 X i=1

i:

(3:26)

Such matrices U have precisely the form ZZ T , where Z 2 On;. The proof is completed by noting that h A; ZZ T i = h Z T AZ; I i = tr(Z T AZ ): (3:27)

Remark: Another expression for the subdi erential is

@f (A) = conv f ZZ T : columns of Z form an orthonormal set of  eigenvectors for 1; . . . ;  g: (3.28) Note that these are just the elements that achieve the maximum in Fan's theorem, and that the elements whose convex hull is being taken are the extreme points of (3.23). Although simpler to write than (3.23) this expression is not as useful, as the structure of the subdi erential is not apparent. Remark: H. Woerdeman and C.-K. Li recently informed us that the equality conv fZZ T : Z 2 On;g = n;: appeared in Fillmore and Williams (1971). This implies the equivalence of (3.19) and (3.25), although it does not by itself imply either of them; nor, strictly speaking, does it imply Lemma 3.7.

3.2 The generalized gradient

Let A(x) : Rm ! Sn be a smooth (at least once continuously di erentiable) function whose partial derivative with respect to xk is (x) for k = 1; . . . ; m: Ak (x)  @A @xk This section is concerned with nding a computationally useful characterization of the generalized gradient of the function f (x) = f (A(x)): Although the eigenvalues and eigenvectors of A(x) are functions of x 2 Rm, the explicit dependence on x will usually be omitted. Thus the eigenvalues of A(x) are denoted by as before (3.8), with r and t now dependent on x, and with corresponding eigenvectors satisfying (3.18). 13

Theorem 3.9 The function f(x) is locally Lipschitz, subdi erentially regular, and

its generalized gradient is the nonempty compact convex set @f (x) = f u 2 Rm : 9 U~ 2 St with 0  U~  I; tr(U~ ) =  ? r; and uk = tr(P1T Ak (x)P1) + h QT1 Ak (x)Q1 ; U~ i; k = 1; . . . ; m g: (3.29)

Proof: Since f (A) is convex and A(x) is smooth the chain rule (Clarke (1983),

Theorem 2.3.10) implies that f (x) is locally Lipschitz, subdi erentially regular, and that @f (x) = f u 2 Rm : uk = h Ak (x); U i; k = 1; . . . ; m where U 2 @f (A(x)) g: (3.30) Corollary 3.5 and the properties of the inner product complete the proof.

Remark: Since by Theorem 3.4

h A(x); U i: f (x) = Umax 2n;

the result also follows from the characterization of generalized gradients of functions de ned by a pointwise maximum in Clarke (1983, Theorem 2.8.6). The Clarke characterization shows that @f (x) = conv f u 2 Rm : uk = h Ak (x); U i; k = 1; . . . ; m where h A(x); U i = f (x) g: (3.31) Equation (3.29) follows from Theorem 3.4 since the maximizing set is already convex.

Remark: Bellman and Fan (1963) gave an example where the set f u 2 Rm : uk = h Ak ; U i; where U  0 g is not closed, and gave sucient conditions for it to be closed. This is not a diculty in our case because the trace condition ensures that n; is compact. As f (x) : Rm ! R is a locally Lipschitz function its generalized gradient is a nonempty compact convex set in Rm .

Corollary 3.10 If  > +1 (i.e.  = r + t) the function f is di erentiable at x with

@f (x) = tr(P T A (x)P ) + tr(QT A (x)Q ): 1 1 1 k 1 k @xk 14

(3:32)

Proof: As  ? r = t the only solution to tr(U~ ) = t and 0  U~  I is U~ = I . Hence

the set @f (x) is a singleton, f (x) is di erentiable, and (3.29) reduces to (3.32). Remark: A key point here is that if j is an eigenvalue of A(x) with multiplicity ` there exists a neighbourhood of x in which the corresponding group of ` eigenvalues is distinct from all the other eigenvalues. The sum of all the eigenvalues in this group is a di erentable function in this neighbourhood (see Kato(1982)). In particular, simple eigenvalues are smooth functions of x. Remark: Cullum, Donath and Wolfe (1975) studied the case where A(x) is ane and only the diagonal elements of A(x) vary, so that m = n, A(x) = A0 + diag(x1; . . . ; xn ); and Ak (x) = ek eTk for k = 1; . . . ; n. Using Fan's theorem they showed that f(x) is di erentiable at x when  > +1 , and that @f (x) = conv f v 2 Rn : vk = tr(P1T Ak (x)P1 ) + tr(Z T QT1 Ak (x)Q1 Z ) for k = 1; . . . ; m and Z 2 On;t g: (3.33) The relationship between (3.33) and (3.29) is precisely that already explained between (3.28) and (3.23), namely the argument of the convex hull in (3.33) is the set of extreme points of (3.29).

3.3 Necessary conditions

The standard necessary condition for x to be a local minimizer of f , namely 0 2 @f (x) (see Clarke (1983)), implies there exists U 2 argmax fh A(x); U i : U 2 n;g (3:34) such that h Ak ; U i = 0 for k = 1; . . . ; m: (3:35) From Theorem 3.9 equations (3.34) and (3.35) are equivalent to the existence of a U~ 2 St such that 0  U~  I; tr(U~ ) =  ? r; (3:36) and tr(P1T Ak P1) + h Q1T Ak Q1; U~ i = 0 for k = 1; . . . ; m: (3:37) The conditions (3.36) and (3.37) are useful computationally as one can relax the inequalities on U~ and solve (3.37) together with tr(U~ ) =  ? r for U~ . This requires solving a system of m + 1 linear equations for the t(t + 1)=2 unknowns in the symmetric matrix U~ . If the inequalities 0  U~  I are not satis ed then a descent direction may be generated. This is discussed in Section 3.6. If f is convex (for example if A(x) is ane) then equations (3.36) and (3.37) are both necessary and sucient for x to be a minimizer of f . 15

3.4 A saddle point result Consider the function

L(x; U ) = h A(x); U i:

(3:38) This section establishes a saddle point result, based on well known results for Lagrangian functions in convex programming (Rockafellar (1970)). A point x 2 Rm ; U  2 n; is a saddle point of L(x; U ) if and only if

L(x; U )  L(x ; U )  L(x; U  ) 8 x 2 Rm 8 U 2 n; :

(3:39)

It is well known and easy to show that min max L(x; U )  Umax min L(x; U ): 2n; x2Rm

x2Rm U 2n;

The primal problem is

min f (x)

x2Rn

where

L(x; U ): f (x) = Umax 2n; De ne the dual problem to be max h(U );

U 2n;

(3:40) (3:41) (3:42) (3:43)

where

h(U ) = xmin L(x; U ): (3:44) 2Rm The following saddle point result is similar to that of Shapiro (1985) for minimizing a function of a symmetric matrix subject to positive semi-de nite constraints. Indeed, it follows from the general saddle point theory for convex-concave functions (Rockafellar(1970), Theorem 36.3), but we give the proof for completeness.

Theorem 3.11 For each U 2 n; let L(:; U ) be a convex function, and let the primal problem attain its solution at x . Then the primal and dual problems have the same optimal value so

min L(x; U ); min max L(x; U ) = Umax 2n; x2Rm

x2Rm U 2n;

(3:45)

and U  satisfying (3.34) and (3.35) solves the dual problem (3.43).

Proof: From (3.42) f (x) is a convex function as it is a maximum of convex

functions, so 0 2 @f (x). Hence there exists a U  satisfying equations (3.34) and 16

(3.35). We only have to show that (x ; U ) is a saddle point of L(x; U ). From Theorem 3.4 and (3.34)

L(x; U  ) = f(x )  L(x ; U ) 8 U 2 n; : The function L(:; U  ) is convex, so 0 2 @f (x ) implies that L(:; U  ) attains its minimum at x. This establishes (3.39).

Remark: As A(x) is a smooth function of x, so is L(:; U ) for any U . Hence a necessary condition for x to be a local minimizer of L(:; U ) is that

h Ak (x); U i = 0 for k = 1; . . . ; m: (3:46) If A(x) is an ane function then L(:; U ) is also an ane (and hence convex) function. In this case L(:; U ) does not have a nite minimum unless it is constant, so (3.46) holds for all x 2 Rm.

3.5 The directional derivative

As f (x) is subdi erentially regular the standard one-sided directional derivative at x in a direction d 2 Rm exists and satis es f (x + d) ? f (x) f0 (x; d)  lim + !0 Td = u2max u @f (x) =

m X

k=1



dk tr(P1T Ak P1) + ~ max

m X

U 2t;?r k=1

dk h QT1 Ak Q1; U~ i:

(3.47)

Recall that the matrices P1 and Q1, de ned by (3.18), are evaluated at the point x, and that Ak is the partial derivative of A(x) with respect to xk evaluated at the point x. De ne

bk = tr(P1T Ak P1); and de ne B(d) 2 St by Note that

bT d

Bk = QT1 Ak Q1 for k = 1; . . . ; m B(d) =

m X k=1

dk Bk :

is the sum of the eigenvalues (the trace) of m X k=1

dk P1T Ak P1: 17

(3:48) (3:49)

Let the eigenvalues of B(d) be 1      t. Then from (3.47) and Theorem 3.4 it follows that X ?r (3:50) f0 (x; d) = bT d + i : i=1

Hence is the sum of allPr eigenvalues of Pmk=1 dk P1T Ak P1 plus the sum of the  ? r largest eigenvalues of mk=1 dk QT1 Ak Q1 . Note that, unless  = r + t, this is generally not the same as the sum of the  largest eigenvalues of " T# m X dk PQ1T Ak [P1 Q1]: 1 k=1

f0 (x; d)

Remark: Classical eigenvalue perturbation theory (Kato(1982)) states that, in P

general, as A(x) is perturbed by dk Ak + o( ), the multiple eigenvalue r+1 =    = r+t splits, with the t perturbed eigenvalues having rst-order changes given, respectively, by the t eigenvalues of B(d). This, together with the fact that the sum of the rst r eigenvalues, being separated from r+1, is a smooth function, provides an alternative proof of (3.50). However, the proof of these classical results is by no means straightforward.

3.6 Splitting multiple eigenvalues

Given x, it is desired to either (a) generate a descent direction for f , or (b) demonstrate that x satis es the rst-order condition for optimality. If  = r + t then f (x) is di erentiable; consequently it is sucient to examine the gradient, whose entries are given by (3.29). If the gradient is zero, the rst-order optimality conditions hold; conversely if it is not zero, the negative gradient provides a descent direction. We therefore consider only the nonsmooth case  < r + t in the remainder of this section, although the following results actually apply in general (with a slight modi cation in the second case). When a descent direction exists, we are not particularly interested in obtaining the steepest descent direction, since it is generally advantageous to maintain the correct multiplicity when possible (by analogy with active set methods for constrained optimization). In the rst of the next three cases a descent direction is generated keeping  of multiplicity t (to rst order), while in the second case, generation of a descent direction requires splitting the group of eigenvalues corresponding to  . Case 1. I 2 Spanf B1; . . . ; Bm g. Solve the system

I ?

m X k=1

dk Bk = 0

18

(3.51)

( ? r) +

m X k=1

dk bk = ?1:

(3.52)

This is a system of t(t +1)=2 + 1 linear equations in m +1 unknowns ; d1; . . . ; dm. Equation (3.51) implies that the eigenvalues of B(d) de ned by (3.49) are all equal to . The system is solvable since (3.51) is solvable for any  by assumption, and (3.52) scales this solution. Hence, from equations (3.50) and (3.52), f0 (x; d) = ?1, where the direction d 2 Rm has components d1; . . . ; dm . Note that the ?1 on the right-hand side of (3.52) is just a normalization constant and can be replaced by any  < 0 giving f0 (x; d) = . To rst order all the eigenvalues r+1(x); . . . ; r+t (x) decrease at the same rate along d, and  gives a rst order estimate of the change in their common value. Case 1 holds generically if m  t(t + 1)=2, that is the generic dimension of the manifold de ned by

r+1(x) =    = r+t(x) (3:53) is greater than zero (see Overton and Womersley (1988), Section 2 for more detail). Case 2. Case 1 does not apply and the span of the m + 1 vectors in Rt(t+1)=2 associated with I; B1; . . . ; Bm has the maximum dimension t(t + 1)=2. Solve the linear system tr(U~ ) =  ? r (3.54) ~ ?h Bk ; U i = bk ; k = 1; . . . ; m (3.55) for the dual matrix U~ 2 St. Note that the trace condition (3.54) is equivalent to h I; U~ i =  ? r. Since the fBk g may not form a linearly independent set, (3.55) may be replaced by considering only a maximal independent set of fBk g; the system cannot be inconsistent because of the related de nitions of the left and right-hand sides (Bk and bk ). By the rank assumption, the resulting linear system is square and nonsingular, with order t(t + 1)=2, and having a unique solution U~ . If U~ satis es 0  U~  I then 0 2 @f (x), so x satis es the rst-order necessary conditions for a minimum. If these inequalities on U~ are not satis ed then the following result shows how to generate a descent direction. Theorem 3.12 Suppose (3.54) and (3.55) are satis ed but 0 62 @f (x), so U~ has an eigenvalue  outside [0; 1]. Let z 2 Rt be the corresponding normalized eigenvector of U~ . Choose 2 R so that < 0 if  > 1 and > 0 if  < 0. Solve I ?

m X

k=1

dk Bk = zzT :

Then d = [d1    ; dm]T is a descent direction.

19

(3:56)

Proof: The linear system (3.56) is solvable by hypothesis, although d is unique

only if the fBk g are independent. Note that the coecient matrix of the left hand side (3.56) is the transpose of that for the system (3.54), (3.55). Taking the inner product of (3.56) with U~ gives m

X  tr(U~ ) ? dk h Bk ; U~ i = h zzT ; U~ i; k=1

(3:57)

so from (3.54), (3.55) and the spectral decomposition of U~

( ? r) + bT d = :

(3:58)

From (3.56) B(d) = I ? zzT has eigenvalues  ? ; ; . . . ; . If < 0 the sum of the  ? r largest eigenvalues of B(d) is ( ? r) ? . Hence (3.50) and (3.58) give

f0 (x; d) = ( ? 1):

(3:59)

Thus if  > 1 choosing < 0 and solving (3.56) produces a descent direction. If > 0 then, since  < r + t, the sum of the  ? r largest eigenvalues of B(d) is ( ? r). Then (3.50) and (3.58) give

f0 (x; d) = :

(3:60)

Therefore, if  < 0, choosing > 0 and solving (3.56) produces a descent direction.

Remark: For the case  = 1 this result was given in Overton (1988) and Overton

and Womersley (1988). As noted earlier the condition U~  I is not required when  = 1.

Remark: Progress is made by splitting the multiple eigenvalue while maintaining

multiplicity t ? 1 to rst order, i.e., the rst-order change in all the eigenvalues but one in the cluster is the same. This is analogous to moving o a single active nonlinear constraint in the context of constrained optimization. The dual matrix U~ provides the information which leads to the generation of a descent direction, just as negative Lagrange multipliers provide similar information in constrained optimization. The distinction between the cases  < 0 and  > 1 is as follows: if  < 0, one eigenvalue in the group of multiplicity t is separated from the others by a reduction, reducing the approximate multiplicity but leaving the number of eigenvalues larger than , to rst order, unchanged. If  > 1 one eigenvalue is separated from the others by an increase, again reducing the approximate multiplicity but increasing the number of larger eigenvalues (to rst order). In either case the theorem guarantees an overall reduction in f . 20

Case 2 applies generically if m +1 = t(t +1)=2, i.e. the generic dimension of the manifold de ned by (3.53) consists of a single point. It also applies if x minimizes f on this manifold; see Overton (1988) for a further explanation. Case 3. Neither of Cases 1 and 2 apply. In this case degeneracy is said to occur. Generation of a descent direction is not straightforward.

3.7 Model algorithms

Practical algorithms for minimizing f (x) based on successive linear or quadratic programming may be de ned to fully exploit the structure of the generalized gradient of f . Such algorithms have been described and tested extensively in the case  = 1 by Overton (1988, 1990). Suppose that x is a (local) minimizer of f (x), with corresponding values r and t de ned by (3.8). The model algorithm must use estimates, say r and t, of r and t. Note that, in general, the matrix iterates generated by the algorithm will have eigenvalues which are strictly multiple only in the limit at x = x. The simplest way to estimate r and t is to use an eigenvalue separation tolerance. The basic step of a model algorithm for minimizing f (x) then becomes the solution of the following quadratic program: min

d2Rm ;2R

Subject to

( ? r) + bT d + 21 dT Hd

(3.61)

I ?

(3.62)

m X

k=1

dk QT1 Ak Q1 = diag(r+1; . . . ; r+t):

Here H is some positive semi-de nite matrix and all the quantities Q1; Ak ; i and b (de ned in (3.48)) are evaluated at the current point x. The next trial point is x + d, and  gives an estimate of  (x + d). Brie y, equation (3.62) represents the appropriate linearization of the nonlinear system r+1(x + d) =    = r+t(x + d) = ; (3:63) since this system is not di erentiable, this needs justi cation (Friedland, Nocedal and Overton(1987)). The rst two terms in the objective function (3.61) represent a linearization of f(x + d), while the third term may be used to incorporate second derivative information. The Lagrange multipliers corresponding to the t(t + 1)=2 equality constraints (3.62) make up a dual matrix estimate of U~ . Inequalities may be included in the quadratic program, to ensure that linearizations of the rst r eigenvalues (or at least their average) are no smaller than , and that the linearizations of r+t+1 ; . . ., are no larger than . A trust region constraint may be used to ensure reduction of the objective function at the trial point. It is 21

possible to choose H so that local quadratic convergence takes place; alternatively, H can be set to zero for a rst-order method using a linear programming subproblem. Further details are beyond the scope of this paper, but see Overton(1990) for the case  = 1.

4 Sum of the largest eigenvalues in absolute value We are now interested in functions of the form  X

jij

(4:1)

j1j      jnj:

(4:2)

g(A) =

i=1

where the eigenvalues i of A are ordered by Let  = jj, the th largest eigenvalue modulus. One approach is to apply the results of the previous section to minimizing the sum of the  largest eigenvalues of " # A(x) 0 : 0 ?A(x) However not only is the size of the problem then doubled, but the structure of the subdi erential is lost. The techniques in Section 4 are related to the well known idea of representing a scalar by its positive part + = maxf0; g and its negative part ? = maxf0; ? g, so = + ? ? and j j = + + ? . For w 2 Rn the vectors w+ and w? are de ned componentwise. This is discussed further in the Appendix.

4.1 A max characterization

This section establishes a max characterization of the  largest, in absolute value, eigenvalues of a matrix, and the elements which achieve the maximum in this characterization. Let  2 f1; . . . ; ng, and de ne n; = f W 2 Sn : W = U ? V; where U; V 2 Sn ; 0  U  I; 0  V  I; tr(U ) + tr(V ) =  g

(4.3)

and n;

= f w 2 Rn : w = u ? v; where u; v 2 Rn; 0  u  e; 0  v  e for i = 1; . . . ; n; and eT (u + v) =  g: (4.4) 22

Lemma 4.1 n; and

are compact convex sets, n; is invariant under orthogonal similarity transformations (i.e. W 2 n; () Z T WZ 2 n; for any Z 2 On;n), and n; and n; are related by n;

n; = f W 2 Sn : W = U ? V; U = ZDZ T ; V = Y EY T where Z; Y 2 On;n; D = diag(u); E = diag(v); 0  u  e; 0  v  e; i = 1; . . . ; n and eT (u + v) =  g

(4.5)

= f w 2 Rn : wi = Wii for i = 1; . . . ; n where W 2 n; g:

(4.6)

and n;

Proof: It is easily veri ed that n; and

are compact convex sets. A spectral decomposition of the matrices U; V 2 Sn in the de nition of n; shows that n; = f W 2 Sn : W = ZDZ T ? Y EY T where Z; Y 2 On;n ; D; E 2 Dn ; 0  D  I; 0  E  I and tr(D) + tr(E ) =  g; (4.7) which establishes (4.5), and the invariance of n; to orthogonal similarity transformations. Let denote the set on the right hand side of equation (4.6). Choosing Z = Y = I , D = diag(w+ ) and E = diag(w? ) in (4.7) , for any w 2 n; , shows that n;  . To establish the reverse inclusion let w 2 , i.e. wi = Wii, where W = U ? V with U ,V satisfying the conditions in (4.3). De ne u and v by ui = Uii, vi = Vii. Then eT (u + v) = tr(U ) + tr(V ) = , and 0  u  e, 0  v  e, by (4.3), i.e. w 2 n; . To characterize the elements which achieve the maximum in the following results requires information about the multiplicity of the eigenvalues of A. Although the ordering (4.2) is useful to give the simple de nition (4.1) for g(A), as well as  = j j, we shall now revert to the standard ordering. Consider the case when  > 0 and let the eigenvalues of A be written 1      r1 > r1+1 =    =  =    = r1+t1 > r1+t1 +1      n?r2 ?t2 > (4.8) n?r2 ?t2+1 =    = ? =    = n?r2 > n?r2 +1      n ; where r1; t1; r2; t2 are nonnegative integers. The eigenvalue equal to  has multiplicity t1 and the eigenvalue equal to ? has multiplicity t2. By assumption there is 23

n;

at least one eigenvalue with modulus  , so t1 + t2  1. The number of eigenvalues greater than  is r1, while the number of eigenvalues less than ? is r2. Note that by de nition r1 + r2 + 1    r1 + t1 + r2 + t2: (4:9) Thus r1 = r2 = 0 if  = 1. Also if t1 = 1 and t2 = 0, or t1 = 0 and t2 = 1 then  = r1 + r2 + 1. We have i > 0 for i = 1; . . . ; r1 + t1 and i < 0 for i = n ? r2 ? t2 + 1; . . . ; n. Thus r1  n X X X i + ( ? r1 ? r2): (4:10) g(A) = jij = i ? i=1

i=1

i=n?r2 +1

Example 4.2 Suppose n = 11 and T = (5; 4; 4; 4; 2; ?1; ?4; ?4; ?6; ?6; ?7). If

 = 2 or 3 then  = 6; r1 = 0; t1 = 0; r2 = 1 and t2 = 2. If  = 5; 6, or 7 then  = 4; r1 = 1; t1 = 3; r2 = 3 and t2 = 2. If  = 0 then the appropriate ordering is 1      r1 > r1+1 =    = 0 =    = r1 +t > (4.11) n?r2 +1      n ; and n ? r2 = r1 + t. Only the situation where  > 0 will be considered in the rest of this section. The modi cations required when  = 0 can be derived from the ordering (4.11). We rst give two lemmas which depend only on the orderings (4.2), (4.8) for a single set of n real numbers, and not on the fact that these are eigenvalues of a matrix A. Lemma 4.3 If the vector  2 Rn is ordered as in (4.8) then max T w = w2 n; and argmax f T w : w 2

n;

 X i=1

jij

g = f w 2 Rn :

wi = 1 for i = 1; . . . ; r1 ; 0  wi  1 for i = r1 + 1; . . . ; r1 + t1; wi = 0 for i = r1 + t1 + 1; . . . ; n ? r2 ? t2; ?1  wi  0 for i = n ? r2 ? t2 + 1; . . . ; n ? r2; wi = ?1 for i = n ? r2 + 1; . . . ; n and nX ?r2 rX 1 +t1 wi =  ? r1 ? r2 g: (4.12) wi ? i=n?r2 ?t2 +1

i=r1 +1

24

Proof: The result follows directly from (4.2), (4.8) and the de nition (4.4) of

n; .

Lemma 4.4 Let  = diag(). Then max h ; W i = W 2 n;

 X i=1

jij

(4:13)

and

argmax f h ; W i : W 2 n; g = f W 2 Sn : W = U ? V; where ~ 0; 0; 0); U~ 2 St1 ; 0  U~  I; U = diag(I; U; V = diag(0; 0; 0; V~ ; I ); V~ 2 St2 ; 0  V~  I; tr(U~ ) + tr(V~ ) =  ? r1 ? r2 g:

(4.14)

Here the diagonal blocks of U and V have dimensions r1, t1, n ? r1 ? t1 ? t2 ? r2, t2 and r2 respectively.

Proof: As  = diag()

h ; W i =

n X i=1

iWii:

Equation (4.13) follows from Lemmas 4.1 and 4.3. If W  is any element of the right hand side of (4.14) then from (4.8) and (4.10)

h ; W  i =

r1 X i=1

i +  tr(U~ ) +  tr(V~ ) ?

n X i=n?r2 +1

i =

 X i=1

jij:

(4:15)

Conversely let W  2 argmax f h ; W i : W 2 n; g. Then w = (W11 . . . Wnn )T 2 P  n; satis es T w = i=1 ji j, and therefore also satis es the properties given on the right-hand side of (4.12). Furthermore, W  has the representation W  = U  ? V , where 0  U   I , 0  V   I , tr(U ) + tr(V ) = , so

Uii = 1; Vii = 0 for i = 1; . . . ; r1; 0  Uii  1; Vii = 0 for i = r1 + 1; . . . ; r1 + t1; Uii = 0; Vii = 0 for i = r1 + t1 + 1; . . . ; n ? r2 ? t2; Uii = 0; 0  Vii  1 for i = n ? r2 ? t2 + 1; . . . ; n ? r2 ; Uii = 0; Vii = 1 for i = n ? r2 + 1; . . . ; n and nX ?r2 rX 1 +t1  Vii =  ? r1 ? r2 g: Uii + i=r1 +1

i=n?r2 ?t2 +1

25

(4.16)

Partition the rows and columns of W  = U  ? V  into blocks so 2 2 E E12 E13 C11 C12 C13 C14 C15 3 66 E11 66 C T C C C C 77 T 22 23 24 25 7 12 E22 E23 12 6 6 U  = 66 C13T C23T C33 C34 C35 77 ; V  = 66 E13T E23T E33 64 E T E T E T 64 C T C T C T C C 75 44 45 14 24 34 14 24 34 E15T E25T E35T C15T C25T C35T C45T C55

E14 E15 3 E24 E25 777 E34 E35 77 E44 E45 75 E45T E55 where the dimensions of the symmetric matrices Cii and Eii for i = 1; . . . ; 5 are respectively r1, t1, n ? r1 ? t1 ? r2 ? t2, t2 and r2. As W  2 n; 0  Cii  I; i = 1; . . . ; 5 0  Eii  I; i = 1; . . . ; 5

5 X

i=1

tr(Cii) +

5 X

i=1

tr(Eii) = :

As in Section 3 the proof is completed using the basic results (3.16) and (3.17) for positive semi-de nite matrices. From (4.16) C11 = I , as C11  I and the diagonal elements of C11 are all 1. Then as U   I , C11 = I implies that C1j = 0 for all j 6= 1. For i = 3; 4; 5 Cii = 0, as Cii  0 and the diagonal elements of Cii are all zero. Then for i = 3; 4; 5 U   0 and Cii = 0 imply that Cij = 0 for all j 6= i. Similarly Eii = 0 for i = 1; 2; 3 and E55 = I . Then 0  V   I implies that Eij = 0 for all i 6= j . The matrices U  and V  now reduced to those given in (4.14). The trace condition comes directly from (4.16). Now consider sums of the eigenvalues of A, and recall that Q 2 On;n is a matrix of eigenvectors for A satisfying

QT AQ = :

(4:17)

Let P1 2 On;r1 ; P2 2 On;r2 be the matrices consisting of the rst r1 and last r2 columns of Q respectively. Also let Q1 2 On;t1 be the matrix consisting of columns r1 + 1; . . . ; r1 + t1 of Q, and let Q2 2 On;t2 be the matrix consisting of columns n ? r2 ? t2 + 1; . . . ; n ? r2 of Q. By (4.8) we have

P1T AP1 = diag(1; . . . ; r1 ); P2T AP2 = diag(n?r2 +1; . . . ; n); and

QT1 AQ1 = I; QT2 AQ2 = ?I:

26

(4:18) (4:19)

Theorem 4.5 with

max h A; W i = g(A);

W 2 n;

argmax f h A; W i : W 2 n; g = f W 2 Sn : W = U ? V; where U; V 2 Sn ~ T1 ; V = P2P2T + Q2V~ QT2 ; U = P1P1T + Q1UQ U~ 2 St1 ; 0  U~  I; V~ 2 St2 ; 0  V~  I and tr(U~ ) + tr(V~ ) =  ? r1 ? r2 g:

(4:20)

(4.21)

Proof: For any W 2 n; equation (4.17) and the properties of the Frobenius

inner product imply that h A; W i = h QQT ; W i = h ; QT WQ i; (4:22) with QT WQ 2 n; as n; is invariant to orthogonal similarity transformations. Hence h ; W i = g(A); max h A; W i = Wmax 2 n; W 2 n; where the second equality follows from Lemma 4.4. From (4.22) and the invariance of n; to orthogonal transformations argmax f h A; W i : W 2 n; g = f QW  QT : W  2 g where = argmax f h ; W i : W 2 n; g. The proof is completed by applying Lemma 4.4 since ~ 0; 0; 0) QT ? Q diag(0; 0; 0; V~ ; I ) QT QW QT = Q diag(I; U; ~ T1 ? Q2V~ QT2 ? P2P2T : = P1P1T + Q1 UQ

Remark: If  = r1 + r2 + t1 + t2 then U~ = I , V~ = I and

argmax f h A; W i : W 2 n; g = P1P1T + Q1QT1 ? Q2QT2 ? P2P2T : (4:23) The matrix achieving the maximum in (4.13) is unique only in this case. It is precisely the case when g(A) is a smooth function of the elements of A. Remark: If t1 = 1; t2 = 1; and  = r1 + r2 +1; then U~ = , V~ = where ; 2 R satisfy ; 2 [0; 1] and + =  ? r1 ? r2 = 1. Also argmax f h A; W i : W 2 n; g = P1P1T + Q1QT1 ? Q2QT2 ? P2P2T ; (4:24) 27

here Q1 and Q2 have only one column. This is precisely the case when g (A) is nonsmooth but is the maximum of a nite number of smooth functions (r1+1 and ?n?r2 ). Remark: When t1 > 1 or t2 > 1 all the exibility in the choices of Q1 and Q2 in (4.19) is absorbed into the matrices U~ and V~ . It makes no di erence if any of the eigenvalues 1; . . . ; r1 or n?r2 +1; . . . ; n have multiplicity greater than 1, as all these multiple eigenvalues are included in g(A) and the corresponding columns of P1 or P2 can be any orthonormal basis for the corresponding eigenspace.

Corollary 4.6 The function g(A) : Sn ! R is convex and its subdi erential

@g (A) is the nonempty compact convex set @g(A) = f W 2 Sn : there exists U~ 2 St1 with 0  U~  I; V~ 2 St2 with 0  V~  I; tr(U~ ) + tr(V~ ) =  ? r1 ? r2; ~ T1 ? Q2V~ QT2 ? P2P2T g: W = P1P1T + Q1 UQ

(4.25)

Proof: For any W 2 n; the inner product h A; W i is a linear function of A 2

Sn , so from Rockafellar (1970) g(A) is a convex function of A. Moreover the subdi erential is

@g(A) = conv f W 2 Sn : h A; W i = g(A) g: The result follows from Theorem 4.5 as the set on the right hand side of (4.25) is already convex. As in Section 3 there are various equivalent forms for the argmax involving a convex hull operation, which are discussed in the Appendix. These lead to equivalent, but computationally less useful, expressions for @g(A).

4.2 The generalized gradient

As in Section 3.2 let A(x) : Rm ! Sn be a continuously di erentiable function with partial derivatives (x) for k = 1; . . . ; m: Ak (x)  @A @xk This section is concerned with nding a computationally useful characterization of the generalized gradient of the function g(x) = g(A(x)): 28

The eigenvalues of A(x) are, as before, denoted by by both (4.2) and (4.8), with r1 ; t1; r2; t2 now dependent on x, and with the corresponding eigenvectors satisfying (4.18), (4.19). Thus the explicit dependence of the eigensystem of A(x) on x is generally omitted.

Theorem 4.7 The function g(x) is locally Lipschitz, subdi erentially regular, and

its generalized gradient is the nonempty compact convex set @g(x) = f w 2 Rm : 9 U~ 2 St1 ; V~ 2 St2 with 0  U~  I; 0  V~  I; tr(U~ ) + tr(V~ ) =  ? r1 ? r2; and for k = 1; . . . ; m wk = tr(P1T Ak (x)P1 ) + h QT1 Ak (x)Q1 ; U~ i ? tr(P2T Ak (x)P2 ) ? h QT2 Ak (x)Q2 ; V~ i g:

(4.26)

(4.27)

Proof: By the Clarke chain rule, g(x) is locally Lipschitz, subdi erentially regular, and @g (x) = f w 2 Rm : wk = h Ak (x); W i; : W 2 @g (A(x)) g: Corollary 4.6 and the properties of the inner product complete the proof.

Corollary 4.8 If  = r1 + t1 + r2 + t2 then the function g(x) is di erentiable at x with

@g(x) = tr(P T A (x)P ) + tr(QT A (x)Q ) 1 1 1 k 1 k @xk ?tr(QT2 Ak (x)Q2 ) ? tr(P2T Ak (x)P2 ):

(4.28)

Proof: This is precisely the case when (4.21) and hence @g(x) is a singleton. Hence g(x) is di erentiable, and (4.27) reduces to (4.28).

4.3 Necessary conditions

If x is a local minimizer of g then the standard necessary condition 0 2 @g(x) and Theorem 4.7 imply there exist U~ 2 St1 and V~ 2 St2 such that 0  U~  I; 0  V~  I; (4:29) tr(U~ ) + tr(V~ ) =  ? r1 ? r2 ; 29

(4:30)

and for k = 1; . . . ; m 0 = tr(P1T Ak (x)P1 ) + h Q1 T Ak (x)Q1 ; U~ i ? tr(P2T Ak (x)P2 ) ? h Q2 T Ak (x)Q2 ; V~ i:

(4.31)

These conditions are useful computationally as one can relax the inequalities on U~ and V~ and solve (4.30) and (4.31) for U~ and V~ . This requires solving a system of m +1 linear equations for the t1(t1 +1)=2 + t2(t2 +1)=2 unknowns in the symmetric matrices U~ and V~ . When  = 1 these are the same conditions as those used in Overton (1988). If the inequalities 0  U~  I or 0  V~  I are not satis ed then this information can be used to generate a descent direction; see Section 4.6. If g(x) is convex (e.g. if A(x) is ane) then equations (4.29), (4.30) and (4.31) are both necessary and sucient for x to be a minimizer of g(x).

4.4 A saddle point result

As in Section 3.4 it is possible to establish saddle point results for the Lagrangian function L(x; W ) = h A(x); W i; (4:32) where W 2 n; . The primal problem is where The dual problem is where

min g (x) x2Rn 

(4:33)

L(x; W ): g(x) = Wmax 2 n;

(4:34)

max h(W );

(4:35)

L(x; W ): h(W ) = xmin 2Rm

(4:36)

W 2 n;

Theorem 4.9 For each W 2 n; let L(:; W ) be a convex function, and let the primal problem attain its solution at x . Then the primal and dual problems have the same optimal value so

min L(x; W ); min max L(x; W ) = Wmax 2 n; x2Rm

x2Rm W 2 n;

(4:37)

and W  2 argmaxf h A(x ); W i : W 2 n; g satisfying h Ak (x ); W  i = 0 for k = 1; . . . ; m solves the dual problem.

Proof: Analogous to the proof of Theorem 3.11. 30

4.5 The directional derivative

As g(x) is subdi erentially regular the standard one-sided directional derivative g0 (x; d) at x in a direction d 2 Rm exists and satis es Td g0 (x; d) = w2max w @g(x) =

m X

k=1

dk [tr(P1T Ak P1) ? tr(P2T Ak P2)] +

max ~~

m X

U;V k=1

dk [h QT1 Ak Q1; U~ i ? h QT2 Ak Q2; V~ i];

(4.38)

where the maximum is over all U~ 2 St1 , V~ 2 St2 satisfying 0  U~  I; 0  V~  I and tr(U~ ) + tr(V~ ) =  ? r1 ? r2: (4:39) Recall that the matrices P1, P2 and Q1, Q2 are evaluated at the point x, and that Ak is the partial derivative of A(x) with respect to xk evaluated at the point x. For k = 1; . . . ; m de ne bk = tr(P1T Ak P1) ? tr(P2T Ak P2); (4:40) and " T # A Q 0 Q k 1 1 Bk = (4:41) 0 ?QT A Q : Note that Bk 2 St1+t2 . Let

2 k 2

B(d) =

m X k=1

dk Bk ;

(4:42)

and let the eigenvalues of B(d) be 1;  . . .  t1+t2 . Also let " ~ # U 0 T = 0 V~ : Then from Theorem 4.5 and (4.38) and (4.39) it follows that g0 (x; d) = bT d + max h B(d); T i T

(4:43)

where the maximum is over all matrices T satisfying (4.39) and (4.43). Since B(d) is block diagonal, the maximum may equivalently be taken over matrices T 2 t1+t2 ;?r1 ?r2 . Hence the directional derivative is

g0 (x; d) = bT d +

?X r1 ?r2 i=1

i;

(4:44)

thePsum of all r1 eigenvalues of Pmk=1 dk P1T Ak P1 minus the sum of all r2 eigenvalues of mk=1 dk P2T Ak P2 plus the sum of the  ? r1 ? r2 largest eigenvalues of B(d). 31

4.6 Splitting multiple eigenvalues

This section discusses the generation of a descent direction from points where the rst-order optimality conditions do not hold. In the rst case a descent direction is generated keeping the eigenvalues corresponding to  of multiplicity t1, and the eigenvalues corresponding to ? of multiplicity t2 (to rst order). The second case requires splitting at least one eigenvalue away from the common value  . When g(x) is di erentiable ( = r1 + t1 + t2 + r2) and the gradient is nonzero its negative provides a descent direction. Therefore we only consider the nonsmooth case  < r1 + t1 + t2 + r2. Case 1. I 2 Spanf B1; . . . ; Bm g. Solve the system

I ?

m X

dk Bk = 0

(4.45)

dk bk = ?1:

(4.46)

k=1 m X

( ? r1 ? r2) +

k=1

Noting the structure of Bk in (4.41) this is a system of t1(t1 + 1)=2 + t2(t2 + 1)=2 linear equations in m + 1 unknowns ; d1; . . . ; dm. Equation (4.45) implies that the eigenvalues of B(d) de ned by (4.42) are all equal to . Hence from equation (4.44) and (4.46) g0 (x; d) = ?1 where the direction d 2 Rm has components d1; . . . ; dm. To rst order all the eigenvalues r1 +1(x); . . . ; r1 +t1 (x) decrease at the same rate along d, and all the eigenvalues n?r2 ?t2+1(x); . . . ; n?r2 (x) increase at the same rate along d. Thus  gives a rst order estimate of the change in  along d. This case holds generically if the manifold de ned by r1+1(x) =    = r1 +t1 (x) =  (4:47) and n?r2 ?t2+1(x) =    = n?r2 (x) = ? (4:48) has dimension greater than zero, so m  t1(t1 + 1)=2 + t2(t2 + 1)=2. Case 2. Case 1 does not apply and f I; B1; . . . ; Bm g has full rank t1(t1 + 1)=2 + t2(t2 + 1)=2. Solve the linear system tr(U~ ) + tr(V~ ) =  ? r1 ? r2 (4:49) ? h Bk ; T i = bk k = 1; . . . ; m (4:50) for the dual matrices U~ 2 St1 and V~ 2 St2 , where T is de ned by (4.43). This is a system of m + 1 equations in t1(t1 + 1)=2 + t2(t2 + 1)=2 unknowns. A similar 32

argument to that given in Section 3.6 shows that a unique solution exists. If U~ satis es 0  U~  I and V~ satis es 0  V~  I then 0 2 @g(x), so x is a stationary point.

Theorem 4.10 Suppose (4.49) and (4.50) are satis ed but 0 62 @g (x), so either U~ or V~ has an eigenvalue outside [0; 1]. If an eigenvalue  of U~ lies outside [0; 1], let z 2 Rt1 be the corresponding normalized eigenvector of U~ . Let 2 R and solve I ?

m X

k=1

dk Bk =

"

#

zzT 0 : 0 0

(4.51)

Alternatively if an eigenvalue  of V~ lies outside [0; 1], let y 2 Rt2 be the corresponding normalized eigenvector of V~ . Let 2 R and solve " # m X 0 0 (4.52) I ? dk Bk = 0 yyT : k=1

Then can be chosen so d = [d1    ; dm]T is a descent direction. Proof: First consider the case when an eigenvalue  of U~ lies outside [0; 1]. The linear system (4.51) is solvable by hypothesis, although d is not unique if fBk g; k = 1; . . . ; m, are linearly dependent. Taking an inner product of (4.51) with T de ned in (4.43) gives m X  tr(T ) ? dk h Bk ; T i = h zzT ; U~ i;

so from (4.49) and (4.50)

k=1

( ? r1 ? r2) + bT d = : From (4.42) and (4.51)

"

#

(4:53)

T 0 B(d) = I ? zz 0 0 has t1 + t2 eigenvalues  ? ; ; . . . ; . If < 0, the sum of the  ? r1 ? r2 largest eigenvalues of B(d) is ( ? r1 ? r2) ? . Hence (4.44) and (4.53) give g0 (x; d) = ( ? 1) (4:54) Thus if  > 1 choosing < 0 and solving (4.51) produces a descent direction. On the other hand, if > 0, the sum of the  ? r1 ? r2 largest eigenvalues of B(d) is ( ? r1 ? r2). Then (4.44) and (4.53) give g0 (x; d) = : (4:55)

33

Thus if  < 0 choosing > 0 and solving (4.51) produces a descent direction. The proof follows the same lines when an eigenvalue  of V~ lies outside [0; 1] and (4.52) is solved instead of (4.51).

Remark: In general, the system (4.51) splits the multiple eigenvalue r1+1 =

. . . = r1+t1 =  , to rst order, reducing the approximate multiplicity by one, while (4.52) splits n?r2 ?t2 +1 = . . . = n?r2 = ? . A special case occurs when t1 = t2 = 1 (and consequently, by assumption,  = r1 + r2 + 1). In this case both eigenvalues already have multiplicity one, but the direction d splits the common value  = r1+1 = ?n?r2 .

Remark: For the case  = 1 this result was given by Overton (1988). In the

previous work with  = 1 the inequalities U~  I and V~  I did not appear. This is because  = 1 implies r1 = r2 = 0 and the conditions tr(U~ ) + tr(V~ ) = 1, U~  0 and V~  0 imply that U~  I and V~  I .

Case 3. Neither of Cases 1 or 2 apply. In this case degeneracy is said to occur. Generation of a descent direction is not straightforward.

4.7 Model algorithms

As in Section 3.7 algorithms for minimizing g(x) may be developed based on successive linear or quadratic programming. Suppose that x is a (local) minimizer of g(x), with corresponding values r1 ; r2 ; t1 and t2 (de ned by (4.8)). The model algorithm must use estimates, say r1 ; r2; t1 and t2, of these quantities. Note that, in general, the matrix iterates generated by the algorithm will have eigenvalues which are strictly multiple only in the limit at x = x. The basic step of a model algorithm for minimizing g(x) then becomes the solution of the following linear or quadratic program: min

d2Rm ;2R

( ? r1 ? r2 ) + bT d + 21 dT Hd

(4.56)

I ?

(4.57)

I +

m X

k=1 m X

dk QT1 Ak Q1 = diag(r1 +1; . . . ; r1+t1 )

dk QT2 Ak Q2 = ?diag(n?r2 ?t2+1; . . . ; n?r2 ) (4.58) k=1 (4.59)

Here the quantities Q1 , Q2 , i and b (de ned by (4.40)) are evaluated at the current point x. The new trial point is x + d, and  is an estimate of  evaluated at the point x + d. 34

Equation (4.57) represents the appropriate linearization of the nonlinear system r1+1 (x + d) =    = r1+t1 (x + d) = ; (4:60) while (4.58) is the linearization of n?r2 ?t2 +1(x + d) =    = n?r2 (x + d) = ?: (4:61) The rst two terms in the objective function (4.56) represent a linearization of g (x + d), while the third term may be used to incorporate second derivative information. The Lagrange multipliers corresponding to the t1(t1 + 1)=2 equality constraints (4.57) make up a dual matrix estimate of U~ , while the Lagrange multipliers corresponding to the t2(t2 + 1)=2 equality constraints (4.58) make up a dual matrix estimate of V~ . Inequalities may be added to the quadratic program to ensure that linearizations of the inequalities i(x + d)   for i = 1; . . . ; r1, ?  i (x + d)   for i = r1 + t1 + 1; . . . ; n ? r2 ? t2 and i(x + d)  ? for i = n ? r + 2 + 1; . . . ; n are satis ed. A trust region constraint may be added to ensure a reduction in the objective function at the trial point. See Overton (1990) for further explanation of these ideas.

A Appendix This Appendix discusses further properties of the sets n; and n; which, while not needed in the derivation of the subdi erential of g(A) given in Section 4.1, are of some independent interest. They lead to an alternative representation for the subdi erential of g(A), in much the same way that Fan's theorem was obtained in Section 3.1 as a corollary of our main results and led to (3.28), the alternative form for the subdi erential of f (A). As already mentioned, the techniques in Section 4 are related to the idea of representing a scalar by its positive part + = maxf0; g and its negative part ? = maxf0; ? g, so = + ? ? and j j = + + ?. The same de nition applies to a vector componentwise, and, as will be seen, can also be extended in a natural way to matrices. Consider the set

n; = f w 2 Rn : ?e  w  e;

n X i=1

jwij =  g:

(A:1)

It is immediately apparent that n;  n; , since for any w 2 n;, setting u = w+, v = w? satis es the requirements w = u?v, 0  u  e, 0  v  e, and eT (u+v) = . However, the example n =  = 1 and u = 0:75, v = 0:25 shows that the converse does not hold, since w = 0:5 2 n; , but w 62 n;. Instead, the following holds: 35

Lemma A.1 n;

Proof: Let w 2

= f w 2 Rn : ?e  w  e;

n; , so w = u ? v , 0  u  e,

n X i=1

jwij   g:

(A:2)

0  v  e, and eT (u + v) = . Then

jwij = jui ? vij  maxfui; vi g  ui + vi so w is an element of the right-hand side of (A.2). Conversely, if w is an element of the right-hand side, the following algorithm produces vectors u and v which demonstrate that w = u ? v 2 n; . n X

 :=  ? jwij i=1 For i = 1; . . . ; n do := 21 minf1 ? jwij; g ui := [wi]+ + vi := [wi]? +  :=  ? 2 : Each step enforces wi = ui ? vi , 0  ui  1, 0  vi  1 but increases eT (u + v) until it equals Pn . Note that  must be nonnegative initially and zero on termination (since i=1 jwij    n). An immediate corollary is Corollary A.2 The set of extreme points of the compact convex set n; is

f w 2 R : wi 2 f0; 1g; for i = 1; . . . ; n; n

n X i=1

jwij =  g;

that is the n-vectors with 1 elements equal to +1, 2 elements equal to ?1, and all other elements equal to zero, where 1  0, 2  0 and 1 + 2 = .

These results are now extended to the set n; , by consideration of the eigenvalues of W = U ? V . The rst step is the following lemma, for which it is necessary to order the eigenvalues of U , V and W so as to be able to exploit a classical result of Weyl. (The last statement in this lemma is not actually needed but is included for completeness.) 36

Lemma A.3 Let W = U ? V , where U  0, V  0 and tr(U ) + tr(V ) = . Let

1      n , 1      n , and !1      !n respectively denote the eigenvalues of U , the eigenvalues of V and the eigenvalues of W = U ? V . Then n X i=1

j!ij  :

(A:3)

Moreover, equality holds in (A.3) if and only if, for i = 1; . . . ; n, and

i n?i+1 = 0

(A:4)

8 if i > 0; > < i if i = n?i+1 = 0; !i = > 0 : ? n?i+1 if n?i+1 > 0:

(A:5)

Proof: Weyl's theorem (see e.g. Horn and Johnson (1985), Section 4.3) shows that for j = 1; . . . ; n

n ? n?j+1  !j  j ? n: As U  0, V  0 this implies that so Hence

? n?j+1  !j  j ;

(A:6)

j!j j  maxf j ; n?j+1 g  j + n?j+1 :

(A:7)

n X j =1

j!j j 

n X j =1

j +

n X j =1

n?j+1 = ;

(A:8)

which establishes (A.3). Suppose now that (A.3) is an equality. Then (A.8) holds with equality, i.e. (A.7) holds with equality for j = 1; . . . ; n. This implies that for each j = 1; . . . ; n at least one of j and n?j+1 is zero, establishing (A.4). If n > 0 equation (A.4) and the ordering of the eigenvalues imply that U = 0. Similarly if n > 0 then V = 0. In both cases (A.5) is trivial. Consider now the case when n = n = 0. If j > 0 then n?j+1 = 0 so (A.6) implies that 0  ! j  j : Similarly if n?j+1 > 0 then j = 0 so

?n?j+1  !j  0: 37

If j = n?j+1 = 0 then !j = 0. Thus if (A.3) holds with equality then

=

n X j =1

j!j j = 

X j :j >0

X

j :j >0

!j + j +

= :

X j :n?j+1 >0

X

j :n?j+1 >0

?!j n?j+1

The only way this can hold with equality is if (A.5) holds. Conversely as U  0, V  0 and tr(U ) + tr(V ) =  equations (A.4) and (A.5) directly give (A.3) with equality. The following results are stated using the positive and negative parts of a symmetric matrix, although they could also be stated directly in terms of eigenvalues. Let W 2 Sn have eigenvalues !i for i = 1; . . . ; n and corresponding normalized eigenvectors vi, so n X W = !iviviT : i=1

The positive and negative parts of the symmetric matrix W are then

W+ = W? =

n X

[!i]+ viviT ;

(A:9)

[!i]?viviT :

(A:10)

i=1 n X i=1

Then W+  0; W?  0; W = W+ ? W? , tr(W ) = and

n X i=1

n X i=1

!i = tr(W+ ) ? tr(W? )

j!ij = tr(W+ ) + tr(W? ):

(A:11) (A:12)

Now consider the set n; = f W 2 Sn : ?I  W  I; tr(W+ ) + tr(W? ) =  g: (A:13) Note that from equations (A.1) and (A.12) n; = f W 2 Sn : W = X diag(w)X T ; X 2 On;n; w 2 n; g: For any W 2 n; , setting U = W+ , V = W? shows that W 2 n; and so n;  n; . The same example given earlier to show that n; does not equal n; also shows that n; does not equal n; . Instead, the following holds. 38

Corollary A.4 n; = f W 2 Sn : ?I  W  I; tr(W+) + tr(W? )   g:

(A:14)

Proof: Let W 2 n; . Since W = U ? V with 0  U  I , 0  V  I ,

tr(U ) + tr(V ) = 1, equations (A.6) and (A.8) show that n; is contained in the set on the right-hand side. To establish the reverse inclusion note that any W in the set on the right side has eigenvalues w = [!1; . . . ; !n ]T satisfying the conditions on the right side of (A.2). Therefore, from Lemma A.1, there exist u; v such that w = u ? v 2 n; . Taking U = X diag(u)X T and V = X diag(v)X T shows that W = U ? V 2 n; .

Remark: The lack of convexity of the sets n; and n; limits their use. Hence

the preference for exploiting n; and n;, which are convex since the restricting equations are linear, involving sums rather than sums of absolute values. Characterization of the extreme points of n; is now possible.

Lemma A.5 The set f Y Y T ? ZZ T : Y T Z = 0; Y T Y = I1 ; Z T Z = I2 ; and 1 + 2 =  g: is the set of extreme points of n; .

Proof: Combining Lemma A.1 and Corollary A.4 gives n; = f W 2 Sn : W = X diag(w)X T ; X 2 On;n ; w 2

n;

g:

The extreme points of n; therefore have the form X diag(w)X T where w is an extreme point of n; . Corollary A.2 then gives the desired result. An obvious generalization of Fan's theorem (which we have nonetheless not encountered in the literature) now follows as a corollary:

Theorem A.6 g(A) = max ftr(Y T AY ) ? tr(Z T AZ ) : Y T Z = 0; Y T Y = I1 ; Z T Z = I2 ; and 1 + 2 =  g: Lemma A.5 also leads immediately to another characterization of the elements achieving the maximum in (4.21), namely 39

Theorem A.7 argmax fh A; W i : W 2 n;g = conv f Y Y T ? ZZ T : Y T Z = 0 Y T Y = I 1 ; Z T Z = I 2 ; 1 + 2 =  and Y T AY ? Z T AZ = g(A) g (A.15) From the ordering (4.8) of the eigenvalues of A the requirement Y T AY ? Z T AZ = g (A) means that the columns of Y must include r1 orthonormal eigenvectors for all the eigenvalues 1; . . . ; r1 and the columns of Z must include r2 orthonormal eigenvectors for all the eigenvalues n?r2 +1; . . . ; n , so 1  r1 and 2  r2. The remaining columns of Y and Z may be any orthonormal sets of eigenvectors corresponding to r1 +1 = . . . = r1+t1 and n?r2 ?t2 +1 = . . . = n?r2 respectively. This leads to the following characterization of the subdi erential of g(A).

Corollary A.8

@g (A) = conv f Y Y T ? ZZ T : the columns of Y form an o.n. set of 1 eigenvectors for 1; . . . ; 1 the columns of Z form an o.n. set of 2 eigenvectors for n?2 +1; . . . ; n where r1  1  r1 + t1; r2  2  r2 + t2; and 1 + 2 =  g The presence of the convex hull operation means that this form is not as computationally convenient as that given in Corollary 4.6, nor does it display the structure of the subdi erential revealed there.

REFERENCES 1. F. Alizadeh (1991), Private communication. 2. V. I. Arnold (1971), \On matrices depending on parameters", Russian Mathematical Surveys 26,2, pp. 29-43. 3. R. Bellman (1970), Introduction to Matrix Analysis, McGraw-Hill, New York, 2nd edition. 4. R. Bellman and K. Fan (1963), \On systems of linear inequalities in matrix variables", in: V. L. Klee ed., Convexity, American Mathematical Society, Providence, R.I., pp. 1-11. 5. F. Clarke (1983), Optimization and Nonsmooth Analysis, John Wiley, New York. Reprinted by SIAM, Philadelphia, 1990. 40

6. B. D. Craven and B. Mond (1981), \Linear programming with matrix variables", Linear Algebra and its Applications 38, pp. 73-80. 7. J. E. Cullum, W. E. Donath and P. Wolfe (1975), \The minimization of certain nondi erentiable sums of eigenvalues of symmetric matrices", Mathematical Programming Study 3, pp. 35-55. 8. K. Fan (1949), \On a theorem of Weyl concerning the eigenvalues of linear transformations", Proceedings of the National Academy of the Sciences of U.S.A 35, pp. 652-655. 9. P. A. Fillmore and J. P. Williams (1971), \Some convexity theorems for matrices", Glasgow Mathematical Journal 12, pp. 110-117. 10. R. Fletcher (1985), \Semi-de nite matrix constraints in optimization", SIAM Journal on Control and Optimization 23, pp. 493-513. 11. S. Friedland (1981), \Convex spectral functions", Linear and Multilinear Algebra 9, pp. 299-316. 12. S. Friedland, J. Nocedal and M. L. Overton (1987), \The formulation and analysis of numerical methods for inverse eigenvalue problems", SIAM Journal on Numerical Analysis 24, pp. 634-667. 13. G. H. Golub and C. van Loan (1983), Matrix Computations, John Hopkins University Press, Baltimore. 14. R. A. Horn and C. Johnson (1985), Matrix Analysis, Cambridge University Press, New York. 15. T. Kato (1982), A Short Introduction to Perturbation Theory for Linear Operators, Springer-Verlag, New York. 16. M. L. Overton (1988), \On minimizing the maximum eigenvalue of a symmetric matrix", SIAM Journal on Matrix Analysis and Application 9, pp. 256-268. 17. M. L. Overton (1990), \Large-scale optimization of eigenvalues", NYU Computer Science Dept Report 505. To appear in SIAM Journal on Optimization. 18. M. L. Overton and R. S. Womersley (1988), \On minimizing the spectral radius of a nonsymmetric matrix function: Optimality conditions and duality theory", SIAM Journal on Matrix Analysis and Applications 9, pp. 473-498. 19. M. L. Overton and R. S. Womersley (to appear), \Second derivatives for optimizing eigenvalues of symmetric matrices". in preparation. 41

20. L. Qi and R. S. Womersley (to appear) \On extreme singular values", in preparation. 21. F. Rendl and H. Wolkowicz (1990), \A projection technique for partitioning the nodes of a graph", Department of Combinatorics and Optimization Report 90-20, University of Waterloo. 22. R. T. Rockafellar (1970), Convex Analysis, Princeton University Press, Princeton, NJ. 23. A. H. Sameh and J. A. Wisniewski (1982), \A trace minimization algorithm for the generalized eigenvalue problem", SIAM Journal on Numerical Analysis 19, pp. 1243-1259. 24. A. Shapiro (1985), \Extremal problems on the set of nonnegative de nite matrices", Linear Algebra and its Applications 67 pp. 7-18. 25. A. Shapiro and J. D. Botha (1988), \Dual algorithms for orthogonal procrustes rotations", SIAM Journal on Matrix Analysis 9, pp. 378-383. 26. G. A. Watson (1990), Private communication. 27. H. Wielandt (1955), \An extremum property of sums of eigenvalues", Proceedings of the American Mathematical Society 6, pp. 106-110.

42