Musings on Multilinear Fitting∗ Martin J. Mohlenkamp
†
September 2, 2010
Abstract We show that the problems of approximating tensors and multivariate functions as a sums of (tensor) products of vectors/functions can be considered in a unified framework, thus exposing their common multilinear structure. We study the alternating least squares algorithm within this framework from the orthogonal projection and gradient perspectives. We then use these perspectives to study its convergence behavior with and without regularization. Finally, we formulate the infinite dimensional version of this problem and an algorithm to compute in that context.
AMS Subject Classification: 15A69, 65D15 Keywords: Tensor approximation, Multilinear approximation, Alternating least squares
1
Introduction
The tensor approximation problem F (j1 , j2 , . . . , jd ) ≈
d r Y X
vil (ji )
for
ji = 1, 2, . . . , Mi
(1)
x i ∈ Xi
(2)
l=1 i=1
and the multivariate function approximation problem f (x1 , x2 , . . . , xd ) ≈
d r Y X
φli (xi )
for
l=1 i=1
have becoming increasing important as uses for them develop. Yet, even after many years of study, our understanding of them and the main algorithm used to compute them is unsatisfactory. In this paper we attempt to clarify several aspects of these problems and the Alternating Least Squares (ALS) algorithm that tries to solve them. Although we include some mathematical “results”, much of this work consists of very carefully formulating the problems in order to expose their essence. We also interpret this essence to suggest what should and should not be done when developing algorithms for and understanding of these problems. The first goal of this paper is to show that the problems (1) and (2) are the same. More specifically, (1) is a special case of (2). This is not say that the usage, desired side constraints, etc. are always the same, but only that the essential multilinear nature is the same. To support this assertion, in Section 2 we develop a common framework for both (1) and (2). Thereafter, the discussion is independent of which problem we are considering, with the exception that when f in (2) requires r = ∞ to achieve equality we cannot use ∗ This
material is based upon work supported by the National Science Foundation under Grant DMS-0545895. of Mathematics, Ohio University, 321 Morton Hall, Athens OH 45701;
[email protected].
† Department
1
2 THE MULTILINEAR SETTING
2
compactness arguments in the posedness and convergence discussion. We will use the function notation, since it is the most general (and cleanest). Since the essential multilinear structure of (1) and (2) is the same, if a general conceptual framework of multilinear approximations exists, then it must apply to both 2 problems. In particular, it can not use partial derivatives such as ∂x∂1 ∂x2 f (x1 , x2 , . . . , xd ) because they make no sense for (1). Unfortunately, this excludes the techniques developed for the closely-related sparse-grid methods (e.g. [5, 9, 10, 18, 19, 20]), which have produced the best results so far. The “workhorse” algorithm for finding good approximations to (1) and (2) is Alternating Least Squares (ALS). The second goal of this paper is to clarify the nature of ALS. As a preliminary, in Section 3 we review ordinary linear least-squares in a setting and notation compatible with Section 2. We show how to solve a linear least-squares problem with a method based on orthogonal projectors and with a method based on gradients, and show that these two perspectives result in exactly the same method. In Section 4 we review the ALS algorithm and show how to solve it using the method(s) in Section 3. The ALS routine can therefore be viewed as either alternately performing orthogonal projections or as alternately solving to make partial gradients zero. The gradient perspective allows one to relate ALS to other gradient-based optimization methods. The orthogonal projectors perspective is cleaner, and also shows that ALS always produces factors φli in a minimal subspace determined by f . In Section 4.3 we consider the convergence properties of ALS. Without regularization the convergence is unsatisfactory since the norms of summands can diverge to infinity. With regularization the convergence is more satisfactory, with only the possibility of limit cycles preventing convergence. Even at their best, the convergence results do not guarantee finding a global minimum of the approximation error or of converging at any particular rate. In Section 5 we formally pose the multilinear approximation problem in infinite dimensions. We show that an ALS-like algorithm can be performed in the infinite dimensional setting. Although the infinite dimensional setting may be purely academic, it provides a framework for thought experiments comparing ALS to alternative algorithms. If a proposed method cannot function in d = ∞ with minor modifications, then it likely will not scale well to large d. Since most of the work here is based on classical definitions in analysis and linear algebra, we have gone rather light on the references. For a recent survey focusing on the tensor problem, see [14]. There is no such survey for the function problem, but we offer as examples [1, 2, 3, 13, 8, 11, 12, 6, 5, 17, 18, 19].
2 2.1
The Multilinear Setting Basic Hilbert Spaces
Consider some variable x which can take values in some set X, on which there is a measure dx. For any two complex-valued functions of x, we can define the inner product by Z f (x)¯ g (x)dx (3) hf, gi = X
p where g¯ denotes complex conjugation. From this inner product, we can define a norm by kf k = hf, f i. The set of functions with finite norm defines a Hilbert space H. We need to consider two functions to be the same element of H if they differ pointwise only on a set of measure zero with respect to dx, which means we say f ≡ g ⇔ kf − gk = 0. This function notation is quite general, and includes some special cases of note. If X = {1, 2, . . . , M } and dx is the counting measure, then a function f is specified by its values (f (1), f (2), . . . , f (M )). We can then refer to f as a vector, although nothing is gained by doing so. If X = {1, 2, . . . } and dx is the counting measure, then a function P f is specified by the infinite vector (i.e. sequence) (f (1), f (2), . . . ). For f to have finite norm, we need j=1 |f (j)|2 < ∞. If H has a countable orthonormal basis, then f is similarly
2 THE MULTILINEAR SETTING
3
specified by the infinite vector of coefficients (fˆ(1), fˆ(2), . . . ) and the inner product can also be computed P by hf, gi = j=1 fˆ(j)ˆ g (j).
2.2
Tensor Product Hilbert Space
Consider a set of such Hilbert spaces {Hi }di=1 . When it is not clear from the context, we will use the subscript i to indicate which domain Xi , inner product h·, ·ii , etc. we are talking about. Now take a set of elements Nd fi ∈ Hi . The tensor product f = f1 ⊗ f2 ⊗ · · · ⊗ fd = i=1 fi defines a function of x = (x1 , x2 , . . . , xd ) by Qd f (x) = i=1 fi (xi ). We can see from this definition that scalars can be moved between factors fi without changing f . We can define an inner product of two such objects by + * d d d Y O O hfi , gi ii , (4) hf, gi = gi = fi , i=1
i=1
i=1
p
and the corresponding norm kf k = hf, f i. Using linearity, the inner product is defined for finite linear Nd combination of such objects. The tensor product space H = i=1 Hi is defined as the completion of the set of finite linear combinations of these objects. The inner product can also be written as Z ZZ Z hf, gi = f (x)¯ g (x)dx = · · · f (x1 , x2 , . . . , xd )¯ g (x1 , x2 , . . . , xd )dx1 dx2 · · · dxd . (5) As a special case, if each Xi = {1, 2, . . . , Mi } and dxi is the counting measure, then f is specified by its values on all combinations of these xi . These values can be arranged in a d-dimensional array and f referred to as a tensor. Nothing in particular is gained by doing so. For notational convenience we allow the inner product on Hi to apply to functions involving other variables as well, as in Z hf, gii = f (x1 , x2 , . . . , xd )¯ g (x1 , x2 , . . . , xd )dxi , (6) which in general yields of function of all xj with j 6= i. Similarly, we define the complementary inner product ZZ Z hf, gi\i = · · · f (x1 , x2 , . . . , xd )¯ g (x1 , x2 , . . . , xd )dx1 dx2 · · · dxi−1 dxi+1 · · · dxd , (7) which in general yields a function of xi .
2.3
Approximation with Sums of Products
Let f ∈ H be some fixed “target” element. We wish to approximate f with a function g that is a sum of separable functions, written as g(x1 , x2 , . . . , xd ) =
r X l=1
g l (x1 , x2 , . . . , xd ) =
d r Y X
φli (xi ) ,
(8)
l=1 i=1
where the superscript l is an index rather than a power. One could normalize the φli and include scalars sl , but that is not needed for our arguments. We would like to minimize the error function E(g) = kf − gk2 = h(f − g), (f − g)i
(9)
3 LINEAR LEAST SQUARES MINIMIZATION
4
over all choices of {φli } for fixed r. Formally, we define ( r d XY Gr = φli (xi )
|
φli
∈ Hi
l=1 i=1
E r = {E(g) | ˆ r = inf E r , E
g ∈ Gr } ,
)
,
(10)
and
(11) (12)
ˆ r , if possible. so the goal is to find g ∈ Gr with E(g) = E We also consider the regularized error for λ > 0 of Eλ (g) = Eλ (g 1 , . . . , g r ) = kf − gk2 + λ
r X
kg l k2 = h(f − g), (f − g)i + λ
r Y d X
hg l , g l i .
(13)
l=1 i=1
l=1
The approximation of f ∈ H with g using Eλ can be viewed as the approximation of (f, 0, . . . , 0) ∈ H r+1 Pr with ( l=1 g l , g 1 , . . . , g r ) using r + 1 copies of E with last r of them scaled by λ. Thus we still have a least-squares problem. Formally, we define Eλr = {Eλ (g) ˆ r = inf E r , E λ λ
|
g ∈ Gr }
and
(14) (15)
ˆ r , if possible. so the goal is to find g ∈ Gr with Eλ (g) = E λ A special case of this problem is when f itself is a sum of separable functions, written as f (x1 , x2 , . . . , xd ) =
d R Y X
θil (xi ) .
(16)
l=1 i=1
The goal then is to approximate f ∈ GR using g ∈ Gr with r < R. By the definition of H, any f ∈ H can be arbitrarily well approximated by choosing R large enough.
3
Linear Least Squares Minimization
In this section we pose the linear least squares minimization problem in a very general setting. We then translate standard solution methods to this setting. We will use these linear results within algorithms to solve the multilinear fitting problem. We assume that the order of integrals can be exchanged, that derivatives can pass through integrals, and that all integrals are finite. Let H be a Hilbert space. Fix some “target” f ∈ H. Let α be some set of parameters; it may be discrete, continuous, or a mixture. We will need to distinguish between the identity of a parameter and the value which that parameter takes. We will use s ∈ α to specify which parameter we are talking about, and t(s) to give the value of the parameter “indexed” by s. Suppose g(α) ∈ H depends linearly on the parameters in α with no constant term. With these assumptions and notation, we can write Z a(s)t(s)ds (17) g(α) = α
for some fixed a(s). We would like to minimize the error function E(α) = kf − g(α)k2 = h(f − g(α)), (f − g(α))i
(18)
3 LINEAR LEAST SQUARES MINIMIZATION
5
over the values that the parameters in α can take. For example, if α = {1, 2, . . . , r} and t(s) are scalars, then a(s) should be elements of H and (17) becomes g=
r X
t(s)a(s) ,
(19)
s=1
which is a simple linear combination. We then wish to determine the coefficients {t(s)}rs=1 to minimize (18).
3.1
Solution using Orthogonal Projectors
A projector P is a linear operator such that P 2 = P. Its complement I −P is also a projector since (I −P)2 = I 2 − IP − PI + P 2 = I − P. The nullspace of P is exactly the range of I − P since Pv = 0 ⇒ v = (I − P)v and y = (I − P)v ⇒ Py = P(I − P)v = (P − P 2 )v = 0. For a linear operator L, its adjoint L∗ is defined to be the linear operator such that hLf, gi = hf, L∗ gi. (In the case of matrices and vectors, the adjoint is the complex conjugate of the transpose of the matrix.) An orthogonal projector is a projector such that the range of P is orthogonal to the nullspace of P. Since the nullspace of P is the range of I − P, a projector is orthogonal if and only if hPv, (I − P)ui = 0 for all u and v. By the definition of the adjoint this is equivalent to hv, P ∗ (I − P)ui = 0. If P = P ∗ (i.e. P is self-adjoint) then we have hv, (P − P 2 )ui = hv, 0i = 0 and so P is an orthogonal projector. One can show that the condition P = P ∗ is also necessary. Suppose we can construct an orthogonal projector onto functions of the form (17), where the a(s) are fixed but t(s) are allowed to vary. In other words, we have a linear operator P such that P 2 = P, P = P ∗ , and Z g= a(s)t(s)ds ⇒ Pg = g . (20) α
The element g of the given form (17) that minimizes E = kf − gk2 is then given precisely by Pf . For any other g in the range of P, we know Pf − g is in the range of P and hence is orthogonal to (I − P)f , which is in the nullspace of P. Thus we have kf − gk2 = kf − Pf k2 + kPf − gk2 and so kf − gk > kf − Pf k. To construct this projector, first define an inner product with respect to s by Z ht, vis = t(s)¯ v (s)ds . (21) α
Define the linear operator L by ′
Lt(s ) =
Z
ha(s), a(s′ )it(s)ds .
(22)
α
The operator L is positive semi-definite since Z Z Z Z ′ ′ ′ ′ ′ ′ a(s)t(s)ds, a(s )t(s )ds ≥ 0 . hLt, tis = ha(s), a(s )it(s)ds t(s )ds = If the elements a(s) are linearly independent in the sense that Z a(s)t(s)ds = 0 ⇒ t = 0
a.e. ,
(23)
(24)
α
then L is positive definite, and hence invertible. Notice that Z Z Z Z ha(s′ ), a(s)iv(s′ )ds′ ds = ht, Lvis t(s) ha(s), a(s′ )it(s)ds v¯(s′ )ds′ = hLt, vis = α
α
α
α
so L = L∗ and thus L is self-adjoint. If L is invertible, then L−1 will thus also be self-adjoint.
(25)
3 LINEAR LEAST SQUARES MINIMIZATION
6
The projector P is defined by Pf =
Z
α
a(s) L−1 hf, a(·)i(s) ds .
To verify that P 2 = P we check Z Z ′ −1 ′ −1 ′ PPf = a(s) L a(s ) L hf, a(·)i(s ) ds , a(·) (s) ds α α Z Z ′ ′ −1 ′ −1 ha(s ), a(s)i L hf, a(·)i(s ) ds ds a(s) L = α Zα = a(s) L−1 L L−1 hf, a(·)i (s) ds = Pf .
(26)
(27) (28) (29)
α
To verify that P ∗ = P we check Z Z −1 L−1 hf, a(·)i(s) ha(s), vids a(s) L hf, a(·)i(s) ds, v = hPf, vi = α α
−1
= L hf, a(·)i, hv, a(·)i s = hf, a(·)i, L−1 hv, a(·)i s Z Z = hf, a(s)i(L−1 hv, a(·)i(s))ds = f, a(s) L−1 hv, a(·)i(s) ds = hf, Pvi .
(31) (32)
α
α
To verify (20), we check Z Z Pg = a(s) L−1 a(s′ )t(s′ )ds′ , a(·) (s) ds α α Z Z Z = a(s) L−1 ha(s′ ), a(s)i t(s′ )ds′ ds = a(s) L−1 Lt(s) ds = g . α
(30)
α
(33) (34)
α
Comparing (26) and (17), we see that the error function E in (18) is minimized by choosing t(s) = L−1 hf, a(·)i(s) in (17). In the example where g is the simple sum (19), the definition (22) for the operator L becomes Lt(s′ ) =
r X
ha(s), a(s′ )it(s) ,
(35)
s=1
which means L is a r × r matrix L with entries L(s′ , s) = ha(s), a(s′ )i. The definition (26) for the projector P becomes r r X X (36) L−1 (s, s′ )hf, a(s′ )i = a∗ L−1 hf, ai . a(s) Pf = s=1
s′ =1
The optimal vector t of coefficients is given by t = L−1 hf, ai.
Remark 3.1 If (24) fails and so the a(s) are linearly dependent, then L in (22) is not invertible. The orthogonal projection is still well-defined, but there may not be a unique representation of it in terms of a(s). For purposes of minimizing E, any representation will do. One can, for example, choose a pseudoinverse for L to use in place of L−1 .
4 THE ALTERNATING LEAST SQUARES (ALS) STRATEGY
3.2
7
Solution using Gradients
In this section we assume H is real so that we can avoid the problem that z¯ is not a differentiable function of z without resorting to writing z = a + bi and differentiating separately in a and b. Inserting (17) into (18), we have E(α) = hf, f i + hg(α), g(α)i − 2hf, g(α)i Z Z Z ′ ′ ′ = hf, f i + a(s )t(s )ds , a(s)t(s)ds − 2 f, a(s)t(s)ds α α α Z Z Z ′ ′ ′ ha(s ), a(s)it(s )t(s)ds ds − 2 hf, a(s)it(s)ds , = hf, f i + α
(37) (38) (39)
α
α
under the assumption that we can exchange the order of integration. Under the assumption that we can differentiate through an integral, we can explicitly compute the partial derivative Z ∂ ′ E(α) = 2 ha(s), a(s )it(s)ds − 2hf, a(s′ )i . (40) ∂t(s′ ) α Assembling these partial derivatives into the gradient, we have Z ∇E(α) = 2 ha(s), a(s′ )it(s)ds − 2hf, a(s′ )i = 2(Lt(s′ ) − hf, a(s′ )i) ,
(41)
using L from (22). Setting the gradient equal to zero, we obtain the normal equations Lt(s) = hf, a(s)i .
(42)
Expressed as t(s) = L−1 hf, a(·)i(s), these are the same equations we obtained using orthogonal projectors in Section 3.1.
4
The Alternating Least Squares (ALS) Strategy
Suppose we wish to solve some least-squares problem that depends nonlinearly on some set of parameters. Assume we are given some initial guess at the correct parameter values. In its most abstract form, the ALS strategy is: • Iterate until happy: – Select some subset of the parameters. – Solve the least-squares problem using those parameters while keeping the remaining parameters fixed. To have any chance of solving the full problem, the subsets must be chosen so that each parameter is allowed to vary many times. Since we still must solve the inner least-squares problem, it is very convenient if the subsets chosen yield linear least-squares problems, which can then be solved with the methods in Section 3. Within the multilinear setting of Section 2, we are interested in the least-squares problem of approximation with sums of products described in Section 2.3. Within this context there is a natural way to choose the subsets to yield linear problems. The ALS strategy becomes: • Iterate until happy: – Loop through directions k = 1, 2, . . . , d:
4 THE ALTERNATING LEAST SQUARES (ALS) STRATEGY
8
∗ Fix {φli } for i 6= k and solve the linear least-squares problem to minimize E(g) with respect to {φlk }rl=1 , thereby updating {φlk }rl=1 . We now consider how to formulate the one-directional problem so that we can use the methods in Section 3 to solve it. For notational convenience we consider the k = 1 case within the ALS loop, so the functions in i = 2, 3, . . . , d are fixed and {φl1 }rl=1 are modified. The set of parameters α has elements of the form s = (l, x1 ) with l = 1, . . . , r and x1 in the domain of definition of dx1 . The value of parameter Qd s is t(s) = φl1 (x1 ). The elements a(s) in (17) should essentially be i=2 φli (xi ), but we must be careful in their definition so that integration with them yields the desired results. In particular, we need an auxiliary integration variable sˆ = (l, x ˆ1 ). We define d Y
a(ˆ s) = a(l, x ˆ1 ) = δ(x1 − x ˆ1 )
φli (xi ) ,
(43)
i=2
R where the delta function satisfies δ(x1 − x ˆ1 )φ(ˆ x1 )dˆ x1 = φ(x1 ). Using sˆ as the variable of integration, we then have that g in (17) reduces to g in (8) via (17) =
4.1
Z
a(ˆ s)t(ˆ s)dˆ s = α
r Z X
d Y
δ(x1 − x ˆ1 )
!
φli (xi ) φl1 (ˆ x1 )dˆ x1 =
i=2
l=1
d r Y X
φli (xi ) = (8).
(44)
l=1 i=1
One-Directional Problem Solution using Orthogonal Projectors
With our definitions above for s, t(s), and a(s), the operator L in (22) becomes + * d d r Z Y Y X ′ φli (xi ) φl1 (ˆ x1 )dˆ x1 φli (xi ), δ(x1 − x′1 ) δ(x1 − x ˆ1 ) Lt(l′ , x′1 ) = =
r X
φl1 (x′1 )
d Y
(45)
i=2
i=2
l=1
′
hφli , φli ii .
(46)
i=2
l=1
The projector P in (26) becomes Pf =
r Z X
δ(x1 − x ˆ1 )
=
l=1
φli (xi )
L
−1
i=2
l=1
r X
d Y
L−1
*
f,
Thus we see that we should choose
d Y
i=2
(·)
φi (xi )
+
\1
*
f, δ(x1 − ·)
(l, x1 )
t(s) = φl1 (x1 ) = L−1
f,
(·) φi (xi )
i=2
*
d Y
d Y
i=2
d Y
φli (xi ) .
+
!
(l, x ˆ1 ) dˆ x1
(47)
(48)
i=2
(·) φi (xi )
+
(l, x1 ) .
(49)
\1
We can simplify these expressions somewhat by noting that the operator L in (46) is separable. In mapping the discrete index, it acts as matrix multiplication from l to l′ . In the continuous index it simply renames the variable to x′1 ; by adjusting its name throughout, we can ignore this action. Define the matrix L with entries d Y ′ (50) hφli , φli ii . L(l′ , l) = i=2
4 THE ALTERNATING LEAST SQUARES (ALS) STRATEGY
9
Define a column vector q whose entries are functions of x1 by q(l) = φl1 (x1 ) .
(51)
We can then write (46) as Lt(l′ , x1 ) = Lq. Define a column vector p whose entries are functions of x2 , . . . , xd by d Y φli (xi ) (52) p(l) = i=2
and a column vector b whose entries are functions of x1 by + * d Y ′ l ′ φi (xi ) b(l ) = f, i=2
.
(53)
\1
¯ ∗ L−1 b and obtain the optimal t(s) = φl1 (x1 ) in (49) as t = L−1 b. Since We can then write (48) as Pf = p each entry in b is in spanx2 ,...,xd {f (·, x2 , . . . , xd )}, we have φl1 ∈ spanx2 ,...,xd {f (·, x2 , . . . , xd )} for all l. If f is of the form (16), then * R d + d d R Y XY Y X ′ ′ b(l′ ) = φli (xi ) θil (xi ), (54) θ1l (x1 ) hθil , φli ii = l=1 i=1
i=2
\1
i=2
l=1
and therefore the optimal t(s) = φl1 (x1 ) in (49) are given as Qd 1 hθl , φ1 i φ1 Qdi=2 il 2i i R φ21 X i=2 hθi , φi ii θ1l (x1 )L−1 .. = L−1 b = .. . . l=1 Q d l r φr1 hθ i=2 i , φi ii Thus we see φl1 ∈ span{θ1l }R l=1 .
4.1.1
.
(55)
Including Regularization
As noted in Section 2.3, when we replace the error measure E from (9) with Pr Eλ from (13), we still have a leastsquares problem, but now we approximate (f, 0, . . . , 0) ∈ H r+1 with ( l=1 g l , g 1 , . . . , g r ). The procedure is the same but the formulas change somewhat. We change (43) to ! d Y l φi (xi ) (1, 0, . . . , 0, 1, 0, . . . ) , (56) a(ˆ s) = a(l, x ˆ1 ) = δ(x1 − x ˆ1 ) i=2
where the second 1 is in position l + 1. The operator L in (46) becomes Lt(l′ , x′1 ) =
r X l=1
φl1 (x′1 )
d Y
′
′
hφli , φli ii + λφl1 (x′1 )
d Y
′
′
hφli , φli ii .
(57)
i=2
i=2
The final formula (48) for the projector P is unchanged as is the solution formula (49). The matrix L defined in (50) is changed to ! d d d Y Y Y l′ l′ l l′ l l′ ′ ′ (58) hφi , φi ii (1 + λδ(l − l′ )) . hφi , φi ii + λδ(l − l ) hφi , φi ii = L(l , l) = i=2
i=2
i=2
Thus we merely take the original L and inflate its diagonal by a factor of 1 + λ.
4 THE ALTERNATING LEAST SQUARES (ALS) STRATEGY
10
Remark 4.1 If we had assumed kφli k = 1 for i > 1 then including regularization in this way is equivalent to modifying the original L by adding λI to it. Such normalization is advisable numerically, since it acts as a preconditioner on L and thus makes the linear system involving L easier to solve.
4.2
One-Directional Problem Solution using Gradients
As we saw in Section 3.2, solving a linear least squares problem using gradients results in the same set of equations as the method using orthogonal projectors. The gradient (41) becomes ∇E(α) = 2(Lt − b)
(59)
Lt = b ,
(60)
and the normal equations (42) become which is equivalent to the equation t = L
4.3
−1
b obtained above.
Convergence of ALS
In this section we consider the convergence properties of the ALS algorithm. Since the error measures E and Eλ are not convex there can be local minima, which are observed in practice. Thus at best we can hope for convergence to a local minimum. Even when we obtain zero error the minimum may not have a unique representation, as shown by examples in [2, 15]. We will use below the standard results that a closed, bounded set in a finite-dimensional space is compact, and that a sequence in a compact set has a convergent subsequence. In general our H is infinite dimensional. If f ∈ GR for some finite R as in (16), then Section 4.1 showed that after one ALS pass we have φli ∈ span{θil }R l=1 . Since this span is finite dimensional, both f and g can be embedded into a finite dimensional subspace of H, and in the proofs it is as if H were finite dimensional. Since f ∈ GR for some finite R is dense in H, this case is in some sense typical, although it may be of zero measure. If H initially was finite dimensional, as in the tensor approximation problem (1), then automatically f ∈ GR with R the dimension of H. So, when we need to use finite dimensionality, we will assume f ∈ GR . 4.3.1
Not Including Regularization
ˆ r in (12) is just an infimum or is actually a minimum. It has been observed The first question is whether E ˆ r is just an infimum. Thus the minimization several times (see [7] and references therein) that in general E ˆr = 0 problem is ill-posed. Specifically, for some f 6∈ Gr there exists a sequence g (k) ∈ Gr with E(g (k) ) → E (k) such that limk→∞ g = f 6∈ Gr . The simplest example of this behavior (see [2]) is # " d d X 1 1Y (61) (1 + hφ(xi )) − φ(xi ) = lim f (x) = h→0 h h i=1 i=1 for any non-constant φ and d > 2 = r. Thus there is no best approximation of that f for that r. Since ˆ r is ill-posed, any “good” algorithm should diverge, in the sense that some of the the problem of finding E l,(k) g diverge. In the left half of Figure 1 the behavior of ALS on an example of this type is shown. When f ∈ GR this ill-posed behavior can only occur if maxl kg l,(k) k → ∞ (see [7]); is not clear if this result holds for f 6∈ GR . ˆ r ≥ 0. The ALS iteration produces a sequence g (k) ∈ Gr such that E(g (0) ) ≥ E(g (k) ) ≥ E(g (k+1) ) ≥ E k (k) (k+1) k 1 We know ∆E = E(g ) − E(g ) → 0 and the sequence {∆E } is in l (has finite sum). Consider one step of the ALS iteration, the update in direction 1 as described in Section 4.1. We denote the result after this update g˜(1,k) , the difference ∆1 g (k) = g˜(1,k) − g (k) , and the improvement ∆1 E k = E(g (k) ) − E(˜ g (1,k) ).
4 THE ALTERNATING LEAST SQUARES (ALS) STRATEGY
10
10
3
10
2
10
1
10 -6 10
11
3
2
1
10
-5
10
-4
10
-3
10
10 -6 10
-2
10
-5
10
-4
10
-3
10
-2
Figure 1: Behavior of the ALS algorithm for (0, 1)⊗(1, 1)⊗(1, 1)+(1, 1)⊗(0, 1)⊗(1, 1)+(1, 1)⊗(1, 1)⊗(0, 1) approximated with r = 2. The horizontal axis is E(g (k) ) and the vertical is maxl {kg l,(k) k2 }, both on a log scale. A point is plotted every one hundred k up to k = 100000. The left plot is without regularization, and the norms of terms would continue to grow and so divergence. The right plot is with λ = 10−4 , and indicates convergence of both error and norms. Since the improvement from one pass ∆E k dominates the improvement from a single direction ∆1 E k , we know ∆1 E k → 0 and the sequence of ∆1 E k is in l1 . With g (k) given by (8), we can write g˜(1,k) =
r X
l,(k)
(φ1
l,(k)
(x1 ) + ∆φ1
(x1 ))
r X l=1
l,(k)
φi
(xi )
and
(62)
i=2
l=1
∆1 g (k) =
d Y
l,(k)
∆φ1
(x1 )
d Y
l,(k)
φi
(xi ) .
(63)
i=2
In Section 4 we showed that each step of the ALS algorithm finds the linear least squares fit to f in the subspace spanned by the a(ˆ s) in (43). In Section 4.1 we showed how to find this fit by doing an orthogonal projection onto this subspace, obtaining g˜(1,k) . Since g (k) is also in this subspace, g (k) , g˜(1,k) , and f form a right triangle and we have kf − g (k) k2 = kf − g˜(1,k) k2 + k˜ g (1,k) − g (k) k2 (64) or, equivalently, ∆1 E k = E(g (k) ) − E(˜ g (1,k) ) = k∆1 g (k) k2 .
(65)
This tells us k∆1 g (k) k2 → 0 and k∆1 g (k) k2 is in l1 . Note that k∆1 g (k) k2 → 0 does not imply the corresponding statement k∆1 g l,(k) k2 → 0 for each summand, since individual terms can grow arbitrarily large yet cancel. This way to diverge is expected by the above results on the ill-posedness of the problem. Although k∆1 g (k) k2 is in l1 , that does not imply k∆1 g (k) k or ∆1 g (k) is summable, so we cannot conclude g (k) converges; it may be possible to have cycles or other bounded forms of divergence. In practice, divergence of the ALS algorithm through unbounded summands is observed. Bounded divergence would be hard to observe. We can also consider the ALS convergence behavior in its relationship with the gradient. Let t0 denote the previous value of t, as appears in (59). We can then write the new value of t, as appears in (60) as
4 THE ALTERNATING LEAST SQUARES (ALS) STRATEGY
12
t = t0 + ∆t. Inserting this into (60), rearranging, and using (59), we have 1 L∆t = b − Lt0 = − ∇E(α) . 2
(66)
This gives us the bounds k∇E(α)k ≤ 2kLkk∆tk and 1 k∆tk ≤ kL−1 kk∇E(α)k . 2
(67) (68)
At any step in the iteration we know L defined in (50) has finite norm, so k∆tk = 0 ⇒ k∇E(α)k = 0. Thus if the ALS algorithm stops, then we are at a stationary point. Since we do not have a uniform bound on P l,(k) kLk we cannot conclude that k∆t(k) k = ( l k∆φi k2 )1/2 → 0 would imply k∇E k (α)k → 0. We know L is positive semi-definite, but it may have a nullspace, so L−1 may not exist. Thus we also cannot conclude that k∇E(α)k = 0 implies k∆tk = 0. (If k∇E(α)k = 0 then no improvement is possible on this step, so we could enforce k∆tk = 0. This approach would fail, however, when k∇E(α)k ≈ 0.) Overall, the convergence of ALS without regularization is unsatisfactory. Besides the possibility of local minima and finite non-convergence, the norm-divergent behavior due to ill-posedness can cause catastrophic failure. 4.3.2
Including Regularization
The motivation for the regularized error Eλ in (13) is to prevent the norm-divergent behavior observed above for E. (It also controls the numerical loss-of-precision error due to ill-conditioning [2].) Since Eλ (g) ≥ λ maxl {kg l k2 }, we know kg l k2 ≤ λ−1 Eλ (g) for all l. The ALS iteration produces a sequence g (k) ∈ Gr such ˆ r ≥ 0. We thus have kg l,(k) k2 ≤ λ−1 Eλ (g (k) ) ≤ λ−1 Eλ (g (0) ) and that Eλ (g (0) ) ≥ Eλ (g (k) ) ≥ Eλ (g (k+1) ) ≥ E thus the norm-divergent behavior observed for E is not possible. In the right half of Figure 1 the behavior of ALS with regularization is shown on an example that diverges without regularization. Since each g l,(k) = Qd l,(k) l,(k) l,(k) l,(k) l,(k) l,(k) (xi ), knowing g l,(k) only defines φi up to scalars, since e.g. φi φj = (cφi )(c−1 φj ). i=1 φi l,(k)
To remove this ambiguity, we enforce the normalization convention kφi l,(k) kφi k2
−1
(k)
1/d
−1
(0)
l,(k)
k = kφj
k, and then obtain the l,(k)
1/d
bound ≤ (λ Eλ (g )) ≤ (λ Eλ (g )) . (There is still a sign ambiguity in φi .) ˆ r as an infimum, there exists a sequence g (k) ∈ Gr with Eλ (g (k) ) → E ˆ r . The By the definition of E λ l,(k) sequence Eλ (g (k) ) must be bounded, which implies all the kφi k are also bounded. When f ∈ GR , then l,(k) for each l and i the sequence {φi } lies in a bounded, finite-dimensional, and therefore compact set, and thus has a convergent subsequence. Passing to subsequences of subsequences rd times, then all sequences l,(k) l,(k) ˆ r . Thus E ˆ r is in fact a we thus have g ∈ Gr with Eλ (g) = E {φi } converge. Using φli = limk→∞ φi minimum, and the approximation problem is well-posed. When f 6∈ GR for any finite R we cannot apply this argument directly. Consider one step of the ALS algorithm as above, but now using Eλ instead of E. The right triangle is formed by (g (k) , g 1,(k) , . . . , g r,(k) ), (˜ g (1,k) , g˜1,(1,k) , . . . , g˜r,(1,k) ), and (f, 0, . . . , 0) and we have ∆1 Eλk = k∆1 g (k) k2 + λ
r X l=1
k∆1 g l,(k) k2 = k∆1 g (k) k2 + λ
r X l=1
l,(k) 2
k∆φ1
k
d Y
l,(k) 2
kφi
k .
(69)
i=2
Since ∆1 Eλk → 0 we have both k∆1 g (k) k2 → 0 and k∆1 g l,(k) k2 → 0. These same limits hold when we update in the other directions. Since ∆Eλk is in l1 , we thus have that k∆1 g (k) k and k∆1 g l,(k) k are in l2 . This result is too weak to imply convergence of g (k) and g l,(k) , but it will allow us to say something about accumulation points.
5 INFINITE DIMENSIONAL MULTILINEAR SETTING
13
If f ∈ GR then by compactness the sequence {z (k) = (g 1,(k) , . . . , g r,(k) )}k must have at least one accumulation point in H r . If it has only one, then it converges, and defines the limit g (k) → g ∈ H. The set K of all accumulation points is closed and therefore compact. We will show that K is also connected. If K is not connected, then we can write K = A ∪ B with A ∩ B = ∅ and A and B nonempty and both relatively open and relatively closed within K. Since A and B are relatively closed subsets of a compact set, they are also compact. Since they are disjoint and compact, the minimum distance between them is some η > 0. Let NA and NB be open η/4 neighborhoods of A and B. The sequence z (k) must visit both NA and NB infinitely often and use steps of size going to zero. Once the step size is smaller than η/2 then the sequence must have an element outside NA ∪ NB . Since z (k) must have such an element each time it moves from NA to NB , there are an infinite number of such elements outside NA ∪ NB . These points must then have an accumulation point that is not in NA ∪ NB and thus not in K. This contradicts the definition of K and thus K must be connected. In particular, if K contains more than one point then it contains an uncountably infinite number of points. Since Eλ (g (k) ) is non-increasing, all accumulation points must have the same value of Eλ . We can also say something about the gradient, based on the inequalities (67) and (68). When using Eλ , l,(k) L is given by (58). Since we have the bound kφi k2 ≤ (λ−1 Eλ (g (0) ))1/d , the entries of L are bounded by (1+ λ)(λ−1 Eλ (g (0) ))(d−1)/d , and thus kLk ≤ r3/4 (1+ λ)1/2 (λ−1 Eλ (g (0) ))(d−1)/(2d) . Thus if k∆t(k) k → 0 then k∇Eλk (α)k → 0. Thus the ALS algorithm can only converge to stationary points. Without regularization, Qd the matrix L as given in (50) is positive semi-definite, but can have a nullspace if the set { i=2 φli }l is Qd ′ ′ linearly dependent. With regularization, we modify that L in (58) by adding λ i=2 hφli , φli ii along the l,(k) l,(k) diagonal l = l′ . With our current normalization convention to maintain kφi k = kφj k, this gives us the bound kL−1 k ≤ (λ minl {kg l,(k) k2(d−1)/d })−1 . If kg l,(k) k maintains a uniform bound away from 0 as k → ∞, then if k∇Eλk (α)k → 0 then k∆t(k) k → 0. We cannot exclude the possibility that kg l,(k) k → 0; for example b in (53) could evaluate to a zero vector and then the ALS algorithm becomes stuck at a zero function. (In practice numerical noise tends to kick one away from such zero states; in any case they are easy to identify and eliminate.) Overall, the convergence of ALS with regularization is satisfactory. We still have local minima and the possibility of finite non-convergence, but there cannot be catastrophic divergence. 4.3.3
Loose Ends
We did not discuss how to choose the starting approximation, or (try to) avoid local minima. One strategy that appears necessary is to “grow” r. First one fits using r = 1, then adds to that a small random separable term to obtain r = 2, fits again, and continues until either the desired r is obtained or the error is sufficiently low. It is difficult to quantify the effect of growing r, but qualitatively the effect is dramatic. In our discussion of convergence, we left open the possibility that the ALS algorithm with regularization could fail to converge because it had an infinite set of accumulation points. It is unclear if this can actually occur. Numerically, it would be very difficult to discover such a case, because we must determine that an l2 sequence is not in l1 ; one would likely believe that one was converging. Analytically, we can imagine a situation where the accumulation points lie on a circle and the sequence approaches the circle while rotating fast enough to avoid converging to any particular point on the circle. Such behavior occurs in dynamical systems (see e.g. the texts [4, 16]) so it may occur here. Since in our context we have a rather complicated iteration, it appears to be very difficult to show anything using techniques from dynamical systems.
5
Infinite Dimensional Multilinear Setting
Recall the multilinear setting from Section 2 but let there be a countably infinite number of variables xi for i = 1, 2, . . .. Our goal is to show how simple modifications of the ALS algorithm can operate in this context,
5 INFINITE DIMENSIONAL MULTILINEAR SETTING
and challenge developers of other algorithms to make sure theirs do so as well. Fix an element ∞ R Y X θil (xi ) . f (x) =
14
(70)
l=1 i=1
We wish to approximate f with a function g of the form g(x) =
∞ r Y X
φli (xi )
(71)
l=1 i=1
with some 1 ≤ r < R. We assume that we can find an initial g with convergent sums, products, and integrals and with hf, gi = 6 0. We will assume all infinite products over the directions are absolutely convergent, so the order of multiplication does not matter. We assume that we have one regular computational node per direction and that these can operate in parallel to do the same operation on all directions at once. We assume there is one central node that can communicate simultaneously with all directional nodes and can take infinite products. Directional node i has l r available to it complete information on {θil (xi )}R l=1 and {φi (xi )}l=1 and whatever (scalar) infinite product information is passed from the central node. In particular it does not have access to θjl or φlj for j 6= i.
5.1
Simultaneous ALS Update Algorithm
The first algorithm computes the orthogonal projections for each direction and then applies all of them simultaneously. • Iterate until happy: – In parallel, the directional nodes for all i: ′
∗ Compute Ai (l, l′ ) = hφli , φli ii for l = 1, . . . r and l′ = 1, . . . , r ′
∗ Compute Bi (l, l′ ) = hφli , θil ii for l = 1, . . . r and l′ = 1, . . . , R ∗ Send Ai (l, l′ ) and Bi (l, l′ ) to the central node. – The central node: Q∞ ∗ Computes A(l, l′ ) = i=1 Ai (l, l′ ) for l = 1, . . . r and l′ = 1, . . . , r. Q∞ ∗ Computes B(l, l′ ) = i=1 Bi (l, l′ ) for l = 1, . . . r and l′ = 1, . . . , R. ∗ Sends A(l, l′ ) and B(l, l′ ) to all directional nodes. – In parallel, the directional nodes for all i: ∗ Compute Li with entries
Li (l, l′ ) = A(l, l′ )/Ai (l, l′ )
(72)
∗ Compute a column vector bi whose entries are functions of x1 by bi (l) =
R X
′
θil (xi )B(l, l′ )/Bi (l, l′ ) .
(73)
L i ti = b i .
(74)
l′ =1
∗ Solve for ti in ∗ Update φli to be φ˜li = ti (l).
REFERENCES
15
The algorithm is clean except for potential divisions by zero. (When such a number is computed as zero, a very small arbitrary number could be used instead.) If the simultaneous updates are sufficiently small it should converge, but for arbitrary starting points it (probably) can diverge. One could change the simultaneous parallel steps to asynchronous parallel steps. This should be more efficient, but even harder to analyze.
5.2
Greedy ALS Algorithm
This algorithm computes the orthogonal projections for each direction and then applies the one that appears most useful. • Iterate until happy: – In parallel, the directional nodes for all i: Compute and send Ai (l, l′ ) and Bi (l, l′ ). – The central node: Compute and send A(l, l′ ) and B(l, l′ ). – In parallel, the directional nodes for all i: ∗ Compute Li and bi . ∗ Solve Li ti = bi for ti . Pr ∗ Compute di = l=1 kφli − ti (l)k2 send to the central node.
– The central node selects the maximum di and sends the instruction for the node that produced it to update its φli (xi ) to be ti (l) = φ˜li (xi ). The central node needs the additional ability to choose the maximum of a countably infinite set of numbers and send an instruction to the directional node that supplied that maximum. One could relax this condition to finding a direction within some fixed fraction of the maximum/supremum. Like ALS, this algorithm can never increase the error. The criteria used above to select which direction to update was the greatest change in φli (xi ). One could instead use the largest (directional) gradient, most improvement in kf − gk, or other criteria.
Acknowledgments This work was inspired by participation in the workshop Computational optimization for tensor decompositions at the American Institute of Mathematics in 2010.
References [1] G. Beylkin and M. J. Mohlenkamp. Numerical operator calculus in higher dimensions. Proc. Natl. Acad. Sci. USA, 99(16):10246–10251, August 2002. [2] G. Beylkin and M. J. Mohlenkamp. Algorithms for numerical analysis in high dimensions. SIAM J. Sci. Comput., 26(6):2133–2159, July 2005. [3] G. Beylkin, M. J. Mohlenkamp, and F. P´erez. Approximating a wavefunction as an unconstrained sum of Slater determinants. Journal of Mathematical Physics, 49(3):032107, 2008. [4] Michael Brin and Garrett Stuck. Introduction to dynamical systems. Cambridge University Press, Cambridge, 2002. [5] Hans-Joachim Bungartz and Michael Griebel. Sparse grids. Acta Numer., 13:147–269, 2004.
REFERENCES
16
[6] Sambasiva Rao Chinnamsetty, Mike Espig, Boris N. Khoromskij, Wolfgang Hackbusch, and Heinz-Juergen Flad. Tensor product approximation with optimal rank in quantum chemistry. Journal of Chemical Physics, 127(8):Art. No. 084110, August 2007. [7] Vin de Silva and Lek-Heng Lim. Tensor rank and the ill-posedness of the best low-rank approximation problem. SIAM Journal on Matrix Analysis and Applications, Special Issue on Tensor Decompositions and Applications, 30(3):1084–1127, 2008. [8] Ivan P. Gavrilyuk, Wolfgang Hackbusch, and Boris N. Khoromskij. Hierarchical tensor-product approximation to the inverse and related operators for high-dimensional elliptic problems. Computing, 74(2):131–157, 2005. [9] M. Griebel and S. Knapek. Optimized general sparse grid approximation spaces for operator equations. Math. Comp., 78(268):2223–2257, 2009. [10] Michael Griebel and Jan Hamaekers. Sparse grids for the Schr¨odinger equation. M2AN Math. Model. Numer. Anal., 41(2):215–247, 2007. [11] W. Hackbusch and B. N. Khoromskij. Low-rank Kronecker-product approximation to multi-dimensional nonlocal operators. I. Separable approximation of multi-variate functions. Computing, 76(3-4):177–202, 2006. [12] W. Hackbusch and B. N. Khoromskij. Low-rank Kronecker-product approximation to multi-dimensional nonlocal operators. II. HKT representation of certain operators. Computing, 76(3-4):203–225, 2006. [13] W. Hackbusch, B. N. Khoromskij, and E. E. Tyrtyshnikov. Hierarchical Kronecker tensor-product approximations. J. Numer. Math., 13(2):119–156, 2005. [14] Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455–500, 2009. [15] Martin J. Mohlenkamp and Lucas Monz´on. Trigonometric identities and sums of separable functions. The Mathematical Intelligencer, 27(2):65–69, 2005. http://www.math.ohiou.edu/∼mjm/research/sine.pdf. [16] Lawrence Perko. Differential equations and dynamical systems, volume 7 of Texts in Applied Mathematics. Springer-Verlag, New York, third edition, 2001. [17] Harry Yserentant. On the regularity of the electronic Schr¨odinger equation in Hilbert spaces of mixed derivatives. Numer. Math., 98(4):731–759, 2004. [18] Harry Yserentant. Sparse grid spaces for the numerical solution of the electronic Schr¨odinger equation. Numer. Math., 101:381–389, 2005. [19] Harry Yserentant. The hyperbolic cross space approximation of electronic wavefunctions. Numer. Math., 105:659–690, 2007. [20] Harry Yserentant. Regularity and approximability of electronic wave functions, volume 2000 of Lecture Notes in Mathematics. Springer-Verlag, Berlin, 2010.