Inference Over Heterogeneous Finite-/Infinite-Dimensional Systems ...

Report 29 Downloads 98 Views
Inference Over Heterogeneous Finite-/Infinite-Dimensional Systems Using Factor Graphs and Gaussian Processes David M. Rosen, Guoquan Huang, and John J. Leonard

Abstract—The ability to reason over partially observable networks of interacting states is a fundamental competency in probabilistic robotics. While the well-known factor graph and Gaussian process models provide flexible and computationally efficient solutions for this inference problem in the special cases in which all of the hidden states are either finite-dimensional parameters or real-valued functions, respectively, in many cases we are interested in reasoning about heterogeneous networks whose hidden states are comprised of both finite-dimensional parameters and functions. To that end, in this paper we propose a novel probabilistic generative model that incorporates both factor graphs and Gaussian processes to model these heterogeneous systems. Our model improves upon prior approaches to inference within these networks by removing the assumption of any specific set of conditional independences amongst the modeled states, thereby significantly expanding the class of systems that can be represented. Furthermore, we show that inference within this model can always be performed by means of a two-stage procedure involving inference within a factor graph followed by inference over a Gaussian process; by exploiting fast inference methods for the individual factor graph and Gaussian process models to solve each of these subproblems in succession, we thus obtain a general framework for computationally efficient inference over heterogeneous finite-/infinite-dimensional systems.

I. I NTRODUCTION Many fundamental problems in robotics can be formulated as instances of inference over a network of interacting random states in which the goal is to estimate the values of some subset of the states given noisy observations of others; for example, the canonical problems of filtering, smoothing, localization, and mapping all belong to this class [1]. The development of computationally efficient inference methods for solving problems of this type is thus of significant practical import. For the common special case in which all of the states are finite-dimensional parameters, factor graphs [2] have proven to be particularly useful: these models generalize and encompass both Bayesian networks and Markov random fields [3] (thus providing a unified theoretical framework for inference), and recent work in the robotics community has led to the development of efficient inference algorithms for these models that can solve problems involving tens of thousands of continuous random variables in real-time on a single processor [4]–[7]. More recently, Gaussian processes [8] have emerged as another useful class of models for the special case in which the hidden states are infinite collections of real values (i.e. realvalued functions); for example, recent work has applied these The authors are with the Computer Science and Artificial Intelligence Laboratory of the Massachusetts Institute of Technology, Cambridge, MA 02139, USA. Email: {dmrosen,gqhuang,jleonard}@mit.edu.

models to develop generalized Bayes filters [9] and RauchTung-Striebel smoothers [10] in which the entire process and observation models can be estimated nonparametrically directly from data. However, we are often interested in modeling heterogeneous networks whose states are comprised of both finitedimensional parameters and entire functions (this is the case, for example, when modeling hybrid discrete-/continuous-time systems). While some special cases of this problem have recently been addressed in the robotics literature (e.g. in [9]–[11]), the approaches presented therein are developed for networks that assume the conditional independence of specific subsets of the modeled states (e.g. the hidden Markov models in [9], [10]), and therefore may not generalize to broader classes of networks in which these relations no longer hold. Indeed, to the best of our knowledge, there has been no discussion in the robotics literature of efficient inference methods for heterogeneous finite-/infinite-dimensional systems in the general case. To that end, in this paper we develop a probabilistic generative model incorporating factor graphs and Gaussian processes to represent generic heterogeneous finite-/infinitedimensional systems. In contrast to prior work, our model does not assume any specific set of conditional independences amongst the modeled states; the only requirement is that any interaction between states must be mediated by a finite set of finite-dimensional values (a significantly weaker condition). Furthermore, we show that inference within this model can always be performed by means of a two-stage procedure involving inference within a factor graph followed by inference over a Gaussian process; by applying efficient inference methods (based upon recent advances in sparse numerical linear algebra [12]–[14]) for the individual factor graph and Gaussian process models to solve each of these subproblems in succession, we are thus still able to opportunistically exploit whatever conditional independences do hold in a particular application (in the form of sparse linear systems, as described in Section II) to achieve fast computation without the need to explicitly require these independences in the design of our model itself. Through this approach we thus obtain a flexible and computationally efficient framework for inference over heterogeneous finite-/infinite-dimensional systems in the general case. II. R EVIEW OF MATHEMATICAL PRELIMINARIES In this section we review the formulations of the factor graph and Gaussian process models together with their associated inference algorithms.

A. Factor graphs 1) Model formulation: A factor graph [2] is a bipartite graph that encodes the factorization of a probability distribution: given a distribution p : Ω → R over several variables Θ = (θ1 , . . . , θn ) ∈ Ω with factorization p(Θ) =

m Y

Under these conditions, the computational cost of optimizing (6) using Newton-type methods [15] turns out to be determined by the sparsity pattern of the factor graph corresponding to (7). To see this, observe that ˆ MAP = argmax p(Z, Θ) Θ Θ

pi (Θi ),

= argmin − ln p(Z, Θ)

(1)

Θ

i=1

= argmin −

where Θi ⊆ {θ1 , . . . , θn } for all 1 ≤ i ≤ m, the corresponding factor graph G = (F, Θ, E) is: (2)

E = {(pi , θj ) | θj ∈ Θi }. By (2), the factor nodes pi ∈ F of G are in one-to-one correspondence with the factors of p in (1), the variable nodes θj ∈ Θ are in one-to-one correspondence with the arguments of p, and factor node pi and variable node θj share an edge eij = (pi , θj ) ∈ E if and only if the variable θj appears as an argument to factor pi in (1). 2) Inference in factor graphs: In general, we will be interested in the problem of Bayesian inference: given a hidden parameter Θ, an observable variable Z, and the joint distribution p(Z, Θ) = p(Z|Θ)·p(Θ) relating the two, we wish to obtain the posterior belief p(Θ|Z) for Θ given a measured value of Z: p(Z, Θ) . (3) p(Θ|Z) = p(Z) Unfortunately, without assuming special structure in the joint distribution p(Z, Θ) (e.g. prior conjugacy, etc.), the computation of the exact posterior p(Θ|Z) is generally intractable, since this requires the evaluation of a (possibly very highdimensional) integral to obtain the evidence p(Z): Z p(Z) = p(Z, Θ) dΘ. (4) On the other hand, it is often relatively straightforward to ˆ MAP : obtain the maximum a posteriori (MAP) point estimate Θ ˆ MAP = argmax p(Θ|Z); Θ

for even if p(Z) in (3) is unknown, it is constant with respect to Θ, and therefore (by virtue of (3) and (5)), (6)

ˆ MAP thus only requires that it be tractable Computation of Θ to optimize the joint distribution p(Z, Θ) as a function of Θ. Let us assume in the sequel (as is commonly the case in practice) that p(Z, Θ) is a twice-differentiable probability density function, and that it factors as p(Z, Θ) =

m Y i=1

pi (Zi , Θi ).

ln pi (Zi , Θi ),

i=1

∂2 dim(Θ) (9) [− ln p(Z, Θ)] = (Hjk )j,k=1 ∂Θ2 has Hjk 6= 0 only if θj , θk ∈ Θi for some factor pi in (7), i.e., only if variable nodes θj and θk are connected to a common factor node pi in G. The edge set E of the factor graph G corresponding to (7) thus directly encodes the sparsity pattern of the Hessian H, and this in turn determines the cost of computing the update step during each iteration of the optimization (8) (cf. e.g. [5]–[7] for details). ˆ MAP in (8), one can After computing the point estimate Θ additionally recover an approximation for the entire posterior p(Θ|Z) by means of the Laplace approximation [16, Sec. 4.4]: H=

  ˆ MAP , Λ−1 Θ|Z ≈ N Θ Θ|Z ΛΘ|Z

∂2 [− ln p(Z, Θ)] = ˆ ; 2 ∂Θ Θ=ΘMAP

(10a) (10b)

this approach approximates the true posterior using a Gaussian ˆ MAP (more distribution that is locally fitted to p(Θ|Z) at Θ precisely, it fits a second-order Taylor series expansion to ˆ MAP ). We observe that the information matrix − ln p(Z, Θ) at Θ ΛΘ|Z defined in (10b) is simply the Hessian in (9) evaluated ˆ MAP ; since Newton-type methods compute this matrix (or at Θ an approximation to it) as part of solving the optimization (8), it is thus available at no additional computational cost beyond ˆ MAP itself. that needed to compute Θ B. Gaussian processes

(5)

Θ

ˆ MAP = argmax p(Z, Θ) = argmax p(Z, Θ). Θ p(Z) Θ Θ

(8)

so that (by virtue of the final line of (8)) the Hessian

F = {p1 , . . . , pm }, Θ = {θ1 , . . . , θn },

Θ

m X

(7)

1) Model formulation: Formally, a Gaussian process is a collection of random variables {Fx }x∈X ⊆ Rd indexed by some (possibly infinite) set X , any finite subset of which have a jointly Gaussian distribution [8], [17]. Since finite-dimensional Gaussian distributions are completely determined by their means and (co)variances, the joint distribution for a finite subset {Fxi }ni=1 of Gaussian process-distributed random variables can be completely specified by first and second moments of the form E[Fx ] and E[(Fx − E[Fx ])(Fx0 − E[Fx0 ])T ] for x, x0 ∈ {xi }ni=1 . Since these moments exist for any choices of x, x0 ∈ X , we may define functions according to m : X → Rd m(x) = E[Fx ]

(11)

and k : X × X → Rd×d   k(x, x0 ) = E (Fx − E[Fx ])(Fx0 − E[Fx0 ])T   = E (Fx − m(x))(Fx0 − m(x0 ))T ;

(12)

the functions m and k defined in (11) and (12) are called the mean function and covariance function for the Gaussian process, respectively. Conversely, given any function m : X → Rd and any positive-definite matrix-valued kernel k : X × X → Rd×d , m and k determine (by means of (11) and (12)) a Gaussian process {Fx }x∈X ⊆ Rd for which they are the mean and covariance functions [8], [17]. We write f ∼ GP(m, k)

(13)

to denote that f , {Fx }x∈X is a collection of random variables distributed according to the Gaussian process with mean function m and covariance function k. Since the Gaussian process (13) assigns to each x ∈ X a Gaussian-distributed random value Fx ∈ Rd , the entire collection of random variables f = {Fx }x∈X can be thought of as a random function f : X → Rd . In this view (the socalled function-space view [8]), a Gaussian process GP(m, k) specifies a probability distribution over the entire set of functions {f : X → Rd } whose domain is X . Gaussian processes thus provide a very useful class of priors for performing nonparametric regression and interpolation/extrapolation in a Bayesian setting when the true parametric form of the regression function is itself uncertain. 2) Inference in Gaussian processes: When performing inference with Gaussian process models, we will generally be interested in determining the posterior belief for a function with a Gaussian process prior given observations of its values at several points. More precisely, we wish to determine the belief for f¯ , f |F , where f ∼ GP(m, k) and X = (x1 , . . . , xnX ) ∈ X nX , F = f (X) = (f (x1 ), . . . , f (xnX )) ∈ RdnX

(14)

denote a vector of nX points in X and the corresponding vector of f ’s values at those points, respectively. First, we observe that in order to recover the posterior belief for f¯, it suffices to determine the posterior belief F∗ |F for f ’s values F∗ = f (X∗ ) on some second finite subset of test points X∗ ∈ X nX∗ , since we can then obtain f |F from F∗ |F by simply taking X∗ = x∗ ∈ X to be a single test point and then allowing it vary pointwise over the entire domain X . To that end, consider the joint distribution for (F, F∗ ):       F M K(X, X) K(X, X∗ ) ∼N , , (15) F∗ M∗ K(X∗ , X) K(X∗ , X∗ ) where M = m(X) = (m(x1 ), . . . , m(xnX )) ∈ RdnX ,

(16)

M∗ = m(X∗ ) analogously, and K(X, Y ) denotes the Gram matrix   k(x1 , y1 ) · · · k(x1 , ynY )   .. .. dn ×dnY K(X, Y ) =  ∈R X . . k(xnX , y1 ) · · · k(xnX , ynY ) (17) for two vectors X ∈ X nX and Y ∈ X nY . Since F and F∗ are jointly Gaussian-distributed according to (15), the conditional distribution F∗ |F is also Gaussian-distributed according to  F∗ |F ∼ N MF∗ |F , ΣF∗ |F , −1 MF∗ |F = M∗ + K(X∗ , X)KX (F − M ),

ΣF∗ |F = K(X∗ , X∗ ) −

(18)

−1 K(X∗ , X)KX K(X, X∗ ),

where we have introduced the notation KX , K(X, X) since this quantity is constant with respect to X∗ . Now since (18) holds for every choice of X and X∗ , then letting X∗ = x∗ ∈ X be a single point (so that F∗ = f (x∗ ) is just the value of f at x∗ ), we find that the posterior belief for f¯ is again Gaussian process-distributed according to ¯ f¯ ∼ GP(m, ¯ k),

(19a)

−1 m(x) ¯ = m(x) + kX (x)KX (F − M ),

(19b)

¯ x0 ) = k(x, x0 ) − kX (x)K −1 kX (x0 )T , k(x, X

(19c)

where kX is the single-variable function defined by kX : X → Rd×dnX kX (x) = K(x, X)

(20)

for fixed X. Gaussian processes thus provide a class of priors on the set of functions {f : X → Rd } that is closed under posterior updates given observations of the function values F = f (X) on some finite subset of points X ∈ X nX . 3) A word on the design of Gaussian process models: As in all kernel-based methods, the covariance (i.e. kernel) function k(x, x0 ) plays a crucial role in Gaussian process models. In this subsection, we show how to characterize several practically-important properties of Gaussian processes (e.g. sample function differentiability class) in terms of easilyascertained properties of their kernel functions, and provide a few guidelines for the design of kernels in application. Given f ∼ GP(m, k), equations (19)–(20) show how to obtain the posterior belief for f¯ = f |F after incorporating knowledge of f ’s values F = f (X) on a vector X of inputs. When using this posterior belief to predict the value of f at other (unobserved) inputs, the corresponding point estimator is just the posterior mean m ¯ : X → Rd , since m(x ¯ ∗ ) = E[f (x∗ )|f (X)] = argmax p(f (x∗ )|f (X))

(21)

f (x∗ )∈Rd

for all x∗ ∈ X by virtue of (19). Since KX , F , and M are constant with respect to x∗ , equations (17) and (19b) imply that the posterior prediction function m ¯ is a linear combination of the prior mean function m and terms of the form k(·, xi )vi , where vi ∈ Rd for 1 ≤ i ≤ nX ; in particular, if m ≡ 0 (as is commonly assumed in practice), the predictor m ¯ is just

a linear combination of the terms k(·, xi )vi . Domain-specific knowledge can thus inform the design of the kernel function k so as to obtain a predictor m ¯ with a parametric form that is well-suited to the prediction task at hand. The kernel function k(x, x0 ) also serves to define a notion of “similarity” between points x, x0 ∈ X in the input space, or equivalently, how tightly coupled the function values f (x) and f (x0 ) are. In particular, if the kernel function k(·, xi ) : X → R has a neighborhood N (xi ) outside of which kk(·, xi )k is negligibly small, then f (x∗ ) and f (xi ) are only weakly coupled whenever x∗ ∈ / N (xi ), and in this case equation (19b) shows that f ’s value f (xi ) at xi does not significantly affect the value of the posterior mean prediction m(x ¯ ∗ ). The choice of a kernel function k with a characteristic spatial scale (e.g. the radial basis kernel) thus gives rise to a Gaussian process whose sample functions f and posterior mean m ¯ likewise have a characteristic spatial scale over which their values vary (cf. [8, Sec. 4.2]). Thus, if the system being modeled has a characteristic spatial scale, this knowledge can again be incorporated in the design of k. Furthermore, knowledge of such a characteristic spatial scale can also be exploited in the design of k to reduce the computational cost of performing inference over the resulting Gaussian process model. As shown in (19), computing the posterior mean and covariance functions m ¯ and k¯ requires the evaluation of a matrix-vector or matrix-matrix product with the inverse of the kernel matrix KX ∈ RdnX ×dnX ; in general, these operations are O(d3 n3X ), which can quickly become prohibitively expensive as the number of observations nX increases. However, if k is chosen such that each k(·, xi ) is supported on a compact set whose size is determined by the spatial scale in the modeled system, then many of the elements k(xi , xj ) in KX will be zero, i.e. KX (and likewise kX (x)) will be block sparse; this sparsity can then be exploited to significantly reduce the computational cost of prediction (19). Finally, for cases in which X ⊆ Rn , the kernel function also determines the continuity and differentiability classes of the sample functions drawn from a mean-zero Gaussian process. If f ∼ GP(0, k) with k : X × X → R (so that f is scalarvalued) and there exists  > 0 and C > 0 such that   E |f (x) − f (y)|2 ≤

C |logkx − yk|1+

(22)

for all x, y ∈ I with I ⊂ X a compact set, then f is continuous on I with probability 1; for the common special case in which k is a stationary kernel (i.e. in which k(x, y) = ρ(x − y) for some ρ : X → R), condition (22) simplifies to C ρ(0) − ρ(x) ≤ |logkxk|1+

(23)

(cf. [18, pgs. 60–62]). Conditions (22) and (23) can be used to establish vector-valued sample function differentiability up order D as follows. Let f ∼ GP(0, k) with k : X ×X → Rd×d ,

and write f and k in coordinates as f (x) = (f1 (x), . . . , fd (x)) ∈ Rd , d

k(x, x0 ) = (kij (x, x0 ))i,j=1 ∈ Rd×d   k11 (x, x0 ) · · · k1d (x, x0 )   .. .. = . . . kd1 (x, x0 ) · · ·

(24)

kdd (x, x0 )

2

0 ii If ∂x∂ ak∂x 0 (x, x ) exists at (x∗ , x∗ ) ∈ X × X , then the meana ∂fi square partial derivative ∂x exists at x∗ and is jointly meana zero Gaussian process-distributed with f according to   ∂fj 0 ∂kij (x, x0 ) E fi (x) (x ) = ∂xb ∂x0b   (25) ∂fi ∂fj 0 ∂kij 0 (x, x ) E (x) (x ) = ∂xa ∂xb ∂xa ∂x0b

where 1 ≤ a, b ≤ n (cf. [8, Secs. 4.1.1 and 9.4]). By ∂ α+β k induction, if each of the mixed partial derivatives ∂xα ∂x0ijβ exists at (x∗ , x∗ ) ∈ X × X for all multi-indices α, β ∈ Nn with 0 ≤ |α|,α|β| ≤ D, then f has mean-square mixed partial derivatives ∂∂xfαi (x∗ ) of all orders up to and including D, and all of these partial derivatives are jointly mean-zero Gaussian process-distributed with f according to  α  ∂ fi ∂ β fj 0 ∂ α+β kij E (x) (x ) = (x, x0 ). (26) α β ∂x ∂x ∂xα ∂x0 β Equation (26) and conditions (22)–(23) can be used to establish that f ∈ C D (I, Rd ) with probability 1 by showing that α each of the partial derivatives ∂∂xfαi has continuous sample paths on I ⊂ X with probability 1 for all 1 ≤ i ≤ d and all 0 ≤ |α| ≤ D (cf. [19, Sec. 2.5]). Thus, for applications in which sample functions must belong to a certain smoothness class (e.g. in the case of physical mechanical systems obeying Newton’s laws, for which the trajectory is the second integral of the applied forces with respect to time), this knowledge can once again be incorporated into the design of theα kernel k. Furthermore, the fact that f and its derivatives ∂∂xfαi are jointly Gaussian process-distributed according to (26) allows the integration of observations of f ’s derivatives into the inference framework outlined in Section II-B2 whenever such observations are available (cf. [8, Sec. 9.4] and [20]). Interested readers are encouraged to consult [17], [21] for more information on kernel-based machine learning techniques in general (including an extensive listing of commonly-used kernel functions and methods for constructing new kernels out of old), and [8], [18], [19], [21] for more information on the design of Gaussian process models in particular. III. I NFERENCE OVER HETEROGENEOUS FINITE -/ INFINITE - DIMENSIONAL SYSTEMS The factor graph and Gaussian process models of Section II provide extremely useful approaches for probabilistic inference over finite-dimensional parameters or entire functions, respectively; in this section, we show how to incorporate both of these models into a framework that enables inference

p(f |Z) and p(Θ|Z) in the model (27). To begin, we observe that (by the same logic as was used in Section II-B2) performing inference over the entire function f is equivalent to replacing f with F∗ = f (X∗ ) and then allowing X∗ = x∗ ∈ X to vary pointwise over all of X . To that end, we consider the joint posterior p(F∗ , F, Θ|Z): Fig. 1. The factor graph describing the joint distribution of the finitedimensional parameters in the model (27): here Θ is a hidden parameter with prior pΘ (·), F = f (X) is the hidden vector of (random) values of a function f ∼ GP(m, k) evaluated on the input points X ∈ X nX , and Z is a vector of observations related to F and Θ through the likelihood (i.e. measurement) model pZ (·|F, Θ). We also introduce an additional hidden vector of function values F∗ = f (X∗ ) on the inputs X∗ ∈ X nX∗ that does not directly influence the observation Z, but whose posterior distribution we nevertheless wish to infer using the Gaussian process prior on f .

over heterogeneous systems whose states are comprised of both finite-dimensional parameters and entire functions. We begin in Section III-A by introducing a probabilistic generative model to describe these systems, and then derive a set of computationally efficient algorithms for performing (approximate) Bayesian inference within this model in Section III-B.

p(F∗ , F, Θ|Z) = p(F∗ |F, Θ, Z) · p(F, Θ|Z) = p(F∗ |F ) · p(F, Θ|Z),

where the first equality follows from the chain rule of probability and the second from the fact that F∗ ⊥⊥ (Z, Θ)|F (cf. Fig. 1). The first distribution p(F∗ |F ) in (28) comes directly from the Gaussian process prior on f and is given in closed form by (18); the second p(F, Θ|Z) can be approximated by applying the Laplace approximation (10) to the subgraph in Fig. 1 determined by the solid edges (as described in Section II-A2) to produce:       F µF¯ ΣF¯ F¯ ΣF¯ Θ ¯ |Z ∼ N , . (29) Θ ΣΘ µΘ ΣΘ ¯Θ ¯ ¯ ¯ F¯ We can further decompose this distribution as

A. Model formulation

p(F, Θ|Z) = p(F |Θ, Z) · p(Θ|Z),

We are interested in performing inference over heterogeneous systems whose hidden states consist of both finitedimensional parameters and functions. To that end, we define the following probabilistic generative model: Θ ∼ pΘ (·) f ∼ GP(m, k) F = f (X)

(28)

(30)

and by virtue of (29) we have Θ|Z ∼ N (µΘ ¯Θ ¯) ¯ , ΣΘ

(31)

and  F |Θ, Z ∼ N µF¯ |Θ ¯ , ΣF¯ |Θ ¯ ,

(27)

Z|F, Θ ∼ pZ (·|F, Θ), where Θ is a finite-dimensional parameter with prior pΘ (·), f : X → Rd is a function with prior GP(m, k) taking values F = f (X) on X ∈ X nX , and Z is a finite-dimensional vector of observations related to F and Θ through the likelihood (i.e. measurement) model pZ (·|F, Θ). The factor graph describing the joint distribution of the finite-dimensional parameters Θ, F , and Z in (27) is shown in Fig. 1. As can be seen in Fig. 1, (27) models the observation Z as arising from the interaction of a finite-dimensional hidden state Θ and a finite set of values F of a hidden function f . It does not enforce any conditional independence relations between Θ, F , and Z (equivalently, it does not require that p(Θ), p(F ), or p(Z|F, Θ) admit any nontrivial factorizations), and therefore suffices to model any measurement arising from any interaction amongst (finite- or infinite-dimensional) hidden states that is mediated by a finite-dimensional set of values, as claimed. Finally, we point out that although (27) does constrain the observation Z to be finite-dimensional, this is not actually a limitation in practice, as physical sensors are only capable of collecting finitely many measurements. B. Inference In this section we derive inference methods for computing the joint posterior distribution p(f, Θ|Z) and the marginals

−1 µF¯ |Θ ¯ = µF¯ + ΣF¯ Θ ¯ ΣΘ ¯), ¯Θ ¯ (Θ − µΘ

ΣF¯ |Θ ¯ = ΣF¯ F¯ −

(32)

−1 ΣF¯ Θ ¯ F¯ . ¯ ΣΘ ¯Θ ¯ ΣΘ

Now, we can obtain the joint distribution p(F∗ , Θ|Z) from p(F∗ , F, Θ|Z) by marginalizing F . By virtue of (28) and (30), this can be written as: Z p(F∗ , Θ|Z) = p(F∗ , F, Θ|Z) dF Z (33) = p(Θ|Z) p(F∗ |F ) · p(F |Θ, Z) dF. We observe that the distributions p(F∗ |F ) and p(F |Θ, Z) in the integrand in the final line of (33) can be interpreted as a (Gaussian) likelihood for F∗ given F and a (Gaussian) prior for F , respectively, so that the entire integral simply represents the marginal distribution for F∗ given Θ and Z: Z p(F∗ |Θ, Z) = p(F∗ |F ) · p(F |Θ, Z) dF. (34) In general, given the Gaussian prior and likelihood models: x ∼ N (µx , Σx )  y|x ∼ N Ax + b, Σy|x ,

(35)

the marginal distribution of y is: y ∼ N (µy , Σy ), µy = Aµx + b, Σy = Σy|x + AΣx AT

(36)

(cf. e.g. [16, Sec. 2.3.3]). Equations (18), (32), and (34)–(36) together imply that the distribution p(F∗ |Θ, Z) in (34) is given in closed form by:  F∗ |Θ, Z ∼ N µF¯∗ |Θ ¯ , ¯ , ΣF¯∗ |Θ  −1 µF¯∗ |Θ µF¯ |Θ ¯ = M (X∗ ) + K(X∗ , X)KX ¯ −M (37) −1 ΣF¯∗ |Θ ¯ = K(X∗ , X∗ ) − K(X∗ , X)KX K(X, X∗ ) −1 −1 + K(X∗ , X)KX ΣF¯ |Θ ¯ KX K(X, X∗ ).

Finally, (33), (34), and (37) in turn imply that the joint posterior distribution p(f, Θ|Z) we seek is given by: p(f, Θ|Z) = p(f |Θ, Z) · p(Θ|Z), where p(Θ|Z) is given by (29) and (31), and   f |Θ, Z ∼ GP mf¯|Θ ¯ , kf¯|Θ ¯

to incorporate any number of functions f1 , . . . , fnf , parameters Θ1 , . . . , ΘnΘ and observations Z1 , . . . , ZnZ by simply defining f = (f1 , . . . , fnf ), Θ = (Θ1 , . . . , ΘnΘ ), and Z = (Z1 , . . . , ZnZ ). Any conditional independence relationships that hold amongst these variables can subsequently be exploited in the form of factor graph sparsity (when performing inference over the finite-dimensional parameters F , Θ, and Z) or block sparsity of kernel matrices (when inferring the posterior distributions for f¯ = (f¯1 , . . . , f¯nf )). As we will see in the next section, the exploitation of sparsity enables fast inference over networks of the form (27) containing hundreds or thousands of continuous random variables.

(38) IV. A N EXAMPLE APPLICATION

In this section we demonstrate the application of the inference framework developed in Section III using a toy targettracking example; specifically, we consider a novel hybrid −1 −1 mf¯|Θ ¯ ΣΘ ¯ )) ¯ (x) = m(x) + kX (x)KX (µF¯ + ΣF¯ Θ ¯Θ ¯ (Θ − µΘ discrete-/continuous-time formulation of the cooperative lo−1 0 0 0 T kf¯|Θ calization and target tracking (CLATT) problem. For ease ¯ (x, x ) = k(x, x ) + kX (x)KX kX (x ) −1 −1 0 T of exposition, in this demonstration we will consider only a + kX (x)KX ΣF¯ |Θ ¯ KX kX (x ) . (39) single mobile robot and a single target; however, the inference algorithm that we derive immediately generalizes to arbitrary Now it remains only to determine the marginal distribution numbers of robots and targets following the argument given p(f |Z) (the marginal distribution p(Θ|Z) having already been at the end of Section III-B. determined in (29) and (31)). We observe that To that end, we consider a single mobile robot attempting Z to track a single target, both moving in the plane. We model p(F∗ |Z) = p(F∗ |F ) · p(F |Z) dF (40) the robot pose at time ti as si = (xri , yir , θir ), where (xri , yir ) ∈ R2 is the robot’s position in the plane and θir ∈ (−π, π] its with p(F∗ |F ) given by (18) and F |Z ∼ N (µF¯ , ΣF¯ F¯ ) by (29). heading angle. We assume that the robot is equipped with a A second application of equations (35) then shows that proprioceptive sensor (e.g. an inertial measurement unit) that  enables it to estimate its ego-motion ∆si,i+1 between two F∗ |Z ∼ N µF¯∗ , ΣF¯∗ , subsequent poses si and si+1 according to: −1 µF¯∗ = M (X∗ ) + K(X∗ , X)KX (µF¯ − M ),   r r (41) (xi+1 − xri ) cos(θir ) + (yi+1 − yir ) sin(θir ) −1 K(X, X∗ ) ΣF¯∗ = K(X∗ , X∗ ) − K(X∗ , X)KX r − yir ) cos(θir ) , ∆si,i+1 = (xri+1 − xri ) sin(θir ) − (yi+1 −1 −1 K(X, X∗ ). + K(X∗ , X)KX ΣF¯ F¯ KX r r θi+1 − θi (43) Thus, the marginal posterior distribution p(f |Z) is given by: and that these measurements are subject to mean-zero additive f |Z ∼ GP(mf¯, kf¯), Gaussian noise with a standard deviation of .03 meters in each −1 translational direction and 1 degree in rotation. mf¯(x) = m(x) + kX (x)KX (µF¯ − M ), (42) To prevent the accumulation of drift in its own state esti−1 kf¯(x, x0 ) = k(x, x0 ) − kX (x)KX kX (x0 )T mate, the robot also estimates the positions of, and relocalizes −1 −1 + kX (x)KX ΣF¯ F¯ KX kX (x0 )T . itself with respect to, any landmarks that it discovers as it Equations (29) and (31), (38)–(39), and (42) admit the moves through its environment (more precisely, it performs computation of p(f, Θ|Z) and its marginals using a two-stage smoothing SLAM [1]). For this purpose, we assume that the inference procedure: first we compute the Laplace approxima- robot is equipped with a sensor that enables it to measure the tion for (F, Θ)|Z in (29) by applying the method of Section relative range and bearing mi,j = (ri,j , θi,j ) from its current II-A2 to the factor graph determined by the solid edges in pose si to a landmark at position lj = (xj , yj ): Fig. 1, and then we recover p(f, Θ|Z) or p(f |Z) by fusing ∆xi,j = xj − xri , p(F, Θ|Z) with the conditional distribution for f |F induced ∆yi,j = yj − yir , by the Gaussian process prior over f using (38)–(39) or (42), q (44) 2 respectively. ri,j = ∆x2i,j + ∆yi,j Finally, we observe that although our algorithmic developθi,j = arctan(∆yi,j , ∆xi,j ) − θir . ment has thus far involved only a single function f , parameter Θ, and observation Z, the fact that all of these may be vector- We assume that the measurements (44) are subject to zerovalued implies that this procedure immediately generalizes mean additive Gaussian noise with a standard deviation of .10

Ground truth

Tracking error in X

Final estimates

Initial discrete−time variable estimates

3

40 Target position estimates Robot pose estimates Landmark position estimates

30

Estimated target positions Posterior target trajectory Estimated robot poses Estimated landmark positions

30

20

20 20

10

1 0 −1 −2

10

10

X (meters)

X (meters)

X (meters)

−3

0

0

Error 3σ bounds

2 Error (meters)

Observed target positions Target trajectory Robot poses Landmark positions

30

0

20

40

0

60 80 Time (seconds)

100

120

140

100

120

140

Tracking error in Y 3

−10

Error 3σ bounds

2 Error (meters)

−10

−10 −20

−20

−20 −30

1 0 −1 −2

−30 −40

−30

−20

−10

0 Y (meters)

10

(a) Ground truth

20

30

40

−40 −80

−30 −70

−60

−50

−40

−30 −20 Y (meters)

−10

0

10

(b) Initial discrete-time estimates

20

−60

−50

−40

−30 −20 Y (meters)

−10

(c) Final estimates

0

10

−3

0

20

40

60 80 Time (seconds)

(d) Tracking errors

Fig. 2. Tracking a target with a mobile robot. (a): Ground truth for this experiment, showing the robot poses (red circles), landmark locations (magenta asterisks), and the target’s continuous-time trajectory (blue curve) and positions at which it was observed (blue circles). Landmark observations are indicated by a dashed green line connecting the landmark and the robot pose from which it was observed. (b): The initial estimate of the discrete-time variables (robot poses, observed target positions, and landmark locations) used to initialize the numerical optimization to compute the MAP estimate as described in Section II-A2. (c): The final MAP estimates for the discrete-time variables and the posterior marginal estimate for the entire target trajectory obtained as described in Section III-B. (d): The globally-registered target tracking errors in x and y, together with the 3σ confidence bounds reported by the inference method.

meters in range and 1 degree in bearing, and that the sensor has a 180-degree forward-facing field of view and a maximum range of 20 meters. Finally, we assume that the robot is also equipped with a sensor that enables it to measure the relative range and bearing from its current pose to the tracked target according to (44). For the sake of simplicity, in this example we will assume that this sensor is always able to observe the target, independent of its position relative to the robot, and that it is likewise subject to zero-mean additive Gaussian noise with a .10 meter standard deviation in range and a 1 degree standard deviation in bearing. A discrete-time formulation suffices to model the robot state in this problem because we assume direct access to odometry measurements; these observations form a “backbone” of poseto-pose constraints that (in combination with the landmark observations) enables smoothing over the robot’s trajectory. Unfortunately, direct odometric measurements are generally not available for the target in target-tracking applications. In practice, it is common to replace odometry measurements with constraints derived from an assumed target motion model (e.g. constant-velocity models); however, it is not always clear a priori how to select an appropriate parametric class for such a model (particularly when the target is highly maneuverable or unpredictable), which renders this approach vulnerable to model misspecification. Bearing these considerations in mind, in this example we propose to model the target position as an unknown continuous-time function f : R → R2 with a Gaussian process prior; this avoids the necessity of specifying a particular parametric form for the target motion model (hence also the model misspecification problem) while still enabling smoothing over the target’s observed positions by enforcing smoothness in the target’s estimated trajectory through an appropriate design of the covariance function k (as described in Section II-B3). To

that end, we suppose that f ∼ GP(0, kf ), where kf : R × R → R2   2 |t − t0 | σf 0 kf (t, t ) = kpp1,2 0 l

 0 , σf2

(45)

and kpp1,2 : R → R is the piecewise-polynomial radial basis kernel defined in [8, pg. 88]. This kernel has several desirable properties for this application, including stationarity, compact support on intervals of length l (corresponding to the length of the “sliding window” over which we wish to smooth the target path, and thus the sparsity of the kernel matrix KX ), a maximum (co)variance magnitude of σf2 (specifying the strength of the applied smoothing), and first-order sample function differentiability (thus enforcing the condition that the target trajectory should be at least first-order differentiable, as we should expect for any physical system obeying Newton’s laws). Based on prior knowledge of the target’s maneuverability characteristics, in this example we will take σf = 5 meters and l = 10 seconds. The experimental setup is shown in Fig. 2(a). The simulated environment is an 80 × 60 meter grid containing 45 randomly distributed landmarks. The robot traverses a single counterclockwise loop of radius 25 meters through this environment at a constant speed of 1.2 m/s, while the tracked target follows a Lissajous curve at speeds between 1.7 and 6.0 m/s (with an average speed of 4.1 m/s). The robot measures the target position, its own ego-motion relative to its prior pose, and any nearby landmarks once every second, for a duration of 130 seconds. The entire simulation thus comprises 45 landmarks (with 745 observations), 131 robot poses (with 130 odometry measurements connecting them), and 131 target positions (each with one measurement). This raw data was batch-processed using the two-stage procedure outlined in Section III-B. First, an initial estimate for the discrete-time variables (the robot poses, landmark positions, and measured target positions) was obtained by integrating the robot odometry measurements and initializing the landmarks and target positions relative to the robot

pose from which they were first observed (Fig. 2(b)). This estimate was then used as the initialization for an iterative numerical optimization to compute the MAP estimate and the Laplace approximation to the posterior distribution of the discrete-time variables as outlined in Section II-A2. Optimization was performed using MATLAB’s lsqnonlin with the trust-region-reflective method (requiring 11 iteraˆ MAP and tions and 3.37 seconds); the resulting MAP estimate Θ T ˆ ˆ ˆ MAP ) the Hessian approximation H(ΘMAP ) ≈ 2J (ΘMAP )J(Θ were then used to approximate the posterior distribution according to (10). Finally, this approximate posterior distribution for the discrete-time variables was used to compute the marginal posterior distribution for the unknown target path f (t) using (42). The final estimates are shown in Fig. 2(c). To evaluate the performance of this method, we computed the least-squares-optimal registration between the robot’s coordinate system and the global coordinate system based upon aligning the landmark position estimates (as would be done in practice when localizing the robot with respect to the global reference frame), and then computed the target tracking error as the difference between the robot’s globally registered estimate and the ground truth; results are shown in Fig. 2(d). In this experiment we observed median tracking errors of −.018 and −.016 meters in x and y, respectively, and a total RMS error of .44 meters. Comparison of Figs. 2(a) and 2(d) reveals that most of this error arises when tracking the target through the aggressive turns at the corners of the environment; furthermore, while the tracking error may be somewhat high in these regions, the posterior 3σ confidence bounds produced by the inference method correctly capture the greater uncertainty in these sections of the estimate, thereby preserving consistency. This demonstrates the feasibility of the proposed nonparametric hybrid discrete-/continuous-time approach to the CLATT problem. V. C ONCLUSION In this paper we developed a probabilistic generative model and an associated set of inference algorithms for reasoning over general heterogeneous finite-/infinite-dimensional systems. Our approach generalizes prior work by relaxing the requirement that a specific set of conditional independences amongst the modeled states must hold, yet is nevertheless still able to opportunistically exploit whatever conditional independences do obtain in a particular application to achieve fast computation. Through this approach we thus obtain a flexible and computationally efficient framework for inference over heterogeneous finite-/infinite-dimensional systems in the general case. In closing, we remark that while the inference framework developed herein has been formulated in the batch setting, we believe it should be possible to adapt this approach to the online case by applying incremental methods (e.g. [5]–[7]) to efficiently solve the sequences of individual factor graph and Gaussian process subproblems. We intend to investigate this possibility in future research.

ACKNOWLEDGEMENTS This work was partially supported by the Office of Naval Research under grants N00014-10-1-0936, N00014-11-1-0688 and N00014-13-1-0588 and by the National Science Foundation under grant IIS-1318392, which we gratefully acknowledge. R EFERENCES [1] S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics. Cambridge, MA: The MIT Press, 2008. [2] F. Kschischang, B. Frey, and H.-A. Loeliger, “Factor graphs and the sum-product algorithm,” IEEE Trans. Inf. Theory, vol. 47, no. 2, pp. 498–519, Feb. 2001. [3] D. Koller and N. Friedman, Probabilistic Graphical Models: Principles and Techniques. Cambridge, MA: The MIT Press, 2009. [4] M. Kaess, V. Ila, R. Roberts, and F. Dellaert, “The Bayes tree: An algorithmic foundation for probabilistic robot mapping,” in Intl. Workshop on the Algorithmic Foundations of Robotics, WAFR, Singapore, Dec. 2010. [5] M. Kaess, H. Johannsson, R. Roberts, V. Ila, J. Leonard, and F. Dellaert, “iSAM2: Incremental smoothing and mapping using the Bayes tree,” Intl. J. of Robotics Research, vol. 31, no. 2, pp. 216–235, Feb. 2012. [6] D. Rosen, M. Kaess, and J. Leonard, “An incremental trust-region method for robust online sparse least-squares estimation,” in IEEE Intl. Conf. on Robotics and Automation (ICRA), St. Paul, MN, May 2012, pp. 1262–1269. [7] ——, “Robust incremental online inference over sparse factor graphs: Beyond the Gaussian case,” in IEEE Intl. Conf. on Robotics and Automation (ICRA), Karlsruhe, Germany, May 2013, pp. 1017–1024. [8] C. Rasmussen and C. Williams, Gaussian Processes for Machine Learning. Cambridge, MA: The MIT Press, 2006. [9] J. Ko and D. Fox, “GP-BayesFilters: Bayesian filtering using Gaussian process prediction and observation models,” Autonomous Robots, vol. 27, no. 1, pp. 75–90, Jul. 2009. [10] M. Deisenroth, R. Turner, M. Huber, U. Hanebeck, and C. Rasmussen, “Robust filtering and smoothing with Gaussian processes,” IEEE Trans. on Automatic Control, vol. 57, no. 7, pp. 1865–1871, Jul. 2012. [11] C. Tong, P. Furgale, and T. Barfoot, “Gaussian process Gauss-Newton for non-parametric simultaneous localization and mapping,” Intl. J. of Robotics Research, vol. 32, no. 5, pp. 507–525, May 2013. [12] P. Matstoms, “Sparse QR factorization in MATLAB,” ACM Trans. Math. Softw., vol. 20, no. 1, pp. 136–159, Mar. 1994. [13] T. Davis, J. Gilbert, S. Larimore, and E. Ng, “A column approximate minimum degree ordering algorithm,” ACM Trans. Math. Softw., vol. 30, no. 3, pp. 353–376, Sep. 2004. [14] Y. Chen, T. Davis, W. Hager, and S. Rajamanickam, “Algorithm 887: CHOLMOD, supernodal sparse Cholesky factorization and update/downdate,” ACM Trans. Math. Softw., vol. 35, no. 3, pp. 22:1–22:14, Oct. 2008. [15] J. Nocedal and S. Wright, Numerical Optimization, 2nd ed. New York: Springer Science+Business Media, 2006. [16] C. Bishop, Pattern Recognition and Machine Learning. New York: Springer Science+Business Media, 2006. ´ [17] M. Alvarez, L. Rosasco, and N. Lawrence, “Kernels for vector-valued functions: A review,” Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Tech. Rep. MITCSAIL-TR-2011-033, CBCL-301, Jun. 2011. [18] R. Adler, The Geometry of Random Fields. Philadelphia: The Society for Industrial and Applied Mathematics (SIAM), 2010. [19] C. Paciorek, “Nonstationary Gaussian processes for regression and spatial modeling,” Ph.D. dissertation, Carnegie Mellon University, 2003. [20] E. Solak, R. Murray-Smith, W. Leithead, D. Leith, and C. Rasmussen, “Derivative observations in Gaussian process models of dynamic systems,” in Advances in Neural Information Processing Systems (NIPS), Vancouver, Canada, 2003, pp. 1033–1040. [21] B. Sch¨olkopf and A. Smola, Learning with Kernels. Cambridge, MA: The MIT Press, 2002.

Recommend Documents