Neural Gaussian Conditional Random Fields Vladan Radosavljevic1? , Slobodan Vucetic2 , and Zoran Obradovic2 1
Yahoo Labs, Sunnyvale, CA, USA
[email protected] 2 Temple University, Philadelphia, PA, USA {vucetic,zoran.obradovic}@temple.edu
Abstract. We propose a Conditional Random Field (CRF) model for structured regression. By constraining the feature functions as quadratic functions of outputs, the model can be conveniently represented in a Gaussian canonical form. We improved the representational power of the resulting Gaussian CRF (GCRF) model by (1) introducing an adaptive feature function that can learn nonlinear relationships between inputs and outputs and (2) allowing the weights of feature functions to be dependent on inputs. Since both the adaptive feature functions and weights can be constructed using feedforward neural networks, we call the resulting model Neural GCRF. The appeal of Neural GCRF is in conceptual simplicity and computational efficiency of learning and inference through use of sparse matrix computations. Experimental evaluation on the remote sensing problem of aerosol estimation from satellite measurements and on the problem of document retrieval showed that Neural GCRF is more accurate than the benchmark predictors. Keywords: Gaussian conditional random fields, neural networks, graphical models
1
Introduction
Learning from structured data is a frequently encountered problem in geoscience [1, 2], computer vision [3, 4], bioinformatics [5, 6], and other areas where examples exhibit sequential [7, 8], temporal [9, 10], spatial [11], spatio-temporal [12, 13], or some other dependencies. In such cases, the traditional unstructured supervised learning approaches could result in a weak model with low prediction accuracy [14]. Structured learning methods try to solve this problem by learning to simultaneously predict all outputs given all inputs. The structured approaches can exploit correlations among output variables, which often results in accuracy improvements over unstructured approaches that predict independently for each example. The benefits of structured learning grow with the strength of dependency between the examples and the data size. In structured learning there is usually some prior knowledge about relationships among the outputs. Those relationships are application-specific and, very ?
This study was conducted while the author was a postdoctoral associate at Temple University.
II
often, they can be modeled by graphical models. The advantage of the graphical models is that one can make use of sparseness in the interactions between outputs and develop efficient learning and inference algorithms. In learning from structured data, the Markov Random Fields [2] and the Conditional Random Fields (CRF) [7] are among the most popular models. Originally, CRFs were designed for classification of sequential data [7] and have found many applications in areas such as computer vision [3] and computational biology [6]. Using CRF for regression is a less explored topic. Continuous Conditional Random Fields (CCRF) [8] is a ranking model that takes into account relationships among ranks of objects in document retrieval. With minor modifications, it can be used for structured regression problems. The Conditional State Space Model (CSSM) [15], an extension of the CRF to a domain with continuous multivariate outputs, was proposed for regression of sequential data. CSSM is an undirected model that makes no independence assumptions between outputs, which results in a more flexible framework. In [4] a conditional distribution of pixels given a noisy input image is modeled using the weighted quadratic factors obtained by convolving the image with a set of filters. Feature functions in [4] are specifically designed for image de-noising problems and are not readily applicable in regression. The Gaussian CRF for structured regression problems with feature functions constrained to quadratic form was introduced in [1]. The Sparse GCRF [10] is a variant of the GCRF model that incorporates l1 regularization in optimization function, thus enforcing sparsity in GCRF parameters. GCRF has recently been successfully utilized in a variety of real world applications. In the computational advertising field, GCRF significantly improved accuracy of click through rate estimation by taking into account relationship among advertisements [11]. An extension of GCRF to the non-Gaussian case using the copula transform was used in forecasting wind power [16]. In combination with decision trees, GCRF was successfully applied to short-term energy load forecasting [17], while in combination with support vector machines it was applied on automatic recognition of emotions from audio and visual features [18]. A tractable fully connected GCRF, which captures both long-range and short-range dependencies, was developed in [19] and was successfully applied on image de-noising and geoscience problems. To improve expressive power of GCRF, we propose a Neural GCRF (NGCRF) regression model where CCRF and GCRF can be considered as special cases. In addition to using the existing unstructured predictors, the proposed NGCRF allows training of additional unstructured predictors simultaneously with other NGCRF parameters. This idea is motivated by the Conditional Neural Fields (CNF) [20, 5] proposed for classification problems to facilitate modeling of complex relationships between inputs and outputs. Moreover, weights of NGCRF feature functions are themselves allowed to be nonlinear functions of inputs. In this way, NGCRF is able to capture non-homogeneous relationships among outputs and account for differing uncertainties in the unstructured predictors. We will show that learning and inference of NGCRF can be conducted efficiently through sparse matrix computations.
III
2
Gaussian conditional random fields
Let us denote as x = (x1 , . . . xM ) an M -dimensional vector of observations and as y = (y1 , . . . yN ) an N -dimensional vector of real-valued output variables. The objective is to learn a non-linear mapping f : RM → RN that predicts the vector of output variables y as accurately as possible given all inputs x. A CRF models a conditional distribution P (y|x), according to the associated graphical structure P (y|x) =
1 eφ(α,β,y,x) , Z(α, β, x)
(1)
with energy function φ(α, β, y, x) =
N X
A(α, yi , x) +
i=1
X
I(β, yi , yj , x),
(2)
j∼i
A(α, yi , x) - association potential with parameters α, I(β, yi , yj , x) - interaction potential with parameters β, i ∼ j - yi and yj are connected by an edge in the graph structure, and the normalization function Z(α, β, x) defined as Z Z(α, β, x) = eφ(α,β,y,x) dy.
(3)
y
The output yi is associated with vector of observations x = (x1 , . . . xM ) by a real-valued function called the association potential A(α, yi , x), where α is a K-dimensional set of parameters. In general, A takes as input any appropriate combination of attributes from vector of observations x. To model interactions among outputs, a real valued function called the interaction potential I(β, yi , yj , x) is used, where β is an L dimensional set of parameters. Interaction potential represents the relationship between two outputs and in general can depend on inputs x. Different applications can impose different interaction potentials. The larger the value of the interaction potential, the more related the two outputs are. In CRF applications, A and I could be conveniently defined as linear combinations of a set of fixed features in terms of α and β, as in [7] A(α, yi , x) =
K X
αk fk (yi , x),
k=1
I(β, yi , yj , x) =
L X
(4) βl gl (yi , yj , x).
l=1
The use of feature functions is convenient because it allows us to model arbitrary relationships between inputs and outputs. In this way, any potentially relevant
IV
feature function could be included to the model and the learning algorithm can automatically determine their relevance. Models with real valued outputs pose quite different challenges with respect to feature function complexity than in the discrete-valued case. Discrete valued models are always feasible, because Z is finite and defined as a sum over finitely many possible values of y. On the contrary, to have a feasible model with real valued outputs, Z must be integrable. Proving that Z is integrable in general might be difficult due to the complexity of association and interaction potentials. 2.1
Feature functions
Construction of appropriate feature functions in CRF is a manual process that depends on prior beliefs of a practitioner about what features could be useful. The choice of features is often constrained to simple constructs to reduce the complexity of learning and inference from CRF. If A and I are defined as quadratic functions of y, P (y|x) becomes a multivariate Gaussian distribution such that learning and inference can be accomplished in a computationally efficient manner. In the following, we describe the feature functions that led to Gaussian CRF. Let us assume we are given K unbiased unstructured predictors, Rk (x), k = 1, . . . K, that predict single output yi taking into account x (in a special case, only corresponding xi can be used as x). To model the dependency between the prediction and output, we use quadratic feature functions fk (yi , x) = −(yi − Rk (x))2 , k = 1, . . . K.
(5)
These feature functions follow the basic principle for association potentials in that their values are large when predictions and outputs are similar. To model the correlation among outputs, we use the quadratic feature function gl (yi , yj , x) = −el (i, j, x)(yi − yj )2 , wl (i, j, x), (i, j) ∈ Gl el (i, j, x) = 0, (i, j) ∈ / Gl ,
(6)
which imposes that outputs yi and yj have similar values if they are connected by an edge in the graph Gl . wl (i, j, x) represents the weight of an edge (i, j) in graph Gl . It should be noted that using multiple graphs Gl can facilitate modeling of different aspects of correlation between outputs (for example, spatial and temporal). 2.2
Multivariate Gaussian model
Conditional distribution P (y|x) for the CRF model in Eq. (1), which uses quadratic feature functions defined in the previous section, can be represented
V
as a multivariate Gaussian distribution. The resulting energy function of the GCRF model can be written as φ=−
N X K X
αk (yi − Rk (x))2 −
i=1 k=1
L XX
βl el (i, j, x)(yi − yj )2 .
(7)
i,j l=1
The energy function is a quadratic function in terms of y. Therefore, P (y|x) can be transformed to a Gaussian form by representing φ as 1 φ = − (y − µ)T Σ −1 (y − µ). 2
(8)
To transform P (y|x) to Gaussian form we determine Σ and µ by matching Eq. (7) and (8) PK PN P k=1 αk + n=1,n6=j Pl βl el (i, n, x), i = j Σ −1 = 2 (9) i,j − l βl el (i, j, x), i 6= j, µ = Σb,
(10)
where b is a vector with elements bi = 2
K X
αk Rk (x).
(11)
k=1
If we calculate Z using the transformed exponent, we obtain P (y|x) =
1 (2π)N/2 |Σ|1/2
1
T
e− 2 (y−µ)
Σ −1 (y−µ)
.
(12)
Therefore, the resulting conditional distribution is Gaussian with mean µ and covariance Σ. We observe that Σ is a function of parameters α and β, and interaction potential graphs Gl , while µ is also a function of inputs x. The resulting CRF is the Gaussian CRF (GCRF). In order for the model to be feasible, the conditional distribution has to be well defined. This means that we have to ensure that the precision matrix Σ −1 is positive semi-definite [1], which we will address in the following sections.
3
Neural Gaussian CRF
In this section we propose a new Neural Gaussian CRF model, which enhances GCRF and increases its representational power. 3.1
Neural GCRF Model
First, motivated by the recently proposed Conditional Neural Fields [20, 5], we introduce the adaptive feature function defined as
VI
fa (yi , x) = −(yi − Ra (w, x))2 ,
(13)
where Ra (w, x) is a function of weights w that can be trained simultaneously with other GCRF parameters. In this way, Ra (w, x) can be trained directly with the goal of maximizing the log-likelihood such that it complements the existing predictors Rk . In this paper, we will assume that predictor Ra (w, x) is a feedforward neural network. Second, as defined in Eq. (4), Gaussian CRF assigns weights α and β to the feature functions. Considering that feature functions for the association potential are defined as squared errors of unstructured predictors, the role of weights α and β is to measure their prediction uncertainty. Since it is likely that the quality of different predictors changes with x, we enhance GCRF such that parameters αk and βl are replaced with the uncertainty functions αk (θk , x) and βl (ψl , x), where θk and ψl are the parameters. We allow using feedforward neural networks for the uncertainty functions. By using the adaptive feature and uncertainty functions, we have A(θ, yi , x) = −
K X
αk (θ k , x)(yi − Rk (x))2 − αa (θ a , x)(yi − Ra (w, x))2 ,
k=1
I(ψ, yi , yj , x) = −
L X
βl (ψl , x)(yi − yj )2 .
l=1
(14) In this way, αk (θk , x) models the varying degree of importance of predictor Rk over different conditions. Similarly, βl (ψl , x) models varying importance of correlation between outputs. As a result, Σ from Eq. (9) becomes dependent on inputs, thus allowing for error heteroscedasticity. Conditional distribution of the enhanced GCRF is Gaussian as in Eq. (12). Since both adaptive feature and uncertainty functions are assumed to be feedforward neural networks, we call the resulting model the Neural GCRF (NGCRF). Let us analyze the feasibility condition for the NGCRF model. In order for the model to be feasible, the precision matrix Σ −1 has to be positive semi-definite. A common approach used in practice [21] is to enforce sufficient condition given by Gershgorin’s circle theorem [22], which says that a symmetric matrix is positive definite if all diagonal elements are non-negative and if the matrix is diagonally dominant. Definition 1. A square matrix Σ −1 is diagonally dominant if the absolute value of each diagonal element is greater than the sum P of absolute values of the non−1 diagonal elements in corresponding row |Σ −1 | > i,i j6=i |Σ i,j |, ∀i. Theorem 1. If the values of functions α and β in Eq (14) are always greater than 0, then the precision matrix Σ −1 that corresponds to NGCRF model defined by association and interaction potentials in Eq. (14) is diagonally dominant and hence positive definite.
VII
Proof. For each i, the absolute value of a diagonal element Σ −1 i,i of precision matrix Σ −1 can be represented as |Σ −1 i,i | =|
K X
αk (θ k , x) +
k=1
=
K X
L XX
βl (ψl , x)|
j6=i l=1
αk (θ k , x) +
k=1
L XX
(15)
βl (ψl , x),
j6=i l=1
where we use the fact that values of α and β are always greater than 0. Similarly, the absolute value of each off-diagonal element Σ −1 i,j equals |Σ −1 i,j | = |
L X
βl (ψl , x)| =
L X
βl (ψl , x).
(16)
αk (θ k , x) > 0.
(17)
l=1
l=1
Then, for each i we have |Σ −1 i,i | −
X
|Σ −1 i,j | =
j6=i
K X k=1
t u
which proves the theorem.
Therefore, one way to ensure that the NGCRF model is feasible is to impose the constraints α > 0 and β > 0, which is analytically tractable [8, 1], but is known to be conservative [21]. To analyze the effect of constraining α > 0, we will assume that the interaction potential is not used (output variables are assumed to be conditionally independent). The prediction for each yi becomes a weighted average of the unstructured predictors, where weights are positive values with their sum equal to 1. This constrains the range of outputs to yi ∈ [min(Rk (x)), max(Rk (x))], which has negligible effect on NGCRF since we assumed that unstructured predictors are unbiased. In [21] it was empirically verified that constraint β > 0 reduces parameter search space more and more with decreasing sparsity and increasing number of parameters in beta functions. This leads to limited improvements when using NGCRF with constraint β > 0 on more dense graphs. 3.2
Learning and Inference of NGCRF
Learning The learning task is to choose values of parameters θ, ψ and w to maximize the conditional log-likelihood on the set of training examples D = {(xt , yt ), t = 1 . . . T } ˆ ψ, ˆ w) ˆ = argmax(L(θ, ψ, w)) (θ, θ,ψ,w
where L(θ, ψ, w) =
T X t=1
(18) log P (yt |xt ).
VIII
By setting α and β to be greater than 0, learning becomes a constrained optimization problem. To convert it to unconstrained optimization, we adopt a technique used in [8, 1] that applies the exponential transformation of the functions to guarantee that their values are positive. We apply an exponential transformation on α and β αk =euk (θk ,x) , for k = 1, . . . K, αa =eua (θa ,x) ,
(19)
βl =evl (ψl ,x) , for l = 1, . . . L. where uk and vl are differentiable functions with respect to parameters θk and ψl . All the parameters are learned by a gradient-based optimization. To apply the gradient-based method for learning, we need to find the gradient of the conditional log-likelihood. The derivatives of L with respect to θ, ψ, and w are ∂L ∂αk ∂uk ∂L = , ∂θk ∂αk ∂uk ∂θk ∂L ∂L ∂βl ∂vl = , ∂ψl ∂βl ∂vl ∂ψl ∂L ∂Ra ∂L = . ∂w ∂Ra ∂w
(20)
The gradient of L with respect to θ and ψ has three components. The first components are ∂L/∂αk and ∂L/∂βl . The expression for ∂L/∂αk is ∂L 1 ∂Σ −1 ∂bT ∂Σ −1 = − (y − µ)T (y − µ) + ( − µT )(y − µ) ∂αk 2 ∂αk ∂αk ∂αk 1 ∂Σ −1 + T r(Σ ). 2 ∂αk
(21)
To calculate ∂L/∂βl , we use ∂b/∂βl = 0 and obtain 1 ∂Σ −1 1 ∂Σ −1 ∂L = − (y + µ)T (y − µ) + T r(Σ ). ∂βl 2 ∂βl 2 ∂βl
(22)
From Eq. (19), the second components are ∂αk /∂uk = αk and ∂βl /∂vl = βl . The third components depend on the chosen functions uk and vl . The gradient of L with respect to w depends on the functional form of Ra . Since Σ −1 does not depend on Ra , ∂L/∂Ra becomes ∂L = 2αa T (y − µ). ∂Ra
(23)
We observe that an update for the adaptive model Ra is proportional to the difference between true output and the mean of the NGCRF model. This means that Ra will be updated only if NGCRF is not able to predict the output correctly
IX
Algorithm 1 Learning of NGCRF Parameters Input: x, Rk (x), y. 1. Initialize θk , ψl . 2. Estimate θk , ψl by applying gradient based approach and Eq. (21) and (22), without taking into account Ra . 3. Initialize θa . 4. Learn predictor Ra using Eq. (23). repeat Apply gradient based optimization to estimate all parameters. until Convergence
and Ra will be updated more aggressively when the error is larger. This justifies our hypothesis that Ra will work as a complement of the existing non-structured models. To ensure convergence, the iterative procedure presented in Algorithm 1 [23, 20] is used for learning model parameters according to update formulas derived earlier in this section. To avoid overfitting, which is a common problem for maximum likelihood optimization, we added regularization terms for α, θ, β, ψ to the log-likelihood. In this way, we penalize large outputs of α and β as well as large weights θ and ψ. Inference The inference task is to find the outputs y for a given set of observaˆ and βˆ such that the conditional probability tions x and estimated parameters α P (y|x) is maximized. The NGCRF model is Gaussian and, therefore, the maximum a posteriori estimate of y is obtained as the expected value µ of the NGCRF distribution y ˆ = argmaxP (y|x) = µ = Σb,
(24)
y
while Σ is a measure of uncertainty of the point estimate. 3.3
Complexity
If the size of the training set is N and the learning takes I iterations, the straightforward matrix computation results in O(IN 3 ) time to train the model. The main cost of computation is matrix inversion, since during the gradient-based optimization we need to find Σ as an inverse of Σ −1 . However, this is the worst case performance. Since matrix Σ −1 is typically very sparse (it depends on the imposed neighborhood structure), the training time can be decreased to O(IN 2 ) by using sparse matrix apparatus or even to O(IN ) if we do not consider interaction potential [21]. During inference, we need to compute µ, which takes O(N ) time. As we eventually need to calculate the trace of the matrix, only the elements that correspond to the main diagonal should be stored. Therefore, memory requirements depend only on the imposed neighborhood structure.
X
4
Experiments
To demonstrate the strength of the NGCRF model, we applied it on two realworld structured regression applications. The experimental results indicate that NGCRF improves prediction accuracy by efficiently utilizing information from structured data.
4.1
The NGCRF Model for Document Retrieval
In this application the objective is to retrieving the most relevant documents with respect to the given query. In order to make a comparison to the GCRF method, we replicated the experimental setup from [8]. We obtained query-document data from OHSUMED dataset from LETOR [24], which is a standard data source used in document retrieval research (the same dataset was used in [8]). The OHSUMED dataset contains search queries, where each query is associated with a number of relevant documents. There are 106 queries, 348,566 documents and a total of 16,140 query-”relevant document” pairs. From the NGCRF perspective, each query-”set of relevant documents” represents an example (x, y). Each component of y represents a relevance of the corresponding document to a query, while x contains extracted features. Features x were used to construct K = 25 unstructured predictors Rk (x) that predict document relevance for a given query. The outputs of unstructured predictors are available in OHSUMED (more details are in [24]). OHSUMED considers three levels of relevance - highly, partially and not relevant (each component in y can take values 2, 1, or 0 respectively). In addition, OHSUMED contains information about similarity between documents i and j, w(i, j, x), which was determined based on similarity of their contents. Having this setup, the goal is to estimate relevance of each document in the database for a given query.
Benchmark methods As benchmark methods we use the following (all parameters were set using a small validation set) Unstructured retrieval by neural network (NN) We trained NN with five hidden units to predict relevance of documents for a given query. The inputs to NN were outputs of unstructured predictors. Structured retrieval by baseline GCRF We trained GCRF to predict relevance of documents. As unstructured predictors we used Rk , which are readily available in OHSUMED. GCRF also utilized relationship among documents by incorporating weights w(i, j, x) from OHSUMED into the interaction potential. Structured retrieval by GCRF+NN We trained a GCRF model using unstructured predictors Rk from OHSUMED and pre-trained NN. We call this model GCRF+NN. RankSVM State-of-the-art retrieval method [25], which predictions are available as a part of OHSUMED.
XI
0.66 NGCRF RankSVM GCRF+NN NN GCRF baseline
0.64
Precision
0.62 0.6 0.58 0.56 0.54 0.52 0.5 0.48 1
2
3
4
5
top−n retrieved documents Fig. 1. Comparison of retrieval performance in terms of precision when top-n documents are retrieved.
The NGCRF model We trained the NGCRF model where unstructured predictors were Rk , α was a function of unstructured predictors, β a function of similarity between documents, and adaptive NN was a function of Rk . Evaluation In our experiments, for each method we averaged results over 5 fold cross validation data sets provided in OHSUMED. As an evaluation measure, we used precision@n, which represents a percentage of relevant publications in topn publications retrieved (n = 1 . . . 5 in our experiments). To fetch top-n relevant publications we retrieved those publications which corresponded to the n largest predictions. In Figure 1 we see that NN and GCRF+NN outperform baseline GCRF, which can be explained by the ability of NN to capture nonlinearity in feature space. Furthermore, if we allow NN to be adaptive, we see that NGCRF outperforms all other alternatives. We see that NGCRF is comparable to stateof-the-art retrieval method RankSVM, which is specifically designed for ranking problems (while NGCRF has general applicability) and which also used Rk and w(i, j, x) as its inputs. 4.2
The NGCRF Model for AOD Prediction
We evaluated the proposed Neural GCRF model on a high impact regression problem from remote sensing, namely, prediction of aerosol optical depth (AOD) in the atmosphere from satellite measurements. AOD is a measure of aerosol light extinction integrated vertically through the atmosphere. AOD prediction is important because one of the main challenges of today’s climate research is to characterize and quantify the effect of aerosols on Earth’s radiation budget [26].
XII
We considered data from MODIS, an instrument aboard NASA’s Terra satellites [27]. We used ground-based data obtained from the AERONET [28], which is a global remote sensing network of radiometers that measure AOD several times per hour at specific geographic locations. The data can be obtained from the official MODIS website of NASA [29]. We extracted satellite-based attributes that are used as inputs to domainbased deterministic prediction algorithms [27]. In addition, we extracted information about the location of each data point (longitude and latitude) and a quality of observation (QA) assigned to each point provided by domain scientist. Data quality index was provided at four levels from the lowest quality QA=0 to the highest quality QA=3. We collected 28,374 data points distributed over the entire globe at 217 AERONET sites during the years 2005 and 2006. Benchmark methods Here we list benchmark methods that we compared NGCRF to. Deterministic prediction algorithm C005 The primary benchmark for comparison with our CRF predictors was the most recent version of the MODIS operational algorithm, called C005 [27]. This is a deterministic algorithm that retrieves AOD from MODIS observations relying on the domain knowledge. It is based on the inversion of physical forward models developed by the domain scientists. Statistical prediction by a neural network As a baseline statistical algorithm we used a neural network model trained to predict AERONET AOD from all MODIS attributes excluding location and quality flag. It has been shown previously that neural networks achieve comparable accuracy to C005 on the AOD prediction problem [30]. The neural network has a hidden layer with 10 nodes and an output layer with one node. In nested 5-cross-validation experiments we trained 5 neural networks. When tested on 2006 data, we used a single network trained on the entire training set. Structured prediction by GCRF The aerosol data are characterized by strong spatial and temporal dependencies that a CRF is able to exploit by defining interactions among outputs using feature functions. Given a data set that consists of satellite observations and ground-based AOD measurements, a statistical prediction model (Ra ) can be trained to use satellite observations as attributes and predict the labels which are ground-based AODs. The deterministic AOD prediction models (DP ) are based on solid physical principles and tuned by domain scientists. To model the association potential, i.e the dependency between the predictions and output AOD, we introduce the following two feature functions, f1 (yi , xi ) = − (yi − DP (xi ))2 , fa (yi , xi ) = − (yi − Ra (xi ))2 .
(25)
To model the interaction potential we introduce feature function g1 (yi , yj , x) = −(yi − yj )2 .
(26)
XIII Table 1. RMSE and FRAC of C005, NN, GCRF and NGCRF on data with four quality flags. C005 NN GCRF+NN NGCRF RMSE 0.123 0.112 ± 0.002 0.105 ± 0.0006 0.102 ± 0.0008 FRAC 0.65 0.68 ± 0.03 0.71 ± 0.005 0.74 ± 0.007
This interaction potential will reflect correlation between spatio-temporal data examples i and j (closer examples are given larger weight). The learned parameter β represents the level of spatio-temporal correlation of neighboring outputs (large β indicates that spatio-temporal correlation is large). We partitioned data into four subsets corresponding to quality flags QA=0, 1, 2, and 3. We determined eight a parameters corresponding to C005 and NN predictions over these subsets. To model interaction potential we defined spatial-temporal neighbors as a pair of observations where temporal distance temporalDist(i, j) is less than 7 days and spatial distance spatialDist(i, j) is less than 50km. This choice is based on previous studies of aerosol dynamics by geoscientists. We multiply feature g with weights w(i, j, x), that are products of Gaussians ( w(i, j, x) =
e
−
spatialDist(i,j)2 2 2σs
−
temporalDist(i,j)2 2 2σt
,i∼j 0, otherwise
(27)
where σs = 50 and σt = 10 were determined using a small validation set. The NGCRF model Here we use similar attributes as in the previous section but in the spirit of the proposed NGCRF model. Instead of defining manual partitions of the dataset, we use all observations as inputs to the α functions. We define α as an exponential function of linear combinations of observations. To incorporate potential bias, one observation is a vector with all ones. P
αk (θ, x(i) ) = e (i)
(i)
θ t xt
,
(28)
(i)
where x1 is a vector with all ones, x2,3,4,5 are quality flags. As an adaptive model Ra we used NN defined in previous sections. Its weight αa follows the definition in Eq. (28). To model spatio-temporal correlation, we use spatial and temporal distance between i and j as two observations for the β function. Similar to Eq. (28) we define β as β(ψ, x(i,j) ) = e (i,j)
(i,j)
P
(i,j)
ψl xl
,
(29)
where x1 is a vector with all ones, x2 represents spatial distance between i (i,j) and j and x3 represents their temporal proximity.
XIV
Evaluation To evaluate proposed methods, we trained the models on 2005 data and used 2006 data for testing. There are many possible measures that could be used to assess AOD prediction accuracy. Given vector t = (t1 , t2 , . . . tN )T of N outcome values and vector y = (y1 , y2 , . . . yN )T of the corresponding predictions, we measure the root mean squared error (RMSE). We also report accuracy on the domain specific measure called the fraction of successful predictions (FRAC) that penalizes errors on small AOD more than errors on large AOD [27] F RAC =
I × 100%, N
(30)
where I is the number of predictions that satisfy |yi − ti | ≤ 0.05 + 0.15ti . RMSE error of the four models is presented in Table 1, where smaller numbers mean more accurate predictions. FRAC accuracy of these four models is also shown in Table 1, where larger numbers correspond to better predictions. We can see that in our experiments NN was more accurate than the operational C005 algorithm. GCRF showed an improvement in accuracy over both NN and C005 by taking advantage of a combination of models and spatio-temporal correlation in data. NGCRF achieves even better accuracy by utilizing nonlinear weights, an adaptive statistical model, and learning instead of assuming the level of correlation between points. Although NGCRF is a non-convex approach, it has only slightly larger variance in predictions than GCRF+NN. The obtained results provide strong evidence that adaptive structured learning approaches can be successfully applied to AOD prediction, where even a small improvement of prediction accuracy results in huge uncertainty reduction in many geophysical studies that rely on AOD predictions [26].
5
Conclusion
Structured learning, as a fairly new research area in machine learning, has great success in classification, but its application on regression problems has not been explored sufficiently. In this article we proposed a method to adaptively combine the outputs of powerful non-structured regression models such as neural networks and a variety of correlated knowledge sources into a single prediction model by utilizing possible correlation among outcome variables. It is worth pointing to differences between our NGCRF model and the GCRF model proposed in [4]. The GCRF in [4] models a conditional distribution of pixels given a noisy input image using the weighted quadratic factors obtained by convolving the image with a set of filters. GCRF is designed for image de-noising problems, while NGCRF can be applied to general regression problems. By taking a closer look at GCRF we find that features in Eq. (5) and (6) are represented in GCRF, while GCRF does not model the adaptive component of NGCRF in Eq. (13). The proposed NGCRF is also readily applicable to other regression applications, where there is a need for knowledge integration and exploration of structure in outputs.
XV
6
Acknowledgment
This work is supported in part by DARPA Grant FA9550-12-1-0406 negotiated by AFOSR, and NSF grant IIS-1117433.
References 1. Radosavljevic, V., Vucetic, S., Obradovic, Z.: Continuous conditional random fields for regression in remote sensing. In: European Conference on Artificial Intelligence (ECAI). (2010) 809–814 2. Solberg, A.H.S., Taxt, T., Jain, A.K.: A markov random field model for classification of multisource satellite imagery. IEEE Transactions on Geoscience and Remote Sensing 34(1) (1996) 100–113 3. Kumar, S., Hebert, M.: Discriminative random fields: A discriminative framework for contextual interaction in classification. Proceedings Ninth IEEE International Conference on Computer Vision (2003) 1150–1157 vol.2 4. Tappen, M.F., Liu, C., Adelson, E.H., Freeman, W.T.: Learning gaussian conditional random fields for low-level vision. IEEE Conference on Computer Vision and Pattern Recognition (2007) 5. Peng, J., Bo, L., Xu, J.: Conditional neural fields. In: Advances in Neural Information Processing Systems 22. (2009) 1419–1427 6. Liu, Y., Carbonell, J., Klein-Seetharaman, J., Gopalakrishnan, V.: Comparison of probabilistic combination methods for protein secondary structure prediction. Bioinformatics 20(17) (2004) 3099–3107 7. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings International Conference on Machine Learning. (2001) 8. Qin, T., Liu, T., Zhang, X., Wang, D., Li, H.: Global ranking using continuous conditional random fields. Neural Information Processing Systems (2008) 9. Grbovic, M., Vucetic, S.: Tracking concept change with incremental boosting by minimization of the evolving exponential loss. In: ECML/PKDD (1)’11. (2011) 516–532 10. Wytock, M., Kolter, Z.: Sparse gaussian conditional random fields: Algorithms, theory, and application to energy forecasting. In Dasgupta, S., Mcallester, D., eds.: Proceedings of the 30th International Conference on Machine Learning (ICML-13). Volume 28., JMLR Workshop and Conference Proceedings (May 2013) 1265–1273 11. Xiong, C., Wang, T., Ding, W., Shen, Y., Liu, T.Y.: Relational click prediction for sponsored search. In: Proceedings of the Fifth ACM International Conference on Web Search and Data Mining. WSDM ’12, New York, NY, USA, ACM (2012) 493–502 12. Grbovic, M., Li, W., Xu, P., Usadi, A.K., Song, L., Vucetic, S.: Decentralized fault detection and diagnosis via sparse {PCA} based decomposition and maximum entropy decision fusion. Journal of Process Control 22(4) (2012) 738 – 750 13. Djuric, N., Radosavljevic, V., Coric, V., Vucetic, S.: Travel speed forecasting by means of continuous conditional random fields. Transportation Research Record (2263) (2011) 131–139 14. Neville, J., Gallagher, B., Eliassi-Rad, T., Wang, T.: Correcting evaluation bias of relational classifiers with network crossvalidation. Knowledge and Information Systems 30 (2012) 31–55
XVI 15. Kim, M., Pavlovic, V.: Discriminative learning for dynamic state prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(10) (2009) 1847– 1861 16. Wytock, M., Kolter, J.: Large-scale probabilistic forecasting in energy systems using sparse gaussian conditional random fields. In: Decision and Control (CDC), 2013 IEEE 52nd Annual Conference on. (Dec 2013) 1019–1024 17. Guo, H.: Modeling short-term energy load with continuous conditional random fields. In Blockeel, H., Kersting, K., Nijssen, S., Zelezn, F., eds.: ECML/PKDD (1). Volume 8188 of Lecture Notes in Computer Science., Springer (2013) 433–448 18. Baltruˇsaitis, T., Banda, N., Robinson, P.: Dimensional affect recognition using continuous conditional random fields. In: IEEE Conference on Automatic Face and Gesture Recognition. (2013) 19. Ristovski, K., Radosavljevic, V., Vucetic, S., Obradovic, Z.: Continuous conditional random fields for efficient regression in large fully connected graphs. In desJardins, M., Littman, M.L., eds.: AAAI, AAAI Press (2013) 20. Do, T.M.T., Artieres, T.: Neural conditional random fields. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. Volume 9., JMLR (5 2010) 21. Rue, H., Held, L.: Gaussian Markov Random Fields: Theory and Applications. Chapman & Hall/CRC Monographs on Statistics & Applied Probability. Taylor & Francis (2005) 22. Gerschgorin, S.: Uber die abgrenzung der eigenwerte einer matrix. Izv. Akad. Nauk. USSR Otd. Fiz.-Mat. Nauk 7 (1931) 749–754 23. Nix, D.A., Weigend, A.S.: Learning local error bars for nonlinear regression. In Tesauro, G., Touretzky, D.S., Leen, T.K., eds.: Advances in Neural Information Processing Systems 7. Cambridge, MA: MIT Press (1995) 489–495 24. Liu, T.Y., Xu, J., Qin, T., Xiong, W., Li, H.: Letor: Benchmark dataset for research on learning to rank for information retrieval. In: SIGIR ’07: Proceedings of the Learning to Rank workshop in the 30th annual international ACM SIGIR conference on Research and development in information retrieval. (2007) 25. Joachims, T.: Optimizing search engines using clickthrough data. In: KDD ’02: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, ACM (2002) 26. Kaufman, Y.J., Tanre, D., Boucher, O.: A satellite view of aerosols in the climate system. Nature 419(6903) (2002) 215–223 27. Remer, L.A., Kaufman, Y.: The modis aerosol algorithm, products and validation. Journal of the Atmospheric Sciences 62 (2005) 947–973 28. Holben, B.N., Eck, T.F.: Aeronet: A federated instrument network and data archive for aerosol characterization. Remote Sensing of Environment 66 (1998) 1–16 29. http://modis.gsfc.nasa.gov: Official modis website 30. Radosavljevic, V., Vucetic, S., Obradovic, Z.: A data-mining technique for aerosol retrieval across multiple accuracy measures. Geoscience and Remote Sensing Letters, IEEE 7(2) (April 2010) 411–415