0 such that g has more vertices with regression error ≥ C than does Strict(f, w), and for any D > C, g and Strict(f, w) have the same number of vertices with regression error ≥ D (it is not necessarily true that if c < C then Strict has fewer vertices than g with regression error ≥ c, but the concern is with the number of large errors). For example, for the unweighted data 3, 1, 2, Strict is 2, 2, 2 while Basic and Prefix are 2, 2, 2.5, i.e., their last value has an unnecessary error. Max is even worse, being 2, 2, 3. Prefix, Basic and Avg have a weaker property: if the L∞ error of isotonic regression is D and if any of them has regression error D at vertex v, then all L∞ isotonic regressions have the same value at v. For Avg this is obvious. For Basic and Prefix the error at v is ≤ max{mean err(f, w : x, y) : x v y}. This is a lower bound on the larger of the regression errors at x and y, and hence if the regression error is D then any isotonic function having a different value would have larger regression error at x or y and would not be an L∞ isotonic regression. Min and Max are at the opposite end of the spectrum, in that for any vertex v, one or both of them has the largest regression error at v among all L∞ isotonic regressions.
3 Isotonic Regression for Arbitrary Partial Orders As mentioned above, for arbitrary dags the Prefix and Basic regressions can be calculated in Θ(n2 ) time if the transitive closure is given. The transitive closure can be computed in Θ(min{nm, nω }) time, where ω is such that matrix multiplication can be performed in O(nω ) time. It is known that ω < 2.376. Kaufman and Tamir [27] provided an algorithm taking Θ(n log2 n + m log n) time without needing the transitive closure.
5
It utilizes parametric search and feasibility tests, where a feasibility test is given an ǫ and decides if there is an isotonic regression with L∞ error ≤ ǫ. Their feasibility test takes Θ(n log n + m) time, while we use a simple test taking only Θ(m). Feasibility tests are also used in L∞ histogram construction [11, 16]. Unfortunately parametric search is quite complex and impractical, making the result of only theoretical interest. Later results give practical algorithms for various settings that arise in applications. Theorem 1 Given weighted data (f, w) on a dag G = (V, E), in Θ(m log n) time the Min, Max and Avg L∞ isotonic regressions can be determined. Proof: Since the optimal regression error is of the form mean err(f, w : u, v) for some u, v ∈ V , a straightforward approach is to do a binary search on these values, using a feasibility test to decide whether a larger or smaller value is needed. A naive implementation would explicitly compute all mean err values, taking Θ(n2 + m log n) time. Parametric search is a “black box” procedure designed to provide the binary search component for such problems (i.e., ones based on the coordinates of intersections of lines). It takes Θ(n log n) time, plus the time for Θ(log n) feasibility tests (see the survey article by Agarwal and Sharir [2]). As shown below, each test can be performed in Θ(m) time using topological sorting. ✷ A window at v is a real-valued interval [av , bv ]. Given a set of windows W = {[av , bv ] : v ∈ V }, an isotonic function through W is an isotonic function g for which av ≤ g(v) ≤ bv for all v ∈ V . Observation: A set of windows W = {[av , bv ]} has an isotonic function through them if and only if the following is an isotonic function through W : s(v) = max{au : u v}. The isotonic constraint forces s(v) to be at least this large, and thus if it is not an isotonic function through W then there must be an v for which s(v) > bv . Since s(v) is the smallest possible value at v among all isotonic functions through W , this implies that there is no isotonic function through W . Given a ǫ > 0, the ǫ-windows is the set of windows where w(v)(f (v)−av ) = w(v)(bv −f (v)) = ǫ. Thus a feasibility test for ǫ merely requires determining if there is an isotonic function through the ǫ-windows. s can be computed via topological sorting, and thus the feasibility can be decided in Θ(m) time. When ǫ is the optimal regression error then at any vertex s has the minimum possible value among all isotonic functions fitting through the windows, and hence is Min. Max(x) = min{bv : v x}. Linear and rooted tree orderings have m = Θ(n) and hence for them Theorem 1 is faster than Kaufman and Tamir’s algorithm. Another important class with such sparsity are d-dimensional grids, which have Θ(dn) edges. Isotonic regression on points in d-dimensional space, which corresponds to having d independent variables, has been extensively studied [5, 8, 20, 37, 38, 41]. Corollary 2 Given weighted data (f, w) on a linear, tree, or d-dimensional grid ordering, the Min, Max and Avg L∞ isotonic regressions can be determined in Θ(n log n) time, where for d-dimensional grids the implied constant is a function of d. ✷ In Section 5 we show that Prefix can be computed in the same time bounds for linear and tree orders.
3.1 Data With Few Different Weights or Few Different Values In some applications the number of different weights is significantly less than n. For example, there may be a few measuring devices with differing reliability which is reflected in the weights attached to their measurements. The author has encountered this scenario in computer simulations of complex experiments. A program of moderate accuracy was run for many points in the parameter space, while one of greater 6
accuracy, but taking far longer, was run for only a few. Results of the latter were weighted more heavily than those of the former. In other situations there may be only a few different values but many weights. While general parametric search is impractical, for a modest number of different weights or different values a quite practical version is available. Theorem 3 Given weighted data (f, w) on a dag G = (V, E), with t different weights, or t different values, the Min, Max and Avg L∞ isotonic regressions can be found in Θ(tn + m log n) time and the Prefix L∞ isotonic regression can be found in Θ(tm) time. Proof: We give the parametric search algorithm for weights, with the one for values being nearly identical. For Min, Max and Avg we first give a version taking Θ((tn + m) log n) time, and then reduce this to the time claimed. Suppose there are t weights w1 . . . wt . For weight wi let Zi = {f (u) : w(u) = wi , u ∈ V }, and let zi (1) . . . zi (k(i)) be the values of Zi in sorted order, where k(i) = |Zi |. For vertex v, let k(v, i) be the largest value such that zi (k(v, i)) ≤ f (v). A binary search over the range 1...k(v, i) will try to locate a j such that the mean error of (zi (j), wi ) and (f (v), w(v)) is the L∞ isotonic regression error (of course, there may be no such value for a given v and i). Note that the error is decreasing in j. The search occurs in parallel for all v ∈ V and 1 ≤ i ≤ t, using the well-known approach for determining order statistics. At any step let e(v, i) denote the mean error of (zi (j), wi ) and (f (v), w(v)) where j is the middle of the remaining range in the search at v for weight wi , i.e., initially j = ⌊(1 + k(v, i))/2⌋. The weight of e(v, i) is the size of the remaining range. A feasibility test is run on the weighted median of all tn of the e(v, i) weighted values. At least 1/4 of the sum of the sizes of the remaining ranges is eliminated at each step, so only Θ(log n) iterations are needed. To reduce the time further we use a more efficient approach for determining which mean err values to calculate. For weights wi and wj , let M be the k(i) × k(j) matrix where M (a, b) is the mean err of (zi (a), wi ) and (zj (b), wj ) if zi (a) ≥ zj (b), and 0 otherwise. (M is only conceptual, it isn’t actually created.) The rows and columns of M are sorted, so one can use searching on sorted matrices [19] as the black box searching procedure. See Agarwal and Sharir [2] for a simple description: the procedure is easy to understand and implement, it’s the time analysis that’s complicated. Letting K = k(i) + k(j), their search evaluates Θ(K) entries of M and uses Θ(log K) feasibility tests, taking Θ(K) time beyond that of the feasibility tests. The procedure originally given evaluates Θ(K log K) entries of M . By using their search for all ordered pairs of weights, mean err is only calculated Θ(tn) times, and thus the time is as claimed. A simple approach can be used for Prefix. For each weight wi , using topological sorting, for all vertices v determine max{f (u) : w(u) = wi , u v}. From the t values at v one can determine pre(v) and hence Prefix can be computed in the time claimed. ✷ Note that Prefix can be computed faster than the time in Theorem 1 when t = o(log n).
4 Error Envelopes We will exploit properties of intersecting line segments to find maximum violating pairs. This is a wellknown approach to L∞ regression of weighted data. Given data (f, w) on a set S, for x ≥ min{f (v) : v ∈ S} let g(x) be the L∞ regression error of using x as the regression value for {v ∈ S : f (v) ≤ x}. This has a simple geometric interpretation: for each v ∈ S, plot the ray with x-intercept f (v) and slope w(v). g is the set of line segments with no points above them. See Figure 3 a). We call this upper envelope the upward error envelope, and max{err(v, x) : v ∈ 7
e r r o r
D
γ
envelope
β α
regression value
C
B A
a) The upward error envelope
b) Merging error envelopes
Figure 3: Error envelopes
C
α
B
β
A
γ
Figure 4: Dynamic envelope intersections
S, f (v) ≥ x} is the downward error envelope. If the envelopes intersect at regression value r, then r is the L∞ weighted mean of S. It is straightforward to use balanced search trees so that the upward and downward error envelopes can be maintained and in Θ(log |S|) time one can insert a weighted value, determine the value of the envelope at a given regression value, determine the regression value having a given error, and determine the point of intersection of a ray with an envelope. Deletions are more difficult and discussed in Section 2.1.3. Given the error envelopes one can determine their intersection in Θ(log |S|) time and thus one can determine the mean of S in Θ(|S| log |S|) time. It is possible to do this in Θ(|S|) time by viewing it as a 2-dimensional linear programming problem, as noted in Section 2.1.1. This is needed for Basic to achieve its time bound, but is unnecessary for Prefix. A slight variation will prove useful later. Given the downward error envelope E of a set S of vertices, and given a vertex v, an error query of E determines the intersection of the error envelope with the ray with x-intercept f (v) and slope w(v), i.e., the ray representing the error in using a regression value ≥ f (v) at v. The maximum of the error queries for vertices in S is the regression error of using the L∞ mean on S. Another operation involving envelopes is merger. The only special aspect is that after two envelopes are merged, using standard merging of balanced binary search trees, some of the segments may need to be removed. E.g., in Figure 3 b), when the envelope with Roman labeling and the one Greek labeling are merged, segments B, C, and γ will be removed. The endpoints of some of the remaining segments also change. Using a relatively straightforward procedure, once the envelopes are merged the removal of segments and changing of endpoints can be performed in Θ(a + b log n) time, where a is the size of the smaller envelope and b is the number of segments removed. Given sets S1 , S2 of vertices, a bipartite maximum violators operation determines a pair of vertices u1 ∈ S1 , u2 ∈ S2 that maximize mean err(f, w : v1 , v2 ) among all pairs where v1 ∈ S1 , v2 ∈ S2 and f (v1 ) ≥ f (v2 ). If there are no such pairs then it returns the empty set. Note that S1 and S2 might overlap. 8
Lemma 4 (Repeated Violators) Given sets S1 and S2 of weighted values, in Θ(N log N ) time one can do an arbitrary sequence of O(N ) deletion and bipartite maximum violators operations, where N = |S1 |+|S2 |. Proof: A downward error envelope for S1 , and an upward one for S2 , will be used to determine the violators. The envelopes will be threaded so that one can determine the segments adjacent to a given segment in constant time. They are “semi-dynamic” since only deletions change them once they are initially constructed. Hershberger and Suri [25] showed how to create the initial envelopes, and perform all the deletions, in O(N log N ) time (they solved the semi-dynamic problem of maintaining the convex hull of a set of planar points which, by duality, can be used to solve the error envelope problem). Thus the only time remaining to be analyzed is the time needed to find the maximum violators. Given the envelopes, the initial maximum violators can be found in Θ(log N ) time. For subsequent violators, the only case of interest is when the segment deleted was one of the ones in the intersection. For example, in Figure 4, suppose initially B and β intersect, and then B is deleted, resulting in new segments in the upward error envelope represented by dashed lines. The new intersection can involve β or a successor, but not α. Further, on the upward envelope it might involve C since that segment may have lengthened, but cannot involve any successor of C. It might, however, involve some predecessor of C. The intersection is found by checking C, then its predecessor, etc., until it is found. During this process it may also be discovered that the intersection will involve a successor of α, etc. While a single deletion may involve Θ(N ) segments, if the new intersection does not involve C then no future intersection can involve C until its predecessor has been deleted. Therefore the total number of segments examined over all iterations is O(N ). ✷ The fact that only the semi-dynamic scenario is needed, without any insertions, is fortunate since it is not known how solve the fully dynamic convex hull problem in O(N log N ) time. One can interpret L∞ isotonic regression as minimizing the maximum “penalty” at the vertices, given the isotonic requirement. This can be applied to rather general penalty functions, such as a classification problem with ordered classes and penalties for misclassifying a vertex. The time of the algorithms depend on the complexity of determining the regression value minimizing the error over a set of weighted values. One extension where there is no increase in time is when the error is linear but the weight for using regressions values larger than the data value is different than the weight for using regression values less than the data. For example, overestimating the capacity of a bridge is more serious than underestimating it. None of the operations on envelopes require that the weights for errors above and below the data value be the same.
5 Linear and Tree Orderings Linear orders are the most important dag by far, and isotonic regression on linear and tree orders has been used for decades (see the extensive references in [5, 37]). Recent applications of isotonic regression on trees include analyzing web data [10, 32]. For rooted trees the ordering is towards the root. For linear orders it is widely known that the L1 and L2 isotonic regressions can easily be found in Θ(n log n) and Θ(n) time, respectively. For trees the fastest isotonic regression algorithms use a bottomup pair adjacent violators (PAV) approach [42], taking Θ(n log n) time for the L2 [33] and L1 [41] metrics. Corollary 2 shows that the Min, Max and Avg isotonic regressions can be determined in Θ(n log n) time. By using a bottom-up approach Prefix can be computed in the same time without requiring parametric search. Theorem 5 Given weighted data (f, w) on a rooted tree T = (V, E), Algorithm A computes the Prefix L∞ isotonic regression in Θ(n log n) time. 9
{DecErrEnv(v) is the downward error envelope of {u : u v} for v ∈ V } Using a topological sort traversal, DecErrEnv(v) = ∪{DecErrEnv(u) : (u, v) ∈ E} Insert (f (v), w(v)) into DecErrEnv(v) pre(v) = result of error query of DecErrEnv(v) Using topological sort traversal in reverse order, Prefix(v) = min{pre(u) : v u} Algorithm A: Computing Prefix using topological sorting on dag G = (V, E) Proof: Envelope merging was described in Section 4. A balanced tree of size a can be merged with one of size b, a ≤ b, in Θ(a (1 + log(b/a))) time, hence all of the initial union and insertion operations can be performed in Θ(n log n) time [9]. The time of the segment removal during each merge is linear in the number of elements in the smaller envelope, plus the time needed to remove intervals. An interval is removed at most once throughout all of the mergers, so the total time for removals is O(n log n), and since the processes are initiated by segments in the smaller tree, the total time for initiations is O(n log n). This completes the proof of Theorem 5. ✷
5.1 River Isotonic Regression for Trees Isotonic regression on trees arises in a variety of hierarchical classification and taxonomy problems in data mining and machine learning. Recent variations of isotonic regression for such problems include requiring that the regression value at a node be greater than or equal to the sum of the values of its children [7], and requiring that for each non-leaf node the regression value is the largest value of a child. Punera and Ghosh [36] called the latter “strict classification monotonicity”, but that is the opposite of the usual meaning of “strictly monotonic”. Here it is called river isotonic regression due to its similarity to river naming conventions. They solved the L1 problem in Θ(n2 log n) time. We use a Prefix-like approach to show: Theorem 6 Given weighted data (f, w) on a rooted tree T = (V, E), in Θ(n log n) time an L∞ river isotonic regression can be determined. If the data is unweighted then this can be done in Θ(n) time. Proof: For the unweighted case, at vertex v let a be the largest data value in v’s subtree. For a path p from v to a leaf node under it, let b be the smallest value on p. If the regression values along the path are all the same, then the overall regression error in v’s subtree is at least (a − b)/2, and choosing x = (a + b)/2 as the regression value at v achieves this error along p. Given this choice, for any isotonic regression on the remainder of v’s subtree, reducing any values larger than x to x cannot increase the overall error. At each node, keep track of the minimal value on the path beneath with maximal minimal value, the child that achieves this, and the maximum value at all vertices in the subtree. If v were the root than this path gives an optimal regression value for it. These can all be determined in Θ(n) time by a bottom-up process. Just as for Prefix, a top-down second stage is used, but it is slightly more complicated since the value at the root may be larger or smaller than its children. The value at the root is that of the bottom-up phase. Recursively, if the value at a node v has been determined to be x in the top-down stage, let w be the child that was recorded on the bottom-up phase. w’s value will be x. For all other children, their value is the minimum of x and their value on the bottom-up phase. It’s relatively straightforward to show that this produces an optimal river isotonic regression. 10
For the weighted case, for a vertex v again let p be a path from a leaf to v. Given the downward envelope for all vertices in v’s subtree, and the upward envelope for vertices in p, using their intersection as the regression value at v gives a lower bound on the total regression error in the subtree when all vertices in p must have the same value. Choosing p to minimize the error of the intersection gives the minimal possible error bound for v’s subtree. A top-down stage provides the final regression value. Determining the path to use in the weighted case is more complex than for the unweighted case. For example, suppose there is vertex v with two leaf children. The left child has data value 0 and weight 1, the right child has value 1 and weight 2, and v has value 2 and weight 2. Then on the upward stage the value at v will be 1.5, using the path to the right child. Suppose v’s parent u has a descendent with value 11 and weight 1, and that the optimal path for u involves v. Then the optimal path is from u to v’s left child, with a regression value of 5 and error 6. Had the path terminated at v’s right child then the regression value would have been 4 32 , with error 6 31 . Thus, even though u’s path passes through v, the continuation to a leaf is different than for v. To keep track of the correct path information upward envelopes are used, but are slightly different. At vertex v, the envelopes for its children are merged together. However, rather than keeping the maximum error for each regression value, the smallest value is kept since there is a path achieving it. Once the children’s envelopes are merged, then the segment corresponding to v is added, but it is inserted keeping track of the maximum error at a regression value since any path through v must include it. It’s straightforward to show that the envelope merging operations can still be done in Θ(n log n) total time. A final detail is that to determine the values during the top-down stage one needs to determine how to assign values to children. This requires the envelopes used when going up, so as they are being created they are maintained in persistent form so that intermediate stages can be recovered. ✷.
5.2 Unimodal Regression for Linear Orders Given weighted data on a linear order v1 < . . . < vn , a unimodal regression g is a regression that minimizes the regression error subject to the unimodal constraint g(v1 ) ≤ g(v2 ) . . . ≤ g(vk ) ≥ g(vk+1 ) . . . ≥ g(vn ) for some k, 1 ≤ k ≤ n. Unimodal orderings are also known as umbrella orderings. Optimal modes may not be unique, which can be seen by considering the unweighted data 1, 0, 1. This example shows there need not be a unimodal regression corresponding to Max, i.e., one which is pointwise as large as any other. Several unimodal regression algorithms [12, 34, 39, 43] have been applied to “competing failure modes” problems [17, 22, 23]. For example, in dose-response settings, as the dose increases the efficacy increases, but toxicity increases as well. The goal is to find a dose that maximizes the probability of being both efficacious and nontoxic [17, 23]. A multi-arm bandit model is used, one bandit per dose, where the expected value of the bandits is assumed to be unimodal. The weights reflect the uncertainty in the expected value. An adaptive trial design is used, where after each observation, unimodal regression is used to decide which dose to use next. Unimodal bandit models have been examined since the 1980’s [24]. A prefix approach was used in [39] to find the unimodal regression of a linear order, taking Θ(n) time for L2 , Θ(n log n) for L1 , and Θ(n) for L∞ with unweighted data. For this approach the information that needs to be computed at each vertex vk is the regression error of an isotonic regression on v1 . . . vk . This is the maximum of the error of using pre(i) at vi , 1 ≤ i ≤ k, which can be computed as pre is being computed. The corresponding information for decreasing isotonic regressions on vk . . . vn is also determined (computed in
11
reverse order). An optimal mode vertex for L∞ is one that minimizes the larger of these two errors. Using Algorithm A gives the following corollary which was also obtained by Lin et al. [29]. Corollary 7 Given weighted data (f, w) on a linear order of n vertices, an L∞ unimodal regression can be determined in Θ(n log n) time. ✷
5.3 Unimodal Regression for Trees For an undirected tree the unimodal problem is to find an optimal root and the isotonic regression on the resulting rooted tree. Agarwal, Phillips and Sadri [1] were motivated to study this problem by an example concerning determining the surface topology in geographic information systems. They showed that for a tree with unweighted data the L2 unimodal regression, with a Lipschitz condition imposed, can be found in Θ(n log3 n) time. In a machine learning context, Yu and Mannor [46] gave active learning algorithms for locating the root of a tree for a multi-arm bandit model with a bandit at each vertex, where the expected reward is unimodal. This is similar to the use of unimodal regression in response adaptive dose-response designs mentioned above. Bandit models using rooted trees appear in [14]. The algorithms below are quite different than those for unimodal regression on a linear order, in that they first locate the root without creating regressions, and then any algorithm for rooted trees can be applied. For unweighted data, locating a root is simple. Proposition 8 Given unweighted data f on an undirected tree T = (V, E), in Θ(n) time an L∞ unimodal regression can be found. Proof: Let x be a vertex where f achieves its maximum and let g be an optimal L∞ unimodal regression on T . Let g ′ be the function on T given by g ′ (v) = min{g(v), g(x)} for all v ∈ V . If ǫ = |f (x) − g(x)|, then ||f − g||∞ ≥ ǫ. Since f (v) ≤ f (x) for all v ∈ V , |g ′ (v) − f (v)| ≤ max{ǫ, |g(v) − f (v)|}. Thus ||g′ − f ||∞ ≤ ||g − f ||∞ , and, since g was optimal, so is g′ . Further, g′ is isotonic when the tree is directed with x as its root. Thus, to find a unimodal regression it suffices to locate a vertex with maximum data value, orient the tree with this as the root, and use Prefix to determine the isotonic regression. ✷ For weighted data it is not immediately obvious which vertices can be optimal roots, but there are edges for which one can quickly determine their correct orientation. The path from u to v, for u, v ∈ V , means the simple path from u to v. For weighted data (f, w) on the undirected tree T = (V, E), let u, v be such that mean err(f, w : u, v) = max{mean err(f, w : p, q) : p, q ∈ V }. Suppose f (u) < f (v). If in the directed tree u is on the path from v to the root then they are violators and the regression error is at least mean err(f, w : u, v). This is the error of the regression where all values are mean(f, w : u, v), and hence every vertex is an optimal root. To obtain a smaller regression error it must be that u is not on the path from v to the root of the directed tree, in which case the edge incident to u on the simple path from u to v is directed towards the root. This will be used to show the following: Theorem 9 Given weighted data (f, w) on an undirected tree T = (V, E), in Θ(n log n) time an L∞ unimodal regression can be found. Proof: The algorithm locates the root via recursion. Locate a pair of vertices u, v which maximize mean err. If f (u) < f (v) then let e be the edge incident to u on the path towards v. Remove e, which partitions T into trees T1 = (V1 , E1 ), T2 = (V2 , E2 ), where u ∈ V1 . Recursively find an optimum root r for a unimodal 12
regression of T2 . Lemma 10 proves that r is an optimal root for unimodal regression on T . Once an optimal root is known the Prefix isotonic regression can be determined in Θ(n log n) time (Theorem 5). To find u, v, set S1 = S2 = V and perform a bipartite maximum violator operation. Once T1 is determined, all weighted values corresponding to it are removed from S1 and S2 and the resulting sets are used for the subproblem on T2 . The repeated violators lemma (Lemma 4) shows that one can find pairs of maximum violators during all recursive steps in Θ(n log n) total time. Knowing u and v does not immediately tell one which edge incident to u is e. Lemma 11 shows that one can find e in O(|V1 |) time. Using this, T1 can be removed from T , creating T2 , in Θ(|V1 |) time. Thus the total time for recursively shrinking the subtrees is Θ(n). ✷ Lemma 10 For an undirected tree T = (V, E) with weighted data (f, w), let u1 , u2 be a pair which maximizes mean err and for which f (u1 ) < f (u2 ). Let {u1 , x} be an edge on the path from u1 to u2 . Let T1 = (V1 , E1 ), T2 = (V2 , E2 ) be the subtrees of T induced by removing this edge, where u1 ∈ T1 and x ∈ T2 . Then an optimal root for L∞ unimodal regression of (f, w) restricted to T2 is an optimal root for L∞ unimodal regression of (f, w) on T . Proof: If the minimal regression error of a unimodal regression on T is mean err(f, w : u1 , u2 ) then all vertices are optimal roots. Otherwise the regression value at u2 must be > mean(f, w : u1 , u2 ) and at u1 is < mean(f, w : u1 , u2 ), and therefore any optimal root is in T2 . Let g be a unimodal regression on T2 with root r, and let h be a unimodal regression on T . Define g′ on T as follows: if v ∈ T1 h(v) g ′ (v) = max{h(u1 ), g(v)} if v is on the path from r to x g(v) otherwise
Since the root of h is in T2 , h restricted to T1 has u1 as its root, and the construction insures that g′ is unimodal on T , with root r. Since g and h are optimal, to show that g′ is optimal we only need to show that on the path from r to x its error is no worse than that of g or h. For an optimal regression the value at u2 is > h(u1 ), and hence on the path from r to u2 , g > h(u1 ). Similarly, on the path from x to u2 , h ≥ h(u1 ). Since the path from r to x is a subset of the union of the paths from r to u2 and x to u2 , the value of g′ on the path from r to x is between (or equal to) the values of g and h, and hence its error is no more than the maximum of theirs. Thus it too is optimal. ✷ Lemma 11 Given an undirected tree T = (V, E) and vertices u, v ∈ V , u 6= v, let x ∈ V be such that the edge {u, x} is on the path from u to v. Then x can be determined in O(|T ′ |) time where T ′ is the maximal subtree of T containing u but not containing x. Proof: Let T ′ denote the maximal subtree of T containing u but not containing x. Let p1 , . . . , pk be the neighbors of u. Viewing u as the root, let Ti denote the subtree with root pi , i = 1, k. On each Ti a preorder traversal will be used. A global traversal is performed as follows: visit the first node in the traversal of T1 , then the first node in the traversal of T2 , . . . , first node in Tk , then the 2nd node in the traversal of T1 , 2nd node in the traversal of T2 , . . . . If v is reached while traversing Tj , or for every Ti , i 6= j, the final node in the traversal of Ti has been reached, then x = pj and T ′ = {u} ∪ {Ti : i = 1 . . . k, i 6= j}. In all cases, x is identified and the number of nodes examined in the traversal of Tj is no more than the number of nodes traversed in the other subtrees, i.e., in T ′ . ✷
13
6 Restricted Regression Values Isotonic regressions with restricted values have been studied by several authors (see [3] and the references therein), including their use in various classification problems, e.g. [18]. For integer-valued isotonic regression on a linear order, Goldstein and Kruskal [21] gave a Θ(n) time algorithm for the L2 metric. For L∞ , Liu and Ubhaya [30] gave a Θ(n2 ) time algorithm. Theorem 12 below shows that the time can be reduced to Θ(n log n). More generally, coupled with Theorem 1 it shows that for any dag an integer-valued L∞ regression can be found in Θ(m log n) time. Theorem 14 shows that these ideas can be extended to find approximations with specified accuracy. We use an approach applicable to arbitrary dags and regression values restricted to a set S of real numbers. An S-valued Lp isotonic regression is an S-valued isotonic function which minimizes the Lp regression error among all S-valued isotonic functions. S must be a closed discrete set, which implies that for any real number x, if x ≤ max{s ∈ S} then ⌈x⌉S = min{s : s ≥ x, s ∈ S} is well-defined, and correspondingly if x ≥ min{s ∈ S} then ⌊x⌋S = max{s : s ≤ x, s ∈ S} is well-defined. If x < min{s ∈ S} let ⌊x⌋S = min{s ∈ S}, and if x > max{s ∈ S} then ⌈x⌉S = max{s ∈ S}. Further, for x ∈ S there is a unique successor if x < max{s ∈ S} and a unique predecessor if x > min{s ∈ S}. We assume that predecessor, successor, ⌊·⌋S , and ⌈·⌉S can be determined in constant time. We will show that one can quickly convert an unrestricted isotonic regression into an S-valued one. Theorem 12 Given weighted data (f, w) on a dag G = (V, E), a closed discrete set S of real numbers, and an unrestricted L∞ isotonic regression of f , an S-valued L∞ isotonic regression can be determined in Θ(m) time. We first establish a property connecting values of restricted and unrestricted isotonic regressions. For linear orders, Goldstein and Kruskal [21] showed that for L2 integer isotonic regression, rounding the unrestricted isotonic regression gives the correct regression, and Liu and Ubhaya [30] showed that this is true for the L∞ metric when the data is unweighted and the original regression is Basic. However, for weighted data, or when the original regression is not Basic, integer-valued L∞ isotonic regression is a bit more complicated. For example, on vertices {0, 1}, suppose the data values are 0.6, 0 and weights are 2, 1. Then the unique unrestricted L∞ isotonic regression is 0.4, 0.4 yet the unique integer-valued L∞ regression is 1, 1, not 0, 0. While rounding does not always give the optimal S-valued isotonic regression, it nearly does. Lemma 13 For 1 ≤ p ≤ ∞, given weighted data (f, w) on a dag G = (V, E), a closed discrete set S of real numbers, and an Lp isotonic regression g of (f, w), there is an S-valued Lp isotonic regression gˆ such that gˆ(v) ∈ {⌊g(v)⌋S , ⌈g(v)⌉S } for all v ∈ V . Proof: Given p, let g be an unrestricted Lp isotonic regression, and let gˆ be an S-valued Lp isotonic regression. If there are vertices v such that gˆ(v) > ⌈g(v)⌉S then let u be one that maximizes the difference (a similar proof holds if there are vertices v such that gˆ(v) < ⌊g(v)⌋S ). Let A = {v : g(v) = g(u) and gˆ(v) = gˆ(u), v ∈ V }. For any vertex v 6∈ A which is a predecessor of an element of A, gˆ(v) < gˆ(u) because the fact that it isn’t in A means either gˆ(v) < gˆ(u) or g(v) < g(u), but if only the latter holds then u does not optimize the difference. Similarly, if v 6∈ A is a successor of a vertex in A then g(v) > g(u). Let x ∈ S be the predecessor of gˆ(u). Note that x ≥ ⌈g(u)⌉S . Let gˆ′ be the S-valued isotonic function which is gˆ on V \ A and x on A. If the regression error of gˆ′ is less than that of gˆ then gˆ was not optimal. If the regression error of gˆ′ is larger then the weighted mean of A is greater than x. Let
14
y = min{x, min{g(v) : g(v) > g(u), v ∈ V }}. Raising the value of g on A to y, and keeping it the same elsewhere, would decrease the Lp error and maintain the isotonic property, contradicting g’s optimality. Finally, if the Lp regression error of gˆ′ is the same as that of gˆ then recursively consider gˆ′ . Since S is discrete there are only finitely many values between x and ⌈g(u)⌉S , so the process can only recurse a finite number of times. ✷ To prove Theorem 12, let g be an unrestricted L∞ isotonic regression and for r ∈ S let Lr be the set {v : ⌊g(v)⌋S = r, v ∈ V }. Note that this is a union of level sets in g. No matter what choices are made for the regression values within Lr , since these choices are either r or its successor r + in S, they do not impose any isotonic restrictions on the choices for other level sets since the values used on Lr are at least as large as the upper values in any predecessor and no larger than the lower values in any successor. Therefore we can consider each Lr separately. For v ∈ Lr , define functions down err and up err by down err(v) = max{err(u, r) : u v, u ∈ Lr } up err(v) = max{err(u, r + ) : u v, u ∈ Lr } down err(v) is a lower bound on the L∞ isotonic regression error if the regression value at v is r since all predecessors in Lr must also have regression value r, and similarly up err(v) is a lower bound if it is r + . Let gˆ be defined by r if down err(v) ≤ up err(v) gˆ(v) = r + otherwise Since down err is monotonic increasing on Lr and up err is monotonic decreasing, gˆ is isotonic increasing. At each vertex v, changing the choice for gˆ(v) can never decrease the overall regression error, and hence an optimal S-valued L∞ isotonic regression has been found. down err and up err can be computed by topological sorting, completing the proof of the theorem. ✷ Thus for unweighted data an S-valued isotonic regression can be found in Θ(m) time. For weighted data, one can efficiently find a regression without starting with an unrestricted one. For approximation with error within ǫ of optimal one can choose S to be multiples of ǫ/ max{wi : i ∈ V } within the data range. Theorem 14 Given weighted data (f, w) on a dag G = (V, E), and given a finite set S of real numbers, in Θ(m log |S|) time an S-valued L∞ isotonic regression can be determined. Proof: Let S = {s1 , . . . , sd }, let [sa , sb ], 1 ≤ a ≤ b ≤ d denote the interval sa . . . sb of values in S, and for v ∈ V let errS (v, a, b) denote min{err(v, t) : t ∈ [sa , sb ]}. The algorithm determines intervals of S in which the regression value for v will be assigned, where the intervals are progressively halved until they are a single value, which is the regression value of v. At each stage the assignment is isotonic in that if u ≺ v then either u and v are assigned to the same interval or the intervals have no overlap and u’s interval has smaller values than v’s. Initially each vertex is assigned to all of S, i.e., to [s1 , sd ]. Suppose x is currently assigned to [sa , sb ], and let c = ⌊(a + b)/2⌋. Define down err(x) = max{errS (u, a, c) : u x, u currently assigned to [sa , sb ]} up err(x) = max{errS (v, c+1, b) : v x, v currently assigned to [sa , sb ]} down err(x) is a lower bound on the L∞ regression error for vertices currently assigned to [sa , sb ] if x is assigned to [sa , sc ] in the next stage, and up err(x) is a lower bound if it is assigned to [sc+1 , sb ]. Assign 15
x to [sa , sc ] if down err(x) ≤ up err(x) and to [sc+1 , sb ] otherwise. As before, this assignment is isotonic and can be determined in Θ(m) time. To see that the final assignments form an S-valued L∞ isotonic regression, suppose they don’t. Let ǫ be the regression error of an S-valued L∞ isotonic regression, and consider the first iteration at which some vertex was assigned to a half for which its error is > ǫ. Let x be a minimal such a vertex, initially in [sa , sb ], and assume it is assigned to the lower half (a similar argument holds if it is assigned to the upper half). Thus ǫ < down err(x) ≤ up err(x). No predecessor of x has error > ǫ, therefore down err(x) = errS (x, a, c). Since the error bound was not exceeded in previous iterations it must be that f (x) > sc and errS (x, 1, c) > ǫ. Because up err(x) ≥ down err(x) there must be an x′ ≻ x initially in [sa , sb ] for which errS (x′ , c+1, b) = up err(x) > ǫ. Let y be a maximal such element. Similarly to x’s properties, it must be that f (y) < sc+1 and errS (y, c+1, d) > ǫ. Since x ≺ y, any S-valued isotonic regression must either assign a regression value to x in [s1 , sc ] or a regression value to y in [sc+1 , sd ], contradicting the assumption that there was an S-valued isotonic regression with error ǫ. ✷
7 Final Comments For an arbitrary dag, Theorem 1 shows that an L∞ isotonic regression can be found in Θ(m log n) time. It is perhaps the optimal time, though we have no proof of this. Unfortunately it is based on parametric search, which is exceedingly complex and infeasible (see Agarwal and Sharir [2]). It is an open question as to whether there is a practical Θ(m log n) algorithm for the general case. It is an open question even when restricted to simple d-dimensional grids, d ≥ 2. Since the fastest algorithm is infeasible we give alternatives for some important cases. We give a practical parametric search when there are few different weights, or few different values (Section 3.1). When the regression values must be from a specified set then an algorithm using iterative partitioning is given in Theorem 14. This also provides a simple approach when the regression is only needed to within a given accuracy. Another such approach is to do a binary search on the error using feasibility tests. This paper introduced the Prefix regression. Basic has been extensively studied and is trivial to compute for unweighted data, but for weighted data Prefix is more useful. Prefix can be computed by using error envelopes, reusing calculations at one vertex to reduce those at successors. For linear and tree orders it can be determined in Θ(n log n) time, obviating the need for parametric search. It also provides a Θ(n log n) approach for river isotonic regression on a tree (Section 5.1) and unimodal regression on a line (Section 5.2). Thus the use of parametric search by Chun et al. [12] for unimodal regression is unnecessary. Beyond their computational time, another important aspect of L∞ isotonic regression algorithms is the properties of the regressions they produce. This arises because of the non-uniqueness of L∞ regression. When the feasibility tests associated with parametric regression are used the natural resulting regressions are the pointwise minimum, Min, or maximum, Max. Avg has apparently not been studied previously. For isotonic regression one would not expect that increasing a data value results in the regression lowering at another vertex, but it happens for Min for the unweighted values 0, 3, 1 and 0, 5, 1, changing the regression from -1, 2, 2, to -2, 3, 3. Thus, unlike Prefix, Min is not a monotonic mapping of data into isotonic functions (Section 2.2). Max has similar behavior. Further, these regressions can have values outside the data range. E.g., for values 2, 3, 1, 2 with weights 1, 4, 4, 1, Min is -2, 2, 2, 2, Max is 2, 2, 2, 6, and Avg is 0, 2, 2, 4. For some purposes values outside the data range would not be acceptable. While one could trim the values so that they lie within the data range, Prefix has this property automatically, as do all Lp isotonic regressions for 1 ≤ p < ∞. For the example they would all be 2, 2, 2, 2.
16
In Section 2.3 it was shown that, in terms of minimizing the number of vertices with large regression errors, Strict is the best possible, and Avg, Basic and Prefix are significantly better than Min and Max. While the objective of L∞ regression is to minimize the worst error, the number of large errors may also be relevant. For example, when weighted L∞ isotonic regression is used to minimize the maximum distance to a service center [27] it might be beneficial if fewer customers had to travel the maximum weighted distance. Overall, among the current algorithms for general dags there is an inverse relationship between an algorithm’s worst-case time to determine an L∞ isotonic regression and the desirability of the regressions it produces. Acknowledgements Research partially supported by NSF grant CDI-1027192, DOE grant DE-FC52-08NA28616, and King Abdulaziz University.
References [1] Agarwal, K, Phillips, JM, and Sadri, B (2010), “Lipschitz unimodal and isotonic regression on paths and trees”, LATIN 2010, pp. 384–396. [2] Agarwal, PK and Sharir, M (1998), “Efficient algorithms for geometric optimization”, ACM Computing Surveys, pp. 412–458. [3] Ahuja, RK and Orlin, JB (2001), “A fast scaling algorithm for minimizing separable convex functions subject to chain constraints”, Operations Research 49, pp. 784–789. [4] Angelov, S, Harb, B, Kannan, S, and Wang, L-S (2006), “Weighted isotonic regression under the L1 norm”, Symp. Discrete Algorithms 2006, pp. 783–791. [5] Barlow, RE, Bartholomew, DJ, Bremner, JM, and Brunk, HD (1972), Statistical Inference Under Order Restrictions: the Theory and Application of Isotonic Regression, Wiley. [6] Barlow, RE and Brunk, HD (1972), “The isotonic regression problem and its dual”, J. Amer. Stat. Soc. 67, pp. 140–147. [7] Benabbas, S, Lee, HC, Oren, J and Ye, Y (2011), “Efficient sum-based hierarchical smoothing under ℓ1 norm”, arXiv:1108.1751. [8] Bril, G, Dykstra, R, Pillars, C and Robertson, T (1984), “Algorithm AS 206: Isotonic regression in two independent variables”, J. Royal Stat. Soc. Series C (Applied Stat.) 33, pp. 352–357. [9] Brown, MR and Tarjan, RE (1979), “A fast merging algorithm”, J. Assoc. Comp. Mach. 26, pp. 211– 226. [10] Chakrabarti, D, Kumar, R and Punera, K (2007), “Page-level template detection via isotonic smoothing”, Proc. 16th Int’l. World Wide Web Conf. [11] Chen, DZ and Wang, H (2013), “Approximating points by a piecewise linear function”, Algorithmica 66, pp. 682–713.
17
[12] Chun, J, Sadakane, K, and Tokuyama, T (2006), “Linear time algorithm for approximating a curve by a single-peaked curve”, Algorithmica 44, pp. 103–115. [13] Chandrasekaran, R, Rhy, YU, Jacob, VS, and Hong, S (2005), “Isotonic separation”, INFORMS J. Computing 17, pp. 462–474. [14] Coquelin, P-A and Munos, R (2007), “Bandit algorithms for tree search”, arXiv:cs/0703062 [15] Dembczynski, K, Greco, S, Kotlowski, W, and Slowinski, R (2007), “Statistical model for rough set approach to multicriteria classification”, PKDD 2007: 11th European Conf. Principles and Practice Knowledge Discovery in Databases, Springer Lec. Notes Comp. Sci. 4702, pp. 164–175. [16] D´ıaz-B´an˜ nez, JM and Mesa, JA (2001), “Fitting rectilinear polygonal curves to a set of points in the plane”, Eur. J. Operational Res. 130, pp. 214–222. [17] Durham, S, Flournoy N, and Li, W (1998), “A sequential design for maximizing the probability of a favorable response”, Can. J. Stat. 26, pp. 479–495. [18] Dykstra, R, Hewett, J and Robertson, T (1999), “Nonparametric, isotonic discriminant procedures”, Biometrika 86, pp. 429–438. [19] Frederickson, GN and Johnson, DB (1984), “Generalized selection and ranking: sorted matrices”, SIAM J. Comput. 13, pp. 14–30. [20] Gebhardt, F (1970), “An algorithm for monotone regression with one or more independent variables”, Biometrika 57, pp. 263–271. [21] Goldstein, AJ and Kruskal, JB (1976), “Least-squares fitting by monotonic functions having integer values”, J. Amer. Stat. Assoc. 71, pp. 370–373. [22] Haiminen, N, and Gionis, A (2004), “Unimodal segmentation of sequences”, Proc. 4th IEEE Int’l. Conf. Data Mining, pp. 106–113. [23] Hardwick, J, Meyer, M and Stout, QF (2003), “Directed walk designs for dose response problems with competing failure modes”, Biometrics 59, pp. 229–236. [24] Herkenrath, U. (1983), “The N -armed bandit with unimodal structure”, Metrika 30, pp. 195–210. [25] Hershberger, J, and Suri, S (1992), “Applications of a semi-dynamic convex hull algorithm”, BIT 32, pp. 249–267. [26] Kalai, AT and Sastry, R (2009), “The Isotron algorithm: high-dimensional isotonic regression”, Proc. Comp. Learning Theory (COLT) 2009. [27] Kaufman, Y and Tamir, A (1993), “Locating service centers with precedence constraints”, Discrete Applied Math. 47, pp. 251–261. [28] Legg, D., Townsend, D (1984), “Best monotone approximation in L∞ [0,1]”, J. Approx. Theory 42, pp. 30–35.
18
[29] Lin, T.-C., Kuo, C.-C., Hsieh, Y.-H., and Wang, B.-F. (2009), “Efficient algorithms for the inverse sorting problem with bound constraints under the l∞ -norm and the Hamming distance”, J. Comp. and Sys. Sci. 75, pp. 451–464. [30] Liu, M-H and Ubhaya, VA (1997), “Integer isotone optimization”, SIAM J. Optimization 7, pp. 1152– 1159. [31] Maxwell, WL and Muckstadt, JA (1985), “Establishing consistent and realistic reorder intervals in production-distribution systems”, Operations Research 33, pp. 1316–1341. [32] Moon, T, Smola, A, Chang, Y and Zheng, Z (2010), “IntervalRank — isotonic regression with listwise and pairwise constraints”, Proc. Web Search and Data Mining, pp. 151–160. [33] Pardalos, PM and Xue, G (1999), “Algorithms for a class of isotonic regression problems”, Algorithmica 23, pp. 211–222. [34] Pehrsson, N-G and Fris´en, M (1983), “The UREGR procedure”, Gothenburg Computer Central, G¨oteborg, Sweden. [35] P´olya, G. (1913), “Sur une algorithme toujours convergent pour obtenir les polynomes de meillure approximation de Tchebycheff pour un fonction continue quelconque”, Comptes Rendus 157, pp. 840– 843. [36] Punera, K and Ghosh, J (2008), “Enhanced hierarchical classification via isotonic smoothing”, Proc. Int’l. Conf. World Wide Web 2008, pp. 151–160. [37] Robertson, T, Wright, FT, and Dykstra, RL (1988), Order Restricted Statistical Inference, Wiley. [38] Spouge, J, Wan, H, and Wilber, WJ (2003), “Least squares isotonic regression in two dimensions”, J. Optimization Theory and Appl. 117, pp. 585–605. [39] Stout, QF (2008), “Unimodal regression via prefix isotonic regression”, Comp. Stat. and Data Anal. 53, pp. 289–297. A preliminary version appeared in “Optimal algorithms for unimodal regression”, Computing and Stat. 32, 2000. [40] Stout, QF (2012), “Strict L∞ isotonic regression”, J. Optimization Theory and Appl. 152, pp. 121–135. [41] Stout, QF (2013), “Isotonic regression via partitioning”, Algorithmica 66 (2013), pp. 93–112. [42] Thompson, WA Jr. (1962), “The problem of negative estimates of variance components”, Annals Math. Stat. 33, pp. 273–289. [43] Turner, TR and Wollan, PC (1997), “Locating a maximum using isotonic regression”, Comp. Stat. and Data Anal. 25, pp. 305–320. [44] Ubhaya, VA (1974), “Isotone optimization, I, II”, J. Approx. Theory 12, pp. 146–159, 315–331. [45] Velikova, M and Daniels, H (2008), Monotone Prediction Models in Data Mining, VDM Verlag. [46] Yu, JY and Mannor, S (2011), “Unimodal bandits”, Proc. Int’l. Conf. Machine Learn., pp. 41–48.
19