Bounds on Generalized Huffman Codes - Semantic Scholar

Report 0 Downloads 165 Views
Bounds on Generalized Huffman Codes Michael B. Baer, Member, IEEE

arXiv:cs/0702059v1 [cs.IT] 10 Feb 2007

Abstract New lower and upper bounds are obtained for the compression of optimal binary prefix codes according to various nonlinear codeword length objectives. Like the coding bounds for Huffman coding — which concern the traditional linear code objective of minimizing average codeword length — these are in terms of a form of entropy and the probability of the most probable input symbol. As in Huffman coding, some upper bounds can be found using sufficient conditions for the codeword corresponding to the most probable symbol being one bit long. Whereas having probability no less than 0.4 is a tight sufficient condition for this to be the case in Huffman coding, other penalties differ, some having a tighter condition, some a looser condition, and others having no such sufficient condition. The objectives explored here are ones for which optimal codes can be found using a generalized form of Huffman coding. These objectives include one related to queueing (an increasing exponential average), one related to single-shot communications (a decaying exponential average), and the recently formulated minimum maximum pointwise redundancy. For these three objectives, we also investigate the necessary and sufficient conditions for the existence of an optimal code with a 1-bit codeword. In the last of them — and in a related fourth objective, dth exponential redundancy — multiple bounds are obtained for different intervals, much as in traditional Huffman coding. Index Terms Huffman codes, minimax redundancy, optimal prefix code, queueing, R´enyi entropy.

I. I NTRODUCTION A lossless binary prefix coding problem takes a probability mass function p(i), defined for all i in the input alphabet X , and finds a binary code for X . Without loss of generality, we consider an n-item source emitting symbols drawn from the alphabet P X = {1, 2, . . . , n} where {p(i)} is the sequence of probabilities for possible symbols (p(i) > 0 for i ∈ X and i∈X p(i) = 1) in monotonically nonincreasing order (p(i) ≥ p(j) for i < j ). The source symbols are coded into binary codewords. The codeword c(i) ∈ {0, 1}∗ in code c, corresponding to input symbol i, has length l(i), defining length vector l. PThe goal of the traditional coding problem is to find a prefix code minimizing expected codeword length i∈X p(i)l(i), or, equivalently, minimizing average redundancy X X ¯ p) , p(i)(l(i) + lg p(i)) p(i)l(i) − H(p) = R(l, i∈X

i∈X

P

where H is − i∈X p(i) lg p(i), Shannon entropy, and lg , log2 . A prefix code is a code for which no codeword begins with a sequence that also comprises the whole of a second codeword. This problem is equivalent to finding a minimum-weight external path X w(i)l(i) i∈X

among all rooted binary trees, due to the fact that every prefix code can be represented as a binary tree. In this tree representation, each edge from a parent node to a child node is labeled 0 (left) or 1 (right), based on whether the child is the left or right child. A leaf is a node without children; this corresponds to a codeword, and the codeword is determined by the path from the root to the leaf. Thus, for example, a leaf that is the right (1) child of a left (0) child of a left (0) child of the root will correspond to codeword 001. Leaf depth (distance from the root) is thus codeword length. The weights are the probabilities (i.e., w(i) =P p(i)), and, in fact, we will refer to the problem inputs as {w(i)} for certain generalizations in which their sum, i∈X w(i), need not be 1. This work was supported in part by the National Science Foundation (NSF) under Grant CCR-9973134 and the Multidisciplinary University Research Initiative (MURI) under Grant DAAD-19-99-1-0215. Part of this work was performed while the author was at Stanford University. Material in this paper was presented at the 2006 International Symposium on Information Theory, Seattle, Washington, USA. The author was with the Department of Electrical Engineering, Stanford University, Stanford, CA 94305-9505 USA. He is now with Electronics for Imaging, 303 Velocity Way, Foster City, CA 94404 USA (e-mail: [email protected]).

1

If formulated in terms of l, the constraints on the minimization are the integer constraint (i.e., that codes must be of integer length) and the Kraft inequality [1]; that is, the set of allowable codeword length vectors is ) ( n X 2−l(i) ≤ 1 . Ln , l ∈ Zn+ such that i=1

It is well known that the Huffman algorithm [2] finds a code minimizing average redundancy; this is so well known that the problem itself is often referred to as the “Huffman problem.” The Huffman algorithm is a greedy algorithm built on the observation that the two least likely items will have the same length and can thus be considered siblings in the coding tree. A reduction can thus be made in which the two items of weights w(i) and w(j) can be considered as one with combined weight w(i) + w(j), and the codeword of the combined item determines all but the last bit of each of the items combined, which are differentiated by this last bit. This reduction continues until there is one item left, and, assigning this item the null string, a code is defined for all input items. In the corresponding optimal code tree, the ith leaf corresponds to the codeword of the ith input item, and thus has weight w(i), whereas the weight of parent nodes are determined by the combined weight of the corresponding merged item. Van Leeuwen gave an implementation of the Huffman algorithm that can be accomplished in linear time given sorted probabilities [3]. Shannon [4] had previously shown that an optimal lopt must satisfy X H(p) ≤ p(i)lopt (i) < H(p) + 1 i∈X

or, equivalently, ¯ opt , p) < 1. 0 ≤ R(l

Less well known is that simple changes to the Huffman algorithm solve several related coding problems which optimize for different objectives. Here we will discuss four such problems, all four of which have been previously shown to satisfy redundancy bounds of the form ˜ ˜ lopt ) < H(p) ˜ H(p) ≤ L(p, +1

or ˜ opt , p) < 1 0 ≤ R(l ˜ , cost measure L ˜ , and redundancy measure R ˜ . We will improve the bounds, given p(1), for some entropy measure H for three of these measures and discuss the related issue of the length of the most likely codeword in these coding problems. These bounds are the first of their kind for nontraditional Huffman codes, bounds which are functions of both entropy and p(1), as in the traditional case [5]–[10]. However, they are not the first improved bounds for such codes. More sophisticated bounds on the optimal solution for one of these problems were given by Blumer and McEliece [11]; these appear as solutions to related problems rather than in closed form, however. Bounds given elsewhere [12] assumed that p(1) ≥ 0.4 always implies l(1) = 1, which we will show here to not necessarily be the case. Generalized versions of the Huffman algorithm have been considered by many authors [13]–[16]. These generalizations change the combining rule; instead of replacing items i and j with an item of weight w(i) + w(j), the generalized algorithm replaces them with an item of weight f (w(i), w(j)) for some function f . Thus the the weight of a combined item (a node) no longer need be equal to the sum of the probabilities of the items merged to create it (the sum of the leaves of the corresponding subtree). This has the result that the sum of weights in a reduced problem need not be 1, unlike in the original Huffman algorithm. In particular, the weight of the root, wroot , need not be 1. However, we continue to assume that the sum of inputs to the nonreduced versions of coding problems will always be 1. One such variation of the Huffman algorithm was used in Humblet’s 1978 dissertation [17] for a queueing application (and further discussed in [13], [14], [18]). The problem this variation solves is as follows: Given probability mass function p and a > 1, find a code minimizing X La (p, l) , loga p(i)al(i) . (1) i∈X

2

This growing exponential average problem is solved by using combining rule f (w(i), w(j)) = (a) (w(i) + w(j)) .

(2)

This problem was proposed (without solution) in 1965 by Campbell [19]. In 1966, Campbell noted that this formulation can be extended to decaying exponential base a ∈ (0, 1) [20], and Humblet noted that the Huffman combining method (2) finds the optimal code for (1) with a ∈ (0, 1) as well [18]. An application for this decaying exponential variant was given in [21]; in this application, a communication channel has a window of opportunity with a total duration (in bits) distributed geometrically with parameter a. The probability of successful transmission is P[success] = aLa (p,l) . (3) Another extension of the first variation, proposed in 1975 [22] and solved for in 1980 [14], can be called dth exponential redundancy [23], and is the minimization of the following: 1 X Rd (l, p) , lg p(i)1+d 2dl(i) . (4) d i∈X

Here we assume that d > 0, although d ∈ (−1, 0) is also a valid problem. Clearly, this can be solved via reduction to (1) by assigning a = lg d and using input weights w(i) = p(i)1+d . We will find, however, that Rd is more suited than La to finding bounds in terms of the most probable item if that item is not very probable. In 2004, Drmota and Szpankowski [24] proposed a problem which, instead of minimizing average redundancy ¯ p) , P R(l, i∈X p(i)(l(i) + lg p(i)), minimizes maximum pointwise redundancy R∗ (l, p) , max(l(i) + lg p(i)). i∈X

This was later noted to be solvable via a variation of Huffman coding [23] derived from that in [25], one for which f (w(i), w(j)) = 2 max(w(i), w(j)).

(5)

In the next section, we find tight exhaustive bounds for the values of optimal R∗ (l, p) and corresponding l(1) in terms of p(1), then find how we can extend these to exhaustive bounds for optimal Rd (l, p). In Section III, we investigate the behavior of optimal La (p, l) and l(1) in terms of p(1). II. B OUNDS

ON THE

R EDUNDANCY P ROBLEMS

It is useful to come up with bounds on the performance of an optimal code, often in terms of the most probable symbol, p(1). In the traditional Huffman problem, such bounds are often referred to as “redundancy bounds” because ¯ p) = P they involve average redundancy, R(l, i∈X p(i)l(i) − H(p). The simplest bounds for the optimal solution to the minimum maximum pointwise redundancy problem ∗ Ropt (p) , min max (l(i) + lg p(i)) l∈Ln i∈X

are quite similar to and can be combined with those for the traditional problem: ¯ opt (p) ≤ R∗ (p) < 1 0≤R opt

(6)

¯ opt (p) is the average redundancy of the average redundancy-optimal code. The traditional case is a lower where R ¯ p)) can be no bound because the maximum (R∗ (l, p)) of the values (l(i) + lg p(i)) that average to a quantity (R(l, less than the average (a fact that holds for all l and p). The upper bound is found similarly to the linear case; we can ∗ (p) ≤ R∗ (l0 , p) = max note that Shannon code lp0 (i) , ⌈− lg p(i)⌉ results in Ropt i∈X (⌈− lg p(i)⌉ + lg p(i)) < 1. p A few observations can be used to find a series of improved lower and upper bounds on optimum maximum pointwise redundancy based on (6): 1) In a Huffman-like tree for a maximum pointwise redundancy code, the weight of the root wroot determines the maximum pointwise redundancy, R∗ (l, p) = lg wroot . 2) The total probability of any subtree is no greater than the weight of the subtree. This can be inductively observed.

3

3) In the Huffman-like coding, items are merged by nondecreasing weight. This can be observed by noting that any new merged item has weight greater than either of its merged items. In fact, any new merged item has weight at least twice as great as either of the merged items, due to (5). A fourth observation is in the form of the following lemma: Lemma 1: Given a probability mass function p for n = |X | items, if p(1) ≤ 2p(n−1), then aP minimum maximum pointwise redundancy code can be represented by a complete tree, that is, a tree such that i∈X 2−l(i) = 1 and |l(i) − l(j)| ≤ 1 for all i, j ∈ X . Proof: A code with minimum maximum pointwise redundancy is always obtained when using a Huffmanstyle algorithm combining the items with the smallest weights, w′ and w′′ , yielding a new item of weight w = 2 max(w′ , w′′ ), and this process being repeated on the P new set of weights, the tree thus being constructed up from the leaves to the root. Since a tree always satisfies i∈X 2−l(i) = 1. Consider the tree formed by the application of this algorithm. Since the first (and thus least weighted) combined item is of weight 2p(n − 1), clearly no combined item need be merged with another item until the point at which item 1 is merged or thereafter. The algorithm can, in this case, be seen as pairing off items in the order of a queue sorted from least weighted to most weighted and placing the paired-off items in the rear of the queue. Because items are processed with increasing weight, this processing occurs in queue order, and thus, at any given point, every item is processed about the same number of times as any other; the difference can only be one. This is true when the algorithm terminates and codeword length is equal to the number of times an item is (by itself or as part of a combined item) processed. Thus |l(i) − l(j)| ≤ 1 for all i, j ∈ X , and the complete code tree is optimal. We can now present the improved redundancy bounds. ∗ (p) = 1+lg p(1). If p(1) ∈ [0.5, 2/3), then R∗ (p) ∈ Theorem 1: For any distribution in which p(1) ≥ 2/3, Ropt opt [1 + lg p(1), 2 + lg(1 − p(1))) and these bounds are tight. Define λ , ⌈− lg p(1)⌉, which, for p(1) ∈ (0, 0.5), is ∗ (p) are tight: greater than 1. For this range the following bounds for Ropt ∗ (p) Ropt

p(1)

h  λ + lg p(1), 1 + lg 1−p(1) −λ 1−2 h   h 1−p(1) 1−p(1) 1 2 , , 1 + lg lg 2λ −1 2λ +1 1−2−λ+1 1−2−λ  h i h 1−p(1) 1 2 lg , , λ + lg p(1) λ λ−1 −λ+1 2 +1 2 1−2 h

1 1 2λ , 2λ −1



Proof: The key here is generalizing the simple bounds of (6). Upper bound: Let us define what we call a first-order Shannon code: ( λ , ⌈− l  lg p(1)⌉  ,−λ m i = 1 lp1 (i) = 1−2 , i ∈ {2, 3, . . . , n} − lg p(i) 1−p(1)

This code improves upon the original “zero-order” Shannon code lp0 by taking the length of the first codeword into account when designing the rest of the code. The code satisfies the Kraft inequality, and thus, as a valid code, its redundancy is an upper bound on the redundancy of an optimal code. Note that   1 − p(1) max(lp1 (i) + lg p(i)) = max (lg p(i)) lg i>1 i>1 p(i)(1 − 2−λ ) 1 − p(1) . < 1 + lg 1 − 2−λ If p(1) ∈ [2/(2λ + 1), 1/2λ−1 ), the maximum pointwise redundancy of the first item is no less than 1 + lg((1 − ∗ (p) ≤ R∗ (l1 , p) = λ + lg p(1). Otherwise, R∗ (p) ≤ R∗ (l1 , p) < 1 + lg((1 − p(1))/(1 − 2−λ )), and thus Ropt opt p p p(1))/(1 − 2−λ )). The tightness of the upper bound in [0.5, 1) is shown via p = (p(1), 1 − p(1) − ǫ, ǫ)

4

for which the bound is achieved in [2/3, 1) for any ǫ ∈ (0, (1 − p(1))/2] and approached in [0.5, 2/3) as ǫ ↓ 0. If λ > 1 and p(1) ∈ [2/(2λ + 1), 1/2λ−1 ), use probability mass function    1 − p(1) − ǫ 1 − p(1) − ǫ   p= p(1), 2λ − 2 , . . . , 2λ − 2 , ǫ {z } | 2λ −2

where

ǫ ∈ (0, 1 − p(1)2λ−1 ).

Because p(1) ≥ 2/(2λ + 1), 1 − p(1)2λ−1 ≤ (1 − p(1) − ǫ)/(2λ − 2), and p(n − 1) ≥ p(n). Similarly, p(1) < 1/2λ−1 assures that p(1) ≥ p(2), so the probability mass function is monotonic. Since 2p(n − 1) > p(1), by Lemma 1, an optimal code for this probability mass function is l(i) = λ for all i, achieving R∗ (l, p) = λ + lg p(1), with item 1 having the maximum pointwise redundancy. This leaves only p(1) ∈ [1/2λ , 2/(2λ + 1)), for which we consider    1 − p(1) − ǫ  1 − p(1) − ǫ  p= p(1), 2λ − 1 , . . . , 2λ − 1 , ǫ | {z } 2λ −1

where ǫ ↓ 0. This is a monotonic probability mass function for sufficiently small ǫ, for which we also have p(1) < 2p(n − 1), so (again from Lemma 1) this results in optimal code where l(i) = λ for i ∈ {1, 2, . . . , n − 2} and l(n − 1) = l(n) = λ + 1, and thus the bound is approached with item n − 1 having the maximum pointwise redundancy. Lower bound: Consider all optimal codes with l(1) = µ for some fixed µ ∈ {1, 2, . . .}. If p(1) ≥ 2−µ , R∗ (l, p) ≥ l(1) + lg p(1) = µ + lg p(1). If p(1) < 2−µ , consider the weights at level µ (i.e., µ edges below the root). One of these weights is p(1), while the rest are known to sum to a number no less than 1 − p(1). Thus at least one weight must be at least (1 − p(1))/(2µ − 1) and R∗ (l, p) ≥ µ + lg((1 − p(1))/(2µ − 1)). Thus,   1 − p(1) ∗ Ropt (p) ≥ µ + lg max p(1), µ 2 −1 for l(1) = µ, and, since µ can be any positive integer,    1 − p(1) ∗ Ropt (p) ≥ min µ + lg max p(1), µ 2 −1 µ∈{1,2,3,...} which is equivalent to the bounds provided. For p(1) ∈ [1/(2µ+1 − 1), 1/2µ ) for some µ, consider 



  p(1), 1 − p(1) , . . . , 1 − p(1)  .  µ+1 − 2 µ+1 − 2  {z 2 } |2 2µ+1 −2

By Lemma 1, this will have a complete coding tree and thus achieve the lower bound for this range (λ = µ + 1). Similarly   p(1), 2−µ−1 , . . . , 2−µ−1 , 2−µ − p(1) | {z } 2µ+1 −2

has a fixed-length optimal coding tree for p(1) ∈ [1/2µ , 1/(2µ − 1)), achieving the lower bound for this range (λ = µ). Note that the bounds of (6) are identical to the tight bounds at powers of two. In addition, the tight bounds clearly approach 0 and 1 as p(1) ↓ 0. This behavior is in stark contrast with the traditional linear penalty, for which

5 1

0.9

0.8

∗ (p) Ropt

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

p(1)

Fig. 1. Tight bounds on minimum maximum pointwise redundancy, including achievable upper bounds (solid), approachable upper bounds (dashed), achievable lower bounds (dotted), and fully determined values for p(1) ≥ 2/3 (dot-dashed).

¯ opt (p) ≤ p(1) + 0.086 — which bounds get closer, not further apart, due to Gallager’s redundancy bound [5] — R cannot be significantly improved for small p(1) [10]. Moreover, approaching 1, the upper and lower bounds on traditional coding converge but never merge, whereas the minimum maximum redundancy bounds are identical for p(1) ≥ 2/3. In addition to finding redundancy bounds in terms of p(1), it is also often useful to find bounds on the behavior of l(1) in terms of p(1). Theorem 2: Any optimal code for probability mass function p, where p(1) ≥ 2−ν , must have l(1) ≤ ν . This bound is tight, in the sense that, for p(1) < 2−ν , one can always find a probability mass function with l(1) > ν . Conversely, if p(1) ≤ 1/(2ν − 1), there is an optimal code with l(1) ≥ ν , and this bound is also tight. ∗ (p) = R∗ (l, p) ≥ l(1) + lg p(1) ≥ 1, contradicting Proof: Suppose p(1) ≥ 2−ν and l(1) ≥ 1 + ν . Then Ropt the simple bounds of (6). Thus l(1) ≤ ν . For tightness of the bound, suppose p(1) ∈ (2−ν−1 , 2−ν ) and consider n = 2ν+1 and   p = p(1), 2−ν−1 , . . . , 2−ν−1 , 2−ν − p(1) . | {z } n−2

If l(1) ≤ ν , then, by the Kraft inequality, one of l(2) through l(n − 1) must exceed ν . However, this contradicts the simple bounds of (6). For p(1) = 2−ν−1 , a uniform distribution results in l(1) = ν + 1. Thus, since these two results hold for any ν , this extends to all p(1) < 2−ν−1 , and this bound is tight. Suppose p(1) ≤ 1/(2ν − 1) and consider an optimal length distribution with l(1) < ν . Consider the weights of the nodes of the corresponding code tree at level l(1). One of these weights is p(1), while the rest are known to sum to a number no less than 1 − p(1). Thus there is one node of at least weight 1 − p(1) 1 − p(1) ≥ l(1) l(1) 2 −1 2 − 2l(1)+1−ν

and thus, taking the logarithm and adding l(1) to the right-hand side, R∗ (l, p) ≥ ν − 1 + lg

1 − p(1) . 2ν−1 − 1

Note that l(1)+1+lg p(1) ≤ ν+lg p(1) ≤ ν−1+lg((1−p(1))/(2ν−1 −1)), a direct consequence of p(1) ≤ 1/(2ν −1). Thus, if we replace this code with one for which l(1) = ν , the code is still optimal. The tightness of the bound is

6 1

0.9

0.8

d (p) Ropt

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

p(1)

Fig. 2.

Bounds on dth exponential redundancy, valid for any d. Upper bounds dashed, lower bounds dotted.

easily seen by applying Lemma 1 to distributions of the form 



 1 − p(1)  1 − p(1)  , . . . , p= p(1), ν −2 ν −2   2 2 {z } | 2ν −2

∗ (p) = ν + lg(1 − p(1)) − lg(2ν − 2), for p(1) ∈ (1/(2ν − 1), 1/2ν−1 ). This results in l(1) = ν − 1 and thus Ropt which no code with l(1) > ν − 1 could achieve. In particular, if p(1) ≥ 0.5, l(1) = 1, while if l(1) ≤ 1/3, there is an optimal code with l(1) > 1. We now briefly address the dth exponential redundancy problem. Recall that this is the minimization of 1 X Rd (p, l) , lg p(i)1+d 2dl(i) d i∈X

which can be rewritten as Rd (p, l) =

1 X lg p(i)2d(l(i)+lg p(i)) . d i∈X

A straightforward application of Lyapunov’s inequality for moments yields Rc (p, l) ≤ Rd (p, l) for c ≤ d, which, taking limits to 0 and ∞, results in ¯ l) ≤ Rd (p, l) ≤ R∗ (p, l) < 1 0 ≤ R(p,

for any valid p, d > 0, and l, resulting in an extension of (6), d ∗ ¯ opt (p) ≤ Ropt 0≤R (p) ≤ Ropt (p) < 1 d (p) is the optimal dth exponential redundancy, an improvement on the bounds found in [23]. This implies where Ropt that this problem can be bounded in terms of the most likely symbol using the upper bounds of Theorem 1 and the lower bounds of traditional Huffman coding [8]:

¯ opt ≥ ξ − (1 − p(1)) lg(2ξ − 1) − H(p(1), 1 − p(1)) R

where

&

ξ = lg

for p(1) ∈ (0, 1) (and, recall, H(x) = −

P

1

1 − 2 p(1)−1 p(1)

1 − 2 p(1)−1

i x(i) lg x(i)).

'

This result is illustrated in Fig. 2.

7

d (p) are an improvement on the simplest bounds, and although any L (l, p) Although these bounds for Ropt a minimization problem can be transformed into a Rd (p, l) minimization problem, the required normalization on input function p means that bounds for La (l, p) in terms of p(1) alone cannot be transformed into bounds for Rd (p, l) in terms of p(1) and entropy alone. It is the latter type of bound that we consider in the next section.

III. B OUNDS

ON THE

E XPONENTIAL P ROBLEMS

While the average, maximum, and dth average redundancy problems yield performance bounds in terms of p(1) alone, it is not clear exactly what “redundancy” means for the exponential problems. For these such problems, we simply seek to find any bounds on La (p, l) in terms of p(1) and an appropriate entropy measure, as discussed below. Note that a ≤ 0.5 is a trivial case, always solved by a finite unary code, · ·10}, 11· cun , (0, 10, 110, . . . , 11· · ·11}). | {z | {z n−1

n−1

This can be seen by applying the exponential version of the Huffman algorithm; at each step, the combined weight will be the lowest weight of the reduced problem, being strictly less than the higher of the two combined weights, thus leading to a unary code. For a > 0.5, there is a relationship between this problem and R´enyi entropy. R´enyi entropy [26] is defined as n X 1 Hα (p) , p(i)α lg 1−α i=1

for α > 0, α 6= 0. It is often defined for α ∈ {0, 1, ∞} via limits, that is, H0 (p) , lim Hα (p) = lg |p| α↓0

(the logarithm of the cardinality of p), H1 (p) , lim Hα (p) = − α→1

n X

p(i) lg p(i)

i=1

(the Shannon entropy of p), and

H∞ (p) , lim Hα (p) = − lg p(1) α↑∞

(the min-entropy). Campbell first proposed exponential utility functions for coding in [19], [20]. He observed a simple lower bound for (1) with a > 0.5 in [20]; a simple upper bound has subsequently shown, e.g., in [27, p. 156] and [11]. These bounds are similar to the minimum average redundancy and minimum maximum pointwise redundancy bounds of (6). In this case, however, the bounds involve R´enyi’s entropy, not Shannon’s. Campbell proved the upper bound for a > 1 beginning with H¨older’s inequality [20], but it is much simpler to ignore the integer constaint, l ∈ Zn+ , as in, e.g., [11], which more easily carries over to the a ∈ (0.5, 1) case. Using Lagrange multipliers, we wish to minimize ! X X −l(i) l(i) 2 J, p(i)a + λ i∈X

i∈X

if a > 1 and maximize it if a < 1. Differentiating with respect to l(i), we obtain ∂J = (ln a)p(i)al(i) − λ2−l(i) ln 2. ∂l(i)

Setting the derivative to 0 and solving for l(i), we find that optimal l† should satisfy   1 p(i) lg a 1+lg a −l† (i) 2 = λ

8

where λ is the solution of

1

X  p(j) lg a  1+lg a λ

j∈X

which is



λ = (lg a) 

Defining

α(a) ,

yields

X

j∈X

=1 1+lg a

1

p(j) 1+lg a 

.

1 1 = lg 2a 1 + lg a 

l† (i) = −α(a) lg p(i) + lg 

X

j∈X



p(j)α(a) 

(7)

This solution satisfies the Karush-Kuhn-Tucker conditions [28] for a > 0.5, a 6= 0.5 and is thus optimal. Then loga

n X

p(i)al(i) ≥ Hα(a) (p) = loga

p(i)al



(i)

.

i=1

i=1

In finding an upper bound, define l§ as integer codeword lengths l m l§ (i) , l† (i)

Then, since l§ is in Ln ,

n X

min La (p, l) ≤ La (p, l§ ) P § = loga ni=1 p(i)al (i)   Pn l† (i) a < loga i=1 p(i)a

(8)

0 ≤ min La (p, l) − Hα(a) (p) < 1

(9)

l∈Ln

= Hα(a) (p) + 1.

Thus, for a > 0.5, a 6= 1, l∈Ln

a similar result to the traditional coding bound [4]. As an example of these bounds, consider the probability distribution implied by Benford’s law [29], [30]: p(i) = log10 (i + 1) − log10 (i), i = 1, 2, . . . 9

(10)

that is, p ≈ (0.30, 0.17, 0.12, 0.10, 0.08, 0.07, 0.06, 0.05, 0.05).

At a = 0.9, for example, Hα(a) (p) ≈ 2.822, so the optimal code cost is between 2.822 and 3.822. In the application given in [21] with (3), this corresponds to an optimal solution with probability of success (codeword transmission) between 0.668 and 0.743. Running the algorithm, the optimal lengths are l = (2, 2, 3, 3, 4, 4, 4, 5, 5), resulting in cost 2.866 (probability of success 0.739). Note that the optimal cost in this example is quite close to entropy, indicating that better upper bounds might be possible. In looking for a better bound, recall first that — as with the exponential Huffman algorithm — (9) applies for both a ∈ (0.5, 1) and a > 1. Improved bounds on the optimal solution for the a > 1 case were given by Blumer and McEliece [11]; these appear as solutions to related problems rather than in closed form, however. Taneja [12] gave closed-form bounds for cases in which p(1) ≥ 0.4, which, if traditional Huffman coding were used, would imply an optimal code with l(1) = 1. However, these bounds were flawed in that they did not take into account a difference between the traditional Huffman coding and the generalized exponential case, one we elaborate on in this section. In addition, these bounds are in terms of entropy of degree α, which is different from R´enyi’s α-entropy.

9 1.0

0.8

p(1)

0.6

0.4

0.2

0.0

0.0

0.2

0.4

0.6

0.8

1.0

a

Fig. 3.

Minimum p(1) sufficient for the existence of an optimal l(1) not exceeding 1.

Bounds are more difficult to come by in the exponential generalizations because R´enyi entropy’s very definition [26] involves a relaxation of a property used in finding bounds such as Gallager’s entropy bounds [5], namely H1 [tp(1), (1 − t)p(1), p(2), . . . , p(n)] = H1 [p(1), p(2), . . . , p(n)] + p(1)H1 (t, 1 − t)

for Shannon entropy H1 and t ∈ [0, 1]. This fails to hold for R´enyi entropy. The penalty function La differs from the usual measure of expectation in an analogous fashion, and we cannot know the weight of a given subtree in the optimal code (merged item in the coding procedure) simply by knowing the sum probability of the items included. However, we can find improved bounds for the exponential problem when we know that l(1) = 1: Theorem 3: There exists an optimal code with l(1) = 1 for a and p if either a ≤ 0.5 or both a ∈ (0.5, 1] and p(1) ≥ 2a/(2a + 3). Conversely, given a ∈ (0.5, 1] and p(1) < 2a/(2a + 3), there exists a p such that any code with l(1) = 1 is suboptimal. Likewise, given a > 1 and p(1) < 1, there exists a p such that any code with l(1) = 1 is suboptimal. Proof: Recall that the exponential Huffman algorithm combines the items with the smallest weights, w′ and w′′ , yielding a new item of weight w = (a)(w′ + w′′ ), and this process is repeated on the new set of weights, the tree thus being constructed up from the leaves to the root. This process makes it clear that, as mentioned, the finite unary code (with l(1) = 1) is optimal for all a ≤ 0.5. This leaves the two nontrivial cases. Case 1 (a ∈ (0.5, 1]): This is a generalization of [6] and is only slightly more complex to prove. Consider the coding step at which item 1 gets combined with other items; we wish to prove that this is the last step. At the beginning of this step the (possibly merged) items left to combine are {1}, S2k , S3k , . . . , Skk , where we use Sjk to denote both a (possibly merged) item of weight w(Sjk ) and the set of (individual) items combined in item Sjk . Since k has at least weight p(1). Note too that all weights w(S k ) must be less {1} is combined in this step, all but one SjP j than or equal to the sums of probabilities i∈Sjk p(i). Then 2a(k−1) 2a+3

≤ < ≤ =

(k − 1)p(1) P p(1) + kj=2 w(Sjk ) P P p(1) + kj=2 i∈Sjk p(i) Pn = 1 i=1 p(i)

which, since a > 0.5, means that k < 5. Thus, because n < 4 always has an optimal code with l(1) = 1, we can consider the steps in exponential Huffman coding at and after which four items remain, one of which is item {1} and the others of which are S24 , S34 , and S44 . We show that, if p(1) ≥ 2/(2a + 3), these items are combined as shown in Fig. 4:

10

X

S22

{1}

4

S2 If |S24 | > 1, w(S24 ) = a[w(S2′ ) + w(S2′′ )] S2′

Fig. 4.

S2′′

4 4 S3 ∪ S4

w(S34 ∪ S44 ) = a[w(S34 ) + w(S44 )]

S34

S44

Tree in last steps of the exponential Huffman algorithm.

We assume without loss of generality that weights w(S24 ), w(S34 ), and w(S44 ) are in descending order. From w(S24 ) + w(S34 ) + w(S44 ) ≤

n X

p(i)

i=2

3 , 2a + 3 w(S24 ) ≥ w(S34 ), ≤

and w(S24 ) ≥ w(S44 ) it follows that w(S34 ) + w(S44 ) ≤ 2/(2a + 3). Consider set S24 . If its cardinality is 1, then w(S24 ) ≤ p(1), so the next step merges the least two weighted items S34 and S44 . Since the merged item has weight at most 2a/(2a + 3), this item can then be combined with S24 , then {1}, so that l(1) = 1. If S24 is a merged item, let us call the two items (sets) that merged to form it S2′ and S2′′ , indicated by the dashed nodes in Fig. 4. Because these were combined prior to this step, w(S2′ ) + w(S2′′ ) ≤ w(S34 ) + w(S44 ) so w(S24 ) ≤ a[w(S34 ) + w(S44 )] ≤

2a . 2a + 3

Thus w(S24 ), and by extension w(S34 ) and w(S44 ), are at most p(1). So S34 and S44 can be combined and this merged item can be combined with S24 , then {1}, again resulting in l(1) = 1. This can be shown to be tight by noting that, for any ǫ ∈ (0, (2a − 1)/(8a + 12)),   2a 1 1 1 pǫ , 2a+3 − 3ǫ, 2a+3 + ǫ, 2a+3 + ǫ, 2a+3 +ǫ

achieves optimality only with length vector l = (2, 2, 2, 2). The result extends to smaller p(1).

Case 2 (a > 1): Given a > 1 and p(1) < 1, we wish to show that a probability distribution always exists such that there is no optimal code with l(1) = 1. We first show that, for the exponential penalties as for the traditional Huffman penalty, every optimal l can be obtained via the (modified) Huffman procedure. That is, if multiple length vectors are optimal, each optimal length vector can be obtained by the Huffman procedure as long as ties are broken in a certain manner. Clearly the optimal code is obtained for n = 2. Let n′ be the smallest n for which there is an l that is optimal but cannot be obtained via the algorithm. Since l is optimal, consider the two smallest probabilities, p(n′ ) and p(n′ − 1). In this optimal code, two items having these probabilities (although not necessarily items n′ − 1 and n′ ) must have the longest codewords and must have the same codeword lengths. Were the latter not the case, the

11

codeword of the more probable item could be exchanged with one of a less probable item, resulting in a better code. Were the former not the case, the longest codeword length could be decreased by one without violating the Kraft inequality, resulting in a better code. Either way, the code would no longer be optimal. So clearly we can find two such smallest items with largest codewords (by breaking any ties properly), which, without loss of generality, can be considered siblings. This means that the problem can be reduced to one of size n′ − 1 via the exponential Huffman algorithm. But since all problems of size n′ − 1 can be solved via the algorithm, this is a contradiction, and the Huffman algorithm can thus find any optimal code. Note that this is not true for minimizing maximum pointwise redundancy, as the exchange argument no longer holds. This is why the sufficient condition of the previous section was not verified using Huffman-like methods. Now we can show that there is always a code with l(1) > 1 for any p(1) ∈ (0.2, 1); p(1) ≤ 0.2 follows easily. Let    4p(1) m = loga 1 − p(1) and suppose n = 1 + 22+m and p(i) = (1 − p(1))/(n − 1) for all i ∈ {2, 3, . . . , n}. This distribution has an optimal code only with l(1) = 2 (or, if m is equal to the logarithm from which it is derived, only with l(1) = 2 and with l(1) = 3), since, although item 1 need not be merged before the penultimate step, at this step its weight is strictly less than either of the two other remaining weights, which have values w′ = a1+m (1 − p(1))/2. Thus, knowing merely the values of a > 1 and p(1) < 1 is not sufficient to ensure that l(1) = 1. These relations are illustrated in Fig. 3, a plot of the minimum value of p(1) sufficient for the existence of an optimal code with l(1) not exceeding 1. Similarly to minimum maximum pointwise redundancy, we can observe that, for a ≥ 1 (that is, a > 1 and traditional Huffman coding), a necessary condition for l(1) = 1 is p(1) ≥ 1/3. The sum of the last three combined weights is at least 1, and p(1) must be no less than the other two. However, for a < 1, there is no such necessary condition. Given a ∈ (0.5, 1) and p(1) ∈ (0, 1), consider the probability distribution consisting of one item with probability p(1) and n = 1 + 21+g items with equal probability, where      2ap(1) 1 − 2p(1) g = max loga , lg ,0 1 − p(1) p(1) and, by convention, we define the logarithm of negative numbers to be −∞. Setting p(i) = (1 − p(1))/(n − 1) for all i ∈ {2, 3, . . . , n} results in a monotonic probability mass function in which (1− p(1))ag /2 < p(1), which means that the generalized Huffman algorithm will have in its penultimate step three items: One of weight p(1) and two of weight (1 − p(1))ag /2; these two will be complete subtrees with each leaf at depth g. Since (1 − p(1))ag /2 < p(1), l(1) = 1. Again, this holds for any a ∈ (0.5, 1) and p(1) ∈ (0, 1), so no nontrivial necessary condition exists for l(1) = 1. This is also the case for a ≤ 0.5, since the unary code is optimal for any probability mass function. Upper entropy bounds derived from Theorem 3, although rather complicated, are an improvement on the previously known bounds of (8): Corollary 1: For l(1) = 1 (and thus for all p(1) ≥ 2a/(2a + 3)) and a ∈ (0.5, 1), the following holds: n X i=1

or, equivalently,

i 1 h α(a) p(i)al(i) > a2 aα(a)Hα(a) (p) − p(1)α(a) + ap(1) 

2

La (p) < loga a

h

α(a)Hα(a) (p)

a

α(a)

− p(1)

i

1 α(a)

 + ap(1)

Note that this upper bound is tight for p(1) ≥ 0.5, as p = (p(1), 1 − p(1) + ǫ, ǫ) gets arbitrarily close for small ǫ. Proof: This is a simple application of the coding bounds of the subtree including all items but item {1}. Let α = α(a) and B = {2, 3, . . . , n}, with R´enyi α-entropy α n  X p(i) 1 lg . Hα (pB ) = 1−α 1 − p(1) i=2

12

Hα (pB ) is related to the entropy of the original source p by 2(1−α)Hα (p) = (1 − p(1))α 2(1−α)Hα (pB ) + p(1)α

or, equivalently, since 21−α = aα , aHα (pB ) =

Applying (8) to subtree B , we have

i1 h 1 α aαHα (p) − p(1)α . 1 − p(1) n

X 1 p(i)al(i) > aHα (pB )+1 . (1 − p(1))a i=2

P

The bound for i p(i)al(i) is obtained by multiplying by (a)(1 − p(1)) and adding the contribution of item {1}, ap(1). Let us apply this result to the Benford distribution in (10) for a = 0.6. In this case, Hα(a) (p) ≈ 2.260 and p(1) > 2a/(2a + 3), so l(1) = 1 and the probability of success is between 0.251 and 0.315 = aHα(a) (p) ; that is, Lopt ∈ [2.259 . . . , 2.707 . . .). The simpler (inferior) lower probability (upper entropy) bound in (9) is probability a 0.189 (Lopt a < 3.259 . . . ). The optimal code is l = (1, 2, 3, 4, 5, 6, 7, 8, 8), which yields a probability of success of 0.296 (Lopt a = 2.382 . . . ). This could likely be improved by considering the conditions for other values of l(1) and/or other lengths — as with the use of lengths in [31] to come up with general bounds [9], [10] — but the expression involved would be far more complex than even this. Note that these arguments fail for lower entropy bounds. In fact, knowing l(1) = mini l(i) does not help for this lower bound as it does in [6], since, according to (7), p(1) does not uniquely determine l(1) even in the “ideal” case. (La (p, l) = Hα(a) (p) if and only if l = l† , but a lower bound for how much La and Hα(a) differ is not known.) Because of this, it is likely that, for all a > 0 except for a = 1 [8] — given only p(1), a, and Hα(a) (p) — the tightest lower bound for the penalty function is the previously known bound Hα(a) (p). Proving this would be a worthwhile contribution. ACKNOWLEDGMENT The author would like to thank J. David Morgenthaler for discussions on this topic. R EFERENCES [1] B. McMillan, “Two inequalities implied by unique decipherability,” IRE Trans. Inf. Theory, vol. IT-2, no. 4, pp. 115–116, Dec. 1956. [2] D. A. Huffman, “A method for the construction of minimum-redundancy codes,” Proc. IRE, vol. 40, no. 9, pp. 1098–1101, Sept. 1952. [3] J. van Leeuwen, “On the construction of Huffman trees,” in Proc. 3rd Int. Colloquium on Automata, Languages, and Programming, July 1976, pp. 382–410. [4] C. E. Shannon, “A mathematical theory of communication,” Bell Syst. Tech. J., vol. 27, pp. 379–423, July 1948. [5] R. G. Gallager, “Variations on a theme by Huffman,” IEEE Trans. Inf. Theory, vol. IT-24, no. 6, pp. 668–674, Nov. 1978. [6] O. Johnsen, “On the redundancy of binary Huffman codes,” IEEE Trans. Inf. Theory, vol. IT-26, no. 2, pp. 220–222, Mar. 1980. [7] R. M. Capocelli, R. Giancarlo, and I. J. Taneja, “Bounds on the redundancy of Huffman codes,” IEEE Trans. Inf. Theory, vol. IT-32, no. 6, pp. 854–857, Nov. 1986. [8] B. L. Montgomery and J. Abrahams, “On the redundancy of optimal binary prefix-condition codes for finite and infinite sources,” IEEE Trans. Inf. Theory, vol. IT-33, no. 1, pp. 156–160, Jan. 1987. [9] R. M. Capocelli and A. De Santis, “Tight upper bounds on the redundancy of Huffman codes,” IEEE Trans. Inf. Theory, vol. IT-35, no. 5, pp. 1084–1091, Sept. 1989. [10] D. Manstetten, “Tight bounds on the redundancy of Huffman codes,” IEEE Trans. Inf. Theory, vol. IT-37, no. 1, pp. 144–151, Jan. 1992. [11] A. C. Blumer and R. J. McEliece, “The R´enyi redundancy of generalized Huffman codes,” IEEE Trans. Inf. Theory, vol. IT-34, no. 5, pp. 1242–1249, Sept. 1988. [12] I. J. Taneja, “A short note on the redundancy of degree α,” Inf. Sci., vol. 39, no. 2, pp. 211–216, Sept. 1986. [13] T. C. Hu, D. J. Kleitman, and J. K. Tamaki, “Binary trees optimum under various criteria,” SIAM J. Appl. Math., vol. 37, no. 2, pp. 246–256, Apr. 1979. [14] D. S. Parker, Jr., “Conditions for optimality of the Huffman algorithm,” SIAM J. Comput., vol. 9, no. 3, pp. 470–489, Aug. 1980. [15] D. E. Knuth, “Huffman’s algorithm via algebra,” J. Comb. Theory, Ser. A, vol. 32, pp. 216–224, 1982. [16] C. Chang and J. Thomas, “Huffman algebras for independent random variables,” Disc. Event Dynamic Syst., vol. 4, no. 1, pp. 23–40, Feb. 1994.

13

[17] P. A. Humblet, “Source coding for communication concentrators,” Ph.D. dissertation, Massachusetts Institute of Technology, 1978. [18] ——, “Generalization of Huffman coding to minimize the probability of buffer overflow,” IEEE Trans. Inf. Theory, vol. IT-27, no. 2, pp. 230–232, Mar. 1981. [19] L. L. Campbell, “A coding problem and R´enyi’s entropy,” Inf. Contr., vol. 8, no. 4, pp. 423–429, Aug. 1965. [20] ——, “Definition of entropy by means of a coding problem,” Z. Wahrscheinlichkeitstheorie und verwandte Gebiete, vol. 6, pp. 113–118, 1966. [21] M. B. Baer, “Integer coding with nonlinear costs,” IEEE Trans. Inf. Theory, submitted for publication, preprint available from http://arxiv.org/abs/cs.IT/0511003. [22] P. Nath, “On a coding theorem connected with R´enyi entropy,” Inf. Contr., vol. 29, no. 3, pp. 234–242, Nov. 1975. [23] M. B. Baer, “A general framework for codes involving redundancy minimization,” IEEE Trans. Inf. Theory, vol. IT-52, no. 1, pp. 344–349, Jan. 2006. [24] M. Drmota and W. Szpankowski, “Precise minimax redundancy and regret,” IEEE Trans. Inf. Theory, vol. IT-50, no. 11, pp. 2686–2707, Nov. 2004. [25] M. C. Golumbic, “Combinatorial merging,” IEEE Trans. Comput., vol. C-25, no. 11, pp. 1164–1167, Nov. 1976. [26] A. R´enyi, “On measures of entropy and information,” in Proc. 4th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, 1961, pp. 547–561. [27] J. Acz´el and Z. Dar´oczy, On Measures of Information and Their Characterizations. New York, NY: Academic, 1975. [28] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge, UK: Cambridge Univ. Press, 2003, available at http://www.stanford.edu/∼boyd/cvxbook.html. [29] S. Newcomb, “Note on the frequency of use of the different digits in natural numbers,” Amer. J. Math., vol. 4, no. 1/4, pp. 39–40, 1881. [30] F. Benford, “The law of anomalous numbers,” Proc. Amer. Phil. Soc., vol. 78, no. 4, pp. 551–572, Mar. 1938. [31] B. L. Montgomery and B. V. K. V. Kumar, “On the average codeword length of optimal binary codes for extended sources,” IEEE Trans. Inf. Theory, vol. IT-33, no. 2, pp. 293–296, Mar. 1987.