Codes: Unequal Probabilities, Unequal Letter Costs DORIS A L T E N K A M P A N D K U R T M E H L H O R N
University of Saarlandes, Saarbrficken, Federal Republic of German), ABSTRACT. The construction of alphabetic prefix codes with unequal letter costs and unequal probabilities is considered. A variant of the noiseless coding theorem is proved giving closely matching lower and upper bounds for the cost of the optimal code. An algorithm is described which constructs a nearly optimal code in linear time. KEY WORDS AND PHRASES; codes, unequal letter costs, unequal probabilities, noiseless coding, prefix codes, approximation algorithm, search trees ca CATEGORIES: 5.25, 5.39, 5.6
1. Introduction W e s t u d y the c o n s t r u c t i o n o f prefix c o d e s in the case o f u n e q u a l p r o b a b i l i t i e s a n d u n e q u a l letter costs. T h e i n v e s t i g a t i o n is m o t i v a t e d b y a n d o r i e n t e d t o w a r d t h e f o l l o w i n g p r o b l e m . C o n s i d e r the t e r n a r y s e a r c h tree in F i g u r e 1. It h a s t h r e e i n t e r n a l n o d e s a n d six leaves. T h e i n t e r n a l n o d e s c o n t a i n the keys {3, 4, 5, 10, 12} in sorted order, a n d t h e leaves r e p r e s e n t the o p e n i n t e r v a l s b e t w e e n keys. T h e s t a n d a r d strategy to locate X in this tree is best described b y the f o l l o w i n g recursive p r o c e d u r e S E A R C H . proc SEARCH(int X; node v) if v is a leaf
then "'X is not in the tree" else begin let K, K2 be the keys in node v; if X < K~ then SEARCH(X, left sou of v) if X = K~ then exit (found); if K2 does not exist then SEARCH(X, right son of v) else begin if X < K2 then SEARCH(X, middle son of v); if X = K2 then exit (found); SEARCH(X, right son of v) end end end
A p p a r e n t l y the s e a r c h strategy is u n s y m m e t r i c . It is c h e a p e r to follow the p o i n t e r to the first s u b t r e e t h a n to t h e s e c o n d subtree, a n d it is c h e a p e r to locate K1 t h a n K2. W e also a s s u m e t h a t the p r o b a b i l i t y o f access is g i v e n for e a c h key a n d e a c h interval between keys. M o r e precisely, s u p p o s e we h a v e n keys B1 . . . . . Bn o u t o f a n o r d e r e d universe, w i t h BI < B2 < . . . < Bn. T h e n fli d e n o t e s the p r o b a b i l i t y o f accessing Bi, 1 _< i _< n, a n d c~i d e n o t e s t h e p r o b a b i l i t y o f accessing e l e m e n t s X, w i t h Bi < X < Bi÷t, Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. A preliminary version of this paper was presented at the Fifth International Colloquium on Automata, Languages, and Programming. Udine, Italy, 1978. Authors' address: University of Saarlandes, 66 Saarbriicken, Fachbereich 10, Federal Republic of Germany 6600. © 1980 ACM 0004-5411/80/071X)-IMI2 $00.75 Journalofthe AssocialionforCompu ng Machinery,Vol.27. No.3. July 19S0,pp. 412-427.
413
;odes: U n e q u a l Probabilities, U n e q u a l L e t t e r Costs
[C-7T5]
~
[Cg'Tg~
~
l( 12 ,
)1
FIGURE 1
0 ~ j -< n. ao and fin have obvious interpretations. In our example n = 5, fls is the probability of accessing 4, and a4 is the probability of accessing X E (4, 5). We always write the distribution o f access probabilities as ao, fla, eta. . . . . ft,, an. Ternary trees, in general, (t + l)-ary trees, correspond to prefix codes in a natural way. We are given letters ao, al, a2. . . . . a2t of cost Co, cl, cs. . . . . cst, respectively; ct > 0 for 0 _< l -< 2t. Here letter ast corresponds to following the pointer to the (l + l)st subtree, 0 _< l _ t, and letter ast+l corresponds to a successful search terminating in the (l + l)st key o f a node, 0 -< l < t. In our example, t = 2. The code word corresponding to 4, denoted I4"2,is aoa3. The code word corresponding to (10, 12), denoted II4, is a4ao. In general, a search tree is a prefix code C = { V0, W~, 1"1. . . . .
W,, V,}
with
~ ~ Z*
and
W/~
'~'*~"end,
where Z = {ao, as, a4. . . . . as,} and Ze,d = {al, a3. . . . . as,-l}, 0 0 and Lh = {i; cCost(Ui) < - l o g pi - h}. Then 1 >_ Q = ~ 2 -cc°~ttv,, i-1
--> ~ 2 -cc°~'~c;~ iEL~
[]
~ 21°g'+h= 2 h. ~ P,. i~Lh
i~Lh
2.2 THE ALPHABETIC CASE. Every alphabetic code C = { Vo, W~. . . . . nonalphabetic code, and hence Theorem 1 applies. It shows that Cost(C) _ - . 1> H(ao, 32 . ... .
W,,, Vn) is a
/3n, ~ ) ,
C
where ~ 0 2 -~ck =1. In this section we improve upon this lower bound and essentially show that for every a l p h a b e t i c code C, Cost(C) ~ ~.
(~o, fl~. . . . . /?~, an) - c . max ci. In H(ao, fl~ . . . . . 1/
i odd
B.. ~
,
Codes: Unequal Probabilities, Unequal Letter Costs
417
where ~,~,=o2 -a',~' = 1 a n d u is some constant. Note that only the letters in Z, a n d not those in Z,,,d, are used to define d, a n d hence the new b o u n d is m u c h better for large H. Example. d = log 3.
Consider ternary trees with Co = c~ = c2 = c3 = c4 = !. T h e n c = log 5 a n d
The alphabetic case differs from the n o n a l p h a b e t i c case in two respects: (1) The letters in X,,d can only be used at the end of code words IV, a n d not at all in words V~. (2) The lexicographic ordering of code words must reflect the underlying ordering o f the keys. We will use only restriction (1) to improve u p o n the lower bound. There seems to be no way to incorporate this (combinatorial) restriction into the proof of Theorem 1. Rather, we turn the combinatorial restriction into a constraint on costs by artificially increasing the cost of letters in ~E~end. T h e n we use the fact that letters in Z~.d are used at most once in words W, and not at all in words ~ in order to relate the cost of a code u n d e r the old a n d the new cost function. Finally, we apply T h e o r e m 1 to the new cost function. Let 1 _< x < ~ . . . be arbitrary, let di = ci dg=x.c~
for for
i even, iodd,
and let c(x) ~ IR be such that ~,~0 2 - ~ > 4 = 1. Remark. In the new cost function d~, 0 _< i_< 2t, we increased the cost o f letters in Z~,d by factor x. For x = 1 the new cost function is identical with the old, a n d hence c(l) = c; for x = ~ the cost of letters in Z~,d is infinite, a n d hence c(oo) = d. Let C = ( Vo, W~, V~. . . . . W , , V,} be a n alphabetic code for probability distribution (a,,. fij, o~. . . . . fl~, a,). In particular, Vj E Z* a n d IVi E ~']'*'~end. Let Cos"~(C) be the cost of C with respect to (o, (~, (2 . . . . . (2t, and let Cost(C) be the cost o f C with respect to Co, Cl . . . . .
C2t.
LEMMA I. PROOF.
Co~(C) m a x ( H ( a o , fll . . . . .
fl,, o~)/c(x) - (x - 1). B . m a x ci; 1 Ro a n d Sl+2~-, 1 and hence (5(t + l) - 4)/(t + l) _> 3, it suffices to choose e such that a
(
1-t
l°g(5(t + 1) - 4 ) ) < e t+l
f o r t _> I. Finally, if n0 = 5 > n / ( t + I), a n d hence n < 5(t + l), the inequality reduces to e(t + l)log(n + l) q- at