496
Even strongly universal hashing is pretty fast Mikkel Thorup*
An experimental study o f strongly universal hashing is presented, including non-standard usage of the floating point co-processor. Gains of up to a factor 15 are obtained over naive implementations o f the classical (az + b) m o d scheme. Strongly universal hashing. For hashing we are interested in getting a random-like function h from some universe U = {0,..., m}, m = 2% o f keys into a domain D = {0, ..., n - 1}, n -=- 2 l o f hash va!ues. As theoreticians, we do this by selecting h uniformly at random from some specific class 7-/of functions from U to D. We say that 7 / i s universal if it guarantees low collision probability, i.e. Vz ~ y 6 U : P r h ~ n (h(z) = h(y)) = O(1/n) [2]. Further, 7 / i s strongly universal if keys are mapped pairwise independently, i.e. if Vz ~ y 6 U, o~, fl 6 D : Prhen(h(x) = ~ A h(y) =/3) = O ( 1 / n 2) [9]. Note that if U = D and 7-/ = {id} consists of the identity function, 7-/ is universal, but not strongly universal. Whereas universality suffices for the classical dictionary applications, it fails to generate pseudo-random numbers. Also it fails to support the following type o f vector hashing [2]: LEMMA 1. Suppose 7-[ is strongly universal. Let 7£q define the class of functions from U q to D such that (hl,...,hq)(Xl,...,Xq) = h l ( Z l ) ~ ) " ' ~ ) h q ( x q ) . Then 7£q is strongly universal. To see that universality does not suffice for the above vector hashing, note that {id} is universal whereas {id}q is not universal as it is constant over all permutations o f the coordinates. A particularly nice feature of vector hashing is that if ~' is obtained from i locally replacing one coordinate xi by z~, then h ( Z ) = h(~-) @ hi(xi) ~ hi(z~). One natural application is local search heuristics: feasible solutions are high-dimensional vectors. To avoid cycling, we tabulate hash values of visited solutions. The neighborhood o f the current solution is obtained by local changes whose hash values can quickly be found and checked. Note that typical heuristics for string hashing [1, 8] do not allow such local updates o f hash values. Implementations. Two classical hashing schemes are the division method and the multiplication method [7]. In the division method, we fix some prime p _> m a x { m , n}, and ---~
Labs-Research, mthorupOresearch, at t. edu
Prec. SGI SGI SGI Sun #xMHz 1 2 x 1 9 4 4 x 1 8 0 l x 2 0 0 4 x 168 Division 7.72 4.34 4.27 41.04 ]CWTrick 2.83 2.00 1.91 33.37 Dietzfelb. 1.13 0.62 0.82 19.34
Sun Intel lx66 lx450 66.60 3.36 42.73 1.06 24.32 0.47
N e w idea
0.43
0.46
1.19
2.29
7.33
0.52
Tab-char Tab-short Multipl. Pearson Bienstock Identity
0.55 0.44 0.33 1.43 0.62 0.09
0.55 0.43 0.36 1.55 0.69 0.10
1.11 1.33 0.53 1.36 0.80 0.16
0.81 0.81 7.05 2.49 1.26 0.09
2.82 6.26 8.63 4.22 3.15 0.45
0.20 0.25 0.10 0.56 0.43 0.07
Table 1: Seconds to compute 10.000.000 hash values then each member o f 7 / i s characterized by two members a and b o f X~,. Given a and b, ha,b(Z) = ((az + b) m o d p) m o d n. The division method is strongly universal [9]. Unfortunately, computing modulo a prime ga is very slow on todays computers, and hence it is often avoided. In this paper, experimentally, we explore published as well as new alternatives, and show that factors o f 15 can be gained, making strongly universal hashing competitive with the heuristic alternatives that have been used in practice. All experiments were done in C. As a case study we fixed ~ = g = 32, thus mapping 32-bit words into 32bit words. With strongly universal hashing, any substring of the hash value is also strongly universal, so generating extra bits is only good. The least significant bit of x is the rightmost. We use > to denote left and right shift, so z >> i -- [ z / 2 i J a n d z