Solving Range Queries in a Distributed System Praveen Yalagandula
1
HEAD
Introduction
The goal of the project is to design and build a scalable distributed discovery system for documents that (i) supports both simple queries and range queries on document names, (ii) supports efficient insertion and deletion of documents, (iii) distributes both storage and access loads uniformly among all the participants, and (iv) is efficient in terms of the communication cost incurred for responding to queries. The ultimate goal of the project is to support a full-fledged keyword search. A full-fledged keyword search system will support the search based on regular expressions. On the other extreme, a very simple system will just support single keyword based lookups. Distributed Hash Tables (DHT) based systems typically support only simple keyword searches. Several intermediate schemes are possible – that support range queries on single dimension (e.g., [1]) and that support multidimensional sequence of keywords and ranges (e.g., [5]). In this project, we focus on supporting range queries on single dimension. Our approach is to combine ideas from Extendible Hashing [2] and Distributed Hash Tables.
2
13 13
21
LEVEL 2
33
48
33
48
LEVEL 1 75
LEVEL 0
99
Figure 1: A skip list (source: [1]) 33
10
01
LEVEL 2 48
75
99
00
00
11
11
75
99
11
11
21 13
SKIP LIST
21 13
MEMBERSHIP VECTOR
10
00
33
48
01
00
13
21
33
48
75
99
00
10
01
00
11
11
LEVEL 1
LEVEL 0
Figure 2: A skip graph with dlog Ne = 3 levels (source: [1]) node has to share is assigned a random ID, called membership vector, and the resources are arranged in a non-descending sorted order according to their names or values. A DHT is constructed on the resources in the following way – for each prefix length of the membership vector of a resource, the resource has a right and left pointers to nearest resources that have same prefix. Figure 2.1 illustrates these pointers. A range query or simple query can be satisfied as in the search procedure in a skip list. Skip graphs support range queries, support efficient insertion and deletion of new resources, is load-balanced in terms of both storage and access loads, is fault-tolerant and is efficient in terms of number of messages. The main disadvantage of this data structure is that it forms DHT on the resources than the physical nodes present in the system. If each node in a N node system has K documents or resources to share, then each node needs to keep track of O(K log(N.K)) pointers to other nodes. Hence this system is not scalable with the
Related Work
Mainly two related papers: (1) Skip Graph [1] and (2) Squid [5].
2.1
TAIL 33
Skip Graph
Skip Graphs, proposed by Aspnes et al [1], is a distributed data structure for supporting range queries. The structure is similar to a skip list (Figure 2.1) and uses ideas from Distributed Hash Tables [3] to achieve fault tolerance and load balance in terms of access loads. In this approach, each resource that a 1
number of resources or documents.
2.2
structure and map the tree onto physical nodes using DHT techniques. This enables us to support range queries while exploiting the load-balancing property of the DHTs but at the cost of increased insertion, deletion, and lookup costs. We propose caching based optimizations
Squid
Squid, proposed by Schmidt et al [5], supports multidimensional range queries. They map the n-dimensional data to 1-dimensional space using Hilbert Space Filling Curves. The one-dimensional data is then mapped to nodes arranged in a linear fashion along the NodeId ring. While this mapping allows range queries to be performed efficiently, the scheme loses load balancing property inherent to the DHTs. The authors propose two load-balancing schemes – (a) Load balancing at join time and (b) Load balancing at run time. In former technique, a new node chooses multiple IDs, joins the network and then discards all but one ID that will place it in the most loaded part of the network. This technique is both expensive – O(n log N) for joining at n places in a N node network (O(n log2 N) in case of chord) – and is not sufficient to achieve the load balancing in the face of document insertions and deletions. Two schemes are presented for load-balancing at the run time (1) exchange the load with neighbors and (2) each node hosts multiple virtual nodes. The first scheme incurs an O(N log2 N) communication cost and hence is very expensive to be performed periodically. The second scheme is good for loadbalancing at the expense of increase number of DHT pointers to maintain at each nodes. With each virtual node containing a single document, this scheme is same the Skip Graph. A node hosting k virtual nodes needs to maintain connections to O(k log N) DHT neighbors; hence, the scheme is not scalable with the number of documents in the system.
3
3.1 Simple Algorithm In this project, we consider range queries on onedimensional keywords only. Further, we assume that keywords are drawn from a small alphabet, say Σ. We use kleene star notation ∗ to express the ranges. We assume that a DHT is already constructed on the nodes in the system and can route messages for a key with ID to the corresponding responsible node (whose NodeId is closer to ID than other nodes in the system). We will denote a node responsible for the hash of a string S by [S]. Insertion and deletion of Documents Initially, each node has zero documents. When a new document is inserted, the entry is routed to the node [Σ∗], say the root node. When the number of entries at the root node exceed a blockSplitThreshold factor, then the block of entries is split into |Σ| blocks, one block for each character of the alphabet, and are spread to nodes [aΣ∗] for a ∈ Σ. This process is further recursively repeated at the child nodes. To be able to merge back the entries upon deletion of documents, the nodes need to keep track of their |Σ| child nodes. when the total number of entries over all children fall below blockMergeThreshold for a node, then the node collects back all the entries from its |Σ| children. Note that the blockSplitThreshold and blockMergeThreshold parameters satisfy the following invariant: blockSplitThreshold ≥ blockMergeThreshold.
Our approach Lookup Any lookup request, say abc∗, is routed to the root node [Σ∗]. If the entries were already split, then the request is passed down to the node [aΣ∗] and then down to [abΣ∗] and so on.
Supporting range queries is harder than supporting simple single keyword queries. Hash tables provide efficient constant lookup, insertion and deletion costs but can not support range queries. Balanced binary trees, B-trees, Tries, etc., support range queries at the cost of O(log N) insertion, lookup, and deletion complexity. Our approach is the fusion of Tries with hash tables: arrange data in a Trie
Example A simple example depicting the insertion of keywords is shown in Figure 3.1. In the example, we assume that the blockSplitThreshold factor is 4. 2
<empty>
<empty>
Σ∗
Σ∗
ant cat
Σ∗
bat dog
Σ∗ aΣ∗ ant
bΣ∗
cΣ∗
bat bait
cat
dΣ∗ dog
....
zΣ∗
sun pot
insert bait [m−z]Σ∗
[a−m]Σ∗
<empty>
ant bat bait
Figure 3: An example Trie construction with split threshold of 4
sun
pot
Figure 4: An example Trie construction with 2-way splitting.
Discussion The insertion, deletion and lookup costs increase from O(log n) in DHTs to O(k log N), where k is the size of the document name. The procedure achieves load balance in terms of storage, but the root node is accessed on each operation and hence the access load is not distributed fairly. Further splitting and merging are costly as |Σ| nodes have to be accessed and thus atleast |Σ| number of messages.
and also helps in load-balancing the access loads. The keywords and keyword ranges that are most accessed are replicated on more nodes to offset the load from one single node. Consistency will be an issue with replication and a simple eventual consistency model will be efficient in terms of communication costs and is generally acceptable for the application domain in consideration. Optimizations: Caching to reduce the number of steps in lookup, insertion and deletion. Nodes can cache the information about how far down the tree is already split to optimize the search performance. Result caching can also be done to further optimize the performance.
Problem 1: Access load is not uniformly distributed. Fix: Use binary search. For example, start the lookup for keyword computer at node [compΣ∗] and proceed to either node [coΣ∗] or [computΣ∗] based on the information at the node [compΣ∗]. The lookup cost decreases to O(log k log N) from O(k log N) and the access load also gets distributed uniformly among the nodes.
Supporting General Expressions The algorithm described above supports the range queries with wildcard character only at the end. Range queries with kleene star appearing anywhere else can also be supported albeit at the increased communication cost. For example, a range query for Σ ∗ uter will need to be sent to all of the leaves of the Trie structure and hence possibly all nodes in the system. A query like comΣ ∗ ter can be efficiently answered compared to query for Σ ∗ uter. The length of the prefix before the appearance of a kleene star in the query greatly effects the performance of the system.
Problem 2: Splitting and merging are costly. Fix: Instead of splitting |Σ| ways at a node, a more coarse granular splitting can be done — 2-way, 4way or some b-way to reduce the costs of splitting and merging. This increases the depth of the lookup tree and hence the trade off is increased lookup time: O(logb |Σ|. log k. log N). An example showing the 2way splitting is shown in the Figure 3.1.
4
ant bat
Choosing Thresholds A large value of blockSplitThreshold will imply that all entries are stored in one or few places causing the storage load imbalance in the system but provides an efficient range query support. A small value gives a good storage load-balancing but at the cost of increased Trie depth and hence an increased lookup, insertion and
Issues
Access Load Balance and Fault Tolerance Data replication at neighboring nodes on the logical NodeId ring. Replication provides fault tolerance 3
4500
18000 One Char Frequency of occurence as first character in the keyword set
10000
8000
6000
4000
3500
3000
2500
2000
1500
1000
2000 500
0 20
40
60 ASCII Characters
80
100
(a)
14000
12000
10000
8000
6000
4000
2000
120
Two Chars 4500 4000 3500 3000 2500 2000 1500 1000 500
0 0
2000
4000
6000
8000 10000 First two characters
12000
14000
16000
0 0
0.2
0.4 0.6 Alphabets (scaled to 0 to 1)
(b)
0.8
1
0
0.2
0.4
0.6
0.8
1
Alphabet pairs
(a)
Figure 5: Frequency distribution of first one and first two letters in the chosen keyword set.
(b)
Figure 6: Frequency distribution of first one and first two letters in the chosen keyword set after converting to lower case.
deletion costs. We propose a dynamic scheme for picking appropriate threshold value that minimizes the tree depth while ensuring the storage load balance. Initially, the blockSplitThreshold is set to a moderate value, say 1MB. When a node is assigned more than a few number of leaf nodes of the Trie, then it increases blockSplitThreshold and decreases the blockMergeThreshold value.
5
16000
0 0
5000
First two chars 4000 Frequency of occurence in the keyword set
Frequency of occurence as first character in keyword set
First Char
12000
Frequency of occurence as first two characters in the keyword set
14000
the keywords, using something like a hamming code based on the occurrence rate of the characters might load-balance the keywords across bit space evenly. Further investigation necessary to quantify this. For now simple straightforward encoding in the same order as the ASCII ordering. Figure 5 illustrate the encoding scheme we have used in our simulations. We encode each character with 5 bits making sure that the alphabets are evenly spaced in the 32 element size space.
Evaluation
26
Workload We use the WordNet’s word list available for free download from dict.org. There are about 150000 words in the database. In Figure 5(a), we plot the frequency of words against the starting ASCII character in our word set. As expected, most words in our workload start with either capital or small alphabets in the English language. Words with small alphabet starting are dominant than the ones starting with the capitals. In Figure 5(b), we plot the frequency of words taking first two characters of the word into consideration. These two graphs clearly show that the distribution of words has a very high variance factor. An approach similar to Squid, where orderpreserving hashing is used to map the keywords to the set of ordered nodes, will suffer from storage load-imbalance. To further substantiate this point, we converted all words into lower case letters and plot the distribution only based on the 26 alphabets of the English language. Figures 5(a) and 5(b) depict the frequency distribution of words for the starting one and two characters. Encoding of alphabet matters – while a straightforward ASCII encoding does not evenly distribute
1
0 13
13
0
4
0 a
2
0 ...
7
0
0
1 6
1
0 3
1
0 2
1 ...
1
0
1
0
b
c
d
e
3 1
2
0 ...
1
0 1 ... ...
3 1 ...
1 0 f
g
Figure 7: Encoding tree for representing 26 lower case alphabets with 5 bits We construct a Trie structure on the keyword set. Two parameters that affect the construction are blockSplitThreshold, which we will refer to as Split Threshold (ST), and the granularity at which the splitting is performed, which we will call Split Granularity (SG). The split granularity is measured in the number of bits. A split granularity of 5 denotes that the split is done based on the 26 alphabets. In Figures 5, we plot the number of buckets and the average depth of the words in the Trie struc4
20
12000
6
splitGranularity=1, buckets splitGranularity=1, avg. depth
18000
35000
5
splitGranularity=10, buckets splitGranularity=10, avg. depth
8000
4
3.5
14000
30000
3
1.5 25000
2
2000
1
10000 2 8000 1.5
Number of Buckets
3
4000
2.5 Average Depth
6000
Number of Buckets
8
12000 Average Depth
10 3000
Number of Buckets
12
Average Depth
14
4000
10000
200
400
600
800 1000 Split Threshold
1200
1400
0 1600
0 0
200
400
SG = 1
600
800 1000 Split Threshold
1200
1400
0 1600
0.5 5000
0.5
2000
2
0
1 15000
1
4000
4 1000
0
20000
6000
6
2000
2
splitGranularity=5, buckets splitGranularity=5, avg. depth 16000
10000 16
5000 Number of Buckets
4
splitGranularity=3, buckets splitGranularity=3, avg. depth 18
6000
Average Depth
7000
0 0
200
400
SG = 3
600
800 1000 Split Threshold
1200
1400
0 1600
0 0
200
400
SG = 5
600
800 1000 Split Threshold
1200
0 1600
1400
SG = 10
Figure 8: Number of buckets and the average depth of the Trie structure after inserting all keywords with split threshold (ST) for different values of split granularity (SG) parameter. 18000
16
16000
14
14000
14
splitThreshold=150, buckets splitThreshold=150, avg. depth
9000
10
splitThreshold=250, buckets splitThreshold=250, avg. depth
16000
14000
12
4500
9
splitThreshold=1000, buckets splitThreshold=1000, avg. depth
splitThreshold=1500, buckets splitThreshold=1500, avg. depth
8000
4000
8
3500
7
3000
6
2500
5
2000
4
1500
3
1000
2
12 8
7000
12000
6
6 6000
5000
4000 4 3000
Number of Buckets
8000
8 8000
6
Average Depth
8
Number of Buckets
10000
10000 Average Depth
10
Number of Buckets
12000
Average Depth
Number of Buckets
6000
Average Depth
10
4 6000
4
4000
2000 1
2
3
4
5 6 Split Granularity
7
ST = 150
8
9
10
4000
2
2000
0
0
2000
2
2 1000
0 1
2
3
4
5 6 Split Granularity
7
8
9
500
0
10
0 1
ST = 250
2
3
4
5 6 Split Granularity
7
ST = 1000
8
9
10
1
0
0 1
2
3
4
5 6 Split Granularity
7
8
9
10
ST = 1500
Figure 9: Number of buckets and the average depth of the Trie structure after inserting all keywords with split granularity (SG) for different values of split threshold (ST) parameter. sized buckets in our bucket count). While a smaller split granularity reduces the buckets with fewer entries, we still observe that there are still a lot of buckets with small number of entries.
ture against the split threshold for various values of split granularity. As expected, the number of buckets reduces as the split threshold is increased and the average depth of the tree is also decreased. The number of buckets and the average depth of the words is plotted against split granularity in Figure 5. Increasing the granularity increases the number buckets while the average depth of the Trie decreases. At higher granularity, a bucket is split to more number of children than at a lower granularity. Hence the number of buckets increases and the necessity for further splits decreases leading to a shorter Trie. These simulations clearly show that a higher split threshold and larger split granularity factors decrease the average depth.
We propose a B-tree flavored Trie based approach that tries to reduce the number of buckets. Upon the need to split a bucket as its size goes beyond split threshold, instead of splitting it into k-ways as specified by the split granularity, delegate down only for few ranges. This idea is explained with the illustration in the Figure 5. Instead of splitting 26-way as shown in Figure 3.1, this controlled delegation reduces the number of buckets. Notice that this splitting has a flavor of B-tree where the intermediate nodes in the tree maintain some entries. We propose to investigate the effectiveness of this approach as part of the future work.
We observe that the number of buckets is very large even at the small split granularity values. For example, at split granularity of 1 and a threshold of 250, we expect around 150000/250 = 600 buckets while the observed valued is about 1500 buckets. The large number of buckets is due to the fact that many buckets with few entries are created during the Trie construction. For example, with 26-way splitting or split granularity of 5 as shown in Figure 3.1, we split the bucket at root into 26 buckets with 4 non-zero size buckets (We do not count the zero-
We measure the storage load-balancing property of a scheme by measuring the normalized standard deviation of the storage loads across all nodes. Figure 5 compares the load-balancing properties of basic Squid approach with Trie based scheme for (ST=250, SG=5) values in case of Trie based approach. In basic Squid, the keywords are mapped to a linear space from 0 to 1 in an order-preserving manner and the nodes are also uniformly mapped to 5
3
Σ∗
Σ∗
ant cat
bat dog
dog
Normalized Standard Deviation of keywords across 1000 nodes
ant cat
insert bait bΣ∗ bat bait
Figure 10: An example Trie construction with BTree flavor.
SG=1 SG=2 SG=5 SG=8 SG=10
2.5
2
1.5
1
0.5
0 0
3
400
600
800 Split Threshold
1000
1200
1400
1600
Figure 12: Normalized standard deviation of the storage load on the nodes for Trie-based approach with Split Threshold (ST) for various values of Split Granularity(SG).
Trie Based Squid 2.5
2
1.5 3 Normalized Standard Deviation of keywords across 1000 nodes
Normalized Standard Deviation of keywords across nodes
200
1
0.5
0 0
200
400
600
800 1000 1200 Number of Nodes
1400
1600
1800
2000
Figure 11: Normalized standard deviation of the storage load on the nodes with basic Squid (without load balancing schemes) and Trie-based approach
ST=50 ST=100 ST=150 ST=200 ST=250 ST=500 ST=1000 ST=1500
2.5
2
1.5
1
0.5
0 1
2
3
4
5 6 SplitGranularity
7
8
9
10
Figure 13: Normalized standard deviation of the storage load on the nodes for Trie-based approach with Split Granularity(ST) for various values of Split Threshold(ST).
the same space. A keyword is assigned to a closest node with larger ID. In these set of simulations, we do not consider the run-time load-balancing techniques proposed by the authors of the Squid system. Clearly the Trie based approach is better in spreading the storage load than basic Squid approach.
6 Conclusions and Future Work
We plot normalized standard deviation for Triebased approach with split threshold for various values of split granularity in Figure 5. With increasing split threshold, the imbalance in the storage load across nodes increases. Smaller thresholds increase the load balance because the fine granularity at which the keywords can be spread across the machines. At split threshold of just 1, the approach is same as the standard DHT where each keyword is hashed separately. We also plot the normalized standard deviation with split granularity for various values of split threshold in Figure 5. Higher values of the split granularity spread the load more evenly than for the lower values of the split granularity.
In this project, we propose a distributed Trie-like data structure for storing the keyword and documents so that (a) range queries can be supported, (b) efficient insertions, deletions and lookups are supported, (c) storage load is uniformly distributed across the participating node, and (d) access load is also uniformly distributed. The preliminary simulation results show that the Trie-based structure is more effective at distributing the load across than the nodes. One interesting point we observe is the creation of large number of buckets with fewer entries at higher split granularities. We propose a modification to the Trie-based 6
data structure with the flavor of B-trees that reduces the creation of smaller sized buckets. Some words are more common than others and hence more documents matched with those words than others. As future work, we plan to query google with each word to estimate the number of documents matching a particular keyword and use those numbers as part of our workload. We plan to implement our algorithm on top of Pastry [4], a free DHT system and evaluate the efficacy of the algorithm with respect to the following metrics: (1) storage load balance, (2) access load balance, and (3) insertion, deletion, and lookup times and message costs (with and without caching enabled). We will also study the behavior of the algorithm under fail-stop fault model for varying number of faults.
References [1] J. Aspnes and G. Shah. Skip Graphs. In Proceedings of the 14th ACM-SIAM Symposium on Discrete Algorithms, January 2003. [2] R. Fagin, J. Nievergelt, N. Pippenger, and H. R. Strong. Extendible hashing - a fast access method for dynamic files. ACM Transactions on Database Systems, 4(3):315– 344, September 1979. [3] C. G. Plaxton, R. Rajaraman, and A. W. Richa. Accessing Nearby Copies of Replicated Objects in a Distributed Environment. In ACM Symposium on Parallel Algorithms and Architectures, 1997. [4] A. Rowstron and P. Druschel. Pastry: Scalable, Distributed Object Location and Routing for Large-scale Peerto-peer Systems. In Proceedings of the 18th IFIP/ACM International Conference on Distributed Systems Platforms(Middleware), November 2001. [5] C. Schmidt and M. Parashar. Flexible Information Discovery in Decentralized Distributed Systems. In Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing, 2003.
7