iDISQUE: Tuning High-Dimensional Similarity ... - Semantic Scholar

Report 2 Downloads 143 Views
iDISQUE: Tuning High-Dimensional Similarity Queries in DHT Networks Xiaolong Zhang # , Lidan Shou # , Kian-Lee Tan ∗ , Gang Chen # #

College of Computer Science, Zhejiang University, China {xiaolongzhang,should,cg}@cs.zju.edu.cn ∗ School of Computing, National University of Singapore, Singapore [email protected]

Abstract. In this paper, we propose a fully decentralized framework called iDISQUE to support tunable approximate similarity query of high dimensional data in DHT networks. The iDISQUE framework utilizes a distributed indexing scheme to organize data summary structures called iDisques, which describe the cluster information of the data on each peer. The publishing process of iDisques employs a locality-preserving mapping scheme. Approximate similarity queries can be resolved using the distributed index. The accuracy of query results can be tuned both with the publishing and query costs. We employ a multi-probe technique to reduce the index size without compromising the effectiveness of queries. We also propose an effective load-balancing technique based on multi-probing. Experiments on real and synthetical datasets confirm the effectiveness and efficiency of iDISQUE.

1 Introduction In many applications, objects (documents, images, etc.) are characterized by a collection of relevant features which are represented as points in a high-dimensional space. Given a query point, a similarity search finds all data points which are nearest (most similar) to the query point. While there have been numerous techniques proposed for similarity search in high-dimensional space, these are mostly studied in the context of centralized architectures [8, 13]. Unfortunately, due to the well-known dimensional curse problem, search in a high-dimensional space is considered as a “hard” problem. It has been suggested that since the selection of features and the choice of a distance metric in typical applications is rather heuristic, determining approximate nearest neighbors should suffice for most practical purposes. The relentless growth of storage density and improvements in broad-band network connectivity has fueled the increasing popularity of massive and distributed data collections. The Peer-to-Peer (P2P) systems, as a popular medium for distributed information sharing and searching, have been gaining increasing interest in recent years. There is strong demand for similarity search in peer-to-peer networks as well. Unfortunately, most of the existing P2P systems are not designed to support efficient similarity search. In unstructured P2P networks, there is no guarantee on the completeness and quality of the answers. On the other hand, while structured DHTbased P2P systems [19, 16, 17] offer an efficient lookup service, similarity search still

cannot be resolved readily. The reason is that the DHT networks usually employ consistent hashing mechanisms which destroy data locality. To facilitate similarity search in DHT networks, we need locality-preserving lookup services, which can map similar data objects in the original data space to the same node in the overlay network. We advocate a fully decentralized index to organize the high-dimensional data in a DHT network. Our framework is motivated by the following ideas. First, we need a locality-preserving mapping scheme to map the index entries into DHT network. The locality sensitive hash (LSH) has been proved to be effective for mapping data with spatial proximity to the same bucket [12]. It has been utilized in [8] to support high dimensional similarity search in centralized setting. By carefully designing the mapping scheme in DHT, we can realize high-dimensional locality among the peers. Second, mapping all data objects of each peer to the DHT would result in a huge distributed index. To reduce the size of the distributed index and the cost of index maintenance, we should publish data summaries to the DHT network instead of the individual data objects. For this purpose, each peer needs a summarization method to derive representative summaries from the data objects. We note that clustering is a typical summarization approach for organizing high-dimensional data. For example, in centralized environment, the iDistance [13] approach indexes high-dimensional data in clusters, and provides effective similarity search based on a mapping scheme for these clusters. Although it is difficult to cluster the entire data set in a P2P environment, we can exploit local clustering in each peer to construct a distributed index. In this paper, we propose a practical framework called iDISQUE to support tunable approximate similarity queries of high-dimensional data in DHT networks. The contributions of our work are summarized as follow: – We propose a fully decentralized framework called iDISQUE to handle approximate similarity queries, which is based on a novel locality-preserving mapping scheme. We also present a tunable query algorithm for approximate similarity search. The accuracy of query results can be tuned by both the indexing and query costs. – We present a distributed query multi-probe technique to allow for reducing the size of the indices without compromising the effectiveness of queries. In addition, we introduce a load-balancing technique based on multi-probing. – We conduct extensive experiments to evaluate the performance of our proposed iDISQUE framework. The rest of the paper is organized as following: In Section 2, we present the related work. In Section 3, we describe the preliminaries. Next we give an overview of the iDISQUE framework in Section 4. Section 5 presents the locality-preserving indexing in iDISQUE. Section 6 describes the query processing method in iDISQUE. In Section 7, we present the techniques to handle load imbalance. We present the experiment study in Section 8, and finally conclude the paper in Section 9.

2 Related Work We are aware of a few previous works which propose techniques to support similarity search in P2P environments. These systems can be divided into three main categories.

The first category such as MAAN [4], Mercury [3] is attribute-based. In these systems, the data indexing process and similarity search process are based on single dimensional attribute space, and their performances are poor even in low dimension. The second category such as Murk [7], VBI-tree [14] is based on multiple dimension data and space partitioning schemes, and maps the specific space regions to certain peers. Although these systems generally perform well at low dimensionality, their performance deteriorates rapidly as the dimension increases due to the ”dimensionality of curse”. The third category is metric-based P2P similarity search systems. The examples include [6] [18] and [11]. The SIMPEER framework [6] utilizes the iDistance [13] scheme to index high-dimensional data in a hierarchical unstructured P2P overlay, however this framework is not fully decentralized, making it vulnerable to super-peer failure. In [18] the author defines a mapping scheme based on several common reference points to map documents to one dimensional chord. However, their scheme is limited to applications in the document retrieval. Perhaps the most similar work to ours is [11]. In [11] the author has proposed a algorithm to approximate K-Nearest Neighbor queries in structured P2P utilizing the Locality Sensitive Hashing [8] scheme. However, we argue that their scheme is not a fully decentralized scheme, since their scheme relies on a set of gateway peers, where it would incur the ”single-failure” problem when the workload of gateway peers is large. More over, the effectiveness of their mapping scheme and load balance scheme are largely dependent on the percomputed global statistics, which are difficult to collected in a fully distributed way. While our approach is fully decentralized, and no percomputed global statistics are required. It makes our approach more practical in reality.

3 Preliminaries In this section, we briefly introduce the basic mechanisms used in our iDISQUE framework, namely the locality sensitive hashing and the iDistance indexing scheme. 3.1 Locality Sensitive Hashing The basic idea of locality sensitive hashing (LSH) is to use a certain set of hash functions which map ”similar” objects into the same hash bucket with high probability [12]. LSH is by far the basis of the best-known indexing method for approximate nearestneighbor (ANN) queries. A LSH function family has the property that objects close to each other have higher probabilities of colliding than those that are far apart. For a domain S of the point set with distance measure D, a LSH family is defined as: a family H = {h : S → U } is called (r1 , r2 , p1 , p2 )-sensitive for D if for any v, q ∈ S – if D(v, q) ≤ r1 then P rH [h(v), h(q)] ≥ p1 – if D(v, q) ≥ r2 then P rH [h(v), h(q)] ≤ p2 , where P rH [h(v), h(q)] is the colliding probability between two points v, q ∈ S. To utilize LSH for approximate similarity (K-nearest-neighbor (KNN)) search, we should pick r1 < r2 and p1 > p2 [8]. With these choices, nearby objects (those within

distance r1 ) have a greater chance (p1 vs. p2 ) of being hashed to the same value than objects that are far apart (those at a distance greater than r2 away). 3.2 The iDistance indexing scheme The iDistance [13] scheme is an indexing technique for supporting similarity queries on high-dimensional data. It partitions the data space into several clusters, and selects a reference point for each cluster. Each data object is assigned a one-dimensional iDistance value according to its distance to the reference point of its cluster. Therefore, all objects in the high-dimensional space can be mapped to the single-dimensional keys of a B+-tree. Details of the similarity query processing in iDistance can be found in [13].

4 Overview In this section, we first give an overview of the iDISQUE framework. Then, we present a simple indexing solution. 4.1 The iDISQUE framework The iDISQUE framework comprises a number of peers that are organized into a DHT network. In a DHT network, the basic lookup service is provided. Given a key, the lookup service can map it to an ID denoted by lookup(key), which can be used to find a peer responsible for the key. Without loss of generality, we use Chord [19] as the DHT overlay in our framework. A peer sharing high-dimensional data is called a data owner, while a peer containing index entries of the data shared by other peers is called an indexing peer. For sharing and searching the data among the peers, the iDISQUE framework mainly provides the following two services: – The index construction service When a peer shares its data, a service called index construction is invoked. The index construction service consists of the following four steps. First, the data owner employs a local clustering algorithm, to generate a set of data clusters. For simplicity in presentation and computation, we assume that all data clusters are spherical in the vector space. Second, the data owner employs the iDistance scheme [13] to index its local data in clusters, using the cluster centers created in the previous step as the reference points. Third, for each data cluster being generated in step one, the data owner creates a data structure called iDisque(iDIStance QUadruplE), denoted by C∗ as follows: C∗ =< C, rmin , rmax , IP > where C is the center of the cluster, rmin and rmax define the minimum and maximum distances of all cluster members to the center, IP is the IP address of the data owner. The above iDisque is a compact data structure which describes the data cluster information. Therefore, the data owner is able to capture the summary of its own data via a set of iDisques. Fourth, for each iDisque being generated, the data

owner publishes its replicas to multiple peers using a mapping scheme which maps the center of the iDisque to a certain number (denoted by Lp ) of indexing peers, then each iDisque is replicated to these indexing peers. In the rest of paper, Lp is called the publishing mapping degree. The indexing peers receiving the iDisque will insert it into their local index structures. – The querying processing service A peer can submit a similarity query to retrieve the data objects most relevant to the given query. When a similarity (KNN) query is issued, it is first sent to a specified number (denoted by Lq ) of indexing peers which may probably contain candidate iDisques, using the same mapping scheme as described above. In the rest of paper, Lq is referred to as the query mapping degree. Second, each indexing peer receiving the query will then look up its local part of the distributed index to find colliding iDisques. The query will then be sent to the data owners of these candidate clusters. Third, the data owners of the candidate clusters process the query utilizing their local iDistance indexes, and return their local K-nearest-neighbors as the candidate query results. Finally, the querying node sorts the data in the candidate query result sets by distance to the query point, and produces the final results. 4.2 A Naive Indexing Scheme For a Chord overlay, a straightforward indexing scheme can be implemented as follows: For each iDisque, we replicate it for Lp copies and map them evenly to the Chord ring ranging from 0 to 2m -1 at a constant interval of P = 2m /Lp . Constant P is called the publishing period. Therefore, any two replicas repi and repj are mapped to two Chord keys so that |ChordKey(repi ) − ChordKey(repj )| = n · P, where n is an integer in (1, . . . , [Lp /2]). Such replication scheme guarantees that an interval of length greater than P in the Chord key space must contain at least one replica of an iDisque. Therefore, any interval of length greater than P must contain all iDisques of the entire system. When a query is issued, we randomly choose an interval of length P in the Chord ring, and randomly select Lq keys in this interval (Lq ≤ P ). The query is then delivered to the indexing peers in charge of these Lq keys. In the above naive indexing scheme, the iDisques are randomly distributed in the network. Therefore, its index storage space is uniformly allocated among the indexing peers, and the query load is balanced too. We denote by iDISQUE-Naive the above naive indexing scheme. We note that if Lq is large enough to cover all peers in the interval, the query is able to touch all iDisques in the system and is therefore accurate. However, a typical Lq value is much smaller than P in a large network. Therefore, the naive scheme cannot achieve high accuracy without a large Lq value. To improve the query accuracy at a low cost, we shall look at a locality-preserving indexing scheme in the iDISQUE framework.

5 Locality-Preserving Index Scheme In this section, we introduce a locality-preserving indexing scheme for iDISQUE. In the remaining sections, we will only assume the locality-preserving indexing scheme for

the iDISQUE framework unless explicitly stated. We begin by introducing a localitypreserving mapping scheme, and then present the index construction method. 5.1 Locality-preserving mapping To facilitate locality-preserving mapping, a family of LSH functions is needed. Without loss of generality, we assume the distance measure to be the L2 norm. Therefore, we can use the family of LSH functions H for Lp norms, as proposed by Datar et al. in [5], where each hash function is defined as (a · v + b) ha,b (v) = b c, W where a is a d-dimensional random vector with entries chosen independently from a p-stable distribution, and b is a real number chosen uniformly from the range [0, W ]. In the above family, each hash function ha,b : Rd → Z maps a d-dimensional vector v onto a set of integers. The p-stable distribution used in this work is the Gaussian distribution, which is 2-stable and therefore works for the Euclidean distance. To resolve similarity queries, the locality-preserving mapping scheme in iDISQUE has to be able to map similar objects (both data and query points) in the high-dimensional space to the same Chord key. Our proposed mapping scheme consists of the following consecutive steps: First, to amplify the gap between the ”high” probability p1 and ”low” probability p2 (refer to section 3.1), we define a function family G = {g : Rd → U k } such that g(v) = (h1 (v), . . . , hk (v)), where hj ∈ H. Each g function produces a k-dimensional vector. By concatenating the k LSH functions in g, the collision probability of far away objects becomes smaller (pk2 ), but it also reduces the collision probability of nearby objects(pk1 ). Second, to increase the collision probability of nearby objects, we choose L independent g functions, g1 , . . . , gL , from G randomly. Each function gi (i = 1, . . . , L) is used to construct one hash table, resulting in L hash tables. We can hash each object into L buckets using functions g1 , . . . , gL . As a result, nearby objects are hashed to at least one same bucket at a considerably higher probability, given by 1 − (1 − pk1 )L . Third, to map each k-dimensional vector, gi (v), to the Chord key space as evenly as possible, we multiply gi (v) with a k-dimensional random vector Ri = [ai,1 , . . . , ai,k ]T : ρi (v) = gi (v) · [ai,1 , . . . , ai,k ]T mod M where M is the maximum Chord key 2m − 1. Each element in Ri is chosen randomly from the Chord key space [0, 2m − 1]. We call each ρi a LSH-based mapping function, and denote by ΨL the function set {ρi , . . . ,ρL }. Therefore, given ΨL = {ρ1 , . . . , ρL }, we can map a point v ∈ Rd to L Chord keys ρ1 (v), . . . , ρL (v). A query at point q is said to collide with data point v if there exists a function ID i ∈ {1, . . . , L}, so that ρi (q) = ρi (v). 5.2 Index construction We now propose the detailed process of index construction in iDISQUE. Given the four steps of the index construction service as described in the previous section, we shall

focus on the first and the fourth steps, namely clustering local data and publishing iDisques, since the other two steps are already clearly described. Clustering local data When clustering its local data, a data owner must take two important requirements into consideration. On one hand, as the iDisque structures are published and stored in the DHT network, we need the number of iDisques (clusters) published by a data owner to be as small as possible so that the cost of constructing and storing the distributed index can be limited. On the other hand, the dimension (geometric size) of each cluster must be small enough to guarantee the query accuracy. In order to balance the number of clusters and the query accuracy, we can tune the entire system using a parameter δ, which specifies the maximum radius allowed for each data cluster. A large δ value reduces the number of iDisques published and maintained in the system, while impairing the accuracy. In contrast, a small δ does the reverse. Due to the restriction of cluster size, one data point might be “singular” as its distance from its nearest neighbor is more than δ. These points are called singularities. We shall create a singular cluster for each singularity. It is worth mentioning that although there are several existing clustering algorithms to cluster local data to data cluster with a maximum radius δ, e.g., BIRCH [20], CURE [10], in the implementation we adopt BIRCH [20] to cluster local data due to its popularity and simplicity.

The mapping from C to Lp Chord keys

The mapping from q to Lq Chord keys ρ1(q)

ρ1(C)

...

ρ2(C)

Chord C

ρi(q)

q

ρi(C) ...

... ρLq(q)

ρLp(C) hash map key Publishing iDisque C* to Chord

If ρi(C) = ρi(q), collision is detected and C* is a colliding iDisuqe Query processing in Chord

Fig. 1: Publishing and querying an iDisque

Publishing iDisques In order to publish iDisques to peers, we need a LSH-based mapping function sequence ΨLp = {ρ1 , . . . , ρLp }, whose length (cardinality) is equal to the publishing mapping degree Lp . All data owners use the same set of ΨLp . Given the above locality-preserving mapping scheme, we can publish iDisques in a straightforward approach. Figure 1 shows an example of the mapping and publishing process. First, for each iDisque denoted by C∗ (meaning that it is centered at C), we map its cluster centroid C to Lp independent Chord keys ρ1 (C), . . . , ρLp (C), using the aforementioned mapping scheme. Second, for each Chord key ρi (C), we evoke the Chord lookup service to find the peer which owns the Chord key, and send the publishing message containing the respective function ID i, the Chord key ρi (C), together

with the iDisque, C∗, to that peer. As multiple keys might be assigned to a same peer in the Chord protocol, the upper bound cost of publishing an iDisque is Lp messages. When a publishing message is sent to an indexing peer, the recipient inserts the received data into its local data structure, i.e. a hash map indexed by composite key (ρi (C),i). This data structure provides efficient local lookup given the Chord key and the mapping function ID.

6 Query processing in iDISQUE 6.1 Basic Query Processing Scheme Query processing is analogous to the publishing process as Figure 1. When a KNN query at point q is issued, we map it to Lq Chord keys, named query Chord key set, using a LSH-based mapping function sequence ΨLq = {ρ1 , . . . , ρLq } which is a prefix of sequence ΨLp (the first Lq functions in the LSH-based mapping functions pre-defined in the system). Peers in charge of these Chord keys may probably contain colliding iDisques (those that are mapped to the same Chord key as the query point via the same mapping function ρi ). We refer to such indexing peers as candidate peers. Then, the querying node distributes the query along with its Chord key and function ID to each respective candidate peer to look for colliding iDisques. Based on the mapping scheme proposed in section 5.1, the probability of collision between an iDisque centroid C and the query q depends on two factors: (1) the distance between q and C, and (2) the number of function IDs on which the two points may collide. Assume the publishing mapping degree of an iDisque centered at C is Lp , the latter can be calculated as min(Lp , Lq ). Therefore, the probability of collision is estimated as 1 − (1 − pk1 )min(Lp ,Lq ) . In the query process, we can tune the coverage of the query by varying Lq . If Lq is large enough, all iDisques close to the query point in the data space may have a chance to collide with the query on at least one function ID i, where i ≤ Lp . In contrast, if Lq is small, the query cost is reduced, but its accuracy will inevitably be impaired. The above procedure supports progressive query refinement during run-time query processing. The querying peer maintain a top K result queue sorted by their distances to query point, and distribute the query progressively to candidate peers in an order analogous to ρ1 , ρ2 , . . . , ρLq . Each time a new candidate iDisque is returned, the querying peer asks for the top-K data points from the data owners of the candidate iDisques, and merges the local results in the result queue. Meanwhile, the querying peer maintains a temporary list of data owners which it has already asked for data. Therefore, if replicas of the same iDisque are returned, they will simply be discarded. The query processing in each candidate peer is straightforward. Upon receiving a query, the candidate peer looks up its local hash map structure (as described in section 5.2) for a colliding iDisque. If the query contains multiple function IDs and Chord keys (in case that multiple functions are mapped to the same peer), multiple lookups will be performed. Since some of the resultant colliding iDisques may not be really close to the query point (due to false positives introduced by LSH-mapping), all colliding iDisques will further undergo a distance check in data space. Then the candidate iDisques are sent

back to the querying peer. The pseudo code of the basic query algorithm is presented algorithm 6.1. Algorithm 6.1: Q UERY P ROCESSING(q, K, Lq ) input:

q as the query object K as the number of nearest neighbors Lq as the query mapping degree queue := ∅ Construct ΨLq from ΨLp Create query Chord key set using ΨLq for each  keyi in query Chord key set Search indexing peer lookup(keyi )    for each ½ iDisque returned do Search data owner iDisque.IP    do Merge the local top K results to queue return (queue)

6.2 Multi-probing in iDISQUE In the above basic query processing algorithm, one problem is that the query accuracy is bounded by min(Lp , Lq ). When the publishing mapping degree Lp is small, the query accuracy is restricted. Therefore, the publishing mapping degree has to be large to achieve high query accuracy, resulting in large index size and high cost of index publishing. A recent study on a new technique called multi-probing [15] has shed some light on this problem. The multi-probe query technique employs a probing sequence to look up multiple buckets which have a high probability of colliding the target data. As a result, the method requires significantly fewer hash functions to achieve the same search quality compared to the conventional LSH scheme. Inspired by this idea, we propose a distributed multi-probe technique in iDISQUE. For each mapping function ρi , the multi-probe technique creates multiple query keys instead of one query key, which are also probable to collide with the keys of candidate iDisques. The multiple keys are generated in a query-directed method. For a query point q, we first obtain the basic query keys ρ1 (q), . . . , ρLq (q). Second, for each basic query key ρi (q), we employ the multi-probe method proposed in [15] to (1) (2) (T ) generate T number of extended query keys, ρi (q), ρi (q), . . . ρi (q), in descending order of their probability to contribute to the query. Parameter T determines the number of multi-probings for each basic query key. Therefore, the total number of query keys is (T + 1) · Lq . Third, we look up the indexing peers of the basic and extended query (1) (1) (1) keys in an order like following ρ1 (q), ρ2 (q),. . . , ρLq (q), ρ1 (q), ρ2 (q),. . . , ρLq (q), (j)

(j)

(j)

. . ., ρ1 (q), ρ2 (q), . . . , ρLq (q), . . . . Users are allowed to determine how far the multiprobing should proceed according to the search accuracy and query cost.

Utilizing the above technique, we are able to achieve high accuracy even when the publishing mapping degree of iDisques is very small. In addition, we can always improve the search accuracy at the expense of more query keys (or higher query costs). Therefore, the search quality will not be restricted by the publishing mapping degree, which is specified by the data owners during index creation.

7 Load-Balancing The iDISQUE framework can easily adopt load-balancing techniques. In this section, we propose a technique to handle load balancing. There could be two categories of data imbalance in iDISQUE. One is the imbalance of data storage among data owners, the other is the imbalance of index size among indexing peers. The former category can be addressed by a simple data migration or replication scheme such as LAR [9]. In iDISQUE, we shall focus on the problem of load imbalance among indexing peers. In our method, the load of an indexing peer is defined as the number of iDisques it maintains. In the ideal case, all iDisques are distributed evenly across the entire network. However, in the locality-preserving mapping scheme, the assignment of iDisques might be skewed among different indexing peers. To address the problem, we restrict the maximum number of iDisques published to indexing peers by defining a variable capacity threshold τ for each indexing peer, according to its storage or computing capacity. If the number of iDisques maintained by an indexing peer reaches the threshold, the peer is overloaded and any request to insert a new iDisque to it will be rejected. The publishing peer (data owner) can then utilize the multi-probe technique to discover a new indexing peer which hopefully maintains fewer iDisques. The discovering process is as following: For an iDisque denoted by C∗ and a mapping function ρi , the basic publishing key ρi (C) and a series of extended publishing keys are all created by the multi-probe technique in descending order of their similar probability to the basic publishing key. If the indexing peer mapped by the basic publishing key is overloaded, we probe one-by-one, in order of descending probability, the peers mapped by the extended keys, until an indexing peer accepts C∗. When processing similarity query, a load-balanced system must utilize the multiprobe technique. If a query misses on a basic publishing key, it will proceed to an extended key. However, the message cost of a query is determined by the query mapping degree Lq and parameter T . In the rest of this paper, we assume that multi-probing is enabled when the proposed load balancing technique works.

8 Experimental Results In this section, we evaluate the performance of iDISQUE framework. We implement both the iDISQUE-Naive indexing scheme and the locality-preserving indexing scheme, which we refer to as iDISQUE-LSH. For comparison, we implement a fully decentralized high-dimensional similarity search P2P system based on spatial partitioning. For a d-dimensional space, we split the data space evenly on each dimension into s equal-length segments. Therefore, we obtain a d-dimensional grid of sd space partitions. For each partition we create an index

entry containing the IPs of peers, which own data records in that partition. All partitions, in a Z-order, are assigned to the peers in the Chord overlay in a round-robin manner. This indexing scheme is called SPP (Spatial-partitioning). When a KNN query is issued to SPP, we generate a sphere in the d-dimensional space centered at the query point with a predefined radius. Indexing peers containing partitions overlapping the sphere are queried, and the results are returned to the querying peer. If the number of results of the query is smaller than K, the sphere is enlarged iteratively to overlap more partitions. The SPP scheme provides accurate query results. We conduct the following experiments to evaluate iDISQUE: (1) The comparison among the three indexing schemes, namely SPP, iDISQUE-Naive, and iDISQUE-LSH; (2) Tuning various parameters in iDISQUE-LSH; (3) Load-balancing; and (4) Scalability. 8.1 Experiment Setup The effectiveness of approximate similarity search is measured by recall and error rate, as defined in [8]. Given a query object q, let I(q) be the set of ideal answers (i.e., the k nearest neighbors of q), let A(q) be the set of actual answers, then recall is defined P PK d# 1 i , where d# . The error rate is defined as: |Q|K as: |A(q)∩I(q)| i is the i-th q∈Q i=1 d∗ |I(q)| i ∗ nearest neighbor found by iDISQUE, and di is the true i-th nearest neighbor. Since error rate does not add new insight over relative recall and we do not report in our experiments due to the space limit. One real dataset and one synthetical dataset are used to evaluate the iDISQUE framework. The real dataset is a subset (denoted by Covertype) containing 500k points selected randomly from the original Covertype dataset [2], which consists of 581k 55-dimensional instances of forest Covertype data. The synthetical dataset is generated from the Amsterdam Library of Object Images set [1] which contains 12000 64dimensional vectors of color histogram. We create new data points by displacing by a small amount (0.005) the original data points on a few random directions, and obtain a synthetical dataset containing 100k points (denoted by ALOI). For both datasets, 1000 queries are drawn from the datasets randomly and for each query a peer initiator is randomly selected. As most experimental results show similar trends for these two datasets, unless otherwise stated, in the following experiment we will only present the results for the Covertype dataset due to the limit of space. In our experiments, the default network size N is 1000. The dataset is horizontally partitioned evenly among the peers, and each node contains 100 data points by default. The splitting number of SPP on each dimension is s = 4. We have experimented with different parameter values for the locality-preserving mapping and picked the ones that give best performance. In our experiment, the parameters of locality-preserving mapping are k = 15, W = 3.0 for both datasets. We also conduct a lot of experiments with different K (K = 1,5,10) for the KNN queries, and the default K value is 5. In our experiments, for simplification, we assume each iDisque C* has the same publishing mapping degree, and the default publishing and querying mapping degree (Lp and Lq ) is 10. Since the current iDISQUE framework is based on Chord, in the experiments the query processing cost is measured by the number of lookup messages for the indexing peers. All the queries are executed for 10 times, and the average results are presented.

8.2 Comparative Study - Experiment 1 In this experiment we first compare the total storage space of the proposed indexing schemes (Lp = 1 for the iDISQUE schemes). Note that the index size of iDISQUENaive is the same as that of iDISQUE-LSH. The results indicate that the storage space of iDISQUE is much larger than SPP, although the former is still acceptable compared to the size of the datasets. It must be noted that as the value of Lp increases, the storage cost will increase linearly. This justifies the need for the multi-probe technique. Dataset

Data size SPP index iDISQUE index (KB) size (KB) size (KB) Covertype(500k) 214844 248 15877 ALOI(100k) 50000 132 4895 Table 1: Results of the storage space

1

1

0.8

0.8

0.6

0.6

Recall

Percentage of Indices

We now compare the distribution of the index storage space among peers for the three indexing schemes. For fairness in comparison, the results are presented in percentages. Figure 2(a) shows the cumulative distribution of index storage space among 1000 peers using the Covertype dataset. The figure indicates that iDISQUE-Naive provides the most uniform distribution in index storage space, while the result of iDISQUE-LSH is also satisfactory. However, the results of SPP indicate that nearly 95% of its index storage is assigned to 5% the nodes. This partly explains the small index size of SPP. As a result, the skewed index in SPP would inevitably cause serious problem of load imbalance. Therefore, a scheme based on spatial-partitioning is not truly viable for querying high-dimensional data.

0.4 iDISQUE-LSH iDISQUE-Naive SPP ideal

0.2 0 0

0.2

0.4 0.6 0.8 Percentage of Nodes

iDISQUE-LSH iDISQUE-Naive

0.4 0.2

1

0 5

10

15 20 25 30 Lq (Message Cost)

35

40

(a) Comparison of cumulative dis- (b) Comparison of search effectributions of index size tiveness Fig. 2: Results of Exp. 1

To compare the effectiveness and efficiency of the three schemes, we plot the accuracy results of iDISQUE-Naive and iDISQUE-LSH versus their message costs in Figure 2(b), given a publishing mapping degree of 40. The message cost of a query can be determined by the query mapping degree Lq . Specifically, Lq is equivalent to the number of messages caused by a query when multi-probe is disabled. Note that the recall of SPP is not plotted as they are always 1. The message cost of SPP is 35. Figure 2(b) indicates that the recall of iDISQUE-Naive increases almost linearly when the message cost

1 0.8

0.6

0.6

0.6

0.4 K=1 K=5 K = 10

0.2 0 0

0.05

0.10 δ

Recall

1 0.8 Recall

Recall

1 0.8

0.4 K=1 K=5 K = 10

0.2

0.15

0.20

(a) Effect of δ on recall

0 1

2

3

4 5 6 7 Lq(Message Cost)

0.4

No MP(Lp=20) MP(Lp=10) MP(Lp=5) MP(Lp=2) MP(Lp=1)

0.2 0 8

9

(b) Tuning Lq

10

4

8

12 Message Cost

16

20

(c) Effect of multi-probing

Fig. 3: Results of Exp. 2

grows. It reaches 44% at the cost of 40 messages. In contrast, the recall of iDISQUELSH increases rapidly at first, and reaches about 70% at 10 messages. This shows that iDISQUE-LSH’s accuracy is acceptable. Since iDISQUE-LSH is much more effective than iDISQUE-Naive given the same message cost, we will focus on iDISQUE-LSH in the remaining experiments. 8.3 Performance Tuning - Experiment 2 Figure 3(a) shows the effect of δ on recall. We can see that the smaller the δ, the higher the recall. This is because a smaller δ leads to finer iDisques, which can represent the actual data points with a better approximation. If δ = 0, all iDisques are singular. In such case, the iDisques represent real data and the recall is 73% for K = 1. Similar trends are observed when K = 5 and K = 10. Such results confirm our analysis in Section 5.2. When δ is 0.1, it can strike a balance between the storage space of iDisques and the accuracy of queries. Therefore, we set δ = 0.1 as the default value. Figure 3(b) shows the effect of Lq on recall for K = 1, 5, 10. It can be observed that when Lq < 5, the recall increases rapidly. When Lq = 5, the recall is nearly 60% for K = 1. These results confirm the effectiveness of our tunable approximate similarity query framework. Figure 3(c) shows the accuracy of iDISQUE-LSH enabling the distributed multiprobe technique (MP means to enable multi-probing). The accuracy of a conventional (non-multiple-probing) scheme is also plotted. As shown in the figure, multi-probe with small publishing mapping degrees (Lp =1, 2, 5, 10) can achieve highly competitive accuracy at nearly the same cost as a conventional scheme, which has a larger publishing degree of 20. The recall of multi-probe with Lp = 10 is even marginally better than the conventional scheme when the message cost is greater than 12. As a smaller Lp value leads to smaller index size and also smaller index publishing costs, the multi-probe technique can reduce the index size without compromising the quality of queries. 8.4 Load-balancing - Experiment 3 In this experiment, we study the effect of the proposed load-balancing technique, which is based on multi-probing. We vary the capacity threshold τ and illustrate the results of the publishing cost, load distribution, and query accuracy in Figure 4. The threshold τ is

measured in units of the average load of all peers without load-balancing (i.e. a τ value of 2 means twice the average load). We also plot the same results for iDISQUE without load-balancing. The results indicate that as τ increases, the publishing cost reduces rapidly because less multi-probing is required. However, the load becomes more skewed as the restriction is being lifted. Meanwhile, as τ increases, more iDisques can be found by the extended query key, therefore producing higher accuracy. To obtain balanced load while producing quality results and without introducing high publishing cost, a τ value of 2.5 could be a good choice.

150 100 50 0 1.5

load balance no load balance 2.0

τ

2.5

(a) Total publishing cost

1

1

0.8

0.8

0.6 τ=3.0 τ=2.5 τ=2.0 τ=1.5 ideal no load bal.

0.4 0.2 0

3.0

Recall

Percentage of Indices

Total Publish Cost (K Msgs)

200

0

0.2

0.4 0.6 Percentage of Nodes

0.6 0.4 0.2

0.8

1

(b) Cumulative distribution of indices

0 1.5

load balance no load balance 2.0

τ

2.5

3.0

(c) Effects on Recall

Fig. 4: Results of Exp. 3

8.5 Scalability - Experiment 4 In this experiment, we evaluate the scalability of iDISQUE using five networks, where the number of peers ranges from 1000 to 5000. We also create five randomly selected subsets of the Covertype dataset with different sizes (100k, 200k, · · · , 500k), and create an iDISQUE index for each of these subsets in a respective network. Figure 5 shows the accuracy at various query message costs (Lq ) in different network scales. The results show that the network size has little impact on the accuracy of iDISQUE. Therefore, the proposed scheme is scalable in terms of effectiveness. 1

Lq=10 Lq=5 Lq=1

Recall

0.8 0.6 0.4 0.2 0 1000

2000

3000 4000 Number of Nodes

5000

Fig. 5: Result of Scalability

9 Conclusion In this work, we proposed a framework called iDISQUE to support tunable approximate similarity query of high dimensional data in DHT networks. The iDISQUE framework

is based on a locality-preserving mapping scheme, and its query accuracy can be tuned by both the indexing and query costs. We also proposed a distributed multi-probe technique for iDISQUE to reduce its index size without compromising the effectiveness of queries. Load balancing among the indexing nodes was achieved by utilizing a novel technique based on multi-probing. The experimental results confirmed the effectiveness and efficiency of the proposed framework. For future work, we would look at data update strategies in the iDISQUE framework. We would also consider dynamic load balancing techniques to handle skewness in queries.

References 1. The amsterdam library of object images homepage. In http://staff.science.uva.nl/ aloi/, 2008. 2. Uci kdd archive. In http://www.kdd.ics.uci.edu, 2008. 3. A. R. Bharambe, M. Agrawal, and S. Seshan. Mercury: supporting scalable multi-attribute range queries. In SIGCOMM, 2004. 4. M. Cai, M. R. Frank, J. Chen, and P. A. Szekely. Maan: A multi-attribute addressable network for grid information services. In GRID, 2003. 5. M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Symposium on Computational Geometry, 2004. 6. C. Doulkeridis, A. Vlachou, Y. Kotidis, and M. Vazirgiannis. Peer-to-peer similarity search in metric spaces. In VLDB, 2007. 7. P. Ganesan, B. Yang, and H. Garcia-Molina. One torus to rule them all: multi-dimensional queries in p2p systems. In WebDB, pages 19–24, 2004. 8. A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, 1999. 9. V. Gopalakrishnan, B. Silaghi, B. Bhattacharjee, and P. Keleher. Adaptive replication in peer-to-peer systems. In ICDCS, 2004. 10. S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases. In SIGMOD, 1998. 11. P. Haghani, S. Michel, and K. Aberer. Distributed similarity search in high dimensions using locality sensitive hashing. In EDBT, 2009. 12. P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC, 1998. 13. H. V. Jagadish, B. C. Ooi, K.-L. Tan, C. Yu, and R. Zhang. idistance: An adaptive b+ -tree based indexing method for nearest neighbor search. ACM TODS, 2005. 14. H. V. Jagadish, B. C. Ooi, Q. H. Vu, R. Zhang, and A. Zhou. Vbi-tree: A peer-to-peer framework for supporting multi-dimensional indexing schemes. In ICDE, 2006. 15. Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Multi-probe lsh: Efficient indexing for high-dimensional similarity search. In VLDB, 2007. 16. S. Ratnasamy, P. Francis, M. Handley, R. M. Karp, and S. Shenker. A scalable contentaddressable network. In SIGCOMM, 2001. 17. A. I. T. Rowstron and P. Druschel. Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In Middleware, 2001. 18. O. D. Sahin, F. Emekc¸i, D. Agrawal, and A. E. Abbadi. Content-based similarity search over peer-to-peer systems. In DBISP2P, 2004. 19. I. Stoica, R. Morris, D. R. Karger, M. F. Kaashoek, and H. Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. In SIGCOMM, pages 149–160, 2001. 20. T. Zhang, R. Ramakrishnan, and M. Livny. Birch: A new data clustering algorithm and its applications. Data Min. Knowl. Discov., 1997.

Recommend Documents