Efficient Similarity-based Declustering Techniques for Keyword-based Information Retrieval in the Streaming Data Model

S. Behl and R.M. Verma (USA)


Declustering, streaming model, text data, load balancing.


Multiple-disk architectures are an attractive approach to meet high performance I/O demands in I/O intensive appli cations such as search engines, web servers and informa tion retrieval systems. This requires that the issues of dy namic load balancing and access parallelism be addressed, which is the goal of this paper. We address the problem of document declustering in a keyword-based information re trieval system for parallel architectures consisting of a sin gle processor and multiple disks. We propose and eval uate experimentally four similarity-based methods, viz., set, multiset, vector, and euclidean, for declustering doc uments. Interestingly, our results show that for single key word queries as well as boolean and queries the set and multiset methods generally outperform the vector and eu clidean methods with set being the best for the so-called simple plan. We also introduce a highest-frequency first retrieval scenario and compare the methods under this sce nario, and find that set and multiset methods are still gener ally superior to the other methods with the multiset outper forming the set method. We compare these methods with the (theoretically) optimal values, which are practically im possible to achieve. Finally, we approximated the multiset method using the harmonic mean and found that the results were slightly inferior than multiset method, but still better than the vector and euclidean methods.

Important Links:

Go Back