Entrepôts, Représentation et Ingénierie des Connaissances
Publications of the ERIC lab


by Year
by Author
by Topic
by Type
- A New Test of Cluster Hypothesis Using a Scalable Similarity-Based Agglomerative Hierarchical Clustering Framework hal link

Author(s): Wang X., Ah-Pine J., Darmont J.

Conference: CORIA 2017 | Conférence en Recherche d'Information et Applications et Rencontres des Jeunes Chercheurs en Recherche d'Information (Marseille, FR, 2017-03-29)

Ref HAL: hal-01504961_v1

The Cluster Hypothesis is the fundamental assumption of using clustering in Information Retrieval. It states that similar documents tend to be relevant to the same query. Past research works extensively test this hypothesis using agglomerative hierarchical clustering (AHC) methods. However, their conclusions are not consistent concerning retrieval effectiveness for a given clustering method. The main limit of these works is the scalability issue of AHC. In this paper, we extend our previous work to a new test of the cluster hypothesis by applying a scalable similarity-based AHC framework. Principally, the input pairwise cosine similarity matrix is sparsified by given threshold values to reduce memory usage and running time. Our experiments show that even when the similarity matrix is largely sparsified, retrieval effectiveness is retained for all tested methods. Moreover, for two clustering methods, complete link and average link, they do not always dominate the other methods as reported in past works.