Entrepôts, Représentation et Ingénierie des Connaissances
Publications of the ERIC lab


by Year
by Author
by Topic
by Type
- Shallow Text Clustering Does Not Mean Weak Topics: How Topic Identification Can Leverage Bigram Features hal link

Author(s): Velcin J., Roche Mathieu, Poncelet Pascal

Conference: DMNLP: Data Mining and Natural Language Processing (Riva del Garda, IT, 2016-09-23)
Proceedings: 3rd Workshop on Interactions between Data Mining and Natural Language Processing 2016co-located with the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2016), vol. 1646 p. (2016)

Ref HAL: lirmm-01362434_v1

Text clustering and topic learning are two closely related tasks. In this paper, we show that the topics can be learnt without the absolute need of an exact categorization. In particular, the experiments performed on two real case studies with a vocabulary based on bigram features lead to extracting readable topics that cover most of the documents. Precision at 10 is up to 74% for a dataset of scientific abstracts with 10,000 features, which is 4% less than when using unigrams only but provides more interpretable topics.