Entrepôts, Représentation et Ingénierie des Connaissances
Publications du laboratoire

Recherche approfondie

par Année
par Auteur
par Thème
par Type
- Shallow Text Clustering Does Not Mean Weak Topics: How Topic Identification Can Leverage Bigram Features hal link

Auteur(s): Velcin J., Roche Mathieu, Poncelet Pascal

Conference: DMNLP: Data Mining and Natural Language Processing (Riva del Garda, IT, 2016-09-23)
Actes de conférence: 3rd Workshop on Interactions between Data Mining and Natural Language Processing 2016co-located with the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2016), vol. 1646 p. (2016)

Ref HAL: lirmm-01362434_v1

Text clustering and topic learning are two closely related tasks. In this paper, we show that the topics can be learnt without the absolute need of an exact categorization. In particular, the experiments performed on two real case studies with a vocabulary based on bigram features lead to extracting readable topics that cover most of the documents. Precision at 10 is up to 74% for a dataset of scientific abstracts with 10,000 features, which is 4% less than when using unigrams only but provides more interpretable topics.