Entrepôts, Représentation et Ingénierie des Connaissances
Publications du laboratoire

Recherche approfondie

par Année
par Auteur
par Thème
par Type
- Feature Selection on Chinese Text Classification Using Character N-Grams

Auteur(s): Wei Z., Miao Duoqian, Chauchat J.-H., Zhong Caiming

Conference: 3rd International Conference on Rough Sets and Knowledge Technology (RSKT 08), Chengdu, China (Heidelberg, Germany, FR, 2008)
Actes de conférence: Springer, vol. p.500–507 (2008)


In this paper, we perform Chinese text classification using n-gram text representation on TanCorp which is a new large corpus special for Chinese text classification more than 14,000 texts divided into 12 classes. We use different n-gram feature (1-, 2-grams or 1-, 2-, 3-grams) to represent documents. Different feature weights (absolute text frequency, relative text frequency, absolute n-gram frequency and relative n-gram frequency) are compared. The sparseness of "document by feature" matrices is analyzed in various cases. We use the C-SVC classifier which is the SVM algorithm designed for the multi-classification task. We perform our experiments in the TANAGRA platform. We found out that the feature selection methods based on n-gram frequency (absolute or relative) always give better results and produce denser matrices.

Commentaires: rskt08wmcz