Entrepôts, Représentation et Ingénierie des Connaissances
- Integration Process for Multidimensional Textual Data Modeling hal link

Auteur(s): Aknouche R.(Corresp.), Asfari O., Bentayeb F., Boussaid O.

Conference: 1st International Workshop in Software Evolution and Modernization SEM / ENASE 2013 (Angers, FR, 2013-07-04)
Actes de conférence: Proceedings of SEM / ENASE 2013, vol. p.119-126 (2013)

Ref HAL: hal-00911862_v1

In this paper, we propose an original approach for text warehousing process. It is based on a decisional architecture which combines classical data warehousing tasks and information retrieval (IR) techniques. We first propose a new ETL process, named ETL-Text, for textual data integration and then, we present a new Text Warehouse Model, denoted TWM, which takes into account both the structure and the semantics of the textual data. TWM is associated with new dimensions types including: a metadata dimension and a semantic dimension. In addition, we propose a new analysis measure based on the language model widely used in IR area. Moreover, our approach is based on Wikipedia as external knowledge source to extract the semantics of the textual documents. To validate our approach, we develop a prototype composed of several processing modules that illustrate the different steps of the ETL-Text. Also, we use the 20 Newsgroups corpus to perform our experimentation.