Entrepôts, Représentation et Ingénierie des Connaissances
- ETL-Text: Extract-Transform-Load Processes for Textual Data Warehousing hal link

Auteur(s): Aknouche R.(Corresp.), Asfari O.(Corresp.), Bentayeb F., Boussaid O.

Conference: EPIA 2013 (16th Portuguese Conference on Artificial Intelligence) (Azores, PT, 2013-09-09)
Actes de conférence: Advances in Artificial Intelligence, EPIA 2013 16th Portuguese Conference on Artificial Intelligence, vol. ISBN: 978-989-95489-1-6 p.308-319 (2013)

Ref HAL: hal-00911861_v1

The construction of the ETL (Extract-Transform-Load) process is one of the biggest tasks of building a warehouse. ETL processes area has little research, because of its difficulty and lack of formal model for representing ETL activities that map the incoming data from different sources to be in a suitable format for loading into the warehouse. A main problem in data warehousing of multidimensional text databases is to deal with the content in its text cells. In this paper, we propose a model for textual data warehouse ETL processes called ETL-Text. It combines classical data warehousing tasks, information retrieval (IR) techniques, and information processing in particular the language modeling. Our approach is based on Wikipedia as external knowledge source to extract the semantics of the textual documents. To validate our approach, we develop a prototype composed of several processing modules that illustrate the different ETL-Text processes. Also, we use the 20 Newsgroups corpus to perform our experimentation.