- Automatic Language Identification for Romance Languages using Stop Words and Diacritics hal link

Auteur(s): Truica C.-O., Velcin J., Boicea Alexandru

Conference: International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC) (Timis, RO, 2015-09-21)
Actes de conférence: , vol. p. ()

Ref HAL: hal-01193158_v1

Automatic language identification is a natural languageprocessing problem that tries to determine the naturallanguage of a given content. In this paper we present a statisticalmethod for automatic language identification of written textusing dictionaries containing stop words and diacritics. Wepropose different approaches that combine the two dictionariesto accurately determine the language of textual corpora. Thismethod was chosen because stop words and diacritics are veryspecific to a language, although some languages have some similarwords and special characters they are not all common. Thelanguages taken into account were romance languages becausethey are very similar and usually it is hard to distinguish betweenthem from a computational point of view. We have tested ourmethod using a Twitter corpus and a news article corpus. Bothcorpora consists of UTF-8 encoded text, so the diacritics couldbe taken into account, in the case that the text has no diacriticsonly the stop words are used to determine the language of thetext. The experimental results show that the proposed methodhas an accuracy of over 90% for small texts and over 99.8% forlarge texts.