University of Wollongong
Browse

Corpus-based Arabic stemming using N-grams

Download (627.23 kB)
journal contribution
posted on 2024-11-14, 17:22 authored by Abdelaziz Zintouni, Asma Damankesh, Foroogh Barakati, Maha Atari, Mohamed Watfa, Farhad Oroumchian
In languages with high word in ation such as Arabic, stemming improves text retrieval performance by reducing words variants.We propose a change in the corpus-based stemming approach proposed by Xu and Croft for English and Spanish languages in order to stem Arabic words. We generate the conflation classes by clustering 3-gram representations of the words found in only 10% of the data in the first stage. In the second stage, these clusters are refined using different similarity measures and thresholds. We conducted retrieval experiments using row data, Light-10 stemmer and 8 different variations of the similarity measures and thresholds and compared the results. The experiments show that 3-gram stemming using the dice distance for clustering and the EM similarity measure for refinement performs better than using no stemming; but slightly worse than Light-10 stemmer. Our method potentially could outperform Light-10 stemmer if more text is sampled in the first stage.

History

Citation

Zintouni, A., Damankesh, A., Barakati, F., Atari, M., Watfa, M. & Oroumchian, F. 2010, 'Corpus-based Arabic stemming using N-grams', in The Sixth Asia Information Retrieval Societies Conference (AIRS 2010), 1-3 Dec 2010, Taiwan University, Taipei, Taiwan, Lecture Notes in Computer Science, vol. 6458, pp. 280-289.

Journal title

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Volume

6458 LNCS

Pagination

280-289

Language

English

RIS ID

36062

Usage metrics

    Categories

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC