Publication Details

Zintouni, A., Damankesh, A., Barakati, F., Atari, M., Watfa, M. & Oroumchian, F. 2010, 'Corpus-based Arabic stemming using N-grams', in The Sixth Asia Information Retrieval Societies Conference (AIRS 2010), 1-3 Dec 2010, Taiwan University, Taipei, Taiwan, Lecture Notes in Computer Science, vol. 6458, pp. 280-289.


In languages with high word in ation such as Arabic, stemming improves text retrieval performance by reducing words variants.We propose a change in the corpus-based stemming approach proposed by Xu and Croft for English and Spanish languages in order to stem Arabic words. We generate the conflation classes by clustering 3-gram representations of the words found in only 10% of the data in the first stage. In the second stage, these clusters are refined using different similarity measures and thresholds. We conducted retrieval experiments using row data, Light-10 stemmer and 8 diff erent variations of the similarity measures and thresholds and compared the results. The experiments show that 3-gram stemming using the dice distance for clustering and the EM similarity measure for refinement performs better than using no stemming; but slightly worse than Light-10 stemmer. Our method potentially could outperform Light-10 stemmer if more text is sampled in the first stage.



Link to publisher version (DOI)