posted on 2024-11-14, 17:22authored byAbdelaziz Zintouni, Asma Damankesh, Foroogh Barakati, Maha Atari, Mohamed Watfa, Farhad Oroumchian
In languages with high word in ation such as Arabic, stemming improves text retrieval performance by reducing words variants.We propose a change in the corpus-based stemming approach proposed by Xu and Croft for English and Spanish languages in order to stem Arabic words. We generate the conflation classes by clustering 3-gram representations of the words found in only 10% of the data in the first stage. In the second stage, these clusters are refined using different similarity measures and thresholds. We conducted retrieval experiments using row data, Light-10 stemmer and 8 different variations of the similarity measures and thresholds and compared the results. The experiments show that 3-gram stemming using the dice distance for clustering and the EM similarity measure for refinement performs better than using no stemming; but slightly worse than Light-10 stemmer. Our method potentially could outperform Light-10 stemmer if more text is sampled in the first stage.
History
Citation
Zintouni, A., Damankesh, A., Barakati, F., Atari, M., Watfa, M. & Oroumchian, F. 2010, 'Corpus-based Arabic stemming using N-grams', in The Sixth Asia Information Retrieval Societies Conference (AIRS 2010), 1-3 Dec 2010, Taiwan University, Taipei, Taiwan, Lecture Notes in Computer Science, vol. 6458, pp. 280-289.
Journal title
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)