University of Wollongong in Dubai - Papers

Corpus-based Arabic stemming using N-grams

Abdelaziz Zintouni, University of Wollongong in Dubai
Asma Damankesh, University of Wollongong in Dubai
Foroogh Barakati, University of Wollongong in Dubai
Maha Atari, University of Wollongong in Dubai
Mohamed Watfa, University of Wollongong in Dubai
Farhad Oroumchian, University of Wollongong - Dubai CampusFollow

RIS ID

36062

Publication Details

Zintouni, A., Damankesh, A., Barakati, F., Atari, M., Watfa, M. & Oroumchian, F. 2010, 'Corpus-based Arabic stemming using N-grams', in The Sixth Asia Information Retrieval Societies Conference (AIRS 2010), 1-3 Dec 2010, Taiwan University, Taipei, Taiwan, Lecture Notes in Computer Science, vol. 6458, pp. 280-289.

Abstract

In languages with high word in ation such as Arabic, stemming improves text retrieval performance by reducing words variants.We propose a change in the corpus-based stemming approach proposed by Xu and Croft for English and Spanish languages in order to stem Arabic words. We generate the conflation classes by clustering 3-gram representations of the words found in only 10% of the data in the first stage. In the second stage, these clusters are refined using different similarity measures and thresholds. We conducted retrieval experiments using row data, Light-10 stemmer and 8 different variations of the similarity measures and thresholds and compared the results. The experiments show that 3-gram stemming using the dice distance for clustering and the EM similarity measure for refinement performs better than using no stemming; but slightly worse than Light-10 stemmer. Our method potentially could outperform Light-10 stemmer if more text is sampled in the first stage.

Download

COinS

Link to publisher version (DOI)

http://dx.doi.org/10.1007/978-3-642-17187-1_27

University of Wollongong in Dubai - Papers

Corpus-based Arabic stemming using N-grams

RIS ID

Publication Details

Abstract

Link to publisher version (DOI)

Search

Browse

Author Corner

Links

University of Wollongong in Dubai - Papers

Corpus-based Arabic stemming using N-grams

Authors

RIS ID

Publication Details

Abstract

Share

Link to publisher version (DOI)

Search

Browse

Author Corner

Links