Publication Details

Raja, F, Amiri, H, Tasharofi, S, Sarmadi, M, Hojjat, H and Oroumchian, F, Evaluation of part of speech tagging on Persian text, Proceedings of the Second Workshop on Computational Approaches to Arabic Script-based Languages, Stanford, California, 21-22 July 2007. Original conference information available here


One of the fundamental tasks in natural language processing is part of speech (POS) tagging. A POS tagger is a piece of software that reads text in some language and assigns a part of speech tag to each one of the words. Our main interest in this research was to see how easy it is to apply methods used in a language such as English to a new and different language such as Persian (Farsi) and what would be the performance of such approaches. This paper presents evaluation of several part of speech tagging methods on Persian text. These are a statistical tagging method, a memory based tagging approach and two different versions of Maximum Likelihood Estimation (MLE) tagging on Persian text. The two MLE versions differ in the way they handle the unknown words. We also demonstrate the value of simple heuristics and post-processing in improving the accuracy of these methods. These experiments have been conducted on a manually part of speech tagged Persian corpus with over two million tagged words. The results of the experiments are encouraging and comparable with the other languages such as English, German or Spanish.