Publication Details

Raja, F, Tasharofi, S and Oroumchian, F, Statistical POS tagging experiments on Persian text, Proceedings of the Second Workshop on Computational Approaches to Arabic Script-based Languages, Stanford, California, 21-22 July 2007. Original conference information available here


Part-Of-Speech (POS) tagging is the process of marking-up the words in a text with their corresponding parts of speech. It is an essential part of text and natural language processing. There are many models and software for POS tagging in English and other European languages. Little work has been done on POS tagging of Persian language which uses Arabic script for writing. In these experiments we want to see how effective would be if we just applied a POS tagger from a language such as English to Persian. Although English and Persian are both Indo-European languages but they have subtle differences. This paper presents creation of a POS tagged corpus for evaluation purposes and evaluation of a statistical tagging method on Persian text. The results show that an overall tagging accuracy between 96.4% and 96.9% is achievable without the need to add any Persian linguistic knowledge to the tagging process. In This study we also looked at the effect of the size of training and test corpora on the accuracy of POS tagging.