سال انتشار: ۱۳۹۰

محل انتشار: هفتمین کنفرانس ماشین بینایی و پردازش تصویر

تعداد صفحات: ۵

نویسنده(ها):

Alireza Alaei – Department of Studies in ComputerScience, University of MysoreMysore, 570006, India
P. Nagabhushan – Department of Studies in ComputerScience, University of MysoreMysore, 570006, India
Umapada Pal – Computer Vision and PatternRecognition Unit, Indian StatisticalInstitute, Kolkata–۱۰۸, India

چکیده:

In document image analysis and especially inhandwritten document image recognition, standard datasets playvital roles for evaluating performances of algorithms andcomparing results obtained by different groups of researchers. Inthis paper, an unconstrained Persian handwritten text dataset(PHTD) is introduced. The PHTD contains 140 handwrittendocuments of three different categories written by 40 individuals.Total number of text-lines and words/subwords in the dataset are1787 and 27073, respectively. In most of the PHTD documentseither an overlapping or a touching text-lines is present. Theaverage number of text-lines in documents of the PHTD is 13.Two types of ground truths based on pixels information andcontent information are generated for the dataset. Providingthese two types of ground truths for the PHTD, it can be utilizedin many areas of document image processing such as sentencerecognition/understanding, text-line segmentation, wordsegmentation, word recognition, and character segmentation. Toprovide a framework for other researches, recent text-linesegmentation results on this dataset are also reported