Full metadata record
DC FieldValueLanguage
dc.contributor.authorElarian, Yousef-
dc.contributor.authorIdris, Fayez-
dc.description.abstractArabic is a cursive script that lacks the ease of character segmentation. Hence, we suggest a unit that is discrete in nature, viz. the connected component, for Arabic text recognition. A lexicon listing valid Arabic connected components is necessary to any system that is to use such unit. Here, we produce and analyze a comprehensive lexicon of connected components. A lexicon can be extracted from corpora or synthesized from morphemes. We follow both approaches and merge their results. Besides, generation of a lexicon of connected components encompasses extra tokenization and point-normalization steps to make the size of the lexicon tractable. We produce a lexicon of surface-words, reduce it into a lexicon of connected components, and finally into a lexicon of point normalized connected components. The lexicon of point normalized connected components contains 684,743 entries, showing a percent decrease of 97.17% from the word-lexicon.en
dc.relation.ispartofFirst International Workshop on Frontiers in Arabic Handwritng Recognition, 2010en
dc.subjectArabic optical text recognitionen
dc.subjectconnected componentsen
dc.subjectholistic recognitionen
dc.subjectlexicon generationen
dc.titleA Lexicon of Connected Components for Arabic Optical Text Recognitionen
dcterms.accessRightsopen access-
Appears in Collections:2010 - First International Workshop on Frontiers in Arabic Handwriting Recognition

Files in This Item:
File Description SizeFormat 
Elarian_2.pdfDNB199.36 kBAdobe PDFView/Open

This item is protected by original copyright

Items in Eldorado are protected by copyright, with all rights reserved, unless otherwise indicated.