A Lexicon of Connected Components for Arabic Optical Text Recognition
dc.contributor.author | Elarian, Yousef | |
dc.contributor.author | Idris, Fayez | |
dc.date.accessioned | 2011-01-12T16:18:36Z | |
dc.date.available | 2011-01-12T16:18:36Z | |
dc.date.issued | 2011-01-12 | |
dc.description.abstract | Arabic is a cursive script that lacks the ease of character segmentation. Hence, we suggest a unit that is discrete in nature, viz. the connected component, for Arabic text recognition. A lexicon listing valid Arabic connected components is necessary to any system that is to use such unit. Here, we produce and analyze a comprehensive lexicon of connected components. A lexicon can be extracted from corpora or synthesized from morphemes. We follow both approaches and merge their results. Besides, generation of a lexicon of connected components encompasses extra tokenization and point-normalization steps to make the size of the lexicon tractable. We produce a lexicon of surface-words, reduce it into a lexicon of connected components, and finally into a lexicon of point normalized connected components. The lexicon of point normalized connected components contains 684,743 entries, showing a percent decrease of 97.17% from the word-lexicon. | en |
dc.identifier.uri | http://hdl.handle.net/2003/27561 | |
dc.identifier.uri | http://dx.doi.org/10.17877/DE290R-14627 | |
dc.language.iso | en | |
dc.relation.ispartof | First International Workshop on Frontiers in Arabic Handwritng Recognition, 2010 | en |
dc.subject | Arabic optical text recognition | en |
dc.subject | connected components | en |
dc.subject | holistic recognition | en |
dc.subject | lexicon generation | en |
dc.subject.ddc | 004 | |
dc.title | A Lexicon of Connected Components for Arabic Optical Text Recognition | en |
dc.type | Text | |
dc.type.publicationtype | conferenceObject | |
dcterms.accessRights | open access |