A Lexicon of Connected Components for Arabic Optical Text Recognition

Authors:	Elarian, Yousef Idris, Fayez
Title:	A Lexicon of Connected Components for Arabic Optical Text Recognition
Language (ISO):	en
Abstract:	Arabic is a cursive script that lacks the ease of character segmentation. Hence, we suggest a unit that is discrete in nature, viz. the connected component, for Arabic text recognition. A lexicon listing valid Arabic connected components is necessary to any system that is to use such unit. Here, we produce and analyze a comprehensive lexicon of connected components. A lexicon can be extracted from corpora or synthesized from morphemes. We follow both approaches and merge their results. Besides, generation of a lexicon of connected components encompasses extra tokenization and point-normalization steps to make the size of the lexicon tractable. We produce a lexicon of surface-words, reduce it into a lexicon of connected components, and finally into a lexicon of point normalized connected components. The lexicon of point normalized connected components contains 684,743 entries, showing a percent decrease of 97.17% from the word-lexicon.
Subject Headings:	Arabic optical text recognition connected components holistic recognition lexicon generation
URI:	http://hdl.handle.net/2003/27561 http://dx.doi.org/10.17877/DE290R-14627
Issue Date:	2011-01-12
Is part of:	First International Workshop on Frontiers in Arabic Handwritng Recognition, 2010
Appears in Collections:	2010 - First International Workshop on Frontiers in Arabic Handwriting Recognition

Files in This Item:

File	Description	Size	Format
Elarian_2.pdf	DNB	199.36 kB	Adobe PDF	View/Open

This item is protected by original copyright

View License

Show full item record

This item is protected by original copyright rightsstatements.org