A Lexicon of Connected Components for Arabic Optical Text Recognition

dc.contributor.authorElarian, Yousef
dc.contributor.authorIdris, Fayez
dc.date.accessioned2011-01-12T16:18:36Z
dc.date.available2011-01-12T16:18:36Z
dc.date.issued2011-01-12
dc.description.abstractArabic is a cursive script that lacks the ease of character segmentation. Hence, we suggest a unit that is discrete in nature, viz. the connected component, for Arabic text recognition. A lexicon listing valid Arabic connected components is necessary to any system that is to use such unit. Here, we produce and analyze a comprehensive lexicon of connected components. A lexicon can be extracted from corpora or synthesized from morphemes. We follow both approaches and merge their results. Besides, generation of a lexicon of connected components encompasses extra tokenization and point-normalization steps to make the size of the lexicon tractable. We produce a lexicon of surface-words, reduce it into a lexicon of connected components, and finally into a lexicon of point normalized connected components. The lexicon of point normalized connected components contains 684,743 entries, showing a percent decrease of 97.17% from the word-lexicon.en
dc.identifier.urihttp://hdl.handle.net/2003/27561
dc.identifier.urihttp://dx.doi.org/10.17877/DE290R-14627
dc.language.isoen
dc.relation.ispartofFirst International Workshop on Frontiers in Arabic Handwritng Recognition, 2010en
dc.subjectArabic optical text recognitionen
dc.subjectconnected componentsen
dc.subjectholistic recognitionen
dc.subjectlexicon generationen
dc.subject.ddc004
dc.titleA Lexicon of Connected Components for Arabic Optical Text Recognitionen
dc.typeText
dc.type.publicationtypeconferenceObject
dcterms.accessRightsopen access
eldorado.dnb.deposittrue

Dateien

Originalbündel

Gerade angezeigt 1 - 1 von 1
Lade...
Vorschaubild
Name:
Elarian_2.pdf
Größe:
199.36 KB
Format:
Adobe Portable Document Format
Beschreibung:
DNB

Lizenzbündel

Gerade angezeigt 1 - 1 von 1
Lade...
Vorschaubild
Name:
license.txt
Größe:
1.85 KB
Format:
Item-specific license agreed upon to submission
Beschreibung: