Learning attribute representations with deep convolutional neural networks for word spotting

Sudholt, Sebastian2019-01-172019-01-172018http://hdl.handle.net/2003/3788110.17877/DE290R-19868Understanding the contents of handwritten texts from document images has long been a traditional field of research in computer science. The ultimate goal is to automatically transcribe the text in the images into an electronic format. This would make the documents from which the images were generated much easier to access and would also allow for a fast extraction of information. Especially for historical documents a possibility to easily sift through large document image collections would be of high interest. There exist vast amounts of manuscripts all over the world storing substantial amounts of yet untapped information on cultural heritage. Being able to extract these information for large and different corpora would allow historians unprecedented insight into various aspects of ancient human life. The desired goal is thus to obtain information on the text embedded in digital document images with no manual human interaction at all. A well known approach for achieving this is to make use of models known from the field of pattern recognition and machine learning in order to classify the text in the images into electronic representations of characters or words. This approach is known as Optical Character Recognition or text recognition and belongs to the oldest applications of pattern recognition and computer science in general. Despite its long history, handwritten text recognition is still considered an unsolved task as classification systems are still not able to consistently achieve results as are common for machine printed text recognition. This is especially true for historical documents as the text to be recognized typically exhibits different amounts of degradation as well as large variability in handwriting for the same characters and words. Depending on the task at hand, a full transcription of the text might, however, not be necessary. If a potential user is only interested in whether a certain word or text portion is present in a given document collection or not, retrieval-based approaches are able to produce more robust results than recognition-based ones. These retrieval-based approaches compare parts of the document images to a sought-after query and decide if the individual parts are similar to the query. For a given method, the result is then a list of parts of the document images which are deemed relevant by the method. In the field of document image analysis, this retrieval approach is known as keyword spotting or simply word spotting. Word spotting is the problem of interest in this thesis. In particular, a method will be presented which allows for using neural network models in order to approach different word spotting tasks. This method is inspired by a recent state-of-the-art approach which utilizes semantic attributes for word spotting. In pattern recognition and computer vision, semantic attributes describe characteristics of classes which may be shared between classes. This sharing ability enables an attribute representations to encode parts of different classes which are common and those which are not. For example, when classifying animals, the classes tiger and zebra may share an attribute striped. For word spotting, attributes have been used in order to encode the occurrence and position of certain characters. The success of any attribute-based method is, of course, highly dependent on the ability of a classifier to correctly predict the individual attributes. In order to accomplish an accurate prediction of attributes for word spotting tasks, the use of Convolutional Neural Networks (CNNs) is proposed in this thesis. CNNs have recently attracted a substantial amount of research interest as they are able to consistently achieve state-of-the-art results in virtually all fields of computer vision. Their main advantage compared to other methods is their ability to jointly optimize a classifier and the feature representations obtained from the images. This characteristic is known as end-to-end learning. While CNNs have been used extensively for classifying data into one of multiple classes for various tasks, predicting attributes with these neural networks has largely been done for face and fashion attributes only. For the method presented in this thesis a CNN is trained to predict attribute representations extracted from word strings in an end-to-end fashion. These attributes are leveraged in order to perform word spotting. The core contribution lies in the design and evaluation of different neural network architectures which are specifically designed to be applied to document images. A big part of this design is to determine suitable loss functions for the CNNs. Loss functions are a crucial ingredient in the training of neural networks in general and largely determine what kind of annotations the individual networks are able to learn for the given images. In particular, two loss function are derived, which allow for learning binary attribute representations as well as real-valued representations who can be considered attribute-like. Besides the loss functions, the second major contribution is the design of three CNN architectures which are tailor-made for being applied to problems involving handwritten text as data. Using the loss functions and the three architectures, a number experiments are conducted in which the neural networks are trained to predict the attribute or attribute-like representations Pyramidal Histogram of Characters (PHOC), Spatial Pyramid of Characters (SPOC) and Discrete Cosine Transform of Words (DCToW). It is shown experimentally, that the proposed approach of using neural networks for predicting attribute representations achieves state-of-the-art results for various word spotting benchmarks.enDeep learningWord spottingAttributesComputer visionDocument image analysis004Learning attribute representations with deep convolutional neural networks for word spottingdoctoral thesisDeep learningMaschinelles SehenBildanalyseBildverstehen