Mustererkennung in Eingebetteten Systemen

Permanent URI for this collection

http://hdl.handle.net/2003/27339

Browse

Now showing 1 - 14 of 14

From handwritten words to cuneiform signs retrieval using semantic attributes
(2024) Rusakov, Eugen; Fink, Gernot A.; Jijang, Xiaoyi
Archives all over the world store a vast number of manuscripts that contain invaluable information on cultural heritage. Extracting the content from these manuscripts requires considerable human effort. Being able to sift through extensive manuscript collections quickly is particularly interesting for historical documents. For this purpose, pattern recognition methods have been used to transform handwritten texts from document images into machine-readable formats fully automatically. Well-known approaches are based on recognizing individual letters or words to transcribe a text. While these approaches achieve outstanding performance for machine-printed documents, their results for handwritten texts often fail to meet expectations, especially for collections of historical documents. Handwritten texts from old manuscripts typically exhibit sundry degradations and large variability in writing styles. Faced with these challenges, document analysis approaches have increasingly adopted retrieval-based methods since they do not rely on individual letter or word recognition but rather evaluate similarities between a query and segmented parts of document images. The obtained retrieval list contains document parts sorted in descending order according to their similarity to the query. This feature lets the user compare the list's contents and determine relevant items. Words in handwritten documents are often the focus of interest, leading to the task of retrieving words being called word spotting in the document image analysis community. This thesis presents a methodology for segmentation-based word spotting in handwritten documents. Word images and textual strings must first be transformed into numerical representations to obtain a sorted retrieval list. Subsequently, a similarity between two representations is evaluated by a particular measure. This thesis utilizes the highly successful word embedding technique pyramidal histogram of characters (PHOC). This embedding is inspired by semantic attributes where textual strings are decomposed in a pyramidal scheme, and the occurrences of individual characters are indicated using binary values. A similarity between two numerical word representations is assessed through a novel measurement named probabilistic retrieval model (PRM). This model evaluates the probability between two binary-valued semantic attribute representations based on the assumption that PHOC attributes are Bernoulli distributed. The similarities obtained from the PRM highly depend on the quality of predicted attributes. While representations from textual strings are obtained through the PHOC algorithm, handwritten word images must be transformed first. This thesis uses convolutional neural networks (CNNs) to predict PHOC representations for corresponding word images. During the last decade, these networks have consistently achieved state-of-the-art results in various computer vision tasks. As a result, CNNs are nowadays the de-facto standard models for image classification. The presented method applies a statistical framework named generalized linear models (GLMs) to derive the binary cross-entropy loss (BCEL) as the suitable loss function for the PRM similarity measure. The BCEL is subsequently used to train CNN models to estimate binary attribute probabilities accurately. A significant advantage is a direct connection between the binary cross-entropy loss and the probabilistic retrieval model. Minimizing the BCEL function is equivalent to maximizing the PRM similarity between two equal PHOC representations. This word-spotting methodology is adapted to an ancient writing system called cuneiform. Cuneiform signs are formed by characteristic wedge-shaped impressions. Norbert Gottstein proposed a representation based on alphanumeric expressions, describing a cuneiform sign according to their wedge impressions. These alphanumeric expressions are used to define a set of semantic attributes named Gottstein representation. This thesis further develops the holistic Gottstein representation, including spatial information. The wedge positions are encoded by applying a pyramidal segmentation, which yields binary attribute representations indicating wedge semantics and their approximate position within the cuneiform sign. This thesis presents two approaches to decomposing cuneiform signs according to predefined pyramidal schemes. The first approach divides cuneiform signs horizontally and vertically, following a grid-like pattern called spatial pyramid Gottstein (SPG) representation. The second approach is based on annotations describing wedge constellations according to their sequential order. These sign encodings are assigned to pyramidal splits by applying the PHOC algorithm, and the resulting representation is referred to as temporal pyramid Gottstein (TPG). By using these representations, signs can now be expressed by their wedge constellations and types, which enables the method to perform cuneiform sign spotting in a novel retrieval scenario named query-by-expression (QbX). This thesis proposes three core contributions: the design of the probabilistic retrieval model as a novel similarity measure, deriving the binary cross-entropy as the suitable loss function for the PRM, and two different cuneiform sign representations based on wedge impressions encoded as binary semantic attributes. These contributions are evaluated on six benchmarks in total. Four include handwritten text documents, and two benchmarks contain images of cuneiform tablets. The experiments show that combining the PRM and BCEL achieves state-of-the-art results and even exceeds the performance using other combinations of similarity measures and loss functions. Representing cuneiform signs by their wedge impressions enables the user to query a database without the necessity of visual examples. Moreover, the use of pyramidal decomposition provides a more detailed description of cuneiform signs, leading to increased retrieval performance.
Self-training for handwritten word recognition and retrieval
(2024) Wolf, Fabian; Fink, Gernot A.; Fornés, Alicia
Over centuries, handwritten documents have been the main mean of capturing and storing information. Libraries and archives have gathered, stored and maintained tremendously large document collections. These collections are in various ways a snapshot of their time and hold incredibly valuable data for social and historical sciences. A significant problem for social scientists and historians alike is that the data stored in this collection is hardly accessible. Usually, no transcriptions exist and the manual creation of them is unfeasible at large scale. This problem motivates the use of automatic systems using techniques such as handwriting recognition or an automatic word search. While these two domains are classic problems considered in the document analysis community and have a long standing tradition, they suffer from a severe drawback. Nowadays, well performing models rely on machine learning techniques, which means models are trained in a supervised fashion using manually annotated training data. The manual creation of training data is a cumbersome process and is the main obstacle that often prevents the application of a automatic document analysis system. This thesis develops a method that allows for the training of handwriting recognition and word spotting models without the need for any manually annotated training samples. The underlying training concept is called self-training and relies on training on automatically generated pseudo-labels.The proposed training scheme can be summarized as follows. First, an initial model is trained on synthetic data that has been generated using a font-based approach. Then, this initial model makes predictions for an unlabeled training data set. Following, the predictions are used for another training step and constitute the current set of pseudo-labels. This process is repeated iteratively, alternating between the prediction of pseudo-labels and training on them. The method is then extended by the integration of a confidence measure that allows for a better selection of less erroneous pseudo-labels. The experiments show that self-training the models considered in this work is feasible and leads to significant performance gains with respect to only training on a synthetic dataset. The investigation of synthetic data generation provides several insights, for example, that training on synthetic data constitutes a form of implicit language modeling, and that a calibrated dataset can be generated by using different style predictor networks. Further experiments on the integration of the confidence measures provide evidence that their use benefits performances and leads to a higher robustness regarding bad performing initial models. It can be concluded that self-training is a highly efficient approach to train well performing models in the absence of manually annotated data and, therefore, provide a potential solution for the application of such models in the data scarce domain of automatic analysis of historical document collections.
Neuronale Ansätze zur semantischen Analyse handschriftlicher Dokumentenbilder
(2024) Tüselmann, Oliver; Fink, Gernot A.; Fischer, Andreas
In den letzten Jahrzehnten hat die weltweite Digitalisierung physischer Dokumente eine wichtige Grundlage für die langfristige Bewahrung und Zugänglichkeit von Informationen geschaffen. Die aktuelle Herausforderung besteht darin, Technologien zu entwickeln, die eine effiziente Durchsuchung und semantische Analyse dieser großen Datenmengen ermöglichen. Insbesondere handschriftliche Dokumente stellen dabei besondere Anforderungen, da sie oft nur als Bilddaten vorliegen und eine hohe Variabilität aufweisen. Diese Arbeit vergleicht zwei Ansätze zur semantischen Analyse handschriftlicher Dokumentenbilder: einen traditionellen Ansatz, der auf der Kombination von Handschriftenerkennung und Textanalyse basiert, und einen Ende-zu-Ende Ansatz ohne explizite Texterkennung. Im ersten Verfahren werden handschriftliche Bilddaten in maschinenlesbare Text umgewandelt und anschließend semantisch analysiert, was jedoch das Risiko von Fehlerfortpflanzungen birgt. Der alternative Ende-zu-Ende Ansatz löst das Problem der Fehlerfortpflanzungen, nutzt jedoch nicht die jüngsten Fortschritte aus dem Natural Language Processing Bereich. In mehreren Benchmarks wird das Potenzial beider Ansätze zur Analyse handschriftlicher Dokumentenbilder systematisch untersucht. Ein Hauptproblem für Ansätze ohne Texterkennung ist das Fehlen vortrainierter semantischer Wortbildrepräsentationen. Zur Lösung wird ein Ansatz zur cross-modalen Wissensdestillation vorgestellt, der semantische Informationen aus maschinenlesbaren Texten auf handschriftliche Bilder überträgt. Dazu werden handschriftliche Wortbilder mithilfe eines neuronalen Faltungsnetzwerks in einen textuellen semantischen Vektorraum eingebettet. Die Ergebnisse zeigen, dass diese Methode entscheidend für Ende-zu-Ende Modelle zur Erreichung des aktuellen Leistungsniveaus in der semantischen Analyse von handschriftlichen Dokumenten ist.
Transfer learning for multi-channel time-series Human Activity Recognition
(2023) Moya Rueda, Fernando; Fink, Gernot A.; Kirste, Thomas
Abstract for the PHD Thesis Transfer Learning for Multi-Channel Time-Series Human Activity Recognition Methods of human activity recognition (HAR) have been developed for the purpose of automatically classifying recordings of human movements into a set of activities. Capturing, evaluating, and analysing sequential data to recognise human activities accurately is critical for many applications in pervasive and ubiquitous computing applications, e.g., in applications such as mobile- or ambient-assisted living, smart-homes, activities of daily living, health support and rehabilitation, sports, automotive surveillance, and industry 4.0. For example, HAR is particularly interesting for optimisation in those industries where manual work remains dominant. HAR takes as inputs signals from videos or from multi-channel time-series, e.g., human joint measurements from marker-based motion capturing systems and inertial measurements measured by wearables or on-body devices. Wearables have become relevant as they extend the potential of HAR beyond constrained or laboratory settings. This thesis focuses on HAR using multi-channel time-series. Multi-channel Time-Series HAR is, in general, a challenging classification task. This is because human activities and movements show a large variation. Humans carry out in similar manner activities that are semantically very distinctive; conversely, they carry out similar activities in many different ways. Furthermore, multi-channel Time-Series HAR datasets suffer from the class unbalance problem, with more samples of certain activities than others. This problem strongly depends on the annotation. Moreover, there are non-standard definitions of human activities for annotation. Methods based on Deep Neural Networks (DNNs) are prevalent for Multi-channel Time-Series HAR. Nevertheless, the performance of DNNs has not significantly increased compared to as other fields such as image classification or segmentation. DNNs present a low sample efficiency as they learn the temporal structure from activities completely from data. Considering supervised DNNs, the scarcity of annotated data is the primary concern. Annotated data from human behaviour is scarce and costly to obtain. The annotation process demands enormous resources. Additionally, annotation reliability varies because they can be subject to human errors or unclear and non-elaborated annotation protocols. Transfer learning has been used to cope with a limited amount of annotated data, overfitting, zero-shot learning or classification of unseen human activities, and the class-unbalance problem. Transfer learning can alleviate the problem of scarcity of annotated data. Learnt parameters and feature representations from a specific source domain are transferred to a target domain. Transfer learning extends the usability of large annotated data from source domains to related problems. This thesis proposes a general transfer learning approach to improve automatic multi-channel Time-Series HAR. The proposed transfer learning method combines a semantic attribute representation of activities and a specific deep neural network. It handles situations where the source and target domains differ, i.e., the sensor space and the set of activities change, without needing a large amount of annotated data from the target domain. The method considers different levels of transferability. First, an architecture handles a variate of dataset configurations in regard to the number of devices and their type; it creates fixed-size representations of sensor recordings that are representative of the human limbs. These networks will process sequences of movements from the human limbs, either from poses or inertial measurements. Second, it introduces a search of semantic attribute representations that favourably represent signal segments for recognising human activities in unknown scenarios, as they only consider annotations of activities, and they lack human-annotated semantic attributes. And third, it covers transferability from data of a variety of source datasets. The method takes advantage of a large human-pose dataset as a source domain, which is created during the develop of this thesis. Furthermore, synthetic-inertial measurements will be derived from sequences of human poses either from a marker-based motion capturing system or video-based HAR and pose-based HAR datasets. The latter will specifically use the annotations of pixel-coordinate of human poses as multi-channel time-series data. Real inertial measurements and these synthetic measurements will then be deployed as a source domain for parameter transfer learning. Experimentation on different target datasets demonstrates that the proposed transfer learning method improves performance, most evidently when deploying a proportion of their training material. This outcome suggests that the temporal convolutional filters are rather general as they learn local temporal relations of human movements related to the semantic attributes, independent of the number of devices and their type. A human-limb-oriented deep architecture and an evolutionary algorithm provide an out-of-the-shelf predictor of semantic attributes that can be deployed directly on a new target scenario. Very related problems can directly be addressed by manually giving the attribute-to-activity relations without the need for a search throughout an evolutionary algorithm. Besides, the learnt convolutional filters are activity class dependent. Hence, the classification performance on the activities shared among the datasets improves.
Assessing the reliability of deep neural networks
(2023) Oberdiek, Philipp; Fink, Gernot A.; Harmeling, Stefan
Deep Neural Networks (DNNs) have achieved astonishing results in the last two decades, fueled by ever larger datasets and the availability of high performance compute hardware. This led to breakthroughs in many applications such as image and speech recognition, natural language processing, autonomous driving, and drug discovery. Despite their success, the understanding of internal workings and the interpretability of predictions remains limited and DNNs are often treated as "black boxes". Especially for safety-critical applications where the well-being of humans is at risk, decisions based on predictions should consider associated uncertainties. Autonomous vehicles, for example, operate in a highly complex environment with potentially unpredictable situations that can lead to safety risks for pedestrians and other road users. In medical applications, decision based on incorrect predictions can have serious consequences for a patient's health. As a consequence, the topic of Uncertainty Quantification (UQ) has received increasing attention in recent years. The goal of UQ is to assign uncertainties to predictions in order to ensure the decision-making process is informed by potentially unreliable predictions. In addition, other tasks such as identifying model weaknesses, collecting additional data or detecting malicious attacks can be supported by uncertainty estimates. Unfortunately, UQ for DNNs is a particularly challenging task due to their high complexity and nonlinearity. Uncertainties which can be derived from traditional statistical models are often not directly applicable to DNNs. Therefore, the development of new UQ techniques for DNNs is of paramount importance to ensure safety-aware decision-making. This thesis evaluates existing UQ methods and proposes improvements and novel approaches which contribute to the reliability and trustworthiness of modern deep learning methodology. One of the core contributions of this work is the development of a novel generative learning framework with an integrated training of a One-vs-All (OvA) classifier. A Generative Adversarial Network (GAN) is trained in such a way that it is possible to sample from the boundary of the training distribution. These boundary samples are shielding the training dataset from the Out-of-Distribution (OoD) region. By making the GAN class-conditional, it is possible to shield each class separately, which integrates well with the formulation of an OvA classifier. The OvA classifier achieves outstanding results on the task of OoD detection and surpasses many previous works by large margins. In addition, the tight class shielding also improves the overall classification accuracy. A comprehensive and consistent evaluation on the tasks of False Positive, Out-of-Distribution and Adversarial Example Detection on a diverse selection of datasets provides insights into the strengths and weaknesses of existing methods and the proposed approaches.
Segmentation-free word spotting with bag-of-features hidden Markov models
(2019) Rothacker, Leonard; Fink, Gernot A.; Llados, Josep
The method that is proposed in this thesis makes document images searchable with minimum manual effort. This works in the query-by-example scenario where the user selects an exemplary occurrence of the query word in a document image. Afterwards, an entire collection of document images is searched automatically. The major challenge is to detect relevant words and to sort them according to similarity to the query. However, recognizing text in historic document images can be considered as extremely challenging. Different historic document collections have highly irregular visual appearances due to non-standardized layouts or the large variabilities in handwritten script. An automatic text recognizer requires huge amounts of annotated samples from the collection that are usually not directly available. In order to search document images with just a single example of the query word, the information that is available about the problem domain is integrated at various levels. Bag-of-features are a powerful image representation that can be adapted to the data automatically. The query word is represented with a hidden Markov model. This statistical sequence model is very suitable for the sequential structure of text. An important assumption is that the visual variability of the text within a single collection is limited. For example, this is typically the case if the documents have been written by only a few writers. Furthermore, the proposed method requires only minimal heuristic assumptions about the visual appearance of text. This is achieved by processing document images as a whole without requiring a given segmentation of the images on word level or on line level. The detection of potentially relevant document regions is based on similarity to the query. It is not required to recognize words in general. Word size variabilities can be handled by the hidden Markov model. In order to make the computationally costly application of the sequence model feasible in practice, regions are retrieved according to approximate similarity with an efficient model decoding algorithm. Since the approximate approach retrieves regions with high recall, re-ranking these regions with the sequence model leads to highly accurate word spotting results. In addition, the method can be extended to textual queries, i.e., query-by-string, if annotated samples become available. The method is evaluated on five benchmark datasets. In the segmentation-free query-by-example scenario where no annotated sample set is available, the method outperforms all other methods that have been evaluated on any of these five benchmarks. If only a small dataset of annotated samples is available, the performance in the query-by-string scenario is competitive with the state-of-the-art.
Learning attribute representations with deep convolutional neural networks for word spotting
(2018) Sudholt, Sebastian; Fink, Genot; Schomaker, Lambert
Understanding the contents of handwritten texts from document images has long been a traditional field of research in computer science. The ultimate goal is to automatically transcribe the text in the images into an electronic format. This would make the documents from which the images were generated much easier to access and would also allow for a fast extraction of information. Especially for historical documents a possibility to easily sift through large document image collections would be of high interest. There exist vast amounts of manuscripts all over the world storing substantial amounts of yet untapped information on cultural heritage. Being able to extract these information for large and different corpora would allow historians unprecedented insight into various aspects of ancient human life. The desired goal is thus to obtain information on the text embedded in digital document images with no manual human interaction at all. A well known approach for achieving this is to make use of models known from the field of pattern recognition and machine learning in order to classify the text in the images into electronic representations of characters or words. This approach is known as Optical Character Recognition or text recognition and belongs to the oldest applications of pattern recognition and computer science in general. Despite its long history, handwritten text recognition is still considered an unsolved task as classification systems are still not able to consistently achieve results as are common for machine printed text recognition. This is especially true for historical documents as the text to be recognized typically exhibits different amounts of degradation as well as large variability in handwriting for the same characters and words. Depending on the task at hand, a full transcription of the text might, however, not be necessary. If a potential user is only interested in whether a certain word or text portion is present in a given document collection or not, retrieval-based approaches are able to produce more robust results than recognition-based ones. These retrieval-based approaches compare parts of the document images to a sought-after query and decide if the individual parts are similar to the query. For a given method, the result is then a list of parts of the document images which are deemed relevant by the method. In the field of document image analysis, this retrieval approach is known as keyword spotting or simply word spotting. Word spotting is the problem of interest in this thesis. In particular, a method will be presented which allows for using neural network models in order to approach different word spotting tasks. This method is inspired by a recent state-of-the-art approach which utilizes semantic attributes for word spotting. In pattern recognition and computer vision, semantic attributes describe characteristics of classes which may be shared between classes. This sharing ability enables an attribute representations to encode parts of different classes which are common and those which are not. For example, when classifying animals, the classes tiger and zebra may share an attribute striped. For word spotting, attributes have been used in order to encode the occurrence and position of certain characters. The success of any attribute-based method is, of course, highly dependent on the ability of a classifier to correctly predict the individual attributes. In order to accomplish an accurate prediction of attributes for word spotting tasks, the use of Convolutional Neural Networks (CNNs) is proposed in this thesis. CNNs have recently attracted a substantial amount of research interest as they are able to consistently achieve state-of-the-art results in virtually all fields of computer vision. Their main advantage compared to other methods is their ability to jointly optimize a classifier and the feature representations obtained from the images. This characteristic is known as end-to-end learning. While CNNs have been used extensively for classifying data into one of multiple classes for various tasks, predicting attributes with these neural networks has largely been done for face and fashion attributes only. For the method presented in this thesis a CNN is trained to predict attribute representations extracted from word strings in an end-to-end fashion. These attributes are leveraged in order to perform word spotting. The core contribution lies in the design and evaluation of different neural network architectures which are specifically designed to be applied to document images. A big part of this design is to determine suitable loss functions for the CNNs. Loss functions are a crucial ingredient in the training of neural networks in general and largely determine what kind of annotations the individual networks are able to learn for the given images. In particular, two loss function are derived, which allow for learning binary attribute representations as well as real-valued representations who can be considered attribute-like. Besides the loss functions, the second major contribution is the design of three CNN architectures which are tailor-made for being applied to problems involving handwritten text as data. Using the loss functions and the three architectures, a number experiments are conducted in which the neural networks are trained to predict the attribute or attribute-like representations Pyramidal Histogram of Characters (PHOC), Spatial Pyramid of Characters (SPOC) and Discrete Cosine Transform of Words (DCToW). It is shown experimentally, that the proposed approach of using neural networks for predicting attribute representations achieves state-of-the-art results for various word spotting benchmarks.
Partially supervised learning of models for visual scene and object recognition
(2018) Grzeszick, René; Fink, Gernot A.; Frintrop, Simone
When creating a visual recognition system for a novel task, one of the main burdens is the collection and annotation of data. Often several thousand samples need to be manually reviewed and labeled so that the recognition system achieves the desired accuracy. The goal of this thesis is to provide methods that lower the annotation effort for visual scene and object recognition. These methods are applicable to traditional pattern recognition approaches as well as methods from the field of deep learning. The contributions are three-fold and range from feature augmentation, over semi-supervised learning for natural scene classification to zero-shot object recognition. The contribution in the field of feature augmentation deals with handcrafted feature representations. A novel method for incorporating additional information at feature level has been introduced. This information is subsequently integrated in a Bag-of-Features representation. The additional information can, for example, be of spatial or temporal nature, encoding a local feature's position within a sample in its feature descriptor. The information is quantized and appended to the feature vector and thus also integrated in the unsupervised learning step of the Bag-of-Features representation. As a result more specific codebook entries are computed for different regions within the samples. The results in the field of image classification for natural scenes and objects as well as the field of acoustic event detection, show that the proposed approach allows for learning compact feature representations without reducing the accuracy of the subsequent classification. In the field of semi-supervised learning, a novel approach for learning annotations in large image collections of natural scene images has been proposed. The approach is based on the active learning principle and incorporates multiple views on the data. The views, i.e. different feature representations, are clustered independently of each other. A human in the loop is asked to label each data cluster. The clusters are then iteratively refined based on cluster evaluation measures and additional labels are assigned to the dataset. Ultimately, a voting over all views creates a partially labeled sample set that is used for training a classifier. The results on natural scene images show that a powerful visual classifier can be learned with minimal annotation effort. The approach has been evaluated for traditional handcrafted features as well as features derived from a convolutional neural network. For the semi-supervised learning it is desirable to have compact feature representation. For traditional features, the ones obtained by the proposed feature augmentation approach are a good example of such a representation. Especially the application in the field of deep learning, which usually requires large amounts of labeled samples for training or even adapting a deep neural network, the semi-supervised learning is beneficial. For the zero-shot object prediction, a method that combines visual and semantic information about natural scenes is proposed. A convolutional neural network is trained in order to distinguish different scene categories. Furthermore, the relations between scene categories and visual object classes are learned based on their semantic relation in large text corpora. The probability for a given image to show a certain scene is derived from the network and combined with the semantic relations based on a statistical approach. This allows for predicting the presence of certain object classes in an image without having any visual training sample from any of the object classes. The results on a challenging dataset depicting various objects in natural scene images, show that especially in cluttered scenes the semantic relations can be a powerful information cue. Furthermore, when post-processing the results of a visual object predictor, the detection accuracy can be improved at the minimal cost of providing additional scene labels. When combining these contributions, it is shown that a scene classifier can be trained with minimal human effort and its predictions can still be leveraged for object prediction. Thus, information about natural scene images and the object classes within these images can be gained without having the burden to manually label tremendous amounts of images beforehand.
Acoustic sensor network geometry calibration and applications
(2017) Plinge, Axel; Fink, Gernot A.; Martin, Rainer
In the modern world, we are increasingly surrounded by computation devices with communication links and one or more microphones. Such devices are, for example, smartphones, tablets, laptops or hearing aids. These devices can work together as nodes in an acoustic sensor network (ASN). Such networks are a growing platform that opens the possibility for many practical applications. ASN based speech enhancement, source localization, and event detection can be applied for teleconferencing, camera control, automation, or assisted living. For this kind of applications, the awareness of auditory objects and their spatial positioning are key properties. In order to provide these two kinds of information, novel methods have been developed in this thesis. Information on the type of auditory objects is provided by a novel real-time sound classification method. Information on the position of human speakers is provided by a novel localization and tracking method. In order to localize with respect to the ASN, the relative arrangement of the sensor nodes has to be known. Therefore, different novel geometry calibration methods were developed. Sound classification The first method addresses the task of identification of auditory objects. A novel application of the bag-of-features (BoF) paradigm on acoustic event classification and detection was introduced. It can be used for event and speech detection as well as for speaker identification. The use of both mel frequency cepstral coefficient (MFCC) and Gammatone frequency cepstral coefficient (GFCC) features improves the classification accuracy. By using soft quantization and introducing supervised training for the BoF model, superior accuracy is achieved. The method generalizes well from limited training data. It is working online and can be computed in a fraction of real-time. By a dedicated training strategy based on a hierarchy of stationarity, the detection of speech in mixtures with noise was realized. This makes the method robust against severe noises levels corrupting the speech signal. Thus it is possible to provide control information to a beamformer in order to realize blind speech enhancement. A reliable improvement is achieved in the presence of one or more stationary noise sources. Speaker localization The localization method enables each node to determine the direction of arrival (DoA) of concurrent sound sources. The author's neuro-biologically inspired speaker localization method for microphone arrays was refined for the use in ASNs. By implementing a dedicated cochlear and midbrain model, it is robust against the reverberation found in indoor rooms. In order to better model the unknown number of concurrent speakers, an application of the EM algorithm that realizes probabilistic clustering according to auditory scene analysis (ASA) principles was introduced. Based on this approach, a system for Euclidean tracking in ASNs was designed. Each node applies the node wise localization method and shares probabilistic DoA estimates together with an estimate of the spectral distribution with the network. As this information is relatively sparse, it can be transmitted with low bandwidth. The system is robust against jitter and transmission errors. The information from all nodes is integrated according to spectral similarity to correctly associate concurrent speakers. By incorporating the intersection angle in the triangulation, the precision of the Euclidean localization is improved. Tracks of concurrent speakers are computed over time, as is shown with recordings in a reverberant room. Geometry calibration The central task of geometry calibration has been solved with special focus on sensor nodes equipped with multiple microphones. Novel methods were developed for different scenarios. An audio-visual method was introduced for the calibration of ASNs in video conferencing scenarios. The DoAs estimates are fused with visual speaker tracking in order to provide sensor positions in a common coordinate system. A novel acoustic calibration method determines the relative positioning of the nodes from ambient sounds alone. Unlike previous methods that only infer the positioning of distributed microphones, the DoA is incorporated and thus it becomes possible to calibrate the orientation of the nodes with a high accuracy. This is very important for all applications using the spatial information, as the triangulation error increases dramatically with bad orientation estimates. As speech events can be used, the calibration becomes possible without the requirement of playing dedicated calibration sounds. Based on this, an online method employing a genetic algorithm with incremental measurements was introduced. By using the robust speech localization method, the calibration is computed in parallel to the tracking. The online method is be able to calibrate ASNs in real time, as is shown with recordings of natural speakers in a reverberant room. The informed acoustic sensor network All new methods are important building blocks for the use of ASNs. The online methods for localization and calibration both make use of the neuro-biologically inspired processing in the nodes which leads to state-of-the-art results, even in reverberant enclosures. The high robustness and reliability can be improved even more by including the event detection method in order to exclude non-speech events. When all methods are combined, both semantic information on what is happening in the acoustic scene as well as spatial information on the positioning of the speakers and sensor nodes is automatically acquired in real time. This realizes truly informed audio processing in ASNs. Practical applicability is shown by application to recordings in reverberant rooms. The contribution of this thesis is thus not only to advance the state-of-the-art in automatically acquiring information on the acoustic scene, but also pushing the practical applicability of such methods.
Modeling and training options for handwritten Arabic text recognition
(2016) Ahmad, Irfan; Fink, Gernot A.; Likforman-Sulem, Laurence
Die Detektion interessanter Objekte unter Verwendung eines objektbasierten Aufmerksamkeitsmodells
(2016) Naße, Fabian; Fink, Gernot A.; Wöhler, Christian
Das visuelle System des Menschen ist in der Lage, komplexe Aufgaben, wie beispielsweise das Erkennen von Objekten und Personen, problemlos zu bewältigen. Mit dem Begriff Computer-Vision wird ein Forschungsgebiet bezeichnet, bei der die Fragestellung im Vordergrund steht, wie eine vergleichbare Leistungsfähigkeit in technischen Systemen erreicht werden kann. In dieser Dissertation wird diesbezüglich das Prinzip der visuellen Aufmerksamkeit betrachtet, dass einen wichtigen Aspekt des menschlichen Sehsystems darstellt. Es besagt, dass der bewussten Wahrnehmung ein unbewusster Prozess vorausgeht, durch den die Aufmerksamkeit selektiv auf potentiell wichtige oder interessante Sehinhalte gelenkt wird. Es handelt sich dabei um eine Strategie der effizienten Informationsverarbeitung, die ein schnelles Reagieren auf relevante Inhalte erlaubt. In diesem Zusammenhang bezeichnet der Begriff der visuellen Salienz die Eigenschaft von Sehinhalten, im Vergleich zu ihrem Umfeld hervorzustechen und deshalb Aufmerksamkeit zu stimulieren. Im Allgemeinen besteht für solche Inhalte eine vergleichsweise hohe Wahrscheinlichkeit, dass sie für das beobachtende Individuum von Interesse sind. Diese Arbeit hat das Thema der aufmerksamkeitsbasierten Objektdetektion zum Gegenstand. Motiviert wird das Thema als eine Alternative zu wissensbasierten Objektdetektionsverfahren, bei denen Klassifizierungsmodelle mittels annotierten Beispielbildern angelernt werden. Solche Verfahren sind im Allgemeinen mit einem hohen manuellen Vorbereitungsaufwand verbunden, weisen eine hohe Komplexität auf und skalieren schlecht mit der Anzahl der betrachteten Objektkategorien. Die zentrale Fragestellung dieser Arbeit ist es deshalb, ob sich Salienz als Kriterium für eine effizientere Lokalisierung von Objekten in Bildern nutzen lässt. Aufbauend auf der These, dass gerade die interessanten Objekte einer Szene visuell salient sind, soll durch einen aufmerksamkeitsbasierten Ansatz eine schnelle und aufwandsarme Detektion solcher Objekte ermöglicht werden. Es werden in dieser Arbeit zunächst wichtige Grundlagen aus den Bereichen der Mustererkennung, des maschinellen Lernens und der Bildverarbeitung erläutert. Anschließend werden klassische Strategien zur Lokalisierung von Objekten in Bildern aufgezeigt. Dabei werden Vor- und Nachteile verschiedener Lokalisierungsstrategien im Hinblick auf den aufmerksamkeitsbasierten Ansatz betrachtet. Im Anschluss daran werden grundlegende Konzepte sowie einflussreiche Theorien und Modelle zur visuellen Aufmerksamkeit des Menschen aufgezeigt. Hieran schließt sich eine Betrachtung mathematischer Aufmerksamkeitsmodelle aus der Literatur an. Aufbauend darauf wird ein eigenes Aufmerksamkeitsmodell vorgeschlagen, dass Objektvorschläge ermittelt und anhand ihrer Salienz bewertet. Zwecks einer generischen Anwendbarkeit wird dabei ein rein datengetriebener Ansatz favorisiert, bei dem bewusst auf die Verwendung problemspezifischen Vorwissens verzichtet wird. Das Verfahren wird schließlich auf einem schwierigen Benchmark evaluiert. Dabei werden durch Vergleiche mit anderen Modellen aus der Literatur die Vorteile der vorgeschlagenen Methoden hervorgehoben. Des Weiteren wird bei der Betrachtung der Ergebnisse gezeigt, dass Salienz ein wichtiges Kriterium bei der generischen Lokalisierung von Objekten in komplexen Bildern darstellt.
Lampung handwritten character recognition
(2016) Junaidi, Akmal; Fink, Gernot A.; Müller, Heinrich
Lampung script is a local script from Lampung province Indonesia. The script is a non-cursive script which is written from left to right. It consists of 20 characters. It also has 7 unique diacritics that can be put on top, bottom, or right of the character. Considering this position, the number of diacritics augments into 12 diacritics. This research is devoted to recognize Lampung characters along with diacritics. The research aim to attract more concern on this script especially from Indonesian researchers. Beside, it is also an endeavor to preserve the script from extinction. The work of recognition is administered by multi steps processing system the so called Lampung handwritten character recognition framework. It is started by a preprocessing of a document image as an input. In the preprocessing stage, the input should be distinguished between characters and diacritics. The character is classified by a multistage scheme. The first stage is to classify 18 character classes and the second stage is to classify special characters which consist of two components. The number of classes after the second stage classification becomes 20 class. The diacritic is classified into 7 classes. These diacritics should be associated to the characters to form compound characters. The association is performed in two steps. Firstly, the diacritic detects some characters nearby. The character with closest distance to that diacritic is selected as the association. This is completed until all diacritics get their characters. Since every diacritic already has one-to-one association to a character, the pivot element is switched to a character in the second step. Each character collects all its diacritics as a composition of the compound characters. This framework has been evaluated on Lampung dataset created and annotated during this work and is hosted at the Department of Computer Science, TU Dortmund, Germany. The proposed framework achieved 80.64% recognition rate on this data.
Videobasierte Gestenerkennung in einer intelligenten Umgebung
(2012-01-19) Richarz, Jan; Fink, Gernot A.; Müller, Heinrich
Diese Dissertation umfasst die Konzeption einer berührungslosen und nutzerunabhängigen visuellen Klassifikation von Armgesten anhand ihrer räumlich-zeitlichen Bewegungsmuster mit Methoden der Computer Vision, der Mustererkennung und des maschinellen Lernens. Das Anwendungsszenario ist hierbei ein intelligenter Konferenzraum, der mit mehreren handelsüblichen Kameras ausgerüstet ist. Dieses Szenario stellt aus drei Gründen eine besondere Herausforderung dar: Für eine möglichst intuitive Interaktion ist es erstens notwendig, die Erkennung unabhängig von der Position und Orientierung des Nutzers im Raum zu realisieren. Somit werden vereinfachende Annahmen bezüglich der relativen Positionen von Nutzer und Kamera weitgehend ausgeschlossen. Zweitens wird ein realistisches Innenraumszenario betrachtet, bei dem sich die Umgebungsbedingungen abrupt ändern können und sehr unterschiedliche Blickwinkel der Kameras auftreten. Das erfordert die Entwicklung adaptiver Methoden, die sich schnell an derartige Änderungen anpassen können bzw. in weiten Grenzen dagegen robust sind. Drittens stellt die Verwendung eines nicht synchronisierten Multikamerasystems eine Neuerung dar, die dazu führt, dass während der 3D-Rekonstruktion der Hypothesen aus verschiedenen Kamerabildern besonderes Augenmerk auf den Umgang mit dem auftretenden zeitlichen Versatz gelegt werden muss. Dies hat auch Folgen für die Klassifikationsaufgabe, weil in den rekonstruierten 3D-Trajektorien mit entsprechenden Ungenauigkeiten zu rechnen ist. Ein wichtiges Kriterium für die Akzeptanz einer gestenbasierten Mensch-Maschine-Schnittstelle ist ihre Reaktivität. Daher wird bei der Konzeption besonderes Augenmerk auf die effiziente Umsetzbarkeit der gewählten Methoden gelegt. Insbesondere wird eine parallele Verarbeitungsstruktur realisiert, in der die verschiedenen Kameradatenströme getrennt verarbeitet und die Einzelergebnisse anschließend kombiniert werden. Im Rahmen der Dissertation wurde die komplette Bildverarbeitungspipeline prototypisch realisiert. Sie umfasst unter anderem die Schritte Personendetektion, Personentracking, Handdetektion, 3D-Rekonstruktion der Hypothesen und Klassifikation der räumlich-zeitlichen Gestentrajektorien mit semikontinuierlichen Hidden Markov Modellen (HMM). Die realisierten Methoden werden anhand realistischer, anspruchsvoller Datensätze ausführlich evaluiert. Dabei werden sowohl für die Personen- als auch für die Handdetektion sehr gute Ergebnisse erzielt. Die Gestenklassifikation erreicht Klassifikationsraten von annähernd 90% für neun verschiedene Gesten.
Advanced ensemble methods for automatic classification of 1H-NMR spectra
(2010-08-03) Lienemann, Kai; Fink, Gernot A.; Weihs, Claus

Browse

Recent Submissions