Partially supervised learning of models for visual scene and object recognition

Grzeszick, René2018-09-042018-09-042018http://hdl.handle.net/2003/3711710.17877/DE290R-19113When creating a visual recognition system for a novel task, one of the main burdens is the collection and annotation of data. Often several thousand samples need to be manually reviewed and labeled so that the recognition system achieves the desired accuracy. The goal of this thesis is to provide methods that lower the annotation effort for visual scene and object recognition. These methods are applicable to traditional pattern recognition approaches as well as methods from the field of deep learning. The contributions are three-fold and range from feature augmentation, over semi-supervised learning for natural scene classification to zero-shot object recognition. The contribution in the field of feature augmentation deals with handcrafted feature representations. A novel method for incorporating additional information at feature level has been introduced. This information is subsequently integrated in a Bag-of-Features representation. The additional information can, for example, be of spatial or temporal nature, encoding a local feature's position within a sample in its feature descriptor. The information is quantized and appended to the feature vector and thus also integrated in the unsupervised learning step of the Bag-of-Features representation. As a result more specific codebook entries are computed for different regions within the samples. The results in the field of image classification for natural scenes and objects as well as the field of acoustic event detection, show that the proposed approach allows for learning compact feature representations without reducing the accuracy of the subsequent classification. In the field of semi-supervised learning, a novel approach for learning annotations in large image collections of natural scene images has been proposed. The approach is based on the active learning principle and incorporates multiple views on the data. The views, i.e. different feature representations, are clustered independently of each other. A human in the loop is asked to label each data cluster. The clusters are then iteratively refined based on cluster evaluation measures and additional labels are assigned to the dataset. Ultimately, a voting over all views creates a partially labeled sample set that is used for training a classifier. The results on natural scene images show that a powerful visual classifier can be learned with minimal annotation effort. The approach has been evaluated for traditional handcrafted features as well as features derived from a convolutional neural network. For the semi-supervised learning it is desirable to have compact feature representation. For traditional features, the ones obtained by the proposed feature augmentation approach are a good example of such a representation. Especially the application in the field of deep learning, which usually requires large amounts of labeled samples for training or even adapting a deep neural network, the semi-supervised learning is beneficial. For the zero-shot object prediction, a method that combines visual and semantic information about natural scenes is proposed. A convolutional neural network is trained in order to distinguish different scene categories. Furthermore, the relations between scene categories and visual object classes are learned based on their semantic relation in large text corpora. The probability for a given image to show a certain scene is derived from the network and combined with the semantic relations based on a statistical approach. This allows for predicting the presence of certain object classes in an image without having any visual training sample from any of the object classes. The results on a challenging dataset depicting various objects in natural scene images, show that especially in cluttered scenes the semantic relations can be a powerful information cue. Furthermore, when post-processing the results of a visual object predictor, the detection accuracy can be improved at the minimal cost of providing additional scene labels. When combining these contributions, it is shown that a scene classifier can be trained with minimal human effort and its predictions can still be leveraged for object prediction. Thus, information about natural scene images and the object classes within these images can be gained without having the burden to manually label tremendous amounts of images beforehand.enComputer visionSemi-supervised learningDeep learning004Partially supervised learning of models for visual scene and object recognitiondoctoral thesisMaschinelles SehenTeilüberwachtes LernenDeep learning