Wolf, Fabian2024-12-032024-12-032024http://hdl.handle.net/2003/4300810.17877/DE290R-24841Over centuries, handwritten documents have been the main mean of capturing and storing information. Libraries and archives have gathered, stored and maintained tremendously large document collections. These collections are in various ways a snapshot of their time and hold incredibly valuable data for social and historical sciences. A significant problem for social scientists and historians alike is that the data stored in this collection is hardly accessible. Usually, no transcriptions exist and the manual creation of them is unfeasible at large scale. This problem motivates the use of automatic systems using techniques such as handwriting recognition or an automatic word search. While these two domains are classic problems considered in the document analysis community and have a long standing tradition, they suffer from a severe drawback. Nowadays, well performing models rely on machine learning techniques, which means models are trained in a supervised fashion using manually annotated training data. The manual creation of training data is a cumbersome process and is the main obstacle that often prevents the application of a automatic document analysis system. This thesis develops a method that allows for the training of handwriting recognition and word spotting models without the need for any manually annotated training samples. The underlying training concept is called self-training and relies on training on automatically generated pseudo-labels.The proposed training scheme can be summarized as follows. First, an initial model is trained on synthetic data that has been generated using a font-based approach. Then, this initial model makes predictions for an unlabeled training data set. Following, the predictions are used for another training step and constitute the current set of pseudo-labels. This process is repeated iteratively, alternating between the prediction of pseudo-labels and training on them. The method is then extended by the integration of a confidence measure that allows for a better selection of less erroneous pseudo-labels. The experiments show that self-training the models considered in this work is feasible and leads to significant performance gains with respect to only training on a synthetic dataset. The investigation of synthetic data generation provides several insights, for example, that training on synthetic data constitutes a form of implicit language modeling, and that a calibrated dataset can be generated by using different style predictor networks. Further experiments on the integration of the confidence measures provide evidence that their use benefits performances and leads to a higher robustness regarding bad performing initial models. It can be concluded that self-training is a highly efficient approach to train well performing models in the absence of manually annotated data and, therefore, provide a potential solution for the application of such models in the data scarce domain of automatic analysis of historical document collections.enDocument analysisHandwritingArtificial intelligenceNeural networks004Self-training for handwritten word recognition and retrievalPhDThesisDokumentHandschriftKünstliche IntelligenzNeuronales Netz