Resource-aware annotation through active learning
Loading...
Date
2010-05-12T15:56:32Z
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The annotation of corpora has become a crucial prerequisite for
information extraction systems which heavily rely on supervised
machine learning techniques and therefore require large amounts of
annotated training material. Annotation, however, requires human
intervention and is thus an extremely costly, labor-intensive, and
error-prone process. The burden of annotation is one of the major
obstacles when well-established information extraction systems are to
be applied to real-world problems and so a pressing research question
is how annotation can be made more efficient.
Most annotated corpora are built by collecting the documents to be
annotated on a random sampling basis or based on simple keyword
search. Only recently, more sophisticated approaches to select the
base material in order to reduce annotation effort are being
investigated. One promising direction is known as Active Learning (AL)
where only examples of high utility for classifier training are
selected for manual annotation. Because of this intelligent selection,
classifiers of a certain target performance can be yieled with less
labeled data points.
This thesis centers around the question how AL can be applied as
resource-aware strategy for linguistic annotation. A set of
requirements is defined and several approaches and adaptations to the
standard form of AL are proposed to meet these requirements. This
includes: (1) a novel method to monitor and stop the AL-driven
annotation process; (2) an approach to semi-supervised AL where only
highly critical tokens have to actually be manually annotated while
the rest is automatically tagged; (3) a discussion and empirical
investigation of the reusability of actively drawn samples; (4) a
comparative study how class imbalance can be reduced right upfront
during AL-driven data acquisition; (5) two methods for selective
sampling of examples which are useful for multiple learning problems;
(6) an extensive evaluation of the proposed approaches to AL for Named
Entity Recognition with respect to both savings in corpus size and
actual annotation time; and finally (7) three methods how these
approaches can be made cost-conscious so as to reduce annotation time
even more.
Description
Table of contents
Keywords
Active learning, Machine learning, Named entity recognition, Natural language processing, Information extraction, Corpus annotation