Wartungsarbeiten: Am 13.04..2026 von ca 10:30 bis 11:30 Uhr steht Ihnen das System nicht zur Verfügung. Bitte stellen Sie sich entsprechend darauf ein. Maintenance: at 2026-04-13 the system will be unavailable from 10.30 a.m. until 11.30 a.m. Please plan accordingly.

Randomized outlier detection with trees

dc.contributor.authorBuschjäger, Sebastian
dc.contributor.authorHonysz, Philipp-Jan
dc.contributor.authorMorik, Katharina
dc.date.accessioned2021-05-31T10:06:09Z
dc.date.available2021-05-31T10:06:09Z
dc.date.issued2020-12-15
dc.description.abstractIsolation forest (IF) is a popular outlier detection algorithm that isolates outlier observations from regular observations by building multiple random isolation trees. The average number of comparisons required to isolate a given observation can then be used as a measure of its outlierness. Multiple extensions of this approach have been proposed in the literature including the extended isolation forest (EIF) as well as the SCiForest. However, we find a lack of theoretical explanation on why IF, EIF, and SCiForest offer such good practical performance. In this paper, we present a theoretical framework that views these approaches from a distributional viewpoint. Using this viewpoint, we show that isolation-based approaches first accurately approximate the data distribution and then secondly approximate the coefficients of mixture components using the average path length. Using this framework, we derive the generalized isolation forest (GIF) that also trains random isolation trees, but combining them moves beyond using the average path length. That is, GIF splits the data into multiple sub-spaces by sampling random splits as do the original IF variants do and directly estimates the mixture coefficients of a mixture distribution to score the outlierness on entire regions of data. In an extensive evaluation, we compare GIF with 18 state-of-the-art outlier detection methods on 14 different datasets. We show that GIF outperforms three competing tree-based methods and has a competitive performance to other nearest-neighbor approaches while having a lower runtime. Last, we highlight a use-case study that uses GIF to detect transaction fraud in financial data.en
dc.identifier.urihttp://hdl.handle.net/2003/40232
dc.identifier.urihttp://dx.doi.org/10.17877/DE290R-22105
dc.language.isoende
dc.relation.ispartofseriesInt J Data Sci Anal;
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.subjectOutlier detectionen
dc.subjectIsolation foresten
dc.subjectDensity estimationen
dc.subjectEnsembleen
dc.subjectTreeen
dc.subject.ddc004
dc.titleRandomized outlier detection with treesen
dc.typeTextde
dc.type.publicationtypearticlede
dcterms.accessRightsopen access
eldorado.dnb.depositfalsede
eldorado.secondarypublicationtruede
eldorado.secondarypublication.primarycitationBuschjäger, S., Honysz, PJ. & Morik, K. Randomized outlier detection with trees. Int J Data Sci Anal (2020).de
eldorado.secondarypublication.primaryidentifierhttps://doi.org/10.1007/s41060-020-00238-wde

Dateien

Originalbündel

Gerade angezeigt 1 - 1 von 1
Lade...
Vorschaubild
Name:
Buschjäger2020_Article_RandomizedOutlierDetectionWith.pdf
Größe:
477.12 KB
Format:
Adobe Portable Document Format
Beschreibung:

Lizenzbündel

Gerade angezeigt 1 - 1 von 1
Lade...
Vorschaubild
Name:
license.txt
Größe:
4.85 KB
Format:
Item-specific license agreed upon to submission
Beschreibung: