Statistical analyses of tree-based ensembles

Loading...
Thumbnail Image

Date

2023

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

This thesis focuses on the study of tree-based ensemble learners, with particular attention to their behavior as a prediction tool for multivariate or time-dependent outcomes and their implementation for efficient execution. In particular, well-known examples such as Random Forest and Extra Trees are often used for the prediction of univariate outcomes. However, for multivariate outcomes, the question arises whether it is better to fit univariate models separately or to follow a multivariate approach directly. Our results show that the advantages of the multivariate approach can be observed in scenarios where there is a high degree of dependency between the components of the results. In particular, significant differences in the performance of the different Random Forest approaches are observed. In terms of predictive performance for time series, we are interested in whether the use of tree-based methods can offer advantages over traditional time series methods such as ARIMA, particularly in the area of data-driven logistics, where the abundance of complex and noisy data - from supply chain transactions to customer interactions - requires accurate and timely insights. Our results indicate the effectiveness of machine learning methods, especially in scenarios where data generation processes are layered with a certain degree of further complexity. Motivated by the trend towards increasingly autonomous and decentralized processes on resource-constrained devices in logistics, we explore strategies to optimize the execution time of machine learning algorithms for inference, focusing on Random Forests and decision trees. In addition to the simple approach of enforcing shorter paths through decision trees, we also investigate hardware-oriented implementations. One optimization is to adapt the memory layout to prefer paths with higher probability, which is particularly beneficial in cases with uneven splits within tree nodes. We present a regularization method that reduces path lengths by rewarding uneven probability distributions during decision tree training. This method proves to be particularly valuable for a memory architecture-aware implementation, resulting in a substantial reduction in execution time with minimal degradation in accuracy, especially for large datasets or datasets concerning binary classification tasks. Simulation studies and real-life data examples from different fields support our findings in this thesis.

Description

Table of contents

Keywords

Machine learning, Tree-based models, Random forest

Citation