Statistical Analyses of Tree-Based
Ensembles
Dissertation zur Erlangung des Doktorgrades Dr. rer. nat. der Fakultät Statistik der
Technischen Universität Dortmund
Vorgelegt von
Lena Schmid
geboren in Schwäbisch Gmünd
Dortmund, Dezember 2023
Amtierender Dekan:
Prof. Dr. Philipp Doebler
Gutachter:
Prof. Dr. Markus Pauly (Technische Universität Dortmund)
Prof. Dr. Andreas Groll (Technische Universität Dortmund)
Tag der Prüfung:
26.02.2024
Abstract
This thesis focuses on the study of tree-based ensemble learners, with particular attention
to their behavior as a prediction tool for multivariate or time-dependent outcomes
and their implementation for efficient execution. In particular, well-known examples
such as Random Forest and Extra Trees are often used for the prediction of univariate
outcomes. However, for multivariate outcomes, the question arises whether it is better to
fit univariate models separately or to follow a multivariate approach directly. Our results
show that the advantages of the multivariate approach can be observed in scenarios
where there is a high degree of dependency between the components of the results. In
particular, significant differences in the performance of the different Random Forest
approaches are observed. In terms of predictive performance for time series, we are
interested in whether the use of tree-based methods can offer advantages over traditional
time series methods such as ARIMA, particularly in the area of data-driven logistics,
where the abundance of complex and noisy data - from supply chain transactions to
customer interactions - requires accurate and timely insights. Our results indicate the
effectiveness of machine learning methods, especially in scenarios where data generation
processes are layered with a certain degree of further complexity. Motivated by the trend
towards increasingly autonomous and decentralized processes on resource-constrained
devices in logistics, we explore strategies to optimize the execution time of machine
learning algorithms for inference, focusing on Random Forests and decision trees. In
addition to the simple approach of enforcing shorter paths through decision trees, we
also investigate hardware-oriented implementations. One optimization is to adapt the
memory layout to prefer paths with higher probability, which is particularly beneficial
in cases with uneven splits within tree nodes. We present a regularization method that
reduces path lengths by rewarding uneven probability distributions during decision tree
training. This method proves to be particularly valuable for a memory architecture-aware
implementation, resulting in a substantial reduction in execution time with minimal
degradation in accuracy, especially for large datasets or datasets concerning binary
classification tasks. Simulation studies and real-life data examples from different fields
support our findings in this thesis.
iii

Acknowledgments
First of all, I would like to thank my supervisor Professor Markus Pauly for his guidance,
support and patience during my dissertation. I also thank Professor Andreas Groll for
agreeing to be the second supervisor of my dissertation and the BMEL for their funding.
Further, I would also like to thank all my co-authors for their helpful comments, sugges-
tions and discussions, which substantially contributed to this thesis.
My time as a PhD candidate was made truly special by my colleagues in both Ulm
and Dortmund. Despite the challenges of switching universities and locations, their
support and friendship made everything so much easier. Whether we were having
discussions, attending conferences together, or simply enjoying coffee breaks, cakes,
and ’Institutsstammtische’, my colleagues made every moment a delight. I would like to
thank them all for joining me on this journey.
Finally, I would like to thank my parents and my sisters for their encouragement,
unconditional love, and support, which made of all this possible. Thank you for always
being there for me!
“The only true wisdom is in knowing you know nothing.”
– Socrates
v

Contents
Abstract iii
Acknowledgments v
List of the Main Publications ix
Abbreviations xi
I Introduction 1
1 Introduction 3
2 Statistical Methods 7
2.1 Supervised Learning Problems . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Tree-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 CART . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.3 Extremely Randomized Trees . . . . . . . . . . . . . . . . . . 12
2.2.4 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Time Series Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Summary of the Main Articles 21
3.1 Article 1: Tree-Based Ensembles for Multi-Output Regression . . . . . 21
3.2 Article 2: Forecasting in Data-Driven Logistics - A Simulation Study . . 24
3.3 Article 3: TREE: Tree Regularization for Efficient Execution. . . . . . . 26
4 Further Research 29
4.1 Human Activity Recognition in Logistics . . . . . . . . . . . . . . . . 29
4.1.1 Interpretable Multi-Label Stacking . . . . . . . . . . . . . . . . 29
4.1.2 Transfer Learning in Warehousing . . . . . . . . . . . . . . . . 31
4.1.3 Dataset Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
vii
4.2 resKIL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Predicting Effects of Math Training . . . . . . . . . . . . . . . . . . . 36
4.4 Capsule Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5 Discussion and Outlook 41
Bibliography 43
II Publications 57
List of the Main Publications
This cumulative thesis is mainly based on the following three manuscripts:
Article 1: Schmid, L., Gerharz, A., Groll, A. & Pauly, M. (2023). Tree-based ensem-
bles for multi-output regression: Comparing multivariate approaches with sep-
arate univariate ones. Computational Statistics & Data Analysis, 179, 107628,
https://doi.org/10.1016/j.csda.2022.107628.
Copyright (2023), with permission from Elsevier.
Contribution of the author:
The author of this thesis had a leading role in the preparation and structuring of
the manuscript. In addition, she mainly implemented the simulation studies as
well as the analysis of the data examples.
Article 2: Schmid, L., Roidl, M. & Pauly, M. (2023). Comparing statistical and machine
learning methods for time series forecasting in data-driven logistics - A simulation
study. arXiv preprint, https://doi.org/10.48550/arXiv.2303.07139.
Contribution of the author:
The author of this thesis implemented extensive simulation studies and analyzed
the data example under Prof. Pauly’s guidance. She had a leading role in drafting
and writing the paper, with helpful comments from all co-authors.
Article 3: Schmid, L., Biebert, D., Hakert, C., Chen, K.-H., Lang, M., Pauly, M. & Chen,
J.-J. (2024). TREE: Tree Regularization for Efficient Execution. arXiv preprint,
https://doi.org/10.48550/arXiv.2406.12531.
Contribution of the author:
The author of this thesis had the leading role in implementing the simulation study.
She also prepared and structured the manuscript with the input from all co-authors.
ix

Abbreviations
AI Artificial Intelligence
ARIMA Autoregressive Integrated Moving Average.
BR Binary Relevance.
CART Classification and Regression Trees.
CSPS Corn Silage Processing Score.
Extra Trees Extremely Randomized Trees.
HAR Human Activity Recognition.
LGCM Latent Growth Curve Models.
pRSL Probabilistic Rule Stacking Learne.r
resKIL Resource-Efficient Artificial Intelligence for Embedded
Systems in Agricultural Technology.
XGBoost Extreme Gradient Boosting.
M Matrix are denoted by bold symbols.
M⊤ Transpose of a matrix or vector M .
N Natural numbers.
R Real numbers.
R+ Set of all positive real numbers.
x Vectors are denoted by bold symbols.
Z Set of all integers.
∥x∥2 Euclidean norm of the vector x.
1Ix∈A Indicator function.
⌊x⌋ Floor function, i.e., the largest integer smaller than or equal to x.
CA The collection of all possible cuts in the cell A.
Dn Training data set.
Tτ The set of all nodes of a tree τ .
xi

Part I
Introduction
1

1 Introduction
The rapid rise of digitization and increasing computing power have recently led to vast
amounts of data being generated worldwide daily. For example, social media platforms
such as Facebook and Instagram produce a huge amount of data through user interactions,
posts, comments and shared content. The growing popularity of connected devices such
as smart fitness trackers and home security systems also creates a wealth of data about
user behavior and interactions with the devices.
As these data sources become increasingly complex, extracting relevant and accurate
patterns for predictive models becomes a significant challenge. This complexity can
result in mathematical problems where a traditional modeling strategy can be either too
time-consuming or even too complex to provide satisfactory results. This has led in part
to a change in approach to statistical modeling. Instead of modeling the complex data
structure and searching for suitable analysis tools, the focus is on developing algorithms
that are tailored to solve the problem. The development of such techniques has mainly
emerged from the field of machine learning.
Formally, machine learning is defined as the study of systems that can learn from
data without explicit programming. Following Mitchell (1997), it is assumed that a
computer program is considered to learn data within a particular class of tasks and a
defined performance measure if its capabilities in these tasks improve with additional
data. Machine learning algorithms are in particular used to process classification or
regression tasks. For both types of tasks, a learning algorithm requires two datasets:
one that is used to train the algorithm and another for evaluating performance measures,
which consists of previously unseen observations. If the outcomes of interest in the
training set are known, the learning problem is categorized as supervised learning. In
contrast, in scenarios where the outcome of interest is unknown during the training
phase, the learning problem falls under unsupervised learning, specifically categorized
as clustering. Both types of problems can encounter challenges in the modeling phase if
the information in the features is of mixed type, including nominal, ordinal or metric
scaled data. This variety of learning scenarios places high demands on the algorithms
used in machine learning processes.
3
1 Introduction
Ensemble learning is a possible solution to these challenges, especially in the context
of supervised learning. In this machine learning technique, the predictions of multiple
individual models are combined to make a more robust and accurate prediction. In
particular, ensembles of decision trees are preferred over trees as a learning method
due to the instability of decision trees and their computational efficiency in training
and testing (Hastie et al., 2009). These tree-based ensemble learning methods often
achieve top performance on many supervised learning problems (Dietterich, 2000;
Fernández-Delgado et al., 2014; Zhang and Ma, 2012; Sagi and Rokach, 2018).
Two popular ensemble strategies, boosting (Freund et al., 1999) and bootstrap aggrega-
tion (bagging) (Breiman, 1996), have been extensively studied. Gradient tree boosting
(Friedman, 2001) has gained great popularity due to its iterative learning process in
which errors are corrected with each subsequent tree. Random Forests, a well-known
tree-based ensemble method, uses bagging (bootstrap aggregation) to create an ensemble
of decision trees. As shown in Figure 1.11 the number of publications citing the original
Random Forests publication (Breiman, 2001) is increasing, highlighting the growing
research interest in this approach.
5000
4000
3000
2000
1000
0
2000 2005 2010 2015 2020
Year
Figure 1.1: Number of publications with citation of the initial Random Forest paper of
Breiman (2001) between 2000 and 2023.
© Copyright Clarivate 2023. All rights reserved.
1This figure was created based on data from the Web of Science in November 2023.
4
Count
The aim of this thesis is to analyze tree-based ensembles from both performance and
efficiency perspectives, with a focus on supervised learning. While these methods have
proven successful in predicting univariate outcomes, e.g., (Wang et al., 2018; Liu and
Wu, 2017; Everingham et al., 2016), extending them to multivariate or time-dependent
outcomes presents a number of complex challenges and opportunities.
In multivariate analysis, the question arises: Should univariate models be fitted separately,
or is a direct multivariate approach more appropriate? Univariate models treat each
component of the outcome independently and capture individual relationships within
the data. In contrast, a direct multivariate approach considers the joint distribution
of outcomes, allowing for the capture of potentially more complex interactions or
dependencies (De’Ath, 2002; Segal and Xiao, 2011). A special case of multivariate data
is repeated measures (Crowder and Hand, 2017) that we treat within another research
project, where the aim is to investigate the potential of multivariate tree-based models
to predict learning trajectories. Note that repeated measures are essentially a subset of
multivariate data with a temporal dimension that connects the domains of multivariate
data and time series (Littell et al., 1998).
For time-dependent outcomes, tree-based ensemble methods have proven to be practical
tools (Kane et al., 2014), offering the advantage of capturing nonlinear relationships and
accommodating different data types. Despite their frequent use, the comparison of the
performance of tree-based ensembles with traditional time series methods, which are
generally faster to compute and easier to interpret, has not yet been widely discussed,
especially not in the context of logistics, where time series forecasts are important at
all levels of the supply chain (Syntetos et al., 2016). In addition to the importance of
time series forecasts for logistics, there is also a trend towards increasingly autonomous
and decentralized processes, such as autonomous robots (Krüger et al., 2020; Shamout
et al., 2022). These often go hand in hand with the switch from a centralized control
system to a decentralized one with less (computational) resources (Delfmann et al.,
2018; Venkatapathy et al., 2017). In such scenarios with resource-constrained devices,
optimizing the execution time of machine learning models is essential. In this thesis, the
concept of regularizing the training process is presented as a way to achieve this goal.
This thesis is organized as follows: Chapter 2 describes the underlying statistical models
and methods, such as the tree-based ensembles considered in our work. Chapter 3
provides a summary of the three main research articles underlying this dissertation,
followed by Chapter 4, which offers a short overview of the results from additional
projects undertaken during my doctoral studies. Chapter 5 contains a discussion of
the results, as well as an outlook for future research. Finally, Part II contains the three
articles summarized in Chapter 3.
5
1 Introduction
6
2 Statistical Methods
2.1 Supervised Learning Problems
Supervised learning is a branch of machine learning that deals with extracting knowledge
from datasets to make predictions when the outcome variable of interest is known (Hastie
et al., 2009).
In supervised learning, an algorithm is trained with training data Dn consisting of pairs
of input featuresX ∈ X and their corresponding known outcomes Y ∈ Y , where bothX
and Y are metric spaces. More precisely, Dn is given by Dn = {(Xi, Yi) ∈ X × Y , i =
1, . . . , n}, where n ∈ N and (Xi, Yi), i = 1, . . . , n, are often assumed to be independent
and identically distributed random variables with the same distribution as (X, Y ).
A central object of supervised learning is modeling a functional mapping f : X → Y
between the featuresX and the corresponding outcomes Y . Since f is usually unknown,
a method for determining an approximation f̂ is used. The quality of this estimate is
evaluated using loss functions L(Y, f̂(X)) that measure the discrepancy between the
predicted results and the actual values. The main goal is to determine f̂ such that it
minimizes the expected loss over all possible data points. In practice, however, the
multivariate distribution of (X, Y ) is usually unknown. Therefore, f̂ is determined by
optimizing the estimated generalization error instead. Since we are often interested
in predicting unseen data, the original dataset Dn is usually split (several times) into
a training and a test dataset, where the training dataset is then used to fit the model,
while the test dataset is used to evaluate the model’s performance. Various resampling
techniques such as cross-validation and bootstrapping can be used to create training and
test datasets from the original dataset (Hastie et al., 2009).
Classification One application of supervised learning is classification. In classifi-
cation, the goal is to assign class labels to the input features. In other words, in this
case, the support of Y is a countable set with finite cardinality. This type of learning
finds extensive applications in various fields, e.g., Zhang et al. (2018); Angelini et al.
7
2 Statistical Methods
(2008); Conneau et al. (2017). In medical diagnosis, for example, the aim is to determine
whether a disease is present or not based on patient data and symptoms (Gupta et al.,
2019; Kourou et al., 2015). Another example is given by human activity recognition in
warehouses, which I also studied during my thesis (Kirchhof et al., 2021b,a; Nair et al.,
2023). Here, the aim is to classify typical activities of workers based upon video and
sensor data.
Classification problems are often categorized based on the number of class labels. In
binary classification, there are two classes, while in multi-class classification, there are
more than two.
Regression In contrast to classification, regression problems deal with outcomes
whose support is not finitely countable. Regression finds applications, for example, in
predicting housing prices based on size and location (Park and Bae, 2015) and forecasting
the annual growth of a country’s gross domestic product based on historical data and
economic indicators (Yoon, 2021). Further examples are concerned with the prediction
of school performances based upon certain interventions and personal information (see
Section 4.3) or forecasting of quantitative time series in logistics (Schmid et al., 2023b).
If X = Rp and Y = Rd, p, d ∈ N, a regression task is often described as follows
Y = f(X) + ε,
where ε denotes the error variable. It is often assumed that the error variable has an
expected value of zero and an existing covariance matrix. If d > 1, the task is referred
to as a multi-output regression problem.
In this thesis, we explore both classification and regression problems. The first two
articles focus on regression, while the third addresses classification. Moreover, my
additional research projects conducted during my studies (Section 4) also deal with both
tasks, particularly covering application fields such as logistics (Section 4.1), agriculture
(Section 4.2) and social sciences (Section 4.3) and additional methods such as capsule
networks (Section 4.4).
2.2 Tree-Based Methods
Tree-based methods divide the feature space into a series of regions and then fit simple
models to each partition. They are conceptually relatively simple but often powerful
(Hastie et al., 2009). We first describe the well-known CART method, which serves as
8
2.2 Tree-Based Methods
the basis for more advanced models, including Random Forests, Extreme Randomized
Trees and Extreme Gradient Boosting. These advanced models are ensemble learners
combining multiple decision trees to improve prediction performance. In the following,
we assume that X = (X(1), . . . X(p))⊤ ∈ X = Rp and Y ⊂ R.
2.2.1 CART
Classification and regression trees (CART) (Breiman, 2017) are a class of nonparametric
algorithms that can handle classification and regression learning problems. As single
decision trees, they fall into the category of weak learners (Zhou, 2012). Consequently,
the predictions forX can vary if the training set Dn is changed. The general principle of
CART is to partition the support of X into disjoint regions R1, . . . , RJ , where J ∈ N.
These regions are often referred to as leaves. Each region Rj, j = 1, . . . , J, is assigned a
constant value cj ∈ Y , resulting in the final CART model fCART which is given by
∑J
fCART : Rp → Y , x → cj1Ix∈R .j
j=1
CART grows the tree greedily in a top-down fashion using binary splits. It starts with the
entire feature space ofX at the root node. To define a split at a node, let A ⊂ X = Rp
be the corresponding region and let Nn(A) denote the number of observations (Xi,Yi)
in the training dataset Dn with Xi ∈ A. A split is then given by (i, z )⊤i , where
i ∈ {1, . . . , p} is the split dimension and zi is the split value from the support of the
i-th coordinate within A. The choice of a split is determined by optimizing a criterion
dependent on the type of problem under consideration.
In the case of a regression problem (Y = R), the split is determined by maximizing the
decrease in empirical variance, also known as L2 loss (Loh, 2011). Let CA be the set of
all possible splits for a given node and its corresponding region A. Then, the best split
(i∗n, z
∗
n) based on the L2 loss is given by
{ (∑n1
(i∗n, z
∗
n) = argmax (Y − Y 2(i,z)∈CA k A) 1IXk∈ANn(A)
k=1
∑ )}n ( )2
− Yk − Y A 1I (i) − Y 1I 1I ,L X <z AR (i)X ≥z Xk∈Ak k
k=1
where AL = {x ∈ A : x(i) < z}, AR = {x ∈ A : x(i) ≥ z} and Y A, Y A andL
9
2 Statistical Methods
Y A denote the corresponding arithmetic means of the outcomes belonging to A, AR L
and AR, respectively. To save computing resources, the algorithm selects the middle
of two consecutive observations as possible split values in the optimization step (Loh
and Shih, 1997). Note that the definition of the best split is only valid if the i-th
component of X is continuous. If it is categorical, the criterion slightly changes. The
indicators are now given by 1I (i) for all categories χi of the i-th component andXk =χi
AL = {x ∈ A : x(i) ̸= χi}, AR = {x ∈ R : x(i) = χi}, see Breiman (2017).
The L2 loss is not suitable for classification problems. Here, impurity measures are used
to evaluate splits. These measures assess the purity of a region, indicating the extent to
which it contains observations of a single class label. The most commonly used impurity
measures QK for K ∈ N classes and a region A are (Hastie et al., 2009):
(i) Misclassification error: QK(p̂(A)) = 1−maxk=1,...,K p̂k(A),
∑
(ii) Gini index: QK(p̂(A)) =
K
k=1 p̂k(A) (1− p̂k(A)),
∑
(iii) Cross entropy: QK(p̂(A)) = − Kk=1 p̂k(A) log (p̂k(A)),
where p̂(A) = (p̂1(A), . . . , p̂K(A))
⊤ and p̂k(A) denotes the proportion of observations
that fall into A with the class label rk, 1 ≤ k ≤ K. The optimum split of such an
impurity measure is then given by (Hastie et al., 2009)
{
(i∗
Nn(AL)
n, z
∗
n) = argmax(i,z)∈C QK(p̂(A))− QK(p̂(A ))A LNn(A)}
−Nn(AR)QK(p̂(AR)) .
Nn(A)
After applying these splits recursively until a stopping criterion is fulfilled, the goal
is to appropriately estimate the constant values (regression) respectively class labels
(classification) {c1, . . . , cJ}. Fo∑r regression problems, the region Rj assigned estimated
constant ĉj is given by ĉj = i:X ∈R Yi, while in classification problems majorityi j
voting is performed in each region to obtain the class label, i.e., ĉj = modei:Xi∈R (Yj i).
Parameters that control the complexity of the tree are often used as stopping criteria.
Examples include limiting the maximum tree depth, specifying a minimum number
of samples in each leaf or using a minimum improvement threshold for the selected
splitting criterion (Therneau and Atkinson, 1997). If a CART tree grows too deep, the
number of regions J is very large and overfitting can occur, i.e., the tree fits the training
data too well but has difficulties in correctly predicting new, unseen data. Ways to reduce
the effect of overfitting are pruning, the subsequent reduction and optimization of the
10
2.2 Tree-Based Methods
trees, or bagging, where a bootstrap sample of the training data is used when creating
the trees (Hastie et al., 2009).
2.2.2 Random Forest
Random Forest is an ensemble learning method for regression and classification tasks
based on the construction of multiple decision trees. Different randomness sources are
introduced into the tree construction process to obtain various decision trees from a
single dataset. This randomness can include randomizing the dataset itself (Breiman,
1996), the feature set (Ho, 1998) or a combination of both (Breiman, 2001; Cutler and
Zhao, 2001).
The most popular Random Forest algorithm (Breiman, 2001) aims to eliminate the
prediction deficiencies of CARTs (Breiman, 2017), such as overfitting and the strong
dependence on the training dataset. By growing multiple trees in parallel, the biases and
variance of the model are reduced simultaneously (Breiman, 2001). The main principles
of Random Forests are bagging and feature sub-selection. This means a random selection
of training data is used when creating an unpruned tree. During construction, only a
small and fixed number of randomly selectedmtry features are used as split candidates
instead of all available features. The same split criteria are used as for CART, i.e., the
L2 loss for the regression and the Gini index for the classification. However, many
other splits, such as the absolute deviation from the median, can also be used. To
make predictions for new data, Random Forest aggregates results from multiple trees by
calculating the arithmetic mean for regression tasks and applying majority voting for
classification.
The most essential hyperparameters for the Random Forests are:
(i) The number of decision trees B ∈ N in the ensemble.
(ii) The number of pre-selected features mtry for conducting splits.
(iii) The number of sampled data points in each tree an ∈ {1, . . . , n}.
(iv) The sampling strategy S .
(iv) Various tree complexity parameters that can be used as stopping criteria, such as
the minimum number of observations that each leaf node should contain nmin ∈
{1, . . . , an} or the number of leaves in each tree maxnodes ∈ {1, . . . , an}.
The choice of hyperparameters is task-dependent and can be determined through tech-
niques like cross-validation or random search (Bartz et al., 2023). Several packages
11
2 Statistical Methods
already implement automatic tuning procedures for Random Forests (Kuhn, 2008; Bischl
et al., 2023). However, it is known that Random Forests often provide good results in
the default settings (Probst et al., 2019a; Fernández-Delgado et al., 2014). The number
of trees in a forest is a parameter that is usually not tuned as it is known that more trees
are better (Dı́az-Uriarte and Alvarez de Andrés, 2006; Scornet, 2017). The common
default value is B = 500 or B = 1, 000 (Probst et al., 2019b).
One of the central hyperparameters is the number of pre-selected features mtry ∈
{1, . . . , p}. Lower values of mtry lead to more different and less correlated trees,
enhancing stability during aggregation. However, lower mtry values can also cause the
trees to perform worse on average, as they may be based on a small suboptimal subset of
randomly selected candidate features (P√robst et al., 2019b). Breiman (2001) proposes
to use m = ⌊ptry ⌋ for regression and ⌊ p⌋ for classification as default values, which3
is still the standard choice in many software implementations such as the fast ranger
implementation in R (Wright and Ziegler, 2017). There are often two options for the
sampling strategy: Sampling with or without replacement. By default, the number of
sampled data points in each tree an is set to n, which means resampling is performed with
replacement (Hastie et al., 2009). However, in theoretical investigations sampling without
replacement is often assumed as it is easier to handle (Scornet et al., 2015; Ramosaj,
2020). The default value for nmin is five for regression and one for classification
(Dı́az-Uriarte and Alvarez de Andrés, 2006). Typically, trees are fully grown, i.e.,
maxnodes = an, or in the case of sampling without replacement, maxnodes = ⌊0.632an⌋
(Wright and Ziegler, 2017).
Several modifications and extensions of the original Random Forest algorithm have
been developed, encompassing alterations in various aspects, including the random
feature selection process, bagging procedures and the split criteria (Meinshausen, 2006;
Clémençon et al., 2013; Ishwaran et al., 2008). In our research, we investigate the use of
random forests for multi-output regression (Schmid et al., 2023a) and explore the use
of regularization to improve performance in classification tasks (Schmid et al., 2024).
Furthermore, we investigate its application in different domains such as social sciences
(Section 4.3) and data-driven logistics (Schmid et al., 2023b).
2.2.3 Extremely Randomized Trees
Extremely Randomized Trees (Extra Trees) (Geurts et al., 2006) is an ensemble method
that aims to obtain trees that are more decorrelated than trees from the Random Forest
approach by introducing an additional source of randomness into the method. The tree
12
2.2 Tree-Based Methods
construction process is very similar to that of Random Forests, with some differences.
In the original Extra Trees algorithm, no bagging is performed, i.e., each tree is built
using the entire available training dataset. Instead of computing each feature’s locally
optimal split value through a greedy search, Extra Trees randomly selects a set of split
values (Geurts et al., 2006).
Most of the primary hyperparameters are the same as for Random Forests. An additional
hyperparameter is the number of random split values at each node, denoted as nsplits.
Geurts et al. (2006) investigate the influence of this parameter in a benchmark study.
Typically, nsplits is set to one (Simm et al., 2014).
Although bagging is not included in the original Extra Trees method, some implementa-
tions, such as ranger (Wright and Ziegler, 2017), have integrated it. Moreover, Simm
et al. (2014) has extended the method to multi-task learning. In Schmid et al. (2023a),
we investigated the use of Extra Trees for multi-output regression and compared their
performance with the extension of Simm et al. (2014).
2.2.4 Boosting
Boosting is a powerful class of learning algorithms that combine multiple sequentially
constructed weak learners. In general, a boosting model f boost has the form
∑M
f boost(x) = fm(x),
m=1
where f1, . . . fM ,M ∈ N are some weak learners, often referred to as base functions in
the following. Friedman et al. (2000) showed that boosting can be viewed as an additive
expansion in a set of elementary functions. Therefore, f boost can be rewritten as
∑M
f boost(x) = αmh(x; am),
m=1
where α1, . . . αM ∈ R are expansion coefficients and h : Rd → R with x → h(x; a) are
simple base functions characterized by a set of parameters a ∈ RD. For example, if
CART learners are chosen as base functions h, the parameters a1, . . . , aM refer to the
parameterization of the decision trees.
To fit such a model, the base functions are selected and then their parameters {α , a }Mm m m=1
13
2 Statistical Methods
are determined by minimizing a loss function L averaged over the training dataset Dn:
1 ∑
(
n ∑ )M
{αm, a Mm}m=1 = argmin{α̃ ,ã }M L Yi, α̃mh(Xi, ãm m m) .m=1 n
i=1 m=1
As this process can be computationally intensive, a forward stagewise additive modeling
approach is often used (Friedman et al., 2000). Here, the optimization is performed by
sequentially adding new basis functions to the additive expansion without adjusting the
parameters and coefficients of those that have already been added. At each iteration m,
m = 1, . . . ,M, we determine
∑n
(αm, am) = argminα,a L(Yi, gm−1(Xi) + αh(Xi, a)), (2.1)
i=1
∑
with g0(x) = 0 and g
m−1
m−1(x) = i=1 αih(x; ai), which results in
gm(x) = gm−1(x) + αmh(x; am),
see Hastie et al. (2009) for details. The understanding of boosting algorithms as numeri-
cal optimization methods in the function space was developed by Breiman (1998, 1999).
Based on this, more general boosting algorithms were developed to optimize any differ-
entiable loss function (Friedman, 2001; Mason et al., 1999). They employed functional
gradient descent to solve Equation (2.1). The negative gradient for a differentiable loss
function L based on the training dataset is given by
[ ]
− ∂L(Yi, g(Xi))γm(xi) = − .
∂g(Xi) g(x)=gm−1(x)
Note that the negative gradient is only available at the data points given in Dn. Thus, to
generalize to other points, we approximate the negative gradient −γm(x) by h(x; am),
where
∑n
{αm, am} = argmin (−γ 2m(Xi)− αh(Xi; a)) .
α,a
i=1
The step length ρm to take in this step direction is then determined using a line search of
the form
∑n
ρm = argmin L(Yi, gm−1(Xi) + ρh(Xi; am)).
ρ
i=1
Friedman (2001) additionally introduces a regularization parameter 0 < ν ≤ 1, which
is multiplied to the step length at each iteration. The factor is also referred to as the
14
2.2 Tree-Based Methods
learning rate. The step taken at each iteration m is given by
νρmh(x; am).
The gradient boosting method for arbitrary supervised learning problems is outlined in
Algorithm 1. Instead of initializing g0(x) = 0 as in Equation (2.1), the optimal constant
Algorithm 1 Gradient Boosting
1: Input: Training dataset Dn, n∑umber of iteration M , loss function L, learning rate ν.
2: Initialize: f0(x) = argmin nρ i=1 L(Yi, ρ);
3: for m = 1, . . . ,M do [ ]
4: Compute for each i = 1, . . . , n : −γ (x ) = − ∂L(Yi,g(Xi))m i ;
∑ ∂g(Xi) g(x)=gm−1(x)
5: Set {αm, am} = arg n 2∑minα,a i=1[−γm(Xi)− α · h(Xi; a)] ;
6: Set ρm = argmin
n
ρ i=1 L(Yi, gm−1(Xi) + ρh(Xi; am));
7: Update: gm(x) = gm−1(x) + νρmh(x; am)
8: end for
9: Output gM(x)
model is initialized (line 2) (Hastie et al., 2009).
A well-known implementation of gradient boosting is Extreme Gradient Boosting (XG-
Boost) proposed by Chen and Guestrin (2016). It uses the Newton step instead of the
gradient step, which, according to Friedman et al. (2000), can reduce susceptibility
to overfitting and lead to improved performance. In this approach, in addition to the
negative gradients, the second-order derivatives are also taken into account, which are
given by
[ ]
∂2L(Yi, g(Xi))
δm(xi) = .
∂g(X 2i) g(x)=gm−1(x)
The Newton step is found by solving:
∑n ( )
am = argmin γm(Xi)h(Xi, a) +
1δm(Xi)h(Xi, a)
2 ,
a 2
i=1
which amounts to a weighted least-squares regression problem. The step length is then
given by νh(x; am).
Furthermore, XGBoost incorporates randomization and regularization techniques to
reduce overfitting while increasing training speed. In particular, when using CART as
15
2 Statistical Methods
the base learning method, XGBoost applies several strategies to speed up training. In
particular, it addresses the computational complexity of identifying the optimal split, a
particularly time-consuming aspect of decision tree construction algorithms. Instead of
evaluating all possible candidate splits, XGBoost employs a method based on percentiles
of the data, where only a subset of candidate splits is tested and their gain is computed
using aggregated statistics (Chen and Guestrin, 2016).
In general, the hyperparameter for XGBoost can be divided into the following categories:
(i) General boosting parameters, including the number of iterations M and the learn-
ing rate ν.
(ii) Base learner dependent parameters.
Friedman (2001) shows that smaller values of ν tend to improve the generalization
performance. Note, however, that a reduction of ν usually requires more iterations.
Consequently, the reduction of ν is associated with an increased computational effort. In
fact, Bühlmann and Hothorn (2007) mention that too large a choice of iterations M can
lead to a slow overfitting problem. When trees are used as base learners, the additional
hyperparameters are very similar to those discussed for the previously presented methods
and control the complexity of the individual trees.
Since XGBoost has demonstrated state-of-the-art performance across diverse problem
domains (Chen and Guestrin, 2016), we have investigated its effectiveness in time series
prediction (Schmid et al., 2023b). Some mathematical background for the latter is
provided, assuming a basic knowledge of time series analysis. For a more detailed
description, the reader is referred to Box et al. (2015) and Brockwell and Davis (2002).
2.3 Time Series Forecasting
A time series process is given by a sequence {yt, t ∈ T} of realization of real-valued
variables Yt, where t belongs to an at most countable index set T , typically T = N or
T = Z, with equidistant elements (Hamilton, 2020). The goal of time series prediction
is to approximate a future variable yt+H by a function
ŷt+H = v(yt, yt−1, . . .)
of the observed values up to the present time t, where H > 0 is the so-called prediction
horizon. Typically, this function v is chosen from a class of functions such that the
prediction error is minimal.
16
2.3 Time Series Forecasting
Commonly used for modeling time series, Autoregressive Integrated Moving Average
(ARIMA) processes (Box et al., 2015) combine autoregressions (AR) and moving aver-
ages (MA) with differencing to make the time series stationary. The ARIMA(q1, q2, q3)
model is defined as
φ(∆)(1−∆q3)yt = θ(∆)εt,
where {εt} is a white noise process, ∆ the backshift operator, φ(z) and θ(z) are poly-
noms of order q1 and q2, respectively.
To use this approach for prediction, appropriate values for the parameters have to be
estimated. This typically involves a two-step procedure: determining the model order
parameters q1, q2, q3 and then estimating the coefficients of the polynomials φ(z), θ(z)
based on these parameters (Hyndman and Athanasopoulos, 2018). The estimation meth-
ods presented in the following are based on the approach of Hyndman and Khandakar
(2008) which is implemented in the R-package forecast. The differentiation parame-
ter q3 is determined through successive unit root tests (Kwiatkowski et al., 1992). The
procedure consists of testing the data for a unit root. If the result is significant, the
differenced data is tested for a unit root, continuing until a non-significant result is ob-
tained. The values of q1 and q2 are selected by minimizing a model information criterion
(Konishi and Kitagawa, 2008) such as AIC, AICc or BIC after differentiating the data
q3 times. Instead of considering every possible combination of q1 and q2, the algorithm
uses a stepwise search to explore the model space. Once the model order parameters,
i.e., the values of q1, q2 and q3, are identified, the coefficients of the polynomials φ(z)
and θ(z) are estimated using maximum likelihood estimation.
Seasonality in a time series refers to recurring patterns or fluctuations that exhibit
regular and predictable behavior over fixed time intervals m. A seasonal model contains
coefficients of order q1 and q2 and includes observations with lags of multiples ofm. The
order of these terms is denoted by qs1 for the seasonal AR part and q
s
2 for the seasonal MA
part. The polynomials Φ(x) and Θ(x) are constructed similarly to their non-seasonal
counterparts, resulting in a seasonal ARIMA(q1, q2, q3)(qs s s1, q2, q3)m model (Shumway
and Stoffer, 2017), which is given by
Φ(∆m)ϕ(∆)(1−∆m qs) 3(1−∆q3)y = Θ(∆mt )θ(∆)εt.
Exponential smoothing offers an alternative approach for time series forecasting. By
iteratively updating weighted averages of historical observations, exponential smoothing
adapts to changing trends and captures seasonality. The TBATS model (De Livera et al.,
17
2 Statistical Methods
2011), which is based on exponential smoothing, is designed to model time series with
multiple trigonometric seasonalities. It uses a combination of Fourier terms with an
exponentially smoothing state space model and a Box-Cox transformation (Box and
Cox, 1964).
To be more precise, a TBATS model at time t with S seasonal pattern is given by
{
yζt−1
(ζ) , if ζ ̸= 0
y ζt =
log yt if ζ = 0,
∑S
(ζ) (i)
yt = ℓt−1 + ψbt−1 + st + wt,
i=1
ℓt = ℓt−1 + ψbt−1 + ηwt,
bt = (1− ψ)b+ ψbt−1 + βwt,
∑q′ ′1 ∑q2
w = φ′w ′t i t−i + θiεt−i + εt,
i=1 i=1
where ζ is the Box-Cox transformation parameter, ℓt is the local level at time t, b is the
long-run trend, bt the short-run trend at time t, wt denotes an ARIMA(q′1, q
′
2, 0) process
and εt is Gaussian white noise process with zero mean and constant variance σ2. The
smoothing parameters are given by η and β. The local level is incorporated to capture
short-term variations or fluctuations in the data (Brockwell and Davis, 2002). In addition,
a damped trend (Gardner Jr. and McKenzie, 1985) with a damping parameter ψ ensures
predictions of future values of the short-run trend bt, which refers to fluctuations or
patterns associated with periodic or cyclical movements that repeat over a short time
horizon, converge to the long-run trend b (Gardner Jr. and McKenzie, 1985). This
long-run trend represents the general tendency of the time series. For the seasonal
component’s process (i)st , i = 1, . . . , S, a trigonometric representation based on Fourier
series is used (West and Harrison, 2006; Harvey, 1990):
∑κi
(i) (i)
st = sj,t ,
j=1
(i) (i) (i) ∗(i) (i) (i)
sj,t = sj,t−1 cos(ωj ) + sj,t−1 sin(ωj ) + ξ1 wt,
∗(i) − (i) (i) ∗(i) (i) (i)sj,t = sj,t−1 sin(ωj ) + sj,t−1 cos(ωj ) + ξ2 wt,
where (i)ω = 2π j (i) (i)j , ξ1 and ξ2 are smoothing parameters and κm i denotes the number ofi
18
2.3 Time Series Forecasting
summands in the Fourier series for the i-th seasonal component.
Selecting the final model in the TBATS algorithm involves a systematic evaluation of
various model specifications. TBATS explores a spectrum of alternatives by fitting
multiple models with distinct characteristics, including the consideration of:
(i) The inclusion or exclusion of the Box-Cox transformation.
(ii) The presence or absence of trend components.
(iii) The incorporation or omission of trend damping effects.
(iv) The utilization or non-utilization of ARIMA(q′ , q′1 2, 0) processes for modeling
residuals.
(v) The exploration of non-seasonal models.
(vi) The assessment of different quantities of harmonics to effectively model seasonal
effects.
The final model is chosen based on the model information criterion AIC and is imple-
mented in the R-package tbats (De Livera et al., 2011).
In addition to traditional time series approaches such as ARIMA and TBATS, machine
learning methods are increasingly being used for time series forecasting. These methods
are able to capture complex patterns and non-linear relationships and have proven their
effectiveness in various areas (Kane et al., 2014; Khaidem et al., 2016; Dudek, 2015;
Wang et al., 2018). In finance, for example, XGBoost approaches are used to predict
crude oil prices (Gumus and Kiran, 2017) and electricity loads on the Australian energy
market (Abbasi et al., 2019). In retail, XGBoost approaches are being explored for
predicting store sales (Zhang et al., 2021), while in transportation they are used to predict
NYC cab travel time (Huang et al., 2020).
Random Forest models have been applied to generate hour-ahead wind power forecasts
(Lahouar and Slama, 2017). In the domain of online retailing, Random Forest approaches
are employed to model real-time delivery time forecasts (Salari et al., 2022) and for
predicting product demand for grocery items (Vairagade et al., 2019). Notably, modifi-
cations of Random Forest tailored for time series, particularly in resampling strategies,
have been explored (Härdle et al., 2003). Goehry et al. (2021) have investigated the use
of different block bootstrap strategies, preserving dependency structures by grouping
data into blocks and incorporated their findings into the rangerts package (Goehry
et al., 2017). Fokam (2022) has examined the use of an AR-sieve bootstrap (Bühlmann,
1997) that applies residual resampling by fitting an AR process to the data. In a sim-
ulation study, Fokam (2022) compared the block bootstrap strategy with the standard
19
2 Statistical Methods
one. The results indicate that the block bootstrap method may not offer a statistically
significant advantage over the classic bootstrap method.
In Schmid et al. (2023b), we conduct a comparative analysis in which we evaluate the
performance of Random Forest and XGBoost compared to traditional time series meth-
ods in the context of data-driven logistics, where robust forecasting methods are essential
for optimizing inventory management, ensuring on-time deliveries and optimizing supply
chains (Syntetos et al., 2016).
20
3 Summary of the Main Articles
3.1 Article 1: Tree-based ensembles for multi-output
regression: Comparing multivariate approaches
with separate univariate ones
We consider the case of multi-output regression described in Section 2.1, where our goal
is to establish a functional relationship between a multivariate outcome Y ∈ Rd, d ≥ 2
and some features X ∈ Rp. The motivation behind such multivariate analyses arises
from the possibility of capturing dependencies between outcomes, which may lead to
improved prediction performance compared to separate univariate analyses. While the
need to develop valid multivariate methods has already been recognized in inferential
statistics, there is still a gap in comprehensive studies that fully exploit the potential
of multivariate machine learning regression methods for prediction. Our focus is on
tree-based ensemble methods and we aim to fill this gap by comparing the predictive
accuracy of separate univariate analyses with a simultaneous multivariate analysis using
exhaustive simulation.
To be more precise, we consider Random Forests and Extra Trees. The multivariate
extension of these methods aligns with established work by De’Ath (2002); Segal and
Xiao (2011). We determine the impurity of a node τ using the multivariate L2 loss
∑
L2(k) = (Yj −Y )⊤t (Yj −Yt),
j:Yj∈τ
whereYt is the arithmetic mean of the outcome vector at node τ . In addition, we also
consider the multivariate L1 loss, which is defined by
∑ ∑d ∣ ∣
L1(k) = ∣∣Y i i∣j − Ỹt ∣ ,
j:Yj∈τ i=1
where Ỹt = (Ỹ 1t , . . . Ỹ
d)⊤t is the vector of marginal medians of the outcome at node τ .
21
3 Summary of the Main Articles
For comparison, we use the extension of Extra Trees for multi-task learning (Simm et al.,
2014). To ensure a fair evaluation with respect to computational time, we implemented
our tree construction algorithm, except for the extra trees for multi-task, for which we
used the R-package extraTrees (Simm et al., 2014).
Our simulation study employs runtime and mean squared error as evaluation metrics
to assess performance. We focus on a three-dimensional outcome with ten real-valued
features, each specified with different distributions, dependencies and underlying mod-
els. Specifically, we explore three distinct dependency structures among the features:
complete independence, weak dependence and strong dependence. Following Loh
(2002), we define six relationships between outcomes and some of the features. For
each relationship type, we consider models where all outcomes are generated by the
same data generating process, resulting in high correlation. Additionally, we examine
models where the data generating process remains constant, but the features used in the
process to compute outcomes differ. Moreover, we introduce models where the data
generating process varies for each component of the outcome. We also cover a spectrum
of dependency structures in the outcome, ranging from complete independence to high
dependence. To ensure comprehensive analysis, we consider three different sample sizes
and conduct 1,000 simulation runs.
The results indicate that when comparing the predictive accuracy of multivariate ap-
proaches with their univariate counterparts, the latter have performance advantages in
scenarios where the data generating processes for the outcome components are different.
However, in all other simulation configurations, the multivariate approaches have at
least similar or even better performance than their univariate counterparts. In particular,
when comparing Random Forests, the multivariate approaches are strongly superior to
the univariate approaches in certain scenarios. Furthermore, all methods are sensitive
to different dependency structures within the features, with similar effects observed
for univariate and multivariate approaches. As for the two algorithmic approaches,
either the Random Forest or the Extra Tree approaches have outperformed the others.
With regard to runtime, multivariate approaches demonstrate a clear runtime advantage
over univariate approaches with comparable implementation. There are only minor
differences in runtime and performance results concerning the multivariate loss function
used.
We also compare the methods on five datasets from the UCI repository (Dua and
Graff, 2017). These data analyses show that the multivariate approaches can improve
performance when considering multivariate outcomes, especially for the Extra Trees.
For the Random Forests, the multivariate counterpart only improved performance in a
few of the considered datasets.
22
3.1 Article 1: Tree-Based Ensembles for Multi-Output Regression
To sum up, we propose the use of multivariate extensions of tree-based ensemble methods.
In an extensive simulation study, we compare these with their univariate counterpart and
a multi-task extension of Extra Trees. Finally, all methods are analyzed on five real-life
multi-output regression datasets of varying complexity, considering performance and
runtime.
23
3 Summary of the Main Articles
3.2 Article 2: Comparing statistical and machine
learning methods for time series forecasting in
data-driven logistics - A simulation study.
Accurate forecasts derived from historical data play a central role in data-driven logistics.
They enable informed decisions, ensure timely delivery or minimize disruption in the
dynamic logistics domain. Therefore, forecasting methods are essential for strategic
planning in various logistics areas, including warehousing, transportation and supply
chain management (Syntetos et al., 2016). While time series models are widely used for
forecasting, the increasing prevalence of large and complex data sets in logistics has led
to a growing interest in machine learning methods for demand forecasting. To address
the ambiguity often faced when choosing the most appropriate forecasting method,
we conduct a comprehensive comparison of the forecasting performance of time series
models and machine learning methods. More precisely, different forecasting methods are
evaluated in terms of their out-of-the-box performance on a large number of simulated
time series that are important for logistics.
Our study focuses on (seasonal) ARIMA and TBATS as time series models. The
former model is one of the best known in time series forecasting and is often used as
a benchmark model (Zhang et al., 2001; Al-Saba and El-Amin, 1999). In addition,
TBATS models contain various approaches commonly used in forecasting, such as the
Box-Cox transformation or exponential smoothing (De Livera et al., 2011). Since the
use of machine learning approaches is already widely explored (Zhang and Qi, 2005;
Siami-Namini et al., 2018), we decided to focus on tree-based ensemble learners. In
particular, we consider XGBoost, which is known for its predictive power (Chen and
Guestrin, 2016) and Random Forests, one of the most popular tree-based ensembles that
is quite robust to hyperparameter tuning (Probst et al., 2019a). To provide a baseline,
a naive approach utilizing the last observation of the time series as prediction is also
considered.
We investigate twelve distinct data generating processes, each presenting different
challenges. These processes range from minimally nonlinear, such as AR, to strongly
nonlinear relationships between past and current values, such as seasonal AR. To increase
the complexity of the time series, we add a jump process and a random walk to each
data generating process. The jump process leads to sudden regime changes, while the
random walk introduces noise into the data. Our study encompasses four scenarios: (1)
the data generating process with no added complexity, (2) the process overlaid with the
jump process, (3) the process superposed with random noise and (4) the process subject
24
3.2 Article 2: Forecasting in Data-Driven Logistics - A Simulation Study
to both the jump process and the random noise.
In addition, we apply the methods to a real-life dataset from logistics and to datasets
generated from two queuing models, M/M/1 and M/M/2. These models are commonly
used to analyze the dynamics of queues with one (M/M/1) or two (M/M/2) servers,
taking into account factors such as the expected waiting time, the expected number of
customers in the queue and the expected server utilization.
For each setting, we generate a time series of length n from the respective data generation
processes with n ∈ {100, 500, 1, 000}. In our simulation study, we use the mean
absolute percentage error and the mean squared error as evaluation measures for the
predictive power since these metrics are widely used in forecasting time series in logistics
(Kuhlmann and Pauly, 2023).
In our simulation study, we employ a sliding window (Dietterich, 2002) approach for
forecasting using machine learning algorithms. This approach involves moving a fixed-
sized window over the time series data, using the data within the window as input for
prediction at each step. Furthermore, we explore the impact of time series differentiation
on the prediction performance of machine learning algorithms. This is motivated by the
aim of making the time series more stationary through differentiation.
When analyzing the simulation results, we find that in all simulation settings the Random
Forests outperform the XGBoost approaches. In general, either Random Forests or
time series methods show superior performance compared to the other approaches. In
particular, Random Forests perform better in scenarios with queueing models and in
cases where a Poisson process overlaps with the data generating processes. In all other
simulation settings, however, Random Forests perform comparably or slightly worse
than time series methods. Regarding the effect of data differentiation on the performance
of the two machine learning methods, we observe similar patterns. Differentiation
improved performance, especially in queuing scenarios and in situations where additional
complexity is introduced into the data generation process. Introducing a jump process
into the data generation process leads to a significant drop in performance for all methods
and settings while overlaying the data with additional noise can lead to differences in
performance depending on the setting.
To sum up, we investigate the forecasting performance of several tree-based ensembles
and time series models in an extensive simulation study. Based on these simulations,
we were able to derive recommendations for practitioners. Furthermore, we analyzed a
real-life dataset on demand forecasting from logistics.
25
3 Summary of the Main Articles
3.3 Article 3: TREE: Tree Regularization for Efficient
Execution.
The optimization of machine learning model execution time on resource-constrained
edge devices has been widely studied (Murshed et al., 2021; David et al., 2021). A
central aspect in this context is the limited availability of energy and time for executing
inference due to resource constraints. A well-known strategy to speed up inference is
to reduce the model size, e.g., by limiting the maximum tree depth. While this leads to
a significant reduction in execution time as fewer computations are required, it often
comes at the expense of accuracy. Another effective approach is using CPU cache
behavior, especially in the area of tree-based ensembles (Chen et al., 2022; Tabanelli
et al., 2022). Chen et al. (2022) have shown that a cache-friendly reordering of the nodes
improves execution time. In their approach, the split probabilities of the nodes are used
to determine the new order in memory, with unequal splits being beneficial as they lead
to nodes that are accessed more frequently.
We propose an alternative method to reduce the overall model size effectively. Our ap-
proach involves optimizing the decision tree construction by introducing a regularization
term into the split criterion. More precisely, the regularization term Ξτ , defined by
− |NnMu(AL)−Nn(AR)|Ξτ = 1 ,
Nn(A)
for a given node τ and the corresponding cell A serves as a control parameter to establish
a trade-off between tree size and prediction accuracy. This modification aims to increase
the occurrence of uneven splits, favoring shorter paths to leaf nodes.
To control the amount of penalization, we introduce the weighting factor λ ∈ R+ into
the regularization term. This factor has to be chosen effectively to obtain a balanced
compromise between accuracy and asymmetric splits. Note that there is a limit to which
each split can be usefully regularized, as at some point, all samples would go to one child
node. Therefore, the impact of the regularization factor reaches a limit the larger the
factor becomes. We quantify the expected performance improvement by the expected
depth dexpo of a single tree o, which is given by
∑
dexpo = pτ · do,τ ,
τ∈To
where To is the set of all leaves τ of a tree o. The probability of a leaf τ is denoted by pτ
and do,τ defines the depth of leaf τ in the tree o.
26
3.3 Article 3: TREE: Tree Regularization for Efficient Execution.
Reducing the expected depth indicates increased performance as fewer nodes need to be
loaded during inference. As soon as the expected depth stabilizes, the influence of the
regularization factor is less evident and a lower performance gain is to be expected. To
estimate the optimal regularization factor, we iteratively increase λ until the difference
in expected depth falls below a certain threshold.
To evaluate the impact of regularization in the training of Random Forests, we perform
extensive experiments on eleven datasets from the UCI repository (Dua and Graff, 2017),
focusing on classification tasks. Furthermore, we investigated different implementations
in terms of cache awareness of Random Forests as proposed by Chen et al. (2022) using
Gini impurity as the split criterion. The experiments are evaluated with respect to the
balanced accuracy and relative mean execution time of the inference that compares the
regularized implementation with the comparable non-regularized counterpart over 50
repetitions. For Random Forest hyperparameters, we varied the maximal depth and the
size of the inner bootstrap sample mtry across a set of recommended default values.
The results show that certain datasets, especially those with binary classification or a
large sample size, can benefit significantly from a high degree of regularization. This
improvement is seen in an improved runtime without significant loss of accuracy. While
regularization for shallow decision trees generally cannot offer a large spectrum for the
trade-off and quickly degrades to extreme cases, a broader spectrum for the trade-off
is generally offered for deeper tree models. Moreover, our observations show that both
cache-optimized and non-cache-optimized implementations yield similar results, which
supports the design principle of orthogonal optimization.
In order to better understand the effects of tree regularization on binary classification
datasets, we conducted a simulation study. We focused on a binary classification
problem with ten features and modeled three different dependency structures ranging
from independence to strong dependence while systematically modifying the class
balances. Various relations between the outcome and the features are considered. For
each setting, we generated samples of size n ∈ {100, 200, 500}. The regularization
strength λ and the other tree hyperparameters are varied as in the data analyses.
The results show that class balance and sample size are the driving factors that influence
the effects of regularization. Analyzing the effects of sample size revealed trends consis-
tent with previous results from analyzing real-world datasets. A significant improvement
in execution time was observed with increasing sample size, accompanied by a slight
decrease in accuracy. Furthermore, our results showed that highly imbalanced classes in
the results allowed for a significant improvement in speed with only a slight degradation
in accuracy.
27
3 Summary of the Main Articles
To sum up, we introduce a regularization term for the provoking uneven splits in decision
tree training, including an implementation in scikit-learn (Pedregosa et al., 2011). We
examine its influence on several real-life datasets from UCI and on simulated data. We
show that, especially for binary classification data sets and data sets with many samples,
this form of regularization can lead to a reduction of up to ≈ 4× in the execution time
with only a slight accuracy degradation.
28
4 Further Research
During my doctoral studies, I was involved in several other research projects. These
cover a wide range of topics, such as human activity recognition, interpretable transfer
learning, multi-label stacking, image-based analysis, predictive modeling and theoretical
foundations of machine learning algorithms. The following sections summarize the main
results of these works.
4.1 Human Activity Recognition in Logistics
All the results presented in this section were developed in collaboration with colleagues
from the Department of Statistics and the Chair of Materials Handling and Warehousing
at TU Dortmund University dealing with different aspects of human activity recognition
(HAR), see our joint research papers Kirchhof et al. (2021b,a); Nair et al. (2023) for
details.
The aim of HAR is to automatically identify and classify human activities based on
sensor or multi-channel time series data. It has applications in various fields, including
healthcare (Wang et al., 2019; Osmani et al., 2008), sports (Tian et al., 2013; Hsu et al.,
2018), smart homes (Yang et al., 2011; Van Kasteren et al., 2010) and logistics (Reining
et al., 2019; Rueda et al., 2018).
4.1.1 Interpretable Multi-Label Stacking
Recent developments in HAR focus on the introduction of multi-label representations of
human motion. These developments show that exploiting the structure between these
classes is essential to achieve better performance than with isolated classifiers (Pakrashi
et al., 2016).
For example, consider a warehouse worker’s human activity recognition task shown
in Figure 4.1 (Kirchhof et al., 2021b). Three embedded sensors provide probabilistic
29
4 Further Research
Figure 4.1: The three embedded machine learning sensors tell us: P (overhead work) =
0.5, P (standing) = 0.95, P (hands high) = 0.7. Given the rules: if standing
and hands high then often overhead work and if normal work then almost al-
ways hands centered or low, what is the new probability P (overhead work)?
information about the worker’s posture and hand position, as well as an initial guess
about whether the worker is performing overhead tasks. Intuitively, we would combine
these beliefs to understand the worker’s situation. In addition, we can use rules to decide
whether certain assumptions contradict or support each other, as shown in Figure 4.1.
The challenge is mathematically applying these non-deterministic rules to uncertain
input beliefs to compute updated beliefs.
Most models that address such problems are often stacked on top of ground learners
and refine the probabilistic estimates of the ground learners by modeling the structure
between labels in various ways: attribute class methods, knowledge graphs, Bayesian
networks and probabilistic rule-based approaches.
Attribute class methods introduce a layer of semantically interpretable attributes between
input data and output labels (Lampert et al., 2013; Atzmon and Chechik, 2018; Liu et al.,
2020), facilitating zero-shot learning (Xian et al., 2019). In contrast, other approaches
do not partition labels into attributes and classes but use knowledge graphs (Wu et al.,
2018; Lee et al., 2018), which allow for more comprehensive relationships between
labels and more straightforward interpretation but often require ground truth graphs.
Bayesian networks can extend these relationships to the probabilistic framework (Chen
et al., 2020; Shen et al., 2018), while probabilistic rule learners account for uncertainty
in labeling (Ding et al., 2015; Rapp et al., 2021).
In Kirchhof et al. (2021b), we introduce the probabilistic rule stacking learner (pRSL),
which combines elements from these methods by using Bayesian networks with a bipar-
tite connection of labels through rules, allowing complex and probabilistic relationships
between labels while maintaining acyclicity. These rules can either be based on prior
30
4.1 Human Activity Recognition in Logistics
knowledge or learned.
In a benchmark study, we compare pRSL to three methods: a neural network with
two hidden layers, MLWSE (Xia et al., 2021) and BOOMER (Rapp et al., 2021),
which has similarities to pRSL. The benchmark study includes six datasets from Mulan
(Tsoumakas et al., 2011). Random Forests (Malley et al., 2012) serve as binary relevance
(BR) learners and for evaluation, we examine joint accuracy, joint log-likelihood and
label-wise Hamming loss.
The results show that pRSL can compete with other interpretable and black-box al-
gorithms in modeling inter-label dependencies in various metrics. The multi-label
algorithms studied generally outperform the BR baseline in all datasets and metrics.
Each of the four multi-label algorithms show strengths on different datasets and metrics.
However, it should be noted that there are datasets where pRSL does not lead to im-
provements and may even worsen the results of the underlying classifiers. The reasons
for these differences are not yet fully understood, although calibration is considered a
possible factor. Further research is required to determine the causes of these differences
in performance.
4.1.2 Transfer Learning for Human Activity Recognition in
Warehousing
Since manual processes in warehouses account for more than half of total operating costs
(De Koster et al., 2007; Grosse et al., 2015), human activities need to be quantifiable to
enable their evaluation and improvement in terms of economy and ergonomics (Calzavara
et al., 2017). Note that different labels are of interest in each warehouse scenario. Ideally,
a classifier would be trained with scenario-specific data. However, collecting and labeling
such data is a labor-intensive endeavor.
Motivated by the large number of emerging HAR datasets that have activities and labels
that resemble warehouse activities (Niemann et al., 2020), we investigate the extent to
which existing datasets can be used for HAR in different warehouse scenarios using
transfer learning (Kirchhof et al., 2021a). The goal here is to connect the labels of an
existing dataset with the possibly different labels of a related target dataset. Transfer
learning can be performed using both interpretable and non-interpretable means. In
the former case, other HAR classes are associated with rules or structures that are
semantically interpretable.
Non-interpretable transfer learning usually involves fine-tuning neural networks (Zhang
et al., 2023). Here, a temporal convolution neural network (Rueda and Fink, 2021) is
31
4 Further Research
initially trained on a HAR dataset to recognize certain classes. During this process, the
network must learn how to summarize the input data in a more compact form, its internal
latent representation space, which is usually a black box. This internal representation
space is then used as a starting point for learning the classes of a new data set, typically
requiring only a few labeled examples of the new dataset, a scenario known as few-shot
learning (Cheng et al., 2013).
We focus on localizing the border between zero-shot interpretable and few-shot non-
interpretable transfer learning in HAR. Therefore, we apply pRSL and fine-tuning to
both a transfer from ergonomics to logistics and a transfer from sports to logistics.
The results of the experiments show that interpretable feature-based transfer learning
enables competitive zero-shot recognition but relies on semantically related features
between the reference and target datasets. On the other hand, non-interpretable fine-
tuning promotes transfer learning even in the face of significant domain shifts, e.g.,
switching from sports to logistics. Fine-tuned learners outperform fast-learned nulls
with only a few annotated examples from the target dataset. These results suggest that
transfer learning can significantly reduce the amount of labeled target data required, an
essential cost when using deep learning techniques in novel warehousing scenarios.
In analyzing the semantic meaning of the self-learned rules by pRSL, the example of
Figure 4.2 (Kirchhof et al., 2021a) shows that label-based transfer learning can produce
interpretable rules in the real world. Still, we emphasize that the interpretation strategy
needs to be fully developed.
(Stand) ∧ (Bulky Unit / Computer / No Item)→ Right Hand ∨ Left Hand ∨
(Step / Standing Still)
Figure 4.2: A self-learned rule from the domain-related transfer learning task together
with examples where the rule has a high impact on the classification decision
(as measured by the L1 distance to prediction without the rule).
32
4.1 Human Activity Recognition in Logistics
4.1.3 Dataset Bias
Ensuring the accuracy, fairness and generalizability of HAR models is essential for
their practical use and compliance with ethical and legal principles. It requires the
development of a robust classifier with respect to several criteria, such as personal
features. The efficient creation of high-quality datasets is a critical aspect of training
such a classifier (Cruz-Sandoval et al., 2019; Reining et al., 2020; Avsar et al., 2021).
The issue of classifier robustness is complex and often involves biases introduced by the
dataset. Recent research has explored dynamic inductive biases to address challenges
such as varying sensor placements (Chang et al., 2020), domain shifts (Khan et al., 2018),
inconsistent labels (Reining et al., 2020) and class imbalances (Niemann et al., 2020).
The latest attention to bias in classification, especially within deep network models,
has highlighted the influence of human characteristics. However, the significance of
subject-specific characteristics in multi-channel time series HAR datasets still needs to
be explored. Each individual exhibits unique motion patterns when performing activities,
which can be recognized through short-term signal patterns (Retsinas et al., 2020).
Consequently, when training and testing data involve subjects with varying physical
characteristics, a drop in performance is to be expected.
In our study (Nair et al., 2023), we investigate the extent to which biases in human
physical features in the training and testing datasets affect classifier performance. The
experiments are performed on different HAR datasets from different domains, consid-
ering the heterogeneity introduced by different features in the training dataset. As a
classifier, we use the established method proposed by Rueda et al. (2018) and evaluate
the performance based on the accuracy and weighted F1 score.
We have found that training data containing a wide range of physical features leads to
better accuracy for unseen test data containing objects with different physical features.
In addition, systematically increasing the size of the training sample to include subjects
with different features further improves accuracy. We therefore recommend that dataset
creators ensure that they have subjects with varying characteristics in their datasets.
Thus, the number of subjects and the diversity of their physical features can be increased
to improve the classifier’s performance if the effort to collect more data is reasonable.
33
4 Further Research
4.2 resKIL: Resource Efficient AI for Embedded
Systems in Agricultural Machinery
The three-year project ”resource-efficient artificial intelligence for embedded systems
in agricultural technology” (resKIL), funded by the German Federal Ministry of Food
and Agriculture (BMEL), is being carried out in collaboration with industrial partners
CLAAS and Zauberzeug as well as academic institutions such as the University of Os-
nabrück and the German Research Center for Artificial Intelligence (Bundesministerium
für Ernährung und Landwirtschaft, 2022).
The goal of the project is to develop a scalable and adaptable solution to significantly
expand the use of machine learning in agriculture in the near future. The project
focuses on specific areas, such as quality assessment of harvested crops and feature
recognition in the machinery environment, with an agile development strategy to ensure
continuous quality improvement. The project planning revolves around three harvest
periods, determining the workflow and iterations of individual work packages. The
transfer of developments across multiple harvests is a crucial aspect of demonstrating
the generalizability, robustness and improved quality of the approaches.
In this project, my main tasks can be divided into three components. First, I am
involved in developing a statistical framework for data collection and processing, which
includes the creation of detailed experimental designs for the subsequent training of
intelligent methods, i.e., to cover the artificial intelligence (AI) training solutions space
comprehensively. Based on CLAAS’ experience, detailed test plans are designed by
means of surrogate models, statistically evaluated and expanded to achieve optimal
coverage of the solution space. In addition to field or harvest trials, trials are conducted
on test parcels to minimize uncertainties in harvesting, such as data collection problems
or unsuitable trial conditions. Moreover, sequential experimental designs are used
because they can be adapted quickly. A small extraction of the experimental plans for
the harvest in 2022 is presented in Figure 4.3.
Figure 4.3: Extraction of the experimental design for the harvest 2022.
34
4.2 resKIL
Another aspect of my involvement is providing statistical support to the collaboration
partners. This includes advising on the selection and evaluation of models in the devel-
opment of a scalable AI platform product line for embedded systems or the evaluation
of the various annotation tools developed.
The final aspect is the evaluation of crop quality, with a focus on corn harvesting. The
primary metric used for this purpose is the Corn Silage Processing Score (CSPS), which
evaluates the efficiency of corn grain processing during harvest and ensiling (Ferreira
and Mertens, 2005). In general, it indicates how much of the corn kernels were processed
appropriately. To calculate the CSPS, a harvest sample is sent to a laboratory where it
is dried and sieved through a 4.74 mm sieve. Whole, unprocessed kernels are usually
left on the sieve. Starch content is determined in the whole sample and in the fraction
smaller than 4.75 mm. CSPS is then given by the percentage of starch content in the fine
fraction relative to the total starch content. Percentages above 70 % are considered very
good, while percentages between 70 % and 50 % are considered good and those below
50 % are considered unsatisfactory (Drewry et al., 2019).
In order to optimize this labor-intensive and costly process and to overcome the dis-
advantage that the results are only available after harvesting when adjustments to the
machine settings are no longer possible, an algorithm was developed with the following
objectives: The algorithm is intended to closely approximate the CSPS determined in
the laboratory while ensuring that the algorithm operates with high resource efficiency
so that it can be applied in real-time during the harvesting process. The algorithm
uses images captured during the harvesting process as input and then applies an image
segmentation procedure, see Figure 4.4 from CLAAS. This allows the calculation of an
optical CSPS based on the segmented data.
Figure 4.4: Harvest set up (left), example image captured during harvest (center) and its
segmented version (right).
In our study, we perform a comprehensive analysis of the optical CSPS and make
a direct comparison with the laboratory CSPS. We also compared the algorithm for
calculating the optical CSPS with alternative statistical methods. These methods use
35
4 Further Research
all the information obtained from image segmentation as input. In this analysis, we
evaluated the trade-off between accurate CSPS determination and associated resource
allocation to gain a better understanding of the practicality and effectiveness of these
resource-efficient approaches. We plan to publish the results of this study together with
our partners from CLAAS soon.
4.3 Predicting Effects of Math Training with
Multivariate Random Forests
Predicting the success of interventions is essential in education because it is critical
to realizing the full potential of individual learners, reducing educational disparities
and allocating societal resources efficiently. These predictive efforts span multiple
educational approaches, including adaptive interventions, in which learning experiences
are adjusted based on real-time data (Collins et al., 2004; Almirall and Chronis-Tuscano,
2016); response to intervention, a system for identifying and supporting students at
risk of learning disabilities (Grosche and Huber, 2012); and personalized education,
which includes tailored strategies and technologies (Tetzlaff et al., 2021). Educational
outcomes are often multidimensional, reflecting performance in different domains or
showing trajectories across different points in time.
Standard methods for modeling learning trajectories include latent growth curve models
(LGCMs) and multilevel regression models. LGCMs are used to capture individual-level
changes over time. De Koning et al. (2014) used LGCMs to predict clickstream data
trajectories. Another approach, multilevel regression models, considers the hierarchical
structure of data and includes repeated measures within individuals or clustered data.
Interestingly, multilevel regression models can often be expressed as LGCMs (McNeish
and Matta, 2018). However, they can be limited in their applicability, especially for
small sample sizes, high dimensionality, unidentified data, or noisy datasets. Alterna-
tive approaches, including machine learning, can provide solutions to overcome these
limitations.
Based on this and the findings of our previous work (Schmid et al., 2023a), we started
together with our colleagues Susanne Frick, Philip Doebler and Jörg-Tobias Kuhn from
the interdisciplinary research group FAIR 1 to investigate the use of multivariate and
univariate Random Forests to predict learning trajectories based on digital training data
1https://fair.tu-dortmund.de/en/
36
4.3 Predicting Effects of Math Training
from the Meister CODY Talasia app from January 2018 to December 2019. The project
is in the final stages and will be submitted soon.
The math training app CODY Talasia is designed to support children in second and third
grades who have problems with math. Developed by researchers at the University of
Münster, designed and marketed by Meister CODY GmbH, the app follows a structured
training concept with a detailed training plan. The training plan consists of various
components:
1. CODY-M 2-4 Test (Kuhn et al., 2013): This test is a validated mathematical
assessment to determine a user’s training profile and is categorized as A, B, or C.
It is recommended to perform it at the beginning and to repeat it if necessary.
2. Training sessions (Kuhn and Holling, 2014): Users are required to attend training
sessions five times per week. Each session consists of two tasks, each lasting
around ten minutes. The level of difficulty is adaptive and aims for a success
rate of 80%. The tasks follow a specific training plan determined by the training
profiles. Figure 4.5 shows some example tasks from the training (Chromik, 2021).
3. Status test: This test, done after every five days of training (approximately once a
week), evaluates the user’s performance in three subdomains: addition, subtraction
and number lines. It uses a gold coin scoring system that combines speed and
accuracy and has been shown to perform better than separate assessments, as
shown in Schwenk et al. (2017).
Figure 4.5: Excerpts from the app training tasks.
In our study, we use two different models to identify features that influence the predic-
tions. The first model, referred to as the base model, includes key variables such as
pretest score, grade and their interaction. In contrast, the second model extends the base
model to include additional training-related features such as average response time and
difficulty level. The evaluation of both models for each status test included univariate
37
4 Further Research
and multivariate approaches. In the univariate approach, we developed separate models
for each subdomain. In the multivariate approach, however, we created a single model to
predict performance in all three subdomains. For our analysis, we used Random Forests
and regularized linear regression with lasso penalty as prediction methods.
The results indicate that feature selection has a significant impact on prediction per-
formance. Models with more features show better performance, therefore information
on the training data is a useful source for early predictions. Comparison between mul-
tivariate and univariate approaches shows minimal differences in performance, with
multivariate models favoring the baseline setting and the opposite in the training set-
ting. Furthermore, we find no significant differences in predictive performance between
Random Forests and regularized linear regression.
The decision to adopt systems like Meister Cody based on these results involves both
statistical and technical considerations. On the statistical side, it requires achieving
low prediction errors and recognizing intervention effect heterogeneity. Simultaneously,
from a technical standpoint, we need to assess the feasibility of implementing these
statistical insights.
4.4 Capsule Network
Capsule networks are a recent advance in deep learning that aims to overcome some
limitations of traditional convolutional neural networks. They were introduced by Sabour
et al. (2017) to improve computer vision tasks, particularly in handling hierarchical and
spatial relationships within images. Capsule networks use capsules as basic building
blocks instead of traditional neurons. These capsules are designed to capture and
represent different parts and attributes of an object, allowing for better generalization
and improved recognition of complex patterns (Hinton et al., 2011).
A neuron computes the weighted sum of its scalar inputs and applies a nonlinear activa-
tion function to produce a scalar outcome. In contrast to that, the multivariate outcome
from a capsule is determined as follows: Let p, d ∈ N be arbitrary and x p1, . . . ,xn ∈ R
be the p-dimensional input vectors of a given capsule. Denote by W ∈ Rd×pi the
corresponding weight matrices. Then, the d-dimensional outcome y ∈ Rd is given by
(∑ )n
y = Ψ uiWixi ,
i=1
where ui ∈ R is the so-called i-th coupling coefficient (calculated via the dynamic
38
4.4 Capsule Network
routing algorithm Sabour et al. (2017)) and Ψ : Rp → Rp is the activation function.
Hinton et al. (2011) also introduces a new nonlinear activation function, the squash func-
tion, which limits the length of a vector to a maximum value of one while maintaining
the original direction of the vector, i.e.,
∥x∥
squash : Rp → Rp, x → ∥ x,x∥+ 1
where ∥ · ∥ denotes the euclidean norm. The universal approximation theorem in neural
networks (Devroye et al., 2013) states that a single hidden layer in a feedforward
neural network with a sufficient number of neurons and a suitable activation function
can approximate any continuous function with arbitrary accuracy. We have extended
this theorem for multivariate functions in the context of capsule networks for certain
activation functions and network structures. This work is in the final stages of preparation
for submission, contributing to the ongoing exploration and advancement of neural
network capabilities in handling complex multivariate relationships.
39
4 Further Research
40
5 Discussion and Outlook
In this work, we investigated the application of tree-based ensembles and evaluated
their effectiveness in multi-output regression and their adaptability in time-dependent
outcomes. Driven by the increase of autonomous and decentralized processes, especially
on resource-constrained devices in logistics, we investigate the impact of regularization
techniques to improve the execution efficiency of these algorithms.
Our investigation of multi-output regression has shown the potential advantages of
simultaneously considering dependencies between outcomes. A comparative analysis
of multivariate approaches, in particular Random Forests and Extra Trees, with their
univariate counterparts revealed that the multivariate methods show comparable or better
performance in scenarios where the data generation process does not differ significantly
between all outcome components. Conversely, the univariate approaches showed a
performance advantage in scenarios with different data generation processes for each
component. In our implementation, the multivariate methods outperformed the univariate
methods in terms of runtime efficiency. However, it is important to acknowledge that our
implementation still has room for improvement compared to the effective implementation
of Simm et al. (2014) and Wright and Ziegler (2017). While our simulations focused
primarily on continuous features, analogous results are observed when applied to mixed
features in real-world datasets. In a joint project with colleagues from FAIR (Section
4.3), we extended our analysis to a real-life dataset derived from a digital learning app
to predict the effects of math training. Here, we focus on the comparative analysis
of multivariate tree-based ensembles with other regression methods, such as lasso
regression. Simultaneously, the variable importance measure and the effective tuning of
hyperparameters in the context of multi-output regression is being explored. Another
direction for future research could be exploring hybrid approaches for multi-output
regression. These hybrid models, in which components are modeled in part using both
univariate and multivariate approaches, promise more flexibility in the modeling process.
Further, we conducted an exhaustive comparison between traditional time series models
such as (seasonal) ARIMA and TBATS and tree-based ensembles, including Random
Forests and XGBoost, to forecast time series in the dynamic landscape of data-driven
41
5 Discussion and Outlook
logistics. Our simulations uncovered nuanced performance variations, showcasing the
superiority of machine learning methods in specific scenarios, particularly when ad-
dressing queuing situations and data overlaid with additional (nonlinear) complexity.
The out-of-the-box performance of Random Forests, which outperformed XGBoost in
all considered scenarios, underlines its robustness in terms of hyperparameter tuning.
Training on differentiated time series can significantly improve the machine learning
resilience. Our findings are relevant to the logistics domain as they represent a care-
fully considered, data-driven approach to improving forecasting accuracy, inventory
management and operational efficiency. Based on these findings, we have developed
recommendations for practical researchers, providing guidance for implementing robust
forecasting strategies in real-world logistics scenarios. Extending our research to multi-
step forecasting, considering multivariate time series and dealing with uncertainty in
forecasts are interesting potential topics.
Due to the increasing trend towards autonomous and decentralized processes, we intro-
duced a regularization approach in the construction of tree-based methods to improve the
efficient execution time of inference. Our proposed approach, leveraging regularization,
aimed to balance model size reduction and maintaining prediction accuracy, effectively
addressing challenges posed by limited energy and time resources. The experimental
results highlighted significant improvements in runtime, particularly for datasets con-
cerning binary classification problems and those with large sample sizes. An interesting
application perspective arises in projects like resKIL (Section 4.2) as well as in au-
tonomous robots in logistics, where it is important to use resource-efficient algorithms
to enable real-time applications. Future considerations involve extending regularization
to the entire structure of the tree-based ensemble instead of individual trees, offering a
promising approach for further investigation. For instance, stratifying the dataset into
subsets with strong dependencies for training different trees could potentially enhance
the effectiveness of the regularization approach.
42
Bibliography
Abbasi, R. A., Javaid, N., Ghuman, M. N. J., Khan, Z. A., Ur Rehman, S., and Amanullah
(2019). Short term load forecasting using xgboost. In Web, Artificial Intelligence and
Network Applications (WAINA 2019), volume 927, pages 1120–1131. Springer.
Al-Saba, T. and El-Amin, I. (1999). Artificial neural networks as applied to long-term
demand forecasting. Artificial Intelligence in Engineering, 13(2):189–197.
Almirall, D. and Chronis-Tuscano, A. (2016). Adaptive interventions in child and ado-
lescent mental health. Journal of Clinical Child & Adolescent Psychology, 45(4):383–
395.
Angelini, E., di Tollo, G., and Roli, A. (2008). A neural network approach for credit risk
evaluation. The Quarterly Review of Economics and Finance, 48(4):733–755.
Atzmon, Y. and Chechik, G. (2018). Probabilistic and-or attribute grouping for zero-shot
learning. arXiv preprint arXiv:1806.02664.
Avsar, H., Altermann, E., Reining, C., Rueda, F. M., Fink, G. A., and ten Hompel,
M. (2021). Benchmarking Annotation Procedures for Multi-channel Time Series
HAR Dataset. In 2021 IEEE International Conference on Pervasive Computing and
Communications Workshops and Other Affiliated Events, pages 453–458.
Bartz, E., Bartz-Beielstein, T., Zaefferer, M., and Mersmann, O. (2023). Hyperparameter
Tuning for Machine and Deep Learning with R: A Practical Guide. Springer Nature.
Bischl, B., Binder, M., Lang, M., Pielok, T., Richter, J., Coors, S., Thomas, J., Ullmann,
T., Becker, M., Boulesteix, A.-L., Deng, D., and Lindauer, M. (2023). Hyperparameter
optimization: Foundations, algorithms, best practices, and open challenges. Wiley
Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 13(2):e1484.
Box, G. E. and Cox, D. R. (1964). An analysis of transformations. Journal of the Royal
Statistical Society Series B (Methodology), 26(2):211–243.
43
Bibliography
Box, G. E., Jenkins, G. M., Reinsel, G. C., and Ljung, G. M. (2015). Time series
analysis: forecasting and control. John Wiley & Sons.
Breiman, L. (1996). Bagging predictors. Machine Learning, 24:123–140.
Breiman, L. (1998). Arcing classifier (with discussion and a rejoinder by the author).
The Annals of Statistics, 26(3):801–849.
Breiman, L. (1999). Prediction games and arcing algorithms. Neural Computation,
11(7):1493–1517.
Breiman, L. (2001). Random forests. Machine Learning, 45:5–32.
Breiman, L. (2017). Classification and Regression Trees. Routledge.
Brockwell, P. J. and Davis, R. A. (2002). Introduction to time series and forecasting.
Springer.
Bühlmann, P. (1997). Sieve bootstrap for time series. Bernoulli, pages 123–148.
Bühlmann, P. and Hothorn, T. (2007). Boosting algorithms: Regularization, prediction
and model fitting. Statistical Science, 22(4):477–505.
Bundesministerium für Ernährung und Landwirtschaft (2022). resKIL –
damit maschinelles Lernen in der Landwirtschaft deutlich mehr zum Ein-
satz kommt. https://www.bmel.de/SharedDocs/Praxisbericht/DE/
kuenstliche-intelligenz/resKIL.html, last accessed on 2023-11-01.
Calzavara, M., Glock, C. H., Grosse, E. H., Persona, A., and Sgarbossa, F. (2017).
Analysis of economic and ergonomic performance measures of different rack layouts
in an order picking warehouse. Computers & Industrial Engineering, 111:527–536.
Chang, Y., Mathur, A., Isopoussu, A., Song, J., and Kawsar, F. (2020). A Systematic
Study of Unsupervised Domain Adaptation for Robust Human-Activity Recognition.
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., 4(1):1–30.
Chen, J., Dai, X., Yuan, Q., Lu, C., and Huang, H. (2020). Towards interpretable
clinical diagnosis with bayesian network ensembles stacked on entity-aware cnns.
In Proceedings of the 58th Annual Meeting of the Association for Computational
Linguistics, pages 3143–3153.
Chen, K.-H., Su, C., Hakert, C., Buschjäger, S., Lee, C.-L., Lee, J.-K., Morik, K., and
Chen, J.-J. (2022). Efficient realization of decision trees for real-time inference. ACM
Transactions on Embedded Computing Systems, 21(6):1–26.
44
Bibliography
Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pages 785–794. Association for Computing Machinery.
Cheng, H.-T., Sun, F.-T., Griss, M., Davis, P., Li, J., and You, D. (2013). Nuactiv:
Recognizing Unseen New Activities Using Semantic Attribute–based Learning. In
Proceeding of the 11th Annual International Conference on Mobile Systems, Applica-
tions, and Services, page 361–374.
Chromik, S. L. (2021). Evaluation mehrdimensionaler längsschnittlicher felddaten aus
einer dyskalkulie - trainingsapp. Master’s thesis, TU Dortmund University.
Clémençon, S., Depecker, M., and Vayatis, N. (2013). Ranking forests. Journal of
Machine Learning Research, 14:39–73.
Collins, L. M., Murphy, S. A., and Bierman, K. L. (2004). A conceptual framework for
adaptive preventive interventions. Prevention Science, 5:185–196.
Conneau, A., Kiela, D., Schwenk, H., Barrault, L., and Bordes, A. (2017). Supervised
learning of universal sentence representations from natural language inference data.
In Proceedings of the 2017 Conference on Empirical Methods in Natural Language
Processing. Association for Computational Linguistics.
Crowder, M. J. and Hand, D. J. (2017). Analysis of repeated measures. Routledge.
Cruz-Sandoval, D., Beltran-Marquez, J., Garcia-Constantino, M., Gonzalez-Jasso, L. A.,
Favela, J., Lopez-Nava, I. H., Cleland, I., Ennis, A., Hernandez-Cruz, N., Rafferty,
J., Synnott, J., and Nugent, C. (2019). Semi-Automated Data Labeling for Activity
Recognition in Pervasive Healthcare. Sensors, 19(14).
Cutler, A. and Zhao, G. (2001). Pert-perfect random tree ensembles. Computing Science
and Statistics, 33(4).
David, R., Duke, J., Jain, A., Janapa Reddi, V., Jeffries, N., Li, J., Kreeger, N., Nappier,
I., Natraj, M., Wang, T., Warden, P., and Rhodes, R. (2021). Tensorflow lite micro:
Embedded machine learning for tinyml systems. In Proceedings of Machine Learning
and Systems, volume 3, pages 800–811.
De Koning, B., Timmerman, M., van Geert, P., and van der Meulen, B. (2014). Identify-
ing learning trajectories while playing a learning-to-learn computer game in different
children and instruction types. Journal of Cognitive Education and Psychology,
13(1):53–75.
45
Bibliography
De Koster, R., Le-Duc, T., and Roodbergen, K. J. (2007). Design and control of
warehouse order picking: A Literature Review. European Journal of Operational
Research, 182(2):481–501.
De Livera, A. M., Hyndman, R. J., and Snyder, R. D. (2011). Forecasting time series
with complex seasonal patterns using exponential smoothing. Journal of the American
Statistical Association, 106(496):1513–1527.
De’Ath, G. (2002). Multivariate regression trees: a new technique for modeling species -
environment relationships. Ecology, 83(4):1105–1117.
Delfmann, W., Ten Hompel, M., Kersten, W., Schmidt, T., and Stölzle, W. (2018).
Logistics as a science: Central research questions in the era of the fourth industrial
revolution. Logistics Research, 11(9):1–13.
Devroye, L., Györfi, L., and Lugosi, G. (2013). A Probabilistic Theory of Pattern
Recognition, volume 31. Springer Science & Business Media.
Dı́az-Uriarte, R. and Alvarez de Andrés, S. (2006). Gene selection and classification of
microarray data using random forest. BMC Bioinformatics, 7(3).
Dietterich, T. G. (2000). Ensemble methods in machine learning. In Multiple Classifier
Systems, pages 1–15. Springer.
Dietterich, T. G. (2002). Machine learning for sequential data: A review. In Structural,
Syntactic, and Statistical Pattern Recognition: Joint IAPR International Workshops
SSPR 2002 and SPR 2002 Windsor, Ontario, Canada, August 6–9, 2002 Proceedings,
pages 15–30. Springer.
Ding, N., Deng, J., Murphy, K. P., and Neven, H. (2015). Probabilistic label relation
graphs with ising models. In Proceedings of the IEEE International Conference on
Computer Vision (ICCV).
Drewry, J. L., Luck, B. D., Willett, R. M., Rocha, E. M., and Harmon, J. D. (2019).
Predicting kernel processing score of harvested and processed corn silage via image
processing techniques. Computers and Electronics in Agriculture, 160:144–152.
Dua, D. and Graff, C. (2017). Uci machine learning repository.
Dudek, G. (2015). Short-term load forecasting using random forests. In Intelligent Sys-
tems’ 2014: Proceedings of the 7th IEEE International Conference Intelligent Systems
IS’2014, September 24-26, 2014, Warsaw, Poland, Volume 2: Tools, Architectures,
Systems, Applications, pages 821–828. Springer.
46
Bibliography
Everingham, Y., Sexton, J., Skocaj, D., and Inman-Bamber, G. (2016). Accurate predic-
tion of sugarcane yield using a random forest algorithm. Agronomy for sustainable
development, 36:1–9.
Fernández-Delgado, M., Cernadas, E., Barro, S., and Amorim, D. (2014). Do we need
hundreds of classifiers to solve real world classification problems? Journal of Machine
Learning Research, 15(1):3133–3181.
Ferreira, G. andMertens, D. (2005). Chemical and physical characteristics of corn silages
and their effects on in vitro disappearance. Journal of Dairy Science, 88(12):4414–
4425.
Fokam, C. T. (2022). Ar-sieve bootstrap for the random forest and a simulation-based
comparison with rangerts time series prediction. Master’s thesis, TU Dortmund
University.
Freund, Y., Schapire, R., and Abe, N. (1999). A short introduction to boosting. Journal
of Japanese Society For Artificial Intelligence, 14:771–780.
Friedman, J., Hastie, T., and Tibshirani, R. (2000). Additive logistic regression: a
statistical view of boosting (with discussion and a rejoinder by the authors). The
Annals of Statistics, 28(2):337–407.
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine.
The Annals of Statistics, 29(5):1189–1232.
Gardner Jr., E. S. and McKenzie, E. (1985). Forecasting trends in time series. Manage-
ment Science, 31(10):1237–1246.
Geurts, P., Ernst, D., and Wehenkel, L. (2006). Extremely randomized trees. Machine
Learning, 63:3–42.
Goehry, B., Yan, H., Goude, Y., Massart, P., and Poggi, J.-M. (2017). Rangerts: a ranger
implementation for time series. https://github.com/hyanworkspace/
rangerts. Last accessed on 10.12.2023.
Goehry, B., Yan, H., Goude, Y., Massart, P., and Poggi, J.-M. (2021). Random forests
for time series. hal-03129751.
Grosche, M. and Huber, C. (2012). Das response-to-intervention-modell als grundlage
für einen inklusiven paradigmenwechsel in der sonderpädagogik. Zeitschrift für
Heilpädagogik, 8(12):312–322.
47
Bibliography
Grosse, E. H., Glock, C. H., and Neumann, W. P. (2015). Human Factors in Order
Picking System Design: A Content Analysis. IFAC-PapersOnLine, 48(3):320–325.
Gumus, M. and Kiran, M. S. (2017). Crude oil price forecasting using xgboost. In
2017 International conference on computer science and engineering (UBMK), pages
1100–1103. IEEE.
Gupta, Y., Lama, R. K., Kwon, G.-R., and Initiative, A. D. N. (2019). Prediction and
classification of alzheimer’s disease based on combined features from apolipoprotein-
e genotype, cerebrospinal fluid, mr, and fdg-pet imaging biomarkers. Frontiers in
Computational Neuroscience, 13:72.
Hamilton, J. D. (2020). Time Series Analysis. Princeton university Press.
Härdle, W., Horowitz, J., and Kreiss, J.-P. (2003). Bootstrap methods for time series.
International Statistical Review, 71(2):435–459.
Harvey, A. C. (1990). Forecasting, Structural Time Series Models and the Kalman Filter.
Cambridge University Press.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning,
volume 2. Springer.
Hinton, G. E., Krizhevsky, A., and Wang, S. D. (2011). Transforming auto-encoders.
In Artificial Neural Networks and Machine Learning – ICANN 2011, pages 44–51.
Springer.
Ho, T. K. (1998). The random subspace method for constructing decision forests. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 20(8):832–844.
Hsu, Y.-L., Yang, S.-C., Chang, H.-C., and Lai, H.-C. (2018). Human daily and sport
activity recognition using a wearable inertial sensor network. IEEE Access, 6:31715–
31728.
Huang, H., Pouls, M., Meyer, A., and Pauly, M. (2020). Travel time prediction using
tree-based ensembles. In Computational Logistics: 11th International Conference,
ICCL 2020, Enschede, The Netherlands, September 28–30, 2020, Proceedings 11,
pages 412–427. Springer.
Hyndman, R. J. and Athanasopoulos, G. (2018). Forecasting: principles and practice.
OTexts.
48
Bibliography
Hyndman, R. J. and Khandakar, Y. (2008). Automatic time series forecasting: The
forecast package for R. Journal of Statistical Software, 27(3):1–22.
Ishwaran, H., Kogalur, U. B., Blackstone, E. H., and Lauer, M. S. (2008). Random
survival forests. The Annals of Applied Statistics, 2(3):841 – 860.
Kane, M. J., Price, N., Scotch, M., and Rabinowitz, P. (2014). Comparison of arima and
random forest time series models for prediction of avian influenza h5n1 outbreaks.
BMC Bioinformatics, 15(1):1–9.
Khaidem, L., Saha, S., and Dey, S. R. (2016). Predicting the direction of stock market
prices using random forest. arXiv preprint arXiv:1605.00003.
Khan, M. A. A. H., Roy, N., and Misra, A. (2018). Scaling Human Activity Recognition
via Deep Learning-based Domain Adaptation. In 2018 IEEE International Conference
on Pervasive Computing and Communications (PerCom), pages 1–9.
Kirchhof, M., Schmid, L., Reining, C., ten Hompel, M., and Pauly, M. (2021a). Chances
of interpretable transfer learning for human activity recognition in warehousing. In
Computational Logistics, pages 163–177. Springer.
Kirchhof, M., Schmid, L., Reining, C., ten Hompel, M., and Pauly, M. (2021b). prsl:
Interpretable multi-label stacking by learning probabilistic rules. In Proceedings of
the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, volume 161,
pages 461–470.
Konishi, S. and Kitagawa, G. (2008). Information criteria and statistical modeling.
Springer Science & Business Media.
Kourou, K., Exarchos, T. P., Exarchos, K. P., Karamouzis, M. V., and Fotiadis, D. I.
(2015). Machine learning applications in cancer prognosis and prediction. Computa-
tional and Structural Biotechnology Journal, 13:8–17.
Krüger, A., Feldmann, F., Pauly, M., and Ten Hompel, M. (2020). Einsatzmöglichkeiten
maschineller lernverfahren in einer dezentral organisierten lagerverwaltung auf basis
intelligenter behälter. Logistics Journal: Proceedings, 2020(12).
Kuhlmann, L. and Pauly, M. (2023). A dynamic systems model for an economic
evaluation of sales forecasting methods. Tehnički glasnik, 17(3):397–404.
Kuhn, J.-T. and Holling, H. (2014). Number sense or working memory? The effect of
two computer-based trainings on mathematical skills in elementary school. Advances
in cognitive psychology, 10(2):59.
49
Bibliography
Kuhn, J.-T., Raddatz, J., Holling, H., and Dobel, C. (2013). Dyskalkulie vs. rechen-
schwäche: Basisnumerische verarbeitung in der grundschule. Lernen und Lernstörun-
gen.
Kuhn, M. (2008). Building predictive models in r using the caret package. Journal of
Statistical Software, 28:1–26.
Kwiatkowski, D., Phillips, P. C., Schmidt, P., and Shin, Y. (1992). Testing the null
hypothesis of stationarity against the alternative of a unit root: How sure are we that
economic time series have a unit root? Journal of Econometrics, 54(1–3):159–178.
Lahouar, A. and Slama, J. B. H. (2017). Hour-ahead wind power forecast based on
random forests. Renewable energy, 109:529–541.
Lampert, C. H., Nickisch, H., and Harmeling, S. (2013). Attribute-based classification
for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 36(3):453–465.
Lee, C.-W., Fang, W., Yeh, C.-K., and Wang, Y.-C. F. (2018). Multi-label zero-shot
learning with structured knowledge graphs. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), pages 1576–1585.
Littell, R. C., Henry, P., and Ammerman, C. B. (1998). Statistical analysis of repeated
measures data using sas procedures. Journal of animal science, 76(4):1216–1231.
Liu, L., Zhou, T., Long, G., Jiang, J., and Zhang, C. (2020). Attribute propagation
network for graph zero-shot learning. Proceedings of the AAAI Conference on Artificial
Intelligence, 34(04):4868–4875.
Liu, Y. and Wu, H. (2017). Prediction of road traffic congestion based on random forest.
In 2017 10th International Symposium on Computational Intelligence and Design
(ISCID), volume 2, pages 361–364. IEEE.
Loh, W.-Y. (2002). Regression tress with unbiased variable selection and interaction
detection. Statistica Sinica, 12(2):361–386.
Loh, W.-Y. (2011). Classification and regression trees. Wiley Interdisciplinary Reviews:
Data Mining and Knowledge Discovery, 1(1):14–23.
Loh, W.-Y. and Shih, Y.-S. (1997). Split selection methods for classification trees.
Statistica Sinica, 7(4):815–840.
50
Bibliography
Malley, J. D., Kruppa, J., Dasgupta, A., Malley, K. G., and Ziegler, A. (2012). Probability
machines. Methods of Information in Medicine, 51(01):74–81.
Mason, L., Baxter, J., Bartlett, P., and Frean, M. (1999). Boosting algorithms as gradient
descent. In Advances in Neural Information Processing Systems, volume 12. MIT
Press.
McNeish, D. and Matta, T. (2018). Differentiating between mixed-effects and latent-
curve approaches to growth modeling. Behavior Research Methods, 50:1398–1414.
Meinshausen, N. (2006). Quantile regression forests. Journal of Machine Learning
Research, 7:983–999.
Mitchell, T. M. (1997). Machine Learning. McGraw-Hill New York.
Murshed, M. S., Murphy, C., Hou, D., Khan, N., Ananthanarayanan, G., and Hussain, F.
(2021). Machine learning at the network edge: A survey. ACM Computing Surveys,
54(8):1–37.
Nair, N. R., Schmid, L., Rueda, F. M., Pauly, M., Fink, G. A., and Reining, C. (2023).
Dataset bias in human activity recognition. arXiv preprint arXiv:2301.10161.
Niemann, F., Reining, C., Rueda, F. M., Nair, N. R., Steffens, J. A., Fink, G. A., and
Hompel, M. t. (2020). LARa: Creating a Dataset for Human Activity Recognition in
Logistics Using Semantic Attributes. Sensors, 20(15).
Osmani, V., Balasubramaniam, S., and Botvich, D. (2008). Human activity recognition in
pervasive health-care: Supporting efficient remote collaboration. Journal of Network
and Computer Applications, 31(4):628–655.
Pakrashi, A., Greene, D., and MacNamee, B. (2016). Benchmarking Multi-label Classi-
fication Algorithms, pages 149–160. CEUR Workshop Proceedings.
Park, B. and Bae, J. K. (2015). Using machine learning algorithms for housing price
prediction: The case of fairfax county, virginia housing data. Expert Systems with
Applications, 42(6):2928–2934.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel,
M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau,
D., Brucher, M., Perrot, M., and Duchesnay, É. (2011). Scikit-learn: Machine learning
in python. Journal of Machine Learning Research, 12:2825–2830.
51
Bibliography
Probst, P., Boulesteix, A.-L., and Bischl, B. (2019a). Tunability: Importance of hyper-
parameters of machine learning algorithms. Journal of Machine Learning Research,
20(1):1–32.
Probst, P., Wright, M. N., and Boulesteix, A.-L. (2019b). Hyperparameters and tuning
strategies for random forest. Wiley Interdisciplinary Reviews: Data Mining and
Knowledge Discovery, 9(3):e1301.
Ramosaj, B. (2020). Analyzing Consistency and Statistical Inference in Random Forest
Models. PhD thesis, Technical University of Dortmund.
Rapp, M., Mencı́a, E. L., Fürnkranz, J., Nguyen, V.-L., and Hüllermeier, E. (2021).
Learning gradient boosted multi-label classification rules. InMachine Learning and
Knowledge Discovery in Databases, pages 124–140. Springer.
Reining, C., Niemann, F., Moya Rueda, F., Fink, G. A., and ten Hompel, M. (2019).
Human activity recognition for production and logistics – a systematic literature
review. Information, 10(8):245.
Reining, C., Rueda, F. M., Niemann, F., Fink, G. A., and Hompel, M. t. (2020). Annota-
tion performance for multi-channel time series har dataset in logistics. In 2020 IEEE
International Conference on Pervasive Computing and Communications Workshops
(PerCom Workshops), pages 1–6.
Retsinas, G., Filntisis, P. P., Efthymiou, N., Theodosis, E., Zlatintsi, A., and Maragos, P.
(2020). Person Identification Using Deep Convolutional Neural Networks on Short-
Term Signals from Wearable Sensors. In ICASSP 2020 - 2020 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3657–3661.
Rueda, F. M. and Fink, G. A. (2021). From human pose to on-body devices for human-
activity recognition. In 2020 25th International Conference on Pattern Recognition
(ICPR), pages 10066–10073.
Rueda, F. M., Grzeszick, R., Fink, G. A., Feldhorst, S., and ten Hompel, M. (2018). Con-
volutional neural networks for human activity recognition using body-worn sensors.
Informatics, 5(2):26.
Sabour, S., Frosst, N., and Hinton, G. E. (2017). Dynamic routing between capsules.
Advances in Neural Information Processing Systems, 30.
Sagi, O. and Rokach, L. (2018). Ensemble learning: A survey. Wiley Interdisciplinary
Reviews: Data Mining and Knowledge Discovery, 8(4):e1249.
52
Bibliography
Salari, N., Liu, S., and Shen, Z.-J. M. (2022). Real-time delivery time forecasting and
promising in online retailing: When will your package arrive? Manufacturing &
Service Operations Management, 24(3):1421–1436.
Schmid, L., Biebert, D., Hakert, C., Chen, K.-H., Lang, M., Pauly, M., and Chen, J.-J.
(2024). Tree: Tree regularization for efficient execution. arXiv preprint.
Schmid, L., Gerharz, A., Groll, A., and Pauly, M. (2023a). Tree-based ensembles for
multi-output regression: Comparing multivariate approaches with separate univariate
ones. Computational Statistics & Data Analysis, 179:107628.
Schmid, L., Roidl, M., and Pauly, M. (2023b). Comparing statistical and machine
learning methods for time series forecasting in data-driven logistics - a simulation
study. arXiv preprint.
Schwenk, C., Kuhn, J.-T., Doebler, P., and Holling, H. (2017). Auf goldmünzenjagd:
Psychometrische kennwerte verschiedener scoringansätze bei computergestützter lern-
verlaufsdiagnostik im bereich mathematik. Empirische Sonderpädagogik, 9(2):123–
142.
Scornet, E. (2017). Tuning parameters in random forests. ESAIM: Proceedings and
Surveys, 60:144–162.
Scornet, E., Biau, G., and Vert, J.-P. (2015). Consistency of random forests. The Annals
of Statistics, 43:1716–1741.
Segal, M. and Xiao, Y. (2011). Multivariate random forests. Wiley Interdisciplinary
Reviews: Data mining and knowledge discovery, 1(1):80–87.
Shamout, M., Ben-Abdallah, R., Alshurideh, M., Alzoubi, H., Kurdi, B. a., and Hamad-
neh, S. (2022). A conceptual model for the adoption of autonomous robots in supply
chain and logistics industry. Uncertain Supply Chain Management, 10(2):577–592.
Shen, R., Zou, F., Song, J., Yan, K., and Zhou, K. (2018). Efui: An ensemble framework
using uncertain inference for pornographic image recognition. Neurocomputing,
322:166–176.
Shumway, R. H. and Stoffer, D. S. (2017). Arima models. Time Series Analysis and Its
Applications: With R Examples, pages 75–163.
Siami-Namini, S., Tavakoli, N., and Namin, A. S. (2018). A comparison of arima
and lstm in forecasting time series. In 2018 17th IEEE International Conference on
Machine learning and Applications, pages 1394–1401. IEEE.
53
Bibliography
Simm, J., Magrans de Abril, I., and Sugiyama, M. (2014). Tree-based ensemble multi-
task learning method for classification and regression. IEICE TRANSACTIONS on
Information and Systems, 97(6):1677–1681.
Syntetos, A. A., Babai, Z., Boylan, J. E., Kolassa, S., and Nikolopoulos, K. (2016).
Supply chain forecasting: Theory, practice, their gap and the future. European Journal
of Operational Research, 252(1):1–26.
Tabanelli, E., Tagliavini, G., and Benini, L. (2022). Optimizing random forest-based
inference on risc-v mcus at the extreme edge. IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems, 41(11):4516–4526.
Tetzlaff, L., Schmiedek, F., and Brod, G. (2021). Developing personalized education: A
dynamic framework. Educational Psychology Review, 33:863–882.
Therneau, T. M. and Atkinson, E. J. (1997). An introduction to recursive partitioning
using the rpart routines. Technical report, Mayo Foundation.
Tian, Y., Sukthankar, R., and Shah, M. (2013). Spatiotemporal deformable part models
for action detection. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR).
Tsoumakas, G., Spyromitros-Xioufis, E., Vilcek, J., and Vlahavas, I. (2011). Mulan:
A java library for multi-label learning. Journal of Machine Learning Research,
12:2411–2414.
Vairagade, N., Logofatu, D., Leon, F., and Muharemi, F. (2019). Demand forecasting
using random forest and artificial neural network for supply chain management. In
Computational Collective Intelligence: 11th International Conference, ICCCI 2019,
Hendaye, France, September 4–6, 2019, Proceedings, Part I 11, pages 328–339.
Springer.
Van Kasteren, T., Englebienne, G., and Kröse, B. J. (2010). An activity monitoring
system for elderly care using generative and discriminative models. Personal and
Ubiquitous Computing, 14:489–498.
Venkatapathy, A. K. R., Bayhan, H., Zeidler, F., and ten Hompel, M. (2017). Human
machine synergies in intra-logistics: Creating a hybrid network for research and
technologies. In 2017 Federated Conference on Computer Science and Information
Systems (FedCSIS), pages 1065–1068. IEEE.
54
Bibliography
Wang, Y., Cang, S., and Yu, H. (2019). A survey on wearable sensor modality centred
human activity recognition in health care. Expert Systems with Applications, 137:167–
190.
Wang, Z., Wang, Y., Zeng, R., Srinivasan, R. S., and Ahrentzen, S. (2018). Random
forest based hourly building energy prediction. Energy and Buildings, 171:11–25.
West, M. and Harrison, J. (2006). Bayesian Forecasting and Dynamic Models. Springer
Science & Business Media.
Wright, M. N. and Ziegler, A. (2017). ranger: A fast implementation of random forests
for high dimensional data in C++ and R. Journal of Statistical Software, 77(1):1–17.
Wu, B., Jia, F., Liu, W., Ghanem, B., and Lyu, S. (2018). Multi-label learning with
missing labels using mixed dependency graphs. International Journal of Computer
Vision, 126(8):875–896.
Xia, Y., Chen, K., and Yang, Y. (2021). Multi-label classification with weighted classifier
selection and stacked ensemble. Information Sciences, 557:421–442.
Xian, Y., Lampert, C. H., Schiele, B., and Akata, Z. (2019). Zero-shot learning—a
comprehensive evaluation of the good, the bad and the ugly. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 41(9):2251–2265.
Yang, J., Lee, J., and Choi, J. (2011). Activity recognition based on rfid object usage for
smart mobile devices. Journal of Computer Science and Technology, 26(2):239–246.
Yoon, J. (2021). Forecasting of real gdp growth using machine learning models: Gradient
boosting and random forest approach. Computational Economics, 57(1):247–265.
Zhang, A., Lipton, Z. C., Li, M., and Smola, A. J. (2023). Dive into Deep Learning.
Cambridge University Press.
Zhang, C. and Ma, Y. (2012). Ensemble machine learning: methods and applications.
Springer.
Zhang, G. P., Patuwo, B. E., and Hu, M. Y. (2001). A simulation study of artificial neural
networks for nonlinear time-series forecasting. Computers & Operations Research,
28(4):381–396.
Zhang, G. P. and Qi, M. (2005). Neural network forecasting for seasonal and trend time
series. European Journal of Operational Research, 160(2):501–514.
55
Bibliography
Zhang, L., Bian, W., Qu, W., Tuo, L., and Wang, Y. (2021). Time series forecast of sales
volume based on xgboost. In Journal of Physics: Conference Series, volume 1873,
page 012067. IOP Publishing.
Zhang, L., Wang, S., and Liu, B. (2018). Deep learning for sentiment analysis: A survey.
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4):e1253.
Zhou, Z.-H. (2012). Ensemble methods: foundations and algorithms. CRC press.
56
Part II
Publications
57

Article 1
Schmid, L., Gerharz, A., Groll, A. & Pauly, M. (2023). Tree-based ensembles for
multi-output regression: Comparing multivariate approaches with separate univariate
ones. Computational Statistics & Data Analysis, 179, 107628,
https://doi.org/10.1016/j.csda.2022.107628.
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95

Article 2
Schmid, L., Roidl, M. & Pauly, M. (2023). Comparing statistical and machine learning
methods for time series forecasting in data-driven logistics - A simulation study. arXiv
preprint, https://doi.org/10.48550/arXiv.2303.07139
Comparing statistical and machine learning methods for time series
forecasting in data-driven logistics – A simulation study
Lena Schmida,∗, Moritz Roidlb, Markus Paulya,c
aDepartment of Statistics, TU Dortmund University, 44227 Dortmund, Germany
bChair of Material Handling and Warehousing, TU Dortmund University, 44221 Dortmund, Germany
cUA Ruhr, Research Center Trustworthy Data Science and Security, 44227 Dortmund, Germany
Abstract
Many planning and decision activities in logistics and supply chain management are based on forecasts of
multiple time dependent factors. Therefore, the quality of planning depends on the quality of the forecasts.
We compare different state-of-the art forecasting methods in terms of out-of-the-box forecasting performance.
Different to most existing research in logistics, we do not do this case-dependent but consider a broad set
of simulated time series to give more general recommendations. We therefore simulate various linear and
nonlinear time series that reflect different situations.
Keywords: Machine Learning, Time Series, Forecasting, Simulation Study
1. Introduction
Forecasting methods are essential for efficient planning in various logistics domains such as warehousing,
transport, and supply chain management. They enable companies to anticipate and plan for future demand,
capacity needs, and supply chain requirements. Thereby, different logistics applications require different
forecasts due to their unique characteristics. In the transport domain, e.g., accurate transportation fore-
casting enables logistics companies to optimize their transportation networks, reduce transportation costs
and enhance delivery reliability (Huang et al., 2020; Wu et al., 2004; Lin et al., 2005; Garrido and Mah-
massani, 2000; Wu and Levinson, 2021). Precise forecasting allows warehouse managers to optimize space
use, reduce stock-out risk, and improve overall efficiency (Shi et al., 2018; Ribeiro et al., 2022). In supply
chain management, accurate forecasts are, e.g., used to optimize resource use across the entire supply chain
(Feizabadi, 2022; Kuhlmann and Pauly, 2023; Syntetos et al., 2016). The above references show that the use
of forecasting techniques such as time series models and machine learning methods has become increasingly
popular in logistics in recent years. However, there is still a lack of consensus on which method is more
effective, especially as most methods of comparison in logistics solely rely on comparing the performance on
a few data sets (Ensafi et al., 2022; Ribeiro et al., 2022). In fact, different to other fields (e.g. Wu et al.,
2018; Weber et al., 2019) there do not exist rigorous benchmark studies in data-driven logistics to the best
of our knowledge. In our opinion, the key reason for this is that, outside of specific examples (e.g. Niemann
∗Corresponding Author
Email address: lena.schmid@tu-dortmund.de (Lena Schmid)
Preprint submitted to Journal of LATEXTemplates June 7, 2024
arXiv:2303.07139v2  [stat.ML]  6 Jun 2024
et al., 2020; Arora et al., 2022), there is a lack of freely accessible and well-characterized data sets for bench-
marking (e.g. Reining et al., 2019; Awasthi et al., 2023) in the logistics research domain. This hampers the
analysis of domain-specific pros and cons of method choices or the formulation of general recommendations.
To overcome this and to be in line with recent recommendations (Friedrich and Friede, 2023), we therefore
focus on simulating data from various statistical time series models that reflect potential logistic scenarios.
Time series models have been used in forecasting for several decades and are widely used in logistics for
sales or demand forecasting, see e.g. Kuhlmann and Pauly (2023); Shukla and Jharkharia (2011) and the
references cited therein. These models are based on historical data and use statistical techniques to identify
patterns and trends in the data, which can then be used to make predictions about future demand. Some
commonly used time series models in logistics include (seasonal) autoregressive integrated moving averages
(ARIMA) and exponential smoothing models. For example, Gilbert (2005) developed an ARIMA multistage
supply chain model that is based on time series models. Another example is Prophet (Taylor and Letham,
2018), a forecasting tool for time series analysis developed by Facebook, which includes additive modeling
with components such as seasonality, holidays and trend flexibility. Kumar Jha and Pande (2021) exam-
ined ARIMA and Prophet models for predicting supermarket sales. The Prophet models showed superior
predictive performance in terms of lower errors. Hasmin and Aini (2020) investigated the performance of
double exponential smoothing for inventory forecasting.
More recently, machine learning (ML) methods have become increasingly popular for demand forecasting
in logistics due to their ability to handle large and complex data sets. There are many literature reviews
(Carbonneau et al., 2008; Wenzel et al., 2019; Sharma et al., 2020; Ni et al., 2020; Baryannis et al., 2019),
that discuss the use of machine learning techniques in forecasting for supply chain management, including
an overview of the various techniques used and their advantages and limitations. However, our comment
regarding a lack of neutral benchmarking studies still applies.
Several studies have shown that ML methods such as neural networks, support vector regression, and
Random Forests can outperform traditional time series models for specific demand forecasting problems. For
example, a study by Ensafi et al. (2022) compared the prediction power of more than ten different forecasting
models, including classical methods such as ARIMA and ML techniques such as long short-term memory
(LSTM) and convolution neural networks, using a single data set containing the sales history of furniture
in a retail store. The results showed that the LSTM outperformed the other models in terms of prediction
performance. Another study by Kohzadi et al. (1996) also compared the forecasting power of ARIMA and
neural networks using a single commodity prices data set. Again the neural network performed better than
the ARIMA model. Similar results were obtained in Weng et al. (2019) or Siami-Namini et al. (2018). How-
ever, other studies have found mixed results, with some suggesting that time series models perform better
than ML methods. For instance, Palomares-Salas et al. (2009) compared the forecasting accuracy of ARIMA
and neural network models in predicting wind speed for short time intervals. The results showed that the
performance of both can be very similar, indicating that a more simple and interpretable forecasting model
could be used to administrate energy sources. A comparison of daily hotel demand forecasting performance
of SARIMAX, GARCH and neutral networks also showed that both time series approaches outperformed
2
the neural networks (Ampountolas, 2021). In the latter examples, one reason may also be the difficulty of
tuning complex machine learning procedures. That’s one reason why we focus on out-of-the-box machine
learning methods in our study.
The comparison of the forecasting performance of ML methods and time series models in logistics has
significant implications for businesses seeking to improve their forecasting accuracy. By identifying the most
effective forecasting methods, businesses can make better-informed decisions about production, inventory
management, and resource allocation. Thus, this work aims to provide a comprehensive comparison of the
forecasting performance of time series models and ML methods. Different from the above-mentioned works
that merely focus on single use cases, this task needs more variation in the data sets under study. To
this end, we compare various forecasting methods in terms of out-of-the-box forecasting performance on a
broad set of simulated time series. We thereby simulate various linear and nonlinear time series that are of
importance for logistics and study the one-step forecast performance of different statistical learning methods.
This work is structured as follows: Section 2 presents the different used forecasting methods. More
precisely, the (seasonal) ARIMA and TBATS models are presented. In addition, the machine learning
approaches (Random Forest and XGBoost) are described in more detail. Section 3 presents the simulation
design and framework, while Section 4 summarizes the main simulation results. In Section 5, an illustrative
real-world data example is analyzed before the manuscript concludes with a discussion of our findings and
an outlook for future research (Section 6).
2. Methods
In this section, we explain the one-step forecasting methods under investigation. There are various
strategies for modeling and forecasting time series. Traditional time series models, including moving averages
and exponential smoothing, follow a linear approach in which the predictions of future values are linear
functions of past observations. Due to their relative simplicity in terms of understanding and implementation,
linear models have found application in many forecasting problems (Fan et al., 2021; Nyoni, 2018; Benvenuto
et al., 2020). To overcome the limitations of linear models and account for certain nonlinear patterns observed
in real-world problems, several classes of nonlinear models have been proposed in the literature. Examples
cover the threshold autoregressive model (TAR) (Tsay, 1989) or the generalized autoregressive conditional
heteroscedastic model (GARCH) (Francq and Zakoian, 2019). Although some improvements have been
noted, the utility of their application to general prediction problems is limited (De Gooijer and Kumar,
1992): Since these models were developed for specific nonlinear patterns, they are often unable to model
other types of nonlinearities. Here, machine learning methods have been proposed as an alternative for time
series forecasting (Bontempi et al., 2012; Ahmed et al., 2010). Since it is impossible to cover the entire
spectrum of machine learning models and time series methods in our simulation study, we limit ourselves
to a selection of what we consider the most common algorithms in data-driven logistics. To evaluate the
performance, we compare these methods with a naive approach, where the last observation of the time series
is used as a prediction. The time series (Subsection 2.1) and machine learning methods (Subsection 2.2)
under study are explained in more detail in the next two subsections.
3
2.1. Time Series Methods
We focus on three different time series models: ARIMA, SARIMA, and TBATS. The first two models are
among the most popular models in traditional time series forecasting (Brockwell and Davis, 2002; Hyndman
and Athanasopoulos, 2018) and are often used as benchmark models for comparison with machine learning
algorithms (Al-Saba and El-Amin, 1999; Zhang et al., 2001b; Hwarng, 2001). In addition, TBATS models
combine many different approaches that are commonly used in forecasting.
ARIMA. Autoregressive integrated moving average (ARIMA)(Box et al., 2015) model is a generalized model
of the autoregressive moving average (ARMA) model and builds a composite model of the time series
(Shumway et al., 2000). Denoted as ARIMA(p, d, q), p, q, d ∈ N, the model is characterized by three key
components:
• AR (Autoregression): Represents the regression of the time series on its own past values, capturing
dependencies through lagged observations. The number of lagged observations included in the models
is given by p.
• I (Integrated): The differencing order (d) indicates the number of times the time series is differenced
to achieve stationarity. This transformation involves subtracting the current observation from its d-th
lag, which is crucial for stabilizing the mean and addressing trends.
• MA (Moving Average): Incorporates a moving average model to account for dependencies between
observations and the residual errors of the lagged observations (q).
In general a time series {xt}t generated from an ARIMA(p, d,q) model has the form
∑p ∑q
ϕi∆
dxt−i = θjεt−j ,
i=1 j=0
where p, d, q ∈ N, ϕ1, . . . , ϕp ∈ R are the autoregressive coefficients, θ1, . . . θq ∈ R are the moving average
coefficients and εt denotes the residuals or the errors at time t. The residuals are often assumed to follow a
white noise process, represented by a sequence of uncorrelated random variables with zero mean and finite
second moment. The difference operator ∆ is defined as ∆ : R → R with xt → xt − xt−1.
SARIMA. With seasonal time series data, short-term non-seasonal components likely contribute to the
model. Therefore, we need to estimate a seasonal ARIMA model incorporating non-seasonal and seasonal
factors into a multiplicative model (Shumway et al., 2000). The general form of a seasonal ARIMA model
is denoted as SARIMA(p, d, q)(P,D,Q)m, where p is the non-seasonal AR order, d is the non-seasonal
differentiation, q is the non-seasonal MA order, P , D and Q are the similar parameters for the seasonal
part. The parameter m denotes the number of time steps for a single period.
TBATS. For time series data exhibiting complex and diverse seasonal patterns, TBATS (Trigonometric
Seasonal Exponential Smoothing) is a robust modeling approach. Introduced as an extension of exponential
smoothing methods, TBATS accounts for different seasonalities through a combination of trigonometric
functions and exponential smoothing (De Livera et al., 2011). The model is particularly effective in handling
multiple seasonal cycles, making it suitable for data sets with intricate temporal structures.
The general form of a TBATS model consists of several components as described below:
4
• T (Trend): Captures the overall trend in the time series using an exponential smoothing mechanism.
• B (Box-Cox Transformation): Applies the Box-Cox transformation (Box and Cox, 1964) to stabilize
variance and ensure the homogeneity of variances.
• A (ARIMA Errors): Incorporates ARIMA errors to capture any remaining non-seasonal dependencies.
• S (Seasonal): Utilizes trigonometric functions to model multiple seasonal components, accommodating
various seasonal patterns.
2.2. Machine Learning Methods
Machine learning methods are increasingly being used to address time series prediction problems. In
fact, there exist too many approaches to consider in a comparison study like ours. We therefore restricted
ourselves to a class that has already been successfully used for predictions in the logistics context (Ji et al.,
2019; Islam and Amin, 2020; Ma et al., 2018; Huang et al., 2020; Kuhlmann et al., 2023): Tree-based
ensemble learners. We thereby focus on two models, each studied with and without differencing: Random
Forest and XGBoost on trees which are briefly introduced below.
XGBoost. Gradient boosting is an ensemble machine learning technique often used in classification and
regression problems and is particularly popular in predictive scenarios (Aguilar Madrid and Antonio, 2021).
As an ensemble technique, gradient boosting combines the results of several weak learners, referred to as
base learners, with the aim of building a model that generally performs better than the conventional single
machine learning models. Typically, gradient boosting utilizes decision trees as base learners. Like other
boosting methods, the core idea of gradient boosting is that during the learning procedure, new models
are built and fitted consecutively and not independently to provide better predictions of the output vari-
able. Thereby, new base learners are constructed with the aim of minimizing a loss function associated
with the whole ensemble. Instances that are not predicted correctly in previous steps and score higher er-
rors are correlated with larger weight values so that the model can focus on them and learn from its mistakes.
XGBoost stands for Extreme Gradient Boosting and is a specific implementation of gradient boosting
(Chen and Guestrin, 2016). It incorporates randomization and regularization techniques to reduce over-
fitting while increasing training speed. Moreover, it computes second-order gradients of the loss function,
which provides more information about the gradient’s direction, making it easier to minimize the loss func-
tion.
In general, the hyperparameters for XGBoost can be divided into two categories (Chen and Guestrin,
2016): General boosting parameters, including the number of iterations and the learning rate, which controls
how much information from a new tree will be used in the boosting step. Second, in base learner dependent
parameters. When trees are used as base learners, the additional hyperparameters are used to control the
complexity of the individual trees. Examples include limiting the maximum tree depth or specifying a
minimum number of samples in each leaf (Therneau and Atkinson, 1997). There also exists other boosting
variants (Schapire and Freund, 2013; Friedman, 2002; Mayr et al., 2014), but we concentrate on XGBoost
as it has emerged as one of the key machine learning models for prediction and was also referred to as ‘the
5
Queen of Machine Learning’ (Morde) in this context. XGBoost models have also been used for time series
forecasting, e.g., Luo et al. (2021); Alim et al. (2020). For example, in Zhang et al. (2021) the potential
of XGBoost in retail for predicting store sales was investigated while (Huang et al., 2020) studied this for
predicting the travel time of NYC cabs.
Random Forest. A Random Forest (Breiman, 2001) is a machine learning method based on building en-
sembles of decision trees. It was developed to address predictive shortcomings of traditional Classification
and Regression Trees (CARTs) (Breiman et al., 2017). Random Forests consist of a large number of weak
decision tree learners, which are grown in parallel to reduce the bias and variance of the model at the same
time (Breiman, 2001). For training a Random Forest, bootstrap samples are drawn from the training data
set. Each bootstrap sample is then used to grow a(n unpruned) tree. Instead of using all available features
in this step, only a small and fixed number of randomly sampled mtry features are selected as split candi-
dates. A split is chosen by the CART-split criterion for regression, i.e., by minimizing the sum of squared
errors in both child nodes. Instead of the CART-split criterion, many other distances, such as the least
absolute deviations of the mean (L1-norm), can also be used. These steps are repeated until B such trees
are grown, and new data is predicted by taking the mean of all B tree predictions. The most important
hyperparameters for the Random Forest (Wright and Ziegler, 2017) are:
• B as the number of grown trees. Note that this parameter is usually not tuned since it is known that
more trees are better.
• The cardinality of the sample of features at every node is mtry.
• The minimum number of observations that each terminal node should contain (stopping criteria).
Though there exist other variants of bagged tree-based ensembles (Geurts et al., 2006; Goehry et al., 2023),
we concentrate on the Random Forest as it is the best known method that is often seen as the machine
learning benchmark procedure (e.g. Pórtoles et al., 2018). In addition, Random Forests have also been
frequently used for time series forecasting (Huang et al., 2020; Kane et al., 2014). For example, in Salari
et al. (2022), a Random Forest approach was used to model real-time delivery time forecasts in online
retailing while Vairagade et al. (2019) applied Random Forest to predict product demand for grocery items.
While machine learning methods are quite en vogue, we should not neglect the advantages of time series
methods in terms of interpretability. Here, time series approaches enable a clearer understanding of the
factors influencing the predictions.
3. Simulation Set-up
In our simulation study, we compare the one-step forecast prediction performance of the methods de-
scribed in Section 2. All simulations were conducted in the statistical computing software R (R Core Team,
2022). We use the forecast package (Hyndman and Khandakar, 2008) for all time series approaches un-
der consideration. For the machine learning methods, we used the ranger (Wright and Ziegler, 2017) and
xgboost(Chen et al., 2022) packages for Random Forest and XGBoost, respectively. The concrete simulation
settings and data generating processes (DGPs) are described below.
6
Data Generating Processes. We consider twelve DGPs in total - an autoregressive model (AR), two bilinear
models (BL), two nonlinear autoregressive models (NAR), a nonlinear moving average model (NMA), two
sign autoregressive models (SAR), two smooth transition autoregressive models (STAR) and two TAR mod-
els. They are summarized in Table 1, where the error terms εt are independent and identically distributed
with a standard normal distribution.
Table 1: Data generating processes (DGPs) used in the simulation study. The error terms εt are i.i.d N (0, 1).
Model Type Variant(s) Data generating process
Autoregressive AR xt = 0.5xt−1 + 0.45xt−2 + εt,
Bilinear BL 1 xt = 0.7xt−1 · εt−2 + εt,
BL2 xt = 0.4xt−1 − 0.3xt−2 + 0.5xt−2 · εt−1 + εt,
Nonlinear Autoregressive NAR 1 x = 0.7|xt−1|t |xt−1|+2 + ε,
NAR2 x = 0.7|xt−1| + 0.35|xt−2|t |xt−1|+2 |xt−2|+2 + ε,
Nonlinear Moving Average NMA xt = εt − 0.3εt−1 + 0.2εt−2 + 0.4εt−1εt−2 − 0.25ε2t−2,
Sign Autoregressive SAR 1 xt = sign(xt−1) + εt,
SAR 2 xt = sign(xt−1 + xt−2) + εt,
Smooth Transition STAR 1 xt = 0.8ε − 0.8εt−1t 1+exp(−10x + ε− ) t,t 1
Autoregressive STAR 2 x = 0 0.1−0.9xt−1+0.8xt−2t .3xt + 0.6xt−2 + 1+exp(−10x + ε ,t− t1)
 0.9xt−1 + εt if |xt−1| ≤ 1
Threshold Autoregressive TAR 1 xt =

 −0.3xt−1 − εt if |xt−1| > 1
 0.9xt−1 + 0.05xt−2 + εt if |xt−1| ≤ 1
TAR 2 xt =  −0.3xt−1 + 0.65xt−2 − εt if |xt−1| > 1.
Similar models have been used to evaluate time series forecasts (Zhang et al., 2001a) and are of im-
portance in data-driven logistics. In particular, autoregressive models (AR, NAR1, NAR2) were chosen
to capture the persistence observed in historical logistics demand (Luong, 2007). Bilinear models (BL1,
BL2) reflect the complex interactions within logistics networks where different components contribute to
the observed patterns. The non-linear moving average (NMA) model is suitable for scenarios with com-
plex interdependencies between multiple factors. Sign autoregressive models (SAR1, SAR2) are suitable
for situations in which events or conditions have a directional influence on future events. Smooth transi-
tion autoregressive models (STAR1, STAR2) mimic logistics systems where demand changes gradually due
to external factors (Ubilava, 2012) and threshold autoregressive models (TAR1, TAR2) represent logistics
processes with different regimes based on specific conditions (Ricky Rambharat et al., 2005). This diverse
7
set of DGPs depicts many aspects of the multi-layered nature of logistics data, which includes persistence,
interactions, complicated dependencies, directional influences, smooth transitions and different regimes. In
the absence of comprehensive benchmark problems, this set-up allows us to evaluate the adaptability of
forecasting methods in dynamic logistics scenarios.
Additional Complexities. To add additional complexity to the analysis, we have incorporated settings with a
jump process and a random walk (Shumway et al., 2000) into each DGP. The jump process introduces sudden
regime changes (which may occur in logistics due to unforeseeable events), while the random walk adds noise
to the data (which may occur in logistics settings with increased complexity or less accurate measurements).
Thus, our study considers four different scenarios: (1) the DGP without additional complexity, (2) the DGP
superposed with the jump process, (3) the DGP superposed with random noise, and (4) the DGP superposed
with both the jump process and random noise. The jumps are modeled using a compound Poisson process
{pt}t (Kingman, 1992). The original DGP {xt}t is then superposed by pt as follows
x∗t = xt + pt,
where x∗t denotes the resulting DGP, and the compound Poisson process is given by
∑Nt
pt = Zi,
i=1
where Nt follows a Poisson distribution with parameter λ and Zi ∼ N (0, σ2p). For the jump experiments we
set σ2p to 1. A larger σ
2
p results in larger jumps in magnitude, while the mean over positive and negative
jumps remains zero. The parameter λ is set to n10 , where n denotes the length of the generated time series.
This means that, on average, a jump is expected to occur after every λ period. Superposing the DGP with
the compound Poisson process results in a mean shift by the actual jump size that occurred at each jump
event. As mentioned before, the noise is modeled by a random walk {wt}t with
wt = wt−1 + et,
where et ∼ N (0, σ2rw). In our study, we choose σ2rw in such a way that we obtain a setting with medium
noise, i.e., a signal-to-noise ratio (SNR) of four. The SNR (Box, 1988) is a measure that characterizes the
strength of the signal relative to the background noise. A higher SNR indicates a clearer and more dis-
cernible signal amidst the noise. By including the random walk, we achieve a resulting DGP that is globally
nonstationary due to the random walk overlay.
Additional Queueing Models. Beyond these 48 simulation models, we include the M/M/1 and M/M/2
queueing models (Cooper, 1981) in our study. Queueing models are commonly used in logistics, operations
research and industrial engineering to study the behavior of waiting lines or queues (Artalejo and Lopez-
Herrero, 2001; Schwarz et al., 2006; Kobayashi and Konheim, 1977; Gautam, 2012). Both models have
numerous real-world applications, such as in call centers (Brown et al., 2005), healthcare facilities (Green,
2006), and transportation systems (Radmilovic et al., 1996). The M/M/1 model is a classic queueing
8
model that assumes a single queue and one server. It is a stochastic model, where customer arrivals are
assumed to follow a Poisson process, and service times are exponentially distributed. The M/M/1 model
can be used to analyze the expected waiting time, the number of customers in the queue, and the expected
server utilization. The M/M/2 model is a variation of the M/M/1 model that assumes two parallel servers.
According to Gautam (2012), we set the arrival rate to four and the service rate to two.
Number of different Settings. For each setting, we generate time series of length n from the respective DGPs
with n ∈ {100, 500, 1000}. In total, this results in 150 (= 12 (time series DGPs) × 4 (further complexity)+2
queueing models) × 3 (lengths)) different simulation settings for each forecasting method.
Data Preprocessing. To forecast time series using a machine learning algorithm, the sliding window approach
(Dietterich, 2002) is used. In this approach, a fixed-sized window is moved over the time series data, and
at each step, the data within the window is used as input to a machine learning algorithm for prediction.
One advantage of the sliding window approach is that it allows the machine learning algorithm to capture
the temporal dependencies and patterns in the data. The window size is an important parameter in this
approach (Savva et al., 2020). If the window size is too small, it may not capture the relevant information in
the data, while if it is too large, it may introduce unnecessary noise and reduce the accuracy of the model.
We consider sliding window sizes of 2, 4, 8 and 16 and study which size is best suited for the different time
series lengths 100, 500 and 1000. Furthermore, in machine learning-based time series forecasting, we explore
two approaches: one using the original time series and the other using the differentiated time series as input.
The latter is essential as trees cannot forecast outside the range observed so far and to enhance stationarity
in the time series.
Choice of Parameters. In order not to have to discuss the different possibilities for hyperparameter tuning
of the machine learning algorithm, we use the default values recommended in the literature (Breiman, 2001;
Wright and Ziegler, 2017; Hastie et al., 2009). This has the additional advantage of a reduced runtime.
Thus, each ensemble learner consists of 500 trees, the inner bootstrap sample is equal to m ptry = ⌊ 3⌋, where
p denotes the number of features, the number of sample points in the bagging step is equal to the sample
size. Each terminal node should at least contain five observations. For XGBoost, we use a learning rate
of 0.3 and a maximal depth of 6. To estimate the parameters of the time series approaches, we use the
algorithms implemented in the R-package forecast.
Evaluation Measure. Since the mean square error (MSE) and the mean absolute percentage error (MAPE)
are widely used in the forecasting of time series in logistics (Kuhlmann and Pauly, 2023), we use them
as evaluation measures, which are calculated over 1,000 repeated forecasting steps. The MSE measures
the model’s accuracy, expressed as the average squared difference between observed and predicted values.
Simultaneously, the MAPE, calculated as the average percentage difference between observed and predicted
values, offers insights into the model’s relative performance.
4. Results
In this section, we describe the results of the simulation study. In particular, we present the MSE of the
different forecasting algorithms under various simulation configurations. The analysis of the MAPE results
can be found in the Appendix. We start with the performance of the methods for queueing models.
9
4.1. Predictive Power in Queueing Models
The influence of the different sliding window sizes and the differentiation is shown in Figures 1 and 2.
RF RF diff
7.5
35
7.2
34
33 6.9
32
6.6
31 Sliding Window Size
6.3
100 500 1000 100 500 1000 2
4
XGB XGB diff
8
11
16 16
10
15
9
14
8
13
100 500 1000 100 500 1000
length
Figure 1: MSE of ML approaches separated by the sliding window size for the M/M/1 setting. XGB stands for XGBoost and
RF for Random Forest; diff in the method name indicates that the data were differentiated.
RF RF diff
13 9.0
12 8.5
11 8.0
10 7.5 Sliding Window Size
100 500 1000 100 500 1000 2
4
XGB XGB diff
8
14 16
12
13
11
12 10
9
100 500 1000 100 500 1000
length
Figure 2: MSE of ML approaches separated by the sliding window size for the M/M/2 setting. XGB stands for XGBoost and
RF for Random Forest; diff in the method name indicates that the data were differentiated.
10
mse mse
Generally, differentiation improves the prediction power of both ML approaches in both settings. Especially
for the Random Forest, the MSE decreases by one-fifth after differentiation. The lengths of the time series
only have a minor influence on the MSE. The Random Forest with differentiated data outperformed the
other methods for all lengths. Comparing the effects of sliding window sizes, we find slight differences in
performance. Random Forests have smaller MSE values with smaller sliding windows in both settings, while
larger window sizes slightly improve performance in the other approaches.
The predictive power of the time series and naive approaches are given in Figure 3. Note that both
ARIMA and SARIMA models have identical MSE values. In both cases, the time series approach performs
better than the naive approach. However, the difference in performance is smaller for M/M/2. Again, the
influence of the time series length is marginal. While all time series approaches perform similarly in the
M/M/1 setting, the TBATS method has slightly smaller values in the M/M/2 setting.
1 2
12 10.0
11 9.5
10 Method
9.0
ARIMA
SARIMA
9 Naive
8.5
TBATS
8
8.0
7
7.5
100 500 1000 100 500 1000
length
Figure 3: MSE of time series and naive approaches for the M/M/1 (left) and M/M/2 (right) setting. ARIMA and SARIMA
models have identical MSE values, as no seasonality was present.
In both scenarios, the Random Forest approach with differenced data consistently showed the smallest
MSE. However, the differences between this method and the time series approaches were not great.
4.2. Predictive Power in the Different Time Series Settings
In the following, we analyze the performance of the methods for the DGPs described in Table 1. When
comparing the influence of sliding window size and differentiation on the performance of Random Forest
across all settings (Figure 4), we observed that non-differentiation resulted in smaller MSE values except
for the AR setting. In the AR setting, differentiation slightly outperformed non-differentiation. However, it
should be noted that as the length of the time series increases, the differences between the two approaches
become negligible. In all settings, the MSE values slightly decrease with an increase in time series length.
The sliding window size has a small influence on the prediction power and shows similar behavior across
different time series lengths.
11
mse
AR BL1 BL2 NAR1
5 1.6
1.4
1.3 4 2.5 1.4
1.2 3
1.1 2.0
1.2
1.0 2 1.0
100 500 1000 100 500 1000 100 500 1000 100 500 1000
NAR2 NMA SAR1 SAR2
1.8 1.6
1.6 2.5 1.6
1.4
1.4 2.0 1.4
1.2 1.5 1.2 1.2
1.0 1.0
100 500 1000 100 500 1000 100 500 1000 100 500 1000
STAR1 STAR2 TAR1 TAR2
1.9 2.25 3.5
6 1.7 2.00 3.0
5 1.5 1.75 2.5
1.50 2.0
4 1.3 1.25 1.5
1.1
100 500 1000 100 500 1000 100 500 1000 100 500 1000
length
Sliding Window Size 2 4 8 16 diff 0 1
Figure 4: MSE of the Random Forest approaches separated by the sliding window size and differentiation for the different data
generating processes.
AR BL1 BL2 NAR1
10
1.6 3.5 2.0
1.5 8 1.8
6 3.01.4 1.6
4 2.51.3 1.4
2.0
1.2 2 1.2
100 500 1000 100 500 1000 100 500 1000 100 500 1000
NAR2 NMA SAR1 SAR2
2.4 2.8
2.1 2.4 2.1 2.00
1.8 2.0 1.8 1.75
1.5 1.6 1.5 1.50
1.2 1.2 1.25
100 500 1000 100 500 1000 100 500 1000 100 500 1000
STAR1 STAR2 TAR1 TAR2
2.1 3.2
7 1.9 2.4 2.8
6 1.7 2.0
2.4
2.0
5 1.5 1.6 1.6
4 1.3
100 500 1000 100 500 1000 100 500 1000 100 500 1000
length
Sliding Window Size 2 4 8 16 diff 0 1
Figure 5: MSE of XGBoost approaches separated by the sliding window size and differentiation for the different data generating
processes.
12
mse mse
Similar observations can be made for XGBoost, see Figure 5. The sliding window’s size and the time
series length have a small effect on the performance quality. For all DGPs, the MSE values decrease slightly
with increasing time series length, except for BL1. Here, the MSE values first increase. The XGBoost
approaches generally have slightly larger MSE values than the Random Forest approaches.
Figure 6 shows the MSE values for the time series approaches. The performance of the time series
approaches is comparable to that of the Random Forest. All methods have very similar MSE values. The
time series length has only a minor impact on the predictive power, except for the BL1 setting. As observed
for the XGBoost approaches, MSE values in this setting first increase and then decrease with increasing
time series length.
AR BL1 BL2 NAR1
4.0 2.25
1.125 3.6 2.20 1.12
1.100 3.2 2.15
2.10 1.10
1.075 2.8 2.05 1.08
2.4 2.00
100 500 1000 100 500 1000 100 500 1000 100 500 1000
NAR2 NMA SAR1 SAR2
1.6 1.55 1.401.20 1.50
1.45
1.15 1.5 1.351.40
1.10 1.4 1.35 1.301.30
100 500 1000 100 500 1000 100 500 1000 100 500 1000
STAR1 STAR2 TAR1 TAR2
1.52 1.42 1.65
3.5 1.48 1.41 1.60
1.44 1.40
3.4 1.40 1.39 1.55
100 500 1000 100 500 1000 100 500 1000 100 500 1000
length
Method ARIMA SARIMA TBATS
Figure 6: MSE of the time series approaches for the different data generating processes.
Additional results can be found in the Appendix. Figure A.10 therein, e.g., shows that the naive approach
exhibits the largest MSE values compared to all methods. Thereby the performance of the naive approach is
dependent on the DGP and the length of the time series. For BL2, longer time series lengths generally lead to
better performance, but for NAR1 the performance may slightly decrease. For AR, BL1, and NMA models,
the MSE values typically decrease initially and then slightly increase as the time series length increases.
Conversely, NAR2, SAR1, SAR2, STAR1, STAR2, TAR1 and TAR2 tend to show the opposite trend.
4.3. Influence of the Additional Complexities on the Predictive Power
Based on the findings of the previous sections, we focus on the simulation results obtained with a sliding
window size of 8, as the choice of this size is due to the consistent performance observed with different
13
mse
sizes. Details of the results with other window sizes can be found in the Appendix, but a moderate size of
8 balances computational efficiency and information incorporation. Below we first consider the influence of
an additional jump process before discussing the white noise results.
The influence of the jumping process can be seen in Figure 7. All MSE values increase monotonically with
increasing sample size, indicating that the jump process significantly impacts predictive performance. Note
that as time series length increases, the Random Forest approach with differentiated data outperforms all
other approaches. Using the differenced data significantly improves the MSE values for both ML approaches,
particularly for increasing time series length. The predictive performance of the time series approaches is
similar for all DGPs and slightly better than that of the naive approach.
AR BL1 BL2 NAR1
160 160
120 120
120 120
80 80
80 80
40 40 40 40
100 500 1000 100 500 1000 100 500 1000 100 500 1000
NAR2 NMA SAR1 SAR2
160 160 160
120 120 100 120
80 80 80
50
40 40 40
100 500 1000 100 500 1000 100 500 1000 100 500 1000
STAR1 STAR2 TAR1 TAR2
160 160
120 120
120 120
80
80 80 80
40 40 40 40
100 500 1000 100 500 1000 100 500 1000 100 500 1000
length
RF XGB ARIMA Naive
Method
RF diff XGB diff SARIMA TBATS
Figure 7: MSE values of all methods and data generating processes superposed by a compound Poisson process.
Figure 8 summarizes the prediction results for all methods and all DGPs superposed by a random walk.
14
mse
Here, the time series length has only a minor influence on the prediction performance of the data overlaid
AR BL1 BL2 NAR1
8 3.5
4.0
7 5 3.0
3.5
6 4 2.5
3.0
5 2.0
2.5 3
100 500 1000 100 500 1000 100 500 1000 100 500 1000
NAR2 NMA SAR1 SAR2
3.5 3.6
5
3.0 3.0 3.2
4
2.5 2.8
2.5
3
2.0 2.4
2.0
100 500 1000 100 500 1000 100 500 1000 100 500 1000
STAR1 STAR2 TAR1 TAR2
10 4.0 15
5
9 3.5
8 4 10
3.0
7
3 2.5 5
6
100 500 1000 100 500 1000 100 500 1000 100 500 1000
length
RF XGB ARIMA Naive
Method
RF diff XGB diff SARIMA TBATS
Figure 8: MSE values of all methods and data generating processes superposed by a random walk.
with a random walk. For the AR and BL2 settings, the MSE values increase slightly when the time series
length is increased from 100 to 500. For all other DGPs, the MSE values decrease slightly, except for the
naive approach. The naive approach has the highest MSE values for all settings, followed by XGBoost,
except for BL2. Here, both approaches have similar values. The performance of the other methods depends
on the respective setting.
For the settings, AR, BL2, SAR1 and SAR2, Random Forest with differenced data again shows the
smallest MSE values, while the time series approaches show slightly larger values. Note that the XGBoosts
with differentiated data perform better in these settings than the Random Forests with non-differentiated
15
mse
data. In the BL1, NAR1, NAR2, NMA and STAR2 settings, only minor differences in the performance of
the Random Forests and time series approaches can be observed. When comparing the two XGBoost ap-
proaches in these settings, the differentiation reduces the MSE. The ML approaches show larger MSE values
in the STAR1, TAR1 and TAR2 settings than the time series approaches, with Random Forests performing
better than the XGBoost method.
The influence of both complexities, the random walk and the Poisson process, on the prediction perfor-
mance is shown in Figure A.15 in the Appendix. Similar to the case where a composite Poisson process
is superposed on the data, we observe an increase in MSE values with increasing time series length for all
settings. In particular, for time series lengths of 500, we obtain MSE values of more than 2,000.
4.4. Summarizing all Results
To evaluate the prediction performance across the spectrum of simulation settings, we calculate the
median rank for each prediction method in Table 2. The ranking is based on the MSE values, with rank 1
indicating the method with the lowest MSE. Each entry in the table represents the median rank of a
particular prediction method in all settings of a particular DGP model described in Section 2. Furthermore,
the results for the ranking take into account the performance of machine learning algorithms with a sliding
window size of 8.
Table 2: Median performance rank of forecasting methods across different simulation settings and different time series lengths.
Rankings are based on MSE values, with rank 1 indicating the method with the lowest MSE.
Method Queueing DGPs (Table 1) with
models no add. complexity jump random walk both
Random Forest 7 1 7 5 7
Random Forest Diff 1 6 1 1 1
XGBoost 7 5 7 7 7
XGBoost Diff 5 7 5 6 6
ARIMA 2.5 3 3 3 3
SARIMA 2.5 3 3 3 3
TBATS 3 3 3 3 3
Naive 6 8 6 8 5
The results in Table 2 provide useful insights into the relative predictive performance of the different
methods in different simulation scenarios. In particular, Random Forest with differentiated inputs proves to
be the best performing method, achieving the lowest median value across different complexities, including
scenarios with jumps, random walks or a combination of both. While XGBoost is competitive, it tends to
have a slightly higher median value under these conditions. Traditional time series methods such as ARIMA,
SARIMA and TBATS consistently show robust and similar performance.
5. Real-World Data Example
As explained at the onset, there is a lack of freely available and good documented data sets in logistics
research. We therefore use a rather simple real-world data example for illustration. The data set contains
daily demand orders from a Brazilian logistics company (Ferreira et al., 2017) and was sourced from the UCI
16
Machine Learning Repository (Dua and Graff, 2017). Covering a span of 60 consecutive days, the data set
consists of three time series that capture orders for products A, B, and C. Figure 9 shows the corresponding
time series in which specific shocks in the data can be identified. This observation puts us in a similar
setting to the simulation study where the DGP was overlaid with a Poisson process. Given this context,
it is of interest to evaluate whether the robust performance of (differentiated) machine learning algorithms
observed in the simulation study is also apparent in for this dataset.
Product A Product B Product C
300
100
250
200
75 200
100 15050
100
25
0 20 40 60 0 20 40 60 0 20 40 60
Days
Figure 9: Daily orders of a Brazilian logistics company separated by the different products.
The machine learning algorithms adhere to the hyperparameters outlined in Section 3, with a sliding
window size of eight, as informed by insights from our simulation study. We use the first 50 observations
to train all methods and the last ten observations to test the performance via time series cross validation
(Hyndman and Athanasopoulos, 2018, Chapter 5.10). The MSE and MAPE are again used as evaluation
measures. The summarized results are presented in Table 3. Note that the results of SARIMA and ARIMA
are identical due to the absence of seasonality and are therefore combined into one method.
Table 3: Mean MAPE and MSE of the methods considered in Section 2 using daily demand order data set.
MAPE MSE
Method Prod. A Prod. B Prod. C Prod. A Prod. B Prod. C
Random Forest 24.30 35.05 30.79 22.39 262.41 695.70
Random Forest Diff 6.67 21.80 15.84 4.91 197.23 1.97
XGBoost 25.06 41.62 19.51 22.34 376.62 147.20
XGBoost Diff 10.70 37.98 27.15 13.10 841.56 41.00
(S)ARIMA 28.57 49.30 33.56 29.48 1,142.14 655.88
TBATS 28.37 36.17 33.56 43.14 446.18 663.78
Naive 33.18 30.71 30.59 25.10 194.21 82.03
The results show that the performance of the forecasting methods is different in the various product cat-
egories. In general, the machine learning algorithms deliver consistently better results than the traditional
time series methods. This is in line with our simulation study, where ML methods showed better perfor-
mances when additional complexities were present. Random Forest with differentiation performed best for
17
Daily Orders
all three time series and evaluation measures, again confirming the results obtained in the simulation study
for such settings. It should be noted that the introduction of differentiation is beneficial for Random Forest
in all predictions. For XGBoost, however, performance on product A improves significantly when differenced
data is used, but in the other two time series differentiation leads to worse forecasting performance.
6. Summary, Discussion and Outlook
Summary with Higlights. The main objective of this simulation study was to perform a one-step comparative
analysis of prediction accuracy and evaluate the performance of tree-based machine learning and time
series approaches that are typically used in data-driven logistics. Through a comprehensive investigation of
different data generating processes, queueing models, and additional complexities, we aimed to determine
each method’s inherent strengths and limitations. Our analysis included conventional time series methods,
including (seasonal) ARIMA models and TBATS, as well as machine learning methods such as Random
Forest and XGBoost. In addition, we investigated the impact of data differencing on the performance of the
two latter algorithms. The key findings from our study are as follows:
• The out-of-the-box Random Forest emerged as the ML benchmark method.
• Training on differentiated time series can significantly improve the ML resilience.
• ML models are more robust with respect to additional (nonlinear) complexity, settings in which they
outperformed statistical time series approaches.
• In all other settings, the time series approaches were at least competitive or even performed better.
Detailed Discussion and Outlook. In our study, the Random Forest approach performed consistently better
in all simulation settings than the XGBoost approaches. It is worth noting that no hyperparameter tuning
was made in our study. Random Forests are known to be robust to hyperparameter settings and often per-
form well with default values (Probst et al., 2019; Fernández-Delgado et al., 2014). This robustness can be a
crucial factor contributing to their superior performance compared to XGBoost. Applying techniques such
as Bayesian Optimization or more simple grid or random search for hyperparameter tuning could change this
observation and should be investigated in future studies. Regarding the effect of data differentiation on the
performance of the two machine learning methods, we observed similar patterns. Differentiation improved
performance, especially in queueing scenarios and situations where additional complexity was introduced
into the data generation process. Without additional complexity, differentiation showed minimal impact,
with the performance of both methods deteriorating slightly when the differentiated data was used, except
for very linear data generation processes. Here, only a slight improvement was observed. This suggests
that differentiation plays a crucial role in improving the resilience of machine learning methods, especially
Random Forests when the data is overlaid with additional noise like a random walk. When comparing the
performance of the different time series approaches, we found subtle differences between them. ARIMA and
SARIMA showed relatively similar performance in all simulation settings under consideration. Their predic-
tion accuracy was quite consistent without big differences in most situations. Comparing their performance
with that of TBATS, the differences are also small and not substantial, suggesting that ARIMA, SARIMA
18
and TBATS had comparable predictive power in our simulation settings. The additional complexity in-
duced, such as a jump process or random noise, significantly impacts the predictive power. Introducing a
jump process leads to increased MSE values for all methods and settings, indicating a significant impact
on prediction accuracy. In this scenario, all methods show consistent behavior with strong increasing MSE
values for increasing time series lengths. When a noise process is introduced, a more nuanced pattern
emerges. For the machine learning approaches, differentiating the data proves beneficial and improves the
overall performance. The Random Forest approach with differenced data as input outperforms the other
approaches in most scenarios, closely followed by all three time series approaches. A comparison between
Random forests and the time series approaches shows different performance patterns in the different sim-
ulation environments. In queueing situations, where the underlying processes are often characterized by
complicated dynamics, the Random Forest approach shows superior performance. Furthermore, a notable
trend emerges in simulation settings where a Poisson process complements the data generating processes.
In these cases, ML methods show improved performance, indicating robustness to the inherent complexity
introduced by the Poisson process. The adaptability of ML models to capture and learn from nonlinear
patterns may contribute to their effectiveness in scenarios with Poisson process or random walk overlays.
However, it is essential to recognize that this beneficial performance of ML methods is not universal. In
all other simulation settings, the Random Forest approaches perform comparable or slightly worse than
all three time series approaches. In addition to the simulation study, our illustrative data analyses were
conducted with a focus on one-step demand forecasting for different products of a logistics company. The
results indicate that machine learning algorithms can improve the forecasting performance in this context.
In particular, the machine learning methods perform better or equally well as the time series methods for
most products.
In the context of data-driven logistics, our results underscore the importance of tailoring time series
forecasting methods to the specific characteristics of data sets encountered in different logistics areas. The
Random Forest approach, especially when using differentiated data as input, is recommended as an initial
benchmark prediction tool, particularly for data sets with a lot of noise or complex patterns. The robustness
of Random Forests, combined with their ability to achieve good results without extensive tuning of hyper-
parameters, makes them a pragmatic choice for various prediction scenarios. Conversely, in situations where
interpretability is paramount (e.g., to gain understanding or trust of users in warehouses or decision makers
in SCM) and the data exhibit clear patterns, traditional time series approaches remain a valuable and inter-
pretable option. These approaches often come with faster runtimes and greater resource efficiency, which is
also essential in the development of data-driven logistics, e.g. in case of resource constraints (Venkatapathy
et al., 2015; Gouda et al., 2023). As only one-step forecasts were considered, future simulation studies
should investigate whether the same observations can be found for more step forecasting. Also, additional
or hybrid methods must be investigated (Aladag et al., 2009; Zhang, 2003; Smyl, 2020). Another line of
future research needs to compare the methods with respect to uncertainty quantification, i.e., point-wise or
simultaneous prediction intervals and regions.
References
Aguilar Madrid, E., Antonio, N., 2021. Short-term electricity load forecasting with machine learning. Information 12, 50.
19
Ahmed, N.K., Atiya, A.F., Gayar, N.E., El-Shishiny, H., 2010. An empirical comparison of machine learning models for time
series forecasting. Econometric reviews 29, 594–621.
Al-Saba, T., El-Amin, I., 1999. Artificial neural networks as applied to long-term demand forecasting. Artificial Intelligence in
engineering 13, 189–197.
Aladag, C.H., Egrioglu, E., Kadilar, C., 2009. Forecasting nonlinear time series with a hybrid methodology. Applied Mathe-
matics Letters 22, 1467–1470.
Alim, M., Ye, G.H., Guan, P., Huang, D.S., Zhou, B.S., Wu, W., 2020. Comparison of arima model and xgboost model for
prediction of human brucellosis in mainland china: a time-series study. BMJ open 10, e039676.
Ampountolas, A., 2021. Modeling and forecasting daily hotel demand: A comparison based on sarimax, neural networks, and
garch models. Forecasting 3, 580–595.
Arora, K., Abbi, P., Gupta, P.K., 2022. Analysis of supply chain management data using machine learning algorithms, in:
Innovative Supply Chain Management via Digitalization and Artificial Intelligence. Springer, pp. 119–133.
Artalejo, J.R., Lopez-Herrero, M., 2001. Analysis of the busy period for the m/m/c queue: An algorithmic approach. Journal
of applied probability 38, 209–222.
Awasthi, S., Fernandez-Cortizas, M., Reining, C., Arias-Perez, P., Luna, M.A., Perez-Saura, D., Roidl, M., Gramse, N.,
Klokowski, P., Campoy, P., 2023. Micro uav swarm for industrial applications in indoor environment – a systematic literature
review. Logistics Research .
Baryannis, G., Dani, S., Antoniou, G., 2019. Predicting supply chain risks using machine learning: The trade-off between
performance and interpretability. Future Generation Computer Systems 101, 993–1004.
Benvenuto, D., Giovanetti, M., Vassallo, L., Angeletti, S., Ciccozzi, M., 2020. Application of the arima model on the covid-2019
epidemic dataset. Data in brief 29, 105340.
Bontempi, G., Ben Taieb, S., Borgne, Y.A.L., 2012. Machine learning strategies for time series forecasting, in: European
business intelligence summer school, Springer. pp. 62–77.
Box, G., 1988. Signal-to-noise ratios, performance criteria, and transformations. Technometrics 30, 1–17.
Box, G.E., Cox, D.R., 1964. An analysis of transformations. Journal of the Royal Statistical Society Series B (Methodology)
26, 211–243.
Box, G.E., Jenkins, G.M., Reinsel, G.C., Ljung, G.M., 2015. Time series analysis: forecasting and control. John Wiley & Sons.
Breiman, L., 2001. Random Forests. Machine Learning 45, 5 – 32. doi:https://doi.org/10.1023/A:1010933404324.
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J., 2017. Classification and Regression Rrees. Routledge.
Brockwell, P.J., Davis, R.A., 2002. Introduction to time series and forecasting. Springer.
Brown, L., Gans, N., Mandelbaum, A., Sakov, A., Shen, H., Zeltyn, S., Zhao, L., 2005. Statistical analysis of a telephone call
center: A queueing-science perspective. Journal of the American statistical association 100, 36–50.
Carbonneau, R., Laframboise, K., Vahidov, R., 2008. Application of machine learning techniques for supply chain demand
forecasting. European Journal of Operational Research 184, 1140–1154.
Chen, T., Guestrin, C., 2016. Xgboost: A scalable tree boosting system, in: Proceedings of the 22nd ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and Data Mining, Association for Computing Machinery. pp. 785–794.
Chen, T., He, T., Benesty, M., Khotilovich, V., Tang, Y., Cho, H., Chen, K., Mitchell, R., Cano, I., Zhou, T., Li, M., Xie,
J., Lin, M., Geng, Y., Li, Y., Yuan, J., 2022. xgboost: Extreme Gradient Boosting. URL: https://CRAN.R-project.org/
package=xgboost. r package version 1.6.0.1.
Cooper, R.B., 1981. Queueing theory, in: Proceedings of the ACM’81 conference, pp. 119–122.
De Gooijer, J.G., Kumar, K., 1992. Some recent developments in non-linear time series modelling, testing, and forecasting.
International Journal of Forecasting 8, 135–156.
De Livera, A.M., Hyndman, R.J., Snyder, R.D., 2011. Forecasting time series with complex seasonal patterns using exponential
smoothing. Journal of the American statistical association 106, 1513–1527.
Dietterich, T.G., 2002. Machine learning for sequential data: A review, in: Structural, Syntactic, and Statistical Pattern
Recognition: Joint IAPR International Workshops SSPR 2002 and SPR 2002 Windsor, Ontario, Canada, August 6–9, 2002
Proceedings, Springer. pp. 15–30.
Dua, D., Graff, C., 2017. Uci machine learning repository. URL: http://archive.ics.uci.edu/ml.
Ensafi, Y., Amin, S.H., Zhang, G., Shah, B., 2022. Time-series forecasting of seasonal items sales using machine learning–a
comparative analysis. International Journal of Information Management Data Insights 2, 100058.
Fan, D., Sun, H., Yao, J., Zhang, K., Yan, X., Sun, Z., 2021. Well production forecasting based on arima-lstm model considering
manual operations. Energy 220, 119708.
20
Feizabadi, J., 2022. Machine learning demand forecasting and supply chain performance. International Journal of Logistics
Research and Applications 25, 119–142.
Fernández-Delgado, M., Cernadas, E., Barro, S., Amorim, D., 2014. Do we need hundreds of classifiers to solve real world
classification problems? Journal of Machine Learning Research 15, 3133–3181.
Ferreira, R., Martiniano, A., Ferreira, A., Ferreira, A., Sassi, R., 2017. Daily Demand Forecasting Orders. UCI Machine
Learning Repository. DOI: https://doi.org/10.24432/C5BC8T.
Francq, C., Zakoian, J.M., 2019. GARCH models: structure, statistical inference and financial applications. John Wiley &
Sons.
Friedman, J.H., 2002. Stochastic gradient boosting. Computational statistics & data analysis 38, 367–378.
Friedrich, S., Friede, T., 2023. On the role of benchmarking data sets and simulations in method comparison studies. Biometrical
Journal , 2200212.
Garrido, R.A., Mahmassani, H.S., 2000. Forecasting freight transportation demand with the space–time multinomial probit
model. Transportation Research Part B: Methodological 34, 403–418.
Gautam, N., 2012. Analysis of queues: methods and applications. CRC Press.
Geurts, P., Ernst, D., Wehenkel, L., 2006. Extremely randomized trees. Machine learning 63, 3–42.
Gilbert, K., 2005. An ARIMA supply chain model. Management Science 51, 305–310. URL: http://search.ebscohost.
com/login.aspx?direct=true&db=buh&AN=16292291&site=ehost-live, doi:10.1287/mnsc.1040.0308. publisher: INFORMS:
Institute for Operations Research.
Goehry, B., Yan, H., Goude, Y., Massart, P., Poggi, J.M., 2023. Random forests for time series. REVSTAT-Statistical Journal
21, 283–302.
Gouda, A., Heinrich, D., Hünnefeld, M., Priyanta, I.F., Reining, C., Roidl, M., 2023. A grid-based sensor floor platform
for robot localization using machine learning, in: 2023 IEEE International Instrumentation and Measurement Technology
Conference (I2MTC), IEEE. pp. 1–6.
Green, L., 2006. Queueing analysis in healthcare. Patient flow: reducing delay in healthcare delivery , 281–307.
Hasmin, E., Aini, N., 2020. Data mining for inventory forecasting using double exponential smoothing method, in: 2020 2nd
International Conference on Cybernetics and Intelligent System (ICORIS), pp. 1–5. doi:10.1109/ICORIS50180.2020.9320765.
Hastie, T., Tibshirani, R., Friedman, J.H., Friedman, J.H., 2009. The elements of statistical learning: data mining, inference,
and prediction. volume 2. Springer.
Huang, H., Pouls, M., Meyer, A., Pauly, M., 2020. Travel time prediction using tree-based ensembles, in: Computational
Logistics: 11th International Conference, ICCL 2020, Enschede, The Netherlands, September 28–30, 2020, Proceedings 11,
Springer. pp. 412–427.
Hwarng, H.B., 2001. Insights into neural-network forecasting of time series corresponding to arma (p, q) structures. Omega
29, 273–289.
Hyndman, R.J., Athanasopoulos, G., 2018. Forecasting: principles and practice. OTexts.
Hyndman, R.J., Khandakar, Y., 2008. Automatic time series forecasting: the forecast package for R. Journal of Statistical
Software 26, 1–22. doi:10.18637/jss.v027.i03.
Islam, S., Amin, S.H., 2020. Prediction of probable backorder scenarios in the supply chain using distributed random forest
and gradient boosting machine learning techniques. Journal of Big Data 7, 1–22.
Ji, S., Wang, X., Zhao, W., Guo, D., 2019. An application of a three-stage xgboost-based model to sales forecasting of a
cross-border e-commerce enterprise. Mathematical Problems in Engineering 2019.
Kane, M.J., Price, N., Scotch, M., Rabinowitz, P., 2014. Comparison of arima and random forest time series models for
prediction of avian influenza h5n1 outbreaks. BMC bioinformatics 15, 1–9.
Kingman, J.F.C., 1992. Poisson processes. volume 3. Clarendon Press.
Kobayashi, H., Konheim, A., 1977. Queueing models for computer communications system analysis. IEEE Transactions on
Communications 25, 2–29.
Kohzadi, N., Boyd, M.S., Kermanshahi, B., Kaastra, I., 1996. A comparison of artificial neural network and time series models
for forecasting commodity prices. Neurocomputing 10, 169–181.
Kuhlmann, L., Pauly, M., 2023. A dynamic systems model for an economic evaluation of sales forecasting methods. Tehnički
glasnik 17, 397–404.
Kuhlmann, L., Wilmes, D., Müller, E., Pauly, M., Horn, D., 2023. Rodd: Robust outlier detection in data cubes. arXiv preprint
arXiv:2303.08193 .
Kumar Jha, B., Pande, S., 2021. Time series forecasting model for supermarket sales using fb-prophet, in: 2021 5th Inter-
21
national Conference on Computing Methodologies and Communication (ICCMC), pp. 547–554. doi:10.1109/ICCMC51019.
2021.9418033.
Lin, H.E., Zito, R., Taylor, M., et al., 2005. A review of travel-time prediction in transport and logistics, in: Proceedings of
the Eastern Asia Society for transportation studies, Bangkok, Thailand. pp. 1433–1448.
Luo, J., Zhang, Z., Fu, Y., Rao, F., 2021. Time series prediction of covid-19 transmission in america using lstm and xgboost
algorithms. Results in Physics 27, 104462.
Luong, H.T., 2007. Measure of bullwhip effect in supply chains with autoregressive demand process. European Journal of
Operational Research 180, 1086–1097.
Ma, Y., Zhang, Z., Ihler, A., Pan, B., 2018. Estimating warehouse rental price using machine learning techniques. International
Journal of Computers Communications & Control 13, 235–250.
Mayr, A., Binder, H., Gefeller, O., Schmid, M., 2014. The evolution of boosting algorithms. Methods of information in medicine
53, 419–427.
Morde, V., . Xgboost algorithm: Long may she reign! URL: https://towardsdatascience.com/
https-medium-com-vishalmorde-xgboost-algorithm-long-she-may-rein-edd9f99be63d. ”Accesssed 13 Dec 2023”.
Ni, D., Xiao, Z., Lim, M.K., 2020. A systematic review of the research trends of machine learning in supply chain management.
International Journal of Machine Learning and Cybernetics 11, 1463–1482.
Niemann, F., Reining, C., Moya Rueda, F., Nair, N.R., Steffens, J.A., Fink, G.A., Ten Hompel, M., 2020. Lara: Creating a
dataset for human activity recognition in logistics using semantic attributes. Sensors 20, 4083.
Nyoni, T., 2018. Modeling and forecasting inflation in kenya: Recent insights from arima and garch analysis. Dimorian Review
5, 16–40.
Palomares-Salas, J., De La Rosa, J., Ramiro, J., Melgar, J., Aguera, A., Moreno, A., 2009. Arima vs. neural networks for
wind speed forecasting, in: 2009 IEEE International Conference on Computational Intelligence for Measurement Systems
and Applications, IEEE. pp. 129–133.
Pórtoles, J., González, C., Moguerza, J.M., 2018. Electricity price forecasting with dynamic trees: A benchmark against the
random forest approach. Energies 11, 1588.
Probst, P., Boulesteix, A.L., Bischl, B., 2019. Tunability: Importance of hyperparameters of machine learning algorithms.
Journal of Machine Learning Research 20, 1–32.
R Core Team, 2022. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing.
Vienna, Austria. URL: https://www.R-project.org/.
Radmilovic, Z., Colic, V., Hrle, Z., 1996. Some aspects of storage and bulk queueing systems in transport operations. Trans-
portation planning and technology 20, 67–81.
Reining, C., Niemann, F., Moya Rueda, F., Fink, G.A., ten Hompel, M., 2019. Human activity recognition for production and
logistics—a systematic literature review. Information 10, 245.
Ribeiro, A.M.N., do Carmo, P.R.X., Endo, P.T., Rosati, P., Lynn, T., 2022. Short-and very short-term firm-level load forecasting
for warehouses: a comparison of machine learning and deep learning models. Energies 15, 750.
Ricky Rambharat, B., Brockwell, A.E., Seppi, D.J., 2005. A threshold autoregressive model for wholesale electricity prices.
Journal of the Royal Statistical Society Series C: Applied Statistics 54, 287–299.
Salari, N., Liu, S., Shen, Z.J.M., 2022. Real-time delivery time forecasting and promising in online retailing: When will your
package arrive? Manufacturing & Service Operations Management 24, 1421–1436.
Savva, A.D., Kassinopoulos, M., Smyrnis, N., Matsopoulos, G.K., Mitsis, G.D., 2020. Effects of motion related outliers in
dynamic functional connectivity using the sliding window method. Journal of Neuroscience Methods 330, 108519.
Schapire, R.E., Freund, Y., 2013. Boosting: Foundations and algorithms. Kybernetes 42, 164–166.
Schwarz, M., Sauer, C., Daduna, H., Kulik, R., Szekli, R., 2006. M/m/1 queueing systems with inventory. Queueing Systems
54, 55–78.
Sharma, R., Kamble, S.S., Gunasekaran, A., Kumar, V., Kumar, A., 2020. A systematic literature review on machine learning
applications for sustainable agriculture supply chain performance. Computers & Operations Research 119, 104926.
Shi, Y., Guo, X., Yu, Y., 2018. Dynamic warehouse size planning with demand forecast and contract flexibility. International
Journal of Production Research 56, 1313–1325.
Shukla, M., Jharkharia, S., 2011. Arima models to forecast demand in fresh supply chains. International Journal of Operational
Research 11, 1–18.
Shumway, R.H., Stoffer, D.S., Stoffer, D.S., 2000. Time series analysis and its applications. volume 3. Springer.
Siami-Namini, S., Tavakoli, N., Namin, A.S., 2018. A comparison of arima and lstm in forecasting time series, in: 2018 17th
22
IEEE international conference on machine learning and applications (ICMLA), IEEE. pp. 1394–1401.
Smyl, S., 2020. A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting. Interna-
tional Journal of Forecasting 36, 75–85.
Syntetos, A.A., Babai, Z., Boylan, J.E., Kolassa, S., Nikolopoulos, K., 2016. Supply chain forecasting: Theory, practice, their
gap and the future. European Journal of Operational Research 252, 1–26.
Taylor, S.J., Letham, B., 2018. Forecasting at scale. The American Statistician 72, 37–45.
Therneau, T.M., Atkinson, E.J., 1997. An Introduction to Recursive Partitioning Using the RPART Routines. Technical
Report. Mayo Foundation.
Tsay, R.S., 1989. Testing and modeling threshold autoregressive processes. Journal of the American statistical association 84,
231–240.
Ubilava, D., 2012. Modeling nonlinearities in the us soybean-to-corn price ratio: a smooth transition autoregression approach.
Agribusiness 28, 29–41.
Vairagade, N., Logofatu, D., Leon, F., Muharemi, F., 2019. Demand forecasting using random forest and artificial neural
network for supply chain management, in: Computational Collective Intelligence: 11th International Conference, ICCCI
2019, Hendaye, France, September 4–6, 2019, Proceedings, Part I 11, Springer. pp. 328–339.
Venkatapathy, A.K.R., Riesner, A., Roidl, M., Emmerich, J., ten Hompel, M., 2015. Phynode: An intelligent, cyber-physical
system with energy neutral operation for phynetlab, in: Smart SysTech 2015; European Conference on Smart Objects,
Systems and Technologies, VDE. pp. 1–8.
Weber, L.M., Saelens, W., Cannoodt, R., Soneson, C., Hapfelmeier, A., Gardner, P.P., Boulesteix, A.L., Saeys, Y., Robinson,
M.D., 2019. Essential guidelines for computational method benchmarking. Genome biology 20, 1–12.
Weng, Y., Wang, X., Hua, J., Wang, H., Kang, M., Wang, F.Y., 2019. Forecasting horticultural products price using arima
model and neural network based on a large-scale data set collected by web crawler. IEEE Transactions on Computational
Social Systems 6, 547–553.
Wenzel, H., Smit, D., Sardesai, S., 2019. A literature review on machine learning in supply chain management, in: Artificial In-
telligence and Digital Transformation in Supply Chain Management: Innovative Approaches for Supply Chains. Proceedings
of the Hamburg International Conference of Logistics (HICL), Vol. 27, Berlin: epubli GmbH. pp. 413–441.
Wright, M.N., Ziegler, A., 2017. ranger: A fast implementation of random forests for high dimensional data in C++ and R.
Journal of Statistical Software 77, 1–17. doi:10.18637/jss.v077.i01.
Wu, C.H., Ho, J.M., Lee, D.T., 2004. Travel-time prediction with support vector regression. IEEE transactions on intelligent
transportation systems 5, 276–281.
Wu, H., Levinson, D., 2021. The ensemble approach to forecasting: a review and synthesis. Transportation Research Part C:
Emerging Technologies 132, 103357.
Wu, Z., Ramsundar, B., Feinberg, E.N., Gomes, J., Geniesse, C., Pappu, A.S., Leswing, K., Pande, V., 2018. Moleculenet: a
benchmark for molecular machine learning. Chemical science 9, 513–530.
Zhang, G., Patuwo, B., Hu, M.Y., 2001a. A simulation study of artificial neural networks for nonlinear time-series fore-
casting. Computers & Operations Research 28, 381–396. URL: https://www.sciencedirect.com/science/article/pii/
S0305054899001239, doi:https://doi.org/10.1016/S0305-0548(99)00123-9.
Zhang, G.P., 2003. Time series forecasting using a hybrid arima and neural network model. Neurocomputing 50, 159–175.
Zhang, G.P., Patuwo, B.E., Hu, M.Y., 2001b. A simulation study of artificial neural networks for nonlinear time-series
forecasting. Computers & Operations Research 28, 381–396.
Zhang, L., Bian, W., Qu, W., Tuo, L., Wang, Y., 2021. Time series forecast of sales volume based on xgboost, in: Journal of
Physics: Conference Series, IOP Publishing. p. 012067.
23
Appendix A. Additional Simulation Results
AR BL1 BL2
2.07
2.04 6.2
2.01 6.1
3.65
1.98 6.0 3.60
1.95 5.9 3.55
100 500 1000 100 500 1000 100 500 1000
NAR1 NAR2 NMA
2.65 2.6752.650 4.3
2.60 2.625 4.2
2.55 2.600 4.1
4.0
100 500 1000 100 500 1000 100 500 1000
SAR1 SAR2 STAR1
2.30 2.60
8.8
2.56 8.7
2.25 8.62.52 8.5
2.20
100 500 1000 100 500 1000 100 500 1000
STAR2 TAR1 TAR2
3.6
4.4 13.643.5
4.2 3.4 13.60
4.0 3.3 13.563.2
100 500 1000 100 500 1000 100 500 1000
Length
Figure A.10: MSE of the naive approach for the different data generating processes.
Appendix A.1. Influence of Jump Process
Figure A.11 and A.12 summarize the prediction results for all sliding window sizes and data generating
processes using the ML methods. For both methods applied to differenced data, the performance is quite
similar across the different windows sizes. However, a small difference in MSE values can be observed for
the Random Forests, where a smaller window size slightly improves the prediction power.
24
mse
AR BL1 BL2 NAR1
160
160
150
120
120
120
100
80 8080
40 50 40 40
100 500 1000 100 500 1000 100 500 1000 100 500 1000
NAR2 NMA SAR1 SAR2
150
160
150 150
120
100
100 100
80
50
50 50
40
100 500 1000 100 500 1000 100 500 1000 100 500 1000
STAR1 STAR2 TAR1 TAR2
150
160 160
150
120 120
100
100
80
80
50
50
40
40
100 500 1000 100 500 1000 100 500 1000 100 500 1000
length
inputs 2 4 8 16 Method RF RF diff
Figure A.11: MSE values of all Random Forest approaches, sliding window sizes and data generating processes superposed by
a compound Poisson process.
25
mse
AR BL1 BL2 NAR1
160
150
150
120 120
100
100
80 80
50
40 50 40
100 500 1000 100 500 1000 100 500 1000 100 500 1000
NAR2 NMA SAR1 SAR2
150
160 160
150
120 120
100
100
80 80
50
50
40 40
100 500 1000 100 500 1000 100 500 1000 100 500 1000
STAR1 STAR2 TAR1 TAR2
160
160 160
120
120
120 120
80
80 8080
40 40 40
40
100 500 1000 100 500 1000 100 500 1000 100 500 1000
length
inputs 2 4 8 16 Method XGB XGB diff
Figure A.12: MSE values of all Random Forest approaches, sliding window sizes and data generating processes superposed by
a compound Poisson process.
26
mse
Appendix A.2. Influence of Additional Noise
AR BL1 BL2 NAR1
4.5
5.0
2.1
4.0 5.4
4.5
3.5 2.0
5.0
4.0
3.0
3.5
4.6 1.9
2.5 3.0
100 500 1000 100 500 1000 100 500 1000 100 500 1000
NAR2 NMA SAR1 SAR2
2.7
3.25
2.1 2.6
2.75
3.00 2.5
2.0
2.50
2.4
2.75
1.9 2.3
2.25
2.50 2.2
1.8
2.00
2.1
100 500 1000 100 500 1000 100 500 1000 100 500 1000
STAR1 STAR2 TAR1 TAR2
8.0
3.2 2.7 5.0
7.5
3.0 2.6
4.5
7.0
2.5
2.8
4.0
6.5 2.4
2.6
2.3 3.5
6.0
100 500 1000 100 500 1000 100 500 1000 100 500 1000
length
inputs 2 4 8 16 Method RF RF diff
Figure A.13: MSE values of all Random Forest approaches, sliding window sizes and data generating processes superposed by
a random walk.
27
mse
AR BL1 BL2 NAR1
8
4.5
6
2.6
4.0 7
5
2.4
3.5
6
4
2.2
3.0
5
100 500 1000 100 500 1000 100 500 1000 100 500 1000
NAR2 NMA SAR1 SAR2
4.5
2.6 3.3 3.2
4.0
3.0
3.0
2.4 3.5
2.8
2.7
2.2 3.0
2.6
2.4
100 500 1000 100 500 1000 100 500 1000 100 500 1000
STAR1 STAR2 TAR1 TAR2
3.50
5.6
10
3.6
3.25 5.2
9
3.3
3.00 4.8
8 3.0
4.4
2.75
7 2.7
100 500 1000 100 500 1000 100 500 1000 100 500 1000
length
inputs 2 4 8 16 Method XGB XGB diff
Figure A.14: MSE values of all XGBoost approaches, sliding window sizes and data generating processes superposed by a
random walk.
28
mse
Appendix A.3. Influence of Additional Noise and Jump Process
AR BL1 BL2 NAR1
6000
6000
6000 6000
4000
4000
4000 4000
2000
2000 2000 2000
0 0 0 0
100 500 1000 100 500 1000 100 500 1000 100 500 1000
NAR2 NMA SAR1 SAR2
6000
6000
6000 6000
4000
4000
4000 4000
2000 2000
2000 2000
0 0 0 0
100 500 1000 100 500 1000 100 500 1000 100 500 1000
STAR1 STAR2 TAR1 TAR2
6000
6000
6000
4000
4000
4000 4000
2000
2000 20002000
0 0 0 0
100 500 1000 100 500 1000 100 500 1000 100 500 1000
length
RF XGB ARIMA Naive
Method
RF diff XGB diff SARIMA TBATS
Figure A.15: MSE of all methods and settings, where the data generating processes were superposed by a random walk and
compound Poisson process.
29
mse
AR BL1 BL2 NAR1
6000
6000
6000 6000
4000
4000 4000 4000
2000
2000 2000 2000
0 0 0 0
100 500 1000 100 500 1000 100 500 1000 100 500 1000
NAR2 NMA SAR1 SAR2
8000
6000
6000
6000
6000
4000 4000
4000 4000
2000 2000
2000 2000
0 0 0 0
100 500 1000 100 500 1000 100 500 1000 100 500 1000
STAR1 STAR2 TAR1 TAR2
6000
6000
6000 6000
4000
4000
4000 4000
2000 2000
2000 2000
0 0 0 0
100 500 1000 100 500 1000 100 500 1000 100 500 1000
length
inputs 2 4 8 16 Method RF RF diff
Figure A.16: MSE of all Random Forest approaches, sliding window sizes and settings, where the data generating processes
were superposed by a random walk and compound Poisson process.
30
mse
AR BL1 BL2 NAR1
6000
6000 6000 6000
4000
4000 4000 4000
2000
2000 2000 2000
0 0 0 0
100 500 1000 100 500 1000 100 500 1000 100 500 1000
NAR2 NMA SAR1 SAR2
6000
6000
6000 6000
4000
4000
4000 4000
2000
2000 2000 2000
0 0 0 0
100 500 1000 100 500 1000 100 500 1000 100 500 1000
STAR1 STAR2 TAR1 TAR2
6000 6000
6000
6000
4000
4000 4000
4000
2000 2000 2000
2000
0 0 0 0
100 500 1000 100 500 1000 100 500 1000 100 500 1000
length
inputs 2 4 8 16 Method XGB XGB diff
Figure A.17: MSE of all XGBoost approaches, sliding window sizes and settings, where the data generating processes were
superposed by a random walk and compound Poisson process.
31
mse
Appendix B. MAPE Results
AR BL1 BL2 NAR1
200 600
400
750 500
190 350
400
180 500 300
300
170 250 250 200
100 500 1000 100 500 1000 100 500 1000 100 500 1000
NAR2 NMA SAR1 SAR2
1500
3000 1000 600 method
800 ARIMA
2000 1000 500
600 naive
400
1000 SARIMA400 500
300 TBATS
200
100 500 1000 100 500 1000 100 500 1000 100 500 1000
STAR1 STAR2 TAR1 TAR2
1600 600500
900 5001200 400
400
600 800 300
300
300 200400 200
100
100 500 1000 100 500 1000 100 500 1000 100 500 1000
length
Figure B.18: MAPE of the time series approaches for the different data generating processes described in Table 1.
32
mape
AR BL1 BL2 NAR1
400 500
200
1000 350
400
180 300
500 250 300
160
200 200
100 500 1000 100 500 1000 100 500 1000 100 500 1000
NAR2 NMA SAR1 SAR2
700
3000 600 800
500 1000
2000 600
400
1000 300 500 400
200 200
100 500 1000 100 500 1000 100 500 1000 100 500 1000
STAR1 STAR2 TAR1 TAR2
400
1500 400
600 350 350
1000 300 300
400 250 250
500 200
200 200
150
100 500 1000 100 500 1000 100 500 1000 100 500 1000
length
method Random Forest Random Forest Diff sliding window size 16 2 4 8
AR BL1 BL2 NAR1
270 550 900
2000 500 800
240
1500 450 700
210 400 6001000
350 500
180 500 300 400
300
100 500 1000 100 500 1000 100 500 1000 100 500 1000
NAR2 NMA SAR1 SAR2
6000
800 1200 1000
4000 1000 800600 800 600
2000 400 600
400 400
100 500 1000 100 500 1000 100 500 1000 100 500 1000
STAR1 STAR2 TAR1 TAR2
450
800 1500
400 400
600 1000 350
300 300
400 500 250
200 200
100 500 1000 100 500 1000 100 500 1000 100 500 1000
length
method XGBoost XGBoost Diff sliding window size 16 2 4 8
Figure B.19: MAPE of the Random Forest (above) and XGBoost (below) approaches for the different data generating processes
described in Table 1 superposed by a compound Poisson process.
33
mape mape
RF RF_dif XGB XGB_dif
2.5
2.0
1.5
1.0
0.5
0.0
30
25
20
15
10
100 500 1000 100 500 1000 100 500 1000 100 500 1000
length
sliding window size 16 2 4 8
Arima naive Sarima tbats
1.25
1.00
0.75
0.50
0.25
20
15
10
100 500 1000 100 500 1000 100 500 1000 100 500 1000
length
Figure B.20: MAPE of the machine learning algorithms (above) and time series approaches (below) for the M/M/1 and M/M/2
data generating process.
34
mape mapeMM1 MM2 MM1 MM2
AR0 BL10 BL20 NAR10
125
75 100 70 70
50 75 50 50
50
25 30 30
25
100 500 1000 100 500 1000 100 500 1000 100 500 1000
NAR20 NMA0 SAR10 SAR20
100 50 200
50
75 40 150
40
30 50 30 100
20 25 20 50
100 500 1000 100 500 1000 100 500 1000 100 500 1000
STAR10 STAR20 TAR10 TAR20
80 150
60 100
60 50 80 100
40
40 60
30 40 50
20 20
20
100 500 1000 100 500 1000 100 500 1000 100 500 1000
length
method Random Forest Random Forest Diff sliding window size 16 2 4 8
AR0 BL10 BL20 NAR10
100
80 80
75 90
60 60
50 60 40 40
25 30 20 20
100 500 1000 100 500 1000 100 500 1000 100 500 1000
NAR20 NMA0 SAR10 SAR20
55
60 200
50 90 45
150
40 60 35 100
30 25
20 30 50
15
100 500 1000 100 500 1000 100 500 1000 100 500 1000
STAR10 STAR20 TAR10 TAR20
100 160
60
75 120100
50 40 80
50
25 4020
100 500 1000 100 500 1000 100 500 1000 100 500 1000
length
method XGBoost XGBoost Diff sliding window size 16 2 4 8
Figure B.21: MAPE of the Random Forest (above) and XGBoost (below) approaches for the different data generating processes
described in Table 1 superposed by a compound Poisson process.
35
mape mape
AR0 BL10 BL20 NAR10
70 100
60 6060
50 75 50 50
40 40 40
30 50 30 30
20 25 20 20
100 500 1000 100 500 1000 100 500 1000 100 500 1000
NAR20 NMA0 SAR10 SAR20
200
40 75 35
30 150
30 50 25 100
20 25 20 50
15
100 500 1000 100 500 1000 100 500 1000 100 500 1000
STAR10 STAR20 TAR10 TAR20
70 60 80
60 50
50 60 100
40 40
30 30 40 50
20 20
20
100 500 1000 100 500 1000 100 500 1000 100 500 1000
length
method ARIMA naive SARIMA TBATS
AR0 BL10 BL20 NAR10
80 150
500
90
60 400 125
70 300 100
40 200
50 75
100
20 50
100 500 1000 100 500 1000 100 500 1000 100 500 1000
NAR20 NMA0 SAR10 SAR20
125
200 80
200 100 70
150
160 75 60
50
100 120 50 40
50 80 30
100 500 1000 100 500 1000 100 500 1000 100 500 1000
STAR10 STAR20 TAR10 TAR20
125 200 160
120 100 150 120
90 75
100 80
60 50
50 40
100 500 1000 100 500 1000 100 500 1000 100 500 1000
length
method ARIMA naive SARIMA TBATS
Figure B.22: MAPE of the time series approaches for the different data generating processes described in Table 1 superposed
by a compound Poisson process (above) or a random walk (below).
36
mape mape
AR0 BL10 BL20 NAR10
500
80 80 400 120
60 300 100
60
200
40 80
40 100
60
20
100 500 1000 100 500 1000 100 500 1000 100 500 1000
NAR20 NMA0 SAR10 SAR20
200 125
200
100 80
160
150 75 60
100 120 50 40
50 80 25
100 500 1000 100 500 1000 100 500 1000 100 500 1000
STAR10 STAR20 TAR10 TAR20
100
100 100 150 80
75 75
100 60
50 50 40
50
25 20
100 500 1000 100 500 1000 100 500 1000 100 500 1000
length
method Random Forest Random Forest Diff sliding window size 16 2 4 8
AR0 BL10 BL20 NAR10
500
80 90 150
400
125
60 70 300
100
40 50 200
100 75
20 30
100 500 1000 100 500 1000 100 500 1000 100 500 1000
NAR20 NMA0 SAR10 SAR20
300
250 100
120
200 200 75
150 80
100 50100 40
100 500 1000 100 500 1000 100 500 1000 100 500 1000
STAR10 STAR20 TAR10 TAR20
200
120 125 100
100 150 80
90
75 100 60
60 50 40
25 50
100 500 1000 100 500 1000 100 500 1000 100 500 1000
length
method XGBoost XGBoost Diff sliding window size 16 2 4 8
Figure B.23: MAPE of the Random Forest (above) and XGBoost (below) approaches for the different data generating processes
described in Table 1 superposed by a random walk.
37
mape mape
AR0 BL10 BL20 NAR10
60
80 100
50 150
40 60 75100
30 40 50
20 5020 25
100 500 1000 100 500 1000 100 500 1000 100 500 1000
NAR20 NMA0 SAR10 SAR20
70 100
60 50
75 10050 40
40 50
30 30 50
20 25 20
100 500 1000 100 500 1000 100 500 1000 100 500 1000
STAR10 STAR20 TAR10 TAR20
70
200 80
40 60
150 60 50
30 100 40 40
20 50 30
20 20
100 500 1000 100 500 1000 100 500 1000 100 500 1000
length
method Random Forest Random Forest Diff sliding window size 16 2 4 8
AR0 BL10 BL20 NAR10
80
200 12550
60 100
40 150 75
30 40 100 50
20 20 50 25
100 500 1000 100 500 1000 100 500 1000 100 500 1000
NAR20 NMA0 SAR10 SAR20
60
90 50
50 100
40
40 60
30 30 50
30
20 20
100 500 1000 100 500 1000 100 500 1000 100 500 1000
STAR10 STAR20 TAR10 TAR20
80
40 200 90
60
30 60
100 40
20 30
20
100 500 1000 100 500 1000 100 500 1000 100 500 1000
length
method XGBoost XGBoost Diff sliding window size 16 2 4 8
Figure B.24: MAPE of the Random Forest (above) and XGBoost (below) approaches for the different data generating processes
described in Table 1 superposed by a compound Poisson process and a random walk.
38
mape mape
AR0 BL10 BL20 NAR10
40 50 80
40 60 75
30
30 40 50
20 20
20 25
10
100 500 1000 100 500 1000 100 500 1000 100 500 1000
NAR20 NMA0 SAR10 SAR20
80
40 60 40 90
30 40 30 60
20 20 3020
100 500 1000 100 500 1000 100 500 1000 100 500 1000
STAR10 STAR20 TAR10 TAR20
160
35
50
30 120 40
80 4025 30
20 3040
20
15 20
100 500 1000 100 500 1000 100 500 1000 100 500 1000
length
method ARIMA naive SARIMA TBATS
Figure B.25: MAPE of the time series approaches for the different data generating processes described in Table 1 superposed
by a compound Poisson process and a random walk.
39
mape
Article 3
Schmid, L., Biebert, D., Hakert, C., Chen, K.-H., Lang, M., Pauly, M. & Chen, J.-J.
(2024). TREE: Tree Regularization for Efficient Execution. arXiv preprint,
https://doi.org/10.48550/arXiv.2406.12531
TREE: TREE REGULARIZATION FOR EFFICIENT EXECUTION
Lena Schmid 1 Daniel Biebert 1 Christian Hakert 1 Kuan-Hsun Chen 2 Michel Lang 1 Markus Pauly 1
Jian-Jia Chen 1
ABSTRACT
The rise of machine learning methods on heavily resource constrained devices requires not only the choice of a
suitable model architecture for the target platform, but also the optimization of the chosen model with regard to
execution time consumption for inference in order to optimally utilize the available resources. Random forests and
decision trees are shown to be a suitable model for such a scenario, since they are not only heavily tunable towards
the total model size, but also offer a high potential for optimizing their executions according to the underlying
memory architecture.
In addition to the straightforward strategy of enforcing shorter paths through decision trees and hence reducing the
execution time for inference, hardware-aware implementations can optimize the execution time in an orthogonal
manner. One particular hardware-aware optimization is to layout the memory of decision trees in such a way, that
higher probably paths are less likely to be evicted from system caches. This works particularly well when splits
within tree nodes are uneven and have a high probability to visit one of the child nodes.
In this paper, we present a method to reduce path lengths by rewarding uneven probability distributions during
the training of decision trees at the cost of a minimal accuracy degradation. Specifically, we regularize the
impurity computation of the CART algorithm in order to favor not only low impurity, but also highly asymmetric
distributions for the evaluation of split criteria and hence offer a high optimization potential for a memory
architecture-aware implementation. We show that especially for binary classification data sets and data sets with
many samples, this form of regularization can lead to an reduction of up to ≈ 4× in the execution time with a
minimal accuracy degradation.
1 INTRODUCTION a premier candidate for cache-aware optimizations, since
every inference follows one path, requiring only a small
Execution time optimization of machine learning models subset of nodes from the tree. This naturally fits in the
on the edge on extremely resource constrained devices has design principle of caches, since these are usually small
been widely studied, especially known as TinyML scenarios. and depend on a high locality of the memory accesses to
While one popular approach is to shrink the models (e.g., be fast and efficient. Chen et al. leverage a probabilistic
by reducing the number of neurons in neural networks, or model, describing the distribution of splits, to place the
the depth of decision trees in random forests) without losing frequent accessed paths in a cache-friendly manner (Chen
much accuracy, this approach is agnostic to the actual prop- et al., 2022). Breaking the probabilistic model down to
erties of the underlying hardware. One aspect of resource a single tree node, we observe that the approach can be
limitation is often the limited availability of energy and beneficial only when the probability of data tuples (i.e., split)
hence time budget for the execution of inference. Shrinking to take the left branch or the right branch differs significantly.
models indeed can meet this requirement, but a consider- In consequence, when optimizing the execution time of
able reduction of the execution time can also be achieved random forest models, a reduction in the tree size should
by an orthogonal hardware-aware implementation of the account for the distribution of splits in the tree nodes and
model, especially in the context of random forests (Chen maintain or enforce the property of uneven splits. This can
et al., 2022; Tabanelli et al., 2022). lead to a considerable execution time improvement by the
Random forests and their inner structure of decision trees are reduced tree size and an orthogonal improvement to favor
1 2 such cache-friendly implementations.TU Dortmund University, Germany University of Twente,
Netherlands. Correspondence to: Lena Schmid <lena.schmid@tu- In this paper, we introduce the design of a hardware-aware
dortmund.de>. regularization for decision tree training by actively reward-
ing uneven splits in single decision tree nodes. This leads to
1
arXiv:2406.12531v1  [cs.LG]  18 Jun 2024
the regularized construction of decision trees, which main- to reveal the relation between dataset properties and
tain the crucial properties for cache optimization, but with regularization effectiveness.
reduced total size or depth of some paths. Consequently, the
studied problem of this paper is how to regularize random 2 RELATED WORK
forest training with the objective of reducing the model size
and to reward uneven splits, while not degrading the accu- Performance optimization of trees and random forests is a
racy significantly. We tackle the problem by introducing a widely studied topic in the literature. When it comes to con-
regularization term into the split method of decision trees. crete hardware-close implementations, one popular example
This regularizer rewards split decisions that lead to uneven is the C++ implementation for random forests in Wright
splits to uphold asymmetric distributions. This leads to an & Ziegler (2017). The prominent concept of native trees,
orthogonal speed improvement to the cache optimization where nodes are stored in an array and executed in a narrow
mentioned above, and can even assist cache optimization. loop and if-else trees, where nodes form deeply nested if-
The introduction of the regularization offers a trade-off for else constructs, is introduced to maintain locality in the data
the application. The regularization term can be controlled by and instruction memory in Asadi et al. (2014). More vari-
a factor to take an either minor or major influence. We pro- ances of tree implementations are studied for the runtime of1
pose an intuitive application, where a tolerable degradation inference on RISC-V MCUs (Tabanelli et al., 2022).
in accuracy can be defined by the user. Subsequently, possi- Random forests are also considered to be executed on accel-
ble degrees of regularization are automatically tested, and erator devices, such as GPUs or FPGAs (Van Essen et al.,
the configuration with the maximal improvement in execu- 2012; Nakandala et al., 2020; Buschjäger & Morik, 2018)
tion time within the tolerable accuracy degradation is chosen. or in a vectorized manner (Kim et al., 2010). In addition to
If users are eager for a deeper investigation and the manual the deployment of the models to a hardware-close language
choice of a trade-off between accuracy degradation and exe- and massive parallel computation devices, also the optimiza-
cution time improvement, we report the corresponding data tion of the usage of the underlying hardware is investigated.
for a comprehensive set of possible and meaningful regu- This includes optimization of the throughput in a pipeline
larization degrees. These results are graphically illustrated execution (Prenger et al., 2013) and investigating the data
and allow an easy choice of the trade-off. It is generally a structure and the decision tree structure itself and gain per-
good idea to focus on Pareto optimal points with respect to formance improvement with proper reordering (Dato et al.,
accuracy degradation and execution time improvement in 2016; Lucchese et al., 2016). More specifically, the usage of
this data set for a first investigation. Beyond the choice of floating point hardware units and their performance impact
the meaningful application of the degree of regularization, is studied (Hakert et al., 2022a;b). Chen et al. (2022) utilize
the level of maximal possible meaningful regularization can a probabilistic model of the data distribution in the data set
reveal the information of how well the data set, which is to optimize the memory layout, in order to favor frequently
used for training is suited for this form of regularization. used paths for the cache behavior.
With the help of this, we determine a property, which we call
regularization robust on data sets, and identify properties, Although the approaches above provide various optimized
which make data sets more regularization robust. implementations of random forests, they do not alter the
training process in order to gain execution time perfor-
Despite the realization of the regularization in scikit-learn mance. One relevant approach is hyperparameter tuning
(Pedregosa et al., 2011), we focus on a comprehensive ex- (Bischl et al., 2023). Hyperparameter tuning specifically for
perimental evaluation of the proposed regularizer in this random forests is covered in Probst et al. (2019), resulting
paper. In detail, we take a set of UCI datasets (Dua & Graff, in the tool tuneRanger focusing on both accuracy max-
2017) and investigate the regularization in different model imization and explainability. The tool, however, does not
configurations. Furthermore, we conduct an extensive sim- include execution time performance as an objective. Mon-
ulation study with synthetic datasets, where the relation drian forests (Lakshminarayanan et al., 2014) in contrast,
between dataset properties and the effectiveness of the reg- introduce an online adaptive realization of random forests,
ularization is analyzed. In short, we provide the following which can improve the execution time performance while
contributions: maintaining a similar accuracy.
Regularizing the training process of random forests, to the
• A regularization term for the provocation of uneven
splits in decision tree training, including an implemen- 1Its naming system is deviated to the terminology used by most
tation in scikit-learn. of related works. For example, the if-else trees are named Naive
kernels, and the native trees are named Loop kernels. In this work,
• Evaluation of the regularization on UCI datasets. we follow the majority and use the terminology commonly found
in the literature.
• An extensive simulation study with synthetic datasets
2
best of our knowledge, has not been studied for the objective result will be two sets of samples from the dataset. This is
of execution time performance. Regularization, also beyond recursively repeated until a given stopping criterion is met
the scope of random forests, however, is a studied topic in (e.g. a certain depth is reached).
order to provide more explainability (Wu et al., 2018) or
achieve higher accuracy (Scheffer, 2000). Also, the effect of The basic working principle of all split criteria is to compute
high randomness in the random forest training as a form of a score for all possible split values at each node, and then
regularization is investigated (Mentch & Zhou, 2020). The select the split point corresponding to the best combined
objective of error tolerance and robustness is further shown criterion scores in the two resulting child nodes. More pre-
to be addressable by regularizing the training of binarized cisely, for a classification problem with k labels, pi denotes
neural networks (BNNs) (Buschjäger et al., 2021). the proportion of samples with class ci (i = 1, . . . , k) ina node. A widely used score for the impurity is the Gini
impurity, which is measured as
3 TREE REGULARIZATION
k
Improving the execution time of decision tree inference on ∑GINI = 1− p2. (1)
real hardware opens a larger design space. One way to i
i=1
achieve faster inference is to decrease the size of the model
itself. The obvious benefit towards execution time is, that Hence, when all samples belong to one class, the sum is 1
less computation is needed to return an inference result. and the resulting impurity is 0. The Gini impurity results in a
This approach usually introduces degradations in accuracy, larger value, the more evenly the class labels are distributed
as such the model cannot be shrank to an arbitrarily small in the node. One popular way to find the best split inside
size. A widely used method to decrease the model size is a node is finding the minimal mean Gini impurity of both
limiting the maximal depth a tree is allowed to grow to. resulting child nodes. We note that other split criteria such
In this work, we introduce an alternative method towards as Entropy and Information Gain can also be used (Breiman
reducing the overall model size. We optimize the decision et al., 1984). However, as the exact criterion for splitting is
tree construction to increase the existence of uneven splits not relevant for our proposed regularization, only the Gini
to benefit shorter paths to leaf nodes. More precisely, a impurity is covered here.
penalty term in the splitting criterion is introduced, which The split results in the samples being separated into two
serves as a control parameter to trade-off between tree size portions, being further used in the left and right child. This
and predictive accuracy. This control parameter effectively division in the samples then determines the probability of
shrinks the model size and reduces depths of single paths the left or right subtree to be used in an inference, Each
by maintaining and provoking uneven split decisions. node has a distinct access path starting from the root node
Another effective method is utilizing the cache behavior and ending in the node itself. To get the absolute probability
of the CPU. Chen et al. have shown that reordering the of any node, the individual probabilities of every node on
nodes inside memory in a cache-friendly manner improves the path to that node need to be multiplied. The resulting
execution time (Chen et al., 2022). In their approach the value is the probability of this node to be accessed during
split probabilities of nodes are used to determine the new prediction. Intuitively, the absolute probability of the root
order in memory. Here uneven splits are beneficial, as they node is 100%. The probability of any path to be taken during
result in nodes which are accessed more often. Therefore, inference is the probability of the leaf node the path ends in.
the benefit of the cache-friendly ordering is increased. Our These absolute probabilities can be used to identify which
proposed regularization both optimizes for smaller model paths are frequently accessed.
sizes and increases the likelihood of uneven splits.
3.2 Regularization Factor
For the sake of completeness, we first give a short overview
of the decision tree construction with the CART algorithm. A possibility to improve the execution time is to reduce the
Afterwards, we present the introduced regularization and total model size by controlling the training process to only
how it can be tuned for different scenarios iteratively. Lastly, keep important paths. The reduced amount of nodes leads
we discuss why the persistence of uneven splits are orthogo- to less memory loads during an inference. In addition, this
nal to the cache-aware optimizations in detail. regularization of the training can be designed such that not
only important paths in terms of prediction accuracy are
3.1 Decision Tree Construction kept, but also the access frequency of paths is maintained
kept. This consequently leads to an orthogonal optimization
A widely used training method to construct decision trees of the cache optimization from Chen et al., since the cache
is the CART algorithm (Breiman et al., 1984), by which friendly handling of frequently accessed paths is kept, and
the samples are repeatedly split by a chosen criterion. The cache replacements are reduced.
3
Since training of decision trees according to the CART algo- changes highly depending on a variety of factors (e.g., the
rithm (Breiman et al., 1984) consists of recursively splitting number of classes in the dataset).
the samples into two child nodes based on a threshold value,
the split decision can be modified in order to favor asym- There is a limit to how much any split can be usefully regu-
metric probabilities. In order to allow a trade-off between larized, as at some point all samples would go to one child
the original split criterion and the size-aware split, we intro- node. Therefore, the impact of the regularization factor is
duce an additive regularization factor for the split criterion, going to approach a limit the larger the factor gets.
penalizing even splits. The amount of penalization can be To find the optimal factor for a given scenario, the expected
controlled with a real-valued factor λ which is subject to performance improvement needs to be quantified. To that
tuning. Although this design is applicable to arbitrary split end, we define the expected depth of a single tree. It is
criteria, we here restrict ourselves to the popular Gini impu- measured as ∑
rity criterion in order to analyse the effect in depth. pl ∗ depth(l) (4)
In order to include a size-aware splitting criterion into this l∈leaf(t)
process, we define a regularization term R as where leaf(t) are all leaves of tree t, pl is the probability of
|#samplesleft −#samples | leaf node l and depth(l) is the depth of node l. The expected− rightR = 1 . (2) depth is therefore the mean depth the inference is expected
#samples to reach during repeated inference operations. Consequently,
Hence, when the split distributes samples almost equally to a reduction in the expected depth results in an increase in
the left and right child nodes, the value is close to 1, when performance, as fewer nodes have to be loaded during infer-
the split is very asymmetric on the other hand, the value ence. Furthermore, once the expected depth does not change
is closer to 0. Note that in contrast to the Gini impurity, significantly, the influence of the regularization factor is less
the regularization term does not operate on the class labels, pronounced and less performance gain is to be expected. To
but instead on the number of samples. In order to form the find an optimal factor, the factor is iteratively increased until
resulting split criterion, we add the regularization term with the difference in expected depth falls under a set threshold,
an adjustable weight λ ∈ R+ to the Gini impurity: which decides how close to the best possible performance
improvement the factor is tuned. At that point, performance
GINI′ = GINI+λ ·R. (3) is unlikely to improve further, and the corresponding value
for λ is chosen.
Adding the regularization term to the evaluation and opti-
mization of the Gini impurity in every step of the CART
algorithm allows accounting for cache-friendly splits during 4 EXPERIMENTAL EVALUATION
the training. It should be noted that the introduction of the
regularization potentially degrades the Gini impurity and To evaluate the application of the hardware-aware regular-
hence also the accuracy of the trained model. Consequently, ization, we conducted experiments on real and synthetic
the parameter λ has to be chosen effectively to provide a data sets. First, we apply a default setting, where the maxi-
good trade-off between accuracy and asymmetric splits. mal regularization is applied with a configurable, tolerableaccuracy degradation. Second, we enlighten the trade-off
Our modifications are directly implemented in scikit-learn. between degree of regularization, speed improvement and
To achieve the outlined regularization, a new split criterion accuracy drop. Lastly, we evaluate the limitations of regular-
based on the Gini impurity is introduced. The implementa- ization itself and report the boundaries for the meaningful
tion is largely similar as for the standard Gini split criterion. application.
However, when calculating the node impurity, the resulting
value is adapted according to Equation (3) and returned. To 4.1 Evaluation Setup
accommodate the factor λ, an additional hyperparameter
can be set while fitting the model to control the amount of For evaluating the execution time improvement, we trained
regularization. The source code is publicly available under random forests with different degrees of regularization (i.e.
[hiddenduetodoubleblindsubmission]. varying λ) on real and synthetic datasets. We subsequently
generated a straightforward C implementation and a cache
3.3 λ Tuning optimized implementation via Chen et al. (2022). The gen-
erated trees of both implementations are executed on a real
During training, the regularization factor λ needs to be set. world target machine. We use a server class system, i.e.
It should improve training towards the best performance with an Intel(R) Xeon(R) Gold 5218 @ 2.3GHz CPU with
optimization while preserving the accuracy as good as pos- 16 cores, 1024 KiB L1 Cache, 16 MiB L2 Cache and 22
sible. An optimal regularization factor cannot be picked MiB L3 Cache and 180GB RAM. We utilized Scikit-learn
universally. The effectiveness and influence of the factor to train random forests with varying number of trees and
4
maximal tree depths (1, 5, 15, 20) for each of these datasets. for all settings are available in the Appendix.
After a given threshold, the number of trees was only in-
creased for more experiments if it improves the accuracy Intuitive Application
enough. This was done to reduce the amount of redundant
experiments. To provide a better intuition for the impact To illustrate the most intuitive use case of the regularization,
of the regularization, we always compare the regularized we limit the allowed degradations in accuracy to 5%. We
implementation to the comparable not regularized counter- then pick the best regularization factor λ, which achieves the
part. In greater detail, the not cache optimized regularized maximal execution time improvement, while not degrading
implementation for a specific number of trees and maximal the accuracy beyond the specified level. Figure 1 reports
tree depth is compared to the not regularized version of the the corresponding results, where the x-axis separates the
not cache optimized implementation. This is similarly done different data sets from the UCI repository. The y-axis
for the cache optimized implementations. shows the relative speed improvement with regularization in
comparison to the same configuration without regularization.
In addition, we measured the balanced accuracy of the Each box includes random forests with different numbers of
trained model based on the test dataset. Balanced accuracy trees. The different colors indicate different maximal depths
evaluates a model’s classification performance by consider- of the trained decision trees and configurations without and
ing both sensitivity (true positive rate) and specificity (true with cache optimization.
negative rate), making it particularly useful in scenarios with
imbalanced datasets. Since the methods from Chen et al. From the presented results in Figure 1, several observa-
(2022) only optimize the memory layout and do not change tions can be made. First, it can be seen that for trees with
the model structure, the balanced accuracy is the same for a small maximal depth, the improvement in terms of ex-
all implementations. For the measurement of the execution ecution time is not reliably observable. Some configura-
time, we executed 50 repetitions of the inference of the test tions degrade the speed, some configurations only slightly
dataset and average the time consumption under realistic increase the speed. Considering that a limited maximal
execution conditions. To compare the balanced accuracy depth of 1 only allows for 3 tree nodes, these results are
and mean relative execution time, we used a training-test not surprising. Further, it can be observed that the speed
split ratio of 3:1 and repeated it 8 times. The scitkit-learn hy- improvement grows, the deeper the trees become. A gen-
perparameter was varied across a set of rec- eral tendency can be observed, that the deepest trees alsomax features√
{ ⌊ p⌋ ⌊√ ⌋ ⌊√ ⌋ } benefit most from regularization in terms of execution timeommended default values 2 , p , 2 p , p , where improvement. For the data sets, which achieve a significant
p denotes the number of features (Wright & Ziegler, 2017; execution time improvement, a similar scale of improvement
Hastie et al., 2009; Liaw et al., 2002). for not cache optimized and cache optimized implementa-
tions can be observed. This supports the design principle of
4.2 UCI Datasets a regularization, improving both not cache-optimized and
In the following experiments, the influence of the regulariza- cache-optimized implementations in an orthogonal manner.
tion is compared on eleven datasets from the UCI repository, It should be noted that this plot shows the relative execution
which was also adopted in Chen et al. (2022). Table 1 lists time in comparison to the not regularized version, i.e., when
the dataset name, source, number of samples (n), number the cache optimization improves the execution time upon the
of features (p), and number of classes (cl). For the ease of not cache optimized implementation, this improvement is
orthogonal to the regularization. The maximal improvement
in terms of execution time can be observed to be more than
Table 1. Name, source, number of samples (n), number of features 75%, i.e., more than 4× faster than without regularization.
(p), number of classes (cl) of each used dataset. The data sets, which profit most from the regularization in
Dataset Source n p cl terms of execution time improvement are adult, bank and
Adult (Kohavi, 1996) 48,842 64 2
Bank Marketing (Moro et al., 2014) 45,211 59 2 magic. Spambase and satlog also show a higher timing
Covertype (Blackard & Dean, 1999) 581,012 54 7 improvement than most of the other data sets. Comparing
Letter (Frey & Slate, 1991) 20,000 16 26
Magic (Dua & Graff, 2017) 19,020 10 2 this finding to Table 1 suggests the conclusion that data
MNIST (Dua & Graff, 2017) 45,000 784 10 sets with binary classification can benefit most from the
Satlog (Dua & Graff, 2017) 6,435 36 6
Spambase (Dua & Graff, 2017) 4,601 57 2 regularization in terms of execution time improvement.
Sensorless Drive (Dua & Graff, 2017) 58,509 48 11
Wearable Computing (Ugulino et al., 2012) 165,632 17 5
Wine Quality (Cortez et al., 2009) 6,497 11 7 Regularization Trade-Off
Tolerating only accuracy degradation until a configurable
presentation, we aggregated the multiple simulation settings threshold is a simplified form of application, which does not
with regard to the 8 replications and√focused on results ob- allow to make a trade-off. It could still happen, that a higher
tained with max features set to ⌊ p⌋. Detailed results
5
1.5
1.0
0.5
adult bank covertype letter magic mnist satlog sensorless− spambase wearable wine−
drive quality
dataset
cache opt., max. depth = 10 cache opt., max. depth = 20 not cache opt., max. depth = 15
cache opt., max. depth = 15 not cache opt., max. depth = 10 not cache opt., max. depth = 20
Figure 1. Impact of regularization on execution time across datasets
degree of regularization degrades the accuracy slightly be- can be observed to exist across different maximal depths of
yond this threshold, but achieves significant faster speed. trees. For the other configurations, it can be observed that a
Such scenarios are evaluated by analyzing the relation be- higher execution time improvement also comes with higher
tween the accuracy drop and the runtime improvement for accuracy degradation, especially for deeper trees. It can be
different degrees of regularization. We illustrate the results as well observed, that cache optimized and not cache opti-
of corresponding experiments in Figure 2. The data sets mized implementations form close results, which supports
are separated in different subplots. Each configuration, in- again the design principle of an orthogonal optimization.
cluding different amounts of trees and different degrees of This suggests the conclusion that, data sets with either bi-
regularization. forms one point, which is denoted by the nary classification or large sample sizes are better suited for
relative execution time improvement to the corresponding execution time improvement due to regularization without
not regularized counterpart on the x-axis and the accuracy high accuracy impact than other data sets. We call these
drop on the y-axis respectively. Cache optimized and not data sets regularization robust.
cache optimized implementations are separated by squares
and pluses. We further denote the limit of 5% accuracy drop, Limits of Regularization
as used for the previous intuition, by a dotted gray line.
The previous experiments and result discussions suggest the
From the results, two different major behaviors can be identi- conclusion, that several data sets have a property, namely
fied: For certain data sets, namely adult, bank, magic, mnist, regularization robust. This property refers to data sets,
spambase and wearable, the configurations with the maxi- which can benefit strongly from high degrees of regulariza-
mal speed improvement are either Pareto optimal or only tion in terms of their improved runtime, while not sacrificing
have a slight larger degradation in the accuracy than the con- too much accuracy. Previous experiments suggest that the
figurations with the lowest accuracy degradation. This trend data sets adult, bank, magic, mnist, spambase and wearable
6
rel. execution time
adult bank covertype letter
0.4
0.3
0.2
0.1
0.0
magic mnist satlog sensorless−drive
0.4
0.3
0.2
0.1
0.0
spambase wearable wine−quality 0.0 0.5 1.0
0.4
0.3
0.2
0.1
0.0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
rel. execution time
max. depth 10 15 20 cache cache opt. not cache opt.
Figure 2. Normalized balanced accuracy values (with respect to the maximum value within each data set) and relative execution times for
all data sets with different tree depths and separated by cache optimization or not.
have this property to a certain degree. In order to investigate tion can profit from high regularization factors. These data
this property even further, we use regularization tuning as sets are adult, bank, magic, spambase and wearable. Except
explained in Section 3.3 to stop increasing the regulariza- the mnist data set, which has an exceptional high number
tion factor once the expected increase in speed falls under of classes, this is exactly the list of data sets, which are
a change threshold of 5%. In other words, if no further encountered as regularization robust before. Hence, by only
execution time improvement is achieved, the regularization investigating the data set properties upfront, an assertion
factor is not further increased. can be made whether the data set is regularization robust
We illustrate the amount of configurations of a data set (i.e. and thus may profit from strong degrees of regularization.
different number of trees in an ensemble and different im- We have seen that this often holds for binary classification
plementation strategies) with their maximal regularization problems or very large data sets.
factor in Figure 3. To get a general picture of the influence
of the regularization factor λ, experiments with λ ∈ [0, 40] 4.3 Simulation with Synthetic Data
are run regardless of any metric. Next, regularization tuning In order to better understand the effects of tree regularization
is used to stop once the factor is expected to not make a on binary classification datasets, particularly with respect to
significant difference to execution speed. It can be observed regularization robustness, we conducted experiments with
that for certain data sets, a reasonable amount of configura- synthetic data. We first describe how the data is generated
7
1− norm. balanced accuracy
adult bank covertype letter magic mnist
0.3
0.2
0.1
0.0
satlog sensorless−drive spambase w wearable                  wine−quality 50 60 70 80 90100
0.3
0.2
0.1
0.0
50 60 70 80 90100 50 60 70 80 90100 50 60 70 80 90100 50 60 70 80 90100 50 60 70 80 90100
max reg. factor
maximal depth 1 5 10 15 20
Figure 3. Relative frequency of maximum regularization factors in each data set, focusing on the range of values above 40.
and subsequently present the measured results. to provide some degree of adjustment to modulate the class
balance. Having fixed the dependencies among the features,
Simulation Setup
We consider a binary classification problem Y ∈ {0, 1} Table 2. Distributions ofX1, . . . , X5 used in the simulation stud-
with ten real-valued features X1, . . . , X10 for which we ies with p ∈ {0.2, 0.5, 0.7, 0.9}.
specify different distributions, dependencies and underly- Features Independent Weakly dependent Strongly dependent
ing models, described in the following. We model the last X1 Zp,1 Zp,1 Zp,1X2 Z0.1,−5 Z0.1,−5 Z0.1,−5
five features X6, . . . , X10 as independent and uniformly X Z0.5,2 Z0.5,2 Z0.5,2
distributed random variables from X Z[0, 10], independent of 4 0. , Z0.1,−5 + Z0.5,2 Z0.1,−5 + Z0.5,2X5 Z0.8,−2 Z0.8,−2 Z0.5,2 + 0.5Zp,1
the first five features X1, . . . , X5. For the first five fea-
tures we consider three different dependence structures
as summarized in Table 2. In the first setting (Indepen- we now model the dependencies with the outcome Y . In this
dent), we consider completely independent features, where study, we investigate three different relationships between
each feature follows a distinct mixed normal distribution Y and X1, . . . , X10 by incorporating different dependen-
Zℓ,k = lW1+(1− ℓ)N (∆µ+k, 1) with∆µ ∈ {1, 3, 5, 8} cies and correlation structures through logical rules. The
and W1 ∼ N (1, 1). The concrete choices for ℓ and k are settings range from a simple dependence of the outcome
given in Table 2. In the other two settings, we model a weak solely on the first feature X1 (S1), whereby Y equals 1 if
dependence between X2, X3 and X4 (second last column) the realization derived from X1 originates fromW1 of the
and a strong dependence between X1 and X5 (last column), normal mixed distribution. The more complex dependencies
respectively. Note that for the first feature, k = b is not involve the first three or five features. The concrete details
held constant, but is systematically varied from 0.2 to 0.9 are illustrated in Table 3.
8
rel. frequency
reachs 50. When examining the effect of regularization on
Table 3. Dependent models between the output and some of the
features. Here Oi refers to the event that the realization of the
balanced accuracy in these settings (plot on the left), it is
feature Xi originates from the first part of its mixed normal distri- noticeable that the blue and green settings show a more
bution. significant decrease in accuracy than the red settings. This
Setting Y = 1 suggests that the balance of classes influences the effect
S1 O1 of regularization, with unbalanced classes showing greater
S3 (O1 and O4) or ¬O2 sensitivity to regularization. By looking at the Pareto front
S5 (O1 and ¬O3) or (¬O5 and ¬O4) or O2 (plot on the right), we can see that the red setting dominates
the others for most configurations.
4.4 Discussion
For each setting, we generated samples of size num from
the respective model with num ∈ {100, 200, 500}. The The previously presented results indicate that the introduc-
regularization strength λ and the number of trees are var- tion of regularization offers a trade-off between a degrada-
ied as described in Section 4.1. The same applies to the tion of accuracy and the improvement of execution time.
hyperparametermax features. While for shallow decision trees the regularization generally
cannot offer a large spectrum for the trade-off and quickly
Results degrades to extreme cases, a wider spectrum for the trade-off
is offered for deeper tree models in general. It is worth notic-
For ease of presentation, we focus on the most important ing that the degradation of the accuracy is usually less by
results and general trends. Studying the simulation study one order of magnitude than the gained speed improvement,
results for all configurations, we observed that changes in when a moderate amount of regularization is chosen.
the dependency structure of the feature, the relationship
between features and outcome and the size of the inner boot- Investigating the dataset properties itself, the comparison
strap sample (max features) of the random forest had no between synthetic and real data sets shows that a major in-
large effect on the behavior of trees under the regularization. fluence on the effectiveness of the regularization is put by
In comparison, the balance of the prediction classes, regu- the balance of the prediction classes. Table 4 shows the Chi-
lated by the balance parameter b and ∆µ, and the sample Squared values for the UCI datasets, where a high values
size n were the driving forces for changes in the influence indicate a potential high imbalance of the distribution in
of the regularization. the prediction classes. It should be noted that these values
Examining the effects of sample size, we find results con-
sistent with those of the previous section. As the sample Table 4. Chi-Square Values of UCI Datasets
size increases, a greater improvement in execution time is Dataset Chi-square
observed along with a decrease in accuracy. Details can be Adult 6,556.066
found in the appendix Bank Marketing 18,552.56
Results for different combinations of the balance param- Covertype 71,3450.9
eters are shown in Figure 4. The results shown in Fig- Letter 19.3273
ure 4 are for max features = 6, n = 100, num = 100, Magic 1271.086
independent characteristics and the S3 model for the out- MNIST 139.705
come. We present results for three combinations of b and Satlog 504.379
∆µ: b = 0.9,∆µ = 8 (red), b = 0.7,∆µ = 3 (green) and
Spambase 153.515
b = 0.5,∆µ = 1 (blue). These combinations were selected
Sensorless Drive 2.2707
because of their different strengths of balance. Red is the Wearable Computing 33,934.62
most unbalanced and blue is the most balanced. The x-axis Wine Quality 6,383.254
for each of the first two plots shows an increasing regu-
larization factor, the y-axis shows the balanced accuracy can only be interpreted for binary classification datasets
in Figure 4 (left), and the relative execution time in Fig- (indicated in bold), since the structure of the imbalance be-
ure 4 (center). The x-axis for the figure on the right shows comes too complex for multi label classifications. It can
the 1-balanced accuracy, and the y-axis shows the relative be seen, that for the adult and bank dataset, which are ob-
execution time. The different shapes indicates wether a served to provide high speed improvements on minimal
cache-optimized version is used or not. For all three set- accuracy degradation, a relatively high Chi-Square value
tings, there is a clear trend towards faster relative execution can be observed. This aligns with the observations from the
times as the regularization factor increases. However, the synthetic datasets, where highly imbalanced distributions
improvement diminishes as soon as the regularization factor also allow high speed improvements on minimal accuracy
9
1.0
0.5
1.00
0.9
0.4
0.75
0.8
0.3
0.50
0.7
0.2
0.25
0.6
0.1
0.5 0.00
0.0
0 10 20 30 40 50 0 10 20 30 40 50 0.00 0.25 0.50 0.75 1.00
reg. factor reg. factor rel. execution time
Figure 4. Simulation results of the regularization for varying b and ∆µ: b = 0.9,∆µ = 8 (red), b = 0.7,∆µ = 3 (green) and
b = 0.5,∆µ = 1 (blue). The different shapes of the points indicate whether a cache-optimized version is used (circle) or not (triangle).
degradations. racy degradation and execution time improvement. We can
From the perspective of an user, regularization should be further categorize data sets as regularization robust, when
considered for deeper tree models, since the effectiveness for they are either binary classification data sets or have a high
small trees is highly limited. When a dataset is used, which amount of samples. Such data sets may benefit strongly
by default is imbalanced, regularization can be generally from regularization.Spending a deeper focus on the property
turned up further and gain more speed improvement while of being regularization robust, we see a dependency to the
not degrading the accuracy much. sample size in the synthetically generated data sets. We fur-ther observe a strong dependency with the imbalance of the
synthesized data sets and the effectiveness of regularization,
5 CONCLUSION AND OUTLOOK supporting the initial design principle. An implementation
in scitkit-learn is openly available.
Deploying machine learning models efficiently on resource-
constrained devices requires a carefully tuned model shape For future work, the application of regularization across
in terms of model size and a hardware-close implementa- the random forest structure should be studied, instead of
tions, of which the state-of-the-art cache-aware optimiza- considering single trees in separation. For instance, the
tions are prominent for random forests and decision trees. dataset can be split into subsets with strong dependencies
In this work, we present a method to regularize the impurity for the training of different trees, making the regularization
computation and reward highly asymmetric distributions more effective. Furthermore, it can be considered to have
in the training process of decision trees, which provokes heterogeneous degrees of regularization for other tree-based
uneven probability distributions (i.e., uneven splits) for of- ensembles.
fering high optimization potential.
To examine the effectiveness of our method, we conduct REFERENCES
extensive experimental evaluation on synthetic datasets and Asadi, N., Lin, J., and de Vries, A. P. Runtime optimiza-
on UCI datasets. The evaluation results show that a large
≈ × tions for tree-based machine learning models. IEEEexecution time reduction of up to 4 can be gained in Transactions on Knowledge and Data Engineering, 26
many cases while degrading the target accuracy by a few (9):2281–2292, Sept 2014. ISSN 1041-4347. doi:
percent. The user can either specify an acceptable threshold 10.1109/TKDE.2013.73.
of sacrificable accuracy degradation and derive the optimal
regularized result or can make an own trade-off by choos- Bischl, B., Binder, M., Lang, M., Pielok, T., Richter,
ing between pareto optimal points in the scope of accu- J., Coors, S., Thomas, J., Ullmann, T., Becker, M.,
Boulesteix, A.-L., Deng, D., and Lindauer, M. Hyper-
10
bal. accuracy
rel. execution time
1− bal. accuracy
parameter optimization: Foundations, algorithms, best Hakert, C., Chen, K.-H., and Chen, J.-J. Flint: Exploiting
practices, and open challenges. WIREs Data Mining floating point enabled integer arithmetic for efficient ran-
and Knowledge Discovery, 13(2), January 2023. doi: dom forest inference. arXiv preprint arXiv:2209.04181,
10.1002/widm.1484. 2022b.
Blackard, J. A. and Dean, D. J. Comparative accuracies Hastie, T., Tibshirani, R., Friedman, J. H., and Friedman,
of artificial neural networks and discriminant analysis in J. H. The elements of statistical learning: data mining,
predicting forest cover types from cartographic variables. inference, and prediction, volume 2. Springer, 2009.
Computers and electronics in agriculture, 24(3):131–151,
1999. Kim, C., Chhugani, J., Satish, N., Sedlar, E., Nguyen, A.,
Kaldewey, T., Lee, V., Brandt, S., and Dubey, P. FAST:
Breiman, L., Friedman, J., Olshen, R., and Stone, C. CART. Fast architecture sensitive tree search on modern CPUs
Classification and regression trees, 1984. doi: 10.1201/ and GPUs. In Proceedings of the 2010 ACM SIGMOD
9781315139470. International Conference on Management of data, pp.
339–350. ACM, 2010.
Buschjäger, S., Chen, J.-J., Chen, K.-H., Günzel, M., Hak-
ert, C., Morik, K., Novkin, R., Pfahler, L., and Yayla, M. Kohavi, R. Scaling up the accuracy of naive-bayes clas-
Margin-maximization in binarized neural networks for sifiers: a decision-tree hybrid. In Second International
optimizing bit error tolerance. In 2021 Design, Automa- Conference on Knowledge Discovery and Data Mining,
tion & Test in Europe Conference & Exhibition (DATE), 1996.
pp. 673–678, 2021. doi: 10.23919/DATE51398.2021.
9473918. Lakshminarayanan, B., Roy, D. M., and Teh, Y. W. Mon-
drian forests: Efficient online random forests. Advances
Buschjäger, S. and Morik, K. Decision tree and random in neural information processing systems, 27, 2014.
forest implementations for fast filtering of sensor data.
IEEE Transactions on Circuits and Systems I: Regular Liaw, A., Wiener, M., et al. Classification and regression by
Papers, 65(1):209–222, 2018. randomforest. R news, 2(3):18–22, 2002.
Chen, K.-H., Su, C., Hakert, C., Buschjäger, S., Lee, C.-L., Lucchese, C., Perego, R., Nardini, F. M., Tonellotto, N., Or-
Lee, J.-K., Morik, K., and Chen, J.-J. Efficient realization lando, S., and Venturini, R. Exploiting CPU SIMD exten-
of decision trees for real-time inference. ACM transac- sions to speed-up document scoring with tree ensembles.
tions on embedded computing systems, 21(6):1–26, 2022. In SIGIR 2016 - Proceedings of the 39th International
ACM SIGIR Conference on Research and Development
Cortez, P., Cerdeira, A., Almeida, F., Matos, T., and Reis, J. in Information Retrieval, 2016. ISBN 9781450342902.
Modeling wine preferences by data mining from physic- doi: 10.1145/2911451.2914758.
ochemical properties. Decision support systems, 47(4):
547–553, 2009. Mentch, L. and Zhou, S. Randomization as regularization: A
degrees of freedom explanation for random forest success.
Dato, D., Lucchese, C., Nardini, F. M., Orlando, S., Perego, The Journal of Machine Learning Research, 21(1):6918–
R., Tonellotto, N., and Venturini, R. Fast ranking with 6953, 2020.
additive ensembles of oblivious and non-oblivious regres-
sion trees. ACM Transactions on Information Systems, Moro, S., Cortez, P., and Rita, P. A data-driven approach
2016. ISSN 15582868. doi: 10.1145/2987380. to predict the success of bank telemarketing. Decision
Support Systems, 62:22–31, 2014.
Dua, D. and Graff, C. UCI machine learning repository,
2017. URL http://archive.ics.uci.edu/ml. Nakandala, S., Saur, K., Yu, G.-I., Karanasos, K., Curino,
C., Weimer, M., and Interlandi, M. A tensor compiler
Frey, P. W. and Slate, D. J. Letter recognition using holland- for unified machine learning prediction serving. In 14th
style adaptive classifiers. Machine learning, 6:161–182, USENIX Symposium on Operating Systems Design and
1991. Implementation (OSDI 20), pp. 899–917, 2020.
Hakert, C., Chen, K.-H., and Chen, J.-J. Immediate split Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,
trees: Immediate encoding of floating point split values in Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,
randomforests. European Conference on Machine Learn- Weiss, R., Dubourg, V., et al. Scikit-learn: Machine
ing and Principles and Practice of Knowledge Discovery learning in python. Journal of machine learning research,
in Databases, 2022a. 12(Oct):2825–2830, 2011.
11
Prenger, R., Chen, B., Marlatt, T., and Merl, D. Fast map
search for compact additive tree ensembles (cate). Tech-
nical report, Tech. rep., Lawrence Livermore National
Laboratory (LLNL), Livermore, CA, 2013.
Probst, P., Wright, M. N., and Boulesteix, A.-L. Hyperpa-
rameters and tuning strategies for random forest. Wiley
Interdisciplinary Reviews: data mining and knowledge
discovery, 9(3):e1301, 2019.
Scheffer, T. Nonparametric regularization of decision trees.
In Machine Learning: ECML 2000: 11th European
Conference on Machine Learning Barcelona, Catalonia,
Spain, May 31–June 2, 2000 Proceedings 11, pp. 344–
356. Springer, 2000.
Tabanelli, E., Tagliavini, G., and Benini, L. Optimizing ran-
dom forest-based inference on risc-v mcus at the extreme
edge. IEEE Transactions on Computer-Aided Design
of Integrated Circuits and Systems, 41(11):4516–4526,
2022. doi: 10.1109/TCAD.2022.3199903.
Ugulino, W., Cardador, D., Vega, K., Velloso, E., Milidiú,
R., and Fuks, H. Wearable computing: Accelerome-
ters’ data classification of body postures and movements.
In Advances in Artificial Intelligence-SBIA 2012: 21th
Brazilian Symposium on Artificial Intelligence, Curitiba,
Brazil, October 20-25, 2012. Proceedings, pp. 52–61.
Springer, 2012.
Van Essen, B., Macaraeg, C., Gokhale, M., and Prenger, R.
Accelerating a random forest classifier: Multi-core, gp-
gpu, or fpga? In Field-Programmable Custom Computing
Machines (FCCM), 2012 IEEE 20th Annual International
Symposium on, pp. 232–239. IEEE, 2012.
Wright, M. N. and Ziegler, A. ranger: A fast implementation
of random forests for high dimensional data in C++ and
R. Journal of Statistical Software, 77(1):1–17, 2017. doi:
10.18637/jss.v077.i01.
Wu, M., Hughes, M., Parbhoo, S., Zazzi, M., Roth, V., and
Doshi-Velez, F. Beyond sparsity: Tree regularization of
deep models for interpretability. In Proceedings of the
AAAI conference on artificial intelligence, volume 32,
2018.
12
APPENDIX
We present the complete results regardingmax features
and the different execution types. The results are presented
in graphical form.
13
adult bank covertype
1.00
0.75
0.50
0.25
letter magic mnist
1.00
0.75
0.50
0.25
satlog sensorless−drive spambase
1.00
0.75
0.50
0.25
wearable−body−postures wine−quality 0 25 50 75 100
1.00
0.75
0.50
0.25
0 25 50 75 100 0 25 50 75 100
reg. factor
max_features sqrt(p)/2 sqrt(p) 2sqrt(p) p  max. depth 1 5 10 15 20
Figure 5. Evaluation of the balanced accuracy for the UCI datasets
14
bal. accuracy
NaiveNativeTree Opt.NativeTree_25 Opt.PathIfTree_128000 StandardIfTree StandardNativeTree
3
2
1
0
3
2
1
0
3
2
1
0
3
2
1
0
0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100
reg. factor
1 10 20 max depth 1 5 10 15 20
 number of trees
5 15 30
Figure 6. Evaluation of the relative execution time for the adult dataset separated by the execution type max features, the maximum
depth (shape of the points) and the number of trees (color).
15
rel.  execution time
sqrt(p)/2 sqrt(p) 2sqrt(p) p
NaiveNativeTree Opt. NativeTree_25 Opt.PathIfTree_128000 StandardIfTree StandardNativeTree
2.0
1.5
1.0
0.5
0.0
2.0
1.5
1.0
0.5
0.0
2.0
1.5
1.0
0.5
0.0
2.0
1.5
1.0
0.5
0.0
0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100
reg. factor
1 10 20 max depth 1 5 10 15 20
 number of trees
5 15 30
Figure 7. Evaluation of the relative execution time for the bank dataset separated by the execution typemax features, the maximum
depth (shape of the points) and the number of trees (color).
16
rel. execution time
sqrt(p)/2 sqrt(p) 2sqrt(p) p
NaiveNativeTree Opt. NativeTree_25 Opt.PathIfTree_128000 StandardIfTree StandardNativeTree
1.25
1.00
0.75
0.50
0.25
1.25
1.00
0.75
0.50
0.25
1.25
1.00
0.75
0.50
0.25
1.25
1.00
0.75
0.50
0.25
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
reg. factor
1 10 20 40 max depth 1 5 10 15 20
 number of trees
5 15 30 50
Figure 8. Evaluation of the relative execution time for the covertype dataset separated by the execution typemax features, the maximum
depth (shape of the points) and the number of trees (color).
17
rel. execution time
sqrt(p)/2 sqrt(p) 2sqrt(p) p
NaiveNativeTree Opt.NativeTree_25 Opt.PathIfTree_128000 StandardIfTree StandardNativeTree
1.2
0.9
0.6
0.3
1.2
0.9
0.6
0.3
1.2
0.9
0.6
0.3
1.2
0.9
0.6
0.3
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
reg. factor
max depth 1 5 10 15 20 1 10 20 40
 number of trees
5 15 30
Figure 9. Evaluation of the relative execution time for the letter dataset separated by the execution type max features, the maximum
depth (shape of the points) and the number of trees (color).
18
rel. execution time
sqrt(p)/2 sqrt(p) 2sqrt(p) p
NaiveNativeTree Opt. NativeTree_25 Opt.PathIfTree_128000 StandardIfTree StandardNativeTree
1.0
0.5
1.0
0.5
1.0
0.5
1.0
0.5
0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100
reg. factor
1 10 20 max depth 1 5 10 15 20
 number of trees
5 15 30
Figure 10. Evaluation of the relative execution time for the magic dataset separated by the execution type max features, the maximum
depth (shape of the points) and the number of trees (color).
19
rel. execution time
sqrt(p)/2 sqrt(p) 2sqrt(p) p
NaiveNativeTree Opt. NativeTree_25 Opt.PathIfTree_128000 StandardIfTree StandardNativeTree
1.2
1.0
0.8
0.6
1.2
1.0
0.8
0.6
1.2
1.0
0.8
0.6
1.2
1.0
0.8
0.6
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
reg. factor
max depth 1 5 10 15 20 1 10 20 40
 number of trees
5 15 30
Figure 11. Evaluation of the relative execution time for the mnist dataset separated by the execution type max features, the maximum
depth (shape of the points) and the number of trees (color).
20
rel. execution time
sqrt(p)/2 sqrt(p) 2sqrt(p) p
NaiveNativeTree Opt. NativeTree_25 Opt.PathIfTree_128000 StandardIfTree StandardNativeTree
1.5
1.2
0.9
0.6
1.5
1.2
0.9
0.6
1.5
1.2
0.9
0.6
1.5
1.2
0.9
0.6
0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100
reg. factor
1 10 20 max depth 1 5 10 15 20
 number of trees
5 15 30
Figure 12. Evaluation of the relative execution time for the satlog dataset separated by the execution type max features, the maximum
depth (shape of the points) and the number of trees (color).
21
rel. execution time
sqrt(p)/2 sqrt(p) 2sqrt(p) p
NaiveNativeTree Opt.NativeTree_25 Opt.PathIfTree_128000 StandardIfTree StandardNativeTree
1.3
1.1
0.9
0.7
0.5
1.3
1.1
0.9
0.7
0.5
1.3
1.1
0.9
0.7
0.5
1.3
1.1
0.9
0.7
0.5
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
reg. factor
1 10 20 40 max depth 1 5 10 15 20
 number of trees
5 15 30 50
Figure 13. Evaluation of the relative execution time for the sensorless dataset separated by the execution type max features, the
maximum depth (shape of the points) and the number of trees (color).
22
rel. execution time
sqrt(p)/2 sqrt(p) 2sqrt(p) p
NaiveNativeTree Opt. NativeTree_25 Opt.PathIfTree_128000 StandardIfTree StandardNativeTree
1.2
0.9
0.6
0.3
1.2
0.9
0.6
0.3
1.2
0.9
0.6
0.3
1.2
0.9
0.6
0.3
0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100
reg. factor
1 10 20 max depth 1 5 10 15 20
 number of trees
5 15 30
Figure 14. Evaluation of the relative execution time for the spambase dataset separated by the execution type max features, the
maximum depth (shape of the points) and the number of trees (color).
23
rel.  execution time
sqrt(p)/2 sqrt(p) 2sqrt(p) p
NaiveNativeTree Opt. NativeTree_25 Opt.PathIfTree_128000 StandardIfTree StandardNativeTree
1.5
1.2
0.9
0.6
0.3
1.5
1.2
0.9
0.6
0.3
1.5
1.2
0.9
0.6
0.3
1.5
1.2
0.9
0.6
0.3
0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100
reg. factor
1 10 20 max depth 1 5 10 15 20
 number of trees
5 15 30
Figure 15. Evaluation of the relative execution time for the wearable dataset separated by the execution typemax features, the maximum
depth (shape of the points) and the number of trees (color).
24
rel.  execution time
sqrt(p)/2 sqrt(p) 2sqrt(p) p
NaiveNativeTree Opt. NativeTree_25 Opt.PathIfTree_128000 StandardIfTree StandardNativeTree
1.6
1.2
0.8
0.4
1.6
1.2
0.8
0.4
1.6
1.2
0.8
0.4
1.6
1.2
0.8
0.4
0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60
reg. factor
1 10 20 max depth 1 5 10 15 20
 number of trees
5 15 30
Figure 16. Evaluation of the relative execution time for the wine-quality dataset separated by the execution type max features, the
maximum depth (shape of the points) and the number of trees (color).
25
rel. execution time
sqrt(p)/2 sqrt(p) 2sqrt(p) p
100 200 500
1.0
0.9
0.8
0.7
0.6
0.5
1.0
0.9
0.8
0.7
0.6
0.5
1.0
0.9
0.8
0.7
0.6
0.5
1.0
0.9
0.8
0.7
0.6
0.5
0 25 50 75 100 0 25 50 75 100 0 25 50 75 100
reg. factor
maximal depth 1 5 10 15 20 number of trees 10 20 30 40 50
Figure 17. Balanced accuracy evaluation for synthetic data (red setting) with varying sample size,max features, maximum depth (point
shape) and number of trees (color).
26
balanced accuracy
1 3 6 10
100 200 500
0.9
0.8
0.7
0.6
0.5
0.9
0.8
0.7
0.6
0.5
0.9
0.8
0.7
0.6
0.5
0.9
0.8
0.7
0.6
0.5
0 25 50 75 100 0 25 50 75 100 0 25 50 75 100
reg. factor
maximal depth 1 5 10 15 20 number of trees 10 20 30 40 50
Figure 18. Balanced accuracy evaluation for synthetic data (green setting) with varying sample size,max features, maximum depth
(point shape) and number of trees (color).
27
balanced accuracy
1 3 6 10
100 200 500
0.9
0.8
0.7
0.6
0.5
0.9
0.8
0.7
0.6
0.5
0.9
0.8
0.7
0.6
0.5
0.9
0.8
0.7
0.6
0.5
0 25 50 75 100 0 25 50 75 100 0 25 50 75 100
reg. factor
maximal depth 1 5 10 15 20 number of trees 10 20 30 40 50
Figure 19. Balanced accuracy evaluation for synthetic data (blue setting) with varying sample size, max features, maximum depth
(point shape) and number of trees (color).
28
balanced accuracy
1 3 6 10
NaiveNativeTree Opt. NativeTree_25 Opt.PathIfTree_128000 StandardIfTree StandardNativeTree
1.5
1.0
0.5
0.0
1.5
1.0
0.5
0.0
1.5
1.0
0.5
0.0
1.5
1.0
0.5
0.0
0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100
reg. factor
 number of trees 10 20 30 40 max depth 1 5 10 15 20
Figure 20. Relative execution time evaluation for synthetic data (red setting) and sample size=100 with varying sample size,
max features, maximum depth (point shape) and number of tree2s9(color).
rel. exection time
1 3 6 10
NaiveNativeTree Opt. NativeTree_25 Opt.PathIfTree_128000 StandardIfTree StandardNativeTree
1.5
1.0
0.5
0.0
1.5
1.0
0.5
0.0
1.5
1.0
0.5
0.0
1.5
1.0
0.5
0.0
0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100
reg. factor
max depth 1 5 10 15 20  number of trees 10 20 30 40 50
Figure 21. Relative execution time evaluation for synthetic data (red setting) and sample size=200 with varying sample size,
max features, maximum depth (point shape) and number of tree3s0(color).
rel. exection time
1 3 6 10
NaiveNativeTree Opt. NativeTree_25 Opt.PathIfTree_128000 StandardIfTree StandardNativeTree
1.5
1.0
0.5
0.0
1.5
1.0
0.5
0.0
1.5
1.0
0.5
0.0
1.5
1.0
0.5
0.0
0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100
reg. factor
 number of trees 10 20 30 40 max depth 1 5 10 15 20
Figure 22. Relative execution time evaluation for synthetic data (green setting) and sample size=100 with varying sample size,
max features, maximum depth (point shape) and number of tree3s1(color).
rel. exection time
1 3 6 10
NaiveNativeTree Opt. NativeTree_25 Opt.PathIfTree_128000 StandardIfTree StandardNativeTree
1.5
1.0
0.5
0.0
1.5
1.0
0.5
0.0
1.5
1.0
0.5
0.0
1.5
1.0
0.5
0.0
0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100
reg. factor
 number of trees 10 20 30 40 max depth 1 5 10 15 20
Figure 23. Relative execution time evaluation for synthetic data (green setting) and sample size=200 with varying sample size,
max features, maximum depth (point shape) and number of tree3s2(color).
rel. exection time
1 3 6 10
NaiveNativeTree Opt. NativeTree_25 Opt.PathIfTree_128000 StandardIfTree StandardNativeTree
1.5
1.0
0.5
0.0
1.5
1.0
0.5
0.0
1.5
1.0
0.5
0.0
1.5
1.0
0.5
0.0
0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100
reg. factor
max depth 1 5 10 15 20  number of trees 10 20 30
Figure 24. Relative execution time evaluation for synthetic data (blue setting) and sample size=100 with varying sample size,
max features, maximum depth (point shape) and number of tree3s3(color).
rel. exection time
1 3 6 10
NaiveNativeTree Opt. NativeTree_25 Opt.PathIfTree_128000 StandardIfTree StandardNativeTree
1.5
1.0
0.5
0.0
1.5
1.0
0.5
0.0
1.5
1.0
0.5
0.0
1.5
1.0
0.5
0.0
0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100
reg. factor
max depth 1 5 10 15 20  number of trees 10 20 30 40 50
Figure 25. Relative execution time evaluation for synthetic data (blue setting) and sample size=200 with varying sample size,
max features, maximum depth (point shape) and number of tree3s4(color).
rel. exection time
1 3 6 10

Name: Lena Schmid
Erklärung
Hiermit erkläre ich, dass ich die vorliegende Dissertation mit dem Titel
“Statistical Analyses of Tree-Based Ensembles”
selbständig verfasst und keine anderen als die angegebenen Quellen und Hilfsmittel
verwendet sowie die wörtlich oder inhaltlich übernommenen Stellen als solche kenntlich
gemacht habe und die Satzung der Technischen Universität Dortmund zur Sicherung
guter wissenschaftlicher Praxis in der jeweils gültigen Fassung beachtet habe. Ich
versichere außerdem, dass ich die beigefügte Dissertation nur in diesem und keinem
anderen Promotionsverfahren eingereicht habe und dass diesem Promotionsverfahren
keine endgültig gescheiterten Promotionsverfahren vorausgegangen sind. Ferner erkläre
ich, dass keine Aberkennung eines bereits erworbenen Doktorgrades vorliegt. Ich
versichere an Eides statt, dass ich nach bestem Wissen die reine Wahrheit erklärt und
nichts verschwiegen habe.
Dortmund, den . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Lena Schmid