Analyzing consistency and statistical inference in Random Forest models
Loading...
Date
2020
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
This thesis pays special attention to the Random Forest method as an ensemble learning technique
using bagging and feature sub-spacing covering three main aspects: its behavior as a
prediction tool under the presence of missing values, its role in uncertainty quantification and
variable screening. In the first part, we focus on the performance of Random Forest models in
prediction and missing value imputations while opposing it to other learning methods such as
boosting procedures. Therein, we aim to discover potential modifications of Breiman’s original
Random Forest in order to increase imputation performance of Random Forest based
models using the normalized root mean squared error and the proportion of false classification
as evaluation measures. Our results indicated the usage of a mixed model involving
the stochastic gradient boosting and a Random Forest based on kernel sampling. Regarding
inferential statistics after imputation, we were interested if Random Forest methods do deliver
correct statistical inference procedures, especially in repeated measures ANOVA. Our
results indicated a heavy inflation of type-I-error rates for testing no mean time effects. We
could furthermore show that the between imputation variance according to Rubin’s multiple
imputation rule vanishes almost surely, when repeatedly applying missForest as an imputation
scheme. This has the consequence of less uncertainty quantification during imputation
leading to scenarios where imputations are not proper. Closely related to the issue of valid
statistical inference is the general topic of uncertainty quantification. Therein, we focused on
consistency properties of several residual variance estimators in regression models and could
deliver theoretical guarantees that Random Forest based estimators are consistent. Beside
prediction, Random Forest is often used as a screening method for selecting informative features
in potentially high-dimensional settings. Focusing on regression problems, we could
deliver a formal proof that the Random Forest based internal permutation importance measure
delivers on average correct results, i.e. is (asymptotically) unbiased. Simulation studies
and real-life data examples from different fields support our findings in this thesis.
Description
Table of contents
Keywords
Random Forest, Consistency, Statistical inference, Uncertainty quantification, Missing value imputation, Prediction intervals, Industrial application