Authors: Ramosaj, Burim
Title: Analyzing consistency and statistical inference in Random Forest models
Language (ISO): en
Abstract: This thesis pays special attention to the Random Forest method as an ensemble learning technique using bagging and feature sub-spacing covering three main aspects: its behavior as a prediction tool under the presence of missing values, its role in uncertainty quantification and variable screening. In the first part, we focus on the performance of Random Forest models in prediction and missing value imputations while opposing it to other learning methods such as boosting procedures. Therein, we aim to discover potential modifications of Breiman’s original Random Forest in order to increase imputation performance of Random Forest based models using the normalized root mean squared error and the proportion of false classification as evaluation measures. Our results indicated the usage of a mixed model involving the stochastic gradient boosting and a Random Forest based on kernel sampling. Regarding inferential statistics after imputation, we were interested if Random Forest methods do deliver correct statistical inference procedures, especially in repeated measures ANOVA. Our results indicated a heavy inflation of type-I-error rates for testing no mean time effects. We could furthermore show that the between imputation variance according to Rubin’s multiple imputation rule vanishes almost surely, when repeatedly applying missForest as an imputation scheme. This has the consequence of less uncertainty quantification during imputation leading to scenarios where imputations are not proper. Closely related to the issue of valid statistical inference is the general topic of uncertainty quantification. Therein, we focused on consistency properties of several residual variance estimators in regression models and could deliver theoretical guarantees that Random Forest based estimators are consistent. Beside prediction, Random Forest is often used as a screening method for selecting informative features in potentially high-dimensional settings. Focusing on regression problems, we could deliver a formal proof that the Random Forest based internal permutation importance measure delivers on average correct results, i.e. is (asymptotically) unbiased. Simulation studies and real-life data examples from different fields support our findings in this thesis.
Subject Headings: Random Forest
Consistency
Statistical inference
Uncertainty quantification
Missing value imputation
Prediction intervals
Industrial application
Subject Headings (RSWK): Partielle Information
Automatische Klassifikation
Regressionsmodell
Random Forest
URI: http://hdl.handle.net/2003/39552
http://dx.doi.org/10.17877/DE290R-21444
Issue Date: 2020
Appears in Collections:Institut für Mathematische Statistik und industrielle Anwendungen

Files in This Item:
File Description SizeFormat 
Dissertation_BurimRamosaj.pdfDNB5.89 MBAdobe PDFView/Open


This item is protected by original copyright



This item is protected by original copyright rightsstatements.org