Authors: Bommert, Andrea Martina
Title: Integration of feature selection stability in model fitting
Language (ISO): en
Abstract: In this thesis, four aspects connected to feature selection are analyzed: Firstly, a benchmark of filter methods for feature selection is conducted. Secondly, measures for the assessment of feature selection stability are compared both theoretically and empirically. Some of the stability measures are newly defined. Thirdly, a multi-criteria approach for obtaining desirable models with respect to predictive accuracy, feature selection stability, and sparsity is proposed and evaluated. Fourthly, an approach for finding desirable models for data sets with many similar features is suggested and evaluated. For the benchmark of filter methods, 20 filter methods are analyzed. First, the filter methods are compared with respect to the order in which they rank the features and with respect to their scaling behavior, identifying groups of similar filter methods. Next, the predictive accuracy of the filter methods when combined with a predictive model and the run time are analyzed, resulting in recommendations on filter methods that work well on many data sets. To identify suitable measures for stability assessment, 20 stability measures are compared based on both theoretical properties and on their empirical behavior. Five of the measures are newly proposed by us. Groups of stability measures that consider the same feature sets as stable or unstable are identified and the impact of the number of selected features on the stability values is studied. Additionally, the run times for calculating the stability measures are analyzed. Based on all analyses, recommendations on which stability measures should be used in future analyses are made. When searching for a good predictive model, the predictive accuracy is usually the only criterion considered in the model finding process. In this thesis, the benefits of additionally considering the feature selection stability and the number of selected features are investigated. To find desirable configurations with respect to all three performance criteria, the hyperparameter tuning is performed in a multi-criteria fashion. This way, it is possible to find configurations that perform a more stable selection of fewer features without losing much predictive accuracy compared to model fitting only considering the predictive performance. Also, with multi-criteria tuning, models are obtained that over-fit the training data less than the models obtained with single-criteria tuning only with respect to predictive accuracy. For data sets with many similar features, we propose the approach of employing L0-regularized regression and tuning its hyperparameter in a multi-criteria fashion with respect to both predictive accuracy and feature selection stability. We suggest assessing the stability with an adjusted stability measure, that is, a stability measure that takes into account similarities between features. The approach is evaluated based on both simulated and real data sets. Based on simulated data, it is observed that the proposed approach achieves the same or better predictive performance compared to established approaches. In contrast to the competing approaches, the proposed approach succeeds at selecting the relevant features while avoiding irrelevant or redundant features. On real data, the proposed approach is beneficial for fitting models with fewer features without losing predictive accuracy.
Subject Headings: Feature selection stability
Stability measures
Feature selection
Classification
High-dimensional data
Subject Headings (RSWK): Merkmalsextraktion
Klassifikation
Hochdimensionale Daten
URI: http://hdl.handle.net/2003/40023
http://dx.doi.org/10.17877/DE290R-21906
Issue Date: 2020
Appears in Collections:Statistische Methoden in der Genetik und Chemometrie

Files in This Item:
File Description SizeFormat 
Dissertation_Bommert.pdfDNB5.4 MBAdobe PDFView/Open


This item is protected by original copyright



All resources in the repository are protected by copyright.