Comparing simulation strategies and quantifying similarity of datasets

Loading...
Thumbnail Image

Date

2025

Journal Title

Journal ISSN

Volume Title

Publisher

Alternative Title(s)

Abstract

Simulation studies are an essential tool for comparing and evaluating new and existing statistical methods. Generating realistic data is considered crucial for the reliability of the simulation results. There are various types of simulation studies that differ in the way in which data is generated. In parametric simulation studies, the data is generated using pseudo-random numbers according to a fully user-specified data-generating mechanism. The complete specification of the data-generating mechanism, i.e. of the data-generating process (DGP) for the covariates and the outcome-generating model (OGM) for generating observations of a target variable based on the generated covariates, might however result in oversimplification of complex real-world processes. An alternative to parametric simulation that is often claimed to produce more realistic data is statistical Plasmode simulation. For this, covariate data is generated by resampling from a real-world dataset. Observations of a target variable are then obtained by applying a user-specified OGM to that resampled data. The claim that Plasmode simulation leads to more realistic data and therefore better simulation results is, however, not proven by any empirical or theoretical results. Therefore, this thesis presents the first empirical comparison of parametric and Plasmode simulation studies. The estimation of the mean squared error (MSE) of the least squares (LS) estimator in linear regression, as well as the comparison of several binary classification methods, are considered as examples. In the context of comparing different simulation strategies, the similarity of the simulated datasets to a real-world dataset is of interest. There are several methods for quantifying the similarity of two or more multivariate datasets proposed in the literature. Yet, there is no guidance available on which method to use when. Therefore, the remainder of the thesis is concerned with comparing methods for quantifying dataset similarity. First, a taxonomy of such methods based on their main ideas is provided together with a comparison based on 22 newly developed theoretical criteria for the applicability, interpretability, and theoretical properties of the methods. These can guide the choice of a suitable method for a given dataset comparison. To facilitate the choice in practice, an online tool is provided which allows for custom filtering of the theoretical criteria and sorting of the methods. To enable an empirical method comparison, an R package is provided that includes the most relevant dataset similarity methods implemented in a unified framework. Finally, a neutral comparison study of dataset similarity methods for categorical data is performed to provide insight into the performance of such methods in practice.

Description

Table of contents

Keywords

Subjects based on RSWK

Simulation, Datenerhebung, Resampling, Versuchsplanung, Statistik

Citation