Institut für Mathematische Statistik und industrielle Anwendungen

Permanent URI for this collection

Browse

Recent Submissions

Now showing 1 - 20 of 38
  • Item
    Testing hypotheses about correlation matrices in general MANOVA designs
    (2023-12-12) Sattler, Paavo; Pauly, Markus
    Correlation matrices are an essential tool for investigating the dependency structures of random vectors or comparing them. We introduce an approach for testing a variety of null hypotheses that can be formulated based upon the correlation matrix. Examples cover MANOVA-type hypothesis of equal correlation matrices as well as testing for special correlation structures such as sphericity. Apart from existing fourth moments, our approach requires no other assumptions, allowing applications in various settings. To improve the small sample performance, a bootstrap technique is proposed and theoretically justified. Based on this, we also present a procedure to simultaneously test the hypotheses of equal correlation and equal covariance matrices. The performance of all new test statistics is compared with existing procedures through extensive simulations.
  • Item
    Statistical analyses of tree-based ensembles
    (2023) Schmid, Lena; Pauly, Markus; Andreas, Groll
    This thesis focuses on the study of tree-based ensemble learners, with particular attention to their behavior as a prediction tool for multivariate or time-dependent outcomes and their implementation for efficient execution. In particular, well-known examples such as Random Forest and Extra Trees are often used for the prediction of univariate outcomes. However, for multivariate outcomes, the question arises whether it is better to fit univariate models separately or to follow a multivariate approach directly. Our results show that the advantages of the multivariate approach can be observed in scenarios where there is a high degree of dependency between the components of the results. In particular, significant differences in the performance of the different Random Forest approaches are observed. In terms of predictive performance for time series, we are interested in whether the use of tree-based methods can offer advantages over traditional time series methods such as ARIMA, particularly in the area of data-driven logistics, where the abundance of complex and noisy data - from supply chain transactions to customer interactions - requires accurate and timely insights. Our results indicate the effectiveness of machine learning methods, especially in scenarios where data generation processes are layered with a certain degree of further complexity. Motivated by the trend towards increasingly autonomous and decentralized processes on resource-constrained devices in logistics, we explore strategies to optimize the execution time of machine learning algorithms for inference, focusing on Random Forests and decision trees. In addition to the simple approach of enforcing shorter paths through decision trees, we also investigate hardware-oriented implementations. One optimization is to adapt the memory layout to prefer paths with higher probability, which is particularly beneficial in cases with uneven splits within tree nodes. We present a regularization method that reduces path lengths by rewarding uneven probability distributions during decision tree training. This method proves to be particularly valuable for a memory architecture-aware implementation, resulting in a substantial reduction in execution time with minimal degradation in accuracy, especially for large datasets or datasets concerning binary classification tasks. Simulation studies and real-life data examples from different fields support our findings in this thesis.
  • Item
    Testing marginal homogeneity in Hilbert spaces with applications to stock market returns
    (2022-02-14) Ditzhaus, Marc; Gaigall, Daniel
    This paper considers a paired data framework and discusses the question of marginal homogeneity of bivariate high-dimensional or functional data. The related testing problem can be endowed into a more general setting for paired random variables taking values in a general Hilbert space. To address this problem, a Cramér–von-Mises type test statistic is applied and a bootstrap procedure is suggested to obtain critical values and finally a consistent test. The desired properties of a bootstrap test can be derived that are asymptotic exactness under the null hypothesis and consistency under alternatives. Simulations show the quality of the test in the finite sample case. A possible application is the comparison of two possibly dependent stock market returns based on functional data. The approach is demonstrated based on historical data for different stock market indices.
  • Item
    CASANOVA: permutation inference in factorial survival designs
    (2021-10-05) Ditzhaus, Marc; Genuneit, Jon; Janssen, Arnold; Pauly, Markus
    We propose inference procedures for general factorial designs with time-to-event endpoints. Similar to additive Aalen models, null hypotheses are formulated in terms of cumulative hazards. Deviations are measured in terms of quadratic forms in Nelson–Aalen-type integrals. Different from existing approaches, this allows to work without restrictive model assumptions as proportional hazards. In particular, crossing survival or hazard curves can be detected without a significant loss of power. For a distribution-free application of the method, a permutation strategy is suggested. The resulting procedures' asymptotic validity is proven and small sample performances are analyzed in extensive simulations. The analysis of a data set on asthma illustrates the applicability.
  • Item
    Which test for crossing survival curves? A user’s guideline
    (2022-01-30) Dormuth, Ina; Liu, Tiantian; Xu, Jin; Yu, Menggang; Pauly, Markus; Ditzhaus, Marc
    Background: The exchange of knowledge between statisticians developing new methodology and clinicians, reviewers or authors applying them is fundamental. This is specifically true for clinical trials with time-to-event endpoints. Thereby, one of the most commonly arising questions is that of equal survival distributions in two-armed trial. The log-rank test is still the gold-standard to infer this question. However, in case of non-proportional hazards, its power can become poor and multiple extensions have been developed to overcome this issue. We aim to facilitate the choice of a test for the detection of survival differences in the case of crossing hazards. Methods: We restricted the review to the most recent two-armed clinical oncology trials with crossing survival curves. Each data set was reconstructed using a state-of-the-art reconstruction algorithm. To ensure reproduction quality, only publications with published number at risk at multiple time points, sufficient printing quality and a non-informative censoring pattern were included. This article depicts the p-values of the log-rank and Peto-Peto test as references and compares them with nine different tests developed for detection of survival differences in the presence of non-proportional or crossing hazards. Results: We reviewed 1400 recent phase III clinical oncology trials and selected fifteen studies that met our eligibility criteria for data reconstruction. After including further three individual patient data sets, for nine out of eighteen studies significant differences in survival were found using the investigated tests. An important point that reviewers should pay attention to is that 28% of the studies with published survival curves did not report the number at risk. This makes reconstruction and plausibility checks almost impossible. Conclusions: The evaluation shows that inference methods constructed to detect differences in survival in presence of non-proportional hazards are beneficial and help to provide guidance in choosing a sensible alternative to the standard log-rank test.
  • Item
    Robust covariance estimation in mixed-effects meta-regression models
    (2022) Welz, Thilo; Pauly, Markus; Knapp, Guido
    In this PhD thesis we consider robust (sandwich) variance-covariance matrix estimators in the context of univariate and multivariate meta-analysis and meta-regression. The underlying model is the classical mixed-effects meta-regression model. Our goal is to enable valid statistical inference for the model coefficients. Specifically, we employ heteroscedasticity consistent (HC) and cluster-robust (CR) sandwich estimators in the univariate and multivariate setting. A key aim is to provide better small sample solutions for meta-analytic research and application. Tests based on the original formulations of these estimators are known to produce highly liberal results, especially when the number of studies is small. We therefore transfer results for improved sandwich estimation by Cribari-Neto and Zarkos (2004) to the meta-analytic context. We prove the asymptotic equivalence of HC estimators and compare them with commonly suggested techniques such as the Knapp-Hartung (KH) method or standard plugin covariance matrix estimation in extensive simulation studies. The new versions of HC estimators considerably outperform their older counterparts, especially in small samples, achieving comparable results to the KH method. In a slight excursion, we focus on constructing confidence regions for (Pearson) correlation coefficients as the main effect of interest in a random-effects meta-analysis. We develop a beta-distribution model for generating data in our simulations in addition to the commonly used truncated normal distribution model. We utilize different variance estimation approaches such as HC estimators, the KH method and a wild bootstrap approach in combination with the Fisher-z transformation and an integral z-to-r back-transformation to construct confidence regions. In simulation studies, our novel proposals improve coverage over the Hedges-Olkin-Vevea-z approach and Hunter-Schmidt approaches, enabling reliable inference for a greater range of true correlations. Finally, we extend our results for the HC estimators to construct CR sandwich estimators for multivariate meta-regression. The aim is to achieve valid inference for the model coefficients, based on Wald-type statistics, even in small samples. Our simulations show that previously suggested CR estimators such as the bias reduced linearization approach, can have unsatisfactory small sample performance for bivariate meta-regression. Furthermore, they show that the Hotelling’s T^2-test suggested by Tipton and Pustejovsky (2015) can yield negative estimates for the degrees of freedom when the number of studies is small. We suggest an adjustment to the classical F -test, truncating the denominator degrees of freedom at two. Our CR extensions, using only the diagonal elements of the hat matrix to adjust residuals, improve coverage considerably in small samples. We focus on the bivariate case in our simulations, but the discussed approaches can also be applied more generally. We analyze both small and large sample behavior of all considered tests / confidence regions in extensive simulation studies. Furthermore, we apply the discussed approaches in real life datasets from psychometric and medical research.
  • Item
    Nonparametric correlation-based methods with biomedical applications
    (2022) Nowak, Claus P.; Pauly, Markus; Schorning, Kirsten
    This cumulative dissertation consists of three manuscripts on nonparametric methodology, i.e., Simultaneous inference for Kendall’s tau, Group sequential methods for the Mann-Whitney parameter, and The nonparametric Behrens-Fisher problem in small samples. The manuscript on Kendall’s τ fully develops a nonparametric estimation theory for multiple rank correlation coefficients in terms of Kendall’s τA and τB, Somers’ D, as well as Kruskal and Goodman’s γ, necessitating joint estimation of both the probabilities of ties occurring and the probability of concordance minus discordance. As for the second manuscript, I review and further develop group sequential methodology for the Mann-Whitney parameter. With the aid of data from a clinical trial in patients with relapse-remitting multiple sclerosis, I demonstrate how one could repeatedly estimate the Mann-Whitney parameter during an ongoing trial together with repeated confidence intervals obtained by test inversion. In addition, I give simple approximate power formulas for this group sequential setting. The last manuscript further explores how best to approximate the sampling distribution of the Mann-Whitney parameter in terms of the nonparametric Behrens-Fisher problem, an issue that has arisen from the preceding manuscript. In that regard, I explore different variance estimators and a permutation approach that have been proposed in the literature and examine some slightly modified ways as regards a small sample t approximation as well. In all three manuscripts, I carried out simulations for various settings to assess the adequacy of the proposed methods.
  • Item
    Resampling-based inference methods for repeated measures data with missing values
    (2022) Amro, Lubna; Pauly, Markus; Ickstadt, Katja
    The primary objective of this dissertation was to (i) develop novel resampling approaches for handling repeated measures data with missing values, (ii) compare their empirical power against other existing approaches using a Monte Carlo simulation study, and (iii) pinpoint the limitations of some common approaches, particularly for small sample sizes. This dissertation investigates four different statistical problems. The first is semiparametric inference for comparing means of matched pairs with missing data in both arms. Therein, we propose two novel randomization techniques; a weighted combination test and a multiplication combination test. They are based upon combining separate results of the permutation versions of the paired t-test and Welch test for the completely observed pairs and the incompletely observed components, respectively. As second problem, we consider the same setting but missingness in one arm only. There, we investigate a Wald-type statistic (WTS), an ANOVA-type statistic (ATS), and a modified ANOVA-type statistic (MATS). However, ATS and MATS are not distribution free under the null hypothesis, and WTS suffers from the slow convergence to its limiting 2 distribution. Thus, we develop asymptotic model-based bootstrap versions of these tests. The third problem is on nonparametric rank-based inference for matched pairs with incompleteness in both arms. In this more general setup, the only requirement is that the marginal distributions are not one point distributions. Therein, we propose novel multiplication combination tests that can handle three different testing problems, including the nonparametric Behrens-Fisher problem (Hp 0 : {p = 1/2}). Finally, the fourth problem is nonparametric rank-based inference for incompletely observed factorial designs with repeated measures. Therein, we develop a wild bootstrap approach combined with quadratic form-type test statistics (WTS, ATS, and MATS). These rank-based methods can be applied to both continuous and ordinal or ordered categorical data and (some) allow for singular covariance matrices. In addition to theoretically proving the asymptotic correctness of all the proposed procedures, extensive simulation studies demonstrate their favorable small samples properties in comparison to classical parametric tests. We also motivate and validate our approaches using real-life data examples from a variety of fields.
  • Item
    Fisher transformation based confidence intervals of correlations in fixed- and random-effects meta-analysis
    (2021-05-02) Welz, Thilo; Doebler, Philipp; Pauly, Markus
    Meta-analyses of correlation coefficients are an important technique to integrate results from many cross-sectional and longitudinal research designs. Uncertainty in pooled estimates is typically assessed with the help of confidence intervals, which can double as hypothesis tests for two-sided hypotheses about the underlying correlation. A standard approach to construct confidence intervals for the main effect is the Hedges-Olkin-Vevea Fisher-z (HOVz) approach, which is based on the Fisher-z transformation. Results from previous studies (Field, 2005, Psychol. Meth., 10, 444; Hafdahl and Williams, 2009, Psychol. Meth., 14, 24), however, indicate that in random-effects models the performance of the HOVz confidence interval can be unsatisfactory. To this end, we propose improvements of the HOVz approach, which are based on enhanced variance estimators for the main effect estimate. In order to study the coverage of the new confidence intervals in both fixed- and random-effects meta-analysis models, we perform an extensive simulation study, comparing them to established approaches. Data were generated via a truncated normal and beta distribution model. The results show that our newly proposed confidence intervals based on a Knapp-Hartung-type variance estimator or robust heteroscedasticity consistent sandwich estimators in combination with the integral z-to-r transformation (Hafdahl, 2009, Br. J. Math. Stat. Psychol., 62, 233) provide more accurate coverage than existing approaches in most scenarios, especially in the more appropriate beta distribution simulation model.
  • Item
    Asymptotic-based bootstrap approach for matched pairs with missingness in a single arm
    (2021-07-08) Amro, Lubna; Pauly, Markus; Ramosaj, Burim
    The issue of missing values is an arising difficulty when dealing with paired data. Several test procedures are developed in the literature to tackle this problem. Some of them are even robust under deviations and control type-I error quite accurately. However, most of these methods are not applicable when missing values are present only in a single arm. For this case, we provide asymptotic correct resampling tests that are robust under heteroskedasticity and skewed distributions. The tests are based on a meaningful restructuring of all observed information in quadratic form–type test statistics. An extensive simulation study is conducted exemplifying the tests for finite sample sizes under different missingness mechanisms. In addition, illustrative data examples based on real life studies are analyzed.
  • Item
    Inference for multivariate and high-dimensional data in heterogeneous designs
    (2021) Sattler, Paavo Aljoscha Nanosch; Pauly, Markus; Doebler, Philipp
    In the presented cumulative thesis, we develop statistical tests to check different hypotheses for multivariate and high-dimensional data. A suitable way to get scalar test statistics for multivariate issues are quadratic forms. The most common are statistics of Waldtype (WTS) or ANOVA-type (ATS) as well as centered and standardized versions of them. Also, [Pauly et al., 2015] and [Chen and Qin, 2010] used such quadratic forms to analyze hypotheses regarding the expectation vector of high-dimensional observations. Thereby, they had different assumptions, but both allowed just one respective two groups. We expand the approach from [Pauly et al., 2015] for multiple groups, which leads to a multitude of possible asymptotic frameworks allowing even the number of groups to grow. In the considered split-plot-design with normally distributed data, we investigate the asymptotic distribution of the standardized centered quadratic form under different conditions. In most cases, we could show that the individual limit distribution was only received under the specific conditions. For the frequently assumed case of equal covariance matrices, we also widen the considered asymptotic frameworks, since not necessarily the sample sizes of individual groups have to grow. Moreover, we add other cases in which the limit distribution can be calculated. These hold for homoscedasticity of covariance matrices but also for the general case. This expansion of the asymptotic frameworks is one example of how the assumption of homoscedastic covariance matrices allows widening conclusions. Moreover, assuming equal covariance matrices also simplifies calculations or enables us to use a larger statistical toolbox. For the more general issue of testing hypotheses regarding covariance matrices, existing procedures have strict assumptions (e.g. in [Muirhead, 1982], [Anderson, 1984] and [Gupta and Xu, 2006]), test only special hypotheses (e.g. in [Box, 1953]), or are known to have low power (e.g. in [Zhang and Boos, 1993]). We introduce an intuitive approach with fewer restrictions, a multitude of possible null hypotheses, and a convincing small sample approximation. Thereby, nearly every quadratic form known from the mean-based analysis can be used, and two bootstrap approaches are applied to improve their performance. Furthermore, it can be expanded to many other situations like testing hypotheses of correlation matrices or check whether the covariance matrix has a particular structure. We investigated the type-I-error for all developed tests and the power to detect deviations from the null hypothesis for small sample sizes up to large ones in extensive simulation studies.
  • Item
    Gaussian Process models and global optimization with categorical variables
    (2021) Kirchhoff, Dominik; Kuhnt, Sonja; Rahnenführer, Jörg
    This thesis is concerned with Gaussian Process (GP) models for computer experiments with both numerical and categorical input variables. The Low-Rank Correlation kernel LRCr is introduced for the estimation of the cross-correlation matrix – i.e., the matrix that contains the correlations of the GP given different levels of a categorical variable. LRCr is a rank-r approximation of the real but unknown cross-correlation matrix and provides two advantages over existing parsimonious correlation kernels: First, it lets the practictioner adapt the number of parameters to be estimated according to the problem at hand by choosing an appropriate rank r. And second, the entries of the estimated cross-correlation matrix are not restricted to non-negative values. Moreover, an approach is presented that can generate a test function with mixed inputs from a test function having only continuous variables. This is done by discretizing (or “slicing”) one of its dimensions. Depending on the function and the slice positions, the slices sometimes happen to be highly positively correlated. By turning some slices in a specific way, the position and value of the global optimum can be preserved while changing the sign of a number of cross-correlations. With these methods, a simulation study is conducted that investigates the estimation accuracy of the cross-correlation matrices as well as the prediction accuracy of the response surface among different correlation kernels. Thereby, the number of points in the initial design of experiments and the amount of negative cross-correlations are varied in order to compare their impact on different kernels. We then focus on GP models with mixed inputs in the context of the Efficient Global Optimization (EGO) algorithm. We conduct another simulation study in which the distances of the different kernels' best found solutions to the optimum are compared. Again, the number of points in the initial experimental design is varied. However, the total budget of function evaluations is fixed. The results show that a higher number of EGO iterations tends to be preferable over a larger initial experimental design. Finally, three applications are considered: First, an optimization of hyperparameters of a computer vision algorithm. Second, an optimization of a logistics production process using a simulation model. And third, a bi-objective optimization of shift planning in a simulated high-bay warehouse, where constraints on the input variables must be met. These applications involve further challenges, which are successfully solved.
  • Item
    QANOVA: quantile-based permutation methods for general factorial designs
    (2021-02-24) Ditzhaus, Marc; Fried, Roland; Pauly, Markus
    Population means and standard deviations are the most common estimands to quantify effects in factorial layouts. In fact, most statistical procedures in such designs are built toward inferring means or contrasts thereof. For more robust analyses, we consider the population median, the interquartile range (IQR) and more general quantile combinations as estimands in which we formulate null hypotheses and calculate compatible confidence regions. Based upon simultaneous multivariate central limit theorems and corresponding resampling results, we derive asymptotically correct procedures in general, potentially heteroscedastic, factorial designs with univariate endpoints. Special cases cover robust tests for the population median or the IQR in arbitrary crossed one-, two- and higher-way layouts with potentially heteroscedastic error distributions. In extensive simulations, we analyze their small sample properties and also conduct an illustrating data analysis comparing children’s height and weight from different countries.
  • Item
    Analyzing consistency and statistical inference in Random Forest models
    (2020) Ramosaj, Burim; Pauly, Markus; Rahnenführer, Jörg
    This thesis pays special attention to the Random Forest method as an ensemble learning technique using bagging and feature sub-spacing covering three main aspects: its behavior as a prediction tool under the presence of missing values, its role in uncertainty quantification and variable screening. In the first part, we focus on the performance of Random Forest models in prediction and missing value imputations while opposing it to other learning methods such as boosting procedures. Therein, we aim to discover potential modifications of Breiman’s original Random Forest in order to increase imputation performance of Random Forest based models using the normalized root mean squared error and the proportion of false classification as evaluation measures. Our results indicated the usage of a mixed model involving the stochastic gradient boosting and a Random Forest based on kernel sampling. Regarding inferential statistics after imputation, we were interested if Random Forest methods do deliver correct statistical inference procedures, especially in repeated measures ANOVA. Our results indicated a heavy inflation of type-I-error rates for testing no mean time effects. We could furthermore show that the between imputation variance according to Rubin’s multiple imputation rule vanishes almost surely, when repeatedly applying missForest as an imputation scheme. This has the consequence of less uncertainty quantification during imputation leading to scenarios where imputations are not proper. Closely related to the issue of valid statistical inference is the general topic of uncertainty quantification. Therein, we focused on consistency properties of several residual variance estimators in regression models and could deliver theoretical guarantees that Random Forest based estimators are consistent. Beside prediction, Random Forest is often used as a screening method for selecting informative features in potentially high-dimensional settings. Focusing on regression problems, we could deliver a formal proof that the Random Forest based internal permutation importance measure delivers on average correct results, i.e. is (asymptotically) unbiased. Simulation studies and real-life data examples from different fields support our findings in this thesis.
  • Item
    A simulation study to compare robust tests for linear mixed-effects meta-regression
    (2020-01-12) Welz, Thilo; Pauly, Markus
    The explanation of heterogeneity when synthesizing different studies is an important issue in meta‐analysis. Besides including a heterogeneity parameter in the statistical model, it is also important to understand possible causes of between‐study heterogeneity. One possibility is to incorporate study‐specific covariates in the model that account for between‐study variability. This leads to linear mixed‐effects meta‐regression models. A number of alternative methods have been proposed to estimate the (co)variance of the estimated regression coefficients in these models, which subsequently drives differences in the results of statistical methods. To quantify this, we compare the performance of hypothesis tests for moderator effects based upon different heteroscedasticity consistent covariance matrix estimators and the (untruncated) Knapp‐Hartung method in an extensive simulation study. In particular, we investigate type 1 error and power under varying conditions regarding the underlying distributions, heterogeneity, effect sizes, number of independent studies, and their sample sizes. Based upon these results, we give recommendations for suitable inference choices in different scenarios and highlight the danger of using tests regarding the study‐specific moderators based on inappropriate covariance estimators.
  • Item
    Ausreißeridentifikation für kategoriale und funktionale Daten im generalisierten linearen Modell
    (2017-03-09) Rehage, André; Kuhnt, Sonja; Fried, Roland
    In der vorliegenden Arbeit werden Verfahren zur Identifikation von Ausreißern in generalisierten linearen Modellen entwickelt. Der Fokus liegt dabei auf kategorialen und funktionalen Zielgrößen. In generalisierten linearen Modellen wird für die Zielgröße eine Verteilung aus der Exponentialfamilie angenommen. Somit können die Zielgrößen dahingehend analysiert werden, ob unter der Verteilungsannahme ungewöhnliche Werte realisiert werden. Hierzu wird das Konzept der α-Ausreißer herangezogen. Mit Hilfe robuster Kerndichteschätzer wird dieses Konzept auf Situationen erweitert, in denen kein Verteilungstyp angenommen wird. Ein wichtiges Ergebnis in dieser Arbeit betrifft die Ausreißeridentifikation einzelner Zellen in Kontingenztafeln, die durch loglineare Poissonmodelle beschrieben werden. Hierfür wird das Minimalmusterverfahren zu einem eindeutig bestimmten Ausreißeridentifizierer erweitert. Ein Minimalmuster besteht aus einer Teilmenge der Zellen einer Kontingenztafel, deren Elemente als potenziell ausreißerfrei aufgefasst werden. Die Performanz dieses Verfahrens wird in realen Datensätzen und Simulationsstudien beurteilt. Zur Wahl von Zellen einer Kontingenztafel, die Teil eines Minimalmusters sind, wird die Relevanz bestimmter geometrischer Strukturen, der sogenannten k-Schlingen, hervorgehoben. In funktionalen Daten können Ausreißeridentifizierer basierend auf Datentiefen definiert werden. Dabei liegt der Fokus der in der Literatur vorhandenen Datentiefen jedoch nicht auf der Form der Daten. Diese Lücke wird in dieser Arbeit zwei neue Pseudo-Datentiefen zur Identifikation von Form-Ausreißern geschlossen. Ihre Eigenschaften werden sowohl theoretisch als auch basierend auf echten sowie künstlichen Datensätzen analysiert und darüber hinaus im Kontext generalisierter linearer Modelle mit funktionaler Zielgröße beurteilt. Dabei wird auch eine eventuelle Fehlspezifikation des generalisierten linearen Modells berücksichtigt. Des Weiteren werden Verfahren zur Identifikation von Ausreißern in Gaußprozessen basierend auf dem Konzept der α-Ausreißer resp. Bagplots entwickelt.
  • Item
    Black-box optimization of mixed discrete-continuous optimization problems
    (2016) Halstrup, Momchil; Kuhnt, Sonja; Weihs, Claus
    In numerous applications in industry it is becoming standard practice to study complex, real-world processes with the help of computer experiments - simulations. With increasing computing capabilities it has become customary to perform simulation studies beforehand, where the desired process characteristics can be optimized. Computer experiments which have only continuous inputs have been studied and applied with great success in the past in a large variety of different fields. However, many experiments in practice have mixed quantitative and qualitative inputs. Such mixed-input experiments have only recently begun to receive more attention, but the field of research is still very new. Computer experiments very often take a long time to run, ranging from hours to days, making it impossible to perform direct optimization on the computer code. Instead, the simulator can be considered as a black-box function and a (meta-)model, which is cheaper to evaluate, is used to interpolate the simulation. In this thesis we develop models and optimization methods for experiments, which have purely continuous outputs, as well as for experiments with mixed qualitative-quantitative inputs. The optimization of expensive to evaluate black-box functions is often performed with the help of model-based sequential strategies. A popular choice is the efficient global optimization (EGO) algorithm, which is based on the prominent Kriging metamodel and the expected improvement (EI) search criterion. Kriging allows for a great flexibility and can be used to approximate highly non-linear functions. It also provides a local uncertainty estimator at unknown locations, which, together with the EI criterion, can be used to guide the EGO algorithm to less explored regions of the search space. EGO based strategies have been successfully applied in numerous simulation studies. However, there are a few drawbacks of the EGO algorithm – for example both the Kriging model and the EI criterion operate under the normality assumption, and the classical Kriging model assumes stationarity – both of these assumptions are fairly restrictive and can lead to a substantial loss of inaccuracy, when they are violated. One further drawback of EGO is its inability to make adequate use of parallel computing. Moreover, the classical version of the EGO algorithm is only suitable for use in computer experiments with purely continuous inputs. The Kriging model uses the Euclidean distances in order to interpolate the unknown black-box function – making interpolation of mixed-input functions difficult. In this work we address all of the drawbacks of the classical Kriging model and the powerful EGO described in the previous paragraph. We develop an assumption robust version of the EGO algorithm – called keiEGO, which does not rely on the Kriging model and the EI criterion. Instead, the robust alternatives – the kernel interpolation (KI) metamodel and the statistical lower bound (SLB) criterion are implemented. The KI and the SLB criterion are less sophisticated than Kriging and the EI criterion, but they are completely free of the normality and stationarity assumptions. The keiEGO algorithm is compared to the classical Kriging model based on a few synthetic function examples and also on a simulation case study of a sheet metal forming process developed by the IUL institute of the TU Dortmund University in the course of the collaborative research center 708. Furthermore, we develop a method for parallel optimization – called ParOF, based on a technique from the field of sensitivity analysis, called FANOVA graph. This method makes possible the use of parallel computations in the optimization with EGO but also manages to achieve a dimensionality reduction of the original problem. This makes modeling and optimization much easier, because of the (reverse) curse of dimensionality. The ParOF algorithm is also compared to the classical EGO algorithm based on synthetic functions and also on the same sheet metal forming case study mentioned before. The last part of this thesis is dedicated to EGO-like optimization of experiments with mixed inputs – thus we are addressing the last issue mentioned in the previous paragraph. We start by assessing different state of the art metamodels suitable for modeling and predicting mixed inputs. We then present a new class of Kriging models capable of modeling mixed inputs – called the Gower Kriging and developed in the course of this work. The Gower Kriging is also distance-based – it uses the Gower similarity measure, which constitutes a viable distance on the space of mixed quantitative-qualitative elements. With the help of the Gower Kriging we are able to produce a generalized EGO algorithm capable of optimization in this mixed space. We then perform a small benchmarking study, based on several synthetic examples of mixed-input functions of variable complexity. In this benchmark study we compare the Gower-Kriging-based EGO method to EGO variations implemented with other state of the art models for mixed data, based on their optimization capabilities.
  • Item
    Statistische Modellierung und Optimierung multipler Zielgrößen
    (2016-01-21) Rudak, Nikolaus; Kuhnt, Sonja; Weihs, Claus
    In vielen technischen Anwendungen, wie etwa beim thermischen Spritzen, werden Maschineneinstellungen (Kovariablen) gesucht, die zu einem Produkt mit gewünschten Eigenschaften (Zielvariablen) bei festgelegten Zielwerten führen. Die einzelnen Zielvariablen werden als Zufallsvariablen aufgefasst und in einem Zielvariablenvektor zusammengefasst. Es wird davon ausgegangen, dass sowohl der Erwartungswert als auch die Varianz der Zielvariablen von Kovariablen abhängen. In der Regel können nicht alle Zielwerte gleichzeitig erreicht werden, sondern lediglich ein guter Kompromiss. Zur Auswahl eines Kompromisses kann die JOP-Methode herangezogen werden, bei der eine Risikofunktion für eine ganze Reihe von Kostenmatrizen minimiert wird. Bisher wurde dabei angenommen, dass die Einträge des Zielvariablenvektors unkorreliert sind. In dieser Arbeit wird die JOP-Methode auf korrelierte Zielgrößen erweitert. Zunächst werden vier unterschiedliche Modelle vorgestellt und erweitert, bei denen sowohl der Erwartungswertvektor als auch die Kovarianzmatrix des Zielvariablenvektors von Kovariablen abhängen. Die Schätzung des Erwartungswertvektors und der Kovarianzmatrix auf Basis der vier genannten Modellvorstellungen wird in einer Simulationsstudie untersucht und verglichen. Als Referenz gilt ein einfaches Vorgehen, bei dem für jede Einstellung das arithmetische Mittel und die empirische Kovarianzmatrix berechnet wird. Dabei stellt sich zwar kein klarer Favorit heraus, allerdings wird das einfache Vorgehen immer von mindestens einem der übrigen Modelle übertroffen. Im weiteren Verlauf wird die JOP-Methode auf korrelierte Zielgrößen erweitert. Es wird eine Wahl von nichtdiagonalen Kostenmatrizen vorgeschlagen und die Pareto- Optimalität der JOP-Methode nachgewiesen. Danach folgt die Vorstellung eines Algorithmus der JOP-Methode für korrelierte Zielgrößen. Anschließend wird die JOPMethode für korrelierte Zielgrößen auf einen Datensatz aus der Literatur sowie auf einen im Rahmen des SFB 823 entstandenen Datensatz aus einem thermischen Spritzprozess angewandt. Den Abschluss der Arbeit bildet eine Zusammenfassung und ein Ausblick.
  • Item
    New methods for the sensitivity analysis of black-box functions with an application to sheet metal forming
    (2015) Fruth, Jana; Kuhnt, Sonja; Kunert, Joachim; Prieur, Clémentine
    The general field of the thesis is the sensitivity analysis of black-box functions. Sensitivity analysis studies how the variation of the output can be apportioned to the variation of input sources. It is an important tool in the construction, analysis, and optimization of computer experiments. The total interaction index is presented, which can be used for the screening of interactions. Several variance-based estimation methods are suggested. Their properties are analyzed theoretically as well as on simulations. A further chapter concerns the sensitivity analysis for models that can take functions as input variables and return a scalar value as output. A very economical sequential approach is presented, which not only discovers the sensitivity of those functional variables as a whole but identifies relevant regions in the functional domain. As a third concept, support index functions, functions of sensitivity indices over the input distribution support, are suggested. Finally, all three methods are successfully applied in the sensitivity analysis of sheet metal forming models.
  • Item
    Echtzeit-Extraktion relevanter Information aus multivariaten Zeitreihen basierend auf robuster Regression
    (2013-01-28) Borowski, Matthias; Gather, Ursula; Fried, Roland
    Diese Arbeit befasst sich mit der Echtzeit-Signalextraktion aus uni- und multivariaten Zeitreihen sowie mit der Echtzeit-Überwachung der Zusammenhänge zwischen den univariaten Komponenten multivariater Zeitreihen. Die in dieser Arbeit ntersuchten und entwickelten Methoden eignen sich zur Echtzeit-Anwendung auf hochfrequent gemessene instationäre Zeitreihen, die Ausreißer und Fehler mit wechselnder Variabilität aufweisen. Ein Verfahren zur Echtzeit-Signalextraktion aus univariaten Zeitreihen wird entwickelt, welches auf der Anpassung robuster Repeated Median-Regressionsgeraden in gleitenden Zeitfenstern gründet, deren Fensterbreite entsprechend der aktuell vorliegenden Datensituation gewählt wird. Eine umfassende Vergleichsstudie zeigt die Überlegenheit der neuen Methode gegenüber einem bereits bestehenden Signalfilter mit adaptiver Fensterbreitenwahl. Auf Basis des neu entwickelten Signalfilters wird eine Methodik zur Echtzeit-Überwachung der Zusammenhänge zwischen den einzelnen Komponenten einer multivariaten Zeitreihe entwickelt. Dieses Verfahren bewertet zu jedem Zeitpunkt den Zusammenhang zwischen zwei univariaten Zeitreihen anhand der aktuell vorliegenden Trends. Bei diesem Ansatz resultiert ein Zusammenhang aus gleich bzw. ähnlich gerichteten Verläufen. Das Verfahren zur Überwachung der Zusammenhänge wird mit dem neuen adaptiven Signalfilter kombiniert zu einer multivariaten Prozedur zur umfassenden Extraktion relevanter Information in Echtzeit. Neben der multivariaten Signalextraktion mit adaptiver Fensterbreitenwahl liefert dieses neue Verfahren für jede univariate Zeitreihenkomponente eine Schätzung der Fehlervariabilität, zeigt zu jedem Messzeitpunkt die aktuell bestehenden Zusammenhänge an und erkennt Sprünge und Trendwechsel in Echtzeit.