Eldorado Community:

Eldorado Community: http://hdl.handle.net/2003/77 2024-07-23T13:57:57Z Benefit of using interaction effects for the analysis of high-dimensional time-response or dose-response data for two-group comparisons http://hdl.handle.net/2003/42423 Title: Benefit of using interaction effects for the analysis of high-dimensional time-response or dose-response data for two-group comparisons Authors: Duda, Julia C.; Drenda, Carolin; Kästel, Hue; Rahnenführer, Jörg; Kappenberg, Franziska Abstract: High throughput RNA sequencing experiments are widely conducted and analyzed to identify differentially expressed genes (DEGs). The statistical models calculated for this task are often not clear to practitioners, and analyses may not be optimally tailored to the research hypothesis. Often, interaction effects (IEs) are the mathematical equivalent of the biological research question but are not considered for different reasons. We fill this gap by explaining and presenting the potential benefit of IEs in the search for DEGs using RNA-Seq data of mice that receive different diets for different time periods. Using an IE model leads to a smaller, but likely more biologically informative set of DEGs compared to a common approach that avoids the calculation of IEs. 2023-11-27T00:00:00Z Designs for the simultaneous inference of concentration–response curves http://hdl.handle.net/2003/42403 Title: Designs for the simultaneous inference of concentration–response curves Authors: Schürmeyer, Leonie; Schorning, Kirsten; Rahnenführer, Jörg Abstract: Background: An important problem in toxicology in the context of gene expression data is the simultaneous inference of a large number of concentration–response relationships. The quality of the inference substantially depends on the choice of design of the experiments, in particular, on the set of different concentrations, at which observations are taken for the different genes under consideration. As this set has to be the same for all genes, the efficient planning of such experiments is very challenging. We address this problem by determining efficient designs for the simultaneous inference of a large number of concentration–response models. For that purpose, we both construct a D-optimality criterion for simultaneous inference and a K-means procedure which clusters the support points of the locally D-optimal designs of the individual models. Results: We show that a planning of experiments that addresses the simultaneous inference of a large number of concentration–response relationships yields a substantially more accurate statistical analysis. In particular, we compare the performance of the constructed designs to the ones of other commonly used designs in terms of D-efficiencies and in terms of the quality of the resulting model fits using a real data example dealing with valproic acid. For the quality comparison we perform an extensive simulation study. Conclusions: The design maximizing the D-optimality criterion for simultaneous inference improves the inference of the different concentration–response relationships substantially. The design based on the K-means procedure also performs well, whereas a log-equidistant design, which was also included in the analysis, performs poorly in terms of the quality of the simultaneous inference. Based on our findings, the D-optimal design for simultaneous inference should be used for upcoming analyses dealing with high-dimensional gene expression data. 2023-10-19T00:00:00Z Statistical inference for intensity-based load sharing models with damage accumulation http://hdl.handle.net/2003/42389 Title: Statistical inference for intensity-based load sharing models with damage accumulation Authors: Jakubzik, Mirko Alexander Abstract: Consider a system in which a load exerted on it is equally shared between its components. Whenever one component fails, the total load is redistributed across the surviving components. This in turn increases the individual load applied to each of these components and therefore their risk of failure. Such a system is called a load sharing system. In a load sharing system, the failure rate of a surviving component grows with the number of failed components. However, the risk of failure is likely to also depend on how long the surviving components were exposed to the shared load. This accumulation of damage within the system causes a continuous increase in the failure rate between consecutive component failures. This thesis deals with the statistical inference for load sharing systems with damage accumulation that can be modelled in terms of its component failure rate. We identify the component failure rate as the stochastic intensity of a counting process, for which a parametric model can be specified - an intensity-based load sharing model with damage accumulation. The first method of inference is the minimum distance estimator introduced by Kopperschmidt and Stute. They claim the strong consistency and asymptotic normality of this estimator, but we demonstrate that their proof of the asymptotic distribution is flawed. Our first important contribution is a corrected proof under slightly adjusted requirements. The second method of inference is based on the K-sign depth test, a powerful and robust generalization of the classical sign test that was up to now mostly used with the residuals of a linear model. We present a procedure to obtain a "residual" counterpart in an intensity-based model via the hazard transformation of a point process. Moreover, we derive conditions on the model under which the 3-sign depth test is consistent. The thesis closes by comparing these two methods with the established likelihood approach. To this end, we verify the applicability of the competing methods to the Basquin load sharing model with multiplicative damage accumulation recently proposed by Müller and Meyer. In a final simulation study, we assess the robustness of the methods in the presence of contaminated data. This study confirms that, in contrast to the other two approaches, the 3-sign depth test offers both a powerful and robust tool of statistical inference for intensity-based load sharing models with damage accumulation. 2023-01-01T00:00:00Z Model selection characteristics when using MCP-Mod for dose–response gene expression data http://hdl.handle.net/2003/42379 Title: Model selection characteristics when using MCP-Mod for dose–response gene expression data Authors: Duda, Julia C.; Kappenberg, Franziska; Rahnenführer, Jörg Abstract: We extend the scope of application for MCP-Mod (Multiple Comparison Procedure and Modeling) to in vitro gene expression data and assess its characteristics regarding model selection for concentration gene expression curves. Precisely, we apply MCP-Mod on single genes of a high-dimensional gene expression data set, where human embryonic stem cells were exposed to eight concentration levels of the compound valproic acid (VPA). As candidate models we consider the sigmoid Emax (four-parameter log-logistic), linear, quadratic, Emax, exponential, and beta model. Through simulations we investigate the impact of omitting one or more models from the candidate model set to uncover possibly superfluous models and to evaluate the precision and recall rates of selected models. Each model is selected according to Akaike information criterion (AIC) for a considerable number of genes. For less noisy cases the popular sigmoid Emax model is frequently selected. For more noisy data, often simpler models like the linear model are selected, but mostly without relevant performance advantage compared to the second best model. Also, the commonly used standard Emax model has an unexpected low performance. 2022-02-20T00:00:00Z Improving adaptive seamless designs through Bayesian optimization http://hdl.handle.net/2003/42378 Title: Improving adaptive seamless designs through Bayesian optimization Authors: Richter, Jakob; Friede, Tim; Rahnenführer, Jörg Abstract: We propose to use Bayesian optimization (BO) to improve the efficiency of the design selection process in clinical trials. BO is a method to optimize expensive black-box functions, by using a regression as a surrogate to guide the search. In clinical trials, planning test procedures and sample sizes is a crucial task. A common goal is to maximize the test power, given a set of treatments, corresponding effect sizes, and a total number of samples. From a wide range of possible designs, we aim to select the best one in a short time to allow quick decisions. The standard approach to simulate the power for each single design can become too time consuming. When the number of possible designs becomes very large, either large computational resources are required or an exhaustive exploration of all possible designs takes too long. Here, we propose to use BO to quickly find a clinical trial design with high power from a large number of candidate designs. We demonstrate the effectiveness of our approach by optimizing the power of adaptive seamless designs for different sets of treatment effect sizes. Comparing BO with an exhaustive evaluation of all candidate designs shows that BO finds competitive designs in a fraction of the time. 2022-02-25T00:00:00Z Statistische Methoden zur Validierung von Inhaltsanalysen http://hdl.handle.net/2003/42361 Title: Statistische Methoden zur Validierung von Inhaltsanalysen Authors: Koppers, Lars Abstract: Auch in den Geistes- und Sozialwissenschaften hat sich die Analyse von großen Textkorpora inzwischen durchgesetzt. Mit den Digital Humanity ist dort ein komplett neues Forschungsfeld entstanden. Damit wurde es zum ersten mal möglich große Textkorpora systematisch auszuwerten und nicht nur Stichproben daraus zu untersuchen. Am Dortmund Center für Datenbasierte Medienanalyse (DoCMA) wird Journalismusforschung anhand von Medienkorpora betrieben. Ein Hauptaugenmerk liegt dabei auf die Entwicklung Themen in Medienerzeugnissen. Als zentrale Methode wurde dabei mit der Latent Dirichlet Allocation (LDA; Blei, Ng u.a. 2003) gearbeitet, ein generatives Themenmodell, das aus Textkorpora Themen identifiziert, bei denen sowohl die Themenverteilung, als auch die Wortverteilung, die ein Thema definiert als latent hinter dem Text liegend angenommen werden. Die vorliegende Arbeit hat sich drei verschiedene Aspekte in diesem Themenbereich vorgenommen: Ein R-Paket für die Vorverarbeitung und Analyse der Textkorpora, mit einem Schwerpunkt auf Grafikvisualisierungen, die die zeitliche Komponente der Korpora in den Mittelpunkt stellt, ein effektiveres Sampling bei der Validierung von Subkorpora und eine Analyse der Topic Coherence für die Modellwahl. Beim Textmining von Medienkorpora fallen immer wieder die gleichen Vorverarbeitungsschritte wie z.B. das Tokenisieren, das Entfernen von Stopwörtern und Umlauten an, bis eine LDA durchgeführt werden kann. Sowohl für die LDA. als auch für die Vorverarbeitung konnte dabei auf bestehende R-Pakete zurückgegriffen werden. Das R-Paket tosca liefert wrapper, die eine Vorverarbeitung übersichtlicher gestalten. Darüber hinaus bietet tosca einige auf die angebotene Analysepipeline abgestimmte Grafikfunktionen, die es ermöglichen ohne viel Aufwand zeitliche Verläufe von Themen und Wörtern zu erhalten. Im Bereich der Validierung wurden die von Blei vorgeschlagenen Intruder Words und Intruder Topics für R implementiert. Für Inhaltsanalysen ist meistens nicht der ganze Korpus, sondern nur Teile davon relevant. Diese können über Wortfilter oder Themen der LDA identifiziert werden. Da die Qualität der Analyse von der Qualität des erzeugten Subkorpusses abhängt, muss dieser validiert werden, was über menschliche Kodierer*innen erfolgt. Oft braucht es mehrere Versuche, bis die Auswahlkriterien für den Subkorpus so optimiert wurden, dass seine Qualität ausreichend ist. In dieser Arbeit wird ein Verfahren vorgestellt, mit dem nicht zufällig aus dem gesamten Korpus Texte zur Validierung gezogen werden, sondern abhängig von dem bereits bestehenden Wissen aus frühreren Durchläufen aus den Schnittmengen der Subkorpora gezogen wird, die die Gesamtunsicherheit am stärksten reduzieren. Die LDA hat das Problem, dass mathematisch optimierte Modelle für Anwender*innen oft nicht die inhaltlich besten Ergebnisse liefern. Gleichzeitig ist eine manuelle Modellwahl aus Kapazitätsgründen nur begrenzt möglich. In dieser Arbeit wird die Topic Coherence (Mimno u.a. 2011) als eine der vorgeschlagenen Maßzahlen zur Modellwahl untersucht. Während der Modellvergleich über Modelle mit verschiedenen Parametern nicht möglich ist, bietet diese Maßzahl die Möglichkeit unter wiederholten Läufen ein Modell auszusuchen. Darauf basierend wird ein Vorgehen vorgestellt, wie ein optimales Modell ausgesucht werden kann, wenn bereits von Anwender*innen für ihre Forschungsfrage optimale Themen aus anderen Läufen identifiziert wurden. 2023-01-01T00:00:00Z A flexible approach to modelling over‐, under‐ and equidispersed count data in IRT: the Two‐Parameter Conway–Maxwell–Poisson model http://hdl.handle.net/2003/42348 Title: A flexible approach to modelling over‐, under‐ and equidispersed count data in IRT: the Two‐Parameter Conway–Maxwell–Poisson model Authors: Beisemann, Marie Abstract: Several psychometric tests and self-reports generate count data (e.g., divergent thinking tasks). The most prominent count data item response theory model, the Rasch Poisson Counts Model (RPCM), is limited in applicability by two restrictive assumptions: equal item discriminations and equidispersion (conditional mean equal to conditional variance). Violations of these assumptions lead to impaired reliability and standard error estimates. Previous work generalized the RPCM but maintained some limitations. The two-parameter Poisson counts model allows for varying discriminations but retains the equidispersion assumption. The Conway–Maxwell–Poisson Counts Model allows for modelling over- and underdispersion (conditional mean less than and greater than conditional variance, respectively) but still assumes constant discriminations. The present work introduces the Two-Parameter Conway–Maxwell–Poisson (2PCMP) model which generalizes these three models to allow for varying discriminations and dispersions within one model, helping to better accommodate data from count data tests and self-reports. A marginal maximum likelihood method based on the EM algorithm is derived. An implementation of the 2PCMP model in R and C++ is provided. Two simulation studies examine the model's statistical properties and compare the 2PCMP model to established models. Data from divergent thinking tasks are reanalysed with the 2PCMP model to illustrate the model's flexibility and ability to test assumptions of special cases.; Correction for this article: https://doi.org/10.1111/bmsp.12312 2022-06-09T00:00:00Z The machines take over: a comparison of various supervised learning approaches for automated scoring of divergent thinking tasks http://hdl.handle.net/2003/42333 Title: The machines take over: a comparison of various supervised learning approaches for automated scoring of divergent thinking tasks Authors: Buczak, Philip; Huang, He; Forthmann, Boris; Doebler, Philipp Abstract: Traditionally, researchers employ human raters for scoring responses to creative thinking tasks. Apart from the associated costs this approach entails two potential risks. First, human raters can be subjective in their scoring behavior (inter-rater-variance). Second, individual raters are prone to inconsistent scoring patterns (intra-rater-variance). In light of these issues, we present an approach for automated scoring of Divergent Thinking (DT) Tasks. We implemented a pipeline aiming to generate accurate rating predictions for DT responses using text mining and machine learning methods. Based on two existing data sets from two different laboratories, we constructed several prediction models incorporating features representing meta information of the response or features engineered from the response’s word embeddings that were obtained using pre-trained GloVe and Word2Vec word vector spaces. Out of these features, word embeddings and features derived from them proved to be particularly effective. Overall, longer responses tended to achieve higher ratings as well as responses that were semantically distant from the stimulus object. In our comparison of three state-of-the-art machine learning algorithms, Random Forest and XGBoost tended to slightly outperform the Support Vector Regression.; Correction for this article: https://doi.org/10.1002/jocb.627 2022-08-08T00:00:00Z Tackling the Challenge of Aging Populations: The Impact of Increasing Life Expectancy and Low Fertility on the Old-Age Dependency Ratio http://hdl.handle.net/2003/42319 Title: Tackling the Challenge of Aging Populations: The Impact of Increasing Life Expectancy and Low Fertility on the Old-Age Dependency Ratio Authors: Pflaumer, Peter Abstract: The old-age dependency ratios are indicators of the number of elderly people who are generally economically inactive compared to the number of people of working age. They significantly affect the financial burden of social public pension schemes, making it essential to analyze the influence of mortality on this ratio. In this paper, the Gompertz model is used to investigate the effect of mortality and fertility on the old-age dependency ratio, with a focus on the impact of changes in life expectancy. Elasticity formulas are derived to analyze this effect, and the results indicate that an increase in life expectancy leads to a considerable rise in the old-age dependency ratio. 2023-09-01T00:00:00Z Refining Mortality Projections at Advanced Ages: Evaluating the Significance of Wittstein's Mortality Law http://hdl.handle.net/2003/42318 Title: Refining Mortality Projections at Advanced Ages: Evaluating the Significance of Wittstein's Mortality Law Authors: Pflaumer, Peter Abstract: Age-specific mortality rates for semi-supercentenarians and supercentenarians play a pivotal role in comprehending longevity and population dynamics at advanced ages. In this study, we introduce a modified Wittstein Model, offering an alternative to the conventional S-shaped curve models used in mortality forecasting. The Wittstein Model, originally formulated by Theodor Wittstein, has been adapted to suit contemporary contexts. Utilizing life table data for German women from 2019/2021, we project age-specific mortality rates, construct life tables commencing from age 100, and conduct a sensitivity analysis to assess the impact of model parameters on mortality patterns. The sensitivity analysis unveils the influence of parameter values on the shape of age-specific mortality rates. This study contributes to research in mortality forecasting, with a specific focus on semi-supercentenarians and supercentenarians, shedding light on an understudied population segment. Accurate projections carry profound implications for public health, healthcare planning, and social policy. Further research should explore the model's applicability in different contexts, providing a deeper understanding of mortality patterns at advanced ages. As the empirical database of centenarians expands, the model is expected to enhance its precision and reliability in forecasting age-specific mortality rates at advanced ages. 2023-11-01T00:00:00Z Analyzing the Historical Life Table of Thomas Young http://hdl.handle.net/2003/42156 Title: Analyzing the Historical Life Table of Thomas Young Authors: Pflaumer, Peter Abstract: Thomas Young (1773-1829) is one of the greatest thinkers and polymaths. His scientific work includes significant contributions in the fields of medicine, physics, anthropology and ancient history. Less well known, however, is Young's demographic contribution. In 1826, Thomas Young examined graphical curves of mortality of his epoch (decrement tables of the deceased) to see if they matched a formula he had developed. Looking for a law of mortality, he created a high order polynomial for the function of mortality. We use modern demographic methods to analyze and criticize his life table. Young's discrete life table is fitted by a continuous life table function (Lazarus distribution) in order to calculate important parameters. It is shown that Young's formula is an early and successful method of determining a model life table. It corresponds to a particular life table of Coale and Demeny. The article concludes with an exploration of Young's mortality formula of 1816, a concise yet foundational model, showcasing its ability to facilitate calculations of vital functions like life expectancy and the force of mortality, despite its lesser-known status. 2023-08-01T00:00:00Z Leonhard Euler’s Research on the Multiplication of the Human Race with Models of Population Growth http://hdl.handle.net/2003/42155 Title: Leonhard Euler’s Research on the Multiplication of the Human Race with Models of Population Growth Authors: Pflaumer, Peter Abstract: The renowned Swiss mathematician Leonhard Euler created three variations of a simple population projection model, including one stable model and two non-stable models, that consider a couple with different fertility behaviors and life-spans. While one of the models was published by a German demographer, Johann Peter Süßmilch, in his book “The Divine Order”, the other two are not widely known in contemporary literature. This paper compares and reanalyzes the three variants of Euler's population projections using matrix algebra, providing diagrams and tables of the population time series and their growth rates, as well as age structures of selected years. It is demonstrated that the non-stable projection models can be explained in the long run by their geometric trend component, which is a special case of strong ergodicity in demography as described by Euler. Additionally, a continuous variant of Euler's stable model is introduced, allowing for the calculation of the age structure, intrinsic growth rate, and population momentum in a straightforward manner. The effect of im¬mortality on population size and age structure at high growth rates is also examined. 2023-06-01T00:00:00Z Simple powerful robust tests based on sign depth http://hdl.handle.net/2003/42118 Title: Simple powerful robust tests based on sign depth Authors: Leckey, Kevin; Malcherczyk, Dennis; Horn, Melanie; Müller, Christine H. Abstract: Up to now, powerful outlier robust tests for linear models are based on M-estimators and are quite complicated. On the other hand, the simple robust classical sign test usually provides very bad power for certain alternatives. We present a generalization of the sign test which is similarly easy to comprehend but much more powerful. It is based on K-sign depth, shortly denoted by K-depth. These so-called K-depth tests are motivated by simplicial regression depth, but are not restricted to regression problems. They can be applied as soon as the true model leads to independent residuals with median equal to zero. Moreover, general hypotheses on the unknown parameter vector can be tested. While the 2-depth test, i.e. the K-depth test for K=2, is equivalent to the classical sign test, K-depth test with K≥3 turn out to be much more powerful in many applications. A drawback of the K-depth test is its fairly high computational effort when implemented naively. However, we show how this inherent computational complexity can be reduced. In order to see why K-depth tests with K≥3 are more powerful than the classical sign test, we discuss the asymptotic behavior of its test statistic for residual vectors with only few sign changes, which is in particular the case for some alternatives the classical sign test cannot reject. In contrast, we also consider residual vectors with alternating signs, representing models that fit the data very well. Finally, we demonstrate the good power of the K-depth tests for some examples including high-dimensional multiple regression. 2022-07-30T00:00:00Z Semiparametric estimation of INAR models using roughness penalization http://hdl.handle.net/2003/42018 Title: Semiparametric estimation of INAR models using roughness penalization Authors: Faymonville, Maxime; Jentsch, Carsten; Weiß, Christian H.; Aleksandrov, Boris Abstract: Popular models for time series of count data are integer-valued autoregressive (INAR) models, for which the literature mainly deals with parametric estimation. In this regard, a semiparametric estimation approach is a remarkable exception which allows for estimation of the INAR models without any parametric assumption on the innovation distribution. However, for small sample sizes, the estimation performance of this semiparametric estimation approach may be inferior. Therefore, to improve the estimation accuracy, we propose a penalized version of the semiparametric estimation approach, which exploits the fact that the innovation distribution is often considered to be smooth, i.e. two consecutive entries of the PMF differ only slightly from each other. This is the case, for example, in the frequently used INAR models with Poisson, negative binomially or geometrically distributed innovations. For the data-driven selection of the penalization parameter, we propose two algorithms and evaluate their performance. In Monte Carlo simulations, we illustrate the superiority of the proposed penalized estimation approach and argue that a combination of penalized and unpenalized estimation approaches results in overall best INAR model fits. 2022-09-21T00:00:00Z Essays in time series econometrics http://hdl.handle.net/2003/42002 Title: Essays in time series econometrics Authors: Reichold, Karsten Abstract: This cumulative dissertation consists of three self-contained papers all contributing to the cointegrating regression literature. The first chapter is devoted to classical linear cointegrating regressions, i.e., regressions that contain integrated processes as regressors. It combines traditional and self-normalized Wald-type test statistics with a vector autoregressive sieve bootstrap to reduce size distortions of hypothesis tests on the cointegrating vector. The second chapter focuses on panels of cointegrating polynomial regressions, i.e., panels of regressions that include an integrated process and its powers as regressors. It derives the asymptotic properties of a group-mean fully modified OLS estimator and hypothesis tests based upon it in a fixed cross-section and large time series dimension. The third chapter is devoted to testing for a cointegrating relationship between a fixed number of integrated processes. In particular, it derives asymptotic theory for an existing nonparametric variance ratio unit root test (originally proposed to test for an unit root in an observed univariate time series) when applied to regression residuals. 2023-01-01T00:00:00Z Statistical inference for the reserve risk http://hdl.handle.net/2003/41966 Title: Statistical inference for the reserve risk Authors: Steinmetz, Julia Abstract: The major part of the liability of an insurance company's balance belongs to the reserves. Reserves are built to pay for all future, known or unknown, claims that happened so far. Hence an accurate prediction of the outstanding claims to determine the reserve is important. For non-life insurance companies, Mack (1993) proposed a distribution-free approach to calculate the first two moments of the reserve. In this cumulative dissertation, we derive first asymptotic theory for the unconditional and conditional limit distribution of the reserve risk. Therefore, we enhance the assumptions from Mack's model and derive a fully stochastic framework. The distribution of the reserve risk can be split up into two additive random parts covering the process and parameter uncertainty. The process uncertainty part dominates asymptotically and is in general non-Gaussian distributed unconditional and conditional on the whole observed loss triangle or the last observed diagonal of the loss triangle. In contrast, the parameter uncertainty part is measurable with respect to the whole observed upper loss triangle. Properly inflated, the parameter uncertainty part is Gaussian distributed conditional on the last observed diagonal of the loss triangle, and unconditional, it leads to a non-Gaussian distribution. Hence, the parameter uncertainty part is asymptotically negligible. In total, the reserve risk has asymptotically the same distribution as the process uncertainty part since this part dominates asymptotically leading to a non-Gaussian distribution conditional and unconditional. Using the theoretical asymptotic distribution results regarding the distribution of the reserve risk, we can now establish bootstrap consistency results, where the derived distribution of the reserve risk serves as a benchmark. Splitting the reserve risk into two additive parts enables a rigorous investigation of the validity of the Mack bootstrap. If the parametric family of distributions of the individual development factors is correctly specified, we prove that the (conditional) distribution of the asymptotically dominating process uncertainty part is correctly mimicked by the proposed Mack bootstrap approach. On the contrary, the corresponding (conditional) distribution of the estimation uncertainty part is generally not correctly captured by the Mack bootstrap. To address this issue, we propose an alternative Mack bootstrap, which uses a different centering and is designed to capture also the distribution of the estimation uncertainty part correctly. 2023-01-01T00:00:00Z Testing marginal homogeneity in Hilbert spaces with applications to stock market returns http://hdl.handle.net/2003/41843 Title: Testing marginal homogeneity in Hilbert spaces with applications to stock market returns Authors: Ditzhaus, Marc; Gaigall, Daniel Abstract: This paper considers a paired data framework and discusses the question of marginal homogeneity of bivariate high-dimensional or functional data. The related testing problem can be endowed into a more general setting for paired random variables taking values in a general Hilbert space. To address this problem, a Cramér–von-Mises type test statistic is applied and a bootstrap procedure is suggested to obtain critical values and finally a consistent test. The desired properties of a bootstrap test can be derived that are asymptotic exactness under the null hypothesis and consistency under alternatives. Simulations show the quality of the test in the finite sample case. A possible application is the comparison of two possibly dependent stock market returns based on functional data. The approach is demonstrated based on historical data for different stock market indices. 2022-02-14T00:00:00Z Compressing data for generalized linear regression http://hdl.handle.net/2003/41841 Title: Compressing data for generalized linear regression Authors: Omlor, Simon Abstract: In this thesis we work on algorithmic data and dimension reduction techniques to solve scalability issues and to allow better analysis of massive data. For our algorithms we use the sketch and solve paradigm as well as some initialization tricks. We will analyze a tradeoff between accuracy, running time and storage. We also show some lower bounds on the best possible data reduction factors. While we are focusing on generalized linear regression mostly, logistic and p-probit regression to be precise, we are also dealing with two layer Rectified Linear Unit (ReLU) networks with logistic loss which can be seen as an extension of logistic regression, i.e. logistic regression on the neural tangent kernel. We present coresets via sampling, sketches via random projections and several algorithmic techniques and prove that our algorithms are guaranteed to work with high probability. First, we consider the problem of logistic regression where the aim is to find the parameter beta maximizing the likelihood. We are constructing a sketch in a single pass over a turnstile data stream. Depending on some parameters we can tweak size, running time and approximation guarantee of the sketch. We also show that our sketch works for other target functions as well. Second, we construct an epsilon-coreset for p-probit regression, which is a generalized version of probit regression. Therefore, we first compute the QR decomposition of a sketched version of our dataset in a first pass. We then use the matrix R to compute an approximation of the l_p-leverage scores of our data points which we use to compute sampling probabilities to construct the coreset. We then analyze the negative log likelihood of the p-generalized normal distribution to prove that this results in an epsilon-coreset. Finally, we look at two layer ReLU networks with logistic loss. Here we show that using a coupled initialization we can reduce the width of the networks to get a good approximation down from gamma^(-8) (Ji and Telgarsky, 2020) to gamma^(-2) where gamma is the so called separation margin. We further give an example where we prove that a width of gamma^(−1) is necessary to get less than constant error. 2022-01-01T00:00:00Z Paola Zuccolotto and Marica Manisera (2020): Basketball Data Science: With Applications in R, CRC Press, 243 pp., £80.50 (Hardcover), ISBN: 978-1-138-60079-9 http://hdl.handle.net/2003/41733 Title: Paola Zuccolotto and Marica Manisera (2020): Basketball Data Science: With Applications in R, CRC Press, 243 pp., £80.50 (Hardcover), ISBN: 978-1-138-60079-9 Authors: Groll, Andreas; Jentsch, Carsten 2022-04-10T00:00:00Z Introducing LASSO-type penalisation to generalised joint regression modelling for count data http://hdl.handle.net/2003/41341 Title: Introducing LASSO-type penalisation to generalised joint regression modelling for count data Authors: van der Wurp, Hendrik; Groll, Andreas Abstract: In this work, we propose an extension of the versatile joint regression framework for bivariate count responses of the R package GJRM by Marra and Radice (R package version 0.2-3, 2020) by incorporating an (adaptive) LASSO-type penalty. The underlying estimation algorithm is based on a quadratic approximation of the penalty. The method enables variable selection and the corresponding estimates guarantee shrinkage and sparsity. Hence, this approach is particularly useful in high-dimensional count response settings. The proposal’s empirical performance is investigated in a simulation study and an application on FIFA World Cup football data. 2021-11-12T00:00:00Z CASANOVA: permutation inference in factorial survival designs http://hdl.handle.net/2003/41339 Title: CASANOVA: permutation inference in factorial survival designs Authors: Ditzhaus, Marc; Genuneit, Jon; Janssen, Arnold; Pauly, Markus Abstract: We propose inference procedures for general factorial designs with time-to-event endpoints. Similar to additive Aalen models, null hypotheses are formulated in terms of cumulative hazards. Deviations are measured in terms of quadratic forms in Nelson–Aalen-type integrals. Different from existing approaches, this allows to work without restrictive model assumptions as proportional hazards. In particular, crossing survival or hazard curves can be detected without a significant loss of power. For a distribution-free application of the method, a permutation strategy is suggested. The resulting procedures' asymptotic validity is proven and small sample performances are analyzed in extensive simulations. The analysis of a data set on asthma illustrates the applicability. 2021-10-05T00:00:00Z Clusteranzahlbestimmung und Clusterung unter Nebenbedingungen in der Musiksignalanalyse und in Energienetzen http://hdl.handle.net/2003/41186 Title: Clusteranzahlbestimmung und Clusterung unter Nebenbedingungen in der Musiksignalanalyse und in Energienetzen Authors: Krey, Sebastian Abstract: Clusterverfahren sind ein wichtiges Werkzeug des unüberwachten maschinellen Lernens. Sie ermöglichen eine automatisierte Strukturierung von großen Datenmengen und können so ein wichtiges Werkzeug zur weiteren Datenverarbeitung bzw. -analyse sein oder das Fundament für Entscheidungen bilden. Die in dieser Arbeit betrachteten Anwendungsbeispiele aus der Musiksignalanalyse sowie der Elektrotechnik zeigen, dass reine distanzbasierte Clusterverfahren nicht immer ausreichend sind und Nebenbedingungen in die zugrundeliegenden Optimierungsprobleme eingefügt werden müssen, um sinnvolle Clusterungen zu erhalten, die für den Anwender hilfreich sind. Hierfür werden die Order Constrained Solutions in k-Means Clustering (OCKC) und Spectral Clustering zur Abbildung der Nebenbedingungen verwendet. Für OCKC wird zusätzlich eine effiziente Implementierung des Verfahrens vorgestellt. Ein gemeinsame Herausforderungen aller Clusterverfahren ist die Festlegung der Anzahl der Cluster. Da es sich bei den hier betrachteten Clusterverfahren um Methoden mit Nebenbedingungen handelt, kann klassische Stabilitätsanalyse mit Hilfe des adjustierten Rand-Index auf Bootstrap-Stichproben der Daten nicht für die Beurteilung der Clusterstabilität verwendet werden, da diese unter Umständen die Nebenbedindung verletzt. Es werden Alternativen präsentiert, die sowohl unter Ordnungsrestriktion, als auch bei einer Nachbarschaftsbedingung die Einhaltung der Nebenbedingung in den generierten Datensätzen sicherstellen. Mit diesen Methoden ist auch bei den Clusterverfahren mit Nebenbedingungen eine Beurteilung der Clusterstabilität mit dem Rand-Index möglich. 2022-01-01T00:00:00Z Which test for crossing survival curves? A user’s guideline http://hdl.handle.net/2003/41166 Title: Which test for crossing survival curves? A user’s guideline Authors: Dormuth, Ina; Liu, Tiantian; Xu, Jin; Yu, Menggang; Pauly, Markus; Ditzhaus, Marc Abstract: Background: The exchange of knowledge between statisticians developing new methodology and clinicians, reviewers or authors applying them is fundamental. This is specifically true for clinical trials with time-to-event endpoints. Thereby, one of the most commonly arising questions is that of equal survival distributions in two-armed trial. The log-rank test is still the gold-standard to infer this question. However, in case of non-proportional hazards, its power can become poor and multiple extensions have been developed to overcome this issue. We aim to facilitate the choice of a test for the detection of survival differences in the case of crossing hazards. Methods: We restricted the review to the most recent two-armed clinical oncology trials with crossing survival curves. Each data set was reconstructed using a state-of-the-art reconstruction algorithm. To ensure reproduction quality, only publications with published number at risk at multiple time points, sufficient printing quality and a non-informative censoring pattern were included. This article depicts the p-values of the log-rank and Peto-Peto test as references and compares them with nine different tests developed for detection of survival differences in the presence of non-proportional or crossing hazards. Results: We reviewed 1400 recent phase III clinical oncology trials and selected fifteen studies that met our eligibility criteria for data reconstruction. After including further three individual patient data sets, for nine out of eighteen studies significant differences in survival were found using the investigated tests. An important point that reviewers should pay attention to is that 28% of the studies with published survival curves did not report the number at risk. This makes reconstruction and plausibility checks almost impossible. Conclusions: The evaluation shows that inference methods constructed to detect differences in survival in presence of non-proportional hazards are beneficial and help to provide guidance in choosing a sensible alternative to the standard log-rank test. 2022-01-30T00:00:00Z Reliability evaluation and an update algorithm for the latent Dirichlet allocation http://hdl.handle.net/2003/41102 Title: Reliability evaluation and an update algorithm for the latent Dirichlet allocation Authors: Rieger, Jonas Abstract: Modeling text data is becoming increasingly popular. Topic models and in particular the latent Dirichlet allocation (LDA) represent a large field in text data analysis. In this context, the problem exists that running LDA repeatedly on the same data yields different results. This lack of reliability can be improved by repeated modeling and a reasonable choice of a representative. Further, updating existing LDA models with new data is another common challenge. Many dynamic models, when adding new data, also update parameters of past time points, thus do not ensure the temporal consistency of the results. In this cumulative dissertation, I summarize in particular my methodological papers from the two areas of improving the reliability of LDA results and updating LDA results in a temporally consistent manner for use in monitoring scenarios. For this purpose, I first introduce the state of research for each of the two areas. After explaining the idea of the corresponding method, I give examples of applications in which the method has already been used and explain the implementation as an R package. Finally, for both fields I provide an outlook on potential further research.; Die Modellierung von Textdaten erfährt wachsende Popularität. Einen großen Bereich in der Textdatenanalyse bilden Topic Modelle und dabei im Speziellen das Modell latent Dirichlet allocation (LDA). Dabei existiert die Problematik, dass sich bei einer wiederholten Ausführung der LDA auf denselben Daten verschiedene Resultate ergeben. Dieser Mangel an Reliabilität lässt sich durch eine wiederholte Modellierung und eine sinnvolle Wahl eines Repräsentanten verbessern. Eine weitere Herausforderung stellt das Aktualisieren von bestehenden LDA-Modellen anhand neuer Daten dar. Viele dynamische Modelle aktu- alisieren im Falle einer Hinzunahme neuer Daten auch Parameter vergangener Zeitpunkte und verletzen damit die zeitliche Konsistenz der Ergebnisse. In dieser kumulativen Dissertation fasse ich insbesondere meine methodischen Paper aus den beiden Themenbereichen der Verbesserung der Reliabilität von LDA-Ergebnissen und der zeitlich konsistenten Aktualisierung von LDA-Ergebnissen zur Nutzung in Monitoring- Szenarien zusammen. Dafür stelle ich zunächst jeweils den Forschungsstand dar. Nach einer Erläuterung der Idee der Methode, werden jeweils Beispiele gegeben, in denen die Methode bereits Anwendung fand und die Implementierung als R Paket erläutert. Zuletzt gebe ich für beide Themenbereiche einen Ausblick auf mögliche weitere Forschung. 2022-07-01T00:00:00Z Robust covariance estimation in mixed-effects meta-regression models http://hdl.handle.net/2003/41097 Title: Robust covariance estimation in mixed-effects meta-regression models Authors: Welz, Thilo Abstract: In this PhD thesis we consider robust (sandwich) variance-covariance matrix estimators in the context of univariate and multivariate meta-analysis and meta-regression. The underlying model is the classical mixed-effects meta-regression model. Our goal is to enable valid statistical inference for the model coefficients. Specifically, we employ heteroscedasticity consistent (HC) and cluster-robust (CR) sandwich estimators in the univariate and multivariate setting. A key aim is to provide better small sample solutions for meta-analytic research and application. Tests based on the original formulations of these estimators are known to produce highly liberal results, especially when the number of studies is small. We therefore transfer results for improved sandwich estimation by Cribari-Neto and Zarkos (2004) to the meta-analytic context. We prove the asymptotic equivalence of HC estimators and compare them with commonly suggested techniques such as the Knapp-Hartung (KH) method or standard plugin covariance matrix estimation in extensive simulation studies. The new versions of HC estimators considerably outperform their older counterparts, especially in small samples, achieving comparable results to the KH method. In a slight excursion, we focus on constructing confidence regions for (Pearson) correlation coefficients as the main effect of interest in a random-effects meta-analysis. We develop a beta-distribution model for generating data in our simulations in addition to the commonly used truncated normal distribution model. We utilize different variance estimation approaches such as HC estimators, the KH method and a wild bootstrap approach in combination with the Fisher-z transformation and an integral z-to-r back-transformation to construct confidence regions. In simulation studies, our novel proposals improve coverage over the Hedges-Olkin-Vevea-z approach and Hunter-Schmidt approaches, enabling reliable inference for a greater range of true correlations. Finally, we extend our results for the HC estimators to construct CR sandwich estimators for multivariate meta-regression. The aim is to achieve valid inference for the model coefficients, based on Wald-type statistics, even in small samples. Our simulations show that previously suggested CR estimators such as the bias reduced linearization approach, can have unsatisfactory small sample performance for bivariate meta-regression. Furthermore, they show that the Hotelling’s T^2-test suggested by Tipton and Pustejovsky (2015) can yield negative estimates for the degrees of freedom when the number of studies is small. We suggest an adjustment to the classical F -test, truncating the denominator degrees of freedom at two. Our CR extensions, using only the diagonal elements of the hat matrix to adjust residuals, improve coverage considerably in small samples. We focus on the bivariate case in our simulations, but the discussed approaches can also be applied more generally. We analyze both small and large sample behavior of all considered tests / confidence regions in extensive simulation studies. Furthermore, we apply the discussed approaches in real life datasets from psychometric and medical research. 2022-01-01T00:00:00Z Nonparametric correlation-based methods with biomedical applications http://hdl.handle.net/2003/41058 Title: Nonparametric correlation-based methods with biomedical applications Authors: Nowak, Claus P. Abstract: This cumulative dissertation consists of three manuscripts on nonparametric methodology, i.e., Simultaneous inference for Kendall’s tau, Group sequential methods for the Mann-Whitney parameter, and The nonparametric Behrens-Fisher problem in small samples. The manuscript on Kendall’s τ fully develops a nonparametric estimation theory for multiple rank correlation coefficients in terms of Kendall’s τA and τB, Somers’ D, as well as Kruskal and Goodman’s γ, necessitating joint estimation of both the probabilities of ties occurring and the probability of concordance minus discordance. As for the second manuscript, I review and further develop group sequential methodology for the Mann-Whitney parameter. With the aid of data from a clinical trial in patients with relapse-remitting multiple sclerosis, I demonstrate how one could repeatedly estimate the Mann-Whitney parameter during an ongoing trial together with repeated confidence intervals obtained by test inversion. In addition, I give simple approximate power formulas for this group sequential setting. The last manuscript further explores how best to approximate the sampling distribution of the Mann-Whitney parameter in terms of the nonparametric Behrens-Fisher problem, an issue that has arisen from the preceding manuscript. In that regard, I explore different variance estimators and a permutation approach that have been proposed in the literature and examine some slightly modified ways as regards a small sample t approximation as well. In all three manuscripts, I carried out simulations for various settings to assess the adequacy of the proposed methods. 2022-01-01T00:00:00Z Spatial and spatio-temporal regression modelling with conditional autoregressive random effects for epidemiological and spatially referenced data http://hdl.handle.net/2003/41006 Title: Spatial and spatio-temporal regression modelling with conditional autoregressive random effects for epidemiological and spatially referenced data Authors: Djeudeu-Deudjui, Dany-Armand Abstract: Regression models are suitable to analyse the association between health outcomes and environmental exposures. However, in urban health studies where spatial and temporal changes are of importance, spatial and spatio-temporal variations are usually neglected. This thesis develops and applies regression methods incorporating latent random effects terms with Conditional Autoregressive (CAR) structures in classical regression models to account for the spatial effects for cross-sectional analysis and spatio-temporal effects for longitudinal analysis. The thesis is divided into two main parts. Firstly, methods to analyse data for which all variables are given on an areal level are considered. The longitudinal Heinz Nixdorf Recall Study is used throughout this thesis for application. The association between the risk of depression and greenness at the district level is analysed. A spatial Poisson model with a latent CAR structured-Random effect is applied for selected time points. Then, a sophisticated spatio-temporal extension of the Poisson model results to a negative association between greenness and depression. The findings also suggest strong temporal autocorrelation and weak spatial effects. Even if the weak spatial effects are suggestive of neglecting them, as in the case of this thesis, spatial and spatio-temporal random effects should be taken into account to provide reliable inference in urban health studies. Secondly, to avoid ecological and atomic fallacies due to data aggregation and disaggregation, all data should be used at their finest spatial level given. Multilevel Conditional Autoregressive (CAR) models help to simultaneously use all variables at their initial spatial resolution and explain the spatial effect in epidemiological studies. This is especially important where subjects are nested within geographical units. This second part of the thesis has two goals. Essentially, it further develops the multilevel models for longitudinal data by adding existing random effects with CAR structures that change over time. These new models are named MLM tCARs. By comparing the MLM tCARs to the classical multilevel growth model via simulation studies, we observe a better performance of MLM tCARs in retrieving the true regression coefficients and with better fits. The models are comparatively applied on the analysis of the association between greenness and depressive symptoms at the individual level in the longitudinal Heinz Nixdorf Recall Study. The results show again negative association between greenness and depression and a decreasing linear individual time trend for all models. We observe once more very weak spatial variation and moderate temporal autocorrelation. Besides, the thesis provides comprehensive decision trees for analysing data in epidemiological studies for which variables have a spatial background. 2022-01-01T00:00:00Z Resampling-based inference methods for repeated measures data with missing values http://hdl.handle.net/2003/40978 Title: Resampling-based inference methods for repeated measures data with missing values Authors: Amro, Lubna Abstract: The primary objective of this dissertation was to (i) develop novel resampling approaches for handling repeated measures data with missing values, (ii) compare their empirical power against other existing approaches using a Monte Carlo simulation study, and (iii) pinpoint the limitations of some common approaches, particularly for small sample sizes. This dissertation investigates four different statistical problems. The first is semiparametric inference for comparing means of matched pairs with missing data in both arms. Therein, we propose two novel randomization techniques; a weighted combination test and a multiplication combination test. They are based upon combining separate results of the permutation versions of the paired t-test and Welch test for the completely observed pairs and the incompletely observed components, respectively. As second problem, we consider the same setting but missingness in one arm only. There, we investigate a Wald-type statistic (WTS), an ANOVA-type statistic (ATS), and a modified ANOVA-type statistic (MATS). However, ATS and MATS are not distribution free under the null hypothesis, and WTS suffers from the slow convergence to its limiting 2 distribution. Thus, we develop asymptotic model-based bootstrap versions of these tests. The third problem is on nonparametric rank-based inference for matched pairs with incompleteness in both arms. In this more general setup, the only requirement is that the marginal distributions are not one point distributions. Therein, we propose novel multiplication combination tests that can handle three different testing problems, including the nonparametric Behrens-Fisher problem (Hp 0 : {p = 1/2}). Finally, the fourth problem is nonparametric rank-based inference for incompletely observed factorial designs with repeated measures. Therein, we develop a wild bootstrap approach combined with quadratic form-type test statistics (WTS, ATS, and MATS). These rank-based methods can be applied to both continuous and ordinal or ordered categorical data and (some) allow for singular covariance matrices. In addition to theoretically proving the asymptotic correctness of all the proposed procedures, extensive simulation studies demonstrate their favorable small samples properties in comparison to classical parametric tests. We also motivate and validate our approaches using real-life data examples from a variety of fields. 2022-01-01T00:00:00Z Optimal designs for comparing regression curves: dependence within and between groups http://hdl.handle.net/2003/40951 Title: Optimal designs for comparing regression curves: dependence within and between groups Authors: Schorning, Kirsten; Dette, Holger Abstract: We consider the problem of designing experiments for the comparison of two regression curves describing the relation between a predictor and a response in two groups, where the data between and within the group may be dependent. In order to derive efficient designs we use results from stochastic analysis to identify the best linear unbiased estimator (BLUE) in a corresponding continuous model. It is demonstrated that in general simultaneous estimation using the data from both groups yields more precise results than estimation of the parameters separately in the two groups. Using the BLUE from simultaneous estimation, we then construct an efficient linear estimator for finite sample size by minimizing the mean squared error between the optimal solution in the continuous model and its discrete approximation with respect to the weights (of the linear estimator). Finally, the optimal design points are determined by minimizing the maximal width of a simultaneous confidence band for the difference of the two regression functions. The advantages of the new approach are illustrated by means of a simulation study, where it is shown that the use of the optimal designs yields substantially narrower confidence bands than the application of uniform designs. 2021-11-26T00:00:00Z Forecasting US inflation using Markov dimension switching http://hdl.handle.net/2003/40940 Title: Forecasting US inflation using Markov dimension switching Authors: Prüser, Jan Abstract: This study considers Bayesian variable selection in the Phillips curve context by using the Bernoulli approach of Korobilis (Journal of Applied Econometrics, 2013, 28(2), 204–230). The Bernoulli model, however, is unable to account for model change over time, which is important if the set of relevant predictors changes. To tackle this problem, this paper extends the Bernoulli model by introducing a novel modeling approach called Markov dimension switching (MDS). MDS allows the set of predictors to change over time. It turns out that only a small set of predictors is relevant and that the relevant predictors exhibit a sizable degree of time variation for which the Bernoulli approach is not able to account, stressing the importance and benefit of the MDS approach. In addition, this paper provides empirical evidence that allowing for changing predictors over time is crucial for forecasting inflation. 2020-08-08T00:00:00Z Statistik im Sozialismus http://hdl.handle.net/2003/40856 Title: Statistik im Sozialismus Authors: Krämer, Walter; Leciejewski, Klaus Abstract: Dieser Beitrag dokumentiert eine Tendenz totalitärer Gesellschaftssysteme, die Statistik und insbesondere Daten der Amtsstatistik als Stütze von Ideologien zu missbrauchen. Dieser Missbrauch wird oft durch westliche Medien unterstützt, die allzu blauäugig auf dergleichen Lügen hereinfallen. Hier versprechen das Internet und die leichte Verfügbarkeit von Massendaten aller Art, ein mögliches Gegengewicht zu werden. We establish a tendency in totalitarian regimes to use official statistical data for propaganda purposes. This is facilitated by an equally obvious tendency among western media to take such figures at face value. However, the big data revolution promises easy checks of such false claims and might help impeding such abuses. 2021-06-18T00:00:00Z Fisher transformation based confidence intervals of correlations in fixed- and random-effects meta-analysis http://hdl.handle.net/2003/40850 Title: Fisher transformation based confidence intervals of correlations in fixed- and random-effects meta-analysis Authors: Welz, Thilo; Doebler, Philipp; Pauly, Markus Abstract: Meta-analyses of correlation coefficients are an important technique to integrate results from many cross-sectional and longitudinal research designs. Uncertainty in pooled estimates is typically assessed with the help of confidence intervals, which can double as hypothesis tests for two-sided hypotheses about the underlying correlation. A standard approach to construct confidence intervals for the main effect is the Hedges-Olkin-Vevea Fisher-z (HOVz) approach, which is based on the Fisher-z transformation. Results from previous studies (Field, 2005, Psychol. Meth., 10, 444; Hafdahl and Williams, 2009, Psychol. Meth., 14, 24), however, indicate that in random-effects models the performance of the HOVz confidence interval can be unsatisfactory. To this end, we propose improvements of the HOVz approach, which are based on enhanced variance estimators for the main effect estimate. In order to study the coverage of the new confidence intervals in both fixed- and random-effects meta-analysis models, we perform an extensive simulation study, comparing them to established approaches. Data were generated via a truncated normal and beta distribution model. The results show that our newly proposed confidence intervals based on a Knapp-Hartung-type variance estimator or robust heteroscedasticity consistent sandwich estimators in combination with the integral z-to-r transformation (Hafdahl, 2009, Br. J. Math. Stat. Psychol., 62, 233) provide more accurate coverage than existing approaches in most scenarios, especially in the more appropriate beta distribution simulation model. 2021-05-02T00:00:00Z Asymptotic-based bootstrap approach for matched pairs with missingness in a single arm http://hdl.handle.net/2003/40837 Title: Asymptotic-based bootstrap approach for matched pairs with missingness in a single arm Authors: Amro, Lubna; Pauly, Markus; Ramosaj, Burim Abstract: The issue of missing values is an arising difficulty when dealing with paired data. Several test procedures are developed in the literature to tackle this problem. Some of them are even robust under deviations and control type-I error quite accurately. However, most of these methods are not applicable when missing values are present only in a single arm. For this case, we provide asymptotic correct resampling tests that are robust under heteroskedasticity and skewed distributions. The tests are based on a meaningful restructuring of all observed information in quadratic form–type test statistics. An extensive simulation study is conducted exemplifying the tests for finite sample sizes under different missingness mechanisms. In addition, illustrative data examples based on real life studies are analyzed. 2021-07-08T00:00:00Z Inference for multivariate and high-dimensional data in heterogeneous designs http://hdl.handle.net/2003/40835 Title: Inference for multivariate and high-dimensional data in heterogeneous designs Authors: Sattler, Paavo Aljoscha Nanosch Abstract: In the presented cumulative thesis, we develop statistical tests to check different hypotheses for multivariate and high-dimensional data. A suitable way to get scalar test statistics for multivariate issues are quadratic forms. The most common are statistics of Waldtype (WTS) or ANOVA-type (ATS) as well as centered and standardized versions of them. Also, [Pauly et al., 2015] and [Chen and Qin, 2010] used such quadratic forms to analyze hypotheses regarding the expectation vector of high-dimensional observations. Thereby, they had different assumptions, but both allowed just one respective two groups. We expand the approach from [Pauly et al., 2015] for multiple groups, which leads to a multitude of possible asymptotic frameworks allowing even the number of groups to grow. In the considered split-plot-design with normally distributed data, we investigate the asymptotic distribution of the standardized centered quadratic form under different conditions. In most cases, we could show that the individual limit distribution was only received under the specific conditions. For the frequently assumed case of equal covariance matrices, we also widen the considered asymptotic frameworks, since not necessarily the sample sizes of individual groups have to grow. Moreover, we add other cases in which the limit distribution can be calculated. These hold for homoscedasticity of covariance matrices but also for the general case. This expansion of the asymptotic frameworks is one example of how the assumption of homoscedastic covariance matrices allows widening conclusions. Moreover, assuming equal covariance matrices also simplifies calculations or enables us to use a larger statistical toolbox. For the more general issue of testing hypotheses regarding covariance matrices, existing procedures have strict assumptions (e.g. in [Muirhead, 1982], [Anderson, 1984] and [Gupta and Xu, 2006]), test only special hypotheses (e.g. in [Box, 1953]), or are known to have low power (e.g. in [Zhang and Boos, 1993]). We introduce an intuitive approach with fewer restrictions, a multitude of possible null hypotheses, and a convincing small sample approximation. Thereby, nearly every quadratic form known from the mean-based analysis can be used, and two bootstrap approaches are applied to improve their performance. Furthermore, it can be expanded to many other situations like testing hypotheses of correlation matrices or check whether the covariance matrix has a particular structure. We investigated the type-I-error for all developed tests and the power to detect deviations from the null hypothesis for small sample sizes up to large ones in extensive simulation studies. 2021-01-01T00:00:00Z Implications on feature detection when using the benefit–cost ratio http://hdl.handle.net/2003/40833 Title: Implications on feature detection when using the benefit–cost ratio Authors: Jagdhuber, Rudolf; Rahnenführer, Jörg Abstract: In many practical machine learning applications, there are two objectives: one is to maximize predictive accuracy and the other is to minimize costs of the resulting model. These costs of individual features may be financial costs, but can also refer to other aspects, for example, evaluation time. Feature selection addresses both objectives, as it reduces the number of features and can improve the generalization ability of the model. If costs differ between features, the feature selection needs to trade-off the individual benefit and cost of each feature. A popular trade-off choice is the ratio of both, the benefit–cost ratio (BCR). In this paper, we analyze implications of using this measure with special focus to the ability to distinguish relevant features from noise. We perform simulation studies for different cost and data settings and obtain detection rates of relevant features and empirical distributions of the trade-off ratio. Our simulation studies exposed a clear impact of the cost setting on the detection rate. In situations with large cost differences and small effect sizes, the BCR missed relevant features and preferred cheap noise features. We conclude that a trade-off between predictive performance and costs without a controlling hyperparameter can easily overemphasize very cheap noise features. While the simple benefit–cost ratio offers an easy solution to incorporate costs, it is important to be aware of its risks. Avoiding costs close to 0, rescaling large cost differences, or using a hyperparameter trade-off are ways to counteract the adverse effects exposed in this paper. 2021-06-03T00:00:00Z On MSE-optimal circular crossover designs http://hdl.handle.net/2003/40823 Title: On MSE-optimal circular crossover designs Authors: Neumann, Christoph; Kunert, Joachim Abstract: In crossover designs, each subject receives a series of treatments, one after the other in p consecutive periods. There is concern that the measurement of a subject at a given period might be influenced not only by the direct effect of the current treatment but also by a carryover effect of the treatment applied in the preceding period. Sometimes, the periods of a crossover design are arranged in a circular structure. Before the first period of the experiment itself, there is a run-in period, in which each subject receives the treatment it will receive again in the last period. No measurements are taken during the run-in period. We consider the estimate for direct effects of treatments which is not corrected for carryover effects. If there are carryover effects, this uncorrected estimate will be biased. In that situation, the quality of the estimate can be measured by the mean square error, the sum of the squared bias and the variance. We determine MSE-optimal designs, that is, designs for which the mean square error is as small as possible. Since the optimal design will in general depend on the size of the carryover effects, we also determine the efficiency of some designs compared to the locally optimal design. It turns out that circular neighbour-balanced designs are highly efficient. 2021-11-12T00:00:00Z Generalized binary vector autoregressive processes http://hdl.handle.net/2003/40814 Title: Generalized binary vector autoregressive processes Authors: Jentsch, Carsten; Reichmann, Lena Abstract: Vector-valued extensions of univariate generalized binary auto-regressive (gbAR) processes are proposed that enable the joint modeling of serial and cross-sectional dependence of multi-variate binary data. The resulting class of generalized binary vector auto-regressive (gbVAR) models is parsimonious, nicely interpretable and allows also to model negative dependence. We provide stationarity conditions and derive moving-average-type representations that allow to prove geometric mixing properties. Furthermore, we derive general stochastic properties of gbVAR processes, including formulae for transition probabilities. In particular, classical Yule–Walker equations hold that facilitate parameter estimation in gbVAR models. In simulations, we investigate the estimation performance, and for illustration, we apply gbVAR models to particulate matter (PM10, ‘fine dust’) alarm data observed at six monitoring stations in Stuttgart, Germany. 2021-07-28T00:00:00Z K-sign depth: Asymptotic distribution, efficient computation and applications http://hdl.handle.net/2003/40787 Title: K-sign depth: Asymptotic distribution, efficient computation and applications Authors: Malcherczyk, Dennis Abstract: Die Vorzeichen-Tiefe (sign depths) entspricht in vielen Situationen der Simplex-Regressionstiefe (simplicial regression depth), welche wiederum verwandt mit der von Rousseeuw 1999 eingeführten Regressionstiefe ist. Diese Klasse von Tiefen bewerten Parameter assoziiert zu einem statistischen Modell für gegebene Daten. Die Regressionstiefe und Simplex-Regressionstiefe sind kompliziert zu berechnen und zu verstehen. Die Vorzeichen-Tiefe entspricht hingegen nur der relativen Anzahl von geordneten Tupeln der Länge K mit alternierenden Vorzeichen. Die Arbeit gliedert sich in drei große Teile. Im ersten Teil (Kapitel 3) wird die asymptotische Verteilung der Vorzeichen-Tiefen für beliebige Hyperparameter K hergeleitet. Diese Rechnung basiert auf einen Beweis von Kustosz, Leucht und Müller aus dem Jahr 2016 für die asymptotische Verteilung des Spezialfalls K=3. Die Masterarbeit von Malcherczyk im Jahr 2018 hat diesen Beweis studiert und anschließend stark vereinfachen können. Durch diese Vereinfachung konnte auch ein Beweis für den Fall K=4 gefunden werden. Kapitel 3 ist eine Fortsetzung der Resultate aus der Masterarbeit für allgemeines K. Ein wesentlicher Schritt in der Herleitung ist die Darstellung der Vorzeichen-Tiefe als stetiges Funktional von symmetrischen Irrfahrten auf dem Skorokhod-Raum, wodurch mithilfe eines funktionalen Zentralen Grenzwertsatzes und einem Stetigkeitsargument die asymptotische Verteilung gewonnen wird. Im zweiten Teil (Kapitel 4 und 5) werden verschiedene effiziente Berechnungsmöglichkeiten vorgestellt, da ein Algorithmus basierend auf der Definition eine polynomielle Rechenkomplexität mit Polynomgrad K aufweist. In Kapitel 4 wird ein Algorithmus basierend auf der asymptotischen Herleitung konstruiert, während in Kapitel 5 ein simpler Ansatz basierend auf der Zusammenfassung von Vorzeichen-Blöcken beschrieben wird. Es zeigt sich, dass der Ansatz in Kapitel 5 zu einem exakten Algorithmus mit linearer Laufzeit (unabhängig von K) führt. Im dritten Teil (Kapitel 6 und 7) werden Testverfahren basierend auf der Vorzeichen-Tiefe beschrieben und in Simulationsstudien untersucht. Anwendungen sind das Testen von Modellen und das Testen auf Zufälligkeit. Weitere Themen in diesem Zusammenhang sind z.B. die Wahl des Hyperparameters oder die Konstruktion eines Zwei-Stichproben-Tests für relevante Unterschiede. Ferner werden Ausblicke für zukünftige Forschung gegeben. Die Vorzeichen-Tiefe können z.B. modifiziert werden, sodass die Vorzeichen unterschiedlich gewichtet sind. Diese Gewichte basieren u.a. auf einem für die robuste Statistik typischen Huber-Gewicht oder auf Vorzeichen-Rängen. Kombiniert mit den Resultaten der Dissertation von Melanie Horn aus dem Jahre 2021, die sich insbesondere mit der Anwendung der Vorzeichen-Tiefe in hoch dimensionalen Modellen beschäftigt, ist eine Grundlage geschaffen worden, um die Vorzeichen-Tiefe in der Praxis sinnvoll nutzen zu können. Besonders an dem Ansatz der Vorzeichen-Tiefe ist, dass Modelle lediglich basierend auf den Residuen bewertet werden können. Dies bietet auch die Möglichkeit, nichtparametrische Modelle zu bewerten.; The sign depth corresponds under many situations to the simplicial regression depth which is closely related to the regression depth introduced by Rousseeuw (1999). This class of depth notions evaluates parameters associated to a model function for given data. The regression depth and simplicial regression depth are complicated to compute and to comprehend. On the contrary the sign depth is simply the relative number of tuples of length K with alternating signs. This thesis is structured in three main parts. In the first part (Chapter 3) the asymptotic distribution is derived for arbitrary hyper parameters. This derivation is based on the proof of Kustosz, Leucht and Müller (2016) for the case K=3. The Master thesis Malcherczyk (2018) analyzed this proof and found a strongly simplified proof which also yields to a proof for K=4. Chapter 3 continues these results for general K. A substantial step of the proof is the representation of the sign depth by a functional with paths of a symmetric random walk in the Skorokhod space. By applying a functional Central Limit Theorem and the Continuous Mapping Theorem, the asymptotic distribution can be obtained. The second part (Chapter 4 and 5) proposes various approaches for efficient computation since the computational costs of an algorithm based on the definition of the sign depth increase in polynomial time to the power of K. Chapter 4 provides an algorithm based on the asymptotic derivation and Chapter 5 is based on an elementary idea considering the sign block structures of the residuals. The algorithm in Chapter 5 has linear time complexity for arbitrary K. The third part (Chapter 6 and 7) introduces several types of tests based on the sign depth and presents associated simulation studies. One application is the diagnose of models and another one is testing randomness of data. Further topics are for example the choice of the hyper parameter or the construction of a two-sample relevance-test. Moreover ideas for generalized version of the sign depth based on weighted signs for higher efficiency in general cases are given. For example, weighted signs based on Huberized versions of the residuals or signed ranks are introduced. Combined with the results of Horn (2021) who applied the sign depth in high-dimensional models, the fundamentals for future research are obtained. The sign depth has the special property that models can be evaluated only by considering the residual vector. Therefore nonparametric model classes can be considered as well. 2022-01-01T00:00:00Z On robust estimation of negative binomial INARCH models http://hdl.handle.net/2003/40748 Title: On robust estimation of negative binomial INARCH models Authors: Elsaied, Hanan; Fried, Roland Abstract: We discuss robust estimation of INARCH models for count time series, where each observation conditionally on its past follows a negative binomial distribution with a constant scale parameter, and the conditional mean depends linearly on previous observations. We develop several robust estimators, some of them being computationally fast modifications of methods of moments, and some rather efficient modifications of conditional maximum likelihood. These estimators are compared to related recent proposals using simulations. The usefulness of the proposed methods is illustrated by a real data example. 2021-04-24T00:00:00Z River-mediated dynamic environmental factors and perinatal data analysis http://hdl.handle.net/2003/40584 Title: River-mediated dynamic environmental factors and perinatal data analysis Authors: Rathjens, Jonathan Abstract: Perfluorooctanoic acid (PFOA) and related per- and polyfluoroalkyl substances, a group of man-made persistent organic chemicals employed for many products, are widely distributed in the environment. Adverse health effects may occur even at low exposure levels. A large-scale PFOA contamination of drinking water resources, especially of the river Ruhr, was detected in North Rhine-Westphalia, Germany, in summer 2006. Subsequent measurements are available from the water supply stations along the river and elsewhere. The first state-wide environmental-epidemiological study on the general population analyses these secondary data together with routinely collected perinatal registry data, to estimate possible developmental-toxic effects of PFOA exposure, especially regarding birth weight (BW). Drinking water data are temporally and spatially modelled to assign estimated exposure values to the residents. A generalised linear model with an inverse link deals with the steeply decreasing temporal data pattern at mainly affected stations. Confirmed by a river-wide joint model, the river's segments between the main junctions are the most important factor to explain the spatial structure, besides local effects. Deductions from stations to areal units are made possible via estimated supply proportions. Regression of perinatal data with BW as response usually includes the gestational age (GA) as an important covariate in polynomial form. However, bivariate modelling of BW and GA is recommended to distinguish effects on each, on both, and between them. Bayesian distributional copula regression is applied, where the marginals for BW and GA as well as the copula representing their dependence structure are fitted independently and all parameters are estimated conditional on covariates. While a Gaussian is suitable for BW, the skewed GA data are better modelled by the three-parametric Dagum distribution. The Clayton copula performs better than the Gumbel and the symmetric Gaussian copula, although the lower tail dependence is weak. A non-linear trend of BW on GA is detected by the standard polynomial model. Linear effects of biometric and obstetric covariates and also of maternal smoking on BW mean are similar in both models, while the distributional copula regression also reveals effects on all other parameters. The local PFOA exposure is spatio-temporally assigned to the perinatal data of the most affected town of Arns\-berg and so included in the regression models. No significant effect results and a relatively high amount of noise remains. Perspectively and for larger regions, this can be dealt with by exposure modelling on area level using dependence information, by allowing further asymmetry in the bivariate distribution of BW and GA, and by respecting geographical structures in birth data. 2021-01-01T00:00:00Z Gaussian Process models and global optimization with categorical variables http://hdl.handle.net/2003/40541 Title: Gaussian Process models and global optimization with categorical variables Authors: Kirchhoff, Dominik Abstract: This thesis is concerned with Gaussian Process (GP) models for computer experiments with both numerical and categorical input variables. The Low-Rank Correlation kernel LRCr is introduced for the estimation of the cross-correlation matrix – i.e., the matrix that contains the correlations of the GP given different levels of a categorical variable. LRCr is a rank-r approximation of the real but unknown cross-correlation matrix and provides two advantages over existing parsimonious correlation kernels: First, it lets the practictioner adapt the number of parameters to be estimated according to the problem at hand by choosing an appropriate rank r. And second, the entries of the estimated cross-correlation matrix are not restricted to non-negative values. Moreover, an approach is presented that can generate a test function with mixed inputs from a test function having only continuous variables. This is done by discretizing (or “slicing”) one of its dimensions. Depending on the function and the slice positions, the slices sometimes happen to be highly positively correlated. By turning some slices in a specific way, the position and value of the global optimum can be preserved while changing the sign of a number of cross-correlations. With these methods, a simulation study is conducted that investigates the estimation accuracy of the cross-correlation matrices as well as the prediction accuracy of the response surface among different correlation kernels. Thereby, the number of points in the initial design of experiments and the amount of negative cross-correlations are varied in order to compare their impact on different kernels. We then focus on GP models with mixed inputs in the context of the Efficient Global Optimization (EGO) algorithm. We conduct another simulation study in which the distances of the different kernels' best found solutions to the optimum are compared. Again, the number of points in the initial experimental design is varied. However, the total budget of function evaluations is fixed. The results show that a higher number of EGO iterations tends to be preferable over a larger initial experimental design. Finally, three applications are considered: First, an optimization of hyperparameters of a computer vision algorithm. Second, an optimization of a logistics production process using a simulation model. And third, a bi-objective optimization of shift planning in a simulated high-bay warehouse, where constraints on the input variables must be met. These applications involve further challenges, which are successfully solved. 2021-01-01T00:00:00Z Essays on cointegration analysis in the state space framework http://hdl.handle.net/2003/40515 Title: Essays on cointegration analysis in the state space framework Authors: Matuschek, Lukas Abstract: Cointegration analysis is by now a standard tool in multivariate time series analysis with application ranging from economics to climate science. It was formalized by Soren Johansen and Katarina Juselius and their co-authors for VAR processes. This dissertation, consisting of three chapters corresponding to three articles written in collaboration with my co-authors Professor Dietmar Bauer, Patrick de Matos Ribeiro and Professor Martin Wagner, extends the cointegration theory to VARMA processes using a representation by state space systems. Chapter 1 focuses on theoretical results regarding the sets of transfer functions corresponding to VARMA systems with similar cointegrating properties, summarized in the so-called state space unit root structure. We develop and discuss different parameterizations for vector autoregressive moving average processes with arbitrary unit roots and (co)integration orders and discuss their topological properties. The general results are exemplified in detail for the empirically most relevant cases, the (multiple frequency or seasonal) I(1) and the I(2) case. In Chapter 2 we show that the Johansen framework for testing hypotheses on the cointegrating ranks and spaces for MFI(1) processes can be extended to the class of VARMA processes and introduce a state space error correction representation. The estimated cointegrating vectors are asymptotically mixed Gaussian and pseudo likelihood ratio tests for the cointegrating ranks have the same distributions under the null hypothesis in the VARMA case as in the VAR case. In a simulation study our tests outperform the tests by Johansen and Schaumburg in small samples. In Chapter 3 we develop estimation and inference techniques for I(2) cointegrated VARMA processes cast in state space format. We show consistency and derive the asymptotic distributions of estimators maximizing the Gaussian pseudo likelihood function. Furthermore, we discuss hypothesis tests for the state space unit root structure, leading to the well-known limiting distributions for VAR I(2) processes. Again, a small simulation study shows favorable results for small samples, with our test leading to better performance in determining these integer parameters. 2021-01-01T00:00:00Z Multimodale Likelihood-Funktionen in Mischverteilungsmodellen http://hdl.handle.net/2003/40489 Title: Multimodale Likelihood-Funktionen in Mischverteilungsmodellen Authors: Jastrow, Malte Abstract: Mischverteilungsmodelle (Mixture Models) dienen allgemein zur Anpassung zusammengesetzter Verteilungen an Daten, in denen einzelne Gruppen von Beobachtungen unterschiedlichen Verteilungen folgen. Durch die Modellierung der Gruppenzugehörigkeiten als latente Variable sind diese Modelle darüber hinaus ein populäres Verfahren zur Clusteranalyse (unüberwachtes Lernen). Dabei werden die Gruppen, denen Beobachtungen zugeordnet werden sollen, durch unterschiedlich parametrisierte Verteilungskomponenten repräsentiert. Die Verteilungsparameter der einzelnen Komponenten, sowie deren Mischungsverhältnis können mittels Maximum-Likelihood-Prinzip geschätzt werden. Wie in der Literatur beschrieben, kann die Likelihood-Funktion bereits für die Mischung zweier Normalverteilungskomponenten zahlreiche Optima aufweisen, wenn sich die zugrundeliegenden Varianzen stark unterscheiden. Im Rahmen dieser Dissertation wird das Problem der Multimodalität zunächst für Mischungen verschiedener Verteilungen durch grafische Darstellungen verdeutlicht. Anschließend wird systematisch der Einfluss der zugrundeliegenden Parameter der Mischverteilungsmodelle untersucht. Dabei ergibt sich, dass die Multimodalität maßgeblich mit dem Abstand zwischen den Varianzparametern der beiden Mischungskomponenten ansteigt. Anhand einer umfangreichen Simulationsstudie wird untersucht, wie gut der üblicherweise verwendete EM-Algorithmus Normalverteilungsmischungen mit unterschiedlicher Komplexität der Likelihood optimieren kann. Es stellt sich heraus, dass EM gegenüber allgemeinen Black-Box-Optimierungsalgorithmen, die spezielle Ansätze zum Überwinden lokaler Optima verfolgen, im Vorteil ist, da die in jedem Schritt verwendete konkrete Zuordnung der Daten zu den Verteilungskomponenten eine erhebliche Vereinfachung der Zielfunktion verursacht. Darüber hinaus wird mit der Methode der Clusterstartpunkte für EM eine für den Anwendungsfall relevante Methode vorgeschlagen, um möglichst viele lokale Optima einer multimodalen Likelihood-Funktion zu identifizieren. Dies gelingt deutlich besser als mit der häufig praktizierten Verwendung von Zufallsstartpunkten für EM und kann einen entscheidenden Beitrag zur Bewertung eines globalen Optimierungsergebnisses in der Praxis liefern. 2021-01-01T00:00:00Z Sign depth for parameter tests in multiple regression http://hdl.handle.net/2003/40483 Title: Sign depth for parameter tests in multiple regression Authors: Horn, Melanie Abstract: This thesis deals with the question how the sign depth test can be applied in the case of multiple regression. Because the result of this test depends on the ordering of the residuals and most times no inherent order is available for multidimensional values one has to think about suitable methods to order these values. In this thesis 13 different ordering methods are described, analyzed and compared with respect to characteristics, computational behavior and performance when using them in the context of sign depth tests. For the last one, several simulations of power functions for many different settings have been carried out. In the simulations different data situations as well as different multiple regression models and different parameters of the sign depth were examined. It is shown in this thesis that a group of so-called “distance based ordering methods” performs best and leads to satisfying results of the sign depth test. Also compared to other tests for regression parameters like the Wald test or the classical sign test the sign depth test performs satisfyingly and especially in the case of testing for model checks it performs clearly better. In addition, this thesis describes the contents and functionality of the R -package GSignTest which was written for this thesis and contains implementations of the sign depth, the sign depth test and the different ordering methods. 2021-01-01T00:00:00Z Flexible instrumental variable distributional regression http://hdl.handle.net/2003/40358 Title: Flexible instrumental variable distributional regression Authors: Briseño Sanchez, Guillermo; Hohberg, Maike; Groll, Andreas; Kneib, Thomas Abstract: We tackle two limitations of standard instrumental variable regression in experimen- tal and observational studies: restricted estimation to the conditional mean of the outcome and the assumption of a linear relationship between regressors and outcome. More flexible regres- sion approaches that solve these limitations have already been developed but have not yet been adopted in causality analysis. The paper develops an instrumental variable estimation proce- dure building on the framework of generalized additive models for location, scale and shape. This enables modelling all distributional parameters of potentially complex response distribu- tions and non-linear relationships between the explanatory variables, instrument and outcome. The approach shows good performance in simulations and is applied to a study that estimates the effect of rural electrification on the employment of females and males in the South African province of KwaZulu-Natal. We find positive marginal effects for the mean for employment of females rates, negative effects for employment of males and a reduced conditional standard deviation for both, indicating homogenization in employment rates due to the electrification pro- gramme. Although none of the effects are statistically significant, the application demonstrates the potentials of using generalized additive models for location, scale and shape in instrumental variable regression for both to account for endogeneity and to estimate treatment effects beyond the mean. 2020-08-16T00:00:00Z Risk and Return of the Tontine: A Brief Discussion http://hdl.handle.net/2003/40356 Title: Risk and Return of the Tontine: A Brief Discussion Authors: Pflaumer, Peter Abstract: This article analyzes the stochastic aspects of a tontine using a Gompertz distribution. In particular, the probabilistic and demographic risks of a tontine investment are examined. The expected value and variance of tontine payouts are calculated. Both parameters increase with age. The stochastic present value of a tontine payout is compared with the present value of a fixed annuity. It is shown that only at very high ages the tontine is more profitable than an annuity. Finally, the demographic risks associated with a tontine are discussed. Elasticities are used to calculate the impact of changes in modal age on the tontine payout. It is shown that the tontine payout is very sensitive to changes in modal age. 2021-07-16T00:00:00Z Statistical approaches for calculating alert concentrations from cytotoxicity and gene expression data http://hdl.handle.net/2003/40277 Title: Statistical approaches for calculating alert concentrations from cytotoxicity and gene expression data Authors: Kappenberg, Franziska Abstract: In this thesis, three different topics regarding the calculation of alert concentrations are considered. In toxicology, an alert concentration is the concentration where the response variable of interest attains or exceeds a pre-specified threshold. The first topic, handling deviating control values, considers cytotoxicity data. Often, response values for the lowest tested concentrations and the negative control do not coincide. This leads to the inability to properly interpret or even calculate the concentration where the curve attains a pre-specified percentage. Four different methods are proposed and compared in a controlled simulation study. All of these methods are based on the family of log-logistic functions. Based on the results from this simulation study, a concrete algorithm is stated, which method to use in which case. The second topic is called identification of alert concentrations and considers gene expression data. Four methods to calculate specific alert concentrations are compared in a controlled simulation study, two based on the discrete observations only and two based on a parametric model fit, with one method taking the significance into account, respectively, and one method considering absolute exceedance of the threshold only. Results show that generally, the methods based on modelling of curves less drastically overestimate the true underlying alert concentrations while at the same time, the number alerts at too low concentrations, does not exceed the significance level. The third topic aims at improving the estimation of the parameter in a 4pLL model corresponding to the half-maximal effect by conducting some information sharing across. Two approaches are presented: The first approach is to conduct a meta-analysis for estimates of this parameters for all genes that are `similar' to each other. The second method makes use of an empirical Bayes procedure to effectively calculate a weighted mean between individual observed value and the mean of all observed parameter values for a large dataset. The meta-analysis approach performs worse than directly estimating the parameter of interest, but results for the Bayes method improved in contrast to the direct estimate in terms of the MSE. 2021-01-01T00:00:00Z Handling deviating control values in concentration-response curves http://hdl.handle.net/2003/40227 Title: Handling deviating control values in concentration-response curves Authors: Kappenberg, Franziska; Brecklinghaus, Tim; Albrecht, Wiebke; Blum, Jonathan; van der Wurp, Carola; Leist, Marcel; Hengstler, Jan G.; Rahnenführer, Jörg Abstract: In cell biology, pharmacology and toxicology dose-response and concentration-response curves are frequently fitted to data with statistical methods. Such fits are used to derive quantitative measures (e.g. EC20 values) describing the relationship between the concentration of a compound or the strength of an intervention applied to cells and its effect on viability or function of these cells. Often, a reference, called negative control (or solvent control), is used to normalize the data. The negative control data sometimes deviate from the values measured for low (ineffective) test compound concentrations. In such cases, normalization of the data with respect to control values leads to biased estimates of the parameters of the concentration-response curve. Low quality estimates of effective concentrations can be the consequence. In a literature study, we found that this problem occurs in a large percentage of toxicological publications. We propose different strategies to tackle the problem, including complete omission of the controls. Data from a controlled simulation study indicate the best-suited problem solution for different data structure scenarios. This was further exemplified by a real concentration-response study. We provide the following recommendations how to handle deviating controls: (1) The log-logistic 4pLL model is a good default option. (2) When there are at least two concentrations in the no-effect range, low variances of the replicate measurements, and deviating controls, control values should be omitted before fitting the model. (3) When data are missing in the no-effect range, the Brain-Cousens model sometimes leads to better results than the default model. 2020-09-23T00:00:00Z Limit theorems for locally stationary processes http://hdl.handle.net/2003/40226 Title: Limit theorems for locally stationary processes Authors: Kawka, Rafael Abstract: We present limit theorems for locally stationary processes that have a one sided time-varying moving average representation. In particular, we prove a central limit theorem (CLT), a weak and a strong law of large numbers (WLLN, SLLN) and a law of the iterated logarithm (LIL) under mild assumptions using a time-varying Beveridge–Nelson decomposition. 2020-10-01T00:00:00Z Comparison of random‐effects meta‐analysis models for the relative risk in the case of rare events - a simulation study http://hdl.handle.net/2003/40206 Title: Comparison of random‐effects meta‐analysis models for the relative risk in the case of rare events - a simulation study Authors: Beisemann, Marie; Doebler, Philipp; Holling, Heinz Abstract: Pooling the relative risk (RR) across studies investigating rare events, for example, adverse events, via meta‐analytical methods still presents a challenge to researchers. The main reason for this is the high probability of observing no events in treatment or control group or both, resulting in an undefined log RR (the basis of standard meta‐analysis). Other technical challenges ensue, for example, the violation of normality assumptions, or bias due to exclusion of studies and application of continuity corrections, leading to poor performance of standard approaches. In the present simulation study, we compared three recently proposed alternative models (random‐effects [RE] Poisson regression, RE zero‐inflated Poisson [ZIP] regression, binomial regression) to the standard methods in conjunction with different continuity corrections and to different versions of beta‐binomial regression. Based on our investigation of the models' performance in 162 different simulation settings informed by meta‐analyses from the Cochrane database and distinguished by different underlying true effects, degrees of between‐study heterogeneity, numbers of primary studies, group size ratios, and baseline risks, we recommend the use of the RE Poisson regression model. The beta‐binomial model recommended by Kuss (2015) also performed well. Decent performance was also exhibited by the ZIP models, but they also had considerable convergence issues. We stress that these recommendations are only valid for meta‐analyses with larger numbers of primary studies. All models are applied to data from two Cochrane reviews to illustrate differences between and issues of the models. Limitations as well as practical implications and recommendations are discussed; a flowchart summarizing recommendations is provided. 2020-06-08T00:00:00Z Prediction intervals for load‐sharing systems in accelerated life testing http://hdl.handle.net/2003/40203 Title: Prediction intervals for load‐sharing systems in accelerated life testing Authors: Leckey, Kevin; Müller, Christine H.; Szugat, Sebastian; Maurer, Reinhard Abstract: Based on accelerated lifetime experiments, we consider the problem of constructing prediction intervals for the time point at which a given number of components of a load‐sharing system fails. Our research is motivated by lab experiments with prestressed concrete beams where the tension wires fail successively. Due to an audible noise when breaking, the time points of failure could be determined exactly by acoustic measurements. Under the assumption of equal load sharing between the tension wires, we present a model for the failure times based on a birth process. We provide a model check based on a Q‐Q plot including a simulated simultaneous confidence band and four simulation‐free prediction methods. Three of the prediction methods are given by confidence sets where two of them are based on classical tests and the third is based on a new outlier‐robust test using sign depth. The fourth method uses the implicit function theorem and the δ‐method to get prediction intervals without confidence sets for the unknown parameter. We compare these methods by a leave‐one‐out analysis of the data on prestressed concrete beams. Moreover, a simulation study is performed to discuss advantages and drawbacks of the individual methods. 2020-06-25T00:00:00Z Optimale Cross-over Designs zur Maximum-Likelihood-Schätzung im Cox-Modell bei Typ-I Zensierungen http://hdl.handle.net/2003/40202 Title: Optimale Cross-over Designs zur Maximum-Likelihood-Schätzung im Cox-Modell bei Typ-I Zensierungen Authors: Urbanik, Sarah Maria Abstract: Die Bestimmung optimaler Versuchspläne für Cross-over Experimente bezieht sich in bisherigen Arbeiten vielfach auf die Analyse von Daten, die mittels einfacher linearer Modelle analysiert werden können. Im Fokus der Arbeit stehen Ereigniszeiten als Zielgröße. Diesen Ereigniszeiten wird zum einen unterstellt, dass sie einer Typ-I Zensierung unterliegen und zum anderen durch das Cox-Modell angepasst werden können. Die berücksichtigte Response-Funktion wird durch einen funktionalen Zusammenhang eines direkten Behandlungseffekts, eines einfachen Carry-over sowie eines Block- und Periodeneffekts beschrieben. Die Modellparameter des Cox-Modells lassen sich mittels der Maximum-Likelihood-Methode schätzen, was auf die Informationsmatrix der direkten Behandlungseffekte schließen lässt. Diese Informationsmatrix dient als Grundlage bei der Suche nach optimalen Versuchsplänen. Im Gegensatz zu Zielgrößen, die mittels eines einfachen linearen Modells erklärt werden können, besteht das zentrale Problem der Arbeit darin, dass die Güte von Versuchsplänen unter der Annahme von Zensierungen von den wahren Modellparametern abhängt. Dies hat zur Folge, dass nur lokal optimale Versuchspläne bestimmt werden können. In der Dissertation werden Versuchspläne gesucht, die möglichst optimal sein sollen, wenn die Unterschiede zwischen den Behandlungen nur schwer erkennbar sind. Weiter werden verschiedene Annahmen an die Modellparameter gestellt, die zu Aussagen über asymptotische Effizienzeigenschaften von Versuchsplänen führen. Für endliche Stichprobengrößen werden anknüpfend anhand von Simulationsstudien die Effizienzen bestimmter Blockdesigns untersucht. Die Ergebnisse der Arbeit dienen als Grundlage, um Handlungsempfehlungen zur Durchführung von Cross-over Experimenten mit Ereigniszeiten geben zu können. Die gewonnenen Erkenntnisse weisen insbesondere darauf hin, dass in vielen Situationen optimale Versuchspläne für einfache lineare Modelle hoch effizient sein können. 2020-01-01T00:00:00Z On the time-varying effects of economic policy uncertainty on the US economy http://hdl.handle.net/2003/40185 Title: On the time-varying effects of economic policy uncertainty on the US economy Authors: Prüser, Jan; Schlösser, Alexander Abstract: We study the impact of Economic Policy Uncertainty (EPU) on the US Economy by using a VAR with time‐varying coefficients. The coefficients are allowed to evolve gradually over time which allows us to discover structural changes without imposing them a priori. We find three different regimes, which match the three major periods of the US economy, namely the Great Inflation, the Great Moderation and the Great Recession. The initial impact on real GDP ranges between −0.2% for the Great Inflation and Great Recession and −0.15% for the Great Moderation. In addition, the adverse effects of EPU are more persistent during the Great Recession providing an explanation for the slow recovery. This regime dependence is unique for EPU as the macroeconomic consequences of Financial Uncertainty turn out to be rather time invariant. 2020-09-11T00:00:00Z On variance estimation under shifts in the mean http://hdl.handle.net/2003/40121 Title: On variance estimation under shifts in the mean Authors: Axt, Ieva; Fried, Roland Abstract: In many situations, it is crucial to estimate the variance properly. Ordinary variance estimators perform poorly in the presence of shifts in the mean. We investigate an approach based on non-overlapping blocks, which yields good results in change-point scenarios. We show the strong consistency and the asymptotic normality of such blocks-estimators of the variance under independence. Weak consistency is shown for short-range dependent strictly stationary data. We provide recommendations on the appropriate choice of the block size and compare this blocks-approach with difference-based estimators. If level shifts occur frequently and are rather large, the best results can be obtained by adaptive trimming of the blocks. 2020-04-01T00:00:00Z Spatial and temporal analyses of perfluorooctanoic acid in drinking water for external exposure assessment in the Ruhr metropolitan area, Germany http://hdl.handle.net/2003/40120 Title: Spatial and temporal analyses of perfluorooctanoic acid in drinking water for external exposure assessment in the Ruhr metropolitan area, Germany Authors: Rathjens, Jonathan; Becker, Eva; Kolbe, Arthur; Ickstadt, Katja; Hölzer, Jürgen Abstract: Perfluorooctanoic acid (PFOA) and related chemicals among the per- and polyfluoroalkyl substances are widely distributed in the environment. Adverse health effects may occur even at low exposure levels. A large-scale contamination of drinking water resources, especially the rivers Möhne and Ruhr, was detected in North Rhine-Westphalia, Germany, in summer 2006. As a result, concentration data are available from the water supply stations along these rivers and partly from the water network of areas supplied by them. Measurements started after the contamination’s discovery. In addition, there are sparse data from stations in other regions. Further information on the supply structure (river system, station-to-area relations) and expert statements on contamination risks are available. Within the first state-wide environmental-epidemiological study on the general population, these data are temporally and spatially modelled to assign estimated exposure values to the resident population. A generalized linear model with an inverse link offers consistent temporal approaches to model each station’s PFOA data along the river Ruhr and copes with a steeply decreasing temporal data pattern at mainly affected locations. The river’s segments between the main junctions are the most important factor to explain the spatial structure, besides local effects. Deductions from supply stations to areas and, therefore, to the residents’ risk are possible via estimated supply proportions. The resulting potential correlation structure of the supply areas is dominated by the common water supply from the Ruhr. Other areas are often isolated and, therefore, need to be modelled separately. The contamination is homogeneous within most of the areas. 2020-12-04T00:00:00Z Exchange rate pass-through to import prices in Europe http://hdl.handle.net/2003/40105 Title: Exchange rate pass-through to import prices in Europe Authors: Arsova, Antonia Abstract: This paper takes a panel cointegration approach to the estimation of short- and long-run exchange rate pass-through (ERPT) to import prices in the European countries. Although economic theory suggests a long-run relationship between import prices and exchange rate, in recent empirical studies its existence has either been overlooked or it has proven difficult to establish. Resorting to novel tests for panel cointegration, we find support for the equilibrium relationship hypothesis. Exchange rate pass-through elasticities, estimated by two different techniques for cointegrated panel regressions, give insight into the most recent development of the ERPT. 2020-04-10T00:00:00Z Generalised joint regression for count data http://hdl.handle.net/2003/40103 Title: Generalised joint regression for count data Authors: Wurp, Hendrik van der; Groll, Andreas; Kneib, Thomas; Marra, Giampiero; Radice, Rosalba Abstract: We propose a versatile joint regression framework for count responses. The method is implemented in the R add-on package GJRM and allows for modelling linear and non-linear dependence through the use of several copulae. Moreover, the parameters of the marginal distributions of the count responses and of the copula can be specified as flexible functions of covariates. Motivated by competitive settings, we also discuss an extension which forces the regression coefficients of the marginal (linear) predictors to be equal via a suitable penalisation. Model fitting is based on a trust region algorithm which estimates simultaneously all the parameters of the joint models. We investigate the proposal’s empirical performance in two simulation studies, the first one designed for arbitrary count data, the other one reflecting competitive settings. Finally, the method is applied to football data, showing its benefits compared to the standard approach with regard to predictive performance. 2020-06-25T00:00:00Z Streaming statistical models via Merge & Reduce http://hdl.handle.net/2003/40091 Title: Streaming statistical models via Merge & Reduce Authors: Geppert, Leo N.; Ickstadt, Katja; Munteanu, Alexander; Sohler, Christian Abstract: Merge & Reduce is a general algorithmic scheme in the theory of data structures. Its main purpose is to transform static data structures—that support only queries—into dynamic data structures—that allow insertions of new elements—with as little overhead as possible. This can be used to turn classic offline algorithms for summarizing and analyzing data into streaming algorithms. We transfer these ideas to the setting of statistical data analysis in streaming environments. Our approach is conceptually different from previous settings where Merge & Reduce has been employed. Instead of summarizing the data, we combine the Merge & Reduce framework directly with statistical models. This enables performing computationally demanding data analysis tasks on massive data sets. The computations are divided into small tractable batches whose size is independent of the total number of observations n. The results are combined in a structured way at the cost of a bounded O(logn) factor in their memory requirements. It is only necessary, though nontrivial, to choose an appropriate statistical model and design merge and reduce operations on a casewise basis for the specific type of model. We illustrate our Merge & Reduce schemes on simulated and real-world data employing (Bayesian) linear regression models, Gaussian mixture models and generalized linear models. 2020-06-12T00:00:00Z Penalized quasi-maximum likelihood estimation for extreme value models with application to flood frequency analysis http://hdl.handle.net/2003/40074 Title: Penalized quasi-maximum likelihood estimation for extreme value models with application to flood frequency analysis Authors: Bücher, Axel; Lilienthal, Jona; Kinsvater, Paul; Fried, Roland Abstract: A common statistical problem in hydrology is the estimation of annual maximal river flow distributions and their quantiles, with the objective of evaluating flood protection systems. Typically, record lengths are short and estimators imprecise, so that it is advisable to exploit additional sources of information. However, there is often uncertainty about the adequacy of such information, and a strict decision on whether to use it is difficult. We propose penalized quasi-maximum likelihood estimators to overcome this dilemma, allowing one to push the model towards a reasonable direction defined a priori. We are particularly interested in regional settings, with river flow observations collected at multiple stations. To account for regional information, we introduce a penalization term inspired by the popular Index Flood assumption. Unlike in standard approaches, the degree of regionalization can be controlled gradually instead of deciding between a local or a regional estimator. Theoretical results on the consistency of the estimator are provided and extensive simulations are performed for the reason of comparison with other local and regional estimators. The proposed procedure yields very good results, both for homogeneous as well as for heterogeneous groups of sites. A case study consisting of sites in Saxony, Germany, illustrates the applicability to real data. 2020-06-03T00:00:00Z Dieter Rasch, Rob Verdooren and Jürgen Pilz: Applied statistics: theory and problem solutions with R http://hdl.handle.net/2003/40072 Title: Dieter Rasch, Rob Verdooren and Jürgen Pilz: Applied statistics: theory and problem solutions with R Authors: Krämer, Walter 2020-02-06T00:00:00Z QANOVA: quantile-based permutation methods for general factorial designs http://hdl.handle.net/2003/40061 Title: QANOVA: quantile-based permutation methods for general factorial designs Authors: Ditzhaus, Marc; Fried, Roland; Pauly, Markus Abstract: Population means and standard deviations are the most common estimands to quantify effects in factorial layouts. In fact, most statistical procedures in such designs are built toward inferring means or contrasts thereof. For more robust analyses, we consider the population median, the interquartile range (IQR) and more general quantile combinations as estimands in which we formulate null hypotheses and calculate compatible confidence regions. Based upon simultaneous multivariate central limit theorems and corresponding resampling results, we derive asymptotically correct procedures in general, potentially heteroscedastic, factorial designs with univariate endpoints. Special cases cover robust tests for the population median or the IQR in arbitrary crossed one-, two- and higher-way layouts with potentially heteroscedastic error distributions. In extensive simulations, we analyze their small sample properties and also conduct an illustrating data analysis comparing children’s height and weight from different countries. 2021-02-24T00:00:00Z Pseudo maximum likelihood estimation of cointegrated multiple frequency I(1) VARMA processes using the state space framework http://hdl.handle.net/2003/40060 Title: Pseudo maximum likelihood estimation of cointegrated multiple frequency I(1) VARMA processes using the state space framework Authors: Ribeiro, Patrick de Matos Abstract: Since the seminal contribution of Clive W.J. Granger that introduced the concept of cointegration, the modeling of multivariate (economic) time series with models and methods that allow for unit roots and cointegration has become standard econometric practice with applications ranging from macroeconomics through finance to climate science. With some early exceptions most authors focus on the VAR framework, most notably Johaansen who developed vector error correction models for the empirically most relevant cases, the I(1) and the I(2) case. Limiting cointegration analysis to VAR processes may be too restrictive. For several reasons discussed in this theses it may be advantageous to use the more general VARMA framework. However, cointegration analysis in theVARMA framework is complicated, in particular in the case of higher integration orders or multiple unit roots. One possibility to overcome the difficulties for the cointegration analysis of VARMA processes is the usage of the state space framework. This dissertation provides important tools for cointegration analysis in the state space framework, namely a continuous parameterization and a pseudo maximum likelihood estimator for the multiple frequency I(1) case. Chapter 1 discusses the parameterization of state space processes of arbitrary integration orders. Since the state space representation of a stochastic process is not unique, a canonical form is necessary which selects one unique state space representation. Since this canonical form places restrictions on the system matrices, not all entries of the matrices are free parameters. Some entries are restricted to be zero or depend on other entries. The parametrization is based on the canonical form of Bauer and Wagner (2012), which is particuarly well suited for cointegration analysis. Since there is no continuous parameterization for all state space systems of a given system order, we partition the set of all systems into subsets on which a continuous parameterization is possible. For this we use a multi-index which is chosen in such a way that properties like the unit roots, integration orders and dimensions of the cointegrating spaces remain constant in each subset. In addition to deriving a continuous parametrization, which is almost everywhere continuously invertible, we find a generic subset which is open and dense in the set of all integrated processes with a state space representation of a given system order. Additionally, we discuss the topological structure of the subsets, defining a partial ordering of the multi-indices. Finally, we discuss the implementation of hypotheses on the cointegrating ranks and spaces in the parametrization for the empirically most relevant cases, the multiple frequency I(1) and the I(2) case. We show that all hypotheses commonly tested for VAR processes in these cases can be implemented in the state space framework. This potentially allows for the derivation of pseudo likelihood ratio tests for these hypotheses. Chapter 2 examines pseudo maximum likelihood estimation for multiple frequency I(1) processes. We derive the likelihood function for MFI(1) processes and show that the pseudo maximum likelihood estimator is consistent under relatively mild conditions. Additionally, we show that setting the starting values of the state process to zero does not affect the asymptotic properties of the pseudo maximum likelihood estimator. For the case of a correctly chosen multi-index we additionally derive the asymptotic distribution of the pseudo maximum likelihood estimator, providing the ground work for pseudo likelihood ratio tests. \\ Finally, Chapter 3 consists of an useful tutorial for the analysis of economic time series using the state space framework. Using the analysis of King, Plosser, Stock and Watson (1991) as an illustrative example, we demonstrate that all economically relevant questions examined by these authors can also be analyzed using the state space framework. The analysis of King, Plosser, Stock and Watson (1991) is based on quarterly US economic data from 1949 to 1988. We compare the methods developed for the state space framework, namely the pseudo maximum likelihood estimator from Chapter 2 and the tests based upon it to the methods used by King, Plosser, Stock and Watson (1991), i.e., the DOLS estimator of and the tests for the cointegrating rank of Stock and Watson (1988) and to the vector error correction model for I(1) processes by Johansen (1995). The results obtained with the three different approaches differ, which indicates that the results of empirical applications to time series of dimension six or more of sample sizes below two or three hundred should be interpreted with care. Additionally, we test the robustness of the vector error correction model and the state space framework by repeating the analysis on an extended data set with quarterly US economic data from 1949 to 2018 and on the subset with data from 1989 to 2018. The results of both approaches differ for the three data sets. This may be a hint that there are structural breaks in the economic time series. 2020-01-01T00:00:00Z Online Diskriminanzanalyse für Datensituationen mit Concept Drift http://hdl.handle.net/2003/40038 Title: Online Diskriminanzanalyse für Datensituationen mit Concept Drift Authors: Schnackenberg, Sarah Anna Abstract: Vor dem Hintergrund der Existenz von immer mehr Datenströmen anstelle von Batch-Daten gewinnen Online-Algorithmen immer mehr an Bedeutung. Eine wesentliche Eigenschaft von Datenströmen besteht darin, dass sich die den Beobachtungen zugrunde liegende Verteilung im Laufe der Zeit ändern kann. Für solche Situationen hat sich der Begriff concept drift etabliert. Die Dissertation fokussiert auf die Diskriminanzanalyse als ein mögliches Klassifikationsverfahren. Viele bisher publizierte Algorithmen für Online Diskriminanzanalyse haben gemein, dass zwar eine Adaption an einen concept drift ermöglicht wird, eine kontinuierlich fortschreitende Veränderung der Verteilung allerdings nicht beachtet wird, sodass veraltete (und daher verzerrte) Schätzer in die Klassifikationsregel zur Prognose einfließen. In der Dissertation wird eine Methodik zur Erweiterung von Methoden für Online Diskriminanzanalyse zur Verbesserung der Prognosegüte für Datensituationen mit concept drift entwickelt. Für die Erweiterung wird der concept drift geeignet modelliert und prognostiziert. Es wird ein linearer Trend der Erwartungswertvektoren über die Zeit unterstellt, welcher mit lokaler linearer Regression modelliert wird. So können kontinuierlich die Erwartungswertvektoren des kommenden Zeitpunktes für jede Klasse prognostiziert werden. Diese Prognosen ersetzen laufend die bisherigen Schätzer in der jeweiligen Klassifikationsregel der Online Diskriminanzanalyse, um eine bessere Prognose für Beobachtungen des folgenden Zeitpunktes gewährleisten zu können. Durch die Lokalität lokaler linearer Regressionsmodelle können auch nicht-lineare Trends geeignet linear approximiert werden. Für Spezialfälle wird bewiesen, dass die Schätzfunktionen für die Erwartungswertvektoren der Klassen der erweiterten Methoden jeweils erwartungstreu für die Erwartungswertvektoren der Verteilung der Prognose sind. Die theoretischen Ergebnisse werden durch eine umfangreiche Simulationsstudie untermauert und erweitert. Für die Evaluierung werden Datenströme mit unterschiedlichen Arten und Stärken von concept drift als Ausprägungen des unendlichen Raumes aller möglichen Datensituationen mit concept drift simuliert. Die ursprünglichen sowie die erweiterten Methoden werden auf diesen Datensituationen hinsichtlich der Prognosegüte miteinander verglichen. Die Prognosegüte der Klassifikatoren kann durch Erweiterung der Methoden unter verschiedenster Formen von concept drift verbessert werden. 2020-01-01T00:00:00Z Integration of feature selection stability in model fitting http://hdl.handle.net/2003/40023 Title: Integration of feature selection stability in model fitting Authors: Bommert, Andrea Martina Abstract: In this thesis, four aspects connected to feature selection are analyzed: Firstly, a benchmark of filter methods for feature selection is conducted. Secondly, measures for the assessment of feature selection stability are compared both theoretically and empirically. Some of the stability measures are newly defined. Thirdly, a multi-criteria approach for obtaining desirable models with respect to predictive accuracy, feature selection stability, and sparsity is proposed and evaluated. Fourthly, an approach for finding desirable models for data sets with many similar features is suggested and evaluated. For the benchmark of filter methods, 20 filter methods are analyzed. First, the filter methods are compared with respect to the order in which they rank the features and with respect to their scaling behavior, identifying groups of similar filter methods. Next, the predictive accuracy of the filter methods when combined with a predictive model and the run time are analyzed, resulting in recommendations on filter methods that work well on many data sets. To identify suitable measures for stability assessment, 20 stability measures are compared based on both theoretical properties and on their empirical behavior. Five of the measures are newly proposed by us. Groups of stability measures that consider the same feature sets as stable or unstable are identified and the impact of the number of selected features on the stability values is studied. Additionally, the run times for calculating the stability measures are analyzed. Based on all analyses, recommendations on which stability measures should be used in future analyses are made. When searching for a good predictive model, the predictive accuracy is usually the only criterion considered in the model finding process. In this thesis, the benefits of additionally considering the feature selection stability and the number of selected features are investigated. To find desirable configurations with respect to all three performance criteria, the hyperparameter tuning is performed in a multi-criteria fashion. This way, it is possible to find configurations that perform a more stable selection of fewer features without losing much predictive accuracy compared to model fitting only considering the predictive performance. Also, with multi-criteria tuning, models are obtained that over-fit the training data less than the models obtained with single-criteria tuning only with respect to predictive accuracy. For data sets with many similar features, we propose the approach of employing L0-regularized regression and tuning its hyperparameter in a multi-criteria fashion with respect to both predictive accuracy and feature selection stability. We suggest assessing the stability with an adjusted stability measure, that is, a stability measure that takes into account similarities between features. The approach is evaluated based on both simulated and real data sets. Based on simulated data, it is observed that the proposed approach achieves the same or better predictive performance compared to established approaches. In contrast to the competing approaches, the proposed approach succeeds at selecting the relevant features while avoiding irrelevant or redundant features. On real data, the proposed approach is beneficial for fitting models with fewer features without losing predictive accuracy. 2020-01-01T00:00:00Z Modelling with feature costs under a total cost budget constraint http://hdl.handle.net/2003/39806 Title: Modelling with feature costs under a total cost budget constraint Authors: Jagdhuber, Rudolf Abstract: In modern high-dimensional data sets, feature selection is an essential pre-processing step for many statistical modelling tasks. The field of cost-sensitive feature selection extends the concepts of feature selection by introducing so-called feature costs. These do not necessarily relate to financial costs, but can be seen as a general construct to numerically valuate any disfavored aspect of a feature, like for example the run-time of a measurement procedure, or the patient harm of a biomarker test. There are multiple ideas to define a cost-sensitive feature selection setup. The strategy applied in this thesis is to introduce an additive cost-budget as an upper bound of the total costs. This extends the standard feature selection problem by an additional constraint on the sum of costs for included features. Main areas of research in this field include adaptations of standard feature selection algorithms to account for this additional constraint. However, cost-aware selection criteria also play an important role for the overall performance of these methods and need to be discussed in detail as well. This cumulative dissertation summarizes the work of three papers in this field. Two of these introduce new methods for cost-sensitive feature selection with a fixed budget constraint. The other discusses a common trade-off criterion of performance and cost. For this criterion, an analysis of the selection outcome in different setups revealed a reduction of the ability to distinguish between information and noise. This can for example be counteracted by introducing a hyperparameter in the criterion. The presented research on new cost-sensitive methods comprises adaptations of Greedy Forward Selection, Genetic Algorithms, filter approaches and a novel Random Forest based algorithm, which selects individual trees from a low-cost tree ensemble. Central concepts of each method are discussed and thorough simulation studies to evaluate individual strengths and weaknesses are provided. Every simulation study includes artificial, as well as real-world data examples to validate results in a broad context. Finally, all chapters present discussions with practical recommendations on the application of the proposed methods and conclude with an outlook on possible further research for the respective topics. 2020-01-01T00:00:00Z Methodenbaukasten zur Quantiﬁzierung der statistischen Güte und deren Sensitivität von Last- und Verschleißanalysen mit einem Beispiel im Kontext alternativer Antriebskonzepte http://hdl.handle.net/2003/39802 Title: Methodenbaukasten zur Quantiﬁzierung der statistischen Güte und deren Sensitivität von Last- und Verschleißanalysen mit einem Beispiel im Kontext alternativer Antriebskonzepte Authors: Lehmann, Thomas Abstract: Die vorliegende Arbeit wurde im Rahmen einer Industriepromotion bei der Daimler AG in Sindelfingen erstellt. Sie umfasst die Entwicklung und Beschreibung eines statistischen Methodenbaukastens um Last- und Verschleißanalysen prozessual durchführen zu können. Dieser Methodenbaukasten wird an Daten im Kontext alternativer Antriebssysteme beispielhaft erprobt. Der methodische Fokus liegt auf der Quantiﬁzierung und Sensitivität der Güte bzw. Unsicherheit auf den einzelnen Analysestufen. Die erste Analysestufe beinhaltet die Identifizierung verschiedener Gruppen in Belastungsdaten, umgesetzt durch Clusterverfahren. Auf der zweiten Analysestufe sollen über verschiedene lineare und nichtlineare Verfahren Vorhersagen für das Verschleißverhalten der identifizierten Gruppen getroffen werden. Auf beiden Stufen soll sowohl die Güte des Verfahrens als auch dessen Sensitivität quantifiziert werden. Im Rahmen der Arbeit werden alle notwendigen statistischen Methoden definiert, die entsprechenden Gütekriterien werden eingeführt. Der Methodenbaukasten beinhaltet einen iterativen Prozess, in dem in jeder Iteration sowohl das Clustering als auch die Prognose durchgeführt wird. So kann zum einen in jedem Schritt die Güte des jeweiligen Verfahrens und zum anderen die Sensitivität der Güte bzw. Unsicherheit der Verfahren/Modelle über mehrere Iterationen quantiﬁziert und bewertet werden. Der entwickelte, iterative Prozess, integriert in den Algorithmus des Evidence Accumulation Clusterings, bietet dem Anwender entscheidende methodische Vorteile. Zum einen kann in jedem Schritt die Güte und dessen Sensitivität des jeweiligen Verfahrens bewertet werden, zum anderen wird über die gleichzeitige Durchführung aller Verfahren in jeder Iteration beides über die Analysestufen hin weg quantiﬁziert. Im Anwendungsbeispiel werden Potentiale aufgezeigt, die Güte der Modelle zu steigern sowie die Sensitivität zu verringern, indem sowohl die Variablenselektion für die Lastanalyse als auch die Modellauswahl für die Verschleißprognose prozessual durchgeführt wird. Der entwickelte Prozess bietet die Möglichkeit, die Qualität und Stabilität der durchgeführten Analyse bereits zu frühen Zeitpunkten (geringe Datenbasis) zu bewerten und ggf. Handlungsmaßnahmen abzuleiten. 2020-01-01T00:00:00Z Extending model-based optimization with resource-aware parallelization and for dynamic optimization problems http://hdl.handle.net/2003/39770 Title: Extending model-based optimization with resource-aware parallelization and for dynamic optimization problems Authors: Richter, Jakob Abstract: This thesis contains two works on the topic of sequential model-based optimization (MBO). In the first part an extension of MBO towards resource-aware parallelization is presented and in the second part MBO is adapted to optimize dynamic optimization problems. Before the newly developed methods are introduced the reader is given a detailed introduction into various aspects of MBO and related work. This covers thoughts on the choice of the initial design, the surrogate model, the acquisition functions, and the final optimization result. As most methods in this thesis rely on the Gaussian process regression it is covered in detail as well. The chapter on “Parallel MBO” dives into the topic of making use of multiple workers that can evaluate the black-box and especially focuses on the problem of heterogeneous runtimes. Strategies that tackle this problem can be divided into synchronous and asynchronous methods. Instead of proposing one configuration in an iterative fashion, as done by ordinary MBO, synchronous methods usually propose as many configurations as there are workers available. Previously proposed synchronous methods neglect the problem of heterogeneous runtimes which causes idling, when evaluations end at different times. This work presents current methods for parallel MBO that cover synchronous and asynchronous methods and presents the newly proposed Resource-Aware Model-based Optimization (RAMBO) Framework. This work shows that synchronous and asynchronous methods each have their advantages and disadvantages and that RAMBO can outperform common synchronous MBO methods if the runtime is predictable but still obtains comparable results in the worst case. The chapter on “MBO with Concept Drift” (MBO-CD) explains the adaptions that have been developed to allow optimization of black-box functions that change systematically over time. Two approaches are explained on how MBO can be taught to handle black-box functions where the relation between input and output changes over time, i.e. where a concept drift occurs. The window approach trains the surrogate only on the most recent observations. The time-as-covariate approach includes the time as an additional input variable in the surrogate, giving it the ability to learn the effect of the time. For the latter, a special acquisition function, the temporal expected improvement, is proposed. 2020-01-01T00:00:00Z Analyzing consistency and statistical inference in Random Forest models http://hdl.handle.net/2003/39552 Title: Analyzing consistency and statistical inference in Random Forest models Authors: Ramosaj, Burim Abstract: This thesis pays special attention to the Random Forest method as an ensemble learning technique using bagging and feature sub-spacing covering three main aspects: its behavior as a prediction tool under the presence of missing values, its role in uncertainty quantification and variable screening. In the first part, we focus on the performance of Random Forest models in prediction and missing value imputations while opposing it to other learning methods such as boosting procedures. Therein, we aim to discover potential modifications of Breiman’s original Random Forest in order to increase imputation performance of Random Forest based models using the normalized root mean squared error and the proportion of false classification as evaluation measures. Our results indicated the usage of a mixed model involving the stochastic gradient boosting and a Random Forest based on kernel sampling. Regarding inferential statistics after imputation, we were interested if Random Forest methods do deliver correct statistical inference procedures, especially in repeated measures ANOVA. Our results indicated a heavy inflation of type-I-error rates for testing no mean time effects. We could furthermore show that the between imputation variance according to Rubin’s multiple imputation rule vanishes almost surely, when repeatedly applying missForest as an imputation scheme. This has the consequence of less uncertainty quantification during imputation leading to scenarios where imputations are not proper. Closely related to the issue of valid statistical inference is the general topic of uncertainty quantification. Therein, we focused on consistency properties of several residual variance estimators in regression models and could deliver theoretical guarantees that Random Forest based estimators are consistent. Beside prediction, Random Forest is often used as a screening method for selecting informative features in potentially high-dimensional settings. Focusing on regression problems, we could deliver a formal proof that the Random Forest based internal permutation importance measure delivers on average correct results, i.e. is (asymptotically) unbiased. Simulation studies and real-life data examples from different fields support our findings in this thesis. 2020-01-01T00:00:00Z A simulation study to compare robust tests for linear mixed-effects meta-regression http://hdl.handle.net/2003/39344 Title: A simulation study to compare robust tests for linear mixed-effects meta-regression Authors: Welz, Thilo; Pauly, Markus Abstract: The explanation of heterogeneity when synthesizing different studies is an important issue in meta‐analysis. Besides including a heterogeneity parameter in the statistical model, it is also important to understand possible causes of between‐study heterogeneity. One possibility is to incorporate study‐specific covariates in the model that account for between‐study variability. This leads to linear mixed‐effects meta‐regression models. A number of alternative methods have been proposed to estimate the (co)variance of the estimated regression coefficients in these models, which subsequently drives differences in the results of statistical methods. To quantify this, we compare the performance of hypothesis tests for moderator effects based upon different heteroscedasticity consistent covariance matrix estimators and the (untruncated) Knapp‐Hartung method in an extensive simulation study. In particular, we investigate type 1 error and power under varying conditions regarding the underlying distributions, heterogeneity, effect sizes, number of independent studies, and their sample sizes. Based upon these results, we give recommendations for suitable inference choices in different scenarios and highlight the danger of using tests regarding the study‐specific moderators based on inappropriate covariance estimators. 2020-01-12T00:00:00Z Statistische Analyse von MCC-IMS-Messungen http://hdl.handle.net/2003/39303 Title: Statistische Analyse von MCC-IMS-Messungen Authors: Horsch, Salome Abstract: Die Atemluft eines Menschen zu diagnostischen Zwecken zu analysieren, hat verschiedene Vorteile gegenüber anderen Methoden, wie beispielsweise Untersuchungen des Blutes. Die Atemluft ist stets verfügbar und ihre Gewinnung ist sicher, da kein Eingriff in den Körper notwendig ist. Wird zur Analyse der Atemluft die Multikapillarsäulen-Ionenmobilitätspektrometrie (MCC-IMS) verwendet, so ist die Messung innerhalb weniger Minuten abgeschlossen und könnte theoretisch direkt ausgewertet werden. Damit dies möglich wird, müssen die entstehenden Rohmessungen jedoch automatisch verarbeitet werden. Dies geschieht im Augenblick noch durch eine manuelle Begutachtung der Rohmessungen. Um diesen Goldstandard durch automatische Verfahren ersetzen zu können, wurden in dieser Arbeit zahlreiche Algorithmen-Kombinationen getestet. Da es in der Atemluftanalyse häufig das Ziel ist, kranke und gesunde Personen voneinander zu unterscheiden, wurden die Methoden auf drei verschiedene entsprechende Datensätze angewendet und zusätzlich verschiedene Klassifikationsalgorithmen getestet. Eine automatische Algorithmenkombination, die gute Ergebnisse für die einzelnen Analyseschritte erzielt, wurde für den zukünftigen Einsatz empfohlen. Der zweite Abschnitt der Arbeit beschäftigte sich mit Einflussfaktoren auf die Atemluft bei MCC-IMS-Messungen. Dabei wurden die Effekte des Geschlechts, des Raucherstatus, Beeinflussung durch ein Nahrungsmittel und der Einfluss des verwendeten Gerätes untersucht. Insbesondere die Messungen der beiden untersuchten Geräte wiesen deutliche Unterschiede auf. Diese wurden in der Arbeit ausführlich untersucht und Ansätze zur Lösung des Problems vorgestellt. 2020-01-01T00:00:00Z Blockwise estimation of parameters under abrupt changes in the mean http://hdl.handle.net/2003/39022 Title: Blockwise estimation of parameters under abrupt changes in the mean Authors: Axt, Ieva Abstract: In this thesis we are dealing with the estimation of parameters under shifts in the mean. The results of this work are based on three articles. The first main chapter of this thesis presents estimation methods for the LRD parameter under shifts in the mean. In the context of long range dependent (LRD) stochastic processes the main task is estimation of the Hurst parameter H, which describes the strength of dependence. When data are contaminated by level shifts ordinary estimators of H, such as the Geweke and Porter-Hudak (GPH) estimator, may fail to distinguish between LRD and structural changes, such as jumps in the mean. As a consequence, the estimator may suffer from positive bias and overestimate the intensity of the LRD. This fact is e.g. a major issue when testing for changes in the mean. To overcome this problem, we propose to segregate the sample of size N into blocks and then to estimate H on each block separately. Estimates, calculated in different blocks, are then combined and a final estimate of the Hurst parameter is obtained. We investigate several possibilities of segregating the data and assess their performance in a simulation study. One possibility is segregation into two blocks. The position at which the data are separated into two parts is either estimated using the Wilcoxon change-point test or chosen at any point, yielding estimates, which are combined by averaging. Another possibility is dividing the sequence of observations into many overlapping or non-overlapping blocks and estimating H by averaging estimates from these blocks. In the presence of one or even several jumps this procedure performs well in simulations. When dealing with processes with long memory and short range dependence, such as the fractionally integrated ARMA process (ARFIMA), the proposed estimators do not yield desirable results. Therefore, we follow an ARMA correction procedure and estimate the Hurst parameter in several recursive steps, using the overlapping or the non-overlapping blocks approach. In the context of LRD we observe that segregation into many blocks improves the ordinary estimators of H considerably under abrupt changes in the mean. We follow this same idea of segregation to estimate the variance of independent or weakly dependent processes under level shifts. The second main chapter of this thesis deals with scale estimation under shifts in the mean. When dealing with a few level shifts in finite samples we propose usage of the ordinary average of sample variances, obtained from many non-overlapping blocks. Under some conditions on the number of change-points and the number of blocks we prove strong consistency and asymptotic normality for independent data, where full asymptotic efficiency compared to the ordinary sample variance is shown. For weakly correlated processes we prove weak consistency of the blocks estimator. This estimator performs well when the number of level shifts is moderately low. In the presence of many level shifts even better results are obtained by an adaptive trimmed mean of the sample variances from non-overlapping blocks. The fraction of trimmed blockwise estimates is chosen adaptively, where extraordinary high sample variances are removed before calculating the average value. Even though this procedure is developed under the assumption of independence, it performs well also under weak dependence, e.g. when dealing with AR processes. If the data are additionally contaminated by outliers the proposed estimators fail to estimate the variance properly, since they are not robust. Therefore, we investigate a modified version of the well-known median absolute deviation (MAD) to account for both sources of contamination - level shifts and outliers. The formula of the MAD involves the sample median, which is not a good estimator of location in the presence of level shifts. Our proposal is to calculate the sample median in non-overlapping blocks and to consider absolute differences involving blockwise medians instead of a single median calculated on the whole sample. In this way only some blocks are affected by level shifts and the resulting modified MAD is robust against outliers and level shifts simultaneously. We proved strong consistency and asymptotic normality for independent random variables under some conditions on the number of change-points and the number of blocks. The Bahadur representation of the proposed estimator is shown to be the same as in the case of the ordinary MAD, resulting in the same asymptotic variance. In a simulation study the modified MAD provides very good results. The proposed estimator performs well as compared to other robust methods, which are discussed for comparison, in many simulation scenarios. 2020-01-01T00:00:00Z Cost-constrained feature selection in binary classification: adaptations for greedy forward selection and genetic algorithms http://hdl.handle.net/2003/38554 Title: Cost-constrained feature selection in binary classification: adaptations for greedy forward selection and genetic algorithms Authors: Jagdhuber, Rudolf; Lang, Michel; Stenzl, Arnulf; Neuhaus, Jochen; Rahnenführer, Jörg Abstract: Background: With modern methods in biotechnology, the search for biomarkers has advanced to a challenging statistical task exploring high dimensional data sets. Feature selection is a widely researched preprocessing step to handle huge numbers of biomarker candidates and has special importance for the analysis of biomedical data. Such data sets often include many input features not related to the diagnostic or therapeutic target variable. A less researched, but also relevant aspect for medical applications are costs of different biomarker candidates. These costs are often financial costs, but can also refer to other aspects, for example the decision between a painful biopsy marker and a simple urine test. In this paper, we propose extensions to two feature selection methods to control the total amount of such costs: greedy forward selection and genetic algorithms. In comprehensive simulation studies of binary classification tasks, we compare the predictive performance, the run-time and the detection rate of relevant features for the new proposed methods and five baseline alternatives to handle budget constraints. Results: In simulations with a predefined budget constraint, our proposed methods outperform the baseline alternatives, with just minor differences between them. Only in the scenario without an actual budget constraint, our adapted greedy forward selection approach showed a clear drop in performance compared to the other methods. However, introducing a hyperparameter to adapt the benefit-cost trade-off in this method could overcome this weakness. Conclusions: In feature cost scenarios, where a total budget has to be met, common feature selection algorithms are often not suitable to identify well performing subsets for a modelling task. Adaptations of these algorithms such as the ones proposed in this paper can help to tackle this problem. 2020-01-28T00:00:00Z Generalized binary time series models http://hdl.handle.net/2003/38547 Title: Generalized binary time series models Authors: Jentsch, Carsten; Reichmann, Lena Abstract: The serial dependence of categorical data is commonly described using Markovian models. Such models are very flexible, but they can suffer from a huge number of parameters if the state space or the model order becomes large. To address the problem of a large number of model parameters, the class of (new) discrete autoregressive moving-average (NDARMA) models has been proposed as a parsimonious alternative to Markov models. However, NDARMA models do not allow any negative model parameters, which might be a severe drawback in practical applications. In particular, this model class cannot capture any negative serial correlation. For the special case of binary data, we propose an extension of the NDARMA model class that allows for negative model parameters, and, hence, autocorrelations leading to the considerably larger and more flexible model class of generalized binary ARMA (gbARMA) processes. We provide stationary conditions, give the stationary solution, and derive stochastic properties of gbARMA processes. For the purely autoregressive case, classical Yule–Walker equations hold that facilitate parameter estimation of gbAR models. Yule–Walker type equations are also derived for gbARMA processes. 2019-12-14T00:00:00Z The G protein-coupled bile acid receptor TGR5 (Gpbar1) modulates endothelin-1 signaling in liver http://hdl.handle.net/2003/38540 Title: The G protein-coupled bile acid receptor TGR5 (Gpbar1) modulates endothelin-1 signaling in liver Authors: Klindt, Caroline; Reich, Maria; Hellwig, Birte; Stindt, Jan; Rahnenführer, Jörg; Hengstler, Jan G.; Köhrer, Karl; Schoonjans, Kristina; Häussinger, Dieter; Keitel, Verena Abstract: TGR5 (Gpbar1) is a G protein-coupled receptor responsive to bile acids (BAs), which is expressed in different non-parenchymal cells of the liver, including biliary epithelial cells, liver-resident macrophages, sinusoidal endothelial cells (LSECs), and activated hepatic stellate cells (HSCs). Mice with targeted deletion of TGR5 are more susceptible towards cholestatic liver injury induced by cholic acid-feeding and bile duct ligation, resulting in a reduced proliferative response and increased liver injury. Conjugated lithocholic acid (LCA) represents the most potent TGR5 BA ligand and LCA-feeding has been used as a model to rapidly induce severe cholestatic liver injury in mice. Thus, TGR5 knockout (KO) mice and wildtype (WT) littermates were fed a diet supplemented with 1% LCA for 84 h. Liver injury and gene expression changes induced by the LCA diet revealed an enrichment of pathways associated with inflammation, proliferation, and matrix remodeling. Knockout of TGR5 in mice caused upregulation of endothelin-1 (ET-1) expression in the livers. Analysis of TGR5-dependent ET-1 signaling in isolated LSECs and HSCs demonstrated that TGR5 activation reduces ET-1 expression and secretion from LSECs and triggers internalization of the ET-1 receptor in HSCs, dampening ET-1 responsiveness. Thus, we identified two independent mechanisms by which TGR5 inhibits ET-1 signaling and modulates portal pressure. 2019-11-19T00:00:00Z Euler and Süßmilch’s Population Growth Model http://hdl.handle.net/2003/38535 Title: Euler and Süßmilch’s Population Growth Model Authors: Pflaumer, Peter Abstract: In 1761, the German demographer Johann Peter Süßmilch published a simple population growth model that starts with a couple, in the eighth chapter of his book "Die göttliche Ordnung". With the help of the Swiss mathematician Leonhard Euler, he projected the population for 300 years. He demonstrated that after that time the population will be growing approximately geometrically. In this paper, the population projection of Euler and Süßmilch is reanalyzed using matrix algebra. Graphs and tables show the time series of the population and its growth rates. Age structures of selected years are presented. The solution of the projection equation is derived. It is shown that the projection model can be described by a geometric trend model which is superimposed by six cyclical components. In the long run, the population time series can be explained quite well by the sum of only two components, the trend component and one component with explosive cycles of a period of about 24 years. In the very long run, the influence of the cyclical component diminishes, and the series can be solely explained by its geometric trend component, as has been also recognized by Euler and Süßmilch. 2019-12-02T00:00:00Z Identifizierung und Modellierung von räumlichen Abhängigkeiten mit Anwendung auf deterministische und probabilistische Windvorhersagen http://hdl.handle.net/2003/38495 Title: Identifizierung und Modellierung von räumlichen Abhängigkeiten mit Anwendung auf deterministische und probabilistische Windvorhersagen Authors: Hüsch, Marc Abstract: Die Dissertation beschäftigt sich mit statistischen Verfahren zur Identifizierung und Modellierung von räumlichen Abhängigkeitsstrukturen. Der Anwendungsfokus liegt dabei auf deterministischen und probabilistischen Windgeschwindigkeits- bzw. Windleistungsvorhersagen. Im ersten Teil der Dissertation wird untersucht, wie sich die räumliche Abhängigkeitsstruktur von deterministischen Windleistungsvorhersagefehlern im europäischen Festland für unterschiedliche Vorhersagehorizonte und verschiedene geographische Gegebenheiten unterscheidet. Aufgrund der hohen räumlichen und zeitlichen Auflösung der zugrundeliegenden Daten müssen die verwendeten statistischen Verfahren dabei sehr effizient sein. Für eine erste Analyse der räumlichen Abhängigkeitsstrukturen wird deshalb ein eigens entwickeltes, korrelationsbasiertes Clusteringverfahren für räumlich-zeitliche Datensätze verwendet. Es stellt sich heraus, dass räumliche Korrelationsstrukturen insbesondere für längere Vorhersagehorizonte ausgeprägt sind und außerdem vor allem in flachen, windreichen Regionen auftreten. Für eine noch detailliertere Analyse der räumlichen Abhängigkeitsstruktur wird zudem ein Ansatz basierend auf Copulas und Generalisierten Additiven Modellen vorgeschlagen. Mit dem Verfahren zeigt sich, dass auch sehr große Vorhersagefehler häufig gemeinsam in einem räumlichen Kontext auftreten. Speziell in diesen Situationen können hohe aggregierte Vorhersagefehler resultieren, die für Energiemarkt-teilnehmer und Übertragungsnetzbetreiber ein erhöhtes Risiko darstellen. Um die Vorhersageunsicherheit bereits im Vorfeld besser abschätzen zu können, wird in der Praxis daher häufig auf probabilistische Vorhersagen bzw. meteorologische Ensemble-Vorhersagen zurückgegriffen. Hier stellt sich die Frage, wie die Qualität von verschiedenen probabilistischen Vorhersagen für mehrere Standorte unter Berücksichtigung von räumlichen Abhängigkeitsstrukturen sinnvoll bewertet werden kann. Unter Verwendung von gängigen multivariaten Bewertungsregeln werden im abschließenden Teil der Dissertation verschiedene Schwierigkeiten aufgezeigt, die bei einem Vergleich der Qualität von multivariaten probabilistischen Vorhersagen auftreten können. Mit Hilfe einer empirischen Analyse und einer Sensitivitätsanalyse wird verdeutlicht, dass fehlspezifizierte räumliche Abhängigkeitsstrukturen von den Bewertungsregeln zum Teil nur aufgrund von unterschiedlichen Vorhersagestrukturen nicht korrekt identifiziert werden. 2019-01-01T00:00:00Z Evaluating an automated number series item generator using linear logistic test models http://hdl.handle.net/2003/38486 Title: Evaluating an automated number series item generator using linear logistic test models Authors: Loe, Bao Sheng; Sun, Luning; Simonfy, Filip; Doebler, Philipp Abstract: This study investigates the item properties of a newly developed Automatic Number Series Item Generator (ANSIG). The foundation of the ANSIG is based on five hypothesised cognitive operators. Thirteen item models were developed using the numGen R package and eleven were evaluated in this study. The 16-item ICAR (International Cognitive Ability Resource1) short form ability test was used to evaluate construct validity. The Rasch Model and two Linear Logistic Test Model(s) (LLTM) were employed to estimate and predict the item parameters. Results indicate that a single factor determines the performance on tests composed of items generated by the ANSIG. Under the LLTM approach, all the cognitive operators were significant predictors of item difficulty. Moderate to high correlations were evident between the number series items and the ICAR test scores, with high correlation found for the ICAR Letter-Numeric-Series type items, suggesting adequate nomothetic span. Extended cognitive research is, nevertheless, essential for the automatic generation of an item pool with predictable psychometric properties. 2018-04-02T00:00:00Z Physical activity and outdoor play of children in public playground - do gender and social environment matter? http://hdl.handle.net/2003/38438 Title: Physical activity and outdoor play of children in public playground - do gender and social environment matter? Authors: Reimers, Anne; Schoeppe, Stephanie; Demetriou, Yolanda; Knapp, Guido Abstract: Background: Few studies have delved into the relationship of the social environment with children’s physical activity and outdoor play in public playgrounds by considering gender differences. The aim of the present study was to examine gender differences and the relationship of the social environment with children’s physical activity and outdoor play in public playgrounds. Methods: A quantitative, observational study was conducted at ten playgrounds in one district of a middle-sized town in Germany. The social environment, physical activity levels, and outdoor play were measured using a modified version of the System for Observing Play and Leisure Activity in Youth. Results: In total, 266 observations of children (117 girls/149 boys) between four and 12 years old were used in this analysis. Significant gender differences were found in relation to activity types, but not in moderate-to-vigorous physical activity (MVPA). The presence of active children was the main explanatory variable for MVPA. In the models stratified by gender, the presence of opposite-sex children was a significant negative predictor of MVPA in girls but not in boys. Conclusions: The presence of active children contributes to children’s physical activity levels in public playgrounds. Girls’ physical activity seems to be suppressed in the presence of boys. 2018-06-28T00:00:00Z Group-based regionalization in flood frequency analysis considering heterogeneity http://hdl.handle.net/2003/38411 Title: Group-based regionalization in flood frequency analysis considering heterogeneity Authors: Lilienthal, Jona Abstract: This dissertation deals with the problem of estimating the recurrence time of rare flood events, especially in the case of short data records. In such scenarios regional flood frequency analysis is used, a multi-step procedure with the goal of improving quantile estimates by pooling information across different gauging stations. Different aspects of regional flood frequency analysis using the Index Flood model are analysed, and improvements for parts of the procedure are proposed. In group-based regional flood frequency analysis sets of stations are built from which a similar flood distribution is assumed. In the Index Flood model, this means that the flood distributions of all stations are the same except for a site-specific scaling factor. Because the validity of this assumption is of crucial importance for the benefits of regionalization, it is commonly checked using homogeneity tests. After possible reassignments of stations to the groups, the information of records within a group is pooled and quantile estimates can be calculated by combination of a site-specific factor and a regional curve. Each of the main chapters of this dissertation focuses attention on specific steps of this procedure. The first main chapter investigates the known drawbacks of the commonly used homogeneity testing procedure of Hosking and Wallis based on L-moments. A new generalized procedure is proposed that uses copulas to model the intersite dependence and trimmed L-moments as a more robust replacement of L-moments. With these changes an improved detection rate in situations of medium to high skewness and in the case of cross-correlated data can be achieved. Another benefit is an increased robustness against outliers or extreme events. The second main chapter is more technical. The asymptotic distribution of sample probability-weighted moments is described in a setting of multiple sites of different record lengths. This theory is then extended to sample TL-moments and GEV parameter and quantile estimators based on them. An estimator for the limiting covariance matrix is given and analysed. The applicability of the theory is illustrated by the construction of a homogeneity test. This test works well when used with trimmed L-moments, but it needs a record length of at least 100 observations at each site to give acceptable error rates. The last main chapter deals with penalized Maximum-Likelihood estimation in flood frequency analysis as an alternative data pooling scheme. Under the assumption of generalized extreme value distributed data, the Index Flood model is translated to restrictions on the parameter space. The penalty term of the optimization problem is then chosen to reflect those restrictions and its influence can be controlled by a hyperparameter. The hyperparameter choice can be automated by a cross-validation which leads to a procedure that automatically finds a compromise between local and regional estimation. This is especially useful in situations in which homogeneity is unclear. A~simulation study indicates that this approach works nearly as good as pure regional methods if the homogeneity assumption is completely true and better than its competitors if the assumption does not hold. Overall, this dissertation presents different approaches and improvements to steps of a group-based regionalization procedure. A special interest is the assessment of the homogeneity of a given group that is analysed with two different approaches. However, due to short record lengths or limitations in the homogeneity testing procedures, heterogeneous groups are often still hard to detect. In such situations the presented penalized Maximum-Likelihood estimator can be applied that gives comparatively good results both in homogeneous and heterogeneous scenarios. However, application of this estimator does not supersede the group building steps, since the benefit of regionalization is highest if the homogeneity assumption is fulfilled. 2019-01-01T00:00:00Z Effects of exercise on the resting heart rate http://hdl.handle.net/2003/38337 Title: Effects of exercise on the resting heart rate Authors: Reimers, Anne; Knapp, Guido; Reimers, Carl-Detlev Abstract: Resting heart rate (RHR) is positively related with mortality. Regular exercise causes a reduction in RHR. The aim of the systematic review was to assess whether regular exercise or sports have an impact on the RHR in healthy subjects by taking different types of sports into account. A systematic literature research was conducted in six databases for the identification of controlled trials dealing with the effects of exercise or sports on the RHR in healthy subjects was performed. The studies were summarized by meta-analyses. The literature search analyzed 191 studies presenting 215 samples fitting the eligibility criteria. 121 trials examined the effects of endurance training, 43 strength training, 15 combined endurance and strength training, 5 additional school sport programs. 21 yoga, 5 tai chi, 3 qigong, and 2 unspecified types of sports. All types of sports decreased the RHR. However, only endurance training and yoga significantly decreased the RHR in both sexes. The exercise-induced decreases of RHR were positively related with the pre-interventional RHR and negatively with the average age of the participants. From this, we can conclude that exercise—especially endurance training and yoga—decreases RHR. This effect may contribute to a reduction in all-cause mortality due to regular exercise or sports. 2018-12-01T00:00:00Z Risk Analysis in Capital Investment Appraisal with Correlated Cash Flows: Simple Analytical Methods http://hdl.handle.net/2003/38294 Title: Risk Analysis in Capital Investment Appraisal with Correlated Cash Flows: Simple Analytical Methods Authors: Pflaumer, Peter Abstract: Since uncertainty is the crucial point of a capital investment decision, risk analysis in capital budgeting is often applied. Usually risk analysis is carried out by a Monte Carlo simulation. The aim of this article is to present simple analytical methods which allow us to calculate the standard deviation of a project with correlated cash flows as a risk measure. These methods are compared with simulation procedures carried out with R, and it is shown that the proposed simple analytical methods are indeed a quick and efficient procedure for assessing the risk of an investment project where the cash flows are correlated. 2017-07-01T00:00:00Z A Statistical Analysis of the Roulette Martingale System: Examples, Formulas and Simulations with R http://hdl.handle.net/2003/38279 Title: A Statistical Analysis of the Roulette Martingale System: Examples, Formulas and Simulations with R Authors: Pflaumer, Peter Abstract: Some gamblers use a martingale or doubling strategy as a way of improving their chances of winning. This paper derives important formulas for the martingale strategy, such as the distribution, the expected value, the standard deviation of the profit, the risk of a loss or the expected bet of one or multiple martingale rounds. A computer simulation study with R of the doubling strategy is presented. The results of doubling to gambling with a constant sized bet on simple chances (red or black numbers, even or odd numbers, and low (1–18) or high (19–36) numbers) and on single numbers (straight bets) are compared. In the long run, a loss is inevitable because of the negative expected value. The martingale strategy and the constant bet strategy on a single number are riskier than the constant bet strategy on a simple chance. This higher risk leads, however, to a higher chance of a positive profit in the short term. But on the other hand, higher risk means that the losses suffered by doublers and by single number bettors are much greater than that suffered by constant bettors. 2019-06-01T00:00:00Z Statistical modeling of protein-protein interaction networks http://hdl.handle.net/2003/38202 Title: Statistical modeling of protein-protein interaction networks Authors: Fermin Ruiz, Yessica Yulieth Abstract: Understanding how proteins bind to each other in a cell is the key in molecular biology to determine how experts can repair anomalies in cells. The major challenge in the prediction of protein-protein interactions is the cell-to-cell heterogeneity within a sample, due to genetic and epigenetic variabilities. Most studies about protein-protein interaction carry out their analysis without awareness of the underlying heterogeneity. This situation can lead to the identification of invalid interactions. As part of the solution to this problem, we proposed in this thesis two aspects of analysis, one for snapshot data, where different samples of ten proteins were taken by toponome imaging and another for the analysis of time correlated data that guarantees a better approximation to the prediction of protein-protein interactions. The latter represents an advance in the analysis of data with high temporal resolution, such as that obtained through the quantification technique known as multicolor live cell imaging. The thesis here presented is divided into two parts: The first part called "Revealing relationships among proteins involved in assembling focal adhesions" consists of the development of a methodology based on frequentist methods, such as machine learning and meta-analysis, for the prediction of protein-protein interaction on six different toponome imaging datasets. This methodology presents an advance in the analysis of highly heterogeneous snapshot data. Our aim here focused on the formulation of a single model capable of identifying the relationship among different samples by summing is common results over them concerning their random variation. This methodology leads to a set of common models over the six datasets hierarchized by their predictive power, where the researcher can choose the model according to its accuracy in the prediction or according to its parsimony. The developing of this part is in Chapters 1-7 â this part published in Harizanova et al. (2016). The second part is called "Modelling of temporal networks with a nonparametric mixture of dynamic Bayesian networks". The content of this part contemplates the advance of a Bayesian methodology regarding temporal networks that successfully enables to identify subpopulations in heterogeneous cell populations as well as at the same time reconstructing the protein interaction network associated with each subpopulation. This method extends the nonparametric Bayesian networks (NPBNs) (Ickstadt et al., 2011) for the analysis of time-correlated data by using Gaussian dynamic Bayesian Networks (GDBNs). We evaluate our model based on the variation of specific parameters such as the underlying number of subpopulations, network density, intra-subpopulation variability among others. On the other hand, a comparative analysis with existing clustering methods such as NPBNs and hierarchical agglomerative clustering (Hclust), shows that the inclusion of temporal correlations in the classification of multivariate time series is relevant for an improvement in the classification. The classic Hclust method using the dynamic time warping distances (T-Hclust) was found to be similar in precision to our Bayesian method here proposed. On the other hand, a comparative analysis with the GDBNs shows the lack of adjustment of the GDBNs to reconstruct temporal networks in heterogeneous cell populations through a single model, while our method, as well as the joint use of the T-Hclust classifications with the GDBNs (T-Hclust+), show a high adequacy in the prediction of temporal networks in a mixture. The developing of this part is in Chapters 8-16. 2018-01-01T00:00:00Z Robust and non-parametric control charts for time series with a time-varying trend http://hdl.handle.net/2003/38201 Title: Robust and non-parametric control charts for time series with a time-varying trend Authors: Abbas, Sermad Abstract: The detection of structural breaks is an important task in many online-monitoring applications. Dynamics in the underlying signal can lead to a time series with a time-varying trend. This complicates the distinction between natural fluctuations and sudden structural breaks that are relevant to the application. Moreover, outlier patches can be confused with structural breaks or mask them. A frequently formulated goal is to achieve a high detection quality while keeping the number of false alarms low. An example is the monitoring of vital parameters in intensive care where false alarms can lead to alarm fatigue of the medical staff, but missed structural breaks may cause health-threatening situations for the patient. The number of false alarms is often controlled by the average run length (ARL) or median run length (MRL), which measure the duration between two consecutive alarms. Typical procedures for online monitoring under these premises are control charts. They compute a control statistic from the most recent observations and compare it to control limits. By this, it can be decided whether the process is in control or out of control. In this thesis, control charts for the mean function are developed and studied. They are based on the sequential application of two-sample location tests in a moving time window to the most recent observations. The window is split into two subwindows which are compared with each other by the test. Unlike popular control schemes like the Shewhart, CUSUM, or EWMA scheme, a large set of historical data to specify the control limits is not required. Moreover, the control charts only depend on local model assumptions, allowing for an adaptation to the local process behaviour. In addition, by choosing appropriate window widths, it is possible to tune the control charts to be robust against a specific number of consecutive outliers. Thus, they can automatically distinguish between outlier patches and location shifts. Via simulations and some theoretical considerations, control charts based on selected two-sample tests are compared. Assuming a locally constant signal, the ability to detect sudden location shifts is studied. It is shown that the in-control run-length distribution of charts based on rank tests does not depend on the data-generating distribution. Hence, such charts keep a desired in-control ARL or MRL under every continuous distribution. Moreover, control charts based on robust competitors of the well-known two-sample t-test are considered. The difference of the sample means and the pooled empirical standard deviation are replaced with robust estimators for the height of a location shift and the scale. In case of tests based on the two-sample Hodges-Lehmann estimator for shift and the difference of the sample medians, the in-control ARL and MRL seem to be nearly distribution free when computing the control limits with a randomisation principle. In general, the charts retain properties of the underlying tests. Out-of-control simulations indicate that a test which is efficient over a wide range of distributions leads to a control chart with a high detection quality. Moreover, the robustness of the tests is inherited by the charts. In the considered settings, the two-sample Hodges-Lehmann estimator leads to a control chart with promising overall results concerning the in- and out-of-control behaviour. While being able to deal with very slow trends, the moving-window approach deteriorates for stronger trends of an in-control process. By confusing trends with location shifts, the number of false alarms becomes unacceptably large. The aforementioned approach is extended by constructing residual control charts based on local model fitting. The idea is to compute a sequence of one-step-ahead forecast errors from the most recent observations to remove the trend and apply the tests to them. This combination makes it possible to detect location shifts and sudden trend changes in the original time series. Robust regression estimators retain the information on change points in the sequence better than non-robust ones. Based on a literature summary on robust online filtering procedures, which is also part of this thesis, the one-step-ahead forecast errors are computed by repeated median regression. The conclusions are similar as for the locally constant signal: Efficient robust tests lead to control charts with a high detection quality and robustness is preserved. However, due to correlated forecast errors, in-control ARL and MRL are not completely distribution free. Still, it is possible to construct charts for which these measures seem approximately distribution free. Again, a chart based on the two-sample Hodges-Lehmann estimator turns out to perform well. A first investigation under the assumption of a local autoregressive model of order one is also provided. In this case, the in- and out-of-control performances of the charts depend not only on the underlying distribution but also on the strength of the autocorrelation. Under distributional assumptions, the results indicate that an acceptable detection quality for small to moderate autocorrelation can be achieved. The application of the control charts to data from different real-world applications indicates that they can reliably detect large structural breaks even in trend periods. Additional rules can be helpful to further reduce the number of false alarms and facilitate the distinction between relevant and irrelevant changes. Furthermore, it is shown how the procedures can be modified to detect sudden variability changes in time series with a non-linear signal. 2019-01-01T00:00:00Z Optimale Versuchsplanung für Model-Averaging Schätzer http://hdl.handle.net/2003/38150 Title: Optimale Versuchsplanung für Model-Averaging Schätzer Authors: Alhorn, Kira Abstract: Durch eine optimale Planung von Versuchen kann statistische Unsicherheit verringert werden, etwa durch die Minimierung der Varianz eines Schätzers. Hierbei wird meist jedoch angenommen, dass das Modell, das den funktionalen Zusammenhang zwischen den Einflussgrößen und dem Versuchsergebnis beschreibt, bekannt ist. Wir betrachten in dieser Arbeit den Fall, dass lediglich eine Klasse möglicher Kandidatenmodelle vorliegt, welche diesen Zusammenhang beschreiben können. Wir schlagen neue Versuchsplanungskriterien zur Schätzung eines Zielparameters vor, welche diese Unsicherheit bezüglich des wahren Modells berücksichtigen. Dazu betrachten wir Model-Averaging Schätzer, welche ein gewichtetes Mittel der Schätzer in den einzelnen Kandidatenmodellen sind. Dabei gehen wir davon aus, dass die Gewichte zur Berechnung des Model-Averaging Schätzers fest sind. Model-Averaging Schätzer sind im Allgemeinen nicht unverzerrt, sodass ein optimaler Versuchsplan den mittleren quadratischen Fehler eines solchen minimiert. Zunächst betrachten wir Kandidatenmodelle, welche der Annahme der sogenannten lokalen Alternativen genügen. Diese Modelle sind jeweils verschachtelt und es ergeben sich handliche Ausdrücke für den asymptotischen mittleren quadratischen Fehler des Model-Averaging Schätzers. Wir bestimmen lokal und Bayes-optimale Versuchspläne zur Model-Averaging Schätzung eines Zielparameters und leiten notwendige Bedingungen für die Optimalität numerisch bestimmter Versuchspläne her. Die Ergebnisse werden anhand verschiedener Beispiele illustriert und wir zeigen mittels Simulationen, dass die Bayes-optimalen Versuchspläne den mittleren quadratischen Fehler des Model-Averaging Schätzers im Vergleich zu anderen Versuchsplänen um bis zu 45% reduzieren können. Wir schlagen zudem eine adaptive Vorgehensweise vor, bei der die Model-Averaging Gewichte basierend auf Ergebnissen aus vorherigen Versuchen bestimmt werden. Im Weiteren verzichten wir auf die Annahme lokaler Alternativen und leiten die asymptotische Verteilung von Model-Averaging Schätzern für nicht-verschachtelte Modelle her. Dabei muss das wahre Modell nicht unter den Kandidatenmodellen sein. Wir illustrieren die theoretischen Resultate anhand von Simulationen und bestimmen anschließend lokal und Bayes-optimale Versuchspläne zur Model-Averaging Schätzung eines Zielparameters, welche den asymptotischen mittleren quadratischen Fehler des Schätzers minimieren. Wir zeigen anhand von Beispielen, dass diese Versuchspläne die Präzision von Model-Averaging Schätzern deutlich erhöhen können. Zusätzlich verbessern diese Versuchspläne auch Schätzer nach Modellselektion, sowie Model-Averaging Schätzer mit zufälligen Gewichten. Zudem bestimmen wir erneut adaptive Versuchspläne, welche in verschiedenen Schritten die Model-Averaging Gewichte aktualisieren. 2019-01-01T00:00:00Z Cutting Optimal Pieces from Production Items http://hdl.handle.net/2003/37950 Title: Cutting Optimal Pieces from Production Items Authors: Kirchhof, Michael; Meyer, Oliver; Weihs, Claus Abstract: In the process of manufacturing various products, a larger production item is first produced and subsequently smaller parts are cut out of it. In this report we present three algorithms that find optimal positions of production pieces to be cut out of a larger production item. The algorithms are able to consider multiple quality parameters and optimize them in a given priority order. They guarantee different levels of optimality and therefore differ in their required computing time and memory usage. We assemble these algorithms with respect to each’s specific benefits and drawbacks and in adaption to the given computational resources. If possible, the process is sped up by splitting the search for pieces on the whole production item into several local searches. Lastly, the approach is embedded into an application with a graphical user interface to enable its use in the industry. 2019-03-08T00:00:00Z Subgroup analyses and investigations of treatment effect heterogeneity in clinical dose-finding trials http://hdl.handle.net/2003/37948 Title: Subgroup analyses and investigations of treatment effect heterogeneity in clinical dose-finding trials Authors: Thomas, Marius Abstract: Identifying subgroups, which respond differently to a treatment is an important part of drug development. Exploratory subgroup analyses, which have the aim to identify subgroups of patients with differential treatment effects are thus common in many randomized clinical trials. Statistically these analyses are known to be challenging the number of possible subgroups is often large, which leads to multiplicity issues. Often such subgroup analyses are also performed for early phase clinical trials, where an additional challenge is the small sample size. In recent years several statistical approaches to these problems have been proposed, employing for example tree-based recursive partitioning algorithms, which are well-suited for handling interactions, penalized regression methods, which can be used to prevent overfitting when explicitly modeling a large number of covariate effects or Bayesian approaches, which allow incorporating uncertainty and can be used to make optimal decisions with regard to subgroups. The available literature focuses however on two-arm clinical trials, where patients are randomized to the experimental treatment or a control (e.g. current standard of care or placebo). A particular focus of this cumulative thesis is the development of statistical methodology for identification of subgroups in dose-finding trials, in which patients are administered several doses of a new drug. Dose-finding trials play a key role in the drug development process, since they provide valuable information about the effect of the dose on efficacy and safety. For identifying subgroups in this setting we consider the treatment effect to be a function of the dose and then try to identify relevant covariate effects on this treatment effect curve. These identified covariates can then be used to define subgroups with higher treatment effects but also subgroups, which require a different dose of the treatment. We propose two different approaches for this purpose. Firstly, a tree-based recursive partitioning algorithm, which detects covariate effects on the parameters of dose-response models and builds a tree of subgroups with different dose-response curves. Secondly, a Bayesian hierarchical model, which makes use of shrinkage priors to prevent overfitting in the considered settings with low sample sizes and a large number of considered covariates. In addition to approaches for subgroup identification we also consider the problem of testing a prespecified subgroup in addition to the full population in dose-finding trials. In a dose-finding setting contrast tests are often used to test for a significant dose-response signal, while taking the underlying dose-response relationship into account. Optimal contrast tests can be derived, when the underlying dose-response model is known, however often there is uncertainty about this underlying model. Testing procedures, which allow for uncertainty with regard to the underlying model and perform multiple contrast tests are therefore popular approaches in such settings. As a part of this thesis we extend such approaches to settings with multiple populations, in particular the situation, in which a prespecified subgroup is considered in addition to the full population. A last part of this cumulative thesis focuses on treatment effect estimation in identified subgroups. Naive treatment effect estimates in subgroups will often suffer from selection bias, especially when the number of considered subgroups is large. Several approaches to obtain adjusted treatment effect estimates in such situations have been proposed, using resampling, model averaging or penalized regression. We compare these approaches in an extensive simulation study for a large range of scenarios, in which such analyses are performed. 2019-01-01T00:00:00Z Bayesian and frequentist regression approaches for very large data sets http://hdl.handle.net/2003/37946 Title: Bayesian and frequentist regression approaches for very large data sets Authors: Geppert, Leo Nikolaus Abstract: This thesis is concerned with the analysis of frequentist and Bayesian regression models for data sets with a very large number of observations. Such large data sets pose a challenge when conducting regression analysis, because of the memory required (mainly for frequentist regression models) and the running time of the analysis (mainly for Bayesian regression models). I present two different approaches that can be employed in this setting. The first approach is based on random projections and reduces the number of observations to manageable level as a first step before the regression analysis. The reduced number of observations depends on the number of variables in the data set and the desired goodness of the approximation. It is, however, independent of the number of observations in the original data set, making it especially useful for very large data sets. Theoretical guarantees for Bayesian linear regression are presented, which extend known guarantees for the frequentist case. The fundamental theorem covers Bayesian linear regression with arbitrary normal distributions or non-informative uniform distributions as prior distributions. I evaluate how close the posterior distributions of the original model and the reduced data set are for this theoretically covered case as well as for extensions towards hierarchical models and models using q-generalised normal distributions as prior. The second approach presents a transfer of the Merge & Reduce-principle from data structures to regression models. In Computer Science, Merge & Reduce is employed in order to enable the use of static data structures in a streaming setting. Here, I present three possibilities of employing Merge & Reduce directly on regression models. This enables sequential or parallel analysis of subsets of the data set. The partial results are then combined in a way that recovers the regression model on the full data set well. This approach is suitable for a wide range of regression models. I evaluate the performance on simulated and real world data sets using linear and Poisson regression models. Both approaches are able to recover regression models on the original data set well. They thus offer scalable versions of frequentist or Bayesian regression analysis for linear regression as well as extensions to generalised linear models, hierarchical models, and q-generalised normal distributions as prior distribution. Application on data streams or in distributed settings is also possible. Both approaches can be combined with multiple algorithms for frequentist or Bayesian regression analysis. 2018-01-01T00:00:00Z Multi-objective analysis of machine learning algorithms using model-based optimization techniques http://hdl.handle.net/2003/37937 Title: Multi-objective analysis of machine learning algorithms using model-based optimization techniques Authors: Horn, Daniel Abstract: My dissertation deals with the research areas optimization and machine learning. However, both of them are too extensive to be covered by a single person in a single work, and that is not the goal of my work either. Therefore, my dissertation focuses on interactions between these fields. On the one hand, most machine learning algorithms rely on optimization techniques. First, the training of a learner often implies an optimization. This is demonstrated by the SVM, where the weighted sum of the margin size and the sum of margin violations has to be optimized. Many other learners internally optimize either a least-squares or a maximum likelihood problem. Second, the performance of most machine learning algorithms depends on a set of hyper-parameters and an optimization has to be conducted in order to find the best performing model. Unfortunately, there is no globally accepted optimization algorithm for hyper-parameter tuning problems, and in practice naive algorithms like random or grid search are frequently used. On the other hand, some optimization algorithms rely on machine learning models. They are called model-based optimization algorithms and are mostly used to solve expensive optimization problems. During the optimization, the model is iteratively refined and exploited. One of the most challenging tasks here is the choice of the model class. It has to be applicable to the particular parameter space of the OP and to be well suited for modeling the function’s landscape. In this work, I gave special attention to the multi-objective case. In contrast to the single-objective case, where a single best solution is likely to exist, all possible trade-offs between the objectives have to be considered. Hence, not only a single best, but a set of best solutions exists, one for each trade-off. Although approaches for solving multi-objective problems differ from the corresponding approaches for single-objective problems in some parts, other parts can remain unchanged. This is shown for model-based multi-objective optimization algorithms. The last third of this work addresses the field of offline algorithm selection. In online algorithm selection the best algorithm for a problem is selected while solving it. Contrary, offline algorithm selection guesses the best algorithm a-priori. Again, the work focuses on the multi-objective case: An algorithm has to be selected with respect to multiple, conflicting objectives. As with all offline techniques, this selection rule hast to be trained on a set of available training data sets and can only be applied to new data sets that are similar enough to those in the training set. 2019-01-01T00:00:00Z Kontrollkarten zur Alarmgebung in Stromnetzen http://hdl.handle.net/2003/37879 Title: Kontrollkarten zur Alarmgebung in Stromnetzen Authors: Langesberg, Christian Abstract: Die Dissertation Kontrollkarten zur Alarmgebung in Stromnetzen entstand im Rahmen einer von der Deutschen Forschungsgemeinschaft geförderten Forschergruppe mit dem Schwerpunkt Schutz- und Leitsysteme zur zuverlässigen und sicheren elektrischen Energieübertragung (FOR 1511). Durch den Verfasser wurde die Möglichkeit einer automatisierbaren Überwachung eines elektrischen Energienetzes mittels statistischer Prozesskontrolle untersucht. Dazu standen Aufzeichnungen der Netzfrequenz aus fünf europäischen Orten zur Verfügung. Wie sich herausstellte, können die vorliegenden Frequenz-Daten nicht mittels Standard- Methoden wie Mittelwert- oder Urwertkarten überwacht werden: Diese führen zu unpraktikabel großen Raten falscher Alarme. Diese Problematik resultiert aus multiplen Annahmeverletzungen der Kontrollkarten-Technik: Die Frequenzwerte sind sowohl hochgradig autokorreliert als auch untereinander stark abhängig. Außerdem entstammen die Daten keiner bekannten statistischen Verteilung und unterliegen ständigen Regelungsprozessen. Zur Abhilfe wurden verschiedene bekannte Verfahren aus dem der Statistischen Prozesskontrolle in Betracht gezogen, jedoch in keinem Fall eine zufriedenstellende Qualität erreicht. Folglich werden Ansätze für neue Varianten diskutiert. Vorgeschlagen wird schließlich die Nutzung eines gleitenden Mittelwertes von absoluten Differenzen als Kontrollkarten- Statistik. Zudem wird eine Symmetrisierung der absoluten Differenzen verwendet und damit die Konvergenzgeschwindikgeit der Mittelwerte (ZGWS) erhöht. Zum Vergleich der Konvergenzgeschwindigkeiten zweier Verfahren oder Parametereinstellungen wird ein Messmittel zur Beurteilung der Nähe eines Datenvektors zur Familie der Normalverteilung benötigt. Da hier keine allgemein gute Methodik bekannt ist, wurden neun Metriken und Teststatistiken bezüglich ihrer zugrunde liegenden Ideen und Eigenschaften sowie durch eine Simulationsstudie verglichen. Schließlich erfolgt die Anwendung der Methodik auf stellvertretende Beispiele von Stromausfällen. 2018-01-01T00:00:00Z Projecting Age-Specific Death Probabilities at Advanced Ages Using the Mortality Laws of Gompertz and Wittstein http://hdl.handle.net/2003/37868 Title: Projecting Age-Specific Death Probabilities at Advanced Ages Using the Mortality Laws of Gompertz and Wittstein Authors: Pflaumer, Peter Abstract: In this paper, death probabilities derived from the Gompertz and Wittstein models are used to project mortality at advanced ages beginning at the age of 101 years. Life table data of Germany from 1871 to 2012 serve as a basis for the empirical analysis. Projections of the death probabilities and life table survivors will be shown. The increase of the death probabilities slows down at very old ages. Finally, Wittstein´s formula will be regarded as a distribution function. Its reversed hazard rate function, which will be derived together with the median and the modal value, will clarify the significance of the parameters of the Wittstein distribution. 2018-12-18T00:00:00Z Statistische Modellierung eines Bohrprozesses http://hdl.handle.net/2003/37857 Title: Statistische Modellierung eines Bohrprozesses Authors: Herbrandt, Swetlana 2018-01-01T00:00:00Z Models and algorithms for low-frequency oscillations in power transmission systems http://hdl.handle.net/2003/37856 Title: Models and algorithms for low-frequency oscillations in power transmission systems Authors: Surmann, Dirk Abstract: Energy supply in the European power transmission system undergoes a structural change due to expansion and integration of renewable energy sources on a large scale. Generating renewable energy is more volatile and less predictable because it usually depends on the weather like wind and sun. Furthermore, the increase in power trading as a result of the full integration of national electricity markets into the European transmission system additionally burdens the power network. Higher volatility and increasing power trading consume additional resources of existing transmission lines while construction projects for network extension take a huge amount of time. As a consequence, the available resources within the European network have to be utilised efficiently and carefully. Reducing the security margins of components in power networks leads to higher vulnerability to additional problems. This thesis focuses on two topics with the aim of supporting power transmission systems stability. Firstly, selecting an optimal subset of nodes within a power network with respect to the particular issue of Low-Frequency Oscillation is addressed. A common application is the optimal placement of measurement devices within a power network. By integrating the modelled oscillations as a preprocessor into the algorithm, the constructed subset includes their characteristics and is optimal to measure this type of oscillation. Secondly, simulation software is widely applied to power networks generating data or investigating the potential effects of changed device parameters. The state of the art way manually defines test scenarios to investigate effects. Each test scenario challenges the corresponding transmission system by, e. g. changing device parameters, increasing its power consumption, or disconnecting a transmission line. Instead of relying on the manual generation of test scenarios to check the network behaviour for modified or new components, it is advantageous to employ an algorithm for building test scenarios. These mechanisms ensure that the range of operating conditions is covered and at the same time propose challenging test scenarios much better than manually generated test scenarios. Black box optimisation techniques support this process by exploring the possible space for test scenarios using a specialised criterion. This cumulative dissertation comprises a summary of six papers which deal with modelling of Low-Frequency Oscillations and with the prediction of corresponding values at unobserved nodes within a power transmission system. I will present two published R packages we implemented to simplify the above process. Applying graph kernels in combination with evolutionary algorithms addresses the node selection task. Issues in multimodal optimisation are addressed using contemporary techniques from model-based optimisation to efficiently identify local minima. 2018-01-01T00:00:00Z Qualitätsvergleiche kalibrierter Wahrscheinlichkeitsprognosen mit Anwendung auf die internationale Ratingindustrie http://hdl.handle.net/2003/37668 Title: Qualitätsvergleiche kalibrierter Wahrscheinlichkeitsprognosen mit Anwendung auf die internationale Ratingindustrie Authors: Neumärker, Simon 2018-01-01T00:00:00Z Survival models with selection of genomic covariates in heterogeneous cancer studies http://hdl.handle.net/2003/37144 Title: Survival models with selection of genomic covariates in heterogeneous cancer studies Authors: Madjar, Katrin Abstract: Building a risk prediction model for a specific subgroup of patients based on high-dimensional molecular measurements such as gene expression data is an important current field of biostatistical research. Major objectives in modeling high-dimensional data are good prediction performance and finding a subset of covariates that are truly relevant to the outcome (here: time-to-event endpoint). The latter requires variable selection to obtain a sparse, interpretable model solution. In this thesis, one further objective in modeling is taking into account heterogeneity in data due to known subgroups of patients that may differ in their relationship between genomic covariates and survival outcome. We consider multiple cancer studies as subgroups, however, our approaches can be applied to any other subgroups, for example, defined by clinical covariates. We aim at providing a separate prediction model for each subgroup that allows the identification of common as well as subgroup-specific effects and has improved prediction accuracy over standard approaches. Standard subgroup analysis includes only patients of the subgroup of interest and may lead to a loss of power when sample size is small, whereas standard combined analysis simply pools patients of all subgroups and may suffer from biased results and averaging of subgroup-specific effects. To overcome these drawbacks, we propose two different statistical models that allow sharing information between subgroups to increase power when this is supported by data. One approach is a classical frequentist Cox proportional hazards model with a lasso penalty for variable selection and a weighted version of the Cox partial likelihood that includes patients of all subgroups but assigns them individual weights based on their subgroup affiliation. Patients who fit well to the subgroup of interest receive higher weights in the subgroup-specific model. The other approach is a novel Bayesian Cox model that uses a stochastic search variable selection prior with latent indicators of variable inclusion. We assume a sparse graphical model that links genes within subgroups and the same genes across different subgroups. This graph structure is not known a priori and inferred simultaneously with the important variables of each subgroup. Both approaches are evaluated through extensive simulations and applied to real lung cancer studies. Simulation results demonstrate that our proposed models can achieve improved prediction and variable selection accuracy over standard subgroup models when sample size is low. As expected, the standard combined model only identifies common effects but fails to detect subgroup-specific effects. 2018-01-01T00:00:00Z Statistical topics in clinical biosimilar development http://hdl.handle.net/2003/37098 Title: Statistical topics in clinical biosimilar development Authors: Mielke, Johanna 2018-01-01T00:00:00Z Clustermethoden für Massenspektren in proteomweiten statistischen Analysen http://hdl.handle.net/2003/36839 Title: Clustermethoden für Massenspektren in proteomweiten statistischen Analysen Authors: Rieder, Vera Abstract: Die Arbeit handelt von Clustermethoden für massenspektrometrische Analysen in der Biodiversitätsforschung. Alternativ zur Artenbestimmung mittels DNA-Barcoding wird die Analyse der Proteinzusammensetzung von Organismen verwendet. Die Mehrheit der Proteinanalytik basiert mittlerweile auf der sogenannten LC-MS/MS-Methode. Dabei wird eine Flüssigchromatographie (LC) als Trennmethode mit der Tandem-Massenspektrometrie (MS/MS) kombiniert. Tandem-Massenspektren, die aus detektierten Intensitäten von vorkommenden Massen bestehen, dienen zur Identifikation von Peptiden und Proteinen mittels Datenbanksuchalgorithmen. Neuartige unbekannte Peptide werden mittlerweile über fehleranfällige De-Novo-Peptidsequenzierungsalgorithmen detektiert. Alternativ zu Annotationsverfahren wird hier die direkte Clusteranalyse der Tandem-Massenspektren behandelt. Zwei Aspekte, die Clusteranalyse sogenannter Läufe, die tausende Spektren einer Proteinprobe umfasst, und die Clusteranalyse von einzelnen Tandem-Massenspektren werden untersucht. Eine Clusteranalyse sogenannter Läufe wird für mehrere reale Datensätze mithilfe der neuen Methode DISMS2 durchgeführt, die ohne Annotationen Distanzen zwischen MS/MS-Läufen bestimmt. Es handelt sich also um eine Alternative zum Vergleich von Peptidlisten, die auf der Identifikation von Spektren in Datenbanksuchen basieren. Die Parameter von DISMS2 sind frei wählbar, sodass die Auswahl der höchsten Peaks je Spektrum (topn), die Bingröße im Binning (bin), die Einschränkung bei dem Vergleich von Spektren auf zeitlich nahe Spektren (ret) mit ähnlicher Precursormasse (prec) und das Distanzmaß für Massenspektren (dist) mit einem frei wählbaren Schwellenwert (cdis) variieren. Zur Parameterwahl wird ein Vorgehen zur Optimierung angewandt, das das Bestimmtheitsmaß R2 eines nichtparametrischen Verfahrens zur Varianzanalyse verwendet. Zur Clusteranalyse von einzelnen Massenspektren wird ein bisher in der Literatur fehlender umfassender Vergleich von Algorithmen erstellt, die für Tandem-Massenspektren etabliert (CAST, MS-Cluster, PRIDE Cluster), für große Datensätze bekannt (hierarchische Clusteranalyse, DBSCAN, Zusammenhangskomponenten eines Graphen) oder neu (Neighbor Clustering) sind. Die Evaluierung basiert auf realen Daten und mehreren Gütemaßen. 2018-01-01T00:00:00Z Classification Method Performance in High Dimensions http://hdl.handle.net/2003/36834 Title: Classification Method Performance in High Dimensions Authors: Weihs, Claus; Kassner, Tobias Abstract: We discuss standard classiﬁcation methods for high-dimensional data and a small number of observations. By means of designed simulations illustrating the practical relevance of theoretical results we show that in the 2-class case the following rules of thumb should be followed in such a situation to avoid the worst error rate, namely the probability π1 of the smaller class: Avoid “complicated” classiﬁers: The independence rule (ir) might be adequate, the support vector machine (svm) should only be considered as an expensive alternative, which is additionally sensitive to noise factors. From the outset, look for stochastically independent dimensions and balanced classes. Only take into account features which inﬂuence class separation sufﬁciently. Variable selection might help, though ﬁlters might be too rough. Compare your result with the result of the data independent rule “Always predict the larger class”. 2018-04-13T00:00:00Z Arbeitszeiten von Professorinnen und Professoren in Deutschland 2016 http://hdl.handle.net/2003/36780 Title: Arbeitszeiten von Professorinnen und Professoren in Deutschland 2016 Authors: Weihs, Claus; Hernández Rodríguez, Tanja; Doeckel, Maximilian; Marty, Christoph; Wormer, Holger Abstract: In dieser Studie werden belastbare Prognoseintervalle der wöchentlichen Gesamtarbeitszeit von Universitätsprofessorinnen und -professoren aus Daten einer Umfrage aus dem Jahre 2016 und a-priori Informationen aus früheren Studien bestimmt. Neben der Gesamtarbeitszeit werden auch Teilarbeitszeiten zum Beispiel für Lehre und Forschung ermittelt. Die Ergebnisse von frequentistischer und Bayesianischer Analyse werden verglichen. Aus den gültigen Fragebögen von aktiven Vollzeit arbeitenden Universitätsprofessorinnen und -professoren ergeben sich bei der direkten Schätzung 56 h für die durchschnittliche wöchentliche Gesamtarbeitszeit und 95%-Prognoseintervalle von 35 h bis 80 h. Frequentistische und Bayesianische Analyse führen zu ähnlichen Ergebnissen, Fächergruppen und Geschlechter unterscheiden sich wenig. Wird die Gesamtarbeitszeit als Summe der Arbeitszeiten für Teilaufgaben geschätzt, führt dies zu einem wesentlich größeren Mittelwert von 63 h und deutlich unterschiedlichen 95%-Prognoseintervallen im Bayesianischen Fall mit [42 , 85] h und im frequentistischen Fall mit [28 , 113] h. Messungen für die Gesamtarbeitszeit aus unabhängig voneinander ermittelten Teilarbeitszeiten erscheinen deshalb nur verlässlich, wenn eine Bayesianische Analyse mit Vorinformationen über die Gesamtarbeitszeit durchgeführt wird, denn offenbar sind Summen von Teilarbeitszeiten tendenziell größer als eine Gesamtarbeitszeitschätzung, sowohl im Mittel als auch in der Variation. Ein möglicher Grund ist die fehlende Übersicht über die insgesamt angegebene Arbeitszeit, wenn kein Summenzähler während des Ausfüllens des Fragebogens mitgeführt wird. Der Anteil forschungsnaher Tätigkeiten an der Arbeitszeit erscheint mit etwa 60% deutlich höher als der Anteil von Lehre und Betreuung und Prüfung von Studierenden mit 23% und der Anteil administrativer Tätigkeiten mit 17%. Die größten signifikanten Differenzen in den Erwartungswerten der Fächergruppen treten immer zwischen den Geistes-/Sozialwissenschaften und einer der anderen Fächergruppen auf, sowohl bei der Gesamtarbeitszeit als auch bei Teilarbeitszeiten. Der Unterschied zwischen dem erwarteten Gesamtarbeitsaufwand von Professorinnen und Professoren ist eher klein. 2018-02-01T00:00:00Z Comparison of prediction intervals for crack growth based on random effects models http://hdl.handle.net/2003/36677 Title: Comparison of prediction intervals for crack growth based on random effects models Authors: Emdadi Fard, Maryam Abstract: Linear and nonlinear mixed effects models are applied extensively in the study of repeated measurements and longitudinal data. In this thesis, we propose two linear random effects models and a nonlinear random effects model based on the Paris-Erdogan equation for describing the crack growth data of Virkler et al. (1979). We describe how such models can be applied to achieve the future prediction and prediction interval of the time, when the crack attains a specific length. We propose eleven new methods for prediction interval by extending the methods of Swamy (1971), Rao (1975), Liski and Nummi (1996), Pinheiro and Bates (2000) and Stirnemann et al. (2011). We compare the methods and models by applying them on the crack propagation and simulated data. 2018-01-01T00:00:00Z