Fachgebiet Statistik in den Biowissenschaften

Permanent URI for this collection

http://hdl.handle.net/2003/85

Browse

Now showing 1 - 17 of 17

On robust estimation of negative binomial INARCH models
(2021-04-24) Elsaied, Hanan; Fried, Roland
We discuss robust estimation of INARCH models for count time series, where each observation conditionally on its past follows a negative binomial distribution with a constant scale parameter, and the conditional mean depends linearly on previous observations. We develop several robust estimators, some of them being computationally fast modifications of methods of moments, and some rather efficient modifications of conditional maximum likelihood. These estimators are compared to related recent proposals using simulations. The usefulness of the proposed methods is illustrated by a real data example.
Blockwise estimation of parameters under abrupt changes in the mean
(2020) Axt, Ieva; Fried, Roland; Müller, Christine
In this thesis we are dealing with the estimation of parameters under shifts in the mean. The results of this work are based on three articles. The first main chapter of this thesis presents estimation methods for the LRD parameter under shifts in the mean. In the context of long range dependent (LRD) stochastic processes the main task is estimation of the Hurst parameter H, which describes the strength of dependence. When data are contaminated by level shifts ordinary estimators of H, such as the Geweke and Porter-Hudak (GPH) estimator, may fail to distinguish between LRD and structural changes, such as jumps in the mean. As a consequence, the estimator may suffer from positive bias and overestimate the intensity of the LRD. This fact is e.g. a major issue when testing for changes in the mean. To overcome this problem, we propose to segregate the sample of size N into blocks and then to estimate H on each block separately. Estimates, calculated in different blocks, are then combined and a final estimate of the Hurst parameter is obtained. We investigate several possibilities of segregating the data and assess their performance in a simulation study. One possibility is segregation into two blocks. The position at which the data are separated into two parts is either estimated using the Wilcoxon change-point test or chosen at any point, yielding estimates, which are combined by averaging. Another possibility is dividing the sequence of observations into many overlapping or non-overlapping blocks and estimating H by averaging estimates from these blocks. In the presence of one or even several jumps this procedure performs well in simulations. When dealing with processes with long memory and short range dependence, such as the fractionally integrated ARMA process (ARFIMA), the proposed estimators do not yield desirable results. Therefore, we follow an ARMA correction procedure and estimate the Hurst parameter in several recursive steps, using the overlapping or the non-overlapping blocks approach. In the context of LRD we observe that segregation into many blocks improves the ordinary estimators of H considerably under abrupt changes in the mean. We follow this same idea of segregation to estimate the variance of independent or weakly dependent processes under level shifts. The second main chapter of this thesis deals with scale estimation under shifts in the mean. When dealing with a few level shifts in finite samples we propose usage of the ordinary average of sample variances, obtained from many non-overlapping blocks. Under some conditions on the number of change-points and the number of blocks we prove strong consistency and asymptotic normality for independent data, where full asymptotic efficiency compared to the ordinary sample variance is shown. For weakly correlated processes we prove weak consistency of the blocks estimator. This estimator performs well when the number of level shifts is moderately low. In the presence of many level shifts even better results are obtained by an adaptive trimmed mean of the sample variances from non-overlapping blocks. The fraction of trimmed blockwise estimates is chosen adaptively, where extraordinary high sample variances are removed before calculating the average value. Even though this procedure is developed under the assumption of independence, it performs well also under weak dependence, e.g. when dealing with AR processes. If the data are additionally contaminated by outliers the proposed estimators fail to estimate the variance properly, since they are not robust. Therefore, we investigate a modified version of the well-known median absolute deviation (MAD) to account for both sources of contamination - level shifts and outliers. The formula of the MAD involves the sample median, which is not a good estimator of location in the presence of level shifts. Our proposal is to calculate the sample median in non-overlapping blocks and to consider absolute differences involving blockwise medians instead of a single median calculated on the whole sample. In this way only some blocks are affected by level shifts and the resulting modified MAD is robust against outliers and level shifts simultaneously. We proved strong consistency and asymptotic normality for independent random variables under some conditions on the number of change-points and the number of blocks. The Bahadur representation of the proposed estimator is shown to be the same as in the case of the ordinary MAD, resulting in the same asymptotic variance. In a simulation study the modified MAD provides very good results. The proposed estimator performs well as compared to other robust methods, which are discussed for comparison, in many simulation scenarios.
Group-based regionalization in flood frequency analysis considering heterogeneity
(2019) Lilienthal, Jona; Fried, Roland; Ligges, Uwe
This dissertation deals with the problem of estimating the recurrence time of rare flood events, especially in the case of short data records. In such scenarios regional flood frequency analysis is used, a multi-step procedure with the goal of improving quantile estimates by pooling information across different gauging stations. Different aspects of regional flood frequency analysis using the Index Flood model are analysed, and improvements for parts of the procedure are proposed. In group-based regional flood frequency analysis sets of stations are built from which a similar flood distribution is assumed. In the Index Flood model, this means that the flood distributions of all stations are the same except for a site-specific scaling factor. Because the validity of this assumption is of crucial importance for the benefits of regionalization, it is commonly checked using homogeneity tests. After possible reassignments of stations to the groups, the information of records within a group is pooled and quantile estimates can be calculated by combination of a site-specific factor and a regional curve. Each of the main chapters of this dissertation focuses attention on specific steps of this procedure. The first main chapter investigates the known drawbacks of the commonly used homogeneity testing procedure of Hosking and Wallis based on L-moments. A new generalized procedure is proposed that uses copulas to model the intersite dependence and trimmed L-moments as a more robust replacement of L-moments. With these changes an improved detection rate in situations of medium to high skewness and in the case of cross-correlated data can be achieved. Another benefit is an increased robustness against outliers or extreme events. The second main chapter is more technical. The asymptotic distribution of sample probability-weighted moments is described in a setting of multiple sites of different record lengths. This theory is then extended to sample TL-moments and GEV parameter and quantile estimators based on them. An estimator for the limiting covariance matrix is given and analysed. The applicability of the theory is illustrated by the construction of a homogeneity test. This test works well when used with trimmed L-moments, but it needs a record length of at least 100 observations at each site to give acceptable error rates. The last main chapter deals with penalized Maximum-Likelihood estimation in flood frequency analysis as an alternative data pooling scheme. Under the assumption of generalized extreme value distributed data, the Index Flood model is translated to restrictions on the parameter space. The penalty term of the optimization problem is then chosen to reflect those restrictions and its influence can be controlled by a hyperparameter. The hyperparameter choice can be automated by a cross-validation which leads to a procedure that automatically finds a compromise between local and regional estimation. This is especially useful in situations in which homogeneity is unclear. A~simulation study indicates that this approach works nearly as good as pure regional methods if the homogeneity assumption is completely true and better than its competitors if the assumption does not hold. Overall, this dissertation presents different approaches and improvements to steps of a group-based regionalization procedure. A special interest is the assessment of the homogeneity of a given group that is analysed with two different approaches. However, due to short record lengths or limitations in the homogeneity testing procedures, heterogeneous groups are often still hard to detect. In such situations the presented penalized Maximum-Likelihood estimator can be applied that gives comparatively good results both in homogeneous and heterogeneous scenarios. However, application of this estimator does not supersede the group building steps, since the benefit of regionalization is highest if the homogeneity assumption is fulfilled.
Robust and non-parametric control charts for time series with a time-varying trend
(2019) Abbas, Sermad; Fried, Roland; Müller, Christine
The detection of structural breaks is an important task in many online-monitoring applications. Dynamics in the underlying signal can lead to a time series with a time-varying trend. This complicates the distinction between natural fluctuations and sudden structural breaks that are relevant to the application. Moreover, outlier patches can be confused with structural breaks or mask them. A frequently formulated goal is to achieve a high detection quality while keeping the number of false alarms low. An example is the monitoring of vital parameters in intensive care where false alarms can lead to alarm fatigue of the medical staff, but missed structural breaks may cause health-threatening situations for the patient. The number of false alarms is often controlled by the average run length (ARL) or median run length (MRL), which measure the duration between two consecutive alarms. Typical procedures for online monitoring under these premises are control charts. They compute a control statistic from the most recent observations and compare it to control limits. By this, it can be decided whether the process is in control or out of control. In this thesis, control charts for the mean function are developed and studied. They are based on the sequential application of two-sample location tests in a moving time window to the most recent observations. The window is split into two subwindows which are compared with each other by the test. Unlike popular control schemes like the Shewhart, CUSUM, or EWMA scheme, a large set of historical data to specify the control limits is not required. Moreover, the control charts only depend on local model assumptions, allowing for an adaptation to the local process behaviour. In addition, by choosing appropriate window widths, it is possible to tune the control charts to be robust against a specific number of consecutive outliers. Thus, they can automatically distinguish between outlier patches and location shifts. Via simulations and some theoretical considerations, control charts based on selected two-sample tests are compared. Assuming a locally constant signal, the ability to detect sudden location shifts is studied. It is shown that the in-control run-length distribution of charts based on rank tests does not depend on the data-generating distribution. Hence, such charts keep a desired in-control ARL or MRL under every continuous distribution. Moreover, control charts based on robust competitors of the well-known two-sample t-test are considered. The difference of the sample means and the pooled empirical standard deviation are replaced with robust estimators for the height of a location shift and the scale. In case of tests based on the two-sample Hodges-Lehmann estimator for shift and the difference of the sample medians, the in-control ARL and MRL seem to be nearly distribution free when computing the control limits with a randomisation principle. In general, the charts retain properties of the underlying tests. Out-of-control simulations indicate that a test which is efficient over a wide range of distributions leads to a control chart with a high detection quality. Moreover, the robustness of the tests is inherited by the charts. In the considered settings, the two-sample Hodges-Lehmann estimator leads to a control chart with promising overall results concerning the in- and out-of-control behaviour. While being able to deal with very slow trends, the moving-window approach deteriorates for stronger trends of an in-control process. By confusing trends with location shifts, the number of false alarms becomes unacceptably large. The aforementioned approach is extended by constructing residual control charts based on local model fitting. The idea is to compute a sequence of one-step-ahead forecast errors from the most recent observations to remove the trend and apply the tests to them. This combination makes it possible to detect location shifts and sudden trend changes in the original time series. Robust regression estimators retain the information on change points in the sequence better than non-robust ones. Based on a literature summary on robust online filtering procedures, which is also part of this thesis, the one-step-ahead forecast errors are computed by repeated median regression. The conclusions are similar as for the locally constant signal: Efficient robust tests lead to control charts with a high detection quality and robustness is preserved. However, due to correlated forecast errors, in-control ARL and MRL are not completely distribution free. Still, it is possible to construct charts for which these measures seem approximately distribution free. Again, a chart based on the two-sample Hodges-Lehmann estimator turns out to perform well. A first investigation under the assumption of a local autoregressive model of order one is also provided. In this case, the in- and out-of-control performances of the charts depend not only on the underlying distribution but also on the strength of the autocorrelation. Under distributional assumptions, the results indicate that an acceptable detection quality for small to moderate autocorrelation can be achieved. The application of the control charts to data from different real-world applications indicates that they can reliably detect large structural breaks even in trend periods. Additional rules can be helpful to further reduce the number of false alarms and facilitate the distinction between relevant and irrelevant changes. Furthermore, it is shown how the procedures can be modified to detect sudden variability changes in time series with a non-linear signal.
Statistische Analyse von Modellen für die Krankheitsprogression
(2017) Hainke, Katrin; Fried, Roland; Rahnenführer, Jörg
In dieser Dissertation geht es darum, Modelle zur Beschreibung von Krankheitsprogression genauer zu betrachten und mit statistischen Methoden zu vergleichen. Ein tieferes Verständnis der Entstehung und des Verlaufs einer Krankheit ist unerlässlich für eine geeignete medizinische Behandlung. Daher ist es von großer Bedeutung, einzelne Schritte der Krankheits¬progres¬sion benennen zu können. Die in dieser Arbeit vorgestellten Progressionsmodelle haben das Ziel, die Abhängigkeitsstruktur der aufgetretenen genetischen Ereignisse zu schätzen und so charakteristische Ereignispfade anzugeben. Die zugrunde liegenden genetischen Ereignisse sind dabei als binäre Zufallsvariablen definiert und beschreiben aufgetretene Mutationen an Chromosomen, Chromosomenarmen oder sogar einzelnen Genen. Insgesamt gibt es für die Modellierung der Krankheitsprogression auf Querschnittsdaten viele verschiedene Ansätze, die zunächst mit ihren Eigenschaften und Schätzalgorithmen vorgestellt werden. Eine umfassende Bewertung bzw. ein Vergleich dieser Modelle fehlte bislang in der Literatur. Daher wird im Rahmen eines Modellvergleichs auf die Frage eingegangen, ob es ein bestimmtes Modell gibt, das die Krankheitsprogression immer am besten beschreiben kann bzw. für welche Datensituationen welches Modell am besten geeignet ist. Außerdem ist unklar, wie sich Modellklassen verhalten und wie die berechneten Ergebnisse zu bewerten sind, wenn einige Modellannahmen verletzt sind. Obige Fragen lassen sich leicht beantworten, wenn das wahre Modell bekannt ist. Ist dies jedoch nicht der Fall, werden geeignete Modellwahlstrategien benötigt, die aus einer Menge von Modellklassen ein geeignetes Modell auswählen. Diese Problemstellung wird ebenfalls in der Dissertation behandelt. Ein weiterer wichtiger Punkt vor dem Anpassen eines Progressionsmodells ist die Auswahl der Ereignisse. Nur die Ereignisse, die für den Krankheitsverlauf eine entscheidende Rolle spielen, sollten in das Modell aufgenommen werden. Es werden verschiedene Variablenselektionsmethoden vorgestellt, bewertet und auf echte Datensätze angewendet.
Robust change-point detection and dependence modeling
(2017) Dürre, Alexander; Fried, Roland; Vogel, Daniel; Müller, Christine H.
This doctoral thesis consists of three parts: robust estimation of the autocorrelation function, the spatial sign correlation, and robust change-point detection in panel data. Albeit covering quite different statistical branches like time series analysis, multivariate analysis, and change-point detection, there is a common issue in all of the sections and this is robustness. Robustness is in the sense that the statistical analysis should stay reliable if there is a small fraction of observations which do not follow the chosen model. The first part of the thesis is a review study comparing different proposals for robust estimation of the autocorrelation function. Over the years many estimators have been proposed but thorough comparisons are missing, resulting in a lack of knowledge which estimator is preferable in which situation. We treat this problem, though we mainly concentrate on a special but nonetheless very popular case where the bulk of observations is generated from a linear Gaussian process. The second chapter deals with something congeneric, namely measuring dependence through the spatial sign correlation, a robust and within the elliptic model distribution-free estimator for the correlation based on the spatial sign covariance matrix. We derive its asymptotic distribution and robustness properties like influence function and gross error sensitivity. Furthermore we propose a two stage version which improves both efficiency under normality and robustness. The surprisingly simple formula of its asymptotic variance is used to construct a variance stabilizing transformation, which enables us to calculate very accurate confidence intervals, which are distribution-free within the elliptic model. We also propose a positive semi-definite multivariate spatial sign correlation, which is more efficient but less robust than its bivariate counterpart. The third chapter deals with a robust test for a location change in panel data under serial dependence. Robustness is achieved by using robust scores, which are calculated by applying psi-functions. The main focus here is to derive asymptotics under the null hypothesis of a stationary panel, if both the number of individuals and time points tend to infinity. We can show under some regularity assumptions that the limiting distribution does not depend on the underlying distribution of the panel as long as we have short range dependence in the time dimension and ndependence in the cross sectional dimension.
Robust estimation methods with application to flood statistics
(2017) Fischer, Svenja; Fried, Roland; Schumann, Andreas H.; Wendler, Martin; Krämer, Walter
Robust statistics and the use of robust estimators have come more and more into focus during the last couple of years. In the context of flood statistics, robust estimation methods are used to obtain stable estimations of e.g. design floods. These are estimations that do not change from one year to another just because one large flood occurred. A problem which is often ignored in flood statistics is the underlying dependence structure of the data. When considering discharge data with high time-resolution, short range dependent behaviour can be detected within the time series. To take this into account, in this thesis a limit theorem for the class of GL-statistics is developed under the very general assumption of near epoch dependent processes on absolutely regular random variables, which is a well known concept of short range dependence. GL-statistics form a very general class of statistics and can be used to represent many robust and non-robust estimators, such as Gini's mean difference, the Qn-estimator or the generalized Hodges-Lehmann estimator. In a direct application the limit distribution of L-moments and their robust extension, the trimmed L-moments, is derived. Moreover, a long-run variance estimator is developed. For all these results, the use of U-statistics and U-processes proves to be the key tool, such that a Central Limit Theorem for multivariate U-statistics as well as an invariance principle for U-processes and the convergence of the remaining term of the Bahadur-representation for U-quantiles is shown. A challenge for proving these results pose the multivariate kernels that are considered to be able to represent very general estimators and statistics. A concrete application in the context of flood statistics, in particular in the estimation of design floods, the classification of homogeneous groups and the modelling of short range dependent discharge series, is given. Here, well known models (peak-over-thresholds) as well as newly developed ones, for example mixing models using the distinction of floods according to their timescales, are combined with robust estimators and the advantages and disadvantages under consideration of stability and efficiency are investigated. The results show that the use of the new models, that take more information into account by enlarging the data basis, in combination with robust estimators leads to a very stable estimation of design floods, even in high quantiles. Whereas a lot of the classical estimators, like Maximum-Likelihood estimators or L-moments, are affected by single extraordinary extreme events and need a long time to stabilise, the robust methods approach the same level of stabilisation rather fast. Moreover, the newly developed mixing model cannot only be used for flood estimation but also for regionalisation, that is the modelling of ungauged basins. Here, especially when needing a classification of flood events and homogeneous groups of gauges, the use of robust estimators proves to result in stable estimations, too.
Semi- and non-parametric flood frequency analysis
(2016) Kinsvater, Paul; Fried, Roland; Krämer, Walter; Ruckdeschel, Peter
In meiner Arbeit geht es um die Untersuchung klassischer Annahmen aus der (regionalen) Hochwasserstatistik. Dabei stelle ich neue Methoden vor, welche auf semi- bzw. nicht-parametrischen Annahmen beruhen und leite einige theoretische Resultate her. Der erste Hauptbeitrag (erschienen bei Environmetrics) befasst sich mit der Untersuchung von Homogenität im Extremwertverhalten einer Gruppe von Marginalverteilungen. Diese Annahme ist grundlegend für das Schätzen hoher Quantile in der klassischen (parametrischen) regionalen Hochwasserstatistik. Wir zeigen zudem, wie man unter semi-parametrischen Annahmen einen regionalen Schätzer hoher Quantile konstruieren kann und ich leite die zugehörigen asymptotischen Resultate her. Im zweiten Hauptbeitrag (akzeptiert bei Extremes) geht es um einen neuen Test auf Strukturbruch. In einer internationalen Zusammenarbeit haben wir einen Test konstruiert, welcher besonders sensitiv auf Änderungen in der sogenannten Pickands-Abhängigkeitsfunktion multivariater Verteilungen reagiert. Eine Erweiterung des Verfahrens erlaubt sogar das Ignorieren von bekannten Brüchen in den Marginalverteilungen. Im letzten Hauptbeitrag wird bedingtes Extremwertverhalten über ein parametrisches Modell für den bedingten Extremwertindex untersucht. Der Extremwertindex beschreibt die Schwere eines Verteilungsschwanzes und ist somit von besonderem Interesse, wenn man sich gerade für hohe Quantile interessiert. Neben existierenden Methoden haben wir auch einen eigenen Schätzer und darauf aufbauend ein Testverfahren zur Detektion von Trends im Extremwertverhalten entwickelt. Unsere Untersuchungen legen nahe, dass Saisonalitäten im Jahresverlauf sich auch im Extremwertverhalten von Abflusszeitreihen bemerkbar machen. Dies widerspricht der Gültigkeit einiger in der Hochwasserstatistik angewandter Methoden basierend auf sogenannten partiellen Serien.
Modeling count time series following generalized linear models
(2016) Liboschik, Tobias; Fried, Roland; Fokianos, Konstantinos
Count time series are found in many different applications, e.g. from medicine, finance or industry, and have received increasing attention in the last two decades. The class of count time series following generalized linear models is very flexible and can describe serial correlation in a parsimonious way. The conditional mean of the observed process is linked to its past values, to past observations and to potential covariate effects. In this thesis we give a comprehensive formulation of this model class. We consider models with the identity and with the logarithmic link function. The conditional distribution can be Poisson or Negative Binomial. An important special case of this class is the so-called INGARCH model and its log-linear extension.A key contribution of this thesis is the R package tscount which provides likelihood-based estimation methods for analysis and modeling of count time series based on generalized linear models. The package includes methods for model fitting and assessment, prediction and intervention analysis. This thesis summarizes the theoretical background of these methods. It gives details on the implementation of the package and provides simulation results for models which have not been studied theoretically before. The usage of the package is illustrated by two data examples. Additionally, we provide a review of R packages which can be used for count time series analysis. A detailed comparison of tscount to those packages demonstrates that tscount is an important contribution which extends and complements existing software. A thematic focus of this thesis is the treatment of all kinds of unusual effects influencing the ordinary pattern of the data. This includes structural changes and different forms of outliers one is faced with in many time series. Our first study on this topic is concerned with retrospective detection of such changes. We analyze different approaches for modeling such intervention effects in count time series based on INGARCH models. Other authors treated a model where an intervention affects the non-observable underlying mean process at the time point of its occurrence and additionally the whole process thereafter via its dynamics. As an alternative, we consider a model where an intervention directly affects the observation at its occurrence, but not the underlying mean, and then also enters the dynamics of the process. While the former definition describes an internal change of the system, the latter can be understood as an external effect on the observations due to e.g. immigration. For our alternative model we develop conditional likelihood estimation and, based on this, develop tests and detection procedures for intervention effects. Both models are compared analytically and using simulated and real data examples. The procedures for our new model work reliably and we find some robustness against misspecification of the intervention model. The aforementioned methods are applied after the complete time series has been observed. In another study we investigate the prospective detection of structural changes, i.e. in real time. For example in public health, surveillance of infectious diseases aims at recognizing outbreaks of epidemics with only short time delays in order to take adequate action promptly. We point out that serial dependence is present in many infectious disease time series. Nevertheless it is still ignored by many procedures used for infectious disease surveillance. Using historical data, we design a prediction-based monitoring procedure for count time series following generalized linear models. We illustrate benefits but also pitfalls of using dependence models for monitoring.Moreover, we briefly review the literature on model selection, robust estimation and robust prediction for count time series. We also make a first study on robust model identification using robust estimators of the (partial) autocorrelation.
Distribution-free analysis of homogeneity
(2015) Wornowizki, Max; Fried, Roland; Müller, Christine
In this dissertation three problems strongly connected to the topic of homogeneity are considered. For each of them a distribution-free approach is investigated using simulated as well as real data. The first procedure proposed is motivated by the fact that a mere rejection of homogeneity is unsatisfactory in many applications, because it is often not clear which discrepancies of the samples case the rejection. To capture the dissimilarities our method combines a fairly general mixture model with the classical nonparametric two-sample Kolmogorov-Smirnov test. In case of a rejection by this test, the proposed algorithm quantifies the discrepancies between the corresponding samples. These dissimilarities are represented by the so called shrinkage factor and the correction distribution. The former measures the degree of discrepancy between the two samples. The latter contains information with regard to the over- and undersampled regions when comparing one sample to the other in the Kolmogorov-Smirnov sense. We prove the correctness of the algorithm as well as its linear running time when applied to sorted samples. As illustrated in various data settings, the fast method leads to adequate and intuitive results. The second topic investigated is a new class of two-sample homogeneity tests based on the concept of f-divergences. These distance like measures for pairs of distributions are defined via the corresponding probability density functions. Thus, homogeneity tests relying on f-divergences are not limited to discrepancies in location or scale, but can detect arbitrary types of alternatives. We propose a distribution-free estimation procedure for this class of measures based on kernel density estimation and spline smoothing. As shown in extensive simulations, the new method performs stable and quite well in comparison to several existing non- and semiparametric divergence estimators. Furthermore, we construct distribution-free two-sample homogeneity tests relying on various divergence estimators using the permutation principle. The tests are compared to an asymptotic divergence procedure as well as to several traditional parametric and nonparametric tests on data from different distributions under the null hypothesis and several alternatives. The results suggest that divergence-based methods have considerably higher power than traditional methods if the distributions do not predominantly differ in location. Therefore, it is advisable to use such tests if changes in scale, skewness, kurtosis or the distribution type are possible while the means of the samples are of comparable size. The methods are thus of great value in many applications as illustrated on ion mobility spectrometry data. The last topic we deal with is the detection of structural breaks in time series. The method introduced is motivated by characteristic functions and Fourier-type transforms. It is highly flexible in several ways: firstly, it allows to test for the constancy of an arbitrary feature of a time series such as location, scale or skewness. It is thus applicable in various problems. Secondly, the method makes use of arbitrary estimators of the feature under investigation. Hence, a robustification of the approach or other modifications are straightforward. We demonstrate the testing procedure focussing on volatility as well as on kurtosis. In both cases our approach leads to reasonable rejection rates for symmetric distributions in comparison to several test derived from the literature. In particular, the test shines in presence of multiple structural breaks, because its test statistic is constructed in a blockwise manner. The position and number of the presumable change points located by the new procedure also correspond to the true ones quite well. The method is thus well suited for many applications as illustrated on exchange rate data.
Threshold optimization and variable construction for classification in the MAGIC and FACT experiments
(2014-10-15) Voigt, Tobias; Fried, Roland; Weihs, Claus
In the MAGIC and FACT experiments, random forests are usually used for a classification of a gamma ray signal and hadronic background. Random forests use a set of tree classifiers and aggregate the single decisions of the trees into one overall decision. In this work a method to choose an optimal threshold value for the random forest classification is introduced. The method is based on the minimization of the MSE of an estimator for the number of gamma particles in the data set. In a second step, new variables for the classification are introduced in this work. The idea of these variables is to fit bivariate distributions to images recorded by the two telescopes and using distance measures for densities to calculate the distance between the observed and fitted distributions. With a reasonable choice of distributions to fit, it can be expected that such distances are smaller for gamma observations than for the hadronic background. In a third step, the new threshold optimization and the new variable construction are combined and compared to the methods currently in use. It can be seen that the new methods lead to substantial improvements of the classification with regard to the aim of the analysis.
Robuste Verfahren zur Periodendetektion in ungleichmäßig beobachteten Lichtkurven
(2014-01-31) Thieler, Anita Monika; Fried, Roland; Müller, Christine
Eine wichtige Aufgabe sowohl in der Astroteilchenphysik als auch in der Astrophysik ist die Suche nach Periodizität in den Messwerten ungleichmäßig beobachteter Zeitreihen, Lichtkurven genannt. Periodogramme für Lichtkurven werden häufig berechnet, indem periodische Funktionen verschiedener Periodenlängen an die Lichtkurve angepasst und jeweils ein Periodogrammbalken als Gütekriterium der Anpassung berechnet wird. Durch den Einsatz gewichteter Regression können bei der Periodogrammberechnung sogenannte Messfehler berücksichtigt werden, die die jeweilige Messgenauigkeit eines Messwerts beschreiben. In dieser Arbeit werden 84 verschiedene Periodogrammmethoden verglichen, die nach obigem Prinzip aufgebaut sind. Sie unterscheiden sich in der angepassten Funktion, der zur Anpassung verwendeten Regressionstechnik und bezüglich der Berücksichtigung von Messfehlern. Da es in Lichtkurvendaten häufig zu Störungen kommt, werden neben der üblichen Kleinste-Quadrate-Regression verschiedene robuste Regressionstechniken untersucht. Viele der verglichenen Periodogrammmethoden werden erstmals in dieser Arbeit und in einhergehenden Publikationen vorgeschlagen. Zur Detektion auffällig hoher Periodogrammbalken wird erstmals ein auf Ausreißeridentifikation basierendes Verfahren verwendet. Dabei wird eine Verteilung robust mittels Cramér-von-Mises-Distanz-Minimierung an die Periodogrammbalken angepasst und solche Balken detektiert, die über einem zuvor spezifizierten Quantil der Verteilung liegen. Zwecks Reduktion der Abhängigkeiten unter den Periodogrammbalken werden auch Verfahren untersucht, bei denen das Periodogramm vor Anpassung der Verteilung verschiedenartig ausgedünnt wird. Diese können sich im Vergleich jedoch nicht behaupten. Der Vergleich erfolgt mittels einer Simulationsstudie und der Auswertung echter Daten. Die verwendeten Programme stammen aus dem R-Paket RobPer, welches im Rahmen dieser Arbeit entstanden ist. Es zeigt sich, dass die Berücksichtigung der Messfehler keine Vorteile bei der Periodendetektion bringt. Die Anwendung robuster Regression zur Periodogrammberechnung kann hingegen bei Störungen in den Daten sehr hilfreich sein. In der Anwendung auf reale Daten werden teilweise Hinweise auf Periodizitäten entdeckt, die bisher in der Literatur noch nicht dokumentiert sind und hier nur mit Hilfe robuster Regression gefunden werden können. Vier Periodogrammmethoden werden identifiziert, die auch bei Störungen in den Daten meist gute Detektionsergebnisse erzielen und die in Abwesenheit einer periodischen Fluktuation ein vorgegebenes Signifikanzniveau approximativ einhalten. Diese Methoden basieren auf Anpassung einer Stufenfunktion mittels M- bzw. tau-Regression sowie auf Anpassung einer Fouriersumme dritten Grades mittels L1- bzw. M-Regression. Alle vier Methoden wurden außer in dieser Arbeit und einhergehenden Publikationen noch nicht zur Periodogrammberechnung bei ungleichmäßig beobachteten Zeitreihen vorgeschlagen. In dieser Arbeit werden weiterhin erste Vorschläge für ein Filter gemacht, welches vor Periodogrammberechnung auf die Lichtkurve angewendet werden kann. Dies reduziert eine potentiell vorhandene spezielle Rauschkomponente, das sogenannte rote Rauschen. Das neuartige Filter kann erfolgreich auf einige simulierte und ein reales Datenbeispiel angewendet werden.
Robust normality test and robust power transformation with application to state change detection in non normal processes
(2013-06-14) Ntiwa Foudjo, Arsene; Fried, Roland; Krämer, Walter
The primary objective of this thesis is the construction of a powerful state change detection procedure for monitoring time series, which can help decision makers to react faster to changes in the system and define the proper course of action for each case. Without losing sight of our primary goal, we first derived a robust test of approximate normality based on the Shapiro-Wilk test (RSW), which detects if the majority of the data follows a normal distribution. The RSW test is based on the idea of trimming the original sample, and replacing the observations in the tail by artificially generated normally distributed data, and then performing the Shapiro-Wilk test on the modified sequence. We show that under the null hypothesis of normality the modified sequence is asymptotically normally distributed and that the RSW test statistic has the same asymptotic null distribution as the Shapiro-Wilk test statistic. The RSW test proves to be resistant to outliers and outperforms the other considered robust test for normality in the presence of outliers. Intending to use the RSW test to create a robust estimator of the Box-Cox transformation, we also investigate its behaviour with respect to the inverse Box-Cox transformation. It proves to be resistant to outliers in this case and also outperforms its competitors in presence of a few outliers. Secondly, we used the RSW test to derive a robust estimator of the Box-Cox transformation parameter (ˆλRSW). This conforms to the fact that the Box-Cox transformation only achieves approximate normality and the Shapiro-Wilk test of normality is one of the most powerful tests of normality. Gaudard & Karson (2000) already derived a non robust estimator of the Box-Cox transformation parameter based on the Shapiro-Wilk test statistic that outperformed the other estimators considered in their comparison. As expected, ˆλRSW is preferable to the maximum-likelihood and the M-estimators (we considered), mainly because it yields a better transformation in the sense that not only are the transformed samples more symmetrical according to the medcouple (a robust measure of symmetry and tail weight), but they also have a higher pass rate for the RSW test and the MC1 test at a significance level of 5%. Finally, returning to the state change detection, we opt for the method of Harrison & Stevens (1971), which considers four states: the steady state (normal state), the step change (level shift), the slope change and the outlier. The assumption of normally distributed data restricts the usage of the procedure, so we transform the data with ˆλRSW to achieve approximate normality. We extend the update equations to two observations in the past, that is to compute the probability of occurrence of a state change at time t - 2 given all available data until time t. This extension is used when we derive classifications rules for the incoming observations, given that the procedure only computes a posteriori probabilities for the different states and does not classify them. We use linear discriminant analysis and intensive simulations to derive the classification rules. We derived an instantaneous classification separating the step change and the outlier from the slope change and the steady state at the arrival of each observations and a one-step-after classification that separates the three classes outlier, step change and slope change, steady state one step after each observation is available. The simulations show that the first rule has an out-of-sample classification error of 2.1% and the second rule 3.11%. Opposed to this, the naive classification rule, which is to classify according to the estimated a posteriori probability, yields misclassification errors of 5.35% and 7.29%, respectively. Unfortunately, a classification rule for the slope change is not derived. One could take advantage of the fact that information on the past can be extended to as many observations in the past one wishes, increasing the probability of detecting a slope change. In addition, we do not consider other classification procedures than the linear discriminant analysis, although it is possible for other classification procedures to yield better results than ours. For all the computations in this work, we used the software package R Core Team (2012).
On nonparametric methods for robust jump preserving smoothing and trend detection
(2012-10-10) Morell, Oliver; Fried, Roland; Müller, Christine H.
Robust modelling of count data
(2012-03-30) Elsaied, Hanan Abdel kariem Abdel latif; Fried, Roland; Kuhnt, Sonja
M-estimators as modified versions of maximum likelihood estimators and their asymptotic properties play an important role in the development of modern robust statistics since the 1960s. In our thesis, we construct new M-estimators based on Tukey’s bisquare function to fit count data robustly. The Poisson distribution provides a standard framework for the analysis of this type of data. In case of independent identically distributed Poisson data, M-estimators based on the Huber and Tukey’s bisquare function are compared to already existing estimators implemented in R via simulations in case of clean data and of additive outliers. It turns out that it is difficult to combine high robustness against outliers and high efficiency under ideal conditions if the Poisson parameter is small, because such Poisson distributions are highly skewed. We suggest an alternative estimator based on adaptively trimmed means as a possible solution to this problem. Our simulation results indicate that a modified version of the R-function glmrob with external weights gives the best robustness properties among all estimation procedures based on the Huber function. A new modified Tukey M-estimator provides improvements over the other procedures which depend on the Tukey function and also those which depend on the Huber function, particularly in case of moderately large and very large outliers. The estimator based on adaptive trimming provides even better results at small Poisson means. Furthermore, our work constitutes a first treatment of robust M-estimation of INGARCH models for count time series. These models assume the observation at each point in time to follow a Poisson distribution conditionally on the past, with the conditional mean being a linear function of previous observations and past conditional means. We focus on the INGARCH(1,0) model as the simplest interesting variant. Our approach based on Tukey’s bisquare function with bias correction and initialization from a robust AR(1) fit provides good efficiencies in case of clean data. In the presence of outliers, the biascorrected Tukey M-estimators perform better than the uncorrected ones and the conditional maximum likelihood estimator. The construction of adequate Tukey M-estimators or the development of other robust estimators for INGARCH models of higher orders remains an open problem, albeit some preliminary investigations for the INGARCH(1,1) model are presented here. Some applications to real data from the medical field and artificial data examples indicate that the INGARCH(1,0) model is a promising candidate for such data, and that the issue of robust estimation tackled here is important.
Elliptical graphical modelling
(2011-01-10) Vogel, Daniel; Fried, R.; Müller, C. H.
Characterizing association parameters in genetic family-based association studies
(2009-04-06T11:59:17Z) Böhringer, Stefan; Ickstadt, Katja; Kunert, Joachim

Browse

Recent Submissions