Reasoning about distributed relational data and query evaluation

Loading...
Thumbnail Image

Date

2019

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Large data sets are often stored distributedly to increase the reliability of systems and the efficiency of query evaluation in them. While some query operators -- like selections and projections -- are intrinsically conform with parallel evaluation, others -- like joins -- demand specific distribution patterns. For relational databases, a common approach to evaluate queries in parallel relies on the use of rather simple distribution patterns for binary joins and the computation of the query result according to some query plan, operator by operator. Often, this requires the redistribution of large intermediate results (possibly larger than the input and/or output) and thus may lead to unnecessary long processing times. Thus, especially in the last decade, more elaborate distribution patterns that depend on the whole query have been studied and shown to allow more efficient query evaluation in several cases by reducing the amount of communication between servers. Ameloot et al. have described a setting where query evaluation is studied for a broad range of distribution patterns. Their work focuses on problems to decide whether a query can be evaluated correctly under a given distribution pattern. More particularly, they have considered two problems: "parallel correctness", where the pattern is specified explicitly, and "parallel-correctness transfer", where the pattern is known to be appropriate for another query. This thesis comprises the author's contributions to the complexity-theoretical investigation of these problems for conjunctive queries (and extensions thereof). These contributions complement the main characterisations and some additional complexity results by Ameloot et al. Furthermore, this thesis contains some new characterisations for "polarised" queries. Via the characterisations, parallel correctness and parallel-correctness transfer can be translated into questions on the co-occurrences of certain facts, induced by the query, on some server. Such questions and others can be modelled by "distribution dependencies", a variant of the well-known tuple- and equality-generating dependencies. Modelling via these constraints allows a more general description of distribution patterns in distributed relational data. The third contribution of this thesis is the study of the implication problem for distribution dependencies, providing lower and upper bounds for some fragments.

Description

Table of contents

Keywords

Distributed database, Query evaluation, Distribution policy, Parallel correctness, Parallel soundness, Parallel completeness, Parallel-correctness transfer, Containment, Implication, Reasoning, Static analysis

Citation