Approximation Techniques for Facility Location and Their Applications in Metric Embeddings Dissertation zur Erlangung des Grades eines Doktors der Naturwissenschaften der Technischen Universität Dortmund an der Fakultät für Informatik von Christiane Lammersen Dortmund 2010 Tag der mündlichen Prüfung: 7. Dezember 2010 Dekan: Prof. Dr. Peter Buchholz Gutachter: Prof. Dr. Christian Sohler, Prof. Dr. Friedhelm Meyer auf der Heide Abstract This thesis addresses the development of geometric approximation algorithms for huge datasets and is subdivided into two parts. The first part deals with algorithms for facil- ity location problems, and the second part is concerned with the problem of computing compact representations of finite metric spaces. Facility location problems belong to the most studied problems in combinatorial opti- mization and operations research. In the facility location variants considered in this thesis, the input consists of a set of points where each point is a client as well as a potential location for a facility. Each client has to be served by a facility. However, connecting a client incurs connection costs, and opening or maintaining a facility causes so-called open- ing costs. The goal is to open a subset of the input points as facilities such that the total cost of the system is minimized. We are particularly interested in facility location problems for large-scale distributed systems of mobile objects. In order to be able to analyze such complex systems, we examine the following partial aspects: • At first, we present a distributed algorithm that, in case of uniform opening costs for the facilities and uniform demands of the clients, computes in only three com- munication rounds a constant-factor approximation for the metric facility location problem. • In Chapter 4, we introduce a mobile facility location problem where the input points move continuously in a constant-dimensional Euclidean space. In contrast to Chap- ter 3, we also take non-uniform opening costs for the facilities and non-uniform demands of the clients into account. We propose an event-driven data structure that efficiently maintains a subset of the mobile points as open facilities such that, at any time, the total cost of the system is at most a constant factor larger than the optimal facility location cost. • In Chapter 5, we consider again a uniform facility location problem. However, this time, we develop a streaming algorithm where the input stream consists of insert and delete operations of points from a constant-dimensional Euclidean space. While reading the input stream, our algorithm maintains a summary of the current point set in a subtle way with the result that the required space is polylogarithmic in the size of the input stream and, at any time, it can output a constant-factor approximation of the optimal facility location cost. • In the next chapter, we give an efficient streaming implementation of a k-means clustering algorithm. The k-means clustering problem is closely related to the facility iv Abstract location problem. The goal is to place k facilities, the so-called cluster centers, such that the sum of the squared distances of the points to their nearest cluster center is minimized. Our algorithm is based on a coreset construction. A coreset is a small weighted point set that approximates the input point set with respect to the k-means clustering problem. In the second part of this thesis, we study compact representations of finite metric spaces. Our representations have the property that a large fraction of all the pairwise distances be- tween the points is almost preserved and only a small fraction of all the pairwise distances, the so-called slack, can be arbitrarily distorted. Constructions of such representations are an important tool in the analysis of huge datasets. In Chapter 7, we apply some space-partitioning techniques from Chapter 5 to construct well-separated pair decompositions with slack for low-dimensional Euclidean spaces. We also show how to transfer this approach to doubling metric spaces. Afterwards, we extend our techniques to obtain streaming algorithms that compute embeddings with a distortion of at most 1 + ε and with low slack for high-dimensional Euclidean spaces and doubling metric spaces. Furthermore, we investigate embeddings with low distortion and low slack for general metric spaces given as a data stream of points. Besides, we show how to use embedding techniques to get a (1± ε)-approximation algorithm for the high-dimensional Euclidean max-cut problem where the input stream consists of insert and delete operations of points. All of our streaming algorithms need only space that is polylogarithmic in the size of the input stream. Zusammenfassung Diese Dissertation beschäftigt sich mit der Entwicklung von geometrischen Approxima- tionsalgorithmen für große Datenmengen und ist in zwei Teile aufgeteilt. Der erste Teil behandelt Algorithmen für verschiedene Arten von Facility-Location-Problemen und der zweite Teil Konstruktionen von kompakten Darstellungen endlicher metrischer Räume. Facility-Location-Probleme gehören zu den am meisten untersuchten Problemen in der kombinatorischen Optimierung und Operations Research. In den von uns betrachteten Facility-Location-Varianten erhält man als Eingabe eine Menge von Punkten, die sowohl Standorte von Kunden als auch mögliche Standorte von Facilities darstellen. Jeder Kunde soll durch eine Facility versorgt werden. Hierbei fallen für die Kunden Verbindungskosten und für das Öffnen bzw. Aufrechterhalten der genutzten Facilities sogenannte Öffnungs- kosten an. Das Ziel ist es, eine Teilmenge der möglichen Facilities zu öffnen, so dass die Gesamtkosten minimiert werden. Wir interessieren uns bei den Facility-Location-Problemen insbesondere für riesige ver- teilte Systeme mobiler Standorte. Um solch komplexe Systeme untersuchen zu können, haben wir im Einzelnen die folgenden Teilaspekte genauer betrachtet: • Als erstes stellen wir einen verteilten Algorithmus vor, der im Fall von einheitlichen Öffnungskosten für die Facilities als auch einheitlichen Bedarf der Kunden in nur drei Kommunikationsrunden eine konstante Approximation für das metrische Facility- Location-Problem ausgibt. • In Kapitel 4 führen wir ein mobiles Facility-Location-Problem ein, bei dem sich die Eingabepunkte kontinuierlich in einem niedrig-dimensionalen euklidischen Raum be- wegen. Im Unterschied zu Kapitel 3 berücksichtigen wir diesmal auch uneinheitliche Öffnungskosten für die Facilities und uneinheitlichen Bedarf der Kunden. Wir geben eine ereignisgesteuert Datenstruktur an, die effizient eine Teilmenge der sich be- wegenden Punkte als geöffnete Facilties aufrechterhält, so dass zu jeder Zeit die Gesamtkosten des Systems höchstens um einen konstanten Faktor größer sind als die optimalen Gesamtkosten. • In Kapitel 5 betrachten wir wieder ein uniformes Facility-Location-Problem. Dies- mal entwickeln wir jedoch einen Algorithmus der Datenströme bearbeiten kann, die aus Einfüge- und Löschoperationen von Punkten in einem niedrig-dimensionalen euk- lidischen Raum bestehen. Während des Einlesens hält unser Datenstromalgorithmus geschickt eine Zusammenfassung der aktuellen Punktmenge aufrecht, so dass sein verwendeter Speicherplatz polylogarithmisch in der Größe des Eingabestromes ist und zu jeder Zeit eine konstante Approximation der optimalen Gesamtkosten für das Facility-Location-Problem ausgegeben werden kann. vi Zusammenfassung • Im nächsten Kapitel geben wir eine effiziente Datenstromimplementierung eines k-Means-Clustering-Algorithmus an. Das k-Means-Clustering-Problem ist verwandt mit dem Facility-Location-Problem. Dabei sollen k Clusterzentren so platziert wer- den, dass die quadrierten Abstände der Punkte zu dem jeweils nächstliegenden Clus- terzentrum minimiert werden. Unser Algorithmus basiert auf einer neuen Kernmen- genkonstruktion. Eine Kernmenge ist eine kleine gewichtete Punktmenge, die die Eingabepunktmenge gemäß des k-Means-Clustering-Problems approximiert. Im zweiten Teil der Dissertation beschäftigen wir uns mit kompakten Repräsentationen endlicher metrischer Räume. Unsere Repräsentationen haben die Eigenschaft, dass sie einen großen Anteil der paarweisen Distanzen gut erhalten und nur einen kleinen Anteil, den sogenannten Schlupf, beliebig verzerren. Konstruktionen solcher Repräsentationen bilden ein wichtiges Werkzeug bei der Analyse von großen Datenmengen. In Kapitel 7 wenden wir einige Raumaufteilungstechniken aus Kapitel 5 an, um wohl- separierte Paar-Dekompositionen mit Schlupf für niedrig-dimensionale euklidische Räume zu konstruieren. Wir zeigen außerdem wie dieser Ansatz auf Doubling-Metriken übertragen werden kann. Anschließend erweitern wir unsere Techniken, um Datenstromalgorithmen zu erhalten, die Einbettungen mit einer Verzerrung von höchstens 1 + ε und geringem Schlupf von hoch-dimensionalen euklidischen Räumen und Doubling-Metriken berechnen. Des Weit- eren untersuchen wir Einbettungen mit geringer Verzerrung und geringem Schlupf für allge- meine Metriken, die als Datenstrom von Punkten gegeben sind. Außerdem zeigen wir, dass man mit Hilfe von Einbettungstechniken einen (1± ε)-Approximationsalgorithmus für das hoch-dimensionale euklidische Max-Cut-Problem erhalten kann, wobei der Eingabestrom aus Einfüge- und Löschoperationen von Punkten besteht. All unsere Datenstromalgo- rithmen benötigen Speicherplatz, der nur polylogarithmisch in der Größe des jeweiligen Eingabestromes ist. Acknowledgments First and foremost, I would like to thank my advisor, Prof. Dr. Christian Sohler, for giving me the opportunity to work with him and under his supervision. I benefited in many ways from his great support. He integrated me into important research communities at an early stage. Even before I started my PhD studies, he offered me to attend a summer school on data stream algorithms and a Dagstuhl seminar on sublinear algorithms. During the whole time, his guidance was invaluable for me. Whenever I got stuck with a problem, I felt free to ask for his advice, which always ended up in new helpful ideas. It was also of great importance to me that he kept faith in my skills. Even when I thought that I would not be able to cope with a challenge, my advisor encouraged me to try it and it always worked out. In a nutshell, I enjoyed it a lot to work with him! Special thanks also go to my co-advisor, Prof. Dr. Friedhelm Meyer auf der Heide. His comments and questions during seminar talks were really helpful to improve my thesis. Besides, I am very pleased that he welcomed me so heartily each time when I visited his research group in Paderborn. The results in this thesis have been emerged from collaborations with many smart people. I would like to express my best thanks to my co-authors Dr. Marcel Ackermann, Dr. Bastian Degener, Joachim Gehweiler, Marcus Märtens, Christoph Raupach, Dr. Anastasios Sidiropoulos, Prof. Dr. Christian Sohler, and Kamil Swierkot. It was a pleasure for me to collaborate with all of them. During my PhD studies, I have been research assistant at Universität Paderborn, Universität Bonn, and Technische Universität Dortmund. This gave me the opportunity to become acquainted with many nice people and to make new friends. Particularly, I would like to name Dr. Bastian Degener, Dominic Dumrauf, and Joachim Gehweiler. For being an amiable three-year office mate and friend, I would like to thank Dr. Morteza Monemizadeh. His amazing knowledge of research results in the area of clustering and data stream algorithms has been very useful for me. Furthermore, I am more than happy that I have found such a sympathetic and caring friend in Melanie Schmidt. The pleasant conversations with her always cheered me up. For many fruitful discussions and a nice working atmosphere, I would also like to thank Dr. Mohammad Ali Abam, Florian Berger, Antje Bertram, Prof. Dr. Beate Bollig, Dr. Olaf Bonorden, Dr. Gereon Frahling, Alexander Gilbers, Dr. André Gronemeier, Frank Hellweg, Prof. Dr. Rolf Klein, Mariele Knepper, Renate Kühn, Dr. Elmar Langetepe, Dr. Mario Mense, Rainer Penninger, Melanie Schmidt, Dr. Dirk Sudholt, Dr. Christian Thyssen, Tim Suess, Heinz-Georg Wassing, and Christine Zarges. I would like to thank Prof. Dr. Stefano Leonardi for inviting me to a research visit at Sapienza University of Rome, Prof. Dr. Piotr Indyk for helpful discussions at the viii Acknowledgments MADALGO Summer School on Data Stream Algorithms, and Prof. Dr. Sumit Ganguly for inviting me to the IITK Workshop on Algorithms for Processing Massive Data Sets. Many thanks go to Dr. Mariano Zelke for carefully proof-reading parts of my thesis and for giving helpful suggestions to improve the readability of my thesis. Certainly, I would have never managed to get so far without the support and encourage- ment of my friends and family. Particularly, I would like to thank my brother Markus who sparked my interest in computer science, my parents, Brunhilde and Klemens, whom I own the most of what I am today, and Thomas Friebe whom I can rely on in any situation and who enriches my life in such a wonderful way. Contents Notation and Terminology xiv 1 Introduction 1 1.1 Outline and Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.1 Facility Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2.3 Compact Representations of Finite Metric Spaces . . . . . . . . . . 10 1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3.1 Facility Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3.2 Compact Representations of Finite Metric Spaces . . . . . . . . . . 18 2 Preliminaries 21 2.1 Distance Functions and Metric Spaces . . . . . . . . . . . . . . . . . . . . 21 2.2 Facility Location Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3 The Mettu-Plaxton Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4 Computational Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4.1 Real Random Access Machine Model . . . . . . . . . . . . . . . . . 26 2.4.2 Synchronous Message Passing Model . . . . . . . . . . . . . . . . . 27 2.4.3 Kinetic Data Structures . . . . . . . . . . . . . . . . . . . . . . . . 27 2.4.4 Data Stream Models . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3 Facility Location in a Distributed Setting 31 3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.1.1 The Distributed Setting . . . . . . . . . . . . . . . . . . . . . . . . 32 3.1.2 The Radii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2 Distributed Algorithm for Metric Spaces . . . . . . . . . . . . . . . . . . . 38 3.2.1 Analysis of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . 39 3.3 Distributed Algorithm for Powers of Metric Spaces . . . . . . . . . . . . . 45 3.3.1 Analysis of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . 46 4 A Kinetic Data Structure for Facility Location 53 4.1 The Special Radii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.1.1 Definition of the Special Radii . . . . . . . . . . . . . . . . . . . . . 54 4.1.2 Computation of the Special Radii . . . . . . . . . . . . . . . . . . . 59 4.1.3 The Invariant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 x Contents 4.2 The Kinetic Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.2.1 Initial Set of Open Facilities . . . . . . . . . . . . . . . . . . . . . . 63 4.2.2 Event Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.2.3 Handling an Update . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.3 Quality and Complexity of the Kinetic Data Structure . . . . . . . . . . . 66 4.3.1 Maintenance of the Invariant . . . . . . . . . . . . . . . . . . . . . . 67 4.3.2 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5 Facility Location in Data Streams 75 5.1 Definition of a Good Estimator . . . . . . . . . . . . . . . . . . . . . . . . 75 5.1.1 Estimator for Special Cases . . . . . . . . . . . . . . . . . . . . . . 75 5.1.2 Estimator Based on a Space Partition . . . . . . . . . . . . . . . . . 77 5.1.3 Properties of the Space Partition . . . . . . . . . . . . . . . . . . . 79 5.1.4 Analysis of the Estimator . . . . . . . . . . . . . . . . . . . . . . . 82 5.2 Randomized Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.2.1 Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.2.2 Analysis of the Estimator . . . . . . . . . . . . . . . . . . . . . . . 85 5.3 Streaming Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.3.1 Analysis of the Estimator . . . . . . . . . . . . . . . . . . . . . . . 91 6 A k-Means Implementation for Data Streams 99 6.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.1.1 Definition of Euclidean k-Means Clusterings . . . . . . . . . . . . . 100 6.1.2 Definition of Coresets . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.1.3 k-Means Clustering Algorithms . . . . . . . . . . . . . . . . . . . . 102 6.2 Coreset Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.3 The Coreset Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.3.1 Definition of the Coreset Tree . . . . . . . . . . . . . . . . . . . . . 111 6.3.2 Construction of the Coreset Tree . . . . . . . . . . . . . . . . . . . 112 6.3.3 Extraction of the Coreset . . . . . . . . . . . . . . . . . . . . . . . . 113 6.4 Streaming Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.4.1 The Merge-and-Reduce Technique . . . . . . . . . . . . . . . . . . . 114 6.4.2 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.5 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.5.2 Parameters of the Algorithms . . . . . . . . . . . . . . . . . . . . . 117 6.5.3 Comparison of the Algorithms . . . . . . . . . . . . . . . . . . . . . 118 7 Well-Separated Pair Decomposition with Slack 125 7.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.2 Construction for Euclidean Metric Spaces . . . . . . . . . . . . . . . . . . . 126 7.2.1 Analysis of the Construction . . . . . . . . . . . . . . . . . . . . . . 128 xi 7.3 Construction for Doubling Metric Spaces . . . . . . . . . . . . . . . . . . . 135 7.3.1 Analysis of the Construction . . . . . . . . . . . . . . . . . . . . . . 137 8 Embeddings with Slack in Data Streams and Applications 143 8.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 8.2 Embedding Euclidean Metric Spaces . . . . . . . . . . . . . . . . . . . . . 144 8.2.1 Low Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 8.2.2 High Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 8.3 Max-Cut in High Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . 157 8.4 Embedding Doubling Metric Spaces . . . . . . . . . . . . . . . . . . . . . . 164 8.5 Embedding General Metric Spaces . . . . . . . . . . . . . . . . . . . . . . . 171 8.6 Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 9 Conclusions and Future Work 179 A Additional Tables for Chapter 6 181 A.1 Parameters of Algorithm BIRCH . . . . . . . . . . . . . . . . . . . . . . . 181 A.2 Running Times of the Algorithms . . . . . . . . . . . . . . . . . . . . . . . 182 A.3 Clustering Cost of the Algorithms . . . . . . . . . . . . . . . . . . . . . . . 183 A.4 Standard Deviation of Running Time and Cost . . . . . . . . . . . . . . . 184 B Mathematical Fundamentals 187 B.1 Sequences, Series, and Inequalities . . . . . . . . . . . . . . . . . . . . . . . 187 B.2 Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 Bibliography 193 xii Contents Notation and Terminology This section provides an overview of the notation and terminology for mathematical basics used throughout this thesis. More special notions are introduced gradually in the main chapters. For ease of reading, some of the definitions are even repeated a few times. ∅ empty set N set of the natural numbers {1, 2, 3, . . .} N0 set of the natural numbers including 0 [n] set {0, 1, . . . , n− 1} R set of the reals R≥0 set of the non-negative reals [a, b] closed interval of the reals x with a ≤ x ≤ b (a± ε) closed interval of the reals x with a− ε ≤ x ≤ a+ ε (a, b) open interval of the reals x with a < x < b |A| cardinality of set A A ∪B union of sets A and B, i.e., {x | x ∈ A or x ∈ B} A ∩B intersection of sets A and B, i.e., {x | x ∈ A and x ∈ B} A\B difference set A minus B, i.e., {x | x ∈ A and x /∈ B} A×B Cartesian product of sets A and B, i.e., {(x, y) | x ∈ A and y ∈ B} An set of n-dimensional column vectors with entries from set A An×m set of (n×m)-matrices with entries from set A M = (X,D) metric space M , where X is a non-empty set of elements and D : X ×X → R≥0 is a distance function defined on X D(x, y) distance between x and y in some metric space diam(X) diameter of set X in some metric space B(x, r) closed ball of radius r centered at point x ∈ X in some metric space M = (X,D), i.e., the set {y ∈ X | D(x, y) ≤ r} spread of M ratio of farthest pair distance in X to closest pair distance in X for some finite metric space M = (X,D) with D(x, y) 6= 0 for all pairs (x, y) ∈ X ×X, x 6= y R d the d-dimensional Euclidean space vT transpose of vector v with entries from R vT · w scalar product of column vectors v, w ∈ Rd ‖v‖ Euclidean norm of column vector v ∈ Rd xiv Notation and Terminology G = (V,E) graph G with vertex set V and edge set E e Euler’s number 2.7182 . . . exp(x) Euler’s number to the power of x, i.e., the value ex ln(x) natural logarithm of x, i.e., logarithm of x to the base e logb(x) logarithm of x to the base b logkb (x) logb(x) to the power of k, i.e., (logb(x)) k log(x) binary logarithm of x, i.e., logarithm of x to the base 2 O(g) {f : N→ R≥0 | ∃n0 ∈ N∃c > 0 ∀n ≥ n0 : f(n) ≤ c · g(n)} Ω(g) {f : N→ R≥0 | ∃n0 ∈ N∃c > 0 ∀n ≥ n0 : 0 ≤ c · g(n) ≤ f(n)} Θ(g) {f : N→ R≥0 | f ∈ O(g) and f ∈ Ω(g)} o(g) {f : N→ R≥0 | ∃n0 ∈ N∃c > 0 ∀n ≥ n0 : f(n) < c · g(n)} ω(g) {f : N→ R≥0 | ∃n0 ∈ N∃c > 0 ∀n ≥ n0 : 0 ≤ c · g(n) < f(n)} poly(g) {f : N→ R≥0 | ∃n0 ∈ N∃c ≥ 1 ∀n ≥ n0 : f(n) ≤ (g(n))c} polylog(g) {f : N→ R≥0 | ∃n0 ∈ N∃c ≥ 1 ∀n ≥ n0 : f(n) ≤ log c(g(n))} O˜(g) {f : N→ R≥0 | f ∈ O(g · polylog(g))} gH {gh | h ∈ H} for some g : N→ R≥0 and H ⊂ {f | f : N→ R≥0} g ·H {g · h | h ∈ H} for some g : N→ R≥0 and H ⊂ {f | f : N→ R≥0} g +H {g + h | h ∈ H} for some g : N→ R≥0 and H ⊂ {f | f : N→ R≥0} minx∈A f(x) value f(x) with x ∈ A and f(x) ≤ f(y) for all y ∈ A maxx∈A f(x) value f(x) with x ∈ A and f(x) ≥ f(y) for all y ∈ A dxe smallest integer n with n ≥ x bxc largest integer n with n ≤ x n! factorial of the natural number n, i.e., the value n · (n− 1) · . . . · 2 · 1 ( n k ) binomial coefficient n over k, i.e., the value n!/((n− k)! · k!) Pr [A] probability of the event A E [Z] expectation of the random variable Z V [Z] variance of the random variable Z 1 Introduction Facility location problems belong to the most studied problems in operations research and combinatorial optimization. In its classical interpretation, the goal of facility location is to find optimal places for industrial facilities (e.g., restaurants, factories, or supermarkets) such that a combination of the building and maintenance costs for the facilities and the transportation costs for the clients is minimized. However, facility location problems have also applications in many other scenarios. As a result, various types of facility location problems have been investigated until today. In the facility location variants considered in this thesis, the input consists of a set of points where each point is a client as well as a potential location for a facility. Each client has to be served by a facility. Here, it must be taken into account that, on the one hand, serving a client incurs connection costs and, on the other hand, opening or maintaining a facility causes so-called opening costs. The goal is to open a subset of the input points as facilities such that the total cost of the system is minimized. In general, each facility has its individual opening cost, and the connection cost of a client depends proportionally on its individual demand as well as on its distance to the nearest open facility. This means, of course, there has to be a distance measure defined on the input points. Obviously, one typical scenario is that the points are from a Euclidean space and the distance measure between points is given by the Euclidean distance. However, other distance measures are also conceivable. In radio networks, for instance, it could be interesting to consider powered Euclidean distances since the energy required for transmitting a message via a certain distance is somewhere between the square and the cube of the distance. We are particularly interested in facility location problems for large-scale distributed systems of mobile objects. In such a system, each object is an autonomous computational entity that has its own local memory and that communicates with the other entities by message passing. Since such systems are too complex to analyze them in their entirety, we examine several partial aspects of them. Applications of our scenario are, for example, in mobile ad-hoc and sensor networks. In these networks, nodes move continuously and interact with each other. Often, they are organized in a hierarchical way where the upper layer offers the lower layer a certain service. Furthermore, we are interested in designing algorithms that are capable of clustering huge Euclidean point sets efficiently. As clustering objective, we focus on k-means. Note that the k-means clustering problem is closely related to the facility location problem, which itself belongs to the clustering problems as well. In general, the goal of a clustering is to partition a set of given objects into subsets, the so-called clusters, such that objects from the same cluster are similar to each other and objects from different clusters are 2 1 Introduction dissimilar. In the k-means clustering problem, the input is a set of points with a distance measure defined on them, and the goal is to place k cluster centers such that the sum of the squared distances of the points to their nearest cluster center is minimized. For each cluster center, there exists one cluster containing all the points that are closer to this cluster center than to all the other cluster centers. One application of clustering is the compact representation of huge datasets. For instance, we could map each data item to a point in a Euclidean space and, after having clustered the resulting point set, represent each cluster by its cluster center. The second part of this thesis concentrates exclusively on compact representations of huge n-point metric spaces. An n-point metric space is a pair M = (X,D) where X is a set of n points and D is a distance measure defined on X that is non-negative, symmetric, and satisfies the triangle inequality. Our goal is to compute a compact representation of M that fairly captures the pairwise distances of M but is structurally simpler than M and uses only sublinear space. To measure the quality of such a representation, we use the notion of low-distortion embeddings with slack, which is defined as follows. An embedding from a metric space M = (X,D) into a target metric space M ′ = (X ′,D′) is a mapping ϕ : X → X ′. We say ϕ contracts the distance between two points x and y in X by a factor of α ≥ 1 if the embedded distance D′(ϕ(x), ϕ(y)) of x and y is α-times shorter than the original distance D(x, y). Similarly, we say that ϕ expands the distance between x and y by a factor of β ≥ 1 if the embedded distance D′(ϕ(x), ϕ(y)) of x and y is β-times longer than the original distance D(x, y). Now, the distortion of ϕ is defined as the product of the maximum contraction and the maximum expansion of all the pairwise distances in X. Finally, we say that ϕ has distortion % ≥ 1 and slack σ with 0 < σ < 1 if, for a (1 − σ)- fraction of all the pairwise distances in X, the distortion is %. The remaining pairwise distances, i.e., the slack, can be arbitrarily distorted. In this thesis, we study the problem of computing embeddings with low distortion and low slack of several n-point metric spaces that are given as a data stream. A data stream is a sequence of data items which can only be accessed in one sequential scan that reads the data items one by one. Besides, while reading and processing the data, an algorithm is only allowed to use space that is sublinear in the size of the input stream. In the following section, we will give a detailed overview of the results presented in this thesis. 1.1 Outline and Main Results Chapter 2 This chapter provides some preparation for the main chapters. We give formal definitions of the metric spaces and the facility location problems considered in this thesis. Afterwards, we present an existing facility location algorithm due to Mettu and Plaxton [87], which has played an important role in the design of two of our facility location algorithms. Furthermore, we introduce the computational models that have been used to develop our 1.1 Outline and Main Results 3 algorithms and to analyze them in terms of their complexity. This includes the synchronous message passing model for algorithms working in a distributed setting, the kinetic data structure framework for algorithms working in a mobile setting, and data stream models. Chapter 3 We begin our studies by investigating a special type of metric facility location problem in a distributed setting. In this problem, we assume that each point is a client as well as a potential location for a facility and that the opening costs for the facilities and the demands of the clients are uniform. We present a randomized distributed algorithm that computes with high constant probability a constant-factor approximation for this type of facility location problem. The algorithm uses three rounds of all-to-all communication with message sizes bounded to O(log(n)) bits, where n is the number of input points. In particular, we show how each point decides locally after the first communication round whether it opens a facility or not. The following two communication rounds are only required to connect the clients to their nearest open facility. In the last part of Chapter 3, we extend our distributed algorithm to constant powers of metric spaces. Here, we also obtain a constant-factor approximation algorithm that uses three rounds of all-to-all communication with message sizes bounded to O(log(n)) bits. The results of Chapter 3 have been previously published in [J. Gehweiler, C. Lammersen, and C. Sohler. A distributed O(1)-approximation algorithm for the uniform facility loca- tion problem. In Proceedings of the 18th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA ’06), pages 237–243. Association for Computing Machinery, 2006.]. Chapter 4 We reuse some essential ideas of Chapter 3 to approach a mobile facility location problem. This time, we take non-uniform opening costs for the facilities and non-uniform demands for the clients into account. We assume that each point moves along a known trajectory in a d-dimensional Euclidean space and that, at any time, each point is either an open facility or a client. The opening cost that arises for a facility persists during the entire time it is open. Analogously, a client has to pay some cost for its connection to an open facility permanently. This cost depends on the client’s demand and its distance to the nearest open facility. To approach the mobile facility location problem, we propose a deterministic kinetic data structure. This data structure maintains a subset of the moving points as open facilities such that, at any time, the sum of the opening cost for the open facilities and the connection cost for the clients is at most a constant factor larger than the current optimal cost. The space requirement of our data structure is O(n(logd(n) + log(nR))), where n denotes the number of input points and R is a value depending on the cost and demand values of the input points. In case that each trajectory can be described by a bounded degree polynomial, we process O(n2 log2(nR)) events, each requiring O(logd+1(n)·log(nR)) 4 1 Introduction time and O(log(nR)) status changes. To our knowledge, there had been no kinetic data structures for facility location proposed prior to the work presented in Chapter 4. Chapter 4 is based on [B. Degener, J. Gehweiler, and C. Lammersen. The kinetic facility location problem. In Proceedings of the 11th Scandinavian Workshop on Algorithm Theory (SWAT ’08), pages 378–389. Springer, 2008.] and [B. Degener, J. Gehweiler, and C. Lammersen. Kinetic facility location. Algorithmica, 57(3):562–584, July 2010. By invitation for the special issue on selected papers from SWAT ’08.]. Chapter 5 We continue our studies of facility location problems. Similar to Chapter 3, we consider a variant in which each input point is a client as well as a potential location for a facility and in which the opening costs for the facilities and the demands of the clients are uniform. However, this time, the input points are given as a dynamic geometric data stream. This means, the input is a sequence of insert and delete operations of points from a discrete Euclidean space {1, . . . ,∆}d. We assume that the dimension d is a constant. We present a randomized algorithm that computes a constant-factor approximation for the cost of the uniform facility location problem over dynamic geometric data streams. Our streaming algorithm processes an insertion or deletion of a point in time polylogarithmic in ∆, requires space polylogarithmic in ∆, and has an error probability of less than 1/3. We remark that this error probability can be reduced by using a standard amplification technique. The construction of our streaming algorithm is done in three steps. The first step is to define a certain partition of the input space and to relate this partition to the cost which are to be calculated. In particular, we show that if we assign to each cell in this partition a weight that corresponds to the number of points inside the cell times the side length of the cell, the sum of these weights is a constant-factor approximation for the facility location cost. In the next step, we propose a randomized algorithm that utilizes the existence of such a space partition but does not consider streaming. Finally, we explain how our randomized algorithm can be transferred to the dynamic geometric data stream model. The results from Chapter 5 can be found in [C. Lammersen and C. Sohler. Facility location in dynamic geometric data streams. In Proceedings of the 16th Annual European Symposium on Algorithms (ESA ’08), pages 660–671. Springer, 2008.]. Chapter 6 This chapter deals with an efficient implementation of a k-means clustering algorithm for data streams, which we call StreamKM++. The k-means clustering problem is closely related to the facility location problem. The goal is to place k facilities, the so-called cluster centers, such that the sum of the squared distances of the points to their nearest cluster center is minimized. Our algorithm computes a small weighted coreset of the data stream that approximates the input point set with respect to the k-means clustering problem. The problem is then solved by running the k-Means++ algorithm [9] on the coreset. 1.1 Outline and Main Results 5 Algorithm StreamKM++ is based on two new techniques. First, we use an adaptive, non-uniform sampling approach similar to the k-Means++ seeding procedure to obtain small coresets from the data stream. This construction is rather easy to implement and, unlike other coreset constructions, its running time has only a small dependency on the dimensionality of the data. Second, we propose a new data structure, which we call coreset tree. The use of coreset trees significantly speeds up the time that is necessary for the adaptive, non-uniform sampling during our coreset construction. To evaluate the performance of our algorithm, we compare it experimentally with two well-known streaming implementations: BIRCH [111] and StreamLS [52]. We show that if the first priority is the quality of the clustering, then StreamKM++ provides a good alternative to BIRCH and StreamLS. This applies particularly if the number of cluster centers is large. To show that the performance of our algorithm is competitive with classical non-streaming algorithms as well, we also compare it with the k-Means++ algorithm and Lloyd’s algorithm [39, 80, 82] on some datasets with small or moderate size. We end our investigation of problems related to facility location with Chapter 6. Algorithm StreamKM++ together with its experimental study has been previously published in [M. R. Ackermann, C. Lammersen, M. Märtens, C. Raupach, C. Sohler, and K. Swierkot. StreamKM++: A clustering algorithm for data streams. In Proceedings of the 12th Workshop on Algorithm Engineering and Experiments (ALENEX ’10), pages 173– 187. Society for Industrial and Applied Mathematics, 2010. Invited to the special is- sue on selected papers from ALENEX ’10. Submitted to ACM Journal on Experimental Algorithmics.]. Chapter 7 In this chapter, we start to investigate another geometric problem. The problem under consideration is the computation of compact representations of finite metric spaces that capture the metric structure well. One option to tackle this problem is the construction of a well-separated pair decomposition (WSPD). An ε-WSPD of a point set P is a represen- tation of P which gives the guarantee that all pairwise distances of P are (1±ε)-preserved, i.e., each pairwise distance is expanded by a factor of at most (1 + ε) or compressed by a factor of at most 1/(1− ε). In order to enable our representation to have size sublinear in |P |, we relax this condition and introduce the notion of a WSPD with slack. An ε-WSPD with slack σ for P is a representation of P such that at least a (1 − σ)-fraction of all the pairwise distances of P are (1 ± ε)-preserved. The remaining pairwise distances of P can be arbitrarily distorted. We show how to compute an ε-WSPD with slack σ for a set P consisting of n points from a low-dimensional Euclidean space Rd. The space requirement for this compact representation is O(log(∆)/(εdσ)), where ∆ is the spread of P . Recall that the spread of a finite metric space is defined by the ratio of the farthest pair distance to closest pair distance occurring in the metric space. The techniques used by our algorithm are also applicable to doubling metric spaces. For a doubling metric space with bounded dimension λ and 6 1 Introduction spread ∆, our algorithm computes an ε-WSPD with slack σ whose space requirement is O(log2(∆)/(ελσ)). The results of Chapter 7 can be found in [C. Lammersen, A. Sidiropoulos, and C. Sohler. Streaming embeddings with slack. In Proceedings of the 11th Algorithms and Data Struc- tures Symposium (WADS ’09), pages 483–494. Springer, 2009.]. Chapter 8 Chapter 8 addresses the computation of compact representations of finite metric spaces in data stream models. We present randomized streaming algorithms that, given a stream of n points from a metric space M , compute an embedding of M into an n-point metric space M ′ that has low distortion and low slack. Our algorithms use space polylogarithmic in n and ∆, where ∆ is the spread of the metric space. Within such space limitations, it is impossible to store the embedding explicitly. We bypass this obstacle by computing a compact representation of M ′, without storing the actual mapping from M into M ′. Given a slack parameter σ and a precision parameter ε, our streaming algorithm com- putes for a set of points P from a high-dimensional Euclidean space a set of points P ′ from a low-dimensional Euclidean space such that P embeds into P ′ with distortion 1 + ε and slack σ. The algorithm uses the techniques presented in Chapter 7. Furthermore, based on results obtained in Chapter 7, we show how to compute embeddings with distortion 1 + ε and slack σ of metric spaces with bounded doubling dimension in the data stream model. For general metric spaces, we propose a streaming embedding of a metric space M into a metric space M ′ with distortion O(1) and slack σ. We complement our upper bounds by proving that embedding general metric spaces with distortion less than 2 and slack less than 1/4 requires Ω(n/ log(n) + log(log(∆))) bits of memory. Besides, we use an embedding to show that there is a randomized streaming algorithm that computes with high constant probability a (1 ± ε)-approximation of the max-cut problem for a dynamic geometric data stream of high-dimensional Euclidean points. The results of Chapter 8 are based on [C. Lammersen, A. Sidiropoulos, and C. Sohler. Streaming embeddings with slack. In Proceedings of the 11th Algorithms and Data Struc- tures Symposium (WADS ’09), pages 483–494. Springer, 2009.]. Chapter 9 We conclude the thesis with Chapter 9, where we summarize our main results and also give some suggestions for future work. Appendix A This appendix contains some additional tables with detailed information about the exper- iments that we have run with our streaming implementation StreamKM++. 1.2 Motivation 7 Appendix B This appendix addresses some mathematical fundamentals which are assumed to be com- mon knowledge throughout this thesis. The facts are stated for reference purposes, without giving any proofs. 1.2 Motivation There are some relations between facility location, clustering, and computations of compact representations of finite metric spaces. Therefore, problems in these areas can be motivated in a similar way and have similar application scenarios. In the following, we will discuss this in detail for the problems considered in these areas. 1.2.1 Facility Location Facility location problems capture a large variety of application scenarios in which we have to allocate resources to satisfy some requirement as good as possible while at the same time we have to pay some cost for the used resources. Therefore, it is not surprising that these kind of problems belong to the most studied problems in combinatorial optimization and operations research. In this thesis, we are particularly interested in facility location problems for large-scale distributed systems of mobile objects. In such a system, each object is an autonomous computational entity that has its own local memory and that communicates with the other entities by message passing. Applications of our scenario are, for example, in battery-powered wireless ad-hoc and sensor networks. These networks are governed by tight energy constraints. Often, the nodes are organized in energy-efficient clusters where some selected cluster heads offer a certain service to the subset of nodes contained in their cluster. Imagine, for example, we have a sensor network containing hundreds or thousands of homogeneous nodes, and the task of the nodes is to send periodically their sensed data to a distant base station where the end-user can access the data. Then, a service of a cluster head could be aggregating the data of its cluster nodes into a small set of meaningful information and taking on the transmission of the aggregated data to the base station. Each node can act as a cluster head, but, at any time, each node that is set up or maintained as a cluster head has an additional energy consumption due to transmitting data to the distant base station, staying in ready-to-receive mode, etc. The energy consumption of a non-serving cluster node is determined by its demand and the transmission power needed to reach a cluster head. The transmission power has to be higher the longer the distance to the nearest cluster head is. As a result, we have to find a good trade-off to have a small number of cluster heads and at the same time keep the energy needed for transmissions to the cluster heads low. The situation may be further aggravated in case that the nodes move continuously. Now, imagine that, to maintain the total energy consumption of the system low, nodes are allowed to change their status from serving cluster head to demanding cluster node or vice 8 1 Introduction versa. This scenario can be modeled as a facility location problem for a distributed system of mobile network nodes. Since such systems, like the one described above, are too complex to analyze them in their entirety, we examine the following partial aspects of them: Distributed Setting: We study the metric facility location problem with uniform opening costs and demands for large-scale distributed systems of static objects. Kinetic Setting: Here, we are interested in the Euclidean facility location problem with non-uniform opening costs and demands for large-scale systems of mobile objects. Dynamic Data Streams: We investigate the problem of computing the cost for the Eu- clidean facility location problem with uniform opening costs and demands for large- scale system of mobile objects where the movement of the objects is given as a dynamic data stream of update operations. Distributed Setting Typically, no node in a large-scale ad-hoc or sensor network knows all the distance informa- tion that is needed to solve or even approximate the facility location problem. Besides, it is not possible to gather this information at a single node since this would cause much com- munication between the nodes, which in turn would result in a high energy consumption. In such a scenario, a distributed algorithm is required. The distributed algorithm presented in Chapter 3 computes a constant-factor approxi- mation of the metric facility location problem using only three rounds of communication where the message size is logarithmic in the total number of nodes. Kinetic Setting Maintaining clusters in a large-scale network of mobile nodes is a challenging task. Good facility location algorithms should ensure a trade-off between the quality of the solution at any given point of time and its stability and efficiency under motion since each status change from demanding cluster node to serving cluster head incurs some costs. Motivated by the fact that the KDS framework is common in the field of computational geometry and well-suited to maintain a combinatorial structure of continuously moving objects efficiently [2, 15, 54], we will develop a KDS for a mobile facility location problem in Chapter 4. Surprisingly, prior to our work, there was no KDS for the facility location problem known. Besides, it does not seem that the only known (1 + ε)-approximation algorithms [8, 76] can be translated to the kinetic setting. This is because the authors used the Arora-scheme [7] including dynamic programming techniques, which do not well comply with kinetization. So, at this stage, the best we can hope for is to construct a KDS maintaining a constant-factor approximation for the facility location problem, which we will do in Chapter 4. 1.2 Motivation 9 Dynamic Data Streams In battery-powered wireless ad-hoc and sensor networks of mobile nodes, it is common to communicate new positions of network nodes in form of a stream of update operations. Such an update may, for example, specify the ‘name’ of the network node, its ‘old position’ and its ‘new position’. Thus, we can also think of it as a deletion of the node from its old position followed by an insertion of the same node at its new position. The model of dynamic geometric data streams addresses such a scenario. We are given a stream of insert and delete operations of points from a discrete Euclidean space {1, . . . ,∆}d, and our goal is to maintain some information about the aggregated data. The difficulty is that the size of the processed data prevents us from storing it completely. This restriction is modeled by allowing only space polylogarithmic in ∆ and the length of the stream. Since, in the facility location problem, the number of open facilities can be as large as the considered point set (and this can be as large as ∆d), we cannot compute a solution in the dynamic data stream model. Instead, we focus on approximating the cost of a solution. We remark that monitoring the cost can be very useful in resource allocation problems. For instance, it is often too costly to maintain, at all times, a nearly optimal set of open facilities in a distributed way. Instead, we can keep the same set of open facilities for some period of time but maintain, at all times, a good estimation of the optimal facility location cost. Then, we recompute a new set of open facilities by running a distributed algorithm as soon as the current cost estimation differs too much from the cost that we had after the latest run of the distributed algorithm. Chapter 5 deals with an algorithm that computes a constant-factor approximation of the cost for the uniform facility location problem over dynamic geometric data streams. 1.2.2 Clustering Clustering is the problem to partition a given set of objects into clusters such that objects from the same cluster are similar to each other and objects from different clusters are dis- similar. The goal is to simplify data by replacing a cluster by one or a few representatives, classify objects into groups of similar objects, or find patterns in some given data. Regard- ing to this, clustering has applications in various areas, including data mining, database systems, data compression, and machine learning. In many of these applications, the data occurs in the form of data streams or is stored on hard disks. This means that streaming access is orders of magnitude faster than random access or is even the only possible access to the data. To sum up, clustering algorithms for data streams are basic tools in the analysis of huge datasets. One of the most widely used clustering algorithms is Lloyd’s algorithm (sometimes also called the k-means algorithm) [39, 80, 82]. This algorithm is based on two observations: (1) Given a fixed set of centers, we obtain the best clustering by assigning each point to the nearest center and (2) given a cluster, the best center of the cluster is the center of gravity (i.e., the mean) of its points. Lloyd’s algorithm applies these two local optimizations steps repeatedly to the current solution, until no more improvement is possible. It is known that 10 1 Introduction the algorithm converges to a local optimum [100], and no approximation guarantee can be given [73]. Recently, Arthur and Vassilvitskii [9] developed the k-Means++ algorithm, which is a seeding procedure for Lloyd’s k-means algorithm that guarantees a solution with certain quality and gives good experimental results. An advantage of this algorithm is that it works also well for high-dimensional datasets. However, a disadvantage is that, like Lloyd’s algorithm, the k-Means++ algorithm needs random access to the data items and, thus, is not suitable for data streams. In Chapter 6, we will present a clustering algorithm for data streams that is based on the idea of the k-Means++ seeding procedure. Our algorithm utilizes the performance of the k-Means++ seeding on high-dimensional data but avoids random access to the data items. 1.2.3 Compact Representations of Finite Metric Spaces Compact representations of finite metric spaces that fairly capture the pairwise distances and use only sublinear space are an important tool in the analysis of huge point sets since they can be stored in small space and much information about the point sets can be obtained from the corresponding pairwise distances. One option to obtain such a represen- tation is the construction of a WSPD. Unfortunately, unless the input metric space is very simple (e.g., given is a multiset of points with many duplicates), one cannot find a sub- linear space representation which preserves all pairwise distances. It is not even possible to guarantee that all pairwise distances are preserved up to any fixed factor. Simply said, it is unavoidable that we loose some distances in the sense that they can get arbitrarily distorted. According to this, we extend the classic notion of WSPD to WSPD with slack where the slack quantifies the fraction of pairwise distances that are not well preserved. In Chapter 7, we will show how to construct a WSPD with low slack for low-dimensional Euclidean spaces. Our construction is based on techniques that compute small summaries for point clouds consisting of a certain number of closely located points. Due to this construction, for several points, the distances to points in their immediate vicinity can be arbitrarily distorted. However, the distances to all the other points, which are further away, are well preserved. Therefore, problems related to finding furthest neighbors in large point sets can be efficiently approximated by our compact representation. One such problem that has recently been considered is the computation of reverse furthest neighbors [110]. Given a huge set of points P , a small or moderate query set Q, and a query point q ∈ Q, the task is to find all the points in P with the property that q is their furthest neighbor among all points in Q. One application for this problem could be the placement of an obnoxious industrial facility. Given a set P of residential sites and a set Q of potential locations for building such an obnoxious facility, one reasonable strategy is to select a location in Q that is further away from as many residential sites as possible, i.e., the location with the largest set of reverse furthest neighbors in P . The reader is referred to [110] for more examples in this area. Another application for our compact representation is hierarchical clustering. For in- stance, given a Euclidean point set P , the complete linkage clustering (or the furthest 1.3 Related Work 11 neighbor method) starts with |P | singleton clusters and successively merges the two near- est clusters where the distance between two clusters is defined as the distance between the two furthest objects in the two clusters. The merge step is repeated until the number of clusters corresponds to the desired number of clusters. In this scenario, our compact representation can be seen as a certain stage, when we have already performed a sequence of several merge steps. Thus, by applying the clustering method on the representation, we get a good approximation of the clustering for P . This is especially useful when the true clusters of P are compact and roughly equal in size. In many applications, the datasets are high-dimensional and given in form of a data stream. Examples of such datasets include the web graph, Internet traffic logs, click- streams, and genome data. To analyze such datasets, streaming algorithms that embed a set of high-dimensional points with low distortion and low slack into a low-dimensional space can be of particular interest. Besides the fact that the embedded point set can be stored in small space, another benefit is that it might be useful to detect some structure in the original data more easily. For example, let us assume, we map the input data to the Euclidean plane or to R3. Then, it is much simpler for the human visual system to detect structure in this data, tight clusters or isolated points, for instance. Chapter 8 addresses the development of streaming algorithms for computing embeddings with low distortion and low slack of high-dimensional Euclidean spaces, doubling metric spaces with low doubling dimension, and general metric spaces. 1.3 Related Work Facility location variants as well as techniques to compute compact representations of metric spaces have extensively been studied in computer science. It goes beyond the scope of this thesis to give a comprehensive overview of the vast available literature. In the following, we will focus our summary of the work emerged in both areas on the results which are most relevant to this thesis. 1.3.1 Facility Location One of the most studied facility location problems is the problem which we refer to as the general facility location problem. Compared to the variants considered in this thesis, in the general facility location problem, only a subset of the input points are potential facility locations. More precisely, we are given a set of facilities F and a set of clients C. With each facility xi ∈ F , there is a non-negative opening cost fi associated. Furthermore, there is a distance measure D defined on the input points, and, for each facility-client pair (xi, yj) ∈ F × C, the distance D(xi, yj) specifies the cost for connecting yj to xi. The goal is to open a subset F ⊆ F of the facilities and to connect each client to an open facility so as to minimize the sum of the opening costs for F and the connection costs for C. Note that, in the literature, the general facility location problem is also often called uncapacitated facility location problem. This indicates that each facility can serve an un- 12 1 Introduction limited number of clients, whereas, in capacitated versions of the problem, each facility can serve only a certain limited number of clients. Since we only consider uncapacitated facility location problems in this thesis, we omit the attribute ‘uncapacitated’. An instance of the general facility location problem is said to be metric if the distance measure D is non-negative, symmetric, and satisfies the triangle inequality. The gen- eral metric facility location problem is known to be NP-hard. The first polynomial-time constant-factor approximation algorithm for this problem was given by Shmoys et al. [102]. Later, several other polynomial-time constant-factor approximations have been proposed [12, 18, 27, 30, 51, 68, 69, 70, 77, 83, 87, 104]. These algorithms can roughly be grouped into algorithms using mainly linear programming (LP) rounding techniques, primal-dual methods, local search strategies, greedy strategies, or combinations of these techniques. The approximation algorithm of Shmoys et al. [102] relies on the LP rounding technique due to Lin and Vitter [79]. An LP rounding algorithm proceeds in two steps: The first step is to solve the linear relaxation of an integer programming formulation of the considered problem, and the second step is to round the obtained fractional LP solution to an integer solution. In this way, the algorithm of Shmoys et al. [102] achieves an approximation ratio of 3.16. Guha and Khuller [51] improved the LP rounding algorithm of Shmoys et al. [102] and combined it with a simple local search phase. Starting with the solution obtained from the LP rounding, in the local search phase, the amount of cost that is saved by opening a closed facility is computed for each closed facility. While there exists a facility whose amount of saved cost is positive, the facility that maximizes the decrease of the total facility location cost is opened. This combination of LP rounding and local search achieves an approximation ratio of 2.41. Chudak and Shmoys [30] developed a 1.736-approximation algorithm by improving the algorithm of Shmoys et al. [102]. The key elements to their improvement are a new rounding procedure for the LP relaxation of the facility location problem and the use of information about the dual linear program to the LP relaxation of the problem. A further improvement of the LP rounding algorithm has been proposed by Sviridenko [104] resulting in an approximation ratio of 1.582. Korupolu et al. [77] proposed a local search algorithm for the general metric facility location problem. In general, a local search algorithm starts with a feasible solution to the considered problem and applies iteratively a local improvement step in which minor modifications are made in order to obtain a solution of lower cost. The local improvement step presented in [77] searches for one facility or a pair of one open and one closed facility such that changing the status of the involved facilities decreases the total facility location cost. The algorithm of Korupolu et al. [77] yields an approximation ratio of 5 + ε for any constant ε > 0, and, according to [27], it can be implemented to run in O(n4 · log(n/ε)) time. By using other techniques in the analysis, Arya et al. [12] proved that the local search algorithm of Korupolu et al. [77] actually achieves an approximation ratio of 3 + ε. One of the most elegant approximation algorithms for the general metric facility location problem is the primal-dual method developed by Jain and Vazirani [70]. In general, the goal of a primal-dual method is to simultaneously compute a feasible integer solution for the original problem as well as a feasible solution to the dual linear program to its LP relaxation. In case that the considered problem is a minimization problem, like the general 1.3 Related Work 13 facility location problem, the cost of a feasible solution to the dual LP can be used as a lower bound on the optimal cost. Jain and Vazirani [70] proved that their primal-dual method for facility location is a 3-approximation algorithm that can be implemented to run in O(n2 · log(n)) time. Later, Jain et al. [68] used a primal-dual method in the analysis of two greedy algorithms. Their first greedy algorithm iteratively opens among all currently closed facilities the facility that minimizes, for any subset U of all currently unconnected clients, the ratio of the opening costs of the particular facility plus the connection cost of U to the size of U . This algorithm achieves an approximation ratio of 1.861 and has a running time ofO(n2·log(n)). The second greedy algorithm was obtained from the first one by small modifications resulting in an improved approximation ratio of 1.61 and a higher running time of O(n3). Furthermore, Mettu and Plaxton [87] presented a greedy algorithm for the facility location variant with C = F which implicitly uses the primal-dual method of Jain and Vazirani [70]. This is done by defining so-called ‘radii’ for amortizing the cost needed to open a facility at a particular location. The algorithm opens iteratively the facility xi ∈ F with the smallest radius that has no other open facility in the ball whose center is xi and whose radius is twice the radius of xi. The algorithm yields an approximation ratio of 3 and has a running time of O(n2). Finally, the second best approximation algorithm for the general facility location problem is based on the algorithm of Jain et al. [68]. This algorithm also involves the primal-dual method. It achieves an approximation ratio of 1.52 and has a running time of O˜(n2) [83]. Another algorithm for the general facility location problem has been developed by Charika and Guha [27]. This algorithm is a 1.728-approximation algorithm that com- bines the primal-dual method of Jain and Vazirani [70], a modified version of the local search technique presented by Guha and Khuller [51], and the LP rounding algorithm of Chudak and Shmoys [30]. Furthermore, Byrka and Aardal [18] proposed an algorithm that uses a modified version of the algorithm of Chudak and Shmoys [30] and combines this with the algorithm of Jain et al. [68]. The algorithm yields the best approximation ratio up to now, which is 1.5. Concerning hardness results, Guha and Khuller [51] proved by a reduction from set cover that the general metric facility location problem of n input points cannot be approximated in polynomial time within a factor of 1.463, unless NP ⊆ DTIME[nO(log(log(n)))]. Combining this result with an observation of Sviridenko implies that the approximation lower bound of 1.463 also holds, unless P = NP (see [108]). Furthermore, Thorup proved [106] that any constant-factor approximation algorithm, even a randomized one, requires Ω(n2) time to compute a solution to the general metric facility location problem. Bădoiu et al. [14] ex- tended this hardness result by showing that any bounded-factor approximation algorithm, even a randomized one, requires Ω(n2) time to compute the cost of the general metric facility location problem. This result holds even for the variant with uniform opening costs. However, for some special variants of the metric facility location problem, the hardness results mentioned above are no longer valid. For instance, Bădoiu et al. [14] considered the metric facility location problem with uniform opening costs in which every point can open a facility. In a first step, they proved that the sum of the radii defined by Mettu 14 1 Introduction and Plaxton [87] is an estimator that approximates the optimal facility location cost to within a constant factor. In a second step, they showed how to obtain a constant-factor approximation of this estimator by using an adaptive sampling approach. This resulted in an algorithm for the considered variant of the metric facility location problem that com- putes a constant-factor approximation of the cost in O(n · log2(n)) time. Furthermore, the non-approximability result of Guha and Khuller [51] is no longer valid in the special case of Euclidean spaces. A first randomized polynomial-time approximation scheme for the gen- eral Euclidean facility location problem in the plain has been developed by Arora et al. [8]. This algorithm is based on the Arora-scheme [7] and computes a (1 + ε)-approximation in O(n1+O(1/ε) log(n)) time. The result of Arora et al. [8] was then improved by Kol- liopoulos and Rao [76]. Assuming that there exists any polynomial-factor approximation for the total connection cost, they obtained a randomized polynomial-time approximation scheme that works in any constant-dimensional Euclidean space and has a running time of O(2O((log(1/ε)/ε) d−1)n logd+6(n)). For a more comprehensive overview of results on facility location problems in a classical setting, we refer the reader to the surveys by Shmoys [101] and Vygen [108]. The facility location problem has also been investigated in other settings. In the follow- ing, we will summarize the results obtained in distributed and kinetic settings as well as in data stream models. Distributed Setting Surprisingly, the first algorithm for a distributed facility location problem was proposed just a few years ago [90]. Given a set of m facilities and a set of n clients, Moscibroda and Wattenhofer [90] investigated the general non-metric facility location problem (i.e., distances do not have to satisfy the triangle inequality) in a synchronous message passing model. In their considered model, the communication network is a complete bipartite graph with communication links between each facility-client pair, and each node can send in each communication round a message containing O(log(n)) bits to each neighbor in the com- munication network. To approach the distributed facility location problem, Moscibroda and Wattenhofer used some ideas from the centralized primal-dual method of Jain and Vazirani [70]. The obtained distributed primal-dual method provides a trade-off between the number of communication rounds and the resulting approximation ratio. In particular, it achieves an O( √ k(m%)1/ √ k log(m + n)) approximation in O(k) communication rounds with a message size of O(log(n)) bits. Here, % is a coefficient that depends on the cost values of the input instance. In Chapter 3, we consider the metric facility location variant with uniform opening costs for the facilities and X := C = F in the synchronous message passing model where the communication network is a clique. Compared to the problem studied in [90], our prob- lem is much simpler, and so the algorithm presented in Chapter 3 is incomparable with the algorithm of Moscibroda and Wattenhofer. We developed our randomized distributed algorithm based on results from Mettu and Plaxton [87] and Bădoiu et al. [14]. As men- tioned before, Bădoiu et al. [14] proved that the sum of the radii defined by Mettu and 1.3 Related Work 15 Plaxton [87] is a constant-factor approximation of the optimal facility location cost. Fur- thermore, for any facility xi ∈ X, they gave a lower bound on the number of points located in the ball whose center is xi and whose radius equals the radius of xi. Using this lower bound, we designed our randomized distributed algorithm in a way that it opens a subset of the potential facilities such that, with high constant probability, the total opening cost is at most a constant factor larger than the sum of the radii and each client xi has an open facility in a ball whose center is xi and whose radius is at most a constant factor larger than the radius of xi. Hence, our algorithm computes with high constant probability a constant-factor approximation for X. In a follow-up study on the results of Moscibroda and Wattenhofer [90] and the results presented in Chapter 3, Pandit and Pemmaraju [97] further investigated the metric version of the problem studied in [90]. Based on the primal-dual method of Jain and Vazirani [70] and a rapid randomized sparsification of graphs due to Gfeller and Vicari [49], they obtained a 7-approximation in O(log(m) + log(n)) communication rounds with a message size of O(log(m+n)) bits. This technique was then generalized to get an algorithm that, for each constant k, runs in k communication rounds and computes a solution whose cost is only a factor of O(m2/ √ k · n3/ √ k) larger than the optimal cost. We point out that the technique of Pandit and Pemmaraju can also be used to obtain a constant-factor approximation in O(log(n)) communication rounds for the variant of our considered metric facility location problem where the opening cost of facilities are non-uniform. Therefore, their result can be seen as a generalization of our result. For more information about distributed computing, the reader is referred to [81, 98]. Kinetic Setting Some frameworks have been proposed for handling kinetic data. In this thesis, we consider a common model for processing points in motion, called kinetic data structures (KDS), which was introduced by Basch et al. [15]. Prior to the work presented in Chapter 4, there was no KDS for the facility location problem known. However, some results have been obtained in the KDS framework for problems related to clustering, to which the facility location problem belongs. For instance, Gao et al. [46] provided a randomized KDS to maintain a set of centers among moving points in the plane such that, given a specified radius, all the points are covered by balls of the given radius centered at the chosen center points. Gao et al. showed that the size of the center set is at most a constant factor larger than the minimum one. Hershberger investigated a similar problem in [60]. More precisely, he proposed a deterministic KDS for maintaining a covering of moving points in Rd by unit boxes such that the number of boxes is always within a factor of 3d of the optimal static covering at any instance. Another clustering problem that has been studied in the KDS framework is the kinetic k-center problem. The goal of the kinetic k-center problem is to maintain a set of k centers so as to minimize the maximum distance of any point to its closest center at any point of time. Gao et al. [47] proposed a deterministic KDS that maintains, for a set of moving points in Rd, an 8-approximation of the discrete k-center problem, i.e., the centers have to be a subset of 16 1 Introduction the moving input points. Bereg et al. [17] studied 1-center problems in which the center is not necessarily located at one of the moving input points. Among other results, he showed that, given a precision parameter ε, 0 < ε < 1, there is a strategy for moving a center such that the location of this center provides a (1 + ε)-approximation of the 1-center problem for a set of moving points in the plane and, assuming each input point moves with velocity at most 1, the velocity of the center never exceeds (2 + ε)(1 + ε)/ √ 2ε+ ε2. Furthermore, a KDS for the k-center problem in the context of outliers can be found in [45]. Har-Peled [56] investigated the k-center problem in a mobile setting different from the KDS framework. Instead of handling events, a static set which ensures a constant-factor approximation at all times is provided. However, the size of this set is kµ+1, where µ is the degree of the polynomial of the trajectories. Finally, we are aware of another result concerning clustering which addresses a randomized KDS for the Euclidean max-cut problem [31, 42]. For other work on KDSs, we refer the reader to the surveys by Guibas [54, 55]. Unfortunately, it does not seem that the only known (1+ε)-approximation algorithms for facility location [8, 76] can be transferred to the kinetic setting since they are based on the Arora-scheme [7] including dynamic programming techniques, which do not well comply with kinetization. Our KDS for the mobile Euclidean facility location problem combines a modified version of the greedy algorithm of Mettu and Plaxton [87] with a counting argument of Bădoiu et al. [14]. Given any static Euclidean point set P , the original greedy algorithm opens as few facilities as possible in a way that each point pi ∈ P has at least one open facility in the ball with center pi and twice the radius of pi. This results in a constant-factor approximation of the facility location problem for P . Concerning the radii defined by Mettu and Plaxton, the counting argument of Bădoiu et al. asserts that, for any facility pi ∈ P , a constant-factor approximation of the radius of pi can be computed by just counting the number of points from P contained in exponentially growing balls centered at pi. This counting argument facilitates us to efficiently kinetize a modified version of the static greedy algorithm proposed by Mettu and Plaxton. Data Streams Although, many geometric approximation algorithms have been developed in data stream models, we are only aware of three results concerning facility location problems. In [41], Fotakis presented a streaming algorithm for the metric facility location variant in which every input point is a potential facility location. The algorithm combines an online facility location algorithm due to Meyerson [88] with an incremental facility location algorithm due to Fotakis [40]. The course of the algorithm is controlled by so-called final distances. The final distance of a point is an upper bound on the distance of this point to the nearest facility at any future point of time. While reading the input stream, the next point is chosen as open facility with a probability that is proportional to the ratio between the final distance of this point and the opening cost. In case that a point is chosen as open facility, it is stored in memory and replaces every currently stored facility which has the property that its distance to the new facility is at most a certain fraction of its final distance. In this way, the algorithm maintains a set of open facilities such that the 1.3 Related Work 17 total associated facility location cost is at most a constant factor larger than the optimal facility location cost. Unfortunately, both the update time and the space requirement of the algorithm are linear in the number of opened facilities, which can be linear in the input size. Chang [26] developed a multi-pass streaming algorithm for the metric facility location variant in which every input point is a potential facility location. In contrast to the data stream models considered in this thesis, in the multi-pass streaming model, an algorithm is allowed to perform more than one sequential scan over the input data. During and after each such pass, the amount of available local memory space is assumed to be sublinear in the size of the input stream. To approach the facility location problem, Chang used an iterative algorithm that is based on a technique proposed by Indyk [61]. In each iteration, the algorithm takes a random sample from the input stream and computes a subset of open facilities by applying some known facility location algorithm on the sample set. Then, the algorithm removes all the points from consideration that are served sufficiently well and iterates on all the remaining points. This is repeated, until all input points are served sufficiently well. Chang showed that his algorithm uses O(`) passes and O˜(kn2/`) space to compute a set of open facilities such that the total associated facility location cost is at most a factor of O(`) larger than the optimal facility location cost. Here, k is the number of open facilities and n is the number of input points. Thus, similar to Fotakis’ streaming algorithm, there exist facility location instances for which the space requirement of Chang’s algorithm is not sublinear in the input size. However, Chang justified his approach by proving that, for the considered facility location problem, any randomized `-pass streaming algorithm requires Ω(n/`) bits of memory to compute even a polynomial-factor approximation of the optimal facility location cost. Previous to the result presented in Chapter 5, the only real streaming algorithm for facility location was proposed in [64], where the author introduced the model of dynamic geometric data streams and studied different geometric problems in this model. A dynamic geometric data stream is a sequence of insert and delete operations on a point set P ⊆ {1, . . . ,∆}d in a discrete d-dimensional Euclidean space. In the facility location variant studied in [64], the opening costs are uniform and every point in P can open a facility. For the purpose of guaranteeing a space requirement that is only polylogarithmic in the size of the input stream, Indyk developed an algorithm that approximates the optimal facility location cost for P instead of an optimal set of open facilities. This is done by defining a certain partition of the space into nested square grids and a set of cells in this partition such that the number of these cells gives an O(log(∆))-approximation of the optimal facility location cost. During the approximation process to estimate the number of these cells, the algorithm of [64] looses another factor of O(log(∆))1. In Chapter 5, we use a similar partition of the space into nested square grids as in [64], and we show that opening a subset of the cells defined in [64] results in a constant-factor approximation of the optimal facility location cost. This leads to a streaming algorithm 1The author of [64] mentions that, with the help of a more intricate analysis, the approximation factor can be improved to O(log(∆)). 18 1 Introduction that computes a constant-factor approximation of the cost for the facility location problem considered in [64], which strongly improves Indyk’s result. We point out that the approximation of the facility location cost was considered again in [14]. As mentioned at the beginning of this section, the authors of [14] proposed a sublinear-time algorithm that computes in O(n log2(n)) time a constant-factor approxi- mation for the cost of the metric facility location variant in which every input point is a potential facility location and in which the opening costs for the facilities are uniform2. Un- fortunately, despite the relation of streaming and sublinear-time algorithms, the techniques cannot be transferred to the other model. Note that the facility location problem in which every input point is a potential facility location and in which the opening costs for the facilities are uniform is closely related to the k-median and k-means clustering problems. In the k-median clustering problem, we are given a set of points and an integer k, and the goal is to determine a set of k centers such that the sum of the distances from the input points to their corresponding nearest center is minimized. The cost function of the k-means clustering problem differs from the one of the k-median clustering problem only in the way that we sum up the squared distances from the input points to their corresponding nearest center. For both clustering problems, a number of streaming algorithms have been developed [4, 28, 29, 36, 37, 44, 53, 57, 58]. Like the streaming algorithm presented in Chapter 6, many of these algorithms apply a merge-and-reduce technique based on a decomposition technique of Bentley and Saxe [16] to obtain a small coreset (see [2] for the introduction of the notion of coresets). Our coreset construction for the k-means clustering problem is based on the k-means++ seeding procedure [9]. We point out that the k-Means++ seeding has also been investigated in [3] and [4]. However, our result differs from the results given in [3] and [4] and was obtained independently. In any case, all known algorithms for the k-median and k-means clustering problem require space Ω(k). Thus, they implicitly assume that k is small, i.e., k ∈ polylog(∆) in dynamic data streams and k ∈ polylog(n) in insertion-only data streams, where ∆ is the spread of the input points and n is the length of the stream. As mentioned above, in facility location problems in which every input point is a potential facility location, the number of cluster centers k can be as large as the maximum size of the point set under consideration. In Chapter 5, we will show that we can approximate the cost for such a facility location problem in space o(k). No similar result is known for the k-median and k-means clustering problems. For other work in data stream models, we refer the reader to [66, 92, 93]. 1.3.2 Compact Representations of Finite Metric Spaces The compact representations of finite metric spaces considered in this thesis are well- separated pair decompositions with slack (WSPDs with slack) and metric embeddings 2Since the size of the representation of an n-point metric space is Θ(n2), the complexity of this algorithm is sublinear with respect to the input size. 1.3 Related Work 19 with slack. Since we are not aware of any prior work on WSPDs with slack, the following overview deals with classical WSPDs. Afterwards, we will summarize the results obtained in the area of metric embeddings with slack. WSPD The notion of WSPD has been introduced in [22]. In the same paper, Callahan and Kosaraju showed that, for any set of n points from any constant-dimensional Euclidean space and for any constant ε with 0 < ε < 1, there always exists an ε-WSPD consisting of O(n) pairs and such an ε-WSPD can be computed in O(n log(n)) time. The construction was later simplified by Har-Peled and Mendel [59], who observed that a WSPD can directly be generated from a compressed quadtree [22]. Also based on a compressed quadtree construction, Chan [25] showed that a WSPD for a Euclidean point set can be found in linear time if the spread of the point set is polynomially bounded in the size of the point set. Concerning dynamic point sets, Callahan [20] presented a deterministic algorithm and later Fischer and Har-Peled [38] a simpler randomized algorithm that maintains a WSPD for a constant-dimensional Euclidean point set in polylogarithmic time under insertions and deletions. In high dimensions, it is known that a WSPD can have quadratic complexity. One example is the uniform n-point metric (with all pairwise distances equal to 1), which can be realized as the vertices of a simplex in Rn−1. Since WSPDs are useful data structures to represent distances between points efficiently, they have been applied for solving many proximity problems for point sets in a Euclidean space [10, 11, 19, 21, 22, 34, 50, 59, 78, 94]. Talwar [105] extended the notion of WSPD to spaces with low doubling dimension. He showed that, given any constant ε with 0 < ε < 1 and any n-point metric space with constant doubling dimension and spread ∆, there always exists an ε-WSPD consisting of O(n log(∆)) pairs. Furthermore, Gao and Zhang studied the construction of WSPDs for unit-disk graphs [48]. Metric Embedding with Slack The theory of metric embeddings received much attention in recent years, and embedding techniques have been applied in the development and analysis of many algorithms that operate on an underlying metric space. For recent work on metric embeddings, we refer the reader to the surveys [63, 67, 84]. In the following overview of prior work, we focus on the results that are related to metric embeddings with slack or that have been relevant in designing the algorithms presented in Chapter 8. Kleinberg et al. [75] introduced the notion of embeddings with slack. Among other re- sults, they showed that, for any constant σ with 0 < σ < 1, any metric space with bounded doubling dimension can be embedded with distortion O(1) and slack σ into a constant- dimensional Euclidean space. The results from [75] have been extended to arbitrary metric spaces and to embeddings under any `p norm, p ≥ 1, by Chan et al. [23]. Furthermore, 20 1 Introduction Abraham et al. [1] developed embeddings with low distortion and low slack for arbitrary metric spaces that additionally guarantee a constant average distortion. Metric approximation with slack has also been investigated in the setting of graph span- ners. Chan et al. [24] showed that, for any weighted graph G and any ε with 0 < ε < 1, there exists a spanner of G with linear number of edges achieving stretch O(log (1/ε)) and slack ε. The authors also gave a spanner construction which is the starting point of the embedding with slack of general metric spaces presented in Chapter 8. In order to transform this construction to the streaming model, we use a technique that has been applied by Czumaj and Sohler [32] to achieve 2-pass streaming algorithms for clustering problems. We point out that, in the same paper, Czumaj and Sohler [32] introduced the concept of α-preserving metric embeddings, which is closely related to embeddings with slack. Their concept can be seen as a generalization of coresets. The goal is to embed a metric space into a structurally simpler metric space that approximates the original metric up to a factor of α with respect to a given optimization problem. Embeddings of point sets into trees via a quadtree partitioning have been used by Indyk [64] to obtain approximation algorithms for several geometric problems. Also, Frahling and Sohler [44] applied a similar quadtree partitioning to get streaming algorithms for different clustering problems. In Chapter 8, we use a similar partitioning technique to embed Euclidean metric spaces with low distortion and low slack. 2 Preliminaries This chapter deals with definitions that are used throughout the whole thesis. More special definitions, which are only used to describe or analyze a certain algorithm, are introduced in the corresponding main chapters. Therefore, the first section of most of the main chapters is for preliminaries, which include such special definitions. In this thesis, we will develop approximation algorithms for geometric problems in various metric spaces. In particular, in Chapters 7 and 8, we will present algorithms for computing compact space representations of different types of finite metric spaces. Section 2.1 covers definitions of these metric spaces. A big part of this thesis is devoted to facility location problems. We will consider various facility location problems in different kinds of settings. More precisely, we will present facility location algorithms for distributed and mobile settings as well as for data streams. Formal definitions of the considered facility location problems are given in Section 2.2. Our facility location algorithms for distributed and mobile settings are based on the greedy algorithm of Mettu and Plaxton [87]. We will present this algorithm in Section 2.3. Finally, in Section 2.4, we will introduce the computational models that we have applied to develop our algorithms in the different kinds of settings. 2.1 Distance Functions and Metric Spaces An important class of distance functions are metric spaces. In this section, we will give a formal definition of general, Euclidean, and doubling metric spaces. Distance Functions Let X be any non-empty set of elements. A function D : X×X → R is a distance function on X if it satisfies the following axioms: • Non-Negativity: For any x, y ∈ X, we have D(x, y) ≥ 0. • Symmetry: For any x, y ∈ X, we have D(x, y) = D(y, x). We generalize the definition of distance functions to sets. More precisely, for any finite set X and any distance function D on X, we define ∀x ∈ X ∀Y ⊆ X : D(x, Y ) := min y∈Y D(x, y) and ∀Y ⊆ X ∀Z ⊆ X : D(Y, Z) := min y∈Y D(y, Z) . 22 2 Preliminaries General Metric Spaces A metric space M is a pair (X,D), where X is a non-empty set of elements and D is a distance function on X that satisfies the following axioms: • Reflexivity: For any x, y ∈ X, we have D(x, y) = 0 if and only if x = y. • Triangle Inequality: For any x, y, z ∈ X, we have D(x, z) ≤ D(x, y) + D(y, z). The complexities of several algorithms presented in this thesis depend on the spread of the given input metric space. For a finite metric space M = (X,D) with D(x, y) 6= 0 for all pairs (x, y) ∈ X ×X with x 6= y, the spread of M is defined as the ratio of the farthest pair distance in X to the closest pair distance in X. Euclidean Metric Spaces Our distance measure will often be the Euclidean distance. The Euclidean distance between two points is given by the Euclidean length of the difference vector of both points. More precisely, let x := ( x(1), x(2), . . . , x(d) ) and y := ( y(1), y(2), . . . , y(d) ) be any two points from the Euclidean space Rd, where the dimension d ∈ N is any natural number. Then, the Euclidean distance between x and y is defined as D(x, y) := ‖x− y‖ = √ √ √ √ d∑ i=1 (x(i) − y(i))2 . Since the Euclidean distance satisfies the condition of reflexivity and the triangle inequality, it is a metric space. Doubling Metric Spaces A metric space M = (X,D) is called a doubling metric space if, there exists some λ ∈ N, such that each ball with any radius r centered at any point in X can be covered by 2λ balls each of radius r/2 and centered at a point in X. The value λ is called the doubling dimension of M . The doubling dimension can be seen as a generalization of the Euclidean dimension since R d has a doubling dimension of Θ(d) [59]. Besides, the doubling dimension extends the notion of growth restricted metric spaces defined by Karger and Ruhl [74]. 2.2 Facility Location Problems In this section, we will define different types of facility location problems. The first problem will be a facility location problem in general metric spaces. The problem definition is then extended to powers of metric spaces. Finally, we will introduce a mobile facility location problem in Euclidean spaces. 2.2 Facility Location Problems 23 Metric Facility Location Problem In the metric facility location problem, we are given a metric space (F ∪ C,D), where F := {x1, x2, . . . , xm} is a set of m facilities and C := {y1, y2, . . . , yn} is a set of n clients. With each facility xi ∈ F , there is a non-negative opening cost fi associated. Each client yj ∈ C has a non-negative demand dj. The goal is to find a subset F ⊆ F of open facilities such that the objective FacLoc((F , C), F ) := ∑ xi∈F fi + ∑ yj∈C dj ·D(yj, F ) is minimized. The first part of the objective is the opening cost related to the open facilities in F . The second part of the objective is the cost related to all clients in C, which we call the connection cost. Throughout the whole thesis, we will only consider the variant of the metric facility location problem with X := F = C, where X := {x1, x2, . . . , xn} is a set of n points. We then shortly write the facility location cost as FacLoc(X,F ) := FacLoc((F , C), F ) . In the uniform metric facility location problem with X := F = C, both the opening costs of the facilities and the demands of the clients are uniform. More precisely, we assume that, for each xi ∈ X, we have fi = f for some fixed value f ≥ 0 and di = 1. Then, the goal is to find a subset F ⊆ X of open facilities such that the objective FacLoc(X,F, f) := f · |F |+ ∑ xj∈X D(xj, F ) is minimized. In case that the given metric space is a Euclidean space, we call the problem the (uni- form) Euclidean facility location problem. Facility Location Problem for Powers of Metric Spaces In the facility location problem for powers of metric spaces, we are given a metric space (F ∪ C,D) and a constant metric exponent ` ≥ 1. As well as for the metric facility location problem, in this thesis, we will only consider the variant with X := F = C, where X := {x1, x2, . . . , xn} is a set of n points. With each point xi ∈ X, there is a non-negative opening cost fi and a non-negative demand di associated. The goal is to find a subset F ⊆ X of open facilities such that the objective FacLoc(X,F, `) := ∑ xi∈F fi + ∑ xj∈X dj ·D(xj, F )` is minimized. In the uniform facility location problem for powers of metric spaces with X := F = C, both the opening costs and the demands of the points are uniform. We assume that, for 24 2 Preliminaries each xi ∈ X, we have fi = f for some fixed value f ≥ 0 and di = 1. Then, the goal is to find a subset F ⊆ X of open facilities such that the objective FacLoc(X,F, f, `) := f · |F |+ ∑ xj∈X D(xj, F )` is minimized. Mobile Facility Location Problem In the mobile facility location problem, we are given a set of moving facilities F and a set of moving clients C in a Euclidean space Rd. As described before, in this thesis, we will only consider the mobile facility location problem with P := F = C, where P := {p1, p2, . . . , pn} is a set of n moving points in Rd. Let pi(t) denote the position of pi at time t, and let P (t) := {p1(t), p2(t), . . . , pn(t)}. For each point pi ∈ P , there exists a non-negative opening cost fi and a non-negative demand di. Observe that both the opening cost and the demand of a point do not change over time. The mobile facility location problem is to maintain, at each point of time t, a subset F (t) ⊆ P (t) of open facilities such that FacLoc(P (t), F (t)) := ∑ pi(t)∈F (t) fi + ∑ pj(t)∈P (t) dj ·D(pj(t), F (t)) is minimized. 2.3 The Mettu-Plaxton Algorithm This section addresses the greedy algorithm of Mettu and Plaxton [87] that computes a constant-factor approximation for the metric facility location problem. Let (X,D) be a metric space, where X = {x1, . . . , xn} is a set of n points and D is a distance function defined on X. Following the definitions from Section 2.2, the opening cost of a point xi ∈ X is denoted by fi and its demand by di. As mentioned in the previous chapter, the Mettu-Plaxton algorithm implicitly applies the primal-dual method of Jain and Vazirani proposed in [70]. This is done by defining so-called ‘radii’ for amortizing the cost needed to open a facility at a particular location. The idea of the Mettu-Plaxton algorithm is to open only a few facilities but, at the same time, to guarantee that each point xi ∈ X has at least one open facility in the ball with center xi and twice the radius of xi. After giving a formal definition of balls and radii, we describe the algorithm in more detail. Balls. For a point xi ∈ X and a non-negative value r, we define B(xi, r) to be the ball with center xi and radius r. Given such a ball B(xi, r), we let weight(B(xi, r)) denote the sum of the demands of all the points in X that are located in the ball B(xi, r), i.e., we define weight(B(xi, r)) := ∑ xj∈X∩B(xi,r) dj . 2.3 The Mettu-Plaxton Algorithm 25 Radius Associated with a Point. According to [87], for each point xi ∈ X, we define the value ri to be the radius of the ball with center xi that satisfies ∑ xj∈X∩B(xi,ri) dj · (ri −D(xi, xj)) = fi . (2.1) xi ri Figure 2.1: Illustration of ∑ x∈X∩B(xi,ri)(ri − D(xi, x)) (in case of uniform demands with dj = 1 for all xj ∈ X). The dashed lines correspond to the distances summed up. Figure 2.1 illustrates the definition of the radius ri associated with a point xi. Observe that the sum on the left hand side of Equation (2.1) is continuous and strictly monotonically increasing with ri. Hence, there exists a unique value ri satisfying the equation. Moreover, for any point xi ∈ X, the radius ri ranges between rmin := minxj∈X fj n ·maxxj∈X dj and rmax := maxxj∈X fj minxj∈X dj . The lower limit of the range is met if (i) fi = minxj∈X fj, (ii) all the points in X are at the same position, and (iii) the demands of all the points are uniform such that d` = maxxj∈X dj for any ` ∈ {1, . . . , n}. Because of Conditions (ii) and (iii), the contribution of each point xj ∈ X to the sum is ri · maxxj∈X dj, which is the highest possible value. The upper limit of the range is met if (i) fi = maxxj∈X fj, (ii) xi is the only point in the ball with radius ri and center xi, and (iii) di = minxj∈X dj. In this case, due to Condition (ii), the contribution of each point xj ∈ X\{xi} to the sum is 0, and, due to Condition (iii), the contribution of xi is ri ·minxj∈X dj, which is the lowest possible value. The Algorithm. First, the Mettu and Plaxton algorithm computes for each point xi ∈ X its associated radius ri. Then, it goes through all the points in X in non-decreasing order of their radii and opens a facility at a point xi ∈ X if xi has no open facility in the ball with center xi and radius 2ri. A pseudocode listing of the Mettu and Plaxton algorithm is given by Algorithm 2.3.1. Let FacLoc∗(X) be the optimal facility location cost for X. Then, Mettu and Plaxton obtained the following result: 26 2 Preliminaries Algorithm 2.3.1 Mettu-Plaxton-FacLoc(X) 1: calculate the radius ri for each point xi ∈ X 2: sort all points in non-decreasing order according to their radii 3: let x1, x2, . . . , xn be the sorted sequence 4: for i← 1 to n do 5: if there is no open facility in B(xi, 2 · ri) then 6: open facility at xi Theorem 1 ([87]). Given any n-point metric space (X,D), algorithm Mettu-Plaxton- FacLoc computes a subset F ⊆ X of open facilities such that we have FacLoc(X,F ) ≤ 3 · FacLoc∗(X) . The running time needed to compute F is O(n2). 2.4 Computational Models In this section, we will describe the computational models that we apply to measure the complexity of our algorithms. This includes the synchronous message passing model for algorithms working in a distributed setting, the kinetic data structure framework for al- gorithms working in a mobile setting, and data stream models. Before we will give an overview of these models, we will briefly describe the real random access machine model because, except for algorithms working in the synchronous message passing model, we measure the time and space complexities of our algorithms based on this model. 2.4.1 Real Random Access Machine Model The real random access machine (RAM ) model is a simplified and idealized model of a real computer, which is often used in computational geometry. In this model, a memory cell can store a real number and is called a memory unit. The set of allowed operations are • arithmetic operations (+,−, ·,÷), • comparisons of two memory cells (<,≤,=, 6=,≥, >), and • some standard operations (raising a number to a given power1, extracting a root2, logarithmic calculus3, trigonometric functions4). 1Our algorithms only raise natural numbers to a power greater than a small constant. 2We use an extraction of a root once in our distributed facility location algorithm for powers of met- ric spaces, once per embedding of a set of high-dimensional Euclidean points into a low-dimensional Euclidean space, and many times for our KDS to compute the points of intersection of two trajectories. 3Some of our algorithms compute a few values of the form dlog(x)e for some real number x > 1. Since the running time of such an algorithm is Ω(dlog(x)e), the value dlog(x)e can even be computed by linear search for the smallest i ∈ N0 such that 2i ≥ x, with negligible increase in the running time. 4Our algorithms do not use trigonometric functions. 2.4 Computational Models 27 It is assumed that each of these allowed operations can be executed in a constant number of time units. In the analysis of our algorithms, we assume that each coordinate of a point can be represented by using one memory unit, and the distance between two constant-dimensional points can be computed in a constant number of time units. These assumptions are commonly made in computational geometry. Unless otherwise stated, we measure the running time of our algorithms in time units and the space requirement in memory cells. 2.4.2 Synchronous Message Passing Model The synchronous message passing model is well-known and one of the most frequently used models to design algorithms in a distributed setting [81, 98]. In this model, a network is an undirected graph, where the nodes are the processors and the edges are the bidirectional communication channels between the processors. Each node has a unique ID and knows the total number of nodes in the network but does not know the topology of the network. At the beginning, the knowledge of a node about the network topology is limited to the neighbor nodes. To solve a given global problem, the nodes are allowed to communicate with each other. A global problem could be to solve the facility location problem on the network nodes, for instance. For sake of simplicity, the communication is assumed to be synchronous, i.e., there are globally defined communication rounds. In each such round, each node can send a message to each of its neighbors. In the process, the message sizes are bounded to B bits, where B is the bandwidth parameter of the network. Often, it is assumed that the bandwidth parameter is logarithmic in the number of nodes. In this way, each message can contain a constant number of node IDs (a.o. message sender and receiver). The time complexity of a distributed algorithm that works in the synchronous message passing model is the number of required communication rounds. 2.4.3 Kinetic Data Structures In 1999, Basch et al. [15] introduced the kinetic data structure (KDS) framework, which has been used as a central model for processing objects in motion ever since (see, e.g., [2, 15, 54] and the references therein). A KDS is a data structure that maintains a certain attribute of a set of continuously moving objects. For instance, in case of a facility location problem, this could be a set of open facilities that minimizes the facility location cost. The input of a KDS is a set of objects and a flight plan, i.e., each object moves continuously along a known trajectory. Furthermore, at any time, it is possible to change the flight plan by performing a so-called flight plan update, which means that one object changes its trajectory. The main idea is now that the continuous motion of the objects is utilized in a way that updates of the KDS take place only at discrete points of time and can be processed fast. As a result, a lot of computational effort can be saved by maintaining the KDS compared to handling just a series of instances of the corresponding static problem. To guarantee that the attribute is correct at any time, a KDS ensures that certain certificates are always 28 2 Preliminaries valid. Whenever a certificate fails, we call this an event, and an update is required. In case of a facility location problem, such an event occurs, for instance, when a client has moved so far away from all the open facilities that its connection cost exceeds the opening cost of a facility. To be able to handle each event at the correct time, an event queue is maintained. There are four important properties to measure the quality of a KDS. The worst-case amount of time to process an update is called responsiveness. The second and third properties are compactness and locality. The compactness is given by the ratio between the maximum number of certificates ever present to prove the correctness of the attribute and the number of the moving objects. The locality addresses the maximum number of events in the event queue in which one object can be involved. As a result, the locality is a measure of how easily flight plan updates can be performed. The fourth property, the efficiency of a KDS, is the ratio between the worst-case total number of processed events and the worst-case number of processed events where the attribute changes. These worst- case numbers are specified under certain assumptions on the trajectories of the objects. Common assumptions are that the motions are linear or can be described by bounded- degree polynomials. A KDS is called responsive, compact, local, and efficient, respectively, if the associated value is at most polylogarithmic in the size of the input. For a more detailed description of the concepts of a KDS, the reader is referred to [15, 54, 55]. 2.4.4 Data Stream Models A data stream consists of a long sequence of data items. The length of this sequence restricts the amount of resources that is available to process the data and the type of access to the data. In general, the amount of data is too large to be stored in main memory. Often it is even larger than the capacity of modern hard disks. As a result, the data has to be processed on the fly, and the only possible access to the data is sequential reading. Typical examples of data streams are network traffic data, measurements of sensor networks, or web crawls. In order to design efficient algorithms for data streams, computer scientists have invented many different data stream models. In this section, we will provide a description of the two models considered in this thesis. These are the insertion-only data stream model and the dynamic data stream model, which are both frequently used in the field of geometry. For information about other data stream models and an overview of recent research, we refer the reader to [66, 92, 93]. Insertion-Only Data Stream Model In the insertion-only data stream model5, the input is a sequence (of insert operations) of points p1, . . . , pi, . . . , pn in worst-case order. As mentioned above, the type of access 5The insertion-only data stream model is a special type of the cash register model (confer [92]). 2.4 Computational Models 29 to the input points and the amount of resources to process them are restricted. More precisely, instead of having random access to the input points, which would be very time consuming, algorithms perform one sequential scan over the input stream that reads the points one by one in increasing order of the index i. Furthermore, it is only allowed to use space that is sublinear in the size of the input stream. To deal with these restrictions, streaming algorithms try to maintain, at any time, a summary of all the data seen so far. Such a summary is a small-space representation that fairly approximates the input data with respect to a given problem, i.e., a solution computed on the original input data can be approximated by using the small summary. The complexity of a streaming algorithm is measured by its space requirement, its update time needed to process an element of the input stream, and its time needed to extract a so- lution for the given problem from the maintained summary. All of these three requirements are assumed to be only polylogarithmic in the size of the input stream. Note that most of the streaming algorithms presented in this thesis have the property that they do not require extra time to extract a solution from the maintained summary since all necessary computations are done during an update. According to this, we will only specify the third complexity measure of those algorithms for which this property does not hold. Dynamic Data Stream Model The dynamic data stream model6 is an extension of the insertion-only data stream model which also allows delete operations of points. In this thesis, our focus is on a special type of this model which is called the dynamic geometric data stream model. This model was introduced by Indyk [64] and is defined as follows. The input is a sequence of m update operations on a point set P ⊆ {1, . . . ,∆}d in a discrete d-dimensional Euclidean space. At the beginning, the point set P is empty. For any point p ∈ {1, . . . ,∆}d, the operation Insert(p) inserts p into P , and, analogously, the operation Delete(p) deletes p from P . We assume that the update operations occur in worst case order with the constraint that the stream is consistent, i.e., no point is removed that is not present in the current point set, and no point is added twice. Furthermore, we use n as an upper bound on the size of the current point set P . Obviously, we have n ∈ O(∆d) and n ≤ m. Algorithms that work in the dynamic geometric data stream model are only allowed to perform one sequential scan over the input stream. The space requirement, the update time, and the time to extract a solution of the given problem from the maintained summary are each assumed to be only polylogarithmic in m and ∆ and, therefore, in n since n ≤ m. 6The dynamic data stream model is a special type of the turnstile model (confer [92]). 30 2 Preliminaries 3 Facility Location in a Distributed Setting This chapter addresses a randomized constant-factor approximation algorithm for the uni- form metric facility location problem in a distributed setting. Our algorithm works in the synchronous message passing model where the underlying network is a clique with each node being a client as well as a potential location for a facility. Our algorithm is based on two facts that Bădoiu et al. [14] discovered in case of the uniform metric facility location problem: (i) Given any point set X from a metric space, the sum of the radii defined by Mettu and Plaxton [87] is a constant-factor approximation of the optimal facility location cost for X, and (ii) for any facility xi ∈ X, there exists a lower bound on the number of points located in the ball whose center is xi and whose radius equals the radius of xi. Using these two facts, we designed our randomized distributed algorithm in a way that it determines in three communication rounds, with message sizes bounded to O(log(|X|)) bits, a subset of the input points as open facilities such that, with high constant probability, the following condition is satisfied: The total opening cost is at most a constant factor larger than the sum of the radii and each facility xi ∈ X has an open facility in a ball whose center is xi and whose radius is at most a constant factor larger than the radius of xi. Thus, with high constant probability, our algorithm computes a constant-factor approximation of the uniform facility location problem for X. Note that, in some settings, the transmission cost between two nodes is not linear in the distance. In radio networks, for example, it is a typical assumption that the energy required for transmitting a message via a certain distance is somewhere between the square and the cube of the distance. Motivated by this fact, we also extended our distributed algorithm to the uniform facility location problem for constant powers of metric spaces. The remainder of this chapter is organized as follows. In Section 3.1, we specify the used synchronous message passing model and generalize the two facts mentioned above to the uniform facility location problem for powers of metric spaces. Our distributed algorithm for the uniform metric facility location problem is presented in Section 3.2. The extension to constant powers of metric spaces can be found in Section 3.3. 3.1 Preliminaries In this chapter, we consider the uniform facility location problem for metric spaces and powers of metric spaces in a distributed setting. Given is a uniform opening cost f and a metric space (X,D), where X = {x1, . . . , xn} is a set of n points and D is a distance function defined on X. In the uniform facility location problem for powers of metric spaces, we are additionally given a constant metric exponent `. Recall the definition of the facility location cost for both considered problems from Section 2.2. 32 3 Facility Location in a Distributed Setting We denote the cost of an optimal solution to the uniform metric facility location problem by FacLoc*(X, f) and the cost of an optimal solution to the uniform facility location problem for powers of metric spaces by FacLoc*(X, f, `). 3.1.1 The Distributed Setting We consider the synchronous message passing model described in Section 2.4.2 where the communication network is a clique. This means that, in each communication round, each node can send a message to all other nodes. In the course of this, the message size is bounded to O(log(n)) bits. Furthermore, we assume that every node knows the distance to all other nodes, and each distance can be represented by O(log(n)) bits. Since we want to develop an approximation algorithm, we can always achieve this by appropriate rounding. Note that although in our setting we allow all-to-all communication, it is not possible to solve the problem by accumulating all information at one node and then solve the problem with a classical (centralized) algorithm. The problem is that every node only knows the distance to its neighbors. Since every node receives O(n log(n)) bits of information in every communication round, it requires Ω(n) rounds to gather the information about all pairwise distances at a single node. As shown in [106], we essentially require all this information because it is not possible to compute a constant-factor approximation to the facility location problem (with uniform opening costs and demands) without looking at Ω(n2) distances. 3.1.2 The Radii Radius Associated with a Point. We extend the original definition of a radius associated with a point, given in Section 2.3, to powers of metric spaces. More precisely, for each point xi ∈ X, we define the value ri to be the radius of the ball with center xi that satisfies ∑ x∈X∩B(xi,ri) ( r`i −D(xi, x) ` ) = f . (3.1) Observe that there still exists only one solution to the radius ri since the left hand side of Equation (3.1) is continuous and strictly monotonically increasing with ri. For any i ∈ {1, . . . , n}, we have (f/n)1/` ≤ ri ≤ f 1/`. In case of uniform opening cost f = 1 and a metric exponent ` = 1, Bădoiu et al. [14] discovered a useful relation between the value weight(B(xi, ri)) and the radius ri. Their result can be generalized to any uniform opening cost f ≥ 0 and any metric exponent ` ≥ 1. We obtain the following lemma: Lemma 3.1.1. For each xi ∈ X, we have weight (B(xi, ri)) ≥ f r`i . 3.1 Preliminaries 33 Proof. Due to the definition of ri, we have ∑ x∈X∩B(xi,ri) (r`i −D(xi, x) `) = f , which implies ∑ x∈X∩B(xi,ri) r`i ≥ f . Since weight(B(xi, ri)) = |{x ∈ X | x ∈ B(xi, ri)}|, we obtain weight(B(xi, ri)) ≥ f/r`i . Sum of the Radii. Bădoiu et al. [14] proved that the sum of the radii associated with the points in X is a good approximation of the optimal facility location cost for X. Again, their result can be generalized to any uniform opening cost f ≥ 0 and any metric exponent ` ≥ 1. In the proof of the generalized result, we use a modified version of the Mettu-Plaxton algorithm. More precisely, this version works exactly as Algorithm 2.3.1 except that, in the first step, it computes, for each point xi ∈ X, the radius ri that satisfies Equation (3.1), instead of the original radius proposed by Mettu and Plaxton [87]. We will first show that this modified Mettu-Plaxton algorithm is still a constant-factor approximation. Based on this result, we will then prove that the sum of the exponentiated radii approximates the optimal cost FacLoc*(X, f, `) within a constant factor. Let FMP be the set of open facilities computed by the modified Mettu-Plaxton algorithm. In the following, we will show that FacLoc(X,FMP, f, `) ≤ 3` · FacLoc*(X, f, `). The argumentation is basically the same as in [87]. Only a few minor adaptations to our scenario have been made. Claim 3.1.2. For any point xi ∈ X, there exists an open facility xj ∈ FMP such that rj ≤ ri and D(xi, xj) ≤ 2 · ri. Proof. If there is no such open facility xj with rj ≤ ri in B(xi, 2 ·ri), then we open a facility at xi and xi belongs to FMP. Claim 3.1.3. Let xi and xj be distinct open facilities in FMP. Then, we have D(xi, xj) > 2 ·max{ri, rj}. Proof. Without loss of generality, we assume that rj ≤ ri. It follows that xj /∈ B(xi, 2 · ri). Otherwise, the point xi would not be an open facility. Thus, we have D(xi, xj) > 2 · ri ≥ 2 · rj . For any point xj ∈ X and an arbitrary set of open facilities F ′ ⊆ X, let charge(xj, F ′) := D(xj, F ′)` + ∑ xi∈F ′ max{0, r`i −D(xi, xj) `} . 34 3 Facility Location in a Distributed Setting Claim 3.1.4. For an arbitrary set of open facilities F ′ ⊆ X, we have ∑ xj∈X charge(xj, F ′) = FacLoc(X,F ′, f, `) . Proof. Due to the definition of charge(·, ·) and Equation (3.1), we get ∑ xj∈X charge(xj, F ′) = ∑ xj∈X D(xj, F ′)` + ∑ xj∈X ∑ xi∈F ′ max{0, r`i −D(xi, xj) `} = ∑ xj∈X D(xj, F ′)` + ∑ xi∈F ′ ∑ xj∈X∩B(xi,ri) (r`i −D(xi, xj) `) = ∑ xj∈X D(xj, F ′)` + ∑ xi∈F ′ f = FacLoc(X,F ′, f, `) . Claim 3.1.5. Let xj ∈ X be any point, let F ′ ⊆ X be an arbitrary set of open facilities, and let xi ∈ F ′ be any open facility. If we have D(xj, xi) = D(xj, F ′), then charge(xj, F ′) ≥ max{r`i ,D(xj, xi) `}. Proof. If xj /∈ B(xi, ri), then charge(xj, F ′) ≥ D(xj, F ′)` = D(xj, xi)` > r`i . Otherwise, we have charge (xj, F ′) ≥ D (xj, F ′) ` + ( r`i −D(xj, xi) ` ) = D (xj, xi) ` + ( r`i −D(xj, xi) ` ) = r`i ≥ D(xj, xi)` . Claim 3.1.6. Let xj ∈ X be any point, and let xi be any open facility in FMP. If xj ∈ B(xi, ri), then charge(xj, FMP) ≤ r`i . Proof. By Claim 3.1.3, there is no open point xm ∈ FMP such that we have i 6= m and xj ∈ B(xm, rm). Since D(xj, FMP) ≤ D(xj, xi), we obtain charge (xj, FMP) = D (xj, FMP) ` + ( r`i −D (xj, xi) ` ) ≤ D (xj, xi) ` + ( r`i −D (xj, xi) ` ) = r`i . 3.1 Preliminaries 35 Claim 3.1.7. Let xj ∈ be any point, and let xi be any open facility in FMP. If xj /∈ B(xi, ri), then we have charge(xj, FMP) ≤ D(xj, xi)`. Proof. The correctness of the claim follows immediately, unless there is an open facility xm ∈ FMP such that xj ∈ B(xm, rm). If such an open facility xm exists, then Claims 3.1.3 and 3.1.6 imply D(xi, xm) > 2 · max{ri, rm} and charge(xj, FMP) ≤ r`m. Furthermore, by triangle inequality, we obtain D(xj, xi) ≥ D(xi, xm)−D(xj, xm) > 2rm − rm = rm , which proves charge(xj, FMP) ≤ r`m ≤ D(xj, xi) `. Claim 3.1.8. For any point xj ∈ X and an arbitrary set of open facilities F ′ ⊆ X, we have charge(xj, FMP) ≤ 3` · charge(xj, F ′). Proof. Let xi be some open facility in F ′ such that we have D(xj, xi) = D(xj, F ′). By Claim 3.1.2, there exists a facility xm ∈ FMP such that we have rm ≤ ri and D(xi, xm) ≤ 2 · ri. If xj ∈ B(xm, rm), then we get charge(xj, FMP) ≤ r`m by Claim 3.1.6. Since Claim 3.1.5 implies charge(xj, F ′) ≥ r`i , we can conclude charge(xj, FMP) ≤ r`m ≤ r`i ≤ charge(xj, F ′) . This proves the assertion in case that we have xj ∈ B(xm, rm). If xj /∈ B(xm, rm), then charge(xj, FMP) ≤ D(xj, xm)` by Claim 3.1.7. Thus, by triangle inequality, we get charge(xj, FMP) ≤ D(xj, xm)` ≤ (D(xj, xi) + D(xi, xm)) ` ≤ (D(xj, xi) + 2 · ri) ` ≤ 3` ·max{D(xj, xi)`, r`i} . Now, the assertion follows by Claim 3.1.5. Lemma 3.1.9. FacLoc(X,FMP, f, `) ≤ 3` · FacLoc*(X, f, `) Proof. The assertion follows from Lemmas 3.1.4 and 3.1.8. Based on the results above, we can prove the following lemma: 36 3 Facility Location in a Distributed Setting Lemma 3.1.10. 1 2`+1 · FacLoc*(X, f, `) ≤ ∑ xi∈X r`i ≤ 6 ` · FacLoc*(X, f, `) Proof. We first prove the lower bound and then the upper bound. The argumentation is basically the same as in [14]. Only a few minor adaptations to our scenario have been made. Lower bound: Let FMP be the set of open facilities computed by the modified Mettu- Plaxton algorithm. Then, it follows from Claim 3.1.2 that 2` · ∑ xi∈X r`i ≥ ∑ xi∈X D(xi, FMP)` . (3.2) Next, we show that we also have 2` · ∑ xi∈X r`i ≥ f · |FMP| . (3.3) Due to Claim 3.1.3, each point xi ∈ X is contained in at most one ball B(xj, rj) for some open facility xj ∈ FMP. Furthermore, observe that, for any point xm ∈ B(xj, rj), we must have rj ≤ 2 · rm. Otherwise, we would have xm ∈ B(xm, 2 · rm) ⊆ B(xm, rj) ⊆ B(xm, rj + D(xj, xm)) ⊆ B(xj, 2 · rj) , and the modified Mettu-Plaxton algorithm would not open a facility at xj, which is a contradiction. Hence, we obtain ∑ xi∈X r`i ≥ ∑ xj∈FMP ∑ xm∈X∩B(xj ,rj) r`m ≥ ∑ xj∈FMP ∑ xm∈X∩B(xj ,rj) (rj 2 )` = 1 2` · ∑ xj∈FMP ∑ xm∈X∩B(xj ,rj) r`j ≥ 1 2` · ∑ xj∈FMP f = 1 2` · f · |FMP| , which proves Inequality (3.3). Due to Inequalities (3.2) and (3.3), we get 2`+1 · ∑ xi∈X r`i ≥ f · |FMP|+ ∑ xi∈X D(xi, FMP)` = FacLoc(X,FMP, f, `) ≥ FacLoc*(X, f, `) . 3.1 Preliminaries 37 Upper bound: Due to Lemma 3.1.9, we know that FacLoc(X,FMP, f, `) ≤ 3` · FacLoc*(X, f, `) . Thus, to prove the upper bound, it remains to show that ∑ xi∈X r`i ≤ 2 ` · FacLoc(X,FMP, f, `) . Due to Claim 3.1.4, we have 2` · FacLoc(X,FMP, f, `) = 2` · ∑ xi∈X charge(xi, FMP) ≥ 2` ·   ∑ xi∈FMP r`i + ∑ xj∈X\FMP max{r`δ(j),D(xj, xδ(j)) `}   , where δ(j) denotes the index of the facility in FMP that is closest to xj. Thus, if we can show that 2` ·   ∑ xi∈FMP r`i + ∑ xj∈X\FMP max{r`δ(j),D(xj, xδ(j)) `}   ≥ ∑ xi∈X r`i , (3.4) then we are done. It is sufficient to prove r`j ≤ 2 `−1 · ( D(xj, xδ(j)) ` + r`δ(j) ) (3.5) because this implies max{r`δ(j),D(xj, xδ(j)) `} ≥ r`j/2 ` and Inequality (3.4) follows. We prove the correctness of Inequality (3.5) by contradiction. Hence, we assume that r`j > 2 `−1 · ( D(xj, xδ(j)) ` + r`δ(j)) ) . We can easily prove by induction that 2`−1 · (a` + b`) ≥ (a+ b)` for any a, b ≥ 0. Thus, we obtain r`j > ( D(xj, xδ(j)) + rδ(j)) )` , which, in turn, would imply B(xδ(j), rδ(j)) ⊆ B(xj, rj). Furthermore, by applying triangle inequality and 2`−1 · (a` + b`) ≥ (a+ b)` for an a, b ≥ 0, we get D(xj, xm)` ≤ ( D(xj, xδ(j)) + D(xδ(j), xm) )` ≤ 2`−1 · ( D(xj, xδ(j)) ` + D(xδ(j), xm) ` ) 38 3 Facility Location in a Distributed Setting as upper bound on the exponentiated distance between xj and any point xm ∈ B(xδ(j), rδ(j)). Now, we obtain ∑ xm∈X∩B(xj ,rj) r`j −D(xj, xm) ` ≥ ∑ xm∈X∩B(xδ(j),rδ(j)) r`j −D(xj, xm) ` > ∑ xm∈X∩B(xδ(j),rδ(j)) 2`−1 · ( D(xj, xδ(j)) ` + r`δ(j) ) − 2`−1 · ( D(xj, xδ(j)) ` + D(xδ(j), xm) ` ) = 2`−1 · ∑ xm∈X∩B(xδ(j),rδ(j)) r`δ(j) −D(xδ(j), xm) ` = 2`−1 · f ≥ f , which is a contradiction because the definition of rj requires ∑ xm∈X∩B(xj ,rj) r`j −D(xj, xm) ` = f . It follows that Inequality (3.5) is true, which was the only thing left to prove the assertion of the lemma. 3.2 Distributed Algorithm for Metric Spaces Our distributed algorithm consists of three parts (see Algorithm 3.2.1 for a description in pseudocode). Recall that we assume that each point knows its distance to all the other points. At the beginning of the first part, each point xi ∈ X creates a (dlog(n)e + 1)-bit array. These bits are used to decide whether a point should open a facility or not. In the following, we will call these bits phase bits. The values of these phase bits are chosen at random so that, for each k ∈ {0, 1, . . . , dlog(n)e}, the k-th phase bit is 1 with probability min{2k/n, 1} and 0 otherwise. Finally, every point sends its dlog(n)e+ 1 phase bits to all the other points. The second part of the algorithm is organized in dlog(n)e + 1 phases. During these phases, each point decides locally, based on the phase bits, if it should open a facility or connect itself to another open facility. This is accomplished as follows: Consider the k-th phase of point xi. The algorithm opens a facility at this point if the k-th phase bit is 1 and the first k−1 phase bits of all the other points at a distance of at most 2k ·f/n from xi are 0. Otherwise, if there exists a point xj at a distance of at most 2k · f/n from xi which has a 1 among the first k− 1 phase bits, the algorithm tentatively connects xi to the point xj. In the final solution, xi will be connected to the nearest open facility (which might differ from xj). Note that if neither the k-th phase bit of xi is 1 nor there exists a point xj at a distance of at most 2k · f/n from xi which has a 1 among the first k − 1 phase bits, xi 3.2 Distributed Algorithm for Metric Spaces 39 does nothing in phase k. At the end of the last phase, every point knows whether it is an open facility or not because the dlog(n)e-th phase bit of every point is 1 with probability min{2dlog(n)e/n, 1} = 1. Finally, each point broadcasts whether it is an open facility or not. In the last part of the algorithm, every point that is not an open facility sends a request of connection to the nearest open facility. We will show in the next section that, with high constant probability, the total opening cost for the facilities is at most a constant factor larger than the sum of the radii, and any client xi ∈ X has at least one open facility in the ball B(xi, cri), where c is some small constant. Since the sum of the radii is a constant-factor approximation of the optimal facility location cost (see Lemma 3.1.10), this implies that, with high constant probability, our distributed algorithm computes a constant factor-approximation for the uniform metric facility location problem. Algorithm 3.2.1 Local Algorithm for Point xi 1: open[i]← false 2: for k ← 0 to dlog(n)e do 3: ϕi[k]←    1 , with probability min{2k/n, 1} 0 , otherwise 4: send ϕi to all xj ∈ X, j 6= i 5: receive ϕj from all xj ∈ X, j 6= i 6: for k ← 0 to dlog(n)e do 7: if ϕi[k] = 1 and for each point xj ∈ B(xi, 2k · f/n), xj 6= xi, we have ϕj[m] = 0 for all m < k then 8: open[i]← true 9: send open[i] to all xj ∈ X, j 6= i 10: receive open[j] from all j ∈ X, j 6= i 11: if open[i] = false then 12: connect to the nearest open facility 3.2.1 Analysis of the Algorithm In this section, we show that our distributed algorithm produces a solution for the uniform metric facility location problem whose cost are with high constant probability at most a constant factor larger than the optimal cost. To simplify the analysis, we do not use the exact value ri satisfying Equation (3.1) for ` = 1 but the value r˜i := 2j ·f/n where j is the smallest integer that satisfies the inequality ∑ x∈X∩B(xi,r˜i) (r˜i −D(xi, x)) ≥ f . First, we give an upper bound on the expected opening cost of any point xi ∈ X. Lemma 3.2.1. Let xi ∈ X be any point. Then, the expected opening cost of xi is O(ri). 40 3 Facility Location in a Distributed Setting Proof. At first, we estimate the probability that the algorithm opens a facility at xi in any phase k ∈ {0, 1, . . . , dlog(n)e}. Recall that this happens if the k-th phase bit of xi is 1 and the first k − 1 phase bits of all the other points at a distance of at most 2k · f/n are 0. Let Yi,k be the indicator random variable for the event that the algorithm opens a facility at xi in phase k. We now consider the two cases k < j and j ≤ k ≤ dlog(n)e with j = log(r˜i · n/f). Case j ≤ k ≤ dlog(n)e: The k-th phase bit of xi is 1 with probability min{2k/n, 1} ≤ 2k/n. Furthermore, for any phase m < k, the m-th phase bit of an arbitrary point in B(xi, 2k · f/n) is 0 with probability 1 − 2m/n. Hence, the probability that all of the first k − 1 phase bits of this point in B(xi, 2k · f/n) are 0 is ∏k−1 m=0 1− 2 m/n. Thus, we have Pr [Yi,k = 1] ≤ 2k n · [( 1− 20 n ) · ( 1− 21 n ) · . . . · ( 1− 2k−1 n )]weight(B(xi,2k· fn)) . Observe that r˜i ≥ ri. By applying Lemma 3.1.1 with ` = 1, we obtain that weight ( B ( xi, 2k · f n )) ≥ weight ( B ( xi, 2j · f n )) = weight (B(xi, r˜i)) ≥ weight (B(xi, ri)) ≥ f ri ≥ f r˜i . Thus, we get Pr [Yi,k = 1] ≤ 2k n · [( 1− 20 n ) · ( 1− 21 n ) · . . . · ( 1− 2k−1 n )] f r˜i = 2k n · [( 1− 20 n ) · ( 1− 21 n ) · . . . · ( 1− 2k−1 n )] n 2j ≤ 2k n · ( 1− 2k−1 n ) n 2j . Now, let m denote the non-negative integer k − j. Then, we obtain Pr [Yi,k = 1] ≤ 2j+m n · ( 1− 2j+m−1 n ) n 2j = 2j+m n · ( 1− 2j+m−1 n ) n 2j+m−1 ·2m−1 ≤ 2j+m n · (1 e )2m−1 , where the last inequality is due to a bound on Euler’s number (see Inequality (B.2)). 3.2 Distributed Algorithm for Metric Spaces 41 Case k < j: An upper bound on the probability that the algorithm opens a facility at xi in a phase k < j is 2k/n. Hence, we have Pr [Yi,k = 1] ≤ 2k n . Let Yi be the indicator random variable for the event that the algorithm opens a facility at xi. Then, the expected opening cost of point xi are upper bounded by f · E [Yi] = f · E   dlog(n)e∑ k=0 Yi,k   = f · dlog(n)e∑ k=0 E [Yi,k] = f · dlog(n)e∑ k=0 Pr [Yi,k = 1] ≤ f · j−1∑ k=0 2k n + f · dlog(n)e−j∑ m=0 2j+m n · (1 e )2m−1 = f · 2j − 1 n + f · 2j+1 n · dlog(n)e−j∑ m=0 (1 e )2m−1 · 2m−1 ≤ f · 2j − 1 n + f · 2j+1 n · dlog(n)e−j∑ m=0 2−m+1 ∈ O(r˜i) , where the last inequality follows from the easily provable fact that (1 e )2m−1 · 2m−1 ≤ 2−m+1 for all m ≥ 0. Finally, due to the definition of r˜i, the expected opening cost of the point xi is O(ri). The proof of our upper bound on the final connection cost of any point xi ∈ X utilizes the following lemma: Lemma 3.2.2. Let xi ∈ X be any point that has been chosen as open facility or that has tentatively been connected in any phase k ∈ {0, . . . , dlog(n)e}. Then, the distance of xi to the nearest open facility is at most 2k+1 · f/n. Proof. Obviously, if we open a facility at xi in phase k, then the distance of xi to the nearest open facility is 0 ≤ 2k+1 · f/n. Next, we consider the case that xi has been connected tentatively. Note that, due to our construction, a point cannot be connected tentatively 42 3 Facility Location in a Distributed Setting in phase 0. Thus, in the following, we will assume that xi has tentatively been connected to a point xj ∈ X in a phase k ∈ {1, . . . , dlog(n)e}. It follows that the distance from xi to xj is at most 2k · f/n. Let m denote the smallest number of a phase bit of xj whose value is 1. Since xi has tentatively been connected to xj in phase k, we have m ≤ k − 1. Now, either xj is open or tentatively connected. If it is open, then the distance from xi to the nearest open facility is at most 2k ·f/n and we are done. Otherwise, xj has tentatively been connected to another point within a distance of at most 2m · f/n ≤ 2k−1 · f/n. Recursively applying this argument (see also Figure 3.1) yields that there must be an open facility within a distance of at most 2k · f n + k−1∑ m=0 2m · f n ≤ 2k+1 · f n from xi. xi2k · fn ≤ 2 k+1 · fn Figure 3.1: Connecting xi to an open facility over a chain of tentatively connected points. Lemma 3.2.3. Let xi ∈ X be any point. Then, the expected final connection cost of xi is O(ri). Proof. Due to our construction, a point cannot be connected tentatively in phase 0. Thus, in phase 0, the algorithm either opens a facility at xi or does nothing with xi. If it opens a facility at xi, then the connection cost of xi is obviously 0. Due to Lemma 3.2.2, if the point xi has been chosen as an open facility or has tentatively been connected in a phase k ∈ {1, . . . , dlog(n)e}, then its final connection cost is at most 2k+1 · f/n. Now, for each k ∈ {0, . . . , dlog(n)e}, let Zi,k be the indicator random variable for the event that the algorithm has not opened a facility at xi and has not tentatively connected xi up to and 3.2 Distributed Algorithm for Metric Spaces 43 including phase k. Then, we can upper bound the expected final connection cost of xi by dlog(n)e∑ k=1 2k+1 · f n ·Pr [xi is opened or tentatively connected in phase k] = dlog(n)e∑ k=1 2k+1 · f n · (Pr [Zi,k−1 = 1]−Pr [Zi,k = 1]) = 2 · f n ·Pr [Zi,0 = 1]− 2dlog(n)e+1 · f n ·Pr [ Zi,dlog(n)e = 1 ] + dlog(n)e−1∑ k=0 2k+1 · f n ·Pr [Zi,k = 1] = 2 · f n ·Pr [Zi,0 = 1] + dlog(n)e−1∑ k=0 2k+1 · f n ·Pr [Zi,k = 1] , where the last equality follows from Pr [ Zi,dlog(n)e = 1 ] = 0. Thus, to upper bound the expected final connection cost of xi, we have to upper bound the probabilities Pr [Zi,k = 1]. We consider the two cases k < j and j ≤ k < dlog(n)e with j = log(r˜i · n/f). Case j ≤ k < dlog(n)e: Observe that Zi,k = 1 if the first k phase bits of xi are 0, and the first k − 1 phase bits of all the other points at a distance of at most 2k · f/n are also 0. For any phase m ≤ k, the m-th phase bit of xi is 0 with probability 1− 2m/n. Hence, the probability that all of the first k phase bits of xi are 0 is ∏k m=0 1 − 2 m/n. Similarly, the probability that all of the first k − 1 phase bits of any point in B(xi, 2k · f/n) are 0 is ∏k−1 m=0 1− 2 m/n. As proven in Lemma 3.2.1, the number of points in B(xi, 2k · f/n) is lower bounded by weight ( B ( xi, 2k · f n )) ≥ f r˜i . It follows that Pr [Zi,k = 1] ≤ [( 1− 20 n ) · . . . · ( 1− 2k n )] · [( 1− 20 n ) · . . . · ( 1− 2k−1 n )] f r˜i ≤ ( 1− 2k n ) · ( 1− 2k−1 n ) n 2j . Let m denote the non-negative integer k − j. Then, we have Pr [Zi,k = 1] ≤ ( 1− 2j+m n ) · ( 1− 2j+m−1 n ) n 2j+m−1 ·2m−1 ≤ ( 1− 2j+m n ) · (1 e )2m−1 ≤ (1 e )2m−1 , where the second inequality is due to a bound on Euler’s number (see Inequality (B.2)). 44 3 Facility Location in a Distributed Setting Case k < j: Obviously, an upper bound on the probability that the algorithm does not open a facility at xi or tentatively connect xi up to and including phase k is 1. Hence, we get Pr [Zi,k] ≤ 1 . Based on the above two cases, we can upper bound the expected final connection cost of xi by 2 · f n ·Pr [Zi,0 = 1] + dlog(n)e−1∑ k=0 2k+1 · f n ·Pr [Zi,k = 1] ≤ 2 · f n · 1 + j−1∑ k=0 2k+1 · f n · 1 + dlog(n)e−j−1∑ m=0 2j+m+1 · f n · (1 e )2m−1 ≤ 2 · f n + 2j+1 · f n + 2j+2 · f n · dlog(n)e−j−1∑ m=0 2m−1 · (1 e )2m−1 ≤ 2 · f n + 2j+1 · f n + 2j+2 · f n · dlog(n)e−j−1∑ m=0 2−m+1 ∈ O(r˜i) , where, as in the proof of Lemma 3.2.1, the last inequality follows from the easily provable fact that 2m−1 · (1 e )2m−1 ≤ 2−m+1 for all m ≥ 0. Finally, due to the definition of r˜i, the expected final connection cost of the point xi is O(ri). Now, we can prove that our distributed algorithm for the uniform metric facility location problem produces a solution whose total cost is with high constant probability at most a constant factor larger than the optimal cost. Lemma 3.2.4. The facility location cost for X is O(FacLoc*(X, f)) with high constant probability. Proof. Due to Lemmas 3.2.1 and 3.2.3, the expected opening cost as well as the expected final connection cost of any point xi ∈ X is O(ri). Thus, the algorithm computes a set of open facilities F that leads to an expected total cost of ∑ xi∈X O(ri). By apply- ing Lemma 3.1.10 with ` = 1, we have that the expected value of FacLoc(X,F, f) is O(FacLoc*(X, f)). Now, the assertion of the lemma follows by applying Markov’s inequal- ity. We summarize our results in the following theorem: 3.3 Distributed Algorithm for Powers of Metric Spaces 45 Theorem 2. Given any n-point metric space (X,D), there is a randomized distributed algorithm working in the synchronous message passing model that computes with high con- stant probability a constant-factor approximation of the uniform metric facility location problem for X. The algorithm uses three rounds of all-to-all communication where the message sizes are bounded to O(log(n)) bits. 3.3 Distributed Algorithm for Powers of Metric Spaces In this section, we extend the distributed algorithm given in Section 3.2 to the uniform facility location problem for powers of metric spaces. Let ` ∈ R with ` ≥ 1 be the (constant) metric exponent. Then, we only have to make the following three adaptations to Algorithm 3.2.1: 1. The total number of phases is dlog(n)/`e+ 1. 2. The k-th phase bit is set to 1 with probability min{2k`/n, 1} and 0 otherwise. 3. In the k-th phase, we check the first k − 1 phase bits of all the points in a distance of at most 2k · (f/n)1/` from xi. The rest of the local algorithm for xi remains unchanged. A complete pseudocode listing of the adapted algorithm is given by Algorithm 3.3.1. Algorithm 3.3.1 Local Algorithm for Point xi 1: open[i]← false 2: for k ← 0 to dlog(n)/`e do 3: ϕi[k]←    1 , with probability min{2k`/n, 1} 0 , otherwise 4: send ϕi to all xj ∈ X, j 6= i 5: receive ϕj from all xj ∈ X, j 6= i 6: for k ← 0 to dlog(n)/`e do 7: if ϕi[k] = 1 and for each point xj ∈ B(xi, 2k · (f/n)1/`), xj 6= xi, we have ϕj[m] = 0 for all m < k then 8: open[i]← true 9: send open[i] to all xj ∈ X, j 6= i 10: receive open[j] from all j ∈ X, j 6= i 11: if open[i] = false then 12: connect to the nearest open facility 46 3 Facility Location in a Distributed Setting 3.3.1 Analysis of the Algorithm Let r˜i := 2j · (f/n)1/` where j is the smallest integer that satisfies the inequality ∑ x∈X∩B(xi,r˜i) (r˜`i −D(xi, x) `) ≥ f be an approximation of the radius ri defined by Equation (3.1). Then, using this approx- imation r˜i and bearing the three adaptations mentioned above in mind, the analysis of our distributed algorithm given in Section 3.2.1 can be easily transferred to the uniform facility location problem for powers of metric spaces. We obtain the following lemmas: Lemma 3.3.1. Let xi ∈ X be any point. Then, the expected opening cost of xi is O(4` ·r`i ). Proof. We prove this lemma by using Lemma 3.1.10 and reusing the techniques given in the proof of Lemma 3.2.1. First, we compute an upper bound on the probability that the algorithm opens a facility at xi in any phase k ∈ {0, 1, . . . , dlog(n)/`e}. Recall that this happens if the k-th phase bit of xi is 1 and the first k−1 bits of all the other points at a distance of at most 2k · (f/n)1/` are 0. Let Yi,k be the indicator random variable for the event that the algorithm opens a facility at xi in phase k. We examine the two cases k < j and j ≤ k ≤ dlog(n)/`e with j = log(r˜i · (n/f)1/`). Case j ≤ k ≤ dlog(n)/`e: The k-th phase bit of xi is 1 with probability min{2k`/n, 1} ≤ 2k`/n. Furthermore, for any phase m < k, the m-th phase bit of an arbitrary point located in B(xi, 2k · (f/n)1/`) is 0 with probability 1− 2m`/n. Thus, the probability that all of the first k − 1 phase bits of this point are 0 is ∏k−1 m=0 1− 2 m`/n. Hence, we get Pr [Yi,k = 1] ≤ 2k` n · [( 1− 20` n ) · ( 1− 21` n ) · . . . · ( 1− 2(k−1)` n )]weight(B(xi,2k·(f/n)1/`)) . Due to Lemma 3.1.1 and r˜i ≥ ri, we have weight  B  xi, 2k · ( f n )1/`     ≥ weight  B  xi, 2j · ( f n )1/`     ≥ weight (B(xi, r˜i)) ≥ weight (B(xi, ri)) ≥ f r`i ≥ f r˜`i . 3.3 Distributed Algorithm for Powers of Metric Spaces 47 It follows that Pr [Yi,k = 1] ≤ 2k` n · [( 1− 20` n ) · ( 1− 21` n ) · . . . · ( 1− 2(k−1)` n )] f r˜` i = 2k` n · [( 1− 20` n ) · ( 1− 21` n ) · . . . · ( 1− 2(k−1)` n )] n 2j` ≤ 2k` n · ( 1− 2(k−1)` n ) n 2j` . Now, let m be the non-negative integer k − j. Then, we obtain Pr [Yi,k = 1] ≤ 2(j+m)` n · ( 1− 2(j+m−1)` n ) n 2j` = 2(j+m)` n · ( 1− 2(j+m−1)` n ) n 2(j+m−1)` ·2(m−1)` ≤ 2(j+m)` n · (1 e )2(m−1)` , where the last inequality is due to a bound on Euler’s number (see Inequality (B.2)). Case k < j: Obviously, an upper bound on the probability that the algorithm opens a facility at xi in a phase k < j is 2k`/n. It follows that Pr [Yi,k = 1] ≤ 2k` n . Let Yi be the indicator random variable for the event that the algorithm opens a facility at xi. Then, the expected opening cost of the point xi are upper bounded by f · E [Yi] = f · E   dlog(n)/`e∑ k=0 Yi,k   = f · dlog(n)/`e∑ k=0 E [Yi,k] = f · dlog(n)/`e∑ k=0 Pr [Yi,k = 1] 48 3 Facility Location in a Distributed Setting Based on the above two cases, we obtain f · E [Yi] ≤ f · j−1∑ k=0 2k` n + f · dlog(n)/`e−j∑ m=0 2(j+m)` n · (1 e )2(m−1)` = f n · 2j` − 1 2` − 1 + f n · 2(j+1)` · dlog(n)/`e−j∑ m=0 (1 e )2(m−1)` · 2(m−1)` ≤ f n · 2j` + f n · 2(j+1)` · dlog(n)/`e−j∑ m=0 2−m+1 ∈ O(2` · r˜`i ) , where the last inequality follows from the easily provable fact that (1 e )2(m−1)` · 2(m−1)` ≤ 2−m+1 for all m ≥ 0 and any ` ≥ 1. Finally, due to the definition of r˜i, the expected opening cost of the point xi is O(4` · r`i ). The proof of our upper bound on the final connection cost of any point xi ∈ X utilizes the following lemma: Lemma 3.3.2. Let xi ∈ X be any point that has been chosen as open facility or that has tentatively been connected in any phase k ∈ {0, . . . , dlog(n)/`e}. Then, the distance of xi to the nearest open facility is at most 2k+1 · (f/n)1/`. Proof. To prove this lemma, we use the same approach as in the proof of Lemma 3.2.2. In case that the algorithm opens a facility at xi in phase k, the distance of xi to the nearest open facility is 0 ≤ 2k+1 · (f/n)1/`. Next, we consider the case that xi has been connected tentatively. The algorithm does not tentatively connect any point in phase 0. Hence, in the following, we will assume that xi has tentatively been connected to a point xj ∈ X in a phase k ∈ {1, . . . , dlog(n)/`e}. Then, the distance from xi to xj is at most 2k · (f/n)1/`. Let m denote the smallest number of a phase bit of xj whose value is 1. Since xi has tentatively been connected to xj in phase k, we have m ≤ k − 1. Now, we have to consider the two cases that either xj is open or xj has been connected tentatively as well. Obviously, if xj is an open facility, then the distance from xi to the nearest open facility is at most 2k · (f/n)1/`, so we are done. Otherwise, xj has tentatively been connected to another point within a distance of at most 2m · (f/n)1/` ≤ 2k−1 · (f/n)1/`. By recursively applying this argument, we obtain that there must be an open facility within a distance of at most 2k · ( f n )1/` + k−1∑ m=0 2m · ( f n )1/` ≤ 2k+1 · ( f n )1/` from xi. 3.3 Distributed Algorithm for Powers of Metric Spaces 49 Lemma 3.3.3. Let xi ∈ X be any point. Then, the expected final connection cost of xi is O(16` · r`i ). Proof. We prove this lemma by using Lemma 3.3.2 and reusing the techniques given in the proof of Lemma 3.2.3. The algorithm does not tentatively connect any point in phase 0. Hence, in phase 0, it either opens a facility at xi or does nothing with xi, which obviously results in 0 connection cost for xi in phase 0. Due to Lemma 3.3.2, if the point xi has been chosen as an open facility or has tentatively been connected in any other phase k ∈ {1, . . . , dlog(n)/`e}, then its final connection cost is at most 2(k+1)` · f/n. Now, for each k ∈ {0, . . . , dlog(n)/`e}, let Zi,k be the indicator random variable for the event that the algorithm has not opened a facility at xi and has not tentatively connected xi up to and including phase k. Then, we can upper bound the expected final connection cost of xi by dlog(n)/`e∑ k=1 2(k+1)` · f n ·Pr [xi is opened or tentatively connected in phase k] = dlog(n)/`e∑ k=1 2(k+1)` · f n · (Pr [Zi,k−1 = 1]−Pr [Zi,k = 1]) ≤ dlog(n)/`e−1∑ k=0 2(k+2)` · f n ·Pr [Zi,k = 1] In order to upper bound the expected final connection cost of xi, we upper bound the probabilities Pr [Zi,k = 1]. Therefore, we examine the two cases k < j and j ≤ k < dlog(n)/`e with j = log(r˜i · (n/f)1/`). Case j ≤ k < dlog(n)/`e: Observe that we have Zi,k = 1 only in the case that the first k phase bits of xi are 0 and the first k − 1 phase bits of all the other points at a distance of at most 2k · (f/n)1/` are 0 as well. For any phase m ≤ k, the m-th phase bit of xi is 0 with probability 1− 2m`/n. Thus, the probability that all of the first k phase bits of xi are 0 is ∏k m=0 1− 2 m`/n. Similarly, the probability that all of the first k − 1 phase bits of any point in B(xi, 2k · (f/n)1/`) are 0 is ∏k−1 m=0 1 − 2 m`/n. As proven in Lemma 3.3.1, the number of points in B(xi, 2k · (f/n)1/`) is lower bounded by weight  B  xi, 2k · ( f n )1/`     ≥ f r˜`i . Hence, we have Pr [Zi,k = 1] ≤ [( 1− 20` n ) · . . . · ( 1− 2k` n )] · [( 1− 20` n ) · . . . · ( 1− 2(k−1)` n )] f r˜` i ≤ ( 1− 2k` n ) · ( 1− 2(k−1)` n ) n 2j` . 50 3 Facility Location in a Distributed Setting Let m denote the non-negative integer k − j. Then, we get Pr [Zi,k = 1] ≤ ( 1− 2(j+m)` n ) · ( 1− 2(j+m−1)` n ) n 2(j+m−1)` ·2(m−1)` ≤ ( 1− 2(j+m)` n ) · (1 e )2(m−1)` ≤ (1 e )2(m−1)` , where the second inequality is due to a bound on Euler’s number (see Inequality (B.2)). Case k < j: An obvious upper bound on the probability that the algorithm does not open a facility at xi or tentatively connect xi up to and including phase k is 1, so we have Pr [Zi,k] ≤ 1 . Now, we can upper bound the expected final connection cost xi by dlog(n)/`e−1∑ k=0 2(k+2)` · f n ·Pr [Zi,k = 1] ≤ j−1∑ k=0 2(k+2)` · f n · 1 + dlog(n)/`e−j−1∑ m=0 2(j+m+2)` · f n · (1 e )2(m−1)` = 2j` − 1 2` − 1 · 22` · f n + 2(j+3)` · f n · dlog(n)/`e−j−1∑ m=0 2(m−1)` · (1 e )2(m−1)` ≤ 2(j+2)` · f n + 2(j+3)` · f n · dlog(n)/`e−j−1∑ m=0 2−m+1 ∈ O(8` · r˜`i ) , where, as in the proof of Lemma 3.3.1, the last inequality follows from the easily provable fact that 2(m−1)` · (1 e )2(m−1)` ≤ 2−m+1 for all m ≥ 0 and any ` ≥ 1. Finally, due to the definition of r˜i, the expected final connection cost of xi is O(16` · r`i ). Lemma 3.3.4. The facility location cost for X is O(FacLoc*(X, f, `)) with high constant probability. Proof. It follows from Lemmas 3.3.1 and 3.3.3 and ` being a constant metric exponent that the expected opening cost as well as the expected final connection cost of any point xi ∈ X is O(r`i ). Hence, the algorithm computes a set of open facilities F that leads to an expected total cost of ∑ xi∈X O(r ` i ). Due to Lemma 3.1.10, we obtain that the expected value of FacLoc(X,F, f, `) is O(FacLoc*(X, f, `)). Finally, the assertion of the lemma follows by applying Markov’s inequality. 3.3 Distributed Algorithm for Powers of Metric Spaces 51 We summarize our results in the following theorem: Theorem 3. Given any n-point metric space (X,D) and a constant metric exponent ` ≥ 1, there is a randomized distributed algorithm working in the synchronous message passing model that computes for X with high constant probability a constant-factor approximation of the uniform facility location problem for powers of metric spaces. The algorithm uses three rounds of all-to-all communication where the message sizes are bounded to O(log(n)) bits. 52 3 Facility Location in a Distributed Setting 4 A Kinetic Data Structure for Facility Location In this chapter, we investigate a facility location problem under motion. The input is a set of continuously moving objects. Each object moves along a known trajectory and can change its status between open facility and client at any time. The goal is to maintain a subset of the given objects as open facilities such that, at any time, the current facility location cost induced by the chosen open facilities is as close to the current optimal cost as possible, and also some side condition is satisfied. Observe that minimizing the mobile facility location cost at any time, without considering any side condition, can result in many status changes of the objects. Depending on the tasks of an open facility, such a status change can be expensive. Hence, the side condition we consider is to change the status of an object rather seldom so that the total number of status changes is below some appropriate threshold. Since the kinetic data structure (KDS) framework is well-suited to maintain a combina- torial structure of continuously moving objects and common in the field of computational geometry [2, 15, 54], we developed a KDS for the facility location problem described above. Our KDS applies a counting argument of Bădoiu et al. [14] to kinetize a modified version of the Mettu-Plaxton algorithm. The counting argument asserts that the radius of a facility can be approximated well by just counting the number of points in exponentially growing balls centered at this particular facility. Note that we cannot apply the original Mettu-Plaxton algorithm to obtain a respon- sive KDS, i.e., a KDS with polylogarithmic update time. The reason is that similar to maintaining an exact solution for the mobile facility location problem, maintaining the solution provided by Algorithm 2.3.1 is not stable. That means, a slight perturbation of the input might result in a number of status changes that is linear in the number of input points, whereas we are looking for stable solutions, where only a polylogarithmic number of changes occur upon an event. In Section 4.1, we present the essential ideas and some notations used throughout this chapter. A detailed description of the KDS can be found in Section 4.2. We analyze our KDS in Section 4.3. First, we prove that, at any time, it is guaranteed that our current set of open facilities leads to a total cost which is at most a constant factor larger than the current optimal cost. Afterwards, we analyze our KDS in terms of its complexity. 4.1 The Special Radii The input of the considered mobile facility location problem is a set P = {p1, p2, . . . , pn} of n independently moving points in Rd, where d is a constant. For any point pi ∈ P , we denote its opening cost by fi and its demand by di. Furthermore, let pi(t) denote 54 4 A Kinetic Data Structure for Facility Location the position of pi at the point of time t, and let P (t) := {p1(t), p2(t), . . . , pn(t)}. Then, the mobile facility location problem is to maintain, at each point of time t, a set of open facilities F (t) such that FacLoc(P (t), F (t)) is minimized (see Section 2.2 for a definition of FacLoc(P (t), F (t))). We let F ∗(t) denote an optimal set of open facilities at the point of time t. To approach the mobile facility location problem, we kinetize a modified version of the Mettu-Plaxton algorithm. One essential modification affects the radius associated with a point. According to Equation (2.1), we let ri(t) be the radius of a point pi ∈ P at the point of time t. More precisely, ri(t) is the radius of the ball with center pi(t) that satisfies ∑ pj(t)∈P (t)∩B(pi(t),ri(t)) dj · (ri(t)−D(pi(t), pj(t))) = fi . (4.1) Let rmin denote the lower limit of the range of ri(t), and let rmax denote the upper limit. Then, as observed in Section 2.3, we have rmin = minpj∈P fj n ·maxpj∈P dj and rmax = maxpj∈P fj minpj∈P dj . (4.2) Based on this definition of a radius, we introduce a new radius associated with a point. This new radius is much easier to maintain than the original radius when the points move. Compared to the original radii, the new radii of the points depend on cubes instead of balls. The key idea of our KDS is to use a set of nested cubes around each point and to update the KDS each time a point enters or leaves a cube of another point. 4.1.1 Definition of the Special Radii Cubes. Similar to the definition of balls, for a point pi(t) ∈ P (t) and a non-negative value r, we define C(pi(t), r) to be the axis-parallel cube whose center is the point pi(t) and whose side length is 2r. Given such a cube C(pi(t), r), we let weight(C(pi(t), r)) denote the sum of the demands of all the points in P (t) that are located in the cube C(pi(t), r), i.e., we define weight(C(pi(t), r)) := ∑ pj(t)∈P (t)∩C(pi(t),r) dj . Note that the cube C(pi(t), r) is a ball with radius r with respect to the L∞-metric. According to this and for sake of simplicity, we will refer to the value r of a cube C(pi(t), r) as the radius of the cube, i.e., the double radius of a cube is equal to its side length. Special Radius Associated with a Point. Our KDS maintains for each point pi ∈ P an approximation of ri(t), called the special radius r˜i(t), which is defined as follows: Definition 4.1.1 (Special Radius). At any point of time t, the special radius r˜i(t) of any point pi ∈ P is the value 2k˜ such that k˜ = k0 + dlog(4 √ d)e and k0 is the minimum integer k with dlog(rmin)e ≤ k ≤ dlog(rmax)e for which weight(C(pi(t), 2k0)) ≥ fi · 2−k0 holds. 4.1 The Special Radii 55 In the following, we will prove the existence of the special radius r˜i(t) of any point pi(t) ∈ P (t) at any point of time t. Moreover, we will show that the special radius r˜i(t) is a constant-factor approximation of the value ri(t). The proof of the existence of the special radius is based on a result obtained in [14]. More precisely, for the uniform metric facility location problem, the authors in [14] gave lower and upper bounds on the value ri(t) (confer also Lemma 3.1.1 with ` = 1). We generalize their result to the non-uniform case: Lemma 4.1.2. At any point of time t and for each pi ∈ P , we have fi weight(B(pi(t), ri(t))) ≤ ri(t) ≤ 2 · fi weight(B(pi(t), ri(t)/2)) . Proof. It follows from the definition of ri(t) given in Equation (4.1) that ∑ pj(t)∈P (t)∩B(pi(t),ri(t)) dj · ri(t) ≥ fi , so we have ri(t) ≥ fi ∑ pj(t)∈P (t)∩B(pi(t),ri(t)) dj = fi weight(B(pi(t), ri(t))) . This proves the first inequality of the lemma. Furthermore, we get fi = ∑ pj(t)∈P (t)∩B(pi(t),ri(t)) dj · (ri(t)−D(pi(t), pj(t))) ≥ ∑ pj(t)∈P (t)∩B(pi(t),ri(t)/2) dj · (ri(t)−D(pi(t), pj(t))) ≥ ri(t) 2 · ∑ pj(t)∈P (t)∩B(pi(t),ri(t)/2) dj = ri(t) 2 · weight(B(pi(t), ri(t)/2)) , where the second inequality follows from the fact that ri(t) − D(pi(t), pj(t)) ≥ ri(t)/2 for all pj(t) ∈ P (t) ∩ B(pi(t), ri(t)/2) and B(pi(t), ri(t)/2) ⊆ B(pi(t), ri(t)). This proves the second equality of the lemma. Lemma 4.1.3. Let t be any point of time, and let pi ∈ P be any point. Then, there exists an integer k with dlog(rmin)e ≤ k ≤ dlog(rmax)e such that weight(B(pi(t), 2k)) ≥ fi · 2−k . Proof. Due to Lemma 4.1.2, we have weight(B(pi(t), 2log(ri(t)))) = weight(B(pi(t), ri(t))) ≥ fi ri(t) = fi 2log(ri(t)) . 56 4 A Kinetic Data Structure for Facility Location Since 2dlog(ri(t))e ≥ 2log(ri(t)), it follows that weight(B(pi(t), 2dlog(ri(t))e)) ≥ fi 2dlog(ri(t))e . Now, the existence of an integer k with dlog(rmin)e ≤ k ≤ dlog(rmax)e such that weight(B(pi(t), 2k)) ≥ fi · 2−k follows from rmin ≤ ri(t) ≤ rmax . Due to Lemma 4.1.3 and the fact that a ball with a certain radius is completely covered by the cube having the same center and the same radius as the ball, we obtain the following result: Corollary 4.1.4. Let t be any point of time, and let pi ∈ P be any point. Then, there exists an integer k with dlog(rmin)e ≤ k ≤ dlog(rmax)e such that weight(C(pi(t), 2k)) ≥ fi · 2−k . It follows from Corollary 4.1.4 that, at each point of time t, the special radius r˜i(t) of each point pi ∈ P exists. Next, we use a modified version of a counting argument given in [14] to prove that r˜i(t) is a constant-factor approximation of ri(t). More precisely, for the uniform metric facility location problem, Bădoiu et al. [14] showed how to approximate ri(t) by counting the number of points in exponentially growing balls around pi(t). We generalize their result to the non-uniform case: Lemma 4.1.5. Let t be any point of time, let pi ∈ P be any point, and let k1 be the minimum integer k with dlog(rmin)e ≤ k ≤ dlog(rmax)e such that weight(B(pi(t), 2k)) ≥ fi · 2−k. Then, it holds that 1 2 · ri(t) ≤ 2k1 ≤ 2 · ri(t) . Proof. The existence of the integer k1 is due to Lemma 4.1.3. Furthermore, due to the choice of k1, we have weight ( B ( pi(t), 2k1−1 )) < fi · 2−(k1−1) . It follows that, for any ri(t) < 2k1−1, we get weight (B (pi(t), ri(t))) ≤ weight ( B ( pi(t), 2k1−1 )) < fi · 2−(k1−1) < fi · 1 ri(t) . Now, we obtain ri(t) < fi weight(B(pi(t), ri(t))) , 4.1 The Special Radii 57 which is a contradiction to Lemma 4.1.2. Hence, ri(t) ≥ 2k1−1 must be true, which proves the second inequality of the assertion. Furthermore, for any ri(t) > 2k1+1, we have weight(B(pi(t), ri(t)/2)) ≥ weight ( B ( pi(t), 2k1 )) ≥ fi · 2−k1 > fi · 2 ri(t) . In this case, it follows that ri(t) > 2fi weight(B(pi(t), ri(t)/2)) , which is again a contradiction to Lemma 4.1.2. Thus, we have ri(t) ≤ 2k1+1, which proves the first inequality of the assertion. Our algorithm uses the approach of [14], but, for any integer k, we approximate the sum of the demands of all the points in a ball with radius 2k by the sum of the demands of all the points in a cube with radius 2k. This leads to the following result: Lemma 4.1.6. Let t be any point of time, let pi ∈ P be any point, and let k0 be the minimum integer k with dlog(rmin)e ≤ k ≤ dlog(rmax)e such that weight(C(pi(t), 2k)) ≥ fi · 2−k. Then, it holds that 1 4 √ d · ri(t) ≤ 2k0 ≤ 2 · ri(t) . Proof. Let k1, dlog(rmin)e ≤ k1 ≤ dlog(rmax)e, be defined as in Lemma 4.1.5. Then, the radius of C(pi(t), 2k0) is at most 2k1 since each point in P (t) that is located in B(pi(t), 2k1) is also located in C(pi(t), 2k1). Thus, we get weight ( C ( pi(t), 2k1 )) ≥ fi · 2−k1 . The maximum radius of C(pi(t), 2k0) is illustrated on the left hand side of Figure 4.1. Furthermore, the radius of C(pi(t), 2k0) is larger than 1/ √ d · 2k1−1. The reason is that weight ( B ( pi(t), 2k1−1 )) < fi · 2−(k1−1) and weight ( C ( pi(t), 1 √ d · 2k1−1 )) ≤ weight ( B ( pi(t), 2k1−1 )) , so we have weight ( C ( pi(t), 2 k1−1−log( √ d) )) = weight ( C ( pi(t), 1 √ d · 2k1−1 )) < fi · 2−(k1−1) < fi · 2 −(k1−1−log( √ d)) . 58 4 A Kinetic Data Structure for Facility Location pi(t) 2k1 pi(t) 2k1−1 Figure 4.1: Illustration of the maximum and minimum radius of C(pi(t), 2k0). The minimum radius of C(pi(t), 2k0) is illustrated on the right hand side of Figure 4.1. Now, the lemma follows from 1/ √ d · 2k1−1 < 2k0 ≤ 2k1 and Lemma 4.1.5. Based on Lemma 4.1.6, we can now show that the special radius associated with a point is always a constant-factor approximation of the original radius defined by Mettu and Plaxton [87]. Furthermore, we prove that the number of possible values of a special radius is only logarithmic in nR where R := maxpi∈P fi · maxpi∈P di minpi∈P fi · minpi∈P di . Lemma 4.1.7. Let t be any point of time, and let pi ∈ P be any point. Then, we have ri(t) ≤ r˜i(t) ≤ 23+dlog( √ d)e · ri(t) . The number of possible values for r˜i(t) is upper bounded by O(log(nR)). Proof. Due to Lemma 4.1.6, we have 2− log(4 √ d) · ri(t) ≤ 2k0 ≤ 2 · ri(t) . According to Definition 4.1.1, we set the special radius to r˜i(t) = 2k˜ = 2k0+dlog(4 √ d)e) , so we obtain ri(t) ≤ r˜i(t) ≤ 23+dlog( √ d)e · ri(t). Due to rmin ≤ ri(t) ≤ rmax, Equation (4.2), and the fact that r˜i(t) is a power of 2, there are O(log(nR)) possible values for r˜i(t). Walls around a Point. We consider a set of O(log(nR)) nested cubes for each point pi(t) ∈ P (t). More precisely, there is the cube C(pi(t), 2k) with radius 2k for each k ∈ {dlog(rmin)e+dlog(4 √ d)e, dlog(rmin)e+1+dlog(4 √ d)e, . . . , dlog(rmax)e+dlog(4 √ d)e}. The side faces of the cube defined by C(pi(t), 2k) form a wall around pi(t), which we callWi,k(t). Hence, there exists a set of O(log(nR)) walls for pi(t). We use this set of walls to determine the points of time when an update of pi in our KDS is required. In general, an event occurs each time when any point crosses any wall of another point. 4.1 The Special Radii 59 4.1.2 Computation of the Special Radii In order to compute the special radius associated with any point efficiently at any time, we maintain two (d + 1)-dimensional dynamic range trees denoted by T1 and T2. At any time, range tree T1 is used to manage the current set of open facilities (which we call open points), and T2 stores the current set of clients (which we call closed points). Apart from the fact that the two data structures contain different point sets, they are constructed in the same way. In the first d levels of the range trees, the points are handled according to their coordinates and, in the (d + 1)-st level, the points are handled according to their special radii. Additionally, with each node v in every binary search tree of the (d + 1)-st level, we store the sum of the demands of all the points contained in the subtree rooted at v. At any point of time t, the range trees rely on the relative position of the points in P (t). More precisely, the leaves of any binary search tree of any level `, 1 ≤ ` ≤ d, in T1 and T2 store the points sorted according to their ranks based on dimension `, i.e., sorted according to their `-th coordinate. Now, the movement of the points in P is reflected by insert and delete operations on T1 and T2. At each point of time t, when any two points pi(t), pj(t) ∈ P (t) change their ranks based on any dimension `, we delete pi and pj from T1 and T2 and reinsert them according to their position at time t. By applying a technique proposed by Willard and Lueker in [109], we are able to support all required properties of T1 and T2 efficiently. More precisely, T1 and T2 have the following complexity: Lemma 4.1.8 ([109]). The range trees T1 and T2 have a space requirement of O(n log d(n)) and can be initialized in O(n logd+1(n)) time. The worst-case time per insertion and dele- tion is O(logd+1(n)). Given any orthogonal range [x1, x′1]×[x2, x ′ 2]×. . .×[xd+1, x ′ d+1] ⊂ R d+1 at any time t, the set Q := {pi(t) ∈ P (t) | pi(t) ∈ [x1, x′1]× [x2, x ′ 2]× . . .× [xd, x ′ d] and r˜i(t) ∈ [xd+1, x ′ d+1]} can be computed in O(logd+1(n)+ |Q|) time and the value ∑ pi(t)∈Q di in O(log d+1(n)) time. Besides the two range trees, we maintain a binary search tree T that contains for each point in P a pair consisting of the point’s index and its current status (which is either open or closed). T is sorted according to the indices. Thus, we can output the status of a given point in O(log(n)) time by querying T . 4.1.3 The Invariant The key idea of our KDS is to keep up one invariant consisting of the following conditions: (a) for each closed point pi(t) ∈ P (t)\F (t), there is an open point pj(t) ∈ F (t) with r˜j(t) ≤ r˜i(t) in C(pi(t), 4 · r˜i(t)) and (b) for each open point pi(t) ∈ F (t), there is no other open point pj(t) ∈ F (t) with r˜j(t) ≤ r˜i(t) in C(pi(t), 2 · r˜i(t)). 60 4 A Kinetic Data Structure for Facility Location The choice of Conditions (a) and (b) enables our KDS to be stable. Moreover, we will show that, by keeping up Conditions (a) and (b), we maintain, at any point of time t, a set of open facilities F (t) that leads to a total cost which is at most a constant factor larger than the optimal cost. The following argumentation for proving that our KDS maintains a constant-factor approximation is basically the same as in [87] and the proof of Lemma 3.1.9. Only a few minor adaptations to the kinetic setting have been made. Claim 4.1.9. Let t be any point of time, and let pi(t) be any point in P (t). If the invariant is satisfied at the point of time t, then there exists a point pj(t) ∈ F (t) such that r˜j(t) ≤ r˜i(t) and D(pi(t), pj(t)) ≤ 64d · ri(t). Proof. Since the invariant is satisfied, there is an open facility pj(t) ∈ F (t) with radius r˜j(t) ≤ r˜i(t) in C(pi(t), 4 · r˜i(t)) for each point pi(t) ∈ P (t). Thus, we get D(pi(t), pj(t)) ≤√ d · 4 · r˜i(t). Now, due to Lemma 4.1.7, we have D(pi(t), pj(t)) ≤ √ d · 4 · 23+dlog( √ d)e · ri(t) ≤ 64d · ri(t) . Claim 4.1.10. Let t be any point of time, and let pi(t) and pj(t) be distinct points in F (t). If the invariant is satisfied at the point of time t, then we have D(pi(t), pj(t)) > 2 ·max{ri(t), rj(t)} . Proof. Without loss of generality, we assume that r˜j(t) ≤ r˜i(t). From the fact that the invariant is satisfied, it follows that pj(t) /∈ C(pi(t), 2 · r˜i(t)). Otherwise, the point pi would be closed at the point of time t. Thus, we have D(pi(t), pj(t)) > 2 · r˜i(t) ≥ 2 · ri(t) and D(pi(t), pj(t)) > 2 · r˜i(t) ≥ 2 · r˜j(t) ≥ 2 · rj(t) , where r˜i(t) ≥ ri(t) and r˜j(t) ≥ rj(t) follow from Lemma 4.1.7. For any point pj(t) ∈ P (t) and an arbitrary set of open facilities X(t) ⊆ P (t), let charge(pj(t), X(t)) := D(pj(t), X(t)) + ∑ pi(t)∈X(t) max{0, ri(t)−D(pi(t), pj(t))} . Claim 4.1.11. Let t be any point of time. For an arbitrary set of open facilities X(t) ⊆ P (t), we get ∑ pj(t)∈P (t) charge(pj(t), X(t)) · dj = FacLoc(P (t), X(t)) . 4.1 The Special Radii 61 Proof. Due to the definition of charge(·, ·) and FacLoc(·, ·) and due to Equation (4.1), we get ∑ pj(t)∈P (t) charge(pj(t), X(t)) · dj = ∑ pj(t)∈P (t) D(pj(t), X(t)) · dj + ∑ pi(t)∈X(t) ∑ pj(t)∈P (t)∩B(pi(t),ri(t)) (ri(t)−D(pi(t), pj(t))) · dj = ∑ pj(t)∈P (t) D(pj(t), X(t)) · dj + ∑ pi(t)∈X(t) fi = FacLoc(P (t), X(t)) . Claim 4.1.12. Let t be any point of time, let pj(t) ∈ P (t) be any point, let X(t) ⊆ P (t) be an arbitrary set of open facilities, and let pi(t) ∈ X(t) be any open facility. If we have D(pj(t), pi(t)) = D(pj(t), X(t)), then charge(pj(t), X(t)) ≥ max{ri(t),D(pj(t), pi(t))}. Proof. If pj(t) /∈ B(pi(t), ri(t)), then charge(pj(t), X(t)) ≥ D(pj(t), X(t)) = D(pj(t), pi(t)) > ri(t) . Otherwise, we have charge(pj(t), X(t)) ≥ D(pj(t), X(t)) + (ri(t)−D(pj(t), pi(t))) = D(pj(t), pi(t)) + (ri(t)−D(pj(t), pi(t))) = ri(t) ≥ D(pj(t), pi(t)) . Claim 4.1.13. Let t be any point of time, let pj(t) ∈ P (t) be any point, and let pi(t) be any open facility in F (t). If the invariant is satisfied at the point of time t and we have pj(t) ∈ B(pi(t), ri(t)), then charge(pj(t), F (t)) ≤ ri(t). Proof. By Claim 4.1.10, there is no open point p`(t) ∈ F (t) such that we have i 6= ` and pj(t) ∈ B(p`(t), r`(t)). Since D(pj(t), F (t)) ≤ D(pj(t), pi(t)), we obtain charge(pj(t), F (t)) = D(pj(t), F (t)) + (ri(t)−D(pj(t), pi(t))) ≤ D(pj(t), pi(t)) + (ri(t)−D(pj(t), pi(t))) = ri(t) . 62 4 A Kinetic Data Structure for Facility Location Claim 4.1.14. Let t be any point of time, let pj(t) ∈ P (t) be any point, and let pi(t) be any open facility in F (t). If the invariant is satisfied at the point of time t and we have pj(t) /∈ B(pi(t), ri(t)), then charge(pj(t), F (t)) < D(pj(t), pi(t)). Proof. The correctness of the claim follows immediately, unless there is a point p`(t) ∈ F (t) such that pj(t) ∈ B(p`(t), r`(t)). If such a point p`(t) exists, then Claims 4.1.10 and 4.1.13 imply D(pi(t), p`(t)) > 2 · max{ri(t), r`(t)} and charge(pj(t), F (t)) ≤ r`(t). Furthermore, by triangle inequality, we obtain D(pj(t), pi(t)) ≥ D(pi(t), p`(t))−D(pj(t), p`(t)) > 2r`(t)− r`(t) = r`(t) , which proves charge(pj(t), F (t)) ≤ r`(t) < D(pj(t), pi(t)). Claim 4.1.15. Let t be any point of time, let pj(t) ∈ P (t) be any point, and let X(t) ⊆ P (t) be an arbitrary set of open facilities. If the invariant is satisfied at the point of time t, then charge(pj(t), F (t)) < (64d+ 1) · charge(pj(t), X(t)) . Proof. Let pi(t) be some point in X(t) such that we have D(pj(t), pi(t)) = D(pj(t), X(t)). By Claim 4.1.9, there exists a point p`(t) ∈ F (t) such that we have r˜`(t) ≤ r˜i(t) and D(pi(t), p`(t)) ≤ 64d · ri(t). If pj(t) ∈ B(p`(t), r`(t)), then we obtain charge(pj(t), F (t)) ≤ r`(t) by Claim 4.1.13. Then, we get r`(t) ≤ r˜`(t) ≤ r˜i(t) ≤ √ d · 4 · 23+dlog( √ d)e · ri(t) ≤ 64d · ri(t) due to the arguments above and Lemma 4.1.7. Since Claim 4.1.12 implies charge(pj(t), X(t)) ≥ ri(t), we can conclude charge(pj(t), F (t)) ≤ r`(t) ≤ 64d · ri(t) ≤ 64d · charge(pj(t), X(t)) . This proves the assertion in case that we have pj(t) ∈ B(p`(t), r`(t)). If pj(t) /∈ B(p`(t), r`(t)), then charge(pj(t), F (t)) < D(pj(t), p`(t)) by Claim 4.1.14. Thus, by triangle inequality, we get charge(pj(t), F (t)) < D(pj(t), pi(t)) + D(pi(t), p`(t)) ≤ D(pj(t), pi(t)) + 64d · ri(t) . Since the ratio of D(pj(t), pi(t)) + 64d · ri(t) to the maximum of ri(t) and D(pj(t), pi(t)) is at most 64d + 1, we obtain charge(pj(t), F (t)) < (64d + 1) · max{D(pj(t), pi(t)), ri(t)}. Now, the assertion follows by Claim 4.1.12. Lemma 4.1.16. Let t be any point of time. If the invariant is satisfied at the point of time t, then we have FacLoc(P (t), F (t)) < (64d+ 1) · FacLoc(P (t), F ∗(t)) . 4.2 The Kinetic Data Structure 63 Proof. Due to Claims 4.1.11 and 4.1.15, we have FacLoc(P (t), F (t)) = ∑ pj(t)∈P (t) charge(pj(t), F (t)) · dj < ∑ pj(t)∈P (t) (64d+ 1) · charge(pj(t), X(t)) · dj = (64d+ 1) · FacLoc(P (t), X(t)) for an arbitrary set of open facilities X(t) ⊆ P (t). Thus, the approximation factor is also true for an optimal set of open facilities F ∗(t), which completes the proof of the lemma. 4.2 The Kinetic Data Structure This section addresses the design of our KDS for the mobile facility location problem. After describing how to compute an initial set of open facilities, we describe how the event queue is structured and how an update of the KDS is processed. 4.2.1 Initial Set of Open Facilities Let pi(t0) denote the initial position of the point pi ∈ P . To compute an initial set of open facilities, we apply Algorithm 4.2.1, which is a modified version of Algorithm 2.3.1, on the point set P (t0). The modification is that, instead of considering exactly the sorted sequence of the ri(t0) values, we round each ri(t0) to one of the O(log(nR)) possible values for the special radii (i.e., compute its corresponding r˜i(t0) value) and use the sorted sequence of the rounded values. Algorithm 4.2.1 Modified-Mettu-Plaxton-FacLoc(P , t0) 1: calculate the radius r˜i(t0) for each point pi(t0) ∈ P (t0) 2: for k ← dlog(rmin)e+ dlog(4 √ d)e to dlog(rmax)e+ dlog(4 √ d)e do 3: let Ik be the set of indices of all the points with radius 2k 4: for each i ∈ Ik do 5: if there is no open facility in C(pi(t0), 2 · 2k) then 6: open facility at pi(t0) 4.2.2 Event Queue In order to maintain the invariant defined in Section 4.1.3, we have to update our KDS at certain points of time. More precisely, we perform an update at each point of time when a point pj(t) crosses a wallWi,k(t), dlog(rmin)e+dlog(4 √ d)e ≤ k ≤ dlog(rmax)e+dlog(4 √ d)e, of another point pi(t). To keep track of these events, we use the following data structure: For each dimension `, 1 ≤ ` ≤ d, we store all n points and all O(n · log(nR)) wall faces that are orthogonal to 64 4 A Kinetic Data Structure for Facility Location the `-th coordinate axis in a list sorted by the `-th coordinate. For each consecutive pair in each of the d lists, we keep up one certificate to certify the sorted order of the lists. We define the failure time of the certificate for any pair of consecutive objects to be the first future point of time when these objects change their ranks in their sorted list. The failure times of all certificates are maintained in one event queue. In case that more than one event occurs at the same time, we handle them in an arbitrary order. Certainly, it is not the case that each event implicates that a point crosses a wall of another point (as, e.g., the change of the rank of two wall faces also causes an event), but definitely every crossing of a wall is discovered by a failure of at least one certificate. The event queue has the following complexity: Lemma 4.2.1. The event queue has size O(n log(nR)), can be initialized in O(n log2(nR)) time, and can be updated in O(log(nR)) time. Provided that the trajectories can be described by bounded-degree polynomials, the total number of events is O(n2 log2(nR)). A flight plan update involves O(log(nR)) certificates and requires O(log2(nR)) time. Proof. Each of the d lists stores n points and O(n log(nR)) wall faces. It follows that the event queue holds O(n log(nR)) events. Thus, the upper bound on the space requirement is as claimed. The initialization of the d lists and the event queue can be done by simple sorting operations inO(n log(nR) log(n log(nR))) ⊂ O(n log2(nR)) time. In each following update, we have to re-calculate the points of time when the two objects involved in the current event change their ranks with their two neighbors in the corresponding list. Thus, a constant number of events have to be updated in the event queue. Since the event queue contains O(n log(nR)) elements and we can use a min-heap to realize it, an update of an event requires O(log(n log(nR))) ⊂ O(log(nR)) time. Furthermore, a flight plan update of a point causes a re-calculation of the points of time when the point and all its wall faces change their ranks with the associated neighbors in all d lists. Afterwards, the involved certificates are updated in the event queue. Since a point has O(log(nR)) wall faces, the number of involved certificates is O(log(nR)). Their update in the event queue can be accomplished in O(log2(nR)) time. In case that the trajectories can be described by bounded-degree polynomials and no flight plan update occurs, the upper bound on the total number of events is given as follows. For each pair of elements, an event occurs when the trajectories of the two elements cross each other. The number of cuts of two polynomials is bounded by the maximum degree of both polynomials. Hence, the total number of cuts of O(n log(nR)) bounded- degree polynomials is O(cn2 log2(nR)), where the constant c is the maximum degree of the polynomials. 4.2.3 Handling an Update In this section, we describe how an event E, occurring at any point of time t, is handled (confer Algorithm 4.2.2, ll. 5). As the first step, the event queue is updated as explained in Section 4.2.2. Then, we have to distinguish between the following three cases: 4.2 The Kinetic Data Structure 65 (i) Both objects involved in the considered certificate are faces of walls. (ii) Both objects involved in the considered certificate are points. (iii) One object involved in the considered certificate is a point and the other object is a face of a wall. The handling of the three cases mainly depends on whether the invariant is violated or not. We say that a point pi(t) ∈ P (t) violates the invariant at a point of time t if either (a) pi(t) is closed, but there is no open facility with radius smaller than or equal to ri(t) in the cube C(pi(t), 4 · r˜i(t)) or (b) pi(t) is open, but there is another open facility with radius smaller than or equal to ri(t) in the cube C(pi(t), 2 · r˜i(t)). We assume that the invariant is satisfied by the time when E occurs. In Case (i), no point crosses the wall of another point. As a result, the invariant is still satisfied, so handling E is completed. In Case (ii), the event indicates that a point pi(t) and another point pj(t) change their ranks based on a dimension `, 1 ≤ ` ≤ d. This means that we have to update the position of pi and pj in the range trees T1 and T2. Since no point crosses a wall of another point, handling E is then completed. In Case (iii), it might be that the invariant is violated. Let pj(t) be the first object involved in the considered certificate, and let pi(t) be the point whose wall is the second object involved in the considered certificate. In case that pj(t) does not cross a wall of pi(t), handling E is completed. Otherwise, we update the radius r˜i(t) according to Definition 4.1.1, i.e., we set r˜i(t) = 2k˜ such that k˜ = k0 + dlog(4 √ d)e and k0 is the minimum integer k, dlog(rmin)e ≤ k ≤ dlog(rmax)e, with weight(C(pi(t), 2k0)) ≥ fi · 2−k0 . We will show that the new value of k0 differs from its old value (before event E occurred) by at most 1. Thus, there are three possible values for k0. Each of these values can be tested by one range query on both T1 and T2. Afterwards, we test if pi(t) violates the invariant by using a range query on T1. If this is the case, we change the status of pi(t). As an effect of changing the radius or the status of one point, the invariant may be violated by many other points (e.g., their open facility has been closed). In the following, we will show how to deal with this problem (confer Algorithm 4.2.3). Algorithm Restore. Suppose that pi(t) is a point that triggered an event E at a point of time t and whose radius or status changed due to E. Let r˜i(t) = 2k˜ be its updated radius. First, we restore the invariant at all points with radius 2k˜−1 to ensure that no point with radius less than or equal to 2k˜−1 violates the invariant. Then, we handle all points with radius 2k˜ that violates the invariant, then the points with radius 2k˜+1, . . . , up to the biggest possible radius. Now, we describe the procedure in general for any radius 2k. We define two cubes S1 := C(pi(t), 4 · 2k+1) and S2 := C(pi(t), 6 · 2k+1). Both cubes are divided into equally sized cubelets with radius 2k. The left hand side of Figure 4.2 illustrates this decomposition in the plane. To guarantee that no open point with radius 2k violates the invariant, we proceed as follows with each cubelet in S1: Let m be the center point of the considered cubelet. If 66 4 A Kinetic Data Structure for Facility Location Algorithm 4.2.2 KineticFL(P, t0) 1: Modified-Mettu-Plaxton-FacLoc(P, t0) 2: initialize event queue Q 3: while Q is not empty do 4: E ← dequeue(Q) 5: update Q 6: if E indicates that pi(t) and pj(t) change their ranks in any list for any i, j then 7: update position of pi and pj in T1 and T2 8: else 9: if E indicates that pj(t) crosses a wall of pi(t) for any i, j then 10: update r˜i(t)← 2k˜ in T1 and T2 11: if pi(t) violates the invariant then 12: change status of pi(t) 13: if radius or status of pi(t) changed then 14: Restore(pi(t), k˜) Algorithm 4.2.3 Restore(pi(t), k˜) 1: for k ← k˜ − 1 to dlog(rmax)e+ dlog(4 √ d)e do 2: define cubes S1 ← C(pi(t), 4 · 2k+1) and S2 ← C(pi(t), 6 · 2k+1) 3: for each cubelet C with center mC and radius 2k in S1 do 4: if ∃ open facility with radius < 2k in C(mC , 3 · 2k) then 5: close all facilities with radius 2k in C 6: for each cubelet C with center mC and radius 2k in S2 do 7: if @ open facility with radius ≤ 2k in C(mC , 3 · 2k) then 8: open one point with radius 2k in C (if existing) there is an open facility with radius less than 2k in C(m, 3 · 2k), then we close all facilities with radius 2k in C(m, 2k). Note that there is at most one such facility. The considered area around a cubelet is illustrated in Figure 4.2. In order to ensure that no closed point with radius 2k violates the invariant neither, we proceed as follows with each cubelet in S2: Let m be the center point of the considered cubelet. If there does not exist an open facility with radius less than or equal to 2k in C(m, 3 · 2k), then we open a point with radius 2k in the cubelet (if there is such a point). No matter, whether we opened a point or not, it is guaranteed that, for each closed point pj(t) with r˜j(t) = 2k in the cubelet, there is an open facility in C(pj(t), 4 · r˜j(t)). 4.3 Quality and Complexity of the Kinetic Data Structure At first, we prove that our KDS maintains a subset of the moving input points as open facilities such that, at any time, the associated total cost is at most a constant factor larger 4.3 Quality and Complexity of the Kinetic Data Structure 67 m pi(t) S2 S1 2k+1 m 2k 3 · 2k Figure 4.2: Illustration of the decomposition into cubelets and the tested area for a cubelet. The shown decomposition is used during the iteration of algorithm Restore that restores the invariant at all points with radius 2k. The cubes S1 and S2 are indicated by thick lines. For each cubelet in S1 and S2, we perform a test. The shaded area indicates the tested area C(m, 3 · 2k) for one cubelet in S1. This area is magnified on the right hand side of the figure, where the dark shaded area corresponds to the tested cubelet C(m, 2k). than the current optimal cost. For that purpose, we show that we restore the invariant each time it is violated. Finally, we analyze the complexity of our KDS. 4.3.1 Maintenance of the Invariant To simplify the description of the following proofs, we assume that at most one event occurs at the same time. Assuming this, we can show that the invariant is always satisfied after our KDS has handled an event. In case that more than one event occurs at the same time, the following proofs would differ in the sense that the fulfillment of the invariant can be guaranteed only after our KDS has handled all of these events. First, we prove that the invariant is satisfied as long as algorithm KineticFL does not call algorithm Restore. Lemma 4.3.1. The invariant is satisfied after the first step of algorithm KineticFL. Proof. Since algorithm Modified-Mettu-Plaxton-FacLoc treats the points in non- decreasing order according to their special radii and opens a point pi(t0) with radius r˜i(t0) if and only if there is no other open point in C(pi(t0), 2 · r˜i(t0)), no open point violates the invariant. Furthermore, algorithm Modified-Mettu-Plaxton-FacLoc does not open a point pi(t0) with radius r˜i(t0) if and only if there is another open point in C(pi(t0), 2 · r˜i(t0)) ⊆ 68 4 A Kinetic Data Structure for Facility Location C(pi(t0), 4 · r˜i(t0)). Because this point has been treated earlier than pi(t0), its radius is less than or equal to r˜i(t0). Thus, there exists an open point with radius less than or equal to r˜i(t0) in C(pi(t0), 4 · r˜i(t0)). Hence, no closed point violates the invariant. Claim 4.3.2. Let E be any event such that algorithm KineticFL does not change the radius or the status of any point. If the invariant is satisfied before E, then it holds after E as well. Proof. We have to consider two cases. In the first case, no point crosses a wall of another point. This implies that no point enters or leaves any cube of another point and no point changes its radius. Hence, the invariant is still valid and the claim holds. Let t be the point of time when event E occurs. Then, in the second case, we have that a wallWi,k(t) of a point pi(t) is crossed by another point pj(t), but our algorithm does not change the radius or the status of pi(t). It follows that neither pi(t) changed its radius nor pi(t) violates the invariant because otherwise our algorithm would have changed the radius and the status of pi(t), respectively. Due to the fact that pi(t) is unchanged and only the wallWi,k(t) is crossed at the point of time t, no point in P (t)\{pi(t)} violates the invariant neither. This completes the proof. Next, we prove that the updated radius of a point that triggered an event E differs at most by a factor of 2 from its value before E. Claim 4.3.3. Let E be an event at any point of time t where any point pj(t) ∈ P (t) crosses any wall of any other point pi(t) ∈ P (t). Let t′ < t be any point of time after the latest point of time when pi has been involved in one event. We get 1/2 · r˜i(t′) ≤ r˜i(t) ≤ 2 · r˜i(t′). Proof. Let k′0 and k0 be the minimum integers k with dlog(rmin)e ≤ k ≤ dlog(rmax)e for which we have weight(C(pi(t′), 2k ′ 0)) ≥ fi · 2−k ′ 0 and weight(C(pi(t), 2k0)) ≥ fi · 2−k0 , respectively. Note that the existence of k′0 and k0 is due to Corollary 4.1.4. Furthermore, let Wi,`(t) be the wall that is crossed by pj(t). We have to consider the cases (i) pj(t) leaves the cube C(pi(t), 2`) and (ii) pj(t) enters the cube C(pi(t), 2`). Case (i). Since the point of time t′, pj is the only point that has crossed a wall of pi. It follows that weight(C(pi(t), 2m)) < fi · 2−m, for any m < k′0, and weight(C(pi(t), 2 k′0)) ≤ weight(C(pi(t′), 2k ′ 0)). This implies k0 ≥ k′0. Since pj(t) has only crossed one wall of pi(t), we get weight(C(pi(t), 2k ′ 0+1)) ≥ weight(C(pi(t′), 2k ′ 0)) ≥ fi · 2−k ′ 0 ≥ fi · 2−(k ′ 0+1) , where the second inequality is given by the definition of k′0. Thus, we have k0 ≤ k ′ 0 + 1. Overall, we obtain k′0 ≤ k0 ≤ k ′ 0 + 1 in Case (i). 4.3 Quality and Complexity of the Kinetic Data Structure 69 Case (ii). Due to the fact that pj(t) is the only point that has crossed a wall of pi(t) and pj(t) enters a cube with center pi(t), we have weight(C(pi(t), 2m)) ≥ weight(C(pi(t′), 2m)), for all possible values of m. Hence, we get k0 ≤ k′0. Recall that pj(t) crosses the wallWi,`(t). If ` ≥ k′0−1, then k0 ≥ k ′ 0−1 follows obviously. Now, let us assume that ` < k′0 − 1 and k0 = `. Due to this assumption, we obtain that weight(C(pi(t), 2`)) ≥ fi · 2−`. Since pj is the only point that has crossed a wall of pi, we also have weight(C(pi(t′), 2`+1)) ≥ fi · 2−` ≥ fi · 2−(`+1). This implies k′0 ≤ `+ 1, which is a contradiction. Hence, we get k′0 − 1 ≤ k0 ≤ k ′ 0 in Case (ii). Considering both cases, we get k′0 − 1 ≤ k0 ≤ k ′ 0 + 1. Now, the claim follows due to the definition of the special radii. The following claims show that the invariant is restored after each call of algorithm Restore. Claim 4.3.4. Let ph(t) be a point that triggered an event E and whose radius or status changed due to E. Let r˜h(t) = 2k˜ be the updated radius of ph(t). If no point with radius less than or equal to 2k˜−2 violates the invariant before E, then this holds after E as well. Proof. Due to Claim 4.3.3, the radius of ph has been at least 2k˜−1 before E. While pro- cessing event E, we only change the status of points with radius larger than or equal to 2k˜−1. These status changes cannot affect the invariant at points with radius less than or equal to 2k˜−2. Thus, the assertion follows. m pi pj 2`+1 (a) m pi pj 2`+2 (b) Figure 4.3: The dark gray area indicates the cube C(m, 2`) in S2 that contains pi(t) during running the outer for-loop of algorithm Restore for k = `. The light gray area indicates the cube C(m, 3 ·2`). (a) Arrangement of points that leads to the desired contradiction in the proof of Case (i) in Claim 4.3.5. (b) Arrangement of points that leads to the desired contradiction in the proof of Case (i) in Claim 4.3.6. 70 4 A Kinetic Data Structure for Facility Location Claim 4.3.5. Let ph(t) be a point that triggered an event E and whose radius or status changed due to E. Let r˜h(t) = 2k˜ be the updated radius of ph(t). If the invariant is satisfied before E and no open point with radius less than or equal to 2`−1 violates the invariant before running the outer for-loop of algorithm Restore for k = `, k˜ − 1 ≤ ` ≤ dlog(rmax)e + dlog(4 √ d)e, then, after running this for-loop, no open point with radius 2` violates the invariant. Proof. The proof is by contradiction. Let us assume that, after running the outer for-loop of algorithm Restore for k = `, there is an open point pi(t) with radius r˜i(t) = 2` that has another open point pj(t) with radius r˜j(t) ≤ r˜i(t) in C(pi(t), 2 · r˜i(t)). We have to consider the cases (i) pi(t) ∈ S2 and (ii) pi(t) /∈ S2. Case (i). Subcase r˜j(t) < r˜i(t): Due to the fact that r˜j(t) < 2`, we have opened pj before running the outer for-loop for k = `. It follows that pi(t) ∈ C(m, 2`) and pj(t) /∈ C(m, 3 · 2`) for one center m of a considered cubelet (see Figure 4.3 (a)) because otherwise we either would have closed pi(t) or would not have opened pi(t). Thus, we have pj(t) /∈ C(pi(t), 2`+1) = C(pi(t), 2 · r˜i(t)), which is a contradiction to the assumption made above. Subcase r˜j(t) = r˜i(t): We have to consider the case that neither pi nor pj is opened while running the outer for-loop for k = ` and the case that at least one of pi and pj is opened during this for-loop. In the first case, it follows that pi and pj must have been open before running the outer for-loop for k = `. It follows that both points have been open before E or one point is ph. Then, either the invariant has been violated before E, which is a contradiction to the precondition of the claim, or changing the status of ph violated the invariant, which means that a rule of the algorithm has been broken. In the latter case, we have opened pi or pj or both while running the outer for-loop for k = `. Without loss of generality, let us assume that we have opened pj before we have opened pi. Then, we must have that pi(t) ∈ C(m, 2`) and pj(t) /∈ C(m, 3 · 2`) for one center m of a considered cubelet (see Figure 4.3 (a)). It follows that pj(t) /∈ C(pi(t), 2`+1) = C(pi(t), 2 · r˜i(t)), which is a contradiction to the assumption made above. Case (ii). Subcase r˜j(t) < r˜i(t): Due to the fact that r˜j(t) < 2`, we have opened pj before running the outer for-loop for k = `. Furthermore, it follows from pi(t) /∈ S2 that we must have opened pi before running the outer for-loop for k = ` as well. Hence, both pi and pj have been open before running this for-loop. Thus, the invariant must have been violated at point pj(t) with r˜j(t) ≤ 2`−1 before running the outer for-loop for k = `, which is a contradiction to the precondition of the claim. Subcase r˜j(t) = r˜i(t): We can use the same argumentation as in subcase r˜j(t) = r˜i(t) of Case (i) with the modification that we know that pi has been opened before running the outer for-loop for k = `. The reason is that pi(t) /∈ S2, so we do not change its status while running this for-loop. Claim 4.3.6. Let ph(t) be a point that triggered an event E and whose radius or status changed due to E. Let r˜h(t) = 2k˜ be the updated radius of ph(t). If the invariant is satisfied 4.3 Quality and Complexity of the Kinetic Data Structure 71 before E and no closed point with radius less than or equal to 2`−1 violates the invariant before running the outer for-loop of algorithm Restore for k = `, where k˜ − 1 ≤ ` ≤ dlog(rmax)e+ dlog(4 √ d)e, then, after running this for-loop, no closed point with radius 2` violates the invariant. Proof. The proof is by contradiction. Let us assume that, after running the outer for-loop of algorithm Restore for k = `, there is a closed point pi(t) with radius r˜i(t) = 2` that has no open point with radius less than or equal to r˜i(t) in C(pi(t), 4 · r˜i(t)). We have to consider the cases (i) pi(t) ∈ S2 and (ii) pi(t) /∈ S2. Case (i). Due to our construction, we have pi(t) ∈ C(m, 2`) and there is an open point pj(t) with radius at most 2` in C(m, 3 · 2`) for any center m of a considered cubelet (see Figure 4.3 (b)) because otherwise we would have opened a point with radius 2` in C(m, 2`). Note that, in case there is no other point with radius at most 2` in C(m, 2`) except pi(t), we would have opened pi and pj = pi. Thus, we have pj(t) ∈ C(pi(t), 2`+2) = C(pi(t), 4 · r˜i(t)), which is a contradiction to the assumption made above. Case (ii). Let t′ be any point of time between the occurrence of E and the latest event before. Then, there was an open point pj(t′) with radius less than or equal to r˜i(t′) in the cube C(pi(t′), 4 · r˜i(t′)) because otherwise the invariant was violated before E. Since E had no influence on the radius of pi, we have r˜i(t′) = r˜i(t) = 2`. First, let us assume that pj = ph. Since pi(t) /∈ S2 = C(ph(t), 6 · 2`+1), we have pj(t) /∈ C(pi(t), 6·2`+1). From pj(t′) ∈ C(pi(t′), 4· r˜i(t′)) = C(pi(t′), 4·2`) and pj(t) /∈ C(pi(t), 6·2`+1) follows that pj must have crossed the wall Wi,`+3(t′′) at a time t′′ with t′ < t′′ < t. This implies an event at time t′′, which is a contradiction to the definition of t′. Thus, we have pj 6= ph. Due to pi 6= ph, pj 6= ph, and pj(t′) ∈ C(pi(t′), 4 · r˜i(t′)), pj(t) ∈ C(pi(t), 4 · r˜i(t)) must also be true. Thus, if pi violates the invariant after E, then we must have closed pj during processing E. We only close points with radius less than or equal to r˜i(t) in S1, so we must have pj(t) ∈ S1. Since pi(t) /∈ S2 and pj(t) ∈ S1, we get pj(t) /∈ C(pi(t), 2 · 2`+1) = C(pi(t), 4 · r˜i(t)), which is a contradiction. Now, we can combine the obtained results to get the following lemma: Lemma 4.3.7. The invariant is satisfied after algorithm KineticFL has handled an event. Proof. Due to Lemma 4.3.1 and Claim 4.3.2, the invariant is satisfied as long as we do not call algorithm Restore. Now, we show by induction that the invariant is also satisfied after running algorithm Restore. Let ph(t) be the point whose radius or status changed due to an event E, and let r˜h(t) = 2k˜ be its updated radius. Due to the precondition given above and Claim 4.3.4, the assertion is true for all points with radius at most 2k˜−2. This proves the base case. By induction hypothesis, the preconditions of Claims 4.3.5 and 4.3.6 hold for any ` with 72 4 A Kinetic Data Structure for Facility Location k˜ − 1 ≤ ` ≤ dlog(rmax)e + dlog(4 √ d)e. This means that the assertion holds for all points with radius at most 2`−1. It follows from Claims 4.3.5 and 4.3.6 that the assertion also holds for all points with radius at most 2`, which completes the proof of the lemma. Due to Lemmas 4.3.1, 4.3.7 and 4.1.16, we get the following result: Lemma 4.3.8. The KDS for the mobile facility location problem in Rd maintains at each point of time t a subset of open facilities F (t) ⊆ P (t) such that we have FacLoc(P (t), F (t)) < (64d+ 1) · FacLoc(P (t), F ∗(t)) . 4.3.2 Complexity In the remainder of this chapter, we analyze our KDS in terms of its compactness, lo- cality, responsiveness, and efficiency (see Section 2.4.3 for definitions of these attributes). Lemma 4.2.1 already implies that our KDS is compact and local. Next, we prove that the requirement for being responsive and efficient is also fulfilled. Lemma 4.3.9. Each update operation requires O(logd+1(n)·log(nR)) time and O(log(nR)) status changes. Proof. Due to Lemma 4.2.1, the time to update the event queue is O(log(nR)). Except for algorithm Restore, all further steps require a constant number of range queries on T1 and T2. Due to Lemma 4.1.8, this requires O(log d+1(n)) time. Next, we examine the time needed for algorithm Restore. We consider the running time resulting for restoring the invariant at points with radius 2k. The number of cubelets with radius 2k in C(ph(t), 6·2k+1) is 12d, where ph(t) is the point that triggered the event. The query of open or closed points for one cubelet can be answered by one range query on T1 or T2. Due to Lemma 4.1.8, this requires O(logd+1(n)) time. Afterwards, there has to be at most one point inserted and deleted in T1 and T2, which can be done in O(log d+1(n)) time according to Lemma 4.1.8. By summation over all radii, we get a total running time of O(logd+1(n) · log(nR)). There can exist at most one open facility with radius 2k in a cubelet with radius 2k be- cause otherwise at least one open facility would violate the invariant. Hence, the number of open facilities with radius 2k that are closed while running algorithm Restore is con- stant. Furthermore, we open at most one facility in each cubelet, so the number of opened facilities with radius 2k is also constant. Due to the fact that we handle O(log(nR)) radii, there are O(log(nR)) status changes per event. Since our KDS processes a total number of O(n2 log2(nR)) events (see Lemma 4.2.1), the total processing time is bounded by O(n2 logd+1(n)·log3(nR)). To measure the efficiency as defined in Section 2.4.3, we use a result from [46]. In [46], Gao et al. investigated a problem in the KDS framework which is closely related to the mobile facility location problem. In particular, they provided a randomized KDS to maintain a set of centers among moving points in the plane such that, given a specified radius, all the points are covered by balls of the given radius centered at the chosen center points. Gao et al. showed that the size 4.3 Quality and Complexity of the Kinetic Data Structure 73 of the center set is at most a constant factor larger than the minimum one. To prove the efficiency of their KDS, they showed that there is a set of n points moving linearly on the real line that forces any c-approximate cover to change Ω(n2/c2) times. With some minor modifications, their result can be transferred to the facility location problem. Lemma 4.3.10. For any constant c > 1, there exists a set P of n points moving linearly on the real line such that any c-approximate solution to the mobile facility location problem for P undergoes Ω(n2/c2) status changes. Proof. We assume that c is an integer and n = 2cm with m ≥ 12c2 being also an integer. Let P be the set of n moving points which is defined as follows. We partition P into m groups, each containing 2c points. Let the j-th point in the i-th group be denoted by pi,j, where 0 ≤ i < m and 0 ≤ j < 2c. The initial position of all the points in the i-th group is i · 2m. Now, we let the point pi,j move with speed j · 2m. Let pi,j(t) be the position of pi,j at the point of time t. Then, we have pi,j(t) = (i+ jt) · 2m , for 0 ≤ i < m, 0 ≤ j < 2c, and t ≥ 0. Note that, in the time period from 0 to m, the points often change their ranks on the line. Afterwards, no two points will change their rank any more. Let us consider the configuration of P at any point of time t1 := k + 3c/m, for some integer k < m. At the point of time t1, the location of the point pi,j is pi,j(t1) = ( i+ jk + 3cj m ) · 2m = 2(i+ jk)m+ 6cj . Let pi,j and pi′,j′ be any two distinct points. In case that i + jk 6= i′ + j′k, the distance between pi,j and pi′,j′ is |pi,j(t1)− pi′,j′(t1)| > 2m− 12c2 ≥ 4c at the point of time t1. In case that i+ jk = i′+ j′k, we have j′ 6= j since pi,j and pi′,j′ are distinct. Then, it follows that the distance between pi,j and pi′,j′ is |pi,j(t1)− pi′,j′(t1)| ≥ 6c at the point of time t1. Thus, at the point of time t1, no two points are within distance 4c of each other. Assuming that the opening costs as well as the demands of all the points in P are 1, we next analyze an optimal solution for P at the point of time t1. Since the distance between any two points in P is greater then 4c at the point of time t1, the only existing optimal solution is to open a facility at each input point. This leads to a total cost of n. It follows that any c-approximate solution can have a cost of at most cn. Let us now consider an approximate solution in which only a (1−α)-fraction of the input points are open facilities. Then, the cost for this solution is more than n− αn + 4c · αn since the distance between 74 4 A Kinetic Data Structure for Facility Location any two points is greater than 4c at the point of time t1. To ensure that the cost is at most cn, α must be smaller than (c− 1)/(4c− 1). Since 1/4 > (c− 1)/(4c− 1) for c > 1, we obtain that any c-approximate solution must open more than 3n/4 facilities. Next, we consider the configuration of P at any point of time t2 := k, for some integer k < m. Since pi,j(t2) = (i+ jk) · 2m, where 0 ≤ i < m, 0 ≤ j < 2c, and k < m, each point is located at a position 2sm for some s ∈ {0, . . . ,m+ 2ck}. It follows that, at the point of time t2, there exist at most m+ 2ck open facilities in an optimal solution, and the optimal facility location cost is at most m+ 2ck. Thus, a c-approximate solution may have at most c(m+ 2ck) open facilities. Hence, between the points of time t1 and t2, any c-approximate solution undergoes at least 3n/4 − c(m + 2ck) = n/4 − 2c2k status changes. Summing up over all k ∈ {0, . . . , K − 1}, the number of status changes is at least K−1∑ k=0 n 4 − 2c2k > Kn 4 − c2K2 . Setting K = n/(8c2) < m, we have established that the total number of changes is Ω(n2/c2). Due to Lemma 4.3.10 and the fact that we process a total number of O(n2 log2(nR)) events, our KDS has an efficiency value of O(log2(nR)). Hence, the KDS for the mobile facility location problem is efficient. We summarize our results in the following theorem: Theorem 4. Let P be a set of n independently moving points in Rd, where d is a constant dimension. Then, there exists a deterministic KDS for the mobile facility location problem that maintains at any point of time t a set F (t) ⊆ P (t) such that we have FacLoc(P (t), F (t)) < (64d+ 1) · FacLoc(P (t), F ∗(t)) . Let R = maxpi∈P fi ·maxpi∈P di/(minpi∈P fi ·minpi∈P di), where fi and di are the opening cost and the demand of a point pi, respectively. Then, the KDS has a space require- ment of O(n(logd(n) + log(nR))) and each event requires O(log(nR)) status changes and O(logd+1(n)·log(nR)) update time. In case that the trajectories can be described by bounded- degree polynomials, the total number of updates is O(n2 log2(nR)), which results in a total processing time of O(n2 logd+1(n) · log3(nR)). A flight plan update involves O(log(nR)) certificates and requires O(log2(nR)) time. 5 Facility Location in Data Streams This chapter deals with a constant-factor approximation algorithm for the cost of the uni- form facility location problem over dynamic geometric data streams in a discrete Euclidean space {1, . . . ,∆}d, where d is a constant. The starting point of our algorithm is the work of Indyk [64]. It gives the best previous approach for approximating the cost of the uni- form facility location problem over dynamic geometric data streams and guarantees an approximation factor of O(log2(∆)). In [64], Indyk defines a certain partition of the space into nested square grids and a set of cells in this partition such that the number of these cells gives an O(log(∆))-approximation. During the approximation process to estimate the number of these cells, the algorithm of [64] looses another O(log(∆)) factor. In Section 5.1, we use a similar partition of the space into nested square grids, and we show that opening a facility in each cell of a subset of the cells defined in [64] leads to a constant-factor approximation of the facility location cost. Moreover, in Section 5.3, we propose an algorithm that maintains this cost sufficiently well in the dynamic geometric data stream model. In this way, we obtain a streaming algorithm for approximating the cost of the uniform facility location problem that strongly improves the best previous one. 5.1 Definition of a Good Estimator Let P := {p1, . . . , pn} be a set of n points from a discrete Euclidean space {1, . . . ,∆}d, where d is a constant. In the streaming context, P will refer to the current point set, i.e., the set of points obtained after having applied an input sequence of insertions and deletions. In this section, we will define a good estimator for the uniform Euclidean facility location problem (see Section 2.2 for a definition). Before we derive our estimator for the general case, we show how to deal with some special cases. 5.1.1 Estimator for Special Cases We consider the following four special cases: (i) The point set P is empty. (ii) The point set P is non-empty and contains O(df/∆e) points. (iii) The opening cost f is at most 1. (iv) The opening cost f is at least ∆d+1. 76 5 Facility Location in Data Streams In Case (i), there are no points that have to be served by an open facility. Hence, the facility location cost is obviously 0. Thus, our estimator is 0. We distinguish two subcases of Case (ii), namely f/∆ < 1 and f/∆ ≥ 1. In the first subcase, we have |P | ≥ 1 and |P | ∈ O(1). Thus, there exists at least one open facility in an optimal solution, so the optimal facility location cost is at least f . If each point opens a facility, then the facility location cost are f · |P | ∈ O(f). Hence, the optimal facility location cost is Θ(f). In the second subcase, we have |P | ≥ 1 and |P | ∈ O(f/∆). Again, there exists at least one open facility in an optimal solution, so the optimal facility location cost is at least f . Furthermore, the total connection cost of the points in P is O(f) since the longest pairwise distance in P is upper bounded by √ d ·∆ and there are O(f/∆) points in P . Thus, if we open one facility and connect the remaining points in P to this facility, then the resulting facility location cost is O(f). It follows that the optimal facility location cost is again Θ(f). Hence, in both subcases of Case (ii), we set our estimator to f . In Case (iii), it is optimal to open a facility at each point in P since the opening cost f is at most as big as the minimum pairwise distance in P . Hence, our estimator is f · |P |. Case (iv) is similar to Case (ii). We can assume that P is not empty because otherwise we have Case (i). It follows that there has to be at least one open facility in an optimal solution. Thus, the optimal facility location cost is at least f . Furthermore, since the maximum pairwise distance of the points in P is at most √ d ·∆ and there are at most ∆d points in P , the cost to connect all the points in P to the same facility is O(∆d+1). Thus, the optimal facility location cost is Θ(f), so we can always safely output f as constant- factor approximation of the optimal facility location cost. Distinct Elements Data Structure To be able to transfer the computation of our estimators to the dynamic geometric data stream model, we have to be able to compute a good estimator for the size of P . To obtain this estimator, we use the data structure for counting the number of distinct elements in a data stream, under insertions and deletions, that has been proposed by Kane et al. [72]. This data structure has the following properties: Lemma 5.1.1 ([72]). Let ε, 0 < ε < 1, be a precision parameter. There is a data structure that computes a (1± ε)-approximation of the number of distinct elements in a data stream under insertions and deletions with probability at least 2/3. The space requirement of the data structure is upper bounded by O(1/ε2 · log(N) · (log(1/ε) + log(log(M)))) bits, where N is the size of the domain of the elements and M is the multiplicity of single elements. The update time of an element is O(1). Corollary 5.1.2. Let ε, 0 < ε < 1, be a precision parameter, and let δ, 0 < δ < 1, be an error probability parameter. There is a data structure that computes a (1 ± ε)- approximation of the number of distinct elements in a data stream under insertions and deletions with probability at least 1 − δ. The data structure has a space requirement of O(1/ε2 · log(N) · (log(1/ε)+ log(log(M))) · log(1/δ)) bits, where N is the size of the domain 5.1 Definition of a Good Estimator 77 of the elements andM is the multiplicity of single elements. The update time of an element is O(log(1/δ)). Proof. The data structure from Lemma 5.1.1 outputs a (1±ε)-approximation of the number of distinct elements in a data stream under insertions and deletions with an error probability of at most 1/3. This error probability can be reduced by using a standard amplification technique. More precisely, we run d75 ln(1/δ)e copies of the algorithm in parallel and output their median value. For each j ∈ {1, . . . , d75 ln(1/δ)e}, let Zj be the indicator random variable for the event that the j-th run of the algorithm outputs a (1± ε)-approximation of the number of distinct elements. By a Chernoff bound, we get Pr   d75 ln(1/δ)e∑ j=1 Zj ≤ ( 1− 1 5 ) · E   d75 ln(1/δ)e∑ j=1 Zj     ≤ exp  − 1 2 · 52 · E   d75 ln(1/δ)e∑ j=1 Zj     ≤ δ . Thus, the probability that more than a fraction of 8/15-th of the copies computes a (1±ε)- approximation is at least 1 − δ. This implies that the median value of the copies is a (1 ± ε)-approximation with probability at least 1 − δ. Now, the assertion follows from Lemma 5.1.1. Given a stream of insert and delete operations of points from a discrete Euclidean space {1, . . . ,∆}d and an error probability parameter δ, we apply the data structure from Corol- lary 5.1.2 with precision parameter ε := 1/2. Since we assume that the input stream is consistent, i.e., no point is removed which is not present in the current point set and no point is added twice, the multiplicity M is constant. Furthermore, the size N of the domain is ∆d. Thus, with probability at least 1 − δ, we can compute a constant-factor approximation of |P | and, hence, a constant-factor approximation of the facility location cost in the four special cases using O(d · log(∆) · log(1/δ)) space. An insertion or deletion of a point requires O(log(1/δ)) time. 5.1.2 Estimator Based on a Space Partition In the remainder of this chapter, we will always assume that the size of P is Ω(f/∆) and 1 < f < ∆d+1. Furthermore, we assume that the value f is a power of 2. Note that, by rounding the opening cost up to the next power of 2, the facility location cost and also our estimator is increased by a factor of at most 2. In order to deal with the general case, we define a certain partition of the input space and relate this partition to the cost for the uniform Euclidean facility location problem. In particular, if we assign to each cell in this partition a weight that corresponds to the number of points inside the cell multiplied by the side length of the cell, the sum of these weights is a constant-factor approximation of the cost for the uniform facility location problem. We will use this fact in Section 5.3 to develop an approximation algorithm in the dynamic geometric data stream model. To compute the above mentioned space partition for P , we impose dlog(∆)e+ 1 nested square grids over the point space denoted by G (0) ,G (1) , . . . ,G (dlog(∆)e). The side length 78 5 Facility Location in Data Streams of each cell in grid G (i) is 2i. We say that the grid cells in G (i) are in level i. The set of neighbors Γ(C) of a cell C ∈ G (i) is the set of all the cells in grid G (i) that share some part of their boundary with C. Note that all the cells located at the border of some grid have less than 3d−1 neighbors. For example, there is only one cell in grid G (dlog(∆)e), and this cell has no neighbors. All remaining cells have exactly 3d − 1 neighbors. Furthermore, we need the definition of a parent cell and a subcell. The parent cell of a cell C ∈ G (i) in any level i ∈ [dlog(∆)e] is the cell in G (i+ 1) that contains C. The subcells of a cell C ∈ G (i) in any level i ∈ {1, 2, . . . , dlog(∆)e} are all the cells in G (i− 1) that are contained in C. In each grid G (i), the active and maximal-useful cells will play a decisive role in the space partitioning. They are defined as follows: Definition 5.1.3 (Active Cell). A cell in any level i ∈ [dlog(∆)e+ 1] is called active if it contains at least a(i) := f/2i points of P . A grid cell that is not active is inactive. Observe that if a cell C is active, then all the cells that contain C are active as well. Definition 5.1.4 (Useful and Maximal-Useful Cell). A cell C in any level i ∈ [dlog(∆)e+1] is called useful if it neither contains an active subcell nor any of its neighbors Γ(C) in grid G (i) contains an active subcell. A grid cell that is not useful is useless. A cell in any level i ∈ [dlog(∆)e] is maximal-useful if it is useful but its parent cell is useless. The cell in level G (dlog(∆)e) is maximal-useful if it is useful. Our space partition consists of all maximal-useful cells. Let SP(i) be the set of all maximal-useful cells in grid G (i), and let SP := ⋃ i SP(i) be the set of all maximal-useful cells. The cells in SP form a partition of the input space. This follows from the fact that we can simply construct SP in a process similar to that of building a quadtree. In general, a quadtree for a d-dimensional point set is a rooted tree in which every node corresponds to a squared cell. Each internal node v has 2d children whose corresponding cells build a partition of v. Hence, the cells corresponding to the leaf nodes of the quadtree form a partition of the space, which is called a quadtree partition. Following this definition, we can construct our space partition by starting from the cell in the coarsest grid G (dlog(∆)e) and recursively splitting each useless cell into 2d equal sized, squared subcells. The final space partition consists of only useful cells whose parent cells are useless. Hence, we obtain SP as desired. An illustration of a space partitioning is given in Figure 5.1. The key idea is now to place an open facility in each active cell in SP . Figure 5.2 illustrates how this is related to a solution for the uniform facility location problem. We remark that our strategy of choosing the set of open facilities is a refinement of the strategy proposed in [64]. More precisely, the open facilities in [64] are chosen from all active cells in ⋃dlog(∆)e i=0 G (i), whereas we choose the open facilities from a subset of these cells. Next, we define a value FL(P, f) that is based on the space partition SP and yields a constant-factor approximation of the cost of an optimal solution for the uniform Euclidean 5.1 Definition of a Good Estimator 79 (a) (b) (c) (d) Figure 5.1: Example illustrating the quadtree partition for a set of points from {1, . . . , 128}2 and for the opening-cost value f = 64. Active cells are colored in gray. Useless cells are indicated by thick borders. Subcells of a cell are indicated by dashed borders. (a)-(c) The quadtree partition for subsequent depths of the recursion. (d) The final quadtree partition and its active cells. (a) (b) Figure 5.2: (a) The final quadtree partition for a set of points from {1, . . . , 128}2 and for the opening-cost value f = 64. Active cells are colored in gray. (b) Solution for the uniform facility location problem whose cost is approximated by the algorithm. The red points are the open facilities. Connections between points are indicated by line segments. facility location problem. Let nP (C) be the number of points in the set P that are contained in the cell C. Then, the estimator for the facility location cost is defined as FL(P, f) := dlog(∆)e∑ i=0 ∑ C∈SP(i) nP (C) · 2i . (5.1) 5.1.3 Properties of the Space Partition Before we prove that FL(P, f) is indeed an O(1)-approximation of the cost of the uniform Euclidean facility location problem, we discuss some properties of the space partition that are needed in the analysis. We say that two cells in a space partition are neighbors if they 80 5 Facility Location in Data Streams share at least one point of their boundary. Furthermore, the distance between two cells is defined as the minimum distance between two points such that one point lies on the boundary of one cell and the other point lies on the boundary of the other cell. Now, we show that the space partition SP has the following properties: Lemma 5.1.5. The set SP of all maximal-useful cells has the following five properties: (i) The side length of each cell in SP differs from the side length of each of its neighbors by a factor of at most 2, i.e., the space partition is balanced. (ii) Let i ∈ [dlog(∆)e+ 1] be any level, and let C be any useless cell in G (i). Then, there exists an active cell with side length at most 2i−1 in SP that has a distance of at most √ d · 2i+1 from C. (iii) Let i ∈ [dlog(∆)e + 1] be any level, and let C be any inactive cell in SP(i). Then, there exists an active cell with side length at most 2i in SP that has a distance of at most 5 √ d · 2i from C. (iv) Let i ∈ [dlog(∆)e + 1] be any level, and let C be any active cell in SP(i). Then, we have f 2i ≤ nP (C) < 2d+1 · f 2i . (v) Let i ∈ [dlog(∆)e + 1] be any level. Then, we have 0 < a(i) < ∆d+1, and we have either a(i) ≥ 1 or SP(i) contains no non-empty cell. Proof. (i) Obviously, there cannot be a cell C ∈ SP(0) ∪ SP(1) that has a neighbor in SP whose side length is less than half the side length of C. We prove the assertion for any level i ∈ {2, . . . , dlog(∆)e} by contradiction. Assume that Cbig is a cell from SP(i) that has a neighbor cell Csmall in SP(j), j ≤ i − 2, i.e., Csmall is a neighbor cell with side length 2j ≤ 2i−2. This situation is illustrated in Figure 5.3. Let C ′small be the parent cell of Csmall. Since Csmall is maximal-useful, its parent C ′small is useless. Hence, C ′small or at least one neighbor in Γ(C ′ small) has an active subcell (the light gray area in Figure 5.3). This subcell is either contained in Cbig or one of its neighbors Γ(Cbig). Hence, Cbig is also a useless cell and cannot be a cell in SP(i), which is a contradiction. (ii) The cells in level 0 are all useful per definition since they contain no subcells. To prove the assertion for the remaining levels, we proceed by induction. Let ` be the smallest level such that SP(`) is not empty. Let C be a useless cell in grid G (`+ 1). Since C is useless, either C or one of its neighbors in Γ(C) contains an active subcell A. By the choice of `, we know that A is maximal-useful and in SP . Furthermore, A has a side length of 2` and a distance of at most √ d · 2` from C, which is less than√ d · 2`+1. This proves the base case. Now, let C be a useless cell in grid G (i). By 5.1 Definition of a Good Estimator 81 Cbig 2i 2i−2 Figure 5.3: Arrangement of cells that leads to the desired contradiction in the proof of the first property stated in Lemma 5.1.5. The cell Csmall is indicated by the dark gray square. The area containing an active subcell is colored in light gray. definition, either C or one of its neighbors in Γ(C) contains an active subcell. Let A be such a subcell. The cell A has side length 2i−1 and a distance of at most √ d · 2i−1 from C. If A is useful, it is maximal-useful and in SP , so we are done. Otherwise, A is useless and in grid G (j), j < i. By induction hypothesis, we have an active cell A′ with side length at most 2j−1 in SP which has a distance of at most √ d·2j+1 ≤ √ d·2i from A. Since A has a diagonal of length √ d · 2i−1, we get that the distance from C to A′ is at most 2 · √ d · 2i−1 + √ d · 2i = √ d · 2i+1. (iii) Since we assume that |P | ∈ Ω(f/∆), the cell in G (dlog(∆)e) is active. According to this, let C ∈ SP be an inactive cell in a level i ∈ [dlog(∆)e]. Let C ′ be the parent cell of C. By ii), there is an active cell with side length at most 2i in SP that has a distance of at most √ d · 2i+2 from C ′. Hence, the distance from C is at most√ d · 2i + √ d · 2i+2 = 5 √ d · 2i. (iv) The first inequality of the assertion follows from our definition of an active cell. Since each cell in level 0 contains at most 2d points from P and f > 1, the second inequality is satisfied for each cell in SP(0). Let i ∈ {1, . . . , dlog(∆)e} be any level and C be any cell in SP(i). The number of points in C is less than 2d+1 · f/2i because each of the 2d subcells of C is inactive, i.e., there are less than f/2i−1 points inside such a subcell. (v) Recall that a(i) = f/2i. Since f > 1, we have a(i) > 0. Furthermore, it follows from f < ∆d+1 and i ≥ 0 that a(i) = f/2i < ∆d+1. Obviously, we get a(0) = f/20 > 1 since f > 1. Thus, in case i = 0, we always have a(i) ≥ 1. For any level i ∈ {1, . . . , dlog(∆)e}, the proof is by contradiction. Let us assume that there is a non-empty cell C ∈ SP(i) with a(i) ≤ 1/2. Then, we have a(i − 1) ≤ 1, so C contains an active subcell, which is a contradiction to the construction of the space partition SP . Hence, we have a(i) > 1/2. In addition, since a(i) = f/2i > 1/2 and we assume that f is a power of 2, we have a(i) ≥ 1. 82 5 Facility Location in Data Streams 5.1.4 Analysis of the Estimator In this section, we analyze our estimator FL(P, f). We separate the analysis into two parts. We give an appropriate lower bound in the first part and an appropriate upper bound in the second part. For this purpose, let FacLoc*(P, f) be the cost of an optimal facility location solution for P . Lemma 5.1.6. FL(P, f) ∈ Ω(FacLoc*(P, f)). Proof. Our goal is to define a set of open facilities such that the induced facility location cost is O(FL(P, f)). This proves FL(P, f) ∈ Ω(FacLoc*(P, f)). We will show that it suffices to open one facility in each active cell in SP . We give an upper bound on the contribution of the points in each cell in SP . For any level i ∈ [dlog(∆)e+1], each active cell C ∈ SP(i) contributes at most f+nP (C) · √ d ·2i because we open one facility in C and connect the points inside of C to this facility. Since C is active, it contains at least f/2i points. Thus, we have f+nP (C)· √ d·2i ∈ O(nP (C)·2i). The points in each inactive cell C in SP are connected to the nearest open facility. Due to Property (iii) in Lemma 5.1.5, for each inactive cell C ∈ SP(i), there exists an active cell with side length at most 2i in SP which has a distance of at most 5 √ d · 2i from C. Thus, the connection cost for the points in C is at most nP (C) · (5 √ d · 2i + √ d · 2i) ∈ O(nP (C) · 2i). Summing up over all cells in SP gives that the cost of the defined solution is O(FL(P, f)). Lemma 5.1.7. FL(P, f) ∈ O(FacLoc*(P, f)). Proof. Let F ∗ be a set of optimal open facilities. Since we assume that P is not empty, the set F ∗ is not empty and we have FacLoc*(P, f) ∈ Ω(f). Now, for any level i ∈ [dlog(∆)e+1], we partition the set SP(i) into two subsets SPnear(i) and SPdist(i). The set SPnear(i) contains every cell whose distance to its nearest open facility in F ∗ is less than 2i−1, i.e., SPnear(i) := {C ∈ SP(i) | min q∈F ∗ D(q, C) < 2i−1} . The set SPdist(i) contains all remaining cells from SP(i), i.e., SPdist(i) := {C ∈ SP(i) | min q∈F ∗ D(q, C) ≥ 2i−1} . For each cell C ∈ ⋃dlog(∆)e i=0 SPdist(i), the cost to connect the points inside of C to the nearest open facility in F ∗ is at least nP (C) · 2i−1. This is exactly half of the cost that we charge for the cell C by the definition of FL(P, f). Thus, the cost that we charge for points contained in ⋃dlog(∆)e i=0 SPdist(i) is upper bounded by twice the optimal connection cost. Let C ∈ SP(j) be a cell in any level j ∈ [dlog(∆)e+ 1] that contains an optimal facility q ∈ F ∗. Furthermore, let C ′ ∈ SP(i) be any cell in any level i ∈ [dlog(∆)e+ 1] such that C ′ is not a direct neighbor of C. Due to Property (i) in Lemma 5.1.5, SP is a balanced space partition, so the neighbors of C ′ have a side length of at least 2i−1. It follows that there is at least one cell with side length at least 2i−1 between C ′ and C. Thus, C ′ is not within 5.2 Randomized Algorithm 83 distance of less than 2i−1 from q. Hence, we have C ′ /∈ SPnear(i). This implies that only direct neighbors of C can be in ⋃dlog(∆)e i=0 SPnear(i). Due to the fact that SP is a balanced space partition, the neighbors of C have a side length of at least 2j−1. Thus, the number of neighbors of C is at most 4d − 2d. It follows that less than 4d cells in ⋃dlog(∆)e i=0 SPnear(i) are within distance less than half of their side length from q. Hence, we have dlog(∆)e∑ i=0 |SPnear(i)| < 4d · |F ∗| . Now, for each cell in SP , we charge a cost of O(f) by the definition of FL(P, f). This is due to Property (iv) in Lemma 5.1.5, which implies that a cell in SP(i) contains at most 2d+1 · f/2i points. Thus, the cost that arises for all cells in ⋃dlog(∆)e i=0 SPnear(i) is O(f · |F ∗|), which is at most a constant factor larger than the optimal opening cost. 5.2 Randomized Algorithm In this section, we describe a randomized algorithm that implements the ideas of Section 5.1 and, with some modifications, can be transformed into a streaming algorithm, which we will do in Section 5.3. The approach of the algorithm is closely related to performing the quadtree partition into maximal-useful cells. We try to identify all active cells in the grids. For that purpose, for each level i ∈ [dlog(∆)e+ 1], we maintain one random sample set and take each point into this set with probability α(i) := min{1/a(i), 1}. Recall that a cell in grid G (i) is active if it contains at least a(i) = f/2i points. Thus, in expectation, we will see at least one point in every active cell of grid G (i). Observe that some sample points will also end up in inactive cells. However, we will show in the analysis that this does not negatively affect our algorithm. We call a cell in grid G (i) marked if it contains at least one sample point. The key idea is to go through all levels i ∈ [dlog(∆)e + 1] and to open one facility in every marked cell C in grid G (i) such that the following two conditions are satisfied: (a) No subcell of C is marked. (b) No smaller cell within a distance of less than 2i−1 from C is marked. The motivation of Condition (b) is that, in our space partition SP , the side lengths of neighbor cells differ at most by a factor of 2. Hence, a marked cell from SP prevents at most a constant number of other cells from SP to open a facility. Finally, we obtain a new estimator for the cost of the uniform facility location problem based on our randomized algorithm. Let F denote the set of cells, where we open a facility. Then, the estimator is FLrand(P, f) := f · |F|. 84 5 Facility Location in Data Streams 5.2.1 Random Sampling In each level i ∈ [dlog(∆)e + 1], we would like to sample each point from P indepen- dently at random with probability α(i) = min{1/a(i), 1}. Since we have insert as well as delete operations of points, the random experiments for the points must be reproducible. Therefore, for each level i ∈ [dlog(∆)e + 1], we use a function hi : {1, . . . ,∆}d → {0, 1} that maps a point to the value 1 with probability α(i) = min{1/a(i), 1}. For any point p ∈ P ⊆ {1, . . . ,∆}d, if hi(p) = 1, then p is a sample point. Otherwise, p is not a sample point. We can construct a function hi(·) with the following properties: Lemma 5.2.1. For each level i ∈ [dlog(∆)e + 1], there is a function hi : {1, . . . ,∆}d → {0, 1} which maps each point p ∈ {1, . . . ,∆}d independently at random to a value in {0, 1} such that Pr [hi(p) = 1] = min{1/a(i), 1} . The function hi(·) uses O(∆d · log(∆)) random bits. For each point p ∈ {1, . . . ,∆}d, the value of hi(p) can be computed in O(log(∆)) time. Proof. Let i ∈ [dlog(∆)e+ 1] be any fixed level. In case that α(i) = 1, the assertion of the lemma is obviously true. Hence, in the following, we will assume that α(i) < 1 and, thus, a(i) > 1. Since i ≥ 0 and f < ∆d+1, we have a(i) = f/2i < ∆d+1. In addition, since f is a power of 2 with positive exponent and a(i) > 1, the value a(i) is also a power of 2 with positive exponent. Observe that a(i) can be represented by ` := d(d+ 1) log(∆)e bits. Now, for each point p ∈ {1, . . . ,∆}d, we generate a bit vector of length `, where each bit is chosen independently and uniformly at random. Let ri(p) := ( r(1)i , r (2) i , . . . , r (`) i ) be the generated bit sequence for p. The function hi maps p to 1 if r (j) i = 0 for j < log(a(i)) and r(log(a(i)))i = 1. For any k ∈ {1, . . . , `}, the event that r (j) i = 0 for j < k and r (k) i = 1 happens with probability 2−k. Thus, the probability that hi maps the point p to 1 is Pr [hi(p) = 1] = 2− log(a(i)) = 1 a(i) = α(i) . Since we generate ` random bits for each point in {1, . . . ,∆}d, the function hi(·) uses ` ·∆d ∈ O(∆d · log(∆)) random bits in total. To compute hi(p) for any p ∈ {1, . . . ,∆}d, we have at most one read and compare operation for each bit in ri(p). Thus, the time to compute hi(p) is O(log(∆)). The issue of full randomness will be discussed in Section 5.3. 5.2 Randomized Algorithm 85 5.2.2 Analysis of the Estimator We will show that, with high constant probability, our randomized algorithm computes a facility location cost that is a constant-factor approximation of the estimator FL(P, f). For any level i ∈ [dlog(∆)e+1], let F(i) be the set of marked cells in G (i) that do not have a marked subcell and that do not have a smaller marked cell within a distance of less than 2i−1. Then, the cells in the set ⋃dlog(∆)e i=0 F(i) are exactly the cells in which the algorithm opens its facilities, i.e., we have F = ⋃dlog(∆)e i=0 F(i). Thus, the estimator of the randomized algorithm is given by FLrand(P, f) = f · dlog(∆)e∑ i=0 |F(i)| . (5.2) Next, we derive appropriate lower and upper bounds of the estimator FLrand(P, f). Lemma 5.2.2. FLrand(P, f) ∈ Ω(FL(P, f)) with probability at least 15/16. Proof. Let us consider the space partition SP defined in Section 5.1. We are interested in the number of marked cells from SP . However, f multiplied by the number of marked cells from SP does not immediately give a lower bound on FLrand(P, f). The reason is that, for any level i ∈ [dlog(∆)e+ 1], we do not open a facility in a marked cell in SP(i) if there is a smaller cell within a distance of less 2i−1 which is also marked. Since neighbor cells in SP differ by a factor of at most 2 in their side lengths, every marked cell in SP can prevent at most a constant number of other marked cells in SP from opening a facility. Thus, if we can show that the expected number of marked cells in SP is Ω(FL(P, f)/f), then the assertion follows. We say that a point p ∈ P ∩ SP(i) is marked if it is sampled in level i. Let Xp denote the indicator random variable for the event that p is marked. Then, the expected number of marked points in any cell C ∈ SP(i) is E   ∑ p∈C Xp   = nP (C) ·min { 1 a(i) , 1 } . (5.3) By the definition of a(i), we obtain that E   ∑ p∈C Xp   ≤ nP (C) a(i) = nP (C) · 2i f . Due to Property (iv) in Lemma 5.1.5, for every cell C ∈ SP , we get E   ∑ p∈C Xp   < 2d+1 . Hence, we can group the cells from SP into sets S1, . . . ,S` such that, for each set Sj with 1 ≤ j < `, we have 40 ≤ ∑ C∈Sj ∑ p∈C E [Xp] < 40 + 2d+1 (5.4) 86 5 Facility Location in Data Streams and ∑ C∈S` ∑ p∈C E [Xp] < 40 + 2d+1 (5.5) for the set S`. Next, we analyze the contribution of the sets S1, . . . ,S` to the estimator FL(P, f). Due to Property (v) in Lemma 5.1.5, for any cell C ∈ SP(i), we have either a(i) ≥ 1 or a(i) > 0 and nP (C) = 0. It follows from Equation (5.3) that E   ∑ p∈C Xp   ≥ nP (C) a(i) . Due to Inequalities (5.4) and (5.5), the contribution of each Sj, 1 ≤ j ≤ `, to the estimator FL(P, f) is dlog(∆)e∑ i=0 ∑ C∈Sj∩SP(i) nP (C) · 2i ≤ dlog(∆)e∑ i=0 ∑ C∈Sj∩SP(i) 2i · a(i) · E   ∑ p∈C Xp   = f · dlog(∆)e∑ i=0 ∑ C∈Sj∩SP(i) E   ∑ p∈C Xp   < f · (40 + 2d+1) ∈ O(f) . Hence, we have FL(P, f) ∈ O(f`). This means, the assertion of the lemma follows if the number of marked cells in SP is Ω(`). We consider the cases ` ≤ 2 and ` > 2. First, we consider the case that ` > 2. We define the random variable Yj := ∑ C∈Sj ∑ p∈C Xp . By a Chernoff bound, we obtain Pr [ Yj ≤ ( 1− 1 2 ) · E [Yj] ] ≤ exp ( − E [Yj] 8 ) . This implies that Pr [Yj ≤ 20] ≤ 1/e5 for 1 ≤ j < `. Hence, with probability at least 1−1/e5, at least one of the cells in Sj is marked. For 1 ≤ j < `, let Zj denote the indicator random variable for the event that no cell in Sj is marked. The expected value of Zj is at most 1/e5. By Markov’s inequality, we get Pr   `−1∑ j=1 Zj ≥ 32 · E   `−1∑ j=1 Zj     ≤ 1 32 . Thus, we have `−1∑ j=1 Zj < 32 · E   `−1∑ j=1 Zj   = 32 · `− 1 e5 ≤ ` 3 5.2 Randomized Algorithm 87 with probability at least 31/32. It follows that, with probability at least 31/32, the number of marked cells in SP is at least `− 1− `/3 ∈ Ω(`). Thus, we have FLrand(P, f) ∈ Ω(f`), so FLrand(P, f) ∈ Ω(FL(P, f)). In case that ` ≤ 2, we have FL(P, f) ∈ O(f). Furthermore, we can assume that P contains at least 32 · df/∆e points, otherwise we have one of the special cases considered in Section 5.1.1. It follows that the expected number of marked points in the cell C in grid G (dlog(∆)e) is at least Pr   ∑ p∈C Xp   ≥ min { 1 a(dlog(∆)e) , 1 } · 32 · ⌈ f ∆ ⌉ = min { 2dlog(∆)e f , 1 } · 32 · ⌈ f ∆ ⌉ ≥ 32 since df/∆e ≥ 1. By a Chernoff bound, we get Pr   ∑ p∈C Xp ≤ ( 1− 1 2 ) · E   ∑ p∈C Xp     ≤ exp  − E [∑ p∈C Xp ] 8   ≤ exp ( − 32 8 ) ≤ 1 32 . Thus, with probability at least 31/32, the cell C in G (dlog(∆)e) is marked. Due to our construction, we have F ≥ 1, so FLrand(P, f) ∈ Ω(f). Hence, we get FLrand(P, f) ∈ Ω(FL(P, f)), which completes the proof. To prove the upper bound, we first observe that every cell C is either contained in SP or it can be partitioned into cells from SP (C lies above SP) or it is a subcell of a cell in SP (C lies below SP). We will first show that the overall expected number of sample points from cells that lie below SP or that do not lie ‘far above’ SP is O(FL(P, f)/f). Hence, the overall cost caused by these cells is O(FL(P, f)). Then, we prove that the expected contribution of cells ‘far above’ SP is also O(FL(P, f)). The latter fact follows because every such cell C in grid G (i) has a (smaller) active cell from SP within distance 2i−1. These active cells are typically marked, with the result that the expected contribution of C is small. Definition 5.2.3 (Height of a Cell). We say that a cell C in grid G (i) has height k if the smallest cell in SP that is contained in C has side length 2i−k. If no cell in SP is contained in C, then we define its height to be −∞. Lemma 5.2.4. FLrand(P, f) ∈ O(FL(P, f)) with probability at least 15/16. Proof. Let i ∈ [dlog(∆)e+1] be any level, and let Xp denote the indicator random variable for the event hi(p) = 1. Furthermore, for a cell C in grid G (i), let XC := ∑ p∈P∩C Xp 88 5 Facility Location in Data Streams denote the random variable for the number of sample points in cell C. With this definition, it follows that, for every cell C in grid G (i), we have E [XC] = nP (C) ·min { 1 a(i) , 1 } . By the definition of a(i), we get E [XC] ≤ nP (C) a(i) = nP (C) · 2i f . For any k ∈ N0 and any level i ∈ {k, . . . , dlog(∆)e}, let us now consider an arbitrary cell C in grid G (i) with height k. The cell C can be partitioned into cells C1, . . . , C` from SP that differ in their side lengths by a factor of at most 2k. Since α(i) ≤ 2k · α(i − k), we have E [XC] ≤ 2k · E   ∑` j=1 XCj   . Observe that cells from the same grid cannot overlap and two cells from different grids only overlap if the smaller cell is completely contained in the bigger cell. Thus, due to the definition of height, the set of cells of height k do not overlap. Due to linearity of expectation, it follows that E   ∑ cells C of height k with k∈N0 XC   ≤ 2k · E   dlog(∆)e∑ i=0 ∑ C∈SP(i) XC   ≤ 2k · dlog(∆)e∑ i=0 ∑ C∈SP(i) nP (C) · 2i f ≤ 2k · FL(P, f) f . Hence, for k∗ := dlog(10 √ d)e, the expected number of sample points in cells with a non- negative height of at most k∗ is less than 10 √ d·FL(P, f)/f . Next, we consider the expected number of sample points in cells with a negative height. For any level i ∈ [dlog(∆)e + 1], let C ′ ∈ SP(i) be any cell with height 0. Then, the expected number of sample points in all the cells that are below SP and that are contained in C ′ is E   i−1∑ j=0 ∑ C∈G(j):C⊂C′ XC   ≤ i−1∑ j=0 ∑ C∈G(j):C⊂C′ nP (C) · 2j f = i−1∑ j=0 nP (C ′) · 2i−1−j f ≤ E [XC′ ] . 5.2 Randomized Algorithm 89 Summing up over all cells in SP , we obtain that the expected number of sample points in cells below SP is at most E   ∑ cells C of height −∞ XC   ≤ dlog(∆)e∑ i=0 ∑ C∈SP(i) E [XC] ≤ dlog(∆)e∑ i=0 ∑ C∈SP(i) nP (C) · 2i f = FL(P, f) f . Thus, the expected number of sample points in cells with height at most k∗ is less than 11 √ d · FL(P, f)/f . By Markov’s inequality, we obtain Pr   ∑ cells C of height at most k∗ XC ≥ 32 · E   ∑ cells C of height at most k∗ XC     ≤ 1 32 . Hence, with probability at least 31/32, the opening cost for facilities in cells with height at most k∗ is less than f · 352 √ d · FL(P, f)/f ∈ O(FL(P, f)). Now, for any level i ∈ {k∗ + 1, . . . , dlog(∆)e}, let us consider an arbitrary cell C in grid G (i) with height bigger than k∗. By the definition of height and the value of k∗, C contains a subcell from SP with side length less than 2i−k ∗ ≤ 2i/(10 √ d). Due to Property (iii) in Lemma 5.1.5, we know that, for any level j ∈ [dlog(∆)e + 1], every cell in SP(j) has an active cell with side length at most 2j in SP within a distance of at most 5 √ d · 2j. We conclude that there is an active cell with side length less than 2i/(10 √ d) in SP within a distance of less than 2i−1 from C. Now, observe that every parent cell of an active cell is active and contains the cell. Hence, there is a cell in grid G (i− 1) within a distance of less than 2i−1 from C that is active. To simplify further descriptions, we will rephrase this as follows. Let the level-j-neighborhood of C be the set of all the cells in G (j) that share some part of their boundary or interior with C. Then, every cell in grid G (i) with height at least k∗ is active and contains an active cell in SP or has a cell in its level-(i− 1)-neighborhood that is active and contains an active cell in SP . Now, we proceed as follows. For each active cell Ai in SP(i), we consider the cell itself and all cells that contain it. For each such cell Aj in grid G (j), j ∈ {i, . . . , dlog(∆)e}, we assume that all the 2d cells in the level-(j+ 1)-neighborhood of Aj contain an open facility if and only if Aj is not marked. Thus, the expected contribution of Ai and all the cells which belong to a level-j-neighborhood of Ai with j ∈ {i, . . . , dlog(∆)e} is at most f + 2d · f · dlog(∆)e−1∑ j=i Pr [Aj is not marked] . Each point in Aj is not sampled with probability at most 1 − min{1/a(j), 1}. Hence, if a(j) ≤ 1, we have Pr [Aj is not marked] = 0. Otherwise, since nP (Aj) ≥ nP (Ai) ≥ a(i), 90 5 Facility Location in Data Streams we obtain Pr [Aj is not marked] ≤ ( 1− 1 a(j) )nP (Ai) ≤ exp ( − nP (Ai) a(j) ) ≤ exp ( − 2j−i · nP (Ai) a(i) ) ≤ exp (−(j − i)) , where the second inequality is due to a bound on Euler’s number (see Inequality (B.2)). It follows that dlog(∆)e−1∑ j=i Pr [Aj is not marked] ≤ dlog(∆)e−1∑ j=i e−(j−i) ≤ ∞∑ j=0 e−j = e e− 1 . Thus, the expected contribution of Ai and all the cells which belong to a level-j-neighbor- hood of Ai with j ∈ {i, . . . , dlog(∆)e} is at most O(f). Observe that each cell with height greater than k∗ is a cell in such a neighborhood of an active cell since it contains itself an active cell from SP or one of its neighbors contains an active cell from SP . It follows that the expected contribution from cells with height greater than k∗ is O(f) times the number of active cells in SP . Since a cell in any level i ∈ [dlog(∆)e+ 1] must contain at least a(i) points to be active, the number of active cells in SP is dlog(∆)e∑ i=0 ∑ C∈SP(i) min {⌊ nP (C) a(i) ⌋ , 1 } ≤ dlog(∆)e∑ i=0 ∑ C∈SP(i) nP (C) a(i) = 1 f · dlog(∆)e∑ i=0 ∑ C∈SP(i) nP (C) · 2i ≤ FL(P, f) f . Thus, the expected opening cost for facilities in cells with height greater than k∗ is O(FL(P, f)). By Markov’s inequality, the opening cost for facilities in cells with height greater than k∗ is less than 32 times its expected value, which is also O(FL(P, f)), with probability at least 31/32. Together with the first part of the proof, we obtain that the total opening cost for facilities is O(FL(P, f)) with probability at least 15/16. 5.3 Streaming Algorithm In this section, we describe how our randomized algorithm can be transferred to the dy- namic geometric data stream model. For each level i ∈ [dlog(∆)e + 1], let M(i) be the 5.3 Streaming Algorithm 91 subset of marked cells in G (i), and let U(i) be the subset of cells in G (i) that have a cell contained in the set ⋃i−1 j=0M(j) within a distance of less than 2 i−1. Thus, we have FLrand(P, f) = f · ∑dlog(∆)e i=0 |M(i)\U(i)|. Recall that, in the streaming context, P refers to the current point set, i.e., the set of points obtained after having applied an input sequence of insertions and deletions. The difficulty is to maintain, for each level i ∈ [dlog(∆)e + 1], a good estimator for the value |M(i)\U(i)| in the streaming model. We use a similar technique as described in [64] to solve this problem. In particular, we use two data structures that both maintain the number of distinct elements in a stream, under insertions and deletions. The first data structure called DE1(i) is supposed to maintain a good estimator for the value |M(i)∪U(i)|. The second data structure called DE2(i) is supposed to maintain a good estimator for the value |U(i)|. We can show that the difference of these two estimators is a good estimator for the desired value |M(i)\U(i)|. Next, we explain this method in more detail. For any point p ∈ {1, . . . ,∆}d and any level i ∈ [dlog(∆)e + 1], the cell in level i that contains the point p is denoted by Cp(i), and the set of neighbor cells of Cp(i) in level i is denoted by Γ(Cp(i)). Furthermore, let hi : {1, . . . ,∆}d → {0, 1} be the random function introduced in Section 5.2.1. Then, our implementation of an insert operation of p is as follows (see also Algorithm 5.3.2). If hk(p) = 0 for each index k ∈ [dlog(∆)e + 1], then p is not a sample point and we do nothing with p. Otherwise, let k ∈ [dlog(∆)e + 1] be the smallest index with hk(p) = 1. Thus, p is sampled in level k, so Cp(k) is a marked cell. We insert Cp(k) in DE1(k) and, for each j ∈ {k+ 1, k+ 2, . . . , dlog(∆)e}, we insert all cells in G (j) such that Cp(k) is within a distance of less than 2j−1 in both DE1(j) and DE2(j). A deletion of p is implemented analogously (see Algorithm 5.3.3). After having processed the whole input stream according to this, we compute an estimator for the facility location cost. For each level i ∈ [dlog(∆)e+ 1], let |DE1(i)| and |DE2(i)| be the number of distinct elements in DE1(i) and DE2(i), respectively. Then, our estimator for the optimal facility location cost in the dynamic geometric data stream model is FLstream(P, f) := f · dlog(∆)e∑ i=0 |DE1(i)| − |DE2(i)| . (5.6) A description of our algorithm in pseudocode is given by Algorithms 5.3.1, 5.3.2, and 5.3.3. 5.3.1 Analysis of the Estimator In this section, we show that our streaming algorithm outputs a constant-factor approx- imation of the optimal facility location cost, has polylogarithmic update time, and uses polylogarithmic memory space. For that purpose, we analyze the quality and complexity of the distinct elements data structures and the random sampling technique. Distinct Elements Data Structures It follows from the analysis in Sections 5.1 and 5.2 that FLrand(P, f) ∈ Ω(FacLoc*(P, f)) and FLrand(P, f) ∈ O(FacLoc*(P, f)) is true with probability at least 7/8. Thus, with prob- 92 5 Facility Location in Data Streams Algorithm 5.3.1 FacLocCost(f,∆) 1: for i← 0 to dlog(∆)e do 2: create random function hi(·) 3: initialize empty data structures DE1(i) and DE2(i) 4: for each pair (p, operation) in the stream do 5: if operation = insert then 6: Insert(p) 7: if operation = delete then 8: Delete(p) 9: z ← 0 10: for i← 0 to dlog(∆)e do 11: z ← z + |DE1(i)| − |DE2(i)| 12: return f · z Algorithm 5.3.2 Insert(p) 1: sampled← false 2: i← 0 3: while sampled = false and i ≤ dlog(∆)e do 4: if hi(p) = 1 then 5: sampled← true 6: insert Cp(i) in DE1(i) 7: for j ← i+ 1 to dlog(∆)e do 8: insert all cells from G (j) that contain a cell from Γ(Cp(j − 1)) in DE1(j) and DE2(j) 9: i← i+ 1 ability at least 7/8, we have that 1/c·FacLoc*(P, f) ≤ FLrand(P, f) ≤ c·FacLoc*(P, f) for an appropriately chosen constant c. Next, we analyze how much the estimator FLstream(P, f) might differ from FLrand(P, f). For this purpose, let us assume that we have one fixed ran- dom function hi : {1, . . . ,∆}d → {0, 1} for each level i ∈ [dlog(∆)e+1] that is used by both our randomized algorithm and our streaming algorithm. We will show that the difference between FLstream(P, f) and FLrand(P, f), which is caused by using DE data structures to maintain the number of distinct elements in a data stream under insertions and deletions, can be upper bounded by 1/(2c) · FacLoc*(P, f) with probability greater than 7/8. Then, we have 1/(2c) · FacLoc*(P, f) ≤ FLstream(P, f) ≤ (2c2 + 1)/(2c) · FacLoc*(P, f) with high constant probability, which implies FLstream(P, f) ∈ Θ(FacLoc*(P, f)) with high constant probability. We use the technique proposed by Kane et al. [72] to realize the DE data structures. Then, due to Corollary 5.1.2, for a precision parameter ε and an error probability parameter δ, which we will both specify later, each of our DE data structures computes a (1 ± ε)- approximation of the number of distinct elements in a data stream under insertions and deletions with probability at least 1− δ. 5.3 Streaming Algorithm 93 Algorithm 5.3.3 Delete(p) 1: sampled← false 2: i← 0 3: while sampled = false and i ≤ dlog(∆)e do 4: if hi(p) = 1 then 5: sampled← true 6: delete Cp(i) from DE1(i) 7: for j ← i+ 1 to dlog(∆)e do 8: delete all cells from G (j) that contain a cell from Γ(Cp(j−1)) from DE1(j) and DE2(j) 9: i← i+ 1 We will show that the difference between FLstream(P, f) and FLrand(P, f) depends on the value ∑dlog(∆)e i=0 f · |M(i)|. For that reason, we next give an appropriate upper bound of ∑dlog(∆)e i=0 f · |M(i)|. Lemma 5.3.1. If FLrand(P, f) ≤ c · FacLoc*(P, f), then we have dlog(∆)e∑ i=0 f · |M(i)| ≤ c · 2d · (log(∆) + 2) · FacLoc*(P, f) . Proof. We open in each marked cell in G (0) one facility. Thus, we have f · |M(0)| = f · |F(0)| ≤ FLrand(P, f) ≤ c · FacLoc*(P, f) . For any level i ∈ [dlog(∆)e + 1], we open a facility in each cell in M(i) such that no subcell of this cell is marked and no smaller cell within a distance of less than 2i−1 is marked. Let us consider any cell C in level i− 1 that is a marked cell or contains a marked cell. Let C ′ be this marked cell. There are 2d cells in G (i) such that C is within a distance of less than 2i−1 from these cells, namely the set of cells in the level-i-neighborhood of C (see Figure 5.4). Recall that the level-i-neighborhood of C is the set of all the cells in level i that share some part of their boundary or interior with C. It follows that C ′ prevents at most 2d cells inM(i) from opening a facility. Now, either C ′ contains an open facility or there exists a smaller marked cell C ′′ that prevents C ′ from opening a facility. Since C ′′ has to be within distance of less than 2i−2 from C ′ ⊆ C, it is located in the level-(i − 2)- neighborhood of C. We can recursively apply this argument until we have found the first marked cell that is not prevented by any smaller marked cell from opening a facility. Note that this happens in level 0 at the latest. Since ∑i−2 j=0 2 j ≤ 2i−1, these marked cells are all located in the level-(i − 1)-neighborhood of C and, thus, also in the level-i-neighborhood of C. Hence, for a fraction of at least 1/2d cells in M(i), there exists at least one cell in ⋃i j=0F(j). Thus, we have f · |M(i)| ≤ f · 2d · i∑ j=0 |F(j)| ≤ 2d · FLrand(P, f) ≤ 2d · c · FacLoc*(P, f) . 94 5 Facility Location in Data Streams C 2i−1 2i Figure 5.4: Illustration of the area of influence of a marked cell. The cell C ∈ G (i− 1) is a marked cell or contains a marked cell. The area of influence in grid G (i), i.e., the subset of cells in G (i) that are within distance of less than 2i−1 from C, is colored in light gray. The white cells in grid G (i) are outside of the influence of the marked cell in C. Now, the lemma follows from the fact that there are at most log(∆) + 2 levels. Finally, we can upper bound the difference between FLstream(P, f) and FLrand(P, f). Lemma 5.3.2. If FLrand(P, f) ≤ c ·FacLoc*(P, f) and if we run each DE data structure to maintain a (1±ε)-approximation of the number of distinct elements in data streams under insertions and deletions with a precision parameter ε, 0 < ε ≤ 1/(8c2 · 22d · (log(∆) + 2)2), and an error probability δ, 0 < δ < 1/(16(log(∆) + 2)), then |FLstream(P, f)− FLrand(P, f)| < 1 2c · FacLoc*(P, f) with probability greater than 7/8. Proof. Since we use two DE data structures per level and there are at most (log(∆) + 2) levels, we use at most 2(log(∆) + 2) DE data structures in total. By the union bound, the probability that at least one of these DE data structures fails is less than 1/8. Hence, the probability that each DE data structure maintains a (1± ε)-approximation is greater than 7/8. Since we run each DE data structure with a precision parameter ε, we can upper bound the difference between FLstream(P, f) and FLrand(P, f) by |FLstream(P, f)− FLrand(P, f)| ≤ ε · f · dlog(∆)e∑ i=0 |M(i)|+ 2 · |U(i)| . 5.3 Streaming Algorithm 95 Due to Lemma 5.3.1 and ε ≤ 1/(8c2 · 22d · (log(∆) + 2)2) < 1/(4c2 · 2d · (log(∆) + 2)), we have ε · f · dlog(∆)e∑ i=0 |M(i)| ≤ ε · c · 2d · (log(∆) + 2) · FacLoc*(P, f) < 1 4c · FacLoc*(P, f) . Next, we upper bound the value ε ·f · ∑dlog(∆)e i=0 |U(i)|. Observe that the set U(0) is empty. Furthermore, for any cell C ∈ ⋃i−1 j=0M(j), there are at most 2 d cells in G (i) that are within a distance of less than 2i−1 from C. Thus, there exists at least one cell in ⋃i−1 j=0M(j) for a fraction of at least 1/2d cells in U(i). Hence, we have |U(i)| ≤ 2d · i−1∑ j=0 |M(j)| ≤ 2d · dlog(∆)e∑ j=0 |M(j)| . Summation over all levels results in dlog(∆)e∑ i=0 |U(i)| ≤ 2d · (log(∆) + 2) · dlog(∆)e∑ i=0 |M(i)| . Due to Lemma 5.3.1 and ε ≤ 1/(8c2 · 22d · (log(∆) + 2)2), we get ε · f · dlog(∆)e∑ i=0 2 · |U(i)| ≤ ε · f · 2 · 2d · (log(∆) + 2) · dlog(∆)e∑ i=0 |M(i)| ≤ ε · f · 2c · 22d · (log(∆) + 2)2 · FacLoc*(P, f) ≤ 1 4c · FacLoc*(P, f) . Thus, we obtain |FLstream(P, f)− FLrand(P, f)| ≤ ε · f · dlog(∆)e∑ i=0 |M(i)|+ |U(i)| < 1 2c · FacLoc*(P, f) with probability greater than 7/8. We summarize our results achieved so far in the following lemma. Note that Lemma 5.3.3 considers only the space requirement and update time of the DE data structures. The space and time needed to do the random sampling will be analyzed later. Lemma 5.3.3. If we run each DE data structure to maintain a (1±ε)-approximation of the number of distinct elements in data streams under insertions and deletions with a precision parameter ε := 1/(8c2 ·22d · (log(∆)+ 2)2) and an error probability δ := 1/(17(log(∆)+ 2)), then FLstream(P, f) ∈ Θ(FacLoc*(P, f)) with probability greater than 3/4. The DE data structures require O(log6(∆) · (log(log(∆)))2) bits of space and O(log(∆) · log(log(∆))) update time. 96 5 Facility Location in Data Streams Proof. Due to Lemmas 5.1.6, 5.1.7, 5.2.2, and 5.2.4, we have 1 c · FacLoc*(P, f) ≤ FLrand(P, f) ≤ c · FacLoc*(P, f) for an appropriately chosen constant c ≥ 1 with probability at least 7/8. If this is the case and each DE data structure is run with a precision parameter ε = 1/(8c2 ·22d ·(log(∆)+2)2) and an error probability δ = 1/(17(log(∆)+2)), then it follows from Lemma 5.3.2 that the difference between FLstream(P, f) and FLrand(P, f) is at most 1/(2c) · FacLoc*(P, f) with probability greater than 7/8. Thus, we obtain 1 2c · FacLoc*(P, f) ≤ FLstream(P, f) ≤ 2c2 + 1 2c · FacLoc*(P, f) with probability greater than 3/4. This proves that FLstream(P, f) ∈ Θ(FacLoc*(P, f)) with probability greater than 3/4. Due to Corollary 5.1.2 and for our values of ε and δ, each DE data structure has a space requirement of O(1/ε2 · log(N) · (log(1/ε) + log(log(M))) · log(1/δ)) = O(log(∆)4 · log(N) · (log(log(∆)) + log(log(M))) · log(log(∆))) bits, where N is the size of the domain of the elements and M is the multiplicity of single elements in the DE data structure. Since the grid of any level contains at most ∆d cells and each DE data structure contains only cells from one level, we have N = ∆d. Furthermore, due to our implementation of Insert(p) and Delete(p) for any point p ∈ {1, . . . ,∆}d, the multiplicity of a cell in any DE data structure is at most M = ∆d. Hence, each DE data structure needs O(log5(∆) · (log(log(∆)))2) bits of space. Since we use O(log(∆)) DE data structures, the total space requirement for the DE data structures is upper bounded by O(log6(∆) · (log(log(∆)))2) bits. While running Insert(p) for any point p ∈ {1, . . . ,∆}d, we insert at most a constant number of cells in each DE data structure. Thus, due to Corollary 5.1.2 and for our value of δ, the time to process an Insert operation is O(log(log(∆))) for each DE data structure. Analogously, we can upper bound the time to process a Delete operation. It follows that the total update time for the O(log(∆)) DE data structures is O(log(∆) · log(log(∆))), which completes the proof of the lemma. Random Sampling For each level i ∈ [dlog(∆)e + 1], we use the random function hi : {1, . . . ,∆}d → {0, 1} described in Section 5.2.1 to realize the random sampling of points. To overcome the assumption of full randomness needed for the creation of these hi(·), we use a pseudo- random generator of Nisan [95]. This approach was first proposed in [62]. First, we briefly summarize some facts of pseudo-random generators for space-bounded computation proposed by Nisan [95]. Then, we show how to utilize these facts for the 5.3 Streaming Algorithm 97 creation of the random functions. Let ALG be any algorithm that uses at most O(k) bits of memory and, thus, has at most 2O(k) distinct states. Furthermore, we assume that ALG uses at most g chunks of random bits, where each chunk is of length ` ∈ O(k). Let ALG(x) be the state of ALG after having used the random bit sequence x ∈ {0, 1}g`. Then, there is a pseudo-random generator for ALG with the following properties: Lemma 5.3.4 ([95]). Let ALG be an algorithm that uses O(k) bits of memory and g chunks of random bits, where each chunk is of length ` ∈ O(k). Then, there exists a pseudo-random generator R : {0, 1}s → ({0, 1}`)g for ALG which expands s random bits into t := g · ` bits such that s ∈ O(k log(t)) and ∑ states z of ALG |Pr [ALG(x) = z]−Pr [ALG(R(y)) = z]| ≤ 2−k where x is chosen uniformly at random from {0, 1}t and y is chosen uniformly at random from {0, 1}s. For any y ∈ {0, 1}s, any length-` chunk of R(y) can be computed using O(log(t)) arithmetic operations on O(`)-bit words. In the proof of the following lemma, we show how to apply a pseudo-random generator to reduce the randomness needed for the creation of the random functions. To do so, we proceed in a similar way as Indyk [62, 65]. Lemma 5.3.5. There is an implementation of algorithm FacLocCost that outputs a constant-factor approximation of the optimal facility location cost for the current point set with probability greater than 2/3. The implementation requires O(log7(∆) · (log(log(∆)))2) bits of space and O(log7(∆) · (log(log(∆)))2) random bits. An insertion or deletion of a point requires O(log2(∆)) arithmetic operations on O(log(∆))-bit words. Proof. Let FLCFullyRandom be the implementation of algorithm FacLocCost that uses the type of DE data structures given in Corollary 5.1.2 and the kind of random functions proposed in Lemma 5.2.1. To prove the lemma, we adopt the argumentation given by Indyk in [62, 65]. Due to Lemma 5.3.3, the total space requirement for the DE data structures is upper bounded by O(log6(∆) · (log(log(∆)))2) bits. Furthermore, FLCFullyRandom requires O(log2(∆)) random bits per point in {1, . . . ,∆}d in total for the creation of the O(log(∆)) random functions. Since we might access a specific point several times and the output of each random function for this point should be always the same, we have to store the random bits of all ∆d points. Obviously, an algorithm working in the dynamic geometric data stream model cannot use Ω(∆) bits of space. This problem is avoidable by allowing a negligible probability of error in the computation of the number of open facilities. For the moment, let us assume the stream is sorted, which means that the insertions and deletions of a specific point occur subsequently in the stream. Then, it is sufficient to compute the output of each random function only once per point. Thus, in case of a sorted stream, algorithm FLCFullyRandom uses O(log6(∆) · (log(log(∆)))2) bits and O(∆d · log(∆)) chunks each consisting of O(log(∆)) random bits. Note that there are O(log(∆)) chunks per point in {1, . . . ,∆}d, one for each 98 5 Facility Location in Data Streams level. Due to Lemma 5.3.4, there exists a pseudo-random generator R which given a random seed of sizeO(log6(∆)·(log(log(∆)))2·log(∆)) expands it to ∆d·log(∆) chunks ofO(log(∆)) random bits such that each chunk can be computed using O(log(∆)) arithmetic operations on O(log(∆))-bit words and using these chunks results in negligible probability of error in the computation of the number of open facilities. Let us denote the implementation of algorithm FacLocCost which uses a pseudo-random generator R for the creation of the random functions by FLCPseudoRandom. Then, according to Lemma 5.3.4 and since we can assume that ∆ ≥ 4, the probability that the implementation FLCFullyRandom differs in its computations from the ones of the implementation FLCPseudoRandom is at most Pr [FLCFullyRandom 6= FLCPseudoRandom] ≤ 2− log 6(∆)·(log(log(∆)))2 ≤ 2−6 log(∆) = ∆−6 . Since, for a fixed random seed, algorithm FLCPseudoRandom does not depend on the order in which the insertions and deletions of points appear in the stream, we also get Pr [FLCFullyRandom 6= FLCPseudoRandom] ≤ 1/∆6 for the unsorted stream. Due to Lemma 5.3.3, we obtain that the implementation FLCFullyRandom of al- gorithm FacLocCost has an error probability of less than 1/4. Since we assume that ∆ ≥ 4 and Pr [FLCFullyRandom 6= FLCPseudoRandom] ≤ 1/∆6, the implementa- tion FLCPseudoRandom works with error probability less than 1/3. Due to Lemma 5.3.3, the space requirement of the DE data structures is upper bounded by O(log6(∆) · (log(log(∆)))2) bits and their update time is O(log(∆) · log(log(∆))). For the random functions, the pseudo-random generator of Nisan [95] needs a random seed of size O(log7(∆) · (log(log(∆)))2). Furthermore, for any level i ∈ [dlog(∆)e + 1] and any point p ∈ {1, . . . ,∆}d, the value hi(p) can be computed using O(log(∆)) arithmetic operations on O(log(∆))-bit words. Thus, we need O(log7(∆) ·(log(log(∆)))2) random bits in total and, for any point, the output values of all random functions can be computed using O(log2(∆)) arithmetic operations on O(log(∆))-bit words. As explained in the proof of Corollary 5.1.2, we can use a standard amplification tech- nique to obtain the following main result: Theorem 5. There is a randomized streaming algorithm that computes with probability 1 − δ a constant-factor approximation of the facility location cost for a stream of points with uniform opening costs and demands in the discrete Euclidean space {1, . . . ,∆}d under insertions and deletions, where d is a constant. The algorithm has a space requirement of O(log7(∆) · (log(log(∆)))2 · log(1/δ)) bits and uses O(log7(∆) · (log(log(∆)))2 · log(1/δ)) random bits. An insertion or deletion of a point requires O(log2(∆) · log(1/δ)) arithmetic operations on O(log(∆))-bit words. 6 A k-Means Implementation for Data Streams In this chapter, we develop an efficient algorithm for the k-means clustering problem in the insertion-only data stream model. We call our algorithm StreamKM++. The k-means clustering problem is closely related to the facility location problem. Given a set of points and a natural number k, the goal of the k-means clustering problem is to place k facilities, the so-called cluster centers, such that the sum of the squared distances of the points to their nearest cluster center is minimized. To approach this problem, our streaming algorithm maintains a small summary of the input points using the merge-and-reduce technique [16, 58], i.e., the data is organized in a small number of samples, each representing 2im input points (for some i ∈ N0 and a fixed m ∈ N). Every time when two samples representing the same number of input points exist, we take the union (merge) and create a new sample (reduce). After having processed the whole input stream in this way, we apply the k-Means++ algorithm [9] on the sample to obtain a k-means clustering. For the reduce step, we develop a new coreset construction. A coreset is a small weighted point set that approximates the original input point set with respect to a given optimization problem, which is in our case the k-means clustering problem. Our focus is to propose a coreset construction that is suitable for high-dimensional data. Existing constructions based on grid-computations [44, 58] yield coresets of a size that is exponential in the dimension. Since the k-Means++ seeding works well for high-dimensional data, a coreset construction based on this approach seems to be more promising. We give a theoretical analysis of this approach in Section 6.2. In order to implement this coreset construction efficiently, we propose a new data struc- ture, which we call coreset tree. This is a tree-like data structure that stores points in such a way that we can perform a fast adaptive sampling which is very similar to the k-Means++ seeding. According to our experiments, the seed computed on the coreset tree has essentially the same properties as the original k-Means++ seed. The advantage of the coreset tree approach is that the running time is significantly shorter than the running time of the original k-Means++ seeding. It should be noted that the k-Means++ seeding has also been theoretically investigated in [3] and [4]. Aggarwal et al. [3] used the k-Means++ seeding to obtain a small weighted point set such that an optimal k-means clustering of the original point set can be ap- proximated well by clustering the small weighted set. Ailon et al. [4] used the k-Means++ seeding to obtain a streaming algorithm for the k-means clustering problem that guarantees an approximation factor of O(cα log(k)), where c is some constant, α ≈ log(n)/ log(M), n is the number of input points in the stream, and M is the amount of work memory available to the algorithm. However, our result differs from the results given in [3] and [4] and was obtained independently. 100 6 A k-Means Implementation for Data Streams In Section 6.5, we compare algorithm StreamKM++ with algorithms BIRCH [111] and StreamLS [96, 52] as well as with the non-streaming version of algorithm k-Means++. It turns out that our algorithm is slower than BIRCH, but it computes significantly better solutions (in terms of the sum of squared errors). In addition, to obtain the desired number of clusters, our algorithm does not require the trial-and-error adjustment of parameters as BIRCH does. The quality of the clustering of algorithm StreamLS is comparable to that of our algorithm, but the running time of StreamKM++ scales much better with the number of cluster centers. For example, on the dataset Tower, our algorithm computes a clustering with k = 100 centers in about 3% of the running time of StreamLS. In comparison with the standard implementation of k-Means++, our algorithm runs much faster on larger datasets and computes solutions that are on a par with k-Means++. For example, on the dataset Covertype, our algorithm computes a clustering with k = 50 centers of essentially the same quality as k-Means++ does, but it needs only about 3% of the running time of k-Means++. Next, we introduce some notation and give a brief overview of the considered competitors of StreamKM++. 6.1 Preliminaries 6.1.1 Definition of Euclidean k-Means Clusterings Recall from Section 2.1 that, for any two points p, q ∈ Rd and any set of points C ⊂ Rd, we denote the Euclidean distance between p and q by D(p, q) := ‖p− q‖, and we define D(p, C) := min c∈C D(p, c) . Similarly, for squared Euclidean distances, we define D2(p, q) := ‖p− q‖2 and D2(p, C) := min c∈C D2(p, c) . Let P ⊂ Rd be a set of points with size |P | =: n. The Euclidean k-means clustering problem for P is given as follows. Definition 6.1.1 (Euclidean k-Means Clustering Problem). For a set P ⊂ Rd and k ∈ N, the Euclidean k-means clustering problem is to find a set C := {c1, . . . , ck} of k cluster centers in Rd and a partition of the set P into k clusters C1, . . . , Ck such that the k-means clustering cost Means(P,C,C1, . . . , Ck) := k∑ i=1 ∑ p∈Ci D2(p, ci) is minimized. Analogously, for a weighted set S ⊂ Rd with weight function w : S → R≥0 and k ∈ N, the weighted Euclidean k-means clustering problem is to find a set C := {c1, . . . , ck} of k 6.1 Preliminaries 101 cluster centers in Rd and a partition of the set S into k clusters C1, . . . , Ck such that the k-means clustering cost Means(S,C,C1, . . . , Ck) := k∑ i=1 ∑ q∈Ci w(q) ·D2(q, ci) is minimized. If a partition C1, . . . , Ck of P relates each point to its nearest cluster center, i.e., if, for each p ∈ P and each i ∈ {1, . . . , k}, we have p ∈ Ci ⇒ D(p, ci) = min j∈{1,...,k} D(p, cj) , then we shortly write Means(P,C) := Means(P,C,C1, . . . , Ck) . Furthermore, we denote the cost of an optimal Euclidean k-means clustering of P by Means∗k(P ) := min C′⊂Rd:|C′|=k Means(P,C ′) . 6.1.2 Definition of Coresets An important concept that we use is the notion of coresets. Generally speaking, a coreset for a set P is a small (weighted) set such that, for any set of k cluster centers, the (weighted) clustering cost for the coreset is an approximation of the clustering cost for the original set P with small relative error. The advantage of such a coreset is that we can apply any fast approximation algorithm (for the weighted problem) on the usually much smaller coreset to compute an approximate solution for the original set P more efficiently. We use the following formal definition: Definition 6.1.2 (Coreset for k-Means Clustering Problem). Let P ⊂ Rd be a set of points, let k ∈ N, and let ε, 0 < ε ≤ 1, be a precision parameter. A weighted multiset S ⊂ Rd with positive weight function w : S → R≥0 is called (k, ε)-coreset of P for the k-means clustering problem if, for each C ⊂ Rd of size |C| = k, we have (1− ε) ·Means(P,C) ≤ Means(S,C) ≤ (1 + ε) ·Means(P,C) . Our clustering algorithm maintains a small coreset in the insertion-only data stream model (see Section 2.4.4 for a definition of this data stream model). 102 6 A k-Means Implementation for Data Streams 6.1.3 k-Means Clustering Algorithms In the experiments, we compare StreamKM++ with two frequently used clustering algo- rithms for processing data streams, namely with algorithm BIRCH of Zhang et al. [111] and with a streaming variant of the local search algorithm given by O’Callaghan et al. [96] and Guha et al. [52], which we refer to as algorithm StreamLS. On smaller datasets, we also compare our algorithm with a classical implementation of Lloyd’s k-means algo- rithm [80], using initial seeds either uniformly at random (algorithm k-Means) or ac- cording to the adaptive, non-uniform seeding from Arthur and Vassilvitskii [9] (algorithm k-Means++). In the following, we will give a brief overview of these k-means clustering algorithms. Algorithm k-Means One of the most widely used clustering algorithms is Lloyd’s algorithm. This algorithm is sometimes also called the k-means algorithm [80, 39, 82]. Lloyd’s algorithm is based on two observations: 1. Given a fixed set of centers, we obtain the best clustering by assigning each point to the nearest center. 2. Given a cluster, the best center of the cluster is the center of gravity (i.e., the mean) of its points. Lloyd’s algorithm applies these two local optimizations steps repeatedly to the current solution, until no more improvement is possible. See Algorithm 6.1.1 for a description in pseudocode. Algorithm 6.1.1 k-Means(P, k) 1: choose k initial centers c1, . . . , ck uniformly at random from P 2: repeat 3: partition P into k subsets P1, . . . , Pk such that Pi, 1 ≤ i ≤ k, contains all points whose nearest center is ci 4: replace the current set of centers by a new set of centers c1, . . . , ck such that center ci, 1 ≤ i ≤ k, is the center of gravity of Pi 5: until the set of centers has not changed It is known that the algorithm converges to a local optimum [100], and the quality of the computed solution is sensitive to the choice of the starting centers. Kanungo et al. [73] give a simple example where, for a fixed set of starting centers, Lloyd’s algorithm converges to a local minimum that is arbitrarily bad compared to the optimal solution. This example can be extended to the case where the starting centers are chosen by uniform seeding as given in Algorithm 6.1.1. 6.1 Preliminaries 103 Algorithm k-Means++ Recently, Arthur and Vassilvitskii developed the k-Means++ algorithm [9], which is a seeding procedure for Lloyd’s k-means algorithm. This seeding procedure considers the fact that the quality of the solution of Lloyd’s k-means algorithm depends strongly on the initial set of centers. In order to achieve a better arrangement, it chooses the initial set of centers adaptively and non-uniformly at random by choosing each point as the next center with probability proportional to its squared distance from the nearest center already chosen. Note that, for a given set of centers, the squared distance of a point from its nearest center corresponds to the current contribution of this point to the total k-means clustering cost. The k-Means++ seeding procedure is given by Algorithm 6.1.2. For simplicity of description, we say that Algorithm 6.1.2 chooses the set C at random according to D2. Algorithm 6.1.2 AdaptiveSeeding(P, k) 1: choose an initial center c1 uniformly at random from P 2: C ← {c1} 3: for i← 2 to k do 4: choose the next center ci at random from P , where the probability of each p ∈ P is given by D2(p, C)/Means(P,C) 5: C ← C ∪ {ci} By replacing line 1 of Algorithm 6.1.1 with Algorithm 6.1.2, Arthur and Vassilvitskii developed a k-means clustering algorithm, known as k-Means++ algorithm, which gives good experimental results and guarantees a solution with certain quality. More precisely, they showed the following: Lemma 6.1.3 ([9]). Let C ⊆ P be a set of k points chosen at random according to D2. Then, we have E [Means(P,C)] ≤ 8 (2 + ln(k)) Means∗k(P ) . Algorithm BIRCH One of the earliest and best known practical clustering algorithms for data streams is BIRCH (which is an acronym for ‘Balanced Iterative Reducing and Clustering using Hier- archies’) [111]. BIRCH is a heuristic which exploits the observation that the point space is usually not uniformly occupied. It scans the given set of input points once and computes a pre-clustering by summarizing dense regions of points by their so-called clustering fea- tures. Such a clustering feature consists of the number of points in the region, the center of gravity, and the sum of squared distances to the origin. Thereby, the problem of clustering the original input point set is reduced to the problem of clustering the set of summaries, which is much smaller than the original point set. The pre-clustering is then clustered by using an agglomerative (bottom-up) clustering algorithm. In this process, the algorithm uses the clustering features to calculate the intra-cluster distances. BIRCH successively merges the closest pair of clusters until the desired number of clusters is obtained. 104 6 A k-Means Implementation for Data Streams To a certain extent, BIRCH uses a kind of coreset construction. However, there is no theoretical analyses of this method known. For more details about BIRCH, the reader is referred to [111]. Algorithm StreamLS Another well-known clustering algorithm for data streams is the streaming implementation of algorithm LSearch from O’Callaghan et al. [96] and Guha et al. [52], which we refer to as StreamLS. This algorithm partitions the input stream into chunks and computes for each chunk a k-means clustering solution using a local search algorithm from Guha et al. [53]. Finally, the local search algorithm is applied once more on the union of the solutions for the chunks to obtain a k-means clustering for the whole input stream. The local search algorithm of Guha et al. [53] takes advantage of the relationship between the k-means clustering problem and the uniform facility location problem (see Section 2.2 for a definition of the uniform facility location problem). More precisely, it is based on the observation that if the opening cost of a facility increases, then the number of facilities (or cluster centers) of an optimal solution tends to decrease. Hence, to solve the k-means problem, the algorithm of Guha et al. [53] performs a binary search on the opening cost of a facility to find a cost that gives the desired number of cluster centers. During the binary search, each facility location problem is solved by starting with an initial solution that is obtained by a simple non-uniform sampling approach and then refining this solution by making local improvements. More details can be found in [96, 53, 52]. 6.2 Coreset Construction Our k-means clustering algorithm uses a coreset construction based on the k-Means++ seeding procedure from Arthur and Vassilviskii [9]. One reason for this design decision was that the k-Means++ seeding works well for high-dimensional datasets, which is often required in practice. This nice property does not apply to many other clustering meth- ods, like the grid-based methods from Har-Peled and Mazumdar [58] and Frahling and Sohler [44], for instance. Let P ⊂ Rd be a set of points with size |P | =: n. For an arbitrary fixed parameter m ∈ N, our coreset construction is as follows (see also Algorithm 6.2.1). First, we choose a set S := {q1, q2, . . . , qm} of size m at random according to D2. Let Qi denote the set of points from P that are closest to qi (breaking ties arbitrarily). By using weight function w : S → R≥0 with w(qi) = |Qi|, we obtain the weighted set S as our coreset. Note that our coreset construction is rather easy to implement and its running time has a merely linear dependency on the dimension d. Furthermore, empirical evaluation (as given in Section 6.5) suggests that our construction leads to good coresets even for relatively small choices of m (i.e., say m = 200k). Unfortunately, we do not have a formal proof supporting this observation. However, we are able to do a first step by proving that, at least in low-dimensional spaces, our construction indeed leads to small coresets. 6.2 Coreset Construction 105 Algorithm 6.2.1 AdaptiveCoreset(P,m) 1: choose an initial coreset point q1 uniformly at random from P 2: w(q1)← 0 3: S ← {q1} 4: for i← 2 to m do 5: choose qi at random according to D2 from P 6: w(qi)← 0 7: S ← S ∪ {qi} 8: for each p ∈ P do 9: let qi ∈ S, 1 ≤ i ≤ m, be the nearest coreset point to p 10: w(qi)← w(qi) + 1 Our proof is based on Lemma 6.2.1. Intuitively, this lemma states that if we consider an optimal m-clustering of P , with m large enough, then the optimal m-clustering cost is merely a tiny fraction of the optimal k-clustering cost of P . Lemma 6.2.1 is a consequence of the fact that there exist (k, γ)-coresets of size m ∈ (d/γ)O(d)k log(n), which has already been proven by Har-Peled and Mazumdar [58]. Lemma 6.2.1. Let γ, 0 < γ ≤ 1, and let m ∈ N. If m ≥ ( 16d γ )d/2 · k · dlog(n) + 3e , then we get Means∗m(P ) ≤ γ ·Means ∗ k(P ) . Proof. Let C := {c1, . . . , ck} be an optimal solution to the Euclidean k-means clustering problem for P with |P | =: n, i.e., Means(P,C) = Means∗k(P ). We consider an exponential grid around each center in C. The construction of this grid is essentially the same as the one from Har-Peled and Mazumdar [58]. In detail, the construction is defined as follows. Let the average cost per point of an optimal k-clustering solution for P be denoted by R := Means∗k(P ) n . Furthermore, for each j ∈ {0, 1, . . . , dlog(n) + 2e} and each center ci ∈ C, let Vij be the axis-parallel square centered at ci with side length rj := √ 2jR . Then, we recursively defineWi0 := Vi0 andWij := Vij\Vi,j−1 for j ∈ {1, 2, . . . , dlog(n)+2e}. Obviously, each point in P is contained within aWij since otherwise there would be a point p ∈ P with D2(p, C) > (rdlog(n)+2e 2 )2 = 2dlog(n)+2eR 4 ≥ nR ≥ Means∗k(P ) , 106 6 A k-Means Implementation for Data Streams which is a contradiction. For each i, j individually, we partition Wij into small grid cells with side length r′j := √ γ 9d · rj = √ γ 9d · 2jR . We remark that the small grid cells do not have to fit properly in Wij. In fact, we impose a grid with side length r′j on Wij such that Wij is completely covered. Then, the partition of Wij consists of all the small cells that completely cover Wij as well as all parts of the small cells that partly cover Wij. This partition is illustrated by Figure 6.1. ri r′i Figure 6.1: Illustration of the partition ofWij into small grid cells. The areaWij is colored in gray. The white parts of the small cells do not belong to the partition of Wij. For each grid cell C such that C ∩ Wij contains points from P , we select a single point from P ∩ C ∩Wij as the representative of all the points in P ∩ C ∩Wij. Let G be the set of all these representatives. Since we have ri r′i = √ 9d γ ≥ 3 , there are at most ∑ ci∈C dlog(n)+2e∑ j=0 (⌈ rj r′j ⌉)d ≤ ∑ ci∈C dlog(n)+2e∑ j=0 ( 4 3 · rj r′j )d = k · dlog(n) + 3e · ( 16d γ )d/2 grid cells. Since this number is smaller or equal to m, we obtain |G| ≤ m. Let gp denote the representative of p ∈ P in G. Then, we have Means∗m(P ) ≤ Means ∗ |G|(P ) ≤ Means(P,G) ≤ ∑ p∈P D2 (p, gp) . (6.1) For each point p ∈ P , the distance from its representative gp is upper bounded by the diagonal of the grid cell that contains p. Thus, for each p ∈ Wi0, we have D2 (p, gp) ≤ (√ d · r′0 )2 ≤ γR 9 . (6.2) 6.2 Coreset Construction 107 Furthermore, for each p ∈ Wij with j ≥ 1, we know that ci is the center of Vi,j−1 and p is not contained in Vi,j−1. It follows that D2(p, C) ≥ (rj−1 2 )2 ≥ 2j−3R . Hence, in this case, we get D2 (p, gp) ≤ (√ d · r′j )2 = γ 9 · 2j R ≤ 8γ 9 ·D2(p, C) . (6.3) Due to Inequalities (6.1)–(6.3) and the definition of R, we obtain Means∗m(P ) ≤ ∑ p∈P D2 (p, gp) ≤ ∑ p∈P (γR 9 + 8γ 9 ·D2(p, C) ) = n · γ 9 R + 8γ 9 ∑ p∈P D2(p, C) = γ 9 ·Means∗k(P ) + 8γ 9 ·Means∗k(P ) = γ ·Means∗k(P ) . Now, we go back to our coreset construction. Given the point set P and a parameter m ∈ N, let S be our weighted coreset chosen at random according to D2 from P by Algorithm 6.2.1. Furthermore, let C be an arbitrary set of k centers. For each point p ∈ P , we denote the point from S whose weight has been increased by 1 due to p in line 9 of Algorithm 6.2.1 by qp, i.e., qp is a point from S closest to p. Then, the difference between the cost of clustering P and the cost of clustering S is at most |Means(P,C)−Means(S,C)| = ∣ ∣ ∣ ∣ ∣ ∣ ∑ p∈P D2(p, C)− ∑ p∈P D2(qp, C) ∣ ∣ ∣ ∣ ∣ ∣ ≤ ∑ p∈P ∣ ∣ ∣D2(p, C)−D2(qp, C) ∣ ∣ ∣ . We partition P into two subsets Pnear and Pdist. Roughly speaking, the set Pnear contains each point p ∈ P whose distance from its coreset point qp is small compared to the distance from its nearest center in C. More precisely, for any constant ε with 0 < ε ≤ 1, we define Pnear := {p ∈ P | D(p, qp) ≤ εD(p, C)} . The set Pdist contains all the other points from P , i.e., Pdist := {p ∈ P | D(p, qp) > εD(p, C)} . 108 6 A k-Means Implementation for Data Streams First, in Claim 6.2.2, we estimate the error of the clustering cost that occurs for any point in Pnear. Then, in Claim 6.2.3, we give an estimation of the error for any point in Pdist. Claim 6.2.2. If p ∈ Pnear, then ∣ ∣ ∣D2(p, C)−D2(qp, C) ∣ ∣ ∣ ≤ 3εD2(p, C) . Proof. For the moment, let us assume that D(p, C) ≤ D(qp, C). Let cp denote the element from C closest to p. By triangle inequality and the definition of Pnear, we have D(qp, C) ≤ D(qp, cp) ≤ D(p, cp) + D(p, qp) ≤ (1 + ε) ·D(p, C) . Hence, for the squared distances, we obtain D2(qp, C) ≤ (1 + ε)2 ·D2(p, C) ≤ (1 + 3ε) ·D2(p, C) . Thus, we get D2(qp, C) − D2(p, C) ≤ 3εD2(p, C), which proves the claim in the case D(p, C) ≤ D(qp, C). Now, assume that D(p, C) > D(qp, C). Let cs denote the element from C closest to qp. Again, by triangle inequality and the definition of Pnear, we have D(p, C) ≤ D(p, cs) ≤ D(qp, cs) + D(p, qp) ≤ D(qp, C) + εD(p, C) . It follows that (1− ε) ·D(p, C) ≤ D(qp, C). For the squared distances, we obtain D2(qp, C) ≥ (1− ε)2 ·D2(p, C) > (1− 2ε) ·D2(p, C) . Hence, we get D2(p, C)−D2(qp, C) < 2εD2(p, C) < 3εD2(p, C) , which proves the claim in the case D(p, C) > D(qp, C). Claim 6.2.3. If p ∈ Pdist, then ∣ ∣ ∣D2(p, C)−D2(qp, C) ∣ ∣ ∣ ≤ 3 ε D2(p, qp) . 6.2 Coreset Construction 109 Proof. Let cp denote the element from C closest to p, and let cs denote the element from C closest to qp. By triangle inequality, we have D(p, C) ≤ D(p, cs) ≤ D(p, qp) + D(qp, cs) = D(p, qp) + D(qp, C) . Similarly, we get D(qp, C) ≤ D(qp, cp) ≤ D(p, qp) + D(p, cp) = D(p, qp) + D(p, C) . It follows that |D(p, C)−D(qp, C)| ≤ D(p, qp) and D(p, C)+D(qp, C) ≤ 2 D(p, C)+D(p, qp). Since D(p, qp) > εD(p, C) and ε ≤ 1, we get ∣ ∣ ∣D2(p, C)−D2(qp, C) ∣ ∣ ∣ = |D(p, C)−D(qp, C)| · (D(p, C) + D(qp, C)) ≤ D(p, qp) · (2 D(p, C) + D(p, qp)) ≤ (2 ε + 1 ) D2(p, qp) ≤ 3 ε D2(p, qp) . Now, we can show our main result. Theorem 6. Let k ∈ N, let ε, 0 < ε ≤ 1, be a precision parameter, and let δ, 0 < δ < 1, be an error probability. Given a point set P ⊂ Rd of size |P | =: n and a size parameter m ∈ ( d δε )O(d) · k · log(n) · logd/2 ( k log(n) δε ) , algorithm AdaptiveCoreset computes a weighted multiset S with size m that is a (k, 6ε)- coreset of P with probability at least 1− δ. Proof. Due to Claims 6.2.2 and 6.2.3, we have |Means(P,C)−Means(S,C)| ≤ ∑ p∈P ∣ ∣ ∣D2(p, C)−D2(qp, C) ∣ ∣ ∣ ≤ ∑ p∈Pnear ∣ ∣ ∣D2(p, C)−D2(qp, C) ∣ ∣ ∣+ ∑ p∈Pdist ∣ ∣ ∣D2(p, C)−D2(qp, C) ∣ ∣ ∣ ≤ 3ε ∑ p∈Pnear D2(p, C) + 3 ε ∑ p∈Pdist D2(p, qp) ≤ 3ε ·Means(P,C) + 3 ε ·Means(P, S) . 110 6 A k-Means Implementation for Data Streams Due to Lemma 6.1.3 and Markov’s inequality, we obtain Means(P, S) ≤ 8 δ (2 + ln(m)) ·Means∗m(P ) with probability at least 1− δ. Hence, by using Lemma 6.2.1 with γ := ε2δ 8(2 + lnm) , we have Means(P, S) ≤ 8 δ (2 + ln(m)) ·Means∗m(P ) ≤ 8 δ (2 + ln(m)) · γ ·Means∗k(P ) ≤ ε2 Means∗k(P ) ≤ ε2 Means(P,C) and, thus, |Means(P,C)−Means(S,C)| ≤ 6ε·Means(P,C) with probability 1−δ, provided that the coreset size m satisfies the condition m ≥ ( 16d γ )d/2 · k · dlog(n) + 3e . (6.4) Hence, the only thing left to do is to prove that there exists a coreset size m ∈ ( d δε )O(d) · k · log(n) · logd/2 ( k log(n) δε ) (6.5) that satisfies Inequality (6.4). Since we can assume that n ≥ 16 andm ≥ 8, Inequality (6.4) is satisfied if we have m logd/2(m) ≥ 2 · 16d dd/2 k log(n) δd/2εd . (6.6) We conclude that Condition (6.5) and Inequality (6.6) are both satisfied for a choice of m = (2d)d/2 · 2 · 16d dd/2 k log(n) δd/2εd · logd/2 ( 2 · 16d dd/2 k log(n) δd/2εd ) since we have logd/2(m) = logd/2 ( (2d)d/2 · 2 · 16d dd/2 k log(n) δd/2εd · logd/2 ( 2 · 16d dd/2 k log(n) δd/2εd )) ≤ ( d 2 )d/2 · logd/2 ( 2d · 2 · 16d dd/2 k log(n) δd/2εd · log ( 2 · 16d dd/2 k log(n) δd/2εd )) ≤ ( d 2 )d/2 · logd/2   ( 2 · 16d dd/2 k log(n) δd/2εd )4   = (2d)d/2 · logd/2 ( 2 · 16d dd/2 k log(n) δd/2εd ) . 6.3 The Coreset Tree 111 Please note that the size bound on the number of coreset points m from Theorem 6 is merely a sufficient condition, and that, to the best of our knowledge, there is no reason to assume that this size bound is tight. Hence, in compliance with our experiments, the actual dependency of m on the dimension d may as well be better than is suggested by the theorem. 6.3 The Coreset Tree Unfortunately, there is one practical problem concerning the k-Means++ seeding proce- dure. Assume that we have chosen a sample set S = {q1, q2, . . . , qi} from the input set P ⊆ Rd so far, where i < m and |P | =: n. In order to compute the probabilities to choose the next sample point qi+1, we need to determine the squared distance from each point in P to its nearest neighbor in S. Hence, using a standard implementation of such a computation, we require Θ(dnm) time to obtain all m coreset points, which is too slow for larger values of m. For this reason, we propose a new data structure called coreset tree, which speeds up this computation. Roughly speaking, a coreset tree is a hierarchical decomposition of P where each leaf represents a set of this decomposition. The advantage of the coreset tree is that it allows us to compute subsequent sample points by taking only points from a subset of P into account whose size is significantly smaller than n. We obtain that if the constructed coreset tree is balanced (i.e., the tree is of depth Θ(log(m)) and each leaf represents roughly the same number of points), we merely need Θ(dn log(m)) time to compute all m coreset points. This intuition is supported by our empirical evalua- tion on real-world datasets, where we find that the process of sampling according to D2 is significantly sped up while the resulting sample set S has essentially the same properties as the original k-Means++ seed. In the following, we will explain the construction of the coreset tree in more detail. A description in pseudocode is given by Algorithm 6.3.1. 6.3.1 Definition of the Coreset Tree A coreset tree T for a point set P is a binary tree that is associated with a hierarchical divisive clustering for P : One starts with a single cluster that contains the whole point set P and successively partitions existing clusters into two subclusters such that the points in one subcluster are far from the points in the other subcluster. The division step is repeated until the number of clusters corresponds to the desired number of clusters. Associated with this procedure, the coreset tree T has to satisfies the following properties: (i) Each node of T is associated with a cluster in the hierarchical divisive clustering. (ii) The root of T is associated with the single cluster that contains the whole point set P . (iii) The nodes associated with the two subclusters of a cluster C are the child nodes of the node associated with C. 112 6 A k-Means Implementation for Data Streams With each node v of T , we store the following attributes: A point set Pv, a representative point qv from Pv, and a value weight(v). Here, point set Pv is the cluster associated with node v. Note that this attribute has only to be stored explicitly in the leaf nodes of T , while, for an internal node v, the set Pv is implicitly defined by the union of the point sets of its children. The representative point qv of a node v is obtained by using the technique of non-uniform sampling according to D2. At any time, the set of all the points q` stored at a leaf node ` are the points that have been chosen so far to be points of the eventual coreset. Furthermore, for a leaf node `, the attribute weight(`) equals Means(P`, q`), which is the sum of squared distances over all points in P` to q`. The value weight(v) of an internal node v is defined as the sum of the weights of its children. 6.3.2 Construction of the Coreset Tree For sake of simplicity, at any time, we number the leaf nodes of the current coreset tree consecutively starting with 1. At the beginning, T consists of one node, the root, which is given the number 1 and is associated with the whole point set P . The attribute q1 of the root is our first point in S and is obtained by sampling one point uniformly at random from P . Now, let us assume that our current tree has i leaf nodes 1, 2, . . . , i, the corresponding sample points are q1, q2, . . . , qi, and P1, P2, . . . , Pi are the associated clusters. We obtain the next sample point qi+1, new clusters in our hierarchical divisive clustering, and, thus, new nodes in T by performing the following three steps: 1. Choose a leaf node ` at random, where the probability of each leaf node `′ is propor- tional to cost(P`′ , q`′). 2. Choose a new sample point, denoted by qi+1, from the subset P` at random according to D2. 3. Based on q` and qi+1, split P` into two subclusters and create two child nodes of ` in T . The first step is implemented as follows: Starting at the root of T , let u be the current inner node. Then, we select randomly a child node of u, where the probability distribution for the child nodes of u is given by their associated weights. More precisely, each child node v of the current node u is chosen with probability weight(v)/weight(u). We continue this selection process until we reach a leaf node. Let ` be the selected leaf node, let q` be the sample point stored at `, and let P` be the subset of P associated with `. It is easy to check that, in doing so, we have chosen ` among the leaf nodes with probability cost(P`, q`)/ ∑i j=1 cost(Pj, qj). In the second step, we choose the new sample point qi+1 from P` at random according to D2, i.e., each p ∈ P` is chosen with probability D2(p, q`)/ cost(P`, q`). In doing so, each point in P is sampled with a probability that is proportional to its squared distance to its center in the clustering induced by the partition of the leaf nodes (giving the clusters) and their sample points (being the centers). That is, we use the same distribution as the 6.4 Streaming Algorithm 113 Algorithm 6.3.1 TreeCoreset(P,m) 1: choose q1 uniformly at random from P 2: root ← node with qroot ← q1 and weight(root)← Means(P, q1) 3: S ← {q1} 4: for i← 2 to m do 5: start at root, iteratively select one of the two child nodes at random according to their weights until a leaf ` is chosen 6: choose qi according to D2 from P` 7: S ← S ∪ {qi} 8: create two child nodes `1, `2 of ` and update weight(`) 9: propagate update of weight attribute upwards up to node root k-Means++ seeding does with the exception that the probability of choosing a point p ∈ Pj is proportional to D2(p, qj) rather than proportional to D2(p, {q1, . . . , qi}). In the third step, we create two child nodes `1 and `2 of ` and compute the associated partition of P` as well as the corresponding attributes. We store at node `1 the point q` and at node `2 our new sample point qi+1. Based on these two representative points, we partition P` into two subsets P`1 and P`2 . The set P`1 contains all the points from P` which are closer to q` than to qi+1, i.e., P`1 = {p ∈ P` | D(p, q`) < D(p, qi+1)} . The set P`2 contains all the remaining points from P`, i.e., P`2 = {p ∈ P` | D(p, qi+1) ≤ D(p, q`)} . The node `1 is associated with the set P`1 , and `2 is associated with the set P`2 . We determine the weight attribute for the nodes `1 and `2 as described above. Recall here that the weight attribute of an inner node of T is defined as the sum of the weights of its child nodes. Consequently, we update the weight of the parent node ` of `1 and `2 according to this. Afterwards, this update is propagated upwards, until we reach the root of the tree. 6.3.3 Extraction of the Coreset As soon as the coreset tree T has m leaf nodes, we can construct our coreset. Let q1, q2, . . . , qm be the representative points stored at the leaf nodes of T . Furthermore, let Qi denote the set of points from P which are closest to qi (breaking ties arbitrarily). Then, we obtain the coreset S = {q1, q2, . . . , qm} where the weight of qi is given by the number of points in Qi. 6.4 Streaming Algorithm In this section, we describe our clustering algorithm for data streams. To this end, let m be a fixed size parameter. First, we extract a small coreset of size m from the data stream by 114 6 A k-Means Implementation for Data Streams Algorithm 6.3.2 InsertPoint(p) 1: put p into B0 2: if B0 is full then 3: create empty bucket S 4: move points from B0 to S 5: empty B0 6: i ← 1 7: while Bi is not empty do 8: create coreset from the union of Bi and S 9: store coreset in S 10: empty Bi 11: i← i+ 1 12: move points from S to Bi using the merge-and-reduce technique from Har-Peled and Mazumdar [58], which is based on the theory of decomposable search problems of Bentley and Saxe [16]. This streaming method is described in detail in the section below. For the reduce step, we employ our new coreset construction, using the coreset trees as given in Section 6.3. Afterwards, a k-clustering can be obtained at any time by running any k-means algorithm on the coreset. Note that since the size of the coreset is much smaller than the size of the data stream, it is no longer inefficient to use algorithms that require random access on their input data. In our implementation, we run the k-Means++ algorithm from Arthur and Vassilvitskii [9] on our coreset five times independently and choose the best clustering result obtained this way. We call the resulting algorithm StreamKM++. 6.4.1 The Merge-and-Reduce Technique In order to maintain a small coreset for all points in the data stream, we use the merge-and- reduce method [16, 58]. For a data stream containing n points, the algorithm maintains L := dlog(n/m) + 2e buckets B0, B1, . . . , BL−1. Bucket B0 can store any number between 0 and m points. For i ≥ 1, bucket Bi is either empty or contains exactly m points. The idea of this approach is that, at any time, if bucket Bi is full, it contains a coreset of size m representing 2i−1m points from the data stream. New points from the data stream are always inserted into the first bucket B0. If bucket B0 is full (i.e., contains m points), all points from B0 need to be moved to bucket B1. If bucket B1 is empty, we are done. However, if bucket B1 already contains m points, we compute a new coreset S of size m from the union of the 2m points stored in B0 and B1 by using the coreset construction described above. Now, both buckets B0 and B1 are emptied and the m points from coreset S need to be moved into bucket B2. If bucket B2 is full, we repeat the process with S and B2. Overall, the process is repeated iteratively until we find the first empty bucket in which we can move the coreset S. A description in pseudocode for inserting a point from the data stream into the buckets is given by Algorithm 6.3.2. 6.5 Empirical Evaluation 115 At any time, it is possible to compute a coreset of size m for all the points in the data stream that we have seen so far. For this purpose, we compute a coreset from the union of the at most mdlog(n/m) + 2e weighted coreset points stored in all the buckets B0, B1, . . . , BL−1 by using the coreset tree construction. In this way, we obtain the desired coreset of size m. Note that the coreset tree construction can be easily generalized to input points with integer weights. Therefore, each time when we choose a new coreset point, we compute the probabilities of the points according to D2, as described before, and then multiply each probability with the weight of the appropriate point. We also incorporate the point weights when we compute the weight attribute of a new leaf node. These two adaptations can be thought of as replacing each weighted point by multiple copies of the same point each having weight 1. 6.4.2 Complexity Using our implementation, a single merge-and-reduce step is guaranteed to be executed in time O(dm2) (or even in time Θ(dm log(m)) if we assume the used coreset tree to be balanced). For a stream of n points, dn/me such steps are needed. The amortized running time of all merge-and-reduce steps is at most O(dnm). The final merge-and- reduce step, to obtain a coreset of size m for the union of all buckets, can be done in time O(dm2 log(n/m)). Finally, algorithm k-Means++ is executed five times on an input set of size m, requiring time Θ(dkm) per iteration. Summing up, the total running time of algorithm StreamKM++ is O(dnm), and the amortized processing time per data item is O(dm). Obviously, algorithm StreamKM++ needs at most Θ(dm log(n/m)) memory units. Hence, both the processing time and the space requirement have a low dependency on the dimension d. As a result, our approach is suitable for high-dimensional data. Of course, careful consideration has to be given to the choice of the coreset size parameter m. Our experiments show that a choice of m = 200k is sufficient for a good clustering quality without sacrificing too much running time. 6.5 Empirical Evaluation We conducted several experiments on different datasets to evaluate the quality of algorithm StreamKM++.1 A description of the datasets can be found in the next section. The computation on the biggest dataset, which is denoted by BigCross, was performed on a DELL Optiplex 620 machine with 3 GHz Pentium D CPU and 2 GB main memory, using Linux 2.6.9 kernel. For all remaining datasets, the computation was performed on a DELL Optiplex 620 machine with 3 GHz Pentium D CPU and 4 GB main memory, using Linux 2.6.18 kernel. 1The source code, the documentation, and the datasets of our experiments can be found at http://www. cs.upb.de/en/fachgebiete/ag-bloemer/research/clustering/streamkmpp/ 116 6 A k-Means Implementation for Data Streams We compared algorithm StreamKM++ with two frequently used clustering algorithms for processing data streams, namely with algorithm BIRCH [111] and with algorithm StreamLS [96, 52]. On the smaller datasets, we also compared our algorithm with a vanilla implementation of Lloyd’s algorithm [80], using initial seeds either uniformly at random (algorithm k-Means) or according to the non-uniform seeding from Arthur and Vassilvitskii [9] (algorithm k-Means++). All algorithms were compiled using g++ from the GNU Compiler Collection on optimization level 2. The quality measure for all experiments was the sum of squared distances, to be referred as the cost of the clustering. 6.5.1 Datasets Since synthetic datasets are typically easy to cluster, we focused our experiments on real- world datasets to obtain practically relevant results. Our main source was the UCI Machine Learning Repository [13]. In the following, we will give a brief description of all the datasets used in our empirical evaluation. Spambase2 is a dataset that contains data about spam e-mails and non-spam e-mails, including work and personal e-mails. Each data entry is a vector consisting of frequencies of certain words or characters occurring in the message and a class attribute that denotes whether the corresponding e-mail was considered as spam or not. After removing the classification attribute, 4 601 data points in 57 dimensions remained. Intrusion2,3 comprises data about TCP transmissions in a simulated environment. This simulation included different types of network attacks and intrusion attempts as well as normal network traffic. We used a 10% subset of the whole unlabeled dataset4 and excluded all symbolic features. Eventually, 311 078 data points in 34 dimensions remained. Covertype2,5 contains cartographic data about some wilderness areas inside the Roosevelt National Forest of northern Colorado. The leading thought of analyzing this dataset is to be able to predict the forest cover type of specific regions from cartographic variables, which is a classification task. After removing the classification attribute, 581 012 data points in 54 dimensions remained. The Tower6 dataset consists of the RGB values of a 2 560 by 1 920 pixel image file. All 4 915 200 pixels are mapped into a 3-dimensional space of integer values between 0 and 255, representing the colors used in the image. Note that clustering techniques are frequently used for lossy image compression: Individual colors can be substituted with their corresponding cluster center. The Census 1990 2 dataset consists of a one percent sample of the Public Use Microdata Samples (PUMS) person records, sampled from the full 1990 census set contributed by 2The dataset was contributed by the UCI Machine Learning Repository [13]. 3The Intrusion dataset is part of the kddcup99 dataset. 4Available for free download at http://kdd.ics.uci.edu/databases/kddcup99/kddcup.newtestdata_ 10_percent_unlabeled.gz 5Copyright by Jock A. Blackard, Colorado State University 6The Tower dataset was contributed by Gereon Frahling and is available for free download at: http: //homepages.uni-paderborn.de/frahling/coremeans.html 6.5 Empirical Evaluation 117 data points dimension type Spambase 4 601 57 float Intrusion 311 078 34 int, float Covertype 581 012 54 int Tower 4 915 200 3 int Census 1990 2 458 285 68 int BigCross 11 620 300 57 int Normdata 100 000 15 float Table 6.1: Overview of the datasets the U.S. Department of Commerce Census Bureau. Most of the data is citizen-related information, like personal income or age, for instance. The dataset has 2 458 285 data points in 68 dimensions. To our knowledge, it is one of the largest naturally structured and free accessible datasets available. To run our algorithm on really huge datasets, we created the Cartesian product of the Tower and Covertype dataset. In this way, we got a naturally structured dataset that is large enough to test our algorithm’s ability of handling huge amounts of data. We used a 1.5 GB sized subset of the Cartesian product consisting of 11 620 300 data points with 57 attributes, which we refer to as the BigCross dataset. To evaluate the impact of the number of well separated clusters of a dataset, we also considered a number of synthetic datasets, to which we collectively refer as the Normdata datasets. To generate these datasets, we used essentially the same construction that has already been used in [9] to evaluate the k-Means++ algorithm. More precisely, for dif- ferent values of k, we chose k ‘true’ centers uniformly at random from a 15-dimensional hypercube of side length 100. We then chose randomly points from a uniform mixture of 15-dimensional normal distributions of variance 1 around these center points. In this way, we obtained k well separated clusters. Each Normdata dataset consists of 100 000 points. The size and the dimensionality of the datasets are summarized in Table 6.1. 6.5.2 Parameters of the Algorithms For algorithm BIRCH, we set all parameters of the experimental environment, except for the memory settings, as recommended by the authors of BIRCH. Like Guha et al. [52], we observed that the CF-Tree had less leaves than it was allowed to use. The CF-Tree is the data structure that is used to compute the pre-clustering into the so-called clustering features (see also Section 6.1). The more leaves it has, the finer is the pre-clustering. Therefore, from time to time, BIRCH did not produce the correct number of centers, especially when the number of clusters k was high. For this reason, the memory settings had to be manually adjusted for each individual dataset. The complete list of parameters is given in Tables A.1 and A.2 in Appendix A.1. 118 6 A k-Means Implementation for Data Streams For algorithm StreamKM++, we experimentally determined an appropriate coreset size m as a function of k. For obvious reasons, we need to choosem ≥ k. To estimate anm that is sufficient to obtain good approximation results, we ran several experiments for different values of k and m on the datasets Covertype and Tower. Due to the randomized7 nature of StreamKM++, we conducted ten runs for each combination of k and m. Figure 6.2 shows the average running times and cost of the clusterings. Concerning the cost, we observed that, for coreset sizes that are only marginally larger than k, the quality of a clustering can be improved considerably by increasing the coreset size. In contrast to that, for coreset sizes of, say, m = 100k or more, the quality improves only slightly with increasing coreset size. For instance, the cost of a 50-clustering of either dataset computed on 20 000 coreset points is only marginally smaller than the cost of a 50-clustering computed on 10 000 coreset points. However, with respect to the running time, we observed that the growth of the running time depends roughly linear on the coreset size. Overall, we conclude the following. On the one hand, m should be chosen not too small (e.g., a very small multiple of k) because, for these values of m, the quality of a clustering can be easily improved, without sacrificing too much running time. On the other hand, m should not be chosen too large (e.g., a large multiple of k) because the increase in quality is only very small compared to clusterings for smaller coresets, but the running time is significantly higher. Therefore, we assume that our choice of m = 200k provides a good trade-off for arbitrary datasets. However, smaller sizes such as m = 20k or m = 50k might still be sufficient to obtain very good clustering results on datasets with k well separated clusters. For algorithm StreamLS the size of the data chunks is set equal to the coreset size m of algorithm StreamKM++. This is done to enable a fair comparison of both algorithms by allowing the same memory usage. We have to point out that, due to its nature, algorithm StreamLS does not always compute the prespecified number of cluster centers. In such a case, the difference varies from dataset to dataset and usually lies within a 20% margin from the prespecified number. 6.5.3 Comparison of the Algorithms Comparison with BIRCH and StreamLS To compare StreamKM++ with BIRCH and StreamLS, we conducted several experi- ments for different values of k on the four larger real-world datasets, i.e., the datasets Cover- type, Tower, Census 1990, and BigCross. In each of these experiments, we set m = 200k. For the randomized algorithms StreamKM++ and StreamLS, ten experiments were conducted for each fixed k. For BIRCH, a single run was used since it is a deterministic algorithm. The average running times and cost of the clusterings are summarized in Fig- ures 6.3 and 6.4. The interested reader can find the concrete values of all experiments in Appendices A.2 and A.3. In our experiments, algorithm BIRCH had the best running time of all algorithms. However, this comes at the expense of a high k-means clustering cost. In terms of the sum 7We used the Mersenne Twister PRNG [85]. 6.5 Empirical Evaluation 119 Figure 6.2: Experimental results for different coreset sizes of squared distances, algorithms StreamKM++ and StreamLS outperform BIRCH by a factor of up to 2. Furthermore, as already mentioned, one drawback of algorithm BIRCH is the need of adjusting parameters manually to obtain a clustering with the desired number of centers. By comparing StreamKM++ and StreamLS, we observed that the quality of the clusterings were on a par. More precisely, the absolute value of the cost of both algorithms lies within a ±5% margin from each other. In contrast to algorithm StreamLS, the number of centers computed by our algorithm always equals its prespecified value. Hence, the cost of clusterings computed by algorithm StreamKM++ tends to be more stable than the costs computed by algorithm StreamLS. The standard deviations of the running times and clustering cost for k = 20 are given in Tables 6.2 and 6.3. A complete overview for all experiments can be found in Appendix A.4. In terms of running time, it turns out that our algorithm scales much better with in- creasing number of centers than algorithm StreamLS does. While for about k ≤ 10 centers StreamLS is sometimes faster than our algorithm, for a larger number of cen- ters, our algorithm easily outperforms StreamLS. For instance, on the dataset Tower, StreamKM++ computes a clustering with k = 100 centers in about 3% of the running time of StreamLS. To investigate the impact of the number of clusters on the running time further, we conducted experiments on the synthetic datasets Normdata for different values of k and m. As described before, for both StreamKM++ and StreamLS, ten experiments were conducted for each combination of k and m. The average running times of the clusterings 120 6 A k-Means Implementation for Data Streams Figure 6.3: Experimental results for the datasets Census 1990 and BigCross k = 20 running time (in sec) StreamKM++ StreamLS k-Means++ Spambase 1.09 - 3.88 Intrusion 3.22 - 98.11 Covertype 6.93 18.18 1249.18 Tower 0.58 14.11 1594.76 Census 1990 5.16 54.30 - BigCross 11.49 162.44 - Table 6.2: Standard deviation of the running time for k = 20 are shown in Figure 6.5. Note that we omitted a figure presenting the average cost of the clusterings because both StreamKM++ and StreamLS always found an optimal or near optimal clustering. The interested reader can find the average values as well as the standard deviations for both running times and cost of the clusterings in the appendix. Figure 6.5 reveals the difference between the running times of StreamKM++ and StreamLS. The ratio between the running time needed by StreamKM++ and the running time needed by StreamLS is decreasing with increasing number of clusters. Form = 500, StreamKM++ computed the clusterings for k = 100 in about 8% of the running time of StreamLS and, for k = 200, it finished the clustering in about 2% of the running time of StreamLS. 6.5 Empirical Evaluation 121 Figure 6.4: Experimental results for the datasets Covertype and Tower k = 20 cost StreamKM++ StreamLS k-Means++ Spambase 6.49 · 105 - 1.73 · 106 Intrusion 8.54 · 1010 - 3.70 · 1011 Covertype 1.08 · 109 1.03 · 1010 9.17 · 108 Tower 7.31 · 106 2.71 · 107 4.39 · 107 Census 1990 3.66 · 106 3.14 · 106 - BigCross 2.46 · 1010 3.36 · 1011 - Table 6.3: Standard deviation of the cost for k = 20 For m = 1 000, StreamKM++ computed the clusterings for k = 100 in about 38% of the running time of StreamLS, whereas, for k = 200, it needed about 3% of the running time of StreamLS. Overall, we conclude that, if the first priority is the quality of the clustering, then our algorithm provides a good alternative toBIRCH and StreamLS. This applies particularly if the number of cluster centers is large. 122 6 A k-Means Implementation for Data Streams Figure 6.5: Experimental results for the Normdata datasets Figure 6.6: Experimental results for the datasets Spambase and Intrusion Comparison with k-Means and k-Means++ We also compared the quality of StreamKM++ with the quality of classical non-streaming k-means algorithms. Because of their popularity, we have chosen k-Means and k-Means++ as competitors. These algorithms are designed to work in a classical non-streaming setting and, due to their need for random access on the data, are not suited for larger datasets. For this reason, we have run k-Means only on the two smallest datasets Spambase and 6.5 Empirical Evaluation 123 Intrusion, while k-Means++ has been evaluated only on the four smaller datasets Cover- type, Tower, Spambase, and Intrusion. For each fixed k, we conducted ten experiments. The results of these experiments are summarized in Figures 6.6 and 6.4. Please note that the results for the dataset Intrusion are on a logarithmic scale. The concrete values of all experiments can be found in Appendices A.2 and A.3. The standard deviations of the running times and clustering cost are given in Appendix A.4. As expected, k-Means++ is clearly superior to the classical algorithm k-Means both in terms of quality and running time. Comparing k-Means++ with our streaming algo- rithm, we find that on all datasets the quality of the clusterings computed by algorithm StreamKM++ is on a par with or even better than the clusterings obtained by algo- rithm k-Means++. We conjecture that this is due to the fact that, in the last step of our algorithm, we run the k-Means++ algorithm five times on the coreset and choose the best clustering result obtained this way. On the other hand, for the experiments with the k-Means++ algorithm, we run the k-Means++ algorithm only once in each repetition of the experiment. However, the running time of k-Means++ is only comparable with algorithm StreamKM++ for the smallest dataset Spambase. Even for moderately large datasets, like the Covertype dataset, we obtain that algorithm StreamKM++ is orders of magnitude faster than k-Means++. We conclude that algorithm k-Means++ should only be used if the size of the dataset is not too large. For larger datasets, algorithm StreamKM++ computes comparable clusterings in a significantly improved running time. 124 6 A k-Means Implementation for Data Streams 7 Well-Separated Pair Decomposition with Slack In this chapter, we study the construction of well-separated pair decompositions (WSPDs) for point sets. Intuitively, two point sets are called a well-separated pair if the shortest distance from any point in one set to any point in the other set is large compared to the diameter of both sets. A well-separated pair decomposition of a point set consists of a collection of well-separated pairs that covers all the pairs of distinct points, i.e., any two distinct points belong to the different subsets of some pair. In this way, a WSPD of size t allows all pairwise distances to be compactly summarized by t distances. Now, let us assume that we are given a huge point set P . Due to the size of P , it could be useful to have a compact representation that fairly captures the pairwise distances of P and uses space sublinear in |P |. In case that the structure of P is very simple (e.g., P is a multiset with many duplicates), it might be possible to construct a WSPD whose representation has sublinear size. However, in case that the structure of P is more complex, one cannot find a sublinear space representation of a WSPD such that all pairwise distances of P are well preserved. To be able to deal with this problem, we introduce the notion of a WSPD with slack. A WSPD with slack σ for P guarantees that at least a (1− σ)-fraction of all the pairwise distances of P are well preserved. After giving a formal definition of a WSPD with slack, we present an efficient construc- tion of a WSPD with low slack for low-dimensional Euclidean point sets in Section 7.2. Our construction is similar to the one we used in Chapter 5. We build a quadtree parti- tion for the input points in which we recursively split every cell that contains more than a certain threshold of points. Based on this partition, we obtain a representation whose space requirement is polylogarithmic in both the size and the spread of the point set. In Section 7.3, we show how to transfer our construction for low-dimensional Euclidean point sets to point sets with bounded doubling dimension. Based on the techniques developed in this chapter, we will design streaming algorithms to compute low-distortion embeddings with low slack in Chapter 8. 7.1 Preliminaries At first, we briefly recapitulate the classic notion of a well-separated pair decomposition (WSPD). A more detailed description can be found in [22, 103]. Afterwards, we relax the classic notion and give a formal definition of a WSPD with slack. Let M = (X,D) be any metric space, where X is a set of n points and D is a distance function defined on X (see Section 2.1 for a definition of metric spaces). Throughout this 126 7 Well-Separated Pair Decomposition with Slack chapter, we assume that the minimum pairwise distance between two points in X is at least 1, and the maximum pairwise distance is at most ∆. For any constant parameter ε with 0 < ε < 1, two non-empty subsets X1, X2 ⊆ X are called ε-well-separated if we have max{diam(X1), diam(X2)} ≤ ε · D(X1, X2), where diam(X1) and diam(X2) are the diameters of X1 and X2, respectively. The value ε is often called separation parameter. Based on this, an ε-WSPD for M is defined as follows. Definition 7.1.1 (WSPD). Let ε, 0 < ε < 1, be a separation parameter. Let M = (X,D) be any n-point metric space, and let P be a collection of ε-well-separated pairs of subsets {(A0, B0), . . . , (At−1, Bt−1)}, where Ai, Bi ⊆ X for i ∈ [t]. P is called an ε-WSPD for M if every pair of points (a, b) ∈ X × X, a 6= b, lies in Ai × Bi or Bi × Ai for exactly one index i ∈ [t]. The usefulness of an ε-WSPD for M is that, for any i ∈ [t], the distances between pairs of points from Ai×Bi are all identical to within a factor of 1+2ε. Thus, if we store instead of each pair (Ai, Bi) a pair of representative points (R(Ai), R(Bi)) with R(Ai) ∈ Ai and R(Bi) ∈ Bi, then D(R(Ai), R(Bi)) is a (1± 2ε)-approximation of all the distances between pairs of points from Ai × Bi. We also say that the distances between pairs of points from Ai×Bi are (1±2ε)-preserved. Hence, an ε-WSPD forM has the property that all pairwise distances of M are (1± 2ε)-preserved. Typically, one assumes that the size of an ε-WSPD forM is linear in n. Since we restrict the space requirement of the representation of M to be sublinear in n and there does not exist an ε-WSPD for any separation parameter ε and for any metric M (e.g., the uniform n-point metric) that has sublinear size, we introduce the notion of a WSPD with slack. Definition 7.1.2 (WSPD with Slack). Let ε, 0 < ε < 1, be a separation parameter and σ, 0 < σ < 1, be a slack parameter. Let M = (X,D) be any n-point metric space, and let P be a collection of pairs of subsets {(A0, B0), . . . , (At−1, Bt−1)}, where Ai, Bi ⊆ X for i ∈ [t]. Let Iε be the subset of indices such that, for all j ∈ Iε, (Aj, Bj) is ε-well-separated. P is called an ε-WSPD with slack σ for M if every pair of points (a, b) ∈ X ×X, a 6= b, lies in Ai ×Bi or Bi × Ai for at most one index i ∈ [t] and ∑ j∈Iε |Aj| · |Bj| ≥ (1− σ) · n2 . Despite the fact that the distance function D is symmetric, the slack σ of a WSPD is measured by the quantity of the fraction of all ordered pairs (a, b) ∈ X × X that do not satisfy the condition given in Definition 7.1.1. The assumption of having n2 (instead of ( n 2 ) ) pairwise distances simplifies descriptions and makes our proofs cleaner, without changing the results in any significant way. 7.2 Construction for Euclidean Metric Spaces This section deals with the construction of a WSPD with slack for Euclidean metrics. Let M = (P,D) be an n-point Euclidean space with constant dimension d, let ε, 0 < ε < 1, 7.2 Construction for Euclidean Metric Spaces 127 be a separation parameter, and let σ, 22d/n < σ < 1, be a slack parameter. In order to construct an ε-WSPD with slack σ forM , we impose dlog(∆)e+1 nested square grids over P denoted by G (0) ,G (1) , . . . ,G (dlog(∆)e). The side length of each cell in grid G (i) is 2i. We say that the grid cells in G (i) are in level i. Our algorithm consists of three phases. In the first phase, we compute a partition of the space based on the heavy cells in the grids (see Definition 7.2.1). Then, it follows a refinement phase, where each cell of the space partition is further subdivided into smaller cells, which we call cubelets. In the last phase, we determine a so-called representative for each cubelet and compute an ε-WSPD with slack σ from the set of representatives. Definition 7.2.1 (Heavy Cell). We call a grid cell heavy if it contains at least h(σ) · n points of P , where h(σ) := σ/2d is a function dependent on σ. A grid cell that is not heavy is called light. Now, we describe the three phases in detail (see Algorithm 7.2.1 for a description in pseudocode). In the first phase, we build a partition of the point space based on a quadtree. To recapitulate, a quadtree for a d-dimensional point set is a rooted tree in which every internal node has 2d children. Furthermore, every node corresponds to a grid cell and, for any internal node v, the cells of its children form a partition of the cell corresponding to v. Thus, the cells of the leaf nodes form a partition of the cell of the root node. We call this partition a quadtree partition. Our quadtree partition for P is now constructed as follows. We start with the coarsest grid G (dlog(∆)e) and identify all heavy cells in this grid, i.e., cells containing at least h(σ) ·n points. Then, we subdivide every heavy cell C into 2d equal sized subcells. These subcells are contained in grid G (dlog(∆)e − 1). We call C the parent cell of these subcells. If none of the subcells is heavy, we stop our process. Otherwise, the algorithm recursively subdivides every heavy cell such that, at the end of the first phase, we have only light cells in our space partition. Note that all the cells in grid G (0) are light since such a cell can contain at most 2d points and σ > 22d/n implies h(σ) · n > 2d. Figure 7.1 illustrates the quadtree partitioning with the help of an example. The refinement phase consists of three steps. The first refinement is that we build a so- called balanced or restricted quadtree partition of the quadtree partition obtained so far, i.e., the side length of each cell is allowed to differ from the side lengths of all neighboring cells by a factor of at most 2 [33, 107]. That means that we further subdivide every leaf cell C of the quadtree which has a neighboring cell whose side length is less than half of the side length of C. We say that two cells are neighbors if they share some part of the boundary. In Figure 7.2, the first refinement step is illustrated by using the example from Figure 7.1. In the second refinement step, every leaf cell of the balanced quadtree is subdivided into `1 d equal sized cubes, where `1 := ⌈ 6 √ d ⌉ . Finally, we subdivide every cube into `2(ε) d equal sized cubelets, where `2(ε) := ⌈ 2 √ d/ε ⌉ is a function dependent on ε. Note that we could have merged the second and third refinement step into one step, but the definition of cubes makes the analysis easier. It remains to determine the representatives. For each non-empty cubelet C, we replace all the points inside of C by one representative. This representative is set to the location of 128 7 Well-Separated Pair Decomposition with Slack (a) (b) (c) (d) Figure 7.1: Example illustrating the quadtree partition for a point set in the plane. A cell is heavy if it contains at least 5 points. (a)-(d) The quadtree partition for subsequent depths of the recursion, i.e., after having subdivided each heavy cell in grid G (dlog(∆)e), G (dlog(∆)e − 1), G (dlog(∆)e − 2), and G (dlog(∆)e − 3), respectively. Since no cell in partition (d) is heavy, the recursion stops here. balancing Figure 7.2: Example illustrating the refinement of a quadtree partition to get a balanced quadtree partition. Cells created during the balancing process are indicated by dashed lines. an arbitrary point it represents (i.e., a point from P ∩C) and weighted by the total number of replaced points. Finally, the collection of all representative pairs is our ε-WSPD with slack σ for M . Note that we implicitly store this information by storing the set of all representatives. 7.2.1 Analysis of the Construction First, we prove that our construction yields an ε-WSPD with slack σ for M . Then, we analyze its complexity. For that purpose, for any level i ∈ [dlog(∆)e + 1], let L(i) be the set of all the leaf cells of the unbalanced quadtree whose side length is 2i, i.e., leaf cells in level i. Analogously, let L+(i) be the set of all the leaf cells of the balanced quadtree whose side length is 2i. We define L∗(i) to be the set of all the cubes contained in a cell in L+(i). Furthermore, we denote by H(i) the set of heavy cells in level i that do not have a heavy subcell. Note that the parent cell of any cell in L(i) is in H(i+ 1). 7.2 Construction for Euclidean Metric Spaces 129 Algorithm 7.2.1 ConstructWSPD(P, ε, σ) 1: initialize space partition with the cells in grid G (dlog(∆)e) 2: initialize queue Q with the cells in grid G (dlog(∆)e) 3: QuadtreePartition(Q) 4: Q← insert all cells of space partition 5: BalancedQuadtreePartition(Q) 6: for each cell C in space partition do 7: split C into `1 d cubes 8: for each cube C in space partition do 9: split C into `2(ε)d cubelets 10: initialize empty set R of representatives 11: for each non-empty cubelet C in space partition do 12: q ← arbitrary point from P ∩ C weighted by number of points in C 13: R← R ∪ q 14: return R Algorithm 7.2.2 QuadtreePartition(P, σ,Q) 1: while Q is not empty do 2: C ← remove first cell from Q 3: if C is heavy then 4: split C into 2d subcells 5: Q← insert all subcells of C Algorithm 7.2.3 BalancedQuadtreePartition(P, σ,Q) 1: while Q is not empty do 2: C ← remove first cell from Q 3: if C violates the balancing condition then 4: split C into 2d subcells 5: Q← insert all subcells of C 6: Q← insert all neighbors of C that now violate the balancing condition Separation and Slack In the second phase of the algorithm, every cube in ⋃dlog(∆)e i=0 L ∗(i) is divided into `2(ε)d equal sized cubelets. The next lemma shows that the choice of the function `2(ε) guarantees that any two cubelets of different non-neighboring cubes are ε-well-separated. Lemma 7.2.2. Let `2(ε) := ⌈ 2 √ d/ε ⌉ . If each cube in ⋃dlog(∆)e i=0 L ∗(i) is divided into `2(ε)d equal sized cubelets, then any two cubelets which are not contained in the same cube or in neighboring cubes are ε-well-separated. Proof. Let C1 and C2 be any two cubelets which are not contained in the same cube or in neighboring cubes. Furthermore, let C1 be in any level i ∈ [dlog(∆)e + 1] and C2 be in a 130 7 Well-Separated Pair Decomposition with Slack level j. Without loss of generality, we assume that j ∈ {i, . . . , dlog(∆)e}. We consider the two cases j ∈ {i, i+ 1} and j ∈ {i+ 2, . . . , dlog(∆)e}. We start with the case j ∈ {i, i+ 1}. The side length of a cube in level i is 2i/`1, where `1 = ⌈ 6 √ d ⌉ . Since the side lengths of neighboring cells in ⋃dlog(∆)e k=0 L +(k) differ by a factor of at most 2 and each cell in ⋃dlog(∆)e k=0 L +(k) is divided into equal sized cubes, the distance between the cube containing C1 and the cube containing C2 is at least 2i/`1. Since the diagonal of the bigger cubelet C2 is diag(C2) = √ d · 2j `1 · `2(ε) ≤ ε · 2i `1 , we get that C1 and C2 are ε-well-separated (see Section 7.1 for a definition of an ε-well- separated pair). Now, we consider the case j ∈ {i + 2, . . . , dlog(∆)e}. Due to the balanced quadtree partitioning, the side lengths of neighboring cells in ⋃dlog(∆)e k=0 L +(k) differ by a factor of at most 2. Hence, the distance between any cell in L+(i) and any cell in L+(j) is at least ∑j−1 k=i+1 2 k = 2j − 2i+1 ≥ 2j−1. Since C1 is contained in a cell in L+(i) and C2 is contained in a cell in L+(j), the distance between C1 and C2 is at least 2j−1. Since the diagonal of the bigger cubelet C2 is diag(C2) = √ d · 2j `1 · `2(ε) < ε · 2j−1 , the two cubelets C1 and C2 are ε-well-separated. Due to Lemma 7.2.2, to bound the slack of the ε-WSPD for M , we have to bound the number of points in each cube in ⋃dlog(∆)e i=0 L ∗(i). The following lemma shows that this is guaranteed by our choice of h(σ). Lemma 7.2.3. Let h(σ) := σ/2d, and let p1 and p2 be any two points in P . If the cubelet that contains p1 and the cubelet that contains p2 are not ε-well-separated, then p2 belongs to the σn closest points of p1. Proof. At first, we bound the maximum distance D(p1, p2) between p1 and p2. Then, we bound the total number of points whose distance from p1 is at most D(p1, p2). If this number is at most σn, then the correctness of the lemma follows. For any level i ∈ [dlog(∆)e + 1], let C∗1 ∈ L ∗(i) be the cube that contains p1, and let C∗2 be the cube that contains p2. Due to Lemma 7.2.2, C∗1 and C ∗ 2 must be neighbors. Since we use a balanced quadtree partitioning, the side lengths of C∗1 and C ∗ 2 differ by a factor of at most 2. Since C∗1 ∈ L ∗(i), the side lengths of C∗1 is 2 i/`1 with `1 = ⌈ 6 √ d ⌉ . We have to consider the cases (i) C∗2 ∈ L ∗(i), (ii) C∗2 ∈ L ∗(i+ 1), and (iii) C∗2 ∈ L ∗(i− 1). 7.2 Construction for Euclidean Metric Spaces 131 p1 C∗1 2i p2 C∗2 Figure 7.3: Sketch of Case (i) in the proof of Lemma 7.2.3. Case (i) is illustrated in Figure 7.3. In this case, the maximum distance between p1 and p2 is at most D(p1, p2) ≤ √ d · ( 2i `1 + 2i `1 ) = √ d · ( 2i+1 `1 + 2i √ d`1 ) − 2i `1 ≤ 2i−1 − 2i `1 . Since the cube C∗1 is contained in a cell in L +(i) and we use a balanced quadtree partitioning, the side lengths of all neighboring cells of the cell containing C∗1 are at least 2 i−1. Now, since the side length of C∗1 is 2 i/`1, the ball with center p1 and radius D(p1, p2) ≤ 2i−1 − 2i/`1 is covered by at most 2d cells in ⋃dlog(∆)e k=0 L +(k). In Case (ii), the maximum distance between p1 and p2 is at most D(p1, p2) ≤ √ d · ( 2i `1 + 2i+1 `1 ) < 2i − 2i `1 . Furthermore, since C∗2 is contained in a cell in L +(i + 1), the side lengths of all common neighbors of the cells containing C∗1 and C ∗ 2 are at least 2 i. Now, since the side length of C∗1 is 2 i/`1, the ball with center p1 and radius D(p1, p2) ≤ 2i − 2i/`1 can be covered by at most 2d cells in ⋃dlog(∆)e k=0 L +(k). Case (iii) is symmetric to Case (ii). As a result, in all three cases, we have to count the number of points in 2d cells in ⋃dlog(∆)e k=0 L +(k). Since the cells in ⋃dlog(∆)e k=0 L +(k) are light cells, each one of them contains at most h(σ) · |P | points. It follows that the number of points whose distance from p1 is at most D(p1, p2) is at most σ · |P |. 132 7 Well-Separated Pair Decomposition with Slack Remark 7.2.4. Lemma 7.2.3 does not only imply that the collection of all representative pairs is an ε-WSPD with slack σ for M . It also lets us know where the slack arises. More precisely, for each point in P , the distances to its σn closest neighbors in P can be arbitrarily distorted, but the distances to all other points in P are (1± 2ε)-preserved. Complexity In order to upper bound the number of representatives, we have to bound the number of cells in the balanced quadtree partition. This is done by first analyzing the dependency of this number on the number of cells in the unbalanced quadtree partition. The proof of the following lemma is basically given in [33]. Only a few adjustments to our scenario have been made. However, for sake of completeness, we include the full proof here. Lemma 7.2.5 ([33]). The number of cells in the balanced quadtree partition is ∣ ∣ ∣ ∣ ∣ ∣ dlog(∆)e⋃ i=0 L+(i) ∣ ∣ ∣ ∣ ∣ ∣ ∈ O  6d · ∣ ∣ ∣ ∣ ∣ ∣ dlog(∆)e⋃ i=0 L(i) ∣ ∣ ∣ ∣ ∣ ∣   . Proof. The proof follows the one in Chapter 14 from [33]. We call the cells in the unbal- anced quadtree old cells and the cells that are in the balanced but not in the unbalanced quadtree new cells. We will show that, for each old cell, there are at most 3d − 1 cells in the same level that have to be split. Since each split operation creates 2d new cells, the total number of new cells is at most 6d times the total number of old cells. Since the total number of cells in the unbalanced quadtree is at most 2 · | ⋃dlog(∆)e i=0 L(i)|, the total number of new cells is at most 2 · 6d · | ⋃dlog(∆)e i=0 L(i)|. Thus, the number of cells in ⋃dlog(∆)e i=0 L +(i) is obviously upper bounded by the number of leaf cells in the unbalanced quadtree plus the total number of new cells, which is at most | ⋃dlog(∆)e i=0 L(i)|+ 2 · 6 d · | ⋃dlog(∆)e i=0 L(i)|. C C ′ Figure 7.4: Illustration of the arrangement of cells in the proof of Lemma 7.2.5. The neighboring cell of C which causes the splitting of C is indicated in gray. We use a charging argument to prove that, for each old cell, there are at most 3d − 1 cells in the same level that have to be split. Let us suppose that we have to split an (old or new) cell C ∈ G (i) in any level i ∈ [dlog(∆)e + 1] during the balancing process. Then, 7.2 Construction for Euclidean Metric Spaces 133 we claim that at least one of the 3d − 1 neighboring cells of C in level i is old. We charge the splitting of C to this old cell. Let us assume that our claim is wrong. Let i ∈ [dlog(∆)e+ 1] be the smallest level and C ∈ G (i) be a cell such that C is split during the balancing process and has no neighboring cell in level i which is old (see Figure 7.4). Since C is split, there is a neighboring cell C ′′ of C with side length at most 2i−2. Let C ′ ∈ G (i− 1) be the cell that contains C ′′. Due to the fact that C ′ is contained in a new cell, it is new itself. Also, all the neighboring cells of C ′ in level i − 1 are new. This follows from the fact that all neighboring cells of C in level i are new and C is split during the balancing process. Furthermore, since C ′ contains the cell C ′′, C ′ is split during the balancing process. Thus, C ′ is a cell that is split in the balancing process and there is no neighboring cell of C ′ in level i− 1 which is old. This is a contradiction to the choice of C. Lemma 7.2.6. The number of cells in the balanced quadtree partition is O(2O(d)·log(∆)/σ). Proof. Each ancestor cell of a heavy cell is heavy, and heavy cells are split during the quadtree partitioning. Recall that, for any level i ∈ [dlog(∆)e + 1], H(i) is the set of all heavy cells in level i that do not have a heavy subcell. Then, for each cell in H(i), there are 2d cells in G (i− 1) and at most 2d ·(dlog(∆)e+1) cells in ⋃dlog(∆)e j=i G (j) that are cells in our quadtree. Since there are at most 1/h(σ) heavy cells in ⋃dlog(∆)e i=0 H(i), the total number of cells in the unbalanced quadtree is bounded by O(1/h(σ)·2d ·log(∆)). Due to Lemma 7.2.5, the number of cells in the balanced quadtree partition is O(1/h(σ) · 12d · log(∆)). Now, the lemma follows from h(σ) = σ/2d. Lemma 7.2.7. The space partition consists of O(2O(d) · dd · log(∆)/(εdσ)) cubelets. Proof. Due to Lemma 7.2.6, there are O(2O(d) · log(∆)/σ) cells in the balanced quadtree partition. Now, the lemma follows from the fact that each cell in the balanced quadtree partition contains ⌈ 6 √ d ⌉d cubes and each cube contains ⌈ 2 √ d/ε ⌉d cubelets. Based on the results given above, we can now analyze the complexity of our algorithm. Lemma 7.2.8. The algorithm has a running time of O ( n · ( d2 ε + d log(∆) ) + 2O(d) · d · log2(∆) σ + 2O(d) · dd · log(∆) εdσ ) and a space requirement of O ( dn+ 2O(d) · dd · log(∆) εdσ ) . Proof. Since each level of a quadtree forms a partition of P , the total number of points associated with cells at the same level of the quadtree is at most n. Thus, the arrangement of points in any level of the quadtree can be computed from the preceding level in O(dn) 134 7 Well-Separated Pair Decomposition with Slack time. Since our quadtree has a height of at most dlog(∆)e + 1, the total running time to compute the arrangements of the points in all the levels of the quadtree is O(dn log(∆)). During the balancing process, for each cell that has ever been split, we check if this cell has neighbors in the quadtree partition that violate the balancing condition. Let C be such a split cell in any level i ∈ [dlog(∆)e + 1]. Since we must only check neighbors of C whose side length is at least twice as big as the side length of C, the level of the neighbors that we have to check is at least i+ 1 and the total number of these neighbors is at most 2d − 1 (see Figure 7.5). Let C ′ be any cell in G (i+ 1) that shares some part of its boundary with C. If the subcells of C ′ are not cells of the current quadtree, then the leaf cell of the current quadtree that contains C ′ is a violating neighbor of C. We can find this violating neighbor by searching for the leaf cell in the quadtree that contains the midpoint of C ′. Since the height of the quadtree is O(log(∆)), this cell can be found in O(d log(∆)) time. Furthermore, due to Lemma 7.2.6, the number of cells in the balanced quadtree is O(2O(d) · log(∆)/σ). Thus, the time for checking neighborhoods during the balancing process is O(2O(d) · d · log2(∆)/σ). C ′ 2i+1 C 2i Figure 7.5: The cell C is split during the balancing operation. A neighbor of C in the current quadtree partition that violates the balancing condition has to contain at least one of the cells colored in gray. Finally, the arrangement of the points in the cubelets can be computed from the balanced quadtree partition in n · d · (`1 · `2(ε)) ∈ O(d2n/ε) time. Obviously, each node of the final partition tree has to be created once. Since it follows from Lemma 7.2.7 that this tree consists of O(2O(d) · dd · log(∆)/(εdσ)) nodes, this can be done in O(2O(d) · dd · log(∆)/(εdσ)) time. Thus, the total running time is as claimed in the lemma. Furthermore, the space requirement for the d-dimensional points in P and the partition tree is upper bounded by O(dn+ 2O(d) · dd · log(∆)/(εdσ)). We summarize our results in the following theorem: Theorem 7 (WSPD with Slack for Euclidean Spaces). Let P be a set of n points with spread ∆ from a low-dimensional Euclidean space Rd, let ε, 0 < ε < 1, be a separation parameter, and let σ, 22d/n < σ < 1, be a slack parameter. Then, there exists an algorithm 7.3 Construction for Doubling Metric Spaces 135 that computes a weighted point set P ′ ⊂ P with cardinality O(2O(d) · dd · log(∆)/(εdσ)) which is an implicit representation of an ε-WSPD with slack σ for P . The algorithm has a running time of O ( n · ( d2 ε + d log(∆) ) + 2O(d) · d · log2(∆) σ + 2O(d) · dd · log(∆) εdσ ) and a space requirement of O ( dn+ 2O(d) · dd · log(∆) εdσ ) . Proof. We have P ′ ⊂ P since the location of each representative is a point from P . It follows from Lemmas 7.2.2 and 7.2.3 that P ′ is an implicit representation of an ε-WSPD with slack σ for P . The cardinality of P ′ is implied by Lemma 7.2.7, and the running time and space requirement to compute P ′ is due to Lemma 7.2.8. In our construction, we made the assumption that σ > 22d/n to ensure that all the cells in grid G (0) are light. 7.3 Construction for Doubling Metric Spaces We transfer our approach to construct a WSPD with slack for low-dimensional Euclidean point sets to point sets with bounded doubling dimension. The input of our algorithm is an n-point doubling metric space M = (X,D) with bounded dimension λ, a separation parameter ε, 0 < ε < 1, and a slack parameter σ, (dlog(∆)e + 1) · 26λ+3/n < σ < 1. Recall from the definition of doubling metric spaces that each ball in M with any radius r centered at any point in X can be covered by 2λ balls each of radius r/2 and centered at a point in X. We assume that the minimum pairwise distance between two points in X is at least 1, and the maximum pairwise distance is at most ∆. Furthermore, we assume access to a distance oracle that, given any two points from X, can compute in constant time the distance between these two points. Our idea is to replace the square grids from Section 7.2 by uniform cut decompositions, which are defined as follows: Definition 7.3.1 (Uniform Cut Decomposition, [35]). Let X be a non-empty point set with a distance function defined on it. An r-cut decomposition of X is a partition of X into balls such that the following conditions are satisfied: i) Each ball is centered at a point in X. ii) Each ball has radius r. iii) Each point in X is covered by a ball. Now, we explain our construction in detail (see Algorithm 7.3.1 for a description in pseudocode). For each i ∈ [dlog(∆)e+ 1], we compute a 2i-cut decomposition of the point 136 7 Well-Separated Pair Decomposition with Slack set X. We say that the balls with radius 2i are in level i. We denote the set of balls in level i by G (i). In case that a point is covered by more than one ball, we assign it to any one of them. We point out that the arrangement of balls in the uniform cut decomposition of any level does not depend on the arrangement of balls in the uniform cut decomposition of any other level. According to this and in contrast to our approach for low-dimensional Euclidean point sets, our algorithm for doubling metric spaces is not recursive. To compute the WSPD for X, we identify in each level the heavy balls. Definition 7.3.2 (Heavy Ball). We call a ball heavy if it contains at least h(σ) · n points of X, where h(σ) := σ/((dlog(∆)e + 1) · 25λ+3) is a function dependent on σ. A ball that is not heavy is called light. Algorithm 7.3.1 ConstructWSPD(X, ε, σ) 1: initialize empty set R of representatives 2: for i← 0 to dlog(∆)e do 3: G (i)← set of balls in 2i-cut decomposition of X 4: for each heavy ball B in G (i) do 5: decompose B into mini balls with radius 2i−`(ε) 6: for each mini ball B′ in B do 7: y ← point located at center of B′ and with weight w(y)← 0 8: for each point x ∈ X ∩ B′ do 9: if x is not marked then 10: mark x 11: w(y)← w(y) + 1 12: R← R ∪ y 13: return R In each level i ∈ [dlog(∆)e+ 1], we identify the heavy balls, i.e., balls containing at least h(σ) · n points from X. Each of these heavy balls is decomposed into mini balls of radius 2i−`(ε), where `(ε) := dlog(1/ε)e. Note that all the balls in G (0) are light since such a ball can contain at most 2λ points and σ > (dlog(∆)e+ 1) · 26λ+3/n implies h(σ) · n > 2λ. Next, we compute a representative for each point in X. Let x ∈ X be any point. Then, the representative of x is the center of the smallest mini ball that contains x. Note that each mini ball belongs to a uniform cut decomposition of a heavy ball and each point in X is contained in at least one heavy ball since the ball in level dlog(∆)e is heavy and covers all points from X. The set of representatives for X is the weighted set of center points of the mini balls where each such point is weighted by the number of points it represents. The collection of all representative pairs is our compact representation for M . As mentioned in Section 7.2, we implicitly store this information by storing the set of all representatives. Note that, for any level i ∈ [dlog(∆)e+1], a 2i-cut decomposition of X can be computed by applying the well-known 2-approximation algorithm for vertex cover. Let G = (V,E) be any simple graph. Then, the vertex-cover algorithm chooses repeatedly any edge 7.3 Construction for Doubling Metric Spaces 137 {x, y} ∈ E, inserts x and y into the currently found vertex cover and removes all edges incident to x or y from E. This is done until the edge set E is empty. Since, in this way, we implicitly find a non-extendable matching of G which is always a vertex cover for G and an optimal cover contains at least one endpoint of each edge in this matching, the algorithm outputs a 2-approximation for vertex cover. Now, let G be the graph that has a vertex for each point in X and an edge {x, y} for each unordered pair of points x, y ∈ X with D(x, y) ≤ 2i. Then, by applying the 2-approximation algorithm for vertex cover on G, we compute a 2i-cut decomposition of X whose size is at most twice as big as the size of an optimal 2i-cut decomposition of X. 7.3.1 Analysis of the Construction First, we show that our construction computes an ε-WSPD with slack σ for M . Then, we analyze its complexity. For any level i ∈ [dlog(∆)e+ 1], let H(i) be the set of heavy balls in level i. We call any two balls B(x1, r1) and B(x2, r2) neighboring balls if B(x1, 3r1) contains at least one point of B(x2, r2) or B(x2, 3r2) contains at least one point of B(x1, r1). Separation and Slack The following lemma proves that the choice of the function `(ε) guarantees that any two mini balls contained in different non-neighboring heavy balls are ε-well-separated. Lemma 7.3.3. Let `(ε) := dlog(1/ε)e. If, for each level i ∈ [dlog(∆)e+ 1], each heavy ball in H(i) is decomposed into mini balls with radius 2i−`(ε), then any two mini balls contained in different non-neighboring heavy balls are ε-well-separated. Proof. Let B1 be any heavy ball in H(i) in any level i ∈ [dlog(∆)e+ 1], and let B2 be any heavy ball in H(j) such that B1 and B2 are non-neighboring balls (see Figure 7.6 for an illustration). Without loss of generality, we assume that j ∈ {i, . . . , dlog(∆)e}. Since the radius of B2 is 2j and B1 is not a neighboring ball of B2, the distance between the center of B2 and any point in B1 is larger than 3 · 2j. Hence, the distance between any point in B1 to any point in B2 is larger than 2j+1. Thus, the distance between any point in any mini ball B∗1 in B1 to any point in any mini ball B ∗ 2 in B2 is larger than 2 j+1. Now, the assertion of the lemma follows from the fact that the diameters of B∗1 and B ∗ 2 are at most diam(B∗2) = 2 · 2 j−`(ε) ≤ ε · 2j+1 . Due to Lemma 7.3.3, to bound the slack of the ε-WSPD for M , we have to bound the number of points in each heavy ball. The following lemma shows that this is guaranteed by our choice of h(σ). Lemma 7.3.4. Let h(σ) := σ/((dlog(∆)e+ 1) · 25λ+3), then the collection of all represen- tative pairs is an ε-WSPD with slack σ for M . 138 7 Well-Separated Pair Decomposition with Slack B1 B∗1 B2 B∗2 2i > 2j+1 2j Figure 7.6: Illustration of the two non-neighboring heavy balls B1 and B2 in the proof of Lemma 7.3.3. Proof. For any point x1 ∈ X, we compute an upper bound on the number of points x2 ∈ X such that the radius of the smallest mini ball containing x2 is at least as big as the radius of the smallest mini ball containing x1 and the two mini balls are not ε-well-separated. By multiplying this number by 2n, we get an upper bound on the slack of our ε-WSPD for all ordered pairs (x1, x2) ∈ X ×X. Let x1 and x2 be any points in X that satisfy the condition above. Let i ∈ [dlog(∆)e+1] be the level such that B1 ∈ H(i) is the smallest heavy ball that contains x1. Let x be the center of B1. Furthermore, let B2 ∈ H(j), j ∈ {i, . . . , dlog(∆)e}, be the smallest heavy ball that contains x2 (see Figure 7.7 for an illustration). First, we show that B1 and B2 must be neighboring balls. Then, we prove an upper bound of 23λ+1 on the number of heavy balls in level j that are neighboring balls of B1. This allows us to derive an upper bound of (dlog(∆)e + 1) · 23λ+1 on the total number of heavy balls in any level j ∈ {i, . . . , dlog(∆)e} that are neighboring balls of B1. Finally, we show that the total weight of all the representative points located in mini balls forming a cut decomposition of such a heavy ball is at most σn/((dlog(∆)e+ 1) · 23λ+2). Then, the number of neighboring heavy balls of B1 times the representative weight associated with a heavy ball times 2n is at most (dlog(∆)e+ 1) · 23λ+1 · σn (dlog(∆)e+ 1) · 23λ+2 · 2n ≤ σ · n2 , which proves that the slack is at most σ. Since the smallest mini ball that contains x1 and the smallest mini ball that contains x2 are not ε-well-separated, it follows from Lemma 7.3.3 that B1 and B2 are neighboring balls. Since B2 is at least as big as B1, the distance from the center of B2 to the closest point in B1 is at most 3 · 2j. It follows that the distance from the center of B1 to any point in B2 is at most 2i + 3 · 2j + 2j < 2j+3. Hence, B2 is completely contained in the ball B(x, 2j+3). SinceM is a doubling metric space, B(x, 2j+3) can be covered by at most 23λ balls of radius 2j. Since we use a 2-approximation algorithm to compute cut decompositions, the number of balls in level j that are completely contained in B(x, 2j+3) is at most 23λ+1. Hence, the number of balls in level j that are neighboring balls of B1 is at most 23λ+1. Summing up over all levels j ∈ {i, . . . , dlog(∆)e}, the total number of balls that are neighboring balls of B1 is at most (dlog(∆)e+ 1) · 23λ+1. 7.3 Construction for Doubling Metric Spaces 139 x B1 x1 B2 x2 2i ≤ 2j+1 2j Figure 7.7: Illustration of the two neighboring heavy balls B1 and B2 in the proof of Lemma 7.3.4. The ball B(x, 2j+3), which completely contains B2, is indicated by the dashed arc. Let B(x′, 2j) ∈ H(j), j ∈ {i, . . . , dlog(∆)e}, be any neighboring ball of B1. Observe that the representative point of any point x′′ ∈ B(x′, 2j) is not from level j if x′′ is covered by a heavy ball with radius smaller than 2j. Obviously, it follows that the representative point of any point x′′ ∈ B(x′, 2j) is not from level j if x′′ is covered by a heavy ball in H(j − 1). Thus, we compute an upper bound on the number of points in all the light balls in G (j − 1) that are at least partly covered by B(x′, 2j). Observe that these balls are completely contained in the ball B(x′, 2j+1). Due to our construction, the number of balls in G (j − 1) that are completely contained in the ball B(x′, 2j+1) is at most 22λ+1. Since any light ball contains less than h(σ) · n points and h(σ) = σ/((dlog(∆)e+ 1) · 25λ+3), the total number of points in all the light balls in G (j − 1) that are completely contained in the ball B(x′, 2j+1) is less than h(σ) ·n ·22λ+1 ≤ σn/((dlog(∆)e+1) ·23λ+2). Thus, the total weight of representative points in B(x′, 2j) is at most σn/((dlog(∆)e + 1) · 23λ+2), which was the only thing left to prove the assertion of the lemma. Unfortunately, in contrast to our WSPD construction for low-dimensional Euclidean spaces, the property that, for each point, the distances to its σn closest neighbors can be arbitrarily distorted, but the distances to all the other points are (1 ± 2ε)-preserved (see Remark 7.2.4), does not hold for our WSPD construction in doubling metric spaces. The reason is that neighboring balls can differ in their radii by more than a constant factor. Due to this fact, there might exist a ball with many neighboring balls of smaller radius, so the number of the distances from a single point to other points in X that are not (1± 2ε)-preserved might be bigger than σn. Complexity In order to upper bound the number of representatives, we have to bound the total number of mini balls. 140 7 Well-Separated Pair Decomposition with Slack Lemma 7.3.5. The number of mini balls is O(2O(λ) · log2(∆)/(ελσ)). Proof. Let i ∈ [dlog(∆)e + 1] be any level. Since any heavy ball contains at least h(σ) · n points from X and h(σ) = σ/((dlog(∆)e + 1) · 25λ+3), the total number of heavy balls in level i is at most 1/h(σ) = ((dlog(∆)e + 1) · 25λ+3)/σ. Since M is a doubling metric space with dimension λ, any ball in level i can be decomposed into at most (2λ)`(ε) balls with radius 2i−`(ε). Thus, by applying the 2-approximation algorithm for vertex cover, we decompose each heavy ball into at most 2 · (2λ)`(ε) = 2 · (2λ)dlog(1/ε)e ≤ 2λ+1 ελ mini balls. It follows that the total number of mini balls in level i is at most (dlog(∆)e+ 1) · 25λ+3 σ · 2λ+1 ελ = (dlog(∆)e+ 1) · 26λ+4 ελσ . Summing up over all levels, we obtain that the total number of mini balls is upper bounded by (dlog(∆)e+ 1)2 · 26λ+4/(ελσ). Lemma 7.3.6. The algorithm has a running time of O(n2·log(∆)) and a space requirement of O(n · log(∆)). Proof. By applying a standard implementation of the described 2-approximation algorithm for vertex cover, we can decompose the set X into balls with any specified radius in O(n2) time. Recall that we assume access to a distance oracle that, given any two points from X, computes in constant time the distance between these two points. Since we have dlog(∆)e + 1 levels, the total running time to compute all uniform cut decompositions is O(n2 · log(∆)). Since each uniform cut decomposition is a partition of X, we can find the heavy balls of any level in O(n) time. Setting the representative points for any level can also be done in O(n) time. Since there are dlog(∆)e+ 1 levels, the total running time for finding heavy balls and setting representatives is O(n · log(∆)). Thus, the total running time is O(n2 · log(∆)). For each level, we store a uniform cut decomposition of X, and, for a subset of X, we store a decomposition into mini balls. Since the number of balls and mini balls per level is less than n, the space requirement of our algorithm is O(n · log(∆)). We summarize our results in the following theorem: Theorem 8 (WSPD with Slack for Doubling Metric Spaces). Let M = (X,D) be an n- point metric space with bounded doubling dimension λ and spread ∆, let ε, 0 < ε < 1, be a separation parameter, and let σ, (dlog(∆)e+ 1) · 26λ+3/n < σ < 1, be a slack parameter. Then, there exists an algorithm that computes a weighted point set X ′ ⊂ X with cardinality O(2O(λ) · log2(∆)/(ελσ)) which is an implicit representation of an ε-WSPD with slack σ for M . The algorithm has a running time of O(n2 · log(∆)) and a space requirement of O(n · log(∆)). 7.3 Construction for Doubling Metric Spaces 141 Proof. We have X ′ ⊂ X since the location of each representative is a point from X. Due to Lemma 7.3.4, X ′ is an implicit representation of an ε-WSPD with slack σ for M . The cardinality of X ′ follows from Lemma 7.3.5, and the running time and space requirement to compute X ′ is due to Lemma 7.3.6. In our construction, we made the assumption that σ > (dlog(∆)e+ 1) · 26λ+3/n to ensure that all the balls in G (0) are light. 142 7 Well-Separated Pair Decomposition with Slack 8 Embeddings with Slack in Data Streams and Applications This chapter is devoted to the problem of computing low-distortion embeddings in the streaming model. Given a stream of points from an n-point metric space M , our stream- ing algorithms compute an embedding of M into another n-point metric space M ′ that preserves a (1−σ)-fraction of all the pairwise distances with small distortion. The param- eter σ is called the slack of the embedding. The strict space limitations specified by the streaming model prevent us from storing our embedding explicitly. We bypass this obstacle by computing a compact representation of M ′ without storing the actual mapping from M into M ′. We present streaming embeddings with low distortion and low slack for n-point Euclidean metric spaces in Section 8.2, doubling metric spaces in Section 8.4, and general metric spaces in Section 8.5. The embeddings for Euclidean and doubling metric spaces are based on the techniques developed in Chapter 7. The embedding for general metric spaces takes advantage of the existence of certain subsets of points called edge-dense nets. Intuitively, an edge-dense net N ⊆ X of a metric space M = (X,D) has the property that, for a (1 − σ)-fraction of pairs of points (x, y) ∈ X × X, the distance between N and both x and y is small compared to D(x, y). The existence of such nets follows from results on embeddings with beacons by Kleinberg et al. [75]. After some modifications, this allows us to compute a low-distortion embedding with low slack in the streaming model. Our method resembles the construction of spanners with slack of Chan et al. [24]. Finally, we prove some lower bounds on the space requirement of streaming embeddings with slack in Section 8.6. 8.1 Preliminaries A metric embedding is the transformation of one metric space into another metric space. Definition 8.1.1 (Metric Embedding). A mapping ϕ : X → X ′ from a metric space M = (X,D) into a target metric space M ′ = (X ′,D′) is called metric embedding. In this thesis, we are only interested in embedding metric spaces M = (X,D), where X is a set of n points. Given such an n-point metric space M = (X,D), our streaming algorithms compute an embedding ϕ : X → X ′ into a target metric space M ′ = (X ′,D′) whose representation uses only sublinear space. Besides the space requirement, we will measure the quality of ϕ by the quantity of its distortion and slack. 144 8 Embeddings with Slack in Data Streams and Applications Definition 8.1.2 (Embedding with Distortion and Slack, [23]). Let % ≥ 1 be a precision parameter, and let σ, 0 < σ < 1, be a slack parameter. An embedding ϕ : X → X ′ from a finite metric space M = (X,D) into a target metric space M ′ = (X ′,D′) has distortion % and slack σ if there are two values α, β ≥ 1 with α · β ≤ % such that 1 α ·D(x, y) ≤ D′(ϕ(x), ϕ(y)) ≤ β ·D(x, y) (8.1) is true for a (1− σ)-fraction of pairs (x, y) ∈ X ×X. Similar to our definition of slack of a WSPD, the slack σ of an embedding ϕ is measured by the quantity of the fraction of all ordered pairs (x, y) ∈ X × X that do not satisfy Inequality (8.1). The assumption of having n2 (instead of ( n 2 ) ) pairwise distances simplifies descriptions and makes our proofs cleaner without changing the results in any significant way. In case that an embedding ϕ has distortion % and slack σ with σ = 0, we just say that ϕ has distortion %. In the following, we will present streaming embeddings for Euclidean, doubling, and general metric spaces. Our algorithms for general and doubling metric spaces work in the insertion-only data stream model, whereas the ones for Euclidean metric spaces work in the dynamic geometric data stream model. In each case, we assume that the minimum pairwise distance of the given metric space is at least 1, and the maximum pairwise distance is at most ∆. Furthermore, we assume that the parameter n is known in advance by our algorithms. 8.2 Embedding Euclidean Metric Spaces In this section, we explain how to compute with high probability a low-distortion embed- ding with low slack for a Euclidean metric spaceM = (P,D) given as a dynamic geometric data stream. Recall that, in this streaming model, the input is a sequence of m Insert and Delete operations of points from a discrete Euclidean space {1, . . . ,∆}d. At first, we assume that the dimension d is a constant. We will show in Section 8.2.2 how to get rid of this assumption. 8.2.1 Low Dimensions Our algorithm for constant-dimensional Euclidean spaces is based on the WSPD construc- tion described in Section 7.2. In order to construct an ε-WSPD with slack σ for a point set P , we first use a certain quadtree partition of the point space into a few cells and an elab- orate refinement of this partition, where each cell is further subdivided into a few cubelets. Then, we replace each point by a representative such that all the representatives of points located inside of the same cubelet have the same position. Let R(p) be the representative of a point p ∈ P , then R : P → P ′ is an embedding from M into the target metric space M ′ = (P ′,D). The advantage of this embedding is that it can be computed by using only 8.2 Embedding Euclidean Metric Spaces 145 the information about the number of points in certain cells or cubelets and is not reliant on the exact location of the points in P . We will show that we can use a random sampling technique to estimate the number of points in the relevant cells and cubelets. To sum up, the idea of our streaming algorithm is to maintain a random sample of the current point set P given as a dynamic geometric data stream and to apply the algorithm described in Section 7.2 on the sample set. Now, we explain the sample step in more detail. We read the items of the input stream one by one. Each time, we decide whether we use the associated point for further compu- tations or not. For that purpose, we use the technique described in [43, 42] to maintain a sample set of the current point set P with size s ∈ Θ(2Θ(d) · dd · log(∆) log(n/δ)/(εdσ4)), where δ is the error probability of the algorithm. We denote this sample set by S. After the sample step, we execute the quadtree partitioning for S based on the heavy cells in dlog(∆)e + 1 nested square grids over S as described in Section 7.2. During this process, a cell is identified as heavy if it contains at least σs/2d+1 sample points. A cell containing less sample points is identified as light. Thereafter, we build the balanced quadtree partition and perform the refinement into equal sized cubelets as explained in Section 7.2. For each cubelet C that contains at least dln(n)/σ2e sample points, we replace the points in C by one representative. This point is set to the location of an arbitrary sample point inside of C and weighted by d|C| · n/se, where |C| denotes the number of replaced points. To avoid that the total weight of the representatives differs from n, we sum up all weights and increase or decrease the weight of some arbitrary representatives by the required amount. The set of all weighted representatives is our compact representation for M ′. Let us first ignore that we use a sample step. Then, due to the fact that the embedding R : P → P ′ from M into the target metric space M ′ = (P ′,D) is determined by the construction of an ε-WSPD with slack σ for P , the embedding R has distortion (1 + 2ε)2 and slack σ. In the following, we will show that the sample step, which enables us to compute a representation of M ′ in the dynamic geometric data stream model, does not significantly increase the slack. Maintenance of the Sample Set By applying the technique described in [43, 42], we are able to maintain a sample set of the current point set P under insertions and deletions such that every point in the sample set is chosen nearly uniformly at random from P . More precisely, by adopting the results in [43, 42], we obtain the following lemma: Lemma 8.2.1 (Sample Data Structure, [43, 42]). Let δ, 0 < δ ≤ 1, be an error probability parameter. Given a sequence of Insert and Delete operations of points from the discrete Euclidean space {1, . . . ,∆}d, there is a data structure that, with probability 1− δ, returns s points q0, . . . , qs−1 from the current point set P := {p0, . . . , pn−1} such that Pr [qi = pj] = 1 n ± δ ∆d 146 8 Embeddings with Slack in Data Streams and Applications is true for every j ∈ [n] and for every i ∈ [s]. Both update time and space requirement of the algorithm are O((s+ log(1/δ)) · d2 · log2(∆/δ)). Slack Induced by the Sample Step Due to the fact that we use a sample set to estimate the number of points in certain cells and cubelets, we make an error which increases the slack. In order to measure this increase of the slack, we first investigate how much the quadtree partition computed on the sample set S might differ from the one that we would get by taking the whole input point set P into account. Recall that the algorithm identifies each cell that contains at least σs/2d+1 sample points as heavy. Next, we show that if each point in S is chosen uniformly at random from P , then, with high probability, the algorithm identifies every heavy cell as heavy and every cell which contains significantly less points than a heavy cell as light. Lemma 8.2.2. If each point in S is chosen uniformly at random from P , then every heavy cell is identified as heavy with probability at least 1− δ. Proof. Let H be the set of all heavy cells that do not have a heavy subcell. Obviously, there are at most n cells in H. Let C be any such cell. Let Yi be the indicator random variable for the event that the i-th point in S is contained in cell C. Since C is heavy, it contains at least σn/2d points. Thus, we have E [Yi] ≥ σ/2d. By a Chernoff bound and linearity of expectation, we get Pr   |S|∑ i=1 Yi < ( 1− 1 2 ) · E   |S|∑ i=1 Yi     ≤ exp ( − |S| · E [Yi] 23 ) ≤ exp ( − σ · |S| 2d+3 ) . For |S| ≥ 8 · 2d ln(n/δ)/σ, this probability is at most δ/n. Since we have chosen |S| ∈ Θ ( 2Θ(d) · dd · log(∆) log(n/δ) εdσ4 ) ⊂ Ω ( 2d ln(n/δ) σ ) , C contains at least ( 1− 1 2 ) · E   |S|∑ i=1 Yi   = σs 2d+1 sample points with probability at least 1− δ/n. By the union bound, the probability that every cell in H contains at least σs/2d+1 sample points is at least 1− δ. Obviously, it then follows that each ancestor cell of a cell in H also contains at least σs/2d+1 sample points. Hence, every heavy cell is identified as heavy with probability at least 1− δ. We call a cell that contains at least σn/2d+2 points from P quarter-heavy. The following lemma proves that, with high probability, no cell which is not quarter-heavy is identified as heavy by the algorithm. 8.2 Embedding Euclidean Metric Spaces 147 Lemma 8.2.3. If each point in S is chosen uniformly at random from P , then every cell that is not quarter-heavy is identified as light with probability at least 1− δ. Proof. Let C be any cell that contains σn/(2d+2k) points from P , where k > 1. Let Yi be the indicator random variable for the event that the i-th point in S is contained in cell C. We have E [Yi] = σ/(2d+2k). By a Chernoff bound and linearity of expectation, we get Pr   |S|∑ i=1 Yi ≥ (1 + k) · E   |S|∑ i=1 Yi     ≤ exp ( − k · |S| · E [Yi] 3 ) = exp ( − |S| · σ 3 · 2d+2 ) . For |S| ≥ 12 · 2d ln(n/δ)/σ, this probability is at most δ/n. Since we have chosen |S| ∈ Θ ( 2Θ(d) · dd · log(∆) log(n/δ) εdσ4 ) ⊂ Ω ( 2d ln(n/δ) σ ) , C contains less than (1 + k) · E   |S|∑ i=1 Yi   < σs 2d+1 sample points with probability at least 1− δ/n. Let us assume that the algorithm is performing the quadtree partitioning and has already identified all cells which are at least quarter-heavy as heavy cells and has not yet tried to identify any cell that is not quarter-heavy. Since a cell must contain at least one point from P to be a potential candidate for being identified as heavy, there are at most n such candidate cells in the current space partition. By the union bound, the probability that all of them contain less than σs/2d+1 sample points is at least 1− δ. Since each cell contains at most as many sample points as its parent cell, it then follows that every descendant cell of a cell in the current space partition contains less than σs/2d+1 sample points. Thus, every cell that is not quarter-heavy contains less than σs/2d+1 sample points and is, hence, identified as light with probability at least 1− δ. Due to Lemmas 8.2.2 and 8.2.3, we know that the quadtree partition on S is fairly close to the quadtree partition on P . By utilizing this fact, we next analyze the slack induced by estimating the number of points in cubelets based on the sample set S. Lemma 8.2.4. Let Z ∈ Θ(2Θ(d) ·dd · log(∆)/(εdσ)). If each point in S is chosen uniformly at random from P , then we can define a set Z of Z cubelets such that the set of cubelets constructed by the algorithm is a subset of Z with probability at least 1− 2δ. Proof. Due to Lemmas 8.2.2 and 8.2.3, with probability at least 1 − 2δ, the algorithm satisfies the condition that it identifies every heavy cell as heavy and every cell which is not quarter-heavy as light. Let us now consider the quadtree partition that we get by splitting exactly the quarter-heavy cells and performing the balancing operations. Let L+ be the set of cells in the resulting balanced quadtree. Then, the set of cells in any balanced 148 8 Embeddings with Slack in Data Streams and Applications quadtree partition obtained from a run of our algorithm that satisfies the above condition is a subset of L+. Furthermore, let Z be the set of cubelets that we obtain by subdividing each cell in L+ into cubelets. Then, the set of cubelets constructed during any run of the algorithm that satisfies the above condition is a subset of Z. Next, we upper bound the size of Z. We proceed exactly as we have done in the proof of Lemma 7.2.6. There are at most 2d+2/σ quarter-heavy cells which do not have a quarter-heavy subcell in a quadtree parti- tion. For each such cell, there exist at most 2d(dlog(∆)e+ 1) cells in the quadtree. Hence, the unbalanced quadtree partition contains O(22d · log(∆)/σ) cells. Due to Lemma 7.2.5, the number of cells in the balanced quadtree partition is O(24d ·log(∆)/σ). Since in the last step of the algorithm a cell is subdivided into ⌈ 6 √ d ⌉d cubes and each cube into ⌈ 2 √ d/ε ⌉d cubelets, the set Z consists of O(288d · dd · log(∆)/(εdσ)) cubelets. Lemma 8.2.5. Let Z ∈ Θ(2Θ(d)·dd·log(∆)/(εdσ)), and let U be the union of all the cubelets that contain at most σn/(2Z) points from P . If each point in S is chosen uniformly at random from P and the space partition consists of at most Z cubelets, then, with probability at least 1− δ, the number of points from P in U is at most σn/2 and the number of sample points in U is at most σs. Proof. Obviously, if there are at most Z cubelets in the space partition, then the number of points in cubelets that contain at most σn/(2Z) points from P is at most σn/2. Thus, we have that, for some k ≥ 1, the total number of points from P in U is σn/(2k). Let Yi be the indicator random variable for the event that the i-th point in S is contained in U . We have E [Yi] = σ/(2k). By a Chernoff bound and linearity of expectation, we get Pr   |S|∑ i=1 Yi ≥ (1 + k) · E   |S|∑ i=1 Yi     ≤ exp ( − k · |S| · E [Yi] 3 ) = exp ( − |S| · σ 6 ) . For |S| ≥ 6 ln(1/δ)/σ, this probability is at most δ. Since we have chosen |S| ∈ Θ ( 2Θ(d) · dd · log(∆) log(n/δ) εdσ4 ) ⊂ Ω ( ln(1/δ) σ ) , U contains less than (1 + k) · E   |S|∑ i=1 Yi   ≤ σs sample points with probability at least 1− δ. Lemma 8.2.6. Let Z ∈ Θ(2Θ(d) ·dd · log(∆)/(εdσ)). If each point in S is chosen uniformly at random from P , then the number of points in every cubelet that contains at least σn/(2Z) points from P can be (1± σ)-approximated by S with probability 1− δ. 8.2 Embedding Euclidean Metric Spaces 149 Proof. Let C be any cubelet that contains at least σn/(2Z) points from P . Let Yi be the indicator random variable for the event that the i-th point in S is contained in cubelet C. We have E [Yi] ≥ σ/(2Z). By Chernoff bounds and linearity of expectation, we get Pr   |S|∑ i=1 Yi ≥ (1 + σ) · E   |S|∑ i=1 Yi     ≤ exp ( − σ2 · |S| · E [Yi] 3 ) ≤ exp ( − σ3 · |S| 6Z ) and Pr   |S|∑ i=1 Yi ≤ (1− σ) · E   |S|∑ i=1 Yi     ≤ exp ( − σ2 · |S| · E [Yi] 2 ) ≤ exp ( − σ3 · |S| 4Z ) . For |S| ≥ 6Z ln(n/δ)/σ3, each of these probabilities is at most δ/n. Since we have chosen |S| ∈ Θ ( 2Θ(d) · dd · log(∆) log(n/δ) εdσ4 ) ⊆ Ω ( Z ln(n/δ) σ3 ) , the number of points in C can be (1± σ)-approximated with probability 1− 2δ/n. By the union bound, the number of points in every cubelet that contains at least σn/(2Z) points is (1± σ)-approximated with probability at least 1− δ. Based on the lemmas given above, we will show that, with high probability, our streaming algorithm computes an ε-WSPD with slack σ′ = 4σ for P . Unfortunately, in contrast to our WSPD construction in a classical non-streaming model, the property that, for each point, the distances to its σ′n closest neighbors can be arbitrarily distorted, but the distances to all the other points are (1 ± 2ε)-preserved (see Remark 7.2.4), does not hold for our streaming embedding. This is caused by the use of the random sampling technique. Simply said, we know how big the slack is, but we do not know where it arises. Weight of the Representatives To avoid that the total weight of the representatives differs from n, we adjust the weight of some representatives in the last phase of the algorithm. Now, we prove that this adjustment is small. Lemma 8.2.7. Let R be the set of representatives before the adjustment, and let w(r) denote the weight of a representative r ∈ R. Then, (1− σ) · n < ∑ r∈R w(r) ≤ ( 1 + σ2 ln(n) ) · n holds with probability at least 1− 4δ. Proof. In the third phase, we place one representative in each cubelet C that contains at least dln(n)/σ2)e sample points. The weight of this representative is set to d|C| · n/se, 150 8 Embeddings with Slack in Data Streams and Applications where |C| denotes the number of sample points in C. It follows that the total weight of the representatives can be smaller than n. The sample data structure from Lemma 8.2.1 fails with probability δ. Furthermore, the statistical difference from the exact uniform distribution is at most δ. However, in case that the sample data structure works as required, it follows from Lemma 8.2.4 that, with an error probability of 2δ, the number of cubelets containing less than dln(n)/σ2e sample points is at most Z ∈ O(2O(d) · dd · log(∆)/(εdσ)) and Z ≤ s · σ3 6 log(n) ≤ σs 3 dln(n)/σ2e for our chosen value of s. Thus, we get n− ∑ r∈R w(r) ≤ ⌈⌈ ln(n) σ2 ⌉ · n s ⌉ · σs 3 dln(n)/σ2e < 2 · ⌈ ln(n) σ2 ⌉ · n s · σs 3 dln(n)/σ2e < σn with probability at least 1− 4δ, which proves the first inequality of the lemma. The sum of the weights can be larger than n because the weight of each representative is rounded up to the next integer. Thus, the sum of the weights is at most n+ |R|. Since s ≤ n and every cubelet in which we set a representative contains at least dln(n)/σ2e points of the sample set S, we get ∑ r∈R w(r)− n ≤ n+ |R| − n ≤ s dln(n)/σ2e ≤ σ2n ln(n) , which proves the second inequality of the lemma. We summarize our results in the following theorem: Theorem 9. Given a stream of Insert and Delete operations of points from a discrete Euclidean space {1, . . . ,∆}d, where d is a constant, a precision parameter ε, 0 < ε < 1, a slack parameter σ, 1/o(n) < σ < 1, and an error probability parameter δ, 0 < δ < 1, there is a randomized streaming algorithm that computes with probability 1 − δ, for the current point set P of size n, a point set P ′ ⊂ P of size O(log(∆)/(εdσ)) such that P embeds into P ′ with distortion 1 + ε and slack σ. The algorithm has an update time of O ( log(∆) · log2(∆/δ) · log(n/δ) εd+1σ4 ) and a space requirement of O ( log(∆) · log2(∆/δ) · log(n/δ) εdσ4 ) . 8.2 Embedding Euclidean Metric Spaces 151 Proof. Due to Lemmas 8.2.1 and 8.2.2, with high probability, the algorithm identifies and splits each heavy cell during the quadtree partitioning, which implies that Lemmas 7.2.2 and 7.2.3 are applicable. It follows that, for any two points p1 and p2 in P , if the cubelet containing p1 and the cubelet containing p2 are not ε-well-separated, then p2 belongs to the σn closest points of p1. This induces a slack of σ. Since we estimate the number of points in each cubelet based on the sample set S, we get an additional slack. Due to Lemmas 8.2.4, 8.2.5 and 8.2.6, with high probability, the additional slack induced by this estimation is at most 2σ. Finally, we get more additional slack since we only place representatives in cubelets containing more than a certain threshold of points and we round up the weight of each representative. Due to Lemma 8.2.7, with high probability, this slack is at most σ. Thus, with high probability, our streaming algorithm computes a representation of an ε-WSPD with slack 4σ for P . This ε-WSPD is an implicit embedding of P with distortion (1 + 2ε)2 and slack 4σ. The error probability is given as follows. We use a random sample as given by the data structure from Lemma 8.2.1. This data structure fails with probability δ. Furthermore, the statistical difference from the exact uniform distribution is at most δ. In case that the sample data structure works as required, Lemmas 8.2.2 and 8.2.3 hold with probability at least 1 − 2δ. Thus, the probability that Lemmas 8.2.1, 8.2.2, and 8.2.3 hold is at least 1− 4δ. If this is the case, then the assertions given in Lemmas 8.2.4, 8.2.7, 7.2.2 and 7.2.3 follow directly and the assertions given in Lemmas 8.2.5 and 8.2.6 hold each with an error probability of at most δ. Thus, the total error probability of our algorithm is at most 6δ. In summary, if we run our embedding algorithm with a precision parameter ε′ ≤ ε/5, a slack parameter σ′ ≤ σ/4, and an error probability parameter δ′ ≤ δ/6, then the embedding has distortion (1+2ε′)2 ≤ 1+ε and slack 4σ′ ≤ σ and works with error probability 6δ′ ≤ δ. Due to Theorem 7, the size of P ′ is O ( log(∆)/(εdσ) ) . Furthermore, it follows from Theorem 7 and S ⊂ P that we have P ′ ⊂ P . The update time is given as follows. At first, we use the data structure of Lemma 8.2.1 to decide if the point is sampled or not. This costs O((s + log(1/δ)) · log2(∆/δ)) time. Afterwards, we build the balanced quadtree partition and the refinement into a set of O(log(∆)/(εd · σ)) cubelets for a set of s points. By applying Theorem 7, this can be done in O(s(1/ε + log(∆)) + log2(∆)/σ + log(∆)/(εdσ)) time. Since the size of the sample set is s ∈ Θ(log(n/δ) · log(∆)/(εd · σ4)), the total update time is as claimed in the theorem. Due to Lemma 8.2.1, the space required to store the sample data structure is upper bounded by O((s+ log(1/δ)) · log2(∆/δ)). Furthermore, we have to store the partition tree with O(log(∆)/(εd · σ)) nodes for the sample set S. This requires O(s + log(∆)/(εdσ)) space. Since the size of the sample set is s ∈ Θ(log(n/δ) · log(∆)/(εd ·σ4)), the resource for the sampling data structure is dominating. Thus, the total space requirement is as stated in the theorem. Since we use the WSPD construction given in Section 7.2, we have to make sure that σ′ > 22d/n (confer Theorem 7). However, this is implicitly required by the fact that the space requirement of a streaming algorithm has to be sublinear in n and the space requirement of our streaming algorithm is ω(1/σ). 152 8 Embeddings with Slack in Data Streams and Applications 8.2.2 High Dimensions If the points in P have a high dimension, we first use the Johnson-Lindenstrauss em- bedding [71] with d(ε, σ, δ) ∈ Θ(1/(ε2σδ)) dimensions to get an embedding into a low- dimensional space that has distortion 1 + ε and slack σ with probability 1− δ. Afterwards, we apply the techniques described in Sections 7.2 and 8.2 on the low-dimensional point set. This composition of two embeddings, both with distortion 1 + ε and slack σ, yields an embedding with distortion 1 + 3ε and slack 2σ that can be computed in the dynamic geometric data stream model. In the following, we will give an idea how to use the techniques developed by Johnson and Lindenstrauss [71] to obtain an embedding from a high-dimensional space into a low- dimensional space with distortion 1 + ε and slack σ. Overall, we apply the AMS-sketch [6] to get an embedding similar to the Johnson-Lindenstrauss embedding [71]. The main difference between both techniques is as follows. The AMS-sketch computes one random variable for the whole input stream such that the sum of the squared coordinate values is close to the second frequency moment of the input stream with high probability. In contrast, our method computes for each d-dimensional point in the data stream d(ε, σ, δ) random variables, the d(ε, σ, δ) coordinates of the embedded point, whose squared values are all equal to the so-called second frequency moment of this d-dimensional input point with high probability. More precisely, we can look upon one d-dimensional point as a stream consisting of d different elements where the frequency of element i, 1 ≤ i ≤ d, is given by the value of the i-th coordinate of the d-dimensional point. It is easy to see that, for this definition of frequency moments, the second frequency moment of a point is equal to the squared norm of this point. Because the embedding of a point is given by a linear mapping, the embedded distance vector of two points is equal to the distance vector of the two embedded points. Consequently, our method computes an embedding that preserves an approximation of almost all squared pairwise distances. Hence, the embedding preserves an approximation of almost all simple pairwise distances. Next, we state our result and give a detailed proof of its correctness. Theorem 10. Let ε, 0 < ε < 1, be a precision parameter, let σ, 0 < σ < 1, be a slack parameter, let δ, 0 < δ < 1, be an error probability parameter, and let d(ε, σ, δ) := 2/(ε2σδ) be a function dependent on ε, σ, and δ. Given a set P of n points in Rd, there exists an embedding ϕ : P → Rd(ε,σ,δ) such that (1− ε) ·D(p, q) ≤ D(ϕ(p), ϕ(q)) ≤ (1 + ε) ·D(p, q) is true for at least (1− σ) · n2 pairs of points (p, q) ∈ P × P with probability at least 1− δ. Each point can be embedded in O(d · log2(d)/(ε2σδ)) time using O(log(d)/(ε2σδ)) space. Proof. Our proof of the theorem is almost identical to the proof of Theorem 2.2 in [6]. However, since we will present another result that is based on similar techniques, we do not only describe our modifications to the proof of Theorem 2.2 in [6] but include below the full proof. 8.2 Embedding Euclidean Metric Spaces 153 For each point p ∈ P and each coordinate i ∈ {1, . . . , d(ε, σ, δ)}, the algorithm computes a random variable Yi(p). We define the embedding ϕ of the point p by ϕ(p) := 1 √ d(ε, σ, δ) · (Y1(p), . . . , Yd(ε,σ,δ)(p)) T . For each point p ∈ P , the value Yi(p) is computed in the same way. Fix an explicit set V := {v1, . . . , vZ} of Z ∈ O(d2) vectors of length d with +1,−1 entries which are four-wise independent, i.e., for every four distinct coordinates, each of the 16 possible combinations {−1, 1}4 occurs uniformly distributed in V . As described in [5, 6], such sets can be constructed with the help of the parity check matrices of BCH codes. The implementation of this construction requires an irreducible polynomial of degree g over the finite field F2, where 2g is the smallest power of 2 greater than d. Such a polynomial can be found by using only O(log d) space. Then, the construction enables us to compute each coordinate of each vector in V in O(log d) space, using a constant number of multiplications in the finite field F2g and binary inner products of vectors of length g. In order to compute Yi(p), for any p ∈ Rd, we choose a random vector vz =: ri = ( r(1)i , r (2) i , . . . , r (d) i ) ∈ V , where z is chosen uniformly between 1 and Z. Note that, once we have chosen a random vector ri to compute the i-th coordinate of ϕ(p) for the first point p ∈ P , we use ri to compute the i-th coordinate of ϕ(p′) for every point p′ ∈ P , i.e., we choose d(ε, σ, δ) random vectors in total. Recall that we denote the i-th coordinate of a point p by p(i). We define Yi(p) := d∑ k=1 r(k)i · p (k) . To compute ϕ(p) = 1/ √ d(ε, σ, δ) · (Y1(p), . . . , Yd(ε,σ,δ)(p))T, we have to keep the value z and have to maintain the sum Yi(p) for each coordinate i ∈ {1, . . . , d(ε, σ, δ)}. Recall that the bits of each ri = vz can be generated from z in O(log(d)) space, using a constant number of arithmetic and finite field operations on elements of O(log(d)) bits. Thus, the embedding of one d-dimensional input point requires O(log(d) · d(ε, σ, δ)) space and O(d · log2(d) · d(ε, σ, δ)) time. Furthermore, if A denotes the d(ε, σ, δ) × d matrix whose rows are the vectors r1, . . . , rd(ε,σ,δ), then we can write ϕ(p) = 1/ √ d(ε, σ, δ) ·Ap. Hence, ϕ is a linear function, i.e., ϕ(p− q) = ϕ(p)− ϕ(q) for all pairs (p, q) ∈ Rd ×Rd. Let Y (ν) := ‖ϕ(ν)‖2 be the random variable for the squared length of ϕ(ν). Due to our definition of ϕ(ν), we have Y (ν) = d(ε,σ,δ)∑ i=1   1 √ d(ε, σ, δ) · Yi(ν)   2 . Next, we show that the expected value of Y (ν) is ‖ν‖2 and, by bounding the variance of Y (ν), that Y (ν) is sharply concentrated. 154 8 Embeddings with Slack in Data Streams and Applications Due to the fact that the random variables r(k)i are pairwise independent and E [ r(k)i ] = 0 for all pairs (i, k) ∈ {1, . . . , d(ε, σ, δ)} × {1, . . . , d}, we have E      1 √ d(ε, σ, δ) · Yi(ν)   2    = E      1 √ d(ε, σ, δ) · d∑ k=1 r(k)i · ν (k)   2    = d∑ k=1 1 d(ε, σ, δ) · E [( r(k)i )2 ] · ( ν(k) )2 + ∑ 1≤k<`≤d 2 d(ε, σ, δ) · E [ r(k)i ] · E [ r(`)i ] · ν(k) · ν(`) = d∑ k=1 1 d(ε, σ, δ) · ( ν(k) )2 = 1 d(ε, σ, δ) · ‖ν‖2 . Due to linearity of expectation, it follows that E [Y (ν)] = E    d(ε,σ,δ)∑ i=1   1 √ d(ε, σ, δ) · Yi(ν)   2    = ‖ν‖2 . Since the variables r(k)i are four-wise independent, we have E      Yi(ν) √ d(ε, σ, δ)   4    = d∑ k=1 1 d(ε, σ, δ)2 · ( ν(k) )4 + ∑ 1≤k<`≤d 6 d(ε, σ, δ)2 · ( ν(k) )2 · ( ν(`) )2 . Furthermore, we obtain E      Yi(ν) √ d(ε, σ, δ)   2    2 = ( d∑ k=1 1 d(ε, σ, δ) · ( ν(k) )2 )2 = d∑ k=1 1 d(ε, σ, δ)2 · ( ν(k) )4 + ∑ 1≤k<`≤d 2 d(ε, σ, δ)2 · ( ν(k) )2 · ( ν(`) )2 . It follows that V      Yi(ν) √ d(ε, σ, δ)   2    = E      1 √ d(ε, σ, δ) · Yi(ν)   4   − E      1 √ d(ε, σ, δ) · Yi(ν)   2    2 = ∑ 1≤k<`≤d 4 d(ε, σ, δ)2 · ( ν(k) )2 · ( ν(`) )2 ≤ 2 · E      1 √ d(ε, σ, δ) · Yi(ν)   2    2 . 8.2 Embedding Euclidean Metric Spaces 155 Now, we can upper bound the variance of Y [ν] by V [Y (ν)] = V    d(ε,σ,δ)∑ i=1   1 √ d(ε, σ, δ) · Yi(ν)   2    = d(ε, σ, δ) ·V      1 √ d(ε, σ, δ) · Yi(ν)   2    ≤ d(ε, σ, δ) · 2 · E      1 √ d(ε, σ, δ) · Yi(ν)   2    2 ≤ 2 · ‖ν‖4 d(ε, σ, δ) . Hence, by Chebyshev’s inequality and the definition of d(ε, σ, δ), we get Pr [ |Y (ν)− E [Y (ν)]| > ε · ‖ν‖2 ] ≤ V [Y (ν)] ε2 · ‖ν‖4 = 2 d(ε, σ, δ) · ε2 = σδ . Let Z` be the indicator random variable for the event that the distance between the `-th pair of points is not (1 ± ε)-approximated. By the above, we get E [Z`] ≤ σδ. Then, by Markov’s inequality, we have Pr   n2∑ `=1 Z` ≥ σ · n 2   ≤ E [∑n2 `=1 Z` ] σn2 ≤ σδn2 σn2 = δ . By combining the above result with the results from Sections 7.2 and 8.2, we obtain the following theorem: Theorem 11. Given a stream of Insert and Delete operations of points from a dis- crete Euclidean space {1, . . . ,∆}d, a precision parameter ε, 0 < ε < 1, a slack parameter σ, 1/o(n) < σ < 1, and an error probability parameter δ, 0 < δ < 1/2, there is a ran- domized streaming algorithm that computes with probability 1 − δ, for the current point set P of size n, a point set P ′ from a discrete Euclidean space {1, . . . ,∆′}d ′ with spread ∆′ ∈ O( √ dn∆/(ε2 √ σδ)) and dimension d′ ∈ Θ(1/(ε2σδ)) and of size O (( 1 εσδ )O(1/(ε2σδ)) · log(dn∆) ) such that P embeds into P ′ with distortion 1 + ε and slack σ. The algorithm has an update time of O ( d · log2(d) ε2σδ + ( 1 εσδ )O(1/(ε2σδ)) · log4 (dn∆) ) 156 8 Embeddings with Slack in Data Streams and Applications and a space requirement of O ( log(d)/(ε2σδ) + ( 1 εσδ )O(1/(ε2σδ)) · log4 (dn∆) ) . Proof. We combine the embedding from Theorem 10 with the construction described in Sections 7.2 and 8.2. At first, we embed the discrete high-dimensional Euclidean point set P into a low-dimensional Euclidean space. Then, we impose an appropriately fine grid on the target space and move each embedded point to its nearest grid point. This technique is sometimes called snap rounding. It follows that, by appropriate scaling of the point space, the resulting point set is from a discrete low-dimensional Euclidean space. On this point set, we apply the construction described in Sections 7.2 and 8.2. Now, we explain our approach in more detail. By applying the techniques given in the proof of Theorem 10 with a precision parameter ε′ := ε/18, a slack parameter σ′ := σ/2, and an error probability parameter δ′ := δ/3, we get an embedding ϕ : P → Rd(ε ′,σ′,δ′) with d(ε′, σ′, δ′) ∈ Θ(1/(ε2σδ)) such that ( 1− ε 18 ) ·D(p, q) ≤ D(ϕ(p), ϕ(q)) ≤ ( 1 + ε 18 ) ·D(p, q) (8.2) is true for at least (1 − σ/2) · n2 pairs of points (p, q) ∈ P × P with probability at least 1− δ/3. Furthermore, we can upper bound the maximum distance between two embedded points as follows. Let p and q be any two points from P . We define ν := p − q and Y (ν) := ‖ϕ(ν)‖2. Then, as explained in the proof of Theorem 10, the expected value of Y (ν) is E [Y (ν)] = ‖ν‖2, and we can upper bound the variance of Y (ν) by V [Y (ν)] ≤ 2 · ‖ν‖4 d(ε′, σ′, δ′) . Thus, by Chebyshev’s inequality, we get Pr [ |Y (ν)− E [Y (ν)] | > n · ‖ν‖2 ] ≤ V [Y (ν)] n2 · ‖ν‖4 ≤ δ 3n2 . Due to the union bound and ‖ν‖2 ≤ d∆2, we have that all squared pairwise distances of the embedded points are at most O(n · d∆2) with probability at least 1 − δ/3. It follows that the diameter of the embedded point set is O( √ dn∆) with probability at least 1− δ/3. Next, we apply the snap-rounding technique. More precisely, we impose a square grid on the target space Rd(ε ′,σ′,δ′), where each cell has side length ε/(18 √ d(ε′, σ′, δ′)), and move each embedded point to its nearest grid point. Each point is moved by a distance of at most ε 18 √ d(ε′, σ′, δ′) · √ d(ε′, σ′, δ′) 2 = ε 36 . Thus, by moving each point to its nearest grid point, the distance between any two points is decreased or increased by at most ε/18. Let ϕ′(p) be the position of an embedded and 8.3 Max-Cut in High Dimensions 157 moved point p ∈ P . Since the minimum pairwise distance from distinct points in P is 1 and Inequality (8.2) is true for at least (1 − σ/2) · n2 pairs of points (p, q) ∈ P × P with probability at least 1− δ/3, we have that ( 1− ε 9 ) ·D(p, q) ≤ D(ϕ′(p), ϕ′(q)) ≤ ( 1 + ε 9 ) ·D(p, q) is true for at least (1 − σ/2) · n2 pairs of points (p, q) ∈ P × P with probability at least 1 − δ/3. It follows that the embedding ϕ′ has distortion (1 + ε/9)/(1 − ε/9) ≤ (1 + ε/3) and slack σ/2 with probability at least 1 − δ/3. Furthermore, each point lies on a grid with cell size ε/(18 √ d(ε′, σ′, δ′)) and the maximum pairwise distance of points is O( √ dn∆) with probability at least 1 − δ/3. Hence, by scaling the point space by 18 √ d(ε′, σ′, δ′)/ε, we get a set of points from a discrete low-dimensional space {1, . . . ,∆′}d ′ with spread ∆′ ∈ O( √ dn∆/(ε2 √ σδ)) and dimension d′ ∈ O(1/(ε2σδ)). On the obtained point set, we run our construction from Sections 7.2 and 8.2 with a precision parameter ε′′ := ε/3, a slack parameter σ′′ := σ/2, and an error probability parameter δ′′ := δ/3. Then, with a total error probability of δ, the resulting point set P ′ embeds P with distortion (1+ε/3)·(1+ε/3) ≤ (1+ε) and slack σ. It follows from the above and Theorem 7 that we also have P ′ ⊂ {1, . . . ,∆′}d ′ with spread ∆′ ∈ O( √ dn∆/(ε2 √ σδ)) and dimension d′ ∈ O(1/(ε2σδ)). As explained before in the proof of Theorem 9, we have to ensure that σ′′ > 22d/n (confer Theorem 7) since we use the construction given in Section 7.2. However, this is implicitly required by the fact that the space requirement of a streaming algorithm has to be sublinear in n and the space requirement of our streaming algorithm is ω(1/σ). Finally, we analyze the complexity of our construction. Due to Theorem 10, each point in P can be embedded into the low-dimensional space Rd(ε ′,σ′,δ′) in O(d · log2(d)/(ε2σδ)) time using O(log(d)/(ε2σδ)) space. Due to Theorems 7 and 9, the construction from Sections 7.2 and 8.2 applied on a set of points with dimension O(1/(ε2σδ)) and spread O( √ dn∆/(ε2 √ σδ)) has both an update time and space requirement of O (( 1 εσδ )O(1/(ε2σδ)) · log4 (dn∆) ) . The size of the set of representatives follows from Lemma 7.2.7. 8.3 Max-Cut in High Dimensions In this section, we show how to embed a set of high-dimensional Euclidean points into a low-dimensional Euclidean space such that the sum of the pairwise distances is well preserved. Afterwards, we use this result to design a streaming algorithm that implicitly computes a (1 ± ε)-approximation of the max-cut problem for a dynamic data stream of high-dimensional Euclidean points. 158 8 Embeddings with Slack in Data Streams and Applications Let ϕ : P → Rd(ε,δ) be the Johnson-Lindenstrauss embedding where each point is mapped into a Euclidean space with dimension d(ε, δ) ∈ Θ(1/(ε2δ2)). Then, we will show that, for a pair of points (p, q) ∈ P × P , the expected value of |D(ϕ(p), ϕ(q))− D(p, q)| is δ ε ·D(p, q) and |D(ϕ(p), ϕ(q))−D(p, q)| is sharply concentrated around its expected value with probability 1− δ. This leads to the following lemma: Lemma 8.3.1. Let ε, 0 < ε < 1, be a precision parameter, let δ, 0 < δ < 1, be an error probability parameter, and let d(ε, δ) := 50/(ε2δ2) be a function dependent on ε and δ. Given a set P of n points in Rd, there exists an embedding ϕ : P → Rd(ε,δ) such that ∑ (p,q)∈P×P |D(ϕ(p), ϕ(q))−D(p, q)| ≤ ε · ∑ (p,q)∈P×P D(p, q) is true with probability at least 1− δ. Each point can be embedded in O(d · log2(d)/(ε2δ2)) time using O(log(d)/(ε2δ2)) space. Proof. For each point p ∈ P and each coordinate i ∈ {1, . . . , d(ε, δ)}, we compute a random variable Yi(p) as explained in the proof of Theorem 10. We define the embedding ϕ for the point p by ϕ(p) := 1 √ d(ε, δ) · (Y1(p), . . . , Yd(ε,δ)(p)) T . Following the construction in the proof of Theorem 10, each point can be embedded using a space of O(log(d)/(ε2δ2)) and by performing O(d/(ε2δ2)) arithmetic and finite field op- erations on elements of O(log(d)) bits. Furthermore, since ϕ is a linear function, we have ϕ(p− q) = ϕ(p)− ϕ(q) for all pairs (p, q) ∈ Rd ×Rd. Now, let p and q be any two points in Rd. We define ν := p− q and Y (ν) := ‖ϕ(ν)‖2 to be the random variable for the squared length of ϕ(ν). Then, as explained in the proof of Theorem 10, the expected value of Y (ν) is E [Y (ν)] = ‖ν‖2, and we can upper bound the variance of Y (ν) by V [Y (ν)] ≤ 2 · ‖ν‖4 d(ε, δ) . Let err(p, q) be the error that occurs due to the estimation of ‖ν‖ = ‖p− q‖, i.e., err(p, q) := ∣ ∣ ∣ ∣ √ Y (ν)− ‖ν‖ ∣ ∣ ∣ ∣ . The expected value of err(p, q) is given by E [err(p, q)] ≤ ε δ 5 · ‖ν‖+ ∞∑ i=0 Pr [ ε δ 5 · 2i · ‖ν‖ < ∣ ∣ ∣ ∣ √ Y (ν)− ‖ν‖ ∣ ∣ ∣ ∣ ≤ ε δ 5 · 2i+1 · ‖ν‖ ] · ε δ 5 · 2i+1 · ‖ν‖ ≤ ε δ 5 · ‖ν‖+ ∞∑ i=0 Pr [∣ ∣ ∣ ∣ √ Y (ν)− ‖ν‖ ∣ ∣ ∣ ∣ > ε δ 5 · 2i · ‖ν‖ ] · ε δ 5 · 2i+1 · ‖ν‖ . 8.3 Max-Cut in High Dimensions 159 It follows that, in order to upper bound the expected value of err(p, q), we have to upper bound the probability that err(p, q) > εδ/5 · 2i · ‖ν‖ for each i ∈ N0. Let ` be any fixed power of 2. Then, for 0 ≤ ε δ `/ 5 ≤ 1, we get Pr [∣ ∣ ∣ ∣ √ Y (ν)− ‖ν‖ ∣ ∣ ∣ ∣ > ε δ ` 5 · ‖ν‖ ] = Pr [ √ Y (ν) < ( 1− ε δ ` 5 ) · ‖ν‖ or √ Y (ν) > ( 1 + ε δ ` 5 ) · ‖ν‖ ] = Pr  Y (ν) < ( 1− ε δ ` 5 )2 · ‖ν‖2 or Y (ν) > ( 1 + ε δ ` 5 )2 · ‖ν‖2   ≤ Pr [ Y (ν) < ( 1− ε δ ` 5 ) · ‖ν‖2 or Y (ν) > ( 1 + ε δ ` 5 ) · ‖ν‖2 ] = Pr [ ∣ ∣ ∣Y (ν)− ‖ν‖2 ∣ ∣ ∣ > ε δ ` 5 · ‖ν‖2 ] . Similarly, for ε δ `/ 5 > 1, we get Pr [∣ ∣ ∣ ∣ √ Y (ν)− ‖ν‖ ∣ ∣ ∣ ∣ > ε δ ` 5 · ‖ν‖ ] = Pr [ √ Y (ν) > ( 1 + ε δ ` 5 ) · ‖ν‖ ] = Pr  Y (ν) > ( 1 + ε δ ` 5 )2 · ‖ν‖2   ≤ Pr [ Y (ν) > ( 1 + ε δ ` 5 ) · ‖ν‖2 ] = Pr [ Y (ν)− ‖ν‖2 > ε δ ` 5 · ‖ν‖2 ] ≤ Pr [ ∣ ∣ ∣Y (ν)− ‖ν‖2 ∣ ∣ ∣ > ε δ ` 5 · ‖ν‖2 ] , where the first equality follows from the fact that the case √ Y (ν) < (1− ε δ `/ 5) · ‖ν‖ < 0 cannot occur. Thus, for any value ε δ `/ 5 ∈ R, we have Pr [∣ ∣ ∣ ∣ √ Y (ν)− ‖ν‖ ∣ ∣ ∣ ∣ > ε δ ` 5 · ‖ν‖ ] ≤ Pr [ ∣ ∣ ∣Y (ν)− ‖ν‖2 ∣ ∣ ∣ > ε δ ` 5 · ‖ν‖2 ] . By Chebyshev’s inequality, we can upper bound this probability by Pr [ |Y (ν)− ‖ν‖2| > ε δ ` 5 · ‖ν‖2 ] ≤ 25 ·V [Y (ν)] ε2 δ2 `2 ‖ν‖4 ≤ 50 d(ε, δ) · ε2 δ2 `2 = 1 `2 . 160 8 Embeddings with Slack in Data Streams and Applications Now, the expected value of err(p, q) can be upper bounded by E [err(p, q)] ≤ ε δ 5 · ‖ν‖+ ∞∑ i=0 Pr [∣ ∣ ∣ ∣ √ Y (ν)− ‖ν‖ ∣ ∣ ∣ ∣ > ε δ 5 · 2i · ‖ν‖ ] · ε δ 5 · 2i+1 · ‖ν‖ ≤ ε δ 5 · ‖ν‖+ ∞∑ i=0 1 22i · ε δ 5 · 2i+1 · ‖ν‖ = ε δ 5 · ‖ν‖+ 2 · ∞∑ i=0 1 2i · ε δ 5 · ‖ν‖ ≤ ε δ ‖ν‖ . Due to Markov’s inequality, it follows that Pr   ∑ p∈P ∑ q∈P err(p, q) ≥ 1 δ · E   ∑ p∈P ∑ q∈P err(p, q)     ≤ δ . Due to linearity of expectation, we have Pr   ∑ p∈P ∑ q∈P err(p, q) ≤ ε · ∑ p∈P ∑ q∈P ‖p− q‖   ≥ Pr   ∑ p∈P ∑ q∈P err(p, q) ≤ 1 δ · ∑ p∈P ∑ q∈P E [err(p, q)]   = Pr   ∑ p∈P ∑ q∈P err(p, q) ≤ 1 δ · E   ∑ p∈P ∑ q∈P err(p, q)     ≥ 1− δ . Given any Euclidean point set P , the embedding described above is useful for all geo- metric problems that satisfy the following four properties: (i) The cost of an optimal solution for P is a function whose set of input parameters is a subset of all pairwise distances of P . (ii) The cost of an optimal solution for P is at least ∑ p∈P ∑ q∈P 1/c ·D(p, q), where c ≥ 1 is any small constant. (iii) If the distance D(p, q) between any two points p, q ∈ P is increased or decreased by any value α > 0, the cost of an optimal solution for P is increased or decreased by at most O(α). (iv) The complexity of all known (1±ε)-approximation algorithms depends exponentially on the dimension of P . 8.3 Max-Cut in High Dimensions 161 To handle these problems, we first embed the input points and afterwards apply any efficient (1± ε)-approximation algorithm on the embedded points. One suitable problem is the max-cut problem in the dynamic geometric data stream model. Definition 8.3.2 (Euclidean Max-Cut Problem). For a set P ⊂ Rd, the Euclidean max- cut problem is to find a partition of P into two subsets C1 and C2 such that the sum Cut(P,C1, C2) := ∑ (p,q)∈C1×C2 D(p, q) of inter-cluster distances is maximized. Obviously, the max-cut problem satisfies Properties (i) and (iii). Furthermore, it is shown in [44] that Property (ii) is satisfied for c = 4. Concerning Property (iv), the authors of [44] gave an efficient (1 ± ε)-approximation for the max-cut problem in low- dimensions that has the following properties: Lemma 8.3.3 ([44]). Let ε, 0 < ε < 1, be a precision parameter. Given a stream of m Insert and Delete operations of points from a discrete Euclidean space {1, . . . ,∆}d, where d is a constant, there exists a streaming algorithm that computes with probabil- ity at least 2/3, for the current point set P with cardinality n, a data structure of size O(log3(∆m) · log4(∆)/ε2d+4) from which an implicit (1 ± ε)-approximate solution for the max-cut problem can be extracted in poly(exp(1/ε)O(1), (1/ε)d, log(∆), log(n), log(m)) time. An update can be processed in O(log2(∆) · log(∆m)) time. By combining the embedding given in Lemma 8.3.1 with the approximation algorithm presented in [44], we can implicitly compute a (1 ± ε)-approximation for the max-cut problem on dynamic geometric data streams of high-dimensional points. Theorem 12. Let ε, 0 < ε < 1, be a precision parameter. Given a stream of m In- sert and Delete operations of points from a discrete high-dimensional Euclidean space {1, . . . ,∆}d, there is a randomized streaming algorithm that has a space requirement of O(log7(d∆mn)/εO(1/ε 2)) and computes with probability at least 5/8, for the current point set P of size n, a data structure from which an implicit (1± ε)-approximation for the max- cut problem can be extracted in poly(exp(1/ε)O(1), (1/ε)1/ε 2 , log(d), log(∆), log(n), log(m)) time. An update requires O(d · log2(d)/ε2 + log3(d∆nm/ε)) time. Proof. We proceed in a similar way as we have done in the proof of Theorem 11. At first, we embed the discrete high-dimensional Euclidean point set P into a low-dimensional Euclidean space. This embedding induces a small multiplicative error on the cost of a max- imum cut. Then, we apply the snap-rounding technique, i.e., we impose an appropriately fine grid on the target space and move each embedded point to its nearest grid point. This movement of the points induces an additive error, which can be charged against a lower bound on the cost of a maximum cut for P to get a small multiplicative error. Finally, by 162 8 Embeddings with Slack in Data Streams and Applications applying the techniques described in [44] on the embedded and moved points, we obtain the results stated in the theorem. Next, we explain our construction in more detail. In the first step, we apply the embedding ϕ : P → P ′ given in Lemma 8.3.1 with precision parameter ε′ := ε/16 and error probability parameter δ′ := 1/24 on P . Then, we have that ∑ (p,q)∈P×P |D(ϕ(p), ϕ(q))−D(p, q)| ≤ ε′ · ∑ (p,q)∈P×P D(p, q) is true with probability at least 1 − δ′. Since Property (ii) (on page 160) is satisfied for c = 4 [44], we have MaxCut(P ) ≥ 1/4 · ∑ (p,q)∈P×P D(p, q). Due to the fact that each cut of P is a subset of (p, q) ∈ P × P , we obtain ∣ ∣ ∣ ∣ ∣ ∣ ∑ (p,q)∈C1×C2 D(ϕ(p), ϕ(q))− ∑ (p,q)∈C1×C2 D(p, q) ∣ ∣ ∣ ∣ ∣ ∣ ≤ ∑ (p,q)∈C1×C2 |D(ϕ(p), ϕ(q))−D(p, q)| ≤ ∑ (p,q)∈P×P |D(ϕ(p), ϕ(q))−D(p, q)| ≤ ε′ · ∑ (p,q)∈P×P D(p, q) ≤ 4ε′ ·MaxCut(P ) = ε 4 ·MaxCut(P ) for all cuts (C1, C2) of P with probability 23/24. Let (C ′1, C ′ 2) be a maximum cut of P , and let (C ′′1 , C ′′ 2 ) be any cut of P such that the embedded point sets of C ′′ 1 and C ′′ 2 build a maximum cut of P ′. It follows from the above that ∑ (p,q)∈C′′1×C ′′ 2 D(ϕ(p), ϕ(q)) ≥ ∑ (p,q)∈C′1×C ′ 2 D(ϕ(p), ϕ(q)) ≥ ∑ (p,q)∈C′1×C ′ 2 D(p, q)− ε 4 ·MaxCut(P ) = ( 1− ε 4 ) ·MaxCut(P ) and ∑ (p,q)∈C′′1×C ′′ 2 D(ϕ(p), ϕ(q)) ≤ ∑ (p,q)∈C′′1×C ′′ 2 D(p, q) + ε 4 ·MaxCut(P ) ≤ ( 1 + ε 4 ) ·MaxCut(P ) . Thus, we have ( 1− ε 4 ) ·MaxCut(P ) ≤ MaxCut(P ′) ≤ ( 1 + ε 4 ) ·MaxCut(P ) (8.3) with probability at least 23/24. 8.3 Max-Cut in High Dimensions 163 In the second step, we apply the snap-rounding technique. We impose a square grid on the target space Rd(ε ′,δ′) with d(ε′, δ′) ∈ Θ(1/(ε2δ2)), where each cell has side length ε/(16 · √ d(ε′, δ′)), and move each point in P ′ to its nearest grid point. Let P ′′ be the set of points that we obtain after moving the points in P ′. Each point is moved by a distance of at most ε 16 · √ d(ε′, δ′) · √ d(ε′, δ′) 2 = ε 32 . Thus, the movement of the points induces an additive error of at most εn2/16 on the sum of the pairwise distances. Since Property (ii) (on page 160) is satisfied for c = 4 [44] and the minimum pairwise distance of P is 1, a lower bound on the cost of a maximum cut for P is n2/4. Hence, we have εn2/16 ≤ ε/4 ·MaxCut(P ). Due to Inequality (8.3), we get ( 1− ε 2 ) ·MaxCut(P ) ≤ MaxCut(P ′′) ≤ ( 1 + ε 2 ) ·MaxCut(P ) with probability at least 1 − 1/24. Besides, we can upper bound the diameter of P ′′ as follows. Since the maximum pairwise distance of P is √ d∆, the value n2 · √ d∆ is an upper bound on the cost of a maximum cut for P . Since the diameter of a point set is a lower bound on the cost of a maximum cut of the point set, we get diam(P ′) ≤ MaxCut(P ′) ≤ ( 1 + ε 4 ) ·MaxCut(P ) ≤ ( 1 + ε 4 ) · n2 · √ d∆ , where the second inequality follows from Inequality (8.3). As a result, the diameter of P ′′ is O( √ d∆n2). Furthermore, each point in P ′′ lies on a grid with cell size ε/(16 · √ d(ε′, δ′)). Thus, by scaling the point space by 16 · √ d(ε′, δ′)/ε, we get a set of points from a discrete low-dimensional space {1, . . . ,∆′}d ′ with ∆′ ∈ O( √ d∆n2/ε2) and d′ ∈ O(1/ε2). On the scaled point set, we run the approximation algorithm of [44] with precision parameter ε′′ := ε/3. Due to Lemma 8.3.3 and our calculations above, with probability at least 23/24− 1/3 = 5/8, we can compute a point set P ′′′ such that ( 1− ε 2 )( 1− ε 3 ) ·MaxCut(P ) ≤ ε ·MaxCut(P ′′′) 16 · √ d(ε′, δ′) ≤ ( 1 + ε 2 )( 1 + ε 3 ) ·MaxCut(P ) . Since (1 − ε/2)(1 − ε/3) ≥ (1 − ε) and (1 + ε/2)(1 + ε/3) ≤ (1 + ε), after rescaling, our construction computes an implicit (1 ± ε)-approximate solution for the max-cut problem with probability at least 5/8. Note that our construction works in the streaming model, where the first two steps are used to transform a stream of high-dimensional points into a stream of low-dimensional points. Due to Lemma 8.3.1, the transformation of one high-dimensional input point requires O(log(d)/ε2) space and O(d · log2(d)/ε2) time. Finally, since we apply the ap- proximation algorithm of [44] on a stream of points with dimension O(1/ε2) and spread O( √ d∆n2/ε2), the complexity of our construction is as claimed in the theorem. 164 8 Embeddings with Slack in Data Streams and Applications 8.4 Embedding Doubling Metric Spaces In this section, we show how to compute a low-distortion embedding with low slack for an n-point doubling metric space M = (X,D) with bounded dimension λ given as a stream of points in the insertion-only data stream model. We assume that the minimum pairwise distance between two points in X is at least 1, and the maximum pairwise distance is at most ∆. Furthermore, we assume access to a distance oracle that, given any two points from X, can compute in constant time the distance between these two points. The idea of our streaming algorithm is based on the results obtained in Section 7.3. Recall that our WSPD construction from Section 7.3 works as follows. It computes the uniform cut decompositions G (0) , . . . ,G (dlog(∆)e) and decomposes each heavy ball in the uniform cut decompositions into a set of mini balls. Then, the weighted centers of these mini balls are the representatives of the WSPD. The idea of our streaming algorithm is to take two sample sets from the input stream. The first sample set is our set of representatives and is supposed to hit every mini ball that contains more than a certain threshold of points with high probability. The second sample set is supposed to approximate the weight of the mini ball centers. Next, we explain our streaming algorithm in more detail (see Algorithm 8.4.1 for a description in pseudocode). We take two sample sets from the input stream denoted by R and S. For that pur- pose, each input point is chosen at random with probability Pr [point is taken into R] := ((dlog(∆)e + 1)2 · 26λ+5) · ln(n/δ)/(ελσ2 · n) to be a sample point in R, where δ is the error probability of the algorithm. Similarly, each input point is taken at random with probability Pr [point is taken into S] := 6 · (dlog(∆)e+ 1)2 · 26λ+5 · ln(n/δ)/(ελσ4 · n) into the sample set S. Let R := {y0, . . . , yk−1} and S := {s0, . . . , s`−1} be the sample sets after having read the whole input stream. Then, R is our set of representatives for X, and S determines the weight of the representatives in R. More precisely, each point in S is assigned to the nearest representative in R. For each representative yi ∈ R, let ci be the total number of points in S that have been assigned to yi. Then, if ci > 0, we set the weight of yi to dci · n/|S|e. Otherwise, we remove yi from the set of representatives. To avoid that the total weight of the representatives is larger than n, we sum up all weights and decrease the weight of some arbitrary representatives by the required amount. The set of all weighted representatives is our compact representation for M ′. Slack Induced by the Sample Step Let M := {B0, . . . ,Bm−1} be the set of mini balls that we would obtain by running the WSPD construction from Section 7.3 on the n-point metric space M = (X,D). Then, we show that, with high probability, there is at least one sample point from R in each mini ball that contains at least a certain fraction of points from X. Furthermore, for each ball in the same set of mini balls, we show that, with high probability, the number of points inside the ball can be (1 ± σ)-approximated by S. Finally, we prove that the remaining mini balls contain only a few points from X as well as from the sample set S. 8.4 Embedding Doubling Metric Spaces 165 Algorithm 8.4.1 EmbedDoublingMetric(n,∆, ε, σ, δ) 1: initialize empty point sets R and S 2: i← 0 3: for each point x in the stream do 4: flip a coin that shows head with probability Pr [point is taken into R] 5: if coin shows head then 6: yi ← x 7: R← R ∪ yi 8: initialize counter ci with 0 9: i← i+ 1 10: flip a coin that shows head with probability Pr [point is taken into S] 11: if coin shows head then 12: S ← S ∪ x 13: for each point x ∈ S do 14: compute nearest neighbor yi in R 15: increment counter ci by 1 16: for each point yi ∈ R do 17: set weight of yi to dci · n/|S|e 18: return R In some proofs, we need to know good estimators for the number of points in the sample sets R and S. For this reason, we first give appropriate lower and upper bounds on the sizes of R and S. Lemma 8.4.1. If each point in X is taken with probability Pr [point is taken into R] := ((dlog(∆)e+ 1)2 · 26λ+5) · ln(n/δ) ελσ2 · n into the set R, then we have ((dlog(∆)e+ 1)2 · 26λ+4) · ln(n/δ) ελσ2 < |R| < 3 · ((dlog(∆)e+ 1)2 · 26λ+4) · ln(n/δ) ελσ2 with probability 1− δ/n. Proof. Let Yi be the indicator random variable for the event that the i-th point in X is taken into the sample set R. We have E [Yi] = ((dlog(∆)e+ 1)2 · 26λ+5) · ln(n/δ) ελσ2 · n . 166 8 Embeddings with Slack in Data Streams and Applications By a Chernoff bound and linearity of expectation, we get Pr   |X|∑ i=1 Yi ≥ ( 1 + 1 2 ) · E   |X|∑ i=1 Yi     ≤ exp ( − n · E [Yi] 12 ) ≤ exp ( − ((dlog(∆)e+ 1)2 · 26λ+3) · ln(n/δ) 3 · ελσ2 ) ≤ δ 2n and Pr   |X|∑ i=1 Yi ≤ ( 1− 1 2 ) · E   |X|∑ i=1 Yi     ≤ exp ( − n · E [Yi] 8 ) ≤ exp ( − ((dlog(∆)e+ 1)2 · 26λ+2) · ln(n/δ) ελσ2 ) ≤ δ 2n . Thus, we have ( 1− 1 2 ) · E   |X|∑ i=1 Yi   < |X|∑ i=1 Yi < ( 1 + 1 2 ) · E   |X|∑ i=1 Yi   with probability at least 1 − δ/n. Now, the assertion follows from ∑|X| i=1 Yi = |R| and E [∑|X| i=1 Yi ] = n · E [Yi]. Lemma 8.4.2. If each point in X is taken with probability Pr [point is taken into S] := 6 · (dlog(∆)e+ 1)2 · 26λ+5 · ln(n/δ) ελσ4 · n into the set S, then we have 3 · (dlog(∆)e+ 1)2 · 26λ+5 · ln(n/δ) ελσ4 < |S| < 9 · (dlog(∆)e+ 1)2 · 26λ+5 · ln(n/δ) ελσ4 with probability 1− δ/n. Proof. The proof runs through with the same approach as used in the proof of Lemma 8.4.1. Let Yi be the indicator random variable for the event that the i-th point in X is taken into the sample set S. We have E [Yi] = 6 · (dlog(∆)e+ 1)2 · 26λ+5 · ln(n/δ) ελσ4 · n . 8.4 Embedding Doubling Metric Spaces 167 By a Chernoff bound and linearity of expectation, we obtain Pr   |X|∑ i=1 Yi ≥ ( 1 + 1 2 ) · E   |X|∑ i=1 Yi     ≤ exp ( − n · E [Yi] 12 ) ≤ exp ( − (dlog(∆)e+ 1)2 · 26λ+4 · ln(n/δ) ελσ4 ) ≤ δ 2n and Pr   |X|∑ i=1 Yi ≤ ( 1− 1 2 ) · E   |X|∑ i=1 Yi     ≤ exp ( − n · E [Yi] 8 ) ≤ exp ( − 3 · (dlog(∆)e+ 1)2 · 26λ+3 · ln(n/δ) ελσ4 ) ≤ δ 2n . Hence, we get ( 1− 1 2 ) · E   |X|∑ i=1 Yi   < |X|∑ i=1 Yi < ( 1 + 1 2 ) · E   |X|∑ i=1 Yi   with probability at least 1 − δ/n. Now, the assertion is due to ∑|X| i=1 Yi = |S| and E [∑|X| i=1 Yi ] = n · E [Yi]. The following lemma shows that, with high probability, there is at least one sample point from R in each mini ball that contains at least a certain fraction of points from X. Lemma 8.4.3. With probability 1 − δ, there is at least one sample point from R in each mini ball that contains at least ελσ2n/((dlog(∆)e+ 1)2 · 26λ+5) points from X. Proof. Let B be any mini ball that contains at least ελσ2n/((dlog(∆)e+ 1)2 · 26λ+5) points from X. Then, we have Pr [B contains no sample point from R] ≤ (1−Pr [point is taken into R]) ελσ2n (dlog(∆)e+1)2·26λ+5 = ( 1− ((dlog(∆)e+ 1)2 · 26λ+5) · ln(n/δ) ελσ2 · n ) ελσ2n (dlog(∆)e+1)2·26λ+5 ≤ δ/n , where the second inequality is due to a bound on Euler’s number (see Inequality (B.2)). Finally, the assertion of the lemma follows by the union bound since we can assume that the number of mini balls is less than n. 168 8 Embeddings with Slack in Data Streams and Applications Now, we can show that, with high probability, there are just a few sample points from S located in mini balls that contain less than a certain fraction of points from X. Further- more, the number of points in each of the remaining mini balls can be (1±σ)-approximated by S with high probability. Lemma 8.4.4. Let U be the union of all the mini balls in M that contain less than ελσ2n/((dlog(∆)e+ 1)2 · 26λ+5) points from X. If |S| ≥ 6 ln(1/δ)/σ, then, with probability at least 1 − δ, the number of points from X in U is less than σn/2 and the number of sample points from S that are contained in U is at most σ|S|. Proof. As we have shown in the proof of Lemma 7.3.5, the number of mini balls is at most (dlog(∆)e+ 1)2 · 26λ+4/(ελσ). Thus, the total number of points in mini balls contained in U is less than (dlog(∆)e+ 1)2 · 26λ+4 ελσ · ελσ2n (dlog(∆)e+ 1)2 · 26λ+5 = σn 2 . It follows that, for some t ≥ 1, the total number of points in mini balls from U is σn/(2t). Let Yi be the indicator random variable for the event that the i-th point in S is contained in U . We have E [Yi] = σ/(2t). By a Chernoff bound and linearity of expectation, we get Pr   |S|∑ i=1 Yi ≥ (1 + t) · E   |S|∑ i=1 Yi     ≤ exp ( − t · |S| · E [Yi] 3 ) = exp ( − |S| · σ 6 ) . Since we assume that |S| ≥ 6 ln(1/δ)/σ, this probability is at most δ. Thus, U contains less than (1 + t) · E   |S|∑ i=1 Yi   ≤ σ|S| sample points with probability at least 1− δ. Lemma 8.4.5. If |S| ≥ 3 · (dlog(∆)e + 1)2 · 26λ+5 · ln(n/δ)/(ελσ4), then the number of points in every mini ball that contains at least ελσ2n/((dlog(∆)e+ 1)2 · 26λ+5) points from X can be (1± σ)-approximated by S with probability 1− δ. Proof. Let B be any mini ball that contains at least ελσ2n/((dlog(∆)e+ 1)2 · 26λ+5) points from X. Let Yi be the indicator random variable for the event that the i-th point in S is contained in the mini ball B. We have E [Yi] ≥ ελσ2/((dlog(∆)e + 1)2 · 26λ+5). By a Chernoff bound and linearity of expectation, we get Pr   |S|∑ i=1 Yi ≥ (1 + σ) · E   |S|∑ i=1 Yi     ≤ exp ( − σ2 · |S| · E [Yi] 3 ) ≤ exp ( − ελσ4 · |S| 3 · (dlog(∆)e+ 1)2 · 26λ+5 ) 8.4 Embedding Doubling Metric Spaces 169 and Pr   |S|∑ i=1 Yi ≤ (1− σ) · E   |S|∑ i=1 Yi     ≤ exp ( − σ2 · |S| · E [Yi] 2 ) ≤ exp ( − ελσ4 · |S| (dlog(∆)e+ 1)2 · 26λ+6 ) . Since we assume that |S| ≥ 3 · (dlog(∆)e + 1)2 · 26λ+5 · ln(n/δ)/(ελσ4), each of these probabilities is at most δ/n. Hence, the number of points in B can be (1±σ)-approximated with probability 1 − 2δ/n. By the union bound, the number of points in every mini ball that contains at least ελσ2n/((dlog(∆)e+1)2 ·26λ+5) points from X is (1±σ)-approximated with probability at least 1− δ. Weight of the Representatives To avoid that the total weight of the representatives differs from n, we adjust the weight of some representatives. We adopt the result given in Section 8.2 to show that the adjustment is small. Lemma 8.4.6. Let R be the set of representatives before the adjustment, and let w(yi) denote the weight of a representative yi ∈ R. Then, n ≤ ∑ yi∈R w(yi) < ( 1 + σ2 2 ) · n holds with probability at least 1− 2δ/n. Proof. Since the sum of the counters over all representatives is equal to |S| and the weight of a representative is at least its counter multiplied by n/|S|, the total weight of the representatives is at least n. This proves the first inequality of the lemma. The sum of the weights can be larger than n because the weight of each representative is rounded up to the next integer. Thus, the sum of the weights is at most n + |R|. Due to Lemma 8.4.1, we have |R| < 3 · ((dlog(∆)e+ 1)2 · 26λ+4) · ln(n/δ) ελσ2 with probability 1− δ/n. Furthermore, due to Lemma 8.4.2, we have |S| > 3 · (dlog(∆)e+ 1)2 · 26λ+5 · ln(n/δ) ελσ4 with probability 1− δ/n. It follows that |R| < |S| · σ2/2 with probability 1− 2δ/n. Since |S| ≤ n, we obtain ∑ yi∈R w(yi)− n ≤ n+ |R| − n < σ2 2 · |S| ≤ σ2 2 · n , which proves the second inequality of the lemma. 170 8 Embeddings with Slack in Data Streams and Applications We summarize our results in the following theorem: Theorem 13. Given a stream of points from an n-point doubling metric spaceM = (X,D) with bounded dimension λ, a precision parameter ε, 0 < ε < 1, a slack parameter σ, 1/o(n) < σ < 1, and an error probability parameter δ, 0 < δ < 1, there is a random- ized streaming algorithm that computes with probability 1 − δ a set X ′ ⊂ X of cardinality O(log2(∆) · log(n/δ)/(ελσ2)) such that M = (X,D) embeds into M ′ = (X ′,D) with dis- tortion 1 + ε and slack σ. The algorithm requires O(log2(∆) · log(n/δ)/(ελσ4)) space and has a constant update time. The set X ′ can be extracted in O(log4(∆) · log2(n/δ)/(ε2λσ6)) time. Proof. Due to Lemma 8.4.3, with high probability, there is at least one representative in each mini ball that contains at least ελσ2n/((dlog(∆)e + 1)2 · 26λ+5) points from X. Let B(x, r) be such a mini ball. Then, each point x′ ∈ S ∩ B(x, r) is either assigned to a representative in B(x, r) or to another representative which is closer to x′. It follows that the representative to which we assign x′ is contained in the ball B(x, 3r). Hence, due to Lemmas 7.3.3 and 7.3.4, the representatives for points located in mini balls that contain at least ελσ2n/((dlog(∆)e + 1)2 · 26λ+5) points from X form an 3ε-WSPD with slack σ for X. Since we estimate the number of points in the mini balls based on the sample set S, we get an additional slack. Due to Lemmas 8.4.2, 8.4.4 and 8.4.5, with high probability, the additional slack induced by this estimation is at most 2σ. Furthermore, we get another additional slack of σ2/2 (see Lemma 8.4.6) since we round up the weight of each representative. Thus, with high probability, our streaming algorithm computes an 3ε-WSPD with slack 4σ for X. Due to our construction, this WSPD is an embedding for X with distortion (1 + 3ε)2 and slack 4σ. Due to Lemma 8.4.1, we can assume with high probability that the set of representatives R and, hence, the set X ′ has a cardinality of O(log2(∆) · log(n/δ)/(ελσ2)). Furthermore, due to Lemma 8.4.2, we can assume with high probability that the set S has a cardinality of O(log2(∆) · log(n/δ)/(ελσ4)). Since we only store the two sample sets R and S, the total space requirement of our algorithm is O(log2(∆) · log(n/δ)/(ελσ4)). The error probability is given as follows. Lemmas 8.4.1 and 8.4.2 hold with a total error probability of at most 2δ/n. If this is the case, then the assertions given in Lemmas 8.4.4 and 8.4.5 follow with a total error probability of 2δ. The assertion of Lemma 8.4.3 is true with probability 1− δ. Thus, the total error probability of our algorithm is at most 4δ. Overall, if we run our embedding algorithm with a precision parameter ε′ ≤ ε/9, a slack parameter σ′ ≤ σ/4, and an error probability parameter δ′ ≤ δ/4, then the embedding has distortion (1 + 3ε′)2 ≤ 1 + ε and slack 4σ′ ≤ σ and works with error probability 4δ′ ≤ δ. Since we can decide in constant time whether a point is taken into one or both of the two sample sets, the algorithm has a constant update time. To extract the weighted set X ′, we assign each point in S to its nearest neighbor in R. Since we assume access to a distance oracle that, given any two points from X, can compute in constant time the distance between these two points, the assignment of S can be done in |R| · |S| time. Since we use the construction from Section 7.3 in the analysis of our streaming algorithm, we have to ensure that σ′ > (dlog(∆)e + 1) · 23λ/n (confer Theorem 8). However, this is 8.5 Embedding General Metric Spaces 171 implicitly required by the fact that the space requirement of a streaming algorithm has to be sublinear in n and the space requirement of our streaming algorithm is ω(1/σ). 8.5 Embedding General Metric Spaces This section deals with a streaming algorithm that embeds a general n-point metric space M = (X,D) with constant distortion and slack σ into a metric space M ′ = (X ′,D′). As in the previous sections, we assume that the minimum pairwise distance of M is at least 1, and the maximum pairwise distance is at most ∆. Furthermore, we assume access to a distance oracle that, given any two points from X, can compute in constant time the distance between these two points. Our algorithm works in the insertion-only data stream model and resembles the con- struction of spanners with slack proposed by Chan et al. [24]. A spanner with slack σ for M is a sparse graph G whose vertices are the points in X and whose shortest-path metric approximates a (1−σ)-fraction of all pairwise distances of M with small distortion. The first step of the spanner construction presented in [24] is the computation of a small edge-dense net N ⊂ X of M . Intuitively, N has the property that, for a large fraction of pairs of points (x, y) ∈ X×X, the distance between N and both x and y is small compared to D(x, y). Based on N , the edges of G are constructed as follows. For each pair of points (x, y) ∈ N ×N , an edge with length D(x, y) is added to G. For each point x ∈ X\N , its closest neighbor y in N is determined and an edge with length D(x, y) is added to G. Now, we transform the construction to the streaming model. Our first modification is that we replace the edge-dense net N by a sample set S drawn uniformly at random from X, i.e., G contains a clique S and each point x ∈ X\S is connected to its closest neighbor in S. We will show that, if the size of S is chosen carefully, this modification changes the properties of G only slightly. Secondly, instead of storing for each point x ∈ X\S an edge to a point in S, we store for each point x′ ∈ S the number of points at each distance scale that have x′ as their nearest neighbor. This technique has been earlier applied by Czumaj and Sohler [32] to obtain 2-pass streaming algorithms for clustering problems. Since our streaming algorithm has to get along with one pass and after having read only a part of the input stream one cannot know the nearest neighbor of a point x ∈ X in the final sample set S, we compute the nearest neighbor in the current sample set S, at the point of time when x appears in the stream. In this way, we are able to compute a compact representation of a spanner with slack for M in the streaming model. Next, we describe our streaming algorithm in more detail. A description in pseudocode is given by Algorithm 8.5.1. We read the points of the input stream one by one and sample each point with probability Pr [point is sampled] := m log(n/δ)(dlog(∆)e + 1)/(σ2n), where δ is the error probability of the algorithm and m is the size of the edge-dense net N mentioned above (which is a constant depending on σ). Let S := {s0, . . . , sk−1} be the set of sampled points. For each i ∈ [k], we maintain counters ci,0, ci,1, . . . , ci,dlog(∆)e, which are initially set to 0. Moreover, for each point x ∈ X\S, we compute its nearest neighbor τ(x) = si in S, at the point of time when x appears in the stream, and we increment ci,j, where j = dlog(D(x, si))e. By 172 8 Embeddings with Slack in Data Streams and Applications storing the points in S and the counters ci,j, we implicitly store the following metric space M ′. The metric M ′ is the shortest-path metric of a graph G with vertex set X. For each pair of points (si, sj) ∈ S×S, the graph G contains an edge {si, sj} of length D(si, sj). For each point x ∈ X\S, the graph G contains an edge {x, τ(x)} of length 2dlog(D(x,τ(x)))e. We denote the resulting embedding by ϕ. Note that we do not store the mapping ϕ : X → X ′ since this would require Ω(n) space. Algorithm 8.5.1 EmbedGeneralMetric(n,∆, σ, δ) 1: initialize empty point set S 2: i← 0 3: for each point x in the stream do 4: flip a coin that shows head with probability Pr [point is sampled] 5: if coin shows head then 6: si ← x 7: S ← S ∪ si 8: for j ← 0 to dlog(∆)e do 9: initialize counter ci,j with 0 10: i← i+ 1 11: else 12: compute nearest neighbor τ(x) = si′ in S 13: increment counter ci′,dlog(D(x,si′ ))e by 1 14: return points in S together with their counters Analysis of the Embedding In order to prove that the embedding ϕ has constant distortion and slack σ, we first show that M indeed contains some small edge-dense net N . Definition 8.5.1 (Edge-Dense Net). Let M = (X,D) be any general metric space, let γ > 0 be any precision parameter, and let σ, 0 < σ < 1, be any slack parameter. We say that a subset N ⊂ X is a (σ, γ)-edge-dense net for M if, for at least a (1− σ)-fraction of pairs (x, y) ∈ X ×X, there exists a pair (bx, by) ∈ N ×N such that max{D(x, bx),D(y, by)} ≤ γ ·D(x, y) . Lemma 8.5.2 ([75]). For any general metric space M = (X,D) and for any slack param- eter σ, 0 < σ < 1, there exists a subset N ⊂ X with |N | = C(σ), where C(σ) is a constant depending on σ, such that min b∈N {D(x, b),D(y, b)} ≤ D(x, y) is true for at least a (1− σ)-fraction of pairs (x, y) ∈ X ×X. 8.5 Embedding General Metric Spaces 173 We reformulate the above lemma as follows. Lemma 8.5.3. For any general metric space M = (X,D) and for any slack parameter σ, 0 < σ < 1, there exists a (σ, 2)-edge-dense net N ⊂ X with |N | = C(σ), where C(σ) is a constant depending on σ. Proof. Let N be the set given by Lemma 8.5.2. Then, for at least a (1 − σ)-fraction of pairs (x, y) ∈ X ×X, there exists an element b ∈ N such that min b∈N {D(x, b),D(y, b)} ≤ D(x, y) . Without loss of generality, we assume that D(x, b) ≤ D(x, y). By triangle inequality, we have D(y, b) ≤ D(y, x) + D(x, b) ≤ 2 ·D(x, y) , and the assertion follows. Now, let N := {z0, . . . , zm−1} be a (σ, 2)-edge-dense net for the input metric space M . For each ` ∈ [m], let X` be the set of points in X for which the nearest neighbor in N is z` (breaking ties arbitrarily). Furthermore, for each j ∈ [dlog(∆)e+ 1], we define X`,j := { x ∈ X` | D(x, z`) ∈ ( 2j−1, 2j ]} . We say that X`,j is good if after σ|X`,j| points from X`,j have appeared in the stream, the set S contains at least one point from X`,j. In case this condition fails, we say that X`,j is bad. The next lemma shows that each set X`,j that contains more than a certain threshold of points is good with high probability. Lemma 8.5.4. With probability at least 1−δ, for each ` ∈ [m] and each j ∈ [dlog(∆)e+1] with |X`,j| ≥ σn/(m(dlog(∆)e+ 1)), X`,j is good. Proof. Pick any ` ∈ [m] and any j ∈ [dlog(∆)e+ 1] such that |X`,j| ≥ σn m(dlog(∆)e+ 1) . Then, we have Pr [X`,j is bad] = (1−Pr [point is sampled])σ|X`,j | ≤ ( 1− m log(n/δ)(dlog(∆)e+ 1) σ2n ) σ2n m(dlog(∆)e+1) < δ/n , where the second inequality is due to a bound on Euler’s number (see Inequality (B.2)). Finally, we obtain the assertion of the lemma by applying the union bound over all pairs (`, j) ∈ [m]× [dlog(∆)e+1]. Recall that m is a constant depending on σ (see Lemma 8.5.3), so we can assume that m · (dlog(∆)e+ 1) ≤ n. 174 8 Embeddings with Slack in Data Streams and Applications The results above facilitates us to bound the distortion and the slack of our embedding ϕ in a sufficient way. Lemma 8.5.5. With probability at least 1 − δ, for at least a (1 − 3σ)-fraction of pairs (x, y) ∈ X ×X, we have D(x, y) ≤ D′(ϕ(x), ϕ(y)) ≤ 46 ·D(x, y) . Proof. The number of points contained in sets X`,j with |X`,j| < σn/(m(dlog(∆)e + 1)) is at most σn. Now, by Lemmas 8.5.2, 8.5.3, and 8.5.4 and since N is a (σ, 2)-edge-dense net, it follows with probability at least 1− δ that, for at least a (1− 3σ)-fraction of pairs (x, y) ∈ X ×X, there exist bx, by ∈ N , and x′, y′ ∈ S (see Figure 8.1) such that: • x′ and y′ appear in the stream before x and y, respectively, • D(x, bx) ≤ D(x, y) or D(y, by) ≤ D(x, y), • max{D(x, bx),D(y, by)} ≤ 2 ·D(x, y), • D(x′, bx) ≤ 2 ·D(x, bx), and D(y′, by) ≤ 2 ·D(y, by). Without loss of generality, we assume that D(x, bx) ≤ D(x, y). Now, for a pair (x, y) ∈ X ×X, we get D′(ϕ(x), ϕ(y)) = 2dlog(D(x,τ(x)))e + D(τ(x), τ(y)) + 2dlog(D(y,τ(y)))e ≤ 2dlog(D(x,x ′))e + D(τ(x), x) + D(x, x′) + D(x′, y′) + D(y′, y) + D(y, τ(y)) + 2dlog(D(y,y ′))e ≤ 2dlog(D(x,x ′))e + 2 ·D(x, x′) + D(x′, y′) + 2 ·D(y′, y) + 2dlog(D(y,y ′))e ≤ 4 ·D(x, x′) + D(x′, y′) + 4 ·D(y, y′) ≤ 4 · (D(x, bx) + D(bx, x′)) + D(x′, y′) + 4 · (D(y, by) + D(by, y′)) ≤ 12 ·D(x, bx) + D(x′, y′) + 12 ·D(y, by) ≤ 12 ·D(x, bx) + D(x′, bx) + D(bx, x) + D(x, y) + D(y, by) + D(by, y′) + 12 ·D(y, by) ≤ 15 ·D(x, bx) + D(x, y) + 15 ·D(y, by) ≤ 46 ·D(x, y) . By combining our results, we obtain the following theorem: Theorem 14. Let σ, 0 < σ < 1, be a slack parameter, and let δ, 0 < δ < 1, be an error probability parameter. Given a stream of points from any general n-point metric space M , there exists a randomized streaming algorithm that computes with probability at least 1− δ an implicit representation of an n-point metric space M ′ such that M embeds into M ′ with distortion O(1) and slack σ. The algorithm requires O(C(σ) · log(n/δ) · log(n) · log2(∆)/σ2) space, where C(σ) is a constant depending on σ. 8.5 Embedding General Metric Spaces 175 bx = z1 x x′ ϕ(x) by = z2 y y′ ϕ(y) Figure 8.1: Illustration of the embedding ϕ for a set of points in the Euclidean plane. The points z1 and z2 belong to the edge-dense net N . The areas which contain the sets X1,1 and X2,2 are colored in gray. Both sets are good, which implies that X1,1 contains a sample point x′ and X2,2 contains a sample point y′. The distance scale for each sample point is indicated by the dashed red circles. The distance between x and y is represented by D(ϕ(x), x′)+D(x′, y′)+D(y′, ϕ(y)), which is indicated by the line segments. 176 8 Embeddings with Slack in Data Streams and Applications Proof. First, we compute an upper bound on the space requirement of our algorithm. Let Yi be the indicator random variable for the event that the i-th point in the data stream is sampled. Recall that each point is sampled with probability Pr [point is sampled] = C(σ) log(n/δ)(dlog(∆)e + 1)/(σ2n). Thus, we have E [Yi] = Pr [point is sampled]. By a Chernoff bound, we get Pr   |X|∑ i=1 Yi ≥ 4 · E   |X|∑ i=1 Yi     ≤ exp ( − 3 · n · E [Yi] 3 ) ≤ exp (−n ·Pr [point is sampled]) = δ . It follows that, with probability at least 1− δ, the size of the sample set S is |S| ∈ O ( C(σ) · log(n/δ) · log(∆) σ2 ) . Since our algorithm only stores each point in S together with its O(log(∆)) counters and each counter is set to at most n, the total space requirement is as claimed. Due to Lemma 8.5.5 and the considerations above, the embedding ϕ has distortion O(1) and slack 3σ and works with a total error probability of at most 2δ. Thus, if we run our algorithm with a slack parameter σ′ ≤ σ/3 and an error probability parameter of δ′ ≤ δ/2, the resulting embedding has slack 3σ′ ≤ σ and an error probability of 2δ′ ≤ δ. 8.6 Lower Bounds In this section, we derive two lower bounds. First, we show that any algorithm that embeds an n-point metric space M into another metric space M ′ with distortion % < 2 and slack σ < 1/4 requires Ω(n/ log n) bits of memory. The second lower bound depends on the spread ∆ of M . More precisely, we prove that any algorithm that embeds a metric space M into another metric space M ′ with distortion % < 2 and slack σ < 1/4 needs Ω(log log ∆) bits of memory. Both proofs are based on the pigeonhole principle. We show that if we restrict the memory space to a certain number of bits, there cannot exist a so-called (σ, %)-net. Definition 8.6.1 (Net for Metric Spaces). Let % ≥ 1 be a precision parameter, and let σ, 0 < σ < 1, be a slack parameter. A set of n-point metric spaces N is called a (σ, %)-net if every n-point metric space M embeds into some metric space M ′ ∈ N with distortion % and slack σ. Theorem 15. Let %, 1 ≤ % < 2, be a precision parameter, and let σ, 0 ≤ σ < 1/4, be a slack parameter. Then, any algorithm that computes for every arbitrary n-point metric space M = (X,D) with positive probability an (implicit or explicit) representation of another metric space M ′ = (X ′,D′) such that there is an embedding from M to M ′ with distortion % and slack σ requires Ω(n/ log n) bits of memory. 8.6 Lower Bounds 177 Proof. We use the probabilistic method to prove the assertion. In general, the probabilistic method says that if an object chosen at random from a given universe satisfies a certain property with positive probability, then there must exist an object in the universe that satisfies this property. Applied to our problem, the universe is a set of n-point metric spaces and the desired property of an n-point metric space M is that M cannot be embedded by an algorithm using O(n/ log n) bits of memory such that the embedding has distortion % and slack σ. If a randomly chosen n-point metric space has this property with positive probability, then there must exist an n-point metric space that cannot be embedded by an algorithm using O(n/ log n) bits of memory such that the embedding has distortion % and slack σ. Now, let us consider any algorithm for embedding n-point metric spaces that uses at most k bits of memory. Then, this algorithm has at most 2k distinct states. Each of these states can correspond to at most one target metric space. Let us denote the set of these target metric spaces by N . By using the probabilistic method, we show that, for a certain value of k, N is not a (σ, %)-net, i.e., there exists an n-point metric space that cannot be embedded into any metric space in N with distortion % and slack σ. This proves the assertion of the lemma for any algorithm using at most k bits of memory. Let M = (X,D) be a random n-point 1-2-metric space, i.e., every distance is chosen uniformly at random from {1, 2}. Without loss of generality, we assume that the computed embedding is non-expanding, i.e., the distance between any two points in X is at least as big as the corresponding distance of the embedded points in the target metric space. Let us consider an arbitrary target metric spaceM ′ ∈ N . There are n! ways to embedM intoM ′. We fix one of these possible embeddings. Without loss of generality, we can assume that X = X ′, i.e., our fixed embedding is the identity function. For the moment, let us assume that the embedding must have slack 0. Then, since our embedding is non-expanding, we must map all 1-distances in M to distances of length at most 1 in M ′. Furthermore, since our embedding has distortion % < 2, we must map all 2-distances in M to a value greater than 1 in M ′. We call an assignment that violates these conditions a wrong assignment. Let err(M,M ′) be the total number of wrong assignments that occur by embeddingM into M ′. Since we allow a slack of σ, we are allowed to make at most σ · ( n 2 ) wrong assignments. For two arbitrary points x, y ∈ X in M ′, the distance D′(x, y) is either bigger than 1 or at most 1. Since D(x, y) is chosen uniformly at random from {1, 2}, the chance that the assignment of D(x, y) to D′(x, y) belongs to the wrong assignments is 1/2. Therefore, by a Chernoff bound, we have Pr [ err(M,M ′) ≤ σ · ( n 2 )] < Pr [ err(M,M ′) ≤ 1 4 · ( n 2 )] = Pr [ err(M,M ′) ≤ ( 1− 1 2 ) · E [err(M,M ′)] ] ≤ exp ( − E [err(M,M ′)] 8 ) = exp ( − 1 16 · ( n 2 )) . 178 8 Embeddings with Slack in Data Streams and Applications Since |N | ≤ 2k and there are n! ways to embed M into one metric space from N , the union bound implies that the overall probability that M embeds into any metric space from N with distortion % and slack σ is at most n! · 2k · exp ( − 1 16 · ( n 2 )) , which is less than 1 for certain k = cn/ log n with sufficiently small constant c. Thus, the randomly chosen n-point metric space M cannot be embedded into any metric space from N with positive probability. It follows that, for such k, there must exist an n-point metric space that cannot be embedded into any metric space from N with distortion % and slack σ, which completes the proof. Theorem 16. Let %, 1 ≤ % < 2, be a precision parameter, and let σ, 0 ≤ σ < 1/4, be a slack parameter. Then, any algorithm that computes with positive probability for every metric space M = (P,D), where P is a set of points from the discrete Euclidean space {0, . . . ,∆} ⊆ R and D is the Euclidean distance function defined on P , an (implicit or explicit) representation of another metric space M ′ = (X ′,D′) such that there is an embedding from M to M ′ with distortion % and slack σ requires Ω(log(log(∆))) bits of memory. Proof. For each i ∈ {1, . . . , blog ∆c}, let Mi be the metric space obtained by placing |P |/2 points at the coordinate 0 and |P |/2 points at the coordinate 2i. Any pair of these metric spaces differs by a factor of at least 2 in |P |2/4 of its distances. Thus, there is no metric space M ′ such that both Mi and Mj, i 6= j, embed into M ′ with distortion less than 2 and slack less than 1/4. Hence, for each of these metric spaces, there must exist a unique state of an algorithm that computes such an embedding. It follows that the algorithm has at least blog ∆c states and so it needs Ω(log(log(∆))) bits of memory to distinguish them. 9 Conclusions and Future Work In this thesis, we developed facility location algorithms and embeddings with slack for huge datasets. Chapter 3 We presented a randomized distributed algorithm for the facility location problem, con- sidering both metric spaces and powers of metric spaces. For the special case of uniform costs and demands, our algorithm provides a constant-factor approximation using three communication rounds. We believe that our algorithm is particularly well-suited for facility location types of problems in wireless networks. This is because the algorithm uses only a few broadcasts (in every communication round each node sends the same message to its neighbors), which can be easily done in wireless networks. In the analysis, we used the fact that the sum of the radii of the points is a constant- factor approximation of the expected total cost. This fact is not directly applicable to the non-uniform case, which means that our result cannot directly be generalized to the non- uniform metric facility location problem. However, motivated by our result, Pandit and Pemmaraju [97] obtained a constant-factor approximation in O(log(n)) communication rounds for the variant of our considered metric facility location problem where the opening cost of facilities are non-uniform. It would be interesting to find out whether Ω(log(n)) communication rounds are required or not. Chapter 4 We initiated the study on a KDS for the mobile facility location problem. In particular, we proposed a KDS that maintains a subset of the moving input points as open facilities such that, at any time, the associated total cost is at most a constant factor larger than the current optimal cost. We showed that our KDS is compact, local, responsive, and efficient. Note that the complexity of our KDS is polylogarithmic in R, which is a value de- pending on the opening cost and demand values of the input points. Hence, the compact- ness, locality, responsiveness, and efficiency are not fully polylogarithmic, but only pseudo- polylogarithmic. It would be nice future work to reduce this pseudo-polylogarithmic term to a real polylogarithmic term. Furthermore, future work in the area of mobile facility lo- cation problems could include to consider additional opening cost that arises at the point of time when a point changes its status from client to open facility. Here, we point out that in our scenario the additional opening cost per event would be already bounded because we open at most a logarithmic number of facilities per event. 180 9 Conclusions and Future Work Chapter 5 Chapter 5 addresses one of the central results in this thesis. In this chapter, we developed a randomized algorithm that computes a constant-factor approximation of the cost for the uniform facility location problem over dynamic geometric data streams. Our streaming algorithm strongly improves the best previous one, which guarantees an approximation factor of O(log2(∆)). We think that it is worthwhile to further investigate the uniform facility location problem over dynamic geometric data streams. In particular, we are optimistic that one can obtain a (1 ± ε)-approximation algorithm for the facility location cost. This might also provide new insights into handling other problems in the dynamic geometric data stream model, like computing the earth mover distance or the minimum length of a traveling-salesperson tour, for instance. Obviously, future work could also include to generalize our results to the non-uniform facility location variant. Chapter 6 We presented a streaming implementation of a k-means clustering algorithm that is based on a new coreset construction. We have shown that this algorithm is capable of efficiently clustering huge amounts of data in the insertion-only data stream model. To evaluate our algorithm, we ran a series of experiments on large real-world datasets. We found empirical evidence that in terms of the cost of the clustering, our algorithm is comparable with StreamLS and significantly better than BIRCH. In terms of the running time, our algorithm outperforms StreamLS, especially for a large number of centers k. From a theoretical point of view, we showed that, for a precision parameter ε with 0 < ε < 1, our adaptive sampling approach computes a (k, ε)-coreset in constant-dimensional Euclidean space. However, the bound on the coreset size depends exponentially on the dimension d (see Theorem 6). In compliance with our experiments, we suggest that one can prove a size bound with much lower dependency on the dimension. Also, from an experimental point of view, it would be interesting to examine the effect of the dimension on an appropriate coreset size more extensively. Chapters 7 and 8 We considered compact representations of metric spaces. In Chapter 7, we introduced the notion of aWSPD with slack and gave constructions of WSPDs with slack for Euclidean and doubling metric spaces. In Chapter 8, we presented streaming algorithms to compute low- distortion embeddings with low slack for Euclidean, doubling, and general metric spaces. Furthermore, we used an embedding to obtain a randomized algorithm that computes a (1 ± ε)-approximation of the max-cut problem for a dynamic geometric data stream of high-dimensional Euclidean points. Metric embeddings with slack preserve much information about the original pairwise distances and can be stored in small space. For this reason, we believe that they are an important tool in the analysis of data streams and deserve further investigation. A Additional Tables for Chapter 6 A.1 Parameters of Algorithm BIRCH Covertype Tower Census 1990 BigCross p = 10 5 5 25 Table A.1: Manual adjustment of the parameter TotalMemSize as percentage of the dataset size for algorithm BIRCH parameter value CorD 0 TotalMemSize (in bytes) p% of dataset size TotalBufferSize (in bytes) 5% of TotalMemSize TotalQueueSize (in bytes) 5% of TotalMemSize TotalOutlierTreeSize (in bytes) 5% of TotalMemSize WMflag 0 W vector (1, 1, . . . , 1) M vector (0, 0, . . . , 0) PageSize (in bytes) 1024 BDtype 4 Ftype 0 Phase1Scheme 0 RebuiltAlg 0 StatTimes 3 NoiseRate 0.25 Range 2000 CFDistr 0 H 0 Bars vector (100, 100, . . . , 100) K number of clusters k InitFt 0 Ft 0 Gtype 1 GDtype 2 Qtype 0 RefineAlg 1 NoiseFlag 0 MaxRPass 1 Table A.2: Setting of the parameters of algorithm BIRCH 182 A Additional Tables for Chapter 6 A.2 Running Times of the Algorithms running time (in sec) dataset k StreamKM++ StreamLS BIRCH k-Means++ k-Means Spambase 10 3.06 - - 3.57 19.02 20 7.04 - - 8.22 59.85 30 16.45 - - 19.05 88.8 40 28.93 - - 20.54 132.03 50 44.48 - - 25.9 182.08 Intrusion 10 74.1 - - 50.6 408.8 20 103.1 - - 262.4 2711.3 30 143.8 - - 1973.3 4389.1 40 197.6 - - 1257.0 10733.7 50 250.5 - - 1339.5 14282.0 Covertype 10 245 147 44 3389 - 20 297 460 44 5160 - 30 378 1027 44 14933 - 40 454 1773 44 16713 - 50 617 2588 44 25803 - Tower 20 157 679 77 2960 - 40 168 1989 78 6902 - 60 187 3849 77 11247 - 80 211 6212 77 19206 - 100 248 8946 77 17161 - Census 1990 10 1571 631 271 - - 20 1724 2362 271 - - 30 1839 5504 271 - - 40 1956 10054 272 - - 50 2057 11842 272 - - BigCross 15 5486 6239 1006 - - 20 5738 10502 998 - - 25 5933 15780 996 - - 30 6076 22779 996 - - Normdata 100 14.5 178.2 - - - (m = 500) 125 14.9 401.8 - - - 150 15.1 569.3 - - - 175 15.1 659.3 - - - 200 15.6 731.8 - - - Normdata 100 16.7 44.8 - - - (m = 1000) 125 17.1 92.6 - - - 150 17.5 176.9 - - - 175 17.6 378.1 - - - 200 18.3 586.7 - - - Table A.3: Average running times of the algorithms A.3 Clustering Cost of the Algorithms 183 A.3 Clustering Cost of the Algorithms cost dataset k StreamKM++ StreamLS BIRCH k-Means++ k-Means Spambase 10 7.85 · 107 - - 8.71 · 107 1.70 · 108 20 2.27 · 107 - - 2.45 · 107 1.53 · 108 30 1.24 · 107 - - 1.34 · 107 1.51 · 108 40 8.64 · 106 - - 9.01 · 106 1.49 · 108 50 6.29 · 106 - - 6.68 · 106 1.48 · 108 Intrusion 10 1.27 · 1013 - - 1.75 · 1013 9.52 · 1014 20 1.26 · 1012 - - 1.55 · 1012 9.51 · 1014 30 4.29 · 1011 - - 4.96 · 1011 9.51 · 1014 40 1.95 · 1011 - - 2.25 · 1011 9.50 · 1014 50 1.11 · 1011 - - 1.29 · 1011 9.50 · 1014 Covertype 10 3.43 · 1011 3.42 · 1011 4.24 · 1011 3.42 · 1011 - 20 2.06 · 1011 2.05 · 1011 2.97 · 1011 2.03 · 1011 - 30 1.57 · 1011 1.56 · 1011 1.89 · 1011 1.54 · 1011 - 40 1.31 · 1011 1.32 · 1011 1.59 · 1011 1.29 · 1011 - 50 1.15 · 1011 1.18 · 1011 1.41 · 1011 1.13 · 1011 - Tower 20 6.24 · 108 6.16 · 108 9.26 · 108 6.51 · 108 - 40 3.34 · 108 3.34 · 108 4.75 · 108 3.30 · 108 - 60 2.43 · 108 2.37 · 108 3.89 · 108 2.40 · 108 - 80 1.95 · 108 1.91 · 108 3.47 · 108 1.92 · 108 - 100 1.65 · 108 1.63 · 108 2.98 · 108 1.63 · 108 - Census 1990 10 2.48 · 108 2.40 · 108 3.98 · 108 - - 20 1.90 · 108 1.85 · 108 3.17 · 108 - - 30 1.59 · 108 1.53 · 108 2.94 · 108 - - 40 1.41 · 108 1.35 · 108 2.78 · 108 - - 50 1.28 · 108 1.24 · 108 2.73 · 108 - - BigCross 15 5.05 · 1012 5.23 · 1012 6.69 · 1012 - - 20 4.15 · 1012 4.23 · 1012 4.85 · 1012 - - 25 3.59 · 1012 3.54 · 1012 4.45 · 1012 - - 30 3.18 · 1012 3.18 · 1012 3.83 · 1012 - - Normdata 100 1.50 · 106 1.50 · 106 - - - (m = 500) 125 1.50 · 106 1.50 · 106 - - - 150 1.50 · 106 1.50 · 106 - - - 175 1.50 · 106 1.50 · 106 - - - 200 1.50 · 106 1.50 · 106 - - - Normdata 100 1.50 · 106 1.50 · 106 - - - (m = 1000) 125 1.50 · 106 1.50 · 106 - - - 150 1.50 · 106 1.50 · 106 - - - 175 1.50 · 106 1.50 · 106 - - - 200 1.50 · 106 1.50 · 106 - - - Table A.4: Average clustering cost of the algorithms 184 A Additional Tables for Chapter 6 A.4 Standard Deviation of Running Time and Cost running time (in sec) dataset k StreamKM++ StreamLS k-Means++ k-Means Spambase 10 0.29 - 1.5 3.33 20 1.09 - 3.88 6.36 30 1.52 - 11.27 17.61 40 6.56 - 6.97 26.95 50 6.59 - 12.83 68.1 Intrusion 10 0.68 - 40.81 58.84 20 3.22 - 98.11 499.7 30 6.07 - 1263.44 345.6 40 24.91 - 563.20 1306.2 50 31.58 - 706.00 1190.78 Covertype 10 0.88 2.43 2295.85 - 20 6.93 18.18 1249.18 - 30 4.15 2.14 9653.06 - 40 14.02 7.64 6838.93 - 50 39.28 123.28 12231.98 - Tower 20 0.58 14.11 1594.76 - 40 1.79 50.83 2085.12 - 60 3.96 58.27 3656.87 - 80 7.95 122.65 5162.60 - 100 11.34 315.31 1795.07 - Census 1990 10 2.04 9.08 - - 20 5.16 54.3 - - 30 5.38 98.03 - - 40 23.31 193.00 - - 50 17.43 533.39 - - BigCross 15 10.49 93.6 - - 20 11.49 162.44 - - 25 15.69 226.38 - - 30 16.66 200.68 - - Normdata 100 0.07 1.22 - - (m = 500) 125 0.05 1.14 - - 150 0.05 2.19 - - 175 0.03 2.89 - - 200 0.03 4.05 - - Normdata 100 0.06 0.6 - - (m = 1000) 125 0.06 1.32 - - 150 0.04 2.56 - - 175 0.08 3.96 - - 200 0.2 2.41 - - Table A.5: Standard deviation of the running time A.4 Standard Deviation of Running Time and Cost 185 cost dataset k StreamKM++ StreamLS k-Means++ k-Means Spambase 10 2.05 · 106 - 9.57 · 106 1.06 · 106 20 6.49 · 105 - 1.73 · 106 8.78 · 104 30 3.14 · 105 - 9.51 · 105 8.81 · 104 40 1.93 · 105 - 5.31 · 105 3.42 · 106 50 1.49 · 105 - 2.47 · 105 2.91 · 106 Intrusion 10 1.39 · 1012 - 6.61 · 1012 3.09 · 1011 20 8.54 · 1010 - 3.70 · 1011 8.20 · 109 30 3.13 · 1010 - 6.85 · 1010 2.54 · 1010 40 7.03 · 109 - 3.25 · 1010 1.53 · 108 50 6.01 · 109 - 1.61 · 1010 6.82 · 108 Covertype 10 2.47 · 109 2.70 · 1010 3.63 · 109 - 20 1.08 · 109 1.03 · 1010 9.17 · 108 - 30 1.49 · 109 6.61 · 109 6.12 · 108 - 40 8.38 · 108 5.63 · 109 6.64 · 108 - 50 5.68 · 108 3.90 · 109 2.92 · 108 - Tower 20 7.31 · 106 2.71 · 107 4.39 · 107 - 40 1.85 · 106 1.65 · 107 4.37 · 106 - 60 1.52 · 106 1.55 · 107 1.61 · 106 - 80 1.03 · 106 9.63 · 106 1.54 · 106 - 100 7.73 · 105 1.03 · 107 1.17 · 106 - Census 1990 10 5.02 · 106 1.45 · 105 - - 20 3.66 · 106 3.14 · 106 - - 30 1.61 · 106 9.34 · 105 - - 40 1.21 · 106 8.13 · 105 - - 50 1.01 · 106 6.80 · 105 - - BigCross 15 3.22 · 1010 1.75 · 1011 - - 20 2.46 · 1010 3.36 · 1011 - - 25 1.86 · 1010 1.76 · 1011 - - 30 1.94 · 1010 1.29 · 1011 - - Normdata 100 0 0 - - (m = 500) 125 0 0 - - 150 0 0 - - 175 0 0 - - 200 0 0 - - Normdata 100 0 0 - - (m = 1000) 125 0 0 - - 150 0 0 - - 175 0 0 - - 200 0 0 - - Table A.6: Standard deviation of the clustering cost 186 A Additional Tables for Chapter 6 B Mathematical Fundamentals This appendix deals with some mathematical fundamentals which are assumed to be com- mon knowledge throughout this thesis. In Section B.1, we specify partial sums of some classical series and state some useful inequalities concerning Euler’s number. Section B.2 addresses probability theory. B.1 Sequences, Series, and Inequalities Arithmetic Series Let (a1, a2, . . .) be any infinite arithmetic sequence, i.e., there is a fixed constant d ∈ R, called the common difference, such that ai − ai−1 = d for all i ∈ N\{1}. It is well-known and can be proven by simple induction that the n-th partial sum of the associated infinite series is equal to n∑ i=1 ai = n 2 · (a1 + an) . In particular, the sum of the first n natural numbers is equal to n∑ i=1 i = n(n+ 1) 2 . Geometric Series Let (a1, a2, . . .) be any infinite geometric sequence, i.e., there is a fixed constant q ∈ R with q 6= 1, called the common ratio, such that ai/ai−1 = q for all i ∈ N\{1}. It is well-known and can be proven by simple induction that the n-th partial sum of the associated infinite series is equal to n∑ i=1 ai = a1 · n−1∑ i=0 qi = a1 · qn − 1 q − 1 . In case that |q| < 1, the sum of the infinite geometric series is ∞∑ i=1 ai = a1 · 1 1− q . 188 B Mathematical Fundamentals Bounds on Euler’s Number In the following, we will state some useful inequalities concerning Euler’s number (see [91, 99]). For all n ∈ N, we have ( 1 + 1 n )n ≤ e ≤ ( 1 + 1 n )n+1 (B.1) and ( 1− 1 n )n ≤ 1 e ≤ ( 1− 1 n )n−1 . (B.2) These inequalities imply lim n→∞ ( 1 + 1 n )n = e and lim n→∞ ( 1− 1 n )n = 1 e . Furthermore, it is known that, for all n ∈ N, we have (n e )n ≤ n! ≤ nn . B.2 Probability Theory This section addresses some basics in probability theory which are assumed to be com- mon knowledge throughout this thesis. The interested reader can find a more general introduction in [89, 99]. The set Ω of all possible outcomes of a random experiment is called a sample space. In this thesis, we only consider discrete sample spaces, i.e., any considered sample space is a countable set of elementary events of the form Ω = {ω1, ω2, . . . , ωn}. The probability distribution on Ω is a function p : Ω→ R which satisfies the following two conditions: (i) the probability associated with any elementary event is non-negative, i.e., p(ωi) ≥ 0 , for any ωi ∈ Ω , (ii) the sum of probabilities over all elementary events is equal to 1, i.e., ∑ ωi∈Ω p(ωi) = 1 . A subset of Ω is called an event. Any event E ⊆ Ω is said to be true if the outcome of the random experiment is any ω ∈ E. Otherwise, E is said to be false. The probability of E is defined by Pr [E] := ∑ ω∈E p(ω) . The probability of an event E2 assuming an event E1 with Pr [E1] > 0 is called condi- tional probability and is defined as Pr [E2 | E1] := Pr [E1 ∩ E2] Pr [E1] . B.2 Probability Theory 189 Random Variables, Expectation, and Variance A random variable is a function X := X(ω) defined on a sample space Ω. In this thesis, we only consider discrete and real-valued random variables. Such a random variable is a function X : Ω → R which only takes isolated values with non-zero probabilities. A random variable X is called an indicator random variable for an event E in case that, for all ω ∈ Ω, we have X(ω) ∈ {0, 1} and X(ω) = 1 if and only if ω ∈ E. For any discrete and real-valued random variable X and any real k ∈ R, we define [X = k] := {ω ∈ Ω | X(ω) = k}. Based on this definition, we use the abbreviations Pr [X ≤ k] := ∑ `≤k:`∈X(Ω) Pr [X = `] and Pr [X ≥ k] := ∑ `≥k:`∈X(Ω) Pr [X = `] . Furthermore, for two random variables X and Y , we use the abbreviations Pr [[X = k] ∩ [Y = `]] := Pr [X = k ∧ Y = `] and Pr [[X = k] ∪ [Y = `]] := Pr [X = k ∨ Y = `] . Two random variables X and Y are called independent if, for all x, y ∈ R, Pr [X = x | Y = y] = Pr [X = x] . A set X0, X1, . . . , Xn−1 of random variables is called independent if, for all i ∈ [n] and I ⊆ [n]\{i}, Pr  Xi = xi | ∧ j∈I Xj = xj   = Pr [Xi = xi] (B.3) for all xi ∈ R and xj ∈ R with j ∈ I. A set X0, X1, . . . , Xn−1 of random variables is called k-wise independent if (B.3) holds for all I ⊆ [n]\{i} with |I| ≤ k. The expectation of any discrete and real-valued random variable X is defined as E [X] := ∑ k∈X(Ω) k ·Pr [X = k] . The following three properties of the expectation are often used in the analysis of random- ized algorithms. Proofs can be found in [89]. (i) For any random variable X and any real k ∈ R, we have E [k ·X] = k · E [X]. (ii) For any two random variables X and Y , we have E [X + Y ] = E [X] + E [Y ]. This property is called linearity of expectation. (iii) For any two independent random variables X and Y , E [X · Y ] = E [X] · E [Y ]. 190 B Mathematical Fundamentals The variance of any random variable X is defined as V [X] := E [ (X − E [X])2 ] . By expanding the term (X − E [X])2, we get V [X] = E [ (X − E [X])2 ] = E [ X2 − 2X · E [X] + E [X]2 ] = E [ X2 ] − 2 · E [X] · E [X] + E [X]2 = E [ X2 ] − E [X]2 . The following three properties of the variance are often used in the analysis of randomized algorithms. Proofs can be found in [89]. (i) For any random variable X, we have V [X] ≥ 0. (ii) For any random variable X and any two reals a, b ∈ R, V [a+ b ·X] = b2 ·V [X]. (iii) For any two independent random variables X and Y , V [X + Y ] = V [X] + V [Y ]. The standard deviation of a random variable X is defined as σ := √ V [X]. Useful Inequalities The following inequalities are frequently used in the analysis of our randomized algorithms. Union Bound Let E1 and E2 be any two events. Then, we have Pr [E1 ∨ E2] ≤ Pr [E1] + Pr [E2] . The union bound is implied by the inclusion-exclusion principle from combinatorics. Markov’s Inequality Let X : Ω → R≥0 be a non-negative random variable. Then, for any k ∈ R with k > 0, we have Pr [X ≥ k] ≤ E [X] k . A proof of Markov’s inequality can be found in [89]. B.2 Probability Theory 191 Chebyshev’s Inequality Let X : Ω→ R be a random variable. Then, for any k ∈ R with k > 0, we have Pr [|X − E [X] | ≥ k] ≤ V [X] k2 . Chebyshev’s inequality follows from Markov’s inequality. A formal proof can be found in [89]. Chernoff Bounds Let X1, X2, . . . , Xn : Ω → {0, 1} be a set of independent 0-1-random variables. Then, it holds for all ε ≥ 0 that Pr [ n∑ i=1 Xi ≥ (1 + ε) · E [ n∑ i=1 Xi ]] ≤ exp ( − 1 3 ·min{ε, ε2} · E [ n∑ i=1 Xi ]) . Furthermore, it holds for all ε, 0 ≤ ε ≤ 1, that Pr [ n∑ i=1 Xi ≤ (1− ε) · E [ n∑ i=1 Xi ]] ≤ exp ( − 1 2 · ε2 · E [ n∑ i=1 Xi ]) . A formal proof can be found in [86]. 192 B Mathematical Fundamentals Bibliography [1] I. Abraham, Y. Bartal, and O. Neiman. Advances in metric embedding theory. In Pro- ceedings of the 38th Annual ACM Symposium on Theory of Computing (STOC ’06), pages 271–286. Association for Computing Machinery, 2006. [2] P. K. Agarwal, S. Har-Peled, and K. R. Varadarajan. Approximating extent measures of points. Journal of the ACM (JACM), 51(4):606–635, July 2004. [3] A. Aggarwal, A. Deshpande, and R. Kannan. Adaptive sampling for k-means cluster- ing. In Proceedings of the 12th International Workshop on Approximation Algorithms for Combinatorial Optimization Problems (APPROX ’10), pages 15–28. Springer, 2009. [4] N. Ailon, R. Jaiswal, and C. Monteleoni. Streaming k-means approximation. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 10–18. MIT press, 2009. [5] N. Alon, L. Babai, and A. Itai. A fast and simple randomized parallel algorithm for the maximal independent set problem. Journal of Algorithms, 7(4):567–583, 1986. [6] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences (JCSS), 58(1):137– 147, February 1999. [7] S. Arora. Polynomial time approximation schemes for Euclidean traveling sales- man and other geometric problems. Journal of the ACM (JACM), 45(5):753–782, September 1998. [8] S. Arora, P. Raghavan, and S. Rao. Approximation schemes for Euclidean k-medians and related problems. In Proceedings of the 30th Annual ACM Symposium on Theory of Computing (STOC ’98), pages 106–113. Association for Computing Machinery, 1998. [9] D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA ’07), pages 1027–1035. Society for Industrial and Applied Mathematics, 2007. [10] S. Arya, G. Das, D. M. Mount, J. S. Salowe, and M. H. M. Smid. Euclidean spanners: Short, thin, and lanky. In Proceedings of the 27th Annual ACM Symposium on Theory 194 Bibliography of Computing (STOC ’95), pages 489–498. Association for Computing Machinery, 1995. [11] S. Arya, D. M. Mount, and M. H. M. Smid. Randomized and deterministic algo- rithms for geometric spanners of small diameter. In Proceedings of the 35th Annual Symposium on Foundations of Computer Science (FOCS ’94), pages 703–712. IEEE Computer Society, 1994. [12] V. Arya, N. Garg, R. Khandekar, A. Meyerson, K. Munagala, and V. Pandit. Local search heuristics for k-median and facility location problems. SIAM Journal on Computing, 33(3):544–562, 2004. [13] A. Asuncion and D. J. Newman. UCI machine learning repository, 2007. University of California, Irvine, School of Information and Computer Sciences, available at http://www.ics.uci.edu/~mlearn/MLRepository.html. [14] M. Bădoiu, A. Czumaj, P. Indyk, and C. Sohler. Facility location in sublinear time. In Proceedings of the 32nd Annual International Colloquium on Automata, Languages and Programming (ICALP ’05), volume 3580, pages 866–877. Springer, 2005. [15] J. Basch, L. J. Guibas, and J. Hershberger. Data structures for mobile data. Journal of Algorithms, 31(1):1–28, 1999. [16] J. L. Bentley and J. B. Saxe. Decomposable searching problems I. Static-to-dynamic transformation. Journal of Algorithms, 1(4):301–358, December 1980. [17] S. Bereg, B. K. Bhattacharya, D. G. Kirkpatrick, and M. Segal. Competitive algo- rithms for maintaining a mobile center. MONET, 11(2):177–186, April 2006. [18] J. Byrka and K. Aardal. An optimal bifactor approximation algorithm for the metric uncapacitated facility location problem. SIAM Journal on Computing, 39(6):2212– 2231, March 2010. [19] P. B. Callahan. Optimal parallel all-nearest-neighbors using the well-separated pair decomposition (preliminary version). In Proceedings of the 34th Annual Symposium on Foundations of Computer Science (FOCS ’93), pages 332–340. IEEE Computer Society, 1993. [20] P. B. Callahan. Dealing with higher dimensions: The well-separated pair decomposi- tion and its applications. PhD thesis, Johns Hopkins University, Baltimore, Mary- land, 1995. [21] P. B. Callahan and S. R. Kosaraju. Algorithms for dynamic closest pair and n- body potential fields. In Proceedings of the 6th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA ’95), pages 263–272. Society for Industrial and Applied Mathematics, 1995. Bibliography 195 [22] P. B. Callahan and S. R. Kosaraju. A decomposition of multidimensional point sets with applications to k-nearest-neighbors and n-body potential fields. Journal of the ACM (JACM), 42(1):67–90, 1995. [23] T.-H. H. Chan, K. Dhamdhere, A. Gupta, J. Kleinberg, and A. Slivkins. Metric embeddings with relaxed guarantees. SIAM Journal on Computing, 38(6):2303–2329, March 2009. [24] T.-H. H. Chan, M. Dinitz, and A. Gupta. Spanners with slack. In Proceedings of the 14th Annual European Symposium on Algorithms (ESA ’06), pages 196–207. Springer, 2006. [25] T. M. Chan. Well-separated pair decomposition in linear time? Information Pro- cessing Letters, 107(5):138–141, August 2008. [26] K. L. Chang. Pass-efficient algorithms for facility location. Technical Report YALEU/DCS/TR-1337, Yale University, November 2005. [27] M. Charikar and S. Guha. Improved combinatorial algorithms for facility location problems. SIAM Journal on Computing, 34(4):803–824, 2005. [28] K. Chen. On k-median clustering in high dimensions. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA ’06), pages 1177– 1185. Association for Computing Machinery, 2006. [29] K. Chen. On coresets for k-median and k-means clustering in metric and Euclidean spaces and their applications. SIAM Journal on Computing, 39(3):923–947, August 2009. [30] F. A. Chudak and D. B. Shmoys. Improved approximation algorithms for the unca- pacitated facility location problem. SIAM Journal on Computing, 33(1):1–25, 2003. [31] A. Czumaj, G. Frahling, and C. Sohler. Efficient kinetic data structures for max- cut. In Proceedings of the 19th Canadian Conference on Computational Geometry (CCCG ’07), pages 157–160. Carleton University, Ottawa, Canada, 2007. [32] A. Czumaj and C. Sohler. Small space representations for metric min-sum k- clustering and their applications. Theory of Computing Systems, 46(3):416–442, April 2010. [33] M. de Berg, O. Cheong, M. van Kreveld, and M. Overmars. Computational Geometry: Algorithms and Applications. Springer, 3rd edition, 2008. [34] J. Erickson. Dense point sets have sparse Delaunay triangulations or "... but not too nasty". Discrete & Computational Geometry, 33(1):83–115, January 2005. 196 Bibliography [35] J. Fakcharoenphol, S. Rao, and K. Talwar. A tight bound on approximating arbi- trary metrics by tree metrics. Journal of Computer and System Sciences (JCSS), 69(3):485–497, November 2004. [36] D. Feldman, A. Fiat, and M. Sharir. Coresets for weighted facilities and their appli- cations. In Proceedings of the 47th IEEE Symposium on Foundations of Computer Science (FOCS ’06), pages 315–324. IEEE Computer Society, 2006. [37] D. Feldman, M. Monemizadeh, and C. Sohler. A PTAS for k-means clustering based on weak coresets. In Proceedings of the 23rd ACM Symposium on Computational Geometry (SCG ’07), pages 11–18. Association for Computing Machinery, 2007. [38] J. Fischer and S. Har-Peled. Dynamic well-separated pair decomposition made easy. In Proceedings of the 17th Canadian Conference on Computational Geome- try (CCCG ’05), pages 235–238. University of Windsor, Ontario, Canada, 2005. [39] E. W. Forgy. Cluster analysis of multivariate data: Efficiency versus interpretability of classifications. Biometrics, 21:768–780, 1965. [40] D. Fotakis. Incremental algorithms for facility location and k-median. Theoretical Computer Science, 361(2–3):275–313, September 2006. [41] D. Fotakis. Memoryless facility location in one pass. In Proceedings of the 23rd Annual Symposium on Theoretical Aspects of Computer Science (STACS ’06), pages 608–620. Springer, 2006. [42] G. Frahling. Algorithms for Dynamic Geometric Data Streams. PhD thesis, Univer- sity of Paderborn, 2006. [43] G. Frahling, P. Indyk, and C. Sohler. Sampling in dynamic data streams and applica- tions. International Journal of Computational Geometry and Applications (IJCGA), 18(1–2):3–28, 2008. [44] G. Frahling and C. Sohler. Coresets in dynamic geometric data streams. In Pro- ceedings of the 37th Annual ACM Symposium on Theory of Computing (STOC ’05), pages 209–217. Association for Computing Machinery, 2005. [45] S. A. Friedler and D. M. Mount. Approximation algorithm for the kinetic robust k-center problem. Computational Geometry, 43(6–7):572–586, August 2010. [46] J. Gao, L. J. Guibas, J. Hershberger, L. Zhang, and A. Zhu. Discrete mobile centers. Discrete & Computational Geometry, 30(1):45–63, July 2003. [47] J. Gao, L. J. Guibas, and A. T. Nguyen. Deformable spanners and applications. Computational Geometry, 35(1–2):2–19, August 2006. Bibliography 197 [48] J. Gao and L. Zhang. Well-separated pair decomposition for the unit-disk graph metric and its applications. SIAM Journal on Computing, 35(1):151–169, 2005. [49] B. Gfeller and E. Vicari. A randomized distributed algorithm for the maximal in- dependent set problem in growth-bounded graphs. In Proceedings of the 26th An- nual ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing (PODC ’07), pages 53–60. Association for Computing Machinery, 2007. [50] J. Gudmundsson, C. Levcopoulos, G. Narasimhan, and M. H. M. Smid. Approximate distance oracles for geometric graphs. In Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA ’02), pages 828–837. Society for Industrial and Applied Mathematics, 2002. [51] S. Guha and S. Khuller. Greedy strikes back: Improved facility location algorithms. Journal of Algorithms, 31(1):228–248, April 1999. [52] S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O’Callaghan. Clustering data streams: Theory and practice. IEEE Transactions on Knowledge and Data Engineering (TKDE), 15(3):515–528, January/February 2003. [53] S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan. Clustering data streams. In Proceedings of the 41st Symposium on Foundations of Computer Science (FOCS ’00), pages 359–366. IEEE Computer Society, 2000. [54] L. J. Guibas. Kinetic data structures: A state of the art report. In Proceedings of the 3rd Workshop on the Algorithmic Foundations of Robotics (WAFR ’98), pages 191–209. A. K. Peters, Ltd., 1998. [55] L. J. Guibas. Kinetic data structures. In D. P. Mehta and S. Sahni, editors, Handbook of Data Structures and Applications. Chapman and Hall/CRC, 2004. [56] S. Har-Peled. Clustering motion. Discrete & Computational Geometry, 31(4):545– 565, March 2004. [57] S. Har-Peled and A. Kushal. Smaller coresets for k-median and k-means clustering. Discrete & Computational Geometry, 37(1):3–19, January 2007. [58] S. Har-Peled and S. Mazumdar. On coresets for k-means and k-median cluster- ing. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing (STOC ’04), pages 291–300. Association for Computing Machinery, 2004. [59] S. Har-Peled and M. Mendel. Fast construction of nets in low-dimensional metrics and their applications. SIAM Journal on Computing, 35(5):1148–1184, 2006. [60] J. Hershberger. Smooth kinetic maintenance of clusters. Computational Geometry, 31(1–2):3–30, May 2005. 198 Bibliography [61] P. Indyk. Sublinear time algorithms for metric space problems. In Proceedings of the 31st Annual ACM Symposium on Theory of Computing (STOC ’99), pages 428–434. Association for Computing Machinery, 1999. [62] P. Indyk. Stable distributions, pseudorandom generators, embeddings and data stream computation. In Proceedings of the 41st IEEE Symposium on Foundations of Computer Science (FOCS ’00), pages 189–197. IEEE Computer Society, 2000. [63] P. Indyk. Algorithmic applications of low-distortion geometric embeddings. In Proceedings of the 42nd IEEE Symposium on Foundations of Computer Science (FOCS ’01), pages 10–33. IEEE Computer Society, 2001. [64] P. Indyk. Algorithms for dynamic geometric problems over data streams. In Pro- ceedings of the 36th Annual ACM Symposium on Theory of Computing (STOC ’04), pages 373–380. Association for Computing Machinery, 2004. [65] P. Indyk. Stable distributions, pseudorandom generators, embeddings, and data stream computation. Journal of the ACM (JACM), 53(3):307–323, May 2006. [66] P. Indyk. Sketching, streaming and sub-linear space algorithms. Graduate course notes, available at http://stellar.mit.edu/S/course/6/fa07/6.895/, 2007. [67] P. Indyk and J. Matousek. Low-distortion embeddings of finite metric spaces. In J. E. Goodman and J. O’Rourke, editors, Handbook of Discrete and Computational Geometry. Chapman and Hall/CRC, 2004. [68] K. Jain, M. Mahdian, E. Markakis, A. Saberi, and V. V. Vazirani. Greedy facility location algorithms analyzed using dual fitting with factor-revealing LP. Journal of the ACM (JACM), 50(6):795–824, November 2003. [69] K. Jain, M. Mahdian, and A. Saberi. A new greedy approach for facility location problems. In Proceedings of the 34th Annual ACM Symposium on Theory of Com- puting (STOC ’02), pages 731–740. Association for Computing Machinery, 2002. [70] K. Jain and V. V. Vazirani. Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and Lagrangian relaxation. Journal of the ACM (JACM), 48(2):274–296, March 2001. [71] W. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. In Conference in Modern Analysis and Probability, volume 26 of Contemporary Mathematics, pages 189–206. American Mathematical Society, 1984. [72] D. M. Kane, J. Nelson, and D. P. Woodruff. An optimal algorithm for the distinct elements problem. In Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS ’10), pages 41–52. Association for Computing Machinery, 2010. Bibliography 199 [73] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. A local search approximation algorithm for k-means clustering. Computational Geometry, 28(2–3):89–112, June 2004. [74] D. R. Karger and M. Ruhl. Finding nearest neighbors in growth-restricted met- rics. In Proceedings of the 34th Annual ACM Symposium on Theory of Computing (STOC ’02), pages 741–750. Association for Computing Machinery, 2002. [75] J. M. Kleinberg, A. Slivkins, and T. Wexler. Triangulation and embedding using small sets of beacons. Journal of the ACM (JACM), 56(6), September 2009. [76] S. G. Kolliopoulos and S. Rao. A nearly linear-time approximation scheme for the Euclidean k-median problem. SIAM Journal on Computing, 37(3):757–782, 2007. [77] M. R. Korupolu, C. G. Plaxton, and R. Rajaraman. Analysis of a local search heuristic for facility location problems. Journal of Algorithms, 37(1):146–188, Octo- ber 2000. [78] C. Levcopoulos, G. Narasimhan, and M. H. M. Smid. Improved algorithms for constructing fault-tolerant spanners. Algorithmica, 32(1):144–156, 2002. [79] J.-H. Lin and J. S. Vitter. Approximation algorithms for geometric median problems. Information Processing Letters, 44(5):245–249, 1992. [80] S. P. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–137, March 1982. [81] N. A. Lynch. Distributed Algorithms. Morgan Kaufmann, 1996. [82] J. B. MacQueen. Some methods for classification and analysis of multivariate obser- vations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 281–297. University of California Press, 1967. [83] M. Mahdian, Y. Ye, and J. Zhang. Approximation algorithms for metric facility location problems. SIAM Journal on Computing, 36(2):411–432, 2006. [84] J. Matousek. Lectures on Discrete Geometry. Graduate Texts in Mathematics. Springer, 1st edition, 2002. [85] M. Matsumoto and T. Nishimura. Mersenne Twister: A 623-dimensionally equidis- tributed uniform pseudo-random number generator. ACM Transactions on Modeling and Computer Simulation (TOMACS), 8(1):3–30, January 1998. [86] C. J. H. McDiarmid. Concentration. In Probabilistic Methods for Algorithmic Discrete Mathematics, volume 16 of Algorithms and Combinatorics, pages 195–248. Springer, 1998. 200 Bibliography [87] R. R. Mettu and C. G. Plaxton. The online median problem. SIAM Journal on Computing, 32(3):816–832, 2003. [88] A. Meyerson. Online facility location. In Proceedings of the 32nd IEEE Symposium on Foundations of Computer Science (FOCS ’01), pages 426–431. IEEE Computer Society, 2001. [89] M. Mitzenmacher and E. Upfal. Probability and Computing. Cambridge University Press, 2005. [90] T. Moscibroda and R. Wattenhofer. Facility location: Distributed approximation. In Proceedings of the 24th Annual ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing (PODC ’05), pages 108–117. Association for Computing Machinery, 2005. [91] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995. [92] S. Muthukrishnan. Data streams: Algorithms and applications. Foundations and Trends in Theoretical Computer Science, 1(2), 2005. [93] S. Muthukrishnan. Data stream algorithms. Notes from Barbados Complex- ity Theory Meeting, available at http://sites.google.com/site/algoresearch/ datastreamalgorithms, 2009. [94] G. Narasimhan and M. H. M. Smid. Approximating the stretch factor of Euclidean graphs. SIAM Journal on Computing, 30(3):978–989, 2000. [95] N. Nisan. Pseudorandom generators for space-bounded computation. Combinatorica, 12(4):449–461, December 1992. [96] L. O’Callaghan, A. Meyerson, R. Motwani, N. Mishra, and S. Guha. Streaming-data algorithms for high-quality clustering. In Proceedings of the 18th International Con- ference on Data Engineering (ICDE ’02), pages 685–696. IEEE Computer Society, 2002. [97] S. Pandit and S. V. Pemmaraju. Return of the primal-dual: Distributed metric facility location. In Proceedings of the 28th Annual ACM Symposium on Principles of Distributed Computing (PODC ’09), pages 180–189. Association for Computing Machinery, 2009. [98] D. Peleg. Distributed Computing: A Locality-Sensitive Approach. SIAM Monographs on Discrete Mathematics and Applications, 2000. [99] C. Scheideler. Probabilistic Methods for Coordination Problems. Habilitation thesis, University of Paderborn, 2000. Bibliography 201 [100] S. Z. Selim and M. A. Ismail. k-means-type algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 6(1):81–87, January 1984. [101] D. B. Shmoys. Approximation algorithms for facility location problems. In Proceed- ings of the 3rd International Workshop on Approximation Algorithms for Combina- torial Optimization Problems (APPROX ’00), pages 27–33. Springer, 2000. [102] D. B. Shmoys, É. Tardos, and K. Aardal. Approximation algorithms for facility location problems. In Proceedings of the 29th Annual ACM Symposium on Theory of Computing (STOC ’97), pages 265–274. Association for Computing Machinery, 1997. [103] M. H. M. Smid. The well-separated pair decomposition and its applications. In T. F. Gonzalez, editor, Handbook of Approximation Algorithms and Metaheuristics. Chapman & Hall/CRC, 2007. [104] M. Sviridenko. An improved approximation algorithm for the metric uncapacitated facility location problem. In Proceedings of the 9th International Conference on Integer Programming and Combinatorial Optimization (IPCO ’02), pages 240–257. Springer, 2002. [105] K. Talwar. Bypassing the embedding: Algorithms for low dimensional metrics. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing (STOC ’04), pages 281–290. Association for Computing Machinery, 2004. [106] M. Thorup. Quick k-median, k-center, and facility location for sparse graphs. SIAM Journal on Computing, 34(2):405–432, 2004. [107] B. Von Herzen and A. H. Barr. Accurate triangulations of deformed, intersecting surfaces. In Proceedings of the 14th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH ’87), pages 103–110. Association for Computing Machinery, 1987. [108] J. Vygen. Approximation algorithms for facility location problems (Lecture notes). Technical Report 05950-OR, Research Institute for Discrete Mathematics, University of Bonn, 2005. Available at http://www.or.uni-bonn.de/~vygen/files/fl.pdf. [109] D. E. Willard and G. S. Lueker. Adding range restriction capability to dynamic data structures. Journal of the ACM (JACM), 32(3):597–617, July 1985. [110] B. Yao, F. Li, and P. Kumar. Reverse furthest neighbors in spatial databases. In Pro- ceedings of the 25th IEEE International Conference on Data Engineering (ICDE ’09), pages 664–675. IEEE Computer Society, 2009. 202 Bibliography [111] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery, 1(2):141–182, June 1997.