Approximation Techniques for
Facility Location and Their Applications
in Metric Embeddings
Dissertation
zur Erlangung des Grades eines
Doktors der Naturwissenschaften
der Technischen Universität Dortmund
an der Fakultät für Informatik
von
Christiane Lammersen
Dortmund
2010
Tag der mündlichen Prüfung: 7. Dezember 2010
Dekan: Prof. Dr. Peter Buchholz
Gutachter: Prof. Dr. Christian Sohler,
Prof. Dr. Friedhelm Meyer auf der Heide
Abstract
This thesis addresses the development of geometric approximation algorithms for huge
datasets and is subdivided into two parts. The first part deals with algorithms for facil-
ity location problems, and the second part is concerned with the problem of computing
compact representations of finite metric spaces.
Facility location problems belong to the most studied problems in combinatorial opti-
mization and operations research. In the facility location variants considered in this thesis,
the input consists of a set of points where each point is a client as well as a potential
location for a facility. Each client has to be served by a facility. However, connecting a
client incurs connection costs, and opening or maintaining a facility causes so-called open-
ing costs. The goal is to open a subset of the input points as facilities such that the total
cost of the system is minimized.
We are particularly interested in facility location problems for large-scale distributed
systems of mobile objects. In order to be able to analyze such complex systems, we
examine the following partial aspects:
• At first, we present a distributed algorithm that, in case of uniform opening costs
for the facilities and uniform demands of the clients, computes in only three com-
munication rounds a constant-factor approximation for the metric facility location
problem.
• In Chapter 4, we introduce a mobile facility location problem where the input points
move continuously in a constant-dimensional Euclidean space. In contrast to Chap-
ter 3, we also take non-uniform opening costs for the facilities and non-uniform
demands of the clients into account. We propose an event-driven data structure that
efficiently maintains a subset of the mobile points as open facilities such that, at any
time, the total cost of the system is at most a constant factor larger than the optimal
facility location cost.
• In Chapter 5, we consider again a uniform facility location problem. However, this
time, we develop a streaming algorithm where the input stream consists of insert
and delete operations of points from a constant-dimensional Euclidean space. While
reading the input stream, our algorithm maintains a summary of the current point set
in a subtle way with the result that the required space is polylogarithmic in the size
of the input stream and, at any time, it can output a constant-factor approximation
of the optimal facility location cost.
• In the next chapter, we give an efficient streaming implementation of a k-means
clustering algorithm. The k-means clustering problem is closely related to the facility
iv Abstract
location problem. The goal is to place k facilities, the so-called cluster centers, such
that the sum of the squared distances of the points to their nearest cluster center is
minimized. Our algorithm is based on a coreset construction. A coreset is a small
weighted point set that approximates the input point set with respect to the k-means
clustering problem.
In the second part of this thesis, we study compact representations of finite metric spaces.
Our representations have the property that a large fraction of all the pairwise distances be-
tween the points is almost preserved and only a small fraction of all the pairwise distances,
the so-called slack, can be arbitrarily distorted. Constructions of such representations are
an important tool in the analysis of huge datasets.
In Chapter 7, we apply some space-partitioning techniques from Chapter 5 to construct
well-separated pair decompositions with slack for low-dimensional Euclidean spaces. We
also show how to transfer this approach to doubling metric spaces.
Afterwards, we extend our techniques to obtain streaming algorithms that compute
embeddings with a distortion of at most 1 + ε and with low slack for high-dimensional
Euclidean spaces and doubling metric spaces. Furthermore, we investigate embeddings
with low distortion and low slack for general metric spaces given as a data stream of
points. Besides, we show how to use embedding techniques to get a (1± ε)-approximation
algorithm for the high-dimensional Euclidean max-cut problem where the input stream
consists of insert and delete operations of points. All of our streaming algorithms need
only space that is polylogarithmic in the size of the input stream.
Zusammenfassung
Diese Dissertation beschäftigt sich mit der Entwicklung von geometrischen Approxima-
tionsalgorithmen für große Datenmengen und ist in zwei Teile aufgeteilt. Der erste Teil
behandelt Algorithmen für verschiedene Arten von Facility-Location-Problemen und der
zweite Teil Konstruktionen von kompakten Darstellungen endlicher metrischer Räume.
Facility-Location-Probleme gehören zu den am meisten untersuchten Problemen in der
kombinatorischen Optimierung und Operations Research. In den von uns betrachteten
Facility-Location-Varianten erhält man als Eingabe eine Menge von Punkten, die sowohl
Standorte von Kunden als auch mögliche Standorte von Facilities darstellen. Jeder Kunde
soll durch eine Facility versorgt werden. Hierbei fallen für die Kunden Verbindungskosten
und für das Öffnen bzw. Aufrechterhalten der genutzten Facilities sogenannte Öffnungs-
kosten an. Das Ziel ist es, eine Teilmenge der möglichen Facilities zu öffnen, so dass die
Gesamtkosten minimiert werden.
Wir interessieren uns bei den Facility-Location-Problemen insbesondere für riesige ver-
teilte Systeme mobiler Standorte. Um solch komplexe Systeme untersuchen zu können,
haben wir im Einzelnen die folgenden Teilaspekte genauer betrachtet:
• Als erstes stellen wir einen verteilten Algorithmus vor, der im Fall von einheitlichen
Öffnungskosten für die Facilities als auch einheitlichen Bedarf der Kunden in nur drei
Kommunikationsrunden eine konstante Approximation für das metrische Facility-
Location-Problem ausgibt.
• In Kapitel 4 führen wir ein mobiles Facility-Location-Problem ein, bei dem sich die
Eingabepunkte kontinuierlich in einem niedrig-dimensionalen euklidischen Raum be-
wegen. Im Unterschied zu Kapitel 3 berücksichtigen wir diesmal auch uneinheitliche
Öffnungskosten für die Facilities und uneinheitlichen Bedarf der Kunden. Wir geben
eine ereignisgesteuert Datenstruktur an, die effizient eine Teilmenge der sich be-
wegenden Punkte als geöffnete Facilties aufrechterhält, so dass zu jeder Zeit die
Gesamtkosten des Systems höchstens um einen konstanten Faktor größer sind als die
optimalen Gesamtkosten.
• In Kapitel 5 betrachten wir wieder ein uniformes Facility-Location-Problem. Dies-
mal entwickeln wir jedoch einen Algorithmus der Datenströme bearbeiten kann, die
aus Einfüge- und Löschoperationen von Punkten in einem niedrig-dimensionalen euk-
lidischen Raum bestehen. Während des Einlesens hält unser Datenstromalgorithmus
geschickt eine Zusammenfassung der aktuellen Punktmenge aufrecht, so dass sein
verwendeter Speicherplatz polylogarithmisch in der Größe des Eingabestromes ist
und zu jeder Zeit eine konstante Approximation der optimalen Gesamtkosten für das
Facility-Location-Problem ausgegeben werden kann.
vi Zusammenfassung
• Im nächsten Kapitel geben wir eine effiziente Datenstromimplementierung eines
k-Means-Clustering-Algorithmus an. Das k-Means-Clustering-Problem ist verwandt
mit dem Facility-Location-Problem. Dabei sollen k Clusterzentren so platziert wer-
den, dass die quadrierten Abstände der Punkte zu dem jeweils nächstliegenden Clus-
terzentrum minimiert werden. Unser Algorithmus basiert auf einer neuen Kernmen-
genkonstruktion. Eine Kernmenge ist eine kleine gewichtete Punktmenge, die die
Eingabepunktmenge gemäß des k-Means-Clustering-Problems approximiert.
Im zweiten Teil der Dissertation beschäftigen wir uns mit kompakten Repräsentationen
endlicher metrischer Räume. Unsere Repräsentationen haben die Eigenschaft, dass sie
einen großen Anteil der paarweisen Distanzen gut erhalten und nur einen kleinen Anteil,
den sogenannten Schlupf, beliebig verzerren. Konstruktionen solcher Repräsentationen
bilden ein wichtiges Werkzeug bei der Analyse von großen Datenmengen.
In Kapitel 7 wenden wir einige Raumaufteilungstechniken aus Kapitel 5 an, um wohl-
separierte Paar-Dekompositionen mit Schlupf für niedrig-dimensionale euklidische Räume
zu konstruieren. Wir zeigen außerdem wie dieser Ansatz auf Doubling-Metriken übertragen
werden kann.
Anschließend erweitern wir unsere Techniken, um Datenstromalgorithmen zu erhalten,
die Einbettungen mit einer Verzerrung von höchstens 1 + ε und geringem Schlupf von
hoch-dimensionalen euklidischen Räumen und Doubling-Metriken berechnen. Des Weit-
eren untersuchen wir Einbettungen mit geringer Verzerrung und geringem Schlupf für allge-
meine Metriken, die als Datenstrom von Punkten gegeben sind. Außerdem zeigen wir, dass
man mit Hilfe von Einbettungstechniken einen (1± ε)-Approximationsalgorithmus für das
hoch-dimensionale euklidische Max-Cut-Problem erhalten kann, wobei der Eingabestrom
aus Einfüge- und Löschoperationen von Punkten besteht. All unsere Datenstromalgo-
rithmen benötigen Speicherplatz, der nur polylogarithmisch in der Größe des jeweiligen
Eingabestromes ist.
Acknowledgments
First and foremost, I would like to thank my advisor, Prof. Dr. Christian Sohler, for giving
me the opportunity to work with him and under his supervision. I benefited in many ways
from his great support. He integrated me into important research communities at an early
stage. Even before I started my PhD studies, he offered me to attend a summer school
on data stream algorithms and a Dagstuhl seminar on sublinear algorithms. During the
whole time, his guidance was invaluable for me. Whenever I got stuck with a problem, I
felt free to ask for his advice, which always ended up in new helpful ideas. It was also of
great importance to me that he kept faith in my skills. Even when I thought that I would
not be able to cope with a challenge, my advisor encouraged me to try it and it always
worked out. In a nutshell, I enjoyed it a lot to work with him!
Special thanks also go to my co-advisor, Prof. Dr. Friedhelm Meyer auf der Heide. His
comments and questions during seminar talks were really helpful to improve my thesis.
Besides, I am very pleased that he welcomed me so heartily each time when I visited his
research group in Paderborn.
The results in this thesis have been emerged from collaborations with many smart people.
I would like to express my best thanks to my co-authors Dr. Marcel Ackermann, Dr. Bastian
Degener, Joachim Gehweiler, Marcus Märtens, Christoph Raupach, Dr. Anastasios
Sidiropoulos, Prof. Dr. Christian Sohler, and Kamil Swierkot. It was a pleasure for me to
collaborate with all of them.
During my PhD studies, I have been research assistant at Universität Paderborn,
Universität Bonn, and Technische Universität Dortmund. This gave me the opportunity
to become acquainted with many nice people and to make new friends. Particularly, I
would like to name Dr. Bastian Degener, Dominic Dumrauf, and Joachim Gehweiler. For
being an amiable three-year office mate and friend, I would like to thank Dr. Morteza
Monemizadeh. His amazing knowledge of research results in the area of clustering and
data stream algorithms has been very useful for me. Furthermore, I am more than happy
that I have found such a sympathetic and caring friend in Melanie Schmidt. The pleasant
conversations with her always cheered me up. For many fruitful discussions and a nice
working atmosphere, I would also like to thank Dr. Mohammad Ali Abam, Florian Berger,
Antje Bertram, Prof. Dr. Beate Bollig, Dr. Olaf Bonorden, Dr. Gereon Frahling, Alexander
Gilbers, Dr. André Gronemeier, Frank Hellweg, Prof. Dr. Rolf Klein, Mariele Knepper,
Renate Kühn, Dr. Elmar Langetepe, Dr. Mario Mense, Rainer Penninger, Melanie Schmidt,
Dr. Dirk Sudholt, Dr. Christian Thyssen, Tim Suess, Heinz-Georg Wassing, and Christine
Zarges.
I would like to thank Prof. Dr. Stefano Leonardi for inviting me to a research visit
at Sapienza University of Rome, Prof. Dr. Piotr Indyk for helpful discussions at the
viii Acknowledgments
MADALGO Summer School on Data Stream Algorithms, and Prof. Dr. Sumit Ganguly
for inviting me to the IITK Workshop on Algorithms for Processing Massive Data Sets.
Many thanks go to Dr. Mariano Zelke for carefully proof-reading parts of my thesis and
for giving helpful suggestions to improve the readability of my thesis.
Certainly, I would have never managed to get so far without the support and encourage-
ment of my friends and family. Particularly, I would like to thank my brother Markus who
sparked my interest in computer science, my parents, Brunhilde and Klemens, whom I own
the most of what I am today, and Thomas Friebe whom I can rely on in any situation and
who enriches my life in such a wonderful way.
Contents
Notation and Terminology xiv
1 Introduction 1
1.1 Outline and Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Facility Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.3 Compact Representations of Finite Metric Spaces . . . . . . . . . . 10
1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.1 Facility Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.2 Compact Representations of Finite Metric Spaces . . . . . . . . . . 18
2 Preliminaries 21
2.1 Distance Functions and Metric Spaces . . . . . . . . . . . . . . . . . . . . 21
2.2 Facility Location Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 The Mettu-Plaxton Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Computational Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.1 Real Random Access Machine Model . . . . . . . . . . . . . . . . . 26
2.4.2 Synchronous Message Passing Model . . . . . . . . . . . . . . . . . 27
2.4.3 Kinetic Data Structures . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.4 Data Stream Models . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 Facility Location in a Distributed Setting 31
3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.1 The Distributed Setting . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.2 The Radii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Distributed Algorithm for Metric Spaces . . . . . . . . . . . . . . . . . . . 38
3.2.1 Analysis of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . 39
3.3 Distributed Algorithm for Powers of Metric Spaces . . . . . . . . . . . . . 45
3.3.1 Analysis of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . 46
4 A Kinetic Data Structure for Facility Location 53
4.1 The Special Radii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.1.1 Definition of the Special Radii . . . . . . . . . . . . . . . . . . . . . 54
4.1.2 Computation of the Special Radii . . . . . . . . . . . . . . . . . . . 59
4.1.3 The Invariant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
x Contents
4.2 The Kinetic Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.1 Initial Set of Open Facilities . . . . . . . . . . . . . . . . . . . . . . 63
4.2.2 Event Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.3 Handling an Update . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3 Quality and Complexity of the Kinetic Data Structure . . . . . . . . . . . 66
4.3.1 Maintenance of the Invariant . . . . . . . . . . . . . . . . . . . . . . 67
4.3.2 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5 Facility Location in Data Streams 75
5.1 Definition of a Good Estimator . . . . . . . . . . . . . . . . . . . . . . . . 75
5.1.1 Estimator for Special Cases . . . . . . . . . . . . . . . . . . . . . . 75
5.1.2 Estimator Based on a Space Partition . . . . . . . . . . . . . . . . . 77
5.1.3 Properties of the Space Partition . . . . . . . . . . . . . . . . . . . 79
5.1.4 Analysis of the Estimator . . . . . . . . . . . . . . . . . . . . . . . 82
5.2 Randomized Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2.1 Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2.2 Analysis of the Estimator . . . . . . . . . . . . . . . . . . . . . . . 85
5.3 Streaming Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.3.1 Analysis of the Estimator . . . . . . . . . . . . . . . . . . . . . . . 91
6 A k-Means Implementation for Data Streams 99
6.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.1.1 Definition of Euclidean k-Means Clusterings . . . . . . . . . . . . . 100
6.1.2 Definition of Coresets . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.1.3 k-Means Clustering Algorithms . . . . . . . . . . . . . . . . . . . . 102
6.2 Coreset Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.3 The Coreset Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.3.1 Definition of the Coreset Tree . . . . . . . . . . . . . . . . . . . . . 111
6.3.2 Construction of the Coreset Tree . . . . . . . . . . . . . . . . . . . 112
6.3.3 Extraction of the Coreset . . . . . . . . . . . . . . . . . . . . . . . . 113
6.4 Streaming Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.4.1 The Merge-and-Reduce Technique . . . . . . . . . . . . . . . . . . . 114
6.4.2 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.5 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.5.2 Parameters of the Algorithms . . . . . . . . . . . . . . . . . . . . . 117
6.5.3 Comparison of the Algorithms . . . . . . . . . . . . . . . . . . . . . 118
7 Well-Separated Pair Decomposition with Slack 125
7.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.2 Construction for Euclidean Metric Spaces . . . . . . . . . . . . . . . . . . . 126
7.2.1 Analysis of the Construction . . . . . . . . . . . . . . . . . . . . . . 128
xi
7.3 Construction for Doubling Metric Spaces . . . . . . . . . . . . . . . . . . . 135
7.3.1 Analysis of the Construction . . . . . . . . . . . . . . . . . . . . . . 137
8 Embeddings with Slack in Data Streams and Applications 143
8.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8.2 Embedding Euclidean Metric Spaces . . . . . . . . . . . . . . . . . . . . . 144
8.2.1 Low Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
8.2.2 High Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
8.3 Max-Cut in High Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . 157
8.4 Embedding Doubling Metric Spaces . . . . . . . . . . . . . . . . . . . . . . 164
8.5 Embedding General Metric Spaces . . . . . . . . . . . . . . . . . . . . . . . 171
8.6 Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
9 Conclusions and Future Work 179
A Additional Tables for Chapter 6 181
A.1 Parameters of Algorithm BIRCH . . . . . . . . . . . . . . . . . . . . . . . 181
A.2 Running Times of the Algorithms . . . . . . . . . . . . . . . . . . . . . . . 182
A.3 Clustering Cost of the Algorithms . . . . . . . . . . . . . . . . . . . . . . . 183
A.4 Standard Deviation of Running Time and Cost . . . . . . . . . . . . . . . 184
B Mathematical Fundamentals 187
B.1 Sequences, Series, and Inequalities . . . . . . . . . . . . . . . . . . . . . . . 187
B.2 Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
Bibliography 193
xii Contents
Notation and Terminology
This section provides an overview of the notation and terminology for mathematical basics
used throughout this thesis. More special notions are introduced gradually in the main
chapters. For ease of reading, some of the definitions are even repeated a few times.
∅ empty set
N set of the natural numbers {1, 2, 3, . . .}
N0 set of the natural numbers including 0
[n] set {0, 1, . . . , n− 1}
R set of the reals
R≥0 set of the non-negative reals
[a, b] closed interval of the reals x with a ≤ x ≤ b
(a± ε) closed interval of the reals x with a− ε ≤ x ≤ a+ ε
(a, b) open interval of the reals x with a < x < b
|A| cardinality of set A
A ∪B union of sets A and B, i.e., {x | x ∈ A or x ∈ B}
A ∩B intersection of sets A and B, i.e., {x | x ∈ A and x ∈ B}
A\B difference set A minus B, i.e., {x | x ∈ A and x /∈ B}
A×B Cartesian product of sets A and B, i.e., {(x, y) | x ∈ A and y ∈ B}
An set of n-dimensional column vectors with entries from set A
An×m set of (n×m)-matrices with entries from set A
M = (X,D) metric space M , where X is a non-empty set of elements and
D : X ×X → R≥0 is a distance function defined on X
D(x, y) distance between x and y in some metric space
diam(X) diameter of set X in some metric space
B(x, r) closed ball of radius r centered at point x ∈ X in some metric space
M = (X,D), i.e., the set {y ∈ X | D(x, y) ≤ r}
spread of M ratio of farthest pair distance in X to closest pair distance in X
for some finite metric space M = (X,D) with D(x, y) 6= 0
for all pairs (x, y) ∈ X ×X, x 6= y
R
d the d-dimensional Euclidean space
vT transpose of vector v with entries from R
vT · w scalar product of column vectors v, w ∈ Rd
‖v‖ Euclidean norm of column vector v ∈ Rd
xiv Notation and Terminology
G = (V,E) graph G with vertex set V and edge set E
e Euler’s number 2.7182 . . .
exp(x) Euler’s number to the power of x, i.e., the value ex
ln(x) natural logarithm of x, i.e., logarithm of x to the base e
logb(x) logarithm of x to the base b
logkb (x) logb(x) to the power of k, i.e., (logb(x))
k
log(x) binary logarithm of x, i.e., logarithm of x to the base 2
O(g) {f : N→ R≥0 | ∃n0 ∈ N∃c > 0 ∀n ≥ n0 : f(n) ≤ c · g(n)}
Ω(g) {f : N→ R≥0 | ∃n0 ∈ N∃c > 0 ∀n ≥ n0 : 0 ≤ c · g(n) ≤ f(n)}
Θ(g) {f : N→ R≥0 | f ∈ O(g) and f ∈ Ω(g)}
o(g) {f : N→ R≥0 | ∃n0 ∈ N∃c > 0 ∀n ≥ n0 : f(n) < c · g(n)}
ω(g) {f : N→ R≥0 | ∃n0 ∈ N∃c > 0 ∀n ≥ n0 : 0 ≤ c · g(n) < f(n)}
poly(g) {f : N→ R≥0 | ∃n0 ∈ N∃c ≥ 1 ∀n ≥ n0 : f(n) ≤ (g(n))c}
polylog(g) {f : N→ R≥0 | ∃n0 ∈ N∃c ≥ 1 ∀n ≥ n0 : f(n) ≤ log
c(g(n))}
O˜(g) {f : N→ R≥0 | f ∈ O(g · polylog(g))}
gH {gh | h ∈ H} for some g : N→ R≥0 and H ⊂ {f | f : N→ R≥0}
g ·H {g · h | h ∈ H} for some g : N→ R≥0 and H ⊂ {f | f : N→ R≥0}
g +H {g + h | h ∈ H} for some g : N→ R≥0 and H ⊂ {f | f : N→ R≥0}
minx∈A f(x) value f(x) with x ∈ A and f(x) ≤ f(y) for all y ∈ A
maxx∈A f(x) value f(x) with x ∈ A and f(x) ≥ f(y) for all y ∈ A
dxe smallest integer n with n ≥ x
bxc largest integer n with n ≤ x
n! factorial of the natural number n, i.e., the value n · (n− 1) · . . . · 2 · 1
(
n
k
)
binomial coefficient n over k, i.e., the value n!/((n− k)! · k!)
Pr [A] probability of the event A
E [Z] expectation of the random variable Z
V [Z] variance of the random variable Z
1 Introduction
Facility location problems belong to the most studied problems in operations research and
combinatorial optimization. In its classical interpretation, the goal of facility location is
to find optimal places for industrial facilities (e.g., restaurants, factories, or supermarkets)
such that a combination of the building and maintenance costs for the facilities and the
transportation costs for the clients is minimized. However, facility location problems have
also applications in many other scenarios. As a result, various types of facility location
problems have been investigated until today.
In the facility location variants considered in this thesis, the input consists of a set of
points where each point is a client as well as a potential location for a facility. Each client
has to be served by a facility. Here, it must be taken into account that, on the one hand,
serving a client incurs connection costs and, on the other hand, opening or maintaining a
facility causes so-called opening costs. The goal is to open a subset of the input points as
facilities such that the total cost of the system is minimized.
In general, each facility has its individual opening cost, and the connection cost of
a client depends proportionally on its individual demand as well as on its distance to
the nearest open facility. This means, of course, there has to be a distance measure
defined on the input points. Obviously, one typical scenario is that the points are from
a Euclidean space and the distance measure between points is given by the Euclidean
distance. However, other distance measures are also conceivable. In radio networks, for
instance, it could be interesting to consider powered Euclidean distances since the energy
required for transmitting a message via a certain distance is somewhere between the square
and the cube of the distance.
We are particularly interested in facility location problems for large-scale distributed
systems of mobile objects. In such a system, each object is an autonomous computational
entity that has its own local memory and that communicates with the other entities by
message passing. Since such systems are too complex to analyze them in their entirety,
we examine several partial aspects of them. Applications of our scenario are, for example,
in mobile ad-hoc and sensor networks. In these networks, nodes move continuously and
interact with each other. Often, they are organized in a hierarchical way where the upper
layer offers the lower layer a certain service.
Furthermore, we are interested in designing algorithms that are capable of clustering
huge Euclidean point sets efficiently. As clustering objective, we focus on k-means. Note
that the k-means clustering problem is closely related to the facility location problem,
which itself belongs to the clustering problems as well. In general, the goal of a clustering
is to partition a set of given objects into subsets, the so-called clusters, such that objects
from the same cluster are similar to each other and objects from different clusters are
2 1 Introduction
dissimilar. In the k-means clustering problem, the input is a set of points with a distance
measure defined on them, and the goal is to place k cluster centers such that the sum
of the squared distances of the points to their nearest cluster center is minimized. For
each cluster center, there exists one cluster containing all the points that are closer to this
cluster center than to all the other cluster centers. One application of clustering is the
compact representation of huge datasets. For instance, we could map each data item to
a point in a Euclidean space and, after having clustered the resulting point set, represent
each cluster by its cluster center.
The second part of this thesis concentrates exclusively on compact representations of
huge n-point metric spaces. An n-point metric space is a pair M = (X,D) where X is a
set of n points and D is a distance measure defined on X that is non-negative, symmetric,
and satisfies the triangle inequality. Our goal is to compute a compact representation of
M that fairly captures the pairwise distances of M but is structurally simpler than M and
uses only sublinear space. To measure the quality of such a representation, we use the
notion of low-distortion embeddings with slack, which is defined as follows. An embedding
from a metric space M = (X,D) into a target metric space M ′ = (X ′,D′) is a mapping
ϕ : X → X ′. We say ϕ contracts the distance between two points x and y in X by a factor
of α ≥ 1 if the embedded distance D′(ϕ(x), ϕ(y)) of x and y is α-times shorter than the
original distance D(x, y). Similarly, we say that ϕ expands the distance between x and y
by a factor of β ≥ 1 if the embedded distance D′(ϕ(x), ϕ(y)) of x and y is β-times longer
than the original distance D(x, y). Now, the distortion of ϕ is defined as the product of
the maximum contraction and the maximum expansion of all the pairwise distances in X.
Finally, we say that ϕ has distortion % ≥ 1 and slack σ with 0 < σ < 1 if, for a (1 − σ)-
fraction of all the pairwise distances in X, the distortion is %. The remaining pairwise
distances, i.e., the slack, can be arbitrarily distorted.
In this thesis, we study the problem of computing embeddings with low distortion and
low slack of several n-point metric spaces that are given as a data stream. A data stream
is a sequence of data items which can only be accessed in one sequential scan that reads
the data items one by one. Besides, while reading and processing the data, an algorithm
is only allowed to use space that is sublinear in the size of the input stream.
In the following section, we will give a detailed overview of the results presented in this
thesis.
1.1 Outline and Main Results
Chapter 2
This chapter provides some preparation for the main chapters. We give formal definitions
of the metric spaces and the facility location problems considered in this thesis. Afterwards,
we present an existing facility location algorithm due to Mettu and Plaxton [87], which
has played an important role in the design of two of our facility location algorithms.
Furthermore, we introduce the computational models that have been used to develop our
1.1 Outline and Main Results 3
algorithms and to analyze them in terms of their complexity. This includes the synchronous
message passing model for algorithms working in a distributed setting, the kinetic data
structure framework for algorithms working in a mobile setting, and data stream models.
Chapter 3
We begin our studies by investigating a special type of metric facility location problem
in a distributed setting. In this problem, we assume that each point is a client as well
as a potential location for a facility and that the opening costs for the facilities and the
demands of the clients are uniform. We present a randomized distributed algorithm that
computes with high constant probability a constant-factor approximation for this type of
facility location problem. The algorithm uses three rounds of all-to-all communication
with message sizes bounded to O(log(n)) bits, where n is the number of input points. In
particular, we show how each point decides locally after the first communication round
whether it opens a facility or not. The following two communication rounds are only
required to connect the clients to their nearest open facility.
In the last part of Chapter 3, we extend our distributed algorithm to constant powers of
metric spaces. Here, we also obtain a constant-factor approximation algorithm that uses
three rounds of all-to-all communication with message sizes bounded to O(log(n)) bits.
The results of Chapter 3 have been previously published in [J. Gehweiler, C. Lammersen,
and C. Sohler. A distributed O(1)-approximation algorithm for the uniform facility loca-
tion problem. In Proceedings of the 18th ACM Symposium on Parallelism in Algorithms
and Architectures (SPAA ’06), pages 237–243. Association for Computing Machinery,
2006.].
Chapter 4
We reuse some essential ideas of Chapter 3 to approach a mobile facility location problem.
This time, we take non-uniform opening costs for the facilities and non-uniform demands
for the clients into account. We assume that each point moves along a known trajectory in
a d-dimensional Euclidean space and that, at any time, each point is either an open facility
or a client. The opening cost that arises for a facility persists during the entire time it
is open. Analogously, a client has to pay some cost for its connection to an open facility
permanently. This cost depends on the client’s demand and its distance to the nearest
open facility.
To approach the mobile facility location problem, we propose a deterministic kinetic
data structure. This data structure maintains a subset of the moving points as open
facilities such that, at any time, the sum of the opening cost for the open facilities and the
connection cost for the clients is at most a constant factor larger than the current optimal
cost. The space requirement of our data structure is O(n(logd(n) + log(nR))), where n
denotes the number of input points and R is a value depending on the cost and demand
values of the input points. In case that each trajectory can be described by a bounded
degree polynomial, we process O(n2 log2(nR)) events, each requiring O(logd+1(n)·log(nR))
4 1 Introduction
time and O(log(nR)) status changes. To our knowledge, there had been no kinetic data
structures for facility location proposed prior to the work presented in Chapter 4.
Chapter 4 is based on [B. Degener, J. Gehweiler, and C. Lammersen. The kinetic
facility location problem. In Proceedings of the 11th Scandinavian Workshop on Algorithm
Theory (SWAT ’08), pages 378–389. Springer, 2008.] and [B. Degener, J. Gehweiler, and
C. Lammersen. Kinetic facility location. Algorithmica, 57(3):562–584, July 2010. By
invitation for the special issue on selected papers from SWAT ’08.].
Chapter 5
We continue our studies of facility location problems. Similar to Chapter 3, we consider a
variant in which each input point is a client as well as a potential location for a facility and
in which the opening costs for the facilities and the demands of the clients are uniform.
However, this time, the input points are given as a dynamic geometric data stream. This
means, the input is a sequence of insert and delete operations of points from a discrete
Euclidean space {1, . . . ,∆}d. We assume that the dimension d is a constant.
We present a randomized algorithm that computes a constant-factor approximation for
the cost of the uniform facility location problem over dynamic geometric data streams. Our
streaming algorithm processes an insertion or deletion of a point in time polylogarithmic
in ∆, requires space polylogarithmic in ∆, and has an error probability of less than 1/3.
We remark that this error probability can be reduced by using a standard amplification
technique.
The construction of our streaming algorithm is done in three steps. The first step is to
define a certain partition of the input space and to relate this partition to the cost which
are to be calculated. In particular, we show that if we assign to each cell in this partition a
weight that corresponds to the number of points inside the cell times the side length of the
cell, the sum of these weights is a constant-factor approximation for the facility location
cost. In the next step, we propose a randomized algorithm that utilizes the existence
of such a space partition but does not consider streaming. Finally, we explain how our
randomized algorithm can be transferred to the dynamic geometric data stream model.
The results from Chapter 5 can be found in [C. Lammersen and C. Sohler. Facility
location in dynamic geometric data streams. In Proceedings of the 16th Annual European
Symposium on Algorithms (ESA ’08), pages 660–671. Springer, 2008.].
Chapter 6
This chapter deals with an efficient implementation of a k-means clustering algorithm for
data streams, which we call StreamKM++. The k-means clustering problem is closely
related to the facility location problem. The goal is to place k facilities, the so-called cluster
centers, such that the sum of the squared distances of the points to their nearest cluster
center is minimized. Our algorithm computes a small weighted coreset of the data stream
that approximates the input point set with respect to the k-means clustering problem. The
problem is then solved by running the k-Means++ algorithm [9] on the coreset.
1.1 Outline and Main Results 5
Algorithm StreamKM++ is based on two new techniques. First, we use an adaptive,
non-uniform sampling approach similar to the k-Means++ seeding procedure to obtain
small coresets from the data stream. This construction is rather easy to implement and,
unlike other coreset constructions, its running time has only a small dependency on the
dimensionality of the data. Second, we propose a new data structure, which we call coreset
tree. The use of coreset trees significantly speeds up the time that is necessary for the
adaptive, non-uniform sampling during our coreset construction.
To evaluate the performance of our algorithm, we compare it experimentally with two
well-known streaming implementations: BIRCH [111] and StreamLS [52]. We show that
if the first priority is the quality of the clustering, then StreamKM++ provides a good
alternative to BIRCH and StreamLS. This applies particularly if the number of cluster
centers is large. To show that the performance of our algorithm is competitive with classical
non-streaming algorithms as well, we also compare it with the k-Means++ algorithm and
Lloyd’s algorithm [39, 80, 82] on some datasets with small or moderate size.
We end our investigation of problems related to facility location with Chapter 6.
Algorithm StreamKM++ together with its experimental study has been previously
published in [M. R. Ackermann, C. Lammersen, M. Märtens, C. Raupach, C. Sohler, and
K. Swierkot. StreamKM++: A clustering algorithm for data streams. In Proceedings of
the 12th Workshop on Algorithm Engineering and Experiments (ALENEX ’10), pages 173–
187. Society for Industrial and Applied Mathematics, 2010. Invited to the special is-
sue on selected papers from ALENEX ’10. Submitted to ACM Journal on Experimental
Algorithmics.].
Chapter 7
In this chapter, we start to investigate another geometric problem. The problem under
consideration is the computation of compact representations of finite metric spaces that
capture the metric structure well. One option to tackle this problem is the construction of
a well-separated pair decomposition (WSPD). An ε-WSPD of a point set P is a represen-
tation of P which gives the guarantee that all pairwise distances of P are (1±ε)-preserved,
i.e., each pairwise distance is expanded by a factor of at most (1 + ε) or compressed by a
factor of at most 1/(1− ε). In order to enable our representation to have size sublinear in
|P |, we relax this condition and introduce the notion of a WSPD with slack. An ε-WSPD
with slack σ for P is a representation of P such that at least a (1 − σ)-fraction of all the
pairwise distances of P are (1 ± ε)-preserved. The remaining pairwise distances of P can
be arbitrarily distorted.
We show how to compute an ε-WSPD with slack σ for a set P consisting of n points
from a low-dimensional Euclidean space Rd. The space requirement for this compact
representation is O(log(∆)/(εdσ)), where ∆ is the spread of P . Recall that the spread of a
finite metric space is defined by the ratio of the farthest pair distance to closest pair distance
occurring in the metric space. The techniques used by our algorithm are also applicable
to doubling metric spaces. For a doubling metric space with bounded dimension λ and
6 1 Introduction
spread ∆, our algorithm computes an ε-WSPD with slack σ whose space requirement is
O(log2(∆)/(ελσ)).
The results of Chapter 7 can be found in [C. Lammersen, A. Sidiropoulos, and C. Sohler.
Streaming embeddings with slack. In Proceedings of the 11th Algorithms and Data Struc-
tures Symposium (WADS ’09), pages 483–494. Springer, 2009.].
Chapter 8
Chapter 8 addresses the computation of compact representations of finite metric spaces in
data stream models. We present randomized streaming algorithms that, given a stream
of n points from a metric space M , compute an embedding of M into an n-point metric
space M ′ that has low distortion and low slack. Our algorithms use space polylogarithmic
in n and ∆, where ∆ is the spread of the metric space. Within such space limitations, it
is impossible to store the embedding explicitly. We bypass this obstacle by computing a
compact representation of M ′, without storing the actual mapping from M into M ′.
Given a slack parameter σ and a precision parameter ε, our streaming algorithm com-
putes for a set of points P from a high-dimensional Euclidean space a set of points P ′ from
a low-dimensional Euclidean space such that P embeds into P ′ with distortion 1 + ε and
slack σ. The algorithm uses the techniques presented in Chapter 7. Furthermore, based on
results obtained in Chapter 7, we show how to compute embeddings with distortion 1 + ε
and slack σ of metric spaces with bounded doubling dimension in the data stream model.
For general metric spaces, we propose a streaming embedding of a metric space M into a
metric space M ′ with distortion O(1) and slack σ. We complement our upper bounds by
proving that embedding general metric spaces with distortion less than 2 and slack less
than 1/4 requires Ω(n/ log(n) + log(log(∆))) bits of memory.
Besides, we use an embedding to show that there is a randomized streaming algorithm
that computes with high constant probability a (1 ± ε)-approximation of the max-cut
problem for a dynamic geometric data stream of high-dimensional Euclidean points.
The results of Chapter 8 are based on [C. Lammersen, A. Sidiropoulos, and C. Sohler.
Streaming embeddings with slack. In Proceedings of the 11th Algorithms and Data Struc-
tures Symposium (WADS ’09), pages 483–494. Springer, 2009.].
Chapter 9
We conclude the thesis with Chapter 9, where we summarize our main results and also
give some suggestions for future work.
Appendix A
This appendix contains some additional tables with detailed information about the exper-
iments that we have run with our streaming implementation StreamKM++.
1.2 Motivation 7
Appendix B
This appendix addresses some mathematical fundamentals which are assumed to be com-
mon knowledge throughout this thesis. The facts are stated for reference purposes, without
giving any proofs.
1.2 Motivation
There are some relations between facility location, clustering, and computations of compact
representations of finite metric spaces. Therefore, problems in these areas can be motivated
in a similar way and have similar application scenarios. In the following, we will discuss
this in detail for the problems considered in these areas.
1.2.1 Facility Location
Facility location problems capture a large variety of application scenarios in which we have
to allocate resources to satisfy some requirement as good as possible while at the same
time we have to pay some cost for the used resources. Therefore, it is not surprising that
these kind of problems belong to the most studied problems in combinatorial optimization
and operations research.
In this thesis, we are particularly interested in facility location problems for large-scale
distributed systems of mobile objects. In such a system, each object is an autonomous
computational entity that has its own local memory and that communicates with the
other entities by message passing.
Applications of our scenario are, for example, in battery-powered wireless ad-hoc and
sensor networks. These networks are governed by tight energy constraints. Often, the
nodes are organized in energy-efficient clusters where some selected cluster heads offer a
certain service to the subset of nodes contained in their cluster. Imagine, for example, we
have a sensor network containing hundreds or thousands of homogeneous nodes, and the
task of the nodes is to send periodically their sensed data to a distant base station where
the end-user can access the data. Then, a service of a cluster head could be aggregating
the data of its cluster nodes into a small set of meaningful information and taking on the
transmission of the aggregated data to the base station. Each node can act as a cluster
head, but, at any time, each node that is set up or maintained as a cluster head has an
additional energy consumption due to transmitting data to the distant base station, staying
in ready-to-receive mode, etc. The energy consumption of a non-serving cluster node is
determined by its demand and the transmission power needed to reach a cluster head.
The transmission power has to be higher the longer the distance to the nearest cluster
head is. As a result, we have to find a good trade-off to have a small number of cluster
heads and at the same time keep the energy needed for transmissions to the cluster heads
low. The situation may be further aggravated in case that the nodes move continuously.
Now, imagine that, to maintain the total energy consumption of the system low, nodes are
allowed to change their status from serving cluster head to demanding cluster node or vice
8 1 Introduction
versa. This scenario can be modeled as a facility location problem for a distributed system
of mobile network nodes.
Since such systems, like the one described above, are too complex to analyze them in
their entirety, we examine the following partial aspects of them:
Distributed Setting: We study the metric facility location problem with uniform opening
costs and demands for large-scale distributed systems of static objects.
Kinetic Setting: Here, we are interested in the Euclidean facility location problem with
non-uniform opening costs and demands for large-scale systems of mobile objects.
Dynamic Data Streams: We investigate the problem of computing the cost for the Eu-
clidean facility location problem with uniform opening costs and demands for large-
scale system of mobile objects where the movement of the objects is given as a
dynamic data stream of update operations.
Distributed Setting
Typically, no node in a large-scale ad-hoc or sensor network knows all the distance informa-
tion that is needed to solve or even approximate the facility location problem. Besides, it is
not possible to gather this information at a single node since this would cause much com-
munication between the nodes, which in turn would result in a high energy consumption.
In such a scenario, a distributed algorithm is required.
The distributed algorithm presented in Chapter 3 computes a constant-factor approxi-
mation of the metric facility location problem using only three rounds of communication
where the message size is logarithmic in the total number of nodes.
Kinetic Setting
Maintaining clusters in a large-scale network of mobile nodes is a challenging task. Good
facility location algorithms should ensure a trade-off between the quality of the solution
at any given point of time and its stability and efficiency under motion since each status
change from demanding cluster node to serving cluster head incurs some costs.
Motivated by the fact that the KDS framework is common in the field of computational
geometry and well-suited to maintain a combinatorial structure of continuously moving
objects efficiently [2, 15, 54], we will develop a KDS for a mobile facility location problem
in Chapter 4. Surprisingly, prior to our work, there was no KDS for the facility location
problem known. Besides, it does not seem that the only known (1 + ε)-approximation
algorithms [8, 76] can be translated to the kinetic setting. This is because the authors
used the Arora-scheme [7] including dynamic programming techniques, which do not well
comply with kinetization. So, at this stage, the best we can hope for is to construct a KDS
maintaining a constant-factor approximation for the facility location problem, which we
will do in Chapter 4.
1.2 Motivation 9
Dynamic Data Streams
In battery-powered wireless ad-hoc and sensor networks of mobile nodes, it is common to
communicate new positions of network nodes in form of a stream of update operations.
Such an update may, for example, specify the ‘name’ of the network node, its ‘old position’
and its ‘new position’. Thus, we can also think of it as a deletion of the node from its old
position followed by an insertion of the same node at its new position.
The model of dynamic geometric data streams addresses such a scenario. We are given a
stream of insert and delete operations of points from a discrete Euclidean space {1, . . . ,∆}d,
and our goal is to maintain some information about the aggregated data. The difficulty is
that the size of the processed data prevents us from storing it completely. This restriction
is modeled by allowing only space polylogarithmic in ∆ and the length of the stream.
Since, in the facility location problem, the number of open facilities can be as large as
the considered point set (and this can be as large as ∆d), we cannot compute a solution in
the dynamic data stream model. Instead, we focus on approximating the cost of a solution.
We remark that monitoring the cost can be very useful in resource allocation problems.
For instance, it is often too costly to maintain, at all times, a nearly optimal set of open
facilities in a distributed way. Instead, we can keep the same set of open facilities for some
period of time but maintain, at all times, a good estimation of the optimal facility location
cost. Then, we recompute a new set of open facilities by running a distributed algorithm
as soon as the current cost estimation differs too much from the cost that we had after the
latest run of the distributed algorithm.
Chapter 5 deals with an algorithm that computes a constant-factor approximation of
the cost for the uniform facility location problem over dynamic geometric data streams.
1.2.2 Clustering
Clustering is the problem to partition a given set of objects into clusters such that objects
from the same cluster are similar to each other and objects from different clusters are dis-
similar. The goal is to simplify data by replacing a cluster by one or a few representatives,
classify objects into groups of similar objects, or find patterns in some given data. Regard-
ing to this, clustering has applications in various areas, including data mining, database
systems, data compression, and machine learning. In many of these applications, the data
occurs in the form of data streams or is stored on hard disks. This means that streaming
access is orders of magnitude faster than random access or is even the only possible access
to the data. To sum up, clustering algorithms for data streams are basic tools in the
analysis of huge datasets.
One of the most widely used clustering algorithms is Lloyd’s algorithm (sometimes also
called the k-means algorithm) [39, 80, 82]. This algorithm is based on two observations:
(1) Given a fixed set of centers, we obtain the best clustering by assigning each point to the
nearest center and (2) given a cluster, the best center of the cluster is the center of gravity
(i.e., the mean) of its points. Lloyd’s algorithm applies these two local optimizations steps
repeatedly to the current solution, until no more improvement is possible. It is known that
10 1 Introduction
the algorithm converges to a local optimum [100], and no approximation guarantee can
be given [73]. Recently, Arthur and Vassilvitskii [9] developed the k-Means++ algorithm,
which is a seeding procedure for Lloyd’s k-means algorithm that guarantees a solution
with certain quality and gives good experimental results. An advantage of this algorithm
is that it works also well for high-dimensional datasets. However, a disadvantage is that,
like Lloyd’s algorithm, the k-Means++ algorithm needs random access to the data items
and, thus, is not suitable for data streams.
In Chapter 6, we will present a clustering algorithm for data streams that is based on
the idea of the k-Means++ seeding procedure. Our algorithm utilizes the performance of
the k-Means++ seeding on high-dimensional data but avoids random access to the data
items.
1.2.3 Compact Representations of Finite Metric Spaces
Compact representations of finite metric spaces that fairly capture the pairwise distances
and use only sublinear space are an important tool in the analysis of huge point sets
since they can be stored in small space and much information about the point sets can be
obtained from the corresponding pairwise distances. One option to obtain such a represen-
tation is the construction of a WSPD. Unfortunately, unless the input metric space is very
simple (e.g., given is a multiset of points with many duplicates), one cannot find a sub-
linear space representation which preserves all pairwise distances. It is not even possible
to guarantee that all pairwise distances are preserved up to any fixed factor. Simply said,
it is unavoidable that we loose some distances in the sense that they can get arbitrarily
distorted. According to this, we extend the classic notion of WSPD to WSPD with slack
where the slack quantifies the fraction of pairwise distances that are not well preserved.
In Chapter 7, we will show how to construct a WSPD with low slack for low-dimensional
Euclidean spaces. Our construction is based on techniques that compute small summaries
for point clouds consisting of a certain number of closely located points. Due to this
construction, for several points, the distances to points in their immediate vicinity can be
arbitrarily distorted. However, the distances to all the other points, which are further away,
are well preserved. Therefore, problems related to finding furthest neighbors in large point
sets can be efficiently approximated by our compact representation. One such problem that
has recently been considered is the computation of reverse furthest neighbors [110]. Given
a huge set of points P , a small or moderate query set Q, and a query point q ∈ Q, the
task is to find all the points in P with the property that q is their furthest neighbor among
all points in Q. One application for this problem could be the placement of an obnoxious
industrial facility. Given a set P of residential sites and a set Q of potential locations for
building such an obnoxious facility, one reasonable strategy is to select a location in Q that
is further away from as many residential sites as possible, i.e., the location with the largest
set of reverse furthest neighbors in P . The reader is referred to [110] for more examples in
this area.
Another application for our compact representation is hierarchical clustering. For in-
stance, given a Euclidean point set P , the complete linkage clustering (or the furthest
1.3 Related Work 11
neighbor method) starts with |P | singleton clusters and successively merges the two near-
est clusters where the distance between two clusters is defined as the distance between
the two furthest objects in the two clusters. The merge step is repeated until the number
of clusters corresponds to the desired number of clusters. In this scenario, our compact
representation can be seen as a certain stage, when we have already performed a sequence
of several merge steps. Thus, by applying the clustering method on the representation, we
get a good approximation of the clustering for P . This is especially useful when the true
clusters of P are compact and roughly equal in size.
In many applications, the datasets are high-dimensional and given in form of a data
stream. Examples of such datasets include the web graph, Internet traffic logs, click-
streams, and genome data. To analyze such datasets, streaming algorithms that embed
a set of high-dimensional points with low distortion and low slack into a low-dimensional
space can be of particular interest. Besides the fact that the embedded point set can be
stored in small space, another benefit is that it might be useful to detect some structure
in the original data more easily. For example, let us assume, we map the input data to the
Euclidean plane or to R3. Then, it is much simpler for the human visual system to detect
structure in this data, tight clusters or isolated points, for instance.
Chapter 8 addresses the development of streaming algorithms for computing embeddings
with low distortion and low slack of high-dimensional Euclidean spaces, doubling metric
spaces with low doubling dimension, and general metric spaces.
1.3 Related Work
Facility location variants as well as techniques to compute compact representations of
metric spaces have extensively been studied in computer science. It goes beyond the scope
of this thesis to give a comprehensive overview of the vast available literature. In the
following, we will focus our summary of the work emerged in both areas on the results
which are most relevant to this thesis.
1.3.1 Facility Location
One of the most studied facility location problems is the problem which we refer to as
the general facility location problem. Compared to the variants considered in this thesis,
in the general facility location problem, only a subset of the input points are potential
facility locations. More precisely, we are given a set of facilities F and a set of clients C.
With each facility xi ∈ F , there is a non-negative opening cost fi associated. Furthermore,
there is a distance measure D defined on the input points, and, for each facility-client pair
(xi, yj) ∈ F × C, the distance D(xi, yj) specifies the cost for connecting yj to xi. The goal
is to open a subset F ⊆ F of the facilities and to connect each client to an open facility so
as to minimize the sum of the opening costs for F and the connection costs for C.
Note that, in the literature, the general facility location problem is also often called
uncapacitated facility location problem. This indicates that each facility can serve an un-
12 1 Introduction
limited number of clients, whereas, in capacitated versions of the problem, each facility
can serve only a certain limited number of clients. Since we only consider uncapacitated
facility location problems in this thesis, we omit the attribute ‘uncapacitated’.
An instance of the general facility location problem is said to be metric if the distance
measure D is non-negative, symmetric, and satisfies the triangle inequality. The gen-
eral metric facility location problem is known to be NP-hard. The first polynomial-time
constant-factor approximation algorithm for this problem was given by Shmoys et al. [102].
Later, several other polynomial-time constant-factor approximations have been proposed
[12, 18, 27, 30, 51, 68, 69, 70, 77, 83, 87, 104]. These algorithms can roughly be grouped
into algorithms using mainly linear programming (LP) rounding techniques, primal-dual
methods, local search strategies, greedy strategies, or combinations of these techniques.
The approximation algorithm of Shmoys et al. [102] relies on the LP rounding technique
due to Lin and Vitter [79]. An LP rounding algorithm proceeds in two steps: The first step
is to solve the linear relaxation of an integer programming formulation of the considered
problem, and the second step is to round the obtained fractional LP solution to an integer
solution. In this way, the algorithm of Shmoys et al. [102] achieves an approximation ratio
of 3.16. Guha and Khuller [51] improved the LP rounding algorithm of Shmoys et al. [102]
and combined it with a simple local search phase. Starting with the solution obtained
from the LP rounding, in the local search phase, the amount of cost that is saved by
opening a closed facility is computed for each closed facility. While there exists a facility
whose amount of saved cost is positive, the facility that maximizes the decrease of the total
facility location cost is opened. This combination of LP rounding and local search achieves
an approximation ratio of 2.41. Chudak and Shmoys [30] developed a 1.736-approximation
algorithm by improving the algorithm of Shmoys et al. [102]. The key elements to their
improvement are a new rounding procedure for the LP relaxation of the facility location
problem and the use of information about the dual linear program to the LP relaxation of
the problem. A further improvement of the LP rounding algorithm has been proposed by
Sviridenko [104] resulting in an approximation ratio of 1.582.
Korupolu et al. [77] proposed a local search algorithm for the general metric facility
location problem. In general, a local search algorithm starts with a feasible solution to
the considered problem and applies iteratively a local improvement step in which minor
modifications are made in order to obtain a solution of lower cost. The local improvement
step presented in [77] searches for one facility or a pair of one open and one closed facility
such that changing the status of the involved facilities decreases the total facility location
cost. The algorithm of Korupolu et al. [77] yields an approximation ratio of 5 + ε for any
constant ε > 0, and, according to [27], it can be implemented to run in O(n4 · log(n/ε))
time. By using other techniques in the analysis, Arya et al. [12] proved that the local
search algorithm of Korupolu et al. [77] actually achieves an approximation ratio of 3 + ε.
One of the most elegant approximation algorithms for the general metric facility location
problem is the primal-dual method developed by Jain and Vazirani [70]. In general, the
goal of a primal-dual method is to simultaneously compute a feasible integer solution for
the original problem as well as a feasible solution to the dual linear program to its LP
relaxation. In case that the considered problem is a minimization problem, like the general
1.3 Related Work 13
facility location problem, the cost of a feasible solution to the dual LP can be used as
a lower bound on the optimal cost. Jain and Vazirani [70] proved that their primal-dual
method for facility location is a 3-approximation algorithm that can be implemented to run
in O(n2 · log(n)) time. Later, Jain et al. [68] used a primal-dual method in the analysis of
two greedy algorithms. Their first greedy algorithm iteratively opens among all currently
closed facilities the facility that minimizes, for any subset U of all currently unconnected
clients, the ratio of the opening costs of the particular facility plus the connection cost of U
to the size of U . This algorithm achieves an approximation ratio of 1.861 and has a running
time ofO(n2·log(n)). The second greedy algorithm was obtained from the first one by small
modifications resulting in an improved approximation ratio of 1.61 and a higher running
time of O(n3). Furthermore, Mettu and Plaxton [87] presented a greedy algorithm for the
facility location variant with C = F which implicitly uses the primal-dual method of Jain
and Vazirani [70]. This is done by defining so-called ‘radii’ for amortizing the cost needed
to open a facility at a particular location. The algorithm opens iteratively the facility
xi ∈ F with the smallest radius that has no other open facility in the ball whose center is
xi and whose radius is twice the radius of xi. The algorithm yields an approximation ratio
of 3 and has a running time of O(n2). Finally, the second best approximation algorithm
for the general facility location problem is based on the algorithm of Jain et al. [68]. This
algorithm also involves the primal-dual method. It achieves an approximation ratio of 1.52
and has a running time of O˜(n2) [83].
Another algorithm for the general facility location problem has been developed by
Charika and Guha [27]. This algorithm is a 1.728-approximation algorithm that com-
bines the primal-dual method of Jain and Vazirani [70], a modified version of the local
search technique presented by Guha and Khuller [51], and the LP rounding algorithm of
Chudak and Shmoys [30]. Furthermore, Byrka and Aardal [18] proposed an algorithm that
uses a modified version of the algorithm of Chudak and Shmoys [30] and combines this
with the algorithm of Jain et al. [68]. The algorithm yields the best approximation ratio
up to now, which is 1.5.
Concerning hardness results, Guha and Khuller [51] proved by a reduction from set cover
that the general metric facility location problem of n input points cannot be approximated
in polynomial time within a factor of 1.463, unless NP ⊆ DTIME[nO(log(log(n)))]. Combining
this result with an observation of Sviridenko implies that the approximation lower bound
of 1.463 also holds, unless P = NP (see [108]). Furthermore, Thorup proved [106] that any
constant-factor approximation algorithm, even a randomized one, requires Ω(n2) time to
compute a solution to the general metric facility location problem. Bădoiu et al. [14] ex-
tended this hardness result by showing that any bounded-factor approximation algorithm,
even a randomized one, requires Ω(n2) time to compute the cost of the general metric
facility location problem. This result holds even for the variant with uniform opening
costs.
However, for some special variants of the metric facility location problem, the hardness
results mentioned above are no longer valid. For instance, Bădoiu et al. [14] considered
the metric facility location problem with uniform opening costs in which every point can
open a facility. In a first step, they proved that the sum of the radii defined by Mettu
14 1 Introduction
and Plaxton [87] is an estimator that approximates the optimal facility location cost to
within a constant factor. In a second step, they showed how to obtain a constant-factor
approximation of this estimator by using an adaptive sampling approach. This resulted in
an algorithm for the considered variant of the metric facility location problem that com-
putes a constant-factor approximation of the cost in O(n · log2(n)) time. Furthermore, the
non-approximability result of Guha and Khuller [51] is no longer valid in the special case of
Euclidean spaces. A first randomized polynomial-time approximation scheme for the gen-
eral Euclidean facility location problem in the plain has been developed by Arora et al. [8].
This algorithm is based on the Arora-scheme [7] and computes a (1 + ε)-approximation
in O(n1+O(1/ε) log(n)) time. The result of Arora et al. [8] was then improved by Kol-
liopoulos and Rao [76]. Assuming that there exists any polynomial-factor approximation
for the total connection cost, they obtained a randomized polynomial-time approximation
scheme that works in any constant-dimensional Euclidean space and has a running time of
O(2O((log(1/ε)/ε)
d−1)n logd+6(n)).
For a more comprehensive overview of results on facility location problems in a classical
setting, we refer the reader to the surveys by Shmoys [101] and Vygen [108].
The facility location problem has also been investigated in other settings. In the follow-
ing, we will summarize the results obtained in distributed and kinetic settings as well as
in data stream models.
Distributed Setting
Surprisingly, the first algorithm for a distributed facility location problem was proposed
just a few years ago [90]. Given a set of m facilities and a set of n clients, Moscibroda
and Wattenhofer [90] investigated the general non-metric facility location problem (i.e.,
distances do not have to satisfy the triangle inequality) in a synchronous message passing
model. In their considered model, the communication network is a complete bipartite graph
with communication links between each facility-client pair, and each node can send in each
communication round a message containing O(log(n)) bits to each neighbor in the com-
munication network. To approach the distributed facility location problem, Moscibroda
and Wattenhofer used some ideas from the centralized primal-dual method of Jain and
Vazirani [70]. The obtained distributed primal-dual method provides a trade-off between
the number of communication rounds and the resulting approximation ratio. In particular,
it achieves an O(
√
k(m%)1/
√
k log(m + n)) approximation in O(k) communication rounds
with a message size of O(log(n)) bits. Here, % is a coefficient that depends on the cost
values of the input instance.
In Chapter 3, we consider the metric facility location variant with uniform opening costs
for the facilities and X := C = F in the synchronous message passing model where the
communication network is a clique. Compared to the problem studied in [90], our prob-
lem is much simpler, and so the algorithm presented in Chapter 3 is incomparable with
the algorithm of Moscibroda and Wattenhofer. We developed our randomized distributed
algorithm based on results from Mettu and Plaxton [87] and Bădoiu et al. [14]. As men-
tioned before, Bădoiu et al. [14] proved that the sum of the radii defined by Mettu and
1.3 Related Work 15
Plaxton [87] is a constant-factor approximation of the optimal facility location cost. Fur-
thermore, for any facility xi ∈ X, they gave a lower bound on the number of points located
in the ball whose center is xi and whose radius equals the radius of xi. Using this lower
bound, we designed our randomized distributed algorithm in a way that it opens a subset
of the potential facilities such that, with high constant probability, the total opening cost
is at most a constant factor larger than the sum of the radii and each client xi has an open
facility in a ball whose center is xi and whose radius is at most a constant factor larger
than the radius of xi. Hence, our algorithm computes with high constant probability a
constant-factor approximation for X.
In a follow-up study on the results of Moscibroda and Wattenhofer [90] and the results
presented in Chapter 3, Pandit and Pemmaraju [97] further investigated the metric version
of the problem studied in [90]. Based on the primal-dual method of Jain and Vazirani [70]
and a rapid randomized sparsification of graphs due to Gfeller and Vicari [49], they obtained
a 7-approximation in O(log(m) + log(n)) communication rounds with a message size of
O(log(m+n)) bits. This technique was then generalized to get an algorithm that, for each
constant k, runs in k communication rounds and computes a solution whose cost is only a
factor of O(m2/
√
k · n3/
√
k) larger than the optimal cost. We point out that the technique
of Pandit and Pemmaraju can also be used to obtain a constant-factor approximation in
O(log(n)) communication rounds for the variant of our considered metric facility location
problem where the opening cost of facilities are non-uniform. Therefore, their result can
be seen as a generalization of our result.
For more information about distributed computing, the reader is referred to [81, 98].
Kinetic Setting
Some frameworks have been proposed for handling kinetic data. In this thesis, we consider
a common model for processing points in motion, called kinetic data structures (KDS),
which was introduced by Basch et al. [15].
Prior to the work presented in Chapter 4, there was no KDS for the facility location
problem known. However, some results have been obtained in the KDS framework for
problems related to clustering, to which the facility location problem belongs. For instance,
Gao et al. [46] provided a randomized KDS to maintain a set of centers among moving
points in the plane such that, given a specified radius, all the points are covered by balls
of the given radius centered at the chosen center points. Gao et al. showed that the size
of the center set is at most a constant factor larger than the minimum one. Hershberger
investigated a similar problem in [60]. More precisely, he proposed a deterministic KDS
for maintaining a covering of moving points in Rd by unit boxes such that the number of
boxes is always within a factor of 3d of the optimal static covering at any instance. Another
clustering problem that has been studied in the KDS framework is the kinetic k-center
problem. The goal of the kinetic k-center problem is to maintain a set of k centers so as to
minimize the maximum distance of any point to its closest center at any point of time. Gao
et al. [47] proposed a deterministic KDS that maintains, for a set of moving points in Rd,
an 8-approximation of the discrete k-center problem, i.e., the centers have to be a subset of
16 1 Introduction
the moving input points. Bereg et al. [17] studied 1-center problems in which the center is
not necessarily located at one of the moving input points. Among other results, he showed
that, given a precision parameter ε, 0 < ε < 1, there is a strategy for moving a center such
that the location of this center provides a (1 + ε)-approximation of the 1-center problem
for a set of moving points in the plane and, assuming each input point moves with velocity
at most 1, the velocity of the center never exceeds (2 + ε)(1 + ε)/
√
2ε+ ε2. Furthermore, a
KDS for the k-center problem in the context of outliers can be found in [45]. Har-Peled [56]
investigated the k-center problem in a mobile setting different from the KDS framework.
Instead of handling events, a static set which ensures a constant-factor approximation at
all times is provided. However, the size of this set is kµ+1, where µ is the degree of the
polynomial of the trajectories. Finally, we are aware of another result concerning clustering
which addresses a randomized KDS for the Euclidean max-cut problem [31, 42].
For other work on KDSs, we refer the reader to the surveys by Guibas [54, 55].
Unfortunately, it does not seem that the only known (1+ε)-approximation algorithms for
facility location [8, 76] can be transferred to the kinetic setting since they are based on the
Arora-scheme [7] including dynamic programming techniques, which do not well comply
with kinetization. Our KDS for the mobile Euclidean facility location problem combines
a modified version of the greedy algorithm of Mettu and Plaxton [87] with a counting
argument of Bădoiu et al. [14]. Given any static Euclidean point set P , the original greedy
algorithm opens as few facilities as possible in a way that each point pi ∈ P has at least
one open facility in the ball with center pi and twice the radius of pi. This results in a
constant-factor approximation of the facility location problem for P . Concerning the radii
defined by Mettu and Plaxton, the counting argument of Bădoiu et al. asserts that, for any
facility pi ∈ P , a constant-factor approximation of the radius of pi can be computed by just
counting the number of points from P contained in exponentially growing balls centered
at pi. This counting argument facilitates us to efficiently kinetize a modified version of the
static greedy algorithm proposed by Mettu and Plaxton.
Data Streams
Although, many geometric approximation algorithms have been developed in data stream
models, we are only aware of three results concerning facility location problems.
In [41], Fotakis presented a streaming algorithm for the metric facility location variant
in which every input point is a potential facility location. The algorithm combines an
online facility location algorithm due to Meyerson [88] with an incremental facility location
algorithm due to Fotakis [40]. The course of the algorithm is controlled by so-called final
distances. The final distance of a point is an upper bound on the distance of this point
to the nearest facility at any future point of time. While reading the input stream, the
next point is chosen as open facility with a probability that is proportional to the ratio
between the final distance of this point and the opening cost. In case that a point is chosen
as open facility, it is stored in memory and replaces every currently stored facility which
has the property that its distance to the new facility is at most a certain fraction of its
final distance. In this way, the algorithm maintains a set of open facilities such that the
1.3 Related Work 17
total associated facility location cost is at most a constant factor larger than the optimal
facility location cost. Unfortunately, both the update time and the space requirement of
the algorithm are linear in the number of opened facilities, which can be linear in the input
size.
Chang [26] developed a multi-pass streaming algorithm for the metric facility location
variant in which every input point is a potential facility location. In contrast to the data
stream models considered in this thesis, in the multi-pass streaming model, an algorithm
is allowed to perform more than one sequential scan over the input data. During and after
each such pass, the amount of available local memory space is assumed to be sublinear in
the size of the input stream. To approach the facility location problem, Chang used an
iterative algorithm that is based on a technique proposed by Indyk [61]. In each iteration,
the algorithm takes a random sample from the input stream and computes a subset of open
facilities by applying some known facility location algorithm on the sample set. Then, the
algorithm removes all the points from consideration that are served sufficiently well and
iterates on all the remaining points. This is repeated, until all input points are served
sufficiently well. Chang showed that his algorithm uses O(`) passes and O˜(kn2/`) space to
compute a set of open facilities such that the total associated facility location cost is at most
a factor of O(`) larger than the optimal facility location cost. Here, k is the number of open
facilities and n is the number of input points. Thus, similar to Fotakis’ streaming algorithm,
there exist facility location instances for which the space requirement of Chang’s algorithm
is not sublinear in the input size. However, Chang justified his approach by proving that,
for the considered facility location problem, any randomized `-pass streaming algorithm
requires Ω(n/`) bits of memory to compute even a polynomial-factor approximation of the
optimal facility location cost.
Previous to the result presented in Chapter 5, the only real streaming algorithm for
facility location was proposed in [64], where the author introduced the model of dynamic
geometric data streams and studied different geometric problems in this model. A dynamic
geometric data stream is a sequence of insert and delete operations on a point set P ⊆
{1, . . . ,∆}d in a discrete d-dimensional Euclidean space. In the facility location variant
studied in [64], the opening costs are uniform and every point in P can open a facility. For
the purpose of guaranteeing a space requirement that is only polylogarithmic in the size
of the input stream, Indyk developed an algorithm that approximates the optimal facility
location cost for P instead of an optimal set of open facilities. This is done by defining a
certain partition of the space into nested square grids and a set of cells in this partition such
that the number of these cells gives an O(log(∆))-approximation of the optimal facility
location cost. During the approximation process to estimate the number of these cells, the
algorithm of [64] looses another factor of O(log(∆))1.
In Chapter 5, we use a similar partition of the space into nested square grids as in [64],
and we show that opening a subset of the cells defined in [64] results in a constant-factor
approximation of the optimal facility location cost. This leads to a streaming algorithm
1The author of [64] mentions that, with the help of a more intricate analysis, the approximation factor
can be improved to O(log(∆)).
18 1 Introduction
that computes a constant-factor approximation of the cost for the facility location problem
considered in [64], which strongly improves Indyk’s result.
We point out that the approximation of the facility location cost was considered again
in [14]. As mentioned at the beginning of this section, the authors of [14] proposed a
sublinear-time algorithm that computes in O(n log2(n)) time a constant-factor approxi-
mation for the cost of the metric facility location variant in which every input point is a
potential facility location and in which the opening costs for the facilities are uniform2. Un-
fortunately, despite the relation of streaming and sublinear-time algorithms, the techniques
cannot be transferred to the other model.
Note that the facility location problem in which every input point is a potential facility
location and in which the opening costs for the facilities are uniform is closely related to the
k-median and k-means clustering problems. In the k-median clustering problem, we are
given a set of points and an integer k, and the goal is to determine a set of k centers such
that the sum of the distances from the input points to their corresponding nearest center
is minimized. The cost function of the k-means clustering problem differs from the one of
the k-median clustering problem only in the way that we sum up the squared distances
from the input points to their corresponding nearest center. For both clustering problems,
a number of streaming algorithms have been developed [4, 28, 29, 36, 37, 44, 53, 57, 58].
Like the streaming algorithm presented in Chapter 6, many of these algorithms apply a
merge-and-reduce technique based on a decomposition technique of Bentley and Saxe [16]
to obtain a small coreset (see [2] for the introduction of the notion of coresets). Our
coreset construction for the k-means clustering problem is based on the k-means++ seeding
procedure [9]. We point out that the k-Means++ seeding has also been investigated in [3]
and [4]. However, our result differs from the results given in [3] and [4] and was obtained
independently.
In any case, all known algorithms for the k-median and k-means clustering problem
require space Ω(k). Thus, they implicitly assume that k is small, i.e., k ∈ polylog(∆)
in dynamic data streams and k ∈ polylog(n) in insertion-only data streams, where ∆ is
the spread of the input points and n is the length of the stream. As mentioned above, in
facility location problems in which every input point is a potential facility location, the
number of cluster centers k can be as large as the maximum size of the point set under
consideration. In Chapter 5, we will show that we can approximate the cost for such a
facility location problem in space o(k). No similar result is known for the k-median and
k-means clustering problems.
For other work in data stream models, we refer the reader to [66, 92, 93].
1.3.2 Compact Representations of Finite Metric Spaces
The compact representations of finite metric spaces considered in this thesis are well-
separated pair decompositions with slack (WSPDs with slack) and metric embeddings
2Since the size of the representation of an n-point metric space is Θ(n2), the complexity of this algorithm
is sublinear with respect to the input size.
1.3 Related Work 19
with slack. Since we are not aware of any prior work on WSPDs with slack, the following
overview deals with classical WSPDs. Afterwards, we will summarize the results obtained
in the area of metric embeddings with slack.
WSPD
The notion of WSPD has been introduced in [22]. In the same paper, Callahan and
Kosaraju showed that, for any set of n points from any constant-dimensional Euclidean
space and for any constant ε with 0 < ε < 1, there always exists an ε-WSPD consisting of
O(n) pairs and such an ε-WSPD can be computed in O(n log(n)) time. The construction
was later simplified by Har-Peled and Mendel [59], who observed that a WSPD can directly
be generated from a compressed quadtree [22]. Also based on a compressed quadtree
construction, Chan [25] showed that a WSPD for a Euclidean point set can be found in
linear time if the spread of the point set is polynomially bounded in the size of the point set.
Concerning dynamic point sets, Callahan [20] presented a deterministic algorithm and later
Fischer and Har-Peled [38] a simpler randomized algorithm that maintains a WSPD for
a constant-dimensional Euclidean point set in polylogarithmic time under insertions and
deletions. In high dimensions, it is known that a WSPD can have quadratic complexity.
One example is the uniform n-point metric (with all pairwise distances equal to 1), which
can be realized as the vertices of a simplex in Rn−1.
Since WSPDs are useful data structures to represent distances between points efficiently,
they have been applied for solving many proximity problems for point sets in a Euclidean
space [10, 11, 19, 21, 22, 34, 50, 59, 78, 94].
Talwar [105] extended the notion of WSPD to spaces with low doubling dimension. He
showed that, given any constant ε with 0 < ε < 1 and any n-point metric space with
constant doubling dimension and spread ∆, there always exists an ε-WSPD consisting of
O(n log(∆)) pairs. Furthermore, Gao and Zhang studied the construction of WSPDs for
unit-disk graphs [48].
Metric Embedding with Slack
The theory of metric embeddings received much attention in recent years, and embedding
techniques have been applied in the development and analysis of many algorithms that
operate on an underlying metric space. For recent work on metric embeddings, we refer
the reader to the surveys [63, 67, 84]. In the following overview of prior work, we focus on
the results that are related to metric embeddings with slack or that have been relevant in
designing the algorithms presented in Chapter 8.
Kleinberg et al. [75] introduced the notion of embeddings with slack. Among other re-
sults, they showed that, for any constant σ with 0 < σ < 1, any metric space with bounded
doubling dimension can be embedded with distortion O(1) and slack σ into a constant-
dimensional Euclidean space. The results from [75] have been extended to arbitrary metric
spaces and to embeddings under any `p norm, p ≥ 1, by Chan et al. [23]. Furthermore,
20 1 Introduction
Abraham et al. [1] developed embeddings with low distortion and low slack for arbitrary
metric spaces that additionally guarantee a constant average distortion.
Metric approximation with slack has also been investigated in the setting of graph span-
ners. Chan et al. [24] showed that, for any weighted graph G and any ε with 0 < ε < 1,
there exists a spanner of G with linear number of edges achieving stretch O(log (1/ε))
and slack ε. The authors also gave a spanner construction which is the starting point
of the embedding with slack of general metric spaces presented in Chapter 8. In order
to transform this construction to the streaming model, we use a technique that has been
applied by Czumaj and Sohler [32] to achieve 2-pass streaming algorithms for clustering
problems. We point out that, in the same paper, Czumaj and Sohler [32] introduced the
concept of α-preserving metric embeddings, which is closely related to embeddings with
slack. Their concept can be seen as a generalization of coresets. The goal is to embed a
metric space into a structurally simpler metric space that approximates the original metric
up to a factor of α with respect to a given optimization problem.
Embeddings of point sets into trees via a quadtree partitioning have been used by
Indyk [64] to obtain approximation algorithms for several geometric problems. Also,
Frahling and Sohler [44] applied a similar quadtree partitioning to get streaming algorithms
for different clustering problems. In Chapter 8, we use a similar partitioning technique to
embed Euclidean metric spaces with low distortion and low slack.
2 Preliminaries
This chapter deals with definitions that are used throughout the whole thesis. More special
definitions, which are only used to describe or analyze a certain algorithm, are introduced in
the corresponding main chapters. Therefore, the first section of most of the main chapters
is for preliminaries, which include such special definitions.
In this thesis, we will develop approximation algorithms for geometric problems in various
metric spaces. In particular, in Chapters 7 and 8, we will present algorithms for computing
compact space representations of different types of finite metric spaces. Section 2.1 covers
definitions of these metric spaces.
A big part of this thesis is devoted to facility location problems. We will consider various
facility location problems in different kinds of settings. More precisely, we will present
facility location algorithms for distributed and mobile settings as well as for data streams.
Formal definitions of the considered facility location problems are given in Section 2.2.
Our facility location algorithms for distributed and mobile settings are based on the greedy
algorithm of Mettu and Plaxton [87]. We will present this algorithm in Section 2.3. Finally,
in Section 2.4, we will introduce the computational models that we have applied to develop
our algorithms in the different kinds of settings.
2.1 Distance Functions and Metric Spaces
An important class of distance functions are metric spaces. In this section, we will give a
formal definition of general, Euclidean, and doubling metric spaces.
Distance Functions
Let X be any non-empty set of elements. A function D : X×X → R is a distance function
on X if it satisfies the following axioms:
• Non-Negativity: For any x, y ∈ X, we have D(x, y) ≥ 0.
• Symmetry: For any x, y ∈ X, we have D(x, y) = D(y, x).
We generalize the definition of distance functions to sets. More precisely, for any finite
set X and any distance function D on X, we define
∀x ∈ X ∀Y ⊆ X : D(x, Y ) := min
y∈Y
D(x, y)
and
∀Y ⊆ X ∀Z ⊆ X : D(Y, Z) := min
y∈Y
D(y, Z) .
22 2 Preliminaries
General Metric Spaces
A metric space M is a pair (X,D), where X is a non-empty set of elements and D is a
distance function on X that satisfies the following axioms:
• Reflexivity: For any x, y ∈ X, we have D(x, y) = 0 if and only if x = y.
• Triangle Inequality: For any x, y, z ∈ X, we have D(x, z) ≤ D(x, y) + D(y, z).
The complexities of several algorithms presented in this thesis depend on the spread of
the given input metric space. For a finite metric space M = (X,D) with D(x, y) 6= 0 for
all pairs (x, y) ∈ X ×X with x 6= y, the spread of M is defined as the ratio of the farthest
pair distance in X to the closest pair distance in X.
Euclidean Metric Spaces
Our distance measure will often be the Euclidean distance. The Euclidean distance between
two points is given by the Euclidean length of the difference vector of both points. More
precisely, let x :=
(
x(1), x(2), . . . , x(d)
)
and y :=
(
y(1), y(2), . . . , y(d)
)
be any two points from
the Euclidean space Rd, where the dimension d ∈ N is any natural number. Then, the
Euclidean distance between x and y is defined as
D(x, y) := ‖x− y‖ =
√
√
√
√
d∑
i=1
(x(i) − y(i))2 .
Since the Euclidean distance satisfies the condition of reflexivity and the triangle inequality,
it is a metric space.
Doubling Metric Spaces
A metric space M = (X,D) is called a doubling metric space if, there exists some λ ∈ N,
such that each ball with any radius r centered at any point in X can be covered by 2λ
balls each of radius r/2 and centered at a point in X. The value λ is called the doubling
dimension of M .
The doubling dimension can be seen as a generalization of the Euclidean dimension since
R
d has a doubling dimension of Θ(d) [59]. Besides, the doubling dimension extends the
notion of growth restricted metric spaces defined by Karger and Ruhl [74].
2.2 Facility Location Problems
In this section, we will define different types of facility location problems. The first problem
will be a facility location problem in general metric spaces. The problem definition is then
extended to powers of metric spaces. Finally, we will introduce a mobile facility location
problem in Euclidean spaces.
2.2 Facility Location Problems 23
Metric Facility Location Problem
In the metric facility location problem, we are given a metric space (F ∪ C,D), where
F := {x1, x2, . . . , xm} is a set of m facilities and C := {y1, y2, . . . , yn} is a set of n clients.
With each facility xi ∈ F , there is a non-negative opening cost fi associated. Each client
yj ∈ C has a non-negative demand dj. The goal is to find a subset F ⊆ F of open facilities
such that the objective
FacLoc((F , C), F ) :=
∑
xi∈F
fi +
∑
yj∈C
dj ·D(yj, F )
is minimized. The first part of the objective is the opening cost related to the open facilities
in F . The second part of the objective is the cost related to all clients in C, which we call
the connection cost.
Throughout the whole thesis, we will only consider the variant of the metric facility
location problem with X := F = C, where X := {x1, x2, . . . , xn} is a set of n points. We
then shortly write the facility location cost as
FacLoc(X,F ) := FacLoc((F , C), F ) .
In the uniform metric facility location problem with X := F = C, both the opening costs
of the facilities and the demands of the clients are uniform. More precisely, we assume that,
for each xi ∈ X, we have fi = f for some fixed value f ≥ 0 and di = 1. Then, the goal is
to find a subset F ⊆ X of open facilities such that the objective
FacLoc(X,F, f) := f · |F |+
∑
xj∈X
D(xj, F )
is minimized.
In case that the given metric space is a Euclidean space, we call the problem the (uni-
form) Euclidean facility location problem.
Facility Location Problem for Powers of Metric Spaces
In the facility location problem for powers of metric spaces, we are given a metric space
(F ∪ C,D) and a constant metric exponent ` ≥ 1. As well as for the metric facility
location problem, in this thesis, we will only consider the variant with X := F = C, where
X := {x1, x2, . . . , xn} is a set of n points. With each point xi ∈ X, there is a non-negative
opening cost fi and a non-negative demand di associated. The goal is to find a subset
F ⊆ X of open facilities such that the objective
FacLoc(X,F, `) :=
∑
xi∈F
fi +
∑
xj∈X
dj ·D(xj, F )`
is minimized.
In the uniform facility location problem for powers of metric spaces with X := F = C,
both the opening costs and the demands of the points are uniform. We assume that, for
24 2 Preliminaries
each xi ∈ X, we have fi = f for some fixed value f ≥ 0 and di = 1. Then, the goal is to
find a subset F ⊆ X of open facilities such that the objective
FacLoc(X,F, f, `) := f · |F |+
∑
xj∈X
D(xj, F )`
is minimized.
Mobile Facility Location Problem
In the mobile facility location problem, we are given a set of moving facilities F and a set of
moving clients C in a Euclidean space Rd. As described before, in this thesis, we will only
consider the mobile facility location problem with P := F = C, where P := {p1, p2, . . . , pn}
is a set of n moving points in Rd. Let pi(t) denote the position of pi at time t, and let
P (t) := {p1(t), p2(t), . . . , pn(t)}. For each point pi ∈ P , there exists a non-negative opening
cost fi and a non-negative demand di. Observe that both the opening cost and the demand
of a point do not change over time. The mobile facility location problem is to maintain,
at each point of time t, a subset F (t) ⊆ P (t) of open facilities such that
FacLoc(P (t), F (t)) :=
∑
pi(t)∈F (t)
fi +
∑
pj(t)∈P (t)
dj ·D(pj(t), F (t))
is minimized.
2.3 The Mettu-Plaxton Algorithm
This section addresses the greedy algorithm of Mettu and Plaxton [87] that computes a
constant-factor approximation for the metric facility location problem. Let (X,D) be a
metric space, where X = {x1, . . . , xn} is a set of n points and D is a distance function
defined on X. Following the definitions from Section 2.2, the opening cost of a point xi ∈ X
is denoted by fi and its demand by di.
As mentioned in the previous chapter, the Mettu-Plaxton algorithm implicitly applies
the primal-dual method of Jain and Vazirani proposed in [70]. This is done by defining
so-called ‘radii’ for amortizing the cost needed to open a facility at a particular location.
The idea of the Mettu-Plaxton algorithm is to open only a few facilities but, at the same
time, to guarantee that each point xi ∈ X has at least one open facility in the ball with
center xi and twice the radius of xi. After giving a formal definition of balls and radii, we
describe the algorithm in more detail.
Balls. For a point xi ∈ X and a non-negative value r, we define B(xi, r) to be the ball
with center xi and radius r. Given such a ball B(xi, r), we let weight(B(xi, r)) denote the
sum of the demands of all the points in X that are located in the ball B(xi, r), i.e., we
define
weight(B(xi, r)) :=
∑
xj∈X∩B(xi,r)
dj .
2.3 The Mettu-Plaxton Algorithm 25
Radius Associated with a Point. According to [87], for each point xi ∈ X, we define the
value ri to be the radius of the ball with center xi that satisfies
∑
xj∈X∩B(xi,ri)
dj · (ri −D(xi, xj)) = fi . (2.1)
xi ri
Figure 2.1: Illustration of
∑
x∈X∩B(xi,ri)(ri − D(xi, x)) (in case of uniform demands with
dj = 1 for all xj ∈ X). The dashed lines correspond to the distances summed
up.
Figure 2.1 illustrates the definition of the radius ri associated with a point xi. Observe
that the sum on the left hand side of Equation (2.1) is continuous and strictly monotonically
increasing with ri. Hence, there exists a unique value ri satisfying the equation. Moreover,
for any point xi ∈ X, the radius ri ranges between
rmin :=
minxj∈X fj
n ·maxxj∈X dj
and rmax :=
maxxj∈X fj
minxj∈X dj
.
The lower limit of the range is met if (i) fi = minxj∈X fj, (ii) all the points in X are at the
same position, and (iii) the demands of all the points are uniform such that d` = maxxj∈X dj
for any ` ∈ {1, . . . , n}. Because of Conditions (ii) and (iii), the contribution of each point
xj ∈ X to the sum is ri · maxxj∈X dj, which is the highest possible value. The upper
limit of the range is met if (i) fi = maxxj∈X fj, (ii) xi is the only point in the ball with
radius ri and center xi, and (iii) di = minxj∈X dj. In this case, due to Condition (ii), the
contribution of each point xj ∈ X\{xi} to the sum is 0, and, due to Condition (iii), the
contribution of xi is ri ·minxj∈X dj, which is the lowest possible value.
The Algorithm. First, the Mettu and Plaxton algorithm computes for each point xi ∈ X
its associated radius ri. Then, it goes through all the points in X in non-decreasing order
of their radii and opens a facility at a point xi ∈ X if xi has no open facility in the ball
with center xi and radius 2ri. A pseudocode listing of the Mettu and Plaxton algorithm is
given by Algorithm 2.3.1.
Let FacLoc∗(X) be the optimal facility location cost for X. Then, Mettu and Plaxton
obtained the following result:
26 2 Preliminaries
Algorithm 2.3.1 Mettu-Plaxton-FacLoc(X)
1: calculate the radius ri for each point xi ∈ X
2: sort all points in non-decreasing order according to their radii
3: let x1, x2, . . . , xn be the sorted sequence
4: for i← 1 to n do
5: if there is no open facility in B(xi, 2 · ri) then
6: open facility at xi
Theorem 1 ([87]). Given any n-point metric space (X,D), algorithm Mettu-Plaxton-
FacLoc computes a subset F ⊆ X of open facilities such that we have
FacLoc(X,F ) ≤ 3 · FacLoc∗(X) .
The running time needed to compute F is O(n2).
2.4 Computational Models
In this section, we will describe the computational models that we apply to measure the
complexity of our algorithms. This includes the synchronous message passing model for
algorithms working in a distributed setting, the kinetic data structure framework for al-
gorithms working in a mobile setting, and data stream models. Before we will give an
overview of these models, we will briefly describe the real random access machine model
because, except for algorithms working in the synchronous message passing model, we
measure the time and space complexities of our algorithms based on this model.
2.4.1 Real Random Access Machine Model
The real random access machine (RAM ) model is a simplified and idealized model of a real
computer, which is often used in computational geometry. In this model, a memory cell
can store a real number and is called a memory unit. The set of allowed operations are
• arithmetic operations (+,−, ·,÷),
• comparisons of two memory cells (<,≤,=, 6=,≥, >), and
• some standard operations (raising a number to a given power1, extracting a root2,
logarithmic calculus3, trigonometric functions4).
1Our algorithms only raise natural numbers to a power greater than a small constant.
2We use an extraction of a root once in our distributed facility location algorithm for powers of met-
ric spaces, once per embedding of a set of high-dimensional Euclidean points into a low-dimensional
Euclidean space, and many times for our KDS to compute the points of intersection of two trajectories.
3Some of our algorithms compute a few values of the form dlog(x)e for some real number x > 1. Since the
running time of such an algorithm is Ω(dlog(x)e), the value dlog(x)e can even be computed by linear
search for the smallest i ∈ N0 such that 2i ≥ x, with negligible increase in the running time.
4Our algorithms do not use trigonometric functions.
2.4 Computational Models 27
It is assumed that each of these allowed operations can be executed in a constant number
of time units.
In the analysis of our algorithms, we assume that each coordinate of a point can be
represented by using one memory unit, and the distance between two constant-dimensional
points can be computed in a constant number of time units. These assumptions are
commonly made in computational geometry. Unless otherwise stated, we measure the
running time of our algorithms in time units and the space requirement in memory cells.
2.4.2 Synchronous Message Passing Model
The synchronous message passing model is well-known and one of the most frequently used
models to design algorithms in a distributed setting [81, 98]. In this model, a network is an
undirected graph, where the nodes are the processors and the edges are the bidirectional
communication channels between the processors. Each node has a unique ID and knows
the total number of nodes in the network but does not know the topology of the network.
At the beginning, the knowledge of a node about the network topology is limited to the
neighbor nodes. To solve a given global problem, the nodes are allowed to communicate
with each other. A global problem could be to solve the facility location problem on the
network nodes, for instance. For sake of simplicity, the communication is assumed to be
synchronous, i.e., there are globally defined communication rounds. In each such round,
each node can send a message to each of its neighbors. In the process, the message sizes
are bounded to B bits, where B is the bandwidth parameter of the network. Often, it
is assumed that the bandwidth parameter is logarithmic in the number of nodes. In this
way, each message can contain a constant number of node IDs (a.o. message sender and
receiver).
The time complexity of a distributed algorithm that works in the synchronous message
passing model is the number of required communication rounds.
2.4.3 Kinetic Data Structures
In 1999, Basch et al. [15] introduced the kinetic data structure (KDS) framework, which has
been used as a central model for processing objects in motion ever since (see, e.g., [2, 15, 54]
and the references therein). A KDS is a data structure that maintains a certain attribute
of a set of continuously moving objects. For instance, in case of a facility location problem,
this could be a set of open facilities that minimizes the facility location cost. The input of a
KDS is a set of objects and a flight plan, i.e., each object moves continuously along a known
trajectory. Furthermore, at any time, it is possible to change the flight plan by performing
a so-called flight plan update, which means that one object changes its trajectory. The main
idea is now that the continuous motion of the objects is utilized in a way that updates
of the KDS take place only at discrete points of time and can be processed fast. As a
result, a lot of computational effort can be saved by maintaining the KDS compared to
handling just a series of instances of the corresponding static problem. To guarantee that
the attribute is correct at any time, a KDS ensures that certain certificates are always
28 2 Preliminaries
valid. Whenever a certificate fails, we call this an event, and an update is required. In
case of a facility location problem, such an event occurs, for instance, when a client has
moved so far away from all the open facilities that its connection cost exceeds the opening
cost of a facility. To be able to handle each event at the correct time, an event queue is
maintained.
There are four important properties to measure the quality of a KDS. The worst-case
amount of time to process an update is called responsiveness. The second and third
properties are compactness and locality. The compactness is given by the ratio between
the maximum number of certificates ever present to prove the correctness of the attribute
and the number of the moving objects. The locality addresses the maximum number of
events in the event queue in which one object can be involved. As a result, the locality is
a measure of how easily flight plan updates can be performed. The fourth property, the
efficiency of a KDS, is the ratio between the worst-case total number of processed events
and the worst-case number of processed events where the attribute changes. These worst-
case numbers are specified under certain assumptions on the trajectories of the objects.
Common assumptions are that the motions are linear or can be described by bounded-
degree polynomials. A KDS is called responsive, compact, local, and efficient, respectively,
if the associated value is at most polylogarithmic in the size of the input.
For a more detailed description of the concepts of a KDS, the reader is referred to [15,
54, 55].
2.4.4 Data Stream Models
A data stream consists of a long sequence of data items. The length of this sequence
restricts the amount of resources that is available to process the data and the type of
access to the data. In general, the amount of data is too large to be stored in main
memory. Often it is even larger than the capacity of modern hard disks. As a result, the
data has to be processed on the fly, and the only possible access to the data is sequential
reading. Typical examples of data streams are network traffic data, measurements of sensor
networks, or web crawls.
In order to design efficient algorithms for data streams, computer scientists have invented
many different data stream models. In this section, we will provide a description of the
two models considered in this thesis. These are the insertion-only data stream model and
the dynamic data stream model, which are both frequently used in the field of geometry.
For information about other data stream models and an overview of recent research, we
refer the reader to [66, 92, 93].
Insertion-Only Data Stream Model
In the insertion-only data stream model5, the input is a sequence (of insert operations)
of points p1, . . . , pi, . . . , pn in worst-case order. As mentioned above, the type of access
5The insertion-only data stream model is a special type of the cash register model (confer [92]).
2.4 Computational Models 29
to the input points and the amount of resources to process them are restricted. More
precisely, instead of having random access to the input points, which would be very time
consuming, algorithms perform one sequential scan over the input stream that reads the
points one by one in increasing order of the index i. Furthermore, it is only allowed to
use space that is sublinear in the size of the input stream. To deal with these restrictions,
streaming algorithms try to maintain, at any time, a summary of all the data seen so far.
Such a summary is a small-space representation that fairly approximates the input data
with respect to a given problem, i.e., a solution computed on the original input data can
be approximated by using the small summary.
The complexity of a streaming algorithm is measured by its space requirement, its update
time needed to process an element of the input stream, and its time needed to extract a so-
lution for the given problem from the maintained summary. All of these three requirements
are assumed to be only polylogarithmic in the size of the input stream.
Note that most of the streaming algorithms presented in this thesis have the property
that they do not require extra time to extract a solution from the maintained summary
since all necessary computations are done during an update. According to this, we will
only specify the third complexity measure of those algorithms for which this property does
not hold.
Dynamic Data Stream Model
The dynamic data stream model6 is an extension of the insertion-only data stream model
which also allows delete operations of points.
In this thesis, our focus is on a special type of this model which is called the dynamic
geometric data stream model. This model was introduced by Indyk [64] and is defined as
follows. The input is a sequence of m update operations on a point set P ⊆ {1, . . . ,∆}d in
a discrete d-dimensional Euclidean space. At the beginning, the point set P is empty. For
any point p ∈ {1, . . . ,∆}d, the operation Insert(p) inserts p into P , and, analogously, the
operation Delete(p) deletes p from P . We assume that the update operations occur in
worst case order with the constraint that the stream is consistent, i.e., no point is removed
that is not present in the current point set, and no point is added twice. Furthermore,
we use n as an upper bound on the size of the current point set P . Obviously, we have
n ∈ O(∆d) and n ≤ m.
Algorithms that work in the dynamic geometric data stream model are only allowed to
perform one sequential scan over the input stream. The space requirement, the update
time, and the time to extract a solution of the given problem from the maintained summary
are each assumed to be only polylogarithmic in m and ∆ and, therefore, in n since n ≤ m.
6The dynamic data stream model is a special type of the turnstile model (confer [92]).
30 2 Preliminaries
3 Facility Location in a Distributed Setting
This chapter addresses a randomized constant-factor approximation algorithm for the uni-
form metric facility location problem in a distributed setting. Our algorithm works in the
synchronous message passing model where the underlying network is a clique with each
node being a client as well as a potential location for a facility.
Our algorithm is based on two facts that Bădoiu et al. [14] discovered in case of the
uniform metric facility location problem: (i) Given any point set X from a metric space,
the sum of the radii defined by Mettu and Plaxton [87] is a constant-factor approximation of
the optimal facility location cost for X, and (ii) for any facility xi ∈ X, there exists a lower
bound on the number of points located in the ball whose center is xi and whose radius
equals the radius of xi. Using these two facts, we designed our randomized distributed
algorithm in a way that it determines in three communication rounds, with message sizes
bounded to O(log(|X|)) bits, a subset of the input points as open facilities such that, with
high constant probability, the following condition is satisfied: The total opening cost is at
most a constant factor larger than the sum of the radii and each facility xi ∈ X has an
open facility in a ball whose center is xi and whose radius is at most a constant factor
larger than the radius of xi. Thus, with high constant probability, our algorithm computes
a constant-factor approximation of the uniform facility location problem for X.
Note that, in some settings, the transmission cost between two nodes is not linear in the
distance. In radio networks, for example, it is a typical assumption that the energy required
for transmitting a message via a certain distance is somewhere between the square and the
cube of the distance. Motivated by this fact, we also extended our distributed algorithm
to the uniform facility location problem for constant powers of metric spaces.
The remainder of this chapter is organized as follows. In Section 3.1, we specify the used
synchronous message passing model and generalize the two facts mentioned above to the
uniform facility location problem for powers of metric spaces. Our distributed algorithm
for the uniform metric facility location problem is presented in Section 3.2. The extension
to constant powers of metric spaces can be found in Section 3.3.
3.1 Preliminaries
In this chapter, we consider the uniform facility location problem for metric spaces and
powers of metric spaces in a distributed setting. Given is a uniform opening cost f and
a metric space (X,D), where X = {x1, . . . , xn} is a set of n points and D is a distance
function defined on X. In the uniform facility location problem for powers of metric spaces,
we are additionally given a constant metric exponent `. Recall the definition of the facility
location cost for both considered problems from Section 2.2.
32 3 Facility Location in a Distributed Setting
We denote the cost of an optimal solution to the uniform metric facility location problem
by FacLoc*(X, f) and the cost of an optimal solution to the uniform facility location
problem for powers of metric spaces by FacLoc*(X, f, `).
3.1.1 The Distributed Setting
We consider the synchronous message passing model described in Section 2.4.2 where the
communication network is a clique. This means that, in each communication round, each
node can send a message to all other nodes. In the course of this, the message size is
bounded to O(log(n)) bits. Furthermore, we assume that every node knows the distance
to all other nodes, and each distance can be represented by O(log(n)) bits. Since we
want to develop an approximation algorithm, we can always achieve this by appropriate
rounding.
Note that although in our setting we allow all-to-all communication, it is not possible to
solve the problem by accumulating all information at one node and then solve the problem
with a classical (centralized) algorithm. The problem is that every node only knows the
distance to its neighbors. Since every node receives O(n log(n)) bits of information in
every communication round, it requires Ω(n) rounds to gather the information about all
pairwise distances at a single node. As shown in [106], we essentially require all this
information because it is not possible to compute a constant-factor approximation to the
facility location problem (with uniform opening costs and demands) without looking at
Ω(n2) distances.
3.1.2 The Radii
Radius Associated with a Point. We extend the original definition of a radius associated
with a point, given in Section 2.3, to powers of metric spaces. More precisely, for each point
xi ∈ X, we define the value ri to be the radius of the ball with center xi that satisfies
∑
x∈X∩B(xi,ri)
(
r`i −D(xi, x)
`
)
= f . (3.1)
Observe that there still exists only one solution to the radius ri since the left hand side
of Equation (3.1) is continuous and strictly monotonically increasing with ri. For any
i ∈ {1, . . . , n}, we have (f/n)1/` ≤ ri ≤ f 1/`.
In case of uniform opening cost f = 1 and a metric exponent ` = 1, Bădoiu et al. [14]
discovered a useful relation between the value weight(B(xi, ri)) and the radius ri. Their
result can be generalized to any uniform opening cost f ≥ 0 and any metric exponent
` ≥ 1. We obtain the following lemma:
Lemma 3.1.1. For each xi ∈ X, we have
weight (B(xi, ri)) ≥
f
r`i
.
3.1 Preliminaries 33
Proof. Due to the definition of ri, we have
∑
x∈X∩B(xi,ri)
(r`i −D(xi, x)
`) = f ,
which implies
∑
x∈X∩B(xi,ri)
r`i ≥ f .
Since weight(B(xi, ri)) = |{x ∈ X | x ∈ B(xi, ri)}|, we obtain weight(B(xi, ri)) ≥ f/r`i .
Sum of the Radii. Bădoiu et al. [14] proved that the sum of the radii associated with
the points in X is a good approximation of the optimal facility location cost for X. Again,
their result can be generalized to any uniform opening cost f ≥ 0 and any metric exponent
` ≥ 1.
In the proof of the generalized result, we use a modified version of the Mettu-Plaxton
algorithm. More precisely, this version works exactly as Algorithm 2.3.1 except that, in the
first step, it computes, for each point xi ∈ X, the radius ri that satisfies Equation (3.1),
instead of the original radius proposed by Mettu and Plaxton [87]. We will first show that
this modified Mettu-Plaxton algorithm is still a constant-factor approximation. Based on
this result, we will then prove that the sum of the exponentiated radii approximates the
optimal cost FacLoc*(X, f, `) within a constant factor.
Let FMP be the set of open facilities computed by the modified Mettu-Plaxton algorithm.
In the following, we will show that FacLoc(X,FMP, f, `) ≤ 3` · FacLoc*(X, f, `). The
argumentation is basically the same as in [87]. Only a few minor adaptations to our
scenario have been made.
Claim 3.1.2. For any point xi ∈ X, there exists an open facility xj ∈ FMP such that
rj ≤ ri and D(xi, xj) ≤ 2 · ri.
Proof. If there is no such open facility xj with rj ≤ ri in B(xi, 2 ·ri), then we open a facility
at xi and xi belongs to FMP.
Claim 3.1.3. Let xi and xj be distinct open facilities in FMP. Then, we have D(xi, xj) >
2 ·max{ri, rj}.
Proof. Without loss of generality, we assume that rj ≤ ri. It follows that xj /∈ B(xi, 2 · ri).
Otherwise, the point xi would not be an open facility. Thus, we have
D(xi, xj) > 2 · ri ≥ 2 · rj .
For any point xj ∈ X and an arbitrary set of open facilities F ′ ⊆ X, let
charge(xj, F ′) := D(xj, F ′)` +
∑
xi∈F ′
max{0, r`i −D(xi, xj)
`} .
34 3 Facility Location in a Distributed Setting
Claim 3.1.4. For an arbitrary set of open facilities F ′ ⊆ X, we have
∑
xj∈X
charge(xj, F ′) = FacLoc(X,F ′, f, `) .
Proof. Due to the definition of charge(·, ·) and Equation (3.1), we get
∑
xj∈X
charge(xj, F ′)
=
∑
xj∈X
D(xj, F ′)` +
∑
xj∈X
∑
xi∈F ′
max{0, r`i −D(xi, xj)
`}
=
∑
xj∈X
D(xj, F ′)` +
∑
xi∈F ′
∑
xj∈X∩B(xi,ri)
(r`i −D(xi, xj)
`)
=
∑
xj∈X
D(xj, F ′)` +
∑
xi∈F ′
f
= FacLoc(X,F ′, f, `) .
Claim 3.1.5. Let xj ∈ X be any point, let F ′ ⊆ X be an arbitrary set of open facilities,
and let xi ∈ F ′ be any open facility. If we have D(xj, xi) = D(xj, F ′), then charge(xj, F ′) ≥
max{r`i ,D(xj, xi)
`}.
Proof. If xj /∈ B(xi, ri), then
charge(xj, F ′) ≥ D(xj, F ′)`
= D(xj, xi)`
> r`i .
Otherwise, we have
charge (xj, F ′) ≥ D (xj, F ′)
` +
(
r`i −D(xj, xi)
`
)
= D (xj, xi)
` +
(
r`i −D(xj, xi)
`
)
= r`i
≥ D(xj, xi)` .
Claim 3.1.6. Let xj ∈ X be any point, and let xi be any open facility in FMP. If xj ∈
B(xi, ri), then charge(xj, FMP) ≤ r`i .
Proof. By Claim 3.1.3, there is no open point xm ∈ FMP such that we have i 6= m and
xj ∈ B(xm, rm). Since D(xj, FMP) ≤ D(xj, xi), we obtain
charge (xj, FMP) = D (xj, FMP)
` +
(
r`i −D (xj, xi)
`
)
≤ D (xj, xi)
` +
(
r`i −D (xj, xi)
`
)
= r`i .
3.1 Preliminaries 35
Claim 3.1.7. Let xj ∈ be any point, and let xi be any open facility in FMP. If xj /∈ B(xi, ri),
then we have charge(xj, FMP) ≤ D(xj, xi)`.
Proof. The correctness of the claim follows immediately, unless there is an open facility
xm ∈ FMP such that xj ∈ B(xm, rm). If such an open facility xm exists, then Claims 3.1.3
and 3.1.6 imply D(xi, xm) > 2 · max{ri, rm} and charge(xj, FMP) ≤ r`m. Furthermore, by
triangle inequality, we obtain
D(xj, xi) ≥ D(xi, xm)−D(xj, xm)
> 2rm − rm
= rm ,
which proves charge(xj, FMP) ≤ r`m ≤ D(xj, xi)
`.
Claim 3.1.8. For any point xj ∈ X and an arbitrary set of open facilities F ′ ⊆ X, we
have charge(xj, FMP) ≤ 3` · charge(xj, F ′).
Proof. Let xi be some open facility in F ′ such that we have D(xj, xi) = D(xj, F ′). By
Claim 3.1.2, there exists a facility xm ∈ FMP such that we have rm ≤ ri and D(xi, xm) ≤
2 · ri.
If xj ∈ B(xm, rm), then we get charge(xj, FMP) ≤ r`m by Claim 3.1.6. Since Claim 3.1.5
implies charge(xj, F ′) ≥ r`i , we can conclude
charge(xj, FMP) ≤ r`m
≤ r`i
≤ charge(xj, F ′) .
This proves the assertion in case that we have xj ∈ B(xm, rm).
If xj /∈ B(xm, rm), then charge(xj, FMP) ≤ D(xj, xm)` by Claim 3.1.7. Thus, by triangle
inequality, we get
charge(xj, FMP) ≤ D(xj, xm)`
≤ (D(xj, xi) + D(xi, xm))
`
≤ (D(xj, xi) + 2 · ri)
`
≤ 3` ·max{D(xj, xi)`, r`i} .
Now, the assertion follows by Claim 3.1.5.
Lemma 3.1.9. FacLoc(X,FMP, f, `) ≤ 3` · FacLoc*(X, f, `)
Proof. The assertion follows from Lemmas 3.1.4 and 3.1.8.
Based on the results above, we can prove the following lemma:
36 3 Facility Location in a Distributed Setting
Lemma 3.1.10.
1
2`+1
· FacLoc*(X, f, `) ≤
∑
xi∈X
r`i ≤ 6
` · FacLoc*(X, f, `)
Proof. We first prove the lower bound and then the upper bound. The argumentation is
basically the same as in [14]. Only a few minor adaptations to our scenario have been
made.
Lower bound: Let FMP be the set of open facilities computed by the modified Mettu-
Plaxton algorithm. Then, it follows from Claim 3.1.2 that
2` ·
∑
xi∈X
r`i ≥
∑
xi∈X
D(xi, FMP)` . (3.2)
Next, we show that we also have
2` ·
∑
xi∈X
r`i ≥ f · |FMP| . (3.3)
Due to Claim 3.1.3, each point xi ∈ X is contained in at most one ball B(xj, rj) for some
open facility xj ∈ FMP. Furthermore, observe that, for any point xm ∈ B(xj, rj), we must
have rj ≤ 2 · rm. Otherwise, we would have
xm ∈ B(xm, 2 · rm) ⊆ B(xm, rj) ⊆ B(xm, rj + D(xj, xm)) ⊆ B(xj, 2 · rj) ,
and the modified Mettu-Plaxton algorithm would not open a facility at xj, which is a
contradiction. Hence, we obtain
∑
xi∈X
r`i ≥
∑
xj∈FMP
∑
xm∈X∩B(xj ,rj)
r`m
≥
∑
xj∈FMP
∑
xm∈X∩B(xj ,rj)
(rj
2
)`
=
1
2`
·
∑
xj∈FMP
∑
xm∈X∩B(xj ,rj)
r`j
≥
1
2`
·
∑
xj∈FMP
f
=
1
2`
· f · |FMP| ,
which proves Inequality (3.3). Due to Inequalities (3.2) and (3.3), we get
2`+1 ·
∑
xi∈X
r`i ≥ f · |FMP|+
∑
xi∈X
D(xi, FMP)`
= FacLoc(X,FMP, f, `)
≥ FacLoc*(X, f, `) .
3.1 Preliminaries 37
Upper bound: Due to Lemma 3.1.9, we know that
FacLoc(X,FMP, f, `) ≤ 3` · FacLoc*(X, f, `) .
Thus, to prove the upper bound, it remains to show that
∑
xi∈X
r`i ≤ 2
` · FacLoc(X,FMP, f, `) .
Due to Claim 3.1.4, we have
2` · FacLoc(X,FMP, f, `) = 2` ·
∑
xi∈X
charge(xi, FMP)
≥ 2` ·


∑
xi∈FMP
r`i +
∑
xj∈X\FMP
max{r`δ(j),D(xj, xδ(j))
`}

 ,
where δ(j) denotes the index of the facility in FMP that is closest to xj. Thus, if we can
show that
2` ·


∑
xi∈FMP
r`i +
∑
xj∈X\FMP
max{r`δ(j),D(xj, xδ(j))
`}

 ≥
∑
xi∈X
r`i , (3.4)
then we are done. It is sufficient to prove
r`j ≤ 2
`−1 ·
(
D(xj, xδ(j))
` + r`δ(j)
)
(3.5)
because this implies max{r`δ(j),D(xj, xδ(j))
`} ≥ r`j/2
` and Inequality (3.4) follows. We prove
the correctness of Inequality (3.5) by contradiction. Hence, we assume that
r`j > 2
`−1 ·
(
D(xj, xδ(j))
` + r`δ(j))
)
.
We can easily prove by induction that 2`−1 · (a` + b`) ≥ (a+ b)` for any a, b ≥ 0. Thus, we
obtain
r`j >
(
D(xj, xδ(j)) + rδ(j))
)`
,
which, in turn, would imply B(xδ(j), rδ(j)) ⊆ B(xj, rj). Furthermore, by applying triangle
inequality and 2`−1 · (a` + b`) ≥ (a+ b)` for an a, b ≥ 0, we get
D(xj, xm)` ≤
(
D(xj, xδ(j)) + D(xδ(j), xm)
)`
≤ 2`−1 ·
(
D(xj, xδ(j))
` + D(xδ(j), xm)
`
)
38 3 Facility Location in a Distributed Setting
as upper bound on the exponentiated distance between xj and any point xm ∈ B(xδ(j), rδ(j)).
Now, we obtain
∑
xm∈X∩B(xj ,rj)
r`j −D(xj, xm)
`
≥
∑
xm∈X∩B(xδ(j),rδ(j))
r`j −D(xj, xm)
`
>
∑
xm∈X∩B(xδ(j),rδ(j))
2`−1 ·
(
D(xj, xδ(j))
` + r`δ(j)
)
− 2`−1 ·
(
D(xj, xδ(j))
` + D(xδ(j), xm)
`
)
= 2`−1 ·
∑
xm∈X∩B(xδ(j),rδ(j))
r`δ(j) −D(xδ(j), xm)
`
= 2`−1 · f
≥ f ,
which is a contradiction because the definition of rj requires
∑
xm∈X∩B(xj ,rj)
r`j −D(xj, xm)
` = f .
It follows that Inequality (3.5) is true, which was the only thing left to prove the assertion
of the lemma.
3.2 Distributed Algorithm for Metric Spaces
Our distributed algorithm consists of three parts (see Algorithm 3.2.1 for a description in
pseudocode). Recall that we assume that each point knows its distance to all the other
points. At the beginning of the first part, each point xi ∈ X creates a (dlog(n)e + 1)-bit
array. These bits are used to decide whether a point should open a facility or not. In the
following, we will call these bits phase bits. The values of these phase bits are chosen at
random so that, for each k ∈ {0, 1, . . . , dlog(n)e}, the k-th phase bit is 1 with probability
min{2k/n, 1} and 0 otherwise. Finally, every point sends its dlog(n)e+ 1 phase bits to all
the other points.
The second part of the algorithm is organized in dlog(n)e + 1 phases. During these
phases, each point decides locally, based on the phase bits, if it should open a facility or
connect itself to another open facility. This is accomplished as follows: Consider the k-th
phase of point xi. The algorithm opens a facility at this point if the k-th phase bit is 1 and
the first k−1 phase bits of all the other points at a distance of at most 2k ·f/n from xi are
0. Otherwise, if there exists a point xj at a distance of at most 2k · f/n from xi which has
a 1 among the first k− 1 phase bits, the algorithm tentatively connects xi to the point xj.
In the final solution, xi will be connected to the nearest open facility (which might differ
from xj). Note that if neither the k-th phase bit of xi is 1 nor there exists a point xj at
a distance of at most 2k · f/n from xi which has a 1 among the first k − 1 phase bits, xi
3.2 Distributed Algorithm for Metric Spaces 39
does nothing in phase k. At the end of the last phase, every point knows whether it is an
open facility or not because the dlog(n)e-th phase bit of every point is 1 with probability
min{2dlog(n)e/n, 1} = 1. Finally, each point broadcasts whether it is an open facility or not.
In the last part of the algorithm, every point that is not an open facility sends a request
of connection to the nearest open facility.
We will show in the next section that, with high constant probability, the total opening
cost for the facilities is at most a constant factor larger than the sum of the radii, and any
client xi ∈ X has at least one open facility in the ball B(xi, cri), where c is some small
constant. Since the sum of the radii is a constant-factor approximation of the optimal
facility location cost (see Lemma 3.1.10), this implies that, with high constant probability,
our distributed algorithm computes a constant factor-approximation for the uniform metric
facility location problem.
Algorithm 3.2.1 Local Algorithm for Point xi
1: open[i]← false
2: for k ← 0 to dlog(n)e do
3: ϕi[k]←



1 , with probability min{2k/n, 1}
0 , otherwise
4: send ϕi to all xj ∈ X, j 6= i
5: receive ϕj from all xj ∈ X, j 6= i
6: for k ← 0 to dlog(n)e do
7: if ϕi[k] = 1 and for each point xj ∈ B(xi, 2k · f/n),
xj 6= xi, we have ϕj[m] = 0 for all m < k then
8: open[i]← true
9: send open[i] to all xj ∈ X, j 6= i
10: receive open[j] from all j ∈ X, j 6= i
11: if open[i] = false then
12: connect to the nearest open facility
3.2.1 Analysis of the Algorithm
In this section, we show that our distributed algorithm produces a solution for the uniform
metric facility location problem whose cost are with high constant probability at most a
constant factor larger than the optimal cost.
To simplify the analysis, we do not use the exact value ri satisfying Equation (3.1) for
` = 1 but the value r˜i := 2j ·f/n where j is the smallest integer that satisfies the inequality
∑
x∈X∩B(xi,r˜i)
(r˜i −D(xi, x)) ≥ f .
First, we give an upper bound on the expected opening cost of any point xi ∈ X.
Lemma 3.2.1. Let xi ∈ X be any point. Then, the expected opening cost of xi is O(ri).
40 3 Facility Location in a Distributed Setting
Proof. At first, we estimate the probability that the algorithm opens a facility at xi in
any phase k ∈ {0, 1, . . . , dlog(n)e}. Recall that this happens if the k-th phase bit of xi is
1 and the first k − 1 phase bits of all the other points at a distance of at most 2k · f/n
are 0. Let Yi,k be the indicator random variable for the event that the algorithm opens a
facility at xi in phase k. We now consider the two cases k < j and j ≤ k ≤ dlog(n)e with
j = log(r˜i · n/f).
Case j ≤ k ≤ dlog(n)e: The k-th phase bit of xi is 1 with probability min{2k/n, 1} ≤
2k/n. Furthermore, for any phase m < k, the m-th phase bit of an arbitrary point in
B(xi, 2k · f/n) is 0 with probability 1 − 2m/n. Hence, the probability that all of the first
k − 1 phase bits of this point in B(xi, 2k · f/n) are 0 is
∏k−1
m=0 1− 2
m/n. Thus, we have
Pr [Yi,k = 1] ≤
2k
n
·
[(
1−
20
n
)
·
(
1−
21
n
)
· . . . ·
(
1−
2k−1
n
)]weight(B(xi,2k· fn))
.
Observe that r˜i ≥ ri. By applying Lemma 3.1.1 with ` = 1, we obtain that
weight
(
B
(
xi, 2k ·
f
n
))
≥ weight
(
B
(
xi, 2j ·
f
n
))
= weight (B(xi, r˜i))
≥ weight (B(xi, ri))
≥
f
ri
≥
f
r˜i
.
Thus, we get
Pr [Yi,k = 1] ≤
2k
n
·
[(
1−
20
n
)
·
(
1−
21
n
)
· . . . ·
(
1−
2k−1
n
)] f
r˜i
=
2k
n
·
[(
1−
20
n
)
·
(
1−
21
n
)
· . . . ·
(
1−
2k−1
n
)] n
2j
≤
2k
n
·
(
1−
2k−1
n
) n
2j
.
Now, let m denote the non-negative integer k − j. Then, we obtain
Pr [Yi,k = 1] ≤
2j+m
n
·
(
1−
2j+m−1
n
) n
2j
=
2j+m
n
·
(
1−
2j+m−1
n
) n
2j+m−1
·2m−1
≤
2j+m
n
·
(1
e
)2m−1
,
where the last inequality is due to a bound on Euler’s number (see Inequality (B.2)).
3.2 Distributed Algorithm for Metric Spaces 41
Case k < j: An upper bound on the probability that the algorithm opens a facility at xi
in a phase k < j is 2k/n. Hence, we have
Pr [Yi,k = 1] ≤
2k
n
.
Let Yi be the indicator random variable for the event that the algorithm opens a facility
at xi. Then, the expected opening cost of point xi are upper bounded by
f · E [Yi] = f · E


dlog(n)e∑
k=0
Yi,k


= f ·
dlog(n)e∑
k=0
E [Yi,k]
= f ·
dlog(n)e∑
k=0
Pr [Yi,k = 1]
≤ f ·
j−1∑
k=0
2k
n
+ f ·
dlog(n)e−j∑
m=0
2j+m
n
·
(1
e
)2m−1
= f ·
2j − 1
n
+ f ·
2j+1
n
·
dlog(n)e−j∑
m=0
(1
e
)2m−1
· 2m−1
≤ f ·
2j − 1
n
+ f ·
2j+1
n
·
dlog(n)e−j∑
m=0
2−m+1
∈ O(r˜i) ,
where the last inequality follows from the easily provable fact that
(1
e
)2m−1
· 2m−1 ≤ 2−m+1
for all m ≥ 0. Finally, due to the definition of r˜i, the expected opening cost of the point
xi is O(ri).
The proof of our upper bound on the final connection cost of any point xi ∈ X utilizes
the following lemma:
Lemma 3.2.2. Let xi ∈ X be any point that has been chosen as open facility or that has
tentatively been connected in any phase k ∈ {0, . . . , dlog(n)e}. Then, the distance of xi to
the nearest open facility is at most 2k+1 · f/n.
Proof. Obviously, if we open a facility at xi in phase k, then the distance of xi to the nearest
open facility is 0 ≤ 2k+1 · f/n. Next, we consider the case that xi has been connected
tentatively. Note that, due to our construction, a point cannot be connected tentatively
42 3 Facility Location in a Distributed Setting
in phase 0. Thus, in the following, we will assume that xi has tentatively been connected
to a point xj ∈ X in a phase k ∈ {1, . . . , dlog(n)e}. It follows that the distance from xi to
xj is at most 2k · f/n. Let m denote the smallest number of a phase bit of xj whose value
is 1. Since xi has tentatively been connected to xj in phase k, we have m ≤ k − 1. Now,
either xj is open or tentatively connected. If it is open, then the distance from xi to the
nearest open facility is at most 2k ·f/n and we are done. Otherwise, xj has tentatively been
connected to another point within a distance of at most 2m · f/n ≤ 2k−1 · f/n. Recursively
applying this argument (see also Figure 3.1) yields that there must be an open facility
within a distance of at most
2k ·
f
n
+
k−1∑
m=0
2m ·
f
n
≤ 2k+1 ·
f
n
from xi.
xi2k · fn ≤ 2
k+1 · fn
Figure 3.1: Connecting xi to an open facility over a chain of tentatively connected points.
Lemma 3.2.3. Let xi ∈ X be any point. Then, the expected final connection cost of xi is
O(ri).
Proof. Due to our construction, a point cannot be connected tentatively in phase 0. Thus,
in phase 0, the algorithm either opens a facility at xi or does nothing with xi. If it opens
a facility at xi, then the connection cost of xi is obviously 0. Due to Lemma 3.2.2, if
the point xi has been chosen as an open facility or has tentatively been connected in a
phase k ∈ {1, . . . , dlog(n)e}, then its final connection cost is at most 2k+1 · f/n. Now, for
each k ∈ {0, . . . , dlog(n)e}, let Zi,k be the indicator random variable for the event that the
algorithm has not opened a facility at xi and has not tentatively connected xi up to and
3.2 Distributed Algorithm for Metric Spaces 43
including phase k. Then, we can upper bound the expected final connection cost of xi by
dlog(n)e∑
k=1
2k+1 ·
f
n
·Pr [xi is opened or tentatively connected in phase k]
=
dlog(n)e∑
k=1
2k+1 ·
f
n
· (Pr [Zi,k−1 = 1]−Pr [Zi,k = 1])
= 2 ·
f
n
·Pr [Zi,0 = 1]− 2dlog(n)e+1 ·
f
n
·Pr
[
Zi,dlog(n)e = 1
]
+
dlog(n)e−1∑
k=0
2k+1 ·
f
n
·Pr [Zi,k = 1]
= 2 ·
f
n
·Pr [Zi,0 = 1] +
dlog(n)e−1∑
k=0
2k+1 ·
f
n
·Pr [Zi,k = 1] ,
where the last equality follows from Pr
[
Zi,dlog(n)e = 1
]
= 0. Thus, to upper bound the
expected final connection cost of xi, we have to upper bound the probabilities Pr [Zi,k = 1].
We consider the two cases k < j and j ≤ k < dlog(n)e with j = log(r˜i · n/f).
Case j ≤ k < dlog(n)e: Observe that Zi,k = 1 if the first k phase bits of xi are 0, and
the first k − 1 phase bits of all the other points at a distance of at most 2k · f/n are also
0. For any phase m ≤ k, the m-th phase bit of xi is 0 with probability 1− 2m/n. Hence,
the probability that all of the first k phase bits of xi are 0 is
∏k
m=0 1 − 2
m/n. Similarly,
the probability that all of the first k − 1 phase bits of any point in B(xi, 2k · f/n) are 0 is
∏k−1
m=0 1− 2
m/n. As proven in Lemma 3.2.1, the number of points in B(xi, 2k · f/n) is lower
bounded by
weight
(
B
(
xi, 2k ·
f
n
))
≥
f
r˜i
.
It follows that
Pr [Zi,k = 1] ≤
[(
1−
20
n
)
· . . . ·
(
1−
2k
n
)]
·
[(
1−
20
n
)
· . . . ·
(
1−
2k−1
n
)] f
r˜i
≤
(
1−
2k
n
)
·
(
1−
2k−1
n
) n
2j
.
Let m denote the non-negative integer k − j. Then, we have
Pr [Zi,k = 1] ≤
(
1−
2j+m
n
)
·
(
1−
2j+m−1
n
) n
2j+m−1
·2m−1
≤
(
1−
2j+m
n
)
·
(1
e
)2m−1
≤
(1
e
)2m−1
,
where the second inequality is due to a bound on Euler’s number (see Inequality (B.2)).
44 3 Facility Location in a Distributed Setting
Case k < j: Obviously, an upper bound on the probability that the algorithm does not
open a facility at xi or tentatively connect xi up to and including phase k is 1. Hence, we
get
Pr [Zi,k] ≤ 1 .
Based on the above two cases, we can upper bound the expected final connection cost
of xi by
2 ·
f
n
·Pr [Zi,0 = 1] +
dlog(n)e−1∑
k=0
2k+1 ·
f
n
·Pr [Zi,k = 1]
≤ 2 ·
f
n
· 1 +
j−1∑
k=0
2k+1 ·
f
n
· 1 +
dlog(n)e−j−1∑
m=0
2j+m+1 ·
f
n
·
(1
e
)2m−1
≤ 2 ·
f
n
+ 2j+1 ·
f
n
+ 2j+2 ·
f
n
·
dlog(n)e−j−1∑
m=0
2m−1 ·
(1
e
)2m−1
≤ 2 ·
f
n
+ 2j+1 ·
f
n
+ 2j+2 ·
f
n
·
dlog(n)e−j−1∑
m=0
2−m+1
∈ O(r˜i) ,
where, as in the proof of Lemma 3.2.1, the last inequality follows from the easily provable
fact that
2m−1 ·
(1
e
)2m−1
≤ 2−m+1
for all m ≥ 0. Finally, due to the definition of r˜i, the expected final connection cost of the
point xi is O(ri).
Now, we can prove that our distributed algorithm for the uniform metric facility location
problem produces a solution whose total cost is with high constant probability at most a
constant factor larger than the optimal cost.
Lemma 3.2.4. The facility location cost for X is O(FacLoc*(X, f)) with high constant
probability.
Proof. Due to Lemmas 3.2.1 and 3.2.3, the expected opening cost as well as the expected
final connection cost of any point xi ∈ X is O(ri). Thus, the algorithm computes a
set of open facilities F that leads to an expected total cost of
∑
xi∈X O(ri). By apply-
ing Lemma 3.1.10 with ` = 1, we have that the expected value of FacLoc(X,F, f) is
O(FacLoc*(X, f)). Now, the assertion of the lemma follows by applying Markov’s inequal-
ity.
We summarize our results in the following theorem:
3.3 Distributed Algorithm for Powers of Metric Spaces 45
Theorem 2. Given any n-point metric space (X,D), there is a randomized distributed
algorithm working in the synchronous message passing model that computes with high con-
stant probability a constant-factor approximation of the uniform metric facility location
problem for X. The algorithm uses three rounds of all-to-all communication where the
message sizes are bounded to O(log(n)) bits.
3.3 Distributed Algorithm for Powers of Metric Spaces
In this section, we extend the distributed algorithm given in Section 3.2 to the uniform
facility location problem for powers of metric spaces. Let ` ∈ R with ` ≥ 1 be the
(constant) metric exponent. Then, we only have to make the following three adaptations
to Algorithm 3.2.1:
1. The total number of phases is dlog(n)/`e+ 1.
2. The k-th phase bit is set to 1 with probability min{2k`/n, 1} and 0 otherwise.
3. In the k-th phase, we check the first k − 1 phase bits of all the points in a distance
of at most 2k · (f/n)1/` from xi.
The rest of the local algorithm for xi remains unchanged. A complete pseudocode listing
of the adapted algorithm is given by Algorithm 3.3.1.
Algorithm 3.3.1 Local Algorithm for Point xi
1: open[i]← false
2: for k ← 0 to dlog(n)/`e do
3: ϕi[k]←



1 , with probability min{2k`/n, 1}
0 , otherwise
4: send ϕi to all xj ∈ X, j 6= i
5: receive ϕj from all xj ∈ X, j 6= i
6: for k ← 0 to dlog(n)/`e do
7: if ϕi[k] = 1 and for each point xj ∈ B(xi, 2k · (f/n)1/`),
xj 6= xi, we have ϕj[m] = 0 for all m < k then
8: open[i]← true
9: send open[i] to all xj ∈ X, j 6= i
10: receive open[j] from all j ∈ X, j 6= i
11: if open[i] = false then
12: connect to the nearest open facility
46 3 Facility Location in a Distributed Setting
3.3.1 Analysis of the Algorithm
Let r˜i := 2j · (f/n)1/` where j is the smallest integer that satisfies the inequality
∑
x∈X∩B(xi,r˜i)
(r˜`i −D(xi, x)
`) ≥ f
be an approximation of the radius ri defined by Equation (3.1). Then, using this approx-
imation r˜i and bearing the three adaptations mentioned above in mind, the analysis of
our distributed algorithm given in Section 3.2.1 can be easily transferred to the uniform
facility location problem for powers of metric spaces. We obtain the following lemmas:
Lemma 3.3.1. Let xi ∈ X be any point. Then, the expected opening cost of xi is O(4` ·r`i ).
Proof. We prove this lemma by using Lemma 3.1.10 and reusing the techniques given in
the proof of Lemma 3.2.1.
First, we compute an upper bound on the probability that the algorithm opens a facility
at xi in any phase k ∈ {0, 1, . . . , dlog(n)/`e}. Recall that this happens if the k-th phase bit
of xi is 1 and the first k−1 bits of all the other points at a distance of at most 2k · (f/n)1/`
are 0. Let Yi,k be the indicator random variable for the event that the algorithm opens a
facility at xi in phase k. We examine the two cases k < j and j ≤ k ≤ dlog(n)/`e with
j = log(r˜i · (n/f)1/`).
Case j ≤ k ≤ dlog(n)/`e: The k-th phase bit of xi is 1 with probability min{2k`/n, 1} ≤
2k`/n. Furthermore, for any phase m < k, the m-th phase bit of an arbitrary point located
in B(xi, 2k · (f/n)1/`) is 0 with probability 1− 2m`/n. Thus, the probability that all of the
first k − 1 phase bits of this point are 0 is
∏k−1
m=0 1− 2
m`/n. Hence, we get
Pr [Yi,k = 1] ≤
2k`
n
·
[(
1−
20`
n
)
·
(
1−
21`
n
)
· . . . ·
(
1−
2(k−1)`
n
)]weight(B(xi,2k·(f/n)1/`))
.
Due to Lemma 3.1.1 and r˜i ≥ ri, we have
weight

B

xi, 2k ·
(
f
n
)1/`



 ≥ weight

B

xi, 2j ·
(
f
n
)1/`




≥ weight (B(xi, r˜i))
≥ weight (B(xi, ri))
≥
f
r`i
≥
f
r˜`i
.
3.3 Distributed Algorithm for Powers of Metric Spaces 47
It follows that
Pr [Yi,k = 1] ≤
2k`
n
·
[(
1−
20`
n
)
·
(
1−
21`
n
)
· . . . ·
(
1−
2(k−1)`
n
)] f
r˜`
i
=
2k`
n
·
[(
1−
20`
n
)
·
(
1−
21`
n
)
· . . . ·
(
1−
2(k−1)`
n
)] n
2j`
≤
2k`
n
·
(
1−
2(k−1)`
n
) n
2j`
.
Now, let m be the non-negative integer k − j. Then, we obtain
Pr [Yi,k = 1] ≤
2(j+m)`
n
·
(
1−
2(j+m−1)`
n
) n
2j`
=
2(j+m)`
n
·
(
1−
2(j+m−1)`
n
) n
2(j+m−1)`
·2(m−1)`
≤
2(j+m)`
n
·
(1
e
)2(m−1)`
,
where the last inequality is due to a bound on Euler’s number (see Inequality (B.2)).
Case k < j: Obviously, an upper bound on the probability that the algorithm opens a
facility at xi in a phase k < j is 2k`/n. It follows that
Pr [Yi,k = 1] ≤
2k`
n
.
Let Yi be the indicator random variable for the event that the algorithm opens a facility
at xi. Then, the expected opening cost of the point xi are upper bounded by
f · E [Yi] = f · E


dlog(n)/`e∑
k=0
Yi,k


= f ·
dlog(n)/`e∑
k=0
E [Yi,k]
= f ·
dlog(n)/`e∑
k=0
Pr [Yi,k = 1]
48 3 Facility Location in a Distributed Setting
Based on the above two cases, we obtain
f · E [Yi] ≤ f ·
j−1∑
k=0
2k`
n
+ f ·
dlog(n)/`e−j∑
m=0
2(j+m)`
n
·
(1
e
)2(m−1)`
=
f
n
·
2j` − 1
2` − 1
+
f
n
· 2(j+1)` ·
dlog(n)/`e−j∑
m=0
(1
e
)2(m−1)`
· 2(m−1)`
≤
f
n
· 2j` +
f
n
· 2(j+1)` ·
dlog(n)/`e−j∑
m=0
2−m+1
∈ O(2` · r˜`i ) ,
where the last inequality follows from the easily provable fact that
(1
e
)2(m−1)`
· 2(m−1)` ≤ 2−m+1
for all m ≥ 0 and any ` ≥ 1. Finally, due to the definition of r˜i, the expected opening cost
of the point xi is O(4` · r`i ).
The proof of our upper bound on the final connection cost of any point xi ∈ X utilizes
the following lemma:
Lemma 3.3.2. Let xi ∈ X be any point that has been chosen as open facility or that has
tentatively been connected in any phase k ∈ {0, . . . , dlog(n)/`e}. Then, the distance of xi
to the nearest open facility is at most 2k+1 · (f/n)1/`.
Proof. To prove this lemma, we use the same approach as in the proof of Lemma 3.2.2.
In case that the algorithm opens a facility at xi in phase k, the distance of xi to the
nearest open facility is 0 ≤ 2k+1 · (f/n)1/`. Next, we consider the case that xi has been
connected tentatively. The algorithm does not tentatively connect any point in phase 0.
Hence, in the following, we will assume that xi has tentatively been connected to a point
xj ∈ X in a phase k ∈ {1, . . . , dlog(n)/`e}. Then, the distance from xi to xj is at most
2k · (f/n)1/`. Let m denote the smallest number of a phase bit of xj whose value is 1. Since
xi has tentatively been connected to xj in phase k, we have m ≤ k − 1. Now, we have to
consider the two cases that either xj is open or xj has been connected tentatively as well.
Obviously, if xj is an open facility, then the distance from xi to the nearest open facility
is at most 2k · (f/n)1/`, so we are done. Otherwise, xj has tentatively been connected to
another point within a distance of at most 2m · (f/n)1/` ≤ 2k−1 · (f/n)1/`. By recursively
applying this argument, we obtain that there must be an open facility within a distance of
at most
2k ·
(
f
n
)1/`
+
k−1∑
m=0
2m ·
(
f
n
)1/`
≤ 2k+1 ·
(
f
n
)1/`
from xi.
3.3 Distributed Algorithm for Powers of Metric Spaces 49
Lemma 3.3.3. Let xi ∈ X be any point. Then, the expected final connection cost of xi is
O(16` · r`i ).
Proof. We prove this lemma by using Lemma 3.3.2 and reusing the techniques given in the
proof of Lemma 3.2.3.
The algorithm does not tentatively connect any point in phase 0. Hence, in phase 0, it
either opens a facility at xi or does nothing with xi, which obviously results in 0 connection
cost for xi in phase 0. Due to Lemma 3.3.2, if the point xi has been chosen as an open
facility or has tentatively been connected in any other phase k ∈ {1, . . . , dlog(n)/`e}, then
its final connection cost is at most 2(k+1)` · f/n. Now, for each k ∈ {0, . . . , dlog(n)/`e}, let
Zi,k be the indicator random variable for the event that the algorithm has not opened a
facility at xi and has not tentatively connected xi up to and including phase k. Then, we
can upper bound the expected final connection cost of xi by
dlog(n)/`e∑
k=1
2(k+1)` ·
f
n
·Pr [xi is opened or tentatively connected in phase k]
=
dlog(n)/`e∑
k=1
2(k+1)` ·
f
n
· (Pr [Zi,k−1 = 1]−Pr [Zi,k = 1])
≤
dlog(n)/`e−1∑
k=0
2(k+2)` ·
f
n
·Pr [Zi,k = 1]
In order to upper bound the expected final connection cost of xi, we upper bound the
probabilities Pr [Zi,k = 1]. Therefore, we examine the two cases k < j and j ≤ k <
dlog(n)/`e with j = log(r˜i · (n/f)1/`).
Case j ≤ k < dlog(n)/`e: Observe that we have Zi,k = 1 only in the case that the first
k phase bits of xi are 0 and the first k − 1 phase bits of all the other points at a distance
of at most 2k · (f/n)1/` are 0 as well. For any phase m ≤ k, the m-th phase bit of xi is
0 with probability 1− 2m`/n. Thus, the probability that all of the first k phase bits of xi
are 0 is
∏k
m=0 1− 2
m`/n. Similarly, the probability that all of the first k − 1 phase bits of
any point in B(xi, 2k · (f/n)1/`) are 0 is
∏k−1
m=0 1 − 2
m`/n. As proven in Lemma 3.3.1, the
number of points in B(xi, 2k · (f/n)1/`) is lower bounded by
weight

B

xi, 2k ·
(
f
n
)1/`



 ≥
f
r˜`i
.
Hence, we have
Pr [Zi,k = 1] ≤
[(
1−
20`
n
)
· . . . ·
(
1−
2k`
n
)]
·
[(
1−
20`
n
)
· . . . ·
(
1−
2(k−1)`
n
)] f
r˜`
i
≤
(
1−
2k`
n
)
·
(
1−
2(k−1)`
n
) n
2j`
.
50 3 Facility Location in a Distributed Setting
Let m denote the non-negative integer k − j. Then, we get
Pr [Zi,k = 1] ≤
(
1−
2(j+m)`
n
)
·
(
1−
2(j+m−1)`
n
) n
2(j+m−1)`
·2(m−1)`
≤
(
1−
2(j+m)`
n
)
·
(1
e
)2(m−1)`
≤
(1
e
)2(m−1)`
,
where the second inequality is due to a bound on Euler’s number (see Inequality (B.2)).
Case k < j: An obvious upper bound on the probability that the algorithm does not
open a facility at xi or tentatively connect xi up to and including phase k is 1, so we have
Pr [Zi,k] ≤ 1 .
Now, we can upper bound the expected final connection cost xi by
dlog(n)/`e−1∑
k=0
2(k+2)` ·
f
n
·Pr [Zi,k = 1]
≤
j−1∑
k=0
2(k+2)` ·
f
n
· 1 +
dlog(n)/`e−j−1∑
m=0
2(j+m+2)` ·
f
n
·
(1
e
)2(m−1)`
=
2j` − 1
2` − 1
· 22` ·
f
n
+ 2(j+3)` ·
f
n
·
dlog(n)/`e−j−1∑
m=0
2(m−1)` ·
(1
e
)2(m−1)`
≤ 2(j+2)` ·
f
n
+ 2(j+3)` ·
f
n
·
dlog(n)/`e−j−1∑
m=0
2−m+1
∈ O(8` · r˜`i ) ,
where, as in the proof of Lemma 3.3.1, the last inequality follows from the easily provable
fact that
2(m−1)` ·
(1
e
)2(m−1)`
≤ 2−m+1
for all m ≥ 0 and any ` ≥ 1. Finally, due to the definition of r˜i, the expected final
connection cost of xi is O(16` · r`i ).
Lemma 3.3.4. The facility location cost for X is O(FacLoc*(X, f, `)) with high constant
probability.
Proof. It follows from Lemmas 3.3.1 and 3.3.3 and ` being a constant metric exponent
that the expected opening cost as well as the expected final connection cost of any point
xi ∈ X is O(r`i ). Hence, the algorithm computes a set of open facilities F that leads to
an expected total cost of
∑
xi∈X O(r
`
i ). Due to Lemma 3.1.10, we obtain that the expected
value of FacLoc(X,F, f, `) is O(FacLoc*(X, f, `)). Finally, the assertion of the lemma
follows by applying Markov’s inequality.
3.3 Distributed Algorithm for Powers of Metric Spaces 51
We summarize our results in the following theorem:
Theorem 3. Given any n-point metric space (X,D) and a constant metric exponent ` ≥ 1,
there is a randomized distributed algorithm working in the synchronous message passing
model that computes for X with high constant probability a constant-factor approximation
of the uniform facility location problem for powers of metric spaces. The algorithm uses
three rounds of all-to-all communication where the message sizes are bounded to O(log(n))
bits.
52 3 Facility Location in a Distributed Setting
4 A Kinetic Data Structure for Facility Location
In this chapter, we investigate a facility location problem under motion. The input is a
set of continuously moving objects. Each object moves along a known trajectory and can
change its status between open facility and client at any time. The goal is to maintain a
subset of the given objects as open facilities such that, at any time, the current facility
location cost induced by the chosen open facilities is as close to the current optimal cost
as possible, and also some side condition is satisfied. Observe that minimizing the mobile
facility location cost at any time, without considering any side condition, can result in
many status changes of the objects. Depending on the tasks of an open facility, such a
status change can be expensive. Hence, the side condition we consider is to change the
status of an object rather seldom so that the total number of status changes is below some
appropriate threshold.
Since the kinetic data structure (KDS) framework is well-suited to maintain a combina-
torial structure of continuously moving objects and common in the field of computational
geometry [2, 15, 54], we developed a KDS for the facility location problem described above.
Our KDS applies a counting argument of Bădoiu et al. [14] to kinetize a modified version of
the Mettu-Plaxton algorithm. The counting argument asserts that the radius of a facility
can be approximated well by just counting the number of points in exponentially growing
balls centered at this particular facility.
Note that we cannot apply the original Mettu-Plaxton algorithm to obtain a respon-
sive KDS, i.e., a KDS with polylogarithmic update time. The reason is that similar to
maintaining an exact solution for the mobile facility location problem, maintaining the
solution provided by Algorithm 2.3.1 is not stable. That means, a slight perturbation of
the input might result in a number of status changes that is linear in the number of input
points, whereas we are looking for stable solutions, where only a polylogarithmic number
of changes occur upon an event.
In Section 4.1, we present the essential ideas and some notations used throughout this
chapter. A detailed description of the KDS can be found in Section 4.2. We analyze our
KDS in Section 4.3. First, we prove that, at any time, it is guaranteed that our current
set of open facilities leads to a total cost which is at most a constant factor larger than the
current optimal cost. Afterwards, we analyze our KDS in terms of its complexity.
4.1 The Special Radii
The input of the considered mobile facility location problem is a set P = {p1, p2, . . . , pn}
of n independently moving points in Rd, where d is a constant. For any point pi ∈ P ,
we denote its opening cost by fi and its demand by di. Furthermore, let pi(t) denote
54 4 A Kinetic Data Structure for Facility Location
the position of pi at the point of time t, and let P (t) := {p1(t), p2(t), . . . , pn(t)}. Then,
the mobile facility location problem is to maintain, at each point of time t, a set of open
facilities F (t) such that FacLoc(P (t), F (t)) is minimized (see Section 2.2 for a definition of
FacLoc(P (t), F (t))). We let F ∗(t) denote an optimal set of open facilities at the point of
time t.
To approach the mobile facility location problem, we kinetize a modified version of the
Mettu-Plaxton algorithm. One essential modification affects the radius associated with a
point. According to Equation (2.1), we let ri(t) be the radius of a point pi ∈ P at the point
of time t. More precisely, ri(t) is the radius of the ball with center pi(t) that satisfies
∑
pj(t)∈P (t)∩B(pi(t),ri(t))
dj · (ri(t)−D(pi(t), pj(t))) = fi . (4.1)
Let rmin denote the lower limit of the range of ri(t), and let rmax denote the upper limit.
Then, as observed in Section 2.3, we have
rmin =
minpj∈P fj
n ·maxpj∈P dj
and rmax =
maxpj∈P fj
minpj∈P dj
. (4.2)
Based on this definition of a radius, we introduce a new radius associated with a point.
This new radius is much easier to maintain than the original radius when the points move.
Compared to the original radii, the new radii of the points depend on cubes instead of
balls. The key idea of our KDS is to use a set of nested cubes around each point and to
update the KDS each time a point enters or leaves a cube of another point.
4.1.1 Definition of the Special Radii
Cubes. Similar to the definition of balls, for a point pi(t) ∈ P (t) and a non-negative
value r, we define C(pi(t), r) to be the axis-parallel cube whose center is the point pi(t) and
whose side length is 2r. Given such a cube C(pi(t), r), we let weight(C(pi(t), r)) denote the
sum of the demands of all the points in P (t) that are located in the cube C(pi(t), r), i.e.,
we define
weight(C(pi(t), r)) :=
∑
pj(t)∈P (t)∩C(pi(t),r)
dj .
Note that the cube C(pi(t), r) is a ball with radius r with respect to the L∞-metric.
According to this and for sake of simplicity, we will refer to the value r of a cube C(pi(t), r)
as the radius of the cube, i.e., the double radius of a cube is equal to its side length.
Special Radius Associated with a Point. Our KDS maintains for each point pi ∈ P an
approximation of ri(t), called the special radius r˜i(t), which is defined as follows:
Definition 4.1.1 (Special Radius). At any point of time t, the special radius r˜i(t) of any
point pi ∈ P is the value 2k˜ such that k˜ = k0 + dlog(4
√
d)e and k0 is the minimum integer
k with dlog(rmin)e ≤ k ≤ dlog(rmax)e for which weight(C(pi(t), 2k0)) ≥ fi · 2−k0 holds.
4.1 The Special Radii 55
In the following, we will prove the existence of the special radius r˜i(t) of any point
pi(t) ∈ P (t) at any point of time t. Moreover, we will show that the special radius r˜i(t) is
a constant-factor approximation of the value ri(t).
The proof of the existence of the special radius is based on a result obtained in [14].
More precisely, for the uniform metric facility location problem, the authors in [14] gave
lower and upper bounds on the value ri(t) (confer also Lemma 3.1.1 with ` = 1). We
generalize their result to the non-uniform case:
Lemma 4.1.2. At any point of time t and for each pi ∈ P , we have
fi
weight(B(pi(t), ri(t)))
≤ ri(t) ≤
2 · fi
weight(B(pi(t), ri(t)/2))
.
Proof. It follows from the definition of ri(t) given in Equation (4.1) that
∑
pj(t)∈P (t)∩B(pi(t),ri(t))
dj · ri(t) ≥ fi ,
so we have
ri(t) ≥
fi
∑
pj(t)∈P (t)∩B(pi(t),ri(t)) dj
=
fi
weight(B(pi(t), ri(t)))
.
This proves the first inequality of the lemma.
Furthermore, we get
fi =
∑
pj(t)∈P (t)∩B(pi(t),ri(t))
dj · (ri(t)−D(pi(t), pj(t)))
≥
∑
pj(t)∈P (t)∩B(pi(t),ri(t)/2)
dj · (ri(t)−D(pi(t), pj(t)))
≥
ri(t)
2
·
∑
pj(t)∈P (t)∩B(pi(t),ri(t)/2)
dj
=
ri(t)
2
· weight(B(pi(t), ri(t)/2)) ,
where the second inequality follows from the fact that ri(t) − D(pi(t), pj(t)) ≥ ri(t)/2 for
all pj(t) ∈ P (t) ∩ B(pi(t), ri(t)/2) and B(pi(t), ri(t)/2) ⊆ B(pi(t), ri(t)). This proves the
second equality of the lemma.
Lemma 4.1.3. Let t be any point of time, and let pi ∈ P be any point. Then, there exists
an integer k with dlog(rmin)e ≤ k ≤ dlog(rmax)e such that
weight(B(pi(t), 2k)) ≥ fi · 2−k .
Proof. Due to Lemma 4.1.2, we have
weight(B(pi(t), 2log(ri(t)))) = weight(B(pi(t), ri(t))) ≥
fi
ri(t)
=
fi
2log(ri(t))
.
56 4 A Kinetic Data Structure for Facility Location
Since 2dlog(ri(t))e ≥ 2log(ri(t)), it follows that
weight(B(pi(t), 2dlog(ri(t))e)) ≥
fi
2dlog(ri(t))e
.
Now, the existence of an integer k with dlog(rmin)e ≤ k ≤ dlog(rmax)e such that
weight(B(pi(t), 2k)) ≥ fi · 2−k
follows from rmin ≤ ri(t) ≤ rmax .
Due to Lemma 4.1.3 and the fact that a ball with a certain radius is completely covered
by the cube having the same center and the same radius as the ball, we obtain the following
result:
Corollary 4.1.4. Let t be any point of time, and let pi ∈ P be any point. Then, there
exists an integer k with dlog(rmin)e ≤ k ≤ dlog(rmax)e such that
weight(C(pi(t), 2k)) ≥ fi · 2−k .
It follows from Corollary 4.1.4 that, at each point of time t, the special radius r˜i(t) of
each point pi ∈ P exists. Next, we use a modified version of a counting argument given
in [14] to prove that r˜i(t) is a constant-factor approximation of ri(t). More precisely, for
the uniform metric facility location problem, Bădoiu et al. [14] showed how to approximate
ri(t) by counting the number of points in exponentially growing balls around pi(t). We
generalize their result to the non-uniform case:
Lemma 4.1.5. Let t be any point of time, let pi ∈ P be any point, and let k1 be the
minimum integer k with dlog(rmin)e ≤ k ≤ dlog(rmax)e such that weight(B(pi(t), 2k)) ≥
fi · 2−k. Then, it holds that
1
2
· ri(t) ≤ 2k1 ≤ 2 · ri(t) .
Proof. The existence of the integer k1 is due to Lemma 4.1.3. Furthermore, due to the
choice of k1, we have
weight
(
B
(
pi(t), 2k1−1
))
< fi · 2−(k1−1) .
It follows that, for any ri(t) < 2k1−1, we get
weight (B (pi(t), ri(t))) ≤ weight
(
B
(
pi(t), 2k1−1
))
< fi · 2−(k1−1)
< fi ·
1
ri(t)
.
Now, we obtain
ri(t) <
fi
weight(B(pi(t), ri(t)))
,
4.1 The Special Radii 57
which is a contradiction to Lemma 4.1.2. Hence, ri(t) ≥ 2k1−1 must be true, which proves
the second inequality of the assertion.
Furthermore, for any ri(t) > 2k1+1, we have
weight(B(pi(t), ri(t)/2)) ≥ weight
(
B
(
pi(t), 2k1
))
≥ fi · 2−k1
> fi ·
2
ri(t)
.
In this case, it follows that
ri(t) >
2fi
weight(B(pi(t), ri(t)/2))
,
which is again a contradiction to Lemma 4.1.2. Thus, we have ri(t) ≤ 2k1+1, which proves
the first inequality of the assertion.
Our algorithm uses the approach of [14], but, for any integer k, we approximate the sum
of the demands of all the points in a ball with radius 2k by the sum of the demands of all
the points in a cube with radius 2k. This leads to the following result:
Lemma 4.1.6. Let t be any point of time, let pi ∈ P be any point, and let k0 be the
minimum integer k with dlog(rmin)e ≤ k ≤ dlog(rmax)e such that weight(C(pi(t), 2k)) ≥
fi · 2−k. Then, it holds that
1
4
√
d
· ri(t) ≤ 2k0 ≤ 2 · ri(t) .
Proof. Let k1, dlog(rmin)e ≤ k1 ≤ dlog(rmax)e, be defined as in Lemma 4.1.5. Then, the
radius of C(pi(t), 2k0) is at most 2k1 since each point in P (t) that is located in B(pi(t), 2k1)
is also located in C(pi(t), 2k1). Thus, we get
weight
(
C
(
pi(t), 2k1
))
≥ fi · 2−k1 .
The maximum radius of C(pi(t), 2k0) is illustrated on the left hand side of Figure 4.1.
Furthermore, the radius of C(pi(t), 2k0) is larger than 1/
√
d · 2k1−1. The reason is that
weight
(
B
(
pi(t), 2k1−1
))
< fi · 2−(k1−1)
and
weight
(
C
(
pi(t),
1
√
d
· 2k1−1
))
≤ weight
(
B
(
pi(t), 2k1−1
))
,
so we have
weight
(
C
(
pi(t), 2
k1−1−log(
√
d)
))
= weight
(
C
(
pi(t),
1
√
d
· 2k1−1
))
< fi · 2−(k1−1)
< fi · 2
−(k1−1−log(
√
d)) .
58 4 A Kinetic Data Structure for Facility Location
pi(t)
2k1
pi(t)
2k1−1
Figure 4.1: Illustration of the maximum and minimum radius of C(pi(t), 2k0).
The minimum radius of C(pi(t), 2k0) is illustrated on the right hand side of Figure 4.1.
Now, the lemma follows from 1/
√
d · 2k1−1 < 2k0 ≤ 2k1 and Lemma 4.1.5.
Based on Lemma 4.1.6, we can now show that the special radius associated with a point
is always a constant-factor approximation of the original radius defined by Mettu and
Plaxton [87]. Furthermore, we prove that the number of possible values of a special radius
is only logarithmic in nR where
R :=
maxpi∈P fi · maxpi∈P di
minpi∈P fi · minpi∈P di
.
Lemma 4.1.7. Let t be any point of time, and let pi ∈ P be any point. Then, we have
ri(t) ≤ r˜i(t) ≤ 23+dlog(
√
d)e · ri(t) .
The number of possible values for r˜i(t) is upper bounded by O(log(nR)).
Proof. Due to Lemma 4.1.6, we have
2− log(4
√
d) · ri(t) ≤ 2k0 ≤ 2 · ri(t) .
According to Definition 4.1.1, we set the special radius to
r˜i(t) = 2k˜ = 2k0+dlog(4
√
d)e) ,
so we obtain ri(t) ≤ r˜i(t) ≤ 23+dlog(
√
d)e · ri(t). Due to rmin ≤ ri(t) ≤ rmax, Equation (4.2),
and the fact that r˜i(t) is a power of 2, there are O(log(nR)) possible values for r˜i(t).
Walls around a Point. We consider a set of O(log(nR)) nested cubes for each point
pi(t) ∈ P (t). More precisely, there is the cube C(pi(t), 2k) with radius 2k for each k ∈
{dlog(rmin)e+dlog(4
√
d)e, dlog(rmin)e+1+dlog(4
√
d)e, . . . , dlog(rmax)e+dlog(4
√
d)e}. The
side faces of the cube defined by C(pi(t), 2k) form a wall around pi(t), which we callWi,k(t).
Hence, there exists a set of O(log(nR)) walls for pi(t). We use this set of walls to determine
the points of time when an update of pi in our KDS is required. In general, an event occurs
each time when any point crosses any wall of another point.
4.1 The Special Radii 59
4.1.2 Computation of the Special Radii
In order to compute the special radius associated with any point efficiently at any time,
we maintain two (d + 1)-dimensional dynamic range trees denoted by T1 and T2. At any
time, range tree T1 is used to manage the current set of open facilities (which we call open
points), and T2 stores the current set of clients (which we call closed points). Apart from
the fact that the two data structures contain different point sets, they are constructed in
the same way. In the first d levels of the range trees, the points are handled according
to their coordinates and, in the (d + 1)-st level, the points are handled according to their
special radii. Additionally, with each node v in every binary search tree of the (d + 1)-st
level, we store the sum of the demands of all the points contained in the subtree rooted at
v.
At any point of time t, the range trees rely on the relative position of the points in
P (t). More precisely, the leaves of any binary search tree of any level `, 1 ≤ ` ≤ d, in T1
and T2 store the points sorted according to their ranks based on dimension `, i.e., sorted
according to their `-th coordinate. Now, the movement of the points in P is reflected by
insert and delete operations on T1 and T2. At each point of time t, when any two points
pi(t), pj(t) ∈ P (t) change their ranks based on any dimension `, we delete pi and pj from
T1 and T2 and reinsert them according to their position at time t.
By applying a technique proposed by Willard and Lueker in [109], we are able to support
all required properties of T1 and T2 efficiently. More precisely, T1 and T2 have the following
complexity:
Lemma 4.1.8 ([109]). The range trees T1 and T2 have a space requirement of O(n log
d(n))
and can be initialized in O(n logd+1(n)) time. The worst-case time per insertion and dele-
tion is O(logd+1(n)). Given any orthogonal range [x1, x′1]×[x2, x
′
2]×. . .×[xd+1, x
′
d+1] ⊂ R
d+1
at any time t, the set
Q := {pi(t) ∈ P (t) | pi(t) ∈ [x1, x′1]× [x2, x
′
2]× . . .× [xd, x
′
d] and r˜i(t) ∈ [xd+1, x
′
d+1]}
can be computed in O(logd+1(n)+ |Q|) time and the value
∑
pi(t)∈Q di in O(log
d+1(n)) time.
Besides the two range trees, we maintain a binary search tree T that contains for each
point in P a pair consisting of the point’s index and its current status (which is either open
or closed). T is sorted according to the indices. Thus, we can output the status of a given
point in O(log(n)) time by querying T .
4.1.3 The Invariant
The key idea of our KDS is to keep up one invariant consisting of the following conditions:
(a) for each closed point pi(t) ∈ P (t)\F (t), there is an open point pj(t) ∈ F (t) with
r˜j(t) ≤ r˜i(t) in C(pi(t), 4 · r˜i(t)) and
(b) for each open point pi(t) ∈ F (t), there is no other open point pj(t) ∈ F (t) with
r˜j(t) ≤ r˜i(t) in C(pi(t), 2 · r˜i(t)).
60 4 A Kinetic Data Structure for Facility Location
The choice of Conditions (a) and (b) enables our KDS to be stable. Moreover, we will
show that, by keeping up Conditions (a) and (b), we maintain, at any point of time t, a set
of open facilities F (t) that leads to a total cost which is at most a constant factor larger
than the optimal cost. The following argumentation for proving that our KDS maintains a
constant-factor approximation is basically the same as in [87] and the proof of Lemma 3.1.9.
Only a few minor adaptations to the kinetic setting have been made.
Claim 4.1.9. Let t be any point of time, and let pi(t) be any point in P (t). If the invariant
is satisfied at the point of time t, then there exists a point pj(t) ∈ F (t) such that r˜j(t) ≤ r˜i(t)
and D(pi(t), pj(t)) ≤ 64d · ri(t).
Proof. Since the invariant is satisfied, there is an open facility pj(t) ∈ F (t) with radius
r˜j(t) ≤ r˜i(t) in C(pi(t), 4 · r˜i(t)) for each point pi(t) ∈ P (t). Thus, we get D(pi(t), pj(t)) ≤√
d · 4 · r˜i(t). Now, due to Lemma 4.1.7, we have
D(pi(t), pj(t)) ≤
√
d · 4 · 23+dlog(
√
d)e · ri(t)
≤ 64d · ri(t) .
Claim 4.1.10. Let t be any point of time, and let pi(t) and pj(t) be distinct points in F (t).
If the invariant is satisfied at the point of time t, then we have
D(pi(t), pj(t)) > 2 ·max{ri(t), rj(t)} .
Proof. Without loss of generality, we assume that r˜j(t) ≤ r˜i(t). From the fact that the
invariant is satisfied, it follows that pj(t) /∈ C(pi(t), 2 · r˜i(t)). Otherwise, the point pi would
be closed at the point of time t. Thus, we have
D(pi(t), pj(t)) > 2 · r˜i(t) ≥ 2 · ri(t)
and
D(pi(t), pj(t)) > 2 · r˜i(t) ≥ 2 · r˜j(t) ≥ 2 · rj(t) ,
where r˜i(t) ≥ ri(t) and r˜j(t) ≥ rj(t) follow from Lemma 4.1.7.
For any point pj(t) ∈ P (t) and an arbitrary set of open facilities X(t) ⊆ P (t), let
charge(pj(t), X(t)) := D(pj(t), X(t)) +
∑
pi(t)∈X(t)
max{0, ri(t)−D(pi(t), pj(t))} .
Claim 4.1.11. Let t be any point of time. For an arbitrary set of open facilities X(t) ⊆
P (t), we get
∑
pj(t)∈P (t)
charge(pj(t), X(t)) · dj = FacLoc(P (t), X(t)) .
4.1 The Special Radii 61
Proof. Due to the definition of charge(·, ·) and FacLoc(·, ·) and due to Equation (4.1), we
get
∑
pj(t)∈P (t)
charge(pj(t), X(t)) · dj
=
∑
pj(t)∈P (t)
D(pj(t), X(t)) · dj +
∑
pi(t)∈X(t)
∑
pj(t)∈P (t)∩B(pi(t),ri(t))
(ri(t)−D(pi(t), pj(t))) · dj
=
∑
pj(t)∈P (t)
D(pj(t), X(t)) · dj +
∑
pi(t)∈X(t)
fi
= FacLoc(P (t), X(t)) .
Claim 4.1.12. Let t be any point of time, let pj(t) ∈ P (t) be any point, let X(t) ⊆ P (t)
be an arbitrary set of open facilities, and let pi(t) ∈ X(t) be any open facility. If we have
D(pj(t), pi(t)) = D(pj(t), X(t)), then charge(pj(t), X(t)) ≥ max{ri(t),D(pj(t), pi(t))}.
Proof. If pj(t) /∈ B(pi(t), ri(t)), then
charge(pj(t), X(t)) ≥ D(pj(t), X(t))
= D(pj(t), pi(t))
> ri(t) .
Otherwise, we have
charge(pj(t), X(t)) ≥ D(pj(t), X(t)) + (ri(t)−D(pj(t), pi(t)))
= D(pj(t), pi(t)) + (ri(t)−D(pj(t), pi(t)))
= ri(t)
≥ D(pj(t), pi(t)) .
Claim 4.1.13. Let t be any point of time, let pj(t) ∈ P (t) be any point, and let pi(t) be
any open facility in F (t). If the invariant is satisfied at the point of time t and we have
pj(t) ∈ B(pi(t), ri(t)), then charge(pj(t), F (t)) ≤ ri(t).
Proof. By Claim 4.1.10, there is no open point p`(t) ∈ F (t) such that we have i 6= ` and
pj(t) ∈ B(p`(t), r`(t)). Since D(pj(t), F (t)) ≤ D(pj(t), pi(t)), we obtain
charge(pj(t), F (t)) = D(pj(t), F (t)) + (ri(t)−D(pj(t), pi(t)))
≤ D(pj(t), pi(t)) + (ri(t)−D(pj(t), pi(t)))
= ri(t) .
62 4 A Kinetic Data Structure for Facility Location
Claim 4.1.14. Let t be any point of time, let pj(t) ∈ P (t) be any point, and let pi(t) be
any open facility in F (t). If the invariant is satisfied at the point of time t and we have
pj(t) /∈ B(pi(t), ri(t)), then charge(pj(t), F (t)) < D(pj(t), pi(t)).
Proof. The correctness of the claim follows immediately, unless there is a point p`(t) ∈ F (t)
such that pj(t) ∈ B(p`(t), r`(t)). If such a point p`(t) exists, then Claims 4.1.10 and 4.1.13
imply D(pi(t), p`(t)) > 2 · max{ri(t), r`(t)} and charge(pj(t), F (t)) ≤ r`(t). Furthermore,
by triangle inequality, we obtain
D(pj(t), pi(t)) ≥ D(pi(t), p`(t))−D(pj(t), p`(t))
> 2r`(t)− r`(t)
= r`(t) ,
which proves charge(pj(t), F (t)) ≤ r`(t) < D(pj(t), pi(t)).
Claim 4.1.15. Let t be any point of time, let pj(t) ∈ P (t) be any point, and let X(t) ⊆ P (t)
be an arbitrary set of open facilities. If the invariant is satisfied at the point of time t, then
charge(pj(t), F (t)) < (64d+ 1) · charge(pj(t), X(t)) .
Proof. Let pi(t) be some point in X(t) such that we have D(pj(t), pi(t)) = D(pj(t), X(t)).
By Claim 4.1.9, there exists a point p`(t) ∈ F (t) such that we have r˜`(t) ≤ r˜i(t) and
D(pi(t), p`(t)) ≤ 64d · ri(t).
If pj(t) ∈ B(p`(t), r`(t)), then we obtain charge(pj(t), F (t)) ≤ r`(t) by Claim 4.1.13.
Then, we get r`(t) ≤ r˜`(t) ≤ r˜i(t) ≤
√
d · 4 · 23+dlog(
√
d)e · ri(t) ≤ 64d · ri(t) due to the
arguments above and Lemma 4.1.7. Since Claim 4.1.12 implies charge(pj(t), X(t)) ≥ ri(t),
we can conclude
charge(pj(t), F (t)) ≤ r`(t)
≤ 64d · ri(t)
≤ 64d · charge(pj(t), X(t)) .
This proves the assertion in case that we have pj(t) ∈ B(p`(t), r`(t)).
If pj(t) /∈ B(p`(t), r`(t)), then charge(pj(t), F (t)) < D(pj(t), p`(t)) by Claim 4.1.14. Thus,
by triangle inequality, we get
charge(pj(t), F (t)) < D(pj(t), pi(t)) + D(pi(t), p`(t))
≤ D(pj(t), pi(t)) + 64d · ri(t) .
Since the ratio of D(pj(t), pi(t)) + 64d · ri(t) to the maximum of ri(t) and D(pj(t), pi(t))
is at most 64d + 1, we obtain charge(pj(t), F (t)) < (64d + 1) · max{D(pj(t), pi(t)), ri(t)}.
Now, the assertion follows by Claim 4.1.12.
Lemma 4.1.16. Let t be any point of time. If the invariant is satisfied at the point of
time t, then we have
FacLoc(P (t), F (t)) < (64d+ 1) · FacLoc(P (t), F ∗(t)) .
4.2 The Kinetic Data Structure 63
Proof. Due to Claims 4.1.11 and 4.1.15, we have
FacLoc(P (t), F (t)) =
∑
pj(t)∈P (t)
charge(pj(t), F (t)) · dj
<
∑
pj(t)∈P (t)
(64d+ 1) · charge(pj(t), X(t)) · dj
= (64d+ 1) · FacLoc(P (t), X(t))
for an arbitrary set of open facilities X(t) ⊆ P (t). Thus, the approximation factor is also
true for an optimal set of open facilities F ∗(t), which completes the proof of the lemma.
4.2 The Kinetic Data Structure
This section addresses the design of our KDS for the mobile facility location problem. After
describing how to compute an initial set of open facilities, we describe how the event queue
is structured and how an update of the KDS is processed.
4.2.1 Initial Set of Open Facilities
Let pi(t0) denote the initial position of the point pi ∈ P . To compute an initial set of open
facilities, we apply Algorithm 4.2.1, which is a modified version of Algorithm 2.3.1, on the
point set P (t0). The modification is that, instead of considering exactly the sorted sequence
of the ri(t0) values, we round each ri(t0) to one of the O(log(nR)) possible values for the
special radii (i.e., compute its corresponding r˜i(t0) value) and use the sorted sequence of
the rounded values.
Algorithm 4.2.1 Modified-Mettu-Plaxton-FacLoc(P , t0)
1: calculate the radius r˜i(t0) for each point pi(t0) ∈ P (t0)
2: for k ← dlog(rmin)e+ dlog(4
√
d)e to dlog(rmax)e+ dlog(4
√
d)e do
3: let Ik be the set of indices of all the points with radius 2k
4: for each i ∈ Ik do
5: if there is no open facility in C(pi(t0), 2 · 2k) then
6: open facility at pi(t0)
4.2.2 Event Queue
In order to maintain the invariant defined in Section 4.1.3, we have to update our KDS at
certain points of time. More precisely, we perform an update at each point of time when a
point pj(t) crosses a wallWi,k(t), dlog(rmin)e+dlog(4
√
d)e ≤ k ≤ dlog(rmax)e+dlog(4
√
d)e,
of another point pi(t).
To keep track of these events, we use the following data structure: For each dimension
`, 1 ≤ ` ≤ d, we store all n points and all O(n · log(nR)) wall faces that are orthogonal to
64 4 A Kinetic Data Structure for Facility Location
the `-th coordinate axis in a list sorted by the `-th coordinate. For each consecutive pair
in each of the d lists, we keep up one certificate to certify the sorted order of the lists. We
define the failure time of the certificate for any pair of consecutive objects to be the first
future point of time when these objects change their ranks in their sorted list. The failure
times of all certificates are maintained in one event queue.
In case that more than one event occurs at the same time, we handle them in an arbitrary
order. Certainly, it is not the case that each event implicates that a point crosses a wall of
another point (as, e.g., the change of the rank of two wall faces also causes an event), but
definitely every crossing of a wall is discovered by a failure of at least one certificate. The
event queue has the following complexity:
Lemma 4.2.1. The event queue has size O(n log(nR)), can be initialized in O(n log2(nR))
time, and can be updated in O(log(nR)) time. Provided that the trajectories can be described
by bounded-degree polynomials, the total number of events is O(n2 log2(nR)). A flight plan
update involves O(log(nR)) certificates and requires O(log2(nR)) time.
Proof. Each of the d lists stores n points and O(n log(nR)) wall faces. It follows that the
event queue holds O(n log(nR)) events. Thus, the upper bound on the space requirement
is as claimed.
The initialization of the d lists and the event queue can be done by simple sorting
operations inO(n log(nR) log(n log(nR))) ⊂ O(n log2(nR)) time. In each following update,
we have to re-calculate the points of time when the two objects involved in the current event
change their ranks with their two neighbors in the corresponding list. Thus, a constant
number of events have to be updated in the event queue. Since the event queue contains
O(n log(nR)) elements and we can use a min-heap to realize it, an update of an event
requires O(log(n log(nR))) ⊂ O(log(nR)) time. Furthermore, a flight plan update of a
point causes a re-calculation of the points of time when the point and all its wall faces
change their ranks with the associated neighbors in all d lists. Afterwards, the involved
certificates are updated in the event queue. Since a point has O(log(nR)) wall faces, the
number of involved certificates is O(log(nR)). Their update in the event queue can be
accomplished in O(log2(nR)) time.
In case that the trajectories can be described by bounded-degree polynomials and no
flight plan update occurs, the upper bound on the total number of events is given as
follows. For each pair of elements, an event occurs when the trajectories of the two elements
cross each other. The number of cuts of two polynomials is bounded by the maximum
degree of both polynomials. Hence, the total number of cuts of O(n log(nR)) bounded-
degree polynomials is O(cn2 log2(nR)), where the constant c is the maximum degree of the
polynomials.
4.2.3 Handling an Update
In this section, we describe how an event E, occurring at any point of time t, is handled
(confer Algorithm 4.2.2, ll. 5). As the first step, the event queue is updated as explained
in Section 4.2.2. Then, we have to distinguish between the following three cases:
4.2 The Kinetic Data Structure 65
(i) Both objects involved in the considered certificate are faces of walls.
(ii) Both objects involved in the considered certificate are points.
(iii) One object involved in the considered certificate is a point and the other object is a
face of a wall.
The handling of the three cases mainly depends on whether the invariant is violated or
not. We say that a point pi(t) ∈ P (t) violates the invariant at a point of time t if either
(a) pi(t) is closed, but there is no open facility with radius smaller than or equal to ri(t) in
the cube C(pi(t), 4 · r˜i(t)) or (b) pi(t) is open, but there is another open facility with radius
smaller than or equal to ri(t) in the cube C(pi(t), 2 · r˜i(t)). We assume that the invariant
is satisfied by the time when E occurs.
In Case (i), no point crosses the wall of another point. As a result, the invariant is still
satisfied, so handling E is completed.
In Case (ii), the event indicates that a point pi(t) and another point pj(t) change their
ranks based on a dimension `, 1 ≤ ` ≤ d. This means that we have to update the position
of pi and pj in the range trees T1 and T2. Since no point crosses a wall of another point,
handling E is then completed.
In Case (iii), it might be that the invariant is violated. Let pj(t) be the first object
involved in the considered certificate, and let pi(t) be the point whose wall is the second
object involved in the considered certificate. In case that pj(t) does not cross a wall
of pi(t), handling E is completed. Otherwise, we update the radius r˜i(t) according to
Definition 4.1.1, i.e., we set r˜i(t) = 2k˜ such that k˜ = k0 + dlog(4
√
d)e and k0 is the
minimum integer k, dlog(rmin)e ≤ k ≤ dlog(rmax)e, with weight(C(pi(t), 2k0)) ≥ fi · 2−k0 .
We will show that the new value of k0 differs from its old value (before event E occurred)
by at most 1. Thus, there are three possible values for k0. Each of these values can be
tested by one range query on both T1 and T2. Afterwards, we test if pi(t) violates the
invariant by using a range query on T1. If this is the case, we change the status of pi(t). As
an effect of changing the radius or the status of one point, the invariant may be violated
by many other points (e.g., their open facility has been closed). In the following, we will
show how to deal with this problem (confer Algorithm 4.2.3).
Algorithm Restore. Suppose that pi(t) is a point that triggered an event E at a point
of time t and whose radius or status changed due to E. Let r˜i(t) = 2k˜ be its updated
radius. First, we restore the invariant at all points with radius 2k˜−1 to ensure that no point
with radius less than or equal to 2k˜−1 violates the invariant. Then, we handle all points
with radius 2k˜ that violates the invariant, then the points with radius 2k˜+1, . . . , up to the
biggest possible radius. Now, we describe the procedure in general for any radius 2k.
We define two cubes S1 := C(pi(t), 4 · 2k+1) and S2 := C(pi(t), 6 · 2k+1). Both cubes
are divided into equally sized cubelets with radius 2k. The left hand side of Figure 4.2
illustrates this decomposition in the plane.
To guarantee that no open point with radius 2k violates the invariant, we proceed as
follows with each cubelet in S1: Let m be the center point of the considered cubelet. If
66 4 A Kinetic Data Structure for Facility Location
Algorithm 4.2.2 KineticFL(P, t0)
1: Modified-Mettu-Plaxton-FacLoc(P, t0)
2: initialize event queue Q
3: while Q is not empty do
4: E ← dequeue(Q)
5: update Q
6: if E indicates that pi(t) and pj(t) change their ranks in any list for any i, j then
7: update position of pi and pj in T1 and T2
8: else
9: if E indicates that pj(t) crosses a wall of pi(t) for any i, j then
10: update r˜i(t)← 2k˜ in T1 and T2
11: if pi(t) violates the invariant then
12: change status of pi(t)
13: if radius or status of pi(t) changed then
14: Restore(pi(t), k˜)
Algorithm 4.2.3 Restore(pi(t), k˜)
1: for k ← k˜ − 1 to dlog(rmax)e+ dlog(4
√
d)e do
2: define cubes S1 ← C(pi(t), 4 · 2k+1) and S2 ← C(pi(t), 6 · 2k+1)
3: for each cubelet C with center mC and radius 2k in S1 do
4: if ∃ open facility with radius < 2k in C(mC , 3 · 2k) then
5: close all facilities with radius 2k in C
6: for each cubelet C with center mC and radius 2k in S2 do
7: if @ open facility with radius ≤ 2k in C(mC , 3 · 2k) then
8: open one point with radius 2k in C (if existing)
there is an open facility with radius less than 2k in C(m, 3 · 2k), then we close all facilities
with radius 2k in C(m, 2k). Note that there is at most one such facility. The considered
area around a cubelet is illustrated in Figure 4.2.
In order to ensure that no closed point with radius 2k violates the invariant neither, we
proceed as follows with each cubelet in S2: Let m be the center point of the considered
cubelet. If there does not exist an open facility with radius less than or equal to 2k in
C(m, 3 · 2k), then we open a point with radius 2k in the cubelet (if there is such a point).
No matter, whether we opened a point or not, it is guaranteed that, for each closed point
pj(t) with r˜j(t) = 2k in the cubelet, there is an open facility in C(pj(t), 4 · r˜j(t)).
4.3 Quality and Complexity of the Kinetic Data Structure
At first, we prove that our KDS maintains a subset of the moving input points as open
facilities such that, at any time, the associated total cost is at most a constant factor larger
4.3 Quality and Complexity of the Kinetic Data Structure 67
m
pi(t)
S2
S1
2k+1
m
2k
3 · 2k
Figure 4.2: Illustration of the decomposition into cubelets and the tested area for a cubelet.
The shown decomposition is used during the iteration of algorithm Restore
that restores the invariant at all points with radius 2k. The cubes S1 and S2 are
indicated by thick lines. For each cubelet in S1 and S2, we perform a test. The
shaded area indicates the tested area C(m, 3 · 2k) for one cubelet in S1. This
area is magnified on the right hand side of the figure, where the dark shaded
area corresponds to the tested cubelet C(m, 2k).
than the current optimal cost. For that purpose, we show that we restore the invariant
each time it is violated. Finally, we analyze the complexity of our KDS.
4.3.1 Maintenance of the Invariant
To simplify the description of the following proofs, we assume that at most one event occurs
at the same time. Assuming this, we can show that the invariant is always satisfied after
our KDS has handled an event. In case that more than one event occurs at the same time,
the following proofs would differ in the sense that the fulfillment of the invariant can be
guaranteed only after our KDS has handled all of these events.
First, we prove that the invariant is satisfied as long as algorithm KineticFL does not
call algorithm Restore.
Lemma 4.3.1. The invariant is satisfied after the first step of algorithm KineticFL.
Proof. Since algorithm Modified-Mettu-Plaxton-FacLoc treats the points in non-
decreasing order according to their special radii and opens a point pi(t0) with radius r˜i(t0)
if and only if there is no other open point in C(pi(t0), 2 · r˜i(t0)), no open point violates the
invariant.
Furthermore, algorithm Modified-Mettu-Plaxton-FacLoc does not open a point
pi(t0) with radius r˜i(t0) if and only if there is another open point in C(pi(t0), 2 · r˜i(t0)) ⊆
68 4 A Kinetic Data Structure for Facility Location
C(pi(t0), 4 · r˜i(t0)). Because this point has been treated earlier than pi(t0), its radius is less
than or equal to r˜i(t0). Thus, there exists an open point with radius less than or equal to
r˜i(t0) in C(pi(t0), 4 · r˜i(t0)). Hence, no closed point violates the invariant.
Claim 4.3.2. Let E be any event such that algorithm KineticFL does not change the
radius or the status of any point. If the invariant is satisfied before E, then it holds after
E as well.
Proof. We have to consider two cases. In the first case, no point crosses a wall of another
point. This implies that no point enters or leaves any cube of another point and no point
changes its radius. Hence, the invariant is still valid and the claim holds.
Let t be the point of time when event E occurs. Then, in the second case, we have that
a wallWi,k(t) of a point pi(t) is crossed by another point pj(t), but our algorithm does not
change the radius or the status of pi(t). It follows that neither pi(t) changed its radius nor
pi(t) violates the invariant because otherwise our algorithm would have changed the radius
and the status of pi(t), respectively. Due to the fact that pi(t) is unchanged and only the
wallWi,k(t) is crossed at the point of time t, no point in P (t)\{pi(t)} violates the invariant
neither. This completes the proof.
Next, we prove that the updated radius of a point that triggered an event E differs at
most by a factor of 2 from its value before E.
Claim 4.3.3. Let E be an event at any point of time t where any point pj(t) ∈ P (t) crosses
any wall of any other point pi(t) ∈ P (t). Let t′ < t be any point of time after the latest
point of time when pi has been involved in one event. We get 1/2 · r˜i(t′) ≤ r˜i(t) ≤ 2 · r˜i(t′).
Proof. Let k′0 and k0 be the minimum integers k with dlog(rmin)e ≤ k ≤ dlog(rmax)e
for which we have weight(C(pi(t′), 2k
′
0)) ≥ fi · 2−k
′
0 and weight(C(pi(t), 2k0)) ≥ fi · 2−k0 ,
respectively. Note that the existence of k′0 and k0 is due to Corollary 4.1.4. Furthermore,
let Wi,`(t) be the wall that is crossed by pj(t). We have to consider the cases (i) pj(t)
leaves the cube C(pi(t), 2`) and (ii) pj(t) enters the cube C(pi(t), 2`).
Case (i). Since the point of time t′, pj is the only point that has crossed a wall of pi. It
follows that weight(C(pi(t), 2m)) < fi · 2−m, for any m < k′0, and weight(C(pi(t), 2
k′0)) ≤
weight(C(pi(t′), 2k
′
0)). This implies k0 ≥ k′0.
Since pj(t) has only crossed one wall of pi(t), we get
weight(C(pi(t), 2k
′
0+1)) ≥ weight(C(pi(t′), 2k
′
0)) ≥ fi · 2−k
′
0 ≥ fi · 2−(k
′
0+1) ,
where the second inequality is given by the definition of k′0. Thus, we have k0 ≤ k
′
0 + 1.
Overall, we obtain k′0 ≤ k0 ≤ k
′
0 + 1 in Case (i).
4.3 Quality and Complexity of the Kinetic Data Structure 69
Case (ii). Due to the fact that pj(t) is the only point that has crossed a wall of pi(t) and
pj(t) enters a cube with center pi(t), we have weight(C(pi(t), 2m)) ≥ weight(C(pi(t′), 2m)),
for all possible values of m. Hence, we get k0 ≤ k′0.
Recall that pj(t) crosses the wallWi,`(t). If ` ≥ k′0−1, then k0 ≥ k
′
0−1 follows obviously.
Now, let us assume that ` < k′0 − 1 and k0 = `. Due to this assumption, we obtain that
weight(C(pi(t), 2`)) ≥ fi · 2−`. Since pj is the only point that has crossed a wall of pi, we
also have weight(C(pi(t′), 2`+1)) ≥ fi · 2−` ≥ fi · 2−(`+1). This implies k′0 ≤ `+ 1, which is a
contradiction. Hence, we get k′0 − 1 ≤ k0 ≤ k
′
0 in Case (ii).
Considering both cases, we get k′0 − 1 ≤ k0 ≤ k
′
0 + 1. Now, the claim follows due to the
definition of the special radii.
The following claims show that the invariant is restored after each call of algorithm
Restore.
Claim 4.3.4. Let ph(t) be a point that triggered an event E and whose radius or status
changed due to E. Let r˜h(t) = 2k˜ be the updated radius of ph(t). If no point with radius
less than or equal to 2k˜−2 violates the invariant before E, then this holds after E as well.
Proof. Due to Claim 4.3.3, the radius of ph has been at least 2k˜−1 before E. While pro-
cessing event E, we only change the status of points with radius larger than or equal to
2k˜−1. These status changes cannot affect the invariant at points with radius less than or
equal to 2k˜−2. Thus, the assertion follows.
m
pi pj
2`+1
(a)
m
pi
pj
2`+2
(b)
Figure 4.3: The dark gray area indicates the cube C(m, 2`) in S2 that contains pi(t) during
running the outer for-loop of algorithm Restore for k = `. The light gray
area indicates the cube C(m, 3 ·2`). (a) Arrangement of points that leads to the
desired contradiction in the proof of Case (i) in Claim 4.3.5. (b) Arrangement
of points that leads to the desired contradiction in the proof of Case (i) in
Claim 4.3.6.
70 4 A Kinetic Data Structure for Facility Location
Claim 4.3.5. Let ph(t) be a point that triggered an event E and whose radius or status
changed due to E. Let r˜h(t) = 2k˜ be the updated radius of ph(t). If the invariant is
satisfied before E and no open point with radius less than or equal to 2`−1 violates the
invariant before running the outer for-loop of algorithm Restore for k = `, k˜ − 1 ≤ ` ≤
dlog(rmax)e + dlog(4
√
d)e, then, after running this for-loop, no open point with radius 2`
violates the invariant.
Proof. The proof is by contradiction. Let us assume that, after running the outer for-loop
of algorithm Restore for k = `, there is an open point pi(t) with radius r˜i(t) = 2` that
has another open point pj(t) with radius r˜j(t) ≤ r˜i(t) in C(pi(t), 2 · r˜i(t)). We have to
consider the cases (i) pi(t) ∈ S2 and (ii) pi(t) /∈ S2.
Case (i). Subcase r˜j(t) < r˜i(t): Due to the fact that r˜j(t) < 2`, we have opened pj
before running the outer for-loop for k = `. It follows that pi(t) ∈ C(m, 2`) and pj(t) /∈
C(m, 3 · 2`) for one center m of a considered cubelet (see Figure 4.3 (a)) because otherwise
we either would have closed pi(t) or would not have opened pi(t). Thus, we have pj(t) /∈
C(pi(t), 2`+1) = C(pi(t), 2 · r˜i(t)), which is a contradiction to the assumption made above.
Subcase r˜j(t) = r˜i(t): We have to consider the case that neither pi nor pj is opened while
running the outer for-loop for k = ` and the case that at least one of pi and pj is opened
during this for-loop. In the first case, it follows that pi and pj must have been open before
running the outer for-loop for k = `. It follows that both points have been open before
E or one point is ph. Then, either the invariant has been violated before E, which is a
contradiction to the precondition of the claim, or changing the status of ph violated the
invariant, which means that a rule of the algorithm has been broken. In the latter case,
we have opened pi or pj or both while running the outer for-loop for k = `. Without loss
of generality, let us assume that we have opened pj before we have opened pi. Then, we
must have that pi(t) ∈ C(m, 2`) and pj(t) /∈ C(m, 3 · 2`) for one center m of a considered
cubelet (see Figure 4.3 (a)). It follows that pj(t) /∈ C(pi(t), 2`+1) = C(pi(t), 2 · r˜i(t)), which
is a contradiction to the assumption made above.
Case (ii). Subcase r˜j(t) < r˜i(t): Due to the fact that r˜j(t) < 2`, we have opened pj
before running the outer for-loop for k = `. Furthermore, it follows from pi(t) /∈ S2 that
we must have opened pi before running the outer for-loop for k = ` as well. Hence, both
pi and pj have been open before running this for-loop. Thus, the invariant must have been
violated at point pj(t) with r˜j(t) ≤ 2`−1 before running the outer for-loop for k = `, which
is a contradiction to the precondition of the claim.
Subcase r˜j(t) = r˜i(t): We can use the same argumentation as in subcase r˜j(t) = r˜i(t)
of Case (i) with the modification that we know that pi has been opened before running
the outer for-loop for k = `. The reason is that pi(t) /∈ S2, so we do not change its status
while running this for-loop.
Claim 4.3.6. Let ph(t) be a point that triggered an event E and whose radius or status
changed due to E. Let r˜h(t) = 2k˜ be the updated radius of ph(t). If the invariant is satisfied
4.3 Quality and Complexity of the Kinetic Data Structure 71
before E and no closed point with radius less than or equal to 2`−1 violates the invariant
before running the outer for-loop of algorithm Restore for k = `, where k˜ − 1 ≤ ` ≤
dlog(rmax)e+ dlog(4
√
d)e, then, after running this for-loop, no closed point with radius 2`
violates the invariant.
Proof. The proof is by contradiction. Let us assume that, after running the outer for-loop
of algorithm Restore for k = `, there is a closed point pi(t) with radius r˜i(t) = 2` that
has no open point with radius less than or equal to r˜i(t) in C(pi(t), 4 · r˜i(t)). We have to
consider the cases (i) pi(t) ∈ S2 and (ii) pi(t) /∈ S2.
Case (i). Due to our construction, we have pi(t) ∈ C(m, 2`) and there is an open point
pj(t) with radius at most 2` in C(m, 3 · 2`) for any center m of a considered cubelet (see
Figure 4.3 (b)) because otherwise we would have opened a point with radius 2` in C(m, 2`).
Note that, in case there is no other point with radius at most 2` in C(m, 2`) except pi(t), we
would have opened pi and pj = pi. Thus, we have pj(t) ∈ C(pi(t), 2`+2) = C(pi(t), 4 · r˜i(t)),
which is a contradiction to the assumption made above.
Case (ii). Let t′ be any point of time between the occurrence of E and the latest event
before. Then, there was an open point pj(t′) with radius less than or equal to r˜i(t′) in the
cube C(pi(t′), 4 · r˜i(t′)) because otherwise the invariant was violated before E. Since E had
no influence on the radius of pi, we have r˜i(t′) = r˜i(t) = 2`.
First, let us assume that pj = ph. Since pi(t) /∈ S2 = C(ph(t), 6 · 2`+1), we have pj(t) /∈
C(pi(t), 6·2`+1). From pj(t′) ∈ C(pi(t′), 4· r˜i(t′)) = C(pi(t′), 4·2`) and pj(t) /∈ C(pi(t), 6·2`+1)
follows that pj must have crossed the wall Wi,`+3(t′′) at a time t′′ with t′ < t′′ < t. This
implies an event at time t′′, which is a contradiction to the definition of t′. Thus, we have
pj 6= ph.
Due to pi 6= ph, pj 6= ph, and pj(t′) ∈ C(pi(t′), 4 · r˜i(t′)), pj(t) ∈ C(pi(t), 4 · r˜i(t)) must
also be true. Thus, if pi violates the invariant after E, then we must have closed pj during
processing E. We only close points with radius less than or equal to r˜i(t) in S1, so we
must have pj(t) ∈ S1. Since pi(t) /∈ S2 and pj(t) ∈ S1, we get pj(t) /∈ C(pi(t), 2 · 2`+1) =
C(pi(t), 4 · r˜i(t)), which is a contradiction.
Now, we can combine the obtained results to get the following lemma:
Lemma 4.3.7. The invariant is satisfied after algorithm KineticFL has handled an
event.
Proof. Due to Lemma 4.3.1 and Claim 4.3.2, the invariant is satisfied as long as we do not
call algorithm Restore. Now, we show by induction that the invariant is also satisfied
after running algorithm Restore.
Let ph(t) be the point whose radius or status changed due to an event E, and let
r˜h(t) = 2k˜ be its updated radius. Due to the precondition given above and Claim 4.3.4,
the assertion is true for all points with radius at most 2k˜−2. This proves the base case.
By induction hypothesis, the preconditions of Claims 4.3.5 and 4.3.6 hold for any ` with
72 4 A Kinetic Data Structure for Facility Location
k˜ − 1 ≤ ` ≤ dlog(rmax)e + dlog(4
√
d)e. This means that the assertion holds for all points
with radius at most 2`−1. It follows from Claims 4.3.5 and 4.3.6 that the assertion also
holds for all points with radius at most 2`, which completes the proof of the lemma.
Due to Lemmas 4.3.1, 4.3.7 and 4.1.16, we get the following result:
Lemma 4.3.8. The KDS for the mobile facility location problem in Rd maintains at each
point of time t a subset of open facilities F (t) ⊆ P (t) such that we have
FacLoc(P (t), F (t)) < (64d+ 1) · FacLoc(P (t), F ∗(t)) .
4.3.2 Complexity
In the remainder of this chapter, we analyze our KDS in terms of its compactness, lo-
cality, responsiveness, and efficiency (see Section 2.4.3 for definitions of these attributes).
Lemma 4.2.1 already implies that our KDS is compact and local. Next, we prove that the
requirement for being responsive and efficient is also fulfilled.
Lemma 4.3.9. Each update operation requires O(logd+1(n)·log(nR)) time and O(log(nR))
status changes.
Proof. Due to Lemma 4.2.1, the time to update the event queue is O(log(nR)). Except
for algorithm Restore, all further steps require a constant number of range queries on T1
and T2. Due to Lemma 4.1.8, this requires O(log
d+1(n)) time. Next, we examine the time
needed for algorithm Restore. We consider the running time resulting for restoring the
invariant at points with radius 2k. The number of cubelets with radius 2k in C(ph(t), 6·2k+1)
is 12d, where ph(t) is the point that triggered the event. The query of open or closed points
for one cubelet can be answered by one range query on T1 or T2. Due to Lemma 4.1.8, this
requires O(logd+1(n)) time. Afterwards, there has to be at most one point inserted and
deleted in T1 and T2, which can be done in O(log
d+1(n)) time according to Lemma 4.1.8.
By summation over all radii, we get a total running time of O(logd+1(n) · log(nR)).
There can exist at most one open facility with radius 2k in a cubelet with radius 2k be-
cause otherwise at least one open facility would violate the invariant. Hence, the number
of open facilities with radius 2k that are closed while running algorithm Restore is con-
stant. Furthermore, we open at most one facility in each cubelet, so the number of opened
facilities with radius 2k is also constant. Due to the fact that we handle O(log(nR)) radii,
there are O(log(nR)) status changes per event.
Since our KDS processes a total number of O(n2 log2(nR)) events (see Lemma 4.2.1), the
total processing time is bounded by O(n2 logd+1(n)·log3(nR)). To measure the efficiency as
defined in Section 2.4.3, we use a result from [46]. In [46], Gao et al. investigated a problem
in the KDS framework which is closely related to the mobile facility location problem. In
particular, they provided a randomized KDS to maintain a set of centers among moving
points in the plane such that, given a specified radius, all the points are covered by balls
of the given radius centered at the chosen center points. Gao et al. showed that the size
4.3 Quality and Complexity of the Kinetic Data Structure 73
of the center set is at most a constant factor larger than the minimum one. To prove the
efficiency of their KDS, they showed that there is a set of n points moving linearly on the
real line that forces any c-approximate cover to change Ω(n2/c2) times. With some minor
modifications, their result can be transferred to the facility location problem.
Lemma 4.3.10. For any constant c > 1, there exists a set P of n points moving linearly
on the real line such that any c-approximate solution to the mobile facility location problem
for P undergoes Ω(n2/c2) status changes.
Proof. We assume that c is an integer and n = 2cm with m ≥ 12c2 being also an integer.
Let P be the set of n moving points which is defined as follows. We partition P into m
groups, each containing 2c points. Let the j-th point in the i-th group be denoted by pi,j,
where 0 ≤ i < m and 0 ≤ j < 2c. The initial position of all the points in the i-th group is
i · 2m. Now, we let the point pi,j move with speed j · 2m. Let pi,j(t) be the position of pi,j
at the point of time t. Then, we have
pi,j(t) = (i+ jt) · 2m ,
for 0 ≤ i < m, 0 ≤ j < 2c, and t ≥ 0. Note that, in the time period from 0 to m, the
points often change their ranks on the line. Afterwards, no two points will change their
rank any more.
Let us consider the configuration of P at any point of time t1 := k + 3c/m, for some
integer k < m. At the point of time t1, the location of the point pi,j is
pi,j(t1) =
(
i+ jk +
3cj
m
)
· 2m = 2(i+ jk)m+ 6cj .
Let pi,j and pi′,j′ be any two distinct points. In case that i + jk 6= i′ + j′k, the distance
between pi,j and pi′,j′ is
|pi,j(t1)− pi′,j′(t1)| > 2m− 12c2 ≥ 4c
at the point of time t1. In case that i+ jk = i′+ j′k, we have j′ 6= j since pi,j and pi′,j′ are
distinct. Then, it follows that the distance between pi,j and pi′,j′ is
|pi,j(t1)− pi′,j′(t1)| ≥ 6c
at the point of time t1. Thus, at the point of time t1, no two points are within distance 4c
of each other.
Assuming that the opening costs as well as the demands of all the points in P are 1, we
next analyze an optimal solution for P at the point of time t1. Since the distance between
any two points in P is greater then 4c at the point of time t1, the only existing optimal
solution is to open a facility at each input point. This leads to a total cost of n. It follows
that any c-approximate solution can have a cost of at most cn. Let us now consider an
approximate solution in which only a (1−α)-fraction of the input points are open facilities.
Then, the cost for this solution is more than n− αn + 4c · αn since the distance between
74 4 A Kinetic Data Structure for Facility Location
any two points is greater than 4c at the point of time t1. To ensure that the cost is at
most cn, α must be smaller than (c− 1)/(4c− 1). Since 1/4 > (c− 1)/(4c− 1) for c > 1,
we obtain that any c-approximate solution must open more than 3n/4 facilities.
Next, we consider the configuration of P at any point of time t2 := k, for some integer
k < m. Since pi,j(t2) = (i+ jk) · 2m, where 0 ≤ i < m, 0 ≤ j < 2c, and k < m, each point
is located at a position 2sm for some s ∈ {0, . . . ,m+ 2ck}. It follows that, at the point of
time t2, there exist at most m+ 2ck open facilities in an optimal solution, and the optimal
facility location cost is at most m+ 2ck. Thus, a c-approximate solution may have at most
c(m+ 2ck) open facilities.
Hence, between the points of time t1 and t2, any c-approximate solution undergoes
at least 3n/4 − c(m + 2ck) = n/4 − 2c2k status changes. Summing up over all k ∈
{0, . . . , K − 1}, the number of status changes is at least
K−1∑
k=0
n
4
− 2c2k >
Kn
4
− c2K2 .
Setting K = n/(8c2) < m, we have established that the total number of changes is
Ω(n2/c2).
Due to Lemma 4.3.10 and the fact that we process a total number of O(n2 log2(nR))
events, our KDS has an efficiency value of O(log2(nR)). Hence, the KDS for the mobile
facility location problem is efficient.
We summarize our results in the following theorem:
Theorem 4. Let P be a set of n independently moving points in Rd, where d is a constant
dimension. Then, there exists a deterministic KDS for the mobile facility location problem
that maintains at any point of time t a set F (t) ⊆ P (t) such that we have
FacLoc(P (t), F (t)) < (64d+ 1) · FacLoc(P (t), F ∗(t)) .
Let R = maxpi∈P fi ·maxpi∈P di/(minpi∈P fi ·minpi∈P di), where fi and di are the opening
cost and the demand of a point pi, respectively. Then, the KDS has a space require-
ment of O(n(logd(n) + log(nR))) and each event requires O(log(nR)) status changes and
O(logd+1(n)·log(nR)) update time. In case that the trajectories can be described by bounded-
degree polynomials, the total number of updates is O(n2 log2(nR)), which results in a total
processing time of O(n2 logd+1(n) · log3(nR)). A flight plan update involves O(log(nR))
certificates and requires O(log2(nR)) time.
5 Facility Location in Data Streams
This chapter deals with a constant-factor approximation algorithm for the cost of the uni-
form facility location problem over dynamic geometric data streams in a discrete Euclidean
space {1, . . . ,∆}d, where d is a constant. The starting point of our algorithm is the work
of Indyk [64]. It gives the best previous approach for approximating the cost of the uni-
form facility location problem over dynamic geometric data streams and guarantees an
approximation factor of O(log2(∆)). In [64], Indyk defines a certain partition of the space
into nested square grids and a set of cells in this partition such that the number of these
cells gives an O(log(∆))-approximation. During the approximation process to estimate the
number of these cells, the algorithm of [64] looses another O(log(∆)) factor.
In Section 5.1, we use a similar partition of the space into nested square grids, and we
show that opening a facility in each cell of a subset of the cells defined in [64] leads to a
constant-factor approximation of the facility location cost. Moreover, in Section 5.3, we
propose an algorithm that maintains this cost sufficiently well in the dynamic geometric
data stream model. In this way, we obtain a streaming algorithm for approximating the
cost of the uniform facility location problem that strongly improves the best previous one.
5.1 Definition of a Good Estimator
Let P := {p1, . . . , pn} be a set of n points from a discrete Euclidean space {1, . . . ,∆}d,
where d is a constant. In the streaming context, P will refer to the current point set,
i.e., the set of points obtained after having applied an input sequence of insertions and
deletions.
In this section, we will define a good estimator for the uniform Euclidean facility location
problem (see Section 2.2 for a definition). Before we derive our estimator for the general
case, we show how to deal with some special cases.
5.1.1 Estimator for Special Cases
We consider the following four special cases:
(i) The point set P is empty.
(ii) The point set P is non-empty and contains O(df/∆e) points.
(iii) The opening cost f is at most 1.
(iv) The opening cost f is at least ∆d+1.
76 5 Facility Location in Data Streams
In Case (i), there are no points that have to be served by an open facility. Hence, the
facility location cost is obviously 0. Thus, our estimator is 0.
We distinguish two subcases of Case (ii), namely f/∆ < 1 and f/∆ ≥ 1. In the first
subcase, we have |P | ≥ 1 and |P | ∈ O(1). Thus, there exists at least one open facility in
an optimal solution, so the optimal facility location cost is at least f . If each point opens
a facility, then the facility location cost are f · |P | ∈ O(f). Hence, the optimal facility
location cost is Θ(f). In the second subcase, we have |P | ≥ 1 and |P | ∈ O(f/∆). Again,
there exists at least one open facility in an optimal solution, so the optimal facility location
cost is at least f . Furthermore, the total connection cost of the points in P is O(f) since
the longest pairwise distance in P is upper bounded by
√
d ·∆ and there are O(f/∆) points
in P . Thus, if we open one facility and connect the remaining points in P to this facility,
then the resulting facility location cost is O(f). It follows that the optimal facility location
cost is again Θ(f). Hence, in both subcases of Case (ii), we set our estimator to f .
In Case (iii), it is optimal to open a facility at each point in P since the opening cost f
is at most as big as the minimum pairwise distance in P . Hence, our estimator is f · |P |.
Case (iv) is similar to Case (ii). We can assume that P is not empty because otherwise
we have Case (i). It follows that there has to be at least one open facility in an optimal
solution. Thus, the optimal facility location cost is at least f . Furthermore, since the
maximum pairwise distance of the points in P is at most
√
d ·∆ and there are at most ∆d
points in P , the cost to connect all the points in P to the same facility is O(∆d+1). Thus,
the optimal facility location cost is Θ(f), so we can always safely output f as constant-
factor approximation of the optimal facility location cost.
Distinct Elements Data Structure
To be able to transfer the computation of our estimators to the dynamic geometric data
stream model, we have to be able to compute a good estimator for the size of P . To obtain
this estimator, we use the data structure for counting the number of distinct elements in
a data stream, under insertions and deletions, that has been proposed by Kane et al. [72].
This data structure has the following properties:
Lemma 5.1.1 ([72]). Let ε, 0 < ε < 1, be a precision parameter. There is a data structure
that computes a (1± ε)-approximation of the number of distinct elements in a data stream
under insertions and deletions with probability at least 2/3. The space requirement of the
data structure is upper bounded by O(1/ε2 · log(N) · (log(1/ε) + log(log(M)))) bits, where
N is the size of the domain of the elements and M is the multiplicity of single elements.
The update time of an element is O(1).
Corollary 5.1.2. Let ε, 0 < ε < 1, be a precision parameter, and let δ, 0 < δ < 1,
be an error probability parameter. There is a data structure that computes a (1 ± ε)-
approximation of the number of distinct elements in a data stream under insertions and
deletions with probability at least 1 − δ. The data structure has a space requirement of
O(1/ε2 · log(N) · (log(1/ε)+ log(log(M))) · log(1/δ)) bits, where N is the size of the domain
5.1 Definition of a Good Estimator 77
of the elements andM is the multiplicity of single elements. The update time of an element
is O(log(1/δ)).
Proof. The data structure from Lemma 5.1.1 outputs a (1±ε)-approximation of the number
of distinct elements in a data stream under insertions and deletions with an error probability
of at most 1/3. This error probability can be reduced by using a standard amplification
technique. More precisely, we run d75 ln(1/δ)e copies of the algorithm in parallel and output
their median value. For each j ∈ {1, . . . , d75 ln(1/δ)e}, let Zj be the indicator random
variable for the event that the j-th run of the algorithm outputs a (1± ε)-approximation
of the number of distinct elements. By a Chernoff bound, we get
Pr


d75 ln(1/δ)e∑
j=1
Zj ≤
(
1−
1
5
)
· E


d75 ln(1/δ)e∑
j=1
Zj



 ≤ exp

−
1
2 · 52
· E


d75 ln(1/δ)e∑
j=1
Zj



 ≤ δ .
Thus, the probability that more than a fraction of 8/15-th of the copies computes a (1±ε)-
approximation is at least 1 − δ. This implies that the median value of the copies is a
(1 ± ε)-approximation with probability at least 1 − δ. Now, the assertion follows from
Lemma 5.1.1.
Given a stream of insert and delete operations of points from a discrete Euclidean space
{1, . . . ,∆}d and an error probability parameter δ, we apply the data structure from Corol-
lary 5.1.2 with precision parameter ε := 1/2. Since we assume that the input stream
is consistent, i.e., no point is removed which is not present in the current point set and
no point is added twice, the multiplicity M is constant. Furthermore, the size N of the
domain is ∆d. Thus, with probability at least 1 − δ, we can compute a constant-factor
approximation of |P | and, hence, a constant-factor approximation of the facility location
cost in the four special cases using O(d · log(∆) · log(1/δ)) space. An insertion or deletion
of a point requires O(log(1/δ)) time.
5.1.2 Estimator Based on a Space Partition
In the remainder of this chapter, we will always assume that the size of P is Ω(f/∆) and
1 < f < ∆d+1. Furthermore, we assume that the value f is a power of 2. Note that, by
rounding the opening cost up to the next power of 2, the facility location cost and also our
estimator is increased by a factor of at most 2.
In order to deal with the general case, we define a certain partition of the input space
and relate this partition to the cost for the uniform Euclidean facility location problem.
In particular, if we assign to each cell in this partition a weight that corresponds to the
number of points inside the cell multiplied by the side length of the cell, the sum of these
weights is a constant-factor approximation of the cost for the uniform facility location
problem. We will use this fact in Section 5.3 to develop an approximation algorithm in the
dynamic geometric data stream model.
To compute the above mentioned space partition for P , we impose dlog(∆)e+ 1 nested
square grids over the point space denoted by G (0) ,G (1) , . . . ,G (dlog(∆)e). The side length
78 5 Facility Location in Data Streams
of each cell in grid G (i) is 2i. We say that the grid cells in G (i) are in level i. The set of
neighbors Γ(C) of a cell C ∈ G (i) is the set of all the cells in grid G (i) that share some part
of their boundary with C. Note that all the cells located at the border of some grid have
less than 3d−1 neighbors. For example, there is only one cell in grid G (dlog(∆)e), and this
cell has no neighbors. All remaining cells have exactly 3d − 1 neighbors. Furthermore, we
need the definition of a parent cell and a subcell. The parent cell of a cell C ∈ G (i) in any
level i ∈ [dlog(∆)e] is the cell in G (i+ 1) that contains C. The subcells of a cell C ∈ G (i)
in any level i ∈ {1, 2, . . . , dlog(∆)e} are all the cells in G (i− 1) that are contained in C.
In each grid G (i), the active and maximal-useful cells will play a decisive role in the
space partitioning. They are defined as follows:
Definition 5.1.3 (Active Cell). A cell in any level i ∈ [dlog(∆)e+ 1] is called active if it
contains at least a(i) := f/2i points of P . A grid cell that is not active is inactive.
Observe that if a cell C is active, then all the cells that contain C are active as well.
Definition 5.1.4 (Useful and Maximal-Useful Cell). A cell C in any level i ∈ [dlog(∆)e+1]
is called useful if it neither contains an active subcell nor any of its neighbors Γ(C) in grid
G (i) contains an active subcell. A grid cell that is not useful is useless.
A cell in any level i ∈ [dlog(∆)e] is maximal-useful if it is useful but its parent cell is
useless. The cell in level G (dlog(∆)e) is maximal-useful if it is useful.
Our space partition consists of all maximal-useful cells. Let SP(i) be the set of all
maximal-useful cells in grid G (i), and let SP :=
⋃
i SP(i) be the set of all maximal-useful
cells. The cells in SP form a partition of the input space. This follows from the fact that
we can simply construct SP in a process similar to that of building a quadtree. In general,
a quadtree for a d-dimensional point set is a rooted tree in which every node corresponds
to a squared cell. Each internal node v has 2d children whose corresponding cells build
a partition of v. Hence, the cells corresponding to the leaf nodes of the quadtree form a
partition of the space, which is called a quadtree partition. Following this definition, we
can construct our space partition by starting from the cell in the coarsest grid G (dlog(∆)e)
and recursively splitting each useless cell into 2d equal sized, squared subcells. The final
space partition consists of only useful cells whose parent cells are useless. Hence, we obtain
SP as desired. An illustration of a space partitioning is given in Figure 5.1.
The key idea is now to place an open facility in each active cell in SP . Figure 5.2
illustrates how this is related to a solution for the uniform facility location problem. We
remark that our strategy of choosing the set of open facilities is a refinement of the strategy
proposed in [64]. More precisely, the open facilities in [64] are chosen from all active cells
in
⋃dlog(∆)e
i=0 G (i), whereas we choose the open facilities from a subset of these cells.
Next, we define a value FL(P, f) that is based on the space partition SP and yields a
constant-factor approximation of the cost of an optimal solution for the uniform Euclidean
5.1 Definition of a Good Estimator 79
(a) (b) (c) (d)
Figure 5.1: Example illustrating the quadtree partition for a set of points from {1, . . . , 128}2
and for the opening-cost value f = 64. Active cells are colored in gray. Useless
cells are indicated by thick borders. Subcells of a cell are indicated by dashed
borders. (a)-(c) The quadtree partition for subsequent depths of the recursion.
(d) The final quadtree partition and its active cells.
(a) (b)
Figure 5.2: (a) The final quadtree partition for a set of points from {1, . . . , 128}2 and for
the opening-cost value f = 64. Active cells are colored in gray. (b) Solution
for the uniform facility location problem whose cost is approximated by the
algorithm. The red points are the open facilities. Connections between points
are indicated by line segments.
facility location problem. Let nP (C) be the number of points in the set P that are contained
in the cell C. Then, the estimator for the facility location cost is defined as
FL(P, f) :=
dlog(∆)e∑
i=0
∑
C∈SP(i)
nP (C) · 2i . (5.1)
5.1.3 Properties of the Space Partition
Before we prove that FL(P, f) is indeed an O(1)-approximation of the cost of the uniform
Euclidean facility location problem, we discuss some properties of the space partition that
are needed in the analysis. We say that two cells in a space partition are neighbors if they
80 5 Facility Location in Data Streams
share at least one point of their boundary. Furthermore, the distance between two cells
is defined as the minimum distance between two points such that one point lies on the
boundary of one cell and the other point lies on the boundary of the other cell. Now, we
show that the space partition SP has the following properties:
Lemma 5.1.5. The set SP of all maximal-useful cells has the following five properties:
(i) The side length of each cell in SP differs from the side length of each of its neighbors
by a factor of at most 2, i.e., the space partition is balanced.
(ii) Let i ∈ [dlog(∆)e+ 1] be any level, and let C be any useless cell in G (i). Then, there
exists an active cell with side length at most 2i−1 in SP that has a distance of at
most
√
d · 2i+1 from C.
(iii) Let i ∈ [dlog(∆)e + 1] be any level, and let C be any inactive cell in SP(i). Then,
there exists an active cell with side length at most 2i in SP that has a distance of at
most 5
√
d · 2i from C.
(iv) Let i ∈ [dlog(∆)e + 1] be any level, and let C be any active cell in SP(i). Then, we
have
f
2i
≤ nP (C) < 2d+1 ·
f
2i
.
(v) Let i ∈ [dlog(∆)e + 1] be any level. Then, we have 0 < a(i) < ∆d+1, and we have
either a(i) ≥ 1 or SP(i) contains no non-empty cell.
Proof.
(i) Obviously, there cannot be a cell C ∈ SP(0) ∪ SP(1) that has a neighbor in SP
whose side length is less than half the side length of C. We prove the assertion for
any level i ∈ {2, . . . , dlog(∆)e} by contradiction. Assume that Cbig is a cell from
SP(i) that has a neighbor cell Csmall in SP(j), j ≤ i − 2, i.e., Csmall is a neighbor
cell with side length 2j ≤ 2i−2. This situation is illustrated in Figure 5.3. Let C ′small
be the parent cell of Csmall. Since Csmall is maximal-useful, its parent C ′small is useless.
Hence, C ′small or at least one neighbor in Γ(C
′
small) has an active subcell (the light gray
area in Figure 5.3). This subcell is either contained in Cbig or one of its neighbors
Γ(Cbig). Hence, Cbig is also a useless cell and cannot be a cell in SP(i), which is a
contradiction.
(ii) The cells in level 0 are all useful per definition since they contain no subcells. To
prove the assertion for the remaining levels, we proceed by induction. Let ` be the
smallest level such that SP(`) is not empty. Let C be a useless cell in grid G (`+ 1).
Since C is useless, either C or one of its neighbors in Γ(C) contains an active subcell
A. By the choice of `, we know that A is maximal-useful and in SP . Furthermore,
A has a side length of 2` and a distance of at most
√
d · 2` from C, which is less than√
d · 2`+1. This proves the base case. Now, let C be a useless cell in grid G (i). By
5.1 Definition of a Good Estimator 81
Cbig
2i 2i−2
Figure 5.3: Arrangement of cells that leads to the desired contradiction in the proof of the
first property stated in Lemma 5.1.5. The cell Csmall is indicated by the dark
gray square. The area containing an active subcell is colored in light gray.
definition, either C or one of its neighbors in Γ(C) contains an active subcell. Let A
be such a subcell. The cell A has side length 2i−1 and a distance of at most
√
d · 2i−1
from C. If A is useful, it is maximal-useful and in SP , so we are done. Otherwise, A
is useless and in grid G (j), j < i. By induction hypothesis, we have an active cell A′
with side length at most 2j−1 in SP which has a distance of at most
√
d·2j+1 ≤
√
d·2i
from A. Since A has a diagonal of length
√
d · 2i−1, we get that the distance from C
to A′ is at most 2 ·
√
d · 2i−1 +
√
d · 2i =
√
d · 2i+1.
(iii) Since we assume that |P | ∈ Ω(f/∆), the cell in G (dlog(∆)e) is active. According
to this, let C ∈ SP be an inactive cell in a level i ∈ [dlog(∆)e]. Let C ′ be the
parent cell of C. By ii), there is an active cell with side length at most 2i in SP that
has a distance of at most
√
d · 2i+2 from C ′. Hence, the distance from C is at most√
d · 2i +
√
d · 2i+2 = 5
√
d · 2i.
(iv) The first inequality of the assertion follows from our definition of an active cell. Since
each cell in level 0 contains at most 2d points from P and f > 1, the second inequality
is satisfied for each cell in SP(0). Let i ∈ {1, . . . , dlog(∆)e} be any level and C be
any cell in SP(i). The number of points in C is less than 2d+1 · f/2i because each
of the 2d subcells of C is inactive, i.e., there are less than f/2i−1 points inside such a
subcell.
(v) Recall that a(i) = f/2i. Since f > 1, we have a(i) > 0. Furthermore, it follows from
f < ∆d+1 and i ≥ 0 that a(i) = f/2i < ∆d+1.
Obviously, we get a(0) = f/20 > 1 since f > 1. Thus, in case i = 0, we always
have a(i) ≥ 1. For any level i ∈ {1, . . . , dlog(∆)e}, the proof is by contradiction.
Let us assume that there is a non-empty cell C ∈ SP(i) with a(i) ≤ 1/2. Then, we
have a(i − 1) ≤ 1, so C contains an active subcell, which is a contradiction to the
construction of the space partition SP . Hence, we have a(i) > 1/2. In addition,
since a(i) = f/2i > 1/2 and we assume that f is a power of 2, we have a(i) ≥ 1.
82 5 Facility Location in Data Streams
5.1.4 Analysis of the Estimator
In this section, we analyze our estimator FL(P, f). We separate the analysis into two parts.
We give an appropriate lower bound in the first part and an appropriate upper bound in
the second part. For this purpose, let FacLoc*(P, f) be the cost of an optimal facility
location solution for P .
Lemma 5.1.6. FL(P, f) ∈ Ω(FacLoc*(P, f)).
Proof. Our goal is to define a set of open facilities such that the induced facility location
cost is O(FL(P, f)). This proves FL(P, f) ∈ Ω(FacLoc*(P, f)). We will show that it
suffices to open one facility in each active cell in SP .
We give an upper bound on the contribution of the points in each cell in SP . For any level
i ∈ [dlog(∆)e+1], each active cell C ∈ SP(i) contributes at most f+nP (C) ·
√
d ·2i because
we open one facility in C and connect the points inside of C to this facility. Since C is active,
it contains at least f/2i points. Thus, we have f+nP (C)·
√
d·2i ∈ O(nP (C)·2i). The points
in each inactive cell C in SP are connected to the nearest open facility. Due to Property (iii)
in Lemma 5.1.5, for each inactive cell C ∈ SP(i), there exists an active cell with side length
at most 2i in SP which has a distance of at most 5
√
d · 2i from C. Thus, the connection
cost for the points in C is at most nP (C) · (5
√
d · 2i +
√
d · 2i) ∈ O(nP (C) · 2i). Summing up
over all cells in SP gives that the cost of the defined solution is O(FL(P, f)).
Lemma 5.1.7. FL(P, f) ∈ O(FacLoc*(P, f)).
Proof. Let F ∗ be a set of optimal open facilities. Since we assume that P is not empty, the
set F ∗ is not empty and we have FacLoc*(P, f) ∈ Ω(f). Now, for any level i ∈ [dlog(∆)e+1],
we partition the set SP(i) into two subsets SPnear(i) and SPdist(i). The set SPnear(i)
contains every cell whose distance to its nearest open facility in F ∗ is less than 2i−1, i.e.,
SPnear(i) := {C ∈ SP(i) | min
q∈F ∗
D(q, C) < 2i−1} .
The set SPdist(i) contains all remaining cells from SP(i), i.e.,
SPdist(i) := {C ∈ SP(i) | min
q∈F ∗
D(q, C) ≥ 2i−1} .
For each cell C ∈
⋃dlog(∆)e
i=0 SPdist(i), the cost to connect the points inside of C to the
nearest open facility in F ∗ is at least nP (C) · 2i−1. This is exactly half of the cost that we
charge for the cell C by the definition of FL(P, f). Thus, the cost that we charge for points
contained in
⋃dlog(∆)e
i=0 SPdist(i) is upper bounded by twice the optimal connection cost.
Let C ∈ SP(j) be a cell in any level j ∈ [dlog(∆)e+ 1] that contains an optimal facility
q ∈ F ∗. Furthermore, let C ′ ∈ SP(i) be any cell in any level i ∈ [dlog(∆)e+ 1] such that C ′
is not a direct neighbor of C. Due to Property (i) in Lemma 5.1.5, SP is a balanced space
partition, so the neighbors of C ′ have a side length of at least 2i−1. It follows that there
is at least one cell with side length at least 2i−1 between C ′ and C. Thus, C ′ is not within
5.2 Randomized Algorithm 83
distance of less than 2i−1 from q. Hence, we have C ′ /∈ SPnear(i). This implies that only
direct neighbors of C can be in
⋃dlog(∆)e
i=0 SPnear(i). Due to the fact that SP is a balanced
space partition, the neighbors of C have a side length of at least 2j−1. Thus, the number
of neighbors of C is at most 4d − 2d. It follows that less than 4d cells in
⋃dlog(∆)e
i=0 SPnear(i)
are within distance less than half of their side length from q. Hence, we have
dlog(∆)e∑
i=0
|SPnear(i)| < 4d · |F ∗| .
Now, for each cell in SP , we charge a cost of O(f) by the definition of FL(P, f). This is
due to Property (iv) in Lemma 5.1.5, which implies that a cell in SP(i) contains at most
2d+1 · f/2i points. Thus, the cost that arises for all cells in
⋃dlog(∆)e
i=0 SPnear(i) is O(f · |F
∗|),
which is at most a constant factor larger than the optimal opening cost.
5.2 Randomized Algorithm
In this section, we describe a randomized algorithm that implements the ideas of Section 5.1
and, with some modifications, can be transformed into a streaming algorithm, which we
will do in Section 5.3.
The approach of the algorithm is closely related to performing the quadtree partition
into maximal-useful cells. We try to identify all active cells in the grids. For that purpose,
for each level i ∈ [dlog(∆)e+ 1], we maintain one random sample set and take each point
into this set with probability α(i) := min{1/a(i), 1}. Recall that a cell in grid G (i) is
active if it contains at least a(i) = f/2i points. Thus, in expectation, we will see at least
one point in every active cell of grid G (i). Observe that some sample points will also end
up in inactive cells. However, we will show in the analysis that this does not negatively
affect our algorithm. We call a cell in grid G (i) marked if it contains at least one sample
point. The key idea is to go through all levels i ∈ [dlog(∆)e + 1] and to open one facility
in every marked cell C in grid G (i) such that the following two conditions are satisfied:
(a) No subcell of C is marked.
(b) No smaller cell within a distance of less than 2i−1 from C is marked.
The motivation of Condition (b) is that, in our space partition SP , the side lengths of
neighbor cells differ at most by a factor of 2. Hence, a marked cell from SP prevents at
most a constant number of other cells from SP to open a facility.
Finally, we obtain a new estimator for the cost of the uniform facility location problem
based on our randomized algorithm. Let F denote the set of cells, where we open a facility.
Then, the estimator is FLrand(P, f) := f · |F|.
84 5 Facility Location in Data Streams
5.2.1 Random Sampling
In each level i ∈ [dlog(∆)e + 1], we would like to sample each point from P indepen-
dently at random with probability α(i) = min{1/a(i), 1}. Since we have insert as well as
delete operations of points, the random experiments for the points must be reproducible.
Therefore, for each level i ∈ [dlog(∆)e + 1], we use a function hi : {1, . . . ,∆}d → {0, 1}
that maps a point to the value 1 with probability α(i) = min{1/a(i), 1}. For any point
p ∈ P ⊆ {1, . . . ,∆}d, if hi(p) = 1, then p is a sample point. Otherwise, p is not a sample
point. We can construct a function hi(·) with the following properties:
Lemma 5.2.1. For each level i ∈ [dlog(∆)e + 1], there is a function hi : {1, . . . ,∆}d →
{0, 1} which maps each point p ∈ {1, . . . ,∆}d independently at random to a value in {0, 1}
such that
Pr [hi(p) = 1] = min{1/a(i), 1} .
The function hi(·) uses O(∆d · log(∆)) random bits. For each point p ∈ {1, . . . ,∆}d, the
value of hi(p) can be computed in O(log(∆)) time.
Proof. Let i ∈ [dlog(∆)e+ 1] be any fixed level. In case that α(i) = 1, the assertion of the
lemma is obviously true. Hence, in the following, we will assume that α(i) < 1 and, thus,
a(i) > 1.
Since i ≥ 0 and f < ∆d+1, we have a(i) = f/2i < ∆d+1. In addition, since f is a power
of 2 with positive exponent and a(i) > 1, the value a(i) is also a power of 2 with positive
exponent. Observe that a(i) can be represented by ` := d(d+ 1) log(∆)e bits.
Now, for each point p ∈ {1, . . . ,∆}d, we generate a bit vector of length `, where each bit
is chosen independently and uniformly at random. Let
ri(p) :=
(
r(1)i , r
(2)
i , . . . , r
(`)
i
)
be the generated bit sequence for p. The function hi maps p to 1 if r
(j)
i = 0 for j < log(a(i))
and r(log(a(i)))i = 1. For any k ∈ {1, . . . , `}, the event that r
(j)
i = 0 for j < k and r
(k)
i = 1
happens with probability 2−k. Thus, the probability that hi maps the point p to 1 is
Pr [hi(p) = 1] = 2− log(a(i)) =
1
a(i)
= α(i) .
Since we generate ` random bits for each point in {1, . . . ,∆}d, the function hi(·) uses
` ·∆d ∈ O(∆d · log(∆)) random bits in total. To compute hi(p) for any p ∈ {1, . . . ,∆}d,
we have at most one read and compare operation for each bit in ri(p). Thus, the time to
compute hi(p) is O(log(∆)).
The issue of full randomness will be discussed in Section 5.3.
5.2 Randomized Algorithm 85
5.2.2 Analysis of the Estimator
We will show that, with high constant probability, our randomized algorithm computes
a facility location cost that is a constant-factor approximation of the estimator FL(P, f).
For any level i ∈ [dlog(∆)e+1], let F(i) be the set of marked cells in G (i) that do not have
a marked subcell and that do not have a smaller marked cell within a distance of less than
2i−1. Then, the cells in the set
⋃dlog(∆)e
i=0 F(i) are exactly the cells in which the algorithm
opens its facilities, i.e., we have F =
⋃dlog(∆)e
i=0 F(i). Thus, the estimator of the randomized
algorithm is given by
FLrand(P, f) = f ·
dlog(∆)e∑
i=0
|F(i)| . (5.2)
Next, we derive appropriate lower and upper bounds of the estimator FLrand(P, f).
Lemma 5.2.2. FLrand(P, f) ∈ Ω(FL(P, f)) with probability at least 15/16.
Proof. Let us consider the space partition SP defined in Section 5.1. We are interested in
the number of marked cells from SP . However, f multiplied by the number of marked cells
from SP does not immediately give a lower bound on FLrand(P, f). The reason is that, for
any level i ∈ [dlog(∆)e+ 1], we do not open a facility in a marked cell in SP(i) if there is a
smaller cell within a distance of less 2i−1 which is also marked. Since neighbor cells in SP
differ by a factor of at most 2 in their side lengths, every marked cell in SP can prevent
at most a constant number of other marked cells in SP from opening a facility. Thus, if
we can show that the expected number of marked cells in SP is Ω(FL(P, f)/f), then the
assertion follows.
We say that a point p ∈ P ∩ SP(i) is marked if it is sampled in level i. Let Xp denote
the indicator random variable for the event that p is marked. Then, the expected number
of marked points in any cell C ∈ SP(i) is
E


∑
p∈C
Xp

 = nP (C) ·min
{
1
a(i)
, 1
}
. (5.3)
By the definition of a(i), we obtain that
E


∑
p∈C
Xp

 ≤
nP (C)
a(i)
=
nP (C) · 2i
f
.
Due to Property (iv) in Lemma 5.1.5, for every cell C ∈ SP , we get
E


∑
p∈C
Xp

 < 2d+1 .
Hence, we can group the cells from SP into sets S1, . . . ,S` such that, for each set Sj with
1 ≤ j < `, we have
40 ≤
∑
C∈Sj
∑
p∈C
E [Xp] < 40 + 2d+1 (5.4)
86 5 Facility Location in Data Streams
and
∑
C∈S`
∑
p∈C
E [Xp] < 40 + 2d+1 (5.5)
for the set S`.
Next, we analyze the contribution of the sets S1, . . . ,S` to the estimator FL(P, f). Due
to Property (v) in Lemma 5.1.5, for any cell C ∈ SP(i), we have either a(i) ≥ 1 or a(i) > 0
and nP (C) = 0. It follows from Equation (5.3) that
E


∑
p∈C
Xp

 ≥
nP (C)
a(i)
.
Due to Inequalities (5.4) and (5.5), the contribution of each Sj, 1 ≤ j ≤ `, to the estimator
FL(P, f) is
dlog(∆)e∑
i=0
∑
C∈Sj∩SP(i)
nP (C) · 2i ≤
dlog(∆)e∑
i=0
∑
C∈Sj∩SP(i)
2i · a(i) · E


∑
p∈C
Xp


= f ·
dlog(∆)e∑
i=0
∑
C∈Sj∩SP(i)
E


∑
p∈C
Xp


< f · (40 + 2d+1)
∈ O(f) .
Hence, we have FL(P, f) ∈ O(f`). This means, the assertion of the lemma follows if the
number of marked cells in SP is Ω(`). We consider the cases ` ≤ 2 and ` > 2.
First, we consider the case that ` > 2. We define the random variable
Yj :=
∑
C∈Sj
∑
p∈C
Xp .
By a Chernoff bound, we obtain
Pr
[
Yj ≤
(
1−
1
2
)
· E [Yj]
]
≤ exp
(
−
E [Yj]
8
)
.
This implies that Pr [Yj ≤ 20] ≤ 1/e5 for 1 ≤ j < `. Hence, with probability at least
1−1/e5, at least one of the cells in Sj is marked. For 1 ≤ j < `, let Zj denote the indicator
random variable for the event that no cell in Sj is marked. The expected value of Zj is at
most 1/e5. By Markov’s inequality, we get
Pr


`−1∑
j=1
Zj ≥ 32 · E


`−1∑
j=1
Zj



 ≤
1
32
.
Thus, we have
`−1∑
j=1
Zj < 32 · E


`−1∑
j=1
Zj

 = 32 ·
`− 1
e5
≤
`
3
5.2 Randomized Algorithm 87
with probability at least 31/32. It follows that, with probability at least 31/32, the number
of marked cells in SP is at least `− 1− `/3 ∈ Ω(`). Thus, we have FLrand(P, f) ∈ Ω(f`),
so FLrand(P, f) ∈ Ω(FL(P, f)).
In case that ` ≤ 2, we have FL(P, f) ∈ O(f). Furthermore, we can assume that P
contains at least 32 · df/∆e points, otherwise we have one of the special cases considered
in Section 5.1.1. It follows that the expected number of marked points in the cell C in grid
G (dlog(∆)e) is at least
Pr


∑
p∈C
Xp

 ≥ min
{
1
a(dlog(∆)e)
, 1
}
· 32 ·
⌈
f
∆
⌉
= min
{
2dlog(∆)e
f
, 1
}
· 32 ·
⌈
f
∆
⌉
≥ 32
since df/∆e ≥ 1. By a Chernoff bound, we get
Pr


∑
p∈C
Xp ≤
(
1−
1
2
)
· E


∑
p∈C
Xp



 ≤ exp

−
E
[∑
p∈C Xp
]
8

 ≤ exp
(
−
32
8
)
≤
1
32
.
Thus, with probability at least 31/32, the cell C in G (dlog(∆)e) is marked. Due to our
construction, we have F ≥ 1, so FLrand(P, f) ∈ Ω(f). Hence, we get FLrand(P, f) ∈
Ω(FL(P, f)), which completes the proof.
To prove the upper bound, we first observe that every cell C is either contained in SP or
it can be partitioned into cells from SP (C lies above SP) or it is a subcell of a cell in SP
(C lies below SP). We will first show that the overall expected number of sample points
from cells that lie below SP or that do not lie ‘far above’ SP is O(FL(P, f)/f). Hence,
the overall cost caused by these cells is O(FL(P, f)). Then, we prove that the expected
contribution of cells ‘far above’ SP is also O(FL(P, f)). The latter fact follows because
every such cell C in grid G (i) has a (smaller) active cell from SP within distance 2i−1.
These active cells are typically marked, with the result that the expected contribution of
C is small.
Definition 5.2.3 (Height of a Cell). We say that a cell C in grid G (i) has height k if the
smallest cell in SP that is contained in C has side length 2i−k. If no cell in SP is contained
in C, then we define its height to be −∞.
Lemma 5.2.4. FLrand(P, f) ∈ O(FL(P, f)) with probability at least 15/16.
Proof. Let i ∈ [dlog(∆)e+1] be any level, and let Xp denote the indicator random variable
for the event hi(p) = 1. Furthermore, for a cell C in grid G (i), let
XC :=
∑
p∈P∩C
Xp
88 5 Facility Location in Data Streams
denote the random variable for the number of sample points in cell C. With this definition,
it follows that, for every cell C in grid G (i), we have
E [XC] = nP (C) ·min
{
1
a(i)
, 1
}
.
By the definition of a(i), we get
E [XC] ≤
nP (C)
a(i)
=
nP (C) · 2i
f
.
For any k ∈ N0 and any level i ∈ {k, . . . , dlog(∆)e}, let us now consider an arbitrary
cell C in grid G (i) with height k. The cell C can be partitioned into cells C1, . . . , C` from
SP that differ in their side lengths by a factor of at most 2k. Since α(i) ≤ 2k · α(i − k),
we have
E [XC] ≤ 2k · E


∑`
j=1
XCj

 .
Observe that cells from the same grid cannot overlap and two cells from different grids
only overlap if the smaller cell is completely contained in the bigger cell. Thus, due to
the definition of height, the set of cells of height k do not overlap. Due to linearity of
expectation, it follows that
E


∑
cells C of height k with k∈N0
XC

 ≤ 2k · E


dlog(∆)e∑
i=0
∑
C∈SP(i)
XC


≤ 2k ·
dlog(∆)e∑
i=0
∑
C∈SP(i)
nP (C) · 2i
f
≤ 2k ·
FL(P, f)
f
.
Hence, for k∗ := dlog(10
√
d)e, the expected number of sample points in cells with a non-
negative height of at most k∗ is less than 10
√
d·FL(P, f)/f . Next, we consider the expected
number of sample points in cells with a negative height. For any level i ∈ [dlog(∆)e + 1],
let C ′ ∈ SP(i) be any cell with height 0. Then, the expected number of sample points in
all the cells that are below SP and that are contained in C ′ is
E


i−1∑
j=0
∑
C∈G(j):C⊂C′
XC

 ≤
i−1∑
j=0
∑
C∈G(j):C⊂C′
nP (C) · 2j
f
=
i−1∑
j=0
nP (C ′) · 2i−1−j
f
≤ E [XC′ ] .
5.2 Randomized Algorithm 89
Summing up over all cells in SP , we obtain that the expected number of sample points in
cells below SP is at most
E


∑
cells C of height −∞
XC

 ≤
dlog(∆)e∑
i=0
∑
C∈SP(i)
E [XC]
≤
dlog(∆)e∑
i=0
∑
C∈SP(i)
nP (C) · 2i
f
=
FL(P, f)
f
.
Thus, the expected number of sample points in cells with height at most k∗ is less than
11
√
d · FL(P, f)/f . By Markov’s inequality, we obtain
Pr


∑
cells C of height at most k∗
XC ≥ 32 · E


∑
cells C of height at most k∗
XC



 ≤
1
32
.
Hence, with probability at least 31/32, the opening cost for facilities in cells with height
at most k∗ is less than f · 352
√
d · FL(P, f)/f ∈ O(FL(P, f)).
Now, for any level i ∈ {k∗ + 1, . . . , dlog(∆)e}, let us consider an arbitrary cell C in grid
G (i) with height bigger than k∗. By the definition of height and the value of k∗, C contains
a subcell from SP with side length less than 2i−k
∗
≤ 2i/(10
√
d). Due to Property (iii) in
Lemma 5.1.5, we know that, for any level j ∈ [dlog(∆)e + 1], every cell in SP(j) has an
active cell with side length at most 2j in SP within a distance of at most 5
√
d · 2j. We
conclude that there is an active cell with side length less than 2i/(10
√
d) in SP within a
distance of less than 2i−1 from C. Now, observe that every parent cell of an active cell is
active and contains the cell. Hence, there is a cell in grid G (i− 1) within a distance of less
than 2i−1 from C that is active. To simplify further descriptions, we will rephrase this as
follows. Let the level-j-neighborhood of C be the set of all the cells in G (j) that share some
part of their boundary or interior with C. Then, every cell in grid G (i) with height at least
k∗ is active and contains an active cell in SP or has a cell in its level-(i− 1)-neighborhood
that is active and contains an active cell in SP .
Now, we proceed as follows. For each active cell Ai in SP(i), we consider the cell itself
and all cells that contain it. For each such cell Aj in grid G (j), j ∈ {i, . . . , dlog(∆)e}, we
assume that all the 2d cells in the level-(j+ 1)-neighborhood of Aj contain an open facility
if and only if Aj is not marked. Thus, the expected contribution of Ai and all the cells
which belong to a level-j-neighborhood of Ai with j ∈ {i, . . . , dlog(∆)e} is at most
f + 2d · f ·
dlog(∆)e−1∑
j=i
Pr [Aj is not marked] .
Each point in Aj is not sampled with probability at most 1 − min{1/a(j), 1}. Hence, if
a(j) ≤ 1, we have Pr [Aj is not marked] = 0. Otherwise, since nP (Aj) ≥ nP (Ai) ≥ a(i),
90 5 Facility Location in Data Streams
we obtain
Pr [Aj is not marked] ≤
(
1−
1
a(j)
)nP (Ai)
≤ exp
(
−
nP (Ai)
a(j)
)
≤ exp
(
−
2j−i · nP (Ai)
a(i)
)
≤ exp (−(j − i)) ,
where the second inequality is due to a bound on Euler’s number (see Inequality (B.2)).
It follows that
dlog(∆)e−1∑
j=i
Pr [Aj is not marked] ≤
dlog(∆)e−1∑
j=i
e−(j−i) ≤
∞∑
j=0
e−j =
e
e− 1
.
Thus, the expected contribution of Ai and all the cells which belong to a level-j-neighbor-
hood of Ai with j ∈ {i, . . . , dlog(∆)e} is at most O(f). Observe that each cell with height
greater than k∗ is a cell in such a neighborhood of an active cell since it contains itself an
active cell from SP or one of its neighbors contains an active cell from SP . It follows that
the expected contribution from cells with height greater than k∗ is O(f) times the number
of active cells in SP . Since a cell in any level i ∈ [dlog(∆)e+ 1] must contain at least a(i)
points to be active, the number of active cells in SP is
dlog(∆)e∑
i=0
∑
C∈SP(i)
min
{⌊
nP (C)
a(i)
⌋
, 1
}
≤
dlog(∆)e∑
i=0
∑
C∈SP(i)
nP (C)
a(i)
=
1
f
·
dlog(∆)e∑
i=0
∑
C∈SP(i)
nP (C) · 2i
≤
FL(P, f)
f
.
Thus, the expected opening cost for facilities in cells with height greater than k∗ is
O(FL(P, f)). By Markov’s inequality, the opening cost for facilities in cells with height
greater than k∗ is less than 32 times its expected value, which is also O(FL(P, f)), with
probability at least 31/32. Together with the first part of the proof, we obtain that the
total opening cost for facilities is O(FL(P, f)) with probability at least 15/16.
5.3 Streaming Algorithm
In this section, we describe how our randomized algorithm can be transferred to the dy-
namic geometric data stream model. For each level i ∈ [dlog(∆)e + 1], let M(i) be the
5.3 Streaming Algorithm 91
subset of marked cells in G (i), and let U(i) be the subset of cells in G (i) that have a
cell contained in the set
⋃i−1
j=0M(j) within a distance of less than 2
i−1. Thus, we have
FLrand(P, f) = f ·
∑dlog(∆)e
i=0 |M(i)\U(i)|. Recall that, in the streaming context, P refers to
the current point set, i.e., the set of points obtained after having applied an input sequence
of insertions and deletions.
The difficulty is to maintain, for each level i ∈ [dlog(∆)e + 1], a good estimator for the
value |M(i)\U(i)| in the streaming model. We use a similar technique as described in [64]
to solve this problem. In particular, we use two data structures that both maintain the
number of distinct elements in a stream, under insertions and deletions. The first data
structure called DE1(i) is supposed to maintain a good estimator for the value |M(i)∪U(i)|.
The second data structure called DE2(i) is supposed to maintain a good estimator for the
value |U(i)|. We can show that the difference of these two estimators is a good estimator
for the desired value |M(i)\U(i)|. Next, we explain this method in more detail.
For any point p ∈ {1, . . . ,∆}d and any level i ∈ [dlog(∆)e + 1], the cell in level i that
contains the point p is denoted by Cp(i), and the set of neighbor cells of Cp(i) in level i is
denoted by Γ(Cp(i)). Furthermore, let hi : {1, . . . ,∆}d → {0, 1} be the random function
introduced in Section 5.2.1. Then, our implementation of an insert operation of p is as
follows (see also Algorithm 5.3.2). If hk(p) = 0 for each index k ∈ [dlog(∆)e + 1], then p
is not a sample point and we do nothing with p. Otherwise, let k ∈ [dlog(∆)e + 1] be the
smallest index with hk(p) = 1. Thus, p is sampled in level k, so Cp(k) is a marked cell.
We insert Cp(k) in DE1(k) and, for each j ∈ {k+ 1, k+ 2, . . . , dlog(∆)e}, we insert all cells
in G (j) such that Cp(k) is within a distance of less than 2j−1 in both DE1(j) and DE2(j).
A deletion of p is implemented analogously (see Algorithm 5.3.3). After having processed
the whole input stream according to this, we compute an estimator for the facility location
cost. For each level i ∈ [dlog(∆)e+ 1], let |DE1(i)| and |DE2(i)| be the number of distinct
elements in DE1(i) and DE2(i), respectively. Then, our estimator for the optimal facility
location cost in the dynamic geometric data stream model is
FLstream(P, f) := f ·
dlog(∆)e∑
i=0
|DE1(i)| − |DE2(i)| . (5.6)
A description of our algorithm in pseudocode is given by Algorithms 5.3.1, 5.3.2, and 5.3.3.
5.3.1 Analysis of the Estimator
In this section, we show that our streaming algorithm outputs a constant-factor approx-
imation of the optimal facility location cost, has polylogarithmic update time, and uses
polylogarithmic memory space. For that purpose, we analyze the quality and complexity
of the distinct elements data structures and the random sampling technique.
Distinct Elements Data Structures
It follows from the analysis in Sections 5.1 and 5.2 that FLrand(P, f) ∈ Ω(FacLoc*(P, f))
and FLrand(P, f) ∈ O(FacLoc*(P, f)) is true with probability at least 7/8. Thus, with prob-
92 5 Facility Location in Data Streams
Algorithm 5.3.1 FacLocCost(f,∆)
1: for i← 0 to dlog(∆)e do
2: create random function hi(·)
3: initialize empty data structures DE1(i) and DE2(i)
4: for each pair (p, operation) in the stream do
5: if operation = insert then
6: Insert(p)
7: if operation = delete then
8: Delete(p)
9: z ← 0
10: for i← 0 to dlog(∆)e do
11: z ← z + |DE1(i)| − |DE2(i)|
12: return f · z
Algorithm 5.3.2 Insert(p)
1: sampled← false
2: i← 0
3: while sampled = false and i ≤ dlog(∆)e do
4: if hi(p) = 1 then
5: sampled← true
6: insert Cp(i) in DE1(i)
7: for j ← i+ 1 to dlog(∆)e do
8: insert all cells from G (j) that contain a cell from Γ(Cp(j − 1)) in DE1(j) and
DE2(j)
9: i← i+ 1
ability at least 7/8, we have that 1/c·FacLoc*(P, f) ≤ FLrand(P, f) ≤ c·FacLoc*(P, f) for an
appropriately chosen constant c. Next, we analyze how much the estimator FLstream(P, f)
might differ from FLrand(P, f). For this purpose, let us assume that we have one fixed ran-
dom function hi : {1, . . . ,∆}d → {0, 1} for each level i ∈ [dlog(∆)e+1] that is used by both
our randomized algorithm and our streaming algorithm. We will show that the difference
between FLstream(P, f) and FLrand(P, f), which is caused by using DE data structures to
maintain the number of distinct elements in a data stream under insertions and deletions,
can be upper bounded by 1/(2c) · FacLoc*(P, f) with probability greater than 7/8. Then,
we have 1/(2c) · FacLoc*(P, f) ≤ FLstream(P, f) ≤ (2c2 + 1)/(2c) · FacLoc*(P, f) with high
constant probability, which implies FLstream(P, f) ∈ Θ(FacLoc*(P, f)) with high constant
probability.
We use the technique proposed by Kane et al. [72] to realize the DE data structures.
Then, due to Corollary 5.1.2, for a precision parameter ε and an error probability parameter
δ, which we will both specify later, each of our DE data structures computes a (1 ± ε)-
approximation of the number of distinct elements in a data stream under insertions and
deletions with probability at least 1− δ.
5.3 Streaming Algorithm 93
Algorithm 5.3.3 Delete(p)
1: sampled← false
2: i← 0
3: while sampled = false and i ≤ dlog(∆)e do
4: if hi(p) = 1 then
5: sampled← true
6: delete Cp(i) from DE1(i)
7: for j ← i+ 1 to dlog(∆)e do
8: delete all cells from G (j) that contain a cell from Γ(Cp(j−1)) from DE1(j) and
DE2(j)
9: i← i+ 1
We will show that the difference between FLstream(P, f) and FLrand(P, f) depends on the
value
∑dlog(∆)e
i=0 f · |M(i)|. For that reason, we next give an appropriate upper bound of
∑dlog(∆)e
i=0 f · |M(i)|.
Lemma 5.3.1. If FLrand(P, f) ≤ c · FacLoc*(P, f), then we have
dlog(∆)e∑
i=0
f · |M(i)| ≤ c · 2d · (log(∆) + 2) · FacLoc*(P, f) .
Proof. We open in each marked cell in G (0) one facility. Thus, we have
f · |M(0)| = f · |F(0)| ≤ FLrand(P, f) ≤ c · FacLoc*(P, f) .
For any level i ∈ [dlog(∆)e + 1], we open a facility in each cell in M(i) such that no
subcell of this cell is marked and no smaller cell within a distance of less than 2i−1 is
marked. Let us consider any cell C in level i− 1 that is a marked cell or contains a marked
cell. Let C ′ be this marked cell. There are 2d cells in G (i) such that C is within a distance
of less than 2i−1 from these cells, namely the set of cells in the level-i-neighborhood of C
(see Figure 5.4). Recall that the level-i-neighborhood of C is the set of all the cells in level
i that share some part of their boundary or interior with C. It follows that C ′ prevents
at most 2d cells inM(i) from opening a facility. Now, either C ′ contains an open facility
or there exists a smaller marked cell C ′′ that prevents C ′ from opening a facility. Since C ′′
has to be within distance of less than 2i−2 from C ′ ⊆ C, it is located in the level-(i − 2)-
neighborhood of C. We can recursively apply this argument until we have found the first
marked cell that is not prevented by any smaller marked cell from opening a facility. Note
that this happens in level 0 at the latest. Since
∑i−2
j=0 2
j ≤ 2i−1, these marked cells are all
located in the level-(i − 1)-neighborhood of C and, thus, also in the level-i-neighborhood
of C. Hence, for a fraction of at least 1/2d cells in M(i), there exists at least one cell in
⋃i
j=0F(j). Thus, we have
f · |M(i)| ≤ f · 2d ·
i∑
j=0
|F(j)| ≤ 2d · FLrand(P, f) ≤ 2d · c · FacLoc*(P, f) .
94 5 Facility Location in Data Streams
C
2i−1 2i
Figure 5.4: Illustration of the area of influence of a marked cell. The cell C ∈ G (i− 1) is a
marked cell or contains a marked cell. The area of influence in grid G (i), i.e.,
the subset of cells in G (i) that are within distance of less than 2i−1 from C, is
colored in light gray. The white cells in grid G (i) are outside of the influence
of the marked cell in C.
Now, the lemma follows from the fact that there are at most log(∆) + 2 levels.
Finally, we can upper bound the difference between FLstream(P, f) and FLrand(P, f).
Lemma 5.3.2. If FLrand(P, f) ≤ c ·FacLoc*(P, f) and if we run each DE data structure to
maintain a (1±ε)-approximation of the number of distinct elements in data streams under
insertions and deletions with a precision parameter ε, 0 < ε ≤ 1/(8c2 · 22d · (log(∆) + 2)2),
and an error probability δ, 0 < δ < 1/(16(log(∆) + 2)), then
|FLstream(P, f)− FLrand(P, f)| <
1
2c
· FacLoc*(P, f)
with probability greater than 7/8.
Proof. Since we use two DE data structures per level and there are at most (log(∆) + 2)
levels, we use at most 2(log(∆) + 2) DE data structures in total. By the union bound, the
probability that at least one of these DE data structures fails is less than 1/8. Hence, the
probability that each DE data structure maintains a (1± ε)-approximation is greater than
7/8.
Since we run each DE data structure with a precision parameter ε, we can upper bound
the difference between FLstream(P, f) and FLrand(P, f) by
|FLstream(P, f)− FLrand(P, f)| ≤ ε · f ·
dlog(∆)e∑
i=0
|M(i)|+ 2 · |U(i)| .
5.3 Streaming Algorithm 95
Due to Lemma 5.3.1 and ε ≤ 1/(8c2 · 22d · (log(∆) + 2)2) < 1/(4c2 · 2d · (log(∆) + 2)), we
have
ε · f ·
dlog(∆)e∑
i=0
|M(i)| ≤ ε · c · 2d · (log(∆) + 2) · FacLoc*(P, f)
<
1
4c
· FacLoc*(P, f) .
Next, we upper bound the value ε ·f ·
∑dlog(∆)e
i=0 |U(i)|. Observe that the set U(0) is empty.
Furthermore, for any cell C ∈
⋃i−1
j=0M(j), there are at most 2
d cells in G (i) that are within
a distance of less than 2i−1 from C. Thus, there exists at least one cell in
⋃i−1
j=0M(j) for a
fraction of at least 1/2d cells in U(i). Hence, we have
|U(i)| ≤ 2d ·
i−1∑
j=0
|M(j)| ≤ 2d ·
dlog(∆)e∑
j=0
|M(j)| .
Summation over all levels results in
dlog(∆)e∑
i=0
|U(i)| ≤ 2d · (log(∆) + 2) ·
dlog(∆)e∑
i=0
|M(i)| .
Due to Lemma 5.3.1 and ε ≤ 1/(8c2 · 22d · (log(∆) + 2)2), we get
ε · f ·
dlog(∆)e∑
i=0
2 · |U(i)| ≤ ε · f · 2 · 2d · (log(∆) + 2) ·
dlog(∆)e∑
i=0
|M(i)|
≤ ε · f · 2c · 22d · (log(∆) + 2)2 · FacLoc*(P, f)
≤
1
4c
· FacLoc*(P, f) .
Thus, we obtain
|FLstream(P, f)− FLrand(P, f)| ≤ ε · f ·
dlog(∆)e∑
i=0
|M(i)|+ |U(i)| <
1
2c
· FacLoc*(P, f)
with probability greater than 7/8.
We summarize our results achieved so far in the following lemma. Note that Lemma 5.3.3
considers only the space requirement and update time of the DE data structures. The space
and time needed to do the random sampling will be analyzed later.
Lemma 5.3.3. If we run each DE data structure to maintain a (1±ε)-approximation of the
number of distinct elements in data streams under insertions and deletions with a precision
parameter ε := 1/(8c2 ·22d · (log(∆)+ 2)2) and an error probability δ := 1/(17(log(∆)+ 2)),
then FLstream(P, f) ∈ Θ(FacLoc*(P, f)) with probability greater than 3/4. The DE data
structures require O(log6(∆) · (log(log(∆)))2) bits of space and O(log(∆) · log(log(∆)))
update time.
96 5 Facility Location in Data Streams
Proof. Due to Lemmas 5.1.6, 5.1.7, 5.2.2, and 5.2.4, we have
1
c
· FacLoc*(P, f) ≤ FLrand(P, f) ≤ c · FacLoc*(P, f)
for an appropriately chosen constant c ≥ 1 with probability at least 7/8. If this is the case
and each DE data structure is run with a precision parameter ε = 1/(8c2 ·22d ·(log(∆)+2)2)
and an error probability δ = 1/(17(log(∆)+2)), then it follows from Lemma 5.3.2 that the
difference between FLstream(P, f) and FLrand(P, f) is at most 1/(2c) · FacLoc*(P, f) with
probability greater than 7/8. Thus, we obtain
1
2c
· FacLoc*(P, f) ≤ FLstream(P, f) ≤
2c2 + 1
2c
· FacLoc*(P, f)
with probability greater than 3/4. This proves that FLstream(P, f) ∈ Θ(FacLoc*(P, f))
with probability greater than 3/4.
Due to Corollary 5.1.2 and for our values of ε and δ, each DE data structure has a space
requirement of
O(1/ε2 · log(N) · (log(1/ε) + log(log(M))) · log(1/δ))
= O(log(∆)4 · log(N) · (log(log(∆)) + log(log(M))) · log(log(∆)))
bits, where N is the size of the domain of the elements and M is the multiplicity of single
elements in the DE data structure. Since the grid of any level contains at most ∆d cells and
each DE data structure contains only cells from one level, we have N = ∆d. Furthermore,
due to our implementation of Insert(p) and Delete(p) for any point p ∈ {1, . . . ,∆}d,
the multiplicity of a cell in any DE data structure is at most M = ∆d. Hence, each DE
data structure needs O(log5(∆) · (log(log(∆)))2) bits of space. Since we use O(log(∆)) DE
data structures, the total space requirement for the DE data structures is upper bounded
by O(log6(∆) · (log(log(∆)))2) bits.
While running Insert(p) for any point p ∈ {1, . . . ,∆}d, we insert at most a constant
number of cells in each DE data structure. Thus, due to Corollary 5.1.2 and for our value
of δ, the time to process an Insert operation is O(log(log(∆))) for each DE data structure.
Analogously, we can upper bound the time to process a Delete operation. It follows that
the total update time for the O(log(∆)) DE data structures is O(log(∆) · log(log(∆))),
which completes the proof of the lemma.
Random Sampling
For each level i ∈ [dlog(∆)e + 1], we use the random function hi : {1, . . . ,∆}d → {0, 1}
described in Section 5.2.1 to realize the random sampling of points. To overcome the
assumption of full randomness needed for the creation of these hi(·), we use a pseudo-
random generator of Nisan [95]. This approach was first proposed in [62].
First, we briefly summarize some facts of pseudo-random generators for space-bounded
computation proposed by Nisan [95]. Then, we show how to utilize these facts for the
5.3 Streaming Algorithm 97
creation of the random functions. Let ALG be any algorithm that uses at most O(k)
bits of memory and, thus, has at most 2O(k) distinct states. Furthermore, we assume that
ALG uses at most g chunks of random bits, where each chunk is of length ` ∈ O(k). Let
ALG(x) be the state of ALG after having used the random bit sequence x ∈ {0, 1}g`.
Then, there is a pseudo-random generator for ALG with the following properties:
Lemma 5.3.4 ([95]). Let ALG be an algorithm that uses O(k) bits of memory and g
chunks of random bits, where each chunk is of length ` ∈ O(k). Then, there exists a
pseudo-random generator R : {0, 1}s → ({0, 1}`)g for ALG which expands s random bits
into t := g · ` bits such that s ∈ O(k log(t)) and
∑
states z of ALG
|Pr [ALG(x) = z]−Pr [ALG(R(y)) = z]| ≤ 2−k
where x is chosen uniformly at random from {0, 1}t and y is chosen uniformly at random
from {0, 1}s. For any y ∈ {0, 1}s, any length-` chunk of R(y) can be computed using
O(log(t)) arithmetic operations on O(`)-bit words.
In the proof of the following lemma, we show how to apply a pseudo-random generator
to reduce the randomness needed for the creation of the random functions. To do so, we
proceed in a similar way as Indyk [62, 65].
Lemma 5.3.5. There is an implementation of algorithm FacLocCost that outputs a
constant-factor approximation of the optimal facility location cost for the current point set
with probability greater than 2/3. The implementation requires O(log7(∆) · (log(log(∆)))2)
bits of space and O(log7(∆) · (log(log(∆)))2) random bits. An insertion or deletion of a
point requires O(log2(∆)) arithmetic operations on O(log(∆))-bit words.
Proof. Let FLCFullyRandom be the implementation of algorithm FacLocCost that
uses the type of DE data structures given in Corollary 5.1.2 and the kind of random
functions proposed in Lemma 5.2.1. To prove the lemma, we adopt the argumentation
given by Indyk in [62, 65]. Due to Lemma 5.3.3, the total space requirement for the
DE data structures is upper bounded by O(log6(∆) · (log(log(∆)))2) bits. Furthermore,
FLCFullyRandom requires O(log2(∆)) random bits per point in {1, . . . ,∆}d in total
for the creation of the O(log(∆)) random functions. Since we might access a specific point
several times and the output of each random function for this point should be always
the same, we have to store the random bits of all ∆d points. Obviously, an algorithm
working in the dynamic geometric data stream model cannot use Ω(∆) bits of space.
This problem is avoidable by allowing a negligible probability of error in the computation
of the number of open facilities. For the moment, let us assume the stream is sorted,
which means that the insertions and deletions of a specific point occur subsequently in
the stream. Then, it is sufficient to compute the output of each random function only
once per point. Thus, in case of a sorted stream, algorithm FLCFullyRandom uses
O(log6(∆) · (log(log(∆)))2) bits and O(∆d · log(∆)) chunks each consisting of O(log(∆))
random bits. Note that there are O(log(∆)) chunks per point in {1, . . . ,∆}d, one for each
98 5 Facility Location in Data Streams
level. Due to Lemma 5.3.4, there exists a pseudo-random generator R which given a random
seed of sizeO(log6(∆)·(log(log(∆)))2·log(∆)) expands it to ∆d·log(∆) chunks ofO(log(∆))
random bits such that each chunk can be computed using O(log(∆)) arithmetic operations
on O(log(∆))-bit words and using these chunks results in negligible probability of error
in the computation of the number of open facilities. Let us denote the implementation of
algorithm FacLocCost which uses a pseudo-random generator R for the creation of the
random functions by FLCPseudoRandom. Then, according to Lemma 5.3.4 and since
we can assume that ∆ ≥ 4, the probability that the implementation FLCFullyRandom
differs in its computations from the ones of the implementation FLCPseudoRandom is
at most
Pr [FLCFullyRandom 6= FLCPseudoRandom] ≤ 2− log
6(∆)·(log(log(∆)))2
≤ 2−6 log(∆)
= ∆−6 .
Since, for a fixed random seed, algorithm FLCPseudoRandom does not depend on the
order in which the insertions and deletions of points appear in the stream, we also get
Pr [FLCFullyRandom 6= FLCPseudoRandom] ≤ 1/∆6 for the unsorted stream.
Due to Lemma 5.3.3, we obtain that the implementation FLCFullyRandom of al-
gorithm FacLocCost has an error probability of less than 1/4. Since we assume that
∆ ≥ 4 and Pr [FLCFullyRandom 6= FLCPseudoRandom] ≤ 1/∆6, the implementa-
tion FLCPseudoRandom works with error probability less than 1/3.
Due to Lemma 5.3.3, the space requirement of the DE data structures is upper bounded
by O(log6(∆) · (log(log(∆)))2) bits and their update time is O(log(∆) · log(log(∆))). For
the random functions, the pseudo-random generator of Nisan [95] needs a random seed
of size O(log7(∆) · (log(log(∆)))2). Furthermore, for any level i ∈ [dlog(∆)e + 1] and
any point p ∈ {1, . . . ,∆}d, the value hi(p) can be computed using O(log(∆)) arithmetic
operations on O(log(∆))-bit words. Thus, we need O(log7(∆) ·(log(log(∆)))2) random bits
in total and, for any point, the output values of all random functions can be computed
using O(log2(∆)) arithmetic operations on O(log(∆))-bit words.
As explained in the proof of Corollary 5.1.2, we can use a standard amplification tech-
nique to obtain the following main result:
Theorem 5. There is a randomized streaming algorithm that computes with probability
1 − δ a constant-factor approximation of the facility location cost for a stream of points
with uniform opening costs and demands in the discrete Euclidean space {1, . . . ,∆}d under
insertions and deletions, where d is a constant. The algorithm has a space requirement of
O(log7(∆) · (log(log(∆)))2 · log(1/δ)) bits and uses O(log7(∆) · (log(log(∆)))2 · log(1/δ))
random bits. An insertion or deletion of a point requires O(log2(∆) · log(1/δ)) arithmetic
operations on O(log(∆))-bit words.
6 A k-Means Implementation for Data Streams
In this chapter, we develop an efficient algorithm for the k-means clustering problem in the
insertion-only data stream model. We call our algorithm StreamKM++. The k-means
clustering problem is closely related to the facility location problem. Given a set of points
and a natural number k, the goal of the k-means clustering problem is to place k facilities,
the so-called cluster centers, such that the sum of the squared distances of the points
to their nearest cluster center is minimized. To approach this problem, our streaming
algorithm maintains a small summary of the input points using the merge-and-reduce
technique [16, 58], i.e., the data is organized in a small number of samples, each representing
2im input points (for some i ∈ N0 and a fixed m ∈ N). Every time when two samples
representing the same number of input points exist, we take the union (merge) and create
a new sample (reduce). After having processed the whole input stream in this way, we
apply the k-Means++ algorithm [9] on the sample to obtain a k-means clustering.
For the reduce step, we develop a new coreset construction. A coreset is a small weighted
point set that approximates the original input point set with respect to a given optimization
problem, which is in our case the k-means clustering problem. Our focus is to propose a
coreset construction that is suitable for high-dimensional data. Existing constructions
based on grid-computations [44, 58] yield coresets of a size that is exponential in the
dimension. Since the k-Means++ seeding works well for high-dimensional data, a coreset
construction based on this approach seems to be more promising. We give a theoretical
analysis of this approach in Section 6.2.
In order to implement this coreset construction efficiently, we propose a new data struc-
ture, which we call coreset tree. This is a tree-like data structure that stores points in
such a way that we can perform a fast adaptive sampling which is very similar to the
k-Means++ seeding. According to our experiments, the seed computed on the coreset tree
has essentially the same properties as the original k-Means++ seed. The advantage of
the coreset tree approach is that the running time is significantly shorter than the running
time of the original k-Means++ seeding.
It should be noted that the k-Means++ seeding has also been theoretically investigated
in [3] and [4]. Aggarwal et al. [3] used the k-Means++ seeding to obtain a small weighted
point set such that an optimal k-means clustering of the original point set can be ap-
proximated well by clustering the small weighted set. Ailon et al. [4] used the k-Means++
seeding to obtain a streaming algorithm for the k-means clustering problem that guarantees
an approximation factor of O(cα log(k)), where c is some constant, α ≈ log(n)/ log(M),
n is the number of input points in the stream, and M is the amount of work memory
available to the algorithm. However, our result differs from the results given in [3] and [4]
and was obtained independently.
100 6 A k-Means Implementation for Data Streams
In Section 6.5, we compare algorithm StreamKM++ with algorithms BIRCH [111] and
StreamLS [96, 52] as well as with the non-streaming version of algorithm k-Means++. It
turns out that our algorithm is slower than BIRCH, but it computes significantly better
solutions (in terms of the sum of squared errors). In addition, to obtain the desired number
of clusters, our algorithm does not require the trial-and-error adjustment of parameters as
BIRCH does. The quality of the clustering of algorithm StreamLS is comparable to
that of our algorithm, but the running time of StreamKM++ scales much better with the
number of cluster centers. For example, on the dataset Tower, our algorithm computes
a clustering with k = 100 centers in about 3% of the running time of StreamLS. In
comparison with the standard implementation of k-Means++, our algorithm runs much
faster on larger datasets and computes solutions that are on a par with k-Means++. For
example, on the dataset Covertype, our algorithm computes a clustering with k = 50
centers of essentially the same quality as k-Means++ does, but it needs only about 3% of
the running time of k-Means++.
Next, we introduce some notation and give a brief overview of the considered competitors
of StreamKM++.
6.1 Preliminaries
6.1.1 Definition of Euclidean k-Means Clusterings
Recall from Section 2.1 that, for any two points p, q ∈ Rd and any set of points C ⊂ Rd,
we denote the Euclidean distance between p and q by D(p, q) := ‖p− q‖, and we define
D(p, C) := min
c∈C
D(p, c) .
Similarly, for squared Euclidean distances, we define
D2(p, q) := ‖p− q‖2 and D2(p, C) := min
c∈C
D2(p, c) .
Let P ⊂ Rd be a set of points with size |P | =: n. The Euclidean k-means clustering
problem for P is given as follows.
Definition 6.1.1 (Euclidean k-Means Clustering Problem). For a set P ⊂ Rd and k ∈ N,
the Euclidean k-means clustering problem is to find a set C := {c1, . . . , ck} of k cluster
centers in Rd and a partition of the set P into k clusters C1, . . . , Ck such that the k-means
clustering cost
Means(P,C,C1, . . . , Ck) :=
k∑
i=1
∑
p∈Ci
D2(p, ci)
is minimized.
Analogously, for a weighted set S ⊂ Rd with weight function w : S → R≥0 and k ∈ N,
the weighted Euclidean k-means clustering problem is to find a set C := {c1, . . . , ck} of k
6.1 Preliminaries 101
cluster centers in Rd and a partition of the set S into k clusters C1, . . . , Ck such that the
k-means clustering cost
Means(S,C,C1, . . . , Ck) :=
k∑
i=1
∑
q∈Ci
w(q) ·D2(q, ci)
is minimized.
If a partition C1, . . . , Ck of P relates each point to its nearest cluster center, i.e., if, for
each p ∈ P and each i ∈ {1, . . . , k}, we have
p ∈ Ci ⇒ D(p, ci) = min
j∈{1,...,k}
D(p, cj) ,
then we shortly write
Means(P,C) := Means(P,C,C1, . . . , Ck) .
Furthermore, we denote the cost of an optimal Euclidean k-means clustering of P by
Means∗k(P ) := min
C′⊂Rd:|C′|=k
Means(P,C ′) .
6.1.2 Definition of Coresets
An important concept that we use is the notion of coresets. Generally speaking, a coreset
for a set P is a small (weighted) set such that, for any set of k cluster centers, the (weighted)
clustering cost for the coreset is an approximation of the clustering cost for the original set
P with small relative error. The advantage of such a coreset is that we can apply any fast
approximation algorithm (for the weighted problem) on the usually much smaller coreset
to compute an approximate solution for the original set P more efficiently. We use the
following formal definition:
Definition 6.1.2 (Coreset for k-Means Clustering Problem). Let P ⊂ Rd be a set of
points, let k ∈ N, and let ε, 0 < ε ≤ 1, be a precision parameter. A weighted multiset
S ⊂ Rd with positive weight function w : S → R≥0 is called (k, ε)-coreset of P for the
k-means clustering problem if, for each C ⊂ Rd of size |C| = k, we have
(1− ε) ·Means(P,C) ≤ Means(S,C) ≤ (1 + ε) ·Means(P,C) .
Our clustering algorithm maintains a small coreset in the insertion-only data stream
model (see Section 2.4.4 for a definition of this data stream model).
102 6 A k-Means Implementation for Data Streams
6.1.3 k-Means Clustering Algorithms
In the experiments, we compare StreamKM++ with two frequently used clustering algo-
rithms for processing data streams, namely with algorithm BIRCH of Zhang et al. [111]
and with a streaming variant of the local search algorithm given by O’Callaghan et al. [96]
and Guha et al. [52], which we refer to as algorithm StreamLS. On smaller datasets,
we also compare our algorithm with a classical implementation of Lloyd’s k-means algo-
rithm [80], using initial seeds either uniformly at random (algorithm k-Means) or ac-
cording to the adaptive, non-uniform seeding from Arthur and Vassilvitskii [9] (algorithm
k-Means++). In the following, we will give a brief overview of these k-means clustering
algorithms.
Algorithm k-Means
One of the most widely used clustering algorithms is Lloyd’s algorithm. This algorithm
is sometimes also called the k-means algorithm [80, 39, 82]. Lloyd’s algorithm is based on
two observations:
1. Given a fixed set of centers, we obtain the best clustering by assigning each point to
the nearest center.
2. Given a cluster, the best center of the cluster is the center of gravity (i.e., the mean)
of its points.
Lloyd’s algorithm applies these two local optimizations steps repeatedly to the current
solution, until no more improvement is possible. See Algorithm 6.1.1 for a description in
pseudocode.
Algorithm 6.1.1 k-Means(P, k)
1: choose k initial centers c1, . . . , ck uniformly at random from P
2: repeat
3: partition P into k subsets P1, . . . , Pk such that Pi, 1 ≤ i ≤ k, contains all points
whose nearest center is ci
4: replace the current set of centers by a new set of centers c1, . . . , ck such that center
ci, 1 ≤ i ≤ k, is the center of gravity of Pi
5: until the set of centers has not changed
It is known that the algorithm converges to a local optimum [100], and the quality of the
computed solution is sensitive to the choice of the starting centers. Kanungo et al. [73] give
a simple example where, for a fixed set of starting centers, Lloyd’s algorithm converges to
a local minimum that is arbitrarily bad compared to the optimal solution. This example
can be extended to the case where the starting centers are chosen by uniform seeding as
given in Algorithm 6.1.1.
6.1 Preliminaries 103
Algorithm k-Means++
Recently, Arthur and Vassilvitskii developed the k-Means++ algorithm [9], which is a
seeding procedure for Lloyd’s k-means algorithm. This seeding procedure considers the
fact that the quality of the solution of Lloyd’s k-means algorithm depends strongly on
the initial set of centers. In order to achieve a better arrangement, it chooses the initial
set of centers adaptively and non-uniformly at random by choosing each point as the next
center with probability proportional to its squared distance from the nearest center already
chosen. Note that, for a given set of centers, the squared distance of a point from its nearest
center corresponds to the current contribution of this point to the total k-means clustering
cost. The k-Means++ seeding procedure is given by Algorithm 6.1.2. For simplicity of
description, we say that Algorithm 6.1.2 chooses the set C at random according to D2.
Algorithm 6.1.2 AdaptiveSeeding(P, k)
1: choose an initial center c1 uniformly at random from P
2: C ← {c1}
3: for i← 2 to k do
4: choose the next center ci at random from P , where the probability of each p ∈ P is
given by D2(p, C)/Means(P,C)
5: C ← C ∪ {ci}
By replacing line 1 of Algorithm 6.1.1 with Algorithm 6.1.2, Arthur and Vassilvitskii
developed a k-means clustering algorithm, known as k-Means++ algorithm, which gives
good experimental results and guarantees a solution with certain quality. More precisely,
they showed the following:
Lemma 6.1.3 ([9]). Let C ⊆ P be a set of k points chosen at random according to D2.
Then, we have
E [Means(P,C)] ≤ 8 (2 + ln(k)) Means∗k(P ) .
Algorithm BIRCH
One of the earliest and best known practical clustering algorithms for data streams is
BIRCH (which is an acronym for ‘Balanced Iterative Reducing and Clustering using Hier-
archies’) [111]. BIRCH is a heuristic which exploits the observation that the point space
is usually not uniformly occupied. It scans the given set of input points once and computes
a pre-clustering by summarizing dense regions of points by their so-called clustering fea-
tures. Such a clustering feature consists of the number of points in the region, the center of
gravity, and the sum of squared distances to the origin. Thereby, the problem of clustering
the original input point set is reduced to the problem of clustering the set of summaries,
which is much smaller than the original point set. The pre-clustering is then clustered by
using an agglomerative (bottom-up) clustering algorithm. In this process, the algorithm
uses the clustering features to calculate the intra-cluster distances. BIRCH successively
merges the closest pair of clusters until the desired number of clusters is obtained.
104 6 A k-Means Implementation for Data Streams
To a certain extent, BIRCH uses a kind of coreset construction. However, there is no
theoretical analyses of this method known. For more details about BIRCH, the reader is
referred to [111].
Algorithm StreamLS
Another well-known clustering algorithm for data streams is the streaming implementation
of algorithm LSearch from O’Callaghan et al. [96] and Guha et al. [52], which we refer
to as StreamLS. This algorithm partitions the input stream into chunks and computes
for each chunk a k-means clustering solution using a local search algorithm from Guha
et al. [53]. Finally, the local search algorithm is applied once more on the union of the
solutions for the chunks to obtain a k-means clustering for the whole input stream.
The local search algorithm of Guha et al. [53] takes advantage of the relationship between
the k-means clustering problem and the uniform facility location problem (see Section 2.2
for a definition of the uniform facility location problem). More precisely, it is based on
the observation that if the opening cost of a facility increases, then the number of facilities
(or cluster centers) of an optimal solution tends to decrease. Hence, to solve the k-means
problem, the algorithm of Guha et al. [53] performs a binary search on the opening cost of
a facility to find a cost that gives the desired number of cluster centers. During the binary
search, each facility location problem is solved by starting with an initial solution that is
obtained by a simple non-uniform sampling approach and then refining this solution by
making local improvements. More details can be found in [96, 53, 52].
6.2 Coreset Construction
Our k-means clustering algorithm uses a coreset construction based on the k-Means++
seeding procedure from Arthur and Vassilviskii [9]. One reason for this design decision
was that the k-Means++ seeding works well for high-dimensional datasets, which is often
required in practice. This nice property does not apply to many other clustering meth-
ods, like the grid-based methods from Har-Peled and Mazumdar [58] and Frahling and
Sohler [44], for instance.
Let P ⊂ Rd be a set of points with size |P | =: n. For an arbitrary fixed parameter
m ∈ N, our coreset construction is as follows (see also Algorithm 6.2.1). First, we choose
a set S := {q1, q2, . . . , qm} of size m at random according to D2. Let Qi denote the set of
points from P that are closest to qi (breaking ties arbitrarily). By using weight function
w : S → R≥0 with w(qi) = |Qi|, we obtain the weighted set S as our coreset.
Note that our coreset construction is rather easy to implement and its running time
has a merely linear dependency on the dimension d. Furthermore, empirical evaluation
(as given in Section 6.5) suggests that our construction leads to good coresets even for
relatively small choices of m (i.e., say m = 200k). Unfortunately, we do not have a formal
proof supporting this observation. However, we are able to do a first step by proving that,
at least in low-dimensional spaces, our construction indeed leads to small coresets.
6.2 Coreset Construction 105
Algorithm 6.2.1 AdaptiveCoreset(P,m)
1: choose an initial coreset point q1 uniformly at random from P
2: w(q1)← 0
3: S ← {q1}
4: for i← 2 to m do
5: choose qi at random according to D2 from P
6: w(qi)← 0
7: S ← S ∪ {qi}
8: for each p ∈ P do
9: let qi ∈ S, 1 ≤ i ≤ m, be the nearest coreset point to p
10: w(qi)← w(qi) + 1
Our proof is based on Lemma 6.2.1. Intuitively, this lemma states that if we consider
an optimal m-clustering of P , with m large enough, then the optimal m-clustering cost is
merely a tiny fraction of the optimal k-clustering cost of P . Lemma 6.2.1 is a consequence
of the fact that there exist (k, γ)-coresets of size m ∈ (d/γ)O(d)k log(n), which has already
been proven by Har-Peled and Mazumdar [58].
Lemma 6.2.1. Let γ, 0 < γ ≤ 1, and let m ∈ N. If
m ≥
(
16d
γ
)d/2
· k · dlog(n) + 3e ,
then we get
Means∗m(P ) ≤ γ ·Means
∗
k(P ) .
Proof. Let C := {c1, . . . , ck} be an optimal solution to the Euclidean k-means clustering
problem for P with |P | =: n, i.e., Means(P,C) = Means∗k(P ). We consider an exponential
grid around each center in C. The construction of this grid is essentially the same as the
one from Har-Peled and Mazumdar [58]. In detail, the construction is defined as follows.
Let the average cost per point of an optimal k-clustering solution for P be denoted by
R :=
Means∗k(P )
n
.
Furthermore, for each j ∈ {0, 1, . . . , dlog(n) + 2e} and each center ci ∈ C, let Vij be the
axis-parallel square centered at ci with side length
rj :=
√
2jR .
Then, we recursively defineWi0 := Vi0 andWij := Vij\Vi,j−1 for j ∈ {1, 2, . . . , dlog(n)+2e}.
Obviously, each point in P is contained within aWij since otherwise there would be a point
p ∈ P with
D2(p, C) >
(rdlog(n)+2e
2
)2
=
2dlog(n)+2eR
4
≥ nR ≥ Means∗k(P ) ,
106 6 A k-Means Implementation for Data Streams
which is a contradiction.
For each i, j individually, we partition Wij into small grid cells with side length
r′j :=
√
γ
9d
· rj =
√
γ
9d
· 2jR .
We remark that the small grid cells do not have to fit properly in Wij. In fact, we impose
a grid with side length r′j on Wij such that Wij is completely covered. Then, the partition
of Wij consists of all the small cells that completely cover Wij as well as all parts of the
small cells that partly cover Wij. This partition is illustrated by Figure 6.1.
ri
r′i
Figure 6.1: Illustration of the partition ofWij into small grid cells. The areaWij is colored
in gray. The white parts of the small cells do not belong to the partition of
Wij.
For each grid cell C such that C ∩ Wij contains points from P , we select a single point
from P ∩ C ∩Wij as the representative of all the points in P ∩ C ∩Wij. Let G be the set
of all these representatives. Since we have
ri
r′i
=
√
9d
γ
≥ 3 ,
there are at most
∑
ci∈C
dlog(n)+2e∑
j=0
(⌈
rj
r′j
⌉)d
≤
∑
ci∈C
dlog(n)+2e∑
j=0
(
4
3
·
rj
r′j
)d
= k · dlog(n) + 3e ·
(
16d
γ
)d/2
grid cells. Since this number is smaller or equal to m, we obtain |G| ≤ m.
Let gp denote the representative of p ∈ P in G. Then, we have
Means∗m(P ) ≤ Means
∗
|G|(P ) ≤ Means(P,G) ≤
∑
p∈P
D2 (p, gp) . (6.1)
For each point p ∈ P , the distance from its representative gp is upper bounded by the
diagonal of the grid cell that contains p. Thus, for each p ∈ Wi0, we have
D2 (p, gp) ≤
(√
d · r′0
)2
≤
γR
9
. (6.2)
6.2 Coreset Construction 107
Furthermore, for each p ∈ Wij with j ≥ 1, we know that ci is the center of Vi,j−1 and p is
not contained in Vi,j−1. It follows that
D2(p, C) ≥
(rj−1
2
)2
≥ 2j−3R .
Hence, in this case, we get
D2 (p, gp) ≤
(√
d · r′j
)2
=
γ
9
· 2j R ≤
8γ
9
·D2(p, C) . (6.3)
Due to Inequalities (6.1)–(6.3) and the definition of R, we obtain
Means∗m(P ) ≤
∑
p∈P
D2 (p, gp)
≤
∑
p∈P
(γR
9
+
8γ
9
·D2(p, C)
)
= n ·
γ
9
R +
8γ
9
∑
p∈P
D2(p, C)
=
γ
9
·Means∗k(P ) +
8γ
9
·Means∗k(P )
= γ ·Means∗k(P ) .
Now, we go back to our coreset construction. Given the point set P and a parameter
m ∈ N, let S be our weighted coreset chosen at random according to D2 from P by
Algorithm 6.2.1. Furthermore, let C be an arbitrary set of k centers. For each point
p ∈ P , we denote the point from S whose weight has been increased by 1 due to p in line 9
of Algorithm 6.2.1 by qp, i.e., qp is a point from S closest to p. Then, the difference between
the cost of clustering P and the cost of clustering S is at most
|Means(P,C)−Means(S,C)| =
∣
∣
∣
∣
∣
∣
∑
p∈P
D2(p, C)−
∑
p∈P
D2(qp, C)
∣
∣
∣
∣
∣
∣
≤
∑
p∈P
∣
∣
∣D2(p, C)−D2(qp, C)
∣
∣
∣ .
We partition P into two subsets Pnear and Pdist. Roughly speaking, the set Pnear contains
each point p ∈ P whose distance from its coreset point qp is small compared to the distance
from its nearest center in C. More precisely, for any constant ε with 0 < ε ≤ 1, we define
Pnear := {p ∈ P | D(p, qp) ≤ εD(p, C)} .
The set Pdist contains all the other points from P , i.e.,
Pdist := {p ∈ P | D(p, qp) > εD(p, C)} .
108 6 A k-Means Implementation for Data Streams
First, in Claim 6.2.2, we estimate the error of the clustering cost that occurs for any
point in Pnear. Then, in Claim 6.2.3, we give an estimation of the error for any point in
Pdist.
Claim 6.2.2. If p ∈ Pnear, then
∣
∣
∣D2(p, C)−D2(qp, C)
∣
∣
∣ ≤ 3εD2(p, C) .
Proof. For the moment, let us assume that D(p, C) ≤ D(qp, C). Let cp denote the element
from C closest to p. By triangle inequality and the definition of Pnear, we have
D(qp, C) ≤ D(qp, cp)
≤ D(p, cp) + D(p, qp)
≤ (1 + ε) ·D(p, C) .
Hence, for the squared distances, we obtain
D2(qp, C) ≤ (1 + ε)2 ·D2(p, C)
≤ (1 + 3ε) ·D2(p, C) .
Thus, we get D2(qp, C) − D2(p, C) ≤ 3εD2(p, C), which proves the claim in the case
D(p, C) ≤ D(qp, C).
Now, assume that D(p, C) > D(qp, C). Let cs denote the element from C closest to qp.
Again, by triangle inequality and the definition of Pnear, we have
D(p, C) ≤ D(p, cs)
≤ D(qp, cs) + D(p, qp)
≤ D(qp, C) + εD(p, C) .
It follows that (1− ε) ·D(p, C) ≤ D(qp, C). For the squared distances, we obtain
D2(qp, C) ≥ (1− ε)2 ·D2(p, C)
> (1− 2ε) ·D2(p, C) .
Hence, we get
D2(p, C)−D2(qp, C) < 2εD2(p, C)
< 3εD2(p, C) ,
which proves the claim in the case D(p, C) > D(qp, C).
Claim 6.2.3. If p ∈ Pdist, then
∣
∣
∣D2(p, C)−D2(qp, C)
∣
∣
∣ ≤
3
ε
D2(p, qp) .
6.2 Coreset Construction 109
Proof. Let cp denote the element from C closest to p, and let cs denote the element from
C closest to qp. By triangle inequality, we have
D(p, C) ≤ D(p, cs)
≤ D(p, qp) + D(qp, cs)
= D(p, qp) + D(qp, C) .
Similarly, we get
D(qp, C) ≤ D(qp, cp)
≤ D(p, qp) + D(p, cp)
= D(p, qp) + D(p, C) .
It follows that |D(p, C)−D(qp, C)| ≤ D(p, qp) and D(p, C)+D(qp, C) ≤ 2 D(p, C)+D(p, qp).
Since D(p, qp) > εD(p, C) and ε ≤ 1, we get
∣
∣
∣D2(p, C)−D2(qp, C)
∣
∣
∣ = |D(p, C)−D(qp, C)| · (D(p, C) + D(qp, C))
≤ D(p, qp) · (2 D(p, C) + D(p, qp))
≤
(2
ε
+ 1
)
D2(p, qp)
≤
3
ε
D2(p, qp) .
Now, we can show our main result.
Theorem 6. Let k ∈ N, let ε, 0 < ε ≤ 1, be a precision parameter, and let δ, 0 < δ < 1,
be an error probability. Given a point set P ⊂ Rd of size |P | =: n and a size parameter
m ∈
(
d
δε
)O(d)
· k · log(n) · logd/2
(
k log(n)
δε
)
,
algorithm AdaptiveCoreset computes a weighted multiset S with size m that is a (k, 6ε)-
coreset of P with probability at least 1− δ.
Proof. Due to Claims 6.2.2 and 6.2.3, we have
|Means(P,C)−Means(S,C)|
≤
∑
p∈P
∣
∣
∣D2(p, C)−D2(qp, C)
∣
∣
∣
≤
∑
p∈Pnear
∣
∣
∣D2(p, C)−D2(qp, C)
∣
∣
∣+
∑
p∈Pdist
∣
∣
∣D2(p, C)−D2(qp, C)
∣
∣
∣
≤ 3ε
∑
p∈Pnear
D2(p, C) +
3
ε
∑
p∈Pdist
D2(p, qp)
≤ 3ε ·Means(P,C) +
3
ε
·Means(P, S) .
110 6 A k-Means Implementation for Data Streams
Due to Lemma 6.1.3 and Markov’s inequality, we obtain
Means(P, S) ≤
8
δ
(2 + ln(m)) ·Means∗m(P )
with probability at least 1− δ. Hence, by using Lemma 6.2.1 with
γ :=
ε2δ
8(2 + lnm)
,
we have
Means(P, S) ≤
8
δ
(2 + ln(m)) ·Means∗m(P )
≤
8
δ
(2 + ln(m)) · γ ·Means∗k(P )
≤ ε2 Means∗k(P )
≤ ε2 Means(P,C)
and, thus, |Means(P,C)−Means(S,C)| ≤ 6ε·Means(P,C) with probability 1−δ, provided
that the coreset size m satisfies the condition
m ≥
(
16d
γ
)d/2
· k · dlog(n) + 3e . (6.4)
Hence, the only thing left to do is to prove that there exists a coreset size
m ∈
(
d
δε
)O(d)
· k · log(n) · logd/2
(
k log(n)
δε
)
(6.5)
that satisfies Inequality (6.4). Since we can assume that n ≥ 16 andm ≥ 8, Inequality (6.4)
is satisfied if we have
m
logd/2(m)
≥
2 · 16d dd/2 k log(n)
δd/2εd
. (6.6)
We conclude that Condition (6.5) and Inequality (6.6) are both satisfied for a choice of
m = (2d)d/2 ·
2 · 16d dd/2 k log(n)
δd/2εd
· logd/2
(
2 · 16d dd/2 k log(n)
δd/2εd
)
since we have
logd/2(m) = logd/2
(
(2d)d/2 ·
2 · 16d dd/2 k log(n)
δd/2εd
· logd/2
(
2 · 16d dd/2 k log(n)
δd/2εd
))
≤
(
d
2
)d/2
· logd/2
(
2d ·
2 · 16d dd/2 k log(n)
δd/2εd
· log
(
2 · 16d dd/2 k log(n)
δd/2εd
))
≤
(
d
2
)d/2
· logd/2


(
2 · 16d dd/2 k log(n)
δd/2εd
)4


= (2d)d/2 · logd/2
(
2 · 16d dd/2 k log(n)
δd/2εd
)
.
6.3 The Coreset Tree 111
Please note that the size bound on the number of coreset points m from Theorem 6 is
merely a sufficient condition, and that, to the best of our knowledge, there is no reason
to assume that this size bound is tight. Hence, in compliance with our experiments, the
actual dependency of m on the dimension d may as well be better than is suggested by the
theorem.
6.3 The Coreset Tree
Unfortunately, there is one practical problem concerning the k-Means++ seeding proce-
dure. Assume that we have chosen a sample set S = {q1, q2, . . . , qi} from the input set
P ⊆ Rd so far, where i < m and |P | =: n. In order to compute the probabilities to
choose the next sample point qi+1, we need to determine the squared distance from each
point in P to its nearest neighbor in S. Hence, using a standard implementation of such
a computation, we require Θ(dnm) time to obtain all m coreset points, which is too slow
for larger values of m. For this reason, we propose a new data structure called coreset
tree, which speeds up this computation. Roughly speaking, a coreset tree is a hierarchical
decomposition of P where each leaf represents a set of this decomposition. The advantage
of the coreset tree is that it allows us to compute subsequent sample points by taking
only points from a subset of P into account whose size is significantly smaller than n. We
obtain that if the constructed coreset tree is balanced (i.e., the tree is of depth Θ(log(m))
and each leaf represents roughly the same number of points), we merely need Θ(dn log(m))
time to compute all m coreset points. This intuition is supported by our empirical evalua-
tion on real-world datasets, where we find that the process of sampling according to D2 is
significantly sped up while the resulting sample set S has essentially the same properties
as the original k-Means++ seed.
In the following, we will explain the construction of the coreset tree in more detail. A
description in pseudocode is given by Algorithm 6.3.1.
6.3.1 Definition of the Coreset Tree
A coreset tree T for a point set P is a binary tree that is associated with a hierarchical
divisive clustering for P : One starts with a single cluster that contains the whole point set
P and successively partitions existing clusters into two subclusters such that the points in
one subcluster are far from the points in the other subcluster. The division step is repeated
until the number of clusters corresponds to the desired number of clusters. Associated with
this procedure, the coreset tree T has to satisfies the following properties:
(i) Each node of T is associated with a cluster in the hierarchical divisive clustering.
(ii) The root of T is associated with the single cluster that contains the whole point set
P .
(iii) The nodes associated with the two subclusters of a cluster C are the child nodes of
the node associated with C.
112 6 A k-Means Implementation for Data Streams
With each node v of T , we store the following attributes: A point set Pv, a representative
point qv from Pv, and a value weight(v). Here, point set Pv is the cluster associated with
node v. Note that this attribute has only to be stored explicitly in the leaf nodes of T ,
while, for an internal node v, the set Pv is implicitly defined by the union of the point sets
of its children. The representative point qv of a node v is obtained by using the technique of
non-uniform sampling according to D2. At any time, the set of all the points q` stored at a
leaf node ` are the points that have been chosen so far to be points of the eventual coreset.
Furthermore, for a leaf node `, the attribute weight(`) equals Means(P`, q`), which is the
sum of squared distances over all points in P` to q`. The value weight(v) of an internal
node v is defined as the sum of the weights of its children.
6.3.2 Construction of the Coreset Tree
For sake of simplicity, at any time, we number the leaf nodes of the current coreset tree
consecutively starting with 1. At the beginning, T consists of one node, the root, which is
given the number 1 and is associated with the whole point set P . The attribute q1 of the
root is our first point in S and is obtained by sampling one point uniformly at random from
P . Now, let us assume that our current tree has i leaf nodes 1, 2, . . . , i, the corresponding
sample points are q1, q2, . . . , qi, and P1, P2, . . . , Pi are the associated clusters. We obtain
the next sample point qi+1, new clusters in our hierarchical divisive clustering, and, thus,
new nodes in T by performing the following three steps:
1. Choose a leaf node ` at random, where the probability of each leaf node `′ is propor-
tional to cost(P`′ , q`′).
2. Choose a new sample point, denoted by qi+1, from the subset P` at random according
to D2.
3. Based on q` and qi+1, split P` into two subclusters and create two child nodes of ` in
T .
The first step is implemented as follows: Starting at the root of T , let u be the current
inner node. Then, we select randomly a child node of u, where the probability distribution
for the child nodes of u is given by their associated weights. More precisely, each child
node v of the current node u is chosen with probability weight(v)/weight(u). We continue
this selection process until we reach a leaf node. Let ` be the selected leaf node, let q`
be the sample point stored at `, and let P` be the subset of P associated with `. It is
easy to check that, in doing so, we have chosen ` among the leaf nodes with probability
cost(P`, q`)/
∑i
j=1 cost(Pj, qj).
In the second step, we choose the new sample point qi+1 from P` at random according
to D2, i.e., each p ∈ P` is chosen with probability D2(p, q`)/ cost(P`, q`). In doing so, each
point in P is sampled with a probability that is proportional to its squared distance to
its center in the clustering induced by the partition of the leaf nodes (giving the clusters)
and their sample points (being the centers). That is, we use the same distribution as the
6.4 Streaming Algorithm 113
Algorithm 6.3.1 TreeCoreset(P,m)
1: choose q1 uniformly at random from P
2: root ← node with qroot ← q1 and weight(root)← Means(P, q1)
3: S ← {q1}
4: for i← 2 to m do
5: start at root, iteratively select one of the two child nodes at random according to
their weights until a leaf ` is chosen
6: choose qi according to D2 from P`
7: S ← S ∪ {qi}
8: create two child nodes `1, `2 of ` and update weight(`)
9: propagate update of weight attribute upwards up to node root
k-Means++ seeding does with the exception that the probability of choosing a point p ∈ Pj
is proportional to D2(p, qj) rather than proportional to D2(p, {q1, . . . , qi}).
In the third step, we create two child nodes `1 and `2 of ` and compute the associated
partition of P` as well as the corresponding attributes. We store at node `1 the point q`
and at node `2 our new sample point qi+1. Based on these two representative points, we
partition P` into two subsets P`1 and P`2 . The set P`1 contains all the points from P` which
are closer to q` than to qi+1, i.e.,
P`1 = {p ∈ P` | D(p, q`) < D(p, qi+1)} .
The set P`2 contains all the remaining points from P`, i.e.,
P`2 = {p ∈ P` | D(p, qi+1) ≤ D(p, q`)} .
The node `1 is associated with the set P`1 , and `2 is associated with the set P`2 . We
determine the weight attribute for the nodes `1 and `2 as described above. Recall here that
the weight attribute of an inner node of T is defined as the sum of the weights of its child
nodes. Consequently, we update the weight of the parent node ` of `1 and `2 according to
this. Afterwards, this update is propagated upwards, until we reach the root of the tree.
6.3.3 Extraction of the Coreset
As soon as the coreset tree T has m leaf nodes, we can construct our coreset. Let
q1, q2, . . . , qm be the representative points stored at the leaf nodes of T . Furthermore,
let Qi denote the set of points from P which are closest to qi (breaking ties arbitrarily).
Then, we obtain the coreset S = {q1, q2, . . . , qm} where the weight of qi is given by the
number of points in Qi.
6.4 Streaming Algorithm
In this section, we describe our clustering algorithm for data streams. To this end, let m be
a fixed size parameter. First, we extract a small coreset of size m from the data stream by
114 6 A k-Means Implementation for Data Streams
Algorithm 6.3.2 InsertPoint(p)
1: put p into B0
2: if B0 is full then
3: create empty bucket S
4: move points from B0 to S
5: empty B0
6: i ← 1
7: while Bi is not empty do
8: create coreset from the union of Bi and S
9: store coreset in S
10: empty Bi
11: i← i+ 1
12: move points from S to Bi
using the merge-and-reduce technique from Har-Peled and Mazumdar [58], which is based
on the theory of decomposable search problems of Bentley and Saxe [16]. This streaming
method is described in detail in the section below. For the reduce step, we employ our new
coreset construction, using the coreset trees as given in Section 6.3.
Afterwards, a k-clustering can be obtained at any time by running any k-means algorithm
on the coreset. Note that since the size of the coreset is much smaller than the size of the
data stream, it is no longer inefficient to use algorithms that require random access on
their input data. In our implementation, we run the k-Means++ algorithm from Arthur
and Vassilvitskii [9] on our coreset five times independently and choose the best clustering
result obtained this way. We call the resulting algorithm StreamKM++.
6.4.1 The Merge-and-Reduce Technique
In order to maintain a small coreset for all points in the data stream, we use the merge-and-
reduce method [16, 58]. For a data stream containing n points, the algorithm maintains
L := dlog(n/m) + 2e buckets B0, B1, . . . , BL−1. Bucket B0 can store any number between
0 and m points. For i ≥ 1, bucket Bi is either empty or contains exactly m points. The
idea of this approach is that, at any time, if bucket Bi is full, it contains a coreset of size
m representing 2i−1m points from the data stream.
New points from the data stream are always inserted into the first bucket B0. If bucket
B0 is full (i.e., contains m points), all points from B0 need to be moved to bucket B1. If
bucket B1 is empty, we are done. However, if bucket B1 already contains m points, we
compute a new coreset S of size m from the union of the 2m points stored in B0 and B1 by
using the coreset construction described above. Now, both buckets B0 and B1 are emptied
and the m points from coreset S need to be moved into bucket B2. If bucket B2 is full, we
repeat the process with S and B2. Overall, the process is repeated iteratively until we find
the first empty bucket in which we can move the coreset S. A description in pseudocode
for inserting a point from the data stream into the buckets is given by Algorithm 6.3.2.
6.5 Empirical Evaluation 115
At any time, it is possible to compute a coreset of size m for all the points in the
data stream that we have seen so far. For this purpose, we compute a coreset from the
union of the at most mdlog(n/m) + 2e weighted coreset points stored in all the buckets
B0, B1, . . . , BL−1 by using the coreset tree construction. In this way, we obtain the desired
coreset of size m.
Note that the coreset tree construction can be easily generalized to input points with
integer weights. Therefore, each time when we choose a new coreset point, we compute
the probabilities of the points according to D2, as described before, and then multiply
each probability with the weight of the appropriate point. We also incorporate the point
weights when we compute the weight attribute of a new leaf node. These two adaptations
can be thought of as replacing each weighted point by multiple copies of the same point
each having weight 1.
6.4.2 Complexity
Using our implementation, a single merge-and-reduce step is guaranteed to be executed
in time O(dm2) (or even in time Θ(dm log(m)) if we assume the used coreset tree to
be balanced). For a stream of n points, dn/me such steps are needed. The amortized
running time of all merge-and-reduce steps is at most O(dnm). The final merge-and-
reduce step, to obtain a coreset of size m for the union of all buckets, can be done in time
O(dm2 log(n/m)). Finally, algorithm k-Means++ is executed five times on an input set
of size m, requiring time Θ(dkm) per iteration. Summing up, the total running time of
algorithm StreamKM++ is O(dnm), and the amortized processing time per data item
is O(dm). Obviously, algorithm StreamKM++ needs at most Θ(dm log(n/m)) memory
units. Hence, both the processing time and the space requirement have a low dependency
on the dimension d. As a result, our approach is suitable for high-dimensional data.
Of course, careful consideration has to be given to the choice of the coreset size parameter
m. Our experiments show that a choice of m = 200k is sufficient for a good clustering
quality without sacrificing too much running time.
6.5 Empirical Evaluation
We conducted several experiments on different datasets to evaluate the quality of algorithm
StreamKM++.1 A description of the datasets can be found in the next section. The
computation on the biggest dataset, which is denoted by BigCross, was performed on a
DELL Optiplex 620 machine with 3 GHz Pentium D CPU and 2 GB main memory, using
Linux 2.6.9 kernel. For all remaining datasets, the computation was performed on a DELL
Optiplex 620 machine with 3 GHz Pentium D CPU and 4 GB main memory, using Linux
2.6.18 kernel.
1The source code, the documentation, and the datasets of our experiments can be found at http://www.
cs.upb.de/en/fachgebiete/ag-bloemer/research/clustering/streamkmpp/
116 6 A k-Means Implementation for Data Streams
We compared algorithm StreamKM++ with two frequently used clustering algorithms
for processing data streams, namely with algorithm BIRCH [111] and with algorithm
StreamLS [96, 52]. On the smaller datasets, we also compared our algorithm with a
vanilla implementation of Lloyd’s algorithm [80], using initial seeds either uniformly at
random (algorithm k-Means) or according to the non-uniform seeding from Arthur and
Vassilvitskii [9] (algorithm k-Means++). All algorithms were compiled using g++ from the
GNU Compiler Collection on optimization level 2. The quality measure for all experiments
was the sum of squared distances, to be referred as the cost of the clustering.
6.5.1 Datasets
Since synthetic datasets are typically easy to cluster, we focused our experiments on real-
world datasets to obtain practically relevant results. Our main source was the UCI Machine
Learning Repository [13]. In the following, we will give a brief description of all the datasets
used in our empirical evaluation.
Spambase2 is a dataset that contains data about spam e-mails and non-spam e-mails,
including work and personal e-mails. Each data entry is a vector consisting of frequencies
of certain words or characters occurring in the message and a class attribute that denotes
whether the corresponding e-mail was considered as spam or not. After removing the
classification attribute, 4 601 data points in 57 dimensions remained.
Intrusion2,3 comprises data about TCP transmissions in a simulated environment. This
simulation included different types of network attacks and intrusion attempts as well as
normal network traffic. We used a 10% subset of the whole unlabeled dataset4 and excluded
all symbolic features. Eventually, 311 078 data points in 34 dimensions remained.
Covertype2,5 contains cartographic data about some wilderness areas inside the Roosevelt
National Forest of northern Colorado. The leading thought of analyzing this dataset is to
be able to predict the forest cover type of specific regions from cartographic variables, which
is a classification task. After removing the classification attribute, 581 012 data points in
54 dimensions remained.
The Tower6 dataset consists of the RGB values of a 2 560 by 1 920 pixel image file.
All 4 915 200 pixels are mapped into a 3-dimensional space of integer values between 0
and 255, representing the colors used in the image. Note that clustering techniques are
frequently used for lossy image compression: Individual colors can be substituted with
their corresponding cluster center.
The Census 1990 2 dataset consists of a one percent sample of the Public Use Microdata
Samples (PUMS) person records, sampled from the full 1990 census set contributed by
2The dataset was contributed by the UCI Machine Learning Repository [13].
3The Intrusion dataset is part of the kddcup99 dataset.
4Available for free download at http://kdd.ics.uci.edu/databases/kddcup99/kddcup.newtestdata_
10_percent_unlabeled.gz
5Copyright by Jock A. Blackard, Colorado State University
6The Tower dataset was contributed by Gereon Frahling and is available for free download at: http:
//homepages.uni-paderborn.de/frahling/coremeans.html
6.5 Empirical Evaluation 117
data points dimension type
Spambase 4 601 57 float
Intrusion 311 078 34 int, float
Covertype 581 012 54 int
Tower 4 915 200 3 int
Census 1990 2 458 285 68 int
BigCross 11 620 300 57 int
Normdata 100 000 15 float
Table 6.1: Overview of the datasets
the U.S. Department of Commerce Census Bureau. Most of the data is citizen-related
information, like personal income or age, for instance. The dataset has 2 458 285 data
points in 68 dimensions. To our knowledge, it is one of the largest naturally structured
and free accessible datasets available.
To run our algorithm on really huge datasets, we created the Cartesian product of the
Tower and Covertype dataset. In this way, we got a naturally structured dataset that is
large enough to test our algorithm’s ability of handling huge amounts of data. We used a
1.5 GB sized subset of the Cartesian product consisting of 11 620 300 data points with 57
attributes, which we refer to as the BigCross dataset.
To evaluate the impact of the number of well separated clusters of a dataset, we also
considered a number of synthetic datasets, to which we collectively refer as the Normdata
datasets. To generate these datasets, we used essentially the same construction that has
already been used in [9] to evaluate the k-Means++ algorithm. More precisely, for dif-
ferent values of k, we chose k ‘true’ centers uniformly at random from a 15-dimensional
hypercube of side length 100. We then chose randomly points from a uniform mixture of
15-dimensional normal distributions of variance 1 around these center points. In this way,
we obtained k well separated clusters. Each Normdata dataset consists of 100 000 points.
The size and the dimensionality of the datasets are summarized in Table 6.1.
6.5.2 Parameters of the Algorithms
For algorithm BIRCH, we set all parameters of the experimental environment, except for
the memory settings, as recommended by the authors of BIRCH. Like Guha et al. [52],
we observed that the CF-Tree had less leaves than it was allowed to use. The CF-Tree is
the data structure that is used to compute the pre-clustering into the so-called clustering
features (see also Section 6.1). The more leaves it has, the finer is the pre-clustering.
Therefore, from time to time, BIRCH did not produce the correct number of centers,
especially when the number of clusters k was high. For this reason, the memory settings
had to be manually adjusted for each individual dataset. The complete list of parameters
is given in Tables A.1 and A.2 in Appendix A.1.
118 6 A k-Means Implementation for Data Streams
For algorithm StreamKM++, we experimentally determined an appropriate coreset size
m as a function of k. For obvious reasons, we need to choosem ≥ k. To estimate anm that
is sufficient to obtain good approximation results, we ran several experiments for different
values of k and m on the datasets Covertype and Tower. Due to the randomized7 nature of
StreamKM++, we conducted ten runs for each combination of k and m. Figure 6.2 shows
the average running times and cost of the clusterings. Concerning the cost, we observed
that, for coreset sizes that are only marginally larger than k, the quality of a clustering can
be improved considerably by increasing the coreset size. In contrast to that, for coreset
sizes of, say, m = 100k or more, the quality improves only slightly with increasing coreset
size. For instance, the cost of a 50-clustering of either dataset computed on 20 000 coreset
points is only marginally smaller than the cost of a 50-clustering computed on 10 000
coreset points. However, with respect to the running time, we observed that the growth
of the running time depends roughly linear on the coreset size. Overall, we conclude the
following. On the one hand, m should be chosen not too small (e.g., a very small multiple
of k) because, for these values of m, the quality of a clustering can be easily improved,
without sacrificing too much running time. On the other hand, m should not be chosen
too large (e.g., a large multiple of k) because the increase in quality is only very small
compared to clusterings for smaller coresets, but the running time is significantly higher.
Therefore, we assume that our choice of m = 200k provides a good trade-off for arbitrary
datasets. However, smaller sizes such as m = 20k or m = 50k might still be sufficient to
obtain very good clustering results on datasets with k well separated clusters.
For algorithm StreamLS the size of the data chunks is set equal to the coreset size m of
algorithm StreamKM++. This is done to enable a fair comparison of both algorithms by
allowing the same memory usage. We have to point out that, due to its nature, algorithm
StreamLS does not always compute the prespecified number of cluster centers. In such
a case, the difference varies from dataset to dataset and usually lies within a 20% margin
from the prespecified number.
6.5.3 Comparison of the Algorithms
Comparison with BIRCH and StreamLS
To compare StreamKM++ with BIRCH and StreamLS, we conducted several experi-
ments for different values of k on the four larger real-world datasets, i.e., the datasets Cover-
type, Tower, Census 1990, and BigCross. In each of these experiments, we set m = 200k.
For the randomized algorithms StreamKM++ and StreamLS, ten experiments were
conducted for each fixed k. For BIRCH, a single run was used since it is a deterministic
algorithm. The average running times and cost of the clusterings are summarized in Fig-
ures 6.3 and 6.4. The interested reader can find the concrete values of all experiments in
Appendices A.2 and A.3.
In our experiments, algorithm BIRCH had the best running time of all algorithms.
However, this comes at the expense of a high k-means clustering cost. In terms of the sum
7We used the Mersenne Twister PRNG [85].
6.5 Empirical Evaluation 119
Figure 6.2: Experimental results for different coreset sizes
of squared distances, algorithms StreamKM++ and StreamLS outperform BIRCH by a
factor of up to 2. Furthermore, as already mentioned, one drawback of algorithm BIRCH
is the need of adjusting parameters manually to obtain a clustering with the desired number
of centers.
By comparing StreamKM++ and StreamLS, we observed that the quality of the
clusterings were on a par. More precisely, the absolute value of the cost of both algorithms
lies within a ±5% margin from each other. In contrast to algorithm StreamLS, the
number of centers computed by our algorithm always equals its prespecified value. Hence,
the cost of clusterings computed by algorithm StreamKM++ tends to be more stable than
the costs computed by algorithm StreamLS. The standard deviations of the running times
and clustering cost for k = 20 are given in Tables 6.2 and 6.3. A complete overview for all
experiments can be found in Appendix A.4.
In terms of running time, it turns out that our algorithm scales much better with in-
creasing number of centers than algorithm StreamLS does. While for about k ≤ 10
centers StreamLS is sometimes faster than our algorithm, for a larger number of cen-
ters, our algorithm easily outperforms StreamLS. For instance, on the dataset Tower,
StreamKM++ computes a clustering with k = 100 centers in about 3% of the running
time of StreamLS.
To investigate the impact of the number of clusters on the running time further, we
conducted experiments on the synthetic datasets Normdata for different values of k and
m. As described before, for both StreamKM++ and StreamLS, ten experiments were
conducted for each combination of k and m. The average running times of the clusterings
120 6 A k-Means Implementation for Data Streams
Figure 6.3: Experimental results for the datasets Census 1990 and BigCross
k = 20 running time (in sec)
StreamKM++ StreamLS k-Means++
Spambase 1.09 - 3.88
Intrusion 3.22 - 98.11
Covertype 6.93 18.18 1249.18
Tower 0.58 14.11 1594.76
Census 1990 5.16 54.30 -
BigCross 11.49 162.44 -
Table 6.2: Standard deviation of the running time for k = 20
are shown in Figure 6.5. Note that we omitted a figure presenting the average cost of the
clusterings because both StreamKM++ and StreamLS always found an optimal or near
optimal clustering. The interested reader can find the average values as well as the standard
deviations for both running times and cost of the clusterings in the appendix. Figure 6.5
reveals the difference between the running times of StreamKM++ and StreamLS. The
ratio between the running time needed by StreamKM++ and the running time needed by
StreamLS is decreasing with increasing number of clusters. Form = 500, StreamKM++
computed the clusterings for k = 100 in about 8% of the running time of StreamLS and,
for k = 200, it finished the clustering in about 2% of the running time of StreamLS.
6.5 Empirical Evaluation 121
Figure 6.4: Experimental results for the datasets Covertype and Tower
k = 20 cost
StreamKM++ StreamLS k-Means++
Spambase 6.49 · 105 - 1.73 · 106
Intrusion 8.54 · 1010 - 3.70 · 1011
Covertype 1.08 · 109 1.03 · 1010 9.17 · 108
Tower 7.31 · 106 2.71 · 107 4.39 · 107
Census 1990 3.66 · 106 3.14 · 106 -
BigCross 2.46 · 1010 3.36 · 1011 -
Table 6.3: Standard deviation of the cost for k = 20
For m = 1 000, StreamKM++ computed the clusterings for k = 100 in about 38% of the
running time of StreamLS, whereas, for k = 200, it needed about 3% of the running time
of StreamLS.
Overall, we conclude that, if the first priority is the quality of the clustering, then our
algorithm provides a good alternative toBIRCH and StreamLS. This applies particularly
if the number of cluster centers is large.
122 6 A k-Means Implementation for Data Streams
Figure 6.5: Experimental results for the Normdata datasets
Figure 6.6: Experimental results for the datasets Spambase and Intrusion
Comparison with k-Means and k-Means++
We also compared the quality of StreamKM++ with the quality of classical non-streaming
k-means algorithms. Because of their popularity, we have chosen k-Means and k-Means++
as competitors. These algorithms are designed to work in a classical non-streaming setting
and, due to their need for random access on the data, are not suited for larger datasets.
For this reason, we have run k-Means only on the two smallest datasets Spambase and
6.5 Empirical Evaluation 123
Intrusion, while k-Means++ has been evaluated only on the four smaller datasets Cover-
type, Tower, Spambase, and Intrusion. For each fixed k, we conducted ten experiments.
The results of these experiments are summarized in Figures 6.6 and 6.4. Please note that
the results for the dataset Intrusion are on a logarithmic scale. The concrete values of
all experiments can be found in Appendices A.2 and A.3. The standard deviations of the
running times and clustering cost are given in Appendix A.4.
As expected, k-Means++ is clearly superior to the classical algorithm k-Means both
in terms of quality and running time. Comparing k-Means++ with our streaming algo-
rithm, we find that on all datasets the quality of the clusterings computed by algorithm
StreamKM++ is on a par with or even better than the clusterings obtained by algo-
rithm k-Means++. We conjecture that this is due to the fact that, in the last step of
our algorithm, we run the k-Means++ algorithm five times on the coreset and choose the
best clustering result obtained this way. On the other hand, for the experiments with
the k-Means++ algorithm, we run the k-Means++ algorithm only once in each repetition
of the experiment. However, the running time of k-Means++ is only comparable with
algorithm StreamKM++ for the smallest dataset Spambase. Even for moderately large
datasets, like the Covertype dataset, we obtain that algorithm StreamKM++ is orders
of magnitude faster than k-Means++. We conclude that algorithm k-Means++ should
only be used if the size of the dataset is not too large. For larger datasets, algorithm
StreamKM++ computes comparable clusterings in a significantly improved running time.
124 6 A k-Means Implementation for Data Streams
7 Well-Separated Pair Decomposition with
Slack
In this chapter, we study the construction of well-separated pair decompositions (WSPDs)
for point sets. Intuitively, two point sets are called a well-separated pair if the shortest
distance from any point in one set to any point in the other set is large compared to the
diameter of both sets. A well-separated pair decomposition of a point set consists of a
collection of well-separated pairs that covers all the pairs of distinct points, i.e., any two
distinct points belong to the different subsets of some pair. In this way, a WSPD of size t
allows all pairwise distances to be compactly summarized by t distances.
Now, let us assume that we are given a huge point set P . Due to the size of P , it could
be useful to have a compact representation that fairly captures the pairwise distances of
P and uses space sublinear in |P |. In case that the structure of P is very simple (e.g.,
P is a multiset with many duplicates), it might be possible to construct a WSPD whose
representation has sublinear size. However, in case that the structure of P is more complex,
one cannot find a sublinear space representation of a WSPD such that all pairwise distances
of P are well preserved. To be able to deal with this problem, we introduce the notion of a
WSPD with slack. A WSPD with slack σ for P guarantees that at least a (1− σ)-fraction
of all the pairwise distances of P are well preserved.
After giving a formal definition of a WSPD with slack, we present an efficient construc-
tion of a WSPD with low slack for low-dimensional Euclidean point sets in Section 7.2.
Our construction is similar to the one we used in Chapter 5. We build a quadtree parti-
tion for the input points in which we recursively split every cell that contains more than
a certain threshold of points. Based on this partition, we obtain a representation whose
space requirement is polylogarithmic in both the size and the spread of the point set. In
Section 7.3, we show how to transfer our construction for low-dimensional Euclidean point
sets to point sets with bounded doubling dimension. Based on the techniques developed
in this chapter, we will design streaming algorithms to compute low-distortion embeddings
with low slack in Chapter 8.
7.1 Preliminaries
At first, we briefly recapitulate the classic notion of a well-separated pair decomposition
(WSPD). A more detailed description can be found in [22, 103]. Afterwards, we relax the
classic notion and give a formal definition of a WSPD with slack.
Let M = (X,D) be any metric space, where X is a set of n points and D is a distance
function defined on X (see Section 2.1 for a definition of metric spaces). Throughout this
126 7 Well-Separated Pair Decomposition with Slack
chapter, we assume that the minimum pairwise distance between two points in X is at
least 1, and the maximum pairwise distance is at most ∆. For any constant parameter
ε with 0 < ε < 1, two non-empty subsets X1, X2 ⊆ X are called ε-well-separated if we
have max{diam(X1), diam(X2)} ≤ ε · D(X1, X2), where diam(X1) and diam(X2) are the
diameters of X1 and X2, respectively. The value ε is often called separation parameter.
Based on this, an ε-WSPD for M is defined as follows.
Definition 7.1.1 (WSPD). Let ε, 0 < ε < 1, be a separation parameter. Let M = (X,D)
be any n-point metric space, and let P be a collection of ε-well-separated pairs of subsets
{(A0, B0), . . . , (At−1, Bt−1)}, where Ai, Bi ⊆ X for i ∈ [t]. P is called an ε-WSPD for M
if every pair of points (a, b) ∈ X × X, a 6= b, lies in Ai × Bi or Bi × Ai for exactly one
index i ∈ [t].
The usefulness of an ε-WSPD for M is that, for any i ∈ [t], the distances between pairs
of points from Ai×Bi are all identical to within a factor of 1+2ε. Thus, if we store instead
of each pair (Ai, Bi) a pair of representative points (R(Ai), R(Bi)) with R(Ai) ∈ Ai and
R(Bi) ∈ Bi, then D(R(Ai), R(Bi)) is a (1± 2ε)-approximation of all the distances between
pairs of points from Ai × Bi. We also say that the distances between pairs of points from
Ai×Bi are (1±2ε)-preserved. Hence, an ε-WSPD forM has the property that all pairwise
distances of M are (1± 2ε)-preserved.
Typically, one assumes that the size of an ε-WSPD forM is linear in n. Since we restrict
the space requirement of the representation of M to be sublinear in n and there does not
exist an ε-WSPD for any separation parameter ε and for any metric M (e.g., the uniform
n-point metric) that has sublinear size, we introduce the notion of a WSPD with slack.
Definition 7.1.2 (WSPD with Slack). Let ε, 0 < ε < 1, be a separation parameter and
σ, 0 < σ < 1, be a slack parameter. Let M = (X,D) be any n-point metric space, and
let P be a collection of pairs of subsets {(A0, B0), . . . , (At−1, Bt−1)}, where Ai, Bi ⊆ X for
i ∈ [t]. Let Iε be the subset of indices such that, for all j ∈ Iε, (Aj, Bj) is ε-well-separated.
P is called an ε-WSPD with slack σ for M if every pair of points (a, b) ∈ X ×X, a 6= b,
lies in Ai ×Bi or Bi × Ai for at most one index i ∈ [t] and
∑
j∈Iε
|Aj| · |Bj| ≥ (1− σ) · n2 .
Despite the fact that the distance function D is symmetric, the slack σ of a WSPD is
measured by the quantity of the fraction of all ordered pairs (a, b) ∈ X × X that do not
satisfy the condition given in Definition 7.1.1. The assumption of having n2 (instead of
(
n
2
)
)
pairwise distances simplifies descriptions and makes our proofs cleaner, without changing
the results in any significant way.
7.2 Construction for Euclidean Metric Spaces
This section deals with the construction of a WSPD with slack for Euclidean metrics. Let
M = (P,D) be an n-point Euclidean space with constant dimension d, let ε, 0 < ε < 1,
7.2 Construction for Euclidean Metric Spaces 127
be a separation parameter, and let σ, 22d/n < σ < 1, be a slack parameter. In order to
construct an ε-WSPD with slack σ forM , we impose dlog(∆)e+1 nested square grids over
P denoted by G (0) ,G (1) , . . . ,G (dlog(∆)e). The side length of each cell in grid G (i) is 2i.
We say that the grid cells in G (i) are in level i.
Our algorithm consists of three phases. In the first phase, we compute a partition of
the space based on the heavy cells in the grids (see Definition 7.2.1). Then, it follows a
refinement phase, where each cell of the space partition is further subdivided into smaller
cells, which we call cubelets. In the last phase, we determine a so-called representative for
each cubelet and compute an ε-WSPD with slack σ from the set of representatives.
Definition 7.2.1 (Heavy Cell). We call a grid cell heavy if it contains at least h(σ) · n
points of P , where h(σ) := σ/2d is a function dependent on σ. A grid cell that is not heavy
is called light.
Now, we describe the three phases in detail (see Algorithm 7.2.1 for a description in
pseudocode). In the first phase, we build a partition of the point space based on a quadtree.
To recapitulate, a quadtree for a d-dimensional point set is a rooted tree in which every
internal node has 2d children. Furthermore, every node corresponds to a grid cell and, for
any internal node v, the cells of its children form a partition of the cell corresponding to v.
Thus, the cells of the leaf nodes form a partition of the cell of the root node. We call this
partition a quadtree partition. Our quadtree partition for P is now constructed as follows.
We start with the coarsest grid G (dlog(∆)e) and identify all heavy cells in this grid, i.e.,
cells containing at least h(σ) ·n points. Then, we subdivide every heavy cell C into 2d equal
sized subcells. These subcells are contained in grid G (dlog(∆)e − 1). We call C the parent
cell of these subcells. If none of the subcells is heavy, we stop our process. Otherwise, the
algorithm recursively subdivides every heavy cell such that, at the end of the first phase,
we have only light cells in our space partition. Note that all the cells in grid G (0) are
light since such a cell can contain at most 2d points and σ > 22d/n implies h(σ) · n > 2d.
Figure 7.1 illustrates the quadtree partitioning with the help of an example.
The refinement phase consists of three steps. The first refinement is that we build a so-
called balanced or restricted quadtree partition of the quadtree partition obtained so far,
i.e., the side length of each cell is allowed to differ from the side lengths of all neighboring
cells by a factor of at most 2 [33, 107]. That means that we further subdivide every leaf cell
C of the quadtree which has a neighboring cell whose side length is less than half of the side
length of C. We say that two cells are neighbors if they share some part of the boundary.
In Figure 7.2, the first refinement step is illustrated by using the example from Figure 7.1.
In the second refinement step, every leaf cell of the balanced quadtree is subdivided into
`1
d equal sized cubes, where `1 :=
⌈
6
√
d
⌉
. Finally, we subdivide every cube into `2(ε)
d
equal sized cubelets, where `2(ε) :=
⌈
2
√
d/ε
⌉
is a function dependent on ε. Note that we
could have merged the second and third refinement step into one step, but the definition
of cubes makes the analysis easier.
It remains to determine the representatives. For each non-empty cubelet C, we replace
all the points inside of C by one representative. This representative is set to the location of
128 7 Well-Separated Pair Decomposition with Slack
(a) (b) (c) (d)
Figure 7.1: Example illustrating the quadtree partition for a point set in the plane. A
cell is heavy if it contains at least 5 points. (a)-(d) The quadtree partition for
subsequent depths of the recursion, i.e., after having subdivided each heavy cell
in grid G (dlog(∆)e), G (dlog(∆)e − 1), G (dlog(∆)e − 2), and G (dlog(∆)e − 3),
respectively. Since no cell in partition (d) is heavy, the recursion stops here.
balancing
Figure 7.2: Example illustrating the refinement of a quadtree partition to get a balanced
quadtree partition. Cells created during the balancing process are indicated by
dashed lines.
an arbitrary point it represents (i.e., a point from P ∩C) and weighted by the total number
of replaced points. Finally, the collection of all representative pairs is our ε-WSPD with
slack σ for M . Note that we implicitly store this information by storing the set of all
representatives.
7.2.1 Analysis of the Construction
First, we prove that our construction yields an ε-WSPD with slack σ for M . Then, we
analyze its complexity. For that purpose, for any level i ∈ [dlog(∆)e + 1], let L(i) be the
set of all the leaf cells of the unbalanced quadtree whose side length is 2i, i.e., leaf cells
in level i. Analogously, let L+(i) be the set of all the leaf cells of the balanced quadtree
whose side length is 2i. We define L∗(i) to be the set of all the cubes contained in a cell in
L+(i). Furthermore, we denote by H(i) the set of heavy cells in level i that do not have a
heavy subcell. Note that the parent cell of any cell in L(i) is in H(i+ 1).
7.2 Construction for Euclidean Metric Spaces 129
Algorithm 7.2.1 ConstructWSPD(P, ε, σ)
1: initialize space partition with the cells in grid G (dlog(∆)e)
2: initialize queue Q with the cells in grid G (dlog(∆)e)
3: QuadtreePartition(Q)
4: Q← insert all cells of space partition
5: BalancedQuadtreePartition(Q)
6: for each cell C in space partition do
7: split C into `1
d cubes
8: for each cube C in space partition do
9: split C into `2(ε)d cubelets
10: initialize empty set R of representatives
11: for each non-empty cubelet C in space partition do
12: q ← arbitrary point from P ∩ C weighted by number of points in C
13: R← R ∪ q
14: return R
Algorithm 7.2.2 QuadtreePartition(P, σ,Q)
1: while Q is not empty do
2: C ← remove first cell from Q
3: if C is heavy then
4: split C into 2d subcells
5: Q← insert all subcells of C
Algorithm 7.2.3 BalancedQuadtreePartition(P, σ,Q)
1: while Q is not empty do
2: C ← remove first cell from Q
3: if C violates the balancing condition then
4: split C into 2d subcells
5: Q← insert all subcells of C
6: Q← insert all neighbors of C that now violate the balancing condition
Separation and Slack
In the second phase of the algorithm, every cube in
⋃dlog(∆)e
i=0 L
∗(i) is divided into `2(ε)d
equal sized cubelets. The next lemma shows that the choice of the function `2(ε) guarantees
that any two cubelets of different non-neighboring cubes are ε-well-separated.
Lemma 7.2.2. Let `2(ε) :=
⌈
2
√
d/ε
⌉
. If each cube in
⋃dlog(∆)e
i=0 L
∗(i) is divided into `2(ε)d
equal sized cubelets, then any two cubelets which are not contained in the same cube or in
neighboring cubes are ε-well-separated.
Proof. Let C1 and C2 be any two cubelets which are not contained in the same cube or in
neighboring cubes. Furthermore, let C1 be in any level i ∈ [dlog(∆)e + 1] and C2 be in a
130 7 Well-Separated Pair Decomposition with Slack
level j. Without loss of generality, we assume that j ∈ {i, . . . , dlog(∆)e}. We consider the
two cases j ∈ {i, i+ 1} and j ∈ {i+ 2, . . . , dlog(∆)e}.
We start with the case j ∈ {i, i+ 1}. The side length of a cube in level i is 2i/`1, where
`1 =
⌈
6
√
d
⌉
. Since the side lengths of neighboring cells in
⋃dlog(∆)e
k=0 L
+(k) differ by a factor
of at most 2 and each cell in
⋃dlog(∆)e
k=0 L
+(k) is divided into equal sized cubes, the distance
between the cube containing C1 and the cube containing C2 is at least 2i/`1. Since the
diagonal of the bigger cubelet C2 is
diag(C2) =
√
d · 2j
`1 · `2(ε)
≤
ε · 2i
`1
,
we get that C1 and C2 are ε-well-separated (see Section 7.1 for a definition of an ε-well-
separated pair).
Now, we consider the case j ∈ {i + 2, . . . , dlog(∆)e}. Due to the balanced quadtree
partitioning, the side lengths of neighboring cells in
⋃dlog(∆)e
k=0 L
+(k) differ by a factor of at
most 2. Hence, the distance between any cell in L+(i) and any cell in L+(j) is at least
∑j−1
k=i+1 2
k = 2j − 2i+1 ≥ 2j−1. Since C1 is contained in a cell in L+(i) and C2 is contained
in a cell in L+(j), the distance between C1 and C2 is at least 2j−1. Since the diagonal of
the bigger cubelet C2 is
diag(C2) =
√
d · 2j
`1 · `2(ε)
< ε · 2j−1 ,
the two cubelets C1 and C2 are ε-well-separated.
Due to Lemma 7.2.2, to bound the slack of the ε-WSPD for M , we have to bound the
number of points in each cube in
⋃dlog(∆)e
i=0 L
∗(i). The following lemma shows that this is
guaranteed by our choice of h(σ).
Lemma 7.2.3. Let h(σ) := σ/2d, and let p1 and p2 be any two points in P . If the cubelet
that contains p1 and the cubelet that contains p2 are not ε-well-separated, then p2 belongs
to the σn closest points of p1.
Proof. At first, we bound the maximum distance D(p1, p2) between p1 and p2. Then, we
bound the total number of points whose distance from p1 is at most D(p1, p2). If this
number is at most σn, then the correctness of the lemma follows.
For any level i ∈ [dlog(∆)e + 1], let C∗1 ∈ L
∗(i) be the cube that contains p1, and let C∗2
be the cube that contains p2. Due to Lemma 7.2.2, C∗1 and C
∗
2 must be neighbors. Since
we use a balanced quadtree partitioning, the side lengths of C∗1 and C
∗
2 differ by a factor of
at most 2. Since C∗1 ∈ L
∗(i), the side lengths of C∗1 is 2
i/`1 with `1 =
⌈
6
√
d
⌉
. We have to
consider the cases (i) C∗2 ∈ L
∗(i), (ii) C∗2 ∈ L
∗(i+ 1), and (iii) C∗2 ∈ L
∗(i− 1).
7.2 Construction for Euclidean Metric Spaces 131
p1
C∗1
2i
p2
C∗2
Figure 7.3: Sketch of Case (i) in the proof of Lemma 7.2.3.
Case (i) is illustrated in Figure 7.3. In this case, the maximum distance between p1 and
p2 is at most
D(p1, p2) ≤
√
d ·
(
2i
`1
+
2i
`1
)
=
√
d ·
(
2i+1
`1
+
2i
√
d`1
)
−
2i
`1
≤ 2i−1 −
2i
`1
.
Since the cube C∗1 is contained in a cell in L
+(i) and we use a balanced quadtree partitioning,
the side lengths of all neighboring cells of the cell containing C∗1 are at least 2
i−1. Now, since
the side length of C∗1 is 2
i/`1, the ball with center p1 and radius D(p1, p2) ≤ 2i−1 − 2i/`1 is
covered by at most 2d cells in
⋃dlog(∆)e
k=0 L
+(k).
In Case (ii), the maximum distance between p1 and p2 is at most
D(p1, p2) ≤
√
d ·
(
2i
`1
+
2i+1
`1
)
< 2i −
2i
`1
.
Furthermore, since C∗2 is contained in a cell in L
+(i + 1), the side lengths of all common
neighbors of the cells containing C∗1 and C
∗
2 are at least 2
i. Now, since the side length of
C∗1 is 2
i/`1, the ball with center p1 and radius D(p1, p2) ≤ 2i − 2i/`1 can be covered by at
most 2d cells in
⋃dlog(∆)e
k=0 L
+(k).
Case (iii) is symmetric to Case (ii).
As a result, in all three cases, we have to count the number of points in 2d cells in
⋃dlog(∆)e
k=0 L
+(k). Since the cells in
⋃dlog(∆)e
k=0 L
+(k) are light cells, each one of them contains
at most h(σ) · |P | points. It follows that the number of points whose distance from p1 is
at most D(p1, p2) is at most σ · |P |.
132 7 Well-Separated Pair Decomposition with Slack
Remark 7.2.4. Lemma 7.2.3 does not only imply that the collection of all representative
pairs is an ε-WSPD with slack σ for M . It also lets us know where the slack arises.
More precisely, for each point in P , the distances to its σn closest neighbors in P can be
arbitrarily distorted, but the distances to all other points in P are (1± 2ε)-preserved.
Complexity
In order to upper bound the number of representatives, we have to bound the number of
cells in the balanced quadtree partition. This is done by first analyzing the dependency of
this number on the number of cells in the unbalanced quadtree partition. The proof of the
following lemma is basically given in [33]. Only a few adjustments to our scenario have
been made. However, for sake of completeness, we include the full proof here.
Lemma 7.2.5 ([33]). The number of cells in the balanced quadtree partition is
∣
∣
∣
∣
∣
∣
dlog(∆)e⋃
i=0
L+(i)
∣
∣
∣
∣
∣
∣
∈ O

6d ·
∣
∣
∣
∣
∣
∣
dlog(∆)e⋃
i=0
L(i)
∣
∣
∣
∣
∣
∣

 .
Proof. The proof follows the one in Chapter 14 from [33]. We call the cells in the unbal-
anced quadtree old cells and the cells that are in the balanced but not in the unbalanced
quadtree new cells. We will show that, for each old cell, there are at most 3d − 1 cells in
the same level that have to be split. Since each split operation creates 2d new cells, the
total number of new cells is at most 6d times the total number of old cells. Since the total
number of cells in the unbalanced quadtree is at most 2 · |
⋃dlog(∆)e
i=0 L(i)|, the total number
of new cells is at most 2 · 6d · |
⋃dlog(∆)e
i=0 L(i)|. Thus, the number of cells in
⋃dlog(∆)e
i=0 L
+(i) is
obviously upper bounded by the number of leaf cells in the unbalanced quadtree plus the
total number of new cells, which is at most |
⋃dlog(∆)e
i=0 L(i)|+ 2 · 6
d · |
⋃dlog(∆)e
i=0 L(i)|.
C
C ′
Figure 7.4: Illustration of the arrangement of cells in the proof of Lemma 7.2.5. The
neighboring cell of C which causes the splitting of C is indicated in gray.
We use a charging argument to prove that, for each old cell, there are at most 3d − 1
cells in the same level that have to be split. Let us suppose that we have to split an (old
or new) cell C ∈ G (i) in any level i ∈ [dlog(∆)e + 1] during the balancing process. Then,
7.2 Construction for Euclidean Metric Spaces 133
we claim that at least one of the 3d − 1 neighboring cells of C in level i is old. We charge
the splitting of C to this old cell.
Let us assume that our claim is wrong. Let i ∈ [dlog(∆)e+ 1] be the smallest level and
C ∈ G (i) be a cell such that C is split during the balancing process and has no neighboring
cell in level i which is old (see Figure 7.4). Since C is split, there is a neighboring cell C ′′
of C with side length at most 2i−2. Let C ′ ∈ G (i− 1) be the cell that contains C ′′. Due to
the fact that C ′ is contained in a new cell, it is new itself. Also, all the neighboring cells
of C ′ in level i − 1 are new. This follows from the fact that all neighboring cells of C in
level i are new and C is split during the balancing process. Furthermore, since C ′ contains
the cell C ′′, C ′ is split during the balancing process. Thus, C ′ is a cell that is split in the
balancing process and there is no neighboring cell of C ′ in level i− 1 which is old. This is
a contradiction to the choice of C.
Lemma 7.2.6. The number of cells in the balanced quadtree partition is O(2O(d)·log(∆)/σ).
Proof. Each ancestor cell of a heavy cell is heavy, and heavy cells are split during the
quadtree partitioning. Recall that, for any level i ∈ [dlog(∆)e + 1], H(i) is the set of all
heavy cells in level i that do not have a heavy subcell. Then, for each cell in H(i), there are
2d cells in G (i− 1) and at most 2d ·(dlog(∆)e+1) cells in
⋃dlog(∆)e
j=i G (j) that are cells in our
quadtree. Since there are at most 1/h(σ) heavy cells in
⋃dlog(∆)e
i=0 H(i), the total number of
cells in the unbalanced quadtree is bounded by O(1/h(σ)·2d ·log(∆)). Due to Lemma 7.2.5,
the number of cells in the balanced quadtree partition is O(1/h(σ) · 12d · log(∆)). Now,
the lemma follows from h(σ) = σ/2d.
Lemma 7.2.7. The space partition consists of O(2O(d) · dd · log(∆)/(εdσ)) cubelets.
Proof. Due to Lemma 7.2.6, there are O(2O(d) · log(∆)/σ) cells in the balanced quadtree
partition. Now, the lemma follows from the fact that each cell in the balanced quadtree
partition contains
⌈
6
√
d
⌉d
cubes and each cube contains
⌈
2
√
d/ε
⌉d
cubelets.
Based on the results given above, we can now analyze the complexity of our algorithm.
Lemma 7.2.8. The algorithm has a running time of
O
(
n ·
(
d2
ε
+ d log(∆)
)
+
2O(d) · d · log2(∆)
σ
+
2O(d) · dd · log(∆)
εdσ
)
and a space requirement of
O
(
dn+
2O(d) · dd · log(∆)
εdσ
)
.
Proof. Since each level of a quadtree forms a partition of P , the total number of points
associated with cells at the same level of the quadtree is at most n. Thus, the arrangement
of points in any level of the quadtree can be computed from the preceding level in O(dn)
134 7 Well-Separated Pair Decomposition with Slack
time. Since our quadtree has a height of at most dlog(∆)e + 1, the total running time to
compute the arrangements of the points in all the levels of the quadtree is O(dn log(∆)).
During the balancing process, for each cell that has ever been split, we check if this
cell has neighbors in the quadtree partition that violate the balancing condition. Let C
be such a split cell in any level i ∈ [dlog(∆)e + 1]. Since we must only check neighbors
of C whose side length is at least twice as big as the side length of C, the level of the
neighbors that we have to check is at least i+ 1 and the total number of these neighbors is
at most 2d − 1 (see Figure 7.5). Let C ′ be any cell in G (i+ 1) that shares some part of its
boundary with C. If the subcells of C ′ are not cells of the current quadtree, then the leaf
cell of the current quadtree that contains C ′ is a violating neighbor of C. We can find this
violating neighbor by searching for the leaf cell in the quadtree that contains the midpoint
of C ′. Since the height of the quadtree is O(log(∆)), this cell can be found in O(d log(∆))
time. Furthermore, due to Lemma 7.2.6, the number of cells in the balanced quadtree
is O(2O(d) · log(∆)/σ). Thus, the time for checking neighborhoods during the balancing
process is O(2O(d) · d · log2(∆)/σ).
C ′
2i+1
C
2i
Figure 7.5: The cell C is split during the balancing operation. A neighbor of C in the
current quadtree partition that violates the balancing condition has to contain
at least one of the cells colored in gray.
Finally, the arrangement of the points in the cubelets can be computed from the balanced
quadtree partition in n · d · (`1 · `2(ε)) ∈ O(d2n/ε) time.
Obviously, each node of the final partition tree has to be created once. Since it follows
from Lemma 7.2.7 that this tree consists of O(2O(d) · dd · log(∆)/(εdσ)) nodes, this can be
done in O(2O(d) · dd · log(∆)/(εdσ)) time. Thus, the total running time is as claimed in
the lemma. Furthermore, the space requirement for the d-dimensional points in P and the
partition tree is upper bounded by O(dn+ 2O(d) · dd · log(∆)/(εdσ)).
We summarize our results in the following theorem:
Theorem 7 (WSPD with Slack for Euclidean Spaces). Let P be a set of n points with
spread ∆ from a low-dimensional Euclidean space Rd, let ε, 0 < ε < 1, be a separation
parameter, and let σ, 22d/n < σ < 1, be a slack parameter. Then, there exists an algorithm
7.3 Construction for Doubling Metric Spaces 135
that computes a weighted point set P ′ ⊂ P with cardinality O(2O(d) · dd · log(∆)/(εdσ))
which is an implicit representation of an ε-WSPD with slack σ for P . The algorithm has
a running time of
O
(
n ·
(
d2
ε
+ d log(∆)
)
+
2O(d) · d · log2(∆)
σ
+
2O(d) · dd · log(∆)
εdσ
)
and a space requirement of
O
(
dn+
2O(d) · dd · log(∆)
εdσ
)
.
Proof. We have P ′ ⊂ P since the location of each representative is a point from P . It
follows from Lemmas 7.2.2 and 7.2.3 that P ′ is an implicit representation of an ε-WSPD
with slack σ for P . The cardinality of P ′ is implied by Lemma 7.2.7, and the running time
and space requirement to compute P ′ is due to Lemma 7.2.8. In our construction, we made
the assumption that σ > 22d/n to ensure that all the cells in grid G (0) are light.
7.3 Construction for Doubling Metric Spaces
We transfer our approach to construct a WSPD with slack for low-dimensional Euclidean
point sets to point sets with bounded doubling dimension. The input of our algorithm is
an n-point doubling metric space M = (X,D) with bounded dimension λ, a separation
parameter ε, 0 < ε < 1, and a slack parameter σ, (dlog(∆)e + 1) · 26λ+3/n < σ < 1.
Recall from the definition of doubling metric spaces that each ball in M with any radius r
centered at any point in X can be covered by 2λ balls each of radius r/2 and centered at a
point in X. We assume that the minimum pairwise distance between two points in X is at
least 1, and the maximum pairwise distance is at most ∆. Furthermore, we assume access
to a distance oracle that, given any two points from X, can compute in constant time the
distance between these two points.
Our idea is to replace the square grids from Section 7.2 by uniform cut decompositions,
which are defined as follows:
Definition 7.3.1 (Uniform Cut Decomposition, [35]). Let X be a non-empty point set
with a distance function defined on it. An r-cut decomposition of X is a partition of X
into balls such that the following conditions are satisfied:
i) Each ball is centered at a point in X.
ii) Each ball has radius r.
iii) Each point in X is covered by a ball.
Now, we explain our construction in detail (see Algorithm 7.3.1 for a description in
pseudocode). For each i ∈ [dlog(∆)e+ 1], we compute a 2i-cut decomposition of the point
136 7 Well-Separated Pair Decomposition with Slack
set X. We say that the balls with radius 2i are in level i. We denote the set of balls in
level i by G (i). In case that a point is covered by more than one ball, we assign it to any
one of them. We point out that the arrangement of balls in the uniform cut decomposition
of any level does not depend on the arrangement of balls in the uniform cut decomposition
of any other level. According to this and in contrast to our approach for low-dimensional
Euclidean point sets, our algorithm for doubling metric spaces is not recursive.
To compute the WSPD for X, we identify in each level the heavy balls.
Definition 7.3.2 (Heavy Ball). We call a ball heavy if it contains at least h(σ) · n points
of X, where h(σ) := σ/((dlog(∆)e + 1) · 25λ+3) is a function dependent on σ. A ball that
is not heavy is called light.
Algorithm 7.3.1 ConstructWSPD(X, ε, σ)
1: initialize empty set R of representatives
2: for i← 0 to dlog(∆)e do
3: G (i)← set of balls in 2i-cut decomposition of X
4: for each heavy ball B in G (i) do
5: decompose B into mini balls with radius 2i−`(ε)
6: for each mini ball B′ in B do
7: y ← point located at center of B′ and with weight w(y)← 0
8: for each point x ∈ X ∩ B′ do
9: if x is not marked then
10: mark x
11: w(y)← w(y) + 1
12: R← R ∪ y
13: return R
In each level i ∈ [dlog(∆)e+ 1], we identify the heavy balls, i.e., balls containing at least
h(σ) · n points from X. Each of these heavy balls is decomposed into mini balls of radius
2i−`(ε), where `(ε) := dlog(1/ε)e. Note that all the balls in G (0) are light since such a ball
can contain at most 2λ points and σ > (dlog(∆)e+ 1) · 26λ+3/n implies h(σ) · n > 2λ.
Next, we compute a representative for each point in X. Let x ∈ X be any point. Then,
the representative of x is the center of the smallest mini ball that contains x. Note that
each mini ball belongs to a uniform cut decomposition of a heavy ball and each point in X
is contained in at least one heavy ball since the ball in level dlog(∆)e is heavy and covers
all points from X. The set of representatives for X is the weighted set of center points of
the mini balls where each such point is weighted by the number of points it represents. The
collection of all representative pairs is our compact representation for M . As mentioned in
Section 7.2, we implicitly store this information by storing the set of all representatives.
Note that, for any level i ∈ [dlog(∆)e+1], a 2i-cut decomposition of X can be computed
by applying the well-known 2-approximation algorithm for vertex cover. Let G = (V,E)
be any simple graph. Then, the vertex-cover algorithm chooses repeatedly any edge
7.3 Construction for Doubling Metric Spaces 137
{x, y} ∈ E, inserts x and y into the currently found vertex cover and removes all edges
incident to x or y from E. This is done until the edge set E is empty. Since, in this way,
we implicitly find a non-extendable matching of G which is always a vertex cover for G
and an optimal cover contains at least one endpoint of each edge in this matching, the
algorithm outputs a 2-approximation for vertex cover. Now, let G be the graph that has a
vertex for each point in X and an edge {x, y} for each unordered pair of points x, y ∈ X
with D(x, y) ≤ 2i. Then, by applying the 2-approximation algorithm for vertex cover on
G, we compute a 2i-cut decomposition of X whose size is at most twice as big as the size
of an optimal 2i-cut decomposition of X.
7.3.1 Analysis of the Construction
First, we show that our construction computes an ε-WSPD with slack σ for M . Then, we
analyze its complexity.
For any level i ∈ [dlog(∆)e+ 1], let H(i) be the set of heavy balls in level i. We call any
two balls B(x1, r1) and B(x2, r2) neighboring balls if B(x1, 3r1) contains at least one point
of B(x2, r2) or B(x2, 3r2) contains at least one point of B(x1, r1).
Separation and Slack
The following lemma proves that the choice of the function `(ε) guarantees that any two
mini balls contained in different non-neighboring heavy balls are ε-well-separated.
Lemma 7.3.3. Let `(ε) := dlog(1/ε)e. If, for each level i ∈ [dlog(∆)e+ 1], each heavy ball
in H(i) is decomposed into mini balls with radius 2i−`(ε), then any two mini balls contained
in different non-neighboring heavy balls are ε-well-separated.
Proof. Let B1 be any heavy ball in H(i) in any level i ∈ [dlog(∆)e+ 1], and let B2 be any
heavy ball in H(j) such that B1 and B2 are non-neighboring balls (see Figure 7.6 for an
illustration). Without loss of generality, we assume that j ∈ {i, . . . , dlog(∆)e}. Since the
radius of B2 is 2j and B1 is not a neighboring ball of B2, the distance between the center of
B2 and any point in B1 is larger than 3 · 2j. Hence, the distance between any point in B1
to any point in B2 is larger than 2j+1. Thus, the distance between any point in any mini
ball B∗1 in B1 to any point in any mini ball B
∗
2 in B2 is larger than 2
j+1. Now, the assertion
of the lemma follows from the fact that the diameters of B∗1 and B
∗
2 are at most
diam(B∗2) = 2 · 2
j−`(ε) ≤ ε · 2j+1 .
Due to Lemma 7.3.3, to bound the slack of the ε-WSPD for M , we have to bound the
number of points in each heavy ball. The following lemma shows that this is guaranteed
by our choice of h(σ).
Lemma 7.3.4. Let h(σ) := σ/((dlog(∆)e+ 1) · 25λ+3), then the collection of all represen-
tative pairs is an ε-WSPD with slack σ for M .
138 7 Well-Separated Pair Decomposition with Slack
B1
B∗1
B2
B∗2
2i > 2j+1 2j
Figure 7.6: Illustration of the two non-neighboring heavy balls B1 and B2 in the proof of
Lemma 7.3.3.
Proof. For any point x1 ∈ X, we compute an upper bound on the number of points x2 ∈ X
such that the radius of the smallest mini ball containing x2 is at least as big as the radius
of the smallest mini ball containing x1 and the two mini balls are not ε-well-separated. By
multiplying this number by 2n, we get an upper bound on the slack of our ε-WSPD for all
ordered pairs (x1, x2) ∈ X ×X.
Let x1 and x2 be any points in X that satisfy the condition above. Let i ∈ [dlog(∆)e+1]
be the level such that B1 ∈ H(i) is the smallest heavy ball that contains x1. Let x be
the center of B1. Furthermore, let B2 ∈ H(j), j ∈ {i, . . . , dlog(∆)e}, be the smallest
heavy ball that contains x2 (see Figure 7.7 for an illustration). First, we show that B1 and
B2 must be neighboring balls. Then, we prove an upper bound of 23λ+1 on the number
of heavy balls in level j that are neighboring balls of B1. This allows us to derive an
upper bound of (dlog(∆)e + 1) · 23λ+1 on the total number of heavy balls in any level
j ∈ {i, . . . , dlog(∆)e} that are neighboring balls of B1. Finally, we show that the total
weight of all the representative points located in mini balls forming a cut decomposition of
such a heavy ball is at most σn/((dlog(∆)e+ 1) · 23λ+2). Then, the number of neighboring
heavy balls of B1 times the representative weight associated with a heavy ball times 2n is
at most
(dlog(∆)e+ 1) · 23λ+1 ·
σn
(dlog(∆)e+ 1) · 23λ+2
· 2n ≤ σ · n2 ,
which proves that the slack is at most σ.
Since the smallest mini ball that contains x1 and the smallest mini ball that contains x2
are not ε-well-separated, it follows from Lemma 7.3.3 that B1 and B2 are neighboring balls.
Since B2 is at least as big as B1, the distance from the center of B2 to the closest point in
B1 is at most 3 · 2j. It follows that the distance from the center of B1 to any point in B2
is at most 2i + 3 · 2j + 2j < 2j+3. Hence, B2 is completely contained in the ball B(x, 2j+3).
SinceM is a doubling metric space, B(x, 2j+3) can be covered by at most 23λ balls of radius
2j. Since we use a 2-approximation algorithm to compute cut decompositions, the number
of balls in level j that are completely contained in B(x, 2j+3) is at most 23λ+1. Hence, the
number of balls in level j that are neighboring balls of B1 is at most 23λ+1. Summing up
over all levels j ∈ {i, . . . , dlog(∆)e}, the total number of balls that are neighboring balls of
B1 is at most (dlog(∆)e+ 1) · 23λ+1.
7.3 Construction for Doubling Metric Spaces 139
x
B1
x1
B2
x2
2i ≤ 2j+1 2j
Figure 7.7: Illustration of the two neighboring heavy balls B1 and B2 in the proof of
Lemma 7.3.4. The ball B(x, 2j+3), which completely contains B2, is indicated
by the dashed arc.
Let B(x′, 2j) ∈ H(j), j ∈ {i, . . . , dlog(∆)e}, be any neighboring ball of B1. Observe that
the representative point of any point x′′ ∈ B(x′, 2j) is not from level j if x′′ is covered
by a heavy ball with radius smaller than 2j. Obviously, it follows that the representative
point of any point x′′ ∈ B(x′, 2j) is not from level j if x′′ is covered by a heavy ball in
H(j − 1). Thus, we compute an upper bound on the number of points in all the light
balls in G (j − 1) that are at least partly covered by B(x′, 2j). Observe that these balls are
completely contained in the ball B(x′, 2j+1). Due to our construction, the number of balls
in G (j − 1) that are completely contained in the ball B(x′, 2j+1) is at most 22λ+1. Since
any light ball contains less than h(σ) · n points and h(σ) = σ/((dlog(∆)e+ 1) · 25λ+3), the
total number of points in all the light balls in G (j − 1) that are completely contained in
the ball B(x′, 2j+1) is less than h(σ) ·n ·22λ+1 ≤ σn/((dlog(∆)e+1) ·23λ+2). Thus, the total
weight of representative points in B(x′, 2j) is at most σn/((dlog(∆)e + 1) · 23λ+2), which
was the only thing left to prove the assertion of the lemma.
Unfortunately, in contrast to our WSPD construction for low-dimensional Euclidean
spaces, the property that, for each point, the distances to its σn closest neighbors can be
arbitrarily distorted, but the distances to all the other points are (1 ± 2ε)-preserved (see
Remark 7.2.4), does not hold for our WSPD construction in doubling metric spaces. The
reason is that neighboring balls can differ in their radii by more than a constant factor.
Due to this fact, there might exist a ball with many neighboring balls of smaller radius,
so the number of the distances from a single point to other points in X that are not
(1± 2ε)-preserved might be bigger than σn.
Complexity
In order to upper bound the number of representatives, we have to bound the total number
of mini balls.
140 7 Well-Separated Pair Decomposition with Slack
Lemma 7.3.5. The number of mini balls is O(2O(λ) · log2(∆)/(ελσ)).
Proof. Let i ∈ [dlog(∆)e + 1] be any level. Since any heavy ball contains at least h(σ) · n
points from X and h(σ) = σ/((dlog(∆)e + 1) · 25λ+3), the total number of heavy balls
in level i is at most 1/h(σ) = ((dlog(∆)e + 1) · 25λ+3)/σ. Since M is a doubling metric
space with dimension λ, any ball in level i can be decomposed into at most (2λ)`(ε) balls
with radius 2i−`(ε). Thus, by applying the 2-approximation algorithm for vertex cover, we
decompose each heavy ball into at most
2 · (2λ)`(ε) = 2 · (2λ)dlog(1/ε)e ≤
2λ+1
ελ
mini balls. It follows that the total number of mini balls in level i is at most
(dlog(∆)e+ 1) · 25λ+3
σ
·
2λ+1
ελ
=
(dlog(∆)e+ 1) · 26λ+4
ελσ
.
Summing up over all levels, we obtain that the total number of mini balls is upper bounded
by (dlog(∆)e+ 1)2 · 26λ+4/(ελσ).
Lemma 7.3.6. The algorithm has a running time of O(n2·log(∆)) and a space requirement
of O(n · log(∆)).
Proof. By applying a standard implementation of the described 2-approximation algorithm
for vertex cover, we can decompose the set X into balls with any specified radius in O(n2)
time. Recall that we assume access to a distance oracle that, given any two points from
X, computes in constant time the distance between these two points. Since we have
dlog(∆)e + 1 levels, the total running time to compute all uniform cut decompositions is
O(n2 · log(∆)). Since each uniform cut decomposition is a partition of X, we can find the
heavy balls of any level in O(n) time. Setting the representative points for any level can
also be done in O(n) time. Since there are dlog(∆)e+ 1 levels, the total running time for
finding heavy balls and setting representatives is O(n · log(∆)). Thus, the total running
time is O(n2 · log(∆)).
For each level, we store a uniform cut decomposition of X, and, for a subset of X, we
store a decomposition into mini balls. Since the number of balls and mini balls per level is
less than n, the space requirement of our algorithm is O(n · log(∆)).
We summarize our results in the following theorem:
Theorem 8 (WSPD with Slack for Doubling Metric Spaces). Let M = (X,D) be an n-
point metric space with bounded doubling dimension λ and spread ∆, let ε, 0 < ε < 1, be
a separation parameter, and let σ, (dlog(∆)e+ 1) · 26λ+3/n < σ < 1, be a slack parameter.
Then, there exists an algorithm that computes a weighted point set X ′ ⊂ X with cardinality
O(2O(λ) · log2(∆)/(ελσ)) which is an implicit representation of an ε-WSPD with slack σ
for M . The algorithm has a running time of O(n2 · log(∆)) and a space requirement of
O(n · log(∆)).
7.3 Construction for Doubling Metric Spaces 141
Proof. We have X ′ ⊂ X since the location of each representative is a point from X. Due
to Lemma 7.3.4, X ′ is an implicit representation of an ε-WSPD with slack σ for M . The
cardinality of X ′ follows from Lemma 7.3.5, and the running time and space requirement
to compute X ′ is due to Lemma 7.3.6. In our construction, we made the assumption that
σ > (dlog(∆)e+ 1) · 26λ+3/n to ensure that all the balls in G (0) are light.
142 7 Well-Separated Pair Decomposition with Slack
8 Embeddings with Slack in Data Streams and
Applications
This chapter is devoted to the problem of computing low-distortion embeddings in the
streaming model. Given a stream of points from an n-point metric space M , our stream-
ing algorithms compute an embedding of M into another n-point metric space M ′ that
preserves a (1−σ)-fraction of all the pairwise distances with small distortion. The param-
eter σ is called the slack of the embedding. The strict space limitations specified by the
streaming model prevent us from storing our embedding explicitly. We bypass this obstacle
by computing a compact representation of M ′ without storing the actual mapping from
M into M ′.
We present streaming embeddings with low distortion and low slack for n-point Euclidean
metric spaces in Section 8.2, doubling metric spaces in Section 8.4, and general metric
spaces in Section 8.5. The embeddings for Euclidean and doubling metric spaces are based
on the techniques developed in Chapter 7. The embedding for general metric spaces takes
advantage of the existence of certain subsets of points called edge-dense nets. Intuitively,
an edge-dense net N ⊆ X of a metric space M = (X,D) has the property that, for a
(1 − σ)-fraction of pairs of points (x, y) ∈ X × X, the distance between N and both x
and y is small compared to D(x, y). The existence of such nets follows from results on
embeddings with beacons by Kleinberg et al. [75]. After some modifications, this allows
us to compute a low-distortion embedding with low slack in the streaming model. Our
method resembles the construction of spanners with slack of Chan et al. [24]. Finally, we
prove some lower bounds on the space requirement of streaming embeddings with slack in
Section 8.6.
8.1 Preliminaries
A metric embedding is the transformation of one metric space into another metric space.
Definition 8.1.1 (Metric Embedding). A mapping ϕ : X → X ′ from a metric space
M = (X,D) into a target metric space M ′ = (X ′,D′) is called metric embedding.
In this thesis, we are only interested in embedding metric spaces M = (X,D), where
X is a set of n points. Given such an n-point metric space M = (X,D), our streaming
algorithms compute an embedding ϕ : X → X ′ into a target metric space M ′ = (X ′,D′)
whose representation uses only sublinear space. Besides the space requirement, we will
measure the quality of ϕ by the quantity of its distortion and slack.
144 8 Embeddings with Slack in Data Streams and Applications
Definition 8.1.2 (Embedding with Distortion and Slack, [23]). Let % ≥ 1 be a precision
parameter, and let σ, 0 < σ < 1, be a slack parameter. An embedding ϕ : X → X ′ from a
finite metric space M = (X,D) into a target metric space M ′ = (X ′,D′) has distortion %
and slack σ if there are two values α, β ≥ 1 with α · β ≤ % such that
1
α
·D(x, y) ≤ D′(ϕ(x), ϕ(y)) ≤ β ·D(x, y) (8.1)
is true for a (1− σ)-fraction of pairs (x, y) ∈ X ×X.
Similar to our definition of slack of a WSPD, the slack σ of an embedding ϕ is measured
by the quantity of the fraction of all ordered pairs (x, y) ∈ X × X that do not satisfy
Inequality (8.1). The assumption of having n2 (instead of
(
n
2
)
) pairwise distances simplifies
descriptions and makes our proofs cleaner without changing the results in any significant
way. In case that an embedding ϕ has distortion % and slack σ with σ = 0, we just say
that ϕ has distortion %.
In the following, we will present streaming embeddings for Euclidean, doubling, and
general metric spaces. Our algorithms for general and doubling metric spaces work in the
insertion-only data stream model, whereas the ones for Euclidean metric spaces work in
the dynamic geometric data stream model. In each case, we assume that the minimum
pairwise distance of the given metric space is at least 1, and the maximum pairwise distance
is at most ∆. Furthermore, we assume that the parameter n is known in advance by our
algorithms.
8.2 Embedding Euclidean Metric Spaces
In this section, we explain how to compute with high probability a low-distortion embed-
ding with low slack for a Euclidean metric spaceM = (P,D) given as a dynamic geometric
data stream. Recall that, in this streaming model, the input is a sequence of m Insert
and Delete operations of points from a discrete Euclidean space {1, . . . ,∆}d. At first,
we assume that the dimension d is a constant. We will show in Section 8.2.2 how to get
rid of this assumption.
8.2.1 Low Dimensions
Our algorithm for constant-dimensional Euclidean spaces is based on the WSPD construc-
tion described in Section 7.2. In order to construct an ε-WSPD with slack σ for a point set
P , we first use a certain quadtree partition of the point space into a few cells and an elab-
orate refinement of this partition, where each cell is further subdivided into a few cubelets.
Then, we replace each point by a representative such that all the representatives of points
located inside of the same cubelet have the same position. Let R(p) be the representative
of a point p ∈ P , then R : P → P ′ is an embedding from M into the target metric space
M ′ = (P ′,D). The advantage of this embedding is that it can be computed by using only
8.2 Embedding Euclidean Metric Spaces 145
the information about the number of points in certain cells or cubelets and is not reliant
on the exact location of the points in P . We will show that we can use a random sampling
technique to estimate the number of points in the relevant cells and cubelets. To sum up,
the idea of our streaming algorithm is to maintain a random sample of the current point
set P given as a dynamic geometric data stream and to apply the algorithm described in
Section 7.2 on the sample set.
Now, we explain the sample step in more detail. We read the items of the input stream
one by one. Each time, we decide whether we use the associated point for further compu-
tations or not. For that purpose, we use the technique described in [43, 42] to maintain a
sample set of the current point set P with size s ∈ Θ(2Θ(d) · dd · log(∆) log(n/δ)/(εdσ4)),
where δ is the error probability of the algorithm. We denote this sample set by S.
After the sample step, we execute the quadtree partitioning for S based on the heavy
cells in dlog(∆)e + 1 nested square grids over S as described in Section 7.2. During this
process, a cell is identified as heavy if it contains at least σs/2d+1 sample points. A cell
containing less sample points is identified as light. Thereafter, we build the balanced
quadtree partition and perform the refinement into equal sized cubelets as explained in
Section 7.2. For each cubelet C that contains at least dln(n)/σ2e sample points, we replace
the points in C by one representative. This point is set to the location of an arbitrary
sample point inside of C and weighted by d|C| · n/se, where |C| denotes the number of
replaced points. To avoid that the total weight of the representatives differs from n, we
sum up all weights and increase or decrease the weight of some arbitrary representatives by
the required amount. The set of all weighted representatives is our compact representation
for M ′.
Let us first ignore that we use a sample step. Then, due to the fact that the embedding
R : P → P ′ from M into the target metric space M ′ = (P ′,D) is determined by the
construction of an ε-WSPD with slack σ for P , the embedding R has distortion (1 + 2ε)2
and slack σ. In the following, we will show that the sample step, which enables us to
compute a representation of M ′ in the dynamic geometric data stream model, does not
significantly increase the slack.
Maintenance of the Sample Set
By applying the technique described in [43, 42], we are able to maintain a sample set of
the current point set P under insertions and deletions such that every point in the sample
set is chosen nearly uniformly at random from P . More precisely, by adopting the results
in [43, 42], we obtain the following lemma:
Lemma 8.2.1 (Sample Data Structure, [43, 42]). Let δ, 0 < δ ≤ 1, be an error probability
parameter. Given a sequence of Insert and Delete operations of points from the discrete
Euclidean space {1, . . . ,∆}d, there is a data structure that, with probability 1− δ, returns
s points q0, . . . , qs−1 from the current point set P := {p0, . . . , pn−1} such that
Pr [qi = pj] =
1
n
±
δ
∆d
146 8 Embeddings with Slack in Data Streams and Applications
is true for every j ∈ [n] and for every i ∈ [s]. Both update time and space requirement of
the algorithm are O((s+ log(1/δ)) · d2 · log2(∆/δ)).
Slack Induced by the Sample Step
Due to the fact that we use a sample set to estimate the number of points in certain cells
and cubelets, we make an error which increases the slack. In order to measure this increase
of the slack, we first investigate how much the quadtree partition computed on the sample
set S might differ from the one that we would get by taking the whole input point set P
into account.
Recall that the algorithm identifies each cell that contains at least σs/2d+1 sample points
as heavy. Next, we show that if each point in S is chosen uniformly at random from P ,
then, with high probability, the algorithm identifies every heavy cell as heavy and every
cell which contains significantly less points than a heavy cell as light.
Lemma 8.2.2. If each point in S is chosen uniformly at random from P , then every heavy
cell is identified as heavy with probability at least 1− δ.
Proof. Let H be the set of all heavy cells that do not have a heavy subcell. Obviously,
there are at most n cells in H. Let C be any such cell. Let Yi be the indicator random
variable for the event that the i-th point in S is contained in cell C. Since C is heavy, it
contains at least σn/2d points. Thus, we have E [Yi] ≥ σ/2d. By a Chernoff bound and
linearity of expectation, we get
Pr


|S|∑
i=1
Yi <
(
1−
1
2
)
· E


|S|∑
i=1
Yi



 ≤ exp
(
−
|S| · E [Yi]
23
)
≤ exp
(
−
σ · |S|
2d+3
)
.
For |S| ≥ 8 · 2d ln(n/δ)/σ, this probability is at most δ/n. Since we have chosen
|S| ∈ Θ
(
2Θ(d) · dd · log(∆) log(n/δ)
εdσ4
)
⊂ Ω
(
2d ln(n/δ)
σ
)
,
C contains at least
(
1−
1
2
)
· E


|S|∑
i=1
Yi

 =
σs
2d+1
sample points with probability at least 1− δ/n. By the union bound, the probability that
every cell in H contains at least σs/2d+1 sample points is at least 1− δ. Obviously, it then
follows that each ancestor cell of a cell in H also contains at least σs/2d+1 sample points.
Hence, every heavy cell is identified as heavy with probability at least 1− δ.
We call a cell that contains at least σn/2d+2 points from P quarter-heavy. The following
lemma proves that, with high probability, no cell which is not quarter-heavy is identified
as heavy by the algorithm.
8.2 Embedding Euclidean Metric Spaces 147
Lemma 8.2.3. If each point in S is chosen uniformly at random from P , then every cell
that is not quarter-heavy is identified as light with probability at least 1− δ.
Proof. Let C be any cell that contains σn/(2d+2k) points from P , where k > 1. Let Yi be
the indicator random variable for the event that the i-th point in S is contained in cell C.
We have E [Yi] = σ/(2d+2k). By a Chernoff bound and linearity of expectation, we get
Pr


|S|∑
i=1
Yi ≥ (1 + k) · E


|S|∑
i=1
Yi



 ≤ exp
(
−
k · |S| · E [Yi]
3
)
= exp
(
−
|S| · σ
3 · 2d+2
)
.
For |S| ≥ 12 · 2d ln(n/δ)/σ, this probability is at most δ/n. Since we have chosen
|S| ∈ Θ
(
2Θ(d) · dd · log(∆) log(n/δ)
εdσ4
)
⊂ Ω
(
2d ln(n/δ)
σ
)
,
C contains less than
(1 + k) · E


|S|∑
i=1
Yi

 <
σs
2d+1
sample points with probability at least 1− δ/n.
Let us assume that the algorithm is performing the quadtree partitioning and has already
identified all cells which are at least quarter-heavy as heavy cells and has not yet tried to
identify any cell that is not quarter-heavy. Since a cell must contain at least one point
from P to be a potential candidate for being identified as heavy, there are at most n such
candidate cells in the current space partition. By the union bound, the probability that all
of them contain less than σs/2d+1 sample points is at least 1− δ. Since each cell contains
at most as many sample points as its parent cell, it then follows that every descendant cell
of a cell in the current space partition contains less than σs/2d+1 sample points. Thus,
every cell that is not quarter-heavy contains less than σs/2d+1 sample points and is, hence,
identified as light with probability at least 1− δ.
Due to Lemmas 8.2.2 and 8.2.3, we know that the quadtree partition on S is fairly close
to the quadtree partition on P . By utilizing this fact, we next analyze the slack induced
by estimating the number of points in cubelets based on the sample set S.
Lemma 8.2.4. Let Z ∈ Θ(2Θ(d) ·dd · log(∆)/(εdσ)). If each point in S is chosen uniformly
at random from P , then we can define a set Z of Z cubelets such that the set of cubelets
constructed by the algorithm is a subset of Z with probability at least 1− 2δ.
Proof. Due to Lemmas 8.2.2 and 8.2.3, with probability at least 1 − 2δ, the algorithm
satisfies the condition that it identifies every heavy cell as heavy and every cell which is
not quarter-heavy as light. Let us now consider the quadtree partition that we get by
splitting exactly the quarter-heavy cells and performing the balancing operations. Let L+
be the set of cells in the resulting balanced quadtree. Then, the set of cells in any balanced
148 8 Embeddings with Slack in Data Streams and Applications
quadtree partition obtained from a run of our algorithm that satisfies the above condition
is a subset of L+. Furthermore, let Z be the set of cubelets that we obtain by subdividing
each cell in L+ into cubelets. Then, the set of cubelets constructed during any run of the
algorithm that satisfies the above condition is a subset of Z. Next, we upper bound the
size of Z.
We proceed exactly as we have done in the proof of Lemma 7.2.6. There are at most
2d+2/σ quarter-heavy cells which do not have a quarter-heavy subcell in a quadtree parti-
tion. For each such cell, there exist at most 2d(dlog(∆)e+ 1) cells in the quadtree. Hence,
the unbalanced quadtree partition contains O(22d · log(∆)/σ) cells. Due to Lemma 7.2.5,
the number of cells in the balanced quadtree partition is O(24d ·log(∆)/σ). Since in the last
step of the algorithm a cell is subdivided into
⌈
6
√
d
⌉d
cubes and each cube into
⌈
2
√
d/ε
⌉d
cubelets, the set Z consists of O(288d · dd · log(∆)/(εdσ)) cubelets.
Lemma 8.2.5. Let Z ∈ Θ(2Θ(d)·dd·log(∆)/(εdσ)), and let U be the union of all the cubelets
that contain at most σn/(2Z) points from P . If each point in S is chosen uniformly at
random from P and the space partition consists of at most Z cubelets, then, with probability
at least 1− δ, the number of points from P in U is at most σn/2 and the number of sample
points in U is at most σs.
Proof. Obviously, if there are at most Z cubelets in the space partition, then the number
of points in cubelets that contain at most σn/(2Z) points from P is at most σn/2. Thus,
we have that, for some k ≥ 1, the total number of points from P in U is σn/(2k). Let Yi
be the indicator random variable for the event that the i-th point in S is contained in U .
We have E [Yi] = σ/(2k). By a Chernoff bound and linearity of expectation, we get
Pr


|S|∑
i=1
Yi ≥ (1 + k) · E


|S|∑
i=1
Yi



 ≤ exp
(
−
k · |S| · E [Yi]
3
)
= exp
(
−
|S| · σ
6
)
.
For |S| ≥ 6 ln(1/δ)/σ, this probability is at most δ. Since we have chosen
|S| ∈ Θ
(
2Θ(d) · dd · log(∆) log(n/δ)
εdσ4
)
⊂ Ω
(
ln(1/δ)
σ
)
,
U contains less than
(1 + k) · E


|S|∑
i=1
Yi

 ≤ σs
sample points with probability at least 1− δ.
Lemma 8.2.6. Let Z ∈ Θ(2Θ(d) ·dd · log(∆)/(εdσ)). If each point in S is chosen uniformly
at random from P , then the number of points in every cubelet that contains at least σn/(2Z)
points from P can be (1± σ)-approximated by S with probability 1− δ.
8.2 Embedding Euclidean Metric Spaces 149
Proof. Let C be any cubelet that contains at least σn/(2Z) points from P . Let Yi be the
indicator random variable for the event that the i-th point in S is contained in cubelet C.
We have E [Yi] ≥ σ/(2Z). By Chernoff bounds and linearity of expectation, we get
Pr


|S|∑
i=1
Yi ≥ (1 + σ) · E


|S|∑
i=1
Yi



 ≤ exp
(
−
σ2 · |S| · E [Yi]
3
)
≤ exp
(
−
σ3 · |S|
6Z
)
and
Pr


|S|∑
i=1
Yi ≤ (1− σ) · E


|S|∑
i=1
Yi



 ≤ exp
(
−
σ2 · |S| · E [Yi]
2
)
≤ exp
(
−
σ3 · |S|
4Z
)
.
For |S| ≥ 6Z ln(n/δ)/σ3, each of these probabilities is at most δ/n. Since we have chosen
|S| ∈ Θ
(
2Θ(d) · dd · log(∆) log(n/δ)
εdσ4
)
⊆ Ω
(
Z ln(n/δ)
σ3
)
,
the number of points in C can be (1± σ)-approximated with probability 1− 2δ/n. By the
union bound, the number of points in every cubelet that contains at least σn/(2Z) points
is (1± σ)-approximated with probability at least 1− δ.
Based on the lemmas given above, we will show that, with high probability, our streaming
algorithm computes an ε-WSPD with slack σ′ = 4σ for P . Unfortunately, in contrast to our
WSPD construction in a classical non-streaming model, the property that, for each point,
the distances to its σ′n closest neighbors can be arbitrarily distorted, but the distances
to all the other points are (1 ± 2ε)-preserved (see Remark 7.2.4), does not hold for our
streaming embedding. This is caused by the use of the random sampling technique. Simply
said, we know how big the slack is, but we do not know where it arises.
Weight of the Representatives
To avoid that the total weight of the representatives differs from n, we adjust the weight of
some representatives in the last phase of the algorithm. Now, we prove that this adjustment
is small.
Lemma 8.2.7. Let R be the set of representatives before the adjustment, and let w(r)
denote the weight of a representative r ∈ R. Then,
(1− σ) · n <
∑
r∈R
w(r) ≤
(
1 +
σ2
ln(n)
)
· n
holds with probability at least 1− 4δ.
Proof. In the third phase, we place one representative in each cubelet C that contains at
least dln(n)/σ2)e sample points. The weight of this representative is set to d|C| · n/se,
150 8 Embeddings with Slack in Data Streams and Applications
where |C| denotes the number of sample points in C. It follows that the total weight of
the representatives can be smaller than n. The sample data structure from Lemma 8.2.1
fails with probability δ. Furthermore, the statistical difference from the exact uniform
distribution is at most δ. However, in case that the sample data structure works as required,
it follows from Lemma 8.2.4 that, with an error probability of 2δ, the number of cubelets
containing less than dln(n)/σ2e sample points is at most Z ∈ O(2O(d) · dd · log(∆)/(εdσ))
and
Z ≤ s ·
σ3
6 log(n)
≤
σs
3 dln(n)/σ2e
for our chosen value of s. Thus, we get
n−
∑
r∈R
w(r) ≤
⌈⌈
ln(n)
σ2
⌉
·
n
s
⌉
·
σs
3 dln(n)/σ2e
< 2 ·
⌈
ln(n)
σ2
⌉
·
n
s
·
σs
3 dln(n)/σ2e
< σn
with probability at least 1− 4δ, which proves the first inequality of the lemma.
The sum of the weights can be larger than n because the weight of each representative
is rounded up to the next integer. Thus, the sum of the weights is at most n+ |R|. Since
s ≤ n and every cubelet in which we set a representative contains at least dln(n)/σ2e points
of the sample set S, we get
∑
r∈R
w(r)− n ≤ n+ |R| − n
≤
s
dln(n)/σ2e
≤
σ2n
ln(n)
,
which proves the second inequality of the lemma.
We summarize our results in the following theorem:
Theorem 9. Given a stream of Insert and Delete operations of points from a discrete
Euclidean space {1, . . . ,∆}d, where d is a constant, a precision parameter ε, 0 < ε < 1, a
slack parameter σ, 1/o(n) < σ < 1, and an error probability parameter δ, 0 < δ < 1, there
is a randomized streaming algorithm that computes with probability 1 − δ, for the current
point set P of size n, a point set P ′ ⊂ P of size O(log(∆)/(εdσ)) such that P embeds into
P ′ with distortion 1 + ε and slack σ. The algorithm has an update time of
O
(
log(∆) · log2(∆/δ) · log(n/δ)
εd+1σ4
)
and a space requirement of
O
(
log(∆) · log2(∆/δ) · log(n/δ)
εdσ4
)
.
8.2 Embedding Euclidean Metric Spaces 151
Proof. Due to Lemmas 8.2.1 and 8.2.2, with high probability, the algorithm identifies and
splits each heavy cell during the quadtree partitioning, which implies that Lemmas 7.2.2
and 7.2.3 are applicable. It follows that, for any two points p1 and p2 in P , if the cubelet
containing p1 and the cubelet containing p2 are not ε-well-separated, then p2 belongs to
the σn closest points of p1. This induces a slack of σ. Since we estimate the number
of points in each cubelet based on the sample set S, we get an additional slack. Due
to Lemmas 8.2.4, 8.2.5 and 8.2.6, with high probability, the additional slack induced by
this estimation is at most 2σ. Finally, we get more additional slack since we only place
representatives in cubelets containing more than a certain threshold of points and we
round up the weight of each representative. Due to Lemma 8.2.7, with high probability,
this slack is at most σ. Thus, with high probability, our streaming algorithm computes a
representation of an ε-WSPD with slack 4σ for P . This ε-WSPD is an implicit embedding
of P with distortion (1 + 2ε)2 and slack 4σ.
The error probability is given as follows. We use a random sample as given by the data
structure from Lemma 8.2.1. This data structure fails with probability δ. Furthermore,
the statistical difference from the exact uniform distribution is at most δ. In case that the
sample data structure works as required, Lemmas 8.2.2 and 8.2.3 hold with probability at
least 1 − 2δ. Thus, the probability that Lemmas 8.2.1, 8.2.2, and 8.2.3 hold is at least
1− 4δ. If this is the case, then the assertions given in Lemmas 8.2.4, 8.2.7, 7.2.2 and 7.2.3
follow directly and the assertions given in Lemmas 8.2.5 and 8.2.6 hold each with an error
probability of at most δ. Thus, the total error probability of our algorithm is at most 6δ.
In summary, if we run our embedding algorithm with a precision parameter ε′ ≤ ε/5, a
slack parameter σ′ ≤ σ/4, and an error probability parameter δ′ ≤ δ/6, then the embedding
has distortion (1+2ε′)2 ≤ 1+ε and slack 4σ′ ≤ σ and works with error probability 6δ′ ≤ δ.
Due to Theorem 7, the size of P ′ is O
(
log(∆)/(εdσ)
)
. Furthermore, it follows from
Theorem 7 and S ⊂ P that we have P ′ ⊂ P .
The update time is given as follows. At first, we use the data structure of Lemma 8.2.1
to decide if the point is sampled or not. This costs O((s + log(1/δ)) · log2(∆/δ)) time.
Afterwards, we build the balanced quadtree partition and the refinement into a set of
O(log(∆)/(εd · σ)) cubelets for a set of s points. By applying Theorem 7, this can be done
in O(s(1/ε + log(∆)) + log2(∆)/σ + log(∆)/(εdσ)) time. Since the size of the sample set
is s ∈ Θ(log(n/δ) · log(∆)/(εd · σ4)), the total update time is as claimed in the theorem.
Due to Lemma 8.2.1, the space required to store the sample data structure is upper
bounded by O((s+ log(1/δ)) · log2(∆/δ)). Furthermore, we have to store the partition tree
with O(log(∆)/(εd · σ)) nodes for the sample set S. This requires O(s + log(∆)/(εdσ))
space. Since the size of the sample set is s ∈ Θ(log(n/δ) · log(∆)/(εd ·σ4)), the resource for
the sampling data structure is dominating. Thus, the total space requirement is as stated
in the theorem.
Since we use the WSPD construction given in Section 7.2, we have to make sure that
σ′ > 22d/n (confer Theorem 7). However, this is implicitly required by the fact that
the space requirement of a streaming algorithm has to be sublinear in n and the space
requirement of our streaming algorithm is ω(1/σ).
152 8 Embeddings with Slack in Data Streams and Applications
8.2.2 High Dimensions
If the points in P have a high dimension, we first use the Johnson-Lindenstrauss em-
bedding [71] with d(ε, σ, δ) ∈ Θ(1/(ε2σδ)) dimensions to get an embedding into a low-
dimensional space that has distortion 1 + ε and slack σ with probability 1− δ. Afterwards,
we apply the techniques described in Sections 7.2 and 8.2 on the low-dimensional point
set. This composition of two embeddings, both with distortion 1 + ε and slack σ, yields
an embedding with distortion 1 + 3ε and slack 2σ that can be computed in the dynamic
geometric data stream model.
In the following, we will give an idea how to use the techniques developed by Johnson
and Lindenstrauss [71] to obtain an embedding from a high-dimensional space into a low-
dimensional space with distortion 1 + ε and slack σ. Overall, we apply the AMS-sketch [6]
to get an embedding similar to the Johnson-Lindenstrauss embedding [71]. The main
difference between both techniques is as follows. The AMS-sketch computes one random
variable for the whole input stream such that the sum of the squared coordinate values
is close to the second frequency moment of the input stream with high probability. In
contrast, our method computes for each d-dimensional point in the data stream d(ε, σ, δ)
random variables, the d(ε, σ, δ) coordinates of the embedded point, whose squared values
are all equal to the so-called second frequency moment of this d-dimensional input point
with high probability. More precisely, we can look upon one d-dimensional point as a
stream consisting of d different elements where the frequency of element i, 1 ≤ i ≤ d, is
given by the value of the i-th coordinate of the d-dimensional point. It is easy to see that,
for this definition of frequency moments, the second frequency moment of a point is equal
to the squared norm of this point. Because the embedding of a point is given by a linear
mapping, the embedded distance vector of two points is equal to the distance vector of the
two embedded points. Consequently, our method computes an embedding that preserves
an approximation of almost all squared pairwise distances. Hence, the embedding preserves
an approximation of almost all simple pairwise distances. Next, we state our result and
give a detailed proof of its correctness.
Theorem 10. Let ε, 0 < ε < 1, be a precision parameter, let σ, 0 < σ < 1, be a slack
parameter, let δ, 0 < δ < 1, be an error probability parameter, and let d(ε, σ, δ) := 2/(ε2σδ)
be a function dependent on ε, σ, and δ. Given a set P of n points in Rd, there exists an
embedding ϕ : P → Rd(ε,σ,δ) such that
(1− ε) ·D(p, q) ≤ D(ϕ(p), ϕ(q)) ≤ (1 + ε) ·D(p, q)
is true for at least (1− σ) · n2 pairs of points (p, q) ∈ P × P with probability at least 1− δ.
Each point can be embedded in O(d · log2(d)/(ε2σδ)) time using O(log(d)/(ε2σδ)) space.
Proof. Our proof of the theorem is almost identical to the proof of Theorem 2.2 in [6].
However, since we will present another result that is based on similar techniques, we do
not only describe our modifications to the proof of Theorem 2.2 in [6] but include below
the full proof.
8.2 Embedding Euclidean Metric Spaces 153
For each point p ∈ P and each coordinate i ∈ {1, . . . , d(ε, σ, δ)}, the algorithm computes
a random variable Yi(p). We define the embedding ϕ of the point p by
ϕ(p) :=
1
√
d(ε, σ, δ)
· (Y1(p), . . . , Yd(ε,σ,δ)(p))
T .
For each point p ∈ P , the value Yi(p) is computed in the same way.
Fix an explicit set V := {v1, . . . , vZ} of Z ∈ O(d2) vectors of length d with +1,−1 entries
which are four-wise independent, i.e., for every four distinct coordinates, each of the 16
possible combinations {−1, 1}4 occurs uniformly distributed in V . As described in [5, 6],
such sets can be constructed with the help of the parity check matrices of BCH codes. The
implementation of this construction requires an irreducible polynomial of degree g over the
finite field F2, where 2g is the smallest power of 2 greater than d. Such a polynomial can
be found by using only O(log d) space. Then, the construction enables us to compute each
coordinate of each vector in V in O(log d) space, using a constant number of multiplications
in the finite field F2g and binary inner products of vectors of length g.
In order to compute Yi(p), for any p ∈ Rd, we choose a random vector
vz =: ri =
(
r(1)i , r
(2)
i , . . . , r
(d)
i
)
∈ V ,
where z is chosen uniformly between 1 and Z. Note that, once we have chosen a random
vector ri to compute the i-th coordinate of ϕ(p) for the first point p ∈ P , we use ri to
compute the i-th coordinate of ϕ(p′) for every point p′ ∈ P , i.e., we choose d(ε, σ, δ) random
vectors in total. Recall that we denote the i-th coordinate of a point p by p(i). We define
Yi(p) :=
d∑
k=1
r(k)i · p
(k) .
To compute ϕ(p) = 1/
√
d(ε, σ, δ) · (Y1(p), . . . , Yd(ε,σ,δ)(p))T, we have to keep the value z
and have to maintain the sum Yi(p) for each coordinate i ∈ {1, . . . , d(ε, σ, δ)}. Recall that
the bits of each ri = vz can be generated from z in O(log(d)) space, using a constant
number of arithmetic and finite field operations on elements of O(log(d)) bits. Thus,
the embedding of one d-dimensional input point requires O(log(d) · d(ε, σ, δ)) space and
O(d · log2(d) · d(ε, σ, δ)) time. Furthermore, if A denotes the d(ε, σ, δ) × d matrix whose
rows are the vectors r1, . . . , rd(ε,σ,δ), then we can write ϕ(p) = 1/
√
d(ε, σ, δ) ·Ap. Hence, ϕ
is a linear function, i.e., ϕ(p− q) = ϕ(p)− ϕ(q) for all pairs (p, q) ∈ Rd ×Rd.
Let Y (ν) := ‖ϕ(ν)‖2 be the random variable for the squared length of ϕ(ν). Due to our
definition of ϕ(ν), we have
Y (ν) =
d(ε,σ,δ)∑
i=1


1
√
d(ε, σ, δ)
· Yi(ν)


2
.
Next, we show that the expected value of Y (ν) is ‖ν‖2 and, by bounding the variance of
Y (ν), that Y (ν) is sharply concentrated.
154 8 Embeddings with Slack in Data Streams and Applications
Due to the fact that the random variables r(k)i are pairwise independent and E
[
r(k)i
]
= 0
for all pairs (i, k) ∈ {1, . . . , d(ε, σ, δ)} × {1, . . . , d}, we have
E





1
√
d(ε, σ, δ)
· Yi(ν)


2



= E





1
√
d(ε, σ, δ)
·
d∑
k=1
r(k)i · ν
(k)


2



=
d∑
k=1
1
d(ε, σ, δ)
· E
[(
r(k)i
)2
]
·
(
ν(k)
)2
+
∑
1≤k<`≤d
2
d(ε, σ, δ)
· E
[
r(k)i
]
· E
[
r(`)i
]
· ν(k) · ν(`)
=
d∑
k=1
1
d(ε, σ, δ)
·
(
ν(k)
)2
=
1
d(ε, σ, δ)
· ‖ν‖2 .
Due to linearity of expectation, it follows that
E [Y (ν)] = E



d(ε,σ,δ)∑
i=1


1
√
d(ε, σ, δ)
· Yi(ν)


2


 = ‖ν‖2 .
Since the variables r(k)i are four-wise independent, we have
E





Yi(ν)
√
d(ε, σ, δ)


4


 =
d∑
k=1
1
d(ε, σ, δ)2
·
(
ν(k)
)4
+
∑
1≤k<`≤d
6
d(ε, σ, δ)2
·
(
ν(k)
)2
·
(
ν(`)
)2
.
Furthermore, we obtain
E





Yi(ν)
√
d(ε, σ, δ)


2



2
=
(
d∑
k=1
1
d(ε, σ, δ)
·
(
ν(k)
)2
)2
=
d∑
k=1
1
d(ε, σ, δ)2
·
(
ν(k)
)4
+
∑
1≤k<`≤d
2
d(ε, σ, δ)2
·
(
ν(k)
)2
·
(
ν(`)
)2
.
It follows that
V





Yi(ν)
√
d(ε, σ, δ)


2


 = E





1
√
d(ε, σ, δ)
· Yi(ν)


4


− E





1
√
d(ε, σ, δ)
· Yi(ν)


2



2
=
∑
1≤k<`≤d
4
d(ε, σ, δ)2
·
(
ν(k)
)2
·
(
ν(`)
)2
≤ 2 · E





1
√
d(ε, σ, δ)
· Yi(ν)


2



2
.
8.2 Embedding Euclidean Metric Spaces 155
Now, we can upper bound the variance of Y [ν] by
V [Y (ν)] = V



d(ε,σ,δ)∑
i=1


1
√
d(ε, σ, δ)
· Yi(ν)


2



= d(ε, σ, δ) ·V





1
√
d(ε, σ, δ)
· Yi(ν)


2



≤ d(ε, σ, δ) · 2 · E





1
√
d(ε, σ, δ)
· Yi(ν)


2



2
≤
2 · ‖ν‖4
d(ε, σ, δ)
.
Hence, by Chebyshev’s inequality and the definition of d(ε, σ, δ), we get
Pr
[
|Y (ν)− E [Y (ν)]| > ε · ‖ν‖2
]
≤
V [Y (ν)]
ε2 · ‖ν‖4
=
2
d(ε, σ, δ) · ε2
= σδ .
Let Z` be the indicator random variable for the event that the distance between the `-th
pair of points is not (1 ± ε)-approximated. By the above, we get E [Z`] ≤ σδ. Then, by
Markov’s inequality, we have
Pr


n2∑
`=1
Z` ≥ σ · n
2

 ≤
E
[∑n2
`=1 Z`
]
σn2
≤
σδn2
σn2
= δ .
By combining the above result with the results from Sections 7.2 and 8.2, we obtain the
following theorem:
Theorem 11. Given a stream of Insert and Delete operations of points from a dis-
crete Euclidean space {1, . . . ,∆}d, a precision parameter ε, 0 < ε < 1, a slack parameter
σ, 1/o(n) < σ < 1, and an error probability parameter δ, 0 < δ < 1/2, there is a ran-
domized streaming algorithm that computes with probability 1 − δ, for the current point
set P of size n, a point set P ′ from a discrete Euclidean space {1, . . . ,∆′}d
′
with spread
∆′ ∈ O(
√
dn∆/(ε2
√
σδ)) and dimension d′ ∈ Θ(1/(ε2σδ)) and of size
O
(( 1
εσδ
)O(1/(ε2σδ))
· log(dn∆)
)
such that P embeds into P ′ with distortion 1 + ε and slack σ. The algorithm has an update
time of
O
(
d · log2(d)
ε2σδ
+
( 1
εσδ
)O(1/(ε2σδ))
· log4 (dn∆)
)
156 8 Embeddings with Slack in Data Streams and Applications
and a space requirement of
O
(
log(d)/(ε2σδ) +
( 1
εσδ
)O(1/(ε2σδ))
· log4 (dn∆)
)
.
Proof. We combine the embedding from Theorem 10 with the construction described in
Sections 7.2 and 8.2. At first, we embed the discrete high-dimensional Euclidean point set
P into a low-dimensional Euclidean space. Then, we impose an appropriately fine grid on
the target space and move each embedded point to its nearest grid point. This technique
is sometimes called snap rounding. It follows that, by appropriate scaling of the point
space, the resulting point set is from a discrete low-dimensional Euclidean space. On this
point set, we apply the construction described in Sections 7.2 and 8.2. Now, we explain
our approach in more detail.
By applying the techniques given in the proof of Theorem 10 with a precision parameter
ε′ := ε/18, a slack parameter σ′ := σ/2, and an error probability parameter δ′ := δ/3, we
get an embedding ϕ : P → Rd(ε
′,σ′,δ′) with d(ε′, σ′, δ′) ∈ Θ(1/(ε2σδ)) such that
(
1−
ε
18
)
·D(p, q) ≤ D(ϕ(p), ϕ(q)) ≤
(
1 +
ε
18
)
·D(p, q) (8.2)
is true for at least (1 − σ/2) · n2 pairs of points (p, q) ∈ P × P with probability at least
1− δ/3. Furthermore, we can upper bound the maximum distance between two embedded
points as follows. Let p and q be any two points from P . We define ν := p − q and
Y (ν) := ‖ϕ(ν)‖2. Then, as explained in the proof of Theorem 10, the expected value of
Y (ν) is E [Y (ν)] = ‖ν‖2, and we can upper bound the variance of Y (ν) by
V [Y (ν)] ≤
2 · ‖ν‖4
d(ε′, σ′, δ′)
.
Thus, by Chebyshev’s inequality, we get
Pr
[
|Y (ν)− E [Y (ν)] | > n · ‖ν‖2
]
≤
V [Y (ν)]
n2 · ‖ν‖4
≤
δ
3n2
.
Due to the union bound and ‖ν‖2 ≤ d∆2, we have that all squared pairwise distances of
the embedded points are at most O(n · d∆2) with probability at least 1 − δ/3. It follows
that the diameter of the embedded point set is O(
√
dn∆) with probability at least 1− δ/3.
Next, we apply the snap-rounding technique. More precisely, we impose a square grid on
the target space Rd(ε
′,σ′,δ′), where each cell has side length ε/(18
√
d(ε′, σ′, δ′)), and move
each embedded point to its nearest grid point. Each point is moved by a distance of at
most
ε
18
√
d(ε′, σ′, δ′)
·
√
d(ε′, σ′, δ′)
2
=
ε
36
.
Thus, by moving each point to its nearest grid point, the distance between any two points
is decreased or increased by at most ε/18. Let ϕ′(p) be the position of an embedded and
8.3 Max-Cut in High Dimensions 157
moved point p ∈ P . Since the minimum pairwise distance from distinct points in P is 1
and Inequality (8.2) is true for at least (1 − σ/2) · n2 pairs of points (p, q) ∈ P × P with
probability at least 1− δ/3, we have that
(
1−
ε
9
)
·D(p, q) ≤ D(ϕ′(p), ϕ′(q)) ≤
(
1 +
ε
9
)
·D(p, q)
is true for at least (1 − σ/2) · n2 pairs of points (p, q) ∈ P × P with probability at least
1 − δ/3. It follows that the embedding ϕ′ has distortion (1 + ε/9)/(1 − ε/9) ≤ (1 + ε/3)
and slack σ/2 with probability at least 1 − δ/3. Furthermore, each point lies on a grid
with cell size ε/(18
√
d(ε′, σ′, δ′)) and the maximum pairwise distance of points is O(
√
dn∆)
with probability at least 1 − δ/3. Hence, by scaling the point space by 18
√
d(ε′, σ′, δ′)/ε,
we get a set of points from a discrete low-dimensional space {1, . . . ,∆′}d
′
with spread
∆′ ∈ O(
√
dn∆/(ε2
√
σδ)) and dimension d′ ∈ O(1/(ε2σδ)).
On the obtained point set, we run our construction from Sections 7.2 and 8.2 with a
precision parameter ε′′ := ε/3, a slack parameter σ′′ := σ/2, and an error probability
parameter δ′′ := δ/3. Then, with a total error probability of δ, the resulting point set P ′
embeds P with distortion (1+ε/3)·(1+ε/3) ≤ (1+ε) and slack σ. It follows from the above
and Theorem 7 that we also have P ′ ⊂ {1, . . . ,∆′}d
′
with spread ∆′ ∈ O(
√
dn∆/(ε2
√
σδ))
and dimension d′ ∈ O(1/(ε2σδ)).
As explained before in the proof of Theorem 9, we have to ensure that σ′′ > 22d/n
(confer Theorem 7) since we use the construction given in Section 7.2. However, this is
implicitly required by the fact that the space requirement of a streaming algorithm has to
be sublinear in n and the space requirement of our streaming algorithm is ω(1/σ).
Finally, we analyze the complexity of our construction. Due to Theorem 10, each point
in P can be embedded into the low-dimensional space Rd(ε
′,σ′,δ′) in O(d · log2(d)/(ε2σδ))
time using O(log(d)/(ε2σδ)) space. Due to Theorems 7 and 9, the construction from
Sections 7.2 and 8.2 applied on a set of points with dimension O(1/(ε2σδ)) and spread
O(
√
dn∆/(ε2
√
σδ)) has both an update time and space requirement of
O
(( 1
εσδ
)O(1/(ε2σδ))
· log4 (dn∆)
)
.
The size of the set of representatives follows from Lemma 7.2.7.
8.3 Max-Cut in High Dimensions
In this section, we show how to embed a set of high-dimensional Euclidean points into
a low-dimensional Euclidean space such that the sum of the pairwise distances is well
preserved. Afterwards, we use this result to design a streaming algorithm that implicitly
computes a (1 ± ε)-approximation of the max-cut problem for a dynamic data stream of
high-dimensional Euclidean points.
158 8 Embeddings with Slack in Data Streams and Applications
Let ϕ : P → Rd(ε,δ) be the Johnson-Lindenstrauss embedding where each point is
mapped into a Euclidean space with dimension d(ε, δ) ∈ Θ(1/(ε2δ2)). Then, we will show
that, for a pair of points (p, q) ∈ P × P , the expected value of |D(ϕ(p), ϕ(q))− D(p, q)| is
δ ε ·D(p, q) and |D(ϕ(p), ϕ(q))−D(p, q)| is sharply concentrated around its expected value
with probability 1− δ. This leads to the following lemma:
Lemma 8.3.1. Let ε, 0 < ε < 1, be a precision parameter, let δ, 0 < δ < 1, be an error
probability parameter, and let d(ε, δ) := 50/(ε2δ2) be a function dependent on ε and δ.
Given a set P of n points in Rd, there exists an embedding ϕ : P → Rd(ε,δ) such that
∑
(p,q)∈P×P
|D(ϕ(p), ϕ(q))−D(p, q)| ≤ ε ·
∑
(p,q)∈P×P
D(p, q)
is true with probability at least 1− δ. Each point can be embedded in O(d · log2(d)/(ε2δ2))
time using O(log(d)/(ε2δ2)) space.
Proof. For each point p ∈ P and each coordinate i ∈ {1, . . . , d(ε, δ)}, we compute a random
variable Yi(p) as explained in the proof of Theorem 10. We define the embedding ϕ for the
point p by
ϕ(p) :=
1
√
d(ε, δ)
· (Y1(p), . . . , Yd(ε,δ)(p))
T .
Following the construction in the proof of Theorem 10, each point can be embedded using
a space of O(log(d)/(ε2δ2)) and by performing O(d/(ε2δ2)) arithmetic and finite field op-
erations on elements of O(log(d)) bits. Furthermore, since ϕ is a linear function, we have
ϕ(p− q) = ϕ(p)− ϕ(q) for all pairs (p, q) ∈ Rd ×Rd.
Now, let p and q be any two points in Rd. We define ν := p− q and Y (ν) := ‖ϕ(ν)‖2 to
be the random variable for the squared length of ϕ(ν). Then, as explained in the proof of
Theorem 10, the expected value of Y (ν) is E [Y (ν)] = ‖ν‖2, and we can upper bound the
variance of Y (ν) by
V [Y (ν)] ≤
2 · ‖ν‖4
d(ε, δ)
.
Let err(p, q) be the error that occurs due to the estimation of ‖ν‖ = ‖p− q‖, i.e.,
err(p, q) :=
∣
∣
∣
∣
√
Y (ν)− ‖ν‖
∣
∣
∣
∣ .
The expected value of err(p, q) is given by
E [err(p, q)]
≤
ε δ
5
· ‖ν‖+
∞∑
i=0
Pr
[
ε δ
5
· 2i · ‖ν‖ <
∣
∣
∣
∣
√
Y (ν)− ‖ν‖
∣
∣
∣
∣ ≤
ε δ
5
· 2i+1 · ‖ν‖
]
·
ε δ
5
· 2i+1 · ‖ν‖
≤
ε δ
5
· ‖ν‖+
∞∑
i=0
Pr
[∣
∣
∣
∣
√
Y (ν)− ‖ν‖
∣
∣
∣
∣ >
ε δ
5
· 2i · ‖ν‖
]
·
ε δ
5
· 2i+1 · ‖ν‖ .
8.3 Max-Cut in High Dimensions 159
It follows that, in order to upper bound the expected value of err(p, q), we have to upper
bound the probability that err(p, q) > εδ/5 · 2i · ‖ν‖ for each i ∈ N0. Let ` be any fixed
power of 2. Then, for 0 ≤ ε δ `/ 5 ≤ 1, we get
Pr
[∣
∣
∣
∣
√
Y (ν)− ‖ν‖
∣
∣
∣
∣ >
ε δ `
5
· ‖ν‖
]
= Pr
[
√
Y (ν) <
(
1−
ε δ `
5
)
· ‖ν‖ or
√
Y (ν) >
(
1 +
ε δ `
5
)
· ‖ν‖
]
= Pr

Y (ν) <
(
1−
ε δ `
5
)2
· ‖ν‖2 or Y (ν) >
(
1 +
ε δ `
5
)2
· ‖ν‖2


≤ Pr
[
Y (ν) <
(
1−
ε δ `
5
)
· ‖ν‖2 or Y (ν) >
(
1 +
ε δ `
5
)
· ‖ν‖2
]
= Pr
[
∣
∣
∣Y (ν)− ‖ν‖2
∣
∣
∣ >
ε δ `
5
· ‖ν‖2
]
.
Similarly, for ε δ `/ 5 > 1, we get
Pr
[∣
∣
∣
∣
√
Y (ν)− ‖ν‖
∣
∣
∣
∣ >
ε δ `
5
· ‖ν‖
]
= Pr
[
√
Y (ν) >
(
1 +
ε δ `
5
)
· ‖ν‖
]
= Pr

Y (ν) >
(
1 +
ε δ `
5
)2
· ‖ν‖2


≤ Pr
[
Y (ν) >
(
1 +
ε δ `
5
)
· ‖ν‖2
]
= Pr
[
Y (ν)− ‖ν‖2 >
ε δ `
5
· ‖ν‖2
]
≤ Pr
[
∣
∣
∣Y (ν)− ‖ν‖2
∣
∣
∣ >
ε δ `
5
· ‖ν‖2
]
,
where the first equality follows from the fact that the case
√
Y (ν) < (1− ε δ `/ 5) · ‖ν‖ < 0
cannot occur. Thus, for any value ε δ `/ 5 ∈ R, we have
Pr
[∣
∣
∣
∣
√
Y (ν)− ‖ν‖
∣
∣
∣
∣ >
ε δ `
5
· ‖ν‖
]
≤ Pr
[
∣
∣
∣Y (ν)− ‖ν‖2
∣
∣
∣ >
ε δ `
5
· ‖ν‖2
]
.
By Chebyshev’s inequality, we can upper bound this probability by
Pr
[
|Y (ν)− ‖ν‖2| >
ε δ `
5
· ‖ν‖2
]
≤
25 ·V [Y (ν)]
ε2 δ2 `2 ‖ν‖4
≤
50
d(ε, δ) · ε2 δ2 `2
=
1
`2
.
160 8 Embeddings with Slack in Data Streams and Applications
Now, the expected value of err(p, q) can be upper bounded by
E [err(p, q)] ≤
ε δ
5
· ‖ν‖+
∞∑
i=0
Pr
[∣
∣
∣
∣
√
Y (ν)− ‖ν‖
∣
∣
∣
∣ >
ε δ
5
· 2i · ‖ν‖
]
·
ε δ
5
· 2i+1 · ‖ν‖
≤
ε δ
5
· ‖ν‖+
∞∑
i=0
1
22i
·
ε δ
5
· 2i+1 · ‖ν‖
=
ε δ
5
· ‖ν‖+ 2 ·
∞∑
i=0
1
2i
·
ε δ
5
· ‖ν‖
≤ ε δ ‖ν‖ .
Due to Markov’s inequality, it follows that
Pr


∑
p∈P
∑
q∈P
err(p, q) ≥
1
δ
· E


∑
p∈P
∑
q∈P
err(p, q)



 ≤ δ .
Due to linearity of expectation, we have
Pr


∑
p∈P
∑
q∈P
err(p, q) ≤ ε ·
∑
p∈P
∑
q∈P
‖p− q‖


≥ Pr


∑
p∈P
∑
q∈P
err(p, q) ≤
1
δ
·
∑
p∈P
∑
q∈P
E [err(p, q)]


= Pr


∑
p∈P
∑
q∈P
err(p, q) ≤
1
δ
· E


∑
p∈P
∑
q∈P
err(p, q)




≥ 1− δ .
Given any Euclidean point set P , the embedding described above is useful for all geo-
metric problems that satisfy the following four properties:
(i) The cost of an optimal solution for P is a function whose set of input parameters is
a subset of all pairwise distances of P .
(ii) The cost of an optimal solution for P is at least
∑
p∈P
∑
q∈P 1/c ·D(p, q), where c ≥ 1
is any small constant.
(iii) If the distance D(p, q) between any two points p, q ∈ P is increased or decreased by
any value α > 0, the cost of an optimal solution for P is increased or decreased by
at most O(α).
(iv) The complexity of all known (1±ε)-approximation algorithms depends exponentially
on the dimension of P .
8.3 Max-Cut in High Dimensions 161
To handle these problems, we first embed the input points and afterwards apply any
efficient (1± ε)-approximation algorithm on the embedded points.
One suitable problem is the max-cut problem in the dynamic geometric data stream
model.
Definition 8.3.2 (Euclidean Max-Cut Problem). For a set P ⊂ Rd, the Euclidean max-
cut problem is to find a partition of P into two subsets C1 and C2 such that the sum
Cut(P,C1, C2) :=
∑
(p,q)∈C1×C2
D(p, q)
of inter-cluster distances is maximized.
Obviously, the max-cut problem satisfies Properties (i) and (iii). Furthermore, it is
shown in [44] that Property (ii) is satisfied for c = 4. Concerning Property (iv), the
authors of [44] gave an efficient (1 ± ε)-approximation for the max-cut problem in low-
dimensions that has the following properties:
Lemma 8.3.3 ([44]). Let ε, 0 < ε < 1, be a precision parameter. Given a stream of
m Insert and Delete operations of points from a discrete Euclidean space {1, . . . ,∆}d,
where d is a constant, there exists a streaming algorithm that computes with probabil-
ity at least 2/3, for the current point set P with cardinality n, a data structure of size
O(log3(∆m) · log4(∆)/ε2d+4) from which an implicit (1 ± ε)-approximate solution for the
max-cut problem can be extracted in poly(exp(1/ε)O(1), (1/ε)d, log(∆), log(n), log(m)) time.
An update can be processed in O(log2(∆) · log(∆m)) time.
By combining the embedding given in Lemma 8.3.1 with the approximation algorithm
presented in [44], we can implicitly compute a (1 ± ε)-approximation for the max-cut
problem on dynamic geometric data streams of high-dimensional points.
Theorem 12. Let ε, 0 < ε < 1, be a precision parameter. Given a stream of m In-
sert and Delete operations of points from a discrete high-dimensional Euclidean space
{1, . . . ,∆}d, there is a randomized streaming algorithm that has a space requirement of
O(log7(d∆mn)/εO(1/ε
2)) and computes with probability at least 5/8, for the current point
set P of size n, a data structure from which an implicit (1± ε)-approximation for the max-
cut problem can be extracted in poly(exp(1/ε)O(1), (1/ε)1/ε
2
, log(d), log(∆), log(n), log(m))
time. An update requires O(d · log2(d)/ε2 + log3(d∆nm/ε)) time.
Proof. We proceed in a similar way as we have done in the proof of Theorem 11. At
first, we embed the discrete high-dimensional Euclidean point set P into a low-dimensional
Euclidean space. This embedding induces a small multiplicative error on the cost of a max-
imum cut. Then, we apply the snap-rounding technique, i.e., we impose an appropriately
fine grid on the target space and move each embedded point to its nearest grid point. This
movement of the points induces an additive error, which can be charged against a lower
bound on the cost of a maximum cut for P to get a small multiplicative error. Finally, by
162 8 Embeddings with Slack in Data Streams and Applications
applying the techniques described in [44] on the embedded and moved points, we obtain
the results stated in the theorem. Next, we explain our construction in more detail.
In the first step, we apply the embedding ϕ : P → P ′ given in Lemma 8.3.1 with precision
parameter ε′ := ε/16 and error probability parameter δ′ := 1/24 on P . Then, we have that
∑
(p,q)∈P×P
|D(ϕ(p), ϕ(q))−D(p, q)| ≤ ε′ ·
∑
(p,q)∈P×P
D(p, q)
is true with probability at least 1 − δ′. Since Property (ii) (on page 160) is satisfied for
c = 4 [44], we have MaxCut(P ) ≥ 1/4 ·
∑
(p,q)∈P×P D(p, q). Due to the fact that each cut
of P is a subset of (p, q) ∈ P × P , we obtain
∣
∣
∣
∣
∣
∣
∑
(p,q)∈C1×C2
D(ϕ(p), ϕ(q))−
∑
(p,q)∈C1×C2
D(p, q)
∣
∣
∣
∣
∣
∣
≤
∑
(p,q)∈C1×C2
|D(ϕ(p), ϕ(q))−D(p, q)|
≤
∑
(p,q)∈P×P
|D(ϕ(p), ϕ(q))−D(p, q)|
≤ ε′ ·
∑
(p,q)∈P×P
D(p, q)
≤ 4ε′ ·MaxCut(P )
=
ε
4
·MaxCut(P )
for all cuts (C1, C2) of P with probability 23/24. Let (C ′1, C
′
2) be a maximum cut of P ,
and let (C ′′1 , C
′′
2 ) be any cut of P such that the embedded point sets of C
′′
1 and C
′′
2 build a
maximum cut of P ′. It follows from the above that
∑
(p,q)∈C′′1×C
′′
2
D(ϕ(p), ϕ(q)) ≥
∑
(p,q)∈C′1×C
′
2
D(ϕ(p), ϕ(q))
≥
∑
(p,q)∈C′1×C
′
2
D(p, q)−
ε
4
·MaxCut(P )
=
(
1−
ε
4
)
·MaxCut(P )
and
∑
(p,q)∈C′′1×C
′′
2
D(ϕ(p), ϕ(q)) ≤
∑
(p,q)∈C′′1×C
′′
2
D(p, q) +
ε
4
·MaxCut(P )
≤
(
1 +
ε
4
)
·MaxCut(P ) .
Thus, we have
(
1−
ε
4
)
·MaxCut(P ) ≤ MaxCut(P ′) ≤
(
1 +
ε
4
)
·MaxCut(P ) (8.3)
with probability at least 23/24.
8.3 Max-Cut in High Dimensions 163
In the second step, we apply the snap-rounding technique. We impose a square grid
on the target space Rd(ε
′,δ′) with d(ε′, δ′) ∈ Θ(1/(ε2δ2)), where each cell has side length
ε/(16 ·
√
d(ε′, δ′)), and move each point in P ′ to its nearest grid point. Let P ′′ be the set
of points that we obtain after moving the points in P ′. Each point is moved by a distance
of at most
ε
16 ·
√
d(ε′, δ′)
·
√
d(ε′, δ′)
2
=
ε
32
.
Thus, the movement of the points induces an additive error of at most εn2/16 on the sum
of the pairwise distances. Since Property (ii) (on page 160) is satisfied for c = 4 [44] and
the minimum pairwise distance of P is 1, a lower bound on the cost of a maximum cut for
P is n2/4. Hence, we have εn2/16 ≤ ε/4 ·MaxCut(P ). Due to Inequality (8.3), we get
(
1−
ε
2
)
·MaxCut(P ) ≤ MaxCut(P ′′) ≤
(
1 +
ε
2
)
·MaxCut(P )
with probability at least 1 − 1/24. Besides, we can upper bound the diameter of P ′′ as
follows. Since the maximum pairwise distance of P is
√
d∆, the value n2 ·
√
d∆ is an upper
bound on the cost of a maximum cut for P . Since the diameter of a point set is a lower
bound on the cost of a maximum cut of the point set, we get
diam(P ′) ≤ MaxCut(P ′) ≤
(
1 +
ε
4
)
·MaxCut(P ) ≤
(
1 +
ε
4
)
· n2 ·
√
d∆ ,
where the second inequality follows from Inequality (8.3). As a result, the diameter of P ′′
is O(
√
d∆n2). Furthermore, each point in P ′′ lies on a grid with cell size ε/(16 ·
√
d(ε′, δ′)).
Thus, by scaling the point space by 16 ·
√
d(ε′, δ′)/ε, we get a set of points from a discrete
low-dimensional space {1, . . . ,∆′}d
′
with ∆′ ∈ O(
√
d∆n2/ε2) and d′ ∈ O(1/ε2).
On the scaled point set, we run the approximation algorithm of [44] with precision
parameter ε′′ := ε/3. Due to Lemma 8.3.3 and our calculations above, with probability at
least 23/24− 1/3 = 5/8, we can compute a point set P ′′′ such that
(
1−
ε
2
)(
1−
ε
3
)
·MaxCut(P ) ≤
ε ·MaxCut(P ′′′)
16 ·
√
d(ε′, δ′)
≤
(
1 +
ε
2
)(
1 +
ε
3
)
·MaxCut(P ) .
Since (1 − ε/2)(1 − ε/3) ≥ (1 − ε) and (1 + ε/2)(1 + ε/3) ≤ (1 + ε), after rescaling, our
construction computes an implicit (1 ± ε)-approximate solution for the max-cut problem
with probability at least 5/8.
Note that our construction works in the streaming model, where the first two steps are
used to transform a stream of high-dimensional points into a stream of low-dimensional
points. Due to Lemma 8.3.1, the transformation of one high-dimensional input point
requires O(log(d)/ε2) space and O(d · log2(d)/ε2) time. Finally, since we apply the ap-
proximation algorithm of [44] on a stream of points with dimension O(1/ε2) and spread
O(
√
d∆n2/ε2), the complexity of our construction is as claimed in the theorem.
164 8 Embeddings with Slack in Data Streams and Applications
8.4 Embedding Doubling Metric Spaces
In this section, we show how to compute a low-distortion embedding with low slack for an
n-point doubling metric space M = (X,D) with bounded dimension λ given as a stream
of points in the insertion-only data stream model. We assume that the minimum pairwise
distance between two points in X is at least 1, and the maximum pairwise distance is at
most ∆. Furthermore, we assume access to a distance oracle that, given any two points
from X, can compute in constant time the distance between these two points.
The idea of our streaming algorithm is based on the results obtained in Section 7.3. Recall
that our WSPD construction from Section 7.3 works as follows. It computes the uniform
cut decompositions G (0) , . . . ,G (dlog(∆)e) and decomposes each heavy ball in the uniform
cut decompositions into a set of mini balls. Then, the weighted centers of these mini balls
are the representatives of the WSPD. The idea of our streaming algorithm is to take two
sample sets from the input stream. The first sample set is our set of representatives and is
supposed to hit every mini ball that contains more than a certain threshold of points with
high probability. The second sample set is supposed to approximate the weight of the mini
ball centers. Next, we explain our streaming algorithm in more detail (see Algorithm 8.4.1
for a description in pseudocode).
We take two sample sets from the input stream denoted by R and S. For that pur-
pose, each input point is chosen at random with probability Pr [point is taken into R] :=
((dlog(∆)e + 1)2 · 26λ+5) · ln(n/δ)/(ελσ2 · n) to be a sample point in R, where δ is the
error probability of the algorithm. Similarly, each input point is taken at random with
probability Pr [point is taken into S] := 6 · (dlog(∆)e+ 1)2 · 26λ+5 · ln(n/δ)/(ελσ4 · n) into
the sample set S. Let R := {y0, . . . , yk−1} and S := {s0, . . . , s`−1} be the sample sets
after having read the whole input stream. Then, R is our set of representatives for X,
and S determines the weight of the representatives in R. More precisely, each point in S
is assigned to the nearest representative in R. For each representative yi ∈ R, let ci be
the total number of points in S that have been assigned to yi. Then, if ci > 0, we set the
weight of yi to dci · n/|S|e. Otherwise, we remove yi from the set of representatives. To
avoid that the total weight of the representatives is larger than n, we sum up all weights
and decrease the weight of some arbitrary representatives by the required amount. The
set of all weighted representatives is our compact representation for M ′.
Slack Induced by the Sample Step
Let M := {B0, . . . ,Bm−1} be the set of mini balls that we would obtain by running the
WSPD construction from Section 7.3 on the n-point metric space M = (X,D). Then, we
show that, with high probability, there is at least one sample point from R in each mini
ball that contains at least a certain fraction of points from X. Furthermore, for each ball
in the same set of mini balls, we show that, with high probability, the number of points
inside the ball can be (1 ± σ)-approximated by S. Finally, we prove that the remaining
mini balls contain only a few points from X as well as from the sample set S.
8.4 Embedding Doubling Metric Spaces 165
Algorithm 8.4.1 EmbedDoublingMetric(n,∆, ε, σ, δ)
1: initialize empty point sets R and S
2: i← 0
3: for each point x in the stream do
4: flip a coin that shows head with probability Pr [point is taken into R]
5: if coin shows head then
6: yi ← x
7: R← R ∪ yi
8: initialize counter ci with 0
9: i← i+ 1
10: flip a coin that shows head with probability Pr [point is taken into S]
11: if coin shows head then
12: S ← S ∪ x
13: for each point x ∈ S do
14: compute nearest neighbor yi in R
15: increment counter ci by 1
16: for each point yi ∈ R do
17: set weight of yi to dci · n/|S|e
18: return R
In some proofs, we need to know good estimators for the number of points in the sample
sets R and S. For this reason, we first give appropriate lower and upper bounds on the
sizes of R and S.
Lemma 8.4.1. If each point in X is taken with probability
Pr [point is taken into R] :=
((dlog(∆)e+ 1)2 · 26λ+5) · ln(n/δ)
ελσ2 · n
into the set R, then we have
((dlog(∆)e+ 1)2 · 26λ+4) · ln(n/δ)
ελσ2
< |R| <
3 · ((dlog(∆)e+ 1)2 · 26λ+4) · ln(n/δ)
ελσ2
with probability 1− δ/n.
Proof. Let Yi be the indicator random variable for the event that the i-th point in X is
taken into the sample set R. We have
E [Yi] =
((dlog(∆)e+ 1)2 · 26λ+5) · ln(n/δ)
ελσ2 · n
.
166 8 Embeddings with Slack in Data Streams and Applications
By a Chernoff bound and linearity of expectation, we get
Pr


|X|∑
i=1
Yi ≥
(
1 +
1
2
)
· E


|X|∑
i=1
Yi



 ≤ exp
(
−
n · E [Yi]
12
)
≤ exp
(
−
((dlog(∆)e+ 1)2 · 26λ+3) · ln(n/δ)
3 · ελσ2
)
≤
δ
2n
and
Pr


|X|∑
i=1
Yi ≤
(
1−
1
2
)
· E


|X|∑
i=1
Yi



 ≤ exp
(
−
n · E [Yi]
8
)
≤ exp
(
−
((dlog(∆)e+ 1)2 · 26λ+2) · ln(n/δ)
ελσ2
)
≤
δ
2n
.
Thus, we have
(
1−
1
2
)
· E


|X|∑
i=1
Yi

 <
|X|∑
i=1
Yi <
(
1 +
1
2
)
· E


|X|∑
i=1
Yi


with probability at least 1 − δ/n. Now, the assertion follows from
∑|X|
i=1 Yi = |R| and
E
[∑|X|
i=1 Yi
]
= n · E [Yi].
Lemma 8.4.2. If each point in X is taken with probability
Pr [point is taken into S] :=
6 · (dlog(∆)e+ 1)2 · 26λ+5 · ln(n/δ)
ελσ4 · n
into the set S, then we have
3 · (dlog(∆)e+ 1)2 · 26λ+5 · ln(n/δ)
ελσ4
< |S| <
9 · (dlog(∆)e+ 1)2 · 26λ+5 · ln(n/δ)
ελσ4
with probability 1− δ/n.
Proof. The proof runs through with the same approach as used in the proof of Lemma 8.4.1.
Let Yi be the indicator random variable for the event that the i-th point in X is taken into
the sample set S. We have
E [Yi] =
6 · (dlog(∆)e+ 1)2 · 26λ+5 · ln(n/δ)
ελσ4 · n
.
8.4 Embedding Doubling Metric Spaces 167
By a Chernoff bound and linearity of expectation, we obtain
Pr


|X|∑
i=1
Yi ≥
(
1 +
1
2
)
· E


|X|∑
i=1
Yi



 ≤ exp
(
−
n · E [Yi]
12
)
≤ exp
(
−
(dlog(∆)e+ 1)2 · 26λ+4 · ln(n/δ)
ελσ4
)
≤
δ
2n
and
Pr


|X|∑
i=1
Yi ≤
(
1−
1
2
)
· E


|X|∑
i=1
Yi



 ≤ exp
(
−
n · E [Yi]
8
)
≤ exp
(
−
3 · (dlog(∆)e+ 1)2 · 26λ+3 · ln(n/δ)
ελσ4
)
≤
δ
2n
.
Hence, we get
(
1−
1
2
)
· E


|X|∑
i=1
Yi

 <
|X|∑
i=1
Yi <
(
1 +
1
2
)
· E


|X|∑
i=1
Yi


with probability at least 1 − δ/n. Now, the assertion is due to
∑|X|
i=1 Yi = |S| and
E
[∑|X|
i=1 Yi
]
= n · E [Yi].
The following lemma shows that, with high probability, there is at least one sample point
from R in each mini ball that contains at least a certain fraction of points from X.
Lemma 8.4.3. With probability 1 − δ, there is at least one sample point from R in each
mini ball that contains at least ελσ2n/((dlog(∆)e+ 1)2 · 26λ+5) points from X.
Proof. Let B be any mini ball that contains at least ελσ2n/((dlog(∆)e+ 1)2 · 26λ+5) points
from X. Then, we have
Pr [B contains no sample point from R]
≤ (1−Pr [point is taken into R])
ελσ2n
(dlog(∆)e+1)2·26λ+5
=
(
1−
((dlog(∆)e+ 1)2 · 26λ+5) · ln(n/δ)
ελσ2 · n
) ελσ2n
(dlog(∆)e+1)2·26λ+5
≤ δ/n ,
where the second inequality is due to a bound on Euler’s number (see Inequality (B.2)).
Finally, the assertion of the lemma follows by the union bound since we can assume that
the number of mini balls is less than n.
168 8 Embeddings with Slack in Data Streams and Applications
Now, we can show that, with high probability, there are just a few sample points from
S located in mini balls that contain less than a certain fraction of points from X. Further-
more, the number of points in each of the remaining mini balls can be (1±σ)-approximated
by S with high probability.
Lemma 8.4.4. Let U be the union of all the mini balls in M that contain less than
ελσ2n/((dlog(∆)e+ 1)2 · 26λ+5) points from X. If |S| ≥ 6 ln(1/δ)/σ, then, with probability
at least 1 − δ, the number of points from X in U is less than σn/2 and the number of
sample points from S that are contained in U is at most σ|S|.
Proof. As we have shown in the proof of Lemma 7.3.5, the number of mini balls is at most
(dlog(∆)e+ 1)2 · 26λ+4/(ελσ). Thus, the total number of points in mini balls contained in
U is less than
(dlog(∆)e+ 1)2 · 26λ+4
ελσ
·
ελσ2n
(dlog(∆)e+ 1)2 · 26λ+5
=
σn
2
.
It follows that, for some t ≥ 1, the total number of points in mini balls from U is σn/(2t).
Let Yi be the indicator random variable for the event that the i-th point in S is contained
in U . We have E [Yi] = σ/(2t). By a Chernoff bound and linearity of expectation, we get
Pr


|S|∑
i=1
Yi ≥ (1 + t) · E


|S|∑
i=1
Yi



 ≤ exp
(
−
t · |S| · E [Yi]
3
)
= exp
(
−
|S| · σ
6
)
.
Since we assume that |S| ≥ 6 ln(1/δ)/σ, this probability is at most δ. Thus, U contains
less than
(1 + t) · E


|S|∑
i=1
Yi

 ≤ σ|S|
sample points with probability at least 1− δ.
Lemma 8.4.5. If |S| ≥ 3 · (dlog(∆)e + 1)2 · 26λ+5 · ln(n/δ)/(ελσ4), then the number of
points in every mini ball that contains at least ελσ2n/((dlog(∆)e+ 1)2 · 26λ+5) points from
X can be (1± σ)-approximated by S with probability 1− δ.
Proof. Let B be any mini ball that contains at least ελσ2n/((dlog(∆)e+ 1)2 · 26λ+5) points
from X. Let Yi be the indicator random variable for the event that the i-th point in S
is contained in the mini ball B. We have E [Yi] ≥ ελσ2/((dlog(∆)e + 1)2 · 26λ+5). By a
Chernoff bound and linearity of expectation, we get
Pr


|S|∑
i=1
Yi ≥ (1 + σ) · E


|S|∑
i=1
Yi



 ≤ exp
(
−
σ2 · |S| · E [Yi]
3
)
≤ exp
(
−
ελσ4 · |S|
3 · (dlog(∆)e+ 1)2 · 26λ+5
)
8.4 Embedding Doubling Metric Spaces 169
and
Pr


|S|∑
i=1
Yi ≤ (1− σ) · E


|S|∑
i=1
Yi



 ≤ exp
(
−
σ2 · |S| · E [Yi]
2
)
≤ exp
(
−
ελσ4 · |S|
(dlog(∆)e+ 1)2 · 26λ+6
)
.
Since we assume that |S| ≥ 3 · (dlog(∆)e + 1)2 · 26λ+5 · ln(n/δ)/(ελσ4), each of these
probabilities is at most δ/n. Hence, the number of points in B can be (1±σ)-approximated
with probability 1 − 2δ/n. By the union bound, the number of points in every mini ball
that contains at least ελσ2n/((dlog(∆)e+1)2 ·26λ+5) points from X is (1±σ)-approximated
with probability at least 1− δ.
Weight of the Representatives
To avoid that the total weight of the representatives differs from n, we adjust the weight of
some representatives. We adopt the result given in Section 8.2 to show that the adjustment
is small.
Lemma 8.4.6. Let R be the set of representatives before the adjustment, and let w(yi)
denote the weight of a representative yi ∈ R. Then,
n ≤
∑
yi∈R
w(yi) <
(
1 +
σ2
2
)
· n
holds with probability at least 1− 2δ/n.
Proof. Since the sum of the counters over all representatives is equal to |S| and the weight
of a representative is at least its counter multiplied by n/|S|, the total weight of the
representatives is at least n. This proves the first inequality of the lemma.
The sum of the weights can be larger than n because the weight of each representative
is rounded up to the next integer. Thus, the sum of the weights is at most n + |R|. Due
to Lemma 8.4.1, we have
|R| <
3 · ((dlog(∆)e+ 1)2 · 26λ+4) · ln(n/δ)
ελσ2
with probability 1− δ/n. Furthermore, due to Lemma 8.4.2, we have
|S| >
3 · (dlog(∆)e+ 1)2 · 26λ+5 · ln(n/δ)
ελσ4
with probability 1− δ/n. It follows that |R| < |S| · σ2/2 with probability 1− 2δ/n. Since
|S| ≤ n, we obtain
∑
yi∈R
w(yi)− n ≤ n+ |R| − n <
σ2
2
· |S| ≤
σ2
2
· n ,
which proves the second inequality of the lemma.
170 8 Embeddings with Slack in Data Streams and Applications
We summarize our results in the following theorem:
Theorem 13. Given a stream of points from an n-point doubling metric spaceM = (X,D)
with bounded dimension λ, a precision parameter ε, 0 < ε < 1, a slack parameter σ,
1/o(n) < σ < 1, and an error probability parameter δ, 0 < δ < 1, there is a random-
ized streaming algorithm that computes with probability 1 − δ a set X ′ ⊂ X of cardinality
O(log2(∆) · log(n/δ)/(ελσ2)) such that M = (X,D) embeds into M ′ = (X ′,D) with dis-
tortion 1 + ε and slack σ. The algorithm requires O(log2(∆) · log(n/δ)/(ελσ4)) space and
has a constant update time. The set X ′ can be extracted in O(log4(∆) · log2(n/δ)/(ε2λσ6))
time.
Proof. Due to Lemma 8.4.3, with high probability, there is at least one representative
in each mini ball that contains at least ελσ2n/((dlog(∆)e + 1)2 · 26λ+5) points from X.
Let B(x, r) be such a mini ball. Then, each point x′ ∈ S ∩ B(x, r) is either assigned to
a representative in B(x, r) or to another representative which is closer to x′. It follows
that the representative to which we assign x′ is contained in the ball B(x, 3r). Hence,
due to Lemmas 7.3.3 and 7.3.4, the representatives for points located in mini balls that
contain at least ελσ2n/((dlog(∆)e + 1)2 · 26λ+5) points from X form an 3ε-WSPD with
slack σ for X. Since we estimate the number of points in the mini balls based on the
sample set S, we get an additional slack. Due to Lemmas 8.4.2, 8.4.4 and 8.4.5, with high
probability, the additional slack induced by this estimation is at most 2σ. Furthermore,
we get another additional slack of σ2/2 (see Lemma 8.4.6) since we round up the weight
of each representative. Thus, with high probability, our streaming algorithm computes an
3ε-WSPD with slack 4σ for X. Due to our construction, this WSPD is an embedding for
X with distortion (1 + 3ε)2 and slack 4σ.
Due to Lemma 8.4.1, we can assume with high probability that the set of representatives
R and, hence, the set X ′ has a cardinality of O(log2(∆) · log(n/δ)/(ελσ2)). Furthermore,
due to Lemma 8.4.2, we can assume with high probability that the set S has a cardinality
of O(log2(∆) · log(n/δ)/(ελσ4)). Since we only store the two sample sets R and S, the total
space requirement of our algorithm is O(log2(∆) · log(n/δ)/(ελσ4)).
The error probability is given as follows. Lemmas 8.4.1 and 8.4.2 hold with a total error
probability of at most 2δ/n. If this is the case, then the assertions given in Lemmas 8.4.4
and 8.4.5 follow with a total error probability of 2δ. The assertion of Lemma 8.4.3 is true
with probability 1− δ. Thus, the total error probability of our algorithm is at most 4δ.
Overall, if we run our embedding algorithm with a precision parameter ε′ ≤ ε/9, a slack
parameter σ′ ≤ σ/4, and an error probability parameter δ′ ≤ δ/4, then the embedding has
distortion (1 + 3ε′)2 ≤ 1 + ε and slack 4σ′ ≤ σ and works with error probability 4δ′ ≤ δ.
Since we can decide in constant time whether a point is taken into one or both of the
two sample sets, the algorithm has a constant update time. To extract the weighted set
X ′, we assign each point in S to its nearest neighbor in R. Since we assume access to
a distance oracle that, given any two points from X, can compute in constant time the
distance between these two points, the assignment of S can be done in |R| · |S| time.
Since we use the construction from Section 7.3 in the analysis of our streaming algorithm,
we have to ensure that σ′ > (dlog(∆)e + 1) · 23λ/n (confer Theorem 8). However, this is
8.5 Embedding General Metric Spaces 171
implicitly required by the fact that the space requirement of a streaming algorithm has to
be sublinear in n and the space requirement of our streaming algorithm is ω(1/σ).
8.5 Embedding General Metric Spaces
This section deals with a streaming algorithm that embeds a general n-point metric space
M = (X,D) with constant distortion and slack σ into a metric space M ′ = (X ′,D′). As
in the previous sections, we assume that the minimum pairwise distance of M is at least
1, and the maximum pairwise distance is at most ∆. Furthermore, we assume access to
a distance oracle that, given any two points from X, can compute in constant time the
distance between these two points.
Our algorithm works in the insertion-only data stream model and resembles the con-
struction of spanners with slack proposed by Chan et al. [24]. A spanner with slack σ
for M is a sparse graph G whose vertices are the points in X and whose shortest-path
metric approximates a (1−σ)-fraction of all pairwise distances of M with small distortion.
The first step of the spanner construction presented in [24] is the computation of a small
edge-dense net N ⊂ X of M . Intuitively, N has the property that, for a large fraction of
pairs of points (x, y) ∈ X×X, the distance between N and both x and y is small compared
to D(x, y). Based on N , the edges of G are constructed as follows. For each pair of points
(x, y) ∈ N ×N , an edge with length D(x, y) is added to G. For each point x ∈ X\N , its
closest neighbor y in N is determined and an edge with length D(x, y) is added to G.
Now, we transform the construction to the streaming model. Our first modification is
that we replace the edge-dense net N by a sample set S drawn uniformly at random from
X, i.e., G contains a clique S and each point x ∈ X\S is connected to its closest neighbor
in S. We will show that, if the size of S is chosen carefully, this modification changes the
properties of G only slightly. Secondly, instead of storing for each point x ∈ X\S an edge
to a point in S, we store for each point x′ ∈ S the number of points at each distance scale
that have x′ as their nearest neighbor. This technique has been earlier applied by Czumaj
and Sohler [32] to obtain 2-pass streaming algorithms for clustering problems. Since our
streaming algorithm has to get along with one pass and after having read only a part of the
input stream one cannot know the nearest neighbor of a point x ∈ X in the final sample set
S, we compute the nearest neighbor in the current sample set S, at the point of time when
x appears in the stream. In this way, we are able to compute a compact representation
of a spanner with slack for M in the streaming model. Next, we describe our streaming
algorithm in more detail. A description in pseudocode is given by Algorithm 8.5.1.
We read the points of the input stream one by one and sample each point with probability
Pr [point is sampled] := m log(n/δ)(dlog(∆)e + 1)/(σ2n), where δ is the error probability
of the algorithm and m is the size of the edge-dense net N mentioned above (which is a
constant depending on σ). Let S := {s0, . . . , sk−1} be the set of sampled points. For each
i ∈ [k], we maintain counters ci,0, ci,1, . . . , ci,dlog(∆)e, which are initially set to 0. Moreover,
for each point x ∈ X\S, we compute its nearest neighbor τ(x) = si in S, at the point of
time when x appears in the stream, and we increment ci,j, where j = dlog(D(x, si))e. By
172 8 Embeddings with Slack in Data Streams and Applications
storing the points in S and the counters ci,j, we implicitly store the following metric space
M ′. The metric M ′ is the shortest-path metric of a graph G with vertex set X. For each
pair of points (si, sj) ∈ S×S, the graph G contains an edge {si, sj} of length D(si, sj). For
each point x ∈ X\S, the graph G contains an edge {x, τ(x)} of length 2dlog(D(x,τ(x)))e. We
denote the resulting embedding by ϕ. Note that we do not store the mapping ϕ : X → X ′
since this would require Ω(n) space.
Algorithm 8.5.1 EmbedGeneralMetric(n,∆, σ, δ)
1: initialize empty point set S
2: i← 0
3: for each point x in the stream do
4: flip a coin that shows head with probability Pr [point is sampled]
5: if coin shows head then
6: si ← x
7: S ← S ∪ si
8: for j ← 0 to dlog(∆)e do
9: initialize counter ci,j with 0
10: i← i+ 1
11: else
12: compute nearest neighbor τ(x) = si′ in S
13: increment counter ci′,dlog(D(x,si′ ))e by 1
14: return points in S together with their counters
Analysis of the Embedding
In order to prove that the embedding ϕ has constant distortion and slack σ, we first show
that M indeed contains some small edge-dense net N .
Definition 8.5.1 (Edge-Dense Net). Let M = (X,D) be any general metric space, let
γ > 0 be any precision parameter, and let σ, 0 < σ < 1, be any slack parameter. We say
that a subset N ⊂ X is a (σ, γ)-edge-dense net for M if, for at least a (1− σ)-fraction of
pairs (x, y) ∈ X ×X, there exists a pair (bx, by) ∈ N ×N such that
max{D(x, bx),D(y, by)} ≤ γ ·D(x, y) .
Lemma 8.5.2 ([75]). For any general metric space M = (X,D) and for any slack param-
eter σ, 0 < σ < 1, there exists a subset N ⊂ X with |N | = C(σ), where C(σ) is a constant
depending on σ, such that
min
b∈N
{D(x, b),D(y, b)} ≤ D(x, y)
is true for at least a (1− σ)-fraction of pairs (x, y) ∈ X ×X.
8.5 Embedding General Metric Spaces 173
We reformulate the above lemma as follows.
Lemma 8.5.3. For any general metric space M = (X,D) and for any slack parameter σ,
0 < σ < 1, there exists a (σ, 2)-edge-dense net N ⊂ X with |N | = C(σ), where C(σ) is a
constant depending on σ.
Proof. Let N be the set given by Lemma 8.5.2. Then, for at least a (1 − σ)-fraction of
pairs (x, y) ∈ X ×X, there exists an element b ∈ N such that
min
b∈N
{D(x, b),D(y, b)} ≤ D(x, y) .
Without loss of generality, we assume that D(x, b) ≤ D(x, y). By triangle inequality, we
have
D(y, b) ≤ D(y, x) + D(x, b) ≤ 2 ·D(x, y) ,
and the assertion follows.
Now, let N := {z0, . . . , zm−1} be a (σ, 2)-edge-dense net for the input metric space M .
For each ` ∈ [m], let X` be the set of points in X for which the nearest neighbor in N is
z` (breaking ties arbitrarily). Furthermore, for each j ∈ [dlog(∆)e+ 1], we define
X`,j :=
{
x ∈ X` | D(x, z`) ∈
(
2j−1, 2j
]}
.
We say that X`,j is good if after σ|X`,j| points from X`,j have appeared in the stream, the
set S contains at least one point from X`,j. In case this condition fails, we say that X`,j is
bad. The next lemma shows that each set X`,j that contains more than a certain threshold
of points is good with high probability.
Lemma 8.5.4. With probability at least 1−δ, for each ` ∈ [m] and each j ∈ [dlog(∆)e+1]
with |X`,j| ≥ σn/(m(dlog(∆)e+ 1)), X`,j is good.
Proof. Pick any ` ∈ [m] and any j ∈ [dlog(∆)e+ 1] such that
|X`,j| ≥
σn
m(dlog(∆)e+ 1)
.
Then, we have
Pr [X`,j is bad] = (1−Pr [point is sampled])σ|X`,j |
≤
(
1−
m log(n/δ)(dlog(∆)e+ 1)
σ2n
) σ2n
m(dlog(∆)e+1)
< δ/n ,
where the second inequality is due to a bound on Euler’s number (see Inequality (B.2)).
Finally, we obtain the assertion of the lemma by applying the union bound over all pairs
(`, j) ∈ [m]× [dlog(∆)e+1]. Recall that m is a constant depending on σ (see Lemma 8.5.3),
so we can assume that m · (dlog(∆)e+ 1) ≤ n.
174 8 Embeddings with Slack in Data Streams and Applications
The results above facilitates us to bound the distortion and the slack of our embedding
ϕ in a sufficient way.
Lemma 8.5.5. With probability at least 1 − δ, for at least a (1 − 3σ)-fraction of pairs
(x, y) ∈ X ×X, we have
D(x, y) ≤ D′(ϕ(x), ϕ(y)) ≤ 46 ·D(x, y) .
Proof. The number of points contained in sets X`,j with |X`,j| < σn/(m(dlog(∆)e + 1))
is at most σn. Now, by Lemmas 8.5.2, 8.5.3, and 8.5.4 and since N is a (σ, 2)-edge-dense
net, it follows with probability at least 1− δ that, for at least a (1− 3σ)-fraction of pairs
(x, y) ∈ X ×X, there exist bx, by ∈ N , and x′, y′ ∈ S (see Figure 8.1) such that:
• x′ and y′ appear in the stream before x and y, respectively,
• D(x, bx) ≤ D(x, y) or D(y, by) ≤ D(x, y),
• max{D(x, bx),D(y, by)} ≤ 2 ·D(x, y),
• D(x′, bx) ≤ 2 ·D(x, bx), and D(y′, by) ≤ 2 ·D(y, by).
Without loss of generality, we assume that D(x, bx) ≤ D(x, y). Now, for a pair (x, y) ∈
X ×X, we get
D′(ϕ(x), ϕ(y))
= 2dlog(D(x,τ(x)))e + D(τ(x), τ(y)) + 2dlog(D(y,τ(y)))e
≤ 2dlog(D(x,x
′))e + D(τ(x), x) + D(x, x′) + D(x′, y′) + D(y′, y) + D(y, τ(y)) + 2dlog(D(y,y
′))e
≤ 2dlog(D(x,x
′))e + 2 ·D(x, x′) + D(x′, y′) + 2 ·D(y′, y) + 2dlog(D(y,y
′))e
≤ 4 ·D(x, x′) + D(x′, y′) + 4 ·D(y, y′)
≤ 4 · (D(x, bx) + D(bx, x′)) + D(x′, y′) + 4 · (D(y, by) + D(by, y′))
≤ 12 ·D(x, bx) + D(x′, y′) + 12 ·D(y, by)
≤ 12 ·D(x, bx) + D(x′, bx) + D(bx, x) + D(x, y) + D(y, by) + D(by, y′) + 12 ·D(y, by)
≤ 15 ·D(x, bx) + D(x, y) + 15 ·D(y, by)
≤ 46 ·D(x, y) .
By combining our results, we obtain the following theorem:
Theorem 14. Let σ, 0 < σ < 1, be a slack parameter, and let δ, 0 < δ < 1, be an error
probability parameter. Given a stream of points from any general n-point metric space M ,
there exists a randomized streaming algorithm that computes with probability at least 1− δ
an implicit representation of an n-point metric space M ′ such that M embeds into M ′ with
distortion O(1) and slack σ. The algorithm requires O(C(σ) · log(n/δ) · log(n) · log2(∆)/σ2)
space, where C(σ) is a constant depending on σ.
8.5 Embedding General Metric Spaces 175
bx = z1
x
x′
ϕ(x)
by = z2
y
y′
ϕ(y)
Figure 8.1: Illustration of the embedding ϕ for a set of points in the Euclidean plane. The
points z1 and z2 belong to the edge-dense net N . The areas which contain
the sets X1,1 and X2,2 are colored in gray. Both sets are good, which implies
that X1,1 contains a sample point x′ and X2,2 contains a sample point y′. The
distance scale for each sample point is indicated by the dashed red circles. The
distance between x and y is represented by D(ϕ(x), x′)+D(x′, y′)+D(y′, ϕ(y)),
which is indicated by the line segments.
176 8 Embeddings with Slack in Data Streams and Applications
Proof. First, we compute an upper bound on the space requirement of our algorithm. Let
Yi be the indicator random variable for the event that the i-th point in the data stream
is sampled. Recall that each point is sampled with probability Pr [point is sampled] =
C(σ) log(n/δ)(dlog(∆)e + 1)/(σ2n). Thus, we have E [Yi] = Pr [point is sampled]. By a
Chernoff bound, we get
Pr


|X|∑
i=1
Yi ≥ 4 · E


|X|∑
i=1
Yi



 ≤ exp
(
−
3 · n · E [Yi]
3
)
≤ exp (−n ·Pr [point is sampled])
= δ .
It follows that, with probability at least 1− δ, the size of the sample set S is
|S| ∈ O
(
C(σ) · log(n/δ) · log(∆)
σ2
)
.
Since our algorithm only stores each point in S together with its O(log(∆)) counters and
each counter is set to at most n, the total space requirement is as claimed.
Due to Lemma 8.5.5 and the considerations above, the embedding ϕ has distortion O(1)
and slack 3σ and works with a total error probability of at most 2δ. Thus, if we run our
algorithm with a slack parameter σ′ ≤ σ/3 and an error probability parameter of δ′ ≤ δ/2,
the resulting embedding has slack 3σ′ ≤ σ and an error probability of 2δ′ ≤ δ.
8.6 Lower Bounds
In this section, we derive two lower bounds. First, we show that any algorithm that
embeds an n-point metric space M into another metric space M ′ with distortion % < 2
and slack σ < 1/4 requires Ω(n/ log n) bits of memory. The second lower bound depends
on the spread ∆ of M . More precisely, we prove that any algorithm that embeds a metric
space M into another metric space M ′ with distortion % < 2 and slack σ < 1/4 needs
Ω(log log ∆) bits of memory. Both proofs are based on the pigeonhole principle. We show
that if we restrict the memory space to a certain number of bits, there cannot exist a
so-called (σ, %)-net.
Definition 8.6.1 (Net for Metric Spaces). Let % ≥ 1 be a precision parameter, and let σ,
0 < σ < 1, be a slack parameter. A set of n-point metric spaces N is called a (σ, %)-net
if every n-point metric space M embeds into some metric space M ′ ∈ N with distortion %
and slack σ.
Theorem 15. Let %, 1 ≤ % < 2, be a precision parameter, and let σ, 0 ≤ σ < 1/4, be a slack
parameter. Then, any algorithm that computes for every arbitrary n-point metric space
M = (X,D) with positive probability an (implicit or explicit) representation of another
metric space M ′ = (X ′,D′) such that there is an embedding from M to M ′ with distortion
% and slack σ requires Ω(n/ log n) bits of memory.
8.6 Lower Bounds 177
Proof. We use the probabilistic method to prove the assertion. In general, the probabilistic
method says that if an object chosen at random from a given universe satisfies a certain
property with positive probability, then there must exist an object in the universe that
satisfies this property. Applied to our problem, the universe is a set of n-point metric spaces
and the desired property of an n-point metric space M is that M cannot be embedded by
an algorithm using O(n/ log n) bits of memory such that the embedding has distortion %
and slack σ. If a randomly chosen n-point metric space has this property with positive
probability, then there must exist an n-point metric space that cannot be embedded by an
algorithm using O(n/ log n) bits of memory such that the embedding has distortion % and
slack σ.
Now, let us consider any algorithm for embedding n-point metric spaces that uses at
most k bits of memory. Then, this algorithm has at most 2k distinct states. Each of these
states can correspond to at most one target metric space. Let us denote the set of these
target metric spaces by N . By using the probabilistic method, we show that, for a certain
value of k, N is not a (σ, %)-net, i.e., there exists an n-point metric space that cannot
be embedded into any metric space in N with distortion % and slack σ. This proves the
assertion of the lemma for any algorithm using at most k bits of memory.
Let M = (X,D) be a random n-point 1-2-metric space, i.e., every distance is chosen
uniformly at random from {1, 2}. Without loss of generality, we assume that the computed
embedding is non-expanding, i.e., the distance between any two points in X is at least as
big as the corresponding distance of the embedded points in the target metric space. Let us
consider an arbitrary target metric spaceM ′ ∈ N . There are n! ways to embedM intoM ′.
We fix one of these possible embeddings. Without loss of generality, we can assume that
X = X ′, i.e., our fixed embedding is the identity function. For the moment, let us assume
that the embedding must have slack 0. Then, since our embedding is non-expanding, we
must map all 1-distances in M to distances of length at most 1 in M ′. Furthermore, since
our embedding has distortion % < 2, we must map all 2-distances in M to a value greater
than 1 in M ′. We call an assignment that violates these conditions a wrong assignment.
Let err(M,M ′) be the total number of wrong assignments that occur by embeddingM into
M ′. Since we allow a slack of σ, we are allowed to make at most σ ·
(
n
2
)
wrong assignments.
For two arbitrary points x, y ∈ X in M ′, the distance D′(x, y) is either bigger than 1 or
at most 1. Since D(x, y) is chosen uniformly at random from {1, 2}, the chance that the
assignment of D(x, y) to D′(x, y) belongs to the wrong assignments is 1/2. Therefore, by
a Chernoff bound, we have
Pr
[
err(M,M ′) ≤ σ ·
(
n
2
)]
< Pr
[
err(M,M ′) ≤
1
4
·
(
n
2
)]
= Pr
[
err(M,M ′) ≤
(
1−
1
2
)
· E [err(M,M ′)]
]
≤ exp
(
−
E [err(M,M ′)]
8
)
= exp
(
−
1
16
·
(
n
2
))
.
178 8 Embeddings with Slack in Data Streams and Applications
Since |N | ≤ 2k and there are n! ways to embed M into one metric space from N , the
union bound implies that the overall probability that M embeds into any metric space
from N with distortion % and slack σ is at most
n! · 2k · exp
(
−
1
16
·
(
n
2
))
,
which is less than 1 for certain k = cn/ log n with sufficiently small constant c. Thus, the
randomly chosen n-point metric space M cannot be embedded into any metric space from
N with positive probability. It follows that, for such k, there must exist an n-point metric
space that cannot be embedded into any metric space from N with distortion % and slack
σ, which completes the proof.
Theorem 16. Let %, 1 ≤ % < 2, be a precision parameter, and let σ, 0 ≤ σ < 1/4,
be a slack parameter. Then, any algorithm that computes with positive probability for
every metric space M = (P,D), where P is a set of points from the discrete Euclidean
space {0, . . . ,∆} ⊆ R and D is the Euclidean distance function defined on P , an (implicit
or explicit) representation of another metric space M ′ = (X ′,D′) such that there is an
embedding from M to M ′ with distortion % and slack σ requires Ω(log(log(∆))) bits of
memory.
Proof. For each i ∈ {1, . . . , blog ∆c}, let Mi be the metric space obtained by placing |P |/2
points at the coordinate 0 and |P |/2 points at the coordinate 2i. Any pair of these metric
spaces differs by a factor of at least 2 in |P |2/4 of its distances. Thus, there is no metric
space M ′ such that both Mi and Mj, i 6= j, embed into M ′ with distortion less than 2 and
slack less than 1/4. Hence, for each of these metric spaces, there must exist a unique state
of an algorithm that computes such an embedding. It follows that the algorithm has at
least blog ∆c states and so it needs Ω(log(log(∆))) bits of memory to distinguish them.
9 Conclusions and Future Work
In this thesis, we developed facility location algorithms and embeddings with slack for huge
datasets.
Chapter 3
We presented a randomized distributed algorithm for the facility location problem, con-
sidering both metric spaces and powers of metric spaces. For the special case of uniform
costs and demands, our algorithm provides a constant-factor approximation using three
communication rounds. We believe that our algorithm is particularly well-suited for facility
location types of problems in wireless networks. This is because the algorithm uses only
a few broadcasts (in every communication round each node sends the same message to its
neighbors), which can be easily done in wireless networks.
In the analysis, we used the fact that the sum of the radii of the points is a constant-
factor approximation of the expected total cost. This fact is not directly applicable to the
non-uniform case, which means that our result cannot directly be generalized to the non-
uniform metric facility location problem. However, motivated by our result, Pandit and
Pemmaraju [97] obtained a constant-factor approximation in O(log(n)) communication
rounds for the variant of our considered metric facility location problem where the opening
cost of facilities are non-uniform. It would be interesting to find out whether Ω(log(n))
communication rounds are required or not.
Chapter 4
We initiated the study on a KDS for the mobile facility location problem. In particular, we
proposed a KDS that maintains a subset of the moving input points as open facilities such
that, at any time, the associated total cost is at most a constant factor larger than the
current optimal cost. We showed that our KDS is compact, local, responsive, and efficient.
Note that the complexity of our KDS is polylogarithmic in R, which is a value de-
pending on the opening cost and demand values of the input points. Hence, the compact-
ness, locality, responsiveness, and efficiency are not fully polylogarithmic, but only pseudo-
polylogarithmic. It would be nice future work to reduce this pseudo-polylogarithmic term
to a real polylogarithmic term. Furthermore, future work in the area of mobile facility lo-
cation problems could include to consider additional opening cost that arises at the point
of time when a point changes its status from client to open facility. Here, we point out that
in our scenario the additional opening cost per event would be already bounded because
we open at most a logarithmic number of facilities per event.
180 9 Conclusions and Future Work
Chapter 5
Chapter 5 addresses one of the central results in this thesis. In this chapter, we developed
a randomized algorithm that computes a constant-factor approximation of the cost for the
uniform facility location problem over dynamic geometric data streams. Our streaming
algorithm strongly improves the best previous one, which guarantees an approximation
factor of O(log2(∆)).
We think that it is worthwhile to further investigate the uniform facility location problem
over dynamic geometric data streams. In particular, we are optimistic that one can obtain
a (1 ± ε)-approximation algorithm for the facility location cost. This might also provide
new insights into handling other problems in the dynamic geometric data stream model,
like computing the earth mover distance or the minimum length of a traveling-salesperson
tour, for instance. Obviously, future work could also include to generalize our results to
the non-uniform facility location variant.
Chapter 6
We presented a streaming implementation of a k-means clustering algorithm that is based
on a new coreset construction. We have shown that this algorithm is capable of efficiently
clustering huge amounts of data in the insertion-only data stream model. To evaluate
our algorithm, we ran a series of experiments on large real-world datasets. We found
empirical evidence that in terms of the cost of the clustering, our algorithm is comparable
with StreamLS and significantly better than BIRCH. In terms of the running time, our
algorithm outperforms StreamLS, especially for a large number of centers k.
From a theoretical point of view, we showed that, for a precision parameter ε with 0 <
ε < 1, our adaptive sampling approach computes a (k, ε)-coreset in constant-dimensional
Euclidean space. However, the bound on the coreset size depends exponentially on the
dimension d (see Theorem 6). In compliance with our experiments, we suggest that one
can prove a size bound with much lower dependency on the dimension. Also, from an
experimental point of view, it would be interesting to examine the effect of the dimension
on an appropriate coreset size more extensively.
Chapters 7 and 8
We considered compact representations of metric spaces. In Chapter 7, we introduced the
notion of aWSPD with slack and gave constructions of WSPDs with slack for Euclidean and
doubling metric spaces. In Chapter 8, we presented streaming algorithms to compute low-
distortion embeddings with low slack for Euclidean, doubling, and general metric spaces.
Furthermore, we used an embedding to obtain a randomized algorithm that computes a
(1 ± ε)-approximation of the max-cut problem for a dynamic geometric data stream of
high-dimensional Euclidean points.
Metric embeddings with slack preserve much information about the original pairwise
distances and can be stored in small space. For this reason, we believe that they are an
important tool in the analysis of data streams and deserve further investigation.
A Additional Tables for Chapter 6
A.1 Parameters of Algorithm BIRCH
Covertype Tower Census 1990 BigCross
p = 10 5 5 25
Table A.1: Manual adjustment of the parameter TotalMemSize as percentage of the dataset
size for algorithm BIRCH
parameter value
CorD 0
TotalMemSize (in bytes) p% of dataset size
TotalBufferSize (in bytes) 5% of TotalMemSize
TotalQueueSize (in bytes) 5% of TotalMemSize
TotalOutlierTreeSize (in bytes) 5% of TotalMemSize
WMflag 0
W vector (1, 1, . . . , 1)
M vector (0, 0, . . . , 0)
PageSize (in bytes) 1024
BDtype 4
Ftype 0
Phase1Scheme 0
RebuiltAlg 0
StatTimes 3
NoiseRate 0.25
Range 2000
CFDistr 0
H 0
Bars vector (100, 100, . . . , 100)
K number of clusters k
InitFt 0
Ft 0
Gtype 1
GDtype 2
Qtype 0
RefineAlg 1
NoiseFlag 0
MaxRPass 1
Table A.2: Setting of the parameters of algorithm BIRCH
182 A Additional Tables for Chapter 6
A.2 Running Times of the Algorithms
running time (in sec)
dataset k StreamKM++ StreamLS BIRCH k-Means++ k-Means
Spambase 10 3.06 - - 3.57 19.02
20 7.04 - - 8.22 59.85
30 16.45 - - 19.05 88.8
40 28.93 - - 20.54 132.03
50 44.48 - - 25.9 182.08
Intrusion 10 74.1 - - 50.6 408.8
20 103.1 - - 262.4 2711.3
30 143.8 - - 1973.3 4389.1
40 197.6 - - 1257.0 10733.7
50 250.5 - - 1339.5 14282.0
Covertype 10 245 147 44 3389 -
20 297 460 44 5160 -
30 378 1027 44 14933 -
40 454 1773 44 16713 -
50 617 2588 44 25803 -
Tower 20 157 679 77 2960 -
40 168 1989 78 6902 -
60 187 3849 77 11247 -
80 211 6212 77 19206 -
100 248 8946 77 17161 -
Census 1990 10 1571 631 271 - -
20 1724 2362 271 - -
30 1839 5504 271 - -
40 1956 10054 272 - -
50 2057 11842 272 - -
BigCross 15 5486 6239 1006 - -
20 5738 10502 998 - -
25 5933 15780 996 - -
30 6076 22779 996 - -
Normdata 100 14.5 178.2 - - -
(m = 500) 125 14.9 401.8 - - -
150 15.1 569.3 - - -
175 15.1 659.3 - - -
200 15.6 731.8 - - -
Normdata 100 16.7 44.8 - - -
(m = 1000) 125 17.1 92.6 - - -
150 17.5 176.9 - - -
175 17.6 378.1 - - -
200 18.3 586.7 - - -
Table A.3: Average running times of the algorithms
A.3 Clustering Cost of the Algorithms 183
A.3 Clustering Cost of the Algorithms
cost
dataset k StreamKM++ StreamLS BIRCH k-Means++ k-Means
Spambase 10 7.85 · 107 - - 8.71 · 107 1.70 · 108
20 2.27 · 107 - - 2.45 · 107 1.53 · 108
30 1.24 · 107 - - 1.34 · 107 1.51 · 108
40 8.64 · 106 - - 9.01 · 106 1.49 · 108
50 6.29 · 106 - - 6.68 · 106 1.48 · 108
Intrusion 10 1.27 · 1013 - - 1.75 · 1013 9.52 · 1014
20 1.26 · 1012 - - 1.55 · 1012 9.51 · 1014
30 4.29 · 1011 - - 4.96 · 1011 9.51 · 1014
40 1.95 · 1011 - - 2.25 · 1011 9.50 · 1014
50 1.11 · 1011 - - 1.29 · 1011 9.50 · 1014
Covertype 10 3.43 · 1011 3.42 · 1011 4.24 · 1011 3.42 · 1011 -
20 2.06 · 1011 2.05 · 1011 2.97 · 1011 2.03 · 1011 -
30 1.57 · 1011 1.56 · 1011 1.89 · 1011 1.54 · 1011 -
40 1.31 · 1011 1.32 · 1011 1.59 · 1011 1.29 · 1011 -
50 1.15 · 1011 1.18 · 1011 1.41 · 1011 1.13 · 1011 -
Tower 20 6.24 · 108 6.16 · 108 9.26 · 108 6.51 · 108 -
40 3.34 · 108 3.34 · 108 4.75 · 108 3.30 · 108 -
60 2.43 · 108 2.37 · 108 3.89 · 108 2.40 · 108 -
80 1.95 · 108 1.91 · 108 3.47 · 108 1.92 · 108 -
100 1.65 · 108 1.63 · 108 2.98 · 108 1.63 · 108 -
Census 1990 10 2.48 · 108 2.40 · 108 3.98 · 108 - -
20 1.90 · 108 1.85 · 108 3.17 · 108 - -
30 1.59 · 108 1.53 · 108 2.94 · 108 - -
40 1.41 · 108 1.35 · 108 2.78 · 108 - -
50 1.28 · 108 1.24 · 108 2.73 · 108 - -
BigCross 15 5.05 · 1012 5.23 · 1012 6.69 · 1012 - -
20 4.15 · 1012 4.23 · 1012 4.85 · 1012 - -
25 3.59 · 1012 3.54 · 1012 4.45 · 1012 - -
30 3.18 · 1012 3.18 · 1012 3.83 · 1012 - -
Normdata 100 1.50 · 106 1.50 · 106 - - -
(m = 500) 125 1.50 · 106 1.50 · 106 - - -
150 1.50 · 106 1.50 · 106 - - -
175 1.50 · 106 1.50 · 106 - - -
200 1.50 · 106 1.50 · 106 - - -
Normdata 100 1.50 · 106 1.50 · 106 - - -
(m = 1000) 125 1.50 · 106 1.50 · 106 - - -
150 1.50 · 106 1.50 · 106 - - -
175 1.50 · 106 1.50 · 106 - - -
200 1.50 · 106 1.50 · 106 - - -
Table A.4: Average clustering cost of the algorithms
184 A Additional Tables for Chapter 6
A.4 Standard Deviation of Running Time and Cost
running time (in sec)
dataset k StreamKM++ StreamLS k-Means++ k-Means
Spambase 10 0.29 - 1.5 3.33
20 1.09 - 3.88 6.36
30 1.52 - 11.27 17.61
40 6.56 - 6.97 26.95
50 6.59 - 12.83 68.1
Intrusion 10 0.68 - 40.81 58.84
20 3.22 - 98.11 499.7
30 6.07 - 1263.44 345.6
40 24.91 - 563.20 1306.2
50 31.58 - 706.00 1190.78
Covertype 10 0.88 2.43 2295.85 -
20 6.93 18.18 1249.18 -
30 4.15 2.14 9653.06 -
40 14.02 7.64 6838.93 -
50 39.28 123.28 12231.98 -
Tower 20 0.58 14.11 1594.76 -
40 1.79 50.83 2085.12 -
60 3.96 58.27 3656.87 -
80 7.95 122.65 5162.60 -
100 11.34 315.31 1795.07 -
Census 1990 10 2.04 9.08 - -
20 5.16 54.3 - -
30 5.38 98.03 - -
40 23.31 193.00 - -
50 17.43 533.39 - -
BigCross 15 10.49 93.6 - -
20 11.49 162.44 - -
25 15.69 226.38 - -
30 16.66 200.68 - -
Normdata 100 0.07 1.22 - -
(m = 500) 125 0.05 1.14 - -
150 0.05 2.19 - -
175 0.03 2.89 - -
200 0.03 4.05 - -
Normdata 100 0.06 0.6 - -
(m = 1000) 125 0.06 1.32 - -
150 0.04 2.56 - -
175 0.08 3.96 - -
200 0.2 2.41 - -
Table A.5: Standard deviation of the running time
A.4 Standard Deviation of Running Time and Cost 185
cost
dataset k StreamKM++ StreamLS k-Means++ k-Means
Spambase 10 2.05 · 106 - 9.57 · 106 1.06 · 106
20 6.49 · 105 - 1.73 · 106 8.78 · 104
30 3.14 · 105 - 9.51 · 105 8.81 · 104
40 1.93 · 105 - 5.31 · 105 3.42 · 106
50 1.49 · 105 - 2.47 · 105 2.91 · 106
Intrusion 10 1.39 · 1012 - 6.61 · 1012 3.09 · 1011
20 8.54 · 1010 - 3.70 · 1011 8.20 · 109
30 3.13 · 1010 - 6.85 · 1010 2.54 · 1010
40 7.03 · 109 - 3.25 · 1010 1.53 · 108
50 6.01 · 109 - 1.61 · 1010 6.82 · 108
Covertype 10 2.47 · 109 2.70 · 1010 3.63 · 109 -
20 1.08 · 109 1.03 · 1010 9.17 · 108 -
30 1.49 · 109 6.61 · 109 6.12 · 108 -
40 8.38 · 108 5.63 · 109 6.64 · 108 -
50 5.68 · 108 3.90 · 109 2.92 · 108 -
Tower 20 7.31 · 106 2.71 · 107 4.39 · 107 -
40 1.85 · 106 1.65 · 107 4.37 · 106 -
60 1.52 · 106 1.55 · 107 1.61 · 106 -
80 1.03 · 106 9.63 · 106 1.54 · 106 -
100 7.73 · 105 1.03 · 107 1.17 · 106 -
Census 1990 10 5.02 · 106 1.45 · 105 - -
20 3.66 · 106 3.14 · 106 - -
30 1.61 · 106 9.34 · 105 - -
40 1.21 · 106 8.13 · 105 - -
50 1.01 · 106 6.80 · 105 - -
BigCross 15 3.22 · 1010 1.75 · 1011 - -
20 2.46 · 1010 3.36 · 1011 - -
25 1.86 · 1010 1.76 · 1011 - -
30 1.94 · 1010 1.29 · 1011 - -
Normdata 100 0 0 - -
(m = 500) 125 0 0 - -
150 0 0 - -
175 0 0 - -
200 0 0 - -
Normdata 100 0 0 - -
(m = 1000) 125 0 0 - -
150 0 0 - -
175 0 0 - -
200 0 0 - -
Table A.6: Standard deviation of the clustering cost
186 A Additional Tables for Chapter 6
B Mathematical Fundamentals
This appendix deals with some mathematical fundamentals which are assumed to be com-
mon knowledge throughout this thesis. In Section B.1, we specify partial sums of some
classical series and state some useful inequalities concerning Euler’s number. Section B.2
addresses probability theory.
B.1 Sequences, Series, and Inequalities
Arithmetic Series
Let (a1, a2, . . .) be any infinite arithmetic sequence, i.e., there is a fixed constant d ∈ R,
called the common difference, such that ai − ai−1 = d for all i ∈ N\{1}. It is well-known
and can be proven by simple induction that the n-th partial sum of the associated infinite
series is equal to
n∑
i=1
ai =
n
2
· (a1 + an) .
In particular, the sum of the first n natural numbers is equal to
n∑
i=1
i =
n(n+ 1)
2
.
Geometric Series
Let (a1, a2, . . .) be any infinite geometric sequence, i.e., there is a fixed constant q ∈ R with
q 6= 1, called the common ratio, such that ai/ai−1 = q for all i ∈ N\{1}. It is well-known
and can be proven by simple induction that the n-th partial sum of the associated infinite
series is equal to
n∑
i=1
ai = a1 ·
n−1∑
i=0
qi = a1 ·
qn − 1
q − 1
.
In case that |q| < 1, the sum of the infinite geometric series is
∞∑
i=1
ai = a1 ·
1
1− q
.
188 B Mathematical Fundamentals
Bounds on Euler’s Number
In the following, we will state some useful inequalities concerning Euler’s number (see [91,
99]). For all n ∈ N, we have
(
1 +
1
n
)n
≤ e ≤
(
1 +
1
n
)n+1
(B.1)
and (
1−
1
n
)n
≤
1
e
≤
(
1−
1
n
)n−1
. (B.2)
These inequalities imply
lim
n→∞
(
1 +
1
n
)n
= e and lim
n→∞
(
1−
1
n
)n
=
1
e
.
Furthermore, it is known that, for all n ∈ N, we have
(n
e
)n
≤ n! ≤ nn .
B.2 Probability Theory
This section addresses some basics in probability theory which are assumed to be com-
mon knowledge throughout this thesis. The interested reader can find a more general
introduction in [89, 99].
The set Ω of all possible outcomes of a random experiment is called a sample space. In
this thesis, we only consider discrete sample spaces, i.e., any considered sample space is
a countable set of elementary events of the form Ω = {ω1, ω2, . . . , ωn}. The probability
distribution on Ω is a function p : Ω→ R which satisfies the following two conditions:
(i) the probability associated with any elementary event is non-negative, i.e.,
p(ωi) ≥ 0 , for any ωi ∈ Ω ,
(ii) the sum of probabilities over all elementary events is equal to 1, i.e.,
∑
ωi∈Ω
p(ωi) = 1 .
A subset of Ω is called an event. Any event E ⊆ Ω is said to be true if the outcome of
the random experiment is any ω ∈ E. Otherwise, E is said to be false. The probability of
E is defined by
Pr [E] :=
∑
ω∈E
p(ω) .
The probability of an event E2 assuming an event E1 with Pr [E1] > 0 is called condi-
tional probability and is defined as
Pr [E2 | E1] :=
Pr [E1 ∩ E2]
Pr [E1]
.
B.2 Probability Theory 189
Random Variables, Expectation, and Variance
A random variable is a function X := X(ω) defined on a sample space Ω. In this thesis,
we only consider discrete and real-valued random variables. Such a random variable is
a function X : Ω → R which only takes isolated values with non-zero probabilities. A
random variable X is called an indicator random variable for an event E in case that, for
all ω ∈ Ω, we have X(ω) ∈ {0, 1} and X(ω) = 1 if and only if ω ∈ E.
For any discrete and real-valued random variable X and any real k ∈ R, we define
[X = k] := {ω ∈ Ω | X(ω) = k}. Based on this definition, we use the abbreviations
Pr [X ≤ k] :=
∑
`≤k:`∈X(Ω)
Pr [X = `] and Pr [X ≥ k] :=
∑
`≥k:`∈X(Ω)
Pr [X = `] .
Furthermore, for two random variables X and Y , we use the abbreviations
Pr [[X = k] ∩ [Y = `]] := Pr [X = k ∧ Y = `]
and
Pr [[X = k] ∪ [Y = `]] := Pr [X = k ∨ Y = `] .
Two random variables X and Y are called independent if, for all x, y ∈ R,
Pr [X = x | Y = y] = Pr [X = x] .
A set X0, X1, . . . , Xn−1 of random variables is called independent if, for all i ∈ [n] and
I ⊆ [n]\{i},
Pr

Xi = xi |
∧
j∈I
Xj = xj

 = Pr [Xi = xi] (B.3)
for all xi ∈ R and xj ∈ R with j ∈ I. A set X0, X1, . . . , Xn−1 of random variables is called
k-wise independent if (B.3) holds for all I ⊆ [n]\{i} with |I| ≤ k.
The expectation of any discrete and real-valued random variable X is defined as
E [X] :=
∑
k∈X(Ω)
k ·Pr [X = k] .
The following three properties of the expectation are often used in the analysis of random-
ized algorithms. Proofs can be found in [89].
(i) For any random variable X and any real k ∈ R, we have E [k ·X] = k · E [X].
(ii) For any two random variables X and Y , we have E [X + Y ] = E [X] + E [Y ]. This
property is called linearity of expectation.
(iii) For any two independent random variables X and Y , E [X · Y ] = E [X] · E [Y ].
190 B Mathematical Fundamentals
The variance of any random variable X is defined as
V [X] := E
[
(X − E [X])2
]
.
By expanding the term (X − E [X])2, we get
V [X] = E
[
(X − E [X])2
]
= E
[
X2 − 2X · E [X] + E [X]2
]
= E
[
X2
]
− 2 · E [X] · E [X] + E [X]2
= E
[
X2
]
− E [X]2 .
The following three properties of the variance are often used in the analysis of randomized
algorithms. Proofs can be found in [89].
(i) For any random variable X, we have V [X] ≥ 0.
(ii) For any random variable X and any two reals a, b ∈ R, V [a+ b ·X] = b2 ·V [X].
(iii) For any two independent random variables X and Y , V [X + Y ] = V [X] + V [Y ].
The standard deviation of a random variable X is defined as σ :=
√
V [X].
Useful Inequalities
The following inequalities are frequently used in the analysis of our randomized algorithms.
Union Bound
Let E1 and E2 be any two events. Then, we have
Pr [E1 ∨ E2] ≤ Pr [E1] + Pr [E2] .
The union bound is implied by the inclusion-exclusion principle from combinatorics.
Markov’s Inequality
Let X : Ω → R≥0 be a non-negative random variable. Then, for any k ∈ R with k > 0,
we have
Pr [X ≥ k] ≤
E [X]
k
.
A proof of Markov’s inequality can be found in [89].
B.2 Probability Theory 191
Chebyshev’s Inequality
Let X : Ω→ R be a random variable. Then, for any k ∈ R with k > 0, we have
Pr [|X − E [X] | ≥ k] ≤
V [X]
k2
.
Chebyshev’s inequality follows from Markov’s inequality. A formal proof can be found
in [89].
Chernoff Bounds
Let X1, X2, . . . , Xn : Ω → {0, 1} be a set of independent 0-1-random variables. Then, it
holds for all ε ≥ 0 that
Pr
[
n∑
i=1
Xi ≥ (1 + ε) · E
[
n∑
i=1
Xi
]]
≤ exp
(
−
1
3
·min{ε, ε2} · E
[
n∑
i=1
Xi
])
.
Furthermore, it holds for all ε, 0 ≤ ε ≤ 1, that
Pr
[
n∑
i=1
Xi ≤ (1− ε) · E
[
n∑
i=1
Xi
]]
≤ exp
(
−
1
2
· ε2 · E
[
n∑
i=1
Xi
])
.
A formal proof can be found in [86].
192 B Mathematical Fundamentals
Bibliography
[1] I. Abraham, Y. Bartal, and O. Neiman. Advances in metric embedding theory. In Pro-
ceedings of the 38th Annual ACM Symposium on Theory of Computing (STOC ’06),
pages 271–286. Association for Computing Machinery, 2006.
[2] P. K. Agarwal, S. Har-Peled, and K. R. Varadarajan. Approximating extent measures
of points. Journal of the ACM (JACM), 51(4):606–635, July 2004.
[3] A. Aggarwal, A. Deshpande, and R. Kannan. Adaptive sampling for k-means cluster-
ing. In Proceedings of the 12th International Workshop on Approximation Algorithms
for Combinatorial Optimization Problems (APPROX ’10), pages 15–28. Springer,
2009.
[4] N. Ailon, R. Jaiswal, and C. Monteleoni. Streaming k-means approximation. In
Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors,
Advances in Neural Information Processing Systems 22, pages 10–18. MIT press,
2009.
[5] N. Alon, L. Babai, and A. Itai. A fast and simple randomized parallel algorithm for
the maximal independent set problem. Journal of Algorithms, 7(4):567–583, 1986.
[6] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the
frequency moments. Journal of Computer and System Sciences (JCSS), 58(1):137–
147, February 1999.
[7] S. Arora. Polynomial time approximation schemes for Euclidean traveling sales-
man and other geometric problems. Journal of the ACM (JACM), 45(5):753–782,
September 1998.
[8] S. Arora, P. Raghavan, and S. Rao. Approximation schemes for Euclidean k-medians
and related problems. In Proceedings of the 30th Annual ACM Symposium on Theory
of Computing (STOC ’98), pages 106–113. Association for Computing Machinery,
1998.
[9] D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding.
In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms
(SODA ’07), pages 1027–1035. Society for Industrial and Applied Mathematics, 2007.
[10] S. Arya, G. Das, D. M. Mount, J. S. Salowe, and M. H. M. Smid. Euclidean spanners:
Short, thin, and lanky. In Proceedings of the 27th Annual ACM Symposium on Theory
194 Bibliography
of Computing (STOC ’95), pages 489–498. Association for Computing Machinery,
1995.
[11] S. Arya, D. M. Mount, and M. H. M. Smid. Randomized and deterministic algo-
rithms for geometric spanners of small diameter. In Proceedings of the 35th Annual
Symposium on Foundations of Computer Science (FOCS ’94), pages 703–712. IEEE
Computer Society, 1994.
[12] V. Arya, N. Garg, R. Khandekar, A. Meyerson, K. Munagala, and V. Pandit. Local
search heuristics for k-median and facility location problems. SIAM Journal on
Computing, 33(3):544–562, 2004.
[13] A. Asuncion and D. J. Newman. UCI machine learning repository, 2007. University
of California, Irvine, School of Information and Computer Sciences, available at
http://www.ics.uci.edu/~mlearn/MLRepository.html.
[14] M. Bădoiu, A. Czumaj, P. Indyk, and C. Sohler. Facility location in sublinear time.
In Proceedings of the 32nd Annual International Colloquium on Automata, Languages
and Programming (ICALP ’05), volume 3580, pages 866–877. Springer, 2005.
[15] J. Basch, L. J. Guibas, and J. Hershberger. Data structures for mobile data. Journal
of Algorithms, 31(1):1–28, 1999.
[16] J. L. Bentley and J. B. Saxe. Decomposable searching problems I. Static-to-dynamic
transformation. Journal of Algorithms, 1(4):301–358, December 1980.
[17] S. Bereg, B. K. Bhattacharya, D. G. Kirkpatrick, and M. Segal. Competitive algo-
rithms for maintaining a mobile center. MONET, 11(2):177–186, April 2006.
[18] J. Byrka and K. Aardal. An optimal bifactor approximation algorithm for the metric
uncapacitated facility location problem. SIAM Journal on Computing, 39(6):2212–
2231, March 2010.
[19] P. B. Callahan. Optimal parallel all-nearest-neighbors using the well-separated pair
decomposition (preliminary version). In Proceedings of the 34th Annual Symposium
on Foundations of Computer Science (FOCS ’93), pages 332–340. IEEE Computer
Society, 1993.
[20] P. B. Callahan. Dealing with higher dimensions: The well-separated pair decomposi-
tion and its applications. PhD thesis, Johns Hopkins University, Baltimore, Mary-
land, 1995.
[21] P. B. Callahan and S. R. Kosaraju. Algorithms for dynamic closest pair and n-
body potential fields. In Proceedings of the 6th Annual ACM-SIAM Symposium on
Discrete Algorithms (SODA ’95), pages 263–272. Society for Industrial and Applied
Mathematics, 1995.
Bibliography 195
[22] P. B. Callahan and S. R. Kosaraju. A decomposition of multidimensional point sets
with applications to k-nearest-neighbors and n-body potential fields. Journal of the
ACM (JACM), 42(1):67–90, 1995.
[23] T.-H. H. Chan, K. Dhamdhere, A. Gupta, J. Kleinberg, and A. Slivkins. Metric
embeddings with relaxed guarantees. SIAM Journal on Computing, 38(6):2303–2329,
March 2009.
[24] T.-H. H. Chan, M. Dinitz, and A. Gupta. Spanners with slack. In Proceedings
of the 14th Annual European Symposium on Algorithms (ESA ’06), pages 196–207.
Springer, 2006.
[25] T. M. Chan. Well-separated pair decomposition in linear time? Information Pro-
cessing Letters, 107(5):138–141, August 2008.
[26] K. L. Chang. Pass-efficient algorithms for facility location. Technical Report
YALEU/DCS/TR-1337, Yale University, November 2005.
[27] M. Charikar and S. Guha. Improved combinatorial algorithms for facility location
problems. SIAM Journal on Computing, 34(4):803–824, 2005.
[28] K. Chen. On k-median clustering in high dimensions. In Proceedings of the 17th
Annual ACM-SIAM Symposium on Discrete Algorithms (SODA ’06), pages 1177–
1185. Association for Computing Machinery, 2006.
[29] K. Chen. On coresets for k-median and k-means clustering in metric and Euclidean
spaces and their applications. SIAM Journal on Computing, 39(3):923–947, August
2009.
[30] F. A. Chudak and D. B. Shmoys. Improved approximation algorithms for the unca-
pacitated facility location problem. SIAM Journal on Computing, 33(1):1–25, 2003.
[31] A. Czumaj, G. Frahling, and C. Sohler. Efficient kinetic data structures for max-
cut. In Proceedings of the 19th Canadian Conference on Computational Geometry
(CCCG ’07), pages 157–160. Carleton University, Ottawa, Canada, 2007.
[32] A. Czumaj and C. Sohler. Small space representations for metric min-sum k-
clustering and their applications. Theory of Computing Systems, 46(3):416–442, April
2010.
[33] M. de Berg, O. Cheong, M. van Kreveld, and M. Overmars. Computational Geometry:
Algorithms and Applications. Springer, 3rd edition, 2008.
[34] J. Erickson. Dense point sets have sparse Delaunay triangulations or "... but not too
nasty". Discrete & Computational Geometry, 33(1):83–115, January 2005.
196 Bibliography
[35] J. Fakcharoenphol, S. Rao, and K. Talwar. A tight bound on approximating arbi-
trary metrics by tree metrics. Journal of Computer and System Sciences (JCSS),
69(3):485–497, November 2004.
[36] D. Feldman, A. Fiat, and M. Sharir. Coresets for weighted facilities and their appli-
cations. In Proceedings of the 47th IEEE Symposium on Foundations of Computer
Science (FOCS ’06), pages 315–324. IEEE Computer Society, 2006.
[37] D. Feldman, M. Monemizadeh, and C. Sohler. A PTAS for k-means clustering based
on weak coresets. In Proceedings of the 23rd ACM Symposium on Computational
Geometry (SCG ’07), pages 11–18. Association for Computing Machinery, 2007.
[38] J. Fischer and S. Har-Peled. Dynamic well-separated pair decomposition made
easy. In Proceedings of the 17th Canadian Conference on Computational Geome-
try (CCCG ’05), pages 235–238. University of Windsor, Ontario, Canada, 2005.
[39] E. W. Forgy. Cluster analysis of multivariate data: Efficiency versus interpretability
of classifications. Biometrics, 21:768–780, 1965.
[40] D. Fotakis. Incremental algorithms for facility location and k-median. Theoretical
Computer Science, 361(2–3):275–313, September 2006.
[41] D. Fotakis. Memoryless facility location in one pass. In Proceedings of the 23rd
Annual Symposium on Theoretical Aspects of Computer Science (STACS ’06), pages
608–620. Springer, 2006.
[42] G. Frahling. Algorithms for Dynamic Geometric Data Streams. PhD thesis, Univer-
sity of Paderborn, 2006.
[43] G. Frahling, P. Indyk, and C. Sohler. Sampling in dynamic data streams and applica-
tions. International Journal of Computational Geometry and Applications (IJCGA),
18(1–2):3–28, 2008.
[44] G. Frahling and C. Sohler. Coresets in dynamic geometric data streams. In Pro-
ceedings of the 37th Annual ACM Symposium on Theory of Computing (STOC ’05),
pages 209–217. Association for Computing Machinery, 2005.
[45] S. A. Friedler and D. M. Mount. Approximation algorithm for the kinetic robust
k-center problem. Computational Geometry, 43(6–7):572–586, August 2010.
[46] J. Gao, L. J. Guibas, J. Hershberger, L. Zhang, and A. Zhu. Discrete mobile centers.
Discrete & Computational Geometry, 30(1):45–63, July 2003.
[47] J. Gao, L. J. Guibas, and A. T. Nguyen. Deformable spanners and applications.
Computational Geometry, 35(1–2):2–19, August 2006.
Bibliography 197
[48] J. Gao and L. Zhang. Well-separated pair decomposition for the unit-disk graph
metric and its applications. SIAM Journal on Computing, 35(1):151–169, 2005.
[49] B. Gfeller and E. Vicari. A randomized distributed algorithm for the maximal in-
dependent set problem in growth-bounded graphs. In Proceedings of the 26th An-
nual ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing
(PODC ’07), pages 53–60. Association for Computing Machinery, 2007.
[50] J. Gudmundsson, C. Levcopoulos, G. Narasimhan, and M. H. M. Smid. Approximate
distance oracles for geometric graphs. In Proceedings of the 13th Annual ACM-SIAM
Symposium on Discrete Algorithms (SODA ’02), pages 828–837. Society for Industrial
and Applied Mathematics, 2002.
[51] S. Guha and S. Khuller. Greedy strikes back: Improved facility location algorithms.
Journal of Algorithms, 31(1):228–248, April 1999.
[52] S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O’Callaghan. Clustering
data streams: Theory and practice. IEEE Transactions on Knowledge and Data
Engineering (TKDE), 15(3):515–528, January/February 2003.
[53] S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan. Clustering data streams. In
Proceedings of the 41st Symposium on Foundations of Computer Science (FOCS ’00),
pages 359–366. IEEE Computer Society, 2000.
[54] L. J. Guibas. Kinetic data structures: A state of the art report. In Proceedings of
the 3rd Workshop on the Algorithmic Foundations of Robotics (WAFR ’98), pages
191–209. A. K. Peters, Ltd., 1998.
[55] L. J. Guibas. Kinetic data structures. In D. P. Mehta and S. Sahni, editors, Handbook
of Data Structures and Applications. Chapman and Hall/CRC, 2004.
[56] S. Har-Peled. Clustering motion. Discrete & Computational Geometry, 31(4):545–
565, March 2004.
[57] S. Har-Peled and A. Kushal. Smaller coresets for k-median and k-means clustering.
Discrete & Computational Geometry, 37(1):3–19, January 2007.
[58] S. Har-Peled and S. Mazumdar. On coresets for k-means and k-median cluster-
ing. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing
(STOC ’04), pages 291–300. Association for Computing Machinery, 2004.
[59] S. Har-Peled and M. Mendel. Fast construction of nets in low-dimensional metrics
and their applications. SIAM Journal on Computing, 35(5):1148–1184, 2006.
[60] J. Hershberger. Smooth kinetic maintenance of clusters. Computational Geometry,
31(1–2):3–30, May 2005.
198 Bibliography
[61] P. Indyk. Sublinear time algorithms for metric space problems. In Proceedings of the
31st Annual ACM Symposium on Theory of Computing (STOC ’99), pages 428–434.
Association for Computing Machinery, 1999.
[62] P. Indyk. Stable distributions, pseudorandom generators, embeddings and data
stream computation. In Proceedings of the 41st IEEE Symposium on Foundations of
Computer Science (FOCS ’00), pages 189–197. IEEE Computer Society, 2000.
[63] P. Indyk. Algorithmic applications of low-distortion geometric embeddings. In
Proceedings of the 42nd IEEE Symposium on Foundations of Computer Science
(FOCS ’01), pages 10–33. IEEE Computer Society, 2001.
[64] P. Indyk. Algorithms for dynamic geometric problems over data streams. In Pro-
ceedings of the 36th Annual ACM Symposium on Theory of Computing (STOC ’04),
pages 373–380. Association for Computing Machinery, 2004.
[65] P. Indyk. Stable distributions, pseudorandom generators, embeddings, and data
stream computation. Journal of the ACM (JACM), 53(3):307–323, May 2006.
[66] P. Indyk. Sketching, streaming and sub-linear space algorithms. Graduate course
notes, available at http://stellar.mit.edu/S/course/6/fa07/6.895/, 2007.
[67] P. Indyk and J. Matousek. Low-distortion embeddings of finite metric spaces. In
J. E. Goodman and J. O’Rourke, editors, Handbook of Discrete and Computational
Geometry. Chapman and Hall/CRC, 2004.
[68] K. Jain, M. Mahdian, E. Markakis, A. Saberi, and V. V. Vazirani. Greedy facility
location algorithms analyzed using dual fitting with factor-revealing LP. Journal of
the ACM (JACM), 50(6):795–824, November 2003.
[69] K. Jain, M. Mahdian, and A. Saberi. A new greedy approach for facility location
problems. In Proceedings of the 34th Annual ACM Symposium on Theory of Com-
puting (STOC ’02), pages 731–740. Association for Computing Machinery, 2002.
[70] K. Jain and V. V. Vazirani. Approximation algorithms for metric facility location
and k-median problems using the primal-dual schema and Lagrangian relaxation.
Journal of the ACM (JACM), 48(2):274–296, March 2001.
[71] W. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert
space. In Conference in Modern Analysis and Probability, volume 26 of Contemporary
Mathematics, pages 189–206. American Mathematical Society, 1984.
[72] D. M. Kane, J. Nelson, and D. P. Woodruff. An optimal algorithm for the distinct
elements problem. In Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART
Symposium on Principles of Database Systems (PODS ’10), pages 41–52. Association
for Computing Machinery, 2010.
Bibliography 199
[73] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y.
Wu. A local search approximation algorithm for k-means clustering. Computational
Geometry, 28(2–3):89–112, June 2004.
[74] D. R. Karger and M. Ruhl. Finding nearest neighbors in growth-restricted met-
rics. In Proceedings of the 34th Annual ACM Symposium on Theory of Computing
(STOC ’02), pages 741–750. Association for Computing Machinery, 2002.
[75] J. M. Kleinberg, A. Slivkins, and T. Wexler. Triangulation and embedding using
small sets of beacons. Journal of the ACM (JACM), 56(6), September 2009.
[76] S. G. Kolliopoulos and S. Rao. A nearly linear-time approximation scheme for the
Euclidean k-median problem. SIAM Journal on Computing, 37(3):757–782, 2007.
[77] M. R. Korupolu, C. G. Plaxton, and R. Rajaraman. Analysis of a local search
heuristic for facility location problems. Journal of Algorithms, 37(1):146–188, Octo-
ber 2000.
[78] C. Levcopoulos, G. Narasimhan, and M. H. M. Smid. Improved algorithms for
constructing fault-tolerant spanners. Algorithmica, 32(1):144–156, 2002.
[79] J.-H. Lin and J. S. Vitter. Approximation algorithms for geometric median problems.
Information Processing Letters, 44(5):245–249, 1992.
[80] S. P. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information
Theory, 28(2):129–137, March 1982.
[81] N. A. Lynch. Distributed Algorithms. Morgan Kaufmann, 1996.
[82] J. B. MacQueen. Some methods for classification and analysis of multivariate obser-
vations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics
and Probability, volume 1, pages 281–297. University of California Press, 1967.
[83] M. Mahdian, Y. Ye, and J. Zhang. Approximation algorithms for metric facility
location problems. SIAM Journal on Computing, 36(2):411–432, 2006.
[84] J. Matousek. Lectures on Discrete Geometry. Graduate Texts in Mathematics.
Springer, 1st edition, 2002.
[85] M. Matsumoto and T. Nishimura. Mersenne Twister: A 623-dimensionally equidis-
tributed uniform pseudo-random number generator. ACM Transactions on Modeling
and Computer Simulation (TOMACS), 8(1):3–30, January 1998.
[86] C. J. H. McDiarmid. Concentration. In Probabilistic Methods for Algorithmic Discrete
Mathematics, volume 16 of Algorithms and Combinatorics, pages 195–248. Springer,
1998.
200 Bibliography
[87] R. R. Mettu and C. G. Plaxton. The online median problem. SIAM Journal on
Computing, 32(3):816–832, 2003.
[88] A. Meyerson. Online facility location. In Proceedings of the 32nd IEEE Symposium
on Foundations of Computer Science (FOCS ’01), pages 426–431. IEEE Computer
Society, 2001.
[89] M. Mitzenmacher and E. Upfal. Probability and Computing. Cambridge University
Press, 2005.
[90] T. Moscibroda and R. Wattenhofer. Facility location: Distributed approximation.
In Proceedings of the 24th Annual ACM SIGACT-SIGOPS Symposium on Principles
of Distributed Computing (PODC ’05), pages 108–117. Association for Computing
Machinery, 2005.
[91] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press,
1995.
[92] S. Muthukrishnan. Data streams: Algorithms and applications. Foundations and
Trends in Theoretical Computer Science, 1(2), 2005.
[93] S. Muthukrishnan. Data stream algorithms. Notes from Barbados Complex-
ity Theory Meeting, available at http://sites.google.com/site/algoresearch/
datastreamalgorithms, 2009.
[94] G. Narasimhan and M. H. M. Smid. Approximating the stretch factor of Euclidean
graphs. SIAM Journal on Computing, 30(3):978–989, 2000.
[95] N. Nisan. Pseudorandom generators for space-bounded computation. Combinatorica,
12(4):449–461, December 1992.
[96] L. O’Callaghan, A. Meyerson, R. Motwani, N. Mishra, and S. Guha. Streaming-data
algorithms for high-quality clustering. In Proceedings of the 18th International Con-
ference on Data Engineering (ICDE ’02), pages 685–696. IEEE Computer Society,
2002.
[97] S. Pandit and S. V. Pemmaraju. Return of the primal-dual: Distributed metric
facility location. In Proceedings of the 28th Annual ACM Symposium on Principles
of Distributed Computing (PODC ’09), pages 180–189. Association for Computing
Machinery, 2009.
[98] D. Peleg. Distributed Computing: A Locality-Sensitive Approach. SIAM Monographs
on Discrete Mathematics and Applications, 2000.
[99] C. Scheideler. Probabilistic Methods for Coordination Problems. Habilitation thesis,
University of Paderborn, 2000.
Bibliography 201
[100] S. Z. Selim and M. A. Ismail. k-means-type algorithms: A generalized convergence
theorem and characterization of local optimality. IEEE Transactions on Pattern
Analysis and Machine Intelligence (PAMI), 6(1):81–87, January 1984.
[101] D. B. Shmoys. Approximation algorithms for facility location problems. In Proceed-
ings of the 3rd International Workshop on Approximation Algorithms for Combina-
torial Optimization Problems (APPROX ’00), pages 27–33. Springer, 2000.
[102] D. B. Shmoys, É. Tardos, and K. Aardal. Approximation algorithms for facility
location problems. In Proceedings of the 29th Annual ACM Symposium on Theory
of Computing (STOC ’97), pages 265–274. Association for Computing Machinery,
1997.
[103] M. H. M. Smid. The well-separated pair decomposition and its applications. In
T. F. Gonzalez, editor, Handbook of Approximation Algorithms and Metaheuristics.
Chapman & Hall/CRC, 2007.
[104] M. Sviridenko. An improved approximation algorithm for the metric uncapacitated
facility location problem. In Proceedings of the 9th International Conference on
Integer Programming and Combinatorial Optimization (IPCO ’02), pages 240–257.
Springer, 2002.
[105] K. Talwar. Bypassing the embedding: Algorithms for low dimensional metrics.
In Proceedings of the 36th Annual ACM Symposium on Theory of Computing
(STOC ’04), pages 281–290. Association for Computing Machinery, 2004.
[106] M. Thorup. Quick k-median, k-center, and facility location for sparse graphs. SIAM
Journal on Computing, 34(2):405–432, 2004.
[107] B. Von Herzen and A. H. Barr. Accurate triangulations of deformed, intersecting
surfaces. In Proceedings of the 14th Annual Conference on Computer Graphics and
Interactive Techniques (SIGGRAPH ’87), pages 103–110. Association for Computing
Machinery, 1987.
[108] J. Vygen. Approximation algorithms for facility location problems (Lecture notes).
Technical Report 05950-OR, Research Institute for Discrete Mathematics, University
of Bonn, 2005. Available at http://www.or.uni-bonn.de/~vygen/files/fl.pdf.
[109] D. E. Willard and G. S. Lueker. Adding range restriction capability to dynamic data
structures. Journal of the ACM (JACM), 32(3):597–617, July 1985.
[110] B. Yao, F. Li, and P. Kumar. Reverse furthest neighbors in spatial databases. In Pro-
ceedings of the 25th IEEE International Conference on Data Engineering (ICDE ’09),
pages 664–675. IEEE Computer Society, 2009.
202 Bibliography
[111] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: A new data clustering algorithm
and its applications. Data Mining and Knowledge Discovery, 1(2):141–182, June
1997.