Streaming Euclidean k-median and k-means to a (1 + ε)-approximation with o_{k,ε}(log n) Memory Words

David P. Woodruff
Samson Zhou
FOCS'23 (2023) (to appear)
Google Scholar

Abstract

We consider the classic Euclidean k-median and k-means objective on insertion-only streams,
where the goal is to maintain a (1 + ε)-approximation to the k-median or k-means, while using
as little memory as possible. Over the last 20 years, clustering in data streams has received a
tremendous amount of attention and has been the test-bed for a large variety of new techniques,
including coresets, merge-and-reduce, bicriteria approximation, sensitivity sampling, etc. Despite
this intense effort to obtain smaller and smaller sketches for these problems, all known techniques
require storing at least Ω(log (n∆)) words of memory, where n is size of the input and ∆ is
the aspect ratio. In this paper, we show how to finally bypass this bound and obtain the first
algorithm that achieves a (1 + ε)-approximation to the more general (k, z)-clustering problem,
using only ̃O ( dk / ε^2) · (2^{z log z} ) · min ( 1/ε^z , k) · poly(log log(n∆)) words of memory.