New Streaming Algorithms for High Dimensional EMD and MST
Abstract
We study streaming algorithms for two fundamental geometric problems: computing the cost of a Minimum Spanning Tree (MST) of an $n$-point set $X \subset \{1,2,\dots,\Delta\}^d$, and~computing the Earth Mover Distance (EMD) between two multi-sets $A,B \subset \{1,2,\dots,\Delta\}^d$ of size $n$. We~consider the turnstile model, where points can be added and removed. We give a one-pass streaming algorithm for MST and a two-pass streaming algorithm~for EMD, both achieving an approximation factor of $\tilde{O}(\log n)$ and using $\polylog(n,d,\Delta)$-space only. Furthermore, our algorithm for EMD can be compressed to a single pass using $\Delta d \cdot \polylog( n,d)$-space whenever ${|A \cap B|}/{|A \cup B|} \leq 1-\Omega(1)$. Previously, the best known sublinear-space streaming algorithms for either problem achieved an approximation of $O(\min\{ \log n , \log (\Delta d)\} \log n)$ \cite{AIK08, BDIRW20}. For MST, we also prove that any constant space streaming algorithm can only achieve an approximation of $\Omega(\log n)$, analogous to the $\Omega(\log n)$ lower bound for EMD of \cite{AIK08}.
Our algorithms are based on an improved analysis of a recursive space partitioning method known generically as the Quadtree. Specifically, we show that the Quadtree achieves an $\tilde{O}(\log n)$ approximation for both EMD and MST, improving on the $O(\min\{ \log n , \log (\Delta d)\} \log n)$ approximation of \cite{AIK08, BDIRW20}. %The main conceptual contribution is an analytical framework for studying the quadtree which goes beyond worst-case distortion of randomized tree embeddings.
Our algorithms are based on an improved analysis of a recursive space partitioning method known generically as the Quadtree. Specifically, we show that the Quadtree achieves an $\tilde{O}(\log n)$ approximation for both EMD and MST, improving on the $O(\min\{ \log n , \log (\Delta d)\} \log n)$ approximation of \cite{AIK08, BDIRW20}. %The main conceptual contribution is an analytical framework for studying the quadtree which goes beyond worst-case distortion of randomized tree embeddings.