The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing

Tyler Akidau

Robert Bradshaw

Craig Chambers

Slava Chernyak

Rafael J. Fernández-Moctezuma

Reuven Lax

Sam McVeety

Daniel Mills

Frances Perry

Eric Schmidt

Sam Whittle

Proceedings of the VLDB Endowment, 8 (2015), pp. 1792-1803

Google Scholar

Abstract

Unbounded, unordered, global-scale datasets are increasingly
common in day-to-day business (e.g. Web logs, mobile
usage statistics, and sensor networks). At the same time,
consumers of these datasets have evolved sophisticated requirements,
such as event-time ordering and windowing by
features of the data themselves, in addition to an insatiable
hunger for faster answers. Meanwhile, practicality dictates
that one can never fully optimize along all dimensions of correctness,
latency, and cost for these types of input. As a result,
data processing practitioners are left with the quandary
of how to reconcile the tensions between these seemingly
competing propositions, often resulting in disparate implementations
and systems.

We propose that a fundamental shift of approach is necessary
to deal with these evolved requirements in modern
data processing. We as a field must stop trying to groom unbounded
datasets into finite pools of information that eventually
become complete, and instead live and breathe under
the assumption that we will never know if or when we have
seen all of our data, only that new data will arrive, old data
may be retracted, and the only way to make this problem
tractable is via principled abstractions that allow the practitioner
the choice of appropriate tradeoffs along the axes of
interest: correctness, latency, and cost.

In this paper, we present one such approach, the Dataflow
Model, along with a detailed examination of the semantics
it enables, an overview of the core principles that guided its
design, and a validation of the model itself via the real-world
experiences that led to its development.

Research Areas

Distributed Systems and Parallel Computing

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations  & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing

Abstract

Research Areas

Meet the teams driving innovation

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing

Abstract

Research Areas

Meet the teams driving innovation

AI/ML Foundations  & Capabilities