Spinning Fast Iterative Data Flows

Notes from the whitepaper "Spinning Fast Iterative Data Flows (PDF)".

On iterative computation types:

We refer to the recomputed state as the partial solution of the iteration, and henceforth distinguish between two different kinds of iterations: bulk iterations, and incremental iterations.

On 'classes of operators' in a dataflow system:

It is interesting, however, to distinguish between certain classes of operators. First, we distinguish between operators that produce output by consuming one record, called record-at-a-time operators, and operators that need to consume multiple records before producing output, called group-at-a-time operators

On encoding a 'partial solution update' mutation in a dataflow:

In imperative programming, updating the partial solution is achievable by [...] passing a reference to the state of the partial solution and modifying that shared state. Dataflow programs (like functional programs) require that the operators/functions are side effect free. [...] We hence express an update of a record in the partial solution through the replacement of that record.

On why a dataflow operator/function needs to be side effect free:

An intransparent side effect would void the possibility of automatic parallelization, which is one of the main reasons to use dataflow programming for large scale analytics.

On the applicability of iterative dataflow:

Every algorithm that can be expressed via a message-passing interface can also be expressed as an incremental iteration.

On the fundamental optimization of the incremental iteration:

By computing a delta set instead of the next partial solution, we can achieve that the iteration returns fewer records when fewer changes need to be made to the partial solution.

On characteristics of the step function:

  • Since ∆ must return two data sets, it is necessarily a non-tree DAG.

  • Because the delta set D is the result of a dataflow operator, it is naturally an unordered bag of records, rather than a set. D may hence contain multiple different records that are identified by the same key, and would replace the same record in the current partial solution.

  • we allow the optional definition of a comparator for the data type of S. Whenever a record in S is to be replaced by a record in D, the comparator establishes an order among the two records.

On Pregel:

  • Pregel is a graph processing adoption of bulk synchronous parallel processing. Programs directly model a graph, where vertices hold state and send messages to other vertices along the edges. By receiving messages, vertices update their state.

  • It is straightforward to implement Pregel on top of Stratosphere’s iterative abstraction: the partial solution holds the state of the vertices, the workset holds the messages. The step function ∆ updates the vertex state from the messages in the working set and creates new update messages.

  • Pregel combines vertex activation with messaging, while incremental iterations give you the freedom to separate these aspects