Spark Versus Flink: Understanding Performance in Big Data Analytics Frameworks

Notes from the whitepaper "Spark Versus Flink: Understanding Performance in Big Data Analytics Frameworks".

On how the core innovation of Spark — RDDs — actually contribute to an improvement over Hadoop/MapReduce:

Resilient Distributed Datasets (RDDs), a set of in-memory data structures able to cache intermediate data across a set of nodes, in order to efficiently support iterative algorithms.

On RDD data provenance:

Spark is built on top of RDDs (read-only, resilient collections of objects partitioned across multiple nodes) that hold provenance information (referred to as lineage) and can be rebuilt in case of failures by partial recomputation from ancestor RDDs.

On configuring task parallelism (for a given workload):

Spark’s default parallelism parameter (spark.def.parallelism) refers to the default number of partitions in the RDDs returned by various transformations

Flink’s default parallelism parameter (flink.def.parallelism) allows to use all the available execution resources

in Flink the partitioning of data is hidden from the user

On shuffle tuning:

The default buffer size is pre-configured in both frameworks to 32 KB, but it can be increased on systems with more memory, leading to less spilling to disk and better results.

On configuration for performance tuning and fault tolerance:

Spark needs a careful parameter configuration (for parallelism, partitions etc.), which is highly dependent on the dataset, in order to obtain an optimal performance. In Flink’s case, one needs to make sure that enough memory is allocated so that its CoGroup operator that builds the solution set in memory could be successfully executed.

On operationalizing these engines:

While extensive research efforts have been dedicated to optimize the execution of MapReduce based frameworks, there has been relatively less progress on identifying, analyzing and understanding the performance issues of more recent data analytics frameworks like Spark and Flink.

On performance evaluation:

Overall, most of the previous work typically focuses on some specific low-level issues of big data frameworks that are not necessarily well correlated with the higher level design. It is precisely this gap that we aim to address in this work by linking bottlenecks observed through parameter configuration and low level resource utilization with high-level behavior in order to better understand performance.

On separating pipeline models from engines:

Identifying and understanding the impact of the different architectural components and parameter settings on the resource usage and, ultimately, on performance, could trigger a new wave of research on data analytics frameworks. Apache Beam [47] is such an example, proposing a unified framework based on the Dataflow model [48], that can be used to execute data processing pipelines on separated distributed engines like Spark and Flink.