Hadoop、Storm、Samza、Spark 和 Flink：大数据框架比较

发表 admin at 2024年8月26日

类别

未分类

标签

介绍

大数据是一个统称，指的是收集、组织、处理和从大型数据集中获取见解所需的非传统策略和技术。虽然处理超出单台计算机的计算能力或存储能力的数据的问题并不新鲜，但近年来，这种计算的普及性、规模和价值已大大扩展。

在之前的指南中，我们讨论了大数据系统中使用的一些一般概念、处理阶段和术语。在本文中，我们将介绍大数据系统最基本的组件之一：处理框架。处理框架通过从非易失性存储中读取数据或将数据输入系统来计算系统中的数据。数据计算是从大量单个数据点中提取信息和见解的过程。

我们将介绍以下框架：

仅限批处理的框架：
- Apache Hadoop
仅流框架：
- 阿帕奇风暴
- 阿帕奇 Samza
混合框架：
- Apache Spark
- Apache Flink

什么是大数据处理框架？

处理框架和处理引擎负责对数据系统中的数据进行计算。虽然没有权威的定义来区分“引擎”和“框架”，但有时将前者定义为负责对数据进行操作的实际组件，将后者定义为一组旨在执行相同操作的组件是有用的。

例如，Apache Hadoop可以被视为一个处理框架，其默认处理引擎是MapReduce。引擎和框架通常可以互换或串联使用。例如，另一个框架Apache Spark可以挂接到 Hadoop 中以替换 MapReduce。组件之间的互操作性是大数据系统具有极大灵活性的原因之一。

虽然处理数据生命周期这一阶段的系统可能很复杂，但从广义上讲，其目标非常相似：对数据进行操作以增加理解、表面模式并深入了解复杂的交互。

为了简化对这些组件的讨论，我们将根据这些处理框架所设计处理的数据的状态对其进行分组。有些系统以批量方式处理数据，而有些系统则在数据流入系统时以连续流的方式处理数据。还有一些系统可以采用这两种方式处理数据。

在深入探讨各种实现的具体细节和后果之前，我们将首先介绍每种类型的处理概念。

批处理系统

批处理在大数据领域有着悠久的历史。批处理涉及对大型静态数据集进行操作，并在计算完成后稍后返回结果。

批处理中的数据集通常是……

有界：批量数据集代表有限的数据集合
持久性：数据几乎总是由某种类型的永久存储支持
大：批处理操作通常是处理极大数据集的唯一选择

批处理非常适合需要访问完整记录集的计算。例如，在计算总数和平均值时，必须整体处理数据集，而不是将其视为单个记录的集合。这些操作要求在计算期间保持状态。

需要大量数据的任务通常最好通过批处理操作来处理。无论数据集是直接从永久存储中处理还是加载到内存中，批处理系统都是为大量数据而构建的，并且拥有处理这些数据的资源。由于批处理擅长处理大量持久数据，因此它经常用于处理历史数据。

处理大量数据的代价是更长的计算时间。因此，在处理时间特别重要的场合，批处理并不合适。

Apache Hadoop

Apache Hadoop 是一个专门提供批处理的处理框架。Hadoop 是第一个在开源社区中获得广泛关注的大数据框架。基于 Google 当时关于如何处理大量数据的几篇论文和演示文稿，Hadoop 重新实现了算法和组件堆栈，使大规模批处理更容易实现。

Hadoop 的现代版本由多个组件或层组成，它们协同处理批量数据：

HDFS：HDFS 是分布式文件系统层，用于协调集群节点之间的存储和复制。HDFS 可确保即使主机发生不可避免的故障，数据仍然可用。它用作数据源、存储中间处理结果以及保存最终计算结果。
YARN：YARN 代表“又一个资源协商器”，是 Hadoop 堆栈的集群协调组件。它负责协调和管理底层资源以及调度要运行的作业。YARN 充当集群资源的接口，使在 Hadoop 集群上运行比早期迭代更加多样化的工作负载成为可能。
MapReduce：MapReduce 是 Hadoop 原生的批处理引擎。

批处理模型

Hadoop 的处理功能来自 MapReduce 引擎。MapReduce 的处理技术遵循使用键值对的 map、shuffle、reduce 算法。基本过程包括：

从 HDFS 文件系统读取数据集
将数据集分成块并分布在可用节点之间
将每个节点上的计算应用于数据子集（中间结果写回到 HDFS）
将中间结果重新分配到按键分组
通过汇总和组合各个节点计算的结果来“减少”每个键的值
将计算的最终结果写回HDFS

优点和局限性

由于这种方法大量利用永久存储，每个任务多次读写，因此速度往往相当慢。另一方面，由于磁盘空间通常是最丰富的服务器资源之一，这意味着 MapReduce 可以处理大量数据集。这也意味着 Hadoop 的 MapReduce 通常可以在比某些替代方案更便宜的硬件上运行，因为它不会尝试将所有内容存储在内存中。MapReduce 具有令人难以置信的可扩展性潜力，已在数万个节点上投入生产。

作为开发目标，MapReduce 以学习难度高而闻名。Hadoop 生态系统中的其他新增功能可以不同程度地降低这种影响，但它仍然是在 Hadoop 集群上快速实现想法的一个因素。

Hadoop 拥有广泛的生态系统，Hadoop 集群本身经常被用作其他软件的构建块。许多其他处理框架和引擎都集成了 Hadoop，以利用 HDFS 和 YARN 资源管理器。

概括

Apache Hadoop 及其 MapReduce 处理引擎提供了经过充分测试的批处理模型，最适合处理时间不是重要因素的超大数据集。运行良好的 Hadoop 集群所需的组件成本低廉，使这种处理在许多用例中既便宜又有效。与其他框架和引擎的兼容性和集成意味着 Hadoop 通常可以作为使用多种技术的多种处理工作负载的基础。

流处理系统

流处理系统在数据进入系统时对其进行计算。这需要与批处理范式不同的处理模型。流处理器不会定义应用于整个数据集的操作，而是定义将应用于通过系统的每个单独数据项的操作。

流处理中的数据集被认为是“无界的”。这有几个重要的含义：

总数据集仅定义为迄今为止进入系统的数据量。
工作数据集可能更相关，并且每次仅限于一个项目。
处理是基于事件的，除非明确停止，否则不会“结束”。结果可立即获得，并将随着新数据的到来而不断更新。

流处理系统可以处理几乎无限量的数据，但它们每次只能处理一个（真正的流处理）或很少的（微批处理）项目，记录之间仅保持最少的状态。虽然大多数系统都提供了保持某些状态的方法，但流处理经过高度优化，可实现更具功能性的处理，且副作用很少。

函数式操作专注于具有有限状态或副作用的离散步骤。对同一数据执行相同操作将产生相同的输出，而不受其他因素的影响。这种处理非常适合流，因为项目之间的状态通常是困难、有限且有时不受欢迎的组合。因此，虽然通常可以进行某种类型的状态管理，但如果没有这些状态管理，这些框架会更简单、更高效。

这种类型的处理适合某些类型的工作负载。流式处理模型非常适合具有近乎实时要求的处理。分析、服务器或应用程序错误日志记录以及其他基于时间的指标是自然选择，因为对这些领域的变化做出反应对业务功能至关重要。流式处理非常适合您必须对变化或峰值做出反应的数据，以及您对随时间变化的趋势感兴趣的数据。

阿帕奇风暴

Apache Storm 是一个专注于极低延迟的流处理框架，可能是需要近乎实时处理的工作负载的最佳选择。它可以处理大量数据，并以比其他解决方案更低的延迟提供结果。

流处理模型

Storm 流处理通过在称为拓扑的框架中编排 DAG（有向无环图）来工作。这些拓扑描述了进入系统时将对每个传入数据采取的各种转换或步骤。

拓扑结构由以下部分组成：

流：传统数据流。这是连续到达系统的无界数据。
Spouts：拓扑边缘的数据流来源。这些可以是生成要操作的数据的 API、队列等。
Bolts：Bolts 表示一个处理步骤，它使用流、对其应用操作并将结果输出为流。Bolts 连接到每个 spout，然后相互连接以安排所有必要的处理。在拓扑的末尾，最终的 Bolt 输出可用作连接系统的输入。

Storm 背后的理念是使用上述组件定义小型、离散的操作，然后将它们组合成拓扑。默认情况下，Storm 提供至少一次处理保证，这意味着它可以保证每条消息至少被处理一次，但在某些故障情况下可能会出现重复。Storm 不保证消息将按顺序处理。

为了实现精确一次、有状态的处理，还可以使用称为Trident的抽象。明确地说，没有 Trident 的 Storm 通常被称为核心 Storm。 Trident 显著改变了 Storm 的处理动态，增加了延迟，为处理添加了状态，并实现了微批处理模型，而不是逐项纯流式系统。

Storm 用户通常建议尽可能使用 Core Storm 来避免这些惩罚。考虑到这一点，Trident 保证只处理一次项目在系统无法智能处理重复消息的情况下非常有用。当您需要在项目之间保持状态时，Trident 也是 Storm 中的唯一选择，例如计算一小时内有多少用户点击了链接。Trident 为 Storm 提供了灵活性，尽管它没有发挥框架的天然优势。

Trident 拓扑由以下部分组成：

流批次：这些是流数据的微批次，它们被分块以提供批处理语义。
操作：这些是可以对数据执行的批处理程序。

优点和局限性

Storm 可能是目前可用于近实时处理的最佳解决方案。它能够以极低的延迟处理必须以最小延迟处理的工作负载的数据。当处理时间直接影响用户体验时，Storm 通常是一个不错的选择，例如当处理反馈直接反馈到网站上的访问者页面时。

Storm 搭配 Trident 可让您选择使用微批次而不是纯流处理。虽然这为用户提供了更大的灵活性，可以根据预期用途塑造工具，但也往往会抵消该软件相对于其他解决方案的一些最大优势。话虽如此，选择流处理样式仍然很有帮助。

Core Storm 不提供消息的排序保证。Core Storm 提供至少一次处理保证，这意味着可以保证处理每条消息，但可能会出现重复。Trident 提供精确一次保证，并且可以在批次之间提供排序，但不能在批次内提供排序。

在互操作性方面，Storm 可以与 Hadoop 的 YARN 资源协商器集成，从而轻松连接到现有的 Hadoop 部署。与大多数处理框架相比，Storm 具有非常广泛的语言支持，为用户提供了许多定义拓扑的选项。

概括

对于具有非常严格的延迟要求的纯流处理工作负载，Storm 可能是最好的成熟选择。它可以保证消息处理，并且可以与大量编程语言一起使用。由于 Storm 不进行批处理，因此如果您需要这些功能，则必须使用其他软件。如果您强烈需要一次性处理保证，Trident 可以提供。但是，其他流处理框架在这一点上可能也更合适。

阿帕奇 Samza

Apache Samza 是一个与 Apache Kafka 消息传递系统紧密相关的流处理框架。虽然许多流处理系统都可以使用 Kafka，但 Samza 的设计专门是为了利用 Kafka 独特的架构和保证。它使用 Kafka 提供容错、缓冲和状态存储。

Samza 使用 YARN 进行资源协商。这意味着默认情况下需要 Hadoop 集群（至少需要 HDFS 和 YARN），但这也意味着 Samza 可以依赖 YARN 内置的丰富功能。

流处理模型

Samza 依赖 Kafka 的语义来定义处理流的方式。Kafka 在处理数据时使用以下概念：

主题：进入 Kafka 系统的每个数据流称为主题。主题基本上是消费者可以订阅的相关信息流。
分区：为了在节点之间分配主题，Kafka 将传入的消息划分为分区。分区划分基于密钥，这样可以保证将具有相同密钥的每条消息发送到同一分区。分区具有保证的顺序。
代理：组成 Kafka 集群的各个节点称为代理。
生产者：任何写入 Kafka 主题的组件都称为生产者。生产者提供用于对主题进行分区的密钥。
消费者：消费者是从 Kafka 主题读取数据的任何组件。消费者负责维护有关其自身偏移量的信息，以便在发生故障时知道哪些记录已被处理。

因为 Kafka 代表不可变日志，所以 Samza 处理不可变流。这意味着任何转换都会创建新的流，这些流会被其他组件使用，而不会影响初始流。

优点和局限性

乍一看，Samza 对 Kafka 类队列系统的依赖似乎有些限制。然而，它为系统提供了一些其他流处理系统中不常见的独特保证和功能。

例如，Kafka 已经提供了可以低延迟访问的复制数据存储。它还为每个单独的数据分区提供了一个非常简单且便宜的多订阅者模型。所有输出（包括中间结果）也写入 Kafka，并可由下游阶段独立使用。

在很多方面，这种对 Kafka 的严格依赖反映了 MapReduce 引擎频繁引用 HDFS 的方式。虽然在每次计算之间引用 HDFS 会导致批处理时出现一些严重的性能问题，但它解决了流处理时的许多问题。

Samza 与 Kafka 的紧密联系使得处理步骤本身可以非常松散地联系在一起。可以将任意数量的订阅者添加到任何步骤的输出中，而无需事先协调。这对于多个团队可能需要访问类似数据的组织非常有用。团队可以全部订阅进入系统的数据主题，或者可以轻松订阅其他经过某些处理的团队创建的主题。这可以在不给数据库等负载敏感型基础设施增加额外压力的情况下完成。

直接写入 Kafka 还可以消除背压问题。背压是指负载峰值导致数据流入的速度超过组件实时处理的速度，从而导致处理停滞并可能造成数据丢失。Kafka 旨在长时间保存数据，这意味着组件可以随时处理，并且可以重新启动而不会产生任何后果。

Samza 能够使用以本地键值存储形式实现的容错检查点系统来存储状态。这样 Samza 就可以提供至少一次交付保证，但在发生故障时无法准确恢复聚合状态（如计数），因为数据可能会交付多次。

Samza 提供高级抽象，在许多方面比 Storm 等系统提供的原语更易于使用。Samza 目前仅支持 JVM 语言，这意味着它不具备与 Storm 相同的语言灵活性。

概括

对于已经可以使用 Hadoop 和 Kafka 或易于实施的流式工作负载，Apache Samza 是一个不错的选择。Samza 本身非常适合拥有多个团队的组织，这些团队在各个处理阶段使用（但不一定紧密协调）数据流。Samza 大大简化了流处理的许多部分，并提供了低延迟性能。如果部署要求与您当前的系统不兼容，如果您需要极低的延迟处理，或者如果您对精确一次语义有强烈的需求，那么它可能不是一个很好的选择。

混合处理系统：批处理器和流处理器

一些处理框架可以同时处理批处理和流处理工作负载。这些框架允许对两种类型的数据使用相同或相关的组件和 API，从而简化了不同的处理要求。

如您所见，我们将讨论的两个框架 Spark 和 Flink 实现这一点的方式存在很大差异。这在很大程度上取决于两种处理范式如何结合在一起，以及对固定和非固定数据集之间关系的假设。

虽然专注于一种处理类型的项目可能非常适合特定用例，但混合框架试图为数据处理提供通用解决方案。它们不仅提供处理数据的方法，还有自己的集成、库和工具，用于执行图形分析、机器学习和交互式查询等操作。

Apache Spark

Apache Spark 是具有流处理功能的下一代批处理框架。Spark 使用与 Hadoop 的 MapReduce 引擎相同的许多原理构建而成，主要致力于通过提供完整的内存计算和处理优化来加速批处理工作负载。

Spark 可以作为独立集群部署（如果与功能强大的存储层配对），也可以作为 MapReduce 引擎的替代品加入 Hadoop。

批处理模型

与 MapReduce 不同，Spark 在内存中处理所有数据，仅与存储层交互，最初将数据加载到内存中，最后将最终结果持久化。所有中间结果都在内存中管理。

虽然内存处理对速度有很大贡献，但 Spark 在处理磁盘相关任务时也更快，因为它可以通过提前分析整个任务集来实现整体优化。它通过创建有向无环图 ( DAG)来实现这一点，这些图表示必须执行的所有操作、要操作的数据以及它们之间的关系，从而使处理器能够更智能地协调工作。

为了实现内存批量计算，Spark 使用一种称为弹性分布式数据集 ( RDD)的模型来处理数据。这些是存在于内存中的不可变结构，代表数据集合。对 RDD 的操作会产生新的 RDD。每个 RDD 都可以通过其父 RDD 追溯其谱系，最终追溯到磁盘上的数据。本质上，RDD 是 Spark 保持容错能力的一种方式，无需在每次操作后写回磁盘。

流处理模型

流处理功能由 Spark Streaming 提供。Spark 本身的设计考虑了面向批处理的工作负载。为了解决引擎设计与流式工作负载特征之间的差异，Spark 实现了一个称为微批处理的概念*。此策略旨在将数据流视为一系列非常小的批处理，可以使用批处理引擎的本机语义进行处理。

Spark Streaming works by buffering the stream in sub-second increments. These are sent as small fixed datasets for batch processing. In practice, this works fairly well, but it does lead to a different performance profile than true stream processing frameworks.

Advantages and Limitations

The obvious reason to use Spark over Hadoop MapReduce is speed. Spark can process the same datasets significantly faster due to its in-memory computation strategy and its advanced DAG scheduling.

Another of Spark’s major advantages is its versatility. It can be deployed as a standalone cluster or integrated with an existing Hadoop cluster. It can perform both batch and stream processing, letting you operate a single cluster to handle multiple processing styles.

Beyond the capabilities of the engine itself, Spark also has an ecosystem of libraries that can be used for machine learning, interactive queries, etc. Spark tasks are almost universally acknowledged to be easier to write than MapReduce, which can have significant implications for productivity.

Adapting the batch methodology for stream processing involves buffering the data as it enters the system. The buffer allows it to handle a high volume of incoming data, increasing overall throughput, but waiting to flush the buffer also leads to a significant increase in latency. This means that Spark Streaming might not be appropriate for processing where low latency is imperative.

Since RAM is generally more expensive than disk space, Spark can cost more to run than disk-based systems. However, the increased processing speed means that tasks can complete much faster, which may completely offset the costs when operating in an environment where you pay for resources hourly.

One other consequence of the in-memory design of Spark is that resource scarcity can be an issue when deployed on shared clusters. In comparison to Hadoop’s MapReduce, Spark uses significantly more resources, which can interfere with other tasks that might be trying to use the cluster at the time. In essence, Spark might be a less considerate neighbor than other components that can operate on the Hadoop stack.

Summary

Spark is a great option for those with diverse processing workloads. Spark batch processing offers incredible speed advantages, trading off high memory usage. Spark Streaming is a good stream processing solution for workloads that value throughput over latency.

Apache Flink

Apache Flink is a stream processing framework that can also handle batch tasks. It considers batches to simply be data streams with finite boundaries, and thus treats batch processing as a subset of stream processing. This stream-first approach to all processing has a number of interesting side effects.

This stream-first approach has been called the Kappa architecture, in contrast to the more widely known Lambda architecture (where batching is used as the primary processing method with streams used to supplement and provide early but unrefined results). Kappa architecture, where streams are used for everything, simplifies the model and has only recently become possible as stream processing engines have grown more sophisticated.

Stream Processing Model

Flink’s stream processing model handles incoming data on an item-by-item basis as a true stream. Flink provides its DataStream API to work with unbounded streams of data. The basic components that Flink works with are:

Streams are immutable, unbounded datasets that flow through the system
Operators are functions that operate on data streams to produce other streams
Sources are the entry point for streams entering the system
Sinks are the place where streams flow out of the Flink system. They might represent a database or a connector to another system

Stream processing tasks take snapshots at set points during their computation to use for recovery in case of problems. For storing state, Flink can work with a number of state backends depending with varying levels of complexity and persistence.

Additionally, Flink’s stream processing is able to understand the concept of “event time”, meaning the time that the event actually occurred, and can handle sessions as well. This means that it can guarantee ordering and grouping in some interesting ways.

Batch Processing Model

Flink’s batch processing model in many ways is just an extension of the stream processing model. Instead of reading from a continuous stream, it reads a bounded dataset off of persistent storage as a stream. Flink uses the exact same runtime for both of these processing models.

Flink offers some optimizations for batch workloads. For instance, since batch operations are backed by persistent storage, Flink removes snapshotting from batch loads. Data is still recoverable, but normal processing completes faster.

Another optimization involves breaking up batch tasks so that stages and components are only involved when needed. This helps Flink play well with other users of the cluster. Preemptive analysis of the tasks gives Flink the ability to also optimize by seeing the entire set of operations, the size of the data set, and the requirements of steps coming down the line.

Advantages and Limitations

Flink is currently a unique option in the processing framework world. While Spark performs batch and stream processing, its streaming is not appropriate for many use cases because of its micro-batch architecture. Flink’s stream-first approach offers low latency, high throughput, and real entry-by-entry processing.

Flink manages many things by itself. Somewhat unconventionally, it manages its own memory instead of relying on the native Java garbage collection mechanisms for performance reasons. Unlike Spark, Flink does not require manual optimization and adjustment when the characteristics of the data it processes change. It handles data partitioning and caching automatically as well.

Flink analyzes its work and optimizes tasks in a number of ways. Part of this analysis is similar to what SQL query planners do within relationship databases, mapping out the most effective way to implement a given task. It is able to parallelize stages that can be completed in parallel, while bringing data together for blocking tasks. For iterative tasks, Flink attempts to do computation on the nodes where the data is stored for performance reasons. It can also do “delta iteration”, or iteration on only the portions of data that have changes.

In terms of user tooling, Flink offers a web-based scheduling view to easily manage tasks and view the system. Users can also display the optimization plan for submitted tasks to see how it will actually be implemented on the cluster. For analysis tasks, Flink offers SQL-style querying, graph processing and machine learning libraries, and in-memory computation.

Flink operates well with other components. It is written to be a good neighbor if used within a Hadoop stack, taking up only the necessary resources at any given time. It integrates with YARN, HDFS, and Kafka easily. Flink can run tasks written for other processing frameworks like Hadoop and Storm with compatibility packages.

One of the largest drawbacks of Flink at the moment is that it is still a very young project. Large scale deployments in the wild are still not as common as other processing frameworks and there hasn’t been much research into Flink’s scaling limitations. With the rapid development cycle and features like the compatibility packages, there may begin to be more Flink deployments as organizations get the chance to experiment with it.

Summary

Flink offers both low latency stream processing with support for traditional batch tasks. Flink is probably best suited for organizations that have heavy stream processing requirements and some batch-oriented tasks. Its compatibility with native Storm and Hadoop programs, and its ability to run on a YARN-managed cluster can make it easy to evaluate. Its rapid development makes it worth keeping an eye on.

Conclusion

There are plenty of options for processing within a big data system.

For batch-only workloads that are not time-sensitive, Hadoop is a good choice that is likely less expensive to implement than some other solutions.

对于纯流式工作负载，Storm 具有广泛的语言支持，可以提供非常低延迟的处理，但可能会提供重复项，并且无法保证其默认配置下的排序。Samza 与 YARN 和 Kafka 紧密集成，以提供灵活性、易于多团队使用以及直接的复制和状态管理。

对于混合工作负载，Spark 提供高速批处理和微批处理，用于流式处理。它具有广泛的支持、集成的库和工具以及灵活的集成。Flink 提供真正的流处理和批处理支持。它经过大量优化，可以运行为其他平台编写的任务，并提供低延迟处理，但仍处于采用的早期阶段。

最适合您情况的解决方案在很大程度上取决于要处理的数据的状态、您的要求的时间限制以及您感兴趣的结果类型。在实施一体化解决方案和开展紧密相关的项目之间存在权衡，在评估新的创新解决方案与成熟且经过充分测试的解决方案时也需要考虑类似的因素。