虽然我以前遇到过Kafka,但我最近才意识到Kafka可能被用作CQRS,事件存储的(基础)。

Kafka支持的一个主要观点是:

事件捕获/存储,当然是所有HA。 发布/订阅体系结构 能够重放事件日志,允许新的订阅者在事件发生后向系统注册。

不可否认,我不是100%精通CQRS /事件来源,但这似乎非常接近事件撕裂应该是什么。有趣的是:我真的找不到那么多关于Kafka被用作事件存储的信息,所以我可能遗漏了一些东西。

那么,Kafka还缺少什么东西来成为一个好的事件存储吗?这会有用吗?使用它的产品?对洞察力、链接等感兴趣?

基本上,系统的状态是根据系统曾经接收到的事务/事件来保存的,而不是像通常那样只保存系统的当前状态/快照。(可以把它想象成会计中的总账:所有交易最终加起来都是最终状态)这允许各种很酷的事情,但请仔细阅读所提供的链接。


当前回答

我一直在思考这个QA问题。我觉得现有的答案不够细致,所以我加上了这个。

TL,博士。是或否,取决于您的事件源使用情况。

我知道有两种主要的事件源系统。

下游事件处理器= Yes

在这种系统中,事件发生在现实世界中,并被记录为事实。例如仓库系统,以跟踪产品的托盘。基本上没有冲突事件。一切都已经发生了,即使它是错的。(即托盘123456装在卡车A上,但原计划装在卡车b上)然后,通过报告机制检查事实是否有例外。Kafka似乎非常适合这种下游事件处理应用程序。

在这种情况下,Kafka的人提倡将其作为事件源解决方案是可以理解的。因为它非常类似于它已经在点击流中使用的方式。然而,人们使用术语事件来源(相对于流处理)可能指的是第二种用法……

应用程序控制的真相来源=否

这类应用程序在用户请求通过业务逻辑时声明自己的事件。Kafka在这种情况下不太适用,主要有两个原因。

缺乏实体隔离

此场景需要为特定实体加载事件流的能力。这样做的常见原因是为用于处理请求的业务逻辑构建一个瞬态写模型。这样做在卡夫卡中是不切实际的。使用每个实体主题可以实现这一点,但是当存在数千或数百万个实体时,这是不可能的。这是由于Kafka/Zookeeper的技术限制。

以这种方式使用瞬态写模型的主要原因之一是使业务逻辑更改变得廉价且易于部署。

建议Kafka使用每类型主题(topic-per-type),但这将需要为该类型的每个实体加载事件,以获取单个实体的事件。由于您无法通过日志位置来判断哪些事件属于哪个实体。即使使用快照从已知的日志位置开始,如果需要对快照进行结构更改以支持逻辑更改,则可能会产生大量事件。

缺乏冲突检测

Secondly, users can create race conditions due to concurrent requests against the same entity. It may be quite undesirable to save conflicting events and resolve them after the fact. So it is important to be able to prevent conflicting events. To scale request load, it is common to use stateless services while preventing write conflicts using conditional writes (only write if the last entity event was #x). A.k.a. Optimistic Concurrency. Kafka does not support optimistic concurrency. Even if it supported it at the topic level, it would need to be all the way down to the entity level to be effective. To use Kafka and prevent conflicting events, you would need to use a stateful, serialized writer (per "shard" or whatever is Kafka's equivalent) at the application level. This is a significant architectural requirement/restriction.

主要原因:设备存在问题

添加2021/09/29

Kafka is designed to solve giant-scale data problems. An app-controlled source of truth is a smaller scale, in-depth solution. Using event sourcing to good effect requires crafting events and streams to match the business processes. This usually has a much higher level of detail than would be generally useful to at-scale consumers. Consider if your bank statement contained an entry for every step of a bank's internal transaction processes. A single deposit or withdrawal could have many entries before it is confirmed to your account. The bank needs that level of detail to process transactions. But it's mostly inscrutable bank jargon (domain-specific language) to you, unusable for reconciling your account. Instead, the bank publishes separate events for consumers. These are course-grained summaries of each completed transaction. These summary events are what consumers know as "transactions" on their bank statement.

When I asked myself the same question as the OP, I wanted to know if Kafka was a scaling option for event sourcing. But perhaps a better question is whether it makes sense for my event sourced solution to operate at a giant scale. I can't speak to every case, but I think often it does not. When this scale enters the picture, like with the bank statement example, the granularity of events tends to be different. My event sourced system should probably publish course-grained events to the Kafka cluster to feed at-scale consumers rather than use Kafka as internal storage.

Scale can still be needed for event sourcing. Strategies differ depending on why. Often event streams have a "done" or "no-longer-useful" state. Archiving those streams is a good answer if event size/volume is a problem. Sharding is another option -- a perfect fit for regional- or tenant-isolated scenarios. In less siloed scenarios, when streams are arbitrarily related in a way that can cross shard boundaries, sharding is still the move (partition by stream ID). But there are no order guarantees across streams, which can make the event consumer's job harder. For example, the consumer may receive transaction events before it receives events describing the accounts involved. The first instinct is to "just use timestamps" to order received events. But it is still not possible to guarantee perfect occurrence order. Too many uncontrollable factors. Network hiccups, clock drift, cosmic rays, etc. Ideally you design the consumer to not require cross-stream dependencies. Have a strategy for temporarily missing data. Like progressive enhancement for data. If you really need the data to be unavailable instead of incomplete, use the same tactic. But keep the incomplete data in a separate area or marked unavailable until it's all filled in. You can also just attempt to process each event, knowing it may fail due to missing prerequisites. Put failed events in a retry queue, processing next events, and retry failed events later. But watch out for poison messages (events).

总结

你能强迫卡夫卡为一个应用程序控制的真相来源工作吗?当然,如果你足够努力,足够深入地融入。但这是个好主意吗?不。


每条评论更新

该评论已被删除,但问题是这样的:人们用什么来存储事件?

似乎大多数人都将自己的事件存储实现放在现有数据库之上。对于非分布式场景(如内部后端或独立产品),如何创建基于sql的事件存储是有详细文档的。在各种各样的数据库之上还有很多可用的图书馆。还有一个EventStoreDB,它就是为此目的而构建的。

In distributed scenarios, I've seen a couple of different implementations. Jet's Panther project uses Azure CosmosDB, with the Change Feed feature to notify listeners. Another similar implementation I've heard about on AWS is using DynamoDB with its Streams feature to notify listeners. The partition key probably should be the stream id for best data distribution (to lessen the amount of over-provisioning). However, a full replay across streams in Dynamo is expensive (read and cost-wise). So this impl was also setup for Dynamo Streams to dump events to S3. When a new listener comes online, or an existing listener wants a full replay, it would read S3 to catch up first.

我目前的项目是一个多租户场景,我在Postgres的基础上开发了自己的项目。像Citus这样的东西似乎适合于可伸缩性,按帐篷+流进行分区。

Kafka在分布式场景中仍然非常有用。将每个服务的关键事件公开给其他服务并不是一个简单的问题。事件存储通常不是为此而构建的,但这正是Kafka所擅长的。每个服务都有自己的内部真相来源(可以是事件、BNF、图表等),然后听Kafka来知道“外部”发生了什么。该服务将公共事件发布给Kafka,以通知外界它遇到的有趣的事情。

其他回答

我是卡夫卡的原作者之一。Kafka可以很好地作为事件来源的日志。它是容错的,可扩展到巨大的数据大小,并有一个内置的分区模型。

在LinkedIn上,我们将它用于该表单的几个用例。例如,我们的开源流处理系统Apache Samza内置了对事件源的支持。

我认为你很少听说使用Kafka进行事件源,主要是因为事件源术语似乎在Kafka最流行的消费网络空间中并不流行。

我在这里写过一些关于卡夫卡风格的用法。

我认为你应该看看axon框架以及他们对Kafka的支持

Kafka是一个消息系统,它与事件存储有许多相似之处,但引用他们的介绍:

Kafka集群保留所有已发布的消息——不管它们是否 在可配置的时间段内被消耗。例如,如果 保留期设定为两天,然后是之后的两天 消息发布后,就可以使用了 将被丢弃以释放空间。卡夫卡的表演是有效的 常数的数据大小,所以保留大量的数据不是一个 问题。

因此,尽管消息可能被无限期地保留,但预期它们将被删除。这并不意味着您不能使用它作为事件存储,但使用其他东西可能会更好。请查看EventStoreDB作为替代方案。

更新

卡夫卡文档:

事件源是一种应用程序设计风格,其中状态更改被记录为按时间顺序排列的记录序列。Kafka对非常大的存储日志数据的支持使它成为这种风格的应用程序的绝佳后端。

更新2

使用Kafka进行事件来源的一个问题是所需主题的数量。通常在事件源中,每个实体(如用户、产品等)都有一个事件流(主题)。这样,可以通过重新应用流中的所有事件来重新构建实体的当前状态。每个Kafka主题由一个或多个分区组成,每个分区存储为文件系统上的一个目录。随着znode数量的增加,也会有来自ZooKeeper的压力。

我一直在思考这个QA问题。我觉得现有的答案不够细致,所以我加上了这个。

TL,博士。是或否,取决于您的事件源使用情况。

我知道有两种主要的事件源系统。

下游事件处理器= Yes

在这种系统中,事件发生在现实世界中,并被记录为事实。例如仓库系统,以跟踪产品的托盘。基本上没有冲突事件。一切都已经发生了,即使它是错的。(即托盘123456装在卡车A上,但原计划装在卡车b上)然后,通过报告机制检查事实是否有例外。Kafka似乎非常适合这种下游事件处理应用程序。

在这种情况下,Kafka的人提倡将其作为事件源解决方案是可以理解的。因为它非常类似于它已经在点击流中使用的方式。然而,人们使用术语事件来源(相对于流处理)可能指的是第二种用法……

应用程序控制的真相来源=否

这类应用程序在用户请求通过业务逻辑时声明自己的事件。Kafka在这种情况下不太适用,主要有两个原因。

缺乏实体隔离

此场景需要为特定实体加载事件流的能力。这样做的常见原因是为用于处理请求的业务逻辑构建一个瞬态写模型。这样做在卡夫卡中是不切实际的。使用每个实体主题可以实现这一点,但是当存在数千或数百万个实体时,这是不可能的。这是由于Kafka/Zookeeper的技术限制。

以这种方式使用瞬态写模型的主要原因之一是使业务逻辑更改变得廉价且易于部署。

建议Kafka使用每类型主题(topic-per-type),但这将需要为该类型的每个实体加载事件,以获取单个实体的事件。由于您无法通过日志位置来判断哪些事件属于哪个实体。即使使用快照从已知的日志位置开始,如果需要对快照进行结构更改以支持逻辑更改,则可能会产生大量事件。

缺乏冲突检测

Secondly, users can create race conditions due to concurrent requests against the same entity. It may be quite undesirable to save conflicting events and resolve them after the fact. So it is important to be able to prevent conflicting events. To scale request load, it is common to use stateless services while preventing write conflicts using conditional writes (only write if the last entity event was #x). A.k.a. Optimistic Concurrency. Kafka does not support optimistic concurrency. Even if it supported it at the topic level, it would need to be all the way down to the entity level to be effective. To use Kafka and prevent conflicting events, you would need to use a stateful, serialized writer (per "shard" or whatever is Kafka's equivalent) at the application level. This is a significant architectural requirement/restriction.

主要原因:设备存在问题

添加2021/09/29

Kafka is designed to solve giant-scale data problems. An app-controlled source of truth is a smaller scale, in-depth solution. Using event sourcing to good effect requires crafting events and streams to match the business processes. This usually has a much higher level of detail than would be generally useful to at-scale consumers. Consider if your bank statement contained an entry for every step of a bank's internal transaction processes. A single deposit or withdrawal could have many entries before it is confirmed to your account. The bank needs that level of detail to process transactions. But it's mostly inscrutable bank jargon (domain-specific language) to you, unusable for reconciling your account. Instead, the bank publishes separate events for consumers. These are course-grained summaries of each completed transaction. These summary events are what consumers know as "transactions" on their bank statement.

When I asked myself the same question as the OP, I wanted to know if Kafka was a scaling option for event sourcing. But perhaps a better question is whether it makes sense for my event sourced solution to operate at a giant scale. I can't speak to every case, but I think often it does not. When this scale enters the picture, like with the bank statement example, the granularity of events tends to be different. My event sourced system should probably publish course-grained events to the Kafka cluster to feed at-scale consumers rather than use Kafka as internal storage.

Scale can still be needed for event sourcing. Strategies differ depending on why. Often event streams have a "done" or "no-longer-useful" state. Archiving those streams is a good answer if event size/volume is a problem. Sharding is another option -- a perfect fit for regional- or tenant-isolated scenarios. In less siloed scenarios, when streams are arbitrarily related in a way that can cross shard boundaries, sharding is still the move (partition by stream ID). But there are no order guarantees across streams, which can make the event consumer's job harder. For example, the consumer may receive transaction events before it receives events describing the accounts involved. The first instinct is to "just use timestamps" to order received events. But it is still not possible to guarantee perfect occurrence order. Too many uncontrollable factors. Network hiccups, clock drift, cosmic rays, etc. Ideally you design the consumer to not require cross-stream dependencies. Have a strategy for temporarily missing data. Like progressive enhancement for data. If you really need the data to be unavailable instead of incomplete, use the same tactic. But keep the incomplete data in a separate area or marked unavailable until it's all filled in. You can also just attempt to process each event, knowing it may fail due to missing prerequisites. Put failed events in a retry queue, processing next events, and retry failed events later. But watch out for poison messages (events).

总结

你能强迫卡夫卡为一个应用程序控制的真相来源工作吗?当然,如果你足够努力,足够深入地融入。但这是个好主意吗?不。


每条评论更新

该评论已被删除,但问题是这样的:人们用什么来存储事件?

似乎大多数人都将自己的事件存储实现放在现有数据库之上。对于非分布式场景(如内部后端或独立产品),如何创建基于sql的事件存储是有详细文档的。在各种各样的数据库之上还有很多可用的图书馆。还有一个EventStoreDB,它就是为此目的而构建的。

In distributed scenarios, I've seen a couple of different implementations. Jet's Panther project uses Azure CosmosDB, with the Change Feed feature to notify listeners. Another similar implementation I've heard about on AWS is using DynamoDB with its Streams feature to notify listeners. The partition key probably should be the stream id for best data distribution (to lessen the amount of over-provisioning). However, a full replay across streams in Dynamo is expensive (read and cost-wise). So this impl was also setup for Dynamo Streams to dump events to S3. When a new listener comes online, or an existing listener wants a full replay, it would read S3 to catch up first.

我目前的项目是一个多租户场景,我在Postgres的基础上开发了自己的项目。像Citus这样的东西似乎适合于可伸缩性,按帐篷+流进行分区。

Kafka在分布式场景中仍然非常有用。将每个服务的关键事件公开给其他服务并不是一个简单的问题。事件存储通常不是为此而构建的,但这正是Kafka所擅长的。每个服务都有自己的内部真相来源(可以是事件、BNF、图表等),然后听Kafka来知道“外部”发生了什么。该服务将公共事件发布给Kafka,以通知外界它遇到的有趣的事情。

是的,Kafka在事件源模型特别是CQRS中工作得很好,但是你在为主题设置ttl时要小心,并始终记住Kafka不是为这个模型设计的,但是我们可以很好地使用它。