虽然我以前遇到过Kafka,但我最近才意识到Kafka可能被用作CQRS,事件存储的(基础)。

Kafka支持的一个主要观点是:

事件捕获/存储,当然是所有HA。 发布/订阅体系结构 能够重放事件日志,允许新的订阅者在事件发生后向系统注册。

不可否认,我不是100%精通CQRS /事件来源,但这似乎非常接近事件撕裂应该是什么。有趣的是:我真的找不到那么多关于Kafka被用作事件存储的信息,所以我可能遗漏了一些东西。

那么,Kafka还缺少什么东西来成为一个好的事件存储吗?这会有用吗?使用它的产品?对洞察力、链接等感兴趣?

基本上,系统的状态是根据系统曾经接收到的事务/事件来保存的,而不是像通常那样只保存系统的当前状态/快照。(可以把它想象成会计中的总账:所有交易最终加起来都是最终状态)这允许各种很酷的事情,但请仔细阅读所提供的链接。


当前回答

是的,Kafka在事件源模型特别是CQRS中工作得很好,但是你在为主题设置ttl时要小心,并始终记住Kafka不是为这个模型设计的,但是我们可以很好地使用它。

其他回答

Kafka是一个消息系统,它与事件存储有许多相似之处,但引用他们的介绍:

Kafka集群保留所有已发布的消息——不管它们是否 在可配置的时间段内被消耗。例如,如果 保留期设定为两天,然后是之后的两天 消息发布后,就可以使用了 将被丢弃以释放空间。卡夫卡的表演是有效的 常数的数据大小,所以保留大量的数据不是一个 问题。

因此,尽管消息可能被无限期地保留,但预期它们将被删除。这并不意味着您不能使用它作为事件存储,但使用其他东西可能会更好。请查看EventStoreDB作为替代方案。

更新

卡夫卡文档:

事件源是一种应用程序设计风格,其中状态更改被记录为按时间顺序排列的记录序列。Kafka对非常大的存储日志数据的支持使它成为这种风格的应用程序的绝佳后端。

更新2

使用Kafka进行事件来源的一个问题是所需主题的数量。通常在事件源中,每个实体(如用户、产品等)都有一个事件流(主题)。这样,可以通过重新应用流中的所有事件来重新构建实体的当前状态。每个Kafka主题由一个或多个分区组成,每个分区存储为文件系统上的一个目录。随着znode数量的增加,也会有来自ZooKeeper的压力。

我是卡夫卡的原作者之一。Kafka可以很好地作为事件来源的日志。它是容错的,可扩展到巨大的数据大小,并有一个内置的分区模型。

在LinkedIn上,我们将它用于该表单的几个用例。例如,我们的开源流处理系统Apache Samza内置了对事件源的支持。

我认为你很少听说使用Kafka进行事件源,主要是因为事件源术语似乎在Kafka最流行的消费网络空间中并不流行。

我在这里写过一些关于卡夫卡风格的用法。

所有现有的答案似乎都很全面,但有一个术语问题,我想在我的答案中解决这个问题。

什么是事件来源?

似乎如果你看五个不同的地方,你会得到这个问题的五个不同答案。

然而,如果你看一下Greg Young在2010年的论文,它从第32页开始就很好地总结了这个想法,但它没有包含最终的定义,所以我自己大胆地阐述了它。

事件源是一种持久化状态的方法。不是由于状态突变而将一个状态替换为另一个状态,而是持久化表示该突变的事件。因此,您总是可以通过读取所有实体事件并按顺序应用这些状态变化来获得实体的当前状态。通过这样做,当前实体状态变成了该实体所有事件的左侧折叠。

什么是“好的”事件存储(数据库)?

任何持久性机制都需要执行两个基本操作:

将新的实体状态保存到数据库中 从数据库检索实体状态

这就是Greg谈论实体流概念的地方,其中每个实体都有自己的事件流,由实体id唯一标识。当您有一个数据库,它能够通过实体id读取所有实体事件(读取流)时,使用Event Sourcing不是一个困难的问题。

As Greg's paper mentions Event Sourcing in the context of CQRS, he explains why those two concepts play nicely with each other. Although, you have a database full of atomic state mutations for a bunch of entities, querying across the current state of multiple entities is hard work. The issue is solved by separating the transactional (event-sourced) store that is used as the source of truth, and the reporting (query, read) store, which is used for reports and queries of the current system state across multiple entities. The query store doesn't contain any events, it contains the projected state of multiple entities, composed based on the needs for querying data. It doesn't necessarily need to contain snapshots of each entity, you are free to choose the shape and form of the query model, as long as you can project your events to that model.

出于这个原因,“合适的”事件数据库需要支持所谓的_real-time订阅,它将向要投射的查询模型交付新的(和历史的,如果我们需要重播的话)事件。

We also know that we need the entity state in hand when making decisions about its allowed state transition. For example, a money transfer that has already been executed, should not be executed twice. As the query model is by definition stale (even for milliseconds), it becomes dangerous when you make decisions on stale data. Therefore, we use the most recent, and totally consistent state from the transactional (event) store to reconstruct the entity state when executing operations on the entity.

有时,您还希望从数据库中删除整个实体,这意味着删除其所有事件。例如,这可能是符合gdpr的要求。

那么,作为事件存储的数据库需要哪些属性才能使事件源系统正常工作呢?就几个:

使用实体id作为键,将事件附加到有序的、只能追加的日志中 使用实体id作为键,按顺序加载单个实体的所有事件 删除给定实体的所有事件,使用实体id作为键 支持实时订阅项目事件以查询模型

卡夫卡是什么?

Kafka是一个高度可伸缩的消息代理,基于仅追加日志。Kafka中的消息是根据主题生成的,现在一个主题通常包含一个单独的消息类型,以便更好地使用模式注册表。主题可以是CPU -load,其中我们为许多服务器生成CPU负载的时间序列测量。

Kafka主题可以分区。分区允许并行地生成和使用消息。消息只在一个分区内排序,通常需要使用一个可预测的分区键,这样Kafka就可以跨分区分发消息。

现在,让我们看一下清单:

Can you append events to Kafka? Yes, it's called produce. Can you append events with the entity id as a key? Not really, as the partition key is used to distribute messages across partitions, so it's really just a partition key. One thing mentioned in another answer is optimistic concurrency. If you worked with a relational database, you probably used the Version column. For NoSQL databases you might have used the document eTag. Both allow you to ensure that you update the entity that is in the state that you know about, and it hasn't been mutated during your operation. Kafka does not provide you with anything to support optimistic concurrency for such state transitions. Can you read all the events for a single entity from a Kafka topic, using the entity id as a key? No, you can't. As Kafka is not a database, it has no index on its topics, so the only way to retrieve messages from a topic is to consume them. Can you delete events from Kafka using the entity id as a key? No, it's impossible. Messages get removed from the topic only after their retention period expires. Can you subscribe to a Kafka topic to receive live (and historical) events in order, so you can project them to your query models? Yes, and because topics are partitioned, you can scale out your projections to increase performance.

那么,为什么人们一直这样做呢?

I believe that the reason why a lot of people claim that Kafka is a good choice to be an event store for event-sourced systems is that they confuse Event Sourcing with simple pub-sub (you can use a hype word "EDA", or Event-Driven Architecture instead). Using message brokers to fan out events to other system components is a pattern known for decades. The issue with "classic" brokers as that messages are gone as soon as they are consumed, so you cannot build something like a query model that would be built from history. Another issue is that when projecting events, you want them to be consumed in the same order as they are produced, and "classic" brokers normally aim to support the competing consumers pattern, which doesn't support ordered message processing by definition. Make no mistake, Kafka does not support competing consumers, it has a limitation of one consumer per one or more partitions, but not the other way around. Kafka solved the ordering issue, and historical messages retention issue quite nicely. So, you can now build query models from events you push through Kafka. But that's not what the original idea of Event Sourcing is about, it's what we today call EDA. As soon as this separation is clear, we, hopefully, stop seeing claims that any append-only event log is a good candidate to be an event store database for event-sourced systems.

是的,Kafka在事件源模型特别是CQRS中工作得很好,但是你在为主题设置ttl时要小心,并始终记住Kafka不是为这个模型设计的,但是我们可以很好地使用它。

你可以使用Kafka作为事件存储,但我不建议这样做,尽管它可能看起来是一个不错的选择:

Kafka only guarantees at least once deliver and there are duplicates in the event store that cannot be removed. Update: Here you can read why it is so hard with Kafka and some latest news about how to finally achieve this behavior: https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-apache-kafka-does-it/ Due to immutability, there is no way to manipulate event store when application evolves and events need to be transformed (there are of course methods like upcasting, but...). Once might say you never need to transform events, but that is not correct assumption, there could be situation where you do backup of original, but you upgrade them to latest versions. That is valid requirement in event driven architectures. No place to persist snapshots of entities/aggregates and replay will become slower and slower. Creating snapshots is must feature for event store from long term perspective. Given Kafka partitions are distributed and they are hard to manage and backup compare with databases. Databases are simply simpler :-)

所以,在你做出选择之前,你要三思。事件存储作为应用层接口(监控和管理)的组合,SQL/NoSQL存储和Kafka作为代理是更好的选择,而不是让Kafka处理这两个角色,以创建完整的功能完整的解决方案。

事件存储是一个复杂的服务,如果你认真考虑在事件驱动架构中应用事件源、CQRS、Sagas和其他模式,并保持高性能,那么它需要的服务比Kafka能提供的更多。

欢迎质疑我的答案!你可能不喜欢我说的那些你最喜欢的具有大量重叠功能的代理,但是,Kafka并不是作为事件存储而设计的,而是同时作为高性能的代理和缓冲区来处理快速生产者和缓慢消费者的场景。

请看最终结果。IO微服务开源框架,以发现更多潜在问题:http://eventuate.io/

更新截止2018年2月8日

我没有从评论中加入新的信息,但同意其中一些方面。这次更新更多的是关于微服务事件驱动平台的一些建议。如果你认真考虑微服务的健壮设计和最高性能,我会给你一些你可能会感兴趣的提示。

不要使用Spring——它很棒(我自己也经常使用),但同时又笨重又慢。它根本不是微服务平台。它“只是”一个帮助你实现的框架(这背后有很多工作..)。其他框架“只是”轻量级REST或JPA或不同重点的框架。我推荐可能是同类中最好的开源完整微服务平台,它回归到纯Java的根源: https://github.com/networknt

如果您想了解性能,可以将自己与现有的基准测试套件进行比较。 https://github.com/networknt/microservices-framework-benchmark

Don't use Kafka at all :-)) It is half joke. I mean while Kafka is great, it is another broker centric system. I think future is in broker-less messaging systems. You might be surprised but there are faster then Kafka systems :-), of course you must get down to lower level. Look at Chronicle. For Event store I recommend superior Postgresql extension called TimescaleDB, which focuses on high performance timeseries data processing (events are timeseries) in large volume. Of course CQRS, Event sourcing (replay, etc. features) are built in light4j framework out of the box which uses Postgres as low storage. For messaging try to look at Chronicle Queue, Map, Engine, Network. I mean get rid of this old-fashioned broker centric solutions and go with micro messaging system (embedded one). Chronicle Queue is actually even faster than Kafka. But I agree it is not all in one solution and you need to do some development otherwise you go and buy Enterprise version(paid one). In the end the effort to build from Chronicle your own messaging layer will be paid by removing the burden of maintaining the Kafka cluster.