Ukkonen的后缀树算法

如果我的回答似乎多余，我很抱歉，但我最近实施了Ukkonen的算法，并发现自己为此挣扎了好几天;我不得不阅读了关于这个主题的多篇论文，以理解算法的一些核心方面的原因和方法。

我发现之前回答的“规则”方法对理解潜在的原因没有帮助，所以我写了下面的所有内容，只关注语用学。如果你像我一样，对其他解释感到困惑，也许我的补充解释会让你“点击”。

我在这里发布了我的c#实现:https://github.com/baratgabor/SuffixTree

请注意，我不是这方面的专家，所以以下部分可能包含不准确(或更糟)。如果你遇到任何问题，请随意编辑。

先决条件

以下解释的出发点假设你熟悉后缀树的内容和使用，以及Ukkonen算法的特征，例如，你如何从头到尾一个字符一个字符地扩展后缀树。基本上，我猜你已经读过一些其他的解释了。

(然而，我必须为流程添加一些基本的叙述，所以开头可能会让人觉得多余。)

最有趣的部分是解释了使用后缀链接和从根目录重新扫描之间的区别。这就是在我的实现中给我带来许多bug和头痛的原因。

开放式叶节点及其局限性

我相信你已经知道，最基本的“技巧”是意识到我们可以只留下后缀的结束“开放”，即引用字符串的当前长度，而不是将结束设置为一个静态值。这样，当我们添加额外的字符时，这些字符将隐式地添加到所有后缀标签中，而不必访问和更新所有这些字符。

但是，由于显而易见的原因，这种后缀的开放结尾只适用于表示字符串结尾的节点，即树结构中的叶节点。我们在树上执行的分支操作(添加新的分支节点和叶节点)不会自动传播到需要的任何地方。

这可能是基本的，并且不需要提及，重复的子字符串不会显式地出现在树中，因为树已经包含了这些重复的子字符串;然而，当重复的子字符串以遇到非重复字符结束时，我们需要在该点创建一个分支，以表示从该点开始的发散。

例如，对于字符串'ABCXABCY'(见下文)，到X和Y的分支需要添加到三个不同的后缀ABC, BC和C;否则它就不是一个有效的后缀树，我们不能通过从根向下匹配字符来找到字符串的所有子字符串。

再次强调，我们对树中的后缀执行的任何操作都需要通过其连续的后缀反映出来(例如ABC > BC > C)，否则它们就不再是有效的后缀。

但是，即使我们接受我们必须进行这些手动更新，我们如何知道需要更新多少个后缀呢?因为，当我们添加重复字符A(以及其他连续重复字符)时，我们还不知道何时/何处需要将后缀分成两个分支。只有当我们遇到第一个非重复字符(在本例中是Y，而不是树中已经存在的X)时，才确定需要分割。

我们能做的是匹配最长的重复字符串，并计算后面需要更新的后缀的数量。这就是“余数”的含义。

“剩余”和“重新扫描”的概念

变量remainder告诉我们在没有分支的情况下隐式添加了多少个重复字符;也就是说，一旦我们发现第一个不能匹配的字符，我们需要访问多少个后缀来重复分支操作。这本质上等于我们在树中从根开始的“深度”有多少字符。

So, staying with the previous example of the string ABCXABCY, we match the repeated ABC part 'implicitly', incrementing remainder each time, which results in remainder of 3. Then we encounter the non-repeating character 'Y'. Here we split the previously added ABCX into ABC->X and ABC->Y. Then we decrement remainder from 3 to 2, because we already took care of the ABC branching. Now we repeat the operation by matching the last 2 characters – BC – from the root to reach the point where we need to split, and we split BCX too into BC->X and BC->Y. Again, we decrement remainder to 1, and repeat the operation; until the remainder is 0. Lastly, we need to add the current character (Y) itself to the root as well.

在Ukkonen的算法中，这个操作，从根开始，跟随连续的后缀到达我们需要做操作的点，这被称为“重新扫描”，通常这是算法中最昂贵的部分。想象一个更长的字符串，您需要跨几十个节点(我们将在后面讨论)“重新扫描”长子字符串，可能需要数千次。

作为解决方案，我们引入了所谓的“后缀链接”。

后缀链接的概念

后缀链接基本上指向我们通常必须“重新扫描”到的位置，因此我们可以简单地跳转到被链接的位置，完成我们的工作，跳转到下一个链接位置，然后重复—直到没有更多的位置需要更新。

Of course one big question is how to add these links. The existing answer is that we can add the links when we insert new branch nodes, utilizing the fact that, in each extension of the tree, the branch nodes are naturally created one after another in the exact order we'd need to link them together. Though, we have to link from the last created branch node (the longest suffix) to the previously created one, so we need to cache the last we create, link that to the next one we create, and cache the newly created one.

一个后果是，我们实际上经常没有后缀链接，因为给定的分支节点刚刚创建。在这些情况下，我们仍然必须回到前面提到的“重新扫描”。这就是为什么在插入之后，会指示您使用后缀link或跳转到根目录。

(或者，如果您在节点中存储父指针，您可以尝试跟踪父指针，检查它们是否有链接，并使用它。我发现这一点很少被提及，但是后缀link的用法并不是一成不变的。有多种可能的方法，如果你了解底层机制，你可以实现最适合你的需求。)

“主动点”的概念

到目前为止，我们讨论了构建树的多种有效工具，并模糊地提到了遍历多个边和节点，但还没有探索相应的结果和复杂性。

前面解释的“余数”概念对于跟踪我们在树中的位置很有用，但我们必须意识到它不能存储足够的信息。

首先，我们总是位于节点的特定边缘，因此我们需要存储边缘信息。我们称之为“活动边”。

其次，即使在添加了边缘信息之后，我们仍然没有办法识别树中更下方的位置，并且不直接连接到根节点。所以我们也需要存储节点。我们称其为“活动节点”。

最后，我们可以注意到，“余数”不足以识别不直接连接到根结点的边上的位置，因为“余数”是整个路由的长度;我们可能不想麻烦地记住并减去之前的边的长度。所以我们需要一个表示就是当前这条边的余数。这就是我们所说的“活动长度”。

这就导致了我们所说的“活动点”——一个包含三个变量的包，其中包含了我们需要维护的关于我们在树中的位置的所有信息:

活动点=(活动节点，活动边，活动长度)

你可以在下图中观察到ABCABD的匹配路由是如何由边缘AB上的2个字符(从根开始)，加上边缘CABDABCABD上的4个字符(从节点4开始)组成的——结果是“剩余”6个字符。所以，我们当前的位置可以被识别为活动节点4，活动边C，活动长度4。

“活动点”的另一个重要作用是它为我们的算法提供了一个抽象层，这意味着我们的算法的部分可以在“活动点”上完成它们的工作，而不管这个活动点是在根结点还是其他任何地方。这使得它很容易在我们的算法中以一种干净和直接的方式实现后缀链接的使用。

重新扫描和使用后缀链接的区别

现在，棘手的部分，在我的经验中，可能会导致大量的错误和头痛，并且在大多数来源中解释得很差，是处理后缀链接案例和重新扫描案例的区别。

考虑以下字符串'AAAABAAAABAAC'的例子:

您可以在上面观察到7的“余数”对应于根节点的字符总数，而4的“活动长度”对应于活动节点的活动边缘的匹配字符总数。

现在，在活动点执行分支操作之后，活动节点可能包含也可能不包含后缀链接。

如果存在后缀链接:我们只需要处理“活动长度”部分。“余数”是无关紧要的，因为我们通过后缀链接跳转到的节点已经隐式编码了正确的“余数”，仅仅是因为它在树中。

如果后缀链接不存在:我们需要从0 /根开始“重新扫描”，这意味着从头处理整个后缀。为此，我们必须使用整个“余数”作为重新扫描的基础。

使用和不使用后缀链接的处理比较示例

考虑一下上面例子的下一步会发生什么。让我们比较一下如何实现相同的结果——即移动到下一个后缀来处理——使用和不使用后缀链接。

使用后缀link

注意，如果我们使用后缀链接，我们就自动地“在正确的地方”。这通常不是严格正确的，因为“活动长度”可能与新位置“不兼容”。

在上面的例子中，由于“活动长度”是4，我们使用后缀“ABAA”，从链接的节点4开始。但在找到与后缀('A')的第一个字符对应的边后，我们注意到我们的“活动长度”超出了这条边3个字符。所以我们跳过整个边缘，跳转到下一个节点，并通过跳跃消耗的字符减少“活动长度”。

然后，在我们找到下一条边'B'后，对应于后缀'BAA'，我们最终注意到边长大于剩余的'活动长度' 3，这意味着我们找到了正确的位置。

请注意，这个操作似乎通常不被称为“重新扫描”，尽管对我来说，它似乎直接等同于重新扫描，只是长度缩短了，起点不是根。

使用“重新扫描”

注意，如果我们使用传统的“重新扫描”操作(这里假设我们没有后缀链接)，我们从树的顶部开始，从根开始，我们必须沿着当前后缀的整个长度再次向下工作到正确的位置。

这个后缀的长度就是我们之前讨论过的“余数”。我们要消耗这个余数的全部，直到它为0。这可能(通常是这样)包括跳过多个节点，在每次跳跃时减少我们跳过的边的长度。最后，我们到达一条比剩下的“余数”更长的边;这里我们将活动边设置为给定的边，将“活动长度”设置为剩余的“余数”，然后我们就完成了。

但是请注意，需要保留实际的“余数”变量，并且只在每次插入节点后递减。所以我上面所描述的假设使用一个单独的变量初始化为'remainder'。

注意后缀链接和重新扫描

1)请注意，这两种方法都会导致相同的结果。然而，在大多数情况下，后缀链接跳转要快得多;这就是后缀链接背后的基本原理。

2)实际的算法实现不需要有所不同。正如我上面提到的，即使在使用后缀link的情况下，“活动长度”通常与链接位置不兼容，因为树的分支可能包含额外的分支。所以本质上你只需要使用“活动长度”而不是“剩余长度”，并执行相同的重新扫描逻辑，直到你找到比剩余后缀长度短的边。

3) One important remark pertaining to performance is that there is no need to check each and every character during rescanning. Due to the way a valid suffix tree is built, we can safely assume that the characters match. So you're mostly counting the lengths, and the only need for character equivalence checking arises when we jump to a new edge, since edges are identified by their first character (which is always unique in the context of a given node). This means that 'rescanning' logic is different than full string matching logic (i.e. searching for a substring in the tree).

4)这里描述的原始后缀链接只是一种可能的方法。例如，NJ Larsson等人将这种方法命名为面向节点的自顶向下方法，并将其与面向节点的自底向上方法和两个面向边方法进行了比较。不同的方法有不同的典型和最坏情况的性能、需求、限制等，但通常情况下，面向边缘的方法似乎是对原始方法的全面改进。

2019-04-24 09:58:09

如果我的回答似乎多余，我很抱歉，但我最近实施了Ukkonen的算法，并发现自己为此挣扎了好几天;我不得不阅读了关于这个主题的多篇论文，以理解算法的一些核心方面的原因和方法。

我发现之前回答的“规则”方法对理解潜在的原因没有帮助，所以我写了下面的所有内容，只关注语用学。如果你像我一样，对其他解释感到困惑，也许我的补充解释会让你“点击”。

我在这里发布了我的c#实现:https://github.com/baratgabor/SuffixTree

请注意，我不是这方面的专家，所以以下部分可能包含不准确(或更糟)。如果你遇到任何问题，请随意编辑。

先决条件

以下解释的出发点假设你熟悉后缀树的内容和使用，以及Ukkonen算法的特征，例如，你如何从头到尾一个字符一个字符地扩展后缀树。基本上，我猜你已经读过一些其他的解释了。

(然而，我必须为流程添加一些基本的叙述，所以开头可能会让人觉得多余。)

最有趣的部分是解释了使用后缀链接和从根目录重新扫描之间的区别。这就是在我的实现中给我带来许多bug和头痛的原因。

开放式叶节点及其局限性

我相信你已经知道，最基本的“技巧”是意识到我们可以只留下后缀的结束“开放”，即引用字符串的当前长度，而不是将结束设置为一个静态值。这样，当我们添加额外的字符时，这些字符将隐式地添加到所有后缀标签中，而不必访问和更新所有这些字符。

但是，由于显而易见的原因，这种后缀的开放结尾只适用于表示字符串结尾的节点，即树结构中的叶节点。我们在树上执行的分支操作(添加新的分支节点和叶节点)不会自动传播到需要的任何地方。

这可能是基本的，并且不需要提及，重复的子字符串不会显式地出现在树中，因为树已经包含了这些重复的子字符串;然而，当重复的子字符串以遇到非重复字符结束时，我们需要在该点创建一个分支，以表示从该点开始的发散。

例如，对于字符串'ABCXABCY'(见下文)，到X和Y的分支需要添加到三个不同的后缀ABC, BC和C;否则它就不是一个有效的后缀树，我们不能通过从根向下匹配字符来找到字符串的所有子字符串。

再次强调，我们对树中的后缀执行的任何操作都需要通过其连续的后缀反映出来(例如ABC > BC > C)，否则它们就不再是有效的后缀。

但是，即使我们接受我们必须进行这些手动更新，我们如何知道需要更新多少个后缀呢?因为，当我们添加重复字符A(以及其他连续重复字符)时，我们还不知道何时/何处需要将后缀分成两个分支。只有当我们遇到第一个非重复字符(在本例中是Y，而不是树中已经存在的X)时，才确定需要分割。

我们能做的是匹配最长的重复字符串，并计算后面需要更新的后缀的数量。这就是“余数”的含义。

“剩余”和“重新扫描”的概念

变量remainder告诉我们在没有分支的情况下隐式添加了多少个重复字符;也就是说，一旦我们发现第一个不能匹配的字符，我们需要访问多少个后缀来重复分支操作。这本质上等于我们在树中从根开始的“深度”有多少字符。

So, staying with the previous example of the string ABCXABCY, we match the repeated ABC part 'implicitly', incrementing remainder each time, which results in remainder of 3. Then we encounter the non-repeating character 'Y'. Here we split the previously added ABCX into ABC->X and ABC->Y. Then we decrement remainder from 3 to 2, because we already took care of the ABC branching. Now we repeat the operation by matching the last 2 characters – BC – from the root to reach the point where we need to split, and we split BCX too into BC->X and BC->Y. Again, we decrement remainder to 1, and repeat the operation; until the remainder is 0. Lastly, we need to add the current character (Y) itself to the root as well.

在Ukkonen的算法中，这个操作，从根开始，跟随连续的后缀到达我们需要做操作的点，这被称为“重新扫描”，通常这是算法中最昂贵的部分。想象一个更长的字符串，您需要跨几十个节点(我们将在后面讨论)“重新扫描”长子字符串，可能需要数千次。

作为解决方案，我们引入了所谓的“后缀链接”。

后缀链接的概念

后缀链接基本上指向我们通常必须“重新扫描”到的位置，因此我们可以简单地跳转到被链接的位置，完成我们的工作，跳转到下一个链接位置，然后重复—直到没有更多的位置需要更新。

Of course one big question is how to add these links. The existing answer is that we can add the links when we insert new branch nodes, utilizing the fact that, in each extension of the tree, the branch nodes are naturally created one after another in the exact order we'd need to link them together. Though, we have to link from the last created branch node (the longest suffix) to the previously created one, so we need to cache the last we create, link that to the next one we create, and cache the newly created one.

一个后果是，我们实际上经常没有后缀链接，因为给定的分支节点刚刚创建。在这些情况下，我们仍然必须回到前面提到的“重新扫描”。这就是为什么在插入之后，会指示您使用后缀link或跳转到根目录。

(或者，如果您在节点中存储父指针，您可以尝试跟踪父指针，检查它们是否有链接，并使用它。我发现这一点很少被提及，但是后缀link的用法并不是一成不变的。有多种可能的方法，如果你了解底层机制，你可以实现最适合你的需求。)

“主动点”的概念

到目前为止，我们讨论了构建树的多种有效工具，并模糊地提到了遍历多个边和节点，但还没有探索相应的结果和复杂性。

前面解释的“余数”概念对于跟踪我们在树中的位置很有用，但我们必须意识到它不能存储足够的信息。

首先，我们总是位于节点的特定边缘，因此我们需要存储边缘信息。我们称之为“活动边”。

其次，即使在添加了边缘信息之后，我们仍然没有办法识别树中更下方的位置，并且不直接连接到根节点。所以我们也需要存储节点。我们称其为“活动节点”。

最后，我们可以注意到，“余数”不足以识别不直接连接到根结点的边上的位置，因为“余数”是整个路由的长度;我们可能不想麻烦地记住并减去之前的边的长度。所以我们需要一个表示就是当前这条边的余数。这就是我们所说的“活动长度”。

这就导致了我们所说的“活动点”——一个包含三个变量的包，其中包含了我们需要维护的关于我们在树中的位置的所有信息:

活动点=(活动节点，活动边，活动长度)

你可以在下图中观察到ABCABD的匹配路由是如何由边缘AB上的2个字符(从根开始)，加上边缘CABDABCABD上的4个字符(从节点4开始)组成的——结果是“剩余”6个字符。所以，我们当前的位置可以被识别为活动节点4，活动边C，活动长度4。

“活动点”的另一个重要作用是它为我们的算法提供了一个抽象层，这意味着我们的算法的部分可以在“活动点”上完成它们的工作，而不管这个活动点是在根结点还是其他任何地方。这使得它很容易在我们的算法中以一种干净和直接的方式实现后缀链接的使用。

重新扫描和使用后缀链接的区别

现在，棘手的部分，在我的经验中，可能会导致大量的错误和头痛，并且在大多数来源中解释得很差，是处理后缀链接案例和重新扫描案例的区别。

考虑以下字符串'AAAABAAAABAAC'的例子:

您可以在上面观察到7的“余数”对应于根节点的字符总数，而4的“活动长度”对应于活动节点的活动边缘的匹配字符总数。

现在，在活动点执行分支操作之后，活动节点可能包含也可能不包含后缀链接。

如果存在后缀链接:我们只需要处理“活动长度”部分。“余数”是无关紧要的，因为我们通过后缀链接跳转到的节点已经隐式编码了正确的“余数”，仅仅是因为它在树中。

如果后缀链接不存在:我们需要从0 /根开始“重新扫描”，这意味着从头处理整个后缀。为此，我们必须使用整个“余数”作为重新扫描的基础。

使用和不使用后缀链接的处理比较示例

考虑一下上面例子的下一步会发生什么。让我们比较一下如何实现相同的结果——即移动到下一个后缀来处理——使用和不使用后缀链接。

使用后缀link

注意，如果我们使用后缀链接，我们就自动地“在正确的地方”。这通常不是严格正确的，因为“活动长度”可能与新位置“不兼容”。

在上面的例子中，由于“活动长度”是4，我们使用后缀“ABAA”，从链接的节点4开始。但在找到与后缀('A')的第一个字符对应的边后，我们注意到我们的“活动长度”超出了这条边3个字符。所以我们跳过整个边缘，跳转到下一个节点，并通过跳跃消耗的字符减少“活动长度”。

然后，在我们找到下一条边'B'后，对应于后缀'BAA'，我们最终注意到边长大于剩余的'活动长度' 3，这意味着我们找到了正确的位置。

请注意，这个操作似乎通常不被称为“重新扫描”，尽管对我来说，它似乎直接等同于重新扫描，只是长度缩短了，起点不是根。

使用“重新扫描”

注意，如果我们使用传统的“重新扫描”操作(这里假设我们没有后缀链接)，我们从树的顶部开始，从根开始，我们必须沿着当前后缀的整个长度再次向下工作到正确的位置。

这个后缀的长度就是我们之前讨论过的“余数”。我们要消耗这个余数的全部，直到它为0。这可能(通常是这样)包括跳过多个节点，在每次跳跃时减少我们跳过的边的长度。最后，我们到达一条比剩下的“余数”更长的边;这里我们将活动边设置为给定的边，将“活动长度”设置为剩余的“余数”，然后我们就完成了。

但是请注意，需要保留实际的“余数”变量，并且只在每次插入节点后递减。所以我上面所描述的假设使用一个单独的变量初始化为'remainder'。

注意后缀链接和重新扫描

1)请注意，这两种方法都会导致相同的结果。然而，在大多数情况下，后缀链接跳转要快得多;这就是后缀链接背后的基本原理。

2)实际的算法实现不需要有所不同。正如我上面提到的，即使在使用后缀link的情况下，“活动长度”通常与链接位置不兼容，因为树的分支可能包含额外的分支。所以本质上你只需要使用“活动长度”而不是“剩余长度”，并执行相同的重新扫描逻辑，直到你找到比剩余后缀长度短的边。

3) One important remark pertaining to performance is that there is no need to check each and every character during rescanning. Due to the way a valid suffix tree is built, we can safely assume that the characters match. So you're mostly counting the lengths, and the only need for character equivalence checking arises when we jump to a new edge, since edges are identified by their first character (which is always unique in the context of a given node). This means that 'rescanning' logic is different than full string matching logic (i.e. searching for a substring in the tree).

4)这里描述的原始后缀链接只是一种可能的方法。例如，NJ Larsson等人将这种方法命名为面向节点的自顶向下方法，并将其与面向节点的自底向上方法和两个面向边方法进行了比较。不同的方法有不同的典型和最坏情况的性能、需求、限制等，但通常情况下，面向边缘的方法似乎是对原始方法的全面改进。

2019-04-24 09:58:09

@jogojapan你带来了很棒的解释和可视化。但正如@makagonov提到的，它缺少一些关于设置后缀链接的规则。在http://brenden.github.io/ukkonen-animation/上通过单词“aabaaabb”一步一步地浏览时，可以很好地看到它。从步骤10到步骤11，节点5到节点2之间没有后缀链接，但活动点突然移动到那里。

@makagonov因为我生活在Java世界，我也试图遵循你的实现来掌握ST构建工作流，但这对我来说很难，因为:

边与节点的组合使用索引指针而不是引用优惠声明; 继续陈述;

所以我最终在Java中实现了这样的实现，我希望以更清晰的方式反映所有步骤，并将减少其他Java人的学习时间:

import java.util.Arrays;
import java.util.HashMap;
import java.util.Map;

public class ST {

  public class Node {
    private final int id;
    private final Map<Character, Edge> edges;
    private Node slink;

    public Node(final int id) {
        this.id = id;
        this.edges = new HashMap<>();
    }

    public void setSlink(final Node slink) {
        this.slink = slink;
    }

    public Map<Character, Edge> getEdges() {
        return this.edges;
    }

    public Node getSlink() {
        return this.slink;
    }

    public String toString(final String word) {
        return new StringBuilder()
                .append("{")
                .append("\"id\"")
                .append(":")
                .append(this.id)
                .append(",")
                .append("\"slink\"")
                .append(":")
                .append(this.slink != null ? this.slink.id : null)
                .append(",")
                .append("\"edges\"")
                .append(":")
                .append(edgesToString(word))
                .append("}")
                .toString();
    }

    private StringBuilder edgesToString(final String word) {
        final StringBuilder edgesStringBuilder = new StringBuilder();
        edgesStringBuilder.append("{");
        for(final Map.Entry<Character, Edge> entry : this.edges.entrySet()) {
            edgesStringBuilder.append("\"")
                    .append(entry.getKey())
                    .append("\"")
                    .append(":")
                    .append(entry.getValue().toString(word))
                    .append(",");
        }
        if(!this.edges.isEmpty()) {
            edgesStringBuilder.deleteCharAt(edgesStringBuilder.length() - 1);
        }
        edgesStringBuilder.append("}");
        return edgesStringBuilder;
    }

    public boolean contains(final String word, final String suffix) {
        return !suffix.isEmpty()
                && this.edges.containsKey(suffix.charAt(0))
                && this.edges.get(suffix.charAt(0)).contains(word, suffix);
    }
  }

  public class Edge {
    private final int from;
    private final int to;
    private final Node next;

    public Edge(final int from, final int to, final Node next) {
        this.from = from;
        this.to = to;
        this.next = next;
    }

    public int getFrom() {
        return this.from;
    }

    public int getTo() {
        return this.to;
    }

    public Node getNext() {
        return this.next;
    }

    public int getLength() {
        return this.to - this.from;
    }

    public String toString(final String word) {
        return new StringBuilder()
                .append("{")
                .append("\"content\"")
                .append(":")
                .append("\"")
                .append(word.substring(this.from, this.to))
                .append("\"")
                .append(",")
                .append("\"next\"")
                .append(":")
                .append(this.next != null ? this.next.toString(word) : null)
                .append("}")
                .toString();
    }

    public boolean contains(final String word, final String suffix) {
        if(this.next == null) {
            return word.substring(this.from, this.to).equals(suffix);
        }
        return suffix.startsWith(word.substring(this.from,
                this.to)) && this.next.contains(word, suffix.substring(this.to - this.from));
    }
  }

  public class ActivePoint {
    private final Node activeNode;
    private final Character activeEdgeFirstCharacter;
    private final int activeLength;

    public ActivePoint(final Node activeNode,
                       final Character activeEdgeFirstCharacter,
                       final int activeLength) {
        this.activeNode = activeNode;
        this.activeEdgeFirstCharacter = activeEdgeFirstCharacter;
        this.activeLength = activeLength;
    }

    private Edge getActiveEdge() {
        return this.activeNode.getEdges().get(this.activeEdgeFirstCharacter);
    }

    public boolean pointsToActiveNode() {
        return this.activeLength == 0;
    }

    public boolean activeNodeIs(final Node node) {
        return this.activeNode == node;
    }

    public boolean activeNodeHasEdgeStartingWith(final char character) {
        return this.activeNode.getEdges().containsKey(character);
    }

    public boolean activeNodeHasSlink() {
        return this.activeNode.getSlink() != null;
    }

    public boolean pointsToOnActiveEdge(final String word, final char character) {
        return word.charAt(this.getActiveEdge().getFrom() + this.activeLength) == character;
    }

    public boolean pointsToTheEndOfActiveEdge() {
        return this.getActiveEdge().getLength() == this.activeLength;
    }

    public boolean pointsAfterTheEndOfActiveEdge() {
        return this.getActiveEdge().getLength() < this.activeLength;
    }

    public ActivePoint moveToEdgeStartingWithAndByOne(final char character) {
        return new ActivePoint(this.activeNode, character, 1);
    }

    public ActivePoint moveToNextNodeOfActiveEdge() {
        return new ActivePoint(this.getActiveEdge().getNext(), null, 0);
    }

    public ActivePoint moveToSlink() {
        return new ActivePoint(this.activeNode.getSlink(),
                this.activeEdgeFirstCharacter,
                this.activeLength);
    }

    public ActivePoint moveTo(final Node node) {
        return new ActivePoint(node, this.activeEdgeFirstCharacter, this.activeLength);
    }

    public ActivePoint moveByOneCharacter() {
        return new ActivePoint(this.activeNode,
                this.activeEdgeFirstCharacter,
                this.activeLength + 1);
    }

    public ActivePoint moveToEdgeStartingWithAndByActiveLengthMinusOne(final Node node,
                                                                       final char character) {
        return new ActivePoint(node, character, this.activeLength - 1);
    }

    public ActivePoint moveToNextNodeOfActiveEdge(final String word, final int index) {
        return new ActivePoint(this.getActiveEdge().getNext(),
                word.charAt(index - this.activeLength + this.getActiveEdge().getLength()),
                this.activeLength - this.getActiveEdge().getLength());
    }

    public void addEdgeToActiveNode(final char character, final Edge edge) {
        this.activeNode.getEdges().put(character, edge);
    }

    public void splitActiveEdge(final String word,
                                final Node nodeToAdd,
                                final int index,
                                final char character) {
        final Edge activeEdgeToSplit = this.getActiveEdge();
        final Edge splittedEdge = new Edge(activeEdgeToSplit.getFrom(),
                activeEdgeToSplit.getFrom() + this.activeLength,
                nodeToAdd);
        nodeToAdd.getEdges().put(word.charAt(activeEdgeToSplit.getFrom() + this.activeLength),
                new Edge(activeEdgeToSplit.getFrom() + this.activeLength,
                        activeEdgeToSplit.getTo(),
                        activeEdgeToSplit.getNext()));
        nodeToAdd.getEdges().put(character, new Edge(index, word.length(), null));
        this.activeNode.getEdges().put(this.activeEdgeFirstCharacter, splittedEdge);
    }

    public Node setSlinkTo(final Node previouslyAddedNodeOrAddedEdgeNode,
                           final Node node) {
        if(previouslyAddedNodeOrAddedEdgeNode != null) {
            previouslyAddedNodeOrAddedEdgeNode.setSlink(node);
        }
        return node;
    }

    public Node setSlinkToActiveNode(final Node previouslyAddedNodeOrAddedEdgeNode) {
        return setSlinkTo(previouslyAddedNodeOrAddedEdgeNode, this.activeNode);
    }
  }

  private static int idGenerator;

  private final String word;
  private final Node root;
  private ActivePoint activePoint;
  private int remainder;

  public ST(final String word) {
    this.word = word;
    this.root = new Node(idGenerator++);
    this.activePoint = new ActivePoint(this.root, null, 0);
    this.remainder = 0;
    build();
  }

  private void build() {
    for(int i = 0; i < this.word.length(); i++) {
        add(i, this.word.charAt(i));
    }
  }

  private void add(final int index, final char character) {
    this.remainder++;
    boolean characterFoundInTheTree = false;
    Node previouslyAddedNodeOrAddedEdgeNode = null;
    while(!characterFoundInTheTree && this.remainder > 0) {
        if(this.activePoint.pointsToActiveNode()) {
            if(this.activePoint.activeNodeHasEdgeStartingWith(character)) {
                activeNodeHasEdgeStartingWithCharacter(character, previouslyAddedNodeOrAddedEdgeNode);
                characterFoundInTheTree = true;
            }
            else {
                if(this.activePoint.activeNodeIs(this.root)) {
                    rootNodeHasNotEdgeStartingWithCharacter(index, character);
                }
                else {
                    previouslyAddedNodeOrAddedEdgeNode = internalNodeHasNotEdgeStartingWithCharacter(index,
                            character, previouslyAddedNodeOrAddedEdgeNode);
                }
            }
        }
        else {
            if(this.activePoint.pointsToOnActiveEdge(this.word, character)) {
                activeEdgeHasCharacter();
                characterFoundInTheTree = true;
            }
            else {
                if(this.activePoint.activeNodeIs(this.root)) {
                    previouslyAddedNodeOrAddedEdgeNode = edgeFromRootNodeHasNotCharacter(index,
                            character,
                            previouslyAddedNodeOrAddedEdgeNode);
                }
                else {
                    previouslyAddedNodeOrAddedEdgeNode = edgeFromInternalNodeHasNotCharacter(index,
                            character,
                            previouslyAddedNodeOrAddedEdgeNode);
                }
            }
        }
    }
  }

  private void activeNodeHasEdgeStartingWithCharacter(final char character,
                                                    final Node previouslyAddedNodeOrAddedEdgeNode) {
    this.activePoint.setSlinkToActiveNode(previouslyAddedNodeOrAddedEdgeNode);
    this.activePoint = this.activePoint.moveToEdgeStartingWithAndByOne(character);
    if(this.activePoint.pointsToTheEndOfActiveEdge()) {
        this.activePoint = this.activePoint.moveToNextNodeOfActiveEdge();
    }
  }

  private void rootNodeHasNotEdgeStartingWithCharacter(final int index, final char character) {
    this.activePoint.addEdgeToActiveNode(character, new Edge(index, this.word.length(), null));
    this.activePoint = this.activePoint.moveTo(this.root);
    this.remainder--;
    assert this.remainder == 0;
  }

  private Node internalNodeHasNotEdgeStartingWithCharacter(final int index,
                                                         final char character,
                                                         Node previouslyAddedNodeOrAddedEdgeNode) {
    this.activePoint.addEdgeToActiveNode(character, new Edge(index, this.word.length(), null));
    previouslyAddedNodeOrAddedEdgeNode = this.activePoint.setSlinkToActiveNode(previouslyAddedNodeOrAddedEdgeNode);
    if(this.activePoint.activeNodeHasSlink()) {
        this.activePoint = this.activePoint.moveToSlink();
    }
    else {
        this.activePoint = this.activePoint.moveTo(this.root);
    }
    this.remainder--;
    return previouslyAddedNodeOrAddedEdgeNode;
  }

  private void activeEdgeHasCharacter() {
    this.activePoint = this.activePoint.moveByOneCharacter();
    if(this.activePoint.pointsToTheEndOfActiveEdge()) {
        this.activePoint = this.activePoint.moveToNextNodeOfActiveEdge();
    }
  }

  private Node edgeFromRootNodeHasNotCharacter(final int index,
                                             final char character,
                                             Node previouslyAddedNodeOrAddedEdgeNode) {
    final Node newNode = new Node(idGenerator++);
    this.activePoint.splitActiveEdge(this.word, newNode, index, character);
    previouslyAddedNodeOrAddedEdgeNode = this.activePoint.setSlinkTo(previouslyAddedNodeOrAddedEdgeNode, newNode);
    this.activePoint = this.activePoint.moveToEdgeStartingWithAndByActiveLengthMinusOne(this.root,
            this.word.charAt(index - this.remainder + 2));
    this.activePoint = walkDown(index);
    this.remainder--;
    return previouslyAddedNodeOrAddedEdgeNode;
  }

  private Node edgeFromInternalNodeHasNotCharacter(final int index,
                                                 final char character,
                                                 Node previouslyAddedNodeOrAddedEdgeNode) {
    final Node newNode = new Node(idGenerator++);
    this.activePoint.splitActiveEdge(this.word, newNode, index, character);
    previouslyAddedNodeOrAddedEdgeNode = this.activePoint.setSlinkTo(previouslyAddedNodeOrAddedEdgeNode, newNode);
    if(this.activePoint.activeNodeHasSlink()) {
        this.activePoint = this.activePoint.moveToSlink();
    }
    else {
        this.activePoint = this.activePoint.moveTo(this.root);
    }
    this.activePoint = walkDown(index);
    this.remainder--;
    return previouslyAddedNodeOrAddedEdgeNode;
  }

  private ActivePoint walkDown(final int index) {
    while(!this.activePoint.pointsToActiveNode()
            && (this.activePoint.pointsToTheEndOfActiveEdge() || this.activePoint.pointsAfterTheEndOfActiveEdge())) {
        if(this.activePoint.pointsAfterTheEndOfActiveEdge()) {
            this.activePoint = this.activePoint.moveToNextNodeOfActiveEdge(this.word, index);
        }
        else {
            this.activePoint = this.activePoint.moveToNextNodeOfActiveEdge();
        }
    }
    return this.activePoint;
  }

  public String toString(final String word) {
    return this.root.toString(word);
  }

  public boolean contains(final String suffix) {
    return this.root.contains(this.word, suffix);
  }

  public static void main(final String[] args) {
    final String[] words = {
            "abcabcabc$",
            "abc$",
            "abcabxabcd$",
            "abcabxabda$",
            "abcabxad$",
            "aabaaabb$",
            "aababcabcd$",
            "ababcabcd$",
            "abccba$",
            "mississipi$",
            "abacabadabacabae$",
            "abcabcd$",
            "00132220$"
    };
    Arrays.stream(words).forEach(word -> {
        System.out.println("Building suffix tree for word: " + word);
        final ST suffixTree = new ST(word);
        System.out.println("Suffix tree: " + suffixTree.toString(word));
        for(int i = 0; i < word.length() - 1; i++) {
            assert suffixTree.contains(word.substring(i)) : word.substring(i);
        }
    });
  }
}

2017-04-21 14:22:32

我尝试使用jogojapan回答中给出的方法来实现后缀树，但由于规则使用的措辞，它在某些情况下不起作用。此外，我还提到过，没有人能够使用这种方法实现绝对正确的后缀树。下面我将对jogojapan的答案进行“概述”，并对规则进行了一些修改。我还将描述当我们忘记创建重要的后缀链接时的情况。

使用的其他变量

活动点-一个三元(active_node;active_edge;Active_length)，显示我们必须从哪里开始插入新后缀。剩余-显示必须显式添加的后缀的数量。例如，如果我们的单词是'abcaabca'，并且remainder = 3，这意味着我们必须处理3个最后的后缀:bca, ca和a。

让我们使用一个内部节点的概念——所有的节点，除了根节点和叶节点都是内部节点。

观察1

当我们发现需要插入的最后一个后缀已经存在于树中时，树本身根本不会改变(我们只更新活动点和余数)。

观察2

如果在某一点上active_length大于或等于当前边缘的长度(edge_length)，我们将活动点向下移动，直到edge_length严格大于active_length。

现在，让我们重新定义规则:

规则1

如果从活动节点= root插入后，活动长度大于0，则: 主节点未改变活动长度减少活动边向右移动(到必须插入的下一个后缀的第一个字符)

规则2

如果我们创建一个新的内部节点或从一个内部节点创建一个插入器，并且这不是当前步骤中的第一个此类内部节点，那么我们将通过后缀链接将前一个此类节点与this节点链接起来。

规则2的定义与jogojapan的不同，因为这里我们不仅考虑了新创建的内部节点，还考虑了我们进行插入的内部节点。

规则3

在从活动节点(不是根节点)进行插入之后，必须遵循后缀链接并将活动节点设置为它所指向的节点。如果没有后缀链路，请将主节点设置为根节点。无论哪种方式，活动边和活动长度保持不变。

在规则3的这个定义中，我们还考虑了叶节点的插入(不仅仅是分裂节点)。

最后，观察3:

当我们想要添加到树中的符号已经在边缘上时，根据观察1，我们只更新活动点和余数，保持树不变。但是如果有一个内部节点被标记为需要后缀链接，我们必须通过后缀链接将该节点与我们当前的活动节点连接起来。

让我们看看cdddcdc的后缀树的例子，如果我们在这种情况下添加后缀链接，如果我们不添加:

如果我们不使用后缀link连接节点: 在加上最后一个字母c之前: 加上最后一个字母c后: 如果我们确实通过后缀link连接节点: 在加上最后一个字母c之前: 加上最后一个字母c后:

Seems like there is no significant difference: in the second case there are two more suffix links. But these suffix links are correct, and one of them - from the blue node to the red one - is very important for our approach with active point. The problem is that if we don't put a suffix link here, later, when we add some new letters to the tree, we might omit adding some nodes to the tree due to the Rule 3, because, according to it, if there's no a suffix link, then we must put the active_node to the root.

When we were adding the last letter to the tree, the red node had already existed before we made an insert from the blue node (the edge labled 'c'). As there was an insert from the blue node, we mark it as needing a suffix link. Then, relying on the active point approach, the active node was set to the red node. But we don't make an insert from the red node, as the letter 'c' is already on the edge. Does it mean that the blue node must be left without a suffix link? No, we must connect the blue node with the red one through a suffix link. Why is it correct? Because the active point approach guarantees that we get to a right place, i.e., to the next place where we must process an insert of a shorter suffix.

最后，这里是我对后缀树的实现:

Java c++

希望这个“概述”结合jogojapan的详细回答将帮助某人实现他自己的后缀树。

2013-01-29 09:57:43

嗨，我已经尝试在ruby中实现上述解释的实现，请检查它。它似乎工作得很好。

在实现中唯一的区别是，我尝试使用边缘对象，而不是仅仅使用符号。

它也出现在https://gist.github.com/suchitpuri/9304856

    require 'pry'


class Edge
    attr_accessor :data , :edges , :suffix_link
    def initialize data
        @data = data
        @edges = []
        @suffix_link = nil
    end

    def find_edge element
        self.edges.each do |edge|
            return edge if edge.data.start_with? element
        end
        return nil
    end
end

class SuffixTrees
    attr_accessor :root , :active_point , :remainder , :pending_prefixes , :last_split_edge , :remainder

    def initialize
        @root = Edge.new nil
        @active_point = { active_node: @root , active_edge: nil , active_length: 0}
        @remainder = 0
        @pending_prefixes = []
        @last_split_edge = nil
        @remainder = 1
    end

    def build string
        string.split("").each_with_index do |element , index|


            add_to_edges @root , element        

            update_pending_prefix element                           
            add_pending_elements_to_tree element
            active_length = @active_point[:active_length]

            # if(@active_point[:active_edge] && @active_point[:active_edge].data && @active_point[:active_edge].data[0..active_length-1] ==  @active_point[:active_edge].data[active_length..@active_point[:active_edge].data.length-1])
            #   @active_point[:active_edge].data = @active_point[:active_edge].data[0..active_length-1]
            #   @active_point[:active_edge].edges << Edge.new(@active_point[:active_edge].data)
            # end

            if(@active_point[:active_edge] && @active_point[:active_edge].data && @active_point[:active_edge].data.length == @active_point[:active_length]  )
                @active_point[:active_node] =  @active_point[:active_edge]
                @active_point[:active_edge] = @active_point[:active_node].find_edge(element[0])
                @active_point[:active_length] = 0
            end
        end
    end

    def add_pending_elements_to_tree element

        to_be_deleted = []
        update_active_length = false
        # binding.pry
        if( @active_point[:active_node].find_edge(element[0]) != nil)
            @active_point[:active_length] = @active_point[:active_length] + 1               
            @active_point[:active_edge] = @active_point[:active_node].find_edge(element[0]) if @active_point[:active_edge] == nil
            @remainder = @remainder + 1
            return
        end



        @pending_prefixes.each_with_index do |pending_prefix , index|

            # binding.pry           

            if @active_point[:active_edge] == nil and @active_point[:active_node].find_edge(element[0]) == nil

                @active_point[:active_node].edges << Edge.new(element)

            else

                @active_point[:active_edge] = node.find_edge(element[0]) if @active_point[:active_edge]  == nil

                data = @active_point[:active_edge].data
                data = data.split("")               

                location = @active_point[:active_length]


                # binding.pry
                if(data[0..location].join == pending_prefix or @active_point[:active_node].find_edge(element) != nil )                  


                else #tree split    
                    split_edge data , index , element
                end

            end
        end 
    end



    def update_pending_prefix element
        if @active_point[:active_edge] == nil
            @pending_prefixes = [element]
            return

        end

        @pending_prefixes = []

        length = @active_point[:active_edge].data.length
        data = @active_point[:active_edge].data
        @remainder.times do |ctr|
                @pending_prefixes << data[-(ctr+1)..data.length-1]
        end

        @pending_prefixes.reverse!

    end

    def split_edge data , index , element
        location = @active_point[:active_length]
        old_edges = []
        internal_node = (@active_point[:active_edge].edges != nil)

        if (internal_node)
            old_edges = @active_point[:active_edge].edges 
            @active_point[:active_edge].edges = []
        end

        @active_point[:active_edge].data = data[0..location-1].join                 
        @active_point[:active_edge].edges << Edge.new(data[location..data.size].join)


        if internal_node
            @active_point[:active_edge].edges << Edge.new(element)
        else
            @active_point[:active_edge].edges << Edge.new(data.last)        
        end

        if internal_node
            @active_point[:active_edge].edges[0].edges = old_edges
        end


        #setup the suffix link
        if @last_split_edge != nil and @last_split_edge.data.end_with?@active_point[:active_edge].data 

            @last_split_edge.suffix_link = @active_point[:active_edge] 
        end

        @last_split_edge = @active_point[:active_edge]

        update_active_point index

    end


    def update_active_point index
        if(@active_point[:active_node] == @root)
            @active_point[:active_length] = @active_point[:active_length] - 1
            @remainder = @remainder - 1
            @active_point[:active_edge] = @active_point[:active_node].find_edge(@pending_prefixes.first[index+1])
        else
            if @active_point[:active_node].suffix_link != nil
                @active_point[:active_node] = @active_point[:active_node].suffix_link               
            else
                @active_point[:active_node] = @root
            end 
            @active_point[:active_edge] = @active_point[:active_node].find_edge(@active_point[:active_edge].data[0])
            @remainder = @remainder - 1     
        end
    end

    def add_to_edges root , element     
        return if root == nil
        root.data = root.data + element if(root.data and root.edges.size == 0)
        root.edges.each do |edge|
            add_to_edges edge , element
        end
    end
end

suffix_tree = SuffixTrees.new
suffix_tree.build("abcabxabcd")
binding.pry

2014-03-02 10:54:58

下面尝试描述Ukkonen算法，首先展示当字符串很简单(即不包含任何重复字符)时它会做什么，然后将其扩展到完整的算法。

首先，做一些初步陈述。

我们正在构建的，基本上就像一个搜索树。因此, 是根节点，它的边指向新的节点再往外延伸一些边，等等但是:与搜索树不同，边缘标签不是单一的字符。相反，每条边都使用一对整数进行标记 [,]。这些是指向文本的指针。在这个意义上，每一个 edge携带任意长度的字符串标签，但只接受O(1) 空格(两个指针)。

基本原理

我想首先演示如何创建的后缀树特别简单的字符串，没有重复字符的字符串:

abc

该算法按从左到右的步骤工作。字符串的每个字符都有一个步骤。每一步可能涉及多个单独的操作，但我们将看到(参见最后的观察结果)操作的总数为O(n)。

因此，我们从左边开始，首先只插入单个字符 A通过创建一条从根节点(左边)到叶节点的边，并将其标记为[0，#]，这意味着这条边代表从位置0开始到当前结束的子字符串。我使用符号#表示当前的结束，它位于位置1 (就在a后面)。

我们有了一个初始树，它看起来像这样:

它的意思是:

现在我们前进到位置2(在b之后)。每一步的目标是插入到当前位置的所有后缀。我们这样做通过

将现有的a边扩展到ab边为b插入一条新边

在我们的表示法中是这样的

它的意思是:

我们观察到两点:

ab的边表示和以前一样在初始树中:[0，#]。它的含义自动改变了因为我们将当前位置#从1更新到2。每条边占用O(1)个空间，因为它只由两条边组成指向文本的指针，不管它有多少个字符代表。

接下来，我们再次增加位置，并通过追加更新树 A c到每条现有的边，并为新边插入一条新边后缀c。

在我们的表示法中是这样的

它的意思是:

我们观察到:

该树是到当前位置为止的正确后缀树在每一步之后文本中有多少字符，就有多少步骤每一步的工作量是O(1)，因为所有现有的边通过增加#和插入最后一个字符的新边可以在O(1)中完成时间。因此，对于长度为n的字符串，只需要O(n)个时间。

第一个扩展:简单重复

当然，这工作得很好，只是因为我们的字符串没有不要重复。我们现在看一个更现实的字符串:

abcabxabcd

它像前面的例子一样以abc开头，然后重复ab 后面跟着x，然后重复ABC，后面跟着d。

步骤1到步骤3:在前3步之后，我们有了前面例子中的树:

第四步:我们移动#到位置4。这将隐式地更新所有现有的边缘问题:

我们需要插入当前步骤的最后一个后缀，a, at 根。

在此之前，我们再引入两个变量(除了 #)，当然，它一直都在那里，但我们没有使用过到目前为止:

活动点，是一个三倍的 (active_edge active_node active_length) 余数，它是一个整数，表示有多少个新后缀我们需要插入

这两个词的确切含义很快就会清楚，但现在让我们这么说吧:

In the simple abc example, the active point was always (root,'\0x',0), i.e. active_node was the root node, active_edge was specified as the null character '\0x', and active_length was zero. The effect of this was that the one new edge that we inserted in every step was inserted at the root node as a freshly created edge. We will see soon why a triple is necessary to represent this information. The remainder was always set to 1 at the beginning of each step. The meaning of this was that the number of suffixes we had to actively insert at the end of each step was 1 (always just the final character).

现在这将会改变。当我们插入当前的final时字符a在根，我们注意到已经有一个外向以a开头的边，特别是abca。这是我们在这种情况:

We do not insert a fresh edge [4,#] at the root node. Instead we simply notice that the suffix a is already in our tree. It ends in the middle of a longer edge, but we are not bothered by that. We just leave things the way they are. We set the active point to (root,'a',1). That means the active point is now somewhere in the middle of outgoing edge of the root node that starts with a, specifically, after position 1 on that edge. We notice that the edge is specified simply by its first character a. That suffices because there can be only one edge starting with any particular character (confirm that this is true after reading through the entire description). We also increment remainder, so at the beginning of the next step it will be 2.

观察:当我们需要插入的最后一个后缀被找到时已经存在于树中，树本身根本没有改变(我们只更新活动点和余数)。这棵树然后不是一个准确的表示后缀树到当前位置不再存在，但它包含所有后缀(因为final 后缀a是隐式包含的)。因此，除了更新变量(都是固定长度的，所以这是O(1) 这一步没有做功。

步骤5:我们将当前位置#更新为5。这自动更新树如下:

因为余数是2，我们需要插入两个final 当前位置的后缀:ab和b。这基本上是因为:

上一步中的a后缀一直不正确插入。所以它一直存在，自从我们进步了一个 Step，现在从a变成了ab。我们需要插入最后一条新边b。

在实践中，这意味着我们去到活动点(指向在a后面，也就是现在的abcab边缘)，然后插入现在的最终字符b。但是:再次证明b是这条边也已经出现了。

同样，我们没有改变树。我们只是:

更新活动点到(根，'a'，2)(相同的节点和边和之前一样，但现在我们指向b) 将余数加为3，因为我们仍然没有正确地插入前一步的最后一条边，我们不插入当前的最终边。

明确一点:我们必须在当前步骤中插入ab和b，但是因为ab已经被找到，所以我们更新了活动点甚至没有尝试插入b。为什么?因为如果ab在树中，它的每个后缀(包括b)都必须在树中，了。也许只是隐含的，但它必须在那里，因为就像我们到目前为止建立的树。

我们通过增加#来继续第6步。树是自动更新为:

因为余数是3，我们必须插入abx bx和活动点告诉我们ab的终点，所以我们只需要跳转到那里并插入x，实际上，x还没有在那里，所以我们分割abcabx边缘并插入一个内部节点:

边缘表示仍然是指向文本的指针，所以分裂和插入内部节点可以在O(1)时间内完成。

我们处理了abx，并将余数减为2。现在我们需要插入下一个剩余的后缀，bx。但在此之前我们需要更新活动点。分裂后的规则并插入一条边，将被称为下面的规则1，它适用于 Active_node是根(我们将进一步学习规则3用于其他情况下文)。规则一:

从根节点插入后， Active_node保持root Active_edge被设置为新后缀we的第一个字符需要插入，即b Active_length减少1

因此，新的活动点三元组(根，'b'，1)表示下一个插入必须在bcabx边缘，后面1个字符，我们可以在O(1)时间内识别插入点和检查x是否已经存在。如果它存在，我们将结束当前步骤，让一切保持原样。但x 不存在，所以我们通过分割边缘插入它:

同样，这花费了O(1)时间我们将余数更新为1 如规则1所述，活动指向(根，'x'，0)。

但我们还需要做一件事。我们称之为规则2:

如果我们分割一条边并插入一个新节点，如果不是在当前步骤中创建的第一个节点，连接先前的节点插入节点和新节点通过一个特殊指针、后缀链接。我们稍后会看到为什么这是有用的。这就是我们得到的后缀link表示为虚线边:

我们仍然需要插入当前步骤的最后一个后缀， x.由于主节点的active_length组件下降对于0，最后的插入直接在根节点上进行。因为没有从x开始的根节点的外向边，我们插入一条新的优势:

正如我们所看到的，在当前步骤中，所有剩余的插入都已完成。

我们通过设置#=7继续到第7步，这将自动添加下一个字符， A，到所有的叶边，一如既往。然后我们尝试插入新的final 字符到活动点(根)，并发现它在那里了。我们结束当前的步骤，不插入任何东西将活动点更新为(root，'a'，1)。

在第8步，#=8，我们附加b，正如前面看到的，这只是这意味着我们将活动点更新为(root，'a'，2)，并增加余数还有别的吗，因为b已经存在了。然而，我们注意到(在O(1)时间)活动点现在在一条边缘的尽头。我们通过将其重新设置为来反映这一点 (node1 ' \ 0 x ' 0)。在这里，我使用node1来引用 ab边结束于的内部节点。

然后，在步骤#=9中，我们需要插入'c'，这将帮助我们了解最后一个技巧:

第二个扩展:使用后缀链接

像往常一样，# update会自动将c追加到叶边我们到活动点，看看能不能插入c。结果 c已经在那条边存在了，所以我们把活动点设为 (node1，'c'，1)，增加余数，其他什么都不做。

现在在步骤#=10中，余数是4，所以我们首先需要插入 Abcd(从3步前保留下来)通过在active中插入d 点。

试图在活动点插入d会导致边缘分裂 O(1)时间:

在中标记了active_node，分裂就是从它开始的上面的红色。这是最后的规则，规则3:

在从一个不是根的active_node中分离边缘之后节点，如果有的话，我们跟踪从该节点出来的后缀链接并将active_node重置为它所指向的节点。如果有的话没有后缀链接，我们将active_node设置为根。active_edge active_length保持不变。

所以活动点现在是(node2，'c'，1)， node2被标记进来红色下面:

由于abcd的插入已经完成，我们将余数减为 3，并考虑当前步骤的下一个剩余后缀， bcd。规则3将活动点设置为正确的节点和边所以插入BCD可以通过简单地插入它的最后一个字符来完成 D在活动点。

这样做会导致另一条边分裂，由于规则2，我们必须创建从先前插入的节点到新节点的后缀链接吗一:

我们观察到:后缀链接使我们能够重置活动点，因此我们可以以O(1)次努力进行下一个剩余插入。看看以确认标签ab上的节点确实被链接到 b处的节点(它的后缀)和ABC处的节点被链接到 bc。

当前步骤还没有完成。余数现在是2，我们需要按照规则3重新设置活动点。自当前active_node(上面红色)没有后缀链接，我们重置为根。活动点现在是(根，c'，1)。

因此，下一次插入发生在根节点的一个外向边缘其标签以c: cabxabcd开头，在第一个字符后面，这导致了另一个分裂:

因为这涉及到一个新的内部节点的创建，我们遵循规则2，并设置一个新的后缀链接之前创建的内部节点:

(我使用Graphviz Dot的这些小图表。新的后缀链接导致点重新排列现有的边，所以仔细检查，以确认唯一的东西上面插入了一个新的后缀链接。)

这样，余数可以设置为1，因为active_node是根，我们使用规则1将活动点更新为(根，'d'，0)。这表示当前步骤的最后一个插入是插入单个d 根:

这是最后一步，我们做完了。有一些最终的不过,观察:

In each step we move # forward by 1 position. This automatically updates all leaf nodes in O(1) time. But it does not deal with a) any suffixes remaining from previous steps, and b) with the one final character of the current step. remainder tells us how many additional inserts we need to make. These inserts correspond one-to-one to the final suffixes of the string that ends at the current position #. We consider one after the other and make the insert. Important: Each insert is done in O(1) time since the active point tells us exactly where to go, and we need to add only one single character at the active point. Why? Because the other characters are contained implicitly (otherwise the active point would not be where it is). After each such insert, we decrement remainder and follow the suffix link if there is any. If not we go to root (rule 3). If we are at root already, we modify the active point using rule 1. In any case, it takes only O(1) time. If, during one of these inserts, we find that the character we want to insert is already there, we don't do anything and end the current step, even if remainder>0. The reason is that any inserts that remain will be suffixes of the one we just tried to make. Hence they are all implicit in the current tree. The fact that remainder>0 makes sure we deal with the remaining suffixes later. What if at the end of the algorithm remainder>0? This will be the case whenever the end of the text is a substring that occurred somewhere before. In that case we must append one extra character at the end of the string that has not occurred before. In the literature, usually the dollar sign $ is used as a symbol for that. Why does that matter? --> If later we use the completed suffix tree to search for suffixes, we must accept matches only if they end at a leaf. Otherwise we would get a lot of spurious matches, because there are many strings implicitly contained in the tree that are not actual suffixes of the main string. Forcing remainder to be 0 at the end is essentially a way to ensure that all suffixes end at a leaf node. However, if we want to use the tree to search for general substrings, not only suffixes of the main string, this final step is indeed not required, as suggested by the OP's comment below. So what is the complexity of the entire algorithm? If the text is n characters in length, there are obviously n steps (or n+1 if we add the dollar sign). In each step we either do nothing (other than updating the variables), or we make remainder inserts, each taking O(1) time. Since remainder indicates how many times we have done nothing in previous steps, and is decremented for every insert that we make now, the total number of times we do something is exactly n (or n+1). Hence, the total complexity is O(n). However, there is one small thing that I did not properly explain: It can happen that we follow a suffix link, update the active point, and then find that its active_length component does not work well with the new active_node. For example, consider a situation like this:

虚线表示树的其余部分。虚线是a 后缀链接。)

现在设活动点为(红色，'d'，3)，所以它指向某个位置在defg边缘的f后面。现在假设我们做了必要的事情更新，现在跟随后缀链接更新活动点根据规则3。新的活动点是(绿色，'d'，3)。然而, 绿色节点外的d边是de，所以它只有2 字符。为了找到正确的活动点，我们显然需要沿着这条边到蓝色节点，并重置为(蓝色，'f'，1)。

在特别糟糕的情况下，active_length可能大到余数，可以大到n，这很可能发生为了找到正确的活动点，我们不仅仅需要跳过一个内部节点，但可能很多，最坏情况下最多n个。这意味着算法有一个隐藏的O(n2)复杂度，因为每步余数一般为O(n)，后调整到主节点后的后缀链接也可以是O(n) ?

No. The reason is that if indeed we have to adjust the active point (e.g. from green to blue as above), that brings us to a new node that has its own suffix link, and active_length will be reduced. As we follow down the chain of suffix links we make the remaining inserts, active_length can only decrease, and the number of active-point adjustments we can make on the way can't be larger than active_length at any given time. Since active_length can never be larger than remainder, and remainder is O(n) not only in every single step, but the total sum of increments ever made to remainder over the course of the entire process is O(n) too, the number of active point adjustments is also bounded by O(n).

2012-03-01 09:13:18

Ukkonen的后缀树算法

推荐文章

最新文章

标签