在这一点上我觉得有点笨。我花了几天时间试图完全理解后缀树结构,但因为我没有数学背景,很多解释都让我难以理解,因为它们开始过度使用数学符号学。我发现的最接近好的解释是使用后缀树的快速字符串搜索,但他掩盖了各种点,算法的一些方面仍然不清楚。

我敢肯定,Stack Overflow上对这个算法的一步一步的解释对除了我之外的许多人来说都是无价的。

作为参考,这里是Ukkonen关于算法的论文:http://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf

目前我的基本理解是:

I need to iterate through each prefix P of a given string T I need to iterate through each suffix S in prefix P and add that to tree To add suffix S to the tree, I need to iterate through each character in S, with the iterations consisting of either walking down an existing branch that starts with the same set of characters C in S and potentially splitting an edge into descendent nodes when I reach a differing character in the suffix, OR if there was no matching edge to walk down. When no matching edge is found to walk down for C, a new leaf edge is created for C.

基本算法似乎是O(n2),正如在大多数解释中指出的那样,因为我们需要逐级遍历所有前缀,然后我们需要逐级遍历每个前缀的每个后缀。Ukkonen的算法显然是独一无二的,因为他使用了后缀指针技术,尽管我认为这是我难以理解的地方。

我也很难理解:

“活动点”被分配、使用和改变的确切时间和方式 算法的规范化方面发生了什么 为什么我看到的实现需要“修复”他们正在使用的边界变量


下面是完整的c#源代码。它不仅工作正确,而且支持自动规范化,并呈现输出的文本图形。源代码和示例输出在:

https://gist.github.com/2373868


更新2017-11-04

多年后,我发现了后缀树的新用途,并在JavaScript中实现了算法。要点如下。它应该没有bug。将它转储到一个js文件中,从相同的位置npm install chalk,然后运行node.js来查看一些彩色的输出。在同一个Gist中有一个精简的版本,没有任何调试代码。

https://gist.github.com/axefrog/c347bf0f5e0723cbd09b1aaed6ec6fc6


当前回答

感谢@jogojapan精心解释的教程,我在Python中实现了算法。

@jogojapan提到的几个小问题比我想象的要复杂,需要非常谨慎地对待。我花了好几天时间来使我的实现足够健壮(我想)。问题及解决方法如下:

End with Remainder > 0 It turns out this situation can also happen during the unfolding step, not just the end of the entire algorithm. When that happens, we can leave the remainder, actnode, actedge, and actlength unchanged, end the current unfolding step, and start another step by either keep folding or unfolding depending on if the next char in the original string is on the current path or not. Leap Over Nodes: When we follow a suffix link, update the active point, and then find that its active_length component does not work well with the new active_node. We have to move forward to the right place to split, or insert a leaf. This process might be not that straightforward because during the moving the actlength and actedge keep changing all the way, when you have to move back to the root node, the actedge and actlength could be wrong because of those moves. We need additional variable(s) to keep that information.

@managonov以某种方式指出了另外两个问题

Split Could Degenerate When trying to split an edge, sometime you'll find the split operation is right on a node. That case we only need add a new leaf to that node, take it as a standard edge split operation, which means the suffix links if there's any, should be maintained correspondingly. Hidden Suffix Links There is another special case which is incurred by problem 1 and problem 2. Sometimes we need to hop over several nodes to the right point for split, we might surpass the right point if we move by comparing the remainder string and the path labels. That case the suffix link will be neglected unintentionally, if there should be any. This could be avoided by remembering the right point when moving forward. The suffix link should be maintained if the split node already exists, or even the problem 1 happens during a unfolding step.

最后,我在Python中的实现如下:

Python

提示:在上面的代码中包含了一个朴素的树打印函数,这在调试时非常重要。这帮我省了很多钱 时间短,便于定位特殊情况。

其他回答

嗨,我已经尝试在ruby中实现上述解释的实现,请检查它。 它似乎工作得很好。

在实现中唯一的区别是,我尝试使用边缘对象,而不是仅仅使用符号。

它也出现在https://gist.github.com/suchitpuri/9304856

    require 'pry'


class Edge
    attr_accessor :data , :edges , :suffix_link
    def initialize data
        @data = data
        @edges = []
        @suffix_link = nil
    end

    def find_edge element
        self.edges.each do |edge|
            return edge if edge.data.start_with? element
        end
        return nil
    end
end

class SuffixTrees
    attr_accessor :root , :active_point , :remainder , :pending_prefixes , :last_split_edge , :remainder

    def initialize
        @root = Edge.new nil
        @active_point = { active_node: @root , active_edge: nil , active_length: 0}
        @remainder = 0
        @pending_prefixes = []
        @last_split_edge = nil
        @remainder = 1
    end

    def build string
        string.split("").each_with_index do |element , index|


            add_to_edges @root , element        

            update_pending_prefix element                           
            add_pending_elements_to_tree element
            active_length = @active_point[:active_length]

            # if(@active_point[:active_edge] && @active_point[:active_edge].data && @active_point[:active_edge].data[0..active_length-1] ==  @active_point[:active_edge].data[active_length..@active_point[:active_edge].data.length-1])
            #   @active_point[:active_edge].data = @active_point[:active_edge].data[0..active_length-1]
            #   @active_point[:active_edge].edges << Edge.new(@active_point[:active_edge].data)
            # end

            if(@active_point[:active_edge] && @active_point[:active_edge].data && @active_point[:active_edge].data.length == @active_point[:active_length]  )
                @active_point[:active_node] =  @active_point[:active_edge]
                @active_point[:active_edge] = @active_point[:active_node].find_edge(element[0])
                @active_point[:active_length] = 0
            end
        end
    end

    def add_pending_elements_to_tree element

        to_be_deleted = []
        update_active_length = false
        # binding.pry
        if( @active_point[:active_node].find_edge(element[0]) != nil)
            @active_point[:active_length] = @active_point[:active_length] + 1               
            @active_point[:active_edge] = @active_point[:active_node].find_edge(element[0]) if @active_point[:active_edge] == nil
            @remainder = @remainder + 1
            return
        end



        @pending_prefixes.each_with_index do |pending_prefix , index|

            # binding.pry           

            if @active_point[:active_edge] == nil and @active_point[:active_node].find_edge(element[0]) == nil

                @active_point[:active_node].edges << Edge.new(element)

            else

                @active_point[:active_edge] = node.find_edge(element[0]) if @active_point[:active_edge]  == nil

                data = @active_point[:active_edge].data
                data = data.split("")               

                location = @active_point[:active_length]


                # binding.pry
                if(data[0..location].join == pending_prefix or @active_point[:active_node].find_edge(element) != nil )                  


                else #tree split    
                    split_edge data , index , element
                end

            end
        end 
    end



    def update_pending_prefix element
        if @active_point[:active_edge] == nil
            @pending_prefixes = [element]
            return

        end

        @pending_prefixes = []

        length = @active_point[:active_edge].data.length
        data = @active_point[:active_edge].data
        @remainder.times do |ctr|
                @pending_prefixes << data[-(ctr+1)..data.length-1]
        end

        @pending_prefixes.reverse!

    end

    def split_edge data , index , element
        location = @active_point[:active_length]
        old_edges = []
        internal_node = (@active_point[:active_edge].edges != nil)

        if (internal_node)
            old_edges = @active_point[:active_edge].edges 
            @active_point[:active_edge].edges = []
        end

        @active_point[:active_edge].data = data[0..location-1].join                 
        @active_point[:active_edge].edges << Edge.new(data[location..data.size].join)


        if internal_node
            @active_point[:active_edge].edges << Edge.new(element)
        else
            @active_point[:active_edge].edges << Edge.new(data.last)        
        end

        if internal_node
            @active_point[:active_edge].edges[0].edges = old_edges
        end


        #setup the suffix link
        if @last_split_edge != nil and @last_split_edge.data.end_with?@active_point[:active_edge].data 

            @last_split_edge.suffix_link = @active_point[:active_edge] 
        end

        @last_split_edge = @active_point[:active_edge]

        update_active_point index

    end


    def update_active_point index
        if(@active_point[:active_node] == @root)
            @active_point[:active_length] = @active_point[:active_length] - 1
            @remainder = @remainder - 1
            @active_point[:active_edge] = @active_point[:active_node].find_edge(@pending_prefixes.first[index+1])
        else
            if @active_point[:active_node].suffix_link != nil
                @active_point[:active_node] = @active_point[:active_node].suffix_link               
            else
                @active_point[:active_node] = @root
            end 
            @active_point[:active_edge] = @active_point[:active_node].find_edge(@active_point[:active_edge].data[0])
            @remainder = @remainder - 1     
        end
    end

    def add_to_edges root , element     
        return if root == nil
        root.data = root.data + element if(root.data and root.edges.size == 0)
        root.edges.each do |edge|
            add_to_edges edge , element
        end
    end
end

suffix_tree = SuffixTrees.new
suffix_tree.build("abcabxabcd")
binding.pry

@jogojapan你带来了很棒的解释和可视化。但正如@makagonov提到的,它缺少一些关于设置后缀链接的规则。在http://brenden.github.io/ukkonen-animation/上通过单词“aabaaabb”一步一步地浏览时,可以很好地看到它。从步骤10到步骤11,节点5到节点2之间没有后缀链接,但活动点突然移动到那里。

@makagonov因为我生活在Java世界,我也试图遵循你的实现来掌握ST构建工作流,但这对我来说很难,因为:

边与节点的组合 使用索引指针而不是引用 优惠声明; 继续陈述;

所以我最终在Java中实现了这样的实现,我希望以更清晰的方式反映所有步骤,并将减少其他Java人的学习时间:

import java.util.Arrays;
import java.util.HashMap;
import java.util.Map;

public class ST {

  public class Node {
    private final int id;
    private final Map<Character, Edge> edges;
    private Node slink;

    public Node(final int id) {
        this.id = id;
        this.edges = new HashMap<>();
    }

    public void setSlink(final Node slink) {
        this.slink = slink;
    }

    public Map<Character, Edge> getEdges() {
        return this.edges;
    }

    public Node getSlink() {
        return this.slink;
    }

    public String toString(final String word) {
        return new StringBuilder()
                .append("{")
                .append("\"id\"")
                .append(":")
                .append(this.id)
                .append(",")
                .append("\"slink\"")
                .append(":")
                .append(this.slink != null ? this.slink.id : null)
                .append(",")
                .append("\"edges\"")
                .append(":")
                .append(edgesToString(word))
                .append("}")
                .toString();
    }

    private StringBuilder edgesToString(final String word) {
        final StringBuilder edgesStringBuilder = new StringBuilder();
        edgesStringBuilder.append("{");
        for(final Map.Entry<Character, Edge> entry : this.edges.entrySet()) {
            edgesStringBuilder.append("\"")
                    .append(entry.getKey())
                    .append("\"")
                    .append(":")
                    .append(entry.getValue().toString(word))
                    .append(",");
        }
        if(!this.edges.isEmpty()) {
            edgesStringBuilder.deleteCharAt(edgesStringBuilder.length() - 1);
        }
        edgesStringBuilder.append("}");
        return edgesStringBuilder;
    }

    public boolean contains(final String word, final String suffix) {
        return !suffix.isEmpty()
                && this.edges.containsKey(suffix.charAt(0))
                && this.edges.get(suffix.charAt(0)).contains(word, suffix);
    }
  }

  public class Edge {
    private final int from;
    private final int to;
    private final Node next;

    public Edge(final int from, final int to, final Node next) {
        this.from = from;
        this.to = to;
        this.next = next;
    }

    public int getFrom() {
        return this.from;
    }

    public int getTo() {
        return this.to;
    }

    public Node getNext() {
        return this.next;
    }

    public int getLength() {
        return this.to - this.from;
    }

    public String toString(final String word) {
        return new StringBuilder()
                .append("{")
                .append("\"content\"")
                .append(":")
                .append("\"")
                .append(word.substring(this.from, this.to))
                .append("\"")
                .append(",")
                .append("\"next\"")
                .append(":")
                .append(this.next != null ? this.next.toString(word) : null)
                .append("}")
                .toString();
    }

    public boolean contains(final String word, final String suffix) {
        if(this.next == null) {
            return word.substring(this.from, this.to).equals(suffix);
        }
        return suffix.startsWith(word.substring(this.from,
                this.to)) && this.next.contains(word, suffix.substring(this.to - this.from));
    }
  }

  public class ActivePoint {
    private final Node activeNode;
    private final Character activeEdgeFirstCharacter;
    private final int activeLength;

    public ActivePoint(final Node activeNode,
                       final Character activeEdgeFirstCharacter,
                       final int activeLength) {
        this.activeNode = activeNode;
        this.activeEdgeFirstCharacter = activeEdgeFirstCharacter;
        this.activeLength = activeLength;
    }

    private Edge getActiveEdge() {
        return this.activeNode.getEdges().get(this.activeEdgeFirstCharacter);
    }

    public boolean pointsToActiveNode() {
        return this.activeLength == 0;
    }

    public boolean activeNodeIs(final Node node) {
        return this.activeNode == node;
    }

    public boolean activeNodeHasEdgeStartingWith(final char character) {
        return this.activeNode.getEdges().containsKey(character);
    }

    public boolean activeNodeHasSlink() {
        return this.activeNode.getSlink() != null;
    }

    public boolean pointsToOnActiveEdge(final String word, final char character) {
        return word.charAt(this.getActiveEdge().getFrom() + this.activeLength) == character;
    }

    public boolean pointsToTheEndOfActiveEdge() {
        return this.getActiveEdge().getLength() == this.activeLength;
    }

    public boolean pointsAfterTheEndOfActiveEdge() {
        return this.getActiveEdge().getLength() < this.activeLength;
    }

    public ActivePoint moveToEdgeStartingWithAndByOne(final char character) {
        return new ActivePoint(this.activeNode, character, 1);
    }

    public ActivePoint moveToNextNodeOfActiveEdge() {
        return new ActivePoint(this.getActiveEdge().getNext(), null, 0);
    }

    public ActivePoint moveToSlink() {
        return new ActivePoint(this.activeNode.getSlink(),
                this.activeEdgeFirstCharacter,
                this.activeLength);
    }

    public ActivePoint moveTo(final Node node) {
        return new ActivePoint(node, this.activeEdgeFirstCharacter, this.activeLength);
    }

    public ActivePoint moveByOneCharacter() {
        return new ActivePoint(this.activeNode,
                this.activeEdgeFirstCharacter,
                this.activeLength + 1);
    }

    public ActivePoint moveToEdgeStartingWithAndByActiveLengthMinusOne(final Node node,
                                                                       final char character) {
        return new ActivePoint(node, character, this.activeLength - 1);
    }

    public ActivePoint moveToNextNodeOfActiveEdge(final String word, final int index) {
        return new ActivePoint(this.getActiveEdge().getNext(),
                word.charAt(index - this.activeLength + this.getActiveEdge().getLength()),
                this.activeLength - this.getActiveEdge().getLength());
    }

    public void addEdgeToActiveNode(final char character, final Edge edge) {
        this.activeNode.getEdges().put(character, edge);
    }

    public void splitActiveEdge(final String word,
                                final Node nodeToAdd,
                                final int index,
                                final char character) {
        final Edge activeEdgeToSplit = this.getActiveEdge();
        final Edge splittedEdge = new Edge(activeEdgeToSplit.getFrom(),
                activeEdgeToSplit.getFrom() + this.activeLength,
                nodeToAdd);
        nodeToAdd.getEdges().put(word.charAt(activeEdgeToSplit.getFrom() + this.activeLength),
                new Edge(activeEdgeToSplit.getFrom() + this.activeLength,
                        activeEdgeToSplit.getTo(),
                        activeEdgeToSplit.getNext()));
        nodeToAdd.getEdges().put(character, new Edge(index, word.length(), null));
        this.activeNode.getEdges().put(this.activeEdgeFirstCharacter, splittedEdge);
    }

    public Node setSlinkTo(final Node previouslyAddedNodeOrAddedEdgeNode,
                           final Node node) {
        if(previouslyAddedNodeOrAddedEdgeNode != null) {
            previouslyAddedNodeOrAddedEdgeNode.setSlink(node);
        }
        return node;
    }

    public Node setSlinkToActiveNode(final Node previouslyAddedNodeOrAddedEdgeNode) {
        return setSlinkTo(previouslyAddedNodeOrAddedEdgeNode, this.activeNode);
    }
  }

  private static int idGenerator;

  private final String word;
  private final Node root;
  private ActivePoint activePoint;
  private int remainder;

  public ST(final String word) {
    this.word = word;
    this.root = new Node(idGenerator++);
    this.activePoint = new ActivePoint(this.root, null, 0);
    this.remainder = 0;
    build();
  }

  private void build() {
    for(int i = 0; i < this.word.length(); i++) {
        add(i, this.word.charAt(i));
    }
  }

  private void add(final int index, final char character) {
    this.remainder++;
    boolean characterFoundInTheTree = false;
    Node previouslyAddedNodeOrAddedEdgeNode = null;
    while(!characterFoundInTheTree && this.remainder > 0) {
        if(this.activePoint.pointsToActiveNode()) {
            if(this.activePoint.activeNodeHasEdgeStartingWith(character)) {
                activeNodeHasEdgeStartingWithCharacter(character, previouslyAddedNodeOrAddedEdgeNode);
                characterFoundInTheTree = true;
            }
            else {
                if(this.activePoint.activeNodeIs(this.root)) {
                    rootNodeHasNotEdgeStartingWithCharacter(index, character);
                }
                else {
                    previouslyAddedNodeOrAddedEdgeNode = internalNodeHasNotEdgeStartingWithCharacter(index,
                            character, previouslyAddedNodeOrAddedEdgeNode);
                }
            }
        }
        else {
            if(this.activePoint.pointsToOnActiveEdge(this.word, character)) {
                activeEdgeHasCharacter();
                characterFoundInTheTree = true;
            }
            else {
                if(this.activePoint.activeNodeIs(this.root)) {
                    previouslyAddedNodeOrAddedEdgeNode = edgeFromRootNodeHasNotCharacter(index,
                            character,
                            previouslyAddedNodeOrAddedEdgeNode);
                }
                else {
                    previouslyAddedNodeOrAddedEdgeNode = edgeFromInternalNodeHasNotCharacter(index,
                            character,
                            previouslyAddedNodeOrAddedEdgeNode);
                }
            }
        }
    }
  }

  private void activeNodeHasEdgeStartingWithCharacter(final char character,
                                                    final Node previouslyAddedNodeOrAddedEdgeNode) {
    this.activePoint.setSlinkToActiveNode(previouslyAddedNodeOrAddedEdgeNode);
    this.activePoint = this.activePoint.moveToEdgeStartingWithAndByOne(character);
    if(this.activePoint.pointsToTheEndOfActiveEdge()) {
        this.activePoint = this.activePoint.moveToNextNodeOfActiveEdge();
    }
  }

  private void rootNodeHasNotEdgeStartingWithCharacter(final int index, final char character) {
    this.activePoint.addEdgeToActiveNode(character, new Edge(index, this.word.length(), null));
    this.activePoint = this.activePoint.moveTo(this.root);
    this.remainder--;
    assert this.remainder == 0;
  }

  private Node internalNodeHasNotEdgeStartingWithCharacter(final int index,
                                                         final char character,
                                                         Node previouslyAddedNodeOrAddedEdgeNode) {
    this.activePoint.addEdgeToActiveNode(character, new Edge(index, this.word.length(), null));
    previouslyAddedNodeOrAddedEdgeNode = this.activePoint.setSlinkToActiveNode(previouslyAddedNodeOrAddedEdgeNode);
    if(this.activePoint.activeNodeHasSlink()) {
        this.activePoint = this.activePoint.moveToSlink();
    }
    else {
        this.activePoint = this.activePoint.moveTo(this.root);
    }
    this.remainder--;
    return previouslyAddedNodeOrAddedEdgeNode;
  }

  private void activeEdgeHasCharacter() {
    this.activePoint = this.activePoint.moveByOneCharacter();
    if(this.activePoint.pointsToTheEndOfActiveEdge()) {
        this.activePoint = this.activePoint.moveToNextNodeOfActiveEdge();
    }
  }

  private Node edgeFromRootNodeHasNotCharacter(final int index,
                                             final char character,
                                             Node previouslyAddedNodeOrAddedEdgeNode) {
    final Node newNode = new Node(idGenerator++);
    this.activePoint.splitActiveEdge(this.word, newNode, index, character);
    previouslyAddedNodeOrAddedEdgeNode = this.activePoint.setSlinkTo(previouslyAddedNodeOrAddedEdgeNode, newNode);
    this.activePoint = this.activePoint.moveToEdgeStartingWithAndByActiveLengthMinusOne(this.root,
            this.word.charAt(index - this.remainder + 2));
    this.activePoint = walkDown(index);
    this.remainder--;
    return previouslyAddedNodeOrAddedEdgeNode;
  }

  private Node edgeFromInternalNodeHasNotCharacter(final int index,
                                                 final char character,
                                                 Node previouslyAddedNodeOrAddedEdgeNode) {
    final Node newNode = new Node(idGenerator++);
    this.activePoint.splitActiveEdge(this.word, newNode, index, character);
    previouslyAddedNodeOrAddedEdgeNode = this.activePoint.setSlinkTo(previouslyAddedNodeOrAddedEdgeNode, newNode);
    if(this.activePoint.activeNodeHasSlink()) {
        this.activePoint = this.activePoint.moveToSlink();
    }
    else {
        this.activePoint = this.activePoint.moveTo(this.root);
    }
    this.activePoint = walkDown(index);
    this.remainder--;
    return previouslyAddedNodeOrAddedEdgeNode;
  }

  private ActivePoint walkDown(final int index) {
    while(!this.activePoint.pointsToActiveNode()
            && (this.activePoint.pointsToTheEndOfActiveEdge() || this.activePoint.pointsAfterTheEndOfActiveEdge())) {
        if(this.activePoint.pointsAfterTheEndOfActiveEdge()) {
            this.activePoint = this.activePoint.moveToNextNodeOfActiveEdge(this.word, index);
        }
        else {
            this.activePoint = this.activePoint.moveToNextNodeOfActiveEdge();
        }
    }
    return this.activePoint;
  }

  public String toString(final String word) {
    return this.root.toString(word);
  }

  public boolean contains(final String suffix) {
    return this.root.contains(this.word, suffix);
  }

  public static void main(final String[] args) {
    final String[] words = {
            "abcabcabc$",
            "abc$",
            "abcabxabcd$",
            "abcabxabda$",
            "abcabxad$",
            "aabaaabb$",
            "aababcabcd$",
            "ababcabcd$",
            "abccba$",
            "mississipi$",
            "abacabadabacabae$",
            "abcabcd$",
            "00132220$"
    };
    Arrays.stream(words).forEach(word -> {
        System.out.println("Building suffix tree for word: " + word);
        final ST suffixTree = new ST(word);
        System.out.println("Suffix tree: " + suffixTree.toString(word));
        for(int i = 0; i < word.length() - 1; i++) {
            assert suffixTree.contains(word.substring(i)) : word.substring(i);
        }
    });
  }
}

下面尝试描述Ukkonen算法,首先展示当字符串很简单(即不包含任何重复字符)时它会做什么,然后将其扩展到完整的算法。

首先,做一些初步陈述。

我们正在构建的,基本上就像一个搜索树。因此, 是根节点,它的边指向新的节点 再往外延伸一些边,等等 但是:与搜索树不同,边缘标签不是单一的 字符。相反,每条边都使用一对整数进行标记 [,]。这些是指向文本的指针。在这个意义上,每一个 edge携带任意长度的字符串标签,但只接受O(1) 空格(两个指针)。

基本原理

我想首先演示如何创建的后缀树 特别简单的字符串,没有重复字符的字符串:

abc

该算法按从左到右的步骤工作。字符串的每个字符都有一个步骤。每一步可能涉及多个单独的操作,但我们将看到(参见最后的观察结果)操作的总数为O(n)。

因此,我们从左边开始,首先只插入单个字符 A通过创建一条从根节点(左边)到叶节点的边, 并将其标记为[0,#],这意味着这条边代表 从位置0开始到当前结束的子字符串。我 使用符号#表示当前的结束,它位于位置1 (就在a后面)。

我们有了一个初始树,它看起来像这样:

它的意思是:

现在我们前进到位置2(在b之后)。每一步的目标 是插入到当前位置的所有后缀。我们这样做 通过

将现有的a边扩展到ab边 为b插入一条新边

在我们的表示法中是这样的

它的意思是:

我们观察到两点:

ab的边表示和以前一样 在初始树中:[0,#]。它的含义自动改变了 因为我们将当前位置#从1更新到2。 每条边占用O(1)个空间,因为它只由两条边组成 指向文本的指针,不管它有多少个字符 代表。

接下来,我们再次增加位置,并通过追加更新树 A c到每条现有的边,并为新边插入一条新边 后缀c。

在我们的表示法中是这样的

它的意思是:

我们观察到:

该树是到当前位置为止的正确后缀树 在每一步之后 文本中有多少字符,就有多少步骤 每一步的工作量是O(1),因为所有现有的边 通过增加#和插入 最后一个字符的新边可以在O(1)中完成 时间。因此,对于长度为n的字符串,只需要O(n)个时间。

第一个扩展:简单重复

当然,这工作得很好,只是因为我们的字符串没有 不要重复。我们现在看一个更现实的字符串:

abcabxabcd

它像前面的例子一样以abc开头,然后重复ab 后面跟着x,然后重复ABC,后面跟着d。

步骤1到步骤3:在前3步之后,我们有了前面例子中的树:

第四步:我们移动#到位置4。这将隐式地更新所有现有的 边缘问题:

我们需要插入当前步骤的最后一个后缀,a, at 根。

在此之前,我们再引入两个变量(除了 #),当然,它一直都在那里,但我们没有使用过 到目前为止:

活动点,是一个三倍的 (active_edge active_node active_length) 余数,它是一个整数,表示有多少个新后缀 我们需要插入

这两个词的确切含义很快就会清楚,但现在 让我们这么说吧:

In the simple abc example, the active point was always (root,'\0x',0), i.e. active_node was the root node, active_edge was specified as the null character '\0x', and active_length was zero. The effect of this was that the one new edge that we inserted in every step was inserted at the root node as a freshly created edge. We will see soon why a triple is necessary to represent this information. The remainder was always set to 1 at the beginning of each step. The meaning of this was that the number of suffixes we had to actively insert at the end of each step was 1 (always just the final character).

现在这将会改变。当我们插入当前的final时 字符a在根,我们注意到已经有一个外向 以a开头的边,特别是abca。这是我们在 这种情况:

We do not insert a fresh edge [4,#] at the root node. Instead we simply notice that the suffix a is already in our tree. It ends in the middle of a longer edge, but we are not bothered by that. We just leave things the way they are. We set the active point to (root,'a',1). That means the active point is now somewhere in the middle of outgoing edge of the root node that starts with a, specifically, after position 1 on that edge. We notice that the edge is specified simply by its first character a. That suffices because there can be only one edge starting with any particular character (confirm that this is true after reading through the entire description). We also increment remainder, so at the beginning of the next step it will be 2.

观察:当我们需要插入的最后一个后缀被找到时 已经存在于树中,树本身根本没有改变(我们只更新活动点和余数)。这棵树 然后不是一个准确的表示后缀树到 当前位置不再存在,但它包含所有后缀(因为final 后缀a是隐式包含的)。因此,除了更新 变量(都是固定长度的,所以这是O(1) 这一步没有做功。

步骤5:我们将当前位置#更新为5。这 自动更新树如下:

因为余数是2,我们需要插入两个final 当前位置的后缀:ab和b。这基本上是因为:

上一步中的a后缀一直不正确 插入。所以它一直存在,自从我们进步了一个 Step,现在从a变成了ab。 我们需要插入最后一条新边b。

在实践中,这意味着我们去到活动点(指向 在a后面,也就是现在的abcab边缘),然后插入 现在的最终字符b。但是:再次证明b是 这条边也已经出现了。

同样,我们没有改变树。我们只是:

更新活动点到(根,'a',2)(相同的节点和边 和之前一样,但现在我们指向b) 将余数加为3,因为我们仍然没有正确地 插入前一步的最后一条边,我们不插入 当前的最终边。

明确一点:我们必须在当前步骤中插入ab和b,但是 因为ab已经被找到,所以我们更新了活动点 甚至没有尝试插入b。为什么?因为如果ab在树中, 它的每个后缀(包括b)都必须在树中, 了。也许只是隐含的,但它必须在那里,因为 就像我们到目前为止建立的树。

我们通过增加#来继续第6步。树是 自动更新为:

因为余数是3,我们必须插入abx bx和 活动点告诉我们ab的终点,所以我们只需要 跳转到那里并插入x,实际上,x还没有在那里,所以我们 分割abcabx边缘并插入一个内部节点:

边缘表示仍然是指向文本的指针,所以 分裂和插入内部节点可以在O(1)时间内完成。

我们处理了abx,并将余数减为2。现在我们 需要插入下一个剩余的后缀,bx。但在此之前 我们需要更新活动点。分裂后的规则 并插入一条边,将被称为下面的规则1,它适用于 Active_node是根(我们将进一步学习规则3用于其他情况 下文)。规则一:

从根节点插入后, Active_node保持root Active_edge被设置为新后缀we的第一个字符 需要插入,即b Active_length减少1

因此,新的活动点三元组(根,'b',1)表示 下一个插入必须在bcabx边缘,后面1个字符, 我们可以在O(1)时间内识别插入点和 检查x是否已经存在。如果它存在,我们 将结束当前步骤,让一切保持原样。但x 不存在,所以我们通过分割边缘插入它:

同样,这花费了O(1)时间我们将余数更新为1 如规则1所述,活动指向(根,'x',0)。

但我们还需要做一件事。我们称之为规则2:

如果我们分割一条边并插入一个新节点,如果不是 在当前步骤中创建的第一个节点,连接先前的节点 插入节点和新节点通过一个特殊指针、后缀 链接。我们稍后会看到为什么这是有用的。这就是我们得到的 后缀link表示为虚线边:

我们仍然需要插入当前步骤的最后一个后缀, x.由于主节点的active_length组件下降 对于0,最后的插入直接在根节点上进行。因为没有 从x开始的根节点的外向边,我们插入一条新的 优势:

正如我们所看到的,在当前步骤中,所有剩余的插入都已完成。

我们通过设置#=7继续到第7步,这将自动添加下一个字符, A,到所有的叶边,一如既往。然后我们尝试插入新的final 字符到活动点(根),并发现它在那里 了。我们结束当前的步骤,不插入任何东西 将活动点更新为(root,'a',1)。

在第8步,#=8,我们附加b,正如前面看到的,这只是 这意味着我们将活动点更新为(root,'a',2),并增加余数 还有别的吗,因为b已经存在了。然而,我们注意到(在O(1)时间)活动点 现在在一条边缘的尽头。我们通过将其重新设置为来反映这一点 (node1 ' \ 0 x ' 0)。在这里,我使用node1来引用 ab边结束于的内部节点。

然后,在步骤#=9中,我们需要插入'c',这将帮助我们 了解最后一个技巧:

第二个扩展:使用后缀链接

像往常一样,# update会自动将c追加到叶边 我们到活动点,看看能不能插入c。结果 c已经在那条边存在了,所以我们把活动点设为 (node1,'c',1),增加余数,其他什么都不做。

现在在步骤#=10中,余数是4,所以我们首先需要插入 Abcd(从3步前保留下来)通过在active中插入d 点。

试图在活动点插入d会导致边缘分裂 O(1)时间:

在中标记了active_node,分裂就是从它开始的 上面的红色。这是最后的规则,规则3:

在从一个不是根的active_node中分离边缘之后 节点,如果有的话,我们跟踪从该节点出来的后缀链接 并将active_node重置为它所指向的节点。如果有的话 没有后缀链接,我们将active_node设置为根。active_edge active_length保持不变。

所以活动点现在是(node2,'c',1), node2被标记进来 红色下面:

由于abcd的插入已经完成,我们将余数减为 3,并考虑当前步骤的下一个剩余后缀, bcd。规则3将活动点设置为正确的节点和边 所以插入BCD可以通过简单地插入它的最后一个字符来完成 D在活动点。

这样做会导致另一条边分裂,由于规则2,我们 必须创建从先前插入的节点到新节点的后缀链接吗 一:

我们观察到:后缀链接使我们能够重置活动点,因此我们 可以以O(1)次努力进行下一个剩余插入。看看 以确认标签ab上的节点确实被链接到 b处的节点(它的后缀)和ABC处的节点被链接到 bc。

当前步骤还没有完成。余数现在是2,我们 需要按照规则3重新设置活动点。自 当前active_node(上面红色)没有后缀链接,我们重置为 根。活动点现在是(根,c',1)。

因此,下一次插入发生在根节点的一个外向边缘 其标签以c: cabxabcd开头,在第一个字符后面, 这导致了另一个分裂:

因为这涉及到一个新的内部节点的创建,我们遵循 规则2,并设置一个新的后缀链接之前创建的内部 节点:

(我使用Graphviz Dot的这些小 图表。新的后缀链接导致点重新排列现有的 边,所以仔细检查,以确认唯一的东西 上面插入了一个新的后缀链接。)

这样,余数可以设置为1,因为active_node是 根,我们使用规则1将活动点更新为(根,'d',0)。这 表示当前步骤的最后一个插入是插入单个d 根:

这是最后一步,我们做完了。有一些最终的 不过,观察:

In each step we move # forward by 1 position. This automatically updates all leaf nodes in O(1) time. But it does not deal with a) any suffixes remaining from previous steps, and b) with the one final character of the current step. remainder tells us how many additional inserts we need to make. These inserts correspond one-to-one to the final suffixes of the string that ends at the current position #. We consider one after the other and make the insert. Important: Each insert is done in O(1) time since the active point tells us exactly where to go, and we need to add only one single character at the active point. Why? Because the other characters are contained implicitly (otherwise the active point would not be where it is). After each such insert, we decrement remainder and follow the suffix link if there is any. If not we go to root (rule 3). If we are at root already, we modify the active point using rule 1. In any case, it takes only O(1) time. If, during one of these inserts, we find that the character we want to insert is already there, we don't do anything and end the current step, even if remainder>0. The reason is that any inserts that remain will be suffixes of the one we just tried to make. Hence they are all implicit in the current tree. The fact that remainder>0 makes sure we deal with the remaining suffixes later. What if at the end of the algorithm remainder>0? This will be the case whenever the end of the text is a substring that occurred somewhere before. In that case we must append one extra character at the end of the string that has not occurred before. In the literature, usually the dollar sign $ is used as a symbol for that. Why does that matter? --> If later we use the completed suffix tree to search for suffixes, we must accept matches only if they end at a leaf. Otherwise we would get a lot of spurious matches, because there are many strings implicitly contained in the tree that are not actual suffixes of the main string. Forcing remainder to be 0 at the end is essentially a way to ensure that all suffixes end at a leaf node. However, if we want to use the tree to search for general substrings, not only suffixes of the main string, this final step is indeed not required, as suggested by the OP's comment below. So what is the complexity of the entire algorithm? If the text is n characters in length, there are obviously n steps (or n+1 if we add the dollar sign). In each step we either do nothing (other than updating the variables), or we make remainder inserts, each taking O(1) time. Since remainder indicates how many times we have done nothing in previous steps, and is decremented for every insert that we make now, the total number of times we do something is exactly n (or n+1). Hence, the total complexity is O(n). However, there is one small thing that I did not properly explain: It can happen that we follow a suffix link, update the active point, and then find that its active_length component does not work well with the new active_node. For example, consider a situation like this:

虚线表示树的其余部分。虚线是a 后缀链接。)

现在设活动点为(红色,'d',3),所以它指向某个位置 在defg边缘的f后面。现在假设我们做了必要的事情 更新,现在跟随后缀链接更新活动点 根据规则3。新的活动点是(绿色,'d',3)。然而, 绿色节点外的d边是de,所以它只有2 字符。为了找到正确的活动点,我们显然 需要沿着这条边到蓝色节点,并重置为(蓝色,'f',1)。

在特别糟糕的情况下,active_length可能大到 余数,可以大到n,这很可能发生 为了找到正确的活动点,我们不仅仅需要跳过一个 内部节点,但可能很多,最坏情况下最多n个。这 意味着算法有一个隐藏的O(n2)复杂度,因为 每步余数一般为O(n),后调整 到主节点后的后缀链接也可以是O(n) ?

No. The reason is that if indeed we have to adjust the active point (e.g. from green to blue as above), that brings us to a new node that has its own suffix link, and active_length will be reduced. As we follow down the chain of suffix links we make the remaining inserts, active_length can only decrease, and the number of active-point adjustments we can make on the way can't be larger than active_length at any given time. Since active_length can never be larger than remainder, and remainder is O(n) not only in every single step, but the total sum of increments ever made to remainder over the course of the entire process is O(n) too, the number of active point adjustments is also bounded by O(n).

我尝试使用jogojapan回答中给出的方法来实现后缀树,但由于规则使用的措辞,它在某些情况下不起作用。此外,我还提到过,没有人能够使用这种方法实现绝对正确的后缀树。下面我将对jogojapan的答案进行“概述”,并对规则进行了一些修改。我还将描述当我们忘记创建重要的后缀链接时的情况。

使用的其他变量

活动点-一个三元(active_node;active_edge;Active_length),显示我们必须从哪里开始插入新后缀。 剩余-显示必须显式添加的后缀的数量。例如,如果我们的单词是'abcaabca',并且remainder = 3,这意味着我们必须处理3个最后的后缀:bca, ca和a。

让我们使用一个内部节点的概念——所有的节点,除了根节点和叶节点都是内部节点。

观察1

当我们发现需要插入的最后一个后缀已经存在于树中时,树本身根本不会改变(我们只更新活动点和余数)。

观察2

如果在某一点上active_length大于或等于当前边缘的长度(edge_length),我们将活动点向下移动,直到edge_length严格大于active_length。

现在,让我们重新定义规则:

规则1

如果从活动节点= root插入后,活动长度大于0,则: 主节点未改变 活动长度减少 活动边向右移动(到必须插入的下一个后缀的第一个字符)

规则2

如果我们创建一个新的内部节点或从一个内部节点创建一个插入器,并且这不是当前步骤中的第一个此类内部节点,那么我们将通过后缀链接将前一个此类节点与this节点链接起来。

规则2的定义与jogojapan的不同,因为这里我们不仅考虑了新创建的内部节点,还考虑了我们进行插入的内部节点。

规则3

在从活动节点(不是根节点)进行插入之后,必须遵循后缀链接并将活动节点设置为它所指向的节点。如果没有后缀链路,请将主节点设置为根节点。无论哪种方式,活动边和活动长度保持不变。

在规则3的这个定义中,我们还考虑了叶节点的插入(不仅仅是分裂节点)。

最后,观察3:

当我们想要添加到树中的符号已经在边缘上时,根据观察1,我们只更新活动点和余数,保持树不变。但是如果有一个内部节点被标记为需要后缀链接,我们必须通过后缀链接将该节点与我们当前的活动节点连接起来。

让我们看看cdddcdc的后缀树的例子,如果我们在这种情况下添加后缀链接,如果我们不添加:

如果我们不使用后缀link连接节点: 在加上最后一个字母c之前: 加上最后一个字母c后: 如果我们确实通过后缀link连接节点: 在加上最后一个字母c之前: 加上最后一个字母c后:

Seems like there is no significant difference: in the second case there are two more suffix links. But these suffix links are correct, and one of them - from the blue node to the red one - is very important for our approach with active point. The problem is that if we don't put a suffix link here, later, when we add some new letters to the tree, we might omit adding some nodes to the tree due to the Rule 3, because, according to it, if there's no a suffix link, then we must put the active_node to the root.

When we were adding the last letter to the tree, the red node had already existed before we made an insert from the blue node (the edge labled 'c'). As there was an insert from the blue node, we mark it as needing a suffix link. Then, relying on the active point approach, the active node was set to the red node. But we don't make an insert from the red node, as the letter 'c' is already on the edge. Does it mean that the blue node must be left without a suffix link? No, we must connect the blue node with the red one through a suffix link. Why is it correct? Because the active point approach guarantees that we get to a right place, i.e., to the next place where we must process an insert of a shorter suffix.

最后,这里是我对后缀树的实现:

Java c++

希望这个“概述”结合jogojapan的详细回答将帮助某人实现他自己的后缀树。

我的直觉如下:

在主循环的k次迭代之后,您已经构造了一个后缀树,其中包含从前k个字符开始的完整字符串的所有后缀。

在开始时,这意味着后缀树包含一个表示整个字符串的根节点(这是唯一从0开始的后缀)。

在len(string)迭代之后,您将得到一个包含所有后缀的后缀树。

在循环过程中,关键是活动点。我的猜测是,这表示后缀树中最深的点,对应于字符串的前k个字符的适当后缀。(我认为合适的意思是后缀不能是整个字符串。)

例如,假设你见过字符'abcabc'。活动点将表示树中对应后缀'abc'的点。

活动点由(原点,第一个,最后一个)表示。 这意味着你目前在树中的点上,你可以从节点原点开始,然后输入字符串[first:last]中的字符。

当你添加一个新字符时,你要看看活动点是否仍然在现有的树中。如果是,那你就完蛋了。 否则,您需要在活动点向后缀树添加一个新节点,退回到下一个最短匹配,并再次检查。

注1: 后缀指针为每个节点提供指向下一个最短匹配项的链接。

注2: 当您添加新节点和回退时,您为新节点添加了一个新的后缀指针。 该后缀指针的目标将是缩短活动点上的节点。 这个节点要么已经存在,要么在此回退循环的下一次迭代中创建。

注意3:封圣部分只是节省了检查活动点的时间。 例如,假设您总是使用origin=0,只是改变了第一个和最后一个。 为了检查活动点,您必须每次沿着所有中间节点跟随后缀树。 通过只记录到最后一个节点的距离来缓存遵循这条路径的结果是有意义的。

您能给出一个代码示例来说明“修复”边界变量是什么意思吗?

健康警告:我还发现这个算法特别难以理解,所以请意识到这个直觉可能在所有重要细节上都是不正确的……