Updating an Aho-Corasick trie in the face of inserts and deletes












2















All the literature and implementations I've found of Aho-Corasick are about building the entire trie beforehand from a set of phrases. However, I'm interested in ways to work with it as a mutable data structure, where it can handle occasional adds and removes without needing to rebuild the whole trie (imagine there are 1 million entries in it). It's OK if the worst case is awful, as long as the average case is close to logarithmic.



From how I figure it, the fail state for each node is another node that uses the same symbol. So if we have a hash multimap from each symbol to the list of nodes which use that symbol, we have our candidates whose fail states need updating.



Remove is straightforward. You find all other nodes which use the deleted node as a fail state and recompute their fail state. Go from the end of the string backwards, and the tree should still be in good shape.



Add is a bit trickier. Any node that failed to that symbol could have the new node as a better candidate. However, again it seems to be enough to traverse every other node with that symbol and fully recompute its fail state.



In other words, if we're adding or removing a node with symbol "A", we need to visit every other "A" node anywhere in the trie, and recompute the fail state (look for its closest ancestor with "A" as a child, or the root). This would require visiting every node for symbol "A", but in my case, that would be several orders of magnitude less than visiting every node in the trie.



Would that algorithm work, or am I missing something obvious?










share|improve this question

























  • Visiting every node with a symbol "A" to update its fail state wouldn't be so difficult, I guess. Finding every such node, though, will require a linear scan of the tree, unless you keep an auxiliary data structure that maps every node by symbol. I think to do this update efficiently will require O(n) extra space, where n is the number of nodes in your trie.

    – Jim Mischel
    Nov 14 '18 at 12:35











  • I think this is going to be more than just updating the fail states. Consider removing the string "the" from the list of matched terms. But you still have to match "their," "they," "them," "other," etc. And, yeah ... adding gets pretty tricky. I've never seen an article on updating an Aho-Corasick trie in-place, which leads me to believe that this is a pretty hard problem.

    – Jim Mischel
    Nov 14 '18 at 12:42











  • @JimMischel Yes, I expect it to take a bit more memory, my main concern is correctness and time complexity. The question is whether there is a sequence of operations that would leave the trie in a bad state. In that particular case, you could remove the "the" output from the "e" node, but leave the node intact as a non-output node, as long as you're not compressing nodes. My intuition says that would work fine, but I'm not certain; there could be cases where a node would need to move, which would mean other seemingly-unrelated fail states could be wrong.

    – Robert Fraser
    Nov 14 '18 at 17:02
















2















All the literature and implementations I've found of Aho-Corasick are about building the entire trie beforehand from a set of phrases. However, I'm interested in ways to work with it as a mutable data structure, where it can handle occasional adds and removes without needing to rebuild the whole trie (imagine there are 1 million entries in it). It's OK if the worst case is awful, as long as the average case is close to logarithmic.



From how I figure it, the fail state for each node is another node that uses the same symbol. So if we have a hash multimap from each symbol to the list of nodes which use that symbol, we have our candidates whose fail states need updating.



Remove is straightforward. You find all other nodes which use the deleted node as a fail state and recompute their fail state. Go from the end of the string backwards, and the tree should still be in good shape.



Add is a bit trickier. Any node that failed to that symbol could have the new node as a better candidate. However, again it seems to be enough to traverse every other node with that symbol and fully recompute its fail state.



In other words, if we're adding or removing a node with symbol "A", we need to visit every other "A" node anywhere in the trie, and recompute the fail state (look for its closest ancestor with "A" as a child, or the root). This would require visiting every node for symbol "A", but in my case, that would be several orders of magnitude less than visiting every node in the trie.



Would that algorithm work, or am I missing something obvious?










share|improve this question

























  • Visiting every node with a symbol "A" to update its fail state wouldn't be so difficult, I guess. Finding every such node, though, will require a linear scan of the tree, unless you keep an auxiliary data structure that maps every node by symbol. I think to do this update efficiently will require O(n) extra space, where n is the number of nodes in your trie.

    – Jim Mischel
    Nov 14 '18 at 12:35











  • I think this is going to be more than just updating the fail states. Consider removing the string "the" from the list of matched terms. But you still have to match "their," "they," "them," "other," etc. And, yeah ... adding gets pretty tricky. I've never seen an article on updating an Aho-Corasick trie in-place, which leads me to believe that this is a pretty hard problem.

    – Jim Mischel
    Nov 14 '18 at 12:42











  • @JimMischel Yes, I expect it to take a bit more memory, my main concern is correctness and time complexity. The question is whether there is a sequence of operations that would leave the trie in a bad state. In that particular case, you could remove the "the" output from the "e" node, but leave the node intact as a non-output node, as long as you're not compressing nodes. My intuition says that would work fine, but I'm not certain; there could be cases where a node would need to move, which would mean other seemingly-unrelated fail states could be wrong.

    – Robert Fraser
    Nov 14 '18 at 17:02














2












2








2


1






All the literature and implementations I've found of Aho-Corasick are about building the entire trie beforehand from a set of phrases. However, I'm interested in ways to work with it as a mutable data structure, where it can handle occasional adds and removes without needing to rebuild the whole trie (imagine there are 1 million entries in it). It's OK if the worst case is awful, as long as the average case is close to logarithmic.



From how I figure it, the fail state for each node is another node that uses the same symbol. So if we have a hash multimap from each symbol to the list of nodes which use that symbol, we have our candidates whose fail states need updating.



Remove is straightforward. You find all other nodes which use the deleted node as a fail state and recompute their fail state. Go from the end of the string backwards, and the tree should still be in good shape.



Add is a bit trickier. Any node that failed to that symbol could have the new node as a better candidate. However, again it seems to be enough to traverse every other node with that symbol and fully recompute its fail state.



In other words, if we're adding or removing a node with symbol "A", we need to visit every other "A" node anywhere in the trie, and recompute the fail state (look for its closest ancestor with "A" as a child, or the root). This would require visiting every node for symbol "A", but in my case, that would be several orders of magnitude less than visiting every node in the trie.



Would that algorithm work, or am I missing something obvious?










share|improve this question
















All the literature and implementations I've found of Aho-Corasick are about building the entire trie beforehand from a set of phrases. However, I'm interested in ways to work with it as a mutable data structure, where it can handle occasional adds and removes without needing to rebuild the whole trie (imagine there are 1 million entries in it). It's OK if the worst case is awful, as long as the average case is close to logarithmic.



From how I figure it, the fail state for each node is another node that uses the same symbol. So if we have a hash multimap from each symbol to the list of nodes which use that symbol, we have our candidates whose fail states need updating.



Remove is straightforward. You find all other nodes which use the deleted node as a fail state and recompute their fail state. Go from the end of the string backwards, and the tree should still be in good shape.



Add is a bit trickier. Any node that failed to that symbol could have the new node as a better candidate. However, again it seems to be enough to traverse every other node with that symbol and fully recompute its fail state.



In other words, if we're adding or removing a node with symbol "A", we need to visit every other "A" node anywhere in the trie, and recompute the fail state (look for its closest ancestor with "A" as a child, or the root). This would require visiting every node for symbol "A", but in my case, that would be several orders of magnitude less than visiting every node in the trie.



Would that algorithm work, or am I missing something obvious?







database algorithm search data-structures aho-corasick






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 14 '18 at 17:05







Robert Fraser

















asked Nov 13 '18 at 20:03









Robert FraserRobert Fraser

6,92964778




6,92964778













  • Visiting every node with a symbol "A" to update its fail state wouldn't be so difficult, I guess. Finding every such node, though, will require a linear scan of the tree, unless you keep an auxiliary data structure that maps every node by symbol. I think to do this update efficiently will require O(n) extra space, where n is the number of nodes in your trie.

    – Jim Mischel
    Nov 14 '18 at 12:35











  • I think this is going to be more than just updating the fail states. Consider removing the string "the" from the list of matched terms. But you still have to match "their," "they," "them," "other," etc. And, yeah ... adding gets pretty tricky. I've never seen an article on updating an Aho-Corasick trie in-place, which leads me to believe that this is a pretty hard problem.

    – Jim Mischel
    Nov 14 '18 at 12:42











  • @JimMischel Yes, I expect it to take a bit more memory, my main concern is correctness and time complexity. The question is whether there is a sequence of operations that would leave the trie in a bad state. In that particular case, you could remove the "the" output from the "e" node, but leave the node intact as a non-output node, as long as you're not compressing nodes. My intuition says that would work fine, but I'm not certain; there could be cases where a node would need to move, which would mean other seemingly-unrelated fail states could be wrong.

    – Robert Fraser
    Nov 14 '18 at 17:02



















  • Visiting every node with a symbol "A" to update its fail state wouldn't be so difficult, I guess. Finding every such node, though, will require a linear scan of the tree, unless you keep an auxiliary data structure that maps every node by symbol. I think to do this update efficiently will require O(n) extra space, where n is the number of nodes in your trie.

    – Jim Mischel
    Nov 14 '18 at 12:35











  • I think this is going to be more than just updating the fail states. Consider removing the string "the" from the list of matched terms. But you still have to match "their," "they," "them," "other," etc. And, yeah ... adding gets pretty tricky. I've never seen an article on updating an Aho-Corasick trie in-place, which leads me to believe that this is a pretty hard problem.

    – Jim Mischel
    Nov 14 '18 at 12:42











  • @JimMischel Yes, I expect it to take a bit more memory, my main concern is correctness and time complexity. The question is whether there is a sequence of operations that would leave the trie in a bad state. In that particular case, you could remove the "the" output from the "e" node, but leave the node intact as a non-output node, as long as you're not compressing nodes. My intuition says that would work fine, but I'm not certain; there could be cases where a node would need to move, which would mean other seemingly-unrelated fail states could be wrong.

    – Robert Fraser
    Nov 14 '18 at 17:02

















Visiting every node with a symbol "A" to update its fail state wouldn't be so difficult, I guess. Finding every such node, though, will require a linear scan of the tree, unless you keep an auxiliary data structure that maps every node by symbol. I think to do this update efficiently will require O(n) extra space, where n is the number of nodes in your trie.

– Jim Mischel
Nov 14 '18 at 12:35





Visiting every node with a symbol "A" to update its fail state wouldn't be so difficult, I guess. Finding every such node, though, will require a linear scan of the tree, unless you keep an auxiliary data structure that maps every node by symbol. I think to do this update efficiently will require O(n) extra space, where n is the number of nodes in your trie.

– Jim Mischel
Nov 14 '18 at 12:35













I think this is going to be more than just updating the fail states. Consider removing the string "the" from the list of matched terms. But you still have to match "their," "they," "them," "other," etc. And, yeah ... adding gets pretty tricky. I've never seen an article on updating an Aho-Corasick trie in-place, which leads me to believe that this is a pretty hard problem.

– Jim Mischel
Nov 14 '18 at 12:42





I think this is going to be more than just updating the fail states. Consider removing the string "the" from the list of matched terms. But you still have to match "their," "they," "them," "other," etc. And, yeah ... adding gets pretty tricky. I've never seen an article on updating an Aho-Corasick trie in-place, which leads me to believe that this is a pretty hard problem.

– Jim Mischel
Nov 14 '18 at 12:42













@JimMischel Yes, I expect it to take a bit more memory, my main concern is correctness and time complexity. The question is whether there is a sequence of operations that would leave the trie in a bad state. In that particular case, you could remove the "the" output from the "e" node, but leave the node intact as a non-output node, as long as you're not compressing nodes. My intuition says that would work fine, but I'm not certain; there could be cases where a node would need to move, which would mean other seemingly-unrelated fail states could be wrong.

– Robert Fraser
Nov 14 '18 at 17:02





@JimMischel Yes, I expect it to take a bit more memory, my main concern is correctness and time complexity. The question is whether there is a sequence of operations that would leave the trie in a bad state. In that particular case, you could remove the "the" output from the "e" node, but leave the node intact as a non-output node, as long as you're not compressing nodes. My intuition says that would work fine, but I'm not certain; there could be cases where a node would need to move, which would mean other seemingly-unrelated fail states could be wrong.

– Robert Fraser
Nov 14 '18 at 17:02












1 Answer
1






active

oldest

votes


















1














I went ahead and implemented it and it seems to be working. A somewhat nicer description of the algorithm, including pictures: https://burningmime.gitlab.io/setmatch/implementation-overview.html#supporting-add-remove-in-an-aho-corasick-trie



For posterity (and per StackOverflow policy), I've copied it here:




It supports modification (add/remove) instead of being built all at
once from a pre-set dictionary. This isn't mentioned in any literature
about it, and all the open-source implementations I've seen of it have
followed in that respect. It turns out that supporting add and remove
operations to an AC automaton is fairly trivial. For deep tries over a
small alphabet, it would not be very time-efficient, but luckily we
have a shallow trie over a large alphabet.



We store a hash multimap of each token to every node that uses that
token. When we remove a phrase, we start from the last (bottom-most)
node. We remove the pointer to the phrase from this bottom-most node.
Then, if it has any children, we can't delete any more. Otherwise, go
through each other node that uses this node as a fail state and
recompute its fail state. Finally, delete the node, then go up to the
node's parent and do the same process. We'll eventually reach either a
node that has another phrase output, a child, or the root node.



That's kind of difficult to picture, so consider this trie (stolen
from the Wikipedia article). We're going to remove the string caa
(light gray color), which will require that the fail states in yellow
be updated:



trie-before



The result:



trie-after



Note there are cases where a node's fail state might be updated 2 or
more times in the same remove operation, but will eventually be
correct. This could be avoided by removing everything first, then
going back through tokens, but the code is complicated enough as it
is, and that would only provide a performance benefit in certain
awkward circumstances, while being a performance hit in the average
case. We just assume that strings containing the same token more than
once are rare.



Adding a node is slightly more difficult, but can make use of the same
hash multimap structure. Each time a node is added, every other node
in the trie that uses the same token and is at a lower depth than the
added node could need to have its fail state updated to the new node.



To help picture that, imagine going from the second diagram back to
the first diagram. The same two fail states are the ones who would
need to be updated if adding the string caa back into that trie, and
since we recompute every c and a node, it's not a problem.



We could keep track of the node depths to skip over about half, but
right now it's just recomputing the fail state for every node that
shares the token to save memory. This means that tries where the
tokens appear many times will be slow to add to, but again that's
considered a rare enough situation. It would be a concern if you were
trying to adapt this technique to something like a DNA sequence or
antivirus database where there is a small alphabet.



The time complexity depends on how many nodes share symbols and how
deep the trie is, meaning it's worst-case O(N), but in practice a
fairly small number (average-case is roughly O(log^2 N) ).




Incredibly messy and mostly uncommented code, but if you're curious, here's the header, the body, and some tests






share|improve this answer

























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53288664%2fupdating-an-aho-corasick-trie-in-the-face-of-inserts-and-deletes%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    I went ahead and implemented it and it seems to be working. A somewhat nicer description of the algorithm, including pictures: https://burningmime.gitlab.io/setmatch/implementation-overview.html#supporting-add-remove-in-an-aho-corasick-trie



    For posterity (and per StackOverflow policy), I've copied it here:




    It supports modification (add/remove) instead of being built all at
    once from a pre-set dictionary. This isn't mentioned in any literature
    about it, and all the open-source implementations I've seen of it have
    followed in that respect. It turns out that supporting add and remove
    operations to an AC automaton is fairly trivial. For deep tries over a
    small alphabet, it would not be very time-efficient, but luckily we
    have a shallow trie over a large alphabet.



    We store a hash multimap of each token to every node that uses that
    token. When we remove a phrase, we start from the last (bottom-most)
    node. We remove the pointer to the phrase from this bottom-most node.
    Then, if it has any children, we can't delete any more. Otherwise, go
    through each other node that uses this node as a fail state and
    recompute its fail state. Finally, delete the node, then go up to the
    node's parent and do the same process. We'll eventually reach either a
    node that has another phrase output, a child, or the root node.



    That's kind of difficult to picture, so consider this trie (stolen
    from the Wikipedia article). We're going to remove the string caa
    (light gray color), which will require that the fail states in yellow
    be updated:



    trie-before



    The result:



    trie-after



    Note there are cases where a node's fail state might be updated 2 or
    more times in the same remove operation, but will eventually be
    correct. This could be avoided by removing everything first, then
    going back through tokens, but the code is complicated enough as it
    is, and that would only provide a performance benefit in certain
    awkward circumstances, while being a performance hit in the average
    case. We just assume that strings containing the same token more than
    once are rare.



    Adding a node is slightly more difficult, but can make use of the same
    hash multimap structure. Each time a node is added, every other node
    in the trie that uses the same token and is at a lower depth than the
    added node could need to have its fail state updated to the new node.



    To help picture that, imagine going from the second diagram back to
    the first diagram. The same two fail states are the ones who would
    need to be updated if adding the string caa back into that trie, and
    since we recompute every c and a node, it's not a problem.



    We could keep track of the node depths to skip over about half, but
    right now it's just recomputing the fail state for every node that
    shares the token to save memory. This means that tries where the
    tokens appear many times will be slow to add to, but again that's
    considered a rare enough situation. It would be a concern if you were
    trying to adapt this technique to something like a DNA sequence or
    antivirus database where there is a small alphabet.



    The time complexity depends on how many nodes share symbols and how
    deep the trie is, meaning it's worst-case O(N), but in practice a
    fairly small number (average-case is roughly O(log^2 N) ).




    Incredibly messy and mostly uncommented code, but if you're curious, here's the header, the body, and some tests






    share|improve this answer






























      1














      I went ahead and implemented it and it seems to be working. A somewhat nicer description of the algorithm, including pictures: https://burningmime.gitlab.io/setmatch/implementation-overview.html#supporting-add-remove-in-an-aho-corasick-trie



      For posterity (and per StackOverflow policy), I've copied it here:




      It supports modification (add/remove) instead of being built all at
      once from a pre-set dictionary. This isn't mentioned in any literature
      about it, and all the open-source implementations I've seen of it have
      followed in that respect. It turns out that supporting add and remove
      operations to an AC automaton is fairly trivial. For deep tries over a
      small alphabet, it would not be very time-efficient, but luckily we
      have a shallow trie over a large alphabet.



      We store a hash multimap of each token to every node that uses that
      token. When we remove a phrase, we start from the last (bottom-most)
      node. We remove the pointer to the phrase from this bottom-most node.
      Then, if it has any children, we can't delete any more. Otherwise, go
      through each other node that uses this node as a fail state and
      recompute its fail state. Finally, delete the node, then go up to the
      node's parent and do the same process. We'll eventually reach either a
      node that has another phrase output, a child, or the root node.



      That's kind of difficult to picture, so consider this trie (stolen
      from the Wikipedia article). We're going to remove the string caa
      (light gray color), which will require that the fail states in yellow
      be updated:



      trie-before



      The result:



      trie-after



      Note there are cases where a node's fail state might be updated 2 or
      more times in the same remove operation, but will eventually be
      correct. This could be avoided by removing everything first, then
      going back through tokens, but the code is complicated enough as it
      is, and that would only provide a performance benefit in certain
      awkward circumstances, while being a performance hit in the average
      case. We just assume that strings containing the same token more than
      once are rare.



      Adding a node is slightly more difficult, but can make use of the same
      hash multimap structure. Each time a node is added, every other node
      in the trie that uses the same token and is at a lower depth than the
      added node could need to have its fail state updated to the new node.



      To help picture that, imagine going from the second diagram back to
      the first diagram. The same two fail states are the ones who would
      need to be updated if adding the string caa back into that trie, and
      since we recompute every c and a node, it's not a problem.



      We could keep track of the node depths to skip over about half, but
      right now it's just recomputing the fail state for every node that
      shares the token to save memory. This means that tries where the
      tokens appear many times will be slow to add to, but again that's
      considered a rare enough situation. It would be a concern if you were
      trying to adapt this technique to something like a DNA sequence or
      antivirus database where there is a small alphabet.



      The time complexity depends on how many nodes share symbols and how
      deep the trie is, meaning it's worst-case O(N), but in practice a
      fairly small number (average-case is roughly O(log^2 N) ).




      Incredibly messy and mostly uncommented code, but if you're curious, here's the header, the body, and some tests






      share|improve this answer




























        1












        1








        1







        I went ahead and implemented it and it seems to be working. A somewhat nicer description of the algorithm, including pictures: https://burningmime.gitlab.io/setmatch/implementation-overview.html#supporting-add-remove-in-an-aho-corasick-trie



        For posterity (and per StackOverflow policy), I've copied it here:




        It supports modification (add/remove) instead of being built all at
        once from a pre-set dictionary. This isn't mentioned in any literature
        about it, and all the open-source implementations I've seen of it have
        followed in that respect. It turns out that supporting add and remove
        operations to an AC automaton is fairly trivial. For deep tries over a
        small alphabet, it would not be very time-efficient, but luckily we
        have a shallow trie over a large alphabet.



        We store a hash multimap of each token to every node that uses that
        token. When we remove a phrase, we start from the last (bottom-most)
        node. We remove the pointer to the phrase from this bottom-most node.
        Then, if it has any children, we can't delete any more. Otherwise, go
        through each other node that uses this node as a fail state and
        recompute its fail state. Finally, delete the node, then go up to the
        node's parent and do the same process. We'll eventually reach either a
        node that has another phrase output, a child, or the root node.



        That's kind of difficult to picture, so consider this trie (stolen
        from the Wikipedia article). We're going to remove the string caa
        (light gray color), which will require that the fail states in yellow
        be updated:



        trie-before



        The result:



        trie-after



        Note there are cases where a node's fail state might be updated 2 or
        more times in the same remove operation, but will eventually be
        correct. This could be avoided by removing everything first, then
        going back through tokens, but the code is complicated enough as it
        is, and that would only provide a performance benefit in certain
        awkward circumstances, while being a performance hit in the average
        case. We just assume that strings containing the same token more than
        once are rare.



        Adding a node is slightly more difficult, but can make use of the same
        hash multimap structure. Each time a node is added, every other node
        in the trie that uses the same token and is at a lower depth than the
        added node could need to have its fail state updated to the new node.



        To help picture that, imagine going from the second diagram back to
        the first diagram. The same two fail states are the ones who would
        need to be updated if adding the string caa back into that trie, and
        since we recompute every c and a node, it's not a problem.



        We could keep track of the node depths to skip over about half, but
        right now it's just recomputing the fail state for every node that
        shares the token to save memory. This means that tries where the
        tokens appear many times will be slow to add to, but again that's
        considered a rare enough situation. It would be a concern if you were
        trying to adapt this technique to something like a DNA sequence or
        antivirus database where there is a small alphabet.



        The time complexity depends on how many nodes share symbols and how
        deep the trie is, meaning it's worst-case O(N), but in practice a
        fairly small number (average-case is roughly O(log^2 N) ).




        Incredibly messy and mostly uncommented code, but if you're curious, here's the header, the body, and some tests






        share|improve this answer















        I went ahead and implemented it and it seems to be working. A somewhat nicer description of the algorithm, including pictures: https://burningmime.gitlab.io/setmatch/implementation-overview.html#supporting-add-remove-in-an-aho-corasick-trie



        For posterity (and per StackOverflow policy), I've copied it here:




        It supports modification (add/remove) instead of being built all at
        once from a pre-set dictionary. This isn't mentioned in any literature
        about it, and all the open-source implementations I've seen of it have
        followed in that respect. It turns out that supporting add and remove
        operations to an AC automaton is fairly trivial. For deep tries over a
        small alphabet, it would not be very time-efficient, but luckily we
        have a shallow trie over a large alphabet.



        We store a hash multimap of each token to every node that uses that
        token. When we remove a phrase, we start from the last (bottom-most)
        node. We remove the pointer to the phrase from this bottom-most node.
        Then, if it has any children, we can't delete any more. Otherwise, go
        through each other node that uses this node as a fail state and
        recompute its fail state. Finally, delete the node, then go up to the
        node's parent and do the same process. We'll eventually reach either a
        node that has another phrase output, a child, or the root node.



        That's kind of difficult to picture, so consider this trie (stolen
        from the Wikipedia article). We're going to remove the string caa
        (light gray color), which will require that the fail states in yellow
        be updated:



        trie-before



        The result:



        trie-after



        Note there are cases where a node's fail state might be updated 2 or
        more times in the same remove operation, but will eventually be
        correct. This could be avoided by removing everything first, then
        going back through tokens, but the code is complicated enough as it
        is, and that would only provide a performance benefit in certain
        awkward circumstances, while being a performance hit in the average
        case. We just assume that strings containing the same token more than
        once are rare.



        Adding a node is slightly more difficult, but can make use of the same
        hash multimap structure. Each time a node is added, every other node
        in the trie that uses the same token and is at a lower depth than the
        added node could need to have its fail state updated to the new node.



        To help picture that, imagine going from the second diagram back to
        the first diagram. The same two fail states are the ones who would
        need to be updated if adding the string caa back into that trie, and
        since we recompute every c and a node, it's not a problem.



        We could keep track of the node depths to skip over about half, but
        right now it's just recomputing the fail state for every node that
        shares the token to save memory. This means that tries where the
        tokens appear many times will be slow to add to, but again that's
        considered a rare enough situation. It would be a concern if you were
        trying to adapt this technique to something like a DNA sequence or
        antivirus database where there is a small alphabet.



        The time complexity depends on how many nodes share symbols and how
        deep the trie is, meaning it's worst-case O(N), but in practice a
        fairly small number (average-case is roughly O(log^2 N) ).




        Incredibly messy and mostly uncommented code, but if you're curious, here's the header, the body, and some tests







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited yesterday

























        answered Nov 27 '18 at 6:51









        Robert FraserRobert Fraser

        6,92964778




        6,92964778






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53288664%2fupdating-an-aho-corasick-trie-in-the-face-of-inserts-and-deletes%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Full-time equivalent

            Bicuculline

            さくらももこ