AaronCrane.co.uk

Tries and Text::Match::FastAlternatives

I’ve just released version 1.00 of my Text::Match::FastAlternatives Perl module. Since I’m apparently declaring it stable, I thought it was worth writing up a description of what it does, and how it does it.

Suppose you have a large list of strings, and a set of keys, and you need to determine, for each of the strings, whether any of the keys occur in it. For example, the list of strings might be a list of user-agent headers sent to a web server, and the keys a set of strings that are good indicators of robots accessing your site; you want to calculate some server statistics, but disregard any robotic traffic.

How do you go about doing that?

A simple regex

If you’re using Perl, the obvious approach is to rely on the language’s excellent integrated regular expressions. First build a regex that matches any of the keys you’re interested in:

my $rx = join '|', map { quotemeta } @robots;
$rx = qr/$rx/;

Then for each user agent, check whether it matches:

for my $ua (@uagents) {
    process($ua) if $ua !~ $rx;
}

That’s certainly simple, but it’s far from fast. With Perl 5.8.8 on a 64-bit Linux machine, running that on 374 keys and about 100,000 user agents took over 40 seconds.

There are some Perl modules on CPAN to help with that (surprise, surprise). For example, Regexp::Trie and Regexp::Assemble each build regexes that run about seven times faster than the naïve regex. But that’s not fast enough for me.

Backtracking

Why does the naïve Perl regex run so slowly? It’s all about the algorithm Perl uses to evaluate regexes — the Backtracking NFA algorithm. When you have a regex of the form foo|bar|baz|fum, and you try to work out whether the string fubar matches it, Perl’s regex engine has to perform the following steps:

  1. Start at the beginning of the string (position 0) and the beginning of the regex.
  2. Does the next character f match the next piece of regex f? Yes, so advance.
  3. Does u match o? No, so backtrack to the closest decision point: start of string, and the bar alternation in the regex.
  4. Does f match b? No, so backtrack: start of string, and the baz alternation in the regex.
  5. Does f match b? No, so backtrack: start of string, and the fum alternation in the regex.
  6. Does f match f? Yes, so advance.
  7. Does u match u? Yes, so advance.
  8. Does b match m? No, so backtrack. But we’ve exhausted all the possible decision points, so retry the whole regex at the next position in the string.
  9. Does u match f? No, so backtrack: position 1, and the bar alternation.
  10. Does u match b? No, so backtrack: position 1, and the baz alternation.
  11. Does u match b? No, so backtrack: position 1, and the fum alternation.
  12. Does u match f? No, so backtrack. But we’ve exhausted, so retry the regex at the next position.
  13. Does b match f? No, so backtrack: position 2, and the bar alternation.
  14. Does b match b? Yes, so advance.
  15. Does a match a? Yes, so advance.
  16. Does r match r? Yes, so advance. But we’ve reached the end of the regex, so now we can finish, and the answer is that, yes, the string matches the regex.

Note that no progress at all is made in steps 4 through 7, or 9 through 12. When you have a lot of alternations in a regex, that’s only to be expected for this regex algorithm: for every position in the string, you have to serially examine each of the alternations in turn. If matches are relatively rare (which they very often are), you end up spending a lot of time on work that doesn’t actually get you anywhere, examining the same characters over and over again.

Fortunately, better algorithms exist. For regexes, there’s a DFA algorithm that has much better worst-case performance, at the cost of some expressiveness. But for the task we’re looking at, using a regex at all is massive overkill — we just don’t need that.

Tries

So, let’s consider a well-known alternative algorithm, the trie. The magic here is actually in the data structure; once you’ve built it, actually applying the algorithm is trivial. It’s essentially a carefully-structured tree that encodes a set of strings, one character per node. Here’s a trie containing the same strings as that foo|bar|baz|fum regex:

Diagram showing a trie containing the strings “foo”, “bar”, “baz”, and “fum”

Like any ordinary tree, a trie is a collection of nodes in which each node has zero or more child nodes, and no node has more than one parent. The clever bit is what’s stored in each node — namely, a mapping from characters to child nodes. The node at the root of this particular trie has entries for b and f, because all the keys in the trie begin with one of those two characters. The child node corresponding to b has an entry for a, because every key that begins with b has an a as its next character. The grandchild node for ba has one entry for r and one for z. Here, though, it’s slightly different: instead of these entries being links to other nodes, they are distinguished as being valid endpoints for keys, so no child nodes are necessary.

Given that data structure, the algorithm to determine whether a string S contains any of the keys at a position P is simply this:

  1. Let N be the node at the root of the tree.
  2. Find the entry E in N corresponding to the character at P in S.
  3. If no such entry E was found, none of the keys were found at the starting position in S; terminate with a false return value.
  4. Otherwise, if E indicates a valid endpoint, one of the keys was found at the starting position; terminate with a true return value.
  5. Otherwise, E must be a link to a child node. Set N ← E; increment P; and go to step 2.

So at each position in the string, you only ever examine each character once, and the worst-case performance is the length of the longest key (or, equivalently, the maximum depth of the tree). Since a match must be attempted at each position in the target string, the total worst-case search time is O(mn) where m is the length of the target string and n is the length of the longest key.

(The astute reader will spot that you still have to examine a character multiple times if you end up stepping through the string. There’s a somewhat more complex algorithm that avoids much of that work, at the cost of some additional work at construction time.)

Text::Match::FastAlternatives

Now, one approach at this point would be to run off and implement this algorithm in Perl, or Python, or whatever high-level language you’re using. My expectation, however, is that doing so would be unproductive. That’s because regex engines are usually written in low-level languages, like C; the implementation win bought from doing that is very likely to be greater than the win from using a smarter algorithm. (That’s certainly the case for Perl 5, where the implementation has a relatively slow op dispatch loop.) So writing in C is probably a good idea if you want to compete with a C-written regex engine.

And, of course, that’s what I did: Text::Match::FastAlternatives is a trie implementation written in C, intended to be called from Perl. It takes some shortcuts to make the implementation especially simple and tight; in particular, it only handles printable ASCII characters, but that covers my use case admirably.

The guts, tweaked slightly to remove the Perl-specific bits, look like this:

#define MAX_NODES 95   /* 0x20 to 0x7E */

struct trie_node;
struct trie_node {
    int final;
    struct trie_node *next[MAX_NODES];
};

int match_once(struct trie_node *node, const char *s, size_t len) {
    unsigned char c;
    for (;;) {
        if (node->final)
            return 1;
        if (len == 0)
            return 0;
        c = *s;
        if (c < 32 || c >= 127)
            return 0;
        node = node->next[c - 32];
        if (!node)
            return 0;
        s++;
        len--;
    }
}

int match(struct trie_node *node, const char *target) {
    size_t len = strlen(target);
    do {
        if (match_once(node, target, len))
            return 1;
        target++;
    } while (len-- > 0);
    return 0;
}

One of the advantages of the very simplistic implementation — just using a vanilla dense array for the character/entry mapping — is that the “find the entry” step is completely trivial:

        node = node->next[c - 32];

That will get compiled to something like a couple of memory references and additions, which is about as fast as the innermost loop can get. In a general-purpose implementation, some kind of sparse mapping would be needed, especially if you want to handle all 2²¹ potential Unicode characters.

Perl 5.10

A new major release of Perl 5 was released on December 18th, 2007. Among its many improvements was a large swathe of enhancements to the regex engine, including a built-in trie optimisation for regexes containing alternations of large numbers of constant strings.

For most purposes, that optimisation will be everything you’d need. For example, the test I described above takes 40 seconds under Perl 5.8.8, but only about half a second under Perl 5.10.0. Yet Text::Match::FastAlternatives is even faster, at about 0.23 seconds. Here are some more detailed numbers:

Perl Keys Regex FastAlternatives Improvement
5.10.0, non-threaded 5 5.05/s 6.11/s 21%
50 3.92/s 5.89/s 50%
100 3.03/s 5.45/s 80%
374 2.09/s 4.48/s 115%
Perl 5.10.0, threaded 5 4.75/s 5.23/s 10%
50 3.49/s 5.06/s 45%
100 2.98/s 4.68/s 57%
374 2.00/s 4.13/s 107%

I’d have guessed that Perl 5.10 would have asymptotic performance at least as good as Text::Match::FastAlternatives, even if the constant factors had turned out bigger. It looks like I’m wrong on that — the Text::Match::FastAlternatives advantage keeps getting bigger as the data set increases, especially on non-threaded Perl builds.

So if your problem can be solved with Text::Match::FastAlternatives, given its limitations, it’s still the fastest option available, even if you’re using Perl 5.10.

Release 1.00

We’ve been using Text::Match::FastAlternatives in production for quite some time now, and it seems rock-solid in practice. So I’ve decided to push out a 1.00 release, with documentation updates to describe the situation with Perl 5.10. The new version has no code changes (except that it declares a previously-implicit dependency on Perl 5.6, which should squash the two remaining CPAN Testers failures for release 0.05). There’s no need to upgrade if you’re already using an earlier release, though.

The new release should be available from CPAN by the time you read this, with any luck.

Thanks to David Blank-Edelman for spurring me to update the docs.