Published at 17:01, Tue 6 May 2008
I’ve just released version 1.00 of my Text::Match::FastAlternatives Perl module. Since I’m apparently declaring it stable, I thought it was worth writing up a description of what it does, and how it does it.
Suppose you have a large list of strings, and a set of keys, and you need to determine, for each of the strings, whether any of the keys occur in it. For example, the list of strings might be a list of user-agent headers sent to a web server, and the keys a set of strings that are good indicators of robots accessing your site; you want to calculate some server statistics, but disregard any robotic traffic.
How do you go about doing that?
If you’re using Perl, the obvious approach is to rely on the language’s excellent integrated regular expressions. First build a regex that matches any of the keys you’re interested in:
my $rx = join '|', map { quotemeta } @robots;
$rx = qr/$rx/;
Then for each user agent, check whether it matches:
for my $ua (@uagents) {
process($ua) if $ua !~ $rx;
}
That’s certainly simple, but it’s far from fast. With Perl 5.8.8 on a 64-bit Linux machine, running that on 374 keys and about 100,000 user agents took over 40 seconds.
There are some Perl modules on CPAN to help with that (surprise, surprise). For example, Regexp::Trie and Regexp::Assemble each build regexes that run about seven times faster than the naïve regex. But that’s not fast enough for me.
Why does the naïve Perl regex run so slowly? It’s all about the algorithm
Perl uses to evaluate regexes — the Backtracking NFA algorithm. When you
have a regex of the form foo|bar|baz|fum, and you try to work out whether
the string fubar matches it, Perl’s regex engine has to perform the
following steps:
f match the next piece of regex f? Yes,
so advance.u match o? No, so backtrack to the closest decision point:
start of string, and the bar alternation in the regex.f match_b? No, so backtrack: start of string, and the baz
alternation in the regex.f match_b? No, so backtrack: start of string, and the fum
alternation in the regex.f match_f? Yes, so advance.u match u? Yes, so advance.b match m? No, so backtrack. But we’ve exhausted all the
possible decision points, so retry the whole regex at the next position
in the string.u match f? No, so backtrack: position 1, and the bar
alternation.u match b? No, so backtrack: position 1, and the baz
alternation.u match b? No, so backtrack: position 1, and the fum
alternation.u match f? No, so backtrack. But we’ve exhausted, so retry
the regex at the next position.b match f? No, so backtrack: position 2, and the bar
alternation.b match b? Yes, so advance.a match a? Yes, so advance.r match r? Yes, so advance. But we’ve reached the end of the
regex, so now we can finish, and the answer is that, yes, the string
matches the regex.Note that no progress at all is made in steps 4 through 7, or 9 through 12. When you have a lot of alternations in a regex, that’s only to be expected for this regex algorithm: for every position in the string, you have to serially examine each of the alternations in turn. If matches are relatively rare (which they very often are), you end up spending a lot of time on work that doesn’t actually get you anywhere, examining the same characters over and over again.
Fortunately, better algorithms exist. For regexes, there’s a DFA algorithm that has much better worst-case performance, at the cost of some expressiveness. But for the task we’re looking at, using a regex at all is massive overkill — we just don’t need that.
So, let’s consider a well-known alternative algorithm, the trie. The magic
here is actually in the data structure; once you’ve built it, actually
applying the algorithm is trivial. It’s essentially a carefully-structured
tree that encodes a set of strings, one character per node. Here’s a trie
containing the same strings as that foo|bar|baz|fum regex:

Like any ordinary tree, a trie is a collection of nodes in which each
node has zero or more child nodes, and no node has more than one parent.
The clever bit is what’s stored in each node — namely, a mapping from
characters to child nodes. The node at the root of this particular trie
has entries for b and f, because all the keys in the trie begin with
one of those two characters. The child node corresponding to b has an
entry for a, because every key that begins with b has an a as its
next character. The grandchild node for ba has one entry for r and
one for z. Here, though, it’s slightly different: instead of these
entries being links to other nodes, they are distinguished as being valid
endpoints for keys, so no child nodes are necessary.
Given that data structure, the algorithm to determine whether a string S contains any of the keys at a position P is simply this:
So at each position in the string, you only ever examine each character once, and the worst-case performance is the length of the longest key (or, equivalently, the maximum depth of the tree). Since a match must be attempted at each position in the target string, the total worst-case search time is O(mn) where m is the length of the target string and n is the length of the longest key.
(The astute reader will spot that you still have to examine a character multiple times if you end up stepping through the string. There’s a somewhat more complex algorithm that avoids much of that work, at the cost of some additional work at construction time.)
Now, one approach at this point would be to run off and implement this algorithm in Perl, or Python, or whatever high-level language you’re using. My expectation, however, is that doing so would be unproductive. That’s because regex engines are usually written in low-level languages, like C; the implementation win bought from doing that is very likely to be greater than the win from using a smarter algorithm. (That’s certainly the case for Perl 5, where the implementation has a relatively slow op dispatch loop.) So writing in C is probably a good idea if you want to compete with a C-written regex engine.
And, of course, that’s what I did: Text::Match::FastAlternatives is a trie implementation written in C, intended to be called from Perl. It takes some shortcuts to make the implementation especially simple and tight; in particular, it only handles printable ASCII characters, but that covers my use case admirably.
The guts, tweaked slightly to remove the Perl-specific bits, look like this:
#define MAX_NODES 95 /* 0x20 to 0x7E */
struct trie_node;
struct trie_node {
int final;
struct trie_node *next[MAX_NODES];
};
int match_once(struct trie_node *node, const char *s, size_t len) {
unsigned char c;
for (;;) {
if (node->final)
return 1;
if (len == 0)
return 0;
c = *s;
if (c < 32 || c >= 127)
return 0;
node = node->next[c - 32];
if (!node)
return 0;
s++;
len--;
}
}
int match(struct trie_node *node, const char *target) {
size_t len = strlen(target);
do {
if (match_once(node, target, len))
return 1;
target++;
} while (len-- > 0);
return 0;
}
One of the advantages of the very simplistic implementation — just using a vanilla dense array for the character/entry mapping — is that the “find the entry” step is completely trivial:
node = node->next[c - 32];
That will get compiled to something like a couple of memory references and additions, which is about as fast as the innermost loop can get. In a general-purpose implementation, some kind of sparse mapping would be needed, especially if you want to handle all 2²¹ potential Unicode characters.
A new major release of Perl 5 was released on December 18th, 2007. Among its many improvements was a large swathe of enhancements to the regex engine, including a built-in trie optimisation for regexes containing alternations of large numbers of constant strings.
For most purposes, that optimisation will be everything you’d need. For example, the test I described above takes 40 seconds under Perl 5.8.8, but only about half a second under Perl 5.10.0. Yet Text::Match::FastAlternatives is even faster, at about 0.23 seconds. Here are some more detailed numbers:
| Perl | Keys | Regex | FastAlternatives | Improvement |
|---|---|---|---|---|
| 5.10.0, non-threaded | 5 | 5.05/s | 6.11/s | 21% |
| 〃 | 50 | 3.92/s | 5.89/s | 50% |
| 〃 | 100 | 3.03/s | 5.45/s | 80% |
| 〃 | 374 | 2.09/s | 4.48/s | 115% |
| Perl 5.10.0, threaded | 5 | 4.75/s | 5.23/s | 10% |
| 〃 | 50 | 3.49/s | 5.06/s | 45% |
| 〃 | 100 | 2.98/s | 4.68/s | 57% |
| 〃 | 374 | 2.00/s | 4.13/s | 107% |
I’d have guessed that Perl 5.10 would have asymptotic performance at least as good as Text::Match::FastAlternatives, even if the constant factors had turned out bigger. It looks like I’m wrong on that — the Text::Match::FastAlternatives advantage keeps getting bigger as the data set increases, especially on non-threaded Perl builds.
So if your problem can be solved with Text::Match::FastAlternatives, given its limitations, it’s still the fastest option available, even if you’re using Perl 5.10.
We’ve been using Text::Match::FastAlternatives in production for quite some time now, and it seems rock-solid in practice. So I’ve decided to push out a 1.00 release, with documentation updates to describe the situation with Perl 5.10. The new version has no code changes (except that it declares a previously-implicit dependency on Perl 5.6, which should squash the two remaining CPAN Testers failures for release 0.05). There’s no need to upgrade if you’re already using an earlier release, though.
The new release should be available from CPAN by the time you read this, with any luck.
Thanks to David Blank-Edelman for spurring me to update the docs.