AaronCrane.co.uk

Regex matching algorithms and asymptotic run time

[Previously published here.]

Russ Cox recently wrote an article about the worst-case run time of two different implementation strategies for matching regexes: Thompson NFA, and backtracking NFA. In particular, Russ points to the Perl regex engine as an example of how not to do it.

The article’s an interesting read. But I don’t think the approach described is necessarily easy and/or useful to fit into Perl, though.

There are two circumstances where the try-and-backtrack NFA execution algorithm for regexes (that is, the algorithm used by Perl and PCRE and many other tools) causes O(2n) (exponential-time) performance. One is when the regex uses backreferences. There are no known backreference algorithms that avoid O(2n) execution in all cases, so that’s not surprising. The other is when the regex uses nested quantifiers — things like /(a*)+/.

The thing that these two situations have in common is that they’re fairly rare in practice. Backreferences are almost never useful, and when they are, it’s usually for something whose execution can’t be O(2n), like /\b (\w+) \s+ \1 \b/x for finding duplicate words. And nested quantifiers seem to be used mainly in pathological test cases for regex engines. The few real-world situations they’re needed for are things like / " (?: [^"\\]+ | \\ .)* " /x, which also can’t take O(2n) time.

It is admittedly possible to write such things in ways that can exhibit O(2n) performance, but I don’t find that significantly more interesting than the fact that you can write buggy code that contains an infinite loop. (For the record, you can also write regexes that match the same strings, but are even more efficient; I have production code containing / " (?> [^"\\]* ) (?> (?> \\ . [^"\\]* )* ) " /x, which can be executed a lot more quickly by Perl’s engine.)

Russ mentions this point, but sees the issue as “a choice between an implementation with a predictable, consistent, fast running time on all inputs or one that usually runs quickly but can take years of CPU time (or more) on some inputs”. That’s a fair comment — it would be nice not to have to spend mental effort on constructing regexes in ways that avoid misuse of nested quantifiers — but it’s also a simplification of the situation.

First, as I’m sure Russ knows, asymptotic time complexity isn’t the only factor in how well a piece of code performs. For small problem sizes (and often, real-world problems are small), the lower-order terms can hurt a lot. We never describe an algorithm as taking O(k1n² + k2n + k3) time, because that’s asymptotically equivalent to O(n²) time complexity. But for small values of n, the ki constants can dominate the actual runtime.

I recently encountered an excellent example of this, related precisely to this issue of regex performance. Another part of the production code mentioned above needs to determine whether any of a set of 339 literal strings can be found in another string, and we need to answer that question millions of times per day. We were building a regex from the literal strings, separating each one with |. Profiling revealed that executing this regex was a performance bottleneck. That made sense: at each position in the string being matched, Perl’s regex engine has to try each of the alternatives in turn. On the other hand, a DFA execution algorithm, as described by Russ, can keep track of all of the alternatives in parallel (at each match position).

With that in mind, I looked round for a suitable off-the-shelf DFA engine to st— I mean, to borrow. I found TRE (one of the “efficient implementations” that Russ mentions), and it’s already in Debian, so it would be easy for us to use. Some simple proof-of-concept XS code later, I benchmarked the same regex on the Perl engine and on TRE.

Perl’s asymptotically-slow engine was about twice as fast as the asymptotically-fast TRE engine. So, I gave up on DFAs for this task, and wrote something even faster instead.

It’s entirely possible that there are DFA implementations that perform at least as well as Perl’s regex engine, but TRE is at least an existence proof that picking a better algorithm isn’t necessarily sufficient for good performance.

The second problem with Russ’s position is that backtracking NFA algorithms typically have more features than DFA or DFA-conversion implementations.

  • Capturing parentheses have been considered hard to implement in DFA engines, though Russ says that “Thompson-style algorithms can be adapted to track submatch boundaries without giving up efficient performance”, and certainly the Tcl regex implementation (a hybrid DFA/NFA engine, as far as I can tell) offers them.

  • There are no known algorithms for backreferences that always avoid O(2n) matching time.

  • NFA implementations are beginning to offer recursive regex invocation. I’m not a mathematician, but I’m pretty sure that you can’t construct a DFA from a regex that recursively invokes parts of itself.

  • Perl lets you embed code in regexes, and that code may have arbitrary side-effects. Further, the side-effects are unpicked when Perl backtracks past the code. Again, I’m no mathematician, but I think this makes it very hard to use a DFA implementation for embedded code blocks.

There seems to be a trade-off between expressiveness in regex syntax, versus fast worst-case execution time. Backtracking NFA implementations come down firmly in favour of expressiveness. Traditional DFA implementations prefer to minimise asymptotic time complexity. Hybrid implementations get to pick and choose.

My own preference is for expressiveness, on balance. That’s in line with my preference for Perl over more minimal languages; good code in a more expressive language is typically easier to understand than good code in a less expressive language.

That said, Russ’s article is definitely interesting, and perhaps it will move people to try out the algorithms he suggests in widely-used systems like Perl or PCRE.

And this does seem to be a good time to go about doing that. Perl 5.10 will have many significant enhancements to its regex engine, including facilities that make it much easier to build and use pluggable regex engines. It’s quite easy to imagine Perl continuing to offer its current expressive but sometimes slow regex engine as the default, but letting you switch to a DFA engine for individual regexes where that would make sense.

Update, 14 Feb: There seems to have been a small explosion of pushback on this article lately. Here are some relevant links: