Just a Theory

Trans rights are human rights

How to Use Regex Named Captures in Perl 5

I ran some Perl 5 regular expression syntax that I’d never seen the other day. It used two features I’d never seen before:

  • (?{ }), a zero-width, non-capturing assertion that executes arbitrary Perl code.
  • $^N, a variable for getting the contents of the most recent capture in a regular expression.

The cool thing is that, used in combination, these two features can be used to hack named captures into Perl regular expressions. Here’s an example:

use warnings;
use strict;
use Data::Dumper;

my $string = 'The quick brown fox jumps over the lazy dog';

my %found;

my @captures = $string =~ /
    (?: (quick|slow) \s+    (?{ $found{speed}  = $^N  }) )
    (?: (brown|blue) \s+    (?{ $found{color}  = $^N  }) )
    (?: (sloth|fox)  \s+    (?{ $found{animal} = $^N  }) )
    (?: (eats|jumps)        (?{ $found{action} = $^N  }) )
/xms;

print Dumper \@captures;
print Dumper \%found;

The output of running this program is:

$VAR1 = [
            'quick',
            'brown',
            'fox',
            'jumps'
        ];
$VAR1 = {
            'color' => 'brown',
            'speed' => 'quick',
            'action' => 'jumps',
            'animal' => 'fox'
        };

So the positional captures are still returned, and we’ve assigned them to keys in a hash. This can be very convenient for complex regular expressions.

This is a cool feature, but there are a few caveats. First, according to the Perl regular expression documentation, (?{ }) is a highly experimental feature that could go away at any time. But more importantly, if you’re relying on this feature you should be aware of the side effects. What I mean by that is that, if a regular expression match fails, but there are some successful matches during execution, then the code in the (?{ }) assertions could still execute. For example, if you changed the word “jumps” to “poops” in the above example, the output becomes:

$VAR1 = [];
$VAR1 = {
            'color' => 'brown',
            'speed' => 'quick',
            'animal' => 'fox'
        };

Which means that the match failed, but there were still assignments to our hash, because some of the captures succeeded before the overall match failed. The upshot is that you should always check the return value from the match before relying on whatever the code inside the (?{ }) assertions did.

The problem becomes even more subtle if your regular expressions trigger backtracking. In that case, you might have an optional group match and its value assigned to the hash, and then the next required group fail. Perl will then backtrack to throw out the successful group match and then see if the next required match succeeds. If so, you can have a successful match and potentially invalid data in your hash. Here’s an example:

my @captures = $string =~ /
    (?: (quick|slow) \s+    (?{ $found{speed}  = $^N  }) )
    (?: (brown|blue) \s+    (?{ $found{color}  = $^N  }) )?
    (?: (brown\s+fox)       (?{ $found{animal} = $^N  }) )
/xms;

print Dumper \@captures;
print Dumper \%found;

And the output is:

$VAR1 = [
            'quick',
            undef,
            'brown fox'
        ];
$VAR1 = {
            'color' => 'brown',
            'speed' => 'quick',
            'animal' => 'brown fox'
        };

So while the second group returned undef for the color capture, the %foundhash still had the color key in it. This may or may not be what you want.

Of course, all this seems cool, but since it’s a truly evil hack, you have to be careful. If you can wait, though, perhaps we’ll see named captures in Perl 5.10.

Looking for the comments? Try the old layout.