Just a Theory

By David E. Wheeler

Posts about Regular Expressions

How to Use Regex Named Captures in Perl 5

I ran some Perl 5 regular expression syntax that I’d never seen the other day. It used two features I’d never seen before:

  • (?{ }), a zero-width, non-capturing assertion that executes arbitrary Perl code.
  • $^N, a variable for getting the contents of the most recent capture in a regular expression.

The cool thing is that, used in combination, these two features can be used to hack named captures into Perl regular expressions. Here’s an example:

use warnings;
use strict;
use Data::Dumper;

my $string = 'The quick brown fox jumps over the lazy dog';

my %found;

my @captures = $string =~ /
    (?: (quick|slow) \s+    (?{ $found{speed}  = $^N  }) )
    (?: (brown|blue) \s+    (?{ $found{color}  = $^N  }) )
    (?: (sloth|fox)  \s+    (?{ $found{animal} = $^N  }) )
    (?: (eats|jumps)        (?{ $found{action} = $^N  }) )
/xms;

print Dumper \@captures;
print Dumper \%found;

The output of running this program is:

$VAR1 = [
            'quick',
            'brown',
            'fox',
            'jumps'
        ];
$VAR1 = {
            'color' => 'brown',
            'speed' => 'quick',
            'action' => 'jumps',
            'animal' => 'fox'
        };

So the positional captures are still returned, and we’ve assigned them to keys in a hash. This can be very convenient for complex regular expressions.

This is a cool feature, but there are a few caveats. First, according to the Perl regular expression documentation, (?{ }) is a highly experimental feature that could go away at any time. But more importantly, if you’re relying on this feature you should be aware of the side effects. What I mean by that is that, if a regular expression match fails, but there are some successful matches during execution, then the code in the (?{ }) assertions could still execute. For example, if you changed the word “jumps” to “poops” in the above example, the output becomes:

$VAR1 = [];
$VAR1 = {
            'color' => 'brown',
            'speed' => 'quick',
            'animal' => 'fox'
        };

Which means that the match failed, but there were still assignments to our hash, because some of the captures succeeded before the overall match failed. The upshot is that you should always check the return value from the match before relying on whatever the code inside the (?{ }) assertions did.

The problem becomes even more subtle if your regular expressions trigger backtracking. In that case, you might have an optional group match and its value assigned to the hash, and then the next required group fail. Perl will then backtrack to throw out the successful group match and then see if the next required match succeeds. If so, you can have a successful match and potentially invalid data in your hash. Here’s an example:

my @captures = $string =~ /
    (?: (quick|slow) \s+    (?{ $found{speed}  = $^N  }) )
    (?: (brown|blue) \s+    (?{ $found{color}  = $^N  }) )?
    (?: (brown\s+fox)       (?{ $found{animal} = $^N  }) )
/xms;

print Dumper \@captures;
print Dumper \%found;

And the output is:

$VAR1 = [
            'quick',
            undef,
            'brown fox'
        ];
$VAR1 = {
            'color' => 'brown',
            'speed' => 'quick',
            'animal' => 'brown fox'
        };

So while the second group returned undef for the color capture, the %foundhash still had the color key in it. This may or may not be what you want.

Of course, all this seems cool, but since it’s a truly evil hack, you have to be careful. If you can wait, though, perhaps we’ll see named captures in Perl 5.10.

Looking for the comments? Try the old layout.

More about…

Add Regular Expression Operator to SQLite

As I discussed a couple of months ago, DBD::SQLite exposes the SQLite sqlite3_create_function() API for adding Pure-Perl functions and aggregates to SQLite on a per-connection basis. This is cool, but in perusing the SQLite expression documentation, I came across this gem:

The REGEXP operator is a special syntax for the regexp() user function. No regexp() user function is defined by default and so use of the REGEXP operator will normally result in an error message. If a user-defined function named “regexp” is defined at run-time, that function will be called in order to implement the REGEXP operator.

Well hell! I thought. I can do that!

In a brief search, I could find no further documentation of this feature, but all it took was a little experimentation to figure it out. The regexp() function should expect two arguments. The first is the regular expression, and the second is the value to match. So it can be added to DBD::SQLite like this:

$dbh = DBI->connect('dbi:SQLite:dbfile=test.db');
$dbh->func('regexp', 2, sub {
    my ($regex, $string) = @_;
    return $string =~ /$regex/;
}, 'create_function');

Yep, that’s it! Now, I have my own module for handling database connections, and I wanted to make sure that all of my custom functions are always present, every time I connect to the database. In a mod_perl environment, you can end up with a lot of connections, and a single process has the potential disconnect and reconnect more than once (due to exceptions thrown by the database and whatnot). The easiest way to ensure that the functions are always there as soon as you connect and every time you connect, I learned thanks to a tip from Tim Bunce, is to subclass the DBI and implement a connected() method. Here’s what it looks like:

package MyApp::SQLite;
use base 'DBI';

package MyApp::SQLite::st;
use base 'DBI::st';

package MyApp::SQLite::db;
use base 'DBI::db';

sub connected {
    my $dbh = shift;
    # Add regexp function.
    $dbh->func('regexp', 2, sub {
        my ($regex, $string) = @_;
        return $string =~ /$regex/;
    }, 'create_function');
}

So how does this work? Here’s a quick app I wrote to demonstrate the use of the REGEXP expression in SQLite using Perl regular expressions:

#!/usr/bin/perl -w

use strict;

my $dbfile = shift || die "Usage: $0 db_file\n";
my $dbh = MyApp::SQLite->connect(
    "dbi:SQLite:dbname=$dbfile", '', '',
    {
        RaiseError  => 1,
        PrintError  => 0,
    }
);

END {
    $dbh->do('DROP TABLE try');
    $dbh->disconnect;
}

$dbh->do('CREATE TABLE try (a TEXT)');

my $ins = $dbh->prepare('INSERT INTO try (a) VALUES (?)');
for my $val (qw(foo bar bat woo oop craw)) {
    $ins->execute($val);
}

my $sel = $dbh->prepare('SELECT a FROM try WHERE a REGEXP ?');

for my $regex (qw( ^b a w?oop?)) {
    print "'$regex' matches:\n  ";
    print join "\n  " =>
        @{ $dbh->selectcol_arrayref($sel, undef, $regex) };
    print "\n\n";
}

This script outputs:

'^b' matches:
  bar
  bat

'a' matches:
  bar
  bat
  craw

'w?oop?' matches:
  foo
  woo
  oop

Pretty slick, no? I wonder if it’d make sense for DBD::SQLite to add the regexp() function itself, in C, using the Perl API, so that it’s just always available to DBD::SQLite apps?

Looking for the comments? Try the old layout.

Splitting Words in Perl

I’ve created a new module, Text::WordDiff, now on its way to CPAN, to show the differences between two documents using words as tokens, rather than lines as Text::Diff does. I plan to use it in Bricolage to give people a change tracking-type view (as seen in word processors) comparing two versions of a document. Fortunately, Algorithm::Diff makes this extremely easy to do. My only real problem was figuring out how to tokenize a string into words

After looking at discussions in The Perl Cookbook and Mastering Regular Expressions, I settled on using Friedl’s pattern for identifying the starting boundary of words, which is qr/(?<!\w)(?=\w)/msx. This pattern will turn the string, “this is O’Reilly’s string” into the following tokens:

[
    q{this },
    q{is },
    q{O'},
    q{Reilly'},
    q{s },
    q{string},
];

So it’s imperfect, but it works well enough for me. I’m thinking of using the Unicode character class for words, instead, at least for more recent versions of Perl that understand them (5.8.0 and later?). That would be /(?<!\p{IsWord})(?=\p{IsWord})/msx. The results using that regular expression are the same.

But otherwise, I’m not sure whether or not this is the best approach. I think that it’s good enough for the general cases I have, and the matching of words in and of themselves is not that important. What I mean is that, as long as most tokens are words, it’s okay with me if some, such as “O’”, “Reilly’”, and “s” in the above example, are not words. What I don’t know is how well it’ll work for non-Roman glyphs, such as in Japanese or Korean text. I tried a test on a Korean string I have lying around (borrowed from the Encode.pm test suite), but it didn’t split it up at all (with use utf8;).

So what do you think? Does Text::WordDiff work for your text? Is there a better and more general solution for tokenizing the words in a string?

Looking for the comments? Try the old layout.

Regular Expressions are Faster than Unpacking

Bricolage has always used unpack() to parse ISO-8601 date strings into their component parts. A few months back, I added support for subsecond precision using the DateTime, and couldn’t figure out how to parse out the optional subsecond part of the date (If it’s 0, PostgreSQL doesn’t include the decimal part of the seconds). So I switched to parsing with the regular expression /(\d\d\d\d).(\d\d).(\d\d).(\d\d).(\d\d).(\d\d)(\.\d*)?/. This worked well, but I lamented the loss of performance of unpack(). I mean, surely it’s faster to tell a parser where, exactly, to find each characters, than it is to use a pattern, right?

Well, last week I finally figured out how to unpack the decimal place using unpack() whether it’s there or not (the secret is the * modifier, which somehow I’d never noticed before). So I ran a benchmark to see how much of a performance gain I would get:

#!/usr/bin/perl -w
use strict;
use Benchmark;

my $date = '2005-03-23T19:30:05.1234';
my $ISO_TEMPLATE =  'a4 x a2 x a2 x a2 x a2 x a2 a*';

sub with_pack {
    my %args;
    @args{qw(year month day hour minute second nanosecond)}
        = unpack $ISO_TEMPLATE, $date;
    {
        no warnings;
        $args{nanosecond} *= 1.0E9;
    }
}

sub with_regex {
    $date =~ m/(\d\d\d\d).(\d\d).(\d\d).(\d\d).(\d\d).(\d\d)(\.\d*)?/;
    my %args = (
        year       => $1,
        month      => $2,
        day        => $3,
        hour       => $4,
        minute     => $5,
        second     => $6,
        nanosecond => $7 ? $7 * 1.0E9 : 0
    );
}

timethese(100000, {
    pack => \&with_pack,
    regex => \&with_regex
});

__END__

I quickly got my answer (all hail Benchmark!). This script outputs:

  Benchmark: timing 100000 iterations of pack, regex...
        pack:  3 wallclock secs ( 2.14 usr +  0.00 sys =  2.14 CPU) @ 46728.97/s (n=100000)
       regex:  3 wallclock secs ( 2.11 usr +  0.01 sys =  2.12 CPU) @ 47169.81/s (n=100000)

I sure didn’t expect them to be so close, let alone to see the regular expression approach nose out the unpack() solution. Clearly the Perl regex engine is highly optimized. And perhaps pack()/unpack() is not.

Live and learn, I guess.

Looking for the comments? Try the old layout.