home :: computers :: programming

SVN::Notify 2.70: Output Filtering and Character Encoding

I’m very pleased to announce the release of SVN::Notify 2.70. You can see an example of its colordiff output here. This is a major release that I’ve spent the last several weeks polishing and tweaking to get just right. There are quite a few changes, but the two most important are imporoved character encoding support and output filtering.

Improved Character Encoding Support

I’ve had a number of bug reports regarding issues with character encodings. Particularly for folks working in Europe and Asia, but really for anyone using multibyte characters in their source code and log messages (and we all do nowadays, don’t we?), it has been difficult to find the proper incantation to get SVN::Notify to convert data from and to their proper encodings. Using a patch from Toshikazu Kinkoh as a starting-point, and with a lot of reading and experimentation, as well as regular and patient tests on Toshikazu’s and Martin Lindhe’s production systems, I think I’ve finally got it nailed down.

Now you can use the --encoding (formerly --charset), --svn-encoding, and --diff-encoding options—as well as --language—to get SVN::Notify to do the right thing. As long as your Subversion server’s OS supports an appropriate locale, you should be golden (mine is old, with no UTF-8 locales :\). And if all else fails, you can still set the $LANG environment variable before executing svnnotify.

There is actually a fair bit to know about encodings to get it to work properly, but if you use UTF-8 throughout and your OS supports UTF-8 locales, you shouldn’t have to do anything. You might have to set --language in order to get it to use the proper locale. See the new documentation of the encoding support for all the details. And if you still have problems, please do let me know.

Output Filtering

Much sexier is the addition of output filtering in SVN::Notify 2.70. I got pretty tired of getting feature requests for what are essentially formatting modifications, such as this one requesting support for KDE-style keyword support. I myself was using Trac wiki syntax in commit messages on a recent project and wanted to see them converted to HTML for messages output by SVN::Notify::HTML::ColorDiff.

So I finally sat down and gave some though on how to implement a simple plugin architecture for SVN::Notify. When I realized that it was generally just formatting that people wanted, it became simpler: I just needed a way to allow folks to write simple output filters. The solution I came up with was to just use Perl. Output filters are simply subroutines named for the kind of output they filter. They live in perl packages. That’s it.

For example, say that your developers write their commit log messages in Textile, and rather than receive them stuck inside <pre> tags, you’d like them converted to HTML. It’s simple. Just put this code in a Perl module file:

package SVN::Notify::Filter::Textile;
use Text::Textile ();

sub log_message {
    my ($notifier, $lines) = @_;
    return $lines unless $notify->content_type eq 'text/html';
    return [ Text::Textile->new->process( join $/, @$lines ) ];
}

Put the file, SVN/Notify/Filter/Textile.pm somewhere in a Perl library directory. Then use the new --filter option to svnnotify to put it to work:

svnnotify -p "$1" -r "$2" --handler HTML::ColorDiff --filter Textile

Yep, that’s it! SVN::Notify will find the filter module, load it, register its filtering subroutine, and then call it at the appropriate time. Of course, there are a lot of things you can filter; consult the complete documentation for all of the details. But hopefully this gives you a flavor for how easy it is to write new filters for SVN::Notify. I’m hoping that all those folks who want featurs can now stop bugging me and writing their own filters to do the job, and uploading them to CPAN for all to share!

To get things started, I scratched my own itch, writing a Trac filter myself. The filter is almost as simple as the Textile example above, but I also spent quite a bit of time tweaking the CSS so that most of the Trac-generated HTML looks good. You can see an example right here. Thanks to a number of bug fixes in Text::Trac, as well as Trac-specific CSS added via a filter on CSS output, it works beautifully. If I’m feeling motivated in the next week or so, I’ll create a separate CPAN distribution with just a Markdown filter and upload it. That will create a nice distriution example for folks to copy to creat their own. Or maybe someone on the Lazy Web Will do it for me! Maybe you?

I wish I’d thought to do this from the beginning; it would have saved me from having to add so many features/cruft to SVN::Notify over the years. Here’s a quick list of the features that likely could have been implemented via filters instead of added to the core:

  • --user-domain: Combine the SVN username with a domain for the From header.
  • --add-header: Add a header to the message.
  • --reply-to: Add a specific header to the message.
  • SVN::Notify::HTML::ColorDiff: Frankly, looking back on it, I don’t know why I didn’t just put this support right into SVN::Notify::HTML. But even if I hadn’t, it could have been implemented via filters.
  • --subject-prefix:: Modify the message subject.
  • --subject-cx: Add the commit context to the subject.
  • --strip-cx-regex: More subject context modification.
  • --no-first-line: Another subject filter.
  • --max-sub-length: Yet another!
  • --max-diff-length: A filter could truncate the diff, although this might be tricky with the HTML formatting.
  • --author-url: Modify the metadata section to add a link to the author URL.
  • --revision-url: Ditto for the revision URL.
  • --ticket-map: Filter the log message for various ticketing system strings to convert to URLs. This also encompasses the old --rt-url, --bugzilla-url, --gnats-url, and --jira-url options.
  • --header: Filter the beginning of the message.
  • --footer: Filter the end of the message.
  • --linkize: Filter the log message to convert URLs to links for HTML messages.
  • --css-url: Filter the CSS to modify it, or filter the start of the HTML to add a link to an external CSS URL.
  • --wrap-log: Reformat the log message for HTML.

Yes, really! That’s about half the functionality right there. I’m glad that I won’t have to add any more like that; filters are a much better way to go.

So download it, install it, write some filters, get your multibyte characters output properly, and enjoy! And as usual, send me your bug reports, but implement your own improvements using filters!

Using sudo to Install the Postgres Gem on Leopard

Been getting this error with the latest postgres gem?

% sudo gem install postgres
Bulk updating Gem source index for: http://gems.rubyforge.org
Building native extensions.  This could take a while...
ERROR:  While executing gem ... (Gem::Installer::ExtensionBuildError)
   ERROR: Failed to build gem native extension.

ruby extconf.rb install postgres
checking for main() in -lpq... yes
checking for libpq-fe.h... yes
checking for libpq/libpq-fs.h... yes
checking for PQsetClientEncoding()... no
checking for pg_encoding_to_char()... no
checking for PQfreemem()... no
checking for PQserverVersion()... no
checking for PQescapeString()... no
creating Makefile

I have, too. I’ve known about the fix for a while, thanks to a post from maintainer Jeff Davis from last month. But I was unable to get it to work. But then I found this gem of a comment (pun not intended) from Gluttonous:

FYI, this does NOT work with sudo since sudo strips the env var out. You must ‘sudo -s’ or ‘sudo su’ and run the command straight up.

D’oh! I’ve been doing this all this time:

ARCHFLAGS='-arch i386' sudo gem install postgres

And getting the same failures. But this works beautifully:

sudo env ARCHFLAGS='-arch i386' gem install postgres

And away we go!

Ruby Time Object Time Zone Bug

This is disappointing.

To summarize, Ruby’s Time class has a bug in its zone method. The example is simple:

tz = ENV['TZ']
ENV['TZ'] = 'Africa/Luanda'
t = Time.now
puts t.zone
ENV['TZ'] = 'Australia/Lord_Howe'
puts t.zone

This outputs:

WAT
WAT

So far so good. But look what happens when I add a single line to the program, foo = t.to_s:

tz = ENV['TZ']
ENV['TZ'] = 'Africa/Luanda'
t = Time.now
puts t.zone
ENV['TZ'] = 'Australia/Lord_Howe'
foo = t.to_s
puts t.zone

The result changes:

WAT
LHST

This is clearly wrong. Changing the $TZ environment variable and stringifying the object should not change the underlying value of any of the object’s attributes. The Time object should remember the value of the time zone when it is initialized, and should never change.

Unfortunately, The Ruby core developers (or at least the owner of the bug report) feel that, since Time relies on the system C API, and since time zones are a PITA, it’s not worth it to change this behavior. The downside, however, is that you cannot rely on Time zones to ever be correct unless you’re very, very careful.

Personally, in my subclass of Time, I took care of stashing the time zone at object instantiation as a workaround for this bug. It seemed reasonable to me, and I was just surprised that the idea was rejected by the Ruby developers.

What do you think?

Array to Hash One-Liner

Programming in Ruby, I’ve badly missed Perl’s list syntax, which, among other things, makes converting between arrays and hashes really easy. In Ruby I have forever been converting an array to a hash like this:

a = [ 1, 2, 3, 4, 5 ]
h = {}
a.each { |v| h[v] = v }

Of course, this is anything bug concise. In Perl, I can just do this:

my @a = (1, 2, 3, 4, 5, 6);
my %h = map { $_ => $_ } @a;

Easy, huh? Well, I finally got fed up with the nasty hack in Ruby, did a little Googling, and figured out a way to do it in a single line:

a = [ 1, 2, 3, 4, 5, 6 ]
h = Hash[ *a.collect { |v| [ v, v ] }.flatten ]

Not quite as consice as the Perl version, and I have to construct a bunch of arrays that I then throw away with the call to flatten, but at least it’s concise and, I think, clearer what it’s doing. So I think I’ll go with that.

How to Use Regex Named Captures in Perl 5

I ran some Perl 5 regular expression syntax that I’d never seen the other day. It used two features I’d never seen before:

  • (?{ }), a zero-width, non-capturing assertion that executes arbitrary Perl code.
  • $^N, a variable for getting the contents of the most recent capture in a regular expresion.

The cool thing is that, used in combination, these two features can be used to hack named captures into Perl regular expressions. Here’s an example:

use warnings;
use strict;
use Data::Dumper;

my $string = 'The quick brown fox jumps over the lazy dog';

my %found;

my @captures = $string =~ /
    (?: (quick|slow) \s+    (?{ $found{speed}  = $^N  }) )
    (?: (brown|blue) \s+    (?{ $found{color}  = $^N  }) )
    (?: (sloth|fox)  \s+    (?{ $found{animal} = $^N  }) )
    (?: (eats|jumps)        (?{ $found{action} = $^N  }) )
/xms;

print Dumper \@captures;
print Dumper \%found;

The output of running this program is:

$VAR1 = [
          'quick',
          'brown',
          'fox',
          'jumps'
        ];
$VAR1 = {
          'color' => 'brown',
          'speed' => 'quick',
          'action' => 'jumps',
          'animal' => 'fox'
        };

So the positional captures are still returned, and we’ve assigned them to keys in a hash. This can be very convenient for complex regular expressions.

This is a cool feature, but there are a few caveats. First, according to the Perl regular expression documentation, (?{ }) is a highly experimental feature that could go away at any time. But more importantly, if you’re relying on this feature you should be aware of the side effects. What I mean by that is that, if a regular expression match fails, but there are some successful matches during execution, then the code in the (?{ }) assertions could still execute. For example, if you changed the word jumps to poops in the above example, the output becomes:

$VAR1 = [];
$VAR1 = {
          'color' => 'brown',
          'speed' => 'quick',
          'animal' => 'fox'
        };

Which means that the match failed, but there were still assignments to our hash, because some of the captures succeeded before the overall match failed. The upshot is that you should always check the return value from the match before relying on whatever the code inside the (?{ }) assertions did.

The problem becomes even more subtle if your regular expressions trigger backgracking. In that case, you might have an optional group match and its value assigned to the hash, and then the next required group fail. Perl will then backtrack to throw out the successfull group match and then see if the next required match succeeds. If so, you can have a successful match and potentially invalid data in your hash. Here’s an example:

my @captures = $string =~ /
    (?: (quick|slow) \s+    (?{ $found{speed}  = $^N  }) )
    (?: (brown|blue) \s+    (?{ $found{color}  = $^N  }) )?
    (?: (brown\s+fox)       (?{ $found{animal} = $^N  }) )
/xms;

print Dumper \@captures;
print Dumper \%found;

And the output is:

$VAR1 = [
          'quick',
          undef,
          'brown fox'
        ];
$VAR1 = {
          'color' => 'brown',
          'speed' => 'quick',
          'animal' => 'brown fox'
        };

So while the second group returned undef for the color capture, the %foundhash still had the color key in it. This may or may not be what you want.

Of course, all this seems cool, but since it’s a truly evil hack, you have to be careful. If you can wait, though, perhaps we’ll see named captures in Perl 5.10.

Powered by KinoSearch