Just a Theory

Trans rights are human rights

Posts about HTML

Encoding is a Headache

I have to spend way too much of my programming time worrying about character encodings. Take my latest module, Text::Markup for example. The purpose of the module is very simple: give in the name of a file, and it will figure out the markup it uses (HTML, Markdown, Textile, whatever) and return a string containing the HTML generated from the file. Simple, right?

But, hang on. Should the HTML it returns be decoded to Perl’s internal form? I’m thinking not, because the HTML itself might declare the encoding, either in a XML declaration or via something like

<meta http-equiv="Content-type" content="text/html;charset=Big5" />

And as you can see, it’s not UTF-8. So decoded it would be lying. So it should be encoded, right? Parsers like XML::LibXML::Parser are smart enough to see such declarations and decode as appropriate.

But wait a minute! Some markup languages, like Markdown, don’t have XML declarations or headers. They’re HTML fragments. So there’s no wait to tell the encoding of the resulting HTML unless it’s decoded. So maybe it should be decoded. Or perhaps it should be decoded, and then given an XML declaration that declares the encoding as UTF-8 and encoded it as UTF-8 before returning it.

But, hold the phone! When reading in a markup file, should it be decoded before it’s passed to the parser? Does Text::Markdown know or care about encodings? And if it should be decoded, what encoding should one assume the source file uses? Unless it uses a BOM, how do you know what its encoding is?

Text::Markup is a dead simple idea, but virtually all of my time is going into thinking about this stuff. It drives me nuts. When will the world cease to be this way?

Oh, and you have answers to any of these questions, please do feel free to leave a comment. I hate having to spend so much time on this, but I’d much rather do so and get things right (or close to right) than wrong.

Looking for the comments? Try the old layout.

Use Rubyish Blocks with Test::XPath

Thanks to the slick Devel::Declare-powered PerlX::MethodCallWithBlock created by gugod, the latest version of Test::XPath supports Ruby-style blocks. The Ruby version of assert_select, as I mentioned previously, looks like this:

assert_select "ol" { |elements|
  elements.each { |element|
    assert_select element, "li", 4
  }
}

I’ve switched to the brace syntax for greater parity with Perl. Test::XPath, meanwhile, looks like this:

my @css = qw(foo.css bar.css);
$tx->ok( '/html/head/style', sub {
    my $css = shift @css;
    shift->is( './@src', $css, "Style src should be $css");
}, 'Should have style' );

But as of Test::XPath 0.13, you can now just use PerlX::MethodCallWithBlock to pass blocks in the Rubyish way:

use PerlX::MethodCallWithBlock;
my @css = qw(foo.css bar.css);
$tx->ok( '/html/head/style', 'Should have style' ) {
    my $css = shift @css;
    shift->is( './@src', $css, "Style src should be $css");
};

Pretty slick, eh? It required a single-line change to the source code. I’m really happy with this sugar. Thanks for the great hack, gugod!

Looking for the comments? Try the old layout.

Test XML and HTML with XPath

When I was hacking Rails projects back in 2006-2007, there was a lot of stuff about Rails that drove me absolutely batshit (<cough>ActiveRecord</cough>), but there were also a (very) few things that I really liked. One of those things was the assert_select test method. There was a bunch of magic involved in sending a request to your Rails app and stuffing the body someplace hidden (hrm, that sounds kind of evil; intentional?), but then you could call assert_select to use CSS selectors to test the structure and content of the document (assuming, of course, that it was HTML or XML). For example, (to borrow from the Rails docs), if you wanted to test that a response contains two ordered lists, each with four list elements then you’d do something like this:

assert_select "ol" do |elements|
    elements.each do |element|
    assert_select element, "li", 4
    end
end

What it does is select all of the <ol> elements and pass them to the do block, where you can call assert_select on each of them. Nice, huh? You can also implicitly call assert_select on the entire array of passed elements, like so:

assert_select "ol" do
    assert_select "li", 8
end

Slick, right? I’ve always wanted to have something like this in Perl, but until last week, I didn’t really have an immediate need for it. But I’ve started on a Catalyst project with my partners at PGX, and of course I’m using a view to generate XHTML output. So I started asking around for advice on proper unit testing for Catalyst views. The answer I got was, basically, Test::WWW::Mechanize::Catalyst. But I found it insufficient:

$mech->get_ok("/");
$mech->html_lint_ok( "HTML should be valid" );
$mech->title_is( "Root", "On the root page" );
$mech->content_contains( "This is the root page", "Correct content" );

Okay, I can check the title of the document directly, which is kind of cool, but there’s no other way to examine the structure? Really? And to check the content, there’s just content_contains(), which concatenates all of the content without any tags! This is useful for certain very simple tests, but if you want to make sure that your document is properly structured, and the content is in all the right places, you’re SOL.

Furthermore, the html_link_ok() method didn’t like the Unicode characters output by my view:

#   Failed test 'HTML should be valid (http://localhost/)'
#   at t/view_TD.t line 30.
# HTML::Lint errors for http://localhost/
#  (4:3) Invalid character \x2019 should be written as &rsquo;
#  (18:5) Invalid character \xA9 should be written as &copy;
# 2 errors on the page

Of course, those characters aren’t invalid, they’re perfectly good UTF-8 characters. In some worlds, I suppose, they should be wrong, but I actually want them in my document.

So I switched to Test::XML, which uses a proper XML parser to validate a document:

ok my $res = request("http://localhost:3000/"), "Request home page";
ok $res->is_success, "Request should have succeeded";

is_well_formed_xml $res->content, "The HTML should be well-formed";

Cool, so now I know that my XHTML document is valid, it’s time to start examining the content and structure in more detail. Thinking fondly on assert_select, I went looking for a test module that uses XPath to test an XML document, and found Test::XML::XPath right in the Test::XML distribution, which looked to be just what I wanted. So I added it to my test script and added this line to test the content of the <title> tag:

is_xpath $res->content, "/html/head/title", "Welcome!";

I ran the test…and waited. It took around 20 seconds for that test to run, and then it failed!

#   Failed test at t/view_TD.t line 25.
#          got: ''
#     expected: 'Welcome!'
#   evaluating: /html/head/title
#      against: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
# <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
#  <head>
#   <title>Welcome!</title>
#  </head>
# </html>

No doubt the alert among my readership will spot the problem right away, but I was at a loss. Fortunately, Ovid was over for dinner last week, and he pointed out that it was due to the namespace. That is, the xmlns attribute of the <html> element requires that one register a namespace prefix to use in the XPath expression. He pointed me to his fork of XML::XPath, called Test::XHTML::XPath, in his Escape project. It mostly duplicates Test::XML::XPath, but contains this crucial line of code:

$xpc->registerNs( x => "http://www.w3.org/1999/xhtml" );

By registering the prefix “x” for the XHTML namespace, he’s able to write tests like this:

is_xpath $res->content, "/x:html/x:head/x:title", "Welcome!";

And that works. It seems that the XPath spec requires that one use prefixes when referring to elements within a namespace. Test::XML::XPath, alas, provides no way to register a namespace prefix.

Perhaps worse is the performance problem. I discovered that if I stripped out the DOCTYPE declaration from the XHTML before I passed it to is_xpath, the test was lightning fast. Here the issue is that XML::LibXML, used by Test::XML::XPath, is fetching the DTD from the w3.org Web site as the test runs. I can disable this by setting the no_network and recover_silently XML::LibXML options, but, again, Test::XML::XPath provides no way to do so.

Add to that the fact that Test::XML::XPath has no interface for recursive testing like assert_select and I was ready to write my own module. One could perhaps update Test::XML::XPath to be more flexible, but for the fact that it falls back on XML::XPath when it can’t find XML::LibXML, and XML::XPath, alas, behaves differently than XML::LibXML (it didn’t choke on my lack of a namespace prefix, for example). So if you ship an application that uses Test::XML::XPath, tests might fail on other systems where it would use a different XPath module than you used.

And so I have written a new test module.

Introducing Test::XPath, your Perl module for flexibly running XPath-powered tests on the content and structure of your XML and HTML documents. With this new module, the test for my Catalyst application becomes:

my $tx = Test::XPath->new( xml => $res->content, is_html => 1 );
$tx->is("/html/head/title", "Welcome", "Title should be correct" );

Notice how I didn’t need a namespace prefix there? That’s because the is_html parameter coaxes XML::LibXML into using its HTML parser instead of its XML parser. One of the side-effects of doing so is that the namespace appears to be assumed, so I can ignore it in my tests. The HTML parser doesn’t bother to fetch the DTD, either. For tests where you really need namespaces, you’d do this:

my $tx = Test::XPath->new(
    xml     => $res->content,
    xmlns   => { x => "http://www.w3.org/1999/xhtml" },
    options => { no_network => 1, recover_silently => 1 },
);
$tx->is("/x:html/x:head/x:title", "Welcome", "Title should be correct" );

Yep, you can specify XML namespace prefixes via the xmlns parameter, and pass options to XML::LibXML via the options parameter. Here I’ve shut off the network, so that XML::LibXML prevents network access, and told it to recover silently when it tries to fetch the DTD, but fails (because, you know, it can’t access the network). Not bad, eh?

Of course, the module provides the usual array of Test::More-like test methods, including ok(), is(), like() and cmp_ok(). They all work just like in Test::More, except that the first argument must be an XPath expressions. Some examples borrowed from the documentation:

$tx->ok( '//foo/bar', 'Should have bar element under foo element' );
$tx->ok( 'contains(//title, "Welcome")', 'Title should "Welcome"' );

$tx->is( '/html/head/title', 'Welcome', 'Title should be welcoming' );
$tx->isnt( '/html/head/link/@type', 'hello', 'Link type should not' );

$tx->like( '/html/head/title', qr/^Foobar Inc.: .+/, 'Title context' );
$tx->unlike( '/html/head/title', qr/Error/, 'Should be no error in title' );

$tx->cmp_ok( '/html/head/title', 'eq', 'Welcome' );
$tx->cmp_ok( '//story[1]/@id', '==', 1 );

But the real gem is the recursive testing feature of the ok() test method. By passing a code reference as the second argument, you can descend into various parts of your XML or HTML document to test things more deeply. ok() will pass if the XPath expression argument selects one or more nodes, and then it will call the code reference for each of those nodes, passing the Test::XPath object as the first argument. This is a bit different than assert_select, but I view the reduced magic as a good thing.

For example, if you wanted to test for the presence of <story> elements in your document, and to test that each such element had an incremented id attribute, you’d do something like this:

my $i = 0;
$tx->ok( '//assets/story', sub {
    shift->is('./@id', ++$i, "ID should be $i in story $i");
}, 'Should have story elements' );

For convenience, the XML::XPath object is also assigned to $_ for the duration of the call to the code reference. Either way, you can call ok() and pass code references anywhere in the hierarchy. For example, to ensure that an Atom feed has entries and that each entry has a title, a link, and a very specific author element with name, uri, and email subnodes, you can do this:

$tx->ok( '/feed/entry', sub {
    $_->ok( './title', 'Should have a title' );
    $_->ok( './author', sub {
        $_->is( './name',  'Larry Wall',       'Larry should be author' );
        $_->is( './uri',   'http://wall.org/', 'URI should be correct' );
        $_->is( './email', 'perl@example.com', 'Email should be right' );
    }, 'Should have author elements' );
}, 'Should have entry elements' );

There are a lot of core XPath functions you can use, too. For example, I’m going to write a test for every page returned by my application to make sure that I have the proper numbers of various tags:

$tx->is('count(/html)',     1, 'Should have 1 html element' );
$tx->is('count(/html/head') 1, 'Should have 1 head element' );
$tx->is('count(/html/body)  1, 'Should have 1 body element' );

I’m going to use this module to the hilt in all my tests for HTML and XML documents from here on in. The only thing I’m missing from assert_select is that it supports CSS 2 selectors, rather than XPath expressions, and the implementation offers quite a few other features including regular expression operators for matching attributes, pseudo-classes, and other fun stuff. Still, XPath gets me all that I need; the rest is just sugar, really. And with the ability to define custom XPath functions in Perl, I can live without the extra sugar.

Maybe you’ll find it useful, too.

Looking for the comments? Try the old layout.

Doomed To Reinvent

There’s an old saying, “Whoever doesn’t understand X is doomed to reinvent it.”X can stand for any number of things. The other day, I was pointing out that such is the case for ORM developers. Take ActiveRecord, for example. As I demonstrated in a 2007 Presentation, because ActiveRecord doesn’t support simple things like aggregates or querying against functions or changing how objects are identified, you have to fall back on using its find_by_sql() method to actually run the SQL, or using fuck typing to force ActiveRecord to do what you want. There are only two ways to get around this: Abandon the ORM and just use SQL, or keep improving the ORM until it has, in effect, reinvented SQL. Which would you choose?

I was thinking about this as I was hacking on a Drupal installation for a client. The design spec called for the comment form to be styled in a very specific way, with image submit buttons. Drupal has this baroque interface for building forms: essentially an array of arrays. Each element of the array is a form element, unless it’s markup. Or something. I can’t really make heads or tails of it. What’s important is that there are a limited number of form elements you can create, and as of Drupal 5, image isn’t fucking one of them!.

Now, as a software developer, I can understand this. I sometimes overlook a feature when implementing some code. But the trouble is: why have some bizarre data structure to represent a subset of HTML when you have something that already works: it’s called HTML. Drupal, it seems, is doomed to reinvent HTML.

So just as I have often had to use find_by_sql() as the fallback to get ActiveRecord to fetch the data I want, as opposed to what it thinks I want, I had to fallback on the Drupal form data structure’s ability to accept embedded HTML like so:

$form['submit_stuff'] = array(
  '#weight' => 20,
  '#type'   => 'markup',
  '#value'  => '<div class="form-submits">'
              . '<label></label><p class="message">(Maximum 3000 characters)</p>'
              . '<div class="btns">'
              . '<input type="image" value="Preview comment" name="op" src="preview.png" />'
              . '<img width="1" height="23" src="divider.png" />'
              . '<input type="image" value="Post comment" name="op" src="post.png" />'
              . '</div></div>',
);

Dear god, why? I understand that you can create images using an array in Drupal 6, but I fail to understand why it was ever a problem. Just give me a templating environment where I can write the fucking HTML myself. Actually, Drupal already has one, it’s called PHP!. Please don’t make me deal with this weird hierarchy of arrays, it’s just a bad reimplementation of a subset of HTML.

I expect that there actually is some way to get what I want, even in Drupal 5, as I’m doing some templating for comments and pages and whatnot. But that should be the default IMHO. The weird combining of code and markup into this hydra-headed data structure (and don’t even get me started on the need for the #weight key to get things where I want them) is just so unnecessary.

In short, if it ain’t broke, don’t reinvent it!

</rant>

Looking for the comments? Try the old layout.

Embed HTML on Your Site

If you’re a regular visitor to my blog (and who could blame you?), you likely have noticed a few changes recently. In addition to adding the sociable links a couple days ago, I’ve also been adding bits of embedded JavaScript in the right column displaying my three most recent Tweets and my three most recent Delicious bookmarks. These work reasonably well: I just embed <script> tags with the appropriate stuff, then style the HTML that they deliver.

Tonight I was talking to Skud about embedding like this. It turns out that some folks were getting a big blank area when they viewed a blog entry on her site in RSS readers and the like, they sometimes just saw a big blank area where there was supposed to be a list of books. She was looking for examples of sites that provided HTML snippets that people could cut-n-paste into their blog entries, so that they can avoid this problem, or use it in places that disallow JavaScript embedding, such as LiveJournal. I had no examples for her, but it suddenly occurred to me: Why not embed a link to an HTML URL that serves a snippet of HTML, rather than a bit of JavaScript that uses the document object to write HTML?

A quick Googling and I found a page a great article about the <object> element. It was intended as a general replacement for the <img> and <applet> elements, although tht really hasn’t happened. But what you can do is embed HTML with it. Here’s a quick example:

If you can see this, then the <object> tag doesn't work in your browser. :-(

Hopefully you can see the embedded HTML above. I’ve styled it with a light blue background and dark blue dotted border, so it stands out. That styling is in the <object> tag, BTW, not in the HTML loaded from the snippet. I’m sure I could figure out how to add <param> tags that would tell it to include various styles, too, since it appears that CSS I have in this page has no effect on the content of the object (I have some CSS to make the <code> tag have a green background, but for me at least, it has no effect.

So why isn’t this more common? It seems to work well in a lot of browsers. Would you use it? What are the downsides?

Looking for the comments? Try the old layout.

SVN::Notify 2.41 Adds Plain Text Issue Tracking Links

I expect that this will be my last release of SVN::Notify for a while. I’ve already spent more time on it than I had anticipated. But anyway, this is a pretty solid release. It doesn’t change the API or anything, but I feel that the jump from 2.30 to 2.40 is justified because of the sheer number of changes. From now on, I expect that it will mostly be maintenance, like 2.41, which fixes a minor formatting bug. Grab it now from CPAN.

First, I’ve added a new, complex example of the SVN::Notify::HTML::ColorDiff output that I will keep up-to-date with all future changes. This will allow people to get a better idea of what it’s capable of than my previous contrived examples allowed.

The biggest change is that I’ve moved the Request Tracker, Bugzilla, and JIRA support from SVN::Notify::HTML to SVN::Notify. I realized, after the release of 2.30, that it might be cool to add links to the text-only email message generated by SVN::Notify, too. So I’ve done that, including for ViewCVS links. Unlike in SVN::Notify::HTML, the links won’t be inline in the message (that doesn’t work too well in plain text, IMO), but will come in their own sections after the message. So you’ll get something like this (extreme example):

Log Message:
-----------
Let's try a few links to other applications. First, we have
A Bugzilla Bug # 709. Then we have a JIRA key, TST-1608. And
finally, we have an RT link to Ticket # 4321.

Hey, we could add one to ViewCVS for a Subversion Revision
#606, too!

ViewCVS Links:
-------------
    http://viewsvn.bricolage.cc/?rev=606&view=rev

Bugzilla Links:
--------------
    http://bugzilla.mozilla.org/show_bug.cgi?id=709

RT Links:
--------
    http://rt.cpan.org/NoAuth/Bugs.html?id=4321

JIRA Links:
----------
    http://jira.atlassian.com/secure/ViewIssue.jspa?key=TST-1608

The nice thing is that, for many mail clients, these will be turned into clickable links. You’ll also notice that the text that creates the ViewCVS link is split over two lines. This is new in this release, and works for SVN::Notify::HTML, too. I made a few other tweaks to the regular expressions, as well. Here’s a complete list of changes:

  • Fixed accessor generation so that accessors created for the attributes passed to register_attributes() but a subclass are created in the subclass’ package instead of in SVN::Notify.
  • Changed parsing for JIRA keys to use any set of capital letters followed by a dash and then a number, rather than the literal string “JIRA-” followed by a number. Reported by Garrett Rooney.
  • Modified the regular expression patterns for the RT, Bugzilla, RT, and ViewCVS links to properly match on word boundaries, so that strings like “humbug 12” don’t match.
  • Modified the ViewCVS link regular expression pattern so that it matches strings like “rev 12” as well as “revision 12”.
  • Modified the RT link regular expression pattern so that it matches strings like “RT-Ticket: 23” as well as “Ticket 1234”. Suggested by Jesse Vincent.
  • Added complicated example to try to show off all of the major features. I will keep this up-to-date going forward in order to post sample output on the Web.
  • Fixed the parsing of log messages so that empty lines are no longer eliminated.
  • HTML::ColorDiff now properly handles the listing of binary files in the diff, marking them with a new class, “binary”, and using the same CSS as is used for the “propset” class.
  • In HTML::ColorDiff, Fixed CSS for the “delfile” class to properly wrap it in a border like the other files in the diff.
  • Added labels to the HTML::ColorDiff diff file sections to indicate the type of change (“Modified”, “Added”, “Deleted”, or “Property changes”).
  • Moved the rt_url, bugzilla_url, and jira_url parameters from SVN::Notify::HTML to SVN::Notify, where they are used to add URLs to the text version of log messages.

Enjoy!

Looking for the comments? Try the old layout.

SVN::Notify 2.20 Adds Colorized Diffs

After getting prodded by Erik Hatcher, I went ahead and added another subclass to SVN::Notify. This one adds a pretty colorized diff to the message, instead of just the plain text one. See an example here. I’ve also added links from the lists of affected files into the diffs in the HTML and new HTML::ColorDiff layouts.

Enjoy!

Update: And now I’ve released SVN::Notify 2.21 with a few minor fixes, including XHTML 1.1 compliance.

Looking for the comments? Try the old layout.