Lessons Learned with Perl and UTF-8

I learned quite a lot last week as I was making Bricolage much more Unicode-aware. Bricolage has always managed Unicode content and stored it in a PostgreSQL Unicode-encoded database. And by Unicode I of course mean UTF-8. By far the biggest nightmare was figuring out the bug with Apache::Util::escape_html(), but ultimately it came down to an interesting lesson.

Why was I making Bricolage Unicode-aware? Well, it all started with a bug report from Kang-min Liu (a.k.a. Gugod). I had naïvely thought that if strings were Unicode that Perl would know it and do the right thing. It turns out I was wrong. Perl assumes that everything is binary unless you tell it otherwise. This means that Perl operators such as length and substr will count bytes instead of characters. And in the case of Unicode, where characters can be multiple bytes, this can cause serious problems. Not only were strings improperly concatenated mid-character for Gugod, but PostgreSQL could refuse to accept such strings, since a chopped-up multibyte character isn't valid Unicode!

So I had to make some decisions: Either stop using Perl operators that count bytes, or let Perl know that all the strings that Bricolage deals with are Unicode strings. The former wasn't really an option, of course, since users can specify that certain content fields be a certain length of characters. So with a lot of testing help from Gugod and his Bricolage install full of multibyte characters, I set about doing so. The result is in the recently released Bricolage 1.8.2 and I'm blogging what I learned for both your reference and mine.

Perl considers its internal representation of strings to be UTF-8 strings, and it knows what variables contain valid UTF-8 strings because they have a special flag set on them, called, strangely enough, utf8. This flag isn't set by default, but can be set in a number of ways. The ways I've found so far are:

  • Using Encode::decode() to decode a string from binary to Perl's internal representation. The use of the word decode here had confused me for a while, because I thought it was a special encoding. But the truth is that it's not. Strings can have any number of encodings, such as ISO-8859-1, GB3212, EUC-KR, UTF-8, and the like. But when you decode a string, you're telling Perl that it's not any of those encodings, but Perl's own representation. I was confused because Perl's internal representation is UTF-8, which is an encoding. But really it's not UTF-8, It's utf8, which isn't an encoding, but Perl's own thing.

  • Cheat: Use Encode::_set_utf8_on(). This private function is nevertheless documented by the Encode module, and therefore usable. What it does is simply turn on the utf8 flag on a variable. You need be confident that the variable contains only valid UTF-8 characters, but if it does, then you should be pretty safe.

  • Using the three-argument version of open, such as

    open my $fh, "<utf8", "/foo/bar"
      or die "Cannot open file: $!\n"

    Now when you read lines from this file, they will automatically be decoded to utf8.

  • Using binmode to set the mode on a file handle:

    binmode $fh, ":utf8";

    As with the three-argument version of open this forces Perl to decode the strings read from the file handle.

  • use utf8;. This Perl pragma indicates that everything within its scope is UTF-8, and therefore should be decoded to utf8.

So I started applying these approaches in various places. The first thing I did was to set the utf8 flag on data coming from the browser with Encode::_set_utf8_on(). Shitty browsers can of course send shitty data, but I'm deciding, for the moment at least, to trust browser to send only UTF-8 when I tell them that's what I want. This solved Gugod's immediate problem, and I happily closed the bug. But then he started to run into places where strings appeared properly in some places but not in others. We spent an entire day (night for Gugod--I really appreciated the help!) tracking down the problem, and there turned out to be two of them. One was the the bug with Apache::Util::escape_html() that I've described elsewhere, but the other proved more interesting.

It seems that if you concatenate a UTF-8 string with the utf8 flagged turned on with a UTF-8 string without utf8 turned on, the text in the unflagged variable turns to crap! I have no idea why this is, but Gugod noticed that strings pulled into the UI from the Bricolage zh_tw localization library simply didn't display properly. I had him add use utf8; to the zh_tw module, and the problem went away!

So the lesson learned here is: If you're going to make Perl strings Unicode-aware, then all of your Perl strings need to be Unicode-aware. It's an all or nothing kind of thing.

So while setting the utf8 flag on browser submits and adding use utf8; to the localization modules got us part of the way toward a solution, it turned out to be trickier than I expected to get the utf8 flag set on everything. The places I needed to get it working were in the UI Mason components, in templates, and in strings pulled from the database.

It took a bit of research, but I think I successfully figured out how to make the UI Mason components UTF-8 aware. I just added preamble => "use utf8\n;" to the creation of the Mason interpretor. This gets passed on to is compiler, and now that string is added to the beginning of every template. This made things behave better in the UI. I applied the same approach to the interpetor created for Mason templates with equal success.

I'm less confident that I pulled it off for the HTML::Template and Template Toolkit templating architectures. In a discussion on the templates mailing list, Andy Wardley suggested that it wasn't currently possible. But I wasn't so sure. It seemed to me that, since Bricolage reads in the templates and asks TT to execute them within a certain scope, that I could just set the mode to utf8 on the file handle and then execute the template within the scope of a use utf8; statement. So that's what I did. Feedback on whether it works or not would be warmly welcomed.

I tried a similar approach with the HTML::Template burner. Again, the burner reads the templates from files and passes them to HTML::Template for execution (as near as I could tell, anyway; I'm not an HTML::Template template user). Hopefully it'll just work.

So that just left the database. Since the database is Unicode-only, all I needed to do was to turn on the utf8 flag for all content pulled from the database. Amazingly, this hasn't come up as an issue for people very much, because DBI doesn't do anything about Unicode. I picked up an older discussion started by Matt Sergeant on the dbi-dev mail list, but it looks like it might be a while before DBI has fast, integrated support for turning utf8 on and off for various database handles and columns. I look forward to it, though, because it's likely to be very efficient. I greatly look forward to seeing the results of Tim's work in the next release of DBI. I opened another bug report to remind myself to take advantage of the new feature when it's ready.

So in the meantime, I needed to find another solution. Fortunately, my fellow PostgreSQL users had run into it before, and added what I needed to DBD::Pg back in version 1.22. The pg_enable_utf8 database handle parameter forces the utf8 flag to be turned on for all string data returned from the database. I added this parameter to Bricolage, and now all data pulled from the database is utf8. And so are the UI components, templates, localization libraries, and data submitted from browsers. I think that nailed everything, but I know that Unicode issues are a slippery slope. I can't wait until I have to deal with them again!

Not.

Backtalk

Mark Fowler wrote:

Okay.

First up, I'm really really nervous about setting the utf8 flag on anything that comes from the browser. If you accidentally mark non-utf8 data (i.e. you don't get utf8 from a badly behaved browser) then Perl won't treat that data at all right - you can actually core dump Perl if you're not careful. Maybe you should actually decode things properly with the decode routine. It's a lot safer.

Secondly if you want me to take a look at the TT stuff if you blast me a url to what's changed and I'll have a look and see what if you're doing The Right Thing, though from what I've read it sounds like it. TT does the right thing with templates it reads from disk (if it's got a BOM it treats it as utf8/utf16, it sticks "use utf8;" at the top of the intermediate perl code it creates from the templates if there's utf8 values in the templates) now those patches I wrote have gone in. But strings you pass from your code it expects you to flag correctly for everything to work.

Theory wrote:

So far it hasn't been a problem to just set the flag on data sent from the browser. Modern browsers are pretty good about doing the right thing. But I have comments in the code for how to change it, if there are bug reports because of stupid browsers. But Bricolage is in daily production at RFA now, publishing in 10 different language with a variety of browsers, and so far there have been no complaints.

My original change is here, and an additional change to take advantage of TT 2.14 is here. You can see the full source code here. The BOM solution wasn't really an option, since it would require Bricoalge templators to always put it in their templates.

Eric Mowrer wrote:

Another trick that will get you around some of the perl 5.6 Unicode bugs: my $fixed_utf8_string = pack('U*', unpack('C*', $broken_utf8_string)); I have had to result to this on a few occasions when 'use utf8' didn't fix the problem.

Justin Mason wrote:

http://taint.org/

Hey -- I wound up returning to this page today while working on a UTF-8 bug in spamassassin. lots of good advice in general, and thanks.

However, I'm a bit worried by the danger of causing the interpreter to dump core based on input from remote HTTP clients. That's a very big deal -- (a) there's no guarantee they really are browsers and not some l33t_bricolage_exploit.pl script hitting port 80 directly; and (b) core dumps in the interpreter, if reliably reproducable, can mean exploits -- even a single byte overflow can be exploited. So I'm with Mark Fowler on that point ;)