Lessons Learned with Perl and UTF-8
I learned quite a lot last week as I was making Bricolage much more
Unicode-aware. Bricolage has always managed Unicode content and stored it in a
PostgreSQL Unicode-encoded database. And by Unicode
I of course
mean UTF-8
. By far the biggest nightmare was figuring out
the bug
with Apache::Util::escape_html(), but ultimately it came down
to an interesting lesson.
Why was I making Bricolage Unicode-aware? Well, it all started with a
bug report from
Kang-min Liu (a.k.a. Gugod
). I had naïvely thought that if strings
were Unicode that Perl would know it and do the right thing. It turns out I
was wrong. Perl assumes that everything is binary unless you tell it
otherwise. This means that Perl operators such as length
and substr will count bytes instead of characters. And in the
case of Unicode, where characters can be multiple bytes, this can cause
serious problems. Not only were strings improperly concatenated mid-character
for Gugod, but PostgreSQL could refuse to accept such strings, since a chopped-up multibyte
character isn't valid Unicode!
So I had to make some decisions: Either stop using Perl operators that count bytes, or let Perl know that all the strings that Bricolage deals with are Unicode strings. The former wasn't really an option, of course, since users can specify that certain content fields be a certain length of characters. So with a lot of testing help from Gugod and his Bricolage install full of multibyte characters, I set about doing so. The result is in the recently released Bricolage 1.8.2 and I'm blogging what I learned for both your reference and mine.
Perl considers its internal representation of strings to be UTF-8 strings,
and it knows what variables contain valid UTF-8 strings because they have a
special flag set on them, called, strangely enough, utf8. This
flag isn't set by default, but can be set in a number of ways. The ways I've
found so far are:
Using
Encode::decode()to decode a string from binary to Perl's internal representation. The use of the worddecode
here had confused me for a while, because I thought it was a special encoding. But the truth is that it's not. Strings can have any number of encodings, such asISO-8859-1
,GB3212
,EUC-KR
,UTF-8
, and the like. But when youdecode
a string, you're telling Perl that it's not any of those encodings, but Perl's own representation. I was confused because Perl's internal representation is UTF-8, which is an encoding. But really it's not UTF-8, It'sutf8
, which isn't an encoding, but Perl's own thing.Cheat: Use
Encode::_set_utf8_on(). This private function is nevertheless documented by the Encode module, and therefore usable. What it does is simply turn on theutf8flag on a variable. You need be confident that the variable contains only valid UTF-8 characters, but if it does, then you should be pretty safe.Using the three-argument version of
open, such asopen my $fh, "<utf8", "/foo/bar" or die "Cannot open file: $!\n"
Now when you read lines from this file, they will automatically be decoded to
utf8.Using
binmodeto set the mode on a file handle:binmode $fh, ":utf8";
As with the three-argument version of
openthis forces Perl to decode the strings read from the file handle.use utf8;. This Perl pragma indicates that everything within its scope is UTF-8, and therefore should be decoded toutf8.
So I started applying these approaches in various places. The first thing I
did was to set the utf8 flag on data coming from the browser with
Encode::_set_utf8_on(). Shitty browsers can of course send shitty
data, but I'm deciding, for the moment at least, to trust browser to send only
UTF-8 when I tell them that's what I want. This solved Gugod's immediate
problem, and I happily closed the bug. But then he started to run into places
where strings appeared properly in some places but not in others. We spent an
entire day (night for Gugod--I really appreciated the help!) tracking down the
problem, and there turned out to be two of them. One was the the bug
with Apache::Util::escape_html() that I've described
elsewhere, but the other proved more interesting.
It seems that if you concatenate a UTF-8 string with the utf8
flagged turned on with a UTF-8 string without utf8 turned on, the
text in the unflagged variable turns to crap! I have no idea why this is, but
Gugod noticed that strings pulled into the UI from the Bricolage zh_tw
localization library simply didn't display properly. I had him add use
utf8; to the zh_tw module, and the problem went away!
So the lesson learned here is: If you're going to make Perl strings Unicode-aware, then all of your Perl strings need to be Unicode-aware. It's an all or nothing kind of thing.
So while setting the utf8 flag on browser submits and
adding use utf8; to the localization modules got us part of the
way toward a solution, it turned out to be trickier than I expected to get
the utf8 flag set on everything. The places I needed to get it
working were in the UI Mason components, in templates, and in strings pulled
from the database.
It took a bit of research, but I think I successfully figured out how to
make the UI Mason components UTF-8 aware. I just added preamble =>
"use utf8\n;" to the creation of the Mason interpretor. This
gets passed on to is compiler, and now that string is added to the beginning
of every template. This made things behave better in the UI. I applied the
same approach to the interpetor created for Mason templates with equal
success.
I'm less confident that I pulled it off for the HTML::Template and Template
Toolkit templating architectures. In a discussion on the templates mailing
list, Andy Wardley suggested that it wasn't currently possible.
But I wasn't so sure. It seemed to me that, since Bricolage reads in the
templates and asks TT to execute them within a certain scope, that I could
just set the mode to utf8 on the file handle and then execute the
template within the scope of a use utf8; statement. So that's
what I did. Feedback on whether it works or not would be warmly welcomed.
I tried a similar approach with the HTML::Template burner. Again, the burner reads the templates from files and passes them to HTML::Template for execution (as near as I could tell, anyway; I'm not an HTML::Template template user). Hopefully it'll just work.
So that just left the database. Since the database is Unicode-only, all I
needed to do was to turn on the utf8 flag for all content pulled
from the database. Amazingly, this hasn't come up as an issue for people very
much, because DBI doesn't do anything about Unicode. I picked up an older discussion started by Matt Sergeant on
the dbi-dev mail list, but it looks like it might be a while before DBI has
fast, integrated support for turning utf8 on and off for various
database handles and columns. I look forward to it, though, because it's
likely to be very efficient. I greatly look forward to seeing the results
of Tim's work in the next release of DBI. I opened
another bug report to remind myself to take
advantage of the new feature when it's ready.
So in the meantime, I needed to find another solution. Fortunately, my
fellow PostgreSQL users had run into it before, and added what I needed to DBD::Pg
back in version 1.22. The pg_enable_utf8 database handle
parameter forces the utf8 flag to be turned on for all string
data returned from the database. I added this parameter to Bricolage, and now
all data pulled from the database is utf8. And so are the UI
components, templates, localization libraries, and data submitted from
browsers. I think that nailed everything, but I know that Unicode issues
are a slippery slope. I can't wait until I have to deal with them again!
Not.
Backtalk
Mark Fowler wrote:
Theory wrote:
Eric Mowrer wrote:
Justin Mason wrote: