Just a Theory

By David E. Wheeler

Posts about UTF-8

Encoding is a Headache

I have to spend way too much of my programming time worrying about character encodings. Take my latest module, Text::Markup for example. The purpose of the module is very simple: give in the name of a file, and it will figure out the markup it uses (HTML, Markdown, Textile, whatever) and return a string containing the HTML generated from the file. Simple, right?

But, hang on. Should the HTML it returns be decoded to Perl’s internal form? I’m thinking not, because the HTML itself might declare the encoding, either in a XML declaration or via something like

<meta http-equiv="Content-type" content="text/html;charset=Big5" />

And as you can see, it’s not UTF-8. So decoded it would be lying. So it should be encoded, right? Parsers like XML::LibXML::Parser are smart enough to see such declarations and decode as appropriate.

But wait a minute! Some markup languages, like Markdown, don’t have XML declarations or headers. They’re HTML fragments. So there’s no wait to tell the encoding of the resulting HTML unless it’s decoded. So maybe it should be decoded. Or perhaps it should be decoded, and then given an XML declaration that declares the encoding as UTF-8 and encoded it as UTF-8 before returning it.

But, hold the phone! When reading in a markup file, should it be decoded before it’s passed to the parser? Does Text::Markdown know or care about encodings? And if it should be decoded, what encoding should one assume the source file uses? Unless it uses a BOM, how do you know what its encoding is?

Text::Markup is a dead simple idea, but virtually all of my time is going into thinking about this stuff. It drives me nuts. When will the world cease to be this way?

Oh, and you have answers to any of these questions, please do feel free to leave a comment. I hate having to spend so much time on this, but I’d much rather do so and get things right (or close to right) than wrong.

Looking for the comments? Try the old layout.

More about…

CITEXT Patch Submitted to PostgreSQL Contrib

On Friday, I submitted a patch to add a locale-aware case-insensitive text type as a PostgreSQL contrib module. This has been among my top requests as a feature for PostgreSQL ever since I started using it. And as I started work on yet another application recently, I decided to look into what it would take to just make it happen myself. I’m hopeful that everyone will be able to benefit from this bit of yak shaving.

I started out by trying to use the citext project on pgFoundry, but immediately identified two issues with it:

  1. It does not install properly on PostgreSQL 8.3 (it uses a lot of casts that were removed in 8.3); and
  2. It only case-insensitively compared ASCII characters. So accented multibyte characters work just as they do in the text type.

So I set about trying to create my own, new type, originally called “lctext”, since what it does is not true case-insensitive comparisons, but lowercases text and then compares, just as millions of us developers already do by using LOWER() on both sides of a query:

SELECT *
  FROM tab
 WHERE lower(col) = LOWER(?);

I just finally got fed up with this. The last straw for me was wanting to create a primary key that would be stored case-insensitively, which would have required that I create two indexes for it: One created for the primary key by default, the other a functional UNIQUE INDEX on LOWER(col), which would just be stupid.

So this patch is the culmination of my work to make a locale-aware case-insensitive text type. It’s locale-aware in that it uses the same locale-aware string comparison code as that used for the text type, and it uses the same C function as LOWER() uses. The nice thing is that it works just as if you had used LOWER() in all your SQL, but now you don’t have to.

So while this is not a true case-insensitive text type, in the sense that it doesn’t do a case-insensitive comparison, but changes the cases and then compares, it is likely more efficient than the LOWER() workaround that we’ve all been using for years, and it neater, too. Using this type, it will now be much easier to create, e.g, an email domain, like so:

CREATE OR REPLACE FUNCTION is_email(text)
RETURNS BOOLEAN
AS $$
    use Email::Valid;
    return TRUE if Email::Valid->address( $_[0] );
    return FALSE;
$$ LANGUAGE 'plperlu' STRICT IMMUTABLE;

CREATE DOMAIN email AS CITEXT CHECK ( is_email( value ) );

No more nasty workarounds to account for the lack of case-insensitive comparisons for text types. It works great for time zones and other data types that are defined to compare case-insensitively:

CREATE OR REPLACE FUNCTION is_timezone( tz TEXT ) RETURNS BOOLEAN as $$
BEGIN
  PERFORM now() AT TIME ZONE tz;
  RETURN TRUE;
EXCEPTION WHEN invalid_parameter_value THEN
  RETURN FALSE;
END;
$$ language plpgsql STABLE;

CREATE DOMAIN timezone AS CITEXT
CHECK ( is_timezone( value ) );

And that should just work!

I’m hoping that this is accepted during the July CommitFest. Of course I will welcome suggestions for how to improve it. Since I sent the patch, for example, I’ve been thinking that I should suggest in the documentation that it is best used for short text entries (say, up to 256 characters), rather than longer entries (like email bodies or journal articles), and that for longer entries, one should really make use of tsearch2, instead. There are other notes and caveats in the patch submission. Please do let me know what you think.

Looking for the comments? Try the old layout.

Intelligent MySQL Configuration

James Duncan Davidson’s Configuring MySQL on MacOS X post earlier today reminded me that I wanted to blog about the configuration I came up with while installing MySQL 5 on my box. Nothing has irritated me more than when MySQL’s syntax has violated the ANSI SQL standards in the most blatant ways, or when transactions have appeared to work, but mysteriously not worked. Yes, I use Duncan’s settings to make sure that the MySQL box on my PowerBook only listens on local sockets, but I additionally add this configuration to /etc/my.cnf:

[mysqld]
sql-mode=ansi,strict_trans_tables,no_auto_value_on_zero,no_zero_date,no_zero_in_date,only_full_group_by
character-set-server=utf8
default-storage-engine=InnoDB
default-time-zone=utc

That last configuration can actually only be added after running this command:

/usr/local/mysql/bin/mysql_tzinfo_to_sql /usr/share/zoneinfo | \
mysql -u root mysql

But then the upshot is that I have everything configured to be as compliant as possible (although the time zone stuff is just my personal preference):

mysql> SELECT @@global.sql_mode;
mysql> SELECT @@global.sql_mode;
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| @@global.sql_mode                                                                                                                                     |
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| REAL_AS_FLOAT,PIPES_AS_CONCAT,ANSI_QUOTES,IGNORE_SPACE,ONLY_FULL_GROUP_BY,ANSI,NO_AUTO_VALUE_ON_ZERO,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE |
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> show variables like '%character_set\_%';
+--------------------------+--------+
| Variable_name            | Value  |
+--------------------------+--------+
| character_set_client     | utf8   |
| character_set_connection | utf8   |
| character_set_database   | utf8   |
| character_set_filesystem | binary |
| character_set_results    | utf8   |
| character_set_server     | utf8   |
| character_set_system     | utf8   |
+--------------------------+--------+
7 rows in set (0.01 sec)

mysql> show variables like '%table_ty%';
+---------------+--------+
| Variable_name | Value  |
+---------------+--------+
| table_type    | InnoDB |
+---------------+--------+
1 row in set (0.00 sec)

mysql> show variables like 'time_zone%';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| time_zone     | utc   |
+---------------+-------+
1 row in set (0.00 sec)

Now that’s the way things should be! Or at least as close as I’m going to get to it in MySQL 5.

Update 2006-11-04: Ask Bjørn Hansen turned me on to the “strict_trans_tables” mode, which prevents MySQL from trying to guess what you mean when you leave out a value for a required column. So I’ve now updated my configuration with sql-mode=ansi,strict_trans_tables.

Update 2009-11-05: I found myself configuring MySQL again today, and there were some other settings I found it useful to add:

  • no_auto_value_on_zero forces AUTO_INCREMENT columns to increment only when inserting a NULL, rather than when inserting a NULL or a zero(!).
  • no_zero_date and no_zero_in_date disallow dates where the the year or month are set to 0.
  • only_full_group_by requires that non-aggregated columns in a select list be included in a GROUP BY clause, as is mandated by the SQL standard. This only applies if an aggregate function is used in a query

I’ve added all of these to the example above.

Looking for the comments? Try the old layout.

No UTF-8 Support on Windows?

This just blows. It will be a while before Bricolage runs on Windows, then. The PostgreSQL team is understandably reluctant to simply include the whole ICU library in PostgreSQL. Maybe it could be compiled into the binaries, though?

Looking for the comments? Try the old layout.

More about…

Lessons Learned with Perl and UTF-8

I learned quite a lot last week as I was making Bricolage much more Unicode-aware. Bricolage has always managed Unicode content and stored it in a PostgreSQL Unicode-encoded database. And by “Unicode” I of course mean “UTF-8”. By far the biggest nightmare was figuring out the bug with Apache::Util::escape_html(), but ultimately it came down to an interesting lesson.

Why was I making Bricolage Unicode-aware? Well, it all started with a bug report from Kang-min Liu (a.k.a. “Gugod”). I had naïvely thought that if strings were Unicode that Perl would know it and do the right thing. It turns out I was wrong. Perl assumes that everything is binary unless you tell it otherwise. This means that Perl operators such as length and substr will count bytes instead of characters. And in the case of Unicode, where characters can be multiple bytes, this can cause serious problems. Not only were strings improperly concatenated mid-character for Gugod, but PostgreSQL could refuse to accept such strings, since a chopped-up multibyte character isn’t valid Unicode!

So I had to make some decisions: Either stop using Perl operators that count bytes, or let Perl know that all the strings that Bricolage deals with are Unicode strings. The former wasn’t really an option, of course, since users can specify that certain content fields be a certain length of characters. So with a lot of testing help from Gugod and his Bricolage install full of multibyte characters, I set about doing so. The result is in the recently released Bricolage 1.8.2 and I’m blogging what I learned for both your reference and mine.

Perl considers its internal representation of strings to be UTF-8 strings, and it knows what variables contain valid UTF-8 strings because they have a special flag set on them, called, strangely enough, utf8. This flag isn’t set by default, but can be set in a number of ways. The ways I’ve found so far are:

  • Using Encode::decode() to decode a string from binary to Perl’s internal representation. The use of the word “decode” here had confused me for a while, because I thought it was a special encoding. But the truth is that it’s not. Strings can have any number of encodings, such as “ISO-8859-1”, “GB3212”, “EUC-KR”, “UTF-8”, and the like. But when you “decode” a string, you’re telling Perl that it’s not any of those encodings, but Perl’s own representation. I was confused because Perl’s internal representation is UTF-8, which is an encoding. But really it’s not UTF-8, It’s “utf8”, which isn’t an encoding, but Perl’s own thing.

  • Cheat: Use Encode::_set_utf8_on(). This private function is nevertheless documented by the Encode module, and therefore usable. What it does is simply turn on the utf8 flag on a variable. You need be confident that the variable contains only valid UTF-8 characters, but if it does, then you should be pretty safe.

  • Using the three-argument version of open, such as

    open my $fh, "<utf8", "/foo/bar" or die "Cannot open file: $!\n"

    Now when you read lines from this file, they will automatically be decoded to utf8.

  • Using binmode to set the mode on a file handle:

    binmode $fh, ":utf8";

    As with the three-argument version of open this forces Perl to decode the strings read from the file handle.

  • use utf8;. This Perl pragma indicates that everything within its scope is UTF-8, and therefore should be decoded to utf8.

So I started applying these approaches in various places. The first thing I did was to set the utf8 flag on data coming from the browser with Encode::_set_utf8_on(). Shitty browsers can of course send shitty data, but I’m deciding, for the moment at least, to trust browser to send only UTF-8 when I tell them that’s what I want. This solved Gugod’s immediate problem, and I happily closed the bug. But then he started to run into places where strings appeared properly in some places but not in others. We spent an entire day (night for Gugod–I really appreciated the help!) tracking down the problem, and there turned out to be two of them. One was the the bug with Apache::Util::escape_html() that I’ve [described elsewhere]the bug with Apache::Util::escape_html(), but the other proved more interesting.

It seems that if you concatenate a UTF-8 string with the utf8 flagged turned on with a UTF-8 string without utf8 turned on, the text in the unflagged variable turns to crap! I have no idea why this is, but Gugod noticed that strings pulled into the UI from the Bricolage zh_tw localization library simply didn’t display properly. I had him add use utf8; to the zh_tw module, and the problem went away!

So the lesson learned here is: If you’re going to make Perl strings Unicode-aware, then all of your Perl strings need to be Unicode-aware. It’s an all or nothing kind of thing.

So while setting the utf8 flag on browser submits and adding use utf8; to the localization modules got us part of the way toward a solution, it turned out to be trickier than I expected to get the utf8 flag set on everything. The places I needed to get it working were in the UI Mason components, in templates, and in strings pulled from the database.

It took a bit of research, but I think I successfully figured out how to make the UI Mason components UTF-8 aware. I just added preamble => "use utf8\n;" to the creation of the Mason interpretor. This gets passed on to is compiler, and now that string is added to the beginning of every template. This made things behave better in the UI. I applied the same approach to the interpreter created for Mason templates with equal success.

I’m less confident that I pulled it off for the HTML::Template and Template Toolkit templating architectures. In a discussion on the templates mailing list, Andy Wardley suggested that it wasn’t currently possible. But I wasn’t so sure. It seemed to me that, since Bricolage reads in the templates and asks TT to execute them within a certain scope, that I could just set the mode to utf8 on the file handle and then execute the template within the scope of a use utf8; statement. So that’s what I did. Feedback on whether it works or not would be warmly welcomed.

I tried a similar approach with the HTML::Template burner. Again, the burner reads the templates from files and passes them to HTML::Template for execution (as near as I could tell, anyway; I’m not an HTML::Template template user). Hopefully it’ll just work.

So that just left the database. Since the database is Unicode-only, all I needed to do was to turn on the utf8 flag for all content pulled from the database. Amazingly, this hasn’t come up as an issue for people very much, because DBI doesn’t do anything about Unicode. I picked up an older discussion started by Matt Sergeant on the dbi-dev mail list, but it looks like it might be a while before DBI has fast, integrated support for turning utf8 on and off for various database handles and columns. I look forward to it, though, because it’s likely to be very efficient. I greatly look forward to seeing the results of Tim’s work in the next release of DBI. I opened another bug report to remind myself to take advantage of the new feature when it’s ready.

So in the meantime, I needed to find another solution. Fortunately, my fellow PostgreSQL users had run into it before, and added what I needed to DBD::Pg back in version 1.22. The pg_enable_utf8 database handle parameter forces the utf8 flag to be turned on for all string data returned from the database. I added this parameter to Bricolage, and now all data pulled from the database is utf8. And so are the UI components, templates, localization libraries, and data submitted from browsers. I think that nailed everything, but I know that Unicode issues are a slippery slope. I can’t wait until I have to deal with them again!

Not.

Looking for the comments? Try the old layout.

More about…

Apache::Util::escape_html() Doesn't Like Perl UTF-8 Strings

I got bit by a bug with Apache::Util’s escape_html() function in mod_perl 1. It seems that it doesn’t like Perl’s Unicode encoded strings! This patch demonstrates the issue (be sure that your editor understands utf-8):

--- modperl/t/net/perl/util.pl.~1.18.~  Sun May 25 03:54:08 2003+++ modperl/t/net/perl/util.pl  Thu Sep  9 19:38:40 2004@@ -74,6 +74,25 @@  #print $esc_2; test ++$i, $esc eq $esc_2;++# Make sure that escape_html() understands multibyte characters.+my $utf8 = '<專輯>';+my $esc_utf8 = '<專輯>';+my $test_esc_utf8 = Apache::Util::escape_html($utf8);+test ++$i, $test_esc_utf8 eq $esc_utf8;+#print STDERR "Compare '$test_esc_utf8'\n     to '$esc_utf8'\n";++eval { require Encode };+unless ($@) {+    # Make sure escape_html() properly handles strings with Perl's+    # Unicode encoding.+    $utf8 = Encode::decode_utf8($utf8);+    $esc_utf8 = Encode::decode_utf8($esc_utf8);+    $test_esc_utf8 = Apache::Util::escape_html($utf8);+    test ++$i, $test_esc_utf8 eq $esc_utf8;+    #print STDERR "Compare '$test_esc_utf8'\n     to '$esc_utf8'\n";+}+ use Benchmark;  =pod

If I enable the print statements and look at the log, I see this:

Compare '<專輯>'
     to '<專輯>'
Compare '<å°è¼¯>'
     to '<專輯>'

The first escape appears to work correctly, but when I decode the string to Perl’s Unicode representation, you can see how badly escape_html() munges the text!

Curiously, both tests fail, although the first conversion appears to be correct. This could be due to the behavior of eq, though I’m not sure why. But it’s the second test that’s the more interesting, since it really screws things up.

Looking for the comments? Try the old layout.

More about…

Always use the C Locale with PostgreSQL

I ran into the weirdest bug with Bricolage today. We use the LIKE operator to do string comparisons throughout Bricolage. In one usage, the code checks to see if there’s a record in the “keyword” table before creating it. This is because keyword names are unique. So it looks for a keyword record like this:

SELECT name, screen_name, sort_name, active
  FROM   keyword
 WHERE  LOWER(name) LIKE ?

If it finds a keyword, it creates a relationship between it and a story document. If it doesn’t find it, it creates a new keyword record and then associates the new keyword with a story document.

However, one of our customers was getting SQL errors when attempting to add keywords to a story, and it took me a while to figure out what the problem was. This is because I couldn’t replicate the problem until I started trying to create multibyte keywords. Now, Bricolage uses a UTF-8 PostgreSQL database, but something very odd was going on. When I attempted to add the keyword “북한의”, it didn’t find an existing keyword, but then threw an error when the unique index thought it existed already! Running tests in psql, I found that = would find the existing record, but LIKE wouldn’t!

Once I posted a query on the pgsql-general list, someone noticed that the record returned when using = actually had a different value than was actually queried for. I had searched for “북한의”, but the database found “국방비”. It seems that = compares bytes, while LIKE compares characters. The error I was getting meant that the unique index was also using bytes. And because of the locale used when initdb was run, PostgreSQL thought that they actually were the same!

The solution to this problem, it turns out, was to dump the database, shut down PostgreSQL, move the old data directory, and create a new one with initdb -locale=C. I then restored the database, and suddenly = and LIKE (and the unique index) were doing the same thing. Hallelujah!

Naturally, I’m not the first to notice this issue. It’s particularly an issue with RedHat Linux installations, since RedHat has lately decided to set a system-wide locale. In my case, it was “en_US.UTF-8.” This apparently can break collations in other languages, and this affects indices, of course. So I was led to wonder if initdb shouldn’t default to a locale of C instead of the system default. What do you think?

You can read the whole thread here.

Looking for the comments? Try the old layout.