Unicode Normalization in SQL

I've been peripherally aware of the need for unicode normalization in my code for a while, but only got around to looking into it today. Although I use Encode to convert text inputs into Perl's internal form and UTF-8 or an appropriate encoding in all my outputs, it does nothing about normalization.

What's normalization you ask?

Well, UTF-8 allows some characters to be encoded in different ways. For example, é can be written as either “é”, which is a “precomposed character,” or as “é”, which is a combination of “e” and “́”. This is all well and good, but the trouble comes when you want to compare values. Observe this Perl one-liner:

% perl -le 'print "\x{00e9}" eq "\x{0065}\x{0301}" ? "t" : "f"'
f

The same issue exists in your database. Here's an example from PostgreSQL:

try=# select U&'\00E9' = U&'\0065\0301';
 ?column? 
----------
 f
(1 row)

If you have a user enter data in your Web app using combining characters, and then another does a search with canonical characters, the search will fail. This won't do at all.

The solution is to normalize your Unicode data. In Perl, you can use Unicode::Normalize, a C/XS module that uses Perl's internal unicode tables to convert things as appropriate. For general use the NFC normalization is probably best:

use Unicode::Normalize;
$string = NFC $string;

PostgreSQL offers no normalization routines. However, the SQL standard mandates one (as of SQL 2008, at least). It looks like this:

<normalize function> ::= NORMALIZE <left paren> <character value expression> [ <comma> <normal form> [ <comma> <normalize function result length> ] ] <right paren>
<normal form> ::= NFC | NFD | NFKC | NFKD

The second argument defaults to NFC and the third, which specifies a maximum length of the return value, is optional. The fact that it looks like a function means that we can use PL/PerlU to emulate it in PostgreSQL until a proper implementation makes it into core. Here's how:

CREATE OR REPLACE FUNCTION NORMALIZE(
    string TEXT,
    form   TEXT,
    maxlen INT
) RETURNS TEXT LANGUAGE plperlu AS $$
    use Unicode::Normalize 'normalize';
    my ($string, $form, $maxlen) = @_;
    my $ret = normalize($form, $string);
    elog(ERROR, 'Normalized value is too long') if length $ret > $maxlen;
    return $ret;
$$;

CREATE OR REPLACE FUNCTION NORMALIZE(
    string TEXT,
    form   TEXT
) RETURNS TEXT LANGUAGE plperlu AS $$
    use Unicode::Normalize 'normalize';
    return normalize($_[1], $_[0]);
$$;

CREATE OR REPLACE FUNCTION NORMALIZE(
    string TEXT
) RETURNS TEXT LANGUAGE plperlu AS $$
    use Unicode::Normalize 'normalize';
    return normalize('NFC', shift);
$$;

I wrote a few tests to make sure it was sane:

SELECT U&'\0065\0301' as combined,
       char_length(U&'\0065\0301'),
       NORMALIZE(U&'\0065\0301') as normalized,
       char_length(NORMALIZE(U&'\0065\0301'));

SELECT NORMALIZE(U&'\0065\0301', 'NFC')  AS NFC,
       NORMALIZE(U&'\0065\0301', 'NFD')  AS NFD,
       NORMALIZE(U&'\0065\0301', 'NFKC') AS NFKC,
       NORMALIZE(U&'\0065\0301', 'NFKD') AS NFKD
;

SELECT NORMALIZE(U&'\0065\0301', 'NFC', 1)  AS NFC,
       NORMALIZE(U&'\0065\0301', 'NFD', 2)  AS NFD,
       NORMALIZE(U&'\0065\0301', 'NFKC', 1) AS NFKC,
       NORMALIZE(U&'\0065\0301', 'NFKD', 2) AS NFKD;

SELECT NORMALIZE(U&'\0065\0301', 'NFD', 1);

And the output

 combined | char_length | normalized | char_length 
----------+-------------+------------+-------------
 é        |           2 | é          |           1
(1 row)

 nfc | nfd | nfkc | nfkd 
-----+-----+------+------
 é   | é   | é    | é
(1 row)

 nfc | nfd | nfkc | nfkd 
-----+-----+------+------
 é   | é   | é    | é
(1 row)

psql:try.sql:45: ERROR:  error from Perl function "normalize": Normalized value is too long at line 5.

Cool! So that's fairly close to the standard. The main difference is that the form argument must be a string instead of a constant literal. But PostgreSQL would likely support both. The length argument is also a literal, and can be 10 charcters or 64 bytes, but for our purposes, this is fine. The only downside to it is that it's slow: PostgreSQL must convert its text value to a Perl string to pass to the function, and then Unicode::Normalize turns it into a C string again to do the conversion, then back to a Perl string which, in turn, is returned to PostgreSQL and converted back into the text form. Not the quickest process, but may prove useful anyway.

Update: 1 Hour Later

Note that this issue applies when using full text search, too. Alas, it does not normalize unicode charaters for you:

try=# select to_tsvector(U&'\00E9clair') @@ to_tsquery(U&'\0065\0301clair');
 ?column? 
----------
 f
(1 row)

But normalizing with the functions I introduced does work:

try=# select to_tsvector(U&'\00E9clair') @@ to_tsquery(normalize(U&'\0065\0301clair'));
 ?column? 
----------
 t
(1 row)

So yes, this really can be an issue in your applications.

Backtalk

Peter Karman wrote:

Thanks for pointing me at Unicode::Normalize. And to Aristotle for that good humour.

Darren Duncan wrote:

NFC vs NFD

You say that in general NFC normalization is probably best. While it may make for more compact strings, I would argue that for the purposes of identity or comparison it is actually NFD that is better. One main reason that I cite is that NFC doesn't provide a single codepoint for many instances of graphemes using multiple combining characters with a base character, but just provides support each of the combining characters with the base separately. So to get such a grapheme with base 'B' and 2 combining characters '1','2', you can choose either the 2 codepoints 'B1' plus '2' or 'B2' + 1 but you can't have 'B12'. So comparing for that grapheme with one input using B1+2 and the other B2+1 will fail. On the other hand, using NFD you would always have B+1+2 so no problem here.

Theory wrote:

Re: NFC vs NFD

Darren,

“Die, heretic scum!”

—Theory