Just a Theory

By David E. Wheeler

Posts about Postgres

PGXN Development Project

I’m pleased to announce the launch of the PGXN development project. I’ve written a detailed specification and pushed it through general approval on pgsql-hackers. I’ve written up a detailed project plan and estimated things at a highly reduced PostgreSQL Experts rate to come up with a fundraising goal: $25,000. And now, thanks to founding contributions from myYearbook.com, and PostgreSQL Experts, we have started the fundraising phase of the project.

So what’s this all about? PGXN, the PostgreSQL Extension Network, is modeled on CPAN, the Perl community’s archive of “all things Perl.” PGXN will provide four major pieces of infrastructure to the PostgreSQL community:

I’ve been wanting to start this project for a long time, but given my need to pay the bills, it didn’t seem like I’d ever be able to find the time for it. Then Josh Berkus suggested that we try to get community interest and raise money for me to have the time to work on it. So I jumped on that, putting in the hours needed to get general approval from the core PostgreSQL developers and to create a reasonable project plan and web site. And thanks to MyYearbook’s and PGX’s backing, I’m really excited about it. I hope to start on it in August.

If you’d like to contribute, first: Thank You!. The PGXN site has a Google Checkout widget that makes it easy to make a donation. If you’d rather pay by some other means (checks are great for us!), drop me a line and we’ll work something out. We have a few levels of contribution as well, including permanent linkage on the PGXN site for your organization, as well as the usual t-shirts launch party invitations.

Looking for the comments? Try the old layout.

More about…

PGAN Bikeshedding

I’ve put together a description of PGAN, the PostgreSQL extension distribution system I plan to develop later this year based on the Comprehensive Archive Perl Network or CPAN. Its primary features will be:

  • Extension distribution
  • Search site with extension documentation
  • Client for downloading, building, testing, and installing extensions.

I’ve never been thrilled with the name, though, so I’m asking for suggestions for a better one. I’ve used the term “extension” here because it seems to be the term that the PostgreSQL community has settled on, but other terms might work, since things other than extensions might be distributed.

What I’ve come up with so far is:

Name Long Name Pronunciation Advantages Disadvantages
PGAN PostgreSQL Add-on Network pee-gan Short, similar to CPAN Ugly
PGEX PostgreSQL Extensions pee-gee-ex or pee-gex Short, easier to pronounce Too similar to PGX)
PGCAN PostgreSQL Comprehensive Archive Network pee-gee-can Similar to CPAN Similar to CPAN
PGDAN PostgreSQL Distribution Archive Network pee-gee-dan Short, easy to pronounce Who’s “Dan”? Doesn’t distribute PostgreSQL itself.
PGEDAN PostgreSQL Extension Distribution Archive Network pee-gee-ee-dan References extensions Long, sounds stupid

Of these, I think I like “PGEX” best, but none are really great. So I’m opening up the bike shed to all. What’s a better name? Or if you can’t think of one, which of the above do you like best? Just leave a comment on this post. The only requirements for suggestions are that a .org domain be available and that it suck less than the alternatives.

Comments close in 2 weeks. Thanks!

Looking for the comments? Try the old layout.

More about…

Execute SQL Code on Connect

I’ve been writing a fair bit of PL/Perl for a client, and one of the things I’ve been doing is eliminating a ton of duplicate code by creating utility functions in the %_SHARED hash. This is great, as long as the code that creates those functions gets executed at the beginning of every database connection. So I put the utility generation code into a single function, called prepare_perl_utils(). It looks something like this:

CREATE OR REPLACE FUNCTION prepare_perl_utils(
) RETURNS bool LANGUAGE plperl IMMUTABLE AS $$
    # Don't bother if we've already loaded.
    return 1 if $_SHARED{escape_literal};

    $_SHARED{escape_literal} = sub {
        $_[0] =~ s/'/''/g; $_[0] =~ s/\\/\\\\/g; $_[0];
    };

    # Create other code refs in %_SHARED…
$$;

So now all I have to do is make sure that all the client’s apps execute this function as soon as they connect, so that the utilities will all be loaded up and ready to go. Here’s how I did it.

First, for the Perl app, I just took advantage of the DBI’s callbacks to execute the SQL I need when the DBI connects to the database. That link might not work just yet, as the DBI’s callbacks have only just been documented and that documentation appears only in dev releases so far. Once 1.611 drops, the link should work. At any rate, the use of callbacks I’m exploiting here has been in the DBI since 1.49, which was released in November 2005.

The approach is the same as I’ve described before: Just specify the Callbacks parameter to DBI->connect, like so:

my $dbh = DBI->connect_cached($dsn, $user, $pass, {
    PrintError     => 0,
    RaiseError     => 1,
    AutoCommit     => 1,
    Callbacks      => {
        connected => sub { shift->do('SELECT prepare_perl_utils()' },
    },
});

That’s it. The connected method is a no-op in the DBI that gets called to alert subclasses that they can do any post-connection initialization. Even without a subclass, we can take advantage of it to do our own initialization.

It was a bit trickier to make the same thing happen for the client’s Rails app. Rails, alas, provides no on-connection callbacks. So we instead have to monkey-patch Rails to do what we want. With some help from “dfr|mac” on #rubyonrails (I haven’t touched Rails in 3 years!), I got it worked down to this:

class ActiveRecord::ConnectionAdapters::PostgreSQLAdapter
    def initialize_with_perl_utils(*args)
    returning(initialize_without_perl_utils(*args)) do
        execute('SELECT prepare_perl_utils()')
    end
    end
    alias_method_chain :initialize, :perl_utils
end

Basically, we overpower the PostgreSQL adapter’s initialize method and have it call initialize_with_perl_utils before it returns. It’s a neat trick; if you’re going to practice fuck typing, alias_method_chain makes it about as clean as can be, albeit a little too magical for my tastes.

Anyway, recorded here for posterity (my blog is my other brain!).

Looking for the comments? Try the old layout.

No more USE_PGXS=1?

I’ve become very tired of having to set USE_PGXS=1 every time I build pgTAP outside the contrib directory of a PostgreSQL distribution:

make USE_PGXS=1
make USE_PGXS=1 install
make USE_PGXS=1 installcheck

I am forever forgetting to set it, and it’s just not how one normally expects a build incantation to work. It was required because that’s how the core contrib extensions work: They all have this code in their Makefiles, which those of us who develop third-party modules have borrowed:

ifdef USE_PGXS
PG_CONFIG = pg_config
PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)
else
subdir = contrib/citext
top_builddir = ../..
include $(top_builddir)/src/Makefile.global
include $(top_srcdir)/contrib/contrib-global.mk
endif

They generally expect ../../src/Makefile.global to exist, and if it doesn’t, you have to tell it so. I find this annoying, because third-party extensions are almost never built from the contrib directory, so one must always remember to specify USE_PGXS=1.

I’d like to propose, instead, that those of us who maintain third-party extensions like pgTAP, PL/Parrot, and Temporal PostgreSQL not force our users to have to remember this special variable by instead checking to see if it’s needed ourselves. As such, I’ve just added this code to pgTAP’s Makefile:

ifdef USE_PGXS
PG_CONFIG = pg_config
PGXS := $(shell $(PG_CONFIG) --pgxs)
else
ifeq (exists, $(shell [ -e ../../src/bin/pg_config/pg_config ] && echo exists) ) 
top_builddir = ../..
PG_CONFIG := $(top_builddir)/src/bin/pg_config/pg_config
else
PG_CONFIG = pg_config
PGXS := $(shell $(PG_CONFIG) --pgxs)
endif
endif

So it still respects USE_PGXS=1, but if it’s not set, it looks to see if it can find pg_config where it would expect it to be if built from the contrib directory. If it’s not there, it simply uses pg_config as if USE_PGXS=1 was set. This makes building from the contrib directory or from anywhere else the same process:

make
make install
make installcheck

Much better, much easier to remember.

Is there any reason why third-party PostgreSQL extensions should not adopt this pattern? I don’t think it makes sense for contrib extensions in core to do it, but for those that will never be in core, I think it makes a lot of sense.

Comments?

Looking for the comments? Try the old layout.

More about…

SQL Hack: The Something-est From Each Entity

This is a pattern that I have dealt with many times, but never figured out how to adequately handle. Say that you have imported a mailbox into your database, and you want a list of the latest messages between each pair of recipients (sender and receiver — I’m ignoring multiple receivers for the moment). The data might look like this:

BEGIN;

CREATE TABLE messages (
    sender   TEXT        NOT NULL,
    receiver TEXT        NOT NULL,
    sent_at  TIMESTAMPTZ NOT NULL DEFAULT clock_timestamp(),
    body     TEXT        NOT NULL DEFAULT ''
);

INSERT INTO messages ( sender, receiver, body )
VALUES ('Theory', 'Strongrrl', 'Hi There.' );

INSERT INTO messages ( sender, receiver, body )
VALUES ('Strongrrl', 'Theory', 'Hi yourself.' );

INSERT INTO messages ( sender, receiver, body )
VALUES ('Anna', 'Theory', 'What''s for dinner?' );

INSERT INTO messages ( sender, receiver, body )
VALUES ('Theory', 'Anna', 'Brussels Sprouts.' );

INSERT INTO messages ( sender, receiver, body )
VALUES ('Anna', 'Theory', 'Oh man!' );

COMMIT;

So the goal is to show the most recent message between Theory and Strongrrl and the most recent message between Theory and Anna, without regard to who is the sender and who is the receiver. After running into this many times, today I consulted my colleagues, showing them this dead simple (and wrong!) query to demonstrate what I wanted:

SELECT sender, recipient, sent_at, body
  FROM messages
 GROUP BY sender, recipient
HAVING sent_at = max(sent_at);

That’s wrong because one can’t have columns in the SELECT expression that are not either aggregate expressions or included in theGROUP BY expression. It’s a violation of the standard (and prone to errors, I suspect). Andrew immediately said, “Classic case for DISTINCT ON”. This lovely little expression is a PostgreSQL extension not included in the SQL standard. It’s implementation looks like this:

SELECT DISTINCT ON (
          CASE WHEN receiver > sender
               THEN receiver || sender
               ELSE sender   || receiver
          END
       ) sender, receiver, sent_at, body
  FROM messages
 ORDER BY CASE WHEN receiver > sender
               THEN receiver || sender
               ELSE sender   || receiver
          END, sent_at DESC;

This query is saying, “fetch the rows where the sender and the receiver are distinct, and order by sent_at DESC. THE CASE statement to get a uniform value for the combination of sender and receiver is a bit unfortunate, but it does the trick:

  sender   | receiver |            sent_at            |     body     
-----------+----------+-------------------------------+--------------
 Anna      | Theory   | 2010-01-12 05:00:07.026711+00 | Oh man!
 Strongrrl | Theory   | 2010-01-12 05:00:07.02589+00  | Hi yourself.

Great, exactly the data I wanted. And the CASE statement can actually be indexed to speed up filtering. But I wondered if it would be possible to get the same results without the DISTINCT ON. In other words, can this be done with standard SQL? If you’re using PostgreSQL 8.4, the answer is “yes.” All you have to do is exploit window functions and a subquery. It looks like this:

SELECT sender, receiver, sent_at, body
  FROM (
    SELECT sender, receiver, sent_at, body,
           row_number() OVER ( PARTITION BY 
               CASE WHEN receiver > sender
                    THEN receiver || sender
                    ELSE sender   || receiver
               END
               ORDER BY sent_at DESC
           ) AS rnum
      FROM messages
  ) AS t
 WHERE rnum = 1;

Same nasty CASE statement as before (no way around it with this database design, alas), but this is fully conforming SQL. It’s also the first time I’ve ever used window functions. If you just focus on the row_number() OVER () expression, it’s simply partitioning the table according to the same value as in the DISTINCT ON value, but it’s ordering it by sent_at directly. The result is a row number, where the first is 1 for the most recent message for each combination of recipients. Then we just filter for that in the WHERE clause.

Not exactly intuitive (I’m really only understanding it now as I explain write it out), but quite straight-forward once you accept the expressivity in this particular OVER expression. It might be easier to understand if we remove some of the cruft. If instead we wanted the most recent message from each sender (regardless of the recipient), we’d write:

SELECT sender, receiver, sent_at, body
  FROM (
    SELECT sender, receiver, sent_at, body,
           row_number() OVER (
               PARTITION BY sender ORDER BY sent_at DESC
           ) AS rnum
      FROM messages
  ) AS t
 WHERE rnum = 1;

And that yields:

  sender   | receiver |            sent_at            |     body     
-----------+----------+-------------------------------+--------------
 Anna      | Theory   | 2010-01-12 05:00:07.026711+00 | Oh man!
 Strongrrl | Theory   | 2010-01-12 05:00:07.02589+00  | Hi yourself.
 Theory    | Anna     | 2010-01-12 05:00:07.24982+00  | Brussels Sprouts.

Furthermore, we can use a common table expression to eliminate the subquery. This query is functionally identical to the subquery example (returning to uniqueness for sender and receiver), just with the WITH clause coming before the SELECT clause, setting things up for it:

WITH t AS (
    SELECT sender, receiver, sent_at, body,
           row_number() OVER (PARTITION BY CASE
               WHEN receiver > sender
                   THEN receiver || sender
                   ELSE sender   || receiver
                   END
               ORDER BY sent_at DESC
           ) AS rnum
      FROM messages
) SELECT sender, receiver, sent_at, body
    FROM t
   WHERE rnum = 1;

So it’s kind of like putting the subquery first, only it’s not a subquery, it’s more like a temporary view. Nice, eh? Either way, the results are the same as before:

  sender   | receiver |            sent_at            |     body     
-----------+----------+-------------------------------+--------------
 Anna      | Theory   | 2010-01-12 05:00:07.026711+00 | Oh man!
 Strongrrl | Theory   | 2010-01-12 05:00:07.02589+00  | Hi yourself.

I hereby dub this “The Entity’s Something-est” pattern (I’m certain someone else has already come up with a good name for it, but this will do). I can see it working any place requiring the highest, lowest, latest, earliest, or something else-est item from each of a list of entities. Perhaps the latest headline from every news source:

WITH t AS (
    SELECT source, headline, dateline, row_number() OVER (
               PARTITION BY source ORDER BY dateline DESC
           ) AS rnum
      FROM news
) SELECT source, headline, dateline
    FROM t
   WHERE rnum = 1;

Or perhaps the lowest score for for each basketball team over the course of a season:

WITH t AS (
    SELECT team, date, score, row_number() OVER (
               PARTITION BY team ORDER BY score
           ) AS rnum
      FROM games
) SELECT team, date, score
    FROM t
   WHERE rnum = 1;

Easy! How have you handled a situation like this in your database hacking?

Looking for the comments? Try the old layout.

RFC: PostgreSQL Add-on Network

I’ve posted a plan to implement PGAN, a CPAN for PostgreSQL extensions. I’ve tried to closely follow the CPAN philosophy to come up with a plan that requires a minimum-work implementation that builds on the existing PostgreSQL tools and the examples of the CPAN and JSAN. My hope is that it’s full of JFDI! I would be very grateful for feedback and suggestions.

Looking for the comments? Try the old layout.

Enforcing a Set of Values

Enumerate Me

I love enums. They’re a terrific way to quickly create self-documenting data types that represent a set of values, and the nice thing is that the underlying values are stored as integers, making them very space- and performance-efficient. A typical example might be a workflow approval process for publishing magazine articles. You create it like so:

CREATE TYPE article_states AS ENUM (
    'draft', 'copy', 'approved', 'published'
);

Nice: we now have a simple data type that’s self-documenting. An an important feature of enums is that the ordering of values is the same as the declared labels. For a workflow such as this, it makes a lot of sense, because the workflow states are inherently ordered: “draft” comes before “copy” and so on.

Unfortunately, enums aren’t a panacea. I would use them all over the place if I could, but, alas, the value-set data types I tend to need tend not to have inherently ordered values other than the collation order of the text. For example, say that we need a table describing people’s faces. Using an enum to manage eye colors might look something like this:

CREATE TYPE eye_color AS ENUM ( 'blue', 'green', 'brown' );

CREATE TABLE faces (
    face_id SERIAL PRIMARY KEY,
    name TEXT NOT NULL DEFAULT '',
    eye_color eye_color NOT NULL
);

Nice, huh? So let’s insert a few values and see what it looks like:

INSERT INTO faces (name, eye_color)
VALUES ('David', 'blue' ),
       ('Julie', 'green' ),
       ('Anna', 'blue' ),
       ('Noriko', 'brown' )
;

So let’s look at the data ordered by the enum:

% SELECT name, eye_color FROM faces ORDER BY eye_color;
  name  | eye_color 
--------+-----------
 David  | blue
 Anna   | blue
 Julie  | green
 Noriko | brown

Hrm. That’s not good. I forgot to put “green” after “brown” when I created the enum. Oh, and I forgot the color “hazel”:

% INSERT INTO faces (name, eye_color) VALUES ('Kat', 'hazel' );
ERROR:  invalid input value for enum eye_color: "hazel"

Well, nice to know that it’s enforced, and that message is really helpful. But the real problem is that we run into the inherent ordering of enum labels, and now we need to adjust the enum to meet our needs. Here’s how to do it:

ALTER TABLE faces RENAME eye_color TO eye_color_tmp;
ALTER TABLE faces ALTER eye_color_tmp TYPE TEXT;
DROP TYPE eye_color;
CREATE TYPE eye_color AS ENUM ( 'blue', 'brown', 'green', 'hazel' );
ALTER TABLE faces ADD eye_color eye_color;
UPDATE faces SET eye_color = eye_color_tmp::eye_color;
ALTER TABLE faces ALTER eye_color SET NOT NULL;
ALTER TABLE faces DROP column eye_color_tmp;

Yikes! I have to rename the column, change its type to TEXT, drop the enum, create a new enum, and then copy all of the data into the new column before finally dropping the old column. If I have a lot of data, this will not be very efficient, requiring that every single row be rewritten. Still, it does work:

% INSERT INTO faces (name, eye_color) VALUES ('Kat', 'hazel' );
% SELECT name, eye_color FROM faces ORDER BY eye_color;
  name  | eye_color 
--------+-----------
 David  | blue
 Anna   | blue
 Noriko | brown
 Julie  | green
 Kat    | hazel

The upshot is that enums are terrific if you have a very well-defined set of values that are inherently ordered (or where order is not important) and that are extremely unlikely to change. Perhaps someday PostgreSQL will have a more robust ALTER TYPE that allows enums to be more efficiently reorganized, but even then it seems likely that re-ordering values will require a table rewrite.

Lookup to Me

Another approach to handling a type as a set of values is to take advantage of the relational model and create store the values in a table. Going with the faces example, it looks like this:

CREATE TABLE eye_colors (
    eye_color TEXT PRIMARY KEY
);

INSERT INTO  eye_colors VALUES( 'blue' ), ('green'), ('brown' );

CREATE TABLE faces (
    face_id SERIAL PRIMARY KEY,
    name TEXT NOT NULL DEFAULT '',
    eye_color TEXT REFERENCES eye_colors(eye_color)
);

We can use this table much as we did before:

INSERT INTO faces (name, eye_color)
VALUES ('David', 'blue' ),
       ('Julie', 'green' ),
       ('Anna', 'blue' ),
       ('Noriko', 'brown' )
;

And of course we can get the rows back properly ordered by eye_color, unlike the original enum example:

% SELECT name, eye_color FROM faces ORDER BY eye_color;
  name  | eye_color 
----+------
 David  | blue
 Anna   | blue
 Noriko | brown
 Julie  | green

Cool! But there are a couple of downsides. One is that you’re adding a bit of I/O overhead to every update. Most likely you won’t have very many values in the eye_colors table, so given PostgreSQL’s caching, this isn’t a big deal. A bigger deal is error handling:

INSERT INTO eye_colors VALUES ('hazel');
ERROR:  insert or update on table "faces" violates foreign key constraint "faces_eye_color_fkey"

That’s not an incredibly useful error message. One might ask, without knowing the schema, what has an eye color has to do with a foreign key constraint? At least looking at the tables can tell you a bit more:

% \dt
          List of relations
 Schema |    Name    | Type  | Owner 
----+------+----+----
 public | eye_colors | table | david
 public | faces      | table | david

A quick look at the eye_colors table will tell you what’s going on, and you can figure out that you just need to add a new row:

INSERT INTO eye_colors VALUES ('hazel');
INSERT INTO faces (name, eye_color) VALUES ('Kat', 'hazel' );

So it is self-documenting, but unlike enums it doesn’t do a great job of it. Plus if you have a bunch of set-constrained value types, you can end up with a whole slew of lookup tables. This can make it harder to sort the important tables that contain actual business data from those that are just lookup tables, because there is nothing inherent in them to tell the difference. You could put them into a separate schema, of course, but still, it’s not exactly intuitive.

Given these downsides, I’m not a big fan of using lookup tables for managing what is in fact a simple list of allowed values for a particular column unless those values change frequently. So what else can we do?

Constrain Me

A third approach is to use a table constraint, like so:

CREATE TABLE faces (
    face_id SERIAL PRIMARY KEY,
    name TEXT NOT NULL DEFAULT '',
    eye_color TEXT NOT NULL,
    CONSTRAINT valid_eye_colors CHECK (
        eye_color IN ( 'blue', 'green', 'brown' )
    )
);

No lookup table, no inherent ENUM ordering. And in regular usage it works just like the lookup table example. The usual INSERT and SELECT once again yields:

% SELECT name, eye_color FROM faces ORDER BY eye_color;
  name  | eye_color 
----+------
 David  | blue
 Anna   | blue
 Noriko | brown
 Julie  | green

The error message, however, is a bit more helpful:

% SELECT name, eye_color FROM faces ORDER BY eye_color;
ERROR:  new row for relation "faces" violates check constraint "valid_eye_colors"

A check constraint violation on eye_color is much more informative than a foreign key constraint violation. The downside to a check constraint, however, is that it’s not as self-documenting. You have to look at the entire table in order to find the constraint:

% \d faces
                             Table "public.faces"
  Column   |  Type   |                        Modifiers                        
------+-----+-----------------------------
 face_id   | integer | not null default nextval('faces_face_id_seq'::regclass)
 name      | text    | not null default ''::text
 eye_color | text    | not null
Indexes:
    "faces_pkey" PRIMARY KEY, btree (face_id)
Check constraints:
    "valid_eye_colors" CHECK (eye_color = ANY (ARRAY['blue', 'green', 'brown']))

There it is at the bottom. Kind of tucked away there, eh? At least now we can change it. Here’s how:

ALTER TABLE faces DROP CONSTRAINT valid_eye_colors;
ALTER TABLE faces ADD CONSTRAINT valid_eye_colors CHECK (
    eye_color IN ( 'blue', 'green', 'brown', 'hazel' )
);

Not as straight-forward as updating the lookup table, and much less efficient (because PostgreSQL must validate that existing rows don’t violate the constraint before committing the constraint). But it’s pretty simple and at least doesn’t require the entire table be UPDATEd as with enums. For occasional changes to the value list, a table scan is not a bad tradeoff. And of course, once that’s done, it just works:

INSERT INTO eye_colors VALUES ('hazel');
INSERT INTO faces (name, eye_color) VALUES ('Kat', 'hazel' );

So this is almost perfect for our needs. Only poor documentation persists as an issue.

This is My Domain

To solve that problem, switch to domains. A domain is simply a custom data type that inherits behavior from another data type and to which one or more constraints can be added. It’s pretty simple to switch from the table constraint to a domain:

CREATE DOMAIN eye_color AS TEXT
CONSTRAINT valid_eye_colors CHECK (
    VALUE IN ( 'blue', 'green', 'brown' )
);

CREATE TABLE faces (
    face_id SERIAL PRIMARY KEY,
    name TEXT NOT NULL DEFAULT '',
    eye_color eye_color NOT NULL
);

Nice table declaration, eh? Very clean. Looks exactly like the enum example, in fact. And it works as well as the table constraint:

% SELECT name, eye_color FROM faces ORDER BY eye_color;
  name  | eye_color 
----+------
 David  | blue
 Anna   | blue
 Noriko | brown
 Julie  | green

A constraint violation is a bit more useful than with the table constraint:

% SELECT name, eye_color FROM faces ORDER BY eye_color;
ERROR:  value for domain eye_color violates check constraint "valid_eye_colors"

This points directly to the domain. It’d be nice if it mentioned the violating value the way the enum error did, but at least we can look at the domain out like so:

\dD eye_color
                                                        List of domains
 Schema |   Name    | Type | Modifier |                                         Check                                          
----+------+---+-----+--------------------------------------------
 public | eye_color | text |          | CHECK (VALUE = ANY (ARRAY['blue', 'green', 'brown', 'hazel']))

None of the superfluous stuff about the entire table to deal with, just the constraint, thank you very much. Changing it is just as easy as changing the table constraint:

ALTER DOMAIN eye_color DROP CONSTRAINT valid_eye_colors;
ALTER DOMAIN eye_color ADD CONSTRAINT valid_eye_colors CHECK (
    VALUE IN ( 'blue', 'green', 'brown', 'hazel' )
);

Yep, you can alter domains just as you can alter tables. And of course now it will work:

INSERT INTO eye_colors VALUES ('hazel');
INSERT INTO faces (name, eye_color) VALUES ('Kat', 'hazel' );

And as usual the data is well-ordered when we need it to be:

% SELECT name, eye_color FROM faces ORDER BY eye_color;
  name  | eye_color 
----+------
 David  | blue
 Anna   | blue
 Noriko | brown
 Julie  | green
 Kat    | hazel

And as an added bonus, if you happened to need an eye color in another table, you can just use the same domain and get all the proper semantics. Sweet!

Color Me Happy

Someday I’d love to see support for a PostgreSQL feature like enums, but allowing an arbitrary list of strings that are ordered by the contents of the text rather than the order in which they were declared, and that’s efficient to update. Maybe it could use integers for the underlying storage, too, and allow values to be modified without a table rewrite. Such would be the ideal for this use case. Hell, I’d find it much more useful than enums.

But domains get us pretty close to that without too much effort, so maybe it’s not that important. I’ve tried all of the above approaches and discussed it quite a lot with my colleagues before settling on domains, and I’m quite pleased with it. The only caveat I’d have is that it’s not to be used lightly. If the value set is likely to change fairly often (at least once a week, say), then you’d be better off with the lookup table.

In short, I recommend:

  • For an inherently ordered set of values that’s extremely unlikely to ever change, use an enum.
  • For a set of values that won’t often change and has no inherent ordering, use a domain.
  • For a set of values that changes often, use a lookup table.

What do you use to constrain a column to a defined set of unordered values?

Looking for the comments? Try the old layout.

More about…

Quest for PostgreSQL Project Hosting

The pgTAP project is currently hosted by pgFoundry. This is an old version of GForge, and from what I understand, highly modified for the PostgreSQL project. That’s fine, except that it apparently makes it impossible for anyone to find the tuits to upgrade it to newer versions.

And it needs upgrading. One annoying thing I noticed is that the URLs for release files include an integer in them. For example, the URL to download pgTAP 0.23 is http://pgfoundry.org/frs/download.php/2511/pgtap-0.23.tar.bz2. See the “25111” there? It appears to be a primary key value or something, but is completely irrelevant for a release URL. I would much prefer that the URL be something like http://pgfoundry.org/frs/download.php/pgtap-0.23.tar.bz2 or, even better, http://pgfoundry.org/projects/pgtap/frs/pgtap-0.23.tar.bz2. But such is not the case now.

Another issue is hosting. I’ve registered pgtap.org to use for hosting the pgTAP Web site, but there is no support for pointing a hostname at a pgFoundry/GForge site.

These issues could of course be worked out if someone had the tuits to take them on, but apparently there is no one. So I’m looking to move.

The question is, where to? I could get a paid GitHub account (the pgTAP source is already on GitHub) and be able to have a pgTAP site on pgtap.org from there, so that’s a plus. And I can do file releases, too, in which case the URL format would be something like http://cloud.github.com/downloads/theory/pgtap/pgtap-0.23.tar.bz2, which isn’t ideal, but is a hell of a lot better than a URL with a sequence number in it. I could put them on the hosted site, too, in which case they’d have whatever URL I wanted them to have.

There are only two downsides I can think of to moving to GitHub:

  1. No mail list support. The pgTAP mail list has next to no traffic so far, so I’m not sure this is a big deal. I could also set up a list elsewhere, like Librelist, if I really needed one. I’d prefer to have @pgtap.org mail lists, but it’s not a big deal.

  2. I would lose whatever community presence I gain from hosting on pgFoundry. I know that when I release a Perl module to CPAN that it will be visible to lots of people in the Perl community, and automatically searchable via search.cpan.org and other tools. A CPAN release is a release to the Perl community.

    There is nothing like this for PostgreSQL. pgFoundry is the closest thing, and, frankly, nowhere near as good (pgFoundry’s search rankings have always stunk). So if I were to remove my projects from pgFoundry, how could I make them visible to the community? Is there any other central repository of or searchable list of third-party PostgreSQL offerings?

So I’m looking for advice. Does having an email list matter? If I can get pgTAP announcements included in the PostgreSQL Weekly News, is that enough community visibility? Do you know of a nice project hosting site that offers hosting, mail lists, download mirroring and custom domain handling?

I’ll follow up with a summary of what I’ve found in a later post.

Looking for the comments? Try the old layout.

JPUG Talk Posted

No Perl content today, I’m afraid. I’m just back from my trip to Japan and wanted to post this very nice video of my talk [Update 2018: Sadly gone now]. Unlike the versions from other conferences, this one focuses primarily on the slides, with me appearing in audio only. This makes it really easy to follow. Enjoy.

Looking for the comments? Try the old layout.

More about…

My Catalyst Tutorial: Add Authors to the View

Another post in my ongoing series of posts on using Catalyst with Template::Declare and DBIx::Connector. This will be the last post covering material from chapter 3, I promise. This is a fun one, though, because we continue to use this really nice DSL called “SQL,” which I think is more expressive than an ORM would be.

To whit, the next task is to add the missing list of authors to the book list. The thing is, the more I work with databases, the more I’m inclined to think about them not only as the “M” in “MVC”, but also the “V”. I’ll show you what I mean.

A Quick Fix

But first, a diversion. In the second post in this series, I created an SQL statement to insert book authors, but I made a mistake: the values for surnames and given names were reversed. Oops. Furthermore, I included explicit author IDs, even though the id column uses a sequence for it’s default value. So first we need to fix these issues. Change the INSERT INTO authors statement in sql/001-books.sql to:

INSERT INTO authors (surname, given_name)
VALUES ('Bastien',      'Greg'),
       ('Nasseh',       'Sara'),
       ('Degu',         'Christian'),
       ('Stevens',      'Richard'),
       ('Comer',        'Douglas'),
       ('Christiansen', 'Tom'),
       ('Torkington',   'Nathan'),
       ('Zeldman',      'Jeffrey')
;

This time, we’re letting the sequence populate the id column. Fortunately, it starts from 1 just like we did, so we don’t need to update the values in the INSERT INTO book_author statement. Now let’s fix the database:

DELETE FROM book_author;
DELETE FROM authors;

Then run the above SQL query to restore the authors with their proper names, and then run the INSERT INTO book_author statement. That will get us back in business.

Constructing our Query

Now it’s time for the fun. The original SQL query we wrote to get the list of books was:

SELECT isbn, title, rating FROM books;

Nothing unusual there. But to get at the authors, we need to join to book_author and from there to authors. Our first cut looks like this:

SELECT b.isbn, b.title, b.rating, a.surname
  FROM books       b
  JOIN book_author ba ON b.isbn       = ba.isbn
  JOIN authors     a  ON ba.author_id = a.id;

Which yields this data:

       isbn        |               title                | rating |   surname    
-------------------+------------------------------------+--------+--------------
 978-1-58720-153-0 | CCSP SNRS Exam Certification Guide |      5 | Bastien
 978-1-58720-153-0 | CCSP SNRS Exam Certification Guide |      5 | Nasseh
 978-1-58720-153-0 | CCSP SNRS Exam Certification Guide |      5 | Degu
 978-0-201-63346-7 | TCP/IP Illustrated, Volume 1       |      5 | Stevens
 978-0-13-018380-4 | Internetworking with TCP/IP Vol.1  |      4 | Comer
 978-1-56592-243-3 | Perl Cookbook                      |      5 | Christiansen
 978-1-56592-243-3 | Perl Cookbook                      |      5 | Torkington
 978-0-7357-1201-0 | Designing with Web Standards       |      5 | Zeldman

Good start, but note how we now have three rows for “CCSP SNRS Exam Certification Guideâ€? and two for “Perl Cookbookâ€?. We could of course modify our Perl code to look at the ISBN in each row and combine as appropriate, but it’s better to get the database to do that work, since it’s designed for that sort of thing. So let’s use an aggregate function to combine the values over multiple rows into a single row. All we have to do is use the column that changes (surname) in an aggregate function and tell PostgreSQL to use the other columns to group rows into one. PostgreSQL 8.4 introduces a really nice aggregate function, array_agg(), for pulling a series of strings together into an array. Let’s put it to use:

SELECT b.isbn, b.title, b.rating, array_agg(a.surname) as authors
  FROM books       b
  JOIN book_author ba ON b.isbn     = ba.isbn
  JOIN authors     a  ON ba.author_id = a.id
 GROUP BY b.isbn, b.title, b.rating;

Now the output is:

       isbn        |               title                | rating |          authors         
-------------------+------------------------------------+--------+--------------------------
 978-0-201-63346-7 | TCP/IP Illustrated, Volume 1       |      5 | {Stevens}
 978-0-13-018380-4 | Internetworking with TCP/IP Vol.1  |      4 | {Comer}
 978-1-56592-243-3 | Perl Cookbook                      |      5 | {Christiansen,Torkington}
 978-1-58720-153-0 | CCSP SNRS Exam Certification Guide |      5 | {Bastien,Nasseh,Degu}
 978-0-7357-1201-0 | Designing with Web Standards       |      5 | {Zeldman}

Much better. We now have a single row for each book, and the authors are all grouped into a single column. Cool. But we can go one step further. Although we could use Perl to turn the array of author surnames into a comma-delimited string, there’s a PostgreSQL function for that, too: array_to_string(). Check it out:

SELECT b.isbn, b.title, b.rating,
       array_to_string(array_agg(a.surname), ', ') as authors
  FROM books       b
  JOIN book_author ba ON b.isbn     = ba.isbn
  JOIN authors     a  ON ba.author_id = a.id
 GROUP BY b.isbn, b.title, b.rating;

Now the rows will be:

       isbn        |               title                | rating |          authors          
-------------------+------------------------------------+--------+--------------------------
 978-0-201-63346-7 | TCP/IP Illustrated, Volume 1       |      5 | Stevens
 978-0-13-018380-4 | Internetworking with TCP/IP Vol.1  |      4 | Comer
 978-1-56592-243-3 | Perl Cookbook                      |      5 | Christiansen, Torkington
 978-1-58720-153-0 | CCSP SNRS Exam Certification Guide |      5 | Bastien, Nasseh, Degu
 978-0-7357-1201-0 | Designing with Web Standards       |      5 | Zeldman

Create a Database View

Cool! All the formatting work is done! But since it’s likely what we’ll often need to fetch book titles along with their authors, let’s create an SQL view for this query. That way, we don’t have to write the same SQL in different places in the application: we can just use the view. So create a new file, sql/002-books_with_authors.sql, and add this SQL:

CREATE VIEW books_with_authors AS
SELECT b.isbn, b.title, b.rating,
       array_to_string(array_agg(a.surname), ', ') as authors
  FROM books       b
  JOIN book_author ba ON b.isbn     = ba.isbn
  JOIN authors     a  ON ba.author_id = a.id
 GROUP BY b.isbn, b.title, b.rating;

Now install this view in the database:

psql -U postgres -d myapp -f sql/002-books_with_authors.sql

And now we can make use of the view any time we want and get the results of the full query. It’s time to do that in our controller. Edit lib/MyApp/Controller/Books.pm and change this line in the list action:

my $sth = $_->prepare('SELECT isbn, title, rating FROM books');

To:

my $sth = $_->prepare(q{
    SELECT isbn, title, rating, authors FROM books_with_authors
});

The use of the q{} operator is a style I use for SQL queries in Perl code; you can use whatever style you like. Since this is a very short SQL statement (thanks to the view), it’s not really necessary to have it on multiple lines, but I like to be fairly consistent about this sort of thing.

The last thing we need to do is a a very simple change to the list template in lib/MyApp/Templates/HTML/Books.pm. In previous posts, I was referring to the non-existent “author” key in the each hash reference fetched from the database. In the new view, however, I’ve named that column “authors”. So change this line:

cell { $book->{author} };

To

cell { $book->{authors} };

And that’s it. Restart the server and reload http://localhost:3000/books/list and you should now see all of the books listed with their authors.

Notes

I think you can appreciate why, to a certain degree, I’m starting to think of the database as handling both the “M” and the “V” in “MVC”. It’s no mistake that the database object we created is known as a “view”. It was written in such a way that it not only expressed the relationship between books and authors in a compact but clear way, but it formatted the appropriate data for publishing on the site—all in a single, efficient query. All the Template::Declare view does is wrap it all up in the appropriate HTML.

PostgreSQL isn’t the only database to support feature such as this, by the way. All of the databases I’ve used support views, and many offer useful aggregate functions, as well. Among the MySQL aggregates, for example, is group_concat(), which sort of combines the array_to_string(array_agg()) PostgreSQL syntax into a single function. And I’ve personally written a custom aggregate for SQLite in Perl. So although I use PostgreSQL for these examples and make use of its functionality, you can do much the same thing in most other databases.

Either way, I find this to be a lot less work than using an ORM or other abstraction layer between my app and the database. Frankly, SQL provides just the right level of abstraction.

Looking for the comments? Try the old layout.

pgTAP Best Practices Slides Available

Last month I gave two presentations at the PostgreSQL Conference West. The first was a variation on Unit Test Your Database!, which I’ve now given about six times (and will at least two more times, including tomorrow night for Portland PLUG and in two weeks at the JPUG 10th Anniversary Conference). The second was a new talk, a 90-minute tutorial, called “pgTAP Best Practices.” And here it is, published on Slideshare. Enjoy.

You can also download a 76MB PDF if that’s more your style.

Enjoy!

Looking for the comments? Try the old layout.

Catalyst with DBIx::Connector and Template::Declare

Following up on my post yesterday introducing Catalyst::View::TD, today I’d like to continue with the next step in chapter 3 of the Catalyst tutorial. The twist here is that I’m going to use PostgreSQL for the database back-end and start introducing some database best practices. I’m also going to make use of my DBIx::Connector module to interact with the database.

Create the Database

Picking up with the database creation section of the tutorial, the first change I’d like to make is to use a natural key for the books table. All books have unique identifiers, known as ISBNs, so creating a surrogate key (the typical id column in ORM-managed applications) is redundant. One of the nice things about PostgreSQL is that it ships with a contributed library, isn, which validates ISBN and other international identifiers. So we use this contrib module (usually included in package-installed PostgreSQL servers) for the primary key for books. If you need to install it from source, it’s pretty easy:

cd postgresql-8.4.1/contrib/isn
make
make install

Ideally I’d use a natural key for the authors table too, but despite some attempts to create universal identifiers for authors, nothing has really caught on as far as I know. So I’ll just stick to a surrogate key for now.

First step: create the database and install isn if it’s not already included in the template database:

createdb -U postgres myapp
psql -U postgres -d myapp -f /usr/local/pgsql/share/contrib/isn.sql

The isn.sql file may be somewhere else on your system. Now let’s create the database. Create sql/001-books.sql in the MyApp directory and paste this into it:

BEGIN;

CREATE TABLE books (
    isbn   ISBN13   PRIMARY KEY,
    title  TEXT     NOT NULL DEFAULT '',
    rating SMALLINT NOT NULL DEFAULT 0 CHECK (rating BETWEEN 0 AND 5)
);

CREATE TABLE authors (
    id         BIGSERIAL PRIMARY KEY,
    surname    TEXT NOT NULL DEFAULT '',
    given_name TEXT NOT NULL DEFAULT ''
);

CREATE TABLE book_author (
    isbn       ISBN13 REFERENCES books(isbn),
    author_id  BIGINT REFERENCES authors(id),
    PRIMARY KEY (isbn, author_id)
);

INSERT INTO books
VALUES ('1587201534',        'CCSP SNRS Exam Certification Guide', 5),
       ('978-0201633467',    'TCP/IP Illustrated, Volume 1',       5),
       ('978-0130183804',    'Internetworking with TCP/IP Vol.1',  4),
       ('978-1-56592-243-3', 'Perl Cookbook',                      5),
       ('978-0735712010',    'Designing with Web Standards',       5)
;

INSERT INTO authors
VALUES (1, 'Greg',      'Bastien'),
       (2, 'Sara',      'Nasseh'),
       (3, 'Christian', 'Degu'),
       (4, 'Richard',   'Stevens'),
       (5, 'Douglas',   'Comer'),
       (6, 'Tom',       'Christiansen'),
       (7, 'Nathan',    'Torkington'),
       (8, 'Jeffrey',   'Zeldman')
;

INSERT INTO book_author
VALUES ('1587201534',        1),
       ('1587201534',        2),
       ('1587201534',        3),
       ('978-0201633467',    4),
       ('978-0130183804',    5),
       ('978-1-56592-243-3', 6),
       ('978-1-56592-243-3', 7),
       ('978-0735712010',    8)
;

COMMIT;

Yeah, I Googled for the ISBNs for those books. I found the ISBN-13 number for most of them, but it handles the old ISBN-10 format, too, automatically upgrading it to ISBN-13. I also added a CHECK constraint for the rating column, to be sure that the value is always BETWEEN 0 AND 5. I also like to include default values where it’s sensible to do so, and that syntax for inserting multiple rows at once is pretty nice to have.

Go ahead and run this against your database:

psql -U postgres -d myapp -f sql/001-books.sql

Now if you connect to the server, you should be able to query things like so:

$ psql -U postgres myapp
psql (8.4.1)
Type "help" for help.

myapp=# select * from books;
       isbn        |               title                | rating 
-------------------+------------------------------------+--------
 978-1-58720-153-0 | CCSP SNRS Exam Certification Guide |      5
 978-0-201-63346-7 | TCP/IP Illustrated, Volume 1       |      5
 978-0-13-018380-4 | Internetworking with TCP/IP Vol.1  |      4
 978-1-56592-243-3 | Perl Cookbook                      |      5
 978-0-7357-1201-0 | Designing with Web Standards       |      5
(5 rows)

Setup the Database Connection

Great! The database is set. Now we need a way for the app to talk to it. I’ve not yet decided how I’m going to integrate DBIx::Connector into a Catalyst model class; maybe I’ll figure it out as I write these posts. But since my mantra is “the database is the model,â€? for now I won’t bother with a model at all. Instead, I’ll create a simple accessor in MyApp so we can easily get at the database connection wherever we need it. To do that, add these lines to lib/MyApp.pm:

use Moose;
use DBIx::Connector;
use Exception::Class::DBI;

has conn => (is => 'ro', lazy => 1, default => sub {
    DBIx::Connector->new( 'dbi:Pg:dbname=myapp', 'postgres', '', {
        PrintError     => 0,
        RaiseError     => 0,
        HandleError    => Exception::Class::DBI->handler,
        AutoCommit     => 1,
        pg_enable_utf8 => 1,
    });
});

We load Moose to get the has keyword, the officially sanctioned interface for defining attributes in Catalyst classes. Then I use that keyword to create the conn attribute. This attribute is read-only and has a DBIx::Connector object for its default value. The nice thing about this is that the DBIx::Connector object won’t be instantiated until it’s actually needed, and then it will be kept forever. We never have to do anything else to use it.

Oh, and I like to make sure that text data coming back from PostgreSQL is properly encoded as UTF-8, and I like to use Exception::Class::DBI to turn DBI errors into exception objects.

Now it’s time to update our controller and template to fetch actual data from the database. Edit lib/MyApp/Controller/Books.pm and change the list method to:

sub list : Local {
    my ($self, $c) = @_;
    $c->stash->{books} = $c->conn->run(fixup => sub {
        my $sth = $_->prepare('SELECT isbn, title, rating FROM books');
        $sth->execute;
        $sth;
    });
}

All we’re doing here is creating a statement handle for the query, executing the query, and storing the statement handle in the stash. Now we need to update the template to use the statement handle. Open up lib/MyApp/Templates/HTML/Books.pm and change the list template to:

template list => sub {
    my ($self, $args) = @_;
    table {
        row {
            th { 'Title'  };
            th { 'Rating' };
            th { 'Author' };
        };
        my $sth = $args->{books};
        while (my $book = $sth->fetchrow_hashref) {
            row {
                cell { $book->{title}  };
                cell { $book->{rating} };
                cell { $book->{author} };
            };
        };
    };
};

All we do is fetch each row from the statement handle and output it. The only thing that’s changed is the use of the statement handle as an iterator rather than an array reference.

And now we’re set! Restart your server with script/myapp_server.pl and point your browser at http://localhost:3000/books/list. Now you should see the book titles and ratings, though the authors still aren’t present. We’ll fix that in a later post.

Takeaway

The takeaway from this post: Use PostgreSQL’s support for custom data types to create validated natural keys for your data, and use a stable, persistent database connection to talk directly to the database. No need for an ORM here, as the DBI provides a very Perlish access to a very capable DSL for models called SQL.

More soon.

Looking for the comments? Try the old layout.

Pg::Priv Hits CPAN (Thanks Etsy!)

Earlier this year, I was working on an administrative utility for Etsy that validates PostgreSQL database permissions. Of course, in order to verify that permissions were correct or needed updating, I had to have a way to examine PostgreSQL ACLs, which are arrays made of of strings that look like this:

my $acl = [
    'miriam=arwdDxt/miriam',
    '=r/miriam',
    'admin=arw/miriam',
];

So following the documentation, I wrote a module that iterates over an ACL, parses each privilege string, and returns an object describing it. Using it is pretty easy. If you wanted to see what the permissions looked like on all the tables in a database, you could do it like so:

#!/usr/bin/perl -w
use strict;
use warnings;
use DBI;
use Pg::Priv;

my $dbname = shift or die "Usage: $0 dbname\n";

my $dbh = DBI->connect("dbi:Pg:dbname=$dbname", 'postgres', '');
my $sth = $dbh->prepare(
    q{SELECT relname, relacl FROM pg_class WHERE relkind = 'r'}
);

$sth->execute;
print "Permissions for $dbname:\n";
while (my $row = $sth->fetchrow_hashref) {
    print "  Table $row->{relname}:\n";
    for my $priv ( Pg::Priv->parse_acl( $row->{relacl} ) ) {
        print '    ', $priv->by, ' granted to ', $priv->to, ': ',
            join( ', ', $priv->labels ), $/;
    }
}

And here’s what the output looks like:

Permissions for bric:
  Table media__output_channel:
    postgres granted to postgres: UPDATE, SELECT, INSERT, TRUNCATE, REFERENCE, DELETE, TRIGGER
    postgres granted to bric: UPDATE, SELECT, INSERT, DELETE
  Table media_uri:
    postgres granted to postgres: UPDATE, SELECT, INSERT, TRUNCATE, REFERENCE, DELETE, TRIGGER
    postgres granted to bric: UPDATE, SELECT, INSERT, DELETE
  Table media_fields:
    postgres granted to postgres: UPDATE, SELECT, INSERT, TRUNCATE, REFERENCE, DELETE, TRIGGER

There are a bunch of utility methods to make it pretty simple to examine PostgreSQL privileges.

And now, I’m pleased to announce the release yesterday of Pg::Priv. My thanks to Etsy for agreeing to the release, and particularly to Chad Dickerson for championing it. This module is a little thing compared to some things I’ve seen open-sourced by major players, but even the simplest utilities can save folks mountains of time. I hope you find Pg::Priv useful.

Looking for the comments? Try the old layout.

More about…

Unicode Normalization in SQL

I’ve been peripherally aware of the need for unicode normalization in my code for a while, but only got around to looking into it today. Although I use Encode to convert text inputs into Perl’s internal form and UTF-8 or an appropriate encoding in all my outputs, it does nothing about normalization.

What’s normalization you ask?

Well, UTF-8 allows some characters to be encoded in different ways. For example, é can be written as either “é”, which is a “precomposed character,” or as “é”, which is a combination of “e” and “́”. This is all well and good, but the trouble comes when you want to compare values. Observe this Perl one-liner:

% perl -le 'print "\x{00e9}" eq "\x{0065}\x{0301}" ? "t" : "f"'
f

The same issue exists in your database. Here’s an example from PostgreSQL:

try=# select U&'\00E9' = U&'\0065\0301';
 ?column? 
----------
 f
(1 row)

If you have a user enter data in your Web app using combining characters, and then another does a search with canonical characters, the search will fail. This won’t do at all.

The solution is to normalize your Unicode data. In Perl, you can use Unicode::Normalize, a C/XS module that uses Perl’s internal unicode tables to convert things as appropriate. For general use the NFC normalization is probably best:

use Unicode::Normalize;
$string = NFC $string;

PostgreSQL offers no normalization routines. However, the SQL standard mandates one (as of SQL 2008, at least). It looks like this:

<normalize function> ::= NORMALIZE <left paren> <character value expression> [ <comma> <normal form> [ <comma> <normalize function result length> ] ] <right paren>
<normal form> ::= NFC | NFD | NFKC | NFKD

The second argument defaults to NFC and the third, which specifies a maximum length of the return value, is optional. The fact that it looks like a function means that we can use PL/PerlU to emulate it in PostgreSQL until a proper implementation makes it into core. Here’s how:

CREATE OR REPLACE FUNCTION NORMALIZE(
    string TEXT,
    form   TEXT,
    maxlen INT
) RETURNS TEXT LANGUAGE plperlu AS $$
    use Unicode::Normalize 'normalize';
    my ($string, $form, $maxlen) = @_;
    my $ret = normalize($form, $string);
    elog(ERROR, 'Normalized value is too long') if length $ret > $maxlen;
    return $ret;
$$;

CREATE OR REPLACE FUNCTION NORMALIZE(
    string TEXT,
    form   TEXT
) RETURNS TEXT LANGUAGE plperlu AS $$
    use Unicode::Normalize 'normalize';
    return normalize($_[1], $_[0]);
$$;

CREATE OR REPLACE FUNCTION NORMALIZE(
    string TEXT
) RETURNS TEXT LANGUAGE plperlu AS $$
    use Unicode::Normalize 'normalize';
    return normalize('NFC', shift);
$$;

I wrote a few tests to make sure it was sane:

SELECT U&'\0065\0301' as combined,
       char_length(U&'\0065\0301'),
       NORMALIZE(U&'\0065\0301') as normalized,
       char_length(NORMALIZE(U&'\0065\0301'));

SELECT NORMALIZE(U&'\0065\0301', 'NFC')  AS NFC,
       NORMALIZE(U&'\0065\0301', 'NFD')  AS NFD,
       NORMALIZE(U&'\0065\0301', 'NFKC') AS NFKC,
       NORMALIZE(U&'\0065\0301', 'NFKD') AS NFKD
;

SELECT NORMALIZE(U&'\0065\0301', 'NFC', 1)  AS NFC,
       NORMALIZE(U&'\0065\0301', 'NFD', 2)  AS NFD,
       NORMALIZE(U&'\0065\0301', 'NFKC', 1) AS NFKC,
       NORMALIZE(U&'\0065\0301', 'NFKD', 2) AS NFKD;

SELECT NORMALIZE(U&'\0065\0301', 'NFD', 1);

And the output

 combined | char_length | normalized | char_length 
----------+-------------+------------+-------------
 é        |           2 | é          |           1
(1 row)

 nfc | nfd | nfkc | nfkd 
-----+-----+------+------
 é   | é   | é    | é
(1 row)

 nfc | nfd | nfkc | nfkd 
-----+-----+------+------
 é   | é   | é    | é
(1 row)

psql:try.sql:45: ERROR:  error from Perl function "normalize": Normalized value is too long at line 5.

Cool! So that’s fairly close to the standard. The main difference is that the form argument must be a string instead of a constant literal. But PostgreSQL would likely support both. The length argument is also a literal, and can be 10 characters or 64 bytes, but for our purposes, this is fine. The only downside to it is that it’s slow: PostgreSQL must convert its text value to a Perl string to pass to the function, and then Unicode::Normalize turns it into a C string again to do the conversion, then back to a Perl string which, in turn, is returned to PostgreSQL and converted back into the text form. Not the quickest process, but may prove useful anyway.

Update: 1 Hour Later

Note that this issue applies when using full text search, too. Alas, it does not normalize unicode characters for you:

try=# select to_tsvector(U&'\00E9clair') @@ to_tsquery(U&'\0065\0301clair');
 ?column? 
----------
 f
(1 row)

But normalizing with the functions I introduced does work:

try=# select to_tsvector(U&'\00E9clair') @@ to_tsquery(normalize(U&'\0065\0301clair'));
 ?column? 
----------
 t
(1 row)

So yes, this really can be an issue in your applications.

Looking for the comments? Try the old layout.

Enable CSV Logging in PostgreSQL

One of the cooler features of recent versions of PostgreSQL is support for CSV-formatted logging. I’ve never had a chance to use it, but after reading Josh’s cool hack for determining sums of concurrent queries using windowing functions in PostgreSQL 8.4 to query a table generated from a CSV log, I just had to give it a try. But while there is decent documentation for loading up the contents of a CSV-formatted log file, there I couldn’t find simple information on how to set it up. So I figured it out and record it here for posterity.

Configuration is pretty straight-forward. You need to edit only two configuration directives (known as “GUCs” in PostgreSQL parlance) in your postgresql.conf file: log_destination and logging_collector. I just changed them from:

log_destination = 'stderr'
#logging_collector = off

To:

log_destination = 'csvlog'
logging_collector = on

Then all I had to do was cold-restart PostgreSQL; that is, stop it and start it again. I’m told that a restart won’t due for security reasons. After that, I had a shiny new .csv log file in the pg_log subdirectory of my data directory. It looks like this:

2009-08-19 10:44:08.128 PDT,,,36596,,4a8c39e8.8ef4,1,,2009-08-19 10:44:08 PDT,,0,LOG,00000,"database system was shut down at 2009-08-19 10:44:06 PDT",,,,,,,,
2009-08-19 10:44:08.411 PDT,,,36573,,4a8c39e7.8edd,1,,2009-08-19 10:44:07 PDT,,0,LOG,00000,"database system is ready to accept connections",,,,,,,,
2009-08-19 10:44:08.412 PDT,,,36599,,4a8c39e8.8ef7,1,,2009-08-19 10:44:08 PDT,,0,LOG,00000,"autovacuum launcher started",,,,,,,,

Cool!

The only other thing I ought to note is that, becaus I removed the “stderr” value from the log_destination GUC, the old log location I used, specified in my start script, is no longer necessary. I was even using log rotation (in the default Mac OS X start script that ships with PostgreSQL), but that’s not necessary anymore, either. So I just turned that stuff off.

Now I have something to refer back to, and so do you. Enjoy!

Looking for the comments? Try the old layout.

More about…

pgTAP 0.22: Test Your Results

I’m pleased to announce that, after much thinking, committing, and not an insignificant amount of hair-pulling, pgTAP 0.22 has finally landed. Download it here. Many, many thanks to all who commented on my previous posts, made suggestions, and helped me on IRC to figure out how to get all this stuff to work. The crazy thing is that it does, quite well, all the way back to PostgreSQL 8.0.

So here’s what I’ve come up with: ten simple functions. Sound like a lot? Well, it’s simpler than it might at first appear. Here’s a quick overview:

results_eq()

Compares two queries row-for-row. Pass in strings with SQL queries, strings with prepared statement names, or cursors. If the query you’re testing returns a single column, the expected results can be passed as an array. If a test fails, you get useful diagnostics:

# Failed test 146
#     Results differ beginning at row 3:
#         have: (1,Anna)
#         want: (22,Betty)

If a row is missing, the diagnostics will show it as a NULL:

# Failed test 147
#     Results differ beginning at row 5:
#         have: (1,Anna)
#         want: NULL
results_ne()

Just like results_eq(), except that it tests that the results of the two arguments are *not* equivalent. May not be very useful, but it’s cute.

set_eq()

Tests that two queries return the same results, without regard to the order of the results or duplicates. Pass in strings with SQL queries or strings with prepared statement names. As with results_eq() the expected results can be passed as an array if the test query returns a single column. Failure diagnostics look like this:

# Failed test 146
#     Extra records:
#         (87,Jackson)
#         (1,Jacob)
#     Missing records:
#         (44,Anna)
#         (86,Angelina)

If the failure is due to incompatible column types, the diagnostics will help you out there, too:

# Failed test 147
#     Columns differ between queries:
#         have: (integer,text)
#         want: (text,integer)
set_ne()

The inverse of set_eq(), the test passes if the results of the two queries are different, without regard to order or duplicate rows. No diagnostics on failure, though; if it fails, it’s because the results are the same.

set_has()

Tests that a query contains a subset of results without regard to order or duplicates. Useful if you need to ensure that a query returns at least some set of rows. Failure diagnostics are useful again:

# Failed test 122
#     Missing records:
#         (44,Anna)
#         (86,Angelina)
set_hasnt()

Tests that a query does not contain a subset of results, without regard to order or duplicates.

bag_eq()

Just like set_eq(), except that duplicates matter. So if the first query has duplicate rows, the second must have the same dupes. Diagnostics are equally useful.

bag_ne()

Just like set_ne(), except that duplicates matter.

bag_has()

Just like set_has(), except that duplicates matter.

bag_hasnt()

Just like set_hasnt(), except that duplicates matter.

Be sure to look at my previous post for usage examples. Since I wrote it, I’ve also added the ability to pass an array as the second argument to these functions. This is specifically for the case when the query you’re testing results a single column of results; the array just makes it easier to specify expected values in a common case:

SELECT results_eq(
    'SELECT * FROM active_user_ids()',
    ARRAY[ 2, 3, 4, 5]
);

Check the documentation for all the details on how to use these functions.

I’m really happy with these functions. It was definitely worth it to really think things through, look at prior art, and spend the time to try different approaches. In the process, I’ve found an approach that works in nearly all circumstances.

The one exception is in results_eq() and results_ne() on PostgreSQL 8.3 and down. The issue there is that there were no operators to compare two record objects before PostgreSQL 8.4. So for earlier versions, the code has to cast the records representing each row to text. This means that two rows can be different but appear to be the same to 8.3 and down. In practice this should be pretty rare, but I’m glad that record comparisons are more correct in 8.4

The only other issue is performance. Although you can write your tests in SQL, rather than strings containing SQL, the set and bag functions use the PL/pgSQL EXECUTE statement to execute each SQL statement and insert it into a temporary table. Then they select the data from the temporary tables once or twice to do the comparisons. That’s a lot more processing than simply running the query itself, and it slows down the performance significantly.

Similarly, the results functions use cursors and fetch each row one-at-a-time. The nice thing is that, in the event of a failure for results_eq() or a pass for results_ne(), the functions can stop fetching results before reaching the end of the queries. But either way, a fair bit of processing goes on.

I’m not sure which is slower, the set and bag functions or the results functions, but, short of adding new syntax to SQL (not an option), I could see no other way to adequately do the comparisons and emit useful diagnostics.

But those are minor caveats, I think. I’m pretty pleased with the function names and the interfaces I’ve created for them. Please download the latest and let me know what you think.

So what’s next? Well, there are a few more schema-testing functions I’d like to add, but after that, I’d like to declare pgTAP stable and start using it in new projects. I’m thinking about writing a test suite for database normalization, starting with testing that all tables have primary keys.

But that’s after my vacation. Back in two weeks.

Looking for the comments? Try the old layout.