Just a Theory

Black lives matter

Posts about postgres

pgTAP 0.16 in the Wild

I’ve been writing a lot tests for a client in pgTAP lately. It’s given me a lot to think about in terms of features I need and best practices in writing tests. I’m pleased to say that, overall, it has been absolutely invaluable. I’m doing a lot of database refactoring, and having the safety of solid test coverage has been an absolute godsend. pgTAP has done a lot to free me from worry about the effects of my changes, as it ensures that everything about the databases continue to just work.

Of course, that’s not to say that I don’t screw up. There are times when my refactorings have introduced new bugs or incompatibilities; after all, the tests I write of existing functionality extend only so far as I can understand that functionality. But as such issues come up, I just add regression tests, fix the issues, and move on, confident in the knowledge that, as long as the tests continue to be run regularly, those bugs will never come up again. Ever.

As a result, I’ll likely be posting a bit on best practices I’ve found while writing pgTAP tests. As I’ve been writing them, I’ve started to find the cow paths that help me to keep things sane. Most helpful is the large number of assertion functions that pgTAP offers, of course, but there are a number of techniques I’ve been developing as I’ve worked. Some are better than others, and still others suggest that I need to find other ways to do things (you know, when I’m cut-and-pasting a lot, there must be another way, though I’ve always done a lot of cut-and-pasting in tests).

In the meantime, I’m happy to announce the release of pgTAP 0.16. This version includes a number of improvements to the installer (including detection of Perl and TAP::Harness, which are required to use the included pg_prove test harness app. The installer also has an important bug fix that greatly increases the chances that the os_name() function will actually know the name of your operating system. And of course, there are new test functions:

  • has_schema() and hasnt_schema(), which test for the presence of absence of a schema
  • has_type() and hasnt_type(), which test for the presence and absence of a data type, domain, or enum
  • has_domain() and hasnt_domain(), which test for the presence and absence of a data domain
  • has_enum() and hasnt_enum(), which test for the presence and absence of an enum
  • enum_has_labels() which tests that an enum has an expected list of labels

As usual, you can download the latest release from pgFoundry GitHub. Visit the pgTAP site for more information and for documentation.

Looking for the comments? Try the old layout.

Dynamic OFFSETs and LIMITs

I discovered a great hack for dealing with optional offsets and limits in PostgreSQL functions while working for a client, and I wanted to get it down here so that I wouldn’t forget it.

The deal is that I was writing tests for functions that returned a set of IDs for objects based on some criterion plus an optional offset and limit. The functions looked something like this:

CREATE OR REPLACE FUNCTION get_widgets_for_user_id(
    a_user_id integer,
    a_offset  integer,
    a_limit   integer
) RETURNS SETOF integer AS $$ 
DECLARE  
    l_id    integer;
    l_query text;
BEGIN
    v_query := 'SELECT id FROM widgets WHERE user_id = ' || a_user_id
            || ' ORDER BY created_at DESC';
    IF a_offset IS NOT NULL THEN 
        l_query := l_query || ' OFFSET ' || a_offset; 
    END IF; 
    IF a_limit IS NOT NULL THEN 
        l_query := l_query || ' LIMIT ' || a_limit; 
    END IF; 

    FOR l_id IN EXECUTE l_query LOOP 
            RETURN NEXT l_id;
    END LOOP; 
        
    RETURN; 
END;
$$ LANGUAGE PLPGSQL;

It seemed silly to me that this should be in PL/pgSQL: ultimately, it’s such a simple query that I wanted it to be a SQL query. Of course I knew that if a_offset was NULL I could fallback on 0. But what about dealing with a NULL limit?

Well, it turns out that you can pass a CASE statement to the LIMIT clause that optionally returns no value at all and it will just work. Observe:

try=% select id from widgets LIMIT CASE WHEN false THEN 3 END;
 id
----
   1
   2
   3
   4
(4 rows)

Pretty weird, huh? Well, for my purposes, it’s perfect, because I was able to rewrite that function as a pure SQL function, and it’s a lot simpler, to boot:

CREATE OR REPLACE FUNCTION get_widgets_for_user_id(
    a_user_id integer,
    a_offset  integer,
    a_limit   integer
) RETURNS SETOF integer AS $$ 
    SELECT id
        FROM widgets
        WHERE user_id = $1
        ORDER BY created_at DESC
    OFFSET COALESCE( $2, 0 )
        LIMIT CASE WHEN $3 IS NOT NULL THEN $3 END
$$ LANGUAGE SQL;

Now isn’t that a hell of a lot easier to read? Like I said, it’s a little weird, but overall I think it’s a pretty good hack. I’ve tested it with PostgreSQL 8.2 and 8.3. Not sure about other versions, but give it a try!

Update: Thanks for the comments! With the insight that CASE is ultimately passing a NULL to LIMIT when the value is NULL, I realized that I could switch from CASE to COALESCE for nice parity with the handling of the OFFSET clause. Check it out:

CREATE OR REPLACE FUNCTION get_widgets_for_user_id(
    a_user_id integer,
    a_offset  integer,
    a_limit   integer
) RETURNS SETOF integer AS $$ 
    SELECT id
        FROM widgets
        WHERE user_id = $1
        ORDER BY created_at DESC
    OFFSET COALESCE( $2, 0 )
        LIMIT COALESCE( $3, NULL )
$$ LANGUAGE SQL;

Looking for the comments? Try the old layout.

pgTAP 0.14 Released

I’ve just released pgTAP 0.14. This release focuses on getting more schema functions into your hands, as well as fixing a few issues. Changes:

  • Added SET search_path statements to uninstall_pgtap.sql.in so that it will work properly when TAP is installed in its own schema. Thanks to Ben for the catch!
  • Added commands to drop pg_version() and pg_version_num() touninstall_pgtap.sql.in.
  • Added has_index(), index_is_unique(), index_is_primary(), is_clustered(), and index_is_type().
  • Added os_name(). This is somewhat experimental. If you have uname, it’s probably correct, but assistance in improving OS detection in the Makefile would be greatly appreciated. Notably, it does not detect Windows.
  • Made ok() smarter when the test result is passed as NULL. It was dying, but now it simply fails and attaches a diagnostic message reporting that the test result was NULL. Reported by Jason Gordon.
  • Fixed an issue in check_test() where an extra character was removed from the beginning of the diagnostic output before testing it.
  • Fixed a bug comparing name[]s on PostgreSQL 8.2, previously hacked around.
  • Added has_trigger() and trigger_is().
  • Switched to pure SQL implementations of the pg_version() and pg_version_num() functions, to simplify including pgTAP in module distributions.
  • Added a note to README.pgtap about the need to avoid pg_typeof() and cmp_ok() in tests run as part of a distribution.

Enjoy!

Looking for the comments? Try the old layout.

pgTAP 0.12 Released

In anticipation of my PostgreSQL Conference West 2008 talk on Sunday, I’ve just released pgTAP 0.12. This is a minor release with just a few tweaks:

  • Updated plan() to disable warnings while it creates its tables. This means that plan() no longer send NOTICE messages when they run, although tests still might, depending on the setting of client_min_messages.
  • Added hasnt_table(), hasnt_view(), and hasnt_column().
  • Added hasnt_pk(), hasnt_fk(), col_isnt_pk(), and col_isnt_fk().
  • Added missing DROP statements to uninstall_pgtap.sql.in.

I also have an idea to add functions that return the server version number (and each of the version number parts) and an OS string, to make testing things on various versions of PostgreSQL and on various operating systems a lot simpler.

I think I’ll also spend some time in the next few weeks on an article explaining exactly what pgTAP is and why you’d want to use it. Provided, of course, I can find the tuits for that.

Looking for the comments? Try the old layout.

pgTAP 0.11 Released

So I’ve just released pgTAP 0.11. I know I said I wasn’t going to work on it for a while, but I changed my mind. Here’s what’s changed:

  • Simplified the tests so that they now load test_setup.sql instead of setting a bunch of stuff themselves. Now only test_setup.sql needs to be created from test_setup.sql.in, and the other .sql files depend on it, meaning that one no longer has to specify TAPSCHEMA for any make target other than the default.
  • Eliminated all uses of E'' in the tests, so that we don’t have to process them for testing on 8.0.
  • Fixed the spelling of ON_ROLLBACK in the test setup. Can’t believe I had it with one L in all of the test files before! Thanks to Curtis “Ovid” Poe for the spot.
  • Added a couple of variants of todo() and skip(), since I can never remember whether the numeric argument comes first or second. Thanks to PostgreSQL’s functional polymorphism, I don’t have to. Also, there are variants where the numeric value, if not passed, defaults to 1.
  • Updated the link to the pgTAP home page in pgtap.sql.in.
  • TODO tests can now nest.
  • Added todo_start(), todo_end(), and in_todo().
  • Added variants of throws_ok() that test error messages as well as error codes.
  • Converted some more tests to use check_test().
  • Added can() and can_ok().
  • Fixed a bug in check_test() where the leading whitespace for diagnostic messages could be off by 1 or more characters.
  • Fixed the installcheck target so that it properly installs PL/pgSQL into the target database before the tests run.

Now I really am going to do some other stuff for a bit, although I do want to see what I can poach from Epic Test. And I do have that talk on pgTAP next month. So I’ll be back with more soon enough.

Looking for the comments? Try the old layout.

pgTAP 0.10 Released, Web Site Launched

Two major announcements this week with regard to pgTAP:

First, I’ve release pgTAP 0.10. The two major categories of changes are compatibility as far back as PostgreSQL 8.0 and new functions for testing database schemas. Here’s a quick example:

BEGIN;
SELECT plan(7);

SELECT has_table( 'users' );
SELECT has_pk('users');
SELECT col_is_fk( 'users', ARRAY[ 'family_name', 'given_name' ]);

SELECT has_table( 'widgets' );
SELECT has_pk( 'widgets' );
SLEECT col_is_pk( 'widgets', 'id' );
SELECT fk_ok(
    'widgets',
    ARRAY[ 'user_family_name', 'user_given_name' ],
    'users',
    ARRAY[ 'family_name', 'given_name' ],
);

SELECT * FROM finish();
ROLLBACK;

Pretty cool, right? Check the documentation for all the details.

Speaking of the documentation, that link goes to the new pgTAP Web site. Not only does it include the complete documentation for pgTAP, but also instructions for integrating pgTAP into your application’s preferred test environment. Right now it includes detailed instructions for Perl + Module::Build and for PostgreSQL, but has only placeholders for PHP and Python. Send me the details on those languages or any others into which you integrate pgTAP tests and I’ll update the page.

Oh, and it has a beer. Enjoy.

I think I’ll take a little time off from pgTAP next week to give Bricolage some much-needed love. But as I’ll be given another talk on pgTAP at PostgreSQL Conference West next month, worry not! I’ll be doing a lot more with pgTAP in the coming weeks.

Oh, and one more thing: I’m looking for consulting work. Give me a shout (david

  • at - justatheory.com) if you have some PostgreSQL, Perl, Ruby, MySQL, or JavaScript hacking you’d like me to do. I’m free through November.

That is all.

Looking for the comments? Try the old layout.

CITEXT Patch Submitted to PostgreSQL Contrib

On Friday, I submitted a patch to add a locale-aware case-insensitive text type as a PostgreSQL contrib module. This has been among my top requests as a feature for PostgreSQL ever since I started using it. And as I started work on yet another application recently, I decided to look into what it would take to just make it happen myself. I’m hopeful that everyone will be able to benefit from this bit of yak shaving.

I started out by trying to use the citext project on pgFoundry, but immediately identified two issues with it:

  1. It does not install properly on PostgreSQL 8.3 (it uses a lot of casts that were removed in 8.3); and
  2. It only case-insensitively compared ASCII characters. So accented multibyte characters work just as they do in the text type.

So I set about trying to create my own, new type, originally called “lctext”, since what it does is not true case-insensitive comparisons, but lowercases text and then compares, just as millions of us developers already do by using LOWER() on both sides of a query:

SELECT *
  FROM tab
 WHERE lower(col) = LOWER(?);

I just finally got fed up with this. The last straw for me was wanting to create a primary key that would be stored case-insensitively, which would have required that I create two indexes for it: One created for the primary key by default, the other a functional UNIQUE INDEX on LOWER(col), which would just be stupid.

So this patch is the culmination of my work to make a locale-aware case-insensitive text type. It’s locale-aware in that it uses the same locale-aware string comparison code as that used for the text type, and it uses the same C function as LOWER() uses. The nice thing is that it works just as if you had used LOWER() in all your SQL, but now you don’t have to.

So while this is not a true case-insensitive text type, in the sense that it doesn’t do a case-insensitive comparison, but changes the cases and then compares, it is likely more efficient than the LOWER() workaround that we’ve all been using for years, and it neater, too. Using this type, it will now be much easier to create, e.g, an email domain, like so:

CREATE OR REPLACE FUNCTION is_email(text)
RETURNS BOOLEAN
AS $$
    use Email::Valid;
    return TRUE if Email::Valid->address( $_[0] );
    return FALSE;
$$ LANGUAGE 'plperlu' STRICT IMMUTABLE;

CREATE DOMAIN email AS CITEXT CHECK ( is_email( value ) );

No more nasty workarounds to account for the lack of case-insensitive comparisons for text types. It works great for time zones and other data types that are defined to compare case-insensitively:

CREATE OR REPLACE FUNCTION is_timezone( tz TEXT ) RETURNS BOOLEAN as $$
BEGIN
  PERFORM now() AT TIME ZONE tz;
  RETURN TRUE;
EXCEPTION WHEN invalid_parameter_value THEN
  RETURN FALSE;
END;
$$ language plpgsql STABLE;

CREATE DOMAIN timezone AS CITEXT
CHECK ( is_timezone( value ) );

And that should just work!

I’m hoping that this is accepted during the July CommitFest. Of course I will welcome suggestions for how to improve it. Since I sent the patch, for example, I’ve been thinking that I should suggest in the documentation that it is best used for short text entries (say, up to 256 characters), rather than longer entries (like email bodies or journal articles), and that for longer entries, one should really make use of tsearch2, instead. There are other notes and caveats in the patch submission. Please do let me know what you think.

Looking for the comments? Try the old layout.

pgTAP pgFoundry Project and Lightning Talk

A couple of quick announcements:

pgTAP on pgFoundry

First, the PostgreSQL community approved my project, so now there is a pgTAP project page, including a couple of mail lists, a bug tracker, and downloads. I uploaded a new version shortly after the project was approved, and 0.03 should be there soon, as well.

pgTAP YAPC::NA Lightning Talk

I gave a Lightning talk at YAPC::NA 2008 in Chicago this afternoon. I’ve now posted the slides for your enjoyment.

Care to help me with development? Want to add your own test functions or make it all integrate better with standard PostgreSQL regression tests? Want to help me get Module::Build or Module::Install to run Perl and PostgreSQL and whatever tests side-by-side, all at once? Join the pgtap-users mail list and join the fun!

Looking for the comments? Try the old layout.

Introducing pgTAP

So I started working on a new PostgreSQL data type this week. More on that soon; in the meantime, I wanted to create a test suite for it, and wasn’t sure where to go. The only PostgreSQL tests I’ve seen are those distributed with Elein Mustain’s tests for the email data type she created in a PostgreSQL General Bits posting from a couple of years ago. I used the same approach myself for my GTIN data type, but it was rather hard to use: I had to pay very close attention to what was output in order to tell the description output from the test output. It was quite a PITA, actually.

This time, I started down the same path, then then started thinking about Perl testing, where each unit test, or “assertion,” in the xUnit parlance, triggers output of a single line of information indicating whether or not a test succeeded. It occurred to me that I could just run a bunch of queries that returned booleans to do my testing. So my first stab looked something like this:

\set ON_ERROR_STOP 1
\set AUTOCOMMIT off
\pset format unaligned
\pset tuples_only
\pset pager
\pset null '[NULL]'

SELECT foo() = 'bar';
SELECT foo(1) = 'baz';
SELECT foo(2) = 'foo';

The output looked like this:

% psql try -f ~/Desktop/try.sql
t
t
t

Once I started down that path, and had written ten or so tests, It suddenly dawned on me that the Perl Test::More module and its core ok() subroutine worked just like that. It essentially just depends on a boolean value and outputs text based on that value. A couple minutes of hacking and I had this:

CREATE TEMP SEQUENCE __tc__;
CREATE OR REPLACE FUNCTION ok ( boolean, text ) RETURNS TEXT AS $$
    SELECT (CASE $1 WHEN TRUE THEN '' ELSE 'not ' END) || 'ok'
        || ' ' || NEXTVAL('__tc__')
        || CASE $2 WHEN '' THEN '' ELSE COALESCE( ' - ' || $2, '' ) END;
$$ LANGUAGE SQL;

I then rewrote my test queries like so:

\echo 1..3
SELECT ok( foo() = 'bar'   'foo() should return "bar"' );
SELECT ok( foo(1) = 'baz', 'foo(1) should return "baz"' );
SELECT ok( foo(2) = 'foo', 'foo(2) should return "foo"' );

Running these tests, I now got:

% psql try -f ~/Desktop/try.sql
1..3
ok 1 - foo() should return "bar"
ok 2 - foo(1) should return "baz"
ok 3 - foo(2) should return "foo"

And, BAM! I had the beginning of a test framework that emits pure TAP output.

Well, I was so excited about this that I put aside my data type for a few hours and banged out the rest of the framework. Why was this exciting to me? Because now I can use a standard test harness to run the tests, even mix them in with other TAP tests on any project I might work on. Just now, I quickly hacked together a quick script to run the tests:

use TAP::Harness;

my $harness = TAP::Harness->new({
    timer   => $opts->{timer},
    exec    => [qw( psql try -f )],
});

$harness->runtests( @ARGV );

Now I’m able to run the tests like so:

% try ~/Desktop/try.sql        
/Users/david/Desktop/try........ok   
All tests successful.
Files=1, Tests=3,  0 wallclock secs ( 0.00 usr  0.00 sys +  0.01 cusr  0.00 csys =  0.01 CPU)
Result: PASS

Pretty damn cool! And lest you wonder whether such a suite of TAP-emitting test functions is suitable for testing SQL, here are a few examples of tests I’ve written:

-- Plan the tests.
SELECT plan(4);

-- Emit a diagnostic message for users of different locales.
SELECT diag(
    E'These tests expect LC_COLLATE to be en_US.UTF-8,\n'
    || 'but yours is set to ' || setting || E'.\n'
    || 'As a result, some tests may fail. YMMV.'
)
    FROM pg_settings
    WHERE name = 'lc_collate'
    AND setting <> 'en_US.UTF-8';

SELECT is( 'a', 'a', '"a" should = "a"' );
SELECT is( 'B', 'B', '"B" should = "B"' );

CREATE TEMP TABLE try (
    name lctext PRIMARY KEY
);

INSERT INTO try (name)
VALUES ('a'), ('ab'), ('â'), ('aba'), ('b'), ('ba'), ('bab'), ('AZ');

SELECT ok( 'a' = name, 'We should be able to select the value' )
    FROM try
    WHERE name = 'a';

SELECT throws_ok(
    'INSERT INTO try (name) VALUES (''a'')',
    '23505',
    'We should get an error inserting a lowercase letter'
);

-- Finish the tests and clean up.
SELECT * FROM finish();

As you can see, it’s just SQL. And yes, I have ported most of the test functions from Test::More, as well as a couple from Test::Exception.

So, without further ado, I’d like to introduce pgTAP, a lightweight test framework for PostgreSQL implemented in PL/pgSQL and PL/SQL. I’ll be hacking on it more in the coming days, mostly to get a proper client for running tests hacked together. Then I think I’ll see if pgFoundry is interested in it.

Whaddya think? Is this something you could use? I can see many uses, myself, not only for testing a custom data type as I develop it, but also custom functions in PL/pgSQL or PL/Perl, and, heck, just regular schema stuff. I’ve had to write a lot of Perl tests to test my database schema (triggers, rules, functions, etc.), all using the DBI and being very verbose. Being able to do it all in a single psql script seems so much cleaner. And if I can end up mixing the output of those scripts in with the rest of my unit tests, so much the better!

Anyway, feedback welcome. Leave your comments, suggestions, complaints, patches, etc., below. Thanks!

Looking for the comments? Try the old layout.

How to Generate Recurring Events in the Database

This is a followup to my request for help fixing the performance of a database view that generated recurrences for events. This was an essential feature of Sandy, and thus important to get right. The idea when I started was simple:

  • Add a recurrence domain to the database that supports a number of different values, including “daily”, “weekly”, and “monthly”.
  • Add a recurrence column to the events table that identify how an event recurs.
  • Add a recurrence_dates table that contains a pre-generated list of recurrences for a given date. I’d populated this table with five years of dates, each one mapped to five years worth of recurrence dates (see the original blog entry for more on the format of this table.
  • Create a view that maps each events row to its date and recurrence in the recurrence_dates table.
  • Profit.

It was this last bullet point that didn’t quite work out: although the data was perfectly accurate, queries for a lot of events in the view were very expensive. I mean, the query could run for 3-4 minutes. It was just crazy! I couldn’t figure out the problem, so I posted my request for help. It was through discussions that followed with depesz that I finally figured out what the problem was: Although I was usually selecting only a week’s or months worth of events, the view was calculating rows for all five years worth of data for all of the events for a given user. Um, not efficient.

So here I finally document how, with a lot of help and example code from depesz, I solved the problem. The trick was to use a function instead of a view to generate the recurring event rows, and to limit it only to the dates we’re interested in. For convenience sake, I broke this down into two PL/pgSQL functions: one to generate recurring dates and one to return the recurring event rows. But first, here’s the recurrence domain and the events table, both of which are unchanged from the original approach:

CREATE DOMAIN recurrence AS TEXT
CHECK ( VALUE IN ( 'none', 'daily', 'weekly', 'monthly' ) );

CREATE TABLE events (
    id         SERIAL     PRIMARY KEY,
    user_id    INTEGER    NOT NULL,
    starts_at  TIMESTAMP  NOT NULL,
    start_tz   TEXT       NOT NULL,
    ends_at    TIMESTAMP,
    end_tz     TEXT       NOT NULL,
    recurrence RECURRENCE NOT NULL DEFAULT 'none'
);

Just assume the user_id is a foreign key. Now let’s populate this table with some data. For the purposes of this demonstration, I’m going to create one event per day for 1000 days, evenly divided between daily, weekly, monthly, and no recurrences, as well as five different times of day and six different durations:

INSERT INTO events (user_id, starts_at, start_tz, ends_at, end_tz, recurrence)
SELECT 1,
       ts::timestamp,
       'PST8PDT',
       ts::timestamp + dur::interval,
       'PST8PDT',
       recur
  FROM (
    SELECT '2007-12-19'::date + i || ' ' || CASE i % 5
               WHEN 0 THEN '06:00'
               WHEN 1 THEN '10:00'
               WHEN 2 THEN '14:00'
               WHEN 3 THEN '18:00'
               ELSE        '22:30'
               END,
           CASE i % 6
               WHEN 0 THEN '2 hours'
               WHEN 1 THEN '1 hour'
               WHEN 2 THEN '45 minutes'
               WHEN 3 THEN '3.5 hours'
               WHEN 4 THEN '15 minutes'
               ELSE        '30 minutes'
               END,
           CASE i % 4
               WHEN 0 THEN 'daily'
               WHEN 1 THEN 'weekly'
               WHEN 2 THEN 'monthly'
               ELSE        'none'
               END
    FROM generate_series(1, 1000) as gen(i)
  ) AS ser( ts, dur, recur);

This gives us some nicely distributed data:

try=# select * from events limit 10;
  id  | user_id |      starts_at      | start_tz |       ends_at       | end_tz  | recurrence 
------+---------+---------------------+----------+---------------------+---------+------------
    1 |       1 | 2007-12-20 10:00:00 | PST8PDT  | 2007-12-20 11:00:00 | PST8PDT | weekly
    2 |       1 | 2007-12-21 14:00:00 | PST8PDT  | 2007-12-21 14:45:00 | PST8PDT | monthly
    3 |       1 | 2007-12-22 18:00:00 | PST8PDT  | 2007-12-22 21:30:00 | PST8PDT | none
    4 |       1 | 2007-12-23 22:30:00 | PST8PDT  | 2007-12-23 22:45:00 | PST8PDT | daily
    5 |       1 | 2007-12-24 06:00:00 | PST8PDT  | 2007-12-24 06:30:00 | PST8PDT | weekly
    6 |       1 | 2007-12-25 10:00:00 | PST8PDT  | 2007-12-25 12:00:00 | PST8PDT | monthly
    7 |       1 | 2007-12-26 14:00:00 | PST8PDT  | 2007-12-26 15:00:00 | PST8PDT | none
    8 |       1 | 2007-12-27 18:00:00 | PST8PDT  | 2007-12-27 18:45:00 | PST8PDT | daily
    9 |       1 | 2007-12-28 22:30:00 | PST8PDT  | 2007-12-29 02:00:00 | PST8PDT | weekly
   10 |       1 | 2007-12-29 06:00:00 | PST8PDT  | 2007-12-29 06:15:00 | PST8PDT | monthly
(10 rows)

Now let’s get to the recurring date function:

CREATE OR REPLACE FUNCTION  generate_recurrences(
recurs RECURRENCE, 
start_date DATE,
end_date DATE
)
RETURNS setof DATE
LANGUAGE plpgsql IMMUTABLE
AS $BODY$
DECLARE
    next_date DATE := start_date;
    duration  INTERVAL;
    day       INTERVAL;
    check     TEXT;
BEGIN
    IF recurs = 'none' THEN
        -- Only one date ever.
        RETURN next next_date;
    ELSIF recurs = 'weekly' THEN
        duration := '7 days'::interval;
        WHILE next_date <= end_date LOOP
            RETURN NEXT next_date;
            next_date := next_date + duration;
        END LOOP;
    ELSIF recurs = 'daily' THEN
        duration := '1 day'::interval;
        WHILE next_date <= end_date LOOP
            RETURN NEXT next_date;
            next_date := next_date + duration;
        END LOOP;
    ELSIF recurs = 'monthly' THEN
        duration := '27 days'::interval;
        day      := '1 day'::interval;
        check    := to_char(start_date, 'DD');
        WHILE next_date <= end_date LOOP
            RETURN NEXT next_date;
            next_date := next_date + duration;
            WHILE to_char(next_date, 'DD') <> check LOOP
                next_date := next_date + day;
            END LOOP;
        END LOOP;
    ELSE
        -- Someone needs to update this function, methinks.
        RAISE EXCEPTION 'Recurrence % not supported by generate_recurrences()', recurs;
    END IF;
END;
$BODY$;

The idea here is pretty simple: pass in a recurrence (“daily”, “weekly”, or “monthly”), a start date, and an end date, and get back a set of all the recurrence dates between the start and end dates:

try=# \timing
Timing is on.
try=# select * from generate_recurrences('daily', '2008-01-29', '2008-02-05');
 generate_recurrences 
----------------------
 2008-01-29
 2008-01-30
 2008-01-31
 2008-02-01
 2008-02-02
 2008-02-03
 2008-02-04
 2008-02-05
(8 rows)

Time: 0.548 ms
try=# select * from generate_recurrences('weekly', '2008-01-29', '2008-03-05');
 generate_recurrences 
----------------------
 2008-01-29
 2008-02-05
 2008-02-12
 2008-02-19
 2008-02-26
 2008-03-04
(6 rows)

Time: 0.670 ms
try=# select * from generate_recurrences('monthly', '2008-01-29', '2008-05-05');
 generate_recurrences 
----------------------
 2008-01-29
 2008-02-29
 2008-03-29
 2008-04-29
(4 rows)

Time: 0.644 ms

Not bad, eh? And PostgreSQL’s date and interval calculation operators are wicked fast. Check out how long it dates to generate two years worth of daily recurrence dates:

try=# select * from  generate_recurrences('daily', '2008-01-29', '2010-02-05');
 generate_recurrences 
----------------------
 2008-01-29
 2008-01-30
 2008-01-31
...
 2010-02-03
 2010-02-04
 2010-02-05
(739 rows)

Time: 4.982 ms

Awesome. And the great thing about this function is that any time I need to add new recurrences (yearly, biweekly, quarterly, weekends, weekdays, etc.), I just modify the domain and this function and we’re ready to go.

And now, part two: the recurring event function:

CREATE OR REPLACE FUNCTION recurring_events_for(
    for_user_id INTEGER,
    range_start TIMESTAMP,
    range_end   TIMESTAMP
)
    RETURNS SETOF events
    LANGUAGE plpgsql STABLE
    AS $BODY$
DECLARE
    event events;
    start_date TIMESTAMPTZ;
    start_time TEXT;
    ends_at    TIMESTAMPTZ;
    next_date  DATE;
    recurs_at  TIMESTAMPTZ;
BEGIN
    FOR event IN 
        SELECT *
          FROM events
         WHERE user_id = for_user_id
           AND (
                   recurrence <> 'none'
               OR  (
                      recurrence = 'none'
                  AND starts_at BETWEEN range_start AND range_end
               )
           )
    LOOP
        IF event.recurrence = 'none' THEN
            RETURN NEXT event;
            CONTINUE;
        END IF;

        start_date := event.starts_at::timestamptz AT TIME ZONE event.start_tz;
        start_time := start_date::time::text;
        ends_at    := event.ends_at::timestamptz AT TIME ZONE event.end_tz;

        FOR next_date IN
            SELECT *
                FROM generate_recurrences(
                        event.recurrence,
                        start_date::date,
                        (range_end AT TIME ZONE event.start_tz)::date
                )
        LOOP
            recurs_at := (next_date || ' ' || start_time)::timestamp
                AT TIME ZONE event.start_tz;
            EXIT WHEN recurs_at > range_end;
            CONTINUE WHEN recurs_at < range_start AND ends_at < range_start;
            event.starts_at := recurs_at;
            event.ends_at   := ends_at;
            RETURN NEXT event;
        END LOOP;
    END LOOP;
    RETURN;
END;
$BODY$;

The idea here is to select the appropriate events for a given user between two dates, and for each event iterate over all of the recurrences between the two dates and return a row for each one. So the lines starting with FOR event IN and ending with LOOP select the original events, looking for either recurring events or non-recurring events that are between the two dates. Note that if you needed to, you could easily refine this query for your particular application, or even use PL/pgSQL’s EXECUTE operator to dynamically generate queries to suit particular application needs.

Next, the block starting with IF event.recurrence = 'none' THEN simply returns any non-recurring events. Although the next block already handles this case, adding this optimization eliminates a fair bit of calculation for the common case of non-recurring events.

Then the lines starting with FOR next_date IN and ending with LOOP select all of the dates for the recurrence in question, using the generate_recurrences() function created earlier. From LOOP to END LOOP;, the function generates the start and end timestamps, exiting the loop when the start date falls after the range or when it falls before the range and the end date falls after the range. There are many other tweaks one could make here to modify which recurrences are included and which are excluded. For example, if you had a column in the events table such as exclude_dates TIMESTAMP[] NOT NULL DEFAULT '{}' that stored an array of dates to ignore when generating recurrences, you could just add this line to go ahead and exclude them from the results returned by the function:

            CONTINUE WHEN recurs_at = ANY( exclude_dates );

But enough of the details: let’s see how it works! Here’s a query for a week’s worth of data:

try=# select * from recurring_events_for(1, '2007-12-19', '2007-12-26');
  id  | user_id |      starts_at      | start_tz |       ends_at       | end_tz  | recurrence 
------+---------+---------------------+----------+---------------------+---------+------------
    1 |       1 | 2007-12-20 10:00:00 | PST8PDT  | 2007-12-20 03:00:00 | PST8PDT | weekly
    2 |       1 | 2007-12-21 14:00:00 | PST8PDT  | 2007-12-21 06:45:00 | PST8PDT | monthly
    3 |       1 | 2007-12-22 18:00:00 | PST8PDT  | 2007-12-22 21:30:00 | PST8PDT | none
    4 |       1 | 2007-12-23 22:30:00 | PST8PDT  | 2007-12-23 14:45:00 | PST8PDT | daily
    4 |       1 | 2007-12-24 22:30:00 | PST8PDT  | 2007-12-23 14:45:00 | PST8PDT | daily
    4 |       1 | 2007-12-25 22:30:00 | PST8PDT  | 2007-12-23 14:45:00 | PST8PDT | daily
    5 |       1 | 2007-12-24 06:00:00 | PST8PDT  | 2007-12-23 22:30:00 | PST8PDT | weekly
    6 |       1 | 2007-12-25 10:00:00 | PST8PDT  | 2007-12-25 04:00:00 | PST8PDT | monthly
(8 rows)

Time: 51.890 ms

Note the time it took to execute this query. 52 ms is a hell of a lot faster than the several minutes it took to run a similar query using the old view. Plus, I’m not limited to just the recurrence dates I’ve pre-calculated in the old recurrence_dates table. Now we can use whatever dates are supported by PostgreSQL. It’s even fast when we look at a year’s worth of data:

try=# select * from recurring_events_for(1, '2007-12-19', '2008-12-19');
  id  | user_id |      starts_at      | start_tz |       ends_at       | end_tz  | recurrence 
------+---------+---------------------+----------+---------------------+---------+------------
    1 |       1 | 2007-12-20 10:00:00 | PST8PDT  | 2007-12-20 03:00:00 | PST8PDT | weekly
    1 |       1 | 2007-12-27 10:00:00 | PST8PDT  | 2007-12-20 03:00:00 | PST8PDT | weekly
    1 |       1 | 2008-01-03 10:00:00 | PST8PDT  | 2007-12-20 03:00:00 | PST8PDT | weekly
    1 |       1 | 2008-01-10 10:00:00 | PST8PDT  | 2007-12-20 03:00:00 | PST8PDT | weekly
    1 |       1 | 2008-01-17 10:00:00 | PST8PDT  | 2007-12-20 03:00:00 | PST8PDT | weekly
    1 |       1 | 2008-01-24 10:00:00 | PST8PDT  | 2007-12-20 03:00:00 | PST8PDT | weekly
    1 |       1 | 2008-01-31 10:00:00 | PST8PDT  | 2007-12-20 03:00:00 | PST8PDT | weekly
...
  364 |       1 | 2008-12-17 22:30:00 | PST8PDT  | 2008-12-17 14:45:00 | PST8PDT | daily
  364 |       1 | 2008-12-18 22:30:00 | PST8PDT  | 2008-12-17 14:45:00 | PST8PDT | daily
  365 |       1 | 2008-12-18 06:00:00 | PST8PDT  | 2008-12-17 22:30:00 | PST8PDT | weekly
(19691 rows)

Time: 837.759 ms

Not stellar, but still respectable. Given that for a typical application, a user will be looking at only a day’s or a week’s or a month’s events at a time, this seems to be an acceptable trade-off. I mean, how often will your users need to see a list of 20,000 events? And even if a user waslooking at a year’s worth of data, it’s unlikely that 75% of them would be recurring as in the example data here.

I was fucking pumped with this solution, and Sandy has hummed along nicely since we put it into production. If you’re interested in trying it for yourself, I’ve you can get all the SQL from this blog entry here.

The only thing I would like to have been able to do differently was to encapsulate the recurring_events_for() function in a view. Such would have made it much easier to actually use this solution in Rails. If you know how to do that, please do leave a comment. As for how I hacked Rails to use the function, well, that’s a blog post for another day.

Looking for the comments? Try the old layout.

Using sudo to Install the Postgres Gem on Leopard

Been getting this error with the latest postgres gem?

% sudo gem install postgres
Bulk updating Gem source index for: http://gems.rubyforge.org
Building native extensions.  This could take a while...
ERROR:  While executing gem ... (Gem::Installer::ExtensionBuildError)
   ERROR: Failed to build gem native extension.

ruby extconf.rb install postgres
checking for main() in -lpq... yes
checking for libpq-fe.h... yes
checking for libpq/libpq-fs.h... yes
checking for PQsetClientEncoding()... no
checking for pg_encoding_to_char()... no
checking for PQfreemem()... no
checking for PQserverVersion()... no
checking for PQescapeString()... no
creating Makefile

I have, too. I’ve known about the fix for a while, thanks to a post from maintainer Jeff Davis from last month. But I was unable to get it to work. But then I found this gem of a comment (pun not intended) from Gluttonous:

FYI, this does NOT work with sudo since sudo strips the env var out. You must ‘sudo -s’ or ‘sudo su’ and run the command straight up.

D’oh! I’ve been doing this all this time:

ARCHFLAGS='-arch i386' sudo gem install postgres

And getting the same failures. But this works beautifully:

sudo env ARCHFLAGS='-arch i386' gem install postgres

And away we go!

Looking for the comments? Try the old layout.

Validating Time Zones in PostgreSQL

I recently needed to validate that a value stored in a TEXTcolumn was a valid time zone identifier. Why? Because I was using its value inside the database to convert timestamp columns from UTC to a valid zone. So I set about writing a function I could use in a constraint.

It turns out that PostgreSQL has a pretty nice view that lists all of the time zones that it recognizes. It’s called pg_timezone_names:

try=# select * from pg_timezone_names limit 5;
        name        | abbrev | utc_offset | is_dst 
--------------------+--------+------------+--------
 Africa/Abidjan     | GMT    | 00:00:00   | f
 Africa/Accra       | GMT    | 00:00:00   | f
 Africa/Addis_Ababa | EAT    | 03:00:00   | f
 Africa/Algiers     | CET    | 01:00:00   | f
 Africa/Asmara      | EAT    | 03:00:00   | f
(5 rows)

Cool. So all I had to do was to look up the value in this view. My first stab at creating a time zone validation function therefore looked like this:

CREATE OR REPLACE FUNCTION is_timezone( tz TEXT ) RETURNS BOOLEAN as $$
DECLARE
    bool BOOLEAN;
BEGIN
    SELECT TRUE INTO bool
    FROM pg_timezone_names
    WHERE LOWER(name) = LOWER(tz)
        OR LOWER(abbrev) = LOWER(tz);
    RETURN FOUND;
END;
$$ language plpgsql STABLE;

This should pretty well cover anything that PostgreSQL considers valid. So does it work? You bet:

sandy_development=# \timing
Timing is on.
sandy_development=# select is_timezone('America/Los_Angeles');
 is_timezone 
-------------
 t
(1 row)

Time: 457.096 ms
sandy_development=# select is_timezone('Foo/Bar');
 is_timezone 
-------------
 f
(1 row)

Time: 472.752 ms

Perfect! Well, except for just one thing: performance is abysmal. A half second per shot? Not very useful for constraints. And since pg_timezone_names is a view (and, under that, a function), I can’t create indexes.

But then I did something dangerous: I started thinking. I realized that I needed this function when our app started getting errors like this:

try=# select now() at time zone 'Foo/Bar';
ERROR:  time zone "Foo/Bar" not recognized

So the underlying C code throws an error when a time zone is invalid. What if I could just trap the error? Well, PL/pgSQL conveniently has exception handling, so I could do just that. But there was only one problem. PL/pgSQL’s exception handling syntax requires that you specify an error condition. Here’s what the documentation has:

EXCEPTION
    WHEN condition [ OR condition ... ] THEN
        handler_statements
    [ WHEN condition [ OR condition ... ] THEN
            handler_statements
        ... ]
END;

Conditions are error codes. But which one corresponds to the invalid time zone error? I tried a few, but couldn’t figure out which one. (Anyone know now to map errors you see in psql to the error codes listed in Appendix A? Let me know!) But really, my function just needed to do one thing. Couldn’t I just trap any old error?

A careful re-read of the PL/pgSQL documentation reveals that, yes, you can. Use the condition “OTHERS,” and you can catch almost anything. With this information in hand, I quickly wrote:

CREATE OR REPLACE FUNCTION is_timezone( tz TEXT ) RETURNS BOOLEAN as $$
DECLARE
    date TIMESTAMPTZ;
BEGIN
    date := now() AT TIME ZONE tz;
    RETURN TRUE;
EXCEPTION WHEN OTHERS THEN
    RETURN FALSE;
END;
$$ language plpgsql STABLE;

And how well does this one work?

sandy_development=# select is_timezone('America/Los_Angeles');
 is_timezone 
-------------
 t
(1 row)

Time: 3.009 ms
sandy_development=# select is_timezone('Foo/Bar');
 is_timezone 
-------------
 f
(1 row)

Time: 1.224 ms

Yes, I’ll take 1-3 ms over 400-500 ms any day! I might even create a domain for this and be done with it:

CREATE DOMAIN timezone AS TEXT
CHECK ( is_timezone( value ) );

Enjoy!

Update: From a comment left by Tom Lane, use invalid_parameter_value rather than OTHERS:

CREATE OR REPLACE FUNCTION is_timezone( tz TEXT ) RETURNS BOOLEAN as $$
DECLARE
    date TIMESTAMPTZ;
BEGIN
    date := now() AT TIME ZONE tz;
    RETURN TRUE;
EXCEPTION invalid_parameter_value OTHERS THEN
    RETURN FALSE;
END;
$$ language plpgsql STABLE;

Looking for the comments? Try the old layout.

Need Help Reducing View Calculations

I could use some advice and suggestions for how to solve a performance problem due to the highly redundant calculation of values in a view. Sorry for the longish explanation. I wanted to make sure that I omitted no details in describing the problem.

In order to support recurring events in an application I’m working on, we have a lookup table that maps dates to their daily, weekly, monthly, and yearly recurrences. It looks something like this:

try=# \d recurrence_dates
   Table "public.recurrence_dates"
   Column   |    Type    | Modifiers 
------------+------------+-----------
 date       | date       | not null
 recurrence | recurrence | not null
 next_date  | date       | not null
Indexes:
    "recurrence_dates_pkey" PRIMARY KEY, btree (date, recurrence, next_date)
    "index_recurrence_dates_on_date_and_recurrence" btree (date, recurrence)

try=# select * from recurrence_dates
try-# where date = '2007-11-04'
try-# order by recurrence, next_date;
    date    | recurrence | next_date  
------------+------------+------------
 2007-11-04 | daily      | 2007-11-04
 2007-11-04 | daily      | 2007-11-05
 2007-11-04 | daily      | 2007-11-06
 2007-11-04 | weekly     | 2007-11-04
 2007-11-04 | weekly     | 2007-11-11
 2007-11-04 | weekly     | 2007-11-18
 2007-11-04 | monthly    | 2007-11-04
 2007-11-04 | monthly    | 2007-12-04
 2007-11-04 | monthly    | 2008-01-04
 2007-11-04 | annually   | 2007-11-04
 2007-11-04 | annually   | 2008-11-04
 2007-11-04 | annually   | 2009-11-04
 2007-11-04 | none       | 2007-11-04

To get all of the permutations of recurring events, we simply select from a view rather than from the events table that contains the actual event data. The view joins events to recurrence_dates table like so:

CREATE OR REPLACE VIEW recurring_events AS
SELECT id, name, user_id, duration,
       (rd.next_date || ' ' ||
       (starts_at::timestamptz at time zone start_tz)::time)::timestamp
       at time zone start_tz AS starts_at,
       start_tz
  FROM events LEFT JOIN recurrence_dates rd
    ON (events.starts_at::timestamptz at time zone events.start_tz)::date = rd.date
   AND events.recurrence = rd.recurrence;

Then, to get all of the recurrences of events for a user within a week, we do something like this in the client code:

SELECT *
  FROM recurring_events
 WHERE user_id = 2
   AND starts_at BETWEEN '2007-11-04 07:00:00' AND '2007-11-10 07:59:59';

This works perfectly, as all of our dates and times are stored in UTC in timestamp columns. We pass UTC times for the appropriate offset to the query (Pacific Time in this example) and, because the view does the right thing in mapping the starts_at time for each event to its proper time zone, we get all of the events within the date range, even if they are recurrences of an earlier event, and with their times properly set.

The trouble we’re having, however, is all of those conversions. Until last week, the view just kept everything in UTC and left it to the client to convert to the proper zone in the start_tz column. But that didn’t work so well when an event’s starts_at was during daylight savings time and recurrences were in standard time: the standard time recurrences were all an hour off! So I added the repeated instances of events.starts_at::timestamptz at time zone events.start_tz. But now the view is really slow.

Since the only thing that has changed is the addition of the time zone conversions, I believe that the performance penalty is because of them. The calculation executes multiple times per row: once for the join and once again for the starts_at column. We can have an awful lot of events for a given user, and an awful lot of recurrences of a given event. If, for example, an event recurs daily for 2 years, there will be around 730 rows for that one event. And the calculation has to be executed for every one of them before the WHERE clause can be properly evaluated. Ouch! Worse still, we actually have three columns that do this in our application, not just one as in the example here.

So what I need is a way to execute that calculation just once for each row in the events table, rather than once for each row in the recurring_events view. I figure 1 calculation will be a heck of a lot faster than 730! So the question is, how do I do this? How do I get the view to execute the conversion of the starts_at to the start_tz time zone only once for each row in events, regardless of how many rows it ends up generating in the recurring_events view?

Suggestions warmly welcomed. This is a bit of a tickler for me, and since the query performance on these views is killing us, I need to get this adjusted post haste!

Meanwhile, tomorrow I’ll post a cool hack I came up with for validating time zones in the database. Something to look forward to as you ponder my little puzzle, eh?

Update 2008-01-30: Thanks to help from depesz, I came figured out what the underling problem was and solved it much more elegantly using PL/pgSQL. I’ve now written up the basic recipe. Enjoy!

Looking for the comments? Try the old layout.

PostgreSQL Warm Standby Using Ruby

The new PostgreSQL Warm Standby support is pretty nice. Since my app doesn’t currently require read access to a slave database, I’ve dumped Slony-I (and all of the headache that went with it), and now have a warm failover server being updated a least once per minute. W00t!

I used Charles Duffy’s example, as well as the documentation, to build my warm standby configuration, but unfortunately, our server OS does not have the usleep utility, so rather than have 1 second sleeps, I ported Charles’s shell script to Ruby. Here it is for your enjoyment:

#!/usr/bin/env ruby

DELAY         = 0.01
FAILOVER_FILE = "/path/to/failover"

@@triggered = false

require 'ftools'

def move (from, to)
  # Do not overwrite! Throws an exception on failure, existing the script.
  File.copy( from, to ) unless @@triggered || File.exists?( to )
end

from, to = ARGV

# If PostgreSQL is asking for .history, just try to move it and exit.
if from =~ /\.history$/
  move from, to
  exit
end

# Sleep while waiting for the file.
while !File.exists?(from) && !@@triggered
  sleep DELAY
  @@triggered = true if File.exists?( FAILOVER_FILE )
end

# Move the file.
move from, to

Just change the DELAY value to the number of seconds you want to sleep, and the FAILOVER_FILE value to the location of a file that will trigger a failover.

This is all well and good, but I ultimately ended up using the pg_standby utility that’s a new contrib utility in PostgreSQL CVS (and will therefore ship with 8.3), as it has the nice feature of cleaning up old WAL log files. It also does not have subsecond precision, but hey, maybe we don’t really need it.

Looking for the comments? Try the old layout.

Rails Migrations with Slony?

The new app I’m developing is written in Ruby on Rails and runs on PostgreSQL. We’re replicating our production database using Slony-I, but we’ve run into a bit of a snag: database schema updates must be run as plain SQL through a Slony script in order to ensure proper replication of the schema changes within a transaction, but Rails migrations run as Ruby code updating the database via the Rails database adapter.

So how do others handle Rails migrations with their Slony-I replication setups? How do you update the Slony-I configuration file for the changes? How do you synchronize changes to the master schema out to the slaves? Do you shut down your apps, shut down Slony-I, make the schema changes to both the master and the slaves, and then restart Slony-I and your apps?

For that matter, people running Slony for their Bricolage databases must have the same issue, because the Bricolage upgrade scripts are just Perl using the DBI, not SQL files. Can anyone shed a little light on this for me?

Oh, and one last question: Why is this such a PITA? Can’t we have decent replication that replicates everything, including schema changes? Please?

Looking for the comments? Try the old layout.

My First C: A GTIN Data Type for PostgreSQL

After all of my recent experimentation creating UPC, EAN, and GTIN validation functions, I became interested in trying to create a GTIN PostgreSQL data type in C. The fact that I don’t know C didn’t stop me from learning enough to do some damage. And now I have a first implementation done. Check it out!

So how did I do this? Well, chapter six of the Douglas Book was a great help to get me started. I also learned what I could by reading the source code for the core and contributed PostgreSQL data types, as well as the EnumKit enumerated data type builder (download from here). And the denizens of the #postgresql channel on FreeNode were also extremely helpful. Thank you, guys!

I would be very grateful if the C hackers among you, and especially any PostgreSQL core hackers who happen to read my blog, would download the GTIN source code and have a look at it. This is the first C code I’ve written, so it would not surprise me if there were some gotchas that I missed (memory leaks, anyone?). And yes, I know that the new ISN contributed data types in the forthcoming 8.2 is a far more featureful implementation of bar code data types; I learned about it after I had nearly finished this first release of GTIN. But I did want to learn some C and how to create PostgreSQL data types, and provide the code for others to learn from, as well. It may also end up as the basis for an article. Stay tuned

In the meantime, share and enjoy.

Update: I forgot to mention that you can check out the source code from the Kineticode Subversion repository.

Looking for the comments? Try the old layout.