Just a Theory

Black lives matter

Posts about SQL

Suggest Method Names for DBIx::Connector

Thanks to feedback from Tim Bunce and Peter Rabbitson in a DBIx::Class bug report, I’ve been reworking DBIx::Connector’s block-handling methods. Tim’s objection is that the the feature of do() and txn_do() that executes the code reference a second time in the event of a connection failure can be dangerous. That is, it can lead to action-at-a-distance bugs that are hard to find and fix. Tim suggested renaming the methods do_with_retry() and txn_do_with_retry() in order to make explicit what’s going on, and to have non-retry versions of the methods.

I’ve made this change in the repository. But I wasn’t happy with the method names; even though they’re unambiguous, they are also overly long and not very friendly. I want people to use the retrying methods, but felt that the long names make the non-retrying preferable to users. While I was at it, I also wanted to get rid of do(), since it quickly became clear that it could cause some confusion with the DBI’s do() method.

I’ve been thesaurus spelunking for the last few days, and have come up with a few options, but would love to hear other suggestions. I like using run instead of do to avoid confusion with the DBI, but otherwise I’m not really happy with what I’ve come up with. There are basically five different methods (using Tim’s suggestions for the moment):

run( sub {} )
Just run a block of code.
txn_run( sub {} )
Run a block of code inside a transaction.
run_with_retry( sub {} )
Run a block of code without pinging the database, and re-run the code if it throws an exception and the database turned out to be disconnected.
txn_run_with_rerun( sub {} )
Like run_with_retry(), but run the block inside a transaction.
svp_run( sub {} )
Run a block of code inside a savepoint (no retry for savepoints).

Here are some of the names I’ve come up with so far:

Run block Run in txn Run in savepoint Run with retry Run in txn with retry Retry Mnemonic
run txn_run svp_run runup txn_runup Run assuming the db is up, retry if not.
run txn_run svp_run run_up txn_run_up Same as above.
run txn_run svp_run rerun txn_rerun Run assuming the db is up, rerun if not.
run txn_run svp_run run::retry txn_run::retry :: means “with”

That last one is a cute hack suggested by Rob Kinyon on IRC. As you can see, I’m pretty consistent with the non-retrying method names; it’s the methods that retry that I’m not satisfied with. A approach I’ve avoided is to use an adverb for the non-retry methods, mainly because there is no retry possible for the savepoint methods, so it seemed silly to have svp_run_safely() to complement do_safely() and txn_do_safely().

Brilliant suggestions warmly appreciated.

Looking for the comments? Try the old layout.

Unicode Normalization in SQL

I’ve been peripherally aware of the need for unicode normalization in my code for a while, but only got around to looking into it today. Although I use Encode to convert text inputs into Perl’s internal form and UTF-8 or an appropriate encoding in all my outputs, it does nothing about normalization.

What’s normalization you ask?

Well, UTF-8 allows some characters to be encoded in different ways. For example, é can be written as either “é”, which is a “precomposed character,” or as “é”, which is a combination of “e” and “́”. This is all well and good, but the trouble comes when you want to compare values. Observe this Perl one-liner:

% perl -le 'print "\x{00e9}" eq "\x{0065}\x{0301}" ? "t" : "f"'
f

The same issue exists in your database. Here’s an example from PostgreSQL:

try=# select U&'\00E9' = U&'\0065\0301';
 ?column? 
----------
 f
(1 row)

If you have a user enter data in your Web app using combining characters, and then another does a search with canonical characters, the search will fail. This won’t do at all.

The solution is to normalize your Unicode data. In Perl, you can use Unicode::Normalize, a C/XS module that uses Perl’s internal unicode tables to convert things as appropriate. For general use the NFC normalization is probably best:

use Unicode::Normalize;
$string = NFC $string;

PostgreSQL offers no normalization routines. However, the SQL standard mandates one (as of SQL 2008, at least). It looks like this:

<normalize function> ::= NORMALIZE <left paren> <character value expression> [ <comma> <normal form> [ <comma> <normalize function result length> ] ] <right paren>
<normal form> ::= NFC | NFD | NFKC | NFKD

The second argument defaults to NFC and the third, which specifies a maximum length of the return value, is optional. The fact that it looks like a function means that we can use PL/PerlU to emulate it in PostgreSQL until a proper implementation makes it into core. Here’s how:

CREATE OR REPLACE FUNCTION NORMALIZE(
    string TEXT,
    form   TEXT,
    maxlen INT
) RETURNS TEXT LANGUAGE plperlu AS $$
    use Unicode::Normalize 'normalize';
    my ($string, $form, $maxlen) = @_;
    my $ret = normalize($form, $string);
    elog(ERROR, 'Normalized value is too long') if length $ret > $maxlen;
    return $ret;
$$;

CREATE OR REPLACE FUNCTION NORMALIZE(
    string TEXT,
    form   TEXT
) RETURNS TEXT LANGUAGE plperlu AS $$
    use Unicode::Normalize 'normalize';
    return normalize($_[1], $_[0]);
$$;

CREATE OR REPLACE FUNCTION NORMALIZE(
    string TEXT
) RETURNS TEXT LANGUAGE plperlu AS $$
    use Unicode::Normalize 'normalize';
    return normalize('NFC', shift);
$$;

I wrote a few tests to make sure it was sane:

SELECT U&'\0065\0301' as combined,
       char_length(U&'\0065\0301'),
       NORMALIZE(U&'\0065\0301') as normalized,
       char_length(NORMALIZE(U&'\0065\0301'));

SELECT NORMALIZE(U&'\0065\0301', 'NFC')  AS NFC,
       NORMALIZE(U&'\0065\0301', 'NFD')  AS NFD,
       NORMALIZE(U&'\0065\0301', 'NFKC') AS NFKC,
       NORMALIZE(U&'\0065\0301', 'NFKD') AS NFKD
;

SELECT NORMALIZE(U&'\0065\0301', 'NFC', 1)  AS NFC,
       NORMALIZE(U&'\0065\0301', 'NFD', 2)  AS NFD,
       NORMALIZE(U&'\0065\0301', 'NFKC', 1) AS NFKC,
       NORMALIZE(U&'\0065\0301', 'NFKD', 2) AS NFKD;

SELECT NORMALIZE(U&'\0065\0301', 'NFD', 1);

And the output

 combined | char_length | normalized | char_length 
----------+-------------+------------+-------------
 é        |           2 | é          |           1
(1 row)

 nfc | nfd | nfkc | nfkd 
-----+-----+------+------
 é   | é   | é    | é
(1 row)

 nfc | nfd | nfkc | nfkd 
-----+-----+------+------
 é   | é   | é    | é
(1 row)

psql:try.sql:45: ERROR:  error from Perl function "normalize": Normalized value is too long at line 5.

Cool! So that’s fairly close to the standard. The main difference is that the form argument must be a string instead of a constant literal. But PostgreSQL would likely support both. The length argument is also a literal, and can be 10 characters or 64 bytes, but for our purposes, this is fine. The only downside to it is that it’s slow: PostgreSQL must convert its text value to a Perl string to pass to the function, and then Unicode::Normalize turns it into a C string again to do the conversion, then back to a Perl string which, in turn, is returned to PostgreSQL and converted back into the text form. Not the quickest process, but may prove useful anyway.

Update: 1 Hour Later

Note that this issue applies when using full text search, too. Alas, it does not normalize unicode characters for you:

try=# select to_tsvector(U&'\00E9clair') @@ to_tsquery(U&'\0065\0301clair');
 ?column? 
----------
 f
(1 row)

But normalizing with the functions I introduced does work:

try=# select to_tsvector(U&'\00E9clair') @@ to_tsquery(normalize(U&'\0065\0301clair'));
 ?column? 
----------
 t
(1 row)

So yes, this really can be an issue in your applications.

Looking for the comments? Try the old layout.

Neither NULL nor NOT NULL: An SQL WTF

While working on result set testing functions, I ran into a bit of weirdness when comparing rows between two cursors. I had code that looked more or less like this:

FETCH have INTO rec_have;
FETCH want INTO rec_want;
WHILE rec_have IS NOT NULL OR rec_want IS NOT NULL LOOP
    IF rec_have IS DISTINCT FROM rec_want THEN
        RETURN FALSE;
    END IF;
    FETCH have INTO rec_have;
    FETCH want INTO rec_want;
END LOOP;
RETURN TRUE;

Basically, the idea is to return true if the two cursors return equivalent rows in the same order. However, things started to get weird when any of the rows included a NULL: it seemed that the loop exited as soon as a NULL was encountered, even if there were also non-NULL values in the row. I poked around a bit and discovered, to my astonishment, that such a record is neither NULL nor NOT NULL:

try=# select ROW(1, NULL) IS NULL;
 ?column? 
----------
 f
(1 row)

try=# select ROW(1, NULL) IS NOT NULL;
 ?column? 
----------
 f
(1 row)

I had thought that a value, even a composite value, had to be either NULL or NOT NULL, so I thought it was a bug. I mean, this isn’t possible, is it? I dutifully asked on the pgsql-hackers list and was informed, to further astonishment, that this is, in fact, mandated by the SQL standard. WTF? As Jeff says, “The standard is what it is. If it says that some NULLs are red and some NULLs are blue, then we’d probably support it.”

Through the discussion, I learned that a record is considered NULL only if all of its values are NULL, and it’s considered NOT NULL only if none of it s values are NULL:

try=# select ROW(NULL, NULL) IS NULL;
 ?column? 
----------
 t
(1 row)

try=# select ROW(1, 1) IS NOT NULL;
 ?column? 
----------
 t

The upshot is that composite values with at least one NULL and at least one NOT NULL value are ambiguous. It’s insane, but there you have it.

Jeff thought that you could cheat the standard by moving the NOT in front of the value before checking its NULLness. I changed my code to reflect this, and things got better:

FETCH have INTO rec_have;
FETCH want INTO rec_want;
WHILE NOT rec_have IS NULL OR NOT rec_want IS NULL LOOP
    IF rec_have IS DISTINCT FROM rec_want THEN
        RETURN FALSE;
    END IF;
    FETCH have INTO rec_have;
    FETCH want INTO rec_want;
END LOOP;
RETURN TRUE;

Kind of confusing to read, but at least it’s not too ugly. In truth, however, it’s still inconsistent: it just makes it so that such records are both NULL and NOT NULL:

try=# select NOT ROW(1, NULL) IS NULL;
 ?column? 
----------
 t
(1 row)

try=# select NOT ROW(1, NULL) IS NOT NULL;
 ?column? 
----------
 t
(1 row)

But it was good enough for me. For a while. But then I started testing the pathological scenario where a row contains only NULLs. I call it pathological because no one should ever have rows with only NULLs. But the truth is that the SQL standard allows it (despite the objections of relational theorists) and, I’ve little doubt, application developers get such rows all the time.

The problem with such rows is that they are inherently NULL, but, according to the documentation for the use of cursors in PL/pgSQL, when fetching rows from a cursor, “if there is no next row, the target is set to NULL(s).” The upshot is that, because I’m using a WHILE loop to fetch rows from a cursor, and rows with only NULLs are themselves considered NULL, there is no way to tell the difference between a row that contains NULLs and the end of a cursor.

To demonstrate, I sent an example of two functions that process a cursor, one using a plain PL/pgSQL FOR rec IN stuff LOOP, which internally detects the difference between rows full of NULLs and the end of the cursor, and one using the WHILE NOT rec IS NULL LOOP syntax required by the pgTAP testing functions. The output looked like this:

    dob     |     ssn
------------+-------------
 1965-12-31 |
            |
            | 932-45-3456
 1963-03-23 | 123-45-6789
(4 rows)

    dob     | ssn
------------+-----
 1965-12-31 |
(1 row)

The two functions are processing the same query in cursors, but while the FOR loop properly returned all four rows, the WHILE loop stopped when it hit a row with only NULLs. I found this annoying, to say the least. Fortunately, other folks were paying better attention to the docs, pointing out that the special PL/pgSQL FOUND variable does just the trick, being set to TRUE when a row is fetched, even if the row is all NULLs, and false then there are no more rows in the cursor. In fact, had I read two more sentences in the relevant documentation, I would have noticed that it says, “As with SELECT INTO, the special variable FOUND can be checked to see whether a row was obtained or not.” D’oh!

So now my function looks more or less like this:

FETCH have INTO rec_have;
have_found := FOUND;
FETCH want INTO rec_want;
want_found := FOUND;
WHILE have_found OR want_found LOOP
    IF rec_have IS DISTINCT FROM rec_want THEN
        RETURN FALSE;
    END IF;
    FETCH have INTO rec_have;
    have_found := FOUND;
    FETCH want INTO rec_want;
    want_found := FOUND;
END LOOP;
RETURN TRUE;

Yeah, pretty verbose and full of a lot of explicit processing that I can just take for granted in more sane languages, but it does the trick. Don’tcha just love SQL?

That issue behind me, I’ll do a bit more hacking on it this week, and hopefully I’ll get a release of pgTAP out with the new result set testing support before I leave for vacation early next week.

Looking for the comments? Try the old layout.

Committed: pgTAP Result Set Assertion Functions

Regular readers will know that I’ve been thinking a lot about testing SQL result sets and how to how to name result testing functions, and various implementation issues. I am very happy to say that I’ve now committed the first three such test functions to the Git repository. They’ve been tested on 8.4 and 8.3. Here’s what I came up with.

I had a pretty good idea how to compare sets and how to compare ordered bags, but ordered sets and unordered bags of results escaped me. During two days of intense hacking and experimentation, I quickly wrote set_eq(), which performs a set comparison of the results of two queries, and obag_eq(), which performs a row-by-row comparison of the results of two queries. I then set to work on bag_eq(), which would do a set comparison but require the same number of duplicate rows between the two queries. set_eq() was easy because I just needed to create temporary tables of the two queries and then execute two EXCEPT queries against them to see where they differ, if at all. bag_eq() was getting kind of hairy, though, so I asked about it on the Freenode #postgresql channel, where depesz looked at my example and pointed out that EXCEPT ALL would do just want I needed.

Hot damn, all it took was the addition a single extra word to the same queries used by set_eq() and I was set. This made me very happy, and such well-thought-out features are the reason I love PostgreSQL. My main man depesz made my day.

But oset_eq(), which was to compare ordered sets of results was proving much harder. The relational operators that operate on sets don’t care about order, so I would have to write the code to care myself. And because dupes needed to be ignored, it got even harder. In fact, it proved just not worth the effort. The main reason I abandoned this test function, though, was not difficulties of implementation (which were significant), but ambiguity of interpretation. After all, if duplicates are allowed but ignored, how does one deal with their effect on order? For example, say that I have two queries that order people based on name. One query might order them like so:

select * from people order by name;
  name  | age 
--------+-----
 Damian |  19
 Larry  |  53
 Tom    |  35
 Tom    |  44
 Tom    |  35

Another run of the same query could give me a different order:

select * from people order by name;
  name  | age 
--------+-----
 Damian |  19
 Larry  |  53
 Tom    |  35
 Tom    |  35
 Tom    |  44

Because I ordered only on “name,” the database was free to sort records with the same name in an undefined way. Meaning that the rows could be in different orders. This is known, if I understand correctly, as a “Partially ordered set,” or “poset.” Which is all well and good, but from my point of view makes it damn near impossible to be able to do a row-by-row comparison and ignore dupes, because they could be in different orders!

So once I gave up on that, I was down to three functions instead of four, and only one depends on ordering. So I also dropped the idea of having the “o” in the function names. Instead, I changed obag_eq() to results_eq(), and now I think I have three much more descriptive names. To summarize, the functions are:

results_eq
Compares two result sets row by row, meaning that they must be in the same order and have the same number of duplicate rows in the same places.
set_eq
Compares result sets to ensure they have the same rows, without regard to order or duplicate rows.
bag_eq
Compares result sets without regard to order, but each must have the same duplicate rows.

I’m very happy with this, because I was able to give up on the stupid function names with the word “order” included or implicit in them. Plus, I have different names for functions that are similar, which is nicely in adherence to the principle of distinction. They all provide nice diagnostics on failure, as well, like this from results_eq():

# Failed test 146
#     Results differ beginning at row 3:
#         have: (1,Anna)
#         want: (22,Betty)

Or this from set_eq() or bag_eq()

# Failed test 146
#     Extra records:
#         (87,Jackson)
#         (1,Jacob)
#     Missing records:
#         (44,Anna)
#         (86,Angelina)

set_eq() and bag_eq() also offer up useful diagnostics when the data types of the rows vary:

# Failed test 147
#     Columns differ between queries:
#         have: (integer,text)
#         want: (text,integer)

results_eq() doesn’t have access to such data, though if I can find some tuits (got any to give me?), I’ll write a quick C function that can return an array of the data types in a record object.

Now, as for the issue of arguments, what I settled on is, like Epic, passing strings of SQL to these functions. However, unlike Epic, if you pass in a simple string with no spaces, or a double-quoted string, pgTAP assumes that it’s the name of a prepared statement. The documentation now recommends prepared statements, which you can use like this:

PREPARE my_test AS SELECT * FROM active_users() WHERE name LIKE 'A%';
PREPARE expect AS SELECT * FROM users WHERE active = $1 AND name LIKE $2;
SELECT results_eq('my_test', 'expect');

This allows you to keep your SQL written as SQL, keeping your test, um, SQLish. But in those cases where you have some really simple SQL, you can just use that, too:

SELECT set_eq(
    'SELECT * FROM active_users()',
    'SELECT * FROM users ORDER BY id'
);

This feels like a good compromise to me, allowing the best of both worlds: keeping things in pure SQL to avoid quoting ugliness in SQL strings, while letting users pass in SQL strings if they really want to.

It turns out that I wasn’t able to support cursors for set_eq() or bag_eq(), because they use the statements passed to them to create temporary tables and then compare the records in those temporary tables. But results_eq() uses cursors internally. And it turns out that there’s a data type for cursors, refcursor. So it was easy to add cursor support to results_eq() for those who want to use it:

DECLARE cwant CURSOR FOR SELECT * FROM active_users();
DECLARE chave CURSOR FOR SELECT * FROM users WHERE active ORDER BY name;
SELECT results_eq('cwant'::refcursor, 'chave'::refcursor );

Neat, huh? As I said, I’m very pleased with this approach overall. There are a few caveats, such as less strict comparisons in results_eq() on 8.3 and lower, and less useful diagnostics for data type differences in results_eq(), but overall, I think that the implementation is pretty good, and that these functions will be really useful.

So what do you think? Please clone the Git repository and take the functions for a test drive on 8.3 or 8.4. Let me know what you think!

In the meantime, before releasing a new version, I still plan to add:

  • set_includes() - Set includes records in another set.
  • set_excludes() - Set excludes records in another set.
  • bag_includes() - Bag includes records in another bag.
  • bag_excludes() - Bag excludes records in another bag.
  • col_eq() - Single column result set equivalent to an array of values.
  • row_eq() - Single row form a query equivalent to a record.
  • rowtype_is() - The data type of the rows in a query is equivalent to an array of types.

Hopefully I can find some time to work on those next week. The only challenging one is row_eq(), so I may skip that one for now.

Looking for the comments? Try the old layout.

Need Help Naming Result Set Testing Functions

I’ve been thinking more since I posted about testing SQL result sets, and I think I’ve settled on two sets of functions for pgTAP: one that tests two SQL queries (though you will be encouraged to use a prepared statement), and one to test two cursors. I’m thinking of naming them:

  • query_gets()
  • cursor_gets()

I had been planning on *_returns() or *_yields(), but they didn’t feel right. “Returns” implies that I would be passing a query and a data structure (to me at least), and while I want to support that, too, it’s not what I was looking for right now. “Yield,” on the other hand, is more related to set-returning functions in my mind (even if PL/pgSQL doesn’t use that term). Anyway, I like the use of “gets” because it’s short and pretty unambiguous.

These function will compare query results as unordered sets, but I want variants that test ordered sets, as well. I’ve been struggling to come up with a decent name for these variants, but not liking any very well. The obvious ones are:

  • ordered_query_gets()
  • ordered_cursor_gets()

And:

  • sorted_query_gets()
  • sorted_cursor_gets()

But these are kind of long for functions that will be, I believe, used frequently. I could just add a character to get the same idea, in the spirit of sprintf:

  • oquery_gets()
  • ocursor_gets()

Or:

  • squery_gets()
  • scursor_gets()

I think that these are okay, but might be somewhat confusing. I think that the “s” variant probably won’t fly, since for sprintf and friends, the “s” stands for “string.” So I’m leaning towards the “o” variants.

But I’m throwing it out there for the masses to make suggestions: Got any ideas for better function names? Are there some relational terms for ordered sets, for example, that might make more sense? What do you think?

As a side note, I’m also considering:

  • col_is() to compare the result of a single column query to an array or other query. This would need an ordered variant, as well.
  • row_is(), although I have no idea how I’d be able to support passing a row expression to a function, since PostgreSQL doesn’t allow RECORDs to be passed to functions.

Looking for the comments? Try the old layout.

Doomed To Reinvent

There’s an old saying, “Whoever doesn’t understand X is doomed to reinvent it.”X can stand for any number of things. The other day, I was pointing out that such is the case for ORM developers. Take ActiveRecord, for example. As I demonstrated in a 2007 Presentation, because ActiveRecord doesn’t support simple things like aggregates or querying against functions or changing how objects are identified, you have to fall back on using its find_by_sql() method to actually run the SQL, or using fuck typing to force ActiveRecord to do what you want. There are only two ways to get around this: Abandon the ORM and just use SQL, or keep improving the ORM until it has, in effect, reinvented SQL. Which would you choose?

I was thinking about this as I was hacking on a Drupal installation for a client. The design spec called for the comment form to be styled in a very specific way, with image submit buttons. Drupal has this baroque interface for building forms: essentially an array of arrays. Each element of the array is a form element, unless it’s markup. Or something. I can’t really make heads or tails of it. What’s important is that there are a limited number of form elements you can create, and as of Drupal 5, image isn’t fucking one of them!.

Now, as a software developer, I can understand this. I sometimes overlook a feature when implementing some code. But the trouble is: why have some bizarre data structure to represent a subset of HTML when you have something that already works: it’s called HTML. Drupal, it seems, is doomed to reinvent HTML.

So just as I have often had to use find_by_sql() as the fallback to get ActiveRecord to fetch the data I want, as opposed to what it thinks I want, I had to fallback on the Drupal form data structure’s ability to accept embedded HTML like so:

$form['submit_stuff'] = array(
  '#weight' => 20,
  '#type'   => 'markup',
  '#value'  => '<div class="form-submits">'
              . '<label></label><p class="message">(Maximum 3000 characters)</p>'
              . '<div class="btns">'
              . '<input type="image" value="Preview comment" name="op" src="preview.png" />'
              . '<img width="1" height="23" src="divider.png" />'
              . '<input type="image" value="Post comment" name="op" src="post.png" />'
              . '</div></div>',
);

Dear god, why? I understand that you can create images using an array in Drupal 6, but I fail to understand why it was ever a problem. Just give me a templating environment where I can write the fucking HTML myself. Actually, Drupal already has one, it’s called PHP!. Please don’t make me deal with this weird hierarchy of arrays, it’s just a bad reimplementation of a subset of HTML.

I expect that there actually is some way to get what I want, even in Drupal 5, as I’m doing some templating for comments and pages and whatnot. But that should be the default IMHO. The weird combining of code and markup into this hydra-headed data structure (and don’t even get me started on the need for the #weight key to get things where I want them) is just so unnecessary.

In short, if it ain’t broke, don’t reinvent it!

</rant>

Looking for the comments? Try the old layout.

Thoughts on Testing SQL Result Sets

pgTAP: The Critique

I’ve been continuing hacking on pgTAP in order to add a lot more schema-testing functionality and a few other niceties. But back when I started the project, I using it to write tests for CITEXT, which was great for my sanity as I developed it, but proved a bit controversial. In a pgsql-hackers post, Tom Lane wrote:

There was some discussion earlier about whether the proposed regression tests for citext are suitable for use in contrib or not. After playing with them for awhile, I have to come down very firmly on the side of “not”. I have these gripes:

  1. The style is gratuitously different from every other regression test in the system. This is not a good thing. If it were an amazingly better way to do things, then maybe, but as far as I can tell the style pgTAP forces on you is really pretty darn poorly suited for SQL tests. You have to contort what could naturally be expressed in SQL as a table result into a scalar. Plus it’s redundant with the expected-output file.

  2. It’s ridiculously slow; at least a factor of ten slower than doing equivalent tests directly in SQL. This is a very bad thing. Speed of regression tests matters a lot to those of us who run them a dozen times per day —– and I do not wish to discourage any developers who don’t work that way from learning better habits ;–)

Because of #1 and #2 I find the use of pgTAP to be a nonstarter.

These are legitimate criticisms, of course. To take the second item first, I would eventually like to figure out a way to make pgTAP a lot faster (in my own benchmarks, I found it to be about 4 times slower than pure SQL, not ten times, but still). A number of functions can likely be rewritten in C, and maybe data can be stored in memory rather than in a temporary table. Overall, though, the speed of the tests doesn’t really concern me much. I’m quite used to large test suites, such as that for Bricolage, that take 5 or 10 minutes or more. This is because, compared to the time it would take me to maintain the code without tests, it’s nothing. I find and fix bugs much more quickly thanks to regression tests. And really, one should just run a subset of the tests for whatever one is working on, and then run the full suite before checking in. One could even have a larger, more comprehensive (read: slower) test suite that’s run via a cron job, so that it identifies bugs in checked in code but developers don’t have to spend a lot of time waiting for tests to finish running.

As a result, I wouldn’t advocate for converting the existing PostgreSQL regression test suite to pgTAP. I could see writing a new suite of tests on pgTAP that run on the build farm. This would be great, as they would complement the existing test suite, and be able to test stuff that can’t be tested with pg_regress.

So really, the performance issue can be addressed in a few ways, some technical, some social, some structural. Like I said, I’m not overly concerned about it, and I wouldn’t make Tom suffer unduly from it, either (I converted all of the CITEXT tests to plain SQL).

Coercing Composite Values

The first issue is tougher, however. Tom was responding to a test like this:

SELECT is(
    ARRAY( SELECT name FROM srt ORDER BY name )::text,
    ARRAY['AAA', 'aardvark', 'aba', 'ABC', 'abc']::text,
    'The words should be case-insensitively sorted'
);

Now, I agree that it’s redundant with the expected-output file, but the assumption with TAP is that there is no expected output file: you just analyze its output using a harness. The need for an expected output file is driven by the legacy of pg_regress.

A bigger issue, and the one I’ll focus on for the remainder of this post, is the requirement currently inherent in pgTAP to “contort what could naturally be expressed in SQL as a table result into a scalar.” The issue is apparent in the above example: even though I’m selecting a number of rows from a table, I use the ARRAY() constructor function to force them into a scalar value—an array—in order to easily do the comparison. It also results in a useful diagnostic message in case the test fails:

# Failed test 40: "The words should be case-insensitively sorted"
#         have: {AAA,aardvark,ABC,abc,aba}
#         want: {AAA,aardvark,aba,ABC,abc}

So for simple cases like this, it doesn’t bother me much personally. But I’ve also had to write tests for functions that return composite types—that is, rows—and again I had to fall back on coercing them into scalar values to do the comparison. For example, say that the fooey() function returns a dude value, which is a composite type with an integer and a text string. Here’s how to test it with pgTAP:

SELECT is(
    fooey()::text,
    ROW( 42, 'Bob' )::text,
    'Should get what we expect from fooey()'
);

So I’m again coercing a value into something else (of course, if I could pass records to functions, that issue goes away). And it does yield nice diagnostics on failure:

# Failed test 96: "Should get what we expect from fooey()"
#         have: (42,Fred)
#         want: (42,Bob)

It gets much worse with set returning functions—Tom’s “table result:” it requires both type and row coercion (or “contortion” if you’d prefer). Here’s an example of a fooies() function that returns a set of dudes:

SELECT is(
    ARRAY( SELECT ROW(f.*)::text FROM fooies() f ),
    ARRAY[
        ROW( 42, 'Fred' )::text,
        ROW( 99, 'Bob' )::text
    ],
    'Should get what we expect from fooies()'
);

As you can see, it’s do-able, but clumsy and error prone. We really are taking a table result and turning into a scalar value. And thanks to the casts to text, the test can actually incorrectly pass if, for example, the integer was actually stored as text (although, to be fair, the same is true of a pg_regress test, where everything is converted to text before comparing results).

What we really need is a way to write two queries and compare their result sets, preferably without any nasty casts or coercion into scalar values, and with decent diagnostics when a test fails.

As an aside, another approach is to use EXCEPT queries to make sure that two data sets are the same:

SELECT ok(
    NOT EXISTS (
        (SELECT 42, 'Fred' UNION SELECT 99, 'Bob')
        EXCEPT
        SELECT * from fooies()
    ),
    'Should get what we expect from fooies()'
);

SELECT ok(
    NOT EXISTS (
        SELECT * from fooies()
        EXCEPT
        (SELECT 42, 'Fred' UNION SELECT 99, 'Bob')
    ),
    'Should have no unexpected rows from fooies()'
);

Here I’ve created two separate tests. The first makes sure that fooies() returns all the expected rows, and the second makes sure that it doesn’t return any unexpected rows. But since this is just a boolean test (yes, we’ve coerced the results into booleans!), there are no diagnostics if the test fails: you’d have to go ahead and run the query yourself to see what’s unexpected. Again, this is do-able, and probably a more correct comparison than using the casts of rows to text, but makes it harder to diagnose failures. And besides, EXCEPT compares sets, which are inherently unordered. That means that if you need to test that results come back in a specific order, you can’t use this approach.

That said, if someone knows of a way to do this in one query—somehow make some sort of NOT EXCEPT operator work—I’d be very glad to hear it!

Prior Art

pgTAP isn’t the only game in town. There is also Dmitry Koterov’s PGUnit framework and Bob Brewer’s Epic. PGUnit seems to have one main assertion function, assert_same(), which works much like pgTAP’s is(). Epic’s assert_equal() does, too, but Epic also offers a few functions for testing result sets that neither pgTAP nor PGUnit support. One such function is assert_rows(), to which you pass strings that contain SQL to be evaluated. For example:

CREATE OR REPLACE FUNCTION test.test_fooies() RETURNS VOID AS $_$
BEGIN
    PERFORM test.assert_rows(
        $$ VALUES(42, 'Fred'), (99, 'Bob') $$,
        $$ SELECT * FROM fooies()          $$
    );
    RAISE EXCEPTION '[OK]';
END;
$_$ LANGUAGE plpgsql;

This works reasonably well. Internally, Epic runs each query twice, using EXCEPT to compare result sets, just as in my boolean example above. This yields a proper comparison, and because assert_rows() iterates over returned rows, it emits a useful message upon test failure:

psql:try_epic.sql:21: ERROR:  Record: (99,Bob) from: VALUES(42, 'Fred'), (99, 'Bob') not found in: SELECT * FROM fooies()
CONTEXT:  SQL statement "SELECT  test.assert_rows( $$ VALUES(42, 'Fred'), (99, 'Bob') $$, $$ SELECT * FROM fooies() $$ )"
PL/pgSQL function "test_fooies" line 2 at PERFORM

A bit hard to read with all of the SQL exception information, but at least the information is there. At PGCon, Bob told me that passing strings of SQL code made things a lot easier to implement in Epic, and I can certainly see how that could be (pgTAP uses SQL code strings too, with its throws_ok(), lives_ok(), and performs_ok() assertions). But it just doesn’t feel SQLish to me. I mean, if you needed to write a really complicated query, it might be harder to maintain: even using dollar quoting, it’s harder to track stuff. Furthermore, it’s slow, as PL/pgSQL’s EXECUTE must be called twice and thus plan twice. And don’t even try to test a query with side-effects—such as a function that inserts a row and returns an ID—as the second run will likely lead to test failure just might blow something up.

SQL Blocks?

One approach is to use blocks. I’m thinking here of something like Ruby blocks or Perl code references: a way to dynamically create some code that is compiled and planned when it loads, but its execution can be deferred. In Perl it works like this:

my $code = sub { say "woof!" };
$code->(); # prints "woof!"

In Ruby (and to a lesser extent in Perl), you can pass a block to a method:

foo.bar { puts "woof!" }

The bar method can then run that code at its leisure. We can sort of do this in PostgreSQL using PREPARE. To take advantage of it for Epic’s assert_rows() function, one can do something like this:

CREATE OR REPLACE FUNCTION test.test_fooies() RETURNS VOID AS $_$
BEGIN
    PREPARE want AS VALUES(42, 'Fred'), (99, 'Bob');
    PREPARE have AS SELECT * FROM public.fooies();
    PERFORM test.assert_rows(
        test.global($$ EXECUTE want $$),
        test.global($$ EXECUTE have $$)
    );
    RAISE EXCEPTION '[OK]';
END;
$_$ LANGUAGE plpgsql;

The nice thing about using a prepared statement is that you can actually write all of your SQL in SQL, rather than in an SQL string, and then pass the simple EXECUTE statement to assert_rows(). Also note the calls to test.global() in this example. This is a tricky function in Epic that takes an SQL statement, turns its results into a temporary table, and then returns the table name. This is required for the EXECUTE statements to work properly, but a nice side-effect is that the actual queries are executed only once each, to create the temporary tables. Thereafter, those temporary tables are used to fetch results for the test.

Another benefit of prepared statements is that you can write a query once and use it over and over again in your tests. Say that you had a few set returning functions that return different results from the users table. You could then test them all like so:

CREATE OR REPLACE FUNCTION test.test_user_funcs() RETURNS VOID AS $_$
BEGIN
    PREPARE want(bool) AS SELECT * FROM users WHERE active = $1;
    PREPARE active     AS SELECT * FROM get_active_users();
    PREPARE inactive   AS SELECT * FROM get_inactive_users();
    PERFORM test.assert_rows(
        test.global($$ EXECUTE want(true) $$),
        test.global($$ EXECUTE active     $$)
    );
    PERFORM test.assert_rows(
        test.global($$ EXECUTE want(false) $$),
        test.global($$ EXECUTE inactive    $$)
    );
    RAISE EXCEPTION '[OK]';
END;
$_$ LANGUAGE plpgsql;

Note how I’ve tested both the get_active_users() and the get_inactive_users() function by passing different values when executing the want prepared statement. Not bad. I think that this is pretty SQLish, aside from the necessity for test.global().

Still, the use of prepared statements with Epic’s assert_rows() is not without issues. There is still a lot of execution here (to create the temporary tables and to select from them a number of times). Hell, this last example reveals an inefficiency in the creation of the temporary tables, as the two different executions of have create two separate temporary tables for data that’s already in the users table. If you have a lot of rows to compare, a lot more memory will be used. And you still can’t check the ordering of your results, either.

So for small result sets and no need to check the ordering of results, this is a pretty good approach. But there’s another.

Result Set Handles

Rather than passing blocks to be executed by the tests, in many dynamic testing frameworks you can pass data structures be compared. For example, Test::More’s is_deeply() assertion allows you to test that two data structures contain the same values in the same structures:

is_deeply \@got_data, \@want_data, 'We should have the right stuff';

This does a deep comparison between the contents of the @got_data array and @want_data. Similarly, I could imagine a test to check the contents of a DBIx::Class result set object:

results_are( $got_resultset, $want_resultset );

In this case, the is_results() function would iterate over the two result sets, comparing each result to make sure that they were identical. So if prepared statements in SQL are akin to blocks in dynamic languages, what is akin to a result set?

The answer, if you’re still with me, is cursors.

Now, cursors don’t work with Epic’s SQL-statement style tests, but I could certainly see how a pgTAP function like this would be useful:

DECLARE want CURSOR FOR SELECT * FROM users WHERE active;
DECLARE have CURSOR FOR SELECT * FROM get_active_users();
SELECT results_are( 'want', 'have' );

The nice thing about this approach is that, even more than with prepared statements, everything is written in SQL. The results_are() function would simply iterate over each row returned from the two cursors to make sure that they were the same. In the event that there was a difference, the diagnostic output would be something like:

#   Failed test 42:
#     Results begin differing at row 3:
#          have: (3,Larry,t)
#          want: (3,Larry,f)

So there’s a useful diagnostic, ordering is preserved, no temporary tables are created, and the data is fetched directly from its sources (tables or functions or whatever) just as it would be in a straight SQL statement. You still have the overhead of PL/pgSQL’s EXECUTE, and iterating over the results, but, outside of some sort of NOT INTERSECT operator, I don’t see any other way around it.

The Plan

So I think I’ll actually look at adding support for doing this in two ways: one with prepared statements (or query strings, if that’s what floats your boat) like Epic does, though I’m going to look at avoiding the necessity for something like Epic’s global() function. But I’ll also add functions to test cursors. And maybe a few combinations of these things.

So, does an approach like this, especially the cursor solution, address Tom’s criticism? Does it feel more relational? Just to rewrite the kind of test Tom originally objected to, it would now look something like this:

DECLARE have CURSOR FOR SELECT name FROM srt ORDER BY name;
DECLARE want CURSOR FOR VALUES ('AAA'), ('aardvark'), ('aba'), ('ABC'), ('abc');
SELECT results_are(
    'have', 'want',
    'The words should be case-insensitively sorted'
);

Thoughts? I’m not going to get to it this week, so feedback would be greatly appreciated.

Looking for the comments? Try the old layout.

pgTAP 0.21 Drops

Mmmmm…pgTAP

I just dropped a new version of pgTAP following a few marathon hack sessions since my talk at PGCon (movie here, BTW). Actually, the new performs_ok() function came about as I wrote the presentation, but all the rest came on the flight home and in the few days since. Not sure when I’ll hack on it this much again (it’s getting a bit big at 5,758 lines of PL/pgSQL and SQL).

Overall I’m very happy with this release, as it adds a lot of new assertion functions. In particular, I added a slew of functions that test that the objects in a given schema (or visible in the search path, if you prefer) are exactly the objects that should be there. This is useful for a couple of things. For one, Norman Yamada, in his PGCon talk, mentioned that his team was using pgTAP to compare database objects between replicated databases. I like this because it’s a great example of using pgTAP for system testing, rather than just unit testing as I’ve been advocating. See, pgTAP can be used for any kind of testing!

Another use for these functions is in a large organization where many different people might be making changes to a schema. In this scenario, you might have application developers adding new objects to the database (or dropping objects) without necessarily informing the DBAs. Using, for example, tables_are() and functions_are() and continuous testing, the DBAs can see when objects have been modified by the developers. Even better, if the developers are running the pgTAP tests themselves (as they should be!), they will be reminded to add new tests for their changes when the existing tests notice that things have been added or dropped and thus fail.

Beyond that, I added a bunch of new functions for testing functions and a number of other goodies. Check out the release notes for all the details.

With these changes, I’ve finished nearly everything I’ve thought of for pgTAP. There are only a few sequence-testing functions left on the To Do list, as well as a call to add a throws_like() function, which I’ll throw in soon if no one else steps up. Beyond these changes, I have a few ideas of where to take it next, but so far I’m kind of stumped. Mainly what I think should be done is to add an interface that makes it easier to compare relations (or result sets, if you prefer). Epic does this by allowing query strings to be passed to a function, but I’d really like to keep queries in SQL rather than in SQL strings. I’ll be giving it some more thought and will post about it soon.

Looking for the comments? Try the old layout.

Unit Test Your Database!

Gave my talk at PGCon today. I felt that it went well, and was well-received. So here it is for everyone else, for posterity, thanks to SlideShare. Enjoy!

Looking for the comments? Try the old layout.

Learn Mad Database Skillz at YAPC::NA 2009

A few weeks ago, I twittered that, in my opinion, application developers should really learn how to use databases. And by that I mean SQL, really. I know that a lot of app developers like to use ORMs to access the database, so that you don’t have to really think about it, but most ORMs are written by people who don’t like databases, don’t like SQL, haven’t taken the time to learn it in any depth, and thus don’t write very good SQL. And even if they do like SQL, that usually means they provide a way for you to execute SQL queries directly. The reason? Because the ORM does not really understand how building more and more complex queries can have negative performance issues, and that there is more than one way to do it. It’s pretty common to have to go back to custom SQL to solve performance issues. So to solve those problems, you gotta understand SQL.

Another sin of application developers is to try to use very standard SQL syntax and features when writing SQL queries, so that they can easily be ported to other databases. Well, if you’re going to do that, you might as well use an ORM, but never mind. Think about it this way: If you were writing an application in Ruby, would you avoid the use of blocks because you might someday want to port it to Perl? And how often have you decided to port an application to another database, anyway? Sure, some OSS projects add support for new databases, but they seldom drop support for one RDBMS in favor of another.

If you’re writing an application in Perl, it pays to learn idiomatic Perl. If you’re writing it in Ruby, it pays to use idiomatic Ruby. So why would you settle for anything less when using an RDBMS? SQL is, after all, just another programming language, and the various dialects have their advantages and disadvantages. Learning how SQL really works and how to leverage the features of your RDBMS will only improve the performance, reliability, and scalability of your app. If your Perl or Ruby or Python code doesn’t look like C, why would you write least-common denominator ANSI-92 compliant SQL? You have a powerful programming language and application server with an amazing array of features and capabilities. Use them!

All of which is a very long-winded way to say that it really, truly pays to learn the ins and outs of SQL, just like any other language. And if you’re a Perl hacker, you have a great opportunity to do just that at YAPC::NA 10 this summer. In response to my tweet, YAPC organizer Robert Blackwell replied in agreement, and pointed out that famed SQL expert Joe Celko, author of numerous volumes on SQL syntax and techniques, will be offering two classes on SQL at YAPC:

This is a great opportunity to expand your knowledge of SQL, how it works, and why it’s so powerful. (Even if you’re not fond of the idea of relational databases, think of it as an opportunity to follow Tom Christiansen’s injunction and learn a bit about logical programming.) Celko knows SQL like nobody’s business, and will be sharing that knowledge in two remarkably cheap courses. Even if you’re not a Perl hacker, if you want to really learn the ins and outs of SQL– how to write idiomatic SQL to match the mad skillz you already apply to your application code, you could hardly do better than to get in on these deals and drink from the Celko firehose. I only wish I was going to be there (alas, prior plans interfered). But do please tell me all about it!

Looking for the comments? Try the old layout.

More on Database Testing

I’ve been meaning for a while to come back to the topic of database testing. After posting that entry, I thought more about the quote from a PostgreSQL core hacker, which I think bears repeating:

Well, you are testing for bugs, and bugs are pretty specific in where they appear. Writing the tests is 90% of the job; writing the infrastructure is minor. If the infrastructure has limitations, which all do, you might as well write that extra 10% too.

I had been so focused on the first sentence, on the why of database testing, that I’d not rally addressed the rest. I failed to notice that he was questioning the utility of a testing infrastructure, or what I would call a framework, like pgTAP. So let me rectify that right now by addressing his actual point.

The idea of using an established framework and protocol is to be able to focus exclusively on the task of writing tests, rather than worrying about how to analyze test results. I agree that writing tests can be time-consuming, but that doesn’t mean that one should write one’s own testing framework. The great thing about pgTAP is that it emits TAP, which can then be analyzed along with any other TAP-emitting test framework in any environment, including Perl, C, JavaScript, PHP, and even Oracle, among others.

The other argument that might support writing one’s own testing infrastructure is if it’s too hard to apply one style of testing to a given application. For example, most of the existing TAP producers provide a functional interface to writing tests. SQL, on the other hand, is not a functional language. So–leaving aside for the moment that one can provide an effective functional interface for writing database tests–even if one wanted to write a relational-style testing framework, it could still emit TAP! TAP is, after all, just a stream of text. So as long as a SQL SELECT statement returns a stream of TAP, then you can take advantage of the myriad of test analysis tools out there.

Now, I was discussing the use of TAP with a different PostgreSQL contributor, who was asking me about modifying the output of pg_regress to be TAP. The way that pg_regress works–and therefore how PostgreSQL core tests work–is simple: One writes SQL statements into a test script, and then one writes an expected output file. If the output of the tests might vary by platform, database setting, or compile-time feature, one just creates more expected files, each with the appropriate variations.

The PostgreSQL test runner, pg_regress then simply runs the script through psql and diffs the output against each expected file. If one of the files is identical to the output, the test passes. Otherwise it fails. When the tests run, the output looks like this:

parallel group (2 tests):  copyselect copy
   copy                 ... ok
   copyselect           ... ok

My core hacker correspondent was thinking of modifying this output to be TAP, something like this:

# Parallel group (2 tests):  copyselect copy
1..2
ok 1 - copy
ok 2 - copyselect

With this change, he could then run the regression tests through TAP::Harness in a cron job and send failure reports when a test failed. This is good as far as it goes, but it has a couple of significant limitations. For one, there are no diagnostics if something goes wrong. This is because, and this is the second shortcoming, it just turns the result of testing a single script into TAP, not individual assertions. There might be 1000s of SQL statements in one script, but if the test fails, one won’t know what failed until one looks at regression.diff.

One of the great features of TAP is the support for diagnostics. For example, if an assertion fails, you might see output something like this:

not ok 38 - The frobnitz should be named "foo"
# Failed test 38: "The frobnitz should be named "foo""
#         have: NULL
#         want: foo

Just changing the listing of the test scripts run does not get you this advantage. That’s not to say that it doesn’t make certain things easier, or that one couldn’t simply have shorter test scripts in order to limit the scope of what’s being tested and what’s a result. But a single test script does not make for a good assertion. In short, pg_regress tests don’t do assertions at all. They simply compare actual and expected output from very verbose scripts. This is a hell of a lot better than nothing, but is still quite limited.

I suggested to my correspondent that he consider modifying the tests he was working on to use pgTAP, instead. Of course, if you have a lot of existing tests, it might be more trouble than it’s worth to rewrite them all. But that doesn’t mean that you can’t write new tests going forward using something that’s more granular, and gives you a lot more control over the output.

His answer rather shocked me:

I’m lucky enough to only be dealing with really good developers, so I can produce software that works even without focusing specifically on low-level unit tests.

To me, it’s a misconception to think “really good developers” don’t need tests. As I said in reply, I consider myself a “really good developer,” and I’d have a whole lot more pain with the code I maintain if it weren’t for the tests I’ve written. Tests do not compensate for poor coders. Rather, they make it easier to maintain, modify, and refactor code, to fix bugs, and to add new features. Not to mention testing my code on new versions of software, such as testing my CPAN modules with Perl 5.10 and testing my client’s PostgreSQL databases with 8.3 or 8.4 as they look to upgrade from 8.2.

One place where my correspondent agreed that pg_regress could use improvement is in the whole approach to matching different outputs. Using pgTAP, one can write tests in such a way that there are different expected results depending on database settings right in the test scripts themselves! This is because the TAP output can vary in any number of ways, and the harness doesn’t care, as long as the tests pass. pg_regress is extremely strict about what it considers a passing test, and this lacks the necessary flexibility for testing some advanced features.

This is why there are no tests for multibyte character semantics (such as locale-aware sorting) in the PostgreSQL core tests.

Are you a database testing doubter? Have I convinced you yet that a rock-solid test suite can actually make your job easier and more enjoyable? Perhaps I never will, but I am still very interested in your reasons for doubting the utility of database testing. What other points should I be thinking about as I prepare for my PGCon presentation?

Looking for the comments? Try the old layout.

pgTAP 0.16 in the Wild

I’ve been writing a lot tests for a client in pgTAP lately. It’s given me a lot to think about in terms of features I need and best practices in writing tests. I’m pleased to say that, overall, it has been absolutely invaluable. I’m doing a lot of database refactoring, and having the safety of solid test coverage has been an absolute godsend. pgTAP has done a lot to free me from worry about the effects of my changes, as it ensures that everything about the databases continue to just work.

Of course, that’s not to say that I don’t scew up. There are times when my refactorings have introduced new bugs or incompatibilities; after all, the tests I write of existing functionality extend only so far as I can understand that functionality. But as such issues come up, I just add regression tests, fix the issues, and move on, confident in the knowledge that, as long as the tests continue to be run regularly, those bugs will never come up again. Ever.

As a result, I’ll likely be posting a bit on best practices I’ve found while writing pgTAP tests. As I’ve been writing them, I’ve started to find the cow paths that help me to keep things sane. Most helpful is the large number of assertion functions that pgTAP offers, of course, but there are a number of techniques I’ve been developing as I’ve worked. Some are better than others, and still others suggest that I need to find other ways to do things (you know, when I’m cut-and-pasting a lot, there must be another way, though I’ve always done a lot of cut-and-pasting in tests).

In the meantime, I’m happy to announce the release of pgTAP 0.16. This version includes a number of improvements to the installer (including detection of Perl and TAP::Harness, which are required to use the included pg_prove test harness app. The installer also has an important bug fix that greatly increases the chances that the os_name() function will actually know the name of your operating system. And of course, there are new test functions:

  • has_schema() and hasnt_schema(), which test for the presence of absence of a schema
  • has_type() and hasnt_type(), which test for the presence and absence of a data type, domain, or enum
  • has_domain() and hasnt_domain(), which test for the presence and absence of a data domain
  • has_enum() and hasnt_enum(), which test for the presence and absence of an enum
  • enum_has_lables() which tests that an enum has an expected list of labels

As usual, you can download the latest release from pgFoundry. Visit the pgTAP site for more information and for documentation.

Looking for the comments? Try the old layout.

Dynamic OFFSETs and LIMITs

I discovered a great hack for dealing with optional offsets and limits in PostgreSQL functions while working for a client, and I wanted to get it down here so that I wouldn’t forget it.

The deal is that I was writing tests for functions that returned a set of IDs for objects based on some criterion plus an optional offset and limit. The functions looked something like this:

CREATE OR REPLACE FUNCTION get_widgets_for_user_id(
    a_user_id integer,
    a_offset  integer,
    a_limit   integer
) RETURNS SETOF integer AS $$ 
DECLARE  
    l_id    integer;
    l_query text;
BEGIN
    v_query := 'SELECT id FROM widgets WHERE user_id = ' || a_user_id
            || ' ORDER BY created_at DESC';
    IF a_offset IS NOT NULL THEN 
        l_query := l_query || ' OFFSET ' || a_offset; 
    END IF; 
    IF a_limit IS NOT NULL THEN 
        l_query := l_query || ' LIMIT ' || a_limit; 
    END IF; 

    FOR l_id IN EXECUTE l_query LOOP 
            RETURN NEXT l_id;
    END LOOP; 
        
    RETURN; 
END;
$$ LANGUAGE PLPGSQL;

It seemed silly to me that this should be in PL/pgSQL: ultimately, it’s such a simple query that I wanted it to be a SQL query. Of course I knew that if a_offset was NULL I could fallback on 0. But what about dealing with a NULL limit?

Well, it turns out that you can pass a CASE statement to the LIMIT clause that optionally returns no value at all and it will just work. Observe:

try=% select id from widgets LIMIT CASE WHEN false THEN 3 END;
 id
----
   1
   2
   3
   4
(4 rows)

Pretty weird, huh? Well, for my purposes, it’s perfect, because I was able to rewrite that function as a pure SQL function, and it’s a lot simpler, to boot:

CREATE OR REPLACE FUNCTION get_widgets_for_user_id(
    a_user_id integer,
    a_offset  integer,
    a_limit   integer
) RETURNS SETOF integer AS $$ 
    SELECT id
        FROM widgets
        WHERE user_id = $1
        ORDER BY created_at DESC
    OFFSET COALESCE( $2, 0 )
        LIMIT CASE WHEN $3 IS NOT NULL THEN $3 END
$$ LANGUAGE SQL;

Now isn’t that a hell of a lot easier to read? Like I said, it’s a little weird, but overall I think it’s a pretty good hack. I’ve tested it with PostgreSQL 8.2 and 8.3. Not sure about other versions, but give it a try!

Update: Thanks for the comments! With the insight that CASE is ultimately passing a NULL to LIMIT when the value is NULL, I realized that I could switch from CASE to COALESCE for nice parity with the handling of the OFFSET clause. Check it out:

CREATE OR REPLACE FUNCTION get_widgets_for_user_id(
    a_user_id integer,
    a_offset  integer,
    a_limit   integer
) RETURNS SETOF integer AS $$ 
    SELECT id
        FROM widgets
        WHERE user_id = $1
        ORDER BY created_at DESC
    OFFSET COALESCE( $2, 0 )
        LIMIT COALESCE( $3, NULL )
$$ LANGUAGE SQL;

Looking for the comments? Try the old layout.

How to Generate Recurring Events in the Database

This is a followup to my request for help fixing the performance of a database view that generated recurrences for events. This was an essential feature of Sandy, and thus important to get right. The idea when I started was simple:

  • Add a recurrence domain to the database that supports a number of different values, including “daily”, “weekly”, and “monthly”.
  • Add a recurrence column to the events table that identify how an event recurs.
  • Add a recurrence_dates table that contains a pre-generated list of recurrences for a given date. I’d populated this table with five years of dates, each one mapped to five years worth of recurrence dates (see the original blog entry for more on the format of this table.
  • Create a view that maps each events row to its date and recurrence in the recurrence_dates table.
  • Profit.

It was this last bullet point that didn’t quite work out: although the data was perfectly accurate, queries for a lot of events in the view were very expensive. I mean, the query could run for 3-4 minutes. It was just crazy! I couldn’t figure out the problem, so I posted my request for help. It was through discussions that followed with depesz that I finally figured out what the problem was: Although I was usually selecting only a week’s or months worth of events, the view was calculating rows for all five years worth of data for all of the events for a given user. Um, not efficient.

So here I finally document how, with a lot of help and example code from depesz, I solved the problem. The trick was to use a function instead of a view to generate the recurring event rows, and to limit it only to the dates we’re interested in. For convenience sake, I broke this down into two PL/pgSQL functions: one to generate recurring dates and one to return the recurring event rows. But first, here’s the recurrence domain and the events table, both of which are unchanged from the original approach:

CREATE DOMAIN recurrence AS TEXT
CHECK ( VALUE IN ( 'none', 'daily', 'weekly', 'monthly' ) );

CREATE TABLE events (
    id         SERIAL     PRIMARY KEY,
    user_id    INTEGER    NOT NULL,
    starts_at  TIMESTAMP  NOT NULL,
    start_tz   TEXT       NOT NULL,
    ends_at    TIMESTAMP,
    end_tz     TEXT       NOT NULL,
    recurrence RECURRENCE NOT NULL DEFAULT 'none'
);

Just assume the user_id is a foreign key. Now let’s populate this table with some data. For the purposes of this demonstration, I’m going to create one event per day for 1000 days, evenly divided between daily, weekly, monthly, and no recurrences, as well as five different times of day and six different durations:

INSERT INTO events (user_id, starts_at, start_tz, ends_at, end_tz, recurrence)
SELECT 1,
       ts::timestamp,
       'PST8PDT',
       ts::timestamp + dur::interval,
       'PST8PDT',
       recur
  FROM (
    SELECT '2007-12-19'::date + i || ' ' || CASE i % 5
               WHEN 0 THEN '06:00'
               WHEN 1 THEN '10:00'
               WHEN 2 THEN '14:00'
               WHEN 3 THEN '18:00'
               ELSE        '22:30'
               END,
           CASE i % 6
               WHEN 0 THEN '2 hours'
               WHEN 1 THEN '1 hour'
               WHEN 2 THEN '45 minutes'
               WHEN 3 THEN '3.5 hours'
               WHEN 4 THEN '15 minutes'
               ELSE        '30 minutes'
               END,
           CASE i % 4
               WHEN 0 THEN 'daily'
               WHEN 1 THEN 'weekly'
               WHEN 2 THEN 'monthly'
               ELSE        'none'
               END
    FROM generate_series(1, 1000) as gen(i)
  ) AS ser( ts, dur, recur);

This gives us some nicely distributed data:

try=# select * from events limit 10;
  id  | user_id |      starts_at      | start_tz |       ends_at       | end_tz  | recurrence 
------+---------+---------------------+----------+---------------------+---------+------------
    1 |       1 | 2007-12-20 10:00:00 | PST8PDT  | 2007-12-20 11:00:00 | PST8PDT | weekly
    2 |       1 | 2007-12-21 14:00:00 | PST8PDT  | 2007-12-21 14:45:00 | PST8PDT | monthly
    3 |       1 | 2007-12-22 18:00:00 | PST8PDT  | 2007-12-22 21:30:00 | PST8PDT | none
    4 |       1 | 2007-12-23 22:30:00 | PST8PDT  | 2007-12-23 22:45:00 | PST8PDT | daily
    5 |       1 | 2007-12-24 06:00:00 | PST8PDT  | 2007-12-24 06:30:00 | PST8PDT | weekly
    6 |       1 | 2007-12-25 10:00:00 | PST8PDT  | 2007-12-25 12:00:00 | PST8PDT | monthly
    7 |       1 | 2007-12-26 14:00:00 | PST8PDT  | 2007-12-26 15:00:00 | PST8PDT | none
    8 |       1 | 2007-12-27 18:00:00 | PST8PDT  | 2007-12-27 18:45:00 | PST8PDT | daily
    9 |       1 | 2007-12-28 22:30:00 | PST8PDT  | 2007-12-29 02:00:00 | PST8PDT | weekly
   10 |       1 | 2007-12-29 06:00:00 | PST8PDT  | 2007-12-29 06:15:00 | PST8PDT | monthly
(10 rows)

Now let’s get to the recurring date function:

CREATE OR REPLACE FUNCTION  generate_recurrences(
recurs RECURRENCE, 
start_date DATE,
end_date DATE
)
RETURNS setof DATE
LANGUAGE plpgsql IMMUTABLE
AS $BODY$
DECLARE
    next_date DATE := start_date;
    duration  INTERVAL;
    day       INTERVAL;
    check     TEXT;
BEGIN
    IF recurs = 'none' THEN
        -- Only one date ever.
        RETURN next next_date;
    ELSIF recurs = 'weekly' THEN
        duration := '7 days'::interval;
        WHILE next_date <= end_date LOOP
            RETURN NEXT next_date;
            next_date := next_date + duration;
        END LOOP;
    ELSIF recurs = 'daily' THEN
        duration := '1 day'::interval;
        WHILE next_date <= end_date LOOP
            RETURN NEXT next_date;
            next_date := next_date + duration;
        END LOOP;
    ELSIF recurs = 'monthly' THEN
        duration := '27 days'::interval;
        day      := '1 day'::interval;
        check    := to_char(start_date, 'DD');
        WHILE next_date <= end_date LOOP
            RETURN NEXT next_date;
            next_date := next_date + duration;
            WHILE to_char(next_date, 'DD') <> check LOOP
                next_date := next_date + day;
            END LOOP;
        END LOOP;
    ELSE
        -- Someone needs to update this function, methinks.
        RAISE EXCEPTION 'Recurrence % not supported by generate_recurrences()', recurs;
    END IF;
END;
$BODY$;

The idea here is pretty simple: pass in a recurrence (“daily”, “weekly”, or “monthly”), a start date, and an end date, and get back a set of all the recurrence dates between the start and end dates:

try=# \timing
Timing is on.
try=# select * from generate_recurrences('daily', '2008-01-29', '2008-02-05');
 generate_recurrences 
----------------------
 2008-01-29
 2008-01-30
 2008-01-31
 2008-02-01
 2008-02-02
 2008-02-03
 2008-02-04
 2008-02-05
(8 rows)

Time: 0.548 ms
try=# select * from generate_recurrences('weekly', '2008-01-29', '2008-03-05');
 generate_recurrences 
----------------------
 2008-01-29
 2008-02-05
 2008-02-12
 2008-02-19
 2008-02-26
 2008-03-04
(6 rows)

Time: 0.670 ms
try=# select * from generate_recurrences('monthly', '2008-01-29', '2008-05-05');
 generate_recurrences 
----------------------
 2008-01-29
 2008-02-29
 2008-03-29
 2008-04-29
(4 rows)

Time: 0.644 ms

Not bad, eh? And PostgreSQL’s date and interval calculation operators are wicked fast. Check out how long it dates to generate two years worth of daily recurrence dates:

try=# select * from  generate_recurrences('daily', '2008-01-29', '2010-02-05');
 generate_recurrences 
----------------------
 2008-01-29
 2008-01-30
 2008-01-31
...
 2010-02-03
 2010-02-04
 2010-02-05
(739 rows)

Time: 4.982 ms

Awesome. And the great thing about this function is that any time I need to add new recurrences (yearly, biweekly, quarterly, weekends, weekdays, etc.), I just modify the domain and this function and we’re ready to go.

And now, part two: the recurring event function:

CREATE OR REPLACE FUNCTION recurring_events_for(
    for_user_id INTEGER,
    range_start TIMESTAMP,
    range_end   TIMESTAMP
)
    RETURNS SETOF events
    LANGUAGE plpgsql STABLE
    AS $BODY$
DECLARE
    event events;
    start_date TIMESTAMPTZ;
    start_time TEXT;
    ends_at    TIMESTAMPTZ;
    next_date  DATE;
    recurs_at  TIMESTAMPTZ;
BEGIN
    FOR event IN 
        SELECT *
          FROM events
         WHERE user_id = for_user_id
           AND (
                   recurrence <> 'none'
               OR  (
                      recurrence = 'none'
                  AND starts_at BETWEEN range_start AND range_end
               )
           )
    LOOP
        IF event.recurrence = 'none' THEN
            RETURN NEXT event;
            CONTINUE;
        END IF;

        start_date := event.starts_at::timestamptz AT TIME ZONE event.start_tz;
        start_time := start_date::time::text;
        ends_at    := event.ends_at::timestamptz AT TIME ZONE event.end_tz;

        FOR next_date IN
            SELECT *
                FROM generate_recurrences(
                        event.recurrence,
                        start_date::date,
                        (range_end AT TIME ZONE event.start_tz)::date
                )
        LOOP
            recurs_at := (next_date || ' ' || start_time)::timestamp
                AT TIME ZONE event.start_tz;
            EXIT WHEN recurs_at > range_end;
            CONTINUE WHEN recurs_at < range_start AND ends_at < range_start;
            event.starts_at := recurs_at;
            event.ends_at   := ends_at;
            RETURN NEXT event;
        END LOOP;
    END LOOP;
    RETURN;
END;
$BODY$;

The idea here is to select the appropriate events for a given user between two dates, and for each event iterate over all of the recurrences between the two dates and return a row for each one. So the lines starting with FOR event IN and ending with LOOP select the original events, looking for either recurring events or non-recurring events that are between the two dates. Note that if you needed to, you could easily refine this query for your particular application, or even use PL/pgSQL’s EXECUTE operator to dynamically generate queries to suit particular application needs.

Next, the block starting with IF event.recurrence = 'none' THEN simply returns any non-recurring events. Although the next block already handles this case, adding this optimization eliminates a fair bit of calculation for the common case of non-recurring events.

Then the lines starting with FOR next_date IN and ending with LOOP select all of the dates for the recurrence in question, using the generate_recurrences() function created earlier. From LOOP to END LOOP;, the function generates the start and end timestamps, exiting the loop when the start date falls after the range or when it falls before the range and the end date falls after the range. There are many other tweaks one could make here to modify which recurrences are included and which are excluded. For example, if you had a column in the events table such as exclude_dates TIMESTAMP[] NOT NULL DEFAULT '{}' that stored an array of dates to ignore when generating recurrences, you could just add this line to go ahead and exclude them from the results returned by the function:

            CONTINUE WHEN recurs_at = ANY( exclude_dates );

But enough of the details: let’s see how it works! Here’s a query for a week’s worth of data:

try=# select * from recurring_events_for(1, '2007-12-19', '2007-12-26');
  id  | user_id |      starts_at      | start_tz |       ends_at       | end_tz  | recurrence 
------+---------+---------------------+----------+---------------------+---------+------------
    1 |       1 | 2007-12-20 10:00:00 | PST8PDT  | 2007-12-20 03:00:00 | PST8PDT | weekly
    2 |       1 | 2007-12-21 14:00:00 | PST8PDT  | 2007-12-21 06:45:00 | PST8PDT | monthly
    3 |       1 | 2007-12-22 18:00:00 | PST8PDT  | 2007-12-22 21:30:00 | PST8PDT | none
    4 |       1 | 2007-12-23 22:30:00 | PST8PDT  | 2007-12-23 14:45:00 | PST8PDT | daily
    4 |       1 | 2007-12-24 22:30:00 | PST8PDT  | 2007-12-23 14:45:00 | PST8PDT | daily
    4 |       1 | 2007-12-25 22:30:00 | PST8PDT  | 2007-12-23 14:45:00 | PST8PDT | daily
    5 |       1 | 2007-12-24 06:00:00 | PST8PDT  | 2007-12-23 22:30:00 | PST8PDT | weekly
    6 |       1 | 2007-12-25 10:00:00 | PST8PDT  | 2007-12-25 04:00:00 | PST8PDT | monthly
(8 rows)

Time: 51.890 ms

Note the time it took to execute this query. 52 ms is a hell of a lot faster than the several minutes it took to run a similar query using the old view. Plus, I’m not limited to just the recurrence dates I’ve pre-calculated in the old recurrence_dates table. Now we can use whatever dates are supported by PostgreSQL. It’s even fast when we look at a year’s worth of data:

try=# select * from recurring_events_for(1, '2007-12-19', '2008-12-19');
  id  | user_id |      starts_at      | start_tz |       ends_at       | end_tz  | recurrence 
------+---------+---------------------+----------+---------------------+---------+------------
    1 |       1 | 2007-12-20 10:00:00 | PST8PDT  | 2007-12-20 03:00:00 | PST8PDT | weekly
    1 |       1 | 2007-12-27 10:00:00 | PST8PDT  | 2007-12-20 03:00:00 | PST8PDT | weekly
    1 |       1 | 2008-01-03 10:00:00 | PST8PDT  | 2007-12-20 03:00:00 | PST8PDT | weekly
    1 |       1 | 2008-01-10 10:00:00 | PST8PDT  | 2007-12-20 03:00:00 | PST8PDT | weekly
    1 |       1 | 2008-01-17 10:00:00 | PST8PDT  | 2007-12-20 03:00:00 | PST8PDT | weekly
    1 |       1 | 2008-01-24 10:00:00 | PST8PDT  | 2007-12-20 03:00:00 | PST8PDT | weekly
    1 |       1 | 2008-01-31 10:00:00 | PST8PDT  | 2007-12-20 03:00:00 | PST8PDT | weekly
...
  364 |       1 | 2008-12-17 22:30:00 | PST8PDT  | 2008-12-17 14:45:00 | PST8PDT | daily
  364 |       1 | 2008-12-18 22:30:00 | PST8PDT  | 2008-12-17 14:45:00 | PST8PDT | daily
  365 |       1 | 2008-12-18 06:00:00 | PST8PDT  | 2008-12-17 22:30:00 | PST8PDT | weekly
(19691 rows)

Time: 837.759 ms

Not stellar, but still respectable. Given that for a typical application, a user will be looking at only a day’s or a week’s or a month’s events at a time, this seems to be an acceptable trade-off. I mean, how often will your users need to see a list of 20,000 events? And even if a user waslooking at a year’s worth of data, it’s unlikely that 75% of them would be recurring as in the example data here.

I was fucking pumped with this solution, and Sandy has hummed along nicely since we put it into production. If you’re interested in trying it for yourself, I’ve you can get all the SQL from this blog entry here.

The only thing I would like to have been able to do differently was to encapsulate the recurring_events_for() function in a view. Such would have made it much easier to actually use this solution in Rails. If you know how to do that, please do leave a comment. As for how I hacked Rails to use the function, well, that’s a blog post for another day.

Looking for the comments? Try the old layout.

My First C: A GTIN Data Type for PostgreSQL

After all of my recent experimentation creating UPC, EAN, and GTIN validation functions, I became interested in trying to create a GTIN PostgreSQL data type in C. The fact that I don’t know C didn’t stop me from learning enough to do some damage. And now I have a first implementation done. Check it out!

So how did I do this? Well, chapter six of the Douglas Book was a great help to get me started. I also learned what I could by reading the source code for the core and contributed PostgreSQL data types, as well as the EnumKit enumerated data type builder (download from here). And the denizens of the #postgresql channel on FreeNode were also extremely helpful. Thank you, guys!

I would be very grateful if the C hackers among you, and especially any PostgreSQL core hackers who happen to read my blog, would download the GTIN source code and have a look at it. This is the first C code I’ve written, so it would not surprise me if there were some gotchas that I missed (memory leaks, anyone?). And yes, I know that the new ISN contributed data types in the forthcoming 8.2 is a far more featureful implementation of bar code data types; I learned about it after I had nearly finished this first release of GTIN. But I did want to learn some C and how to create PostgreSQL data types, and provide the code for others to learn from, as well. It may also end up as the basis for an article. Stay tuned

In the meantime, share and enjoy.

Update: I forgot to mention that you can check out the source code from the Kineticode Subversion repository.

Looking for the comments? Try the old layout.

What Name Do You Use for an Order Column?

Quick poll.

Say that you have a join table mapping blog entries to tags, and you want the tags to be ordered for each entry. The table might look something like this:

CREATE TABLE entry_join_tag (
    entry_id integer REFERENCES entry(id)
                    ON UPDATE CASCADE
                    ON DELETE CASCADE,
    tag_id   integer REFERENCES tag(id)
                    ON UPDATE CASCADE
                    ON DELETE CASCADE,
    ord       smallint,
    PRIMARY KEY (entry_id, tag_id)
);

It’s the ord column I’m talking about here, wherein to order tags for each blog entry, you’d do a select like this:

SELECT entry_id, tag_id
  FROM   entry_join_tag
 ORDER BY entry_id, ord;

So my question is this: What name do you typically give to the ordering column, since “order” itself isn’t available in SQL (it’s a reserved word, of course). Some of the options I can think of:

  • ord
  • ordr
  • seq
  • place
  • rank
  • tag_ord
  • tag_order
  • tag_place
  • tag_rank
  • tag_seq

Leave a comment to let me know. Thanks!

Looking for the comments? Try the old layout.