Just a Theory

Black lives matter

Posts about postgres

Batch Updates with PL/pgSQL

The third in my series of articles about PL/pgSQL, “Batch Updates with PL/pgSQL” has been published on The O’Reilly Network. Actually it was published last week, but I’ve not been very attentive to my blog lately. Sorry about that. Anyway, it improves upon the code in the second article in the series, “Managing Many-to-Many Relationships with PL/pgSQL,” by modifying the updating functions to use PostgreSQL batch query syntax. This means that the number of database calls in a given call to a function are constant, no matter how many IDs are passed to it.

So check it out!

Looking for the comments? Try the old layout.

PL/pgSQL Talk Slides Posted

The talk that I announced a couple of days ago went off very well, I think. Rich Shepard, a local PostgreSQL who was at the meeting, later said on the mail list:

On the happy side, my thanks to Selena for organizing the group and rounding up a half-dozen cats for the first meeting, and to David for the best presentation at a computer users group meeting I’ve seen in 20 years. The topic was very interesting and the presentation itself highly professional and polished. It’s a standard to be met by future presenters for the benefit of us all.

—Rich Shepard

Wow, praise doesn’t get much higher than that. I’m glad the group got a lot out of the presentation. For the benefit of those who couldn’t make it, I’ve posted the slides for your enjoyment. They might be a bit hard to follow at times without my commentary, but you should be able to fill in the gaps by reading my articles for O’Reilly.

So start having some fun with PL/pgSQL!

Looking for the comments? Try the old layout.

PL/pgSQL Talk for Portland PostgreSQL Users

Attention PostgreSQL users in the Portland metropolitan area and environs (Salem, Eugene, Vancouver, Seattle)! I am honored to be giving the inaugural talk to the newly-formed Portland PostgreSQL Users Group on Wednesday, 19 July 2006 at 19:00 at FreeGeek. My talk will be an introduction to PL/pgSQL. Come check it out and join the fun! Beer and schmoozing to take place after the talk at The Lucky Lab.

Looking for the comments? Try the old layout.

Managing Many-to-Many Relationships with PL/pgSQL

The second in my series of articles on programming PostgreSQL PL/pgSQL, “Managing Many-to-Many Relationships with PL/pgSQL”, has just been published onLamp.com. The idea is to abstract out of the application layer the management of the many-to-many relationships, moving it into the database layer where execution is both safer and faster. And you can learn more about PL/pgSQL along the way.

So what are you waiting for? Check it out!

Looking for the comments? Try the old layout.

Benchmarking PostgreSQL Functions

Update 2006-05-19: I realized that there was a nasty error in my algorithm for determining the runtime of a function: It was only fetching the milliseconds part of the runtime, without adding in seconds and minutes! This led to getting negative runtimes then the milliseconds part of the end time was less than the milliseconds part of the start time. Ugh. But with the help of yain on IRC, I’ve switched to calculating the number of seconds by converting the start and end times to epoch seconds (which have subsecond precision in PostgreSQL, and now things are just dandy. While I was at it, I reorganized the function so that it was a bit easier to read, by constructing the created function in the order it would be executed, and fixed the caching problem, as suggested by Aidan in a comment below.

Following yesterday’s post, Klint Gore sent me some PL/pgSQL code that might be useable as a benchmark function. Today I took that code and ran with it.

The idea was to create a function like the Perl Benchmark module’s timethese() function. In the process, I found, with help from Josh Berkus, that PL/pgSQL’s EXECUTE statement has quite a lot of overhead, and the amount of overhead per call is pretty random. The overhead resulted in pretty inaccurate benchmark numbers, unfortunately.

At Josh’s suggestion, I rewrote the function to just test each function inline, rather than passing the function code as parameters. This time, the results were dead on. So then I refactored the original benchmark function to create its own benchmark function, inlining all of the code, and then call that function. Almost higher-order PL/pgSQL! Again the results were just right, and so now I present it to you:

create type _benchmark as (
    code      text,
    runtime   real,
    corrected real
);

CREATE OR REPLACE FUNCTION benchmark(n INTEGER, funcs TEXT[])
RETURNS SETOF _benchmark AS $$
DECLARE
    code TEXT := '';
    a    _benchmark;
BEGIN
    -- Start building the custom benchmarking function.
    code := $_$
        CREATE OR REPLACE FUNCTION _bench(n INTEGER)
        RETURNS SETOF _benchmark AS $__$
        DECLARE
            s TIMESTAMP;
            e TIMESTAMP;
            a RECORD;
            d numeric;
            res numeric;
            ret _benchmark;
        BEGIN
            -- Create control.
            s := timeofday();
            FOR a IN SELECT TRUE FROM generate_series( 1, $_$ || n || $_$ )
            LOOP
            END LOOP;
            e := timeofday();
            d := extract(epoch from e) - extract(epoch from s);
            ret := ROW( '[Control]', d, 0 );
            RETURN NEXT ret;
$_$;
    -- Append the code to bench each function call.
    FOR i IN array_lower(funcs,1) .. array_upper(funcs, 1) LOOP
        code := code || '
            s := timeofday();
            FOR a IN SELECT ' || funcs[i] || ' FROM generate_series( 1, '
                || n || $__$ ) LOOP
            END LOOP;
            e := timeofday();
            res := extract(epoch from e) - extract(epoch from s);
            ret := ROW(
                $__$ || quote_literal(funcs[i]) || $__$,
                res,
                res - d
            );
            RETURN NEXT ret;
$__$;
    END LOOP;

    -- Create the function.
    execute code || $_$
        END;
        $__$ language plpgsql;
$_$;

    -- Now execute the function.
    FOR a IN EXECUTE 'SELECT * FROM _bench(' || n || ')' LOOP
        RETURN NEXT a;
    END LOOP;

    -- Drop the function.
    DROP FUNCTION _bench(integer);
    RETURN;
END;
$$ language 'plpgsql';

You call the function like this:

try=# select * from benchmark(10000, ARRAY[
try(#     'ean_substr(''036000291452'')',
try(#     'ean_byte(''036000291452'')',
try(#     'ean_c(''036000291452'')'
try(# ]);
            code            | runtime   | corrected 
----------------------------+-----------+-----------
    [Control]                  | 0.0237451 |          0
    ean_substr('036000291452') |  0.497734 |   0.473989
    ean_byte(  '036000291452') |  0.394456 |   0.370711
    ean_c(     '036000291452') | 0.0277281 | 0.00398302
(4 rows)

Pretty slick, eh? The only downside was that, when the DROP FUNCTION line was not commented out, the function would run once, and then, the next time, I’d get this error:

ERROR:  cache lookup failed for function 17323
CONTEXT:  PL/pgSQL function "benchmark" line 49 at for over select rows

I have no idea why. So I just leave the function and let the CREATE OR REPLACE take care of it.

Looking for the comments? Try the old layout.

Corrected PostgreSQL EAN Functions

Update: I updated the benchmarks based on the fixed version of my benchmarking function.

In doing a bit more reading about EAN codes, I realized that my previous attempts to write a validating function for UPC and EAN codes had a significant error: they would only properly validate EAN codes if the first numeral was 0! So I went back and fixed them all, and present them here for posterity.

  • The substring solution:

    CREATE OR REPLACE FUNCTION ean_substr (
        TEXT
    ) RETURNS boolean AS $$
    DECLARE
        offset integer := 0;
        -- Support UPCs.
        ean   TEXT    := CASE WHEN length($1) = 12 THEN '0' || $1 ELSE $1 END;
    BEGIN
        -- Make sure we really have an EAN.
        IF ean !~ '^\\d{13}$' THEN RETURN FALSE; END IF;
    
        RETURN 10 - (
            (
                -- Sum even numerals.
                substring(ean,  2 + offset, 1)::integer
                + substring(ean,  4 + offset, 1)::integer
                + substring(ean,  6 + offset, 1)::integer
                + substring(ean,  8 + offset, 1)::integer
                + substring(ean, 10 + offset, 1)::integer
                + substring(ean, 12 + offset, 1)::integer
            ) * 3 -- Multiply total by 3.
            -- Add odd numerals except for checksum (13).
            + substring(ean,  1 + offset, 1)::integer
            + substring(ean,  3 + offset, 1)::integer
            + substring(ean,  5 + offset, 1)::integer
            + substring(ean,  7 + offset, 1)::integer
            + substring(ean,  9 + offset, 1)::integer
            + substring(ean, 11 + offset, 1)::integer
        -- Compare to the checksum.
        ) % 10 = substring(ean, 13 + offset, 1)::integer;
    END;
    $$ LANGUAGE 'plpgsql' immutable;
    
  • The looping solution:

    CREATE OR REPLACE FUNCTION ean_loop(
        TEXT
    ) RETURNS boolean AS $$
    DECLARE
        total INTEGER := 0;
        -- Support UPCs.
        ean   TEXT    := CASE WHEN length($1) = 12 THEN '0' || $1 ELSE $1 END;
    BEGIN
        -- Make sure we really have an EAN.
        IF ean !~ '^\\d{13}$' THEN RETURN FALSE; END IF;
    
        -- Sum even numerals.
        FOR i IN 2..12 LOOP
            total := total + substring(ean, i, 1)::INTEGER;
            i := i + 1;
        END LOOP;
    
        -- Multiply total by 3.
        total := total * 3;
    
        -- Add odd numerals except for checksum (13).
        FOR i IN 1..11 LOOP
            total := total + substring(ean, i, 1)::INTEGER;
            i := i + 1;
        END LOOP;
    
        -- Compare to the checksum.
        RETURN 10 - total % 10 = substring(ean, 13, 1)::INTEGER;
    END;
    $$ LANGUAGE 'plpgsql' immutable;
    
  • The BYTEA solution:

    CREATE OR REPLACE FUNCTION ean_byte (
        arg TEXT
    ) RETURNS boolean AS $$
    DECLARE
        -- Convert to BYTEA; support UPCs.
        ean BYTEA := CASE WHEN length($1) = 12 THEN '0' || $1 ELSE $1 END;
    BEGIN
        -- Make sure we really have an EAN.
        IF arg !~ '^\\d{12,13}$' THEN RETURN FALSE; END IF;
    
        RETURN 10 - (
            (
                -- Sum odd numerals.
                get_byte(ean,  1) - 48
                + get_byte(ean,  3) - 48
                + get_byte(ean,  5) - 48
                + get_byte(ean,  7) - 48
                + get_byte(ean,  9) - 48
                + get_byte(ean, 11) - 48
            ) * 3 -- Multiply total by 3.
            -- Add even numerals except for checksum (12).
            + get_byte(ean,  0) - 48
            + get_byte(ean,  2) - 48
            + get_byte(ean,  4) - 48
            + get_byte(ean,  6) - 48
            + get_byte(ean,  8) - 48
            + get_byte(ean, 10) - 48
        -- Compare to the checksum.
        ) % 10 = get_byte(ean, 12) - 48;
            
    END;
    $$ LANGUAGE plpgsql immutable;
    
  • The PL/Perl solution:

    CREATE OR REPLACE FUNCTION ean_perl (
        TEXT
    ) RETURNS boolean AS $_$
        my $ean = length $_[0] == 12 ? "0$_[0]" : $_[0];
        # Make sure we really have an EAN.
        return 'false' unless $ean =~ /^\d{13}$/;
        my @nums = split '', $ean;
        return 10 - (
            # Sum even numerals.
            (   (   $nums[1] + $nums[3] + $nums[5] + $nums[7] + $nums[9]
                        + $nums[11]
                ) * 3 # Multiply total by 3.
            # Add odd numerals except for checksum (12).
            ) + $nums[0] + $nums[2] + $nums[4] + $nums[6] + $nums[8] + $nums[10]
        # Compare to the checksum.
        ) % 10 == $nums[12] ? 'true' : 'false';
    $_$ LANGUAGE plperl immutable;
    
  • The C solution (thanks StuckMojo!):

    #include <string.h>
    #include "postgres.h"
    #include "fmgr.h"
    
    Datum ean_c(PG_FUNCTION_ARGS);
    
    PG_FUNCTION_INFO_V1(ean_c);
    
    Datum ean_c(PG_FUNCTION_ARGS) {
    
        char *ean;
        text *arg = PG_GETARG_TEXT_P(0);
        int  arglen = VARSIZE(arg) - VARHDRSZ;
        bool ret = false;
    
        /* Validate the easy stuff: 12 or 13 digits. */
        if ((arglen != 12 && arglen != 13) || 
            strspn(VARDATA(arg), "0123456789") != arglen) {
            PG_RETURN_BOOL(ret);
        }
    
        /* Support UPCs. */
        if (arglen == 12) {
            ean = (char *) palloc(13);
            ean[0] = '0';
            memcpy(&ean[1], VARDATA(arg), arglen);
        } else {
            ean = (char *) palloc(arglen);
            memcpy(ean, VARDATA(arg), arglen);
        }
    
        ret = 10 - (
                /* Sum even numerals and multiply total by 3. */
                (  ean[1] - '0' + ean[3] - '0' + ean[5]  - '0' 
                    + ean[7] - '0' + ean[9] - '0' + ean[11] - '0') * 3
                /* Add odd numerals except for checksum (12). */
                + ean[0] - '0' + ean[2] - '0' + ean[4]  - '0'
                + ean[6] - '0' + ean[8] - '0' + ean[10] - '0'
            /* Compare to the checksum. */
            ) % 10 == ean[12] - '0';
    
        PG_RETURN_BOOL(ret);
    }
    

And here are the benchmarks for them (without immutable):

try=# select * from benchmark(100000, ARRAY[
try(#     'ean_substr(''4007630000116'')',
try(#     'ean_loop(  ''4007630000116'')',
try(#     'ean_byte(  ''4007630000116'')',
try(#     'ean_perl(  ''4007630000116'')',
try(#     'ean_c(     ''4007630000116'')'
try(# ]);
            code             | runtime  |    rate     | corrected | corrected_rate 
-----------------------------+----------+-------------+-----------+----------------
 [Control]                   | 0.257728 | 388006.17/s |  0.257728 | 388006.17/s
 ean_substr('4007630000116') |  5.07296 | 19712.37/s  |   4.81523 | 20767.44/s
 ean_loop(  '4007630000116') |  9.18085 | 10892.24/s  |   8.92312 | 11206.84/s
 ean_byte(  '4007630000116') |   3.9248 | 25479.02/s  |   3.66707 | 27269.73/s
 ean_perl(  '4007630000116') |   5.5062 | 18161.33/s  |   5.24848 | 19053.15/s
 ean_c(     '4007630000116') | 0.285376 | 350415.10/s |  0.027648 | 3616901.80/s
(6 rows)

Enjoy!

Looking for the comments? Try the old layout.

Benchmarking UPC Validation

Just to follow up on my query about validating UPC codes in PL/pgSQL, Klint Gore sent me a private email demonstrating that treating the UPC code as a binary string performed better than my substringing approach. I modified his version to work like the others, but it looked to me like the performance was about the same. They were just too close for me to really be able to tell.

What I needed was a way to run the queries a whole bunch of times to see the real difference. I asked on #postgresql, and dennisb suggested a simple brute-force approach:

select foo(42) FROM generate_series (1, 10000);

So that’s what I did. The functions I tested were:

  • A refinement of my original substring solution:

    CREATE OR REPLACE FUNCTION ean_substr (
        TEXT
    ) RETURNS boolean AS $$
    DECLARE
        offset integer := 0;
        -- Support UPCs.
        ean   TEXT    := CASE WHEN length($1) = 12 THEN
            '0' || $1
        ELSE
            $1
        END;
    BEGIN
        -- Make sure we really have an EAN.
        IF ean !~ '^\\d{13}$' THEN RETURN FALSE; END IF;
    
        RETURN 10 - (
            (
                -- Sum even numerals.
                substring(ean,  2 + offset, 1)::integer
                + substring(ean,  4 + offset, 1)::integer
                + substring(ean,  6 + offset, 1)::integer
                + substring(ean,  8 + offset, 1)::integer
                + substring(ean, 10 + offset, 1)::integer
                + substring(ean, 12 + offset, 1)::integer
            ) * 3 -- Multiply total by 3.
            -- Add odd numerals except for checksum (13).
            + substring(ean,  3 + offset, 1)::integer
            + substring(ean,  5 + offset, 1)::integer
            + substring(ean,  7 + offset, 1)::integer
            + substring(ean,  9 + offset, 1)::integer
            + substring(ean, 11 + offset, 1)::integer
        -- Compare to the checksum.
        ) % 10 = substring(ean, 12 + offset, 1)::integer;
    END;
    $$ LANGUAGE plpgsql;
    
  • A looping version, based on the comment from Adrian Klaver in the original post:

    CREATE OR REPLACE FUNCTION ean_loop(
        TEXT
    ) RETURNS boolean AS $$
    DECLARE
        total INTEGER := 0;
        -- Support UPCs.
        ean   TEXT    := CASE WHEN length($1) = 12 THEN
            '0' || $1
        ELSE
            $1
        END;
    BEGIN
        -- Make sure we really have an EAN.
        IF ean !~ '^\\d{13}$' THEN RETURN FALSE; END IF;
    
        -- Sum even numerals.
        FOR i IN 2..12 LOOP
            total := total + substring(ean, i, 1)::INTEGER;
            i := i + 1;
        END LOOP;
    
        -- Multiply total by 3.
        total := total * 3;
    
        -- Add odd numerals except for checksum (13).
        FOR i IN 3..11 LOOP
            total := total + substring(ean, i, 1)::INTEGER;
            i := i + 1;
        END LOOP;
    
        -- Compare to the checksum.
        RETURN 10 - total % 10
            = substring(ean, 13, 1)::INTEGER;
    END;
    $$ LANGUAGE 'plpgsql';
            CREATE OR REPLACE FUNCTION ean_loop(
        TEXT
    ) RETURNS boolean AS $$
    DECLARE
        total INTEGER := 0;
        -- Support UPCs.
        ean   TEXT    := CASE WHEN length($1) = 12 THEN
            '0' || $1
        ELSE
            $1
        END;
    BEGIN
        -- Make sure we really have an EAN.
        IF ean !~ '^\\d{13}$' THEN RETURN FALSE; END IF;
    
        -- Sum even numerals.
        FOR i IN 2..12 LOOP
            total := total + substring(ean, i, 1)::INTEGER;
            i := i + 1;
        END LOOP;
    
        -- Multiply total by 3.
        total := total * 3;
    
        -- Add odd numerals except for checksum (13).
        FOR i IN 3..11 LOOP
            total := total + substring(ean, i, 1)::INTEGER;
            i := i + 1;
        END LOOP;
    
        -- Compare to the checksum.
        RETURN 10 - total % 10
            = substring(ean, 13, 1)::INTEGER;
    END;
    $$ LANGUAGE 'plpgsql';
    
  • A PL/Perl version for Josh and Ovid:

    CREATE OR REPLACE FUNCTION ean_perl (
        TEXT
    ) RETURNS boolean AS $_$
        my $ean = length $_[0] == 12 ? "0$_[0]" : $_[0];
        # Make sure we really have an EAN.
        return 'false' unless $ean =~ /^\d{13}$/;
        my @nums = split '', shift;
        return 10 - (
            # Sum even numerals.
            (   (   $nums[1] + $nums[3] + $nums[5]
                    + $nums[7] + $nums[9] + $nums[11]
                ) * 3 # Multiply total by 3.
            # Add odd numerals except for checksum (12).
            ) + $nums[2] + $nums[4] + $nums[6] + $nums[8]
                + $nums[10]
        # Compare to the checksum.
        ) % 10 == $nums[11] ? 'true' : 'false';
    $_$ LANGUAGE plperl;
    
  • And finally, the new version using a byte string:

    CREATE OR REPLACE FUNCTION ean_byte (
        arg TEXT
    ) RETURNS boolean AS $$
    DECLARE
        -- Convert to BYTEA; support UPCs.
        ean BYTEA := CASE WHEN length($1) = 12 THEN
            '0' || $1
        ELSE
            $1
        END;
    BEGIN
        -- Make sure we really have an EAN.
        IF arg !~ '^\\d{12,13}$' THEN RETURN FALSE; END IF;
    
        RETURN 10 - (
            (
                -- Sum even numerals.
                get_byte(ean,  2) - 48
                + get_byte(ean,  4) - 48
                + get_byte(ean,  6) - 48
                + get_byte(ean,  8) - 48
                + get_byte(ean, 10) - 48
                + get_byte(ean, 12) - 48
            ) * 3 -- Multiply total by 3.
            -- Add odd numerals except for checksum (13).
            + get_byte(ean,  3) - 48
            + get_byte(ean,  7) - 48
            + get_byte(ean,  5) - 48
            + get_byte(ean,  9) - 48
            + get_byte(ean, 11) - 48
        -- Compare to the checksum.
        ) % 10  = get_byte(ean, 12) - 48;
    END;
    $$ LANGUAGE plpgsql;
    

And then I ran the benchmarks:

try=# \timing
Timing is on.
try=# \o /dev/null
try=# select ean_substr('036000291452')
try-# FROM generate_series (1, 10000);
Time: 488.743 ms
try=# select ean_loop('036000291452')
try-# FROM generate_series (1, 10000);
Time: 881.553 ms
try=# select ean_perl('036000291452')
try-# FROM generate_series (1, 10000);
Time: 540.962 ms
try=# select ean_byte('036000291452')
try-# FROM generate_series (1, 10000);
Time: 395.124 ms

So the binary approach is the clear winner here, being 23.69% faster than my substring approach, 36.91% faster than the Perl version, and 2.23 times faster (123.11%) than the looping approach. So I think I’ll go with that.

Meanwhile, I’m pleased to have this simple benchmarking tool in my arsenal for future PostgreSQL function development.

Looking for the comments? Try the old layout.

Validating UPCs with PL/pgSQL

So I wanted to write a PL/pgSQL function to validate UPC codes. The rules for validation are:

  • The UPC must consist of 12 or 13 numerals
  • The last numeral is a checksum for the previous 11 numerals
  • The checksum is calculated as follows:
    • Add the digits in the odd-numbered positions in the string and multiply by three
    • Add in the digits in the even-numbered positions
    • Subtract the result from the next-higher multiple of ten.

It took me a few minutes to whip up this implementation in Perl:

use List::Util qw(sum);

sub validate_upc {
    my $upc = shift;
    my @nums = split $upc;
    shift @nums if @nums == 13; # Support EAN codes.
    die "$upc is not a valid UPC" if @upc != 12;
    10 - (  sum( @nums[0,2,4,6,8,10] ) * 3
          + sum( @nums[1,3,5,7,9] )
    ) % 10 == $nums[11];
}

Trying to do the same thing in PL/pgSQL was harder, mainly because I couldn’t find an easy way to split a string up into its individual characters. string_to_array() seems ideal, but don’t follow the same rules as Perl when it comes to the empty string:

try=% select string_to_array('123', '');
 string_to_array
 -----------------
 {123}
(1 row)

Bummer. So I had to fall back on individual calls to substring(), instead:

CREATE OR REPLACE FUNCTION validate_upc (
upc text
) RETURNS boolean AS $$
DECLARE
    offset integer := 0;
BEGIN
    IF char_length(upc) = 13 THEN
        offset := 1;
    ELSIF char_length(upc) <> 12 THEN
        RAISE EXCEPTION '% is not a valid UPC', upc;
    END IF;

    IF 10 - (
        (
            substring(upc,  1 + offset, 1)::integer
          + substring(upc,  3 + offset, 1)::integer
          + substring(upc,  5 + offset, 1)::integer
          + substring(upc,  7 + offset, 1)::integer
          + substring(upc,  9 + offset, 1)::integer
          + substring(upc, 11 + offset, 1)::integer
         ) * 3
         + substring(upc,  2 + offset, 1)::integer
         + substring(upc,  4 + offset, 1)::integer
         + substring(upc,  6 + offset, 1)::integer
         + substring(upc,  8 + offset, 1)::integer
         + substring(upc, 10 + offset, 1)::integer
         ) % 10  = substring(upc, 12 + offset, 1)::integer
    THEN
        RETURN true;
    ELSE
        RETURN false;
    END IF;
END;
$$ LANGUAGE plpgsql;

This works, and seems pretty fast, but I’m wondering if there isn’t an easier way to do this in PL/pgSQL. Do you know of one? Leave me a comment.

Looking for the comments? Try the old layout.

PL/pgSQL Article Published

The first in a series of articles I’ve written about PL/pgSQL has been published on onLamp.com. The article, Check it out!

Looking for the comments? Try the old layout.

Higher-Order PL/pgSQL

Well, it’s not really higher-order PL/pgSQL, since the language doesn’t support closures, as far as I know. But I have been working on a series of articles for O’Reilly Databases site, drawing inspiration from Mark Jason Dominus’s Higher-Order Perl, and specifically using the Fibonacci sequence to create example PL/pgSQL functions. It turns out that, while the Fibonacci sequence may not be of much use in day-to-day database work, it makes for great pedagogy. And I learned a fair bit along the way, as well.

So the initial, naïve implementation of a Fibonacci calculate in PL/pgSQL, using recursion, is quite straight-forward:

CREATE OR REPLACE FUNCTION fib (
    fib_for int
) RETURNS integer AS $$
BEGIN
    IF fib_for < 2 THEN
        RETURN fib_for;
    END IF;
    RETURN fib(fib_for - 2) + fib(fib_for - 1);
END;
$$ LANGUAGE plpgsql;

Pretty simple, right? The “$$”, by the way, is PL/pgSQL dollar-quoting, which prevents me from having to escape single quotes in the function body (when the function body has them, that is!). of course, this is as slow in PL/pgSQL as it would be in any other language:

try=% explain analyze select fib(26);
----------------------------
 Total runtime: 3772.315 ms

Pretty sad, right? So then I added memoization. In PL/pgSQL, this is as simple as using a table for the cache:

CREATE TABLE fib_cache (
        num integer PRIMARY KEY,
        fib integer NOT NULL
);

CREATE OR REPLACE FUNCTION fib_cached(
    fib_for int
) RETURNS integer AS $$
DECLARE
    ret integer;
BEGIN
    if fib_for < 2 THEN
        RETURN fib_for;
    END IF;

    SELECT INTO ret fib
    FROM   fib_cache
    WHERE  num = fib_for;

    IF ret IS NULL THEN
        ret := fib_cached(fib_for - 2) + fib_cached(fib_for - 1);
        INSERT INTO fib_cache (num, fib)
        VALUES (fib_for, ret);
    END IF;
    RETURN ret;

END;
$$ LANGUAGE plpgsql;CREATE TABLE fib_cache (
        num integer PRIMARY KEY,
        fib integer NOT NULL
);

CREATE OR REPLACE FUNCTION fib_cached(
    fib_for int
) RETURNS integer AS $$
DECLARE
    ret integer;
BEGIN
    if fib_for < 2 THEN
        RETURN fib_for;
    END IF;

    SELECT INTO ret fib
    FROM   fib_cache
    WHERE  num = fib_for;

    IF ret IS NULL THEN
        ret := fib_cached(fib_for - 2) + fib_cached(fib_for - 1);
        INSERT INTO fib_cache (num, fib)
        VALUES (fib_for, ret);
    END IF;
    RETURN ret;

END;
$$ LANGUAGE plpgsql;

This gets me a big performance boost:

try=% explain analyze select fib_cached(26);
--------------------------
 Total runtime: 50.889 ms

try=% explain analyze select fib_cached(26);
--------------------------
 Total runtime: 2.249 ms

So even before any sequence numbers have been calculated it’s a big win, and after they’ve been calculated, it is of course even better.

So then, following Dominus’s lead, I set about refactoring the original, recursive version with tail call elimination. Here’s what I came up with:

CREATE OR REPLACE FUNCTION fib_stacked(
    n integer
) RETURNS integer AS $$
DECLARE
    fib_for integer := n;
    branch  integer := 0;
    ret     integer := 0;
    s1      integer := 0;
    stack   integer[][] := ARRAY[ARRAY[0, 0, 0]];
    bound   integer := 1;
BEGIN
    LOOP
        IF fib_for < 2 THEN
            ret := fib_for;
        ELSE
            IF branch = 0 THEN
                WHILE fib_for >= 2 LOOP
                    stack = array_cat(stack, ARRAY[1, 0, fib_for]);
                    fib_for := fib_for - 1;
                END LOOP;
                ret := fib_for;
            ELSIF branch = 1 THEN
                stack = array_cat(stack, ARRAY[2, ret, fib_for]);
                fib_for := fib_for - 2;
                branch  := 0;
                CONTINUE;
            ELSIF branch = 2 THEN
                ret := ret + s1;
            END IF;
        END IF;

        bound := array_upper(stack, 1);
        IF bound <= 1 THEN
            RETURN ret;
        END IF;

        SELECT INTO branch,          s1,              fib_for
                    stack[bound][1], stack[bound][2], stack[bound][3];
        SELECT INTO stack stack[1:bound-1][1:3];
    END LOOP;
    RETURN ret;
END;
$$ LANGUAGE plpgsql;

It took me forever to figure out how to do it, primarily because arrays in PostgreSQL are quite limited, in the sense that, while can add to them, you cannot remove from them! Hence the SELECT INTO to reassign to stack at the end of the loop. That syntax constructs a completely new, smaller array and assigns it to stack. Ick. Another problem is that PL/pgSQL does not like an empty multidimensional array; hence the initial assignment to stack in the DECLARE block. I then have to remember that the array always has at least one item in it, and respond accordingly. It didn’t much help that SQL arrays start at 1 rather than 0.

But the big surprise to me was just how badly this function performed:

try=% explain analyze select fib_stacked(26);
-----------------------------
 Total runtime: 30852.697 ms

Yow! So I don’t think that’s the best approach. But I did figure out another approach, based on an example I saw in Agile Web Development with Rails. It’s a very simple loop-based approach:

CREATE OR REPLACE FUNCTION fib_fast(
    fib_for int
) RETURNS integer AS $$
DECLARE
    ret integer := 0;
    nxt integer := 1;
    tmp integer;
BEGIN
    FOR num IN 1..fib_for LOOP
        tmp := ret;
        ret := nxt;
        nxt := tmp + nxt;
    END LOOP;

    RETURN ret;
END;
$$ LANGUAGE plpgsql;

This one works the best of all:

try=# explain analyze select fib_fast(26);
-------------------------
 Total runtime: 0.326 ms

Sweet! I was left wondering why this algorithm wasn’t used in HOP, but then realized that it probably didn’t fit in with Dominus’s pedagogical goals. But it was an interesting learning experience for me all the same.

Look for the first of my articles to be published 2006-5-10.

Looking for the comments? Try the old layout.

Use Variables in PL/pgSQL DECLARE Blocks

I’m working on some PL/pgSQL functions for my application framework and for an upcoming article for the O’Reilly Databases site, and was showing some of the code to a PostgreSQL guru. The code looked something like this:

CREATE OR REPLACE FUNCTION update_table (
    key_name text,
    pk_id    integer,
    fk_ids   integer[]

) RETURNS VOID AS $$
DECLARE
    table_name text := quote_ident(key_name);
BEGIN
    EXECUTE 'UPDATE ' || table_name || ' SET pk = ' || pk_id
            || ' WHERE fk IN(' || array_to_string(fk_ids, ', ')
            || ')';
END;
$$ LANGUAGE plpgsql;

No, that’s not the real code, it’s just a dummy example to illustrate something. Illustrate what? Well, my PostgreSQL friend said, “Crap, can you really use variables to set other variables in the DECLARE section?” The answer is “yes,” of course. The above does work. I’m new to PostgreSQL functions, so I didn’t know any better than to just try it, and it worked. But my friend has been writing PL/pgSQL functions for years. Why didn’t he know that you could use variables in a DECLARE block? As he said, “Damn, one of the problems with starting with a language 6 years ago is that you get in the habit of coding around the restrictions from 6 years ago.”

Anyway, I just wanted to share this tidbit, in case there were other PostgreSQL pros who missed it. I don’t know when the feature was added, but it works fine for me in 8.1.

Looking for the comments? Try the old layout.

Issues with INSERT Execution Ordering

My latest experiments in my never-ending quest to move as much object/relational mapping into the database as possible has yielded more hacks to work around database features. This time, the problem is that I have two classes, Simple and Extend, where I want the latter to extend the former (hence its name). This is a little different from inheritance, in that, internally, an Extend just references a single Simple object, but externally, it just has a single interface, where attributes of a Simple object are just attributes of an Extend object. The benefit here is that I can have multiple Extend objects that reference the same Simple object—something you can’t do with simple inheritance

Now, how I wanted to implement this in the database is where the extend view has all of the columns from the _simple table and the _extend table. That’s pretty simple. It looks like this:

CREATE SEQUENCE seq_kinetic;

CREATE TABLE _simple (
    id INTEGER NOT NULL DEFAULT NEXTVAL('seq_kinetic'),
    uuid UUID NOT NULL DEFAULT UUID_V4(),
    state INTEGER NOT NULL DEFAULT 1,
    name TEXT NOT NULL,
    description TEXT
);

CREATE TABLE _extend (
    id INTEGER NOT NULL DEFAULT NEXTVAL('seq_kinetic'),
    uuid UUID NOT NULL DEFAULT UUID_V4(),
    state INTEGER NOT NULL DEFAULT 1,
    simple_id INTEGER NOT NULL
);

CREATE VIEW extend AS
  SELECT _extend.id AS id, _extend.uuid AS uuid, _extend.state AS state,
         _extend.simple_id AS simple__id, simple.uuid AS simple__uuid,
         simple.state AS simple__state, simple.name AS name,
         simple.description AS description
  FROM   _extend, simple
  WHERE  _extend.simple_id = simple.id;

Pretty simple, right? Well, I like to put RULEs on VIEWs like this, so that I can just use the VIEW for INSERTs, UPDATEs, AND DELETESs. For now I’m just going to talk about the INSERT RULEs for this this view, as they are where the trouble came in.

Why the plural, RULEs? Well, what I wanted was to have two behaviors on insert, depending on the value of the simple__id column. If it’s NOT NULL, it should assume that it references an existing record in the _simple table, UPDATE it, and then INSERT into the _extend table. If it’s NULL, however, then it should INSERT into both the _simple table and the _extend table. What I came up with looked like this:

CREATE RULE insert_extend AS
ON INSERT TO extend WHERE NEW.simple__id IS NULL DO INSTEAD (
  INSERT INTO _simple (id, uuid, state, name, description)
  VALUES (NEXTVAL('seq_kinetic'), UUID_V4(), NEW.simple__state, NEW.name,
          NEW.description);

  INSERT INTO _extend (id, uuid, state, simple_id)
  VALUES (NEXTVAL('seq_kinetic'), COALESCE(NEW.uuid, UUID_V4()), NEW.state,
          CURRVAL('seq_kinetic'));
);

CREATE RULE extend_extend AS
ON INSERT TO extend WHERE NEW.simple__id IS NOT NULL DO INSTEAD (
  UPDATE _simple
  SET    state = COALESCE(NEW.simple__state, state),
         name  = COALESCE(NEW.name, name),
         description = COALESCE(NEW.description, description)
  WHERE  id = NEW.simple__id;

  INSERT INTO _extend (id, uuid, state, simple_id)
  VALUES (NEXTVAL('seq_kinetic'), COALESCE(NEW.uuid, UUID_V4()),
          NEW.state, NEW.simple__id);
);

CREATE RULE insert_extend_dummy AS
ON INSERT TO extend DO INSTEAD NOTHING;

That third RULE is required, as a VIEW must have an unconditional DO INSTEAD RULE. The second RULE also works, for those situations where I want to to “extend” a Simple object into an Extend object. It’s that first rule that’s causing problems, when no Simple object yet exists and I need to create it. When I try to INSERT with simple__id set to NULL, I get an error. Can you guess what it is? Let me not keep you on the edge or your seat:

kinetic=# INSERT INTO EXTEND (simple__id, name) VALUES (NULL, 'Four');
ERROR:  insert or update on table "_extend" violates foreign key constraint "fk_simple_id"
DETAIL:  Key (simple_id)=(22) is not present in table "_simple".

Why’s that? Well, it turns out that NEXTVAL('seq_kinetic') is executed twice, once for the id in the _simple table, and once for the id in the _extend table. But by the time CURRVAL('seq_kinetic') is called to reference the value inserted into the _simple table, it the value has already been fetched from the sequence for insertion into the _extend table. So of course, it fails, because the current value in the sequence is not in the _simple table at all.

At first, I thought that this might be an order of execution problem with the INSERT statement, so I tried this:

CREATE RULE insert_extend AS
ON INSERT TO extend WHERE NEW.simple__id IS NULL DO INSTEAD (
  INSERT INTO _simple (id, uuid, state, name, description)
  VALUES (NEXTVAL('seq_kinetic'), UUID_V4(), NEW.simple__state, NEW.name,
          NEW.description);

  INSERT INTO _extend (simple_id, id, uuid, state)
  VALUES (CURRVAL('seq_kinetic'), NEXTVAL('seq_kinetic'),
          COALESCE(NEW.uuid, UUID_V4()), NEW.state);
);

Unfortunately, that yielded the same error. So if the order of the columns in the INSERT statement didn’t define the execution order, what did? Well, a little research helped me to figure it out: It’s the order of the columns in the table, as this example demonstrates:

test=# CREATE SEQUENCE s;
CREATE SEQUENCE
test=# CREATE TABLE a (a0 int, a1 int, a2 int, a3 int);
CREATE TABLE
test=# INSERT INTO a (a3, a2, a0, a1) VALUES (NEXTVAL('s'), NEXTVAL('s'),
test=# NEXTVAL('s'), NEXTVAL('s'));
INSERT 0 1
test=# SELECT * FROM a;
 a0 | a1 | a2 | a3 
----+----+----+----
  1 |  2 |  3 |  4
(1 row)

Even though the values from the sequence were inserted into the columns by the INSERT statement in a more or less random order, they ended up being inserted into the table in the order in which they were declared in the CREATE TABLE statement.

Damn SQL!

So what’s the solution to this? Well, I came up with three. The first, and perhaps simplest, is to use two sequences instead of one:

CREATE RULE insert_extend AS
ON INSERT TO extend WHERE NEW.simple__id IS NULL DO INSTEAD (
  INSERT INTO _simple (id, uuid, state, name, description)
  VALUES (NEXTVAL('seq_kinetic'), UUID_V4(), NEW.simple__state, NEW.name,
          NEW.description);

  INSERT INTO _extend (id, uuid, state, simple_id)
  VALUES (NEXTVAL('seq_kinetic_alt'), COALESCE(NEW.uuid, UUID_V4()), NEW.state,
          CURRVAL('seq_kinetic'));
);

This works very well, and if you have separate sequences for each table, this is what you would do. But I want to use just one sequence for every primary key in the database, so as to prevent any possibility of duplicates. I could use two mutually exclusive sequences, one for odd numbers and the other for even numbers:

CREATE SEQUENCE seq_kinetic_odd INCREMENT BY 2;
CREATE SEQUENCE seq_kinetic_odd INCREMENT BY 2 START WITH 2;

But then I have to keep track of which sequence I’m using where. If I just use the “even” sequence for this special case (which may be rare), then I’m essentially throwing out half the numbers in the sequence. And I like things to be somewhat orderly, and the skipping of even or odd values would annoy me when I had to work with the database. Yeah, I’m a freak.

The solution I’ve currently worked out is to create a PL/pgSQL function that can keep track of the sequence numbers ahead of time, and just call it from the RULE:

CREATE FUNCTION insert_extend(NEWROW extend) RETURNS VOID AS $$
    DECLARE
        _first_id  integer := NEXTVAL(''seq_kinetic'');
        _second_id integer := NEXTVAL(''seq_kinetic'');
    BEGIN
    INSERT INTO _simple (id, uuid, state, name, description)
    VALUES (_first_id, UUID_V4(), NEWROW.simple__state, NEWROW.name,
            NEWROW.description);

    INSERT INTO _extend (id, uuid, state, simple_id)
    VALUES (_second_id, COALESCE(NEWROW.uuid, UUID_V4()), NEWROW.state, _first_id);
    END;
$$ LANGUAGE plpgsql VOLATILE;
    
CREATE RULE insert_extend AS
ON INSERT TO extend WHERE NEW.simple__id IS NULL DO INSTEAD (
    SELECT insert_extend(NEW);
);

This approach works pretty nicely, and doesn’t add much more code than my original solution with the ordering problem. I think I’ll keep it.

One other solution is to use a TRIGGER instead of a rule, but in truth, it would amount to nearly the same thing:

CREATE FUNCTION insert_extend() RETURNS trigger AS $$
    DECLARE
        _first_id  integer := NEXTVAL(''seq_kinetic'');
        _second_id integer := NEXTVAL(''seq_kinetic'');
    BEGIN
    INSERT INTO _simple (id, uuid, state, name, description)
    VALUES (_first_id, UUID_V4(), NEW.simple__state, NEW.name, NEW.description);

    INSERT INTO _extend (id, uuid, state, simple_id)
    VALUES (_second_id, COALESCE(NEW.uuid, UUID_V4()), NEW.state, _first_id);
    END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER insert_extend BEFORE UPDATE ON extend
FOR EACH ROW EXECUTE PROCEDURE insert_extend();

Um, but looking at it now (I just now typed it up, I haven’t tested it), I don’t think it’d work, because you can’t put a condition on a rule. On the other hand, I could use it to combine the three rules I have (two conditional and mutually exclusive, one that does nothing) into a single trigger:

CREATE FUNCTION insert_extend() RETURNS trigger AS $$
    DECLARE
        _first_id  integer;
        _second_id integer;
    BEGIN
    IF NEW.simple__id IS NULL THEN
        _first_id  := NEXTVAL(''seq_kinetic'');
        _second_id := NEXTVAL(''seq_kinetic'');

        INSERT INTO _simple (id, uuid, state, name, description)
        VALUES (_first_id, UUID_V4(), NEW.simple__state, NEW.name, NEW.description);

        INSERT INTO _extend (id, uuid, state, simple_id)
        VALUES (_second_id, COALESCE(NEW.uuid, UUID_V4()), NEW.state, _first_id);
    ELSE
        UPDATE _simple
        SET    state = COALESCE(NEW.simple__state, state),
                name  = COALESCE(NEW.name, name),
                description = COALESCE(NEW.description, description)
        WHERE  id = NEW.simple__id;

        INSERT INTO _extend (id, uuid, state, simple_id)
        VALUES (NEXTVAL('seq_kinetic'), COALESCE(NEW.uuid, UUID_V4()),
                NEW.state, NEW.simple__id);
    END IF;
    END;
$$ LANGUAGE plpgsql;

Hrm. That just might be the best way to go, period. Thoughts? Have I missed some other obvious solution?

Looking for the comments? Try the old layout.

How Do I Avoid Tiger’s readline When Compiling PostgreSQL?

I was delighted to find that Mac OS X 10.4 “Tiger” includes the readline library. So I was able to just compile PostgreSQL and have psql just work. Only it kinda doesn’t. For reasons that Tom Lane has explained, Tiger’s readline implementation is somewhat buggy. I’ve reported the issue to Apple (Radar # 4356545), but in the meantime, I’ve compiled and installed GNU readline 5.0 and wan to use it, instead.

The only problem is that there is no easy way to do it with environment variables or options when configuring PostgreSQL. I’ve tried:

./configure --includes=/usr/local/include -with-libs=/usr/local/lib

And:

CFLAGS=-L/usr/local/lib LDFLAGS=-I/usr/local/include; ./configure

Neither approach worked. In both cases, it still compiled in Apple’s buggy readline library. The only approach I’ve found to work is the brute force approach:

mv /usr/lib/libreadline.* /tmp
mv /usr/include/readline /tmp
./configure
make
make install
mv /tmp/libreadline.* /usr/lib
mv /tmp/readline /usr/include

But surely I’m missing something! Is there no better way to do it?

Looking for the comments? Try the old layout.

What Advanced SQL Book Should I Buy?

So, what advanced SQL book should I buy? I’ve learned a lot about SQL over the last year or so, but I’m sure that Josh Berkus is tired of being my own personal advanced SQL reference. So I’d like to really learn more about triggers, stored procedures, rules, views, and whatnot, what they’re best used for and when to use them. And other typical database features that I’m not familiar with, of course.

What I don’t need is an introduction to SQL. There are a million of those, and they all have much the same stuff. I want to really get into advanced concepts.

So what’s the best choice? Leave me a comment with your opinion. Thanks!

Looking for the comments? Try the old layout.

How to Ensure Unique Values Relative to a Column in Another Table

I recently came across a need to ensure that the value of a column in one table is unique relative to the value of a column in another table. This came about through my use of views to represent object-oriented inheritance relationships in PostgreSQL. For example, say I have a parent class, “Person”, and its subclass, “User.” The person table is quite straight-forward:

CREATE TABLE person (
    id INTEGER NOT NULL PRIMARY KEY SERIAL,
    first_name TEXT,
    last_name  TEXT,
    state      INTEGER
);

I use a combination of a table and a view to represent the User class while keeping all of the data nicely denormalized:

CREATE TABLE person_usr (
    id INTEGER NOT NULL PRIMARY KEY REFERENCES person (id),
    username TEXT,
    password TEXT
);

CREATE VIEW usr AS
SELECT p.id AS id, p.first_name AS first_name, p.last_name AS last_name,
       p.state AS state, u.username AS username, u.password AS password
  FROM   person p, usr u
 WHERE  p.id = u.id;

So to maintain things, I write rules on the usr view to execute INSERT, UPDATE, and DELETE queries against the person and person_usr tables as appropriate.

Now, say that I have a business requirement to allow there to be duplicate usernames for users provided that only one is not “deleted.” Whether a user object is active, inactive, or deleted is determined by the value in its state attribute, which is stored in the person table. The value of the state attribute can be “1” for active, “0” for inactive, and “-1” for deleted. So how can I ensure, in the database, that this rule is followed?

Well, if the state and username columns were in a single table, this is very easy in PostgreSQL: just create a partial unique index:

CREATE UNIQUE INDEX udx_usr_unique
    ON usr(username)
 WHERE state > -1;

This does the trick beautifully, and is nice and compact. However, my OO design has the User class inheriting from Person, so I have the username column in one table and the state column in another. At first, I thought that I could still use a partial unique index, something like this:

CREATE UNIQUE INDEX udx_usr_unique
    ON usr(username)
 WHERE id IN (SELECT id FROM person WHERE state > -1);

Unfortunately, as of PostgreSQL 8.1, the PostgreSQL documentation states:

The expression used in the WHERE clause may refer only to columns of the underlying table, but it can use all columns, not just the ones being indexed. Presently, subqueries and aggregate expressions are also forbidden in WHERE. The same restrictions apply to index fields that are expressions.

D’oh!

So I had to figure out another method. CHECK constraints cannot reference another table, either. So I was left with triggers. It’s ugly and verbose, but it appears to do the trick. Here is the recipe:

CREATE FUNCTION cki_usr_username_unique() RETURNS trigger AS $$
    BEGIN
    /* Lock the relevant records in the parent and child tables. */
    PERFORM true
    FROM    person_usr, person
    WHERE   person_usr.id = person.id AND username = NEW.username FOR UPDATE;
    IF (SELECT true
        FROM   usr
        WHERE  id <> NEW.id AND username = NEW.username AND usr.state > -1
        LIMIT 1
    ) THEN
        RAISE EXCEPTION ''duplicate key violates unique constraint "ck_person_usr_username_unique"'';
    END IF;
    RETURN NEW;
    END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER cki_usr_username_unique BEFORE INSERT ON person_usr
    FOR EACH ROW EXECUTE PROCEDURE cki_usr_username_unique();

CREATE FUNCTION cku_usr_username_unique() RETURNS trigger AS $$
    BEGIN
    IF (NEW.username <> OLD.username) THEN
        /* Lock the relevant records in the parent and child tables. */
        PERFORM true
        FROM    person_usr, person
        WHERE   person_usr.id = person.id AND username = NEW.username FOR UPDATE;
        IF (SELECT true
            FROM   usr
            WHERE  id <> NEW.id AND username = NEW.username AND usr.state > -1
            LIMIT 1
        ) THEN
            RAISE EXCEPTION ''duplicate key violates unique constraint "ck_person_usr_username_unique"'';
        END IF;
    END IF;
    RETURN NEW;
    END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER cku_usr_username_unique BEFORE UPDATE ON person_usr
    FOR EACH ROW EXECUTE PROCEDURE cku_usr_username_unique();

CREATE FUNCTION ckp_usr_username_unique() RETURNS trigger AS $$
    BEGIN
    IF (NEW.state > -1 AND OLD.state < 0
        AND (SELECT true FROM person_usr WHERE id = NEW.id)
        ) THEN
        /* Lock the relevant records in the parent and child tables. */
        PERFORM true
        FROM    person_usr, person
        WHERE   person_usr.id = person.id
                AND username = (SELECT username FROM person_usr WHERE id = NEW.id)
        FOR UPDATE;

        IF (SELECT COUNT(username)
            FROM   person_usr
            WHERE username = (SELECT username FROM person_usr WHERE id = NEW.id)
        ) > 1 THEN
            RAISE EXCEPTION ''duplicate key violates unique constraint "ck_person_usr_username_unique"'';
        END IF;
    END IF;
    RETURN NEW;
    END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER ckp_usr_username_unique BEFORE UPDATE ON person
    FOR EACH ROW EXECUTE PROCEDURE ckp_usr_username_unique();

Why am I locking rows? To prevent some other transaction from changing another row to create username conflicts. For example, while I might be changing a username to “foo” for one record, an existing record with that username but its state set to -1 might be getting activated in a separate transaction. So gotta try to prevent that. Josh Berkus pointed out that issue in an earlier iteration of the triggers.

Anyway, am I on crack here? Isn’t there a simpler way to do this sort of thing? And if not, have I really got the race conditions all eliminated with the row locks?

Update: In further testing, I discovered that the SELECT ... FOR UPDATE was failing on views with the error “ERROR: no relation entry for relid 7”. I have no idea what that means, but I found that it didn’t happen when I selected against the tables directly. So I’ve updated the functions above to reflect that change. I’ve also fixed a few pastos as pointed out in the comments.

Looking for the comments? Try the old layout.

Custom SQLite Aggregates in Perl

About a year ago, Josh Berkus was reviewing some Bricolage SQL code, looking to optimize it for PostgreSQL. One of the things he noticed was that we were fetching a lot more rows for an object than we needed to. The reason for this is that an object might be associated with one or more groups, and to get back a list of all of the group IDs, we were getting multiple rows. For example, if I wanted to fetch a single story with the ID 10, I might get back rows like this:

SELECT s.id, s.title, grp.id
FROM   story s, member m, grp g
WHERE  s.id = m.story_id
       AND m.grp_id = g.id
       AND s.id = 10;
s.id |        s.title      | grp.id
-----+---------------------+--------
  10 | The Princess Bride  | 23
  10 | The Princess Bride  | 24
  10 | The Princess Bride  | 25
  10 | The Princess Bride  | 26
  10 | The Princess Bride  | 27

Now, that’s a lot of extra data to have to fetch for just a single row to be different; it’s very wasteful, really. So Josh said, “Why don’t you use a custom aggregate for that?” I knew nothing about aggregates, but I did some research, and figured out how to write PostgreSQL custom aggregates in SQL. I wrote a very simple one, called id_list(), that joins up all of the values in a column with an empty space. The aggregate code looks like this:

CREATE   FUNCTION append_id(TEXT, INTEGER)
RETURNS  TEXT AS '
    SELECT CASE WHEN $2 = 0 THEN
                $1
            ELSE
                $1 || '' '' || CAST($2 AS TEXT)
            END;'
LANGUAGE 'sql'
WITH     (ISCACHABLE, ISSTRICT);

CREATE AGGREGATE id_list (
    SFUNC    = append_id,
    BASETYPE = INTEGER,
    STYPE    = TEXT,
    INITCOND = ''
);

Now I was able to vastly simplify the results returned by the query:

SELECT s.id, s.title, id_list(grp.id)
FROM   story s, member m, grp g
WHERE  s.id = m.story_id
       AND m.grp_id = g.id
       AND s.id = 10;
GROUP BY s.id, s.title
s.id |        s.title      | id_list
-----+---------------------+---------------
  10 | The Princess Bride  | 23 24 25 26 27

So then I just had to split the id_list column on the white space and I was ready to go. Cool!

So recently, was thinking about how I might do something similar in SQLite. It turns out that SQLite has a way to add custom aggregates, too, via its sqlite_add_function function. But I don’t know C, and had been wondering for a while how, even if I figured out how to write an aggregate function in C, whether I would have to require users to compile SQLite with my C aggregate in order to get it to work.

However, as a Perl developer, I thought it might be worthwhile to just quickly check the DBD::SQLite docs might have to say on the matter. And it turns out that the ability to add aggregates to SQLite is supported in DBD::SQLite via the create_aggregate custom function. And what’s more, the aggregate can be written in Perl! Whoa! I couldn’t believe that it could be that easy, but a quick test script demonstrated that it is:

#!/usr/bin/perl -w

use strict;

use DBI;

my $dbfile = shift;
my $dbh = DBI->connect("dbi:SQLite:dbname=$dbfile", '', '');

END {
    $dbh->disconnect;
    unlink $dbfile;
};

$dbh->do('CREATE TABLE foo (a int)');
$dbh->do('BEGIN');
$dbh->do('INSERT INTO foo (a) VALUES (?)', undef, $_) for 1..10;
$dbh->do('COMMIT');

# Create a new aggregate.
$dbh->func('joiner', 1, 'My::Join', 'create_aggregate');

my $sel = $dbh->prepare(q{SELECT joiner(a) FROM foo});
$sel->execute;
$sel->bind_columns(\my $val);
print "$val\n" while $sel->fetch;

The first argument to create_aggregate() (itself invoked via the DBI func() method) the name of the aggregate, the second is the number of arguments to the aggregate (use -1 for an unlimited number), and the third is the name of a Perl class that implements the aggregate. That class needs just three methods: new(), an object constructor; step(), called for each aggregate row, and finalize, which must return the value calculated by the aggregate. My simple implementation looks like this:

package My::Join;

sub new { bless [] }
sub step {
    my $self = shift;
    push @$self, @_;
}
sub finalize {
    my $self = shift;
    return join q{ }, @$self;
}

Yep, that’s really it! When I run the script, the output looks like this:

% try foo
1 2 3 4 5 6 7 8 9 10

Keen! I mean, that is just so slick! And it really demonstrates the power of SQLite as an embeddable database, as well. Thanks Matt, for making the SQLite API available to us mere mortal Perl developers!

Looking for the comments? Try the old layout.