Just a Theory

By David E. Wheeler

Posts about PL/Perl

Execute SQL Code on Connect

I’ve been writing a fair bit of PL/Perl for a client, and one of the things I’ve been doing is eliminating a ton of duplicate code by creating utility functions in the %_SHARED hash. This is great, as long as the code that creates those functions gets executed at the beginning of every database connection. So I put the utility generation code into a single function, called prepare_perl_utils(). It looks something like this:

CREATE OR REPLACE FUNCTION prepare_perl_utils(
) RETURNS bool LANGUAGE plperl IMMUTABLE AS $$
    # Don't bother if we've already loaded.
    return 1 if $_SHARED{escape_literal};

    $_SHARED{escape_literal} = sub {
        $_[0] =~ s/'/''/g; $_[0] =~ s/\\/\\\\/g; $_[0];
    };

    # Create other code refs in %_SHARED…
$$;

So now all I have to do is make sure that all the client’s apps execute this function as soon as they connect, so that the utilities will all be loaded up and ready to go. Here’s how I did it.

First, for the Perl app, I just took advantage of the DBI’s callbacks to execute the SQL I need when the DBI connects to the database. That link might not work just yet, as the DBI’s callbacks have only just been documented and that documentation appears only in dev releases so far. Once 1.611 drops, the link should work. At any rate, the use of callbacks I’m exploiting here has been in the DBI since 1.49, which was released in November 2005.

The approach is the same as I’ve described before: Just specify the Callbacks parameter to DBI->connect, like so:

my $dbh = DBI->connect_cached($dsn, $user, $pass, {
    PrintError     => 0,
    RaiseError     => 1,
    AutoCommit     => 1,
    Callbacks      => {
        connected => sub { shift->do('SELECT prepare_perl_utils()' },
    },
});

That’s it. The connected method is a no-op in the DBI that gets called to alert subclasses that they can do any post-connection initialization. Even without a subclass, we can take advantage of it to do our own initialization.

It was a bit trickier to make the same thing happen for the client’s Rails app. Rails, alas, provides no on-connection callbacks. So we instead have to monkey-patch Rails to do what we want. With some help from “dfr|mac” on #rubyonrails (I haven’t touched Rails in 3 years!), I got it worked down to this:

class ActiveRecord::ConnectionAdapters::PostgreSQLAdapter
    def initialize_with_perl_utils(*args)
    returning(initialize_without_perl_utils(*args)) do
        execute('SELECT prepare_perl_utils()')
    end
    end
    alias_method_chain :initialize, :perl_utils
end

Basically, we overpower the PostgreSQL adapter’s initialize method and have it call initialize_with_perl_utils before it returns. It’s a neat trick; if you’re going to practice fuck typing, alias_method_chain makes it about as clean as can be, albeit a little too magical for my tastes.

Anyway, recorded here for posterity (my blog is my other brain!).

Looking for the comments? Try the old layout.

Benchmarking PostgreSQL Functions

Update 2006-05-19: I realized that there was a nasty error in my algorithm for determining the runtime of a function: It was only fetching the milliseconds part of the runtime, without adding in seconds and minutes! This led to getting negative runtimes then the milliseconds part of the end time was less than the milliseconds part of the start time. Ugh. But with the help of yain on IRC, I’ve switched to calculating the number of seconds by converting the start and end times to epoch seconds (which have subsecond precision in PostgreSQL, and now things are just dandy. While I was at it, I reorganized the function so that it was a bit easier to read, by constructing the created function in the order it would be executed, and fixed the caching problem, as suggested by Aidan in a comment below.

Following yesterday’s post, Klint Gore sent me some PL/pgSQL code that might be useable as a benchmark function. Today I took that code and ran with it.

The idea was to create a function like the Perl Benchmark module’s timethese() function. In the process, I found, with help from Josh Berkus, that PL/pgSQL’s EXECUTE statement has quite a lot of overhead, and the amount of overhead per call is pretty random. The overhead resulted in pretty inaccurate benchmark numbers, unfortunately.

At Josh’s suggestion, I rewrote the function to just test each function inline, rather than passing the function code as parameters. This time, the results were dead on. So then I refactored the original benchmark function to create its own benchmark function, inlining all of the code, and then call that function. Almost higher order PL/pgSQL! Again the results were just right, and so now I present it to you:

create type _benchmark as (
    code      text,
    runtime   real,
    corrected real
);

CREATE OR REPLACE FUNCTION benchmark(n INTEGER, funcs TEXT[])
RETURNS SETOF _benchmark AS $$
DECLARE
    code TEXT := '';
    a    _benchmark;
BEGIN
    -- Start building the custom benchmarking function.
    code := $_$
        CREATE OR REPLACE FUNCTION _bench(n INTEGER)
        RETURNS SETOF _benchmark AS $__$
        DECLARE
            s TIMESTAMP;
            e TIMESTAMP;
            a RECORD;
            d numeric;
            res numeric;
            ret _benchmark;
        BEGIN
            -- Create control.
            s := timeofday();
            FOR a IN SELECT TRUE FROM generate_series( 1, $_$ || n || $_$ )
            LOOP
            END LOOP;
            e := timeofday();
            d := extract(epoch from e) - extract(epoch from s);
            ret := ROW( '[Control]', d, 0 );
            RETURN NEXT ret;
$_$;
    -- Append the code to bench each function call.
    FOR i IN array_lower(funcs,1) .. array_upper(funcs, 1) LOOP
        code := code || '
            s := timeofday();
            FOR a IN SELECT ' || funcs[i] || ' FROM generate_series( 1, '
                || n || $__$ ) LOOP
            END LOOP;
            e := timeofday();
            res := extract(epoch from e) - extract(epoch from s);
            ret := ROW(
                $__$ || quote_literal(funcs[i]) || $__$,
                res,
                res - d
            );
            RETURN NEXT ret;
$__$;
    END LOOP;

    -- Create the function.
    execute code || $_$
        END;
        $__$ language plpgsql;
$_$;

    -- Now execute the function.
    FOR a IN EXECUTE 'SELECT * FROM _bench(' || n || ')' LOOP
        RETURN NEXT a;
    END LOOP;

    -- Drop the function.
    DROP FUNCTION _bench(integer);
    RETURN;
END;
$$ language 'plpgsql';

You call the function like this:

try=# select * from benchmark(10000, ARRAY[
try(#     'ean_substr(''036000291452'')',
try(#     'ean_byte(''036000291452'')',
try(#     'ean_c(''036000291452'')'
try(# ]);
            code            | runtime   | corrected 
----------------------------+-----------+-----------
    [Control]                  | 0.0237451 |          0
    ean_substr('036000291452') |  0.497734 |   0.473989
    ean_byte(  '036000291452') |  0.394456 |   0.370711
    ean_c(     '036000291452') | 0.0277281 | 0.00398302
(4 rows)

Pretty slick, eh? The only downside was that, when the DROP FUNCTION line was not commented out, the function would run once, and then, the next time, I’d get this error:

ERROR:  cache lookup failed for function 17323
CONTEXT:  PL/pgSQL function "benchmark" line 49 at for over select rows

I have no idea why. So I just leave the function and let the CREATE OR REPLACE take care of it.

Looking for the comments? Try the old layout.

Corrected PostgreSQL EAN Functions

Update: I updated the benchmarks based on the fixed version of my benchmarking function.

In doing a bit more reading about EAN codes, I realized that my previous attempts to write a validating function for UPC and EAN codes had a significant error: they would only properly validate EAN codes if the first numeral was 0! So I went back and fixed them all, and present them here for posterity.

  • The substring solution:

    CREATE OR REPLACE FUNCTION ean_substr (
        TEXT
    ) RETURNS boolean AS $$
    DECLARE
        offset integer := 0;
        -- Support UPCs.
        ean   TEXT    := CASE WHEN length($1) = 12 THEN '0' || $1 ELSE $1 END;
    BEGIN
        -- Make sure we really have an EAN.
        IF ean !~ '^\\d{13}$' THEN RETURN FALSE; END IF;
    
        RETURN 10 - (
            (
                -- Sum even numerals.
                substring(ean,  2 + offset, 1)::integer
                + substring(ean,  4 + offset, 1)::integer
                + substring(ean,  6 + offset, 1)::integer
                + substring(ean,  8 + offset, 1)::integer
                + substring(ean, 10 + offset, 1)::integer
                + substring(ean, 12 + offset, 1)::integer
            ) * 3 -- Multiply total by 3.
            -- Add odd numerals except for checksum (13).
            + substring(ean,  1 + offset, 1)::integer
            + substring(ean,  3 + offset, 1)::integer
            + substring(ean,  5 + offset, 1)::integer
            + substring(ean,  7 + offset, 1)::integer
            + substring(ean,  9 + offset, 1)::integer
            + substring(ean, 11 + offset, 1)::integer
        -- Compare to the checksum.
        ) % 10 = substring(ean, 13 + offset, 1)::integer;
    END;
    $$ LANGUAGE 'plpgsql' immutable;
  • The looping solution:

    CREATE OR REPLACE FUNCTION ean_loop(
        TEXT
    ) RETURNS boolean AS $$
    DECLARE
        total INTEGER := 0;
        -- Support UPCs.
        ean   TEXT    := CASE WHEN length($1) = 12 THEN '0' || $1 ELSE $1 END;
    BEGIN
        -- Make sure we really have an EAN.
        IF ean !~ '^\\d{13}$' THEN RETURN FALSE; END IF;
    
        -- Sum even numerals.
        FOR i IN 2..12 LOOP
            total := total + substring(ean, i, 1)::INTEGER;
            i := i + 1;
        END LOOP;
    
        -- Multiply total by 3.
        total := total * 3;
    
        -- Add odd numerals except for checksum (13).
        FOR i IN 1..11 LOOP
            total := total + substring(ean, i, 1)::INTEGER;
            i := i + 1;
        END LOOP;
    
        -- Compare to the checksum.
        RETURN 10 - total % 10 = substring(ean, 13, 1)::INTEGER;
    END;
    $$ LANGUAGE 'plpgsql' immutable;
  • The BYTEA solution:

    CREATE OR REPLACE FUNCTION ean_byte (
        arg TEXT
    ) RETURNS boolean AS $$
    DECLARE
        -- Convert to BYTEA; support UPCs.
        ean BYTEA := CASE WHEN length($1) = 12 THEN '0' || $1 ELSE $1 END;
    BEGIN
        -- Make sure we really have an EAN.
        IF arg !~ '^\\d{12,13}$' THEN RETURN FALSE; END IF;
    
        RETURN 10 - (
            (
                -- Sum odd numerals.
                get_byte(ean,  1) - 48
                + get_byte(ean,  3) - 48
                + get_byte(ean,  5) - 48
                + get_byte(ean,  7) - 48
                + get_byte(ean,  9) - 48
                + get_byte(ean, 11) - 48
            ) * 3 -- Multiply total by 3.
            -- Add even numerals except for checksum (12).
            + get_byte(ean,  0) - 48
            + get_byte(ean,  2) - 48
            + get_byte(ean,  4) - 48
            + get_byte(ean,  6) - 48
            + get_byte(ean,  8) - 48
            + get_byte(ean, 10) - 48
        -- Compare to the checksum.
        ) % 10 = get_byte(ean, 12) - 48;
            
    END;
    $$ LANGUAGE plpgsql immutable;
  • The PL/Perl solution:

    CREATE OR REPLACE FUNCTION ean_perl (
        TEXT
    ) RETURNS boolean AS $_$
        my $ean = length $_[0] == 12 ? "0$_[0]" : $_[0];
        # Make sure we really have an EAN.
        return 'false' unless $ean =~ /^\d{13}$/;
        my @nums = split '', $ean;
        return 10 - (
            # Sum even numerals.
            (   (   $nums[1] + $nums[3] + $nums[5] + $nums[7] + $nums[9]
                        + $nums[11]
                ) * 3 # Multiply total by 3.
            # Add odd numerals except for checksum (12).
            ) + $nums[0] + $nums[2] + $nums[4] + $nums[6] + $nums[8] + $nums[10]
        # Compare to the checksum.
        ) % 10 == $nums[12] ? 'true' : 'false';
    $_$ LANGUAGE plperl immutable;
  • The C solution (thanks StuckMojo!):

    #include <string.h>
    #include "postgres.h"
    #include "fmgr.h"
    
    Datum ean_c(PG_FUNCTION_ARGS);
    
    PG_FUNCTION_INFO_V1(ean_c);
    
    Datum ean_c(PG_FUNCTION_ARGS) {
    
        char *ean;
        text *arg = PG_GETARG_TEXT_P(0);
        int  arglen = VARSIZE(arg) - VARHDRSZ;
        bool ret = false;
    
        /* Validate the easy stuff: 12 or 13 digits. */
        if ((arglen != 12 && arglen != 13) || 
            strspn(VARDATA(arg), "0123456789") != arglen) {
            PG_RETURN_BOOL(ret);
        }
    
        /* Support UPCs. */
        if (arglen == 12) {
            ean = (char *) palloc(13);
            ean[0] = '0';
            memcpy(&ean[1], VARDATA(arg), arglen);
        } else {
            ean = (char *) palloc(arglen);
            memcpy(ean, VARDATA(arg), arglen);
        }
    
        ret = 10 - (
                /* Sum even numerals and multiply total by 3. */
                (  ean[1] - '0' + ean[3] - '0' + ean[5]  - '0' 
                    + ean[7] - '0' + ean[9] - '0' + ean[11] - '0') * 3
                /* Add odd numerals except for checksum (12). */
                + ean[0] - '0' + ean[2] - '0' + ean[4]  - '0'
                + ean[6] - '0' + ean[8] - '0' + ean[10] - '0'
            /* Compare to the checksum. */
            ) % 10 == ean[12] - '0';
    
        PG_RETURN_BOOL(ret);
    }

And here are the benchmarks for them (without immutable):

try=# select * from benchmark(100000, ARRAY[
try(#     'ean_substr(''4007630000116'')',
try(#     'ean_loop(  ''4007630000116'')',
try(#     'ean_byte(  ''4007630000116'')',
try(#     'ean_perl(  ''4007630000116'')',
try(#     'ean_c(     ''4007630000116'')'
try(# ]);
            code             | runtime  |    rate     | corrected | corrected_rate 
-----------------------------+----------+-------------+-----------+----------------
 [Control]                   | 0.257728 | 388006.17/s |  0.257728 | 388006.17/s
 ean_substr('4007630000116') |  5.07296 | 19712.37/s  |   4.81523 | 20767.44/s
 ean_loop(  '4007630000116') |  9.18085 | 10892.24/s  |   8.92312 | 11206.84/s
 ean_byte(  '4007630000116') |   3.9248 | 25479.02/s  |   3.66707 | 27269.73/s
 ean_perl(  '4007630000116') |   5.5062 | 18161.33/s  |   5.24848 | 19053.15/s
 ean_c(     '4007630000116') | 0.285376 | 350415.10/s |  0.027648 | 3616901.80/s
(6 rows)

Enjoy!

Looking for the comments? Try the old layout.

Benchmarking UPC Validation

Just to follow up on my query about validating UPC codes in PL/pgSQL, Klint Gore sent me a private email demonstrating that treating the UPC code as a binary string performed better than my substringing approach. I modified his version to work like the others, but it looked to me like the performance was about the same. They were just too close for me to really be able to tell.

What I needed was a way to run the queries a whole bunch of times to see the real difference. I asked on #postgresql, and dennisb suggested a simple brute-force approach:

select foo(42) FROM generate_series (1, 10000);

So that’s what I did. The functions I tested were:

  • A refinement of my original substring solution:

    CREATE OR REPLACE FUNCTION ean_substr (
        TEXT
    ) RETURNS boolean AS $$
    DECLARE
        offset integer := 0;
        -- Support UPCs.
        ean   TEXT    := CASE WHEN length($1) = 12 THEN
            '0' || $1
        ELSE
            $1
        END;
    BEGIN
        -- Make sure we really have an EAN.
        IF ean !~ '^\\d{13}$' THEN RETURN FALSE; END IF;
    
        RETURN 10 - (
            (
                -- Sum even numerals.
                substring(ean,  2 + offset, 1)::integer
                + substring(ean,  4 + offset, 1)::integer
                + substring(ean,  6 + offset, 1)::integer
                + substring(ean,  8 + offset, 1)::integer
                + substring(ean, 10 + offset, 1)::integer
                + substring(ean, 12 + offset, 1)::integer
            ) * 3 -- Multiply total by 3.
            -- Add odd numerals except for checksum (13).
            + substring(ean,  3 + offset, 1)::integer
            + substring(ean,  5 + offset, 1)::integer
            + substring(ean,  7 + offset, 1)::integer
            + substring(ean,  9 + offset, 1)::integer
            + substring(ean, 11 + offset, 1)::integer
        -- Compare to the checksum.
        ) % 10 = substring(ean, 12 + offset, 1)::integer;
    END;
    $$ LANGUAGE plpgsql;
  • A looping version, based on the comment from Adrian Klaver in the original post:

    CREATE OR REPLACE FUNCTION ean_loop(
        TEXT
    ) RETURNS boolean AS $$
    DECLARE
        total INTEGER := 0;
        -- Support UPCs.
        ean   TEXT    := CASE WHEN length($1) = 12 THEN
            '0' || $1
        ELSE
            $1
        END;
    BEGIN
        -- Make sure we really have an EAN.
        IF ean !~ '^\\d{13}$' THEN RETURN FALSE; END IF;
    
        -- Sum even numerals.
        FOR i IN 2..12 LOOP
            total := total + substring(ean, i, 1)::INTEGER;
            i := i + 1;
        END LOOP;
    
        -- Multiply total by 3.
        total := total * 3;
    
        -- Add odd numerals except for checksum (13).
        FOR i IN 3..11 LOOP
            total := total + substring(ean, i, 1)::INTEGER;
            i := i + 1;
        END LOOP;
    
        -- Compare to the checksum.
        RETURN 10 - total % 10
            = substring(ean, 13, 1)::INTEGER;
    END;
    $$ LANGUAGE 'plpgsql';
            CREATE OR REPLACE FUNCTION ean_loop(
        TEXT
    ) RETURNS boolean AS $$
    DECLARE
        total INTEGER := 0;
        -- Support UPCs.
        ean   TEXT    := CASE WHEN length($1) = 12 THEN
            '0' || $1
        ELSE
            $1
        END;
    BEGIN
        -- Make sure we really have an EAN.
        IF ean !~ '^\\d{13}$' THEN RETURN FALSE; END IF;
    
        -- Sum even numerals.
        FOR i IN 2..12 LOOP
            total := total + substring(ean, i, 1)::INTEGER;
            i := i + 1;
        END LOOP;
    
        -- Multiply total by 3.
        total := total * 3;
    
        -- Add odd numerals except for checksum (13).
        FOR i IN 3..11 LOOP
            total := total + substring(ean, i, 1)::INTEGER;
            i := i + 1;
        END LOOP;
    
        -- Compare to the checksum.
        RETURN 10 - total % 10
            = substring(ean, 13, 1)::INTEGER;
    END;
    $$ LANGUAGE 'plpgsql';
  • A PL/Perl version for Josh and Ovid:

    CREATE OR REPLACE FUNCTION ean_perl (
        TEXT
    ) RETURNS boolean AS $_$
        my $ean = length $_[0] == 12 ? "0$_[0]" : $_[0];
        # Make sure we really have an EAN.
        return 'false' unless $ean =~ /^\d{13}$/;
        my @nums = split '', shift;
        return 10 - (
            # Sum even numerals.
            (   (   $nums[1] + $nums[3] + $nums[5]
                    + $nums[7] + $nums[9] + $nums[11]
                ) * 3 # Multiply total by 3.
            # Add odd numerals except for checksum (12).
            ) + $nums[2] + $nums[4] + $nums[6] + $nums[8]
                + $nums[10]
        # Compare to the checksum.
        ) % 10 == $nums[11] ? 'true' : 'false';
    $_$ LANGUAGE plperl;
  • And finally, the new version using a byte string:

    CREATE OR REPLACE FUNCTION ean_byte (
        arg TEXT
    ) RETURNS boolean AS $$
    DECLARE
        -- Convert to BYTEA; support UPCs.
        ean BYTEA := CASE WHEN length($1) = 12 THEN
            '0' || $1
        ELSE
            $1
        END;
    BEGIN
        -- Make sure we really have an EAN.
        IF arg !~ '^\\d{12,13}$' THEN RETURN FALSE; END IF;
    
        RETURN 10 - (
            (
                -- Sum even numerals.
                get_byte(ean,  2) - 48
                + get_byte(ean,  4) - 48
                + get_byte(ean,  6) - 48
                + get_byte(ean,  8) - 48
                + get_byte(ean, 10) - 48
                + get_byte(ean, 12) - 48
            ) * 3 -- Multiply total by 3.
            -- Add odd numerals except for checksum (13).
            + get_byte(ean,  3) - 48
            + get_byte(ean,  7) - 48
            + get_byte(ean,  5) - 48
            + get_byte(ean,  9) - 48
            + get_byte(ean, 11) - 48
        -- Compare to the checksum.
        ) % 10  = get_byte(ean, 12) - 48;
    END;
    $$ LANGUAGE plpgsql;

And then I ran the benchmarks:

try=# \timing
Timing is on.
try=# \o /dev/null
try=# select ean_substr('036000291452')
try-# FROM generate_series (1, 10000);
Time: 488.743 ms
try=# select ean_loop('036000291452')
try-# FROM generate_series (1, 10000);
Time: 881.553 ms
try=# select ean_perl('036000291452')
try-# FROM generate_series (1, 10000);
Time: 540.962 ms
try=# select ean_byte('036000291452')
try-# FROM generate_series (1, 10000);
Time: 395.124 ms

So the binary approach is the clear winner here, being 23.69% faster than my substring approach, 36.91% faster than the Perl version, and 2.23 times faster (123.11%) than the looping approach. So I think I’ll go with that.

Meanwhile, I’m pleased to have this simple benchmarking tool in my arsenal for future PostgreSQL function development.

Looking for the comments? Try the old layout.

More about…