Just a Theory

By David E. Wheeler

Posts about PL/pgSQL

Validating UPCs with PL/pgSQL

So I wanted to write a PL/pgSQL function to validate UPC codes. The rules for validation are:

  • The UPC must consist of 12 or 13 numerals
  • The last numeral is a checksum for the previous 11 numerals
  • The checksum is calculated as follows:
    • Add the digits in the odd-numbered positions in the string and multiply by three
    • Add in the digits in the even-numbered positions
    • Subtract the result from the next-higher multiple of ten.

It took me a few minutes to whip up this implementation in Perl:

use List::Util qw(sum);

sub validate_upc {
    my $upc = shift;
    my @nums = split $upc;
    shift @nums if @nums == 13; # Support EAN codes.
    die "$upc is not a valid UPC" if @upc != 12;
    10 - (  sum( @nums[0,2,4,6,8,10] ) * 3
          + sum( @nums[1,3,5,7,9] )
    ) % 10 == $nums[11];
}

Trying to do the same thing in PL/pgSQL was harder, mainly because I couldn’t find an easy way to split a string up into its individual characters. string_to_array() seems ideal, but don’t follow the same rules as Perl when it comes to the empty string:

try=% select string_to_array('123', '');
 string_to_array
 -----------------
 {123}
(1 row)

Bummer. So I had to fall back on individual calls to substring(), instead:

CREATE OR REPLACE FUNCTION validate_upc (
upc text
) RETURNS boolean AS $$
DECLARE
    offset integer := 0;
BEGIN
    IF char_length(upc) = 13 THEN
        offset := 1;
    ELSIF char_length(upc) <> 12 THEN
        RAISE EXCEPTION '% is not a valid UPC', upc;
    END IF;

    IF 10 - (
        (
            substring(upc,  1 + offset, 1)::integer
          + substring(upc,  3 + offset, 1)::integer
          + substring(upc,  5 + offset, 1)::integer
          + substring(upc,  7 + offset, 1)::integer
          + substring(upc,  9 + offset, 1)::integer
          + substring(upc, 11 + offset, 1)::integer
         ) * 3
         + substring(upc,  2 + offset, 1)::integer
         + substring(upc,  4 + offset, 1)::integer
         + substring(upc,  6 + offset, 1)::integer
         + substring(upc,  8 + offset, 1)::integer
         + substring(upc, 10 + offset, 1)::integer
         ) % 10  = substring(upc, 12 + offset, 1)::integer
    THEN
        RETURN true;
    ELSE
        RETURN false;
    END IF;
END;
$$ LANGUAGE plpgsql;

This works, and seems pretty fast, but I’m wondering if there isn’t an easier way to do this in PL/pgSQL. Do you know of one? Leave me a comment.

Looking for the comments? Try the old layout.

More about…

PL/pgSQL Article Published

The first in a series of articles I’ve written about PL/pgSQL has been published on onLamp.com. The article, Check it out!

Looking for the comments? Try the old layout.

More about…

Higher Order PL/pgSQL

Well, it’s not really higher order PL/pgSQL, since the language doesn’t support closures, as far as I know. But I have been working on a series of articles for O’Reilly Databases site, drawing inspiration from Mark Jason Dominus’s Higher-Order Perl, and specifically using the Fibonacci sequence to create example PL/pgSQL functions. It turns out that, while the Fibonacci sequence may not be of much use in day-to-day database work, it makes for great pedagogy. And I learned a fair bit along the way, as well.

So the initial, naïve implementation of a Fibonacci calculate in PL/pgSQL, using recursion, is quite straight-forward:

CREATE OR REPLACE FUNCTION fib (
    fib_for int
) RETURNS integer AS $$
BEGIN
    IF fib_for < 2 THEN
        RETURN fib_for;
    END IF;
    RETURN fib(fib_for - 2) + fib(fib_for - 1);
END;
$$ LANGUAGE plpgsql;

Pretty simple, right? The “$$”, by the way, is PL/pgSQL dollar-quoting, which prevents me from having to escape single quotes in the function body (when the function body has them, that is!). of course, this is as slow in PL/pgSQL as it would be in any other language:

try=% explain analyze select fib(26);
----------------------------
 Total runtime: 3772.315 ms

Pretty sad, right? So then I added memoization. In PL/pgSQL, this is as simple as using a table for the cache:

CREATE TABLE fib_cache (
        num integer PRIMARY KEY,
        fib integer NOT NULL
);

CREATE OR REPLACE FUNCTION fib_cached(
    fib_for int
) RETURNS integer AS $$
DECLARE
    ret integer;
BEGIN
    if fib_for < 2 THEN
        RETURN fib_for;
    END IF;

    SELECT INTO ret fib
    FROM   fib_cache
    WHERE  num = fib_for;

    IF ret IS NULL THEN
        ret := fib_cached(fib_for - 2) + fib_cached(fib_for - 1);
        INSERT INTO fib_cache (num, fib)
        VALUES (fib_for, ret);
    END IF;
    RETURN ret;

END;
$$ LANGUAGE plpgsql;CREATE TABLE fib_cache (
        num integer PRIMARY KEY,
        fib integer NOT NULL
);

CREATE OR REPLACE FUNCTION fib_cached(
    fib_for int
) RETURNS integer AS $$
DECLARE
    ret integer;
BEGIN
    if fib_for < 2 THEN
        RETURN fib_for;
    END IF;

    SELECT INTO ret fib
    FROM   fib_cache
    WHERE  num = fib_for;

    IF ret IS NULL THEN
        ret := fib_cached(fib_for - 2) + fib_cached(fib_for - 1);
        INSERT INTO fib_cache (num, fib)
        VALUES (fib_for, ret);
    END IF;
    RETURN ret;

END;
$$ LANGUAGE plpgsql;

This gets me a big performance boost:

try=% explain analyze select fib_cached(26);
--------------------------
 Total runtime: 50.889 ms

try=% explain analyze select fib_cached(26);
--------------------------
 Total runtime: 2.249 ms

So even before any sequence numbers have been calculated it’s a big win, and after they’ve been calculated, it is of course even better.

So then, following Dominus’s lead, I set about refactoring the original, recursive version with tail call elimination. Here’s what I came up with:

CREATE OR REPLACE FUNCTION fib_stacked(
    n integer
) RETURNS integer AS $$
DECLARE
    fib_for integer := n;
    branch  integer := 0;
    ret     integer := 0;
    s1      integer := 0;
    stack   integer[][] := ARRAY[ARRAY[0, 0, 0]];
    bound   integer := 1;
BEGIN
    LOOP
        IF fib_for < 2 THEN
            ret := fib_for;
        ELSE
            IF branch = 0 THEN
                WHILE fib_for >= 2 LOOP
                    stack = array_cat(stack, ARRAY[1, 0, fib_for]);
                    fib_for := fib_for - 1;
                END LOOP;
                ret := fib_for;
            ELSIF branch = 1 THEN
                stack = array_cat(stack, ARRAY[2, ret, fib_for]);
                fib_for := fib_for - 2;
                branch  := 0;
                CONTINUE;
            ELSIF branch = 2 THEN
                ret := ret + s1;
            END IF;
        END IF;

        bound := array_upper(stack, 1);
        IF bound <= 1 THEN
            RETURN ret;
        END IF;

        SELECT INTO branch,          s1,              fib_for
                    stack[bound][1], stack[bound][2], stack[bound][3];
        SELECT INTO stack stack[1:bound-1][1:3];
    END LOOP;
    RETURN ret;
END;
$$ LANGUAGE plpgsql;

It took me forever to figure out how to do it, primarily because arrays in PostgreSQL are quite limited, in the sense that, while can add to them, you cannot remove from them! Hence the SELECT INTO to reassign to stack at the end of the loop. That syntax constructs a completely new, smaller array and assigns it to stack. Ick. Another problem is that PL/pgSQL does not like an empty multidimensional array; hence the initial assignment to stack in the DECLARE block. I then have to remember that the array always has at least one item in it, and respond accordingly. It didn’t much help that SQL arrays start at 1 rather than 0.

But the big surprise to me was just how badly this function performed:

try=% explain analyze select fib_stacked(26);
-----------------------------
 Total runtime: 30852.697 ms

Yow! So I don’t think that’s the best approach. But I did figure out another approach, based on an example I saw in Agile Web Development with Rails. It’s a very simple loop-based approach:

CREATE OR REPLACE FUNCTION fib_fast(
    fib_for int
) RETURNS integer AS $$
DECLARE
    ret integer := 0;
    nxt integer := 1;
    tmp integer;
BEGIN
    FOR num IN 1..fib_for LOOP
        tmp := ret;
        ret := nxt;
        nxt := tmp + nxt;
    END LOOP;

    RETURN ret;
END;
$$ LANGUAGE plpgsql;

This one works the best of all:

try=# explain analyze select fib_fast(26);
-------------------------
 Total runtime: 0.326 ms

Sweet! I was left wondering why this algorithm wasn’t used in HOP, but then realized that it probably didn’t fit in with Dominus’s pedagogical goals. But it was an interesting learning experience for me all the same.

Look for the first of my articles to be published 2006-5-10.

Looking for the comments? Try the old layout.

Use Variables in PL/pgSQL DECLARE Blocks

I’m working on some PL/pgSQL functions for my application framework and for an upcoming article for the O’Reilly Databases site, and was showing some of the code to a PostgreSQL guru. The code looked something like this:

CREATE OR REPLACE FUNCTION update_table (
    key_name text,
    pk_id    integer,
    fk_ids   integer[]

) RETURNS VOID AS $$
DECLARE
    table_name text := quote_ident(key_name);
BEGIN
    EXECUTE 'UPDATE ' || table_name || ' SET pk = ' || pk_id
            || ' WHERE fk IN(' || array_to_string(fk_ids, ', ')
            || ')';
END;
$$ LANGUAGE plpgsql;

No, that’s not the real code, it’s just a dummy example to illustrate something. Illustrate what? Well, my PostgreSQL friend said, “Crap, can you really use variables to set other variables in the DECLARE section?” The answer is “yes,” of course. The above does work. I’m new to PostgreSQL functions, so I didn’t know any better than to just try it, and it worked. But my friend has been writing PL/pgSQL functions for years. Why didn’t he know that you could use variables in a DECLARE block? As he said, “Damn, one of the problems with starting with a language 6 years ago is that you get in the habit of coding around the restrictions from 6 years ago.”

Anyway, I just wanted to share this tidbit, in case there were other PostgreSQL pros who missed it. I don’t know when the feature was added, but it works fine for me in 8.1.

Looking for the comments? Try the old layout.

More about…

How to Ensure Unique Values Relative to a Column in Another Table

I recently came across a need to ensure that the value of a column in one table is unique relative to the value of a column in another table. This came about through my use of views to represent object-oriented inheritance relationships in PostgreSQL. For example, say I have a parent class, “Person”, and its subclass, “User.” The person table is quite straight-forward:

CREATE TABLE person (
    id INTEGER NOT NULL PRIMARY KEY SERIAL,
    first_name TEXT,
    last_name  TEXT,
    state      INTEGER
);

I use a combination of a table and a view to represent the User class while keeping all of the data nicely denormalized:

CREATE TABLE person_usr (
    id INTEGER NOT NULL PRIMARY KEY REFERENCES person (id),
    username TEXT,
    password TEXT
);

CREATE VIEW usr AS
SELECT p.id AS id, p.first_name AS first_name, p.last_name AS last_name,
       p.state AS state, u.username AS username, u.password AS password
  FROM   person p, usr u
 WHERE  p.id = u.id;

So to maintain things, I write rules on the usr view to execute INSERT, UPDATE, and DELETE queries against the person and person_usr tables as appropriate.

Now, say that I have a business requirement to allow there to be duplicate usernames for users provided that only one is not “deleted.” Whether a user object is active, inactive, or deleted is determined by the value in its state attribute, which is stored in the person table. The value of the state attribute can be “1” for active, “0” for inactive, and “-1” for deleted. So how can I ensure, in the database, that this rule is followed?

Well, if the state and username columns were in a single table, this is very easy in PostgreSQL: just create a partial unique index:

CREATE UNIQUE INDEX udx_usr_unique
    ON usr(username)
 WHERE state > -1;

This does the trick beautifully, and is nice and compact. However, my OO design has the User class inheriting from Person, so I have the username column in one table and the state column in another. At first, I thought that I could still use a partial unique index, something like this:

CREATE UNIQUE INDEX udx_usr_unique
    ON usr(username)
 WHERE id IN (SELECT id FROM person WHERE state > -1);

Unfortunately, as of PostgreSQL 8.1, the PostgreSQL documentation states:

The expression used in the WHERE clause may refer only to columns of the underlying table, but it can use all columns, not just the ones being indexed. Presently, subqueries and aggregate expressions are also forbidden in WHERE. The same restrictions apply to index fields that are expressions.

D’oh!

So I had to figure out another method. CHECK constraints cannot reference another table, either. So I was left with triggers. It’s ugly and verbose, but it appears to do the trick. Here is the recipe:

CREATE FUNCTION cki_usr_username_unique() RETURNS trigger AS $$
    BEGIN
    /* Lock the relevant records in the parent and child tables. */
    PERFORM true
    FROM    person_usr, person
    WHERE   person_usr.id = person.id AND username = NEW.username FOR UPDATE;
    IF (SELECT true
        FROM   usr
        WHERE  id <> NEW.id AND username = NEW.username AND usr.state > -1
        LIMIT 1
    ) THEN
        RAISE EXCEPTION ''duplicate key violates unique constraint "ck_person_usr_username_unique"'';
    END IF;
    RETURN NEW;
    END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER cki_usr_username_unique BEFORE INSERT ON person_usr
    FOR EACH ROW EXECUTE PROCEDURE cki_usr_username_unique();

CREATE FUNCTION cku_usr_username_unique() RETURNS trigger AS $$
    BEGIN
    IF (NEW.username <> OLD.username) THEN
        /* Lock the relevant records in the parent and child tables. */
        PERFORM true
        FROM    person_usr, person
        WHERE   person_usr.id = person.id AND username = NEW.username FOR UPDATE;
        IF (SELECT true
            FROM   usr
            WHERE  id <> NEW.id AND username = NEW.username AND usr.state > -1
            LIMIT 1
        ) THEN
            RAISE EXCEPTION ''duplicate key violates unique constraint "ck_person_usr_username_unique"'';
        END IF;
    END IF;
    RETURN NEW;
    END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER cku_usr_username_unique BEFORE UPDATE ON person_usr
    FOR EACH ROW EXECUTE PROCEDURE cku_usr_username_unique();

CREATE FUNCTION ckp_usr_username_unique() RETURNS trigger AS $$
    BEGIN
    IF (NEW.state > -1 AND OLD.state < 0
        AND (SELECT true FROM person_usr WHERE id = NEW.id)
        ) THEN
        /* Lock the relevant records in the parent and child tables. */
        PERFORM true
        FROM    person_usr, person
        WHERE   person_usr.id = person.id
                AND username = (SELECT username FROM person_usr WHERE id = NEW.id)
        FOR UPDATE;

        IF (SELECT COUNT(username)
            FROM   person_usr
            WHERE username = (SELECT username FROM person_usr WHERE id = NEW.id)
        ) > 1 THEN
            RAISE EXCEPTION ''duplicate key violates unique constraint "ck_person_usr_username_unique"'';
        END IF;
    END IF;
    RETURN NEW;
    END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER ckp_usr_username_unique BEFORE UPDATE ON person
    FOR EACH ROW EXECUTE PROCEDURE ckp_usr_username_unique();

Why am I locking rows? To prevent some other transaction from changing another row to create username conflicts. For example, while I might be changing a username to “foo” for one record, an existing record with that username but its state set to -1 might be getting activated in a separate transaction. So gotta try to prevent that. Josh Berkus pointed out that issue in an earlier iteration of the triggers.

Anyway, am I on crack here? Isn’t there a simpler way to do this sort of thing? And if not, have I really got the race conditions all eliminated with the row locks?

Update: In further testing, I discovered that the SELECT ... FOR UPDATE was failing on views with the error “ERROR: no relation entry for relid 7”. I have no idea what that means, but I found that it didn’t happen when I selected against the tables directly. So I’ve updated the functions above to reflect that change. I’ve also fixed a few pastos as pointed out in the comments.

Looking for the comments? Try the old layout.