Just a Theory

By David E. Wheeler

Posts about PL/pgSQL

pgTAP Best Practices Slides Available

Last month I gave two presentations at the PostgreSQL Conference West. The first was a variation on Unit Test Your Database!, which I’ve now given about six times (and will at least two more times, including tomorrow night for Portland PLUG and in two weeks at the JPUG 10th Anniversary Conference). The second was a new talk, a 90-minute tutorial, called “pgTAP Best Practices.” And here it is, published on Slideshare. Enjoy.

You can also download a 76MB PDF if that’s more your style.

Enjoy!

Looking for the comments? Try the old layout.

pgTAP 0.21 Drops

Mmmmm…pgTAP

I just dropped a new version of pgTAP following a few marathon hack sessions since my talk at PGCon (movie here, BTW). Actually, the new performs_ok() function came about as I wrote the presentation, but all the rest came on the flight home and in the few days since. Not sure when I’ll hack on it this much again (it’s getting a bit big at 5,758 lines of PL/pgSQL and SQL).

Overall I’m very happy with this release, as it adds a lot of new assertion functions. In particular, I added a slew of functions that test that the objects in a given schema (or visible in the search path, if you prefer) are exactly the objects that should be there. This is useful for a couple of things. For one, Norman Yamada, in his PGCon talk, mentioned that his team was using pgTAP to compare database objects between replicated databases. I like this because it’s a great example of using pgTAP for system testing, rather than just unit testing as I’ve been advocating. See, pgTAP can be used for any kind of testing!

Another use for these functions is in a large organization where many different people might be making changes to a schema. In this scenario, you might have application developers adding new objects to the database (or dropping objects) without necessarily informing the DBAs. Using, for example, tables_are() and functions_are() and continuous testing, the DBAs can see when objects have been modified by the developers. Even better, if the developers are running the pgTAP tests themselves (as they should be!), they will be reminded to add new tests for their changes when the existing tests notice that things have been added or dropped and thus fail.

Beyond that, I added a bunch of new functions for testing functions and a number of other goodies. Check out the release notes for all the details.

With these changes, I’ve finished nearly everything I’ve thought of for pgTAP. There are only a few sequence-testing functions left on the To Do list, as well as a call to add a throws_like() function, which I’ll throw in soon if no one else steps up. Beyond these changes, I have a few ideas of where to take it next, but so far I’m kind of stumped. Mainly what I think should be done is to add an interface that makes it easier to compare relations (or result sets, if you prefer). Epic does this by allowing query strings to be passed to a function, but I’d really like to keep queries in SQL rather than in SQL strings. I’ll be giving it some more thought and will post about it soon.

Looking for the comments? Try the old layout.

Unit Test Your Database!

Gave my talk at PGCon today. I felt that it went well, and was well-received. So here it is for everyone else, for posterity, thanks to SlideShare. Enjoy!

Looking for the comments? Try the old layout.

New Gig: PostgreSQL Experts, Inc.

A bit of good news: In addition to my ongoing Kineticode work doing Bricolage consulting services, training, and support, I have a new gig! I, along with Josh Berkus, David Fetter, Andrew Dunstan, and a team of other PostgreSQL experts, have started a new company: PostgreSQL Experts, Inc. I’m really excited about PGX, a cooperative of solid and experienced–dare I say expert?–people dedicated to providing exceptional PostgreSQL professional services, including consulting, training, and support.

Morever, we have a solid group of experienced application developers, who are ready and willing to build your PostgreSQL-backed applications on Rails, Catalyst, PHP, or whatever environment you prefer. If it’s related to PostgreSQL, it’s what we do.

So get in touch or meet us at PGCon (we’re sponsoring!) or at OSCON 2009. I’m really excited about our company, and looking forward to growing it as PostgreSQL adoption grows.

Looking for the comments? Try the old layout.

pgTAP 0.20 Infiltrates Community

I did all I could to stop it, but it just wasn’t possible. pgTAP 0.20 has somehow made its way from my Subversion server and infiltrated the PostgreSQL community. Can nothing be done to stop this menace? Its use leads to cleaner, more stable, and more-safely refactored code. This insanity must be stopped! Please review the following list of its added vileness since 0.19 to determine how you can stop the terrible, terrible influence on your PostgreSQL unit-testing practices that is pgTAP:

  • Changed the names of the functions tested in sql/do_tap.sql and sql/runtests.sql so that they are less likely to be ordered differently given varying collation orders provided in different locales and by different vendors. Reported by Ingmar Brouns.
  • Added the --formatter and --archive options to pg_prove.
  • Fixed the typos in pg_prove where the output of --help listed --test-match and --test-schema instead of --match and --schema.
  • Added has_cast(), hasnt_cast(), and cast_context_is().
  • Fixed a borked function signature in has_trigger().
  • Added has_operator(), has_leftop(), and has_rightop().
  • Fixed a bug where the order of columns found for multicolum indexes by has_index() could be wrong. Reported by Jeff Wartes. Thanks to Andrew Gierth for help fixing the query.

Don’t make the same mistake I did, where I wrote a lot of pgTAP tests for a client, and now testing database upgrades from 8.2 to 8.3 is just too reliable! YOU HAVE BEEN WARNED.

Good luck with your mission.

Looking for the comments? Try the old layout.

pgTAP 0.16 in the Wild

I’ve been writing a lot tests for a client in pgTAP lately. It’s given me a lot to think about in terms of features I need and best practices in writing tests. I’m pleased to say that, overall, it has been absolutely invaluable. I’m doing a lot of database refactoring, and having the safety of solid test coverage has been an absolute godsend. pgTAP has done a lot to free me from worry about the effects of my changes, as it ensures that everything about the databases continue to just work.

Of course, that’s not to say that I don’t scew up. There are times when my refactorings have introduced new bugs or incompatibilities; after all, the tests I write of existing functionality extend only so far as I can understand that functionality. But as such issues come up, I just add regression tests, fix the issues, and move on, confident in the knowledge that, as long as the tests continue to be run regularly, those bugs will never come up again. Ever.

As a result, I’ll likely be posting a bit on best practices I’ve found while writing pgTAP tests. As I’ve been writing them, I’ve started to find the cow paths that help me to keep things sane. Most helpful is the large number of assertion functions that pgTAP offers, of course, but there are a number of techniques I’ve been developing as I’ve worked. Some are better than others, and still others suggest that I need to find other ways to do things (you know, when I’m cut-and-pasting a lot, there must be another way, though I’ve always done a lot of cut-and-pasting in tests).

In the meantime, I’m happy to announce the release of pgTAP 0.16. This version includes a number of improvements to the installer (including detection of Perl and TAP::Harness, which are required to use the included pg_prove test harness app. The installer also has an important bug fix that greatly increases the chances that the os_name() function will actually know the name of your operating system. And of course, there are new test functions:

  • has_schema() and hasnt_schema(), which test for the presence of absence of a schema
  • has_type() and hasnt_type(), which test for the presence and absence of a data type, domain, or enum
  • has_domain() and hasnt_domain(), which test for the presence and absence of a data domain
  • has_enum() and hasnt_enum(), which test for the presence and absence of an enum
  • enum_has_lables() which tests that an enum has an expected list of labels

As usual, you can download the latest release from pgFoundry. Visit the pgTAP site for more information and for documentation.

Looking for the comments? Try the old layout.

Dynamic OFFSETs and LIMITs

I discovered a great hack for dealing with optional offsets and limits in PostgreSQL functions while working for a client, and I wanted to get it down here so that I wouldn’t forget it.

The deal is that I was writing tests for functions that returned a set of IDs for objects based on some criterion plus an optional offset and limit. The functions looked something like this:

CREATE OR REPLACE FUNCTION get_widgets_for_user_id(
    a_user_id integer,
    a_offset  integer,
    a_limit   integer
) RETURNS SETOF integer AS $$ 
DECLARE  
    l_id    integer;
    l_query text;
BEGIN
    v_query := 'SELECT id FROM widgets WHERE user_id = ' || a_user_id
            || ' ORDER BY created_at DESC';
    IF a_offset IS NOT NULL THEN 
        l_query := l_query || ' OFFSET ' || a_offset; 
    END IF; 
    IF a_limit IS NOT NULL THEN 
        l_query := l_query || ' LIMIT ' || a_limit; 
    END IF; 

    FOR l_id IN EXECUTE l_query LOOP 
            RETURN NEXT l_id;
    END LOOP; 
        
    RETURN; 
END;
$$ LANGUAGE PLPGSQL;

It seemed silly to me that this should be in PL/pgSQL: ultimately, it’s such a simple query that I wanted it to be a SQL query. Of course I knew that if a_offset was NULL I could fallback on 0. But what about dealing with a NULL limit?

Well, it turns out that you can pass a CASE statement to the LIMIT clause that optionally returns no value at all and it will just work. Observe:

try=% select id from widgets LIMIT CASE WHEN false THEN 3 END;
 id
----
   1
   2
   3
   4
(4 rows)

Pretty weird, huh? Well, for my purposes, it’s perfect, because I was able to rewrite that function as a pure SQL function, and it’s a lot simpler, to boot:

CREATE OR REPLACE FUNCTION get_widgets_for_user_id(
    a_user_id integer,
    a_offset  integer,
    a_limit   integer
) RETURNS SETOF integer AS $$ 
    SELECT id
        FROM widgets
        WHERE user_id = $1
        ORDER BY created_at DESC
    OFFSET COALESCE( $2, 0 )
        LIMIT CASE WHEN $3 IS NOT NULL THEN $3 END
$$ LANGUAGE SQL;

Now isn’t that a hell of a lot easier to read? Like I said, it’s a little weird, but overall I think it’s a pretty good hack. I’ve tested it with PostgreSQL 8.2 and 8.3. Not sure about other versions, but give it a try!

Update: Thanks for the comments! With the insight that CASE is ultimately passing a NULL to LIMIT when the value is NULL, I realized that I could switch from CASE to COALESCE for nice parity with the handling of the OFFSET clause. Check it out:

CREATE OR REPLACE FUNCTION get_widgets_for_user_id(
    a_user_id integer,
    a_offset  integer,
    a_limit   integer
) RETURNS SETOF integer AS $$ 
    SELECT id
        FROM widgets
        WHERE user_id = $1
        ORDER BY created_at DESC
    OFFSET COALESCE( $2, 0 )
        LIMIT COALESCE( $3, NULL )
$$ LANGUAGE SQL;

Looking for the comments? Try the old layout.

More about…

How to Generate Recurring Events in the Database

This is a followup to my request for help fixing the performance of a database view that generated recurrences for events. This was an essential feature of Sandy, and thus important to get right. The idea when I started was simple:

  • Add a recurrence domain to the database that supports a number of different values, including “daily”, “weekly”, and “monthly”.
  • Add a recurrence column to the events table that identify how an event recurs.
  • Add a recurrence_dates table that contains a pre-generated list of recurrences for a given date. I’d populated this table with five years of dates, each one mapped to five years worth of recurrence dates (see the original blog entry for more on the format of this table.
  • Create a view that maps each events row to its date and recurrence in the recurrence_dates table.
  • Profit.

It was this last bullet point that didn’t quite work out: although the data was perfectly accurate, queries for a lot of events in the view were very expensive. I mean, the query could run for 3-4 minutes. It was just crazy! I couldn’t figure out the problem, so I posted my request for help. It was through discussions that followed with depesz that I finally figured out what the problem was: Although I was usually selecting only a week’s or months worth of events, the view was calculating rows for all five years worth of data for all of the events for a given user. Um, not efficient.

So here I finally document how, with a lot of help and example code from depesz, I solved the problem. The trick was to use a function instead of a view to generate the recurring event rows, and to limit it only to the dates we’re interested in. For convenience sake, I broke this down into two PL/pgSQL functions: one to generate recurring dates and one to return the recurring event rows. But first, here’s the recurrence domain and the events table, both of which are unchanged from the original approach:

CREATE DOMAIN recurrence AS TEXT
CHECK ( VALUE IN ( 'none', 'daily', 'weekly', 'monthly' ) );

CREATE TABLE events (
    id         SERIAL     PRIMARY KEY,
    user_id    INTEGER    NOT NULL,
    starts_at  TIMESTAMP  NOT NULL,
    start_tz   TEXT       NOT NULL,
    ends_at    TIMESTAMP,
    end_tz     TEXT       NOT NULL,
    recurrence RECURRENCE NOT NULL DEFAULT 'none'
);

Just assume the user_id is a foreign key. Now let’s populate this table with some data. For the purposes of this demonstration, I’m going to create one event per day for 1000 days, evenly divided between daily, weekly, monthly, and no recurrences, as well as five different times of day and six different durations:

INSERT INTO events (user_id, starts_at, start_tz, ends_at, end_tz, recurrence)
SELECT 1,
       ts::timestamp,
       'PST8PDT',
       ts::timestamp + dur::interval,
       'PST8PDT',
       recur
  FROM (
    SELECT '2007-12-19'::date + i || ' ' || CASE i % 5
               WHEN 0 THEN '06:00'
               WHEN 1 THEN '10:00'
               WHEN 2 THEN '14:00'
               WHEN 3 THEN '18:00'
               ELSE        '22:30'
               END,
           CASE i % 6
               WHEN 0 THEN '2 hours'
               WHEN 1 THEN '1 hour'
               WHEN 2 THEN '45 minutes'
               WHEN 3 THEN '3.5 hours'
               WHEN 4 THEN '15 minutes'
               ELSE        '30 minutes'
               END,
           CASE i % 4
               WHEN 0 THEN 'daily'
               WHEN 1 THEN 'weekly'
               WHEN 2 THEN 'monthly'
               ELSE        'none'
               END
    FROM generate_series(1, 1000) as gen(i)
  ) AS ser( ts, dur, recur);

This gives us some nicely distributed data:

try=# select * from events limit 10;
  id  | user_id |      starts_at      | start_tz |       ends_at       | end_tz  | recurrence 
------+---------+---------------------+----------+---------------------+---------+------------
    1 |       1 | 2007-12-20 10:00:00 | PST8PDT  | 2007-12-20 11:00:00 | PST8PDT | weekly
    2 |       1 | 2007-12-21 14:00:00 | PST8PDT  | 2007-12-21 14:45:00 | PST8PDT | monthly
    3 |       1 | 2007-12-22 18:00:00 | PST8PDT  | 2007-12-22 21:30:00 | PST8PDT | none
    4 |       1 | 2007-12-23 22:30:00 | PST8PDT  | 2007-12-23 22:45:00 | PST8PDT | daily
    5 |       1 | 2007-12-24 06:00:00 | PST8PDT  | 2007-12-24 06:30:00 | PST8PDT | weekly
    6 |       1 | 2007-12-25 10:00:00 | PST8PDT  | 2007-12-25 12:00:00 | PST8PDT | monthly
    7 |       1 | 2007-12-26 14:00:00 | PST8PDT  | 2007-12-26 15:00:00 | PST8PDT | none
    8 |       1 | 2007-12-27 18:00:00 | PST8PDT  | 2007-12-27 18:45:00 | PST8PDT | daily
    9 |       1 | 2007-12-28 22:30:00 | PST8PDT  | 2007-12-29 02:00:00 | PST8PDT | weekly
   10 |       1 | 2007-12-29 06:00:00 | PST8PDT  | 2007-12-29 06:15:00 | PST8PDT | monthly
(10 rows)

Now let’s get to the recurring date function:

CREATE OR REPLACE FUNCTION  generate_recurrences(
recurs RECURRENCE, 
start_date DATE,
end_date DATE
)
RETURNS setof DATE
LANGUAGE plpgsql IMMUTABLE
AS $BODY$
DECLARE
    next_date DATE := start_date;
    duration  INTERVAL;
    day       INTERVAL;
    check     TEXT;
BEGIN
    IF recurs = 'none' THEN
        -- Only one date ever.
        RETURN next next_date;
    ELSIF recurs = 'weekly' THEN
        duration := '7 days'::interval;
        WHILE next_date <= end_date LOOP
            RETURN NEXT next_date;
            next_date := next_date + duration;
        END LOOP;
    ELSIF recurs = 'daily' THEN
        duration := '1 day'::interval;
        WHILE next_date <= end_date LOOP
            RETURN NEXT next_date;
            next_date := next_date + duration;
        END LOOP;
    ELSIF recurs = 'monthly' THEN
        duration := '27 days'::interval;
        day      := '1 day'::interval;
        check    := to_char(start_date, 'DD');
        WHILE next_date <= end_date LOOP
            RETURN NEXT next_date;
            next_date := next_date + duration;
            WHILE to_char(next_date, 'DD') <> check LOOP
                next_date := next_date + day;
            END LOOP;
        END LOOP;
    ELSE
        -- Someone needs to update this function, methinks.
        RAISE EXCEPTION 'Recurrence % not supported by generate_recurrences()', recurs;
    END IF;
END;
$BODY$;

The idea here is pretty simple: pass in a recurrence (“daily”, “weekly”, or “monthly”), a start date, and an end date, and get back a set of all the recurrence dates between the start and end dates:

try=# \timing
Timing is on.
try=# select * from generate_recurrences('daily', '2008-01-29', '2008-02-05');
 generate_recurrences 
----------------------
 2008-01-29
 2008-01-30
 2008-01-31
 2008-02-01
 2008-02-02
 2008-02-03
 2008-02-04
 2008-02-05
(8 rows)

Time: 0.548 ms
try=# select * from generate_recurrences('weekly', '2008-01-29', '2008-03-05');
 generate_recurrences 
----------------------
 2008-01-29
 2008-02-05
 2008-02-12
 2008-02-19
 2008-02-26
 2008-03-04
(6 rows)

Time: 0.670 ms
try=# select * from generate_recurrences('monthly', '2008-01-29', '2008-05-05');
 generate_recurrences 
----------------------
 2008-01-29
 2008-02-29
 2008-03-29
 2008-04-29
(4 rows)

Time: 0.644 ms

Not bad, eh? And PostgreSQL’s date and interval calculation operators are wicked fast. Check out how long it dates to generate two years worth of daily recurrence dates:

try=# select * from  generate_recurrences('daily', '2008-01-29', '2010-02-05');
 generate_recurrences 
----------------------
 2008-01-29
 2008-01-30
 2008-01-31
...
 2010-02-03
 2010-02-04
 2010-02-05
(739 rows)

Time: 4.982 ms

Awesome. And the great thing about this function is that any time I need to add new recurrences (yearly, biweekly, quarterly, weekends, weekdays, etc.), I just modify the domain and this function and we’re ready to go.

And now, part two: the recurring event function:

CREATE OR REPLACE FUNCTION recurring_events_for(
    for_user_id INTEGER,
    range_start TIMESTAMP,
    range_end   TIMESTAMP
)
    RETURNS SETOF events
    LANGUAGE plpgsql STABLE
    AS $BODY$
DECLARE
    event events;
    start_date TIMESTAMPTZ;
    start_time TEXT;
    ends_at    TIMESTAMPTZ;
    next_date  DATE;
    recurs_at  TIMESTAMPTZ;
BEGIN
    FOR event IN 
        SELECT *
          FROM events
         WHERE user_id = for_user_id
           AND (
                   recurrence <> 'none'
               OR  (
                      recurrence = 'none'
                  AND starts_at BETWEEN range_start AND range_end
               )
           )
    LOOP
        IF event.recurrence = 'none' THEN
            RETURN NEXT event;
            CONTINUE;
        END IF;

        start_date := event.starts_at::timestamptz AT TIME ZONE event.start_tz;
        start_time := start_date::time::text;
        ends_at    := event.ends_at::timestamptz AT TIME ZONE event.end_tz;

        FOR next_date IN
            SELECT *
                FROM generate_recurrences(
                        event.recurrence,
                        start_date::date,
                        (range_end AT TIME ZONE event.start_tz)::date
                )
        LOOP
            recurs_at := (next_date || ' ' || start_time)::timestamp
                AT TIME ZONE event.start_tz;
            EXIT WHEN recurs_at > range_end;
            CONTINUE WHEN recurs_at < range_start AND ends_at < range_start;
            event.starts_at := recurs_at;
            event.ends_at   := ends_at;
            RETURN NEXT event;
        END LOOP;
    END LOOP;
    RETURN;
END;
$BODY$;

The idea here is to select the appropriate events for a given user between two dates, and for each event iterate over all of the recurrences between the two dates and return a row for each one. So the lines starting with FOR event IN and ending with LOOP select the original events, looking for either recurring events or non-recurring events that are between the two dates. Note that if you needed to, you could easily refine this query for your particular application, or even use PL/pgSQL’s EXECUTE operator to dynamically generate queries to suit particular application needs.

Next, the block starting with IF event.recurrence = 'none' THEN simply returns any non-recurring events. Although the next block already handles this case, adding this optimization eliminates a fair bit of calculation for the common case of non-recurring events.

Then the lines starting with FOR next_date IN and ending with LOOP select all of the dates for the recurrence in question, using the generate_recurrences() function created earlier. From LOOP to END LOOP;, the function generates the start and end timestamps, exiting the loop when the start date falls after the range or when it falls before the range and the end date falls after the range. There are many other tweaks one could make here to modify which recurrences are included and which are excluded. For example, if you had a column in the events table such as exclude_dates TIMESTAMP[] NOT NULL DEFAULT '{}' that stored an array of dates to ignore when generating recurrences, you could just add this line to go ahead and exclude them from the results returned by the function:

            CONTINUE WHEN recurs_at = ANY( exclude_dates );

But enough of the details: let’s see how it works! Here’s a query for a week’s worth of data:

try=# select * from recurring_events_for(1, '2007-12-19', '2007-12-26');
  id  | user_id |      starts_at      | start_tz |       ends_at       | end_tz  | recurrence 
------+---------+---------------------+----------+---------------------+---------+------------
    1 |       1 | 2007-12-20 10:00:00 | PST8PDT  | 2007-12-20 03:00:00 | PST8PDT | weekly
    2 |       1 | 2007-12-21 14:00:00 | PST8PDT  | 2007-12-21 06:45:00 | PST8PDT | monthly
    3 |       1 | 2007-12-22 18:00:00 | PST8PDT  | 2007-12-22 21:30:00 | PST8PDT | none
    4 |       1 | 2007-12-23 22:30:00 | PST8PDT  | 2007-12-23 14:45:00 | PST8PDT | daily
    4 |       1 | 2007-12-24 22:30:00 | PST8PDT  | 2007-12-23 14:45:00 | PST8PDT | daily
    4 |       1 | 2007-12-25 22:30:00 | PST8PDT  | 2007-12-23 14:45:00 | PST8PDT | daily
    5 |       1 | 2007-12-24 06:00:00 | PST8PDT  | 2007-12-23 22:30:00 | PST8PDT | weekly
    6 |       1 | 2007-12-25 10:00:00 | PST8PDT  | 2007-12-25 04:00:00 | PST8PDT | monthly
(8 rows)

Time: 51.890 ms

Note the time it took to execute this query. 52 ms is a hell of a lot faster than the several minutes it took to run a similar query using the old view. Plus, I’m not limited to just the recurrence dates I’ve pre-calculated in the old recurrence_dates table. Now we can use whatever dates are supported by PostgreSQL. It’s even fast when we look at a year’s worth of data:

try=# select * from recurring_events_for(1, '2007-12-19', '2008-12-19');
  id  | user_id |      starts_at      | start_tz |       ends_at       | end_tz  | recurrence 
------+---------+---------------------+----------+---------------------+---------+------------
    1 |       1 | 2007-12-20 10:00:00 | PST8PDT  | 2007-12-20 03:00:00 | PST8PDT | weekly
    1 |       1 | 2007-12-27 10:00:00 | PST8PDT  | 2007-12-20 03:00:00 | PST8PDT | weekly
    1 |       1 | 2008-01-03 10:00:00 | PST8PDT  | 2007-12-20 03:00:00 | PST8PDT | weekly
    1 |       1 | 2008-01-10 10:00:00 | PST8PDT  | 2007-12-20 03:00:00 | PST8PDT | weekly
    1 |       1 | 2008-01-17 10:00:00 | PST8PDT  | 2007-12-20 03:00:00 | PST8PDT | weekly
    1 |       1 | 2008-01-24 10:00:00 | PST8PDT  | 2007-12-20 03:00:00 | PST8PDT | weekly
    1 |       1 | 2008-01-31 10:00:00 | PST8PDT  | 2007-12-20 03:00:00 | PST8PDT | weekly
...
  364 |       1 | 2008-12-17 22:30:00 | PST8PDT  | 2008-12-17 14:45:00 | PST8PDT | daily
  364 |       1 | 2008-12-18 22:30:00 | PST8PDT  | 2008-12-17 14:45:00 | PST8PDT | daily
  365 |       1 | 2008-12-18 06:00:00 | PST8PDT  | 2008-12-17 22:30:00 | PST8PDT | weekly
(19691 rows)

Time: 837.759 ms

Not stellar, but still respectable. Given that for a typical application, a user will be looking at only a day’s or a week’s or a month’s events at a time, this seems to be an acceptable trade-off. I mean, how often will your users need to see a list of 20,000 events? And even if a user *was*looking at a year’s worth of data, it’s unlikely that 75% of them would be recurring as in the example data here.

I was fucking pumped with this solution, and Sandy has hummed along nicely since we put it into production. If you’re interested in trying it for yourself, I’ve you can get all the SQL from this blog entry here.

The only thing I would like to have been able to do differently was to encapsulate the recurring_events_for() function in a view. Such would have made it much easier to actually use this solution in Rails. If you know how to do that, please do leave a comment. As for how I hacked Rails to use the function, well, that’s a blog post for another day.

Looking for the comments? Try the old layout.

More about…

Validating Time Zones in PostgreSQL

I recently needed to validate that a value stored in a TEXTcolumn was a valid time zone identifier. Why? Because I was using its value inside the database to convert timestamp columns from UTC to a valid zone. So I set about writing a function I could use in a constraint.

It turns out that PostgreSQL has a pretty nice view that lists all of the time zones that it recognizes. It’s called pg_timezone_names:

try=# select * from pg_timezone_names limit 5;
        name        | abbrev | utc_offset | is_dst 
--------------------+--------+------------+--------
 Africa/Abidjan     | GMT    | 00:00:00   | f
 Africa/Accra       | GMT    | 00:00:00   | f
 Africa/Addis_Ababa | EAT    | 03:00:00   | f
 Africa/Algiers     | CET    | 01:00:00   | f
 Africa/Asmara      | EAT    | 03:00:00   | f
(5 rows)

Cool. So all I had to do was to look up the value in this view. My first stab at creating a time zone validation function therefore looked like this:

CREATE OR REPLACE FUNCTION is_timezone( tz TEXT ) RETURNS BOOLEAN as $$
DECLARE
    bool BOOLEAN;
BEGIN
    SELECT TRUE INTO bool
    FROM pg_timezone_names
    WHERE LOWER(name) = LOWER(tz)
        OR LOWER(abbrev) = LOWER(tz);
    RETURN FOUND;
END;
$$ language plpgsql STABLE;

This should pretty well cover anything that PostgreSQL considers valid. So does it work? You bet:

sandy_development=# \timing
Timing is on.
sandy_development=# select is_timezone('America/Los_Angeles');
 is_timezone 
-------------
 t
(1 row)

Time: 457.096 ms
sandy_development=# select is_timezone('Foo/Bar');
 is_timezone 
-------------
 f
(1 row)

Time: 472.752 ms

Perfect! Well, except for just one thing: performance is abysmal. A half second per shot? Not very useful for constraints. And since pg_timezone_names is a view (and, under that, a function), I can’t create indexes.

But then I did something dangerous: I started thinking. I realized that I needed this function when our app started getting errors like this:

try=# select now() at time zone 'Foo/Bar';
ERROR:  time zone "Foo/Bar" not recognized

So the underlying C code throws an error when a time zone is invalid. What if I could just trap the error? Well, PL/pgSQL conveniently has exception handling, so I could do just that. But there was only one problem. PL/pgSQL’s exception handling syntax requires that you specify an error condition. Here’s what the documentation has:

EXCEPTION
    WHEN condition [ OR condition ... ] THEN
        handler_statements
    [ WHEN condition [ OR condition ... ] THEN
            handler_statements
        ... ]
END;

Conditions are error codes. But which one corresponds to the invalid time zone error? I tried a few, but couldn’t figure out which one. (Anyone know now to map errors you see in psql to the error codes listed in Appendix A? Let me know!) But really, my function just needed to do one thing. Couldn’t I just trap any old error?

A careful re-read of the PL/pgSQL documentation reveals that, yes, you can. Use the condition “OTHERS,” and you can catch almost anything. With this information in hand, I quickly wrote:

CREATE OR REPLACE FUNCTION is_timezone( tz TEXT ) RETURNS BOOLEAN as $$
DECLARE
    date TIMESTAMPTZ;
BEGIN
    date := now() AT TIME ZONE tz;
    RETURN TRUE;
EXCEPTION WHEN OTHERS THEN
    RETURN FALSE;
END;
$$ language plpgsql STABLE;

And how well does this one work?

sandy_development=# select is_timezone('America/Los_Angeles');
 is_timezone 
-------------
 t
(1 row)

Time: 3.009 ms
sandy_development=# select is_timezone('Foo/Bar');
 is_timezone 
-------------
 f
(1 row)

Time: 1.224 ms

Yes, I’ll take 1-3 ms over 400-500 ms any day! I might even create a domain for this and be done with it:

CREATE DOMAIN timezone AS TEXT
CHECK ( is_timezone( value ) );

Enjoy!

Looking for the comments? Try the old layout.

More about…

Batch Updates with PL/pgSQL

The third in my series of articles about PL/pgSQL, “Batch Updates with PL/pgSQL” has been published on The O’Reilly Network. Actually it was published last week, but I’ve not been very attentive to my blog lately. Sorry about that. Anyway, it improves upon the code in the second article in the series, “Managing Many-to-Many Relationships with PL/pgSQL,” by modifying the updating functions to use PostgreSQL batch query syntax. This means that the number of database calls in a given call to a function are constant, no matter how many IDs are passed to it.

So check it out!

Looking for the comments? Try the old layout.

More about…

PL/pgSQL Talk Slides Posted

The talk that I announced a couple of days ago went off very well, I think. Rich Shepard, a local PostgreSQL who was at the meeting, later said on the mail list:

On the happy side, my thanks to Selena for organizing the group and rounding up a half-dozen cats for the first meeting, and to David for the best presentation at a computer users group meeting I’ve seen in 20 years. The topic was very interesting and the presentation itself highly professional and polished. It’s a standard to be met by future presenters for the benefit of us all.

—Rich Shepard

Wow, praise doesn’t get much higher than that. I’m glad the group got a lot out of the presentation. For the benefit of those who couldn’t make it, I’ve posted the slides for your enjoyment. They might be a bit hard to follow at times without my commentary, but you should be able to fill in the gaps by reading my articles for O’Reilly.

So start having some fun with PL/pgSQL!

Looking for the comments? Try the old layout.

PL/pgSQL Talk for Portland PostgreSQL Users

Attention PostgreSQL users in the Portland metropolitan area and environs (Salem, Eugene, Vancouver, Seattle)! I am honored to be giving the inaugural talk to the newly-formed Portland PostgreSQL Users Group on Wednesday, 19 July 2006 at 19:00 at FreeGeek. My talk will be an introduction to PL/pgSQL. Come check it out and join the fun! Beer and schmoozing to take place after the talk at The Lucky Lab.

Looking for the comments? Try the old layout.

Managing Many-to-Many Relationships with PL/pgSQL

The second in my series of articles on programming PostgreSQL PL/pgSQL, “Managing Many-to-Many Relationships with PL/pgSQL”, has just been published onLamp.com. The idea is to abstract out of the application layer the management of the many-to-many relationships, moving it into the database layer where execution is both safer and faster. And you can learn more about PL/pgSQL along the way.

So what are you waiting for? Check it out!

Looking for the comments? Try the old layout.

Benchmarking PostgreSQL Functions

Update 2006-05-19: I realized that there was a nasty error in my algorithm for determining the runtime of a function: It was only fetching the milliseconds part of the runtime, without adding in seconds and minutes! This led to getting negative runtimes then the milliseconds part of the end time was less than the milliseconds part of the start time. Ugh. But with the help of yain on IRC, I’ve switched to calculating the number of seconds by converting the start and end times to epoch seconds (which have subsecond precision in PostgreSQL, and now things are just dandy. While I was at it, I reorganized the function so that it was a bit easier to read, by constructing the created function in the order it would be executed, and fixed the caching problem, as suggested by Aidan in a comment below.

Following yesterday’s post, Klint Gore sent me some PL/pgSQL code that might be useable as a benchmark function. Today I took that code and ran with it.

The idea was to create a function like the Perl Benchmark module’s timethese() function. In the process, I found, with help from Josh Berkus, that PL/pgSQL’s EXECUTE statement has quite a lot of overhead, and the amount of overhead per call is pretty random. The overhead resulted in pretty inaccurate benchmark numbers, unfortunately.

At Josh’s suggestion, I rewrote the function to just test each function inline, rather than passing the function code as parameters. This time, the results were dead on. So then I refactored the original benchmark function to create its own benchmark function, inlining all of the code, and then call that function. Almost higher order PL/pgSQL! Again the results were just right, and so now I present it to you:

create type _benchmark as (
    code      text,
    runtime   real,
    corrected real
);

CREATE OR REPLACE FUNCTION benchmark(n INTEGER, funcs TEXT[])
RETURNS SETOF _benchmark AS $$
DECLARE
    code TEXT := '';
    a    _benchmark;
BEGIN
    -- Start building the custom benchmarking function.
    code := $_$
        CREATE OR REPLACE FUNCTION _bench(n INTEGER)
        RETURNS SETOF _benchmark AS $__$
        DECLARE
            s TIMESTAMP;
            e TIMESTAMP;
            a RECORD;
            d numeric;
            res numeric;
            ret _benchmark;
        BEGIN
            -- Create control.
            s := timeofday();
            FOR a IN SELECT TRUE FROM generate_series( 1, $_$ || n || $_$ )
            LOOP
            END LOOP;
            e := timeofday();
            d := extract(epoch from e) - extract(epoch from s);
            ret := ROW( '[Control]', d, 0 );
            RETURN NEXT ret;
$_$;
    -- Append the code to bench each function call.
    FOR i IN array_lower(funcs,1) .. array_upper(funcs, 1) LOOP
        code := code || '
            s := timeofday();
            FOR a IN SELECT ' || funcs[i] || ' FROM generate_series( 1, '
                || n || $__$ ) LOOP
            END LOOP;
            e := timeofday();
            res := extract(epoch from e) - extract(epoch from s);
            ret := ROW(
                $__$ || quote_literal(funcs[i]) || $__$,
                res,
                res - d
            );
            RETURN NEXT ret;
$__$;
    END LOOP;

    -- Create the function.
    execute code || $_$
        END;
        $__$ language plpgsql;
$_$;

    -- Now execute the function.
    FOR a IN EXECUTE 'SELECT * FROM _bench(' || n || ')' LOOP
        RETURN NEXT a;
    END LOOP;

    -- Drop the function.
    DROP FUNCTION _bench(integer);
    RETURN;
END;
$$ language 'plpgsql';

You call the function like this:

try=# select * from benchmark(10000, ARRAY[
try(#     'ean_substr(''036000291452'')',
try(#     'ean_byte(''036000291452'')',
try(#     'ean_c(''036000291452'')'
try(# ]);
            code            | runtime   | corrected 
----------------------------+-----------+-----------
    [Control]                  | 0.0237451 |          0
    ean_substr('036000291452') |  0.497734 |   0.473989
    ean_byte(  '036000291452') |  0.394456 |   0.370711
    ean_c(     '036000291452') | 0.0277281 | 0.00398302
(4 rows)

Pretty slick, eh? The only downside was that, when the DROP FUNCTION line was not commented out, the function would run once, and then, the next time, I’d get this error:

ERROR:  cache lookup failed for function 17323
CONTEXT:  PL/pgSQL function "benchmark" line 49 at for over select rows

I have no idea why. So I just leave the function and let the CREATE OR REPLACE take care of it.

Looking for the comments? Try the old layout.

Corrected PostgreSQL EAN Functions

Update: I updated the benchmarks based on the fixed version of my benchmarking function.

In doing a bit more reading about EAN codes, I realized that my previous attempts to write a validating function for UPC and EAN codes had a significant error: they would only properly validate EAN codes if the first numeral was 0! So I went back and fixed them all, and present them here for posterity.

  • The substring solution:

    CREATE OR REPLACE FUNCTION ean_substr (
        TEXT
    ) RETURNS boolean AS $$
    DECLARE
        offset integer := 0;
        -- Support UPCs.
        ean   TEXT    := CASE WHEN length($1) = 12 THEN '0' || $1 ELSE $1 END;
    BEGIN
        -- Make sure we really have an EAN.
        IF ean !~ '^\\d{13}$' THEN RETURN FALSE; END IF;
    
        RETURN 10 - (
            (
                -- Sum even numerals.
                substring(ean,  2 + offset, 1)::integer
                + substring(ean,  4 + offset, 1)::integer
                + substring(ean,  6 + offset, 1)::integer
                + substring(ean,  8 + offset, 1)::integer
                + substring(ean, 10 + offset, 1)::integer
                + substring(ean, 12 + offset, 1)::integer
            ) * 3 -- Multiply total by 3.
            -- Add odd numerals except for checksum (13).
            + substring(ean,  1 + offset, 1)::integer
            + substring(ean,  3 + offset, 1)::integer
            + substring(ean,  5 + offset, 1)::integer
            + substring(ean,  7 + offset, 1)::integer
            + substring(ean,  9 + offset, 1)::integer
            + substring(ean, 11 + offset, 1)::integer
        -- Compare to the checksum.
        ) % 10 = substring(ean, 13 + offset, 1)::integer;
    END;
    $$ LANGUAGE 'plpgsql' immutable;
  • The looping solution:

    CREATE OR REPLACE FUNCTION ean_loop(
        TEXT
    ) RETURNS boolean AS $$
    DECLARE
        total INTEGER := 0;
        -- Support UPCs.
        ean   TEXT    := CASE WHEN length($1) = 12 THEN '0' || $1 ELSE $1 END;
    BEGIN
        -- Make sure we really have an EAN.
        IF ean !~ '^\\d{13}$' THEN RETURN FALSE; END IF;
    
        -- Sum even numerals.
        FOR i IN 2..12 LOOP
            total := total + substring(ean, i, 1)::INTEGER;
            i := i + 1;
        END LOOP;
    
        -- Multiply total by 3.
        total := total * 3;
    
        -- Add odd numerals except for checksum (13).
        FOR i IN 1..11 LOOP
            total := total + substring(ean, i, 1)::INTEGER;
            i := i + 1;
        END LOOP;
    
        -- Compare to the checksum.
        RETURN 10 - total % 10 = substring(ean, 13, 1)::INTEGER;
    END;
    $$ LANGUAGE 'plpgsql' immutable;
  • The BYTEA solution:

    CREATE OR REPLACE FUNCTION ean_byte (
        arg TEXT
    ) RETURNS boolean AS $$
    DECLARE
        -- Convert to BYTEA; support UPCs.
        ean BYTEA := CASE WHEN length($1) = 12 THEN '0' || $1 ELSE $1 END;
    BEGIN
        -- Make sure we really have an EAN.
        IF arg !~ '^\\d{12,13}$' THEN RETURN FALSE; END IF;
    
        RETURN 10 - (
            (
                -- Sum odd numerals.
                get_byte(ean,  1) - 48
                + get_byte(ean,  3) - 48
                + get_byte(ean,  5) - 48
                + get_byte(ean,  7) - 48
                + get_byte(ean,  9) - 48
                + get_byte(ean, 11) - 48
            ) * 3 -- Multiply total by 3.
            -- Add even numerals except for checksum (12).
            + get_byte(ean,  0) - 48
            + get_byte(ean,  2) - 48
            + get_byte(ean,  4) - 48
            + get_byte(ean,  6) - 48
            + get_byte(ean,  8) - 48
            + get_byte(ean, 10) - 48
        -- Compare to the checksum.
        ) % 10 = get_byte(ean, 12) - 48;
            
    END;
    $$ LANGUAGE plpgsql immutable;
  • The PL/Perl solution:

    CREATE OR REPLACE FUNCTION ean_perl (
        TEXT
    ) RETURNS boolean AS $_$
        my $ean = length $_[0] == 12 ? "0$_[0]" : $_[0];
        # Make sure we really have an EAN.
        return 'false' unless $ean =~ /^\d{13}$/;
        my @nums = split '', $ean;
        return 10 - (
            # Sum even numerals.
            (   (   $nums[1] + $nums[3] + $nums[5] + $nums[7] + $nums[9]
                        + $nums[11]
                ) * 3 # Multiply total by 3.
            # Add odd numerals except for checksum (12).
            ) + $nums[0] + $nums[2] + $nums[4] + $nums[6] + $nums[8] + $nums[10]
        # Compare to the checksum.
        ) % 10 == $nums[12] ? 'true' : 'false';
    $_$ LANGUAGE plperl immutable;
  • The C solution (thanks StuckMojo!):

    #include <string.h>
    #include "postgres.h"
    #include "fmgr.h"
    
    Datum ean_c(PG_FUNCTION_ARGS);
    
    PG_FUNCTION_INFO_V1(ean_c);
    
    Datum ean_c(PG_FUNCTION_ARGS) {
    
        char *ean;
        text *arg = PG_GETARG_TEXT_P(0);
        int  arglen = VARSIZE(arg) - VARHDRSZ;
        bool ret = false;
    
        /* Validate the easy stuff: 12 or 13 digits. */
        if ((arglen != 12 && arglen != 13) || 
            strspn(VARDATA(arg), "0123456789") != arglen) {
            PG_RETURN_BOOL(ret);
        }
    
        /* Support UPCs. */
        if (arglen == 12) {
            ean = (char *) palloc(13);
            ean[0] = '0';
            memcpy(&ean[1], VARDATA(arg), arglen);
        } else {
            ean = (char *) palloc(arglen);
            memcpy(ean, VARDATA(arg), arglen);
        }
    
        ret = 10 - (
                /* Sum even numerals and multiply total by 3. */
                (  ean[1] - '0' + ean[3] - '0' + ean[5]  - '0' 
                    + ean[7] - '0' + ean[9] - '0' + ean[11] - '0') * 3
                /* Add odd numerals except for checksum (12). */
                + ean[0] - '0' + ean[2] - '0' + ean[4]  - '0'
                + ean[6] - '0' + ean[8] - '0' + ean[10] - '0'
            /* Compare to the checksum. */
            ) % 10 == ean[12] - '0';
    
        PG_RETURN_BOOL(ret);
    }

And here are the benchmarks for them (without immutable):

try=# select * from benchmark(100000, ARRAY[
try(#     'ean_substr(''4007630000116'')',
try(#     'ean_loop(  ''4007630000116'')',
try(#     'ean_byte(  ''4007630000116'')',
try(#     'ean_perl(  ''4007630000116'')',
try(#     'ean_c(     ''4007630000116'')'
try(# ]);
            code             | runtime  |    rate     | corrected | corrected_rate 
-----------------------------+----------+-------------+-----------+----------------
 [Control]                   | 0.257728 | 388006.17/s |  0.257728 | 388006.17/s
 ean_substr('4007630000116') |  5.07296 | 19712.37/s  |   4.81523 | 20767.44/s
 ean_loop(  '4007630000116') |  9.18085 | 10892.24/s  |   8.92312 | 11206.84/s
 ean_byte(  '4007630000116') |   3.9248 | 25479.02/s  |   3.66707 | 27269.73/s
 ean_perl(  '4007630000116') |   5.5062 | 18161.33/s  |   5.24848 | 19053.15/s
 ean_c(     '4007630000116') | 0.285376 | 350415.10/s |  0.027648 | 3616901.80/s
(6 rows)

Enjoy!

Looking for the comments? Try the old layout.

Benchmarking UPC Validation

Just to follow up on my query about validating UPC codes in PL/pgSQL, Klint Gore sent me a private email demonstrating that treating the UPC code as a binary string performed better than my substringing approach. I modified his version to work like the others, but it looked to me like the performance was about the same. They were just too close for me to really be able to tell.

What I needed was a way to run the queries a whole bunch of times to see the real difference. I asked on #postgresql, and dennisb suggested a simple brute-force approach:

select foo(42) FROM generate_series (1, 10000);

So that’s what I did. The functions I tested were:

  • A refinement of my original substring solution:

    CREATE OR REPLACE FUNCTION ean_substr (
        TEXT
    ) RETURNS boolean AS $$
    DECLARE
        offset integer := 0;
        -- Support UPCs.
        ean   TEXT    := CASE WHEN length($1) = 12 THEN
            '0' || $1
        ELSE
            $1
        END;
    BEGIN
        -- Make sure we really have an EAN.
        IF ean !~ '^\\d{13}$' THEN RETURN FALSE; END IF;
    
        RETURN 10 - (
            (
                -- Sum even numerals.
                substring(ean,  2 + offset, 1)::integer
                + substring(ean,  4 + offset, 1)::integer
                + substring(ean,  6 + offset, 1)::integer
                + substring(ean,  8 + offset, 1)::integer
                + substring(ean, 10 + offset, 1)::integer
                + substring(ean, 12 + offset, 1)::integer
            ) * 3 -- Multiply total by 3.
            -- Add odd numerals except for checksum (13).
            + substring(ean,  3 + offset, 1)::integer
            + substring(ean,  5 + offset, 1)::integer
            + substring(ean,  7 + offset, 1)::integer
            + substring(ean,  9 + offset, 1)::integer
            + substring(ean, 11 + offset, 1)::integer
        -- Compare to the checksum.
        ) % 10 = substring(ean, 12 + offset, 1)::integer;
    END;
    $$ LANGUAGE plpgsql;
  • A looping version, based on the comment from Adrian Klaver in the original post:

    CREATE OR REPLACE FUNCTION ean_loop(
        TEXT
    ) RETURNS boolean AS $$
    DECLARE
        total INTEGER := 0;
        -- Support UPCs.
        ean   TEXT    := CASE WHEN length($1) = 12 THEN
            '0' || $1
        ELSE
            $1
        END;
    BEGIN
        -- Make sure we really have an EAN.
        IF ean !~ '^\\d{13}$' THEN RETURN FALSE; END IF;
    
        -- Sum even numerals.
        FOR i IN 2..12 LOOP
            total := total + substring(ean, i, 1)::INTEGER;
            i := i + 1;
        END LOOP;
    
        -- Multiply total by 3.
        total := total * 3;
    
        -- Add odd numerals except for checksum (13).
        FOR i IN 3..11 LOOP
            total := total + substring(ean, i, 1)::INTEGER;
            i := i + 1;
        END LOOP;
    
        -- Compare to the checksum.
        RETURN 10 - total % 10
            = substring(ean, 13, 1)::INTEGER;
    END;
    $$ LANGUAGE 'plpgsql';
            CREATE OR REPLACE FUNCTION ean_loop(
        TEXT
    ) RETURNS boolean AS $$
    DECLARE
        total INTEGER := 0;
        -- Support UPCs.
        ean   TEXT    := CASE WHEN length($1) = 12 THEN
            '0' || $1
        ELSE
            $1
        END;
    BEGIN
        -- Make sure we really have an EAN.
        IF ean !~ '^\\d{13}$' THEN RETURN FALSE; END IF;
    
        -- Sum even numerals.
        FOR i IN 2..12 LOOP
            total := total + substring(ean, i, 1)::INTEGER;
            i := i + 1;
        END LOOP;
    
        -- Multiply total by 3.
        total := total * 3;
    
        -- Add odd numerals except for checksum (13).
        FOR i IN 3..11 LOOP
            total := total + substring(ean, i, 1)::INTEGER;
            i := i + 1;
        END LOOP;
    
        -- Compare to the checksum.
        RETURN 10 - total % 10
            = substring(ean, 13, 1)::INTEGER;
    END;
    $$ LANGUAGE 'plpgsql';
  • A PL/Perl version for Josh and Ovid:

    CREATE OR REPLACE FUNCTION ean_perl (
        TEXT
    ) RETURNS boolean AS $_$
        my $ean = length $_[0] == 12 ? "0$_[0]" : $_[0];
        # Make sure we really have an EAN.
        return 'false' unless $ean =~ /^\d{13}$/;
        my @nums = split '', shift;
        return 10 - (
            # Sum even numerals.
            (   (   $nums[1] + $nums[3] + $nums[5]
                    + $nums[7] + $nums[9] + $nums[11]
                ) * 3 # Multiply total by 3.
            # Add odd numerals except for checksum (12).
            ) + $nums[2] + $nums[4] + $nums[6] + $nums[8]
                + $nums[10]
        # Compare to the checksum.
        ) % 10 == $nums[11] ? 'true' : 'false';
    $_$ LANGUAGE plperl;
  • And finally, the new version using a byte string:

    CREATE OR REPLACE FUNCTION ean_byte (
        arg TEXT
    ) RETURNS boolean AS $$
    DECLARE
        -- Convert to BYTEA; support UPCs.
        ean BYTEA := CASE WHEN length($1) = 12 THEN
            '0' || $1
        ELSE
            $1
        END;
    BEGIN
        -- Make sure we really have an EAN.
        IF arg !~ '^\\d{12,13}$' THEN RETURN FALSE; END IF;
    
        RETURN 10 - (
            (
                -- Sum even numerals.
                get_byte(ean,  2) - 48
                + get_byte(ean,  4) - 48
                + get_byte(ean,  6) - 48
                + get_byte(ean,  8) - 48
                + get_byte(ean, 10) - 48
                + get_byte(ean, 12) - 48
            ) * 3 -- Multiply total by 3.
            -- Add odd numerals except for checksum (13).
            + get_byte(ean,  3) - 48
            + get_byte(ean,  7) - 48
            + get_byte(ean,  5) - 48
            + get_byte(ean,  9) - 48
            + get_byte(ean, 11) - 48
        -- Compare to the checksum.
        ) % 10  = get_byte(ean, 12) - 48;
    END;
    $$ LANGUAGE plpgsql;

And then I ran the benchmarks:

try=# \timing
Timing is on.
try=# \o /dev/null
try=# select ean_substr('036000291452')
try-# FROM generate_series (1, 10000);
Time: 488.743 ms
try=# select ean_loop('036000291452')
try-# FROM generate_series (1, 10000);
Time: 881.553 ms
try=# select ean_perl('036000291452')
try-# FROM generate_series (1, 10000);
Time: 540.962 ms
try=# select ean_byte('036000291452')
try-# FROM generate_series (1, 10000);
Time: 395.124 ms

So the binary approach is the clear winner here, being 23.69% faster than my substring approach, 36.91% faster than the Perl version, and 2.23 times faster (123.11%) than the looping approach. So I think I’ll go with that.

Meanwhile, I’m pleased to have this simple benchmarking tool in my arsenal for future PostgreSQL function development.

Looking for the comments? Try the old layout.

More about…