Benchmarking UPC Validation

Just to follow up on my query about validating UPC codes in PL/pgSQL, Klint Gore sent me a private email demonstrating that treating the UPC code as a binary string performed better than my substringing approach. I modified his version to work like the others, but it looked to me like the performace was about the same. They were just too close for me to really be able to tell.

What I needed was a way to run the queries a whole bunch of times to see the real difference. I asked on #postgresql, and dennisb suggested a simple brute-force approach:

select foo(42) FROM generate_series (1, 10000);

So that’s what I did. The functions I tested were:

  • A refinement of my original substring solution:

    CREATE OR REPLACE FUNCTION ean_substr (
        TEXT
    ) RETURNS boolean AS $$
    DECLARE
        offset integer := 0;
        -- Support UPCs.
        ean   TEXT    := CASE WHEN length($1) = 12 THEN
           '0' || $1
        ELSE
           $1
        END;
    BEGIN
        -- Make sure we really have an EAN.
        IF ean !~ '^\\d{13}$' THEN RETURN FALSE; END IF;
    
        RETURN 10 - (
            (
              -- Sum even numerals.
                substring(ean,  2 + offset, 1)::integer
              + substring(ean,  4 + offset, 1)::integer
              + substring(ean,  6 + offset, 1)::integer
              + substring(ean,  8 + offset, 1)::integer
              + substring(ean, 10 + offset, 1)::integer
              + substring(ean, 12 + offset, 1)::integer
             ) * 3 -- Multiply total by 3.
             -- Add odd numerals except for checksum (13).
             + substring(ean,  3 + offset, 1)::integer
             + substring(ean,  5 + offset, 1)::integer
             + substring(ean,  7 + offset, 1)::integer
             + substring(ean,  9 + offset, 1)::integer
             + substring(ean, 11 + offset, 1)::integer
        -- Compare to the checksum.
        ) % 10 = substring(ean, 12 + offset, 1)::integer;
    END;
    $$ LANGUAGE plpgsql;
        
  • A looping version, based on the comment from Adrian Klaver in the original post:

    CREATE OR REPLACE FUNCTION ean_loop(
        TEXT
    ) RETURNS boolean AS $$
    DECLARE
        total INTEGER := 0;
        -- Support UPCs.
        ean   TEXT    := CASE WHEN length($1) = 12 THEN
           '0' || $1
        ELSE
           $1
        END;
    BEGIN
        -- Make sure we really have an EAN.
        IF ean !~ '^\\d{13}$' THEN RETURN FALSE; END IF;
    
        -- Sum even numerals.
        FOR i IN 2..12 LOOP
            total := total + substring(ean, i, 1)::INTEGER;
            i := i + 1;
        END LOOP;
    
        -- Multiply total by 3.
        total := total * 3;
    
        -- Add odd numerals except for checksum (13).
        FOR i IN 3..11 LOOP
            total := total + substring(ean, i, 1)::INTEGER;
            i := i + 1;
        END LOOP;
    
        -- Compare to the checksum.
        RETURN 10 - total % 10
            = substring(ean, 13, 1)::INTEGER;
    END;
    $$ LANGUAGE 'plpgsql';
        
  • A PL/Perl version for Josh and Ovid:

    CREATE OR REPLACE FUNCTION ean_perl (
        TEXT
    ) RETURNS boolean AS $_$
        my $ean = length $_[0] == 12 ? "0$_[0]" : $_[0];
        # Make sure we really have an EAN.
        return 'false' unless $ean =~ /^\d{13}$/;
        my @nums = split '', shift;
        return 10 - (
            # Sum even numerals.
            (   (   $nums[1] + $nums[3] + $nums[5]
                    + $nums[7] + $nums[9] + $nums[11]
                ) * 3 # Multiply total by 3.
            # Add odd numerals except for checksum (12).
            ) + $nums[2] + $nums[4] + $nums[6] + $nums[8]
              + $nums[10]
        # Compare to the checksum.
        ) % 10 == $nums[11] ? 'true' : 'false';
    $_$ LANGUAGE plperl;
        
  • And finally, the new version using a byte string:

    CREATE OR REPLACE FUNCTION ean_byte (
       arg TEXT
    ) RETURNS boolean AS $$
    DECLARE
        -- Convert to BYTEA; support UPCs.
        ean BYTEA := CASE WHEN length($1) = 12 THEN
            '0' || $1
        ELSE
            $1
        END;
    BEGIN
        -- Make sure we really have an EAN.
        IF arg !~ '^\\d{12,13}$' THEN RETURN FALSE; END IF;
    
        RETURN 10 - (
            (
                -- Sum even numerals.
                get_byte(ean,  2) - 48
              + get_byte(ean,  4) - 48
              + get_byte(ean,  6) - 48
              + get_byte(ean,  8) - 48
              + get_byte(ean, 10) - 48
              + get_byte(ean, 12) - 48
             ) * 3 -- Multiply total by 3.
             -- Add odd numerals except for checksum (13).
             + get_byte(ean,  3) - 48
             + get_byte(ean,  7) - 48
             + get_byte(ean,  5) - 48
             + get_byte(ean,  9) - 48
             + get_byte(ean, 11) - 48
        -- Compare to the checksum.
        ) % 10  = get_byte(ean, 12) - 48;
    END;
    $$ LANGUAGE plpgsql;
        

And then I ran the benchmarks:

try=# \timing
Timing is on.
try=# \o /dev/null
try=# select ean_substr('036000291452')
try-# FROM generate_series (1, 10000);
Time: 488.743 ms
try=# select ean_loop('036000291452')
try-# FROM generate_series (1, 10000);
Time: 881.553 ms
try=# select ean_perl('036000291452')
try-# FROM generate_series (1, 10000);
Time: 540.962 ms
try=# select ean_byte('036000291452')
try-# FROM generate_series (1, 10000);
Time: 395.124 ms

So the binary approach is the clear winner here, being 23.69% faster than my substring approach, 36.91% faster than the Perl version, and 2.23 times faster (123.11%) than the looping approach. So I think I’ll go with that.

Meanwhile, I’m pleased to have this simple benchmarking tool in my arsenal for future PostgreSQL function development.

  • E-mail this story to a friend!
  • Sphinn
  • StumbleUpon
  • Facebook
  • del.icio.us
  • LinkedIn
  • TwitThis
  • Digg
  • Google
  • MySpace
  • Reddit
  • StumbleUpon
  • Technorati
  • Yahoo! Buzz

Comments & Trackbacks

StuckMojo wrote:

Well, here's the results of it as a C function:

test=# select validate_upc('0083514870406') from generate_series (1, 10000);
Time: 138.571 ms
test=# select ean_byte('0083514870406') from generate_series (1, 10000);
Time: 151.628 ms
test=# select ean_c('0083514870406') from generate_series (1, 10000);
Time: 12.612 ms

Pretty god damn impressive!!

#include <string.h>
#include "postgres.h"
#include "fmgr.h"

PG_FUNCTION_INFO_V1(ean_c);

Datum ean_c(PG_FUNCTION_ARGS) {

    char *ean;
    text *arg = PG_GETARG_TEXT_P(0);
    int  arglen = VARSIZE(arg) - VARHDRSZ;
    bool ret = false;

    /* Validate the easy stuff: 12 or 13 digits. */
    if ((arglen != 12 && arglen != 13) ||
        strspn(VARDATA(arg), "0123456789") != arglen) {
        PG_RETURN_BOOL(ret);
    }

    /* Support UPCs. */
    if (arglen == 12) {
        ean = (char *) palloc(13);
        ean[0] = '0';
        memcpy(&ean[1], VARDATA(arg), arglen);
    } else {
        ean = (char *) palloc(arglen);
        memcpy(ean, VARDATA(arg), arglen);
    }

    ret = 10 - (
            /* Sum even numerals and multiply total by 3. */
            (ean[1] - '0' + ean[3] - '0' + ean[5] - '0'
             + ean[7] - '0' + ean[9] - '0' + ean[11] - '0') * 3
            /* Add odd numerals except for checksum (12). */
            + ean[2] - '0' + ean[4] - '0' + ean[6] - '0'
            + ean[8] - '0' + ean[10] - '0'
        /* Compare to the checksum. */
        ) % 10 == ean[12] - '0';

   PG_RETURN_BOOL(ret);
}

StuckMojo wrote:

I packaged this up as a contrib module for easy use:

http://www.thewickedtribe.net/~jon/postgres/ean_c.tgz

Alvaro Herrera wrote:

Performance comparisons

pgbench is a contrib module you can use for performance comparison. Starting with PostgreSQL 8.1 you can give it arbitrary SQL scripts to run using the -f option.

Theory wrote:

Reflecting on Performance

Just thinking about the performance of these functions, I realized that, on my system, all but the looping version can execute 10,000 times in less than half a second. As compelling as I find StuckMojo's C version, I think I'll stick to the pure PL/pgSQL for the time being, since that's how all my other functions are written. Because, when you come down to it and are talking about this kind of performance for functions, figuring out which is fastest is just a kind of splitting hairs.If I later find that I need C functions for other stuff, I might use StuckMojo's version, because then I'll have to deal with the issues of distribution, anyway, so it won't hurt to then make everything that I can be in C, and this version would require no extra tuits. :-)

—Theory

Discussion is now closed.

Powered by KinoSearch