Just a Theory

By David E. Wheeler

Posts about C

My First C: A GTIN Data Type for PostgreSQL

After all of my recent experimentation creating UPC, EAN, and GTIN validation functions, I became interested in trying to create a GTIN PostgreSQL data type in C. The fact that I don’t know C didn’t stop me from learning enough to do some damage. And now I have a first implementation done. Check it out!

So how did I do this? Well, chapter six of the Douglas Book was a great help to get me started. I also learned what I could by reading the source code for the core and contributed PostgreSQL data types, as well as the EnumKit enumerated data type builder (download from here). And the denizens of the #postgresql channel on FreeNode were also extremely helpful. Thank you, guys!

I would be very grateful if the C hackers among you, and especially any PostgreSQL core hackers who happen to read my blog, would download the GTIN source code and have a look at it. This is the first C code I’ve written, so it would not surprise me if there were some gotchas that I missed (memory leaks, anyone?). And yes, I know that the new ISN contributed data types in the forthcoming 8.2 is a far more featureful implementation of bar code data types; I learned about it after I had nearly finished this first release of GTIN. But I did want to learn some C and how to create PostgreSQL data types, and provide the code for others to learn from, as well. It may also end up as the basis for an article. Stay tuned

In the meantime, share and enjoy.

Update: I forgot to mention that you can check out the source code from the Kineticode Subversion repository.

Looking for the comments? Try the old layout.

Benchmarking PostgreSQL Functions

Update 2006-05-19: I realized that there was a nasty error in my algorithm for determining the runtime of a function: It was only fetching the milliseconds part of the runtime, without adding in seconds and minutes! This led to getting negative runtimes then the milliseconds part of the end time was less than the milliseconds part of the start time. Ugh. But with the help of yain on IRC, I’ve switched to calculating the number of seconds by converting the start and end times to epoch seconds (which have subsecond precision in PostgreSQL, and now things are just dandy. While I was at it, I reorganized the function so that it was a bit easier to read, by constructing the created function in the order it would be executed, and fixed the caching problem, as suggested by Aidan in a comment below.

Following yesterday’s post, Klint Gore sent me some PL/pgSQL code that might be useable as a benchmark function. Today I took that code and ran with it.

The idea was to create a function like the Perl Benchmark module’s timethese() function. In the process, I found, with help from Josh Berkus, that PL/pgSQL’s EXECUTE statement has quite a lot of overhead, and the amount of overhead per call is pretty random. The overhead resulted in pretty inaccurate benchmark numbers, unfortunately.

At Josh’s suggestion, I rewrote the function to just test each function inline, rather than passing the function code as parameters. This time, the results were dead on. So then I refactored the original benchmark function to create its own benchmark function, inlining all of the code, and then call that function. Almost higher order PL/pgSQL! Again the results were just right, and so now I present it to you:

create type _benchmark as (
    code      text,
    runtime   real,
    corrected real
);

CREATE OR REPLACE FUNCTION benchmark(n INTEGER, funcs TEXT[])
RETURNS SETOF _benchmark AS $$
DECLARE
    code TEXT := '';
    a    _benchmark;
BEGIN
    -- Start building the custom benchmarking function.
    code := $_$
        CREATE OR REPLACE FUNCTION _bench(n INTEGER)
        RETURNS SETOF _benchmark AS $__$
        DECLARE
            s TIMESTAMP;
            e TIMESTAMP;
            a RECORD;
            d numeric;
            res numeric;
            ret _benchmark;
        BEGIN
            -- Create control.
            s := timeofday();
            FOR a IN SELECT TRUE FROM generate_series( 1, $_$ || n || $_$ )
            LOOP
            END LOOP;
            e := timeofday();
            d := extract(epoch from e) - extract(epoch from s);
            ret := ROW( '[Control]', d, 0 );
            RETURN NEXT ret;
$_$;
    -- Append the code to bench each function call.
    FOR i IN array_lower(funcs,1) .. array_upper(funcs, 1) LOOP
        code := code || '
            s := timeofday();
            FOR a IN SELECT ' || funcs[i] || ' FROM generate_series( 1, '
                || n || $__$ ) LOOP
            END LOOP;
            e := timeofday();
            res := extract(epoch from e) - extract(epoch from s);
            ret := ROW(
                $__$ || quote_literal(funcs[i]) || $__$,
                res,
                res - d
            );
            RETURN NEXT ret;
$__$;
    END LOOP;

    -- Create the function.
    execute code || $_$
        END;
        $__$ language plpgsql;
$_$;

    -- Now execute the function.
    FOR a IN EXECUTE 'SELECT * FROM _bench(' || n || ')' LOOP
        RETURN NEXT a;
    END LOOP;

    -- Drop the function.
    DROP FUNCTION _bench(integer);
    RETURN;
END;
$$ language 'plpgsql';

You call the function like this:

try=# select * from benchmark(10000, ARRAY[
try(#     'ean_substr(''036000291452'')',
try(#     'ean_byte(''036000291452'')',
try(#     'ean_c(''036000291452'')'
try(# ]);
            code            | runtime   | corrected 
----------------------------+-----------+-----------
    [Control]                  | 0.0237451 |          0
    ean_substr('036000291452') |  0.497734 |   0.473989
    ean_byte(  '036000291452') |  0.394456 |   0.370711
    ean_c(     '036000291452') | 0.0277281 | 0.00398302
(4 rows)

Pretty slick, eh? The only downside was that, when the DROP FUNCTION line was not commented out, the function would run once, and then, the next time, I’d get this error:

ERROR:  cache lookup failed for function 17323
CONTEXT:  PL/pgSQL function "benchmark" line 49 at for over select rows

I have no idea why. So I just leave the function and let the CREATE OR REPLACE take care of it.

Looking for the comments? Try the old layout.

Corrected PostgreSQL EAN Functions

Update: I updated the benchmarks based on the fixed version of my benchmarking function.

In doing a bit more reading about EAN codes, I realized that my previous attempts to write a validating function for UPC and EAN codes had a significant error: they would only properly validate EAN codes if the first numeral was 0! So I went back and fixed them all, and present them here for posterity.

  • The substring solution:

    CREATE OR REPLACE FUNCTION ean_substr (
        TEXT
    ) RETURNS boolean AS $$
    DECLARE
        offset integer := 0;
        -- Support UPCs.
        ean   TEXT    := CASE WHEN length($1) = 12 THEN '0' || $1 ELSE $1 END;
    BEGIN
        -- Make sure we really have an EAN.
        IF ean !~ '^\\d{13}$' THEN RETURN FALSE; END IF;
    
        RETURN 10 - (
            (
                -- Sum even numerals.
                substring(ean,  2 + offset, 1)::integer
                + substring(ean,  4 + offset, 1)::integer
                + substring(ean,  6 + offset, 1)::integer
                + substring(ean,  8 + offset, 1)::integer
                + substring(ean, 10 + offset, 1)::integer
                + substring(ean, 12 + offset, 1)::integer
            ) * 3 -- Multiply total by 3.
            -- Add odd numerals except for checksum (13).
            + substring(ean,  1 + offset, 1)::integer
            + substring(ean,  3 + offset, 1)::integer
            + substring(ean,  5 + offset, 1)::integer
            + substring(ean,  7 + offset, 1)::integer
            + substring(ean,  9 + offset, 1)::integer
            + substring(ean, 11 + offset, 1)::integer
        -- Compare to the checksum.
        ) % 10 = substring(ean, 13 + offset, 1)::integer;
    END;
    $$ LANGUAGE 'plpgsql' immutable;
  • The looping solution:

    CREATE OR REPLACE FUNCTION ean_loop(
        TEXT
    ) RETURNS boolean AS $$
    DECLARE
        total INTEGER := 0;
        -- Support UPCs.
        ean   TEXT    := CASE WHEN length($1) = 12 THEN '0' || $1 ELSE $1 END;
    BEGIN
        -- Make sure we really have an EAN.
        IF ean !~ '^\\d{13}$' THEN RETURN FALSE; END IF;
    
        -- Sum even numerals.
        FOR i IN 2..12 LOOP
            total := total + substring(ean, i, 1)::INTEGER;
            i := i + 1;
        END LOOP;
    
        -- Multiply total by 3.
        total := total * 3;
    
        -- Add odd numerals except for checksum (13).
        FOR i IN 1..11 LOOP
            total := total + substring(ean, i, 1)::INTEGER;
            i := i + 1;
        END LOOP;
    
        -- Compare to the checksum.
        RETURN 10 - total % 10 = substring(ean, 13, 1)::INTEGER;
    END;
    $$ LANGUAGE 'plpgsql' immutable;
  • The BYTEA solution:

    CREATE OR REPLACE FUNCTION ean_byte (
        arg TEXT
    ) RETURNS boolean AS $$
    DECLARE
        -- Convert to BYTEA; support UPCs.
        ean BYTEA := CASE WHEN length($1) = 12 THEN '0' || $1 ELSE $1 END;
    BEGIN
        -- Make sure we really have an EAN.
        IF arg !~ '^\\d{12,13}$' THEN RETURN FALSE; END IF;
    
        RETURN 10 - (
            (
                -- Sum odd numerals.
                get_byte(ean,  1) - 48
                + get_byte(ean,  3) - 48
                + get_byte(ean,  5) - 48
                + get_byte(ean,  7) - 48
                + get_byte(ean,  9) - 48
                + get_byte(ean, 11) - 48
            ) * 3 -- Multiply total by 3.
            -- Add even numerals except for checksum (12).
            + get_byte(ean,  0) - 48
            + get_byte(ean,  2) - 48
            + get_byte(ean,  4) - 48
            + get_byte(ean,  6) - 48
            + get_byte(ean,  8) - 48
            + get_byte(ean, 10) - 48
        -- Compare to the checksum.
        ) % 10 = get_byte(ean, 12) - 48;
            
    END;
    $$ LANGUAGE plpgsql immutable;
  • The PL/Perl solution:

    CREATE OR REPLACE FUNCTION ean_perl (
        TEXT
    ) RETURNS boolean AS $_$
        my $ean = length $_[0] == 12 ? "0$_[0]" : $_[0];
        # Make sure we really have an EAN.
        return 'false' unless $ean =~ /^\d{13}$/;
        my @nums = split '', $ean;
        return 10 - (
            # Sum even numerals.
            (   (   $nums[1] + $nums[3] + $nums[5] + $nums[7] + $nums[9]
                        + $nums[11]
                ) * 3 # Multiply total by 3.
            # Add odd numerals except for checksum (12).
            ) + $nums[0] + $nums[2] + $nums[4] + $nums[6] + $nums[8] + $nums[10]
        # Compare to the checksum.
        ) % 10 == $nums[12] ? 'true' : 'false';
    $_$ LANGUAGE plperl immutable;
  • The C solution (thanks StuckMojo!):

    #include <string.h>
    #include "postgres.h"
    #include "fmgr.h"
    
    Datum ean_c(PG_FUNCTION_ARGS);
    
    PG_FUNCTION_INFO_V1(ean_c);
    
    Datum ean_c(PG_FUNCTION_ARGS) {
    
        char *ean;
        text *arg = PG_GETARG_TEXT_P(0);
        int  arglen = VARSIZE(arg) - VARHDRSZ;
        bool ret = false;
    
        /* Validate the easy stuff: 12 or 13 digits. */
        if ((arglen != 12 && arglen != 13) || 
            strspn(VARDATA(arg), "0123456789") != arglen) {
            PG_RETURN_BOOL(ret);
        }
    
        /* Support UPCs. */
        if (arglen == 12) {
            ean = (char *) palloc(13);
            ean[0] = '0';
            memcpy(&ean[1], VARDATA(arg), arglen);
        } else {
            ean = (char *) palloc(arglen);
            memcpy(ean, VARDATA(arg), arglen);
        }
    
        ret = 10 - (
                /* Sum even numerals and multiply total by 3. */
                (  ean[1] - '0' + ean[3] - '0' + ean[5]  - '0' 
                    + ean[7] - '0' + ean[9] - '0' + ean[11] - '0') * 3
                /* Add odd numerals except for checksum (12). */
                + ean[0] - '0' + ean[2] - '0' + ean[4]  - '0'
                + ean[6] - '0' + ean[8] - '0' + ean[10] - '0'
            /* Compare to the checksum. */
            ) % 10 == ean[12] - '0';
    
        PG_RETURN_BOOL(ret);
    }

And here are the benchmarks for them (without immutable):

try=# select * from benchmark(100000, ARRAY[
try(#     'ean_substr(''4007630000116'')',
try(#     'ean_loop(  ''4007630000116'')',
try(#     'ean_byte(  ''4007630000116'')',
try(#     'ean_perl(  ''4007630000116'')',
try(#     'ean_c(     ''4007630000116'')'
try(# ]);
            code             | runtime  |    rate     | corrected | corrected_rate 
-----------------------------+----------+-------------+-----------+----------------
 [Control]                   | 0.257728 | 388006.17/s |  0.257728 | 388006.17/s
 ean_substr('4007630000116') |  5.07296 | 19712.37/s  |   4.81523 | 20767.44/s
 ean_loop(  '4007630000116') |  9.18085 | 10892.24/s  |   8.92312 | 11206.84/s
 ean_byte(  '4007630000116') |   3.9248 | 25479.02/s  |   3.66707 | 27269.73/s
 ean_perl(  '4007630000116') |   5.5062 | 18161.33/s  |   5.24848 | 19053.15/s
 ean_c(     '4007630000116') | 0.285376 | 350415.10/s |  0.027648 | 3616901.80/s
(6 rows)

Enjoy!

Looking for the comments? Try the old layout.