Managing Key/Value Pairs in PostgreSQL

Let’s say that you’ve been following the latest research in key/value data storage and are interested in managing such data in a PostgreSQL database. You want to have functions to store and retrieve pairs, but there is no natural way to represent pairs in SQL. Many languages have hashes or or data dictionaries to fulfill this role, and you can pass them to functional interfaces. SQL’s got nothin’. In PostgreSQL, have two options: use nested arrays (simple, fast) or use a custom composite data type (sugary, legible).

Let’s assume you have this table for storing your pairs:

CREATE TEMPORARY TABLE kvstore (
    key        TEXT PRIMARY KEY,
    value      TEXT,
    expires_at TIMESTAMPTZ DEFAULT NOW() + '12 hours'::interval
);

To store pairs, you can use nested arrays like so:

SELECT store(ARRAY[ ['foo', 'bar'], ['baz', 'yow'] ]);

Not too bad, and since SQL arrays are a core feature of PostgreSQL, there’s nothing special to do. Here’s the store() function:

CREATE OR REPLACE FUNCTION store(
    params text[][]
) RETURNS VOID LANGUAGE plpgsql AS $$
BEGIN
    FOR i IN 1 .. array_upper(params, 1) LOOP
        UPDATE kvstore
           SET value      = params[i][2],
               expires_at = NOW() + '12 hours'::interval
         WHERE key        = param[i][1];
        CONTINUE WHEN FOUND;
        INSERT INTO kvstore (key, value)
        VALUES (params[i][1], params[i][2]);
    END LOOP;
END;
$$;

I’ve seen worse. The trick is to iterate over each nested array, try an update for each, and insert when no row is updated. Alas, you have no control over how many elements a user might include in a nested array. One might call it as:

SELECT store(ARRAY[ ['foo', 'bar', 'baz'] ]);

Or:

SELECT store(ARRAY[ ['foo'] ]);

No errors will be thrown in either case. In the first the "baz" will be ignored, and in the second the value will default to NULL. If you really didn’t like these behaviors, you could add some code to throw an exception if array_upper(params, 2) returns anything other than 2.

Let’s look at fetching values for keys. PostgreSQL 8.4 added variadic function arguments, so it’s easy to provide a nice interface for retrieving one or more values. The obvious one fetches a single value:

CREATE OR REPLACE FUNCTION getval(
    text
) RETURNS TEXT LANGUAGE SQL AS $$
    SELECT value FROM kvstore WHERE key = $1;
$$;

Nice and simple:

SELECT getval('baz');

 getval 
--------’
 yow

The variadic version looks like this:

CREATE OR REPLACE FUNCTION getvals(
    variadic text[]
) RETURNS SETOF text LANGUAGE SQL AS $$
    SELECT value
      FROM kvstore
      JOIN (SELECT generate_subscripts($1, 1)) AS f(i)
        ON kvstore.key = $1[i]
     ORDER BY i;
$$;

Note the use of ORDER BY i to ensure that the values are returned in the same order as the keys are passed to the function. So if I’ve got the key/value pairs ’foo’ => ‘bar’ and ’baz’ => ‘yow’, the output is:

SELECT * FROM getvals('foo', 'baz');

 getvals 
---------
 bar
 yow

If we want to the rows to have the keys and values together, we can return them as arrays, like so:

CREATE OR REPLACE FUNCTION getpairs(
    variadic text[]
) RETURNS SETOF text[] LANGUAGE SQL AS $$
    SELECT ARRAY[key, value]
      FROM kvstore
      JOIN unnest($1) AS k ON kvstore.key = k
$$;

Here I’m assuming that order isn’t important, which means we can use unnest to "flatten" the array, instead of the slightly more baroque generate_subscripts() with array access. The output:

SELECT * FROM getpairs('foo', 'baz');

  getpairs   
-------------
 {baz,yow}
 {foo,bar}

Now, this is good as far as it goes, but the use of nested arrays to represent key/value pairs is not exactly ideal: just looking at the use of a function, there’s nothing to indicate that you’re using key/value pairs. What would be ideal is to use row constructors to pass arbitrary pairs:

SELECT store( ROW('foo', 'bar'), ROW('baz', 42) );

Alas, one cannot pass RECORD values (the data type returned by ROW()) to non-C functions in PostgreSQL.1 But if you don’t mind your keys and values always being TEXT, we can get almost all the way there by creating an "ordered pair" data type as a composite type like so:

CREATE TYPE pair AS ( k text, v text );

Then we can create store() with a signature of VARIADIC pair[] and pass in any number of these suckers:

CREATE OR REPLACE FUNCTION store(
    params variadic pair[]
) RETURNS VOID LANGUAGE plpgsql AS $$
DECLARE
    param pair;
BEGIN
    FOR param IN SELECT * FROM unnest(params) LOOP
        UPDATE kvstore
           SET value = param.v,
               expires_at = NOW() + '12 hours'::interval
         WHERE key = param.k;
        CONTINUE WHEN FOUND;
        INSERT INTO kvstore (key, value) VALUES (param.k, param.v);
    END LOOP;
END;
$$;

Isn’t it nice how we can access keys and values as param.k and param.v? Call the function like this:

SELECT store( ROW('foo', 'bar')::pair, ROW('baz', 'yow')::pair );

Of course, that can get a bit old, casting to pair all the time, so let’s create some pair constructor functions to simplify things:

CREATE OR REPLACE FUNCTION pair(anyelement, text)
RETURNS pair LANGUAGE SQL AS 'SELECT ROW($1, $2)::pair';

CREATE OR REPLACE FUNCTION pair(text, anyelement)
RETURNS pair LANGUAGE SQL AS 'SELECT ROW($1, $2)::pair';

CREATE OR REPLACE FUNCTION pair(anyelement, anyelement)
RETURNS pair LANGUAGE SQL AS 'SELECT ROW($1, $2)::pair';

CREATE OR REPLACE FUNCTION pair(text, text)
RETURNS pair LANGUAGE SQL AS 'SELECT ROW($1, $2)::pair;';

I’ve created four variants here to allow for the most common combinations of types. So any of the following will work:

SELECT pair('foo', 'bar');
SELECT pair('foo', 1);
SELECT pair(12.3, 'foo');
SELECT pair(1, 43);

Alas, you can’t mix any other types, so this will fail:

SELECT pair(1, 12.3);

ERROR:  function pair(integer, numeric) does not exist
LINE 1: SELECT pair(1, 12.3);

We could create a whole slew of additional constructors, but since we’re using a key/value store, it’s likely that the keys will usually be text anyway. So now we can call store() like so:

SELECT store( pair('foo', 'bar'), pair('baz', 'yow') );

Better, eh? Hell, we can go all the way and create a nice binary operator to make it still more sugary. Just map each of the pair functions to the operator like so:

CREATE OPERATOR -> (
    LEFTARG   = text,
    RIGHTARG  = anyelement,
    PROCEDURE = pair
);

CREATE OPERATOR -> (
    LEFTARG   = anyelement,
    RIGHTARG  = text,
    PROCEDURE = pair
);

CREATE OPERATOR -> (
    LEFTARG   = anyelement,
    RIGHTARG  = anyelement,
    PROCEDURE = pair
);

CREATE OPERATOR -> (
    LEFTARG   = text,
    RIGHTARG  = text,
    PROCEDURE = pair
);

Looks like a lot of repetition, I know, but checkout the new syntax:

SELECT store( 'foo' -> 'bar', 'baz' -> 1 );

Cute, eh? I chose to use -> because => is deprecated as an operator in PostgreSQL 9.0: SQL 2011 reserves that operator for named parameter assignment.2

As a last twist, let’s rewrite getpairs() to return pairs instead of arrays:

CREATE OR REPLACE FUNCTION getpairs(
    variadic text[]
) RETURNS SETOF pair LANGUAGE SQL AS $$
    SELECT key -> value
      FROM kvstore
      JOIN unnest($1) AS k ON kvstore.key = k
$$;

Cute, eh? Its use is just like before, only now the output is more table-like:

SELECT * FROM getpairs('foo', 'baz');

  k  |   v   
-----+-------
 baz | yow
 foo | bar

You can also get them back as composites by omitting * FROM:

SELECT getpairs('foo', 'baz');

  getpairs   
-------------
 (foo,bar)
 (baz,yow)

Anyway, just something to consider the next time you need a function that allows any number of key/value pairs to be passed. It’s not perfect, but it’s pretty sweet.


  1. In the recent pgsql-hackers discussion that inspired this post, Pavel Stehule suggested adding something like Oracle COLLECTIONs to address this shortcoming. I don’t know how far this idea will get, but it sure would be nice to be able to pass objects with varying kinds of data, rather than be limited to data all of one type (values in an SQL array must all be of the same type).

  2. No, you won’t be able to use named parameters for this application because named parameters are inherently non-variadic. That is, you can only pre-declare so many named parameters: you can’t anticipate every parameter that’s likely to be wanted as a key in our key/value store.

  • E-mail this story to a friend!
  • Sphinn
  • StumbleUpon
  • Facebook
  • del.icio.us
  • LinkedIn
  • TwitThis
  • Digg
  • Google
  • MySpace
  • Reddit
  • StumbleUpon
  • Technorati
  • Yahoo! Buzz

PGXN Blog and Twitterstream

I crated the PGXN Blog yesterday. Tune in there for news and announcements. I’ll also be posting status reports once development gets underway, so that all you fans out there can follow my progress. Once the site is done (or at 1.0 anyway), the blog will be used for announcements, discussion of support issues, etc. So tune in!

Oh, and I created a PGXN Twitterstream, too. You should follow it! New blog posts will be tweeted, and once the site gets going, new uploads will be tweeted, too. Check it out!

  • E-mail this story to a friend!
  • Sphinn
  • StumbleUpon
  • Facebook
  • del.icio.us
  • LinkedIn
  • TwitThis
  • Digg
  • Google
  • MySpace
  • Reddit
  • StumbleUpon
  • Technorati
  • Yahoo! Buzz

Introducing MyTAP

I gave my OSCON tutorial (slides) last week. It went okay. I spent way too much time helping to get everyone set up with pgTAP, and then didn’t have time to have the attendees do the exercises, and I had to rush through 2.5 hours of material in 1.5 hours. Yikes! At least the video will be better when it’s released (more when that happens).

But as often happens, I was asked whether something like pgTAP exists for MySQL. But this time I was asked by MySQL Community Manager Giuseppe Maxia, who also said that he’d tried to create a test framework himself (a fellow Perl hacker!), but that it wasn’t as nice as pgTAP. Well, since I was at OSCON and tend to like to hack on side projects while at conferences, and since I hoped that Giuseppe will happily take it over once I’ve implemented the core, I started hacking on it myself. And today, I’m pleased to announce the release of MyTAP 0.01 (downloads).

Once you’ve downloaded it, install it against your MySQL server like so:

mysql -u root < mytap.sql

Here’s a very simple example script:

-- Start a transaction.
BEGIN;

-- Plan the tests.
SELECT tap.plan(1);

-- Run the tests.
SELECT tap.pass( 'My test passed, w00t!' );

-- Finish the tests and clean up.
CALL tap.finish();
ROLLBACK;

You can run this test from a .sql file using the mysql client like so:

mysql -u root --disable-pager --batch --raw --skip-column-names --unbuffered --database try --execute 'source test.sql'

But that’s a PITA and can only run one test at a time. Instead, put all of your tests into a directory, perhaps named tests, each with the suffix “.my”, and use my_prove (install TAP::Parser::SourceHandler::MyTAP from CPAN to get it) instead:

my_prove -u root --database try tests/

For MyTAP’s own tests, the output looks like this:

tests/eq.my ........ ok
tests/hastap.my .... ok
tests/matching.my .. ok
tests/moretap.my ... ok
tests/todotap.my ... ok
tests/utils.my ..... ok
All tests successful.
Files=6, Tests=137,  1 wallclock secs
(0.06 usr  0.03 sys +  0.01 cusr  0.02 csys =  0.12 CPU)
Result: PASS

Nice, eh? Of course there are quite a few more assertion functions. See the complete documentation for details.

Now, I did my best to keep the interface the same as pgTAP, but there are a few differences:

  • MySQL temporary tables are teh suck, so I had to use permanent tables to track test state. To make this more feasible, MyTAP is always installed in its own database, (named “tap” by default), and you must always schema-qualify your use of the MyTAP functions.
  • Another side-effect of permanent tables is that MyTAP must keep track of test outcomes without colliding with the state from tests running in multiple concurrent connections. So MyTAP uses connection_id() to keep track of state for a single test run. It also deletes the state when tests finish(), but if there’s a crash before then, data can be left in those tables. If the connection ID is ever re-used, this can lead to conflicts. This seems mostly avoidable by using InnoDB tables and transactions in the tests.
  • The word “is” is strictly reserved by MySQL, so the function that corresponds to pgTAP’s is() is eq() in MyTAP. Similarly, isnt() is called not_eq() in MyTAP.
  • There is no way to throw an exception in MySQL functions an procedures, so the code cheats by instead performing an illegal operation: selecting from a non-existent column, where the name of that column is the error message. Hinky, but should get the point across.

Other than these issues, things went fairly smoothly. I finished up the 0.01 version last night and released it today with most of the core functionality in place. And now I want to find others to take over, as I am not a MySQL hacker myself and thus unlikely ever to use it. If you’re interested, my recommendations for things to do next are:

So fork on GitHub or contact me if you’d like to be added as a collaborator (I’m looking at you, Giuseppe!).

Hope you find it useful.

  • E-mail this story to a friend!
  • Sphinn
  • StumbleUpon
  • Facebook
  • del.icio.us
  • LinkedIn
  • TwitThis
  • Digg
  • Google
  • MySpace
  • Reddit
  • StumbleUpon
  • Technorati
  • Yahoo! Buzz

PGXN Development Project

I’m pleased to announce the launch of the PGXN development project. I’ve written a detailed specification and pushed it through general approval on pgsql-hackers. I’ve written up a detailed project plan and estimated things at a highly reduced PostgreSQL Experts rate to come up with a fundraising goal: $25,000. And now, thanks to founding contributions from myYearbook.com, and PostgreSQL Experts, we have started the fundraising phase of the project.

So what’s this all about? PGXN, the PostgreSQL Extension Network, is modeled on CPAN, the Perl community’s archive of “all things Perl.” PGXN will provide four major pieces of infrastructure to the PostgreSQL community:

I’ve been wanting to start this project for a long time, but given my need to pay the bills, it didn’t seem like I’d ever be able to find the time for it. Then Josh Berkus suggested that we try to get community interest and raise money for me to have the time to work on it. So I jumped on that, putting in the hours needed to get general approval from the core PostgreSQL developers and to create a reasonable project plan and web site. And thanks to MyYearook’s and PGX’s backing, I’m really excited about it. I hope to start on it in August.

If you’d like to contribute, first: Thank You!. The PGXN site has a Google Checkout widget that makes it easy to make a donation. If you’d rather pay by some other means (checks are great for us!), drop me a line and we’ll work something out. We have a few levels of contribution as well, including permanent linkage on the PGXN site for your organization, as well as the usual t-shirts launch party invitations.

  • E-mail this story to a friend!
  • Sphinn
  • StumbleUpon
  • Facebook
  • del.icio.us
  • LinkedIn
  • TwitThis
  • Digg
  • Google
  • MySpace
  • Reddit
  • StumbleUpon
  • Technorati
  • Yahoo! Buzz

PGAN Bikeshedding

I’ve put together a description of PGAN, the PostgreSQL extension distribution system I plan to develop later this year based on the Comprehensive Archive Perl Network or CPAN. Its primary features will be:

  • Extension distribution
  • Search site with extension documentation
  • Client for downloading, building, testing, and installing extensions.

I’ve never been thrilled with the name, though, so I’m asking for suggestions for a better one. I’ve used the term "extension" here because it seems to be the term that the PostgreSQL community has settled on, but other terms might work, since things other than extensions might be distributed.

What I’ve come up with so far is:

Name Long Name Pronounciation Advantages Disadvantages
PGAN PostgreSQL Add-on Network pee-gan Short, similar to CPAN Ugly
PGEX PostgreSQL Extensions pee-gee-ex or pee-gex Short, easier to pronounce Too similar to PGX)
PGCAN PostgreSQL Comprehensive Archive Network pee-gee-can Similar to CPAN Similar to CPAN
PGDAN PostgreSQL Distribution Archive Network pee-gee-dan Short, easy to pronounce Who’s “Dan”? Doesn’t distribute PostgreSQL itself.
PGEDAN PostgreSQL Extension Distribution Archive Network pee-gee-ee-dan References extensions Long, sounds stupid

Of these, I think I like “PGEX” best, but none are really great. So I’m opening up the bike shed to all. What’s a better name? Or if you can’t think of one, which of the above do you like best? Just leave a comment on this post. The only requirements for suggestions are that a .org domain be available and that it suck less than the alternatives.

Comments close in 2 weeks. Thanks!

  • E-mail this story to a friend!
  • Sphinn
  • StumbleUpon
  • Facebook
  • del.icio.us
  • LinkedIn
  • TwitThis
  • Digg
  • Google
  • MySpace
  • Reddit
  • StumbleUpon
  • Technorati
  • Yahoo! Buzz

Execute SQL Code on Connect

I’ve been writing a fair bit of PL/Perl for a client, and one of the things I’ve been doing is eliminating a ton of duplicate code by creating utility functions in the %_SHARED hash. This is great, as long as the code that creates those functions gets executed at the beginning of every database connection. So I put the utility generation code into a single function, called prepare_perl_utils(). It looks something like this:

CREATE OR REPLACE FUNCTION prepare_perl_utils(
) RETURNS bool LANGUAGE plperl IMMUTABLE AS $$
    # Don't bother if we've already loaded.
    return 1 if $_SHARED{escape_literal};

    $_SHARED{escape_literal} = sub {
        $_[0] =~ s/'/''/g; $_[0] =~ s/\\/\\\\/g; $_[0];
    };

    # Create other code refs in %_SHARED…
$$;

So now all I have to do is make sure that all the client’s apps execute this function as soon as they connect, so that the utilities will all be loaded up and ready to go. Here’s how I did it.

First, for the Perl app, I just took advantage of the DBI’s callbacks to execute the SQL I need when the DBI connects to the database. That link might not work just yet, as the DBI’s callbacks have only just been documented and that documentation appears only in dev releases so far. Once 1.611 drops, the link should work. At any rate, the use of callbacks I’m exploiting here has been in the DBI since 1.49, which was released in November 2005.

The approach is the same as I’ve described before: Just specify the Callbacks parameter to DBI->connect, like so:

my $dbh = DBI->connect_cached($dsn, $user, $pass, {
    PrintError     => 0,
    RaiseError     => 1,
    AutoCommit     => 1,
    Callbacks      => {
        connected => sub { shift->do('SELECT prepare_perl_utils()' },
    },
});

That’s it. The connected method is a no-op in the DBI that gets called to alert subclasses that they can do any post-connection initialization. Even without a subclass, we can take advantage of it to do our own initialization.

It was a bit trickier to make the same thing happen for the client’s Rails app. Rails, alas, provides no on-connection callbacks. So we instead have to monkey-patch Rails to do what we want. With some help from “dfr|mac” on #rubyonrails (I haven’t touched Rails in 3 years!), I got it worked down to this:

class ActiveRecord::ConnectionAdapters::PostgreSQLAdapter
  def initialize_with_perl_utils(*args)
    returning(initialize_without_perl_utils(*args)) do
      execute('SELECT prepare_perl_utils()')
    end
  end
  alias_method_chain :initialize, :perl_utils
end

Basically, we overpower the PostgreSQL adapter’s initialize method and have it call initialize_with_perl_utils before it returns. It’s a neat trick; if you’re going to practice fuck typing, alias_method_chain makes it about as clean as can be, albeit a little too magical for my tastes.

Anyway, recorded here for posterity (my blog is my other brain!).

  • E-mail this story to a friend!
  • Sphinn
  • StumbleUpon
  • Facebook
  • del.icio.us
  • LinkedIn
  • TwitThis
  • Digg
  • Google
  • MySpace
  • Reddit
  • StumbleUpon
  • Technorati
  • Yahoo! Buzz

No more USE_PGXS=1?

I’ve become very tired of having to set USE_PGXS=1 every time I build pgTAP outside the contrib directory of a PostgreSQL distribution:

make USE_PGXS=1
make USE_PGXS=1 install
make USE_PGXS=1 installcheck

I am forever forgetting to set it, and it’s just not how one normally expects a build incantation to work. It was required because that’s how the core contrib extensions work: They all have this code in their Makefiles, which those of us who develop third-party modules have borrowed:

ifdef USE_PGXS
PG_CONFIG = pg_config
PGXS := $(shell $(PG_CONFIG) —pgxs)
include $(PGXS)
else
subdir = contrib/citext
top_builddir = ../..
include $(top_builddir)/src/Makefile.global
include $(top_srcdir)/contrib/contrib-global.mk
endif

They generally expect ../../src/Makefile.global to exist, and if it doesn’t, you have to tell it so. I find this annoying, because third-party extensions are almost never built from the contrib directory, so one must always remember to specify USE_PGXS=1.

I’d like to propose, instead, that those of us who maintain third-party extensions like pgTAP, PL/Parrot, and Temporal PostgreSQL not force our users to have to remember this special variable by instead checking to see if it’s needed ourselves. As such, I’ve just added this code to pgTAP’s Makefile:

ifdef USE_PGXS
PG_CONFIG = pg_config
PGXS := $(shell $(PG_CONFIG) —pgxs)
else
ifeq (exists, $(shell [ -e ../../src/bin/pg_config/pg_config ] && echo exists) ) 
top_builddir = ../..
PG_CONFIG := $(top_builddir)/src/bin/pg_config/pg_config
else
PG_CONFIG = pg_config
PGXS := $(shell $(PG_CONFIG) —pgxs)
endif
endif

So it still respects USE_PGXS=1, but if it’s not set, it looks to see if it can find pg_config where it would expect it to be if built from the contrib directory. If it’s not there, it simply uses pg_config as if USE_PGXS=1 was set. This makes building from the contrib directory or from anywhere else the same process:

make
make install
make installcheck

Much better, much easier to remember.

Is there any reason why third-party PostgreSQL extensions should not adopt this pattern? I don’t think it makes sense for contrib extensions in core to do it, but for those that will never be in core, I think it makes a lot of sense.

Comments?

  • E-mail this story to a friend!
  • Sphinn
  • StumbleUpon
  • Facebook
  • del.icio.us
  • LinkedIn
  • TwitThis
  • Digg
  • Google
  • MySpace
  • Reddit
  • StumbleUpon
  • Technorati
  • Yahoo! Buzz

Powered by KinoSearch