Just a Theory

By David E. Wheeler

Indexing Nested hstore

In my first Nested hstore post yesterday, I ran a query against unindexed hstore data, which required a table scan. But hstore is able to take advantage of GIN indexes. So let’s see what that looks like. Connecting to the same database, I indexed the review column:

reviews=# CREATE INDEX idx_reviews_gin ON reviews USING GIN(review);
CREATE INDEX
Time: 360448.426 ms
reviews=# SELECT pg_size_pretty(pg_database_size(current_database()));
 pg_size_pretty 
----------------
 421 MB

Well, that takes a while, and makes the database a lot bigger (it was 277 MB unindexed). But is it worth it? Let’s find out. Oleg and Teodor’s patch adds support for a nested hstore value on the right-hand-side of the @> operator. In practice, that means we can specify the full path to a nested value as an hstore expression. In our case, to query only for Books, instead of using this expression:

WHERE review #> '{product,group}' = 'Book'

We can use an hstore value with the entire path, including the value:

WHERE review @> '{product => {group => Book}}'

Awesome, right? Let’s give it a try:

reviews=# SELECT
    width_bucket(length(review #> '{product,title}'), 1, 50, 5) title_length_bucket,
    round(avg(review #^> '{review,rating}'), 2) AS review_average,
    count(*)
FROM
    reviews
WHERE
    review @> '{product => {group => Book}}'
GROUP BY
    title_length_bucket
ORDER BY
    title_length_bucket;
 title_length_bucket | review_average | count  
---------------------+----------------+--------
                   1 |           4.42 |  56299
                   2 |           4.33 | 170774
                   3 |           4.45 | 104778
                   4 |           4.41 |  69719
                   5 |           4.36 |  47110
                   6 |           4.43 |  43070
(6 rows)

Time: 849.681 ms

That time looks better than yesterday’s, but in truth I first ran this query just before building the GIN index and got about the same result. Must be that Mavericks is finished indexing my disk or something. At any rate, the index is not buying us much here.

But hey, we’re dealing with 1998 Amazon reviews, so querying against books probably isn’t very selective. I don’t blame the planner for deciding that a table scan is cheaper than an index scan. But what if we try a more selective value, say “DVD”?

reviews=# SELECT
    width_bucket(length(review #> '{product,title}'), 1, 50, 5) title_length_bucket,
    round(avg(review #^> '{review,rating}'), 2) AS review_average,
    count(*)
FROM
    reviews
WHERE
    review @> '{product => {group => DVD}}'
GROUP BY
    title_length_bucket
ORDER BY
    title_length_bucket;
 title_length_bucket | review_average | count 
---------------------+----------------+-------
                   1 |           4.27 |  2646
                   2 |           4.44 |  4180
                   3 |           4.53 |  1996
                   4 |           4.38 |  2294
                   5 |           4.48 |   943
                   6 |           4.42 |   738
(6 rows)

Time: 73.913 ms

Wow! Under 100ms. That’s more like it! Inverted indexing FTW!

More about…

Testing Nested hstore

I’ve been helping Oleg Bartunov and Teodor Sigaev with documentation for the forthcoming nested hstore patch for PostgreSQL. It adds support for arrays, numeric and boolean types, and of course arbitrarily nested data structures. This gives it feature parity with JSON, but unlike the JSON type, its values are stored in a binary representation, which makes it much more efficient to query. The support for GiST and GIN indexes to speed up path searches doesn’t hurt, either.

As part of the documentation, we wanted to include a short tutorial, something to show off the schemaless flexibility of the new hstore. The CitusDB guys were kind enough to show off their json_fdw with some Amazon review data in a blog post a few months back; it even includes an interesting query against the data. Let’s see what we can do with it. First, load it:

> createdb reviews
> psql -d reviews -c '
    CREATE EXTENSION HSTORE;
    CREATE TABLE reviews(review hstore);
'
CREATE TABLE
> gzcat customer_reviews_nested_1998.json.gz | sed -e 's/\\/\\\\/g' \
 | sed -e "s/'/''/g" | sed -e 's/":/" =>/g' > /tmp/hstore.copy
> time psql -d reviews -c "COPY reviews FROM '/tmp/hstore.copy'"
COPY 589859
       0.00s user 0.00s system 0% cpu 13.059 total

13 seconds to load 589,859 records from a file – a little over 45k records per second. Not bad. Let’s see what the storage looks like:

> psql -d reviews -c 'SELECT pg_size_pretty(pg_database_size(current_database()));'
 pg_size_pretty 
----------------
 277 MB

The original, uncompressed data is 208 MB on disk, so roughly a third bigger given the overhead of the database. Just for fun, let’s compare it to JSON:

> createdb reviews_js
> psql -d reviews_js -c 'CREATE TABLE reviews(review json);'
CREATE TABLE
> gzcat customer_reviews_nested_1998.json.gz | sed -e 's/\\/\\\\/g' \
 | sed -e "s/'/''/g" | > /tmp/json.copy
> time psql -d reviews_js -c "COPY reviews FROM '/tmp/json.copy'"
COPY 589859
       0.00s user 0.00s system 0% cpu 7.434 total
> psql -d reviews_js -c 'SELECT pg_size_pretty(pg_database_size(current_database()));'
 pg_size_pretty 
----------------
 239 MB

Almost 80K records per second, faster, I’m guessing, because the JSON type doesn’t convert the data to binary representation its way in. JSON currently uses less overhead for storage, aw well; I wonder if that’s the benefit of TOAST storage?

Let’s try querying these guys. I adapted the query from the CitusDB blog post and ran it on my 2013 MacBook Air (1.7 GHz Intel Core i7) with iTunes and a bunch of other apps running in the background [yeah, I’m lazy]). Check out those operators, by the way! Given a path, #^> returns a numeric value:

reviews=# SELECT
    width_bucket(length(review #> '{product,title}'), 1, 50, 5) title_length_bucket,
    round(avg(review #^> '{review,rating}'), 2) AS review_average,
    count(*)
FROM
    reviews
WHERE
    review #> '{product,group}' = 'Book'
GROUP BY
    title_length_bucket
ORDER BY
    title_length_bucket;
 title_length_bucket | review_average | count  
---------------------+----------------+--------
                   1 |           4.42 |  56299
                   2 |           4.33 | 170774
                   3 |           4.45 | 104778
                   4 |           4.41 |  69719
                   5 |           4.36 |  47110
                   6 |           4.43 |  43070
(6 rows)

Time: 2301.620 ms

The benefit of the native type is pretty apparent here. I ran this query several times, and the time was always between 2.3 and 2.4 seconds. The Citus json_fdw query took “about 6 seconds on a 3.1 GHz CPU core.” Let’s see how well the JSON type does (pity there is no operator to fetch a value as numeric; we have to cast from text):

reviews_js=# SELECT
    width_bucket(length(review #>> '{product,title}'), 1, 50, 5) title_length_bucket,
    round(avg((review #>> '{review,rating}')::numeric), 2) AS review_average,
    count(*)
FROM
    reviews
WHERE
    review #>> '{product,group}' = 'Book'
GROUP BY
    title_length_bucket
ORDER BY
    title_length_bucket;
 title_length_bucket | review_average | count  
---------------------+----------------+--------
                   1 |           4.42 |  56299
                   2 |           4.33 | 170774
                   3 |           4.45 | 104778
                   4 |           4.41 |  69719
                   5 |           4.36 |  47110
                   6 |           4.43 |  43070
(6 rows)

Time: 5530.120 ms

A little faster than the json_fdw version, but comparable. But takes well over twice as long as the hstore version, though. For queries, hstore is the clear winner. Yes, you pay up-front for loading and storage, but the payoff at query time is substantial. Ideally, of course, we would have the insert and storage benefits of JSON and the query performance of hstore. There was talk last spring at PGCon of using the same representation for JSON and hstore; perhaps that can still come about.

Meanwhile, I expect to play with some other data sets over the next week; watch this spot for more!

More about…

The Women at XOXO Rocked It

XOXO was fantastic again this year. Since it wrapped up a week ago, there have been a number of fantastic posts from the likes of Frank Chimero, Maciej Ceglowski, and Kelly Kend; you should go read them. I was particularly moved by the unabashed vulnerability and honesty of the speakers, and their willingness to recruit the audience to confront the debilitating effects of impostor syndrome. I expect to write more about the festival in the future. Or maybe I’ll just create something awesome in the coming year, instead.

One thing struck me after the first day of talks that I’ve not noticed addressed elsewhere. While women were under-represented in the audience (though far better than at most tech conferences), but they were pretty well-represented among the speakers. And the women speakers — oh man. They. Kicked. Ass.

Erika Moen openly shared her quest for sexual identity via comics. Vi Hart hilariously described how to make money on YouTube — and Andy told her she could keep going for as long as she wanted. Molly Crabapple raised important yet seldom discussed issues around the independence of artists and the availability of capital on the internet. Julie Uhrman humbly shared all the ways in which Ouya has failed, and in doing so making it better. And Christina Xu discussed [BreadPig]’s objective to enable artists to make a living online without exploitation.

These amazing people weren’t at XOXO because they’re women; they weren’t there to represent women as a separate entity, or to talk about the under-representation of women in tech. No, they were there because they’ve created incredible things and wanted to share. Their energy was palpable. They just came out on stage and totally fucking rocked it.

This is how it ought be. You make something. You’re excited about it. Your energy infects the audience. And your gender and ethnicity have nothing to do with it. It’s moving simply to be amazing.

Alas, XOXO is the outlier here. But it points to the future, and I’m excited to help push it forward.

This post originally appeared on Svbtle.

The Power of Enums

Jim Mlodgenski on using Enums in place of references to small lookup tables:

I saw something else I didn’t expect: […] There was a 8% increase in performance. I was expecting the test with the enums to be close to the baseline, but I wasn’t expecting it to be faster. Thinking about it, it makes sense. Enums values are just numbers so we’re effectively using surrogate keys under the covers, but the users would still the the enum labels when they are looking at the data. It ended up being a no brainer to use enums for these static tables. There was a increase in performance while still maintaining the integrity of the data.

I’ve been a big fan of Enums since Andrew and Tom Dunstan released a patch for them during the PostgreSQL 8.2 era. Today they’re a core feature, and as of 9.1, you can even modify their values! You’re missing out if you’re not using them yet.

More about…

Whither Impala Fault Tolerance?

Justin Erickson on the Cloudera Blog

In December 2012, while Cloudera Impala was still in its beta phase, we provided a roadmap for planned functionality in the production release. In the same spirit of keeping Impala users, customers, and enthusiasts well informed, this post provides an updated roadmap for upcoming releases later this year and in early 2014.

Impala is a pretty nice-looking SQLish query engine that runs on Hadoop. It provides the same basic interface as Hive, but circumvents MapReduce to access data directly from data nodes in parallel. But, in my opinion, to be useful as a real-time query engine, it needs fault tolerance. From the Impala FAQ, under the list of unsupported features:

Fault tolerance for running queries (not currently). In the current release, Impala aborts the query if any host on which the query is executing fails. In the future, we will consider adding fault tolerance to Impala, so that a running query would complete even in the presence of host failures.

Sounds like an unfortunate issue. Since it’s pretty typical for data nodes to go down, this seems like an essential feature. Products like CitusDB offer fault tolerance:

Does CitusDB recover from failures?

Yes. The CitusDB master node intelligently re-routes the work on any failed nodes to the remaining nodes in real-time. Since the underlying data are kept in fixed-size blocks in HDFS, a failed node’s work can evenly be distributed among the remaining nodes in the cluster.

That sound exactly right. I’m excited about Citus, and if it adds solid support for more data formats, such as ORC and Parquet, it may well be the way to go. But Impala will be a nice alternative if it can get fault tolerance figured out. I’m disappointed it’s not on the road map.

Lexical Subroutines

Ricardo Signes:

One of the big new experimental features in Perl 5.18.0 is lexical subroutines. In other words, you can write this:

my sub quickly { ... }
my @sorted = sort quickly @list;

my sub greppy (&@) { ... }
my @grepped = greppy { ... } @input;

These two examples show cases where lexical references to anonymous subroutines would not have worked. The first argument to sort must be a block or a subroutine name, which leads to awful code like this:

sort { $subref->($a, $b) } @list

With our greppy, above, we get to benefit from the parser-affecting behaviors of subroutine prototypes.

My favorite tidbit about this feature? Because lexical subs are lexical, and method-dispatch is package-based, lexical subs are not subject to method lookup and dispatch! This just might alleviate the confusion of methods and subs, as chromatic complained about just yesterday. Probably doesn’t solve the problem for imported subs, though.

More about…

Multirow Database Updates

William Blunn:

So, given a list of updates to apply we could effect them using the following steps:

  1. Use CREATE TEMPORARY TABLE to create a temporary table to hold the updates
  2. Use INSERT to populate the temporary table with the updates
  3. Use UPDATE … FROM to update the target table using updates in the temporary table
  4. Use DROP TABLE to drop the temporary table

So in the example above we can reduce five statements to four. This isn’t a significant improvement in this case. But now the number of statements is no longer directly dependent on the number of rows requiring updates.

Even if we wanted to update a thousand rows with different values, we could still do it with four statements.

Or you could just use one statement. Here’s how to do it with a CTE on PostgreSQL 9.2 and higher:

WITH up(name, salary) AS ( VALUES
     ('Jane',  1200),
     ('Frank', 1100),
     ('Susan', 1175),
     ('John',  1150)
)
UPDATE staff
   SET salary = up.salary
  FROM up
 WHERE staff.name = up.name;

Still on PostgreSQL 9.1 or lower? Use a subselect in the FROM clause instead:

UPDATE staff
   SET salary = up.salary
   FROM (VALUES
       ('Jane',  1200),
       ('Frank', 1100),
       ('Susan', 1175),
       ('John',  1150)
   ) AS up(name, salary)
 WHERE staff.name = up.name;

Stuck with MySQL or Oracle? Use a UNION query in a second table:

UPDATE staff, (
         SELECT 'Jane' AS name, 1200 AS salary
   UNION SELECT 'Frank',        1100
   UNION SELECT 'Susan',        1175
   UNION SELECT 'John',         1150
) AS up
   SET staff.salary = up.salary
 WHERE staff.name = up.name;

Using SQLite? Might make sense to use a temporary table for thousands or millions of rows. But for just a few, use a CASE expression:

UPDATE staff
   SET salary = CASE name
       WHEN 'Jane'  THEN 1200
       WHEN 'Frank' THEN 1100
       WHEN 'Susan' THEN 1175
       WHEN 'John'  THEN 1150
   END
 WHERE name in ('Jane', 'Frank', 'Susan', 'John');

If you need to support multiple database architectures, sure, use something like DBIx::MultiRow to encapsulate things. But if, like most of us, you’re on one database for an app, I can’t recommend stongly enough how well it pays to get to know your database well.

More about…

Sqitch Templating

Last week saw the v.980 release of Sqitch, a database change management system. The headline feature in this version is support for MySQL 5.6.4 or higher. Why 5.6.4 rather than 5.1 or even 5.5? Mainly because 5.6.4 finally added support for fractional seconds in DATETIME columns (details in the release notes). This feature is essential for Sqitch, because changes often execute within a second of each other, and the deploy time is included in the log table’s primary key.

With the requirement for fractional seconds satisfied by 5.6.4, there was nothing to prevent usage of SIGNAL, added in 5.5, to mimic check constraints in a trigger. This brings the Sqitch MySQL implementation into line with what was already possible in the Postgres, SQLite, and Oracle support. Check out the tutorial and the accompanying Git repository to get started managing your MySQL databases with Sqitch.

The MySQL support might be the headliner, but the change in v0.980 I’m most excited about is improved template support. Sqitch executes templates to create the default deploy, revert, and verify scripts, but up to now they have not been easy to customize. With v0.980, you can create as many custom templates as you like, and use them as appropriate.

A Custom Template

Let’s create a custom template for creating a table. The first step is to create the template files. Custom templates can live in `sqitch –etc-path`/templates or in ~/.sqitch/templates. Let’s use the latter. Each template goes into a directory for the type of script, so we’ll create them:

mkdir -p ~/.sqitch/templates/deploy
mkdir -p ~/.sqitch/templates/revert 
mkdir -p ~/.sqitch/templates/verify

Copy the default templates for your preferred database engine; here I copy the Postgres templates:

tmpldir=`sqitch --etc-path`/templates
cp $tmpldir/deploy/pg.tmpl ~/.sqitch/templates/deploy/createtable.tmpl
cp $tmpldir/revert/pg.tmpl ~/.sqitch/templates/revert/createtable.tmpl
cp $tmpldir/verify/pg.tmpl ~/.sqitch/templates/verify/createtable.tmpl
chmod -R +w ~/.sqitch/templates

Here’s what the default deploy template looks like:

-- Deploy [% change %]
[% FOREACH item IN requires -%]
-- requires: [% item %]
[% END -%]
[% FOREACH item IN conflicts -%]
-- conflicts: [% item %]
[% END -%]

BEGIN;

-- XXX Add DDLs here.

COMMIT;

The code before the BEGIN names the template and lists dependencies, which is reasonably useful, so we’ll leave it as-is. We’ll focus on replacing that comment, -- XXX Add DDLs here., with the template for a CREATE TABLE statement. Start simple: just use the change name for the table name. In ~/.sqitch/templates/deploy/createtable.tmpl, replace the comment with these lines:

CREATE TABLE [% change %] (
    -- Add columns here.
);

In the revert template, ~/.sqitch/templates/deploy/createtable.tmpl, replace the comment with a DROP TABLE statement:

DROP TABLE [% change %];

And finally, in the verify template, ~/.sqitch/templates/verify/createtable.tmpl, replace the comment with a simple SELECT statement, which is just enough to verify the creation of a table:

SELECT * FROM [% change %];

Great, we’ve created a set of simple customized templates for adding a CREATE TABLE change to a Sqitch project. To use them, just pass the --template option to sqitch add, like so:

> sqitch add widgets --template createtable -n 'Add widgets table.'
Created deploy/widgets.sql
Created revert/widgets.sql
Created verify/widgets.sql
Added "widgets" to sqitch.plan

Now have a look at deploy/widgets.sql:

-- Deploy widgets

BEGIN;

CREATE TABLE widgets (
    -- Add columns here.
);

COMMIT;

Cool! The revert template should also have done its job. Here’s revert/widgets.sql:

-- Revert widgets

BEGIN;

DROP TABLE widgets;

COMMIT;

And the verify script, verify/widgets.sql:

-- Verify widgets

BEGIN;

SELECT * FROM widgets;

ROLLBACK;

Custom Table Name

What if you want to name the change one thing and the table it creates something else? What if you want to schema-qualify the table? Easy! Sqitch’s dead simple default templating language, Template::Tiny, features if statements. Try using them with custom variables for the schema and table names:

SET search_path TO [% IF schema ][% schema %],[% END %]public;

CREATE TABLE [% IF table %][% table %][% ELSE %][% change %][% END %] (
    -- Add columns here.
);

If the schema variable is set, the search_path, which determines where objects will go, gets set to $schema,public. If schema is not set, the path is simply public, which is the default schema in Postgres.

We take a similar tack with the CREATE TABLE statement: If the table variable is set, it’s used as the name of the table. Otherwise, we use the change name, as before.

The revert script needs the same treatment:

SET search_path TO [% IF schema ][% schema %],[% END %]public;
DROP TABLE [% IF table %][% table %][% ELSE %][% change %][% END %];

As does the verify script:

SET search_path TO [% IF schema ][% schema %],[% END %]public;
SELECT * FROM [% IF table %][% table %][% ELSE %][% change %][% END %];

Take it for a spin:

> sqitch add corp_widgets --template createtable \
  --set schema=corp --set table=widgets \
  -n 'Add corp.widgets table.'
Created deploy/corp_widgets.sql
Created revert/corp_widgets.sql
Created verify/corp_widgets.sql
Added "corp_widgets" to sqitch.plan

The resulting deploy script will create corp.widgets:

-- Deploy corp_widgets

BEGIN;

SET search_path TO corp,public;

CREATE TABLE widgets (
    -- Add columns here.
);

COMMIT;

Cool, right? The revert and verify scripts of course yield similar results. Omitting the --set option, the template falls back on the change name:

-- Deploy widgets

BEGIN;

SET search_path TO public;

CREATE TABLE widgets (
    -- Add columns here.
);

COMMIT;

Add Columns

Template variables may contain array values. The default templates takes advantage of this feature to list dependencies in SQL comments. It works great for custom variables, too. For the purposes of our CREATE TABLE template, let’s add columns. Replace the -- Add columns here comment in the deploy simple with these three lines:

[% FOREACH col IN column -%]
    [% col %] TEXT NOT NULL,
[% END -%]

We can similarly improve the verify script: change its SELECT statement to:

SELECT [% FOREACH col IN column %][% col %], [% END %]
  FROM [% IF table %][% table %][% ELSE %][% change %][% END %];

Just pass multiple --set (or -s) options to sqitch add to add as many columns as you like:

> sqitch add corp_widgets --template createtable \
  -s schema=corp -s table=widgets \
  -s column=id -s column=name -s column=quantity \
  -n 'Add corp.widgets table.'

Behold the resulting deploy script!

-- Deploy corp_widgets

BEGIN;

SET search_path TO corp,public;

CREATE TABLE widgets (
    id TEXT NOT NULL,
    name TEXT NOT NULL,
    quantity TEXT NOT NULL,
);

COMMIT;

You still have to edit the resulting file, of course. Maybe NULLs should be allowed in the name column. And I suspect that quantity ought be an integer. There’s that pesky trailing comma to remove, too. The verify script suffers the same deficiency:

-- Verify corp_widgets

BEGIN;

SET search_path TO corp,public;
SELECT id, name, quantity, 
  FROM widgets;

ROLLBACK;

Still, these templates remove much of the grudge work of adding CREATE TABLE changes, giving you the scaffolding on which to build the objects you need.

Upgraded Templates

We call Sqitch’s templating language “default” because it can be replaced with a more capable one. Simply install Template Toolkit to transparently upgrade your Sqitch templates. Template Toolkit’s comprehensive feature set covers just about any functionality you could want out of a templating system. It’s big and complex, but relatively straight-forward to install: just run cpan Template, cpanm Template, yum install perl-Template-Toolkit, or the like and you’ll be in business.

We can resolve the trailing comma issue thanks to Template Toolkit’s loop variable, which is implicitly created in the FOREACH loop. Simply replace the comma in the template with the expression [% loop.last ? '' : ',' %]:

[% FOREACH col IN column -%]
    [% col %] TEXT NOT NULL[% loop.last ? '' : ',' %]
[% END -%]

Now the comma will be omitted for the last iteration of the loop. The fix for the verify script is even simpler: use join() VMethod instead of a FOREACH loop to emit all the columns in a single expression:

SELECT [% column.join(', ') %]
  FROM [% IF table %][% table %][% ELSE %][% change %][% END %];

Really simplifies things, doesn’t it?

Better still, going back to the deploy template, we can add data types for each column. Try this on for size:

[% FOREACH col IN column -%]
    [% col %] [% type.item( loop.index ) or 'TEXT' %] NOT NULL[% loop.last ? '' : ',' %]
[% END -%]
);

As we iterate over the list of columns, simply pass loop.index to the item() VMethod on the type variable to get the corresponding type. Then specify a type for each column when you create the change:

> sqitch add corp_widgets --template createtable \
  -s schema=corp -s table=widgets \
  -s column=id -s type=SERIAL \
  -s column=name -s type=TEXT \
  -s column=quantity -s type=INTEGER \
  -n 'Add corp.widgets table.'

This yields a much more comprehensive deploy script:

-- Deploy corp_widgets

BEGIN;

SET search_path TO corp,public;

CREATE TABLE widgets (
    id SERIAL NOT NULL,
    name TEXT NOT NULL,
    quantity INTEGER NOT NULL
);

COMMIT;

Go Crazy

The basics for creating task-specific change templates are baked into Sqitch, and a transparent upgrade to advanced templating is a simple install away. I can imagine lots of uses for task-specific changes, including:

  • Adding schemas, users, procedures, and views
  • Modifying tables to add columns, constraints and indexes
  • Inserting or Updating data

Maybe folks will even start sharing templates! You should subscribe to the mail list to find out. See you there?

More about…

Understanding Window Functions

Dimitri Fontaine:

There was SQL before window functions and SQL after window functions: that’s how powerful this tool is. Being that of a deal breaker unfortunately means that it can be quite hard to grasp the feature. This article aims at making it crystal clear so that you can begin using it today and are able to reason about it and recognize cases where you want to be using window functions.

Great intro to a powerful feature.

More about…

Data Deployment with Sqitch

I’ve been thinking about data migrations. I love how well Sqitch works for schema changes, but so far have avoided data changes. Some data ought to be managed by the deployment process, rather than by end-user applications. Lists of countries, for example. Yet none of our Sqitch-managed databases include INSERTs, UPDATEs, or DELETEs in deploy scripts. Why not? Two reasons:

  1. These are mainly Postgres ports of existing Oracle databases. As such, I’ve written independent migration scripts that use oracle_fdw to copy data from Oracle. It made no sense to commit hard-coded changes to the deploy script, even for static data, as it was all to be copied from the old production Oracle databases — often months after I wrote the migrations.

  2. These projects include extensive pgTAP unit tests that expect to run many times against an empty database with no side effects. Having different data in testing than in production increases the likelihood of unforeseen behavioral variations. Better to expect no data in tests, freeing them to focus on units of behavior without regard to preexisting data.

Now that we have multiple Sqitch-deployed databases in production, the time has come to address these issues.

Deploy Hooks for External Sources

I propose to resolve the one-time migration requirement with deploy hooks. The idea is similar to Git hooks: Before or after any sqitch deploy, one or more hook scripts can run. The impetus was to ensure some higher level of consistency after every deploy. For example, a post-deploy hook might grant privileges on all tables in a database. Another might run VACCUM; ANALZYE;.

But we could also use it for one-time data migrations. An option to deploy will disable them, which would be useful for development and test databases. Once the migration has been run in production, we just delete the migration hook scripts from the project. Sqitch won’t record hook executions, so adding or removing them will be no problem.

I like this approach, as Sqitch will automatically run migration scripts, but hooks will not change the interface of Sqitch itself. And it’s more generally useful. Hell, I might want deploy hook script that sends an email notification announcing a deployment (though that might require adding support for non-SQL scripts). There are all kinds of things for which hooks will prove useful.

Changes for Static Data Maintenance

For data that must be tied to the deployment process, there are two complications that have prevented me from simply managing them in normal Sqitch changes:

  1. There might be side-effects to the deployment. For example, a foreign key constraint to the users table, to identify the users who added rows to the database. But in a new database, perhaps there are no users — and Sqitch doesn’t create them, the app does. Chicken, meet egg.

  2. The initial data set might derived from some external source, such as another database. Consequently, none of that data was defined in Sqitch deploy scripts. This situation complicates future updates of the data. We can add data via Sqitch in the future, but then we don’t have the canonical list of all rows that should exist in all such databases.

However, I can think of no alternative that does not over-complicate Sqitch itself. I considered adding another change-related script type, named “update”, to complement the existing deploy, verify, and revert scripts. But oftentimes a data change would not be tied to a schema change, so the corresponding deploy script would be a no-op. Seems silly.

I also considered adding a completely separate command specifically for deploying data changes. Yet these data migrations are exactly like schema changes: Sqitch must execute them in the proper order relative to other changes, record successful or failed deployment, and be able to revert them when required. The only difference is what’s defined in them: data modification rather than definition.

Through several drafts of this post, I have come around to the idea that I should change nothing in Sqitch with regard to data deployments. A better solution than the above, I believe, is organizational.

Data Deploy Best Practice

Let the best practice for data deploys be this: they should be contained in Sqitch changes, but such changes should contain only data modifications. No change script should both define a table and insert its initial rows. Keep the table and its data in separate changes — keep DML separate from DDL.

For our list of countries, we might have a change named “countries”, which creates the countries table, and another, named “country_data”, which inserts the appropriate data into that table. Where necessary and appropriate, these changes may use conditional paths to bring the data up-to-date and in sync across deployments.

Conditions must deal with side-effects, such as foreign key constraints. Where possible, such side effects ought be removed from deployment-managed data. For tracking the user or users who added data to a database, for example, one can use the tools of the source code repository (git log, git blame) to assign blame. Other side-effects may be more necessary, but to the extent possible, deployed data should be independent.

Naturally, unit tests must expect static data to be present, and be updated when that data changes. We are, after all, talking about infrequently-updated data. Frequently-updated data should have separate interfaces provided by applications to change the data. Otherwise, how static is it, really?

More about…

TPF To Revamp Grants

Alberto Simões:

Nevertheless, this lack of “lower than $3000” grant proposals, and the fact that lot of people have been discussing (and complaining) about this value being too low, the Grants Committee is starting a discussion on rewriting and reorganizing the way it works. Namely, in my personal blog I opened a discussion about the Grants Committee some time ago, and had plenty of feedback, that will be helpful for our internal discussion.

This is great news. I would love to see more and more ambitious grant proposals, as well as awards people an subsist on. I look forward to seeing the new rules.

More about…

Mopping the Moose

Stevan Little:

I spent much of last week on vacation with the family so very little actual coding got done on the p5-mop, but instead I did a lot of thinking. My next major goal for the p5-mop is to port a module written in Moose, in particular, one that uses many different Moose features. The module I have chosen to port is Bread::Board and I chose it for two reasons; first, it was the first real module that I wrote using Moose and second, it makes heavy use of a lot of Moose’s features.

I’m so happy to see Stevan making progress on the Perl 5 MOP again.

More about…

A Perl Blog

I have been unsatisfied with Just a Theory for some time. I started that blog in 2004 more or less for fun, thinking it would be my permanent home on the internet. And it has been. But the design, while okay in 2004, is just awful by today’s standards. A redesign is something I have planned to do for quite some time.

I had also been thinking about my audience. Or rather, audiences. I’ve blogged about many things, but while a few dear family members might want to read everything I ever post, most folks, I think, are interested in only a subset of topics. Readers of Just a Theory came for posts about Perl, or PostgreSQL, or culture, travel, or politics. But few came for all those topics, in my estimation.

More recently, a whole bunch of top-level domains have opened up, often with the opportunity for anyone to register them. I was lucky enough to snag theory.pm and theory.pl, thinking that perhaps I would create a site just for blogging about Perl. I also nabbed theory.so, which I might dedicate to database-related blogging, and theory.me, which would be my personal blog (travel, photography, cultural essays, etc.).

And then there is Octopress. A blogging engine for hackers. Perfect for me. Hard to imagine something more appropriate (unless it was written in Perl). It seemed like a good opportunity to partition my online blogging.

So here we are with my first partition. theory.pm is a Perl blog. Seemed like the perfect name. I fiddled with it off and on for a few months, often following Matt Gemmell’s Advice, and I’m really happy with it. The open-source fonts Source Sans Pro and Source Code Pro, from Adobe, look great. The source code examples are beautifully marked up and displayed using the Solarized color scheme (though presentation varies in feed readers). Better still, it’s equally attractive and readable on computers, tablets and phones, thanks to the foundation laid by Aron Cedercrantz’s BlogTheme.

I expect to fork this code to create a database blog soon, and then perhaps put together a personal blog. Maybe the personal blog will provide link posts for posts on the other sites, so that if anyone really wants to read everything, they can. I haven’t decided yet.

In the meantime, now that I have a dedicated Perl blog, I guess I’ll have to start writing more Perl-related stuff. I’m starting with some posts about the state of exception handling in Perl 5, the first of which is already up. Stay tuned for more.

Update: 2018-05-23: I decided that a separate Perl blog wasn’t a great idea after all, and relaunched Just a Theory.

Trying Times

Exception handling is a bit of a pain in Perl. Traditionally, we use eval {}:

eval {
    foo();
}
if (my $err = $@) {
    # Inspect $err…
}

The use of the if block is a bit unfortunate; worse is the use of the global $@ variable, which has inflicted unwarranted pain on developers over the years.1 Many Perl hackers put Try::Tiny to work to circumvent these shortcomings:

try {
    foo();
} catch {
    # Inspect $_…
};

Alas, Try::Tiny introduces its own idiosyncrasies, particularly its use of subroutine references rather than blocks. While a necessity of a pure-Perl implementation, it prevents returning from the calling context. One must work around this deficiency by checking return values:

my $rv = try {
   f();
} catch {
   # …
};

if (!$rv) {
   return;
}

I can’t tell you how often this quirk burns me.

Sadly, there is a deeper problem then syntax: Just what, exactly, is an exception? How does one determine the exceptional condition, and what can be done about it? It might be a string. The string might be localized. It might be an Exception::Class object, or a Throwable object, or a simple array reference. Or any other value a Perl scalar can hold. This lack of specificity requires careful handling of exceptions:

if (my $err = $@) {
    if (ref $err) {
        if (eval { $err->isa('Exception::Class') }) {
            if ( $err->isa('SomeException') ) {
                # …
            } elsif ( $err->isa('SomeException') ) {
                # …
            } else {
                # …
            }
        } elsif (eval { $err->DOES('Throwable') }) {
            # …
        } elsif ( ref $err eq 'ARRAY') {
            # …
        }
    } else {
        if ( $err =~ /DBI/ ) {
            # …
        } elsif ( $err =~ /cannot open '([^']+)'/ ) {
            # …
        }
    }
}

Not every exception handler requires so many conditions, but I have certainly exercised all these approaches. Usually my exception handlers accrete condition as users report new, unexpected errors.

That’s not all. My code frequently requires parsing information out of a string error. Here’s an example from PGXN::Manager:

try {
    $self->distmeta(decode_json scalar $member->contents );
} catch {
    my $f = quotemeta __FILE__;
    (my $err = $_) =~ s/\s+at\s+$f.+//ms;
    $self->error([
        'Cannot parse JSON from “[_1]”: [_2]',
        $member->fileName,
        $err
    ]);
    return;
} or return;

return $self;

When JSON throws an exception on invalid JSON, the code must catch that exception to show the user. The user cares not at all what file threw the exception, nor the line number. The code must strip that stuff out before passing the original message off to a localizing error method.

Gross.

It’s time to end this. A forthcoming post will propose a plan for adding proper exception handling to the core Perl language, including exception objects and an official try/catch syntax.


  1. In fairness much of the $@ pain has been addressed in Perl 5.14.
More about…

Requiring Reworked Sqitch Changes

I recently discovered a mildly annoying bug in Sqitch, the Git-inspired database schema change management app I’ve been working on for the past year. One of its key features is the ability to “rework” changes. For example, if you have a change that defines a function change_password(), and discover sometime after release that it has a bug (maybe the hashing algorithm is too weak), you can “rework” it – essentially modify it in place – and save some headaches. Check out the “In Place Changes” section of the (PostgreSQL, SQLite, Oracle, or MySQL (coming soon) tutorials for detailed examples of how it works.

The bug was about what happens when one adds a new change that depends on a reworked change, but just specifies it by name, such as change_password:

sqitch add meow --requires change_password

This added the change fine, but at deploy time, Sqitch complained that there were multiple instances of a change in the database. Of course, that’s true, because change_password will have been deployed twice: once for the original version, and the second time for the reworked version. This was inconsistent with how it looked up changes in the plan, where it would just return the first instance of a change in the plan. So I changed it so that dependency lookups in the database also return the first instance of the change. I believe this makes sense, because if you require change_password, without specifying which instance you want, you probably want any instance, starting with the earliest.

But what if you actually need to require a specific instance of a reworked change? Let’s say your plan looks like this:

users
widgets
change_pass
sleep
@v1.0

work_stuff
change_pass [change_pass@v1.0]

The third change is change_pass, and it has been reworked in the sixth change (requiring the previous version, as of the @v1.0 tag). If you want to require any instance of change_pass, you specify it as in the previous example. But what if there were changes in the reworked version that you require? You might try to require it as-of the symbolic tag @HEAD:

sqitch add meow --requires change_password@HEAD

This means, “Require the last instance of change_password in the plan.” And that would work…until you reworked it again, then it would be updated to point at the newer instance. Sqitch will choke on that, because you can’t require changes that appear later in the plan.

So what we have to do instead is add a new tag after the second instance of change_pass:

sqitch tag rehash

Now the plan will look like this:

users
widgets
change_pass
sleep
@v1.0

work_stuff
change_pass [change_pass@v1.0]
@rehash

Now we can identify exactly the instance we need by specifying that tag:

sqitch add meow --requires change_password@rehash

Meaning “The instance of change_password as of @rehash.” If what you really needed was the first version, you can specify the tag that follows it:

sqitch add meow --requires change_password@v1.0

Which, since it is the first instance is the same as specifying no tag at all. But if there were, say, four instances of change_pass, you can see how it might be important to use tags to specify specific instances for dependencies.

For what it’s worth, this is how to get around the original bug referenced above: just specify which instance of the change to require by using a tag that follows that instance, and the error should go away.

Looking for the comments? Try the old layout.

More about…

Notes on Upcoming Sqitch Improvements

I was traveling last week, and knowing I would be offline a fair bit, not to mention seriously jet-lagged, I put my hacking efforts into getting MySQL support into Sqitch. I merged it in yesterday; check out the tutorial if you’re interested in it. I expect to release v0.980 with the MySQL support in a couple of weeks; testing and feedback would most appreciated.

There is a caveat, though: it requires MySQL v5.6.4. So if you’re stuck with an older MySQL, it won’t work. There are two reasons to require v5.6.4:

  • The microsecond precision support in DATETIME values, added in v5.6.4. This makes it much easier to keep things in the proper order (deployments usually take less than a second).
  • The SIGNAL functionality, introduced in v5.5. This allows the schema to mock a check constraint in the Sqitch database, as well as make it much easier to write verify tests (as described in the tutorial and figured out on StackOverflow).

But if you can afford to take advantage of a relatively modern MySQL, give it a shot!

The next release also makes a backwards-incompatible change to the SQLite engine: the default Sqitch database is no longer $db_dir/$db_name-sqitch.$suffix, but $db_dir/sqitch.$suffix. In other words, if you were deploying to a db named /var/db/myapp.db, Sqitch previously kept its metadata in /var/db/myapp-sqitch.db, but now will keep it in /var/db/sqitch.db. This is to make it more like the other engines (MySQL defaults to a database named “sqitch”, and Postgres and Oracle default to a schema named “sqitch”).

It’s also useful if you use the SQLite ATTACHDATABASE command to manage multiple database files in a single project. In that case, you will want to use the same metadata file for all the databases. Keep them all in the same directory with the same suffix and you get just that with the default sqitch database.

If you’d like it to have a different name, use sqitch config core.sqlite.sqitch_db $name to configure it. This will be useful if you don’t want to use the same Sqitch database to manage multiple databases, or if you do, but they live in different directories.

I haven’t released this change yet, and I am not a big-time SQLite user. So if this makes no sense, please comment on this issue. It’ll be a couple of weeks before I release v0.980, so there is time to reverse if if there’s consensus that it’s a bad idea.

But given another idea I’ve had, I suspect it will be okay. The idea is to expand on the concept of a Sqitch “target” by giving it its own command and configuration settings. Basically, it would be sort of like Git remotes: use URIs to specify database connection and parameter info (such as the sqitch database name for SQLite). These can be passed to database-touching commands, such as deploy, revert, log, and the like. They can also be given names and stored in the configuration file. The upshot is that it would enable invocations such as

sqitch deploy production
sqitch log qa
sqitch status pg://localhost/flipr?sqitch_schema=meta

See the GitHub issue for a fuller description of this feature. I’m certain that this would be useful at work, as we have a limited number of databases that we deploy each Sqitch project to, and it’s more of a PITA for my co-workers to remember to use different values for the --db-host, --db-user, --db-name and friends options. The project itself would just store the named list of relevant deployment targets.

And it alleviates the issue of specifying a different Sqitch database on SQLite or MySQL, as one can just create a named target that specifies it in the URI.

Not sure when I will get to this feature, though. I think it would be great to have, and maybe iovation would want me to spend some time on it in the next couple of months. But it might also be a great place for someone else to get started adding functionality to Sqitch.

Oh, and before I forget: it looks like Sqitch might soon get CUBRID support, too, thanks to Ștefan Suciu. Stay tuned!

Looking for the comments? Try the old layout.