Just a Theory

Trans rights are human rights

Posts about SQL

JSON Path Operator Confusion

The CipherDoc service offers a robust secondary key lookup API and search interface powered by JSON/SQL Path queries run against a GIN-indexed JSONB column. SQL/JSON Path, introduced in SQL:2016 and added to Postgres in version 12 in 2019, nicely enables an end-to-end JSON workflow and entity lifecycle. It’s a powerful enabler and fundamental technology underpinning CipherDoc. I’m so happy to have found it.

Confusion

However, the distinction between the SQL/JSON Path operators @@ and @? confused me. Even as I found that the @? operator worked for my needs and @@ did not, I tucked the problem into my mental backlog for later study.

The question arose again on a recent work project, and I can take a hint. It’s time to figure this thing out. Let’s see where it goes.

The docs say:

jsonb @? jsonpath → boolean
Does JSON path return any item for the specified JSON value?

'{"a":[1,2,3,4,5]}'::jsonb @? '$.a[*] ? (@ > 2)' → t


jsonb @@ jsonpath → boolean
Returns the result of a JSON path predicate check for the specified JSON value. Only the first item of the result is taken into account. If the result is not Boolean, then NULL is returned.

'{"a":[1,2,3,4,5]}'::jsonb @@ '$.a[*] > 2' → t

These read quite similarly to me: Both return true if the path query returns an item. So what’s the difference? When should I use @@ and when @?? I went so far as to ask Stack Overflow about it. The one answer directed my attention back to the jsonb_path_query() function, which returns the results from a path query.

So let’s explore how various SQL/JSON Path queries work, what values various expressions return.

Queries

The docs for jsonb_path_query say:1

jsonb_path_query ( target jsonb, path jsonpath [, vars jsonb [, silent boolean ]] ) → setof jsonb
Returns all JSON items returned by the JSON path for the specified JSON value. If the vars argument is specified, it must be a JSON object, and its fields provide named values to be substituted into the jsonpath expression. If the silent argument is specified and is true, the function suppresses the same errors as the @? and @@ operators do.
select * from jsonb_path_query(
    '{"a":[1,2,3,4,5]}',
    '$.a[*] ? (@ >= $min && @ <= $max)',
    '{"min":2, "max":4}'
) 
 jsonb_path_query
------------------
 2
 3
 4

The first thing to note is that a SQL/JSON Path query may return more than one value. This feature matters for the @@ and @? operators, which return a single boolean value based on the values returned by a path query. And path queries can return a huge variety of values. Let’s explore some examples, derived from the sample JSON value and path query from the docs.2

select jsonb_path_query('{"a":[1,2,3,4,5]}', '$ ?(@.a[*] > 2)');
    jsonb_path_query    
------------------------
 {"a": [1, 2, 3, 4, 5]}
(1 row)

This query returns the entire JSON value, because that’s what $ selects at the start of the path expression. The ?() filter returns true because its predicate expression finds at least one value in the $.a array greater than 2. Here’s what happens when the filter returns false:

select jsonb_path_query('{"a":[1,2,3,4,5]}', '$ ?(@.a[*] > 5)');
 jsonb_path_query 
------------------
(0 rows)

None of the values in the $.a array are greater than five, so the query returns no value.

To select just the array, append it to the path expression after the ?() filter:

select jsonb_path_query('{"a":[1,2,3,4,5]}', '$ ?(@.a[*] > 2).a');
 jsonb_path_query 
------------------
 [1, 2, 3, 4, 5]
(1 row)

Path Modes

One might think you could select $.a at the start of the path query to get the full array if the filter returns true, but look what happens:

select jsonb_path_query('{"a":[1,2,3,4,5]}', '$.a ?(@[*] > 2)');
 jsonb_path_query 
------------------
 3
 4
 5
(3 rows)

That’s not the array, but the individual array values that each match the predicate. Turns out this is a quirk of the Postgres implementation of path modes. From what I can glean, the SQL:2016 standard dictates something like these SQL Server descriptions:

  • In lax mode, the function returns empty values if the path expression contains an error. For example, if you request the value $.name, and the JSON text doesn’t contain a name key, the function returns null, but does not raise an error.
  • In strict mode, the function raises an error if the path expression contains an error.

But the Postgres lax mode does more than suppress errors. From the docs (emphasis added):

The lax mode facilitates matching of a JSON document structure and path expression if the JSON data does not conform to the expected schema. If an operand does not match the requirements of a particular operation, it can be automatically wrapped as an SQL/JSON array or unwrapped by converting its elements into an SQL/JSON sequence before performing this operation. Besides, comparison operators automatically unwrap their operands in the lax mode, so you can compare SQL/JSON arrays out-of-the-box.

There are a few more details, but this is the crux of it: In lax mode, which is the default, Postgres always unwraps an array. Hence the unexpected list of results.3 This could be particularly confusing when querying multiple rows:

select jsonb_path_query(v, '$.a ?(@[*] > 2)')
        from (values ('{"a":[1,2,3,4,5]}'::jsonb), ('{"a":[3,5,8]}')) x(v);
 jsonb_path_query 
------------------
 3
 4
 5
 3
 5
 8
(6 rows)

Switching to strict mode by preprending strict to the JSON Path query restores the expected behavior:

select jsonb_path_query(v, 'strict $.a ?(@[*] > 2)')
        from (values ('{"a":[1,2,3,4,5]}'::jsonb), ('{"a":[3,5,8]}')) x(v);
 jsonb_path_query 
------------------
 [1, 2, 3, 4, 5]
 [3, 5, 8]
(2 rows)

Important gotcha to watch for, and a good reason to test path queries thoroughly to ensure you get the results you expect. Lax mode nicely prevents errors when a query references a path that doesn’t exist, as this simple example demonstrates:

select jsonb_path_query('{"a":[1,2,3,4,5]}', 'strict $.b');
ERROR:  JSON object does not contain key "b"

select jsonb_path_query('{"a":[1,2,3,4,5]}', 'lax $.b');
 jsonb_path_query 
------------------
(0 rows)

In general, I suggest always using strict mode when executing queries. Better still, perhaps always prefer strict mode with our friends the @@ and @? operators, which suppress some errors even in strict mode:

The jsonpath operators @? and @@ suppress the following errors: missing object field or array element, unexpected JSON item type, datetime and numeric errors. The jsonpath-related functions described below can also be told to suppress these types of errors. This behavior might be helpful when searching JSON document collections of varying structure.

Have a look:

select '{"a":[1,2,3,4,5]}' @? 'strict $.a';
 ?column? 
----------
 t
(1 row)

select '{"a":[1,2,3,4,5]}' @? 'strict $.b';
 ?column? 
----------
 <null>
(1 row)

No error for the unknown JSON key b in that second query! As for the error suppression in the jsonpath-related functions, that’s what the silent argument does. Compare:

select jsonb_path_query('{"a":[1,2,3,4,5]}', 'strict $.b');
ERROR:  JSON object does not contain key "b"

select jsonb_path_query('{"a":[1,2,3,4,5]}', 'strict $.b', '{}', true);
 jsonb_path_query 
------------------
(0 rows)

Boolean Predicates

The Postgres SQL/JSON Path Language docs briefly mention a pretty significant deviation from the SQL standard:

A path expression can be a Boolean predicate, although the SQL/JSON standard allows predicates only in filters. This is necessary for implementation of the @@ operator. For example, the following jsonpath expression is valid in PostgreSQL:

$.track.segments[*].HR < 70

This pithy statement has pretty significant implications for the return value of a path query. The SQL standard allows predicate expressions, which are akin to an SQL WHERE expression, only in ?() filters, as seen previously:

select jsonb_path_query('{"a":[1,2,3,4,5]}', '$ ?(@.a[*] > 2)');
    jsonb_path_query    
------------------------
 {"a": [1, 2, 3, 4, 5]}
(1 row)

This can be read as “return the path $ if @.a[*] > 2 is true. But have a look at a predicate-only path query:

select jsonb_path_query('{"a":[1,2,3,4,5]}', '$.a[*] > 2');
 jsonb_path_query 
------------------
 true
(1 row)

This path query can be read as “Return the result of the predicate $.a[*] > 2, which in this case is true. This is quite the divergence from the standard, which returns contents from the JSON queried, while a predicate query returns the result of the predicate expression itself. It’s almost like they’re two different things!

Don’t confuse the predicate path query return value with selecting a boolean value from the JSON. Consider this example:

select jsonb_path_query('{"a":[true,false]}', '$.a ?(@[*] == true)');
 jsonb_path_query 
------------------
 true
(1 row)

Looks the same as the predicate-only query, right? But it’s not, as shown by adding another true value to the $.a array:

select jsonb_path_query('{"a":[true,false,true]}', '$.a ?(@[*] == true)');
 jsonb_path_query 
------------------
 true
 true
(2 rows)

This path query returns the trues it finds in the $.a array. The fact that it returns values from the JSON rather than the filter predicate becomes more apparent in strict mode, which returns all of $a if one or more elements of the array has the value true:

select jsonb_path_query('{"a":[true,false,true]}', 'strict $.a ?(@[*] == true)');
  jsonb_path_query   
---------------------
 [true, false, true]
(1 row)

This brief aside, and its mention of the @@ operator, turns out to be key to understanding the difference between @? and @@. Because it’s not just that this feature is “necessary for implementation of the @@ operator”. No, I would argue that it’s the only kind of expression usable with the @@ operator

Match vs. Exists

Let’s get back to the @@ operator. We can use a boolean predicate JSON Path like so:

select '{"a":[1,2,3,4,5]}'::jsonb @@ '$.a[*] > 2';
 ?column? 
----------
 t
(1 row)

It returns true because the predicate JSON path query $.a[*] > 2 returns true. And when it returns false?

select '{"a":[1,2,3,4,5]}'::jsonb @@ '$.a[*] > 6';
 ?column? 
----------
 f
(1 row)

So far so good. What happens when we try to use a filter expression that returns a true value selected from the JSONB?

select '{"a":[true,false]}'::jsonb @@ '$.a ?(@[*] == true)';
 ?column? 
----------
 t
(1 row)

Looks right, doesn’t it? But recall that this query returns all of the true values from $.@, but @@ wants only a single boolean. What happens when we add another?

select '{"a":[true,false,true]}'::jsonb @@ 'strict $.a ?(@[*] == true)';
 ?column? 
----------
 <null>
(1 row)

Now it returns NULL, even though it’s clearly true that @[*] == true matches. This is because it returns all of the values it matches, as jsonb_path_query() demonstrates:

select jsonb_path_query('{"a":[true,false,true]}'::jsonb, '$.a ?(@[*] == true)');
 jsonb_path_query 
------------------
 true
 true
(2 rows)

This clearly violates the @@ documentation claim that “Only the first item of the result is taken into account”. If that were true, it would see the first value is true and return true. But it doesn’t. Turns out, the corresponding jsonb_path_match() function shows why:

select jsonb_path_match('{"a":[true,false,true]}'::jsonb, '$.a ?(@[*] == true)');
ERROR:  single boolean result is expected

Conclusion: The documentation is inaccurate. Only a single boolean is expected by @@. Anything else is an error.

Futhermore, it’s dangerous, at best, to use an SQL standard JSON Path expression with @@. If you need to use it with a filter expression, you can turn it into a boolean predicate by wrapping it in exists():

select jsonb_path_match('{"a":[true,false,true]}'::jsonb, 'exists($.a ?(@[*] == true))');
 jsonb_path_match 
------------------
 t
(1 row)

But there’s no reason to do so, because that’s effectively what the @? operator (and the corresponding, cleverly-named jsonb_path_exists() function does): it returns true if the SQL standard JSON Path expression contains any results:

select '{"a":[true,false,true]}'::jsonb @? '$.a ?(@[*] == true)';
 ?column? 
----------
 t
(1 row)

Here’s the key thing about @?: you don’t want to use a boolean predicate path query with it, either. Consider this predicate-only query:

select jsonb_path_query('{"a":[1,2,3,4,5]}'::jsonb, '$.a[*] > 6');
 jsonb_path_query 
------------------
 false
(1 row)

But see what happens when we use it with @?:

select '{"a":[1,2,3,4,5]}'::jsonb @? '$.a[*] > 6';
 ?column? 
----------
 t
(1 row)

It returns true even though the query itself returns false! Why? Because false is a value that exists and is returned by the query. Even a query that returns null is considered to exist, as it will when a strict query encounters an error:

select jsonb_path_query('{"a":[1,2,3,4,5]}'::jsonb, 'strict $[*] > 6');
 jsonb_path_query 
------------------
 null
(1 row)

select '{"a":[1,2,3,4,5]}'::jsonb @? 'strict $[*] > 6';
 ?column? 
----------
 t
(1 row)

The key thing to know about the @? operator is that it returns true if anything is returned by the path query, and returns false only if nothing is selected at all.

The Difference

In summary, the difference between the @? and @@ JSONB operators is this:

  • @? (and jsonb_path_exists()) returns true if the path query returns any values — even false or null — and false if it returns no values. This operator should be used only with SQL-standard JSON path queries that select data from the JSONB. Do not use predicate-only JSON path expressions with @?.
  • @@ (and jsonb_path_match()) returns true if the path query returns the single boolean value true and false otherwise. This operator should be used only with Postgres-specific boolean predicate JSON path queries, that return data from the predicate expression. Do not use SQL-standard JSON path expressions with @@.

This difference of course assumes awareness of this distinction between predicate path queries and SQL standard path queries. To that end, I submitted a patch that expounds the difference between these types of JSON Path queries, and plan to submit another linking these differences in the docs for @@ and @?.

Oh, and probably another to explain the difference in return values between strict and lax queries due to array unwrapping.

Thanks

Many thanks to Erik Wienhold for patiently answering my pgsql-hackers questions and linking me to a detailed pgsql-general thread in which the oddities of @@ were previously discussed in detail.


  1. Well almost. The docs for jsonb_path_query actually say, about the last two arguments, “The optional vars and silent arguments act the same as for jsonb_path_exists.” I replaced that sentence with the relevant sentences from the jsonb_path_exists docs, about which more later. ↩︎

  2. Though omitting the vars argument, as variable interpolation just gets in the way of understanding basic query result behavior. ↩︎

  3. In fairness, the Oracle docs also discuss “implicit array wrapping and unwrapping”, but I don’t have a recent Oracle server to experiment with at the moment. ↩︎

Multirow Database Updates

William Blunn:

So, given a list of updates to apply we could effect them using the following steps:

  1. Use CREATE TEMPORARY TABLE to create a temporary table to hold the updates
  2. Use INSERT to populate the temporary table with the updates
  3. Use UPDATE … FROM to update the target table using updates in the temporary table
  4. Use DROP TABLE to drop the temporary table

So in the example above we can reduce five statements to four. This isn’t a significant improvement in this case. But now the number of statements is no longer directly dependent on the number of rows requiring updates.

Even if we wanted to update a thousand rows with different values, we could still do it with four statements.

Or you could just use one statement. Here’s how to do it with a CTE on PostgreSQL 9.2 and higher:

WITH up(name, salary) AS ( VALUES
     ('Jane',  1200),
     ('Frank', 1100),
     ('Susan', 1175),
     ('John',  1150)
)
UPDATE staff
   SET salary = up.salary
  FROM up
 WHERE staff.name = up.name;

Still on PostgreSQL 9.1 or lower? Use a subselect in the FROM clause instead:

UPDATE staff
   SET salary = up.salary
   FROM (VALUES
       ('Jane',  1200),
       ('Frank', 1100),
       ('Susan', 1175),
       ('John',  1150)
   ) AS up(name, salary)
 WHERE staff.name = up.name;

Stuck with MySQL or Oracle? Use a UNION query in a second table:

UPDATE staff, (
         SELECT 'Jane' AS name, 1200 AS salary
   UNION SELECT 'Frank',        1100
   UNION SELECT 'Susan',        1175
   UNION SELECT 'John',         1150
) AS up
   SET staff.salary = up.salary
 WHERE staff.name = up.name;

Using SQLite? Might make sense to use a temporary table for thousands or millions of rows. But for just a few, use a CASE expression:

UPDATE staff
   SET salary = CASE name
       WHEN 'Jane'  THEN 1200
       WHEN 'Frank' THEN 1100
       WHEN 'Susan' THEN 1175
       WHEN 'John'  THEN 1150
   END
 WHERE name in ('Jane', 'Frank', 'Susan', 'John');

If you need to support multiple database architectures, sure, use something like DBIx::MultiRow to encapsulate things. But if, like most of us, you’re on one database for an app, I can’t recommend stongly enough how well it pays to get to know your database well.

Understanding Window Functions

Dimitri Fontaine:

There was SQL before window functions and SQL after window functions: that’s how powerful this tool is. Being that of a deal breaker unfortunately means that it can be quite hard to grasp the feature. This article aims at making it crystal clear so that you can begin using it today and are able to reason about it and recognize cases where you want to be using window functions.

Great intro to a powerful feature.

Sqitch on Oracle

I found myself with a little unexpected time at work recently, and since we use Oracle (for a few more months), I decided to port Sqitch. Last night, I released v0.970 with full support for Oracle. I did the development against an 11.2 VirtualBox VM, though I think it should work on 10g, as well.

Sqitch is available from the usual locations. For Oracle support, you’ll need the Instant Client, including SQL*Plus. Make sure you have $ORACLE_HOM set and you’ll be ready to install. Via CPAN, it’s

cpan install App::Sqitch DBD::Oracle

Via Homebrew:

brew tap theory/sqitch
brew install sqitch-oracle

Via ActiveState PPM, install ActivePerl, then run:

ppm install App-Sqitch DBD-Oracle
PGCon 2013

There are a few other minor tweaks and fixed in this release; check the release notes for details.

Want more? I will be giving a half-day tutorial, entitled “Agile Database Development,” on database development with Git, Sqitch, and pgTAP at on May 22 PGCon 2013 in Ottawa, Ontario. Come on up!

Looking for the comments? Try the old layout.

Sqitch Homebrew Tap

If Sqitch is to succeed, it needs to get into the hands of as many people as possible. That means making it easy to install for people who are not Perl hackers and don’t want to deal with CPAN. The Sqitch Homebrew Tap is my first public stab at that. It provides a series of “Formulas” for Homebrew users to easily download, build, and install Sqitch and all of its dependencies.

If you are one of these lucky people, here’s how to configure the Sqitch tap:

brew tap theory/sqitch

Now you can install the core Sqitch application:

brew install sqitch

That’s it. Make sure it works:

> sqitch --version
sqitch (App::Sqitch) 0.953

It won’t do you much good without support for your database, though. Currently, there is a build for PostgreSQL. Note that this requires the Homebrew core PostgreSQL server:

brew install sqitch_pg

Sqitch hasn’t been ported to other database engines yet, but once it is, expect other formulas to follow. But if you use PostgreSQL (or just want to experiment with it), you’re ready to rock! I suggest following along the tutorial, downloading, or taking in the latest iteration of the introductory presentation (video of an older version on Vimeo).

My thanks to IRC user “mistym” for the help and suggestions in getting this going. My Ruby is pretty much rusted through, soI could not have done it without the incredibly responsive help!

Looking for the comments? Try the old layout.

Sqitch: Trust, But Verify

New today: Sqitch v0.950. There are a few bug fixes, but the most interesting new feature in this release is the verify command, as well as the complementary --verify option to the deploy command. The add command has created test scripts since the beginning; they were renamed verify in v0.940. In v0.950 these scripts are actually made useful.

The idea is simply to test that a deploy script did what it was supposed to do. Such a test should make no assumptions about data or state other than that affected by the deploy script, so that it can be run against a production database without doing any damage. If it finds that the deploy script failed, it should die.

This is easier than you might at first think. Got a Sqitch change that creates a table with two columns? Just SELECT from it:

SELECT user_id, name
  FROM user
 WHERE FALSE;

If the table does not exist, the query will die. Got a change that creates a function? Make sure it was created by checking a privilege:

SELECT has_function_privilege('insert_user(text, text)', 'execute');

PostgreSQL will throw an error if the function does not exist. Not running PostgreSQL? Well, you’re probably not using Sqitch yet, but if you were, you might force an error by dividing by zero. Here’s an example verifying that a schema exists:

SELECT 1/COUNT(*)
  FROM information_schema.schemata
 WHERE schema_name = 'myapp';

At this point, Sqitch doesn’t care at all what you put into your verify scripts. You just need to make sure that they indicate failure by throwing an error when passed to the database command-line client.

The best time to run a change verify script is right after deploying the change. The --verify option to the deploy command does just that. If a verify script fails, the deploy is considered to have failed. Here’s what failure looks like:

> sqitch deploy
Deploying changes to flipr_test
  + appschema ................. ok
  + users ..................... ok
  + insert_user ............... ok
  + change_pass @v1.0.0-dev1 .. ok
  + lists ..................... psql:verify/lists.sql:7: ERROR:  column "timestamp" does not exist
LINE 1: SELECT nickname, name, description, timestamp
                                            ^
Verify script "verify/lists.sql" failed.
not ok
Reverting all changes
  - change_pass @v1.0.0-dev1 .. ok
  - insert_user ............... ok
  - users ..................... ok
  - appschema ................. ok
Deploy failed

Good, right? In addition, you can always verify the state of a database using the verify command. It runs the verify scripts for all deployed changes. It also ensures that all the deployed changes were deployed in the same order as they’re listed in the plan, and that no changes are missing. The output is similar to that for deploy:

> sqitch verify
Verifying flipr_test
  * appschema ................. ok
  * users ..................... ok
  * insert_user ............... ok
  * change_pass @v1.0.0-dev1 .. ok
  * lists ..................... ok
  * insert_list ............... ok
  * delete_list ............... ok
  * flips ..................... ok
  * insert_flip ............... ok
  * delete_flip @v1.0.0-dev2 .. ok
  * pgcrypto .................. ok
  * insert_user ............... ok
  * change_pass ............... ok
Verify successful

Don’t want verification tests/scripts? Use --no-verify when you call sqitch add and none will be created. Or tell it never to create verify scripts by setting the turning off the add.with_verify option:

sqitch config --bool add.with_verify no

If you somehow run deploy --verify or verify anyway, Sqitch will emit a warning for any changes without verify scripts, but won’t consider them failures.

Up Front Dependency Checking

The other significant change in v0.950 is that the deploy and revert commands (and, by extension the rebase command) now verify that dependencies have been checked before deploying or reverting anything. Previously, Sqitch checked the dependencies for each change before deploying it, but it makes much more sense to check them for all changes to be deployed before doing anything at all. This reduces the chances of unexpected reversions.

Still hacking on Sqitch, of course, though nearly all the commands I initially envisioned are done. Next up, I plan to finally implement support for SQLite, add a few more commands to simplify plan file modification, and to create a new site, since the current site is woefully out-of-date. Until then, though, check out this presentation and, of course, the tutorial.

Looking for the comments? Try the old layout.

Sqitch Update: All Your Rebase Are…Never Mind

I’m pleased to announce the release of Sqitch v0.940. The focus of this release? Sanity.

I’ve been doing a lot of Sqitch-based database development at work. Overall it has worked quite well. Except for one thing: often the order in which changes would be arranged would change from one run to the next. Oy.

Out of Order

The reason? The plan parser would perform a topological sort of all the changes between tags based on their dependencies. I’ve been careful, for the most part, to keep my changes in the proper order in our plan files, but the topological sort would often pick a different order. Still valid in terms of dependency ordering, but different from the plan file.

Given the same inputs, the sort always produced the same order. However, whenever I added a new changes (and I do that all the time while developing), there would then be a new input, which could result in a completely different order. The downside is that I would add a change, run sqitch deploy, and it would die because it thought something needed to be deployed that had already been deployed, simply because it sorted it to come after an undeployed change. So annoying.. It also caused problems in for production deployments, because different machines with different Perls would sort the plans in different ways.

So I re-wrote the sorting part of the the plan parser so that it no longer sorts. The list of changes is now always identical to the order in the plan file. It still checks dependencies, of course, only now it throws an exception if it finds an ordering problem, rather than re-ordering for you. I’ve made an effort to tell the user how to move things around in the plan file to fix ordering issues, so hopefully everything will be less mysterious.

Of course, many will never use dependencies, in which case this change has effect. But it was important to me, as I like to specify dependencies as much as I can, for my own sanity.

See? There’s that theme!

Everyone has a Mom

Speaking of ordering, as we have been starting to do production deployments, I realized that my previous notion to allow developers to reorder changes in the plan file without rebuilding databases was a mistake. It was too easy for someone to deploy to an existing database and miss changes because there was nothing to notice that changes had not been deployed. This was especially a problem before I addressed the ordering issue.

Even with ordering fixed, I thought about how git push works, and realized that it was much more important to make sure things really were consistent than it was to make things slightly more convenient for developers.

So I changed the way change IDs are generated. The text hashed for IDs now includes the ID of the parent change (if there is one), the change dependencies, and the change note. If any of these things change, the ID of the change will change. So they might change a lot during development, while one moves things around, changes dependencies, and tweaks the description. But the advantage is for production, where things have to be deployed exactly right, with no modifications, or else the deploy will fail. This is sort of like requiring all Git merges to be fast-forwarded, and philosophically in line with the Git practice of never changing commits after they’re pushed to a remote repository accessible to others.

Curious what text is hashed for the IDs? Check out the new show command!

Rebase

As a database hacker, I still need things to be relatively convenient for iterative development. So I’ve also added the rebase command. It’s simple, really: It just does a revert and a deploy a single command. I’m doing this all day long, so I’m happy to save myself a few steps. It’s also nice that I can do sqitch rebase @HEAD^ to revert and re-apply the latest change over and over again without fear that it will fail because of an ordering problem. But I already mentioned that, didn’t I?

Order Up

Well, mostly. Another ordering issue I addressed was for the revert --to option. It used to be that it would find the change to revert to in the plan, and revert based on the plan order. (And did I mention that said order might have changed since the last deploy?) v0.940 now searches the database for the revert target. Not only that, the full list of changes to deploy to revert to the target is also returned from the database. In fact, the revert no longer consults the plan file at all. This is great if you’ve re-ordered things, because the revert will always be the reverse order of the previous deploy. Even if IDs have changed, revert will find the changes to revert by name. It will only fail if you’ve removed the revert script for a change.

So simple, conceptually: revert reverts in the proper order based on what was deployed before. deploy deploys based on the order in the plan.

Not @FIRST, Not @LAST

As a result of the improved intelligence of revert, I have also deprecated the @FIRST and @LAST symbolic tags. These tags forced a search of the database, but were mainly used for revert. Now that revert always searches the database, there’s nothing to force. They’re still around for backward compatibility, but no longer documented. Use @ROOT and @HEAD, instead.

Not Over

So lots of big changes, including some compatibility changes. But I’ve tried hard to make them as transparent as possible (old IDs will automatically be updated by deploy). So take it for a spin!

Meanwhile, I still have quite a few other improvements I need to make. On my short list are:

Looking for the comments? Try the old layout.

New in PostgreSQL 9.2: format()

There’s a new feature in PostgreSQL 9.2 that I don’t recall seeing blogged about elsewhere: the format() function. From the docs:

Format a string. This function is similar to the C function sprintf; but only the following conversion specifications are recognized: %s interpolates the corresponding argument as a string; %I escapes its argument as an SQL identifier; %L escapes its argument as an SQL literal; %% outputs a literal %. A conversion can reference an explicit parameter position by preceding the conversion specifier with n$, where n is the argument position.

If you do a lot of dynamic query building in PL/pgSQL functions, you’ll immediately see the value in format(). Consider this function:

CREATE OR REPLACE FUNCTION make_month_partition(
    base_table   TEXT,
    schema_name  TEXT,
    month        TIMESTAMP
) RETURNS VOID LANGUAGE plpgsql AS $_$
DECLARE
    partition TEXT := quote_ident(base_table || '_' || to_char(month, '"y"YYYY"m"MM'));
    month_start TIMESTAMP := date_trunc('month', month);
BEGIN
    EXECUTE '
        CREATE TABLE ' || quote_ident(schema_name) || '.' || partition || ' (CHECK (
                created_at >= ' || quote_literal(month_start) || '
            AND created_at < '  || quote_literal(month_start + '1 month'::interval) || '
        )) INHERITS (' || quote_ident(schema_name) || '.' || base_table || ')
    ';
    EXECUTE 'GRANT SELECT ON ' || quote_ident(schema_name) || '.' || partition || '  TO dude;';
END;
$_$;

Lots of concatenation and use of quote_ident() to get things just right. I don’t know about you, but I always found this sort of thing quite difficult to read. But format() allows use to eliminate most of the operators and function calls. Check it:

CREATE OR REPLACE FUNCTION make_month_partition(
    base_table   TEXT,
    schema_name  TEXT,
    month        TIMESTAMP
) RETURNS VOID LANGUAGE plpgsql AS $_$
DECLARE
    partition TEXT := base_table || '_' || to_char(month, '"y"YYYY"m"MM');
    month_start TIMESTAMP := date_trunc('month', month);
BEGIN
    EXECUTE format(
        'CREATE TABLE %I.%I (
            CHECK (created_at >= %L AND created_at < %L)
        ) INHERITS (%I.%I)',
        schema_name, partition,
        month_start, month_start + '1 month'::interval,
        schema_name, base_table
    );
    EXECUTE format('GRANT SELECT ON %I.%I TO dude', schema_name, partition);
END;
$_$;

I don’t know about you, but I find that a lot easier to read. which means it’ll be easier to maintain. So if you do much dynamic query generation inside the database, give format() a try, I think you’ll find it a winner.

Update 2012-11-16: Okay, so I somehow failed to notice that format() was actually introduced in 9.1 and covered by depesz. D’oh! Well, hopefully my little post will help to get the word out more, at least. Thanks to my commenters.

Looking for the comments? Try the old layout.

Mocking Serialization Failures

I’ve been hacking on the forthcoming Bucardo 5 code base the last couple weeks, as we’re going to start using it pretty extensively at work, and it needed a little love to get it closer to release. The biggest issue I fixed was the handling of serialization failures.

When copying deltas from one database to another, Bucardo sets the transaction isolation to “Serializable”. As of PostgreSQL 9.1, this is true serializable isolation. However, there were no tests for it in Bucardo. And since pervious versions of PostgreSQL had poorer isolation (retained in 9.1 as “Repeatable Read”), I don’t think anyone really noticed it much. As I’m doing all my testing against 9.2, I was getting the serialization failures about half the time I ran the test suite. It took me a good week to chase down the issue. Once I did, I posted to the Bucardo mail list pointing out that Bucardo was not attempting to run a transaction again after failure, and at any rate, the model for how it thought to do so was a little wonky: it let the replicating process die, on the assumption that a new process would pick up where it left off. It did not.

Bucardo maintainer Greg Sabino Mullane proposed that we let the replicating process try again on its own. So I went and made it do that. And then the tests started passing every time. Yay!

Returning to the point of this post, I felt that there ought to be tests for serialization failures in the Bucardo test suite, so that we can ensure that this continues to work. My first thought was to use PL/pgSQL in 8.4 and higher to mock a serialization failure. Observe:

david=# \set VERBOSITY verbose
david=# DO $$BEGIN RAISE EXCEPTION 'Serialization error'
       USING ERRCODE = 'serialization_failure'; END $$;
ERROR:  40001: Serialization error
LOCATION:  exec_stmt_raise, pl_exec.c:2840

Cool, right? Well, the trick is to get this to run on the replication target, but only once. When Bucardo retries, we want it to succeed, thus properly demonstrating the COPY/SERIALIZATION FAIL/ROLLBACK/COPY/SUCCESS pattern. Furthermore, when it copies deltas to a target, Bucardo disables all triggers and rules. So how to get something trigger-like to run on a target table and throw the serialization error?

Studying the Bucardo source code, I discovered that Bucardo itself does not disable triggers and rules. Rather, it sets the session_replica_role GUC to “replica”. This causes PostgreSQL to disable the triggers and rules — except for those that have been set to ENABLE REPLICA. The PostgreSQL ALTER TABLE docs:

The trigger firing mechanism is also affected by the configuration variable session_replication_role. Simply enabled triggers will fire when the replication role is “origin” (the default) or “local”. Triggers configured as ENABLE REPLICA will only fire if the session is in “replica” mode, and triggers configured as ENABLE ALWAYS will fire regardless of the current replication mode.

Well how cool is that? So all I needed to do was plug in a replica trigger and have it throw an exception once but not twice. Via email, Kevin Grittner pointed out that a sequence might work, and indeed it does. Because sequence values are non-transactional, sequences return different values every time they’re access.

Here’s what I came up with:

CREATE SEQUENCE serial_seq;

CREATE OR REPLACE FUNCTION mock_serial_fail(
) RETURNS trigger LANGUAGE plpgsql AS $_$
BEGIN
    IF nextval('serial_seq') % 2 = 0 THEN RETURN NEW; END IF;
    RAISE EXCEPTION 'Serialization error'
            USING ERRCODE = 'serialization_failure';
END;
$_$;

CREATE TRIGGER mock_serial_fail AFTER INSERT ON bucardo_test2
    FOR EACH ROW EXECUTE PROCEDURE mock_serial_fail();
ALTER TABLE bucardo_test2 ENABLE REPLICA TRIGGER mock_serial_fail;

The first INSERT (or, in Bucardo’s case, COPY) to bucardo_test2 will die with the serialization error. The second INSERT (or COPY) succeeds. This worked great, and I was able to write test in a few hours and get them committed. And now we can be reasonably sure that Bucardo will always properly handle serialization failures.

Looking for the comments? Try the old layout.

Sqitch Symbolism

It has been a while since I last blogged about Sqitch. The silence is in part due to the fact that I’ve moved from full-time Sqitch development to actually putting it to use building databases at work. This is exciting, because it needs the real-world experience to grow up.

That’s not to say that nothing has happened with Sqitch. I’ve just released v0.931 which includes a bunch of improvement since I wrote about v0.90. First a couple of the minor things:

  • Sqitch now checks dependencies before reverting, and dies if they would be broken by the revert. This change, introduced in v0.91, required that the dependencies be moved to their own table, so if you’ve been messing with an earlier version of Sqitch, you’ll have to rebuild the database. Sorry about that.
  • I fixed a bunch of Windows-related issues, including finding the current user’s full name, correctly setting the locale for displaying dates and times, executing shell commands, and passing tests. The awesome ActiveState PPM Index has been invaluable in tracking these issues down.
  • Added the bundle command. All it does is copy your project configuration file, plan, and deploy, revert, and test scripts to a directory you identify. The purpose is to be able to export the project into a directory structure suitable for distribution in a tarball, RPM, or whatever. That my not sound incredibly useful, since copying files is no big deal. However, the long-term plan is to add VCS support to Sqitch, which would entail fetching scripts from various places in VCS history. At that point, it will be essential to use bundle to do the export, so that scripts are properly exported from the VCS history. That said, I’m actually using it already to build RPMs. Useful already!

Symbolic References

And now the more immediately useful changes. First, I added new symbolic tags, @FIRST and @LAST. These represent the first and last changes currently deployed to a database, respectively. These complement the existing @ROOT and @HEAD symbolic tags, which represent the first and last changes listed in the plan. The distinction is important: The change plan vs actual deployments to a database.

The addition of @FIRST and @LAST may not sounds like much, but there’s more.

I also added forward and reverse change reference modifiers ^ and ~. The basic idea was stolen from Git Revisions, though the semantics vary. For Sqitch changes, ^ appended to a name or tag means “the change before this change,” while ~ means “the change after this change”. I find ^ most useful when doing development, where I’m constantly deploying and reverting a change as I work. Here’s how I do that revert:

sqitch revert --to @LAST^

That means “revert to the change before the last change”, or simply “revert the last change”. If I want to revert two changes, I use two ^s:

sqitch revert --to @LAST^^

To go back any further, I need to use an integer with the ^. Here’s how to revert the last four changes deployed to the database:

sqitch revert --to @LAST^4

The cool thing about this is that I don’t have to remember the name of the change to revert, as was previously required. And of course, if I just wanted to deploy two changes since the last deployment, I would use ~~:

sqitch deploy --to @LAST~~

Nice, right? One thing to bear in mind, as I was reminded while giving a [Sqitch presentation][slides] to PDXPUG: Changes are deployed in a sequence. You can think of them as a linked list. So this command:

sqitch revert @LAST^^

Does not mean to revert the second-to-last change, leaving the two after it. It will revert the last change and the penultimate change. This is why I actually encourage the use of the --to option, to emphasize that you’re deploying or reverting all changes to the named point, rather than deploying or reverting the named point in isolation. Sqitch simply doesn’t do that.

Internationalize Me

One more change. With today’s release of v0.931, there is now proper internationalization support in Sqitch. The code has been localized for a long time, but there was no infrastructure for internationalizing. Now there is, and I’ve stubbed out files for translating Sqitch messages into French and German. Adding others is easy.

If you’re interested in translating Sqitch’s messages (only 163 of them, should be quick!), just fork Sqitch, juice up your favorite gettext editor, and start editing. Let me know if you need a language file generated; I’ve built the tools to do it easily with dzil, but haven’t released them yet. Look for a post about that later in the week.

Presentation

Oh, and that PDXPUG presentation? Here are the slides (also for download and on Slideshare). Enjoy!

Looking for the comments? Try the old layout.

Sqitch: Depend On It!

Sqitch v0.90 dropped last week (updated to v0.902 today). The focus of this release of the “sane database change management” app was cross-project dependencies. Jim Nasby first put the idea for this feature into my head, and then I discovered that our first Sqitch-using project at work needs it, so blame them.

Depend On It

Earlier versions of Sqitch allow you to declare dependencies on other changes within a project. For example, if your project has a change named users_table, you can create a new change that requires it like so:

sqitch add --requires users_table widgets_table

As of v0.90, you can also require a change from different Sqitch project. Say that you have a project that installs a bunch of utility functions, and that you want to require it in your current Sqitch project. To do so, just prepend the project name to the name of the change you want to require:

sqitch add --requires utils:uuidgen widgets_table

When you go to deploy your project, Sqitch will not deploy the widgets_table change if the uuidgen change from the utils project is not already present.

Sqitch discriminates projects simply by name, as required since v0.80. When you initialize a new Sqitch project, you have to declare its name, too:

siqtch init --name utils

I’ve wondered a bit as to whether that was sufficient. Within a small organization, it’s probably no big deal, as there is unlikely to be much namespace overlap. But thinking longer term, I could foresee folks developing and distributing interdependent open-source Sqitch projects. And without a central name registry, conflicts are likely to pop up. To a certain degree, the risks can be minimized by comparing project URIs, but that works only for project registration, not dependency specification. But perhaps it’s enough. Thoughts?

It’s All Relative

Next up I plan to implement the SQLite support and the bundle command. But first, I want to support relative change specifications. Changes have an order, both in the plan and as deployed to the database. I want to be able to specify relative changes, kind of like you can specify relative commits in Git. So, if you want to revert just one change, you could say something like this:

sqitch revert HEAD^

And that would revert one change. I also think the ability to specify later changes might be useful. So if you wanted to deploy to the change after change foo, you could say something like:

sqitch deploy foo+

You can use ^ or + any number of times, or specify numbers for them. These would both revert three changes:

sqitch revert HEAD^^^
sqitch revert HEAD^3

I like ^ because of its use in Git, although perhaps ~ is more appropriate (Sqitch does not have concepts of branching or multiple parent changes). But + is not a great complement. Maybe - and + would be better, if a bit less familiar? Or maybe there is a better complement to ^ or ~ I haven’t thought of? (I don’t want to use characters that have special meaning in the shell, like <>, if I can avoid it.) Suggestions greatly appreciated.

Oops

A discovered a bug late in the development of v0.90. Well, not so much a bug as an oversight: Sqitch does not validate dependencies in the revert command. That means it’s possible to revert a change without error when some other change depends on it. Oops. Alas, fixing this issue is not exactly trivial, but it’s something I will have to give attention to soon. While I’m at it, I will probably make dependency failures fail earlier. Watch for those fixes soon.

And You?

Still would love help getting a dzil plugin to build Local::TextDomain l01n files. I suspect this would take a knowledgable Dist::Zilla user a couple of hours to do. (And thanks to @notbenh and @RJBS for getting Sqitch on Dist::Zilla!) And if anyone really wanted to dig into Sqitch, Implementing a bundle command would be a great place to start.

Or just give it a try! You can install it from CPAN with cpan App::Sqitch. Read the tutorial for an overview of what Sqitch is and how it works. Thanks!

Looking for the comments? Try the old layout.

Sqitch v0.80: Now With More You

Last night, I uploaded Sqitch v0.80, the latest dev release of the simple database change management system I’ve been working on. I’m kind of stunned by the sheer number of changes in this release, given that the interface has not changed much. Mainly, there’s more you in this version. That is, like Git, the first thing you’ll want to do after installing Git is tell it who you are:

> sqitch config --user user.name 'Marge N. O’Vera'
> sqitch config --user user.email 'marge@example.com'

This information is now recorded for every change added to a project plan, as well as every commit to the database (deploys, reverts, and failures). If you don’t tell Sqitch who you are, it will try to guess, but you might not like who it finds.

Changes and tags now also require a note to be associated with them, kind of like a Git commit message. This allows a bit more context to be provided about a change or tag, since the name may not be sufficient. All of this is recorded in the plan file, which makes it harder to edit by hand, since the lines are so much longer now. An example:

%syntax-version=1.0.0-b1
%project=flipr
%uri=https://github.com/theory/sqitch-intro/

appuser 2012-08-01T15:04:13Z Marge N. O’Vera <marge@example.com> # App database user with limited permissions.
users [:appuser] 2012-08-01T15:36:00Z Marge N. O’Vera <marge@example.com> # Creates table to track our users.
insert_user [:users :appuser] 2012-08-01T15:41:17Z Marge N. O’Vera <marge@example.com> # Creates a function to insert a user.
change_pass [:users :appuser] 2012-08-01T15:41:46Z Marge N. O’Vera <marge@example.com> # Creates a function to change a user password.
@v1.0.0-dev1 2012-08-01T15:48:04Z Marge N. O’Vera <marge@example.com> # Tag v1.0.0-dev1.

But each change and tag is still on a single line, so it’s not too bad if you absolutely must edit it. Still, I expect to discourage that in favor of adding more commands for manipulating it (adding and removing dependencies, changing the note, etc.).

Given all this data, the output of the log command has expanded quite a lot. Here’s an example from the tutorial’s example project:

On database flipr_test
Deploy 7ad1cc6d1706c559dceb3101e7c21786dc7d7b4c
Name:      change_pass
Committer: Marge N. O’Vera 
Date:      2012-08-01 22:20:54 +0200

    Change change_pass to use pgcrypto.

Deploy 799ecd26730a684cf02a889c30371a0af55150cc
Name:      insert_user
Committer: Marge N. O’Vera 
Date:      2012-08-01 22:20:54 +0200

    Change insert_user to use pgcrypto.

Revert 799ecd26730a684cf02a889c30371a0af55150cc
Name:      insert_user
Committer: Marge N. O’Vera 
Date:      2012-08-01 22:20:52 +0200

    Change insert_user to use pgcrypto.

Revert 7ad1cc6d1706c559dceb3101e7c21786dc7d7b4c
Name:      change_pass
Committer: Marge N. O’Vera 
Date:      2012-08-01 22:20:52 +0200

    Change change_pass to use pgcrypto.

Deploy 7ad1cc6d1706c559dceb3101e7c21786dc7d7b4c
Name:      change_pass
Committer: Marge N. O’Vera 
Date:      2012-08-01 22:20:46 +0200

    Change change_pass to use pgcrypto.

Deploy 799ecd26730a684cf02a889c30371a0af55150cc
Name:      insert_user
Committer: Marge N. O’Vera 
Date:      2012-08-01 22:20:46 +0200

    Change insert_user to use pgcrypto.

Note the use of color to identify the event type: green for deploys and blue for reverts. Failures appear in red. Not sure I like it yet, but I think it might be useful. We’ll see.

Back to the plan. Notice that it now also includes pragmas for a project name and URI. Those lines again:

%syntax-version=1.0.0-b1
%project=flipr
%uri=https://github.com/theory/sqitch-intro/

The project name is required when initializing a Sqitch project, but the URI is optional (at least for now). The point of these data points is double:

  • The project name is used (along with the current timestamp and your name and email address) when hashing changes and tags to generate IDs. This ensures that the IDs are likely to be globally unique.
  • In the future, you will be able to declare cross-project dependencies.

The second point is the more important. The plan is to require the name of a project before the : in a dependency. For example, if I wanted to require the insert_user change from the “flipr” project plan above, I would declare it as flipr:insert_user. Sqitch will then know to check for it. I will be adding this pretty soon, since it requires some database changes and we’re going to need it at work. The need for database changes is also why v0.80 is still a dev release. (However I don’t expect the plan format to change beyond this tweak to dependency specification.)

Beyond that, next steps include:

  • Creating an RPM targeting work’s servers. This will probably not be public, though I might add the spec file to the public project.
  • Starting to use Sqitch for some work projects. This will be the first real-world use, which I deem essential for proving the technology. I hope that it does not lead to any more radical redesigns. :–)
  • Implement the SQLite interface to iron out any kinks in the engine API.
  • Switch to Dist::Zilla for building the distribution. I would love a volunteer to help with this; I expect it to be simple for someone well-versed in Dist::Zilla.
  • Add support for localization. Sqitch already uses Locale::TextDomain throughout, so it’s localization-ready. We just need the tools put in place as described in the dzil ticket. Again, I would love help with this.
  • Implement the Bundle command. Should be pretty simple, since, for now at least, all it does is copy files and directories.
  • Add VCS integration. This is less important than it once was, but will still help a lot when working with Sqitch within a VCS. The bundle command would also need to be updated, once this work was done.

But even with all that, I think that Sqitch is finally ready for some serious tire-kicking. To get started, skim the tutorial and take it for a spin (install it by running cpan DWHEELER/App-Sqitch-0.80-TRIAL.tar.gz). Let me know what you like, what you don’t like, and let’s have a discussion about it.

Oh, and for discussions, where should I set up a mail list? Google Groups? Someplace else?

Looking for the comments? Try the old layout.

Sqitch’s Log

Just uploaded Sqitch v0.70 and v0.71. The big change is the introduction of the log command, which allows one to view the deployment history in a database. All events are logged and searchable, including deploys, failed deploys, and reverts. Unlike most other database migration systems, Sqitch has the whole history, so even if you revert back to the very beginning, there is still a record of everything that happened.

I stole most of the interface for the log command from git-log, including:

  • Colorized output
  • Searching against change and committer names via regular expressions
  • A variety of formatting options (“full”, “long”, “medium”, “oneline”, etc.)
  • Extensible formatting with printf-style codes

Here are a couple of examples searching the tutorial’s test database:

> sqitch -d flipr_test log -n 3
On database flipr_test
Deploy 18d7aab59bd0c914a561dc324b1da5549605c376
Name:   change_pass
Date:   2012-07-07 13:26:30 +0200

Deploy 87b4e131897ec370d78be177a3f91fdc877a2515
Name:   insert_user
Date:   2012-07-07 13:26:30 +0200

Deploy 20d9af30b97a3071dce12d91665dcd6237265d60
Name:   pgcrypto
Date:   2012-07-07 13:26:30 +0200
> sqitch -d flipr_test log -n 6 --format oneline --abbrev 7
On database flipr_test
18d7aab deploy change_pass
87b4e13 deploy insert_user
20d9af3 deploy pgcrypto
540359a deploy delete_flip
d4dce7d deploy insert_flip
b715d73 deploy flips

> sqitch -d flipr_test log -n 4 --event revert --event fail --format \
'format:%a %eed %{blue}C%{6}h%{reset}C - %c%non %{cldr:YYYY-MM-dd}d at %{cldr:h:mm a}d%n' 
On database flipr_test
theory reverted 9df095 - appuser
on 2012-07-07 at 1:26 PM

theory reverted 9df0959d078b - users
on 2012-07-07 at 1:26 PM

theory reverted 9df095131e25 - insert_user
on 2012-07-07 at 1:26 PM

theory reverted 9df09502c559 - change_pass
on 2012-07-07 at 1:26 PM

I’m pretty happy with this. Not sure how much it will be used, but it works great!

Looking for the comments? Try the old layout.

Sqitch Status: Now With Status

I’ve just released Sqitch v0.60. The main change is the implementation of the status command, which allows one to see the current deployment status of a database. An example from the updated tutorial:

> sqitch status
# On database flipr_test
# Change:   18d7aab59bd0c914a561dc324b1da5549605c376
# Name:     change_pass
# Deployed: 2012-07-07 13:23:42 +0200
# By:       theory
# 
Nothing to deploy (up-to-date)

If there are changes in the plan after the most recently deployed change, they would be listed like so:

> sqitch status
# On database flipr_test
# Change:   540359a3892d1476f9ca6ccf7d3f9993ac383b68
# Name:     delete_flip
# Tag:      @v1.0.0-dev2
# Deployed: 2012-07-06 19:31:14 +0200
# By:       theory
# 
Undeployed changes:
  * pgcrypto
  * insert_user
  * change_pass

You can also ask it to show the list of deployed changes and applied tags:

> sqitch status --show-tags --show-changes
# On database flipr_test
# Change:   18d7aab59bd0c914a561dc324b1da5549605c376
# Name:     change_pass
# Deployed: 2012-07-07 13:26:30 +0200
# By:       theory
# 
# Changes:
#   change_pass - 2012-07-07 13:26:30 +0200 - theory
#   insert_user - 2012-07-07 13:26:30 +0200 - theory
#   pgcrypto    - 2012-07-07 13:26:30 +0200 - theory
#   delete_flip - 2012-07-07 13:26:30 +0200 - theory
#   insert_flip - 2012-07-07 13:26:30 +0200 - theory
#   flips       - 2012-07-07 13:26:30 +0200 - theory
#   delete_list - 2012-07-07 13:26:30 +0200 - theory
#   insert_list - 2012-07-07 13:26:30 +0200 - theory
#   lists       - 2012-07-07 13:26:30 +0200 - theory
#   change_pass - 2012-07-07 13:26:30 +0200 - theory
#   insert_user - 2012-07-07 13:26:30 +0200 - theory
#   users       - 2012-07-07 13:26:30 +0200 - theory
#   appuser     - 2012-07-07 13:26:30 +0200 - theory
# 
# Tags:
#   @v1.0.0-dev2 - 2012-07-07 13:26:30 +0200 - theory
#   @v1.0.0-dev1 - 2012-07-07 13:26:30 +0200 - theory
# 
Nothing to deploy (up-to-date)

The --date-format option allows one to display the dates in a variety of formats, inspired by the git log --date option:

> sqitch status --date-format long
# On database flipr_test
# Change:   18d7aab59bd0c914a561dc324b1da5549605c376
# Name:     change_pass
# Deployed: 7 juillet 2012 13:26:30 CEST
# By:       theory
# 
Nothing to deploy (up-to-date)

Want to give it a try? Install it with cpan D/DW/DWHEELER/App-Sqitch-0.60-TRIAL.tar.gz and follow along the tutorial.

Now I’m off to add the log command, which shows a history of all deploys and reverts.

Looking for the comments? Try the old layout.

Sqitch Update: Almost Usable

This week, I released v0.50 of Sqitch, the database change management app I’ve been working on for the last couple of months. Those interested in how it works should read the tutorial. A lot has changed since v0.30; here are some highlights:

  • The plan file is now required. This can make merges more annoying, but thanks to a comment from Jakub Narębski, I discovered that Git can be configured to use a “union merge driver”, which seems to simplify things a great deal. See the tutorial for a detailed example.
  • The plan now consists solely of a list of changes, roughly analogous to Git commits. Tags are simply pointers to specific changes.
  • Dependencies are now specified in the plan file, rather than in the deployment scripts. Once the plan file became required, this seemed like the much more obvious place for them.
  • The plan file now goes into the top-level directory of a project (which defaults to the current directory, assumed to be the top level directory of a VCS project), while the configuration file goes into the current directory. This allows one to have multiple top-level directories for different database engines, each with its own plan, and a single configuration file for them all.

Seems like a short list, but in reality, this release is the first I would call almost usable. Most of the core functionality and infrastructure is in place, and the architectural designs have been finalized. There should be much less flux in how things work from here on in, though this is still very much a developer release. Things might still change, so I’m being conservative and not doing a “stable” release just yet.

What works

So what commands actually work at this point? All of the most important functional ones:

  • sqitch init – Initialize a Sqitch project. Creates the project configuration file, a plan file, and directories for deploy, revert, and test scripts
  • sqitch config – Configure Sqitch. Uses the same configuration format as Git, including cascading local, user, and system-wide configuration files
  • sqitch help – Get documentation for specific commands
  • sqitch add – Add a new change to the plan. Generates deploy, revert, and test scripts based on user-modifiable templates
  • sqitch tag – Tag the latest change in the plan, or show a list of existing tags
  • sqitch deploy – Deploy changes to a database. Includes a --mode option to control how to revert changes in the event of a deploy failure (not at all, to last tag, or to starting point)
  • sqitch revert – Revert changes from a database
  • sqitch rework – Copy and modify a previous change

Currently, only PostgreSQL is supported by deploy and revert; I will at least add SQLite support soon.

The rework command is my solution to the problem of code duplication. It does not (yet) rely on VCS history, so it still duplicates code. However, it does so in such a way that it is still easier to see what has changed, because the new files are actually used by the previous instance of the command, while the new one uses the existing files. So a diff command, while showing the new files in toto, actually shows what changed in the existing scripts, making it easier to follow. I think this is a decent compromise, to allow Sqitch to be used with or without a VCS, and without disabling the advantages of VCS integration in the future.

The only requirement for reworking a change is that there must be a tag on that change or a change following it. Sqitch uses that tag in the name of the files for the previous instance of the change, as well as in internal IDs, so it’s required to disambiguate the scripts and deployment records of the two instances. The assumption here is that tags are generally used when a project is released, as otherwise, if you were doing development, you would just go back and modify the change’s scripts directly, and revert and re-deploy to get the changes in your dev database. But once you tag, this is a sort of promise that nothing will be changed prior to the tag.

I modify change scripts a lot in my own database development projects. Naturally, I think it is important to be free to change deployment scripts however one likes while doing development, and also important to promise not to change them once they have been released. As long as tags are generally thought of as marking releases or other significant milestones, it seems a reasonable promise not to change anything that appears before a tag.

See the tutorial for a detailed example. In a future release, VCS integration will be added, and the duplicated files will be unnecessary, too. But the current approach has the advantage that it will work anywhere, VCS or no. The VCS support will be backward-compatible with this design (indeed, it depends on it).

Still To Do

I think I might hold off a bit on the VCS integration, since the rework command no longer requires it. There also needs to be support for database engines other than PostgreSQL. But otherwise, mostly what needs to be done is the informational commands, packaging, and testing:

  • sqitch status – Show the current deployment status of a database
  • sqitch log – Show the deploy and revert history of a database
  • sqitch bundle – Bundle up the configuration, plan, and scripts for distribution packaging
  • sqitch test – Test changes. Mostly hand-wavy; see below
  • sqitch check – Validate a database deployment history against the plan

I will likely be working on the status and log commands next, as well as an SQLite engine, to make sure I have the engine encapsulation right.

Outstanding Questions

I’m still pondering some design decisions. Your thoughts and comments greatly appreciated.

  • Sqitch now requires a URI, which is set in the local configuration file by the init command. If you don’t specify one, it just creates a UUID-based URI. The URI is required to make sure that changes have unique IDs across projects (a change may have the same name as in another project). But maybe this should be more flexible? Maybe, like Git, Sqitch should require a user name and email address, instead? They would have to be added to the change lines of the plan, which is what has given me pause up to now. It would be annoying to parse.

  • How should testing work? When I do PostgreSQL stuff, I am of course rather keen on pgTAP. But I don’t think it makes sense to require a particular format of output or anything of that sort. It just wouldn’t be engine-neutral enough. So maybe test scripts should just run and considered passing if the engine client exits successfully, and failing if it exited unsuccessfully? That would allow one to use whatever testing was supported by the engine, although I would have to find some way to get pgTAP to make psql exit non-zero on failure.

    Another possibility is to require expected output files, and to diff them. I’m not too keen on this approach, as it makes it much more difficult to write tests to run on multiple engine versions and platforms, since the output might vary. It’s also more of a PITA to maintain separate test and expect files and keep them in sync. Still, it’s a tried-and-true approach.

Help Wanted

Contributions would be warmly welcomed. See the to-do list for what needs doing. Some highlights and additional items:

  • Convert to Dist::Zilla
  • Implement the Locale::TextDomain-based localization build. Should be done at distribution build time, not install time. Ideally, there would be a Dist::Zilla plugin to do it, based pattern implemented in this example Makefile (see also this README).
  • The web site could use some updating, though I realize it will regularly need changing until most of the core development has completed and more documentation has been written.
  • Handy with graphics? The project could use a logo. Possible themes: SQL, databases, change management, baby Sasquatch.
  • Packaging. It would greatly help developers and system administrators who don’t do CPAN if they could just use their familiar OS installers to get Sqitch. So RPM, Debian package, Homebrew, BSD Ports, and Windows distribution support would be hugely appreciated.

Take it for a Spin!

Please do install the v0.51 developer release from the CPAN (run cpan D/DW/DWHEELER/App-Sqitch-0.51-TRIAL.tar.gz) and kick the tires a bit. Follow along the tutorial to get a feel for it, or even just review the tutorial example’s Git history to get a feel for it. And if there is something you want out of Sqitch that you don’t see, please feel free to file an issue with your suggestion.

Looking for the comments? Try the old layout.

Sqitch: Rename Step Objects and the SQL Directory?

After all of the thinking and rethinking about how to manage a Sqitch plan, I am just about done with all the changes to make it all work. One of the changes I’ve made is that tags are no longer objects that stand on their own between change steps, but are simply names the refer to specific change steps. Not only is this much more like how a VCS thinks of tags (basically another name for a single commit), but it also greatly simplifies the code for iterating over a plan and updating metadata in the database.

But now that a plan is, in its essence, just a list of “steps”, I’m wondering if I should change that term. I originally used the term “steps” because the original plan was to have a deploy work on a tag-to-tag basis, where a single tag could have a series of changes associated with it. By that model, each change was a “step” toward deploying the tag. If any of the steps for a single tag failed, they were all reverted.

But while one can still specify a tag as a deploy target (and optionally have it revert to an earlier tag one failure), it no longer makes sense to think of each change script as a step toward deploying a target. It’s just a change. Yes, as an object it has separate deploy, revert, and test scripts associated with it, but I’m thinking it still makes sense to call them “changes” instead of “steps.” Because they’re individual things, rather than collections of things that lead to some goal.

What do you think?

In other renaming news, I’m thinking of changing the default directory that stores the step/change scripts. Right now it’s sql (though you can make it whatever you want). The plan file goes into the current directory (assumed to be the root directory of your project), as does the local configuration file. So the usual setup is:

% find .
./sqitch.conf
./sqitch.plan
./sql/deploy/
./sql/revert/
./sql/test/

I’m thinking of changing this in two ways:

  • Make the default location of the plan file be in the top-level script directory. This is because you might have different Sqitch change directories for different database platforms, each with its own plan file.
  • Change the default top-level script directory to ..

As a result, the usual setup would be:

% find .
./sqitch.conf
./sqitch.plan
./deploy/
./revert/
./test/

If you still wanted the change scripts kept in all in a subdirectory, say db/, it would be:

% find .
./sqitch.conf
./db/sqitch.plan
./db/deploy/
./db/revert/
./db/test/

And if you have a project with, say, two sqitch deployment setups, one for PostgreSQL and one for SQLite, you might make it:

% find .
./sqitch.conf
./postgres/sqitch.plan
./postgres/deploy/
./postgres/revert/
./postgres/test/
./sqlite/sqitch.plan
./sqlite/deploy/
./sqlite/revert/
./sqlite/test/

This works because the configuration file has separate sections for each engine (PostgreSQL and SQLite), and so can be used for all the projects; only the --top-dir option would need to change to switch between them. Each engine has its own plan file.

And yeah, having written out here, I’m pretty convinced. What do you think? Comments welcome.

Looking for the comments? Try the old layout.