Just a Theory

Black lives matter

Posts about Databases

Sqitch v1.0.0

Sqitch Logo

After seven years of development and hundreds of production database deployments, I finally decide it was time to release Sqitch v1.0.0, and today’s the day. I took the opportunity to resolve all known bugs in previous releases, so there’s no new functionality since v0.9999. Still, given the typical attention given to a significant milestone release like 1.0.0, my employer published a blog post describing a bit of the history and philosophy behind Sqitch.

The new site goes into great detail describing how to install Sqitch, but the important links are:

  • CPAN — Install Sqitch from CPAN
  • Docker — Run Sqitch from a Docker container
  • Homebrew — Homebrew Sqitch on macOS
  • GitHub — Sqitch releases on GitHub

Thanks to everyone who helped get Sqitch to this point, I appreciate it tremendously. I’m especially grateful to:

Thanks a million for all your help and support!

Brent Simmons is Not Wrong

Brent Simmons:

Database people are already gasping for air, because they know what’s coming. Instead of creating a separate table for attachment metadata, I created an attachments column in the notes table and just encoded the attachment metadata there.

On iOS it uses Core Data’s built-in object archiving feature. On the server it’s stored as JSON.

This is wrong, surely; it’s not how to do this. Except, in this case, it is. Incomplete object graphs are wrong; inefficient and slower syncing with more complex server-side code is also wrong.

This is less wrong than the alternatives.

Some database folks might be gasping for air, but not those of us steeped in relational theory. In Database in Depth, relational theorist C.J. Date poses a question:

“Database in Depth,” by C.J. Date

In Chapter 1, I said that 1NF meant that every tuple in every relation contains just a single value (of the appropriate type, of course) in every attribute position—and it’s usual to add that those “single values” are supposed to be atomic. But this latter requirement raises the obvious question: what does it mean for data to be atomic?

Well, on page 6 of the book mentioned earlier, Codd defines atomic data as data that “cannot be decomposed into smaller pieces by the DBMS (excluding certain special functions).” But even if we ignore that parenthetical exclusion, this definition is a trifle puzzling, and not very precise. For example, what about character strings? Are character strings atomic? Every product I know provides several operators on such strings—LIKE, SUBSTR (substring), “||” (concatenate), and so on—that clearly rely on the fact that character strings in general can be decomposed by the DBMS. So are those strings atomic? What do you think?

The whole book is worth a read, especially the first four chapters, as it does an excellent job of dispelling the myth that complex data types are verboten in a properly normalized relational model. Another gem:

But I could have used any number of different examples to make my point: I could have shown attributes (and therefore domains) that contained arrays; or bags; or lists; or photographs; or audio or video recordings; or X rays; or fingerprints; or XML documents; or any other kind of value, “atomic” or “nonatomic,” that you might care to think of. Attributes, and therefore domains, can contain anything (any values, that is). All of which goes a long way, incidentally, toward explaining why a true “object/relational” system would be nothing more nor less than a true relational system—which is to say, a system that supports the relational model, with all that such support entails.

This is exactly why PostgreSQL offers array, XML, and JSON data types: because sometimes, they are exactly what you need to properly model your system.

So you go, Brent, you’re doing it exactly right.

Toward A Database URI Standard

As part of my effort to improve Sqitch, I plan to add support for specifying deployment targets via URIs. Inspired by Git remotes, targets will greatly simplify the specification of databases to update — especially when stored as named targets in the configuration file.

Before implementing it, though, I started casting about for a standard URI Scheme for database connections. Imagine my surprise1 to find that there is none! The closest thing to a standard is JDBC URLs. Formally, their format is simply:


Turns out that JDBC URLs are barely URLs at all. I mean, fine, according to RFC 3986 they start with the jdbc: scheme followed by whatever. According to the JDBC docs, what comes after the scheme is defined as follows:


The “subprotocol” is simply a driver name, while the the format of the “subname can vary, depending on the subprotocol, and it can have any internal syntax the driver writer chooses, including a subsubname.” In other words, it can be anything at all. Not very satisfying, or particularly “standard."2

In poking around the net, however, I found a fair number of database URI formats defined by various projects:

All very similar, right? Most database engines support all or a subset of these connection parts in common:

  • username
  • password
  • host address
  • port
  • database name
  • configuration parameters

So why not define a standard database URI format with all those parts, and use them where appropriate for each engine? It’s all right there, just like http URLs.

The Proposal

Here’s my proposal. Formally, it’s an opaque URI like JDBC. All database URIs start with the scheme db:. But in this case, the opaque part is an embedded URI that may be in one of two formats:


In other words, a pretty typical http- or mailto-style URI format. We embed it in a db: URI in order to identify the URI as a database URI, and to have a single reasonable scheme to register. Informally, it’s simplest to think of a database URI as a single URI starting with the combination of the scheme and the engine, e.g., db:mysql.

Some notes on the formats:

  • The Database URI scheme is db. Consequently, database URIs always start with db:. This is the URI scheme that defines a database URI.

  • Next comes the database engine. This part is a string naming the type of database engine for the database. It must always be followed by a colon, :. There is no formal list of supported engines, though certain implementations may specify engine-specific semantics, such as a default port.

  • The authority part is separated from the engine by a double slash, //, and terminated by the next slash or end of the URI. It consists of an optional user-information part, terminated by @ (e.g., username:password@); a required host address (e.g., domain name or IP address); and an optional port number, preceded by a colon, :.

  • The path part specifies the database name or path. For URIs that contain an authority part, a path specifying a file name must be absolute. URIs without an authority may use absolute or relative paths.

  • The optional query part, separated by a question mark, ?, contains key=value pairs separated by a semicolon, ;, or ampersand, &. These parameters may be used to configure a database connection with parameters not directly supported by the rest of the URI format.


Here are some database URIs without an authority part, which is typical for non-server engines such as SQLite, where the path part is a relative or absolute file name:

  • db:sqlite:
  • db:sqlite:foo.db
  • db:sqlite:../foo.db
  • db:sqlite:/var/db/foo.sqlite

Other engines may use a database name rather than a file name:

  • db:ingres:mydb
  • db:postgresql:template1

When a URI includes an authority part, it must be preceded by a double slash:

  • db:postgresql://example.com
  • db:mysql://root@localhost
  • db:pg://postgres:secr3t@example.net

To add the database name, separate it from the authority by a single slash:

  • db:postgresql://example.com/template1
  • db:mongodb://localhost:27017/myDatabase
  • db:oracle://scott:tiger@foo.com/scott

Some databases, such as Firebird, take both a host name and a file path. These paths must be absolute:

  • db:firebird://localhost/tmp/test.gdb
  • db:firebird://localhost/C:/temp/test.gdb

Any URI format may optionally have a query part containing key/value pairs:

  • db:sqlite:foo.db?foreign_keys=ON;journal_mode=WAL
  • db:pg://localhost:5433/postgres?client_encoding=utf8;connect_timeout=10


In discussing this proposal with various folks, I’ve become aware of a few challenges to standardization.

First, the requirement that the authority part must include a host address prevents the specification of a URI with a username that can be used to connect to a Unix socket. PostgreSQL and MySQL, among others provide authenticated socket connections. While RFC 3986 requires the host name, its predecessor, RFC 2396, does not. Furthermore, as a precedent, neither do file URIs. So I’m thinking of allowing something like this to connect to a PostgreSQL database


In short, it makes sense to allow the user information without a host name.

The second issue is the disallowing of relative file names in the path part following an authority part. The problem here is that most database engines don’t use paths for database names, so a leading slash makes no sense. For example, in db:pg:localhost/foo, the PostgreSQL database name is foo, not /foo. Yet in db:firebird:localhost/foo, the Firebird database name is a path, /foo. So each engine implementation must know whether or not the path part is a file name.

But some databases may in fact allow a path to be specified for a local connection, and a name for a remote connection. Informix appears to support such variation. So how is one to know whether the path is a file path or a named database? The two variants cannot be distinguished.

RFC 2396 is quite explicit that the path part must be absolute when following an authority part. But RFC 3986 forbids the double slash only when there is no authority part. Therefore, I think it might be best to require a second slash for absolute paths. Engines that use a simple name or relative path can have it just after the slash, while an absolute path could use a second slash:

  • Absolute: db:firebird://localhost//tmp/test.gdb
  • Relative: db:firebird://localhost/db/test.gdb
  • Name: db:postgresql://localhost/template1

That’s It

The path issue aside, I feel like this is a pretty simple proposal, and could have wide utility. I’ve already knocked out a Perl reference implementation, URI::db. Given the wide availability of URI parsers in various programming languages, I wouldn’t expect it to be difficult to port, either.

The uri-db project is the canonical home for the proposal for now, so check there for updates. And your feedback would be appreciated! What other issues have I overlooked? What have I got wrong? Let me know!

  1. As in not surprised at all. Though I was hoping! ↩︎

  2. DSNs for Perl’s DBI aren’t much better: dbi:<driver>:<driver-specific-stuff>↩︎

Whither Impala Fault Tolerance?

Justin Erickson on the Cloudera Blog

In December 2012, while Cloudera Impala was still in its beta phase, we provided a roadmap for planned functionality in the production release. In the same spirit of keeping Impala users, customers, and enthusiasts well informed, this post provides an updated roadmap for upcoming releases later this year and in early 2014.

Impala is a pretty nice-looking SQLish query engine that runs on Hadoop. It provides the same basic interface as Hive, but circumvents MapReduce to access data directly from data nodes in parallel. But, in my opinion, to be useful as a real-time query engine, it needs fault tolerance. From the Impala FAQ, under the list of unsupported features:

Fault tolerance for running queries (not currently). In the current release, Impala aborts the query if any host on which the query is executing fails. In the future, we will consider adding fault tolerance to Impala, so that a running query would complete even in the presence of host failures.

Sounds like an unfortunate issue. Since it’s pretty typical for data nodes to go down, this seems like an essential feature. Products like CitusDB offer fault tolerance:

Does CitusDB recover from failures?

Yes. The CitusDB master node intelligently re-routes the work on any failed nodes to the remaining nodes in real-time. Since the underlying data are kept in fixed-size blocks in HDFS, a failed node’s work can evenly be distributed among the remaining nodes in the cluster.

That sound exactly right. I’m excited about Citus, and if it adds solid support for more data formats, such as ORC and Parquet, it may well be the way to go. But Impala will be a nice alternative if it can get fault tolerance figured out. I’m disappointed it’s not on the road map.

Multirow Database Updates

William Blunn:

So, given a list of updates to apply we could effect them using the following steps:

  1. Use CREATE TEMPORARY TABLE to create a temporary table to hold the updates
  2. Use INSERT to populate the temporary table with the updates
  3. Use UPDATE … FROM to update the target table using updates in the temporary table
  4. Use DROP TABLE to drop the temporary table

So in the example above we can reduce five statements to four. This isn’t a significant improvement in this case. But now the number of statements is no longer directly dependent on the number of rows requiring updates.

Even if we wanted to update a thousand rows with different values, we could still do it with four statements.

Or you could just use one statement. Here’s how to do it with a CTE on PostgreSQL 9.2 and higher:

WITH up(name, salary) AS ( VALUES
     ('Jane',  1200),
     ('Frank', 1100),
     ('Susan', 1175),
     ('John',  1150)
UPDATE staff
   SET salary = up.salary
  FROM up
 WHERE staff.name = up.name;

Still on PostgreSQL 9.1 or lower? Use a subselect in the FROM clause instead:

UPDATE staff
   SET salary = up.salary
       ('Jane',  1200),
       ('Frank', 1100),
       ('Susan', 1175),
       ('John',  1150)
   ) AS up(name, salary)
 WHERE staff.name = up.name;

Stuck with MySQL or Oracle? Use a UNION query in a second table:

UPDATE staff, (
         SELECT 'Jane' AS name, 1200 AS salary
   UNION SELECT 'Frank',        1100
   UNION SELECT 'Susan',        1175
   UNION SELECT 'John',         1150
) AS up
   SET staff.salary = up.salary
 WHERE staff.name = up.name;

Using SQLite? Might make sense to use a temporary table for thousands or millions of rows. But for just a few, use a CASE expression:

UPDATE staff
   SET salary = CASE name
       WHEN 'Jane'  THEN 1200
       WHEN 'Frank' THEN 1100
       WHEN 'Susan' THEN 1175
       WHEN 'John'  THEN 1150
 WHERE name in ('Jane', 'Frank', 'Susan', 'John');

If you need to support multiple database architectures, sure, use something like DBIx::MultiRow to encapsulate things. But if, like most of us, you’re on one database for an app, I can’t recommend stongly enough how well it pays to get to know your database well.

Data Deployment with Sqitch

I’ve been thinking about data migrations. I love how well Sqitch works for schema changes, but so far have avoided data changes. Some data ought to be managed by the deployment process, rather than by end-user applications. Lists of countries, for example. Yet none of our Sqitch-managed databases include INSERTs, UPDATEs, or DELETEs in deploy scripts. Why not? Two reasons:

  1. These are mainly Postgres ports of existing Oracle databases. As such, I’ve written independent migration scripts that use oracle_fdw to copy data from Oracle. It made no sense to commit hard-coded changes to the deploy script, even for static data, as it was all to be copied from the old production Oracle databases — often months after I wrote the migrations.

  2. These projects include extensive pgTAP unit tests that expect to run many times against an empty database with no side effects. Having different data in testing than in production increases the likelihood of unforeseen behavioral variations. Better to expect no data in tests, freeing them to focus on units of behavior without regard to preexisting data.

Now that we have multiple Sqitch-deployed databases in production, the time has come to address these issues.

Deploy Hooks for External Sources

I propose to resolve the one-time migration requirement with deploy hooks. The idea is similar to Git hooks: Before or after any sqitch deploy, one or more hook scripts can run. The impetus was to ensure some higher level of consistency after every deploy. For example, a post-deploy hook might grant privileges on all tables in a database. Another might run VACCUM; ANALZYE;.

But we could also use it for one-time data migrations. An option to deploy will disable them, which would be useful for development and test databases. Once the migration has been run in production, we just delete the migration hook scripts from the project. Sqitch won’t record hook executions, so adding or removing them will be no problem.

I like this approach, as Sqitch will automatically run migration scripts, but hooks will not change the interface of Sqitch itself. And it’s more generally useful. Hell, I might want deploy hook script that sends an email notification announcing a deployment (though that might require adding support for non-SQL scripts). There are all kinds of things for which hooks will prove useful.

Changes for Static Data Maintenance

For data that must be tied to the deployment process, there are two complications that have prevented me from simply managing them in normal Sqitch changes:

  1. There might be side-effects to the deployment. For example, a foreign key constraint to the users table, to identify the users who added rows to the database. But in a new database, perhaps there are no users — and Sqitch doesn’t create them, the app does. Chicken, meet egg.

  2. The initial data set might derived from some external source, such as another database. Consequently, none of that data was defined in Sqitch deploy scripts. This situation complicates future updates of the data. We can add data via Sqitch in the future, but then we don’t have the canonical list of all rows that should exist in all such databases.

However, I can think of no alternative that does not over-complicate Sqitch itself. I considered adding another change-related script type, named “update”, to complement the existing deploy, verify, and revert scripts. But oftentimes a data change would not be tied to a schema change, so the corresponding deploy script would be a no-op. Seems silly.

I also considered adding a completely separate command specifically for deploying data changes. Yet these data migrations are exactly like schema changes: Sqitch must execute them in the proper order relative to other changes, record successful or failed deployment, and be able to revert them when required. The only difference is what’s defined in them: data modification rather than definition.

Through several drafts of this post, I have come around to the idea that I should change nothing in Sqitch with regard to data deployments. A better solution than the above, I believe, is organizational.

Data Deploy Best Practice

Let the best practice for data deploys be this: they should be contained in Sqitch changes, but such changes should contain only data modifications. No change script should both define a table and insert its initial rows. Keep the table and its data in separate changes — keep DML separate from DDL.

For our list of countries, we might have a change named “countries”, which creates the countries table, and another, named “country_data”, which inserts the appropriate data into that table. Where necessary and appropriate, these changes may use conditional paths to bring the data up-to-date and in sync across deployments.

Conditions must deal with side-effects, such as foreign key constraints. Where possible, such side effects ought be removed from deployment-managed data. For tracking the user or users who added data to a database, for example, one can use the tools of the source code repository (git log, git blame) to assign blame. Other side-effects may be more necessary, but to the extent possible, deployed data should be independent.

Naturally, unit tests must expect static data to be present, and be updated when that data changes. We are, after all, talking about infrequently-updated data. Frequently-updated data should have separate interfaces provided by applications to change the data. Otherwise, how static is it, really?

Notes on Upcoming Sqitch Improvements

I was traveling last week, and knowing I would be offline a fair bit, not to mention seriously jet-lagged, I put my hacking efforts into getting MySQL support into Sqitch. I merged it in yesterday; check out the tutorial if you’re interested in it. I expect to release v0.980 with the MySQL support in a couple of weeks; testing and feedback would most appreciated.

There is a caveat, though: it requires MySQL v5.6.4. So if you’re stuck with an older MySQL, it won’t work. There are two reasons to require v5.6.4:

  • The microsecond precision support in DATETIME values, added in v5.6.4. This makes it much easier to keep things in the proper order (deployments usually take less than a second).
  • The SIGNAL functionality, introduced in v5.5. This allows the schema to mock a check constraint in the Sqitch database, as well as make it much easier to write verify tests (as described in the tutorial and figured out on StackOverflow).

But if you can afford to take advantage of a relatively modern MySQL, give it a shot!

The next release also makes a backwards-incompatible change to the SQLite engine: the default Sqitch database is no longer $db_dir/$db_name-sqitch.$suffix, but $db_dir/sqitch.$suffix. In other words, if you were deploying to a db named /var/db/myapp.db, Sqitch previously kept its metadata in /var/db/myapp-sqitch.db, but now will keep it in /var/db/sqitch.db. This is to make it more like the other engines (MySQL defaults to a database named “sqitch”, and Postgres and Oracle default to a schema named “sqitch”).

It’s also useful if you use the SQLite ATTACHDATABASE command to manage multiple database files in a single project. In that case, you will want to use the same metadata file for all the databases. Keep them all in the same directory with the same suffix and you get just that with the default sqitch database.

If you’d like it to have a different name, use sqitch config core.sqlite.sqitch_db $name to configure it. This will be useful if you don’t want to use the same Sqitch database to manage multiple databases, or if you do, but they live in different directories.

I haven’t released this change yet, and I am not a big-time SQLite user. So if this makes no sense, please comment on this issue. It’ll be a couple of weeks before I release v0.980, so there is time to reverse if if there’s consensus that it’s a bad idea.

But given another idea I’ve had, I suspect it will be okay. The idea is to expand on the concept of a Sqitch “target” by giving it its own command and configuration settings. Basically, it would be sort of like Git remotes: use URIs to specify database connection and parameter info (such as the sqitch database name for SQLite). These can be passed to database-touching commands, such as deploy, revert, log, and the like. They can also be given names and stored in the configuration file. The upshot is that it would enable invocations such as

sqitch deploy production
sqitch log qa
sqitch status pg://localhost/flipr?sqitch_schema=meta

See the GitHub issue for a fuller description of this feature. I’m certain that this would be useful at work, as we have a limited number of databases that we deploy each Sqitch project to, and it’s more of a PITA for my co-workers to remember to use different values for the --db-host, --db-user, --db-name and friends options. The project itself would just store the named list of relevant deployment targets.

And it alleviates the issue of specifying a different Sqitch database on SQLite or MySQL, as one can just create a named target that specifies it in the URI.

Not sure when I will get to this feature, though. I think it would be great to have, and maybe iovation would want me to spend some time on it in the next couple of months. But it might also be a great place for someone else to get started adding functionality to Sqitch.

Oh, and before I forget: it looks like Sqitch might soon get CUBRID support, too, thanks to Ștefan Suciu. Stay tuned!

Looking for the comments? Try the old layout.

Sqitch Mail List

Just a quick post to announce that I’ve set up a Sqitch Google Group. I’ve been getting a lot more email about it lately, and a fair bit of it should be shared more generally. So if you’re interested in Sqitch, sign up! (Don’t like web forums? Me neither. Feel free to subscribe by email, instead.)

Looking for the comments? Try the old layout.

Sqitch on Oracle

I found myself with a little unexpected time at work recently, and since we use Oracle (for a few more months), I decided to port Sqitch. Last night, I released v0.970 with full support for Oracle. I did the development against an 11.2 VirtualBox VM, though I think it should work on 10g, as well.

Sqitch is available from the usual locations. For Oracle support, you’ll need the Instant Client, including SQL*Plus. Make sure you have $ORACLE_HOM set and you’ll be ready to install. Via CPAN, it’s

cpan install App::Sqitch DBD::Oracle

Via Homebrew:

brew tap theory/sqitch
brew install sqitch-oracle

Via ActiveState PPM, install ActivePerl, then run:

ppm install App-Sqitch DBD-Oracle
PGCon 2013

There are a few other minor tweaks and fixed in this release; check the release notes for details.

Want more? I will be giving a half-day tutorial, entitled “Agile Database Development,” on database development with Git, Sqitch, and pgTAP at on May 22 PGCon 2013 in Ottawa, Ontario. Come on up!

Looking for the comments? Try the old layout.

Sqitch: Now with SQLite Support

This week I released Sqitch v0.961. There are a number of great new features v0.95x, including the beginning of two features I’ve had in mind since the beginning: VCS integration and support for multiple databases.

First the VCS integration. This comes in the form of the new checkout command, which automatically makes database changes for you when you change VCS branches. Say you have two branches, “widgets” and “big-fix”, and that their Sqitch plans diverge. If you’re in the “widgets” branch and want to switch to “big-fix”, just run

sqitch checkout big-fix

Sqitch will look at the “big-fix” plan, figure out the last change in common with “widgets”, and revert to it. Then it checks out “big-fix” and deploys. That’s it. Yes, you could do this yourself, but do you really remember the last common change between the two branches? Do you want to take the time to look for it, then revert, check out the new branch, and deploy? This is exactly the sort of common developer task that Sqitch aims to take the pain out of, and I’m thrilled to provide it.

You know what’s awesome, though? This feature never occurred to me. I didn’t come up with it, and didn’t implement it. No, it was dreamt up and submitted in a pull request by Ronan Dunklau. I have wanted VCS integration since the beginning, but had yet to get ‘round to it. Now Ronan has jumpstarted it. A million thanks!

One downside: it’s currently Git-only. I plan to add infrastructure for supporting multiple VCSes, probably with Git and Subversion support to begin with. Watch for that in v0.970 in the next couple months.

The other big change is the addition of SQLite support alongside the existing PostgreSQL support. Fortunately, I was able to re-use nearly all the code, so the SQLite adapter is just a couple hundred lines long. For the most part, Sqitch on SQLite works just like on PostgreSQL. The main difference is that Sqitch stores its metadata in a separate SQLite database file. This allows one to use a single metadata file to maintain multiple databases, which can be important if you use multiple databases as schemas pulled into a single connection via ATTACH DATABASE.

Curious to try it out? Install Sqitch from CPAN or via the Homebrew Tap and then follow the new Sqitch SQLite tutorial.

Of the multitude of other Changes, one other bears mentioning: the new plan command. This command is just like log, except that it shows what is in the plan file, rather than what changes have been made to the database. This can be useful for quickly listing what’s in a plan, for example when you need to remember the names of changes required by a change you’re about to add. The --oneline option is especially useful for this functionality. An example from the tutorial’s plan:

> sqitch plan --oneline
In sqitch.plan
6238d8 deploy change_pass
d82139 deploy insert_user
7e6e8b deploy pgcrypto
87952d deploy delete_flip @v1.0.0-dev2
b0a951 deploy insert_flip
834e6a deploy flips
d0acfa deploy delete_list
77fd99 deploy insert_list
1a4b9a deploy lists
0acf77 deploy change_pass @v1.0.0-dev1
ec2dca deploy insert_user
bbb98e deploy users
ae1263 deploy appschema

I personally will be using this a lot, Yep, scratching my own itch here. What itch do you have to scratch with Sqitch?

In related news, I’ll be giving a tutorial at PGCon next month, entitled “Agile Database Development”. We’ll be developing a database for a web application using Git for source code management, Sqitch for database change management, and pgTAP for unit testing. This is the stuff I do all day long at work, so you can also think of it as “Theory’s Pragmatic approach to Database Development.” See you there?

Looking for the comments? Try the old layout.

Sqitch Update: All Your Rebase Are…Never Mind

I’m pleased to announce the release of Sqitch v0.940. The focus of this release? Sanity.

I’ve been doing a lot of Sqitch-based database development at work. Overall it has worked quite well. Except for one thing: often the order in which changes would be arranged would change from one run to the next. Oy.

Out of Order

The reason? The plan parser would perform a topological sort of all the changes between tags based on their dependencies. I’ve been careful, for the most part, to keep my changes in the proper order in our plan files, but the topological sort would often pick a different order. Still valid in terms of dependency ordering, but different from the plan file.

Given the same inputs, the sort always produced the same order. However, whenever I added a new changes (and I do that all the time while developing), there would then be a new input, which could result in a completely different order. The downside is that I would add a change, run sqitch deploy, and it would die because it thought something needed to be deployed that had already been deployed, simply because it sorted it to come after an undeployed change. So annoying.. It also caused problems in for production deployments, because different machines with different Perls would sort the plans in different ways.

So I re-wrote the sorting part of the the plan parser so that it no longer sorts. The list of changes is now always identical to the order in the plan file. It still checks dependencies, of course, only now it throws an exception if it finds an ordering problem, rather than re-ordering for you. I’ve made an effort to tell the user how to move things around in the plan file to fix ordering issues, so hopefully everything will be less mysterious.

Of course, many will never use dependencies, in which case this change has effect. But it was important to me, as I like to specify dependencies as much as I can, for my own sanity.

See? There’s that theme!

Everyone has a Mom

Speaking of ordering, as we have been starting to do production deployments, I realized that my previous notion to allow developers to reorder changes in the plan file without rebuilding databases was a mistake. It was too easy for someone to deploy to an existing database and miss changes because there was nothing to notice that changes had not been deployed. This was especially a problem before I addressed the ordering issue.

Even with ordering fixed, I thought about how git push works, and realized that it was much more important to make sure things really were consistent than it was to make things slightly more convenient for developers.

So I changed the way change IDs are generated. The text hashed for IDs now includes the ID of the parent change (if there is one), the change dependencies, and the change note. If any of these things change, the ID of the change will change. So they might change a lot during development, while one moves things around, changes dependencies, and tweaks the description. But the advantage is for production, where things have to be deployed exactly right, with no modifications, or else the deploy will fail. This is sort of like requiring all Git merges to be fast-forwarded, and philosophically in line with the Git practice of never changing commits after they’re pushed to a remote repository accessible to others.

Curious what text is hashed for the IDs? Check out the new show command!


As a database hacker, I still need things to be relatively convenient for iterative development. So I’ve also added the rebase command. It’s simple, really: It just does a revert and a deploy a single command. I’m doing this all day long, so I’m happy to save myself a few steps. It’s also nice that I can do sqitch rebase @HEAD^ to revert and re-apply the latest change over and over again without fear that it will fail because of an ordering problem. But I already mentioned that, didn’t I?

Order Up

Well, mostly. Another ordering issue I addressed was for the revert --to option. It used to be that it would find the change to revert to in the plan, and revert based on the plan order. (And did I mention that said order might have changed since the last deploy?) v0.940 now searches the database for the revert target. Not only that, the full list of changes to deploy to revert to the target is also returned from the database. In fact, the revert no longer consults the plan file at all. This is great if you’ve re-ordered things, because the revert will always be the reverse order of the previous deploy. Even if IDs have changed, revert will find the changes to revert by name. It will only fail if you’ve removed the revert script for a change.

So simple, conceptually: revert reverts in the proper order based on what was deployed before. deploy deploys based on the order in the plan.


As a result of the improved intelligence of revert, I have also deprecated the @FIRST and @LAST symbolic tags. These tags forced a search of the database, but were mainly used for revert. Now that revert always searches the database, there’s nothing to force. They’re still around for backward compatibility, but no longer documented. Use @ROOT and @HEAD, instead.

Not Over

So lots of big changes, including some compatibility changes. But I’ve tried hard to make them as transparent as possible (old IDs will automatically be updated by deploy). So take it for a spin!

Meanwhile, I still have quite a few other improvements I need to make. On my short list are:

Looking for the comments? Try the old layout.

Sqitch v0.80: Now With More You

Last night, I uploaded Sqitch v0.80, the latest dev release of the simple database change management system I’ve been working on. I’m kind of stunned by the sheer number of changes in this release, given that the interface has not changed much. Mainly, there’s more you in this version. That is, like Git, the first thing you’ll want to do after installing Git is tell it who you are:

> sqitch config --user user.name 'Marge N. O’Vera'
> sqitch config --user user.email 'marge@example.com'

This information is now recorded for every change added to a project plan, as well as every commit to the database (deploys, reverts, and failures). If you don’t tell Sqitch who you are, it will try to guess, but you might not like who it finds.

Changes and tags now also require a note to be associated with them, kind of like a Git commit message. This allows a bit more context to be provided about a change or tag, since the name may not be sufficient. All of this is recorded in the plan file, which makes it harder to edit by hand, since the lines are so much longer now. An example:


appuser 2012-08-01T15:04:13Z Marge N. O’Vera <marge@example.com> # App database user with limited permissions.
users [:appuser] 2012-08-01T15:36:00Z Marge N. O’Vera <marge@example.com> # Creates table to track our users.
insert_user [:users :appuser] 2012-08-01T15:41:17Z Marge N. O’Vera <marge@example.com> # Creates a function to insert a user.
change_pass [:users :appuser] 2012-08-01T15:41:46Z Marge N. O’Vera <marge@example.com> # Creates a function to change a user password.
@v1.0.0-dev1 2012-08-01T15:48:04Z Marge N. O’Vera <marge@example.com> # Tag v1.0.0-dev1.

But each change and tag is still on a single line, so it’s not too bad if you absolutely must edit it. Still, I expect to discourage that in favor of adding more commands for manipulating it (adding and removing dependencies, changing the note, etc.).

Given all this data, the output of the log command has expanded quite a lot. Here’s an example from the tutorial’s example project:

On database flipr_test
Deploy 7ad1cc6d1706c559dceb3101e7c21786dc7d7b4c
Name:      change_pass
Committer: Marge N. O’Vera 
Date:      2012-08-01 22:20:54 +0200

    Change change_pass to use pgcrypto.

Deploy 799ecd26730a684cf02a889c30371a0af55150cc
Name:      insert_user
Committer: Marge N. O’Vera 
Date:      2012-08-01 22:20:54 +0200

    Change insert_user to use pgcrypto.

Revert 799ecd26730a684cf02a889c30371a0af55150cc
Name:      insert_user
Committer: Marge N. O’Vera 
Date:      2012-08-01 22:20:52 +0200

    Change insert_user to use pgcrypto.

Revert 7ad1cc6d1706c559dceb3101e7c21786dc7d7b4c
Name:      change_pass
Committer: Marge N. O’Vera 
Date:      2012-08-01 22:20:52 +0200

    Change change_pass to use pgcrypto.

Deploy 7ad1cc6d1706c559dceb3101e7c21786dc7d7b4c
Name:      change_pass
Committer: Marge N. O’Vera 
Date:      2012-08-01 22:20:46 +0200

    Change change_pass to use pgcrypto.

Deploy 799ecd26730a684cf02a889c30371a0af55150cc
Name:      insert_user
Committer: Marge N. O’Vera 
Date:      2012-08-01 22:20:46 +0200

    Change insert_user to use pgcrypto.

Note the use of color to identify the event type: green for deploys and blue for reverts. Failures appear in red. Not sure I like it yet, but I think it might be useful. We’ll see.

Back to the plan. Notice that it now also includes pragmas for a project name and URI. Those lines again:


The project name is required when initializing a Sqitch project, but the URI is optional (at least for now). The point of these data points is double:

  • The project name is used (along with the current timestamp and your name and email address) when hashing changes and tags to generate IDs. This ensures that the IDs are likely to be globally unique.
  • In the future, you will be able to declare cross-project dependencies.

The second point is the more important. The plan is to require the name of a project before the : in a dependency. For example, if I wanted to require the insert_user change from the “flipr” project plan above, I would declare it as flipr:insert_user. Sqitch will then know to check for it. I will be adding this pretty soon, since it requires some database changes and we’re going to need it at work. The need for database changes is also why v0.80 is still a dev release. (However I don’t expect the plan format to change beyond this tweak to dependency specification.)

Beyond that, next steps include:

  • Creating an RPM targeting work’s servers. This will probably not be public, though I might add the spec file to the public project.
  • Starting to use Sqitch for some work projects. This will be the first real-world use, which I deem essential for proving the technology. I hope that it does not lead to any more radical redesigns. :–)
  • Implement the SQLite interface to iron out any kinks in the engine API.
  • Switch to Dist::Zilla for building the distribution. I would love a volunteer to help with this; I expect it to be simple for someone well-versed in Dist::Zilla.
  • Add support for localization. Sqitch already uses Locale::TextDomain throughout, so it’s localization-ready. We just need the tools put in place as described in the dzil ticket. Again, I would love help with this.
  • Implement the Bundle command. Should be pretty simple, since, for now at least, all it does is copy files and directories.
  • Add VCS integration. This is less important than it once was, but will still help a lot when working with Sqitch within a VCS. The bundle command would also need to be updated, once this work was done.

But even with all that, I think that Sqitch is finally ready for some serious tire-kicking. To get started, skim the tutorial and take it for a spin (install it by running cpan DWHEELER/App-Sqitch-0.80-TRIAL.tar.gz). Let me know what you like, what you don’t like, and let’s have a discussion about it.

Oh, and for discussions, where should I set up a mail list? Google Groups? Someplace else?

Looking for the comments? Try the old layout.

Sqitch’s Log

Just uploaded Sqitch v0.70 and v0.71. The big change is the introduction of the log command, which allows one to view the deployment history in a database. All events are logged and searchable, including deploys, failed deploys, and reverts. Unlike most other database migration systems, Sqitch has the whole history, so even if you revert back to the very beginning, there is still a record of everything that happened.

I stole most of the interface for the log command from git-log, including:

  • Colorized output
  • Searching against change and committer names via regular expressions
  • A variety of formatting options (“full”, “long”, “medium”, “oneline”, etc.)
  • Extensible formatting with printf-style codes

Here are a couple of examples searching the tutorial’s test database:

> sqitch -d flipr_test log -n 3
On database flipr_test
Deploy 18d7aab59bd0c914a561dc324b1da5549605c376
Name:   change_pass
Date:   2012-07-07 13:26:30 +0200

Deploy 87b4e131897ec370d78be177a3f91fdc877a2515
Name:   insert_user
Date:   2012-07-07 13:26:30 +0200

Deploy 20d9af30b97a3071dce12d91665dcd6237265d60
Name:   pgcrypto
Date:   2012-07-07 13:26:30 +0200
> sqitch -d flipr_test log -n 6 --format oneline --abbrev 7
On database flipr_test
18d7aab deploy change_pass
87b4e13 deploy insert_user
20d9af3 deploy pgcrypto
540359a deploy delete_flip
d4dce7d deploy insert_flip
b715d73 deploy flips

> sqitch -d flipr_test log -n 4 --event revert --event fail --format \
'format:%a %eed %{blue}C%{6}h%{reset}C - %c%non %{cldr:YYYY-MM-dd}d at %{cldr:h:mm a}d%n' 
On database flipr_test
theory reverted 9df095 - appuser
on 2012-07-07 at 1:26 PM

theory reverted 9df0959d078b - users
on 2012-07-07 at 1:26 PM

theory reverted 9df095131e25 - insert_user
on 2012-07-07 at 1:26 PM

theory reverted 9df09502c559 - change_pass
on 2012-07-07 at 1:26 PM

I’m pretty happy with this. Not sure how much it will be used, but it works great!

Looking for the comments? Try the old layout.

Sqitch Status: Now With Status

I’ve just released Sqitch v0.60. The main change is the implementation of the status command, which allows one to see the current deployment status of a database. An example from the updated tutorial:

> sqitch status
# On database flipr_test
# Change:   18d7aab59bd0c914a561dc324b1da5549605c376
# Name:     change_pass
# Deployed: 2012-07-07 13:23:42 +0200
# By:       theory
Nothing to deploy (up-to-date)

If there are changes in the plan after the most recently deployed change, they would be listed like so:

> sqitch status
# On database flipr_test
# Change:   540359a3892d1476f9ca6ccf7d3f9993ac383b68
# Name:     delete_flip
# Tag:      @v1.0.0-dev2
# Deployed: 2012-07-06 19:31:14 +0200
# By:       theory
Undeployed changes:
  * pgcrypto
  * insert_user
  * change_pass

You can also ask it to show the list of deployed changes and applied tags:

> sqitch status --show-tags --show-changes
# On database flipr_test
# Change:   18d7aab59bd0c914a561dc324b1da5549605c376
# Name:     change_pass
# Deployed: 2012-07-07 13:26:30 +0200
# By:       theory
# Changes:
#   change_pass - 2012-07-07 13:26:30 +0200 - theory
#   insert_user - 2012-07-07 13:26:30 +0200 - theory
#   pgcrypto    - 2012-07-07 13:26:30 +0200 - theory
#   delete_flip - 2012-07-07 13:26:30 +0200 - theory
#   insert_flip - 2012-07-07 13:26:30 +0200 - theory
#   flips       - 2012-07-07 13:26:30 +0200 - theory
#   delete_list - 2012-07-07 13:26:30 +0200 - theory
#   insert_list - 2012-07-07 13:26:30 +0200 - theory
#   lists       - 2012-07-07 13:26:30 +0200 - theory
#   change_pass - 2012-07-07 13:26:30 +0200 - theory
#   insert_user - 2012-07-07 13:26:30 +0200 - theory
#   users       - 2012-07-07 13:26:30 +0200 - theory
#   appuser     - 2012-07-07 13:26:30 +0200 - theory
# Tags:
#   @v1.0.0-dev2 - 2012-07-07 13:26:30 +0200 - theory
#   @v1.0.0-dev1 - 2012-07-07 13:26:30 +0200 - theory
Nothing to deploy (up-to-date)

The --date-format option allows one to display the dates in a variety of formats, inspired by the git log --date option:

> sqitch status --date-format long
# On database flipr_test
# Change:   18d7aab59bd0c914a561dc324b1da5549605c376
# Name:     change_pass
# Deployed: 7 juillet 2012 13:26:30 CEST
# By:       theory
Nothing to deploy (up-to-date)

Want to give it a try? Install it with cpan D/DW/DWHEELER/App-Sqitch-0.60-TRIAL.tar.gz and follow along the tutorial.

Now I’m off to add the log command, which shows a history of all deploys and reverts.

Looking for the comments? Try the old layout.

Sqitch Update: Almost Usable

This week, I released v0.50 of Sqitch, the database change management app I’ve been working on for the last couple of months. Those interested in how it works should read the tutorial. A lot has changed since v0.30; here are some highlights:

  • The plan file is now required. This can make merges more annoying, but thanks to a comment from Jakub Narębski, I discovered that Git can be configured to use a “union merge driver”, which seems to simplify things a great deal. See the tutorial for a detailed example.
  • The plan now consists solely of a list of changes, roughly analogous to Git commits. Tags are simply pointers to specific changes.
  • Dependencies are now specified in the plan file, rather than in the deployment scripts. Once the plan file became required, this seemed like the much more obvious place for them.
  • The plan file now goes into the top-level directory of a project (which defaults to the current directory, assumed to be the top level directory of a VCS project), while the configuration file goes into the current directory. This allows one to have multiple top-level directories for different database engines, each with its own plan, and a single configuration file for them all.

Seems like a short list, but in reality, this release is the first I would call almost usable. Most of the core functionality and infrastructure is in place, and the architectural designs have been finalized. There should be much less flux in how things work from here on in, though this is still very much a developer release. Things might still change, so I’m being conservative and not doing a “stable” release just yet.

What works

So what commands actually work at this point? All of the most important functional ones:

  • sqitch init – Initialize a Sqitch project. Creates the project configuration file, a plan file, and directories for deploy, revert, and test scripts
  • sqitch config – Configure Sqitch. Uses the same configuration format as Git, including cascading local, user, and system-wide configuration files
  • sqitch help – Get documentation for specific commands
  • sqitch add – Add a new change to the plan. Generates deploy, revert, and test scripts based on user-modifiable templates
  • sqitch tag – Tag the latest change in the plan, or show a list of existing tags
  • sqitch deploy – Deploy changes to a database. Includes a --mode option to control how to revert changes in the event of a deploy failure (not at all, to last tag, or to starting point)
  • sqitch revert – Revert changes from a database
  • sqitch rework – Copy and modify a previous change

Currently, only PostgreSQL is supported by deploy and revert; I will at least add SQLite support soon.

The rework command is my solution to the problem of code duplication. It does not (yet) rely on VCS history, so it still duplicates code. However, it does so in such a way that it is still easier to see what has changed, because the new files are actually used by the previous instance of the command, while the new one uses the existing files. So a diff command, while showing the new files in toto, actually shows what changed in the existing scripts, making it easier to follow. I think this is a decent compromise, to allow Sqitch to be used with or without a VCS, and without disabling the advantages of VCS integration in the future.

The only requirement for reworking a change is that there must be a tag on that change or a change following it. Sqitch uses that tag in the name of the files for the previous instance of the change, as well as in internal IDs, so it’s required to disambiguate the scripts and deployment records of the two instances. The assumption here is that tags are generally used when a project is released, as otherwise, if you were doing development, you would just go back and modify the change’s scripts directly, and revert and re-deploy to get the changes in your dev database. But once you tag, this is a sort of promise that nothing will be changed prior to the tag.

I modify change scripts a lot in my own database development projects. Naturally, I think it is important to be free to change deployment scripts however one likes while doing development, and also important to promise not to change them once they have been released. As long as tags are generally thought of as marking releases or other significant milestones, it seems a reasonable promise not to change anything that appears before a tag.

See the tutorial for a detailed example. In a future release, VCS integration will be added, and the duplicated files will be unnecessary, too. But the current approach has the advantage that it will work anywhere, VCS or no. The VCS support will be backward-compatible with this design (indeed, it depends on it).

Still To Do

I think I might hold off a bit on the VCS integration, since the rework command no longer requires it. There also needs to be support for database engines other than PostgreSQL. But otherwise, mostly what needs to be done is the informational commands, packaging, and testing:

  • sqitch status – Show the current deployment status of a database
  • sqitch log – Show the deploy and revert history of a database
  • sqitch bundle – Bundle up the configuration, plan, and scripts for distribution packaging
  • sqitch test – Test changes. Mostly hand-wavy; see below
  • sqitch check – Validate a database deployment history against the plan

I will likely be working on the status and log commands next, as well as an SQLite engine, to make sure I have the engine encapsulation right.

Outstanding Questions

I’m still pondering some design decisions. Your thoughts and comments greatly appreciated.

  • Sqitch now requires a URI, which is set in the local configuration file by the init command. If you don’t specify one, it just creates a UUID-based URI. The URI is required to make sure that changes have unique IDs across projects (a change may have the same name as in another project). But maybe this should be more flexible? Maybe, like Git, Sqitch should require a user name and email address, instead? They would have to be added to the change lines of the plan, which is what has given me pause up to now. It would be annoying to parse.

  • How should testing work? When I do PostgreSQL stuff, I am of course rather keen on pgTAP. But I don’t think it makes sense to require a particular format of output or anything of that sort. It just wouldn’t be engine-neutral enough. So maybe test scripts should just run and considered passing if the engine client exits successfully, and failing if it exited unsuccessfully? That would allow one to use whatever testing was supported by the engine, although I would have to find some way to get pgTAP to make psql exit non-zero on failure.

    Another possibility is to require expected output files, and to diff them. I’m not too keen on this approach, as it makes it much more difficult to write tests to run on multiple engine versions and platforms, since the output might vary. It’s also more of a PITA to maintain separate test and expect files and keep them in sync. Still, it’s a tried-and-true approach.

Help Wanted

Contributions would be warmly welcomed. See the to-do list for what needs doing. Some highlights and additional items:

  • Convert to Dist::Zilla
  • Implement the Locale::TextDomain-based localization build. Should be done at distribution build time, not install time. Ideally, there would be a Dist::Zilla plugin to do it, based pattern implemented in this example Makefile (see also this README).
  • The web site could use some updating, though I realize it will regularly need changing until most of the core development has completed and more documentation has been written.
  • Handy with graphics? The project could use a logo. Possible themes: SQL, databases, change management, baby Sasquatch.
  • Packaging. It would greatly help developers and system administrators who don’t do CPAN if they could just use their familiar OS installers to get Sqitch. So RPM, Debian package, Homebrew, BSD Ports, and Windows distribution support would be hugely appreciated.

Take it for a Spin!

Please do install the v0.51 developer release from the CPAN (run cpan D/DW/DWHEELER/App-Sqitch-0.51-TRIAL.tar.gz) and kick the tires a bit. Follow along the tutorial to get a feel for it, or even just review the tutorial example’s Git history to get a feel for it. And if there is something you want out of Sqitch that you don’t see, please feel free to file an issue with your suggestion.

Looking for the comments? Try the old layout.

Sqitch: Rename Step Objects and the SQL Directory?

After all of the thinking and rethinking about how to manage a Sqitch plan, I am just about done with all the changes to make it all work. One of the changes I’ve made is that tags are no longer objects that stand on their own between change steps, but are simply names the refer to specific change steps. Not only is this much more like how a VCS thinks of tags (basically another name for a single commit), but it also greatly simplifies the code for iterating over a plan and updating metadata in the database.

But now that a plan is, in its essence, just a list of “steps”, I’m wondering if I should change that term. I originally used the term “steps” because the original plan was to have a deploy work on a tag-to-tag basis, where a single tag could have a series of changes associated with it. By that model, each change was a “step” toward deploying the tag. If any of the steps for a single tag failed, they were all reverted.

But while one can still specify a tag as a deploy target (and optionally have it revert to an earlier tag one failure), it no longer makes sense to think of each change script as a step toward deploying a target. It’s just a change. Yes, as an object it has separate deploy, revert, and test scripts associated with it, but I’m thinking it still makes sense to call them “changes” instead of “steps.” Because they’re individual things, rather than collections of things that lead to some goal.

What do you think?

In other renaming news, I’m thinking of changing the default directory that stores the step/change scripts. Right now it’s sql (though you can make it whatever you want). The plan file goes into the current directory (assumed to be the root directory of your project), as does the local configuration file. So the usual setup is:

% find .

I’m thinking of changing this in two ways:

  • Make the default location of the plan file be in the top-level script directory. This is because you might have different Sqitch change directories for different database platforms, each with its own plan file.
  • Change the default top-level script directory to ..

As a result, the usual setup would be:

% find .

If you still wanted the change scripts kept in all in a subdirectory, say db/, it would be:

% find .

And if you have a project with, say, two sqitch deployment setups, one for PostgreSQL and one for SQLite, you might make it:

% find .

This works because the configuration file has separate sections for each engine (PostgreSQL and SQLite), and so can be used for all the projects; only the --top-dir option would need to change to switch between them. Each engine has its own plan file.

And yeah, having written out here, I’m pretty convinced. What do you think? Comments welcome.

Looking for the comments? Try the old layout.