Just a Theory

Black lives matter

Posts about Git

Agile Database Development Tutorial

I gave a tutorial at PGCon a couple weeks back, entitled “Agile Database Development with Git, Sqitch, and pgTAP.” It went well, I think. The Keynote document and an exported PDF have been posted on PGCon.org, and also uploaded here and to Speaker Deck. And embedded below, too. Want to follow along? Clone the tutorial Git repository and follow along. Here’s the teaser:

Hi, I’m David. I like to write database apps. Just as much as I like to write web apps. (Maybe more!) How? Not by relying on bolted-on, half-baked database integration tools like migrations, I’ll tell you that!. Instead, I make extensive use of best-of-breed tools for source control (Git), database unit testing (pgTAP), and database change management and deployment (Sqitch). If you’d like to get as much pleasure out of database development as you do application development, join me for this tutorial. We’ll develop a sample application using the processes and tools I’ve come to depend on, and you’ll find out whether they might work for you. Either way, I promise it will at least be an amusing use of your time.

Looking for the comments? Try the old layout.

Sqitch: Now with SQLite Support

This week I released Sqitch v0.961. There are a number of great new features v0.95x, including the beginning of two features I’ve had in mind since the beginning: VCS integration and support for multiple databases.

First the VCS integration. This comes in the form of the new checkout command, which automatically makes database changes for you when you change VCS branches. Say you have two branches, “widgets” and “big-fix”, and that their Sqitch plans diverge. If you’re in the “widgets” branch and want to switch to “big-fix”, just run

sqitch checkout big-fix

Sqitch will look at the “big-fix” plan, figure out the last change in common with “widgets”, and revert to it. Then it checks out “big-fix” and deploys. That’s it. Yes, you could do this yourself, but do you really remember the last common change between the two branches? Do you want to take the time to look for it, then revert, check out the new branch, and deploy? This is exactly the sort of common developer task that Sqitch aims to take the pain out of, and I’m thrilled to provide it.

You know what’s awesome, though? This feature never occurred to me. I didn’t come up with it, and didn’t implement it. No, it was dreamt up and submitted in a pull request by Ronan Dunklau. I have wanted VCS integration since the beginning, but had yet to get ‘round to it. Now Ronan has jumpstarted it. A million thanks!

One downside: it’s currently Git-only. I plan to add infrastructure for supporting multiple VCSes, probably with Git and Subversion support to begin with. Watch for that in v0.970 in the next couple months.

The other big change is the addition of SQLite support alongside the existing PostgreSQL support. Fortunately, I was able to re-use nearly all the code, so the SQLite adapter is just a couple hundred lines long. For the most part, Sqitch on SQLite works just like on PostgreSQL. The main difference is that Sqitch stores its metadata in a separate SQLite database file. This allows one to use a single metadata file to maintain multiple databases, which can be important if you use multiple databases as schemas pulled into a single connection via ATTACH DATABASE.

Curious to try it out? Install Sqitch from CPAN or via the Homebrew Tap and then follow the new Sqitch SQLite tutorial.

Of the multitude of other Changes, one other bears mentioning: the new plan command. This command is just like log, except that it shows what is in the plan file, rather than what changes have been made to the database. This can be useful for quickly listing what’s in a plan, for example when you need to remember the names of changes required by a change you’re about to add. The --oneline option is especially useful for this functionality. An example from the tutorial’s plan:

> sqitch plan --oneline
In sqitch.plan
6238d8 deploy change_pass
d82139 deploy insert_user
7e6e8b deploy pgcrypto
87952d deploy delete_flip @v1.0.0-dev2
b0a951 deploy insert_flip
834e6a deploy flips
d0acfa deploy delete_list
77fd99 deploy insert_list
1a4b9a deploy lists
0acf77 deploy change_pass @v1.0.0-dev1
ec2dca deploy insert_user
bbb98e deploy users
ae1263 deploy appschema

I personally will be using this a lot, Yep, scratching my own itch here. What itch do you have to scratch with Sqitch?

In related news, I’ll be giving a tutorial at PGCon next month, entitled “Agile Database Development”. We’ll be developing a database for a web application using Git for source code management, Sqitch for database change management, and pgTAP for unit testing. This is the stuff I do all day long at work, so you can also think of it as “Theory’s Pragmatic approach to Database Development.” See you there?

Looking for the comments? Try the old layout.

Thinking about Changing Sqitch Change IDs

When Sqitch, (the database change management app I’ve been working on for the last several months) parses a deployment plan, it creates a unique ID for each change in the plan. This ID is a SHA1 hash generated from information about the change, which is a string that looks something like this:

project flipr
change add_users_table
planner Marge N. O’Vera <marge@example.com>
date 2012-11-14T01:10:13Z

The nice thing about the ID is that it’s unique: it’s unlikely that the same user with the same email address will add a change with the same name to a project with the same name within a single second. If the plan includes a URI, that’s included, too, for additional uniqueness.

Note, however, that it does not include information about any other changes. Git, from which I modeled the generation of these IDS, always includes the parent commit SHA1 in its uniquely-identifying info. An example:

> git cat-file commit 744c01bfa3798360c1792a8caf784b650e52d89e               
tree d3a64897cca4538ff5c0c41db3f82ab033a09bec
parent 482a79ae2cda5085eed731be2e70739ab37997ee
author David E. Wheeler <david@justatheory.com> 1337355746 -0400
committer David E. Wheeler <david@justatheory.com> 1337355746 -0400

Timestamp v0.30.

The reason Git does this is so that a commit is not just uniquely identified globally, but so that it can only follow an existing commit. Mark Jason Dominus calls this Linus Torvalds’ greatest invention. Why? This is now Git knows it can fast-forward changes.

Why doesn’t Sqitch do something similar? My original thinking had been to make it easier for a database developer to do iterative development. And one of the requirements for that, in my experience, is the ability to freely reorder changes in the plan. Including the SHA1 of the preceding change would make that trickier. But it also means that, when you deploy to a production database, you lose that extra layer of security that ensures that, yes, the next change really should be deployed. That is, it would be much harder to deploy with changes missing or changed from what was previously expected. And I think that’s only sane for a production environment.

Given that, I’ve started to rethink my decision to omit the previous change SHA1 from the identifier of a change. Yes, it could be a bit more of hassle for a developer, but not, I think, that much of a hassle. The main thing would be to allow reverts to look up their scripts just by change name or even file name, rather than ID. We want deploys to always be correct, but I’m thinking that reverts should always just try very hard to remove changes. Even in production.

I am further thinking that the ID should even include the list of prerequisite changes for even stronger identification. After all, one might change just the dependencies and nothing else, but it would still be a different change. And maybe it should include the note, too? The end result would be a hash of something like this:

project flipr
change add_users_table
parent 7cd96745746cd6baa5da352de782354b21838b25
requires [schemas roles common:utils]
planner Marge N. O’Vera <marge@example.com>
date 2012-11-14T01:10:13Z

Adds the users table to the database.

This will break existing installations, so I’d need to add a way to update them, but otherwise, I think it might be a win overall. Thoughts?

Looking for the comments? Try the old layout.

Sqitch Update: The Plan

I gave my first presentation on Sqitch at PGCon last week. The slides are on Slideshare and the PGCon site. It came together at the last minute, naturally. I was not able to pay as close attention to PGCon sessions as I would have liked, as I was doing last minute hacking to get the deploy command working on PostgreSQL, and then writing the slides (which are based on the tutorial). I was pleased with the response, given that this is very much a project that is still under heavy development and available only as a very very early alpha. There was great discussion and feedback afterward, which I appreciate.

A number of folks offered to help, too. For that I am grateful. I’ve started a list of to-dos to give folks a starting point. Please fork and hack! Find me on #sqitch on Freenode for questions/comments/discussion.

But back to the guts. As a result of the work on the deploy command, as well as thinking about how I and my co-workers do database development with Git, I am starting to revise how I think about the deployment plan. You see, I personally often make a lot of changes to a deployment script as I develop a database, generally over many commits and even many days or weeks. If I were to then rely on the Git history to do deployments, it would probably work, but there might be ten times as many deployments as I actually need, just to get it from zero to release state. I had originally thought that using sqitch bundle --tags-only to create a bundle with a written plan would get around this, as it would write a plan file with only VCS tags for Sqitch tags, rather than every commit. That might be okay for releases, but still not great for the developers, such as myself, who will be using Sqitch as part of the development process all day long.

So now I’m thinking more that Sqitch should rely on an explicit plan file (which was to be the preferred method, if it existed, all along) rather than VCS history. That is, the plan file would be required, and a new command, sqitch plan, will allow one to interactively add steps and tags to it. It would also make it easier for the developer to hand-edit, as appropriate, so as not to rely on a funky Git history.

So I’m toying with changing the plan format, which up until now looked likes this:

[alpha]
foo
bar
init

[beta]
users
insert_user
delete_user
update_user

[gamma]
widgets
insert_widget

Each item in brackets is a tag, and each item below is a deployment step (which corresponds to a script) that is part of that tag. So if you deployed to the beta tag, it would deploy all the way up to update_user step. You could only specify tags for deployment, and either all the steps for a given tag succeeded or they failed. When you added a step, it was added to the most recent tag.

I came up with this approach by playing with git log. But now I’m starting to think that it should feel a bit more gradual, where steps are added and a tag is applied to a certain step. Perhaps a format like this:

foo
bar
init
@alpha

users
insert_user
delete_user
update_user
@beta

widgets
insert_widget

With this approach, one could deploy or revert to any step or tag. And a tag is just added to a particular step. So if you deployed to @beta, it would run all the steps through update_user, as before. But you could also update all, deploy through insert_widget, and then the current deployed point in the database would not have a tag (could perhaps use a symbolic tag, HEAD?).

I like this because it feels a bit more VCS-y. It also makes it easier to add steps to the plan without worrying about tagging before one was ready. And adding steps and tags can be automated by a sqitch plan command pretty easily.

So the plan file becomes the canonical source for deployment planning, and is required. What we’ve lost, however, is the ability to use the same step name at different points in the plan, and to get the proper revision of the step by traveling back in VCS history for it. (Examples of what I mean are covered in a previous post, as well as the aforementioned presentation.) However, I think that we can still do that by complementing the plan with VCS history.

For example, take this plan:

foo
bar
init
@alpha

users
insert_user
delete_user
update_user
@beta

insert_user
update_user
@gamma

Note how insert_user and update_user repeat. Normally, this would not be allowed. But if the plan is in a VCS, and if that VCS has tags corresponding to the tags, then we might allow it: when deploying, each step would be deployed at the point in time of the tag that follows it. In other words:

  • foo, bar, and init would be deployed as of the alpha tag.
  • users, insert_user, delete_user, and update_user would be deployed as they were as of the beta tag.
  • insert_user and update_user would again be deployed, this time as of the gamma tag.

This is similar to what I’ve described before, in terms of where in VCS history steps are read from. But whereas before I was using the VCS history to derive the plan, I am here reversing things, requiring an explicit plan and using its hints (tags) to pull stuff from the VCS history as necessary.

I think this could work. I am not sure if I would require that all tags be present, or only those necessary to resolve duplications (both approaches feel a bit magical to me, though I haven’t tried it yet, either). The latter would probably be more forgiving for users. And overall, I think the whole approach is less rigid, and more likely to allow developers to work they way they are used to working.

But I could be off my rocker entirely. What do you think? I want to get this right, please, if you have an opinion here, let me have it!

Looking for the comments? Try the old layout.

Sqitch — VCS-powered SQL Change Management

Back in January, I wrote three posts outlining some ideas I had about a straight-forward, sane way of managing SQL change management. The idea revolved around specifying scripts to deploy and revert in a plan file, and generating that plan file from VCS history. I still feel pretty good about the ideas there, and work has agreed to let me write it and open-source it. Here is the first step making it happen. I call it “Sqitch.”

Why “Sqitch”? Think of it as SQL changes with Git stuck in the middle. Of course I expect to support VCSs other than Git (probably Subversion and Mercurial, though I am not sure yet), but since Git is what I now have the most familiarity with, I thought it kind of fun to kind of reference a VCS in the name, if only obliquely.

This week, I started work on it. My first task is to outline a draft for the interface. Sqitch will be a command-line tool, primarily. The remainder of this post contains the documentation for the draft interface. Thoughts and feedback would be greatly appreciated, especially if you think I’ve overlooked anything! I do want to keep features pretty minimal for now, though, to build up a solid core to be built on later. But other than that, your criticism is greatly desired.

Next up, I will probably write a tutorial, just so I can make my way through some real-life(ish) examples and notice if I missed anything else. Besides, I’m going to need the tutorial myself! Watch for that next week.

Thanks!


Name

Sqitch - VCS-powered SQL change management

Synopsis

sqitch [<options>] <command> [<command-options>] [<args>]

Description

Sqitch is a VCS-aware SQL change management application. What makes it different from your typical migration-style approaches? A few things:

No opinions

Sqitch is not integrated with any framework, ORM, or platform. Rather, it is a standalone change management system with no opinions on your database or development choices.

Native scripting

Changes are implemented as scripts native to your selected database engine. Writing a PostgreSQL application? Write SQL scripts for psql. Writing a MySQL-backed app? Write SQL scripts for mysql.

VCS integration

Sqitch likes to use your VCS history to determine in what order to execute changes. No need to keep track of execution order, your VCS already tracks information sufficient for Sqitch to figure it out for you.

Dependency resolution

Deployment steps can declare dependencies on other deployment steps. This ensures proper order of execution, even when you’ve committed changes to your VCS out-of-order.

No numbering

Change deployment is managed either by maintaining a plan file or, more usefully, your VCS history. As such, there is no need to number your changes, although you can if you want. Sqitch does not care what you name your changes.

Packaging

Using your VCS history for deployment but need to ship a tarball or RPM? Easy, just have Sqitch read your VCS history and write out a plan file with your change scripts. Once deployed, Sqitch can use the plan file to deploy the changes in the proper order.

Reduced Duplication

If you’re using a VCS to track your changes, you don’t have to duplicate entire change scripts for simple changes. As long as the changes are idempotent, you can change your code directly, and Sqitch will know it needs to be updated.

Terminology

step

A named unit of change. A step name must be used in the file names of its corresponding deployment and a reversion scripts. It may also be used in a test script file name.

tag

A known deployment state with a list one or more steps that define the tag. A tag also implies that steps from previous tags in the plan have been applied. Think of it is a version number or VCS revision. A given point in the plan may have one or more tags.

state

The current state of the database. This is represented by the most recent tag or tags deployed. If the state of the database is the same as the most recent tag, then it is considered “up-to-date”.

plan

A list of one or more tags and associated steps that define the order of deployment execution. Sqitch reads the plan to determine what steps to execute to change the database from one state to another. The plan may be represented by a “Plan File” or by VCS history.

deploy

The act of deploying database changes to reach a tagged deployment point. Sqitch reads the plan, checks the current state of the database, and applies all the steps necessary to change the state to the specified tag.

revert

The act of reverting database changes to reach an earlier tagged deployment point. Sqitch checks the current state of the database, reads the plan, and applies reversion scripts for all steps to return the state to an earlier tag.

Options

-p --plan-file  FILE    Path to a deployment plan file.
-e --engine     ENGINE  Database engine.
-c --client     PATH    Path to the engine command-line client.
-d --db-name    NAME    Database name.
-u --username   USER    Database user name.
-h --host       HOST    Database server host name.
-n --port       PORT    Database server port number.
   --sql-dir    DIR     Path to directory with deploy and revert scripts.
   --deploy-dir DIR     Path to directory with SQL deployment scripts.
   --revert-dir DIR     Path to directory with SQL reversion scripts.
   --test-dir   DIR     Path to directory with SQL test scripts.
   --extension  EXT     SQL script file name extension.
   --dry-run            Execute command without making any changes.
-v --verbose            Increment verbosity.
-V --version            Print the version number and exit.
-H --help               Print a usage statement and exit.
-M --man                Print the complete documentation and exit.

Options Details

-p
--plan-file

sqitch –plan-file plan.conf sqitch -p sql/deploy.conf

Path to the deployment plan file. Defaults to ./sqitch.plan. If this file is not present, Sqitch will attempt to read from VCS files. If no supported VCS system is in place, an exception will be thrown. See “Plan File” for a description of its structure.

-e
--engine

sqitch –engine pg sqitch -e sqlite

The database engine to use. Supported engines include:

-c
--client

sqitch –client /usr/local/pgsql/bin/psql sqitch -c /usr/bin/sqlite3

Path to the command-line client for the database engine. Defaults to a client in the current path named appropriately for the specified engine.

-d
--db-name

Name of the database. For some engines, such as PostgreSQL and MySQL, the database must already exist. For others, such as SQLite, the database will be automatically created on first connect.

-u
--user
--username

User name to use when connecting to the database. Does not apply to all engines.

-h
--host

Host name to use when connecting to the database. Does not apply to all engines.

-n
--port

Port number to connect to. Does not apply to all engines.

--sql-dir

sqitch –sql-dir migrations/

Path to directory containing deployment, reversion, and test SQL scripts. It should contain subdirectories named deploy, revert, and (optionally) test. These may be overridden by --deploy-dir, --revert-dir, and --test-dir. Defaults to ./sql.

--deploy-dir

sqitch –deploy-dir db/up

Path to a directory containing SQL deployment scripts. Overrides the value implied by --sql-dir.

--revert-dir

sqitch –revert-dir db/up

Path to a directory containing SQL reversion scripts. Overrides the value implied by --sql-dir.

--test-dir

sqitch –test-dir db/t

Path to a directory containing SQL test scripts. Overrides the value implied by --sql-dir.

--extension

sqitch –extension ddl

The file name extension on deployment, reversion, and test SQL scripts. Defaults to sql.

--dry-run

sqitch –dry-run

Execute the Sqitch command without making any actual changes. This allows you to see what Sqitch would actually do, without doing it. Implies a verbosity level of 1; add extra --verboses for greater verbosity.

-v
--verbose

sqitch –verbose -v

A value between 0 and 3 specifying how verbose Sqitch should be. The default is 0, meaning that Sqitch will be silent. A value of 1 causes Sqitch to output some information about what it’s doing, while 2 and 3 each cause greater verbosity.

-H
--help

sqitch –help sqitch -H

Outputs a brief description of the options supported by sqitch and exits.

-M
--man

sqitch –man sqitch -M

Outputs this documentation and exits.

-V
--version

sqitch –version sqitch -V

Outputs the program name and version and exits.

Sqitch Commands

init

Initialize the database and create deployment script directories if they do not already exist.

status

Output information about the current status of the deployment, including a list of tags, deployments, and dates in chronological order. If any deploy scripts are not currently deployed, they will be listed separately.

check

Sanity check the deployment scripts. Checks include:

  • Make sure all deployment scripts have complementary reversion scripts.

  • Make sure no deployment script appears more than once in the plan file.

deploy

Deploy changes. Configuration properties may be specified under the [deploy] section of the configuration file, or via sqitch config:

sqitch config deploy.$property $value

Options and configuration properties:

--to

Tag to deploy up to. Defaults to the latest tag or to the VCS HEAD commit. Property name: deploy.to.

revert

Revert changes. Configuration properties may be specified under the [revert] section of the configuration file, or via sqitch config:

sqitch config revert.$property $value

Options and configuration properties:

--to

Tag to revert to. Defaults to reverting all changes. Property name: revert.to.

test

Test changes. All SQL scripts in --test-dir will be run. [XXX Not sure whether to have subdirectories for tests and expected output and to diff them, or to use some other approach.]

config

Set configuration options. By default, the options will be written to the local configuration file, sqitch.ini. Options:

--get

Get the value for a given key. Returns error code 1.

--unset

Remove the line matching the key from config file.

--list

List all variables set in config file.

--global

For writing options: write to global ~/.sqitch/config.ini file rather than the local sqitch.ini.

For reading options: read only from global ~/.sqitch/config.ini rather than from all available files.

--system

For writing options: write to system-wide $prefix/etc/sqitch.ini file rather than the local sqitch.ini.

For reading options: read only from system-wide $prefix/etc/sqitch.ini rather than from all available files.

--config-file

Use the given config file.

package

Package up all deployment and reversion scripts and write out a plan file. Configuration properties may be specified under the [package] section of the configuration file, or via sqitch config package.$property $value command. Options and configuration properties:

--from

Tag to start the plan from. All tags and steps prior to that tag will not be included in the plan, and their change scripts Will be omitted from the package directory. Useful if you’ve rejiggered your deployment steps to start from a point later in your VCS history than the beginning of time. Property name: package.from.

--to

Tag with which to end the plan. No steps or tags after that tag will be included in the plan, and their change scripts will be omitted from the package directory. Property name: package.to.

--tags-only

Write the plan file with deployment targets listed under VCS tags, rather than individual commits. Property name: package.tags_only.

--destdir

Specify a destination directory. The plan file and deploy, revert, and test directories will be written to it. Defaults to “package”. Property name: package.destdir.

Configuration

Sqitch configuration information is stored in standard INI files. The # and ; characters begin comments to the end of line, blank lines are ignored.

The file consists of sections and properties. A section begins with the name of the section in square brackets and continues until the next section begins. Section names are not case sensitive. Only alphanumeric characters, - and . are allowed in section names. Each property must belong to some section, which means that there must be a section header before the first setting of a property.

All the other lines (and the remainder of the line after the section header) are recognized as setting properties, in the form name = value. Leading and trailing whitespace in a property value is discarded. Internal whitespace within a property value is retained verbatim.

All sections are named for commands except for one, named “core”, which contains core configuration properties.

Here’s an example of a configuration file that might be useful checked into a VCS for a project that deploys to PostgreSQL and stores its deployment scripts with the extension ddl under the migrations directory. It also wants packages to be created in the directory _build/sql, and to deploy starting with the “gamma” tag:

[core]
    engine    = pg
    db        = widgetopolis
    sql_dir   = migrations
    extension = ddl

[revert]
    to        = gamma

[package]
    from      = gamma
    tags_only = yes
    dest_dir  = _build/sql

Core Properties

This is the list of core variables, which much appear under the [core] section. See the documentation for individual commands for their configuration options.

plan_file

The plan file to use. Defaults to sqitch.ini or, if that does not exist, uses the VCS history, if available.

engine

The database engine to use. Supported engines include:

client

Path to the command-line client for the database engine. Defaults to a client in the current path named appropriately for the specified engine.

db_name

Name of the database.

username

User name to use when connecting to the database. Does not apply to all engines.

password

Password to use when connecting to the database. Does not apply to all engines.

host

Host name to use when connecting to the database. Does not apply to all engines.

port

Port number to connect to. Does not apply to all engines.

sql_dir

Path to directory containing deployment, reversion, and test SQL scripts. It should contain subdirectories named deploy, revert, and (optionally) test. These may be overridden by deploy_dir, revert_dir, and test_dir. Defaults to ./sql.

deploy_dir

Path to a directory containing SQL deployment scripts. Overrides the value implied by sql_dir.

revert_dir

Path to a directory containing SQL reversion scripts. Overrides the value implied by sql_dir.

test_dir

Path to a directory containing SQL test scripts. Overrides the value implied by sql_dir.

extension

The file name extension on deployment, reversion, and test SQL scripts. Defaults to sql.

Plan File

A plan file describes the deployment tags and scripts to be run against a database. In general, if you use a VCS, you probably won’t need a plan file, since your VCS history should be able to provide all the information necessary to derive a deployment plan. However, if you really do need to maintain a plan file by hand, or just want to better understand the file as output by the package command, read on.

Format

The contents of the plan file are plain text encoded as UTF-8. It is divided up into sections that denote deployment states. Each state has a bracketed, space-delimited list of one or more tags to identify it, followed by any number of deployment steps. Here’s an example of a plan file with a single state and a single step:

[alpha]
users_table

The state has one tag, named “alpha”, and one step, named “users_table”. A state may of course have many steps. Here’s an expansion:

[root alpha]
users_table
insert_user
update_user
delete_user

This state has two tags, “root” and “alpha”, and four steps, “users_table”, “insert_user”, “update_user”, and “delete_user”.

Most plans will have multiple states. Here’s a longer example with three states:

[root alpha]
users_table
insert_user
update_user
delete_user

[beta]
widgets_table
list_widgets

[gamma]
ftw

Using this plan, to deploy to the “beta” tag, the “root”/“alpha” state steps must be deployed, as must the “beta” steps. To then deploy to the “gamma” tag, the “ftw” step must be deployed. If you then choose to revert to the “alpha” tag, then the “gamma” step (“ftw”) and all of the “beta” steps will be reverted in reverse order.

Using this model, steps cannot be repeated between states. One can repeat them, however, if the contents for a file in a given tag can be retrieved from a VCS. An example:

[alpha]
users_table

[beta]
add_widget
widgets_table

[gamma]
add_user

[44ba615b7813531f0acb6810cbf679791fe57bf2]
widgets_created_at

[HEAD epsilon master]
add_widget

This example is derived from a Git log history. Note that the “add_widget” step is repeated under the state tagged “beta” and under the last state. Sqitch will notice the repetition when it parses this file, and then, if it is applying all changes, will fetch the version of the file as of the “beta” tag and apply it at that step, and then, when it gets to the last tag, retrieve the deployment file as of its tags and apply it. This works in reverse, as well, as long as the changes in this file are always idempotent.

Grammar

Here is the EBNF Grammar for the plan file:

plan-file   = { <state> | <empty-line> | <comment> }* ;

state       = <tags> <steps> ;

tags        = "[" <taglist> "]" <line-ending> ;
taglist     = <name> | <name> <white-space> <taglist> ;

steps       = { <step> | <empty-line> | <line-ending> }* ;
step        = <name> <line-ending> ;

empty-line  = [ <white-space> ] <line-ending> ;
line-ending = [ <comment> ] <EOL> ;
comment     = [ <white-space> ] "#" [ <string> ] ;

name        = ? non-white space characters ? ;
white-space = ? white space characters ? ;
string      = ? non-EOL characters ? ;

See Also

The original design for Sqitch was sketched out in a number of blog posts:

Other tools that do database change management include:

Rails migrations

Numbered migrations for Ruby on Rails.

Module::Build::DB

Numbered changes in pure SQL, integrated with Perl’s Module::Build build system. Does not support reversion.

DBIx::Migration

Numbered migrations in pure SQL.

Versioning

PostgreSQL-specific dependency-tracking solution by depesz.

Author

David E. Wheeler <david@justatheory.com>

License

Copyright (c) 2012 iovation Inc.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Looking for the comments? Try the old layout.

SQL Change Management Sans Duplication

In the previous episode in this series, I had one more issue with regard to SQL change management that I wanted to resolve:

  1. There is still more duplication of code than I would like, in that a procedure defined in one change script would have to be copied whole to a new script for any changes, even simple single-line changes.

So let’s see what we can do about that. Loading it into Git, our first example looks like this:

> alias sqlhist="git log -p --format='[%H%d]' --name-only --reverse sql/deploy \
| awk '/^\[/ {print \"\"} /./'"
> sqlhist

[3852b378aa029cc610a03806e8268ed452dce8a6 (alpha)]
sql/deploy/users_table.sql

[32883d5a08691351b07928fa4e4fb7e68c500973 (beta)]
sql/deploy/add_widget.sql
sql/deploy/widgets_table.sql

[b8b9f5c152675305c6b2d3e105d55a25019e0828 (HEAD, gamma, master)]
sql/deploy/add_user.sql

(Aside: I’ve created an alias, sqlhist, on the first line, so that all the Git and Awk magic doesn’t clutter the remaining examples.)

So, we’ve got the creation of the users table under the alpha tag, the addition of the widgets table and an accompanying add_widget() function under the beta tag, and the creation of the add_user() function under the gamma tag. So far so good. Now, let’s say that gamma has been deployed to production, and now we’re ready to add a feature for the next release.

Modify This

It turns out that our users really want a timestamp for the time a widget was created. So let’s add a new change script that adds a created_at column to the widgets table. First we add sql/deploy/widgets_created_at.sql with:

-- requires: widgets_table
ALTER TABLE widgets ADD created_at TIMESTAMPTZ;

And then the accompanying revert script, sql/revert/widgets_created_at.sql:

ALTER TABLE widgets DROP COLUMN IF EXISTS created_at;

Commit them and now our deployment configuration looks like this:

> sqlhist

[3852b378aa029cc610a03806e8268ed452dce8a6 (alpha)]
sql/deploy/users_table.sql

[32883d5a08691351b07928fa4e4fb7e68c500973 (beta)]
sql/deploy/add_widget.sql
sql/deploy/widgets_table.sql

[b8b9f5c152675305c6b2d3e105d55a25019e0828 (gamma)]
sql/deploy/add_user.sql

[44ba615b7813531f0acb6810cbf679791fe57bf2 (HEAD, master)]
sql/deploy/widgets_created_at.sql

So far so good. We have a simple delta script that modifies the existing table, and there is no code duplication. Time to modify the add_widget() function to insert the timestamp. Recall that, in the first article in this series, I created a separate sql/deploy/add_widgets_v2.sql file, copied the existing function in its entirety into the new file, and modified it there. If we were to do that here, the resulting deployment configuration would look something like this:

> sqlhist

[3852b378aa029cc610a03806e8268ed452dce8a6 (alpha)]
sql/deploy/users_table.sql

[32883d5a08691351b07928fa4e4fb7e68c500973 (beta)]
sql/deploy/add_widget.sql
sql/deploy/widgets_table.sql

[b8b9f5c152675305c6b2d3e105d55a25019e0828 (gamma)]
sql/deploy/add_user.sql

[44ba615b7813531f0acb6810cbf679791fe57bf2]
sql/deploy/widgets_created_at.sql

[dfba488cfd9145928a25d8d48de3231da84s4bd2 (HEAD, master)]
sql/deploy/add_widget_v2.sql

Which would be fine, except that if someone else wanted to see what had changed, here’s what git diff would output:

diff --git a/sql/deploy/add_widget_v2.sql b/sql/deploy/add_widget_v2.sql
new file mode 100644
index 0000000..9132195
--- /dev/null
+++ b/sql/deploy/add_widget_v2.sql
@@ -0,0 +1,8 @@
+-- requires widgets_created_at
+CREATE OR REPLACE FUNCTION add_widget(
+    username   TEXT,
+    widgetname TEXT
+) RETURNS VOID LANGUAGE SQL AS $$
+    INSERT INTO widgets (created_by, name, created_at)
+    VALUES ($1, $2, NOW());
+$$;

So, what changed in the add_widget() function between gamma and now? One cannot tell from this diff: it looks like a brand new function. And no web-based VCS interface will show you, either; it’s just not inherent in the commit. We have to actually know that it was just an update to an existing function, and what files to manually diff, like so:

> diff -u sql/deploy/add_widget.sql sql/deploy/add_widget_v2.sql 
--- sql/deploy/add_widget.sql   2012-01-28 13:06:24.000000000 -0800
+++ sql/deploy/add_widget_v2.sql    2012-01-28 13:26:59.000000000 -0800
@@ -1,8 +1,8 @@
--- requires: widgets_table
-
+-- requires: widgets_created_at
    CREATE OR REPLACE FUNCTION add_widget(
        username   TEXT,
        widgetname TEXT
    ) RETURNS VOID LANGUAGE SQL AS $$
-    INSERT INTO widgets (created_by, name) VALUES ($1, $2);
+    INSERT INTO widgets (created_by, name, created_at)
+    VALUES ($1, $2, NOW());
    $$;

Much better, but how annoying is that? It doesn’t allow us to really take advantage of the VCS, all because we need SQL changes to run in a very specific order.

But let’s ignore that for the moment. Let’s just throw out the commit with add_widgets_v2.sql and go ahead and change the add_widget change script directly. So the history now looks like this:

> sqlhist

[3852b378aa029cc610a03806e8268ed452dce8a6 (alpha)]
sql/deploy/users_table.sql

[32883d5a08691351b07928fa4e4fb7e68c500973 (beta)]
sql/deploy/add_widget.sql
sql/deploy/widgets_table.sql

[b8b9f5c152675305c6b2d3e105d55a25019e0828 (gamma)]
sql/deploy/add_user.sql

[44ba615b7813531f0acb6810cbf679791fe57bf2]
sql/deploy/widgets_created_at.sql

[e4b970aa36f27451fe377791eab040a73c6eb47a (HEAD, epsilon, master)]
sql/deploy/add_widget.sql

Naturally, the add_widget script appears twice now, once under the beta tag and once under epsilon (which I’ve just tagged). What are the consequences for our migration? Well, if we were to build a new database from the beginning, running these migrations as listed here, we would get an error while applying the beta changes:

ERROR:  column "created_at" of relation "widgets" does not exist
LINE 5:     INSERT INTO widgets (created_by, name, created_at)

This is because the created_at column won’t exist until the widgets_created_at change is applied. That won’t do, will it? Fortunately, Git knows exactly what the add_widget deploy script looked like under the beta tag, and we can ask it:

> git show beta:sql/deploy/add_widget.sql
-- requires: widgets_table

CREATE OR REPLACE FUNCTION add_widget(
    username   TEXT,
    widgetname TEXT
) RETURNS VOID LANGUAGE SQL AS $$
    INSERT INTO widgets (created_by, name) VALUES ($1, $2);
$$;

Boom, there it is, with no reference to created_at. Using this technique, our SQL deployment app can successfully apply all of our database changes by iterating over the list of changes and applying the contents of each script at the time of the appropriate commit or tag. In other words, it could apply the output from each of these commands:

git show alpha:sql/deploy/users_table.sql
git show beta:sql/deploy/widgets_table.sql
git show beta:sql/deploy/add_widget.sql
git show gamma:sql/deploy/add_user.sql
git show 44ba615b7813531f0acb6810cbf679791fe57bf2:sql/deploy/widget_created_at.sql
git show epsilon:sql/deploy/add_widget.sql

And everything will work exactly as it should: the original version of the add_widget change script will be for the beta tag, and the next version will be applied for the epsilon tag. Not bad, right? We get a nice, clean Git history and can exploit it to manage the changes.

Reversion to the Mean

But what about reversion? What if the deploy to epsilon failed, and we need to revert back to gamma? Recall that in the first article, I eliminated duplication by having the add_widget_v2 revert script simply call the add_widget deploy script. But such is not possible now that we’ve changed add_widget in place. What to do?

The key is for the change management script to know the difference between a new change script and a modified one. Fortunately, Git knows that, too, and we can get it to cough up that information with a simple change to the sqlhist alias: instead of passing --name-only, pass --name-status:

% alias sqlhist="git log -p --format='[%H%d]' --name-status --reverse sql/deploy \
| awk '/^\[/ {print \"\"} /./'"

Using this new alias, our history looks like:

> sqlhist

[3852b378aa029cc610a03806e8268ed452dce8a6 (alpha)]
A   sql/deploy/users_table.sql

[32883d5a08691351b07928fa4e4fb7e68c500973 (beta)]
A   sql/deploy/add_widget.sql
A   sql/deploy/widgets_table.sql

[b8b9f5c152675305c6b2d3e105d55a25019e0828 (gamma)]
A   sql/deploy/add_user.sql

[44ba615b7813531f0acb6810cbf679791fe57bf2]
A   sql/deploy/widgets_created_at.sql

[e4b970aa36f27451fe377791eab040a73c6eb47a (HEAD, epsilon, master)]
M   sql/deploy/add_widget.sql

Now we have a letter defining the status of each file. An “A” means the file was added in that commit; an “M” means it was modified. But the upshot is that, to revert to gamma, our change management can see that add_widget was modified in epsilon, and, rather than apply a revert change script, it can just apply the version of the script as it existed under gamma:

> git show gamma:sql/deploy/add_widget.sql
-- requires: widgets_table

CREATE OR REPLACE FUNCTION add_widget(
    username   TEXT,
    widgetname TEXT
) RETURNS VOID LANGUAGE SQL AS $$
    INSERT INTO widgets (created_by, name) VALUES ($1, $2);
$$;

And there we are, right back to where we should be. Of course, the remaining epsilon deploy script, widget_created_at, was added in its commit, so we just apply the revert script and we’re set, back to gamma.

Still Configurable

To get back to the original idea of a migration configuration file, I still think it’s entirely do-able. All we need to is to have the change management app generate it, just as before. When it comes to modified — rather than added — deploy scripts, it can automatically insert new scripts with the full copies of previous versions, much as before. The resulting configuration would look something like this:

[3852b378aa029cc610a03806e8268ed452dce8a6 (alpha)]
sql/deploy/users_table.sql

[32883d5a08691351b07928fa4e4fb7e68c500973 (beta)]
sql/deploy/add_widget.sql
sql/deploy/widgets_table.sql

[b8b9f5c152675305c6b2d3e105d55a25019e0828 (gamma)]
sql/deploy/add_user.sql

[44ba615b7813531f0acb6810cbf679791fe57bf2]
sql/deploy/widgets_created_at.sql

[e4b970aa36f27451fe377791eab040a73c6eb47a (HEAD, epsilon, master)]
sql/deploy/add_widget_v2.sql

Note that last line, where we now have add_widget_v2. The change management script would simply generate this file, and create an additional revert script with the same name that just contains the contents of the deploy script as it was under the gamma tag.

Too Baroque?

Having written down these ideas that have plagued by brain for the last week, along with some examples using Git to confirm them, I’m convinced more than ever that this is entirely workable. But it also leads me to wonder if it’s too baroque. I intend these posts as a rough spec for how this thing should work, and I plan to implement it in the coming weeks. But I’m wondering how difficult it will be to explain it all to people?

So let me see if I can break it down to a few simple rules.

  • In general, you should create independent deploy and revert scripts for your SQL. Put a CREATE TABLE statement into its own script. If it requires some some other table, require declare the dependency. If you need to change it later, create a new script that uses ALTER TABLE.
  • In special cases where a simple change cannot be made without copying something wholesale, and where the deploy script is idempotent, you may simply modify the deploy script in-place.

That’s about it. The idempotence of the deploy script is important for ensuring consistency, and applies very well to features such as user-defined functions. For other objects, there are generally ALTER statements that allow changes to be made without wholesale copying of existing code.

So what am I missing? What have I overlooked? What mistakes in my logic have I made? Do you think this will be too tricky to implement, or to use? Is it hard to understand? Your comments would be greatly appreciated, because I am going to write an app to do this stuff, and want to get it right.

Thanks for sticking with me through all the thought experiments. For my next post on this topic, I expect to have an interface spec for the new app.

Looking for the comments? Try the old layout.

VCS-Enabled SQL Change Management

In my previous post, I outlined the basics of a configuration-file and dependency-tracking SQL deployment architecture, but left a couple of additional challenges unresolved. They were:

  1. I would rather not have to hand-edit a configuration file, as it it’s finicky and error-prone.

  2. There is still more duplication of code than I would like, in that a procedure defined in one change script would have to be copied whole to a new script for any changes, even single-line simple changes.

I believe I can solve both of these issues by simple use of a VCS. Since all of my current projects currently use Git, I will use it for the examples here.

Git it On

First, recall the structure of the configuration file, which was something like this:

[alpha]
users_table

[beta]
add_widget
widgets_table

[gamma]
add_user

[delta]
widgets_created_at
add_widget_v2

Basically, we have bracketed tags identifying changes that should be deployed. Now have a look at this:

> git log -p --format='[%H]' --name-only --reverse sql/deploy
[8920aaf7947a56f6777e69a21b70fd877c8fd6dc]

sql/deploy/users_table.sql
[f7da5fd4b7391747f75d85db6fa82de47b9e4c00]

sql/deploy/add_widget.sql
sql/deploy/widgets_table.sql
[ea10b9e566934ef256debe8752504189436e162a]

sql/deploy/add_user.sql
[89e85f98d891a2984ad4e3c42d8ca8cf31f3b2b4]

sql/deploy/add_widget_v2.sql
sql/deploy/widgets_created_at.sql

Look familiar? Let’s use a bit of awk magic to neaten things a bit (Thanks helwig!):

> git log -p --format='[%H]' --name-only --reverse sql/deploy \
| awk '/^\[/ {print ""} /./'

[8920aaf7947a56f6777e69a21b70fd877c8fd6dc]
sql/deploy/users_table.sql

[f7da5fd4b7391747f75d85db6fa82de47b9e4c00]
sql/deploy/add_widget.sql
sql/deploy/widgets_table.sql

[ea10b9e566934ef256debe8752504189436e162a]
sql/deploy/add_user.sql

[89e85f98d891a2984ad4e3c42d8ca8cf31f3b2b4]
sql/deploy/add_widget_v2.sql
sql/deploy/widgets_created_at.sql

Ah, that’s better. We have commit SHA1s for tags, followed by the appropriate lists of deployment scripts. But wait, we can decorate it, too:

> git log -p --format='[%H%d]' --name-only --reverse sql/deploy \
| awk '/^\[/ {print ""} /./'

[8920aaf7947a56f6777e69a21b70fd877c8fd6dc (alpha)]
sql/deploy/users_table.sql

[f7da5fd4b7391747f75d85db6fa82de47b9e4c00 (beta)]
sql/deploy/add_widget.sql
sql/deploy/widgets_table.sql

[ea10b9e566934ef256debe8752504189436e162a (gamma)]
sql/deploy/add_user.sql
[89e85f98d891a2984ad4e3c42d8ca8cf31f3b2b4 (HEAD, delta, master)]

Look at that! Actual VCS tags built right in to the output. So, assuming our deployment app can parse this output, we can deploy or revert to any commit or tag. Better yet, we don’t have to maintain a configuration file, because the VCS is already tracking all that stuff for us! Our change management app can automatically detect if we’re in a Git repository (or Mercurial or CVS or Subversion or whatever) and fetch the necessary information for us. It’s all there in the history. We can name revision identifiers (SHA1s here) to deploy or revert to, or use tags (alpha, beta, gamma, delta, HEAD, or master in this example).

And with careful repository maintenance, this approach will work for branches, as well. For example, say you have developers working in two branches, feature_foo and feature_bar. In feature_foo, a foo_table change script gets added in one commit, and an add_foo script in a second commit. Merge it into master and the history now looks like this:

> git log -p --format='[%H%d]' --name-only --reverse sql/deploy \
| awk '/^\[/ {print ""} /./'

[8920aaf7947a56f6777e69a21b70fd877c8fd6dc (alpha)]
sql/deploy/users_table.sql

[f7da5fd4b7391747f75d85db6fa82de47b9e4c00 (beta)]
sql/deploy/add_widget.sql
sql/deploy/widgets_table.sql

[ea10b9e566934ef256debe8752504189436e162a (gamma)]
sql/deploy/add_user.sql

[89e85f98d891a2984ad4e3c42d8ca8cf31f3b2b4 (delta)]
sql/deploy/add_widget_v2.sql
sql/deploy/widgets_created_at.sql

[cbb48144065dd345c5248e5f1e42c1c7391a88ed]
sql/deploy/foo_table.sql

[7f89e23c9f1e7fc298c69400f6869d701f76759e (HEAD, master, feature_foo)]
sql/deploy/add_foo.sql

So far so good.

Meanwhile, development in the feature_bar branch has added a bar_table change script in one commit and add_bar in another. Because development in this branch was going on concurrently with the feature_foo branch, if we just merged it into master, we might get a history like this:

> git log -p --format='[%H%d]' --name-only --reverse sql/deploy \
| awk '/^\[/ {print ""} /./'
[8920aaf7947a56f6777e69a21b70fd877c8fd6dc (alpha)]
sql/deploy/users_table.sql

[f7da5fd4b7391747f75d85db6fa82de47b9e4c00 (beta)]
sql/deploy/add_widget.sql
sql/deploy/widgets_table.sql

[ea10b9e566934ef256debe8752504189436e162a (gamma)]
sql/deploy/add_user.sql

[89e85f98d891a2984ad4e3c42d8ca8cf31f3b2b4 (delta)]
sql/deploy/add_widget_v2.sql
sql/deploy/widgets_created_at.sql

[cbb48144065dd345c5248e5f1e42c1c7391a88ed]
sql/deploy/foo_table.sql

[d1882d7b4cfcf5c57030bd5a15f8571bfd7e48e2]
sql/deploy/bar_table.sql

[7f89e23c9f1e7fc298c69400f6869d701f76759e]
sql/deploy/add_foo.sql

[2330da1caae9a46ea84502bd028ead399ca3ca02 (feature_bar)]
sql/deploy/add_bar.sql

[73979ede2c8589cfe24c9213a9538f305e6f508f (HEAD, master, feature_foo)]

Note that bar_table comes before add_foo. In other words, the feature_foo and feature_bar commits are interleaved. If we were to deploy to HEAD, and then need to revert feature_bar, bar_table would not be reverted. This is, shall we say, less than desirable.

There are at least two ways to avoid this issue. One is to squash the merge into a single commit using git merge --squash feature_bar. This would be similar to accepting a single patch and applying it. The resulting history would look like this:

> git log -p --format='[%H%d]' --name-only --reverse sql/deploy \
| awk '/^\[/ {print ""} /./'

[8920aaf7947a56f6777e69a21b70fd877c8fd6dc (alpha)]
sql/deploy/users_table.sql

[f7da5fd4b7391747f75d85db6fa82de47b9e4c00 (beta)]
sql/deploy/add_widget.sql
sql/deploy/widgets_table.sql

[ea10b9e566934ef256debe8752504189436e162a (gamma)]
sql/deploy/add_user.sql

[89e85f98d891a2984ad4e3c42d8ca8cf31f3b2b4 (delta)]
sql/deploy/add_widget_v2.sql
sql/deploy/widgets_created_at.sql

[cbb48144065dd345c5248e5f1e42c1c7391a88ed]
sql/deploy/foo_table.sql

[7f89e23c9f1e7fc298c69400f6869d701f76759e]
sql/deploy/add_foo.sql

[91a048c05e0444682e2e4763e8a7999a869b4a77 (HEAD, master)]
sql/deploy/add_bar.sql
sql/deploy/bar_table.sql

Now both of the feature_bar change scripts come after the feature_foo changes. But it might be nice to keep the history. So a better solution (and the best practice, I believe), is to rebase the feature_bar branch before merging it into master, like so:

> git rebase master
First, rewinding head to replay your work on top of it...
Applying: Add bar.
Applying: Add add_bar().
> git checkout master
Switched to branch 'master'
> git merge feature_bar
Updating 7f89e23..0fab7a0
Fast-forward
 0 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 sql/deploy/add_bar.sql
 create mode 100644 sql/deploy/bar_table.sql
 create mode 100644 sql/revert/add_bar.sql
 create mode 100644 sql/revert/bar_table.sql

And now we should have:

> git log -p --format='[%H%d]' --name-only --reverse sql/deploy \
| awk '/^\[/ {print ""} /./'

[8920aaf7947a56f6777e69a21b70fd877c8fd6dc (alpha)]
sql/deploy/users_table.sql

[f7da5fd4b7391747f75d85db6fa82de47b9e4c00 (beta)]
sql/deploy/add_widget.sql
sql/deploy/widgets_table.sql

[ea10b9e566934ef256debe8752504189436e162a (gamma)]
sql/deploy/add_user.sql

[89e85f98d891a2984ad4e3c42d8ca8cf31f3b2b4 (delta)]
sql/deploy/add_widget_v2.sql
sql/deploy/widgets_created_at.sql

[cbb48144065dd345c5248e5f1e42c1c7391a88ed]
sql/deploy/foo_table.sql

[7f89e23c9f1e7fc298c69400f6869d701f76759e]
sql/deploy/add_foo.sql

[0e53c29eb47c618d0a8818cc17bd5a0aab0acd6d]
sql/deploy/bar_table.sql

[0fab7a0ba928b34a46a9495d4efc1c73d9133d37 (HEAD, master, feature_bar)]
sql/deploy/add_bar.sql

Awesome, now everything is in the correct order. We did lose the feature_foo “tag,” though. That’s because it wasn’t a tag, and neither is feature_bar here. They are, rather, branch names, which we becomes obvious when using “full” decoration:

git log --format='%d' --decorate=full HEAD^..      
 (HEAD, refs/heads/master, refs/heads/feature_foo)

After the next commit, it will disappear from the history. So let’s just tag the relevant commits ourselves:

> git tag feature_foo 7f89e23c9f1e7fc298c69400f6869d701f76759e
> git tag feature_bar
> git log -p --format='[%H%d]' --name-only --reverse sql/deploy \
| awk '/^\[/ {print ""} /./'

[8920aaf7947a56f6777e69a21b70fd877c8fd6dc (alpha)]
sql/deploy/users_table.sql

[f7da5fd4b7391747f75d85db6fa82de47b9e4c00 (beta)]
sql/deploy/add_widget.sql
sql/deploy/widgets_table.sql

[ea10b9e566934ef256debe8752504189436e162a (gamma)]
sql/deploy/add_user.sql

[89e85f98d891a2984ad4e3c42d8ca8cf31f3b2b4 (delta)]
sql/deploy/add_widget_v2.sql
sql/deploy/widgets_created_at.sql

[cbb48144065dd345c5248e5f1e42c1c7391a88ed]
sql/deploy/foo_table.sql

[7f89e23c9f1e7fc298c69400f6869d701f76759e (feature_foo)]
sql/deploy/add_foo.sql

[0e53c29eb47c618d0a8818cc17bd5a0aab0acd6d]
sql/deploy/bar_table.sql

[0fab7a0ba928b34a46a9495d4efc1c73d9133d37 (HEAD, feature_bar, master, feature_bar)]
sql/deploy/add_bar.sql

Ah, there we go! After the next commit, one of those feature_bars will disappear, since the branch will have been left behind. But we’ll still have the tag.

Not Dead Yet

Clearly we can intelligently use Git to manage SQL change management. (Kind of stands to reason, doesn’t it?) Nevertheless, I believe that a configuration file still might have its uses. Not only because not every project is in a VCS (it ought to be!), but because oftentimes a project is not deployed to production as a git clone. It might be distributed as a source tarball or an RPM. In such a case, including a configuration file in the distribution would be very useful. But there is still no need to manage it by hand; our deployment app can generate it from the VCS history before packaging for release.

More to Come

I’d planned to cover the elimination of duplication, but I think this is enough for one post. Watch for that idea in my next post.

Looking for the comments? Try the old layout.

Adding a Git SHA1 to your Xcode Project

I found a decent Perl script for adding a Git SHA1 to an Xcode project last week. However, since by day I’m actually a Perl hacker, I couldn’t help finessing it a bit. Here are the steps to add it to your project:

  1. Open your project settings (Project –> Edit Project Settings) and in the “Build” tab, make sure that “Info.plist Output Encoding” is set to “xml”. I lost about an hour of fiddling before I realized that my plist file was binary, which of course couldn’t really be parsed by the simple Perl script.

  2. Edit your app’s Info.plist file. If you want the version to be the SHA1, set “Bundle Version” to “0x000”. I personally prefer to have a separate key for the SHA1, so I created the “LTGitSHA1” key and set its value to “0x000”. This is the placeholder value that will be replaced when your app runs.

  3. Right-click your target app and select Add –> New Build Phase –> New Run Script Build Phase. For the shell command, enter:

    /usr/bin/env perl -wpi -0777
    
  4. Paste this into the “Script” field:

    #!/usr/bin/env perl -wpi -0777
    # Xcode auto-versioning script for Subversion by Axel Andersson
    # Updated for git by Marcus S. Zarra and Matt Long
    # Refactored by David E. Wheeler.
    
    BEGIN {
        @ARGV = ("$ENV{BUILT_PRODUCTS_DIR}/$ENV{INFOPLIST_PATH}");
        ($sha1 = `git rev-parse --short HEAD`) =~ s/\s+$//;
    }
    
    s{0x000}{$sha1};
    
  5. Fetch the value in your code from your bundle, something like this:

    NSLog(
        @"SHA1: %@",
        [[[NSBundle mainBundle]infoDictionary]objectForKey:@"LTGitSHA1"]
    );
    

That’s it. If you want to use a placeholder other than “0x0000”, just change it on the last line of the script.

Looking for the comments? Try the old layout.

Tutorial on GitHub

Following a very good suggestion from Pedro Melo, I’ve created a Git repository for this tutorial and put it on GitHub. I replayed each step, making each into its own commit, and tagged the state of the code for each entry:

So as I continue to make modifications, I’ll keep this repository up-to-date, and tag things as of each blog entry. This will make it easy for you to follow along; you can simply clone the repository and git pull for each post.

More soon.

Looking for the comments? Try the old layout.

Checkout and Build Perl 5.10 from Git

Tonight we had a Portland Perl Mongers Perl 5.10 code sprint. This was the first time I’ve ever worked with the Perl 5 core code, and while things weren’t exactly intuitive, it wasn’t bad to get started, either. I already had a clone of the Perl Git repository, so here’s what I did to get started.

Here’s how to clone the Git repository, and then check out the maint-5.10 branch and build and test Perl:

git clone git://perl5.git.perl.org/perl.git
cd perl
git checkout -b maint-5.10 origin/maint-5.10
sh Configure -Ode \
 -DDEBUGGING \
 -Dprefix=/usr/local/perl/blead \
 -Dusedevel \
 -Duseithreads \
 -Dccflags='-I/usr/local/include' -Dldflags='-L/usr/local/lib' \
 -Dlibpth='/usr/local/lib /usr/lib' \
 -Uversiononly \
 -Uinstallusrbinperl $@
make
make test

I had two test failures:

#   at ../lib/ExtUtils/CBuilder/t/00-have-compiler.t line 39.
#          got: '0'
#     expected: '1'
# Looks like you failed 1 test of 4.
FAILED at test 4

And:

#   at pod/pod2usage2.t line 235.
# Got:
# #Usage:
# #    This is a test for CPAN#33020
# #
# #Usage:
# #    And this will be also printed.
# #
# 

So the first thing I want to do is fix those failures. It took a bit of fiddling to figure out how to get it to run a single test. On the advice of Paul Fenwick and Schwern, I tried:

./perl -Ilib lib/ExtUtils/CBuilder/t/00-have-compiler.t

But then the test passed! So something has to be different when running under make test. With help from Duke Leto, I finally figured out the proper incantation:

make test TEST_FILES=../lib/ExtUtils/CBuilder/t/00-have-compiler.t

So the TEST_FILES option tells make test what tests to run, relative to the t/ directory. Shortly thereafter, Schwern told me about:

cd t
./perl TEST ../lib/ExtUtils/CBuilder/t/00-have-compiler.t

Which is nice, because it does a lot less work.

Now I was ready to figure out the ExtUtils::CBuilder bug and fix it. This is good, I thought, because ExtUtils::CBuilder is a Perl module, and while my C is teh suck, my Perl skillz are mad. So I do some poking around, and finally pinned down the failure to the have_compiler() method in ExtUtils::CBuilder::Base. A bit more poking and I had the error printed out:

# error building /tmp/compilet-1529033816.o from '/tmp/compilet-1529033816.c'
at ../../ExtUtils/CBuilder/Base.pm line 110.

Well fuck, that sure looks like a C problem. But then I added some more debugging code to see what command it’s actually running, and from where. So I added this code:

use Cwd;
print STDERR "# CWD: ", getcwd, $/;
print STDERR "# ", join( ' ', @cc, @flags), $/;

And then I got this output:

# CWD: /Users/david/dev/perl/lib/ExtUtils/CBuilder
# ./perl -e1 -- -I/Users/david/dev/perl/perl -c -fno-common -DPERL_DARWIN -no-cpp-precomp -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -O3 -o ./compilet-892348855.o

And of course, there’s no perl in lib/ExtUtils/CBuilder. Why does it think there is? The test file has this code:

$b->{config}{cc} = "$^X -e1 --";
$b->{config}{ld} = "$^X -e1 --";

Well, that cinches it. $^X is the path to the currently-executing Perl, but in core–for me at least–that’s a relative path. The test changes directories away from that path, and so it can’t find it. So I just needed to get it to use the proper Perl. Frankly, this test should fail on every platform that runs it, so I’m not sure how it got committed to the maint-5.10 branch, but here we are.

So I moved up use File::Spec; and assigned File::Spec->rel2abs($^X) to a variable before the chdir, and used that variable for the cc mock. And that was it, the test passed. Yay!

Now it was time to submit a patch to p5p. I committed the fix to my local clone, and then, with help from Jacob, generated a patch with git format-patch -1. The -1 means just a single commit, which is all I’d committed. This generated a file, named 0001-Fixed-failing-test-for-ExtUtils-CBuilder.patch. I edited it to add a comment to the effect that the patch is targeted at the maint-5.10 branch.

Next Jacob helped me get git configured to send mail through my gmail account. I just added these lines to ~/.gitconfig:

[sendemail]
        smtpserver = smtp.gmail.com
        smtpserverport = 587
        smtpuser = justatheory
        smtppass = idontthinkso!
        smtpencryption = tls

I then sent the patch with:

git send-email --to perl5-porters@perl.org 0001-Fixed-failing-test-for-ExtUtils-CBuilder.patch

And that was it! You can see the result here. I started to fiddle with the pod2usage failure, but it was a bit obscure for me. Perhaps next time.

And speaking of next time, I’m forking the Perl repository on GitHub, and will probably start to use my own repository for stuff (if it can be kept up to date; RJBS points out that it’s currently way behind the core repo). I’ll likely also create a topic branch before fixing a bug, and then send the patch from there. That way, when it’s committed to the remote branch in core, I won’t have to rebase my local copy: I can just delete the topic branch. This is a nice way to track individual bug fixes that I’ve worked on, while letting the maint-5.10 branch just track the core repository, with no local changes.

Looking for the comments? Try the old layout.

Git-R-Done

Just a quick followup on the completion of the Bricolage Git migration last week, today I completed writing up a set of GitHub wiki documents explaining to my fellow Bricoleurs how to start hacking. The most important bits are:

  • Working with Git, explaining how to get set up with a forked Bricolage repository
  • Contributing a Bug Fix, an intro to the Git way of doing things (as far as I understand it)
  • Working with Branches, describing how to track a maintenance branch in your fork
  • Merging with Git, to cover the frequent merging from Bricolage maintenance branches into master, and how to get said merges pushed upstream
  • Starting a Project Branch, which you’d need to read if you were taking on a major development task, such as a Summer of Code project
  • Contributing via Email, for those who don’t want a GitHub account (needs fleshing out)
  • Creating a Release, in which the fine art of branching, tagging, and releasing is covered

If you’re familiar with the “Git way,” I would greatly appreciate your feedback on these documents. Corrections and comments would be greatly appreciated.

I also just wanted to say that the process of reconstructing the merge history from CVS and Subversion was quite an eye-opener for me. Not because it was difficult (it was) and required a number of hacks (it did), but because it highlighted just how much better a fit Git is for the way in which we do Open Source software development. Hell, probably closed-source, too, for that matter. I no longer will have to think about what revisions to include in a merge, or create a branch just to “tag” a merge. Hell, I’ll probably be doing merges a hell of a lot more often, just because it’s so easy, the history remains intact, and everything just stays more up-to-date and closely integrated.

But I also really appreciate the project-based emphasis of Git. A Subversion repository, I now realize, is really very much like a versioned file system. That means where things go is completely ad-hoc, or convention-driven at best. And god forbid if you decide to change the convention and move stuff around! It’s just so much more sane to get a project repository, with all of the history, branches, tags, merges, and everything else, all in one package. It’s more portable, it’s a hell of a lot faster (ever tried to check out a Subversion repository with 80 tags?), and just tighter. it encourages modularization, which can only be good. I’ll tell you, I expect to have some frustrations and challenges as I learn more about using Git, but I’m already very much happier with the overall philosophy.

Enough evangelizing. As a last statement on this, I’ve uploaded the Perl scripts I wrote to do this migration, just in case someone else finds them useful:

  • bric_cvs_to_git migrated a CVS backup to Git.
  • bric_to_git migrated Subversion from r5517 to Git.
  • stitch stitched the CVS-migrated Git repository into the Subversion-migrated Git repository for a final product.

It turned out that there were a few files lost in the conversion, which I didn’t notice until after all was said and done, but overall I’m very happy. My thanks again to Ask and the denizens of #git for all the help.

Looking for the comments? Try the old layout.

Migrating Bricolage CVS and SVN to Git

Now that I’ve successfully migrated the old Bricolage SourceForge CVS repository to Git, and also migrated Subversion to Git, it’s time to stitch the two repositories together into one with all history intact. I’m glad to say that figuring out how to do so took substantially less time than the first two steps, thanks in large part to the great help from “doener,” “Ilari,” and “Fissure” on the Freenode #git channel.

Actually, they helped me with a bit more tweaking of my CVS and Subversion conversions. One thing I realized after writing yesterday’s post was that, after running git filter-branch, I had twice as many commits as I should have had. It turns out that git filter-branch rewrites all commits, but keeps the old ones around in case you mess something up. doener also pointed out that I wasn’t having all grafts properly applied, because git filter-branch only applies to the currently checked-out branch. To get all of the branches, he suggested that I read the git-filter-branch documentation, where I’ll find that git filter-branch --tag-name-filter cat -- --all would hit all branches. Actually, such was not clear to me from the documentation, but I took his word for it. Once I did that, to get rid of the dupes, all I had to do was git clone the repository to a new repository. And that was that.

This worked great for my CVS migration, but I realized that I also wanted to clean out metadata from the Subversion migration. Of course, git clone throws out most of the metadata, but git svn also stores some metadata at the end of every commit log message, like this:

git-svn-id: file:///Users/david/svn/bricolage/trunk@8581 e630fb3e-2914-11de-beb6-f300316f8eb1

This had been very handy as I looked through commits in GitX to find parents to set up for grafts, but with that done and everything grafted, I no longer needed it. Ilari helped me to figure out how to properly use git filter-branch to get rid of those. To do it, all I had to do was add a filter for commit messages, like so:

git filter-branch --msg-filter \
'perl -0777 -pe "s/\r?\ngit-svn-id:.+\n//ms"' \
--tag-name-filter cat -- --all

This properly strips out that ugly bit of metadata and finalizes the grafts all at the same time. Very nice.

Now it was time to combine these two repositories for a single unified history. I wasn’t able to find a good tutorial for this on the web, other than one that used a third-party Debian utility and only hooked up the master branch, using a bogus intermediary commit to do it. On the other hand, simply copying the pack files, as mentioned in the Git Wiki–and demonstrated by the scripts linked from there–also appeared to be suboptimal: The new commits were not showing up in GitX! And besides, Ilari said, “just copying packs might not suffice. There can also be loose objects.” Well, we can’t have that, can we?

Ilari suggested git-fetch, the documentation for which says that it will “download objects and refs from another repository.” Perfect! I wanted to copy the objects from my CVS migration to the Subversion migration.

My first attempt failed: some commits showed up, but not others. Ilari pointed out that it wouldn’t copy remote branches unless you asked it to do so, via “refspecs.” Since I’d cloned the repositories to get rid of the duplicate commits created by git filter-branch, all of my lovingly recreated local branches were now remote branches. Actually, this is what I want for the final repository, so I just had to figure out how to copy them. What I came up with was this:

chdir $cvs;
my @branches = map { s/^\s+//; s/\s+$//; $_ } `git branch -r`;

chdir $svn;
system qw(git fetch --tags), $cvs;

for my $branch (@branches) {
    next if $branch eq 'origin/HEAD';
    my $target = $branch =~ m{/master|rev_1_[68]$} ? "$branch-cvs" : $branch;
    system qw(git fetch --tags), $cvs,
        "refs/remotes/$branch:refs/remotes/$target";
}

It took me a while to figure out the proper incantation for referencing and creating remote branches. Once I got the refs/remotes part figured out, I found that the master, rev_1_6, and rev_1_8 branches from CVS were overwriting the Subversion branches with the same names. What I really needed was to have the CVS branches grafted as parents to the Subversion branches. The #git channel again came to my rescue, where Fissure suggested that I rename those branches when importing them, do the grafts, and then drop the renamed branches. Hence the line above that adds “-cvs” to the names of those branches.

Once the branches were imported, I simply looked for the earliest commits to those branches in Subversion and mapped it to the latest commits to the same branches in CVS, then wrote their SHA1 IDs to .git/info/grafts, like so:

open my $fh, '>', ".git/info/grafts" or die "Cannot open grafts: $!\n";
print $fh '77a35487f18d68b96d294facc1f1a41745ad914c '
        => "835ff47ee1e3d1bf228b8d0976fbebe3c7f02ae6\n", # rev_1_6
            '97ef646f5c2a7c6f47c2046c8d289c1dfc30a73d '
        => "2b9f3c5979d062614ef54afd0a01631f746fa3cb\n", # rev_1_8
            'b3b2e7f53d789bea962fe8047e119148e28865c0 '
        => "8414b64a6a434b2117294c0568c1012a17bc863b\n", # master
    ;
close $fh;

With the branches all imported and the grafts created, I simply had to run git filter-branch to make them permanent and drop the temporary CVS branches:

system qw(git filter-branch --tag-name-filter cat -- --all);
unlink '.git/info/grafts';
system qw(git branch -r -D), "origin/$_-cvs" for qw(rev_1_6 rev_1_8 master);

Now I had a complete repository, but with duplicate commits left over by git-filter-branch. To get rid of those, I need to clone the repository. But before I clone, I need the remote branches to be local branches, so that the clone will see them as remotes. For this, I wrote the following function:

sub fix_branches {
    for my $remote (map { s/^\s+//; s/\s+$//; $_ } `git branch -r`) {
        (my $local = $remote) =~ s{origin/}{};
        next if $local eq 'master';
        next if $local eq 'HEAD';
        system qw(git checkout), $remote;
        system qw(git checkout -b), $local;
    }
    system qw(git checkout master);
}

It’s important to skip the master and HEAD branches, as they’ll automatically be created by git clone. So then I call the function and and run git gc to take out trash, and then clone:

fix_branches();

run qw(git gc);
chdir '..';
run qw(git clone), "file://$svn", 'git_final';

It’s important to use the file:/// URL to clone so as to get a real clone; just pointing to the directory instead makes hard links.

Now I that I had the final repository with all history intact, I was ready to push it to GitHub! Well, almost ready. First I needed to make the branches local again, and then see if I could get the repository size down a bit:

chdir 'git_final';
fix_branches();
system qw(git remote rm origin);
system qw(git remote add origin git@github.com:bricoleurs/bricolage.git);
system qw(git gc);
system qw(git repack -a -d -f --depth 50 --window 50);

And that’s it! My new Bricolage Git repository is complete, and I’ve now pushed it up to its new home on GitHub. I pushed it like this:

git push origin --all
git push origin --tags

Damn I’m glad that’s done! I’ll be getting the Subversion repository set to read-only next, and then writing some documentation for my fellow Bricoleurs on how to work with Git. For those of you who already know, fork and enjoy!

Looking for the comments? Try the old layout.

Migrating Bricolage Subversion to Git

Following up on last week’s post on migrating the old Bricolage SourceForge CVS repository to Git, here are my notes on migrating the current Bricolage Subversion repository to Git.

It turns out that migrating from Subversion is much more of a pain than migrating from CVS. Why? Because CVS has real tags, while Subversion does not. So while git-svn tries to identify all of your tags and branches, it’s really relying on your Subversion repository using standard directories for all of your branches and tags. And while we’ve used a standard for branches directory, our tags setup is a bit more complicated.

The problem was that we used tags every time we merged between branches. This meant that we ended up with a lot of tags with names like “merge_rev_1_10_5665” to indicate a merge from the “rev_1_10” branch into trunk at r5665. Plus we had tags for releases. So Marshall took it upon himself to reorganize the tags in the Subversion tree so that all release tags went into the “releases” subdirectory, and merges went into subdirectories named for the branch from which the merge derived. Those subdirectories went into the “merges” subdirectory. We ended up with a directory structure organized like this:

/tags/
  /releases/
    /1.10.1/
    /1.10.2/
    /1.10.3/
  /merges/
    /dev_ajax/
      /trunk-7890
    /rev_1_10/
      /trunk-7043/
      /trunk-7194/
      /trunk-7300/

This was useful for keeping things organized in Subversion, so that we could easily find a tag for a previous merge in order to determine the revisions to specify for a new merge. But because older tags were moved from previous locations, and because newer tags were in subdirectories of the “tags” directory, git-svn did not identify them as tags. Well, that’s not really fair. It did identify earlier tags, before they were moved, but all the other tags were not found. Instead I ended up with tags in Git named tags/releases and tags/merges, which was useless. But even if all of our tags had been identified as tags, none had parent commit IDs, so there was no place to see where they actually came from.

So to rebuild the commit, release, and merge history from Subversion, I first created a local copy of the subversion repository using svnsync. Then I cloned it to Git like so:

SVNREPO=file:///Users/david/svn_bricolage_cc
git svn init $SVNREPO --stdlayout
git config svn.authorsfile /Users/david/bric_authors.txt
git svn fetch --no-follow-parent --revision 5517:HEAD

By starting with r5517, which was the first real commit to Subversion, I avoided the git-svn error I reported last week. In truth, though, I ended up running this clone many, many times. The first few times, I ran it with --no-metadata, as recommended in various HOWTOs. But then I kept getting errors such as:

git svn log
fatal: bad default revision 'refs/remotes/git-svn'
----------------------------------------------------

This was more than a little annoying, and it took me a day or so to realize that this was because I had been using --no-metadata. Once I killed off that option, things worked much better

Furthermore, by starting at r5517 and passing the --no-follow-parent option, git-svn ran much more quickly. Rather than taking 30 hours to get all revisions including stuff that had been moved around (and then failing), it now took around 90 minutes to do the export. Much more manageable, although I also started making backup copies and restoring from them as I experimented with fixing branches and tags. Ultimately, I ended up also passing the --ignore-paths option, to exclude various branches that were never really used or that I had already fetched in their entirety from CVS:

git svn fetch --no-follow-parent --revision 5517:HEAD \
--ignore-paths '(David|Kineticode|Release_|dev_(callback|(media_)?templates)|rev_1_([024]|[68]_temp)|tags/(Dev-|Release_|Start|help|mark|rel_1_([24567]|8_0)|rev_1_([26]|8_merge-2004-05-04)))|tmp'
svn2git --no-clone

The call to svn2git converts remote branches to local tags and branches. Now I had a reasonably clean copy of the repository (aside from the 120 or so commits from when Marshall did the tags reorganization) for me to work with. I opened it up with GitX and started scripting out merges.

To assist in this, I took a hint from Ask Bjørn Hansen, sent in email in response to a Tweet, and tagged every single commit with its corresponding Subversion revision number, like so (in Perl):

for my $c (`git rev-list --all --date-order --timestamp | sort -n | awk '{print \$2}'`) {
    chomp $c;
    my ($svnid) = `git show -s $c | tail -1` =~ /[@](\d+)\s+/;
    system qw(git tag -f), $svnid, $c;
}

The nice thing about this is that it made it easy for me to scan through the commits in GitX and see where things were. It also meant that I could reference these tags when I wrote the code to manage the merges. So what I did was sort the commits in reverse chronological order, and then search for those with the word “merge” in their subjects. When one was clearly for a merge (as opposed to simply using the word “merge”), I would disable the search, scroll through the commits until I found the selected commit, and then look for a likely prior commit that it merged from.

This was a bit of pain in the ass, because, unfortunately, GitX doesn’t keep the selected commit record in the middle of the screen when you cancel the search. Mail.app does this right: If I do a search, select a message, then cancel the search, the selected message is still in the middle of the screen. But with GitX, as I said, I have to scroll to find it. This wasn’t going to scale very well. So what I did instead was search for “merge”, then I took a screen shot of the results and cancelled the merge. Then I just opened the screenshot in Preview, looked at the records there, then found them in GitX. This made things go quite a bit faster.

Commits that mention merging in GitX

As a result, I added a migration function to properly tag merges. It looked like this:

sub graft_merges {
    print "Grafting merges\n";
    # Handle the merges.
    for my $graft (
        [qw( trunk@5524   rev_1_8@5523 )],
        [qw( trunk@5614   rev_1_8@5613 )],
        [qw( rev_1_8@5591 trunk@5590   )],
    ) {
        my ($commit, $parent) = map { s/.+[@]//; $_ } @$graft;
        my $cmd = "\$(git rev-parse $commit) "
                . "\$(git rev-parse $commit^) "
                . "\$(git rev-parse $parent)";
        `echo "$cmd" >> .git/info/grafts`;
    }
}

By referencing revision tags explicitly, I was able to just use git rev-parse to look up SHA1 hash IDs to put into .git/info/grafts. This saved me the headache of dealing with very long IDs, but also allowed me to easily keep track of revision numbers and branches (the branch information is actually superfluous here, but I kept it for my sanity). So, basically, for [qw( trunk@5524 rev_1_8@5523 )], it ends up writing the SHA1 hashes for r5524, the existing parent commit for r5524 (that’s the $commit^ bit), and for the new parent, r5523. I ended up with 73 merges that needed to be properly recorded.

With the merges done, I next dove into branches. For some reason, git-svn failed to identify a parent commit for any branch. Maybe because I started with r5517? I have no idea. So I had to search through the commits to see when branches were started. I mainly did this by looking at the branches in ViewVC. By clicking each one, I was able to see the earliest commit, which usually had a name like “Created a branch for my SoC project.” I would then look up that commit in ViewVC, such as r7423, which started the “dev_ajax” branch, just to make sure that it was copied from trunk. Then I simply went into GitX, found r7423, then looked back to the last commit to trunk before r7423. That was the parent of the branch. With such data, I was able to write a function like this:

sub graft_branches {
    print "Grafting branches\n";
    for my $graft (
        [qw( dev_ajax@7423            trunk@7301 )],
        [qw( dev_mysql@7424           trunk@7301 )],
        [qw( dev_elem_occurrence@7427 trunk@7301 )],
    ) {
        my ($commit, $parent) = map { s/.+[@]//; $_ } @$graft;
        my $cmd = "\$(git rev-parse $commit) "
                . "\$(git rev-parse $parent)";
        `echo "$cmd" >> .git/info/grafts`;
    }
}

Here I only needed to look up the revision and its parent and write it to .git/info/grafts. Then all of my branches had parents. Or nearly all of them; those that were also in the old CVS repository will have to wait until the two are stitched together to find their parents.

Next I needed to get releases properly tagged. This was not unlike the merge tag work: I just had to find the proper revision and tag it. This time, I looked through the commits in GitX for those with “tag for” in their subjects because, conveniently, I nearly always used this phrase in a release tag, as in “Tag for the 1.8.11 release of Bricolage.” Then I just looked back from the tag commit to find the commit copied to the tag, and that commit would be tagged with the release tag. The function to create the tags looked like this:

sub tag_releases {
    print "Tagging releases\n";
    for my $spec (
        [ 'rev_1_8@5726' => 'v1.8.1'  ],
        [ 'rev_1_8@5922' => 'v1.8.2'  ],
        [ 'rev_1_8@6073' => 'v1.8.3'  ],
    ) {
        my ($where, $tag) = @{$spec};
        my ($branch, $rev) = split /[@]/, $where;
        my $tag_date = `git show --pretty=format:%cd -s $rev`;
        chomp $tag_date;
        local $ENV{GIT_COMMITTER_DATE} = $tag_date;
        system qw(git tag -fa), $tag, '-m', "Tag for $tag release of Bricolage.", $rev;
    }
}

I am again indebted to Ask for the code here, especially to set the date for the tag.

Since I had created new release tags and recreated the merge history in Git, I no longer needed the old tags from Subversion, so next I rewrote the --ignore-paths option to exclude all of the tags directories, as well as some branches that were never used:

SVNREPO=file:///Users/david/svn_bricolage_cc
git svn init $SVNREPO --stdlayout
git config svn.authorsfile /Users/david/bric_authors.txt
git svn fetch --no-follow-parent --revision 5517:HEAD
git svn fetch --no-follow-parent --revision 5517:HEAD \
--ignore-paths '(David|Kineticode|Release_|dev_(callback|(media_)?templates)|rev_1_([024]|[68]_temp)|tags/)|tmp';

With this in hand, I killed off the call to svn2git, opting to convert trunk and the remote branches myself (easily done by copying-and-pasting the relevant Perl code). Then all I needed to do was clean up the extant tags and run git-filter-branch to make the grafts permanent:

sub finish {
    print "Deleting old tags\n";
    my @tags = grep m{^tags/}, map { s/^\s+//; s/\s+$//; $_ } `git branch -a`;
    system qw(git branch -r -D), $_ for @tags;

    print "Deleting revision tags\n";
    @tags_to_delete = grep { /^\d+$/ } map { s/^\s+//; s/\s+$//; $_ } `git tag`;
    system qw(git tag -d), $_ for @tags_to_delete;

    print "Grafting...\n";
    system qw(git filter-branch);
    system qw(git gc);
}

And now I have a nicely organized Git repository based on the Bricolage Subversion repository, with all (or most) merges in their proper places, release tags, and branch tracking. Now all I have to do is stitch it together with the repository based on CVS and I’ll be ready to put this sucker on GitHub! More on that in my next post.

Looking for the comments? Try the old layout.

Migrating Bricolage CVS to Git

Following a discussion on the Bricolage developers mail list, I started down the path last week of migrating the Bricolage Subversion repository to Git. This turned out to be much more work than I expected, but to the benefit of the project, I think. Since I had a lot of questions about how to do certain things and how Git thinks about certain things, I wanted to record what I worked out here over the course of a few entries. Maybe it will help you manage your migration to Git.

The first thing I tried to do was use git-svn to migrate Bricolage to Git. I pointed it to the root directory and let it rip. I immediately saw that it noticed that the root was originally at the root of the repository, rather than the “bricolage” subdirectory, and so followed that path and started pulling stuff down. In a separate terminal window, I was watching the branches build up, and there were a lot of them, many named like:

David
David@5248
David@584
tags/Release_1_2_1
tags/Release_1_2_1@5249
tags/Release_1_2_1@577

Although many of those branches and tags hadn’t been used since the beginning of time, and certainly not since Bricolage was moved to Subversion from its original home in SourceForge CVS, because Subversion has no real concept of branches or tags, git-svn was duly copying them all, including the separate histories for each. Yow.

I could have dealt with that, renaming things, deleting others, and grafting where appropriate (more on grafting in a minute), but then I got this error from git-svn:

bricolage/branches/rev_1_8/lib/Bric/App/ApacheConfig.pm was not
found in commit e5145931069a511e98a087d4cb1a8bb75f43f899 (r5256)

This was annoying, especially since the file clearly does exist in that commit:

svn list -r5256 http://svn.bricolage.cc/bricolage/branches/rev_1_8/lib/Bric/App/ApacheConfig.pm
ApacheConfig.pm

I posted to the Git mail list about this issue, but unfortunately got no reply. Given that it was taking around 30 hours(!) to get to that point (and about 18 hours once I started using a local copy of the Subversion repository, thank to a suggestion from Ask Bjørn Hansen), I started thinking about how to simplify things a bit.

Since most of the moving stuff around happened immediately after the move to Subversion, and before we started committing working code to the repository, it occurred to me that I could probably go back to the original Bricolage CVS Repository on SourceForge, migrate that to Git, and then just migrate from Subversion starting from the first real commit there. Then I could just stitch the two repositories together.

From CVS to Git

Thanks to advice from IRC, I used cvs2git to build a repository from a dump from CVS. Apparently, git cvsimport makes a lot of mistakes, while cvs2git does a decent job keeping branches and tags where they should be. It’s also pretty fast; once I set up its configuration and ran it, it took only around 5 minutes for it to build import files for git fast-import. It also has some nice features to rename symbols (tags), ignore tags, assign authors, etc. I’m aware of not tool to migrate Subversion to Git that does the same thing.

Once I had my dump, I started writing a script to import it into Git. The basic import looks like this:

GITREPO=/Users/david/Desktop/git_from_cvs
rm -rf $GITREPO
mkdir $GITREPO
chdir $GITREPO
git init
cat ../cvs2svn-tmp/git-blob.dat ../cvs2svn-tmp/git-dump.dat | git fast-import
svn2git --no-clone
git gc
git reset --hard

I used svn2git to convert remote branches to local tags and branches The --no-clone option is what keeps it from doing the Subversion stuff; everything else is the same for a new conversion from CVS. I also had to run git reset --hard to throw out uncommitted local changes. What changes? I’m not sure where they came from, but after the last commit is imported from CVS, all of the local files in the master branch are deleted, but that change is not committed. Strange, but by doing a hard reset, I reverted that change with no harm done.

Next, I started looking at the repository in GitX, which provides a decent graphical interface for browsing around a Git repository on Mac OS X. There I discovered that a major benefit to importing from CVS rather than Subversion is that, because CVS has real tags, those tags are properly migrated to Git. What this means is that, because the Bricolage project (nearly) always tagged merges between branches and included the name of the appropriate tag name in a merge commit message, I was able to reconstruct the merge history in Git.

For example, there were a lot of tags named like so:

% git tag
rev_1_8_merge-2004-05-04
rev_1_6_merge-2004-05-02
rev_1_6_merge-2004-04-10
rev_1_6_merge-2004-04-09
rev_1_6_merge-2004-03-16

So if I wanted to find the merge commit that corresponded to that first tag, all I had to do was sort the commits in GitX by date and look near 2004-05-04 for a commit message that said something like:

Merge from rev_1_8. Will tag that branch "rev_1_8_merge-2004-05-04".

That commit’s SHA key is “b786ad1c0eeb9df827d658a81dc2d32ec6108e92”. Its parent’s SHA key is “11dbbd49644aaa607bd83f8d542d37fcfbd5e63b”. So then all I had to do was to tell git that there is a second parent for that commit. Looking in GitX for the commit tagged “rev_1_8_merge-2004-05-04”, I found that its SHA key is “4fadb117a71a49add69950eccc14b77a04c8ec68”. So to assign that as a second parent, I write a line to the file .git/info/grafts that describes its parentage:

b786ad1c0eeb9df827d658a81dc2d32ec6108e92 11dbbd49644aaa607bd83f8d542d37fcfbd5e63b 4fadb117a71a49add69950eccc14b77a04c8ec68

Once I had all the grafts written, I just ran git filter-branch and they were permanently rewritten to the new hierarchy.

And that’s it! The parentage is now correct. It was a lot of busy work to create the mapping between tags and merges, but it’s nice to have it all done and properly mapped out historically in Git. I even found a bunch merges with no corresponding tags and figured out the proper commit to link them up to (though I stopped when I got back to 2002 and things get really confusing). And now, because the merge relationships are now properly recorded in Git, I can drop those old merge tags: as workarounds for a lack of merge tracking in CVS, they are no longer necessary in Git.

Next up, how I completed the merge from Subversion. I’ll write that once I’ve finally got it nailed down. Unfortunately, it takes an hour or two to export from Subversion to Git, and I’m having to do it over and over again as I figure stuff out. But it will be done, and you’ll hear more about it here.

Looking for the comments? Try the old layout.

List All Subversion Committers

In preparation for migrating a large Subversion repository to GitHub, I needed to get a list of all of the Subversion committers throughout history, so that I could create a file mapping them to Git users. Here’s how I did it:

svn log --quiet http://svn.example.com/ \
| grep '^r' | awk '{print $3}' | sort | uniq > committers.txt

Now I just have edit committers.txt and I have my mapping file.

Looking for the comments? Try the old layout.

The Future of SVN::Notify

This week, I imported pgTAP into GitHub. It took me a day or so to wrap my brain around how it’s all supposed to work, with generous help from Tekkub. But I’m starting to get the hang of it, and I like it. By the end of the day, I had sent push requests to Test::More and Blosxom Plugins. I’m well on my way to being hooked.

One of the things I want, however, is SVN::Notify-type commit emails. I know that there are feeds, but they don’t have diffs, and for however much I like using NetNewsWire to feed by political news addiction, it never worked for me for commit activity. And besides, why download the whole damn thing again, diffs and all (assuming that ever happens), for every refresh. Seems like a hell of a lot unnecessary network activity—not to mention actual CPU cycles.

So I would need a decent notification application. I happen to have one. I originally wrote SVN::Notify after I had already written activitymail, which sends noticies for CVS commits. SVN::Notify has changed a lot over the years, and now it’s looking a bit daunting to consider porting it to Git.

However, just to start thinking about it, SVN::Notify really does several different things:

  • Fetches relevant information about a Subversion event.
  • Parses that information for a number of different outputs.
  • Writes the event information into one or more outputs (currently plain text or XHTML).
  • Constructs an email message from the outputs
  • Sends the email message via a specified method (sendmail or SMTP).

For the initial implementation of SVN::Notify, this made a lot of sense, because it was doing something fairly simple. It was designed to be extensible by subclassing (successfully done by SVN::Notify::Config and SVN::Notify::Mirror), and, later, by output filters, and that was about it.

But as I think about moving stuff to Git, and consider the weaknesses of extensibility by subclassing (it’s just not pretty), I’m naturally rethinking this architecture. I wouldn’t want to have to do it all over again should some future SCM system come along in the future. So, following from a private exchange with Martijn Van Beers, I have some preliminary thoughts on how a hypothetical SCM::Notify (VCS::Notify?) module might be constructed:

  • A single interface for fetching SCM activity information. There could be any number of implementations, just as long as they all provided the same interface. There would be a class for fetching information from Subversion, one for Git, one for CVS, etc.
  • A single interface for writing a report for a given transaction. Again, there could be any number of implementations, but all would have the same interface: taking an SCM module and writing output to a file handle.
  • A single interface for doing something with one or more outputs. Again, they can do things as varied as simply writing files to disk, appending to a feed, inserting into a database, or, of course, sending an email.
  • The core module would process command-line arguments to determine what SCM is being used any necessary contextual information and just pass it on to the appropriate classes.

In psedudo-code, what I’m thinking is something like this:

package SCM::Notify;

sub run {
    my $args = shift->getopt;
    my $scm  = SCM::Interface->new(
        scm      => $args->{scm} # e.g., "SVN" or "Git", etc.
        revision => $args->{revision},
        context  => $args->{context} # Might include repository path for SVN.
    );

    my $report = SCM::Report->new(
        method => $opts->{method}, # e.g., SMTP, sendmail, Atom, etc.
        scm    => $scm,
        format => $args->{output}, # text, html, both, etc.
        params => $args->{params}, # to, from, subject, etc.
    );

    $report->send;
}

Then a report class just has to create report in the specified format or formats and do something with them. For example, a Sendmail report would put together a report as a multipart message with each format in a single part, and then deliver it via /sbin/sendmail, something like this:

package SCM::Report::Sendmail;

sub send {
    my $self = shift;
    my $fh = $self->fh;
    for my $format ( $self->formats ) {
        print $fh SCM::Format->new(
            format => $format,
            scm    => $self->scm,
        );
    }

    $self->deliver;
}

So those are my rather preliminary thoughts. I think it’d actually be pretty easy to port the logic of this stuff over from SVN::Notify; what needs some more thought is what the command-line interface might look like and how options are passed to the various classes, since the Sendmail report class will require different parameters than the SMTP report class or the Atom report class. But once that’s worked out in a way that can be handled neutrally, we’ll have a much more extensible implementation that will be easy to add on to going forward.

Any suggestions for passing different parameters to different classes in a single interface? Everything needs to be able to be handled via command-line options and not be ugly or difficult to use.

So, you wanna work on this? :-)

Looking for the comments? Try the old layout.