Just a Theory

Black lives matter

Posts about Hashing

Thinking about Changing Sqitch Change IDs

When Sqitch, (the database change management app I’ve been working on for the last several months) parses a deployment plan, it creates a unique ID for each change in the plan. This ID is a SHA1 hash generated from information about the change, which is a string that looks something like this:

project flipr
change add_users_table
planner Marge N. O’Vera <marge@example.com>
date 2012-11-14T01:10:13Z

The nice thing about the ID is that it’s unique: it’s unlikely that the same user with the same email address will add a change with the same name to a project with the same name within a single second. If the plan includes a URI, that’s included, too, for additional uniqueness.

Note, however, that it does not include information about any other changes. Git, from which I modeled the generation of these IDS, always includes the parent commit SHA1 in its uniquely-identifying info. An example:

> git cat-file commit 744c01bfa3798360c1792a8caf784b650e52d89e               
tree d3a64897cca4538ff5c0c41db3f82ab033a09bec
parent 482a79ae2cda5085eed731be2e70739ab37997ee
author David E. Wheeler <david@justatheory.com> 1337355746 -0400
committer David E. Wheeler <david@justatheory.com> 1337355746 -0400

Timestamp v0.30.

The reason Git does this is so that a commit is not just uniquely identified globally, but so that it can only follow an existing commit. Mark Jason Dominus calls this Linus Torvalds' greatest invention. Why? This is now Git knows it can fast-forward changes.

Why doesn’t Sqitch do something similar? My original thinking had been to make it easier for a database developer to do iterative development. And one of the requirements for that, in my experience, is the ability to freely reorder changes in the plan. Including the SHA1 of the preceding change would make that trickier. But it also means that, when you deploy to a production database, you lose that extra layer of security that ensures that, yes, the next change really should be deployed. That is, it would be much harder to deploy with changes missing or changed from what was previously expected. And I think that’s only sane for a production environment.

Given that, I’ve started to rethink my decision to omit the previous change SHA1 from the identifier of a change. Yes, it could be a bit more of hassle for a developer, but not, I think, that much of a hassle. The main thing would be to allow reverts to look up their scripts just by change name or even file name, rather than ID. We want deploys to always be correct, but I’m thinking that reverts should always just try very hard to remove changes. Even in production.

I am further thinking that the ID should even include the list of prerequisite changes for even stronger identification. After all, one might change just the dependencies and nothing else, but it would still be a different change. And maybe it should include the note, too? The end result would be a hash of something like this:

project flipr
change add_users_table
parent 7cd96745746cd6baa5da352de782354b21838b25
requires [schemas roles common:utils]
planner Marge N. O’Vera <marge@example.com>
date 2012-11-14T01:10:13Z

Adds the users table to the database.

This will break existing installations, so I’d need to add a way to update them, but otherwise, I think it might be a win overall. Thoughts?

Looking for the comments? Try the old layout.

Array to Hash One-Liner

Programming in Ruby, I’ve badly missed Perl’s list syntax, which, among other things, makes converting between arrays and hashes really easy. In Ruby I have forever been converting an array to a hash like this:

a = [ 1, 2, 3, 4, 5 ]
h = {}
a.each { |v| h[v] = v }

Of course, this is anything but concise. In Perl, I can just do this:

my @a = (1, 2, 3, 4, 5, 6);
my %h = map { $_ => $_ } @a;

Easy, huh? Well, I finally got fed up with the nasty hack in Ruby, did a little Googling, and figured out a way to do it in a single line:

a = [ 1, 2, 3, 4, 5, 6 ]
h = Hash[ *a.collect { |v| [ v, v ] }.flatten ]

Not quite as concise as the Perl version, and I have to construct a bunch of arrays that I then throw away with the call to flatten, but at least it’s concise and, I think, clearer what it’s doing. So I think I’ll go with that.

Looking for the comments? Try the old layout.

Which Digest Should I Use?

With the recent release of MD5 collision code, I’m reading that it’s long since time that MD5 was dropped from applications. But it seems that SHA-1 isn’t well-thought of anymore, either. So what should Perl programmers use now, instead? Digest::Whirlpool? Digest::SHA2? Digest::Tiger? Digest::Haval256? A combination of these? Something else? I mainly used MD5 for hashing passwords. What’s the best choice for that use? For other uses?

Looking for the comments? Try the old layout.