Just a quick followup on the completion of the Bricolage Git migration last
week, today I completed writing up a set of GitHub wiki documents explaining
to my fellow Bricoleurs how to start hacking. The most important bits are:
Working with Git, explaining how to get set up with a forked Bricolage
repository
Contributing a Bug Fix, an intro to the Git way of doing things (as far as
I understand it)
Creating a Release, in which the fine art of branching, tagging, and
releasing is covered
If you’re familiar with the “Git way,” I would greatly appreciate your feedback
on these documents. Corrections and comments would be greatly appreciated.
I also just wanted to say that the process of reconstructing the merge history
from CVS and Subversion was quite an eye-opener for me. Not because it was
difficult (it was) and required a number of hacks (it did), but because it
highlighted just how much better a fit Git is for the way in which we do Open
Source software development. Hell, probably closed-source, too, for that matter.
I no longer will have to think about what revisions to include in a merge, or
create a branch just to “tag” a merge. Hell, I’ll probably be doing merges a
hell of a lot more often, just because it’s so easy, the history remains intact,
and everything just stays more up-to-date and closely integrated.
But I also really appreciate the project-based emphasis of Git. A Subversion
repository, I now realize, is really very much like a versioned file system.
That means where things go is completely ad-hoc, or convention-driven at best.
And god forbid if you decide to change the convention and move stuff around!
It’s just so much more sane to get a project repository, with all of the
history, branches, tags, merges, and everything else, all in one package. It’s
more portable, it’s a hell of a lot faster (ever tried to check out a Subversion
repository with 80 tags?), and just tighter. it encourages modularization,
which can only be good. I’ll tell you, I expect to have some frustrations and
challenges as I learn more about using Git, but I’m already very much happier
with the overall philosophy.
Enough evangelizing. As a last statement on this, I’ve uploaded the Perl scripts
I wrote to do this migration, just in case someone else finds them useful:
bric_to_git migrated Subversion from r5517 to Git.
stitch stitched the CVS-migrated Git repository into the
Subversion-migrated Git repository for a final product.
It turned out that there were a few files lost in the conversion, which I
didn’t notice until after all was said and done, but overall I’m very happy. My
thanks again to Ask and the denizens of #git for all the help.
Now that I’ve successfully migrated the old Bricolage SourceForge CVS
repository to Git, and also migrated Subversion to Git, it’s time to stitch
the two repositories together into one with all history intact. I’m glad to say
that figuring out how to do so took substantially less time than the first two
steps, thanks in large part to the great help from “doener,” “Ilari,” and
“Fissure” on the Freenode #git channel.
Actually, they helped me with a bit more tweaking of my CVS and Subversion
conversions. One thing I realized after writing yesterday’s post was that, after running git filter-branch, I had twice as many
commits as I should have had. It turns out that git filter-branch rewrites all
commits, but keeps the old ones around in case you mess something up. doener
also pointed out that I wasn’t having all grafts properly applied, because
git filter-branch only applies to the currently checked-out branch. To get all
of the branches, he suggested that I read the git-filter-branch documentation,
where I’ll find that git filter-branch --tag-name-filter cat -- --all would
hit all branches. Actually, such was not clear to me from the documentation, but
I took his word for it. Once I did that, to get rid of the dupes, all I had to
do was git clone the repository to a new repository. And that was that.
This worked great for my CVS migration, but I realized that I also wanted to
clean out metadata from the Subversion migration. Of course, git clone throws
out most of the metadata, but git svn also stores some metadata at the end of
every commit log message, like this:
This had been very handy as I looked through commits in GitX to find parents to
set up for grafts, but with that done and everything grafted, I no longer needed
it. Ilari helped me to figure out how to properly use git filter-branch to get
rid of those. To do it, all I had to do was add a filter for commit messages,
like so:
This properly strips out that ugly bit of metadata and finalizes the grafts all
at the same time. Very nice.
Now it was time to combine these two repositories for a single unified
history. I wasn’t able to find a good tutorial for this on the web, other than
one that used a third-party Debian utility and only hooked up the master
branch, using a bogus intermediary commit to do it. On the other hand, simply
copying the pack files, as mentioned in the Git Wiki–and demonstrated by the
scripts linked from there–also appeared to be suboptimal: The new commits were
not showing up in GitX! And besides, Ilari said, “just copying packs might not
suffice. There can also be loose objects.” Well, we can’t have that, can we?
Ilari suggested git-fetch, the documentation for which says that it will
“download objects and refs from another repository.” Perfect! I wanted to copy
the objects from my CVS migration to the Subversion migration.
My first attempt failed: some commits showed up, but not others. Ilari pointed
out that it wouldn’t copy remote branches unless you asked it to do so, via
“refspecs.” Since I’d cloned the repositories to get rid of the duplicate
commits created by git filter-branch, all of my lovingly recreated local
branches were now remote branches. Actually, this is what I want for the final
repository, so I just had to figure out how to copy them. What I came up with
was this:
It took me a while to figure out the proper incantation for referencing and
creating remote branches. Once I got the refs/remotes part figured out, I
found that the master, rev_1_6, and rev_1_8 branches from CVS were
overwriting the Subversion branches with the same names. What I really needed
was to have the CVS branches grafted as parents to the Subversion branches. The
#git channel again came to my rescue, where Fissure suggested that I rename
those branches when importing them, do the grafts, and then drop the renamed
branches. Hence the line above that adds “-cvs” to the names of those branches.
Once the branches were imported, I simply looked for the earliest commits to
those branches in Subversion and mapped it to the latest commits to the same
branches in CVS, then wrote their SHA1 IDs to .git/info/grafts, like so:
openmy$fh,'>',".git/info/grafts"ordie"Cannot open grafts: $!\n";print$fh'77a35487f18d68b96d294facc1f1a41745ad914c '=>"835ff47ee1e3d1bf228b8d0976fbebe3c7f02ae6\n",# rev_1_6'97ef646f5c2a7c6f47c2046c8d289c1dfc30a73d '=>"2b9f3c5979d062614ef54afd0a01631f746fa3cb\n",# rev_1_8'b3b2e7f53d789bea962fe8047e119148e28865c0 '=>"8414b64a6a434b2117294c0568c1012a17bc863b\n",# master;close$fh;
With the branches all imported and the grafts created, I simply had to run
git filter-branch to make them permanent and drop the temporary CVS branches:
Now I had a complete repository, but with duplicate commits left over by
git-filter-branch. To get rid of those, I need to clone the repository. But
before I clone, I need the remote branches to be local branches, so that the
clone will see them as remotes. For this, I wrote the following function:
It’s important to skip the master and HEAD branches, as they’ll automatically be
created by git clone. So then I call the function and and run git gc to take
out trash, and then clone:
It’s important to use the file:/// URL to clone so as to get a real clone;
just pointing to the directory instead makes hard links.
Now I that I had the final repository with all history intact, I was ready to
push it to GitHub! Well, almost ready. First I needed to make the branches local
again, and then see if I could get the repository size down a bit:
And that’s it! My new Bricolage Git repository is complete, and I’ve now pushed
it up to its new home on GitHub. I pushed it like this:
git push origin --all
git push origin --tags
Damn I’m glad that’s done! I’ll be getting the Subversion repository set to
read-only next, and then writing some documentation for my fellow Bricoleurs on
how to work with Git. For those of you who already know, fork and enjoy!
Following up on last week’s post on migrating the old Bricolage SourceForge
CVS repository to Git, here are my notes on migrating the current Bricolage
Subversion repository to Git.
It turns out that migrating from Subversion is much more of a pain than
migrating from CVS. Why? Because CVS has real tags, while Subversion does not.
So while git-svn tries to identify all of your tags and branches, it’s really
relying on your Subversion repository using standard directories for all of your
branches and tags. And while we’ve used a standard for branches directory, our
tags setup is a bit more complicated.
The problem was that we used tags every time we merged between branches. This
meant that we ended up with a lot of tags with names like
“merge_rev_1_10_5665” to indicate a merge from the “rev_1_10” branch into
trunk at r5665. Plus we had tags for releases. So Marshall took it upon
himself to reorganize the tags in the Subversion tree so that all release tags
went into the “releases” subdirectory, and merges went into subdirectories named
for the branch from which the merge derived. Those subdirectories went into the
“merges” subdirectory. We ended up with a directory structure organized like
this:
This was useful for keeping things organized in Subversion, so that we could
easily find a tag for a previous merge in order to determine the revisions to
specify for a new merge. But because older tags were moved from previous
locations, and because newer tags were in subdirectories of the “tags”
directory, git-svn did not identify them as tags. Well, that’s not really
fair. It did identify earlier tags, before they were moved, but all the other
tags were not found. Instead I ended up with tags in Git named tags/releases
and tags/merges, which was useless. But even if all of our tags had been
identified as tags, none had parent commit IDs, so there was no place to see
where they actually came from.
So to rebuild the commit, release, and merge history from Subversion, I first
created a local copy of the subversion repository using svnsync. Then I cloned
it to Git like so:
By starting with r5517, which was the first real commit to Subversion, I avoided
the git-svn error I reported last week. In truth, though, I ended up running
this clone many, many times. The first few times, I ran it with
--no-metadata, as recommended in various HOWTOs. But then I kept getting
errors such as:
git svn log
fatal: bad default revision 'refs/remotes/git-svn'
----------------------------------------------------
This was more than a little annoying, and it took me a day or so to realize that
this was because I had been using --no-metadata. Once I killed off that
option, things worked much better
Furthermore, by starting at r5517 and passing the --no-follow-parent option,
git-svn ran much more quickly. Rather than taking 30 hours to get all
revisions including stuff that had been moved around (and then failing), it now
took around 90 minutes to do the export. Much more manageable, although I also
started making backup copies and restoring from them as I experimented with
fixing branches and tags. Ultimately, I ended up also passing the
--ignore-paths option, to exclude various branches that were never really used
or that I had already fetched in their entirety from CVS:
The call to svn2git converts remote branches to local tags and branches. Now I
had a reasonably clean copy of the repository (aside from the 120 or so commits
from when Marshall did the tags reorganization) for me to work with. I opened it
up with GitX and started scripting out merges.
To assist in this, I took a hint from Ask Bjørn Hansen, sent in email in
response to a Tweet, and tagged every single commit with its corresponding
Subversion revision number, like so (in Perl):
The nice thing about this is that it made it easy for me to scan through the
commits in GitX and see where things were. It also meant that I could reference
these tags when I wrote the code to manage the merges. So what I did was sort
the commits in reverse chronological order, and then search for those with the
word “merge” in their subjects. When one was clearly for a merge (as opposed to
simply using the word “merge”), I would disable the search, scroll through the
commits until I found the selected commit, and then look for a likely prior
commit that it merged from.
This was a bit of pain in the ass, because, unfortunately, GitX doesn’t keep the
selected commit record in the middle of the screen when you cancel the search.
Mail.app does this right: If I do a search, select a message, then cancel the
search, the selected message is still in the middle of the screen. But with
GitX, as I said, I have to scroll to find it. This wasn’t going to scale very
well. So what I did instead was search for “merge”, then I took a screen shot of
the results and cancelled the merge. Then I just opened the screenshot in
Preview, looked at the records there, then found them in GitX. This made things
go quite a bit faster.
As a result, I added a migration function to properly tag merges. It looked like
this:
By referencing revision tags explicitly, I was able to just use git rev-parse
to look up SHA1 hash IDs to put into .git/info/grafts. This saved me the
headache of dealing with very long IDs, but also allowed me to easily keep track
of revision numbers and branches (the branch information is actually superfluous
here, but I kept it for my sanity). So, basically, for
[qw( trunk@5524 rev_1_8@5523 )], it ends up writing the SHA1 hashes for r5524,
the existing parent commit for r5524 (that’s the $commit^ bit), and for the
new parent, r5523. I ended up with 73 merges that needed to be properly
recorded.
With the merges done, I next dove into branches. For some reason, git-svn
failed to identify a parent commit for any branch. Maybe because I started
with r5517? I have no idea. So I had to search through the commits to see when
branches were started. I mainly did this by looking at the branches in ViewVC.
By clicking each one, I was able to see the earliest commit, which usually had a
name like “Created a branch for my SoC project.” I would then look up that
commit in ViewVC, such as r7423, which started the “dev_ajax” branch, just to
make sure that it was copied from trunk. Then I simply went into GitX, found
r7423, then looked back to the last commit to trunk before r7423. That was the
parent of the branch. With such data, I was able to write a function like this:
Here I only needed to look up the revision and its parent and write it to
.git/info/grafts. Then all of my branches had parents. Or nearly all of them;
those that were also in the old CVS repository will have to wait until the two
are stitched together to find their parents.
Next I needed to get releases properly tagged. This was not unlike the merge tag
work: I just had to find the proper revision and tag it. This time, I looked
through the commits in GitX for those with “tag for” in their subjects because,
conveniently, I nearly always used this phrase in a release tag, as in “Tag for
the 1.8.11 release of Bricolage.” Then I just looked back from the tag commit to
find the commit copied to the tag, and that commit would be tagged with the
release tag. The function to create the tags looked like this:
subtag_releases{print"Tagging releases\n";formy$spec(['rev_1_8@5726'=>'v1.8.1'],['rev_1_8@5922'=>'v1.8.2'],['rev_1_8@6073'=>'v1.8.3'],){my($where,$tag)=@{$spec};my($branch,$rev)=split/[@]/,$where;my$tag_date=`git show --pretty=format:%cd -s $rev`;chomp$tag_date;local$ENV{GIT_COMMITTER_DATE}=$tag_date;systemqw(git tag -fa),$tag,'-m',"Tag for $tag release of Bricolage.",$rev;}}
I am again indebted to Ask for the code here, especially to
set the date for the tag.
Since I had created new release tags and recreated the merge history in Git, I
no longer needed the old tags from Subversion, so next I rewrote the
--ignore-paths option to exclude all of the tags directories, as well as some
branches that were never used:
With this in hand, I killed off the call to svn2git, opting to convert trunk
and the remote branches myself (easily done by copying-and-pasting the relevant
Perl code). Then all I needed to do was clean up the extant tags and run
git-filter-branch to make the grafts permanent:
subfinish{print"Deleting old tags\n";my@tags=grepm{^tags/},map{s/^\s+//;s/\s+$//;$_}`git branch -a`;systemqw(git branch -r -D),$_for@tags;print"Deleting revision tags\n";@tags_to_delete=grep{/^\d+$/}map{s/^\s+//;s/\s+$//;$_}`git tag`;systemqw(git tag -d),$_for@tags_to_delete;print"Grafting...\n";systemqw(git filter-branch);systemqw(git gc);}
And now I have a nicely organized Git repository based on the Bricolage
Subversion repository, with all (or most) merges in their proper places, release
tags, and branch tracking. Now all I have to do is stitch it together with the
repository based on CVS and I’ll be ready to put this sucker on GitHub!
More on that in my next post.
Following a discussion on the Bricolage developers mail list, I started down
the path last week of migrating the Bricolage Subversion repository to Git. This
turned out to be much more work than I expected, but to the benefit of the
project, I think. Since I had a lot of questions about how to do certain things
and how Git thinks about certain things, I wanted to record what I worked out
here over the course of a few entries. Maybe it will help you manage your
migration to Git.
The first thing I tried to do was use git-svn to migrate Bricolage to Git. I
pointed it to the root directory and let it rip. I immediately saw that it
noticed that the root was originally at the root of the repository, rather than
the “bricolage” subdirectory, and so followed that path and started pulling
stuff down. In a separate terminal window, I was watching the branches build up,
and there were a lot of them, many named like:
David
David@5248
David@584
tags/Release_1_2_1
tags/Release_1_2_1@5249
tags/Release_1_2_1@577
Although many of those branches and tags hadn’t been used since the beginning of
time, and certainly not since Bricolage was moved to Subversion from its
original home in SourceForge CVS, because Subversion has no real concept of
branches or tags, git-svn was duly copying them all, including the separate
histories for each. Yow.
I could have dealt with that, renaming things, deleting others, and grafting
where appropriate (more on grafting in a minute), but then I got this error from
git-svn:
bricolage/branches/rev_1_8/lib/Bric/App/ApacheConfig.pm was not
found in commit e5145931069a511e98a087d4cb1a8bb75f43f899 (r5256)
This was annoying, especially since the file clearly does exist in that
commit:
svn list -r5256 http://svn.bricolage.cc/bricolage/branches/rev_1_8/lib/Bric/App/ApacheConfig.pm
ApacheConfig.pm
I posted to the Git mail list about this issue, but unfortunately got no
reply. Given that it was taking around 30 hours(!) to get to that point (and
about 18 hours once I started using a local copy of the Subversion repository,
thank to a suggestion from Ask Bjørn Hansen), I started thinking about how to
simplify things a bit.
Since most of the moving stuff around happened immediately after the move to
Subversion, and before we started committing working code to the repository, it
occurred to me that I could probably go back to the original Bricolage CVS
Repository on SourceForge, migrate that to Git, and then just
migrate from Subversion starting from the first real commit there. Then I could
just stitch the two repositories together.
From CVS to Git
Thanks to advice from IRC, I used cvs2git to build a repository from a dump
from CVS. Apparently, git cvsimport makes a lot of mistakes, while cvs2git
does a decent job keeping branches and tags where they should be. It’s also
pretty fast; once I set up its configuration and ran it, it took only around 5
minutes for it to build import files for git fast-import. It also has some
nice features to rename symbols (tags), ignore tags, assign authors, etc. I’m
aware of not tool to migrate Subversion to Git that does the same thing.
Once I had my dump, I started writing a script to import it into Git. The basic
import looks like this:
I used svn2git to convert remote branches to local tags and branches The
--no-clone option is what keeps it from doing the Subversion stuff; everything
else is the same for a new conversion from CVS. I also had to run
git reset --hard to throw out uncommitted local changes. What changes? I’m not
sure where they came from, but after the last commit is imported from CVS, all
of the local files in the master branch are deleted, but that change is not
committed. Strange, but by doing a hard reset, I reverted that change with no
harm done.
Next, I started looking at the repository in GitX, which provides a decent
graphical interface for browsing around a Git repository on Mac OS X. There I
discovered that a major benefit to importing from CVS rather than Subversion is
that, because CVS has real tags, those tags are properly migrated to Git. What
this means is that, because the Bricolage project (nearly) always tagged merges
between branches and included the name of the appropriate tag name in a merge
commit message, I was able to reconstruct the merge history in Git.
For example, there were a lot of tags named like so:
% git tag
rev_1_8_merge-2004-05-04
rev_1_6_merge-2004-05-02
rev_1_6_merge-2004-04-10
rev_1_6_merge-2004-04-09
rev_1_6_merge-2004-03-16
So if I wanted to find the merge commit that corresponded to that first tag, all
I had to do was sort the commits in GitX by date and look near 2004-05-04 for a
commit message that said something like:
Merge from rev_1_8. Will tag that branch "rev_1_8_merge-2004-05-04".
That commit’s SHA key is “b786ad1c0eeb9df827d658a81dc2d32ec6108e92”. Its
parent’s SHA key is “11dbbd49644aaa607bd83f8d542d37fcfbd5e63b”. So then all I
had to do was to tell git that there is a second parent for that commit. Looking
in GitX for the commit tagged “rev_1_8_merge-2004-05-04”, I found that its
SHA key is “4fadb117a71a49add69950eccc14b77a04c8ec68”. So to assign that as a
second parent, I write a line to the file .git/info/grafts that describes its
parentage:
Once I had all the grafts written, I just ran git filter-branch and they were
permanently rewritten to the new hierarchy.
And that’s it! The parentage is now correct. It was a lot of busy work to create
the mapping between tags and merges, but it’s nice to have it all done and
properly mapped out historically in Git. I even found a bunch merges with no
corresponding tags and figured out the proper commit to link them up to (though
I stopped when I got back to 2002 and things get really confusing). And now,
because the merge relationships are now properly recorded in Git, I can drop
those old merge tags: as workarounds for a lack of merge tracking in CVS, they
are no longer necessary in Git.
Next up, how I completed the merge from Subversion. I’ll write that once I’ve
finally got it nailed down. Unfortunately, it takes an hour or two to export
from Subversion to Git, and I’m having to do it over and over again as I figure
stuff out. But it will be done, and you’ll hear more about it here.
This week, I imported pgTAP into GitHub. It took me a day or so to wrap my
brain around how it’s all supposed to work, with generous help from Tekkub.
But I’m starting to get the hang of it, and I like it. By the end of the day, I
had sent push requests to Test::More and Blosxom Plugins. I’m well on my way
to being hooked.
One of the things I want, however, is SVN::Notify-type commit emails. I know
that there are feeds, but they don’t have diffs, and for however much I like
using NetNewsWire to feed by political news addiction, it never worked for me
for commit activity. And besides, why download the whole damn thing again, diffs
and all (assuming that ever happens), for every refresh. Seems like a hell of a
lot unnecessary network activity—not to mention actual CPU cycles.
So I would need a decent notification application. I happen to have one. I
originally wrote SVN::Notify after I had already written activitymail, which
sends noticies for CVS commits. SVN::Notify has changed a lot over the years,
and now it’s looking a bit daunting to consider porting it to Git.
However, just to start thinking about it, SVN::Notify really does several
different things:
Fetches relevant information about a Subversion event.
Parses that information for a number of different outputs.
Writes the event information into one or more outputs (currently plain text
or XHTML).
Constructs an email message from the outputs
Sends the email message via a specified method (sendmail or SMTP).
For the initial implementation of SVN::Notify, this made a lot of sense, because
it was doing something fairly simple. It was designed to be extensible by
subclassing (successfully done by SVN::Notify::Config and
SVN::Notify::Mirror), and, later, by output filters, and that was about it.
But as I think about moving stuff to Git, and consider the weaknesses of
extensibility by subclassing (it’s just not pretty), I’m naturally rethinking
this architecture. I wouldn’t want to have to do it all over again should some
future SCM system come along in the future. So, following from a private
exchange with Martijn Van Beers, I have some preliminary thoughts on how a
hypothetical SCM::Notify (VCS::Notify?) module might be constructed:
A single interface for fetching SCM activity information. There could be any
number of implementations, just as long as they all provided the same
interface. There would be a class for fetching information from Subversion,
one for Git, one for CVS, etc.
A single interface for writing a report for a given transaction. Again,
there could be any number of implementations, but all would have the same
interface: taking an SCM module and writing output to a file handle.
A single interface for doing something with one or more outputs. Again, they
can do things as varied as simply writing files to disk, appending to a
feed, inserting into a database, or, of course, sending an email.
The core module would process command-line arguments to determine what SCM
is being used any necessary contextual information and just pass it on to
the appropriate classes.
In psedudo-code, what I’m thinking is something like this:
package SCM::Notify;
sub run {
my $args = shift->getopt;
my $scm = SCM::Interface->new(
scm => $args->{scm} # e.g., "SVN" or "Git", etc.
revision => $args->{revision},
context => $args->{context} # Might include repository path for SVN.
);
my $report = SCM::Report->new(
method => $opts->{method}, # e.g., SMTP, sendmail, Atom, etc.
scm => $scm,
format => $args->{output}, # text, html, both, etc.
params => $args->{params}, # to, from, subject, etc.
);
$report->send;
}
Then a report class just has to create report in the specified format or formats
and do something with them. For example, a Sendmail report would put together a
report as a multipart message with each format in a single part, and then
deliver it via /sbin/sendmail, something like this:
package SCM::Report::Sendmail;
sub send {
my $self = shift;
my $fh = $self->fh;
for my $format ( $self->formats ) {
print $fh SCM::Format->new(
format => $format,
scm => $self->scm,
);
}
$self->deliver;
}
So those are my rather preliminary thoughts. I think it’d actually be pretty
easy to port the logic of this stuff over from SVN::Notify; what needs some more
thought is what the command-line interface might look like and how options are
passed to the various classes, since the Sendmail report class will require
different parameters than the SMTP report class or the Atom report class. But
once that’s worked out in a way that can be handled neutrally, we’ll have a much
more extensible implementation that will be easy to add on to going forward.
Any suggestions for passing different parameters to different classes in a
single interface? Everything needs to be able to be handled via command-line
options and not be ugly or difficult to use.