Just a Theory

Trans rights are human rights

Posts about GitHub

Accelerate Perl Github Workflows with Caching

I’ve spent quite a few hours evenings and weekends recently building out a comprehensive suite of GitHub Actions for Sqitch. They cover a dozen versions of Perl, nearly 70 database versions amongst nine database engines, plus a coverage test and a release workflow. A pull request can expect over 100 actions to run. Each build requires over 100 direct dependencies, plus all their dependencies. Installing them for every build would make any given run untenable.

Happily, GitHub Actions include a caching feature, and thanks to a recent improvement to shogo82148/actions-setup-perl, it’s quite easy to use in a version-independent way. Here’s an example:

name: Test
on: [push, pull_request]
jobs:
  OS:
    strategy:
      matrix:
        os: [ ubuntu, macos, windows ]
        perl: [ 'latest', '5.34', '5.32', '5.30', '5.28' ]
    name: Perl ${{ matrix.perl }} on ${{ matrix.os }}
    runs-on: ${{ matrix.os }}-latest
    steps:
      - name: Checkout Source
        uses: actions/checkout@v3
      - name: Setup Perl
        id: perl
        uses: shogo82148/actions-setup-perl@v1
        with: { perl-version: "${{ matrix.perl }}" }
      - name: Cache CPAN Modules
        uses: actions/cache@v3
        with:
          path: local
          key: perl-${{ steps.perl.outputs.perl-hash }}
      - name: Install Dependencies
        run: cpm install --verbose --show-build-log-on-failure --no-test --cpanfile cpanfile
      - name: Run Tests
        env: { PERL5LIB: "${{ github.workspace }}/local/lib/perl5" }
        run: prove -lrj4

This workflow tests every permutation of OS and Perl version specified in jobs.OS.strategy.matrix, resulting in 15 jobs. The runs-on value determines the OS, while the steps section defines steps for each permutation. Let’s take each step in turn:

  • “Checkout Source” checks the project out of GitHub. Pretty much required for any project.
  • “Setup Perl” sets up the version of Perl using the value from the matrix. Note the id key set to perl, used in the next step.
  • “Cache CPAN Modules” uses the cache action to cache the directory named local with the key perl-${{ steps.perl.outputs.perl-hash }}. The key lets us keep different versions of the local directory based on a unique key. Here we’ve used the perl-hash output from the perl step defined above. The actions-setup-perl action outputs this value, which contains a hash of the output of perl -V, so we’re tying the cache to a very specific version and build of Perl. This is important since compiled modules are not compatible across major versions of Perl.
  • “Install Dependencies” uses cpm to quickly install Perl dependencies. By default, it puts them into the local subdirectory of the current directory — just where we configured the cache. On the first run for a given OS and Perl version, it will install all the dependencies. But on subsequent runs it will find the dependencies already present, thank to the cache, and quickly exit, reporting “All requirements are satisfied.” In this Sqitch job, it takes less than a second.
  • “Run Tests” runs the tests that require the dependencies. It requires the PERL5LIB environment variable to point to the location of our cached dependencies.

That’s the whole deal. The first run will be the slowest, depending on the number of dependencies, but subsequent runs will be much faster, up to the seven-day caching period. For a complex project like Sqitch, which uses the same OS and Perl version for most of its actions, this results in a tremendous build time savings. CI configurations we’ve used in the past often took an hour or more to run. Today, most builds take only a few minutes to test, with longer times determined not by dependency installation but by container and database latency.

Automate Postgres Extension Releases on GitHub and PGXN

Back in June, I wrote about testing Postgres extensions on multiple versions of Postgres using GitHub Actions. The pattern relies on Docker image, pgxn/pgxn-tools, which contains scripts to build and run any version of PostgreSQL, install additional dependencies, build, test, bundle, and release an extension. I’ve since updated it to support testing on the the latest development release of Postgres, meaning one can test on any major version from 8.4 to (currently) 14. I’ve also created GitHub workflows for all of my PGXN extensions (except for pgTAP, which is complicated). I’m quite happy with it.

But I was never quite satisfied with the release process. Quite a number of Postgres extensions also release on GitHub; indeed, Paul Ramsey told me straight up that he did not want to manually upload extensions like pgsql-http and PostGIS to PGXN, but for PGXN to automatically pull them in when they were published on GitHub. It’s pretty cool that newer packaging systems like pkg.go.dev auto-index any packages on GibHub. Adding such a feature to PGXN would be an interesting exercise.

But since I’m low on TUITs for such a significant undertaking, I decided instead to work out how to automatically publish a release on GitHub and PGXN via GitHub Actions. After experimenting for a few months, I’ve worked out a straightforward method that should meet the needs of most projects. I’ve proven the pattern via the pair extension’s release.yml, which successfully published the v0.1.7 release today on both GitHub and PGXN. With that success, I updated the pgxn/pgxn-tools documentation with a starter example. It looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
name: Release
on:
  push:
    tags:
      - 'v*' # Push events matching v1.0, v20.15.10, etc.
jobs:
  release:
    name: Release on GitHub and PGXN
    runs-on: ubuntu-latest
    container: pgxn/pgxn-tools
    env:
      # Required to create GitHub release and upload the bundle.
      GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
    steps:
    - name: Check out the repo
      uses: actions/checkout@v3
    - name: Bundle the Release
      id: bundle
      run: pgxn-bundle
    - name: Release on PGXN
      env:
        # Required to release on PGXN.
        PGXN_USERNAME: ${{ secrets.PGXN_USERNAME }}
        PGXN_USERNAME: ${{ secrets.PGXN_PASSWORD }}
      run: pgxn-release
    - name: Create GitHub Release
      id: release
      uses: actions/create-release@v1
      with:
        tag_name: ${{ github.ref }}
        release_name: Release ${{ github.ref }}
        body: |
          Changes in this Release
          - First Change
          - Second Change          
    - name: Upload Release Asset
      uses: actions/upload-release-asset@v1
      with:
        # Reference the upload URL and bundle name from previous steps.
        upload_url: ${{ steps.release.outputs.upload_url }}
        asset_path: ./${{ steps.bundle.outputs.bundle }}
        asset_name: ${{ steps.bundle.outputs.bundle }}
        asset_content_type: application/zip

Here’s how it works:

  • Lines 4-5 trigger the workflow only when a tag starting with the letter v is pushed to the repository. This follows the common convention of tagging releases with version numbers, such as v0.1.7 or v4.6.0-dev. This assumes that the tag represents the commit for the release.

  • Line 10 specifies that the job run in the pgxn/pgxn-tools container, where we have our tools for building and releasing extensions.

  • Line 13 passes the GITHUB_TOKEN variable into the container. This is the GitHub personal access token that’s automatically set for every build. It lets us call the GitHub API via actions later in the workflow.

  • Step “Bundle the Release”, on Lines 17-19, validates the extension META.json file and creates the release zip file. It does so by simply reading the distribution name and version from the META.json file and archiving the Git repo into a zip file. If your process for creating a release file is more complicated, you can do it yourself here; just be sure to include an id for the step, and emit a line of text so that later actions know what file to release. The output should be appended to the $GITHUB_OUTPUT file like this, with $filename representing the name of the release file, usually $extension-$version.zip:

    echo bundle=$filename >> $GITHUB_OUTPUT
    
  • Step “Release on PGXN”, on lines 20-25, releases the extension on PGXN. We take this step first because it’s the strictest, and therefore the most likely to fail. If it fails, we don’t end up with an orphan GitHub release to clean up once we’ve fixed things for PGXN.

  • With the success of a PGXN release, step “Create GitHub Release”, on lines 26-35, uses the GitHub create-release action to create a release corresponding to the tag. Note the inclusion of id: release, which will be referenced below. You’ll want to customize the body of the release; for the pair extension, I added a simple make target to generate a file, then pass it via the body_path config:

    - name: Generate Release Changes
      run: make latest-changes.md
    - name: Create GitHub Release
      id: release
      uses: actions/create-release@v1
      with:
        tag_name: ${{ github.ref }}
        release_name: Release ${{ github.ref }}
        body_path: latest-changes.md
    
  • Step “Upload Release Asset”, on lines 36-43, adds the release file to the GitHub release, using output of the release step to specify the URL to upload to, and the output of the bundle step to know what file to upload.

Lotta steps, but works nicely. I only wish I could require that the testing workflow finish before doing a release, but I generally tag a release once it has been thoroughly tested in previous commits, so I think it’s acceptable.

Now if you’ll excuse me, I’m off to add this workflow to my other PGXN extensions.

Tutorial on GitHub

Following a very good suggestion from Pedro Melo, I’ve created a Git repository for this tutorial and put it on GitHub. I replayed each step, making each into its own commit, and tagged the state of the code for each entry:

So as I continue to make modifications, I’ll keep this repository up-to-date, and tag things as of each blog entry. This will make it easy for you to follow along; you can simply clone the repository and git pull for each post.

More soon.

Looking for the comments? Try the old layout.

Git-R-Done

Just a quick followup on the completion of the Bricolage Git migration last week, today I completed writing up a set of GitHub wiki documents explaining to my fellow Bricoleurs how to start hacking. The most important bits are:

  • Working with Git, explaining how to get set up with a forked Bricolage repository
  • Contributing a Bug Fix, an intro to the Git way of doing things (as far as I understand it)
  • Working with Branches, describing how to track a maintenance branch in your fork
  • Merging with Git, to cover the frequent merging from Bricolage maintenance branches into master, and how to get said merges pushed upstream
  • Starting a Project Branch, which you’d need to read if you were taking on a major development task, such as a Summer of Code project
  • Contributing via Email, for those who don’t want a GitHub account (needs fleshing out)
  • Creating a Release, in which the fine art of branching, tagging, and releasing is covered

If you’re familiar with the “Git way,” I would greatly appreciate your feedback on these documents. Corrections and comments would be greatly appreciated.

I also just wanted to say that the process of reconstructing the merge history from CVS and Subversion was quite an eye-opener for me. Not because it was difficult (it was) and required a number of hacks (it did), but because it highlighted just how much better a fit Git is for the way in which we do Open Source software development. Hell, probably closed-source, too, for that matter. I no longer will have to think about what revisions to include in a merge, or create a branch just to “tag” a merge. Hell, I’ll probably be doing merges a hell of a lot more often, just because it’s so easy, the history remains intact, and everything just stays more up-to-date and closely integrated.

But I also really appreciate the project-based emphasis of Git. A Subversion repository, I now realize, is really very much like a versioned file system. That means where things go is completely ad-hoc, or convention-driven at best. And god forbid if you decide to change the convention and move stuff around! It’s just so much more sane to get a project repository, with all of the history, branches, tags, merges, and everything else, all in one package. It’s more portable, it’s a hell of a lot faster (ever tried to check out a Subversion repository with 80 tags?), and just tighter. it encourages modularization, which can only be good. I’ll tell you, I expect to have some frustrations and challenges as I learn more about using Git, but I’m already very much happier with the overall philosophy.

Enough evangelizing. As a last statement on this, I’ve uploaded the Perl scripts I wrote to do this migration, just in case someone else finds them useful:

  • bric_cvs_to_git migrated a CVS backup to Git.
  • bric_to_git migrated Subversion from r5517 to Git.
  • stitch stitched the CVS-migrated Git repository into the Subversion-migrated Git repository for a final product.

It turned out that there were a few files lost in the conversion, which I didn’t notice until after all was said and done, but overall I’m very happy. My thanks again to Ask and the denizens of #git for all the help.

Looking for the comments? Try the old layout.

Migrating Bricolage CVS and SVN to Git

Now that I’ve successfully migrated the old Bricolage SourceForge CVS repository to Git, and also migrated Subversion to Git, it’s time to stitch the two repositories together into one with all history intact. I’m glad to say that figuring out how to do so took substantially less time than the first two steps, thanks in large part to the great help from “doener,” “Ilari,” and “Fissure” on the Freenode #git channel.

Actually, they helped me with a bit more tweaking of my CVS and Subversion conversions. One thing I realized after writing yesterday’s post was that, after running git filter-branch, I had twice as many commits as I should have had. It turns out that git filter-branch rewrites all commits, but keeps the old ones around in case you mess something up. doener also pointed out that I wasn’t having all grafts properly applied, because git filter-branch only applies to the currently checked-out branch. To get all of the branches, he suggested that I read the git-filter-branch documentation, where I’ll find that git filter-branch --tag-name-filter cat -- --all would hit all branches. Actually, such was not clear to me from the documentation, but I took his word for it. Once I did that, to get rid of the dupes, all I had to do was git clone the repository to a new repository. And that was that.

This worked great for my CVS migration, but I realized that I also wanted to clean out metadata from the Subversion migration. Of course, git clone throws out most of the metadata, but git svn also stores some metadata at the end of every commit log message, like this:

git-svn-id: file:///Users/david/svn/bricolage/trunk@8581 e630fb3e-2914-11de-beb6-f300316f8eb1

This had been very handy as I looked through commits in GitX to find parents to set up for grafts, but with that done and everything grafted, I no longer needed it. Ilari helped me to figure out how to properly use git filter-branch to get rid of those. To do it, all I had to do was add a filter for commit messages, like so:

git filter-branch --msg-filter \
'perl -0777 -pe "s/\r?\ngit-svn-id:.+\n//ms"' \
--tag-name-filter cat -- --all

This properly strips out that ugly bit of metadata and finalizes the grafts all at the same time. Very nice.

Now it was time to combine these two repositories for a single unified history. I wasn’t able to find a good tutorial for this on the web, other than one that used a third-party Debian utility and only hooked up the master branch, using a bogus intermediary commit to do it. On the other hand, simply copying the pack files, as mentioned in the Git Wiki–and demonstrated by the scripts linked from there–also appeared to be suboptimal: The new commits were not showing up in GitX! And besides, Ilari said, “just copying packs might not suffice. There can also be loose objects.” Well, we can’t have that, can we?

Ilari suggested git-fetch, the documentation for which says that it will “download objects and refs from another repository.” Perfect! I wanted to copy the objects from my CVS migration to the Subversion migration.

My first attempt failed: some commits showed up, but not others. Ilari pointed out that it wouldn’t copy remote branches unless you asked it to do so, via “refspecs.” Since I’d cloned the repositories to get rid of the duplicate commits created by git filter-branch, all of my lovingly recreated local branches were now remote branches. Actually, this is what I want for the final repository, so I just had to figure out how to copy them. What I came up with was this:

chdir $cvs;
my @branches = map { s/^\s+//; s/\s+$//; $_ } `git branch -r`;

chdir $svn;
system qw(git fetch --tags), $cvs;

for my $branch (@branches) {
    next if $branch eq 'origin/HEAD';
    my $target = $branch =~ m{/master|rev_1_[68]$} ? "$branch-cvs" : $branch;
    system qw(git fetch --tags), $cvs,
        "refs/remotes/$branch:refs/remotes/$target";
}

It took me a while to figure out the proper incantation for referencing and creating remote branches. Once I got the refs/remotes part figured out, I found that the master, rev_1_6, and rev_1_8 branches from CVS were overwriting the Subversion branches with the same names. What I really needed was to have the CVS branches grafted as parents to the Subversion branches. The #git channel again came to my rescue, where Fissure suggested that I rename those branches when importing them, do the grafts, and then drop the renamed branches. Hence the line above that adds “-cvs” to the names of those branches.

Once the branches were imported, I simply looked for the earliest commits to those branches in Subversion and mapped it to the latest commits to the same branches in CVS, then wrote their SHA1 IDs to .git/info/grafts, like so:

open my $fh, '>', ".git/info/grafts" or die "Cannot open grafts: $!\n";
print $fh '77a35487f18d68b96d294facc1f1a41745ad914c '
        => "835ff47ee1e3d1bf228b8d0976fbebe3c7f02ae6\n", # rev_1_6
            '97ef646f5c2a7c6f47c2046c8d289c1dfc30a73d '
        => "2b9f3c5979d062614ef54afd0a01631f746fa3cb\n", # rev_1_8
            'b3b2e7f53d789bea962fe8047e119148e28865c0 '
        => "8414b64a6a434b2117294c0568c1012a17bc863b\n", # master
    ;
close $fh;

With the branches all imported and the grafts created, I simply had to run git filter-branch to make them permanent and drop the temporary CVS branches:

system qw(git filter-branch --tag-name-filter cat -- --all);
unlink '.git/info/grafts';
system qw(git branch -r -D), "origin/$_-cvs" for qw(rev_1_6 rev_1_8 master);

Now I had a complete repository, but with duplicate commits left over by git-filter-branch. To get rid of those, I need to clone the repository. But before I clone, I need the remote branches to be local branches, so that the clone will see them as remotes. For this, I wrote the following function:

sub fix_branches {
    for my $remote (map { s/^\s+//; s/\s+$//; $_ } `git branch -r`) {
        (my $local = $remote) =~ s{origin/}{};
        next if $local eq 'master';
        next if $local eq 'HEAD';
        system qw(git checkout), $remote;
        system qw(git checkout -b), $local;
    }
    system qw(git checkout master);
}

It’s important to skip the master and HEAD branches, as they’ll automatically be created by git clone. So then I call the function and and run git gc to take out trash, and then clone:

fix_branches();

run qw(git gc);
chdir '..';
run qw(git clone), "file://$svn", 'git_final';

It’s important to use the file:/// URL to clone so as to get a real clone; just pointing to the directory instead makes hard links.

Now I that I had the final repository with all history intact, I was ready to push it to GitHub! Well, almost ready. First I needed to make the branches local again, and then see if I could get the repository size down a bit:

chdir 'git_final';
fix_branches();
system qw(git remote rm origin);
system qw(git remote add origin git@github.com:bricoleurs/bricolage.git);
system qw(git gc);
system qw(git repack -a -d -f --depth 50 --window 50);

And that’s it! My new Bricolage Git repository is complete, and I’ve now pushed it up to its new home on GitHub. I pushed it like this:

git push origin --all
git push origin --tags

Damn I’m glad that’s done! I’ll be getting the Subversion repository set to read-only next, and then writing some documentation for my fellow Bricoleurs on how to work with Git. For those of you who already know, fork and enjoy!

Looking for the comments? Try the old layout.

The Future of SVN::Notify

This week, I imported pgTAP into GitHub. It took me a day or so to wrap my brain around how it’s all supposed to work, with generous help from Tekkub. But I’m starting to get the hang of it, and I like it. By the end of the day, I had sent push requests to Test::More and Blosxom Plugins. I’m well on my way to being hooked.

One of the things I want, however, is SVN::Notify-type commit emails. I know that there are feeds, but they don’t have diffs, and for however much I like using NetNewsWire to feed by political news addiction, it never worked for me for commit activity. And besides, why download the whole damn thing again, diffs and all (assuming that ever happens), for every refresh. Seems like a hell of a lot unnecessary network activity—not to mention actual CPU cycles.

So I would need a decent notification application. I happen to have one. I originally wrote SVN::Notify after I had already written activitymail, which sends noticies for CVS commits. SVN::Notify has changed a lot over the years, and now it’s looking a bit daunting to consider porting it to Git.

However, just to start thinking about it, SVN::Notify really does several different things:

  • Fetches relevant information about a Subversion event.
  • Parses that information for a number of different outputs.
  • Writes the event information into one or more outputs (currently plain text or XHTML).
  • Constructs an email message from the outputs
  • Sends the email message via a specified method (sendmail or SMTP).

For the initial implementation of SVN::Notify, this made a lot of sense, because it was doing something fairly simple. It was designed to be extensible by subclassing (successfully done by SVN::Notify::Config and SVN::Notify::Mirror), and, later, by output filters, and that was about it.

But as I think about moving stuff to Git, and consider the weaknesses of extensibility by subclassing (it’s just not pretty), I’m naturally rethinking this architecture. I wouldn’t want to have to do it all over again should some future SCM system come along in the future. So, following from a private exchange with Martijn Van Beers, I have some preliminary thoughts on how a hypothetical SCM::Notify (VCS::Notify?) module might be constructed:

  • A single interface for fetching SCM activity information. There could be any number of implementations, just as long as they all provided the same interface. There would be a class for fetching information from Subversion, one for Git, one for CVS, etc.
  • A single interface for writing a report for a given transaction. Again, there could be any number of implementations, but all would have the same interface: taking an SCM module and writing output to a file handle.
  • A single interface for doing something with one or more outputs. Again, they can do things as varied as simply writing files to disk, appending to a feed, inserting into a database, or, of course, sending an email.
  • The core module would process command-line arguments to determine what SCM is being used any necessary contextual information and just pass it on to the appropriate classes.

In psedudo-code, what I’m thinking is something like this:

package SCM::Notify;

sub run {
    my $args = shift->getopt;
    my $scm  = SCM::Interface->new(
        scm      => $args->{scm} # e.g., "SVN" or "Git", etc.
        revision => $args->{revision},
        context  => $args->{context} # Might include repository path for SVN.
    );

    my $report = SCM::Report->new(
        method => $opts->{method}, # e.g., SMTP, sendmail, Atom, etc.
        scm    => $scm,
        format => $args->{output}, # text, html, both, etc.
        params => $args->{params}, # to, from, subject, etc.
    );

    $report->send;
}

Then a report class just has to create report in the specified format or formats and do something with them. For example, a Sendmail report would put together a report as a multipart message with each format in a single part, and then deliver it via /sbin/sendmail, something like this:

package SCM::Report::Sendmail;

sub send {
    my $self = shift;
    my $fh = $self->fh;
    for my $format ( $self->formats ) {
        print $fh SCM::Format->new(
            format => $format,
            scm    => $self->scm,
        );
    }

    $self->deliver;
}

So those are my rather preliminary thoughts. I think it’d actually be pretty easy to port the logic of this stuff over from SVN::Notify; what needs some more thought is what the command-line interface might look like and how options are passed to the various classes, since the Sendmail report class will require different parameters than the SMTP report class or the Atom report class. But once that’s worked out in a way that can be handled neutrally, we’ll have a much more extensible implementation that will be easy to add on to going forward.

Any suggestions for passing different parameters to different classes in a single interface? Everything needs to be able to be handled via command-line options and not be ugly or difficult to use.

So, you wanna work on this? :-)

Looking for the comments? Try the old layout.