Just a Theory

Trans rights are human rights

Talk: Building Trunk

A line drawing of an old-style trunk with three diamond shaped stars above its closed lid.

This week, my fellow Tembonaut Ian Stanton will present the extension ecosystem mini-summit talk, “Building Trunk: A Postgres Extension Registry and CLI”. We felt it important to get some insight from a couple of the recently-developed Postgres extension registries: what problems they set out to solve, how the were built and operate, successes addressing their issues, and what issues remain, both for the projects and the ecosystem overall. Ian plans to give us the low-down on trunk.

Join us! Need other information or just want an invitation without using Eventbrite, hit me up at david@ this domain, on Mastodon, or via the #extensions channel on the Postgres Slack.

Mini Summit One

Great turnout and discussion for the first in a series of community talks and discussions on the postgres extension ecosystem leading up to the Extension Ecosystem Summit at pgconf.dev on May 28. Thank you!

The talk, “State of the Extension Ecosystem”, was followed by 15 minutes or so of super interesting discussion. Here are the relevant links:

For posterity, I listened through my droning and tried to capture the general outline, posted here along with interspersed chat history and some relevant links. Apologies in advance for any inaccuracies or missed nuance; i’m happy to update these notes with your corrections.

And now, to the notes!

Introduction

  • Introduced myself, first Mini Summit, six leading up to the in-person summit on May 28 at PGConf.dev in Vancouver, Canada.

  • Thought I would get it things started, provide a bit of history of extensions and context for what’s next.

Presentation

  • Postgres has a long history of extensibility, originally using pure SQL or shared preload libraries. Used by a few early adopters, perhaps a couple dozen, including …

  • Explicit extension support added in Postgres 9.1 by Dimitri Fontaine, with PGXS, CREATE EXTENSION, and pg_dump & pg_restore support.

  • Example pair--1.0.0.sql:

    -- complain if script is sourced in psql and not CREATE EXTENSION
    \echo Use "CREATE EXTENSION pair" to load this file. \quit
    
    CREATE TYPE pair AS ( k text, v text );
    
    CREATE FUNCTION pair(text, text)
    RETURNS pair LANGUAGE SQL AS 'SELECT ROW($1, $2)::pair;';
    
    CREATE OPERATOR ~> (LEFTARG = text, RIGHTARG = text, FUNCTION = pair);
    
  • Bagel makes an appearance.

  • Example pair.control:

    # pair extension
    comment = 'A key/value pair data type'
    default_version = '1.0'
    module_pathname = '$libdir/pair'
    relocatable = true
    
  • Example Makefile:

    EXTENSION    = pair
    MODULEDIR    = $(EXTENSION)
    DOCS         = README.md
    DATA         = sql/pair--1.0.sql
    TESTS        = test/sql/base.sql
    REGRESS      = base
    REGRESS_OPTS = --inputdir=test
    MODULES      = src/pair
    PG_CONFIG   ?= pg_config
    
    PGXS := $(shell $(PG_CONFIG) --pgxs)
    include $(PGXS)
    
  • Build and Install:

    $ make
    make: Nothing to be done for `all'.
    
    $ make install
    mkdir -p '/pgsql/share/extension'
    mkdir -p '/pgsql/share/pair'
    mkdir -p '/pgsql/share/doc/pair'
    install -c -m 644 pair.control '/pgsql/share/extension/'
    install -c -m 644 sql/pair--1.0.sql  '/pgsql/share/pair/'
    install -c -m 644 README.md '/pgsql/share/doc/pair/'
    
    $ make installcheck
    # +++ regress install-check in  +++
    # using postmaster on Unix socket, default port
    ok 1         - base                                       15 ms
    1..1
    # All 1 tests passed.
    
  • CREATE EXTENSION:

    $ psql -d try -c 'CREATE EXTENSION pair'
    CREATE EXTENSION
    
    $ pg_dump -d try
    --
    -- Name: pair; Type: EXTENSION; Schema: -; Owner: -
    --
    CREATE EXTENSION IF NOT EXISTS pair WITH SCHEMA public;
    
    --
    -- Name: EXTENSION pair; Type: COMMENT; Schema: -; Owner:
    --
    COMMENT ON EXTENSION pair IS 'A key/value pair data type';
    
  • Many of us saw opportunity in this new feature.

    PostgreSQL today is not merely a database, it’s an application development platform.

    — Me, 2010

  • Proposed to build PGXN. Raised funds to build it in late 2010. Launched site April 2011; Daniele Varrazzo released CLI, and Dickson Guedes released the dev CLI.

  • Problems PGXN set out to solve:

    • Source code distribution with user registration and namespacing
    • Discovery: Search, read docs, brows tags
    • Installation: CLI to compile and install using PGXS or Configure
  • PGXN Components:

  • Problems out of scope for PGXN:

    • Binary packaging and distribution
      • Defer to apt/yum
    • Developer tooling (though dev utils helped)
    • Build tooling
      • Defer to core (PGXS)
  • PGXN Shortcomings:

    • Little development since 2012
    • Search limitations
    • Source of Record
      • Minority of available extensions on PGXN
      • Releases uneven or neglected

    In classic SDLC fashion, PGXN POC shipped as an MVP and was neglected.

    — Me, Just Now

  • Been peripheral to Postgres extensions for the last 10-12 years, but some things have happened.

  • Non-Core extension counts:

  • Daniele asks about that last source, which is just a list in a gist.

  • Joe Nelson links to the gist in Zoom chat. It is not his list, contrary to my off-the-cuff guess

  • Why haven’t extensions taken off?

  • Lost Opportunities

    • No one canonical source to discover and install extensions
    • Difficult to find and discover extensions without canonical registry
    • Most extensions are under-documented and difficult to understand
    • They are also hard to configure and install; most people don’t want or need a compiler
    • The maturity of extensions can be difficult to gauge, not systematized, must each be independently researched
      • David Christensen in Chat “attention economy/awareness, NIH, etc”
      • Jeremy S in chat: “Maybe some people don’t know they are using extensions (I think that’s possible to some degree)”
    • There is no comprehensive binary packaging
    • Centralized source distribution is insufficient (even if it were complete)
      • jubilee in chat: Trust aspect?
      • David Johnson in chat: To seem legit you need to touch the repo at least annually to ensure it works on the newest major release. Even if you just do your compile and update the readme.
      • I mention using pgxn-utils and GitHub workflows to ensure my extensions continue working
    • There is insufficient developer tooling; pgxn-utils not well-maintained, don’t build on recent Rubies, but pgrx has a lot of Rust-oriented tooling
      • Eric in chat: ❤️
      • jubilee in chat: 🦀 mentioned!
  • Filling the Gaps

  • dbdev: “The Database Package Manager for Trusted Language Extensions”: Includes only TLEs, no binary extensions

  • trunk: “A Postgres Extension Registry”: Binary distribution of curated extensions, desires to be comprehensive and cross-platform

  • pgxman: “npm for PostgreSQL”: Binary Apt package distribution of curated extensions packaged with, desires to be comprehensive and cross-platform

  • Emphases: Ease of Use. Screenshot from pgxman:

    $ curl -sfL https://install.pgx.sh | sh -
    👏🎉 pgxman successfully installed
    $ pgxman install pgvector
    The following Debian packages will be installed:
    postgresql-14-pgxman-pgvector=0.5.1
    Do you want to continue? [Y/n] y
    pgvector has been successfully installed.
    
    • Daniele in chat: “Missing a “curl | sudo sh” there…. 👀”
    • Greg Mullane (CrunchyData) [he/him] in chat: “Really not a fan of that “pipe curl stuff from internet into sh” system.”
    • Jeremy S in chat: “Someone recently reprimanded me for putting curl | psql in an extension README. From a security perspective it probably sets a better example to do curl >file.sql … psql file.sql (encourage users not to run from Internet but read/review first)” * jubilee in chat: “apt/yum install is just a better UI over curl | sh :^)”
    • Jeremy S in chat: “Yes and once you’re to that point there’s already more supply chain verification happening”
    • Jeremy S in chat: “It’s usually just the initial bootstrap into any system, if the setup wasn’t already in your distro”
  • Emphases: Platform neutrality. Screenshot from trunk:

    Architecture x86-64
    Operating system Debian/Ubuntu
  • Emphases: Stats. Screenshot from dbdev:

    Downloads


    20 all time downloads 0 downloads in last 30 days 1 download in last 90 days 0 downloads in last 180 days

  • Emphases: Curation. Screenshot from trunk:

    Featured 7
    Analytics 13
    Auditing / Logging 7
    Data Change Capture 6
    Connectors 27
    Data / Transformations 49
    • Damien Clochard in chat: gtg, see you later guys !
  • MVPs

    • trunk: Manual integration, Currently Debian-only

    • pgxman: Form-based submission, Currently Apt-only

    • dbdev: TLEs only, CLI publishing

    • David Christensen in chat: “go has a pretty good extensions infra, imho, wrt discovery/docs, etc. also has the benefit of the package names being the URL to access it, which is a nice convention.”

  • New Opportunities Today

    What are the community opportunities for the extension ecosystem?

    Some ideas:

    • Improved dev tools: More than pgxn-utils and pgrx
    • Canonical registry: All publicly-available extensions in one pac3
    • Easy publishing: auto-discovery or CI/CD pipeline publishing
    • Continuous Delivery: CI/CD pipeline publishing
    • File-free installation: TLEs
    • Documentation: Something like Go docs or Rust docs
    • File management: Put all the files for an extension in one directory
    • Improved metadata
      • Library Dependencies: utilities used by extensions
      • Platform Dependencies: system packages
      • Build pipelines: PGXS, pgrx, make, cpan, pypi, etc.
      • Artifacts: Binaries build on release
      • Classification: Curated in addition to tags
      • Extension Types: Extensions, apps, background workers, loadable libraries
    • Derived Services
      • Binary Packaging: Distributed binaries for many platforms
      • Ratings & Reviews: Stars, thumbs, comments
      • Aggregated Stats: Repository stats, etc.
      • Smoke Testing: Matrix of multi-platform test results
      • Security Scanning: Reporting vulnerabilities
      • Badging & Curation: Third-party classification, badging various statuses
  • Extension Ecosystem Summit

    Collaborate to examine the ongoing work on PostgreSQL extension distribution, examine its challenges, identify questions, propose solutions, and agree on directions for execution.

  • 🏔️ Your Summit Organizers

  • Devrim Gunduz in chat: Thanks David!

  • Schedule:

    • March 6: David Wheeler, PGXN: “State of the Extension Ecosystem”
    • March 20: Ian Stanton, Tembo: “Building Trunk: A Postgres Extension Registry and CLI”
    • April 3: Devrim Gündüz: “yum.postgresql.org and the challenges RPMifying extensions”
    • April 17: Jonathan Katz: “TLE Vision and Specifics”
    • May 1: Yurii Rashkovskii, Omnigres: “Universally buildable extensions: dev to prod”
    • May 15: (Placeholder) David Wheeler, PGXN: “Metadata for All: Enabling discovery, packaging, and community”
  • Ultimately want to talk about what’s important to you, the members of the community to make extensions successful.

Discussion

  • Eric: I’m Eric Ridge, one of the developers behind pgrx, as you’re going through this process of building a modern extension ecosystem, let us know what we can do on the Rust side to help make your lives easier, we’re happy to help any way we can.

  • Steven Miller in chat:

    These are some areas of interest we noticed building Tembo

    Binary packaging / distribution:

    • Variable installation location
    • System dependencies / uncommon system dependencies or versions
    • Chip specific instructions (e.g. vector compiled with avx512)
    • Extension-specific file types / extra data files (e.g. anonymizer .csv data)

    Turning on extensions automatically

    • Different ways to enable extensions
    • does it need load (shared_preload_libraries, session_… etc)?
    • Does it use create extension framework?
    • Does it require a specific schema?
    • What about turning on in multiple DBs at the same time in the same cluster, with background worker?
    • Disabling, what data will be lost?
    • Validating safety / user feedback on upgrade?

    In cloud / SaaS:

    • Installing + enabling extensions quickly, without restart
    • Persisting extension files
    • Extension-specific files (e.g. libraries) versus postgres’ libraries
    • Updating
    • Troubleshooting crashes / core dumps

    Anyone else have similar problems / tips?

  • Steven Miller: These were just things I noted during the presentation. Curious if these are interesting to others on the call.

  • Daniele in chat: “Regards binary distributions, python wheels might be a useful reference.”

  • Steven Miller: That’s good point! What do people think of idea to just install extensions onto servers, not packages, persisted on the disk, next to PGDATA so they go into a single persistent volume, and the rest is managed by an immutable container.

  • Daniele: Had experience on Crunchy where we had to replace an image to get an extension. Looked for feature to have a sidecar or a volume with the extension.

  • Steven Miller: Didn’t have a separate directory just for extensions, it’s just pg_config --libdir fore everything. Had to persist entire directory, including those files form the base build, their internal files. Would have been nice to have a separate directory, extra-libdr or extra-sharedir, something like that.

  • Yurii Rashkovskii: I was working on a patch to do exactly that, but haven’t completed it. Was going to introduce additional directories to search for this stuff.

  • Steven Miller: That would be really awesome.

  • Jeremy S in chat: “Advantage of that is that a single image can be shared among systems with different needs”

  • Eric in chat: “Thoughts around “enterprise repositories” that could be self-hosted and disconnected from the internet?”

    • Ian Stanton in chat: “I’ll touch on this in the next talk, it’s crossed our minds when building the Trunk registry”
  • Steven Miller: I think that’s a great idea.

  • Bagel reappears.

  • David Wheeler: PGXN originally designed so anyone could run Manager and their own root mirror, and maybe rsync from the community one. Don’t know that anyone ever did, it’s a little complicated and most people don’t want to work with Perl. [Chuckles]. Definitely think there’s space for that. If you work with Java or Go or maybe Rust, lots of orgs like Artifactory that provide internal registries. Could be cool use case for Postgres extensions.

  • David Christensen in chat: “something that could support extension batches; like groups of related extensions that could be installed in bulk or loaded in bulk (so could accommodate the shared systems with different individual extension needs, but could be shared_preload_library configured)”

  • “Steven Miller” in chat: “Sounds familiar”

  • Greg Mullane (CrunchyData) [he/him] in chat: “All these items remind me of CPAN. We should see what things it (and other similar systems) get right and wrong. I’ve learned from CPAN that ratings, reviews, and badging are going to be very difficult.”

    • David Christensen in chat: “I’d assumed at the time that it was largely lifted (at least in philosophy/design) from CPAN. 🙂”
    • David Wheeler (he/him) in chat: “yes”
  • Jeremy S: I think this is mostly focused on developers, but I had recent experience where multiple people in the past few months, new to Postgres, are trying to understand extensions. They install a version and then see there are like 15 versions installed, so confused. Goes back to the install file. Bit of UX angle where there are sharp edges where people trying to make sense of extensions, the flexibility makes it hard to understand. Some might be some nice guides, some architectural things explaining PGXS, or improvements to make to the design. Related, not main topic, but good to keep end user UX and devs building on Postgres but not Postgres developers, who run their businesses.

  • David Wheeler: Yeah all the files can be confusing, which is why I think trunk and pgxman trying to simplify: Just run this command and then you have it.

  • Steven Miller in chat: “I really agree with what Jeremy is saying. Right now PGXN and Trunk are taking and approach like “whatever an extension could do / how it may work, it should work on this registry”. But I think more standards / “what is a normal extension” would make the UX much easier.”

  • Jeremy S: Even with that the available extensions view is still there. Some of that is just schema management, and that’s how core is doing schema management.

  • Steven Miller in chat: I exactly agree about the concern about multiple extensions within a package. Also version and name mismatches

  • David Wheeler: And not everything is an extension, you just want to work, or your extension is just utility like pg_top you just want to use. Extensions I think were a tremendous contribution to Postgres itself, but a lot of it was wrangling the existing system for building Postgres itself to make it work for that. Could be very interesting, though quite long term — and I know Dimitri has tried this multiple times — to build a proper package management system within Postgres itself, to eas a lot of that pain and burden.

  • Tobias Bussmann in chat: “Thank you for the great overview and for taking this topic further! Unfortunately, I’ll not be able to join at pgConf.dev but will follow whatever results this will lead to. As a package maintainer, I am constantly looking in a easy way to allow users to use extensions without having to package everything ;)”

  • Steven Miller in chat: “Like auto explain for example right. i.e. a LOAD only “extension””

  • *Yurii Rashkovskii: An interesting topic, what extensions are capable of doing and how they can be more self-contained. Like Steven was saying in chat: how easy is it to load and unload extensions. Example: want an extension to hook into a part of Postgres: executor, planner, etc. How do you go about enabling them? How you unload them, introspect, list of hooks.

    Omni extension provides a list of all hooks, and when you remove an extension it removes the hooks that provide the extension, but still not part of the core. Hooks one of the greatest ways to expand the functionality of Postgres, allows us to experiment with Postgres before committing to the full cycle of getting a patch into Postgres. Lets us get it to users today to try. if it makes a lot of sense and people want it, time to commit to the process of getting a patch committed. But if we don’t hve this venue, how to get extensions in, our ability to try things is limited.

  • jubilee in chat: Hmm. It seems my audio is not working.

  • David Wheeler: The next session is two weeks from today: Ian Stanton is going to talk about “Building Trunk: A Postgres Extension Registry and CLI”. Will be interesting because a number of people have decided to build a binary packaging system for extensions, just to air out what the challenges were, what problems they wanted to solve, what problems remain, and where they want to take it in the future.

  • Jeremy S* in chat: “Bagel clearly has something to say”

  • *David Wheeler: jubileee I see your audio issues, do you just want to type your question into chat? We can also discuss things in the #extensions channel on the Postgres Slack

  • David Wheeler: Thank you all for coming!

  • jubilee in chat: “I had a question which is about: Does Postgres actually support docs for extensions? Like, a lot of people don’t really WANT to read a README. Can you get docstrings for a function in psql?”

  • Ian Stanton in chat: “Thank you David!”

  • jubilee in chat: And if not, why not?

Post Presentation Discussion

From Slack:

  • David Wheeler: I see now that “jubilee” left their question in the Zoom chat.
  • David Wheeler: The closest attempt at this I’ve seen is pg_readme, which will generate Markdown for an extension from comments in the catalog and write it out to a file.
  • David G. Johnson: The comment on command adds in database comments that psql describe commands should display.

Also on Slack, Greg Sabino Mullane started a longish thread on the things we want to do and build.

Talk: State of the Extension Ecosystem

Photo of the summit of Mount Hood

Update: 2024-03-06: Slides and video linked below.

A quick reminder that I’ll be giving a brief talk on the “State of the Extension Ecosystem” on Wednesday at noon US Eastern / 17:00 UTC. This talk is the first in a series of community talks and discussions on the postgres extension ecosystem leading up to the Extension Ecosystem Summit at pgconf.dev on May 28.

I plan to give a brief history of Postgres extension tools and distribution, the challenges encountered, recent developments, and opportunities for the future. It should take about 30 minutes, followed by discussion. Following this pattern for all the talks in the series, I hope to set up some engaging discussions and to surface significant topics ahead of the summit.

Join us! Need other information or just want an invitation without using Eventbrite, hit me up at david@ this domain, on Mastodon, or via the #extensions channel on the Postgres Slack.

Update: 2024-03-06: Great turnout and discussion, thank you! Links:

Extension Ecosystem Summit 2024

Logo for PGConf.dev

I’m pleased to announce that some pals and I have organized and will host the (first annual?) Extension Ecosystem Summit at PGConf.dev in Vancouver (and more, see below) on May 28:

Enabling comprehensive indexing, discovery, and binary distribution.

Participants will collaborate to examine the ongoing work on PostgreSQL extension distribution, examine its challenges, identify questions, propose solutions, and agree on directions for execution.

Going to PGConf? Select it as an “Additional Option” when you register, or update your registration if you’ve already registered. Hope to see you there!


Photo of the summit of Mount Hood

Extension Ecosystem Mini-Summit

But if you can’t make it, that’s okay, because in the lead up to the Summit, to we’re hosting a series of six virtual gatherings, the Postgres Extension Ecosystem Mini-Summit.

Join us for an hour or so every other Wednesday starting March 6 to hear contributors to a variety of community and commercial extension initiatives outline the problems they want to solve, their attempts to so, challenges discovered along the way, and dreams for an ideal extension ecosystem in the future. Tentative speaker lineup:

  • March 6: David Wheeler, PGXN: “State of the Extension Ecosystem”
  • March 20: Ian Stanton, Tembo: “Building Trunk: A Postgres Extension Registry and CLI”
  • April 3: Devrim Gündüz: “yum.postgresql.org and the challenges RPMifying extensions”
  • April 17: Jonathan Katz: “TLE Vision and Specifics”
  • May 1: Yurii Rashkovskii, Omnigres: “Universally buildable extensions: dev to prod”
  • May 15: (Placeholder) David Wheeler, PGXN: “Metadata for All: Enabling discovery, packaging, and community”

Hit the event page for details. Many thanks to my co-organizers Jeremy Schneider, David Christensen, Keith Fiske, and Devrim Gündüz, as well as the PGConf.dev organizers for making this all happen!

Update: 2024-03-06: Updated the talk schedule.

RFC: Extension Metadata Typology

Lately I’ve been thinking a lot about metadata for Postgres extensions. Traditional use cases include control file metadata, which lives in .control files used by CREATE EXTENSION and friends, and PGXN metadata, which lives in META.json files used by PGXN to index and publish extensions. But these two narrow use cases for SQL behavior and source code distribution don’t provide the information necessary to enable other use cases, including building, installing, configuration, and more.

So I have also been exploring other metadata formats, including:

These standards from neighboring communities reveal a great deal of overlap, as one might expect (everything has a name, a version, an author, license, and so on), but also types of metadata that had not occurred to me. As I took notes and gathered suggestions from colleagues and coworkers, I began to recognize natural groupings of metadata. This lead to the realization that it might be easier — and more productive — to think about these groupings rather than individual fields.

I therefore propose a typology for Postgres extension metadata.

Extension Metadata Typology

Essentials

Essential information about the extension itself, including its name (or unique package name), version, list of authors, license, etc. Pretty much every metadata format encompasses this data. Ecosystem applications use it for indexing, installation locations, naming conventions, and display information.

Artifacts

A list of links and checksums for downloading the extension in one or more formats, including source code, binaries, system packages, and more. Apps use this information to determine the best option for installing an extension on a particular system.

Resources

External information about the extension, mostly links, including source code repository, bug reporting, documentation, badges, funding, etc. Apps use this data for links, of course, but also full text indexing, documentation rendering, and displaying useful information about the extension.

Contents

A description of what’s included in the extension package. Often an “extension” consists of multiple extensions, such as PostGIS, which includes postgis, postgis_tiger_geocoder, address_standardizer, and more. Furthermore, some extensions are not CREATE EXTENSION-type extension at all, such as background workers, command-line apps, libraries, etc. Each should be listed along with documentation links where they differ from the package overall (or are simply more specific).

Prerequisites

A list of external dependencies required to configure, build, test, install, and run the extension. These include not only other extensions, but also external libraries and OS-specific lists of binary package dependencies. And let’s not forget the versions of Postgres required, as well as any OS and version dependencies (e.g, does it work on Windows? FreeBSD? What versions?) and architectures (arm64, amd64, etc.)

How to Build It

Metadata that apps use to determine how to build the extension. Does it use the PostgreSQL PGXS build pipeline? Or perhaps it needs the cargo-based pgrx toolchain. Maybe a traditional ./configure && make pattern? Perl, Ruby, Python, Go, Rust, or NPM tooling? Whatever the pattern, this metadata needs to be sufficient for an ecosystem app to programmatically determine how to build an extension.

How to Install It

Usually an extension of the build metadata, the install metadata describes how to install the extension. That could be PGXS or pgrx again, but could also use other patterns — or multiple patterns! For example, perhaps an extension can be built and installed with PGXS, but it might also be TLE-safe, and therefore provide details for handing the SQL files off to a TLE installer.

This typology might include additional data, such as documentation files to install (man pages anyone?), or directories of dependent files or libraries, and the like — whatever needs to be installed for the extension.

How to Run It

Not all Postgres extensions are CREATE EXTENSION extensions. Some provide background workers to perform various tasks; others simply provide Utility applications like pg_top and pg_repack. In fact pg_repack provides both a command-line application and a CREATE EXTENSION extension in one package!

This metadata also provides configuration information, both control file parameters like trusted, superuser, and schema, but also load configuration information, like whether an extension needs its libraries included in shared_preload_libraries to enable LOAD or requires a cluster restart. (Arguably this information should be in the “install” typology rather than “run”.)

Classification

Classification metadata lets the extension developer associate additional information to improve discovery, such as key words. It might also allow selections from a curated list of extension classifications, such as the category slugs supported for the cargo categories field. Ecosystem apps use this data to organize extensions under key words or categories, making it easier for users to find extensions often used together or for various workloads or tasks.

Metrics and Reports

This final typology differs from the others in that its metadata derives from third party sources rather than the extension developer. It includes data such as number of downloads, build and test status on various Postgres/OS/version combinations, binary packaging distributions, test coverage, security scan results, vulnerability detection, quality metrics and user ratings, and more.

In the broader ecosystem, it would be the responsibility of the root registry to ensure such data in the canonical data for each extension comes only from trusted sources, although applications downstream of the root registry might extend metrics and reports metadata with their own information.

What More?

Reading through various metadata standards, I suspect this typology is fairly comprehensive, but I’m usually mistaken about such things. What other types of metadata do you find essential for the use cases you’re familiar with? Do they fit one of the types here, or do they require some other typology I’ve failed to imagine? Hit the #extensions channel on the Postgres Slack to contribute to the discussion, or give me a holler on Mastodon.

Meanwhile, I’ll be refining this typology and assigning all the metadata fields to them in the coming weeks, with an eye to proposing a community-wide metadata standard. I hope it will benefit us all; your input will ensure it does.

The History and Future of Extension Versioning

Every software distribution system deals with versioning. Early in the design of PGXN, I decided to require semantic versions (SemVer), a clearly-defined and widely-adopted version standard, even in its pre-1.0 specification. I implemented the semver data type that would properly sort semantic versions, later ported to C by Sam Vilain and eventually updated to semver 2.0.0.

As I’ve been thinking through the jobs and tools for the Postgres extension ecosystem, I wanted to revisit this decision, the context in which it was made, and survey the field for other options. Maybe a “PGXN v2” should do something different?

But first that context, starting with Postgres itself.

PostgreSQL Extension Version Standard

From the introduction extensions in PostgreSQL 9.1, the project side-stepped the need for version standardization and enforcement by requiring extension authors to adopt a file naming convention, instead. For example, an extension named “pair” must have a file with its name, two dashes, then the version as listed in its control file, like so:

pair--1.1.sql

As long as the file name is correct and the version part byte-compatible with the control file entry, CREATE EXTENSION will find it. To upgrade an extension the author must provide a second file with the extension name, the old version, and the new version, all delimited by double dashes. For example, to upgrade our “pair” extension to version 1.2, the author supply all the SQL commands necessary to upgrade it in this file:

pair--1.1--1.2.sql

This pattern avoids the whole question of version standards, ordering for upgrades or downgrades, and all the rest: extension authors have full responsibility to name their files correctly.

PGXN Versions

SemVer simplified a number of issues for PGXN in ways that the PostgreSQL extension versioning did not (without having to re-implement the core’s file naming code). PGXN wants all metadata for an extension in its META.json file, and not to derive it from other sources that could change over time.

Following the CPAN model, PGXN also required that extension releases never decrease the version.1 The well-defined sortability of semantic versions made this validation trivial. PGXN later relaxed enforcement to allow updates to previously-released versions. SemVer’s clearly specified sorting made this change possible, as the major.minor.patch precedence intuitively compare from left to right.

In other words, if one had previously released version 1.2.2, then released 1.3.0, a follow-up 1.2.3 is allowed, increasing the 1.2.x branch version, but not, say, 1.2.1, which decreases the 1.2.x branch version.

Overall, semantic versions have been great for clarity of versioning of PGXN extensions. The one bit of conflict comes from extensions that use some other other version standard in the control file, usually a two-part x.y version not allowed by SemVer, which requires x.y.z (or, more specifically, major.minor.patch).

But such versions are usually compatible with SemVer, and because PGXN cares only about the contents of the META.json, they’re free to use their own versions in the control file, just as long as the META.json file uses SemVers.

For example, the recent nominatim_fdw v1.0.0 release, which of course lists "version": "1.0.0" in its META.json file, sticks to its preferred default_version = '1.0' in its control file. The extension author simply appends .0 to create a valid SemVer from their preferred version, and as long as they never use any other patch number, it remains compatible.

Versioning Alternatives

Surveying the versioning landscape in 2024 yields a number of approaches. Might we prefer an alternative for future extensions distribution? Let’s look at the possibilities.

Ad Hoc Versions

As described above, the Postgres file naming convention allows ad hoc versions. As far as I can tell, so does the R Project’s CRAN. This approach seems fine for systems that don’t need to follow version changes themselves, but much trickier for systems that do. If I want to install the latest version of an extension, how does the installer know what that latest version is?

The answer is that the extension author must always release them in the proper order. But if someone releases 1.3.1 of an extension, and then 1.2.1, well then 1.2.1 is the latest, isn’t it? It could get confusing pretty quickly.

Seems better to require some system, so that download and install clients can get the latest version — or the latest maintenance version of an earlier release if they need it.

User Choice

Quite a few registries allow users to choose their own versioning standards, but generally with some very specific recommendations to prevent confusion for users.

  • Python Packaging is fairly liberal in the versions it allows, but strongly recommends semantic versioning or calendar versioning (more on that below).
  • CPAN (Perl) is also fairly liberal, due to its long history of module distribution, but currently requires “Decimal versions”, which are evaluated as floating-point numbers, or dotted integer versions, which require three dot-separated positive integers and must begin with the letter v.
  • RubyGems does not enforce a versioning policy, but warns that “using an ‘irrational’ policy will only be a disservice to those in the community who use your gems.” The project therefore urges developers to follow SemVer.

These three venerable projects date from an earlier period of registration and distribution, and have made concessions to times when no policies existed. Their solutions either try to cover as many legacy examples as possible while recommending better patterns going forward (Python, Perl), or simply make recommendations and punt responsibility to developers.

SemVer

More recently-designed registries avoid this problem by requiring some level of versioning standard from their inception. Nearly all use SemVer, including:

  • Go Modules, where “Each version starts with the letter v, followed by a semantic version.”
  • Cargo (Rust), which “uses SemVer for specifying version numbers. This establishes a common convention for what is compatible between different versions of a package.”
  • npm, where the “version must be parseable by node-semver, which is bundled with npm as a dependency.”

CalVer

CalVer eschews context-free incrementing integers in favor of semantically-meaningful versions, at least for some subset of a version string. In other words: make the version date-based. CalVer-versioned projects usually include the year and sometimes the month. Some examples:

  • Ubuntu uses YY.0M.MICRO, e.g., 23.04, released in April 2023, and 23.10.1, released in October 2023
  • Twisted uses YY.MM.MICRO, e.g., 22.4.0, released in April 2022

Ultimately, adoption of a CalVer format is a more choice about embedding calendar-based meaning into a version more than standardizing a specific format. One can of course use CalVer semantics in a semantic version, as in the Twisted example, which is fully-SemVer compliant.

In other words, adoption of CalVer need not necessitate rejection of SemVer.

Package Managers

What about package managers, like RPM and Apt? Some canonical examples:

  • RPM packages use the format:

    <name>-<version>-<release>.<architecture>
    

    Here <version> is the upstream version, but RPM practices a reasonable (if baroque) version comparison of all its parts. But it does not impose a standard on upstream packages, since they of course vary tremendously between communities and projects.

  • Apt packages use a similar format:

    [epoch:]upstream_version[-debian_revision]
    

    Again, upstream_version is the version of the upstream package, and not enforced by Apt.

  • APK (Alpine Linux) packages use the format

    {digit}{.digit}...{letter}{_suf{#}}...{-r#}
    

    I believe that {digit}{.digit}...{letter} is the upstream package version.

This pattern makes perfect sense for registries that repackage software from dozens of upstream sources that may or may not have their own policies. But a system that defines the standard for a specific ecosystem, like Rust or PostgreSQL, need not maintain that flexibility.

Recommendation

Given this survey, I’m inclined to recommend that the PostgreSQL community follow the PGXN (and Go, and Rust, and npm) precedent and continue to rely on and require semantic versions for extension distribution. It’s not perfect, given the contrast with the core’s lax version requirements. CalVer partisans can still use it, though with fewer formatting options (SemVer forbids leading zeros, as in the Ubuntu 23.04 example).

But with its continuing adoption, and especially its requirement by more recent, widely-used registries, and capacity to support date-based semantics for those who desire it, I think it continues to make the most sense.

Wrong!

I’m probably wrong. I’m often mistaken in one way or another, on the details or the conclusion. Please tell me how I’ve messed up! Find me on the #extensions channel on the Postgres Slack or ping me on Mastodon.


  1. Why? Because every module on CPAN has one and only one entry in the index file. Ricardo Signes explains↩︎

Extension Ecosystem Jobs to be Done

Over on the Tembo blog I’ve published a thinking-through of what all the jobs to be done of the ideal Postgres extension ecosystem might be:

These challenges and the interest and energy put into exploring new solutions make clear that the time has come to revisit the whole idea of the PostgreSQL extension ecosystem: to work though the jobs to be done, specify the tools to do those jobs, and outline a plan for the broader Postgres community to design and build them.

Future posts will dream up the tools and make the plan; today we begin with the jobs.

🎬 Let’s get started.

Presentation: Introduction to the PGXN Architecture

As I started digging into the jobs and tools for the Postgres extension ecosystem as part of my new gig, I realized that most people have little knowledge of the PGXN architecture. I learned a lot designing PGXN and its services, and am quite pleased with where it ended up, warts and all. So I thought it worthwhile to put together a brief presentation on the fundamental design principals (static REST file API), inter-related services (root mirror, manager, API, site) and tools (CLI, CI/CD).

Yesterday, the Tembo blog published the presentation, including the video and slides, along with a high-level architecture diagram. I hope it’s a useful point of reference for the Postgres community as we look to better distribute extensions in the future.

Contemplating Decentralized Extension Publishing

TL;DR

As I think through the future of the Postgres extension ecosystem as a key part of the new job, I wanted to understand how Go decentralized publishing works. In this post I work it out, and think through how we might do something similar for Postgres extension publishing. It covers the Go architecture, namespacing challenges, and PGXS abuse; then experiments with URL-based namespacing and ponders reorganizing installed extension files; and closes with a high-level design for making it work now and in the future.

It is, admittedly, a lot, mainly written for my own edification and for the information of my fellow extension-releasing travelers.

I find it fascinating and learned a ton. Maybe you will too! But feel free to skip this post if you’re less interested in the details of the journey and want to wait for more decisive posts once I’ve reached the destination.

Introduction

Most language registries require developers to take some step to make releases. Many automate the process in CI/CD pipelines, but it requires some amount of effort on the developer’s part:

  • Register for an account
  • Learn how to format things to publish a release
  • Remember to publish again for every new version
  • Create a pipeline to automate publishing (e.g., a GitHub workflow)

Decentralized Publishing

Go decentralized publishing has revised this pattern: it does not require user registration or authentication to to publish a module to pkg.go.dev. Rather, Go developers simply tag the source repository, and the first time someone refers to the tag in Go tools, the Go module index will include it.

For example, publishing v1.2.1 of a module in the github.com/golang/example repository takes just three commands:

git tag v1.2.1 -sm 'Tag v1.2.1'
git push --tags
go list -m github.com/golang/example@v1.2.1

After a few minutes, the module will show up in the index and then on pkg.go.dev. Anyone can run go get -u github.com/golang/example to get the latest version. Go developers rest easy in the knowledge that they’re getting the exact module they need thanks to the global checksum database, which Go uses “in many situations to detect misbehavior by proxies or origin servers”.

This design requires go get to understand multiple source code management systems: it supports Git, Subversion, Mercurial, Bazaar, and Fossil.1 It also needs the go.mod metadata file to live in the project defining the package.

But that’s really it. From the developer’s perspective it could not be easier to publish a module, because it’s a natural extension of the module development tooling and workflow of committing, tagging, and fetching code.

Decentralized Extension Publishing

Could we publish Postgres extensions in such a decentralized pattern? It might look something like this:

  • The developer places a metadata file in the proper location (control file, META.json, Cargo.toml, whatever — standard TBD)
  • To publish a release, the developer tags the repository and calls some sort of indexing service hook (perhaps from a tag-triggered release workflow)
  • The indexing service validates the extension and adds it to the index

Note that there is no registration required. It simply trusts the source code repository. It also avoids name collision: github.com/bob/hash is distinct from github.com/carol/hash.

This design does raise challenges for clients, whether they’re compiling extensions on a production system or building binary packages for distribution: they have to support various version control systems to pull the code (though starting with Git is a decent 90% solution).

Namespacing

Then there’s name conflicts. Perhaps github.com/bob/hash and github.com/carol/hash both create an extension named hash. By the current control file format, the script directory and module path can use any name, but in all likelihood the use these defaults:

directory = 'extension'
module_pathname = '$libdir/hash'

Meaning .sql files will be installed in the Postgres share/extension subdirectory — along with all the other installed extensions — and library files will be installed in the library directory along with all other libraries. Something like this:

pgsql
├── lib
│   └── hash.so
└── share
    └── extension
    │   └── hash.control
    │   ├── hash--1.0.0.sql
    └── doc
        └── hash.md

If both projects include, say, hash.control, hash--1.0.0.sql, and hash.so, the files from one will stomp all over the files of the other.

Installer Abuse

Go avoids this issue by using the domain and path from each package’s repository in its directory structure. For example, here’s a list of modules from google.golang.org repositories:

$ ls -1 ~/go/pkg/mod/google.golang.org
api@v0.134.0
api@v0.152.0
appengine@v1.6.7
genproto
genproto@v0.0.0-20230731193218-e0aa005b6bdf
grpc@v1.57.0
grpc@v1.59.0
protobuf@v1.30.0
protobuf@v1.31.0
protobuf@v1.32.0

The ~/go/pkg/mod directory has subdirectories for each VCS host name, and each then subdirectories for package paths. For the github.com/bob/hash example, the files would all live in ~/go/pkg/mod/github.com/bob/hash.

Could a Postgres extension build tool follow a similar distributed pattern by renaming the control file and installation files and directories to something specific for each, say github.com+bob+hash and github.com+carol+hash? That is, using the repository host name and path, but replacing the slashes in the path with some other character that wouldn’t create subdirectories — because PostgreSQL won’t find control files in subdirectories. The control file entries for github.com/carol/hash would look like this:

directory = 'github.com+carol+hash'
module_pathname = '$libdir/github.com+carol+hash'

Since PostgreSQL expects the control file to have the same name as the extension, and for SQL scripts to start with that name, the files would have to be named like so:

hash
├── Makefile
├── github.com+carol+hash.control
└── sql
    └── github.com+carol+hash--1.0.0.sql

And the Makefile contents:

EXTENSION  = github.com+carol+hash
MODULEDIR  = $(EXTENSION)
DATA       = sql/$(EXTENSION)--1.0.0.sql
PG_CONFIG ?= pg_config

PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)

In other words, the extension name is the full repository host name and path and the Makefile MODULEDIR variable tells pg_config to put all the SQL and documentation files into a directories named github.com+carol+hash — preventing them from conflicting with any other extension.

Finally, the github.com+carol+hash.control file — so named becaus it must have the same name as the extension — contains:

default_version = '1.0.0'
relocatable = true
directory = 'github.com+carol+hash'
module_pathname = '$libdir/github.com+carol+hash'

Note the directory parameter, which must match MODULEDIR from the Makefile, so that CREATE EXTENSION can find the SQL files. Meanwhile, module_pathname ensures that the library file has a unique name — the same as the long extension name — again to avoid conflicts with other projects.

That unsightly naming extends to SQL: using the URL format could get to be a mouthful:

CREATE EXTENSION "github.com+carol+hash";

Which is do-able, but some new SQL syntax might be useful, perhaps something like:

CREATE EXTENSION hash FROM "github.com+carol+hash";

Or, if we’re gonna really go for it, use slashes after all!

CREATE EXTENSION hash FROM "github.com/carol/hash";

Want to use both extensions but they have conflicting objects (e.g., both create a “hash” data type)? Put them into separatre schemas (assuming relocatable = true in the control file):

CREATE EXTENSION hash FROM "github.com/carol/hash" WITH SCHEMA carol;
CREATE EXTENSION hash FROM "github.com/bob/hash" WITH SCHEMA bob;
CREATE TABLE try (
    h1 carol.hash,
    h2 bob.hash
);

Of course it would be nice if PostgreSQL added support for something like Oracle packages, but using schemas in the meantime may be sufficient.

Clearly we’re getting into changes to the PostgreSQL core, so put that aside and we can just use long names for creating, modifying, and dropping extensions, but not necessarily otherwise:

CREATE EXTENSION "github.com+carol+hash" WITH SCHEMA carol;
CREATE EXTENSION "github.com+bob+hash" WITH SCHEMA bob;
CREATE EXTENSION "gitlab.com+barack+kicker_type";
CREATE TABLE try (
    h1 carol.hash,
    h2 bob.hash
    kt kicker
);

Namespacing Experiment

To confirm that this approach might work, I committed 24134fd and pushed it in the namespace-experiment branch of the semver extension. This commit changes the extension name from semver to github.com+theory+pg-semver, and follows the above steps to ensure that its files are installed with that name.

Abusing the Postgres extension installation infrastructure like this does work, but suffers from a number of drawbacks, including:

  • The extension name is super long, as before, but now so too are the files in the repository (as opposed to the installer renaming them on install). The shared library file has to have the long name, so therefore does the .c source file. The SQL files must all start with github.com+theory+pg-semver, although I skipped that bit in this commit; instead the Makefile generates just one from sql/semver.sql.
  • Any previous installation of the semver type would remain unchanged, with no upgrade path. Changing an extension’s name isn’t a great idea.

I could probably script renaming and modifying file contents like this and make it part of the build process, but it starts to get complicated. We could also modify installers to make the changes, but there are a bunch of moving parts they would have to compensate for, and given how dynamic this can be (e.g., the semver Makefile reads the extension name from META.json), we would rapidly enter the territory of edge case whac-a-mole. I suspect it’s simply too error-prone.

Proposal: Update Postgres Extension Packaging

Perhaps the Go directory pattern could inspire a similar model in Postgres, eliminating the namespace issue by teaching the Postgres extension infrastructure to include all but one of the files for an extension in a single directory. In other words, rather than files distributed like so for semver:

pgsql
├── lib
│   └── semver.so
└── share
    └── extension
    │   └── semver.control
    │   ├── semver--0.32.1.sql
    │   ├── semver--0.32.0--0.32.1.sql
    └── doc
        └── semver.md

Make it more like this:

pgsql
└── share
    └── extension
        └── github.com
            └── theory
                └── pg-semver
                    └── extension.control
                    └── lib
                    │   └── semver.so
                    └── sql
                    │   └── semver--0.32.1.sql
                    │   └── semver--0.32.0--0.32.1.sql
                    └── doc
                        └── semver.md

Or perhaps:

pgsql
└── share
    └── extension
        └── github.com
            └── theory
                └── pg-semver
                    └── extension.control
                    └── semver.so
                    └── semver--0.32.1.sql
                    └── semver--0.32.0--0.32.1.sql
                    └── semver.md

The idea is to copy the files exactly as they’re stored in or compiled in the repository. Meanwhile, the new semver.name file — the only relevant file stored outside the extension module directory — simply points to that path:

github.com/theory/pg-semver

Then for CREATE EXTENSION semver, Postgres reads semver.name and knows where to find all the files to load the extension.

This configuration would require updates to the control file, now named extension.control, to record the full package name and appropriate locations. Add:

name = 'semver'
package = 'github.com/theory/pg-semver'

This pattern could also allow aliasing. Say we try to install a different semver extension from github.com/example/semver. This is in its extension.control file:

name = 'semver'
package = 'github.com/example/pg-semver'

The installer detects that semver.name already exists for a different package and raises an error. The user could then give it a different name by running something like:

make install ALIAS_EXTENSION_NAME=semver2

This would add semver2.name right next to semver.name, and its contents would contain github.com/example/semver, where all of its files are installed. This would allow CREATE EXTENSION semver2 to load the it without issue (assuming no object conflicts, hopefully resolved by relocate-ability).

I realize a lot of extensions with libraries could wreak some havoc on the library resolver having to search so many library directories, but perhaps there’s some way around that as well? Curious what techniques experienced C developers might have adopted.

Back to Decentralized Publishing

An updated installed extension file structure would be nice, and is surely worth a discussion, but even if it shipped in Postgres 20, we need an updated extension ecosystem today, to work well with all supported versions of Postgres. So let’s return to the idea of decentralized publishing without such changes.

I can think of two pieces that’d be required to get Go-style decentralized extension publishing to work with the current infrastructure.

Module Uniqueness

The first is to specify a new metadata field to be unique for the entire index, and which would contain the repository path. Call it module, after Go (a single Git repository can have multiple modules). In PGXN Meta Spec-style JSON it’d look something like this:

{
    "module": "github.com/theory/pg-semver",
    "version": "0.32.1",
    "provides": {
      "semver": {
         "abstract": "A semantic version data type",
      }
    }
}

Switch from the PGXN-style uniqueness on the distribution name (usually the name of the extension) and let the module be globally unique. This would allow another party to release an extension with the same name. Even a fork where only the module is changed:

{
    "module": "github.com/example/pg-semver",
    "version": "0.32.1",
    "provides": {
      "semver": {
         "abstract": "A semantic version data type",
      }
    }
}

Both would be indexed and appear under the module name, and both would be find-able by the provided extension name, semver.

Where that name must still be unique is in a given install. In other words, while github.com/theory/pg-semver and github.com/example/pg-semver both exist in the index, the semver extension can be installed from only one of them in a given Postgres system, where the extension name semver defines its uniqueness.

This pattern would allow for much more duplication of ideas while preserving the existing per-cluster namespacing. It also allows for a future Postgres release that supports something like the flexible per-cluster packaging as described above.2

Extension Toolchain App

The second piece is an extension management application that understands all this stuff and makes it possible. It would empower both extension development workflows — including testing, metadata management, and releasing — and extension user workflows — finding, downloading, building, and installing.

Stealing from Go, imagine a developer making a release with something like this:

git tag v1.2.1 -sm 'Tag v1.2.1'
git push --tags
pgmod list -m github.com/theory/pg-semver@v1.2.1

The creatively named pgmod tells the registry to index the new version directly from its Git repository. Thereafter anyone can find it and install it with:

  • pgmod get github.com/theory/pg-semver@v1.2.1 — installs the specified version
  • pgmod get github.com/theory/pg-semver — installs the latest version
  • pgmod get semver — installs the latest version or shows a list of matching modules to select from

Any of these would fail if the cluster already has an extension named semver with a different module name. But with something like the updated extension installation locations in a future version of Postgres, that limitation could be loosened.

Challenges

Every new idea comes with challenges, and this little thought experiment is no exception. Some that immediately occur to me:

  • Not every extension can be installed directly from its repository. Perhaps the metadata could include a download link for a tarball with the results of any pre-release execution?
  • Adoption of a new CLI could be tricky. It would be useful to include the functionality in existing tools people already use, like pgrx.
  • Updating the uniqueness constraint in existing systems like PGXN might be a challenge. Most record the repository info in the resources META.json object, so it would be do-able to adapt into a new META format, either on PGXN itself or in a new registry, should we choose to build one.
  • Getting everyone to standardize on standardized versioning tags might take some effort. Go had the benefit of controlling its entire toolchain, while Postgres extension versioning and release management has been all over the place. However PGXN long ago standardized on semantic versioning and those who have released extensions on PGXN have had few issues (one can still use other version formats in the control file, for better or worse).
  • Some PGXN distributions have shipped different versions of extensions in a single release, or the same version as in other releases. The release version of the overall package (repository, really) would have to become canonical.

I’m sure there are more, I just thought of these offhand. What have you thought of? Post ’em if you got ’em in the #extensions channel on the Postgres Slack, or give me a holler on Mastodon or via email.


  1. Or does it? Yes, it does. Although the Go CLI downloads most public modules from a module proxy server like proxy.golang.org, it still must know how to download modules from a version control system when a proxy is not available. ↩︎

  2. Assuming, of course, that if and when the Postgres core adopts more bundled packaging that they’d use the same naming convention as we have in the broader ecosystem. Not a perfectly safe assumption, but given the Go precedent and wide adoption of host/path-based projects, it seems sound. ↩︎

PGXN Tools v1.4

Over on the PGXN Blog I’ve posted a brief update on recent bug fixes and improvements to the pgxn-tools Docker image, which is used fairly widely these days to test, bundle, and release Postgres extensions to PGXN. This fix is especially important for Git repositories:

v1.4.1 fixes an issue where git archive was never actually used to build a release zip archive. This changed at some point without noticing due to the introduction of the safe.directory configuration in recent versions of Git. Inside the container the directory was never trusted, and the pgxn-bundle command caught the error, decided it wasn’t working with a Git repository, and used the zip command, instead.

I also posted a gist listing PGXN distributions with a .git directory.

PGXN Challenges

PGXN Gear

Last week, I informally shared Extension Ecosystem: Jobs and Tools with colleagues in the #extensions channel on the Postgres Slack. The document surveys the jobs to be done by the ideal Postgres extension ecosystem and the suggests the tools and services required to do those jobs — without reference to existing extension registries and packaging systems.

The last section enumerates some questions we need to ponder and answer. The first one on the list is:

What will PGXN’s role be in this ideal extension ecosystem?

The PostgreSQL Extension Network, or PGXN, is the original extension distribution system, created 2010–11. It has been a moderate success, but as we in the Postgres community imagine the ideal extension distribution future, it’s worthwhile to also critically examine existing tools like PGXN, both to inform the project and to realistically determine their roles in that future.

With that in mind, I here jot down some thoughts on the challenges with PGXN.

PGXN Challenges

PGXN sets a lot of precedents, particularly in its decoupling of the registry from the APIs and services that depend on it. It’s not an all-in-one thing, and designed for maximum distributed dissemination via rsync and static JSON files.

But there are a number of challenges with PGXN as it currently stands; a sampling:

  • PGXN has not comprehensively indexed all public PostgreSQL extensions. While it indexes more extensions than any other registry, it falls far short of all known extensions. To be a truly canonical registry, we need to make it as simple as possible for developers to register their extensions. (More thoughts on that topic in a forthcoming post.)

  • In that vein, releasing extensions is largely a manual process. The pgxn-tools Docker image has improved the situation, allowing developers to create relatively simple GitHub workflows to automatically test and release extensions. Still, it requires intention and work by extension developers. The more seamless we can make publishing extensions the better. (More thoughts on that topic in a forthcoming post.)

  • It’s written in Perl, and therefore doesn’t feel modern or easily accessible to other developers. It’s also a challenge to build and distribute the Perl services, though Docker images could mitigate this issue. Adopting a modern compiled language like Go or Rust might increase community credibility and attract more contributions.

  • Similarly, pgxnclient is written in Python and the pgxn-utils developer tools in Ruby, increasing the universe of knowledge and skill required for developers to maintain all the tools. They’re also more difficult to distribute than compiled tools would be. Modern cross-compilable languages like Go and Rust once again simplify distribution and are well-suited to building both web services and CLIs (but not, perhaps native UX applications — but then neither are dynamic languages like Ruby and Python).

  • The PGXN Search API uses the Apache Lucy search engine library, a project that retired in 2018. Moreover, the feature never worked very well, thanks to the decision to expose separate search indexes for different objects — and requiring the user to select which to search. People often can’t find what they need because the selected index doesn’t contain it. Worse, the default index on the site is “Documentation”, on the surface a good choice. But most extensions include no documentation other than the README, which appears in the “Distribution” index, not “Documentation”. Fundamentally the search API and UX needs to be completely re-architected and -implemented.

  • PGXN uses its own very simple identity management and basic authentication. It would be better to have tighter community identity, perhaps through the PostgreSQL community account.

Given these issues, should we continue building on PGXN, rewrite some or all of its components, or abandon it for new services. The answer may come as a natural result of designing the overall extension ecosystem architecture or from the motivations of community consensus. But perhaps not. In the end, we’ll need a clear answer to the question.

What are your thoughts? Hit us up in the #extensions channel on the Postgres Slack, or give me a holler on Mastodon or via email. We expect to start building in earnest in February, so now’s the time!

I’m a Postgres Extensions Tembonaut

Tembo Logo

New year, new job.

I’m pleased to announce that I started a new job on January 2 at Tembo, a fully-managed PostgreSQL developer platform. Tembo blogged the news, too.

I first heard from Tembo CTO Samay Sharma last summer, when he inquired about the status of PGXN, the PostgreSQL Extension Network, which I built in 2010–11. Tembo bundles extensions into Postgres stacks, which let developers quickly spin up Postgres clusters with tools and features optimized for specific use cases and workloads. The company therefore needs to provide a wide variety of easy-to-install and well-documented extensions to power those use cases. Could PGXN play a role?

I’ve tended to PGXN’s maintenance for the last fourteen years, and thanks in no small part to hosting provided by depesz. As of today’s stats it distributes 376 extensions on behalf of 419 developers. PGXN has been a moderate success, but Samay asked how we could collaborate to build on its precedent to improve the extensions ecosystem overall.

It quickly became apparent that we share a vision for what that ecosystem could become, including:

  • Establishing the canonical Postgres community index of extensions, something PGXN has yet to achieve
  • Improving metadata standards to enable new patterns, such as automated binary packaging
  • Working with the Postgres community to establish documentation standards that encourage developers to provide comprehensive extension docs
  • Designing and building developer tools that empower more developers to build, test, distribute, and maintain extensions

Over the the past decade I’ve have many ideas and discussion on these topics, but seldom had the bandwidth to work on them. In the last couple years I’ve enabled TLS and improved the site display, increased password security, and added a notification queue with hooks that post to both Twitter (RIP @pgxn) and Mastodon (@pgxn@botsin.space). Otherwise, aside from keeping the site going, periodically improving new accounts, and eyeing the latest releases, I’ve had little bandwidth for PGXN or the broader extension ecosystem.

Now, thanks to the vision and strategy of Samay and Tembo CEO Ry Walker, I will focus on these projects full time. The Tembo team have already helped me enumerate the extension ecosystem jobs to be done and the tools required to do them. This week I’ll submit it to collaborators from across the Postgres community1 to fill in the missing parts, make adjustments and improvements, and work up a project plan.

The work also entails determining the degree to which PGXN and other extension registries (e.g., dbdev, trunk, pgxman, pgpm (WIP), etc.) will play a role or provide inspiration, what bits should be adopted, rewritten, or discarded.2 Our goal is to build the foundations for a community-owned extensions ecosystem that people care about and will happily adopt and contribute to.

I’m thrilled to return to this problem space, re-up my participation in the PostgreSQL community, and work with great people to build out the extensions ecosystem for future.

Want to help out or just follow along? Join the #extensions channel on the Postgres Slack. See you there.


  1. Tembo was not the only company whose representatives have reached out in the past year to talk about PGXN and improving extensions. I’ve also had conversations with Supabase, Omnigres, Hydra, and others. ↩︎

  2. Never be afraid to kill your darlings↩︎

Times Up

December 22, 2023 was my last day at The New York Times. My tenure was just under two and a half years.

My first 19 months at the company were pretty much everything I had hoped, as I collaborated with Design and Product to design a distributed platform and conceived, designed, and implemented CipherDoc, a service for encrypted data management. I’m incredibly proud of that work!

But alas, plans change. After the company mothballed the project, I refocused my time on glue work: re-platforming services, upgrading runtimes and dependencies, improving logging and observability, and documenting code, architectures, and playbooks. It felt good to reduce the onboarding, maintenance, and on-call overhead for my teams; I hope it helps them to be more productive and fulfilled in their work.

I treasure the terrific people I met at The Times, just super thoughtful, empathetic, and creative co-workers, partners, and colleagues. It’s no wonder they had the energy to form a union, The Times Tech Guild, for which I’m gratified to have helped organize and steward members. The Guild connected me with a far broader range of talented and committed people than I would have otherwise. I will miss them all, and continue to cheer for and support the movement from the outside.

And now, as 2023 winds down, I’ve decided to try something new. More news in the new year!

JSON Path Operator Confusion

The CipherDoc service offers a robust secondary key lookup API and search interface powered by JSON/SQL Path queries run against a GIN-indexed JSONB column. SQL/JSON Path, introduced in SQL:2016 and added to Postgres in version 12 in 2019, nicely enables an end-to-end JSON workflow and entity lifecycle. It’s a powerful enabler and fundamental technology underpinning CipherDoc. I’m so happy to have found it.

Confusion

However, the distinction between the SQL/JSON Path operators @@ and @? confused me. Even as I found that the @? operator worked for my needs and @@ did not, I tucked the problem into my mental backlog for later study.

The question arose again on a recent work project, and I can take a hint. It’s time to figure this thing out. Let’s see where it goes.

The docs say:

jsonb @? jsonpath → boolean
Does JSON path return any item for the specified JSON value?

'{"a":[1,2,3,4,5]}'::jsonb @? '$.a[*] ? (@ > 2)' → t


jsonb @@ jsonpath → boolean
Returns the result of a JSON path predicate check for the specified JSON value. Only the first item of the result is taken into account. If the result is not Boolean, then NULL is returned.

'{"a":[1,2,3,4,5]}'::jsonb @@ '$.a[*] > 2' → t

These read quite similarly to me: Both return true if the path query returns an item. So what’s the difference? When should I use @@ and when @?? I went so far as to ask Stack Overflow about it. The one answer directed my attention back to the jsonb_path_query() function, which returns the results from a path query.

So let’s explore how various SQL/JSON Path queries work, what values various expressions return.

Queries

The docs for jsonb_path_query say:1

jsonb_path_query ( target jsonb, path jsonpath [, vars jsonb [, silent boolean ]] ) → setof jsonb
Returns all JSON items returned by the JSON path for the specified JSON value. If the vars argument is specified, it must be a JSON object, and its fields provide named values to be substituted into the jsonpath expression. If the silent argument is specified and is true, the function suppresses the same errors as the @? and @@ operators do.
select * from jsonb_path_query(
    '{"a":[1,2,3,4,5]}',
    '$.a[*] ? (@ >= $min && @ <= $max)',
    '{"min":2, "max":4}'
) 
 jsonb_path_query
------------------
 2
 3
 4

The first thing to note is that a SQL/JSON Path query may return more than one value. This feature matters for the @@ and @? operators, which return a single boolean value based on the values returned by a path query. And path queries can return a huge variety of values. Let’s explore some examples, derived from the sample JSON value and path query from the docs.2

select jsonb_path_query('{"a":[1,2,3,4,5]}', '$ ?(@.a[*] > 2)');
    jsonb_path_query    
------------------------
 {"a": [1, 2, 3, 4, 5]}
(1 row)

This query returns the entire JSON value, because that’s what $ selects at the start of the path expression. The ?() filter returns true because its predicate expression finds at least one value in the $.a array greater than 2. Here’s what happens when the filter returns false:

select jsonb_path_query('{"a":[1,2,3,4,5]}', '$ ?(@.a[*] > 5)');
 jsonb_path_query 
------------------
(0 rows)

None of the values in the $.a array are greater than five, so the query returns no value.

To select just the array, append it to the path expression after the ?() filter:

select jsonb_path_query('{"a":[1,2,3,4,5]}', '$ ?(@.a[*] > 2).a');
 jsonb_path_query 
------------------
 [1, 2, 3, 4, 5]
(1 row)

Path Modes

One might think you could select $.a at the start of the path query to get the full array if the filter returns true, but look what happens:

select jsonb_path_query('{"a":[1,2,3,4,5]}', '$.a ?(@[*] > 2)');
 jsonb_path_query 
------------------
 3
 4
 5
(3 rows)

That’s not the array, but the individual array values that each match the predicate. Turns out this is a quirk of the Postgres implementation of path modes. From what I can glean, the SQL:2016 standard dictates something like these SQL Server descriptions:

  • In lax mode, the function returns empty values if the path expression contains an error. For example, if you request the value $.name, and the JSON text doesn’t contain a name key, the function returns null, but does not raise an error.
  • In strict mode, the function raises an error if the path expression contains an error.

But the Postgres lax mode does more than suppress errors. From the docs (emphasis added):

The lax mode facilitates matching of a JSON document structure and path expression if the JSON data does not conform to the expected schema. If an operand does not match the requirements of a particular operation, it can be automatically wrapped as an SQL/JSON array or unwrapped by converting its elements into an SQL/JSON sequence before performing this operation. Besides, comparison operators automatically unwrap their operands in the lax mode, so you can compare SQL/JSON arrays out-of-the-box.

There are a few more details, but this is the crux of it: In lax mode, which is the default, Postgres always unwraps an array. Hence the unexpected list of results.3 This could be particularly confusing when querying multiple rows:

select jsonb_path_query(v, '$.a ?(@[*] > 2)')
        from (values ('{"a":[1,2,3,4,5]}'::jsonb), ('{"a":[3,5,8]}')) x(v);
 jsonb_path_query 
------------------
 3
 4
 5
 3
 5
 8
(6 rows)

Switching to strict mode by preprending strict to the JSON Path query restores the expected behavior:

select jsonb_path_query(v, 'strict $.a ?(@[*] > 2)')
        from (values ('{"a":[1,2,3,4,5]}'::jsonb), ('{"a":[3,5,8]}')) x(v);
 jsonb_path_query 
------------------
 [1, 2, 3, 4, 5]
 [3, 5, 8]
(2 rows)

Important gotcha to watch for, and a good reason to test path queries thoroughly to ensure you get the results you expect. Lax mode nicely prevents errors when a query references a path that doesn’t exist, as this simple example demonstrates:

select jsonb_path_query('{"a":[1,2,3,4,5]}', 'strict $.b');
ERROR:  JSON object does not contain key "b"

select jsonb_path_query('{"a":[1,2,3,4,5]}', 'lax $.b');
 jsonb_path_query 
------------------
(0 rows)

In general, I suggest always using strict mode when executing queries. Better still, perhaps always prefer strict mode with our friends the @@ and @? operators, which suppress some errors even in strict mode:

The jsonpath operators @? and @@ suppress the following errors: missing object field or array element, unexpected JSON item type, datetime and numeric errors. The jsonpath-related functions described below can also be told to suppress these types of errors. This behavior might be helpful when searching JSON document collections of varying structure.

Have a look:

select '{"a":[1,2,3,4,5]}' @? 'strict $.a';
 ?column? 
----------
 t
(1 row)

select '{"a":[1,2,3,4,5]}' @? 'strict $.b';
 ?column? 
----------
 <null>
(1 row)

No error for the unknown JSON key b in that second query! As for the error suppression in the jsonpath-related functions, that’s what the silent argument does. Compare:

select jsonb_path_query('{"a":[1,2,3,4,5]}', 'strict $.b');
ERROR:  JSON object does not contain key "b"

select jsonb_path_query('{"a":[1,2,3,4,5]}', 'strict $.b', '{}', true);
 jsonb_path_query 
------------------
(0 rows)

Boolean Predicates

The Postgres SQL/JSON Path Language docs briefly mention a pretty significant deviation from the SQL standard:

A path expression can be a Boolean predicate, although the SQL/JSON standard allows predicates only in filters. This is necessary for implementation of the @@ operator. For example, the following jsonpath expression is valid in PostgreSQL:

$.track.segments[*].HR < 70

This pithy statement has pretty significant implications for the return value of a path query. The SQL standard allows predicate expressions, which are akin to an SQL WHERE expression, only in ?() filters, as seen previously:

select jsonb_path_query('{"a":[1,2,3,4,5]}', '$ ?(@.a[*] > 2)');
    jsonb_path_query    
------------------------
 {"a": [1, 2, 3, 4, 5]}
(1 row)

This can be read as “return the path $ if @.a[*] > 2 is true. But have a look at a predicate-only path query:

select jsonb_path_query('{"a":[1,2,3,4,5]}', '$.a[*] > 2');
 jsonb_path_query 
------------------
 true
(1 row)

This path query can be read as “Return the result of the predicate $.a[*] > 2, which in this case is true. This is quite the divergence from the standard, which returns contents from the JSON queried, while a predicate query returns the result of the predicate expression itself. It’s almost like they’re two different things!

Don’t confuse the predicate path query return value with selecting a boolean value from the JSON. Consider this example:

select jsonb_path_query('{"a":[true,false]}', '$.a ?(@[*] == true)');
 jsonb_path_query 
------------------
 true
(1 row)

Looks the same as the predicate-only query, right? But it’s not, as shown by adding another true value to the $.a array:

select jsonb_path_query('{"a":[true,false,true]}', '$.a ?(@[*] == true)');
 jsonb_path_query 
------------------
 true
 true
(2 rows)

This path query returns the trues it finds in the $.a array. The fact that it returns values from the JSON rather than the filter predicate becomes more apparent in strict mode, which returns all of $a if one or more elements of the array has the value true:

select jsonb_path_query('{"a":[true,false,true]}', 'strict $.a ?(@[*] == true)');
  jsonb_path_query   
---------------------
 [true, false, true]
(1 row)

This brief aside, and its mention of the @@ operator, turns out to be key to understanding the difference between @? and @@. Because it’s not just that this feature is “necessary for implementation of the @@ operator”. No, I would argue that it’s the only kind of expression usable with the @@ operator

Match vs. Exists

Let’s get back to the @@ operator. We can use a boolean predicate JSON Path like so:

select '{"a":[1,2,3,4,5]}'::jsonb @@ '$.a[*] > 2';
 ?column? 
----------
 t
(1 row)

It returns true because the predicate JSON path query $.a[*] > 2 returns true. And when it returns false?

select '{"a":[1,2,3,4,5]}'::jsonb @@ '$.a[*] > 6';
 ?column? 
----------
 f
(1 row)

So far so good. What happens when we try to use a filter expression that returns a true value selected from the JSONB?

select '{"a":[true,false]}'::jsonb @@ '$.a ?(@[*] == true)';
 ?column? 
----------
 t
(1 row)

Looks right, doesn’t it? But recall that this query returns all of the true values from $.@, but @@ wants only a single boolean. What happens when we add another?

select '{"a":[true,false,true]}'::jsonb @@ 'strict $.a ?(@[*] == true)';
 ?column? 
----------
 <null>
(1 row)

Now it returns NULL, even though it’s clearly true that @[*] == true matches. This is because it returns all of the values it matches, as jsonb_path_query() demonstrates:

select jsonb_path_query('{"a":[true,false,true]}'::jsonb, '$.a ?(@[*] == true)');
 jsonb_path_query 
------------------
 true
 true
(2 rows)

This clearly violates the @@ documentation claim that “Only the first item of the result is taken into account”. If that were true, it would see the first value is true and return true. But it doesn’t. Turns out, the corresponding jsonb_path_match() function shows why:

select jsonb_path_match('{"a":[true,false,true]}'::jsonb, '$.a ?(@[*] == true)');
ERROR:  single boolean result is expected

Conclusion: The documentation is inaccurate. Only a single boolean is expected by @@. Anything else is an error.

Futhermore, it’s dangerous, at best, to use an SQL standard JSON Path expression with @@. If you need to use it with a filter expression, you can turn it into a boolean predicate by wrapping it in exists():

select jsonb_path_match('{"a":[true,false,true]}'::jsonb, 'exists($.a ?(@[*] == true))');
 jsonb_path_match 
------------------
 t
(1 row)

But there’s no reason to do so, because that’s effectively what the @? operator (and the corresponding, cleverly-named jsonb_path_exists() function does): it returns true if the SQL standard JSON Path expression contains any results:

select '{"a":[true,false,true]}'::jsonb @? '$.a ?(@[*] == true)';
 ?column? 
----------
 t
(1 row)

Here’s the key thing about @?: you don’t want to use a boolean predicate path query with it, either. Consider this predicate-only query:

select jsonb_path_query('{"a":[1,2,3,4,5]}'::jsonb, '$.a[*] > 6');
 jsonb_path_query 
------------------
 false
(1 row)

But see what happens when we use it with @?:

select '{"a":[1,2,3,4,5]}'::jsonb @? '$.a[*] > 6';
 ?column? 
----------
 t
(1 row)

It returns true even though the query itself returns false! Why? Because false is a value that exists and is returned by the query. Even a query that returns null is considered to exist, as it will when a strict query encounters an error:

select jsonb_path_query('{"a":[1,2,3,4,5]}'::jsonb, 'strict $[*] > 6');
 jsonb_path_query 
------------------
 null
(1 row)

select '{"a":[1,2,3,4,5]}'::jsonb @? 'strict $[*] > 6';
 ?column? 
----------
 t
(1 row)

The key thing to know about the @? operator is that it returns true if anything is returned by the path query, and returns false only if nothing is selected at all.

The Difference

In summary, the difference between the @? and @@ JSONB operators is this:

  • @? (and jsonb_path_exists()) returns true if the path query returns any values — even false or null — and false if it returns no values. This operator should be used only with SQL-standard JSON path queries that select data from the JSONB. Do not use predicate-only JSON path expressions with @?.
  • @@ (and jsonb_path_match()) returns true if the path query returns the single boolean value true and false otherwise. This operator should be used only with Postgres-specific boolean predicate JSON path queries, that return data from the predicate expression. Do not use SQL-standard JSON path expressions with @@.

This difference of course assumes awareness of this distinction between predicate path queries and SQL standard path queries. To that end, I submitted a patch that expounds the difference between these types of JSON Path queries, and plan to submit another linking these differences in the docs for @@ and @?.

Oh, and probably another to explain the difference in return values between strict and lax queries due to array unwrapping.

Thanks

Many thanks to Erik Wienhold for patiently answering my pgsql-hackers questions and linking me to a detailed pgsql-general thread in which the oddities of @@ were previously discussed in detail.


  1. Well almost. The docs for jsonb_path_query actually say, about the last two arguments, “The optional vars and silent arguments act the same as for jsonb_path_exists.” I replaced that sentence with the relevant sentences from the jsonb_path_exists docs, about which more later. ↩︎

  2. Though omitting the vars argument, as variable interpolation just gets in the way of understanding basic query result behavior. ↩︎

  3. In fairness, the Oracle docs also discuss “implicit array wrapping and unwrapping”, but I don’t have a recent Oracle server to experiment with at the moment. ↩︎

CipherDoc: A Searchable, Encrypted JSON Document Service on Postgres

Over the last year, I designed and implemented a simple web service, code-named “CipherDoc”, that provides a CRUD API for creating, updating, searching, and deleting JSON documents. The app enforces document structure via JSON schema, while JSON/SQL Path powers the search API by querying a hashed subset of the schema stored in a GIN-indexed JSONB column in Postgres.

In may I gave a public presentation on the design and implementation of the service at PGCon: CipherDoc: A Searchable, Encrypted JSON Document Service on Postgres. Links:

I enjoyed designing this service. The ability to dynamically change the JSON schema at runtime without database changes enables more agile development cycles for busy teams. Its data privacy features required a level of intellectual challenge and raw problem-solving (a.k.a., engineering) that challenge and invigorate me.

Two minor updates since May:

  1. I re-implemented the JSON/SQL Path parser using the original Postgres path grammar and goyacc, replacing the hand-written parser roundly castigated in the presentation.
  2. The service has yet to be open-sourced, but I remain optimistic, and continue to work with leadership at The Times towards an open-source policy to enable its release.

UK Architecture

We saw some spectacular architecture on a whirlwind trip to London, Bath, Edinburgh, Glasgow, and Canterbury last month. Among the long list of museums, churches, and universities we visited, I managed to capture a few photos I quite like.