Just a Theory

Trans rights are human rights

Posts about RFC

RFC: Restful Secondary Key API

I’ve been working on a simple CRUD API at work, with an eye to make a nicely-designed REST interface for managing a single type of resource. It’s not a complicated API, following best practices recommended by Apigee and Microsoft. It features exactly the sorts for APIs you’d expect if you’re familiar with REST, including:

  • POST /users: Create a new user resource
  • GET /users/{uid}: Read a user resource
  • PUT /users/{uid}: Update a user resource
  • DELETE /users/{uid}: Delete a user resource
  • GET /users?{params}: Search for user resources

If you’re familiar with REST, you get the idea.

There is one requirement that proved a bit of design challenge. We will be creating canonical ID for all resources managed by the service, which will function as the primary key. The APIs above reference that key by the {uid} path variable. However, we also need to support fetching a single resource by a number of existing identifiers, including multiple legacy IDs, and natural keys like, sticking to the users example, usernames and email addresses. Unlike the search API, which returns an array of resources, we need a nice single API like GET /users/{uid} that returns a single resource, but for a secondary key. What should it look like?

None of my initial proposals were great (using username as the sample secondary key, though again, we need to support a bunch of these):

  • GET /users?username={username} — consistent with search, but does it return a collection like search or just a single entry like GET /users/{uid}? Would be weird not to return an array or not based on which parameters were used.
  • GET /users/by/username/{username} — bit weird to put a preposition in the URL. Besides, it might conflict with a planned API to fetch subsets of info for a single resource, e.g., GET /users/{uid}/profile, which might return just the profile object.
  • GET /user?username={username} — Too subtle to have the singular rather than plural, but perhaps the most REST-ish.
  • GET /lookup?obj=user&username={username} Use special verb, not very RESTful

I asked around a coding Slack, posting a few possibilities, and friendly API designers suggested some others. We agreed it was an interesting problem, easily solved if there was just one alternate that never conflicts with the primary key ID, such as GET /users/{uid || username}. But of course that’s not the problem we have: there are a bunch of these fields, and they may well overlap!

There was some interest in GET /users/by/username/{username} as an aesthetically-pleasing URL, plus it allows for

  • /by => list of unique fields
  • /by/username/ => list of all usernames?

But again, it runs up against the planned use of subdirectories to return sub-objects of a resource. One other I played around with was: GET /users/user?username={username}: The user sub-path indicates we want just one user much more than /by does, and it’s unlikely we’d ever use user to name an object in a user resource. But still, it overloads the path to mean one thing when it’s user and another when it’s a UID.

Looking back through the options, I realized that what we really want is an API that is identical to GET /users/{uid} in its behaviors and response, just with a different key. So what if we just keep using that, as originally suggested by a colleague as GET /users/{uid || username} but instead of just the raw value, we encode the key name in the URL. Turns out, colons (:) are valid in paths, so I defined this route:

  • GET /users/{key}:{value}: Fetch a single resource by looking up the {key} with the {value}. Supported {key} params are legacy_id, username, email_address, and even uid. This then becomes the canonical “look up a user resource by an ID” API.

The nice thing about this API is that it’s consistent: all keys are treated the same, as long as no key name contains a colon. Best of all, we can keep the original GET /users/{uid} API around as an alias for GET /users/uid:{value}. Or, better, continue to refer to it as the canonical path, since the PUT and DELETE actions map only to it, and document the GET /users/{key}:{value} API as accessing an alias for symlink for GET /users/{uid}. Perhaps return a Location header to the canonical URL, too?

In any event, as far as I can tell this is a unique design, so maybe it’s too weird or not properly RESTful? Would love to know of any other patterns designed to solve the problem of supporting arbitrarily-named secondary unique keys. What do you think?

Update: Aristotle Pagaltzis started a discussion on this pattern in a Gist.

Toward A Database URI Standard

As part of my effort to improve Sqitch, I plan to add support for specifying deployment targets via URIs. Inspired by Git remotes, targets will greatly simplify the specification of databases to update — especially when stored as named targets in the configuration file.

Before implementing it, though, I started casting about for a standard URI Scheme for database connections. Imagine my surprise1 to find that there is none! The closest thing to a standard is JDBC URLs. Formally, their format is simply:

jdbc:<jdbc-specific-stuff>

Turns out that JDBC URLs are barely URLs at all. I mean, fine, according to RFC 3986 they start with the jdbc: scheme followed by whatever. According to the JDBC docs, what comes after the scheme is defined as follows:

jdbc:<subprotocol>:<subname>

The “subprotocol” is simply a driver name, while the the format of the “subname can vary, depending on the subprotocol, and it can have any internal syntax the driver writer chooses, including a subsubname.” In other words, it can be anything at all. Not very satisfying, or particularly “standard.”2

In poking around the net, however, I found a fair number of database URI formats defined by various projects:

All very similar, right? Most database engines support all or a subset of these connection parts in common:

  • username
  • password
  • host address
  • port
  • database name
  • configuration parameters

So why not define a standard database URI format with all those parts, and use them where appropriate for each engine? It’s all right there, just like http URLs.

The Proposal

Here’s my proposal. Formally, it’s an opaque URI like JDBC. All database URIs start with the scheme db:. But in this case, the opaque part is an embedded URI that may be in one of two formats:

engine://[username[:password]@]host[:port][/dbname][?params]
engine:[dbname][?params]

In other words, a pretty typical http- or mailto-style URI format. We embed it in a db: URI in order to identify the URI as a database URI, and to have a single reasonable scheme to register. Informally, it’s simplest to think of a database URI as a single URI starting with the combination of the scheme and the engine, e.g., db:mysql.

Some notes on the formats:

  • The Database URI scheme is db. Consequently, database URIs always start with db:. This is the URI scheme that defines a database URI.

  • Next comes the database engine. This part is a string naming the type of database engine for the database. It must always be followed by a colon, :. There is no formal list of supported engines, though certain implementations may specify engine-specific semantics, such as a default port.

  • The authority part is separated from the engine by a double slash, //, and terminated by the next slash or end of the URI. It consists of an optional user-information part, terminated by @ (e.g., username:password@); a required host address (e.g., domain name or IP address); and an optional port number, preceded by a colon, :.

  • The path part specifies the database name or path. For URIs that contain an authority part, a path specifying a file name must be absolute. URIs without an authority may use absolute or relative paths.

  • The optional query part, separated by a question mark, ?, contains key=value pairs separated by a semicolon, ;, or ampersand, &. These parameters may be used to configure a database connection with parameters not directly supported by the rest of the URI format.

Examples

Here are some database URIs without an authority part, which is typical for non-server engines such as SQLite, where the path part is a relative or absolute file name:

  • db:sqlite:
  • db:sqlite:foo.db
  • db:sqlite:../foo.db
  • db:sqlite:/var/db/foo.sqlite

Other engines may use a database name rather than a file name:

  • db:ingres:mydb
  • db:postgresql:template1

When a URI includes an authority part, it must be preceded by a double slash:

  • db:postgresql://example.com
  • db:mysql://root@localhost
  • db:pg://postgres:secr3t@example.net

To add the database name, separate it from the authority by a single slash:

  • db:postgresql://example.com/template1
  • db:mongodb://localhost:27017/myDatabase
  • db:oracle://scott:tiger@foo.com/scott

Some databases, such as Firebird, take both a host name and a file path. These paths must be absolute:

  • db:firebird://localhost/tmp/test.gdb
  • db:firebird://localhost/C:/temp/test.gdb

Any URI format may optionally have a query part containing key/value pairs:

  • db:sqlite:foo.db?foreign_keys=ON;journal_mode=WAL
  • db:pg://localhost:5433/postgres?client_encoding=utf8;connect_timeout=10

Issues

In discussing this proposal with various folks, I’ve become aware of a few challenges to standardization.

First, the requirement that the authority part must include a host address prevents the specification of a URI with a username that can be used to connect to a Unix socket. PostgreSQL and MySQL, among others provide authenticated socket connections. While RFC 3986 requires the host name, its predecessor, RFC 2396, does not. Furthermore, as a precedent, neither do file URIs. So I’m thinking of allowing something like this to connect to a PostgreSQL database

db:pg://postgres:secr3t@/

In short, it makes sense to allow the user information without a host name.

The second issue is the disallowing of relative file names in the path part following an authority part. The problem here is that most database engines don’t use paths for database names, so a leading slash makes no sense. For example, in db:pg:localhost/foo, the PostgreSQL database name is foo, not /foo. Yet in db:firebird:localhost/foo, the Firebird database name is a path, /foo. So each engine implementation must know whether or not the path part is a file name.

But some databases may in fact allow a path to be specified for a local connection, and a name for a remote connection. Informix appears to support such variation. So how is one to know whether the path is a file path or a named database? The two variants cannot be distinguished.

RFC 2396 is quite explicit that the path part must be absolute when following an authority part. But RFC 3986 forbids the double slash only when there is no authority part. Therefore, I think it might be best to require a second slash for absolute paths. Engines that use a simple name or relative path can have it just after the slash, while an absolute path could use a second slash:

  • Absolute: db:firebird://localhost//tmp/test.gdb
  • Relative: db:firebird://localhost/db/test.gdb
  • Name: db:postgresql://localhost/template1

That’s It

The path issue aside, I feel like this is a pretty simple proposal, and could have wide utility. I’ve already knocked out a Perl reference implementation, URI::db. Given the wide availability of URI parsers in various programming languages, I wouldn’t expect it to be difficult to port, either.

The uri-db project is the canonical home for the proposal for now, so check there for updates. And your feedback would be appreciated! What other issues have I overlooked? What have I got wrong? Let me know!


  1. As in not surprised at all. Though I was hoping! ↩︎

  2. DSNs for Perl’s DBI aren’t much better: dbi:<driver>:<driver-specific-stuff>↩︎

RFC: A Simple Markdown Table Format

I’ve been thinking about markdown tables a bit lately. I’ve had in mind to follow up on my definition list proposal with a second proposal for the creation and editing of simple tables in Markdown. For better or for worse, an aside on the markdown-discuss mail list led to a longish thread about a syntax for continuing lines in tables (not to mention a long aside on the use of monospaced fonts, but I digress), wherein I realized, after an open-minded post from MultiMarkdown’s Fletcher Penney, that I needed to set to working up this request for comments sooner rather than later.

Requirements

All of which is to say that this blog entry is a request for comments on a proposed syntax for simple tables in Markdown. The requirements for such a feature, to my thinking, are:

  • Simple tables only
  • Formatting should be implicit
  • Support for a simple caption
  • Support for representing column headers
  • Support for left, right, and center alignment
  • Support for multicolumn cells
  • Support for empty cells
  • Support for multiline (wrapping) cells
  • Support for multiple table bodies
  • Support for inline Markdown (spans, lists, etc.)
  • Support for all features (if not syntax) of MultiMarkdown tables.

By “simple tables” in that first bullet, I mean that they should look good in 78 character-wide monospaced plain text. Anything more complicated should just be done in XHTML. My goal is to be able to handle the vast majority of simple cases, not to handle every kind of table. That’s not to say that one won’t be able to use the syntax to create more complicated tables, just that it might not be appropriate to do so, and many more advanced features of tables will just have to be done in XHTML.

And by “implicit formatting” in the second bullet, I mean that the syntax should use the bare minimum number of punctuation characters to provide hints about formatting. Another way to think about it is that formatting hints should be completely invisible to a casual reader of the Markdown text.

Most of the rest of the requirements I borrowed from MultiMarkdown, with the last bullet thrown in just to cover anything I might have missed. The MultiMarkdown syntax appears to be a superset of the PHP Markdown Extra syntax, so that’s covered, too.

Prior Art: Databases

When I think about the display of tables in plain text, the first piece of prior art I think of is the output from command-line database clients. Database developers have been thinking about tables since, well, the beginning, so it makes sense to see what they’re doing. So I wrote a bit of SQL and ran it in three databases. The SQL builds a table with an integer, a short name, a textual description, and a decimal number. Here’s the code:

CREATE TEMPORARY TABLE widgets (
    id          integer,
    name        text,
    description text,
    price       numeric(6,2)
);

INSERT INTO widgets VALUES( 1, 'gizmo', 'Takes care of the doohickies', 1.99);
INSERT INTO widgets VALUES( 2, 'doodad', 'Collects *gizmos*', 23.8);
INSERT INTO widgets VALUES( 10, 'dojigger', 'Handles:
*   gizmos
*   doodads
*   thingamobobs', 102.98);
INSERT INTO widgets VALUES(1024, 'thingamabob', 'Self-explanatory, no?', 0.99);

SELECT * FROM widgets;

My goal here was to see how the database client would format a variety of data formats, as well as a textual column (“description”) with newlines in it (and a Markdown list, no less!), as the newlines will force the output to appear on multiple lines for a single row. This is one of the features that is missing from the existing Markdown implementations, which all require that the text all be on a single line.

The first database client in which I ran this code was psql 8.3, the interactive terminal for PostgreSQL 8.3. Its output looks like this:

  id  |    name     |         description          | price  
------+-------------+------------------------------+--------
    1 | gizmo       | Takes care of the doohickies |   1.99
    2 | doodad      | Collects *gizmos*            |  23.80
   10 | dojigger    | Handles:                     | 102.98
                    : * gizmos                       
                    : * doodads                      
                    : * thingamobobs                 
 1024 | thingamabob | Self-explanatory, no?        |   0.99

As you can see, PostgreSQL properly right-aligned the integer and numeric columns. It also has a very nice syntax for demonstrating continuing lines for a given column: the colon. The colon is really nice here because it looks kind of like a broken pipe character, which is an excellent mnemonic for a string of text that breaks over multiple lines. Really, this is just a very nice output format overall.

The next database client I tried was mysql 5.0, the command-line client for MySQL 5.0. Its output looks like this:

+------+-------------+--------------------------------------------+--------+
| id   | name        | description                                | price  |
+------+-------------+--------------------------------------------+--------+
|    1 | gizmo       | Takes care of the doohickies               |   1.99 | 
|    2 | doodad      | Collects *gizmos*                          |  23.80 | 
|   10 | dojigger    | Handles:
* gizmos
* doodads
* thingamobobs | 102.98 | 
| 1024 | thingamabob | Self-explanatory, no?                      |   0.99 | 
+------+-------------+--------------------------------------------+--------+

Once again we have very good alignment of the numeric data types. Furthermore, MySQL uses exactly the same syntax as PostgreSQL to represent the separation between column headers and column rows, although the PostgreSQL version is a bit more minimalist. The MySQL version just hast a little more stuff in it

Where the MySQL version fails, however, is in the representation of the continuing lines for the “dojigger” row. First of all, it set the width of the “description” column to the longest value in that column, but since that longest value includes newlines, it actually ends up being much too long—much longer than PostgreSQL’s representation of the same column. And second, as a symptom of that problem, nothing special is done with the wrapped lines. The newlines are simply output like any other character, with no attempt to line up the column. This has the side effect of orphaning the price for the “dojigger” after the last line of the continuing description. So its alignment is shot, too.

To be fair, PostgreSQL’s display featured almost exactly the same handling of continuing columns prior to version 8.2. But I do think that their solution featuring the colons is a good one.

The last database client I tried was SQLite 3.6. This client is the most different of all. I set .header ON and .mode column and got this output:

id          name        description                   price     
----------  ----------  ----------------------------  ----------
1           gizmo       Takes care of the doohickies  1.99      
2           doodad      Collects *gizmos*             23.8      
10          dojigger    Handles:
* gizmos
* doodads
  102.98    
1024        thingamabo  Self-explanatory, no?         0.99      

I don’t think this is at all useful for Markdown.

Prior Art: MultiMarkdown

Getting back to Markdown now, here is the MultiMarkdown syntax, borrowed from the documentation:

|             |          Grouping           ||
First Header  | Second Header | Third Header |
 ------------ | :-----------: | -----------: |
Content       |          *Long Cell*        ||
Content       |   **Cell**    |         Cell |

New section   |     More      |         Data |
And more      |            And more          |
[Prototype table]

There are a few interesting features to this syntax, including support for multiple lines of headers, multicolumn cells alignment queues, and captions. I like nearly everything about this syntax, except for two things:

  1. There is no support for multiline cell values.
  2. The explicit alignment queues are, to my eye, distracting.

The first issue can be solved rather nicely with PostgreSQL’s use of the colon to indicate continued lines. I think it could even optionally use colons to highlight all rows in the output, not just the continuing one, as suggested by Benoit Perdu on the markdown-discuss list:

  id  |    name     |         description          | price  
------+-------------+------------------------------+--------
    1 | gizmo       | Takes care of the doohickies |   1.99
    2 | doodad      | Collects *gizmos*            |  23.80
   10 | dojigger    | Handles:                     | 102.98
      :             : * gizmos                     : 
      :             : * doodads                    : 
      :             : * thingamobobs               : 
 1024 | thingamabob | Self-explanatory, no?        |   0.99

I think I prefer the colon only in front of the continuing cell, but see no reason why both couldn’t be supported.

The second issue is a bit more subtle. My problem with the alignment hints, embodied by the colons in the header line, is that to the reader of the plain-text Markdown they fill no obvious purpose, but are provided purely for the convenience of the parser. In my opinion, if there is some part of the Markdown syntax that provides no obvious meaning to the user, it should be omitted. I take this point of view not only for my own selfish purposes, which are, of course, many and rampant, but from John Gruber’s original design goal for Markdown, which was:

The overriding design goal for Markdown’s formatting syntax is to make it as readable as possible. The idea is that a Markdown-formatted document should be publishable as-is, as plain text, without looking like it’s been marked up with tags or formatting instructions. While Markdown’s syntax has been influenced by several existing text-to-HTML filters, the single biggest source of inspiration for Markdown’s syntax is the format of plain text email.

To me, those colons are formatting instructions. So, how else could we support alignment of cells but with formatting instructions? Why, by formatting the cells themselves, of course. Take a look again at the PostgreSQL and MySQL outputs. both simply align values in their cells. There is absolutely no reason why a decent parser couldn’t do the same on a cell-by-cell basis if the table Markdown follows these simple rules:

  • For a left-aligned cell, the content should have no more than one space between the pipe character that precedes it, or the beginning of the line.
  • For a right-aligned cell, the content should have no more than one space between itself and the pipe character that succeeds it, or the end of the line.
  • For a centered cell, the content should have at least two characters between itself and both its left and right borders.
  • If a cell has one space before and one space after its content, it is assumed to be left-aligned unless the cell that precedes it or, in the case of the first cell, the cell that succeeds it, is right-aligned.

What this means, in effect, is that you can create tables wherein you line things up for proper display with a proportional font and, in general, the Markdown parser will know what you mean. A quick example, borrowing from the PostgreSQL output:

  id  |    name     |         description          |  price  
------+-------------+------------------------------+--------
    1 | gizmo       | Takes care of the doohickies |   1.99 
    2 | doodad      | Collects *gizmos*            |  23.80 
   10 | dojigger    | Handles stuff                | 102.98 
 1024 | thingamabob | Self-explanatory, no?        |   0.99 

The outcome for this example is that:

  • The table headers are all center-aligned, because they all have 2 or more spaces on each side of their values
  • The contents of the “id” column are all right-aligned. This includes 1024, which ambiguously has only one space on each side of it, so it makes the determination based on the preceding line.
  • The contents of the “name” column are all left-aligned. This includes “thingamabob”, which ambiguously has only one space on each side of it, so it makes the determination based on the preceding line.
  • The contents of the “description” column are also all left-aligned. This includes first row, which ambiguously has only one space on each side of it, so it makes the determination based on the succeeding line.
  • And finally, the contents of the “price” column are all right-aligned. This includes 102.98, which ambiguously has only one space on each side of it, so it makes the determination based on the preceding line.

And that’s it. The alignments are perfectly clear to the parser and highly legible to the reader. No further markup is required.

Proposed Syntax

So, with this review, I’d like to propose the following syntax. It is inspired largely by a combination of PostgreSQL and MySQL’s output, as well as by MultiMarkdown’s syntax.

  • A table row is identifiable by the use of one or more pipe (|) characters in a line of text, aside from those found in a literal span (backticks).
  • Table headers are identified as a table row with the immediately-following line containing only -, |, +, :or spaces. (This is the same as the MultiMarkdown syntax, but with the addition fo the plus sign.)
  • Columns are separated by |, except on the header underline, where they may optionally be separated by +, and on continuing lines (see next point).
  • Lines that continue content from one or more cells from a previous line must use : to separate cells with continued content. The content of such cells must line up with the cell width on the first line, determined by the number of spaces (tabs won’t work). They may optionally demarcate all cells on continued lines, or just the cells that contain continued content.
  • Alignment of cell content is to be determined on a cell-by-cell basis, with reference to the same cell on the preceding or succeeding line as necessary to resolve ambiguities.
  • To indicate that a cell should span multiple columns, there should be additional pipes (|) at the end of the cell, as in MultiMarkdown. If the cell in question is at the end of the row, then of course that means that pipes are not optional at the end of that row.
  • You can use normal Markdown markup within the table cells, including multiline formats such as lists, as long as they are properly indented and denoted by colons on succeeding lines.
  • Captions are optional, but if present must be at the beginning of the line immediately preceding or following the table, start with [ and end with ], as in MultiMarkdown. If you have a caption before and after the table, only the first match will be used.
  • If you have a caption, you can also have a label, allowing you to create anchors pointing to the table, as in MultiMarkdown. If there is no label, then the caption acts as the label.
  • Cells may not be empty, except as represented by the appropriate number of space characters to match the width of the cell in all rows.
  • As in MultiMarkdown. You can create multiple <tbody> tags within a table by having a single empty line between rows of the table.

Sound like a lot? Well, if you’re acquainted with MultiMarkdown’s syntax, it’s essentially the same, but with these few changes:

  • Implicit cell alignment
  • Cell content continuation
  • Stricter use of space, for proper alignment in plain text (which all of the MultiMarkdown examples I’ve seen tend to do anyway)
  • Allow + to separate columns in the header-demarking lines
  • A table does not have to start right at the beginning of a line

I think that, for purposes of backwards compatibility, we could still allow the use of : in the header lines to indicate alignment, thus also providing a method to override implicit alignment in those rare cases where you really need to do so. I think that the only other change I would make is to eliminate the requirement that the first row be made the table header row if now header line is present. But that’s a gimme, really.

Taking the original MultiMarkdown example and rework it with these changes yields:

|               |            Grouping            ||
+---------------+---------------------------------+
| First Header  |  Second Header  |  Third Header |
+---------------+-----------------+---------------+
| Content       |           *Long Cell*          ||
: continued     :                                ::
: content       :                                ::
| Content       |    **Cell**     |          Cell |
: continued     :                 :               :
: content       :                 :               :

| New section   |      More       |          Data |
| And more      |             And more           ||
 [Prototype table]

Comments?

I think I’ve gone on long enough here, especially since it ultimately comes down to some refinements to the MultiMarkdown syntax. Ultimately, what I’m trying to do here is to push MultiMarkdown to be just a bit more Markdownish (by which I mean that it’s more natural to read as plain text), as well as to add a little more support for some advanced features. The fact that I’ll be able to cut-and-paste the output from my favorite database utilities is a handy bonus.

As it happens, John Gruber today posted a comment to the markdown-discuss mail list in which he says (not for the first time, I expect):

A hypothetical official table syntax for Markdown will almost certainly look very much, if not exactly, like Michel’s table syntax in PHP Markdown Extra.

I hope that he finds this post in that vein, as my goals here were to embrace the PHP Markdown Extra and MultiMarkdown formats, make a few tweaks, and see what people think, with an eye toward contributing toward a (currently hypothetical) official table syntax.

So what do you think? Please leave a comment, or comment on the markdown-discuss list, where I’ll post a synopsis of my proposal and a link to this entry. What have I missed? What mistakes have I made? What do you like? What do you hate? Please do let me know.

Thanks!

Looking for the comments? Try the old layout.