Just a Theory

By David E. Wheeler

Posts about SQLite

Notes on Upcoming Sqitch Improvements

I was traveling last week, and knowing I would be offline a fair bit, not to mention seriously jet-lagged, I put my hacking efforts into getting MySQL support into Sqitch. I merged it in yesterday; check out the tutorial if you’re interested in it. I expect to release v0.980 with the MySQL support in a couple of weeks; testing and feedback would most appreciated.

There is a caveat, though: it requires MySQL v5.6.4. So if you’re stuck with an older MySQL, it won’t work. There are two reasons to require v5.6.4:

  • The microsecond precision support in DATETIME values, added in v5.6.4. This makes it much easier to keep things in the proper order (deployments usually take less than a second).
  • The SIGNAL functionality, introduced in v5.5. This allows the schema to mock a check constraint in the Sqitch database, as well as make it much easier to write verify tests (as described in the tutorial and figured out on StackOverflow).

But if you can afford to take advantage of a relatively modern MySQL, give it a shot!

The next release also makes a backwards-incompatible change to the SQLite engine: the default Sqitch database is no longer $db_dir/$db_name-sqitch.$suffix, but $db_dir/sqitch.$suffix. In other words, if you were deploying to a db named /var/db/myapp.db, Sqitch previously kept its metadata in /var/db/myapp-sqitch.db, but now will keep it in /var/db/sqitch.db. This is to make it more like the other engines (MySQL defaults to a database named “sqitch”, and Postgres and Oracle default to a schema named “sqitch”).

It’s also useful if you use the SQLite ATTACHDATABASE command to manage multiple database files in a single project. In that case, you will want to use the same metadata file for all the databases. Keep them all in the same directory with the same suffix and you get just that with the default sqitch database.

If you’d like it to have a different name, use sqitch config core.sqlite.sqitch_db $name to configure it. This will be useful if you don’t want to use the same Sqitch database to manage multiple databases, or if you do, but they live in different directories.

I haven’t released this change yet, and I am not a big-time SQLite user. So if this makes no sense, please comment on this issue. It’ll be a couple of weeks before I release v0.980, so there is time to reverse if if there’s consensus that it’s a bad idea.

But given another idea I’ve had, I suspect it will be okay. The idea is to expand on the concept of a Sqitch “target” by giving it its own command and configuration settings. Basically, it would be sort of like Git remotes: use URIs to specify database connection and parameter info (such as the sqitch database name for SQLite). These can be passed to database-touching commands, such as deploy, revert, log, and the like. They can also be given names and stored in the configuration file. The upshot is that it would enable invocations such as

sqitch deploy production
sqitch log qa
sqitch status pg://localhost/flipr?sqitch_schema=meta

See the GitHub issue for a fuller description of this feature. I’m certain that this would be useful at work, as we have a limited number of databases that we deploy each Sqitch project to, and it’s more of a PITA for my co-workers to remember to use different values for the --db-host, --db-user, --db-name and friends options. The project itself would just store the named list of relevant deployment targets.

And it alleviates the issue of specifying a different Sqitch database on SQLite or MySQL, as one can just create a named target that specifies it in the URI.

Not sure when I will get to this feature, though. I think it would be great to have, and maybe iovation would want me to spend some time on it in the next couple of months. But it might also be a great place for someone else to get started adding functionality to Sqitch.

Oh, and before I forget: it looks like Sqitch might soon get CUBRID support, too, thanks to Ștefan Suciu. Stay tuned!

Looking for the comments? Try the old layout.

Sqitch: Now with SQLite Support

This week I released Sqitch v0.961. There are a number of great new features v0.95x, including the beginning of two features I’ve had in mind since the beginning: VCS integration and support for multiple databases.

First the VCS integration. This comes in the form of the new checkout command, which automatically makes database changes for you when you change VCS branches. Say you have two branches, “widgets” and “big-fix”, and that their Sqitch plans diverge. If you’re in the “widgets” branch and want to switch to “big-fix”, just run

sqitch checkout big-fix

Sqitch will look at the “big-fix” plan, figure out the last change in common with “widgets”, and revert to it. Then it checks out “big-fix” and deploys. That’s it. Yes, you could do this yourself, but do you really remember the last common change between the two branches? Do you want to take the time to look for it, then revert, check out the new branch, and deploy? This is exactly the sort of common developer task that Sqitch aims to take the pain out of, and I’m thrilled to provide it.

You know what’s awesome, though? This feature never occurred to me. I didn’t come up with it, and didn’t implement it. No, it was dreamt up and submitted in a pull request by Ronan Dunklau. I have wanted VCS integration since the beginning, but had yet to get ‘round to it. Now Ronan has jumpstarted it. A million thanks!

One downside: it’s currently Git-only. I plan to add infrastructure for supporting multiple VCSes, probably with Git and Subversion support to begin with. Watch for that in v0.970 in the next couple months.

The other big change is the addition of SQLite support alongside the existing PostgreSQL support. Fortunately, I was able to re-use nearly all the code, so the SQLite adapter is just a couple hundred lines long. For the most part, Sqitch on SQLite works just like on PostgreSQL. The main difference is that Sqitch stores its metadata in a separate SQLite database file. This allows one to use a single metadata file to maintain multiple databases, which can be important if you use multiple databases as schemas pulled into a single connection via ATTACH DATABASE.

Curious to try it out? Install Sqitch from CPAN or via the Homebrew Tap and then follow the new Sqitch SQLite tutorial.

Of the multitude of other Changes, one other bears mentioning: the new plan command. This command is just like log, except that it shows what is in the plan file, rather than what changes have been made to the database. This can be useful for quickly listing what’s in a plan, for example when you need to remember the names of changes required by a change you’re about to add. The --oneline option is especially useful for this functionality. An example from the tutorial’s plan:

> sqitch plan --oneline
In sqitch.plan
6238d8 deploy change_pass
d82139 deploy insert_user
7e6e8b deploy pgcrypto
87952d deploy delete_flip @v1.0.0-dev2
b0a951 deploy insert_flip
834e6a deploy flips
d0acfa deploy delete_list
77fd99 deploy insert_list
1a4b9a deploy lists
0acf77 deploy change_pass @v1.0.0-dev1
ec2dca deploy insert_user
bbb98e deploy users
ae1263 deploy appschema

I personally will be using this a lot, Yep, scratching my own itch here. What itch do you have to scratch with Sqitch?

In related news, I’ll be giving a tutorial at PGCon next month, entitled “Agile Database Development”. We’ll be developing a database for a web application using Git for source code management, Sqitch for database change management, and pgTAP for unit testing. This is the stuff I do all day long at work, so you can also think of it as “Theory’s Pragmatic approach to Database Development.” See you there?

Looking for the comments? Try the old layout.

More about…

RFC: A Simple Markdown Table Format

I’ve been thinking about markdown tables a bit lately. I’ve had in mind to follow up on my definition list proposal with a second proposal for the creation and editing of simple tables in Markdown. For better or for worse, an aside on the markdown-discuss mail list led to a longish thread about a syntax for continuing lines in tables (not to mention a long aside on the use of monospaced fonts, but I digress), wherein I realized, after an open-minded post from MultiMarkdown’s Fletcher Penney, that I needed to set to working up this request for comments sooner rather than later.

Requirements

All of which is to say that this blog entry is a request for comments on a proposed sytnax for simple tables in Markdown. The requirements for such a feature, to my thinking, are:

  • Simple tables only
  • Formatting should be implicit
  • Support for a simple caption
  • Support for representing column headers
  • Support for left, right, and center alignment
  • Support for multicolumn cells
  • Support for empty cells
  • Support for multiline (wrapping) cells
  • Support for multiple table bodies
  • Support for inline Markdown (spans, lists, etc.)
  • Support for all features (if not syntax) of MultiMarkdown tables.

By “simple tables” in that first bullet, I mean that they should look good in 78 character-wide monospaced plain text. Anything more complicated should just be done in XHTML. My goal is to be able to handle the vast majority of simple cases, not to handle every kind of table. That’s not to say that one won’t be able to use the syntax to create more complicated tables, just that it might not be appropriate to do so, and many more advanced features of tables will just have to be done in XHTML.

And by “implicit formatting” in the second bullet, I mean that the syntax should use the bare minimum number of punctuation characters to provide hints about formatting. Another way to think about it is that formatting hints should be completely invisible to a casual reader of the Markdown text.

Most of the rest of the requirements I borrowed from MultiMarkdown, with the last bullet thrown in just to cover anything I might have missed. The MultiMarkdown syntax appears to be a superset of the PHP Markdown Extra syntax, so that’s covered, too.

Prior Art: Databases

When I think about the display of tables in plain text, the first piece of prior art I think of is the output from command-line database clients. Database developers have been thinking about tables since, well, the beginning, so it makes sense to see what they’re doing. So I wrote a bit of SQL and ran it in three databases. The SQL builds a table with an integer, a short name, a textual description, and a decimal number. Here’s the code:

CREATE TEMPORARY TABLE widgets (
    id          integer,
    name        text,
    description text,
    price       numeric(6,2)
);

INSERT INTO widgets VALUES( 1, 'gizmo', 'Takes care of the doohickies', 1.99);
INSERT INTO widgets VALUES( 2, 'doodad', 'Collects *gizmos*', 23.8);
INSERT INTO widgets VALUES( 10, 'dojigger', 'Handles:
* gizmos
* doodads
* thingamobobs', 102.98);
INSERT INTO widgets VALUES(1024, 'thingamabob', 'Self-explanatory, no?', 0.99);

SELECT * FROM widgets;

My goal here was to see how the database client would format a variety of data formats, as well as a textual column (“description”) with newlines in it (and a Markdown list, no less!), as the newlines will force the output to appear on multiple lines for a single row. This is one of the features that is missing from the existing Markdown implementations, which all require that the text all be on a single line.

The first database client in which I ran this code was psql 8.3, the interactive terminal for PostgreSQL 8.3. Its output looks like this:

  id  |    name     |         description          | price  
------+-------------+------------------------------+--------
    1 | gizmo       | Takes care of the doohickies |   1.99
    2 | doodad      | Collects *gizmos*            |  23.80
   10 | dojigger    | Handles:                     | 102.98
                    : * gizmos                       
                    : * doodads                      
                    : * thingamobobs                 
 1024 | thingamabob | Self-explanatory, no?        |   0.99

As you can see, PostgreSQL properly right-aligned the integer and numeric columns. It also has a very nice syntax for demonstrating continuing lines for a given column: the colon. The colon is really nice here because it looks kind of like a broken pipe character, which is an excellent mnemonic for a string of text that breaks over multiple lines. Really, this is just a very nice output format overall.

The next database client I tried was mysql 5.0, the command-line client for MySQL 5.0. Its output looks like this:

+------+-------------+--------------------------------------------+--------+
| id   | name        | description                                | price  |
+------+-------------+--------------------------------------------+--------+
|    1 | gizmo       | Takes care of the doohickies               |   1.99 | 
|    2 | doodad      | Collects *gizmos*                          |  23.80 | 
|   10 | dojigger    | Handles:
* gizmos
* doodads
* thingamobobs | 102.98 | 
| 1024 | thingamabob | Self-explanatory, no?                      |   0.99 | 
+------+-------------+--------------------------------------------+--------+

Once again we have very good alignment of the numeric data types. Furthermore, MySQL uses exactly the same syntax as PostgreSQL to represent the separation between column headers and column rows, although the PostgreSQL version is a bit more minimalist. The MySQL version just hast a little more stuff in it

Where the MySQL version fails, however, is in the representation of the continuing lines for the “dojigger” row. First of all, it set the width of the “description” column to the longest value in that column, but since that longest value includes newlines, it actually ends up being much too long—much longer than PostgreSQL’s representation of the same column. And second, as a symptom of that problem, nothing special is done with the wrapped lines. The newlines are simply output like any other character, with no attempt to line up the column. This has the side effect of orphaning the price for the “dojiggger” after the last line of the continuing description. So its alignment is shot, too.

To be fair, PostgreSQL’s display featured almost exactly the same handling of continuing columns prior to version 8.2. But I do think that their solution featuring the colons is a good one.

The last database client I tried was SQLite 3.6. This client is the most different of all. I set .header ON and .mode column and got this output:

id          name        description                   price     
----------  ----------  ----------------------------  ----------
1           gizmo       Takes care of the doohickies  1.99      
2           doodad      Collects *gizmos*             23.8      
10          dojigger    Handles:
* gizmos
* doodads
  102.98    
1024        thingamabo  Self-explanatory, no?         0.99      

I don’t think this is at all useful for Markdown.

Prior Art: MultiMarkdown

Getting back to Markdown now, here is the MultiMarkdown syntax, borrowed from the documentation:

|             |          Grouping           ||
First Header  | Second Header | Third Header |
 ------------ | :-----------: | -----------: |
Content       |          *Long Cell*        ||
Content       |   **Cell**    |         Cell |

New section   |     More      |         Data |
And more      |            And more          |
[Prototype table]

There are a few interesting features to this syntax, including support for multiple lines of headers, multicolumn cells alignment queues, and captions. I like nearly everything about this syntax, except for two things:

  1. There is no support for multiline cell values.
  2. The explicit alignment queues are, to my eye, distracting.

The first issue can be solved rather nicely with PostgreSQL’s use of the colon to indicate continued lines. I think it could even optionally use colons to highlight all rows in the output, not just the continuing one, as suggested by Benoit Perdu on the markdown-discuss list:

  id  |    name     |         description          | price  
------+-------------+------------------------------+--------
    1 | gizmo       | Takes care of the doohickies |   1.99
    2 | doodad      | Collects *gizmos*            |  23.80
   10 | dojigger    | Handles:                     | 102.98
      :             : * gizmos                     : 
      :             : * doodads                    : 
      :             : * thingamobobs               : 
 1024 | thingamabob | Self-explanatory, no?        |   0.99

I think I prefer the colon only in front of the continuing cell, but see no reason why both couldn’t be supported.

The second issue is a bit more subtle. My problem with the alignment hints, embodied by the colons in the header line, is that to the reader of the plain-text Markdown they fill no obvious purpose, but are provided purely for the convenience of the parser. In my opinion, if there is some part of the Markdown syntax that provides no obvious meaning to the user, it should be omitted. I take this point of view not only for my own selfish purposes, which are, of course, many and rampant, but from John Gruber’s original design goal for Markdown, which was:

The overriding design goal for Markdown’s formatting syntax is to make it as readable as possible. The idea is that a Markdown-formatted document should be publishable as-is, as plain text, without looking like it’s been marked up with tags or formatting instructions. While Markdown’s syntax has been influenced by several existing text-to-HTML filters, the single biggest source of inspiration for Markdown’s syntax is the format of plain text email.

To me, those colons are formatting instructions. So, how else could we support alignment of cells but with formatting instructions? Why, by formatting the cells themselves, of course. Take a look again at the PostgreSQL and MySQL outputs. both simply align values in their cells. There is absolutely no reason why a decent parser couldn’t do the same on a cell-by-cell basis if the table Markdown follows these simple rules:

  • For a left-aligned cell, the content should have no more than one space between the pipe character that precedes it, or the beginning of the line.
  • For a right-aligned cell, the content should have no more than one space between itself and the pipe character that succeeds it, or the end of the line.
  • For a centered cell, the content should have at least two characters between itself and both its left and right borders.
  • If a cell has one space before and one space after its content, it is assumed to be left-aligned unless the cell that precedes it or, in the case of the first cell, the cell that succeeds it, is right-aligned.

What this means, in effect, is that you can create tables wherein you line things up for proper display with a proportional font and, in general, the Markdown parser will know what you mean. A quick example, borrowing from the PostgreSQL output:

  id  |    name     |         description          |  price  
------+-------------+------------------------------+--------
    1 | gizmo       | Takes care of the doohickies |   1.99 
    2 | doodad      | Collects *gizmos*            |  23.80 
   10 | dojigger    | Handles stuff                | 102.98 
 1024 | thingamabob | Self-explanatory, no?        |   0.99 

The outcome for this example is that:

  • The table headers are all center-aligned, because they all have 2 or more spaces on each side of their values
  • The contents of the “id” column are all right-aligned. This includes 1024, which ambiguously has only one space on each side of it, so it makes the determination based on the preceding line.
  • The contents of the “name” column are all left-aligned. This includes “thingamabob”, which ambiguously has only one space on each side of it, so it makes the determination based on the preceding line.
  • The contents of the “description” column are also all left-aligned. This includes first row, which ambiguously has only one space on each side of it, so it makes the determination based on the succeeding line.
  • And finally, the contents of the “price” column are all right-aligned. This includes 102.98, which ambiguously has only one space on each side of it, so it makes the determination based on the preceding line.

And that’s it. The alignments are perfectly clear to the parser and highly legible to the reader. No further markup is required.

Proposed Syntax

So, with this review, I’d like to propose the following syntax. It is inspired largely by a combination of PostgreSQL and MySQL’s output, as well as by MultiMarkdown’s syntax.

  • A table row is identifiable by the use of one or more pipe (|) characters in a line of text, aside from those found in a literal span (backticks).
  • Table headers are identified as a table row with the immediately-following line containing only -, |, +, :or spaces. (This is the same as the MultiMarkdown syntax, but with the addition fo the plus sign.)
  • Columns are separated by |, except on the header underline, where they may optionally be separated by +, and on continuing lines (see next point).
  • Lines that continue content from one or more cells from a previous line must use : to separate cells with continued content. The content of such cells must line up with the cell width on the first line, determined by the number of spaces (tabs won’t work). They may optionally demarcate all cells on continued lines, or just the cells that contain continued content.
  • Alignment of cell content is to be determined on a cell-by-cell basis, with reference to the same cell on the preceding or succeeding line as necessary to resolve ambiguities.
  • To indicate that a cell should span multiple columns, there should be additional pipes (|) at the end of the cell, as in MultiMarkdown. If the cell in question is at the end of the row, then of course that means that pipes are not optional at the end of that row.
  • You can use normal Markdown markup within the table cells, including multiline formats such as lists, as long as they are properly indented and denoted by colons on succeeding lines.
  • Captions are optional, but if present must be at the beginning of the line immediately preceding or following the table, start with [ and end with ], as in MultiMarkdown. If you have a caption before and after the table, only the first match will be used.
  • If you have a caption, you can also have a label, allowing you to create anchors pointing to the table, as in MultiMarkdown. If there is no label, then the caption acts as the label.
  • Cells may not be empty, except as represented by the appropriate number of space characters to match the width of the cell in all rows.
  • As in MultiMarkdown. You can create multiple <tbody> tags within a table by having a single empty line between rows of the table.

Sound like a lot? Well, if you’re acquainted with MultiMarkdown’s syntax, it’s essentially the same, but with these few changes:

  • Implicit cell alignment
  • Cell content continuation
  • Stricter use of space, for proper alignment in plain text (which all of the MultiMarkdown examples I’ve seen tend to do anyway)
  • Allow + to separate columns in the header-demarking lines
  • A table does not have to start right at the beginning of a line

I think that, for purposes of backwards compatibility, we could still allow the use of : in the header lines to indicate alignment, thus also providing a method to override implicit alignment in those rare cases where you really need to do so. I think that the only other change I would make is to eliminate the requirement that the first row be made the table header row if now header line is present. But that’s a gimme, really.

Taking the original MultiMarkdown example and rework it with these changes yields:

|               |            Grouping            ||
+---------------+---------------------------------+
| First Header  |  Second Header  |  Third Header |
+---------------+-----------------+---------------+
| Content       |           *Long Cell*          ||
: continued     :                                ::
: content       :                                ::
| Content       |    **Cell**     |          Cell |
: continued     :                 :               :
: content       :                 :               :

| New section   |      More       |          Data |
| And more      |             And more           ||
 [Prototype table]

Comments?

I think I’ve gone on long enough here, especially since it ultimately comes down to some refinements to the MultiMarkdown syntax. Ultimately, what I’m trying to do here is to push MultiMarkdown to be just a bit more Markdownish (by which I mean that it’s more natural to read as plain text), as well as to add a little more support for some advanced features. The fact that I’ll be able to cut-and-paste the output from my favorite database utilities is a handy bonus.

As it happens, John Gruber today posted a comment to the markdown-discuss mail list in which he says (not for the first time, I expect):

A hypothetical official table syntax for Markdown will almost certainly look very much, if not exactly, like Michel’s table syntax in PHP Markdown Extra.

I hope that he finds this post in that vein, as my goals here were to embrace the PHP Markdown Extra and MultiMarkdown formats, make a few tweaks, and see what people think, with an eye toward contributing toward a (currently hypothetical) official table syntax.

So what do you think? Please leave a comment, or comment on the markdown-discuss list, where I’ll post a synopsis of my proposal and a link to this entry. What have I missed? What mistakes have I made? What do you like? What do you hate? Please do let me know.

Thanks!

Looking for the comments? Try the old layout.

Add Regular Expression Operator to SQLite

As I discussed a couple of months ago, DBD::SQLite exposes the SQLite sqlite3_create_function() API for adding Pure-Perl functions and aggregates to SQLite on a per-connection basis. This is cool, but in perusing the SQLite expression documentation, I came across this gem:

The REGEXP operator is a special syntax for the regexp() user function. No regexp() user function is defined by default and so use of the REGEXP operator will normally result in an error message. If a user-defined function named “regexp” is defined at run-time, that function will be called in order to implement the REGEXP operator.

Well hell! I thought. I can do that!

In a brief search, I could find no further documentation of this feature, but all it took was a little experimentation to figure it out. The regexp() function should expect two arguments. The first is the regular expression, and the second is the value to match. So it can be added to DBD::SQLite like this:

$dbh = DBI->connect('dbi:SQLite:dbfile=test.db');
$dbh->func('regexp', 2, sub {
    my ($regex, $string) = @_;
    return $string =~ /$regex/;
}, 'create_function');

Yep, that’s it! Now, I have my own module for handling database connections, and I wanted to make sure that all of my custom functions are always present, every time I connect to the database. In a mod_perl environment, you can end up with a lot of connections, and a single process has the potential disconnect and reconnect more than once (due to exceptions thrown by the database and whatnot). The easiest way to ensure that the functions are always there as soon as you connect and every time you connect, I learned thanks to a tip from Tim Bunce, is to subclass the DBI and implement a connected() method. Here’s what it looks like:

package MyApp::SQLite;
use base 'DBI';

package MyApp::SQLite::st;
use base 'DBI::st';

package MyApp::SQLite::db;
use base 'DBI::db';

sub connected {
    my $dbh = shift;
    # Add regexp function.
    $dbh->func('regexp', 2, sub {
        my ($regex, $string) = @_;
        return $string =~ /$regex/;
    }, 'create_function');
}

So how does this work? Here’s a quick app I wrote to demonstrate the use of the REGEXP expression in SQLite using Perl regular expressions:

#!/usr/bin/perl -w

use strict;

my $dbfile = shift || die "Usage: $0 db_file\n";
my $dbh = MyApp::SQLite->connect(
    "dbi:SQLite:dbname=$dbfile", '', '',
    {
        RaiseError  => 1,
        PrintError  => 0,
    }
);

END {
    $dbh->do('DROP TABLE try');
    $dbh->disconnect;
}

$dbh->do('CREATE TABLE try (a TEXT)');

my $ins = $dbh->prepare('INSERT INTO try (a) VALUES (?)');
for my $val (qw(foo bar bat woo oop craw)) {
    $ins->execute($val);
}

my $sel = $dbh->prepare('SELECT a FROM try WHERE a REGEXP ?');

for my $regex (qw( ^b a w?oop?)) {
    print "'$regex' matches:\n  ";
    print join "\n  " =>
        @{ $dbh->selectcol_arrayref($sel, undef, $regex) };
    print "\n\n";
}

This script outputs:

'^b' matches:
  bar
  bat

'a' matches:
  bar
  bat
  craw

'w?oop?' matches:
  foo
  woo
  oop

Pretty slick, no? I wonder if it’d make sense for DBD::SQLite to add the regexp() function itself, in C, using the Perl API, so that it’s just always available to DBD::SQLite apps?

Looking for the comments? Try the old layout.

What Advanced SQL Book Should I Buy?

So, what advanced SQL book should I buy? I’ve learned a lot about SQL over the last year or so, but I’m sure that Josh Berkus is tired of being my own personal advanced SQL reference. So I’d like to really learn more about triggers, stored procedures, rules, views, and whatnot, what they’re best used for and when to use them. And other typical database features that I’m not familiar with, of course.

What I don’t need is an introduction to SQL. There are a million of those, and they all have much the same stuff. I want to really get into advanced concepts.

So what’s the best choice? Leave me a comment with your opinion. Thanks!

Looking for the comments? Try the old layout.

Custom SQLite Aggregates in Perl

About a year ago, Josh Berkus was reviewing some Bricolage SQL code, looking to optimize it for PostgreSQL. One of the things he noticed was that we were fetching a lot more rows for an object than we needed to. The reason for this is that an object might be associated with one or more groups, and to get back a list of all of the group IDs, we were getting multiple rows. For example, if I wanted to fetch a single story with the ID 10, I might get back rows like this:

SELECT s.id, s.title, grp.id
FROM   story s, member m, grp g
WHERE  s.id = m.story_id
       AND m.grp_id = g.id
       AND s.id = 10;
s.id |        s.title      | grp.id
-----+---------------------+--------
  10 | The Princess Bride  | 23
  10 | The Princess Bride  | 24
  10 | The Princess Bride  | 25
  10 | The Princess Bride  | 26
  10 | The Princess Bride  | 27

Now, that’s a lot of extra data to have to fetch for just a single row to be different; it’s very wasteful, really. So Josh said, “Why don’t you use a custom aggregate for that?” I knew nothing about aggregates, but I did some research, and figured out how to write PostgreSQL custom aggregates in SQL. I wrote a very simple one, called id_list(), that joins up all of the values in a column with an empty space. The aggregate code looks like this:

CREATE   FUNCTION append_id(TEXT, INTEGER)
RETURNS  TEXT AS '
    SELECT CASE WHEN $2 = 0 THEN
                $1
            ELSE
                $1 || '' '' || CAST($2 AS TEXT)
            END;'
LANGUAGE 'sql'
WITH     (ISCACHABLE, ISSTRICT);

CREATE AGGREGATE id_list (
    SFUNC    = append_id,
    BASETYPE = INTEGER,
    STYPE    = TEXT,
    INITCOND = ''
);

Now I was able to vastly simplify the results returned by the query:

SELECT s.id, s.title, id_list(grp.id)
FROM   story s, member m, grp g
WHERE  s.id = m.story_id
       AND m.grp_id = g.id
       AND s.id = 10;
GROUP BY s.id, s.title
s.id |        s.title      | id_list
-----+---------------------+---------------
  10 | The Princess Bride  | 23 24 25 26 27

So then I just had to split the id_list column on the white space and I was ready to go. Cool!

So recently, was thinking about how I might do something similar in SQLite. It turns out that SQLite has a way to add custom aggregates, too, via its sqlite_add_function function. But I don’t know C, and had been wondering for a while how, even if I figured out how to write an aggregate function in C, whether I would have to require users to compile SQLite with my C aggregate in order to get it to work.

However, as a Perl developer, I thought it might be worthwhile to just quickly check the DBD::SQLite docs might have to say on the matter. And it turns out that the ability to add aggregates to SQLite is supported in DBD::SQLite via the create_aggregate custom function. And what’s more, the aggregate can be written in Perl! Whoa! I couldn’t believe that it could be that easy, but a quick test script demonstrated that it is:

#!/usr/bin/perl -w

use strict;

use DBI;

my $dbfile = shift;
my $dbh = DBI->connect("dbi:SQLite:dbname=$dbfile", '', '');

END {
    $dbh->disconnect;
    unlink $dbfile;
};

$dbh->do('CREATE TABLE foo (a int)');
$dbh->do('BEGIN');
$dbh->do('INSERT INTO foo (a) VALUES (?)', undef, $_) for 1..10;
$dbh->do('COMMIT');

# Create a new aggregate.
$dbh->func('joiner', 1, 'My::Join', 'create_aggregate');

my $sel = $dbh->prepare(q{SELECT joiner(a) FROM foo});
$sel->execute;
$sel->bind_columns(\my $val);
print "$val\n" while $sel->fetch;

The first argument to create_aggregate() (itself invoked via the DBI func() method) the name of the aggregate, the second is the number of arguments to the aggregate (use -1 for an unlimited number), and the third is the name of a Perl class that implements the aggregate. That class needs just three methods: new(), an object constructor; step(), called for each aggregate row, and finalize, which must return the value calculated by the aggregate. My simple implementation looks like this:

package My::Join;

sub new { bless [] }
sub step {
    my $self = shift;
    push @$self, @_;
}
sub finalize {
    my $self = shift;
    return join q{ }, @$self;
}

Yep, that’s really it! When I run the script, the output looks like this:

% try foo
1 2 3 4 5 6 7 8 9 10

Keen! I mean, that is just so slick! And it really demonstrates the power of SQLite as an embeddable database, as well. Thanks Matt, for making the SQLite API available to us mere mortal Perl developers!

Looking for the comments? Try the old layout.

More about…

MySQL's REPLACE Considered Harmful

So we’ve set up a client with an online poll application using MySQL. Polls are created in Bricolage, and when they’re published, rather than writing data to files, the template writes data to the MySQL database. PHP code on the front-end server then uses the database records to manage the polls.

On the recommendation of one of my colleagues, I was using the MySQL REPLACE statement to insert and update poll answers in the database. At first, this seemed like a cool idea. All I had to do was create a unique index on the story_id and ord (for answer order) columns and I was set. Any time someone reordered the answers or changed their wording in Bricolage, the REPLACE statement would change the appropriate records and just do the right thing.

Or so I thought.

Come the day after the launch of the new site, I get a complaint from the customer that the percentage spread between the answers doesn’t add up to 100%. After some investigation, I realized that the poll_results table is using the ID of each question to identify the votes submitted by readers. This makes sense, of course, and is excellent relational practice, but I have overlooked the fact that REPLACE essentially replaces rows every time it is used. This means that even when a poll answer hasn’t changed, it gets a new ID. Yes, that’s right, its primary key value was changing. Yow!

Now we might have caught this earlier, but the database was developed on MySQL 3.23.58 and, as is conventional among MySQL developers, there were no foreign key constraints. So the poll results were still happily pointing to non-existent records. So a poll might appear to have 800 votes, but the percentages might be counted for only 50 votes. Hence the problem with the percentages not adding up to 100% (nowhere near it, in fact).

Fortunately, the production application is on a MySQL 4.1 server, so I made a number of changes to correct this issue:

  • Added foreign key constraints
  • Exploited a little-known (mis)feature of Bricolage to store primary keys for all poll answers (and questions, for that matter)
  • Switched from REPLACE to INSERT, UPDATE, and DELETE statements using the primary keys

I also started using transactions when making all these updates when a poll is published so that changes are always atomic. Now it works beautifully.

But the lesson learned is that REPLACE is a harmful construct. Yes, it was my responsibility to recognize that it would create new rows and therefore new primary keys. But any construct that changes primary keys should be stricken from any database developer’s toolbox. The fact that MySQL convention omits the use of foreign key constraints makes this a particularly serious issue that can appear to have mysterious consequences.

So my advice to you, gentle reader, is don’t use it.

Looking for the comments? Try the old layout.

More about…

Enforce Foreign Key Integrity in SQLite with Triggers

After some some Googling and experimentation, I’ve figured out how to enforce foreign key constraints in SQLite. I got most of the code from Cody Pisto’s sqlite_fk utility. I couldn’t get it to work, but the essential code for the triggers was in its fk.c file, so I just borrowed from that (public domain) code to figure it out.

Since I couldn’t find documentation for this elsewhere on the Net (though I’m sure it exists somewhere), I decided to just put an example here. Interested? Read on!

Say you have these two table declarations:

create table foo (
    id INTEGER NOT NULL PRIMARY KEY
);

CREATE TABLE bar (
    id INTEGER NOT NULL PRIMARY KEY,
    foo_id INTEGER NOT NULL
            CONSTRAINT fk_foo_id REFERENCES a(id) ON DELETE CASCADE
);

Table bar has a foreign key reference to the primary key column in the foo table. Although SQLite supports this syntax (as well as named foreign key constraints), it ignores them. So if you want the references enforced, you need to create triggers to do the job. Triggers were added to SQLite version 2.5, so most users can take advantage of this feature. Each constraint must have three triggers: one for INSERTs, one for UPDATESs, and one for DELETESs. The INSERT trigger looks like this:

CREATE TRIGGER fki_bar_foo_id
BEFORE INSERT ON bar
FOR EACH ROW BEGIN 
    SELECT CASE
        WHEN ((SELECT id FROM foo WHERE id = NEW.foo_id) IS NULL)
        THEN RAISE(ABORT, 'insert on table "bar" violates foreign key '
                || 'constraint "fk_foo_id"')
    END;
END;

(You can put the RAISE error string all on one line; I’ve concatenated two lines to keep line lengths reasonable here.) If your foreign key column is not NOT NULL, the trigger’s SELECT CASE clause needs to an extra case:

CREATE TRIGGER fki_bar_foo_id
BEFORE INSERT ON bar
FOR EACH ROW BEGIN 
    SELECT CASE
        WHEN ((new.foo_id IS NOT NULL)
            AND ((SELECT id FROM foo WHERE id = new.foo_id) IS NULL))
        THEN RAISE(ABORT, 'insert on table "bar" violates foreign key '
                || 'constraint "fk_foo_id"')
    END;
END;

The UPDATE statements are almost identical; if your foreign key column is NOT NULL, then do this:

CREATE TRIGGER fku_bar_foo_id
BEFORE UPDATE ON bar
FOR EACH ROW BEGIN 
    SELECT CASE
        WHEN ((SELECT id FROM foo WHERE id = new.foo_id) IS NULL))
        THEN RAISE(ABORT, 'update on table "bar" violates foreign key '
                || 'constraint "fk_foo_id"')
    END;
END;

And if NULLs are allowed, do this:

CREATE TRIGGER fku_bar_foo_id
BEFORE UPDATE ON bar
FOR EACH ROW BEGIN 
    SELECT CASE
        WHEN ((new.foo_id IS NOT NULL)
            AND ((SELECT id FROM foo WHERE id = new.foo_id) IS NULL))
        THEN RAISE(ABORT, 'update on table "bar" violates foreign key '
                || 'constraint "fk_foo_id"')
    END;
END;

The DELETE trigger is, of course, the reverse of the INSERT and UPDATE triggers, in that it applies to the primary key table, rather than the foreign key table. To whit, in our example, it watches for DELETEs on the foo table:

CREATE TRIGGER fkd_bar_foo_id
BEFORE DELETE ON foo
FOR EACH ROW BEGIN 
    SELECT CASE
    WHEN ((SELECT foo_id FROM bar WHERE foo_id = OLD.id) IS NOT NULL)
    THEN RAISE(ABORT, 'delete on table "foo" violates foreign key '
                || ' constraint "fk_foo_id"')
    END;
END;

This trigger will prevent DELETEs on the foo table when there are existing foreign key references in the bar table. This is generally the default behavior in databases with referential integrity enforcement, sometimes specified explicitly as ON DELETE RESTRICT. But sometimes you want the deletes in the primary key table to “cascade” to the foreign key tables. Such is what our example declaration above specifies, and this is the trigger to to the job:

CREATE TRIGGER fkd_bar_foo_id
BEFORE DELETE ON foo
FOR EACH ROW BEGIN 
    DELETE from bar WHERE foo_id = OLD.id;
END;

Pretty simple, eh? The trigger support in SQLite is great for building your own referential integrity checks. Hopefully, these examples will get you started down the path of creating your own.

Looking for the comments? Try the old layout.

More about…